PostgreSQL and Databases in general: Different Approaches for MVCC used in well known Databases

Database Management Systems uses MVCC to avoid the problem of
Writers blocking Readers and vice-versa, by making use of multiple
versions of data.

There are essentially two approaches to multi-version concurrency.

Approaches for MVCC
The first approach is to store multiple versions of records in the
database, and garbage collect records when they are no longer
required. This is the approach adopted by PostgreSQL and
Firebird/Interbase. SQL Server also uses somewhat similar approach
with the difference that old versions are stored in tempdb
(database different from main database).

The second approach is to keep only the latest version of data in
the database, but reconstruct older versions of data dynamically
as required by using undo. This is approach adopted by Oracle
and MySQL/InnoDB

MVCC in PostgreSQL
In PostgreSQL, when a row is updated, a new version (called a tuple)
of the row is created and inserted into the table. The previous version
is provided a pointer to the new version. The previous version is
marked “expired", but remains in the database until it is garbage collected.

In order to support multi-versioning, each tuple has additional data
recorded with it:
xmin - The ID of the transaction that inserted/updated the
row and created this tuple.
xmax - The transaction that deleted the row, or created a
new version of this tuple. Initially this field is null.

Transaction status is maintained in CLOG which resides in $Data/pg_clog.
This table contains two bits of status information for each transaction;
the possible states are in-progress, committed, or aborted.

PostgreSQL does not undo changes to database rows when a transaction
aborts - it simply marks the transaction as aborted in CLOG . A PostgreSQL
table therefore may contain data from aborted transactions.

A Vacuum cleaner process is provided to garbage collect expired/aborted
versions of a row. The Vacuum Cleaner also deletes index entries
associated with tuples that are garbage collected.

A tuple is visible if its xmin is valid and xmax is not.
“Valid" means “either committed or the current transaction".
To avoid consulting the CLOG table repeatedly, PostgreSQL maintains
status flags in the tuple that indicate whether the tuple is “known committed"
or “known aborted".

MVCC in Oracle
Oracle maintain old versions in rollback segments (also known as
'undo log'). A transaction ID is not a sequential number; instead, it is
made of a set of numbers that points to the transaction entry (slot) in a
Rollback segment header.

Rollback segments have the property that new transactions can reuse
storage and transaction slots used by older transactions that are
committed or aborted.
This automatic reuse facility enables Oracle to manage large numbers
of transactions using a finite set of rollback segments.

The header block of the rollback segment is used as a transaction table.
Here the status of a transaction is maintained (called System Change Number,
or SCN, in Oracle). Rather than storing a transaction ID with each row
in the page, Oracle saves space by maintaining an array of unique transactions
IDs separately within the page, and stores only the offset of this array with
the row.

Along with each transaction ID, Oracle stores a pointer to the last undo record
created by the transaction for the page. Not only are table rows stored in this
way, Oracle employs the same techniques when storing index rows. This is
one of the major difference between PostgreSQL and Oracle.

When an Oracle transaction starts, it makes a note of the current SCN. When
reading a table or an index page, Oracle uses the SCN number to determine if
the page contains the effects of transactions that should not be visible to the
current transaction. Oracle checks the commit status of a transaction by
looking up the associated Rollback segment header, but, to save time, the first
time a transaction is looked up, its status is recorded in the page itself to avoid
future lookups.

If the page is found to contain the effects of invisible transactions, then Oracle
recreates an older version of the page by undoing the effects of each such
transaction. It scans the undo records associated with each transaction and
applies them to the page until the effects of those transactions are removed.
The new page created this way is then used to access the tuples within it.

Record Header in Oracle
A row header never grows, always a fixed size. For non-cluster tables,
the row header is 3 bytes. One byte is used to store flags, one byte to
indicate if the row is locked (for example because it's updated but not
committed), and one byte for the column count.

MVCC in SQL Server
Snapshot isolation and read committed using row versioning are enabled
at the database level. Only databases that require this option must enable
it and incur the overhead associated with it.

Versioning effectively starts with a copy-on-write mechanism that is
invoked when a row is modified or deleted. Row versioning–based
transactions can effectively "view" the consistent version of the data
from these previous row versions.

Row versions are stored within the version store that is housed within the
tempdb database. More specifically, when a record in a table or index is
modified, the new record is stamped with the "sequence_number" of the
transaction that is performing the modification.
The old version of the record is copied to the version store, and the new record
contains a pointer to the old record in the version store.
If multiple long-running transactions exist and multiple "versions" are required,
records in the version store might contain pointers to even earlier versions of
the row.

Version store cleanup in SQL Server
SQL Server manages the version store size automatically, and maintains a
cleanup thread to make sure it does not keep versioned rows around longer
than needed. For queries running under Snapshot Isolation, the version
store retains the row versions until the transaction that modified the data
completes and the transactions containing any statements that reference the
modified data complete. For SELECT statements running under
Read Committed Snapshot Isolation, a particular row version is no longer
required, and is removed, once the SELECT statement has executed.

If tempdb actually runs out of free space, SQL Server calls the cleanup
function and will increase the size of the files, assuming we configured the
files for auto-grow. If the disk gets so full that the files cannot grow,
SQL Server will stop generating versions. If that happens, any snapshot
query that needs to read a version that was not generated due to space
constraints will fail.

Record Header in SQL Server
4 bytes long
- two bytes of record metadata (record type)
- two bytes pointing forward in the record to the NULL bitmap. This is
offset to some actual data in record (fixed length columns).

Versioning tag - this is a 14-byte structure that contains a timestamp
plus a pointer into the version store in tempdb.
Here timestamp is trasaction_seq_number, the only time that rows get
versioning info added to record is when it’s needed to support a
versioning operation.

As the versioning information is optional, I think that is the reason
they could store this info in index records as well without much
impact.

Database	PostgreSQL	Oracle	SQL Server
Storage for Old Versions	In the main Segment (Heap/Index)	In the separate segment (Rollback Segment/Undo)	In the separate database (tempdb – known as version store)
Size of Tuple Header (bytes)	24	3	Fixed – 4 Variable - 14
Clean up	Vacuum	System Monitor Process (SMON)	Ghost Cleanup task

Conclusion of study
As other databases store version/visibility information in index, that makes
index cleanup easier (as it is no longer tied to heap for visibility information).
The advantage for not storing the visibility information in index is that for
Delete operations, we don't need to perform an index delete and probably the
size of index record could be somewhat smaller.

Oracle and probably MySQL (Innodb) needs to write the record in undo
segment for Insert statement whereas in PostgreSQL/SQL Server, the new
record version is created only when a row is modified or deleted.

Only changed values are written to undo whereas PostgreSQL/SQL Server
creates a complete new tuple for modified row. This avoids bloat in the main
heap segment.

Both Oracle and SQL Server has some way to restrict the growth of version
information whereas PostgreSQL/PPAS doesn't have any way.

16 comments:

volk17 March 2015 at 11:38
AFAIK, Oracle does not write inserted row o undo, only updated or deleted, as in PostgreSQL.
Amit Kapila17 March 2015 at 20:20
Volk, there are some advantages and some disadvantages for each of the method. One advantage of using Vacuum based method is that for Delete operation, there is no overhead of maintaining undo and also Rollback is very cheap in PostgreSQL. On the other hand, the advantage of maintaining undo is that it helps to control bloat (due to multiple versions) in a much better way and it seems that's what majority people care.
Amit Kapila19 March 2015 at 07:23
Yes the link will be stored from old to new tuple.
There is no direct ability in PostgreSQL for "AS OF TIMESTAMP" (aka Flashback in Oracle), however you can once look at few timetravel functions at link : http://www.postgresql.org/docs/devel/static/contrib-spi.html and see if that can serve your purpose.
Anonymous27 March 2015 at 17:35
24 -- too wide.
tableoid - why do we store per each tuple? wouldn't it be better/possible to store tableoid once per page instead?

have ideas?
Merlin Moncure1 June 2018 at 07:04
Is it possible to consider a hybrid approach? If the application can participate in the decision to use an undo log, you can exploit many factors (likelyhood of rollback, tolerance of rollback etc) that may be specific to the application that might tune performance.

Or, you could use rollback log for small transactions and start writing into the heap after a certain number of pages gets exceeded.
Amit Kapila7 March 2019 at 20:32
Good point, though this overhead should be very small, but if it happens for every other tuple, then it might matter. However, I don't expect any big difference due to this.

Tuesday 17 March 2015

Different Approaches for MVCC used in well known Databases

16 comments: