Advance Data Structures for Data Engineering — Part III
Part III of Data Structures for Data Engineers
Write Ahead Log (WAL)
Write Ahead Log is one of those data structures that is very popular in SQL and NoSQL databases. At its core, it is a simple append-only log of changes that are made to a database.
Write-ahead logging (WAL) is a database technique used to ensure atomicity and durability, which are two key ACID properties. In addition to ensuring that all changes to a database will be committed even in case of system failures, WAL also plays a crucial role in maintaining data durability. By recording modifications in a log before writing them to the database itself, WAL ensures that the database can recover to a consistent state after a failure. It is also used to buffer writes to databases since actual writes to databases are often performed in batches at some time later.
Databases like Postgres, RocksDB, and MySQL, all use a WAL under the hood. Even Spark uses WAL for Spark Streaming.
MemTable
MemTable is an in-memory data structure that is used to hold reads and writes until they are flushed to disk-based SSTables. The Memtable serves as an in-memory write-back cache for recent write operations
MemTable handles both read and write operations:
- New writes are always inserted into the Memtable
- While reads must query the Memtable first before checking the SSTable files, as the data in the Memtable is newer.
When a memtable becomes full, it is marked as immutable and replaced by a new one. A background process then flushes the contents of the Memtable to an SST file, after which the old Memtable can be safely discarded.
Implementations
- Skiplist MemTable
- HasghSkipList MemTable
- LinkedList MemTable
- Vector
RocksDB and LevelDB are some examples of databases that use a MemTable.
BitMap Indexes
BitMap indexes use bitmaps (array of 0s and 1s) to efficiently store and query data. It is particularly useful for columns with low cardinality of distinct value.
Bitmap indexes use a series of bits to represent the presence or absence of a value in each row. Each bit corresponds to a row in the table, and its value (0 or 1) indicates whether the row contains a specific value for the indexed column.
Databases use these bitmaps to run efficient queries using bitwise operations. It is much quicker to combine bitmaps using ANDs and ORs!
Many modern databases like Postgres provide support for BitMap indexes.
Prefix Trees
Prefix Tree (popularly known as Trie) is a data structure optimized to work with strings or text-based data. It creates a tree structure out of a text corpus and uses it to perform efficient prefix searches. Common features like autocomplete or spell-checking are more efficient when implemented with a prefix tree.
Add: O(n)
Search: O(n)
Prefix-search: O(n)
Remove: O(n)
Databases like Elasticsearch use Trie for building their inverted index structures.
Suffix Trees
A Suffix tree is yet another advanced structure that is used for full-text search capabilities. Suffix tree is a compressed trie of all the suffixes of a given string. Suffix trees help in solving a lot of string-related problems like pattern matching, finding distinct substrings in a given string, finding longest palindrome etc.
Database engines like Elasticsearch and SQLite FTS that support full-text searching capabilities make heavy use of such trees.