Home > Posts > Larger-Than-Memory Data Management

Larger-Than-Memory Data Management

June 4, 2024 · 3 min read

According to [1], the ideal DBMS is one that “has high performance, characteristic of main-memory databases, when the working set fits in RAM” and “the ability of a traditional disk-based database engine to keep cold data on the larger and cheaper storage media, while supporting, as efficiently as possible, the infrequent cases where data needs to be retrieved from secondary storage”.

To realize a system that achieves both requirements, there are two approaches: (1) take a disk-based databases and optimize it for the former, or (2) take a main-memory database and optimize it for the latter.

The first approach would require a full architectural rework. This is because there isn’t one single glaring bottleneck - everything from the buffer pool manager to the recovery component contributes constant significant overhead with ‘useful work’ comprising a small fraction of instructions processed in a query (figure from ‘OLTP Through the Looking Glass, and What We Found There’):

Hence the current tendency to favour the second approach [8,9]: take highly optimized main-memory systems and extend them to handle larger-than-memory workloads. This is key not just for resource flexibility but also for lowering costs: DRAM is still quite costly while SSDs have gotten drastically cheaper and offer greater performance (IOPS). There’s also data skew - a some parts of data tends to be hot while most of the data tends to be cold. Therefore, it makes more economic sense to keep hot data in main-memory and only page in cold data from secondary storage when needed [6,4,5]. Since most queries will hit the hot memory-resident data, the overall impact on performance should be minimal.

There are a couple of considerations when it comes to larger-than-memory approaches: how should the system identify hot data from cold (offline periodic analysis vs online), what is the granularity of identification (page-level or record-level), how is metadata maintained, how should queries that touch cold data be handled (synchronous retrieval or abort-and-restart), and so on [7,8]. This is why I’ve decided to do an informal survey of all the ’larger-than-memory’ techniques out there in the next couple of posts, so that I can understand the trade-offs and get a glimpse of the cutting edge. Do stay tuned!.

Update: I’m done!. Here’s the list:

References #

←

Leanstore: High Performance Low-Overhead Buffer Pool

Hybrid Locking & Synchronization

→