Posts
Archiving Time-Series Data from PostgreSQL into Parquet
Keeping your database lean
Graph Query Interfaces: A Comparison Between SQL and Cypher
Featuring DuckDB & KuzuDB
LRU vs FIFO (with Lazy Promotion and Quick Demotion)
Sprinkling some lazy promotion and quick demotion on FIFO
Notes on 'A large scale analysis of hundreds of in-memory cache clusters at Twitter'
TTLs are prevalent, object sizes are small, metadata overhead can be large, object sizes change, FIFO is better than LRU, you’ve got to address memory fragmentation
Some Notes on Vector Indexing in DuckDB
Once you’ve indexed your vectors for similarity search, be sure to check your query plans, just in case the DB decides to opt for a sequential scan
Combining Lexical and Semantic Search with Reciprocal Rank Fusion
Best of both worlds sort of thing
Vector Indexing and Search with DuckDB & FastEmbed
Using DuckDB for vector/semantic search
Leanstore: High Performance Low-Overhead Buffer Pool
A dash of pointer swizzling, a sprinkle of optimistic locking and a touch of lean eviction, that’s the secret to a high performance buffer pool!
Tiered Storage via 2-Tree
Split a data-structure into two: a memory-optimized ‘top’-tree and a disk optimized ‘bottom’-tree. Implement a lightweight migration protocol for hot records to move up and cold records down.
Pointer Swizzling Buffer Pools
Switching pointers as the data pointed to moves to and fro memory and secondary storage
Compacting Transactional Data in HyPer
Keep hot tuples uncompressed, organize cold data into chunks of columns then use lightweight compression, handle both OLTP and OLAP workloads
Virtual Memory Hot/Cold Data Re-organization for OLTP
Hot/Cold aware memory allocation with locking of the hot region to physical memory and letting the OS swap out cold LRU pages as needed.
Offline (but Faster and more Accurate) Classification of Hot and Cold Data
Hint, it’s based on exponential smoothing
Anti-Caching
Track hot/cold data at tuple-level granularity. Evict LRU cold data in blocks.
Larger-Than-Memory Data Management
For when the database exceeds the main memory size
Hybrid Locking & Synchronization
Fast-path optimistic locking with fallback to pessimistic RW locks under contention
Optimizing Data Placement for Distributed OLAP Systems
Using MIP solvers to model and optimize shard placement
DuckDB JIT Compiled UDFs with Numba
JIT compiling your vectorized UDFs with Numba. Plus pure SQL is plenty fast if you can figure out how to write it
Guided Local Search for the Capacitated Facility Location Problem
Overview of Guided Local Search plus how it can be applied to the capacitated facility location problem.
Minizinc: Alternative Modeling Approaches for the Facility Location Problem
Multiple views and Channeling Constraints make for faster models (in some cases)
The Facility Location Problem
Discrete Optimization: Where to construct facilities so as to minimize setup costs and customer servicing costs while ensuring each facility is able to meet customer demands.
Vectorized DuckDB UDFs with Rust and Python FFI
Implementing vectorized UDFs in Rust that you can use in DuckDB, with a little help from Arrow
Optimizing CPU & Memory Interaction: Matrix Multiplication
Same algo, different memory access patterns, what could go wrong (or right)!
x86 Cache Control Instructions
Wherein the OS and user get more control over the L1,L2 and L3 caches, mostly for performance.
Retrieving Memory and Cache Organization
Memory, Cache levels, Cache sizes, TLB, associativity and so on
Microbenchmarking: Way more than I set out to know
RDTSC, Out-of-order execution, OS interrupts, cycles, frequency and more.
Logging in Go
Best practices, Logging levels, structured logging, Logging & Telemetry (Metrics, Tracing), Audit logs
Handling Missing Values in Timeseries Datasets
Filling gaps using Last observation carried forward, next observation carried backwards, median & linear interpolation
Wrangling JSON with DuckDB
For when you need to impose some structure on semi-structured data
Parquet + Zstd: Smaller faster data formats
Often, parquet files have to be compressed. For fast compression, use LZ4 or Snappy. For the highest data compression ratio, use brotli. For both, zstd
Lateral Joins & Iterators in SQL
Sneaking for-loops into SQL without anyone noticing
SQL Grouping sets, Rollups & Cube
Computing multiple Group-bys with less steps
Programmatically creating a DuckDB table from an Arrow schema
PyArrow lets you create an empty table. Use that instead of custom mappings to create a DuckDB schema.
Handling panics from goroutines you've spawned
It’s one thing to handle a panic that’s occured within a function. It’s an entirely different affair to handle a panic that occured within a goroutine that’s been spawned.
Benchmarking SQLite inserts
Going from 750 writes per second to 25,0000 with a bit of configuring
Getting started with TLA+
End-to-End Arguments in System Design
Distributed Reference Counting
Malloc excess bytes
Space requested from malloc is a lower bound (at least or more)
Go Channels Suffice for Synchronization
Or how to implement Futures/Promises in Go without having to juggle locks and waitgroups
Speeding up unique constraint checks in Postgres... or not
Are exclusion constraints using hash indexes faster than plain old uniqueness checks? Let’s find out
Bugs From ignoring C Operator Associativity
This is a quick post and it’s here more or less to serve as a reminder to myself, in case I make the same mistake again.
Hello World
Excerpt
subscribe via RSS