Go data-structure tricks: google/Btree

  ·   5 min read

I’ve been studying B-Tree implementations in Golang of late and the first one I came across was Google’s B-Tree package. It’s entirely in-memory and meant to serve as a more efficient, drop-in replacement for the llrb package, which implements a red-black tree. Before going further, if you aren’t familiar with B-Trees, the following video (12 min) provides a great introduction, wikipedia does too.

Memory allocation optimization #

The first optimization used is the FreeList. From the documentation, the FreeList is described as follows:

FreeList represents a free list of btree nodes. By default each BTree has its own FreeList, but multiple BTrees can share the same FreeList.

If you’ve dabbled in C/C++ before, the FreeList here is kind of similar to the Pool Allocation strategy, which, in order to reduce the cost of mallocs and memory fragmentation, a huge chunk of memory is pre-allocated. The programmer then manually slices the memory up by themselves. So that this doesn’t up as an ad-hoc re-implementation of malloc, it requires that the application requests for memory in fixed size blocks. Once the application is done with some block of the memory, it can be stored back into a list, a free-list if you will, and made available for later use

However, google/btree’s FreeList diverges from the pool allocation strategy quite a bit. For one, nodes are not pre-allocated at the start, rather, the FreeList fills up only when nodes are deleted, which occurs during during merging. Since there isn’t a pre-allocated pool, when the btree requests for a new node (such as during splitting) and the free-list is empty, new is called directly. However, if there is an unused node lying around, it’s returned instead:

func (f *FreeList) newNode() (n *node) {
	f.mu.Lock()
	index := len(f.freelist) - 1
	if index < 0 {
		f.mu.Unlock()
		return new(node)
	}
	n = f.freelist[index]
	f.freelist[index] = nil
	f.freelist = f.freelist[:index]
	f.mu.Unlock()
	return
}

The number of nodes within a free-list is constrained to its capacity, as we see in the freeNode method below.

// freeNode adds the given node to the list, returning true if it was added
// and false if it was discarded.
func (f *FreeList) freeNode(n *node) (out bool) {
	f.mu.Lock()
	if len(f.freelist) < cap(f.freelist) {
		f.freelist = append(f.freelist, n)
		out = true
	}
	f.mu.Unlock()
	return
}

By default, this capacity is 32 but it can be tuned based on the application’s btree usage patterns. Constraining the length of the free-list seems to be a security mitigation against errant usage. If it were unbounded, an attacker, or simply faulty code, could simply insert numerous entries then delete most of them, leaving maybe one or two entries. However, since all the deleted nodes are added to the free-list, those chunks of memory remain unavailable for other parts of the application even though the btree won’t be using them any time soon.

Concurrency optimization #

On concurrency, the approaches highlighted here might or might not be an optimization depending on the application’s usage patterns. The documentation isn’t quite specific on how one should handle concurrent access. The most it mentions is the following:

Write operations are not safe for concurrent mutation by multiple goroutines, but Read operations are.

In some way, this is great. From the source code, all reads proceed without any locking involved. However the documentation doesn’t quite specify how one should synchronize or even organize writes, this seems to be left up to the end-user.

Given the way the code is structured, a single writer with multiple concurrent readers will not work since it leads to data races. Just to confirm that this is the case using Go’s race detector, I wrote and ran a quick script, which you can check out.

One option for synchronizing concurrent reads interleaved with writes is simply to tack on readers-writer locks wherever necessary and call it a day.

A better option (that doesn’t involve modifying the source-code directly or wrapping each and every public method), utilizes the Clone method provided so as to implement some hackish form of ‘snapshot isolation’. From the documentation:

Clone clones the btree, lazily. Clone should not be called concurrently, but the original tree (t) and the new tree (t2) can be used concurrently once the Clone call completes.

Reads can use the original ‘snapshot’, and synchronized writes can go into the new clone. Once the writes are complete, new reads then shift to the new clone. Furthermore, given the way the Clone function is structured, every modification that occurs in the original is not visible to any clones made - all that the clones view is the old ‘snapshot’ just before cloning took place. This is possible since in actuality, a Clone results in 3 trees, the original itself becomes a clone, and it now gets to ‘share’ its old nodes (as the actual ‘original’ tree) with the new clone. The old nodes thus are rendered immutable since neither the original nor the clones own them.

However, I’m probably using the wrong term here since this approach doesn’t quite provide the guarantees of snapshot isolation. For example, if you have a reference to and modify an Item that’s currently stored within one of the nodes, both the original and the clone will view this change.

Lastly, I’d like to highlight an interesting fork of google/btree, tidwall/btree, do check it out if you’re interested. That’s all for now