Is the BTree an efficient structure to model an order book?
Introduction
The order book is an interesting object to model, likely to be dense with orders near the top of the book and sparser as we walk outwards towards the less likely to get filled “stink bids”. While reading Database Internals: A Deep-Dive into How Distributed Data Systems Work 1, its low level discussion of BTree internals peaked my curiosity into the efficiency of using a BTree to model an orderbook. My main observations are:
It was presented as a disadvantage that file based databases have to marshall their data structures explicitly on disk while in-memory databases only have to
malloc()
memory chunks from an allocator. However, if the memory returned from the allocator is fragmented then the cache efficiency of traversing the BTree structure is going to be impacted negatively.The search path from the root node to the leafs involves a lot of searching sorted lists which equates to a lot of work for the CPUs branch predictor. Incorrectly guessed branches equates to wasted computation.
Mutations / admin tasks where nodes are expanded and collapsed dynamically suggests unpredictable latency spikes. This cost will be “amortized” in the CPU complexity of the structure - It would be unfortunate for us if this appeared during a period of high market intensity.
One parameter we can tune to attempt to optimise the points above is B
, controlling the number of entries in each BTree node. If we increase this we can leverage bigger nodes for memory cache efficiency due to data locality and incur some additional costs when searching / mutating. Unfortunately the Rust std implementation does not let you specify this parameter, it is hardcoded at 16.
An alternative - StableVec
Since we have a predictable range of keys (price ticks), can we throw memory at the problem and pre-allocate a cache friendly contiguous array of the size of every possible price?
The StableVec library allows us to allocate a contiguous array with “stable” keys, stable meaning the memory they point to will never change (Unlike a Vec, where data indexes change as items are removed). This means when an entry is removed it is simply flagged as deleted and no cost of shifting entries backwards as you get with a standard Vec is incurred. This is conceptually the same as allocating a large Vec of Options and setting values to None
instead of removing, for example: Vec<Option<T>>::with_capacity(N)
.
The orderbook structure is useless if we can’t get timely snapshots of the data to feed to our trading logic. After some research it becomes clear that the generic StableVec Iterator
is to inefficient to retrieve our existing data. Due to the sparsity of the data, it implements a naive scan from 0 to the capacity of the structure:
1unsafe fn first_filled_slot_from(&self, idx: usize) -> Option<usize> {
2 debug_assert!(idx <= self.cap());
3
4 (idx..self.len()).find(|&idx| self.has_element_at(idx))
5}
In this proof of concept we have a significant capacity size of 1_000_000_000. To work around this we implement some tracking pointers which continually identify the best bid and ask indexes. When we generate a snapshot we skip the library Iterator and do a walk outward from these indexes manually.
Profiling
Profiling the orderbook message processing logic
To benchmark we process a days worth of Binance L2 price updates for BTCUSDT into each orderbook implementation (141327104 bid and ask updates). We capture the elapsed time and look at the detailed ouput of the perf
command.
Implementation | Elapsed |
---|---|
BTree | 12.9s |
StableVec | 5.9s |
We see a significant end to end speed up and as expected cache-misses and branches are notably reduced.
BTreeMap perf output
Performance counter stats for 'target/release/btree-ob-eval b':
3,681,556,457 cache-references
202,381,615 cache-misses # 5.497 % of all cache refs
224,968,394,162 cycles
595,727,002,465 instructions # 2.65 insn per cycle
140,040,270,447 branches
2,761,188 faults
38 migrations
StableVec perf output
Performance counter stats for 'target/release/btree-ob-eval s':
2,781,531,040 cache-references
132,323,303 cache-misses # 4.757 % of all cache refs
196,283,494,716 cycles
544,535,582,371 instructions # 2.77 insn per cycle
129,142,369,121 branches
2,785,045 faults
33 migrations
Profiling the orderbook snapshotting logic
When benchmarking the two implementations we take the mean of 5 get_snapshot()
elapsed time measurements (with an prior call intended to warm the cache). Snapshot sizes are selected as multiples of the BTreeMaps B
parameter (16), as it is likely that the memory fragmentation of the BTreeMap implementation scales somewhat relatively in multiples of this. The results are below:
Implementation | Snapshot(32) | snapshot(64) | snapshot(128) | snapshot(256) |
---|---|---|---|---|
BTree | 698.866us | 711.125us | 669.833us | 745.166us |
StableVec | 524ns | 825ns | 1.641us | 3.075us |
Conclusion
The StableVec implementation has proven itself a superior alternative, notably in the bulk load benchmark but more significantly in the snapshot benchmark where it is consistently order of magnatudes faster than the BTree alternative. I believe this is due to the cache efficiency of the pre-allocated structure.
In terms of tuning the BTreeMap there are some interesting avenues we’ve not explored. One is a study of tuning the B
parameter on an implementation which allows it. Another is to provide the BTreeMap with a custom allocator (such as a bump allocator) which allows it to allocate nodes from the same block of contigous memory even over a long lifetime, exploiting memory locality.
Profiling code is provided here
Database Internals: A Deep-Dive into How Distributed Data Systems Work - https://www.databass.dev/ ↩︎