Recently at Scalathon I have been made aware of this article, detailing a data structure for high-throughput data processing. I could not help thinking that the ideas expressed in that article are sound very familiar: while doing my PhD research at the COMPASS experiment at CERN, I implemented an online event filter in 2004 which utilized the same data structure for the central stream buffer. I shortly described it in my PhD thesis (page 38ff) and an IEEE conference record for NSS/MIC 2007. We named the application “Cinderella” after the fairy-tale, as its purpose was to sort the bad (i.e. uninteresting) events from the stream of good ones before it hit the disk buffers.
The Cinderella project was written in plain C and weighed in at about 250kSLOC, where the vast majority is concerned with the reconstruction of aspects of the physics event and the filtering based on these data. The main buffer’s purpose was to decouple the continuous disk writing (the bottleneck at that time) from the SPS spill structure (4–5s of data production every 12–15s), leading to quite different timing profiles for writing to and reading from the queue.
The idea was the same as at LMAX: getting rid of the locking as central contention point. But my solution goes one step further than the disruptor, as I spread out the synchronizing elements across the whole data set, effectively giving each element its own lock-less protection. This is facilitated by a data structure 32 bytes wide, where 28 bytes are used for storing various time stamps and attaching the raw data to be processed (stored in another ring buffer for variable size elements) and the results of the processing steps. The first 32 bits are the magic part: a state indicator, which always follows the same sequence of transitions
- the input thread waits for the next element to become “empty” and for data to arrive, then writes the data to the buffer and sets the element to “new”
- the filter thread waits for the next element to become “new”, performs the expensive reconstruction and filtering operation, enriches the data and sets the element to “accept” or “reject”
- the output thread waits for the next element to become “accept” or “reject”, takes it from the buffer and sets the element to “empty”, closing the circle.
The filtering was actually performed by two threads (the system was built of dual-Opteron machines), which synchronized access to the buffer only between themselves using a mutex, which was not noticeable in comparison to the actual filtering cost. What I found is that bunching similar tasks did increase throughput slightly, which I implemented by letting a thread waiting for its next block sleep in macroscopic quanta of 100ms. As described by Martin Fowler, this data structure works quite well for situations where single threads might fall behind and need to catch up, as they can just run without any contention whatsoever.
Comparison with Disruptor
The design of the disruptor is based on the same principle, but the entries of the ring buffer are not tagged by a status flag, instead each thread manages its own pointer into the ring, which it only ever advances. The check whether a thread may proceed hence has to read memory which another thread potentially just wrote to, leading to inefficiencies if multiple such pointers lie in the same cache line of the CPU (cache line bouncing), which depending on the exact circumstances can completely kill performance. For this reason Trisha explains the magic cache line padding, which luckily also works through all the layers of the Hotspot JVM (just imagine what would happen if is could prove that these fields are never accessed and thus optimized away?).
This technique is not necessary in my case, since no such central memory locations exist:each thread sets various fields on the same cache line and finishes the state transition by writing to the flag field. The cache line is dirtied only once and not touched again (modulo the fact that two elements share one line, but they are written to close in time anyway), and the CPUs can actually do a rather good job at prefetching the next elements as all accesses walk memory in a contiguous fashion (except for the wrap-around, obviously).
Conclusion
So, I guess the LMAX people do not really think that they were the first to utilize such a data structure, do they? I for sure would not exclude that this rather obvious idea was implemented before I started working on it end of 2003.