The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer.
A Brief Introduction
One of the clearest advantages of using D compared to other typical data science workflows is that it compiles down into machine code. Without an interpreter or virtual machine layer, we can rip through data significantly faster than other tools like a Java hadoop framework, R, or python would allow. But D’s compiler is fast enough that in many cases it can be run as if it were a scripting language. Let’s try comparing to python by generating a million uniform random variates, sorting, and finding the deciles:
And similarly for D:
Wait… what? It took longer, but that’s because I just ran it the once and it includes the compilation time. If I don’t change anything and run it again:
rdmd won’t bother to recompile if there’s no code change. These savings can add up quite significantly once your code becomes more computationally complex and you need to perform many computations over a substantial amount of data.
But of course there’s no real revelation here. If we want hyper-efficient code, we know that it’s best to drop into a compiled language. The key thing here that separates D from other efficient languages like the oft-suggested C or C++ is that D frees you to program in the style you feel most comfortable with at the given time.
The code above shows how we can write a quick “script” in D without incurring any additional headache over the simpler python. But D is also just as clean if you want to start writing some object-oriented code:
And D is as ready as you are for your high-performance needs. For example, if we want to calculate a fast inverse square root, we can use some pointer voodoo (very lightly modified from the linked Wikipedia article):
D will even let you write inline assembly if you really want to squeeze the most performance out of it. But this is all just fun and games. How can D help in a real-world scenario?
Ripping Through Data
During the course of our work at AdRoll, we had some infrastructure in D that was running fine for a while, but at a certain point, when our data problems exceeded the scope this code was designed for, we had to optimize. And, believe it or not, the problem was as banal as pulling some fields out of delimited files. This is the gist of what we did.
This particular log file contains some ad data delimited by the ASCII record separator. Let’s say we want to pull out some timestamps and the country whence these data come. As you can probably imagine by now, the naïve D solution is quite readable, but is perhaps not as snappy as we’d like:
Thirteen seconds? That’s an eternity! This one log file is but a mere sample of what we’re producing at AdRoll—uncompressed, we generate about 130TB of log files each day. Ok, we could potentially distribute the problem, but spinning up clusters takes time, and one of our mottos is to “do more with less.” How can we improve the performance here to keep our scalability down?
One of the best things we can do is minimize the amount of memory we’re allocating; we allocate a new
char every time we read a line. But even beyond that, we read through the line to put into this
char, and read through it again to split it by our record separator. But this splitting creates an array of all of the fields in the line, and allocates memory for each and every one of them. From an efficiency standpoint, this clean, straightforward code is actually quite a mess.
To address our first concern with memory allocation, we can have a buffer that’s already allocated to read into. The trick here is that the next line will immediately follow the previous line in the same buffer, and the end of it may not fit into the buffer. Worse, we may only get a partial bit of a field that we care about.
The solution is to have two buffers which we swap between. The fields we care about may straddle both buffers, so we’ll also construct a double-length buffer for simple reconstruction. When we reach the end of the current buffer, we’ll make it the “last buffer,” and then load more data into the other buffer, promoting it to the current buffer. The caveat is that we need to make sure that none of our lines has a length longer than both of these buffers combined.
Our second concern had to do with the inefficiency of splitting. Instead of breaking apart the line in totality, once it’s in memory, let’s just sequentially read through it. We’ll keep track of our progress through both the buffer (index of the array) and our current line (number of delimiters we’ve seen). Once we hit the right number of delimiters, we just need to find the next delimiter, and we know our field is the contents in between. If we hit the end of a line by reading a newline, we’ll reset our line progress.
Finally, once we’ve collected all our fields, we just need to rip through the buffer until we find the next newline.
Enough chit-chat; let’s look at some code:
This really beefs up our simple program from before to accomplish the same task. Let’s break it down a bit.
bufferB are the two backing buffers, and they will be pointed to by either
last_buffer, depending on our state, which we track with
num_buf. To keep track of our progress through
current_buffer we use
index, and to keep track of our progress through a line we use
num_del to represent the number of delimiters we’ve seen thus far. Finally, we want to know if we’ve hit the end of a line with
The instantiation of
FastReader is straightforward: we start with
reset_progress() gets called when we hit the start of a new line and it just updates the state to reflect that.
swap_and_load() is our toggling method. Note that the
num_buf operation to determine our
current_buffer is equivalent to
num_buf % 2.
There’s also a quick catch here with the
file.rawRead() method: though it takes a
char in and populates it from the
File, if we hit an end of file, the
char will have its previous contents past the EOF. For example:
For our purposes, this is an undesirable behavior because we want to know when we’ve actually hit the end of the
File and the end of the last line. It turns out that
file.rawRead() also returns a
char which is the slice of the passed in array that has new content. Just changing the line to
buffer = file.rawRead(buffer) fixes the issue:
We use the same construction in
swap_and_load(), finally resetting our progress through
current_buffer to 0.
get_field() is the trickiest method, but still intelligible. If we’ve already hit the end of the line, there’s no field to find, so we write out that it’s unknown. Starting from where we are in
current_buffer, we just start counting delimiters. If we hit the end of
current_buffer, it’s time to
swap_and_load(). Once we’ve hit the correct number of delimiters, we need to find the next one. This is essentially the same code, but we need to know if we swap buffers during this process. If we hit a newline, we count that as the end of the field.
Constructing the field is simple: if we didn’t swap, it’s just the slice of
current_buffer from our start to end indices. Otherwise, we piece it together from both
advance_to_next() method also works in a similar way to searching out delimiters, but instead we look for newline characters, move into the next line, and then
There are a couple of catches at this point. First, there’s our
eof() method. It’s possible that we hit EOF but there are more lines to read. We check for this by ensuring that we’re only truly done when our
index is the same as our
current_buffer length. Finally, in our
process_file() call, we make sure that we move on to the next line after getting all the fields we require. Critically, we search for the fields in their numerical order. We need to do this because we run through our lines in sequence and never look back.
Ok, so our code gained some weight. But fortunately, it gained weight in terms of muscle mass instead of bloat. Check it out:
Hey, that’s better than a 6x improvement! We can now run over way more data in the same amount of time. Just to make sure nothing fishy is going on:
Awesome. But hey, y’know what? It would be really cool if we could exploit the fact that I’m running on a multi-core machine and read in multiple files at once. How would we do that?
Well, that was pretty easy. What type of performance do we get?
Wow! That’s over five time faster than our original solution while running over four times the data, for a 20x performance boost! And we’re not done yet…
rdmd doesn’t perform as many optimizations as it could. Once our code is in the state that we want it, it makes sense to compile with
dmd rather than running it under a scripting idiom. To be fair, we’ll also compile down our original naïve solution and our non-parallelized one:
After compiling everything down, our fast solution has nearly a 12x performance boost over the naïve one, and our parallelized solution, when run over four times the data, has nearly a 22x boost.
At AdRoll Data Science, we’ve become big fans of D, and it’s easy to see why. We can rapidly prototype new infrastructure and analysis tasks, and when efficiency becomes a core concern, we have the ability to refactor that same code base to squeeze as much performance out as possible. If you’re interested in tackling big data problems with an eye for lean, mean code, you should let us know.