Today, almost a year since the initial TrailDB open-source release, we are happy to announce the next major version of TrailDB, 0.6. The release is long overdue: this release is packed with major new features (and some minor bug fixes) that have been driven both by increasing internal use of TrailDB at AdRoll, as well as requests and contributions by the community.
You can download and install the latest version by following the getting started guide.
TrailDB in Action
First, let’s start with some highlights about the usage of TrailDB. At AdRoll, the amount of data stored in TrailDBs has been growing exponentially. It took two years to reach the first trillion events stored. Now we are routinely creating new TrailDBs storing trillions of events every week. Besides the scale, TrailDBs are powering even more critical services at AdRoll, including various machine learning pipelines, large-scale data analysis, and our new attribution engine.
We want TrailDB to be a small, battle-hardened piece of software which works with different languages and environments without too much trouble. We have been preaching this gospel at a number of events. For instance, see our presentation at an SF Data Mining meetup or a TrailDB tutorial at the PyData SF 2016 conference:
Most important, we are excited to see TrailDB gaining traction outside AdRoll. Thank you for all the raised issues, suggestions, and contributions, such as the nascent language bindings for NodeJS. Recently, we adopted community-contributed Rust bindings under the official TrailDB organization in GitHub. If you have any feedback or ideas about TrailDB, get in touch on the TrailDB Gitter channel.
Major New Features
You can see the full list of changes since the 0.5 release in the changelog. Here are some highlights:
Trck - a highly optimized query language for TrailDB
When we made the initial open-source release, we hinted, “We have a number of tools built on top of TrailDB which make computing various user-level metrics easier. We are planning to open-source some of these tools in the future.” This finally happened in March, when we open-sourced
trck, a highly optimized query language for TrailDB.
In brief, if you have ever had a need to run queries on user-level funnels like “how many users have first done A, then B within T seconds”,
trck is the query language for you. Almost all systems powered by TrailDB at AdRoll use
trck in one form or another. It is hard to overstate its usefulness when your data is shaped like TrailDB. To learn more You should read a separate blog post about
trck and its documentation.
Filters, Views, and Multi-Cursors
Most new features in this release are related to creating event filters that select a subset of features or trails from TrailDB. You can filter events by defining a boolean expression over fields, including timestamps. See the PyData presentation for examples of filters.
You can set filters to cover the whole TrailDB with tdb_set_opt which, in effect, creates a view over the TrailDB that can be materialized. You can also set the filters to cover only individual trails, allowing fine-grained whitelisting and blacklisting of trails. Or you can attach filters only to an individual cursor.
Multi-cursors allow iterating over separate trails (users) as if they were a single trail. Multi-cursors even work across multiple separate TrailDBs. This feature was motivated by the fact that multiple UUIDs may correspond to the same logical user and we want to query all events related to the user, even if they were stored as separate trails.
Command line tool improvements
tdb command line tool has improved in this release. Here are the highlights:
Specify an event filter with the
--filterflag. You can speed up filtered queries significantly by creating an index with
tdb mergecommand for merging TrailDBs, even when they have mismatching sets of fields.
You can return a subset of trails with the
Last but not least, the
tdb functionality is now automatically tested by Travis for every pull request.
We want to keep the core TrailDB very stable and robust. At the same time, it is fun and beneficial to experiment with new directions which might find their way to the core eventually.
A good example of this is a feature that allows you to query TrailDBs directly from Amazon S3 without downloading them locally. This is possible thanks to a relatively new feature in the Linux kernel, user-space page fault handling, which allows us to download only parts of TrailDB on demand with minimal changes to the TrailDB codebase. This feature can reduce query latencies significantly if your application needs to access only a subset of trails, events, or fields.
Another experimental feature is Reel, an AWK-like query language for TrailDB. As mentioned above,
trck is our trusted workhorse for expressing user-level queries. Reel was motivated by a particularly complex query that needed to be executed over a trillion events. Although it is not quite as mature as
trck, you can easily embed and extend it for your own use cases.
TrailDB 0.6 is a robust data backend for applications that need to execute complex computation over discrete events over time. As we have emphasized before, we take the stability of the C API, ABI, and especially the on-disk format very seriously. You can use the 0.6 release to read TrailDBs created with any previous version of the software. This should hold true for any future version of TrailDB as well.
Our next big focus after this release is to optimize creation of TrailDBs. As mentioned above, AdRoll creates multi-trillion event TrailDBs weekly. Even when using spot instances on AWS, it costs thousands of dollars to create these files based on raw log files. There are some easy optimizations that are targeted for the next 0.7 release which should drastically lower the cost of creating massive TrailDBs.
Meanwhile, we hope that you enjoy the 0.6 release. If you have any questions, comments, or contributions, you can reach us at the TrailDB Gitter channel.