The New AdRoll Data Analysis Platform

Written by Andrew Pascoe, April 01, 2016

Introduction

If there is one thing that has been made abundantly clear over the last decade in tech, it’s that understanding your data is paramount to achieving your goals. Understandably, many startups have sprung up promising to aid every company and their employees with data analytics to fully optimize not only their work lives, but their personal lives too. Knowledge is power.

Of course, the more power you have, the more knowledge you can gain. After analyzing current market offerings, we on AdRoll Data Science found only solutions that sluggishly accelerated our growth to the Singularity™. Some like to say they welcome our new robot overlords, but we have a policy of actively ushering them in. (Note: We will soon be adding a lovable automaton, Gort, to our collection of spirit animals.) This is why we set to work and developed the New AdRoll Data Analytics (NADA) platform.

Gort in action.

Technology

As some readers of our previous blog posts may know, AdRoll strongly encourages its engineers to explore new technologies, and if you’ve been following the Data Science Engineering team in particular, you know that we love to get to the metal to eke out as much performance as we can.

But this isn’t enough. AdRoll also has a long-standing philosophy of bringing the best to everyone, not just those with the time and resources of enterprise businesses. We knew that NADA had to be fast, powerful, and most of all, keep us close to our data. But this stuff is only great if the platform is usable, so we knew we wanted a nice, clean API. Take a look for yourself at this k-means code:

I don’t know about you, but this is just about the clearest code I’ve ever seen. Yes, we at AdRoll have become huge fans of Whitespace.

This may seem like a controversial choice to some, but consider the benefits:

You can code for longer as eye strain is minimized, and with appropriate hardware, the ergonomics are fantastic.
Rather than slogging through PRs, each diff is a colorful delight that warms the soul. See image below.
In a process of discovery, printing out your code gives no benefit to your legal opposition.
By inputting your data as essentially raw binary, you’re thinking much more like a machine. This makes Gort happy.
You want Gort to be happy.

These are just the benefits of Whitespace in and of itself. Let’s move on to NADA.

Beautiful diff.

NADA Features

The core features of NADA are a boon to any data scientist, statistician, or Swiss army personnel. You can regress your convex optimizations, trick your kernels into matrix factorizations, and backpropagate your support vector machinations. All of this is so abstracted away, you can be sure you won’t miss the random forest for the decision stumps.

For example, take a look at this snippet, which convolutes a neural network into a stochastic principal component Lp space:

start_with_your_neural_net_to_convolute...   	 	 	  
	
     		 	   
	
     		    	
	
     			 	  
	
      	     
	
     			 			
	
     		    	
	
     			  		
	
stochasticizing...      	     
	
     		 	 	 
	
     			 	 	
	
     			  		
	
     			 	  
	
      	     
	
     		  			
	
     		 	  	
	
     		   	 
	
     		   	 
	
     		  	 	
	
     			  	 
	
     		 	  	
	
     			  		
	
     		 	   
	
      	 			 
	
      	     
	
     	 	 			
	
now_for_components...     		  	 	
	
      	     
	
     		 	 		
	
     		 			 
	
     		 				
	
     			 			
	
      	     
	
     			 	  
	
     		 	   
	
output_Lp_space!     		 	  	
	
     			  		
	
      	 			 
	
        	 	 
	
  


repeat_for_recurrent_space.

So much processing in such a small snippet really shows the expressive power of NADA. This expressivity has allowed AdRoll to bootstrap our data science efforts faster than we ever thought possible, because NADA is also incredibly efficient and scalable.

NADA’s underlying data framework is a parallelized, fault-tolerant, NoSQL, relational data structure that guarantees 100% consistency. This plays a key part in our ability to rip through tons more data than we have in the past. It’s a unique structure, for sure, so there is some ramp up to take full advantage of its features. However, we went ahead and implemented more standard APIs to get your feet wet.

For example, for our Prospecting product we were asked to find out whether certain subsets of our data contained graphs of a particular size. Tackling this somewhat naïvely, we have:

good...   	 	    
	
      	     
	
      				 	
	
      	     
	
     	  			 
	
     	 	    
	
      						
	
        	 	 
	
  


...but_slow

It does all right even with the standard API calls, but that’s mainly a testament to NADA’s computational ingenuity. However, as soon as we switch to NADA’s unique brand of data structure:

“Wow,” indeed! Even a cursory glance reveals the speedup here. We’re expecting a novelty check in the mail soon.

Onwards and Upwards!

We’ve only just scratched the surface of NADA here. While NADA is primarily a data analytics framework, we have also found it useful in a variety of other contexts, such as automatic intrusion detection, improving our WiFi signals, supplanting the human race, and defragging our Hadoop clusters. Also, NADA is absolutely delightful spread on a thick piece of toast with a glass of orange juice.

So where do we go from here? Naturally, we’ll be open-sourcing NADA shortly. Gort is looking forward to seeing what you can do with our framework as soon as you get your hands on it. Gort does not like to be disappointed, so please, contribute as much NADA as you can.

Happy coding!

NextRoll