HyperLogLog in PostgreSQL Amazon Aurora RDS

Written by Piyush Srivastava, August 06, 2019

We moved from HLLs stored in HBase to HLLs stored in Postgres with great results.

5-10 minute read

What is HyperLogLog (HLL)?

HyperLogLog (HLL) is a useful and interesting probabilistic data structure used to count unique values in a given data set with good accuracy and speed. Normally, to count unique values accurately requires memory proportional to the number of unique values. This becomes a problem when working with large data sets. HyperLogLog solves this problem by allowing us to trade memory consumption for (tunable) precision, making it possible to estimate cardinalities larger than 1 billion with a standard error of 2% using only 1.5 kilobytes of memory. For more on how this works, you can refer to our past blog post here.

At Adroll Group, we use this data structure in our reports to measure metrics like unique visitors, engaged visitors, etc. We had implemented HLL in-house and stored it in HBase. But HBase was causing us some headaches.

Problems with HBase

While working with HBase we encountered two major challenges:

Defining the key in a way that supports a variety of queries
Having to build aggregation operations in a Thrift layer.

Key structure

HBase being a key-value data store requires a key which uniquely identifies the record. This key is used to extract metrics and aggregation in reporting. It was always a challenge to define the key in a way which makes searching of records optimized for different parameters, especially compared to a relational database.

As an example, let’s say we are storing unique visitors for a customer with details about visitors every day.

For that we can create a table called Daily Unique Visitors with the following columns:

Date
Customer ID
Browser Type
Location (Country)
Unique Visitors (HLL)

Since the value we care about is only the last column and the others are its key, in an HBase Table we’ll need to define a key-structure using the first 4 columns concatenated by underscores. Something like: Date_Customer-id_browser-type_location

A key will then look like 2019-07-01_A_Chrome_US, i.e. 1st July, for customer A, chrome browser, US location (Key) => Unique visitors (Value)

This works well when we need to scan it in the same order. For example, to get the number of unique visitors in the month of July 2019, for a given customer, for the Chrome browser, for all locations.

But as soon as we have different search criteria - like getting the number of unique visitors in July, for Customer A, for all browsers, in the US - the key structure is no longer effective and requires more processing to discard irrelevant keys on the scan results.

Thrift layer

For interaction with HBase from a non-Java platform, people usually use Thrift since it provides a language-independent interface. To support our reporting needs we have to implement operations like aggregation, filtering, and ordering of HLL records efficiently in our Thrift layer. This adds to development, maintenance, and resource costs. We also encountered stability issues with Thrift which required frequent restarts.

In the earlier HBase table example: scanning, filtering, ordering, etc. is done in the Thrift layer. Sometimes the date ranges are so big that Thrift makes lots of parallel HBase scan threads, causing resource crunches and frequent timeouts.

HLL in PostgreSQL Amazon Aurora RDS

We started looking for alternatives and fortunately, Amazon Aurora RDS started supporting the postgresql-hll extension to provide a new HLL data type. Switching to a relational database system solves a lot of our problems. If we use RDS as a data store, then we can define our schema efficiently using indexing, etc., to fulfill all our reporting needs (no more single keys). Also, we don’t have to maintain anything for our aggregation, filtering and ordering needs as RDS has that built in (no more Thrift).

Going back to our daily unique visitors table example: We can get results easily and efficiently using SQL:

SELECT hll_cardinality(hll_union_agg(unique_visitors))
FROM   daily_unique_visitors
WHERE  ts >= '2019-07-01'
       AND ts < '2019-08-01'
       AND customer_id = 'A'
       AND location = 'US' 

Using various relational database optimization techniques like indexing, table partitioning etc. you can get results fast.

Proof-of-concept

To measure the improvements we did a proof-of-concept. We replicated 3 months of data in RDS and compared the HBase and RDS results.

Query performance improvements

Long date range queries spanning 3 months used to take more than 60 seconds in HBase, but gave results in RDS in less than 10 seconds.

Less management effort

HBase and the Thrift layer require maintenance such as, installation, monitoring, trouble shooting, development etc, which is already taken care of with Amazon Aurora since it’s a managed service.

After evaluation, we decided to move to RDS, but it had its own challenges and limitations as explained below.

Challenges in migration

The biggest challenge we encountered was our old data. Our HBase HLLs were not compatible with the new RDS HLL format due to the different hashing and conversion mechanisms for storage between our HBase implementation and the RDS extension. Where possible we rebuilt the data using RDS HLLs from raw source data. Where we were unable to do that, we generated HBase HLLs simultaneously with RDS HLLs on a go-forward basis. The day when RDS data was available was called the “cutoff date”. Whenever we had to query a time period requiring old data, before the cutoff date, we used the HBase HLLs. Data for date ranges after the cutoff date used RDS. After some time, we were able to retire our HBase queries completely.
Limited library support in the programming languages we use. The current PostgreSQL extension has library support in Java and Javascript only. This works fine if you are using one of these languages in your technology stack, but otherwise, it will be a limitation. We use Python a lot. Hence, we had to put some effort to port these libraries to Python. We tackled that project in our most recent HackWeek.

Python-hll: PostgreSQL-hll extension Python library

We created a Python library to read, write, count and do operations like the union of PostgreSQL compatible HLLs. This is similar to the Java library. We plan to open-source it soon – stay tuned.

Conclusion

Migrating from HBase to PostgreSQL HLL on Amazon Aurora was a big win for us. We found the PostgreSQL HLL extension a lot better to store our HLL reporting metrics. Also, the benefits that stem from using an RDS with support for HLL, like Amazon Aurora, serve our reporting needs to retrieve and present this information efficiently. We thoroughly recommend that you check it out.

NextRoll