HyperLogLog in Python
We open-sourced a Python library for HLLs compatible with postgresql-hll.
5 minute read
What is python-hll?
We recently open-sourced python-hll, which is an implementation of HyperLogLog whose goal is to be storage compatible with java-hll, js-hll, and postgresql-hll. At NextRoll, we use these libraries to do fast counting of unique values in PostgreSQL and Java, but we were dismayed that there was no Python library for it. So we decided to make one.
What are HLLs?
In a previous blog post, we talked about how we use HLLs in PostgreSQL at NextRoll. As a quick recap, HLLs let you quickly count unique values. You basically throw a bunch of unique values into a blob called an HLL. Then you can ask the HLL how many unique values it has:
Not only that, but you can also union HLL blobs together then ask it how many unique values it has:
Here’s how it works using our Python library:
from python_hll.hll import HLL
import mmh3
hll1 = HLL(13, 5)
hll2 = HLL(13, 5)
hll1.add_raw(mmh3.hash('17'))
hll1.add_raw(mmh3.hash('17'))
hll1.add_raw(mmh3.hash('31'))
hll1.add_raw(mmh3.hash('5'))
cardinality = hll1.cardinality()
# Output: 3
hll2.add_raw(mmh3.hash('6'))
hll2.add_raw(mmh3.hash('6'))
hll2.add_raw(mmh3.hash('31'))
hll2.add_raw(mmh3.hash('5'))
cardinality = hll2.cardinality()
# Output: 3
hll1.union(hll2)
cardinality = hll1.cardinality()
# Output: 4
And then you can export the HLL to a string that works with postgresql-hll:
from python_hll.util import NumberUtil
bytes = hll1.to_bytes()
output = "\\x" + NumberUtil.to_hex(bytes, 0, len(bytes))
# \x128D7F00000000201E4B390000000027FA7CC000000000531A35E4000000006E6F0B04
How We Built and Tested It
We built python-hll as part of NextRoll’s June 2019 Hack Week, a time when engineers take some time to build things that interest them personally. As we only had a week, we thought it would be expedient to do a fairly literal translation/port of java-hll to Python.
To minimize changes, we decided to internally represent bytes as Java-style bytes (-128 to 127) rather than true Python bytes (0 to 255). Issues we ran into included: how to translate Java’s >>>
, into Python, how to handle integer overflows from <<
identically to Java so the tests pass (actually we were quite surprised that so many integer overflows were happening in the Java unit tests), and how to make sure the tests pass in both Python 2.7 and Python 3.
Fortunately, postgresql-hll has an extensive set of test data with 22,614 data points that we were able to use to test our library. In addition, we also ported the unit tests from java-hll to Python.
Unfortunately one drawback to our python-hll library is that it is quite slow compared to java-hll. For example, in Java, HLLSerializationTest takes 12 seconds to run, while in Python, the equivalent test_hll_serialization takes 1.5 hours to run - it’s about 400x slower. It should be okay if a few HLL operations are needed, but for many HLL operations you may find that it could slow your program down. Test it and see.
Acknowledgements
Thanks to the 7 engineers who translated code and tests from Java to Python for this project at NextRoll’s June 2019 Hack Week:
- Jon Aquino
- Kushagra Verma
- Alex Leu
- Michael Tran
- Rodrigo Westrupp
- Sridharan Subramanian
- Piyush Srivastava