The project I work on at AdRoll is mature. In a mature project the economics of software flip around to where doing something right is more desirable that doing something quickly. In 2015 we invested a lot of effort in resolving long-tail issues with the bidder system, issues we’d recognized in our failure-mode conversations but had left unresolved pending future work. Sometimes, a system will evolve away from such things and sometimes you’ll have to put in the time and energy to, not repair, but mitigate. Long-tail issues are often intrinsic, those fun little sideshows Charles Perrow called Normal Accidents, failures inherent to a system that can’t be removed but must be lived with. It’s good times.
This blog post details one of my favorite resolutions in 2015, the bidding credit system.
A brief run-down of the bidders. A bid comes in from an “exchange”–these are companies like Google, AppNexus or OpenX–and we’re obligated to reply. Replies can be either a “no bid”, essentially a bid of $0, or a “bid”, a bid of greater than $0. If we win the auction the exchange sends us a specially coded additional request, called a “win”. When this arrives, we reduce the amount of money the system as a whole has to spend. If a “win” doesn’t arrive, we assume that there’s still money to spend and the system does its best to do so.
Here’s a fun question: what happens if an actual win gets decoupled from “win” events? By protocol contract with the exchanges they’re not supposed to be but, you know, things go wrong. What happens?
The bidder used to simply assume, without a “win” that it had money to spend and would continue to do so. This meant that it would continue to spend money, money it may, were the “wins” arriving appropriately, have not spent. Worse, the bidders might erroneously spend more money that in its budget, a worst-case scenario we call “overspend”. Overspends are potentially catastrophic as there’s no fixed upper bound.
Win decoupling took us by surprise in early 2015. Alarms fired alerting us that the system’s rate of win had dropped below normal bounds but a side-channel in a more delayed notification informed us that the system was still spending money. Oops! Manual intervention killed all bidding and communication with the exchange in question informed us that a bug in their latest release caused a failure of win notifications. This was… not cheap.
In the lifetime of the system this decoupling had only been seen just the once but it was bad enough to need a safety system to stop it from happening again. My colleague Mike and I kicked around ideas and settled on a take off a classic feedback control system.
The algorithm is this:
Grant a “bidding credit pool” of some initial capacity C, maximally bounded at Cm, some bid rate Br (per second) and some win rate Wr (per second):
- Each bid issued by the system to an exchange decrements C by B.
- Each win received by the system from an exchange increases C by W.
- If at bid time C =< L a no-bid is automatically issued.
- Every X seconds Bmp credits are added to the pool.
The first two points decide for the system how much a losing bid costs. Combine that with the third point, it’s possible to set values such that a losing streak will cause the pool to exhaust, shutting off bidding. The fourth point protects against losing streaks and allows for automatic recovery in the event of pool exhaustion.
It seemed to Mike and me that the algorithm would work. Say you have Cm = 100, C = 100, Br = 100, Wr = 0.1, B = 1, W = 1, L = 0, X = 5 and Bmp = 100. Then you’ll find that per second Br*Wr = 10 and so every second you’ll remove 90 credits from C, or, C(t1) = C(t0) - Br(1-Wr).
C(0) = 100 C(1) = 100 - 100(1-0.1) = 90 C(2) = 90 - 100(1-0.1) = 0 C(3) = 0 C(4) = 0 C(5) = 0 + 100 - 100(1-0.1) = 90 = C(0).
It’s a loop. In reality there’d be some variation as Br won’t always be a constant but notice that for 3/5 of the time the system is in a state where it can’t bid. That’s not good: the whole purpose of the system is to make bids.
Charles Perrow’s analysis in Normal Accidents argued that by attempting to resolve failure in one subsystem with the addition of another you introduce new failures elsewhere. Well, check it out, the bidding credit system can be configured to cause a worse problem than the one we’re attempting to fix.
Clearly, the trick here is finding the right parameters for the system. By team consensus we wanted something that would be slow to shut off bidding, something with a gradual wind-down, and quick to start bidding again, something with “get up and go”. This bidding credit system is dangerous enough that it can’t be released into the world behind a feature gate and fiddled with but we also must have operational experience with it.
What to do?
We modeled it! What we had on our hands was a simple-to-understand system and a search space for ideal configuration parameters. A small bit of python:
and an even smaller bit of gnuplot:
#!/usr/bin/env gnuplot # # Plotting data about the credit model (see credit_model.dat) reset # png set terminal png size 2560,1440 enhanced font 'Verdana,10' set output 'credit_model.png' set xlabel 'Time (seconds)' set tics scale 0.75 plot 'credit_model.dat' using 1:2 smooth acsplines with lines title 'Credits Available', \ 'credit_model.dat' using 1:4 smooth acsplines with lines title 'Bids Submitted', \ 'credit_model.dat' using 1:5 smooth acsplines with lines title 'Outbound Bids', \ 'credit_model.dat' using 1:6 smooth acsplines with lines title 'Instantaneous Wins'
let the team understand how the system would behave in a no-risk manner. You’ll notice that the python code is simulating more than the brief sketch above: waning traffic through the day, random fluctuation around rates. Here’s a similar periodic crash scenario graphed:
blt> python credit_model.py --time_end_lower_limit=1800 --bid_request_rate_per_second=1000 --initial_credits=1000 --time_step=1 --credit_cap=1000 --periodic_credit_bump=60 --periodic_credit_bump_amount=1000 --decriment_per_bid=1 --incriment_per_win=1 --bid_rate=1.0 --win_rate=0.1 > credit_model.dat && ./credit_model.gnu
Here’s another scenario with W = 10, effectively canceling out all the losing bids to get to a win.
blt> python credit_model.py --time_end_lower_limit=1800 --bid_request_rate_per_second=1000 --initial_credits=1000 --time_step=1 --credit_cap=1000 --periodic_credit_bump=60 --periodic_credit_bump_amount=1000 --decriment_per_bid=1 --incriment_per_win=10 --bid_rate=1.0 --win_rate=0.1 > credit_model.dat && ./credit_model.gnu
Configured well, the system is stable in normal operation. Good! What about if there’s a win decoupling event? The careful reader will have noticed that the python script has the ability to simulate a decoupling. Here’s the stable system with a decouple thrown in:
blt> python credit_model.py --time_end_lower_limit=1800 --bid_request_rate_per_second=1000 --initial_credits=1000 --time_step=1 --credit_cap=1000 --periodic_credit_bump=60 --periodic_credit_bump_amount=1000 --decriment_per_bid=1 --incriment_per_win=10 --bid_rate=1.0 --win_rate=0.1 --win_kill_point=900 --win_recover_point=1200 > credit_model.dat && ./credit_model.gnu
You’ll note it’s got the properties we desire, a clamping of bidding and a rapid jump back to full capacity once wins start entering the system again. At no point do bids ever permanently drop to zero–the periodic credit bump keeps us issuing a moderate number bid submissions–but the fall-off is drastic enough to put the loss in the realm of pennies.
And it turns out, it works well in practice, too: