Recently we had a problem with a buggy update to a piece of 3rd party client software. It produced lots and lots of valid, but nonsensical requests, targeting our system.
This post details how we added a dynamic rate limiting to our HAProxy load balancers, heavily throttling only a very specific set of HTTP requests caused by the client bug, while maintaining regular operations for other requests, even on the same URLs.
The files described in this article are available in a GitHub repository for easy access.
What made things interesting was that the client software was mostly fine, but a single background sync feature repeatedly (and quite relentlessly) uploaded a tremendous amount of small objects, even though they had already been sent, creating lots of duplicate copies on the backend. At the same time, the interactive portion of the application was working nicely. Moreover, even though the problematic update was distributed to a wide audience, only a certain usage pattern would trigger the bug in a comparatively small portion of installs.
Due to the high frequency of requests coming in with almost no effort client-side, the more heavyweight asynchronous server side processing was not able to keep up, leading to a slowly, but continuously growing queue of outstanding requests.
While fair queueing made sure that most users did not notice much of a slowdown in their regular work with the system at first, it was clear that we needed a way to resolve this situation on our side until a fixed client update could be developed and rolled out.
The most obvious solution would have been to revoke access for the affected OAuth Client ID, but it would also have been the one with the most drastic side-effects. Effectively, the application would have stopped working for all customers, including those who either did not yet have the broken update installed or whose behavior had not triggered the bug. Clearly not a good option.
Another course of action we considered for a short moment was to introduce a rate limit using the Client ID as a discriminator. It would have had the same broad side-effects as locking them out completely, affecting lots of innocent users. Basically anything just taking the Client ID into account would hit more users than necessary.
What we came up with is a rate limiting configuration based on the user’s access token instead of the client software, and the specific API call the broken client flooded. While the approach itself is not particularly ingenious, the implementation of the corresponding HAProxy configuration turned out to be a little trickier than anticipated. Most examples are based on the sender’s IP address, however we did not want to punish all users behind the same NATing company firewall as one single offender.
So without further ado here is the relevant snippet from
frontend fe_api_ssl bind 192.168.0.1:443 ssl crt /etc/haproxy/ssl/api.pem no-sslv3 ciphers ... default_backend be_api tcp-request inspect-delay 5s acl document_request path_beg -i /v2/documents acl is_upload hdr_beg(Content-Type) -i multipart/form-data acl too_many_uploads_by_user sc0_gpc0_rate() gt 100 acl mark_seen sc0_inc_gpc0 gt 0 stick-table type string size 100k store gpc0_rate(60s) tcp-request content track-sc0 hdr(Authorization) if METH_POST document_request is_upload use_backend 429_slow_down if mark_seen too_many_uploads_by_user backend be_429_slow_down timeout tarpit 2s errorfile 500 /etc/haproxy/errorfiles/429.http http-request tarpit
Let’s go through these in some more detail.
First of all, right after declaring the frontend’s name to be
bind the appropriate IP address and port, and set up the TLS settings with the certificate/private key and a set of ciphers (left out for brevity).