Recently we had a problem with a buggy update to a piece of 3rd party client software. It produced lots and lots of valid, but nonsensical requests, targeting our system.
This post details how we added a dynamic rate limiting to our HAProxy load balancers, heavily throttling only a very specific set of HTTP requests caused by the client bug, while maintaining regular operations for other requests, even on the same URLs.
The files described in this article are available in a GitHub repository for easy access.
What made things interesting was that the client software was mostly fine, but a single background sync feature repeatedly (and quite relentlessly) uploaded a tremendous amount of small objects, even though they had already been sent, creating lots of duplicate copies on the backend. At the same time, the interactive portion of the application was working nicely. Moreover, even though the problematic update was distributed to a wide audience, only a certain usage pattern would trigger the bug in a comparatively small portion of installs.
Due to the high frequency of requests coming in with almost no effort client-side, the more heavyweight asynchronous server side processing was not able to keep up, leading to a slowly, but continuously growing queue of outstanding requests.
While fair queueing made sure that most users did not notice much of a slowdown in their regular work with the system at first, it was clear that we needed a way to resolve this situation on our side until a fixed client update could be developed and rolled out.
The most obvious solution would have been to revoke access for the affected OAuth Client ID, but it would also have been the one with the most drastic side-effects. Effectively, the application would have stopped working for all customers, including those who either did not yet have the broken update installed or whose behavior had not triggered the bug. Clearly not a good option.
Another course of action we considered for a short moment was to introduce a rate limit using the Client ID as a discriminator. It would have had the same broad side-effects as locking them out completely, affecting lots of innocent users. Basically anything just taking the Client ID into account would hit more users than necessary.
What we came up with is a rate limiting configuration based on the user’s access token instead of the client software, and the specific API call the broken client flooded. While the approach itself is not particularly ingenious, the implementation of the corresponding HAProxy configuration turned out to be a little trickier than anticipated. Most examples are based on the sender’s IP address, however we did not want to punish all users behind the same NATing company firewall as one single offender.
So without further ado here is the relevant snippet from
bind 192.168.0.1:443 ssl crt /etc/haproxy/ssl/api.pem no-sslv3 ciphers ...
tcp-request inspect-delay 5s
acl document_request path_beg -i /v2/documents
acl is_upload hdr_beg(Content-Type) -i multipart/form-data
acl too_many_uploads_by_user sc0_gpc0_rate() gt 100
acl mark_seen sc0_inc_gpc0 gt 0
stick-table type string size 100k store gpc0_rate(60s)
tcp-request content track-sc0 hdr(Authorization) if METH_POST document_request is_upload
use_backend 429_slow_down if mark_seen too_many_uploads_by_user
timeout tarpit 2s
errorfile 500 /etc/haproxy/errorfiles/429.http
Let’s go through these in some more detail.
First of all, right after declaring the frontend’s name to be
bind the appropriate IP address and port, and set up the TLS settings with the certificate/private key and a set of ciphers (left out for brevity).
Then we set the
default_backend to be
be_api. This will handle the default case of all requests that are not rate limited.
The next line
tcp-request inspect-delay is required to ensure the following checks have all required information available. Leaving it out will even cause HAProxy to issue a warning, because we are using TCP related metrics a few lines further down. Setting the delay like this will make HAProxy wait at most 5 seconds for the connection handshaking to complete until it starts evaluating the inspection rules. Not setting it would provoke race conditions, because the rules would be run immediately upon arrival of the first – potentially incomplete – data, leading to unpredictable results.
The next block contains ACL rule definitions. It is important to say, that they are not yet evaluated here. The ACL names merely get bound to the rule following them.
document_request checks if the requested resource’s
path_begins with the string
/v2/documents/, performing a case-insensitive comparison (
is_upload checks if the value of the
Content-Type header matches the search string
multipart/form-data, again case-insensitive. This is the Content-Type the broken client sends from its buggy code path. The other client features might access the same resource, but with different content types. We do not want to limit those.
mark_seen defines that on its execution the General Purpose Register "0" should be incremented. This is the counter whose increase-rate is checked in
too_many_uploads_by_user is a little more involved. It checks, if the average increment rate of the General Purpose Counter (GPC) "0" is greater than 100 over the configured time period. We will get back to that in a moment.
Next up we define a lookup table to keep track of string objects (
type string) with up to 100.000 table rows. The content of that string will be the
Authorization header value, i. e. the user’s access token (next line). The value stored alongside each token is the General Purpose Counter "0"’s increase rate over 1 minute.
So much for the definition of rules. Now we will actually inspect an incoming request’s content (
tcp-request content). We enable tracking of the session’s
Authorization header’s value in the aforementioned stick-table under certain conditions. Those are listed after the
if keyword (logical AND is the default). In this particular case we are only interested in tracking HTTP POST requests (
METH_POST) that are
document_requests (as defined before in the ACL of that name) and have the right Content-Type (
Notice, that so far the
mark_seen ACLs have not yet been executed, because they were only declared so far.
They are executed now as part of the
use_backend directive. This will apply a different than the default backend in case the
too_many_requests_by_user ACL matches. For this check to ever yield any menaingful result, we must ensure the GPC is actually incremented, so that the stick-table contains values other than 0 for each user access token.
This is where the
mark_seen pseudo-ACL comes into play. Its only purpose is to increment the GPC for the tracking entry in the stick-table. It might seem a bit strange to do it like this, but remember, the ACL declaration did not actually do anything yet, but only connected names and actions/checks to be executed later.
Important:Notice that the
mark_seen ACL is listed first! It must be, because HAProxy uses short-circuit evaluation of the conditions: If the first condition evaluates to false, the remaining ones could never change the overall result to true again, hence they are not evaluated at all. If
mark_seen was placed behind
too_many_uploads_by_user, it would never be considered, as initially, of course, the upload limit was not yet reached.
In effect, whenever a request comes in that matches the conditions (POST method, correct path, correct Content-Type) a counter is incremented. If the rate of increase goes above 100 per minute, the request will be forwarded to the special
If the requests come in slowly enough, they will be handled by the default backend.
be_429_slow_down backend uses the so called tarpit feature, usually used to bind and attacker’s resources by keeping a request open for a defined period of time before closing it. The HTTP tarpit option sends an error to the client. Unfortunately, HAProxy does not allow the specification of a particular HTTP response code for tarpits, but always defaults to 500. As we want to both slow broken clients down as well as inform them about the particular error cause, we use a little hack: Using
errorfile we specify a custom file
429.http to be sent for 500 which in fact contains an HTTP 429 response. This goes against best practices, but works nicely nevertheless:
HTTP/1.1 429 Too Many Requests
Too Many Requests (HAP429).
See the HAProxy documentation for details.
Most examples found online for rate limiting with HAProxy are based purely on ports and IP addresses, not on higher level protocol information. It took us a little while to put together all the pieces and wrap our heads around the concept of HAProxy’s counters, stick-tables and the time of ACL evaluations.
The config described above has been in production for a few weeks now and works flawlessly, keeping our backend servers safe from problematic clients. Should the need for other limits arise in the future, we now have an effective way to handle those in a fine-grained way.