Overview

Rate Limiting based on HTTP headers with HAProxy

No Comments

Recently we had a problem with a buggy update to a piece of 3rd party client software. It produced lots and lots of valid, but nonsensical requests, targeting our system.

This post details how we added a dynamic rate limiting to our HAProxy load balancers, heavily throttling only a very specific set of HTTP requests caused by the client bug, while maintaining regular operations for other requests, even on the same URLs.

The files described in this article are available in a GitHub repository for easy access.

Stampede

What made things interesting was that the client software was mostly fine, but a single background sync feature repeatedly (and quite relentlessly) uploaded a tremendous amount of small objects, even though they had already been sent, creating lots of duplicate copies on the backend. At the same time, the interactive portion of the application was working nicely. Moreover, even though the problematic update was distributed to a wide audience, only a certain usage pattern would trigger the bug in a comparatively small portion of installs.

Due to the high frequency of requests coming in with almost no effort client-side, the more heavyweight asynchronous server side processing was not able to keep up, leading to a slowly, but continuously growing queue of outstanding requests.

While fair queueing made sure that most users did not notice much of a slowdown in their regular work with the system at first, it was clear that we needed a way to resolve this situation on our side until a fixed client update could be developed and rolled out.

Options

The most obvious solution would have been to revoke access for the affected OAuth Client ID, but it would also have been the one with the most drastic side-effects. Effectively, the application would have stopped working for all customers, including those who either did not yet have the broken update installed or whose behavior had not triggered the bug. Clearly not a good option.

Another course of action we considered for a short moment was to introduce a rate limit using the Client ID as a discriminator. It would have had the same broad side-effects as locking them out completely, affecting lots of innocent users. Basically anything just taking the Client ID into account would hit more users than necessary.

Implemented Fix

What we came up with is a rate limiting configuration based on the user’s access token instead of the client software, and the specific API call the broken client flooded. While the approach itself is not particularly ingenious, the implementation of the corresponding HAProxy configuration turned out to be a little trickier than anticipated. Most examples are based on the sender’s IP address, however we did not want to punish all users behind the same NATing company firewall as one single offender.

So without further ado here is the relevant snippet from haproxy.cfg:

frontend fe_api_ssl
  bind 192.168.0.1:443 ssl crt /etc/haproxy/ssl/api.pem no-sslv3 ciphers ...
  default_backend be_api
 
  tcp-request inspect-delay 5s
 
  acl document_request path_beg -i /v2/documents
  acl is_upload hdr_beg(Content-Type) -i multipart/form-data
  acl too_many_uploads_by_user sc0_gpc0_rate() gt 100
  acl mark_seen sc0_inc_gpc0 gt 0
 
  stick-table type string size 100k store gpc0_rate(60s)
 
  tcp-request content track-sc0 hdr(Authorization) if METH_POST document_request is_upload
 
  use_backend 429_slow_down if mark_seen too_many_uploads_by_user 
 
backend be_429_slow_down
  timeout tarpit 2s
  errorfile 500 /etc/haproxy/errorfiles/429.http
  http-request tarpit

Let’s go through these in some more detail.

First of all, right after declaring the frontend’s name to be fe_api_ssl we bind the appropriate IP address and port, and set up the TLS settings with the certificate/private key and a set of ciphers (left out for brevity).

Then we set the default_backend to be be_api. This will handle the default case of all requests that are not rate limited.

The next line tcp-request inspect-delay is required to ensure the following checks have all required information available. Leaving it out will even cause HAProxy to issue a warning, because we are using TCP related metrics a few lines further down. Setting the delay like this will make HAProxy wait at most 5 seconds for the connection handshaking to complete until it starts evaluating the inspection rules. Not setting it would provoke race conditions, because the rules would be run immediately upon arrival of the first – potentially incomplete – data, leading to unpredictable results.

The next block contains ACL rule definitions. It is important to say, that they are not yet evaluated here. The ACL names merely get bound to the rule following them.

  • document_request checks if the requested resource’s path_begins with the string
    /v2/documents/, performing a case-insensitive comparison (-i).
  • is_upload checks if the value of the Content-Type header matches the search string multipart/form-data, again case-insensitive. This is the Content-Type the broken client sends from its buggy code path. The other client features might access the same resource, but with different content types. We do not want to limit those.
  • mark_seen defines that on its execution the General Purpose Register "0" should be incremented. This is the counter whose increase-rate is checked in too_many_uploads_by_user.
  • too_many_uploads_by_user is a little more involved. It checks, if the average increment rate of the General Purpose Counter (GPC) "0" is greater than 100 over the configured time period. We will get back to that in a moment.

Next up we define a lookup table to keep track of string objects (type string) with up to 100.000 table rows. The content of that string will be the Authorization header value, i. e. the user’s access token (next line). The value stored alongside each token is the General Purpose Counter "0"’s increase rate over 1 minute.

So much for the definition of rules. Now we will actually inspect an incoming request’s content (tcp-request content). We enable tracking of the session’s Authorization header’s value in the aforementioned stick-table under certain conditions. Those are listed after the if keyword (logical AND is the default). In this particular case we are only interested in tracking HTTP POST requests (METH_POST) that are document_requests (as defined before in the ACL of that name) and have the right Content-Type (is_upload ACL).

Notice, that so far the too_many_uploads_by_user and mark_seen ACLs have not yet been executed, because they were only declared so far.

They are executed now as part of the use_backend directive. This will apply a different than the default backend in case the too_many_requests_by_user ACL matches. For this check to ever yield any menaingful result, we must ensure the GPC is actually incremented, so that the stick-table contains values other than 0 for each user access token.

This is where the mark_seen pseudo-ACL comes into play. Its only purpose is to increment the GPC for the tracking entry in the stick-table. It might seem a bit strange to do it like this, but remember, the ACL declaration did not actually do anything yet, but only connected names and actions/checks to be executed later.

Important:Notice that the mark_seen ACL is listed first! It must be, because HAProxy uses short-circuit evaluation of the conditions: If the first condition evaluates to false, the remaining ones could never change the overall result to true again, hence they are not evaluated at all. If mark_seen was placed behind too_many_uploads_by_user, it would never be considered, as initially, of course, the upload limit was not yet reached.

In effect, whenever a request comes in that matches the conditions (POST method, correct path, correct Content-Type) a counter is incremented. If the rate of increase goes above 100 per minute, the request will be forwarded to the special be_429_slow_down backend.

If the requests come in slowly enough, they will be handled by the default backend.

The be_429_slow_down backend uses the so called tarpit feature, usually used to bind and attacker’s resources by keeping a request open for a defined period of time before closing it. The HTTP tarpit option sends an error to the client. Unfortunately, HAProxy does not allow the specification of a particular HTTP response code for tarpits, but always defaults to 500. As we want to both slow broken clients down as well as inform them about the particular error cause, we use a little hack: Using errorfile we specify a custom file 429.http to be sent for 500 which in fact contains an HTTP 429 response. This goes against best practices, but works nicely nevertheless:

HTTP/1.1 429 Too Many Requests
Cache-Control: no-cache
Connection: close
Content-Type: text/plain
Retry-After: 60
 
Too Many Requests (HAP429).

See the HAProxy documentation for details.

Conclusion

Most examples found online for rate limiting with HAProxy are based purely on ports and IP addresses, not on higher level protocol information. It took us a little while to put together all the pieces and wrap our heads around the concept of HAProxy’s counters, stick-tables and the time of ACL evaluations.

The config described above has been in production for a few weeks now and works flawlessly, keeping our backend servers safe from problematic clients. Should the need for other limits arise in the future, we now have an effective way to handle those in a fine-grained way.

Daniel Schneller has been designing and implementing complex software and database systems for more than 15 years and is the author of the MySQL Admin Cookbook. His current job title is Principal Cloud Engineer at CenterDevice GmbH, where he focuses on OpenStack and Ceph based cloud technologies. He has given talks at FroSCon, Data2Day and DWX Developer Week among others.

Share on FacebookGoogle+Share on LinkedInTweet about this on TwitterShare on RedditDigg thisShare on StumbleUpon

Comment

Your email address will not be published. Required fields are marked *