Rate Limiting based on HTTP headers with HAProxy

3.12.2014 | 7 minutes of reading time

Recently we had a problem with a buggy update to a piece of 3rd party client software. It produced lots and lots of valid, but nonsensical requests, targeting our system.

This post details how we added a dynamic rate limiting to our HAProxy load balancers, heavily throttling only a very specific set of HTTP requests caused by the client bug, while maintaining regular operations for other requests, even on the same URLs.

The files described in this article are available in a GitHub repository for easy access.

Stampede

What made things interesting was that the client software was mostly fine, but a single background sync feature repeatedly (and quite relentlessly) uploaded a tremendous amount of small objects, even though they had already been sent, creating lots of duplicate copies on the backend. At the same time, the interactive portion of the application was working nicely. Moreover, even though the problematic update was distributed to a wide audience, only a certain usage pattern would trigger the bug in a comparatively small portion of installs.

Due to the high frequency of requests coming in with almost no effort client-side, the more heavyweight asynchronous server side processing was not able to keep up, leading to a slowly, but continuously growing queue of outstanding requests.

While fair queueing made sure that most users did not notice much of a slowdown in their regular work with the system at first, it was clear that we needed a way to resolve this situation on our side until a fixed client update could be developed and rolled out.

Options

The most obvious solution would have been to revoke access for the affected OAuth Client ID, but it would also have been the one with the most drastic side-effects. Effectively, the application would have stopped working for all customers, including those who either did not yet have the broken update installed or whose behavior had not triggered the bug. Clearly not a good option.

Another course of action we considered for a short moment was to introduce a rate limit using the Client ID as a discriminator. It would have had the same broad side-effects as locking them out completely, affecting lots of innocent users. Basically anything just taking the Client ID into account would hit more users than necessary.

Implemented Fix

What we came up with is a rate limiting configuration based on the user’s access token instead of the client software, and the specific API call the broken client flooded. While the approach itself is not particularly ingenious, the implementation of the corresponding HAProxy configuration turned out to be a little trickier than anticipated. Most examples are based on the sender’s IP address, however we did not want to punish all users behind the same NATing company firewall as one single offender.

So without further ado here is the relevant snippet from haproxy.cfg:

1frontend fe_api_ssl
2  bind 192.168.0.1:443 ssl crt /etc/haproxy/ssl/api.pem no-sslv3 ciphers ...
3  default_backend be_api
4 
5  tcp-request inspect-delay 5s
6 
7  acl document_request path_beg -i /v2/documents
8  acl is_upload hdr_beg(Content-Type) -i multipart/form-data
9  acl too_many_uploads_by_user sc0_gpc0_rate() gt 100
10  acl mark_seen sc0_inc_gpc0 gt 0
11 
12  stick-table type string size 100k store gpc0_rate(60s)
13 
14  tcp-request content track-sc0 hdr(Authorization) if METH_POST document_request is_upload
15 
16  use_backend 429_slow_down if mark_seen too_many_uploads_by_user 
17 
18backend be_429_slow_down
19  timeout tarpit 2s
20  errorfile 500 /etc/haproxy/errorfiles/429.http
21  http-request tarpit

Let’s go through these in some more detail.

First of all, right after declaring the frontend’s name to be fe_api_ssl we bind the appropriate IP address and port, and set up the TLS settings with the certificate/private key and a set of ciphers (left out for brevity).

Then we set the default_backend to be be_api. This will handle the default case of all requests that are not rate limited.

The next line tcp-request inspect-delay is required to ensure the following checks have all required information available. Leaving it out will even cause HAProxy to issue a warning, because we are using TCP related metrics a few lines further down. Setting the delay like this will make HAProxy wait at most 5 seconds for the connection handshaking to complete until it starts evaluating the inspection rules. Not setting it would provoke race conditions, because the rules would be run immediately upon arrival of the first – potentially incomplete – data, leading to unpredictable results.

The next block contains ACL rule definitions. It is important to say, that they are not yet evaluated here. The ACL names merely get bound to the rule following them.

document_request checks if the requested resource’s path_begins with the string
/v2/documents/, performing a case-insensitive comparison (-i).
is_upload checks if the value of the Content-Type header matches the search string multipart/form-data, again case-insensitive. This is the Content-Type the broken client sends from its buggy code path. The other client features might access the same resource, but with different content types. We do not want to limit those.
mark_seen defines that on its execution the General Purpose Register "0" should be incremented. This is the counter whose increase-rate is checked in too_many_uploads_by_user.
too_many_uploads_by_user is a little more involved. It checks, if the average increment rate of the General Purpose Counter (GPC) "0" is greater than 100 over the configured time period. We will get back to that in a moment.

Next up we define a lookup table to keep track of string objects (type string) with up to 100.000 table rows. The content of that string will be the Authorization header value, i. e. the user’s access token (next line). The value stored alongside each token is the General Purpose Counter "0"’s increase rate over 1 minute.

So much for the definition of rules. Now we will actually inspect an incoming request’s content (tcp-request content). We enable tracking of the session’s Authorization header’s value in the aforementioned stick-table under certain conditions. Those are listed after the if keyword (logical AND is the default). In this particular case we are only interested in tracking HTTP POST requests (METH_POST) that are document_requests (as defined before in the ACL of that name) and have the right Content-Type (is_upload ACL).

Notice, that so far the too_many_uploads_by_user and mark_seen ACLs have not yet been executed, because they were only declared so far.

They are executed now as part of the use_backend directive. This will apply a different than the default backend in case the too_many_requests_by_user ACL matches. For this check to ever yield any menaingful result, we must ensure the GPC is actually incremented, so that the stick-table contains values other than 0 for each user access token.

This is where the mark_seen pseudo-ACL comes into play. Its only purpose is to increment the GPC for the tracking entry in the stick-table. It might seem a bit strange to do it like this, but remember, the ACL declaration did not actually do anything yet, but only connected names and actions/checks to be executed later.

Important:Notice that the mark_seen ACL is listed first! It must be, because HAProxy uses short-circuit evaluation of the conditions: If the first condition evaluates to false, the remaining ones could never change the overall result to true again, hence they are not evaluated at all. If mark_seen was placed behind too_many_uploads_by_user, it would never be considered, as initially, of course, the upload limit was not yet reached.

In effect, whenever a request comes in that matches the conditions (POST method, correct path, correct Content-Type) a counter is incremented. If the rate of increase goes above 100 per minute, the request will be forwarded to the special be_429_slow_down backend.

If the requests come in slowly enough, they will be handled by the default backend.

The be_429_slow_down backend uses the so called tarpit feature, usually used to bind and attacker’s resources by keeping a request open for a defined period of time before closing it. The HTTP tarpit option sends an error to the client. Unfortunately, HAProxy does not allow the specification of a particular HTTP response code for tarpits, but always defaults to 500. As we want to both slow broken clients down as well as inform them about the particular error cause, we use a little hack: Using errorfile we specify a custom file 429.http to be sent for 500 which in fact contains an HTTP 429 response. This goes against best practices, but works nicely nevertheless:

1HTTP/1.1 429 Too Many Requests
2Cache-Control: no-cache
3Connection: close
4Content-Type: text/plain
5Retry-After: 60
6 
7Too Many Requests (HAP429).

See the HAProxy documentation for details.

Conclusion

Most examples found online for rate limiting with HAProxy are based purely on ports and IP addresses, not on higher level protocol information. It took us a little while to put together all the pieces and wrap our heads around the concept of HAProxy’s counters, stick-tables and the time of ACL evaluations.

The config described above has been in production for a few weeks now and works flawlessly, keeping our backend servers safe from problematic clients. Should the need for other limits arise in the future, we now have an effective way to handle those in a fine-grained way.

Was this post helpful?

Likes

Blog author

Daniel Schneller

Do you still have questions? Just send me a message.

fromDaniel Schneller

XFS: Possible Memory Allocation Deadlock in kmem_alloc

A few weeks ago we were surprised by seemingly random I/O hangs on several virtual machines. Any attempt to write to their data volumes blocked, making the load average rise into the stratosphere, and — slightly more consequentially — make Elasticsearch...

Cloud
DevOps
Infrastructure

10.4.2017 | 10 Minuten Lesezeit

Daniel Schneller

True KVM Live Migration with OpenStack Icehouse and Ceph based VM storage

Intro As mentioned before — for example in Fabian’s The CenterDevice Cloud Architecture Revisited post from December 2014) — our document management product CenterDevice runs on top of infrastructure virtualized by OpenStack. Where that older post...

Cloud

16.3.2015 | 12 Minuten Lesezeit

Daniel Schneller

Localizing Mobile Apps

What do the acronyms I18N or L10N stand for? What do they mean for developers of mobile applications in particular? I hosted a session about localizing mobile applications at Developer Week 2014 in Nuremberg. It covers — among other things — text, numbers...

26.8.2014 | 1 Minuten Lesezeit

Daniel Schneller

Jinja2 for better Ansible playbooks and templates

There have been posts about Ansible on this blog before, so this one will not go into Ansible basics again, but focus on ways to improve your use of variables, often, but not only used together with the template module, showing some of the more involved...

24.8.2014 | 11 Minuten Lesezeit

Daniel Schneller

Ansible: Simple yet powerful automation

Automatic provisioning of infrastructure as well as deployment is a cornerstone of DevOps. It brings the benefits of version control, reproducibility, and a central place to consolidate (executable) knowledge about infrastructure setups. Best known provisioning...

CI/CD
DevOps
Infrastructure

22.6.2014 | 14 Minuten Lesezeit

Daniel Schneller

SSH Two-Factor Authentication with Duo Security

An ever increasing number of services start offering (and recommending) additional means of securing access to your accounts: Instead of just asking users to identify and authenticate themselves with a simple set of username and password, a second piece...

10.3.2014 | 7 Minuten Lesezeit

Daniel Schneller

Pseudo-Localization for Cocoa Apps

Locali… what? Simply speaking, localizing an application means translating all output it produces on the screen (and printouts etc.) to the language of the people using it. There is more to it, though, than a simple translation of messages. You should...

Java
iOS
Software development

23.10.2013 | 14 Minuten Lesezeit

Daniel Schneller

SSL: Man in the middle? – No, thank you!

At DWX Developer Week I recently gave a talk on SSL and man in the middle attacks. Due to the popular demand (and some internal scheduling issues) I repeated it again internally. However, the recording of that is available on the codecentric YouTube ...

2.7.2013 | 1 Minuten Lesezeit

Daniel Schneller

Easier JBehave steps with variants

In an earlier post we offered an introduction to the JBehave project for automatic acceptance testing. While that article focused on setup and general use of the framework, this time I will concentrate on a recent addition I wrote and contributed to...

Agile
Java

1.4.2012 | 4 Minuten Lesezeit

Daniel Schneller

SOAP Webservices mit iOS

Betrachtet man APIs für aktuelle Web-Plattformen wie Soziale Netzwerke, die Amazon Web Services, Fotodienste à la Flickr oder Instagram und zahllose mehr, so könnte der Eindruck entstehen, REST hätte als der Kommunikation mit entfernten Diensten zu ...

Java
API

2.1.2012 | 5 Minuten Lesezeit

Daniel Schneller

Why good metrics values do not equal good quality

Quite regularly, codecentric’s experts perform reviews and quality evaluations of software products. For example, clients may want to get an independent assessment of a program they had a contractor develop. In other cases, they request an assessment...

Agile methods
Java

3.10.2011 | 7 Minuten Lesezeit

Daniel Schneller

Using JMeter to measure binary protocols

In a recent project I developed a bridge component to connect a backend web service with a credit-card terminal. The terminal can only speak a binary protocol. The bridge needs to map the binary messages to the corresponding backend calls. If you are...

Java
APM

9.5.2011 | 6 Minuten Lesezeit

Daniel Schneller

droidcon 2011

Vom 23. bis 24. März fand in der Urania in Berlin die droidcon.2011 statt. Neben zahlreichen Ausstellern im Expo Bereich, die bislang teilweise noch nicht (in Deutschland) erhältliche Produkte, darunter z. B. Motorola mit dem Xoom Tablet und Android...

Android
Community
Mobile

5.4.2011 | 4 Minuten Lesezeit

Daniel Schneller

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Send

Rate Limiting based on HTTP headers with HAProxy

Stampede

Options

Implemented Fix

Conclusion

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

XFS: Possible Memory Allocation Deadlock in kmem_alloc

True KVM Live Migration with OpenStack Icehouse and Ceph based VM storage

Localizing Mobile Apps

Jinja2 for better Ansible playbooks and templates

Ansible: Simple yet powerful automation

SSH Two-Factor Authentication with Duo Security

Pseudo-Localization for Cocoa Apps

SSL: Man in the middle? – No, thank you!

Easier JBehave steps with variants

SOAP Webservices mit iOS

Why good metrics values do not equal good quality

Using JMeter to measure binary protocols

droidcon 2011

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten

Contact

Send

Rate Limiting based on HTTP headers with HAProxy

Stampede

Options

Implemented Fix

Conclusion

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

XFS: Possible Memory Allocation Deadlock in kmem_alloc

True KVM Live Migration with OpenStack Icehouse and Ceph based VM storage

Localizing Mobile Apps

Jinja2 for better Ansible playbooks and templates

Ansible: Simple yet powerful automation

SSH Two-Factor Authentication with Duo Security

Pseudo-Localization for Cocoa Apps

SSL: Man in the middle? – No, thank you!

Easier JBehave steps with variants

SOAP Webservices mit iOS

Why good metrics values do not equal good quality

Using JMeter to measure binary protocols

droidcon 2011

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten