In this note I’ll show why the rate limit algorithms GCRA, leaky bucket, and token bucket behave the same.
| CARVIEW |
Tony Finch – blog
There is an IETF draft that aims to standardize RateLimit header
fields for HTTP. A RateLimit header in a successful response
can inform a client when it might expect to be throttled, so it can
avoid 429 Too Many Requests errors. Servers can also
include RateLimit headers in a 429 response to make the error more
informative.
The draft is in reasonably good shape. However as written it seems to require (or at least it assumes) that the server uses bad quota-reset rate limit algorithms. Quota-reset algorithms encourage clients into cyclic burst-pause behaviour; the draft has several paragraphs discussing this problem.
However, if we consider that RateLimit headers are supposed to tell
the client what acceptable behaviour looks like, they can be used with
any rate limit algorithm. (And it isn’t too hard to rephrase the draft
so that it is written in terms of client behaviour instead of server
behaviour.)
When a client has more work to do than will fit in a single window’s
quota, linear rate limit algorithms such as GCRA encourage the client
to smooth out its requests nicely. In this article I’ll describe how a
server can use a linear rate limit algorithm with HTTP RateLimit
headers.
A while back I wrote about the linear rate limit algorithms leaky bucket and GCRA. Since then I have been vexed by how common it is to implement rate limiting using complicated and wasteful algorithms (for example).
But linear (and exponential) rate limiters have a disadvantage: they can be slow to throttle clients whose request rate is above the limit but not super fast. And I just realised that this disadvantage can be unacceptable in some situations, when it’s imperative that no more than some quota of requests is accepted within a window of time.
In this article I’ll explore a way to enforce rate limit quotas more precisely, without undue storage costs, and without encouraging clients to oscillate between bursts and pauses. However I’m not sure it’s a good idea.
Here’s a pearlescent winter holiday gift for you!
There are four variants of the algorithm for shuffling an array, arising from two independent choices:
- whether to swap elements in the higher or lower parts of the array
- whether the boundary between the parts moves upwards or downwards
The variants are perfectly symmetrical, but they work in two fundamentally different ways: sampling or permutation.
The most common variant is Richard Durstenfeld’s shuffle algorithm, which moves the boundary downwards and swaps elements in the lower part of the array. Knuth describes it in TAOCP vol. 2 sect. 3.4.2; TAOCP doesn’t discuss the other variants.
(Obeying Stigler’s law, it is often called a “Fisher-Yates” shuffle, but their pre-computer algorithm is arguably different from the modern algorithm.)
Concrete tetrapods are used to dissipate wave energy in coastal defences.
There’s a bit of a craze for making tetrapod-shaped things: recently I’ve seen people making a plush tetrapod and a tetrapod lamp. So I thought it might be fun to model one.
I found a nice way to describe tetrapods that relies on very few arbitrary aesthetic choices.
Click here to play with an animated tetrapod which I made using three.js. (You can see its source code too.)
Last year I wrote a pair of articles about ratelimiting:
Recently, Chris “cks” Siebenmann has been working on ratelimiting HTTP bots that are hammering his blog. His articles prompted me to write some clarifications, plus a few practical anecdotes about ratelimiting email.
Although it looks really good, I have not yet tried the Jujutsu (jj) version control system, mainly because it’s not yet clearly superior to Magit. But I have been following jj discussions with great interest.
One of the things that jj has not yet tackled is how to do better than git refs / branches / tags. As I underestand it, jj currently has something like Mercurial bookmarks, which are more like raw git ref plumbing than a high-level porcelain feature. In particular, jj lacks signed or annotated tags, and it doesn’t have branch names that always automatically refer to the tip.
This is clearly a temporary state of affairs because jj is still incomplete and under development and these gaps are going to be filled. But the discussions have led me to think about how git’s branches are unsatisfactory, and what could be done to improve them.
What does it mean when someone writes that a programming language is “strongly typed”?
I’ve known for many years that “strongly typed” is a poorly-defined term. Recently I was prompted on Lobsters to explain why it’s hard to understand what someone means when they use the phrase.
I came up with more than five meanings!
Previously, I wrote some sketchy ideas for what I call a p-fast trie, which is basically a wide fan-out variant of an x-fast trie. It allows you to find the longest matching prefix or nearest predecessor or successor of a query string in a set of names in O(log k) cache misses, where k is the key length.
My initial sketch was more complicated and greedy for space than necessary, so here’s a simplified revision.
Here’s a sketch of an idea that might or might not be a good idea. Dunno if it’s similar to something already described in the literature – if you know of something, please let me know via the links in the footer!
The gist is to throw away the tree and interior pointers from a qp-trie. Instead, the p-fast trie is stored using a hash map organized into stratified levels, where each level corresponds to a prefix of the key.
Exact-match lookups are normal O(1) hash map lookups. Predecessor / successor searches use binary chop on the length of the key. Where a qp-trie search is O(k), where k is the length of the key, a p-fast trie search is O(log k).
This smaller O(log k) bound is why I call it a “p-fast trie” by analogy with the x-fast trie, which has O(log log N) query time. (The “p” is for popcount.) I’m not sure if this asymptotic improvement is likely to be effective in practice; see my thoughts towards the end of this note.
Here are a few tangentially-related ideas vaguely near the theme of comparison operators.
Here’s a story from nearly 10 years ago.
the bug
I think it was my friend Richard Kettlewell who told me about a bug he encountered with Let’s Encrypt in its early days in autumn 2015: it was failing to validate mail domains correctly.
the context
At the time I had previously been responsible for Cambridge University’s email anti-spam system for about 10 years, and in 2014 I had been given responsibility for Cambridge University’s DNS. So I knew how Let’s Encrypt should validate mail domains.
Let’s Encrypt was about one year old. Unusually, the code that runs their operations, Boulder, is free software and open to external contributors.
Boulder is written in Golang, and I had not previously written any code in Golang. But its reputation is to be easy to get to grips with.
So, in principle, the bug was straightforward for me to fix. How difficult would it be as a Golang newbie? And what would Let’s Encrypt’s contribution process be like?
the hack
I cloned the Boulder repository and had a look around the code.
As is pretty typical, there are a couple of stages to fixing a bug in an unfamiliar codebase:
-
work out where the problem is
-
try to understand if the obvious fix could be better
In this case, I remember discovering a relatively substantial TODO item that intersected with the bug. I can’t remember the details, but I think there were wider issues with DNS lookups in Boulder. I decided it made sense to fix the immediate problem without getting involved in things that would require discussion with Let’s Encrypt staff.
I faffed around with the code and pushed something that looked like it might work.
A fun thing about this hack is that I never got a working Boulder test setup on my workstation (or even Golang, I think!) – I just relied on the Let’s Encrypt cloud test setup. The feedback time was very slow, but it was tolerable for a simple one-off change.
the fix
My pull request was small, +48-14.
After a couple of rounds of review and within a few days, it was merged and put into production!
A pleasing result.
the upshot
I thought Golang (at least as it was used in the Boulder codebase) was as easy to get to grips with as promised. I did not touch it again until several years later, because there was no need to, but it seemed fine.
I was very impressed by the Let’s Encrypt continuous integration and automated testing setup, and by their low-friction workflow for external contributors. One of my fastest drive-by patches to get into worldwide production.
My fix was always going to be temporary, and all trace of it was overwritten years ago. It’s good when “temporary” turns out to be true!
the point
I was reminded of this story in the pub this evening, and I thought it was worth writing down. It demonstrated to me that Let’s Encrypt really were doing all the good stuff they said they were doing.
So thank you to Let’s Encrypt for providing an exemplary service and for giving me a happy little anecdote.
A couple of years ago I wrote about random floating point numbers. In that article I was mainly concerned about how neat the code is, and I didn’t pay attention to its performance.
Recently, a comment from Oliver Hunt and a blog post from
Alisa Sireneva prompted me to wonder if I made an
unwarranted assumption. So I wrote a little benchmark, which you can
find in pcg-dxsm.git.
(Note 2025-06-09: I’ve edited this post substantially after discovering some problems with the results.)
In hot weather I like to drink my coffee in an iced latte. To make it, I have a very large Bialetti Moka Express. Recently when I got it going again after a winter of disuse, it took me a couple of attempts to get the technique right, so here are some notes as a reminder to my future self next year.
TIL (or this week-ish I learned) why big-sigma and big-pi turn up in the notation of dependent type theory.
About half a year ago I encountered a paper bombastically
titled “the ultimate conditional syntax”. It has the
attractive goal of unifying pattern match with boolean if tests,
and its solution is in some ways very nice. But it seems
over-complicated to me, especially for something that’s a basic
work-horse of programming.
I couldn’t immediately see how to cut it down to manageable proportions, but after reading syntactic musings on match expressions by Yoshua Wuyts a couple of weeks ago, I had an idea. I’ll outline it under the “penultimate conditionals” heading below, after reviewing the UCS and explaining my motivation.
Recently, Alex Kladov wrote on the TigerBeetle blog about swarm testing data structures. It’s a neat post about randomized testing with Zig. I wrote a comment with an idea that was new to Alex @matklad, so I’m reposing a longer version here.
I have added syntax highlighting to my blog using tree-sitter. Here are some notes about what I learned, with some complaining.
Last year I wrote about inlining just the fast path of Lemire’s
algorithm for nearly-divisionless unbiased bounded random
numbers. The idea was to reduce code bloat by eliminating lots
of copies of the random number generator in the rarely-executed slow
paths. However a simple split prevented the compiler from being able
to optimize cases like pcg32_rand(1 << n), so a lot of the blog post
was toying around with ways to mitigate this problem.
On Monday while procrastinating a different blog post, I realised that
it’s possible to do better: there’s a more general optimization which
gives us the 1 << n special case for free.
nearly divisionless
Lemire’s algorithm has about 4 neat tricks:
-
use multiplication instead of division to reduce the output of a random number generator modulo some limit
-
eliminate the bias in (1) by (counterintuitively) looking at the lower digits
-
fun modular arithmetic to calculate the reject threshold for (2)
-
arrange the reject tests to avoid the slow division in (3) in most cases
The nearly-divisionless logic in (4) leads to two copies of the random number generator, in the fast path and the slow path.
Generally speaking, compilers don’t try do deduplicate code that was written by the programmer, so they can’t simplify the nearly-divisionless algorithm very much when the limit is constant.
constantly divisionless
Two points occurred to me:
-
when the limit is constant, the reject threshold (3) can be calculated at compile time
-
when the division is free, there’s no need to avoid it using (4)
These observations suggested that when the limit is constant, the function for random numbers less than a limit should be written:
static inline uint32_t
pcg32_rand_const(pcg32_t *rng, uint32_t limit) {
uint32_t reject = -limit % limit;
uint64_t sample;
do sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
while ((uint32_t)(sample) < reject);
return ((uint32_t)(sample >> 32));
}
This has only one call to pcg32_random(), saving space as I wanted,
and the compiler is able to eliminate the loop automatically when the
limit is a power of two. The loop is smaller than a call to an
out-of-line slow path function, so it’s better all round than the code
I wrote last year.
algorithm selection
As before it’s possible to automatically choose the
constantly-divisionless or nearly-divisionless algorithms depending on
whether the limit is a compile-time constant or run-time variable,
using arcane C tricks or GNU C
__builtin_constant_p().
I have been idly wondering how to do something similar in other languages.
Rust isn’t very keen on automatic specialization, but it has a reasonable alternative. The thing to avoid is passing a runtime variable to the constantly-divisionless algorithm, because then it becomes never-divisionless. Rust has a much richer notion of compile-time constants than C, so it’s possible to write a method like the follwing, which can’t be misused:
pub fn upto<const LIMIT: u32>(&mut self) -> u32 {
let reject = LIMIT.wrapping_neg().wrapping_rem(LIMIT);
loop {
let (lo, hi) = self.get_u32().embiggening_mul(LIMIT);
if lo < reject {
continue;
} else {
return hi;
}
}
}
assert!(rng.upto::<42>() < 42);
(embiggening_mul is my stable replacement for the unstable widening_mul API.)
This is a nugatory optimization, but there are more interesting cases
where it makes sense to choose a different implementation for constant
or variable arguments – that it, the constant case isn’t simply a
constant-folded or partially-evaluated version of the variable case.
Regular expressions might be lex-style or pcre-style, for example.
It’s a curious question of language design whether it should be
possible to write a library that provides a uniform API that
automatically chooses constant or variable implementations, or whether
the user of the library must make the choice explicit.
Maybe I should learn some Zig to see how its comptime works.
One of the neat things about the PCG random number generator by Melissa O’Neill is its use of instruction-level parallelism: the PCG state update can run in parallel with its output permutation.
However, PCG only has a limited amount of ILP, about 3 instructions. Its overall speed is limited by the rate at which a CPU can run a sequence where the output of one multiply-add feeds into the next multiply-add.
… Or is it?
With some linear algebra and some AVX512, I can generate random numbers from a single instance of pcg32 at 200 Gbit/s on a single core. This is the same sequence of random numbers generated in the same order as normal pcg32, but more than 4x faster.