| CARVIEW |
Indexing Hackage: Glean vs. hiedb
May 22, 2025I thought it might be fun to try to use Glean to index as much of Hackage as I could, and then do some rough comparisons against hiedb and also play around to see what interesting queries we could run against a database of all the code in Hackage.
This project was mostly just for fun: Glean is not going to replace
hiedb any time soon, for reasons that will become clear. Neither are
we ready (yet) to build an HLS plugin that can use Glean, but
hopefully this at least demonstrates that such a thing should be
possible, and Glean might offer some advantages over hiedb in
performance and flexibility.
A bit of background:
Glean is a code-indexing system that we developed at Meta. It’s used internally at Meta for a wide range of use cases, including code browsing, documentation generation and code analysis. You can read about the ways in which Glean is used at Meta in Indexing Code At Scale with Glean.
hiedb is a code-indexing system for Haskell. It takes the
.hiefiles that GHC produces when given the option-fwrite-ide-infoand writes the information to a SQLite database in various tables. The idea is that putting the information in a DB allows certain operations that an IDE needs to do, such as go-to-definition, to be fast.
You can think of Glean as a general-purpose system that does the same
job as hiedb, but for multiple languages and with a more flexible
data model. The open-source version of Glean comes with indexers for
ten languages or
so, and moreover Glean supports SCIP which has
indexers for various languages available from SourceGraph.
Since a hiedb is just a SQLite DB with a few tables, if you want you
can query it directly using SQL. However, most users will access the
data through either the command-line hiedb tool or through the API,
which provide the higher-level operations such as go-to-definition and
find-references. Glean has a similar setup: you can make raw queries
using Glean’s query language (Angle) using the
Glean shell or the command-line tool, while the higher-level
operations that know about symbols and references are provided by a
separate system called Glass which also has a command-line tool and
API. In Glean the raw data is language-specific, while the Glass
interface provides a language-agnostic view of the data in a way
that’s useful for tools that need to navigate or search code.
An ulterior motive
In part all of this was an excuse to rewrite Glean’s Haskell
indexer. We built a Haskell indexer a while ago but it’s pretty
limited in what information it stores, only capturing enough
information to do go-to-definition and find-references and only for a
subset of identifiers. Furthermore the old indexer works by first
producing a hiedb and consuming that, which is both unnecessary and
limits the information we can collect. By processing the .hie files
directly we have access to richer information, and we don’t have the
intermediate step of creating the hiedb which can be slow.
The rest of this post
The rest of the post is organised as follows, feel free to jump around:
Performance: a few results comparing
hiedbwith Glean on an index of all of HackageQueries: A couple of examples of queries we can do with a Glean index of Hackage: searching by name, and finding dead code.
Apparatus: more details on how I set everything up and how it all works.
What’s next: some thoughts on what we still need to add to the indexer.
Performance
All of this was perfomed on a build of 2900+ packages from Hackage, for more details see Building all of Hackage below.
Indexing performance
I used this hiedb command:
hiedb index -D /tmp/hiedb . --skip-types
I’m using --skip-types because at the time of writing I haven’t
implemented type indexing in Glean’s Haskell indexer, so this should
hopefully give a more realistic comparison.
This was the Glean command:
glean --service localhost:1234 \
index haskell-hie --db stackage/0 \
--hie-indexer $(cabal list-bin hie-indexer) \
~/code/stackage/dist-newstyle/build/x86_64-linux/ghc-9.4.7 \
--src '$PACKAGE'
Time to index:
- hiedb: 1021s
- Glean: 470s
I should note that in the case of Glean the only parallelism is
between the indexer and the server that is writing to the DB. We
didn’t try to index multiple .hie files in parallel, although that
would be fairly trivial to do. I suspect hiedb is also
single-threaded just going by the CPU load during indexing.
Size of the resulting DB
- hiedb: 5.2GB
- Glean: 0.8GB
It’s quite possible that hiedb is simply storing more information, but Glean does have a rather efficient storage system based on RocksDB.
Performance of find-references
Let’s look up all the references of Data.Aeson.encode:
hiedb -D /tmp/hiedb name-refs encode Data.Aeson
This is the query using Glass:
cabal run glass-democlient -- --service localhost:12345 \
references stackage/hs/aeson/Data/Aeson/var/encode
This is the raw query using Glean:
glean --service localhost:1234 --db stackage/0 \
'{ Refs.file, Refs.uses[..] } where Refs : hs.NameRefs; Refs.target.occ.name = "encode"; Refs.target.mod.name = "Data.Aeson"'
hiedb: 2.3sglean(via Glass): 0.39sglean(raw query): 0.03s
(side note: hiedb found 416 references while Glean found 415. I
haven’t yet checked where this discrepancy comes from.)
But these results don’t really tell the whole story.
In the case of hiedb, name-refs does a full table scan so it’s
going to take time proportional to the number of refs in the DB. Glean
meanwhile has indexed the references by name, so it can serve this
query very efficiently. The actual query takes a few milliseconds, the
main overhead is encoding and decoding the results.
The reason the Glass query takes longer than the raw Glean query is because Glass also fetches additional information about each reference, so it performs a lot more queries.
We can also do the raw hiedb query using the sqlite shell:
sqlite> select count(*) from refs where occ = "v:encode" AND mod = "Data.Aeson";
417
Run Time: real 2.038 user 1.213905 sys 0.823001
Of course hiedb could index the refs table to make this query much
faster, but it’s interesting to note that Glean has already done that
and it was still quicker to index and produced a smaller DB.
Performance of find-definition
Let’s find the definition of Data.Aeson.encode, first with hiedb:
$ hiedb -D /tmp/hiedb name-def encode Data.Aeson
Data.Aeson:181:1-181:7
Now with Glass:
$ cabal run glass-democlient -- --service localhost:12345 \
describe stackage/hs/aeson/Data/Aeson/var/encode
stackage@aeson-2.1.2.1/src/Data/Aeson.hs:181:1-181:47
(worth noting that hiedb is giving the span of the identifier only,
while Glass is giving the span of the whole definition. This is just a
different choice; the .hie file contains both.)
And the raw query using Glean:
$ glean --service localhost:1234 query --db stackage/0 --recursive \
'{ Loc.file, Loc.span } where Loc : hs.DeclarationLocation; N : hs.Name; N.occ.name = "encode"; N.mod.name = "Data.Aeson"; Loc.name = N' | jq
{
"id": 18328391,
"key": {
"tuplefield0": {
"id": 9781189,
"key": "aeson-2.1.2.1/src/Data/Aeson.hs"
},
"tuplefield1": {
"start": 4136,
"length": 46
}
}
}
Times:
- hiedb: 0.18s
- Glean (via Glass): 0.05s
- Glean (raw query): 0.01s
In fact there’s a bit of overhead when using the Glean CLI, we can get a better picture of the real query time using the shell:
stackage> { Loc.file, Loc.span } where Loc : hs.DeclarationLocation; N : hs.Name; N.occ.name = "encode"; N.mod.name = "Data.Aeson"; Loc.name = N
{
"id": 18328391,
"key": {
"tuplefield0": { "id": 9781189, "key": "aeson-2.1.2.1/src/Data/Aeson.hs" },
"tuplefield1": { "start": 4136, "length": 46 }
}
}
1 results, 2 facts, 0.89ms, 696176 bytes, 2435 compiled bytes
The query itself takes less than 1ms.
Again, the issue with hiedb is that its data is not indexed in a way
that makes this query efficient: the defs table is indexed by the
pair (hieFile,occ) not occ alone. Interestingly, when the module
is known it ought to be possible to do a more efficient query with
hiedb by first looking up the hieFile and then using that to query
defs.
What other queries can we do with Glean?
I’ll look at a couple of examples here, but really the possibilities
are endless. We can collect whatever data we like from the .hie
file, and design the schema around whatever efficient queries we want
to support.
Search by case-insensitive prefix
Let’s search for all identifiers that start with the case-insensitive
prefix "withasync":
$ glass-democlient --service localhost:12345 \
search stackage/withasync -i | wc -l
55
In less than 0.1 seconds we find 55 such identifiers in Hackage. (the
output isn’t very readable so I didn’t include it here, but for
example this finds results not just in async but in a bunch of
packages that wrap async too).
Case-insensitive prefix search is supported by an index that Glean produces when the DB is created. It works in the same way as efficient find-references, more details on that below.
Why only prefix and not suffix or infix? What about fuzzy search? We could certainly provide a suffix search too; infix gets more tricky and it’s not clear that Glean is the best tool to use for infix or fuzzy text search: there are better data representations for that kind of thing. Still, case-insensitive prefix search is a useful thing to have.
Could we support Hoogle using Glean? Absolutely. That said, Hoogle doesn’t seem too slow. Also we need to index types in Glean before it could be used for type search.
Identify dead code
Dead code is, by definition, code that isn’t used anywhere. We have a handy way to find that: any identifier with no references isn’t used. But it’s not quite that simple: we want to ignore references in imports and exports, and from the type signature.
Admittedly finding unreferenced code within Hackage isn’t all that useful, because the libraries in Hackage are consumed by end-user code that we haven’t indexed so we can’t see all the references. But you could index your own project using Glean and use it to find dead code. In fact, I did that for Glean itself and identified one entire module that was dead, amongst a handful of other dead things.
Here’s a query to find dead code:
N where
N = hs.Name _;
N.sort.external?;
hs.ModuleSource { mod = N.mod, file = F };
!(
hs.NameRefs { target = N, file = RefFile, uses = R };
RefFile != F;
coderef = (R[..]).kind
)
Without going into all the details, here’s roughly how it works:
N = hs.Name _;declaresNto be a fact ofhs.NameN.sort.external?;requiresNto be external (i.e. exported), as opposed to a local variablehs.ModuleSource { mod = N.mod, file = F };finds the fileFcorresponding to this name’s module- The last part is checking to see that there are no references to this name that are (a) in a different file and (b) are in code, i.e. not import/export references. Restricting to other files isn’t exactly what we want, but it’s enough to exclude references from the type signature. Ideally we would be able to identify those more precisely (that’s on the TODO list).
You can try this on Hackage and it will find a lot of stuff. It might
be useful to focus on particular modules to find things that aren’t
used anywhere, for example I was interested in which identifiers in
Control.Concurrent.Async aren’t used:
N where
N = hs.Name _;
N.mod.name = "Control.Concurrent.Async";
N.mod.unit = "async-2.2.4-inplace";
N.sort.external?;
hs.ModuleSource { mod = N.mod, file = F };
!(
hs.NameRefs { target = N, file = RefFile, uses = R };
RefFile != F;
coderef = (R[..]).kind
)
This finds 21 identifiers, which I can use to decide what to deprecate!
Apparatus
Building all of Hackage
The goal was to build as much of Hackage as possible and then to index
it using both hiedb and Glean, and see how they differ.
To avoid problems with dependency resolution, I used a Stackage LTS snapshot of package versions. Using LTS-21.21 and GHC 9.4.7, I was able to build 2922 packages. About 50 failed for some reason or other.
I used this cabal.project file:
packages: */*.cabal
import: https://www.stackage.org/lts-21.21/cabal.config
package *
ghc-options: -fwrite-ide-info
tests: False
benchmarks: False
allow-newer: *
And did a large cabal get to fetch all the packages in LTS-21.21.
Then
cabal build all --keep-going
After a few retries to install any required RPMs to get the dependency resolution phase to pass, and to delete a few packages that weren’t going to configure successfully, I went away for a few hours to let the build complete.
It’s entirely possible there’s a better way to do this that I don’t know about - please let me know!
Building Glean
The Haskell indexer I’m using is in this pull request which at the time of writing isn’t merged yet. (Since I’ve left Meta I’m just a regular open-source contributor and have to wait for my PRs to be merged just like everyone else!).
Admittedly Glean is not the easiest thing in the world to build, mainly because it has a couple of troublesome dependencies: folly (Meta’s library of highly-optimised C++ utilities) and RocksDB. Glean depends on a very up to date version of these libraries so we can’t use any distro packaged versions.
Full instructions for building Glean are here but roughly it goes like this on Linux:
- Install a bunch of dependencies with
aptoryum - Build the C++ dependencies with
./install-deps.shand set some env vars make
The Makefile is needed because there are some codegen steps that
would be awkward to incorporate into the Cabal setup. After the first
make you can usually just switch to cabal for rebuilding stuff
unless you change something (e.g. a schema) that requires re-running
the codegen.
Running Glean
I’ve done everything here with a running Glean server, which was started like this:
cabal run exe:glean-server -- \
--db-root /tmp/db \
--port 1234 \
--schema glean/schema/source
While it’s possible to run Glean queries directly on the DB without a server, running a server is the normal way because it avoids the latency from opening the DB each time, and it keeps an in-memory cache which significantly speeds up repeated queries.
The examples that use Glass were done using a running Glass server, started like this:
cabal run glass-server -- --service localhost:1234 --port 12345
How does it work?
The interesting part of the Haskell indexer is the schema in hs.angle. Every language that Glean indexes needs a schema, which describes the data that the indexer will store in the DB. Unlike an SQL schema, a Glean schema looks more like a set of datatype declarations, and it really does correspond to a set of (code-generated) types that you can work with when programmatically writing data, making queries, or inspecting results. For more about Glean schemas, see the documentation.
Being able to design your own schema means that you can design
something that is a close match for the requirements of the language
you’re indexing. In our Glean schema for Haskell, we use a Name,
OccName, and Module structure that’s similar to the one GHC uses
internally and is stored in the .hie files.
The indexer
itself
just reads the .hie files and produces Glean data using datatypes
that are generated from the schema. For example, here’s a fragment of
the indexer that produces Module facts, which contain a ModuleName
and a UnitName:
mkModule :: Glean.NewFact m => GHC.Module -> m Hs.Module
mkModule mod = do
modname <- Glean.makeFact @Hs.ModuleName $
fsToText (GHC.moduleNameFS (GHC.moduleName mod))
unitname <- Glean.makeFact @Hs.UnitName $
fsToText (unitFS (GHC.moduleUnit mod))
Glean.makeFact @Hs.Module $
Hs.Module_key modname unitnameAlso interesting is how we support fast find-references. This is done using a stored derived predicate in the schema:
predicate NameRefs:
{
target: Name,
file: src.File,
uses: [src.ByteSpan]
} stored {Name, File, Uses} where
FileXRefs {file = File, refs = Refs};
{name = Name, spans = Uses} = Refs[..];
here NameRefs is a predicate—which you can think of as a datatype,
or a table in SQL—defined in terms of another predicate,
FileXRefs. The facts of the predicate NameRefs (rows of the table)
are derived automatically using this definition when the DB is
created. If you’re familiar with SQL, a stored derived predicate in
Glean is rather like a materialized view in SQL.
What’s next?
As I mentioned earlier, the indexer doesn’t yet index types, so that would be an obvious next step. There are a handful of weird corner cases that aren’t handled correctly, particularly around record selectors, and it would be good to iron those out.
Longer term ideally the Glean data would be rich enough to produce the
Haddock docs. In fact Meta’s internal code browser does produce
documentation on the fly from Glean data for some languages - Hack and
C++ in particular. Doing it for Haskell is a bit tricky because while
I believe the .hie file does contain enough information to do this,
it’s not easy to reconstruct the full ASTs for declarations. Doing it
by running the compiler—perhaps using the Haddock API—would be
an option, but that involves a deeper integration with Cabal so it’s
somewhat more awkward to go that route.
Could HLS use Glean? Perhaps it would be useful to have a full Hackage index to be able to go-to-definition from library references? As a plugin this might make sense, but there are a lot of things to fix and polish before it’s really practical.
Longer term should we be thinking about replacing hiedb with Glean? Again, we’re some way off from that. The issue of incremental updates is an interesting one - Glean does support incremental indexing but so far it’s been aimed at speeding up whole-repository indexing rather than supporting IDE features.
Rethinking Static Reference Tables in GHC
June 22, 2018It seems rare these days to be able to make an improvement that’s unambiguously better on every axis. Most changes involve a tradeoff of some kind. With a compiler, the tradeoff is often between performance and code size (e.g. specialising code to make it faster leaves us with more code), or between performance and complexity (e.g. adding a fancy new optimisation), or between compile-time performance and runtime performance.
Recently I was lucky enough to be able to finish a project I’ve been working on intermittently in GHC for several years, and the result was satisfyingly better on just about every axis.
Code size: overall binary sizes are reduced by ~5% for large programs, ~3% for smaller programs.
Runtime performance: no measurable change on benchmarks, although some really bad corner cases where the old code performed terribly should now be gone.
Complexity: some complex representations were removed from the runtime, making GC simpler, and the compiler itself also became simpler.
Compile-time performance: slightly improved (0.2%).
To explain what the change is, first we’ll need some background.
Garbage collecting CAFs
A Constant Applicative Form (CAF) is a top-level thunk. For example:
myMap :: HashMap Text Int
myMap = HashMap.fromList [
-- lots of data
]
Now, myMap is represented in the compiled program by a static
closure that looks like this:

When the program demands the value of myMap for the first time, the
representation will change to this:

At this point, we have a reference from the original static closure,
which is part of the compiled program, into the dynamic heap. The
garbage collector needs to know about this reference, because it has
to treat the value of myMap as live data, and ensure that this
reference remains valid.
How could we do that? One way would be to just keep all the CAFs alive for ever. We could keep a list of them and use the list as a source of roots in the GC. That would work, but we’d never be able to garbage-collect any top-level data. Back in the distant past GHC used to work this way, but it interacted badly with the full-laziness optimisation which likes to float things out to the top level - we had to be really careful not to float things out as CAFs because the data would be retained for ever.
Or, we could track the liveness of CAFs properly, like we do for other
data. But how can we find all the references to myMap? The problem
with top-level closures is that their references appear in code, not
just data. For example, somewhere else in our program we might have
myLookup :: String -> Maybe Int
myLookup name = HashMap.lookup name myMap
and in the compiled code for myLookup will be a reference to
myMap.
To be able to know when we should keep myMap alive, the garbage
collector has to traverse all the references from code as well as
data.
Of course, actually searching through the code for symbols isn’t
practical, so GHC produces an additional data structure for all the
code it compiles, called the Static Reference Table (SRT). The SRT
for myLookup will contain a reference to myMap.
The naive way to do this would be to just have a table of all the static references for each code block. But it turns out that there’s quite a lot of opportunities for sharing between SRTs - lots of code blocks refer to the same things - so it makes sense to try to use a more optimised representation.
The representation that GHC 8.4 and earlier used was this:

All the static references in a module were collected together into a
single table (ThisModule_srt in the diagram), and every static
closure selects the entries it needs with a combination of a pointer
(srt) into the table and a bitmap (srt_bitmap).
This had a few problems:
On a 64-bit machine we need at least 96 bits for the SRT in every static closure and continuation that has at least one static reference: 64 bits to point to the table and a 32-bit bitmap.
Sometimes the heuristics in the compiler for generating the table worked really badly. I observed some cases with particularly large modules where we generated an SRT containing two entries that were thousands of entries apart in the table, which required a huge bitmap.
There was complex code in the RTS for traversing these bitmaps, and complex code in the compiler to generate this table that nobody really understood.
The shiny new way
The basic idea is quite straightforward: instead of the single table and bitmap representation, each code block that needs an SRT will have an associated SRT object, like this:

Firstly, this representation is a lot simpler, because an SRT object has exactly the same representation as a static constructor, so we need no new code in the GC to handle it. All the code to deal with bitmaps goes away.
However, just making this representation change by itself will cause a lot of code growth, because we lose many of the optimisations and sharing that we were able to do with the table and bitmap representation.
But the new representation has some great opportunities for optimisation of its own, and exploiting all these optimisations results in more compact code than before.
We never need a singleton SRT
If an SRT has one reference in it, we replace the pointer to the SRT with the pointer to the reference itself.

The SRT field for each code block can be 32 bits, not 96
Since we only need a pointer, not a pointer and a bitmap, the overhead goes down to 64 bits. Furthermore, by exploiting the fact that we can represent local pointers by 32-bit offsets (on x86_64), the overhead goes down to 32 bits.

We can common up identical SRTs
This is an obvious one: if multiple code blocks have the same set of static references, they can share a single SRT object.
We can drop duplicate references from an SRT
Sometimes an SRT refers to a closure that is also referred to by something that is reachable from the same SRT. For example:

In this case we can drop the reference to x in the outer SRT,
because it’s already contained in the inner SRT. That leaves the
outer SRT with a single reference, which means the SRT object itself
can just disappear, by the singleton optimisation mentioned earlier.
For a function, we can combine the SRT with the static closure itself
A top-level function with an SRT would look like this:

We might as well just merge the two objects together, and put the SRT entries in the function closure, to give this:

Together, these optimisations were enough to reduce code size compared with the old table/bitmap representation.
Show me the code
- An overhaul of the SRT representation
- Save a word in the info table on x86_64
- Merge FUN_STATIC closure with its SRT
Look out for (slightly) smaller binaries in GHC 8.6.1.
Fixing 17 space leaks in GHCi, and keeping them fixed
June 20, 2018In this post I want to tackle a couple of problems that have irritated me from time to time when working with Haskell.
GHC provides some powerful tools for debugging space leaks, but sometimes they’re not enough. The heap profiler shows you what’s in the heap, but it doesn’t provide detailed visibility into the chain of references that cause a particular data structure to be retained. Retainer profiling was supposed to help with this, but in practice it’s pretty hard to extract the signal you need - retainer profiling will show you one relationship at a time, but you want to see the whole chain of references.
Once you’ve fixed a space leak, how can you write a regression test for it? Sometimes you can make a test case that will use
O(n)memory if it leaks instead ofO(1), and then it’s straightforward. But what if your leak is only a constant factor?
We recently noticed an interesting space leak in GHCi. If we loaded a set of modules, and then loaded the same set of modules again, GHCi would need twice as much memory as just loading the modules once. That’s not supposed to happen - GHCi should release whatever data it was holding about the first set of modules when loading a new set. What’s more, after further investigation we found that this effect wasn’t repeated the third time we loaded the modules; only one extra set of modules was being retained.

Conventional methods for finding the space leak were not helpful in this case. GHCi is a complex beast, and just reproducing the problem proved difficult. So I decided to try a trick I’d thought about for a long time but never actually put into practice: using GHC’s weak pointers to detect data that should be dead, but isn’t.
Weak pointers can detect space leaks
The System.Mem.Weak library provides operations for creating “weak” pointers. A weak pointer is a reference to an object that doesn’t keep the object alive. If we have a weak pointer, we can attempt to dereference it, which will either succeed and return the value it points to, or it will fail in the event that the value has been garbage collected. So a weak pointer can detect when things are garbage collected, which is exactly what we want for detecting space leaks.
Here’s the idea:
- Call
mkWeakPtr v Nothingwherevis the value you’re interested in. - Wait until you believe
vshould be garbage. - Call
System.Mem.performGCto force a full GC. - Call
System.Mem.Weak.deRefWeakon the weak pointer to see ifvis alive or not.
Here’s how I
implemented this for GHCi. One thing to note is that just because
v was garbage-collected doesn’t mean that there aren’t still pieces
of v being retained, so you might need to have several weak pointers
to different components of v, like I did in the GHC patch. These
really did detect multiple different space leaks.
This patch reliably detected leaks in trivial examples, including many of the tests in GHCi’s own test suite. That meant we had a way to reproduce the problem without having to use unpredictable measurement methods like memory usage or heap profiles. This made it much easier to iterate on finding the problems.
Back to the space leaks in GHCi
That still leaves us with the problem of how to actually diagnose the
leak and find the cause. Here the techniques are going to get a bit
more grungy: we’ll use gdb to poke around in the heap at runtime,
along with some custom utilities in the GHC runtime to help us search
through the heap.
To set things up for debugging, we need to
- Compile GHC with
-gand-debug, to add debugging info to the binary and debugging functionality to the runtime, respectively. - load up GHCi in gdb (that’s a bit fiddly and I won’t go into the details here),
- Set things up to reproduce the test case.
*Main> :l
Ok, no modules loaded.
-fghci-leak-check: Linkable is still alive!
Prelude>
The -fghci-leak-check code just spat out a message when it
detected a leak. We can Ctrl-C to break into gdb:
Program received signal SIGINT, Interrupt.
0x00007ffff17c05b3 in __select_nocancel ()
at ../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
Next I’m going to search the heap for instances of the LM
constructor, which corresponds to the Linkable type that the leak
detector found. There should be none of these alive, because the :l
command tells GHCi to unload everything, so any LM
constructors we find must be leaking:
(gdb) p findPtr(ghc_HscTypes_LM_con_info,1)
0x4201a073d8 = ghc:HscTypes.LM(0x4201a074b0, 0x4201a074c8, 0x4201a074e2)
-->
0x4200ec2000 = WEAK(key=0x4201a073d9 value=0x4201a073d9 finalizer=0x7ffff2a077d0)
0x4200ec2000 = WEAK(key=0x4201a073d9 value=0x4201a073d9 finalizer=0x7ffff2a077d0)
0x42017e2088 = ghc-prim:GHC.Types.:(0x4201a073d9, 0x7ffff2e9f679)
0x42017e2ae0 = ghc-prim:GHC.Types.:(0x4201a073d9, 0x7ffff2e9f679)
$1 = void
The findPtr function comes from the RTS, it’s a function designed
specifically for searching through the heap for things from inside
gdb. I asked it to search for ghc_HscTypes_LM_con_info,
which is the info pointer for the LM constructor - every
instance of that constructor will have this pointer as its first word.
The findPtr function doesn’t just search for objects in the heap, it
also attempts to find the object’s parent, and will continue tracing
back through the chain of ancestors until it finds multiple parents.
In this case, it found a single LM constructor, which had four
parents: two WEAK objects and two ghc-prim:GHC.Types.: objects,
which are the list constructor (:). The WEAK objects we know
about: those are the weak pointers used by the leak-checking code. So
we need to trace the parents of the other objects, which we can do with
another call to findPtr:
(gdb) p findPtr(0x42017e2088,1)
0x42016e9c08 = ghc:Linker.PersistentLinkerState(0x42017e2061, 0x7ffff3c2bc63, 0x42017e208a, 0x7ffff2e9f679, 0x42016e974a, 0x7ffff2e9f679)
-->
0x42016e9728 = THUNK(0x7ffff74790c0, 0x42016e9c41, 0x42016e9c09)
-->
0x42016e9080 = ghc:Linker.PersistentLinkerState(0x42016e9728, 0x7ffff3c2e7bb, 0x7ffff2e9f679, 0x7ffff2e9f679, 0x42016e974a, 0x7ffff2e9f679)
-->
0x4200dbe8a0 = THUNK(0x7ffff7479138, 0x42016e9081, 0x42016e90b9, 0x42016e90d1, 0x42016e90e9)
-->
0x42016e0b00 = MVAR(head=END_TSO_QUEUE, tail=END_TSO_QUEUE, value=0x4200dbe8a0)
-->
0x42016e0828 = base:GHC.MVar.MVar(0x42016e0b00)
-->
0x42016e0500 = MUT_VAR_CLEAN(var=0x42016e0829)
-->
0x4200ec6b80 = base:GHC.STRef.STRef(0x42016e0500)
-->
$2 = void
This time we traced through several objects, until we came to an
STRef, and findPtr found no further parents. Perhaps the next
parent is a CAF (a top-level thunk) which findPtr won’t find because
it only searches the heap. Anyway, in the chain we have two
PersistentLinkerState objects, and some THUNKs - it looks like
perhaps we’re holding onto an old version of the
PersistentLinkerState, which contains the leaking Linkable object.
Let’s pick one THUNK and take a closer look.
(gdb) p4 0x42016e9728
0x42016e9740: 0x42016e9c09
0x42016e9738: 0x42016e9c41
0x42016e9730: 0x0
0x42016e9728: 0x7ffff74790c0 <sorW_info>
The p4 command is just a macro for dumping memory (you can get these
macros from here).
The header of the object is 0x7ffff74790c0 <sorW_info>, which is just a
compiler-generated symbol. How can we find out what code this object
corresponds to? Fortunately, GHC’s new -g option generates DWARF
debugging information which gdb can understand, and because we
compiled GHC itself with -g we can get gdb to tell us what code
this address corresponds to:
(gdb) list *0x7ffff74790c0
0x7ffff74790c0 is in sorW_info (compiler/ghci/Linker.hs:1129).
1124
1125 itbl_env' = filterNameEnv keep_name (itbl_env pls)
1126 closure_env' = filterNameEnv keep_name (closure_env pls)
1127
1128 new_pls = pls { itbl_env = itbl_env',
1129 closure_env = closure_env',
1130 bcos_loaded = remaining_bcos_loaded,
1131 objs_loaded = remaining_objs_loaded }
1132
1133 return new_pls
In this case it told us that the object corresponds to line 1129 of
compiler/ghci/Linker.hs. This is all part of the function
unload_wkr, which is part of the code for unloading compiled
code in GHCi. It looks like we’re on the right track.
Now, -g isn’t perfect - the line it pointed to isn’t actually a
thunk. But it’s close: the line it points to refers to closure_env' which is defined on line 1126, and it is
indeed a thunk. Moreover, we can see that it has a reference to pls,
which is the original PersistentLinkerState before the unloading
operation.
To avoid this leak, we could pattern-match on pls eagerly rather
than doing the lazy record selection (closure_env pls) in the
definition of closure_env'. That’s exactly what I did to fix this
particular leak, as you can see in the patch that fixes
it.
Fixing one leak isn’t necessarily enough: the data structure might be retained in multiple different ways, and it won’t be garbage collected until all the references are squashed. In total I found
- 7 leaks in GHCi that were collectively responsible for the original leak, and
- A further 10 leaks that only appeared when GHC was compiled without optimisation. (It seems that GHC’s optimiser is pretty good at fixing space leaks by itself)
You might ask how anyone could have found these without undergoing this complicated debugging process. And whether there are more lurking that we haven’t found yet. These are really good questions, and I don’t have a good answer for either. But at least we’re in a better place now:
- The leaks are fixed, and we have a regression test to prevent them being reintroduced.
- If you happen to write a patch that introduces a leak, you’ll know what the patch is, so you have a head start in debugging it.
Could we do better?
Obviously this is all a bit painful and we could definitely build
better tools to make this process easier. Perhaps something based on
heap-view which was recently added to
GHC? I’d love to see someone tackle this.
Hotswapping Haskell
October 17, 2017This is a guest post by Jon Coens. Jon worked on the Haxl project since the beginning in 2013, and nowadays he works on broadening Haskell use within Facebook.
From developing code through deployment, Facebook needs to move fast. This is especially true for one of our anti-abuse systems that deploys hundreds of code changes every day. Releasing a large application (hundreds of Kloc) that many times a day presents plenty of intriguing challenges. Haskell’s strict type system means we’re able to confidently push new code knowing that we can’t crash the server, but getting those changes out to many thousands of machines as fast as possible requires some ingenuity.
Given the application size and deployment speed constraints:
Building a new application binary for every change would take too long
Starting and tearing down millions of heavy processes a day would create undue churn on other infrastructure
Splitting the service into multiple smaller services would slow down developers.
To overcome these constraints, our solution is to build a shared object file that contains only the set of frequently changing business logic and dynamically load it into our server process. With some clever house-keeping, the server drops old unneeded shared objects to make way for new ones without dropping any requests.
It’s like driving a car down the road, having a new engine fall into your lap, installing it on-the-fly, and dumping the old engine behind you, all while never touching the brakes.
Show Me The Code!
For those who want a demo, look here. Make sure you have GHC 8.2.1 or later, then follow the README for how to configure the projects.
What about…
A Statically built server
The usual way of deploying updates requires building a fully statically-linked binary and shipping that to every machine. This has many benefits, the biggest of which being streamlined and well-understood deployment, but results in long update times due to the size of our large final binary. Each business logic change, no matter how small, needs to re-link the entire binary and be shipped out to all machines. Both binary link time and distribution time are correlated with file size, so the larger the binary, the longer the updates. In our case, the application binary’s size is too large for us to do frequent updates by this method.
GHCi-as-a-service
GHCi’s incremental module reloading is another way of updating code quickly. Mimicking the local development workflow, you could ship code updates to each service, and instruct them to reload as necessary. Continually re-interpreting the code significantly decreases the amount of time to distribute an update. In fact, a previous version of our application (not based on Haskell) worked this way. This approach severely hinders performance, however. Running interpreted code is strictly slower than optimized compiled code, and GHCi can’t currently handle running multiple requests at the same time.
The model of reloading libraries in GHCi closely matches what we want our end behavior to look like. What about loading those libraries into a non-interpreted Haskell binary?
Shipping shared objects for great good
Using the GHCi.Linker API, our update deployment looks roughly as follows:
Commit a code change onto trunk
Incrementally build a shared object file containing the frequently-changing business logic
Ship that file to each machine
In each process, use GHCi’s dynamic linker to load in the new shared object and lookup a symbol from it (while continuing to serve requests using the previous code)
If all succeeds, start serving requests using the new code and mark the previous shared object for unloading by the GC
This minimizes the amount of time between making a code change and having it running in an efficient production environment. It only rebuilds the minimum set of code, deploys a much smaller file to each server, and keeps the server running through each update.
Not every module or application can follow this update model as there are some crucial constraints to consider when figuring out what can go into the shared object.
- The symbol API boundaries into and out of the shared object must remain constant
- The main binary cannot persist any reference to code or data originating from the shared object, because that will prevent the GC from unloading the object.
Fortunately, our use-case fits this mold.
Details
We’ll talk about a handful of libraries + example code
GHCi.ObjLink - A library provided by GHC
ghc-hotswap - A library to use
ghc-hotswap-types - User-written code to define the API
ghc-hotswap-so - User-written code that lives in the shared object
ghc-hotswap-demo - User-written application utilizing the above
Loading and extracting from the shared object
Let’s start with bringing in a new shared object, the guts of which can be found in loadNewSO. It makes heavy use of the GHCi.ObjLink library.
We need the name of an exported symbol to lookup inside the shared object (symName) and the file path to where the shared object lives (newSO). With these, we can return an instance of some data that originates from that shared object.
initObjLinker DontRetainCAFs
GHCi’s linker needs to be initialized before use, and fortunately the call is idempotent. “DontRetainCAFs” tells the linker and GC not to retain CAFs (Constant Applicative Forms, i.e. top-level values) in the shared object. GHCi normally retains all CAFs as the user can type an expression that refers to anything at all, but for hot-swapping this would prevent the object from being unloaded as we would have references into the object from the heap-resident CAFs.
loadObj newSO
resolved <- resolveObjs
unless resolved $
...
This maps the shared object into the memory of the main process, brings the shared object’s symbols into GHCi’s symbol table, and ensures any undefined symbols in the SO are present in the main binary. If any of these fail, an exception is thrown.
c_sym <- lookupSymbol symName
Here we ask GHCi’s symbol table if the given name exists, and returns a pointer to that symbol.
h <- case c_sym of
Nothing -> throwIO ...
Just p_sym ->
bracket (callExport $ castPtrToFunPtr p_sym) freeStablePtr deRefStablePtr
When getting a pointer to the symbol (Just p_sym), a couple things happen. We know that the underlying symbol is a function (as we’ll ensure later), so we cast it to a function pointer. A FunPtr doesn’t do us much good on its own, so use callExport to turn it into a callable Haskell function as well as execute the function. This call is the first thing to run code originating from the shared object. Since our call returns a StablePtr a, we dereference and then free the stable pointer, resulting in our value of type a from the shared object.
We want to query the shared object and get a Haskell value back. The best way to do that safely and without baking in too much low-level knowledge is for the shared object to expose a function using foreign export. The Haskell value must therefore be returned wrapped in a StablePtr, and so we have to get at the value itself using deRefStablePtr, before finally releasing the StablePtr with freeStablePtr.
purgeObj newSO
return h
Assuming everything has gone well, we purge GHCi’s symbol table of all symbols defined from our shared object and then return the value we retrieved. Purging the symbols makes room for the next shared object to come in and resolve successfully without fully unloading the shared object that we’re actively holding references to. We could tell GHCi to unload the shared object at this point, but this would cause the GC to aggressively crawl the entire shared object every single time, which is a lot of unnecessary work. Purging retains the code in the process to make the GC’s work lighter while making room for the next shared object. See Safely Transition Updates for when to unload the shared object.
The project that defines the code for the shared object must be generated in a relocatable fashion. It must be configured with the —enable-library-for-ghci flag, otherwise loadObj and resolveObj will throw a fit.
Defining the shared object’s API
During compilation, the function names from code turn into quasi-human-readable symbol names. Ensuring you look up the correct symbol name from a shared object can become brittle if you rely on hardcoded munged names. To mitigate this, we define a single data type to house all the symbols we want to expose to the main application, and export a ccall using Haskell’s Foreign library. This guarantees we can export a particular symbol with a name we control. Placing all our data behind a single symbol (that both the shared object and main binary can depend on), we reduce the coupling to only a couple of points.
Let’s look at Types.hs.
data SOHandles = SOHandles
{ someData :: Text
, someFn :: Int -> IO ()
} deriving (Generic, NFData)
Here’s our common structure for everything we want to expose out of the shared object. Notice that you can put constants, like someData, as well as full functions to execute, like someFn.
type SOHandleExport = IO (StablePtr SOHandles)
This defines the type for the extraction function the main binary will run to get an instance of the handles from the shared object
foreign import ccall "dynamic"
callExport :: FunPtr SOHandleExport -> SOHandleExport
Here we invoke Haskell’s FFI to generate a function that calls a function pointer to our export function as an actual Haskell function. The “dynamic” parameter to ccall does exactly this. We saw using this earlier when loading in a shared object.
Next let’s look at code for the shared object itself.
Note that we depend on and import the Types module defined in ghc-hotswap-types.
foreign export ccall "hs_soHandles"
hsNewSOHandle :: SOHandleExport
This uses the FFI to explicitly export a Haskell function called hsNewSOHandle as a symbol named “hs_soHandles”. This is the function our main binary is going to end up calling, so set its type to our export function.
hsNewSOHandle = newStablePtr SOHandles
{ ...
}
In our definition of this function, we return a stable pointer to an instance of our data type, which will end up being read by our main application
Using these common types, we’ve limited the amount of coupling down to using callExport, exporting the symbol as “hs_soHandles” from the shared object, and can combine these in our usage of loadNewSO.
Safely Transition Updates
With some extra care, we can cleanly transition to new shared objects while minimizing the amount of work the GC needs to do.
Let’s look closer at Hotswap.hs.
registerHotswap uses loadNewSO to load the first shared object and then provides some accessor functions on the data extracted. We save some state associated with the shared object: the path to the object, the value we extract, as well as a lock to keep track of usage.
The unWrap function reads the state for the latest shared object and runs a user-supplied action on the extracted value. Wrapping the user-function in the read lock ensures we won’t accidentally try to remove the underlying code while actively using it. Without this, we run the risk of creating unnecessary stress on the GC.
The updater function (updateState) assumes we already have one shared object mapped into memory with its symbol table purged.
newVal <- force <$> loadNewSO dynamicCall symbolName nextPath
We first attempt to load in the next shared object located at nextPath, using the same export call and symbol name as before. At this point we actually have two shared objects mapped into memory at the same time; one being the old object that’s actively being used and the other being the new object with our desired updates.
Next we build some state associated with this object, and swap our state MVar.
oldState <- swapMVar mvar newState
After this call, any user that uses unWrap will get the new version of code that was just loaded up. This is when we would observe the update being “live” in our application.
L.withWrite (lock oldState) $
unloadObj (path oldState)
Here we finally ask the GC to unload the old object. Once the write lock is obtained, no readers are present, so nothing can be running code from this old shared object (unless one is nefariously holding onto some state). Calling unloadObj doesn’t immediately unmap the object, as it only informs the GC that the object is valid to be dumped. The next major GC ensures that no code is referencing anything from that shared object and will fully dump it out.
At this point we now have only the next shared object mapped in memory and being used in the main application.
Shortcomings / Future work
Beware sticky shared objects
The trickiest problem we’ve come across has been when the GC doesn’t want to drop old shared objects. Eventually so many shared objects are linked at once that the process runs out of space to load in a new object, stalling all updates until the process is restarted. We’ll call this problem shared object retention, or just retention.
An object is unloaded when (a) we’ve called unloadObj on it, and (b) the GC determines that there are no references from heap data into the object. Retention can therefore only happen if we have some persistent data that lives across a shared object swap. Obviously it’s better if you can avoid this, but sometimes it’s necessary: e.g. in Sigma the persistent data consists of the pre-initialized data sources that we use with the Haxl monad, amongst other things. The first step in avoiding retention is to be very clear about what this data is, and to fully audit it.
To get retention, the persistent data must be mutable in some way (e.g. contain an IORef), and for retention to occur we must write something into the persistent IORef during the course of executing code from the shared object. The data we wrote into the IORef can end up referring to the shared object in two ways:
If it contains a thunk or a function, these will refer to code in the shared object.
If it contains data where the datatype is defined in the shared object (rather than in the packages that the object depends on, which are statically linked), then again we have a reference from the heap-resident data into the shared object, which will cause retention.
So to avoid retention while having mutable persistent data, the rules of thumb are:
rnfeverything before writing into the persistentIORef, and ensure that any manualNFDatainstances don’t lie.Don’t store values that contain functions
Don’t store values that use datatypes defined in the shared object
Debugging retention problems can be really hard, involving attaching to the process with gdb and then following the offending references from the heap. We hope that the new DWARF support in GHC 8.2 will be able to help here.
Linker addressable memory is limited
Calling the built file a shared object is a bit of a misnomer, as it isn’t compiled with -fPIC and is actually just an object file. Files like these can only be loaded into the lower 2GB of memory (x86_64 small memory model uses 32 bit relative jumps), which can become restrictive when your object file gets large. Since the update mechanism relies on having multiple objects in memory at the same time, fragmentation of the mappable address space can become a problem. We’ve already made a few improvements to the GHCi linker to reduce the impact of these problems, but we’re running out of options.
Ideally we’d switch to using true shared objects (built with -fPIC) to remove this limitation. It requires some work to get there, though: GHC’s dynamic linking support is designed to support a model where each package is in a separate shared library, whereas we want a mixed static/dynamic model.
Asynchronous Exceptions in Practice
January 24, 2017Asynchronous exceptions are a controversial feature of Haskell. You
can throw an exception to another thread, at any time; all you need is
its ThreadId:
throwTo :: Exception e => ThreadId -> e -> IO ()
The other thread will receive the exception immediately, whatever it is doing. So you have to be ready for an asynchronous exception to fire at any point in your code. Isn’t that a scary thought?
It’s an old idea - in fact, when we originally added asynchronous exceptions to Haskell (and wrote a paper about it), it was shortly after Java had removed the equivalent feature, because it was impossible to program with.
So how do we get away with it in Haskell? I wrote a little about the
rationale in my
book. Basically it comes down to this: if we want to be able to
interrupt purely functional code, asynchronous exceptions are the only
way, because polling would be a side-effect. Therefore the remaining
problem is how to make asynchronous exceptions safe for the impure
parts of our code. Haskell provides functionality for disabling
asynchronous exceptions during critical sections (mask) and
abstractions based around it that can be used for safe resource
acquisition (bracket).
At Facebook I’ve had the opportunity to work with asynchronous exceptions in a large-scale real-world setting, and here’s what I’ve learned:
They’re really useful, particularly for catching bugs that cause excessive use of resources.
In the vast majority of our Haskell codebase we don’t need to worry about them at all. The documentation that we give to our users who write Haskell code to run on our platform doesn’t mention asynchronous exceptions.
But some parts of the code can be really hard to get right. Code in the
IOmonad dealing with multithreading or talking to foreign libraries, for example, has to care about cleaning up resources and recovering safely in the event of an asynchronous exception.
Let me take each of those points in turn and elaborate.
Where asynchronous exceptions are useful
The motivating example often used is timeouts, for example of
connections in a network service. But this example is not all that
convincing: in a network server we’re probably writing code that’s
mostly in the IO monad, we know the places where we’re blocking, and
we could use other mechanisms to implement timeouts that would be less
“dangerous” but almost as reliable as asynchronous exceptions.
In Sigma, we use asynchronous exceptions to prevent huge requests from degrading the performance of our server for other clients.
In a complex system, it’s highly likely that some requests will end up using an excessive amount of resources. Perhaps there’s a bug in the code that sometimes causes it to use a lot of CPU (or even an infinite loop), or perhaps the code fetches some data to operate on, and the data ends up being unexpectedly large. In principle we could find all these cases and fix them, but in practice, large systems can have surprising emergent behaviour and we can’t guarantee to find all the bugs outside production.
Beware Elephants
So sometimes a request turns out to be an elephant, and we have to deal with it. If we do nothing, the elephant will trample around, slowing everything down, or maxing out some resource like memory or network bandwidth, which can cause failures for other requests running on the system.

One way or another something is going to die. We would rather it was the elephant, and not the many other requests currently running on the same machine. Stopping the elephant minimises the destruction. The elephant’s owner will then fix their problem, and we’ve mitigated a bug with minimal disruption.
Our elephant gun is called Allocation Limits. The Haskell runtime
keeps track of how much memory each Haskell thread has allocated in
total, and if that total exceeds the limit we set, the thread receives
an asynchronous exception, namely AllocationLimitExceeded. The user
code running on our platform is not permitted to catch this exception,
instead the server catches it, logs some data to aid debugging, and
sends an error back to the client that initiated the request.
We’re using “memory allocated” as a proxy for “work done”. Most computation in Haskell allocates memory, so this is a more predictable measure than wall-clock time. It’s a fairly crude way to identify excessively large requests, but it works well for us.
Here’s what happened when we enabled allocation limits early on during Sigma’s development. The graph tracks the maximum amount of live memory across different groups of machines. It turns out there were a very small fraction of requests consuming a huge amount of resources, and enabling allocation limits squashed them nicely:

Allocation limits have helped protect us from disaster on several occasions. One time, an infinite loop made its way into production; the result was that our monitoring showed an increase in requests hitting the allocation limit. The data being logged allowed it to be narrowed down to one particular type of request, we were quickly able to identify the change that caused the problem, undo it, and notify the owner. Nobody else noticed.
In the vast majority of code, we don’t need to worry about asynchronous exceptions
Because you don’t have to poll for an asynchronous exception, they work almost everywhere. All pure code works with asynchronous exceptions without change.
In our platform, clients write code on top of the Haxl framework in which I/O is provided only via a fixed set of APIs that we control, so we can guarantee that those APIs are safe, and therefore all of the client code is safe by virtue of abstraction.
Some parts of the code can be really hard to get right
That leaves the parts of the code that implement the I/O libraries and other lower level functionality. These are the places where we have to care about asynchronous exceptions: if an async exception fires when we have just opened a connection to a remote server, we have to close it again and free all the resources associated with the connection, for example.
In principle, you can follow a few guidelines to be safe.
Use
bracketwhen allocating any kind of resource that needs to be explicitly released. This is not specific to asynchronous exceptions: coping with ordinary synchronous exceptions requires a good resource-allocation discipline, so your code should be usingbracketanyway.Use the
asyncpackage which avoids some of the common problems, such as making sure that you fork a thread insidemaskto avoid asynchronous exceptions leaking.
Nevertheless it’s still possible to go wrong. Here are some ways:
If you want asynchronous exceptions to work, be careful you don’t accidentally run inside
mask, oruninterruptibleMask. We’ve seen examples of third-party libraries that run callbacks insidemask(e.g. thehinotifylibrary until recently). UsegetMaskingStateto assert that you’re not masked when you don’t want to be.Be careful that those asynchronous exceptions don’t escape from a thread if the thread is created by calling a
foreign export, because uncaught exceptions will terminate the whole process. Unlike when usingasync, aforeign exportcan’t be created insidemask. (this is something that should be fixed in GHC, really).Catching all exceptions seems like a good idea when you want to be bullet-proof, but if you catch and discard the
ThreadKilledexception it becomes really hard to actually kill that thread.If you’re coordinating with some foreign code and the Haskell code gets an asynchronous exception, make sure that the foreign code will also clean up properly.
The type system is of no help at all with finding these bugs, the only way you can find them is with careful eyeballs, good abstractions, lots of testing, and plenty of assertions.
It’s worth it
My claim is, even though some of the low-level code can be hard to get right, the benefits are worth it.
Asynchronous exceptions generalise several exceptional conditions that relate to resource consumption: stack overflow, timeouts, allocation limits, and heap overflow exceptions. We only have to make our code asynchronous-exception-safe once, and it’ll work with all these different kinds of errors. What’s more, being able to terminate threads with confidence that they will clean up promptly and exit is really useful. (It would be nice to do a comparison with Erlang here, but not having written a lot of this kind of code in Erlang I can’t speak with any authority.)
In a high-volume network service, having a guarantee that a class of runaway requests will be caught and killed off can help reliability, and give you breathing room when things go wrong.
Haskell in the Datacentre
December 8, 2016At Facebook we run Haskell on thousands of servers, together handling over a million requests per second. Obviously we’d like to make the most efficient use of hardware and get the most throughput per server that we can. So how do you tune a Haskell-based server to run well?
Over the past few months we’ve been tuning our server to squeeze out as much performance as we can per machine, and this has involved changes throughout the stack. In this post I’ll tell you about some changes we made to GHC’s runtime scheduler.
Summary
We made one primary change: GHC’s runtime is based around an M:N threading model which is designed to map a large number (M) of lightweight Haskell threads onto a small number (N) of heavyweight OS threads. In our application M is fixed and not all that big: we can max out a server’s resources when M is about 3-4x the number of cores, and meanwhile setting N to the number of cores wasn’t enough to let us use all the CPU (I’ll explain why shortly).
To cut to the chase, we ended up increasing N to be the same as M (or close to it), and this bought us an extra 10-20% throughput per machine. It wasn’t as simple as just setting some command-line options, because GHC’s garbage collector is designed to run with N equal to the number of cores, so I had to make some changes to the way GHC schedules things to make this work.
All these improvements are upstream in GHC, and they’ll be available in GHC 8.2.1, due early 2017.
Background: Capabilities
When the GHC runtime starts, it creates a number of capabilities
(also sometimes called HEC, for Haskell Execution Context). The
number of capabilities is determined by the -N flag when you start
the Haskell program, e.g. prog +RTS -N4 would run prog with 4
capabilities.
A capability is the ability to run Haskell code. It consists of an allocation area (also called nursery) for allocating memory, a queue of lightweight Haskell threads to run, and one or more OS threads (called workers) that will run the Haskell code. Each capability can run a single Haskell thread at a time; if the Haskell thread blocks, the next Haskell thread in the queue runs, and so on.
Typically we choose the number of capabilities to be equal to the number of physical cores on the machine. This makes sense: there is no advantage in trying to run more Haskell threads simultaneously than we have physical cores.
How our server maps onto this
Our system is based on the C++ Thrift server, which provides a fixed set of worker threads that pull requests from a queue and execute them. We choose the number of worker threads to be high enough that we can fully utilize the server, but not too high that we create too much contention and increase latency under maximum load.
Each worker thread calls into Haskell via a foreign export to do the
actual work. The GHC runtime then chooses a capability to run the
call. It normally picks an idle capability, and the call executes
immediately. If there are no idle capabilities, the call blocks on
the queue of a capability until the capability yields control to it.
The problem
At high load, even though we have enough threads to fully utilize the CPU cores, the intermediate layer of scheduling where GHC assigns threads to capabilities means that we sometimes have threads idle that could be running. Sometimes there are multiple runnable workers on one capability while other capabilities are idle, and the runtime takes a little while to load-balance during which time we’re not using all the available CPU capacity.
Meanwhile the kernel is doing its own scheduling, trying to map those OS threads onto CPUs. Obviously the kernel has a rather more sophisticated scheduler than GHC and could do a better job of mapping those M threads onto its N cores, but we aren’t letting it. In this scenario, the extra layer of scheduling in GHC is just a drag on performance.
First up, a bug in the load-balancer.
While investigating this I found a bug in the way GHC’s load-balancing worked - it could cause a large number of spurious wakeups of other capabilities while load-balancing. Fixing this was worth a few percent right away, but I had my sights set on larger gains.
Couldn’t we just increase the number of capabilities?
Well yes, and of course we tried just bumping up the -N value, but
increasing -N beyond the number of cores just tends to increase CPU
usage without increasing throughput.
Why? Well, the problem is the garbage collector. The GC keeps all its threads running trying to steal work from each other, and when we have more threads than we have real cores, the spinning threads are slowing down the threads doing the actual work.
Increasing the number of capabilities without slowing down GC
What we’d like to do is to have a larger set of mutator threads, but only use a subset of those when it’s time to GC. That’s exactly what this new flag does:
+RTS -qn<threads>
For example, on a 24-core machine you might use +RTS -N48 -qn24 to
have 48 mutator threads, but only 24 threads during GC. This is great
for using hyperthreads too, because hyperthreads work well for the
mutator but not for the GC.
Which threads does the runtime choose to do the GC? The scheduler has a heuristic which looks at which capabilities are currently inactive and chooses those to be idle, to avoid having to synchronise with threads that are currently asleep.
+RTS -qn will now be turned on by default!
This is a slight digression, but it turns out that setting +RTS -qn
to the number of CPU cores is always a good idea if -N is too large.
So the runtime will be doing
this by default from now on. If -N accidentally gets set too
large, performance won’t drop quite so badly as it did with GHC 8.0
and earlier.
Capability affinity
Now we can safely increase the number of capabilities well beyond the
number of real cores, provided we set a smaller number of GC threads
with +RTS -qn.
The final step that we took in Sigma is to map our server threads 1:1 with capabilities. When the C++ server thread calls into Haskell, it immediately gets a capability, there’s never any blocking, and nor does the GHC runtime need to do any load-balancing.
How is this done? There’s a new C API exposed by the RTS:
void rts_setInCallCapability (int preferred_capability, int affinity);
In each thread you call this to map that thread to a particular capability. For example you might call it like this:
static std::atomic<int> counter;
...
rts_setInCallCapability(counter.fetch_add(1), 0);
And ensure that you call this once per thread. The affinity
argument is for binding a thread to a CPU core, which might be useful
if you’re also using GHC’s affinity setting (+RTS -qa). In our case
we haven’t found this to be useful.
Future
You might be thinking, but isn’t the great thing about Haskell that we have lightweight threads? Yes, absolutely. We do make use of lightweight threads in our system, but the main server threads that we inherit from the C++ Thrift server are heavyweight OS threads.
Fortunately in our case we can fully load the system with 3-4 heavyweight threads per core, and this solution works nicely with the constraints of our platform. But if the ratio of I/O waiting to CPU work in our workload increased, we would need more threads per core to keep the CPU busy, and the balance tips towards wanting lightweight threads. Furthermore, using lightweight threads would make the system more resilient to increases in latency from downstream services.
In the future we’ll probably move to lightweight threads, but in the meantime these changes to scheduling mean that we can squeeze all the available throughput from the existing architecture.
