CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: application/xml last-modified: Thu, 12 Jun 2025 12:53:57 GMT access-control-allow-origin: * etag: W/"684acde5-21586" expires: Fri, 16 Jan 2026 00:19:25 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: EAD0:2C78E3:8459:C360:696981B5 accept-ranges: bytes age: 0 date: Fri, 16 Jan 2026 03:18:12 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210042-BOM x-cache: HIT x-cache-hits: 0 x-timer: S1768533492.804054,VS0,VE223 vary: Accept-Encoding x-fastly-request-id: 71c28f51919ff65470d7658c17fcd8e1ba428c42 content-length: 43372 Simon Marlow https://simonmar.github.io/atom.xml Simon Marlow marlowsd@gmail.com 2025-06-11T00:00:00Z Browsing Stackage with VS Code and Glean https://simonmar.github.io/posts/2025-06-11-Glean-stackage-vscode.html 2025-06-11T00:00:00Z 2025-06-11T00:00:00Z

Browsing Stackage with VS Code and Glean

June 11, 2025

Have you ever wished you could browse all the Haskell packages together in your IDE, with full navigation using go-to-definition and find-references? Here’s a demo of something I hacked together while at ZuriHac 2025 over the weekend:

In the previous post I talked about how to index all of Hackage (actually Stackage, strictly speaking, because it’s not in general possible to build all of Hackage together) using Glean. Since that post I made some more progress on the indexer:

The indexer now indexes types. You can see type-on-hover working in the demo. The types are similar to what you see in the Haddock-generated hyperlinked source, except that here it’s always using the type of the definition and not the type at the usage site, which might be more specific. That’s a TODO for later.
Fixed a bunch of things, enriched the index with details about constructors, fields and class methods, and made indexing more efficient.

The DB size including types is now about 850MB, and it takes just under 8 minutes on my 9-year-old laptop to index the nearly 3000 packages in my stackage LTS 21.21 snapshot. (Note: the figures here were updated on 12-06-2025 when I redid the measurments).

Hooking it up to VS Code

The architecture looks like this:

The LSP server is a modified version of static-ls, which is already designed to provide an LSP service based on static information. I just reimplemented a few of its handlers to make calls to Glass instead of the existing hie/hiedb implementations. You can see the changes on my fork of static-ls. Of course, these changes are still quite hacky and not suitable for upstreaming.

Glass is a “Language-agnostic Symbol Server”. Essentially it provides an API abstraction over Glean with operations that are useful for code navigation and search.

Where to next?

There remain a few issues to solve before this can be useful.

Make Glean more easily installable. There’s a general concensus that cabal install glean would lower the barrier to entry significantly; in order to do this we need to build the folly dependency using Cabal.
Clean up and ship the LSP server, somehow. Once Glean is cabal-installable, we can depend on it from an LSP server package.
Think about continuous integration to build the Glean DB. Perhaps this can piggyback off the stackage CI infra? If we can already build a complete stackage snapshot, and Glean is easily installable, then indexing would be fairly straightforward. I’d love to hear suggestions on how best to do this.

And looking forwards a bit further:

Think about how to handle multiple packages versions. There’s no fundamental problem with indexing multiple package versions, except that Glass’s SymbolID format currently doesn’t include the package version but that’s easily fixable. We could for example build multiple stackage LTS instances and index them all in a single Glean DB. There would be advantages to doing this, if for instance there were packages in common between two Stackage instances then the Glean DB would only contain a single copy. A lot of the type structure would be shared too.
Provide search functionality in the LSP. Glean can provide simple textual search for names, and with some work could also provide Hoogle-like type search.
Think about how to index local projects and local changes. Glean supports stacked and incremental DBs, so we could build a DB for a local project stacked on top of the full Stackage DB. You would be able to go-to-definition directly from a file in your project to the packages it depends on in Stackage. We could re-index new .hie files as they are generated, rather like how static-ls currently handles changes.
Integrate with HLS? Perhaps Glean could be used to handle references outside of the current project, switching seamlessly from GHC-based navigation to Glean-based navigation if you jump into a non-local package.

More use cases?

I talked with a few people at ZuriHac about potential use cases for Glean within the Haskell ecosystem. Using it in haskell.org came up a few times, as a way to power search, navigation and analysis. Also mentioned was the possibility of using it as a Hoogle backend. Potentially we could replace the Haddock-generated hyperlinked sources on haskell.org with a Glean-based browser, which would allow navigating links between packages and find-references.

Another use cases that came up was the possibility of doing impact analysis for core library changes (or any API changes really). Some of this is already possible using find-references, but more complex cases such as finding instances that override certain methods aren’t possible yet until we extend the indexer to capture richer information.

If you’re interested in using Glean for something, why not jump on the Glean discord server and tell us about it!

]]> Indexing Hackage: Glean vs. hiedb https://simonmar.github.io/posts/2025-05-22-Glean-Haskell.html 2025-05-22T00:00:00Z 2025-05-22T00:00:00Z

Indexing Hackage: Glean vs. hiedb

May 22, 2025

I thought it might be fun to try to use Glean to index as much of Hackage as I could, and then do some rough comparisons against hiedb and also play around to see what interesting queries we could run against a database of all the code in Hackage.

This project was mostly just for fun: Glean is not going to replace hiedb any time soon, for reasons that will become clear. Neither are we ready (yet) to build an HLS plugin that can use Glean, but hopefully this at least demonstrates that such a thing should be possible, and Glean might offer some advantages over hiedb in performance and flexibility.

A bit of background:

Glean is a code-indexing system that we developed at Meta. It’s used internally at Meta for a wide range of use cases, including code browsing, documentation generation and code analysis. You can read about the ways in which Glean is used at Meta in Indexing Code At Scale with Glean.
hiedb is a code-indexing system for Haskell. It takes the .hie files that GHC produces when given the option -fwrite-ide-info and writes the information to a SQLite database in various tables. The idea is that putting the information in a DB allows certain operations that an IDE needs to do, such as go-to-definition, to be fast.

You can think of Glean as a general-purpose system that does the same job as hiedb, but for multiple languages and with a more flexible data model. The open-source version of Glean comes with indexers for ten languages or so, and moreover Glean supports SCIP which has indexers for various languages available from SourceGraph.

Since a hiedb is just a SQLite DB with a few tables, if you want you can query it directly using SQL. However, most users will access the data through either the command-line hiedb tool or through the API, which provide the higher-level operations such as go-to-definition and find-references. Glean has a similar setup: you can make raw queries using Glean’s query language (Angle) using the Glean shell or the command-line tool, while the higher-level operations that know about symbols and references are provided by a separate system called Glass which also has a command-line tool and API. In Glean the raw data is language-specific, while the Glass interface provides a language-agnostic view of the data in a way that’s useful for tools that need to navigate or search code.

An ulterior motive

In part all of this was an excuse to rewrite Glean’s Haskell indexer. We built a Haskell indexer a while ago but it’s pretty limited in what information it stores, only capturing enough information to do go-to-definition and find-references and only for a subset of identifiers. Furthermore the old indexer works by first producing a hiedb and consuming that, which is both unnecessary and limits the information we can collect. By processing the .hie files directly we have access to richer information, and we don’t have the intermediate step of creating the hiedb which can be slow.

The rest of this post

The rest of the post is organised as follows, feel free to jump around:

Performance: a few results comparing hiedb with Glean on an index of all of Hackage
Queries: A couple of examples of queries we can do with a Glean index of Hackage: searching by name, and finding dead code.
Apparatus: more details on how I set everything up and how it all works.
What’s next: some thoughts on what we still need to add to the indexer.

Performance

All of this was perfomed on a build of 2900+ packages from Hackage, for more details see Building all of Hackage below.

Indexing performance

I used this hiedb command:

hiedb index -D /tmp/hiedb . --skip-types

I’m using --skip-types because at the time of writing I haven’t implemented type indexing in Glean’s Haskell indexer, so this should hopefully give a more realistic comparison.

This was the Glean command:

glean --service localhost:1234 \
  index haskell-hie --db stackage/0 \
  --hie-indexer $(cabal list-bin hie-indexer) \
  ~/code/stackage/dist-newstyle/build/x86_64-linux/ghc-9.4.7 \
  --src '$PACKAGE'

Time to index:

hiedb: 1021s
Glean: 470s

I should note that in the case of Glean the only parallelism is between the indexer and the server that is writing to the DB. We didn’t try to index multiple .hie files in parallel, although that would be fairly trivial to do. I suspect hiedb is also single-threaded just going by the CPU load during indexing.

Size of the resulting DB

hiedb: 5.2GB
Glean: 0.8GB

It’s quite possible that hiedb is simply storing more information, but Glean does have a rather efficient storage system based on RocksDB.

Performance of find-references

Let’s look up all the references of Data.Aeson.encode:

hiedb -D /tmp/hiedb name-refs encode Data.Aeson

This is the query using Glass:

cabal run glass-democlient -- --service localhost:12345 \
  references stackage/hs/aeson/Data/Aeson/var/encode

This is the raw query using Glean:

glean --service localhost:1234 --db stackage/0 \
  '{ Refs.file, Refs.uses[..] } where Refs : hs.NameRefs; Refs.target.occ.name = "encode"; Refs.target.mod.name = "Data.Aeson"'

hiedb: 2.3s
glean (via Glass): 0.39s
glean (raw query): 0.03s

(side note: hiedb found 416 references while Glean found 415. I haven’t yet checked where this discrepancy comes from.)

But these results don’t really tell the whole story.

In the case of hiedb, name-refs does a full table scan so it’s going to take time proportional to the number of refs in the DB. Glean meanwhile has indexed the references by name, so it can serve this query very efficiently. The actual query takes a few milliseconds, the main overhead is encoding and decoding the results.

The reason the Glass query takes longer than the raw Glean query is because Glass also fetches additional information about each reference, so it performs a lot more queries.

We can also do the raw hiedb query using the sqlite shell:

sqlite> select count(*) from refs where occ = "v:encode" AND mod = "Data.Aeson";
417
Run Time: real 2.038 user 1.213905 sys 0.823001

Of course hiedb could index the refs table to make this query much faster, but it’s interesting to note that Glean has already done that and it was still quicker to index and produced a smaller DB.

Performance of find-definition

Let’s find the definition of Data.Aeson.encode, first with hiedb:

$ hiedb -D /tmp/hiedb name-def encode Data.Aeson
Data.Aeson:181:1-181:7

Now with Glass:

$ cabal run glass-democlient -- --service localhost:12345 \
  describe stackage/hs/aeson/Data/Aeson/var/encode
stackage@aeson-2.1.2.1/src/Data/Aeson.hs:181:1-181:47

(worth noting that hiedb is giving the span of the identifier only, while Glass is giving the span of the whole definition. This is just a different choice; the .hie file contains both.)

And the raw query using Glean:

$ glean --service localhost:1234 query --db stackage/0 --recursive \
  '{ Loc.file, Loc.span } where Loc : hs.DeclarationLocation; N : hs.Name; N.occ.name = "encode"; N.mod.name = "Data.Aeson"; Loc.name = N' | jq
{
  "id": 18328391,
  "key": {
    "tuplefield0": {
      "id": 9781189,
      "key": "aeson-2.1.2.1/src/Data/Aeson.hs"
    },
    "tuplefield1": {
      "start": 4136,
      "length": 46
    }
  }
}

Times:

hiedb: 0.18s
Glean (via Glass): 0.05s
Glean (raw query): 0.01s

In fact there’s a bit of overhead when using the Glean CLI, we can get a better picture of the real query time using the shell:

stackage> { Loc.file, Loc.span } where Loc : hs.DeclarationLocation; N : hs.Name; N.occ.name = "encode"; N.mod.name = "Data.Aeson"; Loc.name = N
{
  "id": 18328391,
  "key": {
    "tuplefield0": { "id": 9781189, "key": "aeson-2.1.2.1/src/Data/Aeson.hs" },
    "tuplefield1": { "start": 4136, "length": 46 }
  }
}
1 results, 2 facts, 0.89ms, 696176 bytes, 2435 compiled bytes

The query itself takes less than 1ms.

Again, the issue with hiedb is that its data is not indexed in a way that makes this query efficient: the defs table is indexed by the pair (hieFile,occ) not occ alone. Interestingly, when the module is known it ought to be possible to do a more efficient query with hiedb by first looking up the hieFile and then using that to query defs.

What other queries can we do with Glean?

I’ll look at a couple of examples here, but really the possibilities are endless. We can collect whatever data we like from the .hie file, and design the schema around whatever efficient queries we want to support.

Search by case-insensitive prefix

Let’s search for all identifiers that start with the case-insensitive prefix "withasync":

$ glass-democlient --service localhost:12345 \
  search stackage/withasync -i | wc -l
55

In less than 0.1 seconds we find 55 such identifiers in Hackage. (the output isn’t very readable so I didn’t include it here, but for example this finds results not just in async but in a bunch of packages that wrap async too).

Case-insensitive prefix search is supported by an index that Glean produces when the DB is created. It works in the same way as efficient find-references, more details on that below.

Why only prefix and not suffix or infix? What about fuzzy search? We could certainly provide a suffix search too; infix gets more tricky and it’s not clear that Glean is the best tool to use for infix or fuzzy text search: there are better data representations for that kind of thing. Still, case-insensitive prefix search is a useful thing to have.

Could we support Hoogle using Glean? Absolutely. That said, Hoogle doesn’t seem too slow. Also we need to index types in Glean before it could be used for type search.

Identify dead code

Dead code is, by definition, code that isn’t used anywhere. We have a handy way to find that: any identifier with no references isn’t used. But it’s not quite that simple: we want to ignore references in imports and exports, and from the type signature.

Admittedly finding unreferenced code within Hackage isn’t all that useful, because the libraries in Hackage are consumed by end-user code that we haven’t indexed so we can’t see all the references. But you could index your own project using Glean and use it to find dead code. In fact, I did that for Glean itself and identified one entire module that was dead, amongst a handful of other dead things.

Here’s a query to find dead code:

N where
  N = hs.Name _;
  N.sort.external?;
  hs.ModuleSource { mod = N.mod, file = F };
  !(
    hs.NameRefs { target = N, file = RefFile, uses = R };
    RefFile != F;
    coderef = (R[..]).kind
  )

Without going into all the details, here’s roughly how it works:

N = hs.Name _; declares N to be a fact of hs.Name
N.sort.external?; requires N to be external (i.e. exported), as opposed to a local variable
hs.ModuleSource { mod = N.mod, file = F }; finds the file F corresponding to this name’s module
The last part is checking to see that there are no references to this name that are (a) in a different file and (b) are in code, i.e. not import/export references. Restricting to other files isn’t exactly what we want, but it’s enough to exclude references from the type signature. Ideally we would be able to identify those more precisely (that’s on the TODO list).

You can try this on Hackage and it will find a lot of stuff. It might be useful to focus on particular modules to find things that aren’t used anywhere, for example I was interested in which identifiers in Control.Concurrent.Async aren’t used:

N where
  N = hs.Name _;
  N.mod.name = "Control.Concurrent.Async";
  N.mod.unit = "async-2.2.4-inplace";
  N.sort.external?;
  hs.ModuleSource { mod = N.mod, file = F };
  !(
    hs.NameRefs { target = N, file = RefFile, uses = R };
    RefFile != F;
    coderef = (R[..]).kind
  )

This finds 21 identifiers, which I can use to decide what to deprecate!

Apparatus

Building all of Hackage

The goal was to build as much of Hackage as possible and then to index it using both hiedb and Glean, and see how they differ.

To avoid problems with dependency resolution, I used a Stackage LTS snapshot of package versions. Using LTS-21.21 and GHC 9.4.7, I was able to build 2922 packages. About 50 failed for some reason or other.

I used this cabal.project file:

packages: */*.cabal
import: https://www.stackage.org/lts-21.21/cabal.config
package *
    ghc-options: -fwrite-ide-info
tests: False
benchmarks: False
allow-newer: *

And did a large cabal get to fetch all the packages in LTS-21.21.

Then

cabal build all --keep-going

After a few retries to install any required RPMs to get the dependency resolution phase to pass, and to delete a few packages that weren’t going to configure successfully, I went away for a few hours to let the build complete.

It’s entirely possible there’s a better way to do this that I don’t know about - please let me know!

Building Glean

The Haskell indexer I’m using is in this pull request which at the time of writing isn’t merged yet. (Since I’ve left Meta I’m just a regular open-source contributor and have to wait for my PRs to be merged just like everyone else!).

Admittedly Glean is not the easiest thing in the world to build, mainly because it has a couple of troublesome dependencies: folly (Meta’s library of highly-optimised C++ utilities) and RocksDB. Glean depends on a very up to date version of these libraries so we can’t use any distro packaged versions.

Full instructions for building Glean are here but roughly it goes like this on Linux:

Install a bunch of dependencies with apt or yum
Build the C++ dependencies with ./install-deps.sh and set some env vars
make

The Makefile is needed because there are some codegen steps that would be awkward to incorporate into the Cabal setup. After the first make you can usually just switch to cabal for rebuilding stuff unless you change something (e.g. a schema) that requires re-running the codegen.

Running Glean

I’ve done everything here with a running Glean server, which was started like this:

cabal run exe:glean-server -- \
  --db-root /tmp/db \
  --port 1234 \
  --schema glean/schema/source

While it’s possible to run Glean queries directly on the DB without a server, running a server is the normal way because it avoids the latency from opening the DB each time, and it keeps an in-memory cache which significantly speeds up repeated queries.

The examples that use Glass were done using a running Glass server, started like this:

cabal run glass-server -- --service localhost:1234 --port 12345

How does it work?

The interesting part of the Haskell indexer is the schema in hs.angle. Every language that Glean indexes needs a schema, which describes the data that the indexer will store in the DB. Unlike an SQL schema, a Glean schema looks more like a set of datatype declarations, and it really does correspond to a set of (code-generated) types that you can work with when programmatically writing data, making queries, or inspecting results. For more about Glean schemas, see the documentation.

Being able to design your own schema means that you can design something that is a close match for the requirements of the language you’re indexing. In our Glean schema for Haskell, we use a Name, OccName, and Module structure that’s similar to the one GHC uses internally and is stored in the .hie files.

The indexer itself just reads the .hie files and produces Glean data using datatypes that are generated from the schema. For example, here’s a fragment of the indexer that produces Module facts, which contain a ModuleName and a UnitName:

mkModule :: Glean.NewFact m => GHC.Module -> m Hs.Module
mkModule mod = do
  modname <- Glean.makeFact @Hs.ModuleName $
    fsToText (GHC.moduleNameFS (GHC.moduleName mod))
  unitname <- Glean.makeFact @Hs.UnitName $
    fsToText (unitFS (GHC.moduleUnit mod))
  Glean.makeFact @Hs.Module $
    Hs.Module_key modname unitname

Also interesting is how we support fast find-references. This is done using a stored derived predicate in the schema:

predicate NameRefs:
  {
    target: Name,
    file: src.File,
    uses: [src.ByteSpan]
  } stored {Name, File, Uses} where
  FileXRefs {file = File, refs = Refs};
  {name = Name, spans = Uses} = Refs[..];

here NameRefs is a predicate—which you can think of as a datatype, or a table in SQL—defined in terms of another predicate, FileXRefs. The facts of the predicate NameRefs (rows of the table) are derived automatically using this definition when the DB is created. If you’re familiar with SQL, a stored derived predicate in Glean is rather like a materialized view in SQL.

What’s next?

As I mentioned earlier, the indexer doesn’t yet index types, so that would be an obvious next step. There are a handful of weird corner cases that aren’t handled correctly, particularly around record selectors, and it would be good to iron those out.

Longer term ideally the Glean data would be rich enough to produce the Haddock docs. In fact Meta’s internal code browser does produce documentation on the fly from Glean data for some languages - Hack and C++ in particular. Doing it for Haskell is a bit tricky because while I believe the .hie file does contain enough information to do this, it’s not easy to reconstruct the full ASTs for declarations. Doing it by running the compiler—perhaps using the Haddock API—would be an option, but that involves a deeper integration with Cabal so it’s somewhat more awkward to go that route.

Could HLS use Glean? Perhaps it would be useful to have a full Hackage index to be able to go-to-definition from library references? As a plugin this might make sense, but there are a lot of things to fix and polish before it’s really practical.

Longer term should we be thinking about replacing hiedb with Glean? Again, we’re some way off from that. The issue of incremental updates is an interesting one - Glean does support incremental indexing but so far it’s been aimed at speeding up whole-repository indexing rather than supporting IDE features.

]]> Rethinking Static Reference Tables in GHC https://simonmar.github.io/posts/2018-06-22-New-SRTs.html 2018-06-22T00:00:00Z 2018-06-22T00:00:00Z

Rethinking Static Reference Tables in GHC

June 22, 2018

It seems rare these days to be able to make an improvement that’s unambiguously better on every axis. Most changes involve a tradeoff of some kind. With a compiler, the tradeoff is often between performance and code size (e.g. specialising code to make it faster leaves us with more code), or between performance and complexity (e.g. adding a fancy new optimisation), or between compile-time performance and runtime performance.

Recently I was lucky enough to be able to finish a project I’ve been working on intermittently in GHC for several years, and the result was satisfyingly better on just about every axis.

Code size: overall binary sizes are reduced by ~5% for large programs, ~3% for smaller programs.
Runtime performance: no measurable change on benchmarks, although some really bad corner cases where the old code performed terribly should now be gone.
Complexity: some complex representations were removed from the runtime, making GC simpler, and the compiler itself also became simpler.
Compile-time performance: slightly improved (0.2%).

To explain what the change is, first we’ll need some background.

Garbage collecting CAFs

A Constant Applicative Form (CAF) is a top-level thunk. For example:

myMap :: HashMap Text Int
myMap = HashMap.fromList [
  -- lots of data
  ]

Now, myMap is represented in the compiled program by a static closure that looks like this:

When the program demands the value of myMap for the first time, the representation will change to this:

At this point, we have a reference from the original static closure, which is part of the compiled program, into the dynamic heap. The garbage collector needs to know about this reference, because it has to treat the value of myMap as live data, and ensure that this reference remains valid.

How could we do that? One way would be to just keep all the CAFs alive for ever. We could keep a list of them and use the list as a source of roots in the GC. That would work, but we’d never be able to garbage-collect any top-level data. Back in the distant past GHC used to work this way, but it interacted badly with the full-laziness optimisation which likes to float things out to the top level - we had to be really careful not to float things out as CAFs because the data would be retained for ever.

Or, we could track the liveness of CAFs properly, like we do for other data. But how can we find all the references to myMap? The problem with top-level closures is that their references appear in code, not just data. For example, somewhere else in our program we might have

myLookup :: String -> Maybe Int
myLookup name = HashMap.lookup name myMap

and in the compiled code for myLookup will be a reference to myMap.

To be able to know when we should keep myMap alive, the garbage collector has to traverse all the references from code as well as data.

Of course, actually searching through the code for symbols isn’t practical, so GHC produces an additional data structure for all the code it compiles, called the Static Reference Table (SRT). The SRT for myLookup will contain a reference to myMap.

The naive way to do this would be to just have a table of all the static references for each code block. But it turns out that there’s quite a lot of opportunities for sharing between SRTs - lots of code blocks refer to the same things - so it makes sense to try to use a more optimised representation.

The representation that GHC 8.4 and earlier used was this:

All the static references in a module were collected together into a single table (ThisModule_srt in the diagram), and every static closure selects the entries it needs with a combination of a pointer (srt) into the table and a bitmap (srt_bitmap).

This had a few problems:

On a 64-bit machine we need at least 96 bits for the SRT in every static closure and continuation that has at least one static reference: 64 bits to point to the table and a 32-bit bitmap.
Sometimes the heuristics in the compiler for generating the table worked really badly. I observed some cases with particularly large modules where we generated an SRT containing two entries that were thousands of entries apart in the table, which required a huge bitmap.
There was complex code in the RTS for traversing these bitmaps, and complex code in the compiler to generate this table that nobody really understood.

The shiny new way

The basic idea is quite straightforward: instead of the single table and bitmap representation, each code block that needs an SRT will have an associated SRT object, like this:

Firstly, this representation is a lot simpler, because an SRT object has exactly the same representation as a static constructor, so we need no new code in the GC to handle it. All the code to deal with bitmaps goes away.

However, just making this representation change by itself will cause a lot of code growth, because we lose many of the optimisations and sharing that we were able to do with the table and bitmap representation.

But the new representation has some great opportunities for optimisation of its own, and exploiting all these optimisations results in more compact code than before.

We never need a singleton SRT

If an SRT has one reference in it, we replace the pointer to the SRT with the pointer to the reference itself.

The SRT field for each code block can be 32 bits, not 96

Since we only need a pointer, not a pointer and a bitmap, the overhead goes down to 64 bits. Furthermore, by exploiting the fact that we can represent local pointers by 32-bit offsets (on x86_64), the overhead goes down to 32 bits.

We can common up identical SRTs

This is an obvious one: if multiple code blocks have the same set of static references, they can share a single SRT object.

We can drop duplicate references from an SRT

Sometimes an SRT refers to a closure that is also referred to by something that is reachable from the same SRT. For example:

In this case we can drop the reference to x in the outer SRT, because it’s already contained in the inner SRT. That leaves the outer SRT with a single reference, which means the SRT object itself can just disappear, by the singleton optimisation mentioned earlier.

For a function, we can combine the SRT with the static closure itself

A top-level function with an SRT would look like this:

We might as well just merge the two objects together, and put the SRT entries in the function closure, to give this:

Together, these optimisations were enough to reduce code size compared with the old table/bitmap representation.

Show me the code

Look out for (slightly) smaller binaries in GHC 8.6.1.

]]> Fixing 17 space leaks in GHCi, and keeping them fixed https://simonmar.github.io/posts/2018-06-20-Finding-fixing-space-leaks.html 2018-06-20T00:00:00Z 2018-06-20T00:00:00Z

Fixing 17 space leaks in GHCi, and keeping them fixed

June 20, 2018

In this post I want to tackle a couple of problems that have irritated me from time to time when working with Haskell.

GHC provides some powerful tools for debugging space leaks, but sometimes they’re not enough. The heap profiler shows you what’s in the heap, but it doesn’t provide detailed visibility into the chain of references that cause a particular data structure to be retained. Retainer profiling was supposed to help with this, but in practice it’s pretty hard to extract the signal you need - retainer profiling will show you one relationship at a time, but you want to see the whole chain of references.
Once you’ve fixed a space leak, how can you write a regression test for it? Sometimes you can make a test case that will use O(n) memory if it leaks instead of O(1), and then it’s straightforward. But what if your leak is only a constant factor?

We recently noticed an interesting space leak in GHCi. If we loaded a set of modules, and then loaded the same set of modules again, GHCi would need twice as much memory as just loading the modules once. That’s not supposed to happen - GHCi should release whatever data it was holding about the first set of modules when loading a new set. What’s more, after further investigation we found that this effect wasn’t repeated the third time we loaded the modules; only one extra set of modules was being retained.

Conventional methods for finding the space leak were not helpful in this case. GHCi is a complex beast, and just reproducing the problem proved difficult. So I decided to try a trick I’d thought about for a long time but never actually put into practice: using GHC’s weak pointers to detect data that should be dead, but isn’t.

Weak pointers can detect space leaks

The System.Mem.Weak library provides operations for creating “weak” pointers. A weak pointer is a reference to an object that doesn’t keep the object alive. If we have a weak pointer, we can attempt to dereference it, which will either succeed and return the value it points to, or it will fail in the event that the value has been garbage collected. So a weak pointer can detect when things are garbage collected, which is exactly what we want for detecting space leaks.

Here’s the idea:

Call mkWeakPtr v Nothing where v is the value you’re interested in.
Wait until you believe v should be garbage.
Call System.Mem.performGC to force a full GC.
Call System.Mem.Weak.deRefWeak on the weak pointer to see if v is alive or not.

Here’s how I implemented this for GHCi. One thing to note is that just because v was garbage-collected doesn’t mean that there aren’t still pieces of v being retained, so you might need to have several weak pointers to different components of v, like I did in the GHC patch. These really did detect multiple different space leaks.

This patch reliably detected leaks in trivial examples, including many of the tests in GHCi’s own test suite. That meant we had a way to reproduce the problem without having to use unpredictable measurement methods like memory usage or heap profiles. This made it much easier to iterate on finding the problems.

Back to the space leaks in GHCi

That still leaves us with the problem of how to actually diagnose the leak and find the cause. Here the techniques are going to get a bit more grungy: we’ll use gdb to poke around in the heap at runtime, along with some custom utilities in the GHC runtime to help us search through the heap.

To set things up for debugging, we need to

Compile GHC with -g and -debug, to add debugging info to the binary and debugging functionality to the runtime, respectively.
load up GHCi in gdb (that’s a bit fiddly and I won’t go into the details here),
Set things up to reproduce the test case.

*Main> :l
Ok, no modules loaded.
-fghci-leak-check: Linkable is still alive!
Prelude>

The -fghci-leak-check code just spat out a message when it detected a leak. We can Ctrl-C to break into gdb:

Program received signal SIGINT, Interrupt.
0x00007ffff17c05b3 in __select_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
84	../sysdeps/unix/syscall-template.S: No such file or directory.

Next I’m going to search the heap for instances of the LM constructor, which corresponds to the Linkable type that the leak detector found. There should be none of these alive, because the :l command tells GHCi to unload everything, so any LM constructors we find must be leaking:

(gdb) p findPtr(ghc_HscTypes_LM_con_info,1)
0x4201a073d8 = ghc:HscTypes.LM(0x4201a074b0, 0x4201a074c8, 0x4201a074e2)
-->
0x4200ec2000 = WEAK(key=0x4201a073d9 value=0x4201a073d9 finalizer=0x7ffff2a077d0)
0x4200ec2000 = WEAK(key=0x4201a073d9 value=0x4201a073d9 finalizer=0x7ffff2a077d0)
0x42017e2088 = ghc-prim:GHC.Types.:(0x4201a073d9, 0x7ffff2e9f679)
0x42017e2ae0 = ghc-prim:GHC.Types.:(0x4201a073d9, 0x7ffff2e9f679)
$1 = void

The findPtr function comes from the RTS, it’s a function designed specifically for searching through the heap for things from inside gdb. I asked it to search for ghc_HscTypes_LM_con_info, which is the info pointer for the LM constructor - every instance of that constructor will have this pointer as its first word.

The findPtr function doesn’t just search for objects in the heap, it also attempts to find the object’s parent, and will continue tracing back through the chain of ancestors until it finds multiple parents.

In this case, it found a single LM constructor, which had four parents: two WEAK objects and two ghc-prim:GHC.Types.: objects, which are the list constructor (:). The WEAK objects we know about: those are the weak pointers used by the leak-checking code. So we need to trace the parents of the other objects, which we can do with another call to findPtr:

(gdb) p findPtr(0x42017e2088,1)
0x42016e9c08 = ghc:Linker.PersistentLinkerState(0x42017e2061, 0x7ffff3c2bc63, 0x42017e208a, 0x7ffff2e9f679, 0x42016e974a, 0x7ffff2e9f679)
-->
0x42016e9728 = THUNK(0x7ffff74790c0, 0x42016e9c41, 0x42016e9c09)
-->
0x42016e9080 = ghc:Linker.PersistentLinkerState(0x42016e9728, 0x7ffff3c2e7bb, 0x7ffff2e9f679, 0x7ffff2e9f679, 0x42016e974a, 0x7ffff2e9f679)
-->
0x4200dbe8a0 = THUNK(0x7ffff7479138, 0x42016e9081, 0x42016e90b9, 0x42016e90d1, 0x42016e90e9)
-->
0x42016e0b00 = MVAR(head=END_TSO_QUEUE, tail=END_TSO_QUEUE, value=0x4200dbe8a0)
-->
0x42016e0828 = base:GHC.MVar.MVar(0x42016e0b00)
-->
0x42016e0500 = MUT_VAR_CLEAN(var=0x42016e0829)
-->
0x4200ec6b80 = base:GHC.STRef.STRef(0x42016e0500)
-->
$2 = void

This time we traced through several objects, until we came to an STRef, and findPtr found no further parents. Perhaps the next parent is a CAF (a top-level thunk) which findPtr won’t find because it only searches the heap. Anyway, in the chain we have two PersistentLinkerState objects, and some THUNKs - it looks like perhaps we’re holding onto an old version of the PersistentLinkerState, which contains the leaking Linkable object.

Let’s pick one THUNK and take a closer look.

(gdb) p4 0x42016e9728
0x42016e9740:	0x42016e9c09
0x42016e9738:	0x42016e9c41
0x42016e9730:	0x0
0x42016e9728:	0x7ffff74790c0 <sorW_info>

The p4 command is just a macro for dumping memory (you can get these macros from here).

The header of the object is 0x7ffff74790c0 <sorW_info>, which is just a compiler-generated symbol. How can we find out what code this object corresponds to? Fortunately, GHC’s new -g option generates DWARF debugging information which gdb can understand, and because we compiled GHC itself with -g we can get gdb to tell us what code this address corresponds to:

(gdb) list *0x7ffff74790c0
0x7ffff74790c0 is in sorW_info (compiler/ghci/Linker.hs:1129).
1124
1125	      itbl_env'     = filterNameEnv keep_name (itbl_env pls)
1126	      closure_env'  = filterNameEnv keep_name (closure_env pls)
1127	
1128	      new_pls = pls { itbl_env = itbl_env',
1129	                      closure_env = closure_env',
1130	                      bcos_loaded = remaining_bcos_loaded,
1131	                      objs_loaded = remaining_objs_loaded }
1132	
1133	  return new_pls

In this case it told us that the object corresponds to line 1129 of compiler/ghci/Linker.hs. This is all part of the function unload_wkr, which is part of the code for unloading compiled code in GHCi. It looks like we’re on the right track.

Now, -g isn’t perfect - the line it pointed to isn’t actually a thunk. But it’s close: the line it points to refers to closure_env' which is defined on line 1126, and it is indeed a thunk. Moreover, we can see that it has a reference to pls, which is the original PersistentLinkerState before the unloading operation.

To avoid this leak, we could pattern-match on pls eagerly rather than doing the lazy record selection (closure_env pls) in the definition of closure_env'. That’s exactly what I did to fix this particular leak, as you can see in the patch that fixes it.

Fixing one leak isn’t necessarily enough: the data structure might be retained in multiple different ways, and it won’t be garbage collected until all the references are squashed. In total I found

7 leaks in GHCi that were collectively responsible for the original leak, and
A further 10 leaks that only appeared when GHC was compiled without optimisation. (It seems that GHC’s optimiser is pretty good at fixing space leaks by itself)

You might ask how anyone could have found these without undergoing this complicated debugging process. And whether there are more lurking that we haven’t found yet. These are really good questions, and I don’t have a good answer for either. But at least we’re in a better place now:

The leaks are fixed, and we have a regression test to prevent them being reintroduced.
If you happen to write a patch that introduces a leak, you’ll know what the patch is, so you have a head start in debugging it.

Could we do better?

Obviously this is all a bit painful and we could definitely build better tools to make this process easier. Perhaps something based on heap-view which was recently added to GHC? I’d love to see someone tackle this.

]]> Hotswapping Haskell https://simonmar.github.io/posts/2017-10-17-hotswapping-haskell.html 2017-10-17T00:00:00Z 2017-10-17T00:00:00Z

Hotswapping Haskell

October 17, 2017

This is a guest post by Jon Coens. Jon worked on the Haxl project since the beginning in 2013, and nowadays he works on broadening Haskell use within Facebook.

From developing code through deployment, Facebook needs to move fast. This is especially true for one of our anti-abuse systems that deploys hundreds of code changes every day. Releasing a large application (hundreds of Kloc) that many times a day presents plenty of intriguing challenges. Haskell’s strict type system means we’re able to confidently push new code knowing that we can’t crash the server, but getting those changes out to many thousands of machines as fast as possible requires some ingenuity.

Given the application size and deployment speed constraints:

Building a new application binary for every change would take too long
Starting and tearing down millions of heavy processes a day would create undue churn on other infrastructure
Splitting the service into multiple smaller services would slow down developers.

To overcome these constraints, our solution is to build a shared object file that contains only the set of frequently changing business logic and dynamically load it into our server process. With some clever house-keeping, the server drops old unneeded shared objects to make way for new ones without dropping any requests.

It’s like driving a car down the road, having a new engine fall into your lap, installing it on-the-fly, and dumping the old engine behind you, all while never touching the brakes.

Show Me The Code!

For those who want a demo, look here. Make sure you have GHC 8.2.1 or later, then follow the README for how to configure the projects.

What about…

A Statically built server

The usual way of deploying updates requires building a fully statically-linked binary and shipping that to every machine. This has many benefits, the biggest of which being streamlined and well-understood deployment, but results in long update times due to the size of our large final binary. Each business logic change, no matter how small, needs to re-link the entire binary and be shipped out to all machines. Both binary link time and distribution time are correlated with file size, so the larger the binary, the longer the updates. In our case, the application binary’s size is too large for us to do frequent updates by this method.

GHCi-as-a-service

GHCi’s incremental module reloading is another way of updating code quickly. Mimicking the local development workflow, you could ship code updates to each service, and instruct them to reload as necessary. Continually re-interpreting the code significantly decreases the amount of time to distribute an update. In fact, a previous version of our application (not based on Haskell) worked this way. This approach severely hinders performance, however. Running interpreted code is strictly slower than optimized compiled code, and GHCi can’t currently handle running multiple requests at the same time.

The model of reloading libraries in GHCi closely matches what we want our end behavior to look like. What about loading those libraries into a non-interpreted Haskell binary?

Shipping shared objects for great good

Using the GHCi.Linker API, our update deployment looks roughly as follows:

Commit a code change onto trunk
Incrementally build a shared object file containing the frequently-changing business logic
Ship that file to each machine
In each process, use GHCi’s dynamic linker to load in the new shared object and lookup a symbol from it (while continuing to serve requests using the previous code)
If all succeeds, start serving requests using the new code and mark the previous shared object for unloading by the GC

This minimizes the amount of time between making a code change and having it running in an efficient production environment. It only rebuilds the minimum set of code, deploys a much smaller file to each server, and keeps the server running through each update.

Not every module or application can follow this update model as there are some crucial constraints to consider when figuring out what can go into the shared object.

The symbol API boundaries into and out of the shared object must remain constant
The main binary cannot persist any reference to code or data originating from the shared object, because that will prevent the GC from unloading the object.

Fortunately, our use-case fits this mold.

Details

We’ll talk about a handful of libraries + example code

GHCi.ObjLink - A library provided by GHC
ghc-hotswap - A library to use
ghc-hotswap-types - User-written code to define the API
ghc-hotswap-so - User-written code that lives in the shared object
ghc-hotswap-demo - User-written application utilizing the above

Loading and extracting from the shared object

Let’s start with bringing in a new shared object, the guts of which can be found in loadNewSO. It makes heavy use of the GHCi.ObjLink library. We need the name of an exported symbol to lookup inside the shared object (symName) and the file path to where the shared object lives (newSO). With these, we can return an instance of some data that originates from that shared object.

initObjLinker DontRetainCAFs

GHCi’s linker needs to be initialized before use, and fortunately the call is idempotent. “DontRetainCAFs” tells the linker and GC not to retain CAFs (Constant Applicative Forms, i.e. top-level values) in the shared object. GHCi normally retains all CAFs as the user can type an expression that refers to anything at all, but for hot-swapping this would prevent the object from being unloaded as we would have references into the object from the heap-resident CAFs.

loadObj newSO
resolved <- resolveObjs
unless resolved $
  ...

This maps the shared object into the memory of the main process, brings the shared object’s symbols into GHCi’s symbol table, and ensures any undefined symbols in the SO are present in the main binary. If any of these fail, an exception is thrown.

c_sym <- lookupSymbol symName

Here we ask GHCi’s symbol table if the given name exists, and returns a pointer to that symbol.

h <- case c_sym of
  Nothing -> throwIO ...
  Just p_sym ->
    bracket (callExport $ castPtrToFunPtr p_sym) freeStablePtr deRefStablePtr

When getting a pointer to the symbol (Just p_sym), a couple things happen. We know that the underlying symbol is a function (as we’ll ensure later), so we cast it to a function pointer. A FunPtr doesn’t do us much good on its own, so use callExport to turn it into a callable Haskell function as well as execute the function. This call is the first thing to run code originating from the shared object. Since our call returns a StablePtr a, we dereference and then free the stable pointer, resulting in our value of type a from the shared object.

We want to query the shared object and get a Haskell value back. The best way to do that safely and without baking in too much low-level knowledge is for the shared object to expose a function using foreign export. The Haskell value must therefore be returned wrapped in a StablePtr, and so we have to get at the value itself using deRefStablePtr, before finally releasing the StablePtr with freeStablePtr.

purgeObj newSO
return h

Assuming everything has gone well, we purge GHCi’s symbol table of all symbols defined from our shared object and then return the value we retrieved. Purging the symbols makes room for the next shared object to come in and resolve successfully without fully unloading the shared object that we’re actively holding references to. We could tell GHCi to unload the shared object at this point, but this would cause the GC to aggressively crawl the entire shared object every single time, which is a lot of unnecessary work. Purging retains the code in the process to make the GC’s work lighter while making room for the next shared object. See Safely Transition Updates for when to unload the shared object.

The project that defines the code for the shared object must be generated in a relocatable fashion. It must be configured with the —enable-library-for-ghci flag, otherwise loadObj and resolveObj will throw a fit.

Defining the shared object’s API

During compilation, the function names from code turn into quasi-human-readable symbol names. Ensuring you look up the correct symbol name from a shared object can become brittle if you rely on hardcoded munged names. To mitigate this, we define a single data type to house all the symbols we want to expose to the main application, and export a ccall using Haskell’s Foreign library. This guarantees we can export a particular symbol with a name we control. Placing all our data behind a single symbol (that both the shared object and main binary can depend on), we reduce the coupling to only a couple of points.

Let’s look at Types.hs.

data SOHandles = SOHandles
  { someData :: Text
  , someFn :: Int -> IO ()
  } deriving (Generic, NFData)

Here’s our common structure for everything we want to expose out of the shared object. Notice that you can put constants, like someData, as well as full functions to execute, like someFn.

type SOHandleExport = IO (StablePtr SOHandles)

This defines the type for the extraction function the main binary will run to get an instance of the handles from the shared object

foreign import ccall "dynamic"
  callExport :: FunPtr SOHandleExport -> SOHandleExport

Here we invoke Haskell’s FFI to generate a function that calls a function pointer to our export function as an actual Haskell function. The “dynamic” parameter to ccall does exactly this. We saw using this earlier when loading in a shared object.

Next let’s look at code for the shared object itself. Note that we depend on and import the Types module defined in ghc-hotswap-types.

foreign export ccall "hs_soHandles"
  hsNewSOHandle :: SOHandleExport

This uses the FFI to explicitly export a Haskell function called hsNewSOHandle as a symbol named “hs_soHandles”. This is the function our main binary is going to end up calling, so set its type to our export function.

hsNewSOHandle = newStablePtr SOHandles
  { ...
  }

In our definition of this function, we return a stable pointer to an instance of our data type, which will end up being read by our main application

Using these common types, we’ve limited the amount of coupling down to using callExport, exporting the symbol as “hs_soHandles” from the shared object, and can combine these in our usage of loadNewSO.

Safely Transition Updates

With some extra care, we can cleanly transition to new shared objects while minimizing the amount of work the GC needs to do.

Let’s look closer at Hotswap.hs.

registerHotswap uses loadNewSO to load the first shared object and then provides some accessor functions on the data extracted. We save some state associated with the shared object: the path to the object, the value we extract, as well as a lock to keep track of usage.

The unWrap function reads the state for the latest shared object and runs a user-supplied action on the extracted value. Wrapping the user-function in the read lock ensures we won’t accidentally try to remove the underlying code while actively using it. Without this, we run the risk of creating unnecessary stress on the GC.

The updater function (updateState) assumes we already have one shared object mapped into memory with its symbol table purged.

newVal <- force <$> loadNewSO dynamicCall symbolName nextPath

We first attempt to load in the next shared object located at nextPath, using the same export call and symbol name as before. At this point we actually have two shared objects mapped into memory at the same time; one being the old object that’s actively being used and the other being the new object with our desired updates.

Next we build some state associated with this object, and swap our state MVar.

oldState <- swapMVar mvar newState

After this call, any user that uses unWrap will get the new version of code that was just loaded up. This is when we would observe the update being “live” in our application.

L.withWrite (lock oldState) $
  unloadObj (path oldState)

Here we finally ask the GC to unload the old object. Once the write lock is obtained, no readers are present, so nothing can be running code from this old shared object (unless one is nefariously holding onto some state). Calling unloadObj doesn’t immediately unmap the object, as it only informs the GC that the object is valid to be dumped. The next major GC ensures that no code is referencing anything from that shared object and will fully dump it out.

At this point we now have only the next shared object mapped in memory and being used in the main application.

Shortcomings / Future work

Beware sticky shared objects

The trickiest problem we’ve come across has been when the GC doesn’t want to drop old shared objects. Eventually so many shared objects are linked at once that the process runs out of space to load in a new object, stalling all updates until the process is restarted. We’ll call this problem shared object retention, or just retention.

An object is unloaded when (a) we’ve called unloadObj on it, and (b) the GC determines that there are no references from heap data into the object. Retention can therefore only happen if we have some persistent data that lives across a shared object swap. Obviously it’s better if you can avoid this, but sometimes it’s necessary: e.g. in Sigma the persistent data consists of the pre-initialized data sources that we use with the Haxl monad, amongst other things. The first step in avoiding retention is to be very clear about what this data is, and to fully audit it.

To get retention, the persistent data must be mutable in some way (e.g. contain an IORef), and for retention to occur we must write something into the persistent IORef during the course of executing code from the shared object. The data we wrote into the IORef can end up referring to the shared object in two ways:

If it contains a thunk or a function, these will refer to code in the shared object.
If it contains data where the datatype is defined in the shared object (rather than in the packages that the object depends on, which are statically linked), then again we have a reference from the heap-resident data into the shared object, which will cause retention.

So to avoid retention while having mutable persistent data, the rules of thumb are:

rnf everything before writing into the persistent IORef, and ensure that any manual NFData instances don’t lie.
Don’t store values that contain functions
Don’t store values that use datatypes defined in the shared object

Debugging retention problems can be really hard, involving attaching to the process with gdb and then following the offending references from the heap. We hope that the new DWARF support in GHC 8.2 will be able to help here.

Linker addressable memory is limited

Calling the built file a shared object is a bit of a misnomer, as it isn’t compiled with -fPIC and is actually just an object file. Files like these can only be loaded into the lower 2GB of memory (x86_64 small memory model uses 32 bit relative jumps), which can become restrictive when your object file gets large. Since the update mechanism relies on having multiple objects in memory at the same time, fragmentation of the mappable address space can become a problem. We’ve already made a few improvements to the GHCi linker to reduce the impact of these problems, but we’re running out of options.

Ideally we’d switch to using true shared objects (built with -fPIC) to remove this limitation. It requires some work to get there, though: GHC’s dynamic linking support is designed to support a model where each package is in a separate shared library, whereas we want a mixed static/dynamic model.

Jon Coens

]]> Asynchronous Exceptions in Practice https://simonmar.github.io/posts/2017-01-24-asynchronous-exceptions.html 2017-01-24T00:00:00Z 2017-01-24T00:00:00Z

Asynchronous Exceptions in Practice

January 24, 2017

Asynchronous exceptions are a controversial feature of Haskell. You can throw an exception to another thread, at any time; all you need is its ThreadId:

throwTo :: Exception e => ThreadId -> e -> IO ()

The other thread will receive the exception immediately, whatever it is doing. So you have to be ready for an asynchronous exception to fire at any point in your code. Isn’t that a scary thought?

It’s an old idea - in fact, when we originally added asynchronous exceptions to Haskell (and wrote a paper about it), it was shortly after Java had removed the equivalent feature, because it was impossible to program with.

So how do we get away with it in Haskell? I wrote a little about the rationale in my book. Basically it comes down to this: if we want to be able to interrupt purely functional code, asynchronous exceptions are the only way, because polling would be a side-effect. Therefore the remaining problem is how to make asynchronous exceptions safe for the impure parts of our code. Haskell provides functionality for disabling asynchronous exceptions during critical sections (mask) and abstractions based around it that can be used for safe resource acquisition (bracket).

At Facebook I’ve had the opportunity to work with asynchronous exceptions in a large-scale real-world setting, and here’s what I’ve learned:

They’re really useful, particularly for catching bugs that cause excessive use of resources.
In the vast majority of our Haskell codebase we don’t need to worry about them at all. The documentation that we give to our users who write Haskell code to run on our platform doesn’t mention asynchronous exceptions.
But some parts of the code can be really hard to get right. Code in the IO monad dealing with multithreading or talking to foreign libraries, for example, has to care about cleaning up resources and recovering safely in the event of an asynchronous exception.

Let me take each of those points in turn and elaborate.

Where asynchronous exceptions are useful

The motivating example often used is timeouts, for example of connections in a network service. But this example is not all that convincing: in a network server we’re probably writing code that’s mostly in the IO monad, we know the places where we’re blocking, and we could use other mechanisms to implement timeouts that would be less “dangerous” but almost as reliable as asynchronous exceptions.

In Sigma, we use asynchronous exceptions to prevent huge requests from degrading the performance of our server for other clients.

In a complex system, it’s highly likely that some requests will end up using an excessive amount of resources. Perhaps there’s a bug in the code that sometimes causes it to use a lot of CPU (or even an infinite loop), or perhaps the code fetches some data to operate on, and the data ends up being unexpectedly large. In principle we could find all these cases and fix them, but in practice, large systems can have surprising emergent behaviour and we can’t guarantee to find all the bugs outside production.

Beware Elephants

So sometimes a request turns out to be an elephant, and we have to deal with it. If we do nothing, the elephant will trample around, slowing everything down, or maxing out some resource like memory or network bandwidth, which can cause failures for other requests running on the system.

One way or another something is going to die. We would rather it was the elephant, and not the many other requests currently running on the same machine. Stopping the elephant minimises the destruction. The elephant’s owner will then fix their problem, and we’ve mitigated a bug with minimal disruption.

Our elephant gun is called Allocation Limits. The Haskell runtime keeps track of how much memory each Haskell thread has allocated in total, and if that total exceeds the limit we set, the thread receives an asynchronous exception, namely AllocationLimitExceeded. The user code running on our platform is not permitted to catch this exception, instead the server catches it, logs some data to aid debugging, and sends an error back to the client that initiated the request.

We’re using “memory allocated” as a proxy for “work done”. Most computation in Haskell allocates memory, so this is a more predictable measure than wall-clock time. It’s a fairly crude way to identify excessively large requests, but it works well for us.

Here’s what happened when we enabled allocation limits early on during Sigma’s development. The graph tracks the maximum amount of live memory across different groups of machines. It turns out there were a very small fraction of requests consuming a huge amount of resources, and enabling allocation limits squashed them nicely:

Allocation limits have helped protect us from disaster on several occasions. One time, an infinite loop made its way into production; the result was that our monitoring showed an increase in requests hitting the allocation limit. The data being logged allowed it to be narrowed down to one particular type of request, we were quickly able to identify the change that caused the problem, undo it, and notify the owner. Nobody else noticed.

In the vast majority of code, we don’t need to worry about asynchronous exceptions

Because you don’t have to poll for an asynchronous exception, they work almost everywhere. All pure code works with asynchronous exceptions without change.

In our platform, clients write code on top of the Haxl framework in which I/O is provided only via a fixed set of APIs that we control, so we can guarantee that those APIs are safe, and therefore all of the client code is safe by virtue of abstraction.

Some parts of the code can be really hard to get right

That leaves the parts of the code that implement the I/O libraries and other lower level functionality. These are the places where we have to care about asynchronous exceptions: if an async exception fires when we have just opened a connection to a remote server, we have to close it again and free all the resources associated with the connection, for example.

In principle, you can follow a few guidelines to be safe.

Use bracket when allocating any kind of resource that needs to be explicitly released. This is not specific to asynchronous exceptions: coping with ordinary synchronous exceptions requires a good resource-allocation discipline, so your code should be using bracket anyway.
Use the async package which avoids some of the common problems, such as making sure that you fork a thread inside mask to avoid asynchronous exceptions leaking.

Nevertheless it’s still possible to go wrong. Here are some ways:

If you want asynchronous exceptions to work, be careful you don’t accidentally run inside mask, or uninterruptibleMask. We’ve seen examples of third-party libraries that run callbacks inside mask (e.g. the hinotify library until recently). Use getMaskingState to assert that you’re not masked when you don’t want to be.
Be careful that those asynchronous exceptions don’t escape from a thread if the thread is created by calling a foreign export, because uncaught exceptions will terminate the whole process. Unlike when using async, a foreign export can’t be created inside mask. (this is something that should be fixed in GHC, really).
Catching all exceptions seems like a good idea when you want to be bullet-proof, but if you catch and discard the ThreadKilled exception it becomes really hard to actually kill that thread.
If you’re coordinating with some foreign code and the Haskell code gets an asynchronous exception, make sure that the foreign code will also clean up properly.

The type system is of no help at all with finding these bugs, the only way you can find them is with careful eyeballs, good abstractions, lots of testing, and plenty of assertions.

It’s worth it

My claim is, even though some of the low-level code can be hard to get right, the benefits are worth it.

Asynchronous exceptions generalise several exceptional conditions that relate to resource consumption: stack overflow, timeouts, allocation limits, and heap overflow exceptions. We only have to make our code asynchronous-exception-safe once, and it’ll work with all these different kinds of errors. What’s more, being able to terminate threads with confidence that they will clean up promptly and exit is really useful. (It would be nice to do a comparison with Erlang here, but not having written a lot of this kind of code in Erlang I can’t speak with any authority.)

In a high-volume network service, having a guarantee that a class of runaway requests will be caught and killed off can help reliability, and give you breathing room when things go wrong.

]]> Haskell in the Datacentre https://simonmar.github.io/posts/2016-12-08-Haskell-in-the-datacentre.html 2016-12-08T00:00:00Z 2016-12-08T00:00:00Z

Haskell in the Datacentre

December 8, 2016

At Facebook we run Haskell on thousands of servers, together handling over a million requests per second. Obviously we’d like to make the most efficient use of hardware and get the most throughput per server that we can. So how do you tune a Haskell-based server to run well?

Over the past few months we’ve been tuning our server to squeeze out as much performance as we can per machine, and this has involved changes throughout the stack. In this post I’ll tell you about some changes we made to GHC’s runtime scheduler.

Summary

We made one primary change: GHC’s runtime is based around an M:N threading model which is designed to map a large number (M) of lightweight Haskell threads onto a small number (N) of heavyweight OS threads. In our application M is fixed and not all that big: we can max out a server’s resources when M is about 3-4x the number of cores, and meanwhile setting N to the number of cores wasn’t enough to let us use all the CPU (I’ll explain why shortly).

To cut to the chase, we ended up increasing N to be the same as M (or close to it), and this bought us an extra 10-20% throughput per machine. It wasn’t as simple as just setting some command-line options, because GHC’s garbage collector is designed to run with N equal to the number of cores, so I had to make some changes to the way GHC schedules things to make this work.

All these improvements are upstream in GHC, and they’ll be available in GHC 8.2.1, due early 2017.

Background: Capabilities

When the GHC runtime starts, it creates a number of capabilities (also sometimes called HEC, for Haskell Execution Context). The number of capabilities is determined by the -N flag when you start the Haskell program, e.g. prog +RTS -N4 would run prog with 4 capabilities.

A capability is the ability to run Haskell code. It consists of an allocation area (also called nursery) for allocating memory, a queue of lightweight Haskell threads to run, and one or more OS threads (called workers) that will run the Haskell code. Each capability can run a single Haskell thread at a time; if the Haskell thread blocks, the next Haskell thread in the queue runs, and so on.

Typically we choose the number of capabilities to be equal to the number of physical cores on the machine. This makes sense: there is no advantage in trying to run more Haskell threads simultaneously than we have physical cores.

How our server maps onto this

Our system is based on the C++ Thrift server, which provides a fixed set of worker threads that pull requests from a queue and execute them. We choose the number of worker threads to be high enough that we can fully utilize the server, but not too high that we create too much contention and increase latency under maximum load.

Each worker thread calls into Haskell via a foreign export to do the actual work. The GHC runtime then chooses a capability to run the call. It normally picks an idle capability, and the call executes immediately. If there are no idle capabilities, the call blocks on the queue of a capability until the capability yields control to it.

The problem

At high load, even though we have enough threads to fully utilize the CPU cores, the intermediate layer of scheduling where GHC assigns threads to capabilities means that we sometimes have threads idle that could be running. Sometimes there are multiple runnable workers on one capability while other capabilities are idle, and the runtime takes a little while to load-balance during which time we’re not using all the available CPU capacity.

Meanwhile the kernel is doing its own scheduling, trying to map those OS threads onto CPUs. Obviously the kernel has a rather more sophisticated scheduler than GHC and could do a better job of mapping those M threads onto its N cores, but we aren’t letting it. In this scenario, the extra layer of scheduling in GHC is just a drag on performance.

First up, a bug in the load-balancer.

While investigating this I found a bug in the way GHC’s load-balancing worked - it could cause a large number of spurious wakeups of other capabilities while load-balancing. Fixing this was worth a few percent right away, but I had my sights set on larger gains.

Couldn’t we just increase the number of capabilities?

Well yes, and of course we tried just bumping up the -N value, but increasing -N beyond the number of cores just tends to increase CPU usage without increasing throughput.

Why? Well, the problem is the garbage collector. The GC keeps all its threads running trying to steal work from each other, and when we have more threads than we have real cores, the spinning threads are slowing down the threads doing the actual work.

Increasing the number of capabilities without slowing down GC

What we’d like to do is to have a larger set of mutator threads, but only use a subset of those when it’s time to GC. That’s exactly what this new flag does:

+RTS -qn<threads>

For example, on a 24-core machine you might use +RTS -N48 -qn24 to have 48 mutator threads, but only 24 threads during GC. This is great for using hyperthreads too, because hyperthreads work well for the mutator but not for the GC.

Which threads does the runtime choose to do the GC? The scheduler has a heuristic which looks at which capabilities are currently inactive and chooses those to be idle, to avoid having to synchronise with threads that are currently asleep.

`+RTS -qn` will now be turned on by default!

This is a slight digression, but it turns out that setting +RTS -qn to the number of CPU cores is always a good idea if -N is too large. So the runtime will be doing this by default from now on. If -N accidentally gets set too large, performance won’t drop quite so badly as it did with GHC 8.0 and earlier.

Capability affinity

Now we can safely increase the number of capabilities well beyond the number of real cores, provided we set a smaller number of GC threads with +RTS -qn.

The final step that we took in Sigma is to map our server threads 1:1 with capabilities. When the C++ server thread calls into Haskell, it immediately gets a capability, there’s never any blocking, and nor does the GHC runtime need to do any load-balancing.

How is this done? There’s a new C API exposed by the RTS:

void rts_setInCallCapability (int preferred_capability, int affinity);

In each thread you call this to map that thread to a particular capability. For example you might call it like this:

static std::atomic<int> counter;
...
rts_setInCallCapability(counter.fetch_add(1), 0);

And ensure that you call this once per thread. The affinity argument is for binding a thread to a CPU core, which might be useful if you’re also using GHC’s affinity setting (+RTS -qa). In our case we haven’t found this to be useful.

Future

You might be thinking, but isn’t the great thing about Haskell that we have lightweight threads? Yes, absolutely. We do make use of lightweight threads in our system, but the main server threads that we inherit from the C++ Thrift server are heavyweight OS threads.

Fortunately in our case we can fully load the system with 3-4 heavyweight threads per core, and this solution works nicely with the constraints of our platform. But if the ratio of I/O waiting to CPU work in our workload increased, we would need more threads per core to keep the CPU busy, and the balance tips towards wanting lightweight threads. Furthermore, using lightweight threads would make the system more resilient to increases in latency from downstream services.

In the future we’ll probably move to lightweight threads, but in the meantime these changes to scheduling mean that we can squeeze all the available throughput from the existing architecture.

Browsing Stackage with VS Code and Glean

Hooking it up to VS Code

Where to next?

More use cases?

Indexing Hackage: Glean vs. hiedb

An ulterior motive

The rest of this post

Performance

Indexing performance

Size of the resulting DB

Performance of find-references

Performance of find-definition

What other queries can we do with Glean?

Search by case-insensitive prefix

Identify dead code

Apparatus

Building all of Hackage

Building Glean

Running Glean

How does it work?

What’s next?

Rethinking Static Reference Tables in GHC

Garbage collecting CAFs

The shiny new way

We never need a singleton SRT

The SRT field for each code block can be 32 bits, not 96

We can common up identical SRTs

We can drop duplicate references from an SRT

For a function, we can combine the SRT with the static closure itself

Show me the code

Fixing 17 space leaks in GHCi, and keeping them fixed

Weak pointers can detect space leaks

Back to the space leaks in GHCi

Could we do better?

Hotswapping Haskell

Show Me The Code!

What about…

A Statically built server

GHCi-as-a-service

Shipping shared objects for great good

Details

Loading and extracting from the shared object

Defining the shared object’s API

Safely Transition Updates

Shortcomings / Future work

Beware sticky shared objects

Linker addressable memory is limited

Asynchronous Exceptions in Practice

Where asynchronous exceptions are useful

Beware Elephants

In the vast majority of code, we don’t need to worry about asynchronous exceptions

Some parts of the code can be really hard to get right

It’s worth it

Haskell in the Datacentre

Summary

Background: Capabilities

How our server maps onto this

The problem

First up, a bug in the load-balancer.

Couldn’t we just increase the number of capabilities?

Increasing the number of capabilities without slowing down GC

+RTS -qn will now be turned on by default!

Capability affinity

Future

Haskell positions at Facebook

Stack traces in GHCi, coming in GHC 8.0.1

Background

Show me a stack trace!

Dumping the stack from anywhere

Any drawbacks?

How does it work?

Enter Remote GHCi

Three kinds of stack trace in GHC 8.0.1

Conclusion

Fun With Haxl (Part 1)

What is Haxl?

Example: accessing data for a blog

The problem: batching queries

Introduction to Haxl

How do we run some Haxl?

`+RTS -qn` will now be turned on by default!