| CARVIEW |
That’s where Static Application Security Testing (SAST) tools like CodeQL come in.
If you’ve used CodeQL, you already know how powerful it is. Instead of relying on predefined rules or pattern matching, it treats your code like data, allowing deep semantic analysis that finds subtle, logic-based vulnerabilities other scanners miss — and it does so with impressive precision, minimizing false positives.
If this is your first contact with CodeQL, it’s worth checking out this great introduction before diving in.
However, while CodeQL’s power is undeniable, integrating it smoothly across large organizations, monorepos, or custom CI/CD pipelines can be challenging.
Here are some of the issues teams typically encounter:
- Multiple programming languages and build systems in a single repository
- The need to scan only what actually changed in a pull request
- Running CodeQL outside of GitHub Actions
We’ve been there ourselves. That’s why we built CodeQL Wrapper, an open-source tool that simplifies running CodeQL in any environment, no matter how complex your repo or CI system might be.
This post explains why we built it, what it does in practice, and how to use it to simplify your CodeQL workflows without a deep technical dive.
What Is CodeQL Wrapper?
CodeQL Wrapper is a universal Python CLI that abstracts away much of the setup pain and provides a consistent way to run CodeQL across projects.
It allows you to run CodeQL anywhere — locally or in CI systems like GitHub Actions, Jenkins, Azure Pipelines, CircleCI, or Harness while ensuring consistent behavior across environments.
It automatically fetches and installs the correct version of the CodeQL CLI, meaning your local runs and CI analyses always stay in sync. And even if your pipelines don’t run on GitHub, CodeQL Wrapper can still send results back to GitHub Advanced Security in SARIF format for centralized visibility.
Beyond simplifying setup, CodeQL Wrapper helps teams maintain consistent configuration using a flexible .codeql.json file. This lets you define build modes, custom queries, and paths once — then apply them consistently across projects.
But where the wrapper really shines is in its ability to handle monorepos, tackling two of the biggest pain points: language detection and performance.
Automatic Language Detection
Many modern projects mix languages. A Python API here, a JavaScript frontend there, maybe a bit of Go for background services.
CodeQL’s default GitHub Action setup requires manually specifying which language to analyze. That’s fine for small projects, but a maintenance nightmare for monorepos.
CodeQL Wrapper removes this burden entirely. It automatically detects languages present in your codebase and configures the analysis for you.
This automation ensures nothing slips through the cracks (no forgotten languages, no partial scans) and keeps your configuration simple and future-proof.
Smarter Performance Through Parallelization and Change Detection
Running CodeQL can be time-consuming, especially in large repositories. CodeQL Wrapper optimizes this in two complementary ways: it runs analyses in parallel and skips unchanged code altogether.
The parallelization happens on two levels:
- Across projects – In the context of CodeQL Wrapper, a project refers to a distinct subdirectory within a repository that can be analyzed independently. Each project gets its own CodeQL database, allowing multiple components to be analyzed simultaneously.
- Within each project – If a project contains multiple languages (Python, JavaScript, Go, etc) CodeQL Wrapper can run analysis for each language concurrently instead of processing them sequentially.
The tool intelligently adapts to your system’s hardware using all available CPU cores without exhausting memory. Though you can manually tune the settings if needed.
And because avoiding unnecessary work is even better than parallelizing it, CodeQL Wrapper includes a change detection algorithm that analyzes only the code that has actually changed since the last scan.
It uses Git to identify modified files and determines which projects those files belong to. If no relevant files have changed, the tool automatically skips redundant analysis steps, avoiding unnecessary database creation and query execution, cutting analysis time from hours to minutes on large monorepos.
At a high level, the wrapper determines a base commit (from CI variables or an explicit flag), fetches any missing refs, computes the diff, and maps changed files to projects. Only those projects proceed to database creation and query execution. The logic is platform‑agnostic, so the same behavior applies whether you run in GitHub Actions, Azure Pipelines, CircleCI, Harness, or locally.
Installing and Running CodeQL Wrapper
We designed CodeQL Wrapper so that the first run feels frictionless and the hundredth run still feels predictable. Installation is a single step, and from there you can either prepare the environment or jump straight into analysis.
pip install codeql-wrapperThis will install the tool in your environment. Once installed, you’ll have access to two commands:
- install – Installs CodeQL CLI and query packs (you can pin a specific version if needed).
- analyze – Runs CodeQL analysis on your project(s), with automatic project detection and parallelized scans. If you only need the CodeQL CLI (and don’t want to run a scan yet), just run:
codeql-wrapper installThis fetches the CodeQL CLI and query packs, and you can pin versions with --version to keep results reproducible. When you’re ready to analyze, analyze takes over the heavy lifting: it confirms CodeQL is installed, detects languages and projects, creates databases, and runs queries in parallel to suit your hardware.
Basic Usage
Most teams start by pointing the wrapper at the repository root. That’s intentional: you shouldn’t need to hand‑craft language lists or database steps to get useful results.
codeql-wrapper analyze ./repoBehind the scenes, the wrapper inspects your codebase, identifies the languages that matter, and builds one database per language. If your repo is a monorepo, it scopes the work to projects it finds, running them concurrently. The default configuration leans on safe, broadly applicable queries and build-mode: none, which makes first‑time adoption smooth even without bespoke build instructions.
As codebases grow, consistency matters more than one‑off tweaks. The wrapper’s model — automatic detection plus ergonomic overrides — offers practical advantages: it adapts to monorepos with different needs per project, keeps query policies uniform across CI providers, and makes exceptions explicit without touching pipeline YAML.
Why CI Integration is Easier
Rather than generic promises, here’s what changes in day‑to‑day pipelines:
- One command surface:
installandanalyze. The wrapper handles language detection, database creation, and query execution, so pipelines don’t need fragile per‑language steps. - Version pinning:
install --versionlocks the CodeQL CLI version, avoiding environment drift between developer machines and CI runners. - Monorepo awareness: Independent projects (subdirectories) are processed in parallel, reducing custom CI logic.
- Change detection: Git diffs are mapped to affected projects; unchanged ones are skipped, cutting incremental scan time.
- Consistent outputs: SARIF is generated uniformly, so reporting stays the same regardless of CI platform.
Usage Examples
Local runs establish trust, but enterprise adoption hinges on repeatable automation. To make that transition easy, we ship codeql-wrapper-pipelines, a companion repository with templates for common CI/CD providers. These examples handle the practicalities — auth, artifacts, and reporting — while keeping the interface consistent. You get one way of invoking the wrapper, regardless of whether your pipelines live in GitHub Actions, Azure Pipelines, CircleCI, Harness, or elsewhere.
Why We Built It
Our work with global enterprises adopting GitHub Advanced Security led us to build both codeql-wrapper and codeql-wrapper-pipelines. The tools distill hard‑won lessons from monorepo rollouts: minimize manual configuration, keep behavior consistent across environments, and make smart defaults easy to override.
The goal isn’t just faster scans; it’s a smoother path to reliable, organization‑wide CodeQL usage. If you’re strengthening DevSecOps without piling on process, we think this approach strikes the right balance. Give the wrapper a try, explore the pipelines, and reach out on the repositories if you want help shaping them to your environment.
]]>See, when I write a parser, I frequently write a pretty-printer as well1, and the pretty-printer is almost the same as the parser. This makes maintenance harder, if only because parser and pretty-printer have to be kept in sync. Besides, it simply feels like unnecessary duplication.
This blog post is the story of the latest developments in the quest for more general grammar combinators—or as Mathieu Boespflug and I have been calling them, format descriptors—and how it led me to publish a new library, called Pup. For further reading, you can also check the paper that Mathieu and I wrote about it for Olivier Danvy’s festschrift, held at the ICFP/SPLASH conference in Singapore last October.
A library of format descriptors
The Pup library lets you write format descriptors in a familiar style. In fact, Pup uses existing parser combinator libraries as a backend. But in addition to parsing, these format descriptors let you pretty-print the code (using an existing pretty-printing library as a backend).
An example is better than a long explanation, so let me show you a format descriptor for simplified S-expressions. It lets you parse expressions such as this one:
((a-fun 57 :tag) "a string" (list 1 "somewhat" :long "list of mixed types"
(list "at least" 3) (list "but maybe" :more)))and then pretty-print the result (or any abstract syntax tree, really), which looks like this:
((a-fun 57 :tag)
"a string"
(list
1
"somewhat"
:long
"list of mixed types"
(list "at least" 3)
(list "but maybe" :more)))I didn’t implement any parsing or printing algorithm. Parsing is done by Megaparsec and pretty-printing by Prettyprinter. These are the only backends that I have implemented so far—it’s a new library, but the library is written to be easily extensible to other parsing and pretty-printing backends. What Pup gives you is the surface syntax of the format descriptor DSL.
So here’s our example format descriptor. In the rest of the blog post I’ll
explain what’s going on (in particular why there are those labels like
#SSymb and #":" rather than constructors SSymb and (:)):
sexpr =
group (nest 2 (#SList <* try ("(") <* space <*> try sexpr `sepBy` space1 <* space <* ")"))
<|> #SSymb <*> try symbol
<|> #SInt <*> try nat
<|> #SStr <* try ("\"") <*> takeWhileP Nothing (/= '"') <* "\""
where
symbol = #":" <*> symbol_lead <*> many (try symbol_other)
symbol_lead = oneOf (':' : ['a' .. 'z'] ++ ['A' .. 'Z'])
symbol_other = oneOf (['a' .. 'z'] ++ ['A' .. 'Z'] ++ ['0' .. '9'] ++ ['-'])
data SExpr
= SList [SExpr]
| SSymb String
| SStr Text
| SInt Int
deriving (Generic, Show, Eq)
-- For completeness parse the example string, then pretty-print the result
do
let str =
-- Did you know that GHC has multi-line strings now?
"""
((a-fun 57 :tag) "a string" (list 1 "somewhat" :long "list of mixed types"
(list "at least" 3) (list "but maybe" :more)))
"""
case parse sexpr "<test>" str of
Left e -> putStrLn e
Right expr -> do
let doc = Maybe.fromJust (Pup.print sexpr expr)
let str' = Prettyprinter.renderStrict $ Prettyprinter.layoutPretty Prettyprinter.defaultLayoutOptions doc
putStrLn $ Text.unpack str'The sexpr format descriptor is almost what you would write with Megaparsec.
There are two main differences
- The second line reads
#SSymb <*> try symbol. In Megaparsec you would seeSymb <$> try symbol. We call#SSymba lead. It is in charge not only of building anSSymbwhen parsing, but also the reading out of anSSymbwhen printing. This is also why we have to use the applicative<*>instead of the functorial<$>. The#SSymblead is automatically generated because our type implementsGeneric. - There are pretty-printing combinators
group (nest 2 (…)). These are Wadler-Leijen type combinators as implemented in the Prettyprinter library. They’re responsible for breaking lines and inserting indentation. You’ll need some of these since a standard grammar can tell you what can be parsed, but not what is pretty.
Other than that, if you’re familiar with Megaparsec, you should feel right at
home. You even have access to the monadic >>= if you need it (and yes, it
pretty-prints).
A word of caution: the Pup library helps you to make sure that the output of the pretty-printer can be reparsed to the same AST, but it doesn’t guarantee it. It’s probably unreasonable to hope for parsers and pretty-printers to be inverses by construction with only Haskell’s type system (barring some very drastic restrictions on possible parsers).
With that said. I think the library is cool, and you can go use it today! I hope you enjoy it, and please let me know if you’re missing anything. And if you want to know more about what makes the library tick: read on.
Tuple troubles
The type of the sexpr format descriptor is maybe not what you’re expecting:
sexpr :: Pup' (SExpr -> r) r SExprWhy are there three type arguments rather than one? What is this r type
variable about? These are where Pup’s secret sauce lies.
Fundamentally, a parser for the type a is something of the form String -> a,
and a printer something of the form a -> String. Parsers are covariant
functors and printers are contravariant functors. So our bidirectional format
descriptor should be something like P0 a = (a -> String, String -> a), which
is neither a covariant nor a contravariant functor. The only map-like
transformation that P0 supports is something like
isomap :: (a -> b, b -> a) -> P0 a -> P0 bWe also need something to chain parsers. To that effect, we can
introduce an andThen combinator (we’d really need to enrich the P0 type to be
able to do that, but let’s pretend):
andThen :: P0 a -> P0 b -> P0 (a, b)Together this adds up to a rather so-so experience.
p0 :: P0 (A, (B, (C, D)))
p0 = foo `andThen` bar `andThen` baz `andThen` buzz
p0' :: P0 (A, B, C, D)
p0' = isomap (\(a, (b, (c, d))) -> (a, b, c, d), \(a, b, c, d) -> (a, (b, (c, d)))) p0Part of what makes this style uncomfortable is the deep nesting of product. And when you get to reshuffle them, due to the need of passing two functions at the same time, it adds a lot of visual boilerplate. This boilerplate would probably dominate the parsing logic. This is what in our paper we call “tuple troubles”.
In the case of covariant functors, there’s a ready-made solution: use an
applicative functor interface. It’s equivalent to pairing, it just works much
better in Haskell. But P0 isn’t a functor, and there’s really little we can do
to improve on this interface.
In order to address these shortcomings let’s use a classic technique and decorrelate the type of what is parsed from the type of what is printed2.
type P1 a b = (a -> String, String -> b)A P1 a b can parse a b and print an a. Of course we are generally
interested in using a P1 a a, but we now have the ability to locally change
one and not the other. So that we can use:
-- If we know how to parse a `b` we can parse a `c`
rmap :: (b -> c) -> P1 a b -> P1 a c
-- If we know how to print an `a` we can print a `d`
lmap :: (d -> a) -> P1 a b -> P1 d bIn Haskell parlance, P1 is a profunctor.
And now, since P1 a is a functor for any a, we can also equip it with an
applicative functor structure. The paper Composing Bidirectional Programs
Monadically even shows how to equip P1 a with a monadic structure. Pup
uses their technique for monadic parsing. Now our example reads as follows:
p1 :: P1 (A, B, C, D) (A, B, C, D)
p1 = (,,,) <$> lmap fst foo <*> lmap snd bar <*> lmap thd baz <*> lmap fth buzzThis looks much more reasonable, the expression has the familiar structure of
applicative expressions. But the actual parsing logic is still drowned in all
these lmap (and I’ve been quite generous here, as I’ve invented projections
fst…fth for 4-tuples). We really need a better way to combine printers.
Indexing with a stack
To improve on the profunctor style, we need a better way to combine the contravariant side.
The style I propose, in the Pup library, looks like the following
p2 :: P2 (A -> B -> C -> D -> r) r (A, B, C, D)
p2 = (,,,) <$> foo <*> bar <*> baz <*> buzzI’ll come back to the definition of P2 in a moment. But in the meantime, here’s
how I want you to read the type:
- In some descriptor
descr :: P2 r r' a,ais the return type, andrandr'represent a stack of arguments. Withrrepresenting the shape of the stack beforedescr, andr'the state of the stack afterdescr. - A stack of the form
A -> B -> C -> D -> ris a stack with at least four elements, and whose topmost elements are of typeA,B,C, andDin that order. The rest of the stack is unknown, which is represented by a type variabler. - The format descriptor
foo, for instance, has an argument of typeA(used when printing). To obtain this argument,foopops the topmost element of the argument stack, sofoohas typefoo :: P2 (A -> r) r A.
This usage pattern of passing a different argument to each format descriptor in a sequence, as how it’s naturally done by a stack, appears to be by far the most common in format descriptors, so Pup optimises for it.
You might have noticed the catch, though: the type arguments r and r' vary
during a sequence of actions. This means that P2 isn’t quite an applicative
functor and <*>, above, isn’t the combinator we’re used to. Instead P2 is
an indexed applicative functor (and possibly an indexed monad too).
The type of <*> for indexed applicatives f is:
(<*>) :: f r r' (a -> b) -> f r' r'' a -> f r r'' bOn the other hand, <$> is the usual <$> from the Functor type class.
Specialised to an indexed applicative f, its type is:
(<$>) :: (a -> b) -> f r r' a -> f r r' bWith this established, we can see how the type of individual syntax descriptors can be sequenced:
foo :: P2 (A -> r1) r1 A
bar :: P2 (B -> r2) r2 B
-- `r1` is universally quantified
-- so we can unify `r1` with `B -> r2`
q :: P2 (A -> B -> r2) r2 (A, B)
q :: (,) <$> foo <*> barSequencing an action which pops an A and an action which pops a B yields an
action which pops both an A and a B. Precisely what we needed. Notice how,
crucially, r1 in foo and r2 in bar are universally quantified variables,
so that they can be refined when a subsequent value requires a value from the
stack.
The existing support for indexed applicatives and monads isn’t very good (mostly
it consists of the somewhat minimalistic and pretty dusty indexed
library). This is why I’m also releasing a new, modern,
indexed-applicative-and-monads companion library Stacked, which makes use of
modern GHC facilities such as quantified constraints and qualified do. Pup is
built on top of Stacked.
The last piece of the puzzle is what we’ve been calling “leads”, which are
actions which neither print nor parse anything, but serve as a bidirectional
version of constructors. For types which implement the Generic type class,
leads are derived automatically. And with that we can complete our example:
#"(,,,)" :: P2 ((a, b, c, d) -> r) (a -> b -> c -> d -> r) (a -> b -> c -> d -> (a, b, c, d))
p2' :: P2 ((A, B, C, D) -> r) r (A, B, C, D)
p2' = #"(,,,)" <*> foo <*> bar <*> baz <*> buzzThis uses the OverloadedLabels extension. Equivalently,
you can use lead @"(,,,)" instead of #"(,,,)".
A note on divisible
There’s an existing structure on contravariant functors which is analogous to
applicative functors on covariant functors: the Divisible.
However, it doesn’t help us with our goal in this section, since the divisible
structure is already used implicitly in P1’s instance of Applicative.
In fact, here’s a theorem (the proof is left as an exercise to the reader): let d and f be a
divisible and an applicative functor, respectively, then t a b = (d a, f b) is
applicative (with respect to b).
Implemented with continuations
This concludes the motivation behind the library’s design. You can treat the fact
that we use the function arrow (->) as a stack constructor as an
implementation detail. But if you’re interested, let’s now get into how we can
possibly implement P2.
The idea goes back to Olivier Danvy’s Functional
unparsing, where Danvy presents a straightforward implementation of
combinators for printf-like format strings using continuations. The idea was
that just like the original printf, all the arguments to be formatted were just
additional argument passed to the printf function. This is the technique we
used. With that in mind, P2 can be defined as:
type P2 r1 r2 b = ((String -> r2) -> r1, String -> b)
-- Compare:
-- P2 (a -> r) r b = ((String -> r) -> a -> r, String -> b)
-- P1 a b = (a -> String, String -> b)With this definition, we naturally have a stack defined in terms.
Final touches
Now, this isn’t the definition of the real Pup' type. This definition of P2
can’t deal with data types with several constructors. To deal with type
constructors, we need a way for parsers to signal failures and for failures to be handled. And,
maybe surprisingly, we need a way for printers to signal failures, and for
those failures to be caught. Even though we typically think of pretty-printers as
total, we are going to build them piecewise, constructor by constructor, and
those partial printers can fail (another way to think about it is that a format
descriptor for a parser which can only parse into some of the constructors of a
type, can only print from those same constructors).
Failures are handled by the (<|>) combinator which we can see in the
S-expression example. These are reminiscent of the Alternative and MonadPlus
classes from the Base library. But because we are dealing with indexed
applicatives and monads, it’s a different combinator, which is provided, in
the Stacked library, by the Additive class:
class Additive a where
empty :: a
(<|>) :: a -> a -> aThe standard way to add failures in Haskell is the Maybe monad. We can, in
fact, use the Maybe monad in the parser to signal failure (in the actual Pup
library, we simply reuse the failure mechanism from the parsing backend). But it
doesn’t work for the pretty-printer, as the Maybe monad would interfere with the
continuation-based stacks (this is explained in more details in §5.2 of the
paper).
As it turns out, it would seem that the only way to add failures and handlers to the pretty-printer is to add a second continuation. So our type will look like this:
type P3 r1 r2 b = ((String -> r2 -> r2) -> r1 -> r1, String -> Maybe b)That’s about it. As you can see, the types get pretty hairy. But it’s kept decently under control by defining the right abstractions (of course, indexed applicatives and monads are such abstractions, but we also propose some new ones). In particular, the most central abstraction in Pup’s design is the Stacked type class. All this is explained in our paper (spoiler, because I’m quite happy with this one: it features comonads).
In conclusion
Of course Pup is pretty fresh, and as such there will be some rough edges. But I think it’s already quite useful. I expect it to be straightforward to convert a parser implemented using Megaparsec to Pup. So if you have one of those, you should consider it.
Alternatively, you can check Mathieu’s proof-of-concept library cassette, whose current design is described in our paper as well. Cassette’s format descriptors combine better than Pup’s. But cassette doesn’t support monadic parsing, or using existing parsers (yet?).
But really, the techniques that we’ve developed for Pup, ostensibly for parsing and pretty-printing format descriptors, can presumably be used in many bidirectional programming situations. In fact, I assume that you can often replace profunctors by stacked applicatives. It’s just completely unexplored. I’m curious to hear your thoughts about that.
-
A pretty-printer, by the way, is not the same thing as a formatter. A pretty-printer takes an abstract syntax tree and turns it into something human-readable. A formatter takes text written by a human and normalises it according to some styling rule. This blog post is about pretty-printing not formatting. For our take on formatting see Ormolu and Topiary.↩
-
Here are some Haskell libraries which use this decoupling trick
- tomland a bidirectional Toml parsing library
- autodocodec a bidirectional Json parsing library
- distributors a generic bidirectional parsing library
The same idea can even be found in pure mathematics in the concept of dinatural transformation.↩
rules_img fixes it.
Prefer watching to reading? The content of this post is also available in video form.
The components
Before we dive in, we need to establish where data lives and where it moves. There are three main players:
- The registry (like Docker Hub or gcr.io): A remote server that stores container images. You download base images from here and push your built images back to it.
- Your local machine: Where you run
bazel buildorbazel run. This is your laptop or workstation. - Remote execution and remote cache: A remote caching backend (like Aspect Workflows, BuildBuddy, EngFlow, or Google’s RBE) that runs Bazel actions on remote machines and caches the results. Optional, but common in CI and larger projects.
The core tension is simple: to build a container image that extends a base image, you need information about that base. The question is how much information, and where does it need to be?
The scenario
Here’s what building a container image looks like with rules_oci, the current recommended approach. I’ll show the data flow explicitly:
# Pull a base image
# → Downloads manifest + config + all layer blobs from registry to local machine
pull(
name = "ubuntu",
image = "index.docker.io/library/ubuntu:24.04",
digest = "sha256:1e622c5...",
)
# Build your image
# → Creates a directory containing all blobs (base layers + your layer)
# → With remote execution: all blobs are inputs and outputs of this actions
oci_image(
name = "app_image",
base = "@ubuntu", # References the complete local directory
tars = [":app_layer.tar"],
entrypoint = ["/app/bin/server"],
)
# Push to registry
# → Downloads all image blobs from remote cache to local machine
# → Uploads missing blobs from local machine to registry
oci_push(
name = "push",
image = ":app_image",
repository = "gcr.io/my-project/app",
)Data flow summary:
- Registry → Local machine: full base image (hundreds of MB)
- Local machine → Remote cache: full base image (anything that’s not already cached)
- Remote cache → Remote Executor (creating an image): full image (hundreds of MB)
- Remote cache → Local machine: full image (hundreds of MB)
- Local machine → Registry: missing layers
Here’s the same thing with rules_img:
# Pull a base image
# → Downloads only manifest + config JSON from registry to local machine (~10 KB)
# → Layer blobs stay in the registry
pull(
name = "ubuntu",
registry = "index.docker.io",
repository = "library/ubuntu",
tag = "24.04",
digest = "sha256:1e622c5...",
)
# Build a layer
# → Writes layer tar + metadata to Bazel's content-addressable storage
# → With remote execution: layer blob stays in remote cache
image_layer(
name = "app_layer",
srcs = {
"/app/bin/server": "//cmd/server", # Bazel-built binary
"/app/config": "//configs:prod",
},
)
# Assemble the image
# → Writes manifest JSON referencing base layers + your layers (by digest)
# → Only metadata is read and written
image_manifest(
name = "app",
base = "@ubuntu", # References only metadata, not blobs
layers = [":app_layer"], # References only metadata, not blobs
entrypoint = ["/app/bin/server"],
)
# Push (at bazel run time, not build time)
# → Checks registry: which blobs are already present?
# → Streams only missing blobs: remote cache → local machine → registry
# → If layers are already in registry: nothing to transfer
image_push(
name = "push_app",
image = ":app",
registry = "ghcr.io",
repository = "my-project/app",
tag = "latest",
)Data flow summary:
- Registry → Local machine: only manifest + config (~10 KB)
- Local machine → Remote cache: only metadata on base images
- Remote cache → Local machine → Registry: only missing blobs (often just your new layers)
- Base layers (almost) never move through local machine or remote executors
A two‑minute primer on images
An OCI image is a bundle of metadata and bytes. The bytes live in layers, which are compressed tar archives that encode file additions and deletions. The metadata lives in three JSON objects:
- The config: what to run, environment variables, user, working directory, and the list of uncompressed layer digests (also called diff IDs)
- The manifest: pointers to one config and many layer blobs, identified by digest, size, and media type
- The index: for multi‑architecture images, a list of per‑platform manifests
Tags in a registry point at a manifest digest. The digests are content-addressed, so the same bytes always mean the same name everywhere.
How builds usually work. docker build executes a Dockerfile inside a base image. Each step like RUN, COPY, or ADD runs against a snapshot of the previous root file system and produces a new layer. The final image is the base’s layers plus the layers created by those steps. This is convenient, but it assumes you have the base image bytes locally while you build.
How Bazel thinks about it. Bazel does not need to run inside the base at all. It builds your program artifacts the same way it always does1, then assembles an image by writing a config and a manifest that reference the base image by digest alongside the new layers you produced. Bazel needs the base’s identity to compose a correct manifest and, later, to upload or load the image. But it doesn’t have to materialize the base layers during the build itself2.
Why this matters for performance. Assembling an image is easy. It’s mostly JSON with a few checksums. The hard part is data locality: getting the right bytes to the right place at the right time. Do the executors have to download layers just to write a small manifest? Does a pusher really need to pull all blobs to a workstation before uploading them again? Does a local daemon have to ingest layers it already owns? rules_img answers those questions by moving metadata first and moving bytes only at the edges.
The status quo: rules_oci
The first major ruleset for building container images in Bazel was rules_docker, which integrated with every language ecosystem: Python, Node.js, Java, Scala, Groovy, C++, Go, Rust, and D. This approach proved extremely hard to maintain. Any change in a language ruleset could ripple into rules_docker. Today it is mostly unmaintained and lacks official bzlmod support.
The current recommendation is rules_oci, which takes the opposite approach: use only off‑the‑shelf tools, maintain a strict complexity budget, and delegate layer creation to language rulesets or end users. This design results in a maintainable project with a narrow scope that’s easy to understand.
Under the hood, rules_oci represents images as complete OCI layouts on disk. When you pull a base image, the repository rule downloads the full image—all blobs, all layers—into a tree artifact. When you build an image with oci_image or oci_image_index, the result is again a directory containing every blob of that image. Layers are always tar files, with no separate metadata to describe them, and the ruleset does not use Bazel providers to pass structured information between targets. This approach is simple and works well for local builds, but as we scaled to Remote Execution, we encountered bottlenecks that this design did not address.
From bottlenecks to breakthroughs: how rules_img works
I started with a simple goal: build container images in Bazel and let Remote Execution carry the weight. I used rules_oci in my experiments, the recommended way of building container images in Bazel today3. I was surprised by the inefficiencies I saw. Repository rules that pulled base images ran again and again in CI, even when nothing had changed4. My laptop shoveled data uphill to the remote cache before any real work could begin. Actions that only wrote a few lines of JSON insisted on dragging entire layer blobs along for the ride. When the build finally finished on RBE, Bazel downloaded every layer into a push tool’s runfiles, only to upload them to a registry a moment later. Loading images into Docker added insult to injury by ignoring layers that were already present. None of that felt like Bazel, so I ran experiments until a pattern emerged.
The breakthrough: treat images as metadata first. The key was to see the whole build as a metadata pipeline and to move bytes only at the edges. Keep base images shallow until you truly need a blob. Assemble manifests from digests and sizes, not gigabytes. Push and load by streaming from content‑addressable storage straight to the destination, and skip anything that already exists there. Once that clicked, the rest of the design fell into place.
Pulling, without the pain. Base pulls were the first time sink. In rules_img, the repository rule fetches only the manifest and config JSON files at build time. Just enough metadata to know what layers exist and their digests. The actual layer blobs are never downloaded during the build5. They wait until the run phase when you bazel run a push or load target. CI becomes predictable, and Remote Execution doesn’t spend its morning downloading CUDA for the fourth time this week. Less data moves during builds, and the cache behaves like a cache.
Stop hauling bytes uphill. The next fix addressed the torrent of developer-to-remote uploads. We generate metadata-only providers wherever possible. The heavy blobs live in content-addressable storage (CAS) and stream later to whoever needs them, whether that’s a registry or a local daemon. Your workstation stops being a relay, cold starts are faster, and incremental builds are a breeze.
Let manifests stay tiny. Manifest assembly had been oddly heavyweight. We reshaped the graph so each layer is built in a single action that computes both the blob and the metadata that describes it6. The layer blob stays in Bazel’s CAS, while only a small JSON descriptor (digest, size, media type) flows through the build graph. Downstream actions consume only this metadata during the build phase, so they schedule quickly, cache well, and avoid pulling gigabytes across executors. The manifests remain correct, and the path to them is light. The actual blob bytes only move later during bazel run when you push or load.
Push without the round trip. Pushing used to mean downloading all layers to a local tool and then sending them back up again. With rules_img, we defer all blob transfers to the run phase (bazel run //:push). The build phase only produces a lightweight push specification: a JSON file listing what needs pushing. When you run the pusher, it first asks the registry what blobs it already has, then streams only the missing ones directly from CAS. In environments where your registry speaks the same CAS protocol, the push is close to zero‑copy. For very large monorepos, you can even emit pushes as a side effect of Build Event Service uploads. The principle is simple. Build time produces metadata, run time moves bytes, and nothing passes through your workstation unnecessarily. See the push strategies documentation for other configurations including direct CAS-to-registry transfers.
Loading should be incremental. docker load treats every import like a blank slate. When containerd is available, rules_img talks to its content store and streams only what is missing7. It can also load a single platform from a multi‑platform image, which keeps feedback loops tight8. If containerd isn’t available, we fall back to docker load and tell you what you’re giving up.
Extra touches that add up. Performance rarely comes from one trick alone. We use hardlink-based deduplication inside layers so identical files don’t bloat your tars. We support eStargz to make layers seekable and quick to start with the stargz snapshotter.
Quick start. If you want to try it, here is a minimal setup:
# MODULE.bazel
bazel_dep(name = "rules_img", version = "<version>")
pull = use_repo_rule("@rules_img//img:pull.bzl", "pull")
# Pulls manifest+config only (no layer blobs yet)
pull(
name = "ubuntu",
registry = "index.docker.io",
repository = "library/ubuntu",
tag = "24.04",
digest = "sha256:1e622c5f073b4f6bfad6632f2616c7f59ef256e96fe78bf6a595d1dc4376ac02",
)# BUILD.bazel
load("@rules_img//img:layer.bzl", "image_layer")
load("@rules_img//img:image.bzl", "image_manifest")
image_layer(
name = "app_layer",
srcs = {
"/app/bin/server": "//cmd/server",
"/app/config": "//configs:prod",
},
compress = "zstd",
)
image_manifest(
name = "app_image",
base = "@ubuntu",
layers = [":app_layer"],
)Optional .bazelrc speed dials if you like the metadata‑first defaults:
common --@rules_img//img/settings:compress=zstd
common --@rules_img//img/settings:estargz=enabled
common --@rules_img//img/settings:push_strategy=lazy
# Or: cas_registry / bes (see docs for setup)Conclusion: container images that feel native to Bazel
The performance gains are real. Pulling large base images on fresh machines takes seconds instead of minutes. Loading into Docker takes milliseconds for incremental updates instead of reloading the full image, which could waste 1–5 minutes in the workflows we examined. Manifest assembly actions run dramatically faster, especially on RBE systems that fetch inputs eagerly9. Building push targets no longer destroys the benefits of Build without the Bytes. Where other rulesets might download gigabytes to your machine, rules_img downloads only a few kilobytes of metadata, saving many gigabytes in transfers and minutes per push. A comprehensive benchmark would warrant its own blog post given the wide matrix of possible configurations, RBE backends, image sizes, and network conditions.
Our aim with rules_img is straightforward: make Bazel feel native for container images, with no unnecessary bytes and no unnecessary waits. By treating images as metadata with on-demand bytes, we get faster CI, quieter laptops, and a build graph that scales without drama. Try it, tell us what flies, and tell us what still hurts. There’s more to tune, and we intend to keep tuning.
Get started with rules_img: github.com/bazel-contrib/rules_img
- Bazel actions only access explicitly declared inputs within the execroot, whilst
docker buildmounts all base image layers into the build environment. This means Bazel builds layers independently of base layers, though specific toolchains (like C++) can use workarounds such as--sysrootto simulate filesystem mounting.↩ - Building without mounting base layers risks creating binaries incompatible with the target image, as there’s no way to execute code atop existing layers during the build. The container-structure-test tool addresses this by enabling tests on the final image, including running commands in a Docker daemon with assertions.↩
rules_ociis a good ruleset for building container images in Bazel, and I am a happy user most of the time. However, I quickly noticed that it is only optimized for local execution. I want to stress that most of the issues I’m describing here only matter for the remote execution case.↩- Repository rules downloading image blobs are inefficient in stateless CI environments where caches aren’t preserved between runs. Whilst CI can be configured to preserve these directories,
rules_imgavoids the problem entirely by fetching only metadata.↩ - This is configurable. If you really need access to base image layers at build time (for instance, to run a container structure test), you can set the
layer_handlingattribute accordingly (docs).↩ - In
rules_oci, manifest assembly receives entire layer tar files as inputs, transferring multi-gigabyte blobs with remote execution.rules_imginstead generates small metadata files (containing digests and diff IDs) when writing layers, passing only this fixed-size metadata to downstream actions. Layer contents enter the build graph once, remain in CAS, while tiny metadata flows through subsequent actions.↩ docker loadrequires a tar file with all layers, even those previously loaded. Whilst hacks exist to skip lower layers by abusing chain IDs,rules_ocidoesn’t support this.rules_imginterfaces directly with containerd’s content-addressable store to check blob existence and load only missing layers. Docker will soon expose its own content store via the socket (moby/moby#44369), making this approach more widely accessible.↩- Platform filtering is not yet implemented for the containerd backend (rules_img#107). The
docker loadfallback path does support platform filtering.↩ - Some RBE systems can make action inputs available lazily on-demand, while others fetch all declared inputs eagerly before the action runs. For the latter, avoiding multi-gigabyte layer inputs makes a dramatic difference in action scheduling and execution time.↩
This blog post is a departure from our normal content to remember our friend and colleague Alexander Esgen, who passed away on 1st November at the age of 26.
Alex joined Tweag as an intern in the summer of 2021, working on Ormolu, our Haskell source code formatter. His talent was immediately apparent; he approached problems with mathematical rigour and practical insight, drawing on his background in category theory. He became a full-time engineer that autumn and, more recently, took on the role of team lead for the Peras project.
Alex made significant contributions to the Haskell ecosystem, particularly in WebAssembly support and functional programming tooling. His work was marked by exceptional thoroughness and attention to detail. Beyond his technical excellence, he was a natural mentor with a gift for explaining complex ideas patiently and clearly, always willing to help colleagues no matter how many times they asked.
What we remember most, however, was his character. Alex had a calm, warm presence that made people feel heard and valued. He was genuinely interested in others and their work and had a way of encouraging those around him. Colleagues consistently describe him as humble, kind and deeply passionate about his craft; someone who truly embodied the spirit of collaboration and mutual support that defines our company.
His loss leaves a void that cannot be filled. Our thoughts are with his family, friends and all who knew him.
]]>- Introduction to the dependency graph
- Managing dependency graph in a large codebase
- The anatomy of a dependency graph
In the previous post, we took a closer look at some of the issues working in a large codebase in the context of the dependency graph. In this post, we are about to explore some concepts related to scale and scope of the dependency graph to understand its granularity and what really impacts your builds.
Dependency graph detail
When working on the source code, you likely think of dependencies in the graph as individual modules that import code from each other. When drawing a project architecture diagram, however, the individual files are likely to be grouped in packages to hide the individual files that a package consists of, primarily for brevity. Likewise, in build systems such as Bazel, you would often have one “node” in the dependency graph per directory. Of course, it wouldn’t be totally unreasonable to have a few of “nodes” that would represent a couple of packages in the same directory on disk. You could, for instance, store performance and chaos tests in the same directory, but have them modeled as individual units since they might have a different set of dependencies.
So while both packageA and packageB in the graph below depend on the package shared (solid lines),
we can see that individually, only testsA.ts depends on service.ts
and only testB.ts depends on cluster.ts (dotted lines).
If operating on packages level, the build metadata stored on disk in files would actually lead to construction of this dependency graph:
This means that whenever any file in the shared directory would change,
tests within both packages (A and B) would be considered impacted.
A build system that relies on dependency inference (such as Pants) is able to track dependencies across each file individually with the powerful concept of target generators. This means that every file in your project may be an individual node in the dependency graph with all the dependencies mapped out by statically analyzing the source code of the files and augmenting the build metadata manually where the inference falls short. In practice, this means that even though you can organize your code and specify dependencies at the broader package level — mirroring how you are likely to think about the project architecture and deliverable artifacts — Pants still provides the benefit of fine-grained recompilation avoidance. This reduces unnecessary rebuilds and test runs, shortens feedback cycles, and encourages better dependency hygiene — all without forcing you to manage dependencies at a per-file level manually. It might also provide better incentive for engineers to care more about dependencies in files individually which can be harder to achieve if a file is part of a build target with lots of other files with many dependencies.
Intuitively, one may want to go with as fine-grained dependency graph as possible hoping to avoid unnecessary build actions. For big repositories, however, the granularity of build targets often doesn’t matter as much as it does for smaller projects. This is because doing distributed builds across multiple machines would immediately require presence of a shared cache to be able to track results of any previously executed build actions that could be reused (such as compilation or linking). It is not immediately obvious, though, what operation would complete faster — rebuilding an entire directory (using packages as nodes) or querying the cache on the network for each individual file (using files as nodes) with the ambition to invoke only truly required build actions.
Dependency graph scope
With source code modules or packages being nodes, what are other types of dependencies that also contribute to the dependency graph? Almost any project relies on third-party libraries which might be available in version control or are to be downloaded from some kind of binary repository at build time when the dependency graph needs to be constructed. The same applies to any static resources and data your applications might need to be built or to run.
These are called explicit dependencies because users declare them in the build metadata.
In some build systems, such as Bazel, only a subset of dependencies, typically the ones you explicitly declare,
are fully tracked, while others may be left to the surrounding system.
These become implicit dependencies, because the build system relies on their presence without explicitly encoding them.
For example, your application might link to libcurl, which in turn depends on OpenSSL.
In some environments, OpenSSL is provided by the system rather than the build metadata, making it an implicit dependency.
Note that in other systems, like Nix, the complete transitive dependency graph is fully encoded,
so this distinction does not always apply.
Apart from those, there are implicit dependencies which bind your code to build-time dependencies
such as a compiler.
In addition to saying that your source code (say, a C++ application) depends on a compiler,
it also depends on the compiler’s runtime libraries such as libstdc++ or libc++, an assembler,
and a linker.
Depending on the build system used and how your build workflow is configured,
the connection between build targets and the compilers might be recorded.
For example, in Bazel, it is documented that when processing build metadata for C++ sources, a dependency
between each source unit and a C++ compiler is created implicitly.
Any configuration such as options passed to the compiler (copts among other inputs) matters, too,
just as any environment variables you might have set in the build environment.
You wouldn’t probably think of compiler flags as something your application’s logic might depend on,
but some optimizations like inlining can make the performance of your application worse in certain situations.
Take a look at the Compiler Upgrade use case
from the Software Engineering at Google book to appreciate
how critical compilers are in a large codebase context.
Transitively, the compilation would also typically depend on system headers and libraries
such as libc (e.g., glibc), a C library used for system-level functionality
(unless you are able to link against musl libc, which is designed for static linking).
For instance, a binary compiled with glibc 2.17 might not run on a system with glibc 2.12
because the system may lack the required symbols.
How far can one go? In addition to the build dependencies —
everything that might be needed to build your application — one could argue
that libraries your compiled application needs to run might also be part of the dependency graph
as ultimately this is what’s necessary for this application to be useful to the end user.
For example, you can run ldd <yourbinary> in Linux (or otool -L on a macOS device)
to take a peek at shared object dependencies your binary might have.
Taking it to the extreme, if your application accesses Linux kernel features such as ioctl in unsupported or undocumented ways, the kernel version might matter, too. Driver interfaces can change between kernel versions, breaking user-space tools, so the operating system version along with the underlying kernel version are technically part of the dependency graph, too:
So far, we have looked at the dependency graph in one dimension only. However, it’s not uncommon to have conditional dependencies particularly when doing cross-compilation or producing artifacts for multiple environments. For instance, the backend system of the matplotlib visualization library is chosen based on the platform and available GUI libraries, which affects what transitive dependencies are going to be pulled when being installed. Imagine building your application for various CPU architectures (x86_64 or ARM) or a package for different operating systems (Linux or Windows) and the graph complexity explodes.
A typical approach to “freeze” the build environment and treat it as an immutable set of instructions is to provide a “golden” Docker image where you can find the build dependencies that are known to produce the correct artifacts. They are baked in and then all the build actions are taking place within containers spun from that image. This is a lot better than relying on the host, but this solution has a number of drawbacks as it forces you to treat all your dependencies as a single blob — an image. Making changes and experimenting is not encouraged as, from the dependency graph perspective, every time you do so would rebuild the image, which doesn’t end up being the same bit-for-bit, so you need to “re-build” your whole project as it’s not known what exactly has changed and what part of your project depends on it.
To focus on build-time dependencies only, even documenting (not mentioning properly declaring!)
all the inputs necessary to build your application is not a trivial task.
Using tools such as nix and NixOS to drive the build workflow is appealing,
as it makes it possible to describe practically all inputs to your build,
though, admittedly, this can require a significant investment from your engineering organization (but we can help you with it).
The least one can do is to be aware of the implicit dependencies even if properly describing them in build instructions is not immediately possible. No matter what approach is taken, any implicit relationship between your code and some other dependency that can be expressed in build metadata gets you closer to a fully declared state even if achieving it might be impossible in practice.
In the next post, we’ll explore some graph querying techniques that can help with related test selection, code review strategy, and more.
]]>But that advice didn’t help me, because I wanted to distribute a static library
and the size was causing me problems. Specifically, I had a Rust library1
that I wanted to make available to Go developers. Both Rust and Go can interoperate
with C, so I compiled the Rust code into a C-compatible library and made a little
Go wrapper package for it.
Like most pre-compiled C libraries, I can distribute it either as a static
or a dynamic library. Now Go developers are accustomed to static linking, which
produces self-contained binaries that are refreshingly easy to deploy. Bundling
a pre-compiled static library with our Go package allows Go developers to just
go get https://github.com/nickel-lang/go-nickel and get to work. Dynamic
libraries, on the other hand, require runtime dependencies, linker paths, and installation instructions.
So I really wanted to go the static route, even if it came with a slight size penalty. How large of a penalty are we talking about, anyway?
❯ ls -sh target/release/
132M libnickel_lang.a
15M libnickel_lang.so😳 Ok, that’s too much. Even if I were morally satisfied with 132MB of library, it’s way beyond GitHub’s 50MB file size limit.2 (Honestly, even the 15M shared library seems large to me; we haven’t put much effort into optimizing code size yet.)
The compilation process in a nutshell
Back in the day, your compiler or assembler would turn each source file into an “object” file containing the compiled code. In order to allow for source files to call functions defined in other source files, each object file could announce the list of functions3 that it defines, and the list of functions that it very much hopes someone else will define. Then you’d run the linker, a program that takes all those object files and mashes them together into a binary, matching up the hoped-for functions with actual function definitions or yelling “undefined symbol” if it can’t. Modern compiled languages tweak this pipeline a little: Rust produces an object file per crate4 instead of one per source file. But the basics haven’t changed much.
A static library is nothing but a bundle of object files, wrapped in an ancient and never-quite-standardized archive format. No linker is involved in the creation of a static library: it will be used eventually to link the static library into a binary. The unfortunate consequence is that a static library contains a lot of information that we don’t want. For a start, it contains all the code of all our dependencies even if much of that code is unused. If you compiled your code with support for link-time optimization (LTO), it contains another copy (in the form of LLVM bitcode — more on that later) of all our code and the code of all our dependencies. And then because it has so much redundant code, it contains a bunch of metadata (section headers) to make it easier for the linker to remove that redundant code later. The underlying reason for all this is that extra fluff in object files isn’t usually considered a problem: it’s removed when linking the final binary (or shared library), and that’s all that most people care about.
Re-linking with ld
I wrote above that a linker takes a bunch of object files and mashes
them together into a binary. Like everything in the previous section,
this was an oversimplification: if you pass the --relocatable flag
to your linker, it will mash your object files together but write out
the result as an object file instead of a binary.
If you also pass the --gc-sections flag, it will remove
unused code while doing so.
This gives us a first strategy for shrinking a static archive:
- unpack the archive, retrieving all the object files
- link them all together into a single large object, removing unused code. In this step we need to tell the linker which code is used, and then it will remove anything that can’t be reached from the used code.
- pack that single object back into a static library
# Unpack the archive
ar x libnickel_lang.a
# Link all the objects together, keeping on the parts reachable from our
# public API (about 50 functions worth)
ld --relocatable --gc-sections -o merged.o *.o -u nickel_context_alloc -u nickel_context_free ...
# Pack it back up
ar rcs libsmaller_nickel_lang.a merged.oThis helps a bit: the archive size went from 132MB to 107MB. But there’s clearly still room for improvement.
Examining our merged object file with the size command, the largest
section by far — weighing in at 84MB — is .llvmbc. Remember I wrote
that we’d come back the LLVM bitcode? Well, when you compile something with
LLVM (and the Rust compiler uses LLVM), it converts the original source
code into an intermediate representation, then it converts the
intermediate representation into machine code, and then5 it writes both
the intermediate representation and the machine code into an object file.
It keeps the intermediate representation around in case it has useful
information for further optimization during linking time. Even if that
information is useful, it isn’t 84MB useful.6 Out it goes:
objcopy --remove-section .llvmbc merged.o without_llvmbc.oThe next biggest sections contain debug information. Those might be useful, but we’ll remove them for now just to see how small we can get.
strip --strip-unneeded without_llvmbc.o -o stripped.oAt this point there aren’t any giant sections left. But there are more
than 48,000 small sections. It turns out that the Rust compiler puts
every single tiny function into its own little section within the object
file. It does this to help the linker remove unused code: remember the
--gc-sections argument to ld? It removes unused sections, and so if the
sections are small then unused code can be removed precisely. But
we’ve already removed unused code, and each of those 48,000 section
headers is taking up space.
To do this, we write a linker script that tells ld to merge sections together.
The meaning of the various sections isn’t important here: the point is that
we’re merging sections with names like .text._ZN11nickel_lang4Expr7to_json17h
and .text._ZN11nickel_lang4Expr7to_yaml17h into a single big .text section.
/* merge.ld */
SECTIONS
{
.text :
{
*(.text .text.*)
}
.rodata :
{
*(.rodata .rodata.*)
}
/* and a couple more */
}And we use it like this:
ld --relocatable --script merge.ld stripped.o -o without_tiny_sections.oLet’s take a look back at what we did to our archive, and how much it helped:
| Size | |
|---|---|
| original | 132MB |
linked with --gc-sections |
107MB |
removed .llvmbc |
33MB |
| stripped | 25MB |
| merged sections | 19MB |
It’s probably possible to continue, but this is already a big improvement. We got rid of more than 85% of our original size!
We did lose something in the last two steps, though. Stripping the debug information might make backtraces less useful, and merging the sections removes the ability for future linking steps to remove unused code from the final binaries. In our case, our library has a relatively small and coarse API; I checked that as soon as you use any non-trivial function, less than 150KB of dead code remains. But you’ll need to decide for yourself whether these costs are worth the size reduction.
More portability with LLVM bitcode
I was reasonably pleased with the outcome of the previous section until I tried to port
it to MacOS, because it turns out that the MacOS linker doesn’t support
--gc-sections (it has a -dead_strip option, but it’s incompatible with --relocatable
because apparently no one cares about code size unless they’re building a binary).
After drafting this post but before publishing it, I found
this
nice post on shrinking MacOS static libraries using the toolchain from XCode.
I’m no MacOS expert so I’m probably using it wrong, but I only got down to
about 25MB (after stripping) using those tools. (If you know how to do better, let me know!)
But there is another way! Remember that we had two copies of all our code: the LLVM intermediate representation and the machine code.7 Last time, we chucked out the intermediate representation and used the machine code. But since I don’t know how to massage the machine code on MacOS, we can work with the intermediate representation instead.
The first step is to extract the LLVM bitcode and throw out the rest.
(The section name on MacOS is __LLVM,__bitcode instead of .llvmbc like it was on Linux.)
for obj_file in ./*.o; do
llvm-objcopy --dump-section=__LLVM,__bitcode="$obj_file.bc" "$obj_file"
doneThen we combine all the little bitcode files into one gigantic one:
llvm-link -o merged.bc ./*.bcAnd we remove the unused code by telling LLVM which functions make up the public API. We ask it to “internalize” every function that isn’t in the list, and to remove code that isn’t reachable from a public function (the “dce” in “globaldce” stands for “dead-code elimination”).
opt \
--internalize-public-api-list=nickel_context_alloc,... \
--passes='internalize,globaldce' \
-o small.bc \
merged.bcFinally, we recompile the result back into an object file and pop
it into a static library. llc turns the LLVM bitcode back into
machine code, so the resulting object file can be consumed by
non-LLVM toolchains.
llc --filetype=obj --relocation-model=pic small.bc -o small.o
ar rcs libsmaller_nickel_lang.a small.oThe result is a 19MB static library, pretty much the same as the other workflow.
Note that we don’t need the section-merging step here, because we
didn’t ask llc to generate a section per function.
Dragonfire
Soon after drafting this post, I learned about dragonfire, a recently-released and awesomely-named tool for shrinking collections of static libs by pulling out and deduplicating object files. I don’t think this post’s techniques can be combined with theirs for extra savings, because you can’t both deduplicate and merge object files (I guess in principle you could deduplicate some and merge others, if you have very specific needs.) But it’s a great read, and I was gratified to discover that someone else shared my giant-Rust-static-library concerns.
Conclusion
We saw two ways to significantly reduce the size of a static library, one using
classic tools like ld and objcopy and another using LLVM-specific tools.
They both produced similar-sized outputs, but as with everything in life
there are some tradeoffs. The “classic” bintools approach works with both GNU bintools
and LLVM bintools, and it’s significantly faster — a few seconds, compared
to a minute or so — than the LLVM tools,
which need to recompile everything from the intermediate representation to
machine code. Moreover, the bintools approach should work with any static library,
not just one compiled with a LLVM-based toolchain.
On the other hand, the LLVM approach works on MacOS (and Linux, Windows, and probably others). For this reason alone, this is the way we’ll be building our static libraries for Nickel.
- Namely, the library API for Nickel, which is going to have a stable embedding API real soon now, including bindings for C and Go!↩
- Go expects packages with pre-compiled dependencies to check the compiled code directly into a git repository.↩
- technically “symbols”, not “functions”. But for this abbreviated discussion, the distinction doesn’t matter.↩
- Or not. To improve parallelization, Rust sometimes generates multiple object files per crate.↩
- if you’ve turned on link-time optimization↩
- Linux distributions that use LTO seem to agree that this intermediate representation should be stripped before distributing the library.↩
- We have the LLVM intermediate representation because we build with LTO.
If you aren’t using LTO then there are probably other ways to get it, like with
Rust’s
--emit-llvm-irflag.↩
This post is intended for experienced Bazel engineers or those tasked with modernizing the build metadata of their codebases. The following discussion assumes a solid working knowledge of Bazel’s macro system and build file conventions. If you are looking to migrate legacy macros or deepen your understanding of symbolic macros, you’ll find practical guidance and nuanced pitfalls addressed here.
What are symbolic macros?
Macros instantiate rules by acting as templates that generate targets.
As such, they are expanded in the loading phase,
when Bazel definitions and BUILD files are loaded and evaluated.
This is in contrast with build rules that are run later in the analysis phase.
In older Bazel versions, macros were defined exclusively as Starlark functions
(the form that is now called “legacy macros”).
Symbolic macros are an improvement on that idea;
they allow defining a set of attributes similar to those of build rules.
In a BUILD file, you invoke a symbolic macro by supplying attribute values as arguments.
Because Bazel is explicitly aware of symbolic macros and their function in the build process,
they can be considered “first-class macros”.
See the Symbolic macros design document
to learn more about the rationale.
Symbolic macros also intend to support lazy evaluation,
a feature that is currently being considered for a future Bazel release.
When that functionality is implemented,
Bazel would defer evaluating a macro until
the targets defined by that macro are actually requested.
Conventions and restrictions
There is already good documentation that explains how to write symbolic macros. In this section, we are going to take a look at some practical examples of the restrictions that apply to their implementation, which you can learn more about in the Restrictions docs page.
Naming
Any targets created by a symbolic macro must either match the macro’s name parameter exactly
or begin with that name followed by a _ (preferred), ., or -.
This is different from legacy macros which don’t have naming constraints.
This symbolic macro
# defs.bzl
def _simple_macro_impl(name):
native.genrule(
name = "genrule" + name,
outs = [name + "_out.data"],
srcs = ["//:file.json"],
)
# BUILD.bazel
simple_macro(name = "tool")would fail when evaluated:
$ bazel cquery //...
ERROR: in genrule rule //src:genruletool: Target //src:genruletool declared in symbolic macro 'tool'
violates macro naming rules and cannot be built.This means simple_macro(name = "tool") may only produce files or targets named tool or starting with tool_,
tool., or tool-.
In this particular macro, tool_genrule would work.
Access to undeclared resources
Symbolic macros must follow Bazel’s standard visibility rules:
they cannot directly access source files unless those files are passed in as arguments
or are made public by their parent package.
This is different from legacy macros,
whose implementations were effectively inlined into the BUILD file where they were called.
Attributes
Positional arguments
In legacy macro invocations, you could have passed the attribute values as positional arguments. For instance, these are perfectly valid legacy macro calls:
# defs.bzl
def special_test_legacy(name, tag = "", **kwargs):
kwargs["name"] = name
kwargs["tags"] = [tag] if tag else []
cc_test(**kwargs)
# BUILD.bazel
special_test_legacy("no-tag")
special_test_legacy("with-tag", "manual")With the macro’s name and tags collected as expected:
$ bazel cquery //test/package:no-tag --output=build
cc_test(
name = "no-tag",
tags = [],
...
)
$ bazel cquery //test/package:with-tag --output=build
cc_test(
name = "with-tag",
tags = ["manual"],
...
)You can control how arguments are passed to functions by using an asterisk (*)
in the parameter list of a legacy macro, as per the Starlark language specs.
If you are a seasoned Python developer (Starlark’s syntax is heavily inspired by Python), you might have already guessed
that this asterisk separates positional arguments from keyword-only arguments:
# defs.bzl
def special_test_legacy(name, *, tag = "", **kwargs):
kwargs["name"] = name
kwargs["tags"] = [tag] if tag else []
cc_test(**kwargs)
# BUILD.bazel
special_test_legacy("no-tag") # okay
special_test_legacy("with-tag", tag="manual") # okay
# Error: special_test_legacy() accepts no more than 1 positional argument but got 2
special_test_legacy("with-tag", "manual")Positional arguments are not supported in symbolic macros
as attributes must either be declared in the attrs dictionary
(which would make it automatically a keyword argument)
or be inherited in which case it should also be provided by name.
Arguably, avoiding positional arguments in macros altogether is helpful
because it eliminates subtle bugs caused by incorrect order of parameters passed
and makes them easier to read and easier to process by tooling such as buildozer.
Default values
Legacy macros accepted default values for their parameters which made it possible to skip passing certain arguments:
# defs.bzl
def special_test_legacy(name, *, purpose = "dev", **kwargs):
kwargs["name"] = name
kwargs["tags"] = [purpose]
cc_test(**kwargs)
# BUILD.bazel
special_test_legacy("dev-test")
special_test_legacy("prod-test", purpose="prod")With symbolic macros, the default values are declared in the attrs dictionary instead:
# defs.bzl
def _special_test_impl(name, purpose = "dev", **kwargs):
kwargs["tags"] = [purpose]
cc_test(
name = name,
**kwargs
)
special_test = macro(
inherit_attrs = native.cc_test,
attrs = {
"purpose": attr.string(configurable = False, default = "staging"),
"copts": None,
},
implementation = _special_test_impl,
)
# BUILD.bazel
special_test(
name = "my-special-test-prod",
srcs = ["test.cc"],
purpose = "prod",
)
special_test(
name = "my-special-test-dev",
srcs = ["test.cc"],
)Let’s see what kind of tags are going to be set for these cc_test targets:
$ bazel cquery //test/package:my-special-test-prod --output=build
cc_test(
name = "my-special-test-prod",
tags = ["prod"],
...
)
$ bazel cquery //test/package:my-special-test-dev --output=build
cc_test(
name = "my-special-test-dev",
tags = ["staging"],
...
)Notice how the default dev value declared in the macro implementation was never used.
This is because the default values defined for parameters in the macro’s function are going to be ignored,
so it’s best to remove them to avoid any confusion.
Also, all the inherited attributes have a default value of None,
so make sure to refactor your macro logic accordingly.
Be careful when processing the keyword arguments to avoid
subtle bugs such as checking whether a user has passed [] in a keyword argument
merely by doing if not kwargs["attr-name"]
as None would also be evaluated to False in this context.
This might be potentially confusing as the default value for many common attributes is not None.
Take a look at the target_compatible_with attribute
which normally has the default value [] when used in a rule,
but when used in a macro, would still by default be set to None.
Using bazel cquery //:target --output=build
with some print calls in your .bzl files can help when refactoring.
Inheritance
Macros are frequently designed to wrap a rule (or another macro), and the macro’s author typically aims to pass
most of the wrapped symbol’s attributes using **kwargs directly to the macro’s primary target
or the main inner macro without modification.
To enable this behavior, a macro can inherit attributes from a rule or another macro by providing the rule
or macro symbol to the inherit_attrs parameter of macro().
Note that when inherit_attrs is set, the implementation function must have a **kwargs parameter.
This makes it possible to avoid listing every attribute that the macro may accept,
and it is also possible to disable certain attributes that you don’t want macro callers to provide.
For instance, let’s say you don’t want copts to be defined in macros that wrap cc_test
because you want to manage them internally within the macro body instead:
# BUILD.bazel
special_test(
name = "my-special-test",
srcs = ["test.cc"],
copts = ["-std=c++22"],
)This can be done by setting the attributes you don’t want to inherit to None.
# defs.bzl
special_test = macro(
inherit_attrs = native.cc_test,
attrs = { "copts": None },
implementation = _special_test_impl,
)Now the macro caller will see that copts is not possible to declare when calling the macro:
$ bazel query //test/package:my-special-test
File "defs.bzl", line 19, column 1, in special_test
special_test = macro(
Error: no such attribute 'copts' in 'special_test' macroKeep in mind that all inherited attributes are going to be included in the kwargs parameter
with the default value of None unless specified otherwise.
This means you have to be extra careful in the macro implementation function if you refactor a legacy macro:
you can no longer merely check for the presence of a key in the kwargs dictionary.
Mutation
In symbolic macros, you will not be able to mutate the arguments passed to the macro implementation function.
# defs.bzl
def _simple_macro_impl(name, visibility, env):
print(type(env), env)
env["some"] = "more"
simple_macro = macro(
attrs = {
"env": attr.string_dict(configurable = False)
},
implementation = _simple_macro_impl
)
# BUILD.bazel
simple_macro(name = "tool", env = {"state": "active"})Let’s check how this would get evaluated:
$ bazel cquery //...
DEBUG: defs.bzl:36:10: dict {"state": "active"}
File "defs.bzl", line 37, column 17, in _simple_macro_impl
env["some"] = "more"
Error: trying to mutate a frozen dict valueThis, however, is no different to legacy macros where you could not modify mutable objects in place either.
In situations like this, creating a new dict with env = dict(env) would be of help.
In legacy macros you can still modify objects in place when they are inside the kwargs,
but this arguably leads to code that is harder to reason about
and invites subtle bugs that are a nightmare to troubleshoot in a large codebase.
See the Mutability in Starlark section to learn more.
This is still possible in legacy macros:
# defs.bzl
def special_test_legacy(name, **kwargs):
kwargs["name"] = name
kwargs["env"]["some"] = "more"
cc_test(**kwargs)
# BUILD.bazel
special_test_legacy("small-test", env = {"state": "active"})Let’s see how the updated environment variables were set for the cc_test target created in the legacy macro:
$ bazel cquery //test/package:small-test --output=build
...
cc_test(
name = "small-test",
...
env = {"state": "active", "some": "more"},
)This is no longer allowed in symbolic macros:
# defs.bzl
def _simple_macro_impl(name, visibility, **kwargs):
print(type(kwargs["env"]), kwargs["env"])
kwargs["env"]["some"] = "more"It would fail to evaluate:
$ bazel cquery //...
DEBUG: defs.bzl:35:10: dict {"state": "active"}
File "defs.bzl", line 36, column 27, in _simple_macro_impl
kwargs["env"]["some"] = "more"
Error: trying to mutate a frozen dict valueConfiguration
Symbolic macros, just like legacy macros, support configurable attributes,
commonly known as select(), a Bazel feature that lets users determine the values of build rule (or macro)
attributes at the command line.
Here’s an example symbolic macro with the select toggle:
# defs.bzl
def _special_test_impl(name, **kwargs):
cc_test(
name = name,
**kwargs
)
special_test = macro(
inherit_attrs = native.cc_test,
attrs = {},
implementation = _special_test_impl,
)
# BUILD.bazel
config_setting(
name = "linking-static",
define_values = {"static-testing": "true"},
)
config_setting(
name = "linking-dynamic",
define_values = {"static-testing": "false"},
)
special_test(
name = "my-special-test",
srcs = ["test.cc"],
linkstatic = select({
":linking-static": True,
":linking-dynamic": False,
"//conditions:default": False,
}),
)Let’s see how this expands in the BUILD file:
$ bazel query //test/package:my-special-test --output=build
cc_test(
name = "my-special-test",
...(omitted for brevity)...
linkstatic = select({
"//test/package:linking-static": True,
"//test/package:linking-dynamic": False,
"//conditions:default": False
}),
)The query command does show that the macro was expanded into a cc_test target,
but it does not show what the select() is resolved to.
For this, we would need to use the cquery (configurable query)
which is a variant of query that runs after select()s have been evaluated.
$ bazel cquery //test/package:my-special-test --output=build
cc_test(
name = "my-special-test",
...(omitted for brevity)...
linkstatic = False,
)Let’s configure the test to be statically linked:
$ bazel cquery //test/package:my-special-test --output=build --define="static-testing=true"
cc_test(
name = "my-special-test",
...(omitted for brevity)...
linkstatic = True,
)Each attribute in the macro function explicitly declares whether it tolerates select() values,
in other words, whether it is configurable.
For common attributes, consult the Typical attributes defined by most build rules
to see which attributes can be configured.
Most attributes are configurable, meaning that their values may change
when the target is built in different ways;
however, there are a handful which are not.
For example, you cannot assign a *_test target to be flaky using a select()
(e.g., to mark a test as flaky only on aarch64 devices).
Unless specifically declared, all attributes in symbolic macros are configurable (if they support this)
which means they will be wrapped in a select() (that simply maps //conditions:default to the single value),
and you might need to adjust the code of the legacy macro you migrate.
For instance, this legacy code used to append some dependencies with the .append() list method,
but this might break:
# defs.bzl
def _simple_macro_impl(name, visibility, **kwargs):
print(kwargs["deps"])
kwargs["deps"].append("//:commons")
cc_test(**kwargs)
simple_macro = macro(
attrs = {
"deps": attr.label_list(),
},
implementation = _simple_macro_impl,
)
# BUILD.bazel
simple_macro(name = "simple-test", deps = ["//:helpers"])Let’s evaluate the macro:
$ bazel cquery //...
DEBUG: defs.bzl:35:10: select({"//conditions:default": [Label("//:helpers")]})
File "defs.bzl", line 36, column 19, in _simple_macro_impl
kwargs["deps"].append("//:commons")
Error: 'select' value has no field or method 'append'Keep in mind that select is an opaque object with limited interactivity.
It does, however, support modification in place, so that you can extend it,
e.g., with kwargs["deps"] += ["//:commons"]:
$ bazel cquery //test/package:simple-test --output=build
...
cc_test(
name = "simple-test",
generator_name = "simple-test",
...
deps = ["//:commons", "//:helpers", "@rules_cc//:link_extra_lib"],
)Be extra vigilant when dealing with attributes of bool type that are configurable
because the return type of select converts silently in truthy contexts to True.
This can lead to some code being legitimate, but not doing what you intended.
See Why does select() always return true? to learn more.
When refactoring, you might need to make an attribute configurable, however, it may stop working using the existing macro implementation. For example, imagine you need to pass different files as input to your macro depending on the configuration specified at runtime:
# defs.bzl
def _deployment_impl(name, visibility, filepath):
print(filepath)
# implementation
simple_macro = macro(
attrs = {
"filepath": attr.string(),
},
implementation = _deployment_impl,
)
# BUILD.bazel
deployment(
name = "deploy",
filepath = select({
"//conditions:default": "deploy/config/dev.ini",
"//:production": "deploy/config/production.ini",
}),
)In rules, select() objects are resolved to their actual values,
but in macros, select() creates a special object of type select
that isn’t evaluated until the analysis phase,
which is why you won’t be able to get actual values out of it.
$ bazel cquery //:deploy
...
select({
Label("//conditions:default"): "deploy/config/dev.ini",
Label("//:production"): "deploy/config/production.ini"
})
...In some cases, such as when you need to have the selected value available in the macro function,
you can have the select object resolved before it’s passed to the macro.
This can be done with the help of an alias target, and the label of a target can be turned into a filepath
using the special location variable:
# defs.bzl
def _deployment_impl(name, visibility, filepath):
print(type(filepath), filepath)
native.genrule(
name = name + "_gen",
srcs = [filepath],
outs = ["config.out"],
cmd = "echo '$(location {})' > $@".format(filepath)
)
deployment = macro(
attrs = {
"filepath": attr.label(configurable = False),
},
implementation = _deployment_impl,
)
# BUILD.bazel
alias(
name = "configpath",
actual = select({
"//conditions:default": "deploy/config/dev.ini",
"//:production": "deploy/config/production.ini",
}),
visibility = ["//visibility:public"],
)
deployment(
name = "deploy",
filepath = ":configpath",
)You can confirm the right file is chosen when passing different configuration flags before building the target:
$ bazel cquery //tests:configpath --output=build
INFO: Analyzed target //tests:configpath (0 packages loaded, 1 target configured).
...
alias(
name = "configpath",
visibility = ["//visibility:public"],
actual = "//tests:deploy/config/dev.ini",
)
...
$ bazel build //tests:deploy_gen && cat bazel-bin/tests/config.out
...
DEBUG: defs.bzl:29:10: Label //tests:configpath
...
tests/deploy/config/dev.iniQuerying macros
Since macros are evaluated when BUILD files are queried,
you cannot use Bazel itself to query “raw” BUILD files.
Identifying definitions of legacy macros is quite difficult,
as they resemble Starlark functions, but instantiate targets.
Using bazel cquery with the --output=starlark
might help printing the properties of targets to see
if they have been instantiated from macros.
When using --output=build, you can also inspect some of the properties:
generator_name(the name attribute of the macro)generator_function(which function generated the rules)generator_location(where the macro was invoked)
This information with some heuristics might help you to identify the macros.
Once you have identified the macro name,
you can run bazel query --output=build 'attr(generator_function, simple_macro, //...)'
to find all targets that are generated by a particular macro.
Finding symbolic macros, in contrast, is trivial
as you would simply need to grep for macro() function calls in .bzl files.
To query unprocessed BUILD files, you might want to use buildozer
which is a tool that lets you query the contents of BUILD files using a static parser.
The tool will come in handy for various use cases when refactoring, such as migrating the macros.
Because both legacy and symbolic macros follow the same BUILD file syntax,
buildozer can be used to query build metadata for either type.
Let’s write some queries for these macro invocations:
# BUILD.bazel
perftest(
name = "apis",
srcs = ["//:srcA", "//:srcB"],
env = {"type": "performance"},
)
perftest(
name = "backend",
srcs = ["//:srcC", "//:srcD"],
env = {"type": "performance"},
)Print all macro invocations (raw) across the whole workspace:
$ buildozer 'print rule' "//...:%perftest"
perftest(
name = "apis",
srcs = [
"//:srcA",
"//:srcB",
],
env = {"type": "performance"},
)
perftest(
name = "backend",
srcs = [
"//:srcC",
"//:srcD",
],
env = {"type": "performance"},
)Print attribute’s values for all macro invocations:
$ buildozer 'print label srcs' "//...:%perftest"
//test/package:apis [//:srcA //:srcB]
//test/package:backend [//:srcC //:srcD]Print path to files where macros are invoked:
$ buildozer 'print path' "//...:%perftest" | xargs realpath --relative-to "$PWD" | sort | uniq
test/package/BUILD.bazelThe path can be combined with an attribute, e.g., print path and the srcs to make reviewing easier:
$ buildozer 'print path srcs' "//...:%perftest"
/home/user/code/project/test/package/BUILD.bazel [//:srcA //:srcB]
/home/user/code/project/test/package/BUILD.bazel [//:srcC //:srcD]Remove an attribute from a macro invocation (e.g., env will be set up in the macro implementation function):
$ buildozer 'remove env' "//...:%perftest"
fixed /home/user/code/project/test/package/BUILD.bazelYou might also want to check that no macro invocation passes an attribute that is not supposed to be passed.
In the command output, the missing means the attribute doesn’t exist;
these lines can of course be ignored with grep -v missing:
$ buildozer -quiet 'print path env' "//...:%perftest" 2>/dev/null
/home/user/code/project/test/package/BUILD.bazel {"type": "performance"}
/home/user/code/project/test/package/BUILD.bazel (missing)We hope that these practical suggestions and examples will assist you in your efforts to modernize the use of macros throughout your codebase. Remember that you can compose legacy and symbolic macros, which may be useful during the transition. Also, legacy macros can still be used and are to remain supported in Bazel for the foreseeable future. Some organizations may even choose not to migrate at all, particularly if they rely on the current behavior of the legacy macros heavily.
]]>From my experience, one of the most effective methods to achieve this is with Continuous Performance Testing (CPT). In this post, I want to explain how CPT is effective in catching performance-related issues during development. CPT is a performance testing strategy, so you might benefit from a basic understanding of the latter. A look at my previous blog post will be helpful!
What is Continuous Performance Testing?
Continuous Performance Testing (CPT) is an automated and systematic approach to performance testing, leveraging various tools to spontaneously conduct tests throughout the development lifecycle. Its primary goal is to gather insightful data, providing real-time feedback on how code changes impact system performance and ensuring the system is performing adequately before proceeding further.
As shown in the example below, CPT is integrated directly into the Continuous Integration and Continuous Deployment (CI/CD) pipeline. This integration allows performance testing to act as a crucial gatekeeper, enabling quick and accurate assessments to ensure that software meets required performance benchmarks before moving to subsequent stages.
A key benefit of this approach is its alignment with shift-left testing, which emphasizes bringing performance testing earlier into the development lifecycle. By identifying and addressing performance issues much sooner, teams can avoid costly late-stage fixes, improve software quality, and accelerate the overall development process, ultimately ensuring that performance standards and Service Level Agreements (SLAs) are consistently met.
To which types of performance testing can CPT be applied?
Continuous performance testing can be applied to the all types of performance testing. However each types has different challenges.
Automated Performance Testing is
- Easily applied to load testing
- Hard to apply to stress and spike tests, but still has benefits
- Very hard to apply to soak-endurance tests
For more details about why the latter two performance testing types are difficult to implement in CI/CD, see the previous blog post.
Why prefer automated load testing?
The load test is designed with the primary objective of assessing how well the system performs under a specific and defined load. This type of testing is crucial for evaluating the system’s behavior and ensuring it can handle expected levels of user activity or data processing. The success of a load test is determined by its adherence to predefined metrics, which serve as benchmarks against which the system’s performance is measured. These metrics might include factors such as response times, throughput, and resource utilization. Given this focus on quantifiable outcomes, load testing is considered the most appropriate, easiest and well-suited type of performance testing type for Continuous Performance Testing (CPT).
How to apply continuous load testing
Strategy
Performance testing can be conducted at every level, starting with unit testing. It should be tailored to evaluate the specific performance requirements of each development stage, ensuring the system meets its capabilities and user expectation.
Load testing can be performed at any level—unit, integration, system, or acceptance. In Continuous Performance Testing (CPT), performance testing should start as early as possible in the development process to provide timely feedback, especially at the integration level. Early testing helps identify bottlenecks and optimize the application before progression. When CPT is applied at the system level, it offers insights into the overall performance of the entire system and how components interact, helping ensure the system meets its performance goals.
In my opinion, to maximize CPT benefits, it’s best to apply automated load testing at both integration and system level. This ensures realistic load conditions, highlights performance issues early, and helps optimize performance throughout development for a robust, efficient application.
Evaluation with static thresholds
Continuous Performance Testing (CPT) is fundamentally centered around fully automated testing processes, meaning that the results obtained from performance testing must also be evaluated automatically to ensure efficiency and accuracy. This automatic evaluation can be achieved in different ways. Establishing static metrics that serve as benchmarks against which the current results can be measured is one of them. By setting and comparing against these predefined metrics, we can effectively assess whether the application meets the required performance standards.
The below code snippet shows how we can set threshold values for various metrics with K6. K6 is an open source performance testing tool built in Go and it allows us to write performance testing scripts in Javascript, and it has an embedded threshold feature that we can use to evaluate the performance test results. For more information about setting thresholds, please see the documentation of K6 thresholds.
import { check, sleep } from "k6"
import http from "k6/http"
export let options = {
vus: 250, // number of virtual users
duration: "30s", // duration of the test
thresholds: {
http_req_duration: [
"avg<20", // average response time must be below 2ms
"p(90)<30", // 90% of requests must complete below 3ms
"p(95)<40", // 95% of requests must complete below 4ms
"max<50", // max response time must be below 5ms
],
http_req_failed: [
"rate<0.01", // http request failures should be less than 1%
],
checks: [
"rate>0.99", // 99% of checks should pass
],
},
}With the example above, K6 tests the service for 30 seconds with 250 virtual users and compares the results to the metrics defined in the threshold section. Let’s look at the results of this test:
running (0m30.0s), 250/250 VUs, 7250 complete and 0 interrupted iterations
default [ 100% ] 250 VUs 30.0s/30s
✓ is status 201
✓ is registered
✓ checks.........................: 100.00% 15000 out of 15000
✗ http_req_duration..............: avg=2.45ms min=166.47µs med=1.04ms max=44.52ms p(90)=3.68ms p(95)=7.71ms
{ expected_response:true }...: avg=2.45ms min=166.47µs med=1.04ms max=44.52ms p(90)=3.68ms p(95)=7.71ms
✓ http_req_failed................: 0.00% 0 out of 7500
iterations.....................: 7500 248.679794/s
vus_max........................: 250 min=250 max=250
running (0m30.2s), 000/250 VUs, 7500 complete and 0 interrupted iterations
default ✓ [ 100% ] 250 VUs 30s
time="2025-03-12T12:09:54Z" level=error msg="thresholds on metrics 'http_req_duration' have been crossed"
Error: Process completed with exit code 99.Although the checks and the http_req_failed rate thresholds are satisfied, this test failed because all the calculated http_req_duration metrics are greater than the thresholds defined above.
Evaluation by comparing to historical data
Another method of evaluation involves comparing the current results with historical data within a defined confidence level. This statistical approach allows us to understand trends over time and determine if the application’s performance is improving, declining, or remaining stable.
In many cases, performance metrics such as response times or throughput can be assumed to follow a normal distribution, especially when you have a large enough sample size. The normal distribution, often referred to as the bell curve, is a probability distribution that is symmetric about the mean. You can read more about it on Wikipedia.
Here’s how the statistical analysis works: from your historical data, calculate the mean (or average, μ) and standard deviation (SD, σ) of the performance metrics. These values will serve as the basis for hypothesis testing. Then, determine the performance metric from the current test run that you want to compare against the historical data. This could be the mean response time, p(90), error rate, etc.
Define test hyptheses
Concretely, let’s first create an hypothesis to test the current result with the historical data.
-
Null Hypothesis (H0): The current performance metric is equal to the historical mean (no significant difference).
H0:μcurrent=μhistorical -
Alternative Hypothesis (H1): The current performance metric is not equal to the historical mean (there is a significant difference).
H1:μcurrent=μhistorical
Define a comparison metric and acceptance criterion
To compare the current result to the historical mean, we calculate the Z-score, which tells you how many standard deviations the current mean is from the historical mean. The formula for the Z-score is:
Where:
- μcurrent is the current mean.
- μhistorical is the historical mean.
- σhistorical is the standard deviation of the historical data.
Finally, we need to determine the critical value of the Z-score: for a 95% confidence level, you can extract it from the standard normal distribution table. For a two-tailed test, the critical values are approximately ±1.96. For the full standard normal distribution table, see, for example, this website.
The confidence level means that the calculated difference between current and historical performance would fall within the chosen range around the historical mean in 95% of the cases. I believe the 95% confidence level provides good enough coverage for most purposes, but depending on the criticality of the product or service, you can increase or decrease it.
Make a decision
If the calculated Z-score falls outside the range of -1.96 to +1.96, you reject the null hypothesis (H0) and conclude that there is a statistically significant difference between the current performance metric and the historical mean. If the Z-score falls within this range, you fail to reject the null hypothesis, indicating no significant difference.
Based on these findings, you can interpret whether the application’s performance has improved, declined, or remained stable compared to historical data. This statistical analysis provides a robust framework for understanding performance trends over time and making data-driven decisions for further optimizations.
Implementation
In the above section, I tried to provide a clear explanation of how we can effectively evaluate the results of performance testing using historical data.
It is important to note that we do not need to engage in complex manual statistical analyses to check the validity of these results.
Instead, we should focus on scripting a comprehensive process that allows us to test the hypothesis for the Z-score within the 95% confidence level.
This approach will streamline our evaluation and ensure that we rely on a straightforward method to assess the performance outcomes in the CI/CD pipeline.
import numpy as np
from scipy import stats
def hypothesis_test(historical_data, current_data, confidence_level=0.95):
# Calculate historical mean and standard deviation
historical_mean = np.mean(historical_data)
historical_std = np.std(historical_data, ddof=1)
# Calculate the current mean
current_mean = np.mean(current_data)
# Number of observations in the current dataset
n_current = len(current_data)
# Calculate Z-score
z_score = (current_mean - historical_mean) / historical_std
# Determine the critical Z-values for the two-tailed test
critical_value = stats.norm.ppf((1 + confidence_level) / 2)
# Print results
print(f"Historical Mean: {historical_mean:.2f}")
print(f"Current Mean: {current_mean:.2f}")
print(f"Z-Score: {z_score:.2f}")
print(f"Critical Value for {confidence_level*100}% confidence: ±{critical_value:.2f}")
# Hypothesis testing decision
if abs(z_score) > critical_value:
assert abs(z_score) <= critical_value, f"z_score {z_score} exceeds the critical value {critical_value}"
if __name__ == "__main__":
# Read the historical data (performance metrics)
historical_data = get_historical_data()
# Current data to compare
current_data = get_current_result()
hypothesis_test(historical_data, current_data, confidence_level=0.95)The challenges with CPT
CPT can add additional cost to your project. It is an additional step in the CI pipeline, and requires performance engineering expertise that organizations might need to hire for. Furthermore, an additional test environment is needed to run the performance testing.
In addition to the costs, maintenance can be challenging. Likewise, data generation is very critical for the success of the performance testing: it requires obtaining data, masking sensitive information, and deleting them securely. CPT also requires testing new services, reflecting changes in the current services or removing unused services. Following up on detected issues and on new features of performance testing tools are also mandatory. All these must be done regularly to keep the system afloat, adding to existing maintenance efforts.
The benefits of CPT
Continuous Performance Testing offers significant benefits by enabling automatic early detection of performance issues within the development process. This proactive approach allows teams to identify and address bottlenecks before they reach production, reducing both costs and efforts associated with fixing problems later. By continuously monitoring and optimizing application performance, CPT helps ensure a fast, responsive user experience and minimizes the risk of outages or slowdowns that could disrupt users and business operations.
In addition to early detection, CPT enhances resource utilization by pinpointing inefficient code and infrastructure setups, ultimately reducing overall costs despite initial investments. It also fosters better collaboration among development, testing, and operations teams by providing a shared understanding of performance metrics: each test generates valuable data that supports advanced analysis and better decision-making regarding code improvements, infrastructure upgrades, and capacity planning. Finally, CPT offers the convenience of on-demand testing with just one click, providing an easy-to-use baseline for more rigorous performance evaluations when needed.
Conclusion
Continuous Performance Testing (CPT) transforms traditional performance testing by integrating it directly into the CI/CD pipeline. CPT can, in principle, be applied to each performance testing type, but load testing is most advantageous with lower cost and higher benefits.
The core idea is to automate and conduct performance tests continuously and earlier in the development cycle, aligning with the “shift-left” philosophy. This approach provides real-time feedback on performance impacts, helps identify and resolve issues sooner, and ultimately leads to improved software quality, faster development, and consistent adherence to performance standards and SLAs.
]]>At Modus Create, we wanted to cut through the noise. So we ran a real experiment: two teams, same scope, same product, same timeline. One team used traditional workflows. The other used AI agents to scaffold, implement, and iterate — working in a new paradigm we call Agentic Coding.
Every technique we learned along the way and every insight this approach taught us is collected in our Agentic Coding Handbook. This article distills the lessons from the handbook into the core principles and practices any engineer can start applying today.
From Typing Code to Designing Systems
Agentic coding isn’t about writing code faster. It’s about working differently. Instead of manually authoring every line, engineers become high-level problem solvers. They define the goal, plan the implementation, and collaborate with an AI agent that writes code on their behalf.
Agentic Coding is a structured, AI-assisted workflow where skilled engineers prompt intentionally, validate rigorously, and guide the output within clear architectural boundaries.
This approach is fundamentally different from what many refer to as “vibe coding”, the idea that you can throw a vague prompt at an LLM and see what comes back. That mindset leads to bloated code, fragile architecture, and hallucinations.
Agentic Coding vs. Vibe Coding
To illustrate the difference, here’s how agentic coding compares to the more casual “vibe coding” approach across key dimensions:
| Agentic Coding | Vibe Coding | |
|---|---|---|
| Planning | Structured implementation plan | None or minimal upfront thinking |
| Prompting | Scoped, intentional, reusable | Loose, improvisational, trial-and-error |
| Context | Deliberately curated via files/MCPs | Often missing or overloaded |
| Validation | Treated as a critical engineering step | Frequently skipped or shallow |
| Output Quality | High, repeatable, aligned to standards | Inconsistent, often needs full rewrite |
| Team Scalability | Enables leaner squads with high output | Prone to technical debt and drift |
Agentic coding provides the structure, discipline, and scalability that large organizations need to standardize success across multiple squads. It aligns AI workflows with existing engineering quality gates, enabling automation without losing control. In contrast, vibe coding may produce short-term wins but fails to scale under the weight of enterprise demands for predictability, maintainability, and shared accountability.
A Note on Our Experiment
We ran a structured experiment with two engineering squads working on the same product. One team (DIY) built the product using traditional methods. The other team (AI) used Cursor and GitHub Copilot Agent to complete the same scope, using agentic workflows. The AI team had 30% fewer engineers and delivered in half the time. More importantly, the code quality — verified by SonarQube and human reviewers — was consistent across both teams.
Core Practices That Make the Difference
Implementation Planning is Non-Negotiable
Before any prompting happens, engineers must do the thinking. Creating an implementation plan isn’t just a formality but the most critical piece in making agentic coding work. It’s where intent becomes design.
A solid implementation plan defines what to build, but also why, how, and within what constraints. It includes:
- Functional goals: What should this piece of code do?
- Constraints: Performance expectations, architecture rules, naming conventions, etc.
- Edge cases: Known pitfalls, alternate flows, integration risks.
- Required context: Links to schemas, designs, existing modules, etc.
- Step-by-step plan: Breakdown of the task into scoped units that will become individual prompts.
This plan is usually written in markdown and lives inside the codebase. It acts like a contract between the engineer and the AI agent.
The more precise and explicit this document is, the easier it is to turn each unit into a high-quality prompt. This is where agentic coding shifts from “throw a prompt and see what happens” to deliberate system design, supported by AI.
In short, prompting is the act. Planning is the discipline. Without it, you’re not doing agentic coding — you’re just taking shots in the dark and hoping something works.
Prompt Engineering is a Real Skill
Prompt engineering is not about being clever. It’s about being precise, scoped, and iterative. We teach engineers to break down tasks into discrete steps, write action-oriented instructions, avoid vague intentions, chain prompts, and use prompting strategies like:
- Three Experts: Use this when you want multiple perspectives on a tough design problem. For example, ask the AI to respond as a senior engineer, a security expert, and a performance-focused architect.
- N-Shot Prompting: Provide the AI with N examples of the desired output format or pattern. Zero-shot uses no examples, one-shot provides a single example, and few-shot (N-shot) includes multiple examples to guide the AI toward the expected structure and style.
- 10 Iteration Self-Refinement: Best used when you want the AI to improve its own output iteratively. Give it a problem, then prompt it to improve its previous response 10 times, evaluating each step with reasoning.
Choosing the right style depends on the type of challenge you’re tackling — architectural design, implementation, refactoring, or debugging.
Context is a First-Class Citizen
Model Context Providers (MCPs) give GitHub Copilot a second brain. Instead of treating the LLM as an isolated suggester, MCPs stream relevant context — from Figma designs, documentation in Confluence, code changes from GitHub, and decision logs — directly into the Copilot chat session.
This allows engineers to ask Copilot to write code that matches an actual UI layout, or implements some logic described in a design doc, without manually pasting content into the prompt. The results are significantly more relevant and aligned. Some of the MCPs we use are:
- GitHub MCP: Pulls in pull request content and comments to give the model full context for writing review responses, proposing changes, or continuing implementation from feedback.
- Figma MCP: Streams UI layouts into the session, enabling the AI to generate frontend code that accurately reflects the design.
- Database Schema MCP: Injects table structures, column types, and relationships to help the AI write or update queries, migrations, or API models with accurate field-level context.
- Memory Bank MCP: Shares scoped memory across sessions and team members, maintaining continuity of architectural decisions, prompt history, and recent iterations.
- CloudWatch MCP: Supplies log output to the AI for debugging and incident triage — essential during the Debugging workflow.
- SonarQube MCP: Feeds static analysis results so the AI can refactor code to eliminate bugs, smells, or duplication.
- Confluence MCP: Integrates architecture and business documentation to inform decisions around domain logic, constraints, and requirements.
MCPs are just one part of the context curation puzzle. Engineers also need to deliberately craft the model’s working memory for each session. That includes:
- Implementation Plans: Markdown files that define goals, steps, constraints, and trade-offs, acting as an onboarding doc for the AI agent.
- Codebase Files: Selectively attaching relevant parts of the codebase (like entry points, shared utilities, schemas, or config files) so the AI operates with architectural awareness.
- Console Logs or Test Output: Including runtime details helps the AI understand execution behavior and suggest context-aware fixes.
- Instructions or TODO Blocks: GitHub Copilot supports markdown-based instruction files and inline TODO comments to guide its code generation. These instructions act like lightweight tickets embedded directly in the repo. For example, an
INSTRUCTIONS.mdmight define architectural rules, file responsibilities, or interface contracts. Within code files, TODOs like// TODO: replace mock implementation with production-ready logicact as scoped prompts that Copilot can act on directly. Used consistently, these become in-repo signals that align the agent’s output with team expectations and design intent, markers inside the code to direct the model towards a specific change or design pattern.
Effective context curation is an engineering discipline. Give too little, and the agent hallucinates. Give too much, and it loses focus or runs out of space in the LLM context window. The best results come from curating the smallest possible set of high-signal resources. When you treat context as a design artifact the AI becomes a more reliable collaborator.
The Role of Workflows
We embedded AI in our delivery pipeline using a set of core workflows. You can explore each one in more detail in our handbook, but here is the high-level overview:
| Workflow | Purpose |
|---|---|
| Spec-First | Write a scoped prompt plan before coding |
| Exploratory | Understand unfamiliar codebases with AI help |
| Memory Bank | Maintain continuity across sessions and team members |
| TDD | Test-first with AI-generated test coverage |
| Debugging | Use AI to triage, investigate, and fix bugs |
| Visual Feedback | Align AI output with Figma and screenshots |
| Auto Validations | Run tools like SonarQube, ESLint post-output |
In our experience, these workflows are not just productivity boosters; they’re the foundation for scaling AI-assisted development across teams. They provide consistency, repeatability, and shared mental models. We believe this approach is especially critical in enterprise environments, where large engineering organizations require predictable output, quality assurance, and alignment with established standards. Agentic workflows bring just enough structure to harness AI’s strengths without sacrificing accountability or control.
Building a Validation Loop
We use validation tools like SonarQube, ESLint, Vitest, and Prettier to provide automatic feedback to the AI. For example, if SonarQube flags duplication, we prompt the AI to refactor accordingly. This creates a tight loop where validation tools become coaching signals.
Some tools, like GitHub Copilot, can even collect log output from the terminal running tests or executing scripts. This allows the AI to observe the outcome of code execution, analyze stack traces or test failures, and automatically attempt fixes. One common approach is asking the AI to run a test suite, interpret the failed test results, make corrections, and repeat this process until all tests pass.
Lizard, a tool that calculates code complexity metrics, is another useful validation tool. Engineers can instruct the AI to execute Lizard against the codebase. When the output indicates that a function exceeds the defined complexity threshold (typically 10), the AI is prompted to refactor that function into smaller, more maintainable blocks. This method forces the AI to act on specific, measurable quality signals and improves overall code readability.
In this setup, engineers can let the AI operate in a closed loop for several iterations. Once the AI produces clean validation results — whether through passing tests, static analysis, or complexity reduction — the human engineer steps back in to review the result. This combination of automation and oversight speeds up bug fixing while maintaining accountability.
But here’s the thing: the team needs to actually understand what the AI built. If you’re just rubber-stamping AI changes without really getting what they do, you’re setting yourself up for trouble. The review step isn’t just a checkbox — it’s where you make sure the code actually makes sense for your system.
Why Human Oversight Still Matters
No AI is accountable for what goes to production. Engineers are. AI doesn’t own architectural tradeoffs, domain-specific reasoning, or security assumptions. Human-in-the-loop is the safety mechanism.
Humans are the only ones who can recognize when business context changes, when a feature should be cut for scope, or when a security concern outweighs performance gains. AI can assist in code generation, validation, and even debugging — but it lacks the experience, judgment, and ownership required to make trade-offs that affect users, stakeholders, or the long-term health of the system.
Human engineers are also responsible for reviewing the AI’s decisions, ensuring they meet legal, ethical, and architectural constraints. This is especially critical in regulated industries, or when dealing with sensitive data. Without a human to enforce these standards, the risk of silent failure increases dramatically.
Agentic coding isn’t about handing off responsibility, it’s about amplifying good engineering judgment.
Where People Fail (And Blame the AI)
Common mistakes include vague prompts, lack of planning, poor context, and not validating output. While LLMs have inherent limitations — they hallucinate, make incorrect assumptions, and produce plausible-sounding but wrong outputs even with good inputs — engineering discipline significantly increases the reliability of results.
A prompt like “make this better” tells the AI nothing about what “better” means — faster? more readable? safer? Without clear constraints and context, LLMs default to producing generic solutions that may not align with your actual needs. The goal isn’t to eliminate all AI errors, but to create workflows that catch and correct them systematically.
Lack of validation is another key failure mode. Trusting the first output, skipping tests, or ignoring code quality tools defeats the point of the feedback loop. AI agents need boundaries and coaching signals or, without them, they can drift into plausible nonsense.
Using these tools effectively also means understanding their current limitations. AI models work best with well-represented programming languages like JavaScript, TypeScript, and Python (to name a few examples). However, teams working in specialized domains may see limited results even with popular languages.
A Closer Look at Our Tooling
GitHub Copilot played a key role in our experiment, especially when paired with instruction files, validation scripts, and Model Context Providers (MCPs).
What made GitHub Copilot viable for agentic workflows wasn’t just its autocomplete or inline chat. It was how we surrounded it with structure and feedback mechanisms:
Instruction Files
Instruction files served as the AI’s map. These markdown-based guides detailed the implementation plan, scoped tasks, architectural constraints, naming conventions, and even file-level goals. When placed inside the repo, they gave GitHub Copilot context it otherwise wouldn’t have. Unlike ad-hoc prompts, these files were written with intent and discipline, and became a critical part of the repo’s knowledge layer.
Validation Scripts
We paired Copilot with post-generation validation tools like ESLint, Vitest, Horusec, and SonarQube. These weren’t just guardrails but closers of the loop. When Copilot generated code that violated rules or failed tests, engineers would reframe the prompt with validation results as input. This prompted Copilot to self-correct. It’s how we turned passive AI output into an iterative feedback process.
Copilot + Workflows = Impact
Used this way, GitHub Copilot became more than a helper. It became a participant in our structured workflows:
- In Spec-First, Copilot consumed instruction files to scaffold code.
- In Debugging, it analyzed logs fed via MCP and proposed targeted fixes.
- In TDD, it generated unit tests from requirements, then refactored code until tests passed.
- In Visual Feedback, it aligned components with Figma via the design MCP.
By aligning Copilot with prompts, plans, validation, and context, we moved from “code completion” to code collaboration.
So no — GitHub Copilot isn’t enough on its own. But when embedded inside a disciplined workflow, with context and feedback flowing in both directions, it’s a capable agent. One that gets better the more structured your engineering practice becomes.
Final Advice: How to Actually Start
The path to agentic coding begins with a single, well-chosen task. Pick something atomic that you understand deeply — a function you need to refactor, a component you need to build, or a bug you need to fix. Before touching any AI tool, write an implementation plan that defines your goals, constraints, and step-by-step approach.
Once you have your plan, start experimenting with the workflows we’ve outlined. Try Spec-First to scaffold your implementation, then use Auto Validations to create feedback loops. If you’re working with UI, explore Visual Feedback with design tools. As you gain confidence, introduce Model Context Providers to give your AI agent richer context about your codebase and requirements. Always keep in mind that the quality of AI output depends on the quality of the task setup and the availability of feedback.
Treat each interaction as both an experiment and a learning opportunity. Validate every output as if it came from a junior developer. Most importantly, remember that this isn’t about replacing your engineering judgment; it’s about amplifying it. The most successful engineers in our experiments were the ones who treated the AI as a collaborator — not a magician.
What we’ve described isn’t just a productivity technique — it’s a fundamental shift in how we think about human creativity and machine capability. When engineers become high-level problem solvers, supported by AI agents within well-defined boundaries, we unlock new possibilities for what software teams can accomplish. Welcome to the next era of software development.
]]>In a previous post, I introduced Topiary, a universal formatter (or one could say a formatter generator), and showed how to start a formatter for a programming language from scratch. This post is the second part of the tutorial, where we’ll explore more advanced features of Topiary that come in handy when handling real-life languages, and in particular the single-line and multi-line layouts. I’ll assume that you have a working setup to format our toy Yolo language. If you don’t, please follow the relevant sections of the previous post first.
Single-line and multi-line
A fundamental tenet of formatting is that you want to lay code out in different
ways depending on if it fits on one line or not. For example, in
Nickel, or any functional programming language for that matter, it’s
idiomatic to write small anonymous functions on one line, as in std.array.map (fun x => x * 2 + 1) [1,2,3]. But longer functions would rather look like:
fun x y z =>
if x then
y
else
zThis is true for almost any language construct that you can think of: you’d
write a small boolean condition is_a && is_b, but write a long validation
expressions as:
std.is_string value
&& std.string.length value > 5
&& std.string.length value < 10
&& !(std.string.is_match "\\d" value)In Rust, with rustfmt, short method calls are formatted on one line as in
x.clone().unwrap().into(), but they are spread over several lines when the
line length is over a fixed threshold:
value
.maybe_do_something(|x| x+1)
.or_something_else(|_| Err(()))
.into_iter()You usually either want the single-line layout or the multi-line one. A hybrid solution wouldn’t be very consistent:
std.is_string value
&& std.string.length value > 5 && std.string.length value < 10
&& !(std.string.is_match "\\d" value)Some formatters, such as Rust’s, choose the layout automatically depending on the length of the line. Long lines are wrapped and laid out in the multi-line style automatically, freeing the programmer from any micro decision. On the flip side, the programmer can’t force one style in cases where it’d make more sense.
Some other formatters, like our own Ormolu for Haskell, decide on the layout based on the original source code. For any syntactic construct, the programmer has two options:
- Write it on one line, or
- Write it on two lines or more.
1. will trigger the single-line layout, and 2. the multi-line one. No effort is made to try to fit within reasonable line lengths. That’s up to the programmer.
As we will see, Topiary follows the same approach as Ormolu, although future support for optional line wrapping isn’t off the table1.
Softlines
Less line breaks, please
Let’s see how our Yolo formatter handles the following source:
input income, status
output income_tax
income_tax := case { status = "exempted" => 0, _ => income * 0.2 }Since the case is short, we want to keep it single-line. Alas, this gets
formatted as:
input income, status
output income_tax
income_tax := case {
status = "exempted" => 0,
_ => income * 0.2
}The simplest mechanism for multi-line-aware layout is to use soft
lines instead of spaces or hardlines. Let’s change the
@append_hardline capture in the case branches separating
rule to @append_spaced_softline:
; Put case branches on their own lines
(case
"," @append_spaced_softline
)As the name indicates, a spaced softline will result in a space for the
single-line case, and a line break for the multi-line case, which is precisely
what we want. However, if we try to format our example, we get the dreaded
idempotency check failure, meaning that formatting one time or two times in a
row doesn’t give the same result, which is a usually a red flag (and is why
Topiary performs this check). What happens is that our braces { and } also
introduce hardlines, so the double formatting goes like:
income_tax := case { status = "exempted" => 0, _ => income * 0.2 }
--> (case is single-line: @append_spaced_softline is a space)
income_tax := case {
status = "exempted" => 0, _ => income * 0.2
}
--> (case is multi-line! @append_spaced_softline is a line break)
income_tax := case {
status = "exempted" => 0,
_ => income * 0.2
}We need to amend the rule for braces as well:
; Lay out the case skeleton
(case
"{" @prepend_space @append_spaced_softline
"}" @prepend_spaced_sofline
)Our original example is now left untouched, as desired. Note that softline
annotations are expanded depending on the multi-lineness of the direct parent of
the node they attach to (and neither the subtree matched by the whole query
nor the node itself). Topiary applies this logic because this is most often what
you want. The parse tree of the multi-line version of income_tax:
income_tax := case {
status = "exempted" => 0,
_ => income * 0.2
}is as follows (hiding irrelevant parts in [...]):
0:0 - 4:0 tax_rule
0:0 - 3:1 statement
0:0 - 3:1 definition_statement
0:0 - 0:10 identifier `income_tax`
0:11 - 0:13 ":="
0:14 - 3:1 expression
0:14 - 3:1 case
0:14 - 0:18 "case"
0:19 - 0:20 "{"
1:2 - 1:26 case_branch
[...]
1:26 - 1:27 ","
2:2 - 2:19 case_branch
[...]
3:0 - 3:1 "}"The left part is the span of the node, in the format start_line:start_column - end_line:end_column. A node is multiline simply if end_line > start_line. You
can see that since "{" is not multiline (it can’t be, as it’s only one
character!), if Topiary considered the multi-lineness of the node itself, our
previous "{" @append_spaced_softline would always act as a space.
What happens is that Topiary considers the direct parent instead, which is 0:14 - 3:1 case
here, and is indeed multi-line.
Both single-line and multi-line case are now formatted as expected.
More line breaks, please
Let’s consider the dual issue, where line breaks are unduly removed. We’d like to allow inputs and outputs to span multiple lines, but the following snippet:
input
income,
status,
tax_coefficient
output income_taxis formatted as:
input income, status, tax_coefficient
output income_taxThe rule for spacing around input and
output and the rule for spacing around
, and identifiers both use @append_space. We
can simply replace this with a spaced softline. Recall that a spaced softline
turns into a space and thus behaves like @append_space in a single-line
context, making it a proper substitution.
; Add spaced softline after `input` and `output` decl
[
"input"
"output"
] @append_spaced_softline
; Add a spaced softline after and remove space before the comma in an identifier
; list
(
(identifier)
.
"," @prepend_antispace @append_spaced_softline
.
(identifier)
)We also need to add new rules to indent multi-line lists of inputs or outputs.
; Indent multi-line lists of inputs.
(input_statement
"input" @append_indent_start
) @append_indent_end
; Indent multi-line lists of outputs.
(output_statement
"output" @append_indent_start
) @append_indent_endA matching pair of indentation captures *_indent_start and *_indent_end will
amount to a no-op if they are on the same line, so those rules don’t disturb the
single-line layout.
Recall that as long as you don’t use anchors (.), additional nodes can be
omitted from a Tree-sitter query: here, the first query will match an input
statement with an "input" child somewhere, and any children before or after
that (although in our case, there won’t be any children before).
Scopes
More (scoped) line breaks, please
Let us now consider a similar example, at least on the surface. We want to allow long arithmetic expressions to be laid out on multiple lines as well, as in:
input
some_long_name,
other_long_name,
and_another_one
output result
result :=
some_long_name
+ other_long_name
+ and_another_oneAs before, result is currently smashed back into one line by our current
formatter. Unsurprisingly, since our keywords rule uses
@prepend_space and @append_space. At this point, you start to get the trick:
let’s use softlines! I’ll only handle + for simplicity. We remove "+" from
the original keywords rule and add the following rule:
; (Multi-line) spacing around +
("+" @prepend_spaced_softline @append_space)Ignoring indentation for now, the line wrapping seems to work. For the following example at least:
result :=
some_long_name
+ other_long_name + and_another_onewhich is reformatted as:
result := some_long_name
+ other_long_name
+ and_another_oneHowever, perhaps surprisingly, the following example:
result :=
some_long_name + other_long_name
+ and_another_oneis reformatted as:
result := some_long_name + other_long_name
+ and_another_oneThe first addition hasn’t been split! To understand why, we have to look at how our grammar parses arithmetic expressions:
expression: $ => choice(
$.identifier,
$.number,
$.string,
$.arithmetic_expr,
$.case,
),
arithmetic_expr: $ => choice(
prec.left(1, seq(
$.expression,
choice('+', '-'),
$.expression,
)),
prec.left(2, seq(
$.expression,
choice('*', '/'),
$.expression,
)),
prec(3, seq(
'(',
$.expression,
')',
)),
),Even if you don’t understand everything, there are two important points:
- Arithmetic expressions are recursively nested. Indeed, we can compose
arbitrarily complex expressions, as in
(foo*2 + 1) + (bar / 4 * 6). - They are parsed in a left-associative way.
This means that our big addition is parsed as: ((some_long_name "+" other_long_name) "+" and_another_one). In the first example, since the line
break happens just after some_long_name in the original source, both the inner
node and the outer one are multi-line. However, in the second example, the line
break happens after other_long_name, meaning that the innermost arithmetic
expression is contained in a single line, and the corresponding + isn’t
considered multi-line. Indeed, you can see here that the parent of the first +
is 7:0 - 7:32 arithmetic_expr, which fits entirely on line 7.
7:0 - 8:17 arithmetic_expr
7:0 - 7:32 expression
7:0 - 7:32 arithmetic_expr
7:0 - 7:14 expression
7:0 - 7:14 identifier `some_long_name`
7:15 - 7:16 "+"
7:17 - 7:32 expression
7:17 - 7:32 identifier `other_long_name`
8:0 - 8:1 "+"
8:2 - 8:17 expression
8:2 - 8:17 identifier `and_another_one`The solution here is to use scopes. A scope is a user-defined group of nodes
associated with an identifier. Crucially, when using scoped softline captures
such as @append_scoped_space_softline within a scope, Topiary will consider
the multi-lineness of the whole scope instead of the multi-lineness of the
(parent) node.
Let’s create a scope for all the nested sub-expressions of an arithmetic
expression. Scopes work the same as other node groups in Topiary: we create them
by using a matching pair of begin and end captures. We need to find a parent
node that can’t occur recursively in an arithmetic expression. A good candidate
would be definition_statement, which
encompasses the whole right-hand side of the definition of an output:
; Creates a scope for the whole right-hand side of a definition statement
(definition_statement
(#scope_id! "definition_rhs")
":="
(expression) @prepend_begin_scope @append_end_scope
)We must specify an identifier for the scope using the
predicate scope_id. Identifiers are useful when
several scopes might be nested or even overlap, and help readability in general.
We then amend our initial attempt at formatting multi-line arithmetic expressions:
; (Multi-line) spacing around +
(
(#scope_id! "definition_rhs")
"+" @prepend_scoped_spaced_softline @append_space
)We use a scoped version of softlines, in which case we need to specify the
identifier of the corresponding scope. The captured node must also be part of
said scope. You can check that both examples (and multiple variations of them)
are finally formatted as expected.
Conclusion
This second part of the Topiary tutorial has taught how to finely specify an alternative formatting layout depending on whether an expression spans multiple lines or not. The main concepts at play here are multi-line versus single-line nodes, and scopes. There is an extension to this concept not covered here, measuring scopes, but standard scopes already go a long way for formatting a real life language. If you’re looking for a comprehensive resource to help you write your formatter, the official Topiary book is for you. You can however find the complete code for this post in the companion repository. Happy hacking!
]]>- Introduction to the dependency graph
- Managing dependency graph in a large codebase
- The anatomy of a dependency graph
In the previous post, we explored the concepts of the dependency graph and got familiar with some of its applications in the context of build systems. We also observed that managing dependencies can be complicated.
In this post, we are going to take a closer look at some of the issues you might need to deal with when working in a large codebase, such as having incomplete build metadata or conflicting requirements between components.
Common issues
Diamond dependency
The diamond dependency problem is common in large projects, and resolving it often requires careful dependency version management or deduplication strategies.
Imagine you have these dependencies in your project:
Packaging appA and appB individually is not a problem
because they will end up having libX of a particular version.
But what if appA starts using something from libB as well?
Now when building appA, it is unclear
what version of libX should be used — v1 or v2.
This results in having a part of the dependency graph looking
like a diamond hence the dependency name.
Depending on the programming language and the packaging mechanisms, it might be possible to specify
that when calls are made from libA, then libX.v1 should be used,
and when calls are made from libB, then libX.v2 should be used,
but in practice it can get quite complicated.
The worst situation is perhaps when appA is compatible with both v1 and v2,
but may suffer from intermittent failures when being used in certain conditions such as under high load.
Then you would actually be able to build your application,
and since it includes a “build compatible” yet different version of the third-party library,
you won’t be able to spot the issue straight away.
Some tools, such as the functional package manager nix, treat packages as immutable values and allow you to specify exact versions of dependencies for each package, and these can coexist without conflict.
Having a single set of requirements can also be desirable, because if all the code uses the same versions of required libraries, you avoid version conflicts entirely and everyone in the company works with the same dependencies, reducing “works on my machine”-type issues. In practice, however, this is often unrealistic for large or complex projects, especially in large monorepos or polyglot codebases. For instance, upgrading a single dependency may require updating many parts of the codebase at once, which might be risky and time-consuming. Likewise, if you want to split your codebase into independently developed modules or services, a single requirements set can become a bottleneck.
Re-exports
Re-exports — when a module imports a member from another module and re-exports it — are possible in some languages such as Python or JavaScript.
Take a look at this graph
where appA needs value of dpi from the config, but instead of importing from the config,
it imports it from libA.
While re-exports may simplify imports and improve encapsulation,
they also introduce implicit dependencies:
downstream code like appA becomes coupled not only to libA,
but also to the transitive closure of libA.
In this graph this means that changes in any modules
that libA depends on would require rebuilding appA.
This is not truly needed since appA doesn’t really depend on any code members from that closure.
To improve the chain of dependencies, the refactored graph would look like this:
Identifying re-exports can be tricky particularly with highly dynamic languages such as Python. The available tooling is limited (e.g. see mypy), and custom static analysis programs might need to be written.
Stale dependencies
Maintaining up-to-date and correct build metadata is necessary to represent the dependency graph accurately, but issues might appear silently. For example, you might have modules that were once declared to depend on a particular library but do not depend on them any longer (however, the metadata in build files suggests they still are). This can cause your modules to be unnecessarily rebuilt every time the library changes.
Some build systems such as Pants rely on dependency inference where users do not have to maintain the build metadata in build files, but any manual dependencies declared (where inference cannot be done programmatically in all situations) still need to be kept up-to-date and might easily get stale.
There are tools that can help ensuring the dependency metadata is fresh for C++ (1, 2) Python, and JVM codebases, but often keeping the build metadata up-to-date is still a semi-automated process that cannot be safely automated completely due to edge cases and occasional false positives.
Incompatible dependencies
It is possible for an application to end up depending on third-party libraries that cannot be used together. This could be enforced for multiple reasons:
- to ensure the design is sane (e.g., only a single cryptography library may be used by an application)
- to avoid malfunctioning of the service (e.g., two resource intensive backend services can’t be run concurrently)
- to keep the CI costs under control (e.g., tests may not depend on a live database instance and should always use rich mock objects instead).
Appropriate rules vary between organizations, and should be updated continuously as the dependency graph evolves. If you use Starlark for declaring build metadata, take a look at buildozer which can help querying the build files when validating dependencies statically.
Large transitive closures
If a module depends on a lot of other modules, it’s more likely that it will also need to be changed whenever any of those dependencies change. Usually, bigger files (with more lines of code) have more dependencies, but that’s not always true. For example, a file full of boilerplate or generated code might be huge, but barely depend on anything else. Sticking to good design practices — like grouping related code together and making sure classes only do one thing — can help keep your dependencies under control.
For example, with this graph
a build system is likely to require running all test cases in tests should any of the apps change
which would be wasteful most of the time since most likely you are going to change only one of them at a time.
This could be refactored in having individual test modules targeting every application individually:
Third-party dependencies
It is generally advisable to be cautious about adding any dependency, particularly a third-party one, and its usage should be justified — it may pay off to be reluctant to adding any external dependencies unless the benefits of bringing them outweigh the associated cost.
For instance, a team working on a Python command-line application processing some text data may consider
using pandas because it’s a powerful data manipulation tool
and twenty lines of code written using built-in modules could be replaced by a one-liner with pandas.
But what happens when this application is going to be distributed?
The team will have to make sure that pandas (which contains C code that needs to be compiled)
can be used on all supported operating systems and CPU architectures meeting the reliability and performance constraints.
It may sound harsh, but there’s truth to the idea that every dependency eventually becomes a liability. By adding a dependency (either to your dependency graph, if it’s a new one, or to your program), you are committing to stay on top of its security vulnerabilities, compatibility with other dependencies and your build system, and licensing compliance.
Adding a new dependency means adding a new node or a new edge to the dependency graph, too. The graph traversal time is negligible, but the time spent on rebuilding code at every node is not. The absolute build time is less of a problem since most build systems can parallelize build actions very aggressively, but what about the computational time? While developer time (mind they still have to wait for the builds to finish!) is far more valuable than machine time, every repeated computation during a build contributes to the total build cost. These operations still consume resources — whether you’re paying a cloud provider or covering the energy and maintenance costs of an on-premises setup.
Cross-component dependencies
It is common for applications to depend on libraries (shared code), however, it is also possible (but less ideal) for an application to use code from another application. If multiple applications have some code they both need, it is often advisable that this code is extracted into a shared library so that both applications can depend on that instead.
Modern build systems such as Pants and Bazel have a visibility control mechanism that enforces rules of dependency between your codebase components. These safeguards exist to prevent developers from accessing and incorporating code from unrelated parts of the codebase. For instance, when building source code for accounting software, the billing component should never depend on the expenses component just because it also needs to support exports to PDF.
However, visibility rules may not be expressive enough to cover certain cases.
For instance, if you follow a particular deployment model,
you may need to make sure that a specified module will never end up as a transitive dependency of a certain package.
You may also want to enforce that some code is justified to exist in a particular package
only if it’s being imported by some others.
For example, you may want to prevent placing any modules in the src/common-plugins package
unless they are imported by src/plugins package modules to keep the architecture robust.
Keep in mind that when introducing a modern build system to a large, legacy codebase that has evolved without paying attention to the dependency graph’s shape, builds may be slow not because the code compilation or tests take long, but because any change in the source code requires re-building most or all nodes of the dependency graph. That is, if all nodes of the graph transitively depend on a node with many widely used code members that are modified often, there will be lots of re-build actions unless this module is split across multiple modules each containing only closely related code.
Direct change propagation
When source code in a module is changed, downstream nodes (reverse dependencies of this module) often get rebuilt even if the specific changes don’t truly require it. In large codebases, this causes unnecessary rebuilds, longer feedback cycles, and higher CI costs.
In most build systems (including Bazel and GNU Make), individual actions or targets are invalidated
if their inputs change.
In GNU Make, this would be mtime of declared input files,
and in Bazel, this would be digests, or the action key.
Most build systems can perform an “early cutoff” if the output of an action doesn’t change.
Granted, with GNU Make, the mtime could be updated even if the output was already correct from a previous build
(which will force unnecessary rebuilds), but that’s a very nuanced point.
However, with Application Binary Interface (ABI) awareness, it would only be necessary to rebuild downstream dependencies if the interface they rely on has actually changed.
A related idea is having a stable API, which can help figure out which nodes in the graph actually changed. Picture a setup like this — an application depends on the database writer module which in turn depends on the database engine:
This application calls the apply function from the database writer module to insert some rows,
which then uses the database engine to handle the actual disk writing.
If anything in internals changes (e.g., how the data is compressed before writing to disk),
the client won’t notice as long as the writer’s interface stays the same.
That interface acts as a “stable layer” between the parts.
In the build context, running tests of the application should not be necessary on changes in the database component.
Practically, reordering methods in a Java class, adding a docstring to a Python function,
or even making minor changes in the implementation (such as return a + b instead of return b + a)
would still be marking that node in the graph as “changed” particularly if you rely on tooling
that queries modified files in the version control repository without taking into account the semantics of the change.
Therefore, relying on the checksum of a source file or all files in a package (depending on what a node in your dependency graph represents) just as relying on checksum of compiled objects (be it machine code or bytecode) may prove insufficient when determining what kind of change deserves to be propagated further in the dependency chain of the graph. Take a look at the Recompilation avoidance in rules_haskell to learn more about checksum based recompilation avoidance in Haskell.
Many programming languages have language constructs, such as interfaces in Go, that can avoid this problem by replacing a dependency on some concrete implementation with a dependency on a shared public interface. The application from the example above could depend on a database interface (or abstract base class) instead of the actual implementation. This is another kind of “ABI” system that avoids unnecessary rebuilds and helps to decouple components.
How ABI compatibility is handled depends on the build system used. In Buck, there is a concept of Java ABI that is used to figure out which nodes actually need rebuilding during an incremental build. For example, a Java library doesn’t always need to be rebuilt just because one of its dependencies changed unless the public interface of that dependency changed too. Knowing this helps skip unnecessary rebuilds when the output would be the same anyway.
In the most recent versions of Bazel, there is experimental support for dormant dependencies which are not an actual dependency, but the possibility of one. The idea is that every edge between nodes can be marked as dormant, and then it is possible for it to be passed up the dependency graph and turned into an actual dependency (“materialized”) in the reverse transitive closure. Take a look at the design document to learn more about the rationale.
We hope it is clear now how notoriously complex managing a large dependency graph in a monorepo is. Changes in one package can ripple across dozens or even hundreds of interconnected modules. Developers must carefully coordinate versioning, detect and prevent circular dependencies, and ensure that builds remain deterministic, particularly in industries with harder reproducibility constraints such as automotive or biotech.
Failing to keep the dependency graph sane often leads to brittle CI pipelines and long development feedback loops which impedes innovation and worsens developer experience. In the future, we can expect more intelligent tools to emerge such as machine learning based dependency impact analyzers that predict downstream effects of code changes and self-healing CI pipelines that auto-adjust scope and change propagation. Additionally, semantic-aware refactoring tools and “intent-based” build systems could automate much of the manual effort that is currently required to manage interdependencies at scale.
In the next post, we’ll talk about scalability problems and limitations of the dependency graph scope that is exposed by build systems.
]]>During my GSoC 2025 Haskell.org project with Tweag, I worked on a seemingly
small but impactful feature: allowing LH’s type and predicate aliases to be written
in qualified form.
That is, being able to write Foo.Nat instead of only just Nat, like we can for regular Haskell type aliases.
In this post, I introduce these annotations and their uses, walk through some of the design decisions, and share how I approached the implementation.
Aliasing refinement types
Type and predicate aliases in LH help users abstract over refinement type
annotations, making them easier to reuse and more concise. A type alias refines
an existing type. For instance, LH comes with built-in aliases like Nat and
Odd, which refine Int to represent natural and odd numbers, respectively.
{-@ type Nat = {v: Int | v >= 0 } @-}
{-@ type Odd = {v: Int | (v mod 2) = 1 } @-}Predicate aliases, by contrast, capture only the predicate part of a refinement type. For example, we might define aliases for positive and negative numerical values.
-- Value parameters in aliases are specified in uppercase,
-- while lowercase is reserved for type parameters.
{-@ predicate Neg N = N < 0 @-}
{-@ predicate Pos N = N > 0 @-}Enter the subtle art of giving descriptive names so that our specifications read more clearly. Consider declaring aliases for open intervals with freely oriented boundaries.
{-@ predicate InOpenInterval A B X =
(A != B) &&
((X > A && X < B) || (X > B && X < A)) @-}
{-@ type OpenInterval A B = { x:Float | InOpenInterval A B x } @-}These aliases can then be used to prove, for instance, that an implementation
of an affine transformation, fromUnitInterval below, from the open unit interval to an
arbitrary interval is a bijection. The proof proceeds by supplying an inverse
function (toUnitInterval) and specifying1 that their composition is the identity.
The example shows one half on the proof; the other half is straightforward
and left to the reader.
type Bound = Float
{-@ inline fromUnitInterval @-}
{-@ fromUnitInterval :: a : Bound
-> { b : Bound | a != b }
-> x : OpenInterval 0 1
-> v : OpenInterval a b @-}
fromUnitInterval :: Bound -> Bound -> Float -> Float
fromUnitInterval a b x = a + x * (b - a)
{-@ inline toUnitInterval @-}
{-@ toUnitInterval :: a : Bound
-> { b : Bound | a != b }
-> x : OpenInterval a b
-> v : OpenInterval 0 1 @-}
toUnitInterval :: Bound -> Bound -> Float -> Float
toUnitInterval a b x = (x - a) / (b - a)
{-@ intervalId :: a : Bound
-> { b : Bound | a != b }
-> x : OpenInterval a b
-> {v : OpenInterval a b | x = v} @-}
intervalId :: Bound -> Bound -> Float -> Float
intervalId a b x = fromUnitInterval a b . toUnitInterval a bAnother case: refining a Map type to a fixed length allows us to enforce that
a function can only grant access privileges to a bounded number of users at any
call site.
type Password = String
type Name = String
{-@ type FixedMap a b N = { m : Map a b | len m = N } @-}
{-@ giveAccess :: Name
-> Password
-> FixedMap Name Password 3
-> Bool @-}
giveAccess :: Name -> Password -> Map Name Password -> Bool
giveAccess name psswd users =
Map.lookup name users == Just psswdNone of these specifications strictly require aliases, but they illustrate the practical convenience they bring.
A crowded name space
When we try to be simple and reasonable about such aliases, it becomes quite
likely for other people to converge on the same names to describe similar
types. Even a seemingly standard type such as Nat is not safe: someone
with a historically informed opinion might want to define it as strictly positive
numbers2, or may just prefer to refine Word8 instead of Int.
Naturally, this is the familiar problem of name scope, for which established
solutions exist, such as modules and local scopes. Yet for LH and its Nat, it
was the case that one would have to either invent a non-conflicting name,
exclude assumptions for the base package, or avoid
importing the Prelude altogether. It might be argued that having to invent
alternative names is a minor nuisance, but also that it can quickly lead to
unwieldy and convoluted naming conventions once multiple dependencies expose
their own specifications.
Simply stated, the problem was that LH imported all aliases from transitive dependencies into a flat namespace. After my contribution, LH still accumulates aliases transitively, but users gain two key capabilities: (i) to disambiguate occurrences by qualifying an identifier, and (ii) to overwrite an imported alias without conflict. In practice, this prevents spurious verification failures and gives the user explicit means to resolve clashes when they matter.
Consider the following scenario. Module A defines alias Foo. Two other
modules, B and B', both define an alias Bar and import A.
module A where
{-@ type Foo = { ... } @-}
module B where
import A
{-@ type Bar = { ... } @-}
module B' where
import A
{-@ type Bar = { ... } @-}A module C that imports B and B' will now see Foo in scope unambiguously,
while any occurrence of Bar must be qualified in the usual Haskell manner.
module C where
import B
import B'
{-@ baz :: Foo -> B.Bar @-}
baz _ = undefinedPreviously, this would have caused C to fail verification with a conflicting
definitions error, even if Bar was never used.
examples/B.hs:3:10: error:
Multiple definitions of Type Alias `Bar`
Conflicting definitions at
.
* examples/B.hs:3:10-39
.
* examples/B'.hs:3:10-39
|
3 | {-@ type Bar = { ... } @-}
| ^^^^^^^^^^^^^^This error is now only triggered when the alias is defined multiple times within the same module. And instead, when an ambiguous type alias is found, the user is prompted to choose among the matching names in scope and directed to the offending symbol.
examples/C.hs:6:19: error:
Ambiguous specification symbol `Bar` for type alias
Could refer to any of the names
.
* Bar imported from module B defined at examples/B.hs:3:10-39
.
* Bar imported from module B' defined at examples/B'.hs:3:10-39
|
6 | {-@ baz :: Foo -> Bar @-}
| ^^^The precise behavior is summarized in a set of explicit rules that I proposed, which specify how aliases are imported and exported under this scheme.
The initial name resolution flow
The project goals were initially put forward on a GitHub issue as a
spin-off from a recent refactoring of the codebase that changed the
internal representation of names to a structured LHName type that
distinguishes between resolved and unresolved names and stores information about
where the name originates, so that names are resolved only once for each compiled
module.
Name resolution has many moving parts, but in broad terms its implementation is divided into two phases: The first handles names corresponding to entities GHC knows of—data and type constructors, functions, and annotation binders of aliases, measures, and data constructors—and uses its global reader environment to look them up. The resolution of logical entities (i.e. those found in logical expressions) is left for the second phase, where the names resolved during the first phase are used to build custom lookup environments.
Occurrences of type and predicate aliases were resolved by looking them up in an environment indexed by their unqualified name. When two or more dependencies (possibly transitive) defined the same alias, resolution defaulted to whichever definition happened to be encountered first during collection. This accidental choice was effectively irrelevant, however, since a later duplicate-name check would short-circuit with the aforementioned error. Locally defined aliases were recorded in the module’s interface file after verification, and LH assembled the resolution environment by accumulating the aliases from the interface files of all transitive dependencies.
The reason a module import brings all aliases from transitive dependencies into scope is that no mechanism exists to declare which aliases a module exports or imports. Implementing such a mechanism exceeded the project’s allocated time, so a trade-off was called for. On the importing side, Haskell’s qualifying directives could be applied, but an explicit defaulting mechanism was needed to determine what aliases a module exposes. This left us with at least three possibilities:
- Export no aliases, so that they would be local to each module alone. This no-op solution would allow the user to use any names she wants, but quickly becomes inconvenient as an alias would have to be redefined in each module she intends to use it.
- Export only those locally defined, so that only aliases from direct dependencies would be in scope for any given module. This could leave out aliases used to specify re-exported functions, so we would end up in a similar situation as before.
- Export all aliases from transitive dependencies, avoiding the need to ever duplicate an alias definition.
The chosen option (3) reflects the former behavior and, complemented by the ability qualify and overwrite aliases, it was deemed the most effective solution.
Qualifying type aliases
Type aliases are resolved during the first phase, essentially because they are parsed as type constructors, which are resolved uniformly across the input specification. Two changes had to be made to qualify them: include module import information in the resolution environment to discern which module aliases can be used to qualify an imported type alias, and make sure transitively imported aliases are stored in the interface file along with the locally defined type aliases.
Careful examination of the code revealed that we could reuse environments built for other features of LH that could be qualified already! And as a bonus, their lookup function returns close-match alternatives in case of failure. Factoring this out almost did the trick. In addition, I had to add some provisions to give precedence to locally defined aliases during lookups.
Qualifying predicate aliases
Two aspects of the code made predicate aliases somewhat hard to reason about.
First, predicate aliases are conflated in environments with
Haskell entities lifted by inline and define annotations.
The rationale is to use a single mechanism to expand these definitions in
logical expressions.
Second, the conflated environments were redundantly gathered twice with different purposes: to resolve Haskell function names in logical expressions, and afterwards again to resolve occurrences of predicate aliases.
Both were not straightforward to deduce from the code. These facts, together with some code comments from the past about predicate aliases being the last names that remained “unhandled”, pointed the way.
The surgical change, then, was to sieve out predicate aliases from the lifted Haskell functions as they were stored together in interface files, and include these predicate aliases in the environment used to resolve qualified names for other features.
Alias expansion
Although the problem I set out to solve was primarily about name resolution, the
implementation also required revisiting another process: alias expansion. For a
specification to be ready for constraint generation, all aliases must be fully
expanded (or unfolded), since liquid-fixpoint3 has no notion of aliases.
Uncovering this detail was crucial to advance with the implementation. It
clarified why Haskell functions lifted
with inline or define are eventually converted into predicate aliases: doing
so allows for every aliasing annotation to be expanded consistently in a single
pass wherever they appear in a specification. With qualified aliases, the
expansion mechanism needed some adjustments, as the alias names were now more
structured (LHName).
An additional complication was that the logic to expand type aliases was shared with predicate aliases, and since I did qualification of type aliases first, I needed to have different behavior for type and predicate aliases. In the end, I opted for duplicating the expansion logic for each case during the transition, and unified it again after implementing qualification of predicate aliases.
Closing remarks
My determination to understand implementation details was rewarded by insights that allowed me to refactor my way to a solution. For perspective, my contribution consisted of a 210 LOC addition for the feature implementation alone, after familiarizing myself with 2,150 LOC out of the 25,000 LOC making up the LH plugin. The bulk of this work is contained in two merged PRs (#2550 and #2566), which include detailed source documentation and tests.
The qualified aliases support and the explicit rules that govern it are a modest addition, but hopefully one of a positive impact on user experience. LH tries to be as close as possible to Haskell, but refinement type aliases still mark the boundary between both worlds. Perhaps the need for an ad hoc mechanism for importing and exporting logic entities will be revised in a horizon where LH gets integrated into GHC (which sounds good to me!).
This project taught me about many language features and introduced me to the GHC API; knowledge I will apply in future projects and to further contribute to the Haskell ecosystem. I am grateful to Facundo Domínguez for his generous and insightful mentoring, which kept on a creative flow throughout the project. Working on Liquid Haskell was lots of fun!
- Note that, in this example, the
inlineannotation is used to translate the Haskell definitions into the logic so Liquid Haskell can unfold calls to these functions when verifying specifications.↩ - It took humanity quite a while to think clearly about a null quantity, and further still for it to play a fundamental role as a placeholder for positional number notation.↩
liquid-fixpointis the component of Liquid Haskell that transforms a module’s specification into a set of constraints for an external SMT solver.↩
- Introduction to the dependency graph
- Managing dependency graph in a large codebase
- The anatomy of a dependency graph
A dependency graph is a representation of how different parts of a software project rely on each other. Understanding the dependency graph helps a software engineer see the bigger picture of how their component fits into the whole project and why certain changes might affect other areas. It’s a useful tool for organizing, debugging, and improving the source code.
Engineers responsible for managing the development and build environments also benefit greatly from understanding dependency graph concepts and how they are used by the build system. This knowledge is crucial for optimizing build times since it allows engineers to identify opportunities to parallelize and improve the incrementality of builds. Understanding the dependency graph also helps in troubleshooting build failures, managing changes safely, and ensuring that updates or refactors do not worsen the overall design of the codebase.
In this blog post, we’ll take a fresh look at dependency graphs, starting from the basic concepts and building up from there. You will learn what a dependency graph is, some terminology required to be successful in managing it, and what it is used for.
What is a dependency graph?
A dependency graph is a visual map that explains the connectivity between parts of a software project.
Let’s use a contrived example of a dependency graph in a tiny codebase and lay out some key terminology.
Nodes and edges
A node in a dependency graph represents an individual item which can be a software package, a module, or a component.
The edges (connections) between nodes represent dependencies, meaning one node relies on another to function or build correctly.
Dependencies
appA depends on libX directly therefore libX is a direct dependency of appA.
For example, if you import the requests package in your Python module,
this would be that module’s direct dependency.
appB depends on commons via libY therefore commons is a transitive dependency of appB.
For example, if your C++ program depends on libcurl, then it also depends (transitively)
on every external library that libcurl depends on
such as OpenSSL or zlib.
Dependents
libX and libY directly depend on commons.
This could also be reversed — commons has two direct dependents: libX and libY.
In fact, the dependents are often called reverse dependencies.
Similarly, secrets have two reverse dependencies: one direct - appB, and one transitive - testB.
Shape and orientation
A simple dependency graph can sometimes look like a tree, with one common base component at the root, supporting multiple dependents (components pointing back towards the root), which in turn are depended on by the leaves (components with no further dependents).
However, dependency graphs are usually more complex than trees and belong to a more general family of graphs known as directed acyclic graphs (DAG), where you can only follow the arrows in one direction, and you can never end up back at the same node you started from. We’ll talk about the word “acyclic” in more detail later in the post.
When describing this project, we could emphasize that commons is foundational -
the root that everything else builds upon.
Libraries and apps become the trunk and branches, with tests as leaves.
Without clearly defining how arrows show dependencies,
we might easily draw all arrows pointing the opposite way (a reverse dependency graph1):
This makes terms like “roots” or “leaves” potentially confusing, but it’s important to be aware of them as you will likely hear them being used when talking about graphs.
What is it used for?
Dependency graph concepts have lots of applications:
-
Dependency resolution techniques such as Minimal Version Selection and Backtracking are used by package managers.
-
In artifact-based build systems such as Bazel, a dependency graph is used to determine the order in which different parts of a project should be built. Having access to this allows building only what is necessary and in the correct sequence.
-
GNU Make uses a dependency graph implicitly through its rules: each target specifies its dependencies, and Make constructs a graph to determine the order in which to build targets.
-
Native programming language build tools use the dependency graph to fetch and build modules in the correct order, e.g., in Go, it is used to maintain a cache of passing test results (where
go testchecks whether any of the transitive dependencies of the tests have changed since the last run).
Graph theory applications
Graph theory is a branch of mathematics focused on networks of connected items. Understanding some graph theory ideas can make managing dependencies much smarter. Being familiar with the terminology also helps to find relevant tooling, for instance, knowing that part of the graph is called subgraph would let you find more relevant results when searching for algorithms to extract a part of the graph.
Connected Components
A connected component is a group of nodes where each one can reach every other by following edges in either direction. In a dependency graph, this means a set of source code modules that are all linked together by a dependency link (or a reverse dependency link) — what’s important is that there is some sort of connection.
When two applications share modules in the same connected component, they become indirectly connected which might make it hard to test or deploy them separately. In a worse scenario, if the modules of these apps actually import from each other, then code changes in one app can unexpectedly break another. Applications with isolated dependencies are much easier to extract and move to separate repositories.
In the example below, the configuration is shared among three applications making them part of the same connected component. That is, you can’t move any of the applications along with the shared configuration out of the codebase. This could be refactored by splitting the shared configuration into separate configurations for each application.
Making changes specific to the appA in the shared-config no longer triggers rebuilds of all applications
and running all their tests.
One connected component:
Three connected components:
Isolated nodes (nodes that don’t have any edges connected) also are connected components which may represent software units that are no longer needed. For instance, a program might have once used a third-party library, but later stopped using its functionality. If nothing else in the codebase depends on that library, it is now isolated, and can be removed to avoid rebuilding.
Cut Points and Bridges
A cut point (also called a “point of connection” or “articulation point”) is a node that, if removed, would split the graph into separate components. A bridge is an edge whose removal would produce a new connected component.
In the example below, if we stop depending on the third-party library third-party-lib,
we would stop depending transitively on all those third-party libraries
that third-party-lib brought into the dependency graph of our project.
To remove a “cut point” like third-party-lib, you can replace its functionality with an existing dependency or reimplement it yourself.
This can make builds faster (fewer downloads), more secure, and more reliable.
The npm left-pad incident shows
how third-party dependencies can cause problems.
Creating isolated groups in the dependency graph is often a good thing as it means those modules can now evolve, be tested, and deployed independently, reducing risk and complexity. However, in a large dependency graph, the hard part is to identify the best cut points as often breaking the dependency between two modules might still leave the part of the dependency graph you are concerned about connected to the rest of the codebase.
Breaking appA -> config1 (incorrectly assuming that this as a bridge)
would still leave appA connected to the rest of the codebase via the libX connection.
To identify that libX might still lead to the rest of the codebase via a chain of connections is not trivial
and to be able to refactor the dependency graph so that one can reason about it,
it is often required to use advanced dependency graph querying and visualization tooling.
To estimate how much work it would be to break a connection, one can list all paths between your module
and the undesired dependency, which will be discussed later.
Subgraphs
A subgraph is just a smaller part of the whole graph, focusing on a subset of nodes and their connections. Depending on the complexity and shape of your dependency graph, it might only make sense to interact with a subgraph of it. Take a look at the dependency graphs of the microservices at tech giants to appreciate the complexity of their dependency management.
Visualizing or analyzing a subgraph (e.g., all dependencies of a single service) helps you zoom in on what matters for your project. If the dependencies of a program are complicated, it may make sense to extract only its direct dependencies and their direct dependencies. In graph theory terms, this means focusing on nodes that are at most two degrees away from the program node. The degree of a node refers to the number of direct connections (dependencies) it has. We can extract a subgraph by limiting our view to nodes within a certain depth (in this case, a depth of two). By controlling the depth, you avoid being overwhelmed by the entire transitive chain of dependencies.
With the same dependency graph we had seen in the very first graph of the post,
we can extract the subgraph containing dependencies with depth of 2 for appB:
Transitivity
The transitive closure of a node in a graph is the set of all nodes that can be reached from that node by following edges. In the context of a dependency graph, the transitive closure2 of a module is the entire “tree” of things required for that module to work.
In this dependency graph,
both appA and appB depend on secrets (directly) and cloud (directly and transitively).
In this cluttered visualization of the graph, the direct dependency edge between appA/appB and cloud
could be removed for clarity as we already know that they are connected:
The process of simplifying the graph by removing edges that are implied by other edges is called transitive reduction. Keep in mind that you would not normally want to do this for any other reason than clearer visualization of the graph.
If your build tool tracks node dependencies by reading build metadata (stored in files maintained by engineers),
this information must stay up-to-date so the build system can correctly identify necessary build steps.
Imagine that at some point in time appA used to import some code from cloud, however, after some refactoring,
it doesn’t depend on it directly any longer:
Now, what if in the build metadata files, the direct dependencies of appA are still [cloud, secrets]?
The stale build metadata information such as a redundant declaration of the direct dependency won’t be an issue
from the build systems perspective: cloud will ultimately end up in the transitive closure of appA.
However, if after further refactorings, appA no longer depends on secrets, we end up with this graph used by the build system:
Since appA depends on cloud, it becomes dependent on the transitive closure of cloud
which might lead to slower build times (all resources that cloud depends on now need to be downloaded to build appA).
Paths
Finding paths between arbitrary modules in a dependency graph helps understand how different parts of your system are connected. In this context, we are primarily interested in finding simple paths — paths where all nodes visited are distinct.
By finding a path from module A to module B, you can see if changes in A might affect B (or vice versa). This helps estimate the risk of changes and debug issues that propagate through dependencies. For example, if a module contains source code under a specific license, you might want to ensure no paths from applications with incompatible licenses lead to it, preventing its inclusion in the application bundle.
With this contrived example of a dependency graph,
there are two paths from appA to commons:
appA->libX->libY->commonsappA->secrets->commons
In a large, highly connected dependency graph, there may be hundreds of paths between two modules.
When listing paths, shortest paths help to understand the minimal set of dependencies connecting two modules. In contrast, the longest path between two modules tells you how deep the dependency chains are. The higher the average number of nodes in all paths in the graph, the more interconnected your codebase is. Having a very interconnected dependency graph might be problematic because it becomes hard to reason about how changes will propagate and a change in a low-level module can ripple through many layers, increasing the risk of unexpected breakages.
Topological sort
Topological sort (or order) is a way of ordering the nodes in a dependency graph so that every node comes after all the nodes it depends on. A build system might use topological sort to determine what must be built first and which targets can be built in parallel.
Having access to this contrived dependency graph,
and oversimplifying what a modern build system would do with this dependency graph, we could produce a parallelizable list of build actions.
In order to build a particular node (say, produce a binary executable), we need to first build all nodes
that this node depends on (transitively).
For instance, let’s say we want to build appA:
- To build
appA, we need to first build its direct dependency,libX. - To build
libX, we need to first build its direct dependencies,commonsandsecrets. commonsandsecretscan be built immediately as they do not have any dependencies.
This means that our dependency graph nodes would be sorted like this:
[secrets,commons],libX,appAsecrets and commons can be built in parallel, and once both of them are built,
we can start building libX, and, thereafter, appA.
Parallelism emerges only when the graph has branches, that is, multiple independent subgraphs that can be built concurrently once their dependencies are satisfied. Practically, this means that flattening overly nested or serial dependencies can unlock better parallelism leading to faster builds.
In an extreme case, if your graph is in the shape of a linked list
such as app -> lib -> secrets -> commons, no parallelism can be achieved
because every node would need to wait for its dependency to be built first.
However, even when components must be built sequentially due to their dependencies,
parallelism can still occur within each component,
for instance, compiling multiple source files simultaneously within a single library.
Cycles
Cycles in a dependency graph mean that some components depend on each other in a loop, making it impossible to determine the order in the dependency chain. Build systems like Bazel require the dependency graph to be a directed graph without cycles (commonly known as Directed Acyclic Graph, or DAG) because cycles would lead to infinite build loops and prevent the system from knowing which component to build first.
With this graph having a cycle (libA -> libB -> libC), it is unclear
in what order dependencies of app should be built:
When adopting a build system that needs to construct a DAG out of your dependency graph, you might need to make refactorings in the codebase to break cycles. This is particularly true for legacy codebases written in Python, JavaScript, or Ruby where native build tools might tolerate cycles in the dependency graph.
A DAG is a very common data structure used by various build systems such as Bazel, Pants, and Buck2, process orchestration software such as Dagster, Flyte, and AirFlow, and software engineering tooling such as Git.
In this post, we have reviewed the basic principles related to graph theory and talked about dependency graphs that consist of modules in a codebase. In sophisticated build systems, you’ll find that more kinds of graphs exist, with differences between them. In Bazel, there is a build graph (what we have called dependency graph in this post for simplicity) and an action graph that breaks down each component into specific actions (like compiling a file or generating code) that need to be executed. There are some more advanced kinds of graphs you might run into such as the evaluation graph (Skyframe graph) representing Bazel’s internal state (see skyscope to learn more) and the shadow dependency graph which is created when aspects are used.
In the next blog post, we will cover common problems associated with managing project dependencies and share best practices for keeping a large dependency graph healthy over time.
- The reversed dependency graph concept is useful in scenarios like impact analysis (e.g., “If changes are made to this core library, what other components will be affected?”).↩
- You won’t see this term often, but the transitive closure that also includes the node itself from which we start the search is called a reflexive transitive closure.↩
I will cover aspects that are required for putting custom queries into production. I’ll explain:
- how CodeQL sources are organized,
- what query metadata is,
- how to run CodeQL in GitHub Actions, and
- how to visualize results.
While the first two topics are specific to teams that need to write their own queries, the last two are applicable both to teams that write their own queries and to teams relying on the default queries shipped with CodeQL (which do capture a vast number of issues already).
I won’t dive deep on any topic, but rather give an overview of the features you will most likely need to put your own CodeQL queries into production. I’ll often link to GitHub’s official documentation, so that you have quick access to the documentation most useful to you. Finding what you need can be a bit of a challenge, because CodeQL’s documentation is spread over both https://docs.github.com/en/code-security and https://codeql.github.com/docs/.
Structure of CodeQL sources
There are four main types of CodeQL file:
-
*.qlfiles are query files. A query is an executable request and a query file must contain exactly one query. I will describe the query syntax below. A query file cannot be imported by other files. -
*.qllfiles are library files. A library file can contain types and predicates, but it cannot contain a query. Library files can be imported. -
*.qlsfiles are YAML files describing query suites. They are used to select queries, based on various filters such as a query’s filename, name, or metadata. Query suites are documented in detail in the official documentation. -
*.qlpackfiles are YAML files describing packs. Packs are containers for the three previous kind of files. A pack can either be a query pack, containing queries to be run; a library pack, containing code to be reused; or a model pack, which is an experimental kind of pack meant to extend existing CodeQL rules. Packs are described in detail here.When developing custom queries, I need to wrap them in a query pack in order to declare on what parts of the CodeQL standard library my queries depend (here’s an example to show how to depend on the Java standard library).
Queries in *.ql files have the following structure (as explained in more detail in the official documentation):
from /* ... variable declarations ... */
where /* ... logical formula ... */
select /* ... expressions ... */This can be understood like an SQL query:
- First, the
fromclause declares typed variables that can be referenced in the rest of the query. Because types define predicates, this clause already constrains the possible instances returned by thewhereclause that follows. - The
whereclause constrains the query to only return the variables that satisfy the logical formula it contains. It can be omitted, in which case all instances of variables with the type specified in thefromclause are returned. - The
selectclause limits the query to operate on the variables declared in thefromclause. Theselectclause can also contain formatting instructions, so that the results of the query are more human readable.
To give an example of a query, if I need to write a query to track
tainted data in Java, in a file named App.java, I’ll write this to start somewhere
and will refine the where clause iteratively, based on the query’s result:
from DataFlow::Node node // A node in the syntax tree
where node.getLocation().getFile().toString() = "App" // .java extension is stripped
select node, "node in App"select clauses must obey the following constraints with respect to the number of columns selected:
- A problem query (see below) must select an even number of columns.
The format is supposed to be:
select var1, formatting_for_var1, var2, formatting_for_var2, ...whereformatting_for_var*must be an expression returning a string, as described earlier in theselectparagraph. If you omit the formatting, the query is executed, but a warning is issued. - A path-problem query must select four columns, the first three referring to syntax nodes and the fourth
one a string describing the issue. This assumption is required by the
CodeQL Query Resultsview in VSCode to show the results as paths (using the alerts style in the drop down):
Query metadata
The header of a query defines a set of properties called query metadata:
/**
* @name Code injection
* @description Interpreting unsanitized user input as code allows a malicious user to perform arbitrary
* code execution.
* @kind path-problem
* @problem.severity error
* ...
*/Query metadata is documented in detail in CodeQL’s official documentation. I don’t want to repeat GitHub’s documentation here, so I’m focusing on the important information:
@kindcan take two values:problemandpath-problem. The former is for queries that flag one specific location, while the latter is for queries that track tainted data flow from a source to a sink.- Severity of issues is defined through two means, depending on whether the query is considered a security-related
one or not 🤷
@problem.severityis used for queries that don’t have@tags security.@problem.severitycan be one oferror,warning, orrecommendation.@security-severityis a score between0.0and10.0, for queries with@tags security.
Metadata is most useful for filtering queries in qls files.
This is used extensively in queries shipped with CodeQL itself, as visible for example in
security-experimental-selectors.yml1. To give an idea of the filtering capability, here is an excerpt of this file that declares filtering criteria:
- include:
kind:
- problem
- path-problem
precision:
- high
- very-high
tags contain:
- security
- exclude:
query path:
- Metrics/Summaries/FrameworkCoverage.ql
- /Diagnostics/Internal/.*/
- exclude:
tags contain:
- modeleditor
- modelgeneratorTo smooth the introduction of CodeQL (and security tools in general), I recommend starting small and only reporting the most critical alerts at first (in other words: filtering aggressively). This helps to convince teammates that CodeQL reports useful insights, and it doesn’t make the task of fixing security vulnerabilities look insurmountable.
Once the most critical alerts are fixed, I advise loosening the filtering, so that pressing — but not critical — issues can be addressed.
Running CodeQL in GitHub Actions
The following GitHub Actions are required to run CodeQL:
github/codeql-action/initinstalls CodeQL and creates the database. It can be customized to specify the list of programming languages to analyze, as well as many other options. Customization is done in the YAML workflow file, or via an external YAML configuration file, as explained in the customize advanced setup documentation.github/codeql-action/autobuildis required if you are analyzing a compiled language (such as C# or Java, as opposed to Python). This action can either work out of the box, guessing what to do based on the presence of the build files that are idiomatic in your programming language’s ecosystem. I must admit this is not very principled — you need to look up the corresponding documentation to see how CodeQL is going to behave for your programming language and platform. If the automatic behavior doesn’t work out of the box, you can manually specify the build commands to perform.github/codeql-action/analyzeruns the queries. Its results are used to populate the Security tab, as shown below.
Since the actions work out of the box on GitHub, replicating them in another CI/CD system is non-trivial: you will have to build your own solution.
Visualizing results
Once CodeQL executes successfully in CI, GitHub’s UI picks up its results automatically and shows them in the Security tab:
You may wonder why you cannot see the Security tab on the repository used to create this post’s screenshots yourself. This is because, as GitHub’s documentation explains, security alerts are only visible to people with the necessary rights to the repository. The required rights depend on whether the repository is owned by a user or an organisation. In any case, security alerts cannot be made visible to people who do not have at least some rights to the relevant repository. Clicking on View alerts brings up the main CodeQL view:
As visible in the screenshot, this view allows you to filter the alerts in multiple ways, as well as to select the branch from which the alerts are shown.
Conclusion
In this post, I covered multiple aspects that you need to know to put your custom queries in production. I described how CodeQL codebases are organized and the constraints that individual queries must obey. I described queries’ metadata and how metadata is used. I concluded by showing how to run queries in CI and how everyone in a team can visualize the alerts found. Equipped with this knowledge, I think you are ready to experiment with CodeQL and later pitch it to your stakeholders, as part of your security posture 😉
]]>Performance testing helps in:
- Validating System Performance: Ensuring that the system performs well under expected load conditions.
- Identifying Bottlenecks: Detecting performance issues that could degrade the user experience.
- Ensuring Scalability: Verifying that the system can scale to accommodate increased load, and also decreasing load.
- Improving User Experience: Providing a smooth and responsive experience constantly for end-users to increase loyalty.
Performance Testing process
Like other software development activities, for performance testing to be effective it should be done through a process. The process requires collaboration with other teams such as business, DevOps, system, and development teams.
Let’s explain the process with a real-world scenario. Imagine Wackadoo Corp wants to implement performance testing because they’ve noticed their e-commerce platform slows down dramatically during peak sales events, leading to frustrated customers and lost revenue. When this issue is raised to the performance engineers, they suspect it could be due to inadequate server capacity or inefficient database queries under heavy load and recommend running performance tests to pinpoint the problem. The engineers begin by gathering requirements, such as simulating 10,000 concurrent users while maintaining response times under 2 seconds, and then create test scripts to mimic real user behavior, like browsing products and completing checkouts.
A testing environment mirroring production is set up, and the scripts are executed while the system is closely monitored to ensure it handles the expected load. After the first test run, the engineers analyze the results and identify slow database queries as the primary bottleneck. They optimize the queries, add caching, and re-run the tests, repeating this process until the system meets all performance criteria. Once satisfied, they publish the final results, confirming the platform can now handle peak traffic smoothly, improving both customer experience and sales performance.
How to Apply Performance Testing
Like functional testing, performance testing should be integrated at every level of the system, starting from the unit level up. The test pyramid traditionally illustrates functional testing, with unit tests at the base, integration tests in the middle, and end-to-end or acceptance tests at the top. However, the non-functional aspect of testing—such as performance testing—often remains less visible within this structure. It is essential to apply appropriate non-functional tests at each stage to ensure a comprehensive evaluation. By conducting tailored performance tests across different levels, we can obtain early and timely feedback, enabling continuous assessment and improvement of the system’s performance.
Types of Performance Testing
There are several types of performance tests, each designed to evaluate different aspects of system performance. We can basically categorize performance testing with three main criteria:
- Load; for example, the number of virtual users
- The strategy for varying the load over time
- How long we apply performance testing
The following illustrates the different types of performance testing with regards to the three main criteria.
The three main criteria are a good starting point, but they don’t completely characterize the types performance tests. For example, we can also vary the type of load (for example, to test CPU-bound or I/O-heavy tasks) or the testing environment (for example, whether the system is allowed to scale up the number of instances).
Load Testing
Load testing is a basic form of performance testing that evaluates how a system behaves when subjected to a specific level of load. This specific load represents the optimal or expected amount of usage the system is designed to handle under normal conditions. The primary goal of load testing is to verify whether the system can deliver the expected responses while maintaining stability over an extended period. By applying this consistent load, performance engineers can observe the system’s performance metrics, such as response times, resource utilization, and throughput, to ensure it functions as intended.
- Basic and widely known form of performance testing
- Load tests are run under the optimum load of the system
- Load tests give a result that real users might face in production
- Easiest type to run in a CI/CD pipeline
Let’s make it clearer by again looking at Wackadoo Corp. Wackadoo Corp wants to test that a new feature is performing similarly to the system in production. The business team and performance engineers have agreed that the new feature should meet the following requirements while handling 5,000 concurrent users:
- It can handle 1,000 requests per second (rps)
- 95% of the response times are less than 1,000 ms
- Longest responses are less then 2,000 ms
- 0% error rate
- The test server is not exceeding 70% of CPU usage with 4GB of RAM
With these constraints in place, Wackadoo Corp can deploy the new feature in a testing environment and observe how it performs.
Stress Testing
Stress testing evaluates a system’s upper limits by pushing it beyond normal operation to simulate extreme conditions like high traffic or data processing. It identifies breaking points and assesses the system’s ability to recover from failures. This testing uncovers weaknesses, ensuring stability and performance during peak demand, and improves reliability and fault tolerance.
- Tests the upper limits of the system
- Requires more resources than load testing, to create more virtual users, etc.
- The boundary of the system should be investigated during the stress test
- Stress tests can break the system
- Stress tests can give us an idea about the performance of the system under heavy loads, such as promotional events like Black Friday
- Hard to run in a CI/CD pipeline since the system is intentionally prone to fail
Wackadoo Corp wants to investigate the system behavior when exceeding the optimal users/responses so it decides to run a stress test. Performance engineers have the metrics for the upper limit of the system, so during the tests the load will be increased gradually until the peak level. The system can handle up to 10,000 concurrent users. The expectation is that the system will continue to respond, but the response metrics will degrade within the following expected limits:
- It can handle 800 requests per second (rps)
- 95% of the response times are less than 2,500 ms
- Longest responses are less then 5,000 ms
- 10% error rate
- The test server is around 95% of CPU usage with 4GB of RAM
If any of these limits are exceeded when monitoring in the test environment, then Wackadoo Corp knows it has a decision to make about resource scaling and its associated costs, if no further efficiencies can be made.
Spike Testing
A spike test is a type of performance test designed to evaluate how a system behaves when there is a sudden and significant increase or decrease in the amount of load it experiences. The primary objective of this test is to identify potential system failures or performance issues that may arise when the load changes unexpectedly or reaches levels that are outside the normal operating range.
By simulating these abrupt fluctuations in load, the spike test helps to uncover weaknesses in the system’s ability to handle rapid changes in demand. This type of testing is particularly useful for understanding how the system responds under stress and whether it can maintain stability and functionality when subjected to extreme variations in workload. Ultimately, the spike test provides valuable insights into the system’s resilience and helps ensure it can manage unexpected load changes without critical failures.
- Spike tests give us an idea about the behavior of the system under unexpected increases and decreases in load
- We can get an idea about how fast the system can scale-up and scale-down
- They can require additional performance testing tools, as not all tools support this load profile
- Good for some occasions like simulating push notifications, or critical announcements
- Very hard to run in a CI/CD pipeline since the system is intentionally prone to fail
Let’s look at an example again, Wackadoo Corp wants to send push notifications to 20% of the mobile users at 3pm for Black Friday. They want to investigate the system behavior when the number of users increase and decrease suddenly so they want to run a spike test. The system can handle up to 10,000 concurrent users, so the load will be increased to this amount in 10 seconds and then decreased to 5,000 users in 10 seconds. The expectation is that the system keeps responding, but the response metrics increase within the following expected limits:
- Maximum latency is 500ms
- 95% of the response times are less than 5,000 ms
- Longest responses are less then 10,000 ms
- 15% error rate
- The test server is around 95% of CPU usage but it should decrease when the load decreases
Again, if any of these expectations are broken, it may suggest to Wackadoo Corp that its resources are not sufficient.
Endurance Testing (Soak Testing)
An endurance test focuses on evaluating the upper boundary of a system over an extended period of time. This test is designed to assess how the system behaves under sustained high load and whether it can maintain stability and performance over a prolonged duration.
The goal is to identify potential issues such as memory leaks, resource exhaustion, or degradation in performance that may occur when the system is pushed to its limits for an extended time. By simulating long-term usage scenarios, endurance testing helps uncover hidden problems that might not be evident during shorter tests. This approach ensures that the system remains reliable and efficient even when subjected to continuous high demand over an extended period.
- Soak tests run for a prolonged time
- They check the system stability when the load does not decrease for a long time
- Soak testing can give a better idea about the performance of the system for campaigns like Black Friday than the other tests, hence the need for a diverse testing strategy
- Hard to run in a CI/CD pipeline since it aims to test for a long period, which goes against the expected short feedback loop
This time, Wackadoo Corp wants to send push notifications to 10% of users at every hour, starting from 10am until 10pm, for Black Friday to increase sales for a one-day 50%-off promotion. They want to investigate the system behavior when the number of users increase, but the load stays stable between nominal and the upper-boundary for a long time so they want to run an endurance test. The system can handle up to 10,000 concurrent users, so the load will be increased to 8,000 users in 30 seconds and it will be kept busy. The expectation is that the system keeps responding, but the response metrics increase within the following expected limits:
- Maximum latency is 300ms
- 95% of the response times are less than 2,000 ms
- Longest responses are less then 3,000 ms
- 5% error rate
- The test server is around 90% of CPU usage
Scalability Testing
Scalability testing is a critical type of performance testing that evaluates how effectively a system can manage increased load by incorporating additional resources, such as servers, databases, or other infrastructure components. This testing determines whether the system can efficiently scale up to accommodate higher levels of demand as user activity or data volume grows.
By simulating scenarios where the load is progressively increased, scalability testing helps identify potential bottlenecks, resource limitations, or performance issues that may arise during expansion. This process ensures that the system can grow seamlessly to meet future requirements without compromising performance, stability, or user experience. Ultimately, scalability testing provides valuable insights into the system’s ability to adapt to growth, helping organizations plan for and support increasing demands over time.
- Scalability tests require collaboration for system monitoring and scaling
- They can require more load generators, depending of the performance testing tools (i.e. load the system, then spike it)
- They aim to check the behavior of the system during the scaling
- Very hard to run in a CI/CD pipeline since it requires the scaling to be orchestrated
Performance engineers at Wackadoo Corp want to see how the system scales when the loads exceed the upper boundary, so they perform a scalability test. The system can handle up to 10,000 concurrent users for one server, so this time the load will be increased gradually starting from 5,000 users, and every 2 minutes 1,000 users will join the system. The expectation is that the system keeps responding, but the response metrics increase with the load (as before) until after 10,000 users, when a new server should join the system. At which point, we should observe the response metrics starting to decrease. Once scaling up is tested, we can continue with testing the scaling down by decreasing the number of users under the upper limit.
Volume Testing
Volume testing assesses the system’s behavior when it is populated with a substantial amount of data. The purpose of this testing is to evaluate how well the system performs and maintains stability under conditions of high data volume. By simulating scenarios where the system is loaded with large datasets, volume testing helps identify potential issues related to data handling, storage capacity, and processing efficiency.
This type of testing is particularly useful for uncovering problems such as slow response times, data corruption, or system crashes that may occur when managing extensive amounts of information. Additionally, volume testing ensures that the system can effectively store, retrieve, and process large volumes of data without compromising its overall performance or reliability.
- Volume tests simulate the system behavior when huge amounts of data are received
- They check if databases have any issue with indexing data
- For example, in a Black Friday sale scenario, with a massive surge of new users accessing the website simultaneously, they ensure that no users experience issues such as failed transactions, slow response times, or an inability to access the system
- Very hard to run in a CI/CD pipeline since the system is intentionally prone to fail
Wackadoo Corp wants to increase customers, so they implemented an “invite your friend” feature. The company plans to give a voucher to both members and invited members, which will result in a huge amount of database traffic. Performance engineers want to run a volume test, which mostly includes scenarios like inviting, registering, checking voucher code state, and loading the checkout page. During the test, the load will increase to 5,000 users by adding 1,000 users every 2 minutes and they should simulate normal user behaviors. After that heavy write operations can start. As a result, we should expect the following:
- Maximum latency is 500ms
- 95% of the response times are less than 3,000 ms
- Longest responses are less then 5,000 ms
- 0% error rate
- The test server is around 90% of CPU usage
A failure here might suggest to Wackadoo Corp that its database service is a bottleneck.
Conclusion
Performance testing plays a crucial role in shaping the overall user experience because an application that performs poorly can easily lose users and damage its reputation. When performance problems are not detected and resolved early, the cost of fixing them later can increase dramatically, impacting both time and resources.
Moreover, collaboration between multiple departments, including development, operations, and business teams, is essential to ensure that the testing process aligns with real-world requirements and produces meaningful, actionable insights. Without this coordinated effort and knowledge base, performance testing may fail to deliver valuable outcomes or identify critical issues.
There are many distinct types of performance testing, each designed to assess the system’s behavior from a specific angle and under different conditions. Load testing can be easily adapted to the CI/CD pipeline; the other performance testing types can be more challenging, but they can still provide a lot of benefits.
In my next blog post, I will talk about my experiences on how we can apply performance testing continuously.
]]>CodeQL is designed to do two things:
- Perform all kinds of quality and compliance checks. CodeQL’s query language is expressive enough to describe
a variety of patterns (e.g., “find any loop, enclosed in a function named
foo, when the loop’s body contains a call to functionbar”). As such, it enables complex, semantic queries over codebases, which can uncover a wide range of issues and patterns. - Track the flow of tainted data. Tainted data is data provided by a potentially malicious user. If tainted data is sent to critical operations (database requests, custom processes) without being sanitized, it can have catastrophic consequences, such as data loss, a data breach, arbitrary code execution, etc. Statements of your source code from where tainted data originates are called sources, while statements of your source code where tainted data is consumed are called sinks.
This tutorial is targeted at software and security engineers that want to try out CodeQL, focusing on the second use case from above. I explain how to setup CodeQL, how to write your first taint tracking query, and give a methodology for doing so. To dig deeper, you can check out the second article in this CodeQL series.
Writing the vulnerable code
First, I need to write some code to execute my query against. As the attack surface, I’m choosing calls to the sarge Python library, for three reasons:
- It is available on PyPI, so it is easy to install.
- It is niche enough that it is not already modeled in CodeQL’s Python standard library,
so out of the box queries from CodeQL won’t catch vulnerabilities that use
sarge. We need to write our own rules. - It performs calls to subprocess.Popen,
which is a data sink. As a consequence, code calling
sargeis prone to having command injection vulnerabilities.
For my data source, I use flask.
That’s because HTTP requests contain user-provided data, and as such,
they are modeled as data sources in CodeQL’s standard library.
With both sarge and flask in place, we can write the following vulnerable code:
from flask import Flask, request
import sarge
app = Flask(__name__)
@app.route("/", methods=["POST"])
def user_to_sarge_run():
"""This function shows a vulnerability: it forwards user input (through a POST request) to sarge.run."""
print("/ handler")
if request.method != "POST":
return "Method not allowed"
default_value = "default"
received: str = request.form.get("key", "default")
print(f"Received: {received}")
sarge.run(received) # Unsafe, don't do that!
return "Called sarge"To run the application locally, execute in one terminal:
> flask --debug runIn another terminal, trigger the vulnerability as follows:
> curl -X POST https://localhost:5000/ -d "key=ls"Now observe that in the terminal running the app, the ls command (provided by the user! 💣) was executed:
/ handler
Received: ls
app.py __pycache__ README.md requirements.txtWow, pretty scary right! What if I had passed the string rm -Rf ~/*? Now let’s see how to catch this vulnerability with CodeQL.
Running CodeQL on the CLI
To run CodeQL on the CLI, I need to download the CodeQL binaries from the github/codeql-cli-binaries repository.
At the time of writing, there are CodeQL binaries for the three major platforms. Where I clone this repository doesn’t matter,
as long as the codeql binary ends up in PATH.
Then, because I am going to write my own queries (as opposed to solely using the queries shipped with CodeQL),
I need to clone CodeQL’s standard library: github/codeql.
I recommend putting this repository in a folder that is a sibling of the repository being analyzed.
In this manner, the codeql binary will find it automatically.
Before I write my own query, let’s run standard CodeQL queries for Python. First, I need to create a database. Instead of analyzing code at each run, CodeQL’s way of operating is to:
- Store the code in a database,
- Then run one or many queries on the database.
While I develop a query, and so iterate on step 2 above, having the two steps distinct saves computing time. As long as the code being analyzed doesn’t change, there is no need to rebuild the database. Let’s build the codebase as follows:
> codeql database create --language=python codeql-db --source-root=.Now that the database is created, let’s call the python-security-and-quality (a set of default queries for Python, provided
by CodeQL’s standard library) queries3:
> codeql database analyze codeql-db python-security-and-quality --format=sarif-latest --output=codeql.sarif
# Now, transform the SARIF output into CSV, for better human readibility; using https://pypi.org/project/sarif-tools/
> sarif csv codeql.sarif
> cat codeql.csv
Tool,Severity,Code,Description,Location,Line
CodeQL,note,py/unused-local-variable,Variable default_value is not used.,app.py,12Indeed, in the snippet above, it looks like the developer intended to use a variable to store the value "default" but forgot to use it in the end.
This is not a security vulnerability, but it exemplifies the kind of programming mistakes that CodeQL’s default rules find.
Note that the vulnerability of passing data from the POST request to the sarge.run call is not yet caught. That is because
sarge is not in CodeQL’s list of supported Python libraries.
Writing a query to model sarge.run: modeling the source
The sarge.run function executes a command, like
subprocess does. As such it is a sink for tainted data:
one should make sure that data passed to sarge.run is controlled.
CodeQL performs a modular analysis: it doesn’t inspect the source code of your dependencies. As a consequence,
you need to model your dependencies’ behavior for them to be treated correctly by CodeQL’s analysis.
Modeling tainted sources and sinks is done by implementing the
DataFlow::ConfigSig interface:
/** An input configuration for data flow. */
signature module ConfigSig {
/** Holds if `source` is a relevant data flow source. */
predicate isSource(Node source);
/** Holds if `sink` is a relevant data flow sink. */
predicate isSink(Node sink);
}In this snippet, a predicate is a function returning a Boolean, while Node is a class modeling statements in the source code.
So to implement isSource I need to capture the Node that we deem relevant sources of tainted data w.r.t. sarge.run.
Since any source of tainted data is dangerous if you send its content to sarge.run, I implement isSource as follows:
predicate isSource(DataFlow::Node source) { source instanceof ActiveThreatModelSource }Threat models
control which sources of data are considered dangerous. Usually, only remote sources (data in an HTTP request,
packets from the network) are considered dangerous. That’s because, if local sources (content of local files, content passed by the user in the terminal)
are tainted, it means an attacker has already such a level of control on your software that you are doomed.
That is why, by default, CodeQL’s default threat model is to only consider remote sources.1
In isSource, by using ActiveThreatModelSource, we declare that the sources of interest are the sources of the current active threat model.
To make sure that ActiveThreatModelSource works correctly on my codebase, I write the following test query in file Scratch.ql:
import python
import semmle.python.Concepts
from ActiveThreatModelSource src
select src, "Tainted data source"Because this file depends on the python APIs of CodeQL, I need to put a qlpack.yml file close to Scratch.ql, as follows:
name: smelc/sarge-queries
version: 0.0.1
extractor: python
library: false
dependencies:
codeql/python-queries: "*"I can now execute Scratch.ql as follows:
> codeql database analyze codeql-db queries/Scratch.ql --format=sarif-latest --output=codeql.sarif
> sarif csv codeql.sarif
> cat codeql.csv
Tool,Severity,Code,Description,Location,Line
CodeQL,note,py/get-remote-flow-source,Tainted data source,app.py,1This seems correct: something is flagged. Let’s make it more visual by running the query in VSCode.
For that I need to install the CodeQL extension.
To run queries within vscode, I first need to specify the database to use. It is the codeql-db folder which
we created with codeql database create above:
Now I run the query by right-clicking in its opened file:
Doing so opens the CodeQL results view:
I see that the import of request is flagged as a potential data source. This is correct: in my program,
tainted data can come through usages of this package.
Writing a query to model sarge.run: modeling the sink
This is where things gets more interesting. As per the ConfigSig interface above, I need to implement isSink(Node sink),
so that it captures calls to sarge.run. Because CodeQL is a declarative2 object-oriented language, this means isSink must return true
for subclasses of Node that represent calls to sarge.run. Let me describe a methodology to discover how to do that.
First, modify the Scratch.ql query to find out all instances of Node in my application:
import python
import semmle.python.dataflow.new.DataFlow
from DataFlow::Node src
select src, "DataFlow::Node"Executing this query in VSCode yields the following results:
Wow, that’s a lot of results! In a real codebase with multiple files, this would be unmanageable.
Fortunately code completion works in CodeQL, so I can filter the results using the where clause, discovering
the methods to call by looking at completions on the . symbol. Since the call to sarge.run I am looking for is at line 17,
I can refine the query as follows:
from DataFlow::Node src, Location loc
where src.getLocation() = loc
and loc.getFile().getBaseName() = "app.py"
and loc.getStartLine() = 17
select src, "DataFlow::Node"With these constraints, the query returns only a handful of results:
Still, there are 4 hits on line 17. Let’s see how I can disambiguate those. For this, CodeQL provides the getAQlClass predicate
that returns the most specific type a variable has (as explained in
CodeQL zero to hero part 3):
from DataFlow::Node src, Location loc
where src.getLocation() = loc
and loc.getFile().getBaseName() = "app.py"
and loc.getStartLine() = 17
select src, src.getAQlClass(), "DataFlow::Node"See how the select clause now includes src.getAQlClass() as second element. This makes the CodeQL Query Results show it
in the central column:
There are many more results, and that is because entries that were indistinguishable before are now disambiguated by the class.
If in doubt, one can consult the list of class of CodeQL’s standard Python library
to understand what each class is about. In our case, I had read the
official documentation on using CodeQL for Python,
and I recognize the CallNode class from this list.
As the documentation explains, there is actually an API to retrieve CallNode instances corresponding to functions imported from a distant module, using
the moduleImport function. Let’s use it to restrict our Nodes to be instances of CallNode (using a cast) and
this call being a call to sarge.run:
import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.ApiGraphs
from DataFlow::Node src
where src.(API::CallNode) = API::moduleImport("sarge").getMember("run").getACall()
select src, "CallNode calling sarge.run"Executing this query yields the only result we want:
Putting this all together, I can finalize the implementation of ConfigSig as shown below.
The getArg(0) suffix models that the tainted data flows into sarge.run’s first argument:
private module SargeConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
source instanceof ActiveThreatModelSource
}
predicate isSink(DataFlow::Node sink) {
sink = API::moduleImport("sarge").getMember("run").getACall().getArg(0)
}
}Following the official template for queries tracking tainted data, I write the query as follows:
module SargeFlow = TaintTracking::Global<SargeConfig>;
from SargeFlow::PathNode source, SargeFlow::PathNode sink
where SargeFlow::flowPath(source, sink)
select sink.getNode(), source, sink, "Tainted data passed to sarge"Executing this query in VSCode returns the paths (list of steps) along which the vulnerability takes place:
Conclusion
I have demonstrated how to use CodeQL to model a Python library, covering the setup and steps a developer must do to write his/her first CodeQL query. I gave a methodology to be able to write instances of CodeQL interfaces, even when one is lacking intimate knowledge of CodeQL APIs. I believe this is important, as the CodeQL ecosystem is small and the number of resources is limited: users of CodeQL often have to find out what to write on their own, with limited support from both the tooling and from generative AI tools (probably because the number of resources on CodeQL is small, so the results of generative AI systems are poor too).
To dive deeper, I recommend to:
- read the official CodeQL for Python resource,
- join the GitHub Security Lab Slack to get support from CodeQL users and developers, and
- read the second article in this CodeQL series.
And remember that this tutorial’s material is available at tweag/sarge-codeql-minimal if you want to experiment with this tutorial yourself!
- In this third post of our series, we provide a tool dedicated to handling CodeQL’s SARIF output in monorepos.↩
- The default threat model can be overridden by command line flags and by configuration files.↩
- CodeQL belongs to the Datalog family of languages.↩
At Tweag, we adhere to high standards for reproducible builds, which Buck2 doesn’t fully uphold in its vanilla configuration. In this post, we will introduce our ruleset that provides integration with Nix. I’ll demonstrate how it can be used, and you will gain insights into how to leverage Nix to achieve more reliable and reproducible builds with Buck2.
Reproducibility, anyone?
In short, Buck2 is a fast, polyglot build tool very similar to Bazel. Notably, it also provides fine-grained distributed caching and even speaks (in its open source variant) the same remote caching and execution protocols used by Bazel. This means you’re able to utilize the same Bazel services available for caching and remote execution.
However, in contrast to Bazel, Buck2 uses a remote first approach and does not restrict build actions using a sandbox on the local machine. As a result build actions can be non-hermetic, meaning their outcome might depend on what files or programs happen to be present on the local machine. This lack of hermeticity can lead to non-reproducible builds, which is a critical concern for the effective caching of build artifacts.
Non-hermeticity issues can be elusive, often surfacing unexpectedly for new developers which effects on-boarding new team members, or open source contributors. If left undetected, they can even cause problems down the line in production, which is why we think reproducible builds are important!
Achieving Reproducibility with Nix
If we want reproducible builds, we must not rely on anything installed on the local machine. We need to precisely control every compiler and build tool which is used in our project. Although defining each and every one of these inside the Buck2 build itself is possible, it also would be a lot of work. The solution to this problem can be Nix.
Nix is a package manager and build system for Linux and Unix-like operating systems. With nixpkgs, there is a very large and comprehensive collection of software packaged using Nix, which is extensible and can be adapted to one’s needs. Most importantly, Nix already strictly enforces hermeticity for its package builds and the nixpkgs collection goes to great lengths to achieve reproducible builds.
So, using Nix to provide compilers and build tools for Buck2 is a way to benefit from that preexisting work and introduce hermetic toolchains into a Buck2 build.
Let’s first quickly look into the Nix setup and proceed with how we can integrate it into Buck2 later.
Nix with flakes
After installing Nix, the nix command is available, and we can start declaring dependencies on packages from nixpkgs in a nix file. The Nix tool uses the Nix language, a domain-specific, purely functional and lazily evaluated programming language to define packages and declare dependencies. The language has some wrinkles, but don’t worry; we’ll only use basic expressions without delving into the more advanced concepts.
For example, here is a simple flake.nix which provides the Rust compiler as a package output:
{
inputs = {
nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-unstable";
};
outputs = { self, nixpkgs }:
{
packages = {
aarch64-darwin.rustc = nixpkgs.legacyPackages.aarch64-darwin.rustc;
x86_64-linux.rustc = nixpkgs.legacyPackages.x86_64-linux.rustc;
}
};
}Note: While flakes have been widely used for a long time, the feature still needs to be enabled explicitly by setting extra-experimental-features = nix-command flakes in the configuration. See the wiki for more information.
In essence, a Nix flake is a Nix expression following a specific schema. It defines its inputs (usually other flakes) and outputs (e.g. packages) which depend on the inputs. In this example the rustc package from nixpkgs is re-used for the output of this flake, but more complex expressions could be used just as well.
Inspecting this flake shows the following output:
$ nix flake show --all-systems
path:/source/project?lastModified=1745857313&narHash=sha256-e1sxfj1DZbRjhHWF7xfiI3wc1BpyqWQ3nLvXBKDya%2Bg%3D
└───packages
├───aarch64-darwin
│ └───rustc: package 'rustc-wrapper-1.86.0'
└───x86_64-linux
└───rustc: package 'rustc-wrapper-1.86.0'In order to build the rustc package output, we can call Nix in the directory of the flake.nix file like this: nix build '.#rustc'. This will either fetch pre-built artifacts of this package from a binary cache if available, or directly build the package if not. The result is the same in both cases: the rustc package output will be available in the local nix store, and from there it can be used just like other software on the system.
$ nix build --print-out-paths '.#rustc'
/nix/store/ssid482a107q5vw18l9millwnpp4rgxb-rustc-wrapper-1.86.0-man
/nix/store/szc39h0qqfs4fvvln0c59pz99q90zzdn-rustc-wrapper-1.86.0The output displayed above illustrates that a Nix build of a single package can produce multiple outputs. In this case the rustc package was split into a default output and an additional, separate output for the man pages.
The default output contains the main binaries such as the Rust compiler:
$ /nix/store/szc39h0qqfs4fvvln0c59pz99q90zzdn-rustc-wrapper-1.86.0/bin/rustc --version
rustc 1.86.0 (05f9846f8 2025-03-31) (built from a source tarball)It is also important to note that the output of a Nix package depends on the specific nixpkgs revision stored in the flake.lock file, rather than any changes in the local environment. This ensures that each developer checking out the project at any point in time will receive the exact same (reproducible) output no matter what.
Using Buck2
As part of our work for Mercury, a company providing financial services, we developed rules for Buck2 which can be used to integrate packages provided by a nix flake as part of a project’s build. Recently, we have been able to publish these rules, called buck2.nix, as open source under the Apache 2 license.
To use these rules, you need to make them available in your project first. Add the following configuration to your .buckconfig:
[cells]
nix = none
[external_cells]
nix = git
[external_cell_nix]
git_origin = https://github.com/tweag/buck2.nix.git
commit_hash = accae8c8924b3b51788d0fbd6ac90049cdf4f45a # change to use a different versionThis configures a cell called nix to be fetched from the specified repository on GitHub. Once set up, you can refer to that cell in your BUCK files and load rules from it.
Note: for clarity, I am going to indicate the file name in the top most comment of a code block when it is not obvious from the context already
To utilize a Nix package from Buck2, we need to introduce a new target that runs nix build inside of a build action producing a symbolic link to the nix store path as the build output. Here is how to do that using buck2.nix:
# BUCK
load("@nix//flake.bzl", "flake")
flake.package(
name = "rustc",
binary = "rustc",
path = "nix", # path to a nix flake
package = "rustc", # which package to build, default is the value of the `name` attribute
output = "out", # which output to build, this is the default
)Note: this assumes the flake.nix and accompanying flake.lock file is found alongside the BUCK file in the nix subdirectory
With this build file in place, a new target called rustc is made available which builds the output called out of the rustc package of the given flake. This target can be used as a dependency of other rules in order to generate an output artifact:
# BUCK
genrule(
name = "rust-info",
out = "rust-info.txt",
cmd = "$(exe :rustc) --version > ${OUT}"
)Note: Buck2 supports expanding references in string parameters using macros, such as the $(exe ) part in the cmd parameter above which expands to the path of the executable output of the :rustc target
Using Buck2 (from nixpkgs of course!) to build the rust-info target yields:
$ nix run nixpkgs#buck2 -- build --show-simple-output :rust-info
Build ID: f3fec86b-b79f-4d8e-80c7-acea297d4a64
Loading targets. Remaining 0/10 24 dirs read, 97 targets declared
Analyzing targets. Remaining 0/20 5 actions, 5 artifacts declared
Executing actions. Remaining 0/5 9.6s exec time total
Command: build. Finished 2 local
Time elapsed: 10.5s
BUILD SUCCEEDED
buck-out/v2/gen/root/904931f735703749/__rust-info__/out/rust-info.txt
$ cat buck-out/v2/gen/root/904931f735703749/__rust-info__/out/rust-info.txt
rustc 1.86.0 (05f9846f8 2025-03-31) (built from a source tarball)For this one-off command we just ran buck2 from the nixpkgs flake on the current system. This is nice for illustration, but it is also not reproducible, and you’ll probably end up with a different Buck2 version when you try this on your machine.
In order to provide the same Buck2 version consistently, let’s add another Nix flake to our project:
# flake.nix
{
inputs = {
nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-unstable";
};
outputs = { self, nixpkgs }:
{
devShells.aarch64-darwin.default =
nixpkgs.legacyPackages.aarch64-darwin.mkShellNoCC {
name = "buck2-shell";
packages = [ nixpkgs.legacyPackages.aarch64-darwin.buck2 ];
};
devShells.x86_64-linux.default =
nixpkgs.legacyPackages.x86_64-linux.mkShellNoCC {
name = "buck2-shell";
packages = [ nixpkgs.legacyPackages.x86_64-linux.buck2 ];
};
};
nixConfig.bash-prompt = "(nix) \\$ "; # visual clue if inside the shell
}This flake defines a default development environment, or dev shell for short. It uses the mkShellNoCC function from nixpkgs which creates an environment where the programs from the given packages are available in PATH.
After entering the shell by running nix develop in the directory of the flake.nix file, the buck2 command has the exact same version for everyone working on the project as long as the committed flake.lock file is not changed. For convenience, consider using direnv which automates entering the dev shell as soon as changing into the project directory.
Hello Rust
With all of that in place, let’s have a look at how to build something more interesting, like a Rust project.
Similar to the genrule above, it would be possible to define custom rules utilizing the :rustc target to compile real-world Rust projects. However, Buck2 already ships with rules for various languages in its prelude, including rules to build Rust libraries and binaries.
In a default project setup with Rust these rules would simply use whatever Rust compiler is installed in the system, which may cause build failures due to version mismatches.
To avoid this non-hermeticity, we’re going to instruct the Buck2 rules to use our pinned Rust version from nixpkgs.
Let’s start by preparing such a default setup for the infamous “hello world” example in Rust:
# src/hello.rs
fn main() {
println!("Hello, world!");
}# src/BUCK
rust_binary(
name = "hello",
srcs = ["hello.rs"],
)Toolchains
What’s left to do to make these actually work is to provide a Rust toolchain. In this context, a toolchain is a configuration that specifies a set of tools for building a project, such as the compiler, the linker, and various command-line tools. In this way, toolchains are decoupled from the actual rule definitions and can be easily changed to suit one’s needs.
In Buck2, toolchains are expected to be available in the toolchains cell under a specific name. Conventionally, the toolchains cell is located in the toolchains directory of a project. For example, all the Rust rules depend on the target toolchains//:rust which is defined in toolchains/BUCK and must provide Rust specific toolchain information.
Luckily, we do not need to define a toolchain rule ourselves but can re-use the nix_rust_toolchain rule from buck2.nix:
# toolchains/BUCK
load("@nix//toolchains:rust.bzl", "nix_rust_toolchain")
flake.package(
name = "clippy",
binary = "clippy-driver",
path = "nix",
)
flake.package(
name = "rustc",
binaries = ["rustdoc"],
binary = "rustc",
path = "nix",
)
nix_rust_toolchain(
name = "rust",
clippy = ":clippy",
default_edition = "2021",
rustc = ":rustc",
rustdoc = ":rustc[rustdoc]",
visibility = ["PUBLIC"],
)The rustc target is defined almost identically as before, but the nix_rust_toolchain rule also expects the rustdoc attribute to be present. In this case, the rustdoc binary is available from the rustc Nix package as well and can be referenced using the sub-target syntax :rustc[rustdoc] which refers to the corresponding item of the binaries attribute given to the flake.package rule.
Additionally, we need to pass in the clippy-driver binary, which is available from the clippy package in the nixpkgs collection. Thus, the flake.nix file needs to be changed by adding the clippy package outputs:
# toolchains/nix/flake.nix
{
inputs = {
nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-unstable";
};
outputs =
{
self,
nixpkgs,
}:
{
packages = {
aarch64-darwin.rustc = nixpkgs.legacyPackages.aarch64-darwin.rustc;
aarch64-darwin.clippy = nixpkgs.legacyPackages.aarch64-darwin.clippy;
x86_64-linux.rustc = nixpkgs.legacyPackages.x86_64-linux.rustc;
x86_64-linux.clippy = nixpkgs.legacyPackages.x86_64-linux.clippy;
}
};
}At this point we are able to successfully build and run the target src:hello:
(nix) $ buck2 run src:hello
Build ID: 530a4620-bfb2-454d-bae1-e937ae9e764f
Analyzing targets. Remaining 0/53 75 actions, 101 artifacts declared
Executing actions. Remaining 0/11 1.1s exec time total
Command: run. Finished 3 local
Time elapsed: 0.7s
BUILD SUCCEEDED
Hello, world!Building a real-world Rust project would be a bit more involved. Here is an interesting article how one can do that using Bazel.
Note that buck2.nix currently also provides toolchain rules for C/C++ and Python. Have a look at the example project provided by buck2.nix, which you can directly use as a template to start your own project:
$ nix flake new --template github:tweag/buck2.nix my-projectA big thank you to Mercury for their support and for encouraging us to share these rules as open source! If you’re looking for a different toolchain or have other suggestions, feel free to open a new issue. Pull requests are very welcome, too!
If you’re interested in exploring a more tightly integrated solution, you might want to take a look at the buck2-nix project, which also provides Nix integration. Since it defines an alternative prelude that completely replaces Buck2’s built-in rules, we could not use it in our project but drew good inspiration from it.
Conclusion
With the setup shown, we saw that all that is needed really is Nix (pun intended1):
- we provide the
buck2binary with Nix as part of a development environment - we leverage Nix inside Buck2 to provide build tools such as compilers, their required utilities and third-party libraries in a reproducible way
Consequently, onboarding new team members no longer means following seemingly endless and quickly outdated installation instructions. Installing nix is easy; entering the dev shell is fast, and you’re up and running in no time!
And using Buck2 gives us fast, incremental builds by only building the minimal set of dependencies needed for a specific target.
Next time, I will delve into how we seamlessly integrated the Haskell toolchain libraries from Nix and how we made it fast as well.
- The name Nix is derived from the Dutch word niks, meaning nothing; build actions don’t see anything that hasn’t been explicitly declared as an input↩
Feature Flags in Frontend Development
Feature flags (or feature toggles) are runtime-controlled switches that let you enable or disable features without unnecessary deployments.
For example, imagine you are working on a new feature that requires significant changes to the UI. By using feature flags, you can deploy the changes to all the environments but only enable the feature in specific ones (like development or uat), or to a subset of users in a single environment (like users on Pro subscription). This allows you to test the feature without exposing it to unintended users, reducing the risk of introducing bugs or breaking changes. And in case things go bad, like a feature is not working as expected, you can easily disable it without having to roll back the entire deployment.
What is LaunchDarkly ?
LaunchDarkly is a feature management platform that provides an easy and scalable way to wrap parts of your code (new features, UI elements, backend changes) in flags so they can be turned on/off without redeploying. It provides a user-friendly dashboard to manage and observe flags, and supports over a dozen SDKs for client/server platforms. In my experience, LaunchDarkly is easier to use — including for non-technical users — and more scalable than most home-grown feature flag solutions.
LaunchDarkly supports targeting and segmentation, so you can control which users see specific features based on things like a user’s location or subscription plan. It also offers solid tooling for running experiments, including A/B testing and progressive rollouts (where a new feature is released to users in stages, rather than all at once). All feature flags can be updated in real-time, meaning that there’s no need for users to refresh the page to see changes.
Those are just my favorites, but if you are interested in learning more about it, LaunchDarkly has a blog post with more information.
Flag Evaluations
LaunchDarkly flags have unique identifiers called flag keys that are defined in the LaunchDarkly dashboard. When you request a flag value, supported client-side SDKs (such as React, iOS, Android, or, now, Svelte) send the flag key along with user information (called the “context”) to LaunchDarkly. LaunchDarkly’s server computes the value of the flag using all the applicable rules (the rules are applied in order) and sends the result back to the app. This process is called flag evaluation. By default, LaunchDarkly uses streaming connections to update flags in real time. This lets you flip flags in the dashboard and see the effect almost instantly in your app.
Svelte in Brief
Svelte is a modern JavaScript framework that I’ve come to appreciate for its performance, simplicity, and excellent developer experience. What I particularly like about Svelte is that it lets you write reactive code directly using standard JavaScript variables, with an intuitive syntax that requires less boilerplate than traditional React or Vue applications. Reactive declarations and stores are built into the framework, so you don’t need Redux or similar external state management libraries for most use cases.
Svelte’s Approach
- Superior Runtime Performance: Svelte doesn’t rely on virtual DOM. By eliminating the virtual DOM and directly manipulating the real DOM, Svelte can update the UI more quickly and efficiently, resulting in a more responsive application.
- Faster Load Times: Svelte’s compilation process generates smaller JavaScript bundles and more efficient code, resulting in faster initial page load times compared to frameworks that ship runtime libraries to the browser.
A Simple Example of a Svelte Component
In this example, we define a SimpleCounter component that increments a count when a button is clicked. The count variable is reactive, meaning that any changes to it will automatically update the UI.
// SimpleCounter.svelte
<script lang="ts">
let count = $state(0);
</script>
<button onclick={() => count++}>
clicks: {count}
</button>Now, we can use this component in our application which is in fact another Svelte component. For example: App.svelte:
// App.svelte
<script lang="ts">
import SimpleCounter from './SimpleCounter.svelte';
</script>
<SimpleCounter />After doing this, we can end up with something like this:

Overview of the LaunchDarkly Svelte SDK
Why Use a Dedicated Svelte SDK?
Although LaunchDarkly’s vanilla JavaScript SDK could be used in a Svelte application, this new SDK aligns better with Svelte’s reactivity model and integrates with Svelte-tailored components, allowing us to use LaunchDarkly’s features more idiomatically in our Svelte projects. I originally developed it as a standalone project and then contributed it upstream to be an official part of the LaunchDarkly SDK.
Introduction to LaunchDarkly Svelte SDK
Here are some basic steps to get started with the LaunchDarkly Svelte SDK:
1.Install the Package: First, install the SDK package in your project.
Note: Since the official LaunchDarkly Svelte SDK has not been released yet, for the purposes of this blog post, I’ve created a temporary package available on npm that contains the same code as the official repo. You can still check the official source code in LaunchDarkly’s official repository.
npm install @nosnibor89/svelte-client-sdk2.Initialize the SDK: Next, you need to initialize the SDK with your LaunchDarkly client-side ID (you need a LaunchDarkly account). This is done using the LDProvider component, which provides the necessary context for feature flag evaluation. Here is an example of how to set it up:
<script lang="ts">
import { LDProvider } from '@nosnibor89/svelte-client-sdk';
import MyLayout from './MyLayout.svelte';
</script>
// Use context relevant to your application. More info in https://docs.launchdarkly.com/home/observability/contexts
const context = {
user: {
key: 'user-key',
},
};
<LDProvider clientID="your-client-side-id" {context}>
<MyLayout />
</LDProvider>Let’s clarify the code above:
- Notice how I wrapped the
MyLayoutcomponent with theLDProvidercomponent. Usually, you will wrap a high-level component that encompasses most of your application withLDProvider, although it’s up to you and how you want to structure the app. - You can also notice two parameters provided to our
LDProvider. The"your-client-side-id"refers to the LaunchDarkly Client ID and thecontextobject refers to the LaunchDarkly Context used to evaluate feature flags. This is necessary information we need to provide for the SDK to work correctly.
3.Evaluate a flag: The SDK provides the LDFlag component for evaluating your flag1. This component covers a common use case where you want to render different content based on the state of a feature flag. By default, LDFlag takes a boolean flag but can be extended to work with the other LaunchDarkly flag types as well.
<script lang="ts">
import { LDFlag } from '@nosnibor89/svelte-client-sdk';
</script>
<LDFlag flag={'my-feature-flag'}>
{#snippet on()}
<p>renders if flag evaluates to true</p>
{/snippet}
{#snippet off()}
<p>renders if flag evaluates to false</p>
{/snippet}
</LDFlag>In this example, the LDFlag component will render the content inside the on snippet2 if the feature flag my-feature-flag evaluates to true. If the flag evaluates to false, the content inside the off snippet will be rendered instead.
Building an application with SvelteKit
Now that we have seen the basics of how to use the LaunchDarkly Svelte SDK, let’s see how we can put everything together in a real application.
For the sake of brevity, I’ll be providing the key source code in this example, but if you are curious or need help, you can check out the full source code in Github.
How the app works
This is a simple ‘movies’ app where the main page displays a list of movies in a card format with a SearchBar component at the top. This search bar allows users to filter movies based on the text entered.

The scenario we’re simulating is that Product Owners want to replace the traditional search bar with a new AI-powered assistant that helps users get information about specific movies. This creates a perfect use case for feature flags and can be described as follows:
Feature Flag Scenarios
-
SearchBar vs AI Assistant: We’ll use a boolean feature flag to determine whether to display the classic
SearchBarcomponent or the newMoviesSmartAssistant3 component - simulating a simple all-at-once release. -
AI Model Selection: We’ll use a JSON feature flag to determine which AI model (GPT or Gemini) the
MoviesSmartAssistantwill use. This includes details about which model to use for specific users, along with display information like labels. This simulates a progressive rollout where Product Owners can gather insights on which model performs better.
Prerequisites
To follow along, you’ll need:
- A LaunchDarkly account
- A LaunchDarkly Client ID (Check this guide to get it)
- Two feature flags (see the creating new flags guide): a boolean flag (
show-movie-smart-assistant) and a JSON flag (smart-assistant-config) looking like this:{ "model": "gpt-4", "label": "Ask GPT-4 anything" } - A SvelteKit4 application (create with
npx sv create my-app)
Integrating the LaunchDarkly Svelte SDK
After creating the project, a SvelteKit application was scaffolded for you, meaning you should have a src directory where your application code resides. Inside this folder, you will find a routes directory, which is where SvelteKit handles routing. More specifically, there are two files: +layout.svelte and +page.svelte which are the main files we are going to highlight in this post.
Setting up the layout
// src/routes/+layout.svelte
<script lang="ts">
import "../app.css";
import { LDProvider } from "@nosnibor89/svelte-client-sdk";
import { PUBLIC_LD_CLIENT_ID } from '$env/static/public';
import LoadingSpinner from "$lib/LoadingSpinner.svelte"; // Check source code in Github https://github.com/tweag/blog-resources/blob/master/launchdarkly-svelte-sdk-intro/src/lib/LoadingSpinner.svelte
let { children } = $props();
// random between 0 and 1
const orgId = Math.round(Math.random());
const orgKey = `sdk-example-org-${orgId}`
const ldContext = {
kind: "org",
key: orgKey,
};
</script>
<LDProvider clientID={PUBLIC_LD_CLIENT_ID} context={ldContext}>
{#snippet initializing()}
<div class="...">
<LoadingSpinner message={"Loading flags"}/>
</div>
{/snippet}
{@render children()}
</LDProvider>Let’s analyze this:
- We are importing the
LDProvidercomponent from the LaunchDarkly Svelte SDK and wrapping our layout with it. In SvelteKit, the layout will act as the entry point for our application, so it’s a good place for us to initialize the SDK allowing us to use other members of the SDK in pages or child components. - We are also importing the
PUBLIC_LD_CLIENT_IDvariable from the environment variables. You can set this variable in your.envfile at the root of the project (this is a SvelteKit feature). - Another thing to notice is that we are using a
LoadingSpinnercomponent while the SDK is initializing. This is optional and is a good place to provide feedback to the user while the SDK is loading and feature flags are being evaluated for the first time. Also, don’t worry about the code forLoadingSpinner, you can find it in the source code on Github.
Add the movies page
At this point, we are ready to start evaluating flags, so let’s now go ahead and add our page where the SDK will help us accomplish scenarios 1 and 2.
Movies Page (SearchBar vs AI Assistant)
The movies page is the main and only page of our application. It displays a list of movies along with a search bar. This is the part where we will evaluate our first feature flag to switch between the SearchBar and the MoviesSmartAssistant components.
// src/routes/+page.svelte
<script lang="ts">
// ...some imports hidden for brevity. Check source code on Github
import SearchBar from "$lib/SearchBar.svelte";
import MoviesSmartAssistant from "$lib/MoviesSmartAssistant.svelte";
import { LD, LDFlag } from "@nosnibor89/svelte-client-sdk";
let searchQuery = $state("");
let prompt = $state("");
const flagKey = "show-movie-smart-assistant";
const flagValue = LD.watch(flagKey);
flagValue.subscribe((value) => {
// remove search query or prompt when flag changes
searchQuery = "";
prompt = "";
});
// ...rest of the code hidden for brevity. Check source code on Github
// https://github.com/tweag/blog-resources/blob/master/launchdarkly-svelte-sdk-intro/src/routes/%2Bpage.svelte
</script>
<div class="...">
<LDFlag flag={flagKey}>
{#snippet on()}
<MoviesSmartAssistant
prompt={prompt}
onChange={handlePromptChange}
onSubmit={handleSendPrompt}
/>
{/snippet}
{#snippet off()}
<SearchBar value={searchQuery} onSearch={handleSearch} />
{/snippet}
</LDFlag>
<div
class="..."
>
{#each filteredMovies as movie}
<MovieCard {movie} />
{/each}
</div>
</div>Again, let’s break this down:
- We are using the
LDFlagcomponent from the SDK. It will allow us to determine which component to render based on the state of theshow-movie-smart-assistantfeature flag. When the flag evaluates totrue, theonsnippet will run, meaning theMoviesSmartAssistantcomponent will be rendered, and when the flag evaluates tofalse, theoffsnippet will run, meaning theSearchBarcomponent will be rendered. - We are also using the
LD.watchfunction. This is useful when you need to get the state of a flag and keep track of it. In this case, we are simply resetting the search query or prompt so that the user can start fresh when the flag changes. - The rest of the code you are not seeing is just functionality for the filtering mechanism and the rest of the presentational components. Remember you can find the code for those on Github.
MoviesSmartAssistant Component (AI Model Selection)
Whenever our MoviesSmartAssistant component is rendered, we want to check the value of the smart-assistant-config feature flag to determine which AI model to use for the assistant.
// src/lib/MoviesSmartAssistant.svelte
<script lang="ts">
import { LD } from "@nosnibor89/svelte-client-sdk";
import type { Readable } from "svelte/store";
type MoviesSmartAssistantConfig = { model: string; label: string;};
const smartAssistantConfig = LD.watch("smart-assistant-config") as Readable<MoviesSmartAssistantConfig>;
// ... rest of the code hidden for brevity. Check source code on Github
// https://github.com/tweag/blog-resources/blob/master/launchdarkly-svelte-sdk-intro/src/lib/MoviesSmartAssistant.svelte
</script>
<div class="...">
<input
type="text"
placeholder={$smartAssistantConfig?.label ?? "Ask me anything..."}
value={prompt}
oninput={handleInput}
class="..."
/>
<button type="button" onclick={handleClick} aria-label="Submit">
// ...svg code hidden for brevity
</button>
</div>As before, I’m hiding some code for brevity, but here are the key points:
- We are using the
LD.watchmethod to watch for changes in thesmart-assistant-configfeature flag which contains information about the AI model. This will allow us to use the proper model for a given user based on the flag evaluation. - Notice how the SDK understands it’s a JSON flag and returns a Javascript object (with a little help5) as we defined in the LaunchDarkly dashboard.
Running the Application
Now that we have everything set up, let’s run the application. Here we are going to use the Client ID and set it as an environment variable.
PUBLIC_LD_CLIENT_ID={your_client_id} npm run devOpen your browser and navigate to https://localhost:5173 (check your terminal as it may run at a different port). You should see the movies application with either the SearchBar or MoviesSmartAssistant component depending on your feature flag configuration.
Seeing Feature Flags in Action
If you were able to correctly set everything up, you should be able to interact with the application and LaunchDarkly Dashboard by toggling the feature flags and validating the behavior of the application.
I have included this demo video to show you how the application works and how the feature flags are being evaluated.
Conclusion
We just saw how to use the LaunchDarkly Svelte SDK and integrate it into a SvelteKit application using a realistic example. I hope this post gave you an understanding of the features the SDK provides and also what it lacks while being in its early stages and while awaiting the official release.
For now, my invitation for you is to try the SDK yourself and explore different use cases. For example, change the context with LD.identify to simulate users signing in to an application, or maybe try a different flag type like a string or number flag. Also, stay tuned for updates on the official LaunchDarkly Svelte SDK release.
LDFlagis a key component but there are other ways to evaluate a flag using the SDK.↩- Snippets are a Svelte feature and can also be named slots. Check out https://svelte.dev/docs/svelte/snippet↩
- The
MoviesSmartAssistantcomponent is just a visual representation without actual AI functionality — my focus is on demonstrating how the LaunchDarkly Svelte SDK enables these feature flag implementations.↩ - SvelteKit is the official application framework for Svelte. It comes with out-of-the-box support for TypeScript, server-side rendering, and automatic routing through file-based organization.↩
- Ok, I’m also using TypeScript here to hint the type of the object returned by the
LD.watchmethod. Maybe this is something to fix in the future.↩
Workspaces
The Rust unit of packaging — like a gem in Ruby or a module in Go — is called a “crate”, and it’s pretty common for a medium-to-large Rust project to be divided into several of them. This division helps keep code modular and interfaces well-defined, and also allows you to build and test components individually. Cargo supports multi-crate workflows using “workspaces”: a workspace is just a bunch of crates that Cargo handles “together”, sharing a common dependency tree, a common build directory, and so on. A basic workspace might look like this:
.
├── Cargo.toml
├── Cargo.lock
├── taco
│ ├── Cargo.toml
│ └── src
│ ├── lib.rs
│ └── ... more source files
└── tortilla
├── Cargo.toml
└── src
├── lib.rs
└── ... more source filesThe top-level Cargo.toml just tells Cargo where the crates in the workspace live.2
# ./Cargo.toml
workspace.members = ["taco", "tortilla"]The crate-level Cargo.toml files tell us about the crates (surprise!). Here’s taco’s
Cargo.toml:
# ./taco/Cargo.toml
[package]
name = "taco"
version = "2.0"
dependencies.tortilla = { path = "../tortilla", version = "1.3" }The dependency specification is actually pretty interesting. First, it tells
us that the tortilla package is located at ../tortilla (relative to
taco). When you’re developing locally, Cargo uses this local path to find the
tortilla crate. But when you publish the taco crate for public consumption, Cargo strips out the
path = "../tortilla" setting because it’s only meaningful within your local
workspace. Instead, the published taco crate will depend on version 1.3 of
the published tortilla crate. This doubly-specified dependency gives you the
benefits of a monorepo (for example, you get to work on tortilla and taco
simultaneously and be sure that they stay compatible) without leaking that local setup
to downstream users of your crates.
If you’ve been hurt by packaging incompatibilities before, the previous
paragraph might have raised some red flags: allowing a dependency to come
from one of two places could lead to problems if they get out-of-sync. Like,
couldn’t you accidentally make a broken package by locally updating both your
crates and then only publishing taco? You won’t see the breakage when building locally,
but the published taco will be incompatible with the previously published tortilla.
To deal with this issue, Cargo verifies packages before you publish them.
When you type cargo publish --package taco, it packages up the taco crate
(removing the local ../tortilla dependency) and then unpackages the new
package in a temporary location and attempts to build it from scratch. This
rebuild-from-scratch sees the taco crate exactly as a downstream user would,
and so it will catch any incompatibilities between the existing, published
tortilla and the about-to-be-published taco.
Cargo’s crate verification is not completely fool-proof because it only checks that the package compiles.3 In practice, I find that checking compilation is already pretty useful, but I also like to run other static checks.
Publish all my crates
Imagine you’ve been working in your workspace, updating your crates in backwards-incompatible
ways. Now you want to bump tortilla to version 2.0 and taco to version 3.0
and publish them both. This isn’t too hard:
- Edit
tortilla/Cargo.tomlto increase the version to 2.0. - Run
cargo publish --package tortilla, and wait for it to appear on crates.io. - Edit
taco/Cargo.tomlto increase its version to 3.0, and change itstortilladependency. to 2.0. - Run
cargo publish --package taco.
The ordering is important here. You can’t publish the new taco before tortilla 2.0 is
publicly available: if you try, the verification step will fail.
This multi-crate workflow works, but it has two problems:
- It can get tedious. With two crates it’s manageable, but what about when the dependency graph gets complicated? I worked for a client whose CI had custom Python scripts for checking versions, bumping versions, publishing things in the right order, and so on. It worked, but it wasn’t pretty.4
- It’s non-atomic: if in the process of verifying and packaging dependent crates you discover some problems with the dependencies then you’re out of luck because you’ve already published them. crates.io doesn’t allow deleting packages, so you’ll just have to yank5 the broken packages, increase the version number some more, and start publishing again. This one can’t be solved by scripts or third-party tooling: verifying the dependent crate requires the dependencies to be published.
Starting in mid-2024, my colleague Tor Hovland and I began working on native support for this in Cargo. A few months and dozens of code-review comments later, our initial implementation landed in Cargo 1.83.0. By the way, the Cargo team are super supportive of new contributors — I highly recommend going to their office hours if you’re interested.
How it works
In our implementation, we use a sort of registry “overlay” to verify dependent crates before their dependencies are published. This overlay wraps an upstream registry (like crates.io), allowing us to add local crates to the overlay without actually publishing them upstream. This kind of registry overlay is an interesting topic on its own. The “virtualization” of package sources is an often-requested feature that hasn’t yet been implemented in general because it’s tricky to design without exposing users to dependency confusion attacks: the more flexible you are about where dependencies come from, the easier it is for an attacker to sneak their way into your dependency tree. Our registry overlay passed scrutiny because it’s only available to Cargo internally, and only gets used for workspace-local packages during workspace publishing.
The registry overlay was pretty simple to implement, since it’s just a composition of two existing Cargo features: local registries and abstract sources. A local registry in Cargo is just a registry (like crates.io) that lives on your local disk instead of in the cloud. Cargo has long supported them because they’re useful for offline builds and integration testing. When packaging a workspace we create a temporary, initially-empty local registry for storing the new local packages as we produce them.
Our second ingredient is Cargo’s Source trait: since Cargo can pull dependencies
from many different kinds of places (crates.io, private registries, git repositories, etc.),
they already have a nice abstraction that encapsulates how to query
availability, download, and cache packages from different places. So our registry
overlay is just a new implementation of the Source trait that wraps two other Sources:
the upstream registry (like crates.io) that we want to publish to, and the local registry
that we put our local packages in.
When someone queries our overlay source for a package, we check in the local registry
first, and fall back to the upstream registry.
Now that we have our local registry overlay, the workspace-publishing workflow looks like this:
- Gather all the to-be-published crates and figure out any inter-dependencies. Sort them in a “dependency-compatible” order, meaning that every crate will be processed after all its dependencies.
- In that dependency-compatible order, package and verify each crate. For each crate:
- Package it up, removing any mention of local path dependencies.
- Unpackage it in a temporary location and check that it builds. This build step uses the local registry overlay, so that it thinks all the local dependencies that were previously added to the local overlay are really published.
- “Publish” the crate in the local registry overlay.
- In the dependency-compatible order, actually upload all the crates to crates.io.
This is done in parallel as much as possible. For example, if
tortillaandcarnitasdon’t depend on one another buttacodepends on them both, thentortillaandcarnitascan be uploaded simultaneously.
It’s possible for the final upload to fail (if your network goes down, for example) and for some crates to remain unpublished; in that sense, the new workspace publishing workflow is not truly atomic. But because all of the new crates have already been verified with one another, you can just retry publishing the ones that failed to upload.
How to try it
Cargo, as critical infrastructure for Rust development, is pretty conservative about
introducing new features. Multi-package publishing was recently promoted to
a stable feature, but it is currently only available in nightly builds. If you’re using
a recent nightly build of Cargo 1.90.0 or later, running cargo publish in a workspace
will work as described in this blog post.
If you don’t want to publish everything in your workspace, the usual package-selection arguments
should work as expected: cargo publish --package taco --package tortilla
will publish just taco and tortilla, while correctly managing any dependencies
between them. Or you can exclude packages like cargo publish --exclude onions.
If you’re using a stable Rust toolchain, workspace publishing will be available in Cargo 1.90 in September 2025.
- If you use Node.js, Cargo is like the
npmcommand and crates.io is like the NPM registry. If you use Python, Cargo is like pip (or Poetry, or uv) and crates.io is like PyPI.↩ - It can also contain lots of other useful workspace-scoped information, like dependencies that are common between crates or global compiler settings.↩
- To be even more precise, it only checks that the package compiles against
the dependencies that are locked in your
Cargo.lockfile, which gets included in the package. If you or someone in your dependency tree doesn’t correctly follow semantic versioning, downstream users could still experience compilation problems. In practice, we’ve seen this cause binary packages to break becausecargo installignores the lock file by default.↩ - There are also several third-party tools (for example, cargo-release, cargo-smart-release, and release-plz) to help automate multi-crate releases. If one of these meets your needs, it might be better than a custom script.↩
- “Yanking” is Cargo’s mechanism for marking packages as broken without actually deleting their contents and breaking everyone’s builds.↩
As part of our consulting business we are often invited to solve problems that our clients cannot tackle on their own. It is not uncommon for us to collaborate with a client for extended periods of time; during which, many opportunities for knowledge transfer present themselves, be it in the form of documentation, discussions, or indeed, when the client finds it desirable, in the form of specialized workshops.
In this post we’d like to talk about a workshop that we developed and delivered (so far) five times to different groups of people at the same client. We received positive feedback for it and we believe it was helpful for those who attended it.
The workshop intends to give a principled introduction to the Bazel build system for people who have little or no knowledge of Bazel, but who are software developers and have used a build system before. It is definitely a workshop for a technical audience, and as such it was presented to (among others) dedicated DevOps and DevX teams of the client.
We are happy to announce that the materials of this workshop are now publicly available in the form of:
- the git repository of the example project that we use in the exercises https://github.com/tweag/bazel-workshop-2024 and
- the accompanying slides.
The original intended duration of the workshop was three days. However, one of these days was dedicated almost entirely to a case study that we cannot share publicly; therefore, the public version is shorter and should amount to approximately two days.
Here are a couple of the introductory slides to give you an impression of the scope, structure, and expected knowledge in this workshop:
It must be pointed out that the workshop was developed in 2024, when the
WORKSPACE-based approach to dependency management was still the default
choice and so, given that we were time-constrained both at the authoring and
presentation stages, we chose not to cover Bzlmod. We are still convinced
that familiarity with WORKSPACE and simple repository rules is a
prerequisite for understanding Bzlmod. Some newer features like symbolic
macros are also not covered. Learning materials for Bazel go out of date
quickly, but even so, we believe that the workshop, now public, is still
relevant and can be of use for people who are about to embark on their Bazel
journey.