CARVIEW |
Surprising to most C++ programmers, but not language lawyers, idiomatic C memory allocation was ill-formed in C++ until recently:
int *newint(int v)
{
int *r = (int *)malloc(sizeof(*r));
if (r) {
*r = v; // <-- undefined behavior before C++20
}
return r;
}
This program allocates memory for an object but never starts a lifetime.
Assignment without a lifetime is invalid. Pointer casts are that much more
suspicious in C++, and due to lifetime semantics, in many cases indicate
incorrect code. (To be clear, I’m not arguing in favor of these semantics,
but reasoning about the facts on the ground.) C++20 carved out special
exceptions for malloc
and friends, but addressing this kind of thing in
general is the purpose of the brand new start_lifetime_as
(and
similar), the slightly older construct_at
, or a classic placement
new. They all start lifetimes. The last looks like:
int *newint(int v)
{
void *r = malloc(sizeof(int));
if (r) {
return new(r) int{v};
}
return nullptr;
}
That’s no good as a C/C++ polyglot, though per the differing old semantics
that was impossible anyway without macros. Which is basically cheating. An
important detail: The corrected version has no casts, and it returns the
result of new
. That’s important because only the pointer returned by
new
is imbued as a pointer to the new lifetime, not r
. There are no
side effects affecting the provenance of r
, which still points to raw
memory as far as the language is concerned.
With that in mind let’s revisit my arena from last time, which does not necessarily benefit from the recent changes, not being one of the special case C standard library functions:
struct Arena {
char *beg;
char *end;
};
template<typename T>
T *alloc(Arena *a, ptrdiff_t count = 1)
{
ptrdiff_t size = sizeof(T);
ptrdiff_t pad = -(uintptr_t)a->beg & (alignof(T) - 1);
assert(count < (a->end - a->beg - pad)/size); // OOM policy
T *r = (T *)(a->beg + pad);
a->beg += pad + count*size;
for (ptrdiff_t i = 0; i < count; i++) {
new((void *)&r[i]) T{};
}
return r;
}
Hey, look, placement new! I did that to produce a nicer interface, but I
lucked out also starting lifetimes appropriately. Except it returns the
wrong pointer. This allocator discards the pointer blessed with the new
lifetime. Both pointers have the same address but different provenance.
That matters. But I’m calling new
many times, so how do I fix this?
Array new, duh.
template<typename T>
T *alloc(Arena *a, ptrdiff_t count = 1)
{
ptrdiff_t size = sizeof(T);
ptrdiff_t pad = -(uintptr_t)a->beg & (alignof(T) - 1);
assert(count < (a->end - a->beg - pad)/size); // OOM policy
void *r = a->beg + pad;
a->beg += pad + count*size;
return new(r) T[count]{};
}
Wow… that’s actually much better anyway. No explicit casts, no loop. Why
didn’t I think of this in the first place? The catch is I can’t forward
constructor arguments, emplace-style — the part that gave me the trouble
with perfect forwarding — but that’s for the best. Forwarding more than
once was unsound, made more obvious by new[]
.
Caveat: This only works starting in C++20, and strictly with operator
new[](size_t, void *)
. Any other placement new[]
may require array
overhead — e.g. it prepends an array size so that delete[]
can run
non-trivial destructors — which is unknowable and therefore impossible to
provide or align correctly. Overhead for placement new[]
is nonsense, of
course, but as of this writing, all three major C++ compilers do it and
essentially have broken custom placement new[]
.
Since I’m thinking about lifetimes, what about the other end? My arena does not call destructors, by design, and starts new lifetimes on top of objects that are technically still alive. Is that undefined behavior? As far as I can tell this is allowed, even for non-trivial destructors, with the caveat that it might leak resources. In this case the resource is memory managed by the arena, so that’s fine of course.
So addressing pointer provenance also produced a nicer definition. What a great result from reading that book! While researching, I noticed Jonathan Müller, who personally gave me great advice and feedback on my previous article, talked about lifetimes just a couple weeks later. I recommend both.
]]>The above description is probably confusing on its own, and an example is worth a thousand words, so here’s a listing naming 7 fields:
timestamp,point.x,point.y,foo.bar.z,point.z,foo.bar.y,foo.bar.x
Where point
is a substructure, as is foo
and bar
, but note they’re
interleaved in the record. So if a record contains these values:
{1758158348, 1.23, 4.56, -100, 7.89, -200, -300}
The JSON representation would look like:
{
"timestamp": 1758158348,
"point": {
"x": 1.23,
"y": 4.56,
"z": 7.89
},
"foo": {
"bar": {
"z": -100,
"y": -200,
"x": -300
}
}
}
Notice point.z
moved up and foo.bar.z
down, so that substructures are
contiguous in this representation as required for JSON. Sorting the field
names lexicographically would group them together as a simple solution.
However, as an additional constraint I want to retain the original field
order as much as possible. For example, timestamp
is first in both the
original and JSON representations, but sorting would put it last. If all
substructures are already contiguous, nothing should change.
Solution with string interning
My solution is to intern the segment strings, assigning each a unique, monotonic integral token in the order they’re observed. In my program, zero is reserved as a special “root” token, and so the first string has the value 1. The concrete values aren’t important, only that they’re assigned monotonically.
The trick is that a string is always interned in the “namespace” of a
previous token. That is, we’re building a (token, string) -> token
map.
For our segments that namespace is the token for the parent structure, and
the top-level fields are interned in the reserved “root” namespace. When
applied to the example, we get the token sequences:
timestamp -> 1
point.x -> 2 3
point.y -> 2 4
foo.bar.z -> 5 6 7
point.z -> 2 8
foo.bar.y -> 5 6 9
foo.bar.x -> 5 6 10
And our map looks like:
{0, "timestamp"} -> 1
{0, "point"} -> 2
{2, "x"} -> 3
{2, "y"} -> 4
{0, "foo"} -> 5
{5, "bar"} -> 6
{6, "z"} -> 7
{2, "z"} -> 8
{6, "y"} -> 9
{6, "x"} -> 10
Notice how "x"
is assigned 3 and 10 due to different namespaces. That’s
important because otherwise the fields of foo.bar
would sort in the same
order as point
. Namespace gives these fields unique identities.
Once we have the token representation, sort lexicographically by token.
That pulls point.z
up to its siblings.
timestamp -> 1
point.x -> 2 3
point.y -> 2 4
point.z -> 2 8
foo.bar.z -> 5 6 7
foo.bar.y -> 5 6 9
foo.bar.x -> 5 6 10
Now we have the “output” order with minimal re-ordering. If substructures
were already contiguous, nothing changes. Assuming a reasonable map, this
is O(n log n)
, primarily due to sorting.
Alternatives
Before I thought of namespaces, my initial idea was to intern the whole prefix of a segment. The sequence of look-ups would be:
"timestamp" -> 1 -> {1}
"point" -> 2
"point.x" -> 3 -> {2, 3}
"point" -> 2
"point.y" -> 4 -> {2, 4}
"foo" -> 5
"foo.bar" -> 6
"foo.bar.z" -> 7 -> {5, 6, 7}
"point" -> 2
"point.z" -> 8 -> {2, 8}
"foo" -> 5
"foo.bar" -> 6
"foo.bar.y" -> 9 -> {5, 6, 9}
"foo" -> 5
"foo.bar" -> 6
"foo.bar.x" -> 10 -> {5, 6, 10}
Ultimately it produces the same tokens, and this is a more straightforward
string -> string
map. The prefixes are acting as namespaces. However, I
wrote it this way as a kind of visual proof: Notice the right triangle
shape formed by the strings for each field. From the area we can see that
processing prefixes as strings is O(n^2)
quadratic time on the number of
segments! In my real problem the inputs were never large enough for this
to matter, but I hate leaving behind avoidable quadratic algorithms.
Using a token as a namespace flattens the prefix to a constant size.
Another option is a different map for each namespace. So for foo.bar.z
lookup the "foo"
map (string -> map)
in the root (string -> map)
,
then within that lookup the "bar"
table (string -> token)
(since this
is the penultimate segment), then intern "z"
within that to get its
token. That wouldn’t have quadratic time complexity, but it seems quite a
bit more complicated than a single, flat (token, string) -> token
map.
Implementation in C
Because the standard library has little useful for us, I am
building on previously-established definitions, so refer to
that article for basic definitions like Str
. To start off, tokens will
be a size-typed integer so we never need to worry about overflowing the
token counter. We’d run out of memory first:
typedef ptrdiff Token;
We’re building a (token, string) -> token)
map, so we’ll need a hash
function for such keys:
uint64_t hash(Token t, Str s)
{
uint64_t r = (uint64_t)t << 8;
for (ptrdiff i = 0; i < s.len; i++) {
r ^= s.data[i];
r *= 1111111111111111111u;
}
return r;
}
The map itself is a forever-useful hash trie.
typedef struct Map Map;
struct Map {
Map *child[4];
Token namespace;
Str segment;
Token token;
};
Token *upsert(Map **m, Token namespace, Str segment, Arena *a)
{
for (uint64_t h = hash(ns, segment); *m; h <<= 2) {
if (namespace==(*m)->namespace && equals(segment, (*m)->segment)) {
return &(*m)->token;
}
m = &(*m)->child[h>>62];
}
*m = new(a, 1, Map);
(*m)->namespace = namespace;
(*m)->segment = segment;
return &(*m)->token; // caller will assign
}
We’ll use this map to convert a string naming a field into a sequence of
tokens, so we’ll need a slice. Fields also have an offset within
the record and a type, which we’ll track via its original ordering, which
I’ll do with an index
field (e.g. into the original header). Also track
the original name.
typedef struct {
Str name;
ptrdiff_t index;
Slice(Token) tokens;
} Field;
To sort fields we’ll need a comparator:
ptrdiff_t field_compare(Field a, Field b)
{
ptrdiff_t len = min(a.tokens.len, b.tokens.len);
for (ptrdiff_t i = 0; i < len; i++) {
Token d = a.tokens.data[i] - b.tokens.data[i];
if (d) {
return d;
}
}
return a.tokens.len - b.tokens.len;
}
Because field names are unique, each token sequence is unique, and so we
need not use index
in the comparator.
Finally down to business: cut up the list and build the token
sequences with the established push
macro. The sort function isn’t
interesting, and could be as simple as libc qsort
with the above
comparator (and adapter), so I’m only listing the prototype.
void field_sort(Slice(Field), Arena scratch);
Slice(Field) parse_fields(Str fieldlist, Arena *a)
{
Slice(Field) fields = {};
Map *strtab = 0;
ptrdiff_t ntokens = 0;
for (Cut c = {.tail=fieldlist, .ok=true}; c.ok;) {
c = cut(c.tail, ',');
Field field = {};
field.name = c.head;
field.index = fields.len;
Token prev = 0;
for (Cut f = {.tail=field.name, .ok=true}; f.ok;) {
f = cut(f.tail, '.');
Token *token = upsert(&strtab, prev, f.head, a);
if (!*token) {
*token = ++ntokens;
}
*push(a, &field.tokens) = *token;
prev = *token;
}
*push(a, &fields) = field;
}
field_sort(fields, *a);
return fields;
}
Usage here suggests Cut::ok
should be inverted to Cut::done
so that it
better zero-initializes. Something I’ll need to consider. Because it’s all
allocated from an arena, no need for destructors or anything like that, so
this is the complete implementation. Back to the example:
Str fieldlist = S(
"timestamp,"
"point.x,"
"point.y,"
"foo.bar.z,"
"point.z,"
"foo.bar.y,"
"foo.bar.x"
);
Slice(Field) fields = parse_fields(fieldlist, &scratch);
for (ptrdiff_t i = 0; i < fields.len; i++) {
Str name = fields.data[i].name;
fwrite(name.data, 1, name.len, stdout);
putchar('\n');
}
This program will print the proper output field order. In a real program
we’d hold onto the string table, define an inverse lookup to translate
tokens back into strings, and use it when in producing output. I do just
that in my exploratory program, rec2json.c
, written a little
differently than presented above. It uses the sorted tokens to compile a
simple bytecode program that, when run against a record, produces its JSON
representation. It compiles the example to:
OPEN # print '{'
KEY 1 # print token 1 as a key, i.e. "timestamp:"
READ 0 # print double at record offset 0
COMMA # print ','
KEY 2 # print token 2 as a key, i.e. "point:"
OPEN
KEY 3
READ 8 # print double at record offset 8
COMMA
KEY 4
READ 16
COMMA
KEY 8
READ 32
CLOSE # print '}'
COMMA
KEY 5
OPEN
KEY 6
OPEN
KEY 7
READ 24
COMMA
KEY 9
READ 40
COMMA
KEY 10
READ 48
CLOSE
CLOSE
CLOSE
Seeing it written out, I notice more room for improvement. An optimization
pass could coalesce instructions so that, for instance, OPEN
then KEY
concatenate to a single string at compile time so that it only needs
one instruction. This program could be 15 instructions instead of 31. In
my real case I didn’t need anything quite this sophisticated, but it was
fun to explore.
git log
is a kind
of miniature, project blog. I understand this is neither discoverable nor
obvious, especially because the GitHub UI (ugh) lacks anything like git
log
in the terminal. So let’s talk about it here, along with other recent
changes.
NASM is nice assembler, and I still in general prefer its x86 syntax to
the GNU Assembler, GAS. It’s tidy, self-contained, dependency-free other
than a C toolchain, reliable, easy to build and cross compile, which is
why I included it in the first place. However, it’s just not a good fit
for w64dk. It’s redundant with Binutils as
, which is already mandatory
for supporting GCC. As a rule, w64dk is a curation that avoids redundancy,
and a second assembler requires special justification. Originally it was
that the syntax was nicer, but last year I decided that wasn’t enough to
outweigh against NASM’s weaknesses.
First, it doesn’t integrate well with the GNU toolchain, at least not to
the extent that as
does. For example, many people don’t realize they can
assemble GAS assembly through the general gcc
driver.
$ gcc -c myprogram.s
There’s rarely a reason to invoke as
directly. Need a specific assembler
flag? Use -Wa
or -Xassembler
. Going through the compiler driver for
both assembly and linking (i.e. instead of ld
) has the advantage of
operating at a “higher level.” The compiler driver knows about the whole
toolchain, and better understands the sysroot. Use a capital
file extension, .S
, and gcc
will automatically run it through the C
preprocessor.
$ gcc -c myprogram.S
This is quite nifty, especially for cross builds. NASM isn’t so integrated
and so requires invoking the nasm
command directly. It’s not a big deal,
but it’s friction that adds up.
But even more, the most useful form of assembly in the context of w64dk is
inline assembly, which of course is completely out of NASM’s lane.
So most of the time you’re going to be writing GAS anyway. Again, NASM is
just not integrated into the toolchain like as
.
Second, NASM is a dreadfully slow assembler — in the vicinity of two orders of magnitude slower than GAS! If your assembly program is a couple hundred lines, no big deal. If you’re writing a compiler targeting NASM, it’s impractical beyond toy programs. Friends don’t let friends use NASM as a back-end. If you’re so allergic to GAS, note that YASM has matching syntax and better performance.
Third, and the last nail in the coffin, NASM doesn’t support DWARF debug
information in Windows targets (win32
/win64
). That means you cannot
debug NASM programs with GDB on Windows, at least not with source-level
debugging. Were you even aware GAS has source-level debugging with GDB?
Sure, you can show the assembly pane (layout asm
) and step through the
disassembly with ni
, but stepping through the original source (layout
src
) is a whole different experience. Even better, bring up the register
pane (layout regs
) and you have something akin to a Visual Studio watch
window. It’s such a pleasant experience, and yet no tutorial I’ve seen has
ever mentioned it. It should be the first thing people are taught. Get
your act together, assembly tutorials!
In theory, YASM ought to solve this with its -g dwarf2
, but alas this
feature appears to be broken. So that really just leaves GAS as the most
practical game in town for assembly on Windows with GNU-style toolchains.
In case it helps: Learning a little PDP-11 assembly gave me a
deeper understanding — and appreciation — of why GAS x86 is the way it is.
Makes it sting a little less than before.
Compiling NASM
At the same time w64dk lost NASM, it gained the ability to run Autotools
configure
scripts. It only took a few environment variables as hints for
the script, and a small hack in the shell to undo an unnecessary
Autotools hack. The typical native cmd.exe
invocation with a command
uses /c
:
cmd /c echo hello world
It’s roughly equivalent to -c
in a unix shell. This is what
you’d use if you’re in another shell and you need to invoke cmd
for a
specific purpose. However, if you’re in an MSYS2 shell with its virtual
file system, /c
looks like a path to the C drive. So MSYS2 “helpfully”
translates the switch to a native path, something like so:
cmd C:\ echo hello world
Not helpful at all. So Autotools, assuming Cygwin-like environments are
the only that exist, uses a special escape form when invoking cmd
:
cmd //c echo hello world
Which Cygwin-like environments translate into the desired /c
. If you’re
not in such an environment, then cmd.exe
sees the //c
, which doesn’t
work. So the busybox-w32 shell now pattern matches for precisely:
cmd //c echo [...]
For a similar translation. That’s right, it matches echo
in particular
because that’s the only cmd
feature Autotools uses. So it’s completely
unnecessary, just poor code generation.
With that work in place, you can download a NASM source release, untar it,
run ./configure
, make -j$(nproc)
, and copy the resulting nasm.exe
into w64dk or wherever else on your $PATH
is convenient. The same is
true for quite a bit of software! You can build Binutils, including as
itself, exactly the same way. Being so easy for users to build their own
tools means I’m less concerned with including extraneous, more specialized
tools, such as NASM.
Path Style
Borrowing a concept from MSYS2, w64devkit.ini
now has path
style
option for controlling the initial PATH
, using the same names
and configuration. I’ve already found it useful in testing w64dk
itself in a relatively pristine, hermetic environment:
[w64devkit]
home = .
path style = strict
This uses the w64devkit/
directory itself as $HOME
, and $PATH
is
initially just the w64dk bin/
directory. See the w64devkit.ini
header
for full documentation.
Otherwise, most of the major features have been discussed already: peports and vc++filt, pkg-config, and xxd.
]]>How can a TU have multiple definitions of a struct? Scope. Prior to C23 this wouldn’t compile because the compound literal type and the return type were distinct types:
struct Example { int x, y, z; };
struct Example example(void)
{
struct Example { int x, y, z; };
return (struct Example){1, 2, 3};
}
Otherwise the definition of struct Example
within example
was fine, if
strange. At first this may not seem like a big deal, but let’s revisit my
technique for dynamic arrays:
typedef struct {
T *data;
ptrdiff_t len;
ptrdiff_t cap;
} SliceT;
Where I write out one of these for each T
that I might want to put into
a slice. With the new rule we can change it slightly, taking note of the
introduction of a tag (the name after struct
):
#define Slice(T) \
struct Slice##T { \
T *data; \
ptrdiff_t len; \
ptrdiff_t cap; \
}
This makes the “write it out ahead of time” thing simpler, but with the
new rule we can skip the “ahead of time” part and conjure slice types on
demand. Each declaration with the same T
is compatible with the others
due to matching tags and fields. So, for example, with this macro we can
declare functions using slices parameterized for different element types.
Slice(int) range(int, Arena *);
float mean(Slice(float));
Slice(Str) split(Str, char delim, Arena *);
Str join(Slice(Str), char delim, Arena *);
Or using it with our model parser:
typedef struct {
float x, y, z;
} Vec3;
typedef struct {
int32_t v[3];
int32_t n[3];
} Face;
typedef struct {
Slice(Vec3) verts;
Slice(Vec3) norms;
Slice(Face) faces;
} Model;
typedef Slice(Vec3) Polygon;
I worried these macros might confuse my tools, particularly Universal
Ctags because it’s important to me. Everything handles
prototypes better than expected, but ctags doesn’t see fields with slice
types. Overall they’re like a very limited form of C++ templates. Though
only the types are parameterized, not the functions operating on those
types. Outside of unwarranted macro abuse, this new technique does nothing
regarding generic functions. On the other hand, my generic slice function
complements the new technique, especially with the help of C23’s new
typeof
to mitigate _Alignof
’s limitations:
typedef struct { char *beg, *end; } Arena;
void *alloc(Arena *, ptrdiff_t count, int size, int align);
#define push(a, s) \
((s)->len == (s)->cap \
? (s)->data = push_( \
(a), \
(s)->data, \
&(s)->cap, \
sizeof(*(s)->data), \
_Alignof(typeof(*(s)->data)) \
), \
(s)->data + (s)->len++ \
: (s)->data + (s)->len++)
void *push_(Arena *a, void *data, ptrdiff_t *pcap, int size, int align)
{
ptrdiff_t cap = *pcap;
if (a->beg != (char *)data + cap*size) {
void *copy = alloc(a, cap, size, align);
memcpy(copy, data, cap*size);
data = copy;
}
ptrdiff_t extend = cap ? cap : 4;
alloc(a, extend, size, align);
*pcap = cap + extend;
return data;
}
This exploits the fact that implementations adopting the new tag rule also have the upcoming C2y null pointer rule (note: also requires a cooperating libc). Putting it together, now I can write stuff like this:
Slice(int64_t) generate_primes(int64_t limit, Arena *a)
{
Slice(int64_t) primes = {};
if (limit > 2) {
*push(a, &primes) = 2;
}
for (int64_t n = 3; n < limit; n += 2) {
bool valid = true;
for (ptrdiff_t i = 0; valid && i<primes.len; i++) {
valid = n % primes.data[i];
}
if (valid) {
*push(a, &primes) = n;
}
}
return primes;
}
But it doesn’t take long to run into limitations. It makes little sense to
define, say, a Map(K, V)
without a generic function to manipulate it.
This also doesn’t work:
typedef struct {
Slice(Str) names;
Slice(Slice(float)) edges;
} Graph;
Due to Slice##T
in the macro, required to establish a unique tag for
each element type. The parameter to the macro must be an identifier, so
you have to build up to it (or define another macro), which sort of
defeats the purpose, which was entirely about convenience.
typedef Slice(float) Edges;
typedef struct {
Slice(Str) names;
Slice(Edges) edges;
} Graph;
The benefits are small enough that perhaps it’s not worth the costs, but
it’s been at least worth investigating. I’ve written a small demo of the
technique if you’d like to see it in action, or test the abilities of your
local C implementation: demo.c
For the purposes of this discussion, the actual allocator isn’t important. It could be a simple arena allocator, or a more general purpose buddy allocator. It could even be garbage collected with Boehm GC. Though WebAssembly’s linear memory is a poor fit for such a conservative garbage collector. In a compact address space starting at zero, and which doesn’t include code, memory addresses will be small numbers, and less distinguishable from common integer values. There’s also the issue that the garbage collector cannot scan the Wasm stack, which is hidden from Wasm programs by design. Only the ABI stack is visible. So a garbage collector requires cooperation from the compiler — essentially as a distinct calling convention — to spill all heap pointers on the ABI stack before function calls. Wasm C and C++ toolchains do not yet support this in a practical capacity.
Exporting a static heap
Let’s start with the embedded case because it’s simpler, and reserve a
dynamic memory region at link time. WebAssembly has just reached 8 years
old, so it’s early, and as we keep discovering, Wasm tooling is still
immature. wasm-ld
doesn’t understand linker scripts, and there’s no
stable, low-level assembly language on which to build, e.g. to reserve
space, define symbols, etc. WAT is too high level and inflexible for this
purpose, as we’ll soon see. So our only option is to brute force it in a
high-level language:
char heap[16<<20]; // 16MiB
Plugging it into an arena allocator:
typedef struct {
char *beg;
char *end;
} Arena;
Arena getarena(void)
{
Arena a = {0};
a.beg = heap;
a.end = heap + sizeof(heap);
return a;
}
Unfortunately heap
isn’t generic memory, but a high-level variable with
a specific, fixed type. That’s why it would have been nice to reserve the
memory outside the high-level language. In practice this works fine so
long as everything is aligned, but strictly speaking, allocating any
variable except char
from this arena involves incompatible loads and
stores on a char
array. Clang doesn’t document any inline assembly
interface for Wasm, but neither does Clang forbid it. That leaves
just enough room to launder the pointer if you’re worried about this
technicality:
Arena getarena(void)
{
Arena a = {0};
a.beg = heap;
asm ("" : "+r"(a.beg)); // launder
a.end = a.beg + sizeof(heap);
return a;
}
The +r
means a.beg
is both input and output. The address of the heap
goes into the black box, and as far as the compiler is concerned, some
mystery address comes out which, critically, has no effective type. The
assembly block is empty (""
), so it’s just a no-op, and we know (wink
wink) it’s really the same address. Because the heap was “used” by the
black box, Clang won’t optimize the heap out of existence beneath us. Also
note that a.end
was derived from the laundered pointer.
Update: The next C standard will improve this situation and
so pointer laundering will no longer be unnecessary. A straight char
array could be used as an arena.
This static variable technique works well only in an exported memory
configuration, which is what wasm-ld
uses by default. When a module
exports its memory, it indicates how much linear memory it requires on
start, and the Wasm runtime allocates and zero-initializes it at module
initialization time. C and C++ toolchains depend on that runtime zeroing
to initialize static and global variables, which are defined to be so
initialized. Compilers generate code assuming these variables are zero
initialized. This same paradigm is used for .bss sections in hosted
environments.
In an imported memory configuration, linear memory is uninitialized. The
memory may be re-used from, say, a destroyed module without zeroing, and
may contain arbitrary data. In that case, C and C++ toolchains must zero
the memory explicitly. It could potentially be done with a memory.fill
instruction in the start section, but LLVM does not support start
sections. Instead it uses an active data segment — a chunk of
data copied into linear memory by the Wasm runtime during initialization,
before running the start function.
That is, when importing memory, LLVM actually stores all those zeros in the Wasm module so that the runtime can copy it into linear memory. Wasm has no built-in compression, so your Wasm module will be at least as large as your heap! Exporting or importing memory is determined at link-time, so at compile-time the compiler must assume the worst case. If you compile the example above, you get a 16MiB “object” file (in Wasm format):
$ clang --target=wasm32 -c -O example.c
$ du -h example.o
16.0M example.o
The WAT version of this file is 48MiB — clearly unsuitable as a low-level
assembler. If linking with exported memory, wasm-ld
discards all-zero
active data segments. If using an imported memory configuration, it’s
copied into the final image, producing a huge Wasm image, though highly
compressible. As a rule, avoid importing memory when using an LLVM
toolchain. Regardless, large heaps created this way will have a
significant compile-time cost.
$ time echo 'char heap[256<<20];' | clang --target=wasm32 -c -xc -
real 0m0.334s
user 0m0.013s
sys 0m0.262s
(If only Clang had some sort of “noinit” variable attribute in order to
allow heap
to be uninitialized…)
Growing a dynamic heap
Wasm programs can grow linear memory using an sbrk-like memory.grow
instruction. It operates in quantities of pages (64kB), and returns
the old memory size. Because memory starts at zero, the old memory size is
also the base address of the new allocation. Clang provides access to this
instruction via an undocumented built-in:
size_t __builtin_wasm_memory_grow(int, size_t);
The first parameter selects a memory because someday there might be more
than one. From this built-in we can define sbrk
:
void *sbrk(ptrdiff_t size)
{
size_t npages = (size + 0xffffu) >> 16; // round up
size_t old = __builtin_wasm_memory_grow(0, npages);
if (old == -1ul) {
return 0;
}
return (void *)(old << 16);
}
To which Clang compiles (note the memory.grow
):
(func $sbrk (param i32) (result i32)
(select
(i32.const 0)
(i32.shl
(local.tee 0
(memory.grow
(i32.shr_u
(i32.add
(local.get 0)
(i32.const 65535))
(i32.const 16))))
(i32.const 16))
(i32.eq
(local.get 0)
(i32.const -1))))
Applying that to create an arena like before:
Arena newarena(ptrdiff_t cap)
{
Arena a = {0};
a.beg = sbrk(cap);
if (a.beg) {
a.end = a.beg + cap;
}
return a;
}
Now we can choose the size of the arena, and we can use this to create multiple arenas (e.g. permanent, scratch, etc.). We could even continue growing the last-created arena in-place when it’s full.
If there was no memory.grow
instruction, it could be implemented as a
request through an imported function. The embedder using the Wasm runtime
can grow the memory on the module’s behalf in the same manner. But
as that documentation indicates, either way growing the memory comes with
a downside in the most common Wasm runtimes, browsers: It “detaches” the
memory from references, which complicates its use for the embedder. If a
Wasm module may grow its memory at any time, the embedder must reacquire
the memory handle after every call. It’s not difficult, but it’s easy to
forget, and mistakes are likely to go unnoticed until later.
Importing a dynamic heap
There’s a middle ground where a Wasm module imports a dynamic-sized heap. That is, linear memory beyond the module’s base initialization. This might be the case, for instance, in a programming competition, where contestants submit Wasm modules which must complete a task using the supplied memory. In that case we don’t reserve a static heap, so we’re not facing the storing-zeros issue. However, how do we “find” the memory? Linear memory layout will look something like so:
0 <-- stack | data | heap --> ?
|-----------------------------|
This diagram reflects the more sensible wasm-ld --stack-first
layout,
where the ABI stack overflows off the bottom end of memory. The heap is
just excess memory beyond the data. To find the upper bound, Wasm has a
memory.size
instruction to query linear memory size, which again
Clang provides as an undocumented built-in:
size_t __builtin_wasm_memory_size(int);
Like before, this returns the result in number of 64k pages. That’s the
high end. How do we find the low end? Similar to __stack_pointer
, the
linker creates a __heap_base
constant, which is the address delineating
data and heap in the diagram above. To use it, we need to declare it:
extern char __heap_base[];
Notice how it’s an array, not a pointer. It doesn’t hold an address, it is an address. In an ELF context this would called an absolute symbol. That’s everything we need to find the bounds of the heap:
Arena getarena(void)
{
Arena a = {0};
a.beg = __heap_base;
a.end = (char *)(__builtin_wasm_memory_size(0) << 16);
return a;
}
Then we continue forward using whatever memory the embedder deigned to provide. Hopefully it’s enough!
]]>Wasm is a specification defining an abstract stack machine with a Harvard architecture, and related formats. There are just four types, i32, i64, f32, and f64. It also has “linear” octet-addressable memory starting at zero, with no alignment restrictions on loads and stores. Address zero is a valid, writable address, which resurfaces some, old school, high level language challenges regarding null pointers. There are 32-bit and 64-bit flavors, though the latter remains experimental. That suits me: I appreciate smaller pointers on 64-bit hosts, and I wish I could opt into more often (e.g. x32).
As browser tech goes, they chose an apt name: WebAssembly is to the web as JavaScript is to Java.
There are distinct components at play, and much of the online discussion doesn’t do a great job drawing lines between them:
-
Wasm module: A compiled and linked image — like ELF or PE — containing sections for code, types, globals, import table, export table, and so on. The export table lists the module’s entry points. It has an optional start section indicating which function initializes a loaded image. (In practice almost nobody actually uses the start section.) A Wasm module can only affect the outside world through imported functions. Wasm itself defines no external interfaces for Wasm programs, not even printing or logging.
-
Wasm runtime: Loads Wasm modules, linking import table entries into the module. Because Wasm modules include types, the runtime can type check this linkage at load time. With imports resolved, it executes the start function, if any, then executes zero or more of its entry points, which hopefully invokes import functions such a way as to produce useful results, or perhaps simply return useful outputs.
-
Wasm compiler: Converts a high-level language to low-level Wasm. In order to do so, it requires some kind of Application Binary Interface (ABI) to map the high-level language concepts onto the machine. This typically introduces additional execution elements, and it’s important that we distinguish them from the abstract machine’s execution elements. Clang is the only compiler we’ll be discussing in this article, though there are many. During compilation the function indices are yet unknown and so references will need to be patched in by a linker.
-
Wasm linker: Settles the shape of the Wasm module and links up the functions emitted by the compiler. LLVM comes with
wasm-ld
, and it goes hand-in-hand with Clang as a compiler. -
Language runtime: Unless you’re hand-writing raw Wasm, your high-level language probably has a standard library with operating system interfaces. C standard library, POSIX interfaces, etc. This runtime likely maps onto some standardized set of imports, most likely the aforementioned WASI, which defines a set of POSIX-like functions that Wasm modules may import. Because I think we could do better, as usual around here, in this article we’re going to eschew the language runtime and code directly against raw WASI. You still have easy access hash tables and dynamic arrays.
A combination of compiler-linker-runtime is conventionally called a
toolchain. However, because almost any Clang installation can target
Wasm out-of-the-box, and we’re skipping the language runtime, you can
compile any of programs discussed in this article, including my game, with
nothing more than Clang (invoking wasm-ld
implicitly). If you have a
Wasm runtime, which includes your browser, you can run them, too! Though
this article will mostly focus on WASI, and you’ll need a WASI-capable
runtime to run those examples, which doesn’t include browsers (short of
implementing the API with JavaScript).
I wasn’t particularly happy with the Wasm runtimes I tried, so I cannot enthusiastically recommend one. I’d love if I could point to one and say, “Use the same Clang to compile the runtime that you’re using to compile Wasm!” Alas, I had issues compiling, the runtime was buggy, or WASI was incomplete. However, wazero (Go) was the easiest for me to use and it worked well enough, so I will use it in examples:
$ go install github.com/tetratelabs/wazero/cmd/wazero@latest
The Wasm Binary Toolkit (WABT) is good to have on hand when working
with Wasm, particularly wasm2wat
to inspect Wasm modules, sort of like
objdump
or readelf
. It converts Wasm to the WebAssembly Text Format
(WAT).
Learning Wasm I had quite some difficultly finding information. Outside of
the Wasm specification, which, despite its length, is merely a narrow
slice of the ecosystem, important technical details are scattered all over
the place. Some is only available as source code, some buried comments in
GitHub issues, and some lost behind dead links as repositories have moved.
Large parts of LLVM are undocumented beyond an mention of existence. WASI
has no documentation in a web-friendly format — so I have nothing to link
from here when I mention its system calls — just some IDL sources in a Git
repository. An old wasi.h
was the most readable, complete
source of truth I could find.
Fortunately Wasm is old enough that LLMs are well-versed in it, and simply asking questions, or for usage examples, was more effective than searching online. If you’re stumped on how to achieve something in the Wasm ecosystem, try asking a state-of-the-art LLM for help.
Example programs
Let’s go over concrete examples to lay some foundations. Consider this simple C function:
float norm(float x, float y)
{
return x*x + y*y;
}
To compile to Wasm (32-bit) with Clang, we use the --target=wasm32
:
$ clang -c --target=wasm32 -O example.c
The object file example.o
is in Wasm format, so WABT can examine it.
Here’s the output of wasm2wat -f
, where -f
produces output in the
“folded” format, which is how I prefer to read it.
(module
(type (;0;) (func (param f32 f32) (result f32)))
(import "env" "__linear_memory" (memory (;0;) 0))
(func $norm (type 0) (param f32 f32) (result f32)
(f32.add
(f32.mul
(local.get 0)
(local.get 0))
(f32.mul
(local.get 1)
(local.get 1)))))
We can see the ABI taking shape: Clang has predictably mapped
float
into f32
. It similarly maps char
, short
, int
and long
onto i32
. In 64-bit Wasm, the Clang ABI is LP64 and maps long
onto
i64
. There’s a also $norm
function which takes two f32
parameters
and returns an f32
.
Getting a little more complex:
__attribute((import_name("f")))
void f(int *);
__attribute((export_name("example")))
void example(int x)
{
f(&x);
}
The import_name
function attribute indicates the module will not define
it, even in another translation unit, and that it intends to import it.
That is, wasm-ld
will place it in the import table. The export_name
function attribute indicates it’s an entry point, and so wasm-ld
will
list it in the export table. Linking it will make things a little clearer:
$ clang --target=wasm32 -nostdlib -Wl,--no-entry -O example.c
The -nostdlib
is because we won’t be using a language runtime, and
--no-entry
to tell the linker not to implicitly export a function
(default: _start
) as an entry point. You might think this is connected
with the Wasm start function, but wasm-ld
does not support the start
section at all! We’ll have use for an entry point later. The folded WAT:
(module $a.out
(type (;0;) (func (param i32)))
(import "env" "f" (func $f (type 0)))
(func $example (type 0) (param i32)
(local i32)
(global.set $__stack_pointer
(local.tee 1
(i32.sub
(global.get $__stack_pointer)
(i32.const 16))))
(i32.store offset=12
(local.get 1)
(local.get 0))
(call $f
(i32.add
(local.get 1)
(i32.const 12)))
(global.set $__stack_pointer
(i32.add
(local.get 1)
(i32.const 16))))
(table (;0;) 1 1 funcref)
(memory (;0;) 2)
(global $__stack_pointer (mut i32) (i32.const 66560))
(export "memory" (memory 0))
(export "example" (func $example)))
There’s a lot to unfold:
-
Pointers were mapped onto
i32
. Pointers are a high-level concept, and linear memory is addressed by an integral offset. This is typical of assembly after all. -
There’s now a
__stack_pointer
, which is part of the Clang ABI, not Wasm. The Wasm abstract machine is a stack machine, but that stack doesn’t exist in linear memory. So you cannot take the address of values on the Wasm stack. There are lots of things C needs from a stack that Wasm doesn’t provide. So, in addition to the Wasm stack, Clang maintains another downward-growing stack in linear memory for these purposes, and the__stack_pointer
global is the stack register of its ABI. We can see it’s allocated something like 64kB for the stack. (It’s a little more because program data is placed below the stack.) -
It should be mostly readable without knowing Wasm: The function subtracts a 16-byte stack frame, stores a copy of the argument in it, then uses its memory offset for the first parameter to the import
f
. Why 16 bytes when it only needs 4? Because the stack is kept 16-byte aligned. Before returning, the function restores the stack pointer.
As mentioned earlier, address zero is valid as far as the Wasm runtime is concerned, though dereferences are still undefined in C. This makes it more difficult to catch bugs. Given a null pointer this function would most likely read a zero at address zero and the program keeps running:
int get(int *p)
{
return *p;
}
In WAT:
(func $get (type 0) (param i32) (result i32)
(i32.load
(local.get 0)))
Since the “hardware” won’t fault for us, ask Clang to do it instead:
$ clang ... -fsanitize=undefined -fsanitize-trap ...
Now in WAT:
(module
(type (;0;) (func (param i32) (result i32)))
(import "env" "__linear_memory" (memory (;0;) 0))
(func $get (type 0) (param i32) (result i32)
(block ;; label = @1
(block ;; label = @2
(br_if 0 (;@2;)
(i32.eqz
(local.get 0)))
(br_if 1 (;@1;)
(i32.eqz
(i32.and
(local.get 0)
(i32.const 3)))))
(unreachable))
(i32.load
(local.get 0))))
Given a null pointer, get
executes the unreachable
instruction,
causing the runtime to trap. In practice this is unrecoverable. Consider:
nothing will restore __stack_pointer
, and so the stack will “leak” the
existing frames. (This can be worked around by exporting __stack_pointer
and __stack_high
via the --export
linker flag, then restoring the
stack pointer in the runtime after traps.)
Wasm was extended with bulk memory operations, and so there are
single instructions for memset
and memmove
, which Clang maps onto the
built-ins:
void clear(void *buf, long len)
{
__builtin_memset(buf, 0, len);
}
(Below LLVM 20 you will need the undocumented -mbulk-memory
option.) In WAT we see this as memory.fill
:
(module
(type (;0;) (func (param i32 i32)))
(import "env" "__linear_memory" (memory (;0;) 0))
(func $clear (type 0) (param i32 i32)
(block ;; label = @1
(br_if 0 (;@1;)
(i32.eqz
(local.get 1)))
(memory.fill
(local.get 0)
(i32.const 0)
(local.get 1)))))
That’s great! I wish this worked so well outside of Wasm. It’s one reason
w64devkit has -lmemory
, after all. Similarly __builtin_trap()
maps
onto the unreachable
instruction, so we can reliably generate those as
well.
What about structures? They’re passed by address. Parameter structures go on the stack, then its address passed. To return a structure, a function accepts an implicit out parameter in which to write the return. This isn’t unusual, except that it’s challenging to manage across module boundaries, i.e. in imports and exports, because caller and callee are in different address spaces. It’s especially tricky to return a structure from an export, as the caller must somehow allocate space in the callee’s address space for the result. The multi-value extension solves this, but using it in C involves an ABI change, which is still experimental.
Water Sort Game
Something you might not have expected: My water sort game imports no functions! It only exports three functions:
void game_init(i32 seed);
DrawList *game_render(i32 width, i32 height, i32 mousex, i32 mousey);
void game_update(i32 input, i32 mousex, i32 mousey, i64 now);
The game uses IMGUI-style rendering. The caller passes in the inputs, and the game returns a kind of display list telling it what to draw. In the SDL version these turn into SDL renderer calls. In the web version, these turn into canvas draws, and “mouse” inputs may be touch events. It plays and feels the same on both platforms. Simple!
I didn’t realize it at the time, but building the SDL version first was critical to my productivity. Debugging Wasm programs is really dang hard! Wasm tooling has yet to catch up with 1995, let alone 2025. Source-level debugging is still experimental and impractical. Developing applications on the Wasm platform. It’s about as ergonomic as developing in MS-DOS. Instead, develop on a platform much better suited for it, then port your application to Wasm after you’ve got the issues worked out. The less Wasm-specific code you write, the better, even if it means writing more code overall. Treat it as you would some weird embedded target.
The game comes with 10,000 seeds. I generated ~200 million puzzles, sorted them by difficulty, and skimmed the top 10k most challenging. In the game they’re still sorted by increading difficulty, so it gets harder as you make progress.
Wasm System Interface
WASI allows us to get a little more hands on. Let’s start with a Hello
World program. A WASI application exports a traditional _start
entry
point which returns nothing and takes no arguments. I’m also going to set
up some basic typedefs:
typedef unsigned char u8;
typedef signed int i32;
typedef signed long long i64;
typedef signed long iz;
void _start(void)
{
}
wasm-ld
will automatically export this function, so we don’t need an
export_name
attribute. This program successfully does nothing:
$ clang --target=wasm32 -nostdlib -o hello.wasm hello.c
$ wazero run hello.wasm && echo ok
ok
To write output WASI defines fd_write()
:
typedef struct {
u8 *buf;
iz len;
} IoVec;
#define WASI(s) __attribute((import_module("wasi_unstable"),import_name(s)))
WASI("fd_write") i32 fd_write(i32, IoVec *, iz, iz *);
Technically those iz
variables are supposed to be size_t
, passed
through Wasm as i32
, but this is a foreign function, I know the ABI, and
so I can do as I please. I absolutely love that WASI barely uses
null-terminated strings, not even for paths, which is a breath of fresh
air, but they still marred the API with unsigned sizes. Which I
choose to ignore.
This function is shaped like POSIX writev()
. I’ve also set it
up for import, including a module name. The oldest, most stable version of
WASI is called wasi_unstable
. (I suppose it shouldn’t be surprising that
finding information in this ecosystem is difficult.)
Every returning WASI function returns an errno
value, with zero as
success rather than some kind of in-band signaling. Hence the
final out parameter unlike POSIX writev()
.
Armed with this function, let’s use it:
void _start(void)
{
u8 msg[] = "hello world\n";
IoVec iov = {msg, sizeof(msg)-1};
iz len = 0;
fd_write(1, &iov, 1, &len);
}
Then:
$ clang --target=wasm32 -nostdlib -o hello.wasm hello.c
$ wazero run hello.wasm
hello world
Keep going and you’ll have something like printf
before long. If
the write fails, we should probably communicate the error with at least
the exit status. Because _start
doesn’t return a status, we need to
exit, for which we have proc_exit
. It doesn’t return, so no errno
return value.
WASI("proc_exit") void proc_exit(i32);
void _start(void)
{
// ...
i32 err = fd_write(1, &iov, 1, &len);
proc_exit(!!err);
}
To get the command line arguments, call args_sizes_get
to get the size,
allocate some memory, then args_get
to read the arguments. Same goes for
the environment with a similar pair of functions. The sizes do not include
a null pointer terminator, which is sensible.
Now that you know how to find and use these functions, you don’t need me to go through each one. However, opening files is a special, complicated case:
WASI("path_open") i32 path_open(i32,i32,u8*,iz,i32,i64,i64,i32,i32*);
That’s 9 parameters — and I had thought Win32 CreateFileW
was
over the top. It’s even more complex than it looks. It works more like
POSIX openat()
, except there’s no current working directory
and so no AT_FDCWD
. Every file and directory is opened relative to
another directory, and absolute paths are invalid. If there’s no
AT_FDCWD
, how does one open the first directory? That’s called a
preopen and it’s core to the file system security mechanism of WASI.
The Wasm runtime preopens zero or more directories before starting the
program and assigns them the lowest numbered file descriptors starting at
file descriptor 3 (after standard input, output, and error). A program
intending to use path_open
must first traverse the file descriptors,
probing for preopens with fd_prestat_get
and retrieving their path name
with fd_prestat_dir_name
. This name may or may not map back onto a real
system path, and so this is a kind of virtual file system for the Wasm
module. The probe stops on the first error.
To open an absolute path, it must find a matching preopen, then from it construct a path relative to that directory. This part I much dislike, as the module must contain complex path parsing functionality even in the simple case. Opening files is the most complex piece of the whole API.
I mentioned before that program data is below the Clang stack. With the stack growing down, this sounds like a bad idea. A stack overflow quietly clobbers your data, and is difficult to recognize. More sensible to put the stack at the bottom so that it overflows off the bottom of memory and causes a fast fault. Fortunately there’s a switch for that:
$ clang --target=wasm32 ... -Wl,--stack-first ...
This is what you want by default. The actual default layout is left over
from an early design flaw in wasm-ld
, and it’s an oversight that it has
not yet been corrected.
u-config
The above is in action in the u-config Wasm port. You can download the Wasm module, pkg-config.wasm, used in the web demo to run it in your favorite WASI-capable Wasm runtime:
$ wazero run pkg-config.wasm --modversion pkg-config
0.33.3
Though there are no preopens, so it cannot read any files. The -mount
option maps real file system paths to preopens. This mounts the entire
root file system read-only (ro
) as /
.
$ wazero run -mount /::ro pkg-config.wasm --cflags sdl2
-I/usr/include/SDL2 -D_REENTRANT
I doubt this is useful for anything, but it was a vehicle for learning and trying Wasm, and the results are pretty neat.
In the next article I discuss allocating the allocator.
]]>The original program uses ARM64 assembly. I’m a lot more comfortable with x86-64 assembly, plus that’s the hardware I have readily on hand. So the assembly language will be different, but all the concepts apply to both these architectures. Almost none of these OpenBSD system interfaces are formally documented (or stable for that matter), and I had to dig around the OpenBSD source tree to figure it out (along with a helpful jart nudge). So don’t be afraid to get your hands dirty.
There are lots of subtle problems in the original demo, so let’s go through the program piece by piece, starting with the entry point:
void
start()
{
w("hello\n", 6);
x();
}
This function is registered as the entry point in the ELF image, so it has
no caller. That means no return address on the stack, so the stack is
not aligned for a function.(Correction: The stack alignment issue is
true for x86, but not ARM, so the original demo is fine.) In toy programs
that goes unnoticed, but compilers generate code assuming the stack is
aligned. In a real application this is likely to crash deep on the first
SIMD register spill.
We could fix this with a force_align_arg_pointer
attribute, at
least for architectures that support it, but I prefer to write the entry
point in assembly. Especially so we can access the command line arguments
and environment variables, which is necessary in a real application. That
happens to work the same as it does on Linux, so here’s my old, familiar
entry point:
asm (
" .globl _start\n"
"_start: mov %rsp, %rdi\n"
" call start\n"
);
Per the ABI, the first argument passes through rdi
, so I pass a copy of
the stack pointer, rsp
, as it appeared on entry. Entry point arguments
argc
, argv
, and envp
are all pushed on the stack at rsp
, so the
first real function can retrieve it all from just the stack pointer. The
original demo won’t use it, though. Using call
to pass control pushes a
return address, which will never be used, and aligns the stack for the
first real function. I name it _start
because that’s what the linker
expects and so things will go a little smoother, so it’s rather convenient
that the original didn’t use this name.
Next up, the “write” function:
int
w(void *what, size_t len) {
__asm(
" mov x2, x1;"
" mov x1, x0;"
" mov w0, #1;"
" mov x8, #4;"
" svc #0;"
);
return 0;
}
There are two serious problems with this assembly block. First, the function arguments are not necessarily in those registers by the time control reaches the basic assembly block. The function prologue could move them around. Even more so if this function was inlined. This is exactly the problem extended inline assembly is intended to solve. Second, it clobbers a number of registers. Compilers assume this does not happen when generating their own code. This sort of assembly falls apart the moment it comes into contact with a non-zero optimization level.
Solving this is just a matter of using inline assembly properly:
long w(void *what, long len)
{
char err;
long rax = 4; // SYS_write
asm volatile (
"syscall"
: "+a"(rax), "+d"(len), "=@ccc"(err)
: "D"(1), "S"(what)
: "rcx", "r11", "memory"
);
return err ? -rax : rax;
}
I’ve enhanced it a bit, returning a Linux-style negative errno on
error. In the BSD ecosystem, syscall errors are indicated using the carry
flag, which here is output into err
via =@ccc
. When set, the return
value is an errno. Further, the OpenBSD kernel uses both rax
and rdx
for return values, so I’ve also listed rdx
as an input+output despite
not consuming the result. Despite all these changes, this function is not
yet complete! We’ll get back to it later.
The “exit” function, x
, is just fine:
void
x() {
__asm(
" mov x8, #1;"
" svc #0;"
);
}
It doesn’t set an exit status, so it passes garbage instead, but otherwise this works. No inputs, plus clobbers and outputs don’t matter when control never returns. In a real application I might write it:
__attribute((noreturn))
void x(int status)
{
asm volatile ("syscall" :: "a"(1), "D"(status));
__builtin_unreachable();
}
This function will need a little additional work later, too.
The ident
section is basically fine as-is:
__asm(" .section \".note.openbsd.ident\", \"a\"\n"
" .p2align 2\n"
" .long 8\n"
" .long 4\n"
" .long 1\n"
" .ascii \"OpenBSD\\0\"\n"
" .long 0\n"
" .previous\n");
The compiler assumes the current section remains the same at the end of
the assembly block, which here is accomplished with .previous
. Though it
clobbers the assembler’s remembered “other” section and so may interfere
with surrounding code using .previous
. Better to use .pushsection
and
.popsection
for good stack discipline. There are many such examples in
the OpenBSD source tree.
asm (
".pushsection .note.openbsd.ident, \"a\"\n"
".long 8, 4, 1, 0x6e65704f, 0x00445342, 0\n"
".popsection\n"
);
Now the trickiest part, the pinsyscall table:
struct whats {
unsigned int offset;
unsigned int sysno;
} happening[] __attribute__((section(".openbsd.syscalls"))) = {
{ 0x104f4, 4 },
{ 0x10530, 1 },
};
Those offsets — offsets from the beginning of the ELF image — were entered manually, and it kind of ruins the whole demo. We don’t have a good way to get at those offsets from C, or any high level language. However, we can solve that by tweaking the inline assembly with some labels:
__attribute((noinline))
long w(void *what, long len)
{
// ...
asm volatile (
"_w: syscall"
// ...
);
// ...
}
__attribute((noinline,noreturn))
void x(int status)
{
asm volatile (
"_x: syscall"
// ...
);
// ...
}
Very importantly I’ve added noinline
to prevent these functions from
being inlined into additional copies of the syscall
instruction, which
of course won’t be registered. This also prevents duplicate labels causing
assembler errors. Once we have the labels, we can use them in an assembly
block listing the allowed syscall instructions:
asm (
".pushsection .openbsd.syscalls\n"
".long _x, 1\n"
".long _w, 4\n"
".popsection\n"
);
That lets the linker solve the offsets problem, which is its main job after all. With these changes the demo works reliably, even under high optimization levels. I suggest these flags:
$ cc -static -nostdlib -no-pie -o where where.c
Disabling PIE with -no-pie
is necessary in real applications or else
strings won’t work. You can apply more flags to strip it down further, but
these are the flags generally necessary to compile these sorts of programs
on at least OpenBSD 7.6.
So, how do I know this stuff works in general? Because I ported my ultra
portable pkg-config clone, u-config, to use raw OpenBSD syscalls:
openbsd_main.c
. Everything still works at high optimization
levels.
$ cc -static -nostartfiles -no-pie -o pkg-config openbsd_main.c libmemory.a
$ ./pkg-config --cflags --libs libcurl
-I/usr/local/include -L/usr/local/lib -lcurl
Because the new syscall wrappers behave just like Linux system calls, it
leverages the linux_noarch.c
platform, and the whole port is ~70 lines
of code. A few more flags (-fno-stack-protector
, -Oz
, -s
, etc.), and
it squeezes into a slim 21.6K static binary.
Despite making no libc calls, it’s not possible stop compilers from
fabricating (hallucinating?) string function calls, so the build
above depends on external definitions. In the command above, libmemory.a
comes from libmemory.c
found in w64devkit. Alternatively,
and on topic, you could link the OpenBSD libc string functions by
omitting libmemory.a
from the build.
$ cc -static -nostartfiles -no-pie -o pkg-config openbsd_main.c
Though it pulls in a lot of bloat (~8x size increase), and teasing out the necessary objects isn’t trivial.
]]>If you’d like to see the ready-to-run full source: objrender.c
.
All images are screenshots of this program.
First let’s establish the requirements. By robust I mean no undefined behavior for any input, valid or invalid; no out of bounds accesses, no signed overflows. Input is otherwise not validated. Invalid input may load as valid by chance, which will render as either garbage or nothing. The behavior will also not vary by locale.
We’re also only worried about vertices, normals, and triangle faces with
normals. In OBJ these are v
, vn
, and f
elements. Normals let us
light the model effectively while checking our work. A cube fitting this
subset of OBJ might look like:
v -1.00 -1.00 -1.00
v -1.00 +1.00 -1.00
v +1.00 +1.00 -1.00
v +1.00 -1.00 -1.00
v -1.00 -1.00 +1.00
v -1.00 +1.00 +1.00
v +1.00 +1.00 +1.00
v +1.00 -1.00 +1.00
vn +1.00 0.00 0.00
vn -1.00 0.00 0.00
vn 0.00 +1.00 0.00
vn 0.00 -1.00 0.00
vn 0.00 0.00 +1.00
vn 0.00 0.00 -1.00
f 3//1 7//1 8//1
f 3//1 8//1 4//1
f 1//2 5//2 6//2
f 1//2 6//2 2//2
f 7//3 3//3 2//3
f 7//3 2//3 6//3
f 4//4 8//4 5//4
f 4//4 5//4 1//4
f 8//5 7//5 6//5
f 8//5 6//5 5//5
f 3//6 4//6 1//6
f 3//6 1//6 2//6
Take note:
- Some fields are separated by more than one space.
- Vertices and normals are fractional (floating point).
- Faces use 1-indexing instead of 0-indexing.
- Faces in this model lack a texture index, hence
//
(empty).
Inputs may have other data, but we’ll skip over it, including face texture indices, or face elements beyond the third. Some of the models I’d like to test have relative indices, so I want to support those, too. A relative index refers backwards from the last vertex, so the order of the lines in an OBJ matter. For example, the cube faces above could have instead been written:
f -6//-6 -2//-6 -1//-6
f -6//-6 -1//-6 -5//-6
f -8//-5 -4//-5 -3//-5
f -8//-5 -3//-5 -7//-5
f -2//-4 -6//-4 -7//-4
f -2//-4 -7//-4 -3//-4
f -5//-3 -1//-3 -4//-3
f -5//-3 -4//-3 -8//-3
f -1//-2 -2//-2 -3//-2
f -1//-2 -3//-2 -4//-2
f -6//-1 -5//-1 -8//-1
f -6//-1 -8//-1 -7//-1
Due to this the parser cannot be blind to line order, and it must handle negative indices. Relative indexing has the nice effect that we can group faces, and those groups are relocatable. We can reorder them without renumbering the faces, or concatenate models just by concatenating their OBJ files.
The fundamentals
To start off, we’ll be using an arena of course, trivializing memory management while swiping aside all hard-coded limits. A quick reminder of the interface:
#define new(a, n, t) (t *)alloc(a, n, sizeof(t), _Alignof(t))
typedef struct {
char *beg;
char *end;
} Arena;
// Always returns an aligned pointer inside the arena. Allocations are
// zeroed. Does not return on OOM (never returns a null pointer).
void *alloc(Arena *, ptrdiff_t count, ptrdiff_t size, ptrdiff_t align);
Also, no null terminated strings, perhaps the main source of problems with bespoke parsers.
#define S(s) (Str){s, sizeof(s)-1}
typedef struct {
char *data;
ptrdiff_t len;
} Str;
Pointer arithmetic is error prone, so the tricky stuff is relegated to a handful of functions, each of which can be exhaustively validated almost at a glance:
Str span(char *beg, char *end)
{
Str r = {0};
r.data = beg;
r.len = beg ? end-beg : 0;
return r;
}
_Bool equals(Str a, Str b)
{
return a.len==b.len && (!a.len || !memcmp(a.data, b.data, a.len));
}
Str trimleft(Str s)
{
for (; s.len && *s.data<=' '; s.data++, s.len--) {}
return s;
}
Str trimright(Str s)
{
for (; s.len && s.data[s.len-1]<=' '; s.len--) {}
return s;
}
Str substring(Str s, ptrdiff_t i)
{
if (i) {
s.data += i;
s.len -= i;
}
return s;
}
Each avoids the purposeless special cases around null pointers (i.e.
zero-initialized Str
objects) that would otherwise work out naturally.
The space character and all control characters are treated as whitespace
for simplicity. When I started writing this parser, I didn’t define all
these functions up front. I defined them as needed. (A good standard
library would have provided similar definitions out-of-the-box.) If
you’re worried about misuse, add the appropriate assertions.
A powerful and useful string function I’ve discovered, and which I use in
every string-heavy program, is cut
, a concept I shamelessly stole from
the Go standard library:
typedef struct {
Str head;
Str tail;
_Bool ok;
} Cut;
Cut cut(Str s, char c)
{
Cut r = {0};
if (!s.len) return r; // null pointer special case
char *beg = s.data;
char *end = s.data + s.len;
char *cut = beg;
for (; cut<end && *cut!=c; cut++) {}
r.ok = cut < end;
r.head = span(beg, cut);
r.tail = span(cut+r.ok, end);
return r;
}
It slices, it dices, it juliennes! Need to iterate over lines? Cut it up:
Cut c = {0};
c.tail = input;
while (c.tail.len) {
c = cut(c.tail, '\n');
Str line = c.head;
// ... process line ...
}
Need to iterate over the fields in a line? Cut the line on the field
separator. Then cut the field on the element separator. No allocation, no
mutation (strtok
).
Reading input
Unlike a program designed to process arbitrarily large inputs, the
intention here is to load the entire model into memory. We don’t need to
fiddle around with loading a line of input at at time (fgets
, getline
,
etc.) — the usual approach with OBJ parsers. If the OBJ source cannot fit
in memory, then the model won’t fit in memory. This greatly simplifies the
parser, not to mention faster while lifting hard-coded limits like maximum
line length.
The simple arena I use makes whole-file loading so easy. Read straight
into the arena without checking the file size (ftell
, etc.), which means
streaming inputs (i.e. pipes) work automatically.
Str loadfile(Arena *a, FILE *f)
{
Str r = {0};
r.data = a->beg;
r.len = a->end - a->beg;
r.len = fread(r.data, 1, r.len, f);
return r;
}
Without buffered input, you may need a loop around the read:
Str loadfile(Arena *a, int fd)
{
Str r = {0};
r.data = a.beg;
ptrdiff_t cap = a->end - a->beg;
for (;;) {
ptrdiff_t r = read(fd, r.data+r.len, cap-r.len);
if (r < 1) {
return r; // ignoring read errors
}
r.len += r;
}
}
You might consider triggering an out-of-memory error if the arena was filled to the brim, which almost certainly means the input was truncated. Though that’s likely to happen anyway because the next allocation from that arena will fail.
Side note: When using a multi GB arena, issuing such huge read requests
stress tests the underlying IO system. I’ve found libc bugs this way. In
this case I used SDL2 for the demo, and SDL lost the ability to
read files after I increased the arena size to 4GB in order to test a
gigantic model (“Power Plant”). I’ve run into this before, and
I assumed it was another Microsoft CRT bug. After investigating deeper for
this article, I learned it’s an ancient SDL bug that’s made it all the way
into SDL3. -Wconversion
warns about it, but was accidentally squelched
in the 64-bit port back in 2009. It seems nobody else loads files
this way, so watch out for platform bugs if you use this technique!
Parsing data
In practice, rendering systems limit counts to the 32-bit range, which is
reasonable. So in the OBJ parser, vertex and normal indices will be 32-bit
integers. Negatives will be needed for at least relative indexing. Parsing
from a Str
means null-terminated functions like strtol
are off limits.
So here’s a function to parse a signed integer out of a Str
:
int32_t parseint(Str s)
{
uint32_t r = 0;
int32_t sign = 1;
for (ptrdiff_t i = 0; i < s.len; i++) {
switch (s.data[i]) {
case '+': break;
case '-': sign = -1; break;
default : r = 10*r + s.data[i] - '0';
}
}
return r * sign;
}
The uint32_t
means its free to overflow. If it overflows, the input was
invalid. If it doesn’t hold an integer, the input was invalid. In either
case it will read a harmless, garbage result. Despite being unsigned, it
works just fine with negative inputs thanks to two’s complement.
For floats I didn’t intend to parse exponential notation, but some models I wanted to test actually did use it — probably by accident — so I added it anyway. That requires a function to compute the exponent.
float expt10(int32_t e)
{
float y = 1.0f;
float x = e<0 ? 0.1f : e>0 ? 10.0f : 1.0f;
int32_t n = e<0 ? e : -e;
for (; n < -1; n /= 2) {
y *= n%2 ? x : 1.0f;
x *= x;
}
return x * y;
}
That’s exponentiation by squaring, avoiding signed overflow on the
exponent. Traditionally a negative exponent is inverted, but applying
unary -
to an arbitrary integer might overflow (consider -2147483648).
So instead I iterate from the negative end. The negative range is larger
than the positive, after all. Finally we can parse floats:
float parsefloat(Str s)
{
float r = 0.0f;
float sign = 1.0f;
float exp = 0.0f;
for (ptrdiff_t i = 0; i < s.len; i++) {
switch (s.data[i]) {
case '+': break;
case '-': sign = -1; break;
case '.': exp = 1; break;
case 'E':
case 'e': exp = exp ? exp : 1.0f;
exp *= expt10(parseint(substring(s, i+1)));
i = s.len;
break;
default : r = 10.0f*r + (s.data[i] - '0');
exp *= 0.1f;
}
}
return sign * r * (exp ? exp : 1.0f);
}
Probably not as precise as strtof
, but good enough for loading a model.
It’s also ~30% faster for this purpose than my system’s strtof
. If it
hits an exponent, it combines parseint
and expt10
to augment the
result so far. At least for all the models I tried, the exponent only
appeared for tiny values. They round to zero with no visible effects, so
you can cut the implementation by more than half in one fell swoop if you
wish (no more expt10
nor substring
either):
switch (s.data[i]) {
// ...
case 'E':
case 'e': return 0; // probably small *shrug*
// ...
}
Why not strtof
? That has the rather annoying requirement that input is
null terminated, which is not the case here. Worse, it’s affected by the
locale and doesn’t behave consistently nor reliably.
A vertex is three floats separated by whitespace. So combine cut
and
parsefloat
to parse one.
typedef struct {
float v[3];
} Vert;
Vert parsevert(Str s)
{
Vert r = {0};
Cut c = cut(trimleft(s), ' ');
r.v[0] = parsefloat(c.head);
c = cut(trimleft(c.tail), ' ');
r.v[1] = parsefloat(c.head);
c = cut(trimleft(c.tail), ' ');
r.v[2] = parsefloat(c.head);
return r;
}
cut
parses a field between every space, including empty fields between
adjacent spaces, so trimleft
discards extra space before cutting. If the
line ends early, this passes empty strings into parsefloat
which come
out as zeros. No special checks required for invalid input.
Faces are a set of three vertex indices and three normal indices, and parses almost the same way. Relative indices are immediately converted to absolute indices using the number of vertices/normals so far.
typedef struct {
int32_t v[3];
int32_t n[3];
} Face;
static Face parseface(Str s, ptrdiff_t nverts, ptrdiff_t nnorms)
{
Face r = {0};
Cut fields = {0};
fields.tail = s;
for (int i = 0; i < 3; i++) {
fields = cut(trimleft(fields.tail), ' ');
Cut elem = cut(fields.head, '/');
r.v[i] = parseint(elem.head);
elem = cut(elem.tail, '/'); // skip texture
elem = cut(elem.tail, '/');
r.n[i] = parseint(elem.head);
// Process relative subscripts
if (r.v[i] < 0) {
r.v[i] = (int32_t)(r.v[i] + 1 + nverts);
}
if (r.n[i] < 0) {
r.n[i] = (int32_t)(r.n[i] + 1 + nnorms);
}
}
return r;
}
Since nverts
must be non-negative, and a relative index is negative by
definition, adding them together can never overflow. If there are too many
vertices, the result might be truncated, as indicated by the cast. That’s
fine. Just invalid input.
There’s an interesting interview question here: Consider this alternative
to the above, maintaining the explicit cast to dismiss the -Wconversion
warning.
r.v[i] += (int32_t)(1 + nverts);
Is it equivalent? Can this overflow? (Answers: No and yes.) If yes, under what conditions? Unfortunately a fuzz test would never hit it.
Putting it together
For this case, a model is three arrays of vertices, normals, and indices.
While faces only support 32-bit indexing, I use ptrdiff_t
in order to
skip overflow checks. There cannot possibly be more vertices than bytes of
source, so these counts cannot overflow.
typedef struct {
Vert *verts;
ptrdiff_t nverts;
Vert *norms;
ptrdiff_t nnorms;
Face *faces;
ptrdiff_t nfaces;
} Model;
Model parseobj(Arena *, Str);
They’d probably look a little nicer as dynamic arrays, but we won’t need that machinery. That’s because the parser makes two passes over the OBJ source, the first time to count:
Model m = {0};
Cut lines = {0};
lines.tail = obj;
while (lines.tail.len) {
lines = cut(lines.tail, '\n');
Cut fields = cut(trimright(lines.head), ' ');
Str kind = fields.head;
if (equals(S("v"), kind)) {
m.nverts++;
} else if (equals(S("vn"), kind)) {
m.nnorms++;
} else if (equals(S("f"), kind)) {
m.nfaces++;
}
}
It’s a lightweight pass, skipping over the numeric data. With that information collected, we can allocate the model:
m.verts = new(a, m.nverts, Vert);
m.norms = new(a, m.nnorms, Vert);
m.faces = new(a, m.nfaces, Face);
m.nverts = m.nnorms = m.nfaces = 0;
On the next pass we call parsevert
and parseface
to fill it out.
lines.tail = obj;
while (lines.tail.len) {
lines = cut(lines.tail, '\n');
Cut fields = cut(trimright(lines.head), ' ');
Str kind = fields.head;
if (equals(S("v"), kind)) {
m.verts[m.nverts++] = parsevert(fields.tail);
} else if (equals(S("vn"), kind)) {
m.norms[m.nnorms++] = parsevert(fields.tail);
} else if (equals(S("f"), kind)) {
m.faces[m.nfaces++] = parseface(fields.tail, m.nverts, m.nnorms);
}
}
At this point the model is parsed, though its not necessarily consistent. Faces indices may still be out of range. The next step is to transform it into a more useful representation.
Transformation
Rendering the model is the easiest way to verify it came out alright, and
it’s generally useful for debugging problems. Because it basically does
all the hard work for us, and doesn’t require ridiculous contortions to
access, I’m going to render with old school OpenGL 1.1. It provides a
glInterleavedArrays
function with a bunch of predefined formats.
The one that interests me is GL_N3F_V3F
, where each vertex is a normal
and a position. Each face is three such elements. I came up with this:
typedef struct { // GL_N3F_V3F
Vert n, v;
} N3FV3F[3];
typedef struct {
N3FV3F *data;
ptrdiff_t len;
} N3FV3Fs;
// Transform a model into a GL_N3F_V3F representation.
N3FV3Fs n3fv3fize(Arena *, Model);
If you’re being precise you’d use GLfloat
, but this is good enough for
me. By using a different arena for this step, we can discard the OBJ data
once it’s in the “local” format. For example:
Arena perm = {...};
Arena scratch = {...};
N3FV3Fs *scene = new(&perm, nmodels, N3FV3Fs);
for (int i = 0; i < nmodels; i++) {
Arena temp = scratch; // free OBJ at end of iteration
Str obj = loadfile(&temp, path[i]);
Model model = parseobj(&temp, obj);
scene[i] = n3fv3fize(&perm, model);
}
The conversion allocates the GL_N3F_V3F
array, discards invalid faces,
and copies the valid faces into the array:
N3FV3Fs n3fv3fize(Arena *a, Model m)
{
N3FV3Fs r = {0};
r.data = new(a, m.nfaces, N3FV3F);
for (ptrdiff_t f = 0; f < m.nfaces; f++) {
_Bool valid = 1;
for (int i = 0; i < 3; i++) {
valid &= m.faces[f].v[i]>0 && m.faces[f].v[i]<=m.nverts;
valid &= m.faces[f].n[i]>0 && m.faces[f].n[i]<=m.nnorms;
}
if (valid) {
ptrdiff_t t = r.len++;
for (int i = 0; i < 3; i++) {
r.data[t][i].n = m.norms[m.faces[f].n[i]-1];
r.data[t][i].v = m.verts[m.faces[f].v[i]-1];
}
}
}
return r;
}
Here’s what that looks like in OpenGL with suzanne.obj
and
bmw.obj
:
This was a fun little project, and perhaps you learned a new technique or two after checking it out.
]]>