Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: application/xml last-modified: Mon, 06 Oct 2025 01:38:45 GMT access-control-allow-origin: * etag: W/"68e31da5-393a7" expires: Mon, 06 Oct 2025 01:53:00 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 52ED:33E894:27DD6B:3056EB:68E31EA1 accept-ranges: bytes age: 100 date: Sat, 11 Oct 2025 02:33:57 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210057-BOM x-cache: HIT x-cache-hits: 0 x-timer: S1760150037.275007,VS0,VE1 vary: Accept-Encoding x-fastly-request-id: 95baf8ca2a142c25feec5311ae062396a9339a5f content-length: 44673 null program 2025-10-06T01:38:24Z urn:uuid:f8b65823-4ec5-3a70-efc8-2b713aa63091 Christopher Wellons https://nullprogram.com/ wellons@nullprogram.com More speculations on arenas in C++ urn:uuid:ffce917f-c757-42e7-a4d1-55e8d80c5051 2025-09-30T11:46:16Z Patrice Roy’s new book, C++ Memory Management, has made me more conscious of object lifetimes. C++ is stricter than C about lifetimes, and common, textbook memory management that’s sound in C is less so in C++ — more than I realized. The book also presents a form of arena allocation so watered down as to enjoy none of the benefits. (Despite its precision otherwise, the second half is also littered with integer overflows lacking the appropriate checks, and near the end has some pointer overflows invalidating the check.) However, I’m grateful for the new insights, and it’s made me revisit my own C++ arena allocation. In this new light I see I got it subtly wrong myself!

Surprising to most C++ programmers, but not language lawyers, idiomatic C memory allocation was ill-formed in C++ until recently:

int *newint(int v)
{
    int *r = (int *)malloc(sizeof(*r));
    if (r) {
        *r = v;  // <-- undefined behavior before C++20
    }
    return r;
}

This program allocates memory for an object but never starts a lifetime. Assignment without a lifetime is invalid. Pointer casts are that much more suspicious in C++, and due to lifetime semantics, in many cases indicate incorrect code. (To be clear, I’m not arguing in favor of these semantics, but reasoning about the facts on the ground.) C++20 carved out special exceptions for malloc and friends, but addressing this kind of thing in general is the purpose of the brand new start_lifetime_as (and similar), the slightly older construct_at, or a classic placement new. They all start lifetimes. The last looks like:

int *newint(int v)
{
    void *r = malloc(sizeof(int));
    if (r) {
        return new(r) int{v};
    }
    return nullptr;
}

That’s no good as a C/C++ polyglot, though per the differing old semantics that was impossible anyway without macros. Which is basically cheating. An important detail: The corrected version has no casts, and it returns the result of new. That’s important because only the pointer returned by new is imbued as a pointer to the new lifetime, not r. There are no side effects affecting the provenance of r, which still points to raw memory as far as the language is concerned.

With that in mind let’s revisit my arena from last time, which does not necessarily benefit from the recent changes, not being one of the special case C standard library functions:

struct Arena {
    char *beg;
    char *end;
};
template<typename T>
T *alloc(Arena *a, ptrdiff_t count = 1)
{
    ptrdiff_t size = sizeof(T);
    ptrdiff_t pad  = -(uintptr_t)a->beg & (alignof(T) - 1);
    assert(count < (a->end - a->beg - pad)/size);  // OOM policy
    T *r = (T *)(a->beg + pad);
    a->beg += pad + count*size;
    for (ptrdiff_t i = 0; i < count; i++) {
        new((void *)&r[i]) T{};
    }
    return r;
}

Hey, look, placement new! I did that to produce a nicer interface, but I lucked out also starting lifetimes appropriately. Except it returns the wrong pointer. This allocator discards the pointer blessed with the new lifetime. Both pointers have the same address but different provenance. That matters. But I’m calling new many times, so how do I fix this? Array new, duh.

template<typename T>
T *alloc(Arena *a, ptrdiff_t count = 1)
{
    ptrdiff_t size = sizeof(T);
    ptrdiff_t pad  = -(uintptr_t)a->beg & (alignof(T) - 1);
    assert(count < (a->end - a->beg - pad)/size);  // OOM policy
    void *r = a->beg + pad;
    a->beg += pad + count*size;
    return new(r) T[count]{};
}

Wow… that’s actually much better anyway. No explicit casts, no loop. Why didn’t I think of this in the first place? The catch is I can’t forward constructor arguments, emplace-style — the part that gave me the trouble with perfect forwarding — but that’s for the best. Forwarding more than once was unsound, made more obvious by new[].

Caveat: This only works starting in C++20, and strictly with operator new[](size_t, void *). Any other placement new[] may require array overhead — e.g. it prepends an array size so that delete[] can run non-trivial destructors — which is unknowable and therefore impossible to provide or align correctly. Overhead for placement new[] is nonsense, of course, but as of this writing, all three major C++ compilers do it and essentially have broken custom placement new[].

Since I’m thinking about lifetimes, what about the other end? My arena does not call destructors, by design, and starts new lifetimes on top of objects that are technically still alive. Is that undefined behavior? As far as I can tell this is allowed, even for non-trivial destructors, with the caveat that it might leak resources. In this case the resource is memory managed by the arena, so that’s fine of course.

So addressing pointer provenance also produced a nicer definition. What a great result from reading that book! While researching, I noticed Jonathan Müller, who personally gave me great advice and feedback on my previous article, talked about lifetimes just a couple weeks later. I recommend both.

]]> Hierarchical field sort with string interning urn:uuid:30d4b889-d14b-4b32-b389-858fb3dde34b 2025-09-24T17:11:32Z In a recent, real world problem I needed to load a heterogeneous sequence of records from a buffer. Record layout is defined in a header before the sequence. Each field is numeric, with a unique name composed of non-empty alphanumeric period-delimited segments, where segments signify nested structure. Field names are a comma-delimited list, in order of the record layout. The catch motivating this article is that nested structures are not necessarily contiguous. In my transformed representation I needed nested structures to be contiguous. For illustrative purposes here, it will be for JSON output. I came up with what I think is an interesting solution, which I’ve implemented in C using techniques previously discussed.

The above description is probably confusing on its own, and an example is worth a thousand words, so here’s a listing naming 7 fields:

timestamp,point.x,point.y,foo.bar.z,point.z,foo.bar.y,foo.bar.x

Where point is a substructure, as is foo and bar, but note they’re interleaved in the record. So if a record contains these values:

{1758158348, 1.23, 4.56, -100, 7.89, -200, -300}

The JSON representation would look like:

{
  "timestamp": 1758158348,
  "point": {
    "x": 1.23,
    "y": 4.56,
    "z": 7.89
  },
  "foo": {
    "bar": {
      "z": -100,
      "y": -200,
      "x": -300
    }
  }
}

Notice point.z moved up and foo.bar.z down, so that substructures are contiguous in this representation as required for JSON. Sorting the field names lexicographically would group them together as a simple solution. However, as an additional constraint I want to retain the original field order as much as possible. For example, timestamp is first in both the original and JSON representations, but sorting would put it last. If all substructures are already contiguous, nothing should change.

Solution with string interning

My solution is to intern the segment strings, assigning each a unique, monotonic integral token in the order they’re observed. In my program, zero is reserved as a special “root” token, and so the first string has the value 1. The concrete values aren’t important, only that they’re assigned monotonically.

The trick is that a string is always interned in the “namespace” of a previous token. That is, we’re building a (token, string) -> token map. For our segments that namespace is the token for the parent structure, and the top-level fields are interned in the reserved “root” namespace. When applied to the example, we get the token sequences:

timestamp  -> 1
point.x    -> 2 3
point.y    -> 2 4
foo.bar.z  -> 5 6 7
point.z    -> 2 8
foo.bar.y  -> 5 6 9
foo.bar.x  -> 5 6 10

And our map looks like:

{0, "timestamp"} -> 1
{0, "point"}     -> 2
{2, "x"}         -> 3
{2, "y"}         -> 4
{0, "foo"}       -> 5
{5, "bar"}       -> 6
{6, "z"}         -> 7
{2, "z"}         -> 8
{6, "y"}         -> 9
{6, "x"}         -> 10

Notice how "x" is assigned 3 and 10 due to different namespaces. That’s important because otherwise the fields of foo.bar would sort in the same order as point. Namespace gives these fields unique identities.

Once we have the token representation, sort lexicographically by token. That pulls point.z up to its siblings.

timestamp  -> 1
point.x    -> 2 3
point.y    -> 2 4
point.z    -> 2 8
foo.bar.z  -> 5 6 7
foo.bar.y  -> 5 6 9
foo.bar.x  -> 5 6 10

Now we have the “output” order with minimal re-ordering. If substructures were already contiguous, nothing changes. Assuming a reasonable map, this is O(n log n), primarily due to sorting.

Alternatives

Before I thought of namespaces, my initial idea was to intern the whole prefix of a segment. The sequence of look-ups would be:

"timestamp"    -> 1  -> {1}
"point"        -> 2
"point.x"      -> 3  -> {2, 3}
"point"        -> 2
"point.y"      -> 4  -> {2, 4}
"foo"          -> 5
"foo.bar"      -> 6
"foo.bar.z"    -> 7  -> {5, 6, 7}
"point"        -> 2
"point.z"      -> 8  -> {2, 8}
"foo"          -> 5
"foo.bar"      -> 6
"foo.bar.y"    -> 9  -> {5, 6, 9}
"foo"          -> 5
"foo.bar"      -> 6
"foo.bar.x"    -> 10 -> {5, 6, 10}

Ultimately it produces the same tokens, and this is a more straightforward string -> string map. The prefixes are acting as namespaces. However, I wrote it this way as a kind of visual proof: Notice the right triangle shape formed by the strings for each field. From the area we can see that processing prefixes as strings is O(n^2) quadratic time on the number of segments! In my real problem the inputs were never large enough for this to matter, but I hate leaving behind avoidable quadratic algorithms. Using a token as a namespace flattens the prefix to a constant size.

Another option is a different map for each namespace. So for foo.bar.z lookup the "foo" map (string -> map) in the root (string -> map), then within that lookup the "bar" table (string -> token) (since this is the penultimate segment), then intern "z" within that to get its token. That wouldn’t have quadratic time complexity, but it seems quite a bit more complicated than a single, flat (token, string) -> token map.

Implementation in C

Because the standard library has little useful for us, I am building on previously-established definitions, so refer to that article for basic definitions like Str. To start off, tokens will be a size-typed integer so we never need to worry about overflowing the token counter. We’d run out of memory first:

typedef ptrdiff Token;

We’re building a (token, string) -> token) map, so we’ll need a hash function for such keys:

uint64_t hash(Token t, Str s)
{
    uint64_t r = (uint64_t)t << 8;
    for (ptrdiff i = 0; i < s.len; i++) {
        r ^= s.data[i];
        r *= 1111111111111111111u;
    }
    return r;
}

The map itself is a forever-useful hash trie.

typedef struct Map Map;
struct Map {
    Map  *child[4];
    Token namespace;
    Str   segment;
    Token token;
};
Token *upsert(Map **m, Token namespace, Str segment, Arena *a)
{
    for (uint64_t h = hash(ns, segment); *m; h <<= 2) {
        if (namespace==(*m)->namespace && equals(segment, (*m)->segment)) {
            return &(*m)->token;
        }
        m = &(*m)->child[h>>62];
    }
    *m = new(a, 1, Map);
    (*m)->namespace = namespace;
    (*m)->segment = segment;
    return &(*m)->token;  // caller will assign
}

We’ll use this map to convert a string naming a field into a sequence of tokens, so we’ll need a slice. Fields also have an offset within the record and a type, which we’ll track via its original ordering, which I’ll do with an index field (e.g. into the original header). Also track the original name.

typedef struct {
    Str          name;
    ptrdiff_t    index;
    Slice(Token) tokens;
} Field;

To sort fields we’ll need a comparator:

ptrdiff_t field_compare(Field a, Field b)
{
    ptrdiff_t len = min(a.tokens.len, b.tokens.len);
    for (ptrdiff_t i = 0; i < len; i++) {
        Token d = a.tokens.data[i] - b.tokens.data[i];
        if (d) {
            return d;
        }
    }
    return a.tokens.len - b.tokens.len;
}

Because field names are unique, each token sequence is unique, and so we need not use index in the comparator.

Finally down to business: cut up the list and build the token sequences with the established push macro. The sort function isn’t interesting, and could be as simple as libc qsort with the above comparator (and adapter), so I’m only listing the prototype.

void field_sort(Slice(Field), Arena scratch);
Slice(Field) parse_fields(Str fieldlist, Arena *a)
{
    Slice(Field) fields  = {};
    Map         *strtab  = 0;
    ptrdiff_t    ntokens = 0;
    for (Cut c = {.tail=fieldlist, .ok=true}; c.ok;) {
        c = cut(c.tail, ',');
        Field field = {};
        field.name  = c.head;
        field.index = fields.len;
        Token prev = 0;
        for (Cut f = {.tail=field.name, .ok=true}; f.ok;) {
            f = cut(f.tail, '.');
            Token *token = upsert(&strtab, prev, f.head, a);
            if (!*token) {
                *token = ++ntokens;
            }
            *push(a, &field.tokens) = *token;
            prev = *token;
        }
        *push(a, &fields) = field;
    }
    field_sort(fields, *a);
    return fields;
}

Usage here suggests Cut::ok should be inverted to Cut::done so that it better zero-initializes. Something I’ll need to consider. Because it’s all allocated from an arena, no need for destructors or anything like that, so this is the complete implementation. Back to the example:

    Str fieldlist = S(
        "timestamp,"
        "point.x,"
        "point.y,"
        "foo.bar.z,"
        "point.z,"
        "foo.bar.y,"
        "foo.bar.x"
    );
    Slice(Field) fields = parse_fields(fieldlist, &scratch);
    for (ptrdiff_t i = 0; i < fields.len; i++) {
        Str name = fields.data[i].name;
        fwrite(name.data, 1, name.len, stdout);
        putchar('\n');
    }

This program will print the proper output field order. In a real program we’d hold onto the string table, define an inverse lookup to translate tokens back into strings, and use it when in producing output. I do just that in my exploratory program, rec2json.c, written a little differently than presented above. It uses the sorted tokens to compile a simple bytecode program that, when run against a record, produces its JSON representation. It compiles the example to:

OPEN          # print '{'
KEY     1     # print token 1 as a key, i.e. "timestamp:"
READ    0     # print double at record offset 0
COMMA         # print ','
KEY     2     # print token 2 as a key, i.e. "point:"
OPEN
KEY     3
READ    8     # print double at record offset 8
COMMA
KEY     4
READ    16
COMMA
KEY     8
READ    32
CLOSE         # print '}'
COMMA
KEY     5
OPEN
KEY     6
OPEN
KEY     7
READ    24
COMMA
KEY     9
READ    40
COMMA
KEY     10
READ    48
CLOSE
CLOSE
CLOSE

Seeing it written out, I notice more room for improvement. An optimization pass could coalesce instructions so that, for instance, OPEN then KEY concatenate to a single string at compile time so that it only needs one instruction. This program could be 15 instructions instead of 31. In my real case I didn’t need anything quite this sophisticated, but it was fun to explore.

]]> Assemblers in w64devkit, and other updates urn:uuid:3e24c35f-9fac-470a-8225-2e0a0bc8f7ac 2025-08-10T15:42:50Z Today I’m releasing w64devkit 2.4.0, mostly for GCC 15.2. As usual, it includes the continuous background improvements, and ideally each release is the best so far. The first release included the Netwide Assembler, NASM, but it’s now been a year since I removed NASM from the distribution (2.0.0). I’m asked on occasion why, or how to get it back. Because I value thorough source control logs, my justifications for this, and all changes, are captured in these logs, so git log is a kind of miniature, project blog. I understand this is neither discoverable nor obvious, especially because the GitHub UI (ugh) lacks anything like

git
log

in the terminal. So let’s talk about it here, along with other recent changes.

NASM is nice assembler, and I still in general prefer its x86 syntax to the GNU Assembler, GAS. It’s tidy, self-contained, dependency-free other than a C toolchain, reliable, easy to build and cross compile, which is why I included it in the first place. However, it’s just not a good fit for w64dk. It’s redundant with Binutils as, which is already mandatory for supporting GCC. As a rule, w64dk is a curation that avoids redundancy, and a second assembler requires special justification. Originally it was that the syntax was nicer, but last year I decided that wasn’t enough to outweigh against NASM’s weaknesses.

First, it doesn’t integrate well with the GNU toolchain, at least not to the extent that as does. For example, many people don’t realize they can assemble GAS assembly through the general gcc driver.

$ gcc -c myprogram.s

There’s rarely a reason to invoke as directly. Need a specific assembler flag? Use -Wa or -Xassembler. Going through the compiler driver for both assembly and linking (i.e. instead of ld) has the advantage of operating at a “higher level.” The compiler driver knows about the whole toolchain, and better understands the sysroot. Use a capital file extension, .S, and gcc will automatically run it through the C preprocessor.

$ gcc -c myprogram.S

This is quite nifty, especially for cross builds. NASM isn’t so integrated and so requires invoking the nasm command directly. It’s not a big deal, but it’s friction that adds up.

But even more, the most useful form of assembly in the context of w64dk is inline assembly, which of course is completely out of NASM’s lane. So most of the time you’re going to be writing GAS anyway. Again, NASM is just not integrated into the toolchain like as.

Second, NASM is a dreadfully slow assembler — in the vicinity of two orders of magnitude slower than GAS! If your assembly program is a couple hundred lines, no big deal. If you’re writing a compiler targeting NASM, it’s impractical beyond toy programs. Friends don’t let friends use NASM as a back-end. If you’re so allergic to GAS, note that YASM has matching syntax and better performance.

Third, and the last nail in the coffin, NASM doesn’t support DWARF debug information in Windows targets (win32/win64). That means you cannot debug NASM programs with GDB on Windows, at least not with source-level debugging. Were you even aware GAS has source-level debugging with GDB? Sure, you can show the assembly pane (layout asm) and step through the disassembly with ni, but stepping through the original source (layout src) is a whole different experience. Even better, bring up the register pane (layout regs) and you have something akin to a Visual Studio watch window. It’s such a pleasant experience, and yet no tutorial I’ve seen has ever mentioned it. It should be the first thing people are taught. Get your act together, assembly tutorials!

In theory, YASM ought to solve this with its -g dwarf2, but alas this feature appears to be broken. So that really just leaves GAS as the most practical game in town for assembly on Windows with GNU-style toolchains. In case it helps: Learning a little PDP-11 assembly gave me a deeper understanding — and appreciation — of why GAS x86 is the way it is. Makes it sting a little less than before.

Compiling NASM

At the same time w64dk lost NASM, it gained the ability to run Autotools configure scripts. It only took a few environment variables as hints for the script, and a small hack in the shell to undo an unnecessary Autotools hack. The typical native cmd.exe invocation with a command uses /c:

cmd /c echo hello world

It’s roughly equivalent to -c in a unix shell. This is what you’d use if you’re in another shell and you need to invoke cmd for a specific purpose. However, if you’re in an MSYS2 shell with its virtual file system, /c looks like a path to the C drive. So MSYS2 “helpfully” translates the switch to a native path, something like so:

cmd C:\ echo hello world

Not helpful at all. So Autotools, assuming Cygwin-like environments are the only that exist, uses a special escape form when invoking cmd:

cmd //c echo hello world

Which Cygwin-like environments translate into the desired /c. If you’re not in such an environment, then cmd.exe sees the //c, which doesn’t work. So the busybox-w32 shell now pattern matches for precisely:

cmd //c echo [...]

For a similar translation. That’s right, it matches echo in particular because that’s the only cmd feature Autotools uses. So it’s completely unnecessary, just poor code generation.

With that work in place, you can download a NASM source release, untar it, run ./configure, make -j$(nproc), and copy the resulting nasm.exe into w64dk or wherever else on your $PATH is convenient. The same is true for quite a bit of software! You can build Binutils, including as itself, exactly the same way. Being so easy for users to build their own tools means I’m less concerned with including extraneous, more specialized tools, such as NASM.

Path Style

Borrowing a concept from MSYS2, w64devkit.ini now has path style option for controlling the initial PATH, using the same names and configuration. I’ve already found it useful in testing w64dk itself in a relatively pristine, hermetic environment:

[w64devkit]
home = .
path style = strict

This uses the w64devkit/ directory itself as $HOME, and $PATH is initially just the w64dk bin/ directory. See the w64devkit.ini header for full documentation.

Otherwise, most of the major features have been discussed already: peports and vc++filt, pkg-config, and xxd.

]]> Parameterized types in C using the new tag compatibility rule urn:uuid:abb3bf93-074f-4876-8d46-42997edebb34 2025-06-26T23:49:53Z C23 has a new rule for struct, union, and enum compatibility finally appearing in compilers starting with GCC 15, released this past April, and Clang later this year. The same struct defined in different translation units (TU) has always been compatible — essential to how they work. Until this rule change, each such definition within a TU was a distinct, incompatible type. The new rule says that, ackshually, they are compatible! This unlocks some type parameterization using macros.

How can a TU have multiple definitions of a struct? Scope. Prior to C23 this wouldn’t compile because the compound literal type and the return type were distinct types:

struct Example { int x, y, z; };
struct Example example(void)
{
    struct Example { int x, y, z; };
    return (struct Example){1, 2, 3};
}

Otherwise the definition of struct Example within example was fine, if strange. At first this may not seem like a big deal, but let’s revisit my technique for dynamic arrays:

typedef struct {
    T        *data;
    ptrdiff_t len;
    ptrdiff_t cap;
} SliceT;

Where I write out one of these for each T that I might want to put into a slice. With the new rule we can change it slightly, taking note of the introduction of a tag (the name after struct):

#define Slice(T)        \
    struct Slice##T {   \
        T        *data; \
        ptrdiff_t len;  \
        ptrdiff_t cap;  \
    }

This makes the “write it out ahead of time” thing simpler, but with the new rule we can skip the “ahead of time” part and conjure slice types on demand. Each declaration with the same T is compatible with the others due to matching tags and fields. So, for example, with this macro we can declare functions using slices parameterized for different element types.

Slice(int) range(int, Arena *);
float mean(Slice(float));
Slice(Str) split(Str, char delim, Arena *);
Str join(Slice(Str), char delim, Arena *);

Or using it with our model parser:

typedef struct {
    float x, y, z;
} Vec3;
typedef struct {
    int32_t v[3];
    int32_t n[3];
} Face;
typedef struct {
    Slice(Vec3) verts;
    Slice(Vec3) norms;
    Slice(Face) faces;
} Model;
typedef Slice(Vec3) Polygon;

I worried these macros might confuse my tools, particularly Universal Ctags because it’s important to me. Everything handles prototypes better than expected, but ctags doesn’t see fields with slice types. Overall they’re like a very limited form of C++ templates. Though only the types are parameterized, not the functions operating on those types. Outside of unwarranted macro abuse, this new technique does nothing regarding generic functions. On the other hand, my generic slice function complements the new technique, especially with the help of C23’s new typeof to mitigate _Alignof’s limitations:

typedef struct { char *beg, *end; } Arena;
void *alloc(Arena *, ptrdiff_t count, int size, int align);
#define push(a, s)                          \
  ((s)->len == (s)->cap                     \
    ? (s)->data = push_(                    \
        (a),                                \
        (s)->data,                          \
        &(s)->cap,                          \
        sizeof(*(s)->data),                 \
        _Alignof(typeof(*(s)->data))        \
      ),                                    \
      (s)->data + (s)->len++                \
    : (s)->data + (s)->len++)

void *push_(Arena *a, void *data, ptrdiff_t *pcap, int size, int align)
{
    ptrdiff_t cap = *pcap;
    if (a->beg != (char *)data + cap*size) {
        void *copy = alloc(a, cap, size, align);
        memcpy(copy, data, cap*size);
        data = copy;
    }
    ptrdiff_t extend = cap ? cap : 4;
    alloc(a, extend, size, align);
    *pcap = cap + extend;
    return data;
}

This exploits the fact that implementations adopting the new tag rule also have the upcoming C2y null pointer rule (note: also requires a cooperating libc). Putting it together, now I can write stuff like this:

Slice(int64_t) generate_primes(int64_t limit, Arena *a)
{
    Slice(int64_t) primes = {};
    if (limit > 2) {
        *push(a, &primes) = 2;
    }
    for (int64_t n = 3; n < limit; n += 2) {
        bool valid = true;
        for (ptrdiff_t i = 0; valid && i<primes.len; i++) {
            valid = n % primes.data[i];
        }
        if (valid) {
            *push(a, &primes) = n;
        }
    }
    return primes;
}

But it doesn’t take long to run into limitations. It makes little sense to define, say, a Map(K, V) without a generic function to manipulate it. This also doesn’t work:

typedef struct {
    Slice(Str)          names;
    Slice(Slice(float)) edges;
} Graph;

Due to Slice##T in the macro, required to establish a unique tag for each element type. The parameter to the macro must be an identifier, so you have to build up to it (or define another macro), which sort of defeats the purpose, which was entirely about convenience.

typedef Slice(float) Edges;
typedef struct {
    Slice(Str)   names;
    Slice(Edges) edges;
} Graph;

The benefits are small enough that perhaps it’s not worth the costs, but it’s been at least worth investigating. I’ve written a small demo of the technique if you’d like to see it in action, or test the abilities of your local C implementation: demo.c

]]> WebAssembly: How to allocate your allocator urn:uuid:dc2863e4-9601-4e42-bbd2-3cb4a5315d4d 2025-04-19T03:18:20Z An early, small hurdle diving into WebAssembly was allocating my allocator. On a server or desktop with virtual memory, the allocator asks the operating system to map fresh pages into its address space (sbrk, anonymous mmap, VirtualAlloc), which it then dynamically allocates to different purposes. In an embedded context, dynamic allocation memory is typically a fixed, static region chosen at link time. The Wasm execution environment more resembles an embedded system, but both kinds of obtaining raw memory are viable and useful in different situations.

For the purposes of this discussion, the actual allocator isn’t important. It could be a simple arena allocator, or a more general purpose buddy allocator. It could even be garbage collected with Boehm GC. Though WebAssembly’s linear memory is a poor fit for such a conservative garbage collector. In a compact address space starting at zero, and which doesn’t include code, memory addresses will be small numbers, and less distinguishable from common integer values. There’s also the issue that the garbage collector cannot scan the Wasm stack, which is hidden from Wasm programs by design. Only the ABI stack is visible. So a garbage collector requires cooperation from the compiler — essentially as a distinct calling convention — to spill all heap pointers on the ABI stack before function calls. Wasm C and C++ toolchains do not yet support this in a practical capacity.

Exporting a static heap

Let’s start with the embedded case because it’s simpler, and reserve a dynamic memory region at link time. WebAssembly has just reached 8 years old, so it’s early, and as we keep discovering, Wasm tooling is still immature. wasm-ld doesn’t understand linker scripts, and there’s no stable, low-level assembly language on which to build, e.g. to reserve space, define symbols, etc. WAT is too high level and inflexible for this purpose, as we’ll soon see. So our only option is to brute force it in a high-level language:

char heap[16<<20];  // 16MiB

Plugging it into an arena allocator:

typedef struct {
    char *beg;
    char *end;
} Arena;
Arena getarena(void)
{
    Arena a = {0};
    a.beg = heap;
    a.end = heap + sizeof(heap);
    return a;
}

Unfortunately heap isn’t generic memory, but a high-level variable with a specific, fixed type. That’s why it would have been nice to reserve the memory outside the high-level language. In practice this works fine so long as everything is aligned, but strictly speaking, allocating any variable except char from this arena involves incompatible loads and stores on a char array. Clang doesn’t document any inline assembly interface for Wasm, but neither does Clang forbid it. That leaves just enough room to launder the pointer if you’re worried about this technicality:

Arena getarena(void)
{
    Arena a = {0};
    a.beg = heap;
    asm ("" : "+r"(a.beg));  // launder
    a.end = a.beg + sizeof(heap);
    return a;
}

The +r means a.beg is both input and output. The address of the heap goes into the black box, and as far as the compiler is concerned, some mystery address comes out which, critically, has no effective type. The assembly block is empty (""), so it’s just a no-op, and we know (wink wink) it’s really the same address. Because the heap was “used” by the black box, Clang won’t optimize the heap out of existence beneath us. Also note that a.end was derived from the laundered pointer.

Update: The next C standard will improve this situation and so pointer laundering will no longer be unnecessary. A straight char array could be used as an arena.

This static variable technique works well only in an exported memory configuration, which is what wasm-ld uses by default. When a module exports its memory, it indicates how much linear memory it requires on start, and the Wasm runtime allocates and zero-initializes it at module initialization time. C and C++ toolchains depend on that runtime zeroing to initialize static and global variables, which are defined to be so initialized. Compilers generate code assuming these variables are zero initialized. This same paradigm is used for .bss sections in hosted environments.

In an imported memory configuration, linear memory is uninitialized. The memory may be re-used from, say, a destroyed module without zeroing, and may contain arbitrary data. In that case, C and C++ toolchains must zero the memory explicitly. It could potentially be done with a memory.fill instruction in the start section, but LLVM does not support start sections. Instead it uses an active data segment — a chunk of data copied into linear memory by the Wasm runtime during initialization, before running the start function.

That is, when importing memory, LLVM actually stores all those zeros in the Wasm module so that the runtime can copy it into linear memory. Wasm has no built-in compression, so your Wasm module will be at least as large as your heap! Exporting or importing memory is determined at link-time, so at compile-time the compiler must assume the worst case. If you compile the example above, you get a 16MiB “object” file (in Wasm format):

$ clang --target=wasm32 -c -O example.c
$ du -h example.o
16.0M   example.o

The WAT version of this file is 48MiB — clearly unsuitable as a low-level assembler. If linking with exported memory, wasm-ld discards all-zero active data segments. If using an imported memory configuration, it’s copied into the final image, producing a huge Wasm image, though highly compressible. As a rule, avoid importing memory when using an LLVM toolchain. Regardless, large heaps created this way will have a significant compile-time cost.

$ time echo 'char heap[256<<20];' | clang --target=wasm32 -c -xc -
real    0m0.334s
user    0m0.013s
sys     0m0.262s

(If only Clang had some sort of “noinit” variable attribute in order to allow heap to be uninitialized…)

Growing a dynamic heap

Wasm programs can grow linear memory using an sbrk-like memory.grow instruction. It operates in quantities of pages (64kB), and returns the old memory size. Because memory starts at zero, the old memory size is also the base address of the new allocation. Clang provides access to this instruction via an undocumented built-in:

size_t __builtin_wasm_memory_grow(int, size_t);

The first parameter selects a memory because someday there might be more than one. From this built-in we can define sbrk:

void *sbrk(ptrdiff_t size)
{
    size_t npages = (size + 0xffffu) >> 16;  // round up
    size_t old    = __builtin_wasm_memory_grow(0, npages);
    if (old == -1ul) {
        return 0;
    }
    return (void *)(old << 16);
}

To which Clang compiles (note the memory.grow):

(func $sbrk (param i32) (result i32)
  (select
    (i32.const 0)
    (i32.shl
      (local.tee 0
        (memory.grow
          (i32.shr_u
            (i32.add
              (local.get 0)
              (i32.const 65535))
            (i32.const 16))))
      (i32.const 16))
    (i32.eq
      (local.get 0)
      (i32.const -1))))

Applying that to create an arena like before:

Arena newarena(ptrdiff_t cap)
{
    Arena a = {0};
    a.beg = sbrk(cap);
    if (a.beg) {
        a.end = a.beg + cap;
    }
    return a;
}

Now we can choose the size of the arena, and we can use this to create multiple arenas (e.g. permanent, scratch, etc.). We could even continue growing the last-created arena in-place when it’s full.

If there was no memory.grow instruction, it could be implemented as a request through an imported function. The embedder using the Wasm runtime can grow the memory on the module’s behalf in the same manner. But as that documentation indicates, either way growing the memory comes with a downside in the most common Wasm runtimes, browsers: It “detaches” the memory from references, which complicates its use for the embedder. If a Wasm module may grow its memory at any time, the embedder must reacquire the memory handle after every call. It’s not difficult, but it’s easy to forget, and mistakes are likely to go unnoticed until later.

Importing a dynamic heap

There’s a middle ground where a Wasm module imports a dynamic-sized heap. That is, linear memory beyond the module’s base initialization. This might be the case, for instance, in a programming competition, where contestants submit Wasm modules which must complete a task using the supplied memory. In that case we don’t reserve a static heap, so we’re not facing the storing-zeros issue. However, how do we “find” the memory? Linear memory layout will look something like so:

0 <-- stack | data | heap --> ?
|-----------------------------|

This diagram reflects the more sensible wasm-ld --stack-first layout, where the ABI stack overflows off the bottom end of memory. The heap is just excess memory beyond the data. To find the upper bound, Wasm has a memory.size instruction to query linear memory size, which again Clang provides as an undocumented built-in:

size_t __builtin_wasm_memory_size(int);

Like before, this returns the result in number of 64k pages. That’s the high end. How do we find the low end? Similar to __stack_pointer, the linker creates a __heap_base constant, which is the address delineating data and heap in the diagram above. To use it, we need to declare it:

extern char __heap_base[];

Notice how it’s an array, not a pointer. It doesn’t hold an address, it is an address. In an ELF context this would called an absolute symbol. That’s everything we need to find the bounds of the heap:

Arena getarena(void)
{
    Arena a = {0};
    a.beg = __heap_base;
    a.end = (char *)(__builtin_wasm_memory_size(0) << 16);
    return a;
}

Then we continue forward using whatever memory the embedder deigned to provide. Hopefully it’s enough!

]]> Lessons learned from my first dive into WebAssembly urn:uuid:9881d125-2f2c-4fee-a959-222c9449399b 2025-04-04T04:01:20Z It began as a water sort puzzle solver, constructed similarly to my British Square solver. It was nearly playable, so I added a user interface with SDL2. My wife enjoyed it on her desktop, but wished to play on her phone. So then I needed to either rewrite it in JavaScript and hope the solver was still fast enough for real-time use, or figure out WebAssembly (Wasm). I succeeded, and now my game runs in browsers (source). Like before, next I ported my pkg-config clone to the Wasm System Interface (WASI), whipped up a proof-of-concept UI, and it too runs in browsers. Neither use a language runtime, resulting in little 8kB and 28kB Wasm binaries respectively. In this article I share my experiences and techniques.

Wasm is a specification defining an abstract stack machine with a Harvard architecture, and related formats. There are just four types, i32, i64, f32, and f64. It also has “linear” octet-addressable memory starting at zero, with no alignment restrictions on loads and stores. Address zero is a valid, writable address, which resurfaces some, old school, high level language challenges regarding null pointers. There are 32-bit and 64-bit flavors, though the latter remains experimental. That suits me: I appreciate smaller pointers on 64-bit hosts, and I wish I could opt into more often (e.g. x32).

As browser tech goes, they chose an apt name: WebAssembly is to the web as JavaScript is to Java.

There are distinct components at play, and much of the online discussion doesn’t do a great job drawing lines between them:

Wasm module: A compiled and linked image — like ELF or PE — containing sections for code, types, globals, import table, export table, and so on. The export table lists the module’s entry points. It has an optional start section indicating which function initializes a loaded image. (In practice almost nobody actually uses the start section.) A Wasm module can only affect the outside world through imported functions. Wasm itself defines no external interfaces for Wasm programs, not even printing or logging.
Wasm runtime: Loads Wasm modules, linking import table entries into the module. Because Wasm modules include types, the runtime can type check this linkage at load time. With imports resolved, it executes the start function, if any, then executes zero or more of its entry points, which hopefully invokes import functions such a way as to produce useful results, or perhaps simply return useful outputs.
Wasm compiler: Converts a high-level language to low-level Wasm. In order to do so, it requires some kind of Application Binary Interface (ABI) to map the high-level language concepts onto the machine. This typically introduces additional execution elements, and it’s important that we distinguish them from the abstract machine’s execution elements. Clang is the only compiler we’ll be discussing in this article, though there are many. During compilation the function indices are yet unknown and so references will need to be patched in by a linker.
Wasm linker: Settles the shape of the Wasm module and links up the functions emitted by the compiler. LLVM comes with wasm-ld, and it goes hand-in-hand with Clang as a compiler.
Language runtime: Unless you’re hand-writing raw Wasm, your high-level language probably has a standard library with operating system interfaces. C standard library, POSIX interfaces, etc. This runtime likely maps onto some standardized set of imports, most likely the aforementioned WASI, which defines a set of POSIX-like functions that Wasm modules may import. Because I think we could do better, as usual around here, in this article we’re going to eschew the language runtime and code directly against raw WASI. You still have easy access hash tables and dynamic arrays.

A combination of compiler-linker-runtime is conventionally called a toolchain. However, because almost any Clang installation can target Wasm out-of-the-box, and we’re skipping the language runtime, you can compile any of programs discussed in this article, including my game, with nothing more than Clang (invoking wasm-ld implicitly). If you have a Wasm runtime, which includes your browser, you can run them, too! Though this article will mostly focus on WASI, and you’ll need a WASI-capable runtime to run those examples, which doesn’t include browsers (short of implementing the API with JavaScript).

I wasn’t particularly happy with the Wasm runtimes I tried, so I cannot enthusiastically recommend one. I’d love if I could point to one and say, “Use the same Clang to compile the runtime that you’re using to compile Wasm!” Alas, I had issues compiling, the runtime was buggy, or WASI was incomplete. However, wazero (Go) was the easiest for me to use and it worked well enough, so I will use it in examples:

$ go install github.com/tetratelabs/wazero/cmd/wazero@latest

The Wasm Binary Toolkit (WABT) is good to have on hand when working with Wasm, particularly wasm2wat to inspect Wasm modules, sort of like objdump or readelf. It converts Wasm to the WebAssembly Text Format (WAT).

Learning Wasm I had quite some difficultly finding information. Outside of the Wasm specification, which, despite its length, is merely a narrow slice of the ecosystem, important technical details are scattered all over the place. Some is only available as source code, some buried comments in GitHub issues, and some lost behind dead links as repositories have moved. Large parts of LLVM are undocumented beyond an mention of existence. WASI has no documentation in a web-friendly format — so I have nothing to link from here when I mention its system calls — just some IDL sources in a Git repository. An old wasi.h was the most readable, complete source of truth I could find.

Fortunately Wasm is old enough that LLMs are well-versed in it, and simply asking questions, or for usage examples, was more effective than searching online. If you’re stumped on how to achieve something in the Wasm ecosystem, try asking a state-of-the-art LLM for help.

Example programs

Let’s go over concrete examples to lay some foundations. Consider this simple C function:

float norm(float x, float y)
{
    return x*x + y*y;
}

To compile to Wasm (32-bit) with Clang, we use the --target=wasm32:

$ clang -c --target=wasm32 -O example.c

The object file example.o is in Wasm format, so WABT can examine it. Here’s the output of wasm2wat -f, where -f produces output in the “folded” format, which is how I prefer to read it.

(module
  (type (;0;) (func (param f32 f32) (result f32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (func $norm (type 0) (param f32 f32) (result f32)
    (f32.add
      (f32.mul
        (local.get 0)
        (local.get 0))
      (f32.mul
        (local.get 1)
        (local.get 1)))))

We can see the ABI taking shape: Clang has predictably mapped float into f32. It similarly maps char, short, int and long onto i32. In 64-bit Wasm, the Clang ABI is LP64 and maps long onto i64. There’s a also $norm function which takes two f32 parameters and returns an f32.

Getting a little more complex:

__attribute((import_name("f")))
void f(int *);
__attribute((export_name("example")))
void example(int x)
{
    f(&x);
}

The import_name function attribute indicates the module will not define it, even in another translation unit, and that it intends to import it. That is, wasm-ld will place it in the import table. The export_name function attribute indicates it’s an entry point, and so wasm-ld will list it in the export table. Linking it will make things a little clearer:

$ clang --target=wasm32 -nostdlib -Wl,--no-entry -O example.c

The -nostdlib is because we won’t be using a language runtime, and --no-entry to tell the linker not to implicitly export a function (default: _start) as an entry point. You might think this is connected with the Wasm start function, but wasm-ld does not support the start section at all! We’ll have use for an entry point later. The folded WAT:

(module $a.out
  (type (;0;) (func (param i32)))
  (import "env" "f" (func $f (type 0)))
  (func $example (type 0) (param i32)
    (local i32)
    (global.set $__stack_pointer
      (local.tee 1
        (i32.sub
          (global.get $__stack_pointer)
          (i32.const 16))))
    (i32.store offset=12
      (local.get 1)
      (local.get 0))
    (call $f
      (i32.add
        (local.get 1)
        (i32.const 12)))
    (global.set $__stack_pointer
      (i32.add
        (local.get 1)
        (i32.const 16))))
  (table (;0;) 1 1 funcref)
  (memory (;0;) 2)
  (global $__stack_pointer (mut i32) (i32.const 66560))
  (export "memory" (memory 0))
  (export "example" (func $example)))

There’s a lot to unfold:

Pointers were mapped onto i32. Pointers are a high-level concept, and linear memory is addressed by an integral offset. This is typical of assembly after all.
There’s now a __stack_pointer, which is part of the Clang ABI, not Wasm. The Wasm abstract machine is a stack machine, but that stack doesn’t exist in linear memory. So you cannot take the address of values on the Wasm stack. There are lots of things C needs from a stack that Wasm doesn’t provide. So, in addition to the Wasm stack, Clang maintains another downward-growing stack in linear memory for these purposes, and the __stack_pointer global is the stack register of its ABI. We can see it’s allocated something like 64kB for the stack. (It’s a little more because program data is placed below the stack.)
It should be mostly readable without knowing Wasm: The function subtracts a 16-byte stack frame, stores a copy of the argument in it, then uses its memory offset for the first parameter to the import f. Why 16 bytes when it only needs 4? Because the stack is kept 16-byte aligned. Before returning, the function restores the stack pointer.

As mentioned earlier, address zero is valid as far as the Wasm runtime is concerned, though dereferences are still undefined in C. This makes it more difficult to catch bugs. Given a null pointer this function would most likely read a zero at address zero and the program keeps running:

int get(int *p)
{
    return *p;
}

In WAT:

(func $get (type 0) (param i32) (result i32)
  (i32.load
    (local.get 0)))

Since the “hardware” won’t fault for us, ask Clang to do it instead:

$ clang ... -fsanitize=undefined -fsanitize-trap ...

Now in WAT:

(module
  (type (;0;) (func (param i32) (result i32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (func $get (type 0) (param i32) (result i32)
    (block  ;; label = @1
      (block  ;; label = @2
        (br_if 0 (;@2;)
          (i32.eqz
            (local.get 0)))
        (br_if 1 (;@1;)
          (i32.eqz
            (i32.and
              (local.get 0)
              (i32.const 3)))))
      (unreachable))
    (i32.load
      (local.get 0))))

Given a null pointer, get executes the unreachable instruction, causing the runtime to trap. In practice this is unrecoverable. Consider: nothing will restore __stack_pointer, and so the stack will “leak” the existing frames. (This can be worked around by exporting __stack_pointer and __stack_high via the --export linker flag, then restoring the stack pointer in the runtime after traps.)

Wasm was extended with bulk memory operations, and so there are single instructions for memset and memmove, which Clang maps onto the built-ins:

void clear(void *buf, long len)
{
    __builtin_memset(buf, 0, len);
}

(Below LLVM 20 you will need the undocumented -mbulk-memory option.) In WAT we see this as memory.fill:

(module
  (type (;0;) (func (param i32 i32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (func $clear (type 0) (param i32 i32)
    (block  ;; label = @1
      (br_if 0 (;@1;)
        (i32.eqz
          (local.get 1)))
      (memory.fill
        (local.get 0)
        (i32.const 0)
        (local.get 1)))))

That’s great! I wish this worked so well outside of Wasm. It’s one reason w64devkit has -lmemory, after all. Similarly __builtin_trap() maps onto the unreachable instruction, so we can reliably generate those as well.

What about structures? They’re passed by address. Parameter structures go on the stack, then its address passed. To return a structure, a function accepts an implicit out parameter in which to write the return. This isn’t unusual, except that it’s challenging to manage across module boundaries, i.e. in imports and exports, because caller and callee are in different address spaces. It’s especially tricky to return a structure from an export, as the caller must somehow allocate space in the callee’s address space for the result. The multi-value extension solves this, but using it in C involves an ABI change, which is still experimental.

Water Sort Game

Something you might not have expected: My water sort game imports no functions! It only exports three functions:

void      game_init(i32 seed);
DrawList *game_render(i32 width, i32 height, i32 mousex, i32 mousey);
void      game_update(i32 input, i32 mousex, i32 mousey, i64 now);

The game uses IMGUI-style rendering. The caller passes in the inputs, and the game returns a kind of display list telling it what to draw. In the SDL version these turn into SDL renderer calls. In the web version, these turn into canvas draws, and “mouse” inputs may be touch events. It plays and feels the same on both platforms. Simple!

I didn’t realize it at the time, but building the SDL version first was critical to my productivity. Debugging Wasm programs is really dang hard! Wasm tooling has yet to catch up with 1995, let alone 2025. Source-level debugging is still experimental and impractical. Developing applications on the Wasm platform. It’s about as ergonomic as developing in MS-DOS. Instead, develop on a platform much better suited for it, then port your application to Wasm after you’ve got the issues worked out. The less Wasm-specific code you write, the better, even if it means writing more code overall. Treat it as you would some weird embedded target.

The game comes with 10,000 seeds. I generated ~200 million puzzles, sorted them by difficulty, and skimmed the top 10k most challenging. In the game they’re still sorted by increading difficulty, so it gets harder as you make progress.

Wasm System Interface

WASI allows us to get a little more hands on. Let’s start with a Hello World program. A WASI application exports a traditional _start entry point which returns nothing and takes no arguments. I’m also going to set up some basic typedefs:

typedef unsigned char       u8;
typedef   signed int        i32;
typedef   signed long long  i64;
typedef   signed long       iz;
void _start(void)
{
}

wasm-ld will automatically export this function, so we don’t need an export_name attribute. This program successfully does nothing:

$ clang --target=wasm32 -nostdlib -o hello.wasm hello.c
$ wazero run hello.wasm && echo ok
ok

To write output WASI defines fd_write():

typedef struct {
    u8 *buf;
    iz  len;
} IoVec;
#define WASI(s) __attribute((import_module("wasi_unstable"),import_name(s)))
WASI("fd_write")  i32  fd_write(i32, IoVec *, iz, iz *);

Technically those iz variables are supposed to be size_t, passed through Wasm as i32, but this is a foreign function, I know the ABI, and so I can do as I please. I absolutely love that WASI barely uses null-terminated strings, not even for paths, which is a breath of fresh air, but they still marred the API with unsigned sizes. Which I choose to ignore.

This function is shaped like POSIX writev(). I’ve also set it up for import, including a module name. The oldest, most stable version of WASI is called wasi_unstable. (I suppose it shouldn’t be surprising that finding information in this ecosystem is difficult.)

Every returning WASI function returns an errno value, with zero as success rather than some kind of in-band signaling. Hence the final out parameter unlike POSIX writev().

Armed with this function, let’s use it:

void _start(void)
{
    u8    msg[] = "hello world\n";
    IoVec iov   = {msg, sizeof(msg)-1};
    iz    len   = 0;
    fd_write(1, &iov, 1, &len);
}

Then:

$ clang --target=wasm32 -nostdlib -o hello.wasm hello.c
$ wazero run hello.wasm
hello world

Keep going and you’ll have something like printf before long. If the write fails, we should probably communicate the error with at least the exit status. Because _start doesn’t return a status, we need to exit, for which we have proc_exit. It doesn’t return, so no errno return value.

WASI("proc_exit") void proc_exit(i32);
void _start(void)
{
    // ...
    i32 err = fd_write(1, &iov, 1, &len);
    proc_exit(!!err);
}

To get the command line arguments, call args_sizes_get to get the size, allocate some memory, then args_get to read the arguments. Same goes for the environment with a similar pair of functions. The sizes do not include a null pointer terminator, which is sensible.

Now that you know how to find and use these functions, you don’t need me to go through each one. However, opening files is a special, complicated case:

WASI("path_open") i32 path_open(i32,i32,u8*,iz,i32,i64,i64,i32,i32*);

That’s 9 parameters — and I had thought Win32 CreateFileW was over the top. It’s even more complex than it looks. It works more like POSIX openat(), except there’s no current working directory and so no AT_FDCWD. Every file and directory is opened relative to another directory, and absolute paths are invalid. If there’s no AT_FDCWD, how does one open the first directory? That’s called a preopen and it’s core to the file system security mechanism of WASI.

The Wasm runtime preopens zero or more directories before starting the program and assigns them the lowest numbered file descriptors starting at file descriptor 3 (after standard input, output, and error). A program intending to use path_open must first traverse the file descriptors, probing for preopens with fd_prestat_get and retrieving their path name with fd_prestat_dir_name. This name may or may not map back onto a real system path, and so this is a kind of virtual file system for the Wasm module. The probe stops on the first error.

To open an absolute path, it must find a matching preopen, then from it construct a path relative to that directory. This part I much dislike, as the module must contain complex path parsing functionality even in the simple case. Opening files is the most complex piece of the whole API.

I mentioned before that program data is below the Clang stack. With the stack growing down, this sounds like a bad idea. A stack overflow quietly clobbers your data, and is difficult to recognize. More sensible to put the stack at the bottom so that it overflows off the bottom of memory and causes a fast fault. Fortunately there’s a switch for that:

$ clang --target=wasm32 ... -Wl,--stack-first ...

This is what you want by default. The actual default layout is left over from an early design flaw in wasm-ld, and it’s an oversight that it has not yet been corrected.

u-config

The above is in action in the u-config Wasm port. You can download the Wasm module, pkg-config.wasm, used in the web demo to run it in your favorite WASI-capable Wasm runtime:

$ wazero run pkg-config.wasm --modversion pkg-config
0.33.3

Though there are no preopens, so it cannot read any files. The -mount option maps real file system paths to preopens. This mounts the entire root file system read-only (ro) as /.

$ wazero run -mount /::ro pkg-config.wasm --cflags sdl2
-I/usr/include/SDL2 -D_REENTRANT

I doubt this is useful for anything, but it was a vehicle for learning and trying Wasm, and the results are pretty neat.

In the next article I discuss allocating the allocator.

]]> A more robust raw OpenBSD syscall demo urn:uuid:f7101ee1-a2e6-4895-b763-bd7b2a842280 2025-03-06T02:43:20Z Ted Unangst published dude, where are your syscalls? on flak yesterday, with a neat demonstration of OpenBSD’s pinsyscall security feature, whereby only pre-registered addresses are allowed to make system calls. Whether it strengthens or weakens security is up for debate, but regardless it’s an interesting, low-level programming challenge. The original demo is fragile for multiple reasons, and requires manually locating and entering addresses for each build. In this article I show how to fix it. To prove that it’s robust, I ported an entire, real application to use raw system calls on OpenBSD.

The original program uses ARM64 assembly. I’m a lot more comfortable with x86-64 assembly, plus that’s the hardware I have readily on hand. So the assembly language will be different, but all the concepts apply to both these architectures. Almost none of these OpenBSD system interfaces are formally documented (or stable for that matter), and I had to dig around the OpenBSD source tree to figure it out (along with a helpful jart nudge). So don’t be afraid to get your hands dirty.

There are lots of subtle problems in the original demo, so let’s go through the program piece by piece, starting with the entry point:

void
start()
{
        w("hello\n", 6);
        x();
}

This function is registered as the entry point in the ELF image, so it has no caller. ~~That means no return address on the stack, so the stack is not aligned for a function.~~(Correction: The stack alignment issue is true for x86, but not ARM, so the original demo is fine.) In toy programs that goes unnoticed, but compilers generate code assuming the stack is aligned. In a real application this is likely to crash deep on the first SIMD register spill.

We could fix this with a force_align_arg_pointer attribute, at least for architectures that support it, but I prefer to write the entry point in assembly. Especially so we can access the command line arguments and environment variables, which is necessary in a real application. That happens to work the same as it does on Linux, so here’s my old, familiar entry point:

asm (
    "        .globl _start\n"
    "_start: mov   %rsp, %rdi\n"
    "        call  start\n"
);

Per the ABI, the first argument passes through rdi, so I pass a copy of the stack pointer, rsp, as it appeared on entry. Entry point arguments argc, argv, and envp are all pushed on the stack at rsp, so the first real function can retrieve it all from just the stack pointer. The original demo won’t use it, though. Using call to pass control pushes a return address, which will never be used, and aligns the stack for the first real function. I name it _start because that’s what the linker expects and so things will go a little smoother, so it’s rather convenient that the original didn’t use this name.

Next up, the “write” function:

int
w(void *what, size_t len) {
        __asm(
"       mov x2, x1;"
"       mov x1, x0;"
"       mov w0, #1;"
"       mov x8, #4;"
"       svc #0;"
        );
        return 0;
}

There are two serious problems with this assembly block. First, the function arguments are not necessarily in those registers by the time control reaches the basic assembly block. The function prologue could move them around. Even more so if this function was inlined. This is exactly the problem extended inline assembly is intended to solve. Second, it clobbers a number of registers. Compilers assume this does not happen when generating their own code. This sort of assembly falls apart the moment it comes into contact with a non-zero optimization level.

Solving this is just a matter of using inline assembly properly:

long w(void *what, long len)
{
    char err;
    long rax = 4;  // SYS_write
    asm volatile (
        "syscall"
        : "+a"(rax), "+d"(len), "=@ccc"(err)
        : "D"(1), "S"(what)
        : "rcx", "r11", "memory"
    );
    return err ? -rax : rax;
}

I’ve enhanced it a bit, returning a Linux-style negative errno on error. In the BSD ecosystem, syscall errors are indicated using the carry flag, which here is output into err via =@ccc. When set, the return value is an errno. Further, the OpenBSD kernel uses both rax and rdx for return values, so I’ve also listed rdx as an input+output despite not consuming the result. Despite all these changes, this function is not yet complete! We’ll get back to it later.

The “exit” function, x, is just fine:

void
x() {
        __asm(
"       mov x8, #1;"
"       svc #0;"
        );
}

It doesn’t set an exit status, so it passes garbage instead, but otherwise this works. No inputs, plus clobbers and outputs don’t matter when control never returns. In a real application I might write it:

__attribute((noreturn))
void x(int status)
{
    asm volatile ("syscall" :: "a"(1), "D"(status));
    __builtin_unreachable();
}

This function will need a little additional work later, too.

The ident section is basically fine as-is:

__asm(" .section \".note.openbsd.ident\", \"a\"\n"
"       .p2align 2\n"
"       .long   8\n"
"       .long   4\n"
"       .long   1\n"
"       .ascii \"OpenBSD\\0\"\n"
"       .long   0\n"
"       .previous\n");

The compiler assumes the current section remains the same at the end of the assembly block, which here is accomplished with .previous. Though it clobbers the assembler’s remembered “other” section and so may interfere with surrounding code using .previous. Better to use .pushsection and .popsection for good stack discipline. There are many such examples in the OpenBSD source tree.

asm (
    ".pushsection .note.openbsd.ident, \"a\"\n"
    ".long  8, 4, 1, 0x6e65704f, 0x00445342, 0\n"
    ".popsection\n"
);

Now the trickiest part, the pinsyscall table:

struct whats {
        unsigned int offset;
        unsigned int sysno;
} happening[] __attribute__((section(".openbsd.syscalls"))) = {
        { 0x104f4, 4 },
        { 0x10530, 1 },
};

Those offsets — offsets from the beginning of the ELF image — were entered manually, and it kind of ruins the whole demo. We don’t have a good way to get at those offsets from C, or any high level language. However, we can solve that by tweaking the inline assembly with some labels:

__attribute((noinline))
long w(void *what, long len)
{
    // ...
    asm volatile (
        "_w: syscall"
        // ...
    );
    // ...
}
__attribute((noinline,noreturn))
void x(int status)
{
    asm volatile (
        "_x: syscall"
        // ...
    );
    // ...
}

Very importantly I’ve added noinline to prevent these functions from being inlined into additional copies of the syscall instruction, which of course won’t be registered. This also prevents duplicate labels causing assembler errors. Once we have the labels, we can use them in an assembly block listing the allowed syscall instructions:

asm (
    ".pushsection .openbsd.syscalls\n"
    ".long  _x, 1\n"
    ".long  _w, 4\n"
    ".popsection\n"
);

That lets the linker solve the offsets problem, which is its main job after all. With these changes the demo works reliably, even under high optimization levels. I suggest these flags:

$ cc -static -nostdlib -no-pie -o where where.c

Disabling PIE with -no-pie is necessary in real applications or else strings won’t work. You can apply more flags to strip it down further, but these are the flags generally necessary to compile these sorts of programs on at least OpenBSD 7.6.

So, how do I know this stuff works in general? Because I ported my ultra portable pkg-config clone, u-config, to use raw OpenBSD syscalls: openbsd_main.c. Everything still works at high optimization levels.

$ cc -static -nostartfiles -no-pie -o pkg-config openbsd_main.c libmemory.a
$ ./pkg-config --cflags --libs libcurl
-I/usr/local/include -L/usr/local/lib -lcurl

Because the new syscall wrappers behave just like Linux system calls, it leverages the linux_noarch.c platform, and the whole port is ~70 lines of code. A few more flags (-fno-stack-protector, -Oz, -s, etc.), and it squeezes into a slim 21.6K static binary.

Despite making no libc calls, it’s not possible stop compilers from fabricating (hallucinating?) string function calls, so the build above depends on external definitions. In the command above, libmemory.a comes from libmemory.c found in w64devkit. Alternatively, and on topic, you could link the OpenBSD libc string functions by omitting libmemory.a from the build.

$ cc -static -nostartfiles -no-pie -o pkg-config openbsd_main.c

Though it pulls in a lot of bloat (~8x size increase), and teasing out the necessary objects isn’t trivial.

]]> Robust Wavefront OBJ model parsing in C urn:uuid:852fe937-3510-4752-a9a8-97fde5321e7e 2025-03-02T23:22:58Z Wavefront OBJ is a line-oriented, text format for 3D geometry. It’s widely supported by modeling software, easy to parse, and trivial to emit, much like Netpbm for 2D image data. Poke around hobby 3D graphics projects and you’re likely to find a bespoke OBJ parser. While typically only loading their own model data, so robustness doesn’t much matter, they usually have hard limitations and don’t stand up to fuzz testing. This article presents a robust, partial OBJ parser in C with no hard-coded limitations, written from scratch. Like similar articles, it’s not really about OBJ but demonstrating some techniques you’ve probably never seen before.

If you’d like to see the ready-to-run full source: objrender.c. All images are screenshots of this program.

First let’s establish the requirements. By robust I mean no undefined behavior for any input, valid or invalid; no out of bounds accesses, no signed overflows. Input is otherwise not validated. Invalid input may load as valid by chance, which will render as either garbage or nothing. The behavior will also not vary by locale.

We’re also only worried about vertices, normals, and triangle faces with normals. In OBJ these are v, vn, and f elements. Normals let us light the model effectively while checking our work. A cube fitting this subset of OBJ might look like:

v  -1.00 -1.00 -1.00
v  -1.00 +1.00 -1.00
v  +1.00 +1.00 -1.00
v  +1.00 -1.00 -1.00
v  -1.00 -1.00 +1.00
v  -1.00 +1.00 +1.00
v  +1.00 +1.00 +1.00
v  +1.00 -1.00 +1.00
vn +1.00  0.00  0.00
vn -1.00  0.00  0.00
vn  0.00 +1.00  0.00
vn  0.00 -1.00  0.00
vn  0.00  0.00 +1.00
vn  0.00  0.00 -1.00
f   3//1  7//1  8//1
f   3//1  8//1  4//1
f   1//2  5//2  6//2
f   1//2  6//2  2//2
f   7//3  3//3  2//3
f   7//3  2//3  6//3
f   4//4  8//4  5//4
f   4//4  5//4  1//4
f   8//5  7//5  6//5
f   8//5  6//5  5//5
f   3//6  4//6  1//6
f   3//6  1//6  2//6

Take note:

Some fields are separated by more than one space.
Vertices and normals are fractional (floating point).
Faces use 1-indexing instead of 0-indexing.
Faces in this model lack a texture index, hence // (empty).

Inputs may have other data, but we’ll skip over it, including face texture indices, or face elements beyond the third. Some of the models I’d like to test have relative indices, so I want to support those, too. A relative index refers backwards from the last vertex, so the order of the lines in an OBJ matter. For example, the cube faces above could have instead been written:

f  -6//-6 -2//-6 -1//-6
f  -6//-6 -1//-6 -5//-6
f  -8//-5 -4//-5 -3//-5
f  -8//-5 -3//-5 -7//-5
f  -2//-4 -6//-4 -7//-4
f  -2//-4 -7//-4 -3//-4
f  -5//-3 -1//-3 -4//-3
f  -5//-3 -4//-3 -8//-3
f  -1//-2 -2//-2 -3//-2
f  -1//-2 -3//-2 -4//-2
f  -6//-1 -5//-1 -8//-1
f  -6//-1 -8//-1 -7//-1

Due to this the parser cannot be blind to line order, and it must handle negative indices. Relative indexing has the nice effect that we can group faces, and those groups are relocatable. We can reorder them without renumbering the faces, or concatenate models just by concatenating their OBJ files.

The fundamentals

To start off, we’ll be using an arena of course, trivializing memory management while swiping aside all hard-coded limits. A quick reminder of the interface:

#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))

typedef struct {
    char *beg;
    char *end;
} Arena;
// Always returns an aligned pointer inside the arena. Allocations are
// zeroed. Does not return on OOM (never returns a null pointer).
void *alloc(Arena *, ptrdiff_t count, ptrdiff_t size, ptrdiff_t align);

Also, no null terminated strings, perhaps the main source of problems with bespoke parsers.

#define S(s)    (Str){s, sizeof(s)-1}

typedef struct {
    char     *data;
    ptrdiff_t len;
} Str;

Pointer arithmetic is error prone, so the tricky stuff is relegated to a handful of functions, each of which can be exhaustively validated almost at a glance:

Str span(char *beg, char *end)
{
    Str r = {0};
    r.data = beg;
    r.len  = beg ? end-beg : 0;
    return r;
}
_Bool equals(Str a, Str b)
{
    return a.len==b.len && (!a.len || !memcmp(a.data, b.data, a.len));
}
Str trimleft(Str s)
{
    for (; s.len && *s.data<=' '; s.data++, s.len--) {}
    return s;
}
Str trimright(Str s)
{
    for (; s.len && s.data[s.len-1]<=' '; s.len--) {}
    return s;
}
Str substring(Str s, ptrdiff_t i)
{
    if (i) {
        s.data += i;
        s.len  -= i;
    }
    return s;
}

Each avoids the purposeless special cases around null pointers (i.e. zero-initialized Str objects) that would otherwise work out naturally. The space character and all control characters are treated as whitespace for simplicity. When I started writing this parser, I didn’t define all these functions up front. I defined them as needed. (A good standard library would have provided similar definitions out-of-the-box.) If you’re worried about misuse, add the appropriate assertions.

A powerful and useful string function I’ve discovered, and which I use in every string-heavy program, is cut, a concept I shamelessly stole from the Go standard library:

typedef struct {
    Str   head;
    Str   tail;
    _Bool ok;
} Cut;
Cut cut(Str s, char c)
{
    Cut r = {0};
    if (!s.len) return r;  // null pointer special case
    char *beg = s.data;
    char *end = s.data + s.len;
    char *cut = beg;
    for (; cut<end && *cut!=c; cut++) {}
    r.ok   = cut < end;
    r.head = span(beg, cut);
    r.tail = span(cut+r.ok, end);
    return r;
}

It slices, it dices, it juliennes! Need to iterate over lines? Cut it up:

    Cut c = {0};
    c.tail = input;
    while (c.tail.len) {
        c = cut(c.tail, '\n');
        Str line = c.head;
        // ... process line ...
    }

Need to iterate over the fields in a line? Cut the line on the field separator. Then cut the field on the element separator. No allocation, no mutation (strtok).

Reading input

Unlike a program designed to process arbitrarily large inputs, the intention here is to load the entire model into memory. We don’t need to fiddle around with loading a line of input at at time (fgets, getline, etc.) — the usual approach with OBJ parsers. If the OBJ source cannot fit in memory, then the model won’t fit in memory. This greatly simplifies the parser, not to mention faster while lifting hard-coded limits like maximum line length.

The simple arena I use makes whole-file loading so easy. Read straight into the arena without checking the file size (ftell, etc.), which means streaming inputs (i.e. pipes) work automatically.

Str loadfile(Arena *a, FILE *f)
{
    Str r  = {0};
    r.data = a->beg;
    r.len  = a->end - a->beg;
    r.len  = fread(r.data, 1, r.len, f);
    return r;
}

Without buffered input, you may need a loop around the read:

Str loadfile(Arena *a, int fd)
{
    Str r = {0};
    r.data = a.beg;
    ptrdiff_t cap = a->end - a->beg;
    for (;;) {
        ptrdiff_t r = read(fd, r.data+r.len, cap-r.len);
        if (r < 1) {
            return r;  // ignoring read errors
        }
        r.len += r;
    }
}

You might consider triggering an out-of-memory error if the arena was filled to the brim, which almost certainly means the input was truncated. Though that’s likely to happen anyway because the next allocation from that arena will fail.

Side note: When using a multi GB arena, issuing such huge read requests stress tests the underlying IO system. I’ve found libc bugs this way. In this case I used SDL2 for the demo, and SDL lost the ability to read files after I increased the arena size to 4GB in order to test a gigantic model (“Power Plant”). I’ve run into this before, and I assumed it was another Microsoft CRT bug. After investigating deeper for this article, I learned it’s an ancient SDL bug that’s made it all the way into SDL3. -Wconversion warns about it, but was accidentally squelched in the 64-bit port back in 2009. It seems nobody else loads files this way, so watch out for platform bugs if you use this technique!

Parsing data

In practice, rendering systems limit counts to the 32-bit range, which is reasonable. So in the OBJ parser, vertex and normal indices will be 32-bit integers. Negatives will be needed for at least relative indexing. Parsing from a Str means null-terminated functions like strtol are off limits. So here’s a function to parse a signed integer out of a Str:

int32_t parseint(Str s)
{
    uint32_t r    = 0;
    int32_t  sign = 1;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        switch (s.data[i]) {
        case '+':            break;
        case '-': sign = -1; break;
        default : r = 10*r + s.data[i] - '0';
        }
    }
    return r * sign;
}

The uint32_t means its free to overflow. If it overflows, the input was invalid. If it doesn’t hold an integer, the input was invalid. In either case it will read a harmless, garbage result. Despite being unsigned, it works just fine with negative inputs thanks to two’s complement.

For floats I didn’t intend to parse exponential notation, but some models I wanted to test actually did use it — probably by accident — so I added it anyway. That requires a function to compute the exponent.

float expt10(int32_t e)
{
    float   y = 1.0f;
    float   x = e<0 ? 0.1f : e>0 ? 10.0f : 1.0f;
    int32_t n = e<0 ? e : -e;
    for (; n < -1; n /= 2) {
        y *= n%2 ? x : 1.0f;
        x *= x;
    }
    return x * y;
}

That’s exponentiation by squaring, avoiding signed overflow on the exponent. Traditionally a negative exponent is inverted, but applying unary - to an arbitrary integer might overflow (consider -2147483648). So instead I iterate from the negative end. The negative range is larger than the positive, after all. Finally we can parse floats:

float parsefloat(Str s)
{
    float r    = 0.0f;
    float sign = 1.0f;
    float exp  = 0.0f;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        switch (s.data[i]) {
        case '+':            break;
        case '-': sign = -1; break;
        case '.': exp  =  1; break;
        case 'E':
        case 'e': exp  = exp ? exp : 1.0f;
                  exp *= expt10(parseint(substring(s, i+1)));
                  i    = s.len;
                  break;
        default : r = 10.0f*r + (s.data[i] - '0');
                  exp *= 0.1f;
        }
    }
    return sign * r * (exp ? exp : 1.0f);
}

Probably not as precise as strtof, but good enough for loading a model. It’s also ~30% faster for this purpose than my system’s strtof. If it hits an exponent, it combines parseint and expt10 to augment the result so far. At least for all the models I tried, the exponent only appeared for tiny values. They round to zero with no visible effects, so you can cut the implementation by more than half in one fell swoop if you wish (no more expt10 nor substring either):

        switch (s.data[i]) {
        // ...
        case 'E':
        case 'e': return 0;  // probably small *shrug*
        // ...
        }

Why not strtof? That has the rather annoying requirement that input is null terminated, which is not the case here. Worse, it’s affected by the locale and doesn’t behave consistently nor reliably.

A vertex is three floats separated by whitespace. So combine cut and parsefloat to parse one.

typedef struct {
    float v[3];
} Vert;
Vert parsevert(Str s)
{
    Vert r = {0};
    Cut c = cut(trimleft(s), ' ');
    r.v[0] = parsefloat(c.head);
    c = cut(trimleft(c.tail), ' ');
    r.v[1] = parsefloat(c.head);
    c = cut(trimleft(c.tail), ' ');
    r.v[2] = parsefloat(c.head);
    return r;
}

cut parses a field between every space, including empty fields between adjacent spaces, so trimleft discards extra space before cutting. If the line ends early, this passes empty strings into parsefloat which come out as zeros. No special checks required for invalid input.

Faces are a set of three vertex indices and three normal indices, and parses almost the same way. Relative indices are immediately converted to absolute indices using the number of vertices/normals so far.

typedef struct {
    int32_t v[3];
    int32_t n[3];
} Face;
static Face parseface(Str s, ptrdiff_t nverts, ptrdiff_t nnorms)
{
    Face r      = {0};
    Cut  fields = {0};
    fields.tail = s;
    for (int i = 0; i < 3; i++) {
        fields = cut(trimleft(fields.tail), ' ');
        Cut elem = cut(fields.head, '/');
        r.v[i] = parseint(elem.head);
        elem = cut(elem.tail, '/');  // skip texture
        elem = cut(elem.tail, '/');
        r.n[i] = parseint(elem.head);
        // Process relative subscripts
        if (r.v[i] < 0) {
            r.v[i] = (int32_t)(r.v[i] + 1 + nverts);
        }
        if (r.n[i] < 0) {
            r.n[i] = (int32_t)(r.n[i] + 1 + nnorms);
        }
    }
    return r;
}

Since nverts must be non-negative, and a relative index is negative by definition, adding them together can never overflow. If there are too many vertices, the result might be truncated, as indicated by the cast. That’s fine. Just invalid input.

There’s an interesting interview question here: Consider this alternative to the above, maintaining the explicit cast to dismiss the -Wconversion warning.

            r.v[i] += (int32_t)(1 + nverts);

Is it equivalent? Can this overflow? (Answers: No and yes.) If yes, under what conditions? Unfortunately a fuzz test would never hit it.

Putting it together

For this case, a model is three arrays of vertices, normals, and indices. While faces only support 32-bit indexing, I use ptrdiff_t in order to skip overflow checks. There cannot possibly be more vertices than bytes of source, so these counts cannot overflow.

typedef struct {
    Vert     *verts;
    ptrdiff_t nverts;
    Vert     *norms;
    ptrdiff_t nnorms;
    Face     *faces;
    ptrdiff_t nfaces;
} Model;
Model parseobj(Arena *, Str);

They’d probably look a little nicer as dynamic arrays, but we won’t need that machinery. That’s because the parser makes two passes over the OBJ source, the first time to count:

    Model m     = {0};
    Cut   lines = {0};
    lines.tail = obj;
    while (lines.tail.len) {
        lines = cut(lines.tail, '\n');
        Cut fields = cut(trimright(lines.head), ' ');
        Str kind = fields.head;
        if (equals(S("v"), kind)) {
            m.nverts++;
        } else if (equals(S("vn"), kind)) {
            m.nnorms++;
        } else if (equals(S("f"), kind)) {
            m.nfaces++;
        }
    }

It’s a lightweight pass, skipping over the numeric data. With that information collected, we can allocate the model:

    m.verts  = new(a, m.nverts, Vert);
    m.norms  = new(a, m.nnorms, Vert);
    m.faces  = new(a, m.nfaces, Face);
    m.nverts = m.nnorms = m.nfaces = 0;

On the next pass we call parsevert and parseface to fill it out.

    lines.tail = obj;
    while (lines.tail.len) {
        lines = cut(lines.tail, '\n');
        Cut fields = cut(trimright(lines.head), ' ');
        Str kind = fields.head;
        if (equals(S("v"), kind)) {
            m.verts[m.nverts++] = parsevert(fields.tail);
        } else if (equals(S("vn"), kind)) {
            m.norms[m.nnorms++] = parsevert(fields.tail);
        } else if (equals(S("f"), kind)) {
            m.faces[m.nfaces++] = parseface(fields.tail, m.nverts, m.nnorms);
        }
    }

At this point the model is parsed, though its not necessarily consistent. Faces indices may still be out of range. The next step is to transform it into a more useful representation.

Transformation

Rendering the model is the easiest way to verify it came out alright, and it’s generally useful for debugging problems. Because it basically does all the hard work for us, and doesn’t require ridiculous contortions to access, I’m going to render with old school OpenGL 1.1. It provides a glInterleavedArrays function with a bunch of predefined formats. The one that interests me is GL_N3F_V3F, where each vertex is a normal and a position. Each face is three such elements. I came up with this:

typedef struct {  // GL_N3F_V3F
    Vert n, v;
} N3FV3F[3];
typedef struct {
    N3FV3F   *data;
    ptrdiff_t len;
} N3FV3Fs;
// Transform a model into a GL_N3F_V3F representation.
N3FV3Fs n3fv3fize(Arena *, Model);

If you’re being precise you’d use GLfloat, but this is good enough for me. By using a different arena for this step, we can discard the OBJ data once it’s in the “local” format. For example:

    Arena perm    = {...};
    Arena scratch = {...};
    N3FV3Fs *scene = new(&perm, nmodels, N3FV3Fs);
    for (int i = 0; i < nmodels; i++) {
        Arena temp  = scratch;  // free OBJ at end of iteration
        Str   obj   = loadfile(&temp, path[i]);
        Model model = parseobj(&temp, obj);
        scene[i]    = n3fv3fize(&perm, model);
    }

The conversion allocates the GL_N3F_V3F array, discards invalid faces, and copies the valid faces into the array:

N3FV3Fs n3fv3fize(Arena *a, Model m)
{
    N3FV3Fs r = {0};
    r.data = new(a, m.nfaces, N3FV3F);
    for (ptrdiff_t f = 0; f < m.nfaces; f++) {
        _Bool valid = 1;
        for (int i = 0; i < 3; i++) {
            valid &= m.faces[f].v[i]>0 && m.faces[f].v[i]<=m.nverts;
            valid &= m.faces[f].n[i]>0 && m.faces[f].n[i]<=m.nnorms;
        }
        if (valid) {
            ptrdiff_t t = r.len++;
            for (int i = 0; i < 3; i++) {
                r.data[t][i].n = m.norms[m.faces[f].n[i]-1];
                r.data[t][i].v = m.verts[m.faces[f].v[i]-1];
            }
        }
    }
    return r;
}

Here’s what that looks like in OpenGL with suzanne.obj and bmw.obj:

This was a fun little project, and perhaps you learned a new technique or two after checking it out.

]]>

Original Source | Taken Source