CARVIEW |
Select Language
HTTP/2 200
date: Tue, 29 Jul 2025 18:25:12 GMT
content-type: text/html; charset=utf-8
cache-control: max-age=0, private, must-revalidate
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
referrer-policy: no-referrer-when-downgrade
server-timing: discussion_layout-fragment;desc="discussion_layout fragment";dur=212.236263,content_1-fragment;desc="content_1 fragment";dur=160.067469,content_2-fragment;desc="content_2 fragment";dur=44.545864,sidebar_content-fragment;desc="sidebar_content fragment";dur=79.231529,nginx;desc="NGINX";dur=0.773541,glb;desc="GLB";dur=100.156424
strict-transport-security: max-age=31536000; includeSubdomains; preload
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
x-content-type-options: nosniff
x-frame-options: deny
x-voltron-version: 321f992
x-xss-protection: 0
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=p3ShiPKGFB%2BmZfo0ZzkctEEYKgdndqmd04grILqqvNc9LK%2FRgQpzI%2F81nraJ2jGSRzWT9748HAA2wGxMsPhjtSFFxt%2B8SMNvXDs1BiKtEfkpaNIdpvXeoEB%2FCXeGaUamRUeMAKUYAPUW0b5yedVG45LalbmE9IoPQO6%2FXNVzBV%2FdK4MHyDRxnmJrCJgytsbmtAPlUkv5IifELgD%2BbRea752pZ2uScQIPIX%2FgfP%2BXGWjANzUnun1xLxDGmvuvFBBJZojMZeEYfeSCl0qQUpmISg%3D%3D--NT6ZSaDbykvfDnMs--bM2muFRvrsz8WDMYMqJAdQ%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.1902722599.1753813512; Path=/; Domain=github.com; Expires=Wed, 29 Jul 2026 18:25:12 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Wed, 29 Jul 2026 18:25:12 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: 939A:8019C:F06DA9:11C7051:68891208
How to embed a HashMap with lots of strings in program Β· web-infra-dev Β· Discussion #27 Β· GitHub
Jul 7, 2025
·
0 comments
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji
reacted with thumbs down emoji
reacted with laugh emoji
reacted with hooray emoji
reacted with confused emoji
reacted with heart emoji
reacted with rocket emoji
reacted with eyes emoji
Skip to content
Navigation Menu
{{ message }}
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
You canβt perform that action at this time.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What do you do when you need to embed a large amount of queryable data statically in your application?
A common style is to embed json data directly and then parse it during initial access:
But this has some disadvantages:
Here we introduce some techniques to construct a completely static and efficient Map:
Efficient static queryable Map
The hashmap has a time complexity of
O(1)
and does not take up any extra space,but this is only true when there is no key conflict.
And phf can guarantee this, make the hashmap always perfect and compact.
The simplest MPHF
the MPHF is defined as a completely injective function (aka bijection)
that can map a specified set of keys one-to-one to a set of values of the same size.
The simplest MPHF can be constructed as follows:
f(x) = h(seed, x) mod n
seed
that can make allx
generatedhash mod n
have no conflictBut for large number of keys, find a good seed in a reasonable time is almost impossible.
The practical MPHF
Practical MPHF implementations usually use buckets to resolve conflicts
The newest MPHF: PtrHash
Ptrhash has many improvements over PtHash,
the most interesting of which is the use of cuckoo hash strategy for bucket adjustment.
Cuckoo hash is a classic hashmap conflict resolve algorithm
The efficient collision resolve strategy allows ptrhash to use a lower-quality (faster) hash algorithm
and use only 1byte for each displace, which reduces time and space overhead compared to traditional solutions.
Slow and bloat...
MPHF solves the data query problem,
but if we write code like the following, we will find that it compiles very slow and size is very large.
What's the problem?
&[&str]
is badAbout
&str
If have some understand of Rust's memory layout, it's easy to notice that
&str
is essentially(*const u8, usize)
.This means that in
&[&str]
, each string has an additional size of 16 bytes.for short strings, the binary size overhead of its index is even greater than the string itself.
But thatβs not all
About relocation
I constructed a program that uses a lot of
&[&str]
and printed the section headers and foundIts largest section is not
.rodata
which is used to store constant data, but.rela.dyn
and.data.rel.ro
!Take a look at the global variable
NAMES
we defined. This is a list of&str
, which contains pointers to strings.but before so is loaded, its base address is still unknown.
How can the compiler generate a pointer list that is still uncertain?
The answer is that the dynamic linker needs to adjust the data when so is loaded, and this process is called relocation.
.data.rel.ro
and.rela.dyn
sections.data.rel.ro
stores readonly data that can only be changed in the relocation.rela.dyn
stores metadata of data that needs relocation.rela.dyn
section to adjust the.data.rel.ro
section (and others)This means that each string in
&[&str]
has an additional 24 bytes of size overhead,and will affect the startup performance of program!
String pack
Position index
Since compiler generated string indexes has so much overhead,
we can pack strings and create indexes manually.
A simple solution is to concatenate all the strings and then use the string end position as the index.
This makes each string index cost only 4byte and completely eliminates the relocation overhead.
.rodata
and can be accessed directly through relative addresses.String pool
The Position index solution is compact, and we can use the Elias-Fano technique to make it even more compact.
However, simply concatenat strings will prevent the compiler from merge repeated strings,
which makes it perform worse in specific cases.
Looking at following code,
we can observe that the two string addresses of
S[0]
andS[1]
are exactly the same,proving that the compiler merged them.
We can use string pool to manually merge strings to avoid this shortcoming.
As a space-saving trick, we can pack the offset and length into a single u32 using bitpack.
This makes it the same as position index solution, with an index overhead of only 4byte per string,
but at the cost of limit the maximum length of a single string to 255.
Compile speed
As an added benefit of string packing, this also significantly improves compile time and memory usage.
The three phases that rustc spends a lot of time on with the naive
&[&str]
solutionare reduced to negligible amounts of time after string pack.
before:
after:
The string pack saves the compiler from have to deal with a lot of objects and from have to do a lot of unnecessary optimizations.
Conclusions
Traditionally, embedding static hashmaps has various costs,
but the MPHF and string pack techniques we introduced perform well in all aspects.
&[&str]
Beta Was this translation helpful? Give feedback.
All reactions