Some performance tweaks #19

vmg · 2012-08-04T19:54:58Z

HAI FRIENDLY ENGINEERS OF BITLY I COME IN PEACE

I wanted to play around with your awesome bloom library for some internal super-secret GitHub stuff (HAH), but as it turns out, I found it a tad lacking on raw performance, despite being all ballin' and stuff when it comes to memory usage.

I ran some instrumentation on the library, made some minor (major?) changes, and wanted to share them back. Here's the rationale:

"So I was like, let me trace this, yo"

This is the first instrumentation I ran with the original code (unmodified). The corpus is the words file that ships with Mac OS 10.7:

$ wc /usr/share/dict/words
  235886  235886 2493109 /usr/share/dict/words

Clearly, the two main bottlenecks are:

53.7% (1274.0ms) + 9.3% (221.0ms) + 2.9% (70.0ms) spent just in core hashing (without counting the hash finalization for MD5). As it turns
More than 5% spent in lseek, when the code only uses lseek in one place... (??)

Everything else in the trace was noise, so I decided to go HAM on just these two things. The lowest hanging fruit is certainly MD5: It is not a general purpose hash function, so its use in a bloom filter is ill-advised.

Sex & the CityHash

My first choice for a replacement was Google's CityHash (ayuup, I drank the Google koolaid -- I'm a moron and deserve to be mocked). I left the original commit in the branch for reference.

This simple change halved the runtime, but traces were still showing way too much time spent in hashing. The cause? Well, the bloom filter requires a pretty wide nfunc size for most corpuses if you want reasonable error rates, but CityHash has only two hash modes: Either 64 bit, or 128 bit. Neither of these modes is optimal for bloom filter operation.

64-bit hash output is (supposedly) optimized for small strings, which is not our target corpus at GitHub, although it should perform well with the words file of this synthetic benchmark. In practice, we end up performing too many calls to CityHash to fill the nfunc spectrum because of the small output size.
128-bit (also know as brain-damage-mode) performs poorly for small strings (by poorly I mean worse than other highly-optimized general purpose hashes) and doesn't offer any other specific advantages besides the adequate output word size.

To top off this disaster, CityHash doesn't really have a native "seeded mode". The seed API performs a standard hash and then an extra iteration (??) on top of the result to mix in the seed, instead of seeding the standard hash initially.

...So I killed CityHash with fire.

Enter Murmur

MurmurHash has always been my favorite real-world hash function, and in retrospect I should have skipped City and gone straight for it.

It offers brilliant performance for all kind of string values and is always linear with string sizes without requiring special-casing for short strings. It also takes alignment issues into account.

To top it off, Murmur doesn't return the hash value on the stack/registers but writes it directly to a provided buffer. This makes it exceedingly easy to fill the bloom->hashes buffer with a lot of random data and perform the modularization incrementally.

    for (i = 0; i < bloom->nsalts; i++, hashes += 4) {
        MurmurHash3_x64_128(key, key_len, bloom->salts[i], hashes);
        hashes[0] = hashes[0] % bloom->counts_per_func;
        hashes[1] = hashes[1] % bloom->counts_per_func;
        hashes[2] = hashes[2] % bloom->counts_per_func;
        hashes[3] = hashes[3] % bloom->counts_per_func;
    }

(note that we have aligned the hashes buffer to 16-bytes to prevent corner-case overflow checks). This is simple and straightforward, and makes my nipples tingle. n salts, and each salt throws 128 bits. Wrap'em and we're done here!

Enlarge your files

After dropping in an optimal hash function, the instrumentation showed a hilariously high percent of time spent in the kernel performing lseeks. I wondered where it was coming from...

        for (; size < new_size; size++) {
            if (lseek(fd, size, SEEK_SET) < 0) {
                perror("Error, calling lseek() to set file size");
                free_bitmap(bitmap);
                close(fd);
                return NULL;
            }
        }

Apparently the code to resize a file on the filesystem was performing an absolute seek for every single byte that the file had to be increased. This is... heuh... I don't know if this is for compatibility reasons, but the POSIX standard defines a very very handy ftruncate call:

The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.

If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0').

This works wonders on both Mac OS X and Linux, and lets the kernel fill the file efficiently with those pesky NULL bytes, even in highly fragmented filesystems. After replacing the lseek calls with a ftruncate, all kernel operations (including the mmaps) became noise in the instrumentation. Awesome!

This is where we're at now

As far as I'm concerned, the instrumentation trace has been obliterated.

Murmur cannot be made any faster, that's the way it is.
hash_func is stalling with all the modulo operations (even though they have no interdeps and should be going simultaneously on the pipeline, I think...). There are no SIMD modulo instructions, so I don't see how to work around this.
All the small bumps there come from the actual test program, not the library itself. Mostly strchr for splitting up the words in the dictionary file.
bitmap_check and bitmap_increment are tiny and fast. Nothing to do here. :/
Everything else is noise. ✨

Also, binary strings

This is not performance related (at least not directly), but it totally bummed me that the API was requiring NULL-terminated strings, specially since I'm pretty sure you wrote this to be wrapped from dynamic languages, and all these languages incur on a penalty when asking for a NULL-terminated string (see: Python string slices yo, that's some memory being duped all over the place for NULL-termination) instead of grabbing the raw buffer + it's length.

I've changed the API accordingly, adding a len argument to all calls. Obviously, NULL-terminated strings can still be used by passing strlen(string) in the external API, instead of performing the measurement internally like before.

Final benchmarks

Averaged 8 runs for the original code, words is still the corpus.

Run 1: 2.182463s
Run 2: 2.177441s
Run 3: 2.174175s
Run 4: 2.178066s
Run 5: 2.190548s
Run 6: 2.179080s
Run 7: 2.180691s
Run 8: 2.184210s
AVG: 2.180834

Averaged 8 runs for the updated code, same corpus.

Run 1: 0.321654s
Run 2: 0.310658s
Run 3: 0.314666s
Run 4: 0.307526s
Run 5: 0.311680s
Run 6: 0.316963s
Run 7: 0.307528s
Run 8: 0.309479s
AVG: 0.312519

700% increase on this synthetic benchmark. For our specific corpus (bigger count, strings significantly larger than dictionary words), I get a 1300% increase. This is basically Murmur at work. Results may vary.

Hey you piece of shit did you break the filtering?

Don't think so. Murmur generates very high quality entropy, high enough to come close to MD5 for all measurements.

It's on my TODO list to perform some tests and see if there's an statistically significant variance on the amount of false positives between the two hash functions. Anecdotally, for the words dataset, MD5 was generating 1859 positives, while Murmur decreased that to 1815. THIS IS NOT SCIENTIFIC.

Common sense tells us that MD5, being cryptographically sound, should always stay ahead on pure entropy measurement, but the avalanching properties of Murmur are gorgeous. So I'm happy with this. 100% Pinkie Pie approved.

THAT'S IT

Ayup, as far as I'm concerned this now has acceptable performance to start building big stuff with it. I may look into making this even faster when I can play with more of our real-world data.

I understand these are very serious changes coming out of fucking nowhere, so I don't expect this to be merged straight away. Feel free to look at them with a magnifying glass, test it, see how it performs with your corpus (I assume they are links?), call me a moron and set my ass on 🔥... Anyway, you know how the song goes.

Hey, I just met you, and this is crazy, but I rewrote your bloom hashes, so merge me, maybe?

"The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes. If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0')."

There's no need for the keys to be NULL-terminated... That's so from the 90s! Also, note that Python strings (and pretty much any dynamic language) already keep length information. This will save us quite a few `strlen` calls.

ploxiln · 2012-08-04T20:57:16Z

You should have seen how slow it was in pure python...

Anyway, I like the switch to a more efficient hash function, I like the change from NULL-terminated string to buffer+length, I like the fix of that crazy fseek loop.

Right now I'm tracking down what I think might be a bug, but (depending on what @Hines and @mreiferson think) this could go in soon after.

ploxiln · 2012-08-04T21:18:51Z

I notice that 3 different murmur hash functions are included, and 2 of them are used... maybe the x64_128 one isn't very appropriate for generating the seeds, but it would be nice if just one hash function was provided, and used.

mreiferson · 2012-08-04T21:51:54Z

tl;dr

Honestly though, thanks for this excellent contribution! We're really excited to see all the interest in this project and we're glad GitHub can find some use for it.

Also, I'm on board with these changes.

Can't argue much with the raw speed improvements of the change to the hash function + file resize.

I also completely agree with the API change to take length args. It is more robust, was overlooked, and we might as well make the breaking changes now while the project is young.

We'll need to do some testing early this week and as @ploxiln mentioned theres a possible bug being investigated so we'll bring this in soon.

Thanks and keep 'em coming!

jphines · 2012-08-04T23:11:06Z

Thanks! The md5 bottle neck has been known for some time, but we seemed to
be chasing down other fixes, this look great!

On Sat, Aug 4, 2012 at 5:51 PM, Matt Reiferson <
reply@reply.github.com

wrote:

tl;dr

Honestly though, thanks for this excellent contribution! We're really
excited to see all the interest in this project and we're glad GitHub can
find some use for it.

Also, I'm on board with these changes.

Can't argue much with the raw speed improvements of the change to the hash
function + file resize.

I also completely agree with the API change to take length args. It is
more robust, was overlooked, and we might as well make the breaking changes
now while the project is young.

We'll need to do some testing early this week and as @ploxiln mentioned
theres a possible bug being investigated so we'll bring this in soon.

Thanks and keep 'em coming!

Reply to this email directly or view it on GitHub:
#19 (comment)

Justin Hines
Columbia College '13
Computer Science and Math Concentration

vmg · 2012-08-06T09:02:35Z

✨ Yey ✨

Glad you like the changes! Sorry it took me a while to answer, I was watching SCIENCE.

I notice that 3 different murmur hash functions are included, and 2 of them are used... maybe the x64_128 one isn't very appropriate for generating the seeds, but it would be nice if just one hash function was provided, and used.

I see what you mean. I brought Murmur mostly intact from the original C++ sources (just with minimal translation to make it build as C), but it would make things simpler if we were just using the same hash for generating the salts and the key hashes. I'll look into this.

Regarding Murmur_x86_128 vs Murmur_x64_128: we could certainly drop the x86 version if you're not concerned about 32-bit performance (I am not). If you plan to target x86 systems, I can conditionally swap the appropriate hash at build time. You'll get a nice performance boost from the smaller (internal) word size.

ploxiln · 2012-08-06T14:44:16Z

We think you should drop the x86 version; we're not concerned about absolute best 32-bit performance, and it'll be faster than md5 on x86 anyway.

mreiferson · 2012-08-06T15:08:08Z

src/dablooms.c

+
+    int i, num_salts = bloom->nfuncs / 4;
+
+    if (bloom->nfuncs % 4)


we should document this better in the README...

we use astyle on our C stuff, specifically this command line:

astyle --style=1tbs --lineend=linux --convert-tabs --preserve-date \ --fill-empty-lines --pad-header --indent-switches \ --align-pointer=name --align-reference=name --pad-oper -n <file(s)>

the rest of your changes look fine, but this line would get the curly brace police all over it :)

also, we notably dont run astyle on code we've imported that isn't ours (like the md5/murmur files)

by the way, we use revision 353 from the astyle sourceforge svn repo; it has some fixes not present in the latest release; I'll definitly make note of this stuff in the README

now documented in the README

Micro-optimization: %4 is the same as & 3

@jinghao: Wouldn't any decent compiler backend transform integer arithmetic with constant parameters into the most efficient representation on the target automatically?

In case anyone had doubts, yes, gcc optimises trivial short computations like x/4 and x%4, even without -O2. For demonstration:

src/dablooms.c:new_salts()

void new_salts(counting_bloom_t *bloom) { int div = bloom->nfuncs / 4; int mod = bloom->nfuncs % 4; ...

the assembly generated

objdump -d build/test_dablooms | less ... 000000000040184c <new_salts>: 40184c: 55 push %rbp 40184d: 48 89 e5 mov %rsp,%rbp 401850: 48 81 ec a0 00 00 00 sub $0xa0,%rsp 401857: 48 89 bd 68 ff ff ff mov %rdi,-0x98(%rbp) 40185e: 48 8b 85 68 ff ff ff mov -0x98(%rbp),%rax 401865: 48 8b 40 28 mov 0x28(%rax),%rax 401869: 48 c1 e8 02 shr $0x2,%rax /* ">> 2" */ 40186d: 89 45 fc mov %eax,-0x4(%rbp) 401870: 48 8b 85 68 ff ff ff mov -0x98(%rbp),%rax 401877: 48 8b 40 28 mov 0x28(%rax),%rax 40187b: 83 e0 03 and $0x3,%eax /* "& 3" */

ploxiln · 2012-08-06T19:02:51Z

@vmg can you please rebase on current master (there's going to be conflicts in the test_libdablooms.c due to recently merged changes of mine, sorry), and squash into 3 commits:

replace lseek() loop with lseek() || ftruncate()
replace md5 hash with murmur 128-bit hash
change interface to take buffer-length instead of null-terminated string (this will require fixes to the new test_pydablooms.py also)

thanks :)

vmg · 2012-08-09T04:52:11Z

Sorry for the delay, I've just landed on SF. I'll rebase the PR as soon as the jet lag allows.

(...why is this in the frontpage of HN?)

jinghao · 2012-08-09T07:26:44Z

src/murmur.c

+	const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);
+
+	for(i = -nblocks; i; i++) {
+		uint32_t k1 = getblock(blocks,i*4+0);


Micro-optimization: << 2 is the same as * 4

(i << 4) | 1 is the same as i*4 + 1

same with 2, 3

This is in MurmurHash code, so if you think this is worth the loss in readability, you might want to open an issue at https://code.google.com/p/smhasher/issues/list. But then again, don't underestimate optimizing compilers – in my experience, most backends recognize simple peephole optimizations like this just fine.

justnorawr · 2012-08-09T08:40:58Z

This is the best pull request I have ever seen. You obvisouly know your stuff @vmg the development community should very grateful for your contributions

fanf2 · 2012-08-09T09:56:12Z

Note that a Bloom filter only needs two hash functions: you can get an arbitrary number of hash values from a linear combination of the output of the two functions. This does not harm the accuracy of the Bloom filter compared to having many independent hash functions. See this paper for details: https://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf

ghost · 2012-08-09T20:38:33Z

"(...why is this in the frontpage of HN?)"

Nipple touching and SCIENCE watching.

ploxiln · 2012-08-14T06:22:30Z

I've submitted a rebased version of this pull request as #39. I've kept authorship credits to @vmg (but please let me know if you don't want your name on the changes i made while rebasing and squashing)

ploxiln · 2012-08-16T02:50:16Z

We really liked this pull request, and it's now merged, although as #39. Thanks!

vmg · 2012-08-18T06:51:23Z

Oh shit. Sorry guys, I've spent the last week mostly sick, so this went totally over my head. Thanks tons for taking the time to squash and merge this, @ploxiln. ✨

I'm going to strike back with faster hashing thanks to linear combination. Stay tuned.

ploxiln · 2012-08-18T14:16:33Z

might you be talking about something similar to #41 - if so, we're already looking at that

ranman · 2012-08-31T15:36:32Z

@vmg what tool did you use to generate this:

vmg · 2012-08-31T15:39:39Z

That is Apple's Instruments, in profiling mode, for Mac OS X 10.7. ✨

sylvesterjakubowski · 2013-11-25T22:35:35Z

best pull request ever.

sorry for the necropost, but I was digging for examples for some of my devs and I had to comment.

abhisekp · 2015-10-09T09:24:36Z

@vmg This PR is referenced in this article Effective pull requests and other good practices for teams using github as excellent pull request.

I like your style btw. 😎

vmg added 7 commits August 4, 2012 18:03

Switch MD5 with City

82a791d

Add support for binary keys

03155a0

There's no need for the keys to be NULL-terminated... That's so from the 90s! Also, note that Python strings (and pretty much any dynamic language) already keep length information. This will save us quite a few `strlen` calls.

Ok, disregard City that was a bad idea

21ae03e

Documentation is nice!

f88ec1c

Oops! Indentation!

cc4b676

Better seeds?

86a865c

mreiferson reviewed Aug 6, 2012
View reviewed changes

mreiferson mentioned this pull request Aug 7, 2012

new_counting_bloom_from_file ? #28

Closed

jinghao reviewed Aug 9, 2012
View reviewed changes

This was referenced Aug 10, 2012

Faster hash_func #38

Closed

Vmg perf branch #39

Merged

ploxiln closed this Aug 16, 2012

mehulkar mentioned this pull request Sep 1, 2015

Some performance tweaks mehulkar/github-hall-of-fame#37

Merged

alex-bender mentioned this pull request Sep 14, 2017

coolest pull requests I ever seen alex-bender/dev-null#19

Open

AlexiaChen mentioned this pull request Sep 3, 2019

Improve uint256 bigbangcore/BigBang#61

Merged

Some performance tweaks #19

Some performance tweaks #19

Uh oh!

Conversation

vmg commented Aug 4, 2012

"So I was like, let me trace this, yo"

Sex & the CityHash

Enter Murmur

Enlarge your files

This is where we're at now

Also, binary strings

Final benchmarks

Hey you piece of shit did you break the filtering?

THAT'S IT

Uh oh!

ploxiln commented Aug 4, 2012

Uh oh!

ploxiln commented Aug 4, 2012

Uh oh!

mreiferson commented Aug 4, 2012

Uh oh!

jphines commented Aug 4, 2012

Uh oh!

vmg commented Aug 6, 2012

Uh oh!

ploxiln commented Aug 6, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ploxiln commented Aug 6, 2012

Uh oh!

vmg commented Aug 9, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justnorawr commented Aug 9, 2012

Uh oh!

fanf2 commented Aug 9, 2012

Uh oh!

ghost commented Aug 9, 2012

Uh oh!

ploxiln commented Aug 14, 2012

Uh oh!

ploxiln commented Aug 16, 2012

Uh oh!

vmg commented Aug 18, 2012

Uh oh!

ploxiln commented Aug 18, 2012

Uh oh!

ranman commented Aug 31, 2012

Uh oh!

vmg commented Aug 31, 2012

Uh oh!

sylvesterjakubowski commented Nov 25, 2013

Uh oh!

abhisekp commented Oct 9, 2015

Uh oh!

Uh oh!