CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 117
Conversation
"The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes. If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0')."
There's no need for the keys to be NULL-terminated... That's so from the 90s! Also, note that Python strings (and pretty much any dynamic language) already keep length information. This will save us quite a few `strlen` calls.
You should have seen how slow it was in pure python... Anyway, I like the switch to a more efficient hash function, I like the change from NULL-terminated string to buffer+length, I like the fix of that crazy fseek loop. Right now I'm tracking down what I think might be a bug, but (depending on what @Hines and @mreiferson think) this could go in soon after. |
I notice that 3 different murmur hash functions are included, and 2 of them are used... maybe the x64_128 one isn't very appropriate for generating the seeds, but it would be nice if just one hash function was provided, and used. |
tl;dr Honestly though, thanks for this excellent contribution! We're really excited to see all the interest in this project and we're glad GitHub can find some use for it. Also, I'm on board with these changes. Can't argue much with the raw speed improvements of the change to the hash function + file resize. I also completely agree with the API change to take length args. It is more robust, was overlooked, and we might as well make the breaking changes now while the project is young. We'll need to do some testing early this week and as @ploxiln mentioned theres a possible bug being investigated so we'll bring this in soon. Thanks and keep 'em coming! |
Thanks! The md5 bottle neck has been known for some time, but we seemed to On Sat, Aug 4, 2012 at 5:51 PM, Matt Reiferson <
Justin Hines |
β¨ Yey β¨ Glad you like the changes! Sorry it took me a while to answer, I was watching SCIENCE.
I see what you mean. I brought Murmur mostly intact from the original C++ sources (just with minimal translation to make it build as C), but it would make things simpler if we were just using the same hash for generating the salts and the key hashes. I'll look into this. Regarding |
We think you should drop the x86 version; we're not concerned about absolute best 32-bit performance, and it'll be faster than md5 on x86 anyway. |
|
||
int i, num_salts = bloom->nfuncs / 4; | ||
|
||
if (bloom->nfuncs % 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should document this better in the README...
we use astyle on our C stuff, specifically this command line:
astyle --style=1tbs --lineend=linux --convert-tabs --preserve-date \
--fill-empty-lines --pad-header --indent-switches \
--align-pointer=name --align-reference=name --pad-oper -n <file(s)>
the rest of your changes look fine, but this line would get the curly brace police all over it :)
also, we notably dont run astyle
on code we've imported that isn't ours (like the md5/murmur files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the way, we use revision 353 from the astyle sourceforge svn repo; it has some fixes not present in the latest release; I'll definitly make note of this stuff in the README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now documented in the README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micro-optimization: %4 is the same as & 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinghao: Wouldn't any decent compiler backend transform integer arithmetic with constant parameters into the most efficient representation on the target automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case anyone had doubts, yes, gcc optimises trivial short computations like x/4
and x%4
, even without -O2
. For demonstration:
src/dablooms.c:new_salts()
void new_salts(counting_bloom_t *bloom)
{
int div = bloom->nfuncs / 4;
int mod = bloom->nfuncs % 4;
...
the assembly generated
objdump -d build/test_dablooms | less
...
000000000040184c <new_salts>:
40184c: 55 push %rbp
40184d: 48 89 e5 mov %rsp,%rbp
401850: 48 81 ec a0 00 00 00 sub $0xa0,%rsp
401857: 48 89 bd 68 ff ff ff mov %rdi,-0x98(%rbp)
40185e: 48 8b 85 68 ff ff ff mov -0x98(%rbp),%rax
401865: 48 8b 40 28 mov 0x28(%rax),%rax
401869: 48 c1 e8 02 shr $0x2,%rax /* ">> 2" */
40186d: 89 45 fc mov %eax,-0x4(%rbp)
401870: 48 8b 85 68 ff ff ff mov -0x98(%rbp),%rax
401877: 48 8b 40 28 mov 0x28(%rax),%rax
40187b: 83 e0 03 and $0x3,%eax /* "& 3" */
@vmg can you please rebase on current master (there's going to be conflicts in the test_libdablooms.c due to recently merged changes of mine, sorry), and squash into 3 commits:
thanks :) |
Sorry for the delay, I've just landed on SF. I'll rebase the PR as soon as the jet lag allows. (...why is this in the frontpage of HN?) |
const uint32_t * blocks = (const uint32_t *)(data + nblocks*16); | ||
|
||
for(i = -nblocks; i; i++) { | ||
uint32_t k1 = getblock(blocks,i*4+0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micro-optimization: << 2 is the same as * 4
(i << 4) | 1 is the same as i*4 + 1
same with 2, 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in MurmurHash code, so if you think this is worth the loss in readability, you might want to open an issue at https://code.google.com/p/smhasher/issues/list. But then again, don't underestimate optimizing compilers β in my experience, most backends recognize simple peephole optimizations like this just fine.
This is the best pull request I have ever seen. You obvisouly know your stuff @vmg the development community should very grateful for your contributions |
Note that a Bloom filter only needs two hash functions: you can get an arbitrary number of hash values from a linear combination of the output of the two functions. This does not harm the accuracy of the Bloom filter compared to having many independent hash functions. See this paper for details: https://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf |
"(...why is this in the frontpage of HN?)" Nipple touching and SCIENCE watching. |
We really liked this pull request, and it's now merged, although as #39. Thanks! |
Oh shit. Sorry guys, I've spent the last week mostly sick, so this went totally over my head. Thanks tons for taking the time to squash and merge this, @ploxiln. β¨ I'm going to strike back with faster hashing thanks to linear combination. Stay tuned. |
might you be talking about something similar to #41 - if so, we're already looking at that |
@vmg what tool did you use to generate this: |
That is Apple's Instruments, in profiling mode, for Mac OS X 10.7. β¨ |
best pull request ever. sorry for the necropost, but I was digging for examples for some of my devs and I had to comment. |
@vmg This PR is referenced in this article Effective pull requests and other good practices for teams using github as excellent pull request. I like your style btw. π |
HAI FRIENDLY ENGINEERS OF BITLY I COME IN PEACE
I wanted to play around with your awesome bloom library for some internal super-secret GitHub stuff (HAH), but as it turns out, I found it a tad lacking on raw performance, despite being all ballin' and stuff when it comes to memory usage.
I ran some instrumentation on the library, made some minor (major?) changes, and wanted to share them back. Here's the rationale:
"So I was like, let me trace this, yo"
This is the first instrumentation I ran with the original code (unmodified). The corpus is the
words
file that ships with Mac OS 10.7:Clearly, the two main bottlenecks are:
lseek
, when the code only useslseek
in one place... (??)Everything else in the trace was noise, so I decided to go HAM on just these two things. The lowest hanging fruit is certainly MD5: It is not a general purpose hash function, so its use in a bloom filter is ill-advised.
Sex & the CityHash
My first choice for a replacement was Google's CityHash (ayuup, I drank the Google koolaid -- I'm a moron and deserve to be mocked). I left the original commit in the branch for reference.
This simple change halved the runtime, but traces were still showing way too much time spent in hashing. The cause? Well, the bloom filter requires a pretty wide
nfunc
size for most corpuses if you want reasonable error rates, but CityHash has only two hash modes: Either 64 bit, or 128 bit. Neither of these modes is optimal for bloom filter operation.words
file of this synthetic benchmark. In practice, we end up performing too many calls to CityHash to fill thenfunc
spectrum because of the small output size.To top off this disaster, CityHash doesn't really have a native "seeded mode". The seed API performs a standard hash and then an extra iteration (??) on top of the result to mix in the seed, instead of seeding the standard hash initially.
...So I killed CityHash with fire.
Enter Murmur
MurmurHash has always been my favorite real-world hash function, and in retrospect I should have skipped City and gone straight for it.
It offers brilliant performance for all kind of string values and is always linear with string sizes without requiring special-casing for short strings. It also takes alignment issues into account.
To top it off, Murmur doesn't return the hash value on the stack/registers but writes it directly to a provided buffer. This makes it exceedingly easy to fill the
bloom->hashes
buffer with a lot of random data and perform the modularization incrementally.(note that we have aligned the
hashes
buffer to 16-bytes to prevent corner-case overflow checks). This is simple and straightforward, and makes my nipples tingle.n
salts, and each salt throws 128 bits. Wrap'em and we're done here!Enlarge your files
After dropping in an optimal hash function, the instrumentation showed a hilariously high percent of time spent in the kernel performing
lseek
s. I wondered where it was coming from...Apparently the code to resize a file on the filesystem was performing an absolute seek for every single byte that the file had to be increased. This is... heuh... I don't know if this is for compatibility reasons, but the POSIX standard defines a very very handy
ftruncate
call:This works wonders on both Mac OS X and Linux, and lets the kernel fill the file efficiently with those pesky NULL bytes, even in highly fragmented filesystems. After replacing the
lseek
calls with aftruncate
, all kernel operations (including themmap
s) became noise in the instrumentation. Awesome!This is where we're at now
As far as I'm concerned, the instrumentation trace has been obliterated.
hash_func
is stalling with all the modulo operations (even though they have no interdeps and should be going simultaneously on the pipeline, I think...). There are no SIMD modulo instructions, so I don't see how to work around this.strchr
for splitting up the words in the dictionary file.bitmap_check
andbitmap_increment
are tiny and fast. Nothing to do here. :/Also, binary strings
This is not performance related (at least not directly), but it totally bummed me that the API was requiring NULL-terminated strings, specially since I'm pretty sure you wrote this to be wrapped from dynamic languages, and all these languages incur on a penalty when asking for a NULL-terminated string (see: Python string slices yo, that's some memory being duped all over the place for NULL-termination) instead of grabbing the raw buffer + it's length.
I've changed the API accordingly, adding a
len
argument to all calls. Obviously, NULL-terminated strings can still be used by passingstrlen(string)
in the external API, instead of performing the measurement internally like before.Final benchmarks
Averaged 8 runs for the original code,
words
is still the corpus.Averaged 8 runs for the updated code, same corpus.
700% increase on this synthetic benchmark. For our specific corpus (bigger count, strings significantly larger than dictionary words), I get a 1300% increase. This is basically Murmur at work. Results may vary.
Hey you piece of shit did you break the filtering?
Don't think so. Murmur generates very high quality entropy, high enough to come close to MD5 for all measurements.
It's on my TODO list to perform some tests and see if there's an statistically significant variance on the amount of false positives between the two hash functions. Anecdotally, for the
words
dataset, MD5 was generating 1859 positives, while Murmur decreased that to 1815. THIS IS NOT SCIENTIFIC.Common sense tells us that MD5, being cryptographically sound, should always stay ahead on pure entropy measurement, but the avalanching properties of Murmur are gorgeous. So I'm happy with this. 100% Pinkie Pie approved.
THAT'S IT
Ayup, as far as I'm concerned this now has acceptable performance to start building big stuff with it. I may look into making this even faster when I can play with more of our real-world data.
I understand these are very serious changes coming out of fucking nowhere, so I don't expect this to be merged straight away. Feel free to look at them with a magnifying glass, test it, see how it performs with your corpus (I assume they are links?), call me a moron and set my ass on π₯... Anyway, you know how the song goes.