The 2009 GitHub Contest is winding to a close – less than 48 hours until the deadline to get your submissions in. If you haven’t pushed the source code for your entry yet (which is a lot of you), please remember to do so soon. You know what’s on the line:
I would also like to write up an overview of the entries. If you would like to be featured, please add a description of what you’ve done and any further reading on your approaches to your README and send me an email (scott at github) – even if you didn’t get very high in the rankings. Unfortunately the dataset I put out there wasn’t perfect – most people found a good percentage of my removed results by adding the parents of forked repositories, which by itself gave people a big boost. However, I’m not interested in writing that type of GitHub specific stuff up – I want to know the rest of the algorithm – the parts that would be useful to any dataset or website trying to do something similar. Please let me know about what you’ve done, what you tried, what worked well – I would like to share it with everyone and point to your code.
A great example of a project that is both fantastically open and describes most of what they are doing this way is Jeremy Barnes entry – it is really an amazing writeup on one of the best performing entries out there.
If you haven’t noticed, the ’80’s are back. Obviously no one consulted me on this decision. In honor of the ’80’s brief (it had better be brief) return, we will be holding this Thursday’s drinkup at Double Dutch, 7:30p.
The review page for Double Dutch is peppered with many encouraging nuggets of truth and wisdom including: ‘My mustache brings all the girls to the yard’ and my personal favorite, ‘I usually prefer the Thursday SCENARIO to LAYLOW for some of that ELECTRICRELAXATION. It’s ok if you LEFTYOURWALLET IN EL SEGUNDO because they take other forms of C.R.E.A.M.’ Which must mean something to someone.
I just hope they have enough liquor. And good lord I can’t wait to see you.
My Pro Git book is shipping pretty soon and I’ve been asked by Apress to give them a list of people to send review copies to. Since I really have no idea who I should send copies to, I thought I’d ask you.
If you are interested in reviewing Pro Git and want me to send you a review copy, please send me an email (scott at github) with your name, your blog or publication or user group or whatever you want to review it for, and your mailing address. I have somewhere around 10-15 to give out, so please let me know.
Thanks!
Update: Thanks everyone for sending me your info – I have more than enough now. I’ll try to get copies out to whomever I can.
Today is the day you have all been waiting for. The bi-weekly GitHub drinkup is on tonight at 7:30pm. While PJ and Scott are off on important business in NYC, Tom, Chris and I will be ready to party it up, talk some shop, pass out stickers, and do some important business (sans bull) of our own this evening at Elixir (3200 16th St. at Guerrero), one of the oldest saloons in San Francisco. Looking forward to seeing you there!
Continuous Integration is a fancy term for “run your project’s tests after someone pushes to the repository and notify interested parties if they fail.”
We’re currently in the process of revamping our test suite (which we’ll blog about in the future) and moving servers, so I thought it’d be a good time to re-evaluate our options.
Integrity has grown a lot since its first release. It has a ton of features, great documentation, and nice notifiers (I wrote the Campfire notifier).
It also has a very attractive interface, is easy to configure, and works with multiple projects. And it’ll run anything – not just Ruby projects.
It’s not the easiest thing to install, though. There are a lot of dependencies and I never quite got it working on my latest install attempt. I hear it works better with Passenger than Thin (what I was using).
If you can get it working, it’s worth it. qrush’sreport card is a great addition, too.
I installed it and tried it out – it’s pretty easy to use. And because it’s a generic builder you can also use it for non-test related tasks, like compiling stuff. It has a server BuildBot and worker BuildBots which means you can scale it to run many concurrent tasks, even across machines.
For us it seemed overpowered, but I’ll definitely keep it in mind if we need hardcore lifting in the future.
RunCodeRun is Relevance’s hosted CI service. It supports both private and open source projects, but is Ruby-specific.
Unfortunately RCR doesn’t support Campfire notifications yet, as far as I can tell. We need ’em!
CI Joe
Because knowing is half the battle. CI Joe is a dead simple, Git-talkin’, Unix-lovin’, HTTP slingin’ continuous integration server we wrote to do one thing and do it well.
It uses your Git config and lets you extend it through Git hooks. A POST will trigger a build – which means it works great with GitHub. It supports HTTP auth so Internet prankster can’t trigger your builds. It comes with Campfire support. It’s language agnostic – as long as your test suite can be run from a Unix shell, CI Joe can handle it.
We use the Campfire notifier (I sound like a broken record, don’t I) and Joe’s HTTP basic auth feature. Our config looks like this:
$ cat .git/config
...
[campfire]
user = notifier"github":https://github.com/github.com
pass = secret
subdomain = github
room = The GitHub Dancy Party
[cijoe]
user = chris
pass = secret
runner = rake -s test:units
We also use Joe’s “after-reset” hook. We keep our database.yml file in Git, but the CI server needs its own database config settings. If Joe finds an executable “after-reset” Git hook it’ll run it after updating the repo and before running the tests. Ours looks like this:
As you can see, we keep our good database.yml unversioned in the CI clone’s root and just remove the versioned one after each reset. Joe runs a “git reset —hard” which does not remove unversioned files – our custom database.yml won’t get wiped.
Better late than never, right? As we get ready to upgrade our servers I thought it’d be a good time to upgrade our deployment process. Currently pushing out a new version of GitHub takes upwards of 15 minutes. Ouch. My goal: one minute deploys (excluding server restart time).
We currently use Capistrano with a 400 line deploy.rb file. Engine Yard provides a handful of useful Cap tasks (in gem form) that we use along with many of the built-in features. We also use the fast_remote_cache deployment strategy and have written a handful (400 lines or so) of our own tasks to manage things like our service hooks or SVN importer.
As you may know, Capistrano keeps a releases directory where it creates timestamped versions of your app. All your daemons and processes then assume your app lives under a directory called current which is actually a symlink to the latest timestamped version of your app in releases. When you deploy a new version of your app, it’s put into a new timestamped directory under releases. After all the heavy lifting is done the current symlink is switched to it.
Which was really great. Before Git. So I went digging.
First I investigated Vlad the Deployer, the Capistrano alternative in Ruby. I like that it’s built on Rake but it seems to make the same assumptions as Capistrano. Basically both of these tools are modular and built in such a way that they work the same whether you’re using Subversion, Perforce, or Git. Which is great if you’re using SVN but unfortunate if you’re using Git.
For example, this is from Vlad’s included Git deployment strategy:
When you deploy a new copy of your app, Vlad removes the existing copy and does a full clone to get a new version. Capistrano does something similar by default but has a bundled “remote_cache” strategy that is a bit smarter: it caches the Git repo and does a fetch then a reset. It still has to then copy the updated version of your app into a timestamped directory and switch the symlink, but it’s able to cut down on time spent pulling redundant objects. It even knows about the depth option.
The next thing I looked at was Heroku’s rush. It lets you drive servers (even clusters of them) using Ruby over SSH, which looked very promising. Maybe I’d write a little git-deploy script based on it.
Unfortunately for me Rush needs to be installed on every server you’re managing. It also needs a running instance of rushd. Which makes sense – it’s a super powerful library – but that wouldn’t work for deploying GitHub.
Fabric is a library I first heard about back in February. It’s like Capistrano or Vlad but with more emphasis on being a framework/tool for remote management of servers. Easy deployment scripts are just a side effect of that mentality.
It’s very powerful and after playing with it for a while I was extremely pleased. I’ll definitely be using it in all my Python projects. However, I wasn’t looking forward to porting all our custom Capistrano tasks to Python. Also, though I love Python, we’re mostly a Ruby shop and everyone needs to be able to add, debug, and modify our deploy scripts with ease.
Playing with Fabric did inspire me, though. Capistrano is basically a tool for remote server management, too, if you think about it. We may have outgrown its ideas about deployment but I can always write my own deployment code using Capistrano’s ssh and clustering capabilities. So I did.
It turned out to be pretty easy. First I created a config/deploy directory and started splitting up the deploy.rb into smaller chunks:
Then I pulled them in. Careful here: Capistrano override both load and require so it’s probably best to just use load.
This separation kept the deploy.rb and each specific file small and focused.
Next I thought about how I’d do Git-based deployment. Not too different from Capistrano’s remote_cache, really. Just get rid of all the timestamp directories and have the current directory contain our clone of the Git repo. Do a fetch then reset to deploy. Rollback? No problem.
The best part is that because Engine Yard’s gemified tasks and our own code both call standard Capistrano tasks like deploy and deploy:update, we can just replace them and not change the dependent code.
Here’s what our new deploy.rb looks like. Well, the meat of it at least:
Great. I like this – very Gitty and simple. But copying and removing directories wasn’t the only slow part of our deploy process.
Every Capistrano task you run adds a bit of overhead. I don’t know exactly why, but I imagine each task opens a fresh SSH connection to the necessary servers. Maybe. Either way, the less tasks you run the better.
We were running about eight symlink related tasks during each deploy. Config files and cache directories that only live on the server need to be symlinked into the app’s directory structure after the reset. Cutting these actions down to a single task made everything much, much faster.
Here’s our symlinks.rb:
Finally, bundling CSS and JavaScript. I’d like to move us to Sprockets but we’re not using it yet and this adventure is all about speeding up our existing setup.
Since the early days we’ve been using Uladzislau Latynski’s jsmin.rb to minimize our JavaScript. Our Cap task looked something like this:
Spot the problem? We’re minimizing the JS locally, on every deploy, then uploading it to each server individually. We also do this same process for Gist’s JavaScript and the CSS (using YUI’s CSS compressor). So with N servers, this is basically happening 3N times on each deploy. Yowza.
Solution? Do the minimizing and bundling on the servers. The beefy, beefy servers:
As long as the bundle Rake tasks don’t need to load the Rails environment (which ours don’t), this is much faster.
Conclusion
We moved to a more Git-like deployment setup, cut down the number of tasks we run, and moved bundling and minimizing JS and CSS from our localhost to the server. Did it help?
As I said before, a GitHub deploy can take 15 minutes (not counting server restarts). My goal was to drop it down to 1 minute. How’d we do?
$ time cap production deploy
* executing `production'
* executing `deploy'
triggering before callbacks for `deploy:update'
* executing `notify:campfire'
* executing `deploy:update'
* executing `deploy:update_code'
triggering after callbacks for `deploy:update_code'
* executing `symlinks:make'
* executing `deploy:bundle'
* executing `deploy:restart'
* executing `mongrel:restart'
* executing `deploy:cleanup'
real 0m14.361s
user 0m2.049s
sys 0m0.560s
Pretty cool stuff. For more projects you can checkout the microformatDevCamp wiki. Want to hear about all the latest Microformat news as it happens? Follow their blog.
While Comet may be all the rage, some of us are still stuck in web 2.0. And those of us that are use Ajax polling to see if there’s anything new on the server.
Here at GitHub we normally do this with memcached. The web browser polls a URL which checks a memcached key. If there’s no data cached, the request returns and polls again in a few seconds. If there is data, the request returns with it and the browser merrily goes about its business. On the other end our background workers stick the goods in memcached when they’re ready.
In this way we use memcached as a poor man’s message bus.
Yet there’s a problem with this: if after a few Ajax polls there’s no data, there probably won’t be for a while. Maybe the site is overloaded or the queue is backed up. In those circumstances the continued polling adds additional unwanted strain to the site. What to do?
The solution is to increment the amount of time you wait in between each poll. Really, it’s that simple. We wrote a little jQuery plugin to make this pattern even easier in our own JS. Here it is, from us to you:
Any time you see “Loading commit data…” or “Hardcore Archiving Action,” you’re seeing smart polling. Enjoy!
Today we’re announcing our 2009 GitHub Contest. Since the Netflix prize is now over, we figured you guys needed something to do. Here is your chance to contribute to the open source canon, make GitHub better, and possibly win two of the best prizes probably ever offered by a contest: a bottle of Pappy Van Winkle and a large GitHub account for life! We would estimate the value here, but, honestly, they’re priceless. Also, hopefully have some fun.
So, the problem is that we want to recommend repositories to you when you log into GitHub that you’ll love. How do we find the perfect projects for you? I wanted to just look at networks of what people were watching and figure out what you might like by what your friends liked. In researching collaborative filtering and recommendation systems papers I found little that is really helpful for this sort of problem, oddly, and very little open source code. Most papers I found online (for free, because I’m cheap – why aren’t all academic papers free and open, btw?) are explicit rating system based (like the Netflix prize – figuring out what you would rate something on a 1-X scale based on previous ratings) not item-based collaborative filters for binary implicit voting (like recommending new items based on past purchasing history) which seems way more useful to most websites to me.
Anyhow, so we figured perhaps you can do this better than we can. I extracted a dataset of all the repository watches in our database – close to half a million – and withheld a sample of them. I then created a test file listing the users I held watches back from. If you can write a program to analyze our dataset and best guess the watches we held back, you win our amazing prizes.
To enter the contest, check out our contest website. Basically you just put your guesses into a file named ‘results.txt’ and push it to a public GitHub project that has “https://contest.github.com” as a post-receive hook. On each push, our site will see if you’ve changed your ‘results.txt’ file then download and score it if you have. At the end of the contest, your source code has to be released under an OSI compatible license so nobody ever has to worry about this problem again. Whoever has the highest score at noon PST on Aug 30, 2009 wins. Good luck!
For about the last 8 months, I’ve been working on a side project. In November, Apress contacted me about writing a book about Git and I thought it would be a good idea. I may have slightly underestimated the amount of work that it would take, but a few days ago I put the content of the book online under a Creative Commons noncommercial 3.0 license. The book is titled “Pro Git” and you can read it or reference it online at https://progit.org.
The actual printed version will be shipping in another few weeks, but as Apress was kind enough to allow me to publish it under the CC license, you can take a look now. I hope it’s helpful to you in learning or teaching Git.
The full markdown content for the book, as well as all the images and the .graffle file I used to generate them is on GitHub at progit/progit. If you’re interested in providing a translation under the CC license, please fork the project, copy the ‘en’ folder to the language code of your choice and start translating – I’ll put them online as they are done. Chinese, Portuguese and Ukrainian translations have already been started. Man I love GitHub.
I also encourage you to buy a copy if you use the online resource a lot. Though as a disclaimer I do get royalties when you do, I really do want this to be a commercial success so that more publishing companies and authors will release technical books under open licenses – it benefits the entire community and I’m really glad Apress let me do it.
Oh yeah, the other cool thing is that the Pro Git website is a GitHub Pages site being generated with Jekyll.
Welcome to Rebase #26! If you’ve got an interesting project you’d like to see on the column feel free to shoot me a message. I’d love to see more themed Rebases, like the book edition. Perhaps we could have a JSON edition, a hardcore C edition, unknown language edition, and so on. I follow some simple guidelines that you can check out here too.
Featured Project
asi-http-request is the Steven Seagal of HTTP libraries for Objective-C. Drop this guy into your OSX or iPhone application and it’s guaranteed to kick ass. Well, at least your HTTP calls will. The library makes it easy to interact with RESTful services as well as submit multipart/form-data if you’re in the need for it. It also has a boatload of other features including progress delegates, a streamlined interface to uploading files from disk, and background/queueing support. Take a gander at the docs here, including a nice look at what applications are using it. Fork away, punk.
Notably New Projects
deadweight deals with a common problem that many developers face: unused CSS rules. What do you do with them? Comment them out? Leave them for that annoying team member to deal with instead? This project takes the higher ground by analyzing your stylesheets and some given views to determine what selectors you can safely dispose of. You can even use Mechanize to submit forms and make sure you’re shedding your unnecessary CSS.
jquery-visualize is a really nice way to get simple graphs in your application that’s both accessible (read: degrades into tables) and really spiffy looking. It’s as simple as filling up a table with data and then calling $('table').visualize();. Of course, there’s plenty of configuration options like colors, the type of graph, line weights, and more. Try out a demo or download it for yourself.
tokyo-recipes is a collection of Lua scripts that plug directly into Tokyo Cabinet, an extremely efficient and speedy key-value store. There’s plenty of awesome recipes in this cookbook including expiring data based on TTL, map reduce and even a simple high-low betting game If you’re just getting started on writing your own Lua scripts for Tokyo Cabinet or are looking for some real examples of how you can use the plugins to your advantage, take a look at this repo.
weberl is a small Erlang webserver that’s based on web.py. It’s essentially a bare-bones web framework that doesn’t assume much, which is certainly ideal if you’re just getting off the ground or you don’t like too much baggage. This project has just started up and could certainly use the help of both experienced and greenhorn Erlang coders if you’re up for it. Go forth and clone!
Redisent is an interface to the Redis key value-store for PHP. Unlike memcached, Redis persists data, and now with this library you can easily hook in your code to it. It also supports clustering, which allows you to hook up more than one key-value store and set aliases for each. Read up more about Redisent on this great blog post/tutorial for using it.
As of RubyGems 1.3.2, the index generation code supports incremental index updates. What this means is instead of taking minutes rebuild all of the indexes for GitHub’s thousands of gems, it takes just seconds to index the new gems.
So, your gem should show up in our index within 1-2 minutes now, assuming it builds correctly and our job queue isn’t backed up. We also have dropped support for legacy indexes, so anyone using a version of RubyGems prior to the 1.2.0 release needs to upgrade.