Resque is our Redis-backed library for creating background jobs, placing those jobs on multiple queues, and processing them later.
Background jobs can be any Ruby class or module that responds to
perform. Your existing classes can easily be converted to background
jobs or you can create new classes specifically to do work. Or, you
can do both.
All the details are in the README. We've used it to process over 10m jobs since our move to Rackspace and are extremely happy with it.
But why another background library?
A Brief History of Background Jobs
We've used many different background job systems at GitHub. SQS, Starling, ActiveMessaging, BackgroundJob, DelayedJob, and beanstalkd. Each change was out of necessity: we were running into a limitation of the current system and needed to either fix it or move to something designed with that limitation in mind.
With SQS, the limitation was latency. We were a young site and heard stories on Amazon forums of multiple minute lag times between push and pop. That is, once you put something on a queue you wouldn't be able to get it back for what could be a while. That scared us so we moved.
ActiveMessaging was next, but only briefly. We wanted something focused more on Ruby itself and less on libraries. That is, our jobs should be Ruby classes or objects, whatever makes sense for our app, and not subclasses of some framework's design.
BackgroundJob (bj) was a perfect compromise: you could process Ruby jobs or Rails jobs in the background. How you structured the jobs was largely up to you. It even included priority levels, which would let us make "repo create" and "fork" jobs run faster than the "warm some caches" jobs.
However, bj loaded the entire Rails environment for each job. Loading Rails is no small feat: it is CPU-expensive and takes a few seconds. So for a job that may take less than a second, you could have 8 - 20s of added overhead depending on how big your app is, how many dependencies it requires, and how bogged down your CPU is at that time.
DelayedJob (dj) fixed this problem: it is similar to bj, with a database-backed queue and priorities, but its workers are persistent. They only load Rails when started, then process jobs in a loop.
Jobs are just YAML-marshalled Ruby objects. With some magic you can turn any method call into a job to be processed later.
Perfect. DJ lacked a few features we needed but we added them and contributed the changes back.
We used DJ very successfully for a few months before running into some issues. First: backed up queues. DJ works great with small datasets, but once your site starts overloading and the queue backs up (to, say, 30,000 pending jobs) its queries become expensive. Creating jobs can take 2s+ and acquiring locks on jobs can take 2s+, as well. This means an added 2s per job created for each page load. On a page that fires off two jobs, you're at a baseline of 4s before doing anything else.
If your queue is backed up because your site is overloaded, this added overhead just makes the problem worse.
Solution: move to beanstalkd. beanstalkd is great because it's fast, supports multiple queues, supports priorities, and speaks YAML natively. A huge queue has constant time push and pop operations, unlike a database-backed queue.
beanstalkd also has experimental persistence - we need persistence.
However, we quickly missed DJ features: seeing failed jobs, seeing pending jobs (beanstalkd only allows you to 'peek' ahead at the next pending job), manipulating the queue (e.g. running through and removing all jobs that were created by a bug or with a bad job name), etc. A database-queue gives you a lot of cool features. So we moved back to DJ - the tradeoff was worth it.
Second: if a worker gets stuck, or is processing a job that will take hours, DJ has facilities to release a lock and retry that job when another worker is looking for work. But that stuck worker, even though his work has been released, is still processing a job that you most likely want to abort or fail.
You want that worker to fail or restart. We added code so that, instead of simply retrying a job that failed due to timeout, other workers will a) fail that job permanently then b) restart the locked worker.
In a sense, all the workers were babysitting each other.
But what happens when all the workers are processing stuck or long jobs? Your queue quickly backs up.
What you really need is a manager: someone like monit or god who can watch workers and kill stale ones.
Also, your workers will probably grow in memory a lot during the course of their life. So you need to either make sure you never create too many objects or "leak" memory, or you need to kill them when they get too large (just like you do with your frontend web instances).
At this point we have workers processing jobs with god watching them and killing any that are a) bloated or b) stale.
But how do we know all this is going on? How do we know what's sitting
on the queue? As I mentioned earlier, we had a web interface which
would show us pending items and try to infer how many workers are
working. But that's not easy - how do you have a worker you just
kill -9'd gracefully manage its own state? We added a process to
inspect workers and add their info to memcached, which our web
frontend would then read from.
But who monitors that process. And do we have one running on each server? This is quickly becoming very complicated.
Also we have another problem: startup time. There's a multi-second startup cost when loading a Rails environment, not to mention the added CPU time. With lots of workers doing lots of jobs being restarted on a non-trival basis, that adds up.
It boils down to this: GitHub is a warzone. We are constantly overloaded and rely very, very heavily on our queue. If it's backed up, we need to know why. We need to know if we can fix it. We need workers to not get stuck and we need to know when they are stuck.
We need to see what the queue is doing. We need to see what jobs have failed. We need stats: how long are workers living, how many jobs are they processing, how many jobs have been processed total, how many errors have there been, are errors being repeated, did a deploy introduce a new one?
We need a background job system as serious as our web framework. I highly recommend DelayedJob to anyone whose site is not 50% background work.
But GitHub is 50% background work.
In Search of a Solution
In the Old Architecture, GitHub had one slice dedicated to processing background jobs. We ran 25 DJ workers on it and all they did was run jobs. It was known as our "utility" slice.
In the New Architecture, certain jobs needed to be run on certain machines. With our emphasis on sharding data and high availability, a single utility slice no longer fit the bill.
Both beanstalkd and bj supported named queues or "tags," but DelayedJob did not. Basically we needed a way to say "this job has a tag of X" and then, when starting workers, tell them to only be interested in jobs with a tag of X.
For example, our "archive" background job creates tarballs and zip files for download. It needs to be run on the machine which serves tarballs and zip files. We'd tag the archive job with "file-serve" and only run it on the file serving slice. We could then re-use this tag with other jobs that needed to only be run on the file serving slice.
We added this feature to DelayedJob but then realized it was an opportunity to re-evaluate our background job situation. Did someone else support this already? Was there a system which met our upcoming needs (distributed worker management - god/monit for workers on multiple machines along with visibility into the state)? Should we continue adding features to DelayedJob? Our fork had deviated from master and the merge (plus subsequent testing) was not going to be fun.
We made a list of all the things we needed on paper and started re-evaluating a lot of the existing solutions. Kestrel, AMQP, beanstalkd (persistence still hadn't been rolled into an official release a year after being pushed to master).
Here's that list:
- Persistence
- See what's pending
- Modify pending jobs in-place
- Tags
- Priorities
- Fast pushing and popping
- See what workers are doing
- See what workers have done
- See failed jobs
- Kill fat workers
- Kill stale workers
- Kill workers that are running too long
- Keep Rails loaded / persistent workers
- Distributed workers (run them on multiple machines)
- Workers can watch multiple (or all) tags
- Don't retry failed jobs
- Don't "release" failed jobs
Redis to the Rescue
Can you name a system with all of these features:
- Atomic, O(1) list push and pop
- Ability to paginate over lists without mutating them
- Queryable keyspace, high visibility
- Fast
- Easy to install - no dependencies
- Reliable Ruby client library
- Store arbitrary strings
- Support for integer counters
- Persistent
- Master-slave replication
- Network aware
I can. Redis.
If we let Redis handle the hard queue problems, we can focus on the hard worker problems: visibility, reliability, and stats.
And that's Resque.
With a web interface for monitoring workers, a parent / child forking model for responsiveness, swappable failure backends (so we can send exceptions to, say, Hoptoad), and the power of Redis, we've found Resque to be a perfect fit for our architecture and needs.
We hope you enjoy it. We certainly do!



this seems pretty cool, thanks for making it available. And I love the picture showing 418 failed jobs out of 429 ;)
@riffraff Those are queue sizes - Archive has 1 pending item, Low has 10 pending items, and Failed has 418 pending items (which never drain).
So the failed # is the total failed you've seen, while the other numbers are "waiting" to be processed.
A healthy system will have all those numbers at 0 except for "failed," which will almost always be greater than 1.
BAD ASS
I needed a Redis-based queue system to process stuff for an iPhone app push notifications, thanks ;)
Nice. I really appreciate your "history of background job systems" - I hope folks see this as a great reference guide.
No mention on RabbitMQ?
Always love reading this stuff coming from you guys. It is nice to see how to tackle problems that come up with your system instead of just giving up and thinking you'll find the "magical programming language of joy" that will solve all of your needs.
Awesome.
Cool! It's what I was starting to work on...thx for sharing it ;)
Very cool. We use beanstalkd and haven't needed persistence so it has continued to work for us. We have been bitten by the ability to really inspect what is going on inside the queue before and that seems to not be an issue with resque. I will have to keep an eye on this and see if we run into any other issues iwth beanstalkd and if we do give resque another look.
@acrosa, RabbitMQ is AMQP, which he did mention briefly, but wouldn't come with the nice Sinatra management app. What I'd like to evaluate is how easy is it to integrate with my existing Ruby codebase...
I am continually impressed at the quality of these blog posts. Thanks a lot, guys.
I have been waiting for this blog post for the past few weeks. I had been going back and forth with a few people on how my ideal background worker setup would work and this pretty much nails it. I'm very excited for the potential of this project and look forward to contributing back when I can.
Did you look at gearman? https://gearman.org/
Maybe it's too generic for rails but it's mature and robust.
Thanks for this interesting post. Resque looks really good.
It's nice to see what you found lacking in beanstalkd. I do wish you'd made feature requests rather than just giving up, though. Most of the things you listed are not hard to add -- they've just never been requested.
And, just to clarify for the readers, beanstalkd officially has persistence since version 1.4.
https://kr.github.com/beanstalkd/
Thanks to the campfire integration, i've been well aware of how much an issue this has been for you guys. I can only imagine the tremendous satisfaction from solving this difficult a problem so thoroughly. My own app also uses a fair amount of delayed_job and is only threatening to use more, so this is very exciting on several levels.
Regarding...
'Both beanstalkd and bj supported named queues or "tags," but DelayedJob did not. '
Delayed Job lets you set the priority of each job. Then you can run multiple workers and and tell each one to process a different range of priorities. It's basically the equivalent of 'tags' except using integers instead of strings.
I am curious as to where amqp fell down on your requirements. A lot of what you mentioned is based less on the queue system itself and more on the library/client. Nanite meets a lot of the requirements and what you need could have been added. I mean amqp is what powers the stock exchange, and using topic exchanges is essentially the same as using takes. Just playing devil's advocate here.
Great piece, thanks defunkt. Is redis able to handle queue backs up (to, say, 30,000 pending jobs) in one datastructure? Presumably you will not see that kind of backup as often as DB-backed, but I'm curious if you tested what the redis limits were once it couldn't fit everything in memory.
@jblomo Yes, we saw 50k+ backups on our old architecture and Redis was fine. The only constraint should be your available RAM.
@defunft, Why did you choos the poll and sleep method for your workers instead of the wait and consume model that, for example, unicorn uses.
What a great brain dump on some of the challenges in queue processing. It starts getting complicated with all the what ifs. Looking forward to giving risque a try.
@zdzolton I know RabbitMQ is an implementation of the amq protocol, my question relates to what @BrianTheCoder says.
@chadrem priority as tag is a hack. it only works until you need tags + priorities. then it just becomes a mess.
awesome
The sinatra web interface is awesome.
At first glance, compared to DJ, Resque doesn't seem to support the ability to postpone a job in the future (the DJ :run_at option).
Am I wrong?
@weppos I don't believe it does. However if you have issues with queues backing up the runa_at feature of DJ gets rendered pretty useless as there is no guarantee it will run at that time. If your queue is manageable and you need the run_at feature you probably don't need Resque.
@defunkt what happened to rockqueue? :-D
@choonkeat RockQueue is GitHub's internal API for submitting jobs to Resque.
I was wondering too why nanite isn't in the middle.
It's by far the greatest solution I've found.
How do you suggest providing feedback to the user? For example a progress bar of a csv import. Or showing the status of a job "Examining file, Importing File, etc"
What was the reason for not using starling / workling?
Great write-up, and Resque sounds very cool. I'll definitely check it out the next time I need to implement background jobs for a Ruby app. I've been playing with Redis recently and am very impressed with it. The thought that it would be perfect for a job queue had occurred to me as well.
On the code side I've always been a fan of Workling. I like the fact that it has a pluggable runner architecture, so you only need to write your job code once and can easily swap out a different runner (like BJ, Starling, or AMQP). In fact I implemented an SQS runner for Workling that worked pretty well for us. It seems like it should be easy enough to write a Resque runner. Maybe I'll take a stab at that at some point...
Nice work guys!
Does anyone have a capistrano recipe for this yet?
I'm still not 100% clear on how to get the workers started/restarted on a deploy. Especially if the workers are on another machine. Thanks!
Resque looks really great and match exactly what I need, so thanks!
I'm struggling a bit to understand how I can tell a worker that Redis doesn't run on localhost but on another machine.
Any tip to provide ?
Found out - just after posting.
The trick is to create a resque:setup task in your Rakefile and hook the Redis config here:
namespace :resque do
end
@thbar Please read Resque's README, your question is answered and explained there:
https://github.com/defunkt/resque#readme
@weppos : I really love resque, but I miss the run_at feature from delayed_job. I'm adding a ticket to see if other people are interested by the feature. Thanks !
@webppos: ticket created if you want to vote for it : https://github.com/defunkt/resque/issues#issue/42
thanks
I was wondering how do you manage failures ? Do you use the Append Only Mode, or you just rely on replication ?
Thanks