A week ago I decided to finally get serious about putting gearman to use for search indexing. I had been batting the idea around in my head for a long time (too long, really) but figured I should just write the code and see what happens. It took less than a day to get a prototype working in our development environment, but the end result made me very happy.
Today, in our production deployment, when a sphinx cluster pulls new content to index, the master does all the work. It fetches the new and changed postings, massages them into the XML format that sphinx expects (and makes a lot of small changes along the way), invokes the indexer, and makes the new indexes available for the slaves. The second step is usually the most CPU intensive. Processing the raw data into XML involves a lot of other tweaks and changes that are very specific to Criagslist.
What I did was turn that into a gearman client/worker pair. The client (or master) simply submits processing tasks and then waits for each of them to complete. The workers fetch the data from the master, transform it, and make the transformed data available. When each task completes, the master grabs the transformed data an informs the worker that it can delete the file.
So instead of being stuck at using only the 4 CPU cores on a single box, I can run 4 workers on each of 3 machines and get 12 CPU cores involved. The end result is that I have a solid foundation for a system that can easily scale to many machines. AH-HA! Linear scaling rocks! So does relatively seamless distributed computing.
As time allows I'll have to work on deploying this in production.
(comments)