Clustering LiveJournal

Brad Fitzpatrick & lj_dev crew

Introduction

The problem is described here. The following solution is a rough draft. Future refinements of this solution are posted below the aforelinked introduction.

The Plan; Revision 1

The ultimate ideal would be to have LiveJournal to scale linearly with the number of servers we buy. And that's exactly what this aims to do.

The new plan is to have a bunch of indepedent clusters, each with a pool of ~5 web servers and at least 2 database servers. All of a user's data would be confined to one cluster. A new 'clusterid' column in the user table would specify which cluster that user is on. As new users join, we simply build a new cluster and start putting users there, keeping the old clusters running as they were. If people stop using the service, older clusters will free up, so we can put users there.

Before I go further, a picture I whipped up in ViSiO:

mad art skills

What's it all mean?

Time for another picture:

disk usage graph

See that 1% slice for userinterests? That's the largest thing that'll be stored on the controller master. All the other larger stuff (and a bunch of the <1% slices, even) will only be stored on the cluster masters.

Notes

Maintaining globally unique AUTO values. Without everything in one logical database, the AUTO_INCREMENT columns would have no coordination and multiple clusters could use the same unique IDs for posts, comments, pictures, memories, etc. To prevent this, and to make it easier to delete & move users (move is copy + delete), we need to change all these unique primary key auto columns to be dual column primary keys: (userid, autovalue). You just insert the given userid and a NULL for the autovalue and each userid has its own count. That means that people will no longer have itemids with 9 digits and such... everybody's numbers will unique and small to just them.

Legacy ID snapshot. So as to not break links, we'll need a table to map old unique IDs to the (userid, unique) tuples. When these requests are received, the backhander won't know where to throw it, so it'll throw it to any one, which'll then look up its userid, and HTTP redirect it, so the next request (to, say, /talkread.bml?user=test&itemid=32) will know which cluster to assign it to.

Cache benefit. The coolest thing about this is that the caches on every webserver process and every database server will be a lot more valid. If MySQL needs to pull in 10 pages from disk to find 4 records, we don't have to worry that all that other data on those pages is worthless. Now we know that those other records we now have sitting in memory are also valid for somebody on our cluster. And because we cap the number of users per cluster, we can ensure that the cache performance of a cluster stays the same over time.

RPC between clusters. Obivously there's going to need to be communication between clusters. We started to play with this already, actually ... just wrap DB calls over HTTP. We'll need this to get things like get_recent_items for friends view and getting the text from everything.

It'll be important to request everything that'll be needed to each cluster in one transaction, so as to maintain round trip latencies from serializing a bunch of requests. While the number of clusters is low (~3 - 6) we'll be able to get away with enumerating over the clusters we need to do RPC with and doing the requests one by one. In the future it might be nice to parallelize these requests.

Actually, I'm probably worrying about this too much. We already do dozens of DB calls serialized in places. This'll be faster, as each DB call (albeit over HTTP) will execute faster due to better cache hits.

Another concern you may be having: "But Brad, won't an increased number of users across all the clusters cause a bunch more RPC requests to each cluster, thus diminishing the quality of a cluster, which this whole plan was supposed to address?" Ah, clever you are. Yes, reads will increase to all clusters as total users and total clusters grow. But remember, reads are easy to load balance.. all we have to do is buy more cluster db slaves and spread the reads evenly. So initially we can start with only 1 or 2 cluster slaves per cluster. Once our total number of clusters hits 5-6, it'd be wise to add another db slave to each cluster.

Friends view logic. The userusage table is replicated everywhere, so each cluster will know when their friends updated, and which clusters to do RPC on.

Backhander. The backhander will have to look several places in the HTTP request to determine which cluster to throw it to:

Anything that doesn't match those, we'll need to make sure it does, or add other rules.

Stats aggregation. Each cluster will have to run its own statistics, and then something on mayor wil have to aggregate those.

recent_* tables. Recent tables could and should probably die.

Moving users between clusters. It'll be easy to move users between clusters. Now that all data is prefixed with the itemid, finding it and deleting it is easy. The only issue is locking which isn't too hard. We can use GET/RELEASE_LOCK on the controller master as a global mutex mechanism.

When might it be necessary to move users? Imagine we're getting heavy growth and we must temporarily overload the newest cluster while we wait for parts from some slow vendor. Once the new cluster is up, we'll want to move users over. Or, consider that users stop using the site over time. It'd be nice to be able to move users from the busiest cluster over to the ones that are dropping in traffic due to users quitting. etc.., etc...

What's where? The major things, at least:
Cluster Everywhere
log, logprop, talk, logtext, talktext, userpicblob (userpic images) user, userusuage, generic files, site images
This isn't meant as a definitive list. If you want to start hacking on some part (see the implementation plan below), then check with the code@ mailing list to see where the table/service in question will live.

Directory implications.
(22:39:01) Foobar: So here's a question that may or may not be relevant: how would the directory fit into this?
(22:39:30) Foobar: just query each cluster and aggregate and sort them after they all return?
(22:39:35) brad: basically, yeah
(22:39:53) Foobar: sounds potentially ugly, but I guess it's workable

What cluster do new users go to? The least loaded/newest cluster.

Are there still free/paid servers? We could have a paid cluster, but it seems quite silly, since that paid cluster would have to RPC over to the free clusters. Plus, having a paid cluster would involve moving tons of people each day both back and forth. So the plan is to NOT have seperate clusters, and just make everything incredibly fast. If bandwidth gets too expensive, we could force free users to have an obligatory 30ms latency, which would still be about 3870ms better than what it is now. Please, don't compain about this ... we have no obligation to give away infinite CPU and bandwidth to everybody for free. We realize that free users constitute the overwhelming majority, so free service will always be good (that's our goal), but we want to be able to give paid users a little bit extra always, so if we artificially limit the speed of free access while lower our costs, so be it.

But won't this bring down paid account sales, if free servers are fast enough? Perhaps, but who cares... having shitty service is a poor excuse to profit. We'll make up for the lost sales due to fast servers by offering all the features we've been saying we're going to do forever. Hell, a bunch of them are already 80% written but we've been too busy keeping the site alive. Once the site is permanently alive we can focus on spending time writing fun stuff instead.

So now the BIG/ips don't do much, huh? Yeah, not quite as much. Right now we have a huge ruleset that gets run on the BIG/ip for each request. That'd be simplified quite a bit and the mod_backhand code will do the work now.

Implementation Plan

This is a lot of tedious work, but it's all very trivial. Luckily though, it's high parallelizable.

Conclusion

It's time to get serious. I'm sick of dumb hacks. All those dumb hacks were nice, and a large number of them will still be applicable and carry over and benefit us in the new code, but the root problem (dbs sucking over time) needs to be solved.

Please help me out with this. I can't wait until we can just buy 4-6 more machines and put a new cluster online, letting us grow with diminishing the quality of service for the other clusters. I can't wait until I can spend my time programming fun new features instead of just keeping the site alive.


Brad Fitzpatrick
Last modified: Mon Jan 21 19:34:50 PST 2002