Brad Fitzpatrick & lj_dev crew
LiveJournal was originally designed to be used by 10 people, myself and a few friends. Over time the design has been tweaked to let it scale further, but the basic design is still the same. We're fast approaching the time when there's nothing left to optimize, other than the architecture itself. That's what this page is about.
Currently, there is one master database, 5 slave databases, and a ton of web servers. A request comes in to the load balancer where it is then given to the "best" web server. Each web server runs tons of processes which all maintain a database connection to the master and to a slave. All update operations go to the master. Read operations go to a slave, or fall through to the master if the slave is behind.
The problem with this setup is that we're evenly dividing our reads between slaves but each slave db has to do every write. Imagine that each database server has t units of time. Further imagine that a write takes 2 units of time and a read takes 1 unit. Say we're doing n reads & writes. Now, let us have s slaves. As n increases, each slave is requiring n*2 + (n/s) units of time. Even if we keep increasing s, the number of slave databases, the real problem is that n*2 keeps growing, taking away time those database servers could be serving read requests.
Worse, each slave won't have the disk capacity to hold the entire database. Even if it did, though, the bigger problem is that the machines' memory is finite, so if the db size on disk is growing and the memory size is fixed (all our slaves have a 2GB or 4GB limit... only our master can go up to 16GB), then as the on-disk size grows, the cache hit rate drops incredibly fast. Once you're not hitting the cache, things start to suck with a quickness. The speed of a disk seek compared to a fetch from memory is astronomical. Disks suck.
Right now, we do some tricks to get by with the above architecture. The largest tables we only replicate (from master to slave) a subset of the data. That's why we have the recent_logtext and recent_talktext tables. A cron job deletes everything older than 2 weeks from this table every day. The web servers try the recent tables on the slave dbs first, then fall back to using the master tables.
The next thing we did was have one database that replicated nothing but the recent tables, then all the web servers had 3 db connections open... text slave, general slave, and master. This improved the cache hits everywhere, since the dbs were now specialized. The general slaves even improved, since they didn't have all that text getting in the way of the selects from the log table, notably.
The plan has undergone modification over time as we refine it.