Jeremy Zawodny runs a post from the mysql mailing list, stating that High Availability is NOT Cheap. I agree on most of the points.
But there are two things I'd like to add: Where you should start working on high availability and why it more worthwhile than you might think.
(the first part is quite technical, you might want to skip to the "is it worthwhile?" part below)
The quoted posts first asks you about the acceptable outage rate. This is really a good question and it is important that you ask it at the right level. Revisit the end to end argument. That is, if you only run a few applications on your system and you have control over them (as is typical in a website scenario), then you can reach a reasonably high availability for a relatively low price by getting the ends right instead of focusing on the low level components: High availability of a 100% concurrent database server is pretty hard, but changing your application (the end in the argument) to handle outages sensibly is often much simpler for the same or higher degree of reliability! More specifically, you can answer the "how many nines do I need?" question for each component of the application, letting you focus on the important problems while not driving up costs for the unimportant.
Let me give you a (slightly simplified) example: The adserver of search.ch. The adserver is responsible for choosing and displaying the different forms of advertising present on search.ch. A lot of data is involved: Currently running campaigns and their rules for when and where and by whom a banner should be seen, the number of times the advertising was already shown (to get the scheduling right), the history of banner views of the active users (so that we don't bore you with the same advertising all the time) and the logs for when, where and by whom the advertising was seen and maybe clicked on. Worst of all, only a small fraction is write-once-read-many (which is really the simplest form of application to distribute if you think about it). As there are advertisings on all popular pages of search.ch and we don't want broken images or unnecessary delays, the requirement for availability for this component is really high. Really? What is really important, is that I can keep serving the right banners to the right places. Much less important is that I never loose track of a view or a click; if the systems 'forgets' a few, we will automatically run the campaign a little longer. The most critical part is easy to distribute: Just replicate the campaigns/rules data set to every node in the adserver cluster. If one goes down, the others take over. All the write operations on the data set are spooled and then synchronized regularly; if a server goes down, this small pocket of information in the spool will arrive late or - in the worst case - is lost. In this case not severe. I guess the total amount of money we actually lost this way is in the order of a good espresso :-) What we did, was looking at a rather complex problem at the database level (a lot reading and writing, information that needs to go to every node, etc.) and by asking the right questions at the right level, turned it into an almost embarrassingly parallel problem, gaining cheap high availability by routing around the hard problems.
Of course, redundant switches, loadbalancers and a bunch of PCs are needed, but all of this is commodity and not really expensive. The devil is in the details here, too, so don't underestimate the time to set this up and pay attention to the details. But for most applications, you don't need mainframes if you control the ends and are willing to think through the problem spaces for all of these ends.
So is it worthwhile? This is the second thing I'd like to ask and IMHO the answer is yes. But, I'm not talking about five-nines availability here. I am talking about basic redundancy in your systems to avoid most incidents of emergency actions. And surprisingly, this will easily put you in the 99.95% range, where user errors (when administrating or changing things) will top your reason-for-failure charts along with IBM disks. The difference is in life quality when "disk crashed" or "server down" doesn't mean "RUN! NOW!" but "hey, I'll be in that region anyway tomorrow, I can drop by a check out what's going on". For everyone who works in such environments this remark is pretty obvious and all the others will probably underestimate the effect. Some of our most popular services have to run on multiple machines anyway for performance reasons, but since we migrated all popular services, which could easily run on single machines, to the distributed architecture our maintenance costs decreased significantly! That is, all our applications (ends in the sense above) run distributed and as soon as you get used to the idea, implementing in this way is only a small extra effort with a huge payback. That's why I my opinion, thinking along high availability is always worthwhile.
BTW, the first comment in Jeremy's post (about the consultant with a chainsaw) reinforces my impression that consulting has a lot to do with show biz. Of the B-kind in this case :-)
Posted by seefeld at June 25, 2003 11:00