Formatting a drive using RAID 5 is a very big mistake when setting up a MySQL database server.
In case you missed that:
Unless you are running mission-critical banking transactions that can not for any circumstance be lost or else a whole bunch of people are going to die and RAID 5 is the only solution that works, then do not use RAID 5 for any database server where you wish to continue to get a modicum of performance in the long term.
Really. I work on a, shall we say, very high-transaction marketing tracking service which processes over 242 million transactions a month. And when I say “transactions”, I mean hits. And while the hits are not, say, as intense as downloading an image file, it adds up.
Some of it is discarded traffic from inactive sites, or refresh pages, but most of the data ends up being pumped into the database, constantly.
I have considered myself somewhat of a MySQL aficionado up until recently when I actually bought High Performance MySQL by Schwartz, Zaitsev, Tkachenko, Zawodny, et al. which is worth every penny of the cover cost, for sure. The section on key buffers in MyISAM is worth its weight in gold, and other sections reinforced a lot of stuff which I kind of knew but never really followed through on, and a ton of stuff which I never knew, and you really need to be a developer of MySQL to understand well, or be a full time DBA of MySQL, which I am afraid to say, I am becoming more and more.
February 2008 we upgraded our server farm and purchased two honkin’ ProLiant DL 380 G5 boxes loaded with memory and disk. Our system administration consultant, whom I’ll lovingly call “Jark“, set them up with RAID 5, 7 drives to the array with a hot spare, giving over 1.6 terabytes of storage, a vast increase over our prior 300 gigabytes. Fine and dandy. Everything ran smoothly for the first ten months.
Fast forward to December, when our client who receives the most traffic (10M page views per month) starts to slow everyone down. For a period of about 3 days, our data (which is real-time, ahem) was behind by about six to eight hours during the day, then only caught up to about four hours behind in the low periods in the morning.
I optimized, I profiled, I did everything I could to try and remedy the situation.
There’s a handy dandy little tool I have which shows how long it takes to pump a minutes worth of data into the database. On a good day, it takes about 10 seconds. The week in question, it was taking 60 to 100 seconds. Those rocket scientists out there can see that this was a losing performance.
And, to cap it all off I read in High Performance MySQL how businesses which fail to scale tend to fail. Gulp.
Oddly, the traffic patterns hadn’t changed significantly from the prior month. True, December is a banner month for most merchants, and our customers are no different, but I was starting to sweat. Jark (The administration consultant who set up the boxes originally) offered his high hourly rate to diagnose and check out the situation, which I begrudgingly paid … to no avail.
In the meantime, I asked our development team if there was some software solution for the issue. They have, in the works, a modified data import engine which eliminates the bottleneck for slow sites. However, this code was two months out, sayeth the lead developer. Could we ship it in the next few days? Not without a SNAFU afterwards because we haven’t tested it all. Ok.
Finally, after a few no-op downtimes with Jark where he did nothing, apparently, I asked him about the RAID configuration, having remembered a post on a blog (or MySQL.com) somewhere about which RAID to choose. I found a humorous (or not-so-humorous depending on your mood) organization called BAARF which I forwarded to him. His suggestion, “Maybe we should’ve done RAID 1.”
It was in reading this email that I believe sparks were flying from my back teeth.
Needless to say, Jark is no longer with us.
After painstakingly backing up the master database, reformatting and reinstalling the new slave system with RAID 10 (1+0), the system was back up with the newly formatted drives. When I pressed the magic button which starts up the database feeding, I watched the counter.
2 seconds. 3 seconds. 2 seconds. To import each minute.
On occasion, when things work too fast, I worry. It couldn’t be that fast. But the speed improvement was nothing short of miraculous.
The 8 hour delay was caught up in less than an hour.