Recent Site StabilityThe last week has been a roller coaster of availability for our website. Now that things have settled down, I felt I should remind you guys that we do have an
Official Web Development Group where we discuss the causes of outages and solutions being implemented to address problems. That is the area of the site where I will inform people about work being performed on the site first and (usually) in more technical detail than people care to hear.
If you don't follow that group, I'll give you a summary of the past week. The morning of August 8th we implemented a significant overhaul to our database structure that necessitated a multi hour outage (turned out to be 12 hours). We also set ...
Recent Site StabilityThe last week has been a roller coaster of availability for our website. Now that things have settled down, I felt I should remind you guys that we do have an
Official Web Development Group where we discuss the causes of outages and solutions being implemented to address problems. That is the area of the site where I will inform people about work being performed on the site first and (usually) in more technical detail than people care to hear.
If you don't follow that group, I'll give you a summary of the past week. The morning of August 8th we implemented a significant overhaul to our database structure that necessitated a multi hour outage (turned out to be 12 hours). We also set up a secondary "slave" database server that holds a copy of website data as it is being written. This allows us to perform backups without taking the site offline every day and also allows us to have a spare database server in case the first one dies (more on this in a bit). The slave server is the old database server that used to run the site about 2 1/2 years ago.
After the upgrades we encountered numerous stability problems with the site and began having problems with database tables becoming corrupt for seemingly no reason. After fighting it for a few days and investigating like mad, on Thursday morning we realized that one of the drives in the database's array was failing and was causing all of our trouble. We formulated a plan to take the site offline Thursday evening in a controlled manner to swap out the drives as painlessly as possible.
The drives had other plans.
Thursday afternoon around 5PM the drives experienced catastrophic failure so we had to swap them out and rebuild the database at that time. Luckily I keep a spare set of hard drives in our cabinet just in case of this type of failure. We propped up the slave server to handle the website in the meantime...and it wasn't nearly powerful enough to perform the task. The site was unusable because the database server couldn't keep up with the traffic. Eventually once most people were asleep and traffic was much lower it could serve the site fine, but that wasn't a good solution. Once we rebuilt the drives in the master database we reimported all the data (again, about a 12 hour process) and finally Friday around noon everything came back online and has been ok for the most part since then.
Now that the system is stable again, we are working on bug fixes and optimizations related to last weekend's work and hope to have the site running better than ever. We are also working on a distributed content delivery method to serve some of the more static data on the site in an effort to speed up load times.
I apologize for the outages this week and take full responsibility for them. The problems we encountered were unexpected however I think they are behind us for the most part. Moving forward we will be implementing better systems to deal with these types of problems to improve our uptime percentage and provide a better home for you on the internet. Thanks for sticking around, we wouldn't be here without you guys and your continued support.