Server W-29 in the Dallas data centre was offline on Friday into Saturday after hard drive problems.
The server suffered file system corruptions on multiple disk partitions in the early hours (UTC) of Friday. We do not know what caused this, but suspect a faulty drive controller given that multiple physical drives were involved. Critical to the recover process was the fact that the file system of the drive used for backups also suffered irreparable damage. This meant that we had to restore from our off-server backup at Amazon Web Service (AWS). While this has now proven to be a reliable backup system, the restore process was slow given bandwidth throughput limits imposed by AWS.
The backup we restored was made last weekend (depending on the order, your domain was backed up sometime between Friday 26 June and Sunday 29 June). This means new data (emails, website changes, database entries) in the past week since the backup is lost.
The majority of domains were back in services by early Saturday morning (UTC). The restoration process completed at 15:00 UTC and final checks and correction by 18:00 UTC. In plain English, all services were down on Friday, and a small number for part of Saturday.
The silver lining to this dark cloud is that we were able to recover from what would otherwise have been a catastrophic server failure.
We feel your pain for the significant inconvenience all of this caused.
Original problem 02: 20 UTC
Server is unreachable.
Update 03:20 UTC
Report back from the data centre indicates multiple file system problems. Technician is currently running file system checks; this may take a while…
Update 04:35 UTC
File system checks still in progress. Please bear with us.
Update 05:32 UTC
One of two file system checks has completed. The current running check is a very large disk volume and this may take several more hours to complete. We appreciate that our clients in South Africa are coming online right about now; we regret the inconvenience.
Update 11:00 UTC
Bad news. We were unable to recover two disk partitions, one a critical (/var that contains data) and the other containing the local backup sets. This leaves us with no other choice than to recover to a fresh server. In short, this mean no service today. It will likely also mean restoring from an off-server backup that was made this past weekend and loss of data that occurred since. We have initiated the process and will post more information in a couple hours once we have an idea of the timeline.
Update 12:00 UTC
Data centre indicates that the new server hardware will be in place ASAP. We will then proceed with server configuration and software installation (cPanel plus underlying services) before commencing the restoration from backup. This will unfortunately be one of those slow processes that “takes as long as it takes”.
Update 14:00 UTC
We are still waiting on the new hardware; not sure what the delay is. In the meantime, here are some questions we are receiving:
- ETA for normal operations? his amounts to a total server failure and it will be lengthy process to recover. We cannot say with any certainly when service will be back to normal. Assuming the restoration from backup goes well, Saturday morning seems a realistic target. Nevertheless, this issue has our full attention and proceeding as fast as possible.
- Will I loose data? We have faith in our backups. But having lost the local hard drive we use for backups, we have to revert to off-server backups we make every weekend. We will attempt to restore the backup set made this past weekend. And this means any data (emails, website changes, logs) for the current week will be lost. This is unfortunate, but this is what it means to recover from disaster.
- I am expecting an important email; will I still receive it? Sending servers should queue undeliverable emails and retry delivery again later. However, given the length of this outage, it is quite possible that some sending servers will give up trying and bounce messages back to the senders. We unfortunately have no control over this.
One client commented “at least nobody died”, and another “at least it’s Friday”. Yes, this is not the end of the world but still painful for our affected clients and for us!
Update 17:15 UTC
We have received the shiny new server and are preparing it for action.
Update 20:10 UTC
Restoration for backup has started.
Update 23:00 UTC
We have restored about 1/4 of all accounts. Now that we feel we have the process under control, we will try to speed it up as much as Amazon Web Services S3 allows. Our aim is to be done by Saturday morning in South Africa.
Update 03:20 UTC
Past the 1/2 mark with restoring.
Update 05:35 UTC
The 3/4 mark is in sight. Almost all accounts with usernames starting with “a” through “q” are back online. Our next status update wil be when are at 100% restored and have done some necessary housekeeping (e.g. correcting some IP address assignments in the Apache web server).
Update 15:00 UTC
The last account restored minutes ago. We doing some final tweaks and checks to ensure the restoration was complete and correct. Please let us know if you experience any problems with your website or email.
Update 18:35 UTC
All done. Please open a support ticket if you are unable to access your website or email.