Server WWW-26 RAID Maintenance – 9 to 18 October

The RAID array on Linux server WWW-26  reporting errors. As a precautionary but essential measure, we are replacing both hard drives in the array.

The drive replacement will be done in two steps: replace one hard drive and rebuild the array, and then replace the second drive and rebuild the array again. The procedure will require the server to be offline for up to three hours.

We regret any inconvenience.

First maintenance window:
Thursday 9 October midnight to Friday 03:00 SAST
Thursday 9 October 22:00 to Friday 01:00 GMT
Thursday 9 October 18:00 to 21:00 EDT
Thursday 9 October 15:00 to 18:00 PDT
Update 00:45 GMT:
One of the two drives in the RAID array has been replaced. The server is back online and the array is rebuilding. Everything is looking fine at this time.
We are awaiting the schedule for the next maintenance window for replacing the second drive in the array.
Update Friday 16:20 GMT:
The RAID array has rebuilt (sigh of relief). The second maintenance window has now been set (see below). During this maintenance window we will replace the second drive in the RAID array and, if time allows, also replace the backup drive in the server:
Friday 10 October midnight to Saturday 03:00 SAST
Friday 10 October 22:00 to Saturday 01:00 GMT
Friday 10 October 18:00 to 21:00 EDT
Friday 10 October 15:00 to 18:00 PDT
Update Saturday 18:30 GMT:
The technicians were unable to complete the work on Friday. A third maintenance windows has been set:
Saturday 11 October midnight to Sunday 03:00 SAST
Saturday 11 October 22:00 to Sunday 01:00 GMT
Saturday 11 October 18:00 to 21:00 EDT
Saturday 11 October 15:00 to 18:00 PDT
Update Monday 16:15 GMT:
The RAID successfully rebuilt on Saturday, but a striping error remained. We will replace one of the hard drives again today, hoping that it will address the problem. If problems still remain, we have other (more extensive) solution to try. But let’s not go there yet.
Today’s maintenance window:
Monday 13 October midnight to Tuesday 03:00 SAST
Monday 13 October 22:00 to Tuesday 01:00 GMT
Monday 13 October 18:00 to 21:00 EDT
Monday 13 October 15:00 to 18:00 PDT
Update Tuesday 13:40 GMT:
The RAID successfully rebuilt during the day but revealed an error on the second drive. We know that this is bordering on the absurd. We will try once more to replace the second hard drive in the RAID array today:
Tuesday 14 October midnight to Wednesday 03:00 SAST
Tuesday 14 October 22:00 to Wednesday 01:00 GMT
Tuesday 14 October 18:00 to 21:00 EDT
Tuesday 14 October 15:00 to 18:00 PDT
Update Tuesday 20:15 GMT
The hard drive has been replaced and we are awaiting a successful rebuild of the RAID array. Server should be online again within minutes.
Update Thursday 20:50 GMT
The new hard drives are free of error (hooray!) but the RAID still reports a striping error. We will perform a “verify with fix” on the RAID array on Friday. This procedure is quite intensive, and we expect it to markedly slow down the server for a couple of hours. There is a very, very small risk of data loss; in the worst case we can restore from backup.
We have scheduled the start of the “verify with fix” as below; completion will be several hours later:
Friday 17 October midnight to Saturday 03:00 SAST
Friday 17 October 22:00 to Saturday 01:00 GMT
Friday 17 October 18:00 to 21:00 EDT
Friday 17 October 15:00 to 18:00 PDT
Update Saturday 23:45 GMT
All our attempts to verify and fix the RAID array has failed; striping errors persist. This leaves us two options: reload the OS and restore sites from backup (many hours of downtime) or move everything onto a brand new server (some downtime). We will go with the second option to minimise interruption in service. We’ll post a separate announcement for this.