Today at 3:35pm a couple of critical servers went off line at the datacenter with no warning or explanation. Of course this has to happen at the end of the day on a Friday! After 5 seconds of bewilderment I called the NOC and had someone head out to the cage to take a look at what was going on. One of our power strips just decided to shut off without warning. Most boxes just generated alerts about loosing a power supply but 2 machines were older and did not have dual power supplies so they simply went offline. We moved these boxes over to another power strip and waited for the facilities guys to come and diagnose the problem.
After removing the power strip’s locking plug from under the raised tiles and testing the power source they determined that the issue was local to the power strip. Upon plugging it back in the power strip started working again. The power eng figured that the new power deviation we had installed last week may have knocked the plug loose so we left the power strip in place and decided to wait.
An hour later again the circuit went down. This time our NOC people decided to replace the troubled power strip and now everything appears stable. Flakey power makes me nervous, especially in a mission critical environment.