6

We have been experiencing a very strange problem in our new office's server room across all the power outlets.

Specifically, when all the equipment is up and running (i.e. the air conditioning system, 2x rack mounted servers, 5x 48-port PoE switches and also the door access system - which has its backup batteries and main control circuits based inside the server room) we occasionally see the servers spontaneously reboot, the door access system reboots and the PoE switches simultaneously lurch into a non-functional state for 20 minutes or more at a time. When this happens, all three systems reboot simultaneously. All three systems are on the same circuit.

The servers and switches are running on a UPS device and the card access system also has a backup battery of its own - so a simple momentary loss of power would not explain this as everything should just continue to run from the UPS without interruption. We've disconnected the UPS from the wall and have seen the servers continue to run, as expected - so the UPS seems to be working properly as far as power outages are concerned.

None of the circuit breakers have ever tripped or needed to be reset.

The air conditioning system is apparently on a separate circuit to the servers and network equipment; however, its power cables share a conduit with the power cables which run to wall outlets used by the servers etc. Could there be a risk of a voltage being induced from one circuit to the other when the AC switches on or off as they are parallel to each other for quite a few metres?

I talked to one of the electricians who was trying to work out what was happening and he said that, although the air conditioning unit is on a separate circuit to the servers and other systems, the two circuits actually share a common neutral - something he thought could potentially causing problems. Is this a normal configuration or would it be considered bad practice to have something like an AC unit share a neutral with sensitive equipment in a server room?

Currently, the problem has subsided of its own accord. The servers have stopped spontaneously rebooting and switches are back online but no real changes have been made, so the underlying problem is still there and likely to resurface sooner or later.

Given we are seeing multiple systems with separate battery backup units rebooting during these episodes, what possible explanations could there be besides power surges or spikes?

1 Answers1

4

While not the direct "here's you issue" answer you were hoping for, here's my suggestion.

It appears that while noble, your quest to find out what is wrong isn't going to be solved by you quickly.

You can do like others have suggested and try and log anything you can and hope for a pattern to emerge.

I like derobert's suggestion of hiring someone to measure the power quality...

HOWEVER, here's my actual suggestion which you've somewhat already done. Leave it to the electricians.

Seriously. A qualified electrician (even if you have to outsource it) should be able to give you the root cause IF it's electricial in nature or not. They can test each circuit to make sure they aren't overloaded (especially on spikes/startups), they can make sure the wiring is adequate and the circuits are sized properly for what you are attaching to them. etc. etc.

Most of the time, IT won't have their own qualified electrician and we often like to just "plug stuff in" and don't realize whether we are using the right circuits, balancing circuits, etc.

If your UPS supports log gathering, I would do it, if nothing else than to help prove the issue. While your UPS might not be high end enough to compensate for the spikes/valleys properly (quickly) enough, it doesn't mean it is the root cause. It sounds like an electrical issue to me. If you are running a nice on-line UPS and it seems to be compensating the input voltage properly (based on its logs) then it would be weird that all of the IT equipment plugged into it and the card reader system reboot at the same time.

Talk to your boss and explain the issue in terms of needing an expert electrician to diagnose. It's not fair to expect an electrician to setup BGP routing, and conversely don't expect a sysadmin to be a qualified electrician.

TheCleaner
  • 33,047