1

We had an overnight air conditioning failure. We discovered that the temperature in the server room had reached about 110-115°F (43-46°C). We powered off everything that hadn't already and had the A/C fixed.

Now that it's fixed, I'm concerned of the damage done by the extended exposure to the high temperature. I'd like to run a series of tests on all of our machines to ensure that machines aren't damaged before we return to relying on them. My plan is as follows:

  • Run memtest86 to check if any DIMMs were damaged (have already done this and essentially found no issues)
  • Run Prime95 to check if any CPUs are damanged (presumably this will come in the form of unexpected interrupts or hardware faults)
  • Run smartctl -a and badblocks on all disks and check output for any anomalies

This list feels a little thin, and I'm not confident these will all properly exercise the hardware to ensure we won't run into any heat-induced issues in the future.

Is this battery of tests sufficient? Are there any others I should consider?

1 Answers1

3

46.5 degree celsius.

Start not with a check but reading the paperwork for your main servers.

You will find out that is likely quite within their operating temperatures. No joke. Hardware is done for multiple purposes and there are HOT places on earth - you really want to tell a guy in Texas on a really hot day that no, he NEEDS air conditioning?

Heck, just checking the servers I got:

https://supermicro.com/Aplus/system/1U/1123/AS-1123US-TR4.cfm

Temperature range given to 95 farenheit. And CPU's are temperature throttled - if anything they would have shut down.

You rather should check discs for integrity and make sure the backups are ok - CPU's will not overhead and damage so easily. Not since 15 years or so, since then everyone puts thermal throttling circuits in. I had a couple of CPU Cooler failures and they resulted in the CPU shutting down the mobo FAST.

yagmoth555
  • 17,495
TomTom
  • 52,109
  • 7
  • 59
  • 142