10

I wonder how can I make hard disk drive more durable? Specifically, I works in a factory environment. And almost every few months, hard disk drive in some of the factory machine are corrupted, and even unrepairable. We already implement several SSD on some of them, but that's not much of help. They're being corrupted too on factory machine with heavy task.

So every time it happened, we always restoring using our backup image. And we already attach all of the hard disk drive with shock absorber to reduce the shake.

Is there any other option or prevention we could use? Perhaps adding any anti magnet material to prevent magnet friction, or something else? And what material we need?

Edit: Factory machine with heavy task I mention above basically machine to create car metal body and car frame mould.

And when I meant the disk is corrupted, it means unreadable. The whole disk. Not the program files or software related. So it won't boot at all.

adadion
  • 203
  • 3
  • 9

7 Answers7

14

Quite likely, the SSD killer is electrical. We can't entirely rule out mechanical vibration, but SSDs are pretty robust mechanically. A simple rubber mounting would increase the resiliency even further. Also make certain that both power and data cables have enough slack. Vibration might cause them to come loose, when under tension.

So, to address the electrical reliability, we have to consider two factors. Firstly, the heavy machinery may draw large currents from the power supply. This could cause voltage drops, which in turn can negatively affect the SSD. This is easily solved by an online UPS. Essentially, this type of UPS powers the computer from a battery, while the mains power is used to charge the battery.

A more unusual problem could be electro-magnetic radiation. High-power machinery will have large currents running, often at 50 or 60 Hz. Unintentionally, this will cause cables to act as antennas. The big cables in the machine act as senders, and the cables to the SSD can act as receivers. The solution here is to have a proper Faraday cage, ideally grounded. That's why normal PC cases are made of metal; they work as Faraday cages, keeping EM radiation out in frequency bands between 50 Hz and several Ghz.

MSalters
  • 391
  • 1
  • 8
11

First off, electrical and magnetic problems are not as bad as problems of vibrations and air contamination. Moisture in air plus dust or chemicals can corrode or short paths quite easily, and in our installations they are the primary reasons of failures if the devices aren't protected properly.

The best option is just keeping all that's not necessary on the production floor off-site. Keep minimalist embedded controllers by the machines, keep the PCs in a neat office communicating with the controllers over LAN.

If that's not possible, you need sealed cases. Possibly with heat transfer elements, if needed; airtight boxes that keep most of moisture out, some silica inside to absorb the rest of moisture - neither dry dust nor clean air moisture alone are a big problem, but combined they quickly lead to oxidation of contacts, other corrosion-related problems.

In my experience, EM disturbances are rarely powerful enough to cause any lasting damage. They may knock a device out, forcing a reboot, but a well-built device will recover from that. Power surges are a different matter; without a good surge protection you may see random damage of parts.

Finally, vibration. The vibration conducted by the floor is easily reduced to negligible levels through a sponge mat or similar. The vibration of a machine, in case the device is directly attached to the machine... there's little that can be done about it. There are dampening systems, but they are only efficient against certain vibration scales... really, just move that control box 2 meters away.

Also, temperature range must be "within acceptable levels". You WILL see corruption on overheating devices, and moisture will condense on too cold ones. This is rarely a concern on a production floor, where too many machines depend on it, but as you seal the disk (intentionally, or unintentionally, e.g. through dust) you'll see overheating.

SF.
  • 6,125
  • 25
  • 45
3

Estimated lifespan is given by the manufacturer usually as "MTBF = 2000hrs" but in "normal conditions" - what you describe is not normal.

Why are the SSD drives failing - physical damage or poor connections.

One anti shock mounting used in the past was a mercury bath but you probably won't be allowed that !! But you could make an oil bath version...

Solar Mike
  • 16,242
  • 1
  • 27
  • 33
3

Really a comment but too long:

I've dealt with PCs on the factory floor (woodworking), they proved quite resilient.

Our initial setup which was basically trouble free: We mounted the PCs inside a cabinet, the front was clear plastic slats (think what you sometimes see on a walk-in refrigerator freezer). The original intent was to maintain a slight positive pressure with clean air but this was never done and proved to not be needed.

Unfortunately, after that plenty of machines got installed with less care. The usual "failure" mode was thermal shutdown, take the cover off and blow it out, it would work fine, although these did prove more problematic as the dust did some damage.

The main problems, however, came from their wires. We specced shielded cables but management went cheap on us, the building was wired with ordinary network wire and later modifications were often made by electricians rather than computer guys. This caused a lot of interference and was probably responsible for the high failure rate of the network cards. (Really, now, a Cat-5 tossed over a 480V, 400?A main power bus??? Or even more extreme, a Y connection in a cat 5--which actually worked, albeit with network error problems!) Don't put a computer on the same circuit as a heavy motor. Don't run any computer wires parallel to heavy power wires even if they're on separate circuits.

The initial machines were all diskless (not an option these days) and even after that everything of importance was stored on the network so if a machine did act up it could be swapped out very quickly--it took longer to carry the new machine to the station than to get it up and running in place of the problematic one.

Top lesson--don't let the electricians be anything but carefully-watched assistants when wiring things.

2

We are assuming that the problem is caused by shock or vibration. There can be several other causes, such as temperature, humidity, corrosion, chemicals, as pointed out by others.

One more approach would be to get the drives to a non-hostile place and extend the connection with cables. This may need your computers to run from external drives.

If you cannot take the external disks far away, you can still place them in a cushion (sponge like material) to avoid shock.

If you still cannot escape the vibration or shock, investigate changing your computers with tough ones. This will probably cost your company some dollars, but it's probably better than production stopping.

Gürkan Çetin
  • 906
  • 5
  • 21
2

As others said, SSDs are resistant against vibrations - there are no moving parts, unlike magnetic hard disk drives.

Both of these technologies are, however, vulnerable to electromagnetic fields, as others stated as well. Providing protection against that may help.

You should, however, also not discount other factors.

(A quick note: this list definitely isn't complete. Just look at scope of the other answers - from voltage levels, to heat, to software - there is a LOT that can cause these issues. Unless you're confident with computing, you might want to look at hiring someone to figure this out for you, because they might be looking at different factors on-site that you didn't think of. That said, here are a few factors that you should also consider.)

  • There might be a problem with the cable, and such errors can be very subtle in that they just sporadically show up. Test the "defective" hard disks in a normal environment on a different PC, with different cables - to make sure they actually have physical damage.

  • It can be your memory as well. Unless you're using ECC memory, this can be difficult to identify. If your bits flip in memory, and that just happens to be where your program, the operating system or its drivers reside, then all bets are off. It might do nothing, it might crash, or it might just write garbage all over your disk.

  • It might not be a hardware issue at all. A software bug can also corrupt data. Having an exotic driver stack can make your system more prone to corrupting data.

Depending what exactly is the cause (you need to determine that first!) we can recommend possible solutions. There are plenty of solutions - from isolation, to RAID, to checksumming file systems such as ZFS - but you need to determine the cause first.

Aaa
  • 121
  • 2
2

In addition to the other answers: in the environment you mention, it's possible there is metallic dust in the air. When that gets into the computer, you can get electrical shorts. A sealed case (or ventilation with high-quality air filtering) can help if that's the case.

Hobbes
  • 746
  • 3
  • 7