4

There is a situation that we had at out customer that I'd like to understand better.

Here's what happened:

  • A library with LTO tape drives is connected to a fibre channel environment
  • Archiving software running on Windows server 2008 is writing data to the tapes
  • At some point the tape was rewinded without the software being aware of that and writing erased the tape
  • The situation was detected by comparing the expected position on the tape with actual one

I don't have the details about the vendors of equipment.

It seems that a reset happened on the tape drive that caused the tape to rewind but that situation was not reported as an error back to the to the driver and software so the software assumed that the write was successful.

I was reading a lot of documentation to understand why this happened but I can't make any final conclusions to assist the customer.

  • Can a FC HBA or switch on its own retransmit the SCSI write on bus reset?
    • Can something like this be configurable?
  • Did FC HBA or switch ignore the reported Unit Attention?
  • Can the OS driver be to blame?
  • Is this vendor specific?

I'd be very thankful if someone can provide me some directions where to continue.

matejk
  • 111

1 Answers1

3

This is a known problem with tape drives, and the way that they are trivially-easy to rewind merely by looking sideways at the device (ie, opening it in the wrong way - via the rewinding device - just eg to check status).

At least one major piece of UNIX backup software is so worried by this that it simply refuses to write to a tape a second time until that tape is ready to be erased; this from the amanda FAQ (which specifically mentions bus resets as a problem area):

Why does Amanda not append to a tape?

One run of Amanda = one (set of) tapes. Amanda opens the tape device once, writes all the images and filemarks, and closes the device once. Using that sequence, there is no possibility that other programs interrupt the sequence and rewind the tape, without Amanda noticing.

Doing "mt -f /dev/st0 status" could be enough, or even "amcheck daily". Also, an error like a scsi bus reset implies a rewind.

If Amanda would close and reopen the tape drive for each backup image, there is a window of vulnerability that the tape gets rewound accidentally, and the next image will overwrite all the good backups on the tape. And you wouldn't know unless you tried to restore from the tape.

When appending to a tape, there is the possibility that, between the time that Amanda positions to the last image (that already is not really trivial!), and opening the device for writing, a tape rewind happens, and in that case Amanda would happily erase ALL of the tape, containing possibly many days worth of backup.

Bacula similarly addresses the issue by never closing the tape device, so noone else can open it wrongly while a tape is loaded. But that doesn't get around the bus reset problem.

Essentially, this is a problem, and it's a hard one. I might argue that your backup hardware should be sufficiently rock-solid that these don't happen often; if FC seems particularly prone to these, it's time to get a SAS tape drive instead, or at least directly-attach the tape device to the backup server in order to remove fibre switches etc. from the path. Other than that, I can't see how you can do much more than you have, since you caught the problem before the usual point, ie "our restores don't work, we're screwed".

MadHatter
  • 81,580