Getting a RAID controller to surface scan on a sane schedule

Question

The controller I'm presently working with is quite old, the HP Smart Array P400; in part I want to know how to deal with that controller, but I'm also interested in the general perspective -- if there are other/newer controllers that handle this better, how do they handle it? I'm looking ideally for OS-neutral solutions, but if that doesn't work, it's running VMware ESXi.

There are basically two settings for surface scan on this controller: high, or idle with a configurable delay in seconds.

For years it's been on idle with a 3 second delay. (Not sure why, this was probably the default.) However, I recently got concerned that this means it basically never runs the surface scan, since even during periods of very little actual use, ESXi sends "heartbeat" I/O more frequently than that, and most of the guest OSes also send little blips of one kind or another during idle time.

Figuring it's a bad idea to effectively have the controller never do a surface scan, I picked the only other option, "high".

There might be some kind of performance penalty here, but this array's workload is just system disks for the VMs, not data disks (I use ZFS on a plain HBA for that), so nobody's noticed thus far.

My concern is that, now the drives won't stop, period. I've had this setting for several days, and over those days there have been plenty of idle periods such that I figure the controller could probably have done a complete scan by now. I can do a ZFS scrub on a pool 7 times larger and on lower RPM drives in less time. I've peeked at the server a number of times during idle periods and not once have I seen it without the disk lights dancing around like a music video.

It seems like it has the scan on an infinite loop, without any kind of delay in between scans. Am I correct here?

This to me seems kind of ridiculous. I would have hoped that once the controller managed to get through a scan, it would stop for a few days at least before starting the next one. I really doubt sectors degrade quick enough to justify constant scanning.

I'm worried that this is going to kill off drives way faster. These are 2.5" 10k SAS disks, 300GB and 600GB, in RAID 1+0. Is this a valid concern? I'm guessing this setting has increased total daily disk activity by at least ten times.

Now, disks constantly spin regardless of access, heads don't actually touch platters, and the actuator is moved by a contactless electromagnetic system. So I think the only big difference in wear-out would be on the actuator axis bearing, when the disk seeks. In principle that sounds pretty minor, but in practice it does seem that lots of seeks wear drives out faster.

I imagine this scan is accessing sectors sequentially, which, in of itself, wouldn't involve tons of actuator movement. However, if the scan is being frequently interrupted by little idle accesses that need the heads to be somewhere else, that could at worst amplify that back-and-forth significantly.

^{(I should perhaps look at migrating to SSDs, but in any case I don't want to kill off the magnetic disks already installed.)}

To summarize, my questions are:

Is it actually going to scan continually?
Is there some way to make this scanning periodic instead of continuous? (If not on this controller, even on any different ones?)
Should I actually be worried about this wearing out the disks?

score 3 · Answer 1 · answered Dec 20 '20 at 16:40

Geez... That's a lot of effort.

Disks are consumable. If one fails, let it fail.
The HP SmartArray will tell you and you can replace the drive as intended.

Replacement disks are cheap for that era of server (2007-2009), so you shouldn't overthink how these background processes work.

shodanshok · Answer 2 · 2020-12-20T20:18:33.490

I would not use the high setting for extended periods because it can impact IO performance.

From HP Smart Array manual:

SurfaceScanMode

This parameter specifies the Surface Scan Mode with the following values: High—The surface scan enters a mode guaranteed to make progress despite the level of controller I/O.

In other words, the controller will not prioritize real IO vs scan/scrub one. I suggest you leaving the default medium setting: if disks are constantly accessed by your application, it probably needs the required performance.

If bit rotting worries you, surface scan can be sporadically set to high (ie: during one weekend each month) but, as suggested by others, I would not bother changing the default setting.

score 0 · Answer 3 · answered Feb 17 '24 at 20:34

I don't have a proper answer, but I did dig up some info on this. HPE raid controllers can be managed using HP Smart Storage Administrator. In an older version of the manual for Smart Array / Smart Storage Administrator, in the CLI utility section, it lists an undocumented, unexplained setting called 'surfacescanschedule', next to the mode/delay settings that are present in the UI. This is gone from recent versions of the manual.

Additionally, if you go HP SSA > Diagnose > View Diagnostic Report, it will show a very long and verbose listing of internal values of all storage components. The smart storage raid controller has a whole section called 'Surface Status'. It is difficult to comprehend and afaik there is no documentation for this, but there are some fields that stand out:
Surface Analysis Pass Count - On a 2yo mostly idle server with 2 arrays and 3s delay, it shows 53 and 52. On a 13yo server with 2 arrays and 15s delay, it shows 826 and 457. Strange. Maybe because of different size?
Surface Scan Period - 3600 on the new server, not present on old server. No idea what it actually does.

My idea would be to either ask HPE support for clarification, or, compare how fast 'pass count' goes up when in Idle mode vs High mode. And regarding your concern, if leaving it on idle / 3 seconds still makes the number go up over time, then it is still doing the checking despite what the idle timer says.

Koco · Answer 4 · 2024-05-31T12:27:01.800

I had trouble to find a command for this, so here it is if anybody else finds it hard:

 bucko:~]# ssacli help SurfaceScanMode
The following documentation pertains to your search:
<target> modify [surfacescanmode=disable|idle|high|?]    Sets the
 surface scan mode for the controller. Can be disable, high, or

 idle. If idle is specified, a surface scan delay value must also be

 specified    The target can be any valid controller.
Examples:    controller slot=1 modify surfacescanmode=high

              controller slot=1 modify surfacescanmode=disable

              controller slot=1 modify surfacescanmode=idle surfacescandelay=3
              controller slot=1 modify surfacescanmode=idle surfacescandelay=3

              controller slot=1 modify parallelsurfacescancount=1
[bucko:~]# ssacli controller slot=0 modify surfacescanmode=high
 [bucko:~]# ssacli controller slot=0 modify parallelsurfacescancount=1
Error: Parallel surface scan is not supported. (oh, well)
[bucko:~]# ssacli controller slot=0 modify surfacescandelay=?
Available options are:
    0 = disabled | [1..30] secs (current: 0, default: 3)

Point is - I've replaced all the physical drives (moved extends with LVM to different physical drive and then back), but this error/warning is persistent...

"Warning: Unrecoverable Media Errors Detected on Drives during previous Rebuild or Background Surface Analysis (ARM) scan. Errors will be fixed automatically when the sector(s) are overwritten. Backup and Restore are recommended."

I'll just do a dd to fill empty space with zeroes and see if that helps (it should).

Addendum Filling empty space with zeroes blow the error away! It Works! Warning is gone! :D

score 0 · Answer 5 · edited Feb 15 '25 at 16:45

Even though this is an old thread, I must respond.

We had several issues because of this setting. We had it set to the default setting "idle 3 seconds" and because of that (we are running VMware host with direct attached disks) there were never a surface scan made because of the constant disk activity.

After some maintenance where I did FW upgrade and left the host in maintenance mode disks started to fail on several hosts we had. Luckily only one disk per host on RAID5 systems and one disk on RAID1/RAID10 setups. We had a ticket to HPE regarding this and they advised us to set this setting to HIGH instead of IDLE. Otherwise we would not detect drive issues and might lose data. We are running SAS disks. Maybe SATA (with S.M.A.R.T) would behave differently.

Getting a RAID controller to surface scan on a sane schedule

5 Answers5