125

Background: physical server, about two years old, 7200-RPM SATA drives connected to a 3Ware RAID card, ext3 FS mounted noatime and data=ordered, not under crazy load, kernel 2.6.18-92.1.22.el5, uptime 545 days. Directory doesn't contain any subdirectories, just millions of small (~100 byte) files, with some larger (a few KB) ones.

We have a server that has gone a bit cuckoo over the course of the last few months, but we only noticed it the other day when it started being unable to write to a directory due to it containing too many files. Specifically, it started throwing this error in /var/log/messages:

ext3_dx_add_entry: Directory index full!

The disk in question has plenty of inodes remaining:

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda3            60719104 3465660 57253444    6% /

So I'm guessing that means we hit the limit of how many entries can be in the directory file itself. No idea how many files that would be, but it can't be more, as you can see, than three million or so. Not that that's good, mind you! But that's part one of my question: exactly what is that upper limit? Is it tunable? Before I get yelled at—I want to tune it down; this enormous directory caused all sorts of issues.

Anyway, we tracked down the issue in the code that was generating all of those files, and we've corrected it. Now I'm stuck with deleting the directory.

A few options here:

  1. rm -rf (dir)

    I tried this first. I gave up and killed it after it had run for a day and a half without any discernible impact.

  2. unlink(2) on the directory: Definitely worth consideration, but the question is whether it'd be faster to delete the files inside the directory via fsck than to delete via unlink(2). That is, one way or another, I've got to mark those inodes as unused. This assumes, of course, that I can tell fsck not to drop entries to the files in /lost+found; otherwise, I've just moved my problem. In addition to all the other concerns, after reading about this a bit more, it turns out I'd probably have to call some internal FS functions, as none of the unlink(2) variants I can find would allow me to just blithely delete a directory with entries in it. Pooh.
  3. while [ true ]; do ls -Uf | head -n 10000 | xargs rm -f 2>/dev/null; done )

    This is actually the shortened version; the real one I'm running, which just adds some progress-reporting and a clean stop when we run out of files to delete, is:

    export i=0;
    time ( while [ true ]; do
      ls -Uf | head -n 3 | grep -qF '.png' || break;
      ls -Uf | head -n 10000 | xargs rm -f 2>/dev/null;
      export i=$(($i+10000));
      echo "$i...";
    done )

    This seems to be working rather well. As I write this, it has deleted 260,000 files in the past thirty minutes or so.

Now, for the questions:
  1. As mentioned above, is the per-directory entry limit tunable?
  2. Why did it take "real 7m9.561s / user 0m0.001s / sys 0m0.001s" to delete a single file which was the first one in the list returned by ls -U, and it took perhaps ten minutes to delete the first 10,000 entries with the command in #3, but now it's hauling along quite happily? For that matter, it deleted 260,000 in about thirty minutes, but it's now taken another fifteen minutes to delete 60,000 more. Why the huge swings in speed?
  3. Is there a better way to do this sort of thing? Not store millions of files in a directory; I know that's silly, and it wouldn't have happened on my watch. Googling the problem and looking through SF and SO offers a lot of variations on find that are not going to be significantly faster than my approach for several self-evident reasons. But does the delete-via-fsck idea have any legs? Or something else entirely? I'm eager to hear out-of-the-box (or inside-the-not-well-known-box) thinking.
Thanks for reading the small novel; feel free to ask questions and I'll be sure to respond. I'll also update the question with the final number of files and how long the delete script ran once I have that.

Final script output!:

2970000...
2980000...
2990000...
3000000...
3010000...

real    253m59.331s
user    0m6.061s
sys     5m4.019s

So, three million files deleted in a bit over four hours.

BMDan
  • 7,379

25 Answers25

107

Update August 2021

This answer continues to attract a lot of attention and I feel as if its so woefully out of date it kind of is redundant now.

Doing a find ... -delete is most likely going to produce acceptable results in terms of performance.

The one area I felt might result in a higher performance is tackling the 'removing' part of the problem instead of the 'listing' part.

I tried it and it didn't work. But I felt it was useful to explain what I did and why.

In todays newer kernels, through the use of the IO uring subsystem in the kernel (see man 2 io_uring_setup) it is actually possible to attempt to perform unlinks asynchronously -- meaning we can submit unlink requests without waiting or blocking to see the result.

This program basically reads a directory, submits hundreds of unlinks without waiting for the result, then reaps the results later once the system is done handling the request.

It tries to do what dentls did but uses IO uring. Can be compiled with gcc -o dentls2 dentls2.c -luring.

#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <err.h>
#include <sched.h>

#include <sys/stat.h> #include <sys/types.h> #include <dirent.h>

#include <linux/io_uring.h> #include <liburing.h>

/* Try to keep the queue size to under two pages as internally its stored in

  • the kernel as contiguously ordered pages. Basically the bigger you make it
  • the higher order it becomes and the less likely you'll have the contiguous
  • pages to support it, despite not hitting any user limits.
  • This reduces an ENOMEM here by keeping the queue size as order 1
  • Ring size internally is rougly 24 bytes per entry plus overheads I haven't
  • accounted for.

*/ #define QUEUE_SIZE 256

/* Globals to manage the queue */ static volatile int pending = 0; static volatile int total_files = 0;

/* Probes kernel uring implementation and checks if action is

  • supported inside the kernel */

static void probe_uring( struct io_uring ring) { struct io_uring_probe pb = {0};

pb = io_uring_get_probe_ring(ring);

/* Can we perform IO uring unlink in this kernel ? */ if (!io_uring_opcode_supported(pb, IORING_OP_UNLINKAT)) { free(pb); errno = ENOTSUP; err(EXIT_FAILURE, "Unable to configure uring"); }

free(pb); }

/* Place a unlink call for the specified file/directory on the ring / static int submit_unlink_request( int dfd, const char fname, struct io_uring ring) { char fname_cpy = strdup(fname); struct io_uring_sqe *sqe = NULL;

/* Fetch a free submission entry off the ring / sqe = io_uring_get_sqe(ring); if (!sqe) / Submission queue full */ return 0;

pending++; /* Format the unlink call for submission */ io_uring_prep_rw(IORING_OP_UNLINKAT, sqe, dfd, fname_cpy, 0, 0); sqe->unlink_flags = 0;

/* Set the data to just be the filename. Useful for debugging

  • at a later point */

io_uring_sqe_set_data(sqe, fname_cpy);

return 1; }

/* Submit the pending queue, then reap the queue

  • clearing up room on the completion queue */

static void consume_queue( struct io_uring ring) { char fn; int i = 0, bad = 0; int rc; struct io_uring_cqe **cqes = NULL;

if (pending < 0) abort();

cqes = calloc(pending, sizeof(struct io_uring_cqe *)); if (!cqes) err(EXIT_FAILURE, "Cannot find memory for CQE pointers");

/* Notify about submitted entries from the queue (this is a async call) */ io_uring_submit(ring);

/* We can immediately take a peek to see if we've anything completed */ rc = io_uring_peek_batch_cqe(ring, cqes, pending);

/* Iterate the list of completed entries. Check nothing crazy happened / for (i=0; i < rc; i++) { / This returns the filename we set earlier */ fn = io_uring_cqe_get_data(cqes[i]);

/* Check the error code of the unlink calls */
if (cqes[i]-&gt;res &lt; 0) {
  errno = -cqes[i]-&gt;res;
  warn(&quot;Unlinking entry %s failed&quot;, fn);
  bad++;
}

/* Clear up our CQE */
free(fn);
io_uring_cqe_seen(ring, cqes[i]);

}

pending -= rc + bad; total_files += rc - bad; free(cqes); }

/* Main start / int main( const int argc, const char argv) { struct io_uring ring = {0}; struct stat st = {0}; DIR target = NULL; int dfd; struct dirent *fn;

/* Check initial arguments passed make sense */ if (argc < 2) errx(EXIT_FAILURE, "Must pass a directory to remove files from.");

/* Check path validity */ if (lstat(argv[1], &st) < 0) err(EXIT_FAILURE, "Cannot access target directory");

if (!S_ISDIR(st.st_mode)) errx(EXIT_FAILURE, "Path specified must be a directory");

/* Open the directory */ target = opendir(argv[1]); if (!target) err(EXIT_FAILURE, "Opening the directory failed"); dfd = dirfd(target);

/* Create the initial uring for handling the file removals */ if (io_uring_queue_init(QUEUE_SIZE, &ring, 0) < 0) err(EXIT_FAILURE, "Cannot initialize URING");

/* Check the unlink action is supported */ probe_uring(&ring);

/* So as of writing this code, GETDENTS doesn't have URING support.

  • but checking the kernel mailing list indicates its in progress.
  • For now, we'll just do laymans readdir(). These days theres no
  • actual difference between it and making the getdents() call ourselves.

/ while (fn = readdir(target)) { if (fn->d_type != DT_REG) / Pay no attention to non-files */ continue;

/* Add to the queue until its full, try to consume it
 * once its full. 
 */
while (!submit_unlink_request(dfd, fn-&gt;d_name, &amp;ring)) {
  /* When the queue becomes full, consume queued entries */
  consume_queue(&amp;ring);
  /* This yield is here to give the uring a chance to 
   * complete pending requests */
  sched_yield();
  continue;
}

}

/* Out of files in directory to list. Just clear the queue */ while (pending) { consume_queue(&ring); sched_yield(); }

printf("Total files: %d\n", total_files);

io_uring_queue_exit(&ring); closedir(target); exit(0); }

The results were ironically opposite what I suspected, but why?

TMPFS with 4 million files

$ time ./dentls2 /tmp/many
Total files: 4000000

real 0m6.459s user 0m0.360s sys 0m24.224s

Using find:

$ time find /tmp/many -type f -delete

real 0m9.978s user 0m1.872s sys 0m6.617s

BTRFS with 10 million files

$ time ./dentls2 ./many
Total files: 10000000

real 10m25.749s user 0m2.214s sys 16m30.865s

Using find:

time find ./many -type f -delete

real 7m1.328s user 0m9.209s sys 4m42.000s

So it looks as if batched syscalls dont make an improvement in real time. The new dentls2 spends much more time working (four times as much) only to result in worse performance. So a net loss in overall efficiency and worse latency. dentls2 is worse.

The cause of this is because io_uring produces kernel dispatcher threads to do the unlink work internally, but the directory inode being worked on can only be modified by a single writer at one time.

Basically using the uring we're creating lots of little threads but only one thread is allowed to delete from the directory. We've just created a bunch of contention and eliminated the advantage of doing batched IO.

Using eBPF you can measure the unlink frequencies and watch what causes the delays.

In the case of BTRFS its the kernel function call btrfs_commit_inode_delayed_inode which acquires the lock when unlink is called.

With dentls2

# /usr/share/bcc/tools/funclatency btrfs_commit_inode_delayed_inode
    Tracing 1 functions for "btrfs_commit_inode_delayed_inode"... Hit Ctrl-C to end.
 nsecs               : count     distribution
     0 -&gt; 1          : 0        |                                        |
     2 -&gt; 3          : 0        |                                        |
     4 -&gt; 7          : 0        |                                        |
     8 -&gt; 15         : 0        |                                        |
    16 -&gt; 31         : 0        |                                        |
    32 -&gt; 63         : 0        |                                        |
    64 -&gt; 127        : 0        |                                        |
   128 -&gt; 255        : 0        |                                        |
   256 -&gt; 511        : 18       |                                        |
   512 -&gt; 1023       : 120      |                                        |
  1024 -&gt; 2047       : 50982    |                                        |
  2048 -&gt; 4095       : 2569467  |********************                    |
  4096 -&gt; 8191       : 4936402  |****************************************|
  8192 -&gt; 16383      : 1662380  |*************                           |
 16384 -&gt; 32767      : 656883   |*****                                   |
 32768 -&gt; 65535      : 85409    |                                        |
 65536 -&gt; 131071     : 21715    |                                        |
131072 -&gt; 262143     : 9719     |                                        |
262144 -&gt; 524287     : 5981     |                                        |
524288 -&gt; 1048575    : 857      |                                        |

1048576 -> 2097151 : 293 | | 2097152 -> 4194303 : 220 | | 4194304 -> 8388607 : 255 | | 8388608 -> 16777215 : 153 | | 16777216 -> 33554431 : 56 | | 33554432 -> 67108863 : 6 | | 67108864 -> 134217727 : 1 | |

avg = 8533 nsecs, total: 85345432173 nsecs, count: 10000918

Using find ... -delete:

# /usr/share/bcc/tools/funclatency btrfs_commit_inode_delayed_inode
Tracing 1 functions for "btrfs_commit_inode_delayed_inode"... Hit Ctrl-C to end.
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 34       |                                        |
       512 -> 1023       : 95       |                                        |
      1024 -> 2047       : 1005784  |****                                    |
      2048 -> 4095       : 8110338  |****************************************|
      4096 -> 8191       : 672119   |***                                     |
      8192 -> 16383      : 158329   |                                        |
     16384 -> 32767      : 42338    |                                        |
     32768 -> 65535      : 4667     |                                        |
     65536 -> 131071     : 3597     |                                        |
    131072 -> 262143     : 2860     |                                        |
    262144 -> 524287     : 216      |                                        |
    524288 -> 1048575    : 22       |                                        |
   1048576 -> 2097151    : 6        |                                        |
   2097152 -> 4194303    : 3        |                                        |
   4194304 -> 8388607    : 5        |                                        |
   8388608 -> 16777215   : 3        |                                        |

avg = 3258 nsecs, total: 32585481993 nsecs, count: 10000416

You can see from the histogram that find spends 3258 nanoseconds on average in btrfs_commit_inode_delayed_inode but dentls2 spends 8533 nanoseconds in the function.

Also the histogram shows that overall io_uring threads spend at least twice as long waiting on the lock which the majority of calls taking 4096-8091 nanoseconds versus the majority in find taking 2048-4095 nanoseconds.

Find is single-threaded and isn't contending for the lock, whereas `dentls2 is multi-threaded (due to the uring) which produces lock contention and the delays that are experienced are reflected in the analysis.

Conclusion

All in all, on modern systems (as of writing this) there is less and less you can do in software to make this go faster than it is set to go.

It used to be reading a large buffer from the disk you could compound an expensive IO call down into one large sequential read, instead of seeky IO which small getdents() buffers could typically end up being.

Also due to other improvements there are smaller overheads to just invoking system calls and major improvements in sequential/random IO access times that eliminate the big IO bottlenecks we used to experience.

On my systems, this problem has become memory/cpu bound. Theres a single-accessor problem on (at least) BTRFS which limits the speed you can go to a single cpu/programs worth of unlinks per directory at a time. Trying to batch the IO's yields at best minor improvements even in ideal circumstances of using tmpfs and typically is worse on a real-world filesystem.

To top it off, we really dont have this problem anymore -- gone are the days of 10 million files taking 4 hours to remove.

Just do something simple like find ... -delete. No amount of optimization I tried seemed to yield major performance improvements worth the coding (or analysis) over a default simple setup.


Original Answer

Whilst a major cause of this problem is ext3 performance with millions of files, the actual root cause of this problem is different.

When a directory needs to be listed readdir() is called on the directory which yields a list of files. readdir is a posix call, but the real Linux system call being used here is called 'getdents'. Getdents list directory entries by filling a buffer with entries.

The problem is mainly down to the fact that that readdir() uses a fixed buffer size of 32Kb to fetch files. As a directory gets larger and larger (the size increases as files are added) ext3 gets slower and slower to fetch entries and additional readdir's 32Kb buffer size is only sufficient to include a fraction of the entries in the directory. This causes readdir to loop over and over and invoke the expensive system call over and over.

For example, on a test directory I created with over 2.6 million files inside, running "ls -1|wc-l" shows a large strace output of many getdent system calls.

$ strace ls -1 | wc -l
brk(0x4949000)                          = 0x4949000
getdents(3, /* 1025 entries */, 32768)  = 32752
getdents(3, /* 1024 entries */, 32768)  = 32752
getdents(3, /* 1025 entries */, 32768)  = 32760
getdents(3, /* 1025 entries */, 32768)  = 32768
brk(0)                                  = 0x4949000
brk(0x496a000)                          = 0x496a000
getdents(3, /* 1024 entries */, 32768)  = 32752
getdents(3, /* 1026 entries */, 32768)  = 32760
...

Additionally the time spent in this directory was significant.

$ time ls -1 | wc -l
2616044

real 0m20.609s user 0m16.241s sys 0m3.639s

The method to make this a more efficient process is to call getdents manually with a much larger buffer. This improves performance significantly.

Now, you're not supposed to call getdents yourself manually so no interface exists to use it normally (check the man page for getdents to see!), however you can call it manually and make your system call invocation way more efficient.

This drastically reduces the time it takes to fetch these files. I wrote a program that does this.

/* I can be compiled with the command "gcc -o dentls dentls.c" */

#define _GNU_SOURCE

#include <dirent.h> /* Defines DT_* constants */ #include <err.h> #include <fcntl.h> #include <getopt.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h>

struct linux_dirent { long d_ino; off_t d_off; unsigned short d_reclen; char d_name[256]; char d_type; };

static int delete = 0; char *path = NULL;

static void parse_config( int argc, char **argv) { int option_idx = 0; static struct option loptions[] = { { "delete", no_argument, &delete, 1 }, { "help", no_argument, NULL, 'h' }, { 0, 0, 0, 0 } };

while (1) {
    int c = getopt_long(argc, argv, &quot;h&quot;, loptions, &amp;option_idx);
    if (c &lt; 0)
        break;

    switch(c) {
      case 0: {
          break;
      }

      case 'h': {
          printf(&quot;Usage: %s [--delete] DIRECTORY\n&quot;
                 &quot;List/Delete files in DIRECTORY.\n&quot;
                 &quot;Example %s --delete /var/spool/postfix/deferred\n&quot;,
                 argv[0], argv[0]);
          exit(0);                      
          break;
      }

      default:
      break;
    }
}

if (optind &gt;= argc)
  errx(EXIT_FAILURE, &quot;Must supply a valid directory\n&quot;);

path = argv[optind];

}

int main( int argc, char** argv) {

parse_config(argc, argv);

int totalfiles = 0;
int dirfd = -1;
int offset = 0;
int bufcount = 0;
void *buffer = NULL;
char *d_type;
struct linux_dirent *dent = NULL;
struct stat dstat;

/* Standard sanity checking stuff */
if (access(path, R_OK) &lt; 0) 
    err(EXIT_FAILURE, &quot;Could not access directory&quot;);

if (lstat(path, &amp;dstat) &lt; 0) 
    err(EXIT_FAILURE, &quot;Unable to lstat path&quot;);

if (!S_ISDIR(dstat.st_mode))
    errx(EXIT_FAILURE, &quot;The path %s is not a directory.\n&quot;, path);

/* Allocate a buffer of equal size to the directory to store dents */
if ((buffer = calloc(dstat.st_size*3, 1)) == NULL)
    err(EXIT_FAILURE, &quot;Buffer allocation failure&quot;);

/* Open the directory */
if ((dirfd = open(path, O_RDONLY)) &lt; 0) 
    err(EXIT_FAILURE, &quot;Open error&quot;);

/* Switch directories */
fchdir(dirfd);

if (delete) {
    printf(&quot;Deleting files in &quot;);
    for (int i=5; i &gt; 0; i--) {
        printf(&quot;%u. . . &quot;, i);
        fflush(stdout);
        sleep(1);
    }
    printf(&quot;\n&quot;);
}

while (bufcount = syscall(SYS_getdents, dirfd, buffer, dstat.st_size*3)) {
    offset = 0;
    dent = buffer;
    while (offset &lt; bufcount) {
        /* Don't print thisdir and parent dir */
        if (!((strcmp(&quot;.&quot;,dent-&gt;d_name) == 0) || (strcmp(&quot;..&quot;,dent-&gt;d_name) == 0))) {
            d_type = (char *)dent + dent-&gt;d_reclen-1;
            /* Only print files */
            if (*d_type == DT_REG) {
                printf (&quot;%s\n&quot;, dent-&gt;d_name);
                if (delete) {
                    if (unlink(dent-&gt;d_name) &lt; 0)
                        warn(&quot;Cannot delete file \&quot;%s\&quot;&quot;, dent-&gt;d_name);
                }
                totalfiles++;
            }
        }
        offset += dent-&gt;d_reclen;
        dent = buffer + offset;
    }
}
fprintf(stderr, &quot;Total files: %d\n&quot;, totalfiles);
close(dirfd);
free(buffer);

exit(0);

}

Whilst this does not combat the underlying fundamental problem (lots of files, in a filesystem that performs poorly at it). It's likely to be much, much faster than many of the alternatives being posted.

As a forethought, one should remove the affected directory and remake it after. Directories only ever increase in size and can remain poorly performing even with a few files inside due to the size of the directory.

Edit: I've cleaned this up quite a bit. Added an option to allow you to delete on the command line at runtime and removed a bunch of the treewalk stuff which, honestly looking back was questionable at best. Also was shown to produce memory corruption.

You can now do dentls --delete /my/path

New results. Based off of a directory with 1.82 million files.

## Ideal ls Uncached
$ time ls -u1 data >/dev/null

real 0m44.948s user 0m1.737s sys 0m22.000s

Ideal ls Cached

$ time ls -u1 data >/dev/null

real 0m46.012s user 0m1.746s sys 0m21.805s

dentls uncached

$ time ./dentls data >/dev/null Total files: 1819292

real 0m1.608s user 0m0.059s sys 0m0.791s

dentls cached

$ time ./dentls data >/dev/null Total files: 1819292

real 0m0.771s user 0m0.057s sys 0m0.711s

Was kind of surprised this still works so well!

Matthew Ife
  • 24,261
37

The data=writeback mount option deserves to be tried, in order to prevent journaling of the file system. This should be done only during the deletion time, there is a risk however if the server is being shutdown or rebooted during the delete operation.

According to this page,

Some applications show very significant speed improvement when it is used. For example, speed improvements can be seen (...) when applications create and delete large volumes of small files.

The option is set either in fstab or during the mount operation, replacing data=ordered with data=writeback. The file system containing the files to be deleted has to be remounted.

Déjà vu
  • 5,778
32

Would it be possible to backup all of the other files from this file system to a temporary storage location, reformat the partition, and then restore the files?

jftuga
  • 5,831
13

TLDR: use rsync -a --delete emptyfolder/ x.

This question has 50k views, and quite a few answers, but nobody seems to have benchmarked all the different replies. There's one link to an external benchmark, but that one's over 7 years old and didn't look at the program provided in this answer: https://serverfault.com/a/328305/565293

Part of the difficulty here is that the time it takes to remove a file depends heavily on the disks in use and the file system. In my case, I tested both with a consumer SSD running BTRFS on Arch Linux (updated as of 2020-03), but I got the same ordering of the results on a different distribution (Ubuntu 18.04), filesystem (ZFS), and drive type (HDD in a RAID10 configuration).

Test setup was identical for each run:

# setup
mkdir test && cd test && mkdir empty
# create 800000 files in a folder called x
mkdir x && cd x
seq 800000 | xargs touch
cd ..

Test results:

rm -rf x: 30.43s

find x/ -type f -delete: 29.79

perl -e 'for(<*>){((stat)[9]<(unlink))}': 37.97s

rsync -a --delete empty/ x: 25.11s

(The following is the program from this answer, but modified to not print anything or wait before it deletes files.)

./dentls --delete x: 29.74

The rsync version proved to be the winner every time I repeated the test, although by a pretty low margin. The perl command was slower than any other option on my systems.

Somewhat shockingly, the program from the top answer to this question proved to be no faster on my systems than a simple rm -rf. Let's dig into why that is.

First of all, the answer claims that the problem is that rm is using readdir with a fixed buffer size of 32Kb with getdents. This proved not to be the case on my Ubuntu 18.04 system, which used a buffer four times larger. On the Arch Linux system, it was using getdents64.

In addition, the answer misleadingly provides statistics giving its speed at listing the files in a large directory, but not removing them (which is what the question was about). It compares dentls to ls -u1, but a simple strace reveals that getdents is not the reason why ls -u1 is slow, at least not on my system (Ubuntu 18.04 with 1000000 files in a directory):

strace -c ls -u1 x >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.00    7.177356           7   1000000           lstat
  5.96    0.454913        1857       245           getdents
[snip]

This ls command makes a million calls to lstat, which slows the program way down. The getdents calls only add up to 0.455 seconds. How long do the getdents calls take in dentls on the same folder?

strace -c ./dentls x >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.91    0.489895       40825        12           getdents
[snip]

That's right! Even though dentls only makes 12 calls instead of 245, it actually takes the system longer to run these calls. So the explanation given in that answer is actually incorrect - at least for the two systems I've been able to test this on.

The same applies to rm and dentls --delete. Whereas rm takes 0.42s calling getdents, dentls takes 0.53s. In either case, the vast majority of the time is spent calling unlink!

So in short, don't expect to see massive speedups running dentls, unless your system is like the author's and has a lot of overhead on individual getdents. Maybe the glibc folks have considerably sped it up in the years since the answer was written, and it now takes a linear about of time to respond for different buffer sizes. Or maybe the response time of getdents depends on the system architecture in some way that isn't obvious.

adamf
  • 231
12

There is no per directory file limit in ext3 just the filesystem inode limit (i think there is a limit on the number of subdirectories though).

You may still have problems after removing the files.

When a directory has millions of files, the directory entry itself becomes very large. The directory entry has to be scanned for every remove operation, and that takes various amounts of time for each file, depending on where its entry is located. Unfortunately even after all the files have been removed the directory entry retains its size. So further operations that require scanning the directory entry will still take a long time even if the directory is now empty. The only way to solve that problem is to rename the directory, create a new one with the old name, and transfer any remaining files to the new one. Then delete the renamed one.

12

I haven't benchmarked it, but this guy did:

rsync -a --delete ./emptyDirectoty/ ./hugeDirectory/
Qtax
  • 53
Alix Axel
  • 2,843
4

find simply did not work for me, even after changing the ext3 fs's parameters as suggested by the users above. Consumed way too much memory. This PHP script did the trick - fast, insignificant CPU usage, insignificant memory usage:

<?php 
$dir = '/directory/in/question';
$dh = opendir($dir)) { 
while (($file = readdir($dh)) !== false) { 
    unlink($dir . '/' . $file); 
} 
closedir($dh); 
?>

I posted a bug report regarding this trouble with find: http://savannah.gnu.org/bugs/?31961

Alexandre
  • 151
3

Make sure you do:

mount -o remount,rw,noatime,nodiratime /mountpoint

which should speed things up a bit as well.

karmawhore
  • 3,925
3

I recently faced a similar issue and was unable to get ring0's data=writeback suggestion to work (possibly due to the fact that the files are on my main partition). While researching workarounds I stumbled upon this:

tune2fs -O ^has_journal <device>

This will turn off journaling completely, regardless of the data option give to mount. I combined this with noatime and the volume had dir_index set, and it seemed to work pretty well. The delete actually finished without me needing to kill it, my system remained responsive, and it's now back up and running (with journaling back on) with no issues.

3

A couple of years back, I found a directory with 16 million XML files in the / filesystem. Due the criticity of the server, we used the following command that took about 30 hours to finish:

perl -e 'for(<*>){((stat)[9]<(unlink))}'

It was an old 7200 rpm hdd, and despite the IO bottleneck and CPU spikes, the old webserver continued its service.

2

ls very slow command. Try:

find /dir_to_delete ! -iname "*.png" -type f -delete
bindbn
  • 5,321
2

Obviously not apples to apples here, but I setup a little test and did the following:

Created 100,000 512-byte files in a directory (dd and /dev/urandom in a loop); forgot to time it, but it took roughly 15 minutes to create those files.

Ran the following to delete said files:

ls -1 | wc -l && time find . -type f -delete

100000

real    0m4.208s
user    0m0.270s
sys     0m3.930s 

This is a Pentium 4 2.8GHz box (couple hundred GB IDE 7200 RPM I think; EXT3). Kernel 2.6.27.

gravyface
  • 13,987
2

Is dir_index set for the filesystem? (tune2fs -l | grep dir_index) If not, enable it. It's usually on for new RHEL.

sam
  • 21
1

My preferred option is the newfs approach, already suggested. The basic problem is, again as already noted, the linear scan to handle deletion is problematic.

rm -rf should be near optimal for a local filesystem (NFS would be different). But at millions of files, 36 bytes per filename and 4 per inode (a guess, not checking value for ext3), that's 40 * millions, to be kept in RAM just for the directory.

At a guess, you're thrashing the filesystem metadata cache memory in Linux, so that blocks for one page of the directory file are being expunged while you're still using another part, only to hit that page of the cache again when the next file is deleted. Linux performance tuning isn't my area, but /proc/sys/{vm,fs}/ probably contain something relevant.

If you can afford downtime, you might consider turning on the dir_index feature. It switches the directory index from linear to something far more optimal for deletion in large directories (hashed b-trees). tune2fs -O dir_index ... followed by e2fsck -D would work. However, while I'm confident this would help before there are problems, I don't know how the conversion (e2fsck with the -D) performs when dealing with an existing v.large directory. Backups + suck-it-and-see.

Phil P
  • 3,110
1

Sometimes Perl can work wonders in cases like this. Have you already tried if a small script such as this could outperform bash and the basic shell commands?

#!/usr/bin/perl 
open(ANNOYINGDIR,"/path/to/your/directory");
@files = grep("/*\.png/", readdir(ANNOYINGDIR));
close(ANNOYINGDIR);

for (@files) {
    printf "Deleting %s\n",$_;
    unlink $_;
}

Or another, perhaps even faster, Perl approach:

#!/usr/bin/perl
unlink(glob("/path/to/your/directory/*.png")) or die("Could not delete files, this happened: $!");

EDIT: I just gave my Perl scripts a try. The more verbose one does something right. In my case I tried this with a virtual server with 256 MB RAM and half a million files.

time find /test/directory | xargs rm results:

real    2m27.631s
user    0m1.088s
sys     0m13.229s

compared to

time perl -e 'opendir(FOO,"./"); @files = readdir(FOO); closedir(FOO); for (@files) { unlink $_; }'

real    0m59.042s
user    0m0.888s
sys     0m18.737s
1

From what I remember the deletion of inodes in ext filesystems is O(n^2), so the more files you delete the faster the rest will go.

There was a one time I was faced with similar problem (though my estimates looked at ~7h deletion time), in the end went the jftuga suggested route in first comment.

Alicja Kario
  • 6,449
0

I've written a tool specifically designed to delete directories as fast as possible called rmz: https://github.com/SUPERCILEX/fuc/blob/master/rmz/README.md

Extensive benchmarks are included which compare a bunch of different options: https://github.com/SUPERCILEX/fuc/tree/master/comparisons#remove

I've also gotten real world feedback that rmz generally lives up to its performance goals.

0

I would probably have whipped out a C compiler and done the moral equivalent of your script. That is, use opendir(3) to get a directory handle, then use readdir(3) to get the name of files, then tally up files as I unlink them and once in a while print "%d files deleted" (and possibly elapsed time or current time stamp).

I don't expect it to be noticeably faster than the shell script version, it's just that I'm used to have to rip out the compiler now and again, either because there's no clean way of doing what I want from the shell or because while doable in shell, it's unproductively slow that way.

Vatine
  • 5,560
0

You are likely running into rewrite issues with the directory. Try deleting the newest files first. Look at mount options that will defer writeback to disk.

For a progress bar try running something like rm -rv /mystuff 2>&1 | pv -brtl > /dev/null

BillThor
  • 28,293
  • 3
  • 39
  • 70
0

Well, this is not a real answer, but...

Would it be possible to convert the filesystem to ext4 and see if things change?

marcoc
  • 748
0

Alright this has been covered in various ways in the rest of the thread but I thought I would throw in my two cents. The performance culprit in your case is probably readdir. You are getting back a list of files that are not necessarily in any way sequential on disk which is causing disk access all over the place when you unlink. The files are small enough that the unlink operation probably doesn't jump around too much zeroing out the space. If you readdir and then sort by ascending inode you would probably get better performance. So readdir into ram (sort by inode) -> unlink -> profit.

Inode is a rough approximation here I think .. but basing on your use case it might be fairly accurate...

MattyB
  • 1,013
0

Here is how I delete the millions of trace files that can sometimes gather on a large Oracle database server:

for i in /u*/app/*/diag/*/*/*/trace/*.tr? ; do rm $i; echo -n . ;  done

I find that this results in a fairly slow deletion that has low impact on server performance, usually something along the lines of an hour per million files on a "typical" 10,000 IOPS setup.

It will often take several minutes before the directories have been scanned, the initial file list generated and the first file is deleted. From there and on, a . is echoed for every file deleted.

The delay caused by echoing to the terminal has proven enough of a delay to prevent any significant load while deletion is progressing.

Roy
  • 4,596
-1

You could use 'xargs' parallelization features:

ls -1|xargs -P nb_concurrent_jobs -n nb_files_by_job rm -rf
Jeremy
  • 241
-2
ls|cut -c -4|sort|uniq|awk '{ print "rm -rf " $1 }' | sh -x
karmawhore
  • 3,925
-2

actually, this one is a little better if the shell you use does command line expansion:

ls|cut -c -4|sort|uniq|awk '{ print "echo " $1 ";rm -rf " $1 "*"}' |sh