The fastest method for removing duplicate files safety?

Question

I have a huge amount of files (mostly documents like pdf ~80-90%, but also images, videos, webpages, audio etc.), somewhere around 3.8 millions of files which occupies ~7.8Tb of hard drive space on a 10Tb hdd.

I have tested a lot of software from the internet (Windows and Linux) for removing duplicate files but it's just in vain.

Most of them take days to complete, others can't even complete because of not enough memory to run, others are crushing when just ready to finish and so on, and others seems to never complete.

So I decided to wrote my own C++ program that can compile and run perfectly on both Linux and Windows but there is a problem: it also takes a lot of time and seems to never complete the task. It works very well on small amount of files.

I am writing this topic because maybe there is something I could improve/optimize/remove from my algorithm in order to make it extremely fast but also secure to in order to avoid collisions which means to remove different files thinking they are duplicates.

Here is the algorithm:

First of all it is listing all files recursively on the given path and store them by size into a dictionary where the key is an set of file paths corresponding to that size.
Second it removes the keys where the set has just a single value because they are just a single and unique file.
Next it does MD5 hash (which is faster) of the first 1024 bytes from the files and store them into another dictionary where the key is a pair of size and MD5 hash and their values are a set that contains the file paths of that files that have the same size and hash.
Next it is removing the keys that have a single value as they are unique.
Next it is doing the full SHA3-512 (using OpenSSL) of the full file and store them into another dictionary.
Next remove the keys that have a single value as they are unique, again.
Next proceed to remove the duplicates from the dictionary.

Everything is optimized perfectly and everything is done using multithreading but even so it takes a huge amount of time and it seems like it never completes the task.

What I should do to optimize it further?

Ewan · Answer 1 · 2023-08-18T12:50:31.477

Disk access, opening files and reading the contents is slow. You probably can't get around that. So however you look at it this program is probably going to be running for hours if not days.

Given that, and given that you are writing it yourself for a one off use you should write your program so that it can fail, be fixed and then continue running from where it left off.

Instead of using dictionaries, write to a database.

Split it into three programs or steps. list the files, add the md5s, add the sha3s, delete the files each step goes back to the db. That way you can see how many files you are going to delete/have the same hash etc and spot check your code is working correctly.

sample db table

Files
- fullFileNameAndPath
- fileName
- md5
- sha3
- duplicate
- deleted
Pass1
foreach(var file in directory)
{
   if(file is a folder) { recurse }
   db.Files.Add(file.FullName, file.Path);
}
Pass2
foreach(var files in db.GetNext100Files())
{
  var filedata = Disk.Open(file.fullFileName)
  var md5 = genMd5(filedata)
  db.UpdateFile(fullFileName, md5)
}
Pass3
foreach(var file in db.GetFilesWithMatchingMD5())
{
  var filedata = Disk.Open(file.fullFileName)
  var sha3 = genSha3(filedata)
  db.UpdateFile(fullFileName, sha3)
}
Pass 4
foreach(var fileCollection in db.GetFilesWithMatchingMD5AndSha3())
{
  //pick one to keep
  foreach(var file in fileCollection.Except(oneToKeep))
  {
    db.UpdateFile(fullFileName, duplicateIsTrue)
  }
}
Pass 5
foreach(var file in db.GetNext100Duplicates)
{
   Disk.Delete(file.fullFileName)
   db.UpdateFile(fullFileName, deleted)
}

score 3 · Answer 2 · answered Aug 20 '23 at 10:55

When you write a big, hopefully perfect, program and you run it and it takes too long, it is time to break the problem into pieces to see what is taking the time.

Start by recursively scanning all the files in the given path, retrieving the file size of each and doing nothing at all with it.

Edit your question to tell us how long this takes. Our suggestions will depend a lot on what the answer is.

Also give us an idea of how big the largest file is. That will have a strong influence on what algorithms might be suggested.

One final point (until you provide the information asked for): when you are scanning a directory and find a subdirectory entry, do not scan it. Simply add it to a list of “subdirectories that will need to be scanned” and carry on looking through the original directory. When you have finished looking through it, and only then, take the “will need to be scanned” list and scan the subdirectories named in it. The reason for doing it like this is that it will save too much random seeking all over the disk.

And one final point - depending on the results of your “pure scan” timings. Do not waste time creating maps and filling them with file names, the first time round. Your very first pass should simply note which file sizes occur more than once. Then you will scan the directories again, and do something akin to what you have described… but only on files whose size is in the “more than once” list.

score 2 · Answer 3 · answered Aug 17 '23 at 20:19

I would start with two (or three) maps/tables (a dedicated DB might help. Maybe.):

Small files (full content), optionally hard-link info for candidate bigger files, and bigger files (just sizes).

The threshold should be small enough the file is likely to be resident, for file systems supporting that, but at least the size of your first hash. Unless of course you want to ignore small files.

Next, handle the duplicate small files, handle the hard-link info, and finally drop the info about known non-duplicate bigger files, getting leaner data-structures and saving memory.

Drill down into one cluster of candidate duplicate files after the other, thus you don't need to keep around the known duplicate trait, again keeping (resp. making) data-structures lean.

If you want to multi-thread, remember that one thread is probably plenty for hashing data from one disk, you just should always keep a read-request in-flight using asynchronous IO or a dedicated paired read-thread. More independent requests might add overhead, especially seeking on spinning disks is costly. Also, multiple disks might be behind a common link which becomes a bottleneck if they are fast or plenty enough.
A dedicated thread for collecting the results might help, maybe, if you don't use a database which already does that for you. At least it means that data bounces between CPUs less.

score 1 · Answer 4 · answered Aug 19 '23 at 14:10

Most operating system have file system indices that can help you speed up the job. C++ does have its own way to determining file size, but I am not sure how optimized that is. The point is, you can detect some metadata without actually opening the file or reading anything from the disk or reading very little data. If you do not want to go platform specific, you can create the index yourself.

Using the index, you can then very easily get the size of a file. Then group your files by size. It makes no point to check files of different size against duplication, cause that is unlikely to be the case.

Then for each group of file sizes, you can compute any hash you want to determine duplication. The idea here is to exclude any non-duplicate items out the gate, before embarking into computationally intensive tasks.

The fastest method for removing duplicate files safety?

4 Answers4