0

I have a server hosting an intranet web site where one of the features will be the ability to upload files. The files will be saved in a restricted access folder and managed through the web back-end. To avoid name collisions, I plan on assigning UUIDs and storing the original filename along with the UUID in a database for future retrieval.

However, I do have 2 concerns:

  1. The possibility of duplicate files (at the actual byte level, not just by name), and
  2. Ensuring file integrity.

I thought if I ran some type of hash / checksum (MD5, SHA256, etc.), that could address both concerns. I could store the hash and compare the file at a future date and verify it had not gotten corrupted, and if I found another file with the same hash, I would know if the file was a true duplicate.

So my questions are:

  1. Are my concerns about file corruption unfounded?
  2. Also, is this a good strategy for identifying duplicate files?
Big_Al_Tx
  • 101
  • 2

1 Answers1

0

1) file corruption is not common, and the underlying system should prevent and warn of such things but yes it's nice to double check. Better yet have a backup off site http://en.wikipedia.org/wiki/Comparison_of_backup_software

2) if your using hashs anyway there is no need for other strategies, but yes there thinks like rsync move detection that will compare all files by size which is nice and fast then any of the same size will be hashed if not already and checked for uniqueness. Depending on the file content there are other options like git for text, or quality trumping for media.