Transfer 15TB of tiny files

Question

I'm archiving data from one server to another. Initially I started a rsync job. It took 2 weeks for it to build the file list just for 5 TB of data and another week to transfer 1 TB of data.

Then I had to kill the job as we need some down time on the new server.

It's been agreed that we will tar it up since we probably won't need to access it again. I was thinking of breaking it into 500 GB chunks. After I tar it then I was going to copy it across through ssh. I was using tar and pigz but it is still too slow.

Is there a better way to do it? I think both servers are on Redhat. Old server is Ext4 and the new one is XFS.

File sizes range from few kb to few mb and there are 24 million jpegs in 5TB. So I'm guessing around 60-80 million for 15TB.

edit: After playing with rsync, nc, tar, mbuffer and pigz for a couple of days. The bottleneck is going to be the disk IO. As the data is striped across 500 SAS disks and around 250 million jpegs. However, now I learnt about all these nice tools that I can use in future.

score 68 · Accepted Answer · edited Sep 11 '15 at 17:19

68

I have had very good results using tar, pigz (parallel gzip) and nc.

Source machine:

tar -cf - -C /path/of/small/files . | pigz | nc -l 9876

Destination machine:

To extract:

nc source_machine_ip 9876 | pigz -d | tar -xf - -C /put/stuff/here

To keep archive:

nc source_machine_ip 9876 > smallstuff.tar.gz

If you want to see the transfer rate just pipe through pv after pigz -d!

edited Sep 11 '15 at 17:19

GregL

9,870

answered Sep 09 '15 at 16:29

h0tw1r3

2,813
20
17

score 21 · Answer 2 · answered Sep 09 '15 at 18:44

I'd stick to the rsync solution. Modern (3.0.0+) rsync uses incremental file list, so it does not have to build full list before transfer. So restarting it won't require you to do whole transfer again in case of trouble. Splitting the transfer per top or second level directory will optimize this even further. (I'd use rsync -a -P and add --compress if your network is slower than your drives.)

score 15 · Answer 3 · edited Sep 10 '15 at 06:04

Set up a VPN (if its internet), create a virtual drive of some format on the remote server (make it ext4), mount it on the remote server, then mount that on the local server (using a block-level protocol like iSCSI), and use dd or another block-level tool to do the transfer. You can then copy the files off the virtual drive to the real (XFS) drive at your own convenience.

Two reasons:

No filesystem overhead, which is the main performance culprit
No seeking, you're looking at sequential read/write on both sides

score 10 · Answer 4 · answered Sep 10 '15 at 03:14

If the old server is being decommissioned and the files can be offline for a few minutes then it is often fastest to just pull the drives out the old box and cable them into the new server, mount them (back online now) and copy the files to the new servers native disks.

score 3 · Answer 5 · answered Sep 09 '15 at 15:39

3

Use mbuffer and if it is on a secure network you can avoid the encryption step.

answered Sep 09 '15 at 15:39

JamesRyan

8,204

score 3 · Answer 6 · answered Sep 10 '15 at 23:34

3

(Many different answers can work. Here is another one.)

Generate the file list with find -type f (this should finish in a couple of hours), split it to small chunks, and transfer each chunk using rsync --files-from=....

answered Sep 10 '15 at 23:34

pts

435

score 3 · Answer 7 · answered Sep 12 '15 at 17:56

Have you considered sneakernet? With that, I mean transfering everything onto the same drive, then physically moving that drive over.

about a month ago, Samsung unveiled a 16 TB drive (technically, it's 15.36 TB), which is also an SSD: http://www.theverge.com/2015/8/14/9153083/samsung-worlds-largest-hard-drive-16tb

I think this drive would just about do for this. You'd still have to copy all the files, but since you don't have network latency and probably can use SATA or a similarly fast technique, it should be quite a lot faster.

score 2 · Answer 8 · answered Sep 09 '15 at 20:38

If there is any chance to get high success ratio when deduplication, I would use something like borgbackup or Attic.

If not, check the netcat+tar+pbzip2 solution, adapt the compression options according to your hardware - check what is the bottleneck (CPU? network? IO?). The pbzip2 would nicely span across all CPUs, giving better performance.

score 2 · Answer 9 · answered Sep 10 '15 at 18:49

You are using RedHat Linux, so this wouldn't apply, but as another option:

I've had great success using ZFS to hold millions of files as inodes aren't an issue.

If that was an option for you, you could then take snapshots and use zfs to send incremental updates. I've had a lot of success using this method to transfer as well as archive data.

ZFS is primarily a Solaris filesystem, but can be found in the illumos (open source fork of Sun's OpenSolaris). I know there has also been some luck at using ZFS under BSD and Linux (using FUSE?)--but I have no experience on trying that.

score 1 · Answer 10 · edited Sep 11 '15 at 15:57

1

Start an rsync daemon on the target machine. This will speedup the transfer process a lot.

edited Sep 11 '15 at 15:57

MadHatter

81,580

answered Sep 11 '15 at 15:50

Heiko Wiesner

19
1

score 0 · Answer 11 · answered Apr 08 '23 at 06:35

0

Try juicesync?

juicesync local/path user@host:port:path --threads=50

you can also use --worker and --manager mode to start more jobs.

Also you can try rclone(sftp).

answered Apr 08 '23 at 06:35

Jason Y

1

score -1 · Answer 12 · answered Sep 11 '15 at 18:06

-1

You can do this with just tar and ssh, like this:

tar zcf - <your files> | ssh <destination host> "cat > <your_file>.tar.gz"

Or, if you want to keep individual files:

tar zcf - <your files> | ssh <destination host> "tar zxf -"

answered Sep 11 '15 at 18:06

Fabio Brito

1

Transfer 15TB of tiny files

12 Answers12