82

I'm archiving data from one server to another. Initially I started a rsync job. It took 2 weeks for it to build the file list just for 5 TB of data and another week to transfer 1 TB of data.

Then I had to kill the job as we need some down time on the new server.

It's been agreed that we will tar it up since we probably won't need to access it again. I was thinking of breaking it into 500 GB chunks. After I tar it then I was going to copy it across through ssh. I was using tar and pigz but it is still too slow.

Is there a better way to do it? I think both servers are on Redhat. Old server is Ext4 and the new one is XFS.

File sizes range from few kb to few mb and there are 24 million jpegs in 5TB. So I'm guessing around 60-80 million for 15TB.

edit: After playing with rsync, nc, tar, mbuffer and pigz for a couple of days. The bottleneck is going to be the disk IO. As the data is striped across 500 SAS disks and around 250 million jpegs. However, now I learnt about all these nice tools that I can use in future.

lbanz
  • 1,619
  • 5
  • 24
  • 31

12 Answers12

68

I have had very good results using tar, pigz (parallel gzip) and nc.

Source machine:

tar -cf - -C /path/of/small/files . | pigz | nc -l 9876

Destination machine:

To extract:

nc source_machine_ip 9876 | pigz -d | tar -xf - -C /put/stuff/here

To keep archive:

nc source_machine_ip 9876 > smallstuff.tar.gz

If you want to see the transfer rate just pipe through pv after pigz -d!

GregL
  • 9,870
h0tw1r3
  • 2,813
  • 20
  • 17
21

I'd stick to the rsync solution. Modern (3.0.0+) rsync uses incremental file list, so it does not have to build full list before transfer. So restarting it won't require you to do whole transfer again in case of trouble. Splitting the transfer per top or second level directory will optimize this even further. (I'd use rsync -a -P and add --compress if your network is slower than your drives.)

Fox
  • 4,047
15

Set up a VPN (if its internet), create a virtual drive of some format on the remote server (make it ext4), mount it on the remote server, then mount that on the local server (using a block-level protocol like iSCSI), and use dd or another block-level tool to do the transfer. You can then copy the files off the virtual drive to the real (XFS) drive at your own convenience.

Two reasons:

  1. No filesystem overhead, which is the main performance culprit
  2. No seeking, you're looking at sequential read/write on both sides
Giacomo1968
  • 3,553
  • 29
  • 42
10

If the old server is being decommissioned and the files can be offline for a few minutes then it is often fastest to just pull the drives out the old box and cable them into the new server, mount them (back online now) and copy the files to the new servers native disks.

3

Use mbuffer and if it is on a secure network you can avoid the encryption step.

JamesRyan
  • 8,204
3

(Many different answers can work. Here is another one.)

Generate the file list with find -type f (this should finish in a couple of hours), split it to small chunks, and transfer each chunk using rsync --files-from=....

pts
  • 435
3

Have you considered sneakernet? With that, I mean transfering everything onto the same drive, then physically moving that drive over.

about a month ago, Samsung unveiled a 16 TB drive (technically, it's 15.36 TB), which is also an SSD: http://www.theverge.com/2015/8/14/9153083/samsung-worlds-largest-hard-drive-16tb

I think this drive would just about do for this. You'd still have to copy all the files, but since you don't have network latency and probably can use SATA or a similarly fast technique, it should be quite a lot faster.

Nzall
  • 331
2

If there is any chance to get high success ratio when deduplication, I would use something like borgbackup or Attic.

If not, check the netcat+tar+pbzip2 solution, adapt the compression options according to your hardware - check what is the bottleneck (CPU? network? IO?). The pbzip2 would nicely span across all CPUs, giving better performance.

neutrinus
  • 1,155
2

You are using RedHat Linux, so this wouldn't apply, but as another option:

I've had great success using ZFS to hold millions of files as inodes aren't an issue.

If that was an option for you, you could then take snapshots and use zfs to send incremental updates. I've had a lot of success using this method to transfer as well as archive data.

ZFS is primarily a Solaris filesystem, but can be found in the illumos (open source fork of Sun's OpenSolaris). I know there has also been some luck at using ZFS under BSD and Linux (using FUSE?)--but I have no experience on trying that.

1

Start an rsync daemon on the target machine. This will speedup the transfer process a lot.

MadHatter
  • 81,580
0

Try juicesync?

juicesync local/path user@host:port:path --threads=50 

you can also use --worker and --manager mode to start more jobs.

Also you can try rclone(sftp).

-1

You can do this with just tar and ssh, like this:

tar zcf - <your files> | ssh <destination host> "cat > <your_file>.tar.gz"

Or, if you want to keep individual files:

tar zcf - <your files> | ssh <destination host> "tar zxf -"