16

I've a 100GB file and I want to split into 100 of 1GB file each (by line break)

e.g.

split --bytes=1024M /path/to/input /path/to/output

For the 100 files generated, I want to apply gzip/zip to each of these files.

Is it possible to use a single command?

Ryan
  • 6,271

4 Answers4

41

Use "--filter":

split --bytes=1024M --filter='gzip > $FILE.gz' /path/to/input /path/to/output

Skyhawk
  • 14,230
Peter
  • 411
0

A bash function to compress on the fly with pigz

function splitreads(){

# add this function to your .bashrc or alike
# split large compressed read files into chunks of fixed size
# suffix is a three digit counter starting with 000
# take compressed input and compress output with pigz
# keeps the read-in-pair suffix in outputs
# requires pigz installed or modification to use gzip

usage="# splitreads <reads.fastq.gz> <reads per chunk; default 10000000>\n";
    if [ $# -lt 1 ]; then
        echo;
        echo ${usage};
        return;
    fi;

# threads for pigz (adapt to your needs)
thr=8

input=$1

# extract prefix and read number in pair
# this code is adapted to paired reads
base=$(basename ${input%.f*.gz})
pref=$(basename ${input%_?.f*.gz})
readn="${base#"${base%%_*}"}"

# 10M reads (4 lines each)
binsize=$((${2:-10000000}*4))

# split in bins of ${binsize}
echo "# splitting ${input} in chuncks of $((${binsize}/4)) reads"

cmd="zcat ${input} \
  | split \
    -a 3 \
    -d \
    -l ${binsize} \
    --numeric-suffixes \
    --additional-suffix ${readn} \
    --filter='pigz -p ${thr} > \$FILE.fq.gz' \
    - ${pref}_"

echo "# ${cmd}"
eval ${cmd}
}
splaisan
  • 101
0

A one-liner using a conditional is as close as you can come.

cd /path/to/output && split --bytes=1024M /path/to/input/filename && gzip x*

gzip will only run if split is successful because of the conditional && which is also between the cd and split making sure the cd is successful, too.. Note that split and gzip output to the current directory instead of having the ability to specify the output directory. You can make the directory, if needed:

mkdir -p /path/to/output && cd /path/to/output && split --bytes=1024M /path/to/input/filename && gzip x*

To put it all back together:

gunzip /path/to/files/x* && cat /path/to/files/x* > /path/to/dest/filename
0

Using this command with -d option allows you to generate numeric sufixes.

split -d -b 2048m "myDump.dmp" "myDump.dmp.part-" && gzip myDump.dmp.part*

Files generated:

    myDump.dmp.part-00
    myDump.dmp.part-01
    myDump.dmp.part-02
    ...
Iván
  • 1
  • 3