14

Let's say you have data with quantities in human-readable format, such as the output of du -h, and want to further operate on those numbers. Let's say you want to pipe your data through grep to do a summation of a sub-set of that data. You do this ad-hoc on many systems you've never seen before, and have only minimal utilities. You want suffix conversions for all the standard 10^n suffixes.

Exists a gnu-linux utility to convert the suffixed numbers to real numbers within a pipeline? Do you have a bash function written to do this, or some perl which might be easy to remember, instead of a length of regex replacements or several sed steps?

38M     /var/crazyface/courses/200909-90147
2.7M    /var/crazyface/courses/200909-90157
1.1M    /var/crazyface/courses/200909-90159
385M    /var/crazyface/courses/200909-90161
1.3M    /var/crazyface/courses/200909-90169
376M    /var/crazyface/courses/200907-90171
8.0K    /var/crazyface/courses/200907-90173
668K    /var/crazyface/courses/200907-90175
564M    /var/crazyface/courses/200907-90178
4.0K    /var/crazyface/courses/200907-90179

| grep 200907 | <amazing suffix conversion> | awk '{s+=$1} END {print s}'


Relevant references:

beans
  • 1,700

5 Answers5

15

Based on my answer at one of the questions you linked to:

awk '{
    ex = index("KMGTPEZY", substr($1, length($1)))
    val = substr($1, 0, length($1) - 1)

    prod = val * 10^(ex * 3)

    sum += prod
}
END {print sum}'

Another method that's used:

sed 's/G/ * 1000 M/;s/M/ * 1000 K/;s/K/ * 1000/; s/$/ +\\/; $a0' | bc
5

Personally, I'd just not use the -h flag in the first place. The "human readable" version rounds off numbers which will need to be rounded again when you convert back, getting even less accurate. (For instance, 2.7MiB is 2831155.2 bytes. What did you do with the other 0.8th of a byte??!)

Otherwise, you can ask units to convert MiB/GiB/KiB to just "B" and it'll handle this, but you'd have to do something like (assuming your output is tabbed, otherwise cut appropriately)

{your output} | cut -f1 '-d{tab}' | xargs -L 1 -I {} units -1t {}iB B | awk '{s+=$1}END{printf "%d\n",s}'
DerfK
  • 19,826
3

You can use perl regular expressions to do this. For example,

$value = 0;
if($line =~ /(\d+\.?\d*)(\D+)\s+/) {
   $amplifier = 1024 if ($2 eq 'K');
   $amplifier = 1024 * 1024 if ($2 eq 'M');
   $amplifier = 1024 * 1024 * 1024 if ($2 eq 'G');
   $value = $1 * $amplifier;
}

This is a simple script. You can consider it as starting point. Hope it will help!

Khaled
  • 37,789
2
VALUE=$1

for i in "g G m M k K"; do
        VALUE=${VALUE//[gG]/*1024m}
        VALUE=${VALUE//[mM]/*1024k}
        VALUE=${VALUE//[kK]/*1024}
done

[ ${VALUE//\*/} -gt 0 ] && echo VALUE=$((VALUE)) || echo "ERROR: size invalid, pls enter correct size"
Michael Hampton
  • 252,907
Sun
  • 129
1

Try numfmt from GNU Core Utilities (coreutils). It handles converting the "raw" byte unit both to and from human-readable multiple-byte units, meaning multiples of either 1024 (IEC) or 1000 (SI).

Since you mention du (also from coreutils) and du --human-readable, I assume your sample input data is in IEC format, rounded to at most one decimal.

input.txt

38M     /var/crazyface/courses/200909-90147
2.7M    /var/crazyface/courses/200909-90157
1.1M    /var/crazyface/courses/200909-90159
385M    /var/crazyface/courses/200909-90161
1.3M    /var/crazyface/courses/200909-90169
376M    /var/crazyface/courses/200907-90171
8.0K    /var/crazyface/courses/200907-90173
668K    /var/crazyface/courses/200907-90175
564M    /var/crazyface/courses/200907-90178
4.0K    /var/crazyface/courses/200907-90179

Command

cat input.txt | numfmt --from 'iec' | tee output.txt

output.txt

39845888     /var/crazyface/courses/200909-90147
2831156    /var/crazyface/courses/200909-90157
1153434    /var/crazyface/courses/200909-90159
403701760    /var/crazyface/courses/200909-90161
1363149    /var/crazyface/courses/200909-90169
394264576    /var/crazyface/courses/200907-90171
8192    /var/crazyface/courses/200907-90173
684032    /var/crazyface/courses/200907-90175
591396864    /var/crazyface/courses/200907-90178
4096    /var/crazyface/courses/200907-90179

The change in size string length messes up the column formatting, which should not matter if you are parsing the output further.


Further alternatives and suggestions

To look at the output, column from util-linux can format it back to a pretty table:

cat output.txt | column --table

For better precision, consider using du --bytes to get the (cumulative) byte size rather than the (cumulative) file system block size. It should match (cumulative) file content/data size, while the (default) block size introduces an extra rounding error. For example, the last line in the sample set is listed with size 4.0K even tough it probably is smaller than size 4096 bytes as shown in the output after converting back to byte size.


Use du --threshold=size to conveniently filter output based on (cumulative) size. Use for example --threshold '+5M' to exclude "small" entries, or --threshold '-100M' to exclude "large" entries. The optional du --total argument will still include the size of skipped entries.


With numfmt pretty-printing du can be shifted to "post-processing", after using for example GNU Grep grep and "simple" numeric sort (from coreutils):

du | grep '200907' | sort --numeric-sort | numfmt --to 'iec'

For name-based path exclusions, based on your input sample, you can skip grep and instead use du --exclude '200909-*' to keep only the 200907-* paths. This assumes only one or a few simple patterns to filter out.


Alternatively use the powerful path (etcetera) filtering in find first and then use xargs (both from GNU Findutils) to apply the matches as arguments to du:

find . -name '200907-*' | xargs --no-run-if-empty du --human-readable --total

find can also filter small/large/exact file sizes by find -size, but take care to include a size suffix (such as c for bytes). The optional du --total may be repeated (between batches) if entries exceed the argument limit.


When piping paths between shell commands it might be worth using \0/NUL/null/zero-terminated data, instead of the default newline \n separator. Have a look at arguments such as --null/--zero-terminated/-print0 to avoid (some) issues where input/output paths may contain unexpected "special" characters.