10

I have FILE_A which has over 300,000 lines and FILE_B which has over 30 million lines. I created a Bash script that greps each line in FILE_A over in FILE_B and writes the result of the grep to a new file.

This whole process is taking over 5 hours.

How can I improve the performance of my script?

I'm using grep -F -m 1 as the grep command. FILE_A looks like this:

123456789 
123455321

and FILE_B is like this:

123456789,123456789,730025400149993,
123455321,123455321,730025400126097,

So with Bash I have a while loop that picks the next line in FILE_A and greps it over in FILE_B. When the pattern is found in FILE_B, I write it to file result.txt.

while read -r line; do
   grep -F -m1 $line 30MFile
done < 300KFile

5 Answers5

18

Try using grep --file==FILE_A. It almost certainly loads the patterns into memory, meaning it will only scan FILE_B once.

grep -F -m1 --file==300KFile 30MFile
3

Here is a Perl answer for posterity. I routinely do this for matching 1M lines to 30-35M lines. It takes around 10 seconds to finish.

First, hash up FILE_A:

my %simple_hash;
open my $first_file, '<', 'FILE_A' or die "What have you done?! $!";
while (<$first_file>) {
  chomp;                 ## Watch out for Windows newlines
  $simple_hash{$_} = 1;  ## There may be an even faster way to define this
}
close $first_file;

Then, if your big file is delimited and know what column to go after, check for just the existence of the hash key as you run down FILE_B, which is much, much faster than checking for equality or regular expression matching:

open my $second_file, '<', 'FILE_B' or die "Oh no, not again.. $!";
while (<$second_file>) {
  my ($col1, undef) = split ',';
  if (exists($simple_hash{$col1}) {
    print $_;
  }
}
close $second_file;

If your larger target file isn't nicely parse-able, then this script loses its value as so much of its speed comes from not having to fire up the regular expression engine.

Mintx
  • 131
1

If you don't mind some more involved programming, consider using suffix trees (or a variant).

You can preprocess FILE_B using Ukkonen's algorithm in linear time. Then, you query each line in FILE_A in time linear in line length and get all the line numbers that match (might need to adapt the tree a tad) which you can write to a result file.

The whole procedure runs in time O(n + Nm) if n is the length of FILE_B, N is the number of lines in FILE_A and m is the length of the longest line in FILE_A -- this is essentially linear runtime. Beats the quadratic time your original approach needs by magnitudes.

Raphael
  • 542
1

I found the --mmap flag lately, didn't have a chance to test it, but I'll be happy to hear about your findings. Here is the description from man page:

--mmap If  possible, use the mmap(2) system call to read input, instead
      of the default read(2) system call.  In some situations,  --mmap
      yields  better performance.  However, --mmap can cause undefined
      behavior (including core dumps) if an input file  shrinks  while
      grep is operating, or if an I/O error occurs.

See this or this for further info about mmap.

Ramzi Kahil
  • 1,089
-1

why don't you put that file in a database databases are really good at doing an efficient merge,hash,nested loop join like this . And they are really good at utilizing virtual memory