Improving performance for grepping over a huge file

Question

I have FILE_A which has over 300,000 lines and FILE_B which has over 30 million lines. I created a Bash script that greps each line in FILE_A over in FILE_B and writes the result of the grep to a new file.

This whole process is taking over 5 hours.

How can I improve the performance of my script?

I'm using grep -F -m 1 as the grep command. FILE_A looks like this:

123456789 
123455321

and FILE_B is like this:

123456789,123456789,730025400149993,
123455321,123455321,730025400126097,

So with Bash I have a while loop that picks the next line in FILE_A and greps it over in FILE_B. When the pattern is found in FILE_B, I write it to file result.txt.

while read -r line; do
   grep -F -m1 $line 30MFile
done < 300KFile

score 18 · Accepted Answer · answered May 29 '12 at 22:11

18

Try using grep --file==FILE_A. It almost certainly loads the patterns into memory, meaning it will only scan FILE_B once.

grep -F -m1 --file==300KFile 30MFile

answered May 29 '12 at 22:11

score 3 · Answer 2 · edited May 18 '14 at 21:55

Here is a Perl answer for posterity. I routinely do this for matching 1M lines to 30-35M lines. It takes around 10 seconds to finish.

First, hash up FILE_A:

my %simple_hash;
open my $first_file, '<', 'FILE_A' or die "What have you done?! $!";
while (<$first_file>) {
  chomp;                 ## Watch out for Windows newlines
  $simple_hash{$_} = 1;  ## There may be an even faster way to define this
}
close $first_file;

Then, if your big file is delimited and know what column to go after, check for just the existence of the hash key as you run down FILE_B, which is much, much faster than checking for equality or regular expression matching:

open my $second_file, '<', 'FILE_B' or die "Oh no, not again.. $!";
while (<$second_file>) {
  my ($col1, undef) = split ',';
  if (exists($simple_hash{$col1}) {
    print $_;
  }
}
close $second_file;

If your larger target file isn't nicely parse-able, then this script loses its value as so much of its speed comes from not having to fire up the regular expression engine.

score 1 · Answer 3 · answered May 29 '12 at 22:38

If you don't mind some more involved programming, consider using suffix trees (or a variant).

You can preprocess FILE_B using Ukkonen's algorithm in linear time. Then, you query each line in FILE_A in time linear in line length and get all the line numbers that match (might need to adapt the tree a tad) which you can write to a result file.

The whole procedure runs in time O(n + Nm) if n is the length of FILE_B, N is the number of lines in FILE_A and m is the length of the longest line in FILE_A -- this is essentially linear runtime. Beats the quadratic time your original approach needs by magnitudes.

score 1 · Answer 4 · answered May 29 '12 at 22:44

I found the --mmap flag lately, didn't have a chance to test it, but I'll be happy to hear about your findings. Here is the description from man page:

--mmap If  possible, use the mmap(2) system call to read input, instead
      of the default read(2) system call.  In some situations,  --mmap
      yields  better performance.  However, --mmap can cause undefined
      behavior (including core dumps) if an input file  shrinks  while
      grep is operating, or if an I/O error occurs.

See this or this for further info about mmap.

score -1 · Answer 5 · answered May 19 '14 at 02:28

-1

why don't you put that file in a database databases are really good at doing an efficient merge,hash,nested loop join like this . And they are really good at utilizing virtual memory

answered May 19 '14 at 02:28

Andyz Smith

843

Improving performance for grepping over a huge file

5 Answers5