113

I guess everyone knows the useful Linux cmd line utilities head and tail. head allows you to print the first X lines of a file, tail does the same but prints the end of the file. What is a good command to print the middle of a file? something like middle --start 10000000 --count 20 (print the 10’000’000th till th 10’000’010th lines).

I'm looking for something that will deal with large files efficiently. I tried tail -n 10000000 | head 10 and it's horrifically slow.

BenMorel
  • 4,685
Boaz
  • 2,319

10 Answers10

148
sed -n '10000000,10000020p' filename

You might be able to speed that up a little like this:

sed -n '10000000,10000020p; 10000021q' filename

In those commands, the option -n causes sed to "suppress automatic printing of pattern space". The p command "print[s] the current pattern space" and the q command "Immediately quit[s] the sed script without processing any more input..." The quotes are from the sed man page.

By the way, your command

tail -n 10000000 filename | head -n 10

starts at the ten millionth line from the end of the file, while your "middle" command would seem to start at the ten millionth from the beginning which would be equivalent to:

head -n 10000010 filename | tail -n 10

The problem is that for unsorted files with variable length lines any process is going to have to go through the file counting newlines. There's no way to shortcut that.

If, however, the file is sorted (a log file with timestamps, for example) or has fixed length lines, then you can seek into the file based on a byte position. In the log file example, you could do a binary search for a range of times as my Python script here* does. In the case of the fixed record length file, it's really easy. You just seek linelength * linecount characters into the file.

* I keep meaning to post yet another update to that script. Maybe I'll get around to it one of these days.

Amir
  • 3
44

I found out the following use of sed

sed -n '10000000,+20p'  filename

Hope it's useful to someone!

Dox
  • 541
7

This is my first time posting here! Anyway, this one is easy. Let's say you want to pull line 8872 from your file called file.txt. Here is how you do it:

cat -n file.txt | grep '^ *8872'

Now the question is to find 20 lines after this. To accomplish this you do

cat -n file.txt | grep -A 20 '^ *8872'

For lines around or before see the -B and -C flags in the grep manual.

Dennis
  • 71
  • 1
  • 3
2

Use the following command to get the particular range of lines

awk 'NR < 1220974{next}1;NR==1513793{exit}' debug.log | tee -a test.log

Here debug.log is my file which consists of a lacks of lines and i used to print the lines from 1220974 line number to 1513793 to a file test.log. hope it ll helpful for capturing the range of lines.

newbie13
  • 131
2

Perl is king:

perl -ne 'print if ($. == 10000000 .. $. == 10000020)' filename
1

Dennis' sed answer is the way to go. But using just head & tail, under bash:

middle () { head -n $[ $1 + $2 ] | tail -n $2; }

This scans the first $1+$2 lines twice, so is much worse than Dennis' answer. But you don't need to remember all those sed letters to use it....

1

A ruby oneliner version.

ruby -pe 'next unless $. > 10000000 && $. < 10000020' < filename.txt

It can be useful to somebody. The solutions with 'sed' provided by Dennis and Dox is very nice, even because it seems faster.

shardan
  • 331
0

For instance this awk will print lines between 20 and 40

awk '{if ((NR > 20) && (NR < 40)) print $0}' /etc/passwd

Hrvoje Špoljar
  • 5,405
  • 28
  • 42
0

If you know the line numebrs, say you want to get line 1, 3 and 5 from a file, say /etc/passwd:

perl -e 'while(<>){if(++$l~~[1,3,5]){print}}' < /etc/passwd
Dagelf
  • 643
-1

You can use 'nl'.

nl filename | grep <line_num>
sysadmin1138
  • 135,853
Ajay
  • 9