This post is motivated by a question on Stackoverflow titled Faster way to do “perl -anle ‘print $F[1]’ *manyfiles* > result” (‘cut’ fails). I cautiously mentioned that it might help to read files in parallel. brian d foy emphasized that it might not do much given that the task seems to be IO bound. I admit I don’t understand much about filesystem caches, but I thought reading input in parallel might help utilize them better. So, I decided to test if that made sense. My preliminary check on Windows seemed to show using Parallel::ForkManager with two processes resulted in the run time being reduced by 40%. But, we all know that Windows is a little funky when it comes to forking, so I rebooted into Linux, and decided to try there.
The tests were run on my aging laptop with an ancient Core Duo processor and 2 GB of physical memory. No, I still haven’t replaced it, although I do also use a newer Mac. Both perls were 5.14.2. The Windows system was XP Professional SP3 and the Linux system has ArchLinux with the latest updates. I am only going to show results from runs on Linux below.
First, I generated ten files with 1,000,000 lines each using the following short script:
#!/usr/bin/env perl
use strict; use warnings;
for (1 .. 1_000_000) {
my $str;
if (0.2 > rand) {
$str .= ' ' x rand(10);
}
$str .= 'a' x 20 . ' ' . 'a' x 20;
print $str, "\n";
}
I then used the following script to read all the files and capture the second field:
#!/usr/bin/env perl
use strict; use warnings;
use Parallel::ForkManager;
my ($maxproc) = @ARGV;
my @files = ('01' .. '10');
my $pm = Parallel::ForkManager->new($maxproc);
for my $file (@files) {
my $pid = $pm->start and next;
my $ret = open my $h, '<', $file;
unless ($ret) {
warn "Cannot open '$file': $!";
$pm->finish;
}
while (my $line = <$h>) {
next unless $line =~ /^\s*\S+\s+(\S+)/;
print "$1\n";
}
$pm->finish;
}
$pm->wait_all_children;
Here are the results:
# sync
# echo 3 > /proc/sys/vm/drop_caches
$ /usr/bin/time -f '%Uuser %Ssystem %Eelapsed %PCPU' ./process.pl 1 > output
24.44user 0.93system 0:29.08elapsed 87%CPU
$ rm output
# sync
# echo 3 > /proc/sys/vm/drop_caches
$ /usr/bin/time -f '%Uuser %Ssystem %Eelapsed %PCPU' ./process.pl 2 > output
24.95user 0.91system 0:18.31elapsed 141%CPU
$ rm output
# sync
# echo 3 > /proc/sys/vm/drop_caches
$ /usr/bin/time -f '%Uuser %Ssystem %Eelapsed %PCPU' ./process.pl 4 > output
24.70user 0.88system 0:17.45elapsed 146%CPU
$ rm output
# sync
# echo 3 > /proc/sys/vm/drop_caches
$ /usr/bin/time -f '%Uuser %Ssystem %Eelapsed %PCPU' ./process.pl 1 > output
25.31user 0.95system 0:29.72elapsed 88%CPU
The results were consistent through the handful of runs I tried.
So, even if the task is IO-bound, it may pay to utilize all the cores you have.