Starting Point
Suppose you have some log-files, e.g. from different servers but covering the same incident, and you want to merge them in correct time order. All of them have time-stamps at the beginning of each line, looking like this:
Mar 22 12:00:00 host task ...
Mar 22 12:02:00 host task ...
The files are named file1
, file2
and file3
, already sorted according to time-stamp, and you want the output to look like this:
file1 | Mar 22 12:00:00 host1 taska ...
file2 | Mar 22 12:00:43 host2 taskb ...
file2 | Mar 22 12:00:58 host2 taskc ...
file1 | Mar 22 12:02:00 host1 taska ...
file3 | Mar 22 12:02:03 host3 taskb ...
The program logic is quite simple: open all three files and at each turn look at their next lines, selecting the one with the earliest time-stamp. As soon as one file is exhausted, remove it from the set and repeat until all files are processed.
Implementation
Let me present the implementation in three steps:
- making iterators
- using iterators
- adding convenience
Iterating over the Lines of a File
# generate iterator over lines of a file
sub lines($$) {
my ($name, $shortname) = @_;
open my $file, "<$name" or die "cannot open $name: $!\n";
my $line = <$file>;
sub {
return $line unless @_;
return $shortname if $_[0] eq 'name';
return undef unless $_[0] eq 'next';
$line = <$file>;
$line =~ s/;.*//s if $line;
$line;
}
}
An iterator is an object carrying the state of an iteration over some collection. Thus, it is first initialized, then repeatedly queried for its next element, until it is finally exhausted (infinite iterators are useful in other scenarios but not for the lines of a log-file and are therefore not discussed here). While Perl offers some object-oriented features, all the power needed to implement an iterator is at our disposal through the use of closures and anonymous functions. The function declared in line 2 above takes the path-name and short-name of the file to be iterated over, opens the file, reads the first line and returns a new function. This function captures all variables which were in scope at the moment of its creation, including their current value and the ability to modify them. Care must be taken to avoid obscure side-effects when accessing variables which are still visible in other parts of the code, but our anonymous function is well-behaved as it accesses only the $line
, $shortname
and $file
variables from the immediately enclosing function body: the captured instances are references to the variables of the single function call to lines
which created the anonymous function, therefore they cannot be accessed by anything else after the lines
function has returned. The next invocation of lines
will create completely fresh and decoupled variables of the same local names.
Now, what does the returned function actually do? The most primitive iterator would have only one function: return the next element or signal exhaustion. This iterator is a bit more complex, as it can return the most recently read line over and over again (line 7), or it can return the short-name of the file (line 8), or perform the aforementioned iteration (lines 10–12). The function is selected by passing in one optional parameter; line 9 guards against invalid requests.
Let me add one remark on the declaration of lines
in line 2: In my previous post on custom shells, I used only the “old” function call syntax, always adding an ampersand at the calling site. This time, I'm using the “new-style” function declaration and calling syntax called Prototypes. Using this syntax, it is possible to enable some compile-time argument checking, and it is possible to refine the calling semantics as will be shown with the sortfiles
function in the following.
Combining Iterators
The core of the merging algorithm is implemented using a sortfiles
function and a loop. The mytime
function will be explained later, for now it is sufficient to know that its result is an integer number which is strictly monotonic increasing with advancing time-stamps.
# generate line iterators for all files
my @files = map { lines($_, $_) } @ARGV;
# poll file iterators until all are exhausted
while (@files) {
sortfiles @files;
printf "%s | %s", $files[0]->('name'), $files[0]->();
# close file when exhausted
$files[0]->('next') or shift @files;
}
# sort file iterators in ascending order of the next timestamp
sub sortfiles(\@) {
@{$_[0]} = sort { mytime($a->()) <=> mytime($b->()) } @{$_[0]};
}
First, a list of iterators is produced from the list of arguments provided by the user. The map
function comes in handy for this purpose: it sets the variable $_
to each element of @ARGV
in turn, evaluating the block given as its first argument, and concatenating the results into a new list; Perl's map
is known as flatMap
in most functional programming languages.
While the list of iterators is not empty, sort the iterators according to the time-stamp of their next line, print the line from the earliest iterator, advance that iterator and discard it if it is exhausted. Since the sort
function is stable, i.e. elements which compare equal keep their order, log lines with the same time-stamp are printed grouped by log-file.
The sortfiles
function does the real work, but as often Perl takes away the tedium: the sort
function takes a code block as first parameter, which must evaluate to an integer number for any pair ($a, $b)
passed in. Negative values mean $a<$b
, positive values $b>$a
and zero means $a==$b
, according to the ordering implemented by the code block. Perl provides the nice <=>
operator for exactly this purpose, as it correctly calculates the values for the normal ordering of numbers. Its associate cmp
does the same for lexically ordering strings. The values to be compared in my case are the numerical representations of the time-stamps contained in the next lines available from the iterators $a
and $b
, so I call the iterators without argument, pass the result to the time-stamp conversion function and then utilize Perl's <=>
operator.
Following up on my earlier remark on function declaration: Notice how the sortfiles
function is called with a list argument, but the list is passed “by-reference” due to the declaration with prototype \@
.
The only thing missing now is a function which extracts the time-stamp from a log line:
# mapping of month name abbreviations (Jan=>0, Feb=>1, ...)
my %months = do {
my $count = 0;
map { ($_ => $count++) } qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
};
# determine timestamp of log line, assumes format 'Jan 1 00:00:00'
sub mytime($) {
my ($m, $d, $H, $M, $S) = $_[0] =~ /(\w+)\s+(\d+)\s+(\d+):(\d+):(\d+)/;
$m = $months{$m};
timelocal($S, $M, $H, $d, $m, 2000);
}
In the case under consideration, the month is given as an abbreviated string, so the second line constructs a hash table for mapping the month name to a number (0–11 as per Time::Local). Again, I'm using the map
function, this time demonstrating that the code block given as first argument is actually a closure, meaning that it has full access to the $count
variable in this invocation. qw(...)
is a handy construct which constructs a list from a string literal—enclosed here in parentheses, but any other character would do—by splitting on white-space; in this context let me warmly recommend reading man perlop. Placing the code inside a do {...}
block makes the $count
variable local to this block: it would be unnecessary to pollute the lexical file-scope with this temporary name.
The mytime
function uses a regular expression to find a suitable time-stamp anywhere on the input line, selecting the month, day, hour, minute and second fields. The month is converted to its numeric representation using the hash table constructed beforehand. The resulting tuple is finally passed to the timelocal
function from Time::Local. As our exemplary log-files do not contain the year, I use something which is guaranteed to be stable—meaning independent of the time of execution of the script—according to the Time::Local Year Value Interpretation. I do realize that if the log-files span multiple years, things will go south, but this cannot reliably be fixed since the information is simply not contained within the input files.
Making the Output Nicer
The above code snippets would already get the job done, but depending on the input file-names the output looks quite ragged. The first fix for this is to suppress common path prefixes:
# test whether all elements of a list satisfy a given predicate
sub forall(&@) {
my ($cond, @list) = @_;
@list == grep { $cond->() } @list;
}
# find longest common path prefix
my ($prefix) = $ARGV[0] =~ m,^(.*/),;
while ($prefix ne '' && $prefix ne '/') {
last if forall { /^\Q$prefix/ } @ARGV;
$prefix =~ s,/[^/]*/$,/,;
}
my $prefixlen = length $prefix;
my %shortnames = do {
if ($prefixlen > 1) {
map { ($_ => '.../'.substr($_, $prefixlen)) } @ARGV;
} else {
map { ($_ => $_) } @ARGV;
}
};
# generate line iterators for all files
my @files = map { lines($_, $shortnames{$_}) } @ARGV;
The forall
function demonstrates another neat use of Perl's function prototypes: when the first parameter is declared as &
, then an anonymous function can be supplied without needing the sub
keyword in front. This enables the construction of functions which work exactly like map
and grep
. forall
is a convenience function known from functional programming languages, which checks whether all elements of a list satisfy a given condition.
Finding the longest common path prefix is implemented by taking the part of the first file-name leading up to the last slash and checking whether all other file-names also start with this prefix, stripping the last directory component and repeating the test until the prefix is empty or a common prefix is found. One useful feature when working with file paths is that the regular expression operators are not limited to the slash as pattern delimiter: any character will do, and as a special case the four types of braces need the corresponding opening and closing characters and require proper nesting. I usually use a comma when working with file-names, as that rarely occurs within those on my systems. The %shortnames
hash contains the rewriting rules from file-name to short-name, which are either the identity if no long-enough prefix was found or the prefix is replaced by .../
. Consequently, the %shortnames
hash needs to be involved when creating the file iterators in line 23.
Another nice feature would be to remove the directory components from all file-names which are unique:
my %unique;
map {
m,([^/]+)$,;
if (exists $unique{$1}) {
$unique{$1}++;
} else {
$unique{$1} = 1;
}
} @ARGV;
%unique = map {
m,([^/]+)$,;
if ($unique{$1} == 1) {
($_ => $1)
}
} @ARGV;
map { $shortnames{$_} = $unique{$_} } keys %unique;
This code—when inserted before the generation of the iterators—does the job. First, the %unique
hash table is used to count the frequency of occurrence for each basename (as found by the regular expression on line 3). Then the hash table is used to filter the @ARGV
array (line 12) and construct from it a new hash, containing the rewriting rules from the full file-name to the basename. This is then applied to the %shortnames
hash in the last step.
The last missing piece is the proper formatting of the output lines to make them look more regular. All that is needed is to pad all (modified) file-names to the same length:
use List::Util qw(max reduce);
# determine maximum shortname length
my $maxlen = max map { length $_->('name') } @files;
# poll file iterators until all are exhausted
while (@files) {
sortfiles @files;
printf "%-${maxlen}s | %s", $files[0]->('name'), $files[0]->();
# close file when exhausted
$files[0]->('next') or shift @files;
}
Line 4 uses the function List::Util::max
to obtain the needed padding length, and line 9 uses the $maxlen
variable in the format string to widen the file-name field; the -
qualifier makes the output left-aligned within the field.
Summary
This tool could be enhanced in numerous ways, e.g. providing the parsing facilities for several time-stamp formats in mytime
, or by adding command line options to control the treatment of file-names printed in the output.
As usual, and again without any warranty whatsoever, a compilation of the code snippets into an executable Perl script can be found here.
Anecdote
While cleaning up this script for publication, I found a bug in Debian Lenny's Perl v5.10.0. My original plan was to use a formline
with tailored PICTURE argument, but it seems that only a string constant can be supplied. Every other method I tried (construction using string concatenation in place, passing the constructed string from a variable, putting the string in a HERE document) led to an error message from glibc complaining about an invalid realloc
call.