Merging Logfiles

Quicksearch

Login

Wednesday, March 24. 2010

Inspired by Higher Order Perl, I once tackled the problem of merging several log-files into one common time-line. While the solution itself is useful, this article also discusses the iterator pattern in Perl.

Starting Point

Suppose you have some log-files, e.g. from different servers but covering the same incident, and you want to merge them in correct time order. All of them have time-stamps at the beginning of each line, looking like this:

Mar 22 12:00:00 host task ...
Mar 22 12:02:00 host task ...

The files are named file1, file2 and file3, already sorted according to time-stamp, and you want the output to look like this:

file1 | Mar 22 12:00:00 host1 taska ...
file2 | Mar 22 12:00:43 host2 taskb ...
file2 | Mar 22 12:00:58 host2 taskc ...
file1 | Mar 22 12:02:00 host1 taska ...
file3 | Mar 22 12:02:03 host3 taskb ...

The program logic is quite simple: open all three files and at each turn look at their next lines, selecting the one with the earliest time-stamp. As soon as one file is exhausted, remove it from the set and repeat until all files are processed.

Implementation

Let me present the implementation in three steps:

making iterators
using iterators
adding convenience

Iterating over the Lines of a File

# generate iterator over lines of a file

```
sub lines($$) {
```
```
   my ($name, $shortname) = @_;
```

   open my $file, "<$name" or die "cannot open $name: $!\n";

```
   my $line = <$file>;
```
```
   sub {
```
```
      return $line unless @_;
```

      return $shortname if $_[0] eq 'name';

      return undef unless $_[0] eq 'next';

```
      $line = <$file>;
```
```
      $line =~ s/;.*//s if $line;
```
```
      $line;
```
```
   }
```
```
}
```

An iterator is an object carrying the state of an iteration over some collection. Thus, it is first initialized, then repeatedly queried for its next element, until it is finally exhausted (infinite iterators are useful in other scenarios but not for the lines of a log-file and are therefore not discussed here). While Perl offers some object-oriented features, all the power needed to implement an iterator is at our disposal through the use of closures and anonymous functions. The function declared in line 2 above takes the path-name and short-name of the file to be iterated over, opens the file, reads the first line and returns a new function. This function captures all variables which were in scope at the moment of its creation, including their current value and the ability to modify them. Care must be taken to avoid obscure side-effects when accessing variables which are still visible in other parts of the code, but our anonymous function is well-behaved as it accesses only the $line, $shortname and $file variables from the immediately enclosing function body: the captured instances are references to the variables of the single function call to lines which created the anonymous function, therefore they cannot be accessed by anything else after the lines function has returned. The next invocation of lines will create completely fresh and decoupled variables of the same local names.

Now, what does the returned function actually do? The most primitive iterator would have only one function: return the next element or signal exhaustion. This iterator is a bit more complex, as it can return the most recently read line over and over again (line 7), or it can return the short-name of the file (line 8), or perform the aforementioned iteration (lines 10–12). The function is selected by passing in one optional parameter; line 9 guards against invalid requests.

Let me add one remark on the declaration of lines in line 2: In my previous post on custom shells, I used only the “old” function call syntax, always adding an ampersand at the calling site. This time, I'm using the “new-style” function declaration and calling syntax called Prototypes. Using this syntax, it is possible to enable some compile-time argument checking, and it is possible to refine the calling semantics as will be shown with the sortfiles function in the following.

Combining Iterators

The core of the merging algorithm is implemented using a sortfiles function and a loop. The mytime function will be explained later, for now it is sufficient to know that its result is an integer number which is strictly monotonic increasing with advancing time-stamps.

```
# generate line iterators for all files
```

my @files = map { lines($_, $_) } @ARGV;

```
 
```

# poll file iterators until all are exhausted

```
while (@files) {
```
```
   sortfiles @files;
```

   printf "%s | %s", $files[0]->('name'), $files[0]->();

```
   # close file when exhausted
```
```
   $files[0]->('next') or shift @files;
```
```
}
```
```
 
```

# sort file iterators in ascending order of the next timestamp

```
sub sortfiles(\@) {
```

   @{$_[0]} = sort { mytime($a->()) <=> mytime($b->()) } @{$_[0]};

```
}
```

First, a list of iterators is produced from the list of arguments provided by the user. The map function comes in handy for this purpose: it sets the variable $_ to each element of @ARGV in turn, evaluating the block given as its first argument, and concatenating the results into a new list; Perl's map is known as flatMap in most functional programming languages.

While the list of iterators is not empty, sort the iterators according to the time-stamp of their next line, print the line from the earliest iterator, advance that iterator and discard it if it is exhausted. Since the sort function is stable, i.e. elements which compare equal keep their order, log lines with the same time-stamp are printed grouped by log-file.

The sortfiles function does the real work, but as often Perl takes away the tedium: the sort function takes a code block as first parameter, which must evaluate to an integer number for any pair ($a, $b) passed in. Negative values mean $a<$b, positive values $b>$a and zero means $a==$b, according to the ordering implemented by the code block. Perl provides the nice <=> operator for exactly this purpose, as it correctly calculates the values for the normal ordering of numbers. Its associate cmp does the same for lexically ordering strings. The values to be compared in my case are the numerical representations of the time-stamps contained in the next lines available from the iterators $a and $b, so I call the iterators without argument, pass the result to the time-stamp conversion function and then utilize Perl's <=> operator.

Following up on my earlier remark on function declaration: Notice how the sortfiles function is called with a list argument, but the list is passed “by-reference” due to the declaration with prototype \@.

The only thing missing now is a function which extracts the time-stamp from a log line:

# mapping of month name abbreviations (Jan=>0, Feb=>1, ...)

```
my %months = do {
```
```
   my $count = 0;
```

   map { ($_ => $count++) } qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);

```
};
```
```
 
```

# determine timestamp of log line, assumes format 'Jan 1 00:00:00'

```
sub mytime($) {
```

   my ($m, $d, $H, $M, $S) = $_[0] =~ /(\w+)\s+(\d+)\s+(\d+):(\d+):(\d+)/;

```
   $m = $months{$m};
```
```
   timelocal($S, $M, $H, $d, $m, 2000);
```
```
}
```

In the case under consideration, the month is given as an abbreviated string, so the second line constructs a hash table for mapping the month name to a number (0–11 as per Time::Local). Again, I'm using the map function, this time demonstrating that the code block given as first argument is actually a closure, meaning that it has full access to the $count variable in this invocation. qw(...) is a handy construct which constructs a list from a string literal—enclosed here in parentheses, but any other character would do—by splitting on white-space; in this context let me warmly recommend reading man perlop. Placing the code inside a do {...} block makes the $count variable local to this block: it would be unnecessary to pollute the lexical file-scope with this temporary name.

The mytime function uses a regular expression to find a suitable time-stamp anywhere on the input line, selecting the month, day, hour, minute and second fields. The month is converted to its numeric representation using the hash table constructed beforehand. The resulting tuple is finally passed to the timelocal function from Time::Local. As our exemplary log-files do not contain the year, I use something which is guaranteed to be stable—meaning independent of the time of execution of the script—according to the Time::Local Year Value Interpretation. I do realize that if the log-files span multiple years, things will go south, but this cannot reliably be fixed since the information is simply not contained within the input files.

Making the Output Nicer

The above code snippets would already get the job done, but depending on the input file-names the output looks quite ragged. The first fix for this is to suppress common path prefixes:

# test whether all elements of a list satisfy a given predicate

```
sub forall(&@) {
```
```
   my ($cond, @list) = @_;
```
```
   @list == grep { $cond->() } @list;
```
```
}
```
```
 
```
```
# find longest common path prefix
```
```
my ($prefix) = $ARGV[0] =~ m,^(.*/),;
```

while ($prefix ne '' && $prefix ne '/') {

   last if forall { /^\Q$prefix/ } @ARGV;

```
   $prefix =~ s,/[^/]*/$,/,;
```
```
}
```
```
my $prefixlen = length $prefix;
```
```
my %shortnames = do {
```
```
   if ($prefixlen > 1) {
```

      map { ($_ => '.../'.substr($_, $prefixlen)) } @ARGV;

```
   } else {
```
```
      map { ($_ => $_) } @ARGV;
```
```
   }
```
```
};
```
```
 
```
```
# generate line iterators for all files
```

my @files = map { lines($_, $shortnames{$_}) } @ARGV;

The forall function demonstrates another neat use of Perl's function prototypes: when the first parameter is declared as &, then an anonymous function can be supplied without needing the sub keyword in front. This enables the construction of functions which work exactly like map and grep. forall is a convenience function known from functional programming languages, which checks whether all elements of a list satisfy a given condition.

Finding the longest common path prefix is implemented by taking the part of the first file-name leading up to the last slash and checking whether all other file-names also start with this prefix, stripping the last directory component and repeating the test until the prefix is empty or a common prefix is found. One useful feature when working with file paths is that the regular expression operators are not limited to the slash as pattern delimiter: any character will do, and as a special case the four types of braces need the corresponding opening and closing characters and require proper nesting. I usually use a comma when working with file-names, as that rarely occurs within those on my systems. The %shortnames hash contains the rewriting rules from file-name to short-name, which are either the identity if no long-enough prefix was found or the prefix is replaced by .../. Consequently, the %shortnames hash needs to be involved when creating the file iterators in line 23.

Another nice feature would be to remove the directory components from all file-names which are unique:

```
my %unique;
```
```
map {
```
```
   m,([^/]+)$,;
```
```
   if (exists $unique{$1}) {
```
```
      $unique{$1}++;
```
```
   } else {
```
```
      $unique{$1} = 1;
```
```
   }
```
```
} @ARGV;
```
```
%unique = map {
```
```
   m,([^/]+)$,;
```
```
   if ($unique{$1} == 1) {
```
```
      ($_ => $1)
```
```
   }
```
```
} @ARGV;
```

map { $shortnames{$_} = $unique{$_} } keys %unique;

This code—when inserted before the generation of the iterators—does the job. First, the %unique hash table is used to count the frequency of occurrence for each basename (as found by the regular expression on line 3). Then the hash table is used to filter the @ARGV array (line 12) and construct from it a new hash, containing the rewriting rules from the full file-name to the basename. This is then applied to the %shortnames hash in the last step.

The last missing piece is the proper formatting of the output lines to make them look more regular. All that is needed is to pad all (modified) file-names to the same length:

```
use List::Util qw(max reduce);
```
```
 
```
```
# determine maximum shortname length
```

my $maxlen = max map { length $_->('name') } @files;

```
 
```

# poll file iterators until all are exhausted

```
while (@files) {
```
```
   sortfiles @files;
```

   printf "%-${maxlen}s | %s", $files[0]->('name'), $files[0]->();

```
   # close file when exhausted
```
```
   $files[0]->('next') or shift @files;
```
```
}
```

Line 4 uses the function List::Util::max to obtain the needed padding length, and line 9 uses the $maxlen variable in the format string to widen the file-name field; the - qualifier makes the output left-aligned within the field.

Summary

This tool could be enhanced in numerous ways, e.g. providing the parsing facilities for several time-stamp formats in mytime, or by adding command line options to control the treatment of file-names printed in the output.

As usual, and again without any warranty whatsoever, a compilation of the code snippets into an executable Perl script can be found here.

Anecdote

While cleaning up this script for publication, I found a bug in Debian Lenny's Perl v5.10.0. My original plan was to use a formline with tailored PICTURE argument, but it seems that only a string constant can be supplied. Every other method I tried (construction using string concatenation in place, passing the constructed string from a variable, putting the string in a HERE document) led to an error message from glibc complaining about an invalid realloc call.

Posted by Dr. Roland Kuhn in Perl at 16:14 | Comments (0) | Trackbacks (0)

Trackbacks

Trackback specific URI for this entry

No Trackbacks

Comments

Display comments as (Linear | Threaded)

No comments

Add Comment

Name
Email
Homepage
In reply to
Comment	You can use [geshi lang=lang_name [,ln={y\|n}]][/geshi] tags to embed source code snippets. Markdown format allowed Standard emoticons like :-) and ;-) are converted to images. E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above:
	Remember Information? Subscribe to this entry

Competition entry by David Cummins powered by Serendipity v1.0

Calendar

Syndicate This Blog

Powered by

Merging Logfiles

heisenbug

Categories

Quicksearch

Login

Merging Logfiles

Wednesday, March 24. 2010

Starting Point

Implementation

Iterating over the Lines of a File

Combining Iterators

Making the Output Nicer

Summary

Anecdote

Calendar

Archives

Syndicate This Blog

Powered by