Creating precise text filters for dates and numbers

KarstenHeinrich · February 2, 2017, 4:47pm

i-net PDFC is very powerful and precise when comparing two versions of the same document. But especially this use case often shows unnecessary differences like another revision number or print date. Although those are obviously false positives they cannot be automatically ignored by i-net PDFC as their location and structure is defined by the author of the documents.

A common unnecessary difference for a yearly report would be for instance

The text ‘2016’ was replaced by ‘2017’

Creating a context based pattern

i-net PDFC has a built in filter that can exclude text from the comparison either by an exact match or by pattern.
Please read the help page for the regular repression filter first! It will give you an introduction on how to develop a filter pattern. This FAQ entry will guide you on how to further improve a pattern.

A simple pattern to ignore the years would be for instance:

20[12]\d

Unfortunately this filter will match any number that at least contains the pattern, like 20202020 for instance. So a first improvement is to define the bounds of the element as well. We expect them to be a ‘not a number’ so we can use \D to define the bounds:

\D20[12]\d\D

This filter will only match numbers from 2010 to 2029. But still the filter will match any such number anywhere in the document. A monetary value of $2025 would be matched and excluded by this filter as well. So we have to be more strict on the surroundings of the year number.

If there is for instance a footer line with like ‘business year 2017’ the pattern should take this information into account:

business\syear\s20[12]\d\D

Note: \s is used to match all types of spaces

Checking context without matching

Now we assume that the ‘business year’ was changed to ‘financial year’ in some year. The pattern will no longer match as we’ve used ‘business’ as a keyword. A simple ‘match all’ operator (which is .*?) would solve the issue. But it would hide the change from ‘business’ to ‘financial’ year as well.

The solution is to use ‘look ahead’ or ‘look behind’. In our example we could use:

**(?<=**business|financial\syear)\s20[12]\d\D

The group (?<= … ) is equivalent to ‘check that it’s in front of, but don’t match’. As a result the filter will only match the year if its after business or financial year. But only the number will be excluded, the text will still be compared. To check for a certain context after the actual match, use the group (?= … ).