WG15 Defect Report Ref: 9945-2-135
Topic: Regular Expressions


This is an approved interpretation of 9945-2:1993.

.

Last update: 1997-05-20


								9945-2-135

 _____________________________________________________________________________

	Topic:                  Regular Expressions
	Relevant Sections:      2.8.2

Defect Report:
-----------------------

	From: [email protected] (Andrew Josey), [email protected] (Dave Prosser)
	Date: Tue, 1 Aug 1995 14:04:30 +0100

This is a request for interpretation of ISO/IEC 9945-2:1993:
Interpetation Request:

The fourth paragraph in subclause 2.8.2 (page 77, lines 2791-2796) says:

 Consistent with the whole match being the longest of the leftmost matches,
 each subpattern, from left to right, shall match the longest possible
 string.  For this purpose, a null string shall be considered to be longer
 than no match at all.  For example, matching the BRE \(.*\).* against
 abcdef, the subexpression (\1) is abcdef, and matching the BRE \(a*\)
 against bc, the subexpression (\1) is the null string.

This description is unfortunately very dependent on the terms "subpattern"
and "subexpression" which are not clearly defined.  For BREs, the latter
term is defined as a BRE enclosed between \( and \), there is only a
strong implication that the corresponding construct for EREs is also a
"subexpression".  Of the term "subpattern" I cannot find any further use.

For example, when matching the BRE

	a.*\(.*b.*\(.*c.*\).*d.*\).*e.*\(.*f.*\).*g

against "aabbccddeeffgg" there are quite a few possible ways to match, and
thus what "\1", "\2", and "\3" are supposed to be is muddy.

Two other example BREs extend the issue to repeated subexpressions:

	\(.\(.\)*\)*
and
	\(\(..\)*\)*

Both of these will be matched against "xxxx".

Let's assume that the only significant matching differences are those
distinguishable through the <regex.h> API.  (Strictly speaking, back
references can use intermediate subexpression matches, but it is possible
to reduce each instance of a back reference as if it were the end of a
complete RE.  Therefore, we can ignore back references as being a special
case of our simplifying assumption.)  This means, for example, that it
doesn't matter how the RE ".*.*.*" is matched against "abcdef" since only
the start and end offset of the entire RE is available.

It may be worth noting here why there was no equivalent problem with
historic RE implementations.  With EREs, there was no historic API that
provided an ERE with any match information beyond the start and end of
the full RE.  Since EREs don't include back references, subexpression
grouping solely governed operator binding; there was no remnant of the
groups themselves in the automata constructed.  With BREs, grouping
solely provided "note here" marks; one could not apply repetition
operations to a subexpression.  This means that there was no ambiguity
about what \1 matched, for example, since there was always exactly one
match.  Given this more limited BRE definition, a simple polynomial
implementation could be written that had its first match guaranteed to
maximize earlier portions of the RE.  (These "portions" of the BRE were
also easy to identify since it was possible to enumerate all combinations
since the BRE operators could only be applied to one-character REs.)
It is the extensions to historic REs that were added by POSIX.2 that is
the origin of this RFI's concerns.

I'll describe two different ways to understand the quoted paragraph:
the "subexpression" and the "fencepost" models.  Both of these models
produce the same results for the two examples in the quoted text, so
I'll use my more complex BREs above for this paper.

		Subexpression Model

The "subexpression" model interprets the quoted paragraph as reapplying
the longest-match rule used for the match of the entire RE with its
implied "zeroth subexpression".  In other words, given two RE matches,
for subexpressions 1 through n (in that order), choose the match with
a longest subexpression i (if the two i-th subexpression match have the
same length, go on to subexpression i+1).  For the first example RE,
these REs, in order, are used to judge the best match:

 0.  a.*\(.*b.*\(.*c.*\).*d.*\).*e.*\(.*f.*\).*g
 1.       .*b.*\(.*c.*\).*d.*
 2.              .*c.*
 3.                                   .*f.*

which produces the following match against aabbccddeeffgg:

 a .* \( .* b .* \( .* c .* \) .* d .* \) .* e .* \( .* f .* \) .* g
 a       a  b       bc c d        d e        e       f  f g        g

The "subexpression" model has the nice property that the best match
for all groups are chosen in the same manner, including the entire RE
(once the left edge is found).  It sometimes produces behavior counter
to expectations because the initial part of the RE is given no priority.
(This is a potential compatibility issue for us.)  However, it produces
a best match choice that is easy to determine and understand.

For the other two example REs, since the last instance of a repeated
subexpression is to be reported and thus maximized, we have:

 0.  \(.\(.\)*\)*                   0.  \(\(..\)*\)*
 1.  \(.\(.\)*\)                    1.  \(\(..\)*\)
 2.     \(.\)                       2.    \(..\)

matching xxxx:

 \( . \( .     \)* \)*                \( \( ..   \)* \)*
    x   [x][x]x                            [xx]xx
 \1:[0,4]=xxxx, \2:[3,4]=x            \1:[0,4]=xxxx, \2:[2,4]=xx

in which [...] represents a previous match of the subexpression.  In
essence, the "*" on subexpression one has no effect except when the
former BRE is matched against an empty string.  The best choice is
straightforward to determine and is obvious.

		Fencepost Model

The "fencepost" model takes each begin and end grouping mark as the
"subpatterns" separators.  In this approach, the length matched by the
initial portion of the RE outside of any group is first maximized; the
length matched by the final portion is immaterial.

There are two variants of this model: the "flat fence" scheme in which
all fenceposts are at one level, and the "nested fence" scheme in which
the fenceposts within an inner subexpression are not "visible" in the
containing subexpression.  I believe (but present no proof) that these
two variants produce the same result for REs without repeated
subexpressions.  The former is easier to explain, the latter is easier
to implement with regmatch_t rm_so's and rm_eo's.

For the first example RE, these two schemes maximize the following
patterns and produce this match against the sample string.  The
parenthesized steps are those that are immaterial, but have been
included for completeness.

 flat fence
  0.  a.*\(.*b.*\(.*c.*\).*d.*\).*e.*\(.*f.*\).*g
  1.  a.*
  2.       .*b.*
  3.              .*c.*
  4.                     .*d.*
  5.                            .*e.*
  6.                                   .*f.*
 (7.)                                         .*g

 a .* \( .* b .* \( .* c .* \) .* d .* \) .* e .* \( .* f .* \) .* g
 a ab       b c        c d        d e        e f        f g        g

 nested fence
  0.  a.*\(.*b.*\(.*c.*\).*d.*\).*e.*\(.*f.*\).*g
  1.  a.*
  2.       .*b.*\(.*c.*\).*d.*
  3.                            .*e.*
  4.                                   .*f.*
 (5.)                                         .*g
  6.       .*b.*
  7.              .*c.*
 (8.)                    .*d.*

Note that order of the nested fence steps 3-5 and 6-8 are independent.
This makes it easier to believe the the two schemes produce the same
behavior.

Of the two models, the fencepost model most closely matches historic
implementations (as discussed above), and since the highest priority
is given to the initial part of the RE, it more often seems to behave
as expected when the REs are the same as those provided in historic
BRE implementations.

However, for the other two example REs, the fencepost model is muddier:
How does one maximize the length of separated portions of REs when the
very separators are potentially applied over and over?  Moreover, with
these two examples, the strict subpatterns as separated by grouping
marks ("", ".", and "..") are always a fixed length--what's to maximize
here?

The only approach that I've come up with is to imagine that the repeated
subexpression is sequentially duplicated in an abstract RE without any
change of subexpression number.  The choice now is which of the possible
abstract REs to use.  For \(.\(.\)*\)*, the abstract REs would be the
following eight choices.  (I've used digits to mark the subexpression
numbers just below the parenthesis characters and again a square-bracketed
string is an unreported previous subexpression match.)  The third line
reframes the abstract RE as it would be seen when the "fenceposts" from
the reported subexpressions only are visible.

1. \( . \(  . \) \(  . \) \( . \) \)
    1 x  2 [x] 2  2 [x] 2  2 x  2  1	\1:[0,4]=xxxx, \2:[3,4]=x
   \( .     .        .    \( . \) \)

2. \(  . \(  . \) \(  . \) \) \( . \)
    1 [x] 2 [x] 2  2 [x] 2  1  1 x  1	\1:[3,4]=x, \2:[-1,-1]
       .     .        .       \( . \)

3. \(  . \(  . \) \) \( . \( . \) \)
    1 [x] 2 [x] 2  1  1 x  2 x  2  1	\1:[2,4]=xx, \2:[3,4]=x
       .     .       \( . \( . \) \)

4. \(  . \(  . \) \) \(  . \) \( . \)
    1 [x] 2 [x] 2  1  1 [x] 1  1 x  1	\1:[3,4]=x, \2:[-1,-1]
      .     .           .    \( . \)

5. \(  . \) \( . \(  . \) \( . \) \)
    1 [x] 1  1 x  2 [x] 2  2 x  2  1	\1:[1,4]=xxx, \2:[3,4]=x
       .    \( .     .    \( . \) \)

6. \(  . \) \(  . \(  . \) \) \( . \)
    1 [x] 1  1 [x] 2 [x] 2  1  1 x  1	\1:[3,4]=x, \2:[-1,-1]
       .        .     .       \( . \)

7. \(  . \) \(  . \) \( . \( . \) \)
    1 [x] 1  1 [x] 1  1 x  2 x  2  1	\1:[2,4]=xx, \2:[3,4]=x
       .        .    \( . \( . \) \)

8. \(  . \) \(  . \) \(  . \) \( . \)
    1 [x] 1  1 [x] 1  1 [x] 1  1 x  1	\1:[3,4]=x, \2:[-1,-1]
       .        .        .    \( . \)

Of these eight, we see that there are four unique matches as far as
would be reported by the <regex.h> API:

A. \1:[0,4], \2:[3,4]	(choice 1)
B. \1:[1,4], \2:[3,4]	(choice 5)
C. \1:[2,4], \2:[3,4]	(choices 3 and 7)
D. \1:[3,4], \2:[-1,-1]	(choices 2, 4, 6, and 8)

When these four are examined, D is the one that maximizes the length of
the subpattern between the start of the RE and the first fencepost, even
though subexpression 2 fails to match.  I find this counterintuitive.
One variation possible would be to give higher precedence to the abstract
REs that include more subexpression matches (again ordered from 1 to n).
This would make C the best match, which is at least somewhat intuitive,
but justification from the POSIX.2 standard for giving precedence to
choices that include more subexpression matches is at best a stretch:
there is only a statement that a match of an empty string is preferred to
a match failure.

There are many more abstract RE choices for \(\(..\)*\)* because
subexpression 1 can match an empty string.  Although there can be an
indefinite number of empty string matches, the only place where one is
visible is as the final match of a subexpression.  So, we have the
following four abstract REs as the significant choices:

1. \( \(  .. \) \( .. \) \)
    1  2 [xx] 2  2 xx  2  1		\1:[0,4]=xxxx, \2:[2,4]=xx
   \(     ..    \( .. \) \)

2. \( \(  .. \) \(  .. \) \) \(\)
    1  2 [xx] 2  2 [xx] 2  1  1 1	\1:[4,4]=, \2:[-1,-1]
	  ..        ..       \(\)

3. \( \(  .. \) \) \( \( .. \) \)
    1  2 [xx] 2  1  1  2 xx  2  2	\1:[2,4]=xx, \2:[2,4]=xx
	  ..       \( \( .. \) \)

4. \( \(  .. \) \) \( \(  .. \) \) \(\)
    1  2 [xx] 2  1  1  2 [xx] 2  1  1 1	\1:[4,4]=, \2:[-1,-1]
	  ..              ..       \(\)

Which reduce down to three unique matches

A. \1:[0,4], \2:[2,4]
B. \1:[2,4], \2:[2,4]
C. \1:[4,4], \2:[-1,-1]

and again the choice that maximizes the initial subpattern is C, which
is counterintuitive since it fails to match subexpression 2.  If higher
precedence is given to one that includes a match for subexpression 2,
choice B wins.

In essence, for REs with repeated subexpressions, the fencepost model
tends to choose the match that minimizes the reported lengths of repeated
subexpressions.  My guess is that this result most likely does not match
the user's expectation.  This is unfortunate since it is otherwise the
choice compatible with historic RE implementations.

		Conclusion

The POSIX.2 standard must be amended to clarify the "how to choose the
best match from the set of leftmost-longest" paragraph.  The choices I
can think of are:

 1. Specify that the choice is implementation defined (or unspecified);
    a portable application cannot rely on any particular match other
    than that it will be one of the leftmost-longest.

 2. Choose one of the models I've discussed above.  Of the two, the
    subexpression model is easiest to understand and describe, but it
    easily makes a choice that differs from existing practice--often
    when the RE user was not aware that their RE could be matched in
    more than one way!

 3. Make the best match model be a two-way RE compile-time flag and
    expose it to the user via options for those utilities that employ
    user-provided REs.

 4. Choose some other model.  If this approach is taken, be sure that
    there it is completely understood.  As a minimum, something like
    what I've put together in the above is A good start.  I would also
    expect that an implementation of the model be explored, too.  I've
    implemented both of the models I described in our POSIX.2 RE library.

I can support any of the above, and am willing to help implement some
other model if the fourth choice above is to be explored.  Without some
change in this area we providers remain swinging at the end of multiple
tails--users who want things to work the way they always have, and the
test suites that are going to take a particular interpretation of the
quoted paragraph and wait for you to prove their tests to be wrong.

Dave Prosser [email protected] (201)443-6227 Rm E221, Novell, Florham Park, NJ




Interpretation response
------------------------

        Interpretations can not amend the standard.  This request
        expands on the issues raised in Interpretation #43, and in
        that interpretation it is pointed out that the standard is
        unclear on these issues, and no conformance distinction can
        be made between alternative implementations based on this
        This is being referred to the sponsor.
  

Rationale
-------------
None.

Forwarded to Interpretations group: Aug 9 1995
Recirculated for 30 day review: Oct 19 1995
Finalised: Nov 20 1995