.
Last update: 1997-05-20
9945-2-43.1 _____________________________________________________________________________ Topic: Regular Expressions - imprecise specification Relevant Sections: 2.8 Classification: Q12: Ambiguous. Q13: Defect. Defect Report: ----------------------- (from Andrew Hume Doug McIlroy) Issue D These questions address areas of imprecision in the specifications of REs. With regard to question [13], 9945-2 uses various terms roughly synonymously (leftmost, first, earliest) (subpattern, subexpression) for describing a left-to-right order across a line of text. It would be well if the text (and rationale) used one term consistently. ________________________________________ [12] Can a duplicated subexpression match the null string? If so, will the duplication be repeated until the expression does match the null string? Proposed Solution: A subexpression that can match a null string shall not be duplicated. Rationale: Although adjacent duplication symbols are illegal (for no apparent reason, particularly for EREs), \(x*\)* is a legal expression, in which the *s are not adjacent, that raises the question: how many times is the null string matched? Suppose that \(x*\)* were allowed. Does matching it to xxx yield \1 = xxx or \1 = null? The latter alterna- tive is consistent with the rule that a null string is longer than no match. By extending xxx with a null string instead of nothing, one gets a longer match. More null strings make even longer matches. To avert metaphysical questions about the last element of an infinite sequence, or an element following an infinite sequence, one could forbid null iterations. But this has an unsatisfying corollary that \(x*\)* wouldn't match the null string. Compromise positions might forbid null iterations after the first iteration or after the first null iteration. Such special pleading and the resultant implementation com- plexity is not worth the return. Lord and McIlroy have implemented a no-null-unless- first policy; Spencer has implemented repeat-until-null. Spencer's interpretation seems perhaps less strained, until you realize that it makes references ( \1, \2, etc) to con- tents of duplications useless because they will always be null or undefined. The proposed solution also outlaws EREs like (a?|b)*, (a?)\{2,2\}, and . ([a-z]*(|^))* The set of regular lan- guages is unchanged, however. There would still be legal equivalents to these and all other outlawed expressions. An alternative, to leave the meaning undefined, is unacceptable. Users will be unaware of the exact bounds of portability, and their biggest intellectual investments - the hardest expressions and sed scripts - are liable to be unportable. ________________________________________ [13] What is the meaning of the BRE \(\)? [2.8.5.2] Proposed Solution: Forbid this expression. It would be also acceptable, but rather less so, to define that the pattern \(\) matches the null string. Rationale: The pattern has no analog in EREs, no utility, and to our knowledge no basis in history. It should thus be banned. Otherwise, at least define what it means. WG15 response for 9945-2:1993 ----------------------------------- Q12 The standard does not require the successful matching of a null string after a successful match on a non null string for duplication symbols. However, the standard does not clearly say that you are not allowed to match null strings after a successful match. Therefore the standard in ambiguous. The definition of a back reference expression in section 2.8.3.3 does not specify which match of a subexpression followed by a duplication symbol is to be returned. We note however that the definition regcomp() defined in section B5.1 pg 728 344-346 indicate that the back reference expression will match the last string matched by the sub expression. The standard is unclear on this issue, and no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor. Q13 Page 89 line 3263 requires this form to be accepted. But, the text 2.8.3 does not describe its meaning. Concerns about the wording of this part of the standard have been forwarded to the sponsor. Rationale for Interpretation: ----------------------------- None _____________________________________________________________________________