.
Last update: 1997-05-20
9945-2-42 _____________________________________________________________________________ Topic: Regular Expression Issues Relevant Sections: 2.8.3 Defect Report: ----------------------- (from Andrew Hume Doug McIlroy) RE Issues Issue C The scheme used for defining REs and their semantics is based on the historical notion that primitive REs match a single character. This was broken with the introduction of CEs, as they may be multiple characters in length. These were added through bracket expressions so as to cause as little impact to the rest of the specification. This had two side-effects. The first is interactions with the rest of the semantics. The second is that the his- torical and careful difference in semantics between BREs and EREs, which generally allowed BREs to be implemented with polynomial-time algorithms, was erased. Both BREs and EREs now in principle require algorithms with runtimes that are exponential in the size of the pattern. For example, on line 2860 (2.8.3.2), a bracket expression is defined to match a character or CE. Unfortunately, if a duplicated bracket expression contains multibyte CEs, then (exponential runtime) backtracking could be necessary to find the longest match. The main problem with 2.8.3.1 is that it is unclear how to interpret matching a character or CE against a string. How is the string interpreted? As a sequence of characters, a sequence of CEs, or as a mixture? ________________________________________ [9] On line 2860 (2.8.3.2), a bracket expression is defined to match a single CE. Yet, line 2890 allows matching any ``character or CE''. Proposed Solution: Change the phrase ``character or CE'' to ``CE''. Rationale: The wording suggests a nonexistent distinction. ________________________________________ [10] On line 2860 (2.8.3.2), a bracket expression is defined to match a single CE. Yet, on lines 2949-2952 and lines 2957-2969, frequent reference is made to ``characters''. Proposed Solution: Change lines 2949-2952 and lines 2957-2969 to refer only to CEs, and not to characters. Rationale: Same as previous rationale. ________________________________________ [11] Within 2.8.3.2, the term ``expression'' is used often without a qualifier, and in these cases its meaning is unclear. Examples include lines 2886 and 2891. Proposed Solution: The apparent intended meaning, namely CEs, should be used. WG15 response for 9945-2:1993 ----------------------------------- Part 9 Since the standard is unclear as to how strings are broken up into a series of collating elements (see interpretation #40) it is also unclear how to interpret matching a character or collating element against a string, and as such no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor. Part10 The standard is unclear on this issue, and no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor. Concerns about the wording of this part of the standard have been forwarded to the sponsor. Part 11 The reference to "expressions" on page 80, line 2886 refers to the definition on page 79, lines 2864-2866, and conforming implementations must conform to this. Rationale for Interpretation: ----------------------------- None. Proposed resolution forwarded: Aug 11 1995 Finalized: Sept 12 1995