.
Last update: 1997-05-20
9945-2-60 _____________________________________________________________________________ Topic: regular expressions, 9945-2-92/INT #2 Relevant Sections: B.5.2 Classification: defect Defect Report: ----------------------- A recent interpretation 9945-2-92/INT #2 appears to be an incorrect change in terms of the meaning of the words in POSIX.2, could ISO/IEC clarify the situation. >WG15 response for 9945-2:1993 >----------------------------------- >The subexpression representing the entire RE is to be included in the >count represented in the re_nsub member. No change in wording is >necessary. POSIX.2 is clear on line 285 on Page 727 that re_nsub contains the number of PARENTHESIZED subexpressions, which is different from the total number of subexpressions because pattern itself counts as a subexpression (see line 337 on Page 728). The interpretation given adds one to the value stored in re_nsub to cover the subexpression which encompasses the whole expression but which is not parenthesized. We do not believe that this is correct. The original interpretation request was as follows: > Topic: Regular expressions > Relevant Sections: B.5.2 > >Interpretation Request: >----------------------- > In Section B.5.2 - Description {of C Binding for Regular > Expression Matching}, the standard states that the re_nsub > member of the regex_t structure represents the number of > parenthesized subexpressions found in pattern. [Draft 12 of > ISO/IEC 9945-2:1993 (July 1992), p. 766, lines 329-331] > > The standard then states that the pmatch argument > > shall point to an array with at least nmatch > elements, and regexec() shall fill in the elements > of that array with offsets of the substrings of > string that correspond to the parenthesized > subexpressions of pattern: pmatch[i].rm_so shall > be the byte offset of the beginning and > pmatch[i].rm_eo shall be one greater than the byte > offset of the end of substring i. (Subexpression > i begins at the ith matched open parenthesis, > counting from 1.) Offsets in pmatch[0] shall > identify the substring that corresponds to the > entire regular expression. > > [Ibid., p. 766-767, lines 339-346] > > Thus, if pmatch[] contains nmatch elements, it can only hold > nmatch-1 parenthesized subexpressions of string, since > pmatch[0] represents the entire regular expression. > > The standard also states that ``if there are more than > nmatch subexpressions in pattern (pattern itself counts as a > subexpression), then regexec() [...] shall record only the > first nmatch substrings.'' [Ibid., p. 767, lines 347-350] > > Lines 347-350 appear to contradict lines 339-346; the latter > talks about parenthesized subexpressions, while the former > mentions plain subexpressions. Is the intent of the > standard to allow the re_nsub member to include the > subexpression representing the entire regular expression in > the count (since it is considered a subexpression on page > 767, lines 347-350), or does it only count explicitly > parenthesized subexpressions? We believe this is the > easiest way to rectify the ambiguity. There is no contradiction. The two paragraphs are discussing two different functions--regcomp and regexec. It is VERY clear that the value for re_nsub as set by regcomp is the number of actual groupings present in the RE. In the second paragraph (discussing regexec), it is merely making it clear that pmatch[0] describes the entire RE matched, and that nmatch must take into account that fact. For example, if there are two parenthesized REs, then one needs to have at least three regmatch_t's to have all the sub matches recorded. Let's assume for the moment that the entire RE counted as a parenthesized RE, then re_nsub would be one higher than today. It is still the case that the entire RE's match is recorded in pmatch[0], but it would ALSO have to be recorded in pmatch[1]! This is because the first subexpression has number 1, and it must be placed in pmatch[1]. A new problem has also been noticed, on looking at the POSIX.2 rationale, I note that Page 1040 lines 11926 and 11927 suggest that nmatch should not be larger that re_nsub. This statement seems to be inaccurate since nmatch should equal re_nsub+1 if all subexpression data is to be captured. It may, however, have influenced the interpretation. WG15 response for 9945-2:1993 ----------------------------------- The interpretation 9945-2-92/INT #2 is incorrect as noted above, and has been withdrawn. There is an error in the rationale (page 1040 lines 11926-11927), a future revision should change "re_nsub" to "re_nsub+1". Rationale for Interpretation: ----------------------------- This is a "defect" situation and the previous interpretation has been withdrawn. It is expected that a future revison of the standard will address the problem in the rationale. _____________________________________________________________________________