Bracket Expressions

Syntax of the builtin regular expression library

A bracket expression is a list of characters enclosed in `$[]$'. It normally matches any single character from the list (but see below). If the list begins with `^', it matches any single character (but see below) not from the rest of the list.

If two characters in the list are separated by `-', this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, e.g. $[0-9]$ in ASCII matches any decimal digit. Two ranges may not share an endpoint, so e.g. a-c-e is illegal. Ranges are very collating-sequence-dependent, and portable programs should avoid relying on them.

To include a literal $]$ or - in the list, the simplest method is to enclose it in and to make it a collating element (see below). Alternatively, make it the first character (following a possible `^'), or (AREs only) precede it with `$\backslash$'. Alternatively, for `-', make it the last character, or the second endpoint of a range. To use a literal - as the first endpoint of a range, make it a collating element or (AREs only) precede it with `$\backslash$'. With the exception of these, some combinations using $[$ (see next paragraphs), and escapes, all other special characters lose their special significance within a bracket expression.

Within a bracket expression, a collating element (a character, a multi-character sequence that collates as if it were a single character, or a collating-sequence name for either) enclosed in and stands for the sequence of characters of that collating element.

wxWidgets: Currently no multi-character collating elements are defined. So in , X can either be a single character literal or the name of a character. For example, the following are both identical and and mean the same as $[0-9]$. See Character Names.

Within a bracket expression, a collating element enclosed in and is an equivalence class, standing for the sequences of characters of all collating elements equivalent to that one, including itself. An equivalence class may not be an endpoint of a range.

wxWidgets: Currently no equivalence classes are defined, so stands for just the single character X. X can either be a single character literal or the name of a character, see Character Names.

Within a bracket expression, the name of a character class enclosed in and stands for the list of all characters (not all collating elements!) belonging to that class. Standard character classes are:

alpha A letter.
upper An upper-case letter.
lower A lower-case letter.
digit A decimal digit.
xdigit A hexadecimal digit.
alnum An alphanumeric (letter or digit).
print An alphanumeric (same as alnum).
blank A space or tab character.
space A character producing white space in displayed text.
punct A punctuation character.
graph A character with a visible representation.
cntrl A control character.

A character class may not be used as an endpoint of a range.

wxWidgets: In a non-Unicode build, these character classifications depend on the current locale, and correspond to the values return by the ANSI C 'is' functions: isalpha, isupper, etc. In Unicode mode they are based on Unicode classifications, and are not affected by the current locale.

There are two special cases of bracket expressions: the bracket expressions < and > are constraints, matching empty strings at the beginning and end of a word respectively. A word is defined as a sequence of word characters that is neither preceded nor followed by word characters. A word character is an alnum character or an underscore (_). These special bracket expressions are deprecated; users of AREs should use constraint escapes instead (see Escapes below).

ymasuda 平成17年11月19日