Syntax of the builtin regular expression library
Escapes (AREs only),
which begin with a followed by an alphanumeric character, come in several
varieties: character entry, class shorthands, constraint escapes, and back
references. A
followed by an alphanumeric character but not constituting
a valid escape is illegal in AREs. In EREs, there are no escapes: outside
a bracket expression, a
followed by an alphanumeric character merely stands
for that character as an ordinary character, and inside a bracket expression,
is an ordinary character. (The latter is the one actual incompatibility
between EREs and AREs.)
Character-entry escapes (AREs only) exist to make it easier to specify non-printing and otherwise inconvenient characters in REs:
![]() |
alert (bell) character, as in C |
![]() |
backspace, as in C |
![]() |
synonym
for ![]() |
![]() |
(where X is any character) the character whose low-order 5 bits are the same as those of X, and whose other bits are all zero |
![]() |
the character whose collating-sequence name is `ESC', or failing that, the character with octal value 033 |
![]() |
formfeed, as in C |
![]() |
newline, as in C |
![]() |
carriage return, as in C |
![]() |
horizontal tab, as in C |
![]() |
(where wxyz is exactly four hexadecimal digits) the Unicode character U+wxyz in the local byte ordering |
![]() |
(where stuvwxyz is exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode extension to 32 bits |
![]() |
vertical tab, as in C are all available. |
![]() |
(where hhh is any sequence of hexadecimal digits) the character whose hexadecimal value is 0xhhh (a single character no matter how many hexadecimal digits are used). |
![]() |
the character whose value is 0 |
![]() |
(where xy is exactly two octal digits, and is not a back reference (see below)) the character whose octal value is 0xy |
![]() |
(where xyz is exactly three octal digits, and is not a back reference (see below)) the character whose octal value is 0xyz |
Hexadecimal digits are `0'-`9', `a'-`f', and `A'-`F'. Octal digits are `0'-`7'.
The character-entry
escapes are always taken as ordinary characters. For example, 135 is ] in
ASCII, but
135 does not terminate a bracket expression. Beware, however,
that some applications (e.g., C compilers) interpret such sequences themselves
before the regular-expression package gets to see them, which may require
doubling (quadrupling, etc.) the `
'.
Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used character classes:
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
Within bracket expressions, `d', `
s', and
`
w' lose their outer brackets, and `
D',
`
S', and `
W' are illegal. (So, for example,
a-c
d
is equivalent to
.
Also,
a-c
D
, which is equivalent to
, is illegal.)
A constraint escape (AREs only) is a constraint, matching the empty string if specific conditions are met, written as an escape:
![]() |
matches only at the beginning of the string (see Matching, below, for how this differs from `^') |
![]() |
matches only at the beginning of a word |
![]() |
matches only at the end of a word |
![]() |
matches only at the beginning or end of a word |
![]() |
matches only at a point that is not the beginning or end of a word |
![]() |
matches only at the end of the string (see Matching, below, for how this differs from `$') |
![]() |
(where m is a nonzero digit) a back reference, see below |
![]() |
(where m is a nonzero digit, and nn is some more digits, and the decimal value mnn is not greater than the number of closing capturing parentheses seen so far) a back reference, see below |
A word is defined as in the specification of < and > above. Constraint escapes are illegal within bracket expressions.
A back reference (AREs only) matches
the same string matched by the parenthesized subexpression specified by
the number, so that (e.g.) ()
1 matches bb or cc but not `bc'.
The subexpression
must entirely precede the back reference in the RE. Subexpressions are numbered
in the order of their leading parentheses. Non-capturing parentheses do not
define subexpressions.
There is an inherent historical ambiguity between octal character-entry escapes and back references, which is resolved by heuristics, as hinted at above. A leading zero always indicates an octal escape. A single non-zero digit, not followed by another digit, is always taken as a back reference. A multi-digit sequence not starting with a zero is taken as a back reference if it comes after a suitable subexpression (i.e. the number is in the legal range for a back reference), and otherwise is taken as octal.
ymasuda 平成17年11月19日