Escapes

Syntax of the builtin regular expression library

Escapes (AREs only), which begin with a $\backslash$ followed by an alphanumeric character, come in several varieties: character entry, class shorthands, constraint escapes, and back references. A $\backslash$ followed by an alphanumeric character but not constituting a valid escape is illegal in AREs. In EREs, there are no escapes: outside a bracket expression, a $\backslash$ followed by an alphanumeric character merely stands for that character as an ordinary character, and inside a bracket expression, $\backslash$ is an ordinary character. (The latter is the one actual incompatibility between EREs and AREs.)

Character-entry escapes (AREs only) exist to make it easier to specify non-printing and otherwise inconvenient characters in REs:

$\backslash$a alert (bell) character, as in C
$\backslash$b backspace, as in C
$\backslash$B synonym for $\backslash$ to help reduce backslash doubling in some applications where there are multiple levels of backslash processing
$\backslash$cX (where X is any character) the character whose low-order 5 bits are the same as those of X, and whose other bits are all zero
$\backslash$e the character whose collating-sequence name is `ESC', or failing that, the character with octal value 033
$\backslash$f formfeed, as in C
$\backslash$n newline, as in C
$\backslash$r carriage return, as in C
$\backslash$t horizontal tab, as in C
$\backslash$uwxyz (where wxyz is exactly four hexadecimal digits) the Unicode character U+wxyz in the local byte ordering
$\backslash$Ustuvwxyz (where stuvwxyz is exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode extension to 32 bits
$\backslash$v vertical tab, as in C are all available.
$\backslash$xhhh (where hhh is any sequence of hexadecimal digits) the character whose hexadecimal value is 0xhhh (a single character no matter how many hexadecimal digits are used).
$\backslash$0 the character whose value is 0
$\backslash$xy (where xy is exactly two octal digits, and is not a back reference (see below)) the character whose octal value is 0xy
$\backslash$xyz (where xyz is exactly three octal digits, and is not a back reference (see below)) the character whose octal value is 0xyz

Hexadecimal digits are `0'-`9', `a'-`f', and `A'-`F'. Octal digits are `0'-`7'.

The character-entry escapes are always taken as ordinary characters. For example, $\backslash$135 is ] in ASCII, but $\backslash$135 does not terminate a bracket expression. Beware, however, that some applications (e.g., C compilers) interpret such sequences themselves before the regular-expression package gets to see them, which may require doubling (quadrupling, etc.) the `$\backslash$'.

Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used character classes:

$\backslash$d $[[:digit:]]$
$\backslash$s $[[:space:]]$
$\backslash$w $[[:alnum:]\_]$ (note underscore)
$\backslash$D
$\backslash$S $[^[:space:]]$
$\backslash$W $[^[:alnum:]\_]$ (note underscore)

Within bracket expressions, `$\backslash$d', `$\backslash$s', and `$\backslash$w' lose their outer brackets, and `$\backslash$D', `$\backslash$S', and `$\backslash$W' are illegal. (So, for example, $[$a-c$\backslash$d$]$ is equivalent to $[a-c[:digit:]]$. Also, $[$a-c$\backslash$D$]$, which is equivalent to $[a-c^[:digit:]]$, is illegal.)

A constraint escape (AREs only) is a constraint, matching the empty string if specific conditions are met, written as an escape:

$\backslash$A matches only at the beginning of the string (see Matching, below, for how this differs from `^')
$\backslash$m matches only at the beginning of a word
$\backslash$M matches only at the end of a word
$\backslash$y matches only at the beginning or end of a word
$\backslash$Y matches only at a point that is not the beginning or end of a word
$\backslash$Z matches only at the end of the string (see Matching, below, for how this differs from `$')
$\backslash$m (where m is a nonzero digit) a back reference, see below
$\backslash$mnn (where m is a nonzero digit, and nn is some more digits, and the decimal value mnn is not greater than the number of closing capturing parentheses seen so far) a back reference, see below

A word is defined as in the specification of < and > above. Constraint escapes are illegal within bracket expressions.

A back reference (AREs only) matches the same string matched by the parenthesized subexpression specified by the number, so that (e.g.) ($[bc]$)$\backslash$1 matches bb or cc but not `bc'. The subexpression must entirely precede the back reference in the RE. Subexpressions are numbered in the order of their leading parentheses. Non-capturing parentheses do not define subexpressions.

There is an inherent historical ambiguity between octal character-entry escapes and back references, which is resolved by heuristics, as hinted at above. A leading zero always indicates an octal escape. A single non-zero digit, not followed by another digit, is always taken as a back reference. A multi-digit sequence not starting with a zero is taken as a back reference if it comes after a suitable subexpression (i.e. the number is in the legal range for a back reference), and otherwise is taken as octal.

ymasuda 平成17年11月19日