You can combine regular expressions with the following characters, called regular expression operators, or metacharacters, to increase the power and versatility of regular expressions.
The escape sequences described above in section Escape Sequences, are valid inside a regexp. They are introduced by a `\'. They are recognized and converted into the corresponding real characters as the very first step in processing regexps.
Here is a table of metacharacters. All characters that are not escape sequences and that are not listed in the table stand for themselves.
\
\$matches the character `$'.
^
^@chaptermatches the `@chapter' at the beginning of a string, and can be used to identify chapter beginnings in Texinfo source files. The `^' is known as an anchor, since it anchors the pattern to matching only at the beginning of the string. It is important to realize that `^' does not match the beginning of a line embedded in a string. In this example the condition is not true:
if ("line1\nLINE 2" ~ /^L/) ...
$
p$matches a record that ends with a `p'. The `$' is also an anchor, and also does not match the end of a line embedded in a string. In this example the condition is not true:
if ("line1\nLINE 2" ~ /1$/) ...
.
.Pmatches any single character followed by a `P' in a string. Using concatenation we can make a regular expression like `U.A', which matches any three-character sequence that begins with `U' and ends with `A'. In strict POSIX mode (see section Command Line Options), `.' does not match the NUL character, which is a character with all bits equal to zero. Otherwise, NUL is just another character. Other versions of
awk
may not be able to match the NUL character.
[...]
[MVX]matches any one of the characters `M', `V', or `X' in a string. Ranges of characters are indicated by using a hyphen between the beginning and ending characters, and enclosing the whole thing in brackets. For example:
[0-9]matches any digit. Multiple ranges are allowed. E.g., the list
[A-Za-z0-9]
is a
common way to express the idea of "all alphanumeric characters."
To include one of the characters `\', `]', `-' or `^' in a
character list, put a `\' in front of it. For example:
[d\]]matches either `d', or `]'. This treatment of `\' in character lists is compatible with other
awk
implementations, and is also mandated by POSIX.
The regular expressions in awk
are a superset
of the POSIX specification for Extended Regular Expressions (EREs).
POSIX EREs are based on the regular expressions accepted by the
traditional egrep
utility.
Character classes are a new feature introduced in the POSIX standard.
A character class is a special notation for describing
lists of characters that have a specific attribute, but where the
actual characters themselves can vary from country to country and/or
from character set to character set. For example, the notion of what
is an alphabetic character differs in the USA and in France.
A character class is only valid in a regexp inside the
brackets of a character list. Character classes consist of `[:',
a keyword denoting the class, and `:]'. Here are the character
classes defined by the POSIX standard.
[:alnum:]
[:alpha:]
[:blank:]
[:cntrl:]
[:digit:]
[:graph:]
[:lower:]
[:print:]
[:punct:]
[:space:]
[:upper:]
[:xdigit:]
/[A-Za-z0-9]/
. If your
character set had other alphabetic characters in it, this would not
match them. With the POSIX character classes, you can write
/[[:alnum:]]/
, and this will match all the alphabetic
and numeric characters in your character set.
Two additional special sequences can appear in character lists.
These apply to non-ASCII character sets, which can have single symbols
(called collating elements) that are represented with more than one
character, as well as several characters that are equivalent for
collating, or sorting, purposes. (E.g., in French, a plain "e"
and a grave-accented "`e" are equivalent.)
[[.ch.]]
is a regexp that matches this collating element, while
[ch]
is a regexp that matches either `c' or `h'.
[[=e]]
is a regexp
that matches any of `e', `'e', or ``e'.
gawk
uses for regular
expression matching currently only recognize POSIX character classes;
they do not recognize collating symbols or equivalence classes.
[^ ...]
[^0-9]matches any character that is not a digit.
|
^P|[0-9]matches any string that matches either `^P' or `[0-9]'. This means it matches any string that starts with `P' or contains a digit. The alternation applies to the largest possible regexps on either side. In other words, `|' has the lowest precedence of all the regular expression operators.
(...)
*
ph*applies the `*' symbol to the preceding `h' and looks for matches of one `p' followed by any number of `h's. This will also match just `p' if no `h's are present. The `*' repeats the smallest possible preceding expression. (Use parentheses if you wish to repeat a larger expression.) It finds as many repetitions as possible. For example:
awk '/\(c[ad][ad]*r x\)/ { print }' sampleprints every record in `sample' containing a string of the form `(car x)', `(cdr x)', `(cadr x)', and so on. Notice the escaping of the parentheses by preceding them with backslashes.
+
wh+ywould match `why' and `whhy' but not `wy', whereas `wh*y' would match all three of these strings. This is a simpler way of writing the last `*' example:
awk '/\(c[ad]+r x\)/ { print }' sample
?
fe?dwill match `fed' and `fd', but nothing else.
{n}
{n,}
{n,m}
wh{3}y
wh{3,5}y
wh{2,}y
awk
.
As part of the POSIX standard they were added, to make awk
and egrep
consistent with each other.
However, since old programs may use `{' and `}' in regexp
constants, by default gawk
does not match interval expressions
in regexps. If either `--posix' or `--re-interval' are specified
(see section Command Line Options), then interval expressions
are allowed in regexps.
In regular expressions, the `*', `+', and `?' operators, as well as the braces `{' and `}', have the highest precedence, followed by concatenation, and finally by `|'. As in arithmetic, parentheses can change how operators are grouped.
If gawk
is in compatibility mode
(see section Command Line Options),
character classes and interval expressions are not available in
regular expressions.
The next
section
discusses the GNU-specific regexp operators, and provides
more detail concerning how command line options affect the way gawk
interprets the characters in regular expressions.
Go to the first, previous, next, last section, table of contents.