The GNU Awk User's Guide - Regexp Operators

Go to the first, previous, next, last section, table of contents.

Regular Expression Operators

You can combine regular expressions with the following characters, called regular expression operators, or metacharacters, to increase the power and versatility of regular expressions.

The escape sequences described above in section Escape Sequences, are valid inside a regexp. They are introduced by a `\'. They are recognized and converted into the corresponding real characters as the very first step in processing regexps.

Here is a table of metacharacters. All characters that are not escape sequences and that are not listed in the table stand for themselves.

\

This is used to suppress the special meaning of a character when matching. For example:

\$

matches the character `$'.

^

This matches the beginning of a string. For example:

^@chapter

matches the `@chapter' at the beginning of a string, and can be used to identify chapter beginnings in Texinfo source files. The `^' is known as an anchor, since it anchors the pattern to matching only at the beginning of the string. It is important to realize that `^' does not match the beginning of a line embedded in a string. In this example the condition is not true:

if ("line1\nLINE 2" ~ /^L/) ...

$

This is similar to `^', but it matches only at the end of a string. For example:

p$

matches a record that ends with a `p'. The `$' is also an anchor, and also does not match the end of a line embedded in a string. In this example the condition is not true:

if ("line1\nLINE 2" ~ /1$/) ...

.

The period, or dot, matches any single character, including the newline character. For example:

.P

matches any single character followed by a `P' in a string. Using concatenation we can make a regular expression like `U.A', which matches any three-character sequence that begins with `U' and ends with `A'. In strict POSIX mode (see section Command Line Options), `.' does not match the NUL character, which is a character with all bits equal to zero. Otherwise, NUL is just another character. Other versions of awk may not be able to match the NUL character.

[...]

This is called a character list. It matches any one of the characters that are enclosed in the square brackets. For example:

[MVX]

matches any one of the characters `M', `V', or `X' in a string. Ranges of characters are indicated by using a hyphen between the beginning and ending characters, and enclosing the whole thing in brackets. For example:

[0-9]

matches any digit. Multiple ranges are allowed. E.g., the list [A-Za-z0-9] is a common way to express the idea of "all alphanumeric characters." To include one of the characters `\', `]', `-' or `^' in a character list, put a `\' in front of it. For example:

[d\]]

matches either `d', or `]'. This treatment of `\' in character lists is compatible with other awk implementations, and is also mandated by POSIX. The regular expressions in awk are a superset of the POSIX specification for Extended Regular Expressions (EREs). POSIX EREs are based on the regular expressions accepted by the traditional egrep utility. Character classes are a new feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but where the actual characters themselves can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs in the USA and in France. A character class is only valid in a regexp inside the brackets of a character list. Character classes consist of `[:', a keyword denoting the class, and `:]'. Here are the character classes defined by the POSIX standard.

[:alnum:]: Alphanumeric characters.
[:alpha:]: Alphabetic characters.
[:blank:]: Space and tab characters.
[:cntrl:]: Control characters.
[:digit:]: Numeric characters.
[:graph:]: Characters that are printable and are also visible. (A space is printable, but not visible, while an `a' is both.)
[:lower:]: Lower-case alphabetic characters.
[:print:]: Printable characters (characters that are not control characters.)
[:punct:]: Punctuation characters (characters that are not letter, digits, control characters, or space characters).
[:space:]: Space characters (such as space, tab, and formfeed, to name a few).
[:upper:]: Upper-case alphabetic characters.
[:xdigit:]: Characters that are hexadecimal digits.

For example, before the POSIX standard, to match alphanumeric characters, you had to write /[A-Za-z0-9]/. If your character set had other alphabetic characters in it, this would not match them. With the POSIX character classes, you can write /[[:alnum:]]/, and this will match all the alphabetic and numeric characters in your character set. Two additional special sequences can appear in character lists. These apply to non-ASCII character sets, which can have single symbols (called collating elements) that are represented with more than one character, as well as several characters that are equivalent for collating, or sorting, purposes. (E.g., in French, a plain "e" and a grave-accented "`e" are equivalent.)

Collating Symbols: A collating symbol is a multi-character collating element enclosed in `[.' and `.]'. For example, if `ch' is a collating element, then [[.ch.]] is a regexp that matches this collating element, while [ch] is a regexp that matches either `c' or `h'.
Equivalence Classes: An equivalence class is a locale-specific name for a list of characters that are equivalent. The name is enclosed in `[=' and `=]'. For example, the name `e' might be used to represent all of "e," "`e," and "'e." In this case, [[=e]] is a regexp that matches any of `e', `'e', or ``e'.

These features are very valuable in non-English speaking locales. Caution: The library functions that gawk uses for regular expression matching currently only recognize POSIX character classes; they do not recognize collating symbols or equivalence classes.

[^ ...]

This is a complemented character list. The first character after the `[' must be a `^'. It matches any characters except those in the square brackets. For example:

[^0-9]

matches any character that is not a digit.

|

This is the alternation operator, and it is used to specify alternatives. For example:

^P|[0-9]

matches any string that matches either `^P' or `[0-9]'. This means it matches any string that starts with `P' or contains a digit. The alternation applies to the largest possible regexps on either side. In other words, `|' has the lowest precedence of all the regular expression operators.

(...)

Parentheses are used for grouping in regular expressions as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, `|'. For example, `@(samp|code)\{[^}]+\}' matches both `@code{foo}' and `@samp{bar}'. (These are Texinfo formatting control sequences.)

*

This symbol means that the preceding regular expression is to be repeated as many times as necessary to find a match. For example:

ph*

applies the `*' symbol to the preceding `h' and looks for matches of one `p' followed by any number of `h's. This will also match just `p' if no `h's are present. The `*' repeats the smallest possible preceding expression. (Use parentheses if you wish to repeat a larger expression.) It finds as many repetitions as possible. For example:

awk '/\(c[ad][ad]*r x\)/ { print }' sample

prints every record in `sample' containing a string of the form `(car x)', `(cdr x)', `(cadr x)', and so on. Notice the escaping of the parentheses by preceding them with backslashes.

+

This symbol is similar to `*', but the preceding expression must be matched at least once. This means that:

wh+y

would match `why' and `whhy' but not `wy', whereas `wh*y' would match all three of these strings. This is a simpler way of writing the last `*' example:

awk '/\(c[ad]+r x\)/ { print }' sample

?

This symbol is similar to `*', but the preceding expression can be matched either once or not at all. For example:

fe?d

will match `fed' and `fd', but nothing else.

{n}

{n,}

{n,m}

One or two numbers inside braces denote an interval expression. If there is one number in the braces, the preceding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. If there is one number followed by a comma, then the preceding regexp is repeated at least n times.

wh{3}y: matches `whhhy' but not `why' or `whhhhy'.
wh{3,5}y: matches `whhhy' or `whhhhy' or `whhhhhy', only.
wh{2,}y: matches `whhy' or `whhhy', and so on.

Interval expressions were not traditionally available in awk. As part of the POSIX standard they were added, to make awk and egrep consistent with each other. However, since old programs may use `{' and `}' in regexp constants, by default gawk does not match interval expressions in regexps. If either `--posix' or `--re-interval' are specified (see section Command Line Options), then interval expressions are allowed in regexps.

In regular expressions, the `*', `+', and `?' operators, as well as the braces `{' and `}', have the highest precedence, followed by concatenation, and finally by `|'. As in arithmetic, parentheses can change how operators are grouped.

If gawk is in compatibility mode (see section Command Line Options), character classes and interval expressions are not available in regular expressions.

The next section discusses the GNU-specific regexp operators, and provides more detail concerning how command line options affect the way gawk interprets the characters in regular expressions.

Go to the first, previous, next, last section, table of contents.