Go to the first, previous, next, last section, table of contents.


Additional Regexp Operators Only in gawk

GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section, and are specific to gawk; they are not available in other awk implementations.

Most of the additional operators are for dealing with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (`_').

\w
This operator matches any word-constituent character, i.e. any letter, digit, or underscore. Think of it as a short-hand for [[:alnum:]_].
\W
This operator matches any character that is not word-constituent. Think of it as a short-hand for [^[:alnum:]_].
\<
This operator matches the empty string at the beginning of a word. For example, /\<away/ matches `away', but not `stowaway'.
\>
This operator matches the empty string at the end of a word. For example, /stow\>/ matches `stow', but not `stowaway'.
\y
This operator matches the empty string at either the beginning or the end of a word (the word boundary). For example, `\yballs?\y' matches either `ball' or `balls' as a separate word.
\B
This operator matches the empty string within a word. In other words, `\B' matches the empty string that occurs between two word-constituent characters. For example, /\Brat\B/ matches `crate', but it does not match `dirty rat'. `\B' is essentially the opposite of `\y'.

There are two other operators that work on buffers. In Emacs, a buffer is, naturally, an Emacs buffer. For other programs, the regexp library routines that gawk uses consider the entire string to be matched as the buffer.

For awk, since `^' and `$' always work in terms of the beginning and end of strings, these operators don't add any new capabilities. They are provided for compatibility with other GNU software.

\`
This operator matches the empty string at the beginning of the buffer.
\'
This operator matches the empty string at the end of the buffer.

In other GNU software, the word boundary operator is `\b'. However, that conflicts with the awk language's definition of `\b' as backspace, so gawk uses a different letter.

An alternative method would have been to require two backslashes in the GNU operators, but this was deemed to be too confusing, and the current method of using `\y' for the GNU `\b' appears to be the lesser of two evils.

The various command line options (see section Command Line Options) control how gawk interprets characters in regexps.

No options
In the default case, gawk provide all the facilities of POSIX regexps and the GNU regexp operators described above. However, interval expressions are not supported.
--posix
Only POSIX regexps are supported, the GNU operators are not special (e.g., `\w' matches a literal `w'). Interval expressions are allowed.
--traditional
Traditional Unix awk regexps are matched. The GNU operators are not special, interval expressions are not available, and neither are the POSIX character classes ([[:alnum:]] and so on). Characters described by octal and hexadecimal escape sequences are treated literally, even if they represent regexp metacharacters.
--re-interval
Allow interval expressions in regexps, even if `--traditional' has been provided.


Go to the first, previous, next, last section, table of contents.