The GNU Awk User's Guide - Basic Field Splitting

Go to the first, previous, next, last section, table of contents.

The Basics of Field Separating

The field separator, which is either a single character or a regular expression, controls the way awk splits an input record into fields. awk scans the input record for character sequences that match the separator; the fields themselves are the text between the matches.

In the examples below, we use the bullet symbol "*" to represent spaces in the output.

If the field separator is `oo', then the following line:

moo goo gai pan

would be split into three fields: `m', `*g' and `*gai*pan'. Note the leading spaces in the values of the second and third fields.

The field separator is represented by the built-in variable FS. Shell programmers take note! awk does not use the name IFS which is used by the POSIX compatible shells (such as the Bourne shell, sh, or the GNU Bourne-Again Shell, Bash).

You can change the value of FS in the awk program with the assignment operator, `=' (see section Assignment Expressions). Often the right time to do this is at the beginning of execution, before any input has been processed, so that the very first record will be read with the proper separator. To do this, use the special BEGIN pattern (see section The BEGIN and END Special Patterns). For example, here we set the value of FS to the string ",":

awk 'BEGIN { FS = "," } ; { print $2 }'

Given the input line,

John Q. Smith, 29 Oak St., Walamazoo, MI 42139

this awk program extracts and prints the string `*29*Oak*St.'.

Sometimes your input data will contain separator characters that don't separate fields the way you thought they would. For instance, the person's name in the example we just used might have a title or suffix attached, such as `John Q. Smith, LXIX'. From input containing such a name:

John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139

the above program would extract `*LXIX', instead of `*29*Oak*St.'. If you were expecting the program to print the address, you would be surprised. The moral is: choose your data layout and separator characters carefully to prevent such problems.

As you know, normally, fields are separated by whitespace sequences (spaces, tabs and newlines), not by single spaces: two spaces in a row do not delimit an empty field. The default value of the field separator FS is a string containing a single space, " ". If this value were interpreted in the usual way, each space character would separate fields, so two spaces in a row would make an empty field between them. The reason this does not happen is that a single space as the value of FS is a special case: it is taken to specify the default manner of delimiting fields.

If FS is any other single character, such as ",", then each occurrence of that character separates two fields. Two consecutive occurrences delimit an empty field. If the character occurs at the beginning or the end of the line, that too delimits an empty field. The space character is the only single character which does not follow these rules.

Go to the first, previous, next, last section, table of contents.