The GNU Awk User's Guide

Go to the first, previous, next, last section, table of contents.

Multiple-Line Records

In some data bases, a single line cannot conveniently hold all the information in one entry. In such cases, you can use multi-line records.

The first step in doing this is to choose your data format: when records are not defined as single lines, how do you want to define them? What should separate records?

One technique is to use an unusual character or string to separate records. For example, you could use the formfeed character (written `\f' in awk, as in C) to separate them, making each record a page of the file. To do this, just set the variable RS to "\f" (a string containing the formfeed character). Any other character could equally well be used, as long as it won't be part of the data in a record.

Another technique is to have blank lines separate records. By a special dispensation, an empty string as the value of RS indicates that records are separated by one or more blank lines. If you set RS to the empty string, a record always ends at the first blank line encountered. And the next record doesn't start until the first non-blank line that follows--no matter how many blank lines appear in a row, they are considered one record-separator.

You can achieve the same effect as `RS = ""' by assigning the string "\n\n+" to RS. This regexp matches the newline at the end of the record, and one or more blank lines after the record. In addition, a regular expression always matches the longest possible sequence when there is a choice (see section How Much Text Matches?). So the next record doesn't start until the first non-blank line that follows--no matter how many blank lines appear in a row, they are considered one record-separator.

There is an important difference between `RS = ""' and `RS = "\n\n+"'. In the first case, leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done (d.c.).

Now that the input is separated into records, the second step is to separate the fields in the record. One way to do this is to divide each of the lines into fields in the normal manner. This happens by default as the result of a special feature: when RS is set to the empty string, the newline character always acts as a field separator. This is in addition to whatever field separations result from FS.

The original motivation for this special exception was probably to provide useful behavior in the default case (i.e. FS is equal to " "). This feature can be a problem if you really don't want the newline character to separate fields, since there is no way to prevent it. However, you can work around this by using the split function to break up the record manually (see section Built-in Functions for String Manipulation).

Another way to separate fields is to put each field on a separate line: to do this, just set the variable FS to the string "\n". (This simple regular expression matches a single newline.)

A practical example of a data file organized this way might be a mailing list, where each entry is separated by blank lines. If we have a mailing list in a file named `addresses', that looks like this:

Jane Doe
123 Main Street
Anywhere, SE 12345-6789

John Smith
456 Tree-lined Avenue
Smallville, MW 98765-4321

...

A simple program to process this file would look like this:

# addrs.awk --- simple mailing list program

# Records are separated by blank lines.
# Each line is one field.
BEGIN { RS = "" ; FS = "\n" }

{
      print "Name is:", $1
      print "Address is:", $2
      print "City and State are:", $3
      print ""
}

Running the program produces the following output:

$ awk -f addrs.awk addresses
-| Name is: Jane Doe
-| Address is: 123 Main Street
-| City and State are: Anywhere, SE 12345-6789
-| 
-| Name is: John Smith
-| Address is: 456 Tree-lined Avenue
-| City and State are: Smallville, MW 98765-4321
-| 
...

See section Printing Mailing Labels, for a more realistic program that deals with address lists.

The following table summarizes how records are split, based on the value of RS. (`==' means "is equal to.")

RS == "\n": Records are separated by the newline character (`\n'). In effect, every line in the data file is a separate record, including blank lines. This is the default.
RS == any single character: Records are separated by each occurrence of the character. Multiple successive occurrences delimit empty records.
RS == "": Records are separated by runs of blank lines. The newline character always serves as a field separator, in addition to whatever value FS may have. Leading and trailing newlines in a file are ignored.
RS == regexp: Records are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty records.

In all cases, gawk sets RT to the input text that matched the value specified by RS.

Go to the first, previous, next, last section, table of contents.