The GNU Awk User's Guide

Go to the first, previous, next, last section, table of contents.

Cutting Out Fields and Columns

The cut utility selects, or "cuts," either characters or fields from its standard input and sends them to its standard output. cut can cut out either a list of characters, or a list of fields. By default, fields are separated by tabs, but you may supply a command line option to change the field delimiter, i.e. the field separator character. cut's definition of fields is less general than awk's.

A common use of cut might be to pull out just the login name of logged-on users from the output of who. For example, the following pipeline generates a sorted, unique list of the logged on users:

who | cut -c1-8 | sort | uniq

The options for cut are:

-c list: Use list as the list of characters to cut out. Items within the list may be separated by commas, and ranges of characters can be separated with dashes. The list `1-8,15,22-35' specifies characters one through eight, 15, and 22 through 35.
-f list: Use list as the list of fields to cut out.
-d delim: Use delim as the field separator character instead of the tab character.
-s: Suppress printing of lines that do not contain the field delimiter.

The awk implementation of cut uses the getopt library function (see section Processing Command Line Options), and the join library function (see section Merging an Array Into a String).

The program begins with a comment describing the options and a usage function which prints out a usage message and exits. usage is called if invalid arguments are supplied.

# cut.awk --- implement cut in awk
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# May 1993

# Options:
#    -f list        Cut fields
#    -d c           Field delimiter character
#    -c list        Cut characters
#
#    -s        Suppress lines without the delimiter character

function usage(    e1, e2)
{
    e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
    e2 = "usage: cut [-c list] [files...]"
    print e1 > "/dev/stderr"
    print e2 > "/dev/stderr"
    exit 1
}

The variables e1 and e2 are used so that the function fits nicely on the page.

Next comes a BEGIN rule that parses the command line options. It sets FS to a single tab character, since that is cut's default field separator. The output field separator is also set to be the same as the input field separator. Then getopt is used to step through the command line options. One or the other of the variables by_fields or by_chars is set to true, to indicate that processing should be done by fields or by characters respectively. When cutting by characters, the output field separator is set to the null string.

BEGIN    \
{
    FS = "\t"    # default
    OFS = FS
    while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) {
        if (c == "f") {
            by_fields = 1
            fieldlist = Optarg
        } else if (c == "c") {
            by_chars = 1
            fieldlist = Optarg
            OFS = ""
        } else if (c == "d") {
            if (length(Optarg) > 1) {
                printf("Using first character of %s" \
                " for delimiter\n", Optarg) > "/dev/stderr"
                Optarg = substr(Optarg, 1, 1)
            }
            FS = Optarg
            OFS = FS
            if (FS == " ")    # defeat awk semantics
                FS = "[ ]"
        } else if (c == "s")
            suppress++
        else
            usage()
    }

    for (i = 1; i < Optind; i++)
        ARGV[i] = ""

Special care is taken when the field delimiter is a space. Using " " (a single space) for the value of FS is incorrect---awk would separate fields with runs of spaces, tabs and/or newlines, and we want them to be separated with individual spaces. Also, note that after getopt is through, we have to clear out all the elements of ARGV from one to Optind, so that awk will not try to process the command line options as file names.

After dealing with the command line options, the program verifies that the options make sense. Only one or the other of `-c' and `-f' should be used, and both require a field list. Then either set_fieldlist or set_charlist is called to pull apart the list of fields or characters.

    if (by_fields && by_chars)
        usage()

    if (by_fields == 0 && by_chars == 0)
        by_fields = 1    # default

    if (fieldlist == "") {
        print "cut: needs list for -c or -f" > "/dev/stderr"
        exit 1
    }

    if (by_fields)
        set_fieldlist()
    else
        set_charlist()
}

Here is set_fieldlist. It first splits the field list apart at the commas, into an array. Then, for each element of the array, it looks to see if it is actually a range, and if so splits it apart. The range is verified to make sure the first number is smaller than the second. Each number in the list is added to the flist array, which simply lists the fields that will be printed. Normal field splitting is used. The program lets awk handle the job of doing the field splitting.

function set_fieldlist(        n, m, i, j, k, f, g)
{
    n = split(fieldlist, f, ",")
    j = 1    # index in flist
    for (i = 1; i <= n; i++) {
        if (index(f[i], "-") != 0) { # a range
            m = split(f[i], g, "-")
            if (m != 2 || g[1] >= g[2]) {
                printf("bad field list: %s\n",
                                  f[i]) > "/dev/stderr"
                exit 1
            }
            for (k = g[1]; k <= g[2]; k++)
                flist[j++] = k
        } else
            flist[j++] = f[i]
    }
    nfields = j - 1
}

The set_charlist function is more complicated than set_fieldlist. The idea here is to use gawk's FIELDWIDTHS variable (see section Reading Fixed-width Data), which describes constant width input. When using a character list, that is exactly what we have.

Setting up FIELDWIDTHS is more complicated than simply listing the fields that need to be printed. We have to keep track of the fields to be printed, and also the intervening characters that have to be skipped. For example, suppose you wanted characters one through eight, 15, and 22 through 35. You would use `-c 1-8,15,22-35'. The necessary value for FIELDWIDTHS would be "8 6 1 6 14". This gives us five fields, and what should be printed are $1, $3, and $5. The intermediate fields are "filler," stuff in between the desired data.

flist lists the fields to be printed, and t tracks the complete field list, including filler fields.

function set_charlist(    field, i, j, f, g, t,
                          filler, last, len)
{
    field = 1   # count total fields
    n = split(fieldlist, f, ",")
    j = 1       # index in flist
    for (i = 1; i <= n; i++) {
        if (index(f[i], "-") != 0) { # range
            m = split(f[i], g, "-")
            if (m != 2 || g[1] >= g[2]) {
                printf("bad character list: %s\n",
                               f[i]) > "/dev/stderr"
                exit 1
            }
            len = g[2] - g[1] + 1
            if (g[1] > 1)  # compute length of filler
                filler = g[1] - last - 1
            else
                filler = 0
            if (filler)
                t[field++] = filler
            t[field++] = len  # length of field
            last = g[2]
            flist[j++] = field - 1
        } else {
            if (f[i] > 1)
                filler = f[i] - last - 1
            else
                filler = 0
            if (filler)
                t[field++] = filler
            t[field++] = 1
            last = f[i]
            flist[j++] = field - 1
        }
    }
    FIELDWIDTHS = join(t, 1, field - 1)
    nfields = j - 1
}

Here is the rule that actually processes the data. If the `-s' option was given, then suppress will be true. The first if statement makes sure that the input record does have the field separator. If cut is processing fields, suppress is true, and the field separator character is not in the record, then the record is skipped.

If the record is valid, then at this point, gawk has split the data into fields, either using the character in FS or using fixed-length fields and FIELDWIDTHS. The loop goes through the list of fields that should be printed. If the corresponding field has data in it, it is printed. If the next field also has data, then the separator character is written out in between the fields.

{
    if (by_fields && suppress && $0 !~ FS)
        next

    for (i = 1; i <= nfields; i++) {
        if ($flist[i] != "") {
            printf "%s", $flist[i]
            if (i < nfields && $flist[i+1] != "")
                printf "%s", OFS
        }
    }
    print ""
}

This version of cut relies on gawk's FIELDWIDTHS variable to do the character-based cutting. While it would be possible in other awk implementations to use substr (see section Built-in Functions for String Manipulation), it would also be extremely painful to do so. The FIELDWIDTHS variable supplies an elegant solution to the problem of picking the input line apart by characters.

Go to the first, previous, next, last section, table of contents.