The GNU Awk User's Guide

Go to the first, previous, next, last section, table of contents.

Splitting a Large File Into Pieces

The split program splits large text files into smaller pieces. By default, the output files are named `xaa', `xab', and so on. Each file has 1000 lines in it, with the likely exception of the last file. To change the number of lines in each file, you supply a number on the command line preceded with a minus, e.g., `-500' for files with 500 lines in them instead of 1000. To change the name of the output files to something like `myfileaa', `myfileab', and so on, you supply an additional argument that specifies the filename.

Here is a version of split in awk. It uses the ord and chr functions presented in section Translating Between Characters and Numbers.

The program first sets its defaults, and then tests to make sure there are not too many arguments. It then looks at each argument in turn. The first argument could be a minus followed by a number. If it is, this happens to look like a negative number, so it is made positive, and that is the count of lines. The data file name is skipped over, and the final argument is used as the prefix for the output file names.

# split.awk --- do split in awk
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# May 1993

# usage: split [-num] [file] [outname]

BEGIN {
    outfile = "x"    # default
    count = 1000
    if (ARGC > 4)
        usage()

    i = 1
    if (ARGV[i] ~ /^-[0-9]+$/) {
        count = -ARGV[i]
        ARGV[i] = ""
        i++
    }
    # test argv in case reading from stdin instead of file
    if (i in ARGV)
        i++    # skip data file name
    if (i in ARGV) {
        outfile = ARGV[i]
        ARGV[i] = ""
    }

    s1 = s2 = "a"
    out = (outfile s1 s2)
}

The next rule does most of the work. tcount (temporary count) tracks how many lines have been printed to the output file so far. If it is greater than count, it is time to close the current file and start a new one. s1 and s2 track the current suffixes for the file name. If they are both `z', the file is just too big. Otherwise, s1 moves to the next letter in the alphabet and s2 starts over again at `a'.

{
    if (++tcount > count) {
        close(out)
        if (s2 == "z") {
            if (s1 == "z") {
                printf("split: %s is too large to split\n", \
                       FILENAME) > "/dev/stderr"
                exit 1
            }
            s1 = chr(ord(s1) + 1)
            s2 = "a"
        } else
            s2 = chr(ord(s2) + 1)
        out = (outfile s1 s2)
        tcount = 1
    }
    print > out
}

The usage function simply prints an error message and exits.

function usage(   e)
{
    e = "usage: split [-num] [file] [outname]"
    print e > "/dev/stderr"
    exit 1
}

The variable e is used so that the function fits nicely on the page.

This program is a bit sloppy; it relies on awk to close the last file for it automatically, instead of doing it in an END rule.

Go to the first, previous, next, last section, table of contents.