Go to the first, previous, next, last section, table of contents.


An Easy Way to Use Library Functions

Using library functions in awk can be very beneficial. It encourages code re-use and the writing of general functions. Programs are smaller, and therefore clearer. However, using library functions is only easy when writing awk programs; it is painful when running them, requiring multiple `-f' options. If gawk is unavailable, then so too is the AWKPATH environment variable and the ability to put awk functions into a library directory (see section Command Line Options).

It would be nice to be able to write programs like so:

# library functions
@include getopt.awk
@include join.awk
...

# main program
BEGIN {
    while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
        ...
    ...
}

The following program, `igawk.sh', provides this service. It simulates gawk's searching of the AWKPATH variable, and also allows nested includes; i.e. a file that has been included with `@include' can contain further `@include' statements. igawk will make an effort to only include files once, so that nested includes don't accidentally include a library function twice.

igawk should behave externally just like gawk. This means it should accept all of gawk's command line arguments, including the ability to have multiple source files specified via `-f', and the ability to mix command line and library source files.

The program is written using the POSIX Shell (sh) command language. The way the program works is as follows:

  1. Loop through the arguments, saving anything that doesn't represent awk source code for later, when the expanded program is run.
  2. For any arguments that do represent awk text, put the arguments into a temporary file that will be expanded. There are two cases.
    1. Literal text, provided with `--source' or `--source='. This text is just echoed directly. The echo program will automatically supply a trailing newline.
    2. File names provided with `-f'. We use a neat trick, and echo `@include filename' into the temporary file. Since the file inclusion program will work the way gawk does, this will get the text of the file included into the program at the correct point.
  3. Run an awk program (naturally) over the temporary file to expand `@include' statements. The expanded program is placed in a second temporary file.
  4. Run the expanded program with gawk and any other original command line arguments that the user supplied (such as the data file names).

The initial part of the program turns on shell tracing if the first argument was `debug'. Otherwise, a shell trap statement arranges to clean up any temporary files on program exit or upon an interrupt.

The next part loops through all the command line arguments. There are several cases of interest.

--
This ends the arguments to igawk. Anything else should be passed on to the user's awk program without being evaluated.
-W
This indicates that the next option is specific to gawk. To make argument processing easier, the `-W' is appended to the front of the remaining arguments and the loop continues. (This is an sh programming trick. Don't worry about it if you are not familiar with sh.)
-v
-F
These are saved and passed on to gawk.
-f
--file
--file=
-Wfile=
The file name is saved to the temporary file `/tmp/ig.s.$$' with an `@include' statement. The sed utility is used to remove the leading option part of the argument (e.g., `--file=').
--source
--source=
-Wsource=
The source text is echoed into `/tmp/ig.s.$$'.
--version
--version
-Wversion
igawk prints its version number, and runs `gawk --version' to get the gawk version information, and then exits.

If none of `-f', `--file', `-Wfile', `--source', or `-Wsource', were supplied, then the first non-option argument should be the awk program. If there are no command line arguments left, igawk prints an error message and exits. Otherwise, the first argument is echoed into `/tmp/ig.s.$$'.

In any case, after the arguments have been processed, `/tmp/ig.s.$$' contains the complete text of the original awk program.

The `$$' in sh represents the current process ID number. It is often used in shell programs to generate unique temporary file names. This allows multiple users to run igawk without worrying that the temporary file names will clash.

Here's the program:

#! /bin/sh

# igawk --- like gawk but do @include processing
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# July 1993

if [ "$1" = debug ]
then
    set -x
    shift
else
    # cleanup on exit, hangup, interrupt, quit, termination
    trap 'rm -f /tmp/ig.[se].$$' 0 1 2 3 15
fi

while [ $# -ne 0 ] # loop over arguments
do
    case $1 in
    --)     shift; break;;

    -W)     shift
            set -- -W"$@"
            continue;;

    -[vF])  opts="$opts $1 '$2'"
            shift;;

    -[vF]*) opts="$opts '$1'" ;;

    -f)     echo @include "$2" >> /tmp/ig.s.$$
            shift;;

    -f*)    f=`echo "$1" | sed 's/-f//'`
            echo @include "$f" >> /tmp/ig.s.$$ ;;

    -?file=*)    # -Wfile or --file
            f=`echo "$1" | sed 's/-.file=//'`
            echo @include "$f" >> /tmp/ig.s.$$ ;;

    -?file)    # get arg, $2
            echo @include "$2" >> /tmp/ig.s.$$
            shift;;

    -?source=*)    # -Wsource or --source
            t=`echo "$1" | sed 's/-.source=//'`
            echo "$t" >> /tmp/ig.s.$$ ;;

    -?source)  # get arg, $2
            echo "$2" >> /tmp/ig.s.$$
            shift;;

    -?version)
            echo igawk: version 1.0 1>&2
            gawk --version
            exit 0 ;;

    -[W-]*)    opts="$opts '$1'" ;;

    *)      break;;
    esac
    shift
done

if [ ! -s /tmp/ig.s.$$ ]
then
    if [ -z "$1" ]
    then
         echo igawk: no program! 1>&2
         exit 1
    else
        echo "$1" > /tmp/ig.s.$$
        shift
    fi
fi

# at this point, /tmp/ig.s.$$ has the program

The awk program to process `@include' directives reads through the program, one line at a time using getline (see section Explicit Input with getline). The input file names and `@include' statements are managed using a stack. As each `@include' is encountered, the current file name is "pushed" onto the stack, and the file named in the `@include' directive becomes the current file name. As each file is finished, the stack is "popped," and the previous input file becomes the current input file again. The process is started by making the original file the first one on the stack.

The pathto function does the work of finding the full path to a file. It simulates gawk's behavior when searching the AWKPATH environment variable (see section The AWKPATH Environment Variable). If a file name has a `/' in it, no path search is done. Otherwise, the file name is concatenated with the name of each directory in the path, and an attempt is made to open the generated file name. The only way in awk to test if a file can be read is to go ahead and try to read it with getline; that is what pathto does.(26) If the file can be read, it is closed, and the file name is returned.

gawk -- '
# process @include directives

function pathto(file,    i, t, junk)
{
    if (index(file, "/") != 0)
        return file

    for (i = 1; i <= ndirs; i++) {
        t = (pathlist[i] "/" file)
        if ((getline junk < t) > 0) {
            # found it
            close(t)
            return t
        }
    }
    return ""
}

The main program is contained inside one BEGIN rule. The first thing it does is set up the pathlist array that pathto uses. After splitting the path on `:', null elements are replaced with ".", which represents the current directory.

BEGIN {
    path = ENVIRON["AWKPATH"]
    ndirs = split(path, pathlist, ":")
    for (i = 1; i <= ndirs; i++) {
        if (pathlist[i] == "")
            pathlist[i] = "."
    }

The stack is initialized with ARGV[1], which will be `/tmp/ig.s.$$'. The main loop comes next. Input lines are read in succession. Lines that do not start with `@include' are printed verbatim.

If the line does start with `@include', the file name is in $2. pathto is called to generate the full path. If it could not, then we print an error message and continue.

The next thing to check is if the file has been included already. The processed array is indexed by the full file name of each included file, and it tracks this information for us. If the file has been seen, a warning message is printed. Otherwise, the new file name is pushed onto the stack and processing continues.

Finally, when getline encounters the end of the input file, the file is closed and the stack is popped. When stackptr is less than zero, the program is done.

    stackptr = 0
    input[stackptr] = ARGV[1] # ARGV[1] is first file

    for (; stackptr >= 0; stackptr--) {
        while ((getline < input[stackptr]) > 0) {
            if (tolower($1) != "@include") {
                print
                continue
            }
            fpath = pathto($2)
            if (fpath == "") {
                printf("igawk:%s:%d: cannot find %s\n", \
                    input[stackptr], FNR, $2) > "/dev/stderr"
                continue
            }
            if (! (fpath in processed)) {
                processed[fpath] = input[stackptr]
                input[++stackptr] = fpath
            } else
                print $2, "included in", input[stackptr], \
                    "already included in", \
                    processed[fpath] > "/dev/stderr"
        }
        close(input[stackptr])
    }
}' /tmp/ig.s.$$ > /tmp/ig.e.$$

The last step is to call gawk with the expanded program and the original options and command line arguments that the user supplied. gawk's exit status is passed back on to igawk's calling program.

eval gawk -f /tmp/ig.e.$$ $opts -- "$@"

exit $?

This version of igawk represents my third attempt at this program. There are three key simplifications that made the program work better.

  1. Using `@include' even for the files named with `-f' makes building the initial collected awk program much simpler; all the `@include' processing can be done once.
  2. The pathto function doesn't try to save the line read with getline when testing for the file's accessibility. Trying to save this line for use with the main program complicates things considerably.
  3. Using a getline loop in the BEGIN rule does it all in one place. It is not necessary to call out to a separate loop for processing nested `@include' statements.

Also, this program illustrates that it is often worthwhile to combine sh and awk programming together. You can usually accomplish quite a lot, without having to resort to low-level programming in C or C++, and it is frequently easier to do certain kinds of string and argument manipulation using the shell than it is in awk.

Finally, igawk shows that it is not always necessary to add new features to a program; they can often be layered on top. With igawk, there is no real reason to build `@include' processing into gawk itself.

As an additional example of this, consider the idea of having two files in a directory in the search path.

`default.awk'
This file would contain a set of default library functions, such as getopt and assert.
`site.awk'
This file would contain library functions that are specific to a site or installation, i.e. locally developed functions. Having a separate file allows `default.awk' to change with new gawk releases, without requiring the system administrator to update it each time by adding the local functions.

One user suggested that gawk be modified to automatically read these files upon startup. Instead, it would be very simple to modify igawk to do this. Since igawk can process nested `@include' directives, `default.awk' could simply contain `@include' statements for the desired library functions.


Go to the first, previous, next, last section, table of contents.