Using library functions in awk
can be very beneficial. It
encourages code re-use and the writing of general functions. Programs are
smaller, and therefore clearer.
However, using library functions is only easy when writing awk
programs; it is painful when running them, requiring multiple `-f'
options. If gawk
is unavailable, then so too is the AWKPATH
environment variable and the ability to put awk
functions into a
library directory (see section Command Line Options).
It would be nice to be able to write programs like so:
# library functions @include getopt.awk @include join.awk ... # main program BEGIN { while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1) ... ... }
The following program, `igawk.sh', provides this service.
It simulates gawk
's searching of the AWKPATH
variable,
and also allows nested includes; i.e. a file that has been included
with `@include' can contain further `@include' statements.
igawk
will make an effort to only include files once, so that nested
includes don't accidentally include a library function twice.
igawk
should behave externally just like gawk
. This means it
should accept all of gawk
's command line arguments, including the
ability to have multiple source files specified via `-f', and the
ability to mix command line and library source files.
The program is written using the POSIX Shell (sh
) command language.
The way the program works is as follows:
awk
source code for later, when the expanded program is run.
awk
text, put the arguments into
a temporary file that will be expanded. There are two cases.
echo
program will automatically
supply a trailing newline.
gawk
does, this will get the text
of the file included into the program at the correct point.
awk
program (naturally) over the temporary file to expand
`@include' statements. The expanded program is placed in a second
temporary file.
gawk
and any other original command line
arguments that the user supplied (such as the data file names).
The initial part of the program turns on shell tracing if the first
argument was `debug'. Otherwise, a shell trap
statement
arranges to clean up any temporary files on program exit or upon an
interrupt.
The next part loops through all the command line arguments. There are several cases of interest.
--
igawk
. Anything else should be passed on
to the user's awk
program without being evaluated.
-W
gawk
. To make
argument processing easier, the `-W' is appended to the front of the
remaining arguments and the loop continues. (This is an sh
programming trick. Don't worry about it if you are not familiar with
sh
.)
-v
-F
gawk
.
-f
--file
--file=
-Wfile=
sed
utility is used to remove the leading option part of the
argument (e.g., `--file=').
--source
--source=
-Wsource=
--version
--version
-Wversion
igawk
prints its version number, and runs `gawk --version'
to get the gawk
version information, and then exits.
If none of `-f', `--file', `-Wfile', `--source',
or `-Wsource', were supplied, then the first non-option argument
should be the awk
program. If there are no command line
arguments left, igawk
prints an error message and exits.
Otherwise, the first argument is echoed into `/tmp/ig.s.$$'.
In any case, after the arguments have been processed,
`/tmp/ig.s.$$' contains the complete text of the original awk
program.
The `$$' in sh
represents the current process ID number.
It is often used in shell programs to generate unique temporary file
names. This allows multiple users to run igawk
without worrying
that the temporary file names will clash.
#! /bin/sh # igawk --- like gawk but do @include processing # Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain # July 1993 if [ "$1" = debug ] then set -x shift else # cleanup on exit, hangup, interrupt, quit, termination trap 'rm -f /tmp/ig.[se].$$' 0 1 2 3 15 fi while [ $# -ne 0 ] # loop over arguments do case $1 in --) shift; break;; -W) shift set -- -W"$@" continue;; -[vF]) opts="$opts $1 '$2'" shift;; -[vF]*) opts="$opts '$1'" ;; -f) echo @include "$2" >> /tmp/ig.s.$$ shift;; -f*) f=`echo "$1" | sed 's/-f//'` echo @include "$f" >> /tmp/ig.s.$$ ;; -?file=*) # -Wfile or --file f=`echo "$1" | sed 's/-.file=//'` echo @include "$f" >> /tmp/ig.s.$$ ;; -?file) # get arg, $2 echo @include "$2" >> /tmp/ig.s.$$ shift;; -?source=*) # -Wsource or --source t=`echo "$1" | sed 's/-.source=//'` echo "$t" >> /tmp/ig.s.$$ ;; -?source) # get arg, $2 echo "$2" >> /tmp/ig.s.$$ shift;; -?version) echo igawk: version 1.0 1>&2 gawk --version exit 0 ;; -[W-]*) opts="$opts '$1'" ;; *) break;; esac shift done if [ ! -s /tmp/ig.s.$$ ] then if [ -z "$1" ] then echo igawk: no program! 1>&2 exit 1 else echo "$1" > /tmp/ig.s.$$ shift fi fi # at this point, /tmp/ig.s.$$ has the program
The awk
program to process `@include' directives reads through
the program, one line at a time using getline
(see section Explicit Input with getline
).
The input file names and `@include' statements are managed using a
stack. As each `@include' is encountered, the current file name is
"pushed" onto the stack, and the file named in the `@include'
directive becomes
the current file name. As each file is finished, the stack is "popped,"
and the previous input file becomes the current input file again.
The process is started by making the original file the first one on the
stack.
The pathto
function does the work of finding the full path to a
file. It simulates gawk
's behavior when searching the AWKPATH
environment variable
(see section The AWKPATH
Environment Variable).
If a file name has a `/' in it, no path search
is done. Otherwise, the file name is concatenated with the name of each
directory in the path, and an attempt is made to open the generated file
name. The only way in awk
to test if a file can be read is to go
ahead and try to read it with getline
; that is what pathto
does.(26)
If the file can be read, it is closed, and the file name is
returned.
gawk -- ' # process @include directives function pathto(file, i, t, junk) { if (index(file, "/") != 0) return file for (i = 1; i <= ndirs; i++) { t = (pathlist[i] "/" file) if ((getline junk < t) > 0) { # found it close(t) return t } } return "" }
The main program is contained inside one BEGIN
rule. The first thing it
does is set up the pathlist
array that pathto
uses. After
splitting the path on `:', null elements are replaced with "."
,
which represents the current directory.
BEGIN { path = ENVIRON["AWKPATH"] ndirs = split(path, pathlist, ":") for (i = 1; i <= ndirs; i++) { if (pathlist[i] == "") pathlist[i] = "." }
The stack is initialized with ARGV[1]
, which will be `/tmp/ig.s.$$'.
The main loop comes next. Input lines are read in succession. Lines that
do not start with `@include' are printed verbatim.
If the line does start with `@include', the file name is in $2
.
pathto
is called to generate the full path. If it could not, then we
print an error message and continue.
The next thing to check is if the file has been included already. The
processed
array is indexed by the full file name of each included
file, and it tracks this information for us. If the file has been
seen, a warning message is printed. Otherwise, the new file name is
pushed onto the stack and processing continues.
Finally, when getline
encounters the end of the input file, the file
is closed and the stack is popped. When stackptr
is less than zero,
the program is done.
stackptr = 0 input[stackptr] = ARGV[1] # ARGV[1] is first file for (; stackptr >= 0; stackptr--) { while ((getline < input[stackptr]) > 0) { if (tolower($1) != "@include") { print continue } fpath = pathto($2) if (fpath == "") { printf("igawk:%s:%d: cannot find %s\n", \ input[stackptr], FNR, $2) > "/dev/stderr" continue } if (! (fpath in processed)) { processed[fpath] = input[stackptr] input[++stackptr] = fpath } else print $2, "included in", input[stackptr], \ "already included in", \ processed[fpath] > "/dev/stderr" } close(input[stackptr]) } }' /tmp/ig.s.$$ > /tmp/ig.e.$$
The last step is to call gawk
with the expanded program and the original
options and command line arguments that the user supplied. gawk
's
exit status is passed back on to igawk
's calling program.
eval gawk -f /tmp/ig.e.$$ $opts -- "$@" exit $?
This version of igawk
represents my third attempt at this program.
There are three key simplifications that made the program work better.
awk
program much simpler; all the
`@include' processing can be done once.
pathto
function doesn't try to save the line read with
getline
when testing for the file's accessibility. Trying to save
this line for use with the main program complicates things considerably.
getline
loop in the BEGIN
rule does it all in one
place. It is not necessary to call out to a separate loop for processing
nested `@include' statements.
Also, this program illustrates that it is often worthwhile to combine
sh
and awk
programming together. You can usually accomplish
quite a lot, without having to resort to low-level programming in C or C++, and it
is frequently easier to do certain kinds of string and argument manipulation
using the shell than it is in awk
.
Finally, igawk
shows that it is not always necessary to add new
features to a program; they can often be layered on top. With igawk
,
there is no real reason to build `@include' processing into
gawk
itself.
As an additional example of this, consider the idea of having two files in a directory in the search path.
getopt
and assert
.
gawk
releases, without requiring the system administrator to
update it each time by adding the local functions.
One user
suggested that gawk
be modified to automatically read these files
upon startup. Instead, it would be very simple to modify igawk
to do this. Since igawk
can process nested `@include'
directives, `default.awk' could simply contain `@include'
statements for the desired library functions.
Go to the first, previous, next, last section, table of contents.