Go to the first, previous, next, last section, table of contents.


Removing Duplicates from Unsorted Text

The uniq program (see section Printing Non-duplicated Lines of Text), removes duplicate lines from sorted data.

Suppose, however, you need to remove duplicate lines from a data file, but that you wish to preserve the order the lines are in? A good example of this might be a shell history file. The history file keeps a copy of all the commands you have entered, and it is not unusual to repeat a command several times in a row. Occasionally you might wish to compact the history by removing duplicate entries. Yet it is desirable to maintain the order of the original commands.

This simple program does the job. It uses two arrays. The data array is indexed by the text of each line. For each line, data[$0] is incremented.

If a particular line has not been seen before, then data[$0] will be zero. In that case, the text of the line is stored in lines[count]. Each element of lines is a unique command, and the indices of lines indicate the order in which those lines were encountered. The END rule simply prints out the lines, in order.

# histsort.awk --- compact a shell history file
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# May 1993

# Thanks to Byron Rakitzis for the general idea
{
    if (data[$0]++ == 0)
        lines[++count] = $0
}

END {
    for (i = 1; i <= count; i++)
        print lines[i]
}

This program also provides a foundation for generating other useful information. For example, using the following print satement in the END rule would indicate how often a particular command was used.

print data[lines[i]], lines[i]

This works because data[$0] was incremented each time a line was seen.


Go to the first, previous, next, last section, table of contents.