Go to the first, previous, next, last section, table of contents.


The count-words-region Function

A word count command could count words in a line, paragraph, region, or buffer. What should the command cover? You could design the command to count the number of words in a complete buffer. However, the Emacs tradition encourages flexibility--you may want to count words in just a section, rather than all of a buffer. So it makes more sense to design the command to count the number of words in a region. Once you have a count-words-region command, you can, if you wish, count words in a whole buffer by marking it with C-x h (mark-whole-buffer).

Clearly, counting words is a repetitive act: starting from the beginning of the region, you count the first word, then the second word, then the third word, and so on, until you reach the end of the region. This means that word counting is ideally suited to recursion or to a while loop.

First, we will implement the word count command with a while loop, then with recursion. The command will, of course, be interactive.

The template for an interactive function definition is, as always:

(defun name-of-function (argument-list)
  "documentation..."
  (interactive-expression...)
  body...)

What we need to do is fill in the slots.

The name of the function should be self-explanatory and similar to the existing count-lines-region name. This makes the name easier to remember. count-words-region is a good choice.

The function counts words within a region. This means that the argument list must contain symbols that are bound to the two positions, the beginning and end of the region. These two positions can be called `beginning' and `end' respectively. The first line of the documentation should be a single sentence, since that is all that is printed as documentation by a command such as apropos. The interactive expression will be of the form `(interactive "r")', since that will cause Emacs to pass the beginning and end of the region to the function's argument list. All this is routine.

The body of the function needs to be written so as to do three tasks: first to set up conditions under which the while loop can count words, second to run the while loop, and, third, to send a message to the user.

When a user calls count-words-region, point may be at the beginning or the end of the region. However, the counting process must start at the beginning of the region. This means we will want to put point there if it is not already there. Executing (goto-char beginning) ensures this. Of course, we will want to return point to its expected position when the function finishes its work. For this reason, the body must be enclosed in a save-excursion expression.

The central part of the body of the function consists of a while loop in which one expression jumps point forward word by word, and another expression counts those jumps. The true-or-false-test of the while loop should test true so long as point should jump forward, and false when point is at the end of the region.

We could use (forward-word 1) as the expression for moving point forward word by word, but it is easier to see what Emacs identifies as a `word' if we use a regular expression search.

A regular expression search that finds the pattern for which it is searching leaves point after the last character matched. This means that a succession of successful word searches will move point forward word by word.

As a practical matter, we want the regular expression search to jump over whitespace and punctuation between words as well as over the words themselves. A regexp that refuses to jump over interword whitespace would never jump more than one word! This means that the regexp should include the whitespace and punctuation that follows a word, if any, as well as the word itself. (A word may end a buffer and not have any following whitespace or punctuation, so that part of the regexp must be optional.)

Thus, what we want for the regexp is a pattern defining one or more word constituent characters followed, optionally, by one or more characters that are not word constituents. The regular expression for this is:

\w+\W*

The buffer's syntax table determines which characters are and are not word constituents. (See section What Constitutes a Word or Symbol?, for more about syntax. Also, see section `The Syntax Table' in The GNU Emacs Manual, and, section `Syntax Tables' in The GNU Emacs Lisp Reference Manual.)

The search expression looks like this:

(re-search-forward "\\w+\\W*")

(Note that paired backslashes precede the `w' and `W'. A single backslash has special meaning to the Emacs Lisp interpreter. It indicates that the following character is interpreted differently than usual. For example, the two characters, `\n', stand for `newline', rather than for a backslash followed by `n'. Two backslashes in a row stand for an ordinary, `unspecial' backslash.)

We need a counter to count how many words there are; this variable must first be set to 0 and then incremented each time Emacs goes around the while loop. The incrementing expression is simply:

(setq count (1+ count))

Finally, we want to tell the user how many words there are in the region. The message function is intended for presenting this kind of information to the user. The message has to be phrased so that it reads properly regardless of how many words there are in the region: we don't want to say that "there are 1 words in the region". The conflict between singular and plural is ungrammmatical. We can solve this problem by using a conditional expression that evaluates different messages depending on the number of words in the region. There are three possibilities: no words in the region, one word in the region, and more than one word. This means that the cond special form is appropriate.

All this leads to the following function definition:

;;; First version; has bugs!
(defun count-words-region (beginning end)  
  "Print number of words in the region.
Words are defined as at least one word-constituent
character followed by at least one character that
is not a word-constituent.  The buffer's syntax
table determines which characters these are."
  (interactive "r")
  (message "Counting words in region ... ")

;;; 1. Set up appropriate conditions.
  (save-excursion
    (goto-char beginning)
    (let ((count 0))

;;; 2. Run the while loop.
      (while (< (point) end)
        (re-search-forward "\\w+\\W*")
        (setq count (1+ count)))

;;; 3. Send a message to the user.
      (cond ((zerop count)
             (message 
              "The region does NOT have any words."))
            ((= 1 count) 
             (message 
              "The region has 1 word."))
            (t 
             (message 
              "The region has %d words." count))))))

As written, the function works, but not in all circumstances.


Go to the first, previous, next, last section, table of contents.