The GNU C Library - Finding Tokens in a String

Go to the first, previous, next, last section, table of contents.

Finding Tokens in a String

It's fairly common for programs to have a need to do some simple kinds of lexical analysis and parsing, such as splitting a command string up into tokens. You can do this with the strtok function, declared in the header file `string.h'.

Function: char * strtok (char *newstring, const char *delimiters)

A string can be split into tokens by making a series of calls to the function strtok.

The string to be split up is passed as the newstring argument on the first call only. The strtok function uses this to set up some internal state information. Subsequent calls to get additional tokens from the same string are indicated by passing a null pointer as the newstring argument. Calling strtok with another non-null newstring argument reinitializes the state information. It is guaranteed that no other library function ever calls strtok behind your back (which would mess up this internal state information).

The delimiters argument is a string that specifies a set of delimiters that may surround the token being extracted. All the initial characters that are members of this set are discarded. The first character that is not a member of this set of delimiters marks the beginning of the next token. The end of the token is found by looking for the next character that is a member of the delimiter set. This character in the original string newstring is overwritten by a null character, and the pointer to the beginning of the token in newstring is returned.

On the next call to strtok, the searching begins at the next character beyond the one that marked the end of the previous token. Note that the set of delimiters delimiters do not have to be the same on every call in a series of calls to strtok.

If the end of the string newstring is reached, or if the remainder of string consists only of delimiter characters, strtok returns a null pointer.

Warning: Since strtok alters the string it is parsing, you always copy the string to a temporary buffer before parsing it with strtok. If you allow strtok to modify a string that came from another part of your program, you are asking for trouble; that string may be part of a data structure that could be used for other purposes during the parsing, when alteration by strtok makes the data structure temporarily inaccurate.

The string that you are operating on might even be a constant. Then when strtok tries to modify it, your program will get a fatal signal for writing in read-only memory. See section Program Error Signals.

This is a special case of a general principle: if a part of a program does not have as its purpose the modification of a certain data structure, then it is error-prone to modify the data structure temporarily.

The function strtok is not reentrant. See section Signal Handling and Nonreentrant Functions, for a discussion of where and why reentrancy is important.

Here is a simple example showing the use of strtok.

#include <string.h>
#include <stddef.h>

...

char string[] = "words separated by spaces -- and, punctuation!";
const char delimiters[] = " .,;:!-";
char *token;

...

token = strtok (string, delimiters);  /* token => "words" */
token = strtok (NULL, delimiters);    /* token => "separated" */
token = strtok (NULL, delimiters);    /* token => "by" */
token = strtok (NULL, delimiters);    /* token => "spaces" */
token = strtok (NULL, delimiters);    /* token => "and" */
token = strtok (NULL, delimiters);    /* token => "punctuation" */
token = strtok (NULL, delimiters);    /* token => NULL */

The GNU C library contains two more functions for tokenizing a string which overcome the limitation of non-reentrancy.

Function: char * strtok_r (char *newstring, const char *delimiters, char **save_ptr)

Just like strtok this function splits the string into several tokens which can be accessed be successive calls to strtok_r. The difference is that the information about the next token is not set up in some internal state information. Instead the caller has to provide another argument save_ptr which is a pointer to a string pointer. Calling strtok_r with a null pointer for newstring and leaving save_ptr between the calls unchanged does the job without limiting reentrancy.

This function was proposed for POSIX.1b and can be found on many systems which support multi-threading.

Function: char * strsep (char **string_ptr, const char *delimiter)

A second reentrant approach is to avoid the additional first argument. The initialization of the moving pointer has to be done by the user. Successive calls of strsep move the pointer along the tokens separated by delimiter, returning the address of the next token and updating string_ptr to point to the beginning of the next token.

This function was introduced in 4.3BSD and therefore is widely available.

Here is how the above example looks like when strsep is used.

#include <string.h>
#include <stddef.h>

...

char string[] = "words separated by spaces -- and, punctuation!";
const char delimiters[] = " .,;:!-";
char *running;
char *token;

...

running = string;
token = strsep (&running, delimiters);    /* token => "words" */
token = strsep (&running, delimiters);    /* token => "separated" */
token = strsep (&running, delimiters);    /* token => "by" */
token = strsep (&running, delimiters);    /* token => "spaces" */
token = strsep (&running, delimiters);    /* token => "and" */
token = strsep (&running, delimiters);    /* token => "punctuation" */
token = strsep (&running, delimiters);    /* token => NULL */

Go to the first, previous, next, last section, table of contents.