Go to the first, previous, next, last section, table of contents.


Archiving Sparse Files

@UNREVISED

-S
--sparse
Handle sparse files efficiently.

This option causes all files to be put in the archive to be tested for sparseness, and handled specially if they are. The --sparse (-S) option is useful when many dbm files, for example, are being backed up. Using this option dramatically decreases the amount of space needed to store such a file.

In later versions, this option may be removed, and the testing and treatment of sparse files may be done automatically with any special GNU options. For now, it is an option needing to be specified on the command line with the creation or updating of an archive.

Files in the filesystem occasionally have "holes." A hole in a file is a section of the file's contents which was never written. The contents of a hole read as all zeros. On many operating systems, actual disk storage is not allocated for holes, but they are counted in the length of the file. If you archive such a file, tar could create an archive longer than the original. To have tar attempt to recognize the holes in a file, use --sparse (-S). When you use the --sparse (-S) option, then, for any file using less disk space than would be expected from its length, tar searches the file for consecutive stretches of zeros. It then records in the archive for the file where the consecutive stretches of zeros are, and only archives the "real contents" of the file. On extraction (using --sparse (-S) is not needed on extraction) any such files have hols created wherever the continuous stretches of zeros were found. Thus, if you use --sparse (-S), tar archives won't take more space than the original.

A file is sparse if it contains blocks of zeros whose existence is recorded, but that have no space allocated on disk. When you specify the --sparse (-S) option in conjunction with the --create (-c) operation, tar tests all files for sparseness while archiving. If tar finds a file to be sparse, it uses a sparse representation of the file in the archive. See section How to Create Archives, for more information about creating archives.

--sparse (-S) is useful when archiving files, such as dbm files, likely to contain many nulls. This option dramatically decreases the amount of space needed to store such an archive.

Please Note: Always use --sparse (-S) when performing file system backups, to avoid archiving the expanded forms of files stored sparsely in the system.

Even if your system has no sparse files currently, some may be created in the future. If you use --sparse (-S) while making file system backups as a matter of course, you can be assured the archive will never take more space on the media than the files take on disk (otherwise, archiving a disk filled with sparse files might take hundreds of tapes). @FIXME-xref{incremental when node name is set.}

tar ignores the --sparse (-S) option when reading an archive.

--sparse
-S
Files stored sparsely in the file system are represented sparsely in the archive. Use in conjunction with write operations.

However, users should be well aware that at archive creation time, GNU tar still has to read whole disk file to locate the holes, and so, even if sparse files use little space on disk and in the archive, they may sometimes require inordinate amount of time for reading and examining all-zero blocks of a file. Although it works, it's painfully slow for a large (sparse) file, even though the resulting tar archive may be small. (One user reports that dumping a `core' file of over 400 megabytes, but with only about 3 megabytes of actual data, took about 9 minutes on a Sun Sparstation ELC, with full CPU utilisation.)

This reading is required in all cases and is not related to the fact the --sparse (-S) option is used or not, so by merely not using the option, you are not saving time(6).

Programs like dump do not have to read the entire file; by examining the file system directly, they can determine in advance exactly where the holes are and thus avoid reading through them. The only data it need read are the actual allocated data blocks. GNU tar uses a more portable and straightforward archiving approach, it would be fairly difficult that it does otherwise. Elizabeth Zwicky writes to `comp.unix.internals', on 1990-12-10:

What I did say is that you cannot tell the difference between a hole and an equivalent number of nulls without reading raw blocks. st_blocks at best tells you how many holes there are; it doesn't tell you where. Just as programs may, conceivably, care what st_blocks is (care to name one that does?), they may also care where the holes are (I have no examples of this one either, but it's equally imaginable).

I conclude from this that good archivers are not portable. One can arguably conclude that if you want a portable program, you can in good conscience restore files with as many holes as possible, since you can't get it right.


Go to the first, previous, next, last section, table of contents.