2
Content-Based Retrieval Concepts

This chapter explains, at a high level, why and how to use content-based retrieval. It covers the following topics:

overview and benefits of content-based retrieval
how content-based retrieval works, including definitions and explanations of the visual attributes (global color, local color, texture, structure) and why you might emphasize specific attributes in certain situations
image matching using a specified comparison image, including comparing how the weights of visual attributes determine the degree of similarity between images
image preparation or selection to maximize the usefulness of comparisons

2.1 Overview and Benefits

Inexpensive image-capture and storage technologies have allowed massive collections of digital images to be created. However, as a database grows, the difficulty of finding relevant images increases. Two general approaches to this problem have been developed, both of which use metadata for image retrieval:

using information manually entered or included in the table design, such as titles, descriptive keywords from a limited vocabulary, and predetermined classification schemes
using automated image feature extraction and object recognition to classify image content -- that is, using capabilities unique to content-based retrieval

With Visual Information Retrieval Cartridge, you can combine both approaches in designing a table to accommodate images: use traditional text columns to describe the semantic significance of the image (for example, that the pictured automobile won a particular award, or that its engine has six or eight cylinders), and use the Visual Information Retrieval types for the image, to permit content-based queries based on intrinsic attributes of the image (for example, how closely its color and shape match a picture of a specific automobile).

As an alternative to defining image-related attributes in columns separate from the image, a database designer could create a specialized composite data type that combines Visual Information Retrieval Cartridge and the appropriate text, numeric, and date attributes.

The primary benefit of using content-based retrieval is reduced time and effort required to obtain image-based information. With frequent adding and updating of images in massive databases, it is often not practical to require manual entry of all attributes that might be needed for queries, and content-based retrieval provides increased flexibility and practical value.

Examples of database applications where content-based retrieval is useful -- where the query is semantically of the form, "find objects that look like this one" -- include:

medical imaging
trademarks and copyrights
art galleries and museums
retailing
fashion and fabric design
interior design or decorating
law enforcement and criminal investigation

For example, a web-based interface to a retail clothing catalog might allow users to search by traditional categories (such as style or a price range) and also by image properties (such as color or texture). Thus, a user might ask for formal shirts in a particular price range that are off-white with pin stripes. Similarly, fashion designers could use a database with images of fabric swatches, designs, concept sketches, and actual garments to facilitate their creative processes.

2.2 How Content-Based Retrieval Works

A content-based retrieval system processes the information contained in image data and creates an abstraction of its content in terms of visual attributes. Any query operations deal solely with this abstraction rather than with the image itself. Thus, every image inserted into the database is analyzed, and a compact representation of its content is stored in a feature vector, or signature.

The signature contains information about the following visual attributes:

Global color represents the distribution of colors within the entire image. This distribution includes the amounts of each color, but not the locations of colors.
Local color represents color distributions and where they occur in an image, such as the fact that an RGB vector for sky blue occurs in the upper half of an image.
Texture represents the low-level patterns and textures within the image, such as graininess or smoothness. Unlike structure, texture is very sensitive to features that appear with great frequency in the image.
Structure represents the shapes that appear in the image, as determined by shape-characterization techniques such as edge detection.
Facial represents unique characteristics of human faces. For example, characteristics include the size and shape of the nose, the distance between the eyes, and various other attributes that cannot easily be disguised. Facial signatures are generated using separately purchasable software from Viisage Technology, Inc.

Feature data for all these visual attributes is stored in the signature, whose size typically ranges from 1000 to 2000 bytes.

Images in the database can be retrieved by matching them with a comparison image. The comparison image can be any image: inside or outside the current database, a sketch, an algorithmically generated image, and so forth.

The matching process requires that signatures be generated for the comparison image and each image to be compared with it. Images are seldom identical, and therefore matching is based on a similarity-measuring function for the visual attributes and a set of weights for each attribute. The score is the relative distance between two images being compared. The score for each attribute is used to determine the degree of similarity when images are compared, with a smaller distance reflecting a closer match, as explained in Section 2.3.3.

2.2.1 Global Color and Local Color

Global color reflects the distribution of colors within the entire image, whereas local color reflects color distributions and where they occur in an image. To illustrate the difference between global color and local color, consider Figure 2-1.

Figure 2-1 Image Comparison: Global Color and Local Color

Image 1 and Image 2 are the same size and are filled with solid colors. In Image 1, the top left quarter (25%) is red, the bottom left quarter (25%) is blue, and the right half (50%) is yellow. In Image 2, the top right quarter (25%) is blue, the bottom right quarter (25%) is red, and the left half (50%) is yellow.

If the two images are compared first solely on global color and then solely on local color, the following are the similarity results:

global color: complete similarity (score = 0.0), because each color (red, blue, yellow) occupies the same percentage of the total image in each one
local color: no similarity (score = 100), because there is no overlap in the placement of any of the colors between the two images

Thus, if you need to select images based on the dominant color or colors (for example, to find apartments with blue interiors), give greater relative weight to global color. If you need to find images with common colors in common locations (for example, red dominant in the upper portion to find sunsets), give greater relative weight to local color.

Figure 2-2 shows two images very close (score = 0.0) in global color. Figure 2-3 shows two images very close (score = 0.02461) in local color.

Figure 2-2 Images Very Similar in Global Color

Figure 2-3 Images Very Similar in Local Color

2.2.2 Texture and Structure

Texture is most useful for full images of textures, such as catalogs of wood grains, marble, sand, or stones. These images are generally hard to categorize using keywords alone because our vocabulary for textures is limited. Texture can be used effectively alone (without color) for pure textures, but also with a little bit of global color for some kinds of textures, like wood or fabrics. Figure 2-4 shows two similar fabric samples (score = 4.1).

Figure 2-4 Fabric Images with Similar Texture

Structure is not strictly confined to certain sizes or positions. However, when objects are of the same size or position, they have a lower score (greater similarity) than objects of different sizes. Structure is useful to capture objects such as horizon lines in landscapes, rectangular structures in buildings, and organic structures like trees. Structure is very useful for querying on simple shapes (like circles, polygons, or diagonal lines) especially when the query image is drawn by hand and color is not considered important when the drawing is made. Figure 2-5 shows two images very close (score = 0.61939) in structure.

Figure 2-5 Images with Very Similar Structure

2.2.3 Face Recognition

Visual Information Retrieval Cartridge supports face recognition software developed by Viisage Technology, Inc. This third-party software analyzes images of faces and generates a facial signature based on various unique biometric characteristics.

After you have generated facial signatures with the Viisage software, you can use Visual Information Retrieval Cartridge Convert( ), Score( ), and Similar( ) operators to compare the images.

See the Viisage product documentation for more details.

2.3 How Matching Works

When you match images, you assign an importance measure, or weight, to each of the visual attributes, and the cartridge calculates a similarity measure for each visual attribute.

2.3.1 Weight

Each weight value can range from 0.0 (no importance) to 1.0 (highest importance). Each weight value reflects how sensitive the matching process should be to the degree of similarity or dissimilarity between two images. For example, if you want global color to be completely ignored in matching, assign a weight of 0.0 to global color; in this case, any similarity or difference between the color of the two images is totally irrelevant in matching. On the other hand, if global color is extremely important, assign it a weight near or equal to 1.0; this will cause any similarity or dissimilarity between the two images with respect to global color to contribute greatly to whether or not the two images match.

You should give at least one visual attribute a weight significantly greater than 0.0, otherwise you may get too many matches. As an extreme example, if you assign a weight of 0.0 to all four visual attributes, then all image comparisons will result in matches. This will occur because the weighted sum of all attributes will be zero and thus less than or equal to any threshold value; see Section 2.3.3 for details of the calculation. Such a result, of course, defeats the purpose of content-based retrieval.

2.3.2 Score

The similarity measure for each visual attribute is calculated as the score or distance between the two images with respect to that attribute. The score can range from 0 (no difference) to 100 (maximum possible difference). Thus, the more similar two images are with respect to a visual attribute, the smaller the score will be for that attribute.

As an example of how distance is determined, assume that the dots in Figure 2-6 represent scores for three images with respect to two visual attributes, such as global color and structure, plotted along the X and Y axes of a graph.

Figure 2-6 Score and Distance Relationship

For matching, assume Image 1 is the comparison image, and Image 2 and Image 3 are each being compared with Image 1. With respect to the global color attribute plotted on the X axis, the distance between Image 1 and Image 2 is relatively small (for example, 15), whereas the distance between Image 1 and Image 3 is much greater (for example, 75). If the global color attribute is given more weight, then the fact that the two distance values differ by a great deal will probably be very important in determining whether or not Image 2 and Image 3 match Image 1. However, if global color is minimized and the structure attribute is emphasized instead, then Image 3 will match Image 1 better than Image 2 matches Image 1.

2.3.3 Similarity Calculation

In Section 2.3.2, Figure 2-6 showed a graph of only two of the attributes that Visual Information Retrieval Cartridge can consider. In reality, when images are matched, the degree of similarity depends on a weighted sum reflecting the weight and distance of all four of the visual attributes of the comparison image and the test image.

For example, assume that for the comparison image (Image 1) and one of the images being tested for matching (Image 2), Table 2-1 lists the relative distances between the two images for each attribute. Note that you would never see these individual numbers unless you computed four separate scores, each time highlighting one attribute and setting the others to zero.

Table 2-1 Distances for Visual Attributes Between Image1 and Image2

Visual Attribute	Distance
Global color	15
Local color	90
Texture	5
Structure	50

In this example, the two images are most similar with respect to texture (distance = 5) and most different with respect to local color (distance = 90).

Assume that for the matching process, the following weights have been assigned to each visual attribute:

global color =0.1
local color = 0.6
texture = 0.2
structure = 0.1

The weights are typically supplied in the range of 0.0 to 1.0. Within this range, a weight of 1 indicates the strongest emphasis, and a weight of 0 means the attribute should be ignored. You can use a different range (such as 0 to 100), but be careful not to accidently combine different ranges. The values you supply are automatically be normalized such that the weights total 100 percent, still maintaining the ratios you have supplied. In this example, the weights were specified such that normalization was not necessary.

The following formula is used to calculate the weighted sum of the distances, which is used to determine the degree of similarity between two images:

weighted_sum = global_color_weight * global_color_distance +
               local_color_weight  * local_color_distance +
               texture_weight      * texture_distance +
               structure_weight    * structure_distance

The degree of similarity between two images in this case is computed as:

0.15*gc_distance + 0.55*lc_distance + 0.2*tex_distance + 0.1*struc_distance

That is:

(0.1*15 + 0.6*90 + 0.2*5 + 0.1*50) = (1.5 + 54.0 + 1.0 + 5.0) = 61.5

To illustrate the effect of different weights in this case, assume that the weights for global color and local color were reversed. In this case, the degree of similarity between two images is computed as:

0.6*gc_distance + 0.1*lc_distance +0.2*tex_distance + 0.1*struc_distance

That is:

(0.6*15 + 0.1*90 + 0.2*5 + 0.1*50) = (9.0 +9.0 + 1.0 + 5.0) = 24.0

In this second case, the images are considered to be more similar than in the first case, because the overall score (24.0) is smaller than in the first case (61.5). Whether or not the two images are considered matching depends on the threshold value (explained in Section 2.3.4): if the weighted sum is less than or equal to the threshold, the images match; if the weighted sum is greater than the threshold, the images do not match.

In these two cases, the correct weight assignments depend on what you are looking for in the images. If local color is extremely important, then the first set of weights is a better choice than the second, because the first set of weights grants greater significance to the disparity between these two specific images with respect to local color (weighted sum of 24 versus 61.5). Thus, with the first set of weights, these two images are less likely to match -- and a key goal of content-based retrieval is to eliminate uninteresting images so that you can focus on images containing what you are looking for.

2.3.4 Threshold Value

When you match images, you assign a threshold value. If the weighted sum of the distances for the visual attributes is less than or equal to the threshold, the images match; if the weighted sum is greater than the threshold, the images do not match.

Using the examples in Section 2.3.3, if you assign a threshold of 60, the images do not match when the weighted sum is 81.5, but they do match when the weighted sum is 32. If the threshold is 30, the images do not match in either case; and if the threshold is 81.5 or greater, the images match in both cases.

The following example shows a cursor (getphotos) that selects the photo_id, annotation, and photograph from the Pictures table where the threshold value is 20 for comparing photographs with a comparison image:

CURSOR getphotos IS
 SELECT photo_id, annotation, photo FROM Pictures WHERE
   ORDSYS.VIR.Similar(photo.ImgSignature, comparison_sig, 'globalcolor="1.0",

   localcolor="0.7", texture="0.1", structure="0.9"', 20)=1;

Before the cursor executes, the Analyze( ) operator must be used to compute the signature of the comparison image (comparison_sig), and to compute signatures for each image in the table. Chapter 4 describes all the operators, including
Analyze( ) and Similar( ).

You will probably want to experiment with different weights for the visual attributes and different threshold values, to see which combinations retrieve the kinds and approximate number of matches you want.

2.3.5 Example: Medical X-Ray Screening

In cancer screenings, image comparisons can help to detect suspicious areas of x-ray images. X-rays of patients being screened are compared with one or more x-rays that show cancerous cells. For each comparison, the levels of magnification are the same, and the images are comparable in every respect possible (see Section 2.4 for suggestions on preparing images for comparison).

In a medical application, you may never want the computerized comparisons to interfere with the ability of a trained professional (in this case, the cytologist) to make judgments. Thus, you might consider one or both of the following uses of content-based matching to maximize the productivity of the screening process:

Images are compared by content-based retrieval before the cytologist sees any images.
All matches are passed on to the cytologist for immediate and careful review; non-matches are to be reviewed later. In this case, it is very important that any image even remotely suspicious be passed on for immediate review (to reduce the possibility of cancer going undetected for a significant time); therefore, it would be wise to set a relatively high threshold value (thus causing more images to be passed on to the cytologist than would occur with a lower threshold value).
Images that the cytologist, on visual inspection, determines are noncancerous are compared by content-based retrieval.
In this case, it may be useful to set a lower threshold value than in the preceding item, because human judgment has already been exercised. The chosen threshold value can help, however, to detect obvious cases of cancer that were missed due to human error.

For the cancer screening process, you might consider the following guidelines in selecting weights for the visual attributes:

global color: A high value is best, because the darker cancerous areas contrast with the light noncancerous areas, and the presence of any cancerous areas in the image being screened will increase its overall color similarity to the cancer-present comparison image.
local color: A low value is probably best if the darker cancerous areas of the image being screened can appear in any part of the image; however, a high value is probably best if the cancerous areas always or usually appear in the same part of the image as in the comparison image.
texture: A high value is best if cancerous areas are relatively grainy (light/dark patterns of uneven size, shape, and shade) compared to the noncancerous areas. If this is the case, a high weight for texture will increase the likelihood of a match between a grainy screened image and the comparison image.
structure: A low value is best if the cancerous areas have random shapes or seem amorphous; in this case, the shapes of cancerous areas on screened and comparison images would not be a useful basis for content-based retrieval. However, if the cancerous areas always or often have similar shapes, a high weight value for structure is best.

2.4 Preparing or Selecting Images for Useful Matching

The human mind is infinitely smarter than a computer in matching images. If we are near a street and want to identify all red automobiles, we can easily do so because our minds rapidly adjust for the following factors:

whether the automobile is stopped or moving
the absolute size of the automobile, as well as its relative size in our field of vision (because of its distance from us)
the location of the automobile in our field of vision (center, left, right, top,
bottom)
the direction in which the automobile is pointing or traveling (left or right, toward us, or away from us)

However, for a computer to find red automobiles (retrieving all red automobiles and no or very few images that are not red or not automobiles), it is helpful if all the automobile images have the automobile occupy almost the entire image, have no extraneous elements (people, plants, decorations, and so on), and have the automobiles pointing in the same direction. In this case, a match emphasizing global color and structure would produce useful results. However, if the pictures show automobiles in different locations, with different relative sizes in the image, pointing in different directions, and with different backgrounds, it will be difficult to perform content-based retrieval with these images.

The following are some suggestions for selecting images or preparing images for comparison. The list in not exhaustive, but the basic principle to keep in mind is this: Know what you are looking for, and use common sense.

Have what you expect to be looking for occupy almost all of the image space, or at least occupy the same size and position on each image. For example, if you want to find suspected criminals with blue eyes, each image should show only the face and should have the eyes in approximately the same position within the overall image.
Minimize any extraneous elements that might prevent desired matches or cause unwanted matches. For example, if you want to match red automobiles and if each automobile has a person standing in front of it, the color, shape, and position of the person (skin and clothing) may cause color and shape similarities to be detected, and might reduce the importance of color and shape similarities between automobiles (because part of the automobile is behind the person and thus not visible).

If possible, crop and edit images in accordance with these suggestions before performing content-based retrieval.

Note:

Visual Information Retrieval Cartridge operates as a fuzzy search engine, and is not designed to do correlations. For example, the cartridge cannot find a face in a crowd, but if you crop an individual face from a picture of a crowd, you can then compare it to known images.

2 Content-Based Retrieval Concepts