Oracle® Ultra Search User's Guide 10g Release 1 (10.1) Part Number B10731-01 |
|
|
View PDF |
This chapter contains the following topics:
The Oracle Ultra Search crawler is a Java process activated by your Oracle server according to a set schedule. When activated, the crawler spawns processor threads that fetch documents from various data sources. These documents are cached in the local file system. When the cache is full, the crawler indexes the cached files using Oracle Text. This index is used for querying.
Before you can use the crawler, you must set its operating parameters, such as the number of crawler threads, the crawler timeout threshold, the database connect string, and the default character set. To do so, use the Crawler Settings Page in the administration tool.
In addition to the Web access parameters, you can define specific data sources on the Sources page in the administration tool. You can define one or more of the following data sources:
Web sites
Database tables
Files
Mailing lists
Oracle Application Server Portal page groups
User-defined data sources (requires crawler agent)
If you are defining a user-defined data source to crawl and index a proprietary document repository or management system, such as Lotus Notes or Documentum, then you must implement a crawler agent as a Java class. The agent collects document URLs and associated metadata from the proprietary document source and returns the information to the Oracle Ultra Search crawler, which enqueues it for later crawling. For more information on defining a new data source type, see the User-Defined sub-tab in Sources page in the administration tool.
You can create synchronization schedules with one or more data sources attached to it. Synchronization schedules define the frequency at which the Oracle Ultra Search index is kept up to date with existing information in the associated data sources. To define a synchronization schedule, use the Sources page in the administration tool.
For some applications, for security reasons, the URL crawled is different from the one seen by the end user. For example, crawling on an internal Web site inside a firewall might be done without security checking, but when queried by the end user, a corresponding mirror URL outside the firewall must be used. This mirror URL is called the display URL.
By default, the display URL is treated as the access URL unless a separate access URL is provided. The display URL must be unique in a data source; so two different access URLs cannot have the same display URL.
Document attributes, or metadata, describe the properties of a document. Each data source has its own set of document attributes. The value is retrieved during the crawling process and then mapped to one of the search attributes and stored and indexed in the database. This lets you query documents based on their attributes. Document attributes in different data sources can be mapped to the same search attribute. Therefore, you can query documents from multiple data sources based on the same search attribute.
If the document is a Web page, the attribute can come from the HTTP header or it can be embedded inside the HTML in metatags. Document attributes can be used for many things, including document management, access control, or version control. Different data sources can have attributes of different names which are used for the same purpose; for example, "version" and "revision". It can also have the same attribute name for different purposes; for example, "language" as in natural language in one data source but as programming language in another.
Search attributes are created in three ways:
System-defined search attributes, such as title, author, description, subject, and mimetype
Search attributes created by the system administrator
Search attributes created by the crawler. (During crawling, the crawler agent maps the document attribute to a search attribute with the same name and data type. If not found, then the crawler creates a new search attribute with the same name and type as the document attribute defined in the crawler agent.)
The list of values (LOV) for a search attribute can help you specify a search query. If attribute LOV is available, then the crawler registers the LOV definition, which includes attribute value, attribute value display name, and its translation.
The first time the crawler runs, it must fetch Web pages, table rows, files, and so on based on the data source. It then adds the document to the Oracle Ultra Search index. The crawling process for the schedule is broken into two phases:
Figure 7-1 and Figure 7-2 illustrate an instance of the crawling cycle in a sequence of nine steps. The example uses a Web data source, although the crawler can also crawl other data source types.
Figure 7-1 illustrates how the crawler and its crawling threads are activated. It also shows how the crawler queues hypertext links to control its navigation. This figure corresponds to Steps 1 to 5.
Figure 7-2 illustrates how the crawler caches Web pages. This figure correspond to Steps 6 to 8.
The steps are the following:
Oracle spawns the crawler according to the schedule you specify with the administration tool. When crawling is initiated for the first time, the URL queue is populated with the seed URLs. Figure 7-1.
Crawler initiates multiple crawling threads.
Crawler thread removes the next URL in the queue.
Crawler thread fetches the document from the Web. The document is usually an HTML file containing text and hypertext links.
Crawler thread scans the HTML file for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.
Crawler caches the HTML file in the local file system. Figure 7-2.
Crawler registers URL in the document table.
Crawler thread starts over by repeating Step 3.
Fetching a document, as shown in Step 4, can be time-consuming because of network traffic or slow Web sites. For maximum throughput, multiple threads fetch pages at any given time.
When the file system cache is full (default maximum size is 20 megabytes), document caching stops and indexing begins. In this phase, Oracle Ultra Search augments the Oracle Text index using the cached files referred to by the document table. See Figure 7-3.
After the initial crawl, a URL page is only crawled and indexed if it has changed since the last crawl. The crawler determines if it has changed with the HTTP If-Modified-Since header field or with the checksum of the page. URLs that no longer exist are marked and removed from the index.
To update changed documents, the crawler uses an internal checksum to compare new Web pages with cached Web pages. Changed Web pages are cached and marked for reindexing.
The steps involved in data synchronization are the following:
Oracle spawns the crawler according to the synchronization schedule you specify with the administration tool. The URL queue is populated with the data source URLs assigned to the schedule.
Crawler initiates multiple crawling threads.
Each crawler thread removes the next URL in the queue.
Each crawler thread fetches the document from the Web. The page is usually an HTML file containing text and hypertext links.
Each crawler thread calculates a checksum for the newly retrieved page and compares it with the checksum of the cached page. If the checksum is the same, then the page is discarded and crawler goes to step 3. Otherwise, the crawler moves to the next step.
Each crawler thread scans the document for hypertext links and inserts new links into the URL queue. Links that are already in the document table are discarded.
Crawler caches the document in the local file system. See Figure 7-2.
Crawler registers URL in the document table.
If the file system cache is full or if the URL queue is empty, then Web page caching stops and indexing begins. Otherwise, the crawler thread starts over at Step 3.
Oracle Ultra Search provides the following mechanisms to control the scope of a Web data source crawling:
URL boundary rule (domain rule and path rule)
Robots.txt
file and robots META
tag
Crawling depth
URL Rewriter API
The URL boundary rule consists of domain rules and path rules. A domain rule specifies the set of Web sites allowed using a host name prefix or suffix. A path rule specifies the URL file path allowed or disallowed for a particular host. You can specify an inclusion or exclusion rule for both a domain rule and a path rule. Exclusion rules always override inclusion rules. Path rules are always host-specific.
For example, an inclusion domain ending with oracle.com
limits the Oracle Oracle Ultra Search crawler to hosts belonging to Oracle worldwide. Anything ending with oracle.com
is crawled, but http://www.oracle.com.tw
is not crawled. If you change the inclusion domain to someurl.com
with a new seed http://www.someurl.com
, then all oracle.com
URLs are dropped by the crawler.
An exclusion domain uk.oracle.com
prevents the crawler from crawling Oracle hosts in the United Kingdom. You can also include or exclude Web sites with a specific port. (By default, all ports are crawled.) You can have port inclusion or port exclusion rules for a specific host, but not both.
All URLs must pass domain rules before being checked for path rules. Path rules let you further restrict the crawling space. Path rules are host-specific, but you can specify more than one path rule for each host. For example, on the same host, you can include the path /host/doc
and exclude the path /host/doc/private
. Note that path rules are prefix-based.
Regular expression-based domain and path rules are not supported in the current release.
The following rules restrict the crawler to only crawl www.oracle.com
and otn.oracle.com
. Furthermore, only URLs under /products/database/
and /products/ias/
but not under /products/ias/web_cache/
will be crawled.
Domain inclusion: www.oracle.com Domain inclusion: otn.oracle.com Path inclusion for otn.oracle.com: /products/database/ /products/ias/ Path exclusion for otn.oracle.com: /products/ias/web_cache/
The robots.txt
protocol is the webmaster's path rule for any spider or crawler that visits his or her Web site. (It is described in the document "A Standard for Robot Exclusion" at http://www.robotstxt.org/wc/norobots.html
.) The following example /robots.txt
file specifies that no robots should visit any URL starting with /cyberworld/map/
or /tmp/
, or /foo.html
:
# robots.txt for http://www.acme.com/ User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ Disallow: /foo.html
By default, the Oracle Ultra Search crawler observes the robots.txt
protocol, but it also allow the user to override it. If the Web site is under the user's control, a specific robots rule can be tailored for the crawler by specifying the Ultra Search crawler agent name "User-agent: Oracle Ultra Search." For example,
User-agent: Oracle Ultra Search Disallow: /tmp/
The robots META
tag instructs the crawler whether to index a Web page or follow the links within it. It is described in "HTML Author's Guide to the Robots META tag" (http://www.robotstxt.org/wc/meta-user.html
).
Crawling depth controls how deep the crawler follows a link starting from the given seed URL. Since crawling is multi-threaded, this is not a deterministic control, as there may be different routes to a particular page.
The crawling depth limit applies to all Web sites in a given Web data source.
You implement the URL rewriter API as a Java class to perform link filtering or rewriting. Extracted links within a crawled Web page are passed to this module for checking. This enables ultimate control over which links extracted from a Web page are allowed and which ones should be discard. See "Oracle Ultra Search URL Rewriter API" for details.
With regard to HTTP redirection, earlier Oracle Ultra Search releases (9.0.2, 9.0.3, and 9.2.0.4) applied the same boundary checking to a redirected URL. Thus, a redirected URL would be rejected if it was outside the boundary rule. If the redirected URL was to be crawled, you had to make sure it was covered by the boundary rule.
In 9.2.0.5, iAS 10g, and Oracle Database 10g the redirected URL is always allowed if it is a temporary redirection (HTTP status 302, 307). For permanent redirection (status 301), the redirected URL is still subject to boundary rules.
HTTP meta tag redirection is always checked against boundary rules.
To increase crawling performance, set up the Oracle Ultra Search crawler to run on one or more computers separate from your database. These computers are called remote crawlers. However, each computer must share log, and mail archive directories with the database computer.
To configure a remote crawler, you must first install the Oracle Ultra Search middle tier on a computer other than the database host. During installation, the remote crawler is registered with the Oracle Ultra Search system, and a profile is created for the remote crawler. After installing the Oracle Ultra Search middle tier, you must log on to the Oracle Ultra Search administration tool and edit the remote crawler profile. You can then assign a remote crawler to a crawling schedule. To edit remote crawler profiles, use the Crawler Settings page in the administration tool.
The crawler uses a set of codes to indicate the crawling result of the crawled URL. Besides the standard HTTP status codes, it uses its own codes for non-HTTP related situations. Only URLs with status 200 will be indexed. Table 7-1 shows these URL status codes.
Table 7-1 Oracle Ultra Search URL Status Codes
Code | Explanation |
---|---|
200 | URL OK |
400 | bad request |
401 | authorization required |
402 | payment required |
403 | access forbidden |
404 | not found |
405 | method not allowed |
406 | not acceptable |
407 | proxy authentication required |
408 | request timeout |
409 | conflict |
410 | gone |
414 | request-URI too large |
500 | internal server error |
501 | not implemented |
502 | bad gateway |
503 | service unavailable |
504 | gateway timeout |
505 | HTTP version not supported |
902 | timeout reading document |
903 | filtering failed |
904 | IOEXCEPTION in processing URL |
906 | connection refused |
907 | socket bind exception |
908 | filter not available |
909 | duplicate document detected |
910 | duplicate document ignored |
911 | empty document |
951 | URL not indexed |
952 | URL crawled |
953 | meta tag redirection |
954 | HTTP redirection |
955 | blacklist URL |
956 | URL is not unique |
957 | sentry URL (URL as placeholder) |
958 | document read error |
959 | form login failed |
1001 | data type is not TEXT/HTML |
1002 | broken network datastream |
1003 | HTTP redirect location does not exist |
1004 | bad relative URL |
1005 | HTTP error |
1006 | error parsing HTTP header |
1007 | invalid URL table column name |
1008 | JDBC driver missing |
1009 | binary document reported as text document |
1010 | invalid display URL |