Skip Headers

Oracle® Ultra Search User's Guide
10g Release 1 (10.1)

Part Number B10731-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

8 Understanding the Oracle Ultra Search Administration Tool

The Oracle Ultra Search administration tool lets you manage Oracle Ultra Search instances. This chapter helps guide you through the screens on the Oracle Ultra Search administration tool. It contains the following topics:

8.1 Oracle Ultra Search Administration Tool

The Oracle Ultra Search administration tool is a J2EE-compliant Web application. You can use it to manage Oracle Ultra Search instances. To use the administration tool, log on as either a database user, an Enterprise Manager super-user, a Portal user, or an SSO user through any browser.


Note:

The Oracle Ultra Search administration tool and the Oracle Ultra Search query applications are part of the Oracle Ultra Search middle tier. However, the Oracle Ultra Search administration tool is independent from the Oracle Ultra Search query application. Therefore, they can be hosted on different computers to enhance security or scalability.

With the administration tool, you can do the following:

8.1.1 Setting Crawler Parameters

To configure the Oracle Ultra Search crawler, you must do the following:

  • Set crawler parameters, such as the crawler log file directory. To do so, use the Crawler Page.

  • Set Web access parameters, such as authentication and the proxy server. To do so, use the Web Access Page.

  • Define data sources. Data sources can be Web pages, database tables, files, email mailing lists, Oracle Sources (for example, Oracle Application Server Portals or federated sources), or user-defined data sources. You can assign one or more data sources to a crawler schedule. To define data sources, use the Sources Page. You can also set parameters for the source, such as domain inclusions or exclusions for Web sources or the display URL template or column for table sources.

  • Define synchronization schedules. The crawler uses the synchronization schedule to reconcile the Oracle Ultra Search index with current data source content. To define crawling schedules, use the Schedules Page.

8.1.2 Setting Query Options

Use query options to let users limit their searches. Searches can be limited to document attributes and data groups.

8.1.2.1 Attributes

Search attributes can be mapped to HTML metatags, table columns, document attributes, and email headers. Some attributes, such as author and description, are predefined and need no configuration. However, you can customize your own attributes. To set custom search attributes to expose to the query user, use the Attributes Page.

8.1.2.2 Data Groups

Data source groups are logical entities exposed to the search engine user. When entering a query, the search engine user is asked to select one or more data groups to search from. A data group consists of one or more data sources. To define data groups, use the Queries Page.

8.1.3 Online Help in Different Languages

Oracle Ultra Search provides context-sensitive online help, which can be viewed in different languages. You can change the language preferences in the Users Page.

8.2 Logging On to Oracle Ultra Search

The following users can log on to the Oracle Ultra Search administration tool:

To log on to the administration tool, point your Web browser to one of the following URLs:

Immediately after installation, the only users able to create and manage instances are the following:

After you are logged on as one of these special users, you can grant permission to other users, enabling them to create and manage Oracle Ultra Search instances. Using the Oracle Ultra Search administration tool, you can only grant and revoke Oracle Ultra Search related permissions to and from exiting users. To add or delete users, use the Oracle Internet Directory for single-sign-on users or Oracle Enterprise Manager for local database users.


Note:

The Oracle Ultra Search product database dictionary is installed in the WKSYS schema.


See Also:


8.3 Logging On and Managing Instances as SSO Users


Note:

Single Sign-On (SSO) is available only if the Oracle Identity Management infrastructure is installed

8.3.1 Logging On to Oracle Ultra Search

When a single sign-on (SSO) user logs on to the SSO-protected Oracle Ultra Search administration tool, the user is first prompted with the SSO login screen. Enter the SSO user name and password. After the SSO server authenticates the user, the user sees a list of Oracle Ultra Search instances that they have the privilege to manage.

There are different URLs for different users. For example:

  • SSO users: http://<host>:<http port>/ultrasearch/admin_sso/index.jsp

  • Portal users: http://<host>:<http port>/pls/portal

  • Enterprise Manager users: http://<host>:<em port>/

8.3.2 Granting Privileges to SSO Users

You might need to grant super-user privileges, or privileges for managing an Oracle Ultra Search instance, to an SSO user. This process is slightly different, depending on whether Oracle Application Server Portal is running in hosted mode or non-hosted mode, as described in the following list:


Note:

An SSO user is uniquely identified by Oracle Ultra Search with an SSO-nickname/subscriber-nickname combination.

  • In non-hosted mode, the subscriber-nickname is not required when granting privileges to an SSO user. This is because there is exactly one subscriber in Oracle Application Server Portal in non-hosted mode.

  • In hosted mode, the subscriber-nickname is required when granting privileges to an SSO user. This is because there can be more than one subscriber in Oracle Application Server Portal, and two or more users with the same SSO-nickname (for example, PORTAL) could be distinct SSO users distinguished by their subscriber-nickname. When running in hosted mode, also note the following:

    • When granting permissions for the default subscriber user, always specify "DEFAULT COMPANY" for the subscriber-nickname, even though the actual nickname could be different; for example, "ORACLE". The actual nickname is not recognized by Oracle Ultra Search.

    • When logging in to SSO as the default subscriber user, leave the subscriber nickname blank. Alternatively, enter "DEFAULT COMPANY" instead of the actual subscriber nickname; for example, "ORACLE" so that it is recognized by Oracle Ultra Search.


      Note:

      At any point after installation, you can run an Oracle Application Server Portal script to alter the running mode from non-hosted to hosted. Whenever this is done, the Oracle Application Server Portal script invokes an Oracle Ultra Search script to inform Oracle Ultra Search of the change from non-hosted to hosted modes.


      See Also:

      Hosting Developer's Guide at http://otn.oracle.com/

8.4 Instances Page

After successfully logging on to the Oracle Ultra Search administration tool, you find yourself on the Instances Page. This page manages all Oracle Ultra Search instances in the local database. In the top left corner of the page, there are tabs for creating, selecting, editing, and deleting instances.

Before you can use the administration tool to configure crawling and indexing, you must create an Oracle Ultra Search instance. An Oracle Ultra Search instance is identified with a name and has its own crawling schedules and index. Only users granted super-user privileges can create Oracle Ultra Search instances.

8.4.1 Creating an Instance

To create an instance, click Create. You can create a regular instance or a read-only snapshot instance. Only users with super-user privileges can create new instances.


Note:

If you define the same data source within different instances Oracle Ultra Search, then there could be crawling conflicts for table data sources with logging enabled, email data sources, and some user-defined data sources.

8.4.1.1 Creating a Regular Instance

To create an instance, do the following:

  1. Prepare the database user.

    Every Oracle Ultra Search instance is based on a database user/schema with the WKUSER role.

    The database user you create to house the Oracle Ultra Search instance should be assigned a dedicated self-contained tablespace. This is important if you plan to ever create snapshot instances of this instance. To do this, create a new tablespace. Then, create a new database user whose default tablespace is the one you just created.


    See Also:


  2. Follow instance creation in the Oracle Ultra Search administration tool.

    From the main instance creation page, click Create Instance, and provide the following information:

    • Instance name

    • Database schema: this is the user name from step 1.

    • Schema password

    You can also enter the following optional index preferences:

    • Lexer

      Specify the name of the lexer you want to use for indexing. The lexer breaks text into tokens according to your language. These tokens are usually words. The default lexer is wksys.wk_lexer, as defined in the wk0pref.sql file. After the instance is created, the lexer can no longer be changed.

    • Stoplist

      Specify the name of a stoplist you want to use during indexing. The default stoplist is wksys.wk_stoplist, as defined in the wk0pref.sql file. Try to avoid modifying the stoplist after the instance has been created.

    • Storage

      Specify the name of the storage preference for the index of your instance. The default storage preference is wksys.wk_storage, as defined in the wk0pref.sql file. After the instance is created, the storage preference cannot be changed.


      See Also:


8.4.1.2 Creating a Snapshot Instance

A snapshot instance is a copy of another instance. Unlike a regular instance, a snapshot instance is read only; it does not synchronize its index to the search domain. After the master instance re-synchronizes to the search domain, the snapshot instance becomes out of date. At that point, you should delete the snapshot and create a new one.


Note:

The snapshot and its master instance cannot reside on the same database.

A snapshot instance is useful for the following purposes:

  • Query Processing

    Two Oracle Ultra Search instances can answer queries about the same search domain. Therefore, in a set amount of time, two instances can answer more queries about that domain than one instance. Because snapshot instances do not involve crawling and indexing, snapshot instance creation is fast and inexpensive. Thus, snapshot instances can improve scalability.

  • Backups

    If the master instance becomes corrupted, its snapshot can be transformed into a regular instance by editing the instance mode to updatable. Because the snapshot and its master instance cannot reside on the same database, a snapshot instance should be made updatable only to replace a corrupted master instance.

A snapshot instance does not inherit authentication from the master instance. Therefore, if you make a snapshot instance updatable, you must re-enter any authentication information needed to crawl the search domain.

To create a snapshot instance, do the following:

  1. Prepare the database user.

    As with regular instances, snapshot instances require a database user. This user must have been granted the WKUSER role.

  2. Copy the data from the master instance.

    This is done with the transportable tablespace mechanism, which does not allow renaming of tablespaces. Therefore, snapshot instances cannot be created on the same database as its master.

    Identify the tablespace or the set of tablespaces that contain all the master instance data. Then, copy it, and plug it into the database user from step 1.

  3. Follow snapshot instance creation in the Oracle Ultra Search administration tool.

    From the main instance creation page, click Create Read-Only Snapshot Instance, and provide the following information:

    • Snapshot instance name

    • Snapshot schema name: this is the database user from step 1.

    • Snapshot schema password

    • Database link: this is the name of the database link to the database where the master instance lives.

    • Master instance name

  4. Enable the snapshot for secure searches.

    If the master instance for the snapshot of is secure-search enabled and if the destination database that you are making a snapshot in supports secure-search enabled instances, then you must also run a PL/SQL procedure in the destination database where you are creating the snapshot.

    Running this procedure translates the IDs of the access control lists (ACLs) in the destination database, rendering them usable. Log on to the database as the WKSYS user. Invoke the procedure as follows:

    exec WK_ADM.USE_INSTANCE('instance_name'); 
    exec WK_ADM.TRANSLATE_ACL_IDS();
    

where instance_name is the name of the snapshot instance

Make sure that this statement completes successfully without error.


See Also:


8.4.2 Selecting an Instance

You can have multiple Oracle Ultra Search instances. For example, an organization could have separate Oracle Ultra Search instances for its marketing, human resources, and development portals. The administration tool requires you to specify an instance before it lets you make any instance-specific changes.

To select an instance, do the following:

  1. Click Select on the Instances Page.

  2. Select an instance from the pull-down menu.

  3. Click Apply.


    Note:

    Instances do not share data. Data sources, schedules, and indexes are specific to each instance.

8.4.3 Deleting an Instance

To delete an instance, do the following:

  1. Click Delete on the Instances Page.

  2. Select an instance from the pull-down menu.

  3. Click Apply.


    Note:

    To delete an Oracle Ultra Search instance, the user must be granted the super-user privileges.

8.4.4 Editing an Instance

To edit an instance, click Edit on the Instances Page.

You can change the instance mode (make the instance updatable) or change the instance password.

8.4.4.1 Instance Mode

You can change the instance mode to updatable or read only. Updatable instances synchronize themselves to the search domain on a set schedule, whereas read-only instances (snapshot instances) do not do any synchronization. To set the instance mode, select the box corresponding the to mode you want, and click Apply.

8.4.4.2 Schema Password

An Oracle Ultra Search instance must know the password of the database user in which it resides. The instance cannot get this information directly from the database. During instance creation, Oracle provides the database user password, and the instance caches this information.

If this database user password changes, then the password that the instance has cached must be updated. To do this, enter the new password and click Apply. After the new password is verified against the database, it replaces the cached password.

8.5 Crawler Page

The Oracle Ultra Search crawler is a Java application that spawns threads to crawl defined data sources, such as Web sites, database tables, or email archives. Crawling occurs at regularly scheduled intervals, as defined in the Schedules Page.

With this page, you can do the following:

8.5.1 Configure the Settings


Crawler Threads

Specify the number of crawler threads to be spawned at run time.


Number of Processors

Specify the number of central processing units (CPUs) that exist on the server where the Oracle Ultra Search crawler will run. This setting determines the optimal number of document conversion threads used by the system. A document conversion thread converts multiformat documents into HTML documents for proper indexing.


Automatic Language Detection

Not all documents retrieved by the Oracle Ultra Search crawler specify the language. For documents with no language specification, the Oracle Ultra Search crawler attempts to automatically detect language. Click Yes to turn on this feature.

The language recognizer is trained statistically using trigram data from documents in various languages (Danish, Dutch, English, French, German, Italian, Portuguese, and Spanish). It starts with the hypothesis that the given document does not belong to any language and ultimately refutes this hypothesis for a particular language where possible. It operates on Latin-1 alphabet and any language with a deterministic Unicode range of characters (Chinese, Japanese, Korean, and so on).

The crawler determines the language code by checking the HTTP header content-language or the LANGUAGE column, if it is a table data source. If it cannot determine the language, then it takes the following steps:

  1. If the language recognizer is not available or if it is unable to determine a language code, then the default language code is used

  2. If the language recognizer is available, then the output from the recognizer is used.

This language code is populated in 'LANG' column of the wk$url and wk$doc tables. Multilexer is the only lexer used for Oracle Ultra Search. All document URLs are stored in wk$doc for indexing and wk$url for crawling.


Default Language

If automatic language detection is disabled, or if a Web document does not have a specified language, then the crawler assumes that the Web page is written in this default language. This setting is important, because language directly determines how a document is indexed.


Note:

This default language is used only if the crawler cannot determine the document language during crawling. Set language preference in the Users Page.

You can select a default language for the crawler or for data sources. Default language support for indexing and querying is available for the following languages:

  • Polish

  • Chinese

  • Hungarian

  • Norwegian

  • Romanian

  • Finnish

  • Japanese

  • Spanish

  • Slovak

  • English

  • Turkish

  • Danish

  • Swedish

  • Russian

  • German

  • Korean

  • Dutch

  • Italian

  • Greek

  • Portuguese

  • Czech

  • Hebrew

  • French

  • Arabic


Crawling Depth

A Web document could contain links to other Web documents, which could contain more links. This setting lets you specify the maximum number of nested links the crawler will follow.


See Also:

"Tuning the Web Crawling Process" for more information on the importance of the crawling depth


Crawler Timeout Threshold

Specify in seconds a crawler timeout. The crawler timeout threshold is used to force a timeout when the crawler cannot access a Web page.


Default Character Set

Specify the default character set. The crawler uses this setting when an HTML document does not have its character set specified.


Cache Directory

Specify the absolute path of the cache directory. During crawling, documents are stored in the cache directory. Every time the preset size is reached, crawling stops and indexing starts.

If you are crawling sensitive information, then make sure that you set the appropriate file system read permission to the cache directory.

You can choose whether or not to have the cache cleared after indexing.


Crawler Logging

Specify the following:

  • Level of detail: everything or only a summary

  • Crawler logfile directory

  • Crawler logfile language

The log file directory stores the crawler log files. The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, runtime, and shutdown. Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to print detailed activity to each schedule log file. The crawler logfile language is the language the crawler uses to generate the log file.

The crawler maintains multiple versions of its log file. The format of the log file name is:

iinstance_iddsdata_source_id.MMDDhhmm.log

where MM is the month, DD is the date, hh is the launching hour in 24-hour format, and mm is the minutes. For example, if a schedule for data source 23 of instance 3 is launched at 10 pm, July 8th, then the log file name is i3ds23.07082200.log. Each successive schedule launching will have a unique log file name. If the total number of log files for a data source reaches the system-specified limit, then the oldest log file will be deleted. The number of log files is a scheduler property and applies to all of the data sources assigned to the scheduler.


Database Connect String

The database connect string is a standard JDBC connect string used by the remote crawler when it connects to the database. The connect string can be provided in the form of [hostname]:[port]:[sid] or in the form of a TNS keyword-value syntax; for example:

"(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=...)(PORT=1521)...))" 

You can update the JDBC connect string to a different format; for example, an LDAP format. However, you cannot change the JDBC connect string to point to a different database. The JDBC connect string must be set to the database where the middle tier points; that is, the middle tier and the JDBC should point to the same database.

In a Real Application Clusters environment, the TNS keyword-value syntax should be used, because it allows connection to any node of the system. For example,

"(DESCRIPTION=(LOAD_BALANCE=yes)(ADDRESS=(PROTOCOL=TCP)(HOST=cls02a)(PORT=3001))
(ADDRESS=(PROTOCOL=TCP)(HOST=cls02b)(PORT=3001)))(CONNECT_DATA=(SERVICE_NAME=sales.us.acme.com)))"

8.5.2 Remote Crawler Profiles

Use this page to view and edit remote crawler profiles.

A remote crawler profile consists of all parameters needed to run the Oracle Ultra Search crawler on a remote computer other than the Oracle Ultra Search database. To register a remote crawler, you need to use the PL/SQL API wk_crw.register_remote_crawler. You can choose either RMI-based or JDBC-based remote crawling.

To configure the remote crawler, click Edit. Here is a list of configuration parameters that you can change for the remote crawler:

  • Cache file access mode. You have two options for the remote crawler to handle cache files:

    • Through a JDBC connection.

      In this case, the remote crawler will send cache files over the crawler's JDBC connection to the server's cache directory.

    • Through a mounted file system.

      If you choose this option, the cache file will be saved in the remote crawler cache directory. The remote crawler cache directory must be mounted to the server side crawler cache directory (specified under "Crawler" "Settings" tab); otherwise, the documents cannot be indexed.


    See Also:

    For more on crawling with JDBC connections, see "Using the Remote Crawler"

  • Cache directory location (absolute path)

  • Crawler log file directory

  • Mail archive path

  • Number of crawler threads

  • Number of processors

  • Initial Java heap size (in megabytes)

  • Maximum Java heap size (in megabytes)

  • Java classpath

8.5.3 Crawler Statistics

Use this page to view the following crawler statistics:

8.5.3.1 Summary of Crawler Activity

This provides a general summary of crawler activity:

  • Aggregate crawler statistics

  • Total number of documents indexed

  • Crawler statistics by data source type

8.5.3.2 Detailed Crawler Statistics

This includes the following:

  • List of hosts crawled and indexed

  • Document distribution by depth

  • Document distribution by document type

  • Document distribution by data source type

8.5.3.3 Crawler Progress

This displays crawler progress for the past week. It shows the total number of documents that have been indexed for exactly one week prior to the current time. The Time column rounds the current time to the nearest hour.

8.5.3.4 Problematic URLs

This lists errors encountered during the crawling process. It also lists the number of URLs that cause each error.

8.6 Web Access Page

Use this page to set up authentication and proxies.

8.6.1 Proxies

Specify a proxy server if the search space includes Web pages that reside outside your organization's firewall. Specifying a proxy server is optional. Currently, only the HTTP protocol is supported.


Note:

The crawler cannot use a proxy server that requires proxy authentication.

You can also set domain exceptions.

8.6.2 Authentication

Use this page to enter authentication information global to all data sources.


Note:

The data source specific authentication take precedence over this global authentication.

8.6.2.1 HTTP Authentication

Specify the user name and password for the host and realm for which HTTP authentication is required. Oracle Ultra Search supports both basic and digest authentication.

8.6.2.2 HTML Forms

Register HTML forms that you want the Oracle Ultra Search crawler to automatically fill out during Web crawling. HTML form support requires that HTTP cookie functionality is enabled.You can register HTML forms manually or with the form registration wizard. If the HTML form contains JavaScript, then the wizard might fail and you will need to use manual registration


Note:

The Oracle Ultra Search crawler will choose the form to use based on the form's URL and the form name. URL parameters are not included during matching; thus, they are truncated during form registration.

8.7 Attributes Page

When your indexed documents contain metadata, such as author and date information, you can let users refine their searches based on this information. For example, users can search for all documents where the author attribute has a certain value.

The list of values (LOV) for a document attribute can help specify a search query. An attribute value can have a display name for it. For example, the attribute country might use country code as the attribute value, but show the name of the country to the user. There could be multiple translations of the attribute display name.

To define a search attribute, use the Search Attributes subtab. Oracle Ultra Search provides some system-defined attributes, such as author and description. You can also define your own.

After defining search attributes, you must map between document attributes and global search attributes for data sources. To do so, use the Mappings subtab.


Note:

Oracle Ultra Search provides a command-line tool to load metadata, such as search attribute LOVs and display names into an Oracle Ultra Search database. If you have a large amount of data, this is probably faster than using the HTML-based administration tool. For more information, see Appendix A, " Loading Metadata into Oracle Ultra Search".

8.7.1 Search Attributes

Search attributes are attributes exposed to the query user. Oracle Ultra Search provides system-defined attributes, such as author and description. Oracle Ultra Search maintains a global list of search attributes. You can add, edit, or delete search attributes. You can also click Manage LOV to change the list of values (LOV) for the search attribute. There are two categories of attribute LOVs: one is global across all data sources, the other is data source-specific.

To define your own attribute, enter the name of the attribute in the text box; select string, date, or number; and click Add.

You can add or delete LOV entry and display name for search attributes. Display name is optional. If display name is absent, then LOV entry is used in the query screen.


Note:

LOV is only represented as string type. If LOV is in date format, then you must use "DD-MM-YYYY" to enter the LOV.

To update the policy value, click Manage LOV for any attribute.

A data source-specific LOV can be updated in three ways:

  1. Update the LOV manually.

  2. The crawler agent can automatically update the LOV during the crawling process.

  3. New LOV entries can be automatically added by inspecting attribute values of incoming documents.


    Caution:

    If the update policy is agent-controlled, then the LOV and all translated values are erased in the next crawling.

8.7.2 Mappings

This section displays mapping information for all data sources. For user-defined data sources, mapping is done at the agent level, and document attributes are automatically mapped to search attributes with the same name initially. Document attributes and search attributes are mapped one-to-one. For each user-defined data source, you can edit the global search attribute to which the document attribute is mapped.

For Web, file, or table data sources, mappings are created manually when you create the data source. For user-defined data sources, mappings are automatically created on subsequent crawls.

Click Edit Mappings to change this mapping.

Editing the existing mapping is costly, because the crawler must recrawl all documents for this data source. Avoid this step, unless necessary.


Note:

There are no user-managed mappings for email sources. There are two predefined mappings for emails. The "From" field of an email is intrinsically mapped to the Oracle Ultra Search author attribute. Likewise, the "Subject" field of an email is mapped to the Oracle Ultra Search subject attribute. The abstract of the email message is mapped to the description attribute.

8.8 Sources Page

A collection of documents is called a source. The data source is characterized by the properties of its location, such as a Web site or an email inbox. The Oracle Ultra Search crawler retrieves data from one or more data sources.

The different types of sources are:

You can create as many data sources as you want. The following section explains how to create and edit data sources.

8.8.1 Web Sources

A Web source represents the content on a specific Web site. Web sources facilitate maintenance crawling of specific Web sites.

8.8.1.1 Creating Web Sources

To create a new Web source, do the following:

  1. Specify a name for the Web source and a starting address. This is the URL for the crawler to begin crawling. The starting address can be HTTP or HTTPS.

  2. Set URL boundary rules to refine the crawling space. You can include or exclude hosts or domains beginning with, ending with, or equal to a specific name.

    For example, an inclusion domain ending with oracle.com limits the Oracle Ultra Search crawler to hosts belonging to Oracle worldwide. Anything ending with oracle.com is crawled; but, http://www.oracle.com.tw is not crawled. If you change the inclusion domain to yahoo.com with a new seed "http://www.yahoo.com", then all oracle.com URLs are dropped by the crawler.

    An exclusion domain uk.oracle.com prevents the crawler from crawling Oracle hosts in the United Kingdom. You can also include or exclude Web sites with a specific port. (By default, all ports are crawled.) You can have port inclusion or port exclusion rules for a specific host, but not both. Exclusion rules always override inclusion rules.

  3. Specify the types of documents the Oracle Ultra Search crawler should process for this source. HTML and plain text are default document types that the crawler always processes.

  4. Specify the authentication settings. This step is optional. Under HTTP Authentication, specify the user name and password for host-realm for which authentication is required. Under HTML Forms, you can register HTML forms that you want the Oracle Ultra Search crawler to automatically fill out during Web crawling. HTML form support requires that HTTP cookie functionality is enabled. Click Register HTML Form to register authentication forms protecting the data source. Note: For the form URL to be crawled, you must verify that the URL is not excluded in the robots.txt file. If so, then you must disable robot exclusion for this data source. (By default, Oracle Ultra Search enables robot exclusion.)

  5. Choose either No ACL or Ultra Search ACL for the data source. When a user performs a search, the ACL (access control list) controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. You can add more than one group and user to the ACL for the data source. The option to choose is only available if the instance is security-enabled.

  6. Define, edit, or delete metatag mappings for your Web source. Metatags are descriptive tags in the HTML document header. One metatag can map to only one search attribute.

  7. Override the default crawler settings for each Web source. This step is optional. The parameters you can override are the crawling depth, the number of crawler threads, the language, the crawler timeout threshold, the character set, the maximum cookie size, the maximum number of cookies, and the maximum number of cookies for each host. You can also enable or disable robots exclusion, language detection, the UrlRewriter, indexing dynamic pages, HTTP cookies, and whether content of the cookie log file is shown. (You can also edit those in Edit Web Sources.)

    Robots exclusion lets you control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file. For example, when a robot visits http://www.foobar.com/, it checks for http://www.foobar.com/robots.txt. If it finds it, the crawler analyzes its contents to see if it is allowed to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, you should always comply with robots.txt by enabling robots exclusion.

    The URL Rewriter is a user-supplied Java module for implementing the Oracle Ultra Search UrlRewriter interface. It is used by the crawler to filter or rewrite extracted URL links before they are put into the URL queue. URL filtering removes unwanted links, and ULR rewriting transforms the URL link. This transformation is necessary when access URLs are used.

    The UrlRewriter provides the following possible outcomes for links:

    • There is no change to the link. The crawler inserts it as it is.

    • Discard the link. There is no insertion.

    • A new display URL is returned, replacing the URL link for insertion.

    • A display URL and an access URL are returned. The display URL may or may not be identical to the URL link.

    The generated new URL link is subject to all existing host, path, and mimetype inclusion and exclusion rules.

    You must put the implemented rewriter class in a jar file and provide the class name and jar file name here.

    If Index Dynamic Page is set to Yes, then dynamic URLs are crawled and indexed. For data sources already crawled with this option, setting Index Dynamic Page to No and recrawling the data source removes all dynamic URLs from the index.

    Some dynamic pages appear as multiple search hits for the same page, and you may not want them all indexed. Other dynamic pages are each different and need to be indexed. You must distinguish between these two kinds of dynamic pages. In general, dynamic pages that only change in menu expansion without affecting its contents should not be indexed. Consider the following three URLs:

    http://itweb.oraclecorp.com/aboutit/network.101/npe/standards/naming_convention.html
    
    http://itweb.oraclecorp.com/aboutit/network.101/npe/standards/naming_convention.html?nsdnv=14z1
    
    http://itweb.oraclecorp.com/aboutit/network.101/npe/standards/naming_convention.html?nsdnv=14
    

    The question mark ('?') in the URL indicates that the rest of the strings are input parameters. The duplicate hits are essentially the same page with different side menu expansion. Ideally, the same query should yield only one hit:

    http://itweb.oraclecorp.com/aboutit/network.101/npe/standards/naming_convention.html
    
    

    Dynamic page index control applies to the whole data source. So, if a Web site has both kinds of dynamic pages, you need to define them separately as two data sources in order to control the indexing of those dynamic pages.


See Also:


8.8.2 Table Sources

A table source represents content in a database table or view. The database table or view can reside in the Oracle Ultra Search database instance or in a remote database. Oracle Ultra Search accesses remote databases using database links.

8.8.2.1 Creating Table Sources

To create a table source, click Create Table Source, and follow these steps:

  1. Specify a table source name, and the name of the database link, schema, and table. Click Locate Table.

  2. Specify settings for your table source, such as the default language and the primary key column. You can also specify the column where final content should be delivered, and the type of data stored in that column; for example, HTML, plain text, or binary. For information on default languages, see "Crawler Page".

  3. Verify the information about your table source.

  4. Decide whether or not to use the Oracle Ultra Search logging mechanism to optimize the crawling of table data sources. When crawling is enabled, only newly updated documents are revisited during the crawling process. You can enable logging for Oracle tables, enable logging for non-Oracle tables, or disable the logging mechanism. If you enable logging, then you are prompted to create a log table and log triggers. Oracle SQL statements are provided for Oracle tables. If you are using non-Oracle tables, then you must manually create a log table and log triggers. Follow the examples provided to create the log table and log triggers. After you have created the table, enter the table name in Log Table Name.

  5. Map table columns to search attributes. Each table column can be mapped to exactly one search attribute. This lets the search engine seamlessly search data from the table source.

  6. Specify the display URL template or column for the table source. This step is optional. Oracle Ultra Search uses a default text viewer for table data sources. If you specify display URL, then Oracle Ultra Search uses the Web URL defined to display the table data retrieved. If display URL column is available, then Oracle Ultra Search uses the column to get the URL to display the table data source content. You can also specify display URL templates in the following format: http://hostname:port/path?parameter_name=$(key1) where key1 is the corresponding table's primary key column. For example, assume that you can use the following URL to query the bug number 1234567, and the bug number is the primary key of the table: http://bug:7777/pls/bug?rptno=1234567. You can set the table source display URL template to http://bug:7777/pls/bug?rptno=$(key1).

    The Table Column to Key Mappings section provides mapping information. Oracle Ultra Search supports table keys in STRING, NUMBER, or DATE type. If key1 is of NUMBER or DATE type, then you must specify the format model used by the Web site so that Oracle knows how to interpret the string. For example, the date format model for the string '11-Nov-1999' is 'DD-Mon-YYYY'. You can also map other table columns to Oracle Ultra Search attributes. Do not map the text column.

  7. Specify the ACL (access control list) policy for the data source. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered public and visible. Alternatively, you can specify to use Oracle Ultra Search ACL. You can add more than one group and user to the ACL for the data source. The option to choose is only available if the instance is security-enabled.


    See Also:

    Oracle Database SQL Reference for more on format models

8.8.2.2 Editing Table Sources

On the main Table Sources page, click Edit to change the name of the table source. You can change, add, or delete table column and search attribute mappings; change the display URL template or column; and view values of the table source settings.

8.8.2.3 Table Sources Comprised of More Than One Table

If a table source has more than one table, then a view joining the relevant tables must be created. Oracle Ultra Search then uses this view as the table source. For example, two tables with a master-detail relationship can be joined through a SELECT statement on the master table and a user-implemented PL/SQL function that concatenate the detail table rows.

8.8.2.4 Limitations With Database Links

The following restrictions apply to base tables or views on a remote database that are accessed over a database link by the crawler.

  • If the text column of the base table or view is of type BLOB or CLOB, then the table must have a ROWID column. A table or view might not have a ROWID column for various reasons, including the following:

    • A view is comprised of a join of one or more tables.

    • A view is based on a single table using a GROUP BY clause.

    The best way to know if a remote table or view can be safely crawled by Oracle Ultra Search is to check for the existence of the ROWID column. To do so, run the following SQL statement against that table or view using SQL*Plus:

    SELECT MIN(ROWID) FROM table_name/view_name;
    
  • The base table or view cannot have text columns of type BFILE, RAW.

8.8.3 Email Sources

An email source derives its content from emails sent to a specific email address. When the Oracle Ultra Search crawler searches an email source, it collects all emails that have the specific email address in any of the "To:" or "Cc:" email header fields.

The most popular application of an email source is where an email source represents all emails sent to a mailing list. In such a scenario, multiple email sources are defined where each email source represents an email list.

To crawl email sources, you need an IMAP account. At present, the Oracle Ultra Search crawler can only crawl one IMAP account. Therefore, all emails to be crawled must be found in the inbox of that IMAP account. For example, in the case of mailing lists, the IMAP account should be subscribed to all desired mailing lists. All new postings to the mailing lists are sent to the IMAP email account and subsequently crawled. The Oracle Ultra Search crawler is IMAP4 compliant.

When the Oracle Ultra Search crawler retrieves an email message, it deletes the email message from the IMAP server. Then, it converts the email message content to HTML and temporarily stores that HTML in the cache directory for indexing. Next, the Oracle Ultra Search crawler stores all retrieved messages in a directory known as the archive directory. The email files stored in this directory are displayed to the search end-user when referenced by a query hit.

To crawl email sources, you must specify the user name and password of the email account on the IMAP server. Also specify the IMAP server host name and the archive directory.

8.8.3.1 Creating Email Sources

To create email sources, you must enter an email address and a description. Optionally, you can specify email aliases and ACL policy. The description can be viewed by all search end-users, so you should specify a short but meaningful name. When you create (register) an email source, the name you use is the email of the mailing list. If the emails are not sent to one of the registered mailing lists, then those emails are not crawled.

You can specify email address aliases for an email source. Specifying an alias for an email source causes all emails sent to the main email address, as well as the alias address, to be gathered by the crawler. An alias is useful when two or more email addresses are logically the same. For example, an email source representing the distribution list list@company.com can have the alternate address list@my.company.com. If list@my.company.com is added to the alias list, then email sent to that address are treated as if they were sent to list@company.com.

Specify the ACL (access control list) policy for the data source. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. You can add more than one group and user to the ACL for the data source.

8.8.4 File Sources

A file source is the set of documents that can be accessed through the file protocol on the local machine.

To edit the name of a file source, click Edit.

8.8.4.1 Creating File Sources

To create a new file source, do the following:

  1. Specify a name for the file source and the default language.

  2. Designate files or directories to be crawled. If a URL represents a single file, then the Oracle Ultra Search crawler searches only that file. If a URL represents a directory, then the crawler recursively crawls all files and subdirectories in that directory.

  3. Specify inclusion and exclusion paths to modify the crawling space associated with this file source. This step is optional. An inclusion path limits the crawling space. An exclusion path lets you further define the crawling space. If neither path is specified, then crawling is limited to the underlying file system access privileges. Path rules are host-specific, but you can specify more than one path rule for each host. For example, on the same host, you can include path files://host/doc and exclude path files://host/doc/unwanted.

  4. Specify the types of documents the Oracle Ultra Search crawler should process for this file source. HTML and plain text are default document types that the crawler always processes.

  5. Oracle Ultra Search displays file data sources in text format. However, if you specify display URL for the file data source, then Oracle Ultra Search uses the URL to display the file data source.

    With display URL for file data sources, the URL uses network protocols, such as HTTP or HTTPS, to access the file data source. To generate display URL for the file data source, specify the prefix of the original file URL and the prefix of the display URL. Oracle Ultra Search replaces the prefix of the file URL with the prefix of the display URL.

    For example, if your file URL is file:///home/operation/doc/file.doc and the display URL is https://webhost/client/doc/file.doc, then you can specify the file URL prefix to file:///home/operation and the display URL prefix to https://webhost/client.

  6. Specify the ACL (access control list) policy for the data source. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. Alternatively, you can specify using the Oracle Ultra Search ACL. You can add more than one group and user to the ACL for the data source. The option to choose is only available if the instance is security-enabled.

8.8.5 Oracle Sources

You can create, edit, or delete Oracle sources. You can choose federated or Oracle Application Server Portal (crawlable) data sources. A federated source is a repository that maintains its own index. Oracle Ultra Search can issue a query, and the repository can return query results. Oracle Ultra Search also supports the crawling and indexing of Oracle Application Server Portal installations. This enables searching across multiple portal installations.

8.8.5.1 Oracle Portal Sources

Oracle Ultra Search can only crawl public Oracle AS Portal sources. See the Oracle Application Server Portal Configuration Guide for how to set up public pages.

To create Portal sources, you must first register your portal with Oracle Ultra Search. To register your portal:

  1. Provide a name and portal URL base. The portal name is used to identify this portal entry in the Oracle Portal List page. The URL base is the beginning portion of the portal homepage. This include host name, port number, and DAD. After it is created, the portal URL base is not updatable. Click Register Portal. Oracle Ultra Search attempts to contact the Oracle Application Server Portal instance and retrieve information about it.

  2. Choose one or more page groups for indexing. A portal data source is created for each page group. Click Delete to remove existing portal data sources.

You can edit the types of documents the Oracle Ultra Search crawler should process for a portal source. HTML and plain text are default document types that the crawler always processes. To edit document types, click Edit for the portal source after it has been created.


See Also:

The Oracle Application Server Portal documentation.

8.8.5.2 Federated Sources

To create federated sources, specify the name and JNDI for the new data source. By default, no resource adapter is available.

To create a federated source, you must manually deploy the Oracle Ultra Search resource adapter, or searchlet. A searchlet is a Java module deployed in the middle tier (inside OC4J) that searches the data in an enterprise information system on behalf of a user. A searchlet is a Java module deployed in the middle tier (inside OC4J) that searches the data in an enterprise information system on behalf of a user. When a user's query is delegated to the searchlet, the searchlet runs the query on behalf of the user. Every searchlet is a JCA 1.0 compliant resource adapter.


See Also:

The JCA 1.0 spec from Javasoft for detailed information on resource adapters and Java Connector Architecture

8.8.5.2.1 Deploying and BInding the Oracle Ultra Search Searchlet

The Oracle Ultra Search searchlet enables queries against one Oracle Ultra Search instance. The Oracle Ultra Search searchlet is packaged as ultrasearch.rar and is shipped under the $ORACLE_HOME/ultrasearch/adapter/ directory.

To deploy the Oracle Ultra Search searchlet in OC4J standalone, use admin.jar as follows:

java -jar admin.jar ormi://<hostname> <admin> <welcome> -deployconnector -file
ultrasearch.rar -name UltraSearchSearchlet

At this point, ultrasearch.rar has been deployed in OC4J. However, it has not been instantiated to connect to any Oracle Ultra Search instance. The Oracle Ultra Search searchlet can be instantiated multiple times, to connect to several Oracle Ultra Search instances, by repeating the following steps. To instantiate the searchlet, configuration parameters values must be specified, and a JNDI location must be specified where the searchlet instance should be bound to. To do this, you must manually edit oc4j-ra.xml. This file is typically located under the $J2EE_HOME/application-deployments/default/UltraSearchSearchlet/ directory. The Oracle Ultra Search searchlet requires four configuration properties: connectionURL, userName, password, and instanceName. For example, to bind a searchlet under "eis/UltraSearch" to connect to the default instance 'wk_test' on machine 'dbhost', the following entry can be used:

<connector-factory location="eis/UltraSearch" connector-name="Ultra Search Adapter">
 <config-property name="connectionURL" value="jdbc:oracle:thin:@dbhost:1521:sid"/>
 <config-property name=:userName" value="wk_test"/>
 <config-property name="passwors" value="wk_test"/>
 <config-property name="instanceName" value="wk_test"/>
</connector-factory>

After editing oc4j-ra.xml, restart the OC4J instance. If you do not see errors upon restart, then the searchlet has been successfully instantiated and bound to JNDI.

8.8.5.2.2 Deploying and Binding the Federator Searchlet

The Federator searchlet interacts with other searchlets to provide a single point of search against multiple repositories. For example, the Federator searchlet can invoke multiple Oracle Ultra Search searchlets to simultaneously query against multiple Oracle Ultra Search instances. In the same manner, the Federator searchlet can invoke searchlets for Oracle Files, Email, and so on. The Federator searchlet is configured and managed with the Oracle Ultra Search administration tool, under the Federated Sources tab. The Federator searchlet is packaged as federator.rar and is shipped under the $ORACLE_HOME/ultrasearch/adapter/ directory. The deployment procedure for federator.rar is similar to the deployment of the Oracle Ultra Search searchlet. To deploy the Federator searchlet in OC4J standalone, use admin.jar as follows:

java -jar admin.jar ormi://<hostname> <admin> <welcome> -deployment -file
federator.rar -name FederatorSearchlet

To instantiate the searchlet, the Federator searchlet requires four configuration properties: connectionURL, userName, password, and instanceName in the oc4j-ra.xml file. This file is typically located under the $J2EE_HOME/application-deployments/default/FederatorSearchlet/ directory. For example:

<connector-factory location="eis/Federator" connector-name="Federator Adapter">
 <config-property name="connectionURL" value="jdbc:oracle:thin:@dbhost:1521:sid"/>
 <config-property name="userName" value="wk_test"/>
 <config-property name="password" value="wk_test"/>
 <config-property name=InstanceName" value="wk_test"/>
</connector-factory>

After editing oc4j-ra.xml, restart the OC4J instance. If you do not see errors upon restart, then the searchlet has been successfully instantiated and bound to JNDI.

8.8.6 User-Defined Sources

Oracle Ultra Search lets you define, edit, or delete your own data sources and types in addition to the ones provided. You might implement your own crawler agent to crawl and index a proprietary document repository or management system, such as Lotus Notes or Documentum, which contain their own databases and interfaces.

For each new data source type, you must implement a crawler agent as a Java class. The agent collects document URLs and associated metadata from the proprietary document source and returns the information to the Oracle Ultra Search crawler, which enqueues it for later crawling.

To define a new data source, you first define a data source type to represent it.

8.8.6.1 Creating User-Defined Data Source Types

To create, edit, or delete data source types, click Manage Source Types. To create a new type, click Create New Type.

  1. Specify data source type name, description, and crawler agent Java class file or jar file name. The crawler agent Java classpath is predefined at installation time. The agent collects the list of document URLs and associated metadata from the proprietary document source and returns it to the Oracle Ultra Search crawler, which enqueues the information for later crawling. The agent class file or jar file must be located under $ORACLE_HOME/ultrasearch/lib/agent/.

  2. Specify parameters for this data source type. If you add parameters, you must enter the parameter name and a description. Also, you must decide whether to encrypt the parameter value.

Edit data source type information by changing the data source type name, description, crawler agent Java class/jar file name, or parameters.

8.8.6.2 Creating User-Defined Sources

To create a user-defined data source, select the type and click Go

  1. Specify a name, default language, and parameter values for the data source. For information on default languages, see the Crawler Page.

  2. Specify the authentication settings. This step is optional. Under HTTP Authentication, specify the user name and password for host and realm for which authentication is required. Under HTML Forms, you can register HTML forms that you want the Oracle Ultra Search crawler to automatically fill out during Web crawling. HTML form support requires that HTTP cookie functionality is enabled. Click Register HTML Form to register authentication forms protecting the data source.

  3. Specify the ACL (access control list) policy for the data source: no ACL, repository-generated ACL, or Oracle Ultra Search ACL. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. For the Oracle Ultra Search ACL, you can add more than one group and user to the ACL for the data source.

  4. Specify mappings. This step is optional. Document attributes are automatically mapped directly to the search attribute with the same name during crawling. If you want document attributes to map to another search attribute, then you can specify it here. The crawler picks up attributes that have been returned by the crawler agent or specified here.

  5. Edit crawling parameters.

  6. Specify the document types that the crawler should process for this data source. By default, HTML and plain text are always processed.

You can edit user-defined data sources by changing the name, type, default language, or starting address.

8.9 Schedules Page

Use this page to schedule data synchronization and index optimization. Data synchronization means keeping the Oracle Ultra Search index up to date with all data sources. Index optimization means keeping the updated index optimized for best query performance.

8.9.1 Data Synchronization

The tables on this page display information about synchronization schedules. A synchronization schedule has one or more data sources assigned to it. The synchronization schedule frequency specifies when the assigned data sources should be synchronized. Schedules are sorted first by name. Within a synchronization schedule, individual data sources are listed and can be sorted by source name or source type.

8.9.1.1 Creating Synchronization Schedules

To create a new schedule, click Create New Schedule and follow these steps:

  1. Name the schedule.

  2. Pick a schedule frequency and determine whether the schedule should automatically accept all URLs for indexing or examine URLs before indexing. For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is done, you can examine document URLs and status, remove unwanted documents, and start indexing. You can also associate the schedule with a remote crawler profile.

    You can set the frequency to Manual Launch. In this case, the interval remains in SCHEDULED status until you explicitly invoke data synchronization with the Execute Immediately button of the admin tool (see "Launching Synchronization Schedules").

  3. Assign data sources to the schedule. After a data source has been assigned to a group, it cannot be assigned to other groups.

8.9.1.2 Updating Schedules

Update the indexing option in the Update Schedule page.

8.9.1.3 Editing Synchronization Schedules

After a synchronization schedule has been defined, you can do the following in the Synchronization Schedules List:

  • To assign the schedule to either a crawler that runs on the database host or a remote crawler that runs on a separate host, click Hostname.

  • To change its frequency, click the schedule interval text.

  • To alter its status, click Status.

  • To delete it, click Delete.

  • To edit its name, data source assignments, recrawl policy, or crawling mode, click Edit. When the crawler retrieves a document, it checks to see if it has changed. By default, if the document has not changed, the crawler does not process it. In certain situations, you might want to force the crawler to reprocess all documents. Click Edit to edit schedules in the following ways:

    • Update schedule name. This step is optional. To change the schedule name, specify a name for the schedule, and click Update Schedule Name.

    • Assign data sources to schedule. To assign a data source, select one or more available sources and click >>. After a data source has been assigned to a group, it cannot be assigned to any other group. To undo assignments of a data source, select one or more scheduled sources and click <<.

    • Update crawler recrawl policy. You can update the recrawl policy to the following:

      • Process Documents That Have Changed: This is maintenance crawling. Only documents that have changed are recrawled and indexed. For Web data sources, if there are new links in the updated document, then they are followed. For file data sources, new files are collected if its parent directory has changed.

      • Process All Documents: The crawler recrawls the data source. For example, suppose you want to crawl only text and HTML on a Web site. Later, you also want to crawl Microsoft Word and Adobe PDF documents. You must modify the document types for the source, edit the schedule to select Process All Documents, then rerun the schedule so that the crawler picks up PDF and doc document types for this data source. The crawler treats every document as if it has been changed, which means each document is fetched and processed again.

      Upon relaunching the schedule, the following rules determine which URLs will be recrawled:

      • If the previous crawl did not finish (for example, you stopped the crawling or the database tablespace was full), then the crawler only crawls URLs left in the URL queue. URLs already crawled are not touched on recrawl.

      • If the URL queue is empty but there is a new seed added since the last crawl, then the crawler only crawls the new seed.

      • If the URL queue is empty and there is no new seed URL, then the crawler recrawls all crawled URLs.

      Therefore, if you stop the crawler and set Index Dynamic Pages to No, this only affects the URLs in the queue yet to be crawled. The already crawled dynamic pages are removed from the index on the third recrawl when the queue is empty.


      Note:

      All crawled URLs are subject to crawler setting enforcement, not just newly crawled URLs.

    • Update crawling mode. You can update the crawling mode to the following:

      • Automatically accept all URLs for indexing: This mode crawls and indexes.

      • Examine URLs before indexing: This mode crawls only. For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is done, you can examine document URLs and status, remove unwanted documents, and start indexing.

      • Index only: This mode indexes only.

      The crawler behaves differently for the documents collected.

    Crawling mode and recrawl policy can be combined for six different combinations. For example, Process All Documents and Index Only forces reindexing existing documents in this data source, while Process Documents That Have Changed and Index Only re-indexes only changed documents.

8.9.1.4 Launching Synchronization Schedules

A schedule's synchronization frequency can be identical to another schedule's synchronization frequency. This gives you maximum flexibility in managing data source synchronization.

You can launch a synchronization schedule in the following ways:

  • Set a schedule frequency and wait for the predetermined launch time.

  • Run it immediately. To do so, click Status, then Execute Immediately.

  • Manually start the schedule.


    Note:

    Launching a synchronization schedule can take a very long time. If a schedule has been launched before, then the next time a schedule is launched, all URLs that belong to the data source to be crawled by the schedule are updated to put into a queue. Depending on the number of URLs associated with that data source, the enqueue operation may take a long time. The administration tool displays the schedule state as 'Launching' the entire time.

The launch of a schedule does not perform any enqueue if the URL queue is not empty or if there is a new seed added since the last crawl. For example, if the user stopped the crawler earlier or if the crawler terminated because of insufficient Oracle table space, then the URL queue is not empty. So, on the next launch the crawler does not try to enqueue; instead it works on the existing URL queue until it is empty. In other words, enqueue is only performed when the queue is empty at launch time.

8.9.1.5 Synchronization Status and Crawler Progress

Click the link in the status column to see the synchronization schedule status. To see the crawling progress for any data source associated with this schedule, click Statistics.

If you decide to examine URLs before indexing for the schedule, then after you run the schedule, the schedule status is shown as "Indexing Pending".

In data harvesting mode, you should begin crawling first. After crawling is done, click Examine URL to examine document URLs and status, remove unwanted documents, and start indexing. After you click Begin Index, you see schedule status change from launching, executing, scheduled, and so on.

The crawling progress contains the following information:

  • Data source type

  • Data source name

  • Start time

  • Finish time

  • Elapsed time

  • Total indexing time

  • Total size of document data collected

  • Average document size

  • Average fetch throughput

It also contains the following statistics:

  • Documents to fetch

  • Documents fetched: This is the sum of Document non-indexable, Document conversion failure, and Documents indexed.

  • Document fetch failures: This could be an Oracle HTTP Server timeout or another HTTP server error.

  • Documents rejected: The document is not within the URL boundary rule.

  • Documents discovered: This is the sum of Documents to fetch, Documents fetched, Document fetch failures, and Documents rejected.

  • Documents indexed

  • Documents non-indexable: This could be a file directory, a portal page that is a discovery node, or a robot metatag that specifies no index.

  • Document conversion failures: The binary file filter failed.

8.9.2 Index Optimization


Index Optimization

To ensure fast query results, the Oracle Ultra Search crawler maintains an active index of all documents crawled over all data sources. This lets you schedule when you would like the index to be optimized. The index should be optimized during hours of low usage.


Note:

Increasing the crawler cache directory size can reduce index fragmentation.


Index Optimization Schedule

You can specify the index optimization schedule frequency. Be sure to specify all required data for the option that you select. You can optimize the index immediately, or you can enable the schedule.


Optimization Process Duration

Specify a maximum duration for the index optimization process. The actual time taken for optimization does not exceed this limit, but it could be shorter. Specifying a longer optimization time results in a more optimized index. Alternatively, you can specify that the optimization continue until it is finished.

If your Oracle Ultra Search instance is secure-search enabled, then the index optimization process also triggers garbage collection of unused access control lists (ACLs).

8.10 Queries Page

This section lets you specify query-related settings, such as data source groups, URL submission, relevancy boosting, and query statistics.

8.10.1 Data Groups

Data groups are logical entities exposed to the search engine user. When entering a query, the user is asked to select one or more data groups from which to search.

A data group consists of one or more data sources. A data source can be assigned to multiple data groups. Data groups are sorted first by name. Within each data group, individual data sources are listed and can be sorted by source name or source type.

To create a new data source group, do the following:

  1. Specify a name for the group.

  2. Assign data sources to the group. To assign a Web or table data source to this data group, select one or more available Web sources or table sources and click >>. After a data source has been assigned to a group, it cannot be assigned to any other group. To unassign a Web or table data source, select one or more scheduled sources and click <<.

  3. Click Finish.

8.10.2 URL Submission


URL Submission Methods

URL submission lets query users submit URLs. These URLs are added to the seed URL list and included in the Oracle Ultra Search crawler search space. You can allow or disallow query users to submit URLs.


URL Boundary Rules Checking

URLs are submitted to a specific Web data source. URL boundary rules checking ensures that submitted URLs comply with the URL boundary rules of the Web data source. You can allow or disallow URL boundary rules checking.

8.10.3 Relevancy Boosting

Relevancy boosting lets administrators override the search results and influence the order that documents are ranked in the query result list. This can be used to promote important documents to higher scores. It also makes them easier to find.

There are two methods for locating URLs for relevancy boosting: locate by search or manual URL entry.


Locate by Search

To boost a URL, first locate a URL by performing a search. You can specify a host name to narrow the search. After you have located the URL, click Information to edit the query string and score for the document.


Manual URL Entry

If a document has not been crawled or indexed, then it cannot be found in a search. However, you can provide a URL and enter the relevancy boosting information with it. To do so, click Create, and enter the following:

  1. Specify the document URL. You must assign the URL to a data source. This document is indexed the next time it is crawled.

  2. Enter scores in the range of 1 to 100 for one or more query strings. When a user performs a search using the exact query string, the score applies for this URL.

The document is searchable after the document is loaded for the term. The document is also indexed the next time the schedule is run.

With manual URL entry, you can only assign URLs for Web data sources. Users get an error message on this page if no Web data source is defined.


Note:

Oracle Ultra Search provides a command-line tool to load metadata, such as document relevance boosting, into an Oracle Ultra Search database. If you have a large amount of data, this is probably faster than using the HTML-based administration tool. For more information, see Appendix A, " Loading Metadata into Oracle Ultra Search".

8.10.4 Query Statistics


Enabling Query Statistics

This section lets you enable or disable the collection of query statistics. The logging of query statistics reduces query performance. Therefore, Oracle recommends that you disable the collection of query statistics during regular operation.


Note:

After you enable query statistics, the table that stores statistics data is truncated every Sunday at 1:00 A.M.


Viewing Statistics

If query statistics is enabled, you can click one of the following categories:


Daily Summary of Query Statistics

This summarizes all query activity on a daily basis. The statistics gathered are:

  • Average query time: the average time taken over all queries

  • Number of queries: the total number of queries made in the day

  • Number of hits: the average number of results returned by each query


Top 50 Queries

This summarizes the 50 most frequent queries in the past 24 hours.

  • Query string: the query string

  • Average query time: the average time to return a result

  • Number of queries: the total number of queries in the past 24 hours

  • Number of hits: the average number of results returned by each query

  • Frequency: the number of queries divided by total number of queries over all query strings

  • Percentage of ineffective queries: the number of ineffective queries divided by total number of queries over all query strings


Top 50 Ineffective Queries

This summarizes the 50 most frequent queries in the past 24 hours. Each row in the table describes statistics for a particular query string.

  • Query string: the query string

  • Number of queries: the total number of queries made in the past 24 hours

  • Percentage of ineffective queries: the number of ineffective queries divided by total number of queries for that string


Top 50 Failed Queries

This summarizes the top 50 queries that failed over the past 24 hours. A failed query is one where the search engine end-user did not locate any query results.

The columns are:

  • Query string: the query string

  • Number of queries: the total number of queries made in the past 24 hours

  • Frequency: the percentage occurrence of a failed query

  • Cumulative frequency: the cumulative percentage occurrence of all failed queries

8.10.5 Configuration

You can configure the query application and the federation engine with several parameters, including the maximum number of hits and enabling relevancy boosting.

8.11 Users Page

Use this page to manage Oracle Ultra Search administrative users. You can assign a user to manage an Oracle Ultra Search instance. You can also select a language preference.

8.11.1 Preferences

This section lets you set preference options for the Oracle Ultra Search administrator.

You can specify the date and time format. The pull-down menu lists the following languages:

  • English

  • Brazilian Portuguese

  • French

  • German

  • Italian

  • Japanese

  • Korean

  • Simplified Chinese

  • Spanish

  • Traditional Chinese

You can also select the number of rows to display on each page.

8.11.2 Super-Users

A user with super-user privileges can perform all administrative functions on all instances, including creating instances, dropping instances, and granting privileges. Only super-users can access this page.

Single sign-on (SSO) users can use a delegated administrative service (DAS) list of values to add another SSO user as a super-user. These users are authenticated by the SSO server before allowing access. Database users can add another database user as a super-user.

To grant super-user administrative privileges to another user, enter the user name of the user. Specify also whether the user should be allowed to grant super-user privileges to other users. Then click Add.

8.11.3 Privileges

Only instance owners, users that have been granted general administrative privileges on this instance, or super-users are allowed to access this page. Instance owners must have been granted the WKUSER role.

Single sign-on (SSO) users can use a delegated administrative service (DAS) list of values to add privileges to another SSO user. These users are authenticated by the SSO server before allowing access. Database users can add privileges to another database user.


Note:

Database users cannot grant privileges to SSO users, and SSO users cannot grant privileges to database users. The DAS list of values only shows SSO users.

Granting general administrative privileges to a user allows that user to modify general settings for this instance. To do this, enter the user name and specify whether the user should be allowed to grant administrative privileges to other users. Then click Add.

To remove one ore more users from the list of administrators for this instance, select one or more user names from the list of current administrators and click Remove.


Note:

General administrative privileges do not include the ability to create or delete an instance. These privileges belong to super-users.

8.12 Globalization Page

Oracle Ultra Search lets you translate names to different languages. This page lets you enter multiple values for search attributes, list of values (LOV) display names, and data groups.

8.12.1 Search Attribute Name

This section lets you translate attribute display names to different languages. The pull-down menu lists the following languages:

  • English

  • Arabic

  • Brazilian Portuguese

  • Canadian French

  • Czech

  • Danish

  • Dutch

  • Finnish

  • French

  • German

  • Greek

  • Hebrew

  • Hungarian

  • Italian

  • Japanese

  • Korean

  • Latin American Spanish

  • Norwegian

  • Polish

  • Portuguese

  • Romanian

  • Russian

  • Simplified Chinese

  • Slovak

  • Spanish

  • Swedish

  • Thai

  • Traditional Chinese

  • Turkish

8.12.2 LOV Display Name

This section lets you translate data group names to different languages. Select a search attribute from the pull-down menu: author, description, mimetype, subject, or title. Select the LOV type, and then select the language from the pull-down menu.

8.12.3 Data Group Name

This section lets you translate data group display names to different languages. The pull-down menu lists the language options.