Understanding the Indexing Service
The
Indexing Service functions much as one would expect—it catalogs a set
of documents, enabling dynamic full-text searches using the search
function, a query form, or Microsoft Internet Explorer. Just as an index
in a book maps an important word to a page inside the book,
content indexing on a computer takes a word within a document and maps
it back to that document. Documents to be indexed can be specified in
catalogs and can include document properties as well as the actual text
in the document. After the Indexing Service is set up, no ongoing
maintenance is needed, and administration is required only when you need
to change a basic configuration. If you didn’t include the Indexing
Service in your original installation of Windows Server 2003, you can
add it through Add/Remove Programs in Control Panel.
Note
By default, the Indexing Service is disabled in Windows Server 2003.
Defining Terms
When administering the
Indexing Service, you’ll encounter a number of terms that have a
special meaning when used in the Indexing Service context. Here are some
of the most common ones, with their definitions:
Catalog
A directory where all temporary (word lists) and persistent (shadow and
master) indexes and cached properties are stored for a particular
scope.
CiDaemon
A child process created by the Indexing Service (cisvc.exe). CiDaemon
works in the background, filtering documents for the Indexing Service.
Corpus The entire collection of HTML pages and other documents indexed by the Indexing Service.
Filter
Part of a dynamic-link library (DLL) of filters, each designed to
extract textual information and properties from a specific type of
formatted document.
Master index
A persistent index that contains the indexed data for a large number of
documents. This is usually the largest persistent data structure. In an
ideal state, this is the only index present because all the indexed
data is stored in the master index and there are no shadow indexes or
word lists. A master index is created through a master merge.
Master merge
The process by which shadow indexes are combined with the current
master index into a single master index. Unlike shadow merges, this is
usually a fairly long process.
Persistent index
Data for an index that is stored on disk. Unlike word lists, which
exist only in memory, a persistent index survives shutdowns and
restarts. Persistent-index data is stored in a highly compressed format.
There are two types of persistent indexes: shadow indexes (also
referred to as saved indexes and temporary indexes) and master indexes.
Query A request to search files for specific data.
Scan The
process by which files and directories are checked for modifications.
Scanning is performed against virtual roots that have been selected for
indexing.
Scope The range of documents to be searched when executing a query. Physical paths or virtual roots can specify scopes.
Shadow index (also known as saved index)
A persistent index created by merging word lists and occasionally other
shadow indexes into a single index. A catalog can have multiple shadow
indexes.
Shadow merge
The process by which word lists and shadow indexes are combined into a
single shadow index. A shadow merge is performed to free up memory used
by word lists and also to make the filtered data persistent.
Virtual root
An alias to a physical location on disk. Index Server can index any
directory defined as a virtual root. Index Server can be set up to work
with a central index but point to files on other servers.
Word list
When a document is indexed, the index information goes first to a small
temporary index, called a word list. Word lists are maintained in
memory until the Indexing Service combines them into the existing
indexes.
How Indexing Works
The Indexing Service
uses filters that can read certain types of documents, extract the text
and properties, and send that information to the indexing engine. The
filters included with Windows Server 2003 index the following kinds of
documents: text, HTML, Microsoft Office 95 and later, and Internet Mail
and News (provided that IIS is installed). The Indexing Service can use
other filters made available by software vendors. The vendor that
supplies the filter also supplies installation instructions.
After extracting
the text and properties, the Indexing Service determines the language
the document is written in and removes words that are on the language’s
exception list. The exception list contains prepositions, pronouns,
articles, and so forth, and is appropriately named Noise.xxx, where xxx represents the language. Noise.xxx is in the System32 directory. Figure-1
shows a portion of the Noise.eng file, which contains the exception
list for American English. You can add words to or remove words from the
exception list using any text editor, such as Notepad.
After
words from the exception list are removed, the remaining words are
stored first in a word list in memory. At least once a day, the word
lists are combined to form temporary saved indexes, and later the
Indexing Service consolidates the temporary indexes into a single master
index.
Planning Your Indexing Service
When
designing an indexing site, the first question that arises is how much
storage space will be needed. The minimum disk space allocated should be
at least 30 percent of the size of your corpus, and 40 percent is
better. During a master merge, the Indexing Service can temporarily need
up to 45 percent of the corpus size.
Depending
on the filters used to index a group of documents, the actual size of
the indexes might be less than the standard 30 percent. For example, if
you write a filter for indexing large documents (such as large image
files), you can limit indexing to the first few hundred bytes (about all
you need to get the header information), thus reducing the amount of
space needed for the index.
Note
Because most
Indexing Service operations are read requests (searching the indexes,
returning the results, and then accessing the actual documents), disk
striping (RAID-0) or a RAID-5 array is a good way to reduce disk-bound
I/O operations.
Planning for future site
growth is essential. Moving documents to larger disks to overcome space
limitations can cause query errors until you are able to run a complete
reindex, which can take many hours. Another critical part of planning an
Indexing Service site is to make sure that plenty of memory is
available on the indexing machine. Table 1
shows the minimum memory required versus the recommended minimum amount
for different quantities of documents. As usual, the more memory you
have available, the better (and with the price of memory as low as it
is, consider 512 MB a minimum for any type of Windows Server 2003). With
large numbers of documents, a faster CPU also speeds up indexing and
searching.
Table 1. Memory requirements by number of documents indexed
Number of Documents | Minimum Memory | Recommended Memory |
---|
Fewer than 100,000 | 128 MB | 128 MB |
100,000 to 250,000 | 128 MB | 128 MB to 256 MB |
250,000 to 500,000 | 128 MB | 256 MB to 512 MB |
500,000 or more | 256 MB | 512 MB or more |
Merging Indexes
The Indexing
Service automatically combines memory-resident word lists into
disk-resident temporary lists and, once a day, merges all temporary
indexes into a master index. Depending on the number of temporary lists,
merging can be a long process that uses much of the CPU’s resources.
Queries are slower during a merge, and other processes on the computer
are slower still.
By default, merges are
done at midnight local time. If this is unsuitable for your system, you
can change the default when the master merge is performed. You can also
initiate a merge manually when a large number of documents in a catalog
are changed. This section describes how to perform these two tasks.
Setting the Time to Start a Master Merge
To change the operation’s schedule from the default time, follow these steps:
1. | Run Regedit.exe.
|
2. | Navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex.
|
3. | In the rightmost pane of the Registry Editor window, double-click the MasterMergeTime value.
|
4. | The
DWORD Editor dialog box opens. In the Data box, type the number of
minutes after midnight when a master merge should be initiated. Be sure
to select Decimal from the Base options.
|
5. | Click OK and close the Registry Editor.
|
Note
MasterMergeTime
has a valid range of values from 0 to 1439 minutes, though no error is
reported if you enter a larger value. The default is 0. When the
specified number of minutes after midnight has passed, the Indexing
Service initiates a master merge.
Manually Merging Indexes
If
a large number of documents change in a short period, you might want to
perform a merge of the temporary indexes without waiting for the
scheduled master merge. To initiate a merge, follow these steps:
1. | Open Computer Management, and select Indexing Service in the console tree.
|
2. | Right-click the appropriate catalog, point to All Tasks on the shortcut menu, and choose Merge. (See Figure 2.)
|
3. | You’re asked to confirm that you want to merge the catalog. Click Yes.
|
Setting Up an Indexing Console
For easy and
frequent access, ideally you should set up a Microsoft Management
Console (MMC) with Indexing Service. To do so, follow these steps:
1. | Choose Run from the Start menu. Type mmc, and press Enter.
|
2. | Choose Add/Remove Snap-in from the File menu. Click Add.
|
3. | In the Add Standalone Snap-In box, select Indexing Service and click Add. Select Local Computer.
|
4. | Click Close and then OK, and you see an Indexing Service MMC like the one shown in Figure 3.
|
The
illustrations and examples in the following sections use the Indexing
Service MMC, but you can also perform these tasks just as well through
Computer Management.