Content Collector

Business scenarios: Automatic classification of
documents and email

Version 8.8.0
The business scenarios describe how a fictitious company called ExampleCo. Enterprises solves typical
problems with unorganized content by using IBM Content Classification to automatically classify documents
and email.
ExampleCo. Enterprises uses Content Classification to ensure that document retention and disposition policies
are enforced, unclassified email is automatically and intelligently archived, and storage and legal review costs
are reduced by filtering out data that is irrelevant to a pending legal case.
To classify content as it is added to a repository, ExampleCo. Enterprises installs IBM FileNet Content
Manager or IBM Content Manager to store and archive documents and emails,IBM Content Collector to
capture file system and email content, and IBM Content Classification to classify the content when it is
collected.
To declare certain documents as records so that they can be managed in accordance with record retention and
compliance requirements, ExampleCo. Enterprises also installs IBM Enterprise Records (formerly FileNet
Records Manager) to manage documents that automatically declares as records.
Business scenario: Organizing content in an ECM system to reduce future compliance costs
ExampleCo. Enterprises, a fictitious company, must classify content in an enterprise content
management system and ensure that document retention and disposition policies are enforced.
Business scenario: Filtering and organizing email

ExampleCo. Enterprises, a fictitious company, must classify and archive a backlog of unclassified
email and set up a system to regularly classify all new email.
Business scenario: Reacting to a legal matter

ExampleCo. Enterprises, a fictitious company, might be targeted in a lawsuit. The company needs to
ensure that potentially relevant documents are placed under the control of a records management
system.
Related concepts:
Integration with IBM Content Manager
Integration with IBM FileNet Content Manager and IBM Enterprise Records
Integration with IBM Content Collector
vzsp
Business scenario: Organizing content in an ECM system

to reduce future compliance costs
Version 8.8.0
ExampleCo. Enterprises, a fictitious company, must classify content in an enterprise content management
system and ensure that document retention and disposition policies are enforced.
The ExampleCo. Enterprises corporate repository is built on IBM FileNet Content Manager. Currently,
content in the repository is not well organized, and it does not comply with records management policies.
To address this problem, Bob, an IT specialist, wants to ensure that all of the documents in the IBM FileNet
Content Manager content store are organized into a consistent set of folders and document classes. Anne, a
business analyst, wants to ensure that data in the repository is organized according to a corporate taxonomy.
To achieve this goal, Anne defines a new corporate taxonomy for assigning document properties and
classifying content into folders and documents classes. She also defines records management policies (rules
for the retention and disposition of documents).
Bob is tasked with organizing content in the company repository by using the new taxonomy and records
management policies. He will work closely with Anne to reclassify content that already exists in the repository
by applying this new taxonomy. Bob and Anne must work together to ensure that important documents are
declared as records so that they can be managed according to the records management policies.
To intelligently automate this task, Bob decides to use IBM Content Classification. Bob is familiar with IBM
FileNet Content Manager but is new to using IBM Content Classification. He needs to configure a set of rules to
match the corporate taxonomy that Anne provides to him. The rules need to be easy to configure but powerful
enough to classify documents on the basis of both metadata (document properties) and content.
During the reclassification phase, information about the documents might need to be updated, such as the
document properties and the target folder or document class. Some documents might need to be declared as
records. With IBM Content Classification, Bob can use a single interface to configure the needed classification
rules. After he extracts sample content from IBM FileNet Content Manager for training and testing purposes, he
can configure rules that are based on keywords in the content. He can then test the rules with some of the
extracted content to see how the system determined the classifications. If necessary, Bob can fine tune the
rules to ensure that the correct classification actions will be applied before he classifies the remaining content
in the repository.
Bob initially starts with a relevant subset of the data, which consists of approximately 100 million documents.
He expects to expand this subset to include a larger portion of the overall repository, from 300 million to 400
million documents. As the system usage expands, he expects that the combination of new documents,
Microsoft SharePoint files, and other content sources might add between 200 and 300 new documents daily.
Anne plans to use IBM Content Classification to review documents. If Anne disagrees with a classification
decision, she can reclassify a document by applying different classification criteria. By reviewing documents,
and either confirming the classification decision or reclassifying the content, Anne helps train the system and
improve accuracy over time.
Parent topic: Business scenarios: Automatic classification of documents and email
Related concepts:

Extracting content from IBM FileNet Content Manager
Building a knowledge base
Building a decision plan
Classification review
Related tasks:
Creating rules
Reclassifying documents

Version 8.8.0
ExampleCo. Enterprises, a fictitious company, might be targeted in a lawsuit. The company needs to ensure
that potentially relevant documents are placed under the control of a records management system.
Bob, an IT administrator for ExampleCo. Enterprises, is tasked with adding a large number of documents from
multiple file systems to the corporate data repository. To prepare for a potential legal dispute, the company
needs to declare the potentially relevant documents as records and assign appropriate retention and
disposition rules to each document. The documents and records must be stored in particular folders and
document classes in the case vault repository so that they are available for legal review. After the relevant
documents are declared as records and stored in the appropriate folders, the legal team will review the
documents by using IBM eDiscovery Manager and IBM eDiscovery Analyzer.
Bob selects IBM Enterprise Records for records management. Bob decides to use IBM Content
Collector with IBM Content Classification to automatically and intelligently classify documents and email and
declare them as records according to the company's records management policies.
To control storage and legal review costs, Bob needs to filter out irrelevant data such as company bulletins,
newsletters, personal email, and personal documents that have no relevance to the pending legal case. Bob
will work closely with Anne, a business analyst who has expertise in the company's knowledge management
hierarchy, to define the rules for determining which documents are potentially relevant and need to be retained.
Bob is already familiar with IBM Content Collector and has used it for similar purposes in the past, but it has
typically collected too much content, which increases legal review costs and time. He plans to work with Anne
to identify a set of representative documents that are pertinent to the case to use as a training set.
Working together, they configure classification rules in IBM Content Classification that are based on a list of
keywords provided to them by the legal team. The rules specify that the documents are to be declared as
records in IBM Enterprise Records and identify which file plan is to be used to manage the records.
Although a typical case might have approximately 50,000 to 200,000 potentially relevant documents, the
documents must be identified across departmental and enterprise repositories that can hold hundreds of
millions of documents. It is critical that Bob and Anne understand how IBM Content Classification filters
different documents and email so that they can ensure that all content that might be relevant is captured while
everything else is omitted. After classifying the training set, Bob and Anne can review how the decisions were
applied and adjust the rules as needed.
To ensure that content is classified when it is captured, Bob sets up a task route for IBM Content
Classification in IBM Content Collector. After the system is in production, Bob expects that an additional 1000
potentially relevant documents might be identified out of newly collected document and email each week.
Anne plans to use IBM Content Classification to review the classification decisions. She can help train the
system by reclassifying content, and she can work with Bob to fine tune the rules. For example, if irrelevant
documents are classified into the case vault because of the occurrence of some keyword, she might
recommend that a rule be changed so that documents are classified only when the keyword occurs in proximity
to another keyword.
Related concepts:
eDiscovery Manager
Related tasks:
Creating rules
Version 8.8.0
ExampleCo. Enterprises, a fictitious company, must classify and archive a backlog of unclassified email and set
up a system to regularly classify all new email.
ExampleCo. Enterprises is a European company that recently acquired a company based in the United States.
Bob, the IT administrator for the newly acquired company, is tasked with creating and maintaining the corporate
email archive.
To control ongoing storage costs and potential legal discovery costs, Bob must manage the email archive to
avoid adding irrelevant data. Additionally, because the two companies previously maintained separate email
systems, Bob must archive content from Lotus Domino and Microsoft Exchange. Bob needs to ensure that
the entire message content is archived, including attachments.
To satisfy all of these requirements, Bob selected IBM Content Collector and IBM Content Classification to
automatically and intelligently create and maintain the email archive. Bob needs to filter out irrelevant data,
such as company bulletins, newsletters, and personal email that has no business value (for example, notes that
discuss the outcome of a local team's sporting event).
Bob creates an IBM Content Classification decision plan to define a set of rules for automatically classifying
email and assigning the correct category value for the item in IBM Content Manager, such as Contracts,
Claims, or Human Resources. The rules need to filter out irrelevant email before it is archived and
automatically detect sensitive information in an email, such as social security numbers and credit card
information.
Bob creates an IBM Content Classification knowledge base and then trains the classification system by using a
small set of user mailboxes to serve as a set of representative documents. After the system is trained, Bob
builds the archive by classifying all email that was transmitted over the past year. The initial archive contains
approximately 150 million emails.
After the initial archives are built, Bob configures a task route to regularly archive email in IBM Content
Collector that uses IBM Content Classification to first classify the email. Currently, an additional 250,000 500,000 emails are expected to be processed per day. The size of individual messages might range from 20
KB to as large as 100 MB, depending on the length of the email and the number of documents that are
attached.
Anne is a business analyst who is highly familiar with the company's information repositories, the enterprise
taxonomy, and the data flow. Anne wants to use IBM Content Classificationto review some documents to
ensure that the correct classification decisions are applied. It is critical that Bob and Anne understand how IBM
Content Classification filters email so that they can ensure that the right email is archived and that the email is
assigned to the correct categories.
Related concepts:

Related tasks:
Creating rules
Add automatic content classification to

your IBM FileNet P8
Step-by-step examples showing how to install, configure, and integrate

Concerned about records compliance and legal discovery effectiveness?
Challenged by maintaining consistent and reliable access to content items in
various repositories? IBM Classification Module Version 8.5 can help you to
address these issues. This new IBM Enterprise Content Management (ECM)
taxonomy management solution understands the full text of content, adapts to the
business needs, and automates the classification of content in your IBM FileNet
P8 systems. This article provides step-by-step instructions for performing the
seamless integration between IBM Classification Module and IBM FileNet P8. It
then shows you how to use IBM ECM Classification tools (Content Extractor,
Classifier, and Classification Review tool), along with the IBM Classification
Module server component and the Classification Workbench, to automate the
content classification in the integrated environment.
0 Comments
Share:
Facebook
Twitter
Linked In
Google+
Xiaomei Wang (xiaomeiw@ca.ibm.com), ECM Partner Technical Enablement Team, IBM
Shirley Braley (Shirley_Braley@us.ibm.com), Advisory Software Engineer , IBM Corporation
31 January 2008
Also available in Russian

Table of contents
Introduction
In ECM, taxonomies ensure that content is accurately cataloged and easily accessible. Having
consistent and reliable access to unstructured content is the foundation to realizing the business
benefits of ECM, and all subsequent content-centric enterprise applications will realize their
return on investment (ROI) by leveraging this essential capability.
IBM Classification Module automates the process of categorizing your content by reading and
analyzing the full text of each document. It analyzes the entire document, discerns the topics in
the text, and then assigns it to a proper category. And it won't be fooled by noises in your
content, such as misspellings, abbreviations, or short-hand or technical terms. Moreover, the
IBM Classification Module adapts to the unique nature of your business by learning to identify
different categories from examples you provide to it. And as you provide feedback on its
performance, it adjusts in real time, immediately taking into account corrections you have made.
In this way, its accuracy keeps pace with your business, rapidly adjusting to changes as they
occur.
The IBM Classification Module integration for IBM FileNet P8 can be deployed on Windows,
AIX, Solaris, and Linux environments. The example in this article is specific to the integration
on a Microsoft Windows platform. However, the key concepts and information provided are
relevant to any platform.
Back to top
Integration architecture overview

IBM Classification Module for IBM FileNet P8 is a system that automatically classifies new
content for insertion into IBM FileNet P8, or filters out content that doesn't fit the profile of
managed content. It can also reclassify and move existing P8 documents to the correct folders
or document classes. Unlike other classification systems that are based on rules only, the
Classification Module is based on a combination of text analysis and rules, and incorporates
real-time learning that adapts to changing business needs and becomes more accurate over
time.
Figure 1. Integration architecture overview
As shown in Figure 1, at the core of this solution is the IBM Classification Module product that
has been in the market place for many years and has proven to be scalable and reliable in
demanding IT environments. It consists of three core components as follows:
Classification Application Program Interface (API)

The Classification API provides C, Java, .NET, or COM client API libraries to
enable rapid development of various client applications that interact directly with
the Classification server component.
Classification server
Embedded with natural language processing capabilities, the Classification sever
classifies free-form texts by leveraging its Relationship Modeling Engine and a
predefined knowledge base (KB) in decision making.
Classification Workbench
Classification Workbench allows you to create and analyze a knowledge base,
evaluate the KB's performance using reports and graphical diagnostics, and work
with a collection of texts or messages known as a corpus for analysis, training,
and learning.
In addition, the Taxonomy Proposer is a new tool shipped with Classification
Workbench 8.5. It can assist users in creating a taxonomy starting from scratch
or from a partial one, where it uses custom clustering algorithms to analyze and
group similar documents together.
On top of the core product is the newly added integration asset for providing a taxonomy
automation solution to IBM FileNet P8. The integration for IBM FileNet P8 includes the following
components:
Classifier
Automatically classifies and filters out documents, and sets aside a configured
percentage of documents for audit and manual review.
Classification Review Tool

A browser-based application that enables online learning by manually confirming
or correcting automatic classification.
Content Extractor
A command-line tool that extracts sample content from the IBM FileNet P8
repository to train a KB and enable automatic classification.
Back to top
Integration and classification workflow

The example in this article shows you how to first deploy IBM Classification Module integration
for IBM FileNet P8, and then perform the content classification in the integrated environment by
following the workflow displayed in Figure 2.
Figure 2. Integration and classification workflow
Back to top
Phase 1. Install and configure IBM Classification

Module integration software stack
This section uses step-by-step screen shots to illustrate how to install and configure IBM
Classification Module Integration software stack.
Task 1. Install IBM Classification Module and integration components for

IBM FileNet P8
This example covers the installation of IBM Classification Module Integration software stack on
the Windows platform. It is assumed that you currently have a working FileNet P8 Version 3.5
server and are able to connect to it.
Before you begin, ensure that:
All system requirements are met.

Use an installation ID that has administrator privileges on the computer
where the Classification Module components are to be installed.
1.
Run the Classification Module Version 8.5 installation wizard,
icm85_setup_win32.exe, and click Next.
Figure 1-1-1. Welcome screen
2.
Enter the installation path, and click Next.
Figure 1-1-2. Directory path
3.
4.
Accept the license agreement, and click Next.

Choose the Custom installation type, and click Next.
Note: If you select the Basic installation type, the Classification server, client, and
workbench components will be installed, but the ECM Tools components will not
be.
Figure 1-1-3. Installation options
5.
Select the installation components, and click Next. In this example, check
the options to include Classification Module server, client, workbench, and
integration components.
Figure 1-1-4. Feature selection
6.
Select the option to install the administration and data server on this
computer.
If you have already installed the IBM Classification Module server on another
server, you can connect to it on a remote computer.
Figure 1-1-5. Administration and data server installation
7.
Accept the default ports for the administration and data servers.
You can use other port numbers here, but make sure that those port numbers are
not used by any other processes on this computer.
Figure 1-1-6. Port selection
8.
Install the listener component to handle client requests on this computer.

Ensure that the port is not used by other processes in your environment.
Figure 1-1-7. Listener option
9.
In this example, install the Classification Review Tool into New Tomcat.
There are four options for installing the Classification Review Tool. To learn more
about each installation option, refer to the Integration for IBM FileNet P8 User's
Guide.
Figure 1-1-8. Classification Review Tool installation type
10.
When you wish to install Tomcat 5.0 as part of the IBM Classification
Module installation, enter the following information:
o
Directory where Tomcat is installed
o
Home directory for Java 1.4.2
o
Admin user name and password
o
Port of the Tomcat server
The installation wizard installs the Tomcat server first, and then deploys the
Classification Review Tool war file into it.
Figure 1-1-9. Tomcat servlet container installation information
11.
Review the installation summary information, and click Install.
12.
After the IBM Classification Module components have been installed, you
are prompted to reboot the machine. You should do this before running the IBM
Classification Module server.
13.
After the install completes, you should see a directory structure similar to
the one shown in Figure 1-1-10. Check the install_log.txt file for detailed
information on the installation.
Figure 1-1-10. IBM Classification Module directory structure
Task 2. Configure IBM FileNet P8 parameters and connectivity

You must perform the following configuration steps to prepare your IBM FileNet P8 object store
and enable communications before you classify documents:
Add IBM FileNet P8 properties (metadata).

Add properties to IBM FileNet P8 document class.
Add IBM FileNet P8 folders and document classes.
Configure connectivity between the IBM Classification Module and the IBM
FileNet P8.
To add IBM FileNet P8 metadata:
1.
Start the IBM FileNet Enterprise Manager.

For detailed instructions on starting and using the IBM FileNet Enterprise
Manager, see the IBM FileNet P8 documentation.
2.
Create a Classification Module AddOn resource by importing the AddOn
file (F:\IBM\ICM\ECMTools\icm_prps_addon.xml) that contains Classification
Module properties into IBM FileNet P8.
Figure 1-2-1. Create new AddOn
3.
Install the Classification Module AddOn.

a.
In the IBM FileNet Enterprise Manager, right-click the object store
you are working with Classification Module and select All tasks > Install
AddOn.
Figure 1-2-2. AddOn installation
b.
Select the newly created AddOn, ICM AddOn, and click Install.
Figure 1-2-3. AddOn installation cont.
To add properties to IBM FileNet P8 document class:
1.
In the IBM FileNet Enterprise Manager, select the object store you are
working with, right-click Document Class, and then selectProperties >
Properties Definitions > Add.
2.
Select all properties beginning with "ICM_" and add them to the base
class.
Figure 1-2-4. Add properties to IBM FileNet P8 document class
To add IBM FileNet P8 folders and document classes:

Important:The folder and document class names that you define here in IBM FileNet P8 object
store must match the names specified in the ECM Classification tools (that is, Content Extractor,
Classifier, and Classification Review Tool). If you change a default name, be sure to update it in
both locations.
1.
Create the folders in IBM FileNet P8. Use the default names listed below
or specify your own.
Figure 1-2-5. Folders for IBM FileNet P8
2.
Create the document classes in IBM FileNet P8. Use the default names
listed below or specify your own.
The document classes are required only if you plan to classify documents by
document class.
Figure 1-2-6. Document classes in IBM FileNet P8
To configure connectivity between IBM Classification Module and IBM FileNet P8:
Configuring connectivity to IBM FileNet P8 depends on factors such as the

IBM FileNet P8 version, Java Web application, and client-server communication.
Configuring connectivity might be as simple as specifying the IBM FileNet P8 IP
address in the WcmApiConfig.properties file in F:\IBM\ICM\ECMTools\conf. Here
is an example of what this file would look like. Substitute the IP address and port
number of your FileNet P8 server in the first three lines. Consult your FileNet P8
administrator if you are unsure what the correct URL is.
RemoteServerUrl =http://9.148.198.23:8008/ApplicationEngine/xcmisasoap.dll
RemoteServerUploadUrl =http://9.148.198.23:8008/ApplicationEngine/doccontent.dll
RemoteServerDownloadUrl =http://9.148.198.23:8008/ApplicationEngine/doccontent.dll
CredentialsProtection =Clear
CryptoKeyFile =C:\\Program Files\\FileNet\\Authentication\\CryptoKeyFile.properties
CredentialsProtection/UserToken = Symmetric
CryptoKeyFile/UserToken =C:\\Program Files\\FileNet\\Authentication\\

UTCryptoKeyFile.properties
Configuring connectivity might also involve additional configuration files and

communication library files (JARs).
Back to top
Phase 2. Train the IBM Classification Module System

with IBM FileNet P8 content
This section uses step-by-step screen shots to illustrate how to train an IBM Classification
Module system, and have it learn how to classify based on the sample content extracted from
IBM FileNet P8 through three tasks below:
Extract content from FileNet P8

Create and analyze a knowledge base
Configure IBM Classification Module: KB and dictionary
Task 1. Extract content from FileNet P8

You use the Content Extractor tool to extract a set of sample documents from FileNet P8 object
store.
Before you begin, ensure that IBM FileNet P8 is running.
1.
Configure the Content Extractor by editing the Extractor.properties file. You

must set these properties before you run the Content Extractor.
2.
Locate and edit the Extractor.properties file in the
F:\IBM\ICM\ECMTools\conf directory.
You can use the default values for most of the properties. But you might want to
change the following properties:
o
logFile: The Content Extractor log file
logFile = logs/icm_p8_extractor.log
o
XmlDirectory: The directory where XML output by the Content
Extractor is written. This directory must exist and be empty prior to the Content
Extractor run.
XmlDirectory = extractorOutput
Which documents to extract?

The criteria below are AND-ed together, that is, path AND doc class AND Date. If
you want the OR logic, do separate extraction runs.
o
Path_1: From a specific FileNet P8 folder
Path_1 = QA/Demo
IgnorePath_1: Exclude a specific FileNet P8 folder
IgnorePath_1 = QA/Demo/Accounting
With_1: By document property
With_1 = DocumentClass=Technote
Without_1: Excluded by document property
Without_1 = DocumentClass=TopSecret
Date: Documents that were modified since a date
Date = 13-Jul-2007
How many documents to extract?
FolderFraction: Fraction of eligible documents to extract from each
o
folder
FolderFraction = 0.2
FileMax: Maximum number of bytes to extract from a document

FileMax = 600000
3.
From a command prompt, go to the ECMTools directory in the IBM

Classification Module installation path.
cd F:\IBM\ICM\ECMTools
4.
Run the Content Extractor with the following command:

Extractor -u Administrator -p ihpdep -f conf\extractor.properties -v
Task 2. Create and analyze a KB

You use IBM Classification Workbench to create and analyze a KB using content extracted from
the FileNet P8 object store.
1.
Launch IBM Classification Workbench by navigating to Start > Programs

> IBM Classification Module 8.5 > Classification Workbench.
2.
On the Project Explorer that appears, click New.
Figure 2-2-1. Project Explorer
3.
On the New Project screen, enter a project name, FocusPlusProject, and

click Next.
Figure 2-2-2. Add new project name
4.
On the Import Corpus screen, under External formats, select XML

(eXtensible Markup Language), and click Next.
Figure 2-2-3. Import a corpus
5.
Browse to the folder containing the XML files generated by the Content
Extractor. The folder path is what is defined through the Content Extractor
property, XmlDirectory. Click Next.
Figure 2-2-4. Import corpus location
6.
Select the check box Scan XML data files for corpus fields before
importing the corpus., and click Finish.
Figure 2-2-5. XML import method
7.
After importing the XML files, from the menu, select View > View Project
Details to display the project detail panel on the right side if it hasn't displayed.
Figure 2-2-6. Display the project detail panel
8.
On the Project Details Fields tab, right click the Document Title field, and
select Edit Field. In the Corpus Field Properties window, set Type to string and
NLP Usage to Plain Text, and click OK.
Figure 2-2-7. Edit document title properties
9.
Perform the same operations as above to edit the ICM_CONTENT field by

setting its Type to string and NLP Usage to Plain Text.
Figure 2-2-8. Edit ICM_CONTENT properties
10.
On the Project Details Fields tab, right click on the ICM_folders field, and
select Use as Categories. Click OK in the pop-up message window.
Figure 2-2-9. Define ICM_folders field
11.
On the Project Details Categories tab, review the list of newly created
categories.
Figure 2-2-10. List of newly created categories
12.
1.
Create and analyze a new KB:
Click KB Wizard on the toolbar to launch the Create, Analyze, and

Learn Wizard.
2.
Click Next to display the first options window.
3.
Select Create and analyze KB using active view, and
choose Create using even, analyze using odd.
4.
Click Next to display the next options window, and keep the default
values.
5.
Click Next to display the Match Fields window. By default, the Add
Match Field box is checked and the Number of matches to display field is set
to 5. Keep the default values in this example.
6.
Click Finish to continue. The Status Information screen lets you
view the Create and Analyze processes as they progress.
7.
When processing is complete, click Close.
8.
View the results of the knowledge base analysis in the workspace.
9.
Note the matching values. There are five matching columns that
show potential matching categories for every other item.
10.
Note that the Classification Workbench has analyzed only the oddnumbered items.
13.
(Optional) Click Reports on the toolbar to open the View Reports window.
Check the following reports and graphs:
1.
2.
3.
4.
Cumulative success
KB data sheet
Cumulative success graph
Total precision vs. recall graph
Click OK to run the reports. View the reports that you have generated. These
reports provide a summary of information about the categories defined in your
KB. For more information on how to analyze and fine tune a KB, refer
to Classification Workbench User's Guide.
Task 3. Configure IBM Classification Module: KB and dictionary

You register the newly created KB with the classification engine through the IBM Classification
Manager.
1.
Launch the IBM Classification Manager by navigating to Start > Programs

> IBM Classification Module 8.5 > Classification Manager.
2.
Enter the server listener URL, same as what you defined during the
installation.
Figure 2-3-1. Server listener URL
3.
In the console tree, select Knowledge Bases, and then click Add on the
toolbar to add a new KB.
Figure 2-3-2. Add a new knowledge base
4.
In the Add Knowledge Base window, define the fields below, and click OK:
Knowledge base name: FocusPlus

Select Import statistics from file
Browse to the FocusPlusProject.kb file created by IBM
Classification Workbench. The kb file is located in the project directory
F:\IBM\ICM\Workbench\Classification
Workbench\Projects_Unicode\FocusPlusProject.
o
Select Access file from server.
o
Select English as the supported language.
o
o
o
Figure 2-3-3. Define fields to create knowledge base
5.
The IBM Classification Module receives texts as a series of fields (also

known as Name Value Pairs). The dictionary defines the data type and method of
language processing performed on each field. The IBM Classification Module
uses dictionary entries information to analyze and classify documents.
In the console tree, select Dictionary.
In IBM Classification Module 8.5, the Classification Module Dictionary provides
three predefined fields: Body, FileName, and Title, which are sufficient in this
example. In case you need to add additional dictionary entries for other
deployments, click Add on the toolbar to add a new dictionary entry.
Figure 2-3-4. Classification Module Dictionary
6.
If new dictionary entries are added, the KB should be restarted to

recognize the changes. In the console tree, select Knowledge Bases, and rightclick on the newly added KB to restart the KB.
Figure 2-3-5. Restart knowledge base
Back to top
Phase 3. Classify new content for IBM FileNet P8

This section illustrates how to use the Classifier to scan new documents in a specified directory
and automatically:
Classify documents into right folders in a FileNet P8 object store

Reject documents that do not belong to a FileNet P8 object store
Set aside a configured percentage of documents for manual review
1.
2.
o
The IBM Classification Module server and KB instance are running. You
can verify it through the Classification Manager.
IBM FileNet P8 is running.
Launch the Classifier tool by navigating to Start > Programs > IBM
Classification Module 8.5 > Classifier, and log on with your IBM FileNet P8
user name and password. This FileNet P8 user ID must have read and write
permission on all folders in FileNet p8 that are required for classification.
On the General tab, set the following global properties:
Global FileNet P8 settings:
FileNet P8 Object Store for all documents: QA

Folder containing documents to be
classified: ICM_Classifier_Input
Folder to place documents after they are

classified: ReviewToolInput
Figure 3-1. Define global FileNet P8 settings
Classification type:
Select the check box Folder.
Figure 3-2. Define classification type
Content to classify:
Select The file system with the directory F:\Test, where all the new documents
reside.
Figure 3-3. Define the content to classify
IBM Classification Module settings:

Set the IBM Classification Module listener URL http://localhost:18087, which is
the same as what you defined during the installation.
Figure 3-4. Define the IBM Classification Module settings
3.
On the Folder Classification tab, go the Runtime Settings tab and set the
following properties:
Scoring:
Classification Module knowledge base to use for

scoring: FocusPlus
Maximum number of folders to suggest: 5
Don't suggest folders with scores less than: 0
Figure 3-5. Define scoring properties
Default folders:
When a document doesn't match any folders, put it

in: ICM_No_Matches
When an error occurs while classifying a document, put it

in: ICM_Classification_Error
Figure 3-6. Define default folder properties
4.
On the Folder Classification tab, go the Document Properties tab and set
the following properties:
Table 1. File system document properties
Document Property
Classification Module Field
Document Content
Body
Document Filename
FileName
Document Title
Title
Figure 3-7. Define the document properties
5.
On the Folder Classification tab, go the Auto-move tab and set the
following properties:
Select the check box Allow Classifier to automatically move

documents during classification.
o
Also select the check boxes Enable auto-classify and Enable
auto-reject.
o
Figure 3-8. Define the auto-move properties
Auto-move thresholds:
Select Use the same thresholds for all folders when

determining whether to accept or reject documents.
Automatically classify documents when the score for the top

folder is above: 60
Automatically reject documents when the score for the top

folder is below: 20
Figure 3-9. Define auto-move thresholds
Audit percentage:
Percentage of auto-classified documents to be sent to the

Classification Review Tool for auditing: 50
Percentage of auto-rejected documents to be sent to the

Classification Review Tool for auditing: 50
Note: You might want to set high audit percentages initially, then lower audit
percentages when you are more confident with the system's classification ability.
Figure 3-10. Define audit percentage
Auto-reject folder:
When a document is auto-rejected, put it in this folder: ICM_Auto_Rejected
Figure 3-11. Define the auto-reject folder
6.
7.
Click Save to save changes to the Classifier property file.

Click Start to classify documents. The Classifier scans all the files in the
input directory and does the following:
Automatically classifies the documents to the right folder in FileNet
o
P8
Automatically rejects the document since it does not belong in
o
FileNet P8
o
Moves a configured percentage of documents into the Classification

Review Tool folder for manual review
Figure 3-12. Action buttons description
8.
The monitoring area shows information such as how many documents

were auto-classified or auto-rejected and destination folders.
Figure 3-13. Monitoring area
Back to top
Phase 4. Review content classification

This section illustrates how to use the Classification Review Tool to manually confirm or correct
automatic classification.
The IBM Classification Module server and KB instance are running. You
can verify it through the Classification Manager.
The Classification Review Tool Web application server is up and running.

In this example, start the Apache Tomcat Web application server.
IBM FileNet P8 is running.

1.
Configure the Classification Review Tool by editing the
ReviewTool.properties file either in the
F:\Apache\Tomcat5.0\webapps\ReviewTool\WEBINF\classes\com\ibm\ICMP8\resources directory, or in the folder specified in the
ReviewTool.xml file in the F:\Apache\Tomcat5.0\conf\Catalina\localhost directory.
You must properly set these properties before you run the Classification Review
Tool. The location of the properties file depends on whether you are running the
Classification Review Tool with an Apache Tomcat installation or with IBM
WebSphere Application Server. You can use the default values for most of the
properties. In case you need to update any property, detailed property description
can be found in the Integration for IBM FileNet P8 User's Guide. In particular, the
Classification Review Tool uses some settings from the Classifier property file,
such as FileNet P8 folder locations.
2.
reviewTool.classifier.propertyfile=${reviewTool.baseDir}/
ECMTools/conf/Classifier.properties
3.
Launch the Classification Review Tool by navigating to Start > Programs

> IBM Classification Module 8.5 > Classification Review Tool. In any
environment, you can start the Classification Review Tool by entering the
following Web address in a browser:
http://Web_application_server_name or
IP_address:port_number/ReviewTool/index.jsp
4.
On the Sign in window, enter your IBM FileNet P8 user name and
password. This FileNet P8 user ID must have read and write permission on all
relevant FileNet P8 folders.
Single document view
1.
Upon login into the Classification Review Tool, you are in the single
document view by default. The documents in this view are taken from the
Classifier output folder ReviewToolInput. You can navigate and view documents
in the review queue one by one by clicking Next orPrevious.
Figure 4-1. Classification Review Tool Single document view
2.
After navigation, click Start Over to set the view back to the first
document.
3.
For each document, the system tells why the document is in the review
queue. Click Show on the Additional Document Information bar.
Figure 4-2. Document information
4.
Click the document type icon or the document name to review the
document content.
Figure 4-3. View document content
5.
Review the folders in the Suggested and Selected lists for the given
document. The Suggested list displays the top classification folders that the IBM
Classification Module determined to be most appropriate for the document. The
Selected list displays the suggested folder with the highest relevancy score by
default.
Figure 4-4. Suggested and Selected folder lists
6.
Depending on your evaluation of the document and the system's proposed

classification, you can accept the classification recommendation(by
clicking Apply Classification), delete or reclassify the document (by
clicking Delete Document or Send to Classifier), or send the document to the
Taxonomy Proposer for further process (by clicking Mark as Unknown). In this
example, click theApply Classification link to move the document to the
selected folder in FileNet P8 object store. This action in turn provides feedback to
the system that allows the system to provide more accurate classification in the
future, and also keep up with the evolution of your taxonomy definition.
7.
After the Apply Classification operation, the top of the browser displays the
second document in the review queue, and the number of documents in the
review queue has decreased from 78 to 77 along with the message, "The
document was successfully classified to folder(s) HR."
Figure 4-5. Successful document classification
Document list view
1.
If you would like to take actions in bulk, you can use the document list view
to process multiple documents simultaneously.
2.
Click View: Document list, and switch to the document list view.
3.
Review the folder that appears in the Move to column. This is how the IBM
Classification Module intends to classify the document.
4.
Mark check boxes to select documents for which a bulk operation is
applied. When selecting the check box in the table title row, all documents on the
page are selected.
Figure 4-6. Classification Review Tool -- Document list view
5.
Once multiple documents have been selected, you can perform operations
similar to those in the single document view except that there is no Send to
Classifier action in document list view.
Add new documents
1.
You can add new documents one at a time to the IBM FileNet P8
repository, and accept or modify the classification that the IBM Classification
Module suggests for the document.
2.
In the sidebar, click Add New Document.
3.
Browse to the document you'd like to add, and click Add.
Figure 4-7. Add document
4.
The document information is displayed, and the document is classified.

Figure 4-8. Document successfully added
5.
You can process this new document as you would process any document
in the review queue in the single document view.
Back to top
Summary
This article introduced IBM Classification Module and IBM FileNet P8 integration steps in a
Windows environment. It first reviewed the integration architecture and workflow. It then used
step-by-step screen shots to illustrate how to use the IBM ECM Classification tools along with
the IBM Classification Module server component and the Classification Workbench to automate
the content classification in the integrated environment.
In summary, while ingesting new content into IBM FileNet P8, a typical workflow to automate
content classification in the integrated environment is:
Train the IBM Classification Module system with IBM FileNet P8 content by
using the Content Extractor to extract sample content from a FileNet P8
repository, creating and analyzing a knowledge base in the Classification
Workbench, and registering the newly created KB with the classification engine
through the Classification Manager.
You can rapidly train the IBM Classification Module based on the categories
already created in your IBM FileNet P8 repository, as addressed above. But if
you don't have a set of categories ready to go, the IBM Classification Module's
taxonomy proposer can divide an existing set of content into logical groupings,
recommending a new set of categories and corresponding names. You can learn
more about the Taxonomy Proposer by referring to the Integration for IBM FileNet
P8 User's Guide.
Classify new content for IBM FileNet P8 by using the Classifier tool to
classify documents into right folders or document classes in FileNet P8, reject
documents with low relevancy scores, and set aside a configured percentage of
documents for manual review or audit.
Review content classification to ensure the classification accuracy by using

the Classification Review Tool to manually confirm or correct automatic
classification.
Resources
Learn
IBM Classification Module Support Web site: Get the links to the latest IBM
Classification Module publications (product manuals, white papers, and
technotes), fix packs, and many other resources.
Learn more about the products described in this article from the following
publications:
o
IBM Classification Module 8.5 Administrator's Guide
o
o
IBM Classification Module 8.5 Integration for IBM FileNet P8

User's Guide
IBM Classification Module 8.5 Classification Workbench User's
Guide
IBM Classification Module V8.5 Information Center: Access topics in all
IBM Classification Module V8.5 manuals.
Requirements for IBM Classification Module Version 8.5: Get detailed
software requirements for IBM Classification Module, Version 8.5 and each
subsequent fix pack.
"Leverage taxonomies for enterprise search using IBM OmniFind, IBM
Classification Module, and SchemaLogic" (developerWorks, Feburary 2007): A
guide to tapping the power of taxonomies for integrated search solutions using
IBM Classification Module and SchemaLogic applications.
Architecture area on developerWorks: Get the resources you need to
advance your skills in the architecture arena.
Browse the technology bookstore for books on these and other technical
topics.
developerWorks Information Management zone: Learn more about DB2.
Find technical documentation, how-to articles, education, downloads, product
information, and more.
Stay current with developerWorks technical events and webcasts.
Get products and technologies
Build your next development project with IBM trial software, available for
download directly from developerWorks.
Discuss
Check out developerWorks blogs and get involved in the developerWorks

community.

Content Collector

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Content Collector

Caricato da

Copyright:

Formati disponibili

Business scenarios: Automatic classification of

documents and email

Business scenario: Filtering and organizing email

Business scenario: Reacting to a legal matter

Business scenario: Organizing content in an ECM system

Business scenario: Filtering and organizing email

Business scenario: Reacting to a legal matter

Business scenario: Filtering and organizing email

Business scenario: Reacting to a legal matter

Add automatic content classification to

Step-by-step examples showing how to install, configure, and integrate

Also available in Russian

Integration architecture overview

Classification Application Program Interface (API)

Classification Review Tool

Integration and classification workflow

Phase 1. Install and configure IBM Classification

Task 1. Install IBM Classification Module and integration components for

All system requirements are met.

Figure 1-1-1. Welcome screen

Enter the installation path, and click Next.

Figure 1-1-2. Directory path

Accept the license agreement, and click Next.

Figure 1-1-3. Installation options

Figure 1-1-4. Feature selection

Install the listener component to handle client requests on this computer.

Task 2. Configure IBM FileNet P8 parameters and connectivity

Add IBM FileNet P8 properties (metadata).

Start the IBM FileNet Enterprise Manager.

Figure 1-2-1. Create new AddOn

Install the Classification Module AddOn.

Figure 1-2-2. AddOn installation

To add properties to IBM FileNet P8 document class:

To add IBM FileNet P8 folders and document classes:

Configuring connectivity to IBM FileNet P8 depends on factors such as the

CryptoKeyFile =C:\\Program Files\\FileNet\\Authentication\\CryptoKeyFile.properties

CryptoKeyFile/UserToken =C:\\Program Files\\FileNet\\Authentication\\

Configuring connectivity might also involve additional configuration files and

Phase 2. Train the IBM Classification Module System

Extract content from FileNet P8

Task 1. Extract content from FileNet P8

Before you begin, ensure that IBM FileNet P8 is running.

Configure the Content Extractor by editing the Extractor.properties file. You

Which documents to extract?

IgnorePath_1: Exclude a specific FileNet P8 folder

With_1: By document property

Without_1: Excluded by document property

Date: Documents that were modified since a date

How many documents to extract?

FolderFraction: Fraction of eligible documents to extract from each

FileMax: Maximum number of bytes to extract from a document

From a command prompt, go to the ECMTools directory in the IBM

Run the Content Extractor with the following command:

Task 2. Create and analyze a KB

Launch IBM Classification Workbench by navigating to Start > Programs

On the New Project screen, enter a project name, FocusPlusProject, and

On the Import Corpus screen, under External formats, select XML

Perform the same operations as above to edit the ICM_CONTENT field by

Figure 2-2-8. Edit ICM_CONTENT properties

Create and analyze a new KB:

Click KB Wizard on the toolbar to launch the Create, Analyze, and

Task 3. Configure IBM Classification Module: KB and dictionary

Launch the IBM Classification Manager by navigating to Start > Programs