Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
The business scenarios describe how a fictitious company called ExampleCo. Enterprises solves typical
problems with unorganized content by using IBM Content Classification to automatically classify documents
and email.
ExampleCo. Enterprises uses Content Classification to ensure that document retention and disposition policies
are enforced, unclassified email is automatically and intelligently archived, and storage and legal review costs
are reduced by filtering out data that is irrelevant to a pending legal case.
To classify content as it is added to a repository, ExampleCo. Enterprises installs IBM FileNet Content
Manager or IBM Content Manager to store and archive documents and emails,IBM Content Collector to
capture file system and email content, and IBM Content Classification to classify the content when it is
collected.
To declare certain documents as records so that they can be managed in accordance with record retention and
compliance requirements, ExampleCo. Enterprises also installs IBM Enterprise Records (formerly FileNet
Records Manager) to manage documents that automatically declares as records.
Business scenario: Organizing content in an ECM system to reduce future compliance costs
ExampleCo. Enterprises, a fictitious company, must classify content in an enterprise content
management system and ensure that document retention and disposition policies are enforced.
Related concepts:
Integration with IBM Content Manager
Integration with IBM FileNet Content Manager and IBM Enterprise Records
Integration with IBM Content Collector
vzsp
Version 8.8.0
ExampleCo. Enterprises, a fictitious company, must classify content in an enterprise content management
system and ensure that document retention and disposition policies are enforced.
The ExampleCo. Enterprises corporate repository is built on IBM FileNet Content Manager. Currently,
content in the repository is not well organized, and it does not comply with records management policies.
To address this problem, Bob, an IT specialist, wants to ensure that all of the documents in the IBM FileNet
Content Manager content store are organized into a consistent set of folders and document classes. Anne, a
business analyst, wants to ensure that data in the repository is organized according to a corporate taxonomy.
To achieve this goal, Anne defines a new corporate taxonomy for assigning document properties and
classifying content into folders and documents classes. She also defines records management policies (rules
for the retention and disposition of documents).
Bob is tasked with organizing content in the company repository by using the new taxonomy and records
management policies. He will work closely with Anne to reclassify content that already exists in the repository
by applying this new taxonomy. Bob and Anne must work together to ensure that important documents are
declared as records so that they can be managed according to the records management policies.
To intelligently automate this task, Bob decides to use IBM Content Classification. Bob is familiar with IBM
FileNet Content Manager but is new to using IBM Content Classification. He needs to configure a set of rules to
match the corporate taxonomy that Anne provides to him. The rules need to be easy to configure but powerful
enough to classify documents on the basis of both metadata (document properties) and content.
During the reclassification phase, information about the documents might need to be updated, such as the
document properties and the target folder or document class. Some documents might need to be declared as
records. With IBM Content Classification, Bob can use a single interface to configure the needed classification
rules. After he extracts sample content from IBM FileNet Content Manager for training and testing purposes, he
can configure rules that are based on keywords in the content. He can then test the rules with some of the
extracted content to see how the system determined the classifications. If necessary, Bob can fine tune the
rules to ensure that the correct classification actions will be applied before he classifies the remaining content
in the repository.
Bob initially starts with a relevant subset of the data, which consists of approximately 100 million documents.
He expects to expand this subset to include a larger portion of the overall repository, from 300 million to 400
million documents. As the system usage expands, he expects that the combination of new documents,
Microsoft SharePoint files, and other content sources might add between 200 and 300 new documents daily.
Anne plans to use IBM Content Classification to review documents. If Anne disagrees with a classification
decision, she can reclassify a document by applying different classification criteria. By reviewing documents,
and either confirming the classification decision or reclassifying the content, Anne helps train the system and
improve accuracy over time.
Parent topic: Business scenarios: Automatic classification of documents and email
Related concepts:
ExampleCo. Enterprises, a fictitious company, might be targeted in a lawsuit. The company needs to ensure
that potentially relevant documents are placed under the control of a records management system.
Bob, an IT administrator for ExampleCo. Enterprises, is tasked with adding a large number of documents from
multiple file systems to the corporate data repository. To prepare for a potential legal dispute, the company
needs to declare the potentially relevant documents as records and assign appropriate retention and
disposition rules to each document. The documents and records must be stored in particular folders and
document classes in the case vault repository so that they are available for legal review. After the relevant
documents are declared as records and stored in the appropriate folders, the legal team will review the
documents by using IBM eDiscovery Manager and IBM eDiscovery Analyzer.
Bob selects IBM Enterprise Records for records management. Bob decides to use IBM Content
Collector with IBM Content Classification to automatically and intelligently classify documents and email and
declare them as records according to the company's records management policies.
To control storage and legal review costs, Bob needs to filter out irrelevant data such as company bulletins,
newsletters, personal email, and personal documents that have no relevance to the pending legal case. Bob
will work closely with Anne, a business analyst who has expertise in the company's knowledge management
hierarchy, to define the rules for determining which documents are potentially relevant and need to be retained.
Bob is already familiar with IBM Content Collector and has used it for similar purposes in the past, but it has
typically collected too much content, which increases legal review costs and time. He plans to work with Anne
to identify a set of representative documents that are pertinent to the case to use as a training set.
Working together, they configure classification rules in IBM Content Classification that are based on a list of
keywords provided to them by the legal team. The rules specify that the documents are to be declared as
records in IBM Enterprise Records and identify which file plan is to be used to manage the records.
Although a typical case might have approximately 50,000 to 200,000 potentially relevant documents, the
documents must be identified across departmental and enterprise repositories that can hold hundreds of
millions of documents. It is critical that Bob and Anne understand how IBM Content Classification filters
different documents and email so that they can ensure that all content that might be relevant is captured while
everything else is omitted. After classifying the training set, Bob and Anne can review how the decisions were
applied and adjust the rules as needed.
To ensure that content is classified when it is captured, Bob sets up a task route for IBM Content
Classification in IBM Content Collector. After the system is in production, Bob expects that an additional 1000
potentially relevant documents might be identified out of newly collected document and email each week.
Anne plans to use IBM Content Classification to review the classification decisions. She can help train the
system by reclassifying content, and she can work with Bob to fine tune the rules. For example, if irrelevant
documents are classified into the case vault because of the occurrence of some keyword, she might
recommend that a rule be changed so that documents are classified only when the keyword occurs in proximity
to another keyword.
Parent topic: Business scenarios: Automatic classification of documents and email
Related concepts:
Business scenario: Organizing content in an ECM system to reduce future compliance costs
Business scenario: Filtering and organizing email
Integration with IBM Content Collector
Integration with IBM FileNet Content Manager and IBM Enterprise Records
Building a knowledge base
Building a decision plan
Classification review
eDiscovery Manager
Related tasks:
Creating rules
Reclassifying documents
Version 8.8.0
ExampleCo. Enterprises, a fictitious company, must classify and archive a backlog of unclassified email and set
up a system to regularly classify all new email.
ExampleCo. Enterprises is a European company that recently acquired a company based in the United States.
Bob, the IT administrator for the newly acquired company, is tasked with creating and maintaining the corporate
email archive.
To control ongoing storage costs and potential legal discovery costs, Bob must manage the email archive to
avoid adding irrelevant data. Additionally, because the two companies previously maintained separate email
systems, Bob must archive content from Lotus Domino and Microsoft Exchange. Bob needs to ensure that
the entire message content is archived, including attachments.
To satisfy all of these requirements, Bob selected IBM Content Collector and IBM Content Classification to
automatically and intelligently create and maintain the email archive. Bob needs to filter out irrelevant data,
such as company bulletins, newsletters, and personal email that has no business value (for example, notes that
discuss the outcome of a local team's sporting event).
Bob creates an IBM Content Classification decision plan to define a set of rules for automatically classifying
email and assigning the correct category value for the item in IBM Content Manager, such as Contracts,
Claims, or Human Resources. The rules need to filter out irrelevant email before it is archived and
automatically detect sensitive information in an email, such as social security numbers and credit card
information.
Bob creates an IBM Content Classification knowledge base and then trains the classification system by using a
small set of user mailboxes to serve as a set of representative documents. After the system is trained, Bob
builds the archive by classifying all email that was transmitted over the past year. The initial archive contains
approximately 150 million emails.
After the initial archives are built, Bob configures a task route to regularly archive email in IBM Content
Collector that uses IBM Content Classification to first classify the email. Currently, an additional 250,000 500,000 emails are expected to be processed per day. The size of individual messages might range from 20
KB to as large as 100 MB, depending on the length of the email and the number of documents that are
attached.
Anne is a business analyst who is highly familiar with the company's information repositories, the enterprise
taxonomy, and the data flow. Anne wants to use IBM Content Classificationto review some documents to
ensure that the correct classification decisions are applied. It is critical that Bob and Anne understand how IBM
Content Classification filters email so that they can ensure that the right email is archived and that the email is
assigned to the correct categories.
Parent topic: Business scenarios: Automatic classification of documents and email
Related concepts:
Business scenario: Organizing content in an ECM system to reduce future compliance costs
Introduction
In ECM, taxonomies ensure that content is accurately cataloged and easily accessible. Having
consistent and reliable access to unstructured content is the foundation to realizing the business
benefits of ECM, and all subsequent content-centric enterprise applications will realize their
return on investment (ROI) by leveraging this essential capability.
IBM Classification Module automates the process of categorizing your content by reading and
analyzing the full text of each document. It analyzes the entire document, discerns the topics in
the text, and then assigns it to a proper category. And it won't be fooled by noises in your
content, such as misspellings, abbreviations, or short-hand or technical terms. Moreover, the
IBM Classification Module adapts to the unique nature of your business by learning to identify
different categories from examples you provide to it. And as you provide feedback on its
performance, it adjusts in real time, immediately taking into account corrections you have made.
In this way, its accuracy keeps pace with your business, rapidly adjusting to changes as they
occur.
The IBM Classification Module integration for IBM FileNet P8 can be deployed on Windows,
AIX, Solaris, and Linux environments. The example in this article is specific to the integration
on a Microsoft Windows platform. However, the key concepts and information provided are
relevant to any platform.
Back to top
managed content. It can also reclassify and move existing P8 documents to the correct folders
or document classes. Unlike other classification systems that are based on rules only, the
Classification Module is based on a combination of text analysis and rules, and incorporates
real-time learning that adapts to changing business needs and becomes more accurate over
time.
Figure 1. Integration architecture overview
As shown in Figure 1, at the core of this solution is the IBM Classification Module product that
has been in the market place for many years and has proven to be scalable and reliable in
demanding IT environments. It consists of three core components as follows:
Classification server
Embedded with natural language processing capabilities, the Classification sever
classifies free-form texts by leveraging its Relationship Modeling Engine and a
predefined knowledge base (KB) in decision making.
Classification Workbench
Classification Workbench allows you to create and analyze a knowledge base,
evaluate the KB's performance using reports and graphical diagnostics, and work
with a collection of texts or messages known as a corpus for analysis, training,
and learning.
In addition, the Taxonomy Proposer is a new tool shipped with Classification
Workbench 8.5. It can assist users in creating a taxonomy starting from scratch
or from a partial one, where it uses custom clustering algorithms to analyze and
group similar documents together.
On top of the core product is the newly added integration asset for providing a taxonomy
automation solution to IBM FileNet P8. The integration for IBM FileNet P8 includes the following
components:
Classifier
Automatically classifies and filters out documents, and sets aside a configured
percentage of documents for audit and manual review.
Content Extractor
A command-line tool that extracts sample content from the IBM FileNet P8
repository to train a KB and enable automatic classification.
Back to top
Back to top
2.
3.
4.
5.
Select the installation components, and click Next. In this example, check
the options to include Classification Module server, client, workbench, and
integration components.
6.
Select the option to install the administration and data server on this
computer.
If you have already installed the IBM Classification Module server on another
server, you can connect to it on a remote computer.
Figure 1-1-5. Administration and data server installation
7.
Accept the default ports for the administration and data servers.
You can use other port numbers here, but make sure that those port numbers are
not used by any other processes on this computer.
Figure 1-1-6. Port selection
8.
9.
In this example, install the Classification Review Tool into New Tomcat.
There are four options for installing the Classification Review Tool. To learn more
about each installation option, refer to the Integration for IBM FileNet P8 User's
Guide.
Figure 1-1-8. Classification Review Tool installation type
10.
When you wish to install Tomcat 5.0 as part of the IBM Classification
Module installation, enter the following information:
o
Directory where Tomcat is installed
o
Home directory for Java 1.4.2
o
Admin user name and password
o
Port of the Tomcat server
The installation wizard installs the Tomcat server first, and then deploys the
Classification Review Tool war file into it.
Figure 1-1-9. Tomcat servlet container installation information
11.
Review the installation summary information, and click Install.
12.
After the IBM Classification Module components have been installed, you
are prompted to reboot the machine. You should do this before running the IBM
Classification Module server.
13.
After the install completes, you should see a directory structure similar to
the one shown in Figure 1-1-10. Check the install_log.txt file for detailed
information on the installation.
Figure 1-1-10. IBM Classification Module directory structure
1.
3.
b.
Select the newly created AddOn, ICM AddOn, and click Install.
Figure 1-2-3. AddOn installation cont.
1.
In the IBM FileNet Enterprise Manager, select the object store you are
working with, right-click Document Class, and then selectProperties >
Properties Definitions > Add.
2.
Select all properties beginning with "ICM_" and add them to the base
class.
Figure 1-2-4. Add properties to IBM FileNet P8 document class
1.
Create the folders in IBM FileNet P8. Use the default names listed below
or specify your own.
Figure 1-2-5. Folders for IBM FileNet P8
2.
Create the document classes in IBM FileNet P8. Use the default names
listed below or specify your own.
The document classes are required only if you plan to classify documents by
document class.
Figure 1-2-6. Document classes in IBM FileNet P8
To configure connectivity between IBM Classification Module and IBM FileNet P8:
RemoteServerUrl =http://9.148.198.23:8008/ApplicationEngine/xcmisasoap.dll
RemoteServerUploadUrl =http://9.148.198.23:8008/ApplicationEngine/doccontent.dll
RemoteServerDownloadUrl =http://9.148.198.23:8008/ApplicationEngine/doccontent.dll
CredentialsProtection =Clear
CredentialsProtection/UserToken = Symmetric
1.
IgnorePath_1 = QA/Demo/Accounting
With_1 = DocumentClass=Technote
Without_1 = DocumentClass=TopSecret
Date = 13-Jul-2007
o
folder
FolderFraction = 0.2
3.
4.
1.
3.
4.
5.
Browse to the folder containing the XML files generated by the Content
Extractor. The folder path is what is defined through the Content Extractor
property, XmlDirectory. Click Next.
Figure 2-2-4. Import corpus location
6.
Select the check box Scan XML data files for corpus fields before
importing the corpus., and click Finish.
Figure 2-2-5. XML import method
7.
After importing the XML files, from the menu, select View > View Project
Details to display the project detail panel on the right side if it hasn't displayed.
Figure 2-2-6. Display the project detail panel
8.
On the Project Details Fields tab, right click the Document Title field, and
select Edit Field. In the Corpus Field Properties window, set Type to string and
NLP Usage to Plain Text, and click OK.
Figure 2-2-7. Edit document title properties
9.
10.
On the Project Details Fields tab, right click on the ICM_folders field, and
select Use as Categories. Click OK in the pop-up message window.
Figure 2-2-9. Define ICM_folders field
11.
On the Project Details Categories tab, review the list of newly created
categories.
Figure 2-2-10. List of newly created categories
12.
1.
13.
(Optional) Click Reports on the toolbar to open the View Reports window.
Check the following reports and graphs:
1.
2.
3.
4.
Cumulative success
KB data sheet
Cumulative success graph
Total precision vs. recall graph
Click OK to run the reports. View the reports that you have generated. These
reports provide a summary of information about the categories defined in your
KB. For more information on how to analyze and fine tune a KB, refer
to Classification Workbench User's Guide.
1.
3.
In the console tree, select Knowledge Bases, and then click Add on the
toolbar to add a new KB.
4.
In the Add Knowledge Base window, define the fields below, and click OK:
5.
6.
Back to top
1.
2.
o
The IBM Classification Module server and KB instance are running. You
can verify it through the Classification Manager.
IBM FileNet P8 is running.
Launch the Classifier tool by navigating to Start > Programs > IBM
Classification Module 8.5 > Classifier, and log on with your IBM FileNet P8
user name and password. This FileNet P8 user ID must have read and write
permission on all folders in FileNet p8 that are required for classification.
On the General tab, set the following global properties:
Global FileNet P8 settings:
Classification type:
Select the check box Folder.
Content to classify:
Select The file system with the directory F:\Test, where all the new documents
reside.
3.
On the Folder Classification tab, go the Runtime Settings tab and set the
following properties:
Scoring:
Default folders:
4.
On the Folder Classification tab, go the Document Properties tab and set
the following properties:
Table 1. File system document properties
Document Property
Document Content
Body
Document Filename
FileName
Document Title
Title
5.
On the Folder Classification tab, go the Auto-move tab and set the
following properties:
Auto-move thresholds:
Audit percentage:
Auto-reject folder:
When a document is auto-rejected, put it in this folder: ICM_Auto_Rejected
6.
7.
o
P8
o
FileNet P8
o
8.
Back to top
The IBM Classification Module server and KB instance are running. You
can verify it through the Classification Manager.
reviewTool.classifier.propertyfile=${reviewTool.baseDir}/
ECMTools/conf/Classifier.properties
3.
http://Web_application_server_name or
IP_address:port_number/ReviewTool/index.jsp
4.
On the Sign in window, enter your IBM FileNet P8 user name and
password. This FileNet P8 user ID must have read and write permission on all
relevant FileNet P8 folders.
1.
Upon login into the Classification Review Tool, you are in the single
document view by default. The documents in this view are taken from the
Classifier output folder ReviewToolInput. You can navigate and view documents
in the review queue one by one by clicking Next orPrevious.
Figure 4-1. Classification Review Tool Single document view
2.
After navigation, click Start Over to set the view back to the first
document.
3.
For each document, the system tells why the document is in the review
queue. Click Show on the Additional Document Information bar.
Figure 4-2. Document information
4.
Click the document type icon or the document name to review the
document content.
5.
Review the folders in the Suggested and Selected lists for the given
document. The Suggested list displays the top classification folders that the IBM
Classification Module determined to be most appropriate for the document. The
Selected list displays the suggested folder with the highest relevancy score by
default.
6.
1.
If you would like to take actions in bulk, you can use the document list view
to process multiple documents simultaneously.
2.
Click View: Document list, and switch to the document list view.
3.
Review the folder that appears in the Move to column. This is how the IBM
Classification Module intends to classify the document.
4.
Mark check boxes to select documents for which a bulk operation is
applied. When selecting the check box in the table title row, all documents on the
page are selected.
Figure 4-6. Classification Review Tool -- Document list view
5.
Once multiple documents have been selected, you can perform operations
similar to those in the single document view except that there is no Send to
Classifier action in document list view.
Add new documents
1.
You can add new documents one at a time to the IBM FileNet P8
repository, and accept or modify the classification that the IBM Classification
Module suggests for the document.
2.
In the sidebar, click Add New Document.
3.
Browse to the document you'd like to add, and click Add.
Figure 4-7. Add document
4.
5.
You can process this new document as you would process any document
in the review queue in the single document view.
Back to top
Summary
This article introduced IBM Classification Module and IBM FileNet P8 integration steps in a
Windows environment. It first reviewed the integration architecture and workflow. It then used
step-by-step screen shots to illustrate how to use the IBM ECM Classification tools along with
the IBM Classification Module server component and the Classification Workbench to automate
the content classification in the integrated environment.
In summary, while ingesting new content into IBM FileNet P8, a typical workflow to automate
content classification in the integrated environment is:
Train the IBM Classification Module system with IBM FileNet P8 content by
using the Content Extractor to extract sample content from a FileNet P8
repository, creating and analyzing a knowledge base in the Classification
Workbench, and registering the newly created KB with the classification engine
through the Classification Manager.
You can rapidly train the IBM Classification Module based on the categories
already created in your IBM FileNet P8 repository, as addressed above. But if
you don't have a set of categories ready to go, the IBM Classification Module's
taxonomy proposer can divide an existing set of content into logical groupings,
recommending a new set of categories and corresponding names. You can learn
more about the Taxonomy Proposer by referring to the Integration for IBM FileNet
P8 User's Guide.
Classify new content for IBM FileNet P8 by using the Classifier tool to
classify documents into right folders or document classes in FileNet P8, reject
documents with low relevancy scores, and set aside a configured percentage of
documents for manual review or audit.
Resources
Learn
IBM Classification Module Support Web site: Get the links to the latest IBM
Classification Module publications (product manuals, white papers, and
technotes), fix packs, and many other resources.
Learn more about the products described in this article from the following
publications:
o
IBM Classification Module 8.5 Administrator's Guide
o
o
Build your next development project with IBM trial software, available for
download directly from developerWorks.
Discuss