Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Business Intelligence
Systems
(9th Ed., Prentice Hall)
Chapter 7:
Text and Web Mining
-2
Learning Objectives
-3
Learning Objectives
-4
Opening Vignette:
Mining Text for Security and
Counterterrorism
What is MITRE?
Problem description
Proposed solution
Results
Answer and discuss the case
questions
Copyright 2011 Pearson Education, Inc. Publishing as Prentice Hall
-5
Opening Vignette:
Mining Text For Security
-6
-7
TRUE or FALSE?
The nature of data is the
difference between data mining
and text mining?
-9
Spam filtering
Email prioritization and categorization
Copyright
2011 Pearson Education, Inc. Publishing as Prentice Hall
Automatic
response generation
-10
Information extraction
Topic tracking
Summarization
Categorization
Clustering
Concept linking
Question answering
-11
Unstructured or semistructured
data
Corpus (and corpora)
Terms
Concepts
Stemming
Stop words (and include words)
Synonyms (and polysemes)
Tokenizing
-12
Term dictionary
Word frequency
Part-of-speech tagging
Morphology
Term-by-document matrix
Occurrence matrix
-13
What is a patent?
CHECK POINT: Click the forward cursor key once for the
answer
-15
NLP is
-16
What is Understanding ?
-17
Challenges in NLP
Part-of-speech tagging
Text segmentation
Word sense disambiguation
Syntax ambiguity
Imperfect or irregular input
Speech acts
Dream of AI community
CHECK POINT: Click the forward cursor key once for the
answer
-19
WordNet
Sentiment Analysis
-20
-21
Marketing applications
Security applications
ECHELON, OASIS
Deception detection ()
Academic applications
CHECK POINT: Click the forward cursor key once for the
answer
-23
A difficult problem
If detection is limited to only text,
then the problem is even more
difficult
The study
-24
-25
-26
Logistic regression
67.28
Decision trees
71.60
Copyright 2011
Pearson Education, Inc. Publishing as Prentice Hall
Neural
networks
73.46
-27
CHECK POINT: Click the forward cursor key once for the
answer
Modifiers
Non-immediacy
Lexical
Diversity
Spatio-Temporal
Quantity
Passive Voice
Uncertainty
Verb/Phrase Count
-29
Software/hardware limitations
P rivacy issues
Linguistic limitations
Extract
knowledge
from available
data sources
A0
Context-specific knowledge
Domain expertise
Tools and techniques
Copyright 2011 Pearson Education, Inc. Publishing as Prentice Hall
-30
-31
-32
-33
-34
-35
Association
Trend Analysis ()
-36
CHECK POINT: Click the forward cursor key once for the
answer
-38
Author(s)
MISQ
2005
A. Malhotra,
S. Gosain and
O. A. El Sawy
ISR
1999
JMIS
2001
R. Aron and
E. K. Clemons
Title
Vol/No Pages
Absorptive capacity
configurations in
supply chains:
Gearing for partnerenabled market
knowledge creation
D. Robey and
Accounting for the
M. C. Boudreau contradictory
organizational
consequences of
information
technology:
Theoretical directions
and methodological
implications
Keywords
Abstract
29/1
-39
RapidMiner
GATE
Spy-EM,
-40
No of Articles
3
3
2
2
1
1
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
C LU S TER : 4
C LU STER : 5
C LU STER : 6
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
5
0
5
0
5
0
5
0
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
5
0
5
0
5
0
5
0
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
3
3
2
2
1
1
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
5
0
5
0
5
0
5
0
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
3
3
2
2
1
1
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
C LU S TER : 1
C LU STER : 2
C LU STER : 3
C LU S TER : 7
C LU STER : 8
C LU STER : 9
Y EAR
-41
J M IS
M IS Q
IS R
No of Articles
C LU S T ER : 1
J M IS
M IS Q
IS R
C LU S T ER : 2
J M IS
M IS Q
C LU S T E R : 3
100
90
80
70
60
50
40
30
20
10
0
IS R
J M IS
M IS Q
IS R
C LU S T ER : 4
J M IS
M IS Q
IS R
C LU S T ER : 5
J M IS
M IS Q
C LU S T E R : 6
100
90
80
70
60
50
40
30
20
10
0
IS R
J M IS
M IS Q
C LU S T ER : 7
IS R
J M IS
M IS Q
C LU S T ER : 8
JO U R N AL
IS R
J M IS
M IS Q
C LU S T E R : 9
-42
The
The
The
The
The
Web
Web
Web
Web
Web
-43
Web Mining
CHECK POINT: Click the forward cursor key once for the
answer
Web Mining.
What is the process of discovering
intrinsic relationships from Web
data?
-45
Authoritative pages
Hubs
hyperlink-induced topic search (HITS)
alg
-46
Clickstream data
Clickstream analysis
-47
CHECK POINT: Click the forward cursor key once for the
answer
-49
-50
-51
URL
angoss.com
ClickTracks
clicktracks.com
deepmetrix.com
Megaputer WebAnalyst
megaputer.com
microstrategy.com
sas.com
spss.com
WebTrends
webtrends.com
XML Miner
scientio.com
CHECK POINT: Click the forward cursor key once for the
answer
-53
Questions / comments
-54