Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Jing Luan, Ph.D., ITMC Director, Planning and Research, Cabrillo College October, 2001
In 45 minutes
Tiered Knowledge Management Model (TKMM) Data Mining Overview: concept and use Demonstration of Clementine Data Mining plan at your college Data mining, statistics and OLAP Q&A
Jing Luan, UCSF/SPSS, 2001 2
Tiers:
Knowledge Base Knowledge Workers Collaborative Working Environment (CWE) Knowled ge Mapping Tacit Knowledge
three
one
two
two
one
Many data mining projects fail due to lack of understanding of these three tiers, particularly in data (feature) extraction in Tier One.
Mining : Clementine, Enterprise Miner, Statistica, Mineset, Darwin, SpotFire Classical statistics SPSS, SAS, BMDP, SysStat
TIER TWO
Querying: BrioQuery, Business Objects, PowerPlay Access, Foxpro Online Data Processing: ASP, JSP, iHTML, XML
TIER ONE
Data Engines SQL Server, Oracle, Informix, Sybase, UniData, DB2 Enterprise Resource Planning (ERP) PeopleSoft, Datatel, SAP, Oracle, Banner
Topography of Tiered Knowledge Management Model (TKMM) for explicit knowledge Jing Luan, UCSF/SPSS, 2001 4
Guiding Principles
LRM (Learner Relationship Management) Student Life Cycle Student Clustering, student types Data source and quality CRISP-DM (all about a system) The One-Percent Doctrine
Jing Luan, UCSF/SPSS, 2001 5
Who is likely to respond to our new marketing strategy? What factor garners the highest respon Which type of marketing works better?
10
35%
quota
Savings ($)
25%
0 40th 70th percentile percentile If every percentage point = $2,500, savings =(70% * $2,500) (40% * $2,500) = $175,000 - $100,000 = $75,000 BACK Jing Luan, UCSF/SPSS, 2001
11
o1 Persist
o2 Not-persist
n oj = f oi w ji i =1
Jing Luan, UCSF/SPSS, 2001 12
14
Examining Data
15
16
A node is being executed (notice the red arrows denoting the flow of data.
17
Output (Boosting/Reduction)
Because there are always fewer graduates than all students. Clementine can balance the dataset first.
18
19
These are the outputs the Neural Networks. Overall accuracy and significance of features (left). Predicted number of policies using fresh data vs. known data (above).
20
Examining C5.0
21
22
23
24
25
Decision:
26
UNSUPERVISED
Purpose For clustering and association Models Kohonen, Kmeans, TwoStep GRI, etc.
27
Classification Estimation
Segmentation
Visualizing the Euclidean spatial relationships, trends, and patterns of your data
Description
28
But I Spent Years Learning Statistics! But I Use OLAP For All My Work!
Statistics knowledge is very useful. Data mining cannot replace statistics in a number of areas. There are overlapping areas. OLAP is the middle tier. We must go beyond counting heads!
Jing Luan, UCSF/SPSS, 2001 29
Statistics
OLAP
C5.0, C&RT
Kohonen, K-means, Cluster Analysis, Cubes TwoStep Probability Density Spatial Visualization 2-3 dimension charts Machine Learning/ Mathematics Artificial Intelligence Unsupervised Descriptive 2-3 dimension charts ETL, SQL
30
Temporal/Trend
31
34
Contact
Jing Luan, Ph.D., ITMC Director, Planning and Research Cabrillo College Email: jing@cabrillo.cc.ca.us 831.477.5656
Jing Luan, UCSF/SPSS, 2001 35