Data Mining in HEd UCSF

Data Mining System In Higher Education & Persistence Clustering and Predication
Jing Luan, Ph.D., ITMC Director, Planning and Research, Cabrillo College October, 2001
In 45 minutes
Tiered Knowledge Management Model (TKMM) Data Mining Overview: concept and use Demonstration of Clementine Data Mining plan at your college Data mining, statistics and OLAP Q&A
Jing Luan, UCSF/SPSS, 2001 2
Tiered Knowledge Management Model (TKMM)

Tiers:
three
Tiers:
Data Minin g Middleware OLAP Portals CRM
Knowledge Base Knowledge Workers Collaborative Working Environment (CWE) Knowled ge Mapping Tacit Knowledge
three
one
two
two
one
Data Warehouses Enterprise Resource Planning (ERP) Explicit Knowledge
Jing Luan, UCSF/SPSS, 2001
TKMM: Explicit Knowledge Management

TIER THREE:
Many data mining projects fail due to lack of understanding of these three tiers, particularly in data (feature) extraction in Tier One.
Mining : Clementine, Enterprise Miner, Statistica, Mineset, Darwin, SpotFire Classical statistics SPSS, SAS, BMDP, SysStat
TIER TWO
Querying: BrioQuery, Business Objects, PowerPlay Access, Foxpro Online Data Processing: ASP, JSP, iHTML, XML
TIER ONE
Data Engines SQL Server, Oracle, Informix, Sybase, UniData, DB2 Enterprise Resource Planning (ERP) PeopleSoft, Datatel, SAP, Oracle, Banner
Topography of Tiered Knowledge Management Model (TKMM) for explicit knowledge Jing Luan, UCSF/SPSS, 2001 4
Guiding Principles
LRM (Learner Relationship Management) Student Life Cycle Student Clustering, student types Data source and quality CRISP-DM (all about a system) The One-Percent Doctrine
Data Mining in Higher Ed

Alumni Institutional Effectiveness Marketing Enrollment Management
Data Mining in Higher Ed -institutional effectiveness

What do we know about our students? What factors contributive to learning? Who is likely to fail, drop out? What courses provide high FTES, use space better? Whatre the course taking patterns?
Data Mining in Higher Ed -enrollment management

Which groups prefer what services? Which student is likely to drop out? Where do our students come from? Who is likely to return?
Data Mining in Higher Ed -marketing
Who is likely to respond to our new marketing strategy? What factor garners the highest respon Which type of marketing works better?
Data Mining in Higher Ed - alumni

What different types of alumni are there? Who is likely to pledge for which amount and when?
10
Lift Chart: Gain Chart

Hypothetical database marketing campaign
Lift
35%
quota
Savings ($)
25%
0 40th 70th percentile percentile If every percentage point = $2,500, savings =(70% * $2,500) (40% * $2,500) = $175,000 - $100,000 = $75,000 BACK Jing Luan, UCSF/SPSS, 2001
11
Artificial Neural Networks (ANN)

Multi-layer perceptron (MLP): feed forward back propagation
x1 # of Terms x2 GPA x3 Demographics x4 Courses x5 Fin Aid xj n
w5 n w1
o1 Persist
o2 Not-persist
n oj = f oi w ji i =1
Decision Trees Rule Induction

Rule 1: If Income $55,000 and # of Children = 3, then multiple policies Rule 2: If Income < $55,000, and single and Age < 30, then single policy
Information theorem: H ( N ) =
P(n) log2 P(n)

i =1
13
The Use of Clementine

Real-time demonstration
Student persistence prediction
14
Examining Data
15
Clustering using TwoStep
16
Building Models for Persistence in Streams
A node is being executed (notice the red arrows denoting the flow of data.
17
Output (Boosting/Reduction)
Because there are always fewer graduates than all students. Clementine can balance the dataset first.
18
Seeing the Work of Neural Thinking
Graphic display showing an ANN is learning the data.
19
Results of Neural Node
These are the outputs the Neural Networks. Overall accuracy and significance of features (left). Predicted number of policies using fresh data vs. known data (above).
20
Examining C5.0
The control panel of the C5.0 node, (Expert)
21
Results of C5.0 Node
View the prediction by individual records (PNXT vs. $C-PNXT).
View the overall prediction accuracy.
22
Comparing C&RT and C5.0

Use the Analysis node to examined the difference in accuracy for C&RT and C5.0. See next slide.
23
Which One is Better: C&RT & C5.0

C5.0 has an accuracy rate of 66.3% and C&RT 63.7%. They agree 72% of the time.
24
Scoring New Data

Moment of truth. The most powerful feature of data mining is to use learned rules to predict (score) using fresh data for business purposes. Shown here is the change of dataset to a fresh data set unseen by Clementine before now.
25
Using Models to Score New Data

Test Set Results Scored Results
Decision:
26
2 TYPES OF DATA MINING

SUPERVISED
Purpose: For classification and estimation Models C5.0, C&RT, ANN, etc
UNSUPERVISED
Purpose For clustering and association Models Kohonen, Kmeans, TwoStep GRI, etc.
27
But pre-classified data means data without target.

Data Mining Tasks

Predicting onto new data by using rules or patterns and behaviors

Classification Estimation
Understanding the groupings, trends, and characteristics of your customer
Segmentation
Visualizing the Euclidean spatial relationships, trends, and patterns of your data
Description
28
But I Spent Years Learning Statistics! But I Use OLAP For All My Work!
Statistics knowledge is very useful. Data mining cannot replace statistics in a number of areas. There are overlapping areas. OLAP is the middle tier. We must go beyond counting heads!
How Do Data Mining, Statistics and OLAP Compare

Data Mining
Neural Net
Statistics
OLAP
Regression, Structural Equation PCA, Factor Analysis
C5.0, C&RT
Kohonen, K-means, Cluster Analysis, Cubes TwoStep Probability Density Spatial Visualization 2-3 dimension charts Machine Learning/ Mathematics Artificial Intelligence Unsupervised Descriptive 2-3 dimension charts ETL, SQL
30
Temporal/Trend
Evaluating Data Mining Software

Company stability and customer feedback User Interface Scalability (up and down) Server/Client (real-time, KDD) Modeling capacities Learning Curve Join a listserv, such as CLUG Cost
31
Data Mining Plan at Your College

1. Determine business needs 2. Determine technology infrastructure 3. 4. 5. 6. 7.
and management support Determine data source Identify mining areas Invite an expert to jump start Pilot test mining results CRISP-DM and Real-time data mining, Knowledge Discovery in Databases (KDD)
Data Mining Skills Set

Translate to SkillDriving Forces of set: DM: Data domain Computer Storage expert Algorithms Familiar w/ Knowledge models Management System level view of decision making
Whos Coming to Dinner?

Data mining workshop(s)
34
Contact
Jing Luan, Ph.D., ITMC Director, Planning and Research Cabrillo College Email: jing@cabrillo.cc.ca.us 831.477.5656

Data Mining in HEd UCSF

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Mining in HEd UCSF

Caricato da

Copyright:

Formati disponibili

Data Mining System In Higher Education & Persistence Clustering and Predication

Tiered Knowledge Management Model (TKMM)

Data Minin g Middleware OLAP Portals CRM

Data Warehouses Enterprise Resource Planning (ERP) Explicit Knowledge

Jing Luan, UCSF/SPSS, 2001

TKMM: Explicit Knowledge Management

Data Mining in Higher Ed

Jing Luan, UCSF/SPSS, 2001

Data Mining in Higher Ed -institutional effectiveness

Jing Luan, UCSF/SPSS, 2001

Data Mining in Higher Ed -enrollment management

Jing Luan, UCSF/SPSS, 2001

Data Mining in Higher Ed -marketing

Jing Luan, UCSF/SPSS, 2001

Data Mining in Higher Ed - alumni

Jing Luan, UCSF/SPSS, 2001

Lift Chart: Gain Chart

Artificial Neural Networks (ANN)

Decision Trees Rule Induction

P(n) log2 P(n)

Jing Luan, UCSF/SPSS, 2001

The Use of Clementine

Student persistence prediction

Jing Luan, UCSF/SPSS, 2001

Jing Luan, UCSF/SPSS, 2001

Clustering using TwoStep

Jing Luan, UCSF/SPSS, 2001

Building Models for Persistence in Streams

Jing Luan, UCSF/SPSS, 2001

Jing Luan, UCSF/SPSS, 2001

Seeing the Work of Neural Thinking

Graphic display showing an ANN is learning the data.

Jing Luan, UCSF/SPSS, 2001

Results of Neural Node

Jing Luan, UCSF/SPSS, 2001

The control panel of the C5.0 node, (Expert)

Jing Luan, UCSF/SPSS, 2001

Results of C5.0 Node

View the prediction by individual records (PNXT vs. $C-PNXT).

View the overall prediction accuracy.

Jing Luan, UCSF/SPSS, 2001

Comparing C&RT and C5.0

Jing Luan, UCSF/SPSS, 2001

Which One is Better: C&RT & C5.0

Jing Luan, UCSF/SPSS, 2001

Scoring New Data

Jing Luan, UCSF/SPSS, 2001

Using Models to Score New Data

Jing Luan, UCSF/SPSS, 2001

2 TYPES OF DATA MINING

But pre-classified data means data without target.

Data Mining Tasks

Understanding the groupings, trends, and characteristics of your customer

Jing Luan, UCSF/SPSS, 2001

How Do Data Mining, Statistics and OLAP Compare

Regression, Structural Equation PCA, Factor Analysis

Jing Luan, UCSF/SPSS, 2001

Evaluating Data Mining Software

Jing Luan, UCSF/SPSS, 2001

Data Mining Plan at Your College

Data Mining Skills Set

Whos Coming to Dinner?

Jing Luan, UCSF/SPSS, 2001