Sei sulla pagina 1di 35

Data Mining System In Higher Education & Persistence Clustering and Predication

Jing Luan, Ph.D., ITMC Director, Planning and Research, Cabrillo College October, 2001

In 45 minutes
Tiered Knowledge Management Model (TKMM) Data Mining Overview: concept and use Demonstration of Clementine Data Mining plan at your college Data mining, statistics and OLAP Q&A
Jing Luan, UCSF/SPSS, 2001 2

Tiered Knowledge Management Model (TKMM)


Tiers:
three

Tiers:

Data Minin g Middleware OLAP Portals CRM

Knowledge Base Knowledge Workers Collaborative Working Environment (CWE) Knowled ge Mapping Tacit Knowledge
three

one

two

two

one

Data Warehouses Enterprise Resource Planning (ERP) Explicit Knowledge

Jing Luan, UCSF/SPSS, 2001

TKMM: Explicit Knowledge Management


TIER THREE:

Many data mining projects fail due to lack of understanding of these three tiers, particularly in data (feature) extraction in Tier One.

Mining : Clementine, Enterprise Miner, Statistica, Mineset, Darwin, SpotFire Classical statistics SPSS, SAS, BMDP, SysStat
TIER TWO

Querying: BrioQuery, Business Objects, PowerPlay Access, Foxpro Online Data Processing: ASP, JSP, iHTML, XML
TIER ONE

Data Engines SQL Server, Oracle, Informix, Sybase, UniData, DB2 Enterprise Resource Planning (ERP) PeopleSoft, Datatel, SAP, Oracle, Banner

Topography of Tiered Knowledge Management Model (TKMM) for explicit knowledge Jing Luan, UCSF/SPSS, 2001 4

Guiding Principles
LRM (Learner Relationship Management) Student Life Cycle Student Clustering, student types Data source and quality CRISP-DM (all about a system) The One-Percent Doctrine
Jing Luan, UCSF/SPSS, 2001 5

Data Mining in Higher Ed


Alumni Institutional Effectiveness Marketing Enrollment Management

Jing Luan, UCSF/SPSS, 2001

Data Mining in Higher Ed -institutional effectiveness


What do we know about our students? What factors contributive to learning? Who is likely to fail, drop out? What courses provide high FTES, use space better? Whatre the course taking patterns?

Jing Luan, UCSF/SPSS, 2001

Data Mining in Higher Ed -enrollment management


Which groups prefer what services? Which student is likely to drop out? Where do our students come from? Who is likely to return?

Jing Luan, UCSF/SPSS, 2001

Data Mining in Higher Ed -marketing

Who is likely to respond to our new marketing strategy? What factor garners the highest respon Which type of marketing works better?

Jing Luan, UCSF/SPSS, 2001

Data Mining in Higher Ed - alumni


What different types of alumni are there? Who is likely to pledge for which amount and when?

Jing Luan, UCSF/SPSS, 2001

10

Lift Chart: Gain Chart


Hypothetical database marketing campaign
Lift

35%

quota

Savings ($)
25%

0 40th 70th percentile percentile If every percentage point = $2,500, savings =(70% * $2,500) (40% * $2,500) = $175,000 - $100,000 = $75,000 BACK Jing Luan, UCSF/SPSS, 2001

11

Artificial Neural Networks (ANN)


Multi-layer perceptron (MLP): feed forward back propagation
x1 # of Terms x2 GPA x3 Demographics x4 Courses x5 Fin Aid xj n
w5 n w1

o1 Persist

o2 Not-persist

n oj = f oi w ji i =1
Jing Luan, UCSF/SPSS, 2001 12

Decision Trees Rule Induction


Rule 1: If Income $55,000 and # of Children = 3, then multiple policies Rule 2: If Income < $55,000, and single and Age < 30, then single policy
Information theorem: H ( N ) =

P(n) log2 P(n)


i =1
13

Jing Luan, UCSF/SPSS, 2001

The Use of Clementine


Real-time demonstration

Student persistence prediction

Jing Luan, UCSF/SPSS, 2001

14

Examining Data

Jing Luan, UCSF/SPSS, 2001

15

Clustering using TwoStep

Jing Luan, UCSF/SPSS, 2001

16

Building Models for Persistence in Streams

A node is being executed (notice the red arrows denoting the flow of data.

Jing Luan, UCSF/SPSS, 2001

17

Output (Boosting/Reduction)
Because there are always fewer graduates than all students. Clementine can balance the dataset first.

Jing Luan, UCSF/SPSS, 2001

18

Seeing the Work of Neural Thinking

Graphic display showing an ANN is learning the data.

Jing Luan, UCSF/SPSS, 2001

19

Results of Neural Node

These are the outputs the Neural Networks. Overall accuracy and significance of features (left). Predicted number of policies using fresh data vs. known data (above).

Jing Luan, UCSF/SPSS, 2001

20

Examining C5.0

The control panel of the C5.0 node, (Expert)

Jing Luan, UCSF/SPSS, 2001

21

Results of C5.0 Node

View the prediction by individual records (PNXT vs. $C-PNXT).

View the overall prediction accuracy.

Jing Luan, UCSF/SPSS, 2001

22

Comparing C&RT and C5.0


Use the Analysis node to examined the difference in accuracy for C&RT and C5.0. See next slide.

Jing Luan, UCSF/SPSS, 2001

23

Which One is Better: C&RT & C5.0


C5.0 has an accuracy rate of 66.3% and C&RT 63.7%. They agree 72% of the time.

Jing Luan, UCSF/SPSS, 2001

24

Scoring New Data


Moment of truth. The most powerful feature of data mining is to use learned rules to predict (score) using fresh data for business purposes. Shown here is the change of dataset to a fresh data set unseen by Clementine before now.

Jing Luan, UCSF/SPSS, 2001

25

Using Models to Score New Data


Test Set Results Scored Results

Decision:

Jing Luan, UCSF/SPSS, 2001

26

2 TYPES OF DATA MINING


SUPERVISED
Purpose: For classification and estimation Models C5.0, C&RT, ANN, etc

UNSUPERVISED
Purpose For clustering and association Models Kohonen, Kmeans, TwoStep GRI, etc.
27

But pre-classified data means data without target.


Jing Luan, UCSF/SPSS, 2001

Data Mining Tasks


Predicting onto new data by using rules or patterns and behaviors

Classification Estimation

Understanding the groupings, trends, and characteristics of your customer

Segmentation

Visualizing the Euclidean spatial relationships, trends, and patterns of your data

Description

Jing Luan, UCSF/SPSS, 2001

28

But I Spent Years Learning Statistics! But I Use OLAP For All My Work!
Statistics knowledge is very useful. Data mining cannot replace statistics in a number of areas. There are overlapping areas. OLAP is the middle tier. We must go beyond counting heads!
Jing Luan, UCSF/SPSS, 2001 29

How Do Data Mining, Statistics and OLAP Compare


Data Mining
Neural Net

Statistics

OLAP

Regression, Structural Equation PCA, Factor Analysis

C5.0, C&RT

Kohonen, K-means, Cluster Analysis, Cubes TwoStep Probability Density Spatial Visualization 2-3 dimension charts Machine Learning/ Mathematics Artificial Intelligence Unsupervised Descriptive 2-3 dimension charts ETL, SQL
30

Jing Luan, UCSF/SPSS, 2001

Temporal/Trend

Evaluating Data Mining Software


Company stability and customer feedback User Interface Scalability (up and down) Server/Client (real-time, KDD) Modeling capacities Learning Curve Join a listserv, such as CLUG Cost

Jing Luan, UCSF/SPSS, 2001

31

Data Mining Plan at Your College


1. Determine business needs 2. Determine technology infrastructure 3. 4. 5. 6. 7.
and management support Determine data source Identify mining areas Invite an expert to jump start Pilot test mining results CRISP-DM and Real-time data mining, Knowledge Discovery in Databases (KDD)
Jing Luan, UCSF/SPSS, 2001 32

Data Mining Skills Set


Translate to SkillDriving Forces of set: DM: Data domain Computer Storage expert Algorithms Familiar w/ Knowledge models Management System level view of decision making
Jing Luan, UCSF/SPSS, 2001 33

Whos Coming to Dinner?


Data mining workshop(s)

Jing Luan, UCSF/SPSS, 2001

34

Contact

Jing Luan, Ph.D., ITMC Director, Planning and Research Cabrillo College Email: jing@cabrillo.cc.ca.us 831.477.5656
Jing Luan, UCSF/SPSS, 2001 35

Potrebbero piacerti anche