Introduction To Weka

Introduction to WEKA
Mark Hall
Pentaho Corporation Suite 340, 5950 Hazeltine National Dr. Orlando, FL 32822, USA
Data Mining WEKA - what is it? WEKA UIs Integration with Pentaho Projects based on WEKA
Data Mining
A definition: Extraction of implicit, previously unknown, and potentially useful information from data Goal (business oriented): improve marketing, sales, and customer support operations
Who is likely to remain a loyal customer/jump ship? What products should be marketed to which prospects? What determines whether a person will respond to a certain offer? How can I detect potential fraud?
Central idea: historical data contains information that will be useful in the future
Historical patterns provide useful insight and generalize to future situations
Data Mining: algorithms that automatically detect patterns and regularities in data
Data Mining
Strong patterns can be used to make predictions
Problem 1: a lot of patterns are not interesting Problem 2: patterns may be inexact (or even completely spurious) if data is garbled or missing
Techniques borrowed from statistics, computer science, machine learning research Compared to traditional statistics
Statistics is manual, user driven, top-down - formulate a hypothesis, convert hypothesis into database query, test significance of results Data mining automates the data interrogation Data-driven, self-organizing, bottom-up Automatic examination of a large number of hypothesis
Compared to OLAP
OLAP: data summarization - aggregation via addition # widgets sold in all ZIP codes in the country Data Mining: ratios, patterns and influences Factors influencing the sales of the widgets in those ZIP codes DM can enhance OLAP - suggest dimensions for cube, discretization etc.
Data Mining is a Process
Selected data
Preprocesse d data
Transforme d data
Extracted informatio n
Assimilate d knowledge
Select Preprocess & Assimilate
Transform
Mine
Analyze
What is WEKA?
Copyright: Martin Kramer (mkramer@wxs.nl)
Hamilton
WEKA: The Software

WEKA (Waikato Environment for Knowledge Analysis) Funded by the NZ government for more than 10 years
Develop an open-source state-of-the art workbench of data mining tools Explore fielded applications Develop new fundamental methods
Became part of the Pentaho suite in 2006

Pentaho Data Mining (PDM)
Core Functionality
Support for the whole process of experimental data mining
Preparation of input data Statistical evaluation of learning schemes Visualization of input data and the result of learning
Tools and algorithms

69 data pre-processing tools 118 classification/regression algorithms 11 clustering algorithms 18 attribute/subset evaluators + 12 search algorithms for feature selection 6 algorithms for finding association rules
User Interfaces
Explorer - data exploration/visualization, model construction and export, preliminary evaluation Experimenter - large-scale algorithm comparison with statistical tests for significant differences in performance KnowledgeFlow - process model view of data mining, export of DM process
Architecture
Modular, object-oriented architecture
Packages for different types of algorithms: filters, classifiers, clusterers, associations, attribute selection etc. Sub-packages group components by functionality or purpose E.g. classifiers.bayes, filters.unsupervised.attribute
Public interface prescribed by abstract base class or interface for all types of algorithms
Algorithms are Java Beans GUIs use introspection/reflection to dynamically generate editor dialogs at runtime
All components rely to a greater or lesser extent on a core top-level package

Classes and data structures for reading data sources; representing instances, sparse instances and attributes; and providing common utility methods Additional interfaces that indicate extra functionality
Packages containing learning schemes have associated Evaluation classes

Routines for performing cross-validation, computing performance metrics, generating ROC curves etc.
Explorer
Explorer
Preprocess panel
Load data from various sources (file, SQL database, URL etc.) Apply pre-processing filters to the data Summary statistics & histograms
Classify panel
Apply classification and regression algorithms Evaluate resulting models Numerically via statistical estimation Graphically through visualization (data and model)
Cluster panel
Apply clustering algorithms to the data Visualize the outcome Clusters that represent density estimates can be evaluated based on the statistical likelihood of the data
Associate panel
Learn association rules for market-basket type analysis
Explorer
Select attributes panel
Mix and match algorithms for evaluating the utility of attributes and sets of attributes with different search methods
Visualize panel
Color-coded scatter plot matrix of the data Select, enlarge, zoom in etc.
Knowledge Flow
Define a data mining process
Like the Explorer, all of WEKAs algorithms are available
Data flows through the process from node to node Accommodates both batch-based processing and data streams
Command line interface to WEKA can also train incremental classifiers on data streams
Fully multi-threaded
Accommodates multiple independent flows on the same layout
Knowledge Flows Classifier step is multi-threaded

Build models for more than one cross-validation fold in parallel
Binary and XML-based persistence of flow layouts
Knowledge Flow
Experimenter
Automate the process of determining the best method to use
Is an interactive process in the Explorer or Knowledge Flow
Run classification and regression algorithms on a corpus of data sets

Try different parameter settings Collect performance statistics Perform significance tests on the results
Raw output saved to files or databases Analysis results can be export as text, CSV, Gnuplot, LaTeX or HTML Advanced users can distribute the processing load across multiple machines
Experimenter
Extensibility
Plugin mechanisms allow WEKA to be extended without modifying the classes in the WEKA distribution New tabs can be added to the Explorer New visualizations can be added to the pop-up menu in the Explorers Classify panel
Classifier errors, predictions, trees and graphs
Knowledge Flow - Plugins tab

Drop a jar file into $HOME/.knowledgeFlow/plugins/<a plugin>/
Standards and Interoperability

Support for PMML import
Regression, general regression and neural network model types More model types and support for export in future development releases
LibSVM/SVM-Light data format support
Integration With Pentaho

Main point of integration is with Pentaho Data Integration (PDI), aka the Kettle project
PDI (Kettle) - streaming, engine-driven ETL tool PDI can export data in ARFF format High-volume, low memory consumption WEKA-specific transformation steps WekaScoring: score data using a pre-constructed WEKA model (classification, regression or clustering) or PMML model as part of an ETL transformation KnowledgeFlow: execute arbitrary Knowledge Flow processes as part of an ETL transformation
Can be used to automatically refresh/rebuild a predictive model
Scoring as Part of an ETL process
Refreshing a predictive model
Projects Based on/Using WEKA

Open-source data mining systems
Konstanz Information Miner (KNIME) & RapidMiner provide WEKA plugins R provides an interface to WEKA through the RWeka package
Scientific workflow environment

Kepler Weka project integrates all of WEKAs functionality into the Kepler open-source scientific workflow platform
Systems for natural language processing

GATE NLP workbench Balie - language identification, tokenization, sentence boundary detection, named entity recognition Kea - automatic keyphrase extraction
Text mining
Judge & IR Utilities - two systems that perform document categorization and clustering
Projects Based on/Using WEKA

Knowledge discovery in biology
BioWEKA - extension to WEKA for tasks in biology, bioinformatics and biochemistry Epitopes Tookit - platform based on WEKA for developing epitope prediction tools maxdView & Mayday - visualization and analysis of microarray data
Distributed and parallel data mining

Weka-Parallel & GridWeka - distributed cross-validation, scoring and testing FAEHIM & Weka4WS - make WEKA available as a web service
Connectionist and artificial immune system algorithms

Weka Classification Algorithms Project - several artificial neural networks and artificial immune system based algorithms
Impact
Has been downloaded more than 1.5 million times since placed on SourceForge in April 2000 Current download rate of more than 20,000 per month Large user-base and active community
2750 people subscribed to the mailing list

Introduction To Weka

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Introduction To Weka

Caricato da

Copyright:

Formati disponibili

Introduction to WEKA

Data Mining is a Process

Select Preprocess & Assimilate

Copyright: Martin Kramer (mkramer@wxs.nl)

WEKA: The Software

Became part of the Pentaho suite in 2006

Tools and algorithms

All components rely to a greater or lesser extent on a core top-level package

Packages containing learning schemes have associated Evaluation classes

Knowledge Flows Classifier step is multi-threaded

Binary and XML-based persistence of flow layouts

Run classification and regression algorithms on a corpus of data sets

Knowledge Flow - Plugins tab

Standards and Interoperability

LibSVM/SVM-Light data format support

Integration With Pentaho

Scoring as Part of an ETL process

Refreshing a predictive model

Projects Based on/Using WEKA

Scientific workflow environment

Systems for natural language processing

Projects Based on/Using WEKA

Distributed and parallel data mining

Connectionist and artificial immune system algorithms

Potrebbero piacerti anche