Sei sulla pagina 1di 16

T.

Karthikeyan

What on World is

Apache MAHOUT

Applications

Examples Freq. Pattern Mining

Genetic

Classification

Clustering

Recommenders

Utilities Lucene/Vectorizer

Math Vectors/Matrices /SVD

Collections (primitives)

Apache Hadoop

Mahout Clustering
Algorithms : K-Means Fuzzy K-Means Mean shift Canopy Dirichlet Spectral Clustering based on Eigen values Minhash clustering LDA based clustering

Notion Of similarity : Distance Measure : Euclidean Cosine Tanimoto Manhattan

Clustering our own data

Dataset

Hadoop Sequence File format ./mahout seqdirectory <options> Sparse vector Format ./mahout seq2sparse <options>
Clustering Driver class ./mahout <kmeans/> <options> Dump cluster output

./mahout clusterdump <options>

Clustering Examples
Using Reuters Dataset (SGML File) : $ bin/mahout seqdirectory -i reuters-ip -o reuters-seqdir \ -c UTF-8 -chunk 1 $ bin/mahout seq2sparse -i reuters-seqdir -o reuters-sparse $ bin/mahout kmeans -i reuters-sparse/tfidf-vectors / -c reuters-clusters \ -o reuters-kmeans \ -dm org.apache.mahout.distance.CosineDistanceMeasure\ -cd 0.1 -x 10 -k 20 ow $ bin/mahout clusterdump -d reuters-sparse \dictionary.file-0 -s reuters-kmeans-clusters/clusters-19 -b 10 n 10

Mahout Classification
Algorithms Implemented: Nave Bayes Complementary Nave Bayes Random Forest Logistic Regression (Sequential Algorithm) Hidden markov models Upcoming Algorithms: Support vector machines Classification based on perception and winnow

Bayes , Cbayes Classifier


Preprocessing Raw data into classifiable data

Bayes ,Cbayes Classifier Example


Using Newsgroup Dataset: $./mahout prepare20newsgroups -p 20news-bydate-train -o 20news-train \ -a org.apache.lucene.analysis.standard.StandardAnalyzer \ -c UTF-8 $./mahout trainclassifier i 20news-train -o 20news-model \ -type <cbayes ,bayes> \ -ng 1 -source hdfs

$./mahout testclassifier -d 20news-test -m 20news-model \ -type <cbayes,bayes> \ -ng 1 -source hdfs

Output : Confusion matrix

Logistic Regression
x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias""
0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1

./mahout trainlogistic --input input.csv --output ./model \ --target color --categories 2

./mahout runlogistic --input test.csv --model ./model \ --auc --confusion


CONFUSION MATRIX ( 0/P) A AUC = 0.97 ; B A {[24.0, 2.0],

B [3.0, 11.0]]

Random Forest
Input : arff or csv Generate a file descriptor for the dataset: $ericsson>$HADOOP_HOME/bin/hadoop jar \ $MAHOUT_HOME/core/target/mahout-core-0.6-SNAPSHOT-job.jar \ org.apache.mahout.df.tools.Describe -p KDDTrain.arff -f Train.info \ -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L Run the example: $ericsson>$HADOOP_HOME/hadoop jar \ $MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar\ org.apache.mahout.df.mapreduce.BuildForest <options> Using the Decision Forest to Classify new data $HADOOP_HOME/hadoop jar \ $MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar org.apache.mahout.df.mapreduce.TestForest -i Test.arff -ds Train.info <options> Output : confusion matrix

Dimension reduction
Algorithms Implemented: Singular value Decomposition Stochastic singular value Decomposition

Upcoming Algorithms : Principal Components Analysis Independent Component Analysis Gaussian Discriminative Analysis
Input : Real value Matrix

0.12 0.8 0.123


0.89 2.33 1.445 4.12 2.123 3.12

./mahout <svd/ssvd> <options>

Eigen Vectors

Frequent Pattern mining


Algorithm: Parallel FP growth Algorithm Input : dat or csv Running Parallel FPGrowth: $./mahout fpg retail.dat -o patterns -k 50 -method mapreduce -regex '[\ ]' -s 2

Viewing the results : $./mahout seqdumper -s patterns/part-?-00000 -n 4

Recommenders / Collaborative Filtering


Algorithms: Non-distributed recommenders ("Taste") Distributed Item-Based Collaborative Filtering Collaborative Filtering using a parallel matrix factorization Input is text file: user ,item ,preference

TASTE

Collaborative Filtering using a parallel matrix factorization


Input : Rating Matrix or csv To Run distributed ALS-WR to factorize the rating matrix defined by the training set

$MAHOUT parallelALS input TrainingSet --output out \ --tempDir tmp -- numFeatures 20 -- numIterations 10 --lambda 0.065 Compute predictions against the probe set, measure the error $MAHOUT evaluateFactorization input TrainingSet --output op \ --tempDir tmp1

Compute recommendations

$MAHOUT recommendfactorized

input userRatings --output recommendations \numRecommendations 6 --maxRating 5

SUMMARY
ALGORITHMS All Clustering Algorithms, Bayes, Cbayes classifier Logistic regression, Random forest, FP Growth Taste , Collaborative Filtering SVD, SSVD INPUT Sparse Vector

CSV

User ,Item ,Preference

Matrix

Potrebbero piacerti anche