Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
ISSN: 2455-5703
Abstract
Data mining is an inter disciplinary field and it finds application everywhere. To solve many different day to life problems, the
algorithms could be made use. Since R studio is more comfortable for researcher across the globe, most widely used data mining
algorithms for different cases studies are implemented in this paper by using R programming language. Could be implemented
with help of R programming. The advanced sensing and computing technologies have enabled the collection of large amount of
complex data. Data mining techniques can be used to discover useful patterns that in turn can be used for classifying new data or
other purpose. The algorithm for processing large set of data is scalable. Algorithm for processing data with changing pattern must
be capable of incrementally learning and updating data patterns as new data become available. Still data mining algorithm such as
decision tree support the incremental learning of data with mixed data types, the user is not satisfied with scalability of these
algorithms in handling large amount of data. The following algorithms were implemented using R studio with complex data set.
There are four algorithms in the project- 1) Clustering Algorithm 2) Classification Algorithm 3) Apriori Algorithm 4) Decision
Tree Algorithm. It is concluded that R studio produced most efficient result for implementing the above said algorithms.
Keywords- R, Data Mining, Clustering, Classification, Decision Tree, Apriori Algorithm, Data Sets
I. INTRODUCTION
R Studio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical
computing and graphics. R Studio is written in the C++ programming language and uses the Qt framework for its graphical user
interface, which including rich code editing, debugging, testing, and profiling tools.
A. Clustering Algorithm
K--means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. The procedure
follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori.
The main idea is to define k centers, one for each cluster. These centers should be placed in a cunning way because of
different location causes different result. So, the better choice is to place them as much as possible far away from each other. The
next step is to take each point belonging to a given data set and associate it to the nearest center.
B. Classification Algorithm
It is one of the Data Mining. That is used to analyze a given data set and takes each instance of it. It assigns this instance to a
particular class. Such that classification error will be least. It is used to extract models. That define important data classes within
the given data set. Classification is a two-step process.
During the first step, the model is created by applying classification algorithm. That is on training data set. Then in the
second step, the extracted model is tested against a predefined test data set. That is to measure the model trained performance and
accuracy. So classification is the process to assign class label from a data set whose class label is unknown.
C. Apriori Algorithm
Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by
identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item
sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association
rules which highlight general trends in the database: this has applications in domains such as market basket analysis.
A. Clustering
B. Classification
C. Apriori Algorithm
IV. CONCLUSION
The Implementation of Data Mining Algorithm acts efficiently done in R environment and enhancing its features. The large set of
data could be processed and manipulate using R environment. This can be widely used in statistical analysis of data. Since it is
very large in size the user can’t notice its occurrence. This System is able to achieve reliability. It reduces the human involvement
in manipulating the data. This System reduces risk in the mistake that the human occurs while manipulate the larger set of data.
APPENDIX
A. Cluster
library (datasets)
data (iris)
summary (iris)
set.seed(8953)
iris1<-iris
iris1$Species<-NULL
(kmeans.result<-kmeans(iris1,3))
table (iris$Species,kmeans.result$cluster)
plot (iris1[c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster)
points (kmeans.result$centers[, c("Sepal.Length", "Sepal.Width")],col = 1:3, pch = 8, cex = 2)
library (fpc)
pamk.result <- pamk(iris1)
pamk.result$nc
table (pamk.result$pamobject$clustering, iris$Species)
layout(matrix(c(1, 2), 1, 2))
plot (pamk.result$pamobject)
library(fpc)
iris2 <- iris[-5] # remove class tags
ds <- dbscan(iris2, eps = 0.42, MinPts = 5)
table(ds$cluster, iris$Species)
plot(ds, iris2[c(1, 4)])
plotcluster(iris2, ds$cluster)
B. Classification Algorithm
str(iris)
set.seed(1234)
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
train.data <- iris[ind == 1, ]
test.data <- iris[ind == 2, ]
library(party)
myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
iris_ctree <- ctree(myFormula, data = train.data)
table(predict(iris_ctree), train.data$Species)
Print ctree
print(iris_ctree)
print(iris_ctree)
plot(iris_ctree)
C. Apriori Algorithm
install.packages("caTools")
# Decision Tree Regression
# Importing the dataset
setwd("E:\\Research 2018\\Course VN\\Algorithm Datasets\\Decision_Tree_Regression")
datasets = read.csv('Position_Salaries.csv')
dataset = datasets[2:3]
# Splitting the dataset into the Training set and Test set
# # install.packages('caTools')
# library(caTools)
# set.seed(123)
# split = sample.split(dataset$Salary, SplitRatio = 2/3)
# training_set = subset(dataset, split == TRUE)
# test_set = subset(dataset, split == FALSE)
# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Decision Tree Regression to the dataset
# install.packages('rpart')
#rpart(Recursive partitioning is a statistical method for multivariable analysis. Recursive partitioning creates a decision tree that
strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous
independent variables)
library(rpart)
# ~. is tilde dot plot for dependent and independet variable
regressor = rpart(formula = Salary ~ .,
data = dataset,
control = rpart.control(minsplit = 1))
# rpart.control: Various parameters that control aspects of the rpart fit.
# minsplit :the minimum number of observations that must exist in a node in order for a split to be attempted.
# Predicting a new result with Decision Tree Regression
y_pred = predict(regressor, data.frame(Level = 6.5))
y_pred
#Apriori algorithm
setwd("E:\\Research 2018\\Course VN\\Algorithm Datasets\\Apriori")
library(arules)
dataset=read.csv('Market_Basket_Optimisation.csv',header=FALSE)
dataset = read.transactions('Market_Basket_Optimisation.csv',sep=',',rm.duplicates=TRUE)
summary(dataset)
itemFrequencyPlot(dataset, topN=10)
rules=apriori(data=dataset,parameter=list(support=0.04,confidence=0.2))
#visulaalizing results
inspect(sort(rules,by='lift')[1:10])A
q()
REFERENCES
Book
[1] Rakesh Agrawal and Ramakrishnan Srikant Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large
Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994.
[2] Rodriguez, J. J.; Kuncheva, L. I.; Alonso, C. J. (2006). "Rotation forest: A new classifier ensemble method". IEEE Transactions on Pattern Analysis and
Machine Intelligence. 28 (10): 1619–1630. doi:10.1109/TPAMI.2006.211
Website
[3] https://docs.oracle.com/middleware/12211/bip/BIPDM/GUID-70F8A7D1-B206-434A-9B20-D2D7377AC0CB.htm#BIPDM179
[4] https://stackoverflow.com/questions/6771588/how-to-define-a-simple-dataset-in-r