Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
NEIGHBOUR ANALYSIS.
PROJECT PROPOSAL on
1.1 INTRODUCTION
Academic performance is the extent to which a student, lecturer (tutor), or institution has
achieved their short or long term educational goals. Grades in the form of cumulative grade
point average (GPA) and completion of educational degrees such as High School and
measured with the help of examinations and continuous assessments. Most factors affect the
performance of the students at the end of a degree program prior to university admission.
However, these individual factors can be used to predict academic performance of students
prior to entrance. Factors like entrance grades and demographic variables are most widely
analysed in related literature. Kabakchieva (2012) analysed about twenty (20) factors that
affect student performance in university degree examinations, including gender, birth year,
birth place, living place and country, type of previous education, profile and place of previous
education, total score from previous education, university admittance exam and achieved
score, total university score at the end of the first year, etc. Jia and Mareboyana (2013)
identified total credit hours (TCH) taken as one of the main factors that affect students’
retention ability. Acharya and Sinha (2014) also analysed a number of factors that affect
student performance which include gender, caste, religion, family size, board, state of origin,
family income, board marks, study hours per day, attendance, mid semester score, student’s
medium of study, type of secondary school attended and private tuition. This area of research
and modelling of large data sources. It is the broad process of finding valuable knowledge in
data and emphasizes the high-level application of particular data mining method. The unique
goal of KDD process is to extract knowledge from data in the context of large databases. This
is done using data mining methods. Katare and Dubey (2017) described data mining as the
foundation of the KDD process, relating the linking of algorithms that search the data, build
up the model and determine previously unknown patterns. The KDD knowledge discovery
process is cyclic, interactive, and consists of nine steps. The unifying purpose of the KDD
process is to dig out valuable information from data in the framework of large databases.
Data mining refers to the set of computational methods that extract important patterns from
original data. In addition, KDD process is concerned about manipulation with immense data,
scaling algorithms for better presentation, appropriate analysis of retrieved information, and
Educational Data Mining is a rising authority, concerned with developing methods for
exploring the unique types of data that come from educational settings, using those methods
to better understand students, and the settings which they are taught in (Katare and Dubey,
2017). On one hand, the raise in both instrumental educational software as well as state
database of students information have shaped great repositories of data dazzling how students
be trained. On the other hand, a new context has created by using Internet in education known
computational approaches that merge data and theory to convert practice to profit learners.
The skill to observe the growth of student’s academic presentation is a serious subject to the
academic area of advanced learning. A scheme for analyzing student’s outcome based on
classification study and uses standard statistical algorithms to position their scores data
according to the height of their performance is described. The major objective of this project
is to use classification algorithms (of data mining methods) for analyzing and predicting
contain secret information which when the standard dataset is subjected to appropriate data
mining technique will predict the student performance, thus providing information for future
improvements.
There had been a lot of techniques applied to predicting students performance in institutions
using qualitative and quantitative factors. Data mining techniques have been applied in this
educational data mining problem area; Rule Learner, decision tree, neural network, and k-
nearest neighbour (Kabakchieva, 2012), logistic regression, naive bayes and SVM (Jia and
Mareboyana, 2013). The authors (Agrawal and Mavan, 2015) pointed out that present studies
shows that academic performances of the students are primarily dependent on their past
performances. Results from their investigation confirm that past performances have indeed
In (Tanner and Toivonen, 2010), “Predicting and preventing student failure–using the k-
presented. The paper is spurned by the need to study the problem of predicting student
performance in an online course. The objective of this work is to identify at an early stage of
the course those students who have a high risk of failing. The work employs the k-nearest
neighbour method (KNN) and its many variants in solving student performance prediction
problem. The work presents extensive experimental results from a 12-lesson course on touch-
typing, with a database of close to 15000 students. The specific target environment of the
study is a tutored online course for touch-typing, consisting of 12 lessons and a final test. The
goal is to predict the success or failure of students automatically based on their performances
already during the first lessons; a tool could then alert teachers at an early stage to students
who are at risk of failing the course or dropping out from it. The results indicate that KNN
can predict student performance accurately, and already after the very first lessons. The
authors conclude that early tests on skills can be strong predictors for final scores also in
other skill-based courses. The authors suggested that selected methods described in this paper
will be implemented as an early warning feature for teachers of the touch-typing course, so
they can quickly focus their attention to the students who need help the most. However, the
Yadav and Pal (2012) presented a paper on “Data mining: A prediction for performance
improvement of engineering students using classification.” The main motivation of this paper
is the idea that classification methods like decision trees, Bayesian network etc can be applied
on the educational data for predicting the student’s performance in examination. This paper
carries out prediction for performance improvement of engineering students using the C4.5,
ID3 and CART decision tree algorithms. Decision tree algorithms were applied to the sourced
data. The data set used in this study was obtained from VBS Purvanchal University. The data
was collected through the enrolment form filled by the student at the time of admission,
which includes: demographic data, past performance data, address and contact number. Most
of the attributes reveal the past performance of the students. A few derived variables were
selected. The work was implemented in WEKA and 10-fold cross validation was used in
evaluation of the models. The outcome of the decision tree predicted the number of students
who are likely to pass, fail or promoted to next year. The results provide steps to improve the
performance of the students who were predicted to fail or promoted. This work also identifies
different factors, which affects a student’s learning behaviour and performance during
academic career. Frequently used decision tree classifiers are studied and the experiments are
conducted to find the best classifier for prediction of student’s performance in First Year of
engineering exam. From the classifiers accuracy it is clear that the true positive rate of the
model for the FAIL class is 0.786 for ID3 and C4.5 decision trees. This work can be extended
algorithms is presented. The rationale behind the research work described in this paper is
based on the great potential that is seen in using data mining methods and techniques for
effective usage of university data. The objective is the development of data mining models
for predicting student performance, based on their personal, pre-university and university-
performance characteristics. The dataset includes data about students admitted to the
university in three consecutive years. The dataset file contains data about 10330 students that
have been enrolled as university students during the period between 2007 and 2009,
described by 20 parameters, including gender, birth year, birth place, living place and
country, type of previous education, profile and place of previous education, total score from
previous education, university admittance exam and achieved score, total university score at
the end of the first year, etc. The research work focuses on the development of data mining
models for predicting student performance by using four data mining algorithms for
classification – a Rule Learner, a Decision tree algorithm, a Neural network, and a K-Nearest
Neighbour method. It is a continuation of previous research, carried out with the same dataset
and with similar data mining algorithms, but for a different format of the predicted target
variable. The achieved results from the performed research, using a target variable with five
distinct values – Bad, Average, Good, Very Good, and Excellent, are previously published in
(Kabakchieva et al, 2011). In this current paper, the study is implemented for a binary target
applying the selected four data mining algorithms – OneR Rule Learner, Decision Tree,
Neural Network and K-Nearest Neighbour, on the available and carefully pre-processed
student data, reveal classification accuracy between 67.46% and 73.59%. A notable
contribution of this research is that the highest accuracy is achieved for the Neural Network
model (73.59%), followed by the Decision Tree model (72.74%) and the k-NN model
(70.49%). The Neural Network model predicts with higher accuracy the “Strong” class, while
the other three models perform better for the “Weak” class. The data attributes related to the
students’ University Admission Score and Number of Failures at the first-year university
exams are among the factors influencing most the classification process. The work was not
Jia and Mareboyana (2013) present “Machine Learning Algorithms and Predictive Models
for Undergraduate Student Retention.” The authors seek to explore the goal of a decision tree
learning algorithm and to have the resulting tree to be as small as possible as stated in
(Mitchell, 1997). This paper studies Historically Black Colleges and Universities (HBCU)
undergraduate student retention and explore the effectiveness of machine learning techniques
to determine factors that influence student retention at an HBCU and create retention
predictive model. The study used data queried from the Campus Solution database. The six-
year training data set size is 771 instances with 12 attributes. The last attribute is the target
attribute/variable holding binary values of yes or no. Decision tree, Logistic regression, Naive
Bayes, SVM and neural network (2-layer) are applied to the data. In this paper, the authors
have presented some results of undergraduate student retention using machine learning
algorithms classifying the student data. They made some improvements to the classification
algorithms such as Decision tree, Support Vector Machines (SVM), and neural networks
supported by Weka software toolkit. The experiments revealed that the main factors that
influence student retention in the Historically Black Colleges and Universities (HBCU) are
the cumulative grade point average (GPA) and total credit hours (TCH) taken. There is need
to extend the implementation using other machine learning implementation tools like Python
In (Acharya and Sinha, 2014), a paper on Early prediction of students performance using
machine learning techniques is presented. In recent years Educational Data Mining (EDM)
has emerged as a new field of research due to the development of several statistical
approaches to explore data in educational context. One such application of EDM is early
prediction of student results. The objectives of this work are as follows: (i) To identify the
attributes and examination pattern of a set of students majoring in Computer Science in some
under graduate colleges in Kolkata; (ii) To perform features selection on this data set to
determine the optimal number of features; these features would then be used for
classification; (iii) To apply MLAs on this reduced feature set to identify the students who
may perform exceedingly well and more importantly those who may perform poorly; and (iv)
To perform a parametric comparison with the contemporary literature discussed. The data set
for this study is derived from a set of students majoring in Computer Science in some
undergraduate colleges in Kolkata. The training set contains 309 instances whereas the
testing set contains 104 instances. The dataset attributes used in this study include gender,
caste, religion, family size, board, state of origin, family income, board marks, study hours
per day, attendance, mid semester score, student’s medium of study, type of secondary school
attended, private tuition and finally the grade (the target variable). These attributes are used to
predict student grades as a 7 class problem in the semester end examination: in particular
grade “O‟ indicates that the student is outstanding and may be awarded scholarship whereas
grade “F‟ indicated that the student is poor and hence may need remedial attention. The work
experimented on Decision Trees (DTs), Bayesian Networks, Neural Networks and Support
Vector Machines. The work is implemented in WEKA. Results after detailed analysis showed
that DTs are the most convenient algorithm to generate the set of production rules.
Accordingly, C4.5 was used to generate the decision tree. F-Measure and Kappa Statistic
were used to determine the efficiency of the prediction algorithm. Average F-Measure value
for the training dataset was found to be 0.79 whereas for the testing dataset was found to be
0.66. The later value seems to be slightly low perhaps due to the fact that the number of
testing instances is only 104. The efficiency of prediction can be increased with other
Asif et al (2014), present “Predicting student academic performance at degree level: a case
study.” This paper is motivated by the two following questions: “Can we predict the
performance of students at the end of their degree in 4th year with a reasonable accuracy
using only their marks in High School Certificate (HSC), and in first and second year courses
at university, no socio-economic data, and utilizing one cohort to build the model and the
next cohort to test it?” and “Can we identify those courses in first and second years which are
effective predictors of students‘ performance at the end of the degree?” This paper presents a
case study on predicting performance of students at the end of a university degree at an early
stage of the degree program. The data of four academic cohorts comprising 347
undergraduate students are mined with different classifiers namely decision trees, neural
networks and naive bayes. In this study the performance of a student at the end of the degree
are class A, B, C, D or E, which represents the interval in which her/his final mark lies.
Intervals allow for differentiating between strong and weak students. The study differs from
other works in three aspects. First, using the conclusions of others, it limits the variables to
predict performance to marks only, no socio-economic data. Second, it takes one cohort to
build a model and the next cohort to evaluate it, thus allowing for some measurement of how
well findings generalize from one cohort to the next one. Third it is a longitudinal study as
four cohorts of students have been considered. The study shows that it is possible to predict
the graduation performance in 4th year at university using only pre-university marks and
marks of 1st and 2nd year courses, no socio-economic or demographic features, with a
reasonable accuracy and that the model established for one cohort generalizes to the
following cohort. Naïve Bayes showed an accuracy of 83.65% on Dataset II, which is better
than the one obtained in related works that have used socio-economic or demographic
features and pre-university mark. This suggests that including marks obtained in the first
investigation of the second research question led to identify 5 courses (2 courses of 1st year
and 3 courses of 2nd year) to predict the graduation performance. Considering this set of 5
courses only instead of pre-university subjects, all 1st year and 2nd year courses tends to
increase the accuracy of the decision trees. However, the five selected courses do not lead to
a better accuracy with the naive Bayes and 1-NN classifiers, which gave the best accuracy in
the first place. Even if these five courses allow finding sensible course that are detectors of
poor or strong performance in first and second year, some more research is needed to
understand these limited findings. A future work is to study progression of students during
their 4 years of bachelor and investigate whether typical developments can be identified.
In (Abdullah et al, 2014), “Students' Performance Prediction System using Multi Agent Data
Mining Technique” is presented. The authors reaffirmed the idea that a high prediction
accuracy of the students’ performance is more helpful to identify the low performance
students at the beginning of the learning process and boosting technique had been identified
to achieve high prediction accuracy. In this paper, students’ performance prediction system
using Multi Agent Data Mining is proposed to predict the performance of the students based
on their data with high prediction accuracy and provide help to the low students by
optimization rule. The work builds a Students’ Performance Prediction System that can be
able to predict the students’ performance based on their academic result with high accuracy.
Boosting Techniques. Students’ data were collected manually from EMES e-learning system
of the computer skills (CPIT100) course, KAU, Jeddah, Saudi Arabia of session 2012-13.
The experiment was conducted using the EMES e-learning system dataset. The dataset
contains 155 students. The dataset attributes are 9 which include student name, ID, number of
number of corrected answers of all assignments, Total time of login hours the student was
spent in the e-learning system, Number of pages the student was read and finally the student
final grade class (which is the target class). The system is implemented in JAVA. The
Roc curve, the computational time to complete the classification tasks. The proposed system
has been implemented and evaluated by investigate the prediction accuracy of Adaboost.M1
and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The
results show that using SAMME Boosting technique improves the prediction accuracy and
outperformed C4.5 single classifier and LogitBoost. The paper applied Multi Agent Data
Mining Technique in EMES e-learning system for students’ performance prediction and
optimization. There is need to improve the true positive rate of the low class.
Mavani, 2015). One of the motivation of this study is to identify high-risk students, as well as
to identify features which affect the performance of students. In this paper, a neural network
semester 3 to semester 6 were used in the study. The algorithm is trained on a training set of
This was done 7 times, varying the training and test sets each time (k-fold cross validation).
Also, the importance of several different features is considered, in order to determine which
of these are correlated with student performance. From the results, it is found that the second
high potential variable for students' performance is their living location, and the third high
potential variable for students' performance is medium of teaching. The highest accuracy
reported by this model is 70%. This can be improved upon by utilizing more features and
increasing data sample size. Also this work was not evaluated with other works.
Mining Technique.” The authors identified one way to achieve highest level of quality in
education system is by discovering knowledge from educational data, to study the main
attributes that may affect the student’s performance. The discovered knowledge can be used
institutes to enhance their decision making process. In this work, data mining techniques is
used to predict student academic performance. This work investigates the educational domain
of data mining using the graduate student’s data collected from the various sources. The
work performs Feature Extraction technique to pre-process the data. Decision tree classifier
technique is used to classify the data. The proposed system concentrates on the development
of a tool for student performance prediction. By pre-processing the data, applying suitable
data mining method knowledge is discovered. The discovered knowledge is used to provide
and to improve student’s academic performance at the earliest. There was no reported
Improvement is presented (Muralidharan et al, 2017). This paper is poised to enhance the
educational institution’s ranking which is greatly influenced by the university results. These
results can be predicted beforehand and used in subsequent planning. This paper mainly
focuses on the prediction of a student’s university result by making use of different attributes
utilizing k-nearest neighbour technique. The attributes used in this study are either
quantitative or qualitative type. The quantitative attributes used are Internal Assessments,
The qualitative attributes include Subject feedback, Faculty feedback, and whether the
the historical data of students for more accurate prediction of results. This algorithm makes
use of the Euclidean distance formula which is used to find the nearest record. Though there
have been different classification algorithms predicting the results of students and yet no
algorithm has been applied on real time educational datasets. In this work, the results of some
students belonging to an educational institution have been computed and results shown. This
Katare and Dubey (2017) present a study of various techniques for predicting student
performance under educational data mining. The motivation for this paper is that to improve
teaching and learning process of educators and learners’ performance prediction is very
useful. This survey mostly focuses on classification algorithms and their utilization for
evaluating student performance. This survey explores various data mining technique
application to educational data mining domain area to predict student performance. This
paper discusses the related work on evaluating the performance of students using different
analytical methods. It talks about the techniques which applied on datasets to find out hidden
information and pattern from database of educational environment. In EDM, the frequently
used prediction technique is classification. Under classification different methods like
Decision tree, Neural network, k-nearest neighbor, SVM, Bayes algorithm are used to predict
student performance. The performance of all these methods is evaluated based on the
parameters like accuracy, precision and recall. This study is limited only to applicability of
five data mining techniques to EDM. Also the techniques explored were not evaluated.
One major motivation which led to the introduction of similarity distance measure
calculation. The k-nearest neighbours’ algorithm is amongst the simplest of all machine learning
algorithms. An object is classified by a majority vote of its neighbours, with the object being assigned
to the class most common amongst its k nearest neighbours. k is a positive integer, typically small. If
k = 1, then the object is simply assigned to the class of its nearest neighbour. In binary (two class)
classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. The
same method can be used for regression, by simply assigning the property value for the object to be
the average of the values of its k nearest neighbours. It can be useful to weight the contributions of the
neighbours, so that the nearer neighbours contribute more to the average than the more distant ones.
The neighbours are taken from a set of objects for which the correct classification (or, in the case of
regression, the value of the property) is known. This can be thought of as the training set for the
algorithm, though no explicit training step is required. In order to identify neighbours, the objects are
represented by position vectors in a multidimensional feature space. It is usual to use the Euclidian
distance, though other distance measures, such as the Manhattan distance could in principle be used
instead. The k-nearest neighbour algorithm is sensitive to the local structure of the data. It is also
insensitive to outliers and makes no assumptions about data. It works well with numeric and
nominal values. kNN is an example of instance-based learning, where there is need to have
from educational data, to study the main attributes that may affect the student’s performance.
The discovered knowledge can be used to offer a helpful and constructive recommendations
to the academic planners in education institutes to enhance their decision making process, to
improve student’s academic performance and trim down failure rate, to understand better
student’s behaviour. Abdullah et al (2014) identified one of the educational problems that are
solved with data mining is the prediction of students' academic performances. Prediction of
students' performance is more beneficial for identifying the low academic performance
different factors like social, personal, Psychological and other environmental variables. A
very promising tool to achieve this objective is the use of Data Mining. Data mining
techniques are used to discover hidden patterns and relationships of data, which is very much
Against this back drop, this research is motivated by the need to:
a) extend the work in (Asif et al, 2014) by using pre-university, 1st and 2nd year courses
and
b) implement the model in (a) above using prior and present students’ university
A study of students’ performance prediction and educational data mining (EDM) will be
carried out. Techniques used in prediction of students’ performance analysed with emphasis
on recent data mining techniques like k-nearest neighbour. Dataset of computer science
students of University of Ilorin that describes the demographic features and academic grades
but not limited to; their pre-university (entrance examination), year one and second year
exams respectively will be sourced from appropriate management and other reviewed
literature. The dataset will contain n number of identified variables/features of about 1000
students. These variables/features will be the input variables to the proposed k-nearest
neighbour model. Five classes of target variable (representing the five grades of a degree
programme outcome), namely first class, second class upper, second class lower, third class,
pass and fail will be handled in this project. The methodology is subdivided into the
following stages;
i. Data Pre-processing
v. Calculation of mean of the data points in a cluster and assign centroid to mean.
assumptions about data. It works well with numeric and nominal values. The k-nearest
neighbour is an instance based learning which carries out its classification based on a
similarity measure, like Euclidean, Mahanttan or Minkowski distance functions. The first two
distance measures work well with continuous variables while the third suits categorical
variables. Manhattan distance computes the absolute differences between coordinates of pair
Dxv X im X jm (1)
p
n 1
Dxv m1 X im X jm p (2)
When p equals 2, then equation (3) becomes Euclidean distance. Euclidean distance
computes the root of square difference between co-ordinates of pair of objects. The Euclidean
distance (Dxv) between two input vectors (Xi, Xj) is given by:
n 2
X X m = 1,2,…,n
jm
D (3)
m 1
xv im
X = {x1,x2,x3,……..,xn} (4)
V = {v1,v2,…….,vc} (5)
where X is the set of input feature value data points, V is set of centers of clusters and n is the
number of features.
The Euclidean distance measure will be used in this study for the kNN classifier. For every
data point in the dataset, the Euclidean distance between an input data point and current point
is calculated. These distances are sorted in increasing order and k items with lowest distances
to the input data point are selected. The majority class among these items is found and the
classifier returns the majority class as the classification for the input point. The parameter k is
The following steps in the training of kNN classifier algorithm are given in listing 1;
Step3: calculate the distance between input vector X and the current point using
equation (3). The algorithm calculates the distance measurement for every data point
Step7: return the majority class as the prediction for the class of X
The first step in listing 1 sets up training data. This is two-thirds of the whole dataset. The
step 3 uses the similarity measure to calculate the Euclidean distance for every data point in
the dataset. This distance measure is calculated for every data point (instance) in the dataset.
The calculated Euclidean distances are sorted in increasing order. A predefined number, k, of
items with lowest Euclidean distances to the input data point is taken and the majority class
amongst the items is found. This majority class is returned as the prediction class for the data
point.
The performance of this study will be evaluated against other reviewed works based on true
positive, true negative, false positive, false negative rates and total accuracy of classification.
True positives are instances classified as positive which are actually positives. True negatives
are instances classified rightly as negatives. False positives are instances classified as
positives but are rather negatives. False negatives are instances classified as negatives but are
rather positives.
TP
TPR (6)
P
TN
TNR (7)
N
FP
FPR (8)
N
FN
TNR (9)
P
where TPR, TNR, FPR, FNR means true positive, true negative, false positive and false
negative rates. The total accuracy of classification comprises of all four basic metrics; true
positive, true negative, false positive and false negative (TP, TN, FP, FN).
TP TN
Accuracy (10)
TP FP TN FN
This project will establish an adequate analysis of students’ performance prediction model
using k-nearest neighbour which will serve as a veritable framework for educational