Sei sulla pagina 1di 18

STUDENT PERFORMANCE PREDICTION USING K-NEAREST

NEIGHBOUR ANALYSIS.

(CASE STUDY OF AL-HIKMAH UNIVERSITY, ILORIN)

PROJECT PROPOSAL on

K-NEAREST NEIGHBOUR ANALYSIS FOR STUDENT PERFORMANCE (CASE

STUDY OF UNIVERSITY OF ILORIN)

1.1 INTRODUCTION

Academic performance is the extent to which a student, lecturer (tutor), or institution has

achieved their short or long term educational goals. Grades in the form of cumulative grade

point average (GPA) and completion of educational degrees such as High School and

bachelor’s degrees represent academic achievement. This academic achievement is mostly

measured with the help of examinations and continuous assessments. Most factors affect the

performance of the students at the end of a degree program prior to university admission.

However, these individual factors can be used to predict academic performance of students

prior to entrance. Factors like entrance grades and demographic variables are most widely

analysed in related literature. Kabakchieva (2012) analysed about twenty (20) factors that

affect student performance in university degree examinations, including gender, birth year,

birth place, living place and country, type of previous education, profile and place of previous

education, total score from previous education, university admittance exam and achieved

score, total university score at the end of the first year, etc. Jia and Mareboyana (2013)

identified total credit hours (TCH) taken as one of the main factors that affect students’

retention ability. Acharya and Sinha (2014) also analysed a number of factors that affect
student performance which include gender, caste, religion, family size, board, state of origin,

family income, board marks, study hours per day, attendance, mid semester score, student’s

medium of study, type of secondary school attended and private tuition. This area of research

studies on the various factors that affect performance of students in an institution.

The Knowledge Discovery in Databases (KDD) is an automatic, exploratory data analysis

and modelling of large data sources. It is the broad process of finding valuable knowledge in

data and emphasizes the high-level application of particular data mining method. The unique

goal of KDD process is to extract knowledge from data in the context of large databases. This

is done using data mining methods. Katare and Dubey (2017) described data mining as the

foundation of the KDD process, relating the linking of algorithms that search the data, build

up the model and determine previously unknown patterns. The KDD knowledge discovery

process is cyclic, interactive, and consists of nine steps. The unifying purpose of the KDD

process is to dig out valuable information from data in the framework of large databases.

Data mining refers to the set of computational methods that extract important patterns from

original data. In addition, KDD process is concerned about manipulation with immense data,

scaling algorithms for better presentation, appropriate analysis of retrieved information, and

human interaction with the overall process.

Educational Data Mining is a rising authority, concerned with developing methods for

exploring the unique types of data that come from educational settings, using those methods

to better understand students, and the settings which they are taught in (Katare and Dubey,

2017). On one hand, the raise in both instrumental educational software as well as state

database of students information have shaped great repositories of data dazzling how students

be trained. On the other hand, a new context has created by using Internet in education known

as e-learning or web-based education in which information in large amount in relation to

teaching–learning interaction are continuously generated and universally available. These


data repositories used by EDM to better recognize learners and learning, and to develop

computational approaches that merge data and theory to convert practice to profit learners.

The skill to observe the growth of student’s academic presentation is a serious subject to the

academic area of advanced learning. A scheme for analyzing student’s outcome based on

classification study and uses standard statistical algorithms to position their scores data

according to the height of their performance is described. The major objective of this project

is to use classification algorithms (of data mining methods) for analyzing and predicting

students’ performance based on some parameters. Large database in educational environment

contain secret information which when the standard dataset is subjected to appropriate data

mining technique will predict the student performance, thus providing information for future

improvements.

There had been a lot of techniques applied to predicting students performance in institutions

using qualitative and quantitative factors. Data mining techniques have been applied in this

educational data mining problem area; Rule Learner, decision tree, neural network, and k-

nearest neighbour (Kabakchieva, 2012), logistic regression, naive bayes and SVM (Jia and

Mareboyana, 2013). The authors (Agrawal and Mavan, 2015) pointed out that present studies

shows that academic performances of the students are primarily dependent on their past

performances. Results from their investigation confirm that past performances have indeed

got a significant influence over students' performance. A number of related works on

students’ performance prediction using data mining techniques is presented.

1.2 REVIEW OF RELATED WORKS

In (Tanner and Toivonen, 2010), “Predicting and preventing student failure–using the k-

nearest neighbour method to predict student performance in an online course environment” is

presented. The paper is spurned by the need to study the problem of predicting student

performance in an online course. The objective of this work is to identify at an early stage of
the course those students who have a high risk of failing. The work employs the k-nearest

neighbour method (KNN) and its many variants in solving student performance prediction

problem. The work presents extensive experimental results from a 12-lesson course on touch-

typing, with a database of close to 15000 students. The specific target environment of the

study is a tutored online course for touch-typing, consisting of 12 lessons and a final test. The

goal is to predict the success or failure of students automatically based on their performances

already during the first lessons; a tool could then alert teachers at an early stage to students

who are at risk of failing the course or dropping out from it. The results indicate that KNN

can predict student performance accurately, and already after the very first lessons. The

authors conclude that early tests on skills can be strong predictors for final scores also in

other skill-based courses. The authors suggested that selected methods described in this paper

will be implemented as an early warning feature for teachers of the touch-typing course, so

they can quickly focus their attention to the students who need help the most. However, the

study did not incorporate demographic attributes in the feature selection.

Yadav and Pal (2012) presented a paper on “Data mining: A prediction for performance

improvement of engineering students using classification.” The main motivation of this paper

is the idea that classification methods like decision trees, Bayesian network etc can be applied

on the educational data for predicting the student’s performance in examination. This paper

carries out prediction for performance improvement of engineering students using the C4.5,

ID3 and CART decision tree algorithms. Decision tree algorithms were applied to the sourced

data. The data set used in this study was obtained from VBS Purvanchal University. The data

was collected through the enrolment form filled by the student at the time of admission,

which includes: demographic data, past performance data, address and contact number. Most

of the attributes reveal the past performance of the students. A few derived variables were

selected. The work was implemented in WEKA and 10-fold cross validation was used in
evaluation of the models. The outcome of the decision tree predicted the number of students

who are likely to pass, fail or promoted to next year. The results provide steps to improve the

performance of the students who were predicted to fail or promoted. This work also identifies

different factors, which affects a student’s learning behaviour and performance during

academic career. Frequently used decision tree classifiers are studied and the experiments are

conducted to find the best classifier for prediction of student’s performance in First Year of

engineering exam. From the classifiers accuracy it is clear that the true positive rate of the

model for the FAIL class is 0.786 for ID3 and C4.5 decision trees. This work can be extended

to investigate other data mining techniques with multiple class classification.

In (Kabakchieva, 2012), performance prediction by using data mining classification

algorithms is presented. The rationale behind the research work described in this paper is

based on the great potential that is seen in using data mining methods and techniques for

effective usage of university data. The objective is the development of data mining models

for predicting student performance, based on their personal, pre-university and university-

performance characteristics. The dataset includes data about students admitted to the

university in three consecutive years. The dataset file contains data about 10330 students that

have been enrolled as university students during the period between 2007 and 2009,

described by 20 parameters, including gender, birth year, birth place, living place and

country, type of previous education, profile and place of previous education, total score from

previous education, university admittance exam and achieved score, total university score at

the end of the first year, etc. The research work focuses on the development of data mining

models for predicting student performance by using four data mining algorithms for

classification – a Rule Learner, a Decision tree algorithm, a Neural network, and a K-Nearest

Neighbour method. It is a continuation of previous research, carried out with the same dataset

and with similar data mining algorithms, but for a different format of the predicted target
variable. The achieved results from the performed research, using a target variable with five

distinct values – Bad, Average, Good, Very Good, and Excellent, are previously published in

(Kabakchieva et al, 2011). In this current paper, the study is implemented for a binary target

variable. The work is implemented in WEKA. The classification models, generated by

applying the selected four data mining algorithms – OneR Rule Learner, Decision Tree,

Neural Network and K-Nearest Neighbour, on the available and carefully pre-processed

student data, reveal classification accuracy between 67.46% and 73.59%. A notable

contribution of this research is that the highest accuracy is achieved for the Neural Network

model (73.59%), followed by the Decision Tree model (72.74%) and the k-NN model

(70.49%). The Neural Network model predicts with higher accuracy the “Strong” class, while

the other three models perform better for the “Weak” class. The data attributes related to the

students’ University Admission Score and Number of Failures at the first-year university

exams are among the factors influencing most the classification process. The work was not

evaluated with other previous works cited in the review.

Jia and Mareboyana (2013) present “Machine Learning Algorithms and Predictive Models

for Undergraduate Student Retention.” The authors seek to explore the goal of a decision tree

learning algorithm and to have the resulting tree to be as small as possible as stated in

(Mitchell, 1997). This paper studies Historically Black Colleges and Universities (HBCU)

undergraduate student retention and explore the effectiveness of machine learning techniques

to determine factors that influence student retention at an HBCU and create retention

predictive model. The study used data queried from the Campus Solution database. The six-

year training data set size is 771 instances with 12 attributes. The last attribute is the target

attribute/variable holding binary values of yes or no. Decision tree, Logistic regression, Naive

Bayes, SVM and neural network (2-layer) are applied to the data. In this paper, the authors

have presented some results of undergraduate student retention using machine learning
algorithms classifying the student data. They made some improvements to the classification

algorithms such as Decision tree, Support Vector Machines (SVM), and neural networks

supported by Weka software toolkit. The experiments revealed that the main factors that

influence student retention in the Historically Black Colleges and Universities (HBCU) are

the cumulative grade point average (GPA) and total credit hours (TCH) taken. There is need

to extend the implementation using other machine learning implementation tools like Python

and also evaluate the work with various related works.

In (Acharya and Sinha, 2014), a paper on Early prediction of students performance using

machine learning techniques is presented. In recent years Educational Data Mining (EDM)

has emerged as a new field of research due to the development of several statistical

approaches to explore data in educational context. One such application of EDM is early

prediction of student results. The objectives of this work are as follows: (i) To identify the

attributes and examination pattern of a set of students majoring in Computer Science in some

under graduate colleges in Kolkata; (ii) To perform features selection on this data set to

determine the optimal number of features; these features would then be used for

classification; (iii) To apply MLAs on this reduced feature set to identify the students who

may perform exceedingly well and more importantly those who may perform poorly; and (iv)

To perform a parametric comparison with the contemporary literature discussed. The data set

for this study is derived from a set of students majoring in Computer Science in some

undergraduate colleges in Kolkata. The training set contains 309 instances whereas the

testing set contains 104 instances. The dataset attributes used in this study include gender,

caste, religion, family size, board, state of origin, family income, board marks, study hours

per day, attendance, mid semester score, student’s medium of study, type of secondary school

attended, private tuition and finally the grade (the target variable). These attributes are used to

predict student grades as a 7 class problem in the semester end examination: in particular
grade “O‟ indicates that the student is outstanding and may be awarded scholarship whereas

grade “F‟ indicated that the student is poor and hence may need remedial attention. The work

experimented on Decision Trees (DTs), Bayesian Networks, Neural Networks and Support

Vector Machines. The work is implemented in WEKA. Results after detailed analysis showed

that DTs are the most convenient algorithm to generate the set of production rules.

Accordingly, C4.5 was used to generate the decision tree. F-Measure and Kappa Statistic

were used to determine the efficiency of the prediction algorithm. Average F-Measure value

for the training dataset was found to be 0.79 whereas for the testing dataset was found to be

0.66. The later value seems to be slightly low perhaps due to the fact that the number of

testing instances is only 104. The efficiency of prediction can be increased with other

methods/techniques and bigger datasets.

Asif et al (2014), present “Predicting student academic performance at degree level: a case

study.” This paper is motivated by the two following questions: “Can we predict the

performance of students at the end of their degree in 4th year with a reasonable accuracy

using only their marks in High School Certificate (HSC), and in first and second year courses

at university, no socio-economic data, and utilizing one cohort to build the model and the

next cohort to test it?” and “Can we identify those courses in first and second years which are

effective predictors of students‘ performance at the end of the degree?” This paper presents a

case study on predicting performance of students at the end of a university degree at an early

stage of the degree program. The data of four academic cohorts comprising 347

undergraduate students are mined with different classifiers namely decision trees, neural

networks and naive bayes. In this study the performance of a student at the end of the degree

are class A, B, C, D or E, which represents the interval in which her/his final mark lies.

Intervals allow for differentiating between strong and weak students. The study differs from

other works in three aspects. First, using the conclusions of others, it limits the variables to
predict performance to marks only, no socio-economic data. Second, it takes one cohort to

build a model and the next cohort to evaluate it, thus allowing for some measurement of how

well findings generalize from one cohort to the next one. Third it is a longitudinal study as

four cohorts of students have been considered. The study shows that it is possible to predict

the graduation performance in 4th year at university using only pre-university marks and

marks of 1st and 2nd year courses, no socio-economic or demographic features, with a

reasonable accuracy and that the model established for one cohort generalizes to the

following cohort. Naïve Bayes showed an accuracy of 83.65% on Dataset II, which is better

than the one obtained in related works that have used socio-economic or demographic

features and pre-university mark. This suggests that including marks obtained in the first

semesters or year of university is important to obtain a reasonable accuracy. Also, the

investigation of the second research question led to identify 5 courses (2 courses of 1st year

and 3 courses of 2nd year) to predict the graduation performance. Considering this set of 5

courses only instead of pre-university subjects, all 1st year and 2nd year courses tends to

increase the accuracy of the decision trees. However, the five selected courses do not lead to

a better accuracy with the naive Bayes and 1-NN classifiers, which gave the best accuracy in

the first place. Even if these five courses allow finding sensible course that are detectors of

poor or strong performance in first and second year, some more research is needed to

understand these limited findings. A future work is to study progression of students during

their 4 years of bachelor and investigate whether typical developments can be identified.

In (Abdullah et al, 2014), “Students' Performance Prediction System using Multi Agent Data

Mining Technique” is presented. The authors reaffirmed the idea that a high prediction

accuracy of the students’ performance is more helpful to identify the low performance

students at the beginning of the learning process and boosting technique had been identified

to achieve high prediction accuracy. In this paper, students’ performance prediction system
using Multi Agent Data Mining is proposed to predict the performance of the students based

on their data with high prediction accuracy and provide help to the low students by

optimization rule. The work builds a Students’ Performance Prediction System that can be

able to predict the students’ performance based on their academic result with high accuracy.

The algorithms experimented include C4.5, SAMME, ADABOOST M1, LOGITBOOST

Boosting Techniques. Students’ data were collected manually from EMES e-learning system

of the computer skills (CPIT100) course, KAU, Jeddah, Saudi Arabia of session 2012-13.

The experiment was conducted using the EMES e-learning system dataset. The dataset

contains 155 students. The dataset attributes are 9 which include student name, ID, number of

solved quizzes, number of corrected answers of quizzes, number of submitted assignments,

number of corrected answers of all assignments, Total time of login hours the student was

spent in the e-learning system, Number of pages the student was read and finally the student

final grade class (which is the target class). The system is implemented in JAVA. The

experiment measures classification accuracy, performance metrics, classification error curve,

Roc curve, the computational time to complete the classification tasks. The proposed system

has been implemented and evaluated by investigate the prediction accuracy of Adaboost.M1

and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The

results show that using SAMME Boosting technique improves the prediction accuracy and

outperformed C4.5 single classifier and LogitBoost. The paper applied Multi Agent Data

Mining Technique in EMES e-learning system for students’ performance prediction and

optimization. There is need to improve the true positive rate of the low class.

In Student Performance Prediction using Machine Learning is presented (Agrawal and

Mavani, 2015). One of the motivation of this study is to identify high-risk students, as well as

to identify features which affect the performance of students. In this paper, a neural network

model is proposed to predict the performance of students in an academic organization. The


marks of 80 B.E. I.T (Bachelor of Engineering, Information Technology) students from

semester 3 to semester 6 were used in the study. The algorithm is trained on a training set of

60 students, and tested on a cross-validation set of 10 students, to predict marks in 6 subjects.

This was done 7 times, varying the training and test sets each time (k-fold cross validation).

Also, the importance of several different features is considered, in order to determine which

of these are correlated with student performance. From the results, it is found that the second

high potential variable for students' performance is their living location, and the third high

potential variable for students' performance is medium of teaching. The highest accuracy

reported by this model is 70%. This can be improved upon by utilizing more features and

increasing data sample size. Also this work was not evaluated with other works.

Vidyavathi (2015) present “Predicting Academic Performance Of Students Using Data

Mining Technique.” The authors identified one way to achieve highest level of quality in

education system is by discovering knowledge from educational data, to study the main

attributes that may affect the student’s performance. The discovered knowledge can be used

to offer a helpful and constructive recommendations to the academic planners in education

institutes to enhance their decision making process. In this work, data mining techniques is

used to predict student academic performance. This work investigates the educational domain

of data mining using the graduate student’s data collected from the various sources. The

work performs Feature Extraction technique to pre-process the data. Decision tree classifier

technique is used to classify the data. The proposed system concentrates on the development

of a tool for student performance prediction. By pre-processing the data, applying suitable

data mining method knowledge is discovered. The discovered knowledge is used to provide

constructive recommendation to overcome the problem of low grades of graduate students

and to improve student’s academic performance at the earliest. There was no reported

evaluation of the model performance.


Result Prediction Using K-Nearest Neighbor Algorithm for Student Performance

Improvement is presented (Muralidharan et al, 2017). This paper is poised to enhance the

educational institution’s ranking which is greatly influenced by the university results. These

results can be predicted beforehand and used in subsequent planning. This paper mainly

focuses on the prediction of a student’s university result by making use of different attributes

utilizing k-nearest neighbour technique. The attributes used in this study are either

quantitative or qualitative type. The quantitative attributes used are Internal Assessments,

Attendance percentage, Number of On-Duties taken and Overall Assignments completed.

The qualitative attributes include Subject feedback, Faculty feedback, and whether the

student is a Day Scholar/Hosteller. Here, k-Nearest Neighbour algorithm is applied against

the historical data of students for more accurate prediction of results. This algorithm makes

use of the Euclidean distance formula which is used to find the nearest record. Though there

have been different classification algorithms predicting the results of students and yet no

algorithm has been applied on real time educational datasets. In this work, the results of some

students belonging to an educational institution have been computed and results shown. This

research is limited by the small data sample used in the study.

Katare and Dubey (2017) present a study of various techniques for predicting student

performance under educational data mining. The motivation for this paper is that to improve

teaching and learning process of educators and learners’ performance prediction is very

useful. This survey mostly focuses on classification algorithms and their utilization for

evaluating student performance. This survey explores various data mining technique

application to educational data mining domain area to predict student performance. This

paper discusses the related work on evaluating the performance of students using different

analytical methods. It talks about the techniques which applied on datasets to find out hidden

information and pattern from database of educational environment. In EDM, the frequently
used prediction technique is classification. Under classification different methods like

Decision tree, Neural network, k-nearest neighbor, SVM, Bayes algorithm are used to predict

student performance. The performance of all these methods is evaluated based on the

parameters like accuracy, precision and recall. This study is limited only to applicability of

five data mining techniques to EDM. Also the techniques explored were not evaluated.

1.2.1 K-Nearest Neighbour technique

One major motivation which led to the introduction of similarity distance measure

methodology in prediction of students’ performance is its simplicity in distance measure

calculation. The k-nearest neighbours’ algorithm is amongst the simplest of all machine learning

algorithms. An object is classified by a majority vote of its neighbours, with the object being assigned

to the class most common amongst its k nearest neighbours. k is a positive integer, typically small. If

k = 1, then the object is simply assigned to the class of its nearest neighbour. In binary (two class)

classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. The

same method can be used for regression, by simply assigning the property value for the object to be

the average of the values of its k nearest neighbours. It can be useful to weight the contributions of the

neighbours, so that the nearer neighbours contribute more to the average than the more distant ones.

The neighbours are taken from a set of objects for which the correct classification (or, in the case of

regression, the value of the property) is known. This can be thought of as the training set for the

algorithm, though no explicit training step is required. In order to identify neighbours, the objects are

represented by position vectors in a multidimensional feature space. It is usual to use the Euclidian

distance, though other distance measures, such as the Manhattan distance could in principle be used

instead. The k-nearest neighbour algorithm is sensitive to the local structure of the data. It is also

insensitive to outliers and makes no assumptions about data. It works well with numeric and

nominal values. kNN is an example of instance-based learning, where there is need to have

instances of data close at hand to perform the machine learning algorithm.

1.3 MOTIVATION OF RESEARCH PROJECT


One way to achieve highest level of quality in education system is by discovering knowledge

from educational data, to study the main attributes that may affect the student’s performance.

The discovered knowledge can be used to offer a helpful and constructive recommendations

to the academic planners in education institutes to enhance their decision making process, to

improve student’s academic performance and trim down failure rate, to understand better

student’s behaviour. Abdullah et al (2014) identified one of the educational problems that are

solved with data mining is the prediction of students' academic performances. Prediction of

students' performance is more beneficial for identifying the low academic performance

students. Student retention is an indicator of academic performance and enrolment

management of the university. The ability to predict a student’s performance is very

important in educational environments. Students’ academic performance is based upon

different factors like social, personal, Psychological and other environmental variables. A

very promising tool to achieve this objective is the use of Data Mining. Data mining

techniques are used to discover hidden patterns and relationships of data, which is very much

helpful in decision-making. Currently, educational database is increased rapidly because of

the large amount of data stored in it.

Against this back drop, this research is motivated by the need to:

a) extend the work in (Asif et al, 2014) by using pre-university, 1st and 2nd year courses

performance in students’ performance prediction; and

b) addressing the limited data size (Muralidharan et al, 2017).

1.4 OBJECTIVES OF PROPOSED PROJECT

The specific objectives of this work are to:

a) develop a k-nearest neighbour model for predicting students’ academic performance;

and
b) implement the model in (a) above using prior and present students’ university

performance and other demographic attributes.

1.5 PROPOSED METHODOLOGY OF PROJECT

A study of students’ performance prediction and educational data mining (EDM) will be

carried out. Techniques used in prediction of students’ performance analysed with emphasis

on recent data mining techniques like k-nearest neighbour. Dataset of computer science

students of University of Ilorin that describes the demographic features and academic grades

but not limited to; their pre-university (entrance examination), year one and second year

exams respectively will be sourced from appropriate management and other reviewed

literature. The dataset will contain n number of identified variables/features of about 1000

students. These variables/features will be the input variables to the proposed k-nearest

neighbour model. Five classes of target variable (representing the five grades of a degree

programme outcome), namely first class, second class upper, second class lower, third class,

pass and fail will be handled in this project. The methodology is subdivided into the

following stages;

i. Data Pre-processing

ii. Creation of k number of points

iii. Calculation of distance measurement between centroids and datapoints

iv. Assignment of data point to the cluster with lowest distance.

v. Calculation of mean of the data points in a cluster and assign centroid to mean.

KNN algorithm is a supervised learning algorithm. It is insensitive to outliers and makes no

assumptions about data. It works well with numeric and nominal values. The k-nearest

neighbour is an instance based learning which carries out its classification based on a

similarity measure, like Euclidean, Mahanttan or Minkowski distance functions. The first two

distance measures work well with continuous variables while the third suits categorical
variables. Manhattan distance computes the absolute differences between coordinates of pair

of objects while Minkowski is a generalized metric distance. Mahanttan and Minkowski

distance functions are shown in equations (1) and (2) respectively.

Dxv  X im  X jm (1)

p
 n 1

Dxv   m1 X im  X jm p  (2)
 

When p equals 2, then equation (3) becomes Euclidean distance. Euclidean distance

computes the root of square difference between co-ordinates of pair of objects. The Euclidean

distance (Dxv) between two input vectors (Xi, Xj) is given by:

n  2
 
 X  X m = 1,2,…,n
jm 
D (3)
m  1
xv im

For a given dataset, where

X = {x1,x2,x3,……..,xn} (4)

V = {v1,v2,…….,vc} (5)

where X is the set of input feature value data points, V is set of centers of clusters and n is the

number of features.

The Euclidean distance measure will be used in this study for the kNN classifier. For every

data point in the dataset, the Euclidean distance between an input data point and current point

is calculated. These distances are sorted in increasing order and k items with lowest distances

to the input data point are selected. The majority class among these items is found and the

classifier returns the majority class as the classification for the input point. The parameter k is

tuned for k = 1, 3, 5,..,o (where o is an odd number) to achieve optimal performance.

The following steps in the training of kNN classifier algorithm are given in listing 1;

Listing 1: k-NN Algorithm


Step1: Set up the training data and associated labels

Step2: For every point in our dataset:

Step3: calculate the distance between input vector X and the current point using

equation (3). The algorithm calculates the distance measurement for every data point

in the dataset using equation (4).

Step4: sort the distances in increasing order

Step5: take k items with lowest distances to input vector X

Step6: find the majority class among these items

Step7: return the majority class as the prediction for the class of X

The first step in listing 1 sets up training data. This is two-thirds of the whole dataset. The

step 3 uses the similarity measure to calculate the Euclidean distance for every data point in

the dataset. This distance measure is calculated for every data point (instance) in the dataset.

The calculated Euclidean distances are sorted in increasing order. A predefined number, k, of

items with lowest Euclidean distances to the input data point is taken and the majority class

amongst the items is found. This majority class is returned as the prediction class for the data

point.

The performance of this study will be evaluated against other reviewed works based on true

positive, true negative, false positive, false negative rates and total accuracy of classification.

True positives are instances classified as positive which are actually positives. True negatives

are instances classified rightly as negatives. False positives are instances classified as

positives but are rather negatives. False negatives are instances classified as negatives but are

rather positives.

TP
TPR  (6)
P

TN
TNR  (7)
N
FP
FPR  (8)
N

FN
TNR  (9)
P

where TPR, TNR, FPR, FNR means true positive, true negative, false positive and false

negative rates. The total accuracy of classification comprises of all four basic metrics; true

positive, true negative, false positive and false negative (TP, TN, FP, FN).

TP  TN
Accuracy  (10)
TP  FP  TN  FN

1.6 EXPECTED CONTRIBUTION TO KNOWLEDGE

This project will establish an adequate analysis of students’ performance prediction model

using k-nearest neighbour which will serve as a veritable framework for educational

administrators, university admission officers likewise guidance and counselling personnel of

institutions for future improvements and sustainable academic curriculum design.

Potrebbero piacerti anche