Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Introduction
A breach of information security can affect not only a single users work but also
the economic development of companies, and even the national security of a
country. The breach is the focus of research into unauthorised access attacks to a
computer, which is the second greatest source of financial loss according to the
2006 CSI/FBI Computer Crime and Security Survey. Attacks on computer
systems can be undertaken at the network, system and user levels.
Most information security research undertaken in recent years is concerned with
system and network-level attacks. However, there is a lack of research on attacks
at the user level. User level attacks include the impostor or intruder who takes
over from the valid user either at the start of a computer session or during the
session. Depending on the risks in a particular environment, a single, initial
authentication might be insufficient to guarantee security. It may also be
necessary to perform continuous authentication to prevent user substitution after
the initial authentication step. The impact of an intruder taking over during a
session is the same as any kind of false representation at the beginning of a
session. Most current computer systems authorise the user at the start of a session
and do not detect whether the current user is still the initial authorised user, a
substitute user, or an intruder pretending to be a valid user. Therefore, a system
that continuously checks the identity of the user throughout the session is
necessary. Such a system is called a continuous authentication system.
The majority of existing continuous authentication systems are built around
biometrics. These continuous biometric authentication systems (CBAS) are sup-
plied by user traits and characteristics. There are two major forms of biometrics:
those based on physiological attributes and those based on behavioural
characteristics. The physiological type includes biometrics based on stable body
traits, such as fingerprint, face, iris and hand, and are considered to be more
robust and secure. However, they are also considered to be more intrusive and
1
expensive and require regular equipment replacement [86]. On the other hand,
behavioural biometrics include learned movements such as handwritten
signatures, keyboard dynamics (typing), mouse movements, gait and speech.
Collecting of these biometrics is less obtrusive and they do not require extra
hardware.
Recently, keystroke dynamics has gained popularity as one of the main
sources of behavioural biometrics for providing continuous user authentication.
Keystroke dynamics is appealing for many reasons:
Keystroke dynamics exist and are available after the authentication step at
the start of the computer session.
1.1 Motivation
This thesis focuses on developing automatic analysis techniques for continuous
user authentication systems (that are based on keystroke dynamics) with the goal
of detecting an impostor or intruder that may take over a valid user session. The
main motivation of this research is that we need a flexible system that can
authenticate users and must not be dependent on a pre-defined typing model of a
user. This research is motivated by:
the need for new feature selection techniques that represent user typing
behaviour which guarantee that frequently typed features are selected and
inherently reflect user typing behavior.
guarantee that frequently typed features are selected and inherently reflect
user typing behaviour.
3. To discover whether a pre-defined typing model of a user is necessary for
successful authentication.
4. To minimise the delay for an automatic CBAS to detect intruders.
typing samples in a user session belong to the same user? In order to address
this sub-question, we have examined four different variables that influence
the accuracy of the threshold which directly manipulate the user samples for
authentication: distance type, number of keystrokes, feature type and
amount of features.
3. Can we automatically detect an impostor who takes over from a valid user
during a computer session and the amount of typing data needed for a
system to detect the imposter? To answer this sub-question, we need to first
answer the questions 1 and 2, and use the answers to propose the automated
system. For automated detection, a sliding window mechanism is used and
the optimum size of the window is determined.
ngraphs that are typed with consistent time. The technique computes the
standard deviation of n-graphs representing the variance from their
average typing time and then selects the n-graphs having least variance.
6
that are typed with noticeably different time.The technique computes the
standard deviation of n-graphs among all users representing the variance
from their average typing time and then, selects the n-graphs having large
variance.
The research results were published in:
Al solami, Eesa, Boyd, Colin, Clark, Andrew and Ahmed, Irfan . User
representative feature selection for keystroke dynamics. In Sabrina De
Capitani di Vimercati and Pierangela Samarati, editors, International
Conference on Network and System Security, Universit degli Studi di
Milano, Milan, 2011.
3. A proposed user-independent threshold approach that can distinguish a
user accurately without needing any predefined user typing model a-priori.
The threshold can be fixed across a whole set of users in order to
authenticate
1.5. Research Significance users without requiring pre-defined typing model for
each user. The research results were published in:
Al solami, Eesa, Boyd, Colin, Ahmed, Irfan, Nayak, Richi, and
Marrington, Andrew. User-independent threshold for continuous
user authentication based on keystroke dynamics. The Seventh
International Conference on Internet Monitoring and Protection,
May 27 - 1 June , 2012, Stuttgart, Germany.
4. The design of an automatic system that is capable of authenticating users
Chapter 2
Background
The goal of this thesis, as described in chapter 1, is to design and develop
techniques for detection of the impostor who may take over from the
authenticated user during a computer session using keystroke dynamics. This
chapter provides an overview of the authentication concept and different types of
authentication methods focusing on typist authentication methods. Also, the
chapter gives an overview of the current anomaly detection techniques that can
be used in our research problem with the emphasis on the related techniques that
are used in this thesis.
This chapter is organized as follows. Section 2.1 provides an overview of
authentication methods. Section 2.2 discusses in details the current schemes with
typist authentication including static typist and continuous typist authentication.
Section 2.3 provides an overview of the current anomaly detection techniques.
Section 2.4 presents previous research related to work described in chapters 5 to
7. Later in section 2.5, research challenges associated with the analysis of
keystroke dynamics for continuous user authentication are discussed. Finally, the
chapter is summarized in section 2.6.
the object factors: Something the user has (e.g., ID card, security token,
smart card, phone, or cell phone)
9
the knowledge factors: Something the user knows (e.g., a password, pass
phrase, or personal identification number (PIN) and digital signature)
the inherent factors: Something the user is or does (e.g., fingerprint, retinal
pattern, DNA sequence (there are assorted definitions of what is sufficient),
signature, face, voice, unique bio-electric signals, or other biometric
identifier).
As we mentioned at the start of this section, the third class of the positive
identifications factors is the inherent factors. Also, we mentioned that the
biometric is based on inherent factors and since our research is focused on
biometric authentication, we will limit our discussion only to biometric
authentication in the next sub-section.
10
In the next section, we describe in detail the typist authentication as one of the
behavioural biometric types.
12
The digraph time is the time interval between the first and the last of n
subsequent key-presses. It is sometimes called keystroke latency or inter
keystroke interval.
The keystroke duration is the time between the key-press and key-release
for a single key. This is sometimes known as the key-down time, dwell time
or hold time.
There are two main types of keystroke analysis, keystroke static and keystroke
dynamic (or continuous) analysis. Static keystroke analysis means that the
analysis is performed on the same predefined text for all the individuals under
observation. Most of the literature on keystroke analysis falls within this
category. The intended application of static analysis is at login time, in
combination with other traditional authentication methods.
Continuous analysis involves a continuous monitoring of keystrokes typing
and is intended to be executed during the entire session, after the initial
authentication step. It should be that keystroke analysis performed after the initial
authentication step deals with the typing rhythms of whatever is entered by the
users. It means that the system should deal with free text. In the next two subsections we will give in more details the existing schemes in both types of
keystroke dynamics.
2.2.1Static Typist Authentication
Static authentication involves authenticating users through stable methods like
user name and password. Behavioural static authentication is a static
authentication method that determines how the user acts and behaves with the
authentication system; for example, how a user name and password typed. This
method is used for additional authentication methods and to overcome some
limitations of traditional authentication methods. Keystroke dynamics and mouse
dynamics are the main examples of behavioural static authentication.
Keystroke dynamics analyse the typing patterns of users. Using keystroke
dynamics as an authentication method is derived from handwriting recognition,
which analyses hand writing movements. Table 2.1 summarises a few techniques
14
that will be discussed in this section. These techniques are measured by two
measurements: FRR when the system incorrectly rejects an access attempt by an
authorised user and FAR when the system incorrectly accepts an access attempt
by an unauthorised user.
In 1980, Gaines et al. were the first to use keystroke dynamics as an
authentication method. They conducted an experiment with six users. Each
participant was asked to retype two samples and the gap time between collection
of the two samples was four months period. Each sample contained three
paragraphs with varying lengths. They used specific digraphs as a feature that
occurred during the paragraphs by analysing and collecting the keystroke latency
timing. The most frequent five digraphs that appeared as distinguished features
were in, io, no, on, ul. Then, they compared latencies between two sessions to see
whether the average and mean values were the same at both sessions. The
limitation of this experiment was that the data sample was too small to get
reliable results. Also, there was no automated classification algorithm used
between the participants but the results were claimed to be very encouraging.
Umphress and Williams asked 17 participants to type two samples. One
typing sample used for training included about 1400 characters and a second
sample used for testing included about 300 characters. They represent the features
by
Reference
Gaines et al.
[29]
Umphress and
Williams[100]
Leggett and
Williams[57]
Joyce and Gupta
[46]
FAR
(%)
0.00
FRR
(%)
0.00
Sample Content
Method
6000 characters
Manual
12.00
6.00
Statistical
5.5
16.36
0.25
Statistical
Manual
15
testing
Bleha et al.
[11]
Brown and
Rogers [12]
Obaidat and
Sadoun[69, 70]
Furnell et al.
[28]
Bergadano et al.
[10]
Sang et al.
[88]
8.1
2.80
21.2
12.0
0.00
0.00
26
15
4400 characters
0.00
0.14
683 characters
0.2
0.1
Bayes
Neural
network
Neural
network
Statistical
Nearest
Neighbour
SVMs
Table 2.1: The accuracy of statistic typist authentication techniques grouping the
characters in terms of words and then they calculated the time of the first 6
digraphs for each word. The classifier was based on statistical techniques by
setting the condition that each digraph must fall within 0.5 standard deviations of
its mean to be considered valid. Their system obtained a FRR of 12% and an FAR
of 6 %.
Later, an experiment was conducted by Leggett and Williams inviting 17
programmers to type approximately 1000 words, which was similar to the Gaines
et al. experiment, but there was a condition of accepting the user if more than
60% of the comparisons were valid. The results demonstrated that the FAR was
5.5% and the FRR was 5 %.
Joyce and Gupta recorded the keystroke during the log-in process by typing
the user name, a password, and the last names of users eight times. 33 users
participated in this experiment and they typed eight times to build historical
profile, and five times for testing. The classifier was based on a statistical
approach. It requires that the digraph fall with 1.5 standard deviations of its
16
reference mean to belong to a valid user. The result demonstrated that the FAR
was 16.36% and the FRR was 0.25% .
Bleha et al. used the same approach that was proposed by Joyce and Gupta
and they used digraph latencies as a feature to distinguish between samples of
legal users and intruders. The experiment invited 14 participants as valid users
and 25 as impostor users to create their profiles. The classifier method was based
on Bayes classifier using the digraph times. Results show that the FAR was 8.1%
and FRR was 2.8 %.
Brown and Rogers are the first to use the keystroke duration as a feature to
distinguish between the samples of authenticated users and impostors. The
experiment divided the participants into two groups (21 in the first group and 25
in the second group), and they were asked to type their first and last names. The
neural network method was applied in this experiment to classify the data and
results show a 0.0% false negative rate and 12.0% FRR in the first group and
21.2% FAR in the second group.
Furnell et al. used the digraph latencies as representative feature. Thirty users
were invited to type the same text of 2200 characters twice, as a measure to build
their profiles. For intruder profiles, the users were asked to type two different
texts of 574 and 389 characters. Digraph latencies were computed by statistical
analysis, and the results show that in the first 40 keystrokes of the testing sample,
the FRR was 15% and the FAR were 26 %.
Obaidat and Sadoun used the keystroke duration and latency together as a
feature to distinguish between the samples of authenticated users and impostors.
The experiment invited 15 users to type their names 225 times each day over a
period of eight weeks to build their profiles. Neural network was the classifier to
classify the user samples. The results showed that both FAR and FRR were zero.
Bergadano et al. used single text of 683 characters for 154 participants and
they considered the type errors and the intrinsic variability of typing as a feature
that can distinguish users. They used the degree of disorder in trigraph latencies
as a measure for dissimilarity metric and statistical method for classification to
compute the average differences between the units in the array. The results show
that the FAR was 4.0% and FRR was 0.01%. This method in the experiment is
17
suitable for the authentication of users at log-in, but it is not applicable for
continuous authentication because it requires predefined data.
Sang et al. conducted the same experiment as Obaidat and Sadoun [69 , 70]
(duration and latency together) but with a different technique. The technique used
support vector machine (SVM) to classify ten user profiles, and the results
demonstrated that this technique is the best for classifying the data of user
profiles, where more accurate results of 0.02% FAR and 0.1% FRR.
All of the previous techniques show that the static typist authentication had
great success that can be used to distinguish different users effictively. It shows
that the static typist authentication has different features that can be used to
present the user typing behaviour. These features can be used for user
authentication. In the next section, we will see whether the continuous typist
authentication has different features that can be used effectively for user
authentication similar to the static typist authentication .
2.2.2Continuous Typist Authentication
Continuous typist authentication using dynamic or free text applies when users
are free to type whatever they want and keystroke analysis is performed on the
available information. Continuous typist authentication using dynamic or free
text is much closer to a real world situation than using static text. The literature
on keystroke analysis of free text is pretty limited. This section describes most
continuous typist authentication techniques. They are summarised in Table 2.2.
Monrose and Rubin [64] conducted an experiment on 31 users by collecting typReference
Monrose
and Rubin
[64]
Dowland
et al. [22]
FAR
(%)
-
FRR
(%)
-
Accuracy
(%)
23
50
Sample content
Method
Euclidean
distance and
weighted
probability
Statistical
method
Normal activity on
computers runing
18
Dowland
et al.[21]
60
Bergadano
et al. [9]
5.36
Nisenson
et al. [68]
1.13
5.25
Gunetti
and
Picardi
[32]
3.17
0.03
Hu et al.
[40]
3.17
0.03
Bertacchini
et al.[8]
Hempstalk
et al. [38]
Janakirama
n and Sim
[44]
Windows
NT
Normal activity on
specific applications
such as Word
Two different texts,
each
300 charcters long
Task response&
each user typed
2551 1866
keystrokes.
Artificial emails
&each user typed 15
samples&each
sample contains 700
to 900
keystrokes
19 users &each one
provide
5 typing data "free
text"
62 different users
typed 66 samples
based on spanish
language
Real world emails
150 email samples&
607 email samples
22 users collected
their data based on
their daily activity
work of using email,
ranged
from 30,000
keystrokes to 2
million keystrokes
Statistical
method
Distance
measure
LZ78
Nearest
Neighbour
k-nearest
neighbor
k-medoids
Gaussian
density
estimation
Based on a
common list
of fixed
strings
19
Euclidean distance
The last experiment used Euclidean distance and added weights to digraphs. The
FRR is about 23% of correct classification in the best case (that is, using the
weighted probability measure).
Dowland et al. applied different data mining algorithms and statistical
techniques on four users, with data samples to distinguish between authenticated
users and imposters. The users were observed for some weeks during their normal
activity on computers using Windows NT. It means that there was no constraint in
the user to use the computer and the user is free to use the computer in any way.
Users profiles are decided to have features using the mean and standard
deviation of digraph latency and only digraphs typed less frequently by all the
users in all samples are considered. To filter the user profile, there were two
thresholds: any times less than 40ms or greater than 750ms were discarded. The
results demonstrated a 50% correct classification rate. The same experiment was
refined by Dowland et al. It included some application information for Power
Point, Internet Explorer, Word, and Messenger. The experiment collected the data
20
of eight users over three months and the results demonstrated that the FRR was
40 %.
Bergadano et al. calculated a new measure which was the time between the
depression of the first key and the depression of the second key for each two
characters in a sequence. Forty users were invited to build historical profiles by
typing two different samples of text. Each text contained 300 characters and the
participants were asked to type 137 samples. 90 new users were invited to build
testing files by typing the second sample only. The mean distance was computed
between unknown instance sample and each sample of a users profile and the
mean distance was also computed between unknown instance sample and each
users profile to classify unknown instance sample. The authors applied a
supervised learning scheme for improving the false negative rate to compute the
mean and standard deviation between every sample in users profile and every
sample in a different user profile. Results demonstrated that the FRR was reduced
to 5.36 % and the FAR was zero.
A longer experiment done by Dowland and Furnell collected about 3.5
million keystrokes from 35 users during three months. The sample content that
were collected from users is based on the global logging. Global logging includes
all possible typists behaviour.
Nisenson et al. collected free text samples from five users as normal users and
30 users behaving as attackers. The sample content was either an open answer to
a question, copy-typing, or a block of free typing. The time differentials were
calculated from typing data and used as a user feature. Each normal user typed
between 2551 and 1866 keystrokes. Attackers were asked to type two open ended
questions and were required to type the specific sentence, To be or not to be.
That is the question. Also, they were allowed to type in free text between 660 to
597 keystrokes. Then, they trained these features on the LZ78-based classifier
algorithm. The accuracy of the system was attained with FRR 5.25% and FAR
1.13 %.
Gunetti and Picardi, used free text samples in their experiment by inviting 205
participants and they used the same technique that Bergadano et al. proposed in
their work based on static typist authentication, discussed in the previous section.
21
They created profiles for each user based on their typing characteristics in free
text. The users performed a series of experiments using the degree of disorder to
measure the distance between the test sample to reference samples from every
other user in their database. The samples are transformed into a list of n graphs,
sorted by their average times. To classify a new sample it is compared with each
existing sample in terms of both relative and absolute timing. Only digraphs that
appear in both reference and unknown samples are used for classification. The
Gunetti study achieves very high accuracy when there are many registered users.
Many researchers have used clustering algorithms in order to authenticate
users. Hu et al. , applied similar technique to that was proposed by Gunetti and
Picardi . 19 users participated in this experiment with each of them providing five
typing samples. Another 17 users provided 27 typing samples which were used as
impostor data. Typing environment conditions were not controlled in this data
collection. They proposed k-nearest neighbor classification algorithm which an
input needs only to be authenticated against limited user profiles within a cluster.
The main difference between the proposed algorithm by Hu et al. and the method
of Gunetti and Picardi (GP method) is that the authentication process of the
proposed algorithm is within a cluster while the GP method needs to go through
the entire database. Also for a user profile X, the k-nearest neighbor classification
algorithm uses only its representative profile in the authentication process while
the GP method needs to compare with every sample of each user profile. They
used the clustering algorithm to make a cluster for each user. First, each user
provides several training samples and then the profile of each user is used for
building. A representative user profile is built by averaging all such vectors from
all training samples provided. Second, the k-nearest neighbour method is applied
to cluster the representative profiles based on the distance measure. Finally, the
authentication algorithm for new text is executed only on the users
corresponding cluster. The success of the proposed algorithm depends upon the
threshold value, which is dependent on the registered users in the system.
Moreover, the specific use of the proposed algorithm in classifying and
authenticating only users who are already registered in the system makes the
system less effective when new users interact with the system. The experiment
shows that the proposed k-nearest neighbor classification algorithm can achieve
the same level of FAR and FRR performance as the Gunetti and Picardi (PG)
22
are still familiar with the keyboard as they have used it for many years. The users
came from different backgrounds including Chinese, Indian or European origin,
and all are fluent in English. Keystrokes were logged as users went about their
daily activity work of using email, surfing the web, creating documents, and so
on. The collected data from users ranged from 30,000 keystrokes to 2 million
keystrokes. In total 9.5 million keystrokes were recorded for all users. Howevere,
they did not report their findings in their paper.
One of the main limitation of Gunetti and Picardi approach is the high
verification error rate which cause scalability issue. Gunetti and Picardi proposed
a classical n-graph-based keystroke verification method (GP method), which can
achieve a low False Acceptance Rate (FAR). However, the GP method suffers
from a high False Rejection Rate (FRR) and a severe scalability issue. Thus, GP
is not a feasible solution for some applications such as computing cloud
application where scalability is a big issue. To overcome GPs shortcomings, Xi
et al. devloped the latest keystroke dynamic scheme for user vervication to
overcome GPs shortcomings. To reduce high FRR, they designed a new
correlation measure using n-graph equivalent feature (nGdv) that enables more
accurate recognition for genuine users. Moreover, correlation-based hierarchical
clustering is proposed to address the scalability issue. The experimental results
show that the nGdv-C can produce much lower FRR while achieving almost the
same level of FAR as that of the GP method.
All of the previous techniques shows that the continuous typist authentication
had great success similarly to the static typist authentication that can be used to
distinguish users effectively. It shows that we can obtain some features from the
typing data that can be used to represent the user typing behaviour. These features
can be used for successful user authentication. Howevere, the extracted features
from the typing data of continous typist authentication do not guarantee features
with strong statistical significance and also, do not inherently incorporate user
typing behavior. Furthermore, one of the main limiatation of the previous
continuous typist authentication technuiques is requiring the users data to be
available in advance. In principle, the requirement of collecting the users data in
advance restricts the systems to authenticate only known users whose typing
samples are modelled. In some cases, it is impractical or impossible to gain the
24
pre-defined typing model of all users in advance (before detection time). It should
be possible to distinguish users without having pre-defined typing model which
lead the system to be more practical.
In the next section, we present some related pioneering works in both
supervised and unsupervised typist authentication. Supervised and unsupervised
methods will help to link the existing continuous typist authentication schemes
with the relevant setting environment or scenario.
However, open-setting environment might be conducted when the profile of the
impostor and valid users are not available such as computer based TOEFL exam
scenario.
2.3 Summary
In this chapter, past and current research into the analysis of users typing data or
keystroke dynamics with emphasis on the continuous typist authentication
schemes is presented. The chapter also presents a discussion of the current
anomaly detection techniques. Two types of data analysis techniques have been
widely used in keystroke dynamics, namely data mining and statistical analysis.
Research in applying these techniques to keystroke dynamics has been reviewed.
In addition, challenges of the current research are also presented. In the next
chapter, we present our first contribution by proposing a generic model which
will be the primary object of this thesis. The model can help in identifying and
understanding the characteristics and requirements of each type of continuous
typist authentication and continuous authentication scenarios.
25
Chapter 3
The majority of existing CBAS are built around the biometrics supplied by user
traits and characteristics. There has been an ongoing pursuit of improving CBAS.
Recently, efforts have been made to improve CBAS by either embedding intrusion
detection into the CBAS itself or by adding a new biometric source tailored to the
26
detection system. However, these attempts do not consider the different limitations
that might affect the practicability of existing CBAS in real world applications and
continuous authentication scenarios. To our knowledge, there is no CBAS deployed
in real world applications; it seems reasonable to assume that this is because
existing systems lack practicality. There are a number of issues associated with
existing schemes which may prevent CBAS from being applied in real world
applications. These limitations include:
variability of the behaviour biometric between the training and testing data;
and
variability of the behaviour biometric of the user from one context to another.
For example, typing in email and typing in computer based exam.
There are several scenarios and situations that require the application of a
continuous user authentication approach, but these scenarios cover a variety of
possible situations and have different characteristics and requirements. The current
CBAS may not consider the differences of requirements and characteristics. Thus
each scheme might be applied to the wrong domain or scenario because the
previous CBAS might not have chosen accurate measurements for the relevant
scenario or situation and may not consider the type of setting environment or
situation. Additionally, the time taken to detect the intruder might be too long.
For example, choosing the right source of biometric with the relevant application
or system may not be considered by the existing CBAS. In other words, onemight
rely on the keystroke dynamics as an input for providing continuous authentication,
while, most actions of a user may be based on mouse dynamics. Another example,
selecting a suitable algorithm with the relevant setting environment or situation
might not be considered such as where a CBAS might use multi-class classification
algorithm in an open setting environment. Multi-class classification requires the data
to be available from both valid users and impostors before the profile of normal
biometric data is generated but it is impossible to collect the data in advance from the
possible impostors in open-setting environment. In an open-setting environment the
27
profile of the impostor is not available. The profile of valid users may or may not be
available. In a closed-setting environment the profiles of all users, including a
possible impostor, are available. This could be one of the main problems affecting
existing schemes in meeting the European standards for acceptable commercial
biometrics. That is, a false acceptance rate (FAR) of less than 0.001% and false reject
rate (FRR) of less than 1% is necessary
A generic model is proposed in this thesis that attempts to define most continuous
authentication (CA) scenarios and CBAS. The model of CBAS is proposed based on
detection capabilities of continuous authentication (CA) scenarios and CBAS to
better identify and understand the characteristics and requirements of each type of
scenario and system. This model pursues two goals: the first is to describe the
characteristics and attributes of existing CBAS and the second is to describe the
requirements of different scenarios that a CBAS can be represented with. Also, we
identify the main issues and limitations of existing CBAS, observing that all of the
issues are related to the training data. Finally, we consider a new application for
CBAS without requiring any training data either from intruders or from valid users in
order to make the CBAS more practical.
Sensor
Feature
Extractions
Administrator
User
Active response
passive response
Response U nit
D etector
D ata Base
There are six basic components to describe a typical CBAS (see Figure 3.1)
29
performs measurements for errors that may detect the intruder. In order to
determine the accuracy of the detector there are two popular measurements of
CBAS: false acceptance rate (FAR) and false rejection rate ( FRR ).
5. Biometric database - storage of the biometric data and user actions are by
profiles and this process might happen during the registration phase. The CBAS
will use the database for comparison with the live data in the verification phase.
6. Response unit - taking an appropriate response for detecting the intruder or
impostor. The CBAS has two popular types of responses either passive
response or active response.
An additional aspect of the CBAS model is the type of setting (or scenario). A
continuous authentication scenario might be conducted either in an open-setting
environment or a closed setting environment. We defined the type of setting
environments in thSSSe previous chapter section 2.3. Each of the six CBAS basic
components is described in detail below.
3.2.1Subjects
Subjects are initiators of activity on a target system, normally users, either authorised
or unauthorised [20]. An authorised user is allowed to access a system by providing
some form of identity and credentials. Also, they are allowed to deal with objects of
the system during the session. The authorised user can be known to the system where
the biometric data of that user is registered in the system as a historical profile. On
30
the other hand, the authorised user can be unknown to the system where there is no
biometric data of that user registered in the system in advance. Unauthorised user
can be the second type of user who does not have distinctive characteristics or
recognition factors and would try to claim the identity of an authenticated user. The
unauthorised user can be an adversary user acting maliciously towards the valid user
or it can be a colluder invited by the valid user to complete an action on behalf of the
user. In the first case of the non-authorised user, the victim who suffers from the
attack would be the end user, but in the second case the victim is typically the system
operator or the owner of the application.
3.2.2Sensor
The sensor is a device that collects biometric data from the user, either physically or
behaviourally, and translates it into a signal that can be read by an observer or an
instrument such as keyboard or camera. The system may have several sensors.
Sensors in this model transform raw biometric data into a form suitable for further
analysis by the detector. The location of the sensor module for collecting data which
is the process of acquiring and preparing biometric data, can be centralised or
distributed. The data can be collected from many different sources in a distributed
fashion or it can be collected from a single point using the centralised approach. The
aim of data collection is to obtain biometric data to keep on record, and to make
further analysis of that data. The quality and nature of the raw data is significantly
affected when the sensor used during registration and authentication is different [87].
The sensors are based on one or more physical traits or behavioral characteristics
[55] for uniquely recognising humans. The physical type includes biometrics based
on stable body traits, such as fingerprint, face, iris, and hand. The behavioural type
includes learned movements such as handwritten signature, keyboard dynamics
(typing), mouse movements, gait, and speech. The feature set is sensitive to several
factors including:
1. Change in the sensor used for acquiring the raw data between registration and
verification time. For example, using an optical sensor during registration and a
solid state capacitive sensor during verification in a fingerprint system.
2. Variations in the environment. Very cold weather might affect the typing speed
Since each biometric source has different characteristics and attributes, it requires
selection of suitable feature selection technique with the relevant biometric source. In
this component, we selected features that are user representative ( or reflecting a
model of a user) based on the type of biometric sources. For each biometric source a
representative features are generated and recorded as a requirement for the input of
the next component ( detector ).
3.2.4Detector
The detector performs error detection that may lead to intruder detection, which is
based on the biometric data gathered by the CBA sensor. The detector software
might be implemented in the client computer either the system and does not known
about the application or in the application by adding some functionality in that
application. Also, the detection algorithm might be run in the server computer. The
CBA detector is generally the most complex component of the CBAS. The detector
operates in two modes: registration and identification/verification mode. The
operation of each mode consists of the following three stages: in the first stage, a
data capturing process is conducted by a sensor module, which captures all biometric
data, and converts all of the raw data into a set of more organised and meaningful
data. The data directly passes to this phase for data processing where the feature
extraction is conducted. Then, these features build up all data features received from
the previous stage over a pre-defined session period, and perform a number of
algorithms on the data to produce the Mouse Dynamics Signature (MDS), Finger
Print Signature (FPS), and Keystroke Dynamics Signature (KDS) for the user being
monitored. Finally, the generated signature will directly pass to a database and be
stored as a reference signature for the enrolled user. The detector stage in this mode
is the verification process after the signature is calculated during the data processing
stage. The sensor module is compared to the reference signature of the legitimate
user.
There are two major families of algorithms of detection in the surveyed systems
based on the historical profile of all the users or based on only the valid user profile.
The first method is a type of machine learning algorithm that requires biometric data
about all users, either valid or possible impostors, to build a model for prediction.
There are different algorithms that follow this method such as nearest neighbor,
linear classification and Euclidean distance, which was proposed by Gunetti et. al.
This type of classification algorithm is suitable for the closed setting and restricted
33
environment also, the environment should be under access control to stop any user
not registered in the system. The system will perform well when there are many
registered users and fail completely when there is only one.
Another method of algorithm detection is a type of machine learning algorithm
that requires only biometric data about a single target class in order to build a model
that can be used for prediction. This type of classification algorithm is trying to
distinguish one class of object from all other possible objects, by learning from a
training set containing only the objects of that class. The method is different from,
and more difficult than, the first method of classification problem, which tries to
distinguish between two or more classes with the training set containing objects from
all of the classes and this class is suitable for the open setting environment where any
user can use the system.
The previous algorithms are used for decision-making in order to detect the
attacker either in on-line or off-line mode. The online mode can be identified as a
real-time (or near real-time) detection and off-line as a retrospective (or archival)
detection. The online detection is necessary for detecting the intruder in real time or
near real time when the system realizes any change of biometric data of the
authenticated user. The surveyed systems in this group can also be run in the
retrospective (or archival) mode. The off-line mode is used when it is unnecessary
for detecting the intruder in real time.
The location of the previous data processing can be centralised within a particular
location and/or group, or distributed on multiple computers. Data processing is a
term used to describe a process of transforming the raw biometric data to suitable
data, summarising and analysing the biometric data.
The accuracy of the correctness of a single measurement will be set in this
component. Accuracy is determined by comparing the measurement against the true
or accepted value. False acceptance rate, or FAR, measures the accuracy of the
CBAS. It measures the likelihood of whether the biometric security system will
incorrectly accept an access attempt by an unauthorised user. A systems FAR is
typically stated as the ratio of the number of false acceptances divided by the number
of identification attempts. False rejection rate, or FRR, is the second measure of
34
Unauthorised
user
Victim type
Algorithm type
unauthorised
user
Adversary /
colluder
System
operator
/ owner of the
application
Multi-class
classification
user
Adversary
End user
One-class
classification
Adversary /
colluder
System
operator
/ End user
Change point
detection
( potentially )
Table 3.2: The differences between the first, second classes and the new class.
mon characteristics and attributes from the generic model of CBAS. To date there is
no CBAS deployed in real world applications, probably because the existing systems
lack practicality. We observed that the main limitations are related to the training
data that prevent CBAS to be applicable in real world applications. The problems are
the requirement of the training data to be available in advance, many training data
samples required, the variability of the behaviour biometric between the training and
testing phase in case of the comparison time and the variability of the behaviour
biometric of the user from one context to another. Finally, the chapter considered a
new class for sCBAS associated (potentially) with change point detection algorithms
that does not require training data for both intruder and valid users which can
overcome the identified limitations associated with the existing CBAS.
In chapter 7 we will consider this new class that is not dependent on training data
using the results of chapter 5 and 6. The new class can overcome the identified
limitations associated with the existing the CBAS. The new scheme is capable of
distinguishing a user accurately without need for any predefined user typing model
a-priori.
36
Chapter 4
Find new features that are representing the user typing behaviour which can
guarantee selecting frequently-typed features and inherently reflect user
Develop new flexible technique that authenticate users and automate this
technique to continuously authenticate users.
37
38
detail the design and evaluation of two novel sliding window techniques for
analysis of the typing data in order to detect the impostor who may take
over from the genuine user during the computer session in real time. The
use of proposed sliding window (overlapping) technique resulted in better
performance over the proposed sliding window(non-overlapping) in terms
of fast detection. The user-independent threshold has been used practically
demonstrates for detection the impostor in near real time.
5. This thesis has presented a framework for the analysis of continuous
language. However, we do not know about the influence of other languages on the
user-independent threshold. Evaluation of other languages could be attempted to
obtain the user-independent threshold. It will be interesting to analyse the
effectiveness of the proposed techniques and this will help answer questions about
how well the user-independent threshold can be generalised to other data sets.
4.2.2 Application of the Proposed Technique to Different Biometric Sources
we used keystroke dataset as one of the biometric sources for experiments in this
thesis for providing continuous user authentication. The techniques provided in this
thesis can be applied to other sources of biometric such as mouse dynamics and face
recognition. We found that the pre-defined typing model of a user is not necessary
for successful authentication. However, we do not know whether the predefined
model for a user is necessary for successful authentication in other biometric
sources. Evaluation of other biometric sources could be attempted to obtain the userindependent threshold. It will be interesting to analyse the effectiveness of the
proposed techniques and this will help answer questions about how well the userindependent threshold can be generalised to other biometric sources.
4.2.3
behaviour can be used in assisting other security solutions such as an aid to intrusion
detection systems.
4.3 Concluding Remarks
This research has highlighted the challenges of analysing the user typing data in
distinguishing between users and can be used as a biometric authentication. New
methods for improving the analysis of user typing data in order to detect the
impostor during the computer session were proposed. The analysis techniques
provided in this thesis have been successfully used that are representative of user
typing behavior and also distinguished a user accurately without need for any
predefined user typing model a-priori.
41
Appendix A
Characteristics of users typing data
User 7
User
14
User
15
12335
16898
15267
9427
95
4
496
5581
5801
22
648
10443
181
2
539
210
168
117
1291
111
993
601
198
1216
Distribution Time
42
The following table presents the distribution time in milliseconds for 10 different
users from the dataset. From the table, we can easily see typing delay is different for
most of the users. For example, the average typing delay for sequence of two
characters is significantly different for some users such as user3 and other users
including user 2, 4, 5,7 and 8. Furthermore, from the table, it can be seen clearly that
most of the typing delay of sequences of two characters happened in the category
101-200 ms. However, it is not the case for some users such as user 6 where most of
his typing delay of sequence of two characters happened in category 201-300 ms.
43
44
Screenshots
45
46
References
[1]
[2]
M.B. Ahmad and T.S. Choi. Local threshold and boolean function based edge
detection. Consumer Electronics, IEEE Transactions on, 45(3):674 679,
1999.
47
[3]
[4]
[5]
L.C.F. Araujo, L.H.R. Sucupira Jr, M.G. Lizarraga, L.L. Ling, and J.B.T.
Yabu-Uti. User authentication through typing biometrics features. Signal
Processing, IEEE Transactions on, 53(2):851855, 2005.
[6]
[7]
[8]
[9]
48
1994.
[18] M. Dash, K. Choi, P. Scheuermann, and H. Liu. Feature selection for clustering-a
filter solution. In Data Mining, 2002. ICDM 2002. Proceedings. 2002 IEEE
International Conference on, pages 115122. IEEE, 2002.
49
50