Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Paulo R. Galego Hernandes Jr. Luiz F. Carvalho Mario Lemes Proença Jr.
Abstract—Traffic monitoring is an important task for net- by the security community or companies. When a packet is
work administrators who require tools to aid in the detection of matched against a signature, an alert is raised, indicating an
changes in the network’s routine. In this paper, we use a Digital attempted intrusion or misuse, which could be considered a
Signature of Network Segment using Flow Analysis (DSNSF) as
a technique to describe standard network behavior aiming to network anomaly [2]. But, an attack signature can change
support network management through traffic characterization. and a NIDS can leave an intruder to gain unauthorized
We have collected real data set from State University of access, or to bring down a server over a DDoS attack, until
Londrina (UEL), using data flow attributes such as bits, its signature data base is updated.
packets and number of flows. Our novel model uses Genetic In order to overcome this lack of security, we find models
Algorithm to optimize the process, which consists or organizing
the data to display graphically a standard network behavior. able to detect every change in the network routine by
To accomplish this task, we compared our novel model with learning the standard behavior of an environment using
another similar method, Ant Colony Optimization for Digital traffic characterization. According to Zarpelão et al. [3],
Signature (ACODS), evaluating these models to measure their the anomaly detection techniques known as profile-based
accuracy. or statistical-based do not require any previous knowledge
Keywords-Characterization, Traffic Monitoring, Network about the nature and properties of the anomalies to be
Management, Genetic Algorithm, ACO, sFlow. detected. It is an advantage because these models can be
used in different environments.
I. I NTRODUCTION In this study, we use genetic algorithm (GA), a powerful
Nowadays, one of the biggest challenges for network metaheuristic, to organize IP data flow in clusters. We used
administrators is to manage bandwidth resources efficiently. three features from a real data set: bits, packets and number
There are a variety of tools and techniques to help these of flows per second. These data were used to generate our
administrators identify anomalous behavior and prevent se- Digital Signature of Network Segment using Flow Analysis
curity incidents. Information such as an interface’s traffic (DSNSF), which characterize network traffic through its
should be monitored through network active elements. behavior and predict traffic usage, standard possible to
The Simple Network Management Protocol (SNMP) was identify anomalous traffic. We will call this method DSNSF-
used for many years to provide such information, however, GA and we will compare it with another approach based
administrators required more knowledge about their environ- on DSNSF: Ant Colony Optimization for Digital Signature
ments. With the use of data flow, they were able to obtain (ACODS).
the required details. A flow record is defined by a connection This paper is organized as follows: Section II presents the
between two hosts reporting fields in common, those could related work. Section III explains the novel method giving
be the endpoint addresses, time, and volume of information details of the DSNSF-GA generation. Section IV presents
transferred. This gives a more detailed view on the traffic the result of our evaluation tests, and finally Section V
and allows it to be used on large networks, due to the data concludes this paper.
reduction compared to SNMP [1].
Apart from data flow, there are other tools used to support II. R ELATED W ORK
network managers, such as Network Intrusion Detection Patcha and Park [4] classify network anomaly based
Systems (NIDS). These tools are normally based on attack systems in three groups such as signature based, profile
signatures, represented as a set of rules frequently updated based and hybrid. A signature based system works similar
93
GA is a method which manipulates a population of poten- dimensionality, i.e., number of features to be processed. The
tial problem solutions, trying to solve them using a coded collected flows are divided in 5 minute intervals, totaling 288
representation of these solutions, which are equivalent to data sets throughout the day. The variable xin denotes value
genetic material (chromosomes) of individuals in nature. In of the feature n of flow i and cjn stores value of center of
GA, members of a population (the solutions) compete with cluster j at n dimension.
each other to survive and reproduce, generating new solu-
tions. Each individual will be assigned a fitness value which We use the cluster centroids as the chromosomes’s value.
will reflect how adapted in an environment this individual For the initial population, we have generated randomly chro-
is in comparison with others. As in nature, the selection mosomes and their values should be between the minimum
will elect the fittest individuals. These will be assigned for and maximum flow data values. These chromosomes are
a genetic combination, also called crossover, which will used to generate new populations of individuals. The next
result in the exchange of genetic material (reproduction), action is to determine the fittest individuals in a population.
generating a new population. The GA cycle are illustrated What will determine if an individual is or not fit, is the
in Figure 1. sum of the distance between all points and their respective
cluster centroid, as we are finding the lower value. If this
total distance is lower in an individual than in other, it means
that the data inside that cluster are well organized, i.e., there
are more points closer from its central point in that cluster,
than in other. In our approach, we used this total distance
to determine the fittest individuals which will procreate. A
Roulette Wheel technique was used to determine the best
chromosomes, which consists of giving each individual a
slice of a circular wheel equal in area to the individual’s
fitness [22]. We spun the roulette for the same number of
individuals in the population. Those which were selected
more often were the fittest, and will breed a new generation.
94
To yield new generations, the crossover operator will 6
x 10 Bits per Second Traffic of 10/02/2012 and Generated DSNSFs
bits/s
2
in nature, the fittest individuals have a greater probability of Packets per Second Traffic of 10/02/2012 and Generated DSNSFs
500
generating a new offspring, who will then generate another 400
packets/s
and so on. To generate a new progeny, two parents must 300
combine their genetic material. The crossover is the key 200
100
process in a genetic algorithm, because at this moment the current day DSNSF−GA ACODS
0
exchange of genes will occur. When two parents reproduce, 6 7 8 9 10 11 12 13 14 15 16
time (hours)
17 18 19 20 21 22 23 24
Flows/s
1
we are finding the shorter total distance in a chromosome.
To start this process, we set an initial population of ρ 0.5
current day DSNSF−GA ACODS
individuals, and we choose τ = ρ/2 corresponding to the 0
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
time (hours)
fittest in this population to generate a new one. The last
individuals will join with the previous population to breed
another one. The fittest will be elected by the Roulette Figure 2: DSNSF-GA and ACODS for the 2nd of October.
technique, which will choose τ to generate other children.
Each one of these iterations is called a new generation.
To create our DSNSF-GA, we have collected data from processes, and from this we choose the best individual,
State University of Londrina (UEL) during five weeks using which represents the shortest sum of the distance between
sFlow to generate IP flows. These data were separated each point in the cluster and its respective centroid. Now,
in files, one per day. Every file has 86400 lines, each we calculate the average among the three cluster centroids.
corresponding to the amount of bits, packets or flows per This number represents a single point in the graphic, and
second. As we choose to generate the DSNSF-GA using data this process will repeat for other 288 times, representing all
from five minute intervals, we selected 300 lines from each 5 minutes intervals during a day. By using data from three
file (as we are using three previous days) to generate a single previous days to generate this single point, we now have a
point in the graphic. Those lines (data) were divided and network signature of this day, or the DSNSF-GA.
later grouped in clusters according to Euclidean distance.
IV. T ESTS AND R ESULTS
Using the Silhouette method for interpretation and validation
of clusters [23], best results were reached using K = 3, For our purpose, we used only data between 06:00 and
where K is the number of clusters, and the GA was used to 24:00 since we collect data from a University, whose work
optimize these distribution among the clusters. hours commence at 07:00 and end at 23:00. It is important
Each chromosome also undergoes a mutation probability, to emphasize that the 12th of October is a national holiday
which is a fixed number. Mutation allows the beginning in Brazil, and this is the reason for a different traffic
and preservation of genetic variation in a population by behavior on that specific day. We decided to keep this day
introducing another genetic structure modifying genes inside to demonstrate the ability of adaptation of the methods to
the chromosome. We established a mutation tax of 5%. similar situations.
When the mutation occurs we choose a mutation point,
called M P , which will be the point where its value changes. Furthermore, Figure 2 shows the observed traffic predicted
This new value is calculated using: by both, DSNSF-GA and ACODS for bits, packets and flows
per second. The figure represents the traffic measure from
N ewi = Oldi + (δ × Oldi ) (2) 2nd October 2012 where the green color is the current
day and two other lines are presented: one for DSNSF-
where N ewi is the new individual, Oldi is the old individ- GA and another for ACODS approaches. As shown in this
ual, δ is the randomic number 0 < n < 1 which determine figure, both of them are able to characterize network traffic,
if mutation will or will not occur. The new chromosome will presenting the usual observed traffic, displaying increases
be used to generate a new offspring. and decreases in usage following the same pattern, and also
a greater use of network resources during the periods from
The best population will be acquired at the end of these
95
6
x 10 Bits per second traffic of 10/12/2012 − DSNSF
CC/Days 1 2 3 4 5 Average
4
3 GA-bits 0.79 0.87 0.79 0.89 0.66 0.80
Bits/s
1
Table II: CC Table - Days between 8th to 12th of October
0.5 current day DSNSF−GA ACODS
0
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
time (hours)
ACODS
0.4 DSNSF−GA
NMSE
NMSE value is close to 0, then the forecasts are doing Flows per second
ACODS
the predictions are doing worse than the current traffic. The
0.1
CC measures how suitable a model is, whether the growth
and decrease trends of the traffic movement are followed by 0
1 2 3 4 5 8 9 10 11 12
the DSNSF, resulting values varying from -1 to 1. A positive Workdays from 1st to 12th of October
96
of HTTP traffic. It was caused by students applying for one [9] M. Dorigo, G. D. Caro, and L. M. Gambardella, “Ant algorithms for
discrete optimization,” Artificial Life, vol. 5, pp. 137–172, 1999.
of the 53 new classes of postgraduate courses. It means that [10] A. Lakhina, M. Crovella, and C. Diot, “Characterization of network-
both methods are able to characterize network traffic, while wide anomalies in traffic flows,” in Internet Measurement Conference,
its predictions were able to display the normal behavior, 2004, pp. 201–206.
[11] H. Rahmani, N. Sahli, and F. Kamoun, “Ddos flooding attack de-
making it possible to identify deviations. tection scheme based on f-divergence,” Computer Communications,
vol. 35, no. 11, pp. 1380–1391, 2012.
V. C ONCLUSION AND F UTURE W ORKS [12] M. L. Proença Jr., B. B. Zarpelão, and L. S. Mendes, “Anomaly
detection for network servers using digital signature of network
In this paper, we presented a novel model to characterize segment,” in Proceedings - Advanced Industrial Conference on
network traffic using flow attributes, such as bits, packets Telecommunications/Service Assurance with Partial and Intermit-
and number of flows per second through genetic algorithm, tent Resources Conference/E-Learning on Telecommunications Work-
shop AICT/SAPIR/ELETE 2005, vol. 2005, 2005, pp. 290–295,
called DSNSF-GA. In addition to this, we compare our doi:10.1109/AICT.2005.26.
model with another method, ACODS, which is based on [13] A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. New
ant colony’s optimization. In terms of computational com- York, NY, USA: Cambridge University Press, 2011.
[14] M. H. A. C. Adaniya, T. Abrão, and M. L. Proença Jr., “Anomaly
plexity, both methods use metaheuristics algorithms to find detection using metaheuristic firefly harmonic clustering,” Journal of
an optimal solution, there are a number of iterations until Networks, vol. 8, no. 1, 2013.
a certain condition is reached, and both of them used the [15] M. L. Proença Jr., C. Coppelmans, M. Bottoli, and L. Souza Mendes,
“Baseline to help with network management,” in e-Business and
same value. Telecommunication Networks. Springer Netherlands, 2006, pp. 158–
We used a real data set of traffic, collected from State 166, doi: 10.1007/1–4020–4761–4 12.
University of Londrina (UEL). In our tests, both methods [16] B. B. Zarpelão, L. S. Mendes, M. L. Proença Jr., and J. J. P. C.
Rodrigues, “Parameterized anomaly detection system with automatic
were able to characterize network traffic efficiently, and pre- configuration,” in Global Telecommunications Conference, 2009.
dict the standard behavior of an environment. Although we GLOBECOM 2009. IEEE, Nov 2009, pp. 1–6, doi: 10.1109/GLO-
have no threshold to distinguish anomalous from standard COM.2009.5 426 189.
[17] M. V. O. Assis, J. J. P. C. Rodrigues, and M. L. Proença Jr.,
behavior, we observe in tables positive values from NMSE “A seven-dimensional flow analysis to help autonomous network
and CC, which means satisfactory predictions. In future management,” Information Sciences, vol. 278, pp. 900 – 913,
works, we intend to increase the number of attributes, linking doi:10.1016/j.ins.2014.03.102, 2014.
[18] M. H. A. C. Adaniya, M. F. Lima, J. J. P. C. Rodrigues, T. Abrão,
all of them, creating a multidimensional model, which uses and M. L. Proença Jr., “Anomaly detection using dsns and firefly
genetic algorithm to generate our DSNSF-GA. harmonic clustering algorithm,” in Communications (ICC), 2012
IEEE International Conference on, June 2012, pp. 1183–1187,
ACKNOWLEDGMENT doi:10.1109/ICC.2012.6 364 088.
[19] P. Phaal, S. Panchen, and N. McKee, “InMon corporation s sFlow: A
This work was supported by SETI/Fundação Araucária method for monitoring traffic in switched and routed networks,” RFC
3176, Tech. Rep., 2001.
for Betelgeuse Project financial support. Also the authors [20] L. F. Carvalho, A. M. Zacaron, M. H. A. da Costa Adaniya, and M. L.
would thanks São Paulo State Technological College (Fatec Proença Jr., “Ant colony optimization for creating digital signature of
Ourinhos). network segments using flow analysis,” in SCCC, 2012, pp. 171–180.
[21] L. F. Carvalho, J. J. P. C. Rodrigues, S. Barbon, and M. L. Proença Jr.,
R EFERENCES “Using ant colony optimization metaheuristic and dynamic time
warping for anomaly detection,” in Software, Telecommunications and
[1] B. Trammell and E. Boschi, “An introduction to ip flow information Computer Networks (SoftCOM), 2013 21st International Conference
export (ipfix),” IEEE Communications Magazine, vol. 49, no. 4, pp. on, Sept 2013, pp. 1–5.
89–95, 2011. [22] M. Mitchell, An introduction to genetic algorithms. Cambridge, MA,
[2] K. Salah and A. Kahtani, “Performance evaluation comparison of USA: MIT Press, 1998.
snort nids under linux and windows server,” J. Network and Computer [23] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and
Applications, vol. 33, no. 1, pp. 6–15, 2010. validation of cluster analysis,” Journal of Computational and Applied
[3] B. B. Zarpelão, L. d. S. Mendes, and M. L. Proença Jr., “Anomaly Mathematics, vol. 20, no. 0, pp. 53 – 65, 1987.
detection aiming pro-active management of computer network based [24] K. Bansal, S. Vadhavkar, and A. Gupta, “Brief application description.
on digital signature of network segment,” Journal of Network and neural networks based forecasting techniques for inventory control
Systems Management, vol. 15, no. 2, pp. 267–283, 2007. applications,” Data Min. Knowl. Discov., vol. 2, no. 1, pp. 97–102,
[4] A. Patcha and J.-M. Park, “An overview of anomaly detection tech- 1998.
niques: Existing solutions and latest technological trends,” Computer
Networks, vol. 51, no. 12, pp. 3448–3470, 2007.
[5] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Network
anomaly detection: Methods, systems and tools,” IEEE Communica-
tions Surveys and Tutorials, vol. 16, no. 1, pp. 303–336, 2014.
[6] S. J. Nanda and G. Panda, “A survey on nature inspired metaheuristic
algorithms for partitional clustering,” Swarm and Evolutionary Com-
putation, vol. 16, no. 0, pp. 1 – 18, 2014.
[7] J. H. Holland, Adaptation in Natural and Artificial Systems: An
Introductory Analysis with Applications to Biology, Control and
Artificial Intelligence. Cambridge, MA, USA: MIT Press, 1992.
[8] U. Maulik and S. Bandyopadhyay, “Genetic algorithm-based cluster-
ing technique,” Pattern Recognition, vol. 33, no. 9, pp. 1455–1465,
2000.
97