Sei sulla pagina 1di 6

CogInfoCom 2013 4th IEEE International Conference on Cognitive Infocommunications December 25, 2013 , Budapest, Hungary

Advances and practice in Internet of Things


A case study

Gyrgy Terdik

Zoltn Gl

Faculty of Informatics
University of Debrecen
Debrecen, Hungary
Terdik.Gyorgy@inf.unideb.hu

Center of Arts, Humanities and Sciences


University of Debrecen
Debrecen, Hungary
ZGal@unideb.hu
communication services. Having role of measuring different
physical phenomenon continuous data is requested and
transmitted through the sink node of the sensor network. Other
extra data are transmitted based on asynchronous events
generation at the sensor controllers. The data transfer from the
sensors to the server entity is affected not only by the link
quality. Other causes of the sensor controller like battery
energy level and variable intensity of the routing and
compression tasks can produce lack of the response to the
periodic requests. This missing possibility of the data elements
from the sensors is an accepted phenomenon of the sensor
based acquisition systems.

AbstractInternet of Things (IoT) services and technologies


become more and more predominant in the current and future
era of the Internet. Lot of devices and applications are developed
and connected to the classical IP networks. As for any new
network services testing and evaluation is made in the intranet
environment in these days. Neither the efficiency measurement
methods of the energy usage nor communication nor
sustainability aspects of these services have been determined
exactly. Well defined solutions based on statistical analysis are
needed to gather practical knowledge about the sensor based
network nodes. The errors of the collected data arise in different
levels of the process. There are errors during the measurement of
a given signal by the sensor, another type of errors is generated
during the transmission of collected data through the sensor
access network, finally the store process of the received data
influenced by errors as well. Evaluation of the hardware and
software resource usage by the services and applications are very
dissimilar tasks from the classical wired or wireless data network
situations in case of IoT. Because of the data streams coming
from quite a number of sensors, the statistical analysis of several
dozens or hundreds of time series become a complex processing
problem. The time series affected by errors generate a preprocessing challenge, which depends on several simultaneous
aspects like the measured signal, the applied IoT communication
technology and the data storing mechanism. The statistical
analysis methods of the time series originated from the
preliminary processing differ from the classical data network
performance evaluation methods since in the IoT streams are
grouped in epoch time periods, for instance. Having huge
number of several variables distributed in time and physical
space which describes special aspects of the measured and/or
controlled system implies complex event processing. This paper
presents a case study for the complex data set which has been
captured from 18 TFLOPS capacity supercomputer system with
1.5 thousand CPU cores.

Having these special features of the data coming from the


sensors new processing steps need to be applied for the
analysis of the measured quantities. The pre-processing task of
the collected data is considered a necessary step for data
streams with sensor origins. Missing elements in the epoch
time periods should be completed based on the previous
response or series of responses. The effective algorithm of the
completion process depends on the type of the measured
quantity. Missing data for slowly varying quantities like
temperature of the physical environment can be superseded
relatively easy, but the rapidly variable like CPU temporal
load or memory usage quantities need extra support from the
hardware and/or firmware level of the analyzed system.
PRELIMINARY WORK
Good overview of the sensing task and sensor network
applications is given in [2] [3], grouped in military,
environmental, health, home and other commercial
applications. Factors that influence the sensor network are
fault tolerance, scalability, cost of the production, hardware
constraints, topology, environment, transmission media and
power consumption. The protocol stack of the sensor network
should be extended from the classical OSI layer based model
to the multi-plan model. Different planes are needed for sensor
network systems: power management, mobility management
and task management, as well.

KeywordsIoT, sensor, actuator, HPC, statistical analysis,


cluster analysis, principal components

INTRODUCTION
Huge volume of data is collected from the systems
equipped with physical sensors. The integration of these
systems into the Internet creates serious considerations and
profound analysis of the existing communication and storage
mechanisms. Data collected from the sensor systems has
significant other characteristics from the data of the classical

978-1-4799-1546-0/13/$31.00 2013 IEEE

The evaluation of Berkeley Mica1 and Mica 2 motes [10]


by statistical methods and with TASK (Tiny Application
Sensor Kit) can be found in [3], [11] and [27]. Performance
benchmarks of the wireless sensor network are evaluated for

435

Gy. Terdik and Z. Gl Advances and practice in Internet of Things

(MPP) system is 18.04 TFLOP (Tera Floating Point Operation


per Second). The CN servers are running Red Hat Linux
operating system and the scheduler type for the jobs is Sun
Grid Engine (see Figure 1).

this data set. Tiny database system presentation is given in


[22], [24].
Comparison of the WSN (Wireless Sensor Networks) and
the classical wired LAN technologies is presented in [4] and
[27]. Key issues of the state-of-the art sensor technology are
described from different point of views: technology, operating
system, communication protocol, upper layer services and
applications. Good classification and characterization of the
sensor networks is presented in paper [4].

At each CN server there were captured 20 variables from


physical and logical sensors for a week with the epoch time of
10 seconds. The number of epoch time periods during the
measurement was N=55341. The name and the meaning of the
variables from physical and logical sensors are presented in
the Table I and Table II.

Books related to the WSN from the information and signal


processing perspectives are cited as [5], [9], [13], [14] and
[23]. Data analysis and evaluation of systems with relatively
high number of variables is made in [11].

TABLE I.
Variable

Adaptive model is proposed for the prediction of the data


collected from the sensors. Energy saving and compensation
of the missing data caused by the communication errors and
lack of data from the sensor is important mechanism for the
sensor networks [12], [17].

1.

Load_one

2.

Load_five

3.

Proc_run

Correlation of the data sampled from spatially distributed


sensors is presented in [15], where partial ordered tree (POT)
structure is proposed for cluster readings. This structure has
better performance from the energy usage, network life time
and monitoring point of view.

4.

Proc_total

IOT DATA MESUREMENT SCENARIO AND EFFICIENCY


EVALUATION

Complex environment and resource data set of supercomputer


at the University of Debrecen

Meaning
Reported system load, averaged over one
minute
Reported system load, averaged over five
minute
Number of running processes
Number of total processes
Number of packets read from all non-loopback
interfaces
Number of packets written to all non-loopback
interfaces
Number of bytes read from all non-loopback
interfaces
Number of bytes written to all non-loopback
interfaces

5.

Pkts_in

6.

Pkts_out

7.

Bytes_in

8.

Bytes_out

9.

Mem_free

10.

CPU_user

Percentage of CPU cycles spent in user mode

CPU_system

Percentage of CPU cycles spent in non-user


mode

11.

The data set has been captured from the High Performance
Computer (HPC) with 128 compute nodes (CN) placed in two
racks. The rack is organized in four individual rack units
(IRU). Sixteen servers are placed in the IRU.

LOGICAL VARIABLES

Memory free

The time series from physical and logical sensors were


collected with the Ganglia distributed monitoring system. The
client process was running on a dedicated computer. The
gmetad daemon was running on the front-end node and the
gmond daemons were running on each of the CN servers of
the HPC system.
TABLE II.
Variable

PHYSICAL VARIABLES
Meaning

12.

System_Temp

CN server temperature

13.

CPU1_Temp

CPU1 temperature

14.

CPU2_Temp

CPU2 temperature

15.

P1_DIMM1A_Temp

Memory module temperature

Fig. 1. The structure of the analyzed HPC system

16.

P1_DIMM2A_Temp

Memory module temperature

Each CN server has two CPUs and each CPU contains six
cores of IEEE 754 FP 64 bits architecture. The total number of
CPU cores of the system is 1536 and each core has own 4 GB
DRAM memory bank (DIMMs), providing 6 TB RAM to the
whole HPC system. The unblocking type communication
system is QDR InfiniBand with dedicated 40 Gbit/sec links
between the rack switch and the CN servers. The total
computation capacity of this Massive Parallel Processing

17.

P1_DIMM3A_Temp

Memory module temperature

18.

P2_DIMM1A_Temp

Memory module temperature

19.

P2_DIMM2A_Temp

Memory module temperature

20.

P2_DIMM3A_Temp

Memory module temperature

The physical variables were collected through a dedicated


control network between the CN servers and the front-end

436

CogInfoCom 2013 4th IEEE International Conference on Cognitive Infocommunications December 25, 2013 , Budapest, Hungary

node. The logical variables were transmitted through the QDR


InfiniBand links. Because the sampling time interval was high
(10 seconds) the transmission of the collected data through the
internal network created only a negligible extra load of the
links (some kb/s on the 40 Gb/s links).
The XML format requests sent with 10 sec period did not
get back all the 20x128=2560 answers in the current epoch
time. Having this special situation pre-processing was needed
to be executed. By pre-processing we mean interpolation of
the missing values for each variable. The sampled variables
measure slowly varying phenomenon, so the completion of
missing values was made by linear interpolation.
The computation nodes are organized in three queues with
the testing, serial and parallel execution functions
respectively: Q.SERIAL (nodes 1, 12..14, 18, 31..32),
Q.PARALLEL (nodes 2..11, 15..30, 33..127), Q.TEST (node
128). Because of the transient state of the tasks in the parallel
queue several compute nodes were not analysed (hosts 19, 48,
49, 50, 51, 52).
Fig. 2.

Analysis of the extreme events


The detection of the extreme event occurrence during the
working time of the HPC system was done by introducing a
function Corrs,m1,m2(). This parameterized function calculates
the correlation of two different variables of the same
computation node CN. For this there were filtered out the
elements of the time series in function of the width of the band
in the vicinity of the moving average value.

Extreme and ordinary events

Let Vs,m(t) be a variable from Table II. where s = 1..128, is


the server ID, m = 1..20 is the variable ID and t = 1..N (N =
55,341) is the epoch number.
S{Vs,m(t)} = {Vs,m() | Vs,m()<LVs,m or Vs,m()>HVs,m} (1)
where
LVs,m = MOVAVG(Vs,m(t)) - STD(Vs,m(t))

(2)

HVs,m = MOVAVG(Vs,m(t)) + STD(Vs,m(t))

(3)

It can be seen easily that S{Vs,m(t)} is a subset of Vs,m(t)


and contains only elements of the time series Vs,m(t) that are in
the outside part of the filter band dependent on the parameter
. It is well known that for Gaussian processes the majority of
elements are situated around the expected value settled by the
standard deviation. Our operator is given by the following
formula:
Corrs,m1,m2() = Corr(S{Vs,m1(t)}, S{Vs,m2(t)})

(4)

The separation mechanism of the extreme events from the


ordinary events in function of the parameter can be seen on
Figure 2.

Fig. 3. Extreme and ordinary events correlations (CN#6)

Correlation values of the ordinary events are calculated on


the complementary subsets of the two variables. For the
parallel queue (Host6) the outside, the inside and the average
of the outside and inside correlations can be seen on Figure 3.

Extreme events of a compute node (host) s can be detected


by calculating the parameterized correlation function
Corrs,m1,m2() of two different variables where the values are
outside of the band defined by [LVs,m, HVs,m].

For the test queue (Host128) the same relation can be


detected but for other variables (see Figure 4.).

437

Gy. Terdik and Z. Gl Advances and practice in Internet of Things

clusters and inside the cluster the order is alphabetical see


Figure 5.

Fig. 5. Primary Clusters

This correlation matrix shows that the cluster C1 correlates


very strongly inside (higher then .8) but at the same time the
correlation with the variables in the cluster C3 is high as well.
We conclude that there should be another clustering which
explains more carefully the structure of the variables. It looks
reasonable that variables of these primary clusters, i.e. these
groups of variables are possible contained in different lager
groups. The process of clustering needs some dissimilarity
measure. We define the dissimilarity measure between two
variables as 1-corr2 and we have constructed clusters by the
method of linkage under assumption that maximum number of
clusters was fixed to 7, see [28] for details.

Fig. 4. Extreme and ordinary events correlations (CN#128)

Variable clusterization analysis


Taking in consideration the architecture and the execution
mechanisms of the analysed HPC system different variables
can be grouped into clusters. For each server we form the
correlation matrix of the variables. In each case the variables
show strong correlations. Since the correlation matrix has
diagonal on ones and it is symmetric we exclude from the
contour plot the upper half and the diagonal, see below. Based
on the theoretical consideration of the role of the variables we
set up clusters in order to collect the variables into groups
having highest correlations inside the group and lower
between groups.
These clusters are the following:
C1: '5: Load_one', '6: Load_five', '7: Proc_run',
' 8: Proc_total'
C2: '9: Pkts_in', '10: Pkts_out', '11: Bytes_in'
'12: Bytes_out'
C3: '22: System_Temp', '23: CPU1_Temp',
'24: CPU2_Temp'
C4: '25: P1_DIMM1A_Temp',
'26: P1_DIMM2A_Temp', '27: P1_DIMM3A_Temp',
'28. P2_DIMM1A_Temp', '29: P2_DIMM2A_Temp',
'30: P2_DIMM3A_Temp'
C5: 13: Mem_free'
C6: '18: CPU_user'
C7: '19: CPU_system'
For instance the server number 2 the correlation matrix
when the order of the variables follow the order of the

Fig. 6. Clusters according to linkage, serial group

The result for the servers in the serial group is the following.
C1: '5: Load_one', '6: Load_five' , '7: Proc_run',
'8: Proc_total', '18: CPU_user', '23: CPU1_Temp',
'24: CPU2_Temp'
C2: '22: System_Temp', '28. P2_DIMM1A_Temp',

438

CogInfoCom 2013 4th IEEE International Conference on Cognitive Infocommunications December 25, 2013 , Budapest, Hungary

reason of looking for principal components. Naturally if the


cluster contains only one variable this variable remains the
same. The average weights of the principal components are
the following:

'29: P2_DIMM2A_Temp', '30: P2_DIMM3A_Temp'


C3: '25: P1_DIMM1A_Temp',
'26: P1_DIMM2A_Temp', '27: P1 DIMM3A Temp'
C4: '9: Pkts_in', '10: Pkts_out'
C5: '12: Bytes_out'
C6: '11: Bytes_in'
C7: '13: Mem_free'
C8: '19: CPU_system'
The reason of choosing the number of clusters 8 is that we
could not find a cluster for the individual variables listed by
the end of the clusters, see Figure 6.

Cluster 1
0.9140

Cluster 2
0.8511

Cluster 3
0.8931

Cluster 4
0.8641

Cluster 5
0.8713

Each of these weights shows that the reduction of the


dimension is correct.
Another interesting question we address is whether ordering of
the parallel servers significant. Let us fix the values of the
variables and consider vectors of order 120 according to the
servers. We test the randomness of the order against the nonrandomness. Runs test shows that, the randomness has been
rejected in 41%, 48%, 72%, 99%, 44%, 41%, 4%, 52% of the
cases respectively to the clusters. The remarkable case is the
cluster 7, i.e. the variable '19: CPU_system' in other words the
percentage of CPU cycles spent in non-user mode support the
hypotheses that the order of the servers has no influence on the
values of that variable.
CONLCUSIONS
Huge number of data series can be collected from HPC
system virtual and physical sensors creating big amount of
data and complexity of the analysis. The compute nodes of the
HPC system analysed are grouped into three different queues:
serial, parallel and test. The majority of servers were
associated to the parallel queue to run jobs in parallel. The test
queue with only one compute node works similar with the
servers of the serial queue. By clusterization of variables
representing variable can be selected for each cluster. By
definition the clusters should have strongly correlated
elements. The method proposed for calculating and
characterizing the intra-cluster correlation among the elements
can be used for characterizing the inter-cluster correlation, as
well. It was found that the clusterization method presented in
this paper gives strong correlation inside the clusters and weak
correlation among the representing variables. This method
reduces the number of variables necessary to be sampled from
20 to 8 (one variable per each cluster), meaning decreasing by
60% of the management data collected from the sensors of a
HPC system.

Fig. 7. Clusters according to linkage, parallel group

Similarly we fixed the number of clusters to be 8 for servers in


the parallel queue. The clustering for the parallel queue is
more difficult, this is so because of the number of servers here
is 120. The list of the clusters is the following:
C1: '5: Load_one', '6: Load_five', '7: Proc_run',
'8: Proc_total'
C2: '18: CPU_user', '23: CPU1_Temp',
'24: CPU2_Temp'
C3: '22: System_Temp', '28: P2_DIMM1A_Temp',
'29: P2_DIMM2A_Temp', '30: P2_DIMM3A_Temp'
C4: '25: P1_DIMM1A_Temp',
'26: P1_DIMM2A_Temp', '27: P1_DIMM3A_Temp'
C5: '9: Pkts_in', '10: Pkts_out', '12: Bytes_out'
C6: '11: Bytes in'
C7: '19: CPU_system'
C8: '12: Bytes_out'
For instance the server number 36 provide the following
correlation matrix according the new clustering, see Figure 7.
After we fixed these clusters we looked at the problem of
decrease the number of variables which are involved into the
analysis of the system. It looks reasonable to reduce the
original dimension according to the number of variables into
new variables with number of clusters. The method doing so is
the principal component analysis. For each cluster and each
server we calculated the principal components and the first
one between them has been substituted instead of each cluster.
The clusters 6-8 contains one variable, hence there is no

ACKNOWLEDGMENT
This work was supported by the TMOP-4.2.2.C11/1/KONV-2012-0001 (FIRST Future Internet Research,
Services and Technology) project. The project has been
supported by the European Union, co-financed by the
European Social Fund.
REFERENCES
[1]

[2]

439

M. L. Massiea, B. N. Chunb, D. E. Cullera, The ganglia distributed


monitoring system: design, implementation, and experience, Parallel
Computing, Volume 30, Issue 7, July 2004, Pages 817840.
I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, E. Cayirci, Wireless
Sensor Networks: a Survey, Computer Networks 38 (2002), Elsevier,
pp. 293-422.

Gy. Terdik and Z. Gl Advances and practice in Internet of Things


[3]
[4]

[5]
[6]

[7]

[8]

[9]

[10]
[11]

[12]

[13]

[14]

[15]

[16]

John A. Stankovic, Wireless Sensor Networks, Communications of the


ACM Wireless sensor networks (2004), Vol. 47 Issue 6.
Jennifer Yick, Biswanath Mukherjee, Dipak Ghosal, "Wireless sensor
network survey, Computer Networks 52 (2008), Elsevier, pp. 22922330.
Feng Zhao, Leonidas J. Guibas, Wireless Sensor Networks: An
Information Processing Approach, Elsevier (2004).
Sam Madden, Joe Hellerstein, Wei Hong, TinyDB: In-Network Query
Processing in TinyOS, Technical Report TinyOS Document, Version
0.4, September 2003.
Ameer Ahmed Abbasi, Mohamed Younis, A Survey on Clustering
Algorithms for Wireless Sensor Networks, Computer Communications
20 (2007), Elsevier, pp. 2826-2841.
Jamal N. Al-Karaki, Ahmed E. Kamal, Routing techniques in Wireless
Sensor Networks: a Survey, IEEE Wireless Communications, 11(6):628, 2004.
Waltenegus Dargie, Christian Poellabauer, Fundamentals of Wireless
Sensor Networks, Wiley Series on Wireless Communications and
Mobile Computing, 2010.
Intel Berkeley lab. http://db.csail.mit.edu/labdata/labdata.html.
Ping Ji, Marcin Szczodrak, A Multivariate Model for Data Cleansing
in Sensor Networks, The Second Annual Conference of the
International Technology Alliance (2008).
Yann-Ael Le Borgne, Silvia Santini, Gianluca Bontempi Adaptive
Model Selection for Time Series Prediction in Wireless Sensor
Networks Signal Processing, Volume 87 Issue 12, December, 2007.
Ameer Ahmed Abbasi, Mohamed Younis, A survey on clustering
algorithms for wireless sensor networks Computer Communications,
Volume 30 Issue 14-15, October, 2007, pp. 2826-2841.
Baljeet Malhotra, Ionis Nikolaidis, Mario A. Nascimento, Aggregation
Convergecast Scheduling in Wireless Sensor Networks, Kluwer
Academic Publishers, Wireless Networks, Volume 17 Issue 2, February
2011.
Yong Hyun Cho, Jihoon Son, Yon Dohn Chung, POT: An Efficient Topk Monitoring Method for Spatially Correlated Sensor Readings, ACM
Proceedings of the 5th workshop on Data management for sensor
networks, 2008.
Chong Liu, Kui Wu, Jian Pei, An Energy Efficient Data Collection
Framework for Wireless Sensor Networks by Exploiting Spatiotemporal

[17]

[18]

[19]

[20]

[21]

[22]
[23]
[24]

[25]

[26]
[27]

Correlation, IEEE Transactions on Parallel and Distributed Systems,


Volume 18 Issue 7, July 2007, pp. 1010-1023.
David Chu, Amol Deshpande, Joseph M. Hellerstein, Wei Hong,
Approximate Data Collection in Sensor Networks using Probabilistic
Models, IEEE Computer Society ICDE'06 Proceedings of the 22nd
International Conference on Data Engineering, 2006.
Hejun Wu, Qiong Luo, Jianjun Li, Alexandros Labrinidis, Quality
aware query scheduling in wireless sensor networks, ACM DMSN'09
Proceedings of the Sixth International Workshop on Data Management
for Sensor Networks, 2009.
David Yates, Erich Nahum, Jim Kurose, Data Quality and Query Cost
in Wireless Sensor Networks, PERCOMW '07 Proceedings of the Fifth
IEEE International Conference on Pervasive Computing and
Communications Workshops, 2007.
Su Ping, Delay Measurement Time Synchronization for Wireless
Sensor Networks, ACM Transactions on Sensor Networks (TOSN),
Volume 3 Issue 2, June 2007.
Carlos Guestrin, Peter Bodik, Romain Thibaux, Mark Paskin, Samuel
Madden, Distributed Regression: an Efficient Framework for Modeling
Sensor Network Data, IPSN04 - Information Processing in Sensor
Networks, 2004.
TinyDB: A Declarative Database for Sensor Networks,
http://telegraph.cs.berkeley.edu/tinydb/documentation.htm
Jan F. Akyildiz, Mehmet Can Vuran, Wireless Sensor Networks, John
Wiley & Sons Ltd. (2010), ISBN: 978-0-470-03601-3
Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein, Wei
Hong: TinyDB: An Acquisitional Query Processing System for Sensor
Networks, ACM Trans. Database Syst., Vol. 30, No. 1. (March 2005),
pp. 122-173.
Gy. Terdik, T. Gyires: Internet Traffic Modeling with Lvy Flights,
IEEE/ACM Transactions on Networking, Vol. 17, No. 1, pp. 120-129,
February 2009.
Z. Gal: VoIP LAN/MAN Traffic Analysis for NGN QoS Management,
Infocommunications Journal, Volume LXIV, pp. 22-29, 2009.
Zoltn Gl, Gyrgy Terdik: On the Statistical Analysis of Wireless
Sensor vs. Wired Data Network Traffics, Carpathian Journal of
Electronic and Computer Engineering, Vol. 4, No. 1, 2011, ISSN-18449689, pp. 41-47.

[28] Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an


introduction to cluster analysis (Vol. 344). Wiley. com

440

Potrebbero piacerti anche