SPA2018proceedings PDF

THE INSTITUTE OF ELECTRICAL
AND ELECTRONICS ENGINEERS INC.

Region 8Europe, Middle East and Africa, POLAND SECTION
CIRCUITS AND SYSTEMS CHAPTER
SPa 2018
SIGNaL PROCESSING
algorithms, architectures, arrangements, and applications
Conference Proceedings
Poznan, 19 th -21st September 2018
POZNAN UNIVERSITY OF TECHNOLOGY

FACULTY OF COMPUTING
INSTITUTE OF AUTOMATION AND ROBOTICS
DIVISION OF SIGNAL PROCESSING AND ELECTRONIC SYSTEMS
UL. JANA PAWŁA II 24 60 965 POZNAN, POLAND
Phone: +48 61 647 5941 Fax: +48 61 647 5940
www.spaconference.org.pl www.ieee.put.poznan.pl
(page intentionally left blank)
THE INSTITUTE OF ELECTRICAL
AND ELECTRONICS ENGINEERS INC.
Region 8Europe, Middle East and Africa, POLAND SECTION
CIRCUITS AND SYSTEMS CHAPTER
SPa 2018
SIGNaL PROCESSING
Conference Proceedings
Poznan, 19 th -21st September 2018
POZNAN UNIVERSITY OF TECHNOLOGY

FACULTY OF COMPUTING
INSTITUTE OF AUTOMATION AND ROBOTICS
DIVISION OF SIGNAL PROCESSING AND ELECTRONIC SYSTEMS
UL. JANA PAWŁA II 24 60 965 POZNAN, POLAND
Phone: +48 61 647 5941 Fax: +48 61 647 5940
www.spaconference.org.pl www.ieee.put.poznan.pl
Scientific Committee
Prof. Adam Dąbrowski - Chairman Prof. Brian C.J. Moore
Prof. Ryszard Choraś Prof. George Moschytz
Prof. Andrzej Czyżewski Prof. Andrzej Napieralski
Prof. Anthony Davies Prof. Peter Noll
Prof. Patrick Dewilde Prof. Maciej Ogorzałek
Prof. Andrzej Dobrucki Prof. Stanisław Osowski
Prof. Andrzej Dziech Prof. Aleksander Petrovsky
Prof. Ewa Hermanowicz Prof. Kamisetty R. Rao
Prof. Mos Kaveh Prof. Thomas Sikora
Prof. Piotr Kleczkowski Prof. Władysław Skarbek
Prof. Christian Kollmitzer Prof. Ryszard Tadeusiewicz
Prof. Bożena Kostek Prof. Ralph Urbansky
Prof. Krzysztof Kozłowski Prof. Joos Vandewalle
Prof. Rolf Kraemer Prof. Heinrich T. Vierhaus
Prof. Zbigniew Kulka Prof. Ryszard Wojtyna
Prof. Andrzej Materka Prof. Jan Zarzycki
Prof. Józef Modelski Prof. Tomasz Zieliński
Organizing Committee
Prof. Adam Dąbrowski - Chairman
Julian Balcerek - Secretary
Tomasz Marciniak - Publication Chair
Damian Cetnarowicz - Proceedings Editor
Paweł Pawłowski - Conference website
Małgorzata Piskorz - Financial Chair
Adam Konieczka
Piotr Kardyś
Tomasz Janiak
Szymon Drgas
Andrzej Meyer
Karol Piniarski
Andrzej Kubacki
ISBN-13 978-83-62065-31-8
IEEE Conference Record # 44815
Table of Contents
Program summary ...................................................................................................................................................... 6
General information ................................................................................................................................................... 10
TUTORIALS:
I. Bart M. ter Haar Romeny, Vision for Vision – Deep Learning in Retinal Image Analysis ........................... 12
II. Heinrich Theodor Vierhaus, Migrating Electronic Systems from Fault Tolerant Computing to Error
Resilience ....................................................................................................................................................... 13
III. Paweł Strumiłło, Electronic Systems and Interfaces Aiding the Visually Impaired ...................................... 14
IV. Adam Dąbrowski, Contemporary technologies and techniques for processing of human eye images .......... 15
SESSION 1: DSP Theory & Implementation 1
1. Grzegorz Szwoch, Suppression of distortions in signals received from Doppler sensor for vehicle speed
measurement ...................................................................................................................................................... 16
2. Marek Kulawiak, Programmatic Simulation of Laser Scanning Products ......................................................... 22
3. Fatih Serdar Sayin, Sertan Ozen, Ulvi Baspinar, Hand Gesture Recognition by Using sEMG Signals for
Human Machine Interaction Applications ............................................................................................. 27
4. Jerzy Fiołka, Preliminary investigation of the in-cylinder pressure signal using Teager energy operator ......... 31
SESSION 2: Image Processing 1
5. Faezeh Fallah, Bin Yang, Sven S. Walter, Fabian Bamberg, Hierarchical Feature-learning Graph-based
Segmentation of Fat-Water MR Images ................................................................................................ 37
6. Jalil Nourmohammadi-Khiarak, Samaneh Mazaheri, Rohollah Moosavi-Tayebi, Hamid Noorbakhsh-
Devlagh, Object Detection utilizing Modified Auto Encoder and Convolutional Neural Networks
................................................................................................................................................................ 43
7. Emre Canayaz, Veysel Gökhan Böcek , Comparison of Performance of Different Background
Subtraction Methods for Detection of Heavy Vehicles ......................................................................... 50
8. Grzegorz Sarwas, Sławomir Skoneczny, FSIFT based feature points for face hierarchical clustering .............. 55
9. Krzysztof Krupa, Marcin Grochowina, Microprocessor implementation of the sound source location
process based on the correlation of signals ........................................................................................... 59
10. J. Kotus, Determination of the Vehicles Speed Using Acoustic Vector Sensor ................................................ 64
11. Michał Pielka, Paweł Janik, Małgorzata Aneta Janik, Zygmunt Wróbel, An adaptive transmission
algorithm for an inertial motion capture system in the aspect of energy saving ................................... 70
12. Marek Parfieniuk, Sang Yoon Park, A critique of some rough approximations of the DCT ............................. 76
13. Tomasz Grzywalski, Szymon Drgas, Application of recurrent U-net architecture to speech enhancement ...... 82
14. Marcin Matłacz, Grzegorz Sarwas, Crowd counting using complex convolutional neural network ................. 88
15. Michał Bednarek, Krzysztof Walas, Simulated Local Deformation & Focal Length Optimisation For
Improved Template-Based 3D Reconstruction of Non-Rigid Objects .................................................. 93
16. S. Cygert, A. Czyżewski, Vehicle detector training with labels derived from background subtraction
algorithms in video surveillance ............................................................................................................ 98
17. Szymon Zaporowski, Joanna Gołębiewska, Bożena Kostek, Julia Piltz, Audio-visual aspect of the
Lombard effect and comparison with recordings depicting emotional states ........................................ 104
18. Ba chien Thai, Anissa Mokraoui, Basarab Matei, HDR Image Tone Mapping Approach based on Near
Optimal Separable Adaptive Lifting Scheme ........................................................................................ 108
19. Adam Borowicz, On Using Quaternionic Rotations for Indpendent Component Analysis ............................... 114
20. Nick A. Petrovsky, Eugene V. Rybenkov, Alexander A. Petrovsky, Two-dimensional non-separable
quaternionic paraunitary filter banks ..................................................................................................... 120
21. Radu Matei, Elliptically-Shaped IIR Digital Filters Designed Using Frequency Transformations ................... 126
3
22. Yaprak Eminaga, Adem Coskun, Izzet Kale, IIR Wavelet Filter Banks for ECG Signal Denoising ................ 130
23. Paweł Pawłowski, Adam Pawlikowski, Rafał Długosz, Adam Dąbrowski, Programmable, switched-
capacitor finite impulse response filter realized in CMOS technology for education purposes ............ 134
24. Piotr Janus, Tomasz Kryjak, Hardware implementation of the Gaussian Mixture Model foreground
object segmentation algorithm working with ultra-high resolution video stream in real-time .............. 140
25. Marcin Kociolek, Peter Bajcsy, Mary Brady, Antonio Cardone, Interpolation-Based Gray-Level Co-
Occurrence Matrix Computation for Texture Directionality Estimation ............................................... 146
26. Marcin Kociolek, Michal Strzelecki. Szymon Szymajda, On the influence of the image normalization
scheme on texture classification accuracy ............................................................................................. 152
27. Michał Bednarek, Krzysztof Walas, Spatial Transformations in Deep Neural Networks ................................. 158
28. Jakub Bednarek, Karol Piaskowski, Michał Bednarek, Methods of Enriching The Flow of Information in
The Real-Time Semantic Segmentation Using Deep Neural Networks ................................................ 163
SESSION 7: Biomedical & Biometric Apps. 1
29. Jakub Jurek, Marek Kociński, Andrzej Materka, Are Losnegård, Lars Reisætery, Ole J. Halvorsen,
Christian Beisland, Jarle Rørvik, Arvid Lundervold, Dictionary-based through-plane
interpolation of prostate cancer T2-weighted MR images ..................................................................... 168
30. Jakub Jurek, Mateusz Peleszy, Andrzej Wojciechowski, Artur Klepaczko, Marek Kociński, Andrzej
Materka, Are Losnegård, Lars Reisætery, Ole J. Halvorsen, Christian Beisland, Jarle Rørvik,
Arvid Lundervold, CRF-Based Clustering of Pharmacokinetic Curves from Dynamic Contrast-
Enhanced MR Images ........................................................................................................................... 174
31. Carlos Vinhais, Marek Kociński, Andrzej Materka, Centerline-Radius Polygonal-Mesh Modeling of
Bifurcated Blood Vessels in 3D Images using Conformal Mapping ..................................................... 180
32. Marcin Grochowina, Lucyna Leniowska, Design and implementation of a device supporting automatic
diagnosis of arteriovenous fistula .......................................................................................................... 186
33. Lukasz Kubus, Alexander Yastrebov, Katarzyna Poczeta, Magdalena Poterala, The use of fuzzy
cognitive maps in evaluation of prognosis of chronic heart failure patients ......................................... 191
SESSION 8: Audio Processing 1
34. Akira Ikuta, Hisako Orimoto, Fuzzy Bayesian Filter for Sound Environment by Considering Additive
Property of Energy Variable and Fuzzy Observation in Decibel Scale ................................................. 197
35. Cezary Wernik, Grzegorz Ulacha, Application of adaptive Golomb codes for lossless audio compression ..... 203
36. Karolina Marciniuk, Maciej Szczodrak, Andrzej Czyżewski, An application of acoustic sensors for the
monitoring of road traffic ...................................................................................................................... 208
37. Damian Koszewski, Bozena Kostek, Low-level audio descriptors-based analysis of music mixes from
different Digital Audio Workstations – case study ................................................................................ 213
38. Przemysław Falkowski-Gilski, Transmitting Alarm Information in DAB+ Broadcasting System ................... 217
SESSION 9: DSP Implementations
39. Piotr Kłosowski, Deep Learning for Natural Language Processing and Language Modelling .......................... 223
40. Wladyslaw Magiera, Urszula Libal, Statistical properties of signals approximated by orthogonal
polynomials and Schur parametrization ................................................................................................ 229
41. A.A.Kim, O.O.Lukovenkova, Yu.V.Marapulets, A.B.Tristanov, Using a sparse model to evaluate the
internal structure of impulse signals ...................................................................................................... 235
42. Tomasz Maka, Miroslaw Lazoryszczak, Detecting the Number of Speakers in Speech Mixtures by
Human and Machine ............................................................................................................................. 239
43. Janusz Rafałko, Marking the Allophones Boundaries Based on the DTW Algorithm ...................................... 245
44. Adam Konieczka, Ewelina Michałowicz, Karol Piniarski, Infrared thermal camera-based system for tram
drivers warning about hazardous situations ........................................................................................... 250
45. Adam Bykowski, Szymon Kupiński, Feature matching and ArUco markers application in mobile eye
tracking studies ...................................................................................................................................... 255
46. Marianna Parzych, Tomasz Marciniak, Adam Dąbrowski, Adaptive methods of time-dependent crowd
density distribution visualization ........................................................................................................... 261
4
47. Julian Balcerek, Mateusz Łuczak, Paweł Pawłowski, Adam Dąbrowski, Automatic recognition of image
details using stereovision and 2D algorithms ........................................................................................ 268
48. Zenon Kidoń, Jerzy Fiołka, Evaluation postural stability using complex-valued data Fourier analysis of
the follow-up posturographic trajectories .............................................................................................. 274
SESSION 10: Biomedical & Biometric Apps. 2
49. Artur Klepaczko, Martyna Muszelska, Eli Eikefjord, Jarle Rørvik, Arvid Lundervold, Automated
determination of arterial input function in DCE-MR images of the kidney .......................................... 280
50. Michał Strzelecki, Artur Klepaczko, Martyna Muszelska, Eli Eikefjord, Jarle Rørvik, Arvid Lundervold,
An artificial neural network for GFR estimation in the DCE-MRI studies of the kidneys ................... 286
51. Artur Klepaczko, Piotr Skulimowski, Michał Strzelecki, Ludomir Stefańczyk, Eli Eikefjord, Jarle
Rørvik, Arvid Lundervold, Numerical simulation of the b-SSFP sequence in MR perfusion-
weighted imaging of the kidney ............................................................................................................ 292
52. Baixiang Zhao, John Soraghan, Gaetano Di-caterina, Lykourgos Petropoulakis, Derek Grose, Trushali
Doshi, Automatic 3D segmentation of MRI data for detection of head and neck cancerous lymph
nodes ..................................................................................................................................................... 298
SESSION 11: Speech Processing
53. Hisako Orimoto, Akira Ikuta, Noise Cancellation Method for Speech Signal by Using an Extension Type
UKF ....................................................................................................................................................... 304
54. Marzena Mięsikowska, Speech Intelligibility in the presence of X4 Unmanned Aerial Vehicle ...................... 310
55. Hugo Cordeiro, Carlos Meneses, Low band continuous speech system for voice pathologies
identification .......................................................................................................................................... 315
56. Maxim Vashkevich, Elias Azarov, Alexander Petrovsky, Yuliya Rushkevich, Features extraction for the
automatic detection of ALS disease from acoustic speech signals ........................................................ 321
SESSION 12: Communication Apps.
57. Marek Kulawiak, Przemysław Falkowski-Gilski, Marcin Kulawiak, DAB+ Coverage Analysis: a New
Look at Network Planning using GIS Tools ......................................................................................... 327
58. Anatoliy Platonov, Ievgen Zaitsev, Perfect Low Power Narrowband Transmitters for Dense Wireless
Sensor Networks ................................................................................................................................... 332
59. Jan Wietrzykowski, Probabilistic reasoning for indoor positioning with sequences of WiFi fingerprints ........ 338
60. Paweł Pawłowski, Karol Piniarski, Adam Dąbrowski, Selection and tests of lossless and lossy video
codecs for advanced driver-assistance systems ..................................................................................... 344
SESSION 13: Audio Processing 2
61. Tatsiana Viarbitskaya, Andrzej Dobrucki, Audio processing with using Python language science
libraries .................................................................................................................................................. 350
62. Michał Łuczyński, Stefan Brachmański, Andrzej Dobrucki, Active elimination of tonal components in
acoustic signals ...................................................................................................................................... 355
63. Maciej Sabiniok, Stefan Brachmański, Analysis of application popssibilities of Grey System Theory to
detection of acoustic feedback ............................................................................................................... 361
64. Krzysztof Sozański, Anna Sozańska, Low Frequency Loudspeaker Measurements Using An Anechoic
Acoustic Chamber ................................................................................................................................. 367
Index of Authors ....................................................................................................................................................... 373
5
SIGNaL PROCESSING
SPa 2018
THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS INC. September 19th - 21st, 2018, Poznań, POLAND
Program Summary
September 19th, 2018 (Wednesday)
9:00 OPENING (Room 201)
Chairman: Adam Dąbrowski
9:15-10:00 TUTORIAL I: Bart M. ter Haar Romeny, Vision for Vision – Deep Learning in Retinal Image Analysis
10:00-10:20 Coffee Break
10:20-11:40 SESSION 1: DSP Theory & Implementation 1 SESSION 2: Image Processing 1 (Room 202)
(Room 201) Chairman: Tomasz Marciniak Chairman: Bart M. ter Haar Romeny
10:20 Grzegorz Szwoch, Suppression of distortions in Faezeh Fallah, Bin Yang, Sven S. Walter, Fabian
signals received from Doppler sensor for vehicle Bamberg, Hierarchical Feature-learning Graph-
speed measurement based Segmentation of Fat-Water MR Images
10:40 Marek Kulawiak, Programmatic Simulation of Laser Jalil Nourmohammadi-Khiarak, Samaneh Mazaheri,
Scanning Products Rohollah Moosavi-Tayebi, Hamid Noorbakhsh-
Devlagh, Object Detection utilizing Modified Auto
Encoder and Convolutional Neural Networks
11:00 Fatih Serdar Sayin, Sertan Ozen, Ulvi Baspinar, Emre Canayaz, Veysel Gökhan Bö ek ,
Hand Gesture Recognition by Using sEMG Signals Comparison of Performance of Different
for Human Machine Interaction Applications Background Subtraction Methods for Detection of
Heavy Vehicles
11:20 Jerzy Fiołka, Preliminary investigation of the in- Grzegorz Sarwas, Sławomir Skone zny, FSIFT
cylinder pressure signal using Teager energy based feature points for face hierarchical clustering
operator
12:00-14:00 SESSION 3: DSP Theory & Implementation 2 SESSION 4: : Image Processing 2 (Room 202)
(Room 201) Chairman: Adam Dąbrowski Chairman: Paweł Pawłowski
12:00 Krzysztof Krupa, Marcin Grochowina, Mar in Matła z, Grzegorz Sarwas, Crowd counting
Microprocessor implementation of the sound source using complex convolutional neural network
location process based on the correlation of signals
12:20 J. Kotus, Determination of the Vehicles Speed Using Michał Bednarek, Krzysztof Walas, Simulated Local
Acoustic Vector Sensor Deformation & Focal Length Optimisation For
Improved Template-Based 3D Reconstruction of
Non-Rigid Objects
12:40 Mi hał Pielka, Paweł Janik, Małgorzata Aneta Janik, S. Cygert, A. Czyżewski, Vehicle detector training
Zygmunt Wróbel, An adaptive transmission with labels derived from background subtraction
algorithm for an inertial motion capture system in algorithms in video surveillance
the aspect of energy saving
13:00 Marek Parfieniuk, Sang Yoon Park, A critique of Szymon Zaporowski, Joanna Gołębiewska, Bożena
some rough approximations of the DCT Kostek, Julia Piltz, Audio-visual aspect of the
Lombard effect and comparison with recordings
depicting emotional states
13:20 Tomasz Grzywalski, Szymon Drgas, Application of Ba chien Thai, Anissa Mokraoui, Basarab Matei,
recurrent U-net architecture to speech enhancement HDR Image Tone Mapping Approach based on Near
Optimal Separable Adaptive Lifting Scheme
14:00-15:00 Lunch
15:00 TUTORIAL II: Heinrich Theodor Vierhaus, Migrating Electronic Systems from Fault Tolerant
Computing to Error Resilience
15:50-17:30 SESSION 5: DSP Theory & Implementation 3 SESSION 6: Image Processing 3 (Room 202)
(Room 201) Chairman: Julian Balcerek Chairman: Adam Konieczka
15:50 Adam Borowicz, On Using Quaternionic Rotations Piotr Janus, Tomasz Kryjak, Hardware
for Indpendent Component Analysis implementation of the Gaussian Mixture Model
foreground object segmentation algorithm working
with ultra-high resolution video stream in real-time
6
16:10 Nick A. Petrovsky, Eugene V. Rybenkov, Alexander Marcin Kociolek, Peter Bajcsy, Mary Brady,
A. Petrovsky, Two-dimensional non-separable Antonio Cardone, Interpolation-Based Gray-Level
quaternionic paraunitary filter banks Co-Occurrence Matrix Computation for Texture
Directionality Estimation
16:30 Radu Matei, Elliptically-Shaped IIR Digital Filters Marcin Kociolek, Michal Strzelecki. Szymon
Designed Using Frequency Transformations Szymajda, On the influence of the image
normalization scheme on texture classification
accuracy
16:50 Yaprak Eminaga, Adem Coskun, Izzet Kale, IIR Mi hał Bednarek, Krzysztof Walas, Spatial
Wavelet Filter Banks for ECG Signal Denoising Transformations in Deep Neural Networks
17:10 Paweł Pawłowski, Adam Pawlikowski, Rafał Jakub Bednarek, Karol Piaskowski, Mi hał
Długosz, Adam Dąbrowski, Programmable, Bednarek, Methods of Enriching The Flow of
switched-capacitor finite impulse response filter Information in The Real-Time Semantic
realized in CMOS technology for education Segmentation Using Deep Neural Networks
purposes
18:00 Welcome Party
7
September 20th, 2018 (Thursday)
9:00-9:45 TUTORIAL III: Paweł Strumiłło, Electronic Systems and Interfaces Aiding the Visually Impaired
10:00-11:40 SESSION 7: Biomedical & Biometric Apps. 1 SESSION 8: Audio Processing 1 (Room 202)
(Room 201) Chairman: Michał Strzelecki Chairman: Szymon Drgas
10:00 Jakub Jurek, Marek Ko iński, Andrzej Materka, Are Akira Ikuta, Hisako Orimoto, Fuzzy Bayesian Filter
Losnegård, Lars Reisætery, Ole J. Halvorsen, for Sound Environment by Considering Additive
Christian Beisland, Jarle Rørvik, Arvid Lundervold, Property of Energy Variable and Fuzzy Observation
Dictionary-based through-plane interpolation of in Decibel Scale
prostate cancer T2-weighted MR images
10:20 Jakub Jurek, Mateusz Peleszy, Andrzej Cezary Wernik, Grzegorz Ulacha, Application of
Woj ie howski, Artur Klepa zko, Marek Ko iński, adaptive Golomb codes for lossless audio
Andrzej Materka, Are Losnegård, Lars Reisætery, compression
Ole J. Halvorsen, Christian Beisland, Jarle Rørvik,
Arvid Lundervold, CRF-Based Clustering of
Pharmacokinetic Curves from Dynamic Contrast-
Enhanced MR Images
10:40 Carlos Vinhais, Marek Ko iński, Andrzej Materka, Karolina Marciniuk, Maciej Szczodrak, Andrzej
Centerline-Radius Polygonal-Mesh Modeling of Czyżewski, An application of acoustic sensors for
Bifurcated Blood Vessels in 3D Images using the monitoring of road traffic
Conformal Mapping
11:00 Marcin Grochowina, Lucyna Leniowska, Design and Damian Koszewski, Bozena Kostek, Low-level
implementation of a device supporting automatic audio descriptors-based analysis of music mixes
diagnosis of arteriovenous fistula from different Digital Audio Workstations – case
study
11:20 Lukasz Kubus, Alexander Yastrebov, Katarzyna Przemysław Falkowski-Gilski, Transmitting Alarm
Poczeta, Magdalena Poterala, The use of fuzzy Information in DAB+ Broadcasting System
cognitive maps in evaluation of prognosis of chronic
heart failure patients
12:00-13:00 SESSION 9: DSP Implementations (Posters) Chairman: Damian Cetnarowicz
Piotr Kłosowski, Deep Learning for Natural Language Pro essing and Language Modelling
Wladyslaw Magiera, Urszula Libal, Statistical properties of signals approximated by orthogonal
polynomials and Schur parametrization
A.A.Kim, O.O.Lukovenkova, Yu.V.Marapulets, A.B.Tristanov, Using a sparse model to evaluate the
internal structure of impulse signals
Tomasz Maka, Miroslaw Lazoryszczak, Detecting the Number of Speakers in Speech Mixtures by Human
and Machine
Janusz Rafałko, Marking the Allophones Boundaries Based on the DTW Algorithm
Adam Konie zka, Ewelina Mi hałowi z, Karol Piniarski, Infrared thermal camera-based system for tram
drivers warning about hazardous situations
Adam Bykowski, Szymon Kupiński, Feature matching and ArUco markers application in mobile eye
tracking studies
Marianna Parzy h, Tomasz Mar iniak, Adam Dąbrowski, Adaptive methods of time-dependent crowd
density distribution visualization
Julian Bal erek, Mateusz Łu zak, Paweł Pawłowski, Adam Dąbrowski, Automatic recognition of image
details using stereovision and 2D algorithms
Zenon Kidoń, Jerzy Fiołka, Evaluation postural stability using complex-valued data Fourier analysis of the
follow-up posturographic trajectories
13:00-14:00 Lunch
14:30-18:00 Social Event
18:00 Banquet
8
September 21st, 2018 (Friday)
9:00-9:45 TUTORIAL IV: Adam Dąbrowski, Contemporary technologies and techniques for processing of human
eye images
Chairman: Michał Strzelecki
10:00-11:40 SESSION 10: Biomedical & Biometric Apps. 2 SESSION 11: Speech Processing (Room 202)
(Room 201) Chairman: Paweł Strumiłło Chairman: Krzysztof Sozański
10:00 Artur Klepaczko, Martyna Muszelska, Eli Eikefjord, Hisako Orimoto, Akira Ikuta, Noise Cancellation
Jarle Rørvik, Arvid Lundervold, Automated Method for Speech Signal by Using an Extension
determination of arterial input function in DCE-MR Type UKF
images of the kidney
10:20 Mi hał Strzele ki, Artur Klepa zko, Martyna Marzena Mięsikowska, Speech Intelligibility in the
Muszelska, Eli Eikefjord, Jarle Rørvik, Arvid presence of X4 Unmanned Aerial Vehicle
Lundervold, An artificial neural network for GFR
estimation in the DCE-MRI studies of the kidneys
10:40 Artur Klepaczko, Piotr Skulimowski, Mi hał Hugo Cordeiro, Carlos Meneses, Low band
Strzelecki, Ludomir Stefań zyk, Eli Eikefjord, Jarle continuous speech system for voice pathologies
Rørvik, Arvid Lundervold, Numerical simulation of identification
the b-SSFP sequence in MR perfusion-weighted
imaging of the kidney
11:00 Baixiang Zhao, John Soraghan, Gaetano Di-caterina, Maxim Vashkevich, Elias Azarov, Alexander
Lykourgos Petropoulakis, Derek Grose, Trushali Petrovsky, Yuliya Rushkevich, Features extraction
Doshi, Automatic 3D segmentation of MRI data for for the automatic detection of ALS disease from
detection of head and neck cancerous lymph nodes acoustic speech signals
11:40-13:00 SESSION 12: Communication Apps. (Room 201) SESSION 13: Audio Processing 2 (Room 202)
Chairman: Alexander Petrovsky Chairman: Andrzej Meyer
11:40 Marek Kulawiak, Przemysław Falkowski-Gilski, Tatsiana Viarbitskaya, Andrzej Dobrucki, Audio
Marcin Kulawiak, DAB+ Coverage Analysis: a New processing with using Python language science
Look at Network Planning using GIS Tools libraries
12:00 Anatoliy Platonov, Ievgen Zaitsev, Perfect Low Power Mi hał Łu zyński, Stefan Bra hmański, Andrzej
Narrowband Transmitters for Dense Wireless Sensor Dobrucki, Active elimination of tonal components
Networks in acoustic signals
12:20 Jan Wietrzykowski, Probabilistic reasoning for indoor Ma iej Sabiniok, Stefan Bra hmański, Analysis of
positioning with sequences of WiFi fingerprints application popssibilities of Grey System Theory
to detection of acoustic feedback
12:40 Paweł Pawłowski, Karol Piniarski, Adam Dąbrowski, Krzysztof Sozański, Anna Sozańska, Low
Selection and tests of lossless and lossy video codecs Frequency Loudspeaker Measurements Using An
for advanced driver-assistance systems Anechoic Acoustic Chamber
13:00 CLOSING (Room 201)
13:15-14:00 Lunch
9
https://conferences.ieee.org/conferences_events/conferences/conferencedetails/44815
10
General information
The organizer of the 22nd IEEE SPA (Signal Processing Algorithms, Architectures, Arrange-
ments, and Applications) Conference (19th–21st September 2018, Poznań, Poland) is the Polish
IEEE Circuits and Systems (CAS) Chapter. The SPA Conference venue is the Center for Mecha-
tronics, Biomechanics, and Nanoengineering at the Piotrowo Campus of the Poznan University
of Technology.
The main goal of the IEEE SPA conferences consists as always in presentation of the newest
achievements in a wide and interdisciplinary areas of signal processing in order to integrate
various researchers active in these and related fields of science and technology. The discussed
problems cover a wide spectrum of topics including: signal theory, methods for processing of
multimedia (audio, image, video) as well as other signals (including biological and medical
signals), various applications of signal processing e.g. in biometry, and finally the realization
techniques and technologies.
It is great honor to all of us that the President of the Poznan University of Technology
Professor Tomasz Łodygowski and the Dean of the Faculty of Computing Professor Andrzej
Jaszkiewicz are opening our meeting.
Moreover many distinguished scientists and experts are our guests and conference partici-
pants.
We will be having four very interesting tutorial presentations, which are as usual covering a
quite broad spectrum of major signal processing problems starting with deep learning in retinal
image analysis, then continuing with fault tolerant and error resilient digital electronic systems,
followed by the discussion of electronic systems and interfaces aiding visually impaired people,
and finishing with contemporary technologies and techniques for processing of human eye images.
The first tutorial will be presented just after the official conference opening by Prof. Bart M.
ter Haar Romeny (Dept. of Biomedical Engineering, Eindhoven University of Technology, the
Netherlands).
The second tutorial prepared by Professor Heinrich T. Vierhaus (Computer Engineering
Group, Brandenburg University of Technology Cottbus, Germany) will be presented in the
afternoon of the first conference day.
The third tutorial prepared by Professor Paweł Strumiłło (Medical Electronics Division,
Institute of Electronics, Lodz University of Technology, Poland) is planned for the morning of
the second day.
Finally, in the morning of the third (last) conference day, we will have an opportunity to
participate in the fourth tutorial prepared by myself.
The regular papers will be presented in oral sessions held in two parallel streams together
with one poster session. The papers were accepted on the basis at least three reviews per paper
prepared by the Scientific Committee Members and by additionally invited Reviewers. Many
thanks to them for their time, effort, and assistance. The overall paper acceptance success rate
was ca. 70 %.
I hope that the present IEEE SPA conference will occur a valuable and helpful scientific and
professional event.
On behalf of the Scientific and Organizing Committees I would like to welcome all of you to
the meeting!
Poznan, September 2018
Professor Adam Dąbrowski

Chairman of the IEEE SPA Scientific and Organizing Committees
11
SIGNaL PROCESSING
SPa 2018
THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS INC. September 19 th -21st, 2018, Poznań, POLAND
Vision for Vision – Deep Learning in Retinal

Image Analysis
Bart M. ter Haar Romeny

Department of Biomedical Engineering
Eindhoven University of Technology
The Netherlands
Abstract—Automated, fast and large-scale computer-aided diagnosis of medical images has become reality. The greatest
breakthrough is Deep Learning. It already has huge impact in self-driving cars, industrial product inspection, surveillance,
robotics and translation services, and in the medical arena it is outperforming human experts already in many domains.
However, it is still largely a black box. What can we learn from recent insights in the functionality, nanometer-scale
connectivity and self-organization of the human visual brain? We will discuss several recent breakthroughs in our
understanding of visual perception and visual deep learning.
We apply these techniques in the RetinaCheck project, a large screening / early warning project for eye damage due to diabetes.
In China now an alarming 11.6% of the population has developed diabetes, due to genetic factors and fast lifestyle changes. In
this project large amounts of retinal fundus images are acquired, and the e-cloud deep learning system successfully learns to
identify early biomarkers of retinal disease.
The circle is round: we can prevent blindness by learning from the visual system: vision for vision.
Bart Romeny (1952) is professor in Biomedical Image Analysis (BMIA) at Eindhoven University of Technology, the
Netherlands.
1979 MSc in Applied Physics, Delft University of Technology, NL.
1983 PhD in Physics and Life Sciences, Utrecht University, NL.
1989 Associate prof. Medical Imaging, Utrecht University NL.
2001 Professor Biomedical Image Analysis, Eindhoven University of Technology, NL, emeritus per 21-12-2017.
2013 Professor Biomedical Image Analysis, Northeastern University, Shenyang, China.
His research interests focus on automated computer-aided diagnosis, and quantitative medical image analysis, using brain-
inspired computing and brain network modeling. He pioneered the exploitation of multi-scale differential geometry in
medical image analysis. His interactive tutorial book is used worldwide. He currently leads the RetinaCheck project, a
large screening project for early warning for diabetes and diabetic retinopathy in Liaoning Province. He has developed
many sophisticated retinal image analysis applications with his team, in close collaboration with clinical partners and
industry.
He (co-)authored over 230 scientific papers, 14418 citations, h-index 42. He is reviewer and/or associate editor for a range
of journals and conferences, and a frequent keynote speaker at conferences and summer schools.
He was president of the Dutch Society for Clinical Physics, president of the Dutch Society of Biophysics and Biomedical
Engineering. He is currently president of the Dutch Society for Pattern Recognition and Image Processing, and board
member of the International Association for Pattern Recognition (IAPR).
He is Fellow of the European Alliance for Medical and Biological Engineering & Science (EAMBES), and senior member
of IEEE. He is recipient of the Chinese Liaoning Friendship Award in 2014. He is an enthusiastic and awarded teacher.
12
SIGNaL PROCESSING
SPa 2018
Migrating Electronic Systems from Fault Tolerant

Computing to Error Resilience
Heinrich Theodor Vierhaus

Computer Engineering Group
Brandenburg University of Technology Cottbus
Germany
Abstract—Fault tolerant computing is a branch of technology which has developed continuously over decades since the late
1940. Application was limited to areas such as ultra-reliable computers for banks, space flight, aviation, and nuclear power
stations. By that time, the extra hardware needed was a real problem that had to be accepted. Overhead often was beyond
triplication, whereby extra power was acceptable. Only since about the 1990s, electronic “embedded” sub-system have found
their way into new applications such automotive systems and industrial control. Typically, a robust type of electronics was
employed which stayed away from smallest feature size technologies and lowest signal voltage swings. More recently, advanced
features implemented by automotive electronic systems such as ultra-fast image processing towards autonomous driving strictly
demand their implementation in nano-electronic technologies with a minimum feature size of 20 nm and below. Then, suddenly,
there are two demands which strongly interact. ICs in nano-technologies show a rising vulnerability to disturbing influences
such as particle radiation. Furthermore, the increasing stress due to scaling at constant supply voltage rather than the previous
scaling at constant field strength implies a higher level on inherent stress, resulting in wear-out and shorter life times. Hence on-
line fault detection and subsequent error correction becomes necessary on a wide scale. But now the extra power needed for
fault tolerant computing becomes a nightmare, since extra power and extra heat promotes aging effects strongly. Research has
gone two ways. First, methods of fault detection and error correction that get along with only a small extra power budget were
developed. Unfortunately, they are not as powerful, robust and universally applicable as, for example, triplication plus majority
vote (TMR), which consumes more than triple power. The second approach to the solution is the concept of “error resilience”.
It is based on the observation that, depending on function and application of a circuit, a limited number of faults may be
acceptable for some time without total system failure. If the overall systems can become aware of its own fault status,
acceptance of some faults followed by a process of self-repair by re-organization during a time slot when the system is “at rest”
may be a partial solution. Then only those parts of a digital system, where single or multiple bit errors will damage the function
directly and critically, will need traditional “fast and hot” methods of error detection and error correction.
Heinrich Theodor Vierhaus received a diploma in electrical engineering from Ruhr-University Bochum (Germany) in
1975.
Lecturer for RF technology and electronic circuits with Dar-es-Salaam Technical College in Tanzania (East Africa) from
1975 to 1977.
Dr.-Ing. in EE from University of Siegen in 1983.
Senior researcher with GMD, the German National Research Institute for Information Technology from 1983 to 1996.
Professor for “Technische Informatik” (computer engineering) at Brandenburg University of Technology Cottbus since
1996.
He has authored or co-authored about 250 papers in the area of computer engineering, mainly related to IC test,
technology, dependability and fault correction. He also contributed to three books in the area as an editor and an author.
Since 2009, he has initiated and coordinated several projects on advanced education of doctoral students in the area of
dependable systems, based on an international network of universities and research institutes.
13
SIGNaL PROCESSING
SPa 2018
Electronic Systems and Interfaces Aiding the

Visually Impaired
Paweł Strumiłło
Medical Electronics Division
Institute of Electronics
Lodz University of Technology
Poland
Abstract—Visual impairment is one of the most serious sensory disabilities. It deprives a human being of an active professional
and social live. EU reports indicate that for every 1000 Europeans citizens 4 are blind or suffer from serious visual impairment
and this number is predicted to increase with time due to our ageing society.
In spite of numerous, worldwide research efforts focusing on building innovative aids helping the blind no single electronic
travel aid (ETA) solution has been widely accepted by the blind community. The aim of the tutorial is to apprise the current
state of the art in the field of electronic interfaces aiding the blind in independent travel, navigation and access to information.
Functional solutions and outcomes of recent research projects devoted to assistive technologies for the visually impaired will be
presented.
Paweł Strumiłło received the MSc, PhD and DSc degrees and currently holds the position of full-time university
professor at Lodz University of Technology (TUL), Poland. In 1991-1993 he was with the University of Strathclyde
(under the EU Copernicus programme) where he defended his PhD thesis. His current research interests include medical
electronics, processing of biosignals, soft computing methods and human-system interaction systems. He has published
more than 100 frequently cited technical articles, authored one and co-authored two books. He was a principal and co-
principal investigator in a number of Polish and European research projects aimed at developing ICT solutions and
electronic aids for persons with physical and sensory disabilities. He received a number of prizes and awards for
development of assistive technologies for the visually impaired people (in cooperation with Orange Labs). From 2015 he
has been the head of the Institute of Electronics at TUL. He is the Senior Member of the IEEE and a member of
Biocybernetics and Biomedical Engineering Committee of the Polish Academy of Sciences.
14
SIGNaL PROCESSING
SPa 2018
Contemporary technologies and techniques for

processing of human eye images
Adam Dąbrowski
Division of Signal Processing and Electronic Systems
Institute of Automation and Robotics
Faculty of Computing
Center of Mechatronics, Biomechanics, and Nanoengineering
Poznan University of Technology
Poland
Abstract—Imaging technologies and techniques of the human eye are used for both biometric and medical-diagnostic
applications. Among various types of the eye images the following can be distinguished: iris images, fundus images, and various
optical coherence tomography (OCT) scans. Contemporary processing approaches to all of these image types are reviewed and
analyzed together with a discussion of their applications. Advanced image processing methods and algorithms, including the
artificial intelligence approach, developed at the Division of Signal Processing and Electronic Systems of the Poznań University
of Technology for the considered applications, are presented. The proposed solutions are characterized by a good effectiveness
and accuracy in the support of appropriate biometric and clinical decisions.
Adam Dąbrowski received a Ph.D. in Electrical Engineering (Electronics) from the Poznan University of Technology,
Poznan, Poland in 1982. In 1989 he received the Habilitation degree in Telecommunications from the same university.
Since 1997 he is a full professor in digital signal processing at the Faculty of Computing, Poznan University of
Technology, Poland, and Chief of the Division of Signal Processing and Electronics Systems. He was also professor at the
Adam Mickiewicz University, Poznan, Poland, Technische Universität Berlin, Germany, Universität Kaiserslautern,
Germany, and visiting professor at the Eidgenossische Technische Hochschule Zürich, Switzerland, Katholieke
Universiteit Leuven, Belgium and Ruhr-Universität Bochum, Germany. He was a Humboldt Foundation fellow at the
Ruhr-Universität Bochum, Germany (1984-1986).
His scientific interests concentrate on: digital signal processing (digital filters, signal separation, multidimensional
systems, wavelet transformation), processing of images, video and audio, multimedia and intelligent vision systems,
biometrics, and on processor architectures. He is author or co-author of 5 books and over 500 scientific and technical
publications. Among them he is one of the co-authors of "The Computer Engineering Handbook" (first edition in 2002,
second edition in 2008) bestseller and most frequently cited book of the CRC Press, Boca Raton, USA.
15
SIGNaL PROCESSING
SPa 2018
Suppression of distortions in signals received from

Doppler sensor for vehicle speed measurement
Grzegorz Szwoch
Department of Multimedia Systems
Gdańsk University of Technology, Narutowicza 11/12
80-233 Gdańsk, Poland
greg@sound.eti.pg.gda.pl
Abstract— Doppler sensors are commonly used for movement emit short impulses, they are often used in ranging
detection and speed measurement. However, electromagnetic applications. Newer frequency modulated continuous wave
interferences and imperfections in sensor construction result in (FMCW) sensors [2] are able to measure range, speed and
degradation of the signal to noise ratio. As a result, detection of angle at the same time, at the cost of reduced resolution and
signals reflected from moving objects becomes problematic. The maximum measured value. A variety of low cost sensors is
paper proposes an algorithm for reduction of distortions and available on the market. Alternatively, a custom sensor may be
noise in the signal received from a simple, dual-channel type of a constructed. For example, Placentino et al built a custom CW
Doppler sensor. The proposed method is based on examining sensor and used it for measurement of vehicle speed and
phase relationship between I/Q channels of the sensor signal. A
length [3]. Nguyen et al used a pulse radar sensor and proposed
weighting function is calculated in order to suppress the
distortions while preserving energy of the desired signal.
an algorithm that calculates vehicle speed and length [4].
Additionally, the proposed algorithm may select signals reflected Processing signals from a Doppler sensor does not require
by objects moving in a specific direction (e.g. towards the sensor). significant processing power. Butterfield proposed a system
The processed signal may be further analyzed in order to detect composed from a CW motion detector, operating amplifiers, a
signal frequency and compute the object velocity. The results of sound card and a Raspberry Pi microcomputer, that is able to
the experiments show that the proposed approach results in collect traffic data [5]. The processing algorithm may also be
significant reduction of level of noise and interferences, allowing implemented on a low power digital signal processor.
for detection and tracking of signals reflected from moving
Parameters of consumer motion sensor do not match those
objects.
of professional equipment used e.g. by the Police. The most
Keywords-speed measurement; Doppler radar; noise important problem is low signal-to-noise ratio, which makes
suppression; traffic monitoring the task of separating the useful signal from the background
noise problematic. The usual approach is to compute a noise
profile and to subtract this profile from the sensor signal.
I. INTRODUCTION However, this method requires that signal parts containing only
Traffic monitoring systems are important tools in the noise are used for computing and updating the profile.
maintenance of road networks. In order to obtain a detailed Determining whether a signal frame contains components
dataset on road traffic, a network of sensors measuring traffic reflected from moving objects is problematic. In many cases,
parameters (such as number of vehicles, average speed, etc.) is the algorithm takes blindly a number of signal frames,
necessary. Due to large number of sensors in such systems, low computes a profile and uses it for noise suppression [6]. If a
cost devices should be used in construction of monitoring reflected signal is present in the analyzed frames, the computed
stations. In order to ensure a sufficient level of accuracy, profile is incorrect. Additionally, subtracting the noise profile
appropriate signal processing algorithms need to be employed. from the sensor signal also suppresses the reflected
Radar sensors based on the Doppler effect are standard devices components. Another problem is that noise profiling works
used for speed measurement. Low cost Doppler sensors, only for a stationary noise, it won’t help in case of noise
advertised as ‘motion sensors’, may be used for estimation of resulting e.g. from wind blowing at the sensor. There is also a
vehicles speed in the monitoring stations. While they are not as problem of electromagnetic interferences that may be present
sophisticated as professional devices used e.g. for traffic law in the sensor signal and which are not successfully removed
enforcement, they are able, with the help of digital signal with the standard profile subtraction approach. Various
processing algorithms, to provide sufficient accuracy of vehicle advanced methods of noise suppression in a Doppler sensor
speed measurement for the purpose of collecting traffic data. were proposed. For example, Islam and Chong used a matched
filter and wavelet analysis [7].
Various types of radar sensors are used for speed
measurement [1]. Continuous wave (CW) sensors transmit a This paper proposes an alternative approach to suppression
harmonic signal with constant frequency and amplitude. Only of noise and electromagnetic interferences in the Doppler
speed measurement is possible with these sensors. Pulse radars sensor signal. The algorithm works on any CW sensor with a
16
dual channel, I/Q output. This approach is based on examining where α is the angle between the line of the object movement
phase relationship between the two output channels. Both and the line connecting the object with the sensor (Fig. 1). As a
stationary and non-stationary noise, as well as electromagnetic result, as the object moves closer to the sensor, the measured
interferences, may be reduced, without affecting the reflected speed decreases. This is called a ‘cosine effect’ [8].
signal. Additionally, this method is able to select only signals
reflected from vehicles moving either towards or away from v r
the sensor, which greatly simplifies the speed estimation. α
vr
Additionally, an improved method of calculating the noise vehicle
profile is also described. The details are presented in the d
following Sections.
α
II. PRINCIPLE OF DOPPLER SENSOR OPERATION
sensor
A CW Doppler sensor transmits an electromagnetic wave
with constant frequency and amplitude. Most CW sensors in Figure 1. The actual object speed v and the radial speed vr measured by the
Europe use the K band and transmit signals with Doppler sensor
f0 = 24.125 GHz frequency. Signals reflected by obstacles are
received by the sensor antennas and amplified. Signals This effect has an important influence on signals reflected
reflected by static obstacles have the same frequency as the from vehicles moving close to the sensor. As a vehicle
transmitted signal. Objects in motion that reflect the approaches the sensor, the difference in angle α between the
transmitted wave cause the Doppler effect. As a result, front and the back of the vehicle increases. As a result,
frequency fr of the received signal is higher than f0 if the object difference in frequency of signals reflected from the front and
approaches the sensor, and lower if the object moves away the back of the car also increases, which results in widening of
from the sensor. The transmitted and the received signals are the spectrogram peak (Fig. 2). This effect my be utilized for
multiplied in the mixer and low-pass filtered [1]. As a result, estimation of vehicle length.
frequency fd of the signal at the sensor output is equal to
2 2 f0
fd = fr − f0 = vr = vr = Svr (1)
λ c
where λ is the transmitted wave length, f0 is the transmitted
wave frequency, c is the speed of light, vr is the radial velocity,
i.e. the velocity vector component in the direction from the
sensor to the moving object. Therefore, frequency of the Figure 2. Recording from a Doppler sensor – signal reflected by a vehicle
received wave is related to vr, with a constant scale factor S, moving with a constant speed, the cosine effect is visible
equal to ca. 44.71 for f0 = 24.125 GHz and speed expressed in
kmph. It should be noted that the signal from the sensor output
is within the audio frequencies, as vehicle speeds in range up to III. WEIGHTING FUNCTION FOR NOISE SUPPRESSION
200 kmph result in frequencies up to ca. 9 kHz. In an ideal dual-channel CW Doppler sensor, phase
Sensors with a single-channel output do not allow for difference between I/Q channels of signals reflected from
determining the direction of movement. For this purpose, many moving vehicles would be either +90 or -90 degrees. In
Doppler sensors provide a dual-channel output in I/Q format. practice, due to noise and interferences, the observed values
the first channel (I, in-phase) is the same as in the single- deviate from the ideal ones. For example, the datasheet of the
channel sensor, the second channel (Q, quadrature) is shifted in sensor used in the experiments shows possible range of ±30
phase by 90 degrees. If the object moves towards the sensor degrees [9]. Nevertheless, this observation may be used in the
(fd > 0), phase difference between channels Q and I is equal to noise suppression algorithm which is designed as follows. A
90 degrees. If the object moves away from the sensor (fd < 0), weighting function is computed from the phase difference
the sign in the Q channel is inverted and the phase difference signal. Such a function has values from 0 to 1, indicating a
Q-I is equal to -90 degrees. Therefore, it is possible to probability that a given signal component is the desired signal.
distinguish between the two directions of movement by The signal components reflected from moving objects are
examining phase difference between the channels. expected to have phase difference close to ±90 degrees.
It should be noted that the sensor measures only the radial The distribution of phase difference for the noise was
component of the frequency vector. The relation between the obtained by selecting signal frames containing only noise from
actual object speed v and the measured velocity vr is the recording, computing spectra of both channels, computing
the interchannel phase difference and constructing a histogram.
The result is shown in Fig. 3. It can be observed that for the
vr = v ⋅ cos(α ) (2) stationary noise, the distribution of phase differences is
Gaussian, with mean close to zero. Therefore, most of the noise
17
components are concentrated around zero phase difference and experiments. Similarly to image processing, this additional
they are separated in phase from the reflected signal weighting increases the ‘contrast’ between 0 and ±90 degrees
components. phase differences.
The additional advantage of the proposed method is that it
is possible to modify the function w(f) in a way that only the
objects moving towards the sensor are retained, and the objects
moving away from the sensor are suppressed. Thanks to that,
further analysis of vehicle speed detection is simplified. The
weighting function for this case is
 ∆ϕ ( f ) 
win ( f ) = 1 − max ,0  − 1 (6)
 90 
Fig. 4 shows plots of weighting functions for both cases,
without and with the sigmoid function, for all objects and only
for objects approaching the sensor.
Figure 3. Histogram of phase differences in a noise recorded from the dual-

channel Doppler sensor. Mean = -1.64, standard deviation = 38.15
The weighting function is calculated as follows. Signals

xi(n) and xq(n), from the I/Q sensor output are digitized and
transformed to the frequency domain, so that their spectral
representations Xi(f) and Xq(f) are obtained. The phase
difference function is computed as
∆ϕ ( f ) = arg (X q ( f )) − arg ( X i ( f )) (3)
The function ∆ϕ(f) is normalized to [-180, 180) degrees range

by adding or subtracting 360 degrees where necessary. This
function describes the Q-I phase difference for frequency
components of the signal.
The weighting function w(f) for detection of all moving
objects is computed as
∆ϕ ( f )
w( f ) = 1 − −1 (4)
90
This function has values in range from 0 to 1.

The function given by Eq. 4 changes linearly. In order to
increase attenuation of noise components and to preserve
amplitude of the reflected components, a sigmoid function is
applied to the original weighting function:
Figure 4. Weighting functions for all objects and only for objects moving
towards the sensor, without using the sigmoid function (dashed lines) and with
w′( f ) =
1 sigmoid function γ = 10 (dotted line) and γ = 20 (solid line)
−γ ( w ( f )− 0.5 )
(5)
1+ e Similarly, it is possible to select only objects moving away
from the sensor, with weighting function
where γ is the parameter which defines ‘flatness’ of the curve at
the extrema and sharpness of the function in the middle
section. Values γ of in the range 10-20 were used in the
18
 − ∆ϕ ( f ) 
Initially, modules based on LM386 amplifiers were used.
wout ( f ) = 1 − max ,0  − 1 (7) However, these modules introduced interferences into the
 90  analyzed signal, and the maximum gain (20×) was not
sufficient. Therefore, they were replaced by a custom module
based on a NE5532 amplifier (max gain 1000×). The amplified
Electromagnetic interferences received by the sensor
signal was digitized by a sound card and recorded on a
antennas will be added to the reflected signal, so in most cases,
computer. The recorded signals were processed offline with
one of the directions of movement (usually away from the
scripts written in Python. It was tested that the same scripts
sensor) will be distorted. However, interferences inducted in
may be used for online analysis, running e.g. on a Raspberry
the electronic circuits after the sensor output, or amplified by
Pi 3 microcomputer. Sampling frequency was 48 kHz, STFT
this circuit, are added to both channels, so their phase
analysis was performed with Blackman window of length 2048
difference will be close to zero. These interferences may be
samples, with 75% overlapping.
suppressed with the proposed method, provided that their
amplitude is comparable to the reflected signal amplitude. The recordings were made on a city street (vehicles moving
with average speed of 30-50 kmph). The sensor was placed ca.
After the weighting function is calculated, suppression of
3 meters from the street, angled at ca. 45 degrees relative to the
noise and interferences is performed by multiplying the
street. Fig. 5 shows the spectrogram of a fragment of the first
amplitude or power spectrum of the signal frame by the
recording, made with the original amplifier. Signal components
weighting function. The result is used in detection of reflected
reflected from moving vehicles are clearly visible. However,
signal components and for speed estimation.
the noise level is high and there are constant frequency
components at multiples of 1 kHz. The source of this
IV. CALCULATION OF NOISE PROFILE interference was not found. It is possible that it is a result of
Subtraction of noise profile is the standard approach to USB frame synchronization, amplified in the circuit.
noise suppression. The problem is that only signal frames that
do not contain components reflected from moving objects
should be used for profile calculation. The weighting function
proposed in the previous Section may be used to determine
spectral bins containing noise or signal components. In order to
compute a profile for a given spectral bin, N values have to be
collected. The proposed procedure works as follows. For each
signal frame, amplitude spectrum and the phase difference
processed by the weighting function (for all objects) are
computed. For each frequency bin, the following operations are
Figure 5. Spectrogram of a fragment of Recording 1
performed.
• If the weighted phase difference is above a threshold Fig. 6 shows phase difference (Q-I) for spectral
Tn, the bin contains a signal component and is not components, black and white colors represent -90 and +90
processed further. degree phase difference, respectively. It can be observed that
signals reflected from moving vehicles are marked mostly with
• If the number k of noise samples collected for the bin is black or white color, depending on the direction of movement.
k < N, add the value of spectral amplitude as another Electromagnetic interferences have phase difference close to
noise sample. zero (medium gray color). Phase difference of noise
• If k = N, compute the noise estimate e(f) as the components is distributed as shown in Fig. 3.
arithmetic mean of the collected samples.
• If k > N, update the noise estimate:
e( f ) = (1 − α ) ⋅ e( f ) + α X ( f ) (8)
where α is the learning parameter, X(f) is the signal frame

spectrum.
The noise profile is completed if k ≥ N for all spectral bins. Figure 6. Phase difference Q-I of the signal shown in Fig. 5. Black and white
colors represent phase difference -90 and +90 deg, respectively
The learning parameter is a small number, e.g. α = 0.05.
Fig. 7 shows the spectrogram of a processed signal, after
V. EXPERIMENTS the weighting function is computed (for both directions, γ = 20)
The test system built for the experiments is based on a and multiplied with the amplitude spectrum of the signal. It can
RSM2650 Doppler sensor operating with 24.125 GHz transmit be observed that the interference components were successfully
frequency [9]. The output signal is processed with an amplifier. suppressed and that the noise level was significantly reduced,
as the contrast between the signal components and the noise
19
background is increased. Fig. 8 shows the result of applying a
weighting function which selects only objects moving towards
the sensor. Signal components reflected from vehicles moving
away from the sensor are suppressed, the noise level is also
decreased in comparison to the previous experiment.
Figure 7. Spectrogram of a fragment of Recording 1, after noise suppression

with the proposed algorithm (weighting function for all vehicles)
Figure 9. Noise profiles computed from Recording 1, using various methods
Figure 8. Spectrogram of a fragment of Recording 1, after noise suppression

with the proposed algorithm (weighting function only for vehicles moving
towards the sensor)
Fig. 9 shows noise profiles computed from the same

recording. The standard profile is the amplitude spectrum
averaged over the first 100 frames (no reflected signals were
present in these frames). The ‘proposed’ profile was computed
using the method described in the previous section, with
N = 100. Both profiles are similar, but the second profile may
also be computed in presence of reflected signals, as they are Figure 10. Noise profiles computed from Recording 2, using various methods
excluded from the profiling. Electromagnetic interferences are
clearly visible in the profile plot. The third (‘processed’) profile Table 1 summarizes the results of the proposed algorithm in
is obtained after noise suppression with the procedure terms of the measured noise level, for both tested recordings.
described earlier (the weighting function for both directions Average noise level was computed as a mean of the noise
was used). It can be seen that the noise level was reduced profile values in the frequency range 500 Hz – 10 kHz. Each
significantly, by ca. 20 dB. Peaks from interferences, as well as recording was processed with a weighting function calculated
the direct component (resulting from signal reflections from first for both directions (two-dir.), and then only for the
static obstacles), are suppressed, making further signal analysis direction towards the sensor (one-dir.). Three cases were
(speed estimation) easier. tested: without the sigmoid function and with the sigmoid
Because of electromagnetic interferences visible in the function and two different γ values. The obtained results
recorded signals, the amplifier circuit was replaced with confirm that the proposed algorithm efficiently decreases the
another one. In the second recording, made with this setup, noise level. This reduction is more prominent if only one
interferences were significantly reduced, but the noise level has direction is selected for the analysis. Sigmoid weighting of the
increased. Noise profiles calculated for this recording are phase difference enhances the noise suppression in most cases.
presented in Fig. 10. The proposed algorithm for noise In one case (Recording 2, no sigmoid weighting) it was not
suppression reduced the overall noise level. The observed gain possible to complete the noise profile calculation (due to
in signal to noise ratio is higher than in the first recording. insufficient number of frames), but with the help of sigmoid
weighting, it was possible to calculate the profile.
The observed noise levels (-84 dB to -76 dB) may seem
low, but the signal to noise ratio in the Doppler radar system
built from low cost components is very low. The observed
levels of signals reflected from moving vehicles were in the
20
range from -80 dB to -50 dB, depending on distance to the is intended to be used in a pre-processing stage of an algorithm
object and the object size. Therefore, increasing the gap for vehicle speed measurement. Thanks to digital signal
between the signal and the noise levels is vital for efficient processing, it will be possible to construct a network of a large
speed measurement. The standard approach based on noise number of simple, low cost sensors, providing an effective and
profile subtraction reduces the amplitude of both the signal and economic solution for traffic monitoring.
the noise. The proposed method applies strong attenuation to
the noise while keeping the suppression of signal level at a ACKNOWLEDGMENT
reasonably low level.
Project co-financed by the by the Polish National Centre for
Research and Development (NCBR) from the European
TABLE I. NOISE LEVEL [DB] MEASURED FOR THE SIGNAL (500 HZ – Regional Development Fund under the Operational Programme
10 KHZ) PROCESSED WITH THE PROPOSED ALGORITHM
Innovative Economy No. POIR.04.01.04-00-0089/16 entitled:
Algorithm Recording 1 Recording 2 INZNAK – “Intelligent road signs”.
two-dir. one-dir. two-dir. one-dir.
Original signal -84,1 -84,1 -75.8 -75.8
No sigmoid -100.2 -118.6 — -105.1 REFERENCES
Sigmoid, γ = 10 -105.3 -118.9 -113.5 -114.9 [1] G. Brooker, “Sensors and Signals,” Chapter 14: Doppler Measurement,
Sigmoid, γ = 20 -111,7 -131.0 -139.6 -143.1 Australian Centre for Field Robotics, University of Sydney, 2006.
[2] C. Iovescu and S. Rao, “The fundamentals of millimeter wave sensors,”
VI. CONCLUSION Texas Instruments, SPYY005.
http://www.ti.com/lit/wp/spyy005/spyy005.pdf
Commonly used noise suppression methods, e.g. based on
subtracting a noise profile, are suboptimal for processing [3] F. Placentino, F. Alimenti, A. Battistini, W. Bernardini, P. Mezzanotte,
V. Palazzari, et al., “Measurements of length and velocity of vehicles
Doppler sensor signals, as they do not utilize phase relationship with a low cost sensor radar Doppler operating at 24GHz,” 2nd
between I/Q channels and they do not provide satisfactory International Workshop on Advances in Sensors and Interface, Bari,
efficiency in low signal-to-noise ratio conditions. The proposed 2007, pp. 1-5. doi: 10.1109/IWASI.2007.4420036
algorithm detects signal components that are expected to [4] V. C. Nguyen, D. K. Dinh, V. A. Le and V. D. Nguyen, “Length and
represent signals reflected from moving objects, based on speed detection using microwave motion sensor,” 2014 Int. Conf.
phase difference between the two channels. The remaining Advanced Technologies for Communications (ATC 2014), Hanoi, 2014,
pp. 371-376. doi: 10.1109/ATC.2014.7043414
components are considered to be noise ad they are efficiently
[5] T. Butterfield, “Building radar speed camera and traffic logger with
suppressed. The proposed algorithm performs much better than Raspberry Pi,” 2015.
the profile subtraction approach in case of presence of http://blog.durablescope.com/post/BuildASpeedCameraAndTrafficLogg
electromagnetic interferences. An important advantage of the er/
proposed method is that objects moving in opposite directions [6] RFbeam Microwave Gmbh., “K-LD2 transceiver datasheet,”
(e.g. on two lanes) may be separated from each other using the https://www.rfbeam.ch/product?id=14
appropriate weighting function. As a result, speed [7] M. S. Islam and U. Chong, “Noise reduction of continuous wave radar
measurement may be performed separately for each direction. and pulse radar using matched filter and wavelets,” EURASIP Journal
on Image and Video Processing. 2014. https://doi.org/10.1186/1687-
Moreover, selecting only one direction allows for more 5281-2014-43
efficient noise suppression. Additionally, if a noise profile is
[8] D. S. Sawicki, “Police radar handbook,” Chapter 2: The cosine effect,
required, the proposed algorithm is able to extract the profile CreateSpace Independent Publishing Platform, 2013.
from a signal that may contain useful components, so it does [9] B+B Sensors, “RSM2650 Radar movement alarm unit – data sheet,”
not require selection of noise-only signal frames. http://www.produktinfo.conrad.com/datenblaetter/500000-
524999/506343-da-01-en-
The main intended application of the presented algorithm RADARBEWEGUNGSM__MOD__STEREO_4_75__5_25V.pdf.
are traffic monitoring networks composed of a large number of
monitoring devices. The presented noise suppression algorithm
21
SIGNaL PROCESSING
SPa 2018
Programmatic Simulation of Laser Scanning Products
Marek Kulawiak
Department of Geoinformatics, Faculty of Electronics, Telecommunications and Informatics,
Gdańsk University of Technology
Gdańsk, Poland
Marek.Kulawiak@eti.pg.edu.pl
Abstract—The technology of laser scanning is widely used for quality of 3D meshes generated by various surface
producing three-dimensional digital representations of reconstruction algorithms. Acquiring input data needed for this
geographic features. The measurement results are usually process is not an easy task, considering the fact that apart from
available in the form of 3D point clouds, which are often used as the scanning results, additional data in the form of high-quality
a transitional data model in various remote sensing applications. reference models representing the scanned objects is needed in
Unfortunately, while the costs of Light Detection And Ranging order to compare the reconstruction results with more reliable
scanners have dropped significantly in recent years, they are still material. Such reference models often need to be obtained by
considered to be quite expensive for smaller institutions. In different means, such as with the use of photogrammetric
consequence, the process of 3D point cloud acquisition remains a methods or by manual mesh construction based on other source
difficult one, requiring investment not only in scanning
material (e.g. building plans). This is when simulating the
equipment, but also time to operate it and process the obtained
process of surface scanning offers an additional benefit, as the
results. However, if the goal does not involve the 3D digitalization
of a particular object, but instead the point clouds are required source 3D model can serve two purposes at the same time:
e.g. for testing reconstruction algorithms, in many cases such providing both the input point cloud and a reliable reference
input data can be successfully substituted with the results of a material. The process of scanning the surface of a single object
simulated scanning process, which is far easier to accomplish. (e.g. a building) and creating its reference model from scratch
This paper presents a programmatic simulator which generates can take several weeks of work. On the other hand, the
artificial scanning results from solid meshes provided by the user simulation for a given mesh can be achieved within a single
and saves them in the form of point cloud datasets. day.
Keywords-Laser scanning; Simulation; LiDAR; 3D; Point II. RELATED WORK

cloud;
Over the years, many different solutions for simulating
laser scanning of three-dimensional surfaces have been
I. INTRODUCTION proposed. One of the first known attempts to solving this
LiDAR (Light Detection And Ranging) is a remote sensing problem was made by Holmgren et al. [4], who simulated the
method that measures distances by illuminating the target (such process of modeling the scanning angle effect when measuring
as ground, a building or some other object) with light in the tree height and canopy closure in boreal forest with a laser
form of a pulsed laser. By combining these light pulses with scanner. Fochesatto et al. [5] created a backscatter LiDAR
other data, such as the position and orientation of the aircraft to signal simulation available in the areas of space exploration.
which the LiDAR device is attached, it is possible to generate Peinecke et al. [6] described a LiDAR simulation approach
precise data which describes the shape of the measured which made heavy use of computer's graphical processing unit
object [1]. The scanning results are usually available in the (GPU) by doing most computations with vertex and fragment
form of three-dimensional point clouds, which are often used shaders in a solution created with OpenGL. Kukko and
as a transitional data model in various land remote-sensing Hyyppä [7] proposed a simulation approach for small-footprint
applications. These applications include the creation of 3D LiDAR processing, which combined both spatial and
topographic maps [2], modelling of various processes related to radiometric components to produce waveform and point cloud
the area of research in urban areas [3], as well as other 3D data. Hämmerle et al. [8] simulated full-waveform terrestrial
terrain visualization systems offering the recreation of detailed laser scanning and unmanned aerial vehicle platforms point
digital models. clouds of a virtual forest plot and examined the results in
regard to their effect on understory vegetation height
Unfortunately, such data can be quite difficult to acquire,
parameters. Qin et al. [9] used the DART (Discrete Anisotropic
given the need for expensive scanning equipment, as well as
Radiative Transfer) software to simulate airborne waveform
significant amounts of time and effort required to operate it and
LiDAR data to explore the effects of data acquisition
process the obtained results. However, for the purpose of
conditions on forest foliage profile retrieval.
testing the output of various algorithms designed to work with
point clouds, in many cases such input data can be successfully As a side note, not every simulation project is intended for
substituted with the results of a simulated scanning process, large-scale data. For example, Danhof et al. [10] have
which is far easier to accomplish. A good example of such developed a Virtual-Reality 3D-Laser-Scan Simulation for
scenario is when research is performed in order to evaluate the creating synthetic scans of CAD (Computer-Aided Design)
22
models. This solution utilizes the VR (Virtual Reality) head- Classification
Meaning
mounted Oculus Rift display and the Razer Hydra motion value
controllers for simulating the use of common hand-held 3D 0 Created, never classified
laser scanners. 7 Low Point (noise)
While the number of attempts at simulating the process of 8 Model Key-points

object laser scanning is quite large, the software programs used 9 Water
during the past research were created for dedicated purposes
and are not available to the public. This paper presents a robust
The provided data are organized in a grid which consists of
programmatic simulator capable of generating artificial
thousands of sectors (depicted in yellow in Fig. 1), each
scanning results from solid meshes provided by the user and
representing an area of approximately 29 hectares.
saving the simulation results in the form of point cloud
Unfortunately, this method of data storage has a significant
datasets. The proposed simulator uses well-known standard 3D
disadvantage in that points representing the shape of a single
file formats for both input and output data.
object can be scattered between several neighboring sectors
(e.g. when the object happens to be located on the boundaries
III. REFERENCE DATA of two or more sectors). Because of this, the original LAS files
The main source of actual LiDAR scanning data (used as needed to be parsed, and their contents extracted into a format
reference for generating artificial datasets) were provided by that could be easily analyzed.
the Polish Centre of Geodesic and Cartographic
Documentation (CODGIK). The data contain 3D point clouds
depicting various terrestrial objects located in Poland, with the
average resolution of around 19 points / m2. The datasets are
stored in a grid of files in the LAS format (compliant with
version 1.2 of the file format), which consist of [11]:
 The Public Header block, containing information
regarding file version, creation date, bounding box,
projection, number of records, header size and offset to
point data. Figure 1. CODGIK data grid.
 The Variable Length Records block, containing Fig. 2 shows a fragment of extracted point set classified as
information such as data description and the a building (with its classification value equal to “6”), cropped
identification number of the author. to the extents of a sample reference model depicting the
 The Point Data Records block, which stores the actual building of Gdańsk University of Technology’s Faculty of
results of LiDAR scanning. Mechanical Engineering Machine Laboratory. It can be easily
noticed that the point cloud data structure is quite irregular, as
 The Extended Variable Length Records block, which is the points distribution over the building’s surface is clearly
optional and available only in newer versions of the non-uniform. Apart from varying point density, including large
LAS format (version 1.3 and later). gaps in some areas, this data can be considered quite accurate,
In the context of the carried research, the most important as the distance from a single point to the actual surface
were the records containing information about XYZ fragment it represents rarely exceeds several centimeters.
coordinates of individual points and their classification. The
most significant classes described by the LAS format
specification are listed in Tab. I.
TABLE I. CLASS ATTRIBUTES OF POINTS DESCRIBED BY THE LAS

FORMAT
Classification
Meaning
value
0 Created, never classified
1 Unclassified (emerged in an undefined state)
2 Ground
3 Low vegetation
4 Medium vegetation Figure 2. Point cloud data obtained by scanning the surface of Gdańsk
University of Technology’s Faculty of Mechanical Engineering Machine
5 High vegetation
Laboratory, placed on a reference solid mesh.
6 Building
23
IV. PROPOSED SOLUTION The intersection computations are handled by internal
The main purpose of the proposed simulator is to generate functions of the OGRE 3D engine. As the scanner position
point clouds which would have spatial structure similar to changes, the intersection points are collected and finally saved
actual LiDAR surface scanning data. Taking that into account, in the output file representing the results of a single simulation,
the simulator is designed to imitate the process of acquiring using the same file format in which the input mesh is stored.
spatial data by a mobile multibeam scanner located at a fixed The contents of the output files can be combined into a single
height above the scanned object in three-dimensional space. point cloud representing the final results obtained by a
simulated multibeam scanner moving along several different
The simulator application was created in the C++ language tracks.
with the use of the OGRE [12] engine. OGRE is an open-
source graphics engine capable of rendering 3D context in real- The following parameters are used during a single surface
time with the use of either OpenGL or Direct3D. Due to its scanning simulation:
cross-platform nature, the engine is available on all major  Number of rays emitted by the simulated scanner,
operating systems, including various desktop systems such as
Windows, Linux and macOS, as well as mobile-driven systems  Number of measurements, with a constant time
like Android, iOS and Windows Phone. In the presented work, interval between consecutive measurements,
OGRE is used for rendering the 3D context and 2D graphical  Distance between scanner positions at which
interface, as well as performing various computations related to consecutive measurements were made (it is assumed
computer graphics, such as ray-plane intersection needed for that the movement speed of the scanner is constant
collision detection. during the entire simulation process),
The simulation consists of a virtual 3D scene containing a
 Scanner heading,
real-world scale solid mesh representing an object (such as a
building) read from a Wavefront OBJ file. In the same  Maximum angle of ray deviation from the vertical
simulation space a scanner is placed above the mesh, which plane,
emits a set of rays in multiple directions at a constant
frequency. These rays which collide with the bounding box of  Length of a vector representing the direction of a
the 3D object are used for computing intersections with the single ray,
object's geometry. To better visualize this process, Fig. 3 shows  Initial position of the scanner.
the results of a static scanner emitting 14 rays (depicted in
yellow) towards the Elizabeth Tower, where the 9 red points In the case of the sample surface scanning shown in Fig. 3,
represent the ray-plane intersection results. the simulation used the following parameter values:
 Number of rays: 14
 Number of measurements: 1
 Scanner heading: 110º
 Maximum angle of ray deviation from the vertical
plane: 35º
 Ray vector length: 180 m
 Starting position (XYZ): (1 m, 110 m, 3 m)
V. RESULTS
In this section, the results of simulating laser scanning of a
single object using different flight trajectories are presented
and discussed. The input data for this case consists of a highly
detailed model of the Elizabeth Tower, which was downloaded
from MyMiniFactory [13], scaled to real-world dimensions
(making it 96 meters tall) and converted to Wavefront OBJ
format. Four simulations were performed in total, with the first
two imitating a multibeam laser scanner moving at a fixed
altitude of 150 meters above the ground level (54 meters above
the top of the Tower), and the other two imitating a scanner
moving alongside similar paths, but at the level of 300 meters
above the ground (204 meters above the top of the Tower). The
parameter values used for all four simulations are presented in
Tab. II. The input model was placed in such a way that the
Figure 3. Visualization of a simplified Elizabeth Tower's surface scanning
simulation.
center of its base was located at the origin of the simulated
space, with its coordinates equal to (0, 0, 0).
24
TABLE II. PARAMETER VALUES USED FOR SIMULATING LASER SCANNING apparent that this dataset’s resolution is nearly twice lower than
OF ELIZABETH TOWER'S SURFACE
the previous one’s, but it should also be pointed out that its
Parameter name
Value Value spatial structure is more uniform, with its points being almost
(trajectory I) (trajectory II) evenly distributed over the object’s space. This causes the gap
Number of rays 160 160 below the Tower’s clocks to be less apparent, but at the same
Number of measurements 30 30 time the object’s silhouette is also harder to distinguish from
the background.
Distance between consecutive
0,75 m 0,75 m
measurements
Scanner heading 80º 170º
Maximum angle of ray deviation
50º 50º
from the vertical plane
Ray vector length 180 m 350 m
Starting position (XYZ) when
scanning 150 m above the ground (-7, 150, 25) (-20, 150, -8)
level [m]
Starting position (XYZ) when
scanning 300 m above the ground (6, 300, 94) (-94, 150, 6)
level [m]
As a result of the aforementioned simulations using the

parameters from Tab. II, four different point clouds were
created, as shown in Fig. 4. The output point sets presented in
Fig. 4 a) and Fig. 4 b) were obtained from the altitude of 150
meters, while the point clouds shown in Fig. 4 c) and Fig. 4 d)
were acquired from the altitude of 300 meters above the Figure 5. Original model of the Elizabeth Tower (a) compared to resulting
ground. point clouds obtained from combined simulations of scanning 150 m above
the ground (b) and 300 m above the ground (c).
The presented software has been validated with the use of

LiDAR point clouds provided by CODGIK for the area of
Tricity in Northern Poland. Fig. 6 presents the view of
St. Mary's Church in Gdańsk in the form of original LiDAR
point cloud (Fig. 6 a) next to the results of modeling obtained
using the presented application (Fig. 6 b).
Figure 4. The results of performing surface scanning simulations on the

surface of the Elizabeth Tower using different flight trajectories: a) trajectory I
(150 m above the ground), b) trajectory II (150 m above the ground),
c) trajectory I (300 m above the ground), d) trajectory II (300 m above the
ground).
The obtained point clouds were then combined into two

different datasets, merging the results obtained at the same
altitude. These results are shown in Fig. 5, compared to the
input mesh presented in Fig. 5 a). The dataset obtained by
simulated scannings performed 150 meters above the ground
(Fig. 5b) consists of 2,090 points in total. It is clearly
noticeable that the upper parts of the model are represented in
greater details than its lower parts, and it can also be seen that
the areas located directly below the Tower’s clocks have the
lowest point density. The second dataset represents the Figure 6. Sample LiDAR point cloud (a) compared to the results (b)
simulated scanning results performed 300 meters above the obtained by the presented simulator.
ground (Fig 5c) and it consists of 1,127 points in total. It is
25
Consequently, the presented software has been successfully REFERENCES
used to produce point clouds of accurate 3D models for the [1] N. Alberto, “Design and development of a generalized LiDAR point
purpose of testing and improvement of custom shape cloud streaming framework over the web”. 2014.
reconstruction algorithms. [2] M. Kulawiak and M. Kulawiak, “Application of Web-GIS for
Dissemination and 3D Visualization of Large-Volume LiDAR Data”. In
The Rise of Big Spatial Data (pp. 1-12). Springer International
VI. CONCLUSIONS Publishing. 2017. DOI: 10.1007/978-3-319-45123-7_1
The presented 3D programmatic simulator is capable of [3] A. Chybicki, M. Kulawiak, Z. Lubniewski, J. Dabrowski, M. Luba, M.
generating point clouds with characteristics similar to actual Moszynski and A. Stepnowski, May. “GIS for remote sensing, analysis
scanning data. The simulation can be performed for any solid and visualisation of marine pollution and other marine ecosystem
components”. In Information Technology, 2008. IT 2008. 1st
mesh constructed from triangles. The simulated data can be International Conference on (pp. 1-4). IEEE. 2008. DOI:
used for the purpose of algorithm development, e.g. as input 10.1109/INFTECH.2008.4621628
for testing various algorithms operating on point clouds, [4] J. Holmgren, M. Nilsson and H. Olsson, “Simulating the effects of lidar
especially many different surface reconstruction methods. The scanning angle for estimation of mean tree height and canopy closure”.
presented simulator has been successfully applied by the author Canadian Journal of Remote Sensing, 29 (5). 2003. 623-632.
to the generation of input data and test cases for development [5] J. Fochesatto, P. Ristori, P. Flamant, M. E. Machado, U. Singh and E.
of automatic surface reconstruction algorithms dedicated to Quel. “Backscatter LIDAR signal simulation applied to spacecraft
LIDAR instrument design”. Advances in Space Research, 34(10). 2004.
processing LiDAR data. The presented simulator can also be 2227-2231.
used for planning actual multibeam laser surveys of surfaces
[6] N. Peinecke, L. Thomas and B. R. Korn. "Lidar simulation using
located in environments which are not likely to cause graphics hardware acceleration." Digital Avionics Systems Conference,
significant interference with scanning equipment. Due to the 2008. DASC 2008. IEEE/AIAA 27th. IEEE, 2008.
nature of LiDAR scanning simulation, no two algorithms are [7] Kukko, Antero, and Juha Hyyppä. "Small-footprint laser scanning
likely to produce identical outputs, even when identically simulator for system validation, error assessment, and algorithm
configured. Thus, direct comparisons between the presented development." Photogrammetric Engineering & Remote Sensing 75.10
work and existing solutions would be difficult. However it is (2009): 1177-1189. DOI: https://doi.org/10.14358/PERS.75.10.1177
apparent that the presented simulator stands out in the [8] Hämmerle, M., Lukač, N., Chen, K. C., Koma, Z., Wang, C. K., Anders,
K., & Höfle, B. (2017). Simulating Various Terrestrial and UAV LiDAR
following aspects: the support for open file formats which are Scanning Configurations for Understory Forest Structure Modelling.
recognizable by any major 3D-graphics software, as well as the ISPRS Annals of Photogrammetry, Remote Sensing & Spatial
potential of being easily deployed on many different systems Information Sciences, 4.
thanks to the multi-platform nature of the underlying software. [9] Qin, H., Wang, C., Xi, X., Tian, J., & Zhou, G. (2017). Simulating the
As such, the presented application constitutes a valuable Effects of the Airborne Lidar Scanning Angle, Flying Altitude, and Pulse
contribution to the field of point cloud shape reconstruction. Density for Forest Foliage Profile Retrieval. Applied Sciences, 7(7),
712. DOI: 10.3390/app7070712
[10] Danhof, M., Schneider, T., Laube, P. and Umlauf, G., 2015. A virtual-
ACKNOWLEDGEMENT reality 3D-laser-scan simulation. Proc. SINCOM’15, pp.68-73.
The author would like to express his gratitude to the Polish [11] LAS Specification version 1.4. The American Society for
Centre of Geodesic and Cartographic Documentation Photogrammetry & Remote Sensing, Maryland, USA. Available at:
http://www.asprs.org/a/society/committees/standards/LAS_1_4_r13.pdf
(CODGIK) for providing sample high-quality LiDAR scanning [Accessed on 30.04.2018]
results used as valuable reference for the presented work. [12] OGRE - Open Source 3D Graphics Engine. https://www.ogre3d.org/
[Accessed on 22.05.2018]
[13] BB3D. Big Ben, London. Scan The World. MyMiniFactory,
https://www.myminifactory.com/object/big-ben-london-2462 [Accessed
on 27.04.2018]
26
SIGNaL PROCESSING
SPa 2018
Hand Gesture Recognition by Using sEMG Signals

for Human Machine Interaction Applications
Fatih Serdar SAYIN Sertan OZEN Ulvi BASPINAR
Marmara University, Marmara University, Marmara University,
Technology Faculty, Department of Institute of Pure and Applied Technology Faculty, Department of
Electrical-Electronics Engineering Sciences, Department of Electrical- Electrical-Electronics Engineering
Istanbul/Turkey Electronics Engineering Istanbul/Turkey
fatih.sayin@marmara.edu.tr Istanbul / Turkey ubaspinar@marmara.edu.tr
ozensertan@gmail.com
Abstract— Cyber physical systems are gaining more place in daily It is very important to classify hand movements accurately in
life so interaction with the machines are increasing. Hand gestures HCI applications. In this study, classification of five hand
are one of the tools for interaction with the machines and human - motions (hand open, hand close, forefinger opening,
machines interfaces. Image processing, sensor based and sEMG cylindrical grip and key grasp) was realized. Classification
based methods are the most popular for hand gesture recognition.
performances for single user and multi-user were compared.
sEMG based hand gesture recognition is chosen especially for
graphical controller, hand rehabilitation software development and Myo Armband® which is new wearable used for measuring
manipulation of robotic devices etc. muscular activity.
In this study, classification of 5 hand motion, which are hand
open, hand close, cylindrical grasp, Lateral pinch(key grasp) and
index finger opening , have been realized. As a classifier, Artificial II. MATERIALS AND METHODS
Neural Network(ANN) is used. The Data used for training and
validation recorded from five subjects by using MYO® armband.
Mean absolute value, slope sign change, waveform length, Willison In this study, the surface EMG signals collected from
amplitude and mean frequency features are used for classification. forearm muscles by the help of Myo Armband easy to use
Classification performances were evaluated for all five subject bioinstrumentation device (from Thalmic Labs Inc.) were
together and each subject separately. In the study, we achieved analyzed using Artificial Neural Network (ANN) and different
88.4 % mean classification rate by using five subject’s recordings. feature extraction algorithms. Thus, different types of wrist and
arm movements were tried to be determined via EMG data. It
Keywords— Hand Gesture Recognition, sEMG, ANN
Classification
has been requested from the volunteers to repeat following
movements in order to build up EMG dataset which will be
analyzed by ANN.
I. INTRODUCTION
 Cylindrical Grasping
Human beings interacting with their environment by touching,
holding, grasping, carrying, non-verbal communication etc.  Key Pinch
while doing these activities hands are used. As virtual
 Hand Opening
environment (VE) applications become widespread, the
studies about hand gesture recognition for the VE are getting  Hand Closing
more attention. Vision - based hand recognition technics [1-3],
sensor based [4, 5] and surface Electromyogram(sEMG) based  Forefinger Opening
methods[6, 7] are among the most popular hand gesture The age, height and weight information of the volunteers
recognition technics. All these techniques have their own pros participating in the study are given in Table I.
and cons. TABLE I. THE WEIGHT AND HEIGHT OF PARTICIPANTS OF
sEMG based hand recognition technics are one of the well- EXPERIMENT
studied areas. In parallel to latest developments, sEMG Physiological Data
measurement systems are used with wearable technologies. Gender
Age Body Weight Body Height
Detecting muscular activities by using wearable measurement
Female 23 54 Kg 1m 68cm
systems present new possibilities in human-machine
interaction (HCI)[8, 9]. There are studies in the literature Female 24 58 Kg 1m 63cm
about EMG based interactive rehabilitation applications in Female 26 52 Kg 1m 62cm
virtual applications and applications intended to decreasing
adaptation time of EMG controlled active prosthesis for the Male 38 73 Kg 1m 79cm
first users[10]. Male 25 90 Kg 1m 80cm
27
A. Myo Armband Implementation and EMG Measurements B. EMG Signal Feature Extraction Methods
The Myo armband is a device developed by the Thalmic Feature extraction process is used in order to differentiate
Labs' company, it is a wearable system that is placed just below the features of EMG signals produced by different muscles or
the elbow to interact with the armband. The Myo Armband is muscle groups by using various methods. Utilizing these
accoutred with 8 dry electrodes for sEMG sensing and 9 axes properties, systems that mimic the biomechanical properties of
Inertial Measurement Unit (IMU) that ensures arm motion humans or animals can be developed. Feature extraction
detection. The devices use Bluetooth low energy technology operations take place in two main groups that focus on the
for wireless connections. The operating frequency of the Myo time-dependent changes of the relevant signals or their
Smart Arm Band is the standard Bluetooth operating range frequency-dependent properties. Based upon previous
2.402-2.480 GHz and has a working power output between -30 studies[11, 12] we have selected five features which are Mean
dBm and -4 dBm. Absolute Value(MAV), Slope Sign Change(SSC), Waveform
Length(WL), Willison Amplitude(WAMP) and Mean
The usable energy of the sEMG signal is limited to the 0 to Frequency(MF) as shown in Figure 3.
500 Hz frequency range, with the dominant energy is between
50Hz and 150 Hz range. The armband’s sampling rate is
200Hz. It has a filter that suppresses frequencies above 500 Hz
and below 20 Hz. Feature Extraction
The sEMG signals measured by the Myo Armband were Time Domain Analyze Frequency Domain Analyze
transferred to the computer via wireless connection. The Myo
driver software saves the sampled sEMG signal in a Microsoft Mean Absolute Value Mean Frequency
Excel (.xls) file format where the program is installed. When
the Myo armband is placed on the forearm, the product logo on
Slope Sign Change
the device is considered as the reference point. Figure 1. shows
the placement of the Myo Armband device on the forearm.
Waveform Length
Willison Amplitude
Fig. 3. Selected method types for EMG feature extraction
The extracted feature samples from subject 1 for one channel

are shown in Table II.
TABLE II. FEATURE SAMPLES FROM SUBJECT I
Fig. 1. Myo Armband position on the forearm

Features Value
Name
Myo Armband has surface EMG measurement electrodes so
the measurements mostly include EMG signals just beneath the WAMP 65
skin and rarely from the deep muscles. The name of superficial MAV 9.0199
muscles where the measurement take place are given on
anatomical view in Figure 2. SSC 53
WL 3036
MF 159.1591 Hz
Time dependent approaches are more useful for real-time

measurement and control applications because the computation
time is less than that of spectral methods and the demand of
physical system resources are simple.
C. ANN Training and Test

The data to be used for artificial neural network training
were selected from the raw sEMG signals measured by Myo
Armband. The datasets are generated by repeating the
corresponding movements 20 times. The first 200 samples of
muscle contraction data recorded from the 8 channels are used
Fig. 2. Superficial muscles of the forearm[13]
28
for extracting valuable data from recordings and using these Actual Classes
features the ANN was trained. M1M2 M3 M4 M5
Subject 4
The designed ANN consist of three layers: input layer, hidden M1 4.2
2 0.6 0 0
Predicted
layer and output layer. ANN has 20 neurons at the hidden M2 0
2.6 0 0 0
Classes
layer and 5 neurons at the output layer. We extracted 5 M3 0.8
0.2 4 0 0
features from each channel so we used 40 neurons at the input M4 0
0.2 0.4 5 0
layer for the 8 channels. M5 00 0 0 5
Ave. 90.4% 84%
52% 80% 100% 100%
III. RESULTS Actual Classes
M1 M2 M3 M4 M5
sEMG data which are recorded from 5 subjects are used in
Subject 5
order to compare the performance of on the shelf device for M1 5 0 0.2 0 0
Predicted
M2 0 5 0 0 0
Classes
single user and multi-user hand gesture recognition
applications. For the training 75% of data are used and the M3 0 0 4.8 0 0
remaining 25% of data are used for testing the ANN. Single M4 0 0 0 5 0
user performance is tested on the each subject separately by M5 0 0 0 0 5
training 5 identical ANN. Multi-user performance is tested by Ave. 99.2% 100% 100% 96% 100% 100%
combining data from all 5 subjects and the same procedure is
applied as in single user testing procedure.
When it is compared in the aspect of classification rates of
The results are presented in confusing matrices format. M1, hand gestures, forefinger opening has the highest classification
M2, M3, M4 and M5 represents hand opening gesture, hand rates for all 5 subjects and hand closing has the lowest
closing gesture, key pinch gesture, cylindrical grasping gesture classification rates.
and forefinger opening gesture respectively. Single user
TABLE IV. CLASSIFICATION RESULTS FOR MULTI-USER
classification performances are given in Table III. As seen in
Table III, the lowest classification rate for single user is 89.9% Actual Classes
and the highest classification rate for the single user is 99.2 %.
Multi-user classification rate is given in Table IV. The average M1 M2 M3 M4 M5
classification rate is 88.4%.
M1 21.1 1.8 0.8 1.7 0
Predicted Classes
TABLE III. CLASSIFICATION RESULTS FOR SINGLE USER M2 0.4 20.7 0.3 0.3 0.2
Actual Classes
M1 M2 M3 M4 M5 M3 1.5 0.6 23.3 2.1 0.1
Subject 1
M1 4.8 0.4 0 0.8 0 M4 1.3 1.3 0.4 20.7 0

Predicted
M2 0 3.8 0 0 0
Classes
M3 0 0 5 0.2 0 M5 0.7 0.6 0.2 0.3 24.7

M4 0.2 0.8 0 4 0 Ave. 88.4% 84.4% 82.8% 93.2% 82.8% 98.8%
M5 0 0 0 0 5
Ave. 90.4% 96% 76% 100% 80% 100%
Actual Classes IV. CONCLUSION
M1 M2 M3 M4 M5
In this study, 5 basic hand gestures classification is realized.
Subject 2
M1 4.2 0 0 0 0
While single user classification rates for the 5 hand gestures
Predicted
M2 0 5 0 0 0
Classes
are between 89.6% and 99.2%. The average of multi user

M3 0 0 4.6 0.2 0 classification rate is 88.4%. The results show that the
M4 0.4 0 0.2 3.6 0 classification performances for single user performances are
M5 0.4 0 0.2 1.2 5 over multi-user and if we evaluate the results in the aspect of
Ave. 89.6% 84% 100% 92% 72% 100% gesture classification success, forefinger opening has the
Actual Classes highest rate while hand close and cylindrical grasping have
M1 M2 M3 M4 M5 lowest rate.
The commercially available products such as MYO armband
Subject 3
M1 4.8 0 0.6 0 0
is suitable for human machine interaction applications and
Predicted
M2 0 5 0 0 0
Classes
interactive computer based rehabilitation but the study shows

M3 0.2 0 4.4 0.2 0
that if the training process is applied for every single user the
M4 0 0 0 4.8 0 better performances will be achieved.
M5 0 0 0 0 5
Ave. 96% 96% %100 88% 96% 100%
29
ACKNOWLEDGEMENT [11] U. Baspinar, H. S. Varol, and V. Y. Senyurek,
"Performance comparison of artificial neural network
This study is supported by the Marmara University and Gaussian mixture model in classifying hand
Independent Research Project Commission (M.U. BAPKO). motions by using sEMG signals," Biocybernetics and
Biomedical Engineering, vol. 33, no. 1, pp. 33-45,
REFERENCES 2013.
[12] K. Englehart and B. Hudgins, "A robust, real-time
control scheme for multifunction myoelectric
[1] Z. Ren, J. Meng, and J. Yuan, "Depth camera based control," IEEE transactions on biomedical
hand gesture recognition and its applications in engineering, vol. 50, no. 7, pp. 848-854, 2003.
human-computer-interaction," in Information,
Communications and Signal Processing (ICICS) [13] Available: https://www.visiblebody.com/anatomy-
2011 8th International Conference on, 2011, pp. 1-5: and-physiology-apps/human-anatomy-atlas[Accesses
IEEE. 14-02-2018]
[2] J. P. Wachs, M. Kölsch, H. Stern, and Y. Edan,
"Vision-based hand-gesture applications,"
Communications of the ACM, vol. 54, no. 2, pp. 60-
71, 2011.
[3] S. S. Rautaray and A. Agrawal, "Vision based hand
gesture recognition for human computer interaction:
a survey," Artificial Intelligence Review, vol. 43, no.
1, pp. 1-54, 2015.
[4] C. Zhu and W. Sheng, "Wearable sensor-based hand
gesture and daily activity recognition for robot-
assisted living," IEEE Transactions on Systems, Man,
and Cybernetics-Part A: Systems and Humans, vol.
41, no. 3, pp. 569-573, 2011.
[5] G. Saggio, "A novel array of flex sensors for a
goniometric glove," Sensors and Actuators A:
Physical, vol. 205, pp. 119-125, 2014.
[6] M. R. Ahsan, M. I. Ibrahimy, and O. O. Khalifa,
"Electromygraphy (EMG) signal based hand gesture
recognition using artificial neural network (ANN),"
in Mechatronics (ICOM), 2011 4th International
Conference On, 2011, pp. 1-6: IEEE.
[7] J. Kim, S. Mastnik, and E. André, "EMG-based hand
gesture recognition for realtime biosignal
interfacing," in Proceedings of the 13th international
conference on Intelligent user interfaces, 2008, pp.
30-39: ACM.
[8] I. Moon, M. Lee, J. Chu, and M. Mun, "Wearable
EMG-based HCI for electric-powered wheelchair
users with motor disabilities," in Robotics and
Automation, 2005. ICRA 2005. Proceedings of the
2005 IEEE International Conference on, 2005, pp.
2649-2654: IEEE.
[9] A. Pantelopoulos and N. G. Bourbakis, "A survey on
wearable sensor-based systems for health monitoring
and prognosis," IEEE Transactions on Systems, Man,
and Cybernetics, Part C (Applications and Reviews),
vol. 40, no. 1, pp. 1-12, 2010.
[10] A. Soares, A. Andrade, E. Lamounier, and R. Carrijo,
"The development of a virtual myoelectric prosthesis
controlled by an EMG pattern recognition system
based on neural networks," Journal of Intelligent
Information Systems, vol. 21, no. 2, pp. 127-141,
2003.
30
SIGNaL PROCESSING
SPa 2018
Preliminary investigation of the in-cylinder pressure

signal using Teager energy operator
Jerzy Fiołka
Silesian University of Technology
ul. Akademicka 16, 44-100 Gliwice, Poland
e-mail: jerzy.fiolka@polsl.pl
6
Abstract—The Teager energy operator (TEO) has been used
in various areas, including speech analysis, image processing,
machinery fault diagnostics and biomedical engineering. The
operator provides a simple and efficient solution to the problem of knocking
estimating the instantaneous frequency and envelope of amplitude combustion
pressure [MPa]
and frequency-modulated (AM-FM) signals. Furthermore, this
method is easy to implement and has a low computational
complexity.
In the paper, the author proposes using TEO in an automotive normal
combustion
application to perform preliminary investigations of the in-
cylinder pressure signal. Research in this area is important
because it determines fuel consumption, engine durability as well
as the emission of air pollutants. Detecting abnormal combustion
in spark-ignition (SI) engines is possible by measuring and −10 0 10 20 30 40 50 60 70 80
TDC crank angle [°CA]
analysing the engine block vibrations, the ionisation current and
the in-cylinder pressure. However, the fundamental variable that
provides an in-depth insight into the combustion process is the Fig. 1. Cylinder pressure as a function of the crank angle for normal and
knocking combustion
pressure signal. By analysing the signal, a detailed study of
the knock phenomenon can be performed, which is necessary
to develop a reliable and efficient knock detection method. By
applying the proposed technique, we are able to identify the studied intensively by various investigators [1]. The name
basic parameters of the pressure trace, such as the starting ”knock” comes from the characteristic metallic sound that
frequency and the rate of frequency change. By knowing the
value of the parameters for various engine operating conditions, occurs when the combustion in a cylinder does not process
the performance of a knock detection system can be improved. correctly (the names engine knock, knocking and knocking
Index Terms—digital signal processing, time-frequency analy- combustion are also often encountered). Engine knock arises
sis, Teager operator, empirical mode decomposition, automotive from the auto-ignition of a portion of the fuel-air mixture
engineering. ahead of the propagating flame front [2]. A rapid release of
chemical energy causes the in-cylinder pressure to increase
I. I NTRODUCTION considerably, thereby reaching the highest value and then it
In 2017, about 73 million cars were produced worldwide. oscillates with a decaying amplitude. Very often, this leads to
In order to maintain the current growth rate tendency, the changes in pressure and temperature that are beyond its design
automotive industry has been exhorted by the government and limit. Moreover, once abnormal combustion is induced, the
customers to produce smarter, cleaner and more fuel-efficient shock waves that arise in the cylinder cause the engine block
vehicles. Carbon dioxide emission limits for new cars have to vibrate [3][2]. The pressure curves versus the crank angle
put pressure on manufacturers to build more efficient engines of a non-knocking and knocking cycle are shown in Fig. 1.
or to use alternative powertrain technologies. In the case of Knocking combustion reduces engine durability, power den-
SI engines, the widely used method to maximise power and sity, fuel consumption as well as emission performance. In
efficiency is to run an engine as close to the knock threshold addition, knocking combustion can lead to damage of the
as possible. In practice, this is realised by using a closed-loop engine parts (e.g. piston crown melting, breakage of piston
control of the spark ignition timing, which is based on a real- rings). The knock tendency of an engine can be reduced by
time analysis of the knock sensor signal. For this reason, the using fuel with a high octane number, cooling the intake
performance of the control strategy depends on the accuracy charge, reducing the engine load, enhancing heat transfer
of the knock detection. and by the proper ignition timing control. The main knock
Knocking combustion in SI engines is an unwanted mode of suppression method, which has been implemented in the
combustion. It is still one of the main problems that plague SI Engine Control Unit (ECU), is spark advance control. The
engines. In the last few decades, the phenomenon has been process works as follows: retarding spark timing reduces the
31
end-gas pressure and temperature. Due to the decrease in the ratio and electrical noise. Due to the high computational com-
in-cylinder pressure and temperature, the knock tendency is plexity of signal processing and low robustness, the application
also reduced. However, this deteriorates engine performance. of the method is very limited [7], [8].
Thus, one of the methods this is widely used is to keep an The third method is based on the direct measurement and
engine as close to the knock threshold as possible. From the analysis of the in-cylinder pressure, which can be influenced
above considerations, it is evident that in order to meet future by knocking combustion. Because the pressure signal provides
regulations on fuel economy as well as environmental issues, excellent knock information, it is often used as a reference
the development of knock detection methods is critically for the other detection schemes (e.g. the previously described
important. methods). Although it has obvious advantages, the main draw-
The main goal of this paper is to provide an efficient back is the high cost of piezoelectric-quartz sensors. This
solution to the problem of estimating the basic parameters limits the use of this technique to experimental researches.
of the pressure signal. In Section II, the methods to be used In contrast to the piezoelectric-quartz devices, a fiber-optic
to detect knock are outlined. Section III gives an overview sensor, which has a projected price of a few dollars, meets all
of the experimental investigations that must be performed of the requirements for car production applications [9]. Thus,
in order to develop a reliable knock detection method. In it opens the possibility of the practical utilisation of this sensor
Section IV, the energy operator that was developed by Teager in mass-produced knock detection systems.
[4] is shown to be effective for estimating the amplitude
envelope and the instantaneous frequency of non-stationary III. P RELIMINARY INVESTIGATION OF THE PRESSURE
signals. However, the application of the method is limited SIGNAL
to mono-component, narrow-band signals. In order to satisfy There are many methods for signal processing to be used
these requirements, Section V discusses the application of to detect knocking combustion. The basic approach is based
empirical mode decomposition (EMD) to separate the pressure on detecting of one or more resonant frequencies in the
signal components. Section VI presents simulation results pressure trace. The number of signal components, the starting
including an analysis of the pressure signal in white noise. frequencies and the modulation scheme depend on many
The conclusions are presented in Section VII. factors such as the combustion chamber geometry and the
engine operating point (rotational speed, load etc.). What is
II. K NOCK DETECTION METHODS also important is the fact that the pressure signal is affected
by any background noise resulting from the ignition, valves
There are several methods that are used to detect abnormal
closing etc. Thus, to increase the efficiency of detection, the
combustion. Detection can be performed by measuring and
signal must be filtered after which the knock intensity (KI) can
analysing the engine block vibrations, the ionisation current
be calculated using a specific formula. The most commonly
and the in-cylinder pressure [5], [6]. The first method deter-
used definitions of KI are [5], [10]:
mines the occurrence of knock by analysing the vibrations
of the engine. The vibrations that are excited by knocking • the mean square value of the bandpass-filtered signal;
combustion are measured by an accelerometer. Because this • the maximum peak-to-peak value of the bandpass-filtered
non-invasive sensor has many advantages such as its durability pressure oscillations;
and low-cost, the method has become very popular for mass- • the signal energy of the highpass- or bandpass-filtered
produced vehicles. However, there are also many disadvan- pressure signal.
tages. The frequencies of the induced vibrations depend on Therefore, it goes without saying that in order to develop a
the shape and dimensions of the chamber and the speed of reliable knock detection method, the filter parameters must
sound, which vary greatly depending on the engine operating be known. To determine these parameters, experimental in-
conditions. In addition, the induced vibrations must be dis- vestigations must be performed using an engine test bed that
tinguished from mechanically induced vibrations, which can is equipped with comprehensive controls and instrumentation.
even arise during normal combustion. Due to a poor signal- The pressure data are usually acquired for various engine op-
to-noise ratio (SNR) and the above-mentioned problems, some erating conditions (load and rotational speed of the engine) by
detection systems are deactivated when the engine rotational using a data acquisition system. As a result, a pressure signal
speed exceeds some limit (e.g. 3000 RPM). database can be created. Then, based on collected samples, a
It has observed that there is a correlation between the ion- detailed study on the pressure signal can be performed.
sense current and the in-cylinder pressure. The application Since the pressure trace exhibits time-varying spectra, an
of the method requires modifications of the ignition system. in-depth insight into the behaviour of the signal can be done
In this case, a spark plug is used as an ionisation probe. using a time-frequency analysis [11]. Among the possible
By supplying DC voltage after ignition and measuring the representations, the short-time Fourier transform (STFT) is
flowing current, information about the combustion process can the one that is of main interest. However, it does not permit
be gathered. The fundamental drawback of the method is that the correct estimation of the instantaneous frequency [12]
the ionisation current is significantly influenced by the fuel and suffers from a trade-off between the time and frequency
properties, gas temperature, spark plug location, the air-fuel resolution [13]. This problem can be solved by using bilinear
32
distributions such as the Wigner, Zhao-Atlas-Marks or Choi- where |q(t)| ≤ 1. By applying the operator Ψc to the signal
Williams distribution. However, when the analysed signal of the form (2), we obtain [19]
includes two or more components, the bilinear distributions
Ψc [x(t)] =Ψc [a(t) cos(φ(t)] = (a(t)φ̇(t))2 +
suffer from cross-terms.
a2 (t)φ̈(t) (4)
The main method that is used to define the ”knock sig- sin(2φ(t)) + cos2 (φ(t))Ψc [a(t)]
nature” is supervised by a person analysing of the time- 2
frequency distributions of the pressure signals. The disadvan- The (4) can be simplified by assuming that a(t) and φ̇(t) do
tage of this technique is that it is a time-consuming process, not vary too fast. Under these realistic conditions
which requires true engagement. However, it provides in-
depth insight into the behaviour of the pressure signal. In Ψc [a(t) cos(φ(t)] ≈ (a(t)ωi (t))2 (5)
addition, in the case of multi-component signals, a visual This means that the TEO estimates the squared product
analysis permits incorrect interpretations of the results (cross- of the amplitude a(t) and the instantaneous frequency ωi (t).
terms interferences) to be avoided. Similarly, we can apply the operator to the signal derivative
In addition to these methods, many different knock detection ẋ(t)
techniques have been proposed in the literature, e.g. using Ψc [ẋ(t)] ≈ a2 (t)ωi4 (t) (6)
wavelets [14], [15], [16], the Wigner-Ville spectrum [11],
the fractional Fourier transform [17] etc. However, like the By combining (5) and (6) we obtain
s
previously described approaches, those methods also require
Ψ[ẋ(t)]
preliminary investigations of the pressure signal in order to ωi (t) ≈ (7)
Ψ[x(t)]
find the parameters such as the number of components, the
starting frequency, the modulation scheme etc. Ψ[x(t)]
In light of these considerations, the author proposes a |a(t)| ≈ p (8)
Ψ[ẋ(t)]
method that would facilitate the process of the preliminary
analysis of the pressure signal. The method uses a nonlinear Equations (7) and (8) constitute the main part of the Con-
energy-tracking signal operator to estimate the amplitude enve- tinuous Energy Separation Algorithm (CESA). Therefore, the
lope and the instantaneous frequency of the analysed signals. CESA provides the solution to the problem of estimating
The results that are obtained, coupled with the simplicity, the instantaneous frequency ωi (t) and the amplitude envelope
establish the usefulness of the energy operator in experimental |a(t)| of the signal being analysed.
researches of the combustion process. A. Discrete-time energy separation algorithm
IV. T EAGER ENERGY OPERATOR In the discrete domain, the Teager energy operator is defined
as [18]
The Teager energy operator (also called Teager-Kaiser En-
∆
ergy Operator) is defined in the continuous domain as [18] Ψd [x(n)] = x2 (n) − x(n − 1)x(n + 1) (9)
2 2
∆ dx(t) d x(t) where x(n) is a sampled version of x(t) and n = 0, ±1, ±2, ...
Ψc [x(t)] = − x(t) Similar to the previously described operator, the Ψd is the
dt dt2 (1)
= [ẋ(t)]2 − x(t)ẍ(t) fundamental component of the discrete version of CESA.
Depending on the method that is used to approximate the
where ẋ = dx/dt. derivatives, three basic versions of the Discrete Energy Sepa-
The operator Ψc is useful for analysing signals with a time- ration Algorithm (DESA) are reported in the literature: DESA-
varying amplitude and frequency. The AM-FM signal can be 1a, DESA-1 and DESA-2. Because it has the best perfor-
written in the form mance, the DESA-1 algorithm is used in the presented work
x(t) =a(t) cos(φ(t)) (an extensive numerical comparison of the above-mentioned
Z t algorithm can be found in [18]). In this case, the instantaneous
(2)
=a(t) cos ωc t + ωm q(τ )dτ + θ frequency and envelope are defined as:
0
Ψd [y(n)] + Ψd [y(n + 1)]
where a(t) is the time-varying amplitude, ωc is the carrier Ωi (n) ≈ arccos 1 − (10)
4Ψd [x(n)]
frequency, ωm is the maximum frequency deviation from ωc , v
q(t) is the frequency modulating signal and θ is an arbitrary u Ψd [x(n)]
|a(n)| ≈ uu 2 (11)
phase offset. t Ψd [y(n)] + Ψd [y(n + 1)]
Thus, the real signal (2) can be viewed as a cosine of 1− 1−
4Ψd [x(n)]
carrier frequency ωc with a time-varying amplitude and a
time-varying instantaneous angular frequency defined as the where
derivative of φ(t): y(n) = x(n) − x(n − 1) (12)
∆ d Moreover, we assume that 0 < Ωi (n) < π. To illustrate

ωi (t) = φ(t) = ωc + ωm q(t) (3) the operation of DESA-1, Fig. 2 shows the estimation of
dt
33
a) b) c)
10 200 10 • the number of extrema and the number of zero crossings
8 180 9 in the input data set must either be equal or differ by one
160 8
at most;
AMPLITUDE ENVELOPE [kPa]

• at any point, the mean value of the envelope that is
6
INST.FREQUENCY [Hz]
4
140 7
defined by the local maxima and the envelope that is
120 6 defined by the local minima is zero.
x(t) [kPa]
2
100 5 A residue, rN (t), can either be a constant, a monotonic mean
0
80 4 trend or a curve that has only one extremum.
-2
60 3
The steps of the EMD algorithm are as follows:
-4 1) Identify the local extrema (maxima, minima) of the
40 2
signal being analysed.
-6 20 1
2) Connect the maxima using an interpolation function (a
-8 0 0 common choice is the cubic spline), thus creating the
0 0.05 0.1 0 0.05 0.1 0 0.05 0.1
t[s] t[s] t[s] upper envelope, eU P (t), for the signal. Connect the
minima with an interpolation function, thereby creating
Fig. 2. (a) Exponentially-damped linear chirp. (b) Estimated instantaneous a lower envelope, eDOW N (t), for the signal.
frequency. (c) Estimated amplitude envelope 3) Calculate the local mean
eU P (t) + eDOW N (t)
m1 (t) = (14)
the instantaneous frequency and envelope of an exponentially 2
damped linear chirp sampled at 1kHz with a starting and a 4) Subtract the local mean from the signal, i.e.
final frequency of 200 Hz and 100 Hz, respectively.
h1 (t) = x(t) − m1 (t) (15)
V. E MPIRICAL M ODE D ECOMPOSITION
The function h1 (t) that is obtained is the first proto-IMF
The application of the energy separation algorithm is limited (PINF) (the procedure of extracting an IMF is called
to mono-component signals. In order to overcome this prob- sifting). If the function PINF satisfies the definition of
lem, several techniques have been proposed in the literature. the IMF, proceed to point 6.
In speech processing, in order to isolate individual resonances, 5) Repeat the sifting process. In the subsequent sifting
bandpass filters that are tuned to the frequencies of speech processes, h1 (t) is treated as the input data
formants are used. However, this approach requires a priori
knowledge of speech formant frequencies. A simple solution, h11 (t) = h1 (t) − m11 (t) (16)
which is based on a visual study of the spectral peaks or the where m11 (t) is the mean of the upper and lower
time-frequency plane, is to set the filter parameters manually, envelopes of h1 (t). Repeat the procedure up to k times.
but this is not feasible [18]. An alternative approach uses a Then, h1k (t) is given by the equation
bank of relatively closely spaced filters. In this case, a subset
of filters is selected by calculating the energy across all bands h1k (t) = h1(k−1) (t) − m1k (t) (17)
and selecting the n highest values. As a result, the center When h1k (t) meets the definition of the IMF, the proce-
frequencies of the speech formants can be determined [20]. dure is terminated and we proceed to point 6. It is worth
In this work, an analysed, multi-component signal is decom- mentioning that this step can go on many iterations. In
posed into individual components using an adaptive method, practical implementations, the number of sifting steps
that is called empirical mode decomposition (EMD). The required to produce the IMF is limited according to the
method has been successfully used in a number of applica- stoppage criteria of the sifting [22].
tions. Moreover, the usefulness of this approach for decom- 6) The function that is obtained satisfies the definition of
posing the in-cylinder pressure signal was confirmed in an the IMF (or the stoppage criteria). This component is
earlier paper by the author [21]. EMD is an adaptive method designated as the first IMF
that can be used to decompose the signal being analysed into a
series of intrinsic mode functions (IMFs) with each IMF being c1 (t) = h1k (t) (18)
a mono-component function and a residue. The non-stationary
7) Subtract the first IMF from the signal being analysed in
signal is then represented as [22]:
order to produce the first residue.
N
X
x(t) = rN (t) + ci (t) (13) r1 (t) = x(t) − c1 (t) (19)
i=1 We treat r1 (t) as a new data set, and perform the sifting
where x(t) is the signal being analysed, rN (t) is a residue, process to obtain c2 . This procedure can be repeated for
ci (t) is i-th intrinsic mode function and N is the number of all of the subsequent residues.
IMFs. 8) EMD decomposition is completed when the last residual,
An IMF must fulfill the following conditions [22]: rN (t), is less than a predetermined value or when
34
a residue becomes a monotonic function. Finally, the overcome this problem, the proposed method employs only
original signal x(t) is decomposed into the IMFs and one S-G filter in order to increase the signal-to-noise ratio
one residue rN (t). of the input signal. The filter parameters were selected to
be optimally fitted to the analysed data experimentally. After
VI. R ESULTS
the prefiltering step, the smoothed signal was processed using
A. Signal model DESA-1 algorithm.
The pressure signal can be modelled as a sum of a large- As was assumed, the instantaneous frequency of the pres-
amplitud, a slowly varying component and a component that sure signal components changed linearly with time (22). The
contains information about the knock signatures, s(t). Because starting frequency and the rate of frequency changes are
the slowly varying component does not provide any informa- particularly important in developing a reliable knock detection
tion about the knock, it can be eliminated from the signal using method. Each of these variables can be obtained by applying
high-pass filtering. a linear regression to the data set that is generated by (10).
As was shown in [23], [24] within the knock window (crank
C. Simulation results
angle range starting from near the top dead center and ending
about 70◦ after the top dead center), the dominant resonance The effectiveness of the proposed method was tested on
modes exhibit identical modulation schemes. In this interval, noisy AM-FM signals that had been synthesised according to
the s(t) can be modelled as the sum of the exponentially (20). The values of the parameters that occurred in the equation
damped linear chirps. Thus, in the presence of noise, the were found to be similar to the parameters of the actual data.
pressure signal being analysed is The synthetic signal consisted of two components with the
following parameters values:
K
X
• γ1 =10 kHz, χ1 = −390.625 kHz/s, d1 = −135, A1 =
p(t) = n(t) + s(t) = n(t) + ak (t) cos[φk (t)] (20)
k=1
10
• γ2 =6 kHz, χ2 = −390.625 kHz/s, d2 = −135, A2 = 10
with
ak (t) = Ak exp(−dk t) (21) The sampling frequency fc and the number of samples
N are 50 kHz and 256, respectively. The calculations ware
where n(t) is background noise that is modelled as white performed for additive noise cases with SNR = 10dB; 15dB;
Gaussian noise. The K is the number of resonance frequencies 20dB; 30dB and 40dB. For each SNR value, the simulation
and Ak and dk are the initial amplitude and damping constants was run 100 times.
of the k-th resonance frequencies, respectively. The DESA-1 algorithm was implemented in the Matlab
Using the relation ω = 2πf , the instantaneous frequency environment. The EMD was carried on the signal being
can be written in a more convenient form analysed using freely available scripts [28]. The obtained
decomposition consisted of four IMFs and a residue. The
fk (t) = γk + χk t (22)
first knock resonance frequency (located at 10 kHz) was
where fk (t) is the instantaneous cyclic frequency and γk and represented by IMF1, the second (located at 6 kHz) by IMF2.
χk are the starting frequency and the rate of the frequency The energy of the IMFs with an index higher than two was
change of the k-th component, respectively. Moreover, the significantly lower than the energy of the first two IMFs. This
linear chirps are well concentrated in the time-frequency plane result proved that the EMD had effectively decomposed the
and do not overlap. synthetic signal into two IMFs. The parameters of the S-G
filter (order=7 and frame length=11) were selected using the
B. Noise charts presented in [27].
The main disadvantage of the TEO is its high sensitivity The performance of the methods was measured by estimat-
to noise. In order to overcome this problem, several methods ing the relative error, which was defined as:
have been studied in the literature [20], [25], [26]. To solve
this problem, an interesting approach was proposed in [25]. |γ̂ − γ|
δγ = · 100% (23)
In this work, EMD was combined with a filtering that is |γ|
based on the Savitzky-Golay (S-G) filter [27]. To obtain a
|χ̂ − χ|
denoised version of an individual signal component, each δχ = · 100% (24)
IMF is filtered separately. By applying this filter to the set |χ|
of input samples, a least-squares polynomial approximation where γ̂ was the estimated value and γ was the known
(smoothing) is performed, which increases the signal-to-noise value of the starting frequency. Similarly, χ̂ and χ denoted
ratio without greatly distorting the signal. the estimated and known value of the chirp rate, respectively.
The drawback of the above-mentioned approach is the The maximum value of the relative errors are presented in
fact, that the method assumes apriori knowledge about the Tab. I. The results that were obtained indicate that the proposed
parameters of the signal being analysed (e.g. the number of method is promising for evaluating the basic parameters of
frequency component, the frequency range of the individual the pressure signal, i.e. the starting frequency and the rate of
components etc.) to properly select the filter parameters. To the frequency change. These values are required in order to
35
TABLE I
T HE RELATIVE ERROR OF THE ESTIMATED PARAMETERS FOR VARIOUS [3] G. Ferrari, Internal combustion engines. Esculapio, 2014.
SNR [4] H. Teager, “Some observations on oral air flow during phonation,” IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. 28, pp.
SNR [dB] IMF1 IMF2 599–601, 1980.
[5] F. Millo and C. Ferraro, “Knock in s.i. engines: A comparison between
10 δγ = 1.11 % δγ = 2.15 % different techniques for detection and control,” in SAE Technical Paper,
δχ = 16.01 % δχ = 18.33 % 1998, pp. 25–42, 982477.
15 δγ = 0.98 % δγ = 1.32 % [6] Z. Xudong, W. Yang, X. Shuaiqing, Z. Yongsheng, T. Chengjun, X. Tao,
δχ = 10.24 % δχ = 12.2 % and S. Mingzhi, “The engine knock analysis an overview,” Applied
20 δγ = 0.77 % δγ = 0.84 % Energy, vol. 92, pp. 628–636, 2012.
δχ = 8.04 % δχ = 7.24 % [7] L. Peron, A. Charlet, P. Higelin, B. Moreau, and J. Burq, “Limitations
30 δγ = 0.63 % δγ = 0.67 % of ionization current sensors and comparison with cylinder pressure
δχ = 6.25 % δχ = 6.34 % sensors,” in SAE Technical Paper, 2000, 2000-01-2830.
40 δγ = 0.6 % δγ = 0.63 % [8] J. Wagner, J. Keane, R. Koseluk, and W. Whitlock, “Engine knock
δχ = 5.99 % δχ = 6.22 % detection: Products, tools, and emerging research,” in SAE Technical
Paper, 1998, 980522.
[9] M. T. Wlodarczyk, T. Poorman, L. Xia, J. Arnold, and T. Coleman,
“Embedded fiber optic combustion-pressure sensors for automotive
develop a reliable detection method (the initial amplitude and engines,” in FISITA World Automotive Congress, 1998.
the damping factor are not essential, and therefore, were not [10] R. Worret, S. Bernhardt, F. Schwarz, and U. Spicher, “Application of
estimated). different cylinder pressure based knock detection methods in spark
ignition engines,” in SAE Technical Paper, 2002, 2002-01-1668.
[11] S. Carstens-Behrens and J. Bohme, “Applying time-frequency methods
VII. C ONCLUSIONS to pressure and structure-borne sound for combustion diagnosis,” Signal
The aim of this paper was to evaluate the feasibility and Processing and its Applications, vol. 1, pp. 256–259, 2001.
[12] O. Boubal, “Knock detection in automobile engines,” IEEE Instrum.
usefulness of the Teager energy operator to identify the basic Meas. Mag., vol. 3, pp. 24–28, 2000.
parameters of the pressure signal during knocking combustion. [13] S. Qian, Joint Time-Frequency Analysis: Methods and Applications.
The correct recognition of the ”knock signature” is particu- Prentice Hall, 1996.
[14] J. Fiolka, “A fast method for knock detection using wavelet transform,”
larly important in the research and development of detection in Proceedings of the International Conference Mixed Design of Inte-
schemes. The main advantage of this approach is the fact that grated Circuits and System, MIXDES 2006, Poland, 2006, pp. 621–626.
this method does not suffer from cross-terms, and therefore [15] C. Liu, Q. Gao, Y. Jin, and W. Yang, “Application of wavelet packet
transform in the knock detection of gasoline engines,” in Proc. of
the results that are obtained are easy to interpret. When International Conf. on Image Analysis and Signal Processing, 2010.
accurate information about the knock phenomenon is known, [16] Z. Zhang and E. Tomota, “A new diagnostic method of knocking in a
the performance of a knock detection system can be improved. spark-ignition engine using the wavelet transform,” in SAE Technical
Paper, 2000, 2000-01-1801.
In the presented work, the individual resonance frequencies [17] J. Fiolka, “Application of the fractional Fourier transform in automotive
were extracted from the multi-component pressure signal using system development: The problem of knock detection,” in Conference
the well-known and widely used algorithm, EMD. Then, by proceedings: Signal Processing Algorithms, Architectures, Arrange-
ments, and Applications SPA 2017, 2017, pp. 286–291.
applying the DESA-1 algorithm to each IMF, the instanta- [18] P. Maragos, T. Quatieri, and J. Kaiser, “Energy separation in signal
neous frequency and amplitude envelope could be calculated modulations with applications to speech analysis,” IEEE Trans. Signal
efficiently. In order to improve the robustness of the method Process., vol. 41, pp. 3024–2051, 1993.
[19] ——, “On amplitude and frequency demodulation using energy opera-
in a noisy environment, Savitzky-Golay filtering and linear tors,” IEEE Trans. Signal Process., vol. 41, pp. 1532–1550, 1993.
regression were applied. As was demonstrated through the [20] E. Kvedalen, “Signal processing using the Teager energy operator and
simulation experiments, the proposed method provides a pre- other nonlinear operators,” 2003, cand. Scient Thesis. University of Oslo
Department of Informatics.
cise estimation of the pressure signal parameter under various [21] J. Fiolka, “Application of hilbert-huang transform to engine knock
engine operating conditions. Moreover, the method has a low detection,” in Proceedings of the 20th International Conference Mixed
computational complexity(the computational aspects of the Design of Integrated Circuits and System, MIXDES 2013, Poland, 2013,
pp. 457–461.
DESA-1 and the EMD algorithm are presented in [20], [22]). [22] N. Huang and S. Shen, The Hilbert-Huang Transform and its applica-
A future work will be focused on improving the noise tions. World Scientific Publishing Co, 2005.
immunity of the algorithm by replacing the EMD with a new [23] D. Konig, “Application of time-frequency analysis for optimum non-
equidistant sampling of automotive signals captured at knock,” in
technique, which is called ensemble empirical mode decom- Proceedings of the IEEE International Conference on Acoustics, Speech,
position (EEMD) and by enhancing the signal prefiltering (e.g. and Signal Processing, ICASSP-96, 1996, pp. 2746–2749.
wavelet filtering). [24] J. Fiolka, “Knock detection in gasoline engines using time-frequency
methods,” Ph.D. dissertation, Silesian University of Technology, Gli-
ACKNOWLEDGMENT wice, Poland, 2004.
[25] A. Bouchikhi, A. Boudraa, J. Cexus, and T. Chonavel, “Analysis of
This work was supported by the Ministry of Science and multicomponent LFM signals by Teager Huang-Hough transform,” IEEE
Trans. Aerosp. Electron. Syst., vol. 50, pp. 1222–1233, 2014.
Higher Education funding for statutory activities. [26] A. Bovik, P. Maragos, and T. Quatieri, “AM-FM energy detection and
separation in noise using multiband energy operators,” IEEE Trans.
R EFERENCES Signal Process., vol. 41, pp. 3245–3265, 1993.
[1] Z. Wang, H. Liu, and R. Reitz, “Knocking combustion in spark-ignition [27] R. Schafer, “What is a Savitzky-Golay filter?” IEEE Signal Process.
engines,” Progress in Energy and Combustion Science, vol. 61, pp. 78– Mag., vol. 28, pp. 111–117, 2011.
112, 2017. [28] G. Rilling, P. Flandrin, and P. Goncalves, “On empirical mode decom-
[2] J. B. Heywood, Internal combustion engines fundamentals. McGraw- position and its algorithm,” in Proceedings of the 6th IEEE/EURASIP
Hill, 1998. Workshop on Nonlinear Signal and Image Processing (NSIP ’03), 2003.
36
SIGNaL PROCESSING
SPa 2018
Hierarchical Feature-learning Graph-based

Segmentation of Fat-Water MR Images
Faezeh Fallah and Bin Yang Sven S. Walter and Fabian Bamberg
Institute of Signal Processing Department of Diagnostic and Interventional
and System Theory, University of Stuttgart Radiology, University Clinic of Tübingen
Email: faezeh.fallah@iss.uni-stuttgart.de Email: fabian.bamberg@med.uni-tuebingen.de
Abstract—In this paper, we proposed a deformation- intensity-based segmentations. In segmenting medical images,
/registration-free method for multilabel segmentation of fat- these fluctuations could also be caused by pathological
water MR images without need to prior localization or changes of the tissues or interactions between tissues and
geometry estimation. This method employed a multiresolution
(hierarchical) feature- and prior-based Random Walker graph imaging system. This hindered reproducible segmentations,
and a hierarchical conditional random field (HCRF). To particularly, in dealing with cohort images.
incorporate both aspatial (intra-patch) and spatial (inter-patch To reduce the complexity of segmentations and to
neighborhood) information into the image segmentation, the enable a registration-/deformation-free localization in a large
proposed random walker graph was made of a multiresolution volumetric image, hierarchical feature-learning methods were
spatial and a multiresolution aspatial (prior-based) sub-graph.
Edge weights and prior probabilities of this graph as well as the proposed [2], [5]. These methods also enhanced scale-
energy terms of the HCRF were determined by a hierarchical invariance of the segmentations. That is, they improved
random decision forest classifier. This classifier was trained flexibility and robustness of the segmentations when
using multiscale local and contextual features extracted from sophisticated features were to be extracted from patches of
fat-water (2-channel) magnetic resonance (MR) images. The unknown optimal size or the patches’ sizes should be adapted
proposed method was trained and evaluated for simultaneous
volumetric segmentation of vertebral bodies and intervertebral to the feature type or the object’s size in an image.
discs on fat-water MR images. These evaluations revealed its In a hierarchical auto-context classifier [5], multiple
comparable accuracy to the state-of-the-art while demanding classifiers were built for different spatial resolutions and their
less computations and training data. The proposed method was, classifications were fused by a hierarchical conditional random
however, generic and extendible for segmenting any kind of filed (HCRF) with regard to the hierarchical consistencies of
tissues on other multichannel images.
the estimated labels. In [2], same approach was taken with the
I. I NTRODUCTION difference that only one multiresolution classifier was trained
Machine-learning algorithms offered a vast amount of methods for all the resolutions. Thus it further reduced the complexity
for feature selection and multilabel image segmentation. of segmentations. Despite of these advantages, no hierarchical
However, they either needed large number of training method was used in the Random Walker segmentation. This
data or only relied on aspatial (intra-patch) information could be due to the edge detection mechanism of the Random
[1]. Lack of an inherent facility for incorporating spatial Walker algorithm that relied on intensity gradients. Coarser
(neighborhood) information have compromised their accuracy image patches had a lower sensitivity to intensity gradients
in most segmentation tasks and required post correction than the finer ones. Thus their detections were not reliable
steps to refine their segmentations [2]. In contrast, graph- initializers for the finer estimates.
based segmentation methods mainly relied on spatial (inter- In this paper, we proposed a hierarchical deformation-
patch) relationships to classify image patches. For feasible /registration-free method for segmenting fat-water (2-channel)
computations, these relationships were mostly modeled magnetic resonance (MR) images without need to prior
by conditional or Markov random fields. Among these localization or geometry estimation. This method employed
approaches, Random Walker algorithm segmented an image a multiresolution feature-learning prior-based Random Walker
by modeling intensity similarities of its neighboring patches algorithm and a HCRF. The Random Walker algorithm used
according to Markov chains and a Gaussian Markov field a novel graph made of a multiresolution spatial and a
[3]. To this end, it built an undirected weighted graph, multiresolution aspatial (prior-based) sub-graph. Edge weights
including some labeled vertices (seeds), from the image. of this graph represented either spatial (intra-resolution inter-
The labeled seeds could be replaced by aspatially-derived patch) or aspatial (intra-resolution ) relationships. Over this
prior probabilities of the image patches [4]. Accordingly, graph, image patches were classified at different resolutions.
this algorithm could incorporate both spatial and aspatial These estimates were then fused by a HCRF with regard to
information into its segmentation process and provided a the hierarchical (inter-resolution inter-patch) consistencies of
generic reliable framework with an efficient computation the estimated labels. Also in this method, we derived the
enabled by its sparse system of linear equations. However, it edge weights and the prior probabilities of the proposed graph
only processed image intensities at a single resolution and used as well as the energy terms of the HCRF by a hierarchical
no feature-learning method. Image intensities were known random decision forest classifier. To enhance scale- and
not to be reliable representatives of the underlying classes. rotation-invariance of the segmentations, this classifier was
Their fluctuations, induced by noise, imaging apparatus, or trained using multiscale local and contextual features extracted
preprocessing steps, could compromise the accuracy of the from image patches at different resolutions. The proposed
37
Fig. 1: Flowchart of the proposed method applied to the simultaneous volumetric segmentation of VBs and IVDs on fat-water MR images.
method was trained and evaluated for simultaneous volumetric fused and regularized by the HCRF in order to estimate patch-
segmentation of 10 thoracic (T3–T12) and 5 lumbar (L1–L5) wise labels with regard to their hierarchical consistencies.
vertebral bodies (VBs) and their intervertebral discs (IVDs) B. Reference Labels
on fat-water MR images. A fat-water image comprised of
For training and testing, voxel-wise reference labels of all the
a volumetric fat and its corresponding water image. The fat
fat-water images were needed. This labeling L(1) mapped the
(water) images formed an MR channel.
domain of all the fat-water images Ω ⊂ R3 to |L|= Nc classes
II. M ATERIALS AND M ETHODS contained in L = {C1 , ..., CNc }.
A. Framework of Automatic Segmentation C. Multiresolution Training and Test Data
Fig. 1 shows framework of the proposed method for According to [2], a multiresolution image pyramid, consisting
multilabel segmentation of fat-water MR images. This method of Nr resolution layers, was built from every fat-water image
consisted in separate steps for training and testing. It used (l)
by extracting a feature vector fj from every cubic patch
a multiresolution image pyramid to form multiresolution (l) S (l)
training and test data and to built a hierarchical random {sj ∈ S (l) }N
l=1 of it at every resolution layer l. S
r (l)
= j sj
decision forest classier, the proposed multiresolution image was the subsampling of the image domain Ω ⊂ R3 with
graph, and the HCRF. The image pyramid was made by S (1) = Ω. Patches at the lth layer were of 23(l−1) voxels
extracting local and contextual features from fat-water patches size and the patches overlaps at the coarsest layer were 100 ×
at different spatial resolutions. During the training, the (1−(21 /2(Nr −1) ))% in all directions. The image subsampling
multiresolution training data were used to optimize parameters was done from coarse to fine by uniformly dividing every
(l) (l−1)
of the hierarchical classifier. This training determined most sj ∈ S (l) into 8 disjoint patches {si ∈ S (l−1) }1≤i≤8
(l) S8 (l−1) (l)
discriminant features for patch classifications at every with sj = i si . The feature vector fj contained local
resolution layer and the Gini impurities of the classified and contextual intra- and inter-channel features computed from
training data. During the test, the multiresolution test data median of intensities, average gradient magnitude, average
were processed by the trained classifier. This affected class gradient orientation, and 42 angle-invariant Haralick features
posterior probabilities and the Gini impurities of the training [6], and their mean and maximum differences over 26-
(l)
data to the test data. The most discriminant features and the connected neighborhood of sj ∈ S (l) [2]. This vector
affected probabilities and impurities determined edge weights contained 67 elements for l = 1 and 158 elements for
of the proposed graph. The most discriminant features and l ≥ 2. Its elements were normalized to zero mean and unit
the affected impurities determined energy terms of the HCRF. variance to stabilize penalized linear discriminants of the
Over the graph, class posterior probabilities were computed for random forest classifier [2], [7]. Also, features based on the
image patches at different resolutions. These probabilities were mean or maximum differences were rotation-invariant. This
38
Fig. 2: The 26-connected neighborhood of a spatial vertex, a portion of the sub-graphs forming G, and the parent-child relationships of patches in the HCRF.
way, from all the training fat-water images and their reference scaled by a regularization parameter [4].
labels, a multiresolution training data T = {T (l) }N l=1 with
r
Our proposed method for image segmentation expanded on
(l)
T (l) (l) (l) (l) (l) (l)
= {tj = (sj , hj , cj , fj )} were generated. This [3], [4] by incorporating a multiresolution spatial {Gs }N r
l=1
(l) Nr
(l)
data associated each patch sj ∈ S (l) to its label histogram and a multiresolution aspatial (prior-based) {Gp }l=1 sub-
(l) (l)
hj ∈ H, its reference label cj ∈ L, and its feature vector graph into a unified graph and fusing (regularizing) their
(l) (l) estimates by a HCRF. Also, we proposed a novel approach for
fj . The label histogram hj ∈ H was used to extend the deriving edge weights of the proposed graph and the energy
voxel-wise reference labels L(1) to the patch-wise reference terms of the HCRF. This approach relied on a hierarchical
labels L(l) : S (l) → L. Similarly, a multiresolution test data random decision forest classifier trained using multiscale local
(l) (l) (l)
D = {D(l) }N l=1 , with D
r (l)
= {dj = (sj , fj )}, were and contextual features of fat-water MR images.
generated from every test fat-water image. The proposed graph was G = {{Gs }N
(l) (l) Nr
l=1 , {Gp }l=1 }.
r
D. Training of the Random Forest Classifier (l) (l) (l)

The sub-graph Gs = (Vs , Es ), representing spatial (intra-
The hierarchical random decision forest classifier involved Nt resolution inter-patch) relationships, was 26-connected. Every
(l) (l) (l)
hierarchical binary decision trees. Every tree was composed vertex vsi ∈ Vs represented the patch si ∈ S (l) of a test
(l) (l) (l)
of binary decision nodes at different resolution layers. Every sample di = (si , fsig i ) ∈ D . The graph edges were
(l)
node classified its received data to Nc classes by using a (l) (l) (l)
Es = {esi,j |si ∈ S (l) is adjacent to sj ∈ S (l) }.
(l)
penalized multivariate linear discriminant [2]. The parameters (l) (l) (l) (l)
The sub-graph Gp = (Vs ∪Vp , Ep ) represented aspatial
of these discriminants were optimized by feeding a random (l)
(intra-resolution intra-patch) priors. The set Vp contained Nc
subset of the coarsest training data T (Nr ) to the root nodes (l) (l)
vertices. Every vertex vpi ∈ Vp represented one class and
of the trees and recursively growing trees from coarse to (l) (l) (l)
fine according to the multiresolution image pyramid. At was connected to every vsj ∈ Vs through an edge epi,j . In
(l) (l)
every node m(l) of the lth tree’s this sub-graph, no edge connected {vsi ∈ Vs } and no edge
q layer, above optimizations (l) (l) (l) (l)
were done using Nd ≈
(l)
dim(fj ) randomly sampled connected {vpi ∈ Vp }. That is, Ep = {epi,j }. The graph G
(l) was undirected. Fig. 2 shows a portion of its sub-graphs.
features of the received training data Tm ⊂ T (l) . After the (l)
training, in every m(l) , indices of the sampled features, the G. Edge Weights of {Gp }N r
l=1
optimized parameters of the linear discriminant, and the labels In the test phase, the test data D = {D(l) }N l=1 were
r
(l) (l) (l)

distribution zT (l) (c) = |{tj ∈ Tm |cj = c}|
(l)
of Tm ⊂ processed by the trained random forest classifier. Every test
m c∈L (l) (l) (l) (l)
T (l) were saved to be used in the test phase. sample dj = (sj , fsig j ) ∈ Dm ⊂ D(l) , reached m(l) ,
was classified according to the sampled features and the
E. Layer-wise Feature Selection optimized discriminant saved at m(l) during the training. Also,
In every layer of the trained classifier, indices of the most from zT (l) (c) , saved at m(l) , class posterior probabilities
c∈L
discriminant features were determined by considering Nsig
m
(l) (l)
{Pm (c)}c∈L and the Gini impurity GI(Tm ) of its processed
largest elements of the optimized projecting coefficients of the (l)
linear discriminants at this layer [2]. These features formed training data Tm ⊂ T (l) were computed as
fsig (l) for the lth layer to be used in the test phase. Nc
X
(l)
Pm (c) = zT (l) (c)/ zT (l) [c0 ] , c ∈ L, (1)
F. The Proposed Multiresolution Image Graph m
c0 =1
m
The prior-based Random Walker algorithm involved a spatial, Nc

X 2
Gs , and an aspatial (prior-based), Gp , sub-graph at one GI(Tm(l) ) = 1 − (l)
Pm (c) . (2)
resolution [3], [4]. Edge weights of Gs were derived from a c=1
function of the gradient magnitude of the image [3], [8]. Edge The probabilities {Pm (c)}N
(l)
c=1 were considered as the priors
c
weights of Gp were derived from some prior probabilities (l) (l)

for dj ∈ Dm ⊂ D(l) reached m(l) . Accordingly, weight of
39
(l) (l) (l) (l) (l) (l) (l) (l) (l) (l) (l)
epi,j ∈ Ep , that connected vpi ∈ Vp with vsj ∈ Vs , for and dj ∈ D(l) associated to vsi ∈ Vs and vsj ∈ Vs ,

1 ≤ i ≤ Nc , 1 ≤ j ≤ |D(l) |, was (l) (l)
respectively. By assuming that di ∈ D(l) dj ∈ D(l) was
wp(l)i,j = Pm
(l)
(i) · λ(l) (3) processed by the decision node m(l) n(l) of the trained
p , l ≥ 1, (l) (l)
classifier and thus affected GI(Tm ) GI(Tn ) , we defined
(l)
where 0 ≤ λp ≤ 1 was a hyperparameter defining the
RIs(l) = |GI(Tm(l) ) − GI(Tn(l) )|, l ≥ 1. (8)
contribution of priors in the energy function of the prior-based ij
(l) (l)
Random Walker algorithm [4] applied to every Gs and Gp . I. Multiresolution Segmentation of a Test Fat-Water Image
(l) (l) (l)
H. Edge Weights of {Gs }N r
l=1
Over each graph, made from Gs and Gp , the Random
Based on robust statistics [9], [10], boundaries between Walker equations [3], [4] were solved to obtain class posterior
(l)
piecewise constant regions in an image could be considered probabilities {P̂i (c), c ∈ L} for the test samples at the
as outliers. Thus, the better the outlier detection/removal is, l resolution layer. These estimates were with regard to the
th
the better the edge detection would be. In [10], a Tukey’s spatial consistencies of the patches’ classifications and the
biweight function was shown to be superior in detecting priors obtained from applying the trained classifier to the test
objects boundaries in an image. This function employed an fat-water image. However, they did not take the hierarchical
outlier removing parameter σt and relied on intensity gradient consistencies of the classifications into account.
of the image [10]. Thus, to enhance edge detections, we J. The Hierarchical Conditional Random Field (HCRF)
(l)
derived edge weights of {Gs }N l=1 and energy terms of the
r
(l)
The probabilities {P̂i (c), c ∈ L}N l=1 , obtained over the
r
HCRF using this function. However, we adapted it to our

proposed image graph, were fused and regularized by a
hierarchical feature-learning method by
(l)
HCRF with regard to the hierarchical consistencies of the
1) defining resolution-specific outlier removers {σt }N l=1 .
r
classifications. To this end, an undirected graph Gh = (Vh , Eh )
2) using differences of the most discriminant features of (1) (2)
with vertices Vh = Vs ∪ Vs ∪ ... ∪ Vs r and edges
(N )
the compared patches. (l,l−1) (l)
Eh = {ehi,j |si ∈ S is parent of sj
(l) (l−1)
∈ S (l−1) } was
3) weighting the Tukey’s biweight function by the relative (l,l−1) (l)
impurities of the compared patches. built. The edge ehi,j ∈ Eh connected the vertex vhi ∈ Vh ,
(l) (l−1)
The most discriminant features and the relative impurities associated to di ∈ D(l) , with the vertex vhj ∈ Vh ,
(l−1) (l−1)
were determined by the trained random forest classifier. We associated to dj ∈ D (l−1)
, if the patch sj ∈ S (l−1)
(l)
computed σt using median absolute deviation of the features of dj
(l−1)
∈D was one of the 8 patches that composed
(l−1)
differences over 26-neighborhoods of all the training fat-water (l) (l)
si ∈ S (l) of di ∈ D(l) according to the multiresolution
(l)
samples {tj ∈ T (l) } at the lth layer as (l,l−1)
image pyramid. The edge ehi,j ∈ Eh had no weight.
√
(l)
(l,l−1)
(l) However, an energy term whi,j was associated to it.
σt = 5 · 1.4826 · median F(l) − F0 , (4)
(l) (l−1)
T (l) F Assuming that di ∈ D(l) dj ∈ D(l−1) was processed
where k.kF denoted the Frobenius norm; medianT (l) was the by the node m(l) n(l−1) of the classifier and thus affected
median over all the samples in T (l) and F(l) was the matrix (l) (l−1)
GI(Tm ) GI(Tn ) , we defined
of features differences given by (l)
(l) RIhij = |GI(Tm(l) ) − GI(Tn(l−1) )|, l ≥ 2. (9)
(l)
F|T (l) |×26×Nsig = di,j 1≤i≤|T (l) |,1≤j≤26 , (5a)
(l,l−1)
(l) (l) (l) (l)
Accordingly, the energy term whi,j , for 1 ≤ i ≤ |D(l) |
di,j = fsig i − fsig ij , dim(di,j ) = Nsig , (5b) and 1 ≤ j ≤ |D(l−1) |, was defined as
(l) (l) (l−1)
with fsig ij being the feature vector of the sample whose patch (l,l−1)
whi,j
(l,l−1)
= thi,j · 1 − δ(ĉi , ĉj ) , l ≥ 2, (10a)
was located at the j th direction of the 26-neighborhood of the 
 1 − (x/σ 0 (l) )2 2 · e−RIhij , |x|≤ σ 0 (l) ,
(l)
(l) (l) (l)
patch of ti ∈ T (l) with feature vector fsig i . F0 was (l,l−1) t t
thi,j = (10b)
(l) (l) 0, otherwise,
F0 |T (l) |×26×Nsig = M26×Nsig · 1|T (l) | , (6a)
(l) (l−1) (l)
(l) (l)
M26×Nsig = median F|T (l) |×26×Nsig , (6b) where x = kfsig i − fsig j k1 /dim(fsig i ); δ(.) was the
T (l) (l) (l−1)
√ Kronecker function; ĉi ∈ L and ĉj ∈ L were the
The constant 5 · 1.4826 stemmed from the minimum outlier (l)
estimated labels for di ∈ D (l)
and dj
(l−1)
∈ D(l−1) ,
reducing point of the Tukey’s biweight function and a zero- respectively, and σt was defined as
0 (l)
mean normal distribution assumed for the features [10]. (l) (l−1)
(l) (l)
Accordingly, weight of esi,j ∈ Es , that connected vsi ∈
(l) σt0 (l) = min(σt , σt ), l ≥ 2. (11)
(l) (l) (l)
Vs with vsj ∈ Vs , for 1 ≤ i, j ≤ |D(l) |, was defined as From these, an energy function E(ĉ) was derived as
 (l) (l)
 1 − (x/σ (l) )2 2 · e−RIsij , |x|≤ σ (l) , Nr |D
X X|
(l) t t (l) (l)
wsi,j = (7) E(ĉ) = − log({P̂i (ĉi )}) +
0, otherwise, l=1 i=1
X
Nr (12)
(l) (l) (l)
X (l,l−1)
where x = kfsig i − fsig j k1 /dim(fsig i ); k.k1 was the L1 - + λh · whi,j ,
(l) (l)
norm and RIsij was the relative impurity of di ∈ D(l) l=2 (l,l−1)
∀eh
i,j
∈Eh
40
(l)
where ĉ = {ĉi ∈ L}N
l=1 were the estimated labels and λh ≥ 0
r Name Definition
was a hyperparameter. This function was initialized by Dice (%)
2|AU∩GT|
X |AU|+|GT|
X
(l) (l) min d(a, g) + min d(a, g)
ĉ0 = {ĉi0 = arg max P̂i (c)}N
l=1 .
r
(13) a∈δ(AU)
g∈δ(GT)
g∈δ(GT)
a∈δ(AU)
c∈L MSSD (mm) card(AU∪GT)

∗(l) HSD (mm) max max min d(a, g), max min d(a, g)
The optimum labels ĉ∗ = {ĉj ∈ L}N l=1 were the
r g∈δ(GT) a∈δ(AU) a∈δ(AU) g∈δ(GT)
minimizer of E(ĉ). This minimization was done by an iterative TABLE I: Quantitative metrics comparing the automatically (AU) and the
primal-dual algorithm [11]. This method offered an efficient manually (GT) segmented adipose tissues. d(a, g) was the Euclidean distance
between a and g and δ denoted the surface of the segmented volume. Dice
linear programming for minimizing above NP-hard problem compared the volumes. MSSD and HSD compared the surfaces.
by exploiting information not only from the original random
field but also from its dual [11]. The automatically segmented VBs and IVDs were evaluated
using the quantitative metrics described in Table I.
K. Evaluation
The proposed segmentation method was evaluated for L. Results
volumetric simultaneous segmentation of 10 thoracic and 5 The automatically segmented VBs on 30 test fat-water
lumbar vertebral bodies (VBs) and intervertebral discs (IVDs) images achieved Dice coefficient (Dice) = 93.2±1.8%, mean
on fat-water MR images. To this end, volumetric fat-water symmetric surface distance (MSSD) = 0.61±0.15 mm, and
images of 60 asymptomatic volunteers (38 men and 22 Hausdorff distance (HSD) = 3.37±0.9 mm. For IVDs, it
women) were acquired by a 3D Dixon-VIBE sequence with achieved Dice = 92.5±1.9%, MSSD = 1.02±0.3 mm, HSD
an isotropic spatial resolution of 1.7 mm on a clinical 3T MR = 4.1±0.8 mm. These results compared favorably to the state-
scanner. Reference voxel-wise labels of all the acquired images of-the-art for segmenting VBs/IVDs on MR images [2]. Fig. 3
were manually obtained using the Medical Imaging Interaction shows the masks of the automatically segmented VBs and
Toolkit (MITK) of release 2015.05 [12]. This formed L = IVDs on mid-sagittal slices of 3 test images. Fig. 4 shows
{VBs, IVDs, BG} with BG being the background class. the box plots of the quantitative metrics, described in Table I,
for the automatically segmented masks in comparison with the
1) Preprocessing of the MR Images reference (manually segmented) masks.
Intensity nonuniformities were reduced in all the fat-water III. D ISCUSSION
images using [13]. To reduce the complexity of feature
In this paper, we proposed a hierarchical deformation-
extraction, all image intensities were linearly normalized to
/registration-free method for multilabel segmentation of fat-
the range [0, 128]. The 60 fat-water images were divided into
water MR images. This method expanded on the prior-based
30 training and 30 test fat-water images. A multiresolution
Random Walker algorithm [3], [4] by proposing a novel
image pyramid of Nr = 6 was built from every fat-water
multiresolution graph for incorporating spatial (inter-patch)
image. Based on this pyramid, the multiresolution training
and aspatial (intra-patch) information into segmentation. It
data T = {T (l) }N l=1 was formed from the training images
r
also enabled a low-complex registration-/deformation-free
and their reference labels. Also, the multiresolution test data
localization of the targeted objects simultaneous with the
D = {D(l) }Nl=1 was formed from the test images.
r
segmentations. Additionally, the present work proposed a
2) Evaluation of the Proposed Method novel approach for deriving edge weights of the proposed
First the hierarchical classifier was trained using the graph as well as the energy terms of the HCRF in order
multiresolution training data. This classifier had Nt = 40 to incorporate a feature-learning method into a graph-based
decision trees with a maximum depth of Dt = 12 and layer- segmentation. To this end, the probabilistic estimates, the most
(6)
wise regularization parameters of λld = 10, λld = 10,
(5) discriminant features, and the labels distributions determined
(4) (3) (2) (1) by a hierarchical random forest classifier were used. Regarding
λld = 1, λld = 10 , λld = 10 , λld = 10−1 for its
−2 −3
the likelihood notion of the edge weights of a Random Walker
penalized discriminants. Then the multiresolution test data of
graph, the proposed edge weighting mechanism could enhance
every test fat-water image was processed by this classifier to
(l) Nr the accuracy of the segmentations in comparison with the
get {Pm (c), c ∈ L}l=1 and {zT (l) (c), c ∈ L}Nl=1 affected to
r
m state-of-the-art [2]. Furthermore, the used contextual features
its samples.
not only allowed a fast localization but also measured the
From the most discriminant features of the test data and
(l) distribution of fatty/lean tissues around a lean/fat image patch.
the affected {Pm (c), c ∈ L}N Nr
l=1 and {zT (l) (c), c ∈ L}l=1 the
r
m This was neglected in the previous methods [14].
(6) (5)
proposed graph G was built with λp = 10−4 , λp = 10−3 ,
(4) (3) (2) (1) IV. C ONCLUSION
λp = 10−3 , λp = 10−2 , λp = 10−1 , λp = 1. Over this
The proposed method can be easily extended to segment
graph, the Random Walker equations [3], [4] were solved to
(l) multichannel MR or multimodal images. Its future perspective
obtain probabilities {P̂i (c), c ∈ L}N
l=1 for the test samples.
r
would be use of a constrained Random Walker algorithm [15]
Using these probabilities and the most discriminant features
and advance feature-learning methods for an enhanced edge
of the test data and their affected {zT (l) (c), c ∈ L}N
l=1 , the
r
m detection in challenging segmentation tasks.
energy function of the HCRF was derived with λh = 10.
This function was minimized to obtain dominant class labels ACKNOWLEDGMENT
of the test samples at all resolutions. The voxel-wise labels The MR images of this study were acquired under a financial
determined masks of the automatically segmented VBs and support from German Research Foundation (DFG) with the
IVDs for the test fat-water image. grant number of BA 4233/4-1.
41
Fig. 3: Masks of the automatically segmented VBs and IVDs on mid-sagittal slices of the corresponding fat-water images.
Fig. 4: Box plots of the quantitative metrics comparing the automatically segmented VBs and IVDs with their manually segmented references. These
segmentations were done for 10 thoracic (T3–T12) and 5 lumbar (L1–L5) VBs and their IVDs.
R EFERENCES [9] P. Rousseeuw and A. Leroy, Robust regression and outlier detection, ser.
Wiley Series in probability and mathematical statistics. Wiley, 1987.
[1] F. Fallah, D. M. Tsanev, B. Yang, S. Walter, and F. Bamberg, “A [10] M. J. Black, G. Sapiro, D. H. Marimont, and D. Heeger, “Robust
novel objective function based on a generalized Kelly criterion for deep anisotropic diffusion,” IEEE Trans. Image Process., vol. 7, no. 3, pp.
learning,” in Proc IEEE Conf Signal Process Algorithms Archit Arrange 421–432, 1998.
Appl, Sept 2017, pp. 84–89. [11] N. Komodakis, G. Tziritas, and N. Paragios, “Performance vs
[2] F. Fallah, B. Yang, S. S. Walter, and F. Bamberg, “A hierarchical computational efficiency for optimizing single and dynamic
ensemble classifier for multilabel segmentation of fat-water MR images,” MRFs: Setting the state of the art with primal-dual strategies,”
in Proc Eur Signal Process Conf, Sep. 2018. Comput Vis Image Underst, vol. 112, no. 1, pp. 14–29, 2008,
[3] L. Grady, “Random walks for image segmentation,” IEEE Trans. Pattern http://www.csd.uoc.gr/˜komod/FastPD/.
Anal. Mach. Intell., vol. 28, no. 11, pp. 1768–1783, 2006. [12] I. Wolf, M. Vetter, I. Wegner, T. Böttger, M. Nolden, M. Schöbinger,
[4] ——, “Multilabel random walker image segmentation using prior M. Hastenteufel, T. Kunert, and H. P. Meinzer, “The medical imaging
models,” in Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit, interaction toolkit,” Med Image Anal, vol. 9, no. 6, pp. 594–604, 2005.
2005, pp. 763–770. [13] F. Fallah, J. Machann, P. Martirosian, F. Bamberg, F. Schick, and
[5] V. Zografos, A. Valentinitsch, M. Rempfler, F. Tombari, and B. Menze, B. Yang, “Comparison of T1-weighted 2D TSE, 3D SPGR, and two-
“Hierarchical multi-organ segmentation without registration in 3D point 3D Dixon MRI for automated segmentation of visceral adipose
abdominal CT images,” in Proc Med Comput Vis, 2016, pp. 37–46. tissue at 3 Tesla,” Magn Reson Mater Phy, vol. 30, no. 2, pp. 139–151,
[6] F. Han, H. Wang, G. Zhang, H. Han, B. Song, L. Li, W. Moore, H. Lu, 2017.
H. Zhao, and Z. Liang, “Texture feature analysis for computer-aided [14] Y. Zheng, D. Ai, P. Zhang, Y. Gao, L. Xia, S. Du, X. Sang, and J. Yang,
diagnosis on pulmonary nodules,” J Digit Imaging, vol. 28, no. 1, pp. “Feature learning based Random Walk for liver segmentation,” PLOS
99–115, 2015. ONE, vol. 11, no. 11, pp. 1–17, 11 2016.
[7] B. H. Menze, B. M. Kelm, D. N. Splitthoff, U. Koethe, and F. A. [15] F. Fallah, B. Yang, and F. Bamberg, “Automatic atlas-guided constrained
Hamprecht, “On oblique random forests,” in Proc Mach Learn Knowl random walker algorithm for 3D segmentation of muscles on water
Discov Databases, 2011, pp. 453–469. magnetic resonance images,” in Proc Eur Signal Process Conf, 2017,
[8] L. Grady and M.-P. Jolly, “Weights and topology: A study of the effects pp. 251–255.
of graph construction on 3D image segmentation,” in Proc Med Image
Comput Comput Assist Interv, 2008, pp. 153–161.
42
SIGNaL PROCESSING
SPa 2018
Object Detection utilizing Modified Auto Encoder

and Convolutional Neural Networks
Jalil Nourmohammadi-Khiarak1,* Samaneh Mazaheri2 Rohollah Moosavi-Tayebi3 Hamid Noorbakhsh-Devlagh4
Faculty Of Electrical And Faculty of Computer Science and Faculty of Computer Science and Faculty of Computer Engineering
Computer Engineering Information Technology Information Technology University of Isfahan
University of Tabriz Universiti Putra Malaysia (UPM) Universiti Putra Malaysia (UPM) Isfahan, Iran
Tabriz, Iran Selangor, Malaysia Selangor, Malaysia h.noorbakhsh@eng.ui.ac.ir
J.nourmohammadi92@ms.tabrizu samaneh.mazaheri@gmail.com samaneh.mazaheri@gmail.com
.ac.ir
Abstract— Deep learning models are widely used in object performance will decline abruptly. Due to the wide variety of
detection area, including combination of multiple non-linear data different environments, a general classification that has been
transformations. The objective is receiving brief and concise trained in a large dataset is desirable in a specific experimental
information for feature representations. Due to the high volume environment. In general, creating a detector based on the
of processing data, object detection in videos has been faced with appearance is time-consuming and difficult, because a large
big challenges, such as mass calculation. To increase the object number of training samples should be collected and manually
detection precision in videos, a hybrid method is proposed, in this designed for different variations in the appearance of the object.
paper. Some modifications are applied to auto encoder neural Therefore, how to adapt a public detector to different visual
networks, for the compact and discriminative learning of object conditions in a video is a challenging problem which needs more
features. Furthermore, for object classification, firstly extracted attention.
features are transferred to a convolutional neural network, and
after feature convolution with input pictures, they will be A lot of research [6-9] have been reported in the aim of
classified. The proposed method has two main advantages over improving object detection in video frames. Several authors [10-
other unsupervised feature learning techniques. Firstly, as it will 14] proposed that simultaneously improve detecting detection by
be shown, features are detected with a much higher precision. detecting. By using this detector, the detection precision will be
Secondly, in the proposed method, the outcome is compact and greatly declined if the noisy detections are used directly as initial
additional unnecessary information is removed; while the values of the trackers.
existing unsupervised feature learning models mainly learn In this paper, a method has been introduced to improve the
repeated and redundant information of the features. accuracy of detection results in a video in offline mode, without
Experimental evaluation shows that precision of feature routing information or using annotation from the video. To this
detection improved by 1.5% in average in compare with the aim, firstly, the negative and positive samples are extracted using a
state-of-the-art methods. primary detector on the video frames. All identified visual samples
will be used to improve the detector for different video conditions.
Keywords— Deep learning, Object Detection, Classification, Since feature selection plays an important role in detection, the
Unsupervised Feature Learning. classical hand-made features such as HOG (Histogram of Oriented
Gradients), SIFT (Scale Invariant Feature Transform) may not be
I. INTRODUCTION fit enough for any kind of video. In a particular video, a method to
display objects shares some of the same properties that can be
Object detection is one of the most significant applications in used to separate them from non-objects. Consequently, unlike
computer vision [1-4]. It is a difficult task, due to the remarkable other suggested methods which are based on using hand-made
amount of variation between images which belong to the same features, the proposed technique extracts good features directly
object category. Other factors, such as adverse viewpoint and from the raw pixels of video. A modified auto encoder is utilized
scale, illumination, partial occlusions and multiple instances in the feature learning part, which tries to extract features with
complicate object detection process further. In fact, it can be informative data by trying to learn raw pixels’ features in a better
defined as the problem of finding positions of all concerned way; where, the most important issues in object detection are
objects in an image. More specifically, the goal is to find the feature learning and classification.
bounding box for each object [5]. One common approach is to use
a sliding window to scan the image exhaustively in scale-space, This paper is organized as follows. Section 2 will discuss the
and classify every window individually. related works; Section 3 will include details of the proposed
method; In Section 4, exponential evaluation will be presented to
Most detectors are designed for static images as they are compare the proposed method performance with state-of-the-art
trained from a large collection of labeled examples. When images
related works. Finally, conclusion will be given in Section 5.
are taken from a video frame in different conditions, the detector’s
43
II. RELATED WORK feature extraction methods are being used on the face recognition
applications. Yu et al. presented a feature extraction method in a
Classifying objects and selecting appropriate box band are two multi-layer distinction in the high lighting [25]. First, large scale
fundamental challenges of any object detection method which has
features are transformed to several layers of high lighting features
attracted many researchers to this field [15]. Yuan et al. proposed a
in small scales as a non-linear combination and the weights related
discriminative learning method for multi-view object detection [3].
A decision tree related to the leaf node is created, in the to every layer is set as its importance and efficiency. Another
discriminative patch training for every view with Hough method, result from Yoo et al. provides a feature extraction in a high
which has high distinction. As a result, stable spatial distribution dimension feature space in order to improve the distinction
will be detected in every single tree. Afterwards, the capability for face recognition [24].
discriminative patches are connected in different views to spread In order to reduce the computational complexity, since the
the correlation between two neighbor views. For the final structure of the local information of a local dimension reduction is
detection, mean shift prediction of the sample is applied. Another supervised, the distinction analysis of the symmetric linear
object detection method is conducted for the robotic application as discriminant is applied to high dimensional feature vector. In
Lu et al. proposed an efficient deep network for object detection addition, previous works which have been done benefits from
based on the vision [2]. feature-learning or deep learning methods, including automatic
By using a camera ridded on a robot to take pictures, first, a extraction [29], fusion system development [30], Wavelet
layer which includes online convolution and offline optimization enhancement [31], components categorization [32, 33], and some
is used for efficient production of object boundary boxes. Then, a surveys regarding segmentation and registration approaches [34,
detection layer, including convolution neural network utilizing 35, and 36].
genetic algorithm with multiple populations and a multi-frame In [37] Firat et al., randomly sampled patches from remote
fusion method based on Tracking-learning-detection (TLD) is sensing images to train a single layer sparse-auto encoder, so that
used. To select the appropriate box band for the objects in video, it can be used to learn the most efficient representation for the
Zhong proposed a specific class semantic approach to rank the dataset. These representations are appeared to be as Gabor filters
recommended objects [4]. Specially, it extracts features for each in various orientations and parameters, color co-occurrence and
recommendation, including semantic segmentation, stereo and color filters and edge-detection filters. In [38], Garcia et al.,
field detection based on the Convolutional Neural Network (CNN) defined a novel face detection approach based on a convolutional
to grades them using trained special class weights and Support neural architecture, to robustly detect highly variable face patterns,
Vector Machine (SVM). rotated up to ±20 degrees in image plane and turned up to ±60
Many researchers have been proposed different algorithms for degrees, in complex real world images.
object detection by analyzing and mimicking the process of human Most of the methods for object detection are mainly designed
brain. One of the related work is detection by tracking [16, 17], and implemented based on a deep neural network. In [9], a multi-
which uses the trajectory information to help improving detection stage method called Regions with Convolutional Neural Networks
results. In addition, improved detection can be used backward to (R-CNN) has been proposed for deep CNN training to categorize
enhance tracking. Recently, with the new developed learning propositional areas for object detection.
blocks based on artificial neural network, the bio-inspired methods
have been explored extensively and have achieved significant Object detection originates in several stages including pre-
success in many competitions in the area of speech [18] and object training CNN, CNN fine tuning, SVM training, proposing
recognition [19]. bounding box and bounding box regression. In [39], GoogLeNet
was proposed as a 22-layer structure and the "inception" module to
Several works [20-27] have been studied extraction useful replace the CNN in R-CNN. Furthermore, to detect an issue to
features from images and videos, such as SIFT [23] and HOG accelerate the training R-CNN, Fast R-CNN [9] is proposed. In
[20]. They are designed to scale or rotate invariant by quantizing the Faster R-CNN [7], the proposed bounding box is generated by
the edges into histogram bins. Most of the proposed feature- a regional proposal network (RPN), and therefore the entire
learning or deep-learning methods have been demonstrated framework can be trained in general. However, this method is used
implemented on the 1D audio-based speech recognition [18] or the to detect objects in static images. If the mentioned methods apply
2D static image-based object recognition. Not many works have to movies, they may lose some positive samples because the
been done to solve the problems based on 3D video. As one of the objects in each frame of the movies may not be in their best
reasons, since input of learning algorithm is usually the flattened position.
images, when an image becomes larger, the input dimension will
increase with the square of the image size (input dimension will be
1600). So, the number of weights which need to be learned in the
III. PROPOSED METHOD
network will also increase. In this research, a novel learning feature method is
This method maintains the analysis of features which learned proposed using deep neural network based on sparse auto-
from the first-level input data. Duo provides a DPCA-SP2, which encoder for object detection in videos. The proposed algorithm
is an optimum projection matrix with diversity maximizing of is described in two levels, in which the algorithm first learns a
Schatten p-norm based on the classes on the low-dimension number of good and compressed features from raw data.
feature space, to extract the features of the picture. Using Schatten Afterwards, uses the learned feature in classification and
p-norm and information label of the training samples, DPCA-SP2 detection level for the purpose of detection.
cannot extract the distinct features efficiently. In this regard,
44
x1 x1'
Input images break into pieces Pictures after appling the ZCA Whiting w1 w2
x2 x2'
P
T
x3 x3'
ZCA whiting x4 x4'
x5 x5'
(a) (d)
+1
x6 x6' (c)
+1 Auto Encoder
(b)
Fig 1: Low-level schema of the proposed algorithm

A. Feature learning low level
x1 x1'
The feature learning low level consists of four stages which
can be seen in Fig. 1. x2 x2'
First step: Pre-processing: In pre-processing step, a ZCA
Whitening (Zero Components Analysis) algorithm is used to x3 x3'
reduce the correlation of data over time as it is shown in part (a) hw,b(X)
x4 x4'
of Fig. 1. Principal component analysis (PCA) technique has been
used to reduce the data dimension. There is a close related pre-
x5 x5'
processing step called whitening (or, in some other literatures,
sphering) which is crucial for some algorithms. On training +1
x6 x6'
images, the raw input is redundant, since adjacent pixel values are
highly correlated. The goal of whitening is to make the input less +1 Layer L2 Layer L3
redundant. More formally, it is desired that the learning
algorithms consider a training input, where:
(i) The features are less correlated with each other Layer L1
(ii) The features all have the same variance
Fig. 2. A model of Auto Encoder
Concretely, if R is any orthogonal matrix, and satisfies
RRT  RT R  I (less formally, if R is a rotation/reflection Where x̂ is similar to x . The identity function seems to be a
matrix), then Rx will also have identical covariance. In trivial function trying to learn, but by placing constraints on the
PCAwhiting
network, such as limiting the number of hidden units, an
ZCA whitening, R U is chosen and also:
interesting structure about the data can be discovered. In this
research, to learn features from patch images, the auto-encoders
xZCAwhiting  U * xPCAwhiting )1( try to reconstruct the data by minimizing the following loss
function:
ZCA whitening is a form of data pre-processing technique
   
N
that maps from x to xZCAwhite .This is also a rough model of ESAE   | x    W2 s W1 x    b1  b2 |2  Z W1 x    b1
i i i
(2)
how the biological eye (the retina) processes images. Specifically, i 1
as human eye perceives images, most adjacent "pixels" in the eye

will perceive very similar values, since adjacent parts of an image Where W1  R N1 *D denotes a weight matrix by which the
tend to be highly correlated in intensity. It is thus wasteful for eye visible nodes are mapped to hidden nodes, b1  R N1 is a hidden
to transmit every pixel separately (via the optic nerve) to the
bias vector, and represents a non-linear sigmoid function. In
brain. Instead, the retina performs a de-correlation operation
which is similar to that performed by ZCA. This results in a less addition, S  x   1 , the visible node is reconstructed
redundant representation of the input image, which is then 1  exp   x 
transmitted to the brain. In the proposed algorithm, to reduce the from the hidden node by a weight matrix W2  R N1 *D . The input
pixel correlation, ZCA Whitening algorithm is used to make
patch images available in the next phase for better learning. bias vector is denoted by b2  R N1 . Z is a regularization
Second step: Feature learning using modified auto encoder: An function. Afterwards, the lost function for the next level is
auto encoder neural network [40] is an unsupervised learning defined as follows:
algorithm that applies back-propagation, setting the target values
i  i 
to be equal to the inputs, i.e., it uses y  x . The algorithm is N N (3)
EGen   | x    W T h  |22    || V (h  ) 2 ||1
i i i
shown in Fig. 2 or part (b) of Fig. 1. The auto encoder tries to
 
learn a function hw,b x  x . In other words, it tries to learn an
i 1 i 1
Where data samples are represented by the index i . Square

approximation to identity function, so as to output:
45
and square-root operations in (3) are element-wise. Part (c) of to learn complex features by which the conjunction of edges may
Fig. 1 shows a subspace-pooling matrix with groups of size of be captured. A CNN architecture [28], which uses auto encoders
two, which is denoted by V in (3). The value of V is constant, as sub-units, is employed to learn the higher-level features.
and is chosen 0.5 depend on the experimental round. Afterwards, the max pooling is carried out over a certain
Another objective function is added in order to incorporate neighborhood. Thus, local patches from locally-invariant
features information obtained from (3) to learn the features. multidimensional feature maps may be extracted and fed to
Reconstructing the input x and a classifier predicting the label another level implemented by auto encoders. The overall
structure of a CNN with Max-Pooling is shown in Fig. 3. In this
c from the representation p can be used to learn filters W . paper, a CNN with a hierarchical structure as shown in Fig. 3 is
c  0,1
K used. Weight sharing can significantly reduce the number of
Between the actual label and the predicted
CNN free parameters of the network and thus increase the
label c'  0, 1
K generalization of the training. Using a smaller educable and
, an average classification loss is computed by
segmented network to solve a complex problem provides a
a discriminative objective function. The loss function is scalable architecture to implement a large network. The proposed
considered as a performance measure; moreover, an optimization structure reduces training time and the number of educable
problem is stated as: parameters while at the same time, it increases the accuracy. Each
convolutional filter of a CNN produces a matrix of hidden
NI N
1 p variables. The size of this matrix is often reduced using some
EDis   || softmax( T  p   )  c   ||1
tj j
(4) form of pooling’s.
j 1 N p t 1
Where: C. Feature Classification

exp  ak  Features have been fully represented, and in this part,
softmax  a   , k  1, 2,, K (5)
 k'
exp  ak '  obtained features are given to an RBF-SVM classifier
algorithm to learn a safe margin. Non-linear support vector
Where c represents a binary vector which is permitted to have machines create a way for non-linear classification by
one element 1 by Softmax unit; T denotes learned classifier applying the kernel trick to maximum-margin hyper planes.
weights. As the object and non-object may have similar local The resulted algorithm is formally similar, except that every
i 
patches, the label c of x can be hard to gain, when the input dot product is replaced by a non-linear kernel function. This
x  is local patches. The discriminative property may be retained allows the algorithm to fit the maximum-margin hyper plane
i
by applying the loss function at the image level rather than each in a transformed feature space. The transformation may be
local patch. non-linear and the transformed space high dimensional; thus,
though the classifier is a hyper plane in the high-dimensional
B. Feature learning high levels feature space, it may be non-linear in the original input space.
High-level feature learning to detect objects includes two RBF-SVM classifier which is used in this paper yields better
stages: Convolutional Neural Network and Max-Pooling. At the results.
first level, confident labeled images are sampled, which is used to
learn the feature W . Local edges or color information are
obtained by each feature in first level. Nevertheless, it is expected
Filters learned from SAE Filters learned from SAE
feature map Apply Max-Pooling

convolution
convolution
feature map Apply Max-Pooling
Fig 3. Structure of a CNN with Max-Pooling in the proposed method
46
IV. EXPERIMENTAL EVALUATION Table 1 The average precision of various methods for PNNL
Methods PNNL Parking Lot dataset
After Three benchmark datasets are used for evaluation of
the proposed method: PNNL Parking Lot datasets [41], INRIA Reference [15] 86.4
Dataset [20] and TownCenter [42] for human detection. The HOG [20] 92.9
dataset was annotated manually. In all the sequences, only the
Reference [43] 96.3
detector scores are used; however, the tracking results or any
annotation from the video is not considered. More human Proposed method 97.1
objects are considered in the experiments, which results in
higher precision recall curve. Detection is classified into two
groups: first, with positive samples and then with negative
samples. All the samples are resized to 128*64 pixels. Three
feature levels are learnt to represent the images. 8*8 pixel
wise patches from the input image are employed to learn the
first two levels.
At each level, the quantity of filters is assumed
NumberFeature = 400; moreover, the subspace size is
considered to be 16. As a result, at each level, the quantity of
feature maps is 100. The mixture of the three levels in Fig. 3
is a representation of the final image. On the learned image
representation, an RBF-SVM classifier is trained using the
positive and negative samples. The human detection
performance is used to test the proposed method. To select the
value of V learning from the low level feature learning part,
the value of V is selected equal 0.5 with different values at the Fig. 5. The precision-recall curve of three human detection
end, which shows the best average precision. Fig. 6 shows ParkingLot datasets
average precision from different runs.
Table 2 The average precision of various methods for INRIA
Methods INRIA Dataset
Reference [44] 88.2

Reference [5] 84.2
Proposed method 91.4
Results of the INRIA Dataset are discussed in the following. To

evaluate the proposed technique, INRIA Person dataset is
utilized. 1000 photos of test datasets are assigned randomly.
INRIA Person Dataset contains 1774 positive and 1671 negative
samples. All pedestrian samples are 128*64 pixels in size.
Dataset is divided into two categories of training and testing. In
this research, as other works, the number of samples is set to
Fig 4. The average precision of proposed method over 1,000, and average precision is reported in Table 2.
different V values
Fig. 7 shows the precision recall curve of the human

detection in the ParkingLot datasets. The red curve shows the
standard detector [15] results, the blue curve shows [43] and
the green curve shows the proposed results from
implementation on PNNL Parking Lot dataset. The average
precision is listed in the third row of Table 1. As an advantage
of the proposed method, it learns features in a compact way
from the data itself, rather than using a mixture of hand-
designed features.
47
[1] X. Li, M. Ye, Y. Liu, F. Zhang, D. Liu, and S. Tang,
Fig. 6. The precision-recall curve of three human detection "Accurate object detection using memory-based models in
TownCenter dataset surveillance scenes," Pattern Recognition, vol. 67, pp. 73-
84, 7// 2017.
Table 3 The average precision of various methods for TownCenter [2] K. Lu, X. An, J. Li, and H. He, "Efficient deep network for
dataset vision-based object detection in robotic applications,"
Neurocomputing, vol. 245, pp. 31-45, 7/5/ 2017.
Methods TownCenter dataset [3] Z. Yuan, T. Lu, and C. L. Tan, "Learning discriminated and
correlated patches for multi-view object detection using
Reference [15] 86.9 sparse coding," Pattern Recognition, vol. 69, pp. 26-38, 9//
2017.
HOG [20] 92.1 [4] Z. Zhong, M. Lei, D. Cao, J. Fan, and S. Li, "Class-specific
object proposals re-ranking for object detection in
Reference [43] 96.7 automatic driving," Neurocomputing, vol. 242, pp. 187-
194, 6/14/ 2017.
Proposed method 97.5 [5] W. Fang, J. Chen, C. Liang, X. Wang, Y. Nan, and R. Hu,
"Object Detection in Low-Resolution Image via Sparse
The call and precision curve graph is achieved that call-precision Representation," in International Conference on
curves are showed in the Fig. 9. The red curve shows the standard Multimedia Modeling, 2015, pp. 234-245: Springer.
[6] R. Girshick, "Fast r-cnn," in Proceedings of the IEEE
detector [15] and the blue curve represents the reference [43] and international conference on computer vision, 2015, pp.
the green curve shows the outcome. By reviewing the call- 1440-1448.
precision curve result, it can be seen that if in a TownCenter, [7] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN:
Towards real-time object detection with region proposal
precision is low, there will be a call curve drop as well. Since a
networks," in Advances in neural information processing
non-object, as an object is a few like PNNL data set and this systems, 2015, pp. 91-99.
amount is obtained by changing λ parameters and radial basis [8] S. Gidaris and N. Komodakis, "Object detection via a
function in the RBF-SVM classifier. λ in the TownCenter dataset multi-region and semantic segmentation-aware cnn model,"
in Proceedings of the IEEE International Conference on
is considered as almost 0.002 to 0.07 randomly and the radial
Computer Vision, 2015, pp. 1134-1142.
basis functions as 1 to 5 by 0.30 periods. The mean precision of [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich
the TownCenter dataset is reported in the Table 3, in which the feature hierarchies for accurate object detection and
proposed method is compared with HOG feature manual design semantic segmentation," in Proceedings of the IEEE
conference on computer vision and pattern recognition,
and some other methods. 2014, pp. 580-587.
[10] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier,
V. CONCLUSION and L. Van Gool, "Robust tracking-by-detection using a
In this paper, the problem of object detection in a video is detector confidence particle filter," in Computer Vision,
2009 IEEE 12th International Conference on, 2009, pp.
studied. Due to lack of better learning features for video 1515-1522: IEEE.
analysis in object detection problems, a hierarchical [11] M. Andriluka, S. Roth, and B. Schiele, "People-tracking-
representation method is presented using a deep learning. The by-detection and people-detection-by-tracking," in
Computer Vision and Pattern Recognition, 2008. CVPR
proposed method consists of two low and high levels. In low 2008. IEEE Conference on, 2008, pp. 1-8: IEEE.
level section, it has been tried to learn feature hierarchies from [12] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier,
the raw pixels directly, followed in high level section and L. Van Gool, "Online multiperson tracking-by-
detection from a single, uncalibrated camera," IEEE
extracted feature applied to CNN for object detection. RBF- transactions on pattern analysis and machine intelligence,
SVM is used as a classification phase. The goal of proposed vol. 33, no. 9, pp. 1820-1833, 2011.
method is to learn features in a hierarchical fashion. [13] P. KaewTraKulPong and R. Bowden, "An improved
adaptive background mixture model for real-time tracking
Automatic learning of the features in hierarchy way, results in with shadow detection," Video-based surveillance systems,
a system with complex function map inputs to learn output vol. 1, pp. 135-144, 2002.
data directly and independent of the features of the hand- [14] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista,
"Exploiting the circulant structure of tracking-by-detection
design. The results of extensive experiments demonstrate the with kernels," in European conference on computer vision,
effectiveness of the proposed approach. The future work will 2012, pp. 702-715: Springer.
be research on finding the exact algorithm for extracting [15] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D.
Ramanan, "Object detection with discriminatively trained
feature such as ResNet. part-based models," IEEE transactions on pattern analysis
and machine intelligence, vol. 32, no. 9, pp. 1627-1645,
References 2010.
[16] P. Chiranjeevi and S. Sengupta, "New fuzzy texture
features for robust detection of moving objects," IEEE
48
Signal Processing Letters, vol. 19, no. 10, pp. 603-606, Proceeding of International Conference on Agricultural,
2012. Ecological and Medical Sciences (AEMS), 2015.
[17] G. Lee, R. Mallipeddi, G.-J. Jang, and M. Lee, "A genetic [32] R. Moosavi Tayebi et al., "Cardiac components
algorithm-based moving object detection for real-time categorization and coronary artery enhancement in CT
traffic surveillance," IEEE Signal Processing Letters, vol. angiography," in Scopus Proceeding of International
22, no. 10, pp. 1619-1622, 2015. Conference on Computer Assisted System in Health
[18] H. Lee, P. Pham, Y. Largman, and A. Y. Ng, (CASH), 2014.
"Unsupervised feature learning for audio classification [33] R. M. Tayebi et al., "3D multimodal cardiac data
using convolutional deep belief networks," in Advances in reconstruction using angiography and computerized
neural information processing systems, 2009, pp. 1096- tomographic angiography registration," Journal of
1104. cardiothoracic surgery, vol. 10, no. 1, p. 58, 2015.
[19] J. Bai and Y. Wu, "SAE-RNN deep learning for RGB-D [34] S. Mazaheri et al., "Echocardiography image segmentation:
based object recognition," in International Conference on A survey," in Advanced Computer Science Applications
Intelligent Computing, 2014, pp. 235-240: Springer. and Technologies (ACSAT), 2013 International Conference
[20] N. Dalal and B. Triggs, "Histograms of oriented gradients on, 2013, pp. 327-332: IEEE.
for human detection," in Computer Vision and Pattern [35] R. M. Tayebi et al., "Coronary Artery Segmentation in
Recognition, 2005. CVPR 2005. IEEE Computer Society Angiograms with Pattern Recognition Techniques--A
Conference on, 2005, vol. 1, pp. 886-893: IEEE. Survey," in Advanced Computer Science Applications and
[21] H. Du, Z. Zhao, S. Wang, and Q. Hu, "Two-dimensional Technologies (ACSAT), 2013 International Conference on,
discriminant analysis based on Schatten p-norm for image 2013, pp. 321-326: IEEE.
feature extraction," Journal of Visual Communication and [36] S. Mazaheri, P. S. B. Sulaiman, R. Wirza, M. Z. Dimon, F.
Image Representation, vol. 45, pp. 87-94, 5// 2017. Khalid, and R. M. Tayebi, "A Review of Ultrasound and
[22] Z. Fan, D. Bi, L. He, M. Shiping, S. Gao, and C. Li, "Low- Computed Tomography Registration Approaches," in
level structure feature extraction for image processing via Computer Assisted System in Health (CASH), 2014
stacked sparse denoising autoencoder," Neurocomputing, International Conference on, 2014, pp. 6-11: IEEE.
vol. 243, pp. 12-20, 6/21/ 2017. [37] O. Fırat and F. T. Y. Vural, "Representation learning with
[23] F.-C. Huang, S.-Y. Huang, J.-W. Ker, and Y.-C. Chen, convolutional sparse autoencoders for remote sensing," in
"High-performance SIFT hardware accelerator for real- Signal Processing and Communications Applications
time image feature extraction," IEEE Transactions on Conference (SIU), 2013 21st, 2013, pp. 1-4: IEEE.
Circuits and Systems for Video Technology, vol. 22, no. 3, [38] C. Garcia and M. Delakis, "Convolutional Face Finder: A
pp. 340-351, 2012. Neural Architecture for Fast and Robust Face Detection,"
[24] C.-H. Yoo, S.-W. Kim, J.-Y. Jung, and S.-J. Ko, "High- IEEE TRANSACTIONS ON PATTERN ANALYSIS AND
dimensional feature extraction using bit-plane MACHINE INTELLIGENCE, vol. 26, no. 11, pp. 1408-
decomposition of local binary patterns for robust face 1423, 2004.
recognition," Journal of Visual Communication and Image [39] C. Szegedy et al., "Going deeper with convolutions," in
Representation, vol. 45, pp. 11-19, 2017. Proceedings of the IEEE conference on computer vision
[25] Y.-F. Yu, D.-Q. Dai, C.-X. Ren, and K.-K. Huang, and pattern recognition, 2015, pp. 1-9.
"Discriminative multi-layer illumination-robust feature [40] A. Ng, "Sparse autoencoder," CS294A Lecture notes, vol.
extraction for face recognition," Pattern Recognition, vol. 72, no. 2011, pp. 1-19, 2011.
67, pp. 201-212, 2017. [41] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah,
[26] D. Bibicu, L. Moraru, and A. Biswas, "Thyroid Nodule "Part-based multiple-person tracking with partial occlusion
Recognition Based on Feature Selection and Pixel handling," in Computer Vision and Pattern Recognition
Classification Methods," Journal of Digital Imaging, (CVPR), 2012 IEEE Conference on, 2012, pp. 1815-1821:
journal article vol. 26, no. 1, pp. 119-128, 2013. IEEE.
[27] J. Sachdeva, V. Kumar, I. Gupta, N. Khandelwal, and C. K. [42] B. Benfold and I. Reid, "Stable multi-target tracking in
Ahuja, "Segmentation, Feature Extraction, and Multiclass real-time surveillance video," in Computer Vision and
Brain Tumor Classification," Journal of Digital Imaging, Pattern Recognition (CVPR), 2011 IEEE Conference on,
journal article vol. 26, no. 6, pp. 1141-1150, 2013. 2011, pp. 3457-3464: IEEE.
[28] C. Szegedy, A. Toshev, and D. Erhan, "Deep neural [43] Y. Yang, G. Shu, and M. Shah, "Semi-supervised learning
networks for object detection," in Advances in Neural of feature hierarchies for object detection in a video," in
Information Processing Systems, 2013, pp. 2553-2561. Proceedings of the IEEE Conference on Computer Vision
[29] R. M. Tayebi et al., "A fast and accurate method for and Pattern Recognition, 2013, pp. 1650-1657.
automatic coronary arterial tree extraction in angiograms," [44] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester,
Journal of Computer Science, vol. 10, no. 10, p. 2060, "Discriminatively trained deformable part models, release
2014. 5," 2012.
[30] S. Mazaheri, P. S. Sulaiman, R. Wirza, M. Z. Dimon, F.
Khalid, and R. Moosavi Tayebi, "Hybrid pixel-based
method for cardiac ultrasound fusion based on integration
of PCA and DWT," Computational and mathematical
methods in medicine, vol. 2015, 2015.
[31] R. Moosavi Tayebi, R. Wirza, P. Suhaiza Binti Sulaiman,
M. Zamrin Dimon, F. Khalid, and S. Mazaheri, "Using
wavelet for X-ray angiography enhancement," in
49
SIGNaL PROCESSING
SPa 2018
Comparison of Performance of Different

Background Subtraction Methods for Detection of
Heavy Vehicles
Emre CANAYAZ/ Marmara University Veysel Gökhan BÖCEKÇİ / Marmara University

Vocational School of Technical Sciences Technology Faculty
Biomedical Equipment Technology Electrical-Electronic Engineering Department
Istanbul, Turkey Istanbul, Turkey
emre.canayaz@marmara.edu.tr vgbocekci@marmara.edu.tr
Abstract— The growing vehicle numbers in urban and national classes that use state-aid streets and high highways. Despite
road networks emerged the need for effective monitoring and varying amount of studies on vehicle detection and tracking,
management of road traffic. Especially detecting vehicles with there has been relatively little work done in the field of vehicle
break average speed limits rules and trespassing a heavy vehicle classification. The main reason for this was vehicle
is essential to constitute safety traffic flow. In the proposed study, classification was an inherently hard problem [3]. There are
the main goal was detecting heavy vehicles using surveillance varying studies based on different approaches for classifying
videos by using interframe difference, approximate median vehicles [4-6].
filtering and Gaussian mixture models for background
subtraction and compare their performance. Moreover, after Main methods for detecting objects including vehicles can
removing the background image from original videos, on binary be divided into three main groups: background difference
image morphological opening and blob analysis processes were method, the inter-frame difference method and optical flow
applied and with minimum blob area of the detected object in a
frame, heavy vehicle detection was achieved. Different
method [7]. Optical flow method has the most complex
background subtraction methods produce varying results, and algorithm which needs more time than other ways to process
these results were discussed. Our results were consistent with the data. On the other hand, the background subtraction
performance comparison studies which indicated the Gaussian method is known for its extreme sensitivity to the fluctuations
mixture model was stable, real-time outdoor tracker in any of light. Finally, frame difference method is relatively simple
varying outdoor condition. and easy to apply, but the results are not accurate enough,
Keywords: Heavy Vehicle Tracking System, Background
because of the background brightness changes cause errors
Subtraction Methods, Video Processing [8]. Hence to increase frame difference method's performance,
different edge detection algorithms should be applied on
I. INTRODUCTION related data [8-10].
The growing vehicle numbers in urban and national road
networks emerged the need for effective monitoring and
management of road traffic. In the next decade, it is predicted
that nearly 3.7 million miles of roads, estimated to increase by (1)
30%. Especially detecting vehicles with break average speed
limits rules and trespassing heavy vehicle are essential to There were some studies mainly focused on determining
constitute safety traffic flow. To handle this problematic vehicle’s speed effectively in a particular highway or urban
situation, one option is to increase network capacity and the road in different weather conditions [11-13]. They integrated
other one is to improve efficiency by investing in Intelligent compressive sensing and background subtraction methods.
Transportation Systems (ITS) technology using visual The algorithm operates compressive sample on a background
processing tools and soft computing methods [1]. The first step image and input image frame by frame, then the compressive
in varying computer vision applications like video surveillance, measurement will be received, then subtracts the sizes of the
traffic monitoring, crowd counting, and people tracking,
detection and segmentation of relevant objects in video streams background image and input image to be measured for the
are essential [2]. Vision-based video monitoring systems offer target vehicle's existence, and finally precisely reconstructs the
many advantages. In addition to vehicle counts, a much vehicle information by orthogonal matching pursuit algorithm.
broader set of traffic parameters such as vehicle classifications, Therefore, the background update was just conducted on the
lane changes, etc., can be measured. Vehicle classification is measurements; the calculation speed was mostly declined [9].
essential in the computation of the percentages of vehicle
50
In frame difference method (FD), by calculating the more effective for background subtraction especially in our
corresponding pixel' difference of the No. k frame and the No. proposed study for heavy vehicle detection, frame difference
k-1 frame, a binary difference image is obtained as in Formula and median filtering background subtraction methods had their
(1). T is the threshold which is a fixed value. In the binary superiority in some conditions.
difference image, the one-value pixels are considered as
foreground points, and the zero value pixels are regarded as
II. METHODOLOGY
background points [14].
Median filter (APM) is designed by buffering N number of The main purpose of this study to compare different
frames and median of these frames are calculated, and a background subtraction methods' performance that generally
threshold is applied to detect the background of the video. used to detect heavy vehicles in traffic videos. As indicated in
the introduction section, frame difference, approximate median
This method is very effective but many frames have to be
and Gaussian mixture background subtraction models were
stored to calculate the median frame. In median filtering, the selected. All of the models applied to the same traffic video in
previous N frames of video are stored in the buffer and the MATLAB programing language platform as presented in
background frame is calculated as the median of buffered Figure 1. The same algorithm is used to the same video frame
frames. Then the background frame is subtracted from the by frame with different background subtraction method (Figure
current frame to find out the foreground pixels [15,16]. 1).
(2) Defining MATLAB

VideoPlayer Object s 1. Frame Difference
Parameters 2.Approximate Median
For GMM, or Gaussian kernels, defined one-dimensional 3. Gaussian Mixture
M-component GMM can be express as in Formula 2 where ωj

is the weight of the j-th component and Kσ(x−µ) is a Gaussian
kernel Background
Detection and
(3) Substraction
centered at mean µ with standard deviation σ (Formula 3) Morfologic

Opening
[17]. Process
In their study that was comparing the performance of inter-

frame difference, approximate median filtering and Gaussian
mixture models (GMM) for background subtraction, Shahbaz
Blob Anaysis
et al. claimed that GMM is relatively slow but its sharpness
and accuracy is far better if we compare with other two
methods [18]. Moreover, in their review study Xu et al. imply
that the Gaussian mixture model does not need to store a set of NO
input data in the running process. Gaussian mixture model Object Bigger Then
min Blob Area
uses the mean value and covariance to measure the pixel [16].
Thus each pixel has its unique threshold without the constraint YES
of a unified global threshold. However, the number of
Gaussians must be predetermined before operation [16].
Although there are varying numbers of background Detecting
Object
subtraction algorithms [17] Gaussian mixture model had
relatively fast and easily applicable among other algorithms.
Although Sobral et al. [17] did not apply the background Figure 1. A general algorithm for applying background detection and
subtraction methods.
subtraction method for heavy vehicle detection, the Gaussian
mixture model is one of the most useful models among others. First of all, to display original and background subtracted
Since lack of studies on vehicle classification, especially videos, a VideoPlayer object of MATLAB computer vision
heavy vehicle detection, in our research we mainly focused on system toolbox which can play a video or display image
heavy vehicle detection from video obtained from two sequences, were defined. Target traffic video was also chosen
different surveillance cameras for management of road traffic in this step by VideoFileReader. Then subtraction methods
effectively. Moreover, for background subtraction we used were applied in the next level, by using predefined MATLAB
three different methods which were mainly in literature and commands for Gaussian mixture model and MATLAB codes
compared their performances: inter-frame difference, for other background subtraction models. Detected
approximate median filtering and GMM. Although Results background pixels were removed from the original video and a
were compatible with previous studies suggesting GMM was cleared version was obtained [14].
51
After that morphological opening of binary pixel data III. RESULTS
process is performed to open up a gap between objects To determine performance of different background
connected by a thin bridge of pixels. To achieve opening subtraction methods, all of them were applied to the same
process, morphological approximation structuring element was traffic video. Test video's properties were presented in Table 1.
selected as ‘Disk' with 1-pixel radius. The main reason to
prefer Disk element was its fast approximation reaction time to
perform heavy detection vehicles abruptly. Moreover, by TABLE I. PROPERTIES OF THE TEST VIDEO AND DOWNLOAD LINK.
applying morphological opening process, salt and pepper noise
was eliminated from each frame. In Figure 2, screenshots of Type of Frame
Size Width Height
seventy-fifth frames of cleaned versions and original video File Rate
were presented. final crop .mp4 41.2MB
30.000 640
360 pixels
fps pixels
Download Link:
https://drive.google.com/open?id=1zIw85fVVlmHdc55JgK5cCV1NKfQugFnT
In the video, there were 109 vehicles, 7 of them are heavy

ones and the other 102 of them was not heavy. Different blob
area size criteria applied during detecting a heavy vehicle from
the cleaned version of the video described in the methodology
section to form decision matrix according to heavy vehicle
detection performance. In Table 2, Table 3 and Table 4,
performance analysis of varying blob area criteria for different
methods presented. Moreover, (ROC) curves were obtained
according to performance criteria given below (Figure 4,
Figure 5, Figure 6).
Figure 2. Original frame and background subtracted versions of the original

frame.
After forming background cleaned versions, each frame

was blob analyzed to calculate the area of each object in related
flame. Therefore, the blob gets a label if the number of pixels
meets the minimum size specified by the user. To determine
the minimum blob area criteria for heavy vehicle detection was
chosen due to receiver operating characteristic (ROC) analysis
of methods. Also with blob analysis, the coordination of the
object can be achieved thus, when minimum criteria were met,
algorithm drew a rectangle around the object in the original
video which represents heavy truck (Figure 3). Figure 4. ROC curve of Gaussian mixture background subtraction model for
different minimum blob area criteria
Figure 3. Detected heavy truck was framed by a rectangle according to blob

analysis results.
Figure 5. ROC curve of frame difference background subtraction model for

different minimum blob area criteria
52

TABLE IV. RESULTS OF APPROXIMATE MEDIAN BACKGROUND

SUBTRACTION MODEL FOR DIFFERENT MINIMUM BLOB AREA
MBA TP FP TN FN SPFS ACC FPR SENS

4000 3 1 108 4 0,990 0,956 0.009 0.429
3500 5 1 108 2 0.991 0.009 0.714
0.974
3000* 6 3 106 1 0,972 0,966 0,028 0,857
2800 6 3 106 1 0,972 0,966 0,028 0,857
2600 6 3 106 1 0,972 0,966 0,028 0,857
2400 6 4 105 1 0,963 0,957 0,037 0,857
Figure 5. ROC curve of the approximate difference background subtraction 2000 6 23 86 1 0,789 0,793 0,211 0,857
model for different minimum blob area criteria 1500 7 30 79 0 0,725 0,741 0,275 1
1000 7 80 29 0 0,587 0,612 0,413 1
500 7 80 29 0 0,266 0,310 0,734 1
True positive (TP) represents the number-heavy vehicles that 200 7 100 9 0 0,083 0,138 0,917 1
100 7 102 7 0 0,064 0,121 0,936 1
successfully detected from video and false positive (FP)
symbolizes the number non-heavy vehicles classified as a
heavy vehicle. Also true negative (TN) is the number of non- IV CONCLUSIONS
heavy vehicles classified as they are successfully and false
negative (FN) means non detected heavy vehicles among all According to data obtained from the video, the best
vehicles. For different Minimum Blob Area (MBA), Specificity performance of all subtraction methods marked as ”*” in
(SPFS), Accuracy (ACC), False Positive Rate (FPR), and tables given above. Although the accuracy of all marked
Sensitivity (SENS) was presented in tables below. performance ratings is different for each technique, they have
distinct superiority among them. For example, GMM was
managed to detect all of the heavy vehicles with three FP.
TABLE II. RESULTS OF GAUSSIAN MIXTURE BACKGROUND Although FD and AMP only lead to identifying 6 heavy
vehicles, their FP rating was low. Therefore, we can conclude
MBA TP FP TN FN SPFS ACC FPR SENS that GMM has better performance for detecting heavy
37000 2 1 101 5 0,990 0,944 0,009 0,29 vehicles. Concerning positive and negative predictive rating
36000 3 1 101 4 0,990 0,954 0,009 0,428 value GMM was better scored for identifying heavy vehicles
35000 3 1 101 4 0,990 0,954 0,009 0,428
34000 3 1 101 4 0,990 0,954 0,009 0,428 with 0.88927 scores for AROC. Overall, concerning the result
33000 3 1 101 4 0,990 0,954 0,009 0,428 data represented, the Gaussian mixture model performed
32000 3 1 101 4 0,990 0,954 0,009 0,428 better than the other two subtraction methods. Therefore,
31000 3 1 101 4 0,990 0,954 0,009 0,428
30000 3 2 100 4 0,980 0,944 0,019 0,428 among the proposed model's Gaussian mixture gave the best
27000 5 2 100 2 0,980 0,963 0,019 0,714 performance detecting heavy vehicles for automated video
25000* 7 3 99 0 0,970 0.972 0,029 1
surveillance.
15000 5 20 28 0 0.583 0.622 0.416 1
10000 5 25 23 0 0.479 0.528 0.520 1
5000 5 48 0 0 0 0.090 1 1
TABLE III. RESULTS OF FRAME DIFFERENCE BACKGROUND

MBA TP FP TN FN SPFS ACC FPR SENS

3300 0 0 109 7 1 0,939 0 0
3100 1 0 109 6 1 0,948 0 0,142
2900 3 0 109 4 1 0,965 0 0,428
2700 4 0 109 3 1 0,974 0 0,571
2500 5 0 109 2 1 0,982 0 0,714
2300 5 0 109 2 1 0,928 0 0,714
2100* 6 1 108 1 0,990 0,982 0,009 0,857
1500 6 1 108 1 0,990 0,982 0,009 0,857
1000 6 2 107 1 0,981 0,974 0,018 0,857
500 6 31 78 1 0,715 0,724 0,284 0,857
200 6 109 0 1 0 0,051 1 0,875 Figure 6. Positive and Negative Predicted Values of GMM, FD and APM
background subtraction methods for the test video
 
53
[10] Fan, X., Y. Cheng, and Q. Fu. Moving target detection algorithm based
on Susan edge detection and frame difference. In Information Science
and Control Engineering (ICISCE), 2015 2nd International Conference
on. 2015. IEEE.
[11] Donoho, D.L., Compressed sensing. IEEE Transactions on information
theory, 2006. 52(4): p. 1289-1306.
[12] Baraniuk, R.G., Compressive Sensing [Lecture Notes]. IEEE Signal
Processing Magazine, 2007. 24(4): p. 118-121.
[13] Candès, E.J., J. Romberg, and T. Tao, Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency
information. IEEE Transactions on information theory, 2006. 52(2): p.
489-509.
[14] Liu, H. and X. Hou. Moving detection research of background frame
difference based on Gaussian model. In Computer Science & Service
System (CSSS), 2012 International Conference on. 2012. IEEE.
[15] Arif, M., S. Daud, and S. Basalamah, People counting in an extremely
dense crowd using blob size optimization. Life Science Journal, 2012.
Figure 7. The area under ROC curve scores for GMM, FD and APM 9(3): p. 1663-1673.
background subtraction methods [16] McFarlane, N.J.B. and C.P. Schofield, Segmentation and tracking of
piglets in images. Machine Vision and Applications, 1995. 8(3): p. 187-
193.
Our results coincide with previous performance comparison
[17] Kristan, M., D. Skocaj, and A. Leonardis. Incremental learning with
studies which indicated the Gaussian mixture model was Gaussian mixture models. In Computer Vision Winter Workshop. 2008.
stable, real-time outdoor tracker in any varying outdoor [18] Shahbaz, A., J. Hariyono, and K.-H. Jo. Evaluation of background
condition. For future work, the proposed system can be subtraction algorithms for video surveillance. In Frontiers of Computer
applied with deep learning algorithms or with other algorithms Vision (FCV), 2015 21st Korea-Japan Joint Workshop on. 2015. IEEE.
work with soft computing methods to obtain better results and [19] Luo, Z., et al., MIO-TCD: A new benchmark dataset for vehicle
classification and localization. IEEE Transactions on Image Processing,
get a fast reaction time. Also system should be tested on a 2018: p. 1-1.
newly presented traffic dataset to imply its performance
results [19].
ACKNOWLEDGMENT
This research was supported by Marmara University
Scientific Research Unit with the number of FEN-D-8938
REFERENCES
[1] Gutchess, D., et al. A background model initialization algorithm for

video surveillance. in Proceedings Eighth IEEE International
Conference on Computer Vision. ICCV 2001. 2001.
[2] Yao, L. and M. Ling, An Improved Mixture-of-Gaussians Background
Model with Frame Difference and Blob Tracking in Video Stream. The
Scientific World Journal, 2014. 2014: p. 9.
[3] Gupte, S., et al., Detection and Classification of Vehicles IEEE
Transactions on Intelligent Transportation Systems, vol. 3(1): p. 37-47. ,
no. I. 2002.
[4] Roxas, E.A., et al. Vehicle classification method using compound kernel
functions. in 2018 IEEE International Conference on Applied System
Invention (ICASI). 2018.
[5] Belen, J.J.P., et al. Vision based classification and speed estimation of
vehicles using forward camera. in 2018 IEEE 14th International
Colloquium on Signal Processing & Its Applications (CSPA). 2018.
[6] Sliwa, B., et al. A radio-fingerprinting-based vehicle classification
system for intelligent traffic control in smart cities. in Systems
Conference (SysCon), 2018 Annual IEEE International. 2018. IEEE.
[7] Wang, X., et al. Target detection algorithm based on improved Gaussian
mixture model. In International Conference on Electrical, Computer
Engineering and Electronics. 2015.
[8] Zhan, C., et al. An improved moving object detection algorithm based
on frame difference and edge detection. In Image and Graphics, 2007.
ICIG 2007. Fourth International Conference on. 2007. IEEE.
[9] Cao, Y., et al., A Vehicle Detection Algorithm Based on Compressive
Sensing and Background Subtraction. AASRI Procedia, 2012. 1: p. 480-
485.
54
SIGNaL PROCESSING
SPa 2018
FSIFT based feature points for face hierarchical

clustering
Grzegorz Sarwas and Sławomir Skoneczny
Institute of Control and Industrial Electronics
Warsaw University of Technology
Koszykowa 75, 00-662 Warsaw, Poland
Email: {sarwasg, slaweks}@ee.pw.edu.pl
Abstract—In this paper a method for clustering face images in many cases the quality of human faces images is low and
based on fractional order SIFT algorithm (FSIFT) is presented. therefor it is necessary to increase their contrast [9].
This new approach is based on the dissimilarity matrix. This ma- In the presented paper, a known number of clusters have
trix is constructed by using descriptors calculated for keypoints
detected by FSIFT algorithm using derivatives of non integer been established. For image comparing purpose the algorithms
order. To proof and compared the quality of achieved results the based on keypoints detectors have been used. The solution
relative error ratio and the F-measure were applying. The final proposed by the authors of this paper based on a fractional
scores of experiments were compared with hierarchical clustering order differentiation keypoint detector, called FSIFT [10], have
methods based on SIFT and SURF detectors. been compared with the well known methods like SIFT [11]
and SURF [12].
I. I NTRODUCTION The paper is organized as follows. In Section II the proposed
approach is presented. The similarity matrix created using
In this paper, the problem of face clustering is discussed. features detectors, three different keypoint detectors as well
Clustering of several face images is a very important task in as hierarchical clustering method are described. In Section III
image processing area. The main purpose of this process is to experimental result are presented and discussed. In the last
provide meaningful partitions for given face image sets based section the article is summarized and concluded.
on its appearance similarity and otherness.
Precisely speaking the clustering problem is as follows: For II. P ROPOSED A PPROACH
the finite data set O = {o1 , o2 , . . . , oN } and a dissimilarity In presented research a concept of hierarchical face cluster-
index d which for a set O maps d : O × O → [0, ∞) we ing based on the solution suggested by Antonopoulos et al.
can group N faces to C clusters by using the low index of [4] is adapted. Their method consists of two steps:
dissimilarity. As a result of clustering process groups of face 1) Creation dissimilarity matrix D by using SIFT image
images should be received. Each group contains the faces that feature detector.
belong to the same identity, while dissimilar faces should be 2) Using a hierarchical average linkage algorithm on the
assigned to different clusters. dissimilarity matrix for faces clustering.
The problem of face clustering is encountered in many In this paper, the authors propose use of FSIFT feature point
different scenarios starting from social media trough video detector in order to create a dissimilarity matrix. The clustering
analysis or content indexing to biometric problems. We can effectiveness of this algorithm has been compared with the
find dozens of algorithms developed to solve this issue. There SURF and SIFT feature detectors used for building a matrix
are many disturbances which affect the correctness of the of dissimilarity.
clustering process. Some of them are related to the image
acquisition problems like the change of illumination, different A. Building the dissimilarity matrix using keypoint detectors
occlusion and pose. Change of appearance caused by aging A square and symmetric matrix D contains all pairs of
process, make up or even facial expressions makes problem of differences between samples that should be grouped. The size
face matching a complicated task. An other important matter of this matrix is N × N , where N is the total number of
associated with clustering algorithms is an unknown number clustering face images. Each element Dij of dissimilarity
of clusters required by the most often encountered algorithms. matrix defines the dissimilarity between facial images Ai and
Nowadays there is a lot of different techniques for face Aj . This dissimilarity is defined by the following formula [4]:
clustering. In [1] Wu et al. proposed the solution based on
Mij
Hidden Markov Random Fields. Prince and Elder presented Dij = Dji = 100 1 − , (1)
a Bayesian approach in [2]. In [3] Schroff et al. proposed min(Ki , Kj )
FaceNet for finding features for a face clustering. Another where Mij is the maximum number of keypoints matches
possible approach is based on the local features solution. This between the pairs (Ai , Aj ), (Aj ,Ai ) and Ki , Kj are numbers
kind solution we can find in [4], [5], [6], [7], [8]. However, of keypoints found in Ai , Aj respectively. Di,j is in a range of
55
[0, 100] and higher values of Di,j imply higher dissimilarity 3) FSIFT: FSIFT is the solution presented in [10]. Its main
between images Ai and Aj . The feature points are matched by idea was based on the SIFT algorithm but the computing of
using the Euclidean distance between feature vectors. Every DoG pyramid is replaced by calculating fractional derivative.
match is valid if the best match (the closes neighbor) is less The fractional-order derivative can be interpreted as a gen-
than 0.8 times for the second closest match. eralization of the integer-order derivative. In this paper have
been used the Riemann-Liouville (R-L) formula for calculating
B. Keypoint detectors in time domain the fractional order differ-integral of f (x)
function expressed as:
In this subsection the detectors used for building dissimi- Z x
larity matrix are described: α 1 dn f (τ )
a Dx f (x) = dτ (2)
1) SIFT: The SIFT algorithm proposed by Lowe in 2004 Γ(n − α) dxn a (x − τ )α−n+1
consists of five main steps [11]: where α ∈ R is a fractional order of the differ-integral of the
1) Detection of scale-space extrema, function f (x) and for n ∈ N ∪ {0} we have:
2) Keypoint localization,
3) Orientation assignment, n − 1 < α ≤ n for α > 0;
4) Keypoint descriptor, n = 0 for α ≤ 0.
5) Keypoint matching. For α > 0 the result of this function is equivalent to the
Because it is impossible to detect keypoints with different fractional order derivative, for α < 0 to fractional order
scale by using the same window, so in order to detect larger integral and for α = 0 to the function itself. This is why
keypoints we need larger windows. For this purpose, scale- above definition is called a differ-integral. The main property
space filtering based on the Laplacian of Gaussian (LoG) with of the operator a Dxα is its linearity for the integer-order
the various variance of Gaussian distortion σ is calculated differentiation and also for the fractional differentiation.
for the image. LoG with different values of σ parameter, Let F (ω) be the exponential Fourier transform of a con-
detect blobs of various sizes. However, the LoG operator has tinuous function f (x) absolutely integrable in (−∞, ∞). Its
quite large computational burden, so the SIFT method takes property concerning the Fourier transform of convolution can
advantage of the Difference of Gaussians, which is quite good be used for the evaluation of the Fourier transforms of the
approximation of the LoG. The Difference of Gaussian is Riemann-Liouville fractional integral and Fourier transforms
obtained as the difference of Gaussian blurring of an image of fractional derivatives.
with two different σ, where ones is k-times higher. This Another important property of the Fourier transform, which
process is done for different octaves of the image in the is often used in many engineering application, is the Fourier
Gaussian Pyramid. transforms of fractional derivatives of f (x). Namely, if
Once this DoG are found, images are searched for local f (x), f (1) (x), . . . , f (n−1) (x) vanish for x → ±∞, then the
extrema over scale and space. For instance one pixel in an Fourier transform of the n-th derivative of f (x) is:
image is compared with its 8 neighbors as well as 9 pixels
in next scale and 9 pixels in previous scales. If it is a local F(f (n) (x)) = (jω)n F (ω). (3)
extreme, it is a potential keypoint. It basically means that the
keypoint is best represented in that scale. In case of fractional derivative first we evaluate the Fourier
transform of the Riemann-Liouville fractional derivatives with
After potential keypoints’ locations were found, then in
the lower bound a = −∞ [13]:
the next phase called Keypoint Localization, they should be
Z x
refined to get more accurate results. α 1 f (n) (τ )dτ
Here the Taylor series expansion of scale space is utilized to −∞ xD f (x) = dτ (4)
Γ(n − α) −∞ (x − τ )α+1−n
get more accurate location of extrema. If the extreme intensity α−n (n)
= −∞ Dx f (x)
is lower then a threshold value then this particular point is
rejected. Moreover low-contrast keypoints and edge keypoints and we assume here that n − 1 < α < n and n ∈ N.
are eliminated so only strong interest points remain. After applying the Fourier transform to equation (4) we
2) SURF: SURF detect feature points in a slightly different arrived at:
way than SIFT. While the SIFT algorithm builds an image
pyramids, taking difference of images filtered with Gaussians F(Dα f (x)) = (jω)α−n F(f (n) (t))
of increasing sigma values SURF creates a “stack” without = (jω)α−n (jω)n F (ω) (5)
2:1 down sampling for higher levels in the pyramid resulting
= (jω)α F (ω).
in images of the same resolution[9]. Due to the use of integral
images, SURF filters the stack using a box filter approximation Therefore the pair of fractional derivative in time domain and
of second-order Gaussian partial derivatives, since integral in the frequency domain is:
images allow the computation of rectangular box filters in near
constant time [3]. Dα f (x) ↔ (jω)α F (ω), α ∈ R+ . (6)
56
For any two-dimensional function g(x, y) absolutely inte- III. E XPERIMENTS
grable in (−∞, ∞)×(−∞, ∞) the corresponding 2-D Fourier In order to test and compare the operation of the FSIFT
transform is as follows [13]: algorithm for face images clustering the IMM Face Database
Z
[15] has been used. This data base contains 240 facial images
G(ω1 , ω2 ) = g(x, y)e−j(ω1 x+ω2 y) dxdy. (7)
of 40 different humans. For our experiments the full frontal
Therefore we can write the formula for fractional-order deriva- photos have been selected. Finally the experiments have been
tives as: conducted using the 78 images of 26 persons (each in 3
frontal views of different types). Strictly speaking these faces
Dxα g = F −1 ((jω1 )α G(ω1 , ω2 )) , in two different resolutions: 150 x 150 and 512 x 512 pixels
Dyα g = F −1 ((jω2 )α G(ω1 , ω2 )) , (8) have been used. Each photograph from the image database
taken originally on a dark green background were resampled
−1
where denotes F is the inverse 2-D continuous Fourier in two images of different resolutions (150x 150 and 512 x
transform operator. This formula have been extensively used 512 pixels): transformed into gray-level image. In Fig. 1 the
in our FSIFT algorithm. examples of 36 frontal view images selected from IMM Face
The Fractional Scale-Invariant Feature Transform (FSIFT) Database are shown.
algorithm presented in 2017 is an approach to keypoint de-
tection where the step of computing DoG pyramid have been
replaced by calculating fractional derivative. This method can
be demonstrated in the following steps:
1) Build classical Gaussian Pyramid proposed by Adelson
and Burt in 1983 [14].
2) For each scale calculate the collection of images created
by fractional order derivative with the following orders
of α = 1.75, 1.85, 1.95, 2.05, 2.15, 2.25.
3) For derivative images with α order between 1.85 and
2.15 search for extrema over order and space. For
example, one pixel in an image is compared with its
8 neighbors as well as 9 pixels in next order image and
9 pixels in previous order image.
4) The interested points’ location are updated the points’
location for greater accuracy by using an interpolate
based on the second-order Taylor-series.
5) Rejection of all extrema with low contrast.
6) Refining keypoint locations neglecting points that lie on
edges.
C. Hierarchical Clustering
Fig. 1. 36 samples of frontal view images.
Hierarchical clustering is one of the most popular method of
object grouping that organizes data by creating a cluster tree or For clusterization we used Matlab 2018a hierarchical clus-
dendrogram. There exist two types of hierarchical clustering tering procedure (Linkage with 2 types of parameters: Average
methods: agglomerative and divisive. Euclidean and Ward). Two types of measures of clusterization
An agglomerative clustering method starts by treating each quality were applied.
observation as a separate cluster. Than in every step the two In case of comparing the result of algorithms performance
closest clusters are joined in new one. This iterative algorithm the relative error Cerr of clusterization and the F measure
works until all data is combined into one cluster. proposed in [4] have been counted.
A divisive clustering method starts assuming that all data The relative error of clusterization is defined as:
belongs to one class and in which iterative step this group are nicf
divided in smaller clusters. Cerr = (9)
N
In hierarchical clustering method users usually decide on
where nicf is the number of incorrectly clusterized faces and
the level or scale of clustering that is most appropriate for
N - is the number of all faces in dataset.
designed application. In presented research an agglomerative
The F measure for clusterizaton have been described in [4].
hierarchical clustering method with a number of clusters as-
For clusters i and j the F measure have been defined as follow
sumed based on our dataset have been used. As a result of the
[16]:
hierarchical clustering algorithm’s operation the dissimilarity n
2 · nijc · nniii
matrix D is transformed into the previously defined number Fi,j = nij j nii , (10)
of clusters. nc + ni
j
57
where: on the possessed data set of size 150 x 150 pixels. In case of
• nij is the number of patterns from class j in cluster i, larger sizes of facial images (512 x 512) the best results were
• ni is the number of patterns that belong to cluster i, obtained by the SIFT detector, however the results obtained
c
• nj is the number of patterns that belong to cluster j. by the FSIFT detector was only slightly worse. Comparing
For each class over the entire hierarchy is: two methods of cluster merging the Ward’s minimum variance
approach was significantly better then the method based on
Fj = max Fij . (11) the average standard Euclidean distance. The results achieved
i
by using FSIFT detector are very promising, especially for
Finally the F measure for whole clustering hierarchy is defined face images of small resolution (faces taken from a movie
as: sequence). However, it should be noted here, that for faces
X ncj
F = · Fj . (12) in positions different than the full frontal view the further
N
j experiments are necessary.
The final results of the experiments conducted by the R EFERENCES
authors are shown in Tables I and II. AVGSE means the
[1] B. Wu, Y. Zhang, B. G. Hu, and Q. Ji, “Constrained clustering and its
average of standard Euclidean distance and the Ward is a application to face clustering in videos,” in 2013 IEEE Conference on
minimum variance. For images of size 150 x 150 the best Computer Vision and Pattern Recognition, June 2013, pp. 3507–3514.
clustering results were obtained by FSIFT keypoint detector [2] S. J. D. Prince and J. H. Elder, “Bayesian identity clustering,” in 2010
Canadian Conference on Computer and Robot Vision, May 2010, pp.
proposed by the author of this paper. Among the 512 x 512 32–39.
images the SIFT detector was a little bit better than FSIFT. The [3] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified em-
agglomerative clustering method used the Ward’s minimum bedding for face recognition and clustering,” in 2015 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June 2015, pp.
variance measure for cluster linking produced significantly 815–823.
better results than the agglomerative clustering method with [4] P. Antonopoulos, N. Nikolaidis, and I. Pitas, “Hierarchical face clus-
the standard Euclidean distance. tering using sift image features,” in 2007 IEEE Symposium on Compu-
tational Intelligence in Image and Signal Processing, April 2007, pp.
TABLE I 325–329.
E RROR MEASURES FOR IMAGES CLUSTERING 150 X 150 FACES [5] S. S. PM Panchal, SR Panchal, “A comparison of sift and surf,”
IJIRCCE, vol. 1, no. 2, pp. 323–327, 2013.
Method Cerr F-measure [6] S. Velusamy and P. Moogi, “A progressive method of face clustering
for mobile phone applications,” in 2016 Signal Processing: Algorithms,
F SIF T − AV GSE 14.1% 0.8783 Architectures, Arrangements, and Applications (SPA), Sept 2016, pp.
F SIF T − W ard 6.41% 0.9341 202–206.
SIF T − AV GSE 23.08% 0.8109 [7] W. Zhang, X. Wu, S. Zhang, and J. Yan, “Face hierarchical
SIF T − W ard 12.82% 0.9002 clustering with sift-based similarities,” in 2nd International Conference
SU RF − AV GSE 25.64% 0.8214 on Computer Science and Technology. dpi-proceedings.com,
SU RF − W ard 6.41% 0.9615 2017, query date: 2018-06-09. [Online]. Available: http://dpi-
proceedings.com/index.php/dtcse/article/view/12562
[8] W. Zhang, X. Wu, W. P. Zhu, and L. Yu, “Unsupervized image clus-
tering with sift-based soft-matching affinity propagation,” IEEE Signal
TABLE II
Processing Letters, vol. 24, no. 4, pp. 461–464, April 2017.
E RROR MEASURES FOR IMAGES CLUSTERING 512 X 512 FACES
[9] S. Skoneczny, “Morphological sharpening of color images,” Bulletin of
the Polish Academy of Sciences Technical Sciences, vol. 64, no. 1, pp.
Method Cerr F-measure
103–113, 2016.
F SIF T − AV GSE 1.2821% 0.9923 [10] G. Sarwas, S. Skoneczny, and G. Kurzejamski, “Fractional order method
F SIF T − W ard 1.2821 % 0.9823 of image keypoints detection,” in 2017 Signal Processing: Algorithms,
SIF T − AV GSE 0% 1.0000 Architectures, Arrangements, and Applications (SPA), Sept 2017, pp.
SIF T − W ard 0% 1.0000 349–353.
SU RF − AV GSE 10.2564 % 0.9462 [11] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
SU RF − W ard 5.1282 % 0.9538 International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110,
2004.
[12] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-up robust
features (surf),” Computer Vision and Image Understanding, vol. 110,
IV. C ONCLUSION no. 3, pp. 346 – 359, 2008, similarity Matching in Computer Vision and
Multimedia.
In this paper the hierarchical clustering for full frontal [13] I. Podlubny, Fractional differential equations. Academic press, 1999.
face images grouping have been examined. The clustering [14] P. J. Burt and E. H. Adelson, “The laplacian pyramid as a compact
process is based on the dissimilarity matrix built by use of image code,” IEEE TRANSACTIONS ON COMMUNICATIONS, vol. 31,
pp. 532–540, 1983.
image feature points. These points have been detected by [15] M. B. Stegmann, B. K. Ersbøll, and R. Larsen, “FAME – a flexible
three different detectors: SURF, SIFT and FSIFT. For clus- appearance modelling environment,” IEEE Trans. on Medical Imaging,
terization purpose the agglomerative hierarchical clustering vol. 22, no. 10, pp. 1319–1331, 2003.
[16] B. Larsen and C. Aone, “Fast and effective text mining using
with two different linkage method (average standard Euclidean linear-time document clustering,” in Proceedings of the Fifth ACM
distance and Ward’s minimum variance) have been applied. SIGKDD International Conference on Knowledge Discovery and Data
The experiments based on a frontal face data set have been Mining, ser. KDD ’99. New York, NY, USA: ACM, 1999, pp. 16–22.
[Online]. Available: http://doi.acm.org/10.1145/312129.312186
conducted. It has been proven that the algorithm using the
fractional order calculus achieved the best clustering results
58
SIGNaL PROCESSING
SPa 2018
Microprocessor implementation of the sound source

location process based on the correlation of signals
Krzysztof Krupa Marcin Grochowina
University of Rzeszow University of Rzeszow
Department of Mechatronics and Control Engineering Department of Mechatronics and Control Engineering
35-310 Rzeszow, Poland 35-310 Rzeszow, Poland
Email: kkrupa@ur.edu.pl Email: grochmar@ur.edu.pl
Abstract—Sound direction estimation can be used in many k

different mechatronic systems, while the use of bare-metal S0 ϕ
programming microcontrollers allows for miniaturization and ϕ0
H
broadening the range of applications. The paper presents a mi-
croprocessor implementation of the system allowing to determine S(x0 y)
the azimuth for the source of sound. The device operates based on
the measurement of the phase shift of the incoming signal to two l
spaced apart microphones. The algorithm based on calculating L 0 l + ∆l R
the correlation of sound signals using the FFT algorithm was
used in the research. d
I. I NTRODUCTION
The aim of the work was to create a device that would
allow to determine the azimuth for the source of sound. The
additional goal was to design a solution that would allow for
maximum miniaturization of the system and reduction of its Fig. 1. Microphone and sound source arrangement
costs, which is why the software was implemented in a bare
metal microcontroller system. The sound was recorded using
two microphones connected to the analog-digital converters Therefore, analyzing the phase shift of the sound signals
built into the microcontroller. The azimuth for the sound reaching the L and R microphones, we can only conclude that
source was calculated based on the cross-correlation of the the desired sound source lies on the curve H .
signals reaching the microphones Using the simplified formula (2) [2] we can specify the
A. Background azimuth ϕ0 asymptote k of a hyperbola H:
The operation principle of the presented solution is gener- ∆l

ϕ0 = arcsin , (2)
ally known. It is based on the measurement of the phase shift d
of the acoustic signal reaching two microphones are spaced suggesting the position of the source on the straight line k, for
apart from each other by a known distance. Most of sources example, at point S 0 . Real azimuth for source S in relation
[2], [4] only state that the measurement is reliable if the sound to the point 0 of the reference system is however ϕ. The
source is far away from the microphone system, it means in difference between the actual azimuth ϕ and the calculated
the far field. ϕ0 is the smaller the greater the distance of the source S from
The principle of operation of the method used is shown in the point 0.
the Fig. 1. The precision of the results obtained is an implication of
The microphones, spaced d apart, are located at the points the physical limitations of the measurement system. When the
marked L and R. The sound source is located at point S The sampling rate is fS , the sound speed is ν and ∆l ∈ {0; d},
sound reaches the microphone L at a distance of l while into we can specify at most
microphone R by covering a distance l + ∆l.
Keep in mind that for l ∈ R there are infinitely many points dfS
n= (3)
far from l to L and from l + ∆l to R according to the curve ν
described by the equation: different azimuths ϕ0 ∈ {0; π2 } .
r r Additionally, due to the non-linearity of the arcsin function,
d d the accuracy of azimuth estimation will be better for angles
(x − )2 + y 2 − (x + )2 + y 2 − ∆l = 0. (1) close to 0 and much less for angles close to π2 .
2 2
59
B. Octave simulation transmission band 20 Hz–20 kHz and sensitivity 44 dB. The
To illustrate the operation of the device, a simulation has microphone signal is amplified by the MAX9814 circuit,
been performed using GNU Octave, version 4.0.0 and the which is supplied with 3.3 V. This amplifier enables to choose
results of which are shown in the Figure 2. L and R are three gain values: 40, 50 and 60 dB. It is also equipped with
time diagrams of signals reaching two microphones. The third an automatic gain control system whose Attack/Release ratio
chart – Corr. represents the cross-correlation of these signals. is regulated at the level of three user-selected values - 1:500,
An example of 1024 samples in length is recorded with a 1:2000 and 1:4000. microphone and amplifier are elements of
sample rate of 440ks/sec. The sampling rate was selected the module produced by Adafruit.
experimentally. The priority was to get the best precision The signals from both microphones were analyzed with
(formula 3). a 32-bit STM32F446RE microcontroller equipped with a Cor-
The maximum correlation value is at position 855 of the tex M4 core and an arithmetic coprocessor, 512 kB FLASH
sample. Taking into account the total length of the sample, memory and 128 kB SRAM included in the NUCLEO-
it is possible to calculate that the R-signal is delayed by 169 F446RE set. This microcontroller is clocked with an internal
samples in relation to the L-signal. Assuming a sound speed RC oscillator with a frequency of 16 MHz (accuracy 1%) repli-
of 440m/s, we can calculate that ∆l ≈ 13cm and the source cated by means of a PLL circuit to a frequency of 180 MHz.
are closer to the L microphone. Two analog-digital converters with 12-bit resolution. The
sampling rate was 440 kS/s. The process of analogue-to-
0.4
0.3
0.2
L 0.1
0
-0.1
-0.2
0 200 400 600 800 1000 1200
0.4
0.3
0.2
R 0.1
0
-0.1
-0.2
0 200 400 600 800 1000 1200
10
Corr. 0
-5
-10
0 200 400 600 800 1000 1200
Fig. 3. Block diagram of sample registration system

Fig. 2. Time course of L and R signals as well ass correlation of their.
digital conversion was initiated with one of the microcontroller
II. M ATHERIALS AND METHODS timers, which was clocked at 90 MHz. Because the conversion
was triggered when the timer was overloaded, the sound
The research was carried out in a anechoic chamber
sampling frequency was fully controlled by the user. The
equipped with the Vibroacoustics Laboratory at the Faculty
STM32F446RE microcontroller enables simultaneous opera-
of Department of Mechatronics and Control Engineering at
tion of two or three analogue-to-digital converters. Since two
the University of Rzeszów. The sound for which the azimuth
microphones were used, the Dual ADC mode option of the
was determined was recorded with two microphones 0.145 m
Regular simultaneous mode was activated.
apart. Tests carried out in the far field, therefore the distance The 12-bit data from the ADC1 and ADC2 are transferred
between the sound source and a pair of microphones, are much via a DMA to the microcontroller RAM and stored in two
longer than the distance between microphones. SRAM arrays.
The acquisition parameters were as follows:
• sample rate: fS = 440 kS/s, B. Software
• speed of sound: ν = 340 m/s, In order to perform correlation calculations, CMSIS DSP
• distance between microphones: d = 15, 4 cm Software Library was used, which is a package of pre-
• Number of potential azimuth(s): n = 188 compiled libraries that can be used with, among others,
Using the formula (2) the measurement resolution has been SMT32 microcontrollers based on Cortex M0, M3, M4 and
calculated. For angles close to 0◦ , ∆ϕ00 = 0.295◦ , while for M7 cores. Optimization dedicated to these microprocessors
angles close to 90◦ ∆ϕ090 = 5.82◦ . makes these libraries characterized by high speed and allow
for the possible use of an arithmetic co-processor (FPU).
A. Hardware As shown in Figure 4, buffer 1 and buffer 2 data were
To build the device we used a set of two electret micro- converted to the frequency domain. For buffer 2 in the field of
phones CMA-4544PF-W with omnidirectional characteristics, frequencies complex conjugates were calculated. The results
60
buf1 buf2 are presented in Table I, which show the maximum, minimum,
average values of the measurement result ϕ, mode M o, median
ϕ̃, standard deviation S, average error ε, maximum negative
(float)buf1 (float)buf2 error ε− and maximum positive error ε+.
Average error ε was calculated by subtracting the mean
value of the measurements from the angle value. Error ε− is
b1fft = fft(buf1) b2fft = fft(buf2) calculated by subtracting the maximum value of the obtained
results from the value of the preset angle while error ε+
calculated by subtracting from the preset angle the minimum
b2fft = b2fftH value of the measurements obtained. These data are illustrated
in a box chart and presented in Fig. 5. An asterisk on this
chart indicates the average value from the measurement. The
value of the measurement error depends on the deviation of
output = ifft(b1fft.*b2fft)
the sound source from the normal to the axis on which the
microphones are placed. For angles from 0◦ do ±40◦ , the
variation shall not exceed 2◦ .
max(output)
Despite the fact that the resolution of the measurement is
low for angles close to 90◦ , averaging the results of many
measurements a relatively high accuracy can be obtained.
td = index · T s
B. Constant angle
ϕ = arcsin td·V
d
s
The sound source was moved away from the microphone
axis along a line at a 30◦ angle every 0.02m and each time
Fig. 4. Algorithm diagram the value indicated by the device was read. The measurement
results are shown in the Table II. On the following basis, a box
graph was prepared Fig. 6, where two areas can be observed.
obtained in the form of two composite matrices were subjected The first is the distance between the sound source and the
to complex multiplicationand the reverse transformation of microphone axis from 0,02 to 0,1m. For these distances ϕ0
Furier was calculated. The index of the maximum value of differs significantly from ϕ. The second area is the distance
the result obtained after the reverse Fourier transform is the from 0,12 do 0,4m and in this range ϕ = ϕ ± 5◦ . The most
value of the signal delay. precise results were obtained for distances above 0.24m. In this
The following functions of the CMSIS DSP package were area, the measurement results differ from the actual angle by
used in the algorithm: less than 2◦ , which can be interpreted as measurement noise.
• arm_cfft_f32 - function allowing calculation of Furier
transform on floating point numbers. C. Time of calculation in STM32
• arm_cmplx_mult_cmplx_f32 - a function that performs
Apart from the tests aimed at determining the precision of
composite multiplication of matrices. the measurement of azimuth to the source of sound, a study
• arm_max_f32 - a function that allows to specifying the
was also performed to determine the duration of individual
maximum value in the array and its index. computational operations. This was done by programming the
III. R ESULTS AND DISCUSSION microcontroller so that the high state on one of its GPIO ports
was high when performing particular calculations. The high
The device was tested to determine the accuracy of azimuth state time was read with the RIGOL DS1054Z oscilloscope.
angle measurement on the source of sound. The following results were obtained:
The studies were carried out for various angles with constant
radius and for various radiuses with constant angle. • time of FFT calculation: 2.95 · 10−3 s
Addidtionally anlysis of calculation time of presented • time of complex multiplications: 205 · 10−6 s
method was performed. • time of inverse FFT: 1.725 · 10−3 s
• time to find the maximum correlation index: 112.5·10−6 s
A. Constant radius The obtained results allow to conclude that the calculation
The sound source was placed at a constant distance of 0.5 m speed of the STM32F4 microcotroller is enough to determine
from the middle of the line section connecting the points where the azimuth for the source of sound in real time. It is possible
the microphones were located and ϕ was changed (Fig. 1). It that other microcontrollers with lower parameters may be used
is assumed that ϕ was changed from the value of −90◦ to the as well. However, this statement needs to be verified, taking
value of 90◦ every 10◦ . 55 measurements were taken for each into account, for example, the compromise between accuracy
sound source angle. The results of the analysis of these data and speed of measurement and cost of the instrument.
61
TABLE I
C ONSTANT RADIUS r = 0.5 M . A NGLE CHANGED EVERY 10◦
ϕ -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90

max -83.74 -77.93 -68.36 -57.52 -48.78 -38.95 -28.21 -19.13 -8.65 0.30 10.52 19.45 31.41 40.14 49.72 61.14 70.10 88.04 90.00
ϕ -89.12 -79.28 -69.07 -59.06 -49.43 -39.32 -28.69 -19.30 -9.10 0.22 10.39 19.31 30.71 39.51 49.43 59.88 68.98 80.66 87.19
min -90.00 -83.74 -71.02 -60.51 -52.17 -40.14 -29.27 -19.78 -9.58 0.00 9.90 18.80 29.62 38.95 48.32 58.09 66.75 75.27 81.36
Mo -90.00 -77.93 -68.36 -58.68 -49.25 -38.95 -28.56 -19.13 -9.27 0.30 10.52 19.45 30.69 39.34 49.72 59.89 69.21 79.51 88.04
ϕ̃ -90.00 -79.51 -69.21 -58.68 -49.25 -39.34 -28.56 -19.13 -9.12 0.30 10.52 19.45 30.69 39.34 49.72 59.89 69.21 79.51 88.04
S 1.98 1.67 0.71 0.77 0.62 0.34 0.26 0.23 0.20 0.13 0.20 0.24 0.39 0.27 0.40 0.58 0.93 3.33 2.50
ε -0.88 -0.72 -0.93 -0.94 -0.57 -0.68 -1.31 -0.70 -0.90 -0.22 -0.39 0.69 -0.71 0.49 0.57 0.12 1.02 -0.66 2.81
ε− -6.26 -2.07 -1.64 -2.48 -1.22 -1.05 -1.79 -0.87 -1.35 -0.30 -0.52 0.55 -1.41 -0.14 0.28 -1.14 -0.10 -8.04 0
ε+ 0 3.74 1.02 0.51 2.17 0.14 -0.73 -0.22 -0.42 0 0.1 1.2 0.38 1.05 1.68 1.91 3.25 4.73 8.64
10,00
5,00
0,00
−5,00
−10,00
−90 −80 −70 −60 −50 −40 −30 −20 −10 0 10 20 30 40 50 60 70 80 90
Fig. 5. Visualization of data contained in Table I
TABLE II
C ONSTANT ANGLE ϕ = 30· , RADIUS IN RANGE 2 ÷ 40 CM
r[cm] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
max 0.92 7.40 14.30 15.58 19.45 27.17 26.83 26.83 28.21 28.21 29.27 30.33 31.05 32.50 32.13 33.23 33.97 32.86 30.69 30.33
ϕ 0.03 6.89 13.41 14.83 18.50 25.72 25.45 25.51 27.04 27.18 28.48 29.50 30.48 31.10 30.68 31.30 31.73 31.55 29.44 28.62
min -1.23 6.47 12.72 13.99 17.18 24.10 24.43 23.76 26.14 26.48 27.87 28.56 29.98 29.98 29.27 30.33 30.69 30.33 27.87 27.17
Mo 0.00 6.78 13.35 14.94 18.48 25.79 25.45 25.45 27.17 27.17 28.56 29.62 30.69 30.69 30.69 31.41 31.41 31.41 28.92 28.21
ϕ̃ 0.00 6.78 13.35 14.94 18.48 25.79 25.45 25.45 27.17 27.17 28.56 29.62 30.51 31.05 30.69 31.41 31.41 31.41 29.62 28.56
S 0.45 0.26 0.38 0.40 0.49 0.64 0.61 0.56 0.47 0.34 0.28 0.45 0.27 0.59 0.63 0.67 0.54 0.46 0.63 0.71
ε -29.97 -23.11 -16.59 -15.17 -11.50 -4.28 -4.55 -4.49 -2.96 -2.82 -1.52 -0.50 0.48 1.10 0.68 1.30 1.73 1.55 -0.56 -1.38
ε− -29.08 -22.60 -15.70 -14.42 -10.55 -2.83 -3.17 -3.17 -1.79 -1.79 -0.73 0.33 1.05 2.50 2.13 3.23 3.97 2.86 0.69 0.33
ε+ -31.23 -23.53 -17.28 -16.01 -12.82 -5.90 -5.57 -6.24 -3.86 -3.52 -2.13 -1.44 -0.02 -0.02 -0.73 0.33 0.69 0.33 -2.13 -2.83
30
20
10
0
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
Fig. 6. Visualization of data contained in Table II
62
IV. C ONCLUSION R EFERENCES
The results of the research allow us to draw the following [1] J. Baszun, “Passive sound source localization system,” Zeszyty Naukowe
conclusions: Politechniki Białostockiej. Informatyka, pp. 5–16, 2011.
[2] E. Kornatowski, “Przestrzenna identyfikacja kierunku źródła dźwi˛eku,”
• Measurement accuracy obtained by the device depends Logistyka, no. 6, 2010.
on the distance of the sound source. [3] H. Li, T. Yosiara, Q. Zhao, T. Watanabe, and J. Huang, “A spatial
sound localization system for mobile robots,” in Instrumentation and
• The accuracy of the measurement increases with the
Measurement Technology Conference Proceedings, 2007. IMTC 2007.
distance of the sound source (Fig. 1). IEEE. IEEE, 2007, pp. 1–6.
• The accuracy of the measurement depends on the angle [4] J.-M. Valin, F. Michaud, J. Rouat, and D. Létourneau, “Robust sound
source localization using a microphone array on a mobile robot,” in
between the sound source and the normal to the line Intelligent Robots and Systems, 2003.(IROS 2003). Proceedings. 2003
section connecting the points where the microphones was IEEE/RSJ International Conference on, vol. 2. IEEE, 2003, pp. 1228–
located. 1233.
[5] J. Xia, W. Li, J. Cao, and T. Li, “Sound localization system design
• The spread of measurement results increases significantly
based on the teaching application,” in Information Technology and Career
when the angle is greater than 40◦ . Education: Proceedings of the 2014 International Conference on Infor-
• When the angle is greater than 40 , the average measure-
◦ mation Technology and Career Education (ICITCE 2014), Hong Kong,
9-10 October 2014. CRC Press, 2015, p. 175.
ment value is close to the actual value, so by averaging [6] STMicroelectronics, “Reference manual - stm32f446xx advanced arm - R
more measurements and rejecting the extremes a reliable based 32-bit mcus,” access: 23 may 2018. [Online]. Available:
measurement result can be obtained. www.st.com/resource/en/reference_manual/dm00135183.pdf
[7] STMicroelectronics, “Programming stm32f3 series, stm32f4 series,
• The calculation capabilities of the STM32F4 microcon-
stm32l4 series and stm32l4+ series cortex -m4
R programming manual,”
troller are sufficient to measure the azimuth to the sound access: 23 may 2018. [Online]. Available: www.st.com/resource/en/
source using two microphones. programming_manual/dm00046982.pdf
[8] A. Limited, “Cmsis dsp manual,” access: 23 may 2018. [Online].
Available: http://www.keil.com/pack/doc/CMSIS/DSP/html/index.html
63
SIGNaL PROCESSING
SPa 2018
Determination of the Vehicles Speed

Using Acoustic Vector Sensor
J. Kotus
Faculty of Electronics, Telecommunications, and Informatics, Gdańsk University of Technology
Narutowicza 11/12, Gdańsk, PL80233, POLAND
joseph@multimed.org
Abstract— The method for determining the speed of vehicles where: T is the time period in which integration was
using acoustic vector sensor and sound intensity measurement performed.
technique was presented in the paper. First, the theoretical basis
of the proposed method was explained. Next, the details of the Determination of the sound intensity value according to the
developed algorithm of sound intensity processing both in time formula (1) requires the use of specialized and expensive
domain and in frequency domain were described. Optimization sensors (acoustic probe type p-u) [7]. The acoustic velocity
process of the method was also presented. Finally, the proposed component appearing in the formula (1) can be determined on
measurement method was tested in real conditions. The obtained the basis of the difference in acoustic pressure measured in two
results confirm that the proposed method may complement the points of space (p1 and p2), spaced by the distance r. In this
currently used vehicle speed measurement techniques. case, the particle velocity is obtained through the Euler’s
relation (2) [5], [ 6]:
Keywords— sound intensity; speed of vehicles; acoustic sensor;
t p2 ( )  p1 ( )
I. INTRODUCTION uˆ r (t )   d (2)
 r
The speed of vehicles can be determined in various ways.
Active systems, whose operation involves the use of radar or where:  is the density of the medium and  is a dummy time
laser technology, are widely used. Passive systems are also variable.
used, for example induction loops or systems based on the
usage of video cameras and image analysis techniques. There The formula (2) underlies the operation of the p-p intensity
are also solutions based on the analysis of acoustic signals. The probe [5], [6]. The above considerations are valid for one
accuracy of available solutions depends on the technique used plane. In order to determine the sound intensity vector
and the measurement condition and traffic volume. The typical describing the direction of sound in space (Direction of Arrival,
measurement accuracy for radar systems is around 3%. DOA), it is necessary to use 3 pairs of microphones, mutually
Systems based on laser technology are more accurate. For these perpendicular to each other [5], [6]. The vector of sound
systems, the accuracy of the speed determination is up to 1%. intensity in the Cartesian system can be written as:
This paper presents the application of sound intensity    
measurement technique for vehicle speed determination I  I xex  I y ey  I z ez (3)
purpose.
where: Ix, Iy, Iz – intensity components for x, y and z direction.
Based on previous author's experience related to the work
on the application of acoustic modality for the detection and Research described in this paper was conducted by means
location of sound sources, described in detail in earlier works of the 3D sound intensity probe designed and realized in
[1-4], it is postulated to use this modality for monitoring the Multimedia Systems Department Gdansk University of
speed of vehicles in road traffic. The implementation of this Technology. More details about this probe can be found in
functionality requires a precise determination of the location of papers [8], [9]. The developed sensor in various spatial
the sound source. This task was accomplished by analyzing the orientations is shown in Fig.1. The microphones forming
sound intensity signal. Sound intensity is a measure of the flow individual p-p pairs are clearly visible.
of acoustic energy in a sound field. More precisely, the sound
intensity I is a vector quantity defined as the time average of
the flow of sound energy through a unit area in a direction
perpendicular to the area. The intensity in a certain direction is
the product of sound pressure (scalar) p(t) and the particle
velocity (vector) component in that direction u(t). The time-
averaged intensity I in a single direction is given by eq. (1) [5],
[6]:
a) b) c)
1
I   p(t )u (t )dt (1) Figure 1. 3D sound intensity probe in different orientations, all microphones
TT are visible [8].
Project co-financed by the Polish National Centre for Research and Development
(NCBR) from the European Regional Development Fund under the Operational
Programme Innovative Economy No. POIR.04.01.04-00-0089/16 entitled: INZNAK –
“Intelligent road signs”. .
64
II. MEASUREMENT OF SPEED OF VEHICLE USING
ACOUSTIC METHOD
Recordings were made over expressway No. 501 (Armii
Krajowej, Gdańsk, Poland). Geographical coordinates of the
place of measurement are: 54.336499N, 18.589742E. The
meteorological conditions were determined on the basis of data
available for the meteorological station located at a distance 2
[km] from the measuring point [10]. The air temperature was G.T .M.
11 [oC], relative humidity: 70%, pressure: 1007.5 [hPa], 24 [m]
average wind speed: 0.5 [m/s], in gusts up to 4 [m/s], wind
direction: north. The device was placed on a viaduct
(Cedrowa). The measuring sensor was mounted on a tripod at a S.I .M.
distance of 0.3 [m] beyond the protective balustrade. It was a) b) 11.3 [m]
located at a height of 9 [m] above the road surface. During the
recording of acoustic signals, an additional AV recording was Figure 2. Position of the sound intensity probe (a). Selected frame from AV
also made. A mobile device was used for this purpose, model movie used during GT determination (b). Blue arrow indicate the length of
D5803. Parameters of the recorded movie: image size: measuring zone for AV analysis. Red arrow presents the measurement zone
used for sound intensity calculation.
1920x1080 pixels, video codec: H.264, recording length: 366
[sec.], Number of image frames: 21706, frame rate: 59 [fps], As a result of the GT analysis, the speed for 182 vehicles
audio stream was present, codec AAC was applied. The mobile was determined. The uncertainty of determining the travel time
device was hand-held. The recorded movie clip was used to was estimated as the ± 1 picture frame. The average speed
obtain the ground truth (G.T.) data of the speed of every calculation accuracy for the adopted GT method is ± 1.85%.
vehicle. Thank to this the detailed comparison of the speed of The prepared data will be used to validate the results of vehicle
vehicle obtained from acoustical analysis with ground truth speed determined on the basis of sound intensity technique.
data was possible.
B. Determination the Speed of Vehicles Based on Sound
A. Ground Truth Data Preparation. Intensity Measurement
As was mentioned before, the speed of vehicle was The technique for determination the speed of vehicles based
determined on the basis of the recorded movie clip. For speed on sound intensity measurement was presented in details in this
calculation of the vehicle the measuring zone in the image was section. The block diagram of the proposed algorithm was
determined. The length of the measuring zone was determined shown in Fig. 3. The source of acoustic data is the multi-
on the basis of horizontal markings, the segregation lines channel sound intensity probe. The sound intensity probe
delimiting lanes were analyzed for this purpose. At the location provides 4 signals: acoustic pressure signal and three particle
of the measuring station on the road surface, the P-1a [11] type velocity signals determined in three mutually perpendicular
segmentation line is visible. The length of the measuring zone directions. Acoustic signals are fed into the analogue-to-digital
was determined for two periods of the segmentation line. This converter. A multi-channel sound card connected to a PC
gives a measuring distance of 24 [m]. The measuring path computer using a USB port was used for this purpose. The first
length was checked using the GIS tools available on webpage step in the analysis was to calculate the Fourier transform for
[12]. The correctness of the length of the measuring zone was each signal. The FFT analysis parameters were as follows:
confirmed it that way. Then, the position of the vehicle in 1024 analysis window sample length, 50% overlap, Hann
subsequent frames of the film was analyzed. The number of window.
frames in which the vehicle was present in measuring zone was
determined. Based on the number of frames, the travel time for Acoustic signals
4  FFT
a given vehicle has been determined. The obtained data allows buffering (p, x, y, z)
for the speed calculation for each vehicle.
The measurement equipment used during experiment
Amplitude and phase Filtration in the
(picture a)) and measurement zone determined for ground correction frequency domain
truth calculation were depicted in Fig. 2. The red circle shown
in the picture a) indicates the sound intensity probe placed in
the windshield. Picture b) shows one frame from a film S.I. calculation
recorded using a mobile AV device. Area labeled G.T.M. 4  iFFT
in time domain
means Ground Truth Measurement and was marked using blue
arrow. The length of road used for vehicle speed determination
on the basis of sound intensity calculation was shown as well.
In this case area was labeled as S.I.M. which means: Sound Vechicle detection Speed estimation
Intensity Measurements and was marked using red arrow. The
length of measurement zone used in considered type of vehicle Figure 3. Block diagram of the algorithm for determining the speed of
speed determination was also shown in Fig. 2 b). vehicles based on the sound intensity calculation.
65
The amplitude and phase correction of individual signals is a) Sensor b) Sensor
applied in the next step. Details of this process are presented in  - azimuth   - elevation
[8]. In the next step, filtration of signals in the frequency  hS
domain is performed. The parameters of this process allow
determining the frequency range for which the value of sound hS
V V
intensity components will be determined in the further part of Road
the analysis. The frequency range has been marked as FR - Road width Tin ( in) S Tout ( out)
frequency range. The filtered signals are converted into a time
domain using the inverse fast Fourier transform. Phase Figure 4. Scheme for determining the speed of the sound source by means of
corrected and amplitude filtered signals are used for calculation an acoustic sensor (red dot) placed above the roadway. Black rectangles
of the sound intensity components in time domain. This symbolize the silhouette of the vehicle (sound source).
process is performed for signal frames of a given length. The
length of the sound intensity component determination The above assumptions underlie the operation of the block
window, designated as: nSIW - sound intensity window, was marked as vehicle detection. The azimuth angle, elevation
also a parameter to be optimized. To increase the time angle and current frame number for which the sound intensity
resolution of the analysis, the overlap and moving average components are determined are fed into the input of this block.
parameters were introduced. The meaning of these parameters During the conducted experiments, additionally, detailed
and the values considered are presented in the algorithm criteria were added to increase the accuracy of the performed
optimization section. analysis. They are presented in the section III of the paper. For
approaching sound source the elevation angle is changing from
Components of the sound intensity vector, available in the about 90o (vehicle in front of the sensor) to 0o at the moment
Cartesian system, are converted into a spherical system. The when the sound source is directly under the sensor. When the
position of the moving sound source is in this case described vehicle moves away, the values of the elevation angle are
by the azimuth angle and the elevation angle. The elevation negative. Analysis of the changes of the elevation angle in time
angle is determined by the plane running along the road. It can enable a proper detection of the vehicle. The azimuth angle
take values from -90o to 90o. Positive elevation angle values are value is also taken into consideration for this purpose. For
characteristic for the object located in front of the sensor. proper vehicle detection the azimuth angle must be between
Negative values are observed for the sound source located 70o and 110o. A falling edge is detected based on the difference
behind the sensor. The azimuth angle defines a plane between the previous and current elevation angle () and can
perpendicular to the road. The range of possible azimuth angle be written as:
values is 0o-360o. Taking into account the known spatial
orientation of the sensor, it is possible to determine the range of    n1   n (3)
the azimuth angle values which correspond to the position of
the sound sources along the road. Due to the fact that the sensor where: n is the current frame and n-1 previous frame of the
was placed between lanes, the location of sound sources sound intensity used for azimuth and elevation calculation.
(passing cars) for the azimuth angle should be in the range:
Positive difference () means that the sound source is
from 70o to 110o. The value 90o corresponds to the position
approaching the sensor. The time measurements begin when
exactly between the lanes.
the elevation angle is within the range (25o, 55o). Initial value
A graphic illustration of the meaning of the angles and of the elevation angle is captured as well (parameters: Tin and
value of the azimuth angle and elevation angle are shown in in are determined, see Fig. 4 for details). Time measurement
Fig. 4. Implementation of the above requirements for azimuth finished when the elevation angle falls below 10o or when the
angle values can be interpreted as spatial filtration of the sound indicator  become negative. The values: Tout and out are
intensity vector. Thanks to this, it is possible to eliminate the determined.
influence of other sound sources operating in the vicinity of the
acoustic sensor and not being a vehicle moving along the Angle values (in and out) are used for calculation the
considered road. Vehicle speed determination using sound distance of moving sound object using formula:
intensity technique is based on the following assumptions: S  hs  tan( in )  tan( out ) (4)
 a moving vehicle is considered as a mobile source of
acoustic energy, where: hs is the height of the sensor above road surface.
 the sound source in the measuring zone moves at a Time values: Tin and Tout are expressed as frame number.
constant speed, For that reason the time needed for speed calculation is
determined using formula:
 the change in elevation angle value is proportional to
the speed of the moving sound source, T = (Tout - Tin)  nOverlap / sf (5)
 changing the azimuth angle allows to confirm the where: nOverlap – overlap indicator for sound intensity frames
position of the sound source in the area of the road, (see section III, Table 1), sf – sampling frequency – in
thanks to this it is possible to carry out measurements considered configuration it was equal to 48000 [Sa/s]. Finally,
even in the presence of external interference. the vehicle speed, expressed in [km/h] is calculated using
formula:
66
V  S T  / 1000  3600 (6) of sound intensity components in the frequency domain, the
length of the frame for which the components of the vector of
Graphical illustration of the operation of the described sound intensity in the time domain were determined (nSIW),
algorithm was shown in Fig. 5. A movement of single vehicle overlap value (nOverlap) understood as the value expressed in
was chosen for this purpose. The blue line, labeled as: S.GT samples between two different frames (nSIW) (for example for
(Speed Ground Truth) shows the reference speed and period of nSIW = 4800 samples and nOverlap = 4 the interval between
time during passing G.T.M. (see section II A for details) for the adjacent frames in the samples was 4800/4 = 1200 [Sa ]). The
considered vehicle. Other series were prepared based on the last modified parameter was length of moving average (MA).
results of the presented method. Green line is the elevation The numerical values of individual parameters considered in
angle, black line is the azimuth angle. Black trapezium, labeled the optimization process are summarized in Table 1.
as: S.M.R (Speed Measurement Region) indicates the time
range used for speed calculation. The final speed value TABLE I. VALUES OF DESCRIBED PARAMETERS THAT WERE TAKEN INTO
CONSIDERATION DURING ALGORITHM OPTIMIZATION.
determined by means of presented algorithm was presented
using red line, labeled as: S.M. (Speed Measured). We can also Parameter Considered values
notice that both speed values S.GT and S.M are the same. It FR [Hz] 3000-5000; 3000-6000; 6000-8000; 6000-9000
means that in presented case the speed of vehicle calculated
2048; 4096; 4800; 5500; 6000; 6500; 7000; 7500; 8000;
using acoustical sound intensity technique was determined nSIW [n]
8500; 9000; 9600
properly.
nOverlap [n] 1; 2; 4; 10; 20
140 140
MA [n] 1; 2; 3; 5
130 S.GT S.M. S.M.R. Elev. Azim. 130
120 120
110 110
A total of 144 different combinations of individual
100 100
90 90
parameter values were checked during the optimization
process. RMSE value was determined for all combination of
Speed [km/h]
Angle [deg.]
80 80
70 70 the values of considered parameters. The distribution of the
60 60
50 50
RMSE indicator for individual configurations is presented in
40 40 Fig. 6. The minimum RMSE value for the best configuration
30 30 set has been marked in Fig. 6 as well. Values for the best set of
20 20
10 10
considered parameters were also highlighted in Table I. It
0 0 should be emphasized that the optimization process was
59.2 59.4 59.6 59.8 60 60.2 60.4 60.6 60.8
Time [sec.] performed for real traffic. The traffic volume was irregular. It
means that different situations took place on the road. In some
Figure 5. Vehicle speed determined using sound intensity technique. Other cases not only one vehicle was crossing the measuring space in
parameters linke azimuth and elevation angle are also presented.
the same time but sometimes two or more vehicles with
III. ALGORITHM OPTIMIZATION different speeds were observed in measurement zone (such
situations were illustrated in the section on the analysis of
The block diagram of the algorithm for determining the results, see Fig. 7 for details). In order to objectify and
speed of moving vehicles presented in the previous chapter, automate the process of determining the RMSE indicator, the
which operates on the basis of the determined direction of situation described above (more than one vehicle in the
arrival, has numerous signal processing blocks. Modification of measurement area at the same time) was excluded from the
operating parameters of individual blocks of the algorithm has optimization process.
a significant impact on the obtained vehicle speed results.
70
For this reason, optimization process of the algorithm was
performed. The main aim of this process was to find the best 60
set of parameters. Evaluation of the optimization process was
conducted by means the root mean square error (RMSE), given 50
by the formula:
RMSE
40
30
(7)
20
The best configuration, RMSE = 17.4 
where: S.G.Ti,c.n. – ground truth speed for i-th vehicle for given
configuration number, S.Meas.i,c.n. – measured speed of given 10
0 50 100 150
vehicle for i-th vehicle for given configuration number, nv – Configuration number
number of vehicles detected by the algorithm for given Figure 6. RMSE indicator for considered configurations
configuration number.
The following parameters of the algorithm will be modified
during the optimization: frequency range (FR) of the analysis
67
IV. RESULTS 140 140
130 S.GT S.M. S.M.R. Elev. Azim. 130
In this section a detailed analysis of the results of the 120 120
algorithm obtained for optimal set of configuration parameters 110 110
is presented. The length of the recording was 366 [sec.]. During 100 100
this time, 182 vehicles were observed. The reference speed was 90 90
Speed [km/h]
Angle [deg.]
80 80
determined for every vehicle (see section II A for details). 119 70 70
cases were detected using optimized algorithm. For these 60 60
situations, the speed value was determined. In the next step, the 50 50
40 40
calculated speed values were compared with the reference data. 30 30
An example of high traffic volume was presented in Fig 7. For 20 20
such complex situation the algorithm detected only one 10 10

0 0
vehicle, because only for one car the speed could be 314 314.5 315 315.5 316
Time [sec.]
determined on the basis of elevation angle characteristic.
Thanks to this, the speed of one of the vehicles was determined Figure 7. Complex situation, more than one vehicle in the measurement area.
correctly, despite the presence of other sound sources. The
other cars can be counted (four minimum values on elevation 140 140
angle characteristics can be noticed) but the vehicle detection 130 S.GT S.M. S.M.R. Elev. Azim. 130
module (see Fig. 3) does not have a proper set of input data 120 120
110 110
described by the elevation angle changes in time. 100 100
90 90
Operation results obtained by means of the algorithm under
Speed [km/h]
Angle [deg.]
80 80
typical conditions were shown in Fig. 8 and Fig. 9. Rare traffic 70 70
was illustrated in Fig. 8. The vehicles crossed the measurement 60 60

50 50
zone one by one in an interval of 5 [sec]. The first vehicle 40 40
moved at a speed close to 130 [km / h], the second vehicle had 30 30
a speed of about 100 [km / h]. In both cases, the acoustic 20 20
method worked properly. The speed of the vehicles has been 10 10

0 0
determined correctly. The shape of the characteristic of the 259 260 261 262 263 264
Time [sec.]
265 266 267 268
azimuth angle, presented in Fig. 8 should be also explained. It

is clearly visible that at the moment of vehicle detection the Figure 8. Speed measurement results for rare traffic.
azimuth angle values are within the given angles range (from
70o to 110o). At other time intervals, the sound intensity vector 140 140
indicates other sound sources outside the measurement zone. In 130 S.GT S.M. S.M.R. Elev. Azim. 130
this way, the importance of spatial filtration in effective 120 120

110 110
elimination of false positive detections (detection of a vehicle 100 100
despite its physical absence) has been demonstrated. No false 90 90
Speed [km/h]
Angle [deg.]
positive error was detected for the entire time interval 80 80
70 70
considered. This means that the developed method is highly 60 60
resistant to interference from external sound sources. The 50 50
constant high traffic flow and obtained results of the algorithm 40 40

30 30
for such conditions were depicted in Fig. 9. A lot of vehicles 20 20
were crossing the measurement area at very small time 10 10
intervals. In some cases, vehicles moved side by side (two 0

305 306 307 308 309 310 311
0
lanes in the same direction). Despite such a high density of Time [sec.]
traffic, the algorithm properly detected the vehicles and Figure 9. Results obtained for constant high traffic flow.
correctly determined their speed. It is important to emphasize
that only selected events were processed because only for this The second case, marked as B, only those vehicles from the
events the speed measurement conditions are valid. On the reference data set, which were detected by the considered
basis of the presented examples of various traffic volume it can algorithm were used for average speed calculation.
be concluded that the algorithm for determining vehicle speed Additionally, for each of the scenarios, the lower (V.G.T.-) and
based on acoustic analysis correctly determines the speed for upper (V.G.T.+) speed values resulting from the uncertainty of
incidental vehicles as well as in the case of a large traffic flow. specifying the reference speed were determined.
In the next part of the work, calculations of the average vehicle
speed for the whole considered time interval were carried out. TABLE II. AVERAGE SPEED VALUES FOR REFERENCE DATA AND
V.G.T – speed based on Ground Truth, V.M.S – speed CALCULATED BY THE ACOUSTIC METHOD FOR THE ENTIRE ANALYSIS
obtained by the presented algorithm . The results of this kind of
Scenario V.G.T.- V.G.T. V.G.T.+ V.M.S. V.Error
analysis are presented in Table 2. Two scenarios of
determining the average speed were considered. The first case, A 92.0 93.7 95.5 93.5 0.2
marked in the table as scenario A, was obtained for all vehicles B 93.4 95.1 97.0 93.5 1.7
that passed through the measuring point during the observation.
68
Taking obtained results into consideration, it should be V. CONCLUSIONS
noted that for case A, the error of determining the average The vehicle speed measurement system based on the sound
speed was 0.2%. This means very high accuracy of the intensity analysis was presented in the paper. The essential
proposed method. In case of scenario B, the error was 1.7%. parameters of the algorithm and the procedure for their
Determined average speed using the developed method in both optimization have been described in detail. The operation of
cases was within the uncertainty range of the reference data.. the developed algorithm was checked using recordings
The difference in the measurement error value between obtained in real acoustic conditions. The determined vehicle
scenario A and B is that in the first case, the algorithm speed values were compared with the reference data
"samples" the speed of vehicles from a larger set. This causes determined from the video recording. Based on the obtained
that individual differences in determining the speed for results, it was found that the developed algorithm correctly
individual vehicles are not relevant for the determined average. determines the speed of vehicle, both in the rare traffic
In the second case we deal with a matched reference set to (passage of a single vehicle) as well as in the situation of a
detected acoustic events on the basis of which the speed was large traffic flow. Additional averaging techniques, such as
calculated. The increase in the average error is due to the moving average, enables accurate estimation of the average
greater impact of errors in determining the speed for individual vehicles speed at a given measuring point.
vehicles. Further research will focus on checking the accuracy of the
Based on the collected research material, an additional presented system in various atmospheric conditions, in
analysis was carried out, which involved determining the particular during rainfall causing a wet road surface. It is also
average vehicle traffic speed using the moving average. It was planned to extend the functionality of the system with the
assumed that the average speed will be determined for 20 cars. ability to determine the conditions prevailing on the road
It was corresponded to averaging time of about 60 [sec.]. (surface condition), the ability to count vehicles and analyze
movements independently for different lanes. It should be
The calculations were performed both for the reference data emphasized that the applied measuring method is based on
and for the results determined by means of the described passive speed measurement. The measuring device does not
algorithm. The obtained results are shown in Fig. 10. The black emit any sounding signal. The developed method has a high
line shows the average speed for the reference data (indicated potential for application, in particular in traffic monitoring
with the symbol S.G.T., data matched - scenario B from Table systems, hence it can make a viable alternative to currently
2). The uncertainty range for determined reference speed using used radar systems.
the optical method are marked in gray (S.G.T.Error, see
chapter II A). Red line (S.Meas.) indicate changes of the REFERENCES
average vehicles speed obtained using the acoustic method [1] J. Kotus, “Multiple Sound Sources Localization in Free Field Using
presented in the paper. Acoustic Vector Sensor”, Multimedia Tools and Applications: Volume
74, Issue 12, pp. 4235-4251, 2015, DOI 10.1007/s11042-013-1549-y
It is clearly visible that the shapes of both characteristics
[2] J. Kotus, “Application of passive acoustic radar to automatic
are consistent. The red line in most cases is located within the localization, tracking and classification of sound sources”. Information
uncertainty range of the reference data set. Conformity of both Technologies, vol. 18, pp. 111 – 116, 2010.
lines was assessed objectively using the Pearson correlation [3] K. Łopatka, J. Kotus, A. Czyżewski, “Detection, classification and
coefficient. The calculated value of the correlation coefficient localization of acoustic events in the presence of background noise for
was equal to 0.92. acoustic surveillance of hazardous situations”, Multimedia Tools and
Applications, pp. 1-33, 2015, DOI 10.1007/s11042-015-3105-4
[4] J. Kotus, K. Łopatka, A. Czyżewski, “Detection and localization of
120 selected acoustic events in acoustic field for smart surveillance
S.G.T.Error S.G.T. S.Meas.
115
applications”, Multimedia Tools and Applications, 68:5–21, 2014, DOI
10.1007/s11042-012-1183-0
110 [5] F. Jacobsen, “Sound Intensity and its Measurement and Applications”,
Acoustic Technology, Department of Electrical Engineering Technical
105
University of Denmark, 2011.
100 [6] F. Fahy, Sound intensity, E & F.N. Spon, 1995.
Speed [km/h]
95
[7] H.-E.de Bree, “The Microflown: an acoustic particle velocity sensor”.
Acoust Aust 31(3), pp. 91–94, 2003.
90 [8] J. Kotus, A. Czyżewski, B. Kostek, “3D Acoustic Field Intensity Probe
Design and Measurements”, Archives of Acoustics. Vol. 41, nr. 4
85
(2016), s.701-711, DOI: 10.1515/aoa-2016-0067.
80 [9] G. Szwoch, J. Kotus, Detection of the incoming sound direction
employing MEMS microphones and the DSP; Multimedia
75
Communications, Services and Security (MCSS), No. 785, pp. 186 -
70
198, Kraków, Polska, 16.11.2017 - 17.11.2017, DOI: 10.1007/978-3-
0 50 100 150 200 250 300 350 319-69911-0_15.
Time [sec.]
[10] http://pogoda-gdansk.pl/ (access: 10.06.2018)
Figure 10. Vehicle speed calculated using a moving average. [11] http://www.bwir.org/oznakowanie-poziome/linie-segregacyjne/ (access:
10.06.2018)
[12] http://gdanski.e-mapa.net/ (access: 10.07.2018)
69
SIGNaL PROCESSING
SPa 2018
An adaptive transmission algorithm

for an inertial motion capture system
in the aspect of energy saving
Michał Pielka1, Paweł Janik2, Małgorzata Aneta Janik3, Zygmunt Wróbel4

University of Silesia in Katowice
Faculty of Computer Science and Materials Science
Sosnowiec, Poland
Email: michal.pielka@us.edu.pl 1, pawel.janik@us.edu.pl 2, malgorzata.janik@us.edu.pl 3, zygmunt.wrobel@us.edu.pl 4
Abstract—The article presents a tested, integrated architecture of Gypsy from Animazoo or hand monitoring like Dexmo from
a single sensor network node with an algorithm that has allowed Dexta Robotics.
to reduce power consumption, without limiting the ability to Inertial systems play an important role among wearable
process data from a multi-axis MEMS sensor. In the experiment solutions. Movement mapping in these systems is carried out
described in the article, the energy efficiency of the sensor by rotating the virtual bones of a computer model. For this
module has been increased by 64%. By using the SoC chip purpose, sensors are placed, for example, on the human body,
system, the printed circuits (PCB) of the sensor module have allowing for measurements of rotation and additionally
been reduced to 26 mm x 16 mm. In turn, the use of the Modem acceleration or intensity of the magnetic field. The mapping of
Sleep mechanism has allowed for effective feeding of the
the entire human body movement is usually carried out using a
developed sensor module by means of a battery with 175 mAh
capacity whose dimensions do not exceed the size of PCB.
dozen or so sensors [3]. The inertial MoCap group is
dominated by two solutions: sensor costumes and independent
Keywords-component; motion capture, signal processing, data modules. Both solutions form a sensor network, but in the first
transmissions, inertial MoCap, sensor power efficiency case, sensors are connected by wire to one or several radio
modules, whereas the second solution is based on independent
nodes of the sensor network, where each node has its own
I. INTRODUCTION radio interface. The effectiveness of inertial systems is mainly
Inertial motion controllers with a radio interface enable related to their energy demand for the transmission of large
mobile data acquisition. These controllers in the form of amounts of data. Therefore, it is important to constantly
wearable solutions are increasingly used in medical and improve the algorithmic methods and new structures of
rehabilitation, sports or entertainment applications. A certain information systems that will allow at least a partial reduction
limitation of inertial MoCap systems is the size of sensor of earlier technological limitations.
modules and their energy demand related to radio
transmission. Currently, motion capture systems (MoCap) are II. RELATED WORK
quite common. Depending on the requirements and
applications, they offer different measurement precision and Movement monitoring using IMU sensors is used in many
different data analysis possibilities. MoCap systems can be aspects. Sensor systems enable to reproduce the movement of
divided into three main groups: optoelectronic, mechanical- the entire human body [4] as well as smaller biomechanical
electronic and inertial, using MEMS sensors. Optoelectronic systems such as the hand [5]. Processing of data from sensors
systems are both high-precision solutions, using many fast requires the use of a sufficiently efficient hardware platform.
cameras, e.g. Smart DX from BTS Bioengineering [1], as well In this context, an architecture treated as a reference is widely
as simple, cheap solutions, based on the infrastructure used, in which a dedicated, high-performance microcontroller
developed for games, e.g. Kinect - Microsoft [2]. This group for data processing such as XSens [6, 7] is applied. There are
of motion capture systems concerns stationary solutions, due also known energy-saving motion controllers with a reduced
to the need to deploy cameras and define the scene they structure using miniature modules with subgigahertz
monitor. In this context, mechanical-electronic and inertial transmission [8]. Such a structure allows for a significant
systems based on Inertial Measurement Unit (IMU) sensors reduction in energy consumption, but at the same time
can be treated as an alternative. Solutions that use radio requires the construction of a separate hardware infrastructure
transmission allow for motion capture in a natural that couples the motion controller with the virtual
environment, limited only by the range of transmitters. This environment. In this context, the implementation of a motion
group of solutions includes mechanical-electronic systems that controller based on the Wi-Fi technology enables to use
are based on exoskeletons, allowing for body monitoring like standard routers and, consequently, to reduce the costs of the
entire hardware base. The use of a Wi-Fi module with a shared
70
microcontroller is presented in [9]. However, this system
relates to a textile solution with one radio module. The issue
of energy saving in the aforementioned publications is related
to the use of energy-efficient interfaces. The problem of an
algorithmic approach to the reduction of energy consumption
by a single sensor network node is often solved by adaptive
sampling [10] or data compression [11]. The so-called
episodic sampling [12], which reduces the energy demand of
the node, is also used.
III. CONTRIBUTION
The article presents an interface for motion capture with an
integrated structure, working in the Wi-Fi standard and
2.4 GHz band. The use of a microcontroller implemented in a
radio module for data processing has allowed to reduce the
energy consumption and dimensions of a single sensor module
(SM). Thanks to this concept, a much smaller sensor module
for real-time systems has been created. SM is managed by an
adaptive algorithm that allows for lossless data transmission
and reduces the energy consumed by the radio interface. The
algorithm makes it possible to sample trajectory changes with
high frequency. However, depending on the physical activity Figure 1. a) Block diagram of the sensor module, b) Sensor module looking
of the sensor, it regulates the frequency of sending frames and from the MEMS system side, c) Sensor module looking from the Wi-Fi
their size. This approach does not result in the loss of the module side
quality of the monitored movement and does not limit the As a result, the presented module (SM) is currently one of the
smoothness of the image playback in real time. During slow smallest independent sensor network nodes for motion capture
movements or under resting conditions, the radio interface using the Wi-Fi technology.
limits frame transmission, but individual frames contain larger
portions of data. As a result, with slow changes in the B. Measurements Setup
trajectory, refreshing data at a lower frequency does not
reduce the fluidity of the reconstructed movement in a virtual In order to present the transmission process of Wi-Fi
environment. frames, the current consumption of the module (SM) with the
ESP8266 chip was registered. Frame transmission is
associated with an impulse increase in current consumption,
IV. PROCEDURE whereas the pulse width is related to the amount of data
contained in the frame. The value of the current consumed by
A. Hardware Setup SM was measured by an indirect method and was estimated
Low-cost and widely available components were used to based on the voltage drop across the resistor connected to the
build the sensor network. The simplification of the module supply circuit. The signal was recorded using the National
construction is one of the conditions for reducing production Instruments USB-6361 measuring card and LabView
costs. For this reason, a system was designed and tested in software. Measurements of the current and battery discharge
which the ESP07 module and the LSM9DS1 sensor were used were carried out with a stationary SM, as shown in the block
[13]. The ESP07 module is based on a single ESP8266 chip diagram in Fig. 2a. During measurements the sensor module
[14], which simultaneously supports the radio as well as SM transmitted predefined frames of variable length and at
provides resources of a 32 bit microcontroller. The Arduino different frequencies. In turn, to test the adaptive algorithm,
environment with libraries based on the Espressif SDK was the measurement stand presented schematically in Fig. 2b was
used to create the ESP8266 software. The system was used. The change in the frequency of frames transmitted by
programmed via the USB / UART converter with the FT232R the sensor module is related to the rate of its rotation. The
chip. To enable communication of the microcontroller with the sensor module was placed on the platform, which allowed for
sensor, the SPI (Serial Peripheral Interface) was used (which the registration of the change in transmission frequency of the
allows to significantly reduce the time needed for data transmission frames depending on the speed of rotation. The
transmission between the sensor and the microcontroller in platform was driven by a stepper motor whose rotational speed
relation to the I2C). This structure eliminates the need to use was controlled by an 8-bit microcontroller μC. The integrated
an additional microcontroller for processing data from the LM298 driver was used to control the motor. To study the
sensor. The functional structure of a single sensor module communication between the computer (server) and the sensor
(SM) is shown in Fig. 1a. SM electronics was made in the module, the Linksys WRT120N router was used. In turn, the
form of a two-sided PCB circuit sized approx. 26 mm x time between the reception of subsequent frames was
16 mm (Fig. 1b and Fig. 1c). SM was powered by a 3.7 V Li- measured using software written for the server.
polymer battery with a capacity of 175 mAh.
71
Figure 3. Logical diagram of the sensor module operation
All the coefficients of the vectors a, , m and the quaternion q

associated with the same incident of data collection from the
LSM9DS1 sensor are an inseparable block, 34 bytes long. The
data vector  collected from the sensor is used to control the
transmission frequency. On its basis, the orientation of the
object in space is determined, whereas the vectors a and m,
containing data from the accelerometer and magnetometer, are
Figure 2. Block diagram of the system a) for measuring the current during used only to correct gyroscope errors in the sensor fusion
transmission, b) for analysing the dependence of frame transmission frequency
on the angular velocity algorithm. The coefficient d is calculated based on the vector
 according to (3).
V. METHODS
 d   x2   y2   z2  
A. Adaptive Transmission Algorithm
The logic diagram showing how the sensor module SM By using the coefficient d to determine the transmission
works is presented in Fig. 3. During the operation of the frequency, it is possible to take into account the rotational
sensor module program from the LSM9DS1 (Sensor MEMS) speed of the sensor around all three axes. The adaptive
system, data are continuously collected at the sampling transmission algorithm (implemented in the Transmisson
frequency of the gyroscope and accelerometer of 238 Hz. Data Frequency Controller block – Fig. 3) enables to control the
are collected from all three sensors, i.e. accelerometer, frequency of frame transmission depending on the rotational
gyroscope and magnetometer, and represented by the vectors speed of the monitored object. Transmission control is carried
a,  and m, respectively, described by (1). out in such a way that the visualization of rotation in real time
is smooth (e.g. in a virtual environment). With a slower
orientation change, the frame transmission frequency can be
a  [a x , a y , a z ] reduced, which saves energy. At the same time, along with the
   [ x ,  y ,  z ]   reduction of the transmission frequency, the length of the
teletransmission frame, into which the data blocks placed in
m  [m x , m y , m z ] the FIFO queue are entered, is extended. If the amount of data
in the FIFO queue is larger than the capacity of one frame,
The coefficients ax, ay and az of the vector a correspond to the then two or more frames are sent immediately one after
acceleration with respect to the x, y and z axes of the sensor. another via the Transmission Manager block (Fig. 3). After
Similarly, x, y, z correspond to the rotational speed around sending the frame, the FIFO buffer is emptied. Four
the x, y and z axes, whereas mx, my, mz correspond to the exemplary transmission frequencies were determined in the
magnetic field induction values. These coefficients are double- algorithm for testing purposes: 4 Hz (base frequency), 10 Hz,
byte integers with a sign. All the data are stored in the FIFO 30 Hz and 60 Hz. 4 Hz is the lowest adopted transmission
queue (FIFO Frame Buffer). frequency, and the increase in frequency is related to the
In addition, a,  and m are also input vectors of the Sensor increase in the value of the coefficient d. In addition, three
Fusion Algorithm (Fig. 3). The Madgwick algorithm was used threshold values TH1, TH2, TH3 (positive integers) are also
in the tested system [15]. The output vector of the algorithm is the input data of the algorithm.
the quaternion q described by (2), where  is the rotation angle
around the unit vector v̂ , and the coefficients w, x, y, z of the B. Transmission control
quaternion q are single-precision floating point numbers (4 Once the coefficient d exceeds the TH1 value, the
bytes) that are placed in the FIFO queue of the frame buffer. transmission frequency increases to 10 Hz, the TH2 value – to
30 Hz, and the TH3 value - to 60 Hz. With a sensor sampling
    frequency of 238 Hz, frames sent at 4 Hz, 10 Hz, 30 Hz and
 q  [ w, x, y, z ]  [cos , sin vˆ]   60 Hz will contain successively twice 30, once 24, once 8 and
2 2 once 4 blocks of data. In addition, each transmission frame
contains a four-byte frame number and a byte defining the
72
number of data blocks. The double frame transmitted by the
sensor module at 4 Hz was treated as single. Modem Sleep
mechanisms made available by the ESP8266 chipset
manufacturer were also used in the transmission control
process. Owing to this mechanism, the Wi-Fi transceiver of
the ESP8266 chipset is turned off when the data are not
transmitted. The described mechanism is controlled by the
chipset manufacturer's software. The Beacon Interval
parameter of the router was set at 100 ms, whereas DTIM was
set at 1.
VI. RESULTS
The adaptive algorithm allows to reduce the energy demand
of the radio module by controlling the frame transmission
frequency and using the Modem Sleep mechanism. Fig. 4
presents the results of concurrent measurements - angular
velocity changes P of the rotating platform, on which the
frame transmission frequency fT depends. As the angular Figure 5. Characteristics of current consumption by the sensor module at
different frame transmission frequencies.
velocity rises, the transmission frequency of variable length
frames increases. Transmission of transmission frames at such frequencies is
The energy demand of the radio module with the ESP8266 justified when faster refreshing of the position is required -
chipset will be estimated based on the current consumption e.g. in fast sports movements. At these frame transmission
characteristics. The module normally takes about 70 mA, frequencies, the Modem Sleep mechanism does not allow for a
which is indicated by the yellow line in Fig. 5 as a constant significant reduction in the current consumption below the
component of the signal. In turn, the data transmission process constant component for the module.
is associated with an impulse increase in power consumption The impulse nature of current consumption (variable
by the radio module. component of the recorded signal) is also influenced by the
The arrows in Fig. 5 show four areas with an interval of 5 s, length of transmitted frames. Fig. 6 shows a comparison of the
which are related to frequencies defined in the presented shape of four peaks representing the transmission of frames
adaptive algorithm. The sensor module transmitted frames of containing twice 30, once 24, once 8 and once 4 data blocks.
various lengths at four selected frequencies. The red line The curves have been scaled with respect to their maxima so
marks the RMS curve representing the current consumption by that they graphically show the differences in their half widths
the sensor module depending on the frame transmission (4 Hz - 0.355 and 0.365; 10 Hz - 0.334; 30 Hz - 0.195; 60 Hz -
frequency. Reduced current consumption (below the constant 0.190).
component) can be observed in the 4 Hz and 10 Hz areas, The use of the presented transmission algorithm allows to
which is related, inter alia, to the functioning of the Modem extend the working time of the sensor module with the battery
Sleep mechanism. In the 30 Hz and 60 Hz areas, the current supply by up to 64 % (discharge time at 4 Hz in relation to the
consumption during transmission is at a similar level, discharge time at 60 Hz). The results of the battery discharge
comparable to the constant component. measurements are presented in Fig. 7.
Figure 4. Dependence of frame transmission frequency on the rotational Figure 6. Scaled current impulse curves representing frame transmission at
speed of the sensor module. different frequencies fT.
73
Figure 8. Dependence of the battery discharge time tU on the frame
Figure 7. Discharge curves of the battery supplying the sensor module at transmission frequency fT
different frame transmission frequencies
The reduction in the frequency of refreshing the change of
The battery was discharged until the power was cut off by the single MoCap sensor rotation, e.g. up to several Hz, is carried
battery protection circuit. Fig. 8 shows the dependence of the out simultaneously with the change of the length of
discharge time tU of the sensor module battery on the frequency transmitted frames. As a result, the presented algorithm
fT of the frames it transmitted. The empirical data were fitted enables to save energy without losing measurement precision.
with the exponential function providing the coefficient of In turn, if there is a need to monitor dynamic movements, e.g.
determination r2 = 0.99998, which indicates that the variability in sport, the algorithm will adjust the frequency of refreshing
of the battery discharge time is practically completely the position to user’s requirements (in order to obtain fluidity
explained by changes in the frame transmission frequency. of mapping in real time) at the expense of energy saving.
The nature of changes can be described by the equation
presented in Fig. 8, where fT is the frame transmission
frequency. The first part of this equation is associated with the REFERENCES
area of lower frequencies (below 15 Hz), the second - with [1] Kabaciński Rafał, Kowalski Mateusz, ―Preliminary study on accuracy of
higher frequencies (above 15 Hz). It can be noticed that the step length measurement for CIE Exoskeleton‖, Multisensor Fusion and
Integration for Intelligent Systems (MFI), 2016 IEEE International
biggest changes in the battery discharge time occur below Conference on IEEE, pp. 577-581
15 Hz. The directional coefficient (-0.33) in this area is 18
[2] Brook Galna, Gillian Barry, Dan Jackson, Dadirayi Mhiripiri, Patrick
times larger than in the area above 15 Hz (-0.02). Thus, energy Olivier,Lynn Rochester ―Accuracy of the Microsoft Kinect sensor for
saving is associated with the reduced frame transmission measuring movement in people with Parkinson’s disease‖, Gait &
frequency and effectiveness of the Modem Sleep mechanism Posture, Vol 39, Issue 4, 2014, pp. 1062-1068.
in the area of low transmission frequencies. [3] May-chen Kuo, Pei-Ying Chiang, and C.-C. Jay Kuo, ―Coding of
Motion Capture Data via Temporal-Domain Sampling and Spatial-
domain Vector Quantization Techniques‖, Advances in Multimedia
VII. CONCLUSIONS Information Processing – PCM, 2010, pp. 84-99.
[4] L. Guo, S. Xiong, ―Accuracy of Base of Support Using an Inertial
By reducing the frame transmission frequency and using the Sensor Based Motion Capture System‖, Sensors 2017, 17(9), 2091.
Modem Sleep mechanism, the presented algorithm allows for [5] T. Mańkowski, J. Tomczyński, P. Kaczmarek, ―CIE-DataGlove, a
significant energy saving (up to 64 %). As the frequency Multi-IMU System for Hand Posture Tracking―, International
decreases, the energy demand of the sensor module decreases, Conference Automation ICA 2017, Advances in Intelligent Systems and
which is particularly evident in the area of lower frequencies. Computing, AISC, vol. 550, pp. 268-276
Reducing the frequency from 6 Hz to 4 Hz extends the battery [6] G. Pons-Moll, A. Baak, T. Helten, M. Muller, H. P. Seidel, B.
life by approx. 30 minutes, whereas reducing the frequency Rosenhahn, ―Multisensor-fusion for 3d full-body human motion
capture‖, Computer Vision and Pattern Recognition (CVPR), 2010 IEEE
from 60 Hz to 40 Hz allows for time saving of only 2.5 Conference on. IEEE, pp. 663-670.
minutes. In turn, the integrated architecture enables to reduce [7] Xsens Motion Technologies, https://www.xsens.com, Accessed 30
the dimensions of the module, which is particularly important May, 2018.
in applications designed for children, e.g. in rehabilitation or [8] F. Höflinger, J. Müller, M. Törk, L.M. Reindl, W. Burgard, ‖A wireless
diagnostics. By using the SoC chipset (ESP8266), the size of micro inertial measurement unit (IMU)‖, Instrumentation and
the sensor module electronics is 26 mm x 16 mm. It is Measurement Technology Conference (I2MTC), 2012 IEEE
currently one of the smallest MoCap modules with a Wi-Fi International. IEEE, pp. 2578-2583
interface. The proposed algorithm for an integrated sensor [9] A. Szczęsna, P. Skurowski, E. Lach, P. Pruszowski, D. Pęszor, M.
module may be particularly useful in rehabilitation systems Paszkuta, J. Słupik, K. Lebek, M. Janiak, A. Polański, K.
Wojciechowski, „Inertial Motion Capture Costume Design Study‖,
with real-time data visualization, where slow movements of Sensors 2017, 17, 612.
patients are monitored.
74
[10] Y. E. M. Hamouda, Ch. Phillips, ―Metadata-Based Adaptive Sampling EMBC 2009, Annual International Conference of the IEEE.pp. 6901–
for Energy-Efficient Collaborative Target Tracking in Wireless Sensor 6905
Networks‖, Computer and Information Technology (CIT), 2010 IEEE [13] STMicroelectronics, LSM9DS1, Datasheet - production data,
10th International Conference on. IEEE, pp. 313-320 DocID025715 Rev.3, 2015, http://www.st.com/, Accessed 3 0 May,
[11] S. J. Baek, G. de Veciana, X. Su, ―Minimizing Energy Consumption In 2018.
Large-scale Sensor Networks Through Distributed Data Compression [14] Espressif Systems, ESP8266EX, Datasheet, Version 5.8, 2018,
And Hierarchical Aggregation‖, IEEE Journal on Selected Areas in https://www.espressif.com, Accessed 30 May, 2018.
Communications, Vol. 22, Issue 6, 2004, pp. 1130-1140
[15] S. Madgwick ―An efficient orientation filter for inertial and
[12] L.K. Au, M.A. Batalin, T. Stathopoulos, A.A. Bui, W.J.Kaiser, inertial/magnetic sensor arrays‖, Technical Report; Report x-io and
―Episodic sampling: towards energy-efficient patient monitoring with University of Bristol, Bristol, UK, 30 April 2010.
wearable sensors‖, Engineering in Medicine and Biology Society, 2009,
75
SIGNaL PROCESSING
SPa 2018
A critique of some rough approximations

of the DCT
Marek Parfieniuk Sang Yoon Park
Faculty of Computer Science Department of Electronic Engineering
Bialystok University of Technology Myongji University
Wiejska 45a, 15-351 Bialystok, Poland Yongin 17058, Korea
Email: m.parfieniuk@pb.edu.pl Email: sypark@mju.ac.kr
Abstract—Recently, rough approximations of the Discrete which is related to merely one addition and subtraction. Only
Cosine Transform (DCT) have been proposed that can be several rotations must be computed by using multiplications.
implemented as multiplier-less, low-area, and low-power circuits. Formerly, fast algorithms were developed with the aim of
Promoters of such algorithms considered simpler and simpler
data-flow graphs, by using fewer and fewer additions and bit- minimizing the number of non-trivial rotations, by assuming
shifts compared to finer approximations developed at the turn of that the DCT has to be computed exactly, and by allowing
the 20th and 21th centuries. However, they neglected to carefully use of floating-point multipliers [2]. The second generation of
check whether an approximation works like the original, and factorizations was aimed at finite-precision arithmetic and at
from another point of view, they ignore well-known essential replacing multipliers with bit-shifts and additions of integer
results of the theory and practice of image transforms. This paper
shows that one of such solutions is not as perfect as advertised, or numbers. Distributed arithmetic (DA) [3], coordinate rota-
even seems to be useless, suffering from inherent disadvantages of tion computer (CORDIC) [4], lifting [5], and subexpression
non-selective filters and non-smooth basis functions. We point out sharing [6] have been considered as means for computing
what is lacking in the published evaluations of the algorithm and multiplication-related subtransforms, in the domain of integers
analyse its properties, demonstrating that it behaves differently but accurately. However, the aforementioned techniques allow
from the DCT and thus is suitable to neither image compression
nor pattern recognition. In particular, we show that it poorly for omitting operations at the price of only approximating
decorrelates samples of natural images, and unpleasant in-block rotations. Therefore, finally, many authors proposed to get
artefacts appear in decoded pictures. rid of exactly computing the DCT, and to look for its ap-
proximations characterized by trade-offs between accuracy and
I. I NTRODUCTION computational complexity.
The idea is reasonable and allowed for developing nice
The 8-point type-II Discrete Cosine Transform (DCT) is multiplier-less transforms that well approximate the original,
popular in image processing because of its functional and requiring few additions [5], [7]. However, recently, some
computational advantages. It allows for decomposing an image authors went to an extreme: focusing attention only on
into subbands, which are related to different ranges of frequen- computational efficiency, they have simplified data flows but
cies, or to various levels of details. As the DCT determines neglected to carefully check whether the resulting solutions
subband filters whose number and characteristics well match have properties similar to those of the DCT and are useful
properties of natural images, it usually packs most of the in the light of theory and practice of image transforms and
energy, or information, carried by pixels into a smaller number filter banks. Evaluations of whether a transform approximates
of transform coefficients of large magnitudes. On the other the DCT, or even it merely is generally useful, have been
hand, the DCT can be efficiently computed, by using little oversimplified. Measures of matrix/image similarity and of
auxiliary memory and few operations. For the purpose of decorrelation performance were computed mechanically, and
data compression, subbands should be critically sampled, so obtained values were accepted automatically, whatever they
separate blocks of 8 × 8 pixels can be processed, one block were, without carefully reviewing if an approximation is
after another, using local memory. The transform of a block meaningful and useful.
can be separated into two subsequent transforms of rows and This dangerous trend seems to be overlooked by the commu-
columns, and the DCT of a vector can be computed using a nity of researchers related to image compression, transforms,
fast algorithm. and filter banks. Questionable or even faulty results have been
A number of fast algorithms for computing the DCT has published by renown journals [8]–[12], but no comments have
been developed over years [1], the most successful of which been made. So, our paper seems to be the first attempt to
are based on factorizing the DCT matrix into matrices that de- signalize the problem and to explain its essence.
scribe transforms of 2-element vectors. Arithmetic operations We do this by reviewing two algorithms, which have been
can be saved mainly because it is possible to use only a dozen introduced in [9] and [8] and are advertised as especially
or so subtransforms, and most of them are the rotation by π/4, successful, even as breakthroughs, therein and in several
76
subsequent publications by the same authors [10], [11], [13]. in memory. If an algorithm is used in image compression,
We show that one of the transforms under consideration has then the output scaling can be combined with quantization.
properties so different from those of the DCT that there is no Therefore, the computational load of a fast algorithm is
reason to call it an approximation. What is even worse, the essentially determined by the numbers of non-trivial and trivial
solution of [8] seems to be useless in both image compres- plane rotations, which require multipliers and adders, or only
sion and pattern recognition. Lacking frequency selectivity, adders, respectively.
it decomposes natural images into subbands differently from In older algorithms for efficiently computing the DCT, both
the original, while reconstructed images suffer from specific these numbers were minimized, especially the former, and
artefacts. As to the second transform, we show that it is indeed non-trivial subtransforms were implemented so as to exchange
similar to the DCT, but undoubtedly performs not as well even more multiplications for additions/subtractions. Never-
in image compression as more accurate approximations do, theless, it was assumed that DCT coefficients are computed ac-
like the transform of the MPEG4 AVC standard for video curately, up to scaling. Then, solutions have appeared in which
compression. Additions have been saved, but at a consider- non-trivial transformations are approximated using additions
able cost: pixels are decorrelated less successfully, and non- and bit-shifts. Obviously, this requires accepting that DCT
smooth images are reconstructed from quantized transform coefficients will be computed inaccurately, but researchers
coefficients. cared about keeping errors low.
We present the above-mentioned facts in a pragmatic, em- Recently, algorithms have been developed whose authors
pirical way, neglecting mathematical rigour, so as to made our were focused on maximizing computational efficiency, accept-
paper accessible to a wide audience, even to persons knowing ing that the DCT will be approximated roughly. In particular,
only essentials of DSP. Nevertheless, we refer to well-known in [9], one can found the transform described by the eye-
essential concepts and results related to filters, transforms, pleasing matrix
image coding, and approximation in general. Namely, we
study characteristics of the filter banks related to the rough C̃18+ = diag √18 , √16 , √14 , √16 , √18 , √16 , √14 , √16
 
approximations of the DCT, and we review properties of 1 1 1 1 1 1 1 1
the corresponding basis functions. The discussion on one-  1 1 1 0 0 −1 −1 −1 
 1 0 0 −1 −1 0 0 1  (2)
dimensional transforms is supplemented with results of simple, 
 1 0 −1 −1 1 1 0 −1 

but telling, experiments on images. 

1 −1 −1 1 1 −1 −1 1 

No similar study can be found in works on the rough ap-  1 −1 0 1 −1 0 1 −1 
0 −1 1 0 0 1 −1 0
proximations of the DCT, published neither by their inventors 0 −1 1 −1 1 −1 1 0
nor by others. This is surprising, as we merely follow some
while in [8], the transform is considered whose matrix is even
standards of experimental research on transforms and filter
sparser
banks, and our point of view should be obvious for specialists
in these domains. Our work can rather be instructive to persons C̃14+ = diag √18 , √12 , √14 , √12 , √18 , √12 , √14 , √12
more specialized in hardware implementations, or interested in  
1 1 1 1 1 1 1 1
applications of the DCT.  0 1 0 0 0 0 −1 0 
 1 0 0 −1 −1 0 0 1  (3)
II. ROUGH APPROXIMATIONS OF THE DCT  1 0 0 0 0 0 0 −1 
 
 1 −1 −1 1 1 −1 −1 1 
The 8-point type-II DCT can be identified with multiplying  0 0 0 1 −1 0 0 0 
 
a vector of 8 image samples by the following matrix 0 −1 1 0 0 1 −1 0
  0 0 1 0 0 −1 0 0
0.3536 0.3536 0.3536 0.3536 0.3536 0.3536 0.3536 0.3536
 0.4904 0.4157 0.2778 0.0975 −0.0975 −0.2778 −0.4157 −0.4904  Both matrices can be factorized into only a dozen or so
 0.4619 0.1913 −0.1913 −0.4619 −0.4619 −0.1913 0.1913 0.4619  rotations, see eg. [8]. so that, respectively, only 14 and 18 ad-
 0.4157 −0.0975 −0.4904 −0.2778 0.2778 0.4904 0.0975 −0.4157 
C=
 0.3536 −0.3536 −0.3536 0.3536 0.3536 −0.3536 −0.3536 0.3536
 (1)
 ditions/subtractions are necessary to compute the transforms.
 0.2778 −0.4904 0.0975 0.4157 −0.4157 −0.0975 0.4904 −0.2778  For (3), the corresponding data flow graph is shown in Fig. 1.
 
0.1913 −0.4619 0.4619 −0.1913 −0.1913 0.4619 −0.4619 0.1913 About ten similar transforms are known [8], [10], [12],
0.0975 −0.2778 0.4157 −0.4904 0.4904 −0.4157 0.2778 −0.0975
[14], [15], whose common property is that their matrices
However, it is not a good idea to compute the transform in this contain less or more zeros and can be factorized into 2-
way, as the straightforward matrix-by-vector multiplication is point transforms that can be implemented only additions and
related to as many as 64 multiplications and 56 additions. bit shifts. Roughly speaking, more zeros means that fewer
Much less demanding algorithms can be developed by fac- subtransforms, or fewer additions, are necessary to compute a
torizing C into a product of a permutation matrix, diagonal transform.
matrix of 8 scaling coefficients, and matrices that describe Authors of such solutions clearly assumed that a transform,
2D rotations, reflections, or shears, see eg. [7] and references less or more but always, approximates the DCT, provided that
therein. non-zero entries of its matrix conform the pattern (signs and
It costs almost nothing to permute results of arithmetic magnitudes) of (1) [10], [14]. In the following we show that
operations, as this consists only in particularly placing values this assumption is incorrect: even when a matrix seems to
77
1/ 8 Fig. 2c clearly shows that the 14-addition transform deter-
x0 X0
1/ 8
mined by (3) has little in common with the DCT. The 0th
x1 - X4 and 4th filters are the same as for the DCT. The 2nd and 6th
- 1/ 4
x2 - X6 responses are the same as for the 18-adder approximation,
x3 -
1/ 4
X2 worse than the desirable ones but similar. The remaining
x4
- 1/ 2
X5
responses are of the comb type, with all lobes at the same
-
- 1/ 2 level, capturing low- as well as high-frequency contents. The
x5 X7
-
- 1/ 2
5th filter is high-pass even though the corresponding filter of
x6 - X1 the DCT is band-pass. Its selectivity is extremely poor as it is
- 1/ 2
x7 - X3 the simplest half-band filter. Generally, the responses overlap
~
X=C14+x much with that of the low-pass filter, even though this is
undesirable.
Fig. 1. Data flow graph for computing transform determined by (3). As a result, for the 14-addition approximation, most of
transform coefficients cannot be associated with specific fre-
quency bands, or with rough shapes and details of images.
be similar to the original, zero entries cause the transform to This is against the essential principles of transform-based
behave differently from the DCT. image coding and pattern recognition. Additionally, as to
coding, failures in aliasing cancellation are more probable, the
III. ROUGH APPROXIMATIONS AS FILTER BANKS
entire frequency range is affected by quantization of transform
A linear transform like the DCT is equivalent to passing coefficients that are nominally higher-frequency.
a series of samples through a filter bank [16], [17]. The filters No similar straightforward study of the magnitude re-
are of the FIR type, and their impulse responses or coefficients sponses can be found in the existing works on the rough
are defined by rows of the transform matrix. The filtering approximations of the DCT. Their authors consider differences
is aimed at splitting a signal into low- to high-frequency between the original and approximate characteristics, show
components, or into a rough waveform and a series of finer some equations, plots, and indicators, but this has been done
and finer details. in such a way to do not exhibit disadvantages of the pro-
The DCT determines the filters whose magnitude responses posed transforms. They claim that the 14-addition algorithm
are shown in Fig. 2a (1st column). The filters are not very is a meaningful approximation of the DCT, whereas we can
selective, but evidently 8 channels can be distinguished: one reasonably conclude that it is oversimplified and useless.
low-, one high-, and 6 band-pass ones. It is advantageous that
higher-frequency filters do no capture much lower-frequency IV. D ISTRIBUTION OF IMAGE ENERGY AMONG SUBBANDS
contents of input signals. In the frequency range of 0 . . . 1/8 The differences between the DCT and its rough approxi-
ωs , which ideally should be captured by only H0 (ω) and mations are more evident in results of image processing. By
H1 (ω), H4 (ω), . . . H7 (ω) have side-lobes at -20 dB or below, computing the 1D transform on rows of a block of 8×8 pixels
with respect to their maximum magnitudes. The transition and then on columns of the resulting matrix, we obtain the
bands are wide, and the responses overlap much, which 8×8 matrix of coefficients of the 2D DCT. The corresponding
means that the forward transform produces subband signals coefficients of all blocks form a subband image. By adding
with considerable aliasing-related distortions. The distortions their squares, the energy of this subband can be computed.
are cancelled when subbands are combined by the inverse Subband energies can be illustrated using grayscale levels
transform, but this fails less or more when subband samples are as shown in Figure 3, where higher-energy subbands are
quantized. Fortunately, the aliasing cancellation usually works represented as brighter fields. It is clear that the DCT works
satisfactorily, so that distortions in reconstructed images are differently from the rough approximations, especially from the
acceptable. This is because only adjacent responses overlap 14-addition algorithm. The original packs image energy into
much: aliasing distortions in low-frequency subbands do not low-frequency subbands more successfully, as for the oversim-
depend much on those in high-frequency subbands. plified transforms, considerable energy is spread among high-
Fig. 2b shows the magnitude responses of the filters related frequency subbands. Moreover, the approximations distribute
to (2), the 18-addition approximation of the DCT. It seems image energy much less gradually, or even non-monotonically
justified to say that this transform is similar to the original, as along frequency, which can be identified with subband index.
the plots generally conform those in Fig. 2a. However, except In Figure 4, representative subband energies are illustrated
H0 (ω) and H4 (ω), the responses are evidently worse, or using bars, arranged in accordance with the zig-zag order.
much less selective, then the original ones. H1 (ω) and H2 (ω) This order is used in the JPEG and MPEG standards to
capture more high-frequency contents of an input signal, while rank transform coefficients from most to least significant. The
H5 (ω), . . . H7 (ω) capture a lot of low-frequency contents. differences between the DCT and the rough approximations
A related disadvantage it that the responses overlap more than are even more evident.
those of the DCT, so the aliasing cancellation fails for slighter It is clear that the zig-zag order is completely unsuitable to
quantization, or the related distortions are more severe. the worse approximation, as it does not take into consideration
78
(a) DCT (b) 18+ Approximation (c) 14+ Approximation
20
k=0
0
-20
-40
20
k=1
0
-20
-40
20
k=2
0
-20
-40
20 log 10 |Hk(ω)| [dB]
20
0
k=3
-20
-40
20
k=4
0
-20
-40
20
k=5
0
-20
-40
20
k=6
0
-20
-40
20
k=7
0
-20
-40
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Normalized frequency: ω / ω s
Fig. 2. Magnitude responses of filter banks corresponding to DCT and its oversimplified approximations.
that some higher-frequency subbands, especially the (7,0)th in image processing, as images are generally smooth, so that
and (0,7)th ones, are usually characterized by energies similar edges occur occasionally.
to those of lower-frequency subbands, and thus are of similar The DCT provides smooth basis functions, as the matrix
importance for quality of reconstructed images. From another rows in (1) contain values that increase/decrease gradually.
point of view, the zig-zag order does not suit cases when The matrices in (2) and (3) have rows with jump-type changes
subband energy do not necessarily decreases with increasing of values. Except the 0th and 4th ones, the basis functions
frequency. are non-smooth, because they have resulted from too radical
So, it was another unacceptable mistake by the authors of rounding of values of (1). Especially in (3), more than half of
the 14-addition approximation to mechanically use the zig- the basis functions are impulsive because of series of zeros.
zag order to evaluate how well its algorithm perform in image Figure 5a shows what happens if a smooth series of pixels
compression. They should study distributions of subband ener- is reconstructed using the non-smooth basis functions related
gies and should reorder filters or arrange transform coefficients to the rough approximations of the DCT. Oscillation- or step-
in a custom order. like artefacts appear, which form artificial impulses or edges
For comparison purposes, Fig. 4b shows a distribution in smooth regions of an image, as demonstrated in Fig. 6. The
of subband energies that result from the transform used in reconstructed versions of the well-known ”Barbara” image are
the H.264 (MPEG4 AVC) standard for video coding. More unacceptable for both 14- and 18-addition transforms, even
additions are necessary to compute this transform, but it though 10 of 64 (in the zig-zag order) subbands have been
approximates the DCT incomparably better than the rough used, which corresponds to rather slight compression.
algorithms. It is difficult to notice a difference between its For the former transform, one could expect the poor quality,
plot and that of the original transform. as transform coefficients have been taken in inappropriate
order, so that some of significant ones have been omitted. The
V. A RTEFACTS IN IMAGES DECODED USING ROUGH non-smoothness of the basis functions only intensifies distor-
APPROXIMATIONS tions. As to the 18-addition transform, it is rather surprising
From another point of view, rows of a transform matrix that distortions are so severe compared to those in the image
determine discrete basis functions [16] that are used to rep- reconstructed using the DCT. Even though the approximation
resent a signal after the transform. The inverse transform seems to decorrelate data similarly to the original, it is unsuit-
approximates the signal using a linear combinations of these able for decoding, because of the non-smooth basis functions.
functions. One of the main results of approximation theory and The distortions related to the non-smoothness are especially
signal processing it that smooth bases are usually preferred annoying because they occur in important regions, the face
[18]. Especially, it is well known that such bases are superior and hand, and can be easily noticed at the background of
79
200
Test image
150
Pixel value
100
50
(a) 0
Original DCT 18+ Approx. 14+ Approx.

200
Subband Energy for DCT 150
Pixel value
0
100
2
50
4
6 (b) 0
200
Vertical subband index
150
Pixel value
Subband Energy for 18+ approx.
0 100
2 50
4 (c) 0
6
Fig. 5. Artefacts in 8-pixel series reconstructed from 5 of 8 transform
coefficients.
Subband Energy for 14+ approx.
0
2 The rough approximations evidently suffer from something
4 like blocking, the main problem with DCT-based compression.
But they can cause blocking inside 8 × 8 blocks, not at
6
block boundaries, as in case of DCT. Therefore, it seems
0 2 4 6 0 2 4 6 0 2 4 6 impossible to use a deblocking filter to mitigate transform-
Horizontal subband index related distortions, as done in the MPEG standards for video
compression.
Fig. 3. Test images and energies of subbands that result from DCT and its Moreover, the rough approximations do not reconstruct
approximations.
edges well, even though one could expect the opposite of non-
smooth basis functions. As it can be seen in Fig. 5(b) and (c),
#107
3 they tend to introduce oscillations and steps, or artificial edges
2 (a) DCT
aside true ones. The DCT is know to reconstruct edges poorly,
1
0 as it blurs them, but this effect is more acceptable than those
of the approximations.
#107
3 The artefacts specific for the rough approximations of the
2 (b) H.264 / MPEG4 AVC
DCT undoubtedly are troublesome and deserve being studied.
Subband energy
1
0 However, we have tried in vain to find something about their
#107
nature, significance, or even existence, in the publications that
3 promote these algorithms. Instead, eg. in [13], decoded images
2 (c) 18+ Approximation
1
are shown downsized, so it is difficult to notice distortions.
0 Errors have been measured using the MSE (Mean-Square
#107
Error), PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural
3
Similarity Index), etc., but obtained values have been accepted
2 (d) 14+ Approximation
1 whatever they are, without careful justification. In particular,
0 the authors neglected low values of the UQI (Universal Quality
0 10 20 30 40 50 60
Index), which indicate problems with quality.
Coefficient/subband index in zig-zag order
VI. C ONCLUSION
Fig. 4. Energies of subbands that result from DCT and its approximations
for ”Barbara” image. Most of rough approximations of the DCT undoubtedly
need to be reexamined, so as to carefully and objectively
evaluate if saving additions is worth facing problems with odd
gradual changes of intensity. In the image reconstructed using filters, specific order of transform coefficients, and in-block
the DCT, distortions are less noticeable as they occur at the artefacts. Instead, top journals have published a noticeable
background of edges and patterns. series of articles that unquestionably suffer from flaws related
80
Original DCT 18+ Approx. 14+ Approx.
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
50 100 150 50 100 150 50 100 150 50 100 150
Fig. 6. Artefacts in versions of the Barbara image reconstructed from 10 of 64 (8 × 8 in zig-zag order) transform coefficients per block 8 × 8. (Please
observe the images on a screen, zoomed so as to have 1:1 correspondence between display dot and image sample.)
to algorithm evaluation and present as a breakthrough a [7] M. Parfieniuk, M. Vashkevich, and A. Petrovsky, “Short-critical-path
transform that is useless in image coding. What is even worse, and structurally orthogonal scaled CORDIC-based approximations of
the eight-point Discrete Cosine Transform,” Circuits, Devices Systems,
these publications have constituted a seemingly notable and IET, vol. 7, no. 3, pp. 150–158, 2013.
well-grounded discipline of research and seem to present the [8] F. M. Bayer and R. J. Cintra, “DCT-like transform for image compres-
state-of-art in transforms. sion requires 14 additions only,” Electronics Letters, vol. 48, no. 15, pp.
919–921, Jul. 2012.
It is a pity that something like this has happened after [9] R. J. Cintra and F. M. Bayer, “A DCT approximation for image
several decades of real progress in transforms, and filter banks, compression,” IEEE Signal Processing Letters, vol. 18, no. 10, pp. 579–
which resulted in brilliants like the DCT. There is a real 582, Oct. 2011.
[10] R. Cintra, F. Bayer, and C. Tablada, “Low-complexity 8-point DCT
risk that many scientists and engineers can be misled. It is approximations based on integer functions,” Signal Processing, vol. 99,
very probable that conceptual and experimental results will be pp. 201 – 214, 2014.
published that are faulty, as based on rough approximations of [11] A. Madanayake, R. J. Cintra, V. Dimitrov, F. Bayer, K. A. Wahid, S. Ku-
lasekera, A. Edirisuriya, U. Potluri, S. Madishetty, and N. Rajapaksha,
the DCT. On the other hand, good ideas and solutions could “Low-power VLSI architectures for DCT/DWT: Precision vs approxi-
be criticized falsely, as a result of testing them with a rough mation for HD video, biomedical, and smart antenna applications,” IEEE
approximation. Circuits and Systems Magazine, vol. 15, no. 1, pp. 25–47, First quarter
2015.
[12] R. J. Cintra, F. M. Bayer, V. A. Coutinho, S. Kulasekera, A. Madanayake,
ACKNOWLEDGMENT and A. Leite, “Energy-efficient 8-point DCT approximations: Theory
Marek Parfieniuk was supported by Bialystok Univer- and hardware architectures,” Circuits, Systems, and Signal Processing,
vol. 35, no. 11, pp. 4009–4029, Nov 2016.
sity of Technology under Grand S/WI/3/2018. Sang Yoon [13] U. S. Potluri, A. Madanayake, R. J. Cintra, F. M. Bayer, S. Kulasekera,
Park was supported by Basic Science Research Program and A. Edirisuriya, “Improved 8-point approximate DCT for image and
through the National Research Foundation of Korea (NRF) video compression requiring only 14 additions,” IEEE Transactions on
Circuits and Systems I: Regular Papers, vol. 61, no. 6, pp. 1727–1740,
funded by the Ministry of Education (grant number: June 2014.
2016R1D1A1B03933315). [14] S. Bouguezel, M. Ahmad, and M. Swamy, “A fast 8 x 8 transform
for image compression,” in Proc. 21st Int. Conf. Microelectronics,
R EFERENCES Marrakech, Morocco, 19-22 Dec. 2009, pp. 74–77.
[15] D. Puchala and K. Stokfiszewski, “Low-complexity approximation of 8-
[1] V. Britanak, P. C. Yip, and K. R. Rao, Discrete Cosine and Sine Trans- point Discrete Cosine Transform for image compression,” J. of Applied
forms: General Properties, Fast Algorithms and Integer Approximations. Computer Science, vol. 20, no. 2, pp. 107–117, 2012.
Amsterdam: Elsevier//Academic Press, 2007. [16] A. N. Akansu and R. A. Haddad, Multiresolution Signal Decomposition:
[2] Y. Arai, T. Agui, and M. Nakajima, “A fast DCT-SQ scheme for images,” Transforms, Subbands, and Wavelets. San Diego, CA: Academic Press,
IEICE Transactions, vol. E-71, no. 11, pp. 1095–1097, Nov 1988. 1992.
[3] Y. H. Chen, T. Y. Chang, and C. Y. Li, “High throughput DA-based DCT [17] G. Strang and T. Q. Nguyen, Wavelets and Filter Banks. Wellesley,
with high accuracy error-compensated adder tree,” IEEE Transactions MA: Wellesley-Cambridge Press, 1996.
on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 4, pp. [18] T. Yokota, R. Zdunek, A. Cichocki, and Y. Yamashita, “Smooth nonneg-
709–714, April 2011. ative matrix and tensor factorizations for robust multi-way data analysis,”
[4] D.-H. L. Trong-Thuc Hoang and C.-K. Pham, “Minimum adder-delay Signal Processing, vol. 113, pp. 234 – 249, 2015.
architecture of 8/16/32-point DCT based on fixed-rotation adaptive
CORDIC,” IEICE Electronics Express, vol. 10, no. 12, pp. 1–12, May
2018.
[5] J. Liang and T. D. Tran, “Fast multiplierless approximations of the DCT
with the lifting scheme,” IEEE Trans. Signal Process., vol. 49, no. 12,
pp. 3032–3044, Dec. 2001.
[6] M. Jridi, A. Alfalou, and P. Meher, “Optimized architecture using a
novel subexpression elimination on Loeffler algorithm for DCT-based
image compression,” VLSI Design (Special issue on VLSI Circuits,
Systems, and Architectures for Advanced Image and Video Compression
Standards), vol. 2012, p. 12, 2012, article no. 209208.
81
SIGNaL PROCESSING
SPa 2018
Application of recurrent U-net architecture to

speech enhancement
Tomasz Grzywalski Szymon Drgas
StethoMeTM Institute of Automation and Robotics,
Email: grzywalski@stethome.com Poznan University of Technology
Email: szymon.drgas@put.poznan.pl
Abstract—In this paper a recurrent U-net neural architecture

is proposed to speech enhancement. The mentioned neural
network architecture is trained to provide a mapping between a
spectrogram of a noisy speech and both spectrograms of isolated
speech and noise. Some key design choices are being evaluated
in experiments and discussed, including: number of levels of the
U-net, presence/absence of recurrent layers, presence/absence of
max pooling layers as well and upsampling algorithm used in
decoder part of the network.
I. I NTRODUCTION
The single-channel speech enhancement problem is to re-
duce a noise present in a single-channel recording of speech.
This technique has many applications, it can be employed as a
preprocessing stage in an automatic speech recognition system,
or it can be used in order to improve the intelligibility of
speech recorded in noisy environment. This can be especially
important for hearing aids, as a noise can dramatically reduce
speech intelligibility for hearing-impaired persons.
Early speech enhancement methods were based on an
assumption that noise is a stationary signal [1]. In many
acoustical environments, however, the noise is nonstationary
(e.g. babble noise). In this case, methods based on non- Fig. 1. Architecture of the proposed recurrent U-net
negative matrix factorization have been successfully used [2].
More recently, deep neural networks (DNNs) [3] gained
popularity for speech enhancement. DNNs are used as a
fully connected layers to estimate the mask. In this case how-
nonlinear transformation of a noisy signal to denoised one,
ever, information about the denoised spectrogram is lost by
or to filtering mask, that can be used to recover the speech.
max-pooling information. In [6] fully convolutional network
In [4] the U-net architecture was proposed for medical
(without fully-connected layers) was used. In this case the
image segmentation.
information lost by max-pooling information also cannot be
In this work a recurrent U-net architecture is proposed for
recovered.
speech enhancement. Several aspects of this architecture have
been examined, including: number of levels, presence/absence The information loss caused by max-pooling layers was
of recurrent layers and max-pooling. Moreover, several up- mitigated by using skip-connections in [7]. The authors of
sampling methods have been evaluated. the mentioned work compared convolutional encoder-decoder
The paper is structured as follows: in Section II convolutive (CED) networks with skip-connections to redundant convolu-
DNN approaches from the literature, applied to audio source tion encoder-decoder (R-CED) networks proposed by them. In
separation are presented. Next, in Section III the problem is R-CED max-pooling operation is not used, thus the encoder
formulated. This is followed by the description of the proposed part maps input to higher dimensional space, while decoder
neural network architecture in Section IV. The experimental performs mapping from the higher dimensional space back
setup, results, and discussion are in Section V. Finally, con- to the input space. The results of the comparison suggest
clusions are listed in Section VI. that, R-CED performs better than CED. The results strongly
indicate that skip connections provide significant improvement
II. R ELATED WORK of speech enhancement quality. There are however, several
Convolutional neural networks (CNNs) were applied to aspects that are not clear. First, the input to the convolutional
speech enhancement in [5], by combining convolutional and network were spectrogram patches, containing 8 frames which
82
corresponds to temporal framgments of 100 ms, which can be
less than one phoneme. Second, skip connections are done
by addition in contrast to concatenation used in most U-net
architectures.
The U-net architecture for singing voice recognition were
tested in [8]. In contrast to [7] the analyzed segment of a
spectrogram spanned longer time range (512 frames). Instead
of max-pooling operation, convolutions with stride equal two
were performed. The decoder part was built using transposed
convolution technique for upsampling. The obtained results
were compared to the network without skip connections, and
it turned out that skip connections significantly improved the
separation quality.
A. Upsampling methods
Upsampling is done in the decoder part of the U-network.
Its purpose is to recreate the spatial dimension of the feature
map from the previous level of the U-net architecture. Simplest
choice for upsampling is transposed convolution, but as was
shown in [9] it performs poorly in many applications. Instead
authors propose several alternatives. For the purpose of our
study we have selected following upsampling methods:
1) Transposed convolution - it is done by interleaving
upsampled feature maps with zeros and then to perform Fig. 2. Example of recurrent U-net architecture with four levels and max-
pooling
convolution.
2) Depth to space - this method is composed of convolution
followed by a proper reshaping of its output. Each region a set of parameters (weights and biases) of the neural network
of the feature maps that was reduced by max-pooling denoted by function f .
operation to one value (in our case 2x2 regions) is Given a training dataset T , that comprises corresponding
reconstructed from a depth of the preceding feature map noisy speech, clean speech and noise examples
(from the 2*2=4 channels).
3) Bilinear upsampling (BU) - it is done by performing T = {(Yi , Xi , Ni )}Ii=1 , (2)
bilinear interpolation in one direction (in rows) and then
where i is the index of training example and I denotes
in the other (columns).
a number of training examples in the training dataset, the
4) Bilinear additive upsampling (BAU) - is similar to bi-
optimization task is performed
linear upsampling, except that we extend the number
of channels before the upsampling operation by 2*2=4 arg min αkX − X̂k1 + (1 − α)kN − N̂k1 , (3)
and then after interpolation add consecutive 4 channels Θ
together, according to [9] this preserves the volume where k · k1 denotes l1-norm and α is a weight allowing to
of information throughout the upsampling process thus cotrol a tradeoff between reconstruction errors of speech and
removing information loss bottlenecks noise.
5) Bilinear additive upsampling with residual connections Thus, the problem is to design the architecture of neural
(R-BAU) - is an extension to bilinear additive upsampling network f (·, Θ), that trained on a given dataset, will bring
where there is an additional residual connection that the best speech enhancement quality in terms of signal-to-
generates a high resolution update to the upsampled distortion ratio (SDR) on a test dataset.
feature map
IV. P ROPOSED ARCHITECTURE
III. P ROBLEM FORMULATION The proposed architecture is shown in Figure 1. The left
In DNN-based speech enhancement, a noisy utterance rep- side of the U-net by convention is called encoder, while the
resented by spectrogram matrix Y ∈ RB×N , where B and N right side is decoder. There are many levels at which encoder
denote a number of frequency bands and a number of spectro- is connected to decoder via skip connections. In the decoder
gram frames respectively, a nonlinear function is considered part, feature maps from the skip connections are merged to the
feature maps from the preceding layers of the decoder part.
(X̂, N̂) = f (Y, Θ) , (1)
This can be done by a concatenation (which results with higher
where X̂ ∈ R B×N
and N̂ ∈ R B×N
are estimates of clean number of feature map channels) or by element-wise sum of
speech and noise spectrogram matrix respectively, while Θ is feature maps from the skip connection and higher level (in
83
this case both tensors representing feature maps must have the
same dimensions). In the last connection between the encoder
and decoder (the lowest in Figure 1) the recurrent layers
are included. Between the levels in the encoder max-pooling
can be done, which reduces dimensions of feature maps and
increases the effective receptive field. In the encoder, in order
to recover dimensions, upsampling methods are employed (see
Section II-A).
In Figure 2 an example of the proposed architecture is
shown. This is a four level recurrent U-net with max-pooling.
In boxes the dimensions of feature maps are provided. At the
input to the neural network a spectrogram matrix is provided.
Next, convolutions are performed. At each level, there are
two 2D convolutional layers, with nonlinearities and batch Fig. 3. A sequence of elements of the proposed neural network in the
normalizations as depicted in Figure 3. All convolutions use convolutions block
’same’ padding. At the first level of the proposed U-net, the
first convolution is done with 48 filters with mask 5x5, while
the second one with 48 filters with 3x3 mask. For the rest
of the levels, both convolutions are done with 48 filters with
3x3 mask. Filtering at each level is followed by 2x2 max-
pooling operation. The decoding at each level consists of two
filters with 3x3 masks, each followed by ELU (exponential
linear unit) [10] nonlinearity and batch normalization [11]. Fig. 4. Recurrences block
This is followed by upsampling. The upsampling methods are
briefly described in Section II-A. The first filtering for levels
where data from skip connection are provided is applied to of the output spectrogram. Presence of recurrent layers is
96 channels, concatenated from skip connection and from the also something that differentiates our solution from the one
preceding level. Finally feature maps with initial resolution presented in [8].
(at the first level) are processed with 48 filters with mask
3x3, which are followed by the second filtering with 2 masks A. Effective receptive field
1x1. This forms two outputs of the network, corresponding to Given the architecture design described above, it is worth
estimated speech and estimated noise. analyzing the effective receptive field of networks with dif-
The recurrences at the last level of the network are done ferent number of levels. The effective receptive field is an
as shown in Figure 4. The input feature maps (i.e. from important feature of any network design as it tells us how
the last level of the encoder) are processed by two GRU much context the network can take into consideration when
(gated recurrent unit) recurrent layers [12] (with latent vector making estimation about each element of its output. With
dimension equal 24) — one forward and one backward. It each new level the effective receptive field of proposed net-
should be added that the weights of GRU layers are shared work increases roughly by two. By comparison, the proposed
across the frequency channels. The outputs of GRU layers network’s counterpart without max pooling (and consequently
(i.e. sequences of 24-dimensional vectors for each frequency upscaling) increases its receptive field only by constant value.
channel) are concatenated with direct connection from the This of course does not take into consideration the recurrent
feature map from the encoder part. Finally 96 feature maps are layers, whose presence can potentially extent the effective
processed by convolutional layer with 48 filters. The outputs receptive field of the network in time axis to the whole length
of this layer are provided to the decoder. of the utterance. Table III contains the summary of effective
Proposed architecture differs from those presented in Sec- receptive fields of the proposed networks with depth-to-space
tion II in a number of key elements. First of all, our solution upscaling for different number of levels (excluding influence
uses max pooling layers, but unlike in [5] and [6], our model of recurrent layers).
includes skip connections that preserve local information from
before the pooling, which helps to reconstruct fine details of TABLE I
E FFECTIVE RECEPTIVE FIELD OF PROPOSED U- NETS
speech signal. Additionally we tested variation of our network
with max pooling layers completely removed. This makes U-net levels With max pooling Without max pooling
our solution more similar to the R-CED presented in [7]. 1 9x9 9x9
2 18x18 13x13
However, our network includes recurrent layers and is trained 3 42x42 21x21
on whole utterances (up to six seconds of length), hence 4 94x94 29x29
our model has the theoretical ability to use the context of 5 190x190 37x37
whole utterance while making prediction for each element
84
In order to assess the quality of the separation for different
variants of the proposed architecture, SDR (signal-to-distortion
(a) Original speech
ratio) which was calculated as defined in [14], i.e.
kPx x̂k2
SDR = 10 log10 , (4)
kx̂ − Px x̂k2
where x̂ denotes a vector representing reconstructed speech
signal in time domain, while Px is the matrix of orthogonal
(b) Noisy speech (babble) (d) Noisy speech (factory) projection on vector x. This vector contains samples of the
original speech signal.
D. Parameters of the neural network architectures

In all tested variants of neural network architectures, the
input and outputs were spectrogram matrices of size 64 fre-
(c) Denoised speech (babble) (e) Denoised speech (factory) quency channels times 608 frames. The longest utterance had
608 spectrogram frames. The shorter recordings were placed
in the center of a 64 × 608 matrix. Thus, during the training
each batch was an array of dimension 16 × 64 × 608 where
16 corresponds to the batch size.
As shown in equation 3 optimized cost function was ab-
solute difference between predicted and actual speech signal.
Initially also the squared error was considered for cost func-
tion, but preliminary set of experiments showed that absolute
Fig. 5. Spectrograms of an exemplary utterance from TIMIT: (a) original error outperforms squared error by significant margin in all
signal, (b) and (d) mixed with factory and babble noises, (c) and (e) denoised scenarios. Considering the whole batch, the cost function was
versions calculated only on the centered data (padded zeros before and
after each utterance were masked out and did not contribute
to the loss).
V. E XPERIMENTS
Since the network was trained to estimate speech as well
A. Data as noise, but only the speech estimation was eventually used,
The noisy speech examples were obtained by mixing TIMIT a weighting mechanism was implemented to emphasize the
[13] speech utterances with noise segments extracted from speech estimation in favor to noise estimation (parameter alpha
NOISEX-92 database. The training and test datasets contain in equation 3). We found that value 0.75 worked best in most
2000 and 192 utterances respectively. Both speech and noise cases and so it was used in all experiments.
signals were resampled to 8000 samples/s. The speech en-
E. Optimization parameters
hancement quality was assessed for babble and factory noises,
mixed with the speech utterances at SNR 0 dB. In Figure 5 All neural network parameters were optimized using Adam
spectrograms of exemplary utterance with both factory and algorithm [15]. The initial learning rate was set to 0.001.
babble noises are presented. After completion of each epoch, it was multiplied by 0.99.
The data for neural network training which consisted of 2000
B. Feature extraction utterances was divided into training (90%) set and validation
From all audio signals short-time Fourier transform were (10%) set. In order to train the proposed neural networks 100
computed. The frame step was 10 ms, while frame length epochs were done. Each epoch consisted of weight updates
was 25 ms. Hann window was applied. The next step was for batches containing 16 spectrograms. The best model was
to calculate 512-point FFT for each frame. Afterwards, 64- selected based on SDR obtained on validation set. Validation
channel mel-scale filterbank was applied to the magnitude of was performed after each training epoch, no early stopping
STFT. Finally, logarithm was calculated. was used.
C. Speech signal reconstruction and evaluation metric F. Results

The magnitude mel-spectrogram in log-scale of the clean 1) Number of levels and recurrences: Tables II and III show
speech was estimated using the proposed neural networks. the influence of number of levels in the U-net architecture
After applying exponential function, the mel-filtering was and presence/absence of recurrent layers on SDR obtained by
inverted by means of the pseudoinverse of the matrix with the model. Experiments were performed with max pooling,
characteristics of mel-filters. Next, the result was combined concatenated skip connections and depth to space upsampling.
with phase of the noisy speech. This allowed to reconstruct As can be seen, in all cases except one, adding recurrent
the speech signal. layers improved speech enhancement performance, but the
85
TABLE V
improvement was much greater for shallower U-nets. For I NFLUENCE OF UPSAMPLING METHODS ON SDR ( D B)
factory noise the optimal number of layers was 4, while for
a more complex babble noise, deeper architecture performed Upsampling methods Factory noise Babble noise
transposed convolution 8.06 6.42
better. depth to space 8.10 6.32
BU 8.03 6.36
TABLE II BAU 8.04 6.35
I NFLUENCE OF NUMBER OF LEVELS AND RECURRENCE ON SDR ( D B) R-BAU 8.12 6.40
( FACTORY NOISE )
U-net levels (conv layers) No recurrence With recurrence

1 (4) 7.11 7.61 VI. C ONCLUSIONS
2 (6) 7.72 7.93
3 (10) 7.91 8.01 The following conclusions can be drawn from the performed
4 (14) 8.00 8.10 experiments:
5 (18) 8.09 8.02
1) U-net architecture is effective at speech enhancement.
2) The optimal number of levels of the proposed recurrent
TABLE III U-net architecture is 4 or 5 depending on the noise.
I NFLUENCE OF NUMBER OF LEVELS AND RECURRENCE ON SDR ( D B) 3) The recurrence improves speech enhancement especially
( BABBLE NOISE )
for shallower networks.
U-net levels (conv layers) No recurrence With recurrence 4) Max pooling introduces loss of information, but it is
1 (4) 5.16 5.76 needed to build big enough receptive field of a convolu-
2 (6) 5.86 6.02 tional network. However, recurrent layers are effective
3 (10) 6.10 6.29
4 (14) 6.20 6.32 at extending the receptive field so that max pooling is no
5 (18) 6.30 6.40 longer needed. Thus, on average the best combination
is to build a network without max pooling and use
2) Influence of max-pooling: To examine the extent of recurrent layers to extended RF.
the loss of information introduced by max pooling layers, 5) For the networks with max-pooling, the applied upsam-
we have conducted a set of experiments where we have pling method does not influence results significantly. It
removed these layers from the network architecture. This also can be observed that bilinear upsampling without resid-
makes architecture similar to R-CED mentioned in Section ual connections does not perform very well, however
II. We also examined the influence of recurrent layers which adding residual connections improves the performance
should be able to compensate (to some extent) the much and eventually outperforms all other methods by small
narrower effective receptive field of the network without max margin.
pooling. The results are shown in table IV. Experiments were
ACKNOWLEDGMENT
performed on 4-level U-nets with additive skip connections
and depth to space upsampling. This research was supported in part by PL-Grid Infrastruc-
ture.
TABLE IV
I NFLUENCE OF MAX - POOLING ON SDR ( D B) R EFERENCES
max-pooling no max-pooling [1] P. C. Loizou, Speech enhancement: theory and practice. CRC press,
recurrence 8.07 8.19 2007.
Factory [2] C. Févotte, E. Vincent, and A. Ozerov, “Single-channel audio source
no recurrence 8.08 8.03
recurrence 6.26 6.28 separation with nmf: divergences, constraints and algorithms,” in Audio
Babble Source Separation. Springer, 2018, pp. 1–24.
no recurrence 6.30 6.04
[3] D. Wang and J. Chen, “Supervised speech separation based on deep
learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and
3) Influence of upsampling methods: In this experiment we Language Processing, 2018.
have compared different ways to perform upsampling of fea- [4] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in International Conference on
ture maps in the decoder part of the network. For the R-BAU Medical image computing and computer-assisted intervention. Springer,
the residual connections were implemented using transposed 2015, pp. 234–241.
convolution. Results are depicted in table V. Experiments were [5] L. Hui, M. Cai, C. Guo, L. He, W. Q. Zhang, and J. Liu, “Convolutional
maxout neural networks for speech separation,” in 2015 IEEE Inter-
performed on 4-level U-nets with max pooling, recurrence and national Symposium on Signal Processing and Information Technology
concatenated skip connections. There is no clear winner among (ISSPIT), Dec 2015, pp. 24–27.
the algorithms, but on average the best was R-BAU which [6] E. M. Grais and M. D. Plumbley, “Single channel audio source
separation using convolutional denoising autoencoders,” arXiv preprint
achieved the highest SDR in factory noise and second best arXiv:1703.08019, 2017.
in babble noise. In general the influence of the upsampling [7] S. R. Park and J. Lee, “A fully convolutional neural network for speech
algorithm is not very significant for the network performance. enhancement,” arXiv preprint arXiv:1609.07132, 2016.
[8] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and
All presented upsampling methods were stable during training T. Weyde, “Singing voice separation with deep u-net convolutional
and did not require any tuning of meta-parameters. networks,” 2017.
86
[9] Z. Wojna, V. Ferrari, S. Guadarrama, N. Silberman, L.-C. Chen,
A. Fathi, and J. Uijlings, “The devil is in the decoder,” arXiv preprint
arXiv:1707.05847, 2017.
[10] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate
deep network learning by exponential linear units (elus),” arXiv preprint
arXiv:1511.07289, 2015.
[11] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[12] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
properties of neural machine translation: Encoder-decoder approaches,”
arXiv preprint arXiv:1409.1259, 2014.
[13] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett,
and N. L. Dahlgren, “Darpa timit acoustic phonetic continuous speech
corpus cdrom,” 1993.
[14] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement
in blind audio source separation,” IEEE transactions on audio, speech,
and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
[15] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
87
SIGNaL PROCESSING
SPa 2018
Crowd counting using complex convolutional neural

network
Marcin Matłacz Grzegorz Sarwas
Warsaw University of Technology Institute of Control and Industrial Electronics
Faculty of Electrical Engeneering Warsaw, ul. Koszykowa 75
Warsaw, Pl. Politechniki 1 Email: sarwasg@ee.pw.edu.pl
Email: marcin@matlacz.eu
Abstract—This paper is focused on the problem of counting

people in crowd. For solving this issue a complex valued convolu-
tional neural network has been proposed. The network training
and evaluation have been processed using datasets ShanghaiTech
and UCF CC 50, respectively. Achieved results have been com-
pared with other algorithms for crowd counting based on the
deep neural network architecture, mainly ”CrowdNet” algorithm.
Proposed model achieved better results than equivalent real-
valued model.
I. I NTRODUCTION
Over the last decades, we can observe the intensified devel-

opment of urban monitoring systems. Each bigger metropolis
posses a plenty of IP cameras connected to the crisis manage-
Fig. 1. Sample of crowd image with annotations.
ment center, which provides continued observation of obtained
recordings. Such a large number of streams does not allow
for unremitting analysis of all camera signals, therefore the
II. DATA FORMAT
operators focus on observing especially endangered places.
The huge amount of video data requires the automation of There are two huge databases in the network that allow to
processes for identifying dangerous situations or analyze the learn and test algorithms for counting people in a crowd. The
pedestrian density. It is a difficult task due to serious occlusion, first one is UCF CC 50 crowd counting dataset collected and
distortion of the stage perspective and various distribution of shared by Indrees et al. in 2013 [5]. This dataset contains 50
the crowd. Most state-of-the-art methods base on regression gray-scale photos in different sizes spanning from 328x496 up
model [1], [2], [3], [4] where the main aim is to find the to 1024x1024 pixels with 64K annotated humans and the head
relation between low-level features and crowd counts. The counts ranging from 94 to 4543.
first works on this issue were based on manual selection of The second dataset used for crowd counting is Shang-
patterns chosen based on domain knowledge. Some methods haiTech. It contains 1198 color photos in RGB in sizes
[5] base on the classical features which are invariant in scale, spanning from 200x300 up to 1024x1024. In the whole dataset,
illumination and rotation like SIFT [6]. The newest solution there are 330k annotated people with the headcounts ranging
adapted the deep neural network algorithms [7], [8], [9], [10], from 9 to 3138 per picture.
which increased their popularity when pretrained, commonly Each photo in these datasets has assigned file in ”mat”
available models appeared. format containing annotated people locations. Each annotation
describes the position of an individual person in an image.
In this paper, authors propose the solution based on a
Figure 1 presents sample photo of crowd with red dots located
complex convolutional neural network. This network structure
in the coordinates from the annotation ”mat” file.
has been proposed by Guberman in 2016 [11]. Since that time
The density maps have been computed using Gaussian filter
it the tendency for using complex numbers in deep learning
with sigma estimated from the distance between annotations.
algorithms in identification systems can be observed [12],
The formula for density maps generation is expressed as:
[13], [14]. In this paper, authors present the implementation of
the complex convolutional neural network for crowd counting N
X
process. Its operation and results have been compared with D(x) = δ(x − xi ) ∗ G(x), (1)
traditional deep learning architectures based only on the real i=1
numbers. where:
88
and
h = x + iy, x ∈ Rm×n , y ∈ Rm×n . (4)
Finally, the two dimension convolution for complex numbers
can be defined as:
G ∗ h = (A ∗ x − B ∗ y) + i(B ∗ x + A ∗ y). (5)
Activation function
The following activation functions have been tested:
CReLU defined as [17]:
CReLU = ReLU (ℜ(z)) + iReLU (ℑ(z)), (6)
where:
Fig. 2. Density map of image presented in Fig. 1. ReLU (x) = max(0, x), x ∈ R. (7)
modReLU defined as [11]:
D(x) - density map value for pixel x, modReLU (z) = ReLU (|z| + b)eiθz . (8)
δ - Dirac delta,
N - number of annotations, zReLU defined as [18]:
xi - annotation for i-th pixel, (
z for θz ∈ [0, π/2],
G() - Gaussian filter, zReLU (z) = (9)
∗ - convolution operator. 0 in other cases.
In Fig. 2 is presented the density map of the image shown in Batch Normalization
Fig. 1, where brighter points mean places with higher crowd In experiments batch normalization defined in [13] has been
density, whereas darker the lower. used.
III. C OMPLEX N ET
Average 2D Pooling
In this section, the novel neural network architecture for In presented implementation average pooling has been used
crowd counting is introduced. Its structure is based on VGG- instead of max pooling, which is caused by max operation
16[15] without pretrained weights. The inspiration for pre- not being defined for complex numbers.
sented paper was an article written by Trabelsi et al. [13]
and recent papers on the use of complex numbers in neural PixelShuffle
networks [16], [17], [18], [11]. Proposed network is an exten- This method changes shape of input vector from
sion of VGG-16 architecture. Instead of using one real matrix [H, W, C ∗ r2 ] into [H ∗ r, W ∗ r, C], where H - height, W -
for weights representation two matrices have been used: the width, C - number of channels, r - scaling ratio. This method
first for real parts of complex numbers and the second for has been proposed in [19].
the imaginary part. Used network implementation was based
on the official implementation of ”Deep Complex Networks” B. Architecture
presented in [13].
The proposed network architecture is a simplified version
A. Building blocks of complex network of ”CrowdNet” [8] which is neural network comprised of
Afterwards the core building blocks of a complex-valued a deep and shallow network. The deep part of the network
deep neural network are presented. Each block has a uses an architecture similar to VGG-16 with weights trained
mathematical outline following those proposed in [13]. on ImageNet. The shallow part consists of 3 convolutional
layers. Shallow network detects big head blobs, while the
Complex 2D Convolution deep part is responsible for extracting crowd features from
Assuming that G ∈ Rm×n is a weights matrix and h ∈ Rm×n an input image. The outputs produced by two sub-networks
is an input matrix, then the two dimensional convolution is are concatenated and processed by depthwise 1x1 convolution.
defined as: Produced output is upsampled to a size of an input image using
∞ ∞ bilinear upsampling. The output is an estimation of density
X X
G[m, n] ∗ h[m, n] = G[k, l]h[m − k, n − l]. (2) map.
k=−∞ l=−∞ In this paper the following modifications of neural network
architecture are proposed:
In general, G and h can be complex (G ∈ Cm×n and
• Building blocks are replaced by their complex counter-
h ∈ Cm×n ) then:
parts.
G = A + iB, A ∈ Rm×n , B ∈ Rm×n (3) • Shallow network has been discarded.
89
TABLE I
A RCHITECTURE OF DEEP COMPLEX PART
Layer (type) Output Shape Param

Input Layer (224, 224, 2) 0
Complex conv2d (ComplexConv2D) (224, 224, 64) 640
Activation (Activation) (224, 224, 64) 0
Average pooling2d (Average) (112, 112, 64) 0
Complex batch normalization (BatchNorm) (112, 112, 64) 320
Fig. 3. Network architecture Activation (Activation) (28, 28, 256) 0
• In place of bilinear upsampling, PixelShuffle has been Complex conv2d (ComplexConv2D) (14, 14, 256) 295168
used. Activation (Activation) (14, 14, 256) 0
Figure 3 presents the architecture of used complex network. Complex conv2d (ComplexConv2D) (14, 14, 256) 295168
In Table I the detailed architecture of complex VGG16:Model Activation (Activation) (14, 14, 256) 0
from Fig. 3 are shown. This table presents the layer layout of Complex conv2d (ComplexConv2D) (14, 14, 256) 295168
the deep complex part. Used model contains 34 layers. Its
Input Layer takes 2 images of 224 × 224 and on the output
256 images of the size 14 × 14 are returned. The total number • patches vertical flip,
of parameters is 2,642,560,which 2,640,800 are trainable and • padding of images with aspect ration ineligible for slicing
1,760 non-trainable. into patches.
Finally, the patches have been sampled from Shanghaitech
IV. E XPERIMENTS dataset based on crowd density. Greater selection probability
The complex valued network has been compared with has been given to patches containing more people. After the
”CrowndNet” and other methods attempting to solve crowd first sampling of images containing people, the additional set
counting problem such as the state of the art Switch-CNN. of size equal to 15% of the previous one consisting of patches
For the experiments, the proposed network has been trained not containing any people is sampled.
on 233 patches created from ShanghaiTech part A train data. These steps are followed for both training and validation set.
Validation set containing 166 patches has been created from The validation set is used for early stopping of training. The
ShanghaiTech part A test data. Each patch was the size of network is trained using Adam optimizer with learning rate
224 × 224 pixels. To prepare patches, data augmentation have equal 1e-3. After 5 epoch without improvement, the learning
been conducted consisting of: rate is lowered by 10% of previous value. The network is
trained with pixelwise loss function logcosh, defined as:
• multiscale pyramid,
• patches cropping, logcosh(x, y) = log (cosh(x − y)) , (10)
90
where:
cosh(x) = (ex + e−x )/2.
Due to low values in density maps (of magnitude 1e-3),
ground truth density maps are multiplied by 1e3. During the
evaluation, an output is divided by the same value.
V. R ESULTS
Figures 4 and 5 present comparison of estimated density
maps with ground truth. Figure 4 presents cases for which the
worst results have been obtained and Fig. 5 presents the exam-
ples with the lowest error. In each figure, the top row shows an
original photo in gray-scale, middle row contains images with
ground truth density map and the bottom row shows estimated
density map. Visible grid artifact in the bottom row is a result
of estimating density per patch, not for the whole image. Fig. 4. Examples of images with highest error (from left): (estimation/ground
In Table II the results of evaluation on UCF CC 50 dataset truth) 4/92 419/196, 1402/3390.
have been collected. Proposed ”ComplexNet” architecture was
trained on ”ShanghaiTech” database, while the other networks
were trained and evaluated on UCF CC 50 using 5-fold cross-
validation.
For results comparing have been used three matrics: Mean
Absolute Error (MAE), Root Mean Squared Error (RMSE)
and Mean Absolute Percentage Error (MAPE). These metrics
have been defined as:
N
1 X
MAE = |yi − yi′ |, (11)
N i=1
v
u N
u1 X
RMSE = t |yi − yi′ |2 , (12)
N i=1
N
1 X yi − yi′
MAPE = | |, (13)
N i=1 yi
where: Fig. 5. Examples of images with the best results (from left): (estima-
tion/ground truth) 309/316 653/673, 2395/2308.
N - number of inputs,
yi - number of people in input i,
yi′ - estimated number of people in input i. VI. C ONCLUSION
TABLE II In this paper, the problem of counting people in a crowd
C OMPARISON OF RESULTS ON UCF CC 50 visual scene has been raised. For solving this issue a complex
MAE MAPE RMSE convolutional neural network has been proposed. The proposed
People Counting ... [20] 514.1 0.542 - architecture is a simplified version of ”CrowdNet” architecture
Switch-CNN [21] 318.1 - 439.2 with some modifications such as using complex counterparts
instead of classical block building, rejection of the shallow
CrowdNet [8] 452.5 - -
part of the network and use of PixelShuffle in place of
ComplexNet 448.5 0.407 703.9
bilinear sampling. Network training and evaluation have been
carried out using the ShanghaiTech dataset and UCF CC 50,
Values presented for each algorithm in Table II are the results respectively.
reported by their authors. The complex-valued neural net- Achieved results have been compared with other algorithms
work achieved smaller mean absolute error than correspond- for crowd counting based on deep neural network architecture,
ing real-valued ”CrowdNet”. Due to missing RMSE metric especially ”CrowdNet”. Obtained results are very promising.
for ”CrowdNet” it is not possible to asses which network The proposed model, based on innovative architecture based
has higher coarse error. Comparing obtained results with on complex numbers, achieved better results than the equiv-
current state-of-the-art solution ”Switch-CNN” confirms that alent real-valued one. It confirms the legitimacy of using
further experiments with more sophisticated architectures (like the complex convolutional architecture for crowd counting
ResNet) are needed. process. In addition, it is worth emphasizing that receiving
91
such results was possible using a smaller amount of training [21] D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neural
data (only 233 patches compared to around 50292 patches and network for crowd counting,” in 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), July 2017, pp. 4031–4039.
without pretraining), which confirms that convolutional neural
network with complex numbers allows a faster adaptation to
a training data than a network using only real numbers.
R EFERENCES
[1] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preserving
crowd monitoring: Counting people without people models or tracking,”
in 2008 IEEE Conference on Computer Vision and Pattern Recognition,
June 2008, pp. 1–7.
[2] K. Chen, S. Gong, T. Xiang, and C. C. Loy, “Cumulative attribute space
for age and crowd density estimation,” in 2013 IEEE Conference on
Computer Vision and Pattern Recognition, June 2013, pp. 2467–2474.
[3] K. Chen, C. C. Loy, S. Gong, and T. Xiang, “Feature mining for localised
crowd counting,” in In BMVC, 2012.
[4] L. Fiaschi, U. Koethe, R. Nair, and F. A. Hamprecht, “Learning to
count with regression forest and structured labels,” in Proceedings of
the 21st International Conference on Pattern Recognition (ICPR2012),
Nov 2012, pp. 2685–2688.
[5] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multi-source multi-scale
counting in extremely dense crowd images,” in 2013 IEEE Conference
on Computer Vision and Pattern Recognition, June 2013, pp. 2547–2554.
[6] D. G. Lowe, “Object recognition from local scale-invariant features,” in
Proceedings of the Seventh IEEE International Conference on Computer
Vision, vol. 2, 1999, pp. 1150–1157 vol.2.
[7] K. Kang and X. Wang, “Fully convolutional neural networks for crowd
segmentation,” CoRR, vol. abs/1411.4464, 2014.
[8] L. Boominathan, S. S. S. Kruthiventi, and R. V. Babu, “Crowdnet: A
deep convolutional network for dense crowd counting,” in Proceedings
of the 2016 ACM on Multimedia Conference, ser. MM ’16, 2016, pp.
640–644.
[9] L. Zeng, X. Xu, B. Cai, S. Qiu, and T. Zhang, “Multi-scale convolutional
neural networks for crowd counting,” in 2017 IEEE International
Conference on Image Processing (ICIP), Sept 2017, pp. 465–469.
[10] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting
via deep convolutional neural networks,” in 2015 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 833–
841.
[11] N. Guberman, “On complex valued convolutional neural networks,”
[12] C. A. Popa, “Complex-valued convolutional neural networks for real-
valued image classification,” in 2017 International Joint Conference on
Neural Networks (IJCNN), May 2017, pp. 816–822.
[13] C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, S. M.
Joo Felipe Santos, N. Rostamzadeh, Y. Bengio, and C. J. Pal, “Deep
complex networks,” arXiv preprint arXiv:1705.09792, 2017.
[14] M. Wilmanski, C. Kreucher, and A. Hero, “Complex input convolutional
neural networks for wide angle sar atr,” in 2016 IEEE Global Conference
on Signal and Information Processing (GlobalSIP), Dec 2016, pp. 1037–
1041.
[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[16] O. Rippel, J. Snoek, and R. P. Adams, “Spectral representations for
convolutional neural networks,” in Proceedings of the 28th International
Conference on Neural Information Processing Systems - Volume 2, ser.
NIPS’15. Cambridge, MA, USA: MIT Press, 2015, pp. 2449–2457.
[Online]. Available: http://dl.acm.org/citation.cfm?id=2969442.2969513
[17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in ICML, 2015.
[18] M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent
neural networks,” in International Conference on Machine Learning,
2016, pp. 1120–1128.
[19] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
D. Rueckert, and Z. Wang, “Real-time single image and video super-
resolution using an efficient sub-pixel convolutional neural network,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 1874–1883.
[20] A. Bansal and K. S. Venkatesh, “People counting in high density crowds
from still images,” CoRR, vol. abs/1507.08445, 2015.
92
SIGNaL PROCESSING
SPa 2018
Simulated Local Deformation & Focal Length

Optimisation For Improved Template-Based 3D
Reconstruction of Non-Rigid Objects
Micha Bednarek, Krzysztof Walas
Institute of Control, Robotics and Information Engineering, Poznan, Poland
Abstract—Non-rigid objects 3D reconstruction is a complex camera is needed. Nevertheless, there are approaches where an
problem, which solution has many practical implications for uncalibrated camera was used for performing structure from
graphics, robotics or augmented reality. One of the approaches motion [5], [6]. However, to the best of authors knowledge,
to the problem is template-based, where the reference mesh, de-
formed over time, is used. To achieve improvement in the quality there are no previous works, that would tackle the problem of
of reconstruction of the non-rigid object, we implemented two intrinsic camera parameters optimisation and reconstruction of
novel concepts - Simulated Local Deformation (SLD) and focal non-rigid objects at the same time.
length optimisation. In this paper, we combine them together. Our contributions are: simulated local deformation and focal
SLD successfully enlarges the number of correspondences, hence length optimisation. First one provides our algorithm with an
improving the reconstruction process. Additionally, 3D recon-
struction also depends on reprojection error. To minimise this enhanced set of correspondences between reference and input
measure, we propose to optimise focal length simultaneously with frame. These new correspondences appear in most deformed
the reconstruction process. As a result, we achieved improved areas of tracked objects subjected to out-of-plane rotations.
reconstruction when compared to the state-of-the-art solution. By performing a simulated camera viewpoint transformations
process we obtain auxiliary information which is stored as a
I. I NTRODUCTION
stack of descriptors associated with each keypoint in the ref-
3D reconstruction of deformable surfaces from monocular erence image. The representation of keypoints in the reference
images is a challenging problem. One of the approaches image is richer, which yields improved matching performance.
relays on templates, understood here as a reference mesh of The second improvement is related to camera parameters
3D object recorded as if its surface was rigid. This rigid estimation. The non-rigid objects 3D reconstruction process is
template is deformed, vertices are displaced, according to computed using the mean reprojection error. We have shown
the changes found in the sequence of 2D images recorded in this paper, that the minimisation of reprojection error with
while performing a series of deformations of the non-rigid respect to camera focal length yields improved quality of
object. Such algorithms consist of two phases: (1) matching reconstruction.
visual descriptors between a reference and input images, In the remainder of the paper, we will review the related
which creates an input data (correspondences) for (2) 3D work. Then, description and main highlights of our methods
reconstruction step based on mathematical model and solved will be provided. Afterwards, the results of applying simulated
in the optimisation process. The problem is highly under- local deformation and focal length optimisation will be pro-
constrained with many ambiguities. On the one hand, getting vided. Finally, concluding remarks with future work plans will
right correspondences might be problematic. On the other be given.
hand, as we are dealing with the 2d projections, the correctness
of the camera parameters is of paramount importance. To II. R ELATED W ORK
address this two issues descriptors which are robust to out- In this section we will shortly review 3D non-rigid recon-
of-plane rotations and optimisation procedure which finds the struction methods and two research fields which are providing
best possible camera intrinsic parameters are needed. algorithms used in this process.
There are known solutions which provide descriptors robust 3D Non-Rigid Reconstruction One of the prominent works
to in-plane rotations [1], [2], [3], [4]. They can successfully on this topic is described in [7]. Template-based approach
deal with the rigid movement of tracked object, but not with using Laplacian formalism was presented in paper [8], in
out-of plane rotations. Such descriptors will not provide a current work we extend this method. Other template-based
sufficient number of correspondences for template-based non- approach using monocular camera was presented in [9] where
rigid 3D reconstruction methods, and - in most deformed areas Laplacian multipliers were used. The problem of tracking
- the object will be poorly reconstructed. It clearly shows, deformation of texture-less and occluded objects was tack-
that the descriptor, which will provide more correspondences led in [10]. Reconstruction of non-rigid shapes can also be
in contorted regions, is needed. Moreover, to obtain good performed in real-time. Latest work in computer vision which
3D reconstruction from 2D images precise calibration of the is focused on volume deformation is described in [11]. Other
93
on top of the algorithm provided in [8] and our previous work
on descriptors for non-rigid object reconstruction [20]. Our
code is available online 1 .
A. Simulated Local Deformation

Simulated local deformation is the concept of creating a
stack of descriptors for each keypoint in the reference frame –
image of non-deformed object. The workflow of this process
is shown in Fig 1. The keypoints are detected only once
on non-deformed reference frame – we know their exact
positions, indexes, etc. Then, this reference image is subjected
Fig. 1. Enlarging the number of descriptors per keypoints using simulated to perspective transformations and a neighbourhood of every
local deformation. A stack of descriptors increases the possibility of finding
a match between keypoints on reference and input frame.
keypoint changes. In this new surrounding, the descriptor
is computed and added to the stack associated with each
particular keypoint.
real-time approach of reconstructing dynamic scenes was pre- The first stage of SLD is to perform rotations of the
sented in [12]. Furthermore, fresh view on 3D reconstruction chessboard in 8 selected directions with a variable rate of tilt.
was described in [13]. In this paper, authors proposed a With the camera axis perpendicular to the XY-plane and cross-
method of visual tracking of non-rigid object without using ing it in point (0,0), we have obtained affine transformation
correspondences. for the chessboard pattern rotated around four different axes
Visual Descriptors The most popular descriptor is SIFT [1]. (X-axis, Y-axis, 45o between X and Y). To ensure precise
Except this one, there are plenty of alternatives such as measurements of chessboard pattern plane pose, images have
SURF [2] which is supposed to be a speeded-up version of been taken using robotic arm – the board was held with the
SIFT. An alternative approach is KAZE [14] where nonlinear gripper.
scale space approach is proposed. All the descriptors men- The advantage of using the robotic arm is that we had an
tioned so far were using floating point representation, which accurate feedback on all rotations in SO(3). Additionally, when
has costly computation of distance function. Much faster to performing the experiments in the real world conditions all
compute is Hamming distance used in binary descriptors. the effects are taken into account when rotation matrices are
Directly as an alternative to SIFT and SURF the authors of [3] obtained. Initial work on SLD and in depth analysis of number
presented ORB descriptor. Another alternative is BRISK [4] of rotations on the number of matches was presented in our
and modified version of KAZE - AKAZE [15]. All of previous work [21].
these descriptors represent some kind of robustness to image Having proper transformation matrices we could insert them
rotations and scale changes. to our framework which is graphically described in Fig. 1 and
The problem of finding matches between scenes rotated in- in an algorithmic form given below.
plane was tackled by [16]. The authors proposed the method, 1) Initialization step:
that performs simulated global deformation. Another approach a) Perform feature description on reference image.
was presented in [17], where new local descriptor based on b) Obtain a vector of transformation matrixes using
accumulated stability voting was presented. chessboard images at different angles.
Camera Parameters Estimation External camera parame- c) Create a stack of descriptors for each keypoint
ters estimation is related to Non-Rigid Structure From Motion under simulated deformations.
(NRSFM). In [5] authors proposed online dense NRSFM
2) Runtime step:
method, which, in unsupervised manner, estimates external
parameters. Deep learning approaches also successfully deal a) Detect keypoints on test image.
with the problem of obtaining 6-DOF camera pose [6]. b) Compute descriptors for detected keypoints.
Optimisation of internal camera parameters is strongly re- c) Find matches with not-deformed keypoints on
lated to the camera calibration. In [18] presented method of reference image.
recovering each camera internal parameter using geometrical d) Store them in goodMatches vector. Rejected
approach. In one of the latest works [19] a method, which matches store in badMatches vector.
successfully obtain intrinsic camera matrix from images sub- e) Improve matching step:
jected to radial distortions, is presented. Nevertheless, to the i) Iterate over badMatches vector.
best of authors knowledge, there are no previous papers, that ii) For each not matched keypoint in test image
would describe the process of such optimisation during non- check stack of descriptors for corresponding
rigid object template-based 3D reconstruction. keypoint in reference image.
1) Method: In this section, our enhancements of template
based non-rigid 3D object reconstruction are given. We build 1 https://github.com/mbed92/template-based-3d-reconstruction
94
Result: Optimal value of focal length foptimum
T ← Tstart
f ← frandom
while T ≤ Tend do
while iter ≤ iterf ixed do
f ← frandom
err ← errnewF ocal
if (err < min) or accepted then
foptimum ← f
errminimum ← err
Fig. 2. Finding the optimal value of focal length based on reprojection error end
of each matched keypoint in the input image. Focal length optimisation using iter = iter + 1
SA algorithm decreases the overall reconstruction error as shown in Table II. end
T = T ∗ decreaseF actor
end
Algorithm 1: Simmulated annealing for focal length optimi-
iii) If there exists better match, check his displace-
sation. In our method it was implemented in C++.
ment in pixels smaller then a certain threshold
for eliminating outliers.
iv) If displacement is considered as proper, enlarge
goodMatches vector by this match. A. Quantitative Results – Simulated Local Deformation
3) Result: enlarged vector of matches between two images. As it could be observed in Table I mean error per frame
Additional matches are added in key areas of object - (computed over the whole dataset – 187 frames) for our
places where significant deformations occurred. method is 3.4 % smaller when compared to the baseline
approach. Importantly, as it could be observed in Fig. 3
for the frames with large deformations our system obtains
B. Focal Length Optimisation more matches (Fig. 3a) which result with better optimisation
In this paper, we introduce the novel approach to internal – shape reconstruction. This could be assessed qualitatively
camera parameters estimation, which is performed during the in Fig. 3c, and quantitatively by inspecting Table I. The
reconstruction process. In [8] authors proposed the method of difference between the baseline approach and our improved
receiving a 3D shape, which basically consists of two mesh system is equal to 26.8%.
optimisation steps: (1) unconstrained, which in fixed number Dataset Mean – whole dataset Frame 82
of iterations, provides us with the estimate of mesh and (2) Bare EPFL [8] 14,16 19,13
inequality constraints, yielding in final result mesh. In each EPFL + SLD 13,68 14,00
iteration of (1), the overall reprojection error is measured
– the distance between found position of input keypoint TABLE I
M EAN RECONSTRUCTION ERROR IN [ MM ] FOR THE PAPER DATASET [23].
and the same point, but reprojected one. To minimise this
measure, we propose to optimise focal length and see if that
additionally minimise this error. The key concept of internal
camera parameters optimisation process is shown in Fig. 2. B. Quantitative Results – Focal Length Optimisation
For optimisation Simulated Annealing (SA) metaheuris- The optimisation was performed for the whole dataset. But
tic [22] was used. The implementation of the algorithm to the influence of it is clearly visible for the most deformed
the task of focal length optimisation is as follows: images, e.g.frame 082 . By applying optimisation of focal
Additionally, using presented optimisation technique, we length where the starting point was the camera parameters
constructed a system which can be focal-less. This means that provided in the dataset files, we can further improve the final
there is no need of knowing focal length in advance – it can outcome by additional 1.1 % when compared to our approach
be completely random. It is possible as simulated annealing with SLD. Moreover, we have performed the test with random
approximates the global optimum. initialization of focal length parameter, in this case, the result
is worst than the best one achieved so far, but it is still better
III. R ESULTS than the baseline.
All the results provided in this section are obtained using a C. Discussion
dataset [23]. This allowed us to compare our improvements to In the second part of the result section, we have focused
the baseline solution provided with the code from [8]. First, on a single frame but this is without the loss of generality,
we will focus on the improvements obtained through the use as the algorithm is performing reconstruction always using
of simulated local deformation and then we will move on to reference frame. This means that any frame in a sequence
the outcomes of focal length optimisation. undergoes full optimisation against the reference frame. There
95
struction. Both of presented concepts brings new features not
only to non-rigid reconstruction field of study and what they
bring is likely to be of use to the wider community. The
benefits of using our approach are most prominent for the
large deformations of the object observed by the camera.
In the future work, we will work on paralleling the SLD
approach and developing improved version of the system
which will allow us to perform full camera parameter free
(a) reconstruction of non-rigid objects.
R EFERENCES
[1] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,”
Int. J. of Computer Vis. (IJCV), pp. 91–110, 2004.
[2] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “SURF: Speeded Up
Robust Features,” Computer Vis. and Image Underst. (CVIU), vol. 110,
no. 3, pp. 346–359, 2008.
[3] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient
alternative to SIFT or SURF,” in IEEE Int. Conf. on Computer Vis.
(ICCV), 2011, pp. 2564–2571.
[4] S. Leutenegger, M. Chli, and R. Siegwart, “BRISK: Binary Robust
invariant scalable keypoints,” in IEEE Int. Conf. on Computer Vis.
(b) (c) (ICCV), 2011, pp. 2548–2555.
[5] K. Lebeda, S. Hadfield, and R. Bowden, “Direct-from-video: Unsuper-
Fig. 3. Example frame (082) from the sequence with the sheet of paper [23], vised nrsfm,” in proceedings of the ECCV workshop on Recovering 6D
where large deformations are visible: (a) application of SLD added 47 Object Pose Estimation, ser. LNCS, vol. 9915. Springer, October 2016,
new matches. These in turn, decreased the error of the 3D reconstruction pp. 578–594.
(see Table I). (b) Displacement of added matches. The location of added [6] A. Kendall and R. Cipolla, “Geometric loss functions for camera pose
correspondences – in the most deformed areas of the tracked surface. In (c) regression with deep learning,” in Proc. of the IEEE Conf. on Computer
the violet mesh is a ground truth, obtained from Kinect sensor, the green shape Vis. and Pattern Recognition, 2017.
is the result of EPFL algorithm with implemented SLD and the orange one [7] A. Bronstein, M. Bronstein., and R. Kimmel, Numerical Geometry of
is the result from bare EPFL algorithm (without our improvements). Images Non-Rigid Shapes. Springer Int. Publishing, 2009.
from dataset [23]. [8] D. T. Ngo, J. Östlund, and P. Fua, “Template-based monocular 3d
shape recovery using laplacian meshes,” IEEE Trans. Pattern Anal.
Dataset Frame 82 Mach. Intell., vol. 38, no. 1, pp. 172–187, 2016. [Online]. Available:
http://dx.doi.org/10.1109/TPAMI.2015.2435739
Bare EPFL [8] 19,13
[9] N. Haouchine and E. Cotin, “Template-based monocular 3D recovery
EPFL + SLD 14,00
of elastic shapes using lagrangian multipliers,” 2017.
EPFL + SLD + SA + F known 13,85 [10] D. T. Ngo, S. Park, A. Jorstad, A. Crivellaro, C. D. Yoo,
EPFL + SLD + SA + F random 18,02 and P. Fua, “Dense image registration and deformable surface
reconstruction in presence of occlusions and minimal texture,” in
TABLE II 2015 IEEE Int. Conf. on Computer Vis., ICCV 2015, Santiago, Chile,
R ECONSTRUCTION ERROR FOR FRAME 82 IN [ MM ]. December 7-13, 2015, 2015, pp. 2273–2281. [Online]. Available:
http://dx.doi.org/10.1109/ICCV.2015.262
[11] M. Innmann, M. Zollhöfer, M. Nießner, C. Theobalt, and M. Stam-
minger, VolumeDeform: Real-Time Volumetric Non-rigid Reconstruc-
tion. Cham: Springer Int. Publishing, 2016, pp. 362–379.
are no additive improvements on the frame by frame basis. The [12] R. A. Newcombe, D. Fox, and S. M. Seitz, “Dynamicfusion:
behaviour of the algorithm on frames with large deformations Reconstruction and tracking of non-rigid scenes in real-time,” in IEEE
Conf. on Computer Vis. and Pattern Recognition, CVPR 2015, Boston,
is of paramount importance if we think of possible applications MA, USA, June 7-12, 2015, 2015, pp. 343–352. [Online]. Available:
which could be: entertainment, robotics manipulation and http://dx.doi.org/10.1109/CVPR.2015.7298631
surgical systems. In each of this cases the difference in the [13] M. Slavcheva, M. Baust, D. Cremers, and S. Ilic, “KillingFusion: Non-
rigid 3D reconstruction without correspondences,” in Proc. of the IEEE
reconstruction by a couple of millimetres means that the Conf. on Computer Vis. and Pattern Recognition, 2017.
Augmented Reality system will have low visual quality, the [14] P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “Kaze features,” in
robot will be unable to grasp the object and the surgeon will Proc. of the 12th European Conf. on Computer Vis. - Volume Part VI,
ser. ECCV’12. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 214–227.
miss the tissue when performing medical treatment. [15] P. F. Alcantarilla, J. Nuevo, and A. Bartoli, “Fast explicit diffusion for
accelerated features in nonlinear scale spaces,” in Br. Mach. Vis. Conf.
IV. C ONCLUSIONS (BMVC), 2013.
[16] G. Yu and J. Morel, “ASIFT: an algorithm for fully affine
We have presented methods which improve the quality of invariant comparison,” IPOL J., vol. 1, 2011. [Online]. Available:
reconstruction of non-rigid shapes and provided quantitative http://dx.doi.org/10.5201/ipol.2011.my-asift
results in the form of mean reconstruction error. Simulated [17] T. Y. Yang, Y. Y. Lin, and Y. Y. Chuang, “Accumulated stability voting:
A robust descriptor from descriptors of multiple scales,” in 2016 IEEE
local deformation provides us with the robust to out-of-plane Conf. on Computer Vis. and Pattern Recognition (CVPR), Jun. 2016, pp.
rotations algorithm, which successfully enlarges the number 327–335.
of matches between reference and input frame. Additionally, [18] R. Melo, M. Antunes, J. P. Barreto, G. Falcão, and N. Gonçalves,
“Unsupervised intrinsic calibration from a single frame using a plumb-
focal length adjustment step, performed during unconstrained line approach,” in 2013 IEEE Int. Conf. on Computer Vis., Dec. 2013,
optimisation additionally improves the quality of 3D recon- pp. 537–544.
96
[19] M. Antunes, J. P. Barretoo, D. Aouada, and B. Ottersten, “Unsupervised
vanishing point detection and camera calibration from a single manhattan
image with radial distortion,” in Proc. of the IEEE Conf. on Computer
Vis. and Pattern Recognition, 2017.
[20] M. Bednarek, “Comparison of visual descriptors for 3d reconstruction
of non-rigid planar surfaces,” in Image Processing and Communications
Challenges 9, M. Choraś and R. S. Choraś, Eds. Cham: Springer
International Publishing, 2018, pp. 191–198.
[21] M. Bednarek and K. Walas, “Local descriptors robust to out-of-plane
rotations,” in 2017 Signal Processing: Algorithms, Architectures, Ar-
rangements, and Applications (SPA), Sept 2017, pp. 154–159.
[22] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by
simulated annealing,” SCIENCE, vol. 220, no. 4598, pp. 671–680, 1983.
[23] A. Varol, M. Salzmann, P. Fua, and R. Urtasun, “A
constrained latent variable model,” in 2012 IEEE Conf. on
Computer Vis. and Pattern Recognition, Providence, RI, USA,
June 16-21, 2012, 2012, pp. 2248–2255. [Online]. Available:
http://dx.doi.org/10.1109/CVPR.2012.6247934
97
SIGNaL PROCESSING
SPa 2018
Vehicle detector training with labels derived from

background subtraction algorithms in video
surveillance
S. Cygert, A. Czyżewski
Faculty of Electronics, Telecommunication and Informatics
Multimedia Systems Department
Gdansk University of Technology
Gdansk, Poland
sebcyg@multimed.org
Abstract—Vehicle detection in video from a miniature station- Many other challenges are found in traffic surveillance.
ary closed-circuit television (CCTV) camera is discussed in the Firstly, the system must work in different weather conditions
paper. The camera provides one of components of the intelligent such as rain, snow, fog presence etc. The second problem, even
road sign developed in the project concerning the traffic control
with the use of autonomous devices being developed. Modern more challenging, is detecting vehicles during the night and in
Convolutional Neural Network (CNN) based detectors need big varying illumination conditions. Usually different algorithms
data input, usually demanding their manual labeling. In the are used for that purpose, e.g. those ones based on tracking
presented research approach the weakly-supervised learning of headlights [2]. The last important challenge is the detec-
paradigm is used for the training of a CNN based detector em- tion made during traffic congestion with highly overlapping
ploying labels obtained automatically through an application of
video background subtraction algorithm. The proposed method vehicles.
is evaluated on GRAM-RTM dataset and a CNN fine-tuned Classical approaches to this task include a use of back-
with labels from the background subtraction algorithm. Even ground subtraction algorithms [3], [4]. These methods usually
though obtained representation in the form of labels may include provide satisfactory results with a stationary camera, good
many false positives and negatives, a reliable vehicle detector was illumination conditions and smooth traffic flow without con-
trained employing them. The results are presented showing that
such a method can be applied to traffic surveillance systems. gestions. What is more important, the learning is not required
Index Terms—vehicle detection, traffic monitoring system, (entailing costly annotations) and nowadays those algorithms
background subtraction, convolutional neural network may easily achieve real-time analysis which makes them
perfect for embedded systems. Unfortunately, those methods
I. I NTRODUCTION fail in many situations like different weather conditions, night
Vehicle detection is a crucial task in traffic monitoring time lighting or traffic congestions.
systems. In this work we focus on the surveillance based on Comparing to background subtraction, the trainable classi-
image data obtained from a small stationary CCTV camera. fiers have several advantages. Firstly, they are usually much
Different approaches, e.g. use of infrared thermal cameras has better at handling occlusions between overlapping vehicles.
been also employed for vehicle detection [1], however that Also it is possible to generate synthetic examples using vari-
topic is not covered by this paper. Here we analyse RGB image ous techniques, to improve the performance of the classifier.
data because of a ubiquitous presence of CCTV cameras on Support vector machines (SVM) have been widely used for
the roadsides, and also because this type of images can be also that purpose, together with histogram of oriented gradients
used for another tasks (e.g. plate number recognition). (HOG) features [5] or more computationally efficient Haar
In an intelligent transportation systems it is desirable that the features [6]. Before deep learning era the state-of-the art
analysis is performed in an embedded system connected to the object detection employed supervised training of deformable-
camera, because in this case only the results of the analysis are part models [7] based on HOG features.
sent to a server, what is important especially when the number Recent advances in deep learning resulted in many CNN-
of cameras is substantial. However only a limited computing based detectors [8]–[10] which obtain impressive performance.
power is available for the traffic analysis inside an intelligent However these methods may pose some drawbacks. Namely,
road sign. since they are based on deep learning models, they are
extremely data hungry, so training them is time-consuming and
Project financed by the by the Polish National Centre for Research and costly task. Secondly, this kind of models require specialized
Development (NCBR) from the European Regional Development Fund under GPUs for their running. This may pose a big obstacle when
the Operational Programme Innovative Economy No. POIR.04.01.04-00-
0089/16 entitled: "INZNAK: Intelligent Road Signs with V2X Interface for one wants to perform an analysis using embedded systems or
Adaptive Traffic Controlling". may have a big number of video-sources at a disposal, as it is
98
the case of traffic monitoring. As a result, in spite of impressive The GRAM dataset consists of 3 video sequences from
performance of CNN based detectors, they have not been which we decided to use the one called M30. It contains 7520
widely used for traffic surveillance systems until the recent images with 800x480 resolution recorded on a highway on a
development of significantly smaller CNN-based architectures, sunny day with 30 fps. In Fig.1 an example image from the
e.g. MobileNet [11] and SqueezeDet [12]. While the smallest dataset can be seen with annotations. In the top picture some
version of YOLO [10] has around 45 millions of parameters, illumination artifacts from the camera can be observed (which
SqueezeDet has only around 2 millions of parameters and it in turn will pose a difficulty for the background subtraction
still allows for obtaining competitive results. Such a small algorithm).
model size makes it deployable to embedded systems like Regarding annotations, few rules are followed. First of all,
NVIDIA Jetson. only vehicles on the right roadway are annotated (a region of
The authors of SqueezeDet published a detection model interest mask is provided for that purpose). Secondly, only cars
trained on KITTI [13] dataset. However, using such off-the- that are fully visible are annotated. This will affect subsequent
shelf pretrained model might not be practical in case of a results, since our algorithms (correctly) detects such vehicles,
CCTV application. Firstly, the model is trained on images but it will contribute to wrong detection, reducing precision
recorded from a camera placed inside a car, while CCTV (and F-measure). Finally, tiny objects in the far end are not
cameras are operating on the view with a much different annotated at all.
perspective. Also, the KITTI dataset was recorded with the
resolution of 1240 × 376 pixels which is not typical for
CCTV cameras. In that case, it is usually necessary to collect
manually a large number of labels for the training. However,
quite often it is possible to obtain less accurate labels with a
much less effort.
In this work we propose the use of labels obtained from
background subtraction algorithm to train a neural network. It
is possible to do that, because we use data from a stationary
CCTV camera. A similar concept was used in literature [14]
for person detection. The biggest advantage of that framework
is that the training occurs in the online mode, without the need
of manual labeling. Even though such labels are sometimes
inexact, still with a big number of annotations it is possible to
train a reliable detector. Training a classifier on such labels
falls into a category of weakly-supervised learning, where
many labels are imperfect or missing.
Consequently, our work integrates classical computer vision
algorithms with modern deep learning approach. We show
that it is possible to train a reliable vehicle detector using
inaccurate labels from background subtraction algorithm. The Figure 1: Example of annotations from GRAM dataset. Picture
obtained neural network is much more stable than a back- at the top is overlayed with region of interest mask. White cars
ground subtraction algorithm utilized for the training and it at the bottom are not annotated, because they are not fully
performs better in some situations. visible.
II. DATASET
In order to obtain feasible labels from the background sub- III. E XPERIMENTS
traction algorithm, a labeled video recorded from a stationary We follow up a standard split of 60:20:20 between train,
camera was needed. Interestingly, there are not that many validation and test set. First 60% of the images are used for
big datasets available as is needed. Many modern datasets the training, following 20% validation, whereas the last part
focus on autonomous driving and thus the camera is installed is used for testing.
inside a moving car, so it is impossible to use a background
subtraction algorithm in that case. Therefore we decided to Background subtraction
use GRAM Road-Traffic Monitoring (GRAM-RTM) dataset A comprehensive analysis of background subtraction algo-
[15]. The author of the cited paper also released Traffic rithms and their performance can be found in literature [3].
and congestions algorithm (TRANCOS). However, it contains Based on results from that paper we decided to use Gaussian
mostly congested traffic, hence using background subtraction Mixture Model (GMM) [16] as that algorithm obtains good
algorithms would result in a very noisy labels, because the results and it is computationally effective.
algorithm tends to merge overlapping vehicles into a single Firstly we run background subtraction algorithm in order
one. to obtain a foreground mask. Then we apply morphological
99
operators using a small circle as structuring element to remove
noise: closing is used to remove gaps in areas, then opening in
order to remove small artifacts, and finally dilation is used to
make the object bolder. We show an obtained binary mask in
Fig. 2 on the left-hand side. We then apply a simple contour
finder to detect objects on the mask. Finally, all detected
bounding boxes that are smaller than 25 pixels are skipped to
remove too many noisy labels. Final detection result is showed
on the right-hand side in Fig. 2. Two vehicles at the far end are
merged into one object because their foreground masks overlap
with each other. We follow that procedure for the whole video
in order to obtain appropriate labels. Because we use adaptive
GMM algorithm, detections of the first 30 frames are ignored,
as at that stage GMM is being initialized.
Figure 2: Background subtraction results: foreground mask (on Figure 3: Example of inaccurate or faulty detections by
the left) and corresponding detections (on the right) background subtraction algorithm.
In Fig. 3 we show examples of incorrect detections by the

background subtraction algorithm. They are caused by many algorithm on the training set. We follow k-means procedure
factors: illumination artifacts, overlapping vehicles, shadow as suggested in the literature [19], and we empirically choose
behind a car or just by random camera noise present in the k=6. We show bounding boxes distribution from ground-
image. Two observations should be noted. First, the algorithm truth data and from background subtraction labels in Fig. 4.
does not detect small vehicles in the far end. However, Estimated boxes contain lots of noise, especially when the
this is not an important issue for vehicle counting or for width or the height of the object is big, which is caused mainly
speed estimating. Moreover, most of the detections are shifted by illumination artifacts or by merging different vehicles.
slightly toward the right and the bottom part of the car, as many Despite of that, estimated anchor boxes look quite similar.
of the background subtraction labels also include a shadow of
the car. There are some shadow removal algorithms [17], yet
their application is not required in this case. It is also worth
mentioning that some part of these faulty detections could
be removed by applying for example the Kalman filtering
[18]. However, in this work we wanted to test whether a
neural network trained on inaccurate labels from a background
subtraction algorithm would achieve a better performance.
SqueezeDet
(a) Ground truth bounding boxes (b) Estimated bounding boxes
Labels obtained in the previous step were used to train the
neural network. For the detection we have used SqueezeDet1 . Figure 4: Comparison of bounding boxes size distribution
SqueezeDet adopts YOLO [10] inspired architecture. Firstly, between ground truth and its estimation
each image is divided into a grid (50x30 in our case). Then
each cell in the grid returns k confidence score numbers, where After the configuration step the training procedure is fol-
k is the number of predefined anchors. Each anchor defines a lowed. As mentioned previously, here we have used only
prior distribution of the size of bounding-boxes. Ideally, they labels obtained from the background subtraction algorithm.
would fit the boxes encountered at the test time. Here, because Our SqueezeDet setup is very similar to the one used by
we propose an online detector, thus we compute the anchors the original authors in [12]. However, we do not differentiate
using only labels provided by the background subtraction between cars and motorbikes (because we do not have appro-
priate labels at our disposal). As a result our loss function
1 https://github.com/omni-us/squeezedet-keras consists only of the sum of the bounding box regression part
100
Table I: Numerical results of classification on GRAM dataset subtractions algorithms. SqueezeDet easily handles all illumi-
Algorithm Precision Recall F-measure Speed (FPS) nation (top row) and shadow (bottom row) artifacts, in which
GMM 0.7997 0.8343 0.8167 73.9 case the background subtraction caused faulty detections.
SqueezeDet 0.8058 0.8433 0.8241 6.4
Also, it is important to note that the navy blue car at the
bottom of the top picture which is correctly detected by
SqueezeDet was not annotated in the ground-truth (because
and the confidence score part without the classification part,
it is not fully visible, yet). The fact that not fully visible cars
as in the original paper. The final model uses only dataset of
are not annotated in ground truth, and they are in most cases
the volume of 7.5 MB representing feature vectors.
correctly detected by our algorithms significantly reduces the
We used initial learning rate of 0.01, and batch size = 8. We
precision and F-measure values.
start training with weights pretrained on the ImageNet dataset.
Test were performed on Intel Core i5-6440HQ operating at
2.60GHz. Each training epoch took around 45 minutes. In Fig.
5 one can see that F-measure saturates after around 10 epochs,
while the loss measure saturates after around 25 epochs. F-
measure on the validation set reaches value of 0.927. Note
that this is F-measure computed against validation labels from
background subtraction algorithm, but not from ground-truth
data.
Figure 6: Examples of background subtraction detections (on

the left) and corresponding SqueezeDet detections (on the
right)
Failure examples of SqueezeDet detections are presented in

the Fig. 7. The neural network has problems with detecting
some big vehicles. The big truck on the left picture was
incorrectly detected as two overlapping vehicles. On the right
Figure 5: Loss and F-measure metrics obtained during training picture we can see that vehicles in the far end were not
annotated at all, or they were not annotated precisely. Those
both cases are, however, very similar to faulty results caused
IV. R ESULTS AND DISCUSSION
by the background subtraction algorithm depicted in Fig. 3.
The background subtraction algorithm and trained
SqueezeNet were evaluated on the remaining test images
from GRAM dataset. The detection is accepted when the
IOU of the predicted bounding box with the ground truth
bounding box is bigger than 0.5. We skip all the vehicles
from the ground-truth that are smaller than 25 pixels in width
and height, since only really small and blurry cars appear at
those sizes. For the testing of SqueezeDet we choose a model
that had the best F-measure on the validation set. Using a
model based on the lowest loss yields some very similar
results. Figure 7: Examples of faulty detection by SqueezeDet. Double
In table I obtained results are shown. It turned out that detection of the truck (left picture), and lack of annotations for
the SqueezeDet algorithm performs slightly better than the vehicles in the bigger distance (both pictures).
background subtraction algorithm it was trained with, however
the difference is very small. In order to make a deeper insight into results, we per-
It is very interesting to see what are differences in predic- formed two additional tests. First, we have decreased the IOU
tions between both algorithms. In Fig. 6 we show examples threshold from 0.5 to 0.25. This step increased F-measure for
where SqueezeDet performs clearly better than background SqueezeDet to 0.881 and 0.834 for the background subtraction
101
algorithm, which shows that SqueezeDet detections are signif- instantly detect vehicles in the new position, while the adaptive
icantly closer to ground truth. In the second test we removed background subtraction algorithm requires a few seconds to
from ground truth all objects that were in the top quarter of create a new background model. Finally, having a trainable
the image, because objects in the far end are not detected. algorithm it is possible to create many new synthetic training
This increased recall to 0.963 for SqueezeDet and to 0.9534 examples which may increase the accuracy of the SqueezeDet
for background subtraction, which shows that in general case detector, especially while zooming images in order to increase
both algorithms detect almost all of the cars. accuracy achieved on small objects or inpainting rain or fog
We also tested the detection on the left roadway. An into the images in order to increase accuracy in bad weather
inference was run on the test set without using the region of conditions.
interest mask, as it was earlier. Results are presented in Fig. 8.
In general, most of the vehicles are detected, however many Future work
of the bounding boxes are shifted, and also few false positives There are many ways to the presented algorithm effective
appear. We cannot perform numerical evaluation here because extension. In the first phase, it would be really interesting too
of the lack of ground-truth data. see how the neural network trained on synthetically inpainted
different weather conditions (rain, snow, fog) works. In case of
success in this step, this would present a significant advantage
over the background subtraction algorithm.
Furthermore, it would be important to clean the labels
created by the background subtraction algorithm. For example,
an employment of the Kalman filter would certainly remove
some noisy labels. Also some weakly-supervised techniques
could be used. In particular, we could use a few background
subtraction algorithms, simultaneously, and then aggregate ob-
tained labels using the majority voting or some more advanced
Figure 8: Examples of detection performed on whole images
techniques described in the literature [20].
without region of interest mask
Finally, we plan testing our technique in real environment
on a larger dataset. This could give us a unique opportunity
We conclude that SqueezeDet is in general better in the ve-
to verify efficiency of our algorithm in different weather and
hicle detection task than the background subtraction algorithm
traffic congestion conditions.
it was trained on. It presents much more stable results, it is
not affected by illumination artifacts, and it does not show R EFERENCES
some noise detections as it is the case of the background
subtraction algorithm employment. The only point of failures [1] Y. Iwasaki, “A method of robust moving vehicle detection for bad
weather using an infrared thermography camera,” vol. 1, pp. 86–90,
for SqueezeDet are wrong detections for objects in the far end, Aug 2008.
and for large vehicles at the bottom of the image. Evaluation [2] M. Taha, H. H. Zayed, T. Nazmy, and M. Khalifa, “Day/night detector
on a much larger dataset is required. This would allow for for vehicle tracking in traffic monitoring systems,” International Journal
of Computer, Electrical, Automation, Control and Information Engineer-
testing the trained algorithm in different weather and traffic ing, vol. 10, no. 1, pp. 98 – 104, 2016.
conditions. Therefore, we installed recently some cameras for [3] A. Sobral and A. Vacavant, “A comprehensive review of background
monitoring the roads surrounding our campus. subtraction algorithms evaluated with synthetic and real videos,” Com-
puter Vision and Image Understanding, vol. 122, pp. 4 – 21, 2014.
In terms of the processing speed, SqueezeDet is times [4] N. A. Mandellos, I. Keramitsoglou, and C. T. Kiranoudis, “A background
slower reaching 6.4 FPS on the CPU. In further experiments subtraction algorithm for detecting and tracking vehicles,” Expert Sys-
we shall evaluate the algorithm on NVIDIA Jetson TX2 to test tems with Applications, vol. 38, no. 3, pp. 1619 – 1631, 2011.
[5] M. Cheon, W. Lee, C. Yoon, and M. Park, “Vision-based vehicle
whether this algorithm is feasible for real-time application. detection system with consideration of the detecting location,” IEEE
Transactions on Intelligent Transportation Systems, vol. 13, no. 3, pp.
V. CONCLUSIONS 1243–1252, Sept 2012.
[6] X. Wen, L. Shao, W. Fang, and Y. Xue, “Efficient feature selection and
In this work we have shown that it is possible to train a classification for vehicle detection,” IEEE Transactions on Circuits and
reliable neural network-based vehicle detector using inaccurate Systems for Video Technology, vol. 25, no. 3, pp. 508–517, March 2015.
labels obtained from a background subtraction algorithm. We [7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
“Object detection with discriminatively trained part-based models,”
have presented an online solution framework which reveals IEEE Transactions on Pattern Analysis and Machine Intelligence,
a potential of CCTV application in an embedded system. vol. 32, no. 9, pp. 1627–1645, Sept 2010.
Despite only a slight increase in accuracy of the neural net over [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C.
Berg, “SSD: single shot multibox detector,” CoRR, vol. abs/1512.02325,
background subtraction the algorithm has several advantages. 2015.
Firstly, obtained detections are in general more stable, more [9] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards
resistant to illumination and they contain less noise. Moreover, real-time object detection with region proposal networks,” CoRR, vol.
abs/1506.01497, 2015.
modern CCTV cameras have a capability to change the zoom [10] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” CoRR,
or the direction they are facing at. The proposed algorithm can vol. abs/1612.08242, 2016.
102
[11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
networks for mobile vision applications,” CoRR, vol. abs/1704.04861,
2017.
[12] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified,
small, low power fully convolutional neural networks for real-time object
detection for autonomous driving,” CoRR, vol. abs/1612.01051, 2016.
[13] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on
Computer Vision and Pattern Recognition, June 2012, pp. 3354–3361.
[14] V. Nair and J. J. Clark, “An unsupervised, online learning framework
for moving object detection,” Proceedings of the 2004 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, 2004.
CVPR 2004., vol. 2, pp. II–II, 2004.
[15] R. Guerrero-Gomez-Olmedo, R. J. Lopez-Sastre, S. Maldonado-Bascon,
and A. Fernandez-Caballero, “Vehicle tracking by simultaneous detec-
tion and viewpoint estimation,” in IWINAC 2013, Part II, LNCS 7931,
2013, pp. 306–316.
[16] Z. Zivkovic and F. van der Heijden, “Efficient adaptive density
estimation per image pixel for the task of background subtrac-
tion,” Pattern recognition letters, vol. 27, pp. 773–780, 1 2006,
10.1016/j.patrec.2005.11.005.
[17] Z. Zhu and X. Lu, “An accurate shadow removal method for vehicle
tracking,” in 2010 International Conference on Artificial Intelligence
and Computational Intelligence, vol. 2, Oct 2010, pp. 59–62.
[18] R. Rad and M. Jamzad, “Real time classification and tracking of multiple
vehicles in highways,” Pattern Recognition Letters, vol. 26, no. 10, pp.
1597 – 1607, 2005.
[19] K. Ashraf, B. Wu, F. N. Iandola, M. W. Moskewicz, and K. Keutzer,
“Shallow networks for high-accuracy road object-detection,” CoRR, vol.
abs/1606.01561, 2016.
[20] V. S. Sheng, F. Provost, and P. G. Ipeirotis, “Get another label?
improving data quality and data mining using multiple, noisy labelers,”
in Proceedings of the 14th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, ser. KDD ’08. New York, NY,
USA: ACM, 2008, pp. 614–622.
103
SIGNaL PROCESSING
SPa 2018
Audio-visual aspect of the Lombard effect and

comparison with recordings depicting emotional
states.
Szymon Zaporowski Bożena Kostek
Multimedia Systems Department, Faculty of Electronics, Audio Acoustics Laboratory, Faculty of Electronics,
Telecommunications and Informatics, Telecommunications and Informatics,
Gdańsk University of Technology Gdańsk University of Technology
G. Narutowicza 11/12, Gdańsk, Poland G. Narutowicza 11/12, Gdańsk, Poland
smck@multimed.org
Joanna Gołębiewska Julia Piltz

Faculty of Electronics, Telecommunications and Faculty of Electronics, Telecommunications and
Informatics, Informatics,
Gdańsk University of Technology Gdańsk University of Technology
G. Narutowicza 11/12, Gdańsk, Poland G. Narutowicza 11/12, Gdańsk, Poland
Abstract— In this paper an analysis of audio-visual recordings of visual features or use Facial Motion Capture equipment to get
the Lombard effect is shown. First, audio signal is analyzed even more accurate data [7]. However, it should be
indicating the presence of this phenomenon in the recorded remembered that along with technological progress, the
sessions. The principal aim, however, was to discuss problems availability of such technologies is increasing, and their low
related to extracting differences caused by the Lombard effect, cost may trigger a renewed interest in this subject.
present in the video , i.e. visible as tension and work of facial
muscles aligned to an increase in the intensity of the articulated II. METHODOLOGY
speech signal. Also the database of recordings, available on the
internet, depicting emotional states was analyzed in order to The aim of this study was to show whether scenarios used in
compare and find a visual similarity between the Lombard effect experiments allow for discerning the Lombard effect both in
and sentiment contained in speech. The results presented are the audio and visual recordings. Also, it was to check the
discussed and further plans are depicted. presence of visual effects of the Lombard phenomenon
discernible in the video, extracting similarities and differences
Keywords: Lombard effect, audio and image processing, from the faces of speakers recorded in the experiments and
psychacoustics external video recordings with various sentiments. For this
I. INTRODUCTION purpose, audio and video recordings of a group of speakers
were made while reading a list of sentences and words
The Lombard Effect is a phenomenon known since 1909, prepared in advance. The first recording was made as the
when Etienne Lombard, a French otolaryngologist, discovered reference point, without the use of noise. Then the speaker put
the phenomenon of an unintended tendency to increase the on headphones to which the noise was sent with a given
volume of voice to improve audibility in loud environments
volume. The whole statement was recorded, and then
[1]. This effect is thoroughly described in the literature but
limited in most cases to the acoustic side of the analysis, in parameters typical for the Lombard effect were calculated,
which observations of increasing the speaker's volume, the such as an increase in the sound level, a change in the basic
increased pace of speech, phonetic fundamental frequency tone, an increase in the formant frequencies F1 and F2, or an
increase or increased duration of vowels along with the increase in the duration of vowels. Also, such an analysis was
increase of sound intensity in the speaker's surroundings are observed for allophones, which are phonetic realizations of a
routinely done [2][3][4][5]. However, the video side of the phoneme in a language, which do not contribute to distinctions
Lombard effect is not investigated systematically. However, of meaning. This was the basis for determining whether the
there exist papers that describe the visual aspect, the increase in Lombard effect occurred during the sessions.
muscle tone, the exaggerated accuracy of articulation, or the Another part of the study was the analysis of video
raising of eyebrows as a consequence of the Lombard language recordings in terms of visible changes on the faces of speakers
[6]. One of the reasons for the lack of interest in the visual when reading a sentence or a word, when the level of noise was
aspect of the effect may be the inability to recreate a real-life greatest. Facial muscle tension changes, pulsating temples,
scenario in the experiments performed. Another aspect may be more pronounced lips movements or raising eyebrows were
the need to have high-quality cameras to accurately capture all searched. After finding such behaviors, the video recordings
104
were compared to the external database of recordings with the following categories: anger, disgust, fear, happiness,
sentiment in order to find similarities and differences in face sadness and surprise. Also neutral emotion was recorded.
movements and muscle tension.
IV. RESULTS
A detailed description of how the recordings were made,
the results obtained and their analyzes are presented in the The following results were obtained for the acoustic side of
following Sections. the study carried out. Table 1 presents the minimum and
maximum values as well as averages for the speaker volume
III. CAPTURING THE LOMBARD EFFECT IN AUDIO changes depending on the type of noise and its intensity.
A. Audio recordings TABLE I. CHANGE OF LOUDNES FOR ACOUSTIC RECORDINGS
To capture the Lombard effect in the audio-video Loudness [dB]
recordings two types of noise were used. The first one was [dB] Aver. for Aver. for
babble speech also known as cocktail-party-effect. The second Min Max
sentences words
one was pink noise. Levels for each noise were from 72.5 dB No noise 60.3 63.9 62 63
SPL to 83.8 dB SPL for pink noise and 80 dB SPL for babble
speech. Noise was played via IPod. Levels of noise correspond 72.5 62 68.4 65 66
Pink noise
to 70% and 90% of the volume level of the generator used for 83.8 65.6 70 67 67.5
the pink noise. Babble speech was played as a recording with
the embedded player with max volume, which corresponds to Babble speech 80 64.6 69.6 66 68
80 dB SPL.
During the recording sessions eight people were recorded, Table 2 contains some of the results obtained depending on
four man and four women, aged from 23 to 26. All speakers the gender of the speaker. The abbreviation ‘Sp’ with the
were of Polish origin. The list contained five sentences in corresponding number indicates the number assigned to the
Polish for each type of a sentence: giving a command or speaker.
warning (imperative types), making a statement (declarative)
asking questions (interrogative) or exclamatory ones, It TABLE II. LOUDNESS OF SPEAKING DEPENDING ON GENDER
consisted of messages and expressions that can be heard in the
Loudness [dB]
case of danger or other emergency situations e.g. ‘Head [dB]
Sp2 Sp3 Sp5 Sp6
towards the emergency exit!’ or ‘Is there a doctor among us?’.
The list consisted also of 10 words, which can be used in No noise 54 56 56 58
emergency situations e.g. ‘fire extinguisher’, ‘medic’, ‘alarm’, 72.5 60 60 65 63
etc. Expressions were recorded using the shotgun microphone, Pink noise
placed about 1 m from the speaker, pointed towards the 83.8 63.5 65 71 66
speaker’s mouth. All audio was recorded as 24 bit wave files Babble speech 80 62 61 65 64
with the sample rate of 48 kHz. Sound level was measured with
use of B&K handheld analyzer model 2238 Mediator, placed
50 cm from speakers mouth. Table 3 shows a comparison of the mean value of the
B. Video recording fundamental frequency F0 depending on the type of noise.
Video was recorded using the GoPro 5 camera at the TABLE III. COMPARISON OF F0 DEPENDING ON THE NOISE TYPE
resolution of 1280x720p and the frame rate of 240 frames per
second. Those parameters were selected to capture the process No noise
Pink noise Pink noise Babble speech
of stretching the face muscles and other possible muscle 72.5 [dB] 83.8 [dB] 80 [dB]
movements resulting from the nature of the Lombard effect in F0 [Hz] 124.66 152.663 164.306 163.591
the most precise way. The camera was placed about 2 meters
from the speaker to capture face and torso. Also additional light
was used, a pair of directional lamps and a lamp directed to the In Table 4 changes in the F1 and F2 formant frequencies
face of the speaker to lighten the face during the recordings in for word ‘evacuation’ are contained. Speakers from no. 1 to 3
order to observe the changing facial expressions are women; those of 4 to 6 are men.
C. Sentimen recording database TABLE IV. CHANGES IN F1 AND F2 FORMANTS DEPENDING ON NOISE
To compare executed recordings with external video AND GENDER
material the Surrey Audio-Visual Expressed Emotion Frequency [Hz]

(SAVEE) database was chosen [8]. This database was recorded Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
- as authors declared - as a prerequisite for the development of
an automatic emotion recognition system. This database F1 [Hz] 695 588 605 632 569 514
No noise
consists of recordings of four male actors and seven different F2 [Hz] 1639 565 1506 1559 1334 1453
emotions. Actors are between the ages of 27 and 31. Recorded F1 [Hz] 674 661 623 652 571 553
Pink noise
emotions have been described psychologically and labeled as 72.5 [dB] F2 [Hz] 1695 1560 1523 652 1315 1463
105
Frequency [Hz]
Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
Fig. 1 presents a comparison of one of the speakers in
Pink noise F1 [Hz] 745 691 643 702 627 548 whom the Lombard effect may be discerned in the visual layer.
83.8 [dB] F2 [Hz] 1570 583 1636 1605 1430 1504 On the top left, the image from the camera is shown when the
Babble F1 [Hz] 703 706 643 635 624 534
speaker did not hear any noise, on the right - with noise level of
speech 83.8 dB. Fig. 2 shows the captured effect of raising eyebrows
80 [dB] F2 [Hz] 1643 1548 1527 530 1588 1373 when speaking a sentence when the pink noise of 83.8 dB is
heard and the comparison when speaking without noise and
In Tables 5 and 6 changes in the F1, F2, and F3 formants with maximum level of noise. It is important to mention that all
are presented for allophones /o/ and /a/ in the recorded Polish volunteers agreed to show their image in the research paper and
speech. Speakers: 2-3 are women, speakers: 5-6 are men. privacy of these people has not been violated.
Allophones /a/ and /o/were chosen due to the fact that ‘a’ When eventually trying to relate these photos to the
and ‘o’ letters are correspondingly the first and the third most sentiment analysis, it seems that the expression recognized is
often occurring vowels in Polish language according to that of concentration or anger.
National Corpus of Polish Speech [9]. Also /a/ and /o/ from all
top ten vowels occurring in Polish language according to the
previously mentioned corpus got most similar formant
frequencies.
TABLE V. CHANGES IN F1, F2 AND F3 FORMANTS DEPENDING ON NOISE

AND GENDER ON THE /O/ ALLOPHONE
Sp2 Sp3 Sp5 Sp6

No noise F1[Hz] 348 405 464 429
F2 [Hz] 1435 1416 1238 1189
F3 [Hz] 2544 2550 3112 2099
Pink noise F1 [Hz] 538 582 566 527
83.8 [dB] F2 [Hz] 1478 1429 1667 1353
F3 [Hz] 2818 2493 3600 2272
TABLE VI. CHANGES IN F1, F2 AND F3 FORMNATS DEPENDING ON NOISE Fig. 1. On the top left image, a speaker speaking without noise,
AND GENDER IN /A/ ALLOPHONE
then hearing noise at the max level.
Sp2 Sp3 Sp5 Sp6
No noise F1[Hz] 707 671 421 583
F2 [Hz] 1426 1567 1690 1450
F3 [Hz] 2708 2799 2633 2461
Pink noise F1 [Hz] 791 805 509 624
83.8 [dB] F2 [Hz] 1480 1603 1661 1127
F3 [Hz] 2736 2743 2640 2137
Table 7 shows the duration of the allophone /a/ depending

on the intensity of the noise.
TABLE VII. DURATION OF THE ALLOPHONE /A/ DEPENDING ON NOISE

INTESITY
Duration time [s] Sp3 Sp 4 Sp5 Sp6

No noise 0.044 0.05 0.028 0.06
Fig 2. From the top left, a speaker raising eyebrows when
Pink noise
0.035 0.06 0.038 0.07 hearing noise, speaker speaking without noise, speaker when
72.5 [dB]
Pink noise hearing noise at the max level.
0.031 0.08 0.042 0.1
83.8 [dB]
Babble speech
- 0.07 0.04 0.09
80 [dB]
106
V. COMPARISON WITH SAVEE SENTIMENT DATABASE forehead of one of the speakers. This tension is similar to some
As a result of the comparison of recordings with the SAVEE extent to those that can be seen in the SAVEE anger emotion
database, several similarities were found. The main similarity recordings.
is exaggerated articulation in the case of recordings with the The authors see the possibility of using the Lombard effect
highest level of pink noise and anger recordings. One can see a in the future as a kind of amplification of the message when
similar way of tensing facial muscles. While in the case of the recording data sets for speech recognition from the movement
SAVEE basis it is an intentional operation, resulting from of the lips or the recognition of emotions combined with a deep
acting to develop an automatic emotion recognition system. In learning approach [10]. An interesting proposal in the context
the case of the experiments carried out by the authors, it is of the results obtained seems to be conducting studies focused
required to clearly mark all manifestations of selected on the impact of the visual side of the Lombard effect on
emotions, both acoustically and visually. One of the speakers, recordings with sentiment. Another idea for a study is to further
while uttering a sentence at the highest applied level of noise, investigate the impact of the Lombard effect on females and
began to raise his eyebrows unintentionally. As an example in males. For such an extension of the experiments carried out
Fig. 3 one can see a speaker (from the SAVEE database) Facial Motion Capture may be utilized. Also EEG helmets may
expressing anger and the same person with a neutral face bring some insight into the emotions expressed by the speakers
expression. There may be some degree of similarity between when reading text in noise. A more thorough approach applied
emotions shown in this database and speakers affected by the to this issue may use parameterization for the image region of
Lombard effect, however it is difficult to connect them in the interest to compare with or to use geometric grid placed on the
straightforward way. speaker’s face image to extract exact spots where some
changes in muscle tensions occur. Another interesting approach
to presented subject is to test possibility of use of obtained
audio results in near end listening enhancement systems for
speech intelligibility improvement in noisy environments. Such
algorithms are using sub-band non-linear amplification of the
audio signal according to noise spectral characteristic and
hearing thresholds of the listener [11].
Acknowledgments
Research partly sponsored by the Polish National Science
Fig. 3. Speaker speaking with emotional anger (left side), on Centre, Dec. No. 2015/17/B/ST6/01874.
the right side neutral face expression.
VII. REFERENCES
VI. CONCLUSIONS AND DISCUSION
Our findings coincide very well with previous results [1] E. Lombard, “Le signe de l’élévation de la voix (translated from
presented in the literature [2], even though different scenarios French),” Ann. des Mal. l’oreille du larynx, vol. 37, no. 2, pp. 101–119,
were proposed. A change in the intensity of speech uttered by 1911.
the speaker was observed depending on the given external [2] P. Kleczkowski, A. Żak, and A. Król-nowak, “Lombard Effect in Polish
Speech and its Comparison in English Speech,” vol. 42, no. 4, pp. 561–
noise, resulting change in the fundamental frequency F0 as 569, 2017.
well as in F1 and F2 formant frequencies and elongation of the [3] S. A. Zollinger and H. Brumm, “The Lombard effect,” Curr. Biol., vol.
vowel duration. Moreover these effects for F1 and F2 are also 21, no. 16, pp. R614–R615, 2011.
visible in allophone analyses. However in some cases F2 is not [4] J.-C. Junqua, S. Fincke, and K. Field, “The Lombard effect: a reflex to
greater when analyzing signals with the Lombard effect. This better communicate with others in noise,” 1999 IEEE Int. Conf. Acoust.
could be the effect of some errors in measurement, nevertheless Speech, Signal Process. Proceedings. ICASSP99 (Cat. No.99CH36258),
those allophones were measured several times for each pp. 2083–2086 vol.4, 1999.
implementation and similar results were obtained. F1 is always [5] A. S. Therrien, J. Lyons, and R. Balasubramaniam, “Sensory
greater, for /a/ allophone differences are not such significant Attenuation of Self-Produced Feedback: The Lombard Effect
Revisited,” PLoS One, vol. 7, no. 11, 2012.
like for allophone /o/.
[6] J. Kim, C. Davis, G. Vignali, and H. Hill, “A visual concomitant of the
It is interesting to note the relationship between gender and Lombard reflex,” Actes Audit. Speech Process., vol. 2005, pp. 17–21,
2005.
the Lombard effect. Based on the data collected, it may be
concluded that men are more susceptible to this effect. This is [7] D. Jachimski, A. Czyzewski, and T. Ciszewski, “A comparative study of
English viseme recognition methods and algorithms,” Multimed. Tools
also reflected in the video data. In the case of men, Appl., 2017.
considerable tension of the facial muscles was observed during [8] S. Haq and P. J. B. Jackson, “Multimodal Emotion Recognition,” Mach.
articulation for the highest values of noise. Contrarily, when Audit., pp. 398–423.
females were speaking, there was a slight change in the way [9] “National Corpus of Polish Speech.” [Online]. Available: http://nkjp.pl/.
the lips were pulled along with the muscles around the mouth. [Accessed: 10-May-2018].
In the case of one of men, eyebrows were raised while saying a [10] A. Hannun et al., “Deep Speech: Scaling up end-to-end speech
sentence. Similar effects with regard to the Lombard recognition,” pp. 1–12, 2014.
phenomenon were described in the literature [6]. The most [11] V. H. and A. P. E. Azarov, M. Vashkevich, “General-purpose listening
important observed effect is muscle tension visible on the enhancement based on subband non-linear amplification with
psychoacoustic criterion,” Audio Eng. Soc. Conv. 138, 2015.
107
SIGNaL PROCESSING
SPa 2018
HDR Image Tone Mapping Approach based on

Near Optimal Separable Adaptive Lifting Scheme
Ba chien Thai∗ , Anissa Mokraoui∗ and Basarab Matei†
∗ L2TI, † LIPN,
Institut Galilée, Université Paris 13 Sorbonne Paris Cité
99, Avenue Jean-Baptiste Clément, 93430 Villetaneuse, France
{bachien.thai, anissa.mokraoui}@univ-paris13.fr
matei@lipn.univ-paris13.fr
Abstract—This paper proposes a Tone Mapping (TM) ap- histogram equalization and human sensitivity to the light
proach converting a High Dynamic Range (HDR) image into function.
a Low Dynamic Range (LDR) image while preserving as much In [11], a second generation of wavelets based on the
information of the HDR image as possible to ensure a good LDR
image visual quality. This approach is based on a separable near edge content of the image avoiding having pixels from both
optimal lifting scheme using an adaptive powerful prediction step. sides of an edge is proposed. In [12], a separable non-linear
The latter relies on a linear weighted combination depending on multiresolution approach based on essentially non-oscillatory
the neighboring coefficients extracting then the relevant finest interpolation strategy has been investigated. These approaches
details in the HDR image at each resolution level. Moreover take into account the singularities (e.g. edge points) in their
the approximation and detail coefficients are modified according
to the entropy of each subband. The pixel’s distribution of the mathematical models preserving then the structural informa-
coarse reconstructed LDR image is then adjusted according to a tion of the HDR images. In [13], a non-separable non-linear
perceptual quantizer with respect to the human visual system multiresolution approach is proposed. The results provided
using a piecewise linear function. Simulation results provide in [11], [12] and [13] show that the decomposition of the
good results, both in terms of visual quality and TMQI metric, HDR image on different resolution levels would seem to be a
compared to existing competitive TM approaches.
good strategy. However, the choice of the decomposition filters
I. I NTRODUCTION is extremely important and decisive in the extractive power
of the approximation and detail information. To do so, this
The objective of a High Dynamic Range (HDR) image Tone paper proposes a separable near optimal lifting scheme. The
Mapping (TM) approach is to find a trade-off between the ”Predict” operation is performed by a new adaptive prediction
relevant information (e.g. details, contrast, brightness...) to be operation to extract the finest detail coefficients to highlight
preserved or discarded in the image ensuring a good visual the sharp transitions in the given HDR image.
quality of the displayed image on Low Dynamic Range (LDR) The proposed HDR image TM approach has two main goals
devices that would be appreciated by observers. namely the preservation of the details and the adjustment of
A state of the art on HDR image TM approaches is the contrast in accordance with the LDR display devices. It is
fairly complete in [1], [2] and [3] where a classification is composed of four stages. The first one judiciously decomposes
provided. Among the developed TM strategies, this paper the HDR image into different resolution levels (section II-A).
quickly reviews those that caught our attention because of The second one concerns the weighting strategy of the detail
their performance. In [6], the TM approach reduces the HDR and approximation coefficients (section II-B). The third one
contrast while preserving the image details. This work uses an reconstructs the coarse LDR image (section II-C). Finally the
edge-preserving bilateral filter to decompose the HDR image fourth one adjusts the contrast according to a perceptual linear
into two layers: a base layer encoding large-scale variations quantizer (section II-D). Section III discusses the simulation
and a detail one. Contrast is then reduced only in the first layer results. Section IV concludes the paper.
while the details are kept unchanged. An adaptive logarithmic
II. P ROPOSED HDR IMAGE TONE MAPPING APPROACH
mapping method of luminance values is presented in [7]. It
concerns the adjustment of the logarithmic basis depending This section concerns the first, second, third and fourth
on the radiance of the pixels. In [8], a subband architecture stages of the proposed HDR image TM mapping approach
related on an oversampled Haar pyramid representation is (i.e. decomposition, weighting, reconstruction and contrast
proposed. Subband coefficients are re-scaled according to a adjustment stages).
gain control function reducing the high frequency magnitudes Before describing the approach, introduce some notations.
and boosting low ones. In [9], a TM optimization approach The original HDR image, at the finest resolution level J, is
using a histogram adjustment between linear mapping and the assumed to be of size N J × M J . The index j refers to the
equalized histogram mapping is developed. A modification of resolution level (with j = 0, ..., J −1). Denote lHDR the HDR
this approach is made where revisited histogram equalization image luminance. In the rest of this paper, the HDR image
approaches are discussed in [10]. The latter considers both luminance is considered in the logarithm domain since it is
108
well adapted to the human visual system. It is denoted I J and Error (MSE) between Iˆj (xn , y2k−1 ) and I j (xn , y2k−1 ) is
defined as follows I J = {I J (xn , ym ) = log10 (lHDR (xn , ym )) minimized:
for 1 ≤ n ≤ N J and 1 ≤ m ≤ M J where I J (xn , ym ) is
the HDR logarithm luminance value of the pixel located at E = arg minkIˆj (xn , y2k−1 ) − I j (xn , y2k−1 )k22
uj−1
position (xn , ym ) on the image. i
for 1 ≤ k ≤ M j /2. (6)

A. First stage: HDR image decomposition
This results in solving the following equation:
The proposed HDR image decomposition is performed
according to the forward process of a separable near optimal Γj−1 · uj−1 = rj−1 , (7)
cell-average lifting scheme. This strategy is motivated by the where u j−1
is the weight vector. r j−1
is the cross-
fact that the relevant details are accurately predicted since the correlation vector rj−1 = (rj−1 (0), rj−1 (1), rj−1 (2))t where
coefficients of the filters are adapted locally to the image to rj−1 (i) represents the cross-correlation function between
be processed. V j−1 (xn , yk+i−1 ) and I j (xn , y2k−1 ) for 1 ≤ k ≤ M j /2.
The decomposition consists to go from the finest resolution Γj−1 is the autocorrelation matrix defined as:
level J to the coarse resolution level 0. At a given resolution  j−1 
level j − 1 (with 0 ≤ j ≤ J − 1), the algorithm deals with the R (0) Rj−1 (1) Rj−1 (2)
approximation coefficients denoted I j (xn , yk ) (with 1 ≤ n ≤ Γj−1 = Rj−1 (−1) Rj−1 (0) Rj−1 (1) , (8)
N j and 1 ≤ k ≤ M j ) computed at resolution level j. For a R (−2) R (−1) Rj−1 (0)
j−1 j−1
given n belonging to [1, N j ], the algorithm starts with splitting where Rj−1 (i) is the autocorrelation of V j−1 (xn , yk ).
the horizontal 1D-signal, i.e. I j (xn , yk ) for 1 ≤ k ≤ M j , into These weights, associated to the row (xn ), are then deduced
a set of odd and even indexes as follows: so that the partial derivatives of the MSE (given by equation
{I j (xn , yk ) with 1 ≤ k ≤ M j } (6)) with respect to uj−1
i (i.e. i = 0, 1, 2) are equal to zero:
:= {I (xn , y2k−1 ), I (xn , y2k ) with 1 ≤ k ≤ M j /2 }. (1)
j j
uj−1 (xn ) = (Γj−1 )−1 · rj−1 . (9)
This process is then repeated for all n. Based on this split, the The weight vectors are then stored to be used in the adaptive
approximation coefficient located at position (xn , yk ), denoted lifting scheme backward process to reconstruct the decom-
V j−1 (xn , yk ), is computed on a Cell-Average (CA) scheme. posed 1D-signal.
For a given n, it is expressed as follows: For sake of convenience, the approximation and details
I j (xn , y2k−1 ) + I j (xn , y2k ) coefficients are organized as follows:
V j−1 (xn , yk ) =
2 {W j (xk , ym ) for 1 ≤ k ≤ N j and ∀m ∈ [1, M j /2]}
for 1 ≤ k ≤ M j /2. (2)
:= {V j−1 (xn , yk ) for 1 ≤ k ≤ M j /2 and ∀n ∈ [1, N j ]},
This process is then repeated for all n. {U j (xk , ym ) for 1 ≤ k ≤ N j , ∀m ∈ [1, M j /2]}
The detail coefficient, denoted dj−1 (xn , y2k−1 ), is com- := {d j−1
(xn , y2k−1 ) for 1 ≤ k ≤ M j /2, ∀n ∈ [1, N j ]}.
puted at odd indexes and is provided below for a given n: (10)
The split, approximation and detail operations are applied on
dj−1 (xn , y2k−1 ) = Iˆj (xn , y2k−1 ) − I j (xn , y2k−1 )
W j (xk , ym ) and U j (xk , ym ) for a given m ∈ [1, M j /2]
for 1 ≤ k ≤ M j /2, (3) (i.e. on the vertical direction). Note that the approximation
step requires the prediction of Ŵ j (x2k−1 , ym ) (respectively
where Iˆj (xn , y2k−1 ) is the predicted logarithm luminance
Û j (x2k−1 , ym )) based on a set of weights vij−1 (respectively
value at odd position (xn , y2k−1 ) and resolution level j. This
wij−1 ). These weights need to be stored for the backward
predicted value is expressed as a linear weighted combination
process lifting scheme to reconstruct the decomposed im-
of the neighboring approximation coefficients at resolution
age. Finally, the approximation resolution level I j (xn , ym )
level j − 1:
is divided into 4 blocks I j := (I j−1 , dj−1 j−1 j−1
HL , dLH , dHH ). The
P2 decomposition is thus iterated on I j−1
until j = 0.
Iˆj (xn , y2k−1 ) = i=0 uj−1
i (xn ) · V j−1 (xn , yk+i−1 )
The finest HDR image I J is then repre-
for 1 ≤ k ≤ M j /2, (4) sented by 3J + 1 resolution levels I J :=
J−1 J−1 J−1
where the weights uj−1 (xn ) must preserve the initial 1D- (I 0 , d0HL , d0LH , d0HH , ..., dHL , dLH , dHH ).
i
signal average which results in satisfying this condition: B. Second stage: Weighting strategy of the coefficients
P2 j−1 To reduce the dynamic range of the HDR image, the TM
i=0 ui (xn ) = 1. (5)
approach proposes to weight the approximation and detail
In what follows, the weights are rather denoted in a vector coefficients in an appropriate way before performing the
form uj−1 = (uj−1 j−1 j−1
0 (xn ), u1 (xn ), u2 (xn )) to lighten the
t
adaptive lifting scheme backward process described in section
writing. These weights are deduced so that the Mean Squared II-C.
109
Denote Nl the number of resolution levels equal to J; Ea The odd and even approximation coefficients are then merged
the entropy of the approximation coefficients at the coarsest to built 1D-signal:
{W 0j (xk , ym ) for 1 ≤ k ≤ N j }
resolution level (i.e. j = 0); and Edj the entropy of the detail
coefficients at resolution level j (i.e. djHL , djLH and djHH ) := {W 0j−1 (x2k−1 , ym ), W 0j−1 (x2k , ym ) for 1 ≤ k ≤ N j /2}.
(16)
getting therefore Nl + 1 entropies (Ea , Ed0 , Ed1 , ..., Edj ,...,
This process is repeated for all m to reconstruct W 0j of size
EdNl −1 ). From these entropies, positive weights smaller than j−1 j−1
N j × M j /2. These same steps are applied on d0 HL and d0 HH
one are deduced as follows:
 PNl −1 i
to generate the new block U of size N × M /2. As in the
0j j j

αa =
Ed
PNl −1 i for j = 0
i=0 decomposition strategy, W 0j and U 0j are renamed V 0j−1 and
Ea + i=0 E
PNl −1 d i (11) d0j−1 . The same steps, as described above, are then performed

αdj = Ea + Pi=0,6 = j Ed
for j = 0, ..., Nl − 1 but according to a horizontal direction. The reconstruction is
Nl −1 i
Ea + Ed
i=0
iterated to finally build the image called coarse LDR image
αa (respectively αdj )
is the weight associated to the approxima- which is denoted IeLDR
J
.
tion (respectively detail) coefficients at resolution level j = 0 D. Fourth stage: Piecewise linear perceptual quantizer
(respectively j).
The coefficients of the four coarsest resolution levels are This stage proposes to adjust locally the distribution of the
first modified according to: coarse LDR image logarithm luminance IeLDRJ
according to the
I 00 = αa × I 0 , d00 0 0 HVS to enhance the contrast using a piecewise linear function.
HL = αd × dHL ,
This strategy is inspired from [19] which has been developed
d00 0 0
LH = αd × dLH , d00 0 0
HH = αd × dHH . (12)
for compression purpose. However, modifications are made
The approximation subband, denoted I , is then reconstructed
01 mainly to avoid the problem of empty bins of equal size.
(see section II-C) and the number of levels is reduced to To do so, the IeLDR
J
values are first sorted and classified
Nl − 1. The Nl entropies (associated to 3Nl − 2 subbands) are into equal B bins defined by cutting points denoted ciuLDR
calculated again to update the weights αa , αd1 (equation (11) (with 1 ≤ i ≤ B). A non-uniform histogram equalization is
with Nl = Nl − 1). After that, these weights are applied on also performed. cinuLDR (with 1 ≤ i ≤ B) cutting points,
the coefficients I 01 , d1HL , d1HL , d1HL to build I 02 as explained defining the bounds of the non-uniform consecutive B bins,
above. This process is iterated until Nl = 0 to reconstruct the are deduced. The lower bound (cutting point) of each bin is
coarse tone mapped HDR image denoted IeLDR J
, called coarse then adjusted as follows:
LDR image. ei
lLDR (1) = ciuLDR + β(cinuLDR − ciuLDR ), (17)
C. Third stage: Reconstruction of the coarse LDR image where β is a positive parameter smaller than 1.
The reconstruction stage is carried out inversely to the Therefore the IeLDR
J
values are classified into non-uniform
decomposition stage. Assume that the adaptive lifting scheme B bins as follows:
backward algorithm processed all resolution levels until j − 1. IeJ = {e lLDR (k) for k = 1, ..., N J × M J } =
LDR
The next step consists to recover the approximation coeffi-
cients I 0j of size N j × M j using the four weighted blocks {[e1
lLDR (1), ..., e1
lLDR (K1 )], ..., [ei
lLDR (1), ..., ei
lLDR (Ki )], ..., (18)
I 0j−1 , d0j−1 0j−1 0j−1
LH , dHL and dHH . The algorithm first deals with [e
l B
LDR (1), ..., e
l B
LDR (KB )]},
j−1 j−1
the coefficients in a vertical direction using I 0 and d0 LH depending on the quantization level set, where Ki is the
(denoted d 0j−1
). At a given m, the approximation coefficient number of values in the i − th bin (i.e.
at odd position, denoted W 0j (x2k−1 , ym ), is deduced: PB1 ≤ i ≤ B; Ki > 0)
and satisfying the following relation i=1 Ki = N J × M J .
W 0j (x2k−1 , ym ) = Ŵ 0j (x2k−1 , ym ) − d0j−1 (xk , ym ) The ”s-shape” TM perceptual curve, as discussed in [4]
for 1 ≤ k ≤ N j /2, (13) and [5], is modelled by a piecewise linear curve on each
bin (see Fig. 1). Consider the i − th bin, defined by
where Ŵ 0j (x2k−1 , ym ) is the predicted coefficient, at odd [ei
lLDR (1), ..., ei
lLDR (Ki )], the coarse LDR values are then
position, deduced from the weighted combination vij−1 (with modeled as follows:
i = 0, 1, 2) of the neighboring approximation coefficients ˆli iei i
I 0j−1 (x2k−1 , ym ) as: LDR (k) = a lLDR (k) + b with k ∈ [1, Ki ], (19)
P2 where ai (with ai 6= 0) and bi are two unknown parameters
Ŵ 0j (x2k−1 , ym ) = i=0 vij−1 (ym ) · I 0j−1 (xk+i−1 , ym )
depending on the i − th bin. This equation, after some
for 1 ≤ k ≤ N j /2. (14) mathematical manipulations, can be rewritten as follows:
The weights vij−1 (ym ) are those computed and stored in ˆli (k) − ˆlLDR
i
(1) ei
the decomposition process. The approximation coefficient ei
lLDR (k) = LDR + lLDR (1), (20)
ai
W 0j (x2k , ym ), at even position, is deduced in a CA scheme:
where the unknown parameter ai is deduced so that the MSE
W 0j (x2k , ym ) = 2I 0j−1 (xk , ym ) − W 0j (x2k−1 , ym ) between the coarse LDR value and its quantized version,
i
for 1 ≤ k ≤ N j /2. (15) denoted lLDR (k) (i.e. when ˆlLDR
i
(k) has been supported a
110
ceiling process according to a scalar quantization), is mini- ”Office”, ”Oxford Church”, ”Memorial”, ”Light”, ”WardFlow-
mized in the i − th bin: ers” and ”StreetLamp”) with different dynamic range (or
i contrast ratio) from 8 f-stops to 19 f-stops.
arg min klLDR (k) − ei
lLDR (k)k22 . (21) The proposed approach is compared to: (i) NONSEP ENO-
ai
CA[13] ; (ii) SEP ENO-CA[12] with parameters α1 = 0.3, α2 =
This equation is simplified as follows: 0.7; (iii) Li TMO[8] with Haar multiscale; (iv) Fattal[11] using
PKi lLDRi i
(k)−l̂LDR (k)
2 RBW method with parameters α = 0.8, β = 0.3, γ = 0.8;
arg min · pi , (22)
ai
k=1 ai (v) Duan[9] using β = 0.5; (vi) TMOs in HDR Toolbox:
Drago[7] , Reinhard[14] , Ward[15] , Durand[6] , Schlick[17] with the
where pi is the probability of the ei
lLDR (k) value in the i − th default parameters as given in the HDR Toolbox. The different
K
bin given by pi = PB K .i
parameters are chosen so as to give the best results in terms
i=1 i
Extending equation (22) to all bins involves the computation of TMQI metric in all methods.
of a Global MSE (GMSE) deduced as follows: Table I provides the TMQI metrics. The proposed TM
PB PKi lLDR i i
(k)−l̂LDR (k)
2 approach namely ”Proposed LJ” is deployed with B = 256,
GM SE = i=1 k=1 a i · pi . (23) β = 0.25, lLDRmax = 255, lLDRmin = 0 and J = 1, ..., 5.
Our approach is competitive to those developed in the litera-
The variance of (lLDR
i
(k) − ˆlLDR
i
(k)) on each bin is assumed ture. More the number of resolution levels increase, more the
to be equal and is denoted ξ. Equation (23) is then simplified performance increase.
and becomes: Fig. 2 compares the visual quality of the ”Church” tone
PB mapped image using ”Duan” method and our approach. The
GM SE = i=1 (apii)2 · ξ. (24)
stained glass window at the church background presents a
Denote lLDRmax (respectively lLDRmin ) the maximum (re- better contrast and details with our approach although the
spectively minimum) LDR luminance value. Introduce δ i as TQMI are identical. Fig. 3 compares the ”Memorial” tone
the difference between coarse LDR luminance in two consec- mapped image using ”Duan” and ”Fattal” methods and our
utive bins: approach. The details on tills (see Fig. 3) and rosette (see
δ i = ˜lLDR
i+1
(1) − ˜lLDR
i
(1). (25) Fig. 4) are better rendered by our approach.
Fig. 5 compares the visual quality of the ”WardFlowers”
A constraint related to the limit sum of the projected heights tone mapped image using ”Fattal” and our approach. Some
equal to the entire LDR range results in: details, on flowers and rocks, are lost on ”Fattal” tone mapped
PB i i
(26) image compared to our approach. Moreover, our tone mapped
i=1 a · δ = lLDRmax − lLDRmin .
image if of better contrast. A similar result is provided by
Therefore the optimization problem is written as follows: Fig. 6 where the HDR ”StreetLamp” image has been mapped
XB B
X using ”SEP ENO” method and our method. The brightness of
pi
arg min ·ξ, s.t ai ·δ i = lLDRmax −lLDRmin . our tone mapped is better.
(ai )2
ai i=1 i=1 The performance of our approach is confirmed on more than
(27) 274 test HDR images where the details and contrast are better
This problem is solved analytically using the Lagrangian represented than other competitive methods.
function. After some mathematical manipulations, the slope
ai is deduced: IV. C ONCLUSION
(lLDRmax − lLDRmin ) · (pi )1/3 This paper proposed a new HDR image TM approach able
ai = PB i . (28) at the same time to extract the relevant details and enhance
1/3
i=1 δ · (pi ) the contrast of LDR images. This is essentially related to :
Hence the unknown parameter bi is calculated (i.e. bi = (i) the forward process of the near optimal local adaptive cell
ˆli i ei average lifting scheme where the filter coefficients are locally
LDR (1)−a × lLDR (1)) and LDR mapped values are deduced
according to equation (19). The global piecewise linear curve adapted to the content; (ii) the weighting operation depending
is continuous and strictly monotonic increasing according to on the information of each subband; (iii) the adjustment of
the positive slopes (i.e ai > 0, or angles 0◦ < atan(ai ) < 90◦ ). the coarse LDR image luminance distribution according to
the perceptual piecewise linear function. Simulation results
III. S IMULATION RESULTS confirm the relevance of the proposed approach both in terms
This section provides the performance of the proposed of the TMQI metric and the visual quality of the displayed
tone mapped HDR image. The tone mapped image quality is image.
measured with the TMQI (Tone-Mapped image Quality Index)
metric [18]. Simulations have been conducted under Matlab
environnement using the HDR Toolbox ([1]) with 274 test
HDR images. For lack of space, we only present the results
obtained with 8 HDR images (”Anturium”, ”Bottle Small”,
111
Fig. 3: ”Memorial” HDR test image (18.38 f-stops) - Left
image: proposed (5 levels, TMQI=0.951); Middle image:
”Duan” (TMQI=0.935); Right image: ”Fattal”
(TMQI=0.927).
Fig. 1: Piecewise linear curve modelization (”s-shape” curve).
Fig. 4: LDR luminance ”Rosette” zoom - Left image:

proposed (5 levels, TMQI=0.951); Middle image: ”Duan”
(TMQI=0.935); Right image: ”Fattal” (TMQI=0.927).
Fig. 2: ”Oxford Church” HDR test image (15.46 f-stops) - Fig. 5: ”WardFlowers” HDR test image (14.01 f-stops) - Up
Up image: proposed (5 levels, TMQI=0.985); Down image: image: proposed (5 levels, TMQI=0.930); Down image:
”Duan” (TMQI=0.986). ”Fattal” (TMQI=0.875).
112
R EFERENCES
[1] Banterle, F., Artusi, A., Debattista, K., and Chalmers, A., Advanced High
Dynamic Range Imaging: Theory and Practice, AK Peters (now CRC
Press), ISBN: 978-156881-719-4 (2011).
[2] Dufaux, F., Le Callet, P., Mantiuk, R., Mrak, M., High Dynamic Range
Video 1st Edition : From Acquisition, to Display and Applications, ISBN:
9780081004128 (April 2016)
[3] Reinhard, E., Heidrich, Wolfgang., Debevec, Paul., Pattanaik, S., Ward,
G., and Myszkowski, K., High Dynamic Range Imaging 2nd Edition:
Acquisition, Display, and Image-Based Lighting, ISBN: 9780123749147
(May 2010).
[4] Dowling, J. E., The Retina: An Approachable Part of the Brain,
Cambridge, Belknap Press (1987).
[5] Geisler, W. S., Effects of Bleaching and Backgrounds on the Flash
Response of the Cone System, Journal of Physiology, 312:413–434
(1981).
[6] Durand, F., and Dorsey, J., Fast Bilateral Filtering for The Display of
High-dynamic-range Images, ACM Transactions on Graphics (TOG) -
Proceedings of ACM SIGGRAPH 21, pp. 257-266 (2002).
[7] Drago, F., Myszkowski, K., Annen, T., and Chiba, N., Adaptive Log-
arithmic Mapping for Displaying High Contrast Scenes, Computer
Graphics Forum 22, pp. 419-426 (2003).
[8] Li, Y., Sharan, L., and Adelson, E.H., Compressing and Companding
High Dynamic Range Images with Subband Architectures, ACM Trans.
Graph. 24, pp. 836-844 (July 2005).
[9] Duan, J., Bressan, M., Dance, C., and Qiu, G., Tone-mapping High
Dynamic Range Images by Novel Histogram Adjustment, Pattern Recog-
nition, vol. 43, pp. 1847-1862 (2010).
[10] Husseis, A., Mokraoui, A., and Matei, B., Revisited Histogram Equaliza-
tion as HDR Images Tone Mapping Operators, 17th IEEE International
Symposium on Signal Processing and Information Technology, ISSPIT
(December 2017).
[11] Fattal, R., Edge-Avoiding Wavelets and their Applications, ACM Trans.
Graph (2009).
[12] Thai, B.C., Mokraoui, A., and Matei, B., Performance Evaluation
of High Dynamic Range Image Tone Mapping Operators Based on
Separable Non-linear Multiresolution Families, 24th European Signal
Processing Conference, pp. 1891-1895 (August 2016).
[13] Thai, B.C., Mokraoui, A., and Matei, B., Image Tone Mapping Approach
Using Essentially Non-Oscillatory Bi-quadratic Interpolations Com-
bined with a Weighting Coefficients Strategy, 17th IEEE International
Symposium on Signal Processing and Information Technology, ISSPIT
(December 2017).
[14] Reinhard, E., and Devlin, K., Dynamic Range Reduction Inspired
Fig. 6: ”StreetLamp” HDR test image (13.83 f-stops) - Up by Photoreceptor Physiology, IEEE Transactions on Visualization and
image: proposed (5 levels, TMQI=0.911); Down image: Computer Graphics 11, pp. 13-24 (2005).
”SEP ENO” (TMQI=0.855). [15] Ward, G., Rushmeier, H., and Piatko, C., A Visibility Matching Tone
Reproduction Operator for High Dynamic Range Scenes, IEEE Trans-
actions on Visualization and Computer Graphics 3, pp. 291-306 (1997).
[16] Tumblin, J., and Rushmeier, H., Tone Reproduction for Realistic Images,
IEEE Comput. Graph. Appl, pp. 42-48 (1993).
[17] Schlick, C., Quantization Techniques for Visualization of High Dynamic
TABLE I: Tone Mapped Image Quality Index (TMQI) Range Pictures, In Proceeding of the Fifth Eurographics Workshop on
Rendering, pp. 7-18 (1994).
TMOs Anturium Bottle Office Church Memorial Light
[18] Yeganeh, H. and Wang, Z., Objective Quality Assessment of Tone-
DR f-stops 8.73 16.03 16.29 15.46 18.38 17.46
mapped Images, IEEE Trans. on Image Processing, vol. 22, pp. 657-667
Drago[7] 0.874 0.801 0.800 0.814 0.800 0.800 (February 2013).
Reinhard[14] 0.778 0.807 0.826 0.789 0.791 0.794 [19] Mai, Z., Mansour, H., Nasiopoulos, P., and Ward, R., Visually-favorable
Ward[15] 0.806 0.783 0.775 0.817 0.795 0.789 tone-mapping with high compression performance, Image Processing
Durand[6] 0.811 0.892 0.825 0.929 0.814 0.800 (ICIP) 17th IEEE International Conference on, pp. 1285-1288, ISSN
Tumblin[16] 0.715 0.713 0.735 0.675 0.759 0.750 1522-4880 (2010).
Schlick[17] 0.770 0.835 0.926 0.970 0.787 0.780
Duan[9] 0.964 0.916 0.955 0.986 0.935 0.969
Fattal[11] 0.889 0.928 0.943 0.889 0.927 0.971
Li[8] 0.964 0.954 0.854 0.877 0.834 0.888
SEP ENO[12] 0.896 0.934 0.943 0.895 0.932 0.970
NONSEP[13] 0.938 0.873 0.935 0.820 0.832 0.932
Proposed L1 0.946 0.863 0.934 0.954 0.918 0.954
Proposed L2 0.965 0.882 0.938 0.970 0.929 0.963
Proposed L3 0.978 0.904 0.945 0.981 0.941 0.966
Proposed L4 0.980 0.921 0.946 0.984 0.949 0.969
Proposed L5 0.982 0.933 0.948 0.985 0.951 0.969
113
SIGNaL PROCESSING
SPa 2018
On Using Quaternionic Rotations for Indpendent

Component Analysis
Adam Borowicz
Bialystok University of Technology, Faculty of Computer Science
Wiejska str. 45A, Bialystok, Poland
Email: a.borowicz@pb.edu.pl
Abstract—Independent component analysis (ICA) is a popular Among orthogonal approaches, the FastICA method [3] is
technique for demixing multi-sensor data. In many approaches to probably the most popular technique due to its simplicity
the ICA, signals are decorrelated by whitening data and then by and high efficiency. It is based on maximizing some non-
rotating the result. In this paper, we introduce a four-unit, sym-
metric algorithm, based on quaternionic factorization of rotation linear contrast function that measures non-Gaussianity of the
matrix. It makes use an isomorphism between quaternions and whitened mixture. There are two versions of the FastICA al-
4x4 orthogonal matrices. Unlike conventional techniques based gorithm: the one-unit algorithm and the symmetric algorithm.
on Jacobi decomposition, our method exploits 4D rotations and In the case of the one-unit approach [3] the components are
uses negentropy approximation as a contrast function. Compared estimated successively one by one and after every iteration
to the widely used, symmetric FastICA algorithm, the proposed
method offers a better separation quality in a presence of multiple step, a deflationary orthogonalization is performed by using
Gaussian sources. Gram-Schmidt method. The major drawback of this approach
is that the orthogonalization cumulates estimation errors. The
I. I NTRODUCTION symmetric version of the FastICA algorithm [6] estimates
Independent component analysis (ICA) is a method for the components in parallel. In fact, this consists of parallel
transforming multivariate data to components that are as computation of the one-unit updates for each component,
independent from each other as possible. Most frequently, followed by symmetric orthogonalization of the estimated
it is based on the Central Limit Theorem and attempts to demixing matrix after each iteration step. In its basic form,
find directions in which some measure of non-Gaussianity symmetric orthogonalization involves a computation of an
is maximized. Usually, the following linear model [1] of the inverse matrix square root, which can be computationally
multi-sensor observation vector is assumed: expensive, for large n. Although, simpler iterative methods
can be used [3], this step presents some difficulties, especially
x = As, (1)
when FastICA is implemented in hardware architectures [7].
where A ∈ Rn×n is an unknown mixing matrix and s is Moreover, as we will show, the symmetric orthogonalization
a random vector of n unknown signals that are mutually can introduce some distortions, when the number of Gaussian
independent, and zero-mean. It is also assumed that at most sources is greater than one.
only one of sources is Gaussian. From a mathematical point The ICA literature also contains examples of methods that
of view, we are looking for some matrix W such that do not require implicit orthogonalization, for instance Jacobi-
related algorithms [4], [5]. In these approaches, the rotation
y = Wx = ΛPs, (2) matrix, B is considered to be a product of orthogonal matrices,
where Λ and P are scaling and permutation matrices, respec- specifically the Jacobi/Givens rotations. The ICA is carried out
tively. This means that the sources can be recovered only up by minimizing a contrast function for each pair of components
to sign, scale and permutation [2]. that correspond to a given rotation plane. Thus, by sweeping
In this study, we are interested in so-called orthogonal the local optimization over all possible pairs, the global
approaches [3], [4], [5]. These methods directly constrain the independent sources can be extracted. The most important
unmixed components to be uncorrelated, so that E{yyT } = I. advantage of the Jacobi algorithm is that the local optimization
This constraint is usually enforced by prewhitening data before can be solved explicitly if the contrast function is given as
rotating them: a polynomial of 4th-order moments (for example, kurtosis).
y = Wx = BC−1/2 (3) Unfortunately, the use of kurtosis as a general contrast function
xx x,
may be discouraged due to poor asymptotic efficiency for
where B is constrained to be an orthogonal matrix. The super-Gaussian sources and lack of robustness to outliers [8].
whitening matrix can be computed as follows: In this paper, we propose to use a quaternion-based fac-
torization of 4x4 orthogonal matrices that represent rotation
Cxx −1/2 = Ux Λx −1/2 Ux T , (4)
in R4 . Such matrices can be uniquely described by only
where Ux is the matrix of eigenvectors of Cxx = E{xxT }, two unit quaternions [9]. In this way, ICA optimization task
and Λx is the diagonal matrix of corresponding eigenvalues. can be reformulated to find quaternions that maximize some
114
measure of non-Gaussianity simultaneously in all (orthogonal) III. ICA USING QUATERNIONIC ROTATIONS
directions. We show, that the conventional Givens rotations can In the conventional Jacobi algorithm, the independent com-
be replaced by quaternion-based 4D rotations which results ponents are estimated by recursively applying so-called sweep
in a novel 4-unit symmetric ICA algorithm. Contrary to operations [4], [5]:
conventional Jacobi-related algorithms, the proposed approach
exploits the contrast function based on negentropy approxi- yk = Rk yk−1 , k > 1, (11)
mation. When compared to the symmetric FastICA algorithm, y0 = Cxx −1/2 x, (12)
the method offers better separation quality in the presence of
Gaussian sources. Experiments show that our method is able where Rk is an orthogonal matrix and k is a sweep number.
to successfully extract non-Gaussian components even if the Typically, the matrix Rk is a product of rotation matrices:
number of Gaussian sources is greater than one, in which case n(n−1)/2
Y
the FastICA method often fails. Rk = G(pi , qi , θk,i ), (13)
i=1
II. Q UATERNIONS AND 4D ROTATIONS
where
Most frequently a quaternion Q ∈ H is represented in the  
Ip−1 0 0 0 0
rectangular form:  0 cos θ 0 sin θ 0 
 
G(p, q, θ) = 
 0 0 Iq−p−1 0 0 

Q = q0 + iq1 + jq2 + kq3 , q0 , q1 , q2 , q3 ∈ R. (5)  0 
− sin θ 0 cos θ 0
The real part of Q is q0 and the pure quaternion part is iq1 + 0 0 0 0
In−q−1
jq2 + kq3 . Similarly as for ordinary complex numbers, the (14)
conjugate of Q is given by Q̄ = q0 − iq1 − jq2 − kq3 , and the is the Givens matrix that represents rotation by the θ angle
norm |Q|, is defined as in the plane determined by the p and q axes. In this way,
an ndimensional ICA problem can be reduced to solving
|Q|2 = QQ̄ = Q̄Q = q02 + q12 + q22 + q32 . (6) n(n − 1)/2 one-dimensional subproblems. Each subproblem
consists in searching for the angle that maximizes cumulant-
The multiplication of quaternions is determined by the rules: based contrast function [5] in the corresponding rotation plane.
The data is rotated and the process is repeated cyclically, until
i2 = j2 = k2 = ijk = −1. (7) all rotation angles converge to zeros.
In this paper, we propose a simple 4-unit symmetric algo-
Since any quaternion Q can be also represented by the vector
rithm based on two-step optimization task. The key idea is to
q = [q0 q1 q2 q3 ]T , a set H can be identified with the R4 vector
replace 2D Givens rotations with their 4D quaternion-based
space.
counterparts. In particular, six ordinary Givens rotations are
Let (P, Q) be a pair of unit quaternions i.e. |P |= |Q|= 1,
replaced with two quaternionic rotations as follows:
and W be a quaternion that represents an arbitrary point w ∈
R4 . Consider a real linear transformation from H to H that Rk = M+ (pk )M− (qk ) (15)
maps W to P W Q̄. It can be shown [9] that this transformation
Thus, in each sweep, two vectors pk , qk ∈ R4 have to be
can be uniquely represented by a product of the 4x4 orthogonal
estimated that maximize some measure of non-Gaussianity of
matrices:
  the rotated data. We propose to optimize these parameters in
p0 −p1 −p2 −p3 two steps. Firstly, we look for the vector qk and rotate the data
p1 p0 −p3 p2  using the matrix M− (qk ). Then we look for pk and rotate the
M+ (P ) =  p2
, (8)
p3 p0 −p1  result of the first step by using M+ (pk ). These steps have to
p3 −p2 p1 p0 be repeated for every sweep until both rotation matrices are
  approximately equal to the identity transform.
q0 q1 q2 q3 It was shown in [10] that measures of non-Gaussianity
−q q0 −q3 q2  based on some non-linear functions are often considerably
M− (Q) = 
−q2
1 . (9)
q3 q0 −q1  more robust than cumulant-based approximations. Moreover,
−q3 −q2 q1 q0 by properly choosing the non-linearity we are able to adjust
the algorithm to work for practically any distribution of
In particular
source signals. One-unit contrast function for measuring non-
P W Q̄ ⇔ M+ (P )M− (Q)w. (10) Gaussianity of any random variable y in the negentropy-based
ICA framework is given by [3]:
In fact, the pairs (P, Q) and (−P, −Q) map to same rotation
Jg (y) = [E{g(y)} − E{g(v)}]2 , (16)
of R4 . The possibility of using two parameters to describe a
rotation is a special feature of the group of rotations in four where v is a standardized Gaussian variable, and g is any non-
dimensions. linear function. Please see [11] for examples of such functions.
115
The random variable y is assumed to be zero-mean and unit- take the form of simple permutation matrices containing only
variance. Please note that the measure of (16) is always non- zeros and ones [9], so that the multiplication by Y can be
negative, and it equals to zero if y is Gaussian. avoided.
As suggested in [3], contrast function for several units can By assuming that the multiplier λ is known, one can try
be obtained by simply maximizing the sum of the one-unit to solve (22) numerically using, for instance, the Newton-
functions. Thus, by taking into account the unit-norm con- Raphson method, in a similar way as in the FastICA approach
straints (that enforce orthogonality), we defined the following [3]. However, this can be difficult task, since it involves
optimization problems: determining the second-order derivatives (Jacobian matrix) of
4 the quadratic form of (21). Even if an explicit-form expression
X
qk = arg max J − (u) = E{ḡ([M− (u)]i∗ y− )}2 , for this matrix will be obtained, its computation (without
u
i=1
(17) substantial simplifications) could be very expensive. Also, it
s.t. kuk= 1, seems to be a challenge to find a reasonable approximation for
Jacobian matrix. A much simpler solution is to use gradient
4
X method with L2 -norm constraint [12]. In this approach a pre-
pk = arg max J + (u) = E{ḡ([M+ (u)]i∗ y+ )}2 , ˆ
scaled gradient of J(u), say
u
i=1
(18)
s.t. kuk= 1, ˆ
g(u) = µ∇u J(u)/k∇ ˆ
u J(u)k, (27)
where [A]i∗ denotes the ith row of an arbitrary matrix A, with µ being an empirically chosen scaling factor, is projected
ḡ(y) = g(y) − E{g(v)} is centered version of the non-linear onto hyperplane orthogonal to the unit vector u:
function g(y), and
hg (u) = g(u) − uuT g(u). (28)
y− = Rk−1 x, (19)
The set of all such vectors at a given point u is known as
y+ = M− (qk )Rk−1 x = M− (qk )y− , (20)
the tangent space of the constraint surface kuk= 1. Then, the
are the rotated random vectors for the kth sweep. If we replace solution vector is updated by using the following rule:
the vector x in (19) and (20) by its observations (stacked in hg (u)
column-wise order) we will obtain the matrices Y− , Y+ ∈ u∗ ← u cos θ + sin θ, (29)
khg (u)k
R4×m respectively, where m denotes number of samples. Thus
the expectations in (17) and (18) can be approximated by sums, which is equivalent to the rotating u, the current solution
which leads to the maximization of the contrast function: vector, in the direction of hg (u) by some angle θ along the
geodesic (i.e. curve on the constraint surface that connects
ˆ
J(u) ≈ J(u) = 1Tm ḡ(M(u)Y)T ḡ(M(u)Y)1m . (21) two arbitrary points by an arc of shortest length). Following
Please note that the problems (17) and (18) have the same the work [12], in our experiments we arbitrary set this angle
solution. Therefore, here and in the rest of the paper the to khg (u)k. Please note that for constant µ, the norm of
subscripts ’+’ and ’−’ are dropped to simplify the notation. the tangent vector (28) mainly depends on the direction of
In accordance with the Kuhn-Tucker conditions, any solu- gradient. At solution point, say u = uopt , the gradient (23) is
tion (local minimum or maximum) must satisfy the following always perpendicular to the constraint surface, which means
vector equation: that khg (uopt )k= 0.
ˆ
∇u J(u) + λu = 0, (22) In fact, during the early sweeps it is usually not necessary
to compute rotation angles with the highest accuracy, since the
where λ is a Lagrange multiplier, and local optimizations can affect each other. From this reason, in
" #T every sweep, we perform only one iteration of the rule (29)
∂ ˆ
J(u) ∂ ˆ
J(u)
ˆ
∇u J(u) = ··· , (23) per each optimization task (17), (18), assuming that
∂u0 ∂u3
u = u1 = [1, 0, 0, 0]T . (30)
ˆ
denotes the gradient of J(u). The partial derivatives can be
computed as follows Since M± (u1 ) = I, (25) and (26) can be simplified to
ˆ
∂ J(u) a(u1 ) = ḡ(Y)1m , (31)
= 2a(u)T bi (u), (24)
∂ui bi (u1 ) = [ḡ 0 (Y) ◦ Pi Y] 1m , i = 0, ..., 3. (32)
with
After applying (29), the data are rotated by using the matrix
a(u) = ḡ(M(u)Y)1m , (25)
M+ (pk = u∗ ) or M− (qk = u∗ ), depending on current
bi (u) = [ḡ 0 (M(u)Y) ◦ Pi Y] 1m , i = 0, ..., 3, (26) optimization problem that is solved. As the stopping condition
we propose
where ◦ denotes a Hadammard product, ḡ 0 is a first order
khg (qk )k+khg (pk )k< (33)
derivative of the function ḡ and Pi = ∂M(u) ∂ui are partial
derivatives of the quaternionic rotation matrix. These matrices where is a sufficiently small positive scalar constant.
116
(a) (b)
s4 x4
s3 x3
s2 x2
s1 x1
50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400
Sample Index Sample Index
(c) (d)
10
9 y4
8
Contrast Function
7 y3
6
5
y2
4
3
y1
2
1
2 4 6 8 10 12 14 16 18 20 50 100 150 200 250 300 350 400
Sweep Number ( k) Sample Index
Fig. 1. Demonstration of source extraction using the proposed method. (a) The original source signals. (b) The observed mixtures of the source signals. (c)
Contrast function values versus sweep number. (d) The estimates of the original source signals.
It is well-known that the gradient methods do not necessar- We have found empirically that the convergence speed can
ily converge for large values of the scaling factor µ. On the be affected by the particular choice of the non-linear func-
other hand, for small values of µ the algorithm can converge tion. In order to ameliorate this problem, we propose to use
very slowly. Therefore, in our simulations we started with a different initial values of the scaling factor µ for different non-
sufficiently large µ and whenever the instability was detected, linearities. In our experiments we set µ = 1.5, 5, and 3, for
ˆ ∗ ) < J(u),
i.e. J(u ˆ the scaling factor µ was divided by two. ’POW3’, ’TANH’ and ’GAUS’ functions, respectively.
As an illustration, let us consider the four waveforms in Synthetic sources were generated for the following distri-
Fig. 1. The original and randomly mixed signals are depicted butions: uniform, bpsk, Laplacian, and generalized Gaussian
in Fig. 1a and Fig. 1b, respectively. Figure 1d presents source (GG(α) with parameter α = 3, and 0.5) [11]. The length
signals recovered by using the proposed approach. In order of each source signal was m = 1000 samples. In the first
to illustrate the convergence characteristics, in Fig. 1c, we scenario, we mixed four sources of the same non-Gaussian dis-
also show the estimate of the contrast function Jˆ+ (pk ) as tribution. In order to evaluate the robustness of the compared
a function of the iteration number k. It can be seen that algorithms to the Gaussian sources, we also considered the
the source signals have been successfully separated (up to scenarios in which one or two sources were Gaussian and the
permutation) in less than six sweeps. others were non-Gaussian but of the same distribution. In all
scenarios, the coefficients of the mixing matrix were generated
IV. P ERFORMANCE E VALUATION
from the uniform distribution. The performance indexes were
The proposed method (denoted here as QICA) has been averaged over 1000 random realizations of the sources and the
implemented in MATLAB. For comparative purposes we mixing matrices.
have also implemented the symmetric FastICA approach [13].
Convergence characteristics, source separation quality, and
computation time were measured for several conditions using A. Convergence and Separation Quality
synthetic data. The experiments were conducted for the three Please note that a measuring convergence speed in terms
non-linear functions that are frequently considered in literature of iteration number or average computation time taken by
[3], [11]: algorithm to converge, makes a little sense, since the compared
POW3 : ḡ(y) = y 4 − 3, methods use different stop conditions. Therefore, we decided
to disable the stop conditions in both methods and to perform
TANH : ḡ(y) = log(cosh(y)) − 0.3746,
simulations for a fixed number of iterations, measuring the
GAUS : ḡ(y) = −exp(−y 2 /2) + 0.7071. separation quality in each iteration step. For this purpose the
117
POW3 TANH GAUSS
-5 -5 -5
QICA (no gaussians) QICA (no gaussians) QICA (no gaussians)
FastICA (no gaussians) FastICA (no gaussians) FastICA (no gaussians)
QICA (1) QICA (1) QICA (1)
FastICA (1) FastICA (1) FastICA (1)
-10 QICA (2) -10 QICA (2) -10 QICA (2)
FasICA (2) FastICA (2) FastICA (2)
-15 -15 -15

SMSE (dB)
SMSE (dB)
SMSE (dB)
-20 -20 -20
-25 -25 -25
-30 -30 -30

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Iteration Number Iteration Number Iteration Number
Fig. 2. Average source separation quality as a function of sweep/iteration number, obtained using QICA (circles) and FastICA (squares) for different
non-linearities. Dashed lines: no Gaussian sources. Solid lines: 1 Gaussian source. Dotted lines: 2 Gaussian sources.
average signal mean square error (SMSE) [14] was used: TABLE I
X SMSE ( D B) FOR 4 COMPONENTS WITH THE SAME DISTRIBUTION , AFTER
1 k = 40 ITERATIONS .
SMSE = SMSE(x1 , x2 ), (34)
|S|
(x1 ,x2 )∈S Symmetric FastICA QICA4
PDF POW3 TANH GAUSS POW3 TANH GAUSS
where SMSE(i, j) = E{|si − αj ŝj | }, and the coefficient 2
uniform -29.0 -27.9 -27.8 -29.0 -27.9 -27.8
αj = E{si ŝj }/E{|ŝj |2 } represents correlation between ith bpsk -32.2 -32.1 -32.2 -32.2 -32.2 -32.2
source and jth estimated component. The set S contains the GG(3.0) -19.2 -19.4 -19.2 -19.3 -19.5 -19.4
unique pairs of indices of sources and the corresponding Laplacian -21.0 -24.8 -25.1 -20.5 -24.7 -25.1
estimated components. These pairs are chosen subsequently GG(0.5) -24.2 -29.0 -29.6 -23.7 -28.9 -29.5
as (x1 , x2 ) = arg min SMSE(i, j), and once selected, are no
i,j
longer taken into account in the pairing process. TABLE II
Figure 2 presents SMSE measures obtained in all scenarios, SMSE ( D B) FOR 3 COMPONENTS WITH THE SAME DISTRIBUTION AND 1
averaged over all distributions. More detailed results are GAUSSIAN NOISE SOURCE , AFTER k = 40 ITERATIONS .
presented in Tab. I-III. In the first scenario (no Gaussians), Symmetric FastICA QICA4
approximately, after 20-40 iterations, both methods converge PDF POW3 TANH GAUSS POW3 TANH GAUSS
to the solution giving similar separation qualities. This is rather uniform -24.3 -25.4 -25.6 -29.0 -27.8 -27.7
not surprising, since the optimization tasks of the QICA and bpsk -28.6 -32.5 -32.9 -34.8 -35.0 -35.0
FastICA are basically equivalent. From Fig. 2 we also see GG(3.0) -15.2 -16.1 -16.3 -17.5 -17.9 -17.9
that the proposed method converges only slightly slower than Laplacian -19.9 -23.2 -23.3 -20.0 -24.1 -24.5
the FastICA approach. In the scenario where one Gaussian GG(0.5) -23.7 -28.6 -29.0 -23.3 -29.1 -29.8
source was used (Tab. II), our algorithm provides substantially
better separation quality for almost all cases. This observation TABLE III
is even more evident in the last scenario (Tab. III), where two SMSE ( D B) FOR 2 COMPONENTS WITH THE SAME DISTRIBUTION AND 2
Gaussians were used. It is clear that the performance of the GAUSSIAN NOISE SOURCES , AFTER k = 40 ITERATIONS .
FastICA approach can be deteriorated due to the Gaussian Symmetric FastICA QICA4
sources. The proposed method seems to be more robust to PDF POW3 TANH GAUSS POW3 TANH GAUSS
Gaussian sources, since it gives similar or even better results uniform -21.8 -23.7 -24.0 -28.8 -27.7 -27.6
as in the first scenario. bpsk -26.3 -32.6 -33.4 -39.2 -41.7 -41.7
Note that if there is no Gaussian sources, then each lo- GG(3.0) -13.3 -14.5 -14.5 -16.2 -17.0 -17.0
cal minimum or maximum of the one-unit contrast function Laplacian -19.1 -21.1 -22.1 -19.6 -23.7 -24.1
corresponds to one independent component. As (16) is the GG(0.5) -23.4 -28.6 -26.7 -23.2 -29.6 -30.6
function blind to Gaussian sources, in the presence of one or
more Gaussians, the number of local optima is usually lower
than n, so that symmetric orthogonalization, as in FastICA problem could be mitigated by properly regularizing the in-
approach, may pose some stability issues. Theoretically, this verse during the symmetric orthogonalization step. Also, the
118
V. C ONCLUSIONS
QICA
Quaternionic factorization of 4x4 rotation matrices can be
FastICA applied to the orthogonal ICA approach. By following the
Jacobi-related estimation framework and by using contrast
function based on negentropy approximation we have devel-
oped a novel 4-unit symmetric ICA method. Compared to the
10-2
symmetric FastICA approach, the proposed method is a bit
Time (s)
slower and more computationally demanding. However, we

have also shown through the experiments that our method
offers higher separation quality of non-Gaussian components
if the observation mixtures contains one or more Gaussian
10-3
sources.
Future works include extending the proposed method to the
case of n dimensions, improving its convergence speed, and
developing a hardware implementation.
102 103 104
Number of Samples ACKNOWLEDGEMENT
Fig. 3. Average computation time of one iteration of QICA (circles) and This work was supported by Bialystok University of Tech-
FastICA (squares) versus size of input data. nology under the grant S/WI/3/2018.
R EFERENCES
dimensionality of the observation data could be reduced so [1] P. Comon, “Independent component analysis, a new concept?” Signal
that the dimension of the transformed data will be equal to Process., vol. 36, no. 3, pp. 287–314, 1994.
[2] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms
the number of non-Gaussian sources. Unfortunately, in both and applications,” Neural Networks, vol. 13, no. 4, pp. 411–430, 2000.
cases, the non-Gaussian subspace must be identified properly, [3] A. Hyvärinen, “Fast and robust fixed-point algorithms for independent
which in general is not a trivial task. Unlike the FastICA component analysis,” IEEE Trans. Neural Netw., vol. 10, no. 3, pp. 626–
634, 1999.
method, the proposed approach is inherently orthogonal and [4] J. Cardoso and A. Souloumiac, “Blind beamforming for non-gaussian
directly maximizes the sum of the one-unit contrast functions. signals,” IEE Proceedings F - Radar and Signal Processing, vol. 140,
It can be verified that the data rotations are performed in no. 6, pp. 362–370, 1993.
[5] J. Cardoso, “High-order contrasts for independent component analysis,”
the non-Gaussian subspace only. Thus, the method is able Neural Computation, vol. 11, no. 1, pp. 157–192, 1999.
to successfully recover non-Gaussian components even if the [6] A. Hyvärinen, “The fixed-point algorithm and maximum likelihood
observation mixtures contain multiple Gaussians. estimation for independent component analysis,” Neural Process. Lett.,
vol. 10, no. 1, pp. 1–5, 1999.
[7] M. Plauth, F. Feinbube, P. Tröger, and A. Polze, “Fast ICA on modern
GPU architectures,” in 2014 15th International Conference on Parallel
B. Complexity and Distributed Computing, Applications and Technologies, Dec 2014,
pp. 69–75.
As it can be seen in Fig. 2 both methods have similar [8] A. Hyvärinen, “One-unit contrast functions for independent component
convergence characteristics. However, it can be easily verified analysis: a statistical analysis,” in Neural Networks for Signal Processing
VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop,
that QICA’s iterations are more expensive than iterations of Sep 1997, pp. 388–397.
the FastICA algorithm. In the case of the QICA approach, [9] N. Mackey, “Hamilton and jacobi meet again: Quaternions and the eigen-
two functions: ḡ and ḡ 0 have to be evaluated twice per sample, value problem,” SIAM Journal on Matrix Analysis and Applications,
vol. 16, no. 2, pp. 421–435, 1995.
while the FastICA approach requires evaluating the functions [10] A. Hyvärinen, “New approximations of differential entropy for indepen-
g 0 and g 00 only once per sample [3]. Thus, by assuming that dent component analysis and projection pursuit,” in Proceedings of the
all these functions have a similar complexity, one QICA’s 1997 Conference on Advances in Neural Information Processing Systems
10, 1997, pp. 273–279.
iteration is at least 2 times more expensive than FastICA’s [11] P. Tichavsky, Z. Koldovsky, and E. Oja, “Performance analysis of
iteration. In order to validate this statement, we measured the FastICA algorithm and Cramér-Rao bounds for linear independent
average computation time of one iteration for both methods, component analysis,” IEEE Transactions on Signal Processing, vol. 54,
no. 4, pp. 1189–1203, 2006.
and for different sizes of input data. As can be seen in Fig. [12] S. Douglas, S. Amari, and S. Kung, “On gradient adaptation with unit-
3, the complexities of the evaluated methods are of the same norm constraints,” IEEE Transactions on Signal Processing, vol. 48,
order, since the computation times differ by a constant factor no. 6, pp. 1843–1847, 2000.
[13] “FastICA MATLAB package.” [Online]. Available:
regardless to data size, but on average the one iteration of www.cis.hut.fi/projects/ica/fastica
the QICA method is 2.3 times more expensive than that of [14] V. Zarzoso and P. Comon, “Robust independent component analysis by
the FastICA approach. Theoretically the complexity of the iterative maximization of the kurtosis contrast with algebraic optimal
step size,” IEEE Transactions on Neural Networks, vol. 21, no. 2, pp.
proposed method can be reduced by taking into account the 248–261, 2010.
specific structure of the matrices M± (u). In this case the [15] T. Howel and J.-C. Lafon, “The complexity of quaternion product,”
matrix-vector multiplication can be implemented using only Cornell University, Department of Computer Science, Tech. Rep. TR
75-245, 1975.
8 real multiplications [15].
119
SIGNaL PROCESSING
SPa 2018
Two-dimensional non-separable quaternionic

paraunitary filter banks
Nick A. Petrovsky∗ , Eugene V. Rybenkov† , Alexander A. Petrovsky‡
Department of Computer Engineering,
Belarusian State University of Informatics and Radioelectronics
Minsk, Belarus
Email: { ∗ nick.petrovsky, † rybenkov, ‡ palex }@bsuir.by
Abstract—This paper presents a novel technique of factor- in the subband signals. This property also allows us to use
ization for 2-D non-separable quaternionic paraunitary filter optimal bit-allocation algorithms [3]. PUFBs are the basis
banks (2-D NSQ-PUFB). Two-dimensional factorization struc- for orthogonal M -band wavelets. The human visual system
tures called ”16in-16out” and ”64in-64out” respectively for 4-
channel and 8-channel Q-PUFB based on the proposed technique is known to be sensitive to phase distortion. Since phase
are shown. The given structures can be mapped to parallel- distortion can be avoided by applying filters with a linear-
pipeline processor architecture with a minimum latency time phase (LP) property, it is desirable that all filters composing
2(N + 1) quaternion multiplication operations, where N is FBs have the LP property when the system is applied to image
transform order of the Q-PUFB. The latency of parallel-pipeline processing. Hence, the LP and paraunitary properties of FBs
processing does not depend on the size of the original image
in contrast to the conventional 2-D transform. The coding gains are particularly significant for the subband coding of images.
CGM D of 2-D non-separable Q-PUFBs for the isotropic auto- Several one-dimensional (1-D) linear-phase paraunitary filter
correlation function model with the correlation factor ρ = 0.95 banks (LP PUFB’s) have been developed so far [4], [5]. New
are the following: CGM D = 13.4 dB for ”16in-16out” structure factorizations and structures regularly appear, offering new
and CGM D = 15.6 dB for ”64in-64out” structure. useful features, design flexibility and ease, or computational
Keywords—quaternionic paraunitary filter banks, two- dimen-
sional, non-separable efficiency [6], [7].
I. I NTRODUCTION In this context, the authors have recently [8] presented a

The rapid development of the Internet, wireless commu- new concept of a quaternionic building block applicable to
nication, and portable computing, the demands for and the many existing structures of FBs (Q-PUFB) and transforms,
interests in digital multimedia (digitized speech, audio, image, especially to the 4- and 8-channel ones, commonly used
video, computer graphics, and their combination) are growing in imaging applications. The main results were: structurally
exponentially. A high-performance filter bank (FB) is typically guaranteed perfect reconstruction (up to scaling) under a rough
at the heart of every state-of-the-art digital multimedia sys- coefficient quantization, reduced memory requirements, and
tem [1]. FB has been widely researched as a way to efficiently good suitability for FPGA and VLSI implementations, miti-
compress such signals. A uniform maximally decimated M - gating the disadvantage of increased computational complexity
channel filter bank consists of parallel analysis filters Hk (z), non-critical [9].
synthesis filters Fk (z), and down-sampling and up-sampling
operators. The equivalent polyphase representation of the One-dimensional LPPUFB’s can be applied to the construc-
given filter bank includes polyphase matrices E(z) and R(z). tion of multidimensional separable systems. 2-D signals (im-
The polyphase representation is formulated as follows [2]: ages) are separately transformed along vertical and horizontal
[H0 (z) H1 (z) ... HM −1 (z)] = E(z M )e(z)T , directions. However, multidimensional signals are generally
non-separable, and this approach does not exploit their char-
[F0 (z) F1 (z) . . . FM −1 (z)] = e(z)R(z M ), acteristics effectively. 2-D non-separable FBs perform more
efficiently for image coding than separable FBs, because non-
where e(z) = 1 z −1 . . . z −(M −1) ; z and T denote a
delay element and matrix transposition. If E(z) is invertible, separable FBs may have better frequency characteristics [10],
the synthesis polyphase matrix R(z) can be chosen as the [11]. Suzuki et al. proposed a lattice structure of the 2-D
inverse of E(z), i.e., perfect reconstruction is achieved. The non-separable perfect reconstruction FBs and showed their
obtained FB is called a perfect reconstruction FB or biortogo- efficiency for lossy-to-lossless image coding [12].
nal FB (BOFB). If E(z −1 )T E(z) = I and R(z) = E(z −1 )T ,
the FB belongs to a special class of perfect reconstruction filter Taking into account the advantages of the Q-PUFB the
bank called a paraunitary FB (PUFB). In such situation, the aim of this contribution is to show a novel technique of
PUFB has the advantage that it guarantees the error energy in factorization for 2-D non-separable quaternionic paraunitary
the reconstructed signal to be the average of the error energy filter banks (2D-NSQ-PUFB) and evaluate their performance.
120
II. R EVIEW AND DEFINITIONS Theorem 1. The factorization of memory-efficient high-
A. Conventional 2-D transform throughput 2D separable transform is
In general case 1D transformation can be formulated as yn·n,1 = Θdiag · P · Θdiag · P · xn·n,1 = Θ̈n2 ,n2 · xn·n,1 , (7)
follows: yn,n = Θn,n ·xn,n , where Θn,n is conversion matrix,
whose size is n × n); yn,n is transform result n × n, xn,n is where Θdiag is the matrix with transform matrices Θn,n on the
input signal dimension n × n. The two-dimensional transform main diagonal (the number of matrices Θn,n is n); Θ̈n2 ,n2 =
based on the orthogonal transform Θn,n applied to 2D input = Θdiag · P · Θdiag · P is the 2D transformation matrix.
signal xn,n separately by column and row is expressed by
Proof. Two-dimensional transformation (1) can be rewritten
yn,n = Θn,n · xn,n · ΘTn,n . (1) as
Comparing the 1D and 2D transformations, we can note that T
yn,n = Θn,n · xn,n · ΘTn,n = Θn,n · Θn,n · xTn,n . (8)
the 2D transform is performed over a size signal n × n, it is
executed in blocks. Also, in order to perform a 2D transform, By applying (4-6) to (8), we obtain:
it is needed to obtain an intermediate result xn,n ΘTn,n , which  
requires additional memory of size n × n. Θn,n · · · 0
 .. .. ..  · P ×
B. Memory-efficient high-throughput 2-D FBs yn·n,1 =  . . .  (9)
As a trade-off between processing speed and hardware costs, 0 · · · Θn,n

 
a parallel-pipeline scheme for calculating the 2D transforma- Θn,n · · · 0
×  ... ..  · P · x
tion of the original image is known based on the separable  ..
. .  n·n,1 .
approach [12]. 0 ··· Θn,n
Definition 1. Given input signal xn,n can be transformed into
vector xn·n,1 as follows: Comparing equations (1) and (7), it may seem that the com-
plexity of the hardware implementation of the conversion has
  increased, but this is not so. The calculation of yn,n requires
x1,1 ··· x1,n performing (2 · n2 ) vector multiplication operations, which is
T  .. .. ..  (2)
[x1,1 . . . x1,n . . . xn,1 . . . xn,n ] ←tv
− . . .  equivalent for the case of yn·n,1 (if we neglect multiplication
xn,1 ··· xn,n by 0). Multiplication by the matrix P is the commutation of
the input signal, which does not incur additional hardware
where tv (xn,n ) denotes the forward transformation: xn·n,1 = costs. The transformation based on the separable approach (7)
= tv (xn·n,1 ). The transformation tv only performs a line-by- for the 2D separable M -channel FB (M = 4) is written as
line mapping of the matrix xn,n into a vector xn·n,1 . follows:
Definition 2. The inverse transform xn,n = tv (xn·n,1 ) for yM ·M,1 = diag (E, E, E, E) · P×
the vector xn·n,1 is defined as: (10)
  × diag (E, E, E, E) · xM ·M,1 .
x1,1 · · · x1,n
 .. .. ..  tv [x . . . x . . . x . . . x ]T . (3) The number of polyphase matrices E on the main diagonal
 . . . ← − 1,1 1,n n,1 n,n
of the diag (E, E, E, E) matrix is equal to the number of
xn,1 · · · xn,n channels (M ) of the FB, then the total number of one-
Definition 3. The forward transform of the transposed matrix dimensional FBs in a 2D FB is 2M .
xTn,n is zn·n,1 = tv xTn,n : The calculation of a 2D separable 4-band Q-FB, an example
  based on the quaternion multipliers [13], continues over the
x1,1 ··· xn,1 line boxes of pixels (4 × 4) n, where n = 1 . . . N/4 (see
T  .. .. ..  (4)
[x1,1 . . . xn,1 . . . x1,n . . . xn,n ] ←tv
− . . .  Fig. 1a and b), and then the column pixel blocks are processed
x1,n ··· xn,n (4 × 4) n, where n = 1 . . . N/4 (see Fig. 1c). 2D separable 4-
band Q-FB will be calculated for the N 2 /16 cycles of block
On the basis of definition 1, the vectors zn·n,1 , xn·n,1 and processing of the original image. The processing capacity of
matrix xn,n are related as follows: the image as compared to the 2-band FB (the number of
zn·n,1 = P · xn·n,1 = P tv (xn,n ) (5) processing cycles is N 2 /4) is increased 4 times.
Thus, the parallelism in the Eq. (9) and for calculating
where P is the permutation matrix of size (n2 × n2 ). On the the 2D 4-band Q-FB (10) is achieved by a new method
other hand, taking definitions 1 and 3, direct transformations for block scanning of the original image in the processing
of the transposed matrix xTn,n and matrix xn,n into a vector step of row and column blocks of pixels, and the separable
are connected as follows: 2D transformation scheme (10) in the next phase algorithm

tv xTn,n = P · tv (xn,n ) , (6) allows to apply the principle of parallel-pipeline processing to
calculate the transformation over the column blocks of pixels.
121
N
N/4 N/4
...
N/4 LLLL LLHL HLLL HLHL
LLLH LLHH HLLH HLHH

N ...
N LL LH HL HH
LHLL LHHL HHLL HHHL
...
...
... LHLH LHHH HHLH HHHH
(a) (b) (c)

a — the order of image scanning by pixel blocks; b — conversion coefficients of four sub-bands LL, HL, LH, HH after the first stage
of line-by-line processing; c — decomposition of the original image into 16 subband components
Fig. 1: Steps of image processing in parallel-pipeline processor 2D separable 4-band Q-FB
III. 2-D NON - SEPARABLE QUATERNIONIC PARAUNITARY A 4-channel PMI LP Q-PUFB realized according to the
FILTER BANK (2D-NSQ-PUFB) fallows factorization of the matrices Φi and ΦN −1 [6]:
A. Structurally lossless lattice for Q-PUFB Φi = M+ (Pi ) (15)
As shown in [6], quaternions are especially suited to the
parameterization of 4 × 4 orthogonal matrices. Namely, every ΦN −1 = M+ (Pi ) · diag JM/2 · ΓM/2 , IM/2 , (16)
matrix belonging to SO(4), can be represented as a product of  
left and right unit quaternions P and Q (|P | = 1 and |Q| = 1) p1 −p2 −p3 −p4
∀ ∃ R = M+ (P ) · M− (Q) = M− (Q) ×  p2 p1 −p4 p3 
 
R∈SO(4) P,Q∈unit quat. M+ (P ) =  , (17a)
× M (P ) directly (contrary to Givens rotations) to preserve
+  p3 p4 p1 −p2 
their orthogonality in spite of quantization. A quaternionic crit- p4 −p3 p2 p1
ically sampled linear phase with pairwise-mirror-image (PMI)  
q1 −q2 −q3 −q4
symmetric frequency responses PMI LP PUFBs results from  q2 q1 q4 −q3 
substitution (E(z) is paraunitary polyphase transfer matrices M− (Q) = 
 q3
 (17b)
−q4 q1 q2 
of an analysis filter bank) [6], [8]: q4 q3 −q2 q1
E(z) = GN −1 GN −2 . . . G1 E0 ; (11) The matrices M+ (P ) and M− (Q) are left and right 4
1 by 4 multiplication matrices, accordingly: Qx = M+ (Q) x,
E0 = √ Φ0 W diag IM/2 , JM/2 , (12) xQ = M− (Q) x; P = p1 + p2 i + p3 j + p4 k and Q = q1 +
2
+ q2 i + q3 j + q4 k are unit quaternions, where the orthogonal
1 imaginary numbers obey the following multiplicative rules:
Φi WΛ(z)W, i = 1, N − 1,
Gi = (13)
2 i2 = j 2 = k 2 = ijk = −1, ij = −ji = k, jk = −kj = i,
ki = −ik = j.
IM/2 IM/2
W= ; ΛM (z) = diag IM/2 , z −1 IM/2 , The corresponding factorization of the matrices Φi and
IM/2 −IM/2 ΦN −1 for an 8-channel PMI LP Q-PUFB is shown below [8]:
(14)
where N is order of the factorization; IM/2 and JM/2 denote M− (Qi )
Φi = diag ΓM/2 , IM/2 · ×
the M/2 × M/2 identity and reversal matrices, respectively; M− (Qi )
ΓM/2 is diagonal matrix the elements of which are defined as + (18)
M (Pi )
γmm = (−1)
m−1
, m = 1, M − 1. × · diag Γ M/2 , I M/2
M+ (Pi )
122
1 1 T
yn,n = . . . · diag IM/2 , JM/2 · W · Φ0 √ xn,n √ ΦT0 · WT · diag IM/2 , JM/2 · . . . (?)
2 2
 
M− (Qi ) ΛM ··· 0
ΦN −1 = diag ΓM/2 , IM/2 · ×
M− (Qi )  .. .. ..  · P;
+ Λ4·M =P· . . . 
M (Pi )
× · diag ΓM/2 , IM/2 0 ··· ΛM
M+ (Pi )
(19)  
ΛM ··· 0
B. Two-dimensional non-separable PMI LP Q-PUFB  .. .. ..  · Λ
Λ̈ =  . . .  4·M (25)
When a factorization of PMI LP Q-PUFB matrix E (8) is 0 ··· ΛM
applied to a 2D input signal xn,n in horizontal and vertical
directions, the output signal yn,n is expressed as For a 4-channel PMI LP Q-PUFB, the two-dimensional ana-
logues of the matrices Φi and ΦN −1 are defined as follows:
yn,n = E · xn,n · ET = GN −1 . . . G1 E0 xn,n ×  + 
(20) M (Pi ) 0 0 0
×ET0 GT1 . . . GTN −1 ,  0 M+ (Pi ) 0 0 
Φ̈i = 

 · P×

According to the theorem 1 and the relations (8-9) 2D non- 0 0 M+ (Pi ) 0
separable PMI LP Q-PUFB is 0 0 0 M+ (Pi )
 + 
M (Pi ) 0 0 0
yn·n,1 = Ë · xn·n,1 = G̈N −1 (z)G̈N −2 (z) · . . .  
(21) 0 M+ (Pi ) 0 0
. . . · G̈1 (z) · Ë0 · xn·n,1 , ×
·P

0 0 M+ (Pi ) 0
where ¨ denotes the 2D-transformation matrix. The underscore 0 0 0 M+ (Pi )
in equation (?) shows the sequence of matrix replacement (26)
to obtain equation (21). This Eq. (20) means that the 2D
Φ̈N −1 = Φ̈i · S̈i ,
implementation of Gk is performed after that of Gk−1
(1 ≤ k ≤ N − 1), i.e., the matrices W, Λ(z), M+ (P ) S̈ = diag (S1 , S1 , S1 , S1 ) · P diag (S1 , S1 , S1 , S1 ) · P,

can be operated separately. The resulting representation of the S1 = diag JM/2 · ΓM/2 , IM/2
matrices W, Λ(z), M+ (P ) as 2D transform are (27)
   
WM · · · 0 WM · · · 0 Thus, the factorization components (21) of the 2-D non-
Ẅ =  ... .. ·P· .. .. ·P; (22)
 .. .. separable 4-channel PMI LP Q-PUFB, whose prototype filter
. .   . . . 
is given by the relations (11-16), are represented by the
0 · · · WM 0 · · · WM
following expressions (22-27). Similarly, the 2D non-separable
 
ΛM (z) · · · 0 factorization for the 8-channel PMI LP Q-PUFB can be found.
The structure of the critically sampled 2-D non-separable
Λ̈(z) =  ... ..  · P×
 ..
. .  4-channel PMI LP Q-PUFB for N = 1 in according with
0 · · · ΛM (z) the factorizations (22-27) is shown in Fig. 2. The given
  (23)
ΛM (z) · · · 0 factorization structure of 2D nonseparable PMI LP Q-PUFB
×  ... ..  · P;
 .. will be call ”16in-16out” and for two dimensional 8-channel
. . 
Q-PUFB — ”64in-64out”. Rectangular blocks (see Fig. 2)
0 ··· ΛM (z)
on the structure of the 2-D non-separable filter bank denote
In order to reduce the factorization steps (21), the permu- quaternionic multipliers, and vertical lines are places of data
tation matrices in (22) and (23) can be combined. This will permutations in accordance with permutation matrices P. The
simplify the structural solution of the 2D non-separable PMI 2D signal is scanned in blocks of 16 samples in accordance
LP Q-PUFB: with Fig. 1a. Memory is only needed for the input vector
and it is 16 words. Performing one transformation results
  in a product (see Fig. 1c). The analysis of the 2-D non-
WM ··· 0
I I2·M  .. .. ..  · P separable PMI LP Q-PUFB circuit in Fig. 2 shows that it
W4·M = 2·M =P· . . .  can be mapped to parallel-pipeline processor structures with a
I2·M −I2·M
0 ··· WM minimum time of latency 2(N + 1) quaternion multiplication
  operations, where N is transformation order of the PMI LP Q-
WM ··· 0 PUFB. It should be noted that the latency of parallel-pipeline
 .. .. ..  · W
Ẅ =  . . .  4·M , (24) processing does not depend on the size of the original image
0 ··· WM in contrast to the conventional 2-D transform (1).
123
P P P P
1 1
1 1 1 1 1 2 1 1 1 1 4
P0 P0 P1 P1
1 1
2 2 5 2 5 2 2 5 2 5 4
1 1
3 3 9 3 9 2 z‒1 3 9 3 9 4
1 1
4 4 13 4 13 2 z‒1 4 13 4 13 4
1 1
5 5 2 5 2 2 5 2 5 2 4
P0 P0 P1 P1
1 1
6 6 6 6 6 2 6 6 6 6 4
1 1
7 7 10 7 10 2 z‒1 7 10 7 10 4
1 1
8 8 14 8 14 2 z‒1 8 14 8 14 4
1 1
9 9 3 9 3 2 z‒1 9 3 9 3 4
P0 P0 P1 P1
1 1
10 10 7 10 7 2 z‒1 10 7 10 7 4
1 1
11 11 11 11 11 2 z‒1 z‒1 11 11 11 11 4
1 1
12 12 15 12 15 2 z‒1 z‒1 12 15 12 15 4
1 1
13 13 4 13 4 2 z‒1 13 4 13 4 4
P0 P0 P1 P11
1 1
14 14 8 14 8 2 z‒1 14 8 14 8 4
1 1
15 15 12 15 12 2 z‒1 z‒1 15 12 15 12 4
1 1
16 16 16 16 16 2 z‒1 z‒1 16 16 16 16 4
•• •• •• •• •• •• ••
J W Ф0 W ᴧ(z) W Ф1
•• ••
E0 G1(z)
Fig. 2: The structure of the critically sampled 2-D non-separable PMI LP Q-PUFB for N = 1
IV. D ESIGN EXAMPLE

In this section, we present design examples of the proposed
2-D non-separable PMI LP Q-PUFB structures: ”16in-16out”
and ”64in-64out”, show their magnitude responses and effi-
ciency measured in coding gain (CGM D ). These examples
are designed by using the routine ’fminunc’ provided by
the MATLAB optimization toolbox. For presenting the design
examples of proposed FBs, the design objective function is
chosen as the maximum coding gain, for the isotropic autocor-
relation function model with the correlation factor ρ = 0.95.
We design critically sampled 4-channel (12 taps) and 8-
channel (24 taps) LP PMI Q-PUFBs for the factorization
order N = 3 (performance measures: linear phase, coding
gain (CG), stopband attenuation (SBE) and direct current
attenuation (DC Att.), which measures the deviation from one-
regularity, i.e. DC leakage) [14]: (8 × 24)Q-PUFB: CG =
9.37 dB, SBE = −21.02 dB, DC Att. = −316.08 dB; (4 × Fig. 3: Basis images of 16 analysis filters
12)Q-PUFB: CG = 8.23 dB, SBE = −18.38 dB, DC Att. =
= −301.17 dB. For these filters, the factorization (21) was The coding gains CGM D of 2-D non-separable PMI LP Q-
performed, resulting in two structures of 2-D non-separable PUFBs for the isotropic autocorrelation function model with
PMI LP Q-PUFB: ”16in – 16out” and ”64in – 64out”. Design the correlation factor ρ = 0.95 are the following: CGM D =
examples of 2-D non-separable 4-channel PMI LP Q-PUFB = 13.4 dB for ”16in-16out” structure and CGM D = 15.6 dB
with rectangular decimation: [ 40 04 ] is shown in Fig. 4 and for ”64in-64out” structure. The coding gain of proposed 2-
Fig. 3: each filter has 12 × 12 taps; (Fig. 3) basis images D non-separable channel PMI LP Q-PUFB (”16in-16out”
of 16 analysis filters; (Fig. 4) amplitude responses of the four structure) in comparison with [10], [11] (CGM D = 11.55 dB)
analysis filters. is almost two decibels more.
124
10 10
8 8
Amplitude
Amplitude
6 6
4 4
2 2
0 0
1 1
0.5 1 0.5 1
0 0.5 0 0.5
0 0
-0.5 -0.5 -0.5 -0.5
/ -1 -1
1/ / -1 -1
1/
2 2
10 10
8 8
Amplitude
Amplitude
6 6
4 4
2 2
0 0
1 1
0.5 1 0.5 1
0 0.5 0 0.5
0 0
-0.5 -0.5 -0.5 -0.5
/ -1 -1
1/ / -1 -1
1/
2 2
Fig. 4: Amplitude responses of the four analysis filters
V. C ONCLUSION [7] ——, “Implementation perspectives of quaternionic component for

paraunitary filter banks,” in Proc. Int. TICSP Workshop on Spectral
We devised the 2-D technique of non-separable factorization Methods and Multirate Signal Processing (SMMSP), Vienna, Austria,
for 4-th and 8-th chanel PMI LP Q-PUFBs. The factorization 11–12 Sep. 2004, pp. 151–158.
[8] ——, “Inherently lossless structures for eight- and six-channel linear-
structures of 2-D NSQ-PUFB can be easily mapped on the phase paraunitary filter banks based on quaternion multipliers,” Signal
parallel-pipeline processor architecture. Process., vol. 90, pp. 1755–1767, 2010.
[9] N. A. Petrovsky, A. V. Stankevich, and A. A. Petrovsky, “Pipelined
block-lifting-based embedded processor for multiplying quaternions
ACKNOWLEDGMENT using distributed arithmetic,” in 5th Mediterranean Conference on
Embedded Computing (MECO), 2016, pp. 222–225.
This work was supported by Belarusian Republican Foun- [10] S. Muramatsu, A. Yamada, and H. Kiya, “A design method of multidi-
dation for Fundamental Research (project no. F18MV-016). mensional linear-phase paraunitary filter banks with a lattice structure,”
IEEE Transactions on Signal Processing, vol. 47, no. 3, pp. 690 – 700,
1995.
R EFERENCES [11] T. Yoshida, S. Kyochi, and M. Ikehara, “A simplified lattice structure
of two-dimensional generalized lapped orthogonal transform (2-D Gen-
[1] K. Rao and J. Hwang, Techniques and Standards for Image, Video, and LOT) for image coding,” in Image Processing (ICIP), 2010 17th IEEE
Audio Coding. Prentice Hall, 1996. International Conference on, Sept 2010, pp. 349–352.
[2] P. P. Vaidyanathan, Multirate Systems and Filter Banks. Englewood [12] T. Suzuki and H. Kudo, “2D non-separable block-lifting structure and its
Cliffs, NJ: Prentice-Hall, 1993. application to M-channel perfect reconstruction filter banks for lossy-to-
[3] N. A. Petrovsky, “Optimal bit allocation in the paraunitary subband lossless image coding,” IEEE Transactions on Image Processing, vol. 24,
image coder based on the quaternion algebra,” Doklady BGUIR, vol. 79, no. 12, pp. 4943 – 4951, Aug. 2015.
no. 1, pp. 72–77, 2014. [13] N. Petrovsky, A. Stankevich, and A. Petrovsky, “Design and high-
[4] G. Strang and T. Q. Nguyen, Wavelets and Filter Banks. Wellesley, performance hardware architecture for image coding using block-lifting-
MA: Wellesley-Cambridge Press, 1996. based quaternionic paraunitary filter banks,” in 4th Mediterranean
[5] Y.-P. Lin and P. Vaidyanathan, “Linear phase cosine modulated maxi- Conference on Embedded Computing (MECO), June 2015, pp. 193–198.
mally decimated filter banks with perfect reconstruction,” IEEE Trans- [14] N. A. Petrovsky, E. V. Rybenkov, and A. A. Petrovsky, “Design and
actions on Signal Processing, vol. 43, no. 11, pp. 2525 – 2539, 1995. implementation of reversible integer quaternionic paraunitary filter banks
[6] M. Parfieniuk and A. Petrovsky, “Quaternionic building block for on adder-based distributed arithmetic,” in Signal Processing: Algorithms,
paraunitary filter banks,” in Proc. 12th European Signal Processing Conf. Architectures, Arrangements, and Applications (SPA), 2017, Poznan,
(EUSIPCO), Vienna, Austria, 6–10 Sep. 2004, pp. 1237–1240. Poland, 2017, pp. 17–22.
125
SIGNaL PROCESSING
SPa 2018
Elliptically-Shaped IIR Digital Filters Designed

Using Frequency Transformations
Radu Matei
Faculty of Electronics, Telecommunications and Information Technology
“Gheorghe Asachi” Technical University
Iasi, Romania
Email: rmatei @ etti.tuiasi.ro
Abstract—This paper presents an analytic design technique frequency plane. The design method is mainly analytical and
for 2D IIR filters with elliptical symmetry, which have useful uses approximations, but not any numerical optimization
applications in image processing. The design is based on efficient algorithms. It is based on 1D low-pass (LP) prototype filters
elliptic digital filters, regarded as 1D prototypes, to which specific and frequency mappings. Several design examples using the
complex frequency transformations are applied; this allows to proposed method are given.
obtain directly a factored form of the transfer function for the 2D
elliptically-shaped filter. The design procedure uses some
accurate approximations, but no global optimization algorithm. II. LOW-PASS FILTER PROTOTYPES
Finally the 2D filter matrices are obtained. The filter is In designing the 2D elliptically-symmetric filter we use an
adjustable in the sense that its coefficients depend explicitly on efficient 1D IIR filter prototype. The most efficient digital or
specified orientation and bandwidth. Another advantage is continuous-time filter approximation for specified steepness or
versatility, since the design need not be resumed each time from selectivity is the elliptic filter, resulting of a lower order than
the start for various specifications. The designed 2D filters have other approximations, like Butterworth or Chebyshev.
an accurate elliptical shape with low distortions even close to the Let us consider a digital elliptic filter with specifications:
margins of frequency plane and are efficient, of high selectivity
and relatively low order. filter order N  6 , peak-to-peak ripple RP  0.1dB , minimum
stop-band attenuation RS  36dB and normalized pass-band
Keywords—2D IIR filters; approximations; elliptically-shaped edge frequency P  0.5 (the value 1 corresponding to half
filters; frequency transformations
the sample rate). These specifications lead to the following
I. INTRODUCTION transfer function in the complex frequency variable z:
The field of two-dimensional filters has largely developed  0.1139  z 6  0.29246  z 5  0.53483  z 4 
 
along the last three decades and various design methods were 0.61902  z  0.53483  z  0.29246  z  0.1139 
3 2
proposed [1]. The currently-used design methods for 2D IIR H P ( z)   (1)
filters rely on 1D prototypes, using spectral transformations  z 6 - 0.34378  z 5  1.7013  z 4 - 0.54092  z 3 
from s to z plane via bilinear or Euler approximations, with the  
 0.82266  z - 0.19568  z  0.086938
2

aim to obtain a 2D filter with a desired frequency response [2].
A convenient and largely used tool for 2D FIR filter design is which has the factored expression (where k  0.11395 ):
McClellan transform [3], [4]. Anisotropic filters were also
studied extensively and used in interesting applications, like H P ( z )  0.11395 
z 2
 1.6501747  z  1  
remote sensing for directional smoothing applied to weather
images, texture segmentation and pattern recognition [5], [6].
z 2
 0.057838  z  0.916863 
In particular, FIR or IIR filters with elliptical symmetry are z 2
 0.627297  z  1  
z 2
 0.289093  z  1 (2)
z  z 
very useful in image processing and various design methods
were developed in early papers like [7]-[9]. A fast space-
2
 0.072837  z  0.63421 2
 0.32878  z  0.14951
variant filtering using Gaussian elliptic window is proposed in  k  H B1 ( z )  H B 2 ( z )  H B 3 ( z )
[10]. Interesting applications of elliptically-shaped filters were
found in the biometric field, like pose robust human detection Therefore, the 1D elliptic filter transfer function in z is
[11], iris recognition [12], palmprint identification [13], factored into three bi-quad functions H B1 ( z ) , H B 2 ( z ) and
fingerprint enhancement [14]. Other analytical design methods H B 3 ( z ) . However, the factors of the nominator and
for directional filters, in particular elliptically shaped, were
proposed by the author in [15]-[18]. Stability of 2D filters and denominator in (2) can be coupled in pairs in several ways.
stabilization methods are important and rather difficult issues, The magnitude of the transfer function (1) is displayed for
studied in papers like [19], [20]. We approach here the design   [0,  ] in Fig.1, which shows a steep transition and very
of a class of 2D filters, having an elliptical shape in the small ripple in the pass-band and stop-band.
126
Here we have used the notations:
p  1 E2 1 F 2 , q  1 E2 1 F 2 (8)
The next step in designing the 2D elliptically-shaped
filter is to apply the frequency mapping (6) to the digital
prototype H P ( z ) from (2). Thus we substitute z  exp( j)

by exp j  E ( 1 ,  2 )  exp   
 E ( 1 ,  2 ) . In order to
derive more efficient 2D filters, we use a first-order rational
approximation for exp  on frequency range   [0,  ] :
Fig. 1. Magnitude of the elliptic filter frequency response

exp   
  exp j   F ( ) 
 0.664635  0.2127933    +j   0.896434+0.251929    (9)

III. DESIGN OF LOW-PASS ELLIPTICALLY-SHAPED IIR 1+0.209288    +j   0.515203  0.268335   
FILTERS This approximation is accurate enough for our purpose, and
In this paragraph a 2D LP filter with elliptical symmetry is practically reduces twice the filter order, while maintaining the
obtained, starting from an usual digital prototype with a correct frequency response shape, with very low distortions.
transfer function in variable z. This 2D filter will be specified The real and imaginary part and their approximations are
by imposing the values of the ellipse semi-axes, and the plotted comparatively in Fig.2. The approximation (9) can be
orientation is given by the angle of the large axis with respect made scalable of frequency axis, i.e. substituting the current
to 2  axis. Starting from the frequency response of a 1D variable  by k   ( k  0 ), it remains valid for a certain
filter given by (2), we derive a 2D elliptically-shaped filter range of the scaling parameter k (which means stretching for
using the frequency mapping  2  E (1 , 2 ) , where [15]: k  1 or shrinking for k  1 ). Therefore, in order to obtain a
parametric filter, the above approximation is written as:
 cos 2  sin 2   2  sin  cos 2  
   
2
E (1 , 2 )  12  2
    2    exp j k    exp k  
 E F2   E
2
F2 
(3)  0.664635  0.212793  k    +j   0.896434+0.25193  k    (10)
 1 1  
 12 sin(2 )  2  2   a  1  b  2  c  12 1+0.209288  k    +j   0.515203  0.268335  k   
2 2
F E 
A generic bi-quad function H Bi ( z ) in variable z has the form:
The elliptically-shaped filter can be considered as derived
from a circular filter through the linear transformation: 
H Bi ( z )  z 2  v1  z  v0  z 2
 u1  z  u0  (11)
  1   E 0  cos   sin     1 
'
To obtain a 2D filter with elliptical symmetry, we simply
      '  (4)
 2   0 F   sin  cos    2  make the substitution (frequency mapping)  2  E (1 , 2 )
where usually we consider E  F ; in (4) , (1 ,  2 ) are the and the following frequency transformation results:
current coordinates and ( 1' ,  '2 ) are the former (rotated)


z  exp j k  E (1 , 2 )  
coordinates. Thus, the unit circle is stretched along the axes  0.664635  0.212793  k  E (1 , 2 ) 
 1 and  2 with factors E and F, then counter-clockwise +j   0.896434+0.25193  k  E (1 , 2 )  (12)

rotated with an angle  , becoming an oriented ellipse. 1+0.209288  k  E (1 , 2 ) 
Therefore, given a 1D prototype filter, we can obtain a
corresponding 2D filter with elliptical support, specified by +j   0.515203  0.268335  k  E (1 , 2 ) 
the parameters E, F and  which impose the orientation and where E (1 , 2 ) is in turn replaced by its expression (6).
shape, using the mapping   E ( 1 ,  2 ) , also written: Using Chebyshev-Padé method we get the following rational
 2  E (1 , 2 )  a0  12  b0  22  c0  12 (5) trigonometric approximation for the square function  2 :
 2  2.35753  1  0.946216  cos   1  0.46301 cos   (13)

Using in (5) the identity  1 2  0.5  ( 1   1 )    2 2
1
2
2 , displayed in Fig. 3. As can be noticed, this is a very efficient
we find another expression for E (1 , 2 ) : and accurate approximation on  [ , ] , having a small
 2  E (1 , 2 )  a  12  b  22  c  (1  2 )2 (6) distortion only at the margins of the specified frequency range.
Writing the approximation (13) for the variables 1 and 2
where
a  a0  0.5 c0  p  q  cos(2 )  q  sin(2 ) respectively, the expressions of  12 and  22 are then replaced
b  b0  0.5 c0  p  q  cos(2 )  q  sin(2 ) (7) into the mapping (12). Next we will obtain a matrix form of
c  0.5 c0  q  sin(2 ) this mapping, for the parameter value k  1 .
127
The 5  5 matrices A1 , A2 and A3 are centrally-symmetric and
their elements have the values   2.237977 ,   0.287569 ,
  0.989016 ,   0.258214 ,   0.059778 ,   0.126352 .
B1 has the diagonally symmetric form:
 0 0 0.0248 0.1072 0.0248 
 0 0.1072 0.5702 0.5702 0.1072 

B1   0.0248 0.5702 1.0248 0.5702 0.0248  (18)
 
0.1072 0.5702 0.5702 0.1072 0 
(a) (b)  0.0248 0 
 0.1072 0.0248 0
Fig. 2. (a) Plot of the function cos  and the real part of the approximation
The numerator B( z 1 , z 2 ) and denominator A( z 1 , z 2 ) in (14)
Re  F ()  ; (b) Plot of the function sin  and the imaginary part of the
approximation, Im  F ()  .
are in fact the Discrete Space Fourier Transforms of the
corresponding matrices B and A with complex elements. Next,
substituting z in (11) by the 1D to 2D mapping (14), we find
the factor corresponding to the bi-quad function H Bi ( z ) :
P ( z1 , z2 )
H Bi ( z )  H Bi ( z1 , z2 ) 
Q( z1 , z2 )
(19)
PR ( z1 , z2 )  j  PI ( z1 , z2 ) z1  P  zT2
 
QR ( z1 , z2 )  j  QI ( z1 , z2 ) z1  Q  zT2
Applying this to all three bi-quad factors of the prototype (2),
we finally get the factored transfer function in z 1 and z 2 for
Fig. 3. The parabolic function (in blue) and its first-order rational
the entire 2D elliptically shaped filter.
trigonometric approximation (in red) Using the proposed method, we can also obtain 2D filters
with circular symmetry as a particular case. Indeed, if in (3)
Using identities cos 1  0.5  ( z1  z11 ) , cos 2  0.5  ( z2  z21 ) we set equal semi-axes E  F  1 , the mapping takes the
expressed in the complex variables z1  e j1 , z2  e j2 , the simpler form corresponding to a circularly symmetric filter:
mapping (12) may be finally written in the matrix form:    12   22 (20)
B( z 1 , z 2 ) z1  B  zT2 In Fig. 5, the frequency response magnitudes and contour
z  HC ( z1, z 2 )   (14) plots of two circular filters are shown, for the indicated
A( z 1 , z 2 ) z1  A  zT2
parameters, corresponding to a wider and a narrower filter.
where  is inner product and where the vectors z1 and z 2 are: The stability of these filters was not approached here, but
z1  1 z1 z12 z13 z14  ; z 2  1 z2 z22 z23 z24  (15) will be studied in further work on this topic. Generally the
stability problem for 2D filters is much more difficult than for
The matrices A and B with complex elements corresponding 1D filters. If the prototype filter is stable, and if the frequency
to the numerator and denominator are given by: transformations preserve stability, the derived 2D filters
A  A R  j  A I and B  BR  j  BI , where should also be stable. Various stability criteria exist [19], and
A R  0.664635  B1  0.212793  k   a  A1  b  A 2  c  A 3  also for some unstable filters, stabilization methods can be
applied [20]. As regards previous work on this field, this type
A I  0.896434  B1  0.25193  k   a  A1  b  A 2  c  A 3  of analytic design in the frequency domain from digital
B R  B1  0.209288  k   a  A1  b  A 2  c  A 3  (16) prototypes has not been approached previously. The derived
filters are more efficient (steeper transition for given filter
B I  0.515203  B1  0.268335  k   a  A1  b  A 2  c  A 3  order) than other elliptically shaped filters, for instance
The matrices AR , AI , BR , B I from (16) are linear obtained from zero-phase prototypes.
combinations of 5  5 matrices A1 , A2 , A3 and B1 , where: IV. DESIGN EXAMPLES
0 0    0 0    Some design examples are presented for elliptically-
0      0      shaped filters specified by the values of scale parameter k, the
 
A1         ; A 2       semi-axes E and F, and orientation angle  . The frequency
    response magnitudes and contour plots given in Fig.4 show a
    0     0
   0     0 
relatively accurate elliptical shape in the frequency plane, a
 0  0
(17) maximally flat top and a small ripple in the stop band. Since
0 0    the applications of elliptically-shaped filters in image
0     
 processing are relatively well known, simulation results were
A 3       not included here. The aim of this paper was limited to
 
    0 presenting this analytic design method and to highlight its
   0  advantages over a completely numerical optimization method.
 0
128
V. CONCLUSION
An efficient analytic design method was proposed for 2D
IIR elliptically-shaped filters with adjustable orientation and
bandwidth. The designed 2D filters are parametric, since the
transfer function depends on parameters giving the orientation
angle and bandwidth. For given parameter values, the filter
matrices result directly. One obvious advantage of the method
is its versatility; the design need not be resumed every time
again from the start for various given specifications. The
(a) (b)
designed 2D filters have an accurate elliptical shape with low
distortions even close to the margins of the frequency plane.
Further research envisages an efficient implementation of
these filters and testing them on various real-life images.
REFERENCES
[1] W. Lu, A. Antoniou, Two-Dimensional Digital Filters, CRC Press, 1992
[2] L. Harn, B. A. Shenoi, “Design of stable two-dimensional IIR filters
using digital spectral transformations”, IEEE Trans. Circuit Syst., vol.
CAS-33, pp. 483 - 490, May 1986
[3] N. Nagamuthu, M. N. S. Swamy, “Analytical methods for the design of
2D circularly symmetric digital filters using McClellan transformation”,
IEEE ISCAS 1989, Vol.2, pp.1095-1098, 8-11 May 1989, Portland
(c) (d) [4] C.-K. Chen, Lee Ju-Hong, “McClellan transform based design
techniques for two-dimensional linear-phase FIR filters”, IEEE Trans.
on Circuits and Systems I, vol. 41 (8), pp. 505-517, August 1994
[5] V. Lakshmanan, “A separable filter for directional smoothing”, IEEE
Geoscience and Remote Sensing Letters, July 2004, Vol.1, pp.192-195
[6] Yiu Man Lam, B.E. Shi, “Recursive anisotropic 2-D Gaussian filtering
based on a triple-axis decomposition”, IEEE Trans. Image Processing,
July 2007, Vol. 16 (7), pp. 1925 - 1930
[7] D. Nguyen, M. Swamy, “Approximation design of 2-D digital filters
with elliptical magnitude response of arbitrary orientation”, IEEE Trans.
on Circuits and Systems, Vol. 33 (6), Jun 1986, pp. 97-603
[8] S. A. Jackson, N. Ahuja, “Elliptical Gaussian filters”, Proc. of the 13th
Int. Conf. on Pattern Recognition, Vienna, 25-29 Aug 1996
[9] C.L. Keerthi and V. Ramachandran, “Study of elliptical symmetry in 2D
(e) (f) IIR Butterworth digital filters”, Proc. of IEEE 47th Midwest Symposium
Fig. 4. Frequency response magnitudes and contour plots of elliptically- Circuits & Systems, MWSCAS 2004, 25-28 July 2004, Vol.2, pp.77-80
shaped filter for parameters: (a), (b) p  0.1 ,    8 , E  0.4 , F  0.2 ; (c), (d) [10] K. N. Chaudhury, A. Munoz-Barrutia, M. Unser, “Fast space-variant
elliptical filtering using box splines”, IEEE Trans. on Image Processing(
p  0.1 ,    8 , E  0.6 , F  0.2 ;(e), (f) p  0.1 ,    6 , E  0.6 , F  0.14 Vol. 19 (9), Sept. 2010, pp, 2290 – 2306
[11] S. H Cho, D. Kim, T. Kim, D. Kim, “Pose robust human detection using
multiple oriented 2d elliptical filters”, Proc. 1st ACM workshop on
Vision networks for behavior analysis, Vancouver, 2008, pp. 9-16
[12] J. M. Abdul-Jabbar, Z. N. Abdulkader, “Iris recognition using 2-D
elliptical-support wavelet filter bank, 3rd Int. Conf. Image Processing
Theory, Tools and Applications (IPTA), Istanbul, 15-18 Oct. 2012
[13] A. Kong, D. Zhang, M. Kamel, “Palmprint identification using feature-
level fusion”, Pattern Recognition, 39 (3), March 2006, pp. 478-487
[14] Y. Zhang, “Fingerprint image enhancement based on elliptical shape
Gabor filter”, 6th IEEE Int. Conf. Intelligent Systems, Sofia, 2012
[15] R. Matei, P. Ungureanu, “Image processing using elliptically-shaped
(a) (b) filters”, IEEE ISSCS 2009, Iasi, Romania, Vol.2, pp. 337-340
[16] R. Matei, D. Matei, “Orientation-selective 2D recursive filter design
based on frequency transformations”, IEEE EUROCON 2009, St.
Petersburg, Russia, May 2009, 1320-1327
[17] R. Matei, D. Matei, “Design and applications of 2D directional filters
based on frequency transformations”, 18th European Signal Processing
Conference EUSIPCO 2010, Aalborg, Denmark, pp. 1695-1699
[18] R. Matei, “Analytical design methods for directional Gaussian 2D FIR
filters”, Multidimensional Systems and Signal Processing (Springer), 27
(4), (October 2016)
[19] B. T. O'Connor, T. S. Huang, “Stability of general two-dimensional
recursive digital filters,” IEEE Trans. Acoustics, Speech & Signal
(c) (d) Processing, vol. 26 (6), pp. 550–560, 1978
Fig. 5. Frequency response magnitude of the circular filter and contour plot for [20] E. I. Jury, V. R. Kolavennu, B.D.O. Anderson, “Stabilization of certain
the parameter values: (a), (b) k  0.1 ; (c), (d) k  0.2 . two-dimensional recursive digital filters,” Proc. of the IEEE, vol. 65,
no.6, pp. 887–892, 1977
129
SIGNaL PROCESSING
SPa 2018
IIR Wavelet Filter Banks for ECG Signal Denoising

Yaprak Eminaga, Adem Coskun, and Izzet Kale
Applied DSP and VLSI Research Group
University of Westminster, London, W1W 6UW, United Kingdom
Email: y.eminaga@my.westminster.ac.uk, a.coskun@westminster.ac.uk, kalei@westminster.ac.uk
Abstract—ElectroCardioGram (ECG) signals are widely used II. IIR WAVELET A NALYSIS F ILTER BANK P ROPERTIES
for diagnostic purposes. However, it is well known that
these recordings are usually corrupted with different type of The analysis part of a two channel Perfect Reconstruction
noise/artifacts which might lead to misdiagnosis of the patient. (PR) IIR filter bank can be realized with a halfband lowpass
This paper presents the design and novel use of Infinite Impulse and a halfband highpass filter denoted by H0 (z) and H1 (z),
Response (IIR) filter based Discrete Wavelet Transform (DWT) respectively. These filters are based on the parallel connection
for ECG denoising that can be employed in ambulatory health
monitoring applications. The proposed system is evaluated and
of two real all-pass filters [6], [7] and 1-level transform matrix
compared in terms of denoising performance as well as the for the analysis filter bank is given in (1).
computational complexity with the conventional Finite Impulse
H0 (z) 1 A0 z 2 + z −1 A1 z 2
Response (FIR) based DWT systems. For this purpose, raw ECG H (z) = =√ 2 −1 2 , (1)
data from MIT-BIH arrhythmia database are contaminated with H1 (z) 2 A0 z − z A1 z
synthetic noise and denoised with the aforementioned filter banks.
The results from 100 Monte Carlo simulations demonstrated that where A0 (z) and A1 (z) are M th order allpass filters with a
the proposed filter banks provide better denoising performance general transfer function,
with fewer arithmetic operations than those reported in the open P
M
literature. αm z m
Keywords—ECG denoising, Discrete Wavelet Transform, FIR
wavelets, IIR wavelets. A (z) = z −M m=0 (2)
P
M
αm z −m
I. I NTRODUCTION m=0
The ECG signals are usually contaminated with various As it can be observed from (1), H0 (z) and H1 (z) are power
noise where the noise and signal spectra overlap and the complementary filters, since they satisfy the following prop-
conventional filtering techniques are insufficient to remove erty.
2 2
this noise. DWT is a popular tool in the field of non- |H0 (z)| + |H1 (z)| = 1 (3)
stationary signal processing that provides simultaneous time The scaling and wavelet functions associated with the afore-
and frequency information, and has been used to detect such mentioned filters, can be achieved by iterating the filter bank J
overlapping noise. In ECG denoising literature a vast amount times on its lowpass branch as shown in (4). This will result in
of research employed FIR filter banks with various wavelet transfer functions Φ (z) and Ψ (z) with lowpass and bandpass
families, most popular ones being the Daubechies such as spectrum where their impulse responses are the scaling (φ (n))
Haar, db2, and db4, Symmlets and Coiflets [1]–[3]. On the and wavelet (ψ (n)) functions, respectively.
other hand, IIR wavelet filter bank studies are less extensive
and limited to image processing and compressing applications J−1
Y j
[4], [5]. This paper presents the design of IIR DWT filter Φ (z) = H0 z 2
j=0
banks and their novel application in ECG signal denoising. (4)
To the best knowledge of the authors’, this is a first in the J−2
Y
2j−1 2j
open literature and the results showed that the proposed IIR Ψ (z) = H1 z H0 z
j=0
DWT filter banks achieve higher output Signal-to-Noise Ratio
(SNR) and lower Mean Square Error (MSE) with reduced It is well known that, the regularity of wavelets defines the
arithmetic operation complexity compared to the conventional smoothness of the wavelet function and has a crucial effect
FIR wavelets. for noise reduction applications. It is directly related to the
Section II provides brief information regarding the theory wavelet’s vanishing moments which is the number of times
of IIR wavelet design followed by the details of the designed the
wavelet spectrum vanishes (goes to zero) at ω = 0 i.e
IIR wavelets. Section IV introduces the wavelet thresholding Ψ(ejω ) = 0 where z = ejω . Thus, the aforementioned
ω=0
technique employed and the noisy test data generated. Com- H0 (z) and H1 (z) needs to be designed with an additional
parative analysis on the noise suppression performance of the flatness condition as shown in (5) [8].
proposed IIR and FIR wavelets for different noise scenarios
are presented in Section V. Finally, Section VI presents the ∂ k H1 ejω ∂ k H0 ejω
= =0 (5)
conclusions. ∂ω k ∂ω k
ω=0 ω=π
130
for k = 0, 1, ...K − 1, where K corresponds to the number of
zeros of H1 (z) at z = 0 and H0 (z) at z = −1 i.e. Nyquist
frequency. This design procedure can be reduced to the design
of H0 (z) due to the power complementary properties given
in (3). For a given filter order, a trade of between frequency 3 3
resolution and wavelet regularity exists. Therefore, it is critical
to identify needs of the application and select the best possible
frequency selectivity for a given flatness condition [6].
III. IIR WAVELET A NALYSIS F ILTER BANK D ESIGN
In this study, IIR wavelet design methodology introduced (a) (b)
by Zhang et. al [6] is adopted for implementing IIR wavelet
filters with 3 and 5 vanishing moments and are referred to
as ilet3 and ilet5, respectively in the rest of this document.
Both wavelets filters are designed as maximally flat filters in
order to achieve the maximum number of zeros at the Nyquist 5 5
leading to the maximum possible smoothness of the scaling

and wavelet functions. The number of vanishing moments
are selected in order to closely match the most commonly
used wavelet basis functions in ECG denoising applications
including, db2, and db4 with 2 and 4 vanishing moments,
(c) (d)
respectively. Recalling (1), H0 (z) can be re-written as,
Fig. 1. ilet3; (a) Magnitude response and (b)Pole-Zero locations. ilet5; (c)
1 Magnitude response and (d) Pole-Zero locations.
H0 (z) = A0 z 2 1 + z −1 U z 2 (6)
2
where U (z) is an allpass filter with a general transfer function
given in (2). For ilet5 wavelets U (z) is chosen to be a second
order filter with real coefficients a2 , a1 , and a0 = 1. Thus, for
M = 2, its transfer function is expressed as,

A1 z 2 a0 + a1 z 2 + a2 z 4
U z =2
= z −4 (7)
2
A0 (z ) a0 + a1 z −2 + a2 z −4
The frequency response of H0 (z) is calculated by evaluat-
ing (6) on the unit circle and the magnitude response is given (a) (b)
by, Fig. 2. Scaling (φ (n)) and Wavelet (ψ (n)) functions of (a) ilet3 and (b)

H0 ejω = cos θ (ω) (8) ilet5 after 8 iterations.
2

where θ(ω) is the phase
response
of z −1 U z 2 . Therefore,
H0 (z) and H1 (z) and corresponding impulse responses, φ(n)
for ilet5, θ(ω) and H0 ejω is respectively computed as,
and ψ (n) for ilet3 and ilet5 are presented in Fig. 1 and Fig.

sin 5ω ω 3ω
2 + a1 sin 2 − a2 sin 2
2, respectively.
θ (ω) = −2 tan −1
(9)
cos 5ω2 + a1 cos ω2 + a2 cos 3ω 2 IV. M ETHOD
5ω
ω
3ω
There is various types of noise such as powerline interfer-

H0 ejω = p cos 2 + a1 cos 2 + a2 cos 2 (10) ence, baseline wander, and muscle contraction artifacts that
3 + 2a1 (1 + a2 ) cos (2ω) + 2a2 cos (4ω) are assumed to be additive and independent from the the ECG
As mentioned before, the smoothness of the wavelet func- signal which is generally modelled as xn (n) = xc (n)+e (n),
tion is determined by the number of zeros at the Nyquist, where xn (n), xc (n), and e (n) are the noisy ECG, clean
which is computed by substituting the numerator of (10) into ECG and composite noise, respectively. Although powerline
(5). Then, filter coefficients a1 = 10 and a2 = 5 are calculated interference can be eliminated by a digital notch filter, the
by solving the linear equations obtained. The same steps are spectrum of other noise sources overlap with the spectrum of
applied for ilet3 with M = 1 and K = 3. This results in the ECG signal. In such circumstances, wavelet thresholding
a1 = 3. Following (7), the poles of U (z) that lies inside the can be employed where the noisy signal is decomposed
unit circle corresponds to the poles of A1 (z) and the poles into several levels, denoised and reconstructed [9]. For this
outside the unit circle corresponds to the zeros of A0 (z). study, the noisy ECG signal is decomposed into 7 levels and
By assigning the poles correctly, two stable allpass filters are each of the detail coefficients (i.e. outputs of H1 (z) at each
obtained. The magnitude responses and pole-zero locations of level) are thresholded using the soft thresholding method, in
131
which the threshold is computed using the Rigorous SURE
(Stein’s Unbiased Risk Estimator) criterion [10]. The baseline
wander is removed by nullifying the finest level approximation
coefficients (i.e. H0 (z) output at level 7) and the denoised
signal is reconstructed from the thresholded detail coefficients.
The thresholding method and threshold criterion is empirically
determined where soft thresholding is well-known for deliv-
ering smoother outputs and the Rigorous SURE threshold se-
lection scheme is known for successfully identifying the small
details of signal overlapped with noise. A good comparison of
different threshold selection and thresholding methods can be
found in [11].
A. Generated ECG data and Synthetic Noise Sources

Four raw ECG records (’103’, ’105’, ’109’, and ’118’) are
randomly taken from the MIT-BIH arrhythmia database which
are resampled to 256 Hz. In order to obtain clean control
data, preprocessing stages are applied, including notch and
highpass filtering (cut-off frequency (fc ) = 0.5 Hz), to remove
60 Hz powerline interference and baseline wander, respec-
Fig. 3. Top to bottom;10 seconds of clean record ’103’,EMG noise, Baseline
tively. Then, the ElectroMyogram (EMG) interference (xe (n)) Wander, Noisy ECG, and Denoised ECG.
is modelled as white Gaussian noise, whereas the baseline
wander is modelled as additive combination of deterministic
and random data with frequency content below 1 Hz as shown presents (top to bottom) the clean ECG record ’103’, generated
in (11). synthetic EMG and baseline wander, noisy ECG contaminated
P with composite noise with SN R = 4 dB, and finally the
X fi
xbw (n) = sin 2πn + W (n) (11) denoised ECG with ilet5 wavelet filter bank. For each data
i=1
fs record and at each SNR, 100 Monte Carlo Simulations are
performed and the average SNR and MSE are computed.
where 0 < fi ≤ 1 for i = 1, 2, . . . , P , fs is the sampling
Results for the noisy record ’103’ are shown in Fig. 4. As it can
frequency and W (n) is lowpass filtered (fc = 1 Hz) white
be observed, the ilet5 wavelet filter bank provides the highest
Gaussian noise. Thus, the composite noise is obtained by
SNR improvement and the lowest MSE when compared to
e (n) = A(xe (n)+xbw (n)) where A is the input noise scaling
others, where ilet3 provides the second best results. In Table
factor that is determined by the desired input SNR and is
I, the average SNR improvement (in dB) figures obtained for
computed via;
v ilet3, ilet5, db4 and Haar wavelets are also presented for
u PN 2 four noisy ECG records. As expected, ilet5 provides the best
u |xc (n)|
A = t Pn=1N 2
10−SNR/10 (12) results compared to the FIR wavelets, followed by ilet3 both
n=1 |e (n)| under high and relatively low noise power. This is due to
the better frequency selectivity achieved with the IIR wavelets
B. Quantitative Evaluation
despite having lower vanishing moments,i.e. in the ilet3 case.
The ECG signal denoising performance of ilet3 and ilet5 Although, Haar wavelet is the simplest FIR wavelet filter
as well as Haar, db2, db4, sym4, and coif 2 wavelet filter which makes it desirable for power limited applications, it
banks are evaluated and compared by computing the SNR achieves the lowest denoising performance. On the other
improvement and MSE which are obtained from; hand, ilet3 filter with one distinct coefficient provides better
PN 2 denoising performance making it a favourable choice amongst
|xn (n) − xc (n)|
SNRimp = Pn=1 N 2 the others. For applications where the denoising performance
n=1 |xd (n) − xc (n)| is critical and the power consumption can be compromised
N
(13)
1 X 2 then the ilet5 can be employed which uses only two distinct
MSE = |xd (n) − xc (n)| coefficients. In addition, Table II presents the average MSE
N n=1
results where the ilet5 and ilet3 wavelets provide the two
where xd (n) is the denoised ECG signal. minimum MSE results. This is an indication of a relatively
smaller signal distortion after denoising which is a significant
V. R ESULTS AND D ISCUSSIONS factor for diagnostic applications. In terms of computational
Four records (’103’, ’105’, ’109’, and ’118’) are con- complexity except from the Haar filter, rest of the FIR filters
taminated by adding the synthetically generated EMG and employ 4, 8 and 12 rational coefficients. Thus, based on the
baseline wander with SNR ranging from −12 to 20 dB. Fig. 3 selected filter structure, the arithmetic and storage complexity
132
Record 103 + Composite Noise Record 103 + Composite Noise
0.16
0.14
SNR Improvement, dB
10
MSE after Denoising

0.12 ×10−3
6 0.1
ilet3 4
ilet5 0.08
2 db2 2
0.06
db4
-2 sym4 0.04 0
Haar 12 14 16 18 20
0.02
coif2
-10 -5 0 5 10 15 20 -10 -5 0 5 10 15 20
Input SNR, dB Input SNR, dB
(a) (b)
Fig. 4. Average (a) SNR Improvement (dB), and (b) MSE, after wavelet denoising with ilet3, ilet5, db2, db4, sym4, Haar and coif 2.
TABLE I output MSE. The results obtained demonstrated that the IIR
SNR IMPROVEMENT AFTER WAVELET DENOISING . wavelets achieve the best ECG denoising performance with the
Input SNR = -12 dB Input SNR = 4 dB least signal distortion amongst the others with fewer arithmetic
ilet3 ilet5 db4 Haar ilet3 ilet5 db4 Haar operations. This study demonstrates that IIR wavelets can
’103’ 13.07 13.92 12.18 9.49 9.13 9.4 8.72 5.55 be included in more sophisticated denoising applications in
’105’ 13.71 14.58 12.72 9.67 9.21 9.90 8.72 5.32 portable devices due to their better frequency selectivity with
’109’ 13.92 14.92 12.98 9.68 10.18 11.38 9.20 5.27 lesser arithmetic operations.
’118’ 13.68 14.42 12.56 9.61 8.24 9.34 6.93 4.72
ACKNOWLEDGMENT
TABLE II The authors wish to thank the University of Westminster
MSE AFTER WAVELET DENOISING Faculty of Science and Technology for the PhD Studentship.
Input SNR = -12 dB Input SNR = 4 dB
R EFERENCES
ilet3 ilet5 db4 Haar ilet3 ilet5 db4 Haar
’103’ 0.07 0.06 0.09 0.17 0.0046 0.0044 0.005 0.011 [1] Y. Eminaga, A. Coskun, and I. Kale, “Multiplier Free Implementation
of 8-tap Daubechies Wavelet Filters for Biomedical Applications,” in
’105’ 0.06 0.05 0.07 0.16 0.0045 0.0038 0.0049 0.011
2017 New Generation of CAS (NGCAS). IEEE, 2017, pp. 129–132.
’109’ 0.12 0.098 0.15 0.17 0.0074 0.0056 0.0086 0.022 [2] S. Nagai, D. Anzai, and J. Wang, “Motion artefact removals for wearable
’118’ 0.11 0.096 0.14 0.29 0.0101 0.0078 0.0129 0.022 ECG using stationary wavelet transform,” Healthcare technology letters,
vol. 4, no. 4, p. 138, 2017.
[3] P. Shemi and E. Shareena, “Analysis of ECG signal denoising using
discrete wavelet transform,” in Engineering and Technology (ICETECH),
will always be higher for the FIR wavelets in comparison to the 2016 IEEE International Conference on. IEEE, 2016, pp. 713–718.
IIR wavelets. Also, it is a well known fact that for fixed-point [4] X. Zhang, W. Wang, T. Yoshikawa, and Y. Takei, “Design of IIR
implementations, FIR filters are more sensitive to coefficient orthogonal wavelet filter banks using lifting scheme,” IEEE Transactions
on Signal Processing, vol. 54, no. 7, pp. 2616–2624, 2006.
quantization which require higher word-lengths compared to [5] J. M. Abdul-Jabbar and R. W. Hmad, “Allpass-based design, multipli-
allpass based halfband polyphase IIR filters, further increasing erless realization and implementation of IIR wavelet filter banks with
the system complexity. approximate linear phase,” in Innovation in Information & Communi-
cation Technology (ISIICT), 2011 Fourth International Symposium on.
IEEE, 2011, pp. 118–123.
VI. C ONCLUSIONS [6] X. Zhang and T. Yoshikawa, “Design of orthonormal IIR wavelet filter
banks using allpass filters,” Signal Processing, vol. 78, no. 1, pp. 91–
In this paper, the novel use of IIR wavelet filter banks 100, 1999.
for ECG signal denoising is presented. For this purpose, two [7] S. Damjanovic and L. Milic, “Examples of orthonormal wavelet trans-
form implementad with IIR filter pairs,” Proc. SMMSP 2005, Riga,
maximally flat and stable IIR wavelet filters, ilet3 and ilet5 Latvia, pp. 19–27, 2005.
are designed. Both filters are computationally efficient where [8] C. Herley and M. Vetterli, “Wavelets and recursive filter banks,” IEEE
ilet3 and ilet5 employs one and two distinct coefficients, Transactions on Signal Processing, vol. 41, no. 8, pp. 2536–2556, 1993.
[9] J. Gao, H. Sultan, J. Hu, and W.-W. Tung, “Denoising nonlinear time
respectively that can be implemented with simple shift and series by adaptive filtering and wavelet shrinkage: a comparison,” IEEE
add operations without using costly multipliers. A comparative Signal Processing Letters, vol. 17, no. 3, pp. 237–240, 2010.
analysis of ECG signal denoising based on the aforementioned [10] D. L. Donoho and I. M. Johnstone, “Adapting to unknown smoothness
via wavelet shrinkage,” Journal of the American Statistical Association,
IIR wavelet filters and state-of-the-art FIR wavelet filters is vol. 90, no. 432, pp. 1200–1224, 1995.
carried out. The denoising performance of all filter banks are [11] S. R. Messer, J. Agzarian, and D. Abbott, “Optimal wavelet denoising
evaluated through the generation of the synthetic noisy signals for phonocardiograms,” Microelectronics Journal, vol. 32, no. 12, pp.
931–941, 2001.
and compared by means of the SNR improvement and the
133
SIGNaL PROCESSING
SPa 2018
Programmable, switched-capacitor finite impulse

response filter realized in CMOS technology
for education purposes
Paweł Pawłowski∗ , Adam Pawlikowski∗ , Rafał Długosz†,‡ , Adam Dabrowski
˛ ∗
∗ Poznan University of Technology
Faculty of Computing, Division of Signal Processing and Electronic Systems

Piotrowo 3a, 60-965 Poznań, Poland
e-mail: pawel.pawlowski@put.poznan.pl
† UTP University of Science and Technology
Faculty of Telecommunication, Computer Science and Electrical Engineering

Kaliskiego 7, 85-796, Bydgoszcz, Poland
e-mail: rafal.dlugosz@gmail.com
‡ Aptive Poland S.A.
Podgórki Tynieckie 2, 30-399 Kraków, Poland
Abstract—The paper reports comprehensive laboratory tests delay lines by filter coefficients, and a summing block. In the
of a mixed analog-digital, application specific integrated circuit analog approach, all these operations have to be performed
(ASIC). The realized chip is a programmable device. It contains with analog signals. In the SC FIR filters, signal samples are
such components as an operational amplifier, a sample-and-hold
(S&H) element, programmable array of capacitors, multiphase stored as voltages on capacitors. Multiplication is carried out
clock generator and a programmable switched capacitor (SC) through appropriate ratios between capacities. The summation
delay line. All these blocks may be used separately or may be is realized using a capacitive summing system composed of
coupled together into a finite impulse response (FIR) filter, with an operational amplifier and capacitors.
reconfigurable frequency response. Since the filter coefficients The SC FIR filter designed and presented in this work is an
may be either positive or negative, therefore both lowpass or
highpass frequency responses may be obtained. The chip has example of such solutions. This project is a continuation of
been designed in the AMS CMOS 0.35 µm technology and authors’ previous works in this area. In previous approaches,
occupies the area of 0.5 mm2 . It was designed for educational however, we did not use programmable solutions. It resulted
purposes. Programming and testing of the chip was made from the application of the realized projects in GSM base
with the computer-controlled interface prepared in the National station, in which a proper attenuation in stop band of the
Instruments LabVIEW environment. The presented solutions
allow for conducting of various laboratory exercises. frequency response was one of the key features [4]. While
Index Terms—Switched capacitor circuits, Programmable cir- introducing programmable solutions, additional configuration
cuits, CMOS technology, Education switches have to be used to enable connecting or disconnect of
particular blocks. Such switches introduce inaccuracies, which
I. I NTRODUCTION are in particular visible in case of the analog circuits. For
Analog discrete time filters in some cases can be an alter- example, the capacitance of the capacitors used in the switches
native to the similar digital solutions. In a typical real-signal may affect the capacity of the coefficient capacitors (CCs),
processing scheme, the analog signal after a rough anti-aliasing thus modifying the frequency response of the filter.
filtering is converted to a digital form. Then it is subjected to
more accurate processing, using fully digital blocks. II. I MPLEMENTATION ISSUES OF FIR FILTERS
The use of more selective filtering at the analog side can, FIR filters are very commonly used in various areas of
in certain situations, simplify the structure and requirements technology. The main elements of these filters are the delay
of an analog-to-digital converter (ADC), which is the next line, which stores following samples of the input signal, the
component in the signal processing chain. Discrete time filters block of filter coefficients and the summing component. Signal
of this type, such as finite impulse response (FIR) and infinite samples from the delay line are multiplied by coefficients,
impulse response (IIR) filters, can be implemented in either which play a role of weights. At the next stage, the results
the voltage or the current technique. In this work, we present of the multiplication operations are summed and the result of
a solution based on the first one – the switched capacitor (SC) this operation becomes the output of the filter.
technique [3]. Depending on the application, the implementation of these
Typical FIR and IIR filters are composed of a delay line or filters may be different. The most commonly filters of this
delay lines, blocks that multiply signal samples stored in the type are realized in software. This is due to the a simplicity
134
(a)
(b) (c)
Fig. 2. Structure of the realized programmable filter. Main blocks include:

(A) Even-Odd delay line, (B) multiphase clock generator, (C) programmable
(d) filter coefficients, (D) programmable capacitor, (E) operational amplifier, (F)
I/O and control block, (G) address decoder, (H) memory block.
Fig. 1. Even-Odd SC FIR filter: (a) a general structure (b, c) even and odd
delay elements, respectively, (d) coefficient capacitors (an example case for
even delay elements).
SC FIR filters can be implemented in various ways

[10][11][19]. Since they are analog solutions, the main prob-
of such implementations and the ease of modifying the filter lems in this case are the distortions resulting from copying
structure. It also enables the implementation of a filter with samples between the capacitors in the delay line and between
any frequency response, as coefficients may be unambiguously this line and the block of the filter coefficients. Rewriting
realized with any precision. In this case, however, the problem errors can strongly affect the frequency response of the filter.
is a serial execution of particular operations such as multipli- Discrepancies are usually visible in the stopband, which may
cations, summation of signal samples, as well as rewriting suffer from reduced attenuation in the comparison with theo-
of samples in the delay line. It is visible, in particular, in retical values. Another problem is a linear dependence between
high order filters that contain several dozen or several hundred the capacitance values in the multiplier block and the values
coefficients. of the filter coefficients. With more selective filters, due to the
An alternative solution may be a hardware implementation large spread between coefficients, this may lead to very large
of the FIR filters (the same for the IIR filters) [1][3][19]. capacitors. This impacts the chip area and data processing rate.
With an appropriate approach the operations described above With an appropriate approach, the described problems may be
may be performed in parallel. Such possibilities are offered, strongly limited. One of the possibilities is to decompose the
for example, by filters implemented in field programmable filter transmittance into several sections connected in series.
gate arrays (FPGAs) or realized as specialized chip (ASIC In this case spread between coefficients is strongly reduced.
– application specific integrated circuit). An example of the We applied this solution successfully in our previous projects.
latter are analog SC FIR filters, which are subject of this work.
Parallel signal processing takes place here at all data pro- III. A N OVERVIEW OF THE FILTER STRUCTURE
cessing stages, although details depend on the used hardware The family of the SC FIR filters offers various structures.
architecture of the filter. The delay line may constructed is The main issue is the number of the operations of rewriting of
such a way that all sampling operations between the delay the signal samples, which impacts the frequency response of
elements are performed in parallel. After the samples are the filter. In some approaches (e.g. the Gillingham filter), the
rewritten, they are multiplied by the filter coefficients, which number of these operations equals the filter order. However,
is also performed in parallel for all signal samples. Finally such structures require relatively simple controlling clock [4].
the summing operation of the multiplication products is also There are also solutions in which the number of the rewriting
performed in parallel. operations does not exceed three, regardless of the filter order
135
(e.g. circular memory and rotator filters). In this approach, IV. M EASUREMENT SETUP
however, the control clock is much more complex, as the num- The chip offers wide spectrum of tests and experiments
ber of clock phases linearly depends on the order of the filter due to the possibility of programming it in many ways.
[7]. The development of microelectronics and the consequent A configuration of the chip is stored in the internal RAM
miniaturization, however, causes that the implementation of memory, so the programming should be performed after each
even complex clock generators is not a big problem at present power start-up, using a special programming sequence. The
[18]. programming sequence, besides the configuration of the chip,
In this work we present a transistor level implementation setting values of the filter coefficients and capacitors, allows
of an even-odd SC FIR filter, which is a compromise solution for testing the entire filter and also the individual components
between the two mentioned features. The delay line in such of the chip. The programming sequence is presented in Table I
filters allows for reducing the number of rewriting operations and the working modes of the chip in Table II [5]. It is worth to
of the samples, while the clock generator still features a rela- notice that the addresses in the programming sequence change
tively simple structure. Diagram of the filter and its particular in the Gray code (only one bit is changed in one programming
components are shown in Fig. 1. step). Therefore the programming is self-clocked (it does not
We have implemented the even-odd SC FIR filters in several need any additional clock signal).
CMOS technologies [4][5][6][8]. A first filter of this type
TABLE I
was implemented in the CMOS 0.8 µm technology, for the C HIP PROGRAMMING SEQUENCE
application in the GSM telephony [4]. In this case the filter
coefficients had fixed values, as the main objective was to Address States
obtain a good match with a theoretical frequency response. 0000 Neutral state
1000 Mode of the chip
In this work we present another filter from this family. 1001 Mode of the filter (delay elements order R = 1 / R = 2)
The structure of the designed chip realized in the CMOS 0001 Filter coefficient: 1
0.35 µm technology is shown in Figure 2. The silicon area 0011 Filter coefficient: 3
0010 Filter coefficient: 2
equals 0.5 mm2 (700×700 µm). Main components, shown in 0110 Filter coefficient: 6
Figure 2, are as follows [5]: 0100 Filter coefficient: 4
0101 Filter coefficient: 5
A. Delay line composed of two (even and odd) delay ele- 0111 Filter coefficient: 7
ments. The delay components may be set up to store two 1110 Output capacitor (after this MSB b5 = 1)
1110/ Output capacitor MSB b5 = 0
or three samples, so that the 4th or 6th order FIR filter 1101 Output capacitor MSB b5 = 1
may be obtained. This approach increases the educational 1100 External clock mode
possibilities of the chip.
B. Multiphase, reconfigurable controlling clock generator,
with the number of clock phases dependent on the con- TABLE II
C ONFIGURATION MODES OF THE CHIP
figuration of the delay elements (4 or 6 phases).
C. Programmable matrix of the coefficient capacitors (CCs) Mode Description
and configuration switches (transmission gates). The ca- 1 All internal blocks are connected in the SC FIR filter
pacitors are composed of, the so-called, unit capacitors 2 Measurement of the delay line
3 Measurement of the transmission gate
(UCs). Each CC contains three sections with 1, 2 and 4 Measurement of the output capacitor
4 UCs, respectively. The configuration switches allow to 5 Measurement of the S&H delay element
use the CCs as positive or negative filter coefficients. As a 6 Measurement of the OA
7 Measurement of the clock phases: n1, p1, n2
result, the CCs offer values -7, -6, . . . , 0, 1, 2, . . . 7 UCs. 8 Measurement of the clock phases: n1, n2, n3
The described features allow to program the frequency 9 Measurement of the clock phases: p1 ,p2, p3
response of the filter. 10 Measurement of the clock phases: n12, n23, n34
D. Capacitor used in the feedback loop of the operational 11 Measurement of the clock phases: n135, n246, p246
12 Measurement of the clock phases: p12, p23, p34
amplifier (OA) in the summing block,
E. Two stage OA, used in the summing block,
F. Control block used to configure the chip, so that different The chip, to work correctly, needs both digital and analog
tests may be performed, signals. The analog signals are input and output signals, e.g.
G. Address decoder, of the filter, so the analog interface must produce input signals
H. Memory block that stores the configuration settings of and measure the output signals. The digital interface must
the chip. program the filter and produce clock signals. For internal clock
mode the chip need two clock signal, for the external clock
The chip is controlled / tested by 15 external pins divided mode 4 or 6 (depending on the filter working mode). Because
into three groups: (i) digital, control and programmable signals the programming is performed only once after the power-up, it
(8 pins), (ii) analog I/O pins used to test the filter performance may be performed with relative low speed. The clock signals,
(3 pins), (iii) power supply lines (4 pins). on the contrary, they are much faster, should be delivered all
136
Fig. 3. Interface of the LabVIEW program used to test the designed chip.
the time and be very accurate. This is a common property of digital level translators [9]. Because the powering voltage can
the SC structures, that the quality of signal processing strongly vary, the digital signals must be controlled to be equal or
depends on the accuracy of clock signals. Additionally, it is lower than the powering voltage. All NI cards are placed in
possible to tune the SC filters by changing of the frequency the CompactDAQ chassis NI cDAQ-9172, which is connected
of clock signals. to the PC through USB port [15].
The interface for controlling of the chip was prepared with Multiphase clock signals are precisely generated in the
the use of two platforms: National Instruments CompactDAQ second platform, namely the Altium NanoBoard NB2 with
with LabVIEW environment and Altium NanoBoard NB2 Xilinx Spartan3 FPGA chip [2]. To produce properly syn-
FPGA evaluation board. chronized complementary clock signals we used FPGA-based
The software prepared in the NI LabVIEW 2015 envi- programmable divider, which is clocked by the board source
ronment [16] (presented in Fig. 3) offers programming of clock with controlled frequency from 6 MHz up to 200 MHz
the chip (Fig. 3, frames 2 and 3), generation of the chip [2]. The divider produces clock signals, which are connected to
input signals (Fig. 3, frame 4), gathering of the chip output the tested chip. These signals have exact 50% pulse width with
signals, visualization (virtual oscilloscopes), and automatic frequency range from 300 Hz - 10 kHz. In fact, the chip can
measurement of all processed signals (Fig. 3, frames 1 and work with clock frequencies up to several MHz, but according
5). to the sampling rates of the analog interface part, the signals
The analog signals are produced via NI9263 digital-to- with frequencies higher than 50 kHz may not be correctly
analog converter (DAC) module. This module has 4-channels, sampled (in the SC circuits the analog signals are processed
16-bit resolution, 100kS/s/ch sampling rate, and ±10V voltage with sampling clock frequencies and this chopping process is
range [14]. Using the software the user can modify the visible on the output analog signals, c.f. Fig. 4 (a), (b) and
amplitude and frequency of the generated sinusoidal source Fig. 5 (a), (b)).
signals (Fig. 3, frame 4). Two of analog output channels are
used to control the powering of the chip, where the user V. M EASUREMENT RESULTS
can digitally set the DC voltages (Fig. 3, frame 1). Actually, In this section we present selected results of the laboratory
the chip is powered via DAC controlled L2722 1A output tests of the chip [17]. We verified the correctness of the
current power amplifier from STMicroelectronics [20]. Beside operation of particular filter components, the programming
the controlling of the supply voltages, the software measures abilities and the behavior of the filter for different transfer
the supply current and calculates the supply power. The output functions.
signals, supply voltages and currents are acquired by NI9215 Fig. 4 demonstrates performance of the filter in the lowpass
analog-to-digital converter (ADC). The NI9215 is 4-channel mode, for the sampling frequency of 1 kHz. The consecutive
converter with 100kS/s/ch sampling rate and ±10V output filter coefficients were set to 0; 1; 3; 7; 3; 1; 0. Diagrams
voltage range [13]. (a) and (b) present time domain results for the signal in
The signals for programming of the chip are generated in bandpass (20 Hz) and stoband (450 Hz), respectively. The
the LabVIEW software and interfaced to the chip via NI9401 overall frequency response of the low-pass filter is illustrated
digital 8-channel I/O TTL-level card [12] and two 74LVX125 in diagram (c), for two modes of the filter operation (c.f. Table
137
(a) (a)
(b) (b)
(c)
Fig. 4. Performance of the filter in the lowpass mode: (a, b) in time domain
for passband and stopband, respectively, (c) frequency response.
(c)
Fig. 5. Performance of the filter in the bandpass mode: (a, b) in time domain
for passband and stopband, respectively, (c) frequency response.
I Mode of the filter). In the first mode the filter uses delay
elements with order R = 1 (it has more, but less complicated
delay elements). In the second mode the filter uses delay
elements with order R = 2 (it has less, but more complicated
delay elements). Please notice that the filter order is just 6, so
the length of the passband is relatively high and the attenuation
is not higher than 25 dB. The frequency response is in average
symmetric due to the digital nature of SC FIR filter structure. (a) (b)
In both cases the bandwidth is in the frequency range 10–
250 Hz. The attenuation in the stopband equals 25 dB and
20 dB in the 1st and 2nd mode, respectively. For the signal
frequencies above 500 Hz, a typical reflection of the frequency
response is visible.
Similar results for the bandpass mode are shown in Fig. 5. (c) (d)
In this filter, the consecutive filter coefficients were set to 3;
5; 7; 7; 5; 3. Fig. 6. Clock signals.
Fig. 6 illustrates selected measurement results of the con-
trolling clock. One of the main issues during the design of
this block was to obtain a very precise crossing point of just in the middle of voltage range. As it can be seen in
two adjacent clock phases. Crossing near to zero voltage is Fig. 6, the crossing is just in the middle, regardless the clock
essential from the point of view of filter behavior. The clock frequency. This ensures the best performance of the filter.
generator works properly. The results on particular diagrams
VI. C ONCLUSIONS
of Fig. 6 are shown for different sampling frequencies. The
most important for the SC circuits is a synchronization of The experiments shown that the filter programming proce-
complementary phases: the falling and rising edges must cross dure and filter modes work correctly. The chip can be used
138
for laboratory experiments in various university electronics [19] N. Singh, P. P. Bansod, Switched-capacitor filter design for ECG appli-
courses starting with the basic course (investigation and mea- cation using 180nm CMOS technology, 2017 International Conference
on Recent Innovations in Signal processing and Embedded Systems
surement of basic elements and building blocks) and finishing (RISE), pp. 439 - 443, 2017
with advanced courses devoted to the design of mixed analog- [20] STMicroelectronics, L2720/2/4 Low drop dual power operational ampli-
digital ASIC’s, and to realization and programming of adaptive fiers, http://www.st.com/web/en/resource/technical/document/datasheet/
CD00000055.pdf, 2003
and programmable FPGA structures, or to the design of control
and telecommunication systems, etc., in which the entire SC
FIR filter will perform the signal processing.
R EFERENCES
[1] S. Al-Khammasi et al., Hardware-based FIR filter implementations for

ECG signal denoising: A monitoring framework from industrial elec-
tronics perspective, 2016 Annual Connecticut Conference on Industrial
Electronics, Technology & Automation (CT-IETA), pp. 1-6, 2016
[2] Altium Ltd., Technical Reference Manual for Altium’s Desktop
NanoBoard NB2DSK01, 2008
[3] A. Dabrowski,
˛ Multirate and Multiphase Switched-capacitor Circuits,
Chapman&Hall London, 1997
[4] A. Dabrowski,
˛ R. Długosz, P. Pawłowski, Integrated CMOS GSM
Baseband Channel Selecting Filters Realized Using Switched Capacitor
Finite Impulse Response Technique, Microelectronics and Reliability,
Vol. 46, Issues 5-6 , pp. 949 – 958, 2006
[5] R. Długosz, P. Pawłowski, A. Dabrowski,
˛ Laboratory of mixed analog-
digital integrated circuits (REASON – EDUCHIP Project), EDUCHIP
special session on 12th International Conference Mixed Design of
Integrated Circuits and Systems, Kraków, 22 – 25 June 2005, pp. 851
– 856, 2005
˛ Operational amplifier for
switched capacitor systems realized in various CMOS technologies,
Elektronika – konstrukcje, technologie, zastosowania, miesi˛ecznik n-t,
Wyd. Sigma NOT, 3/2010, pp. 67 – 70, 2010
˛ Multiphase clock generators
with controlled clock impulse width for programmable high order
rotator SC FIR filters realized in 0.35 µm CMOS technology, VLSI
Circuits and Systems II, Pts 1 and 2 Vol. 5837 pp. 1056-1063, DOI:
10.1117/12.608490, 2005
˛ Design and optimization
of operational amplifiers for SC systems - a comparative study in
CMOS 0.18 µm, 0.35 µm, and 0.8 µm technologies, SPA 2009: Signal
Processing Algorithms, Architectures, Arrangements, and Applications
Conference Proceedings, pp. 36-39, 2009
[9] Fairchild Semiconductor Corp., 74LVX125 Low Voltage Quad Buffer
with 3-STATE Outputs, https://www.fairchildsemi.com/datasheets/74/
74LVX125.pdf, 2008
[10] S. Z. Lulec, D. A. Johns, A. Liscidini, A 150-µW 3rd-order butterworth
passive-switched-capacitor filter with 92 dB SFDR, 2017 Symposium
on VLSI Circuits, pp. C142-C143, 2017
[11] D. Mandal, P. Mandal, T. K. Bhattacharyya, Spur reducing archi-
tecture of frequency synthesiser using switched capacitors, IET Cir-
cuits, Devices Systems, vol. 8, No. 4, pp. 237-245, DOI: 10.1049/iet-
cds.2013.0200, 2014
[12] National Instruments, NI9401 8-Channel, TTL Digital Input/Output
Module, Operating instructions and specifications, 374068A-02, 2015
[13] National Instruments, NI 9215 4-Channel, ±10 V, 16-Bit Simultane-
ous Analog Input Module, Operating instructions and specifications,
373779A-02, 2016
[14] National Instruments, NI 9263 4 AO, ±10 V, 16 Bit, 100 kS/s/ch
Simultaneous Analog Output Module Datasheet, 373781b-02, 2017
[15] National Instruments, NI cDAQ- User Guide and Specifications,
371747F-01, 2008
[16] National Instruments, LabVIEW environment, http://www.ni.com, 2018
[17] A. Pawlikowski, Automated testing stand for programmable intergrated
circuits using LabVIEW environment, Master thesis, Poznan University
of Technology, in Polish, not published, 2015
[18] P. Pawłowski, A. Dabrowski,
˛ T. Marciniak, Prototyping of measurement
setups for testing baseband communication chips, Int. Conf. Mixed
Design of Integrated Circuits and Systems MIXDES, Poland, Poznań
2008, pp. 499 – 504, 2008
139
SIGNaL PROCESSING
SPa 2018
Hardware implementation of the Gaussian Mixture

Model foreground object segmentation algorithm
working with ultra-high resolution video stream in
real-time
Piotr Janus Tomasz Kryjak, Member IEEE
AGH University of Science AGH University of Science
and Technology Krakow, Poland and Technology Krakow, Poland
E-mail: piojanus@agh.edu.pl E-mail: tomasz.kryjak@agh.edu.pl
Abstract—In this paper a hardware implementation of the ground modelling algorithm. For the first time it was presented
Gaussian Mixture Model algorithm for background modelling in paper [1]. Its popularity is proved by a large number of arti-
and foreground object segmentation is presented. The proposed cles presenting its properties and possibilities of improvement
vision system is able to handle video stream with resolution up
to 4K (3840 x 2160 pixels) and 60 frames per second. Moreover, [2]. Moreover, this algorithm is included in the popular open
the constraints caused by memory bandwidth limit are also source library for image processing and analysis OpenCV and
discussed and a few different solutions to tackle this issue have Matlab software (Computer Vision System Toolbox).
been considered. The designed modules have been verified on There are also many other algorithms used for background
the ZCU 102 development board with Xilinx Zynq UltraScale+ modelling – an excellent review can be found in [3]. This
MPSoC device. Additionally, the computing performance and
power consumption have been estimated. group includes KDE (non-parametric Kernel Density Estima-
tion) [4] or FTSG (Flux Tensor with Split Gaussian Models)
I. I NTRODUCTION [5] – also using Gaussian distributions. Moreover, frequently
Foreground object segmentation is one of the most impor- used are the ViBE (Visual Background Extractor) [6] and
tant elements of many advanced vision systems. It is a key PBAS (Pixel Based Adaptive Segmenter) [7] approaches, in
component of many object detecting and tracking systems which the background model is built based on samples (i.e.
(humans, cars), abandoned luggage detection, forbidden zone pixel values) rather than a statistical model. Recently, fore-
protection (i.e. border control, nuclear power plant area, air- ground object segmentation methods using deep convolutional
ports etc.) and finally broadly understood human behaviour neural networks have become popular [8].
analysis systems. The dynamic development of vision sensors made it possible
The simplest group of foreground object detection algo- to use cameras with very high spatial resolution. This allows to
rithms is based on subtracting subsequent frames from a video obtain images with very good quality, a larger field of view and
sequence. More advanced approaches involve the so-called ultimately improves the effectiveness of the considered vision
background modelling. For each pixel, a dedicated model is system (e.g. more accurate pedestrian detection). Currently,
assigned that describes the background appearance in a given the most common are three resolutions: High Definition (HD:
location. Then, depending on used algorithm, the new pixel 1280 × 720), Full High Definition (FHD 1920 × 1080) and
value is compared to the background model and classified recently Ultra High Definition (UHD or 4K – 3840 × 2160). It
(as foreground, background and sometimes also shadow). The should be noted that the resolution directly affects the amount
model is updated to incorporate changes in the scene like slow of data to be processed and the size of the background model
or fast light variations and movement of objects belonging to to be stored.
background (i.e a moved chair). Foreground object segmentation is an element of widely
One of the simplest background modelling methods is understood advanced video surveillance systems (AVSS). In
moving average, which is an extended version of subtracting these applications, it is usually required to perform calculations
the current frame from a static background model (i.e. a photo in real time, i.e. on an ongoing basis for data acquired from the
of an empty scene). The new value of a background pixel is video sensor. For a 4K video stream, the choice of a computing
computed based on a weighted average of the previous and platform becomes particularly important. It should be charac-
current pixel values. A quite similar approach is realized in terized by adequate computing performance, as well as energy
the median approximation method, also called sigma-delta. efficiency and the possibility of subsequent modification and
Gaussian Mixture Models (GMM, also called Mixture of updating of the application. The designer can choose from the
Gaussian – MOG) is one of the most commonly used back- following platforms: general purpose processors (GPP), appli-
140
cation specific integrated circuits (ASICs), field programmable acceleration of the GMM algorithm using GPU has also been
gate arrays (FPGAs) and heterogeneous programmable system described in [12], [13].
on chips (SoC), which are composed of an ARM processor,
reprogrammable logic and also GPU (eg Zynq UltraScale+ III. R EAL - TIME 4K VIDEO STREAM IN FPGA
from Xilinx). The first two, as will be shown in the further A colour video stream with a resolution of 3840 × 2160
part of the article, do not have sufficient computing power at 60 frames per second means a data flow of approx.
to perform 4K processing in real-time. Also they can not be 1424 MB/s. Its transmission in the format of 1 pixel (24
considered as energy efficient. ASICs are very expensive to bit) per clock cycle (ppc) means a frequency of approx.
produce in small series and they do not support modification of 500 MHz (this clock is usually referred to as pixel clock).
the solution. In the considered context, they could be used, for At the same time, the presence of so-called horizontal and
example, as a co-processor of a particular version of the GMM vertical blanking fields, causes the actual pixel frequency to
algorithm. According to the authors of this article, FPGA be slightly less than 600 MHz. This is close to the ”limit”
devices, especially programmable SoC available for several value for currently available FPGA devices or programmable
years (eg Zynq and Zynq UltraScale+ from Xilinx or similar logic of SoC devices. Admittedly, selected elements, such as
Intel/Altera solutions) constitute a very interesting platform block memories (BRAMs), hardware multipliers (DSPs) are,
for implementing algorithms and video systems, especially according to the manufacturer’s declaration, able to work with
for a smart-camera design (video processing and recognition even higher frequencies (this depends, among others, on the
performed just after acquisition). version of the chip (speed grade), supply voltage, type of
The main contributions of this paper are: operation). However, in practice, for a more complex logic,
• analysis of the possibility of reducing the size of the back- achieving such frequencies can be very difficult, since the
ground model for the Gaussian Mixture Model algorithm, delay associated with the connection resources has also to be
• evaluation of the proposed solution using NVidia GPU considered. Moreover, working with higher frequencies results
device and CUDA, in larger energy consumption.
• FPGA hardware implementation of GMM algorithm Due to the above described 4K signal parameters and the
with 4K@60fps real-time support (according to authors’ limitations of reconfigurable resources, it is not possible to
knowledge is the first real-time implementation of GMM use the well known 1 pixel per clock (1 ppc) format. Instead,
in reprogrammable logic for this resolution), the 2 ppc or 4 ppc format is used, which allows to lower
• a custom AXI4 memory controller with support for 4K the pixel clock to 300 MHz and 150 MHz, respectively. This
video stream. has the following consequences for the designed hardware
The reminder of this paper is organized as follows. In modules. In the case of point operations (like colourspace con-
Section II previous works related to hardware implementation version, LUT) and also the majority of background modelling
of GMM algorithms are briefly discussed. Section III shows approaches, where no pixel’s context analysis is needed, it is
4K video stream details. In Section IV the GMM algorithm necessary to multiply computational resources – for example,
is described, as well as the introduced adaptation to 4K and for 2 pcc, two independent GMM modules should be used. For
its evaluation are presented. The designed hardware system is contextual operations, the matter is more complex, because a
discussed in Section V. The paper ends with a conclusion and sufficiently large context should be gathered and two or four
future research directions indications. operations carried out at the same time.
Another difficulty is the limited RAM bandwidth in which
II. P REVIOUS WORK
the background model is stored. In the case of the used in
Over the years, several hardware implementations of the this research development board (ZCU 102 from Xilinx) with
GMM algorithm have been proposed, using both FPGAs and Zynq UltraScale+ device, the maximum obtained throughput
GPUs (CUDA and OpenCL platforms). The short review is respectively 128 bits per clock cycle in the 2 ppc variant and
presented below is limited only to selected solutions that can 256 bits in the 4 ppc format (simultaneous read and write).
process images with at least FHD resolution (1920x1080). These values are very important when designing a foreground
In the work [9] a GMM implementation on an Virtex 5 object segmentation system.
FPGA is presented. It it able to process a 1920x1080 @ 24
fps video stream in real-time. In addition, authors focused on IV. A LGORITHMS AND EVALUATION
post-processing to eliminate noise and segmentation errors.
Further works by the same authors are presented in the article A. Gaussian Mixture Models
[10]. The use of a Virtex 6 FPGA made it possible to process Gaussian Mixture Models is one of the most commonly
1920x1080 @ 91 fps. used method for background modelling. In this approach each
Authors in [11] proposed a GPU implementation. The pixel is represented by k Gaussian distributions characterized
presented video system worked in real time for resolutions by three parameters (ω, µ, σ 2 ).
up to 1920x1080@30fps. A 9600GT GPU device was used. It ω is the normalized weight (range 0–1) of the Gaussian
is worth noting, that at that time (2010) it was not a high-end distribution. µ is the means vector of each colour component
computing unit, but rather a fairly moderate device. Hardware of a particular pixel. In the case of RGB colour space it can be
141
defined as the vector of three numbers (rmean , gmean , bmean ). Gaussian distributions is k = 3 and each parameter is 12 bits
For the grayscale space it is a single number. length, the total model size is 180 bits for each pixel.
Finally, σ 2 is the variance of given Gaussian distribution
B. Adaptation to 4K
– a single value is used for each colour component. Usually
it is assumed that RGB components are independent, which Considering the properties of a 4K signal described in
allows to use 3 values instead of a covariance matrix. It should Section III, the new implementation capable to handle such
be noticed that a lot of varying implementations of the GMM a stream shall meet the constraints introduced by the used
algorithm have been proposed so far (cf. [2]). In this work, hardware platform. The main issues are throughput between
a version based on the open source image processing library programmable logic and DDR4 memory and the necessity of
OpenCV was implemented. processing two or four pixel per clock cycle. In order to solve
The background model is initialized while processing the both of them, the following simplifications and modifications
first frame of the video sequence. The same initial weight and reducing the background model size could be applied:
variance are assigned to each Gaussian distribution, while the • using grayscale image instead of RGB,
vector of mean values is initialized with pixel values. The • using a common background model for neighbouring
algorithm itself is build up of several steps. Firstly, sorting of pixels.
Gaussian distributions with respect to weight in descending These solutions have been implemented in a GPU device using
order is performed. CUDA for evaluation purpose. They have been compared with
Then the current pixel (x) is tested against each Gaussian the GMM implementation available in OpenCV.
distribution. For match estimation the Mahalanobins distance The first of aforementioned approaches allows to signifi-
formula is applied: cantly reduce the size of the background model by resizing the
q length of mean values vector from three to just one dimension
d(x, µ) = (x − µ) · (x − µ)T (1) and reducing precision of each parameter. The maximum RAM
A pixel is classified as matching the Gaussian if the com- throughput is 256 bits per 4K pixel clock. With respect to this
puted distance is lower than established threshold. With respect value, the bit size of each parameters was selected as follows:
to Equation (2) usually the triple value of standard deviation weight – 6 bits, mean – 8 bits, variance – 7 bits. Assuming
is used. k = 3 and 4 pixels per clock cycle it gives 252 bits for each
d(x, µ) < 3 · σ (2) read/write operation. Therefore the total size of the background
model will meet the transfer throughput requirement.
The next step is pixel classification based on match test. The next approach assumes the use of a common back-
According to Equation (3), first B Gaussian distributions, ground model for neighbouring pixels. It may be profitable
which weights exceed a constant threshold T are considered as due to the specificity of 4K vision signal. In this case, each
background, otherwise they represent foreground. The default pixel processed in the same clock cycle will have a common
value of this parameter is 0.9 (the same as in OpenCV background model. The idea is to evaluate each pixel indepen-
implementation). dently, then merge classification results and use it for update
! decision. The update procedure is applied to those Gaussian
X b
distributions, which matches to at least one pixel. When more
B = argb min ωi > T (3) than one pixel matches to the same distribution, the average
i=0
value is used during this procedure. Moreover, the distribution
The final step is model update. The following formulas are with the lowest weight is replaced, if at least one pixel does
applied: not match to any distribution. The approach with average
ωi+1 = ωi + α(M − ωi ) (4) value is also used here. For instance, if none of the Gaussian
α distribution match to more than one pixel, the average value
µi+1 = µi + M (x − µi ) (5) of them is used during re-initialization.
ωi
α
C. Evaluation
σi+1 = σi + M (x − µi ) · (x − µi )T (6)
ωi The hardware implementation was preceded by tests on
where α represents the learning speed, while M equals 1 a GPU device. The CUDA platform was used for evaluation.
for the first Gaussian distribution that passed the match test, During simulation tests, ultra-high resolution video sequences
otherwise it is 0. Moreover the value of variance is upper were examined (they were recorded by the authors with
constrained. In the case of distributions, which do not match a 4K camera). The GMM implementation available in the
to pixel value, only the weight value is updated (decreased). OpenCV library was used as a reference model (with default
If instead none of the Gaussian distributions match the pixel, parameters – 5 Gaussian distributions). It is worth to notice,
than new Gaussian is added (the same parameters as in the that while adjusting the algorithm parameters, the hardware
initialization phase are used). The distribution with the lowest constraints described in Section III have to be considered as
weight is replaced by the new one. Finally, weights have to well. However, the results obtained on the target hardware
be normalized to range 0–1. Assuming that the number of might be slightly deteriorated compared to simulation tests
142
TABLE I
E VALUATION OF THE PROPOSED APPROACH ON A GPU
TP TN FP FN
Variant 1 4% 90% 1% 5%
Variant 2 2% 89% 2% 7%
Variant 3 4% 85% 6% 5%
due to fixed point arithmetic. Taking all of these into account

the following modifications have been tested:
• grayscale input, 3 Gaussian distributions, independent Fig. 1. System architecture
background model for each pixel,
• grayscale input, 6 Gaussian distributions, common back-
ground model for four neighbouring pixels,
• RGB input, 3 Gaussian distributions, common back-
used. Especially custom ones with high bandwidth to external
ground model for four neighbouring pixels. RAM (connected to the PL). Therefore, the presented results
should be treated as an example (proof of concept), and
The obtained are presented in Table I. For evaluation purpose
the implementation parameters should always be adapted to
the following performance metrics with respect to OpenCV
a specific hardware platform.
implementation are used:
The block diagram of the design system is presented in
• True Positive (TP) – percentage of pixels correctly clas- Figure 1. Particular modules are discussed in following sub-
sified as foreground sections.
• True Negative (TN) – percentage of pixels correctly
classified as background A. HDMI RX/TX
• False Positive (FP) – percentage of pixels incorrectly The video signal from a HDMI 2.0 source (PC computer in
classified as foreground (classified as background by the the experiments) is received by the HDMI RX module and,
reference algorithm) after processing, sent by the HDMI TX to the display – a
• False Negatice (FN) – percentage of pixels incorrectly 4K monitor. Additionally, it is necessary to use the Video
classified as background (classified as foreground by the PHY Controller module. All modules are supplied by Xilinx.
reference algorithm) Moreover, they are configured via the AXI interface from the
The obtained results indicate, that using a simplified processor system (C ++ application).
(smaller) background model does not have too much impact .
on final results, as well as using grayscale colourspace instead
of RGB. Depending on the test case, the percentage of wrong B. AXI memory controller
classified pixels varies from 6% to 11%. Moreover, it could In the project a custom AXI4 memory controller has been
be noticed that it is better to use less number of Gaussian used. This solution turned out to be more adequate than
distributions than a common background model for several a dedicated VDMA module (Video Direct Memory Access).
pixels. The basic assumption is the availability of data assigned to a
In addition to accuracy evaluation, the general performance given pixel (e.g. the background model) exactly at the moment
of the GPU implementation was also measured. Two different in which the pixel appears at the input of the processing
graphic cards were tested: NVIDIA GeForce GTX 1050m module. In other words, the video stream from the camera and
(mobile version) and GTX 1080 (desktop version). The perfor- data from the external memory must be synchronized. Only
mance for these devices is about 2.5 and 5 frames per second then the foreground segmentation module works correctly. Due
respectively (for 4K video). It is worth to consider the power to the lack of a specific duration of a single transaction for
consumption as well. During performed tests the average GPU a dynamic memory (and these type of memory is currently
resources utilization was about 50-60%, while measured power used on most cards with FPGA/Zynq SoC devices) and the
usage was more than 70W. specific nature of the physical RAM controller in the used
device, it was necessary to use a dedicated architecture, that
V. H ARDWARE IMPLEMENTATION ensured compliance with the described requirements.
In the experiments, the ZCU 102 evaluation board from It is based on FIFO buffers and two state machines (separate
Xilinx was used. It is equipped with a Zynq UltraScale+ for write (SM WR) and read (SM RD) and is presented in
MPSoC (Multiprocessor System on Chip) device (XCZU9EG- Fig. 2. Writing is controlled by the fifo_wr_en signal.
2FFVB1156). This choice was dictated by the planned fur- The data in transmitted to a ”long” FIFO (FIFO WR) with a
ther work on hardware-software vision systems, which would width such as the supported AXI data bus – in the considered
include foreground object detection, eg abandoned luggage case 128 bits. Then, in the FIFO TWR module, these data are
detection and person re-identification. It should be noted that grouped into packages of BURST x 128 bits. Different values
for the considered application more suitable boards could be of this parameter were tested (4,8,16) and finally the value 8
143
Fig. 3. GMM architecture – variant 1 (separate background models)
Fig. 2. Scheme of the used memory controller
was used. If the data is ready, the address is set in the state
machine along with necessary control signals and the transfer
is executed.
Reading is quite similar. If there is enough space in the FIFO Fig. 4. GMM architecture – variant 2 (common background model)
RD, the state machine sets proper AXI parameters, so that
subsequent BURST x 128-bit words are received from RAM.
Then, if the fifo_rd_en signal is asserted, the data is passed to
a computation module. Furthermore, the FIFO RD should be image and a common background model for all pixels in each
pre-filled with data from memory, prior actual processing. clock cycle (2 or 4 pixels share the same background model).
The hardware platform used in the experiment is equipped In Figure 3 the scheme of the first variant is presented.
with two external RAM memories. The first DDR4 is con- Since there is a dedicated background model for each pixel
nected to the processor system. Its bus is 64-bit wide and per clock cycle, two or four (depending on video signal
clocked at 1066 MHz. This means the maximum theoretical settings) instances of each block are used. The scheme of
transfer is at the level of 16.256 MB/s. The second memory is the second variant is shown in the Figure 4. In this case,
connected to the reprogrammable logic. Its bus has 16 bits and the colourspace conversion is optional because both RGB
is clocked at 1333 MHz. This means a maximum transfer rate and grayscale images can be handled. Moreover, there is a
of 5210 MB/s. The described system uses the first memory. single instance of sorting module and model updater, because
The second one will be used in further research. a common background model is used. The update procedure is
A single AXI stream has a width of 128 bits (due to applied only to Gaussian distributions which passed the match
a separate buses for writing and reading 256 bits). Only data test.
for visible pixels are saved in memory (blanking fields are The conversion from RGB to grayscale is performed at the
omitted). Assuming a 150 MHz clock (corresponding to the 4 beginning – rgb2grayscale module. In parallel to colourspace
ppc format), the transfer rate is about 4578 MB/s. This means conversion, the background model is read via the AXI memory
that the first memory should provide a transfer for a word with controller and Gaussian distributions are sorted with respect
a width of approximately 450 bits and a second of 145 bits. to weight. Then the input pixel is compared to background
However, these values are theoretical and do not include the model in match test block. The classification result is passed
AXI bus, controller and RAM overheads. to the output and finally the background model is updated.
In experiments, it was possible to obtain for the PS memory The operation of sorting Gaussian distributions uses a sim-
a stable transfer at the level for two 128-bit channels (each ple bubble sort algorithm. The key issue is to design the
128 write + 128 read). This is approximately 57% of the module in such a way that the latency is independent from the
stated above maximum value. There have been many attempts input data. This module must be also capable to process any
to connect another channel, 128, 64 and 32-bit. Moreover, number of Gaussian distributions. Firstly, for K input values,
different values of the BURST parameter, as well as the FIFO K − 1 compare operations are performed. In the next step this
sizes were considered. Unfortunately, in no other case it was procedure is repeated for other K − 1 values, giving K − 2
possible to obtain a stable operation. This issue will be the compare operations. Finally, after K − 1 iterations, values are
subject of further research. sorted in descending order. The total latency depends on input
data size and equals 2K − 3.
C. GMM implementation The module responsible for matching of Gaussian distri-
Following the description from Section IV-B, two hard- butions to the current pixel value is simple to implement in
ware implementations were prepared. The first one processes hardware, since only multiplication and add operations are
grayscale images and provides a dedicated background model needed. The model update process is more complex, because
for each pixel. The second approach uses RGB or grayscale equations (5) and (4) require a hardware divider to be used as
144
TABLE II
R ESOURCE UTILIZATION paper [14]. Authors proposed the use of a lossless compression
algorithm to reduce the size of background model. Such an
Video RAM CTRL GMM v1 GMM v2 System approach shall be also considered for the GMM algorithm.
LUT 32025 3786 15622 4943 18.7/14.9%
FF 39037 6559 22090 5289 12.3/9.3% ACKNOWLEDGMENT
BRAM 6 52 0 0 6.3 %
DSP 3 0 140 35 5.7/1.5 % The work presented in this paper was supported by the
National Science Centre project no. 2016/23/D/ST6/01389.
The authors would like to thank Mateusz Komorkiewicz for
well. help in designing the RAM controller.
D. Implementation Results R EFERENCES

The FPGA resource utilization of the proposed system is [1] C. Stauffer and W. Grimson, “Adaptive background mixture models for
real-time tracking,” Computer Vision and Pattern Recognition, 1999.
presented in Table II. The amount of resources (LUTs, flip- IEEE Computer Society Conference on. 2, 1999.
flops, block RAMs and DSP blocks) used by particular sub- [2] T. Bouwmans, F. El Baf, and B. Vachon, “Background Modeling using
modules and the entire system is shown. Its analysis indicates Mixture of Gaussians for Foreground Detection - A Survey,” Recent
Patents on Computer Science, vol. 1, no. 3, pp. 219–237, Nov. 2008.
that the video system (HDMI RX/TX) is quite complex. The [Online]. Available: https://hal.archives-ouvertes.fr/hal-00338206
RAM AXI controller uses a few BRAM modules for FIFOs. [3] T. Bouwmans, “Traditional and recent approaches in background
The GMM v1 uses more than 3 times resources as GMM v2. In modeling for foreground detection: An overview,” Computer Science
Review, vol. 11-12, pp. 31 – 66, 2014. [Online]. Available:
the last column (System), both variants are summarized – the http://www.sciencedirect.com/science/article/pii/S1574013714000033
% of available resources on the ZCU 102 board is presented. In [4] A. Elgammal, R. Duraiswami, D. Harwood, and L. Davis, “Background
both cases enough resources for implementing further modules and foreground modeling using nonparametric kernel density estimation
for visual surveillance,” Proceedings of the IEEE, vol. 90, no. 7, pp.
are available. 1151–1163, 2002.
The whole running system on the ZCU 102 uses more than [5] R. Wang, F. Bunyak, G. Seetharaman, and K. Palaniappan, “Static and
22 W. The Vivado xPower tool estimates about 4.7 W for moving object detection using flux tensor with split gaussian models,”
Computer Vision and Pattern Recognition Workshops (CVPRW), 2014
the device (values are almost the same for both variants). The IEEE Conference on, pp. 420–424, 2014.
estimated computing performance of the system equals 32.8 [6] O. Barnich and M. Van Droogenbroeck, “ViBe: A universal background
and 20.7 GOPS (giga-operations per second – arithmetical, subtraction algorithm for video sequences,” IEEE Transactions on
Image Processing, vol. 20, no. 6, pp. 1709–1724, June 2011. [Online].
logical, etc.) respectively. Therefore, 6.98 GOPS/W and 4.40 Available: http://www.telecom.ulg.ac.be/research/vibe
GOPS/W factors were obtained. [7] M. Hofmann, P. Tiefenbacher, and G. Rigoll, “Background segmentation
with feedback: The pixel-based adaptive segmenter,” in Computer Vision
VI. C ONCLUSION and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer
Society Conference on. IEEE, 2012, pp. 38–43.
In this paper a hardware implementation of the Gaussian [8] H. Yousif, Z. He, and R. Kays, “Object segmentation in the deep neural
network feature domain from highly cluttered natural scenes,” in 2017
Mixture Models background modelling algorithm with two IEEE International Conference on Image Processing (ICIP), Sept 2017,
different ways of organizing background model was presented. pp. 3095–3099.
According to the best authors knowledge, this is the first [9] M. Genovese and E. Napoli, “Fpga-based architecture for real time
segmentation and denoising of hd video,” Journal of Real-Time Image
reported FPGA implementation this algorithm, which is able Processing, vol. 8, no. 4, pp. 389–401, Dec 2013. [Online]. Available:
to process 4K video stream in real-time. It was verified https://doi.org/10.1007/s11554-011-0238-1
on the ZCU 102 board, equipped with a Zynq UltraScale+ [10] ——, “Asic and fpga implementation of the gaussian mixture model
algorithm for real-time segmentation of high definition video,” IEEE
MPSoC device. In addition the same algorithm was imple- Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22,
mented and evaluated on GPU using CUDA platform. The no. 3, pp. 537–547, March 2014.
FPGA implementation outperforms the GPU version in terms [11] V. Pham, P. Vo, V. T. Hung, and L. H. Bac, “Gpu implementation of
extended gaussian mixture model for background subtraction,” in 2010
of both performance and energy consumption. In case of IEEE RIVF International Conference on Computing Communication
GPU implementation there is no memory bandwidth limit, Technologies, Research, Innovation, and Vision for the Future (RIVF),
so better accuracy can be obtained. On the other hand, in Nov 2010, pp. 1–4.
[12] S. Popa, D. Crookes, and P. Miller, “Hardware acceleration of back-
compare to FPGA computing performance is pretty low, even ground modeling in the compressed domain,” IEEE Transactions on
mid/high-end GPUs are not capable to handle 4K video stream Information Forensics and Security, vol. 8, no. 10, pp. 1562–1574, Oct
processing in real-time. The designed module could be used 2013.
[13] C. Zhang, H. Tabkhi, and G. Schirner, “A gpu-based algorithm-specific
in a MPSoC based smart-camera for UHD video surveillance optimization for high-performance background subtraction,” in 2014
system. Comparing to the other hardware implementations,the 43rd International Conference on Parallel Processing, Sept 2014, pp.
authors achieved significantly better performance while main- 182–191.
[14] K. Piszczek, P. Janus, and T. Kryjak, “The use of hacp+sbt lossless
taining the accuracy of the algorithm. compression in optimizing memory bandwidth requirement for hardware
The proposed system could be improved in many ways. implementation of background modelling algorithms,” ARC 2018: Ap-
The achieved RAM throughput is only 57% of declared plied Reconfigurable Computing. Architectures, Tools, and Applications,
pp. 379–391, 2018.
maximum (theoretically) transfer speed. Further attempts to
improve transfer rate will be conducted. Moreover, the issue of
constrained memory bandwidth has been already discussed in
145
SIGNaL PROCESSING
SPa 2018
Interpolation-Based Gray-Level Co-Occurrence

Matrix Computation for Texture Directionality
Estimation
Marcin Kociolek Peter Bajcsy, Mary Brady and Antonio Cardone
Institute of Electronics National Institute of Standards and Technology
Lodz University of Technology Software and Systems Division
ul. Wolczanska 211/215, 90-924 Lodz, Poland 100 Bureau Drive, 20899 Gaithersburg, MD, USA
Email: marcin.kociolek@p.lodz.pl Emails: peter.bajcsy@nist.gov,
mary.brady@nist.gov, antonio.cardone@nist.gov
TABLE I
Abstract—A novel interpolation-based model for the com- A SUMMARY OF GLCM ANGLE AND OFFSET PARAMETERS ALLOWED IN
putation of the Gray Level Co-occurrence Matrix (GLCM) is VARIOUS SOFTWARE PACKAGES SUPPORTING GLCM COMPUTATIONS (LC
presented. The model enables GLCM computation for any real- - LATTICE CONSTRAINED ).
valued angles and offsets, as opposed to the traditional, lattice-
based model. A texture directionality estimation algorithm is Software Available angle Available offsets
defined using the GLCM-derived correlation feature. The robust-
QMaZda [5] 0, 45, 90, 135 1,2,3,4,5
ness of the algorithm with respect to image blur and additive
MATLAB [6] 0, 45, 90, 135 Any (LC)
Gaussian noise is evaluated. It is concluded that directionality ImageJ/FIJI plugin [7] 0, 45, 90, 135 Any (LC)
estimation is robust to image blur and low noise levels. For WIPP [8] Any (LC) Any (LC)
high noise levels, the mean error increases but remains bounded. Pythonxy/scikit-img [9] Any (LC) Any (LC)
The performance of the directionality estimation algorithm is
illustrated on fluorescence microscopy images of fibroblast cells.
The algorithm was implemented in C++ and the source code is
available in an openly accessible repository. cell clutter. In this context, the robustness of directionality
estimation with respect to noise and image blur needs to be
I. I NTRODUCTION assessed.
Gray Level Co-occurrence Matrix (GLCM) computations The aforementioned motivations and needs can be summa-
are frequently used to capture second-order statistics of image rized into two objectives. The first objective is to enable the
textures [1], [2], [3]. GLCMs are calculated over a selected computation of GLCM derived features over any real-valued
image region by counting the number of co-occurring intensity angles and offsets that are not constrained to an image lattice.
pairs. Locations of the co-occurring pixels under consideration The second objective is to assess the performance of GLCM
are defined by fixed angle and offset (distance) values. Given derived features on texture directionality estimation in terms
an image, the values of angle and offset are constrained by of robustness to blur and noise.
a lattice consisting of integer row and column locations of After a brief overview of related work in Section II the
image pixels. The lattice constraints introduce uncertainty in above objectives are addressed by defining an interpolation-
the GLCM computation, since only specific angle-offset pairs based GLCM computation that can operate on any real-valued
correspond to the image lattice. angle and offset pair (Section III-A), designing a method for
The gap between lattice-constrained and real-valued com- estimating texture directionality from GLCM derived features
putations of GLCM motivates our work. A summary of a few (Section III-B), and evaluating its robustness with respect to
software packages with GLCM computation is provided in image noise and blur on synthetic images (Section IV). The
Table I. They are constrained by the lattice and somehow method is compared with the Fiji implementation of a Fourier
limited to specific offset values and direction angles, according Transformation (FT)-based directionality estimation technique
to the seminal GLCM paper [4]. However, GLCM computa- [12] (Section IV-D), and it is demonstrated on fluorescence
tions should be applicable for any direction of interest, which microscopy images of fibroblast cells (Section IV-E). Section
motivates the need for any real-valued angle and offset. V discusses this work and suggests future directions.
From an application perspective, texture directionality es-
II. P REVIOUS WORK
timation techniques are extremely useful in cell biology. For
example, the assembly mechanisms of stress fibers in response Texture directionality estimation has been explored based
to mechanical characteristics of the extracellular matrix are on Radon [13], Moijete [14], and Fourier transforms [15],
related to cardiovascular diseases [10], [11]. However, mi- or on the auto-covariance function [16]. However, such ap-
croscopy images of cells are frequently being obfuscated by proaches require a mapping between the visual perception of
146
directionality and its characteristics in the transformed space. interpolation-based GLCM at i-th row and j-th column is
GLCM is more directly linked to human visual perception defined below for a given interpolation model M odel, real-
[17]. Past work on seismic image data [18] and on the valued angle α and offset d.
characterization of collagen fibers [19] shows that GLCM
C(i, n
j, d, α, M odel) =
derived homogeneity and energy features are maximal when
computed along the perceived texture directionality. This =# (k, l), (m = k + d cos(α),

property of GLCM derived features was also pointed out in n = l + d sin(α) ∈ (LY × LX )× (3)
[20]. However, GLCM computations are traditionally limited
×(RY × RX )I(k, l) = i &
to a small set of direction and offset pairs, since pair values o
defining the locations of the co-occurring pixels belong to the & Interpolated I(m, n, M odel)
lattice consisting of integer row and column locations of image where the symbol # refers to the number of elements in the
pixels. set and InterpolatedI(m, n, M odel) denotes the interpolated
III. M ETHODS intensity for the real-valued location (m, n) ∈ (RY × RX ).
The bilinear interpolation model is used in this work. The main
In this section, we define an interpolation-based GLCM
idea behind the interpolation-based GLCM computation is
computation. Then, a texture directionality estimation algo-
that, by definition, lattice constraints limit the direction/offset
rithm is designed using the interpolation-based GLCM.
pairs that must be considered. In the proposed approach one of
A. Interpolation-based GLCM Computation the two points needed for GLCM calculation is in the image
The mathematical notations introduced by Haralick[4] are lattice and the second one is interpolated (Fig. 1).
used here. Let us define an image I that is a mapping from a
lattice of pixels to integer values of gray level intensities. The
mapping assigns a set of quantized (or binned) gray levels
G = {0, 1, 2, , NG } to each lattice point defined by LY =
{0, 1, 2, , NY } × LX = {0, 1, 2, , NX }. Thus, an image I is
defined as a mapping I : LY × LX → G. According to [4], Ref. point
the lattice-based GLCM entry at i-th row and j-th column
C(i, j, d, α) is defined for angles α = 0◦ , 45◦ , 90◦ and 135◦
at a distance (i.e., offset) d as follows: d=3
 = 65°
C(i, n
j, d, α) =
Model
=# (k, l), (m, n) ∈ (LY × LX )×

(1)
×(LY × LX )I(k, l) = i, I(m, n) = j,
o
|k − m| = V 1(α), |l − n| = V 2(α) Fig. 1. Instance of interpolation-based GLCM computation. Locations in blue
represent a reference point on the lattice and an interpolated point located at
distance d = 3 and angle α = 65◦ . The intensity value at the latter point is
where the symbol # refers to the number of elements in the obtained by bilinear interpolation over the 4 closest neighbors (in red)
set, and the four angles α = 0◦ , 45◦ , 90◦ and 135◦ correspond
to the following four combinations of the V 1, V 2 values:
B. Texture directionality estimation algorithm based on
{V 1(0◦ ) = 0; V 2(0◦ ) = d} GLCM derived features
◦
{V 1(45 ) = d; V 2(45◦ ) = −d}
{V 1(90◦ ) = d; V 2(90◦ ) = 0} Four GLCM features, originally implemented in MatLab
{V 1(135◦ ) = −d; V 2(135◦ ) = −d} software [21], are considered in this paper: homogeneity,
contrast, correlation, and energy. Those features are modified
applied to the lattice location coordinates k, l, m and n. The below to take into account the interpolation-based GLCM, by
GLCM is of size NG × NG . The GLCMs in [4] are defined to simply substituting C(i, j, d, α) with C(i, j, d, α, model).
be symmetric, which implies C(i, j, d, α) = C(j, i, d, α). The  P
quantization (binning) into NG gray levels is carried out as in 
 homogeneity = i,j∈[1,G] C(i,j,d,α,model)
1+|i−j|

 P
Eq. (2). contrast = i,j∈[1,G] |i − j|2 C(i, j, d, α, model)
h i P (i−µi )(j−µj )C(i,j,d,α,model)
I(i,j)−min(k,l)∈I I(k,l)  correlation
 = i,j∈[1,G]
G I(i, j) = max(k,l)∈I I(k,l)−min(k,l)∈I I(k,l) ∈

 P σi σj
(2) energy = i,j∈[1,G] C(i, j, d, α, model)2
∈ {0, 1, . . . , NG } (4)
The operations min and max are performed over the entire The characteristics of the GLCM features in Eq. (4) with
image I, and G is the quantized gray level of an intensity respect to texture directionality were tested for a set of
I(i, j). fixed offset values on synthetic images consisting of bars of
Interpolation-based rather than lattice-based GLCM com- various thicknesses and orientations. A total of 1320 tests
putations are introduced here. The general element of the were performed. GLCM features were generally maximal or
147
minimal along the bar direction and in overall agreement TEXT DIR DETECT algorithm was implemented in C++
with the previous studies [18], [19] and [20]. Specifically, using the OpenCV image processing library, and the source
correlation was maximal in 99% of the cases, contrast was code is available in the GitHub repository [22].
minimal in 99%, homogeneity was maximal in 95%, and
IV. P ERFORMANCE TESTS AND EXAMPLES OF
energy was maximal in 68%. According to these results,
APPLICATIONS
correlation and contrast seem to be the most reliable features
for texture directionality estimation. Correlation is used for A. Definition of synthetic images and procedure for tests
the texture directionality estimation algorithm defined in this The robustness of our texture directionality estimation al-
paper. Fig. 2 shows an instance of the above-mentioned tests, gorithm with respect to noise and image blur was evaluated
where GLCM features are plotted for 4 offset values and over on synthetic texture images with depth of 16 bits per pixel
181 direction angles α ∈ {0◦ , 1◦ , . . . , 180◦ }. In this case, three (BPP), with dimension of 512 × 512 pixels and with known
GLCM features reach their maximum or minimum value in directionality, which were obtained as follows. Initially, evenly
correspondence of the actual texture directionality, which is spaced vertical bars were built, with thickness and spacing
α = 101◦ . Only GLCM energy does not follow this behavior. equal to 8, 12 and 16 pixels, with background intensity equal
to 16384 (1/4 of the full scale), and with foreground intensity
300
Contrast
0.4
Energy equal to 49151 (3/4 of the full scale). Next, the bars were ro-
200
0.3 tated of angles 0◦ , 1◦ , . . . , 90◦ ,counterclockwise with respect
100
0.2 to the vertical direction. Please observe that, in general, a given
0.1
0 0
rotation involves the interpolation of the foreground intensity
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 values, which leads to multiple grayscale values for the rotated
Angle [°] Angle [°]
image. This procedure yielded 273 images (91 directions by 3
Homogeneity Correlation
1
bar thicknesses). In order to evaluate the robustness of the tex-
1 ture directionality estimation algorithm to noise, the synthetic
0.5
0 images were perturbed using the Gaussian PDF noise with zero
0 -1 average and standard deviation σ = {2000, 4000, . . . , 20000}.
0 20 40 60 80 100 120 140 160 180
Angle [°]
0 20 40 60 80 100 120 140 160 180
Angle [°]
On the other hand, in order to evaluate the robustness of the
texture directionality estimation algorithm to blur, the synthetic
Offset 04 Offset 08 Offset 12 Offset 16 images were perturbed by filtering with square averaging
kernels of size {3 × 3, 5 × 5, . . . , 21 × 21} pixels. Instances of
the obtained synthetic images are shown in Fig. 3 in pseudo
Fig. 2. The four texture features derived from the interpolation-based GLCM
are plotted in function of the computation angle, for four offset values. The
colors, so that the intensity values can be seen more clearly.
GLCM is computed on a synthetic texture containing 8-pixel wide bars, 8- Based on the above, a total of 3003 synthetic images
pixel spaced, oriented at 101◦ with respect to the vertical axis (inset). was obtained after noise perturbation (3barthickness ×
91directions × 11noisestandarddeviationlevels), and the
The algorithm for texture directionality estimation is defined same number of synthetic images was obtained af-
as follows. Observe that there are no restrictions on the size ter blur perturbation (3barthickness × 91directions ×
and shape of the processed image region. Let A and D 11blurkernelssize). On each synthetic image 121 circular
be respectively the set of angles and offsets to use for the regions of interest (ROIs) with diameter of 61 pixels, regularly
interpolation-based GLCM computation. Let F (α, d, M odel) spaced and partially overlapping to fully cover each synthetic
be the GLCM correlation feature from Eq. (4), where α image, were used to perform the directionality estimation.
and d belong to sets A and D, and M odel represents the Circular ROIs were chosen because they are not expected to
interpolation model. Then, TEXT DIR DETECT algorithm bias in any way the directionality detection, unlike square or
consists of the steps below: rectangular ROIs. Ultimately, our tests yield a total of 11011
1) For each d ∈ D, compute GLCM based correlation directionality estimates (121ROIs × 91directions) for each
F (d = const, α) over every α ∈ A. bar thickness and perturbation level. Our texture directionality
2) Find the angle αmax (d) corresponding to the maximum, estimation algorithm was applied to the synthetic images
value of F (d, α) for a given offset d. described above. The directionality estimation was carried out
3) Find the highest occurrence of angle ᾱ among the using 1◦ sampling and the GLCM was computed using 16
αmax (d) for all d ∈ D. The angle ᾱ represents the intensity bins. Two metrics to evaluate the algorithm perfor-
detected texture directionality. mance were employed: the mean value and the maximum
The outputs of Step 2 and 3 of the above algorithm can be value of the directionality detection error (DDE). The DDE
formally defined as αmax (d) = arg max∀α∈A F (d, α) and is defined as the absolute value of the difference between the
ᾱ = M o{αmax (d)}d∈D where M o is the mode of the set estimated directionality ᾱ and the known directionality α of a
in parenthesis for every d ∈ D. Note that if no unique synthetic texture image, as formally shown in Equation (5).
mode is found then texture directionality is not reported
which is a valid outcome of the algorithm. The described DDE = |ᾱ − α| (5)
148
2.00 12.00
20° 40° 60° 1.80 bars thickness 8 bars thickness 8
65535 1.60 10.00
a) b) c) bars thickness 12
Average DDE [°]

1.40 bars thickness 12
8.00
Max DDE [°]

1.20 bars thickness 16 bars thickness 16
1.00 6.00
No distortion
0.80
0.60 4.00
0.40 2.00
0.20
0.00 0.00
Noise standard deviation Noise standard deviation
d) e) f)
Fig. 4. Average (left) and maximum (right) Directionality Detection Error
Gaussian noise
in function of the standard deviation of zero-mean Gaussian noise.
12 and 16 does not exceed 0.01◦ degrees, and the maximum

g) h) i) DDE does not exceed 3◦ . For bars of thickness 8, the average
DDE does not exceed 0.01◦ degrees with averaging kernel
Average blur
sizes up to 11×11 pixels. On the other hand, with kernel sizes

of 13 × 13 and up, the average DDE starts to grow without
exceeding 0.14◦ . The maximum DDE for bars of thickness 8
0 is limited to 4◦ with averaging kernel sizes up to 19 × 19, and
it jumps to 26◦ with averaging kernel size of 21 × 21 pixels
Fig. 3. Instances of synthetic images used for the evaluation of robustness (the largest tested averaging kernel size). The only significant
to blur and noise. On each image, the circular ROIs and the corresponding
detected directionalities are shown. The top row shows images consisting of
error increase was observed for bars of thickness 8, and it
bars with thickness of 8 pixels (a), 12 pixels (b), and 16 pixels (c). The middle is due to the fact that such bar thickness is smaller than the
row shows images consisting of 12-pixel thick bars after applying Gaussian size of averaging kernel. It should be noted that the observed
noise with standard deviation equal to 6000 (d), 12000 (e) and 1800 (f). The
bottom row shows consisting of 12-pixel thick bars after applying blur with
low sensitivity to image blur is a very important property of
average kernels of pixel size 7×7 (g), 13×13 (h) and 19×19 (i). the proposed texture directionality estimation algorithm, since
image blur is one of the commonly used methods for noise
reduction in images. In the future, the robustness to the other
B. Robustness of directionality estimation to Gaussian noise types of distortions as intensity nonuniformity [23] will be
The results of the tests of our texture directionality esti- tested.
mation algorithm with respect to its robustness to Gaussian 0.14 30
noise are presented here. In Fig. 4, the average and maximum 0.12 bars thickness 8
25
bars thickness 8
Average DDE [°]
0.10 bars thickness 12

DDE is plotted against the noise standard deviation. Gen-
bars thickness 12
Max DDE [°]
20
0.08 bars thickness 16 bars thickness 16
15
erally speaking, the directionality estimation algorithm seems 0.06
0.04
10
sensitive to noise. In fact, there is no error only for noise 0.02

5
0
standard deviation below 4000, which is below the lowest
0.00
binning level (please observe that, since 0 − 65535 is the Blur kernell size Blur kernell size
intensity value range, which is grouped into 16 bins, each bin

represents consecutive 4096 intensity values). Above 4000, the Fig. 5. Average (left) and maximum (right) Directionality Detection Error
estimation error seems to grow proportionally to noise for both in function of the averaging kernel size.
the average and maximum DDE plot. Larger errors for thicker
bars are due to a lower incidence of the edges in the ROI,
since they mainly carry the directionality information. Despite D. Comparison with Fourier Transformation-based direction-
the decrease of the directionality estimation accuracy with the ality estimation
increasing of Gaussian noise, it is important to observe that the The performance of our GLCM-based texture directionality
average DDE values remain limited to 1.23◦ for considered estimation algorithm was compared to the well-established and
range of noise standard deviations. This is an acceptable error widely used Fourier Transformation (FT)-based directionality
level, considering that 1-degree sampling was used for the estimation technique implemented in Fiji [7]. The comparison
directionality in the GLCM computation. was based on the robustness of the directionality estimation
with respect to image blur and Gaussian noise. The two
C. Robustness of directionality estimation to image blur algorithms were compared using the same synthetic images
The results of the tests of our texture directionality esti- described earlier. The only difference is that, since the FT-
mation algorithm with respect to its robustness to blur are based directionality estimation technique works only with
presented here. In Fig. 5, the average and maximum DDE is rectangular ROIs, squared ROIs with edge size of 61 pixels
plotted against the average filter kernel size. The directionality were used rather than circular ROIs. In order to obtain the
estimation algorithm does not seem to be significantly affected same total number of ROIs, the centers of the squared ROIs
by image blur. The average DDE for the bars of thickness used for testing the FT-based texture directionality estimation
149
algorithm were defined so that they correspond to the centers far more sensitive to blur.
of the circular ROIs used for testing the GLCM-based texture E. Example of applications
directionality estimation algorithm. In Fig. 6, the average
The texture directionality estimation algorithm was applied
2.0 100
to microscopy images of actin fibers in fibroblast cells. In
1.8
1.6
bars thickness 8 90
80
bars thickness 8
order to detect the locally varying directionality of texture in
Average DDE [°]
1.4 bars thickness 12 70 bars thickness 12
these images, each image has to be partitioned. The texture
Max DDE [°]

1.2 60 bars thickness 16
bars thickness 16
1.0 50
0.8
0.6
0.4
40
30
20
directionality images in Fig. 8 was estimated using circular
0.2
0.0
10
0
overlapping tiles (41 − pixel diameter). The local direction
was superimposed on images as black lines. The resulting
Noise standard deviation Noise standard deviation
direcionality histograms are shown in polar coordinates. In
Fig. 6. Average (left) and maximum (right) Directionality Detection Error

in function of the standard deviation of zero-mean Gaussian noise for the
Fourier based method. .
and maximum DDE are plotted against the noise standard

deviation for the FT-based directionality estimation algorithm.
Based on the plots, the FT-based directionality estimation 180 180 180
algorithm seems to be sensitive to Gaussian noise as well.

15 10 8
-150 150 -150 150 -150 150
8 6
10 6
-120 120 -120 120 -120 4 120
The average DDE is not equal to zero even in absence of 5 4
2 2
-90 0 90 -90 0 90 -90 0 90
noise, and it increases not linearly with the increasing of
-60 60 -60 60 -60 60
the noise standard deviation. The average DDE does not -30 30 -30 30 -30 30
exceed 0.5◦ for noise standard deviation values below 16000, 0 0 0
while for higher noise standard deviation values it grows

Fig. 8. Texture directionality estimation for three images of actin fibers
more rapidly. The maximum DDE behaves similarly, as it in fibroblast cells. The tiled images are shown in pseudo colors, each tile
does not exceed 3.6◦ for noise standard deviation values containing the estimated directionality line with length proportional to DDC
below 16000, and then it rapidly grows. In comparison to (top row); the corresponding polar histogram representing the directionality
distribution is also shown for each image (bottom row).
the GLCM-based directionality estimation algorithm, the FT-
based one is slightly less sensitive to Gaussian noise for noise the leftmost cell image from Fig. 8, two vertically oriented
standard deviation values below 16000. In Fig. 7, the average actin bundles are clearly noticeable. In fact, the largest bin of
the corresponding polar histogram is very close to the vertical
direction, and about 33% of the estimated texture directional-
30 90
bars thickness 8 80 bars thickness 8
25
70
ities are within 5◦ from the vertical direction. The center cell
Average DDE [°]
bars thickness 12 bars thickness 12

Max DDE [°]
20 60
bars thickness 16 50 bars thickness 16
15
10
40
30
image from Fig. 8 contains several actin bundles with different
5
20
10 directions. In fact, the corresponding polar histogram contains
0 0
three dominating direction groups and not only one as before.
Blurr kernell size Blurr kernell size The three direction groups are roughly centered at 45◦ , 90◦ and
135◦ , and about 50% of the estimated texture directionalities
Fig. 7. Average (left) and maximum (right) Directionality Detection Error are within 10◦ from these three directions. Finally, one main
in function of the averaging kernel size for the Fourier based method. . actin bundle is clearly noticeable in the rightmost cell image
from Fig. 8, along with several smaller actin fibers distributed
and maximum DDE are plotted against the averaging filter across the cell. The largest bin in the corresponding polar
kernel size (i.e., blur level) for the FT-based directionality histogram represents the estimated orientation of the main
estimation algorithm. The obtained data shows that the FT- actin bundle (≈ 140◦ ), whereas a few smaller bins also appear.
based directionality estimation algorithm is sensitive to blur The standard deviation of the estimated actin directionality
as well. The average DDE is not equal to zero even in is quite large in this case (≈ 40◦ ). In general, the standard
absence of blur, but this time its increasing with the size of deviations of directionality estimates can be related to the
the averaging kernel size is not fully monotonic. For averaging directional distribution of actin during various cell states and
kernel sizes below 13 × 13 pixels, the average DDE does not hence to a quantitative imaging method for cell biology.
exceed 3.8◦ , while for larger averaging kernel sizes it grows In conclusion, the above examples show that our texture
rapidly. A similar behavior was found for the maximum DDE, directionality estimation algorithm enables quantitative char-
which for averaging kernel sizes up to 7 × 7 pixels does not acterization of texture directionality. Quantitative directionality
exceed 10◦ . Even in this case, larger averaging kernel sizes estimation for cell images can find application in several
results in a more rapid growth of the maximum DDE, not biological studies, such as the ones focusing on the character-
necessarily monotonic. Clearly, in comparison to the GLCM- ization of the arrangement of the proteins in the cytoskeleton
based directionality estimation algorithm the FT-based one is of cells [10] and [11].
150
V. C ONCLUSIONS [5] P. M. Szczypiński, A. Klepaczko, and M. Kociołek, “Qmazda - software
tools for image analysis and pattern recognition,” in Signal Processing:
An interpolation-based GLCM computation technique was Algorithms, Architectures, Arrangements, and Applications (SPA), 2017.
introduced in this paper. The technique allows GLCM com- IEEE, 2017, pp. 217–221.
putation for virtually any real-valued angle-offset pair, as [6] I. The MathWorks, “Create gray-level co-occurrence matrix
from image - MATLAB graycomatrix.” [Online]. Available:
opposed to current lattice-constrained GLCM computations. https://www.mathworks.com/help/images/ref/graycomatrix.html
Based on the interpolation-based GLCM computation, a tex- [7] J. E. Cabrera, “Texture Analyzer, plugin for ImageJ/Fiji,” 2006.
ture directionality estimation algorithm was introduced by [Online]. Available: https://imagej.nih.gov/ij/plugins/texture.html
[8] P. Bajcsy, J. Chalfoun, and M. Simon, “Functionality of web image pro-
finding the maximum of GLCM-derived correlation values cessing pipeline,” in Web Microanalysis of Big Image Data. Springer,
over all angles and offsets. 2018, pp. 17–40.
The proposed GLCM-based directionality estimation algo- [9] The scikit-image development team, “Calculate the grey-
level co-occurrence matrix.” [Online]. Available: http://scikit-
rithm was tested on synthetic images with known texture image.org/docs/dev/api/skimage.feature.html#skimage.feature.greycomatrix
directionality, using progressive blur and additive Gaussian [10] P. Hotulainen and P. Lappalainen, “Stress fibers are generated by two
noise with increasing values of standard deviation. Results distinct actin assembly mechanisms in motile cells,” The Journal of Cell
Biology, vol. 173, no. 3, 2006.
show robustness to both image blur and Gaussian noise. Our [11] M. Théry, A. Pépin, E. Dressaire, Y. Chen, and M. Bornens,
algorithm was also compared with the well-established FT- “Cell distribution of stress fibres in response to the
based directionality estimation algorithm, showing comparable geometry of the adhesive environment,” Cell Motility and
the Cytoskeleton, vol. 63, no. 6, pp. 341–355, jun 2006.
performance with respect to noise, and much better perfor- [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/16550544
mance with respect to blur. It is important to observe that our http://doi.wiley.com/10.1002/cm.20126
GLCM-based algorithm can be used with arbitrarily shaped [12] J.-Y. Tinevez, “Directionality (Fiji),” 2010. [Online]. Available:
https://imagej.net/Directionality
regions of interest, unlike the FT-based one. [13] K. Jafari-Khouzani and H. Soltanian-Zadeh, “Radon transform
Our algorithm was also demonstrated on microscopy images orientation estimation for rotation invariant texture analysis.”
IEEE transactions on pattern analysis and machine intelligence,
of fibroblast cells. Results show a good match with the vol. 27, no. 6, pp. 1004–8, jun 2005. [Online]. Available:
perceived texture directionality, and a quantitative character- http://www.ncbi.nlm.nih.gov/pubmed/15945146
ization of texture directionality distribution with immediate [14] P. Peng Jia, J. Junyu Dong, L. Lin Qi, and F. Autrusseau, “Directionality
measurement and illumination estimation of 3D surface textures by
interpretation is provided using polar histograms. using mojette transform,” in 2008 19th International Conference on
Pattern Recognition. IEEE, dec 2008, pp. 1–4. [Online]. Available:
VI. D ISCLAIMER http://ieeexplore.ieee.org/document/4761389/
Commercial products are identified in this document in [15] D. Feng, L. Chunlin, X. Cheng, and S. Wei, “Research of spectrum
measurement of texture image,” in World Automation Congress 2012.,
order to specify the experimental procedure adequately. Such 2012, pp. 163–165.
identification is not intended to imply recommendation or [16] R. Mester, “Orientation estimation: Conventional techniques
endorsement by the National Institute of Standards and Tech- and a new non-differential approach - IEEE Xplore
Document,” in Signal Processing Conference, 2000
nology, nor is it intended to imply that the products identified 10th European, 2000, pp. 3–6. [Online]. Available:
are necessarily the best available for the purpose. http://ieeexplore.ieee.org.00000azm0aaa.han.p.lodz.pl/document/7075718/
[17] B. Julesz, “Experiments in the visual perception of texture.” Scientific
ACKNOWLEDGMENT American, vol. 232, no. 4, pp. 34–43, apr 1975. [Online]. Available:
http://www.ncbi.nlm.nih.gov/pubmed/1114309
We would like to thank Dr. Kiran Bhadriraju from the [18] W. Lu, “Adaptive noise attenuation of seismic image using
Nanoscale Metrology group in the Physical Measurements singular value decomposition and texture direction detection,”
in Proceedings. International Conference on Image Processing,
Laboratory at NIST and his colleagues from Material Mea- vol. 2. IEEE, 2002, pp. 465–468. [Online]. Available:
surements Laboratory at NIST for sharing the fibroblast cell http://ieeexplore.ieee.org/document/1039988/
images that were used in this paper. [19] W. Hu, H. Li, C. Wang, S. Gou, and L. Fu, “Characterization
of collagen fibers by means of texture analysis of second
harmonic generation images using orientation-dependent gray level
R EFERENCES co-occurrence matrix method,” Journal of Biomedical Optics,
[1] E. O. Olaniyi, A. A. Adekunle, T. Odekuoye, and A. Khashman, “Au- vol. 17, no. 2, p. 026007, feb 2012. [Online]. Available:
tomatic system for grading banana using glcm texture feature extraction http://www.ncbi.nlm.nih.gov/pubmed/22463039
and neural network arbitrations,” Journal of Food Process Engineering, [20] K. R. Castleman, Digital image processing. New Jercey: Prentice Hall,
vol. 40, no. 6, 2017. 1996.
[2] J. Neumann, U. Heilmeier, G. Joseph, F. Hofmann, W. Ashmeik, [21] The MathWorks Inc., “Properties of gray-level co-
A. Gersing, N. Chanchek, B. Schwaiger, M. Nevitt, C. McCulloch et al., occurrence matrix - MATLAB graycoprops.” [Online]. Available:
“Texture analysis of t2 maps of the cartilage indicates differences in http://www.mathworks.com/help/images/ref/graycoprops.html
knee cartilage matrix in subjects with type 2 diabetes: data from the [22] M. Kociolek and A. Cardone, “Haralick Based Directionality
osteoarthritis initiative,” Osteoarthritis and Cartilage, vol. 25, pp. S73– Map - source code repository,” 2016. [Online]. Available:
S74, 2017. https://github.com/marcinkociolek/HaralickBasedDirectionalityMap
[3] K. Buch, B. Li, M. Qureshi, H. Kuno, S. Anderson, and O. Sakai, [23] A. Materka and M. Strzelecki, “On the importance of MRI nonuni-
“Quantitative assessment of variation in ct parameters on texture fea- formity correction for texture analysis,” in 2013 Signal Processing:
tures: pilot study using a nonanatomic phantom,” American Journal of Algorithms, Architectures, Arrangements, and Applications (SPA), Sept
Neuroradiology, vol. 38, no. 5, pp. 981–985, 2017. 2013, pp. 118–123.
[4] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural Features
for Image Classification,” IEEE Transactions on Systems, Man, and
Cybernetics, vol. 3, no. 6, pp. 610–621, nov 1973. [Online]. Available:
http://ieeexplore.ieee.org/document/4309314/
151
SIGNaL PROCESSING
SPa 2018
On the influence of the image normalization scheme

on texture classification accuracy
Marcin Kociolek, Michal Strzelecki. Szymon Szymajda
Lodz University of Technology
ul. Wolczanska 211/215, 90-924 Lodz, Poland
Email: marcin.kociolek@p.lodz.pl
Abstract—Texture can be a very rich source of information [6] and intensity artifacts [7]. Certainly, normalization will
about the image. Texture analysis finds applications, among not eliminate these two above mentioned effects, however it
other things, in biomedical imaging. One of the widely used was demonstrated in chapter three of [8] , that its application
methods of texture analysis is the Gray Level Co-occurrence
Matrix (GLCM). Texture analysis using the GLCM method is improves texture analysis accuracy. In [9] was shown that,
most often carried out in several stages: determination of areas without normalization, for additive noise the classification ac-
of interest, normalization, calculation of the GLCM, extraction curacy increases if the number of quantization level decreases.
of features, and finally, the classification. Values of the GLCM The aim of the work is to analyze the influence of different
based features depend on the choice of the normalization method, normalization methods along with the number of brightness
which was examined in this work. The normalization is necessary,
since acquired images often suffer from noise and intensity levels on the discriminating ability of features estimated from
artifacts. Certainly, the normalization will not eliminate these the GLC matrix. The investigated normalization methods are
two effects, however it was demonstrated, that its application those implemented in the MaZda [10] software, a very useful
improves texture analysis accuracy. The aim of the work was tool for texture analysis and its follower the QMazda [11]
to analyze the influence of different normalization methods on multiplatform software. The analysis was performed both for
the discriminating ability of features estimated from the GLCM.
The analysis was performed both for Brodatz textures and real Brodatz textures and real magnetic resonance data.
magnetic resonance data. Brodatz textures were corrupted by
three types of distortion: intensity nonuniformity, Gaussian noise II. N ORMALIZATION SCHEMES
and Rician Noise. Three types of normalizations were tested: Normalization in image processing consists of extending
min − max, 1 − 99% and +/ − 3σ.
the image histogram to the entire available brightness range.
Usually this is the first preprocessing step performed during
I. I NTRODUCTION
analysis of digital images. In the MaZda program, the nor-
Texture analysis is a very rich source of information about malization is performed simultaneously with the reduction of
images, especially biomedical ones where texture describes the number of brightness values in the image. This is equiv-
properties of visualized tissues and organs. Texture analysis alent to reducing the number of bits coding the brightness.
is already a well developed and reliable technique used for Normalizations are described by (1).
quantitative analysis of a wide range of biomedical images  k
acquired by various modalities, like CT [1], MRI [2], ultra-  2 −1 for N (x, y) > 2k − 1
sound [3], and optical [4]. It is no wonder that this is a topic INORM (x, y) = N (x, y) for 0 ≤ N (x, y) ≤ 2k − 1

widely considered in scientific papers and publications. The 0 for N (x, y) < 0
(1)
Google Scholar search engine returns approximately 2,200,000
where:
search results related to this subject, of which about 17,000 I(x,y)−minNorm
publications are dated from 2017. It shows that this subject is N (x, y) = round to int max Norm −minNorm
(2k − 1) ,
still often addressed by researchers. One of the most widely minNorm - minimum normalized value,
used methods of texture analysis, despite of its long history, is maxNorm - maximum normalized value,
the Gray Level Co-occurrence Matrix (GLCM) which enables k - no. of image bits per pixel after normalization,
extraction of many features [5]. Texture analysis using the The following image normalization techniques were inves-
GLCM method is most often carried out in several stages: tigated in this study.
determination of areas of interest, choice of normalization,
determination of the GLCM, extraction of features, and finally, A. min-max normalization
the classification. The obtained values depend on the choice If the range of values in the image histogram is focused in a
of the normalization method and the number of bits coding certain range of values, min−max type normalization can be
the brightness levels of pixels, which will be examined in used. The minNorm and the maxNorm from the equation (1)
this work. The image preprocessing step (normalization) is are take directly as the minimum and maximum brightness
necessary, since the acquired images often suffer from noise levels appearing in the image histogram. An example of a
152
histogram for which the min − max normalization can be Range uniform with out layers
used is shown in Fig.1 1500
1000
Range uniform
500
1500
0
1000
0 3 6 9 121518212427303336394245485154576063
500
Fig. 3. Image histogram with most values concentrated in a certain range,
0
but with some outliers
0 3 6 9 121518212427303336394245485154576063
Fig. 1. Image histogram limited to a certain gray level range
B. ±3σ normalization
Another popular method of normalization is to bring the
brightness of the image to within ±3σ of the average image
intensity. In cases where the image histogram takes the form
resembling a Gaussian distribution, the range of intensity
values taken for normalization is minNorm = µ − 3σ to
maxNorm = µ + 3σ, where: µ - calculated average pixel
value in the image, and σ - standard deviation of brightness
levels. An example of an image histogram for which ±3σ
normalization can be performed is presented in Fig. 2.
Gaussian
Min - max
1500
1000
500
0
0 3 6 9 121518212427303336394245485154576063
Fig. 2. Gaussian-shaped image histogram
C. 1% − 99% normalization
In case when the majority of elements of the histogram Fig. 4. Test Images from Paul Brodatz album[12] [13]. From top left: grass,
are focused in a certain range of values, but there are also bark, straw, herringbone weave, woolen cloth, pressed calf leather, beach sand,
water, wood grain, raffia, brick wall, plastic bubbles.
elements outside this range, normalization type 1% − 99% can
be applied. The range of intensities for normalization can be
defined as minNorm = arg(cumulative histogram = 1%) – Rician noise [14] for two noise levels, s = 0.1 and s =
to maxNorm = arg(cumulative histogram = 99%) Fig. 3 0.2 respectively.
shows an example image histogram suitable for such normal-
As an image nonuniformity artifact, variable contrast was
ization.
used. The variable contrast was defined basing on a linear
III. M ATERIALS AND METHODS function:
g(x, y) = c1 x + c2 y + c3 (2)
For analysis, 12 images representing Brodatz textures [12]
were considered (Fig. 4). These are 8-bit encoded (512 × 512) where: c1 , c2 , c3 are respective model parameters. Resulting
images downloaded from [13]. To make the analysis results images were obtained as the result of multiplication of the
more realistic, the images were corrupted by means of fol- original ones with the function (2). Sample degraded texture
lowing distortions: images are shown in Fig. 5.
– background brightness nonuniformity, The Brodatz textures were sampled by means of 64 non
– additive Gaussian noise with mean = 0 and two standard overlapped square regions of interest (ROI) with sizes 59 × 59
deviation levels σ = 25.6 and σ = 51.2, evenly distributed over the images.
153
a) b)
Fig. 6. Sample MR image with 8 textures classes marked. Original image

(a) and image with superimposed ROIs (b).
GLCMs build for 4 directions (0◦ , 45◦ , 90◦ and 135◦ ) and 5
distances/offsets (1, 2, .., 5). For each GLCM 11 features were
extracted:
– angular second moment (energy),
– contrast,
– correlation,
– sum of squares,
– inverse difference moment,
– sum average,
– sum variance,
– sum entropy,
– entropy,
– differential variance,
– differential entropy.
This gives a total of 220 textural features (4 directions ·
Fig. 5. Bark texture and its degraded versions: original (a), nonuniformity 5 offsets · 11 features).
artefact added (b), Gaussian noise σ = 25.6 (c), Gaussian noise σ = 51.2 All such features were calculated without normalizing the
(d), Rician noise with s = 0.1 (e), Rician noise with s = 0.2 (f).
image and for 3 normalization methods:
– min − max,
Six physical phantoms were manufactured in the Medical – 1% − 99%,
Physics Department, University of Dundee, Scotland. They – ±3σ.
were in the form of glass tubes filled with reticulated foam and 6 intensity quantization levels (8, 16, 32, 64, 128, 256)
with different porosities and glass bubbles with different which correspond with following numbers of bits coding the
diameters. The foam and bubbles were stuffed with agarose gel intensity 3, 4, 5, 6, 7, 8 respectively.
that possesses a relatively long value of magnetic resonance For each combination of normalization and quantization
T2 response [15]. The tubes were sealed properly to prevent level a classification test was performed. For this test all
the water included in the gel from evaporation. The tubes were feature values were normalized according to their mean and
fixed in the MR scanner close to the patients head. A series standard deviation. Then for each pair of considered classes
of magnetic resonance images of the phantoms were recorded a binary linear classifier, utilizing a subset of three fea-
using a Siemens Magnetom 1.5-Tesla scanner at the German tures, was trained. Such training was performed for multiple
Research Cancer Center, Heidelberg, Germany. The images combinations of three features and the classifier with the
represent cross-sections of the tubes, along with the patients minimal error was chosen. For the classification of K classes
brain cross-section taken at a constant number of image pixels (K 2 −K)/2 classifiers were trained. Finally, classification test
(512 × 512), and slice thickness equal to 2 mm. As a result, for all classes was performed resulting in the confusion matrix
6 different texture classes representing phantoms and two (Fig 7). A sample under consideration was assigned to given
additional classes representing brain tissue were obtained with class if all (K − 1) classifiers for this class agreed. Otherwise
7 samples in each class. Example of the analyzed MR textured no decision was reported. Such situation is also treated as a
images is presented in Fig. 6. classification error (”no decision” column in Fig. 7). All steps
Gray Level Co-occurrence matrix-based features were ex- of Linear discriminant analysis were performed by means of
tracted by means of QMaZda [11] software. We considered the QMaZda [11] software package.
154
Classification result
a) Original Brodatz textures
pressed calf leather

herringbone weave
0.5
plastic bubbles
Classification Error [%]

woolen cloth
no decision
beach sand
Class
wood grain
0.4
brick wall
water
straw
raffia
grass
0.3 no norm
bark
grass 63 1 0.2 min-max
bark 61 1 2 1 - 99 %
straw 63 1 0.1
herringbone weave 64 +/- 3 std
woolen cloth 64 0.0
pressed calf leather 64 3 4 5 6 7 8
beach sand 1 63 bit/pixel
water 64
wood grain 64
b)
raffia 59 5 Backgroung brightness nonuniformity
brick wall 64
plastic bubbles 1 63 6.0

5.0
Fig. 7. Example confusion matrix for classification test performed for images
4.0
distorted by means of non uniformity (no normalization, 3 bits per pixel). For no norm
ease of read zero values was removed from the table. The last column shows 3.0
number of samples for which classifiers did not agreed so no decision was min-max
2.0
taken. 1 - 99 %
1.0
+/- 3 std
0.0
IV. R ESULTS 3 4 5 6 7 8
bit/pixel
The linear discrimination procedure was repeated for each
distorted image collection for the above discussed normaliza- Fig. 8. The classification error for original (non-distorted) Brodatz textures
tion schemes and six quantization levels. Fig. 8 shows the (a) and textures distorted by nonuniformity (b) plotted against number of bits
classification error for original Brodatz textures and textures coding intensity levels after normalization and binning
distorted by the nonuniformity artifact plotted against the
number of bits coding the intensity levels after normalization a) Gaussian noise std = 25.6
and binning. 0.5
As it can be easily seen GLCM based features constitute an 0.4

appropriate tool for classification of selected Brodatrz textures.
The error is minimal and the best choice for normalization 0.3 no norm
is ±3σ method which gives an error equal to 0 for any 0.2 min-max
tested quantization level. Adding the nonuniformity distortion 1 - 99 %
0.1
to the original images shows importance of the normalization. +/- 3 std
Minimal classification error for this case can be observed for 0.0
3 4 5 6 7 8
±3σ and min − max normalization.
bit/pixel
Fig. 9 shows the classification error for Brodatz textures
distorted by means of Gaussian noise: σ = 25.5 and σ = 51.2 b)
Low level Gaussian noise gives relatively low classification Gaussian noise std = 51.2
error. The best results can be obtained by means of ±3σ 3.0
normalization which gives zero error for any tested number of 2.5
quantization levels. Increasing of the noise standard deviation 2.0
(σ = 51.2) causes that the classification error raises. For such 1.5
no norm
type of distortion, the best normalization scheme was ±3σ For min-max
1.0
Gaussian Noise decreasing the number of quantization levels 1 - 99 %
0.5
caused a decrease in the classification error. Fig. 10 shows the +/- 3 std
classification error for Brodatz textures distorted by means of 0.0
3 4 5 6 7 8
two levels of Rician noise: s = 0.1 and s = 0.2. For the Rician
bit/pixel
noise the best normalization method is ±3σ. For such type of
distortion the classification error is practically insensitive to
Fig. 9. The classification error for Brodatz textures distorted by means of
the time number of quantization levels. After increasing of Gaussian noise: σ = 25.5(a) and σ = 51.2(b)
the noise parameter (s = 0.2) the classification error raises.
Fig. 11 shows classification results for phantoms and two
tissues from MRI images The normalization is essential for
155
a) Rician noise s=0.1 The same conclusion is valid for the nonuniform background
1.0 that does not influence classification results. When noise
variance is increasing, the application of standardization be-
0.8
comes important. It can be observed in Fig. 8, where the
0.6 no norm classification error decreases when normalization is applied
0.4 min-max (except the min − max one), obtaining minimal values for
1 - 99 %
±3σ scheme. This is due the fact that this scheme is prepared
0.2 for images with a Gaussian like histogram distribution thus it
+/- 3 std
0.0 works well for analyzed distorted images helping to eliminate
3 4 5 6 7 8 influence of the background nonuniformity. The min − max
bit/pixel normalization does not significantly reduce the classification
error since it provides a linear histogram stretching only thus
b) in the presence of noise this is not sufficient to eliminate its
Rician noise s=0.2
5.0 influence. Additionaly reduction of quantization levels reduces
the classification error since it acts similarly to averaging filter
4.0
thus partly removes the additive gaussian noise.
3.0 no norm
Rician noise is not an additive distortion. This noise de-
2.0 min-max
pends on the data itself and it modifies the data in a way
1 - 99 %
1.0 that the histogram of the modified image becomes a Rician
+/- 3 std distribution. Such distribution depends strongly on the pixel
0.0
3 4 5 6 7 8 gray level affected by this noise. For higher values of the s
bit/pixel parameter Rician noise degrades the texture in the way that
classification error increases. The ±3σ normalization improves
Fig. 10. The classification error for Brodatz textures distorted by means of the classification over other methods. This is caused by the fact
two levels of Rician noise: s = 0.1 (a) and s = 0.2 (b) that Rician distribution reassembles somewhat the Gaussian
distribution for higher gray level values that dominate in the
MRI Phantoms + Brain Tissues analyzed textures. This time the number of quantization levels
has limited influence on the classification results.
100.0
80.0 The same effect of various normalization schemes is ob-

served when experiments are performed on noiseless textures
60.0 no norm
with a brightness artifact. In this case even min − max
40.0 min-max performs well since histogram stretching is sufficient to reduce
20.0
1 - 99 % this artifact.
+/- 3 std
0.0 For MRI phantoms, the application of normalization is
3 4 5 6 7 8
crucial. Such images are originally 12 bit coded but GLCM
bit/pixel
is estimated for 8 bits (or less), since such analysis scheme
is implemented in MaZda. Thus without normalization, image
Fig. 11. FThe classification error for phantoms and two tissues on MRI gray levels are quantized to the 8 bit range. Since the gray
images. level variability of phantoms ROIs (shown in Fig. 6b) is
low, such quantization will lead to further reduction of gray
level ranges. Thus, texture that represents such phantoms is
classification tasks of phantom and tissue classes on our becoming undistinguishable. Any normalization scheme will
test MRI images. Without the normalization classification stretch the gray level distribution of analyzed phantom ROIs,
error reach over 80%. While ±3σ and 1 − 99% allows for preserving in consequence texture properties. This explains the
errorless classification for any number of quantization levels. significant reduction of the classification error obtained for no
minmax normalization requires at least 32 intensity bins for normalization and normalization schemes.
classification with zero error.
Summarizing, from the conducted tests, normalization im-
V. D ISCUSSION AND CONCLUSIONS proves the classification utilizing GLCM based features. ±3σ
It can be observed from the obtained results that for original normalization appears to good choice to mitigate different
images and lower Gaussian or Rician noise, classification types of image distortion, assuming that Gaussian-like image
of Brodatz textures is almost perfect with or without any gray level distribution is preserved. Further tests with com-
type of normalization. This is explained by the good image binations of various distortion techniques along with medical
quality and rather quite distinguishable textures they represent. data acquired by different imaging modalities are needed.
156
R EFERENCES
[1] J. Thevenot, J. Hirvasniemi, M. Finnilä, P. Pulkkinen, V. Kuhn, T. Link,
F. Eckstein, T. Jämsä, and S. Saarakkala, “Trabecular homogeneity index
derived from plain radiograph to evaluate bone quality,” Journal of bone
and mineral research, vol. 28, no. 12, pp. 2584–2591, 2013.
[2] A. Larroza, V. Bodı́, and D. Moratal, “Texture analysis in magnetic
resonance imaging: Review and considerations for future applications,”
in Assessment of Cellular and Organ Function and Dysfunction using
Direct and Derived MRI Methodologies. InTech, 2016.
[3] L. Chrzanowski, J. Drozdz, M. Strzelecki, M. Krzeminska-Pakula,
K. Jedrzejewski, and J. Kasprzak, “Application of neural networks
for the analysis of histological and ultrasonic aortic wall appearance-
an invitro tissue characterization study,” Ultrasound in Medicine and
Biology, vol. 34, no. 2008, pp. 103–113, 2008.
[4] K. Kropidłowski, M. Kociołek, M. Strzelecki, and D. Czubiński, “Model
based approach for melanoma segmentation,” in International Confer-
ence on Computer Vision and Graphics. Springer, 2014, pp. 347–355.
[5] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural Features
for Image Classification,” IEEE Transactions on Systems, Man, and
Cybernetics, vol. 3, no. 6, pp. 610–621, nov 1973. [Online]. Available:
http://ieeexplore.ieee.org/document/4309314/
[6] S. Aja-Fernández, A. Tristán-Vega, and C. Alberola-López, “Noise
estimation in single-and multiple-coil magnetic resonance data based
on statistical models,” Magnetic resonance imaging, vol. 27, no. 10, pp.
1397–1409, 2009.
[7] A. Materka and M. Strzelecki, “On the importance of mri nonuniformity
correction for texture analysis,” in 2013 Signal Processing: Algorithms,
Architectures, Arrangements, and Applications (SPA), Sept 2013, pp.
118–123.
[8] M. Strzelecki and A. Materka, Tekstura obrazw biomedycznych.
Warszawa: Wydawnictwo Naukowe PWN.
[9] M. Strzelecki, M. Kociołek, and A. Materka, “On the influence of image
features wordlength reduction on texture classification,” in International
Conference on Information Technologies in Biomedicine. Springer,
2018, pp. 15–26.
[10] P. M. Szczypinski, M. Strzelecki, and A. Materka, “Mazda - a software
for texture analysis,” in 2007 International Symposium on Information
Technology Convergence (ISITC 2007), Nov 2007, pp. 245–249.
[11] P. M. Szczypiński, A. Klepaczko, and M. Kociołek, “Qmazda - software
tools for image analysis and pattern recognition,” in Signal Processing:
Algorithms, Architectures, Arrangements, and Applications (SPA), 2017.
IEEE, 2017, pp. 217–221.
[12] P. Brodatz, Textures: A Photographic Album for Artists and Designers.
New York, NY: Dover Publications, 1966.
[13] University of Southern California Signal and Image Processing Institute,
“The usc-sipi image database, volume 1: Textures.” [Online]. Available:
http://sipi.usc.edu/database/database.php?volume=textures
[14] H. Gudbjartsson and S. Patz, “The rician distribution of noisy mri data,”
Magnetic resonance in medicine, vol. 34, no. 6, pp. 910–914, 1995.
[15] R. Lerski, J. De Wilde, D. Boyce, and J. Ridgway, Quality control in
magnetic resonance imaging. Institute of Physics and Engineering in
Medicine, 1998.
157
SIGNaL PROCESSING
SPa 2018
Spatial Transformations in Deep Neural Networks

Michał Bednarek, Krzysztof Walas
Institute of Control, Robotics and Information Engineering, Poznan, Poland
Abstract—Convolutional Neural Networks (CNNs) have about datasets used in our work. Next, we document our
brought us the exceptionally significant improvement in the experiments together with results. Finally, we provide the
performance of the variety of visual tasks, such as object clas- reader with the conclusions and future work plans.
sification, semantic segmentation or linear regression. However,
these powerful neural models suffer from the lack of spatial II. R ELATED W ORK
invariance. In this paper, we introduce the end-to-end system
that is able to learn such invariance including in-plane and In this section, we discuss algorithms for transformation
out-of-plane rotations. We performed extensive experiments on capturing used among the Computer Vision Society and the
variations of widely known MNIST dataset, which consist of influence of CNNs on modelling transformation-invariant rep-
images subjected to deformations. Our comparative results show
that we can successfully improve the classification score by resentations. Finally, we provide the overview of end-to-end
implementing so-called Spatial Transformer module. systems based on deep neural networks that are designed to
manage data subjected to deformations.
I. I NTRODUCTION Handcrafted methods One of the most popular algorithms
Human perception system has the baked-in ability to be for feature extraction is a floating-point descriptor called
robust to a variety of deformations, e.g. perspective changes, SIFT [4]. Its features were designed to represent image inde-
rotations, scaling or non-rigid deformations. It means that we pendently of scale changes and in-plane rotations. It was the
as humans can successfully recognise and visualize objects baseline for other algorithms like SURF [5] or KAZE [6]. An-
subjected to such effects, even if they are seen in their highly other approach was presented by the authors of Dali [7] which
deformed form (e.g. folded T-shirt is obviously still classified provided the additional invariance to lumination changes.
as a T-shirt). However, many computer vision systems suffer However, the robustness against out-of-plane rotations was
from the lack of such invariance and this problem has become still the issue. Authors of ASIFT [8] successfully tackled
a field of an extensive research. this problem and proposed global descriptor invariant to such
In recent years, the most popular approach to solve many transformations. All of the mentioned descriptors are operating
computer vision tasks is to use Convolutional Neural Net- on floating numbers, what causes the increase in computational
works, which appear to provide excellent results in many expense. The common approach to decrease it is to design
visual applications. Nevertheless, the design of CNNs do not feature extractor operation in the space of binary numbers.
provide a proper deformation handling. A common approach There were developed plenty of binary descriptors that operate
to overcome this issue was to perform data augmentation [1] on Hamming distance. In this group, we have ORB [9],
and simply increase the volume of training data and to train BRISK [10] or – binary version of KAZE – AKAZE [11].
our neural network to give the same results for many versions Machine Learning Descriptors based on artificial neural
of deformed input. In our work, we are proposing a solution networks can outperform traditional and handcrafted methods
which provides invariance to such transformations without this in terms of accuracy and deformations robustness. A common
technique. approach is to use the Siamese Networks [12] – a technique
In this paper we propose to use in our system the Spatial based on a comparison of responses of two identical neural
Transformer Network [2] that operates on so-called Thin- branches processing input images. Based on this model authors
Plate Splines [3] interpolation method used to disentangling of [13] proposed a method that performs feature extraction.
non-rigid deformations. We perform learning process in a Another approach is to perform end-to-end computations in
supervised manner. Our main contribution is to show that such neural network manner, consisting of the entire pipeline:
an approach allows us to obtain an increase of classification feature detection, extraction and matching. On top of that
accuracy by 48%, when compared to the system without assumption, the [14] descriptor was proposed. Even though
a rectification module and tested on the same dataset. The learned descriptors can outperform state-of-the-art descriptors
second contribution is the dataset itself prepared by us that in the quality of description, they are not entirely free from
was used in the final stage of our experiments – RD-MNIST significant drawbacks. One of them is the high computational
(Rotated & Deformed MNIST) was further described in details cost, regarding the fact that they need GPU units and extensive
in the Section IV-A. datasets for training. Also, neural network algorithms are
In our paper, we first review other papers related to ours. significantly slower due to this fact.
Then we describe our system with Spatial Transformer Net- Systems As it was already mentioned, a standard technique
work and Thin-Plate Splines as well as we provide information to manage input data distortions is to increase a system robust-
158
ness by data augmentation [1]. However, it requires to enlarge spatial transformations on input data or feature maps inside
the size of the model, what may be not desirable because of the neural network. Geometric transformations are made using
limited hardware resources and time of computation. Jaderberg TPS smooth interpolation that let us transform one grid points
et al. [2] proposed a Spatial Transformer Network (STN) that into another. First of all, the fixed number of source points –
can perform transformation of input data such as scaling, called landmarks is specified by the submodule of STN called
cropping, rotation or even non-rigid deformation. Non-rigid localisation network (its structure is shown in Figure 5). Target
deformation interpolation can be performed inside this module points grid is an evenly distributed pattern of points, thus their
by so-called Thin-Plate Splines (TPS) [3] – extensively used position is constant. The grid generator takes both grids and
in our work too. On top of the STN, authors of [15] created produces the transformation – sampling grid. Both sampling
a system that recognises irregular text. Other approach was grid and the input image are directed towards bilinear sampler
proposed by [16]. Their model is able to perform custom that creates a rectified image.
rotations on input images or 3D objects. Improved pooling
mechanism, called TI-Pooling was presented in [17] and let B. Thin-Plate Splines
the researchers to achieve invariance under deformations. Our grids are just discrete points in the image space that
are used to compute the transformation between two of them.
III. M ODEL However, they are relatively sparse in comparison with a full
In this section, we introduce our model that consists of two image. To enforce smoothness of resulting images, we have
main modules. First is the Spatial Transformer Network [2] used thin-plate splines interpolation method. Behind the name,
required for performing the thin-plate splines interpolation of there is a metallurgical inspiration of bending a thin metal
deformed images. The second module is a hand-written digits plate that is in turn constrained not to move in specified
classifier trained on MNIST dataset. Overview of our end-to- points of the grid. A few reasons motivate the choice of
end system is presented in the Figure 1. splines to the task of image rectification. One of them is that
these polynomial functions are known to have very simple
structures locally, but globally are flexible and smooth, what
is desirable in terms of deformation tracking. Also, the result
of TPS is a smooth surface that is differentiable. This feature
is very important in deep learning due to the back-propagation
training algorithm [18]. From the mathematical point of view,
TPS interpolation is the two-dimensional analogue to the cubic
spline in one dimension.
The objective of this method is to find the unknown function
f (x) : <2 → <2 , which is known only at distinct locations
using the TPS interpolant w(x) that describes the smooth
transformation between two grids of points. Interpolant is
represented by a function that satisfies the condition 1.
w(xi ) = f (xi ) f or i = 0, 1, ..., N (1)

A general concept of the image rectification using two grids
of points was depicted in the Figure 2. Our system has to find
such transformation between two grids of points to ensure
the best rectification for a further specified task. Classical
approach is to express the interpolant as one of Radial Basis
Functions φ(x) (RBF) such that φ : [0, ∞) → <2 . This is a
real-valued function dependent only on the distance between
points lying in two grids. In our article, the RBF is expressed
as in the equation 2.
Fig. 1. Proposed architecture. STN can learn transformation of the images
to support classification. We split our experiments into two paths. In our
experiments, we verified that STN could equip us with the invariance with φ(r) = r2 ln r where ri = |xi − yi |i=1,2,...,N (2)
respect to in-plane rotations and non-rigid deformations using rotated and
deformed versions of MNIST dataset.
C. MNIST Classifier
To verify that STN is able to successfully manage in-
A. Spatial Transformer plane and local out-of-plane rotations there was prepared our
Authors of [2] presented a learnable module that allows us MNIST classifier. CNNs injected at the end of the entire
to improve the invariance properties of CNNs by performing system are equivariant to translations by design, but not to
159
Fig. 2. Interpolant is used to find the unknown function, which rules the non-rigid deformation. For the image of a deformed object (left image), the
localisation network tries to find landmarks depicted as blue points. The resulting regular grid needed to rectify the image is known, and it is presented in
the right picture. TPS interpolation requires the transformation between these two grids. Then the rest of pixels from the left image is expressed in terms of
designated transformation.
in-plane rotations and non-rigid deformations. Our target is trained to capture any in-plane rotations. It is worth to mention
to propose a model that brings invariance on these transfor- here that achieving the highest possible score of classification
mations. The classifier is trained entirely separate from STN accuracy is not the clue of our paper. In our experiments, we
on the standard version of MNIST dataset. The aim is to focus rather on managing spatial transformations expressed in
obtain a minimal representation of chosen, undistorted images the improvements of accuracy rather than the result in itself.
in the feature space. When the image of a deformed object Having the trained classifier we have added STN module
is provided to the input, translational invariance is obtained before it to check whether the TPS interpolation brings any
directly from the CNNs architecture by sliding windows of rotation invariance or not. Examples were presented in the
filters replicas through the entire image. However, rotational Figure 4. It is known fact that CNNs are sensitive to such
and non-rigid deformation invariance is provided by the STN. transformations and in-plane rotations will surely degenerate
Our architecture was depicted in the Figure 1. the classification score. The implementations of Spatial Trans-
former with thin plate splines successfully let us improve the
IV. E XPERIMENTS
classification result on the rot-MNIST dataset.
In this section, we provide information about datasets used
while experiments and performed tests that constitute the
evaluation of the proposed end-to-end system.
A. Deformed & Rotated MNIST Dataset
First, we perform training of our model on the rot-MNIST 1
dataset to check if it can bring some rotation invariance to
existing model – classifier. Next, we show that to the rotation
invariance we can add deformation invariance what is our main Fig. 3. Examples of rectified images coming out of the STN module that
is used as input for adjacent classifier model. The left image in each pair
contribution. To verify that we used also our own dataset – of images is the output of our system, which successfully manages non-rigid
Rotated & Deformed MNIST (RD-MNIST) that consists of deformations and improves the classification score.
2600 training samples, 260 validation samples and 560 test
samples. We imposed MNIST digits on the cloth model tossed Next step train our system on RD-MNIST and to verify that
over a cup presented in [19] using Blender. A cloth subjected the STN can be used to capture not only rigid transformations
to randomly chosen rotations in X, Y and Z axis. These but also out-of-plane local rotations. The exemplary results
rotations have been chosen randomly with values respectively were depicted in the Figure 3. Usage of STN significantly
±5◦ , ±5◦ , ±180◦ . increased the classification accuracy on the RD-MNIST dataset
B. Classification to ~60%, which is ~48% boost in comparison to testing the
bare classifier without STN module on the same dataset. This
Our first target was to prepare a model that brings in-plane result proves that our implementation of STN gives us not
rotation invariance to our system. In this step we have used only in-plane rotation invariance, but also there is a reason to
a baseline classifier trained to the level of ~ 90% accuracy, think about non-rigid deformation invariance.
which architecture was depicted in 5. The training dataset for From the results presented in the Table I we can draw
this model was a standard version of MNIST, so it was not conclusions that STN performs the best on in-plane rotations,
1 http://www.iro.umontreal.ca/ lisa/twiki/bin/view.cgiPublic/MnistVariations because the result on rot-MNIST is equal to one of the clas-
[access: 04-2018] sifiers trained on a regular MNIST, so the full invariance was
160
Fig. 4. Results of using STN it the task of digit classification on rotated
MNIST dataset. The left image in each pair is the output of our system, and the
right image depicts the input. Results show that STN module can adequately
adapt to the input data rotated in-plane, while this type of transformation did
not appear in the MNIST classifier training stage.
provided. The decrease in our score between bare classifier

tested on rot-MNIST and RD-MNIST shows us that out-of-
plane rotations are clearly more complicated. Although, the Fig. 5. Neural network architectures used in our experiments. Each of them
was a base model to which the STN had to adapt to improve its performance.
significant boost in classification accuracy while using STN
classifier is also significant.
reshaped into a 4x3 grid. The last layer of localisation network
Model Dataset Accuracy
Classifier MNIST 90%
has activation function tanh(·) to produce values in the range
Classifier rot-MNIST 27% (-1, 1) to fit the normalised XY-coordinates of the image.
STN classifier rot-MNIST 90%
Classifier RD-MNIST 12% VI. C ONCLUSIONS
STN classifier RD-MNIST 60%
In this work, we proposed a method of interpolation of
TABLE I
non-rigid deformations using thin-plate splines method imple-
R ESULTS OF TESTING OUR MODELS ON VARIATIONS OF MNIST DATASET. mented in pair with Spatial Transformer. We carried out tests
STN SUCCESFULLY IMPROVES CLASSIFICATION SCORE . with variations of MNIST dataset – rotated and deformed.
As a result, we successfully managed to overcome difficulties
arising from non-rigid deformations and improved the classifi-
V. I MPLEMENTATION D ETAILS cation score on RD-MNIST by 48%. We hope that using STN
in the future we could connect the translation of grid points in
Each model used while our experiments was depicted in the image space with the translations of corresponding 3D points
Figure 5. Convolutional (conv) and transposed convolutional that will let us track non-rigid deformations.
(deconv) layers were presented according to the scheme as
follows: R EFERENCES
[1] D. A. V. Dyk and X. li Meng, “The art of data augmentation,” 2001.
(de)conv kernel, stride, number of f ilters [2] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu,
“Spatial transformer networks,” in Proceedings of the 28th International
Each layer has an ELU activation function to avoid the van- Conference on Neural Information Processing Systems - Volume 2, ser.
ishing gradient. Furthermore, batch-normalisation and max- NIPS’15. Cambridge, MA, USA: MIT Press, 2015, pp. 2017–2025.
pool techniques were used after each layer (if no other [Online]. Available: http://dl.acm.org/citation.cfm?id=2969442.2969465
[3] J. Duchon, “Splines minimizing rotation-invariant semi-norms in sobolev
specified). Fully-connected layers were presented as a number spaces,” in Constructive Theory of Functions of Several Variables,
of neurons in layer and activation function. Training was W. Schempp and K. Zeller, Eds. Berlin, Heidelberg: Springer Berlin
performed on the NVIDIA TITAN Xp graphics card using Heidelberg, 1977, pp. 85–100.
[4] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,”
a TensorFlow framework. International Journal of Computer Vision (IJCV), pp. 91–110, 2004.
Very important is the initialisation of landmarks grid pattern. [5] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “SURF: Speeded Up
The weights in the last fully-connected layer of localisation Robust Features,” Computer Vision and Image Understanding (CVIU),
vol. 110, no. 3, pp. 346–359, 2008.
network had to be manually set to zeros to avoid failure [6] P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “Kaze features,” in
in convergence. Then, initial values of landmarks are set as Proceedings of the 12th European Conference on Computer Vision -
biases. Otherwise, the optimisation problem of finding the Volume Part VI, ser. ECCV’12. Berlin, Heidelberg: Springer-Verlag,
2012, pp. 214–227.
best positions of landmarks in the input problem become too [7] E. Simo-Serra, C. Torras, and F. M. Noguer, “DaLI: Deformation and
complicated and model cannot find the solution. The [15] also Light Invariant Descriptor,” International Journal of Computer Vision
proved these findings. In our implementation, we used the (IJCV), pp. 1–19, 2015.
[8] G. Yu and J. Morel, “ASIFT: an algorithm for fully affine
number of K = 12 landmarks and target grid – localisation invariant comparison,” IPOL Journal, vol. 1, 2011. [Online]. Available:
network produces 24 values of XY-coordinates that are then http://dx.doi.org/10.5201/ipol.2011.my-asift
161
[9] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient
alternative to SIFT or SURF,” in IEEE International Conference on
Computer Vision (ICCV), 2011, pp. 2564–2571.
[10] S. Leutenegger, M. Chli, and R. Siegwart, “BRISK: Binary Robust
invariant scalable keypoints,” in IEEE International Conference on
Computer Vision (ICCV), 2011, pp. 2548–2555.
[11] P. F. Alcantarilla, J. Nuevo, and A. Bartoli, “Fast explicit diffusion for
accelerated features in nonlinear scale spaces,” in British Machine Vision
Conf. (BMVC), 2013.
[12] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature
Verification using a ”Siamese” Time Delay Neural Network,” in
Advances in Neural Information Processing Systems 6, J. D. Cowan,
G. Tesauro, and J. Alspector, Eds. Morgan-Kaufmann, 1994, pp.
737–744. [Online]. Available: http://papers.nips.cc/paper/769-signature-
verification-using-a-siamese-time-delay-neural-network.pdf
[13] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-
Noguer, “Discriminative learning of deep convolutional feature point
descriptors,” in 2015 IEEE International Conference on Computer Vision
(ICCV), Dec. 2015, pp. 118–126.
[14] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “LIFT: Learned Invariant
Feature Transform,” in Proceedings of the European Conference on
Computer Vision, 2016.
[15] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text
recognition with automatic rectification,” 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 4168–4176,
2016.
[16] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow,
“Interpretable transformations with encoder-decoder networks,” in IEEE
International Conference on Computer Vision, ICCV 2017, Venice,
Italy, October 22-29, 2017, 2017, pp. 5737–5746. [Online]. Available:
https://doi.org/10.1109/ICCV.2017.611
[17] D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys, “TI-
POOLING: transformation-invariant pooling for feature learning in
convolutional neural networks,” in CVPR. IEEE Computer Society,
2016, pp. 289–297.
[18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Parallel distributed
processing: Explorations in the microstructure of cognition, vol. 1,”
D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group,
Eds. Cambridge, MA, USA: MIT Press, 1986, ch. Learning Internal
Representations by Error Propagation, pp. 318–362. [Online]. Available:
http://dl.acm.org/citation.cfm?id=104279.104293
[19] R. White, K. Crane, and D. Forsyth, “Capturing and animating occluded
cloth,” in ACM Transactions on Graphics (SIGGRAPH), 2007.
162
SIGNaL PROCESSING
SPa 2018
Methods of Enriching The Flow of Information in

The Real-Time Semantic Segmentation Using Deep
Neural Networks
Jakub Bednarek∗ , Karol Piaskowski∗ , Michał Bednarek†
∗ Institute of Computer Science
† Institute of Control, Robotics and Information Engineering
Poznań, Poland
Abstract—Semantic Segmentation is one of the visual tasks that Usually, CNN autoencoders suffer from the lack of in-
gained the significant boost in performance in recent years due to formation - they focus only on predefined levels of image
the popularization of Convolutional Neural Networks (CNNs). In granularity without any reference to a wider context. On the
this paper, we addressed the problem of losing information while
changing the size of input images during training neural models. contrary, a human usually easily overcomes a problem with
Moreover, our method of downsampling and upsampling could insufficient information. For example, when one cannot clearly
be easily injected into current autoencoder models. We show state, what he is looking at, he could simply zoom the image
that without any significant changes in a model architecture it is in, instantly yielding much more information but in more
possible to noticeably improve IoU metric. On popular Cityscapes narrow context. Proposed attention blocks have the inspiration
benchmark, our model is achieving almost 2.5% boost in the
accuracy of segmentation in comparison to the widely known in a human vision system. Our attention up-sample block
ERF model. Additionally, to the ability to real-time usages, we works in a similar way - it looks at current representation and
run our network on GPU comparable to NVIDIA Jetson Tx2, decides where should it be enriched or not. On the other side
what let us implement our algorithm in autonomous vehicles. of autoencoder architecture, we propose Wide Downsample
blocks. The single convolutional kernel tries to encode as much
I. Introduction field of view as possible in a single value. Our motivation was
to save information from the neighbourhood of the pixel on
Semantic segmentation is one of the most challenging tasks the image, assuring that after each down-sampling the most
in autonomous robots and vehicles. It aims to label the image representative information will be at least once reproduced
in a pixel-wise manner in such a way that each pixel is along neighbouring positions of a new tensor.
associated with only one label from a predefined set of terms, In our paper, we showed that for ERF model [2] significant
like grass, road, sky, etc. A common approach to such labelling improvements can be made without a noticeable drop in
is to use traditional methods, like Conditional Random Fields efficiency, by adding only a few thousands of parameters.
or Markov Random Fields [1]. Finally, our modified ERF achieved 26.01 FPS with 0.550
To divide an image into semantically separate segments, one Intersection over Union metric (IoU) on images of resolution
needs to provide enough information to the algorithm. Simple of 240x480px.
colour segmentation is not sufficient, due to the infinite number The paper is organized as follows. First of all, we provide
of classes that share a common colour. The truly successful other works related to ours. Then the proposed components
algorithm needs to incorporate additional data like a spatial are described in details. Afterwards, results and experiments
relationship of regions or possibility of objects being close to were documented. Finally, we provide the reader with the
each other. A very challenging problem in semantic segmenta- conclusions and future work plans.
tion is a loss of information during data processing that reduces
the data resolution. Our main contribution is to overcome this II. Related Work
issue by introducing a novel approach of downsampling and Along with the dynamic growth of computing capabilities
upsampling in autoencoders. In this paper, we propose neural- caused by rapid Graphics Processing Units development, Con-
network-driven methods for enriching information in the data volutional Neural Networks became one of the most powerful
flow to tackle aforementioned tasks. We introduce the attention tools in a wide range of tasks, from the image classification
upsample and wide downsample blocks, which purpose is to to instance segmentation [3], [4], [5], [6]. One of the first
help to produce a better image representation. Our methods works which use CNN in Semantic Segmentation task is [7]
could be perceived as modifications of upsampling and down- where so-called Fully Convolutional Networks, learned end-
sampling parts of every autoencoder - therefore they could be to-end, performed segmentation. The authors used [8], [9] or
easily injected into most autoencoders. [4] CNN architectures pre-trained on other tasks and fine-tuned
163
Fig. 1: Comparasion of input image, ground truth labelling, and various configurations of proposed components.
using transfer learning. Authors of [10] used U-net architecture

with a direct connection between encoder and decoder for
biomedical semantic segmentation, achieving state-of-the-art
results.
Due to the high complexity of the problem of the semantic
segmentation of a scene, traditional methods are not suit-
able for any real-time usage. However, multiple works aim
to overcome this issue - many researchers already showed
that classical probabilistic methods can be speeded-up by
utilizing highly-efficient GPU. Authors of ENet [11], as well
as ERF [2], made models that are able to run real-time on
embedded devices. Current state-of-the-art models are based Fig. 2: An illustration of Attention Up-Sample module showed
on ResNet [12] that uses so-called residual-layers to overcome information flow inside component. Input and output has shape
a problem of vanishing gradient. By applying pyramid-like equal to inputs and outputs shapes from ERF architecture. All
feature extraction and attention upsampling, in [6] the authors data processing are defined internally.
achieved state-of-the-art results on popular benchmarks like
[13] or [14]. In [15] and [16], a probabilistic model was
implemented (fully-connected CRF) at final layers of the Firstly, we describe the attention based upsample method
network, producing better semantic segmentation. Another for enriching information in the decoder by adding missing
approach was presented in [16] where the authors also showed information that is lost during feed-forward pass through the
that it can be executed in real-time. encoder. Then we present the wide-downsample block, inspired
by ERFNet downsample methods.
III. Proposed Components
A. Attention Upsample
In this section, we introduce components designed for
enriching information flow, which can be easily adapted to We designed the attention upsample block in the decoder
current autoencoder architectures. The most important moti- part of the model as a probability map, detecting places in
vation for constructing both methods was the need for methods upsampled data where the information should be enriched with
of enriching information flow in either downsampling and up- the corresponding data from encoder with the same shape. The
sampling parts of a model. Transposed convolution or bilinear schematic view is presented in figure 2. Our starting point
upsampling are trying to expand information encoded in a sin- is the output of the last layer (before upsampling block) that
gle value to larger information clusters, which is a challenging was denoted as X. The corresponding origin data from the
task. We propose the enriched upsampling method which also encoder with the same shape as upsampled data is denoted
takes into account data from the encoder that was not affected as Z. Let’s signify Wi (X) as some convolution operation with
by lossy methods. Most of the information is lost when data Wi trainable filter tensor and input data X. Since we do not
is compressed into smaller fields (known as kernels) using consider specific filter tensors, we replace all convolutional
pooling methods or simple convolution. To enrich information filters by W. Firstly, X is processed in the same way as in the
flow while downsampling, we propose an extension method. ERF model by applying a transposed convolution operation
164
with a shape corresponding to the used filter, a following
depth-wise convolution – proposed by us – expands the field
of view encoded in a single value. Assuming width and height
of convolution filter in the ERFNet is set to 3, downsample in
a single value information about only a 3x3 field is encoded.
From the other hand, by applying depthwise convolution with
stride equal one and kernel 3x3, we expand the information
about corresponding fields and finally produce a field of view
in a size of 5x5. Since a single depthwise convolution is
computationally a cheap operation, we can prove that without
Fig. 3: Scheme of Wide Downsample block. Module is a any significant drop in the network efficiency improved results
modification of downsample method used in ERF network were achieved.
with additional depthwise convolution, representing neighbor Denote X as an input of a downsample block. We can
feature weighting, after convolutional downsample. Output of signify the mechanism that we named Wide Downsample as
the module is defined as concatenation of maxpooling product in the Equation III-B, where the concat is a simple stacking
and downsampled by convolutions data. of the filters of tensors with the same shape. Scheme of Wide
Downsample is provided in the Figure 3.
with a filter (3x3) and a stride of 2. A produced output is

Y 0 = MaxPool2D(X)
two times bigger according to the width and height. In the
next step, the origin and upsampled data are added together, Y 00 = Wdepthwise ( f (W(X)))
forming up the input for attention upsample block. Y = concat(Y 0, Y 00)
IV. Experiments
X 0 = f (W(X))
A. Semantic Segmantation
X 00 = X 0 + Z
This section presents an evaluation of our proposed compo-
X 00 is the attention upsample input and f is some activation nents. First of all, we replaced simple downsampling and/or
function. Then attention core operation is computed producing upsampling with the Wide Downsample and Attention Up-
probability map A – places where the information should be sample respectively (see Figure 4). As the baseline, we have
enriched, and complementary probability map A0, which are used the ERFNet. Then, we performed the training stage
places where the information is missing. in following configurations: only Wide Downsample, only
Attention Upsample, using both proposed methods and without
any modification as the baseline. Training was performed on
A = σ(W(X 00))
Cityscapes dataset [13]. Images were rescaled to match the
A0 = 1 − A shape of 240x480 px. Because of the semantic segmentation
in the real-time, there was no need for training and evaluating
In the equation III-A σ is a sigmoid activation function. A0
model on any bigger resolutions. Training takes 200 epochs
can be computed simply as complementary to A by applying
and evaluation was based on best 200 achieved results. Size
1 − A function. From now on, upsampled data can be enriched
of the batch was equal to 10 since bigger mini-batches did not
with origin data by ”clearing” data in the information missing
produce better results. As an optimizer, we have Adam with
places and inputting there corresponding data from origin
learning rate equals to 0.0005. During training and evaluation,
data. Such an operation can be easily computed as in the
we calculated Intersection over Union (IoU) and used it as a
equation III-A, where ∗ is an element-wise multiplication of
credible metric for comparison with others configurations. IoU
two tensors with the same shape and Y is an output of the
was calculated for classes and categories shown in tables I and
Attention Upsample block.
II respectively.
The ERF with proposed modifications achieved better re-
Y = A ∗ X 0 + A0 ∗ Z sults in almost every field and has significantly beaten the
ERFNet. Usage of Wide Downsample caused significant im-
B. Wide Downsample provements of average IoU score. Also, the attention modifica-
One of the most challenging problems in autoencoder-like tion produced better results than bare ERF. Noticeably, despite
neural networks is the information lost during downsampling. the significant improvements achieved by using proposed mod-
We propose an improvement of widely known downsampling ifications, using both methods simultaneously produced worst
method used in the ERFNet by applying a weighting of results than baseline. It can be caused by the tender between
neighbouring values in each output filter separately. Since a encoding very wide field of view in one value and enrich
single convolution encodes the information from a window upsampled data with such rich information.
165
Fig. 4: The scheme illustrates the example of usage of our proposed modifications inside the ERFNet. All of the 3 downsample
methods originally used in the ERF are replaced by our Wide Downsample. From the other hand, the Attention Upsample is
used only for the first two upsample blocks since the last upsample would have to use an image as input, which cannot be a
good representation for labelling task itself.
Class ERF Wide Attention Wide + Attention Category ERF Wide Attention Wide + Attention
Road 0.954 0.955 0.955 0.949 Flat 0.970 0.970 0.969 0.968
Sidewalk 0.685 0.688 0.674 0.664 Construction 0.850 0.856 0.844 0.833
Building 0.847 0.852 0.843 0.833 Object 0.459 0.469 0.460 0.438
Wall 0.310 0.300 0.264 0.249 Nature 0.866 0.873 0.866 0.856
Fence 0.305 0.312 0.284 0.310 Sky 0.872 0.890 0.886 0.895
Pole 0.415 0.415 0.417 0.399 Human 0.607 0.617 0.607 0.593
Traffic Light 0.342 0.391 0.388 0.337 Vehicle 0.830 0.838 0.835 0.826
Traffic Sign 0.465 0.487 0.455 0.418 Score Average 0.779 0.788 0.781 0.773
Vegetation 0.864 0.870 0.865 0.854
Terrain 0.526 0.541 0.500 0.505 TABLE II: Results of each run configuration on metric score
Sky 0.872 0.890 0.886 0.895 Intersection over Union for predefined categories. Results
Person 0.574 0.586 0.570 0.551
Rider 0.315 0.322 0.261 0.300 presented in the Table I and here were calculated on the same
Car 0.844 0.854 0.854 0.845 trained models.
Truck 0.386 0.369 0.428 0.283
Bus 0.506 0.571 0.532 0.511
Train 0.080 0.330 0.340 0.026
Motorcycle 0.190 0.184 0.174 0.262 of comparable performance possibilities with NVIDIA Jetson
Bicycle 0.516 0.536 0.511 0.514 TX2. Training was performed on NVIDIA Titan Xp.
Score Average 0.526 0.550 0.537 0.511
FPS FPS
TABLE I: Presented results were produced by running each (480 x 240) (1024 x 512)
New parameters
configuration on NVIDIA Titan Xp and picking best result ERF 26.57 5.83 0 (0‰)
of 200 epochs. Input image resolution was set to 480 width Wide 26.01 5.79 1250 (0.7‰)
Attention 25.57 5.66 4432 (2.3‰)
and 240 height. Best results were highlighted in bold. Each
Wide Attention 25.31 5.58 5682 (3.0‰)
value is a Intersection over Union metric score for predefined
classes. Configuration with Wide Downsample achieved the TABLE III: Summary of performance results gained by all
best results overcoming ERF baseline with IoU score at 0.550. tested configurations with input image resolutions 480x240
and 1024x512 pixels width and height. Table also consists
of number of new parameters used in each configuration in
comparison to baseline ERF model. In brackets is detailed
B. Performance
share of new parameters in entire network.
Since the goal was improving results of a model which is
able to perform semantic segmentation task on small GPU. We have shown that our system is able to significantly
We used NVIDIA GTX 860m GPU for evaluation because improve results of challenging semantic segmentation task by
166
adding only a few thousands of parameters to the baseline [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
model. The best results were achieved for the 1250 additional V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
2014.
weights (baseline model has almost 2 million parameters [5] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-
and the increase gives only 0.7‰ of the total number of Cun, “Overfeat: Integrated recognition, localization and detection using
parameters). At the same time, no significant drop in the convolutional networks,” 2013.
[6] H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for
performance was noticed. Best model achieved 26.01 FPS semantic segmentation,” 2018.
that enables us to run the model in real-time applications. [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
As experimental results show that our solution is scalable into semantic segmentation,” 2014.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
higher resolutions. ERF with Wide Downsample achieves 5.79 classification with deep convolutional neural networks,” in
FPS processing images in resolution 1024x512 px. Proceedings of the 25th International Conference on Neural
Information Processing Systems - Volume 1, ser. NIPS’12. USA:
Curran Associates Inc., 2012, pp. 1097–1105. [Online]. Available:
C. Implementation Details http://dl.acm.org/citation.cfm?id=2999134.2999257
Denote the training set as pairs of an input RGB images and [9] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” 2014.
labelled target images as Z = {Xi, Yi }i=1,2,...N , where N is a [10] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
number of training examples. Thus, the semantic segmentation for biomedical image segmentation,” 2015.
task can be defined as attaching to each input pixel the label [11] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep
neural network architecture for real-time semantic segmentation,” 2016.
f (xi ) = yi . For each xi ⊆ X, where yi ⊆ {0, 1, 2, . . . , n − 1} [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
and n is number of classes. To solve semantic segmentation recognition,” 2015.
task we can treat each pixel separately and apply the standard [13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen-
son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for
cross entropy function, known from the classification, applied semantic urban scene understanding,” in Proc. of the IEEE Conference
to the network predictions. Target labels are converted into on Computer Vision and Pattern Recognition (CVPR), 2016.
one-hot encoded vectors. Because of the unbalanced data, a [14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman, “The PASCAL Visual Object Classes
weighting of a loss function for different classes was performed Challenge 2012 (VOC2012) Results,” http://www.pascal-
based on histograms of labels in the training set. Models were network.org/challenges/VOC/voc2012/workshop/index.html.
implemented using TensorFlow library [17] in version 1.8. [15] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs
with gaussian edge potentials,” 2012.
[16] M. T. T. Teichmann and R. Cipolla, “Convolutional crfs for semantic
V. Conclusions segmentation,” 2018.
[17] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
In this work, we proposed methods for improving IoU metric M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur,
results, which can be easily adopted into current autoencoder J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner,
architectures. Firstly, we have made an effort that leads us P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and
X. Zheng, “Tensorflow: A system for large-scale machine learning,”
to produce a better image representation. Then, we have in 12th USENIX Symposium on Operating Systems Design and
implemented our methods in an efficient way on the GPU. Implementation (OSDI 16), 2016, pp. 265–283. [Online]. Available:
The number of parameters used in proposed methods does https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
not affect significantly into model inference time. Results
presented in the paper shows that there exist many possibilities
to enrich the information flow in neural networks. Current
models that run in the real-time can be easily boosted. For
the future work, we are planning to adjust the downsampling
and upsampling methods to use their full capabilities and to
achieve the best results by applying our modification to the
current state-of-the-art models.
References
[1] X. He, R. S. Zemel, and M. A. Carreira-Perpiñán, “Multiscale
conditional random fields for image labeling,” in Proceedings of
the 2004 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, ser. CVPR’04. Washington, DC, USA:
IEEE Computer Society, 2004, pp. 695–703. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1896300.1896400
[2] E. Romera, J. M. Álvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Effi-
cient residual factorized convnet for real-time semantic segmentation,”
IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1,
pp. 263–272, Jan 2018.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems 25, F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012,
pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-
imagenet-classification-with-deep-convolutional-neural-networks.pdf
167
SIGNaL PROCESSING
SPa 2018
Dictionary-based through-plane interpolation of

prostate cancer T2-weighted MR images
Jakub Jurek, Marek Kociński, Andrzej Materka Are Losnegård∗† , Lars Reisæter† , Ole J. Halvorsen∗† ,
Lodz University of Technology, Lodz, Poland Christian Beisland∗† , Jarle Rørvik∗† , Arvid Lundervold∗†
Email: jakubjurekmail@gmail.com ∗ University of Bergen, Bergen, Norway
† Haukeland University Hospital, Bergen, Norway
Abstract— T2-weighted magnetic resonance images (T2W mm in any dimension. This implies that the smallest, early
MRI) of prostate cancer are usually acquired with a large cancers might be missed or not fully visualized by MR images
slice thickness compared to in-plane voxel dimensions and to whose slices are too thick compared to tumour dimensions.
the minimal significant malignant prostate tumour size. This
causes a negative partial volume effect, decreasing the precision Large slice thickness also complicates 3D texture analysis and
of tumour volumetry and complicating 3D texture analysis of volumetry.
the images. At the same time, three orthogonal, anisotropic Slice thickness can be reduced by tighter space sampling.
acquisitions with overlapping fields of view are often acquired However, this makes imaging less comfortable for the patient
to allow insight into the prostate from different anatomical and more expensive due to the increased acquisition time.
planes. It is desirable to reconstruct an isotropic prostate T2W
image, using the 3 orthogonal volumes computationally, instead What follows, during a long scanning time patients are more
of directly acquiring a high-resolution MR image, which typically probable to move, introducing motion artifacts to final images.
requires elongated scanning time, with higher cost, less patient Moreover, acquired high-resolution images are typically more
comfort and lower signal-to-noise ratio. In our previous work, noisy than thick slice images.
we followed the above rationale applying a Markov-Random- Another way to increase the through-plane resolution of
Field(MRF)-based combination of 3 orthogonal T2W images of
the prostate. Our initial results were, however, biased by the anisotropic MR images are various reconstruction and inter-
quality of input orthogonal images. These were first preprocessed polation methods. In reconstruction, usually several images of
using spline interpolation to yield the same voxel dimensions and the same object are combined and additional information is
later registered. In this paper, we apply a dictionary learning retrieved. Many works were published in this field [1]–[9].
approach to interpolation in order to increase the resolution Reconstruction methods aim at reverting the imaging process
of a coronal T2W MRI image. We compose a low-resolution
dictionary from the original axial image, calculate its sparse understood as a sequence of downsampling, blurring and noise
representation by Orthogonal Matching Pursuit and finally derive operations leading to MR image formation. Since this revertion
the high-resolution dictionary to improve the original coronal problem is ill-posed, regularization is needed to find optimal
image. We assess the improvement in visual image quality as solutions.
satisfying and propose further studies. Interpolation methods [10] are used for various reasons in
Index Terms—Magnetic resonance imaging; prostate cancer;
superresolution image reconstruction; dictonary learning; K-SVD
medical imaging. They can also be used to find missing values
when neighbours are known. However, even the state-of-the-
art interpolation methods cannot restore the actual values with
a perfect accuracy.
I. I NTRODUCTION
Our group has recently published some preliminary re-
Magnetic resonance (MR) image formation process is in- sults on high resolution T2-weighted (T2W) prostate cancer
herently affected by noise, blur and various artifacts. At the image reconstruction [11] using the maximum-a-posteriori
end of the imaging process a stack of slices constituting a setting and Markov Random Fields (MAP-MRF) to problem
3D volume is formed. These slices are 2D images reflecting a modelling. We found that the latter approach might be use-
cross section of the body and having thickness usually equal ful in restoring isotropic images of the prostate, however,
to several millimetres. The subjective quality of an image preprocessing steps like interpolation should be improved.
may depend on the resolution of its 2D slices (in-plane), Recently, dictionary learning and sparse representation was
which is typically greater than the slice thickness (through- proposed for upsampling of MR images of the brain [12]. We
plane resolution). The ratio of the through-plane to in-plane find this interpolation/reconstruction scheme very promising
resolutions of the volume elements (voxels) can reach 5 or and propose a modification of the method applied to T2W
more, being the source of voxel anisotropy. Anisotropic voxels images of prostate cancer. In our future work, we plan to
are not cubes but cuboids. Image details which do not lay merge the MAP-MRF reconstruction with dictionary-based
in the imaging plane may then be distorted or lost due to reconstruction. Here, we present our very first results from
the slice thickness. This has especially strong implications in the application of dictionary learning to resolve anisotropic
cancer diagnosis. For example, in the study of prostate cancer, prostate cancer T2W images.
a significant tumour traditionally has the size of at least 5
168
TABLE I TABLE II
ACQUISITION PARAMETERS OF T2W IMAGES S UMMARY OF THE RECONSTRUCTION ALGORITHM
RT/ET ST Gap FOV AT 1. Collect a HR image

Plane Matrix 2. Simulate the imaging process from on HR images
[ms] [mm] [mm] [mm] [min]
Sagittal 3030/98 4 0.8 320x256 200x200 3:06 3. Upsample the simulated LR images back to HR resolution
Coronal 3000/98 4 0.4 320x256 200x200 4:05 using interpolation
Axial 4840/84 3 0.6 320x256 200x200 4:18 4. Extract LR patches from the simulated LR image and filter them
RT - repetition time; ET - echo time; ST - slice thickness; FOV - field of 5. Reduce the dimensionality of LR patches by PCA
view; AT - acquisition time. 6. Extract HR patches at valid locations and process them
7. Initialize the LR dictionary e.g. using extracted patches
8. Run i iterations of OMP-KSVD for dictionary training
9. Compute the HR dictionary using the pseudo-inverse and LR
sparse representation
10. Collect a LR image
11. Follow Steps 3-5 with the LR image
12. Use the LR dictionary to sparse-code the patches using OMP
13. Use the HR dictionary with the sparse represenatation to get
reconstructed HR patches
14. Combine the patches to obtain the HR image
The problem of finding DL and WL corresponding to a low

resolution (LR, thick-slice) image IL is defined as follows:
(a) (b)
2
Fig. 1. An axial T2W slice of the prostate in the midgland region. Axial DL , WL = argmin kIL − DL WL k s.t. ∀i kwi k0 = T (1)
volume (a) and coronal volume with visible thick slices (b). DL ,WL
In problem (1), T is the sparsity measure, being the total

II. M ATERIALS AND M ETHODS number of non-zero entries in each column of WL . To solve
the problem, a good option is to employ the following strat-
A. MR images egy. When DL is known or initialized, sparse approximation
Multiparametric MRI (mpMRI) of 1 patient with diagnosed methods (e.g the Orthogonal Matching Pursuit - OMP) are
prostate cancer was selected from a larger patient cohort, used to find a good estimate of WL . Then, with WL fixed,
acquired using a 1.5 T whole-body MRI scanner (Avanto, the dictionary should be updated. The update is performed
Siemens Medical Systems, Erlangen, Germany) [13]. An in- using the K-SVD algorithm. Alternating OMP and K-SVD
tegrated endorectal and pelvic phased-array coil (MR Innerva, leads to improved DL and WL and minimizes the constrained
2
Medrad, Pittsburgh, PA, USA) was used to enhance image kIL − DL WL k term. For a detailed description of K-SVD
quality. The MR dataset of each patient consisted of images and OMP, see [14], [15]
from three techniques: T2W, diffusion-weighted and dynamic
contrast-enhanced images. For the purpose of this project, C. Training set construction
we only selected T2W images, which were acquired in three Reconstruction of a high resolution (HR) image from its
orthogonal planes: transverse (axial), sagittal and coronal. An LR version is possible under the assumption that both the LR
experienced radiologist performed prostate segmentation on and HR versions share the same sparse representation and that
T2W axial images. For the summary of image parameters, see images (including MRI) are intrinsically pattern-redundant.
Table I. In our experiment we used the axial image in the For the proof of correctness of the first assumption, we refer
training phase and the coronal image in the reconstruction the reader to [12], [16].
phase. An example slice taken from these two volumes is The LR image IL is a consequence of the imaging pro-
presented in Fig. 1. cess which can be modelled as IL = DPIH + N, where
D reflects the downsampling operation (slice-selection), P
B. Dictionary-based image reconstruction reflects blurring (due to the Point Spread Function (PSF) of
The idea behind dictionary learning is to find a matrix the imaging system) and N is the noise. Since finding the
called the dictionary, D ∈ Rf ×K , whose columns are called exact solution by simply inverting the imaging model leads to
atoms. The K atoms of a dictionary are “words” of length f a large number of possible solutions, the problem has to be
features and are the basic units that will be linearly combined regularized. In dictionary learning, the problem is regularized
to reconstruct the output image from the given feature set. The by using examples from the training data. First, however, the
optimal linear combination of atoms is found in the sparse training set has to be constructed. It requires manipulations of
approximation process, whose output is another matrix called the training HR images such as the imaging simulation and
the (sparse) representation, W ∈ RK×n , storing weights of filtering, ending with pairs of HR and LR examples.
atoms for each of n data vector. When D and W are known 1) LR image simulation: The two assumed operations,
or learned, the image can be reconstructed as: I = D × W ∈ downsampling and blurring, can be easily simulated from a
Rf ×n . HR image, providing that the parameters are known (such as
169
slice thickness or shape of the PSF). This step reflects the E. Image reconstruction
information loss during imaging. Next, for technical reasons, The HR version of a LR test image can be reconstructed
the simulated LR image is upsampled back to the original size as PH = DH WH = DH WL . The dictionary DH is known
of IH , using the operator U, yielding ILU . We neglect the noise from the training phase, so the reconstruction step mainly
effect in our application. involves the calculation of the sparse representation of the
2) Extraction of LR patches: The examples sampled from test image, WLt = WHt . Thus, OMP is applied using the
HR and LR images are small image patches of a specific size. trained LR dictionary DL on the LR patches extracted from the
Since interpolation does not bring any new knowledge into test image. To obtain the latter, one has to perform the whole
the dataset, the patches are extracted from ILU and IH only at preprocessing pipeline as in the training phase. The test image
certain locations k, i.e. at interpolation nodes. The LR patches is upsampled, filtered and then subjected to PCA with the
are first extracted from ILU and then high-pass filtered to obtain same projection matrix. When WHt is obtained, all necessary
patch features. Due to a big number of those, dimensionality information is yet known and PHt , the high resolution patches
reduction by PCA is likely to improve the efficiency of the of the test image, can be calculated.
algorithm, while preserving the majority of variance in the Since the reconstructed patches are, in accordance with
dataset. The resulting f features (principal components) are the training set formulation, just the HR content lost in the
used for training. imaging process, the true image content involving both the HR
3) Extraction of HR patches: When a thick-slice image is and LR information is obtained by adding the reconstruction
acquired, the signal from tissues within the scope of a slice is to the test image patch by patch. Summed patches are then
averaged, thus low-pass filtered. High frequency information is placed back to locations k within the test image space from
lost and this is the type of information that we want to restore. where they were extracted. If the patches overlap, which is
Therefore, instead of using the HR patches directly, we process mostly the case, the intensity values ovelapping at a particular
them in order to extract the high frequency content. This is voxel are averaged.
achieved by PH = IH −ILU . Thus, the final HR patches contain
purely the image content lost in the imaging process and such F. Upsampling coronal prostate cancer images
will be the nature of later reconstructed image patches. In order to prove the utility of the presented method for
4) Training images: In practical applications, there may be reconstruction of prostate images, we performed the following
very little data available for training. Even when only one experiment. First, we cropped the axial image to a small region
image is involved, reconstruction is still possible and such around the prostate using the prostate segmentation. Later, we
task is regarded to as the single-image scale-up. In-plane slices simulated the degradations introduced by coronal imaging on
of a thick-slice acquisition can be used to train a dictionary the axial one by 2D Gaussian blurring of the slices (σ = 1) and
and later applied to a LR image created from the out-of-plane downsampling by the factor of 7. The downsampled image was
slices. then upsampled using spline interpolation by the same factor.
Alternatively, when orthogonal MR acquisitions are avail- Next, we constructed a training set consisting of pairs of
able, one volume can be used to construct a dictionary and corresponding 5 × 5 HR and LR patches. 5310 patches were
later reconstruct patches from the same imaging plane in the extracted in total. Features of the LR patches were obtained
other images. Such approach was adapted in this paper. by removing a Gaussian-blurred (with σ = 1) version of each
patch from itself. Such kind of high-pass filtering yielded the
D. Dictionary learning highest quality in a pilot experiment among others including
For dictionary learning, as mentioned previously, one can gradient, Sobel and Laplace kernels. After PCA, 231 features
use K-SVD and OMP. The parameters in this process are 1) were kept out of 1225 to use in dictionary learning.
the number of iterations i of alternating OMP and K-SVD; Following the training set construction, we trained a LR
2) the number T of non-zero weights per example in the dictionary and its sparse representation. We allowed 3 non-
representation matrix W and 3) the number of atoms K to be zeros per patch representation, while the number of atoms was
used to construct the dictionary. This scheme is used to find equal to the number of training patches. The dictionary was
the dictionary representing the feature set, derived from the initialized using the LR patches. We only performed 1 iteration
LR image, thus it solves problem (1) where IL is substituted of OMP and KSVD, due to the similarity of the training
with PL , a matrix of features obtained by PCA. image and the image to be reconstructed later. The trained
The dictionary has to be initialized as some matrix of the representation in this conditions is nearly perfect (overfitting,
desired size of f ×K. After i iterations, the LR dictionary DL though desired due to the mentioned similarity). We finished
is obtained, as well as the corresponding representation WL (in the training phase by calculating the HR dictionary directly
accordance with the previous assumption, WL = WH ). To find from the LR representation and HR patches using the pseudo-
the HR dictionary, DH , we used a closed-form expression for inverse as given above.
pseudoinverse which is suitable for sparse matrices and speeds To reconstruct the axial resolution in the coronal image,
up computations, DH = PH WLT (R+ ((RT )+ WL WLT )), where we first rotated the latter image to get axial slices and then
R is the upper triangular matrix from QR decomposition of upsampled it by the factor of 7 in the axial plane using 2D
the sparse matrix WL WLT . spline interpolation. 5 × 5 patches were then extracted and
170
features were calculated using the same filter as in training. desired resolutions.
Next, the features were projected using the projection matrix The difference in the in-plane voxel dimensions between
obtained by PCA in the training phase. The feature vector the reconstructed image and the original axial slices is also
was then sparsely coded using the trained LR dictionary and one of the reasons why quantitative comparison of the images
the obtained LR representation was multiplied by the HR was impossible. We therefore limited our analysis to visual
dictionary to calculate the reconstructed HR patches from the examination. Since the similarity between the reconstructed
coronal image. The HR patches were then added to the non- axial slices of the coronal image and the original axial slices
filtered upsampled test image patches and placed back into dramatically increased in terms of this visual exam compared
the image grid at locations from which they were sampled, to spline interpolation, we consider our approach as relevant.
averaging the overlapping regions. In our future work, we will continue using the ideas of
dictionary learning for reconstruction of MR images. The
III. R ESULTS presented method could also be adapted to other imaging
The results from our experiment are presented below in techniques used for prostate imaging like diffusion-weighted
Fig. 2. In the absence of suitable ground truth HR images, we or dynamic contrast-enhanced imaging. It is also applicable to
based our assessment on visual comparison of interpolated all situations where orthogonal acquisitions are available.
axial slices from the coronal image, reconstructed HR axial
slices and original axial slices from the axial image. We R EFERENCES
present the reconstruction results for 15 slices of the coronal
image, matching the 15 original axial slices. We determined [1] H. Greenspan, G. Oz, N. Kiryati, and S. Peled, “Mri inter-slice recon-
struction using super-resolution,” Magnetic resonance imaging, vol. 20,
the corresponding slices and XY ranges using real world voxel no. 5, pp. 437–446, 2002.
coordinates. [2] E. Carmi, S. Liu, N. Alon, A. Fiat, and D. Fiat, “Resolution enhancement
in mri,” Magnetic resonance imaging, vol. 24, no. 2, pp. 133–154, 2006.
IV. D ISCUSSION AND CONCLUSIONS [3] F. Rousseau, O. A. Glenn, B. Iordanova, C. Rodriguez-Carranza, D. B.
Vigneron, J. A. Barkovich, and C. Studholme, “Registration-based
As seen in Fig. 2, the quality of the axial view in the coronal approach for reconstruction of high-resolution in utero fetal MR brain
T2W acquisition has subjectively improved with respect to images,” Academic Radiology, vol. 13, pp. 1072–1081, Sep 2006.
[4] A. Souza and R. Senn, “Model-based super-resolution for mri,” in
spline interpolation results. Some details lost in the imaging Engineering in Medicine and Biology Society, 2008. EMBS 2008. 30th
process, which were not retrieved by spline interpolation, are Annual International Conference of the IEEE, pp. 430–434, IEEE, 2008.
present in the dictionary-based reconstruction. For example, [5] A. Gholipour, J. A. Estroff, M. Sahin, S. P. Prabhu, and S. K. Warfield,
the round-shaped seminal vesicles in Fig. 2 (n-o) were recon- “Maximum a posteriori estimation of isotropic high-resolution volumet-
ric mri from orthogonal thick-slice scans,” in International Conference
structed from the very blurry interpolation result. However, on Medical Image Computing and Computer-Assisted Intervention,
we identified several issues that will be relevant in our further pp. 109–116, Springer, 2010.
studies. [6] F. Rousseau, K. Kim, C. Studholme, M. Koob, and J.-L. Dietemann,
“On super-resolution for fetal brain mri,” in International Conference
First of all, we have constructed the dictionary purely from on Medical Image Computing and Computer-Assisted Intervention,
existing single image patches and performed only one iteration pp. 355–362, Springer, 2010.
of OMP and K-SVD. This reduces the time of computations [7] J. Woo, E. Murano, M. Stone, and J. Prince, “Reconstruction of high-
resolution tongue volumes from MRI,” IEEE Transactions on Biomedi-
and seems acceptable for the task of reconstructing a corrupted cal Engineering, vol. 59, no. 12, pp. 3511–3524, 2012.
image of exactly the same object (the axial and coronal [8] E. Van Reeth, I. W. K. Tham, C. H. Tan, and C. L. Poh, “Super-
volumes of the same prostate). However, this dictionary would resolution in magnetic resonance imaging: A review,” Concepts in
Magnetic Resonance Part A, vol. 40A, no. 6, pp. 306–325, 2012.
probably be unsuitable for other T2W images of the prostate [9] A. Gholipour, O. Afacan, I. Aganj, B. Scherrer, S. P. Prabhu, M. Sahin,
and reconstruction of those would require new training. Al- and S. K. Warfield, “Super-resolution reconstruction in frequency, image,
ternatively, one could use multiple axial images from various and wavelet domains to reduce through-plane partial voluming in mri,”
Medical physics, vol. 42, no. 12, pp. 6919–6932, 2015.
prostates to train a more general dictionary. [10] T. M. Lehmann, C. Gonner, and K. Spitzer, “Survey: Interpolation
Secondly, in the preliminary experiment presented in this methods in medical image processing,” IEEE transactions on medical
paper we only used a single filter to derive image features. It is imaging, vol. 18, no. 11, pp. 1049–1075, 1999.
desired to test a range of high-pass filters for the reconstruction [11] J. Jurek, M. Kociński, A. Materka, A. Losnegård, L. Reisæter, O. J.
Halvorsen, C. Beisland, A. Lundervold, et al., “Reconstruction of high-
task. It is also tempting to extend the applicability of the resolution t2w mr images of the prostate using maximum a posteriori
method to 3D image patches. approach and markov random field regularization,” in Signal Processing:
Thirdly, the method is limited to integer upsampling factors, Algorithms, Architectures, Arrangements, and Applications (SPA), 2017,
pp. 96–99, IEEE, 2017.
which implies that only closest voxel size to full isotropy can [12] Y. Jia, Z. He, A. Gholipour, and S. K. Warfield, “Single anisotropic
be achieved. In our case, with the slice thickness equal to 3-d mr image upsampling via overcomplete dictionary trained from in-
4.4 mm, the upsampling factor of 7 leads to the final reso- plane high resolution slices,” IEEE journal of biomedical and health
informatics, vol. 20, no. 6, pp. 1552–1561, 2016.
lution in the reconstructed direction of approximately 0.628 [13] L. A. Reisæter, J. J. Fütterer, O. J. Halvorsen, Y. Nygård, M. Biermann,
mm (compared to the resolution of 0.625x0.625 mm in the E. Andersen, K. Gravdal, S. Haukaas, J. A. Monssen, H. J. Huisman,
other directions). Thus, the reconstructed voxels are not fully, L. A. Akslen, C. Beisland, and J. Rørvik, “1.5-t multiparametric mri
using pi-rads: a region by region analysis to localize the index-tumor
however close to isotropic, up to several µm. Nonetheless, of prostate cancer in patients undergoing prostatectomy.,” Acta Radiol,
this discrepancy can be larger for some slices thicknesses and vol. 56, pp. 500–511, May 2014.
171
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
172
(k) (l) (m) (n) (o)
Fig. 2. Reconstruction of the axial slices from the coronal image. The presented results show upsampling by 2D spline interpolation (upper rows), our method
[14] M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for
designing overcomplete dictionaries for sparse representation,” IEEE
Transactions on signal processing, vol. 54, no. 11, pp. 4311–4322, 2006.
[15] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching
pursuit: Recursive function approximation with applications to wavelet
decomposition,” in Signals, Systems and Computers, 1993. 1993 Confer-
ence Record of The Twenty-Seventh Asilomar Conference on, pp. 40–44,
IEEE, 1993.
[16] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using
sparse-representations,” in International conference on curves and sur-
faces, pp. 711–730, Springer, 2010.
173
SIGNaL PROCESSING
SPa 2018
CRF-Based Clustering of Pharmacokinetic Curves

from Dynamic Contrast-Enhanced MR Images
Jakub Jurek∗ , Mateusz Pelesz† , Are Losnegård∗† , Lars Reisæter†
Andrzej Wojciechowski† , Artur Klepaczko∗ , Ole J. Halvorsen∗† , Christian Beisland∗† ,
Marek Kociński∗ , Andrzej Materka∗ Jarle Rørvik∗† , Arvid Lundervold∗†
∗ Lodz University of Technology, Lodz, Poland ∗ University of Bergen, Bergen, Norway
† MSWiA Hospital, Lodz, Poland † Haukeland University Hospital, Bergen, Norway
Email: jakubjurekmail@gmail.com
Abstract—Traditionally, analysis of Dynamic Contrast- segmentation, where the cortex and medulla exhibit different
Enhanced Magnetic Resonance Images (DCE MRI) requires perfusion. Kidney compartment segmentation was tackled in
pharmacokinetic modelling to derive quantitative physiological many previous works, by clustering among others [7], [8].
parameters of the tissue. Modelling, however, is a complex task
and many competing models of contrast agent kinetics and tissue Based on the above, we believe that the DCE timeseries
structure were proposed. Alternatively, raw DCE data could clustering problem is general and might find application in
be analysed to find correlation with pathology in the tissue or the study of function or disease of many organs. We propose
other desired effects, for example by clustering. In this paper, we our own method to solve this problem, adapted from the
propose a new method for DCE MRI timeseries clustering. We domain of gene expression data [9] with modifications. The
model the data space as a Conditional Random Field (CRF) and
optimize the objective function in order to find cluster labels for method is based on the concept of Conditional Random Fields
all timeseries. The method is unsupervised and fully automatic. (CRFs) [10], a useful tool in image analysis, which has proven
We also propose a strategy to speed up the clustering process its performance e.g. in image segmentation [11].
using Support Vector Machines. We demonstrate the utility of In our method, we assign voxel labels by optimizing a cost
our method on two distinct problems: prostate cancer localization function, i.e. maximizing label probability having observed the
and healthy kidney compartment segmentation.
Index Terms—CAD, prostate cancer, kidney segmentation,
DCE data. To speed up convergence, we only sample a random
unsupervised learning, clustering, DCE. subset of the available examples for clustering. Next, we train
a Support Vector Machine using cluster labels and sampled
data. We use the classifier to label the remaining examples.
I. I NTRODUCTION
We perform initial tests of the clustering method on single
DCE allows to visualize and examine perfusion in biological cases from prostate cancer and healthy kidney imaging.
tissues. This examination provides a 4D dataset where an
II. M ATERIALS AND M ETHODS
image voxel can be analysed in the time domain. Thus, for
each 3D voxel, a timeseries illustrating the changes of intensity A. Imaging data
is available. This intensity variation is nonlinearly correlated to 1) Prostate cancer: We used T2W and DCE TWIST im-
the concentration of the injected contrast agent. DCE is widely ages of one patient from a larger cohort [12], acquired at
used in modern medicine, with emerging new applications like Haukeland University Hospital, Bergen, Norway. The images
prostate cancer diagnosis, kidney segmentation, or noninvasive were acquired using a 1.5 T whole-body MRI scanner (Avanto,
assessment of kidney glomerular filtration rate. Siemens Medical Systems, Erlangen, Germany). An integrated
DCE analysis traditionally required pharmacokinetic mod- endorectal and pelvic phased-array coil (MR Innerva, Medrad,
elling [1], [2] to derive quantitative measures with physiologi- Pittsburgh, PA, USA) was used to enhance image quality.
cal background like Ktrans , kep or Ve [3]. Since modelling is a Gadoterate meglumin 0.195 ml/kg (Dotarem, Guerbet SA,
demanding task and several models describing tissue perfusion Aulnay-sous-Bois, France) was used as a contrast agent in
were proposed, some researchers turned to the use of model- DCE MRI. The prostate cancer DCE MRI timeseries’ consist
free parameters or heuristic criteria [4], [5]. In cancer studies, of 67 volumes acquired with a temporal resolution of 6.16 s.
timeseries are often believed to belong to three curve types: The summary of imaging parameters is presented in Table I.
normal, suspicious or malignant. Therefore, it is reasonable to Manual segmentation of the prostate was performed by an
consider timeseries clustering in order to group them according experienced radiologist (L.R.), including subsegmentation into
to their similarity, omitting the need to extract either model- anatomical zones. For later evaluation purposes, the radiologist
based or model-free parameters [6]. performed a PI-RADS-based detection [13] of the tumours in
Timeseries clustering can be adapted also to other prob- multiparametric MRI (Fig. 1a).
lems than cancer detection, since curve shape can vary also Prostate specimens were obtained after radical prosta-
within normal tissue. An example is kidney compartment tectomy (RP) and prepared as described in our previous
174
TABLE I
S UMMARY OF MR ACQUISITION PARAMETERS [12].
Technique Acquisition TR/TE Slice thickness No. of No. of Matrix FOV Acquisition time
plane [msec] [mm] slices volumes [mm] [mm] [min:sec]
T2W Axial 4840/84 3.6 24 1 320x256 200x200 4:18
DCE TWIST Axial 4.24/1.66 3 30 67 512x512 192x138 6:58
DCE 3D FLASH Sagittal 2.36/0.8 3 30 74 192x192 425x425 app. 6:00
TWIST - time-resolved interleaved stochastic trajectories, FLASH - fast low-angle shot.
study [14]. The prostate was outlined and the tumours were CRFs model the posterior conditional probability P (y|x)
delineated by a pathologist (O.J.H.). Tumour labels from the directly, as
midgland were transferred from histology slides to T2W MRI ( )
1 X X X
using a method presented in [15]. P (y|x) = exp − U1 (yi |x) − U2 (yi , yi0 |x)
2) Kidneys: A kidney DCE image of one patient was se- Z
i∈S i∈S i0 ∈Ni
lected from a cohort of 10 volunteering patients. Images were (4)
acquired at Haukeland University Hospital, Bergen, Norway where Z is the normalizing constant called the partition
by a 32 channel 1.5 T whole-body scanner (Magnetom Avanto, function, U1 and U2 are the potentials associated with 1-
Siemens Medical Systems, Erlangen, Germany). A standard element and pairwise cliques in the graph and Ni is some
six-channel body matrix coil and a table-mounted six-channel neighbourhood system of site i. For our purposes, we only
spine matrix coil were used for signal reception. Coronal- consider maximal, pairwise interactions so the term involving
oblique DCE data were continuously acquired using a spoiled U1 is removed from the formula. Substituting (4) into (3) and
gradient recalled 3D FLASH pulse sequence (Table I). A bolus applying the negative logarithm we obtain:
injection of 0.025 mmol/kg of GdDOTA was administered Y ∗ = argmax(P (y|x)) ≡ argmin(− log(P (y|x))) =
at 3 mL/s into the antecubital vein using an automated X X
power injector, followed by a 20 ml saline flush. Breathing = argmin(log(Z) U2 (yi , yi0 |x)) ≡ (5)
i∈S i0 ∈Ni
instructions were given by a CD player. Five seconds after
injection of contrast agent the subjects were instructed to hold ≡ argmin(U2 (y|x)).
their breath for 26 seconds for motion-free first pass perfusion. Since Z is a positive constant, it can be omitted. The opti-
Subsequent instructions on 15-second breath holds and 25- mization problem reduces to the minimization of the pairwise
second free breathing were given during the continuous scan. clique potential function U2 .
The 4D kidney DCE dataset consists of 74 volumes acquired C. Preprocessing of DCE timeseries
with the temporal resolution of 2.3 s. Kidneys were segmented
in consensus by two radiologists (M. P., A. W.), into the cortex DCE timeseries usually present wide variability of shape
and medulla regions on the 15th frame (Fig. 1b). and are noisy. This is due to pharmacokinetic properties of
the tissue, T1 differences, patient or organ movement during
B. Conditional Random Fields formulation the scan and the imaging process. Since the goal is to cluster
Consider a set of observations (timeseries) X, a set of the timeseries according to some pharmacokinetic pattern, all
labels Y and a set of sites (voxel locations) S, where X = components resulting from other processes should be filtered
{xi }, Y = {yi }, i ∈ S = 1, 2, ..., n. We model the CRF as out. In order to achieve that, we applied various preprocessing
a graph such that X and Y are vertices and edges exist only steps, as summarized in Table II. First, the frames (subsequent
between interdependent vertices. The edges are associated with volumes of the timeseries) of DCE should be registered so
potentials U. We also define a clique to be a set of mutually that organ and patient movement is corrected. Then, after
connected vertices. With those definitions, the relations in our extracting the timeseries from a ROI in a 4-D DCE volume
graph are as follows. From the Bayes rule of probability: to vector τ , the injection timepoint at which the enhancement
p(x|y)P (y) starts should be estimated. Next, the timeseries might have
P (y|x) = . (1) different baseline due to the influence of T1-weighting. To
p(x)
reduce that effect, the curves are set to zero baseline by
Since when observations are made the term p(x) is a constant, calculating the median value of the pre-injection signal and
it follows that subtracting it from the whole timeseries. Following baseline
P (y|x) ∝ p(x|y)P (y). (2) correction, imaging noise should be removed by filtering. Fi-
Within Bayesian estimation, the objective in timeseries cluster- nally, differentiation is performed to transform the timeseries:
ing is to minimize the cost function associated with selecting x(i) = τ (i) − τ (i − 1), i = 2, 3, 4..., T .
a particular label set Y for given observations X. Minimizing D. Conditional Random Fields and clustering
the cost is equivalent to the maximum a posteriori approach
As mentioned previously, our CRF-based approach is con-
(MAP), where the objective is to find the optimal labels as:
cluded by minimization of pairwise potentials U2 (y|x) be-
Y ∗ = argmax(P (y|x)). (3) tween the graph vertices. In our implementation, the vertices
175
TABLE II
G ENERAL OVERVIEW OF THE DCE MRI PREPROCESSING STEPS AND THE
CLUSTERING ALGORITHM
Preprocessing:
1. Register DCE frames to suppress organ motion
2. Extract timeseries from the ROI
3. Correct the baseline using the pre-injection signal (optional)
4. Filter the timeseries to reduce noise
5. Compute the derivative signal x of each timeseries
6. Sample a random subset of x for clustering
Clustering:
1. Estimate the local distance threshold D(i)
(a) Prostate DCE (b) Kindey DCE 2. Initialize the label set Y randomly.
Fig. 1. Exemplary slices from sagittal kidney DCE and axial prostate DCE. 3. Do until convergence:
Blue contours delineate the organs, red contours denote prostate cancer (a) or For each x(i):
renal medulla (b). 3a. Form the voting pool Ni
3b. Assign the optimal label from the set {yi , yi0 ∈N i } ((6), (7)).
(
x are the preprocessed timeseries and y are their labels. The −(D(i) − di,i0 ) if yi = yi0
algorithm outputs a label for each timeseries in an unsuper- W i.i0 = (7)
D(i) − di,i0 if yi 6= yi0
vised manner, based on the minimization of potentials and
requiring only randomly initialized label set and some data- D(i) is a variable providing the mentioned data-driven knowl-
driven knowledge about the dataset (D(i)), as explained below. edge about the dataset and is the main parameter of the
The procedure (see Table II) starts by initializing a random clustering algorithm. It steers the decisions made in each
labelling to the observations, i.e. each observation is allowed iteration regarding the label assignment, it is therefore crucial
to form a singleton cluster. As the algorithm converges, the for the results. D(i) determines the threshold distance that
clusters merge. Therefore, the presented algorithm can be differentiates between intra- and extra-cluster distances in
viewed as a type of agglomerative clustering. the dataset. In [9], an objective, global threshold D was
The algorithm operates at two iteration levels. We call used. We considered this inappropriate, since the density in
them ‘inner’ (over the observations x(i)) and ‘outer’ iterations the given data distribution might be varying, e.g. when the
(repeated over the whole dataset, run until convergence). true clusters have different variance. Instead, we applied the
Within the inner iterations, we used a local iterative opti- important modification and used the subjective, local D(i),
mization strategy, subsequently solving the criterion yi = determined for each timeseries individually. To calculate D(i),
argmin(U2 (yi , yi0 |x)). the Euclidean distance of timeseries x(i) to all other timeseries
Following the approach from [9], we extended the notion in the dataset was found. We calculated D(i) simply as the
of a neighbourhood (Ni ) of timeseries x(i) to the notion of mean of those distances.
a voting pool, which includes observations randomly chosen
from the dataset. In such setting, the CRF is dynamic and E. Speeding up the algorithm
edges between the vertices of a graph can emerge and disap- To make the algorithm more efficient, one can use organ
pear as the algorithm progresses. This is possible because the masks as a region of interest (ROI). Even with this substantial
spatial distribution of observations (spatial neighbourhood) in reduction of the size of the input dataset, the number of time-
the image is considered irrelevant. series in the ROI can reach tens of thousands, depending on
The voting pool is constructed of three subsets of observa- the image resolution and organ size. Such substantial amount
tions: recommended (MS - most similar), deprecated (MD - of observations slows down both computations within the CRF
most different) and random. In the first outer iteration (when and the convergence of the algorithm. To further improve
initializing), all observations in the voting pool are randomly the running time we employed a semi-supervised scheme.
selected. In the following outer iterations, only the random We first sampled a smaller random subset of the available
subset remains to be selected at random, while the MS and MD observations (a fraction of observations from each slice) and
observations serve as the memory about the previously seen ran the unsupervised CRF-based algorithm as described above.
observations that were ‘quite similar’ or ‘quite different’ from Next, we used the learned labels in the supervised setting in a
the current observation x(i). The MS and MD observations are multiclass Support Vector Machine classifier. The classifier is
identified (initialized or updated) by sorting the observations then applied on the remaining observations to find their labels.
according to a similarity criterion di,i0 , which we chose to be Since training samples are drawn from the same distribution
the Euclidean distance. (for each patient case individually), the risk of overfitting or
Local iterative minimization of the pairwise potential is bias is negligible.
achieved by applying (6) and (7) within each inner iteration.
F. Implementation details
X
U2 (yi , yi0 |x) = Wi,i0 (6) The radiologists used the ITK-SNAP [16] for
i0 ∈Ni
assessment of DCE and T2W images as well as for
176
300 300 300 300 300
250 250 250 250 250
200 200 200 200 200
150 150 150 150 150
100 100 100 100 100
50 50 50 50 50
0 0 0 0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
(a) Prostate cancer (b) Benign prostate tissue (c) Renal cortex (d) Renal medulla (e) Renal pelvis
Fig. 2. Plots showing (filtered) curves belonging to each cluster in the prostate cancer localization experiment (a-b) and in the kidney compartment segmentation
experiment (c-e). The captions reflect the expected cluster physiological interpretation. The mean of curves is plotted in thick black. Note the difference in
the curve shape, which could be used to intepret cluster meaning.
segmentation. The algorithms were implemented in B. Kidney compartment segmentation

MATLAB R2015a using standard, built-in libraries for In the second experiment, we applied the method to
image and signal processing. NIfTI images were loaded the kidney compartment segmentation task. We first tried
using a package by Jimmy Shen available at Mathworks to use exactly the same parameters as for prostate cancer
(https://www.mathworks.com/matlabcentral/fileexchange/8797- localization, but the obtained results were unsatisfactory. We
tools-for-nifti-and-analyze-image). Experiments were found the reason within the preprocessing steps. In the final
performed on a Intel Core i3-4000M CPU with 2.40 setting, we corrected the baseline using the early part of the
GHz and 16 GB memory. curve (9 timepoints), since we found extracting the exact
III. R ESULTS ehancement start not applicable to all curve types of the
kidney. We also did not remove the pre-injection signal for
A. Prostate cancer localization the same reason. For filtering, we used two Savitzky-Golay
First, we applied the algorithm to the prostate cancer filters: the first was applied to the first half of the curves with
dataset. During preprocessing, we corrected the baseline and the frame of width of 3 and first-order polynomials, and the
then removed the pre-injection signal from the timeseries (as second was applied to the second half of the curves with the
a redundant part) using a manually identified enhancement same polynomials, but a wider frame with the width of 15.
start. The remaining signal was filtered using the Savitzky- This is justified further in the Discussion section.
Golay low-pass filter with frames of length 21 and fifth-order We selected MD= 100, MS= 100 and the voting pool
polynomials to substantially smooth the signal. We empirically size= 250. Approximately 5% out of 10229 timeseries,
checked different numbers of MS, MD examples and sizes of yielding 496, were selected for clustering, randomly sampling
the voting pool, finally selecting MD= 100, MS= 100, the from each slice showing a kidney.
voting pool size= 250 in the final experiment. Approximately The preprocessing, clustering and SVM train-
5% out of 34115 timeseries, yielding 1676, were selected for ing/classification steps took approximately 7, 8 and 2
clustering, randomly sampling from each prostate slice. seconds, respectively. The algorithm converged with 3
The preprocessing, clustering and SVM train- clusters (Fig. 2c-d).
ing/classification steps took approximately 17, 69 and 3 For evaluation, we used the same methodology as in the
seconds, respectively. The algorithm converged with 2 prostate cancer localization experiment, but considered two
clusters (see the plot of the timeseries belonging to each regions, the cortex and the medulla. Cluster labels mapped to
cluster in Figure 2a-b). the image space are shown in Figure 5.
To check how well the clusters match tumour annotations,
we calculated the sensitivity (SNS), specificity (SPC) IV. D ISCUSSION
and the Dice coefficient (DSC) for all clusters in the In this paper, we successfully adapted a CRF-based clus-
following anatomical regions: in the whole prostate gland tering method to the problem of DCE timeseries clustering.
(WG), the peripheral zone (PZ) and the transition zone In the prostate cancer experiment, our assumption was that
(TZ). These metrics were computed in all regions both perfusion patterns in the prostate follow the three curve types
for tumour annotations drawn by a radiologist on MR for normal, suspicious and malignant tissue. However, our al-
and by a pathologist, as described above. Note that the gorithm converged at two clusters, which we believe to reflect
histopathological annotations are only valid for the midgland, normal and malignant curve types. Figure 2a-b convinces that
corresponding to slices (c)-(l) in Figure 4. the above is indeed true.
177
TABLE III TABLE IV
SNS, SPC AND DSC VALUES IN PROSTATE CANCER LOCALIZATION . WG - WHOLE GLAND , SNS, SPC AND DSC VALUES IN KIDNEY COMPARTMENT
PZ - PERIPHERAL ZONE , TZ - TRANSITION ZONE SEGMENTATION . M - MEDULLA , C - CORTEX
Radiological ground truth Histopathological ground truth Radiological ground truth

Cluster SNS SPC DSC SNS SPC DSC Cluster SNS SPC DSC
1 65.8252 97.6968 0.5364 63.3209 97.5121 0.4909 1 39.2668 31.4767 0.2204
WG
2 34.1748 2.3032 0.0209 36.6791 2.4879 0.0191 M 2 60.3259 69.7196 0.4709
1 65.8252 94.9205 0.5974 82.6861 94.3430 0.6037 3 0.4073 98.8037 0.0078
PZ
2 34.1748 5.0795 0.0592 17.3139 5.6570 0.0211 1 68.5233 60.7332 0.7575
1 N/A 99.0684 N/A 0 99.0107 0 C 2 30.2804 39.6741 0.4055
TZ
2 N/A 0.9316 N/A 100 0.9893 0.0185 3 1.1963 99.5927 0.0236
We reached a moderately good DSC for the cortex segmen-

tation of about 0.75. For the medulla, the DSC was lower.
The reason for this is false negative samples (Fig. 5i) or
imprecise ROI (multiple areas on the border of the ROI).
Many timeseries of the renal pelvis were clustered together
with the medullar ones (Fig. 5h-i), which might indicate that
subclustering of the latter could lead to segmentation of that
third region as well.
In the kidney segmentation experiment we applied separate
Fig. 3. Clustering of 5% of the patient data mapped to the MR space.
Blue contours - tumours found by a radiologist, red - tumour delineations
filtering to the two halves of the timeseries. This improved our
registered from histology slides. Three slices of the prostate image are shown, results in contrast to the trials when a single filter was applied
corresponding to Fig. 4d-f. to the whole curve - which might also be caused by noise of
the renal pelvic timeseries, which enhance the latest, in the
second half of the DCE exam.
The obtained Dice coefficients are not very high, objectively. The results of both experiments only indirectly inform about
The best values were obtained when the evaluation was the performance of the method in the clustering task. Labels
performed within the PZ of the prostate. The specificity was used in the evaluation yield either anatomical or pathology-
quite high reaching up to 97%, at the cost of lower sensitivity related information, but not about the specific curve types. In
of about 65%. This means that we managed to detect all order to evaluate the efficiency of our algorithm in clustering
tumours with quite good accuracy, but also with some false of noisy timeseries of certain shape, we shall use DCE
positive detections, like in Fig. 4f. In the TZ, where the simulation. We shall also study the correlation of clusters with
pathologist identified a tumour missed by the radiologist (4e), quantitative perfusion parameters learned from the timeseries.
our initial results are poor. The TZ tumour was completely The preprocessing steps should be investigated more thor-
missed (however, also by the radiologist). Possible reasons oughly to be as universal as possible. Also, some steps, e.g.
are: wrong parameter tuning promoting clustering solutions baseline correction, might be redundant, since differentiation
with less clusters (allowing higher intracluster variance) or of the timeseries provides the same result implicitly. Omitting
properties of the particular DCE image. At the same time, that step would also cancel the need to identify the start of
DCE is known to have a higher predictive value in the PZ, enhancement, which is demanding.
since in the TZ malignant timeseries can be similar to the
timeseries of benign prostatic hyperplasia. V. C ONCLUSIONS
The presented approach can still be useful as a part of Our two experiments have shown that even though the
computer-aided diagnosis system for prostate cancer, for ex- clustering algorithm is fully automatic, the preprocessing
ample serving as an initial filter for tumour candidates. In the steps might need to be adapted to the specific application,
future we will study the application of this method in such using the knowledge or expectations about the curve types.
setting. However, first a scheme for cluster meaning interpreta- After tuning, the algorithm allowed to obtain moderate results
tion should be established, for example using heuristic criteria both for prostate cancer localization and kidney compartment
about curve shape. segmentation.
We also demonstrated how the algorithm can be used to
cluster timeseries of a health kidney in order to segment This paper was partially supported by the Polish National Science Centre
anatomical compartments: the cortex and medulla. We skipped grant no. UMO-2014/15/B/ST7/05227.
the renal pelvis region in the evaluation (it is incorporated into
R EFERENCES
the cortex region) due to the lack of a valid ground truth. Even
though, our algorithm managed to identify some of the curves [1] P. S. Tofts and A. G. Kermode, “Measurement of the blood-brain
barrier permeability and leakage space using dynamic mr imaging. 1.
typical to the pelvis (Fig. 2e), while mistaking the other, which fundamental concepts.,” Magn Reson Med, vol. 17, pp. 357–367, Feb
is clearly visible (Fig. 2c-d). 1991.
178
(a) (b) (c) (d) (e) (f) (g)
(h) (i) (j) (k) (l) (m)

Fig. 4. SVM classification results mapped to the MR space. Blue contours show tumours found by a radiologist, red - tumour delineations registered from
histology slides.
(a) (b) (c) (d) (e) (f) (g) (h) (i)
(j) (k) (l) (m) (n) (o) (p) (q)

Fig. 5. SVM classification results mapped to the MR space. Red contours reflect the extent of the medulla within the kidney (blue contour). The region
outside the medulla is the renal cortex.
[2] P. S. Tofts, G. Brix, D. L. Buckley, J. L. Evelhoch, E. Henderson, M. V. [10] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields:
Knopp, H. B. Larsson, T. Y. Lee, N. A. Mayr, G. J. Parker, R. E. Probabilistic models for segmenting and labeling sequence data,” 2001.
Port, J. Taylor, and R. M. Weisskoff, “Estimating kinetic parameters [11] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected
from dynamic contrast-enhanced t(1)-weighted mri of a diffusable tracer: crfs with gaussian edge potentials,” in Advances in neural information
standardized quantities and symbols.,” J Magn Reson Imaging, vol. 10, processing systems, pp. 109–117, 2011.
pp. 223–232, Sep 1999. [12] L. A. Reisæter, J. J. Fütterer, O. J. Halvorsen, Y. Nygård, M. Biermann,
[3] P. C. Vos, T. Hambrock, C. A. Hulsbergen-van de Kaa, J. J. Fütterer, E. Andersen, K. Gravdal, S. Haukaas, J. A. Monssen, H. J. Huisman,
J. O. Barentsz, and H. J. Huisman, “Computerized analysis of prostate L. A. Akslen, C. Beisland, and J. Rørvik, “1.5-t multiparametric mri
lesions in the peripheral zone using dynamic contrast enhanced mri.,” using pi-rads: a region by region analysis to localize the index-tumor
Med Phys, vol. 35, pp. 888–899, Mar 2008. of prostate cancer in patients undergoing prostatectomy.,” Acta Radiol,
[4] N. F. Haq, P. Kozlowski, E. C. Jones, S. D. Chang, S. L. Goldenberg, vol. 56, pp. 500–511, May 2014.
and M. Moradi, “A data-driven approach to prostate cancer detection [13] J. O. Barentsz, J. Richenberg, R. Clements, P. Choyke, S. Verma,
from dynamic contrast enhanced mri.,” Comput Med Imaging Graph, G. Villeirs, O. Rouviere, V. Logager, J. J. Fütterer, and E. S. o. U. R. ,
vol. 41, pp. 37–45, Apr 2015. “Esur prostate mr guidelines 2012.,” Eur Radiol, vol. 22, pp. 746–757,
[5] A. Fabijańska, “A novel approach for quantification of time–intensity Apr 2012.
curves in a dce-mri image series with an application to prostate cancer,” [14] J. Jurek, M. Kocinski, A. Materka, A. Losnegard, L. Reisater, O. J.
Computers in biology and medicine, vol. 73, pp. 119–130, 2016. Halvorsen, C. Beisland, J. Rorvik, and A. Lundervold, “Rule-based data-
[6] G. Tartare, D. Hamad, M. Azahaf, P. Puech, and N. Betrouni, “Spectral driven approach for computer aided diagnosis of the peripheral zone
clustering applied for dynamic contrast-enhanced mr analysis of time- prostate cancer from multiparametric mri: Proof of concept,” in Signal
intensity curves.,” Comput Med Imaging Graph, vol. 38, pp. 702–713, Processing: Algorithms, Architectures, Arrangements, and Applications
Dec 2014. (SPA), pp. 90 – 95, 2017.
[15] A. Losnegård, L. Reisæter, O. J. Halvorsen, C. Beisland, A. Castilho,
[7] F. G. Zöllner, E. Svarstad, A. Z. Munthe-Kaas, L. R. Schad, A. Lunder-
L. P. Muren, J. Rørvik, and A. Lundervold, “Intensity-based volumetric
vold, and J. Rørvik, “Assessment of kidney volumes from mri: acquisi-
registration of magnetic resonance images and whole-mount sections of
tion and segmentation techniques,” American Journal of Roentgenology,
the prostate,” Computerized Medical Imaging and Graphics, pp. 24 –
vol. 199, no. 5, pp. 1060–1069, 2012.
30, 2017.
[8] X. Yang, H. Le Minh, T. Cheng, K. H. Sung, and W. Liu, “Automatic
[16] P. A. Yushkevich, J. Piven, H. Cody Hazlett, R. Gimpel Smith, S. Ho,
segmentation of renal compartments in dce-mri images,” in International
J. C. Gee, and G. Gerig, “User-Guided 3D Active Contour Segmen-
Conference on Medical Image Computing and Computer-Assisted Inter-
tation of Anatomical Structures: Significantly Improved Efficiency and
vention, pp. 3–11, Springer, 2015.
Reliability,” Neuroimage, vol. 31, no. 3, pp. 1116–1128, 2006.
[9] C.-T. Li, Y. Yuan, and R. Wilson, “An unsupervised conditional random
fields approach for clustering gene expression time series,” Bioinformat-
ics, vol. 24, no. 21, pp. 2467–2473, 2008.
179
SIGNaL PROCESSING
SPa 2018
Centerline-Radius Polygonal-Mesh Modeling

of Bifurcated Blood Vessels in 3D Images
using Conformal Mapping
Carlos Vinhais Marek Kociński and Andrzej Materka
ISEP School of Engineering Institute of Electronics
Polytechnic Institute of Porto Lodz University of Technology
Porto, Portugal Lodz, Poland
Email: cav@isep.ipp.pt Email: marek.kocinski@p.lodz.pl
Abstract—Accurate modeling of the human vascular tree from it utilizes the knowledge about the vessel shape included in
3D computed tomography (CTA) or magnetic resonance (MRA) images [4], [5], [6].
angiograms is required for visualization, diagnosis of vascular
diseases, and computational fluid dynamic (CFD) blood flow Geometric modeling of a vascular tree on the basis of
simulations. This work describes an automated algorithm for medical images is a complex task. The main problem is related
constructing the polygonal mesh of blood vessels from such
images. Each vascular segment is modeled as a tubular object, to limited resolution of the image (limited size of voxels)
and a thin plate spline transform is used to generate the correlated to the diameter of vessel branches and the fact that
responding surface from its centerline-radius representation. A this diameter spans a wide range of values – from millimeters
novel approach for generating the polygonal mesh of bifurcating to micrometers. The other disturbing factors are artifacts and
vessels based on conformal mapping is presented. A mathematical inevitable image noise. It was shown that centerline-radius
description of the methodology is also provided. The model is
improved by computing local intensity features with subvoxel based approach is robust to these factors [4]. It allows subvoxel
accuracy, to slightly deform the mesh of the vascular tree for accuracy modeling of cylindrical vessel tree segments which
fine-tuning. The proposed algorithm was successfully tested on in most cases have a circular cross-section with the radius
a 3D synthetic image containing randomly generated vascular varying along the centerline. This robustness is achieved by
branches. Experiment results, confirmed by real-world Time of utilizing the knowledge about the tubular shape of the branches
Flight MRA, demonstrate that our methodology is consistent
and capable of generating high quality triangulated meshes of and the course of centerline in 3D space, approximated by a
vascular trees, suitable for further CFD simulations. Compared smooth function.
to common techniques, conformal mapping proved to be a
simple and effective mathematical approach for polygonal mesh A vessel tree is composed of a large number of such
modeling of bifurcating vessels. cylindrical segments, connected together at bifurcations. There
is little number of publications on mathematical modeling of
I. I NTRODUCTION vessel bifurcations and the existing models are rather com-
plex [7], [8], [9]. A simple approach to bifurcations modeling,
Accurate modeling of the human vascular tree from 3D based on joining the triangular meshes which approximate the
magnetic resonance (MRA) and computed tomography (CTA) cylindrical segments, is presented in [5]. It was shown in that
angiograms is required for visualization and diagnosis of work that mesh based modeling produces plausible results,
vascular diseases, but also for computational fluid dynamic however the technique is difficult to be automated and the
(CFD) blood flow simulations. bifurcation modeling accuracy needs to be improved.
An extensive state of the art review of 3D images segmenta-
tion methods and techniques is presented in [1]. Most of these This work describes an automated algorithm for construct-
works are aimed at discrete image segmentation (e.g. using ing the polygonal mesh of bifurcating blood vessels from
Dice coefficient as the measure of algorithm performance), 3D images with subvoxel accuracy. The proposed algorithm
relatively little work has been done on establishing the vessel is based on nonlinear coordinate transformations, including
walls location with a subvoxel accuracy. The latter approach is thin plate spline transforms and conformal mapping. A math-
especially important in the case of thin branches of the vessel ematical description of the proposed methodology is given in
tree and is difficult to be implemented in practice since the im- Sec. II. The algorithm was tested and validated on two sets of
age noise and artifacts critically affect the modeling accuracy. image data - a synthesized vessel tree of known geometrical
From the two popular methods of vessel tree segmentation – properties [10] and a real-world brain magnetic resonance
Level-Set based [2] and centerline-radius [3] – the second one angiogram. The obtained results are presented and discussed
proved to be better suited to subvoxel accuracy modeling, as in Sec. III. A conclusion is provided in Sec. IV.
180
structures; 4) spline interpolation was applied to each segment,
to generate points and corresponding tangent vector along its
centerline; 5) blood vessel radii were finally estimated from
image intensity profiles obtained from each point toward the
vessel wall (along different angular directions), by minimizing
the root-mean square fitting errors from a parameterized edge-
blurring function [4].
C. Centerline-Radius Representation
The centerline of each tubular object, or vessel, is now
represented by the set C of K consecutive points, denoted as:
(a) (b)
Fig. 1. Maximum Intensity Projection (MIP) of: (a) synthetic 3D image and
C = {Ck , tk , ρk } , (1)
(b) 3D Time of Flight MRA angiogram.
where Ck = (xk , yk , zk ), k = 0, · · · , K − 1, denotes the
Cartesian coordinates of each point Ck of the centerline, tk
is the vector tangent to the centerline and ρk is the estimated
II. M ATERIALS AND M ETHODS
mean radius of the vessel cross-section at Ck . The total length
A. Data Sets L and maximum radius ρmax of the vessel is given by (2) and
The proposed blood vessel modeling algorithm was tested (3), respectively:
on a 3D synthetic image that contains connected cylinders K−1 K−1
X X
of different diameters. To model such numerical phantom, a L= dk = d (Ck , Ck−1 ), (2)
computer simulator of tree growth was designed and imple- k=1 k=1
mented using the Karch method [10]. The size of the 3D image
where dk is the Euclidean distance between two consecutive
(number of voxels) is 256 × 256 × 256, with isotropic voxel
points Ck and Ck−1 of the centerline, and
spacing of 1.0 mm. A Maximum Intensity Projection (MIP)
of the 3D synthetic image is shown in Fig. 1(a). ρmax = max {ρk } . (3)
k
To evaluate the robustness of the algorithm when applied to
real medical images, a 3D Time of Flight MR brain angiogram Given the centerline-radius representation (1) of a blood ves-
(ToF-MRA) was considered in this study. The image was sel, such tubular object can therefore be modeled as a regular
acquired for a healthy volunteer with a 3T experimental MRI cylinder, of height H = L and constant radius ρ = ρmax . In a
system, under approval of the ethical committee of Friedrich Cartesian coordinate system (model space) defined by the unit
Schiller University, Jena, Germany. The image was saved in vectors (ex , ey , ez ), the cylindrical model is defined by a set
NIFTI format, with a size of 346 × 448 × 319 voxels, and of K points lying on its axis, e.g. coincident with the Z-axis:
isotropic spacing of 0.49 mm. High resolution was obtained
Z = {Zk , ez , ρmax } , (4)
at the expense of long acquisition time (16 min.). The coronal
MIP of the ToF-MRA image is shown in Fig. 1(b), highlighting where points Zk = (0, 0, zk ), k = 0, · · · , K − 1, are given by
elongated regions of flowing blood in brain arteries. z0 = 0, for k = 0, and zk = zk−1 + dk , for k > 0. From (2),
the condition H = L is satisfied.
B. Vessel Centerline Extraction
The polygonal mesh construction of a given vessel is D. Vessel Mesh Construction
based on its centerline-radius representation. The centerline A base mesh of the vessel model, M(ves) , is automatically
extraction algorithm used in this study is described in previous generated by patching a structured grid of N × M points
works [4], [11]; its main steps are as follows: 1) a multi-scale with triangulated quadrilateral patches. The circumferential
Hessian-based vessel filter [12] was first applied to the 3D resolution of the mesh is controlled by N, the number of
input images. To account for varying diameter along a blood patches in a single cross-section. The surface is tiled from
vessel, several filtered images with different kernels were com- the first to the last of M cross-sections. The size of a single
puted [13]. The maximum intensity found for each input voxel patch is ds, given by ds = 2πρmax /N, and M, the number of
among all processed images was taken as the vesselness filter patches along the longitudinal axis of the cylindrical model, is
output; 2) the resulting vesselness image was thresholded and such that Mds = H. An example of a base mesh with N = 24
centerlines of the vascular tree were initialized automatically is illustrated in Fig. 2(a).
as a binary skeleton; 3) all n-furcations (points of bifurcation) The polygonal mesh of the vessel can be constructed from
of the skeleton were detected, and the skeleton was parsed to the corresponding model by means of a nonlinear warp trans-
generate its nonfurcating segments. Each segment represents form, defined by a set of source and target landmarks. Any
a tubular object of the blood vessel system. Only sufficiently vertex on the model mesh M(ves) close to a source landmark
long skeletal objects, containing more than few voxels each, will be moved to a place close to the corresponding target
were preserved to maintain representation of elongated tubular landmark. The vertices in between are interpolated smoothly
181
(a) (b)
(a) (b) (c) (d)
Fig. 2. Construction of the polygonal mesh of a vessel (vascular segment).

(a) cylinder model of a vessel and (b) associated source landmarks; (c) cor-
responding target landmarks and (d) polygonal mesh of the vessel generated
after TPS transform.
using Bookstein’s Thin Plate Spline (TPS) algorithm [14].

Here, the TPS transform describing the nonlinear warp is
defined by two sets of control points, evenly spaced on K
circles with center Ck and Zk . The set of K × Q source
(c) (d)
landmarks, S = {Skq }, k = 0, · · · , K − 1, q = 0, · · · , Q − 1,
is computed from the cylinder axis Z of the vessel model Fig. 3. Model construction of a planar bifurcation using conformal mapping.
defined by (4), (a) Model of a branch and (b) branch after conformal mapping. Bifurcation
model with 3 conformal branches of (c) equal and (d) different lengths.
Skq = Zk + ρmax (cos θq ex + sin θq ey ) , (5)
where θq = q2π/Q, as shown in Fig. 2(b) for Q = 6. Similarly,
the set of K × Q corresponding target landmarks, T = {Tkq }, Consider the base mesh for b = 0, M0 , shown in Fig. 3(a).
is computed from the vessel centerline C as The axis of the cylindrical model is coincident with the X-
axis. All vertices of mesh M0 , expressed in polar coordinates
Tkq = Ck + ρk (cos θq nk + sin θq bk ) , (6) (r, φ) in the plane XY, are now transformed according to the
following conformal mapping, defined by the parameter α:
where nk and bk are the normal and binormal unit vectors,
with nk × bk = tk , of the Frenet-Serret frame defined at point x̃ = rα cos (αφ), (8)
Ck of the vessel centerline, as illustrated in Fig. 2(c).
The polygonal mesh of the vessel, V, is finally constructed ỹ = rα sin (αφ), (9)
by transforming each vertex of the model mesh M(ves) , using
where (x̃, ỹ) are the Cartesian coordinates of the mapped
the TPS transform fTPS (S, T ) defined by the set of source S
vertex. The resulting mesh, M̃0 , hereinafter referred to as
and target T control points given by (5) and (6), respectively:
conformal branch, is illustrated in Fig. 3(b) with α = 1/Nb =

1/3. Actually, the conformal map defined by (8) and (9) can be
V = fTPS S, T ; M(ves) . (7)
expressed in a very simple way, in terms of the complex-valued
The result of the vessel mesh construction algorithm using the function f (ξ) = ξ α , of a complex variable ξ = x + jy = rejφ
TPS transformation (7) is shown in Fig. 2(d). defined in the complex plane XY, with x̃ = Re {f (ξ)} and
ỹ = Im {f (ξ)}.
E. Bifurcation Mesh Construction
The conformal transformation of a given branch b is there-
A novel approach for generating the polygonal mesh of fore expressed as:
bifurcations based on conformal mapping is now presented.
Let Nb be the number of branches of a bifurcation. Each single M̃b = fCM (α; Mb ) . (10)
branch b, b = 0, · · · , Nb − 1, is modeled (as described in
Sec. II-D) with a cylinder of height Hb (the length of branch Assuming, w.l.o.g., Nb = 3, models M̃b of conformal
b) and radius ρmax = 1, ∀b. A single branch is connected to branches b = 1, 2 are constructed with the same transformation
an endpoint (either the first or the last) of a segment in the (10), and rotated around Z axis by an angle of 120◦ and
vessel tree. Therefore, Hb is the distance between the point of 240◦ , respectively. All conformal branches are then appended
bifurcation and the endpoint of that segment. For each branch together, and duplicated cells (fully overlapped) and vertices
b, a base mesh Mb is also automatically generated. are deleted to obtain a smooth and continuous geometrical
182
vj ni vj ~vj
ps ~vi
vi vi
dv~i
(a) (b) (c)
(a) (b) (c) Fig. 5. Polygonal mesh optimization: (a) Surface normal at vertex vi ; (b)
mesh deformation along normal direction; (c) mesh relaxation.
Fig. 4. Polygonal mesh construction of a bifurcation from 3 connected ves-
sels. (a) Bifurcation model and associated source landmarks, (b) corresponding
target landmarks and (c) bifurcation mesh generated after TPS transform.
term Ei is computed by estimating local statistics of vi with
subvoxel accuracy,
model M(bif ) of the bifurcation: 2
N[
Ei = κ I (vi ) − I¯ (vi ) , (14)
b −1
M(bif ) = M̃b . (11) where κ is a normalization constant to account for the modality
b=0 of the 3D input image, I (vi ) is the intensity value estimated at
Examples of bifurcation models generated by conformal map- vertex vi (via bicubic interpolation), and I¯ (vi ) is the average
ping and rotational extrusion are shown in Fig. 3(c) and 3(d). image intensity inside and outside M(tree) :
The polygonal mesh of the bifurcation, B, is finally obtained
1 X
by transforming each vertex of the model M(bif ) , using a I¯ (vi ) = I (vi + sni ). (15)
nonlinear transformation similar to (7): Ns s

B = fTPS S, T ; M(bif ) , (12) In (15), s defines subvoxel locations of Ns equally spaced
points ps = vi + sni along the normal direction, as shown in
where, in this case, the TPS transform fTPS (S, T ) is defined Fig. 5(b), within the sampling range [−smax , smax ].
by source landmarks S, that includes the vertices of tiling 2) Mesh Relaxation: To keep the mesh smooth during the
patches of the last cross-section of all branches of the bi- deformation, vertex coordinates are adjusted using Laplacian
furcation (see Fig. 4(a)), and target landmarks T , containing smoothing [15], [16], [17]. For each vertex ṽi , a topological
the vertices of patches of the end cross-section of the cor- and geometric analysis is performed to determine which
responding bifurcating vessels (see Fig. 4(b)). The result of vertices are connected to ṽi . The coordinates of the vertex
the bifurcation mesh construction algorithm using the TPS are then modified according to an average of the connected
transformation (12) is shown in Fig. 4(c). vertices ṽj , as shown in Fig. 5(c). A relaxation factor ω2 is
Similar TPS transforms (4) are applied for connecting all used to control the amount of displacement of each vertex ṽi :
bifurcation models to the corresponding adjacent vascular
segments. A single, waterproof and continuous polygonal vi
(t+1) (t) (t)
= ṽi + ω2 · δṽi , (16)
mesh of the entire vessel tree, M(tree) , is therefore obtained
after merging all generated objects. where δṽi are the δ-coordinates of ṽi [15], i.e. the difference
between the absolute coordinates of ṽi and the center of mass
F. Mesh Optimization
of its Nj immediate neighbors in the mesh,
To account for the noncircularity of the cross-sections of
real vessels and adjust all bifurcation meshes to their actual 1 X
δṽi = (ṽj − ṽi ). (17)
anatomical shape, the modeling algorithm is improved by Nj j
slightly deforming the mesh M(tree) for fine-tuning. For
this purpose, an iterative region-based deformable model was The process repeats for each vertex in a single iteration t. After
implemented to move and deform the mesh under the influence both deformation (13) and relaxation (16) have been applied,
of 1) an external force, to attract all vertices of M(tree) toward the surface normal ni at each vertex vi is updated for the next
vessel boundaries, and 2) an internal force, designed to keep iteration t + 1. The optimization process is repeated a fixed
the mesh smooth during the deformation [1]. number of iterations Nt .
1) Mesh Deformation: At a given iteration t, all vertices vi
of the mesh are moved to new locations ṽi , along the surface III. R ESULTS AND D ISCUSSION
normal ni (see Fig. 5(a)), according to: The proposed method for polygonal mesh modeling
(t) (t) (t) (t) of blood vessels was applied to the data sets described
ṽi = vi + ω1 · Ei ni , (13)
in Sec. II-A. Algorithms were implemented with the
being ω1 the external force weight. In this work, a localized Python programming language, incorporating the Visualization
region energy formulation was used; the image intensity-based Toolkit [18] accessed through Python wrappers.
183
Fig. 6. Polygonal meshes generated from the 3D synthetic image. Top: Fig. 7. Polygonal meshes generated from the 3D ToF-MRA brain image. Top:
vessel tree; Bottom: example of bifurcations, generated with N = 24 (left) part of the vessel tree; Bottom: bifurcation connecting basilar and vertebral
and N = 48 (right). arteries, generated with N = 24 (left) and N = 48 (right).
TABLE I
A. Polygonal Meshes D EFORMABLE MODEL PARAMETERS .
All vessel centerlines of 15 segments connected to 7 bi- Parameter Symbol Value

furcations were successfully extracted from the 3D synthetic External force weight ω1 10.0
input image. Polygonal meshes of the tubular segments and Interpolated points Ns 6
bifurcations of the random tree were automatically generated Sampling range smax 3
as described in this paper. Results are displayed in Fig. 6. For Relaxation factor ω2 0.25
N = 24 tiling patches per cross-section, a triangulated mesh Iterations Nt 100
with 31650 cells and 15926 vertices of the entire vessel tree
was obtained. Actually, the resolution of the meshes is only
dictated by the parameter N. The meshes of 2 bifurcations of
other hand, the Laplacian operation reduces high frequency
the random tree, with N = 24 and N = 48, are also shown in
information in the geometry of the mesh. With excessive
Fig. 6 for comparison purposes. Fig. 7 illustrates the modeling
smoothing important details may be lost, and the surface may
of blood vessels from the 3D ToF-MRA brain image. Here,
shrink towards the centerline.
the polygonal mesh of part of the vessel tree was constructed
with N = 24. Results of generating the bifurcation connecting Meshes were fitted to actual image intensity by estimating
the segments of basilar, left and right vertebral arteries are also local statistics at each vertex with appropriate locations in
shown, with resolution N = 24 and N = 48. the voxel grid. Subvoxel precision was achieved by bicubic
interpolation, retrieving small details of the vessel boundaries,
B. Mesh Optimization as shown in Fig. 8(d). To judge for the feasibility of this fine-
A region-based deformable model was used to improve tuning, image intensity values were interpolated at all vertex
the accuracy of the modeling algorithm, by considering the locations, before and after optimization. Histograms of such
mesh of a vessel tree as moving under image foreground and distributions are plotted in Fig. 9. A significant decrease of the
background region constraints. The region-based formulation standard deviation σ is observed after optimization: from 33 to
was adopted to overcome some of the edge-based deformable 1, for the 3D synthetic image (Fig. 9(a)), and 156 to 56, for the
model issues [1], that arise when dealing with noisy images 3D ToF-MRA image (Fig. 9(b)). In fact, the deformation of a
or non-uniform intensity. mesh tends to move all its vertices towards vessel boundaries,
Results of the polygonal mesh optimization algorithm are where image intensities are expected to be similar.
illustrated in Fig. 8, with the parameters listed in Tab. I for both Although the obtained experimental results are promising,
3D synthetic and ToF-MRA images. As shown in Fig. 8(b), mesh optimization requires further investigation. The choice of
Laplacian smoothing tends to relax the mesh, making the cells the deformable model parameters and their effect on accuracy
better shaped and the vertices more evenly distributed. On the is still under study.
184
(a) (b) (c) (d)
Fig. 8. Polygonal mesh optimization. Mesh of bifurcating vessels generated from the 3D synthetic image: (a) before and (b) after optimization. Mesh of the
basilar/vertebral arteries bifurcation generated from the 3D ToF-MRA brain image: (c) before and (d) after optimization.
1600
16000
Before deformation Before deformation [4] A. Materka, M. Kociński, J. Blumenfeld, A. Klepaczko, A. Deistung,
µ=65, σ = 33 µ=389, σ = 156
14000
After deformation
1400
After deformation B. Serres, and J. R. Reichenbach, “Automated modeling of tubular
µ=65, σ = 1 µ=253, σ = 56
12000 1200 blood vessels in 3D MR angiography images,” in 2015 9th International
10000 1000 Symposium on Image and Signal Processing and Analysis (ISPA), Sept
Frequency
Frequency
8000 800 2015, pp. 54–59.

6000 600 [5] M. Kociński, A. Materka, A. Deistung, and J. Reichenbach, “Centerline-
4000 400
based surface modeling of blood-vessel trees in cerebral 3D MRA,” in
2016 Signal Processing: Algorithms, Architectures, Arrangements, and
2000 200
Applications (SPA), Sept 2016, pp. 85–90.
00 00
50 100 150
Image Intensity
200 250 200 400 600 800
Image Intensity
1000 1200 1400 [6] M. Kociński, A. Materka, A. Deistung, J. Reichenbach, and A. Lunder-
vold, “Towards multi-scale personalized modeling of brain vasculature
(a) (b) based on magnetic resonance image processing,” in 2017 International
Conference on Systems, Signals and Image Processing (IWSSIP), May
Fig. 9. Distribution of intensity values interpolated at vertex locations, before 2017, pp. 1–5.
(red color) and after (blue color) mesh optimization, of polygonal meshes [7] T. Heistracher and W. Hofmann, “Physiologically Realistic Models of
generated from: (a) 3D synthetic image and (b) 3D ToF-MRA image. Bronchial Airway Bifurcations,” J. Aerosol Sci., vol. 26, no. 3, pp. 497–
509, 1995.
[8] F. Yuan, Y. Chi, S. Huang, and J. Liu, “Modeling n-Furcated Liver
vessels From a 3-D Segmented Volume Using Hole-Making and Sub-
IV. C ONCLUSION division Methods,” IEEE Transactions on Biomedical Engineering,
This paper presents a novel methodology for generating vol. 59, no. 2, pp. 552–561, Feb 2012.
[9] X. Han, R. Bibb, and R. Harris, “Design of Bifurcation Junctions in
polygonal meshes of blood vessels based on conformal map- Artificial Vascular Vessels Additively Manufactured for Skin Tissue
ping and TPS transform. The feasibility of this approach was Engineering,” Journal of Visual Languages and Computing, vol. 28, pp.
demonstrated in a 3D synthetic image. Preliminary results 238–249, 2015.
[10] M. Kociński, A. Klepaczko, A. Materka, M. Chekenya, and A. Lunder-
obtained in a Time of Flight MRA volume demonstrate that vold, “3D image texture analysis of simulated and real-world vascular
our method is consistent and capable of generating high quality trees,” Computer Methods and Programs in Biomedicine, vol. 107, pp.
triangulated meshes of real-world vascular trees, suitable for 140–154, 2012.
[11] J. Blumenfeld, M. Kociński, and A. Materka, “A centerline-based
further CFD simulations. Conformal mapping proved to be algorithm for estimation of blood vessels radii from 3D raster images,” in
a simple and effective mathematical approach for polygonal 2015 Signal Processing: Algorithms, Architectures, Arrangements, and
mesh modeling of bifurcating vessels. Applications (SPA), Sept 2015, pp. 38–43.
[12] A. F. Frangi, W. J. Niessen, P. J. Nederkoorn, J. Bakker, W. P. Mali, and
M. A. Viergever, “Quantitative analysis of vascular morphology from 3D
ACKNOWLEDGMENT MR angiograms: In vitro and in vivo results,” Magnetic Resonance in
The authors would like to thank Dr. Andreas Deistung and Medicine, vol. 45, pp. 311–322, 2001.
[13] A. F. Frangi, W. J. Niessen, K. L. Vincken, and M. A. Viergever,
Prof. Jürgen Reichenbach for kindly providing the ToF-MRA “Multiscale vessel enhancement filtering,” in Medical Image Computing
image, and Mr. K. Kropidłowski for 3D printing our models. and Computer-Assisted Intervention – MICCAI’98, 1998, pp. 130–137.
[14] F. L. Bookstein, “Principal warps: thin-plate splines and the decom-
R EFERENCES position of deformations,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 11, no. 6, pp. 567–585, Jun 1989.
[1] S. Moccia, E. De Momi, S. El Hadji, and L. S. Mattos, “Blood vessel [15] O. Sorkine, “Laplacian Mesh Processing,” in Eurographics 2005 - State
segmentation algorithms - Review of methods, datasets and evaluation of the Art Reports. The Eurographics Association, 2005.
metrics,” Computer Methods and Programs in Biomedicine, vol. 158, [16] A. Nealen, T. Igarashi, O. Sorkine, and M. Alexa, “Laplacian Mesh
pp. 71–91, 2018. Optimization,” in Proceedings of ACM GRAPHITE. ACM, 2006, pp.
[2] T. Wozniak, M. Strzelecki, A. Majos, and L. Stefanczyk, “3D vascular 381–389.
tree segmentation using a multiscale vesselness function and a level set [17] Y. Ohtake, A. Belyaev, and I. Bogaevski, “Mesh regularization and
approach,” Biocybernetics and Biomedical Engineering, vol. 37, no. 1, adaptative smoothing,” Computer-Aided Design, vol. 33, pp. 789–800,
pp. 66–77, 2017. 2001.
[3] D. Lesage, E. D. Angelini, I. Bloch, and G. Funka-Lea, “A review of 3D [18] W. Schroeder, K. Martin, and B. Lorensen, The Visualization Toolkit,
vessel lumen segmentation techniques: Models, features and extraction 4th ed. Kitware, 2006.
schemes,” Medical Image Analysis, vol. 13, pp. 819–845, 2009.
185
SIGNaL PROCESSING
SPa 2018
Design and implementation of a device supporting

automatic diagnosis of arteriovenous fistula
Marcin Grochowina Lucyna Leniowska
University of Rzeszów University of Rzeszów
al. Rejtana 16, 35-310 Rzeszów, Poland al. Rejtana 16, 35-310 Rzeszów, Poland
Email: gromar@ur.edu.pl Email: lleniow@ur.edu.pl
Abstract—The article presents an innovative solution enabling whose source is pathologically altered fistula, differs signifi-
automatic assessment of the condition of arteriovenous fistula cantly from the spectrum of the normal fistula signal.
on the basis of the sound emitted by blood flowing inside the A similar relationship in the time domain was stated by
fistula. Software and hardware tools included in the designed
and manufactured prototype of the device were discussed. The Bosman [2], who carried out analyzes of the impact of fistula
methods currently used in medical practice, based on ultrasound state on the intensity of the sound emitted by it. However,
imaging techniques, require the use of specialised equipment he did not propose technical methods that would enable such
supported by medical staff. Routine auscultation with the use analysis to be carried out automatically.
of a stethoscope is a method burdened with the subjectivism The turn of the century and the accompanying significant
of a doctor and does not allow to compare the results with
historical research. The advantage of the presented solution is the progress in the field of pattern recognition resulted in the
possibility of obtaining a quick, reliable and objective diagnosis. extension of the front of research on the state of the fistula.
The examination may be carried out by unqualified personnel Many researchers became interested in this problem using
and may allow the collection of results of historical studies in a wide range of available methods for extracting diagnostic
order to follow the trend of changes in the fistula status. features from acoustic signal and classification algorithms. The
I. I NTRODUCTION basis of all tests are discovered by Sekhar and Bosman, and
confirmed in further studies the relationship between the light
Maintenance of arteriovenous fistula is a key issue from
of the fistula sectional cross-section and the parameters of the
the patient’s point of view. The progressive pathologisation
stream of blood flowing through it. The size and speed of the
process may eventually lead to a condition that prevents
flow, its character – linear or turbulent, and venous pressure
dialysis, which is associated with the necessity of temporarily
are change [3], [4].
performing it using a central venipuncture 1 . In the long term,
The most frequently indicated relationship between the
it requires surgical intervention to open the deformed fistula
frequency spectrum of the tested signal and the fistula state is
or perform a new one.
the increase of the amplitude of components with frequencies
Current ad hoc fetal auscultation with the use of a stetho-
above 200–300 Hz for pathological cases.
scope suffers from subjective judgment and does not allow
It should be emphasized that most of the research concerns
to record the history of the tests. The USG examination is
only the binary evaluation of fistula. Such an approach does
performed only in cases of justified suspicion or failure, and
not make it possible to detect the pathology formed at an early
therefore often too late to take non-invasive remedial measures.
stage, but only to find an advanced stage of disease. Few
Providing patients with a tool to control the current state
studies indicate the possibility of multi-level or percentage
of the fistula will allow to determine the trend of changes
determination of the degree of pathologization [5], [6], [7],
within it and detect pathological states in advance. In addition,
which in consequence may allow tracking the trend of changes
eliminating from the process of daily control of fistula medical
in time.
staff will reduce the costs of the procedure and increase the
scope of its availability. As a result, the safety level of patients II. M ATERIALS AND METHODS
with created fistula will be increased and the costs of treatment
The research material being the basis for the study was
of possible complications will be reduced. The psychological
collected from 38 patients dialyzed in the dialysis station of
and physical comfort of patients is also important.
the Municipal Hospital No. 2 in Rzeszów.
The sampling rate was set to 8 kS/s at the resolution
The first mentions of the numerical analysis of the acoustic
of 16 bit/sample. In total, by registering in a few series
signal emitted by the arteriovenous fistula come from the mid-
in 2016 and 2017, 156 recordings were obtained, of which
1980s [1]. In his research, however, Sekhar limited himself
2670 feature vectors were extracted.
only to the conclusion that the frequency spectrum of sound,
Positive approval of the local Bioethical Commission of the
1 The catheter inserted through the subcutaneous blood vessel into a large Regional Medical Chamber in Rzeszów was obtained for the
central vein, typically placed in the subclavian vein. tests (No.17/B/2016).
186
Fig. 1: Division of the recorded signal into fragments corresponding to single heart beats: a) registered material with several
heartbeat lengths requiring division into shorter fragments, b) fragment too long to carry out FFT analysis, shortened to 8192
samples, c) fragment shorter than 8192 samples, completed to the required length samples with value 0
TABLE I: Features extracted in the frequency domain based

microphone
on the distribution of the bandwidth according to the third
sound
Feature fd f0 fg
name [Hz] [Hz] [Hz]
fft20 18 20 22
fft25 23 25 27
decompression fft31 28 31,5 34
fft40 35 40 44
Fig. 2: Head for sound acquisition fft50 45 50 56
fft63 57 63 70
fft80 71 80 89
fft100 90 100 112
A. Head for sound acquisition fft125 113 125 140
fft160 141 160 180
The registration of the high quality of the sound signal fft200 181 200 224
emitted by the fistula is a priority conditioning the validity fft250 225 250 280
of further research. The basic tool for audio recording fistula fft315 281 315 354
fft400 355 400 450
are electronic stethoscopes [5], [8], [9], [10]. fft500 451 500 561
The necessity of using in the device analyzing the fistula fft630 562 630 707
state easy to use by an unqualified patient, cheaper and
repeatable in terms of head measurement quality resulted in
the development of own solution [11](Fig. 2). Assignment of individual patients to specific classes was made
based on the subjective assessment of the medical staff of the
B. Registered sound dialysis station and the results of ultrasound imaging.
The recorded signal is divided into fragments corresponding After removal of singular points features from the set of
to single cycles of the heart. This process is illustrated in vectors and standardization, standard deviations for each class,
Fig. 1. Splitting points indicated are local minima envelope, coordinates of class centers of gravity and their mutual dis-
and each thus obtained fragment was used to create a feature tances were calculated. The results are summarized in Tab. II
vector. and Tab. III.
From the obtained material, 16 features determining the
energy content of the signal in the bands, whose position and TABLE II: Distances between the centers of gravity of classes
width were determined using a third [12] in the range from A, B, C, D, E and F
18 Hz to 707 Hz, were separated. B C D E F
A 13,87 20,24 25,43 30,17 33,48
C. Sets of festures B 13,17 18,38 23,82 32,55
C 9,57 16,08 25,61
The case space is divided into 6 classes marked with labels D 10,91 17,96
A, B, C, D, E i F denoting cases of various fistula pathologies. E 11,14
The best cases were labeled A, whereas the worst labeled F.
187
TABLE III: Standard deviation of classes A, B, C, D, E and F TABLE IV: Confusion matrix for a set of 8 features – values
class A B C D E F expressed in %
std.dev. 8,95 16,40 13,91 13,07 11,62 6,29 A B C D E F
A → 88,2 11,8 0,0 0,0 0,0 0,0
B → 11,1 64,4 24,2 0,3 0,0 0,0
C → 0,0 24,3 64,6 11,1 0,0 0,0
D → 0,0 0,0 6,3 79,5 14,2 0,0
E → 0,0 0,0 0,0 10,0 80,7 9,3
F → 0,0 0,0 0,0 0,0 7,9 92,1
Fig. 3: Simplified visualization of relative positions of classes

A, B, C, D, E and F
According to the principle that if it is possible to visualize Fig. 4: Percentage values of vector assignments to classes with
the data in order to better understand them, a simplified approximated normal distribution curves for a set of 8 features
visualization of the class distribution in the case space has
been made. Because imagining a 16-dimensional space can
cause some hardship to the human mind, projections of class • each time during the test a signal containing at least
images onto a plane were made, as shown in Fig. 3. a dozen or so heart rhythms is collected (a dozen vectors
Class centers of gravity are marked with black points, while being the basis for the diagnosis),
colored circles (Manhattan metrics) depict the standard devia- the fistula can be determined with more precision than the
tion within each of the classes. Creating an image proceeded distance between the classes in the training set.
in two stages. Firstly, 4 groups were cast in each of the three
classes – ABC, BCD, CDE and DEF, and then they were B. Hardware
combined in such a way that distances between each of the
classes and its direct neighbors were an exact mapping of The prototype hardware basis was a microcomputer de-
values from Tab. II. Distances from further classes are only signed for embedded applications from the Raspberry Pi
approximate. series. A version III based on a 64-bit 4-core ARM processor
equipped with a unit for floating point numbers was selected.
III. R ESULTS AND DISCUSSION The block diagram of the hardware layer is shown in Fig. 5.
A. Classification system
In the selection of the classification algorithm and its work
parameters, the k-NN classifier based on Manhattan metric
and distance weighted voting was selected. The selection was
made on the basis of conducted research works [13], [14],
[15].
The assessment of the quality of the classification was made
on the basis of the collected research material, using the Fig. 5: The structure of the hardware layer
confusion martix (Tab.IV).
An important information flowing from Tab. IV is the nature The head for sound acquisition with an arteriovenous fistula
of errors in assigning the examined vectors to classes. The was built on the basis of a USB C-media sound card based
classified elements are assigned most correctly to the expected on the CM-108 chip. The microphone preamplifier with the
class, while erroneous assignments are limited only to the max9841 circuit allows to select amplification in the ranges
neighboring classes. This phenomenon can be described as of 40, 50 or 60dB and has a built-in automatic gain control.
an inter-class leak. The basis of the graphical user interface is a 7" LCD display
Going further along this trail and knowing that: with a resolution of 800x480, equipped with a resistive touch
• the distribution is normal, panel.
188
C. Software
A modular structure of the system based on a master module
responsible for data flow and control as well as executive
modules responsible for individual elementary operations was
assumed, not necessarily made using the same techniques and
programming languages as the main module (Fig. 6).
NefDiag
visualisation
control
decimation standardization Fig. 7: The physical implementation of the NefDiag device
IV. C ONCLUSION
The specificity of the considered case, in which tech-
ALSA Octave WEKA nical and medical problems interfere, does not allow for
unambiguous and objective evaluation of the obtained results
before conducting clinical trials. From a technical point of
training set
view, however, the quality of the solution can be estimated
on the basis of statistical analyzes and classification quality
Fig. 6: The structure of the software layer
indicators. It was found that the level of quality indicators of
the classification was satisfactory and the usefulness of the
The master module, responsible for the control flow, data noise signal contained in the input signal was demonstrated,
distribution and also acting as a user interface, was written thanks to which it is possible to increase the resolution of
in C++ using the Qt library. The modules that perform the the obtained results. The developed solution implemented in
tasks of sound acquisition and data analysis use ready-made a microprocessor system equipped with a dedicated head after
solutions available in the form of programming libraries or preliminary tests can be passed for testing in operational
computing environments as well as code fragments and scripts conditions. These tests will ultimately confirm the usefulness
written during the research process by the author of this work. and effectiveness of the developed solution.
Sound acquisition is carried out using the ALSA library
(Advanced Linux Sound Architecture) that provides commu- R EFERENCES
nication with the hardware layer, data buffering and error [1] L. N. Sekhar and J. F. Wasserman, “Noninvasive detection of intracranial
handling. Due to hardware limitations used sound card (sample vascular lesions using an electronic stethoscope,” Journal of neuro-
surgery, vol. 60, no. 3, pp. 553–559, 1984.
rate 44 100 Hz or 48 000 Hz) it is necessary decimation of the [2] P. J. Bosman, F. Boereboom, C. J. Bakker, W. Mali, B. C. Eikelboom,
signal to the desired sample rate of 8 kHz. P. J. Blankestijn, and H. A. Koomans, “Access flow measurements
Then the signal is passed to an external application in the in hemodialysis patients: in vivo validation of an ultrasound dilution
technique.” Journal of the American Society of Nephrology, vol. 7, no. 6,
m-code working under the control of the Octave packet, where pp. 966–969, 1996.
it is divided into fragments corresponding to single heart beats, [3] K. Konner, B. Nonnast-Daniel, and E. Ritz, “The arteriovenous fistula,”
FFT calculation and a set of diagnostic features based on it. Journal of the American Society of Nephrology, vol. 14, no. 6, pp. 1669–
1680, 2003.
After standardization, the data is transferred to the WEKA [4] L. Kumbar, J. Karim, and A. Besarab, “Surveillance and monitoring of
package, in which the classification is based on the training dialysis access,” International journal of nephrology, vol. 2012, 2011.
set stored in the external file. [5] H. Mansy, S. Hoxie, N. Patel, and R. Sandler, “Computerised analysis of
auscultatory sounds associated with vascular patency of haemodialysis
The diagnosis is presented in the form of a bar chart on the access,” Medical and Biological Engineering and Computing, vol. 43,
LCD display. no. 1, pp. 56–62, 2005.
[6] W.-L. Chen, C.-D. Kan, and C.-H. Lin, “Arteriovenous shunt stenosis
D. The prototype evaluation using a fractional-order fuzzy petri net based screening
system for long-term hemodialysis patients,” Journal of Biomedical
The device was enclosed in a dedicated housing made using Science and Engineering, vol. 7, no. 05, p. 258, 2014.
the rapid prototyping method with the use of a 3D printer. The [7] K. Roth, I. Kauppinen, P. A. Esquef, and V. Valimaki, “Frequency
warped burg’s method for ar-modeling,” in Applications of Signal
part responsible for A/C processing, including the microphone, Processing to Audio and Acoustics, 2003 IEEE Workshop on. IEEE,
preamplifier and sound card, was separated into a separate case 2003, pp. 5–8.
and connected to the device using a USB cable (Fig. 7). [8] Y.-N. Wang, C.-Y. Chan, and S.-J. Chou, “The detection of arteriovenous
fistula stenosis for hemodialysis based on wavelet transform,” Interna-
Initial tests with the participation of patients confirmed the tional Journal of Advanced Computer Science, vol. 1, no. 1, pp. 16–22,
correct operation of the device. 2011.
189
[9] P. Malindretos, C. Liaskos, P. Bamidis, I. Chryssogonidis, A. Lasaridis, klasyfikatora svm w zadaniu klasyfikacji stanu przetoki t˛etniczo-żylnej
and P. Nikolaidis, “Computer assisted sound analysis of arteriovenous na podstawie sygnału akustycznego,” Acta Bio-Optica et Informatica
fistula in hemodialysis patients,” The International journal of artificial Medica. Inżynieria Biomedyczna, vol. 22, no. 4, 2016.
organs, vol. 37, no. 2, pp. 173–176, 2014. [14] M. Grochowina and L. Leniowska, “Comparison of svm and k-nn clas-
[10] L. Rousselot, Acoustical Monitoring of Model System For Vascular sifiers in the estimation of the state of the arteriovenous fistula problem,”
Access In Haemodialysis, 2014. in Federated Conference on Computer Science and Information Systems
[11] M. Grochowina and L. Leniowska, “Analiza parametrów akusty- cznych (FedCSIS), 2015 . IEEE, 2015, pp. 249–254.
prototypu głowicy do akwizycji sygnału z przetoki tetniczo- żylnej,” [15] M. Grochowina and L. Leniowska, “The new method of the selection of
Mechanika w Medycynie, pp. 63–72, 2014. features for the k-nn classifier in the arteriovenous fistula state estima-
[12] M. Kirpluk, “Podstawy akustyki,” Warszawa, NTLMK, 2012. tion,” in Federated Conference on Computer Science and Information
[13] M. Grochowina and L. Leniowska, “Dobór cech diagnostycznych dla Systems (FedCSIS), 2016. IEEE, 2016, pp. 281–285.
190
SIGNaL PROCESSING
SPa 2018
The use of fuzzy cognitive maps in evaluation of

prognosis of chronic heart failure patients
Lukasz Kubus, Alexander Yastrebov, Katarzyna Poczeta Magdalena Poterala
Kielce University of Technology Department of Cardiology
al. Tysiaclecia
˛ Państwa Polskiego 7 Masovian Specialized Hospital in Radom, Poland
Poland, Kielce 25-314
Email: l.kubus, a.jastriebow, k.piotrowska@tu.kielce.pl
Leszek Gromadzinski
II Department of Cardiology and Internal Medicine
School of Medicine Collegium Medicum University of Warmia and Mazury in Olsztyn, Poland
Abstract—Fuzzy cognitive map (FCM) is an effective tool for in evaluation of prognosis for patients with CHF. Section V
modeling decision support systems. It describes the analyzed contains a summary of the paper.
problem in the form of key concepts and causal connections
between them. The aim of this paper is to use the fuzzy cognitive II. P ROBLEM DESCRIPTION
map in evaluation of prognosis for patients with chronic heart
failure. The developed evolutionary algorithm for fuzzy cognitive The analyzed medical data come from a retrospective study
maps learning and medical data of consecutive chronic heart [4]. It contains the data of 95 consecutive chronic heart failure
failure patients were used to select the most significant concepts patients, including 56 women and 39 men at the mean age of
and determine the relationships between them. 70.5±9.8 years admitted to hospital because of cardiac angina
or pulmonary oedema. Patients received standard CHF medical
I. I NTRODUCTION
treatment (ACE inhibitors, loop diuretics, spironolactone, and
A fuzzy cognitive map (FCM) is the effective tool for mod- beta-blockers). Patients with acute coronary syndromes were
eling decision support systems. It allows the visualization of excluded from the study. Chronic heart failure is a syndrome
the analyzed problem in a form of the key concepts and causal connected with various risk factors [4]. Each patient was
relationships between them [6]. Fuzzy cognitive maps have the described by 26 clinical and echocardiographic attributes pre-
ability of modeling complex systems, can model the medical sented in Table I. The aim of this analysis is to select the most
processes and help the experts in predicting, differentiating significant attributes and determine the relationships between
and treating various diseases [1]. In [9], the FCM model was them with the use of fuzzy cognitive maps. Prognosis means
used to represent the cause-effect relationships within medical the number of months of survival of the patient during the
data. A fuzzy cognitive map was also used to discriminate the 3-year observation.
diagnoses of alterations in urinary elimination, according to
the nursing terminology of NANDA International [3]. In [11], III. T HE DEVELOPED APPROACH
the approach based on the soft computing technique of FCMs The medical data were analyzed with the use of a fuzzy
was proposed to determine the success of radiation therapy cognitive map and the developed learning algorithm.
process.
This paper is devoted to the use of fuzzy cognitive maps A. Fuzzy Cognitive Maps
in the context of medical data analysis. The results of an Fuzzy cognitive maps are graph structures. Nodes are
evaluation of prognosis for patients with chronic heart failure variable concepts (data attributes) and links mean causal
(CHF) are presented. The aim of the study is to determine relationships between them [6]. The model is described by the
the impact of clinical and echocardiographic parameters on 3- vector X and the connection matrix W . Each element wj,i
year survival. The developed evolutionary algorithm for fuzzy of the matrix W determines the weight of the relationship
cognitive maps learning was used to select the most significant between concepts. A positive weight of the connection wj,i
concepts and determine the relationships between them. means node Xj causally increases node Xi . A negative weight
The paper is organized as follows. In Section II, the of the connection wj,i means node Xj causally decreases node
analyzed medical problem and the available data of chronic Xi . Some of the concepts can be determined as the output
heart failure patients are described. Section III describes (decision) concepts. In this analysis prognosis was chosen as
the proposed approach based on fuzzy cognitive maps and the output concept.
evolutionary learning algorithm. Section IV presents results of Fuzzy cognitive maps can be initialized based on expert
the use of fuzzy cognitive maps with the developed algorithm knowledge or using learning algorithms and historical data.
191
TABLE I
C LINICAL AND ECHOCARDIOGRAPHIC ATTRIBUTES individual is described by the vector of the weights of the
relationships W 0 and the vector of the states of the concepts
Name Values C.
Gender 0 (male)
1 (female)
Age [40 – 99] W 0 = [w1,2 , ..., w1,n , w2,1 , w2,3 , ..., w2,n , ..., wn,n−1 ]T (2)
Age index 0 (age ≤ 60)
1 (age > 60) where wj,i ∈ [−1, 1] is the weight of the relationship between
NYHA classification {0,1,2,3, 3.5, 4} the j-th and the i-th concept, i, j = 1, 2, ..., n and n is the
Arterial hypertension HA 1 (yes), 0 (no)
Diabetes mellitus (DM) 1 (yes), 0 (no) number of concepts.
Myocardial infarction (MI) 1 (yes), 0 (no)
Coronary artery disease (CAD) 1 (yes), 0 (no)
Alcohol 1 (yes), 0 (no)
Overpressure 1 (yes), 0 (no) C = [c1 , c2 , ..., cn , ]T
FA 1 (yes), 0 (no) (3)
ci ∈ {AS, IAS, AAS}
SR 1 (yes), 0 (no)
Sodium [134, 146]
Start serum creatinine concentration [0.7, 5.2] where ci is the state of the i-th concept and n is the number
Last serum creatinine concentration [0.7, 5.5] of concepts.
Right ventricular systolic pressure (RVSP) [25, 92] Each concept can have one of three states: active (AS),
RVSP class {1, 2}
Right ventricular diastolic dimension (RVDD) [1.6, 4.4] inactive (IAS) and always active (AAS). The output concept
Left ventricular diastolic dimension (LVDD) [4, 8.2] prognosis is always active.
EF size [25, 70] During initialization, the elements of the W 0 vector are
Aspirin 1 (yes), 0 (no)
Sintrom 1 (yes), 0 (no) initialized with the random values form the interval [−1, 1].
Standard heparin 1 (yes), 0 (no) The state for each node is active for all individuals in the
Heparin injections 1 (yes), 0 (no) initial population.
Previously hospitalized 1 (yes), 0 (no)
Pneumonia 1 (yes), 0 (no) STEP 2. Select key concepts.
Prognosis [0, 36] Each individual in the population is decoded into the candidate
FCM. In this approach, the state of each concept can be
changed with a certain probability described by the parameter
The aim of the developed evolutionary algorithm for fuzzy a.
cognitive maps learning is [12]: STEP 3. Evaluate population.
• to select the most significant concepts, The fitness function evaluates the candidate FCM based data
• to determine the weights of the relationships between error and is described as follows:
them,
• to evaluate prognosis for patients with chronic heart f itness(J) = −J (4)
failure.
Two approaches were proposed: the static approach and the where J is the objective function calculated for the output
dynamic approach. The static approach allows determination concept:
XPl
of the relationships only between the input concepts and the
output concept (prognosis). The dynamic approach enables J= |Zp − Xp | (5)
p=1
determination of the mutual relationships between all concepts
in the analyzed FCM model. where Xp is the value of the output concept for the p-th record,
B. Static approach Zp is the desired value for the p-th record, p = 0, 1, 2, ..., Pl ,
Pl is the number of learning records (patients).
The aim of this approach is to select the concepts that
STEP 4. Check stop condition.
affect the prognosis in the most. The response of the system
If the number of current generation is greater than the max-
is calculated as follows:
  imum number of generations then the learning process is
n
X stopped, go to STEP 7.
Xi = F  wj,i · Xj  (1) STEP 5. Select new population.
j=1,j6=i A roulette-wheel selection with dynamic linear scaling of the
where Xi is the value of the output concept (prognosis), n is fitness function and the elite strategy were applied [8].
the number of concepts, wj,i is the weight of the relationship STEP 6. Apply genetic operators.
between the j-th concept and the i-th concept, taking on the A uniform crossover and non-uniform mutation were used [8],
values from the range [−1, 1], F (x) is a logistic transformation [13].
function. STEP 7. Choose the best individual and calculate evaluation
The main steps of the static approach are described below. criteria.
STEP 1. Initialize population. Two the most often used criteria are calculated to evaluate the
The population of individuals is randomly initialized. Each FCM model:
192
1) Learning (initial) error allowing the calculation of sim- where n is the number of concepts; wj,i is the weight
ilarity between the input learning data and the data of the connection between the j-th and the i-th concept;
generated by the FCM model: i, j = 1, 2, ..., n.
Pl
The state of each node is modified according to the
1 X following formula:
Jl = |Zp − Xp | (6)
Pl p=1 if degi ≤ a then ci =IAS
(10)
else ci =AS
where Xp is the value of the decision concept for the p-
th record, Zp is the desired value of the decision concept where i = 1, 2, ..., n, n is the number of the concepts; a
for the p-th record, p = 1, 2, ..., Pl , Pl is the number of is a parameter selected experimentally, a>0 (a = 0.2).
learning records (patients). 2) Key concepts are selected based on the total value of
2) Testing (behavior) error evaluating the similarity be- a node (VAL). The total value of a node is calculated
tween the input testing data and the data generated by based on the sum weights of all incoming links and the
the FCM model: sum weights of all outgoing connections [2], [15]:
Pn Pn
j=1,j6=i |wi,j | + j=1,j6=i |wj,i |
Pt
1 X vali = Pn Pn (11)
Jt = |Zp − Xp | (7)
Pt p=1 k=1 j=1m,k6=j |wk,j |
where vali is the total value of the node, n is the number

where Xp is the value of the decision concept for the p- of the concepts; wj,i is the weight of the relationship
th record, Zp is the desired value of the decision concept between the j-th and the i-th concept; i, j = 1, 2, ..., n.
for the p-th record, p = 1, 2, ..., Pt , Pt is the number of The state of each node is modified as follows:
testing records (patients).
if vali ≤ a then ci =IAS
(12)
C. Dynamic approach else ci =AS
The aim of this approach is to determine the most significant where i = 1, 2, ..., n, n is the number of the concepts; a
relationships between all concepts in the analyzed FCM model. is a parameter selected experimentally, a>0 (a = 0.1).
The response of the map can be calculated as follows: STEP 3. Evaluate population.
  The fitness function is described as follows:
X n
Xi (t + 1) = F  wj,i · Xj (t) (8) f itness(J) = −J (13)
j=1,j6=i
where J is the objective function (the learning data error)
where Xi (t) is the value of the i-th concept, i = 1, 2, ..., n, n calculated for all concepts in the FCM model:
is the number of concepts, t = 0, 1, 2, ..., T , T is the number of Tl X
X n
records, wj,i is the weight of the connection between the j-th J= |Zi (t) − Xi (t)| (14)
concept and the i-th concept, F (x) is a logistic transformation t=1 i=1
function. where i = 1, ..., n, n is the number of key concepts, Xi (t)
The main steps of the dynamic approach are described is the value of the i-th concept at iteration t of the candidate
below. FCM, Zi (t) is the desired value of the i-th concept at iteration
t, t = 0, 1, 2, ..., Tl , Tl is the input data length.
STEP 1. Initialize population. STEP 4. Check stop condition.
The population of individuals is randomly initialized as in the If the number of current generation is greater than the max-
static approach. imum number of generations then the learning process is
STEP 2. Select key concepts. stopped, go to STEP 8.
Each individual in the population is decoded into the candidate STEP 5. Select new population.
FCM and the selected graph theory metrics are calculated. The A roulette-wheel selection with dynamic linear scaling of the
most significant concepts are selected based on the degree of fitness function and the elite strategy were applied [8].
the node and the total value of the node [2], [15]. STEP 6. Apply genetic operators.
1) Key concepts are selected based on the degree of the A uniform crossover and non-uniform mutation were used [8],
node (DEG). The degree of the node (9) means its [13].
significance calculated based on the number of concepts STEP 7. Analyze population.
it affects [2], [15]: The values of weights from the interval [−0.2, 0.2] are rounded
Pn down to 0, as suggested in [13]. Next, the potential solutions
j=1,j6=i θ(wi,j ) are analyzed according to the developed approach [7]. Values
degi = ,
n−1 of the total influence between concepts pj,i are calculated. If
(9)
1 , wi,j 6= 0 the value of pj,i is in the interval [−0.2, 0.2], the weight value
θ(wi,j ) =
0 , wi,j = 0 wj,i is rounded down to 0. Go to STEP 2.
193
STEP 8. Choose the best individual and calculate evaluation Age
criteria. Sintrom
index
Two criteria are calculated to evaluate the FCM model:
1) Learning (initial) error allowing the calculation of sim- Heparin
ilarity between the input learning data and the data injections.
generated by the FCM model:
Tl Previously
1 X
Jl = |Zi (t) − Xi (t)| (15) hospitalized
Tl t=1
where i = 1, ..., n, n is the number of key concepts, Alcohol Prognosis
Xi (t) is the value of the i-th concept at iteration t of
the candidate FCM, Zi (t) is the desired value of the
i-th concept at iteration t, t = 0, 1, 2, ..., Tl , Tl is the RVDD
learning data length.
2) Testing (behavior) error evaluating the similarity be-
tween the input testing data and the data generated by NYHA
the FCM model:
Tt
1 X MI
Jt = |Zi (t) − Xi (t)| (16)
Tt t=1
LVDD
where i = 1, ..., n, n is the number of key concepts, RVSP
Xi (t) is the value of the i-th concept at iteration t of
the candidate FCM, Zi (t) is the value of the i-th concept Fig. 1. The sample FCM model for the static approach
at iteration t of the input model, t = 1, 2, ..., Tt , Tt is
the testing data length.
IV. E XPERIMENTS B. Dynamic approach
The main aim of the analysis is to select the most significnat In this study, 2000 models were built. We selected 200 best
concepts and find the strength of relationships between them. models from them and these models became the subject of
The following learning parameters were used: a meta analysis. Data of 71 randomly chosen patients were
• crossover probability: 0.75, used as the learning records, the remaining 24 records were
• mutation probability: 0.03, used in testing process. Table III presents the most significant
• population size: 100, relationships between concepts selected with the use of the
• the number of elite individuals: 10, proposed approach, where:
• the maximum number of generations: 100. r – ratio of occurrence of the relationships between concepts
in the analyzed models,
A. Static approach rp – ratio of occurrence of the relationships with positive
In this study, 2000 models were built. We selected 55 best weight value,
models from them and these models became the subject of a rn – ratio of occurrence of the relationships with negative
meta analysis. Data of 71 randomly chosen patients were used weight value,
as the learning records, the remaining 24 records were used in wp – an average value of the positive weights,
testing process. Table II presents the most significant concepts wn – an average value of the negative weights,
selected with the use of the proposed approach, where: stdp – standard deviation of the positive weights,
r – ratio of occurrence of the concept in the analyzed models, stdn – standard deviation of the negative weights.
rp – ratio of occurrence of the concept with positive weight Figure 2 illustrates the most significant concepts and re-
value, lationships (with r ≥ 0.11) for the dynamic approach. The
rn – ratio of occurrence of the concept with negative weight models for the selection of key concepts based on the degree
value, of the node obtained the learning error Jl equal 0.27 ± 0.028
wp – an average value of the positive weights, and the testing error Jt equal 0.38 ± 0.008. The models for
wn – an average value of the negative weights, the selection of key concepts based on the total value of the
stdp – standard deviation of the positive weights, node obtained the learning error Jl equal 0.30 ± 0.027 and the
stdn – standard deviation of the negative weights. testing error Jt equal 0.37 ± 0.005.
Figure 1 illustrates the most significant concepts (with
r ≥ 0.7) and relationships for the static approach. The models C. Discussion of results
obtained the learning error Jl equal 0.13±0.001 and the testing Despite the fact that the analyzed population was a group of
error Jt equal 0.41 ± 0.025. 95 patients with consecutive chronic heart failure, the results
194
TABLE II
S AMPLE RESULTS OF A META ANALYSIS FOR THE STATIC APPROACH
No. Concept r rp rn wp stdp wn stdn

1. Age index 1 0 1 -0.74 0.18
2. Sintrom 1 1 0 -0.85 0.14
3. Heparin injections 1 0 1 0.92 0.08
4. Previously hospitalized 1 1 0 0.84 0.12
5. Alcohol 0.91 0 0.91 -0.59 0.24
6. RVDD 0.85 0.83 0.02 0.70 0.28
7. NYHA 0.81 0.81 0 0.79 0.19 -0.27 0.21
8. MI 0.81 0.72 0.09 0.67 0.25
9. LVDD 0.72 0.70 0.02 0.81 0.19
10. RVSP 0.70 0.05 0.65 0.26 0.39 -0.70 0.29
11. Age 0.69 0.04 0.65 0.48 0.51 -0.79 0.23
12. RVSP class 0.67 0.02 0.65 0.01 -0.60 0.24
13. EF 0.66 0.07 0.59 0.06 0.07 -0.62 0.25
14. FA 0.61 0.61 0 0.69 0.29
15. SR 0.59 0.54 0.05 0.63 0.27 -0.59 0.31
16. Start creatinine 0.59 0.20 0.39 0.57 0.22 -0.13 0.09
17. CAD 0.52 0.52 0 0.39 0.15
18. Gender 0.50 0.04 0.46 0.10 0.10 -0.46 0.17
19. Last creatinine 0.46 0.31 0.15 0.58 0.26 -0.60 0.30
20. Pneumonia 0.39 0.26 0.13 0.17 0.09 -0.07 0.04
21. HA 0.37 0.06 0.31 0.06 0.06 -0.38 0.18
22. DM 0.33 0.15 0.18 0.22 0.10 -0.38 0.22
23. Sodium 0.26 0.17 0.09 0.32 0.20 -0.51 0.30
24. Standard heparin 0.26 0.20 0.06 0.47 0.29 -0.38 0.24
25. Aspirin 0.15 0.06 0.09 0.07 0.06 -0.08 0.02
TABLE III
S AMPLE RESULTS OF A META ANALYSIS FOR THE DYNAMIC APPROACH
Concept i Concept j r rp rn wp stdp wn stdn

Prognosis Previously hospitalized 0.57 0.57 0 0.85 0.13 0 0
Previously hospitalized Standard heparin 0.45 0 0.45 0 0 -0.86 0.13
Previously hospitalized Prognosis 0.4 0.4 0 0.83 0.14 0 0
Aspirin Previously hospitalized 0.37 0.37 0 0.89 0.08 0 0
Previously hospitalized Aspirin 0.35 0.35 0 0.8 0.16 0 0
Pneumonia Previously hospitalized 0.33 0.33 0 0.88 0.14 0 0
Previously hospitalized Heparin injections 0.28 0 0.28 0 0 -0.78 0.18
LVDD Previously hospitalized 0.25 0.25 0 0.79 0.16 0 0
Prognosis Standard heparin 0.24 0 0.24 0 0 -0.75 0.2
RVDD Previously hospitalized 0.23 0.23 0 0.86 0.13 0 0
Sintrom Previously hospitalized 0.22 0.22 0 0.87 0.13 0 0
Age index Previously hospitalized 0.2 0.2 0 0.82 0.17 0 0
EF Previously hospitalized 0.2 0.2 0 0.88 0.13 0 0
Aspirin Prognosis 0.17 0.17 0 0.84 0.12 0 0
Heparin injections Previously hospitalized 0.16 0.16 0 0.81 0.16 0 0
Sintrom Prognosis 0.14 0.14 0 0.74 0.24 0 0
Prognosis Aspirin 0.12 0.12 0 0.78 0.2 0 0
Prognosis Heparin injections 0.12 0 0.12 0 0 -0.89 0.07
Age index Prognosis 0.11 0.11 0 0.82 0.16 0 0
Pneumonia Prognosis 0.11 0.11 0 0.74 0.23 0 0
LVDD Heparin injections. 0.11 0 0.11 0 0 -0.83 0.13
Pneumonia Heparin injections 0.11 0 0.11 0 0 -0.89 0.14
of the analysis present interesting relationships, which clas- the history of myocardial infarction MI, NYHA class, the
sically used statistical methods did not include, for example, size of echocardiographic parameters of the left and right
improving the prognosis of patients in the 3-year follow-up ventricular function as LVDD and RVDD, and the prognosis
with Sintrom. This probably results from the fact that the due to a small group of patients. Surprising dependencies
entire study group are patients with multiple co-morbidities, may also result from mutual relationships between particular
burdened with an increased risk of stroke in the case of silent, concepts assessed by dynamic models.
previously undetected atrial fibrillation, hence the protective
V. C ONCLUSION
role of Sintrom. The analysis also confirms the negative impact
on the prognosis of recognized risk factors such as age index, This paper is devoted to the use of fuzzy cognitive maps in
EF size or elevated RVSP. the context of medical data analysis. Results of the simulation
analysis of evaluation of prognosis for patients with chronic
It is difficult to refer in this work to the relationship between heart failure (CHF) based on the FCM model and evolutionary
195
Previously. applications. Computer Methods and Programs in Biomedicine 142,
hospitalized pp. 129–145, 2017.
Standard [2] Christoforou, A., Andreou, A.S., A framework for static and dynamic
heparin. Pneumonia analysis of multi-layer fuzzy cognitive maps. Neurocomputing 232,
133–145, 2017.
[3] Lopes, M.H.B.M., Ortega, N.R.S., Silveira, P.S.P., Massad, E., Higa, R.,
Marin, H.F., Fuzzy cognitive map in differential diagnosis of alterations
Aspirin in urinary elimination: A nursing approach. International Journal of
Medical Informatics, Volume 82, Issue 3, pp. 201–208, 2013.
[4] L. Gromadziński and R. Targoński, Impact of clinical and echocardio-
Prognosis graphic parameters assessed during acute decompensation of chronic
Heparin heart failure on 3-year survival. Kardiologia Polska, 64: 9, pp. 951–956,
injections 2006.
[5] A. Jastriebow and K. Pocz˛eta, Analysis of multi-step algorithms for
cognitive maps learning. BULLETIN of the POLISH ACADEMY of
SCIENCES TECHNICAL SCIENCES 62(4), 735–741, 2014.
EF [6] Kosko, B., 1986. Fuzzy cognitive maps. International Journal of Man-
LVDD Machine Studies 24(1), 65–75.
[7] Kubuś, Ł., Pocz˛eta, K., Yastrebov, A., A New Learning Approach for
Fuzzy Cognitive Maps based on System Performance Indicators. 2016
Age IEEE International Conference on Fuzzy Systems, Vancouver, Canada,
index 1398–1404, 2016.
RVDD
[8] Michalewicz, Z., Genetic algorithms + data structures = evolution
Sintrom
programs. Springer-Verlag, New York, 1996.
[9] Papageorgiou, E.I., Papandrianos, N.I., Karagianni, G., Kyriazopoulos
G.C., Sfyras, D., A fuzzy cognitive map based tool for prediction
Fig. 2. The sample FCM model for the dynamic approach of infectious diseases. 2009 IEEE International Conference on Fuzzy
Systems, Jeju Island, pp. 2094–2099, 2009.
[10] Papageorgiou, E.I., Poczeta, K., A two-stage model for time series
learning algorithm were presented. The static and dynamic prediction based on fuzzy cognitive maps and neural networks. Neu-
approach were proposed to select the most significant concepts rocomputing 232, 113–121, 2017.
[11] Papageorgiou, E.I., Stylios, C.D., Groumpos, P.P., An Integrated Two-
and determine the relationships between them. The results Level Hierarchical System for Decision Making in Radiation Ther-
of the analysis present interesting relationships, e.g. positive apy Based on Fuzzy Cognitive Maps. IEEE TRANSACTIONS ON
impact on the prognosis of Sintrom and negative impact of EF BIOMEDICAL ENGINEERING 50(12), 1326–1339, 2003.
[12] Poczeta, K., Kubuś, Ł., Yastrebov, A., An Evolutionary Algorithm
size or right ventricular systolic pressure. Based on Graph Theory Metrics for Fuzzy Cognitive Maps Learning.
Despite some limitations of this work, it gives hope that in In: Martín-Vide, C., Neruda, R., Vega-Rodríguez, M. (eds) Theory
the field of scientific research in medicine there is a chance and Practice of Natural Computing. TPNC 2017. Lecture Notes in
Computer Science 10687, Springer, Cham, 137–149, 2017.
to use a new data analysis tool, which is the method of fuzzy [13] Stach, W., Kurgan, L., Pedrycz, W., Reformat, M., Genetic learning
cognitive maps. We plan to expand the analyzed dataset with of fuzzy cognitive maps. Fuzzy Sets and Systems 153 (3), 371–401,
new patient data and compare the results with other state-of- 2005.
[14] Stach, W., Pedrycz, W., Kurgan, L.A., Learning of fuzzy cognitive maps
the-art approaches for data analysis and features selection. using density estimate. IEEE Trans. on Systems, Man, and Cybernetics,
R EFERENCES Part B, 42(3), 900–912, 2012.
[15] Wilson, R.J., 1970. An Introduction to Graph Theory. Pearson Educa-
[1] Amirkhani A., Papageorgiou E.I., Mohseni A., Mosavi M.R., A re- tion, India.
view of fuzzy cognitive maps in medicine: Taxonomy, methods, and
196
SIGNaL PROCESSING
SPa 2018
Fuzzy Bayesian Filter for Sound Environment by

Considering Additive Property of Energy Variable
and Fuzzy Observation in Decibel Scale
Akira Ikuta Hisako Orimoto

Department of Management Information Systems Department of Management Information Systems
Prefectural University of Hiroshima Prefectural University of Hiroshima
Hiroshima, Japan Hiroshima, Japan
ikuta@pu-hiroshima.ac.jp orimoto@pu-hiroshima.ac.jp
Abstract— In the measurement and evaluation of actual random measured by using the instruments with mean squaring
signal in a sound environment, the observed data often contain operation, the sound environment very often may be
the fuzziness due to several causes. Furthermore, there exists considered on an energy scale as a system with a signal of
usually a background noise in addition to the objective specific non-zero mean. So, it becomes essentially a big problem to
signal, and it is often that the specific signal partly or completely
is buried in the background noise. In this paper, a fuzzy Bayesian
apply the conventional state estimation methods to the present
filter for estimating a specific signal, based on the observed data situation without any improvement to them.
containing the fuzziness, and the effects of a background noise Though several researches on fuzzy Bayesian inference
with non-Gaussian type is proposed. More specifically, after have been proposed [5, 6], these are confined to the static
paying attention to the energy variables satisfying the additive approaches to estimate the parameters based on Bayesian
property of the specific signal and background noise, by statistics. Dynamical methods to estimate momentarily the
introducing a new type of membership function suitable for the fluctuating signal based on the successive observation of fuzzy
energy variable and the observation in decibel scale, a state data in sound environment have not been proposed.
estimation method is theoretically derived. The proposed theory In this paper, a fuzzy Bayesian filter for estimating a
is applied to the actual estimation problem of the sound
environment, and its usefulness is experimentally verified.
specific signal, based on the observed data containing the
fuzziness, and the effects of an external noise with non-
Keywords- State estimation, Probability measure of fuzzy events, Gaussian type is proposed in a recursive form suitable for use
Fuzzy Bayesian filter, Sound environment with a digital computer. More specifically, after paying
attention to the energy variable (e.g., sound intensity) for a
I. INTRODUCTION specific signal in a sound environment, which exhibits
complex probability distribution forms, by introducing a new
In the sound environment related to the mutual effects on type of membership function for the energy variable and the
physical phenomenon and human response, most of the actual observation in decibel scale, a state estimation method is
observed data show a complex fluctuation pattern differing theoretically derived. A lognormal distribution is suitable to
from a standard Gaussian distribution, and furthermore, these represent the energy variable, which fluctuates only within the
often contain the fuzziness due to the existence of confidence positive region. The proposed fuzzy Bayesian filter positively
limitation in measuring instruments, permissible error in utilizes the additive property of energy variables in the
experimental data, and the variety of human response to estimation algorithm. The proposed theory is applied to the
phenomena. Furthermore, there exists usually a background actual estimation problem of the sound environment, and its
noise in addition to the objective specific signal, and it is often usefulness is experimentally verified.
that the specific signal partly or completely is buried in the
background noise. Therefore, the fluctuation waveform of the II. FORMULATION OF FUZZY OBSERVATION
specific signal must be momentarily estimated as precisely as UNDER EXISTENCE OF BACKGROUND NOISE
possible, based on the observed data contaminated by the A sound environment system with energy variables (e.g.,
background noise, in order to evaluate the sound environment. sound intensity) exhibiting a non-Gaussian distribution is
Many standard estimation methods proposed previously in considered. Let the specific signal energy at a discrete time k
the field of stochastic system theory have not considered
be x k , and express the dynamical model for the specific signal
positively the fuzziness in the observation data under the
restriction of Gaussian type fluctuation (in most cases, zero as:
mean), for the simplification of theory [1]-[4]. Especially, x k 1  Fx k  Gu k , (1)
since the specific signal in real sound environment is usually
197
where u k denotes the random input energy (e.g., sound I n ( z k )  0  z k ( yk ) P0 ( yk | Z k 1 ) n( 2) ( yk )dyk , (5)
intensity) with known statistics, and x k and u k are
uncorrelated each other. Furthermore, F and G are unknown Amn   m(1) ( xk ) n( 2 ) ( y k ) | Z k 1  , (6)
system parameters and can be estimated by use of the system
identification method [7] when these parameters cannot be where <･> denotes the averaging operation with respect to the
determined on the basis of the physical mechanism of system. random variables. The functions  m(1) ( x k ) and  n( 2 ) ( y k ) are
The observed data in the actual sound environment often the orthogonal polynomials of degrees m and n with
contain the fuzziness due to several causes, for example, the weighting functions P0 ( x k | Z k 1 ) and P0 ( y k | Z k 1 ) , which
permissible error of the accuracy in measurements, the can be artificially chosen as the probability density functions
quantized error in the digitization of observation data, and the
variety of human response to the physical stimulus. Therefore describing the dominant parts of P( x k | Z k 1 ) and
in addition to the inevitable background noise, the effects of P ( y k | Z k 1 ) . These two functions must satisfy the following
the fuzziness contained in the observed data have to be first orthonormal relationships:
considered in order to derive a state estimation method for the
 (1)
0  m ( xk ) m ' ( xk ) P0 ( xk | Z k 1 )dxk   mm ' ,
(1)
specific signal. The observation equation can be formulated by (7)
dividing it into two types of operation from functional
 ( 2) ( 2)
viewpoint: 0  n ( y k ) n ' ( y k ) P0 ( y k | Z k 1 )dy k   nn ' , (8)
i) The additive property of energy variable, under the
existence of background noise: where  mm ' is Kronecker delta.
yk  xk  vk . (2) Based on Eq. (4), and using the orthonormal relationship of
Eq. (7), the recurrence algorithm for estimating an arbitrary
We assume that the statistics of the background noise energy
N th order polynomial type function f N ( x k ) of the specific
v k are known in advance.
signal can be derived as follows:
ii) The fuzzy observation in decibel scale z k obtained
from y k : The fuzziness of z k is characterized by the fˆN ( xk )  f N ( xk ) | Z k 
membership function  zk ( y k ) .
 0 f N ( xk ) P( xk | Z k )dxk
III. STATE ESTIMATION BASED ON FUZZY N 
OBSERVATION   Amn C Nm I n ( z k )
m 0 n 0
 , (9)
In order to derive an estimation algorithm for a specific 
 A0 n I n ( z k )
signal energy x k , based on the successive observations of n 0
fuzzy data z k in decibel scale, we focus our attention on where C Nm is the expansion coefficient determined by the
Bayes’ theorem [4, 8]: equality:
P ( x k , z k | Z k 1） N
P( x k | Z k )  , (3) f N ( x k )   C Nm m(1) ( x k ) . (10)
P ( z k | Z k 1 ) m 0
where Z k ( ( z1 , z 2 ,..., z k )) is a set of observation data up to a In order to make the general theory for estimation
time k . After applying probability measure of fuzzy events [9] algorithm more concrete, the well-known log-normal
to the right side of Eq. (3), expanding it in a general form of the distribution is adopted as P0 ( x k | Z k 1 ) and P0 ( y k | Z k 1 ) ,
statistical orthogonal expansion series [10], the conditional because this probability density function is defined within
probability density function P ( x k | Z k ) can be expressed as: positive region and is suitable to the energy variables.

0  z k ( y k ) P ( x k , y k | Z k 1 )dy k P0 ( x k | Z k 1 )  PL ( x k ;  x ,  x2k ) , (11)
P( x k | Z k )  
k
0  z k ( y k ) P ( y k | Z k 1 )dy k
P0 ( y k | Z k 1 )  PL ( y k ;  y ,  y2k ) (12)
k
 
(1)
  Amn P0 ( x k | Z k 1 ) m ( x k ) I n ( z k ) with
m 0 n 0
 
(4)
 A0 n I n ( z k ) 1 (ln x   ) 2
PL ( x;  ,  2 )  exp{ },
n 0
2 2 x 2 2
with
198
 xk | Z k 1  2 ( xk0 , xk0 ) ( xk0 , x1k )  ( xk0 , xkm )
 x k  ln( ),
 xk2 | Z k 1     
Dm ( xk )  m 1 m , (20)
( xkm 1 , xk0 ) ( xkm 1 , x1k )  ( xk , xk )
xk2
 | Z k 1 
 x2k  ln( ), xk0 x1k  xkm
 xk | Z k 1  2
 y k | Z k 1  2 ( yk0 , yk0 ) ( yk0 , y1k )  ( yk0 , ykn )

 y k  ln( ),
 y k2 | Z k 1     
En ( y k )  . (21)
( ykn 1, yk0 ) ( ykn 1, y1k )  ( ykn 1 , ykn )
 yk2 | Z k 1 
 y2k  ln( ). (13) yk0 y1k  ykn
 yk | Z k 1  2
Furthermore, by using Eq. (2), two conditional expectations of
Then, the orthonormal functions with two weighting
y k in Eq. (13) can be expressed as:
probability density functions in Eqs. (11) and (12) can be given
in the following expressions:  yk | Z k 1  xk*   vk  , (22)
2
m n
 yk2 | Z k 1  k  xk*  2 xk*  vk    vk2  (23)
 m(1) ( xk )   (mi1) xki ,  n( 2) ( y k )   (nj2) y kj . (14) with
i 0 j 0
xk*  xk | Z k 1  , (24)
The coefficients  (mi
1)
and (nj2) are determined by using * 2
k  ( xk  xk ) | Z k 1  . (25)
Schmidt’s orthonormalization method [11], as follows: As the membership function  zk ( y k ) , the following
(mi
1)
 (Gm 1Gm ) 1 / 2 Gm(i ) , (nj2)  ( S n1S n ) 1 / 2 S n( j ) , (15) expression suitable for the log-normal distribution is adopted.
 zk ( y k )  exp{ (ln y k  z k ) 2 } , (26)

where
where  ( 0) is a parameter. Accordingly, Eq. (5) can be
( xk0 , xk0 ) ( xk0 , x1k )  ( xk0 , xkm )
given by
( x1k , xk0 ) ( x1k , x1k )  ( x1k , xkm )
Gm  , (16)
    S yk  1 (ln y k  M yk ) 2
I n ( z k )  e B( zk ) 0 exp{ }
( xkm , xk0 ) ( xkm , x k1)  ( xkm , xkm )  yk 2S yk y k 2 S yk
n
( yk0 , yk0 ) ( yk0 , yk0 )  ( yk0 , ykm )   (nj2) y kj dy k , (27)
j 0
( y1k , yk0 ) ( y1k , y1k )  ( y1k , ykm )
Sn  . (17)
    1 (  y k  2 y2k z k ) 2
B( z k )   { y2k  2 y2k zk2  }
( ykn , yk0 ) ( ykn , y1k )  ( ykn , ykn ) 2 y2k 1  2 y2k 
Each element of Gm and S n can be calculated by using (28)
Eqs.(11) and (12), as follows:
with
( xki , xkj )  0 P0 ( xk | Z k 1 ) xki xkj dxk
 y k  2 y2k z k
(i  j ) 2  x2k M yk  ,
 exp{(i  j )  xk  }, (18) 1  2 y2k 
2
( yki , ykj )  0 P0 ( yk | Z k 1 ) yki ykj dyk  y2k
Sk  , (29)
(i  j ) 2  y2k 1  2 y2k 
 exp{(i  j )  yk  }. (19)
2 where the fuzzy data z k are reflected in B( zk ) and M y k .
G m(i ) and Sn( j ) in Eq. (15) are respectively cofactors for Furthermore, by expanding yki in Eq. (27) as
( m  1, i  1) th and ( n  1, j  1) th elements of the following
j
determinants Dm ( xk ) and En ( yk ) . y kj   d ji  i( 2 ) ( y k ) , (30)
i 0
199
where  i( 2) ( yk ) is the orthonormal polynomial having log- noise and quantized roughly with 1 dB and 2 dB widths as
examples of fuzzy observation, the fluctuation wave form of
normal distribution with the parameters M yk and S k as the
the road traffic noise was estimated. The statistics of the
weighting function, Eq. (27) can be calculated by considering specific signal and the background noise used in the
the orthonormal condition of  i( 2) ( y k ) , as follows: experiment are shown in Tables 1 and 2.
Figures 1 and 2 show some of the estimation results of the
S yk n fluctuation wave form of the specific signal. In this estimation,
I n ( z k )  e B( zk ) ( 2)
 nj d j 0 . (31) the finite number of expansion coefficients Amn (m, n  2) is
 yk j 0
used for the simplification of the estimation algorithm. In
Therefore, the estimates for mean and variance can be obtained these figures, the horizontal axis shows the discrete time k , of
as follows: the estimation process, and the vertical axis expresses the
sound level taking a logarithmic transformation of energy-
xˆ k  xk | Z k 
scaled variables ( 10 log10 ( xˆk / 1012 ) [dB]), because the actual

 { A0 n C10  A1n C11}I n ( z k ) sound environment usually is evaluated on dB scale connected
 n 0
, (32) with human effects. For comparison, the estimation results
 calculated using the method without considering any
 A0 n I n ( z k ) membership function are also shown in these figures as a
n 0
compared method. The proposed method considering the
Pk  ( xk  xˆ k ) 2 | Z k  membership function shows more accurate estimation than the
results based on the method without considering membership
 function.
 { A0 n C 20  A1n C 21  A2 n C 22 }I n ( z k )
n0 Since Kalman’s filtering theory is widely used in the field
 (33)
 of stochastic system, the extended Kalman filter [2] is also
 A0 n I n ( z k ) applied to the fuzzy observation data as a trail by introducing
n 0
the following observation model.
with zk  10 log10{( xk  vk ) / 1012}   k [dB], (37)
(1)
10 1 where  k denotes the quantized noise. A uniform distribution
C10   , C11   ,
(1)
11 (1)
11 within [ q / 2, q / 2] ( q : the quantized width) is assumed as
(1) (1) (1) (1) (1)
the probability distribution of  k .
10 10 21  11 20
C 20  xˆ k2 2 (1)
xˆ k  (1) (1)
, Some of the comparisons between the proposed method
11 11 22 and the extended Kalman filter on the estimation are shown in
Figs. 3 and 4. The results estimated by the proposed method
2 (211) 1 considering the membership function show good agreement
C 21   xˆ k  , C 22   (1) . (34)
(1)
11 (1) (1)
11 22 22 with the true values. On the other hand, there are great
discrepancies between the estimates based on the extended
Finally, by considering Eq. (1), the prediction step which is Kalman filter and the true values, particularly in the estimation
of the lower level values of the fluctuation with the estimation
essential to perform the recurrence estimation can be given by
error more than 5 dB.
xk* 1  xk 1 | Z k  Fxˆ k  G  uk  , (35) Furthermore, some of the estimated results by using our
previously reported method [12] based on the observation
k 1  ( xk 1  xk*1 ) 2 | Z k  model considering the additive property of energy variables
and the quantized observation in decibel scale:
 F 2 Pk  G 2  (uk   uk ) 2  . (36) yk  10 log10{( xk  vk ) / 1012 } [dB], (38)
By replacing k with k  1 , the recurrence estimation can be zk  Q ( yk )  g ( xk  vk ) [dB], (39)
achieved. are shown in Figs. 5 and 6. The function Q() denote a
nonlinear function expressing the quantization mechanism and
IV. APPLICATION TO SOUND ENVIRONMENT g () denotes a nonlinear function combining the nonlinearity
In order to examine the practical usefulness of the of decibel observation with quantized observation mechanism.
proposed fuzzy Bayesian filter, the proposed method is It is obvious that the proposed method shows better estimation
applied to the actual sound environmental data. The road than the previous method.
traffic noise is adopted as an example of a specific signal with
a complex fluctuation form. Applying the proposed estimation
method to actually observed data contaminated by background
200
Figure 1. Comparison between the proposed method and a compared method
without considering membership function in the estimation results
for Data 1 based on the quantized observation data with 1 dB width.
Figure 4. Comparison between the proposed method and the extended
Kalman filter [2] in the estimation results for Data 1 based on
the quantized observation data with 2 dB width.
Figure 2. Comparison between the proposed method and a compared method

without considering membership function in the estimation results
for Data 1 based on the quantized observation data with 2 dB width.
Figure 5. Comparison between the proposed method and our previously
reported method [12] without considering membership function
in the estimation results for Data 1 based on the quantized
observation data with 1 dB width.
Figure 3. Comparison between the proposed method and the extended

Kalman filter [2] in the estimation results for Data 1 based on the
quantized observation data with 1 dB width. Figure 6. Comparison between the proposed method and our previously
reported method [12] without considering membership function
in the estimation results for Data 1 based on the quantized
observation data with 2 dB width.
201
TABLE 1. MEAN AND STANDARD DEVIATION (SD) OF THE The squared sums of the estimation error are shown in
SPECIFIC SIGNAL (in W/m2)
Tables 3 and 4. These results clearly show the effectiveness of
Data Data 1 Data 2 Data 3 Data 4 Data 5
the proposed method for application to the observation of
Mean 2.23 3.44 3.25 3.82 3.71
fuzzy data.
Value  10-4  10-4  10-4  10-4  10-4
SD 1.47 2.13 2.65 3.19 3.56 V. CONCLUSION
 10-4  10-4  10-4  10-4  10-4
In this paper, based on the observed data containing
fuzziness after contamination by the background noise, a
Bayesian filter for estimating the fluctuation wave form of a
TABLE 2. MEAN AND STANDARD DEVIATION (SD) OF THE specific signal has been proposed by paying our attention to
BACKGROUND NOISE (in W/m2) the energy variable satisfying the additive property of the
Data Data 1 Data 2 Data 3 Data 4 Data 5 specific signal and the background noise. The proposed
Mean 2.50 2.49 2.44 2.47 2.48 estimation method has been realized by introducing a new
Value  10-4  10-4  10-4  10-4  10-4 type of membership function suitable for the evaluation in
SD 1.08 9.61 9.49 1.05 9.26 decibel scale and by applying the probability measure of fuzzy
 10-5  10-6  10-6  10-5  10-6 events. The proposed method has been applied to the actual
estimation problem of the sound environment, and it has been
experimentally verified that a better result has certainly been
obtained than the result employing three kinds of compared
TABLE 3. COMPARISON BETWEEN THE PROPOSED METHOD AND methods without considering any membership function.
THREE COMPARED METHODS FOR ROOT-MEAN
SQUARED ERROR OF THE ESTIMATION BASED ON THE
QUANTIZED OBSERVATION DATA WITH 1 [dB] WIDTH
ACKNOWLEDGMENT
(in dB) The authors are grateful to Mr. Keita Takahashi for his help
Data Data 1 Data 2 Data 3 Data 4 Data 5 during this study. This work was supported in part by fund
Proposed from the Grant-in-Aid for Scientific Research No. 15K06116
Method 0.954 0.734 0.723 1.04 1.12 from the Ministry of Education, Culture, Sports, Science and
Compared Technology-Japan.
Method
without 1.15 0.820 0.780 1.10 1.20 REFERENCES
Membership
[1] R. E. Kalman, “A new approach to linear filtering and prediction
Function problem,” Transactions of the ASME Series D, Journal of Basic
Extended Engineering, vol.82, no. 1, pp.35-45, 1960.
Kalman 1.67 1.09 1.19 1.26 2.26 [2] H. J. Kushner, “Approximations to optimal nonlinear filters,” IEEE
Filter [2] Transactions on Automatic Control, vol.AC-12, no. 5, pp.546-556, 1967.
Our [3] S. J. Julier, “The scaled unscented transformation,” Proceedings of
Previous 1.21 0.989 1.19 1.22 1.46 Americal Control Conference, vol. 6, pp. 4555-4559, 2002.
Method [12] [4] J. V. Candy, “Bayesian Signal Processing – Classical, Modern, and
Particle Filtering Methods,” John Wiley &Sons, 2008.
[5] S. Fruhwirth-Schnatter, “On fuzzy Bayesian inference,” Fuzzy Sets and
Systems, vol. 60, November, pp.41-58, 1993.
TABLE 4. COMPARISON BETWEEN THE PROPOSED METHOD AND [6] M. M. Rajabi and B. Ataie-Ashitani, “Efficient fuzzy Bayesian inference
THREE COMPARED METHODS FOR ROOT-MEAN algorithms for incorporating expert knowledge in parameter estimation,”
SQUARED ERROR OF THE ESTIMATION BASED ON THE Journal of Hydrogy, vol. 536, May, pp.255-272, 2016.
QUANTIZED OBSERVATION DATA WITH 2 [dB] WIDTH [7] P. Eyhhoff, “System Identification: Parameter and State Estimation,”
(in dB) John Wiley & Sons, 1984.
Data Data 1 Data 2 Data 3 Data 4 Data 5 [8] A. Ikuta, M. O. Tokhi and M. Ohta, “A cancellation method of
Proposed background noise for a sound environment system with unknown
structure,” IEICE Transactions on Fundamentals of Electronics,
Method 1.58 0.820 0.780 1.10 1.20 Communications and Computer Sciences, vol. E84-A, no. 2, pp. 457-
Compared 466, 2001.
Method [9] L. A. Zadeh, “Probability measures of fuzzy events,” Journal of
without 2.10 1.67 1.72 2.73 2.37 Mathematical Analysis and Applications, vol.23, pp.421-427, 1968.
Membership [10] M. Ohta and T. Koizumi, “General statistical treatment of response of a
Function non-linear rectifying device to a stationary random input,” IEEE
Transactions on Information Theory, vol. 14, no. 4, pp.595-598, 1968.
Extended
Kalman 2.96 1.75 1.86 2.22 3.36 [11] M. Ohta and H. Yamada, “New methodological trials of dynamical state
estimation for the noise and vibration environment system –
Filter [2] establishment of general theory and its application to urban noise
Our problems,” Acustica, vol. 55, no. 4, pp.199-212, 1984.
Previous 2.06 1.73 2.09 2.08 2.67 [12] A. Ikuta, “A Bayesian filter for sound environment system with
Method [12] quantized observation,” Proceedings of INTER-NOISE 2016.
202
SIGNaL PROCESSING
SPa 2018
Application of adaptive Golomb codes

for lossless audio compression
Cezary Wernik Grzegorz Ulacha
Faculty of Computer Science and Information Technology Faculty of Computer Science and Information Technology
West Pomeranian University of Technology Szczecin West Pomeranian University of Technology Szczecin
Poland Poland
cwernik@wi.zut.edu.pl gulacha@wi.zut.edu.pl
Abstract— In this paper the advantages of the Golomb codes the most cases prediction methods are used. Next item of this
family on example of audio signals coding are presented. Such as work are devoted to the basic of modeling audio data. In
low computational complexity, high efficiency and flexibility to section III and IV focuses on the effective and fast coding
adapt to local changes in the probability distribution method of prediction errors, showing the advantages of the
characteristics of coded data. The effectiveness of the Golomb Golomb code in the version with forward adaptation and three
coder with forward adaptation with three versions of the reverse versions with backward adaptation. In Section IV a comparison
adaptation coder has been compared. Also, attempts have been of the results obtained with other known solutions are
made to solve the problem of incomplete adaptation to the presented. Further directions of research are also indicated.
distribution encoded audio data to a one-sided geometric
distribution, for which the Golomb code is the optimal code.
II. BASICS OF AUDIO DATA MODELING
Keywords- adaptative Golomb code, lossless audio compression. In the first years of the 21st century, many effective
proposals of the MPEG-4 Lossless Audio Coding standard
I. INTRODUCTION were developer . However, one can not overlook the fact that
A specific variant of the Huffman code [13] for sources the branch of amateur solutions whose algorithms are not fully
with an infinite number of source symbols is the Golomb codes presented in scientific publications is developing
family. Golomb codes belong to highly efficient methods of independently. For example, OptimFrog [19] and Monkey's
encoding geometric distribution data. In lossless encoding of Audio [20] are among the top-performing programs for lossless
images or audio, the data distribution is similar to geometric audio data compression. In modern compression methods, two
distribution. stages are usually used: decomposition and compression one of
the efficient entropy methods, where data modeling takes place
In this work focuses on the analysis of effective lossless at the decomposition stage to improve the coding efficiency.
coding of audio data of different classes (human speech,
classical music, entertainment, electronic music, etc.). There are two basic types of modeling, the first one is the
Important applications of lossless audio compression include use of linear prediction [4], [11] or nonlinear prediction (e.g.
archiving of recordings, the ability to record high-quality sound using neural networks [8]). The second type is the use of such
on commercial media (e.g. DVDs, Blu-Ray) as well as selling transformations as DCT (MPEG-4 SLS [17]) or wavelet, but
songs on online music stores for more demanding customers the current state of knowledge allows us to assess that this type
who will not be satisfied with mp3 quality [7]. In addition, gives slightly lower efficiency in the case of lossless coding.
often the lossless mode is required at the stage of studio music For this reason, in the lossless audio codec’s, the typical
processing, advertising materials and in the production of radio linear predictor of the order r is used for modeling which is the
and television programs, films (post-production [1]), etc. In predicted value of the currently coded sample x(n) based on r
such a situation, no lossy coding is used, which at each of the previous signal samples. It has the form:
iteration of the sound edition could introduce additional,
cumulative distortions.
r
In modern compression methods usually two steps are
used: decomposition of data, and compression of one of the
 xˆ (n)   w  x(n  i), 
i 1
i 
efficient entropy methods. Among the entropy compression
methods, the most effective are arithmetic coding, Huffman
where the elements x(n ˗ i) are the values of the samples
coding [13] and its simple implementation variants such as
preceding the currently coded x(n) while the elements wj are
Rice [10] and Golomb code [5]. In the case of audio data
the prediction coefficients [13]. Using the linear predictor
coding, the Gilbert-Moore block code is also used, which is a
allows to encode only prediction errors, the differences e(n)
kind of arithmetic code [9].
between the actual and predicted values, which are most often
Publications in the field of lossless compression audio data small values oscillating near zero:
are mainly focused on the stage of data decomposition in which
203
 e(n)  x(n)  xˆ(n)      III. APPLICATION THE GOLOMB CODE TO SOURCES OF
GEOMETRICAL DISTRIBUTION
where  = 0 or 0.5 depending on the need to adapt the In 1966, Solomon W. Golomb presented in [5] the basic
distribution to the characteristics of the code (details are given assumptions and examples of the code family, which were
in section III). In this way we obtain a difference signal in described as Golomb codes. This is a specific version of the
which the distribution of errors e(n) has a character similar to a Huffman code [13] which is a prefixal code for a source with
two-sided geometric distribution. This allows effective coding
an infinite symbol alphabet. It is used to represent non-negative
using one of static or adaptive entropy methods, such as the
Golomb code. integers i whose probability of occurrence is consistent with a
one-sided geometric distribution G(i)  (1  p)  p i .
To calculate the prediction coefficients, in practice the most
frequently used is the Levinson-Durbin autocorrelation method It should be noted here two facts related to the prediction
[18] which some simplifications does not require calculate the errors created in the modeling stage. First, the range of
inverse matrix but iteratively can calculate the model prediction error values e(n) doubles in relation to the range of
coefficients for successive orders of prediction. Rejecting the x(n) samples of the encoded sound (e.g. for 16-bit samples,
assumption about the stationarity of the audio signal, the prediction errors assume values from the interval
autocovariance method should be used in which the vector of  
 65535; 65535 ). This is therefore a 17-bit range, although it
prediction coefficients w = [w1, w2, ..., wr]T is determined from is possible to narrow the range to the original 16-bit without
the matrix equation [13]. Both of these approaches ensure noticeable changes in the distribution of such a signal. In
minimum mean square error (MMSE). addition, in order to be able to code these prediction errors
In this work the autocovariance method was used to using the Golomb code it is necessary to project the values of
determine predictive models assuming a forward adaptation e(n) into the range of non-negative numbers. In the codec
that requires saving these models in the file header. It has been proposed in this work, both of these operations are performed
proposed to use a division into blocks of up to 20 seconds long, (in a similar way to that presented in [9]) using the algorithm
each block being divided into frames with a length of N presented in Fig. 1.
samples that have their own predictive model. The fact that
there are significant dependencies between the channels is also for coder: for decoder:
used. Therefore, it is advantageous to use r order prediction
models using samples of both channels, left xL(n ˗ i) and right e(n) : x(n)  xˆ (n) if e (n) mod 2  then
marked as xR(n ˗ j). 
if e(n)  - 2 15
 then e(n) : e (n) / 2 
e(n) : e(n)  2 16 else
e(n) : e (n)  1 / 2
 
rL rR
xˆ L (n)   ai( L )  xL (n  i )   b (j L )  xR (n  j ) , else if e(n)  215 then
i 1 j 1 end

rL rR 1
  e(n) : e(n)  216
xˆ R (n)   b (j R )  xR (n  j )   ai( R )  xL (n  i ) . end
j 1 i 0 x(n) : xˆ (n)  e(n)
The vector of prediction coefficients for the left channel are if e(n)  0  then
 
if x(n)  - 215 then
wL = [a1( L) , a2( L) ,...,ar( L) , b1( L) , b2( L) ,...,br( L) ]T , whereas for the e (n) : 2  e(n)  1 x(n) : x(n)  216
 
L R
right channel vector with coefficients is defined as else if x(n)  215 then
( R) ( R) ( R) ( R) ( R) ( R) T
wR = [b1 , b2 ,...,br , a0 , a1 ,...,ar 1 ] . The formulas are else
e (n) : 2  e(n) x(n) : x(n)  216
L R
two, because by coding (decoding) secondly the value of the
xR(n) sample of the right channel, we already have access to the end end
current sample of the left channel xL(n). Thus, on the result of
bit average can have an impact the selection of which channels Figure 1. The algorithm of projection to non-negative numbers
are coded in the first place. It should be clearly noted that the rL
designation applies in both cases to the set of samples of the Resulting range of numbers e (n) (hereinafter referred to as
currently coded channel, while rR is the number of opposing 
modified prediction errors) is in the range 0; 65535 so that 
channel samples, furthermore r = rL + rR. The frame size N = they can be coded using the Golomb code.
211 and {rL, rR} = {11, 8} were experimentally selected.
The Golomb code is characterized by better efficiency
For the analysis 16 dozen-second fragments of recordings compared to the frequently used Rice code. Main advantage of
were used (stereo, 16-bit samples, 44 100 samples per second) the Golomb code is that it does not require usage of code tables
various genres of music, men's and women's speech recordings
and is relatively simple to hardware implementation. Golomb
available in the database [21].
word code consists of two parts: the prefix being the number of
the uG (in the form of a unary code) and part written using the
phased-in binary code [12] (the Huffman code version for m
source of the equiprobable symbols), the number of vG (number
204
of an element in a given group) is coded. The group at number average (for the whole test base). However, there is a need to
uG consists of m elements of the source alphabet with extend the scope of experimental measurements by
subsequent numbers from uG·m to uG·m + m – 1. Setting the significantly increasing the size of the test base for better
parameter k  log 2 m means that in each group of the first selection of the value of .
l  2 k  m values vG is coded with k – 1 bits, and the At  = 0 there is a linear relationship between values m and
remaining m – l is coded in the form of the number vG + l by S, amounting m = (ln 2)  S [14]. After considering  = 0.41 and
means of k bits [13]. the fact that m parameter is an integer, the approximated
Creating a Golomb code word for value e (n) number of formula which is in the form m  0.693147  S + 0.563636
was determined for 2 < m < 213 (designating m with an
group should be firstly calculate by formula:
accuracy of 1 relative to the formula (8)).
For small S values there is a problem with obtaining high
 e ( n)  Golomb code efficiency. At the smallest value of m = 1, there
 uG   ,  
 m  is a need for the expected value of the prediction errors to be 0,
because the values e(n) = 0 have been assigned the shortest
and next the number of element in group (remainder from possible code word "1" (see Table I). At this point, the two-
division by m): sided geometric distribution reaches its maximum when in
formula (2)  = 0.5. For higher values of m, the value  = 0 is
used, thanks to which the symmetry axis of a two-sided
 vG  e (n)  uG  m.   geometric distribution is found for e(n) = 0.5. And this allows
to obtain the symmetry of probability pairs P(e(n) = 0) =
The group number m (also called Golomb code order) is P(e(n) = 1), P(e(n) = -1) = P(e(n) = 2), etc. This is particularly
selected according to the data matching to the G(i) distribution advantageous at m = 2 and any even value m (see Table I).
depending on the p parameter value. This translates into According to formula (8) (at  = 0) the smallest value of
dependence: p m  1/ 2 . For p  0,5 theoretical Golomb code m = 1 is optimal with p  ( 5  1) / 2 which corresponds to
efficiency (measured by entropy to Lave bit average) for data S  (3  5 ) / 2 . As the geometric distribution is only an
consistent with the geometric distribution is greater than approximation of the actual distribution of e(n) errors, the
95.942%. By carefully determining the value of m Golomb S = 1.5 threshold value below which m = 1 has been
code has been highly elastic (Rice code is limited, because it experimentally selected (thus extending the S range for which
only allows the use of m values being the power of two). the m = 2 to the S   1.5; (3  5 ) / 2  ). After taking  = 0.41
Assuming the local stationarity of distribution into account, we obtain the m = 1 value for S < 1.5 and m = 2
characteristics in a frame with the length of NG we can assume for S  <1.5; 3.48>. For the higher S values we use the formula
that the expected value of modified prediction errors is (8).
inversely proportional to 1 ˗ p. Therefore, p is calculated from
the formula: TABLE I. CODE WORDS TABLE OF GOLOMB CODE FAMILY
e(n) e (n) m=2 m=1

S 1 0 0 10 1
 p , 
S 1 1 11 01
-1 2 010 001
2 3 011 0001
where the expected value S is: -2 4 0010 00001
3 5 0011 000001
NG
1
 S
NG  e (i). 
i 1
 It should be noted that locally the sequence of prediction
errors has different dynamics. This fact can be used by dividing
the coded file into frames with length of NG samples and for
Having the value of p, we can determine the group number each of them determine individual value of parameter S. On
m, using the formula presented in among other works [2], [11] Fig. 2. was show the variability of S values for the first 200
(at  = 0): frames (each with a length of 29 samples) left channel of the
female_speech test file.
Using a forward adaptive approach, the individual mi
 log10 (1  p) 
 m     .   associated with the i-th frame should be written in the header.
 log10 p  By specifying that m < 213 to save this parameter is enough 13
bits. The smaller the frame size, the more accurate the m
An experimental correction value  = 0.41 was introduced parameter is adjusted to the local probability distribution. On
into formula (8) which resulted a slight decrease in the bit the other hand, the need to send the individual mi values to the
205
decoder determines the header size, which increases the bit t
1
average by 13/NG bits per sample.  Sn 
t  e (n  i) 
i 1

4000
The value of t depends on the characteristics of the encoded
3500 file related to the dynamics of error variation in the signal,
therefore it can not be considered that there is some universal
3000
optimum value. For a compromise t = 42 value, the bit average
2500 for the entire test base was 9.167. The coding efficiency can be
increased by using a weighted average (method II), but this
2000 requires a doubling of the t compromise value. Using the
S
formula (10) at t = 85 the bit average of 9.155 was obtained.

1500
1000
t
1
500

i 1 i
 e (n  1)
0  Sn  t
 
1

0 50 100 150 200
frame number
i 1 i
Figure 2. Variability of the expected value S, for the first 200 frames left
channel of the female_speech test file There is also a third approach (method III) based on the
forgetting factor  in which the computational complexity is
For the compromise chosen constant NG = 29 value, the best similar to the first approach, with a compression efficiency
result of the bit average for the entire test base was obtained. similar to the method II. This solution uses the S, auxiliary
The bit average can also reduce by entering the mi value sum, which is adapted using formula:
encoding using the phased-in binary code, and additionally by
performing several times the average bit measurement for
different NG values from sets{256, 384, 512, 768, 1024, 1536,  S ( n)  e (n  1)    S ( n1) ,  
2048}. The use of these two improvements allowed to shorten
the bit average (for the entire test base) from 9.201 to 9.183. whereas the Sn = (1  )S(n) value is. Also in this case it is
The results of bit average for individual test files are presented
difficult to find the universal optimal value of the . For
in the second column of Table II (column 3 contains the size of
the Golomb NG frame).
 = 0.952 a bit average of 9.156 was obtained.
In addition to forward adaptation, it is possible to determine TABLE II. BIT AVERAGE FOR DIFFERENT KIND ADAPTATIVE GOLOMB
m values based on a number of recently coded (decoded) CODE
modified e (n  i) errors. This approach is called backward
Forward Method Method Method
adaptation and the m value can be set individually even for File name method
NG
I II III
each consecutively encoded e (n) value. This does not require to ATrain 7.893 512 7.883 7.873 7.876
BeautySlept 9.639 1024 9.634 9.631 9.633
save any header information about m. chanchan 10.238 384 10.225 10.180 10.184
death2 6.425 512 6.405 6.393 6.390
IV. METHODS FOR CALCULATING THE EXPECTED VALUE S experiencia 11.468 512 11.451 11.427 11.432
FOR DETERMINING THE GROUP NUMBER m OF GOLOMB CODE female_speech 4.965 512 4.947 4.941 4.939
FloorEssence 10.027 512 10.000 9.974 9.973
The experiment consisting in removing header information ItCouldBeSweet 8.690 512 8.676 8.669 8.669
about the mi parameters from the coded (forward method) files Layla 10.315 512 10.287 10.279 10.281
allowed to determine the bit average for the test base at the LifeShatters 11.210 2048 11.206 11.201 11.203
level of 9.166 - which is the reference point to the below macabre 10.007 1024 10.004 9.999 10.002
described three proposals for determining the S value by which male_speech 5.085 512 5.060 5.054 5.050
it is possible to calculate the current mn value. The obtained SinceAlways 10.816 1024 10.801 10.794 10.796
thear1 11.776 1024 11.769 11.763 11.767
results indicate the existence of additional advantages of
TomsDiner 7.613 512 7.595 7.589 7.587
backward adaptation that goes beyond saving on the velvet 10.764 256 10.724 10.714 10.716
resignation of the need to write in the header of mi value. Base bit average 9.183 - 9.167 9.155 9.156
The first approach (indicated in table II as method I)

consists in replacing the formula (7) with the arithmetic mean
of t recently encoded (decoded) modified errors e (n  i) : V. SUMMARY AND RESULTS
In this paper a method of lossless audio compression using
predictive modeling with division into frames and Golomb
coding are presented. Several approaches of adaptive
206
calculation of the m parameter determining the approximate REFERENCES
local probability distribution of the coded prediction errors [1] S. Andriani, G. Calvagno, T. Erseghe, G. A. Mian, M. Durigon, R.
were analyzed. It has been shown that there are some Rinaldo, M. Knee, P. Walland, M. Koppetz, “Comparison of lossy to
differences between the geometric distribution and occurring lossless compression techniques for digital cinema,” Proceedings of
for real data. International Conference on Image Processing ICIP’04, 24-27 Oct.
2004, vol. 1, pp. 513-516.
For the best adaptive Golomb method proposed here, the bit [2] V. Bhaskaran, K. Konstantinides, “Image and video compression
average for the test base turned out to be better about 15.39% standards – algorithms and architectures,” S.Ed., Kluwer Academic
versus the universal data archiving tool RAR 5.0, and also for Publishers, United Kingdom, 1997.
12.26% better than the dedicated SHORTEN 3.6.1 solution. [3] F. Ghido, I. Tabus, “Sparse Modeling for lossless audio aompression,”
IEEE Transactions on Audio, Speech, and Language Processing, Vol.
Still, the results are worse than the best published solutions, 21, No: 1, Jan. 2013, pp. 14-28.
such as MP4-ALS-RM23 which is presented in Table III.
[4] C. D. Giurcaneau, I. Tabus, J. Astola, “Adaptive context based
sequential prediction for lossless audio compression,” Proceedings of IX
TABLE III. COMPARISON OF CODES FOR BASE 16 AUDIO FILES [21] European Signal Processing Conference EUSIPCO 1998, Rhodes,
Greece, Sept. 1998, vol. 4, pp. 2349-2352.
Bit average [5] S. W. Golomb, “Run-length encoding,” IEEE Transactions on
Information Theory, July 1966, vol. 12, pp. 399-401.
proposition
OptimFrog
MP4-ALS-
[6] H. Huang, P. Fränti, D. Huang, S. Rahardja, “Cascaded RLS-LMS
WavPack
File name
Monkey-
prediction in MPEG-4 lossless audio coding,” IEEE Trans. on Audio,

Shorten
RM23
audio
Speech and Language Processing, March 2008, vol. 16, no. 3, pp. 554-
RAR
3.6.1
Our
5.6
5.1
562.
[7] T. Liebchen , Y. A. Reznik, “Improved forward-adaptive prediction for
ATrain 9,552 8,627 7,792 7,873 7,441 7,232 7,172
MPEG-4 audio lossless coding,” in 118th AES Convention, 28-31 May
BeautySlept 10,771 10,712 9,825 9,631 8,826 8,305 7,499 2005, Barcelona, Spain, pp. 1-10.
chanchan 11,066 10,859 10,032 10,18 9,938 9,886 9,783
[8] E. Ravelli, P. Gournay, R. Lefebvre, “A two-stage MLP+NLMS lossless
death2 8,825 7,226 6,620 6,393 5,930 6,660 5,454
coder for stereo audio,” Proc. of IEEE International Conference on
experiencia 12,331 12,285 11,252 11,427 11,029 10,992 10,924 Acoustics, Speech, and Signal Processing (ICASSP’06), Toulouse,
female_speech 8,730 7,551 5,204 4,941 5,085 4,710 4,504 France, 14-19 May 2006, vol. 5, pp. V_177-180.
FloorEssence 11,560 11,476 9,922 9,974 9,750 9,509 9,418 [9] Y. A. Reznik, “Coding of prediction residual in MPEG-4 standard for
ItCouldBeSweet 11,510 11,578 8,859 8,669 8,577 8,396 8,321 lossless audio coding (MPEG-4 ALS)”, Proc. of IEEE International
Layla 11,207 10,868 10,202 10,279 9,885 9,691 9,581 Conference on Acoustics, Speech, and Signal Processing (ICASSP’04),
LifeShatters 11,823 12,166 11,052 11,201 10,874 10,836 10,822 Montreal, Quebec, Canada, 17-21 May 2004, vol. 3, pp. III_1024-1027.
macabre 10,695 10,549 9,928 9,999 9,275 9,076 9,040 [10] R. F. Rice, “Some practical universal noiseless coding techniques,” Jet
male_speech 8,737 7,566 5,333 5,054 5,221 4,813 4,265 Propulsion Labolatory, JPL Publication 79-22, Pasadena, CA, March
SinceAlways 12,265 12,186 10,749 10,794 10,539 10,473 10,419 1979.
thear1 12,285 12,561 11,635 11,763 11,504 11,425 11,411 [11] T. Robinson, “SHORTEN: Simple lossless and near-lossless waveform
TomsDiner 10,113 9,709 8,087 7,589 7,423 7,268 7,117 compression,” Cambridge Univ. Eng. Dept., Cambridge, UK, Tech. Rep.
velvet 11,643 11,082 10,843 10,714 10,508 10,212 10,002 156, 1994, pp. 1-17.
Base bit average 10,820 10,438 9,208 9,155 8,863 8,718 8,483 [12] D. Salomon, Data compression. The complete reference, 3rd edition,
New York, Springer-Verlag 2004.
[13] K. Sayood, Introduction to Data Compression, 2nd edition, Morgan
Kaufmann Publ., San Francisco, 2002.
VI. CONCLUSION [14] R. Sugiura, Y. Kamamoto, N. Harada, T. Moriya, “Optimal Golomb-
The proposition of our algorithm is similar in construction Rice code extension for lossless coding of low-entropy exponentially
distributed sources,” IEEE Transactions on Information Theory, April
to MP4-ALS-RM23, but by building subsequent steps of the 2018, vol. 64, no. 4, pp. 3153-3161.
algorithm, we fill the gaps that were omitted in MP4, which [15] G. Ulacha, R. Stasiński, “Entropy coder for audio signals, international
will increase the overall efficiency in the future. Algorithms journal of electronics and telecomunications,” vol. 61, no. 2, Poland, pp.
better than our proposal presented in Table III have many 219-224, 2015.
switches, which were chosen best for effective compression. [16] Y. Yang, C. D. Giurcaneau, I. Tabus, “An application of the piecewise
Our proposal has a static configuration that is flexible ad-hoc. autoregressive model in lossless audio coding,” Proc. of the 7 th Nordic
Using the approach proposed here with some ideas used in the Signal Processing Symposium, (NORSIG 2006), Rejkjavik, Iceland, 7-9
June 2006, pp. 326-329.
work [15] (introduction of cascading connection of successive
blocks minimizing prediction errors) in further research will [17] R. Yu, S. Rahardja, C. C. Ko, H. Huang, “Improving coding efficiency
for MPEG-4 audio scalable lossless coding,” Proc. of IEEE International
allow to significantly improve the efficiency of the encoder Conference on Acoustics, Speech, and Signal Processing (ICASSP’05),
with predictive forward adaptation. It is also expected to Philadelphia, PA, USA, 18-23 March 2005, vol. 3, pp. III_169-172.
introduce a better technique than the MMSE selection of [18] Yuli You, Audio coding. Theory and applications, 1st edition, Springer,
prediction coefficients as well as a more accurate individual New York 2010.
selection of prediction orders {rL, rR} in each frame [3], [16]. [19] http://www.losslessaudio.org/
In addition the algorithm for dividing each frame into sub- [20] http://www.monkeysaudio.com/
frames will be implemented in a similar way like as MPEG4- [21] http://www.rarewares.org/test_samples/
ALS. [22] http://www.wavpack.com/
207
SIGNaL PROCESSING
SPa 2018
An application of acoustic sensors for the

monitoring of road traffic
Karolina Marciniuk, Maciej Szczodrak, Andrzej Czyżewski
Multimedia Systems Department,
Faculty of Electronics, Telecommunications and Informatics,
Gdańsk University of Technology
Narutowicza 11/12, 80-233 Gdańsk
Email: karmarci(at:)multimed.org
Abstract—Assessment of road traffic parameters for the de- B. Proposed solution

veloped intelligent speed limit setting decision system constitutes
In our approach, a simple device for traffic monitoring based
the subject addressed in the paper. Current traffic conditions
providing vital data source for the calculation of the locally fitted on easily available and affordable components was prepared.
speed limits are assessed employing an economical embedded The main part of the device is the Arduino Mega 2560
platform placed at the roadside. The use of the developed development board with 16 MHz clock speed, 8 KB SRAM,
platform employing a low-powered processing unit with a set of and 4 KB EEPROM [6]. The software was written in C++
microphones, an accelerometer and some other sensors, for the
programming language. The advantage of such device is that
estimation of the essential road traffic parameters is presented
in the paper. Acoustical signal processing-based vehicle counting it can be powered via laptop USB port or from a portable
attempts were made, and an acceleration sensor was used in order source, i.e. powerbank. The detection of the vehicle is provided
to detect the heavy vehicles pass-bys. Obtained results based on mainly by the sound sensors. The additional information about
the measurements were discussed in the paper. Evaluation of the approximate vehicle type is provided through the vibrations
proposed methods is provided.
received by the accelerometer and by the gyroscope module.
The measurement device diagram is shown in Fig. 1.
I. I NTRODUCTION
C
The aim of this study is to examine the suitability and MI
L
accuracy of commonly available portable acoustic sensors and
accelerometers for the purpose of automatic traffic counting DETECTION LED
(ATC) and automatic vehicle classification (AVC).
ACCELEROMETER
A. Overview of available solutions
ATmega2560 with
Sensors typically used in ITS (Intelligent Transportation GYROSCOPE
SD card
Systems) are based on optical (Video Image Processing al-
gorithms [1], [2]), microwave (mostly Doppler systems) and MI
C
~ R
electromagnetic technologies (typically inductive lops) [3]. ~
Apart from traffic monitoring and control applications, sensors
are used for traffic law enforcement. PC
The technical design of car detectors and physical phe-
nomena exploited in them determine their placement and the
Fig. 1. A scheme of the measuring device used for traffic volume survey
necessary installation method. The inductive loop detectors,
for example, are installed in the pavement causing a faster
wearing of the pavement entailing temporary closure of the II. M EASUREMENT METHODS
road for repairs. Microwave sensors belong to the group of The system bases its action on two physical phenomena
active sensors emitting a probing signal to detect the vehi- arising as a result of a passage of the vehicle. The first one
cle [4]. Professional, heavy duty solutions are often expensive is a sound emission (engine, exhaust, tires) and the second is
in maintenance. connected with physical dimensions and interaction with air
Road sensors are used also for an incident management masses movement. The preliminary experiments were carried
such as incident detection, traveler information and emergency out on the city street with moderate traffic. The setup of the
management services. In case of a car accident, fast and equipment used for the field measurements is presented in
efficient response of emergency services is crucial to the health Fig. 2. Additional video camera was recording image and
and life saving [5]. audio as the source of reference data.
208
second method assumes treating the device only as a recorder
Video camera - data are acquired to a local storage and they can be later
processed offline, i.e. according to ITS standards, in blocks of
Accelerometer module 10-15 minutes depending on the chosen traffic management
technique.
The first test of audio detection using the proposed platform
Microphones was conducted under controlled conditions in an anechoic
chamber. High quality recording was reproduced using Gen-
Accelerometer module top view elec speakers placed 4 m away from the Arduino microphones.
Two sampling ratios were tested and the outcome is presented
in Fig. 3. The most significant moment is between 20th and
z 23rd second where two vehicles were physically present, and
x
the 125 ms sampling ratio revealed that situation precisely,
y whereas 250 ms sampling did not allow for distinguishing of
two vehicles.
Fig. 2. Measurement setup
A. Automatic traffic counting

Based on sound monitoring, the device counts vehicles as
sound events. Each vehicle passage generates noise, in the
case of speeds up to 30-40 km/h the main source of noise is
the engine and the exhaust system. In case of speeds typically
found on highways, the dominant road noise is the rolling
sound of the vehicle resulting from the interaction of tires
with the road surface [7]. The publication [8] concerns the
previous work of the authors related to investigation of the
possibility of using typical noise level measures as well as
using high-resolution audio recording for that purpose. Taking
into account the cost of the device, its performance and
resistance to weather conditions, it was decided to check the
usability of sensors based on cheaper electronic components.
The device has been equipped with two analog microphones
in the ORTF configuration. This stereo configuration allows
for determining the direction of arrival of a vehicle depending
on the moment of its detection by individual channels (L or
R). With one-way roads, only one microphone is used. There
are two ways of exploiting the device depending on the data
processing methods.
The first one is all-in-the-box assumes an autonomous action
of tracking the sound levels and reacting to exceeding the
threshold of reaction signifying the appearance of the vehicle Fig. 3. Influence of audio sampling on detection accuracy. Tests were
(Eq. 1). Because of the limitations related to Ardurino board conducted in controlled environment, with previously recorded high quality
programming, automatic adaptation threshold is difficult to audio and audio reproduced in anechoic chamber
implement. Therefore, the device may not respond properly to
some patterns of the pulsate rhythm of the traffic and to certain
changes in its intensity during the day (influencing acoustic Sampling ratio should be chosen accordingly to the distance
conditions) which can, in turn, cause some false positive or from the road, angle of the microphones and the typical speed
false negative errors. of the vehicles. The time-distance that vehicle spent between
Depending on the selected parameters (reaction threshold the microphone gate (2 · α) can be measured using equations 2
and sampling time) the vehicle is counted when the micro- and 3. Equations show individual time frames for each road
phone being closer to the direction of travel detects an event line, tclose for the first line, and tf ar for the second line of
above the threshold. Exceeding the threshold and counting the the road from the sensor point of view. The data are saved
vehicle is indicated by lighting up of the LED diode. The synchronously on the SD card for the purpose of a more
209
detailed off-line analysis. During the measurement phase, MA (i) is calculated using
( Eq. 5. The experimentally obtained threshold value is then
1 if SoundLevel ≥ ReactionT hreshold applied to detect vibrations induced by heavy vehicles.
D(i) = (1)
0 if SoundLevel < ReactionT hreshold MA (i) =
 p
V (AX (i) − AXref )2 + (AY (i) − AYref )2 + (AZ (i) − AZref )2


 tf ar = 2 · Sf ar
(5)
, (2)

 V III. R ESULTS
tclose =
2 · Sclose A. Audio
where V is typical vehicle speed in m/s and Sf ar and Sclose Independently on the adjustment of the threshold taking
can be calculated using Eq. 3 and Fig. 4. into account the geometric variables of the road, the vehi-
 cle counting result has been re-calculated by modifying the
 Sf ar = tan α · (Dre + 32 Dcr )
, (3) detection threshold at the data processing stage. This allows
S 1
close = tan α · (Dre + 2 Dcr )
for adding an adaptation of the system’s response threshold.
The obtained results for a selected time frame are presented
where Sf ar and Sclose are half of distance that vehicle spend in Tab. I. Figure 5 shows the results of sound measurements
between microphones. Dre is distances between sensor and using different methods. In case of recording with Arduino
road edge and Dcr is a lane width as it is shown in Fig. 4. using the stereo technique, the driving position of the vehicle
is clearly visible.
TABLE I
Sfar C OUNTING RESULTS FROM ANALOG AUDIO SENSORS VS . HIGH REAL DATA
AND HIGH RESOLUTION AUDIO EQUIPMENT (30 S OF RECORDING , REAL
DATA = 26 VEHICLES ) WITH DIFFERENT THRESHOLD VALUE (TH)
Sclose
Dre Dcr
No of vehicles
Starting TH for Arduino Sampling
Arduino High def. audio
α
30 50ms 24 22
30 125ms 13 26
30 250ms 7 24
Fig. 4. Setup of the measuring equipment and the required distances needed 20 50ms 24 26
to measurement calibration 20 125ms 13 26
20 250ms 7 24
10 50ms 35 31
B. Heavy vehicle classification 10 125ms 31 24
10 250ms 21 22
Simple sound level analysis may not be efficient for the
vehicle classification. Recent research work focus is put on
using detectors such as accelerometers for vehicle sensing and
classification [9], [10], [11]. Some experiments report using
the sensor mounted in the road pavement, what is costly, since
it requires some construction work.
The attempt was made for detection of heavy vehicles pass-
bys employing an accelerometer mounted at the roadside (at-
tached to the road sign). The higher and bigger the vehicle is,
the greater impact of induced wind and vibrations received by
the road infrastructure is observed. The measurement device
was LSM6DS33 chip which provides 3D digital accelerometer
and a 3D digital gyroscope. A full-scale acceleration range was
equal ±2g.
The algorithm begins with initializing accelerometer sensor
configuration registers. Then a calibration process in a period
of Tc (typically several seconds), sets the reference mag-
nitude MAref (i) calculated form accelerometer components
(AXref (i), AY ref (i), AZref (i)) using Eq. 4:
Fig. 5. Comparison of Arduino microphone sensors in ORTF configuration
q and real data acquired from camera observation
MAref (i) = AXref (i)2 + AYref (i)2 + AZref (i)2 (4)
210
B. Accelerometer a steady, evenly distributed traffic flow, the proposed all-in-
Only vehicles passing at the lane nearest to the road the-box solution is sufficient for working with a satisfactory
sign mounted accelerometer and gyroscope were taken into efficiency. The problem may arise with heavy traffic, where
account due to the characteristic of the experiment location. the noise of single vehicles passing by cannot be distin-
Vehicles were counted and classified manually based on visual guished. The inclusion of an additional data coming from an
observations. Table II presents all types of vehicles passing by accelerometer sensor may improve the detection of vehicles.
the investigated road, in each measurement session (S1–S4). For example, in case when the sound level is high, and the
vibration sensor did not indicate any heavy vehicle presence,
TABLE II the exceeding noise was induced by cars or motorcycles, only.
V EHICLE CLASS TYPES MET AT THE EXPERIMENT LOCATION IN THE
MEASUREMENT SESSIONS S1–S4
Moreover, the problem related to an employment of a simple
device is an absence of the real-time clock and calendar data.
number of vehicles Therefore, the synchronization of the relative sampling time
class name class number
S1 S2 S3 S4 instances with real time is difficult, thus an external clock
car 2 56 61 55 44 module should be used. The next measurement sessions will
minivan, suv 3 10 9 18 14 include a GPS sensor for providing the exact time base.
bus, small truck 4 3 4 1 1
truck 5 1 - 1 1
The vibration analysis method applied to the detection of
bike 1 1 - - 4 heavy vehicles is promising, especially that it may work in
the quasi real-time on the presented device. However, some
The results of the analysis of vibrations detected by the drawbacks should be deliberated, such as susceptibility on
sensor mounted at the road sign, for measurement session S1, windy weather-induced vibrations of the road signs. Therefore,
are depicted in Fig. 6. The vehicle class number relates to the extended experiments will be made for the collection of data
one presented in Tab. II. The reference data, which describe in order to develop a more robust classification methods.
vehicle class, is juxtaposed with values of acceleration mag- ACKNOWLEDGMENT
nitude MA (i) exceeding the threshold. The truck pass-by is
detected correctly, however false detection is observed at the Research was subsidized by the Polish National Centre for
car pass-by. Research and Development (NCBR) from the European Re-
gional Development Fund under the Operational Programme
Innovative Economy No. POIR.04.01.04-00-0089/16 entitled:
INZNAK Intelligent road signs...”.
0.04
vehicle class
5
acceleration above threshold R EFERENCES

0.03
[1] P. Dalka and A. Czyżewski, “Vehicle classification based on soft

4
vehicle class
computing algorithms,” in International Conference on Rough Sets and

M A(i)
0.02
Current Trends in Computing. Springer, 2010, pp. 70–79.

3
[2] H. Jang, H.-J. Yang, D.-S. Jeong, and H. Lee, “Object classification
using cnn for video traffic detection system,” in Frontiers of Computer
0.01
2
Vision (FCV), 2015 21st Korea-Japan Joint Workshop on. IEEE, 2015,
pp. 1–4.
[3] S. S. M. Ali, B. George, L. Vanajakshi, and J. Venkatraman, “A multiple
1
inductive loop vehicle detection system for heterogeneous and lane-

10 20 30 40 50
less traffic,” IEEE Transactions on Instrumentation and Measurement,
time [s]
vol. 61, no. 5, pp. 1353–1360, 2012.
[4] “Road network operations and intelligent transport systems. a
guide for practitioners,” Tech. Rep., 2018. [Online]. Available:
Fig. 6. Heavy vehicle classification results: reference vehicle classes and http://www.mdpi.com/1424-8220/17/12/2817
thresholded MA (i) [5] S. Shaheen and R. Finson, “Intelligent transportation systems,” 12 2013.
[6] Arduino Mega 2560, ePro Labs, 2016. [Online]. Available: https:
//wiki.eprolabs.com/index.php?title=Arduino Mega 2560
IV. C ONCLUSIONS [7] P. Mitchell, Speed and Road Traffic Noise. The role that lower speed
could play in cutting noise from traffic, ser. Blue Book, No. 4, UK Noise
The hardware can work as a simple traffic detector using Association Std., December 2009. [Online]. Available: http://www.ukna.
audio processing algorithms, being not comlex. Any additional org.uk/uploads/4/1/4/5/41458009/speed and road traffic noise.pdf
calculations and observations have to be conducted in the [8] K. Marciniuk, B. Kostek, and A. Czyżewski, “Traffic noise analysis ap-
plied to automatic vehicle counting and classification,” in International
off-line mode, because of the low computing power of the Conference on Multimedia Communications, Services and Security.
applied hardware. Therefore, the investigated device can be Springer, 2017, pp. 110–123.
used mainly for a data collection purpose. Implementation of [9] W. Ma, D. Xing, A. McKee, R. Bajwa, C. Flores, B. Fuller, and
P. Varaiya, “A wireless accelerometer-based automatic vehicle classifica-
all-in-the-box scenario requires a more complex development tion prototype system,” IEEE Transactions on Intelligent Transportation
board application. Systems, vol. 15, no. 1, pp. 104–111, Feb 2014.
The accuracy of the developed device depends on the [10] W. Balid, H. Tafish, and H. H. Refai, “Development of portable wireless
sensor network system for real-time traffic surveillance,” in 2015 IEEE
registration parameters, namely on the sampling ratio and for 18th International Conference on Intelligent Transportation Systems,
the live analysis - on the threshold level adopted. In case of Sept 2015, pp. 1630–1637.
211
[11] Z. Ye, L. Wang, W. Xu, Z. Gao, and G. Yan, “Monitoring traffic
information with a developed acceleration sensing node,” Sensors,
vol. 17, no. 12, 2017. [Online]. Available: http://www.mdpi.com/1424-
8220/17/12/2817
212
SIGNaL PROCESSING
SPa 2018
Low-level audio descriptors-based analysis of music

mixes from different Digital Audio Workstations –
case study
Damian Koszewski, Bozena Kostek

Gdansk University of Technology, Faculty of Electronics, Telecommunications and Informatics,
Multimedia Systems Department and Audio Acoustics Laboratory
Narutowicza 11/12, 80-233 Gdansk, Poland
{damkosze, bozenka}@sound.eti.pg.gda.pl
Abstract—The aim of this paper is two-fold. Firstly, we attempt to and host-based ones. DSP-based DAWs in general use a
check whether objective, low-level audio descriptors may serve as dedicated DSP (digital signal processing) board of high quality
a comparison tool in music mix evaluation performed using (i.e. 24 bit 96 kHz) for audio processing. Such systems are
different Digital Audio Workstations (DAWs). Secondly, we seek capable of providing functionality of 48 track recording studio.
to answer whether differences in music mixes are objectively Contrarily, host-based DAW employ the host computer for all
discernible when several sound processing engines of DAWs are tasks, including audio processing, but they have a smaller
used. The same tracks of a song exported from different Digital capacity with regard to the recording/mastering studio [5].
Audio Workstations constitute the basis for this research study. They are also more operating system-dependent than DSP-
Several song mixes are built of 24 individual tracks with no
based DAWs [9]. In general, DAWs signal processing quality
added effects, employing both commercial and non-commercial
DAWs. Then, a set of time- and frequency-domain audio
may be discussed in terms of latencies, but these data are not
descriptors is calculated to find similarities and differences fully available. Overall, latencies lower than 20 milliseconds
between the music mixes. Informal listening tests are conducted should be available to the user [10], but this is not always the
to answer to what extent experts are able to evaluate differences case [9]. In this paper, the authors try to identify similarities
in these mixes. Then data are analyzed to show that in most cases and differences between those algorithms that may be called
similar results are obtained regardless of the DAW employed. sound processing engines, with the use of low-level descriptors
analysis derived from the exported tracks. For this purpose, 13
Keywords: Signal processing, audio descriptors, Digital Audio different DAWs were used to export a chorus from a song built
Workstation, automatic mixing of 24 individual tracks with no added effects. All DAWs were
installed on the same computer, to eliminate hardware
I. INTRODUCTION dissimilarity.
The advent of digital audio and high-speed global First, MPEG-7 low-level descriptors are recalled.
communication has revolutionized the way people produce, Following this, digital signal processing and the recording
distribute and consume music. It has become possible for techniques used for the studied song are shortly described. In
individuals to make near professional recordings in their own the next section, the analysis of the most diverse and varying
home using a laptop and a microphone. While the technology descriptors is presented. The analysis was performed with the
has progressed enormously, a significant amount of time, skills use of the MIRToolbox 1.7 script [11] for MatLab. Finally,
and experience operating a Digital Audio Workstation (DAW) conclusions are presented, referring to experts’ informal
is necessary to produce high quality results [1]. This also refers evaluation.
to the automatic mixing and mastering notions [2][3][4][5],
which are nowadays more frequently seen, especially in the II. MPEG-7 LOW LEVEL DESCRIPTORS
game- or live-related audio areas [6][7]. There are many
DAWs available on the market, both commercial and non- Audio files and the audio channel of the video files can be
commercial. In theory, one can achieve the same final effect. described with temporal, spectral, cepstral, and perceptual
However, every brand of DAWs has an individual algorithm of audio descriptors. The most common audio descriptors are as
summing up the tracks during bouncing. There are some follows [12]:
references in the literature on the usability of DAWs, but they  Temporal audio descriptors: the energy envelope
are most often analyzed in terms of differences imposed by descriptor represents the root mean square of the mean
DAW workflow, functionalities, ergonomics and not the signal energy of the audio signal, which is also suitable for
processing quality [8]. Such analyses are performed based on silence detection. The zero crossing rate descriptor
gathering the user’s experience or respondents’ answers to represents the number of times the signal amplitude
enquiries. One may discern two types of DAWs: DSP-based under-goes a change of sign (Fig. 1), which is used for
213
differentiating between periodic and noise-like signals, (Fig. 3) descriptor characterizes the timbral width of
such as to determine whether the audio content is sounds, and is calculated as the relative difference
speech or music. The temporal waveform moment between the specific loudness and the total loudness.
descriptors represent characteristics of the waveform
shape, including temporal centroid, width, asymmetry,
and flatness. The amplitude modulation descriptor
describes the tremolo of the sustained sound or
graininess or roughness of sound. The autocorrelation
coefficient descriptor represents the spectral
distribution of the audio signal over time, which is
suitable for musical instrument recognition. Figure 3. Second central moment [11]
 Dedicated audio descriptors [14]: odd-even harmonic

energy ratio descriptor represents the energy
proportion carried by odd and even harmonics. The
descriptors of the octave band signal intensities
represent the power distribution of the different
harmonics of music. The attack duration descriptor
represents how quickly sound reaches full volume after
Figure 1. Visual representation of zero crossing rate descriptor [11]. it is activated, and is used for sound identification. The
harmonic-noise ratio descriptor represents the ratio
Spectral audio descriptors: the spectral moment between the energy of the harmonic component and the
descriptor corresponds to the core spectral shape noise component, and enables the estimation of the
characteristic, such as spectral centroid (Fig. 2), amount of noise in the sound. The fundamental
spectral width, spectral asymmetry, and spectral frequency descriptor, also known as the pitch
flatness, which are useful for determining sound descriptor, represents the inverse of the period of the
brightness, music genre, and categorizing music by periodic sound [12][13][15].
mood [4][5][6]. The spectral decrease descriptor refers
to the average rate of the spectral decrease with Descriptors employed in this paper are calculated using
frequency. The spectral roll-off descriptor represents MIRToolbox 1.7 script for Matlab [11]. All descriptors have
the frequency under which a predefined percentage been determined for every bounced music piece from 13
(usually 85–99%) of the total spectral energy is different DAWs.
present. The spectral irregularity descriptor describes
the amplitude difference between adjacent harmonics. III. RECORDINGS AND DIGITAL SIGNAL PROCESSING
The descriptors of formants parameters represent the
spectral peaks of the sound spectrum of voice, and are A. Recordings
suitable for phoneme and vowel identification. The case study analyzed concerns a song that was recorded
live at the Sunset Sound Recorders Studio 3 by Warren Huart
[16] with only vocals overdubbed. It consists of 24 tracks:
drums, percussions, bass, guitar, Hammond, lead vocal and
background vocals. The input list and used microphones are
shown in Table I. All channels were captured in mono. Even
though this is an example of mixing a music piece, it should be
noted that there are several musical instruments (including
Figure 2. Geometric center of distribution [11].
vocals) of different pitch and level range recorded using 24
channels, thus the results may be treated as more general.
 Cepstral audio descriptors: cepstral features are used
for speech and speaker recognition and music TABLE I. INPUT LIST
modeling. The most common cepstral descriptors are
Input list
the mel-frequency cepstral coefficient descriptors, Instrument
which approximate the psychological sensation of the Microphone Notes
height of pure sounds, and are calculated using the Kick in AKG D112 -
inverse discrete cosine transform of the energy in
Kick out Neumann 47FET -
predefined frequency bands.
Snare top Shure sm57 -
 Perceptual audio descriptors: loudness descriptor
represents the impression of sound intensity. The Snare bottom Shure sm57 -
sharpness descriptor, which corresponds to the spectral Snare shell AKG C414 XLS -
centroid, is typically estimated using a weighted
centroid of specific loudness. The perceptual spread Hihat AKG C414 XLS -
214
Input list centroid. The results obtained in the case of those descriptors
Instrument
Microphone Notes for chosen DAWs are presented in Figs. 5-7.
OH L Neumann TLM 67 XY configuration
OH R Neumann TLM 67 XY configuration
Room Neumann TLM 67 Far away in the room
Tambourine AKG C414 XLS -
Bass Sennheiser MD 421-II -
Bass - DI
Columns Neumann TLM 67 -
Talkback AKG C414 XLS -

Figure 4. Adding 0 character at the end of the file
Bottom of the
Hammond Low Sennheiser MD 421-II
instrument
Hammond High Shure sm57 Top of the instrument
Guitar 1 Shure sm57 Z-amp
Guitar 2 - DI
Lead vocal Neumann U87 Pop-filter
Background vocals Neumann U87 Pop-filter
B. Digital signal processing

Delivered tracks were edited, synchronized and imported to
the ProTools 12 HD software. Then, the initial mix has been
made – all tracks were matched by their levels so that there was Figure 5. Zero-crossing rate descriptor differences
no clipping on the main out obtained. An 8-seconds sample
from the chorus was extracted, and every channel was exported
as an individual track for further analysis. This way, 24
separate channels were obtained and were then imported into
different DAWs. One stereo file has been exported, with no
panning changes, and no effects added. The following DAWs
were employed: Ableton 9, Ableton 10, Adobe Audition,
Cubase 9, Digital Performer, FL Studio, Bitwig 1, Logic Pro X,
Reaper, Reason, Pro Tools 12 HD, Samplitude and Studio One.
In the following Section, the authors presented the results of
analyses conducted in the Matlab environment. Because of
large amount of data, the results presented were limited to 7
Figure 6. Spectral spread descriptor differences
most dissimilar DAWs, and 7 descriptors returning different in
values.
IV. MATLAB-BASED ANALYSIS

All of the files have been loaded into Matlab. The
“MIRfeatures” [11] analysis has been performed. However,
first, the file structure has been analyzed, and it was observed
that some of the DAWs (Cubase, Bitwig) add 0 character at the
end of the wave file (Fig. 4). DAW Reason cuts some of the
last signal samples while exporting, whereas all of the other
DAWs export the resulting file with the same length of
samples.
The analysis brought a very large amount of data, because
of the large number of analyzed descriptors, DAWs, and music Figure 7. Spectral centroid descriptor differences
samples. While analyzing and interpreting these results, most
major differences have been observed for the following What is interesting, when calculating the spectral spread
descriptors: zero-crossing rate, spectral spread and spectral descriptor, it was observed that DAW Ableton shows
differences in versions 9 and 10. These are temporal audio
215
descriptors. They are the most crucial descriptors, in terms of The analysis of the descriptors allows concluding that
perception of instruments’ sounds and they apply to the whole DAWs employed have different algorithms of mixing samples
segment of the signal. in the export process. In most cases, it does not affect the final
result, and the exported files do not show any major differences
The statistical analysis of the descriptors for which in sound perceived. The created excerpts and their analysis
differences in values were visible has shown that these confirm that by using free/commercial software a desired and
differences are statistically insignificant. The t-student satisfactory result can be achieved, and the result of the mix
distribution has been implemented for independent samples, mostly depends on the sound engineer’s skills. This conclusion
and the p-value of statistical significance has been calculated opens the pathway to researching software that automatically
and then compared with the test significance level α. In the mixes tracks, because such a software would cooperate rather
experiment, α equals 0.05. The obtained p-values are at the flawlessly with any available (and are free for use) DAWs.
level of 0.840. Therefore, it confirms the hypothesis of the This aspect is especially important for mixing audio for games.
equivalence of the exported DAW files that was formulated in
this study. The results have been calculated with the help of
Matlab for each descriptor. Possible differences in descriptor REFERENCES
calculations may also be the result of using different
algorithms of signal mixing in different DAWs. As observed [1] E. Perez-Gonzales, and J. Reiss, “Automatic gain and fader control for
in Table II the differences between descriptor values are very live mixing,” in IEEE Workshop on applications of signal processing to
small. audio and acoustics, 2009, pp. 1–4.
[2] Hafezi S., Reiss J. D., Autonomous Multitrack Equalization Based on
Masking Reduction, JAES Volume 63 Issue 5 pp. 312-323; May 2015.
TABLE II. RESULTS OF CHOSEN DESCRIPTORS [3] De Man B., McNally K., Reiss J. D., Perceptual Evaluation and Analysis
of Reverberation in Multitrack Music Production, J. Audio Engineering
Value Society, Vol. 65, No. 1/2, 2017, doi:
Descriptor
Cubase Bitwig Ableton Logic Reaper https://doi.org/10.17743/jaes.2016.0062
Spectral skewness 2.4782 2.4780 2.4782 2.4781 2.4782 [4] Perez Gonzalez E., Reiss J. D., “Automatic Gain and Fader Control for
Live Mixing,” IEEE WASPAA (2009 Oct.).
Spectral kurtosis 10.0508 10.0508 10.0492 10.0508 10.0508 https://doi.org/10.1109/ASPAA.2009.5346498
[5] Wilson A., Fazenda B., Variation in Multitrack Mixes: Analysis of Low-
Spectral flatness 0.2069 0.20377 0.20691 0.20691 0.2069 level Audio Signal Features, JAES Volume 64 Issue 7/8 pp. 466-473;
Entropy of spectrum 0.8831 0.8881 0.8831 0.8831 0.8831 July 2016.
[6] Hafezi S., Reiss J. D., “Autonomous Mul-titrack Equalization Based on
Masking Reduction,” J. Audio Eng. Soc. , vol. 63, pp. 312–323 (2015
May), https://doi.org/10.17743/jaes.2015.0021.
[7] Hoffmann P., Kostek B., Bass Enhancement Settings in Portable
Devices Based on Music Genre Recognition, J. Audio Eng. Soc., 12,
63, 980 - 989, (2015) http://dx.doi.org/10.17743/jaes.2015.0087.
[8] http://www.resoundsound.com/two-daws, (accessed June ‘2018).
[9] Jurewicz M., Self, 24 bit 96 kHz Digital Audio Workstation using high
performance Be Operating System on a multiprocessor Intel machine
(http://www.espace-cubase.org/anglais/beosforaudio.pdf, accessed June
‘2018).
[10] https://www.ee.columbia.edu/~dpwe/pubs/MuEKR11-spmus.pdf,
(accessed June ‘2018).
[11] Lartillot O., “MIRToolbox 1.7 User’s manual,” University of Oslo,
Department of musicology, 2017.
Figure 8. Different pan-law of the Reaper DAW [12] R. Koenen and F. Pereira, “MPEG-7: A standardised description of
audiovisual content,” in Signal Processing: image communication, Vol.
16, pp. 5-13, Elsevier, 2000.
V. CONCLUSIONS [13] B. S. Manjunath, P. Salembier, and T. Sikora, “Introduction to MPEG-7
audio – and overview,” J. Audio Eng. Soc., Vol. 49, pp. 589-594,
During the experiments, some informal subjective tests July/August 2001.
were also conducted. They have shown that experts could not [14] Kostek B. (2005), Perception-Based Data Processing in Acoustics.
discern between recordings and do not hear differences Applications to Music Information Retrieval and Psychophysiology of
between the music samples processed by the DAWs, but in one Hearing, Series on Cognitive Technologies, Springer Verlag, Berlin,
case. Because of the different pan-law of the Reaper DAW, the Heidelberg, New York.
exported mix sample was quieter by 3 dB (Fig. 9) comparing to [15] O. Lartillot, T. Eerola, P. Toiviainen, and J. Fornari, "Multi-feature
modeling of pulse clarity: Design, validation, and optimization,"
others, what has been noticed by the experts. Samples have International Conference on Music Information Retrieval, Philadelphia,
also been compared in pairs, inverting the phase in one of 2008.
them. The result for most pairs was silence, but DAWs Cubase, [16] Huart W., https://www.producelikeapro.com,
Bitwig, Reaper, and Reason, because of added (or cut) (accessed June ‘2018).
samples, have shown some artifacts at the end of the audio.
216
SIGNaL PROCESSING
SPa 2018
Transmitting Alarm Information in DAB+

Broadcasting System
Przemysław Falkowski-Gilski
Faculty of Electronics, Telecommunications and Informatics
Gdansk, Poland
przemyslaw.falkowski@eti.pg.edu.pl
Abstract—The main goal of digital broadcasting is to deliver However, the DAB+ broadcasting system also has a TPEG
high-quality content with the lowest possible bitrate. This paper (Transport Protocol Experts Group) protocol for traffic or
is focused on transmitting alarm information, such as emergency travel information, used to inform about road conditions and
warning and alerting, in the DAB+ (Digital Audio Broadcasting traffic jams. It can provide messages in the form of either text,
plus) broadcasting system. These additional services should be synthesized speech or graphically. Further information on
available at the lowest possible bitrate, in order to provide a clear DAB+ may be found in [1].
and understandable voice message to people. Furthermore,
additional information should not put stress on the ensemble In order to provide clear and understandable voice
management process, nor affect full-time audio services. information in the DAB+ broadcasting system, one needs to
know how many bits are sufficient to convey quality content.
Keywords-audio coding, broadcasting, DAB+, digital audio This paper is focused on determining what is the lowest
broadcasting, SMOC required bitrate, necessary to deliver such a message to the
general public.
I. INTRODUCTION
In recent years, digital audio has become an important II. QUALITY IN BROADCASTING SERVICES
source of information in the modern world of broadcasting The key issue in digital broadcasting is to provide
systems. In linear representation, digital audio files require a lot high-quality audio services over varying bandwidth conditions
of memory or bandwidth for storing and transmission purposes, and heterogeneous networks. Bandwidth fluctuations occur
respectively. very often during transmission, even in a single connected
The central goal in any multimedia communications is to network session. In this case packets may be lost or delayed,
deliver quality content with the lowest possible bitrate. which is not acceptable for real-time applications.
By quality, we mean the perceived fidelity of the received This may cause degradation in quality, either network
content against the original content. The lowest possible bitrate QoS (Quality of Service) or user QoE (Quality of Experience).
depends on two disparate concepts: entropy and perception. The former describes a set of parameters, such as bandwidth,
Entropy measures the quantity of information, but not all end-to-end delay, stream synchronization, ordered delivery of
information is perceptible. Human perception of aural data data and error rate. The latter describes requirements for the
varies from person to person. It is based on different notions of subjective perception of multimedia data at the user side.
good sound quality, musical skills, age, hearing defects and The use of network communication imposes serious
many other issues. restrictions, including bandwidth limitations associated with
Currently, the majority of DAB+ broadcasters focus on available bitrates. In the last two decades, many research
implementing additional services such as: efforts have been devoted to the problem of audio compression.
Two different compression categories have been of particular
 DLS (Dynamic Label Segment) – text information interest, namely high performance and low bitrate audio
of length up to 128 characters. coding. Additional information may be found in [2][3].
 SLS (Slideshow) – sequences of still pictures, their High performance audio coding is aimed to achieve the
order and presentation time are generated by the audio quality as high as possible at a certain bitrate. On the
broadcaster. This service has the biggest potential other hand, for applications such as streaming or broadcasting,
to increase advertising revenue. audio coding at lowest possible bitrate is of major interest.
In this sense, the reduction of data, imposed by lossy
 EPG (Electronic Programme Guide) – a schedule compression algorithms, should preserve some quality of the
very similar as in TV, which helps the user to find, signal. It is worth mentioning that the mere reduction of
select and listen to a desired radio station. required storage size of an audio file should not be the main
reason of applying any lossy compression algorithm.
217
III. AUDIO CODING On the basis of these parameters, quantization noise is only
In case of audio coding, lossless compression always added to time-frequency intervals, in which it is masked by the
ensures the highest possible quality, in which the objective audio itself and is thus made inaudible. Prior to quantization,
redundancy in the multimedia content is the only reason of the signal is processed by a MDCT (Modified Discrete Cosine
compression. Of course, there is a limit for lowest possible Transform), having a 50% overlap and outputting either 1024
bitrate with perfect decompression. Nevertheless, this limit or 128 coefficients, depending upon pre-echo considerations,
may be sometimes hard to determine. Therefore, any at each time shift [8][9].
experiment designed to investigate the perceived subjective Frequency domain coding schemes such as AAC are based
user quality must take into account a set of parameters, due to on three main steps:
the diversity and complexity of content [4].
1) time/frequency conversion;
One of the main goals of audio compression technology is
to preserve as much sound quality as possible, even at very low 2) subsequent quantization stage, in which the
bitrates. Of course, the higher the bitrate, the higher the quality. quantization error is controlled using information
However, this increase in quality does not resemble a linear from a psychoacoustic model;
scale. In every coding algorithm there is always a break point, 3) encoding stage, in which the quantized spectral
when further increase in bitrate does not imply further raise in coefficients and corresponding side information
perceived quality. This breakpoint is highly dependable on the are entropy-encoded using code tables.
type of transmitted content, including speech and music
signals. This results in a source-controlled variable-rate codec,
which adapts to the input signal statistics, as well as to the
A. Perceptual Coding at Low Bitrates characteristics of human perception.
Perceptual audio coding aims to reduce the bitrate required To further reduce the bitrate, AAC combines LC
to transmit an audio signal while minimizing the perceptual (Low Complexity) core in the low frequency band with a
distortion between the original and encoded versions. parametric coding approach for the high frequency band, called
For musical audio, much of the effort has concentrated on SBR (Spectral Band Replication), and a joint stereo coding tool
generic transform coders that split the signal into an adaptive PS (Parametric Stereo) [10].
number of sub-bands and time frames, in order to quantize The PS tool is capable of representing stereo signals by a
them separately. Perceptual audio compression utilizes the idea mono downmix and corresponding sets of inter-channel level,
of auditory masking to hide coding distortion with the auditory phase and correlation parameters. Whereas the principle of
masking thresholds, obtained from mathematical models of the SBR is based on the fact that the high frequencies of an audio
human ear. However, at low bitrates, the coding of noise is signal can be extrapolated from the low frequencies, whereas
significant and cannot be masked by the audio content. the reconstruction by means of transposition results in a coding
Currently, a broad range of public and private broadcasters of the high frequency portion with very low overhead.
are interested in content delivery using either terrestrial
broadcasting or Internet streaming technology. The pursue for IV. TRANSMITTING ALARM INFORMATION
low-bitrate but high-quality services is growing.
When analyzing the market one can notice, that the MPEG-4 One of the key responsibilities of any government is to
(Moving Picture Experts Group) codec has gained huge shares, communicate and disseminate safety information and warnings
and is now the leading standard. However, in some cases, too to the general public in case of an emergency. Traditionally,
low bitrates may lead to annoying artefacts in the audio signal. warnings are issued through a broadcast approach using
communication channels such as TV and radio.
Existing coders, such as MPEG-4 AAC (Advanced Audio
Coding) utilized in DAB+, provide a near-transparent quality The key advantage of broadcast technology is its high
down to 50 kbps for mono signals but generate “birdies” coverage and resilience to network overload. This enables
artifacts under 15 kbps, caused by sound components people to receive information anytime and anywhere, both
appearing and disappearing successively [5]. Parametric coders indoors and outdoors. Such a warning or alerting system, i.e.
result in a better quality at low bitrates by representing the considering traffic information on highways, airports and train
signal as a sum of sound atoms whose structure is more stations, as well as weather conditions, could be most
adapted to musical audio. For instance, sinusoidal coders welcomed. This system should be integrated with existing
decompose the signal into a set of sinusoidal tracks, transients digital broadcasting services used as a complementary
and background noise, that are encoded separately. technology when needed. This paper examines opportunities
This improves the quality around 10 kbps, but other kinds of and challenges related with delivering low-bitrate alarm
artifacts appear at lower bitrates [6]. information aside regular broadcasts.
B. MPEG-4 AAC A. Network and Service Planning

To quantize audio in a perceptually lossless manner, all of The catastrophic consequences of emergencies and
the codecs in the MPEG-4 family contain models of the human disasters are universally recognized, bearing devastating
hearing system that use block-based FFT (Fast Fourier human, economic and environmental losses. Previous events
Transform) to calculate signal masking parameters [7]. mentioned in [11] highlighted the critical importance of
218
preparedness, early detection and warning communications, in C. Positioning Technologies
order to mitigate losses of life, property and ecological damage. Positioning methods can be divided according to the means
Multiple redundant communication channels are required in which the signal, received by a device, is measured. In a
for the dissemination of critical safety information and warning wireless system the location of a mobile device is determined
messages to people. These may include incoming danger and either by the terminal itself (terminal-based approach) or
risk of disasters. Multiple communication channels increase the calculated by the network using signal metrics from the UE
effectiveness of emergency warnings by extending their reach, (User Equipment) (network-based approach). There are of
so that if one would fail, others may get through. Furthermore, course hybrid methods involving measurements and position
multiple means of delivering emergency information could also estimation as a combination of both network and handheld
serve as ways of confirmation reinforcement. When people device (non-conventional approach) [13].
receive news of an unexpected event, they often seek Mobile telephone warning systems have been embraced
confirmation from other sources. and used effectively around the world, as complementary
Traditional communication channels such as door-to-door, systems to the conventional well-established warning channels.
signage, sirens, loudspeakers, radio, television, fixed phone One of the most practical mobile telecommunication
network and the Internet, can only fill the information function technologies, that can be utilized in case of mobile emergency
of warnings and convey them in a passive manner. The public alert information services, are the common SMS (Short
needs to tune to specific channels of communication. Message Service) text messages and CBS (Cell Broadcast
Door-to-door notification and fixed phone network can actively Services) services.
notify and warn people of the impending danger. However, the With CBS, uniform text messages are sent point-to-area to
coverage is limited to a local scale, thus is not operationally potentially all users within a specific geographic area defined
effective. These different methods of communicating warning by serving cell towers. However, CBS is not susceptible to
and safety information are not equally effective at providing an network overload. With SMS, the text message is sent from
alert in different physical and social settings. point-to-point to a specific predefined set of phone numbers.
Unlike CBS, this channel is therefore an individual addressable
B. Warning Systems channel, i.e. the recipients are well-known. However, SMS is
To function effectively, warning systems must be supported vulnerable to network overload and message delivery failure if
by robust communication networks comprised of reliable a huge number of SMS messages and/or phone calls are
infrastructure and effectual interactions between key initiated simultaneously. Delays can occur and may result in
stakeholders, decision makers and the general public. delivery failure, especially during times of emergency.
The range of information communication technologies used by Nonetheless, SMS is a well-established and widely accepted
emergency services authorities to disseminate warnings has protocol for communication. The benefits of SMS are that it
conventionally involved one-way, top-down communication supports delivery confirmation and has a store and forward
flows [12]. Technologies such as sirens, radio and television mechanism. Whenever a network failure occurs, the message is
are commonly used to deliver warnings and safety information. stored in the SMSC (Short Message Service Centre) and
In the past years, new media and communication technologies delivered when the recipients becomes available.
have emerged, as viable tools for delivering warning messages
and safety information. D. Alarm Information in Broadcasting Systems
Mobile telecommunication or cellular network technologies In a broadcasting system, any alarm information,
have become ubiquitous in today’s society. The widespread of concerning traffic information on highways, airports and train
mobile and portable devices presents an opportunity to provide stations, as well as weather conditions, could be distributed
personalized support information during emergencies and using the Cell-ID (Cell Identification) method.
disasters, based on the current position of a particular user. Cell-ID is considered to be the simplest network-based
Services, for which it is crucial to know the position of a positioning method. In any wireless network the whole area of
user or mobile terminal, are referred to as LBS (Location- operation is divided into hexagonal cells, each with a unique
Based Services). LBS are services for providing information identification code. When a mobile device, for instance a
that has been created, compiled, selected and filtered, taking DAB+ radio receiver, moves within this honeycomb structure,
into consideration the current location of a person or device. it connects with a transmitter that provides the signal.
They can appear in conjunction with conventional services like However, the area of coverage, strictly linked with the number
telephony and related value-added features. The attractiveness of connected receivers, is limited only to the area of the serving
of LBS results from the fact that participants do not have to transmitter. It is worth mentioning, that in case of wireless
enter location information manually, they are automatically cellular systems, for urban areas the size of a cell can vary from
pinpointed and tracked. Furthermore, knowledge about the 1 to 3 km, whereas in suburban or rural areas it ranges from
location of a handheld device, combined with a user profile, 10 up to 30 km. In case of broadcasting systems such as
can significantly improve network planning, traffic analysis DAB+, most often a single transmitter covers a large city or
and radio resource management. metropolitan area.
219
V. ABOUT THE TEST algorithm (SBR tool active). The bitrate of the degraded signal
The DAB+ broadcasting system is one of the most popular sample ranged from 8 to 64 kbps and was limited to 5 values,
standards of delivering broadcast content to consumers. It has that is: 8, 16, 32, 48 and 64 kbps. The sampling frequency was
broad capabilities of regionalization, as well as multiplex set to 48 kHz, as in other DAB+ audio services.
configuration. The most crucial factor is the efficient use of Thanks to SBR, the reconstruction of the high band is
available resources within a single multiplex ensemble, further improved by transmitting guiding information such as
for both economic and technical issues [14]. the spectral envelope of the original input signal or additional
As shown in previous study [15], audio content transmitted information to compensate for potentially missing high
over different radio programs does vary, depending on the frequency components.
profile of a radio program, as well as time of the day, which is
clearly visible in the time schedule of a particular service. B. Subjective Quality Assessment Study
The profile of a radio program, as well as musical genres, are Tests were carried out in turns according to [17],
categories labeling pieces of music, which in broadcasting one participant after another, with a short break between the
services refers to content transmitted over particular radio assessment of Polish and American English audio samples.
programs. All 16 participants, aged between 20-25 years, took a training
phase before starting the essential study, in order to acquaint
Using the SMOC (Speech, Music, Other, Commercial) with the test equipment and become familiar with the test
method, a new adaptive bitrate assignment and resource scenario. The subjective quality was assessed using
allocation method for managing multiplex resources of the Beyerdynamic Custom One Pro headphones. Further
DAB+ broadcasting system, enables to reduce the size of information on testing audio signals may be found in [18].
required bandwidth. These bandwidth savings can range up to During the test, each sample was evaluated in a 5-step MOS
¼ of the multiplex ensemble. (Mean Opinion Score) scale in two ways, namely:
As indicated, any saved resources could be used to further intelligibility of the voice message, and its overall quality.
increase the quality of currently offered services or introduce
The results of this study, concerning intelligibility of the
yet another full-time, part-time or periodical program. In this
voice message, are shown in Fig. 1-2.
scenario, a portion of saved resources would be used to
transmit alarm information in the form of speech signals.
These can include short or cyclic messages transmitted to the
general public.
A. Signal Samples
For the purpose of this test, a set of signal samples has been
selected. The files were sourced from ITU (International
Telecommunication Union) [16]. The description of selected
files from Annex B (Speech and Noise Signals Clause B),
is shown in Table 1.
TABLE I. AUDIO TEST SIGNALS
File name Language Duration [s]

AE_female1 8 Figure 1. Intelligibility – American English
AE_female2 American 8
AE_male1 English 8
AE_male2 7
PL_female1 7
PL_female2 7
Polish
PL_male1 7
PL_male2 7
From the list of available languages, both male and female

signal samples in Polish and American English were utilized.
All sourced materials were WAV 16-bit PCM (Pulse Code
Modulation) files of sampling frequency set to 32 kHz. Figure 2. Intelligibillity – Polish
Each file was then processed using the HE-AAC v1 coding
220
Whereas, those concerning the overall quality of the audio VI. SUMMARY
material, are shown in Fig. 3-4. Studies show, that in case of terrestrial broadcasting and
online streaming services, speech signals are perceived as of
high quality for a bitrate of 48-64 kbps, whereas for music
signals the required bitrate ranges from 96-128 kbps [19-22].
However, this bitrate is necessary for regular commercial
broadcast services. When transmitting alarm information,
concerning voice messages, i.e. road, railway, air traffic or
weather conditions, quality can be lowered, as the intelligibility
of a speech signal is much more relevant than its overall
subjective quality. According to obtained results, a clear and
understandable voice message can be provided for bitrates of
32 kbps (high overall quality), or even 16 kbps
(high intelligibility).
Current audio coding schemes, such as MPEG-4 AAC
utilized in a number of digital broadcasting systems as well as
Figure 3. Overall quality – American English online streaming platforms, provide subjective high-quality
services for both speech and music signals. In case of low
bitrate audio coding, especially lossless compression of speech
signals, perceptual audio coding can greatly reduce the required
bandwidth or storage space simply by removing the irrelevancy
of the signal. Thanks to the SBR tool, high frequency band is
reconstructed from replicated low frequency signal portions,
controlled by parameter sets containing level, noise and
tonality parameters. For any audio signal, this means lossless to
the extent that the distortion after decompression is
imperceptible to the listener.
The amount of multimedia distributed over networks
continues to grow, as real-time audio broadcasting and
streaming services become more and more popular.
However, efficient bandwidth and resource management,
related with handling bit streams, still remains a challenge.
Figure 4. Overall quality – Polish
REFERENCES
According to obtained results, when considering the [1] P. Gilski and J. Stefański, “Can the Digital Surpass the Analog: DAB+
intelligibility criterion, a bitrate of 16 kbps was sufficient to Possibilities, Limitations and User Expectations,” Intl. J. Elec. Tele.,
deliver a clear and easily understandable voice message. vol. 62, no. 4, pp. 353–361, 2016.
However, when analyzing the overall quality criterion, a MOS [2] M. Leszczuk, L. Janowski, P. Romaniak, and Z. Papir,
score of above 4.0 was given for signal samples coded at “Assessing Quality of Experience for High Definition Video Streaming
32 kbps and higher. under Diverse Packet Loss Patterns,” Signal Process. Image,
vol. 28, no. 8, pp. 903–916, 2013.
As observed, MOS scores can vary, based on cultural or [3] T. Uhl and S. Paulsen, “The new, parameterized VT Model for
language issues, number of listeners, or even test conditions. Determining Quality in the Video-telephony Service,” B. Pol. Acad. Sci.
Tech., vol. 62, no. 3, pp. 431–437, 2014.
That is why usually the range of tested audio samples is
limited, depending on the interest for a specific research topic. [4] S. Chen, R. Hu, and N. Xiong, “A Multimedia Application: Spatial
Perceptual Entropy of Multichannel Audio Signals,” EURASIP J. Wirel.
Additionally, as indicated by the listeners, sentences spoken by Comm., vol. 2010, pp. 1–13, 2010.
a male lector seemed more appealing. [5] M. Erne, “Perceptual audio coders ‘what to listen for’,” in Proc.
111th AES Convention, USA, 2001.
Furthermore, the spoken language of presented voice
messages, whether Polish or American English, did not have a [6] E. Vincent and M. D. Plumbley, “A Prototype System for Object Coding
of Musical Audio,” in Proc. of IEEE WASPAA, USA, 2005.
significant impact on obtained results. In case of each
[7] P. Noll, “MPEG digital audio coding,” IEEE Signal Process. Mag.,
participant, Polish language was indicated as the mother vol. 14, no. 5, pp. 59–81, 1997.
tongue, whereas English was the second language of choice, [8] T. Painter and A. Spanias, “Perceptual coding of digital audio,”
in which they most often communicate abroad. It would be Proc. IEEE, vol. 88, no. 4, pp. 451–515, 2000.
interesting to investigate the impact of other languages spoken [9] C. D. Creusere, “Understanding Perceptual Distortion in MPEG Scalable
in the European community, since DAB+ is pointed out by Audio Coding,” IEEE Trans. Speech Audio Process., vol. 13, no. 3,
many international organizations and associations as the pp. 422–431, 2005.
leading broadcasting standard. [10] M. Wolters, K. Kjörling, D. Homm, and H. Purnhagen, “A closer look
into MPEG-4 High Efficiency AAC,” in Proc. 115th AES Convention,
USA, 2003.
221
[11] S. Choy, J. Handmer, J. Whittaker, Y. Shinohara, T. Hatori, and [17] ITU Recommendation BS.1284, General methods for the subjective
N. Kohtake, “Application of satellite navigation system for emergency assessment of sound quality, Geneva, Switzerland, 2003.
warning and alerting,” Comput. Environ. Urban, vol. 58, [18] S. Brachmański, Selected issues in assessing the quality of speech
pp. 12–18, 2016. transmission [in Polish: Wybrane zagadnienia oceny jakosci transmisji
[12] D. J. Parker and J. Handmer, “The role of unofficial flood warning sygnału mowy]. Wrocław, Poland: Oficyna Wydawnicza Politechniki
systems,” J. Conting. Crisis Man., vol. 6, pp. 45–60, 1998. Wrocławskiej, 2015.
[13] P. Gilski and J. Stefański, “Survey of Radio Navigation Systems,” [19] S. Brachmański and M. J. Kin, “Assessment of speech quality in Digital
Intl. J. Elec. Tele., vol. 61, no. 1, pp. 43–48, 2015. Audio Broadcasting (DAB+) system,” in Proc. 134th AES Convention,
[14] A. B. Dobrucki and P. Kozłowski, “Evaluation of the quality of audio Italy, 2013.
signals transmitted by the telecommunication channels,” Przeg. Tel., [20] M. J. Kin, “Subjective evaluation of sound quality of musical recordings
vol. 6, pp. 235–241, 2010. transmitted via DAB+ system,” 134th AES Convention, Italy, 2013.
[15] P. Gilski, “Adaptive Multiplex Resource Allocation Method for DAB+ [21] P. Gilski and J. Stefański, “Subjective and Objective Comparative Study
Broadcast System,” in Proc. 21st SPA Conference, Poland, of DAB+ Broadcast System,” Arch. Acoust., vol. 42, no. 1,
pp. 337–342, 2017. pp. 3–11, 2017.
[16] ITU-T Test Signals for Telecommunication Systems, [22] P. Gilski, “DAB vs DAB+ Radio Broadcasting: a Subjective
https://www.itu.int/net/itu-t/sigdb/genaudio/AudioForm- Comparative Study,” Arch. Acoust., vol. 42, no. 4, pp. 157–165, 2017.
g.aspx?val=10000501 [Access: 13.04.2018].
222
SIGNaL PROCESSING
SPa 2018
Deep Learning for Natural Language Processing

and Language Modelling
Piotr Kłosowski
Faculty of Automatic Control, Electronics and Computer Science
Akademicka 16, 44-100 Gliwice, Poland
pklosowski@polsl.pl
Abstract—The article presents an example of practical appli- specialized and manual methods. In this paper are presented
cation of deep learning methods for language processing and the most interesting natural language processing tasks, such as
modelling. Development of statistical language models helps to language modelling, in which deep learning methods achieve
predict a sequence of recognized words and phonemes, and can
be used for improving speech processing and speech recognition. some progress [1].
However, currently the field of language modelling is shifting
from statistical language modelling methods to neural networks
and deep learning methods. Therefore, one of the methods II. D EEP L EANING A PPLICATION
of effective language modelling with the use of deep learning
techniques is presented in this paper. Presented results concerns Over the last few years deep learning was applied to
the modelling of the Polish language but the achieved research
hundreds of problems, ranging from computer vision to natural
results and conclusions can also be applied to language modelling
application for other languages. language processing. In many cases deep learning outper-
Index Terms—deep learning, machine learning, language anal- formed previous work. Deep learning is heavily used in both
ysis, language modelling, language processing, speech recognition. academia to study intelligence and in the industry in building
intelligent systems to assist humans in various tasks. There
are many different applications and the list presented below
I. I NTRODUCTION is in no way exhaustive [1], [2]. Sample deep learning ap-
Deep learning and natural language processing (NLP) is an plications are: text classification [3], caption generation [21]–
area of computer science and artificial intelligence concerned [23], machine translation [3], [15], [24], document summa-
with the interactions between computers and human (natural) rization [25]–[27], question answering [3], [28]–[31], speech
languages, in particular how to program computers to fruitfully recognition [1], [3], [18]–[20], [32]–[34], language modelling
process large amounts of natural language data. Challenges in [18]–[20]. The problem area of this article focuses on the use
natural language processing frequently involve speech recog- of deep learning methods to language modelling for speech
nition, natural language understanding and natural language recognition.
generation [1]–[3]. Speech recognition is the problem of understanding what
The main objective of this research on natural language was said. The task of speech recognition is to map an acoustic
processing and modelling area is improving speech processing signal containing a spoken natural language utterance into the
and recognition for Polish [4], [5]. In addition, research corresponding sequence of words intended by the speaker [1].
studies have been conducted in the field of properties of Given an utterance of text as audio data, the model must
Polish phonemes [6]–[8], speech recognition based on it [9], produce human readable text. Given the automatic nature
speaker recognition [10]–[14] and new applications of speech of the process, the problem may also be called automatic
processing (e.g. speech translation) [15]. Particularly a good speech recognition (ASR). In the literature can be found many
performance of speech recognition is achieved through the use examples of deep learning applications for speech recognition
of speech recognition by the statistical language modelling [32]–[34]. More and more often a language modelling is used
methods [16]–[20]. to create the text output that is conditioned on the audio data
Currently, however the field of natural language processing [3], [18]–[20].
is shifting from statistical methods to neural network methods Language modelling is really a subtask of more interesting
[2]. There are still many difficult problems to solve in natural natural language problems, specifically those that condition
language processing. Nevertheless, deep learning methods the language model on some other input. The main problem
achieve the most modern results for some specific language of language modelling is to predict the next word given the
problems [1]. Not only the performance of deep learning previous words [18]–[20]. The task is fundamental to speech
models on benchmarking problems is the most interesting; or optical character recognition, and is also used for spelling
the fact is that a single model can learn the meaning of a correction, handwriting recognition, and statistical machine
word and perform language tasks, eliminating the need for translation [3].
223
III. L ANGUAGE M ODELLING P ROBLEM D EFINITION with low perplexity on a test set may not work equally well in
Language modelling can be useful for various speech and a real world application whose data may not be drawn from
language processing applications, including automatic speech the same distribution as the test set. However, in the lack of
recognition [16]. Given the acoustic signal A, it is today efficient means to evaluate language model, perplexity is a
common practice to introduce the acoustic model P (A|W ) useful metric for comparing language models.
and the language model P (W ) when searching for the best IV. D EEP L EANING A PPLICATION FOR L ANGUAGE
word sequence Ŵ . M ODELLING
The probability P (W ) of occurrence W sequence of n A language model can predict the probability of the next
words wi , can be decomposed as [16]: word in the sequence, based on the words already observed
n
Y in the sequence. This kind of language modelling is named
P (W ) = P (wi |w1 , ..., wi−1 ) (1) as word-based language modelling. The experiment presented
i=1 in this paper is an example to develop a statistical language
where P (wi |w1 , ..., wi−1 ) is a conditional probability that wi models using deep learning methods. Used and described neu-
will occur, given the previous word sequence w1 , ..., wi−1 . ral network (NN) model is a preferred method for developing
Unfortunately, it is impossible to compute conditional word statistical language models because they can use a distributed
probabilities P (wi |w1 , ..., wi−1 ) for all words and all sequence representation where different words with similar meanings
lengths in a given language. Even though sequences are have similar representation and because they can use a large
limited to moderate values of i, there would not be enough context of recently observed words when making predictions.
data to estimate reliably all of conditional probabilities. The Deep neural networks (NN) usually have many hidden layers
conditional probability can be approximated by estimating the inside and there are generally recurrent (RNN). The block
probability only on the preceding N − 1 words defined by the diagram of typical deep neural network with hidden layer is
following formula: presented in Fig. 1.
n
Y
PN (W ) = P (wi |wi−N +1 , ..., wi−1 ) (2) Input Hidden Output
i=1
layer layer layer
This approximation is commonly referred to as N -gram
language model [16]. The most popular solutions published Input #1
in the literature relate to the application of N -gram language
models for word-based speech recognition tasks [35]–[38].
An approach to evaluate a language model is the derivative Input #2
measure of entropy known as perplexity (PP) [16]. Given a Output
language model P (W ), where W is a word sequence with n Input #3
words, the entropy of the language model can be defined as
[39]:
1 Input #4
H(W ) = − log2 (P (W )) (3)
n
Note that as n approaches infinity, the entropy approaches
Fig. 1. Block diagram of typical deep neural network with hidden layer
the asymptotic entropy of the source defined by the measure
P (W ). This means that the typical length of the sequence
must approach infinity, which is of course impossible. Thus, Used in the experiment deep recurrent neural network
entropy H(W ) should be estimated on a sufficient large n (RNN) is composed of Long Short Term Memory networks
value. The perplexity P P (W ) of the language model is then units is often called an LSTM network. LSTMs are a special
defined as [16]: kind of RNN, capable of learning long-term dependencies.
P P (W ) = 2H(W ) (4) They were introduced by Hochreiter and Schmidhuber in
1997 [40], and were refined and popularized by many people
Perplexity can be considered to be a measure of on av- in later works. LSTMs are explicitly designed to avoid the
erage how many different equally most probable words can long-term dependency problem. Remembering information for
follow any given word. Lower perplexities represent better long periods of time is practically their default behaviour, not
language models, although this simply means that they ‘model something they struggle to learn.
language better’, rather than necessarily work better in speech
recognition systems - perplexity is only loosely correlated A. Deep Learning Development Environment
with performance in a speech recognition system since it has Presented in this paper language model was developed using
no ability to note the relevance of acoustically similar or software created in Python programming language. The author
dissimilar words. However, perplexity is not a definite way used the Anaconda development environment [41] for this pur-
of determining the usefulness of a language model. A model pose and neural network developing tools available for Python
224
programming language named TensorFlow and Keras [42], • normalize all words to lowercase to reduce the vocabulary
[43]. The Anaconda is software that facilitates the installation size, because a smaller vocabulary results in a smaller
of some of the Linux operating system distributions. Especially model that trains faster.
these are derivative systems for Linux. Anaconda is written A sample fragment of the training text file before and after
in C and partly in Python, and uses the PyGTK package as a preparation are presented in Listings 1 and 2.
graphical environment [41]. TensorFlow was developed by the
Listing 1. A sample fragment of the training text file before preparation,
Google Brain team for internal Google use. It was released containing original text of Adam Mickiewicz’s epopee entitled "Pan Tadeusz"
under the Apache 2.0 open source license on November 9, [44]
2015 [42]. Litwo! Ojczyzno moja! ty jesteś jak zdrowie.
Keras is a high-level neural networks API, written in Python Ile ci˛
e trzeba cenić, ten tylko si˛ e dowie,
Kto ci˛ ekność twa˛ w całej ozdobie
e stracił. Dziś pi˛
and capable of running on top of TensorFlow, CNTK, or Widz˛
e i opisuj˛
e, bo t˛eskni˛
e po tobie.
Theano. It was developed with a focus on enabling fast
Panno Świ˛eta, co jasnej bronisz Cz˛ estochowy
experimentation. Being able to go from idea to result with I w Ostrej świecisz Bramie! Ty, co gród zamkowy
the least possible delay is key to doing good research. Keras Nowogródzki ochraniasz z jego wiernym ludem!
Jak mnie dziecko do zdrowia powróciłaś cudem
focuses on being user-friendly, modular, and extensible. It was (Gdy od płaczacej
˛ matki pod Twoja˛ opiek˛e
developed as part of the research effort of project ONEIROS Ofiarowany, martwa˛ podniosłem powiek˛ e
I zaraz mogłem pieszo do Twych światyń
˛ progu
(Open-ended Neuro-Electronic Intelligent Robot Operating Iść za wrócone życie podzi˛
ekować Bogu),
System), and its primary author and maintainer is François Tak nas powrócisz cudem na Ojczyzny łono.
Chollet, a Google engineer. In 2017, Google’s TensorFlow

team decided to support Keras in TensorFlow’s core library. Listing 2. A sample fragment of the training text file after prepatarion,
Chollet explained that Keras was conceived to be an interface containing processed text of Adam Mickiewicz’s epopee entitled "Pan
Tadeusz" [44]
rather than a standalone machine-learning framework. It offers
litwo ojczyzno moja ty jesteś jak zdrowie ile ci˛ e trzeba
a higher-level, more intuitive set of abstractions that make cenić ten tylko si˛e dowie kto ci˛e stracił dziś pi˛
ekność twa˛
it easy to develop deep learning models regardless of the w całej ozdobie widz˛e i opisuj˛e bo t˛
eskni˛e po tobie panno
świ˛
eta co jasnej bronisz cz˛ estochowy i w ostrej świecisz
computational backend used. bramie ty co gród zamkowy nowogródzki ochraniasz z jego
wiernym ludem jak mnie dziecko do zdrowia powróciłaś cudem
B. Deep Learning Software for Word-Based Language Mod- gdy od płaczacej
˛ matki pod twoja˛ opiek˛e ofiarowany martwa˛
podniosłem powiek˛ e i zaraz mogłem pieszo do twych światyń
˛
elling in Polish progu iść za wrócone życie podzi˛
ekować bogu tak nas powró
The software created in Python programming language cisz cudem na ojczyzny łono
and high-level neural network API of Keras, for word-based

Technical parameters of prepared training text ware presented
language modelling in Polish performs the following tasks:
on Table I.
1) training data preparation,
2) definition ot the recurrent neural network for word-based TABLE I
T ECHNICAL PARAMETERS OF PREPARED TRAINING TEXT
language modelling,
3) compiling and training of the language model, Parameter Value
4) analysis of the language modelling results. Total tokens 68249
Unique tokens 18688
The individual software tasks are described in the following Total sequences 68198
subsections. Text file size 455 851 [bytes]
C. Training Data Preparation

The training text consist just only 68249 words (total
As a language model training data in Polish, it was used tokens) in the clean text and a vocabulary of just 18688 words
text of Adam Mickiewicz’s epopee entitled "Pan Tadeusz" (unique tokens). This is smallish and models fit on this data
[44]. It turned out to be useful and necessary preparing the should be manageable on modest hardware. The training text
training source data for language modelling. It was necessary allows you to build 68198 sequences consisting of 50 words
to prepare the text properly by removing all unnecessary (tokens).
characters and to transform the raw text into a sequence of
tokens or words that can be used as a source to train the model. D. Definition of the Recurrent Neural Network for Word-
Based on reviewing the raw text, below are some specific Based Language Modelling
operations that need be perform to clean the text: The word-based language model will be statistical and will
• replace illegal characters with a white space so we can predict the probability of each word that receive as the text
split words better, input sequence. The predicted word will be entered as input to
• split words based on white space, generate the next word. The key design decision is the length
• remove all punctuation from words to reduce the vocab- of the input sequences. They must be long enough to allow
ulary size, the model to know the context of words to predict. This entry
• remove all words that are not alphabetic to remove length will also specify the length of the starting text used
standalone punctuation tokens, to generate new sequences when using the language model.
225
There is no correct answer. With enough time and resources, E. Compiling and Training of the Language Model
we could study the ability of the language model to learn at Next, the model is compiled specifying the categorical cross
different sizes of input sequences. Instead, the author chose entropy loss needed to fit the model. Technically, the model
the length of 50 words for the length of the input sequences, is learning a multi-class classification and this is the suitable
slightly experimentally. loss function for this type of problem. Finally, the model is
Consequently, developed deep learning RNN model has 50 fit on the data for 500 training epochs with a modest batch
inputs and 18688 outputs. The author use a two LSTM hidden size of 256 to speed things up. Training may take over 30
layers with 100 memory cells each. More memory cells and hours on modern hardware without GPUs. It is possible to
a deeper network may achieve better results, but the demand speed it up with a larger batch size and/or fewer training
for computing power will be growing. A dense fully connected epochs. Measurement of model training speed was performed
layer with 100 neurons connects to the LSTM hidden layers with PC workstation based on CPU unit Intel Core i7 4770K
to interpret the features extracted from the sequence. The @ 3.50 GHz. The computational power of used CPU unit
output layer predicts the next word as a single vector the was equal around 130 GIPS. The model accuracy and loss
size of the vocabulary with a probability for each word in values reported during the language model training process are
the vocabulary. Technical details of developed RNN language presented in Fig. 3 and 4. Additionally model accuracy, loss
model is presented on Table II and block diagram in Fig. 2. values and processing time reported during the model training
process are presented on Table III.
TABLE II
T ECHNICAL DETAILS OF DEVELOPED RNN LANGUAGE MODEL model accuracy
train
Layer (type) Outout shape Number of parameters
embedding_1 (Embedding) (None, 50, 50) 934 400 0.8
lstm_1 (LSTM) (None, 50, 100) 60 400
lstm_2 (LSTM) (None, 100) 80 400
dense_1 (Dense) (None, 100) 10 100 accuracy 0.6
dense_2 (Dense) (None, 18688) 1 887 488
Total params: 2 972 788
Trainable params: 2 972 788 0.4
Non-trainable params: 0
0.2
input: (None, 50) 0.0

embedding_1_input: InputLayer 0 100 200 300 400 500
output: (None, 50) epoch
Fig. 3. Model accuracy reported during the language model training process
input: (None, 50)

embedding_1: Embedding
output: (None, 50, 50) model loss
train
8
input: (None, 50, 50)

lstm_1: LSTM
output: (None, 50, 100) 6
loss
4
input: (None, 50, 100)
lstm_2: LSTM
output: (None, 100) 2
0
input: (None, 100) 0 100 200 300 400 500
dense_1: Dense epoch
output: (None, 100)
Fig. 4. Model loss reported during the language model training process
input: (None, 100)

dense_2: Dense F. Language Modelling Results
output: (None, 18688)
The developed language model was used to predict and
Fig. 2. Block diagram of developed RNN language model generate a sequence of words based on a context of preceding
50 words. New machine-generated sequences of words have
the same statistical properties as the source text [44].
226
TABLE III TABLE IV
L ANGUAGE MODEL TRAINING PROCESS DETAILS L ANGUAGE MODEL EVALUATION RESULTS
Training epochs Model accuracy Model loss Training time (apr.) Training Perplexity Average number of
100 0.402 2.779 5h43m30s epochs value correctly predicted words
200 0.723 1.194 12h51m46s 100 6.86 5
300 0.851 0.614 16h32m54s 200 2.29 9
400 0.906 0.387 1d 0h34m28s 300 1.53 21
500 0.936 0.265 1d 6h24m38s 400 1.31 113
500 1.20 134
A sample fragment of the input file, containing the preced-

ing context of the text is presented in Listing 3 [44]. V. C ONCLUSION
Listing 3. A sample fragment of the input file, containing the text (50 The article presents an example of practical application of
words) of Adam Mickiewicz’s epopee entitled "Pan Tadeusz" [44]
deep learning methods for language processing and modelling.
litwo ojczyzno moja ty jesteś jak zdrowie ile ci˛ e trzeba It was presented how to develop a word-based language
cenić ten tylko si˛
e dowie kto ci˛ ekność twa˛
e stracił dziś pi˛
w całej ozdobie widz˛
e i opisuj˛
e bo t˛eskni˛
e po tobie panno model using a recurrent neural network, used in deep machine
świ˛
eta co jasnej bronisz cz˛
estochowy i w ostrej świecisz learning techniques. Presented in the paper sample experiment
bramie ty co gród zamkowy nowogródzki ochraniasz z jego
wiernym also shows: how to prepare text for developing a word-based
language model; how to design and train a neural language
A sample fragment of the output file, containing machine- model with a learned embedding and an LSTM hidden layer;
generated text (consisting of 150 words) with the use of and how to use the learned language model to generate new
developed language model, based on the preceding context text with similar statistical properties as the source text.
of 50 words, shown in Listing 3, is presented in Listing 4. Developed language model, based on deep learning meth-
Listing 4. A sample fragment of the output file, containing machine- ods, can be used to generate new sequences of text that have
generated text, based on the previous text context
the same statistical properties as the source text. This is not
ludem jak mnie dziecko do zdrowia powróciłaś cudem gdy od pł practical, at least not for this example, but it gives a concrete
aczacej
˛ matki pod twoja˛ opiek˛e ofiarowany martwa˛ podniosłem
powiek˛e i zaraz mogłem pieszo do twych światyń
˛ progu iść za example of what the language model has learned with the use
wrócone życie podzi˛ ekować bogu tak nas powrócisz cudem na of deep learning technique and how it works.
ojczyzny łono tymczasem przenoś moj˛ e dusz˛
e ut˛
eskniona˛ do
tych pagórków leśnych do tych łak ˛ zielonych szeroko nad bł˛ e Application of relevant language corpus as a training text
kitnym niemnem rozciagnionych
˛ do tych pól malowanych zbożem
rozmaitem wyzłacanych pszenica˛ posrebrzanych żytem gdzie
data for deep language model learning can be practically
bursztynowy świerzop gryka jak śnieg biała gdzie panieńskim used to predict spoken words to improve speech recognition
rumieńcem dzi˛ecielina pała a wszystko przepasane jakby wst˛ eg
a˛ miedza˛ zielona˛ na niej z rzadka ciche grusze siedza˛ śród
process. The results presented in this article concerns the mod-
takich pól przed laty nad brzegiem ruczaju na pagórku elling of the Polish language but the achieved research results
niewielkim we brzozowym gaju stał dwór szlachecki z drzewa
lecz podmurowany świeciły si˛ e z daleka pobielane ściany tym
and conclusions can also be applied to language modelling
bielsze że odbite od ciemnej zieleni topoli co go bronia˛ od application for other languages.
wiatrów romans izby zrobiłem jak domowym mnie stojac ˛ i
skrzydła kto r˛ ece obwieszcz˛ e u nas wujem bogata
ACKNOWLEDGEMENTS
For comparison, the original text following the fragment This work was supported by Polish Ministry of Science and
shown in Listing 3, is presented in Listing 5 [44]. Higher Education funding for statutory activities.
Listing 5. A sample fragment of the original text of Adam Mickiewicz’s
epopee entitled "Pan Tadeusz" [44] R EFERENCES
ludem jak mnie dziecko do zdrowia powróciłaś cudem gdy od pł [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
aczacej
˛ matki pod twoja˛ opiek˛e ofiarowany martwa˛ podniosłem 2016. http://www.deeplearningbook.org.
powiek˛e i zaraz mogłem pieszo do twych światyń
˛ progu iść za [2] Y. Goldberg, “A primer on neural network models for natural language
wrócone życie podzi˛ ekować bogu tak nas powrócisz cudem na
ojczyzny łono tymczasem przenoś moj˛ e dusz˛
e ut˛
eskniona˛ do
processing,” CoRR, vol. abs/1510.00726, 2015.
tych pagórków leśnych do tych łak ˛ zielonych szeroko nad bł˛ e [3] C. D. Manning and H. Schütze, Foundations of Statistical Natural
kitnym niemnem rozciagnionych
˛ do tych pól malowanych zbożem Language Processing. Cambridge, MA, USA: MIT Press, 1999.
rozmaitem wyzłacanych pszenica˛ posrebrzanych żytem gdzie [4] P. Kłosowski, “Speech Processing Application Based on Phonetics and
bursztynowy świerzop gryka jak śnieg biała gdzie panieńskim Phonology of the Polish Language,” in Computer Nerworks (Kwiecien,
rumieńcem dzi˛ecielina pała a wszystko przepasane jakby wst˛ eg A and Gaj, P and Stera, P, ed.), vol. 79 of Communications in
a˛ miedza˛ zielona˛ na niej z rzadka ciche grusze siedza˛ śród Computer and Information Science, (Germany), pp. 236–244, Springer-
takich pól przed laty nad brzegiem ruczaju na pagórku Verlag Berlin, 2010. 17th International Conference Computer Networks,
niewielkim we brzozowym gaju stał dwór szlachecki z drzewa
lecz podmurowany świeciły si˛ e z daleka pobielane ściany tym
Ustron, Poland, Jun 15-19, 2010.
bielsze że odbite od ciemnej zieleni topoli co go bronia˛ od [5] P. Kłosowski, “Improving speech processing based on phonetics and
wiatrów jesieni dóm mieszkalny niewielki lecz zewszad ˛ ch˛ e phonology of Polish language,” Przeglad ˛ Elektrotechniczny, vol. 89,
dogi i stodoł˛ e miał wielka˛ i przy niej trzy stogi użatku
˛ co no. 8/2013, pp. 303–307, 2013.
pod ... [6] J. Izydorczyk and P. Kłosowski, “Acoustic properties of Polish vowels,”
Bulletin of the Polish Academy of Science - Technical Sciences, vol. 47,
Perplexity values and average numbers of correctly predicted no. 1, pp. 29–37, 1999.
words in machine-generated text, depending on the model [7] J. Izydorczyk and P. Kłosowski, “Base acoustic properties of Polish
speech,” in International Conference Programable Devices and Systems
training level (training epochs), as developed language model PDS2001 IFAC Workshop, Gliwice November 22nd - 23rd, 2001, pp. 61–
evaluation results, are presented on Table IV. 66, IFAC, 2001.
227
[8] P. Kłosowski, “Algorithm and implementation of automatic phonemic [23] S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and
transcription for Polish,” in Proceedings of 20th IEEE International K. Saenko, “Sequence to sequence - video to text,” Computing Research
Conference Signal Processing Algorithms, Architectures, Arrangements, Repository (CoRR), vol. abs/1505.00487, 2015.
and Applications, September 21-23, 2016, Poznań, Poland, pp. 298–303, [24] M. Auli, M. Galley, C. Quirk, and G. Zweig, “Joint language and
2016. translation modeling with recurrent neural networks,” in Microsoft
[9] P. Kłosowski, A. Dustor, J. Izydorczyk, J. Kotas, and J. Slimok, “Speech Research, (Seattle, Washington), October 2013.
Recognition Based on Open Source Speech Processing Software,” in [25] A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for
Computer Networks, CN 2014 (Kwiecien, A and Gaj, P and Stera, abstractive sentence summarization,” Computing Research Repository
P, ed.), vol. 431 of Communications in Computer and Information (CoRR), vol. abs/1509.00685, 2015.
Science, (Germany), pp. 308–317, Springer-Verlag Berlin, 2014. 21st [26] R. Nallapati, B. Xiang, and B. Zhou, “Sequence-to-sequence rnns
International Science Conference on Computer Networks (CN), Brunow, for text summarization,” Computing Research Repository (CoRR),
Poland, Jun 23-27, 2014. vol. abs/1602.06023, 2016.
[10] A. Dustor and P. Kłosowski, “Biometric Voice Identification Based on [27] J. Cheng and M. Lapata, “Neural summarization by extracting
Fuzzy Kernel Classifier,” in Computer Networks, CN 2013 (Kwiecien, A sentences and words,” Computing Research Repository (CoRR),
and Gaj, P and Stera, P, ed.), vol. 370 of Communications in Computer vol. abs/1603.07252, 2016.
and Information Science, (Germany), pp. 456–465, Springer-Verlag [28] J. Han, J. Pei, and M. Kamber, Data Mining: Concepts and Techniques.
Berlin, 2013. 20th International Conference on Computer Networks The Morgan Kaufmann Series in Data Management Systems, Elsevier
(CN), Lwowek Slaski, Poland, Jun 17-21, 2013. Science, 2011.
[11] A. Dustor, P. Kłosowski, and J. Izydorczyk, “Influence of Feature [29] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay,
Dimensionality and Model Complexity on Speaker Verification Perfor- M. Suleyman, and P. Blunsom, “Teaching machines to read and com-
mance,” in Computer Networks, CN 2014 (Kwiecien, A and Gaj, P and prehend,” in Advances in Neural Information Processing Systems 28
Stera, P, ed.), vol. 431 of Communications in Computer and Information (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
Science, (Germany), pp. 177–186, Springer-Verlag Berlin, 2014. 21st eds.), pp. 1693–1701, Curran Associates, Inc., 2015.
International Science Conference on Computer Networks (CN), Brunow, [30] L. Dong, F. Wei, M. Zhou, and K. Xu, “Question answering over
Poland, Jun 23-27, 2014. freebase with multi-column convolutional neural networks,” in ACL,
[12] A. Dustor, P. Kłosowski, and J. Izydorczyk, “Speaker recognition system 2015.
with good generalization properties,” in 2014 International Conference [31] L. Yu, K. M. Hermann, P. Blunsom, and S. Pulman, “Deep learning
on Multimedia Computing and Systems (ICMCS), (Germany), pp. 206– for answer sentence selection,” Computing Research Repository (CoRR),
210, IEEE, 2014. International Conference on Multimedia Computing vol. abs/1412.1632, 2014.
and Systems (ICMCS), Marrakech, Morocco, Apr 14-16, 2014. [32] A. Graves, S. Fernández, and F. Gomez, “Connectionist temporal
[13] P. Kłosowski, A. Dustor, and J. Izydorczyk, “Speaker verification per- classification: Labelling unsegmented sequence data with recurrent neu-
formance evaluation based on open source speech processing software ral networks,” in In Proceedings of the International Conference on
and timit speech corpus,” in Computer Networks, CN 2015 (Gaj, P and Machine Learning, ICML 2006, pp. 369–376, 2006.
Kwiecien, A and Stera, P, ed.), vol. 522 of Communications in Computer [33] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition
and Information Science, (Germany), pp. 400–409, Springer-Verlag with deep recurrent neural networks,” Computing Research Repository
Berlin, 2015. 22nd International Conference on Computer Networks (CoRR), vol. abs/1303.5778, 2013.
(CN), Brunow, Poland, Jun 16-19, 2015. [34] O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolutional neural
[14] A. Dustor, P. Kłosowski, J. Izydorczyk, and R. Kopanski, “Influence network structures and optimization techniques for speech recognition,”
of Corpus Size on Speaker Verification,” in Computer Networks, CN in Interspeech 2013, ISCA, August 2013.
2015 (Gaj, P and Kwiecien, A and Stera, P, ed.), vol. 522 of Commu- [35] S. Takahashi and T. Morimoto, “N-gram Language Model Based on
nications in Computer and Information Science, (Germany), pp. 242– Multi-Word Expressions in Web Documents for Speech Recognition
249, Springer-Verlag Berlin, 2015. 22nd International Conference on and Closed-Captioning,” in 2012 International Conference on Asian
Computer Networks (CN), Brunow, Poland, Jun 16-19, 2015. Language Processing (IALP 2012) (Xiong, D and Castelli, E and Dong,
[15] P. Kłosowski and A. Dustor, “Automatic Speech Segmentation for Auto- M and Yen, PTN, ed.), pp. 225–228, 2012.
matic Speech Translation,” in Computer Networks, CN 2013 (Kwiecien, [36] A. Hatami, A. Akbari, and B. Nasersharif, “N-gram Adaptation Using
A and Gaj, P and Stera, P, ed.), vol. 370 of Communications in Computer Dirichlet Class Language Model Based on Part-of-Speech for Speech
and Information Science, (Germany), pp. 466–475, Springer-Verlag Recognition,” in 2013 21st Iranian Conference on Electrical Engineer-
Berlin, 2013. 20th International Conference on Computer Networks ing (ICEE), 2013.
(CN), Lwowek Slaski, Poland, Jun 17-21, 2013. [37] M. Bahrani, H. Sameti, N. Hafezi, and S. Momtazi, “New word
[16] F. Jelinek, Statistical Methods for Speech Recognition. Language, clustering method for building n-gram language models in continuous
Speech, & Communication: A Bradford Book, USA: MIT Press, 1997. speech recognition systems,” in New Frontiers in Applied Artificial
[17] J. R. Bellegarda and C. Monz, “State of the art in statistical methods Intelligence (Nguyen, NT and Borzemski, L and Grzech, A and Ali,
for language and speech processing,” Computer Speech and Language, M, ed.), vol. 5027 of Lecture Notes in Artificial Intelligence, pp. 286–
vol. 35, pp. 163–184, Jan 2016. 293, 2008.
[18] P. Kłosowski, “Statistical analysis of Polish language corpus for speech [38] B. Rapp, “N-gram language models for Polish language. Basic concepts
recognition application,” in Proceedings of 20th IEEE International and applications in automatic speech recognition systems,” in 2008
Conference Signal Processing Algorithms, Architectures, Arrangements, International Multiconference on Computer Science and Information
and Applications, September 21-23, 2016, Poznań, Poland, pp. 304–309, Technology (IMCSIT), Vols 1 and 2 (Ganzha, M and Paprzycki, M and
2016. PelechPilichowski, T, ed.), pp. 295–298, 2008.
[19] P. Kłosowski, “Statistical analysis of orthographic and phonemic lan- [39] T. Cover and J. Thomas, Wiley series in telecommunications: Elements
guage corpus for word-based and phoneme-based Polish language of information theory. USA: John Wiley and Sons, 1991.
modelling,” EURASIP Journal on Audio, Speech, and Music Processing, [40] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
vol. 2017, no. 1, p. 5, 2017. computation, vol. 9, pp. 1735–80, 12 1997.
[20] K. P., “Polish language modelling for speech recognition application,” [41] Anaconda, “What is anaconda,” 2018. https://www.anaconda.com/what-
in Proceedings of the 21th IEEE International Conference Signal is-anaconda/.
Processing Algorithms, Architectures, Arrangements, and Applications, [42] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
September 20-22, 2017, Poznan, Poland, pp. 313–318, 2017. G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
[21] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, et al., “Tensorflow:
R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image cap- Large-scale machine learning on heterogeneous distributed systems,”
tion generation with visual attention,” Computing Research Repository 2015.
(CoRR), vol. abs/1502.03044, 2015. [43] F. Chollet et al., “Keras.” https://github.com/fchollet/keras, 2015.
[22] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: [44] A. Mickiewicz, Pan Tadeusz, czyli, Ostatni zajazd na Litwie: historja
A neural image caption generator,” Computing Research Repository szlachecka z r. 1811 i 1812, we dwunastu ksi˛egach, wierszem. Wydawn.
(CoRR), vol. abs/1411.4555, 2014. Zakładu Narodowego im. Ossolińskich, 1834.
228
SIGNaL PROCESSING
SPa 2018
Statistical properties of signals approximated by

orthogonal polynomials and Schur parametrization
Wladyslaw Magiera∗ and Urszula Libal†
Signal Processing Systems Department
Wroclaw University of Science and Technology
Wroclaw, Poland
Email: ∗ wladyslaw.magiera@pwr.edu.pl, † urszula.libal@pwr.edu.pl
Abstract—In the paper, we investigate reconstruction of statis-

tical properties of signals approximated in various orthogonal
bases. The approximation of signals is performed in various
polynomial bases and by Schur parametrization algorithm. To
compare quality of remodeled signals in different bases, we use
mean square error criterion for power spectral density. The
correlation function, and the derived from it power spectral
density, is sufficient to describe signal statistical properties. The
numerical experiments were performed using benchmark signals.
The tests were executed for different polynomial degrees and
different orders of Schur innovation filtering. Our purpose was
to find which patrametrization method requires less parameters.
Fig. 1. Schema of innovations filtering (i.e. Schur parametrization) and

I. I NTRODUCTION modeling of a signal, using only a set of Schur coefficients.
The basic task of signal parametrization, closely related to
many practical problems, is the compression of information.
One of the most popular applications of signal parametrization and non-stationary – what is described in detail in Section
is transmission of compressed speech signals with the use III-A. The results of experiments are presented in Section
of LPC method (Linear Predictive Coding, [16], [11]) in III-B. Conclusions can be found in Section IV.
telecommunication systems as GSM. LPC method is generally II. M ETHODS
used for speech analysis and resynthesis. In the first step,
In this section, we describe two methods used for analysis
signal is coded by a set of parameters, and on the other side
(representation by a set of parameters) and synthesis (modeling
of transmission channel, it is decoded basing on the obtained
using the set of parameters) of stationary signals. The first
parameters.
procedure is based on linear Schur parametrization and model-
The motivation behind the paper was the comparison of ing, the second - based on signal representation in polynomial
different methods of parametrization of signals in various bases.
orthogonal bases: obtained by innovations filtering according
to Schur algorithm [13], [23], [5], and orthogonal polynomials A. Linear Schur parametrization of signals
families, such as Laguerre, Legendre, Chebyshev of the first For a finite set of linearly independent observations
and second kind and trigonometric polynomials [1], [17], [19], [st , st−1 , . . . st−n ] of a wide-sense stationary signal s(t), we
[15], [20]. For each mentioned method a set of parameters consider the linear estimate
representing signal in a given finite base is calculated. The
general idea of coding and decoding is presented in Figure 1 ŝn (t) = αn,1 · st−1 + . . . + αn,n · st−n (1)
for Schur parametrization by a set of Schur coefficients associated with the forward prediction error
{ρ1 , ρ2 , . . . , ρN }, and in Figure 2 for approximation in a
finite polynomial base by a set of representation coefficients εn (t) = s(t) − ŝn (t) (2)
{a0 , a1 , . . . , aN }. and minimizing the least-squares error Rn = Eε2n (t).
The paper focuses on statistical property of the approxi- If we look at time axis from backward direction, then we
mated signals, such as power spectral density. We investigate can define the backward estimator
two main groups of parametrization methods: using Schur
parametrization, presented in Section II-A, and representation s̆n (t) = βn · st + . . . + β1 · st−n+1 . (3)
in orthogonal polynomial bases, recalled in Section II-B.
The backward prediction error is
The considerations apply only for stationary signals. For the
numerical experiments, we used sinusoidal signals – stationary υn (t) = s(t − n) − s̆n (t). (4)
229
The normalized versions of prediction errors will be denoted 2) Modeling: The modeling filter algorithm [3], [6] can
as be obtained by the so-called ’arrow-reversal’ method, where
εn (t) the direction of flow of signals in the ’upper-wire’ of the
en (t) = , (5)
||εn (t)|| innovation filter is reversed. The orthogonal realization of the
modeling filter uses the parameters of the filter that are exactly
and the Schur coefficients extracted during the parametrization
υn (t)
rn (t) = . (6) procedure, as shown in Figure 1. The modeling filter of order
||υn (t)|| N is a block-chain realization of N sections:

For the wide-sense stationary signals with Toeplitz co- e0 (t)
= σN σN −1 · · · σ1
eN (t)
, (14)
variance matrix, the optimum linear innovations filter, can rN (t) r0 (t)
be computed by the Levinson algorithm [4]. Numerically where
better version of the procedure is Schur algorithm [23], [5], 1 0
σn = σ(ρn ) (15)
efficiently performing the linear innovations transformation, 0 z
usually called the Schur parametrization. for n = 1, 2 . . . , N . The particular calculations at each section
The classical Schur algorithm, associated with stationary has the following form
processes, has been generalized to orthogonal parametrization
of second-order nonstationary processes [3], [6], [7]. There en−1 (t) en (t)
= σ(ρn+1 ) , (16)
exists also a nonlinear realization [28], [30], [14] of the rn (t) rn−1 (t − 1)
innovation procedure, but the algorithm is more complex. where 1
1) Innovations filtering: Innovation filtering [7] of order N (1 − ρ2n ) 2 −ρn
σ(ρn ) = 1 . (17)
can be presented as a block-chain realization of N sections: ρn (1 − ρ2n ) 2
The transformation is also called circular rotation. If we
eN (t) e0 (t) assume ρn = sin ψ, then
= θN θN −1 · · · θ1 , (7)
rN (t) r0 (t)
cos ψ −sin ψ
σ(ρn ) = . (18)
where sin ψ cos ψ
1 0
θn = θ(ρn ) (8) The matrix σ(ρn ) is orthogonal, i.e.
0 z
σ(ρn )σ T (ρn ) = I. (19)
for n = 1, 2 . . . , N and delay operator denoted as z. At each
section, we perform a hyperbolic rotations (elementary J- The modeling filter transforms the input white noise into the
orthogonal operations) output time-series s̃ which is stochastically equivalent, in the
second order sense, to the original (parametrized) signal s.
en (t) en−1 (t)
= θ(ρn ) , (9) B. Orthogonal polynomials and approximation of signals
rn (t) rn−1 (t − 1)
Approximation of signal in orthogonal polynomial base
where leads to a new representation [21], [22] of signal in a form
θ(ρn ) = (1 −
1
ρ2n )− 2
1 ρn
. (10) of a set of parameters. In our considerations, we restrict
ρn 1 to real numbers representations. The general formula for
approximation s̃ of signal s in a finite family of orthogonal
The backward prediction errors {r0 , r1 , . . . , rN } form an polynomials {Bn (x)}N n=0 is
orthonormal base. If we take ρn = tgh ϕ, we obtain evident
N
X
form of hyperbolic rotations
s̃(x) = an Bn (x) (20)

cosh ϕ sinh ϕ n=0
θ(ρn ) = . (11)
sinh ϕ cosh ϕ where the representation coefficients {an }Nn=0 are real num-
bers, i.e. an ∈ R. For all polynomials considered in this
The procedure is J-orthogonal, because matrix θ(ρn ) satisfies work, we performed simulations in the interval [0, 1]. The
polynomials are called orthogonal, if for any n and m, n 6= m,
T 1 0
θ(ρn )Jθ (ρn ) = J = . (12) the inner product of polynomials Bn and Bm is zero, i.e.
0 −1 Z
The block-chain filter needs to be excited with centered and < Bn , Bm >= Bn (x)Bm (x)W (x)dx = 0, (21)
R
normalized input signal, which we want to parametrize. The
innovations filtering produces set of N Schur coefficients where W (x) is a proper weight function [24] on a correspond-
ing support. Many orthogonal polynomials [26], [2], [8] – such
{ρ1 , ρ2 , . . . , ρN }, (13) as Laguerre, Legendre or Chebyshev polynomials – are defined
by recursion relations [10], [27], which are recalled in the
where N is the order of the innovations filter. following paragraphs.
230
1) Laguerre polynomials: Laguerre polynomials Ln [1] can
be obtained by the following recurrence relation:
L0 (x) = 1,
L1 (x) =1 − x, (22)
2n + 1 − x n
Ln+1 (x) = Ln (x) − Ln−1 (x).
n+1 n+1
They are orthogonal with respect to the inner product
Z ∞
< Ln , Lm >= Ln (x)Lm (x)ex dx, (23)
0
where the weight function is Fig. 2. Schema of approximation in polynomial base and modeling of a
signal, with the use of a set of representation coefficients.
x
W (x) = e . (24)
2) Legendre polynomials: Legendre polynomials Pn [17]
are defined by Bonnet’s recursion [25], [9] formula:
P0 (x) = 1,
P1 (x) = x, (25)
2n + 1 n
Pn+1 (x) = xPn (x) − Pn−1 (x).
n+1 n+1
Legendre polynomials are orthogonal with respect to the L2
inner product on the interval 1 ≤ x ≤ 1, without any scaling
by weight function as for Laguerre polynomials.
3) Chebyshev polynomials: Chebyshev polynomials of the
first kind [19], [15] are usually denoted by Tn and fulfill the Fig. 3. Benchmark signal 1: ’bumps’.
following recurrence:
T0 (x) = 1,
T1 (x) = x, (26)
Tn+1 (x) = 2xTn (x) − Tn−1 (x).
Chebyshev polynomials of the second kind Un [19], [18] are
defined by the similar recurrence:
U0 (x) = 1,
U1 (x) = 2x, (27)
Un+1 (x) = 2xUn (x) − Un−1 (x).
Chebyshev polynomials are orthogonal with respect to the Fig. 4. Power spectral density of signal ’bumps’.
inner product
Z 1
1
< Bn , Bm >= Bn (x)Bm (x) √ dx, (28) III. S IMULATIONS
−1 1 − x2
To examine the quality of remodeling of the signals,
where the weight function is parametrized in various orthogonal bases, we performed an
1 experiment for two benchmark signals, which are described in
W (x) = √ . (29)
1 − x2 Section III-A. For each signal (for both versions: parametrized
and remodeled), we calculate the power spectral density (PSD)
4) Trigonometric polynomials: Real trigonometric polyno-
and mean square error (MSE) between original PSD and PSD
mials [20] of degree N are finite linear combinations of
of remodeled signal. The results are presented in Section III-B.
functions sin(nx) and cos(nx), for 0 ≤ n ≤ N . The
representation coefficients {an , bn }Nn=0 are real numbers and A. Signals
the approximation of signal s has the following form 1) ”Bumps” signal: We use benchmark signal – called
N
X N
X ’bumps’ – which is implemented in MATLAB r environment,
s̃(x) = a0 + an cos(nx) + bn sin(nx), (30) and can be generated by the command wnoise(2,10). Its
n=1 n=1 trajectory in time-domain can be found in Figure 3 and its
where x ∈ R. power spectral density is shown in Figure 4.
231
TABLE I
MSE BETWEEN PSD OF THE ORIGINAL ’ BUMPS ’ SIGNAL AND
REMODELED .
No. of parameters 1 5 10 20 50
Schur 0.055 0.060 0.065 0.054 0.042
Laguerre 0.088 0.076 0.072 0.054 0.062
Legrendre 0.088 0.076 0.059 0.030 0.001
Chebyshev 1st 0.088 0.076 0.059 0.034 0.001
Chebyshev 2nd 0.088 0.076 0.059 0.034 0.001
Trigonometric 0.119 0.073 0.041 0.002 0.000
Fig. 5. Benchmark signal 2: sinusoidal signal. TABLE II

MSE BETWEEN PSD OF THE ORIGINAL SINUSOIDAL SIGNAL AND
REMODELED .
No. of parameters 1 5 10 20 50
Schur 0.041 0.051 0.046 0.046 0.047
Laguerre 0.064 0.033 0.037 0.037 0.034
Legrendre 0.064 0.033 0.034 0.017 0.000
Chebyshev 1st 0.064 0.033 0.034 0.017 0.000
Chebyshev 2nd 0.064 0.033 0.034 0.021 0.000
Trigonometric 0.088 0.038 0.017 0.000 0.000
the remodeling of the signal is much easier, because the re-

constructed signal is a linear combination of the representation
Fig. 6. Power spectral density of sinusoidal signal.
parameters multiplied by the corresponding polynomials.
Figure 7 shows power spectral density of the reconstructed
2) Sinusoidal signal: For the second test we used sum of signal ’bumps’ for Schur algorithm and different polynomial
sinusoidal signals. There are three components with frequen- bases. In this case, number of parameters obtained by each
cies 2Hz, 5Hz and 10Hz. Its trajectory is shown in Figure 5 method was set to 10. Figure 8 and Figure 9 shows power
and its power spectral density is in Figure 6. spectral density obtained for 20 and 50 parameters. The
numerical results are also shown in Table I.
B. Results 2) Sinusoidal signal: For the sum of sinusoidal signals,
1) ’Bumps’ signal: Signal ’bumps’ is parametrized by we have calculated mean square error (MSE) between power
Schur algorithm and by Laguerre, Legrendre, Chebyshev 1st spectral density of the original signal and remodeled signals
and 2nd kind, and trigonometric polynomials. In each case, the for different polynomial bases and Schur algorithm, for dif-
order of Schur algorithm and maximal degree of polynomials, ferent numbers of parameters n = 1, 5, 10, 20, 50. The results
used during the parametrization, is chosen in order to obtain are shown in Table II. It is quite obvious that the higher the
the same numbers of parameters for each method. In this number of parameters, the better reconstruction of the original
way, it is possible to compare various approaches in case they signal is obtained. Value 0.000 means that the difference
provide different number of parameters (e.g. n+1 for Schur between power spectral densities (original and reconstructed)
parametrization and 2n+1 for trigonometric polynomials) for are smaller than 10−3 . In case of the Schur algorithm, even
a given degree n. a very small number of parameters (around 3) is sufficient to
The obtained parameters are used for reconstruction of the reconstruct the statistical properties of the original sinusoidal
signal. In case of Schur algorithm, we input as an excitation a signal, due to its not complicated PSD, and taking higher order
white noise signal. If the innovation signal eN (see Figure 1) of the filter does not provide any better solution.
was applied as an input signal for the modeling filter then the
trajectory of the modeled and parametrized signals would be IV. C ONCLUSIONS
equal, i.e. s̃(t) = s(t) for all time samples t. Disadvantage of The analyzed signals are mainly characterized by low fre-
this approach is that it is necessary to use obtained parameters quencies. For higher frequencies polynomial approximation
but also whole signal eN which requires to store the whole of lower degrees does not provide satisfactory results, unlike
length of the signal and makes this parametrization pointless. parameterization and modeling by the Schur algorithm. For
In case, the modeling filter is fed with white noise signal then this reason, we focused on low frequency signals.
we obtain different trajectory but with statistical properties All polynomial algorithms proved to be convergent, i.e. with
close to properties of the original signal. For polynomial bases, the increase of the polynomial degree, we always observed a
232
Fig. 7. PSD of reconstructed signal ’bumps’ using a) Schur parametrization, b) Laguerre, c) Legrendre, d) Chebyshev 1st kind, e) Chebyshev 2nd kind and
f) trigonometric polynomials – for parameter number set to n = 10.
decrease in the value of mean-square error for power spectral complex spectrum in the Table I.
density estimates.
The worst results were obtained for Schur parameterization
In all the tests, the algorithm using the approximation with
and approximation by the Laguerre polynomials, what is clear
trigonometric polynomials performed best. However, it should
to observe for high degrees (N = 20, 50).
be emphasized, that this is related to the appearance of mostly
low frequencies in the analyzed signals, for which we obtained
In Figures 7, 8 and 9, the reconstructed PSD is shown.
a good fitting to the sinusoidal components quite fast. The test
For Schur algorithm we should use white noise signal in
results for signals with higher frequencies (not shown in this
modeling part to reconstruct the parametrized signal. During
paper) show a limited application of polynomial approximation
performance of the tests we used signal which is closed
for fast-changing signals.
to theoretical white noise signal but it has some amplitude
In general, lower values of error for PSD reconstruction fluctuations in PSD which are not avoidable for finite number
were obtained for signals with fewer frequency components - of samples. That phenomenon is visible in part a) of Figures 7,
see lower MSE values for a sinusoidal signal in the Table II 8 and 9. For polynomial algorithms this disturbances does not
compared to the ’bumps’ signal characterized with a more appear.
233
R EFERENCES [17] K. Maleknejad, K. Nouri, M. Yousefi, Discussion on convergence

of Legendre polynomial for numerical solution of integral equations,
[1] W.A. Al-Salam, Operational representations for Laguerre and other poly- Applied Mathematics and Computation, Vol. 193, pp. 335–339, 2007.
nomials, Duke Math J. 31(1): pp. 127–142, 1964. [18] J.C. Mason, Chebyshev polynomials of the second, third and fourth kinds
[2] T. Chihara, An Introduction to Orthogonal Polynomials. Gordon and in approximation, indefinite integration, and integral transforms, Journal
Breach, New York, 1978. of Computational and Applied Mathematics 49(13), pp. 169-178, 1993.
[3] E.F.A. Deprettere, S.C. Lie, Generalized Schur-Darlington Algorithms [19] J.C. Mason, D.C. Handscomb, Chebyshev Polynomials, Chapman & Hall
for Lattice-Structured Matrix Inversion and Stochastic Modelling, / CRC, 2003.
Techn.Rept., Delft Univ.Techn. 1980. [20] M. Powell, Approximation Theory and Methods, Cambridge University
[4] P. Dewilde, A.C. Vieira, T. Kailath, On a Generalized Szegö-Levinson Press, 1981.
Realization Algorithm for Optimal Linear Predictors Based on a Network [21] W. Rudin, Real and Complex Analysis, 3rd ed. New York: McGraw-Hill,
Synthesis Approach, IEEE Trans. on Circuits and Systems, vol. CAS-25, 1987.
No.9, September 1978, pp.663–675. [22] W. Rudin, Functional Analysis, 2nd ed. New York: McGraw-Hill, 1991.
[5] P. Dewilde, H. Dym, Schur Recursions, Error Formulas and Convergence [23] I. Schur, On Power Series Which Are Bounded in the Interior of the Unit
of Rational Estimators for Stationary Stochastic Sequences, IEEE Trans. Circle I, in: I. Schur Methods in Operator Theory and Signal Processing,
on Information Theory, vol. IT-27(4), July 1981, pp.446–461. I. Gohberg (Ed.), Operator Theory: Advances and Applications, vol.18,
[6] P. Dewilde, E.F.A. Deprettere, Approximative Inversion of Positive Ma- Birkhäuser-Verlag 1986, pp.31–60.
trices with Applications to Modelling, in: Modelling, Robustness and [24] M.-R. Skrzipek, Orthogonal polynomials for modified weight functions,
Sensitivity Reduction in Control Systems, NATO ASI Series, Vol. F34, Journal of Computational and Applied Mathematics 41(3), pp. 331–346,
R.F.Curtain (ed.), Springer-Verlag Berlin-Heidelberg 1987, pp.212–238. 1992.
[7] P. Dewilde, E.F.A. Deprettere, The Generalized Schur Algorithm: Approx- [25] V. Spokoiny, T. Dickhaus, Basics of Modern Mathematical Statistics,
imation and Hierarchy, in: Operator Theory: Advances and Applications, Springer, 2014.
vol. 29, Birkhäuser Verlag, Basel, 1988, pp.97–116. [26] G. Szegö, Orthogonal polynomials, 4th ed., Amer. Math. Soc. Colloq.
[8] D. Jackson, Fourier Series and Orthogonal Polynomials. New York: Publ., vol. 23, Amer. Math. Soc., Providence, RI, 1975.
Dover, 2004. [27] L. Verde-Star, Recurrence coefficients and difference equations of clas-
[9] D.P. Jarrett, E.A.P. Habets, P.A. Naylor, Theory and Applications of sical discrete orthogonal and q-orthogonal polynomial sequences, Linear
Spherical Microphone Array Processing, Springer, 2016. Algebra and its Applications, Vol. 440, pp. 293–306, 2014.
[10] W. Koepf, D. Schmersau, Recurrence equations and their classical [28] J. Zarzycki, P. Dewilde, The Nonlinear Nonstationary Schur Algo-
orthogonal polynomial solutions, Orthogonal systems and applications, rithm, Proc. Workshop on Advanced Algorithms and Their Realizations,
Appl. Math. Comput., Vol. 128, pp. 303–327, 2002. Chateaux de Bonas, (M. Verhaegen, Ed.), Paper V3, 1991.
[11] P. Kroon, E.F.A. Deprettere, R.J.Sluyter, Multi-pulse excitation linear- [29] J. Zarzycki, Orthogonal digital filtering of stochastic signals (in polish),
predictive speech coder, US Patent 4932061, 1990. WNT, Warsaw, 1998.
[12] D.T.L. Lee, M. Morf and B. Friedlander, Recursive Least-Squares [30] J. Zarzycki, A. Wielgus and U. Libal, Nonlinear Schur-Type Orthogonal
Ladder Estimation Algorithms, IEEE Trans. on Circuits and Systems, Transformations of Higher-Order Stochastic Processes: An Overview
Vol. CAS-28, pp.467–481, 1981. of Current Topics, 2017 Signal Processing Symposium (SPSympo),
[13] N. Levinson, The Wiener rms error criterium in filter design and Jachranka, pp. 1–5, 2017.
prediction, Journal of Mathematical Physics, Vol. 25, pp.261–278, 1947.
[14] U. Libal, A. Wielgus and W. Magiera, Nonlinear Orthogonal
Parametrization and Modeling of Higher-Order Non-Gaussian Time-
Series, 2017 Signal Processing Symposium (SPSympo), Jachranka, pp.
1–6, 2017.
[15] Y. Liu, Application of the Chebyshev polynomial in solving Fredholm
integral equations, Mathematical and Computer Modelling 50(34), pp.
465–469, 2009.
[16] J. Makhoul, Linear Prediction: A tutorial review, Proc. IEEE, vol. 63,
pp.561–580, 1975.
234
SIGNaL PROCESSING
SPa 2018
Using a sparse model to evaluate the internal structure

of impulse signals
A.A.Kim, O.O.Lukovenkova, Yu.V.Marapulets, A.B.Tristanov

Laboratory of Acoustic Research
IKIR FEB RAS
Paratunka, Kamchatka, Russia
Abstract — Impulse nature signals generated by complex (atoms) from the redundant dictionary of functions. The sparse
geophysical systems require special methods to study their internal approximation has several significant advantages in describing
structure. These signals are characterized by a short duration of signals in comparison with classical methods of time-frequency
impulses and the variability of their structure. The use of classical analysis. The dictionary is a set of functions that describes the
spectral and time-frequency methods raises great difficulty. The internal structure of signals, and it is selected on the basis of
authors propose a model of an impulse signal based on a sparse Impulse nature. So, for the investigation of GAE signals the
approximation and an algorithm for identifying a model. The authors used a combined dictionary consisting of the functions
algorithm is a modified matching pursuit algorithm using a of Gauss and Berlage [2].
physically based system of functions (dictionary). The study of
modeling results consists in estimating the time-frequency The rest of the paper has the following structure. Section 2 is
characteristics of the model components. The paper gives an devoted to the concept of sparse approximation and to the
example of the model application on geoacoustic emission signals algorithms of its construction. Section 3 presents a model of a
of a seismically active region (Kamchatka peninsula). The geoacoustic signal, as an example of an impulse signal. Section
proposed model and approaches to the model investigation can be 4 presents the results of the study of the model proposed in
used for a wide range of impulse nature signals. Section 3. Finally, the conclusions are presented.
Keywords — sparse approximation, adaptive matching pursuit,
geoacoustic signal model. SIGNAL SPARSE APPROXIMATION
The methods of sparse approximation have been studied for
INTRODUCTION more than a hundred years. Temlyakov V.N. [3] and Tropp J.A.
[4] note that one of the first mentions of the so-called "m-term
Studies of natural environments to prevent natural and man-
approximation" is in the work of Schmidt E. "Zur Theorie der
made disasters require the use of a rich arsenal of prediction
linearen und Nichtlinearen Integralgleichungen" [5] published
algorithms. These algorithms are sensitive to the quality of the
in 1908. In the past decade work has been actively carried out to
preparation of the initial data–the construction of a feature space.
systematize and generalize research, for example, in the works
Geophysical time series can be divided conditionally into
of Mallat S. and Zhang Z. [6], Sturm B.L.[7], Gribonval R.[8],
several groups: slowly changing, rapidly changing (oscillating)
Donoho D.L.[9], Elad M.[10] et al.
and impulsive. The first two groups of signals can be
successfully analyzed in terms of identifying typical and The problem of representing a discrete signal 𝑆 ∈ ℝN by a
anomalous behaviors, by various non-stationary methods of linear combination with a small number of elements can be
classical analysis. In turn, impulse signals require a special formulated as follows. Let there be given a redundant dictionary
approach. This is caused by an interest in the study of the fine of discrete functions which is a matrix D of size 𝑁 × 𝑀, and
structure of separate impulses, especially in their mutual 𝑁 ≫ 𝑀. It is required to find a vector of coefficients a, such as
dynamics. 𝑆 = 𝐷𝑎, and ‖𝑎‖0 → min, where ‖∙‖0 is the pseudo-norm 𝑙0
The signals of geoacoustic emission (GAE) are a good equal to the number of nonzero elements of the vector. This
example of impulse signals [1]. In the paper the authors problem names a sparse approximation. The rows of matrix D is
considered the GAE signals registered in the seismically active such named atoms.
region on the Kamchatka peninsula (Russia). The main goal of Note that the computational problem of sparse
studying GAE signals is to identify the features of the impulse approximation in this formulation has a high complexity O(N3)
morphology to explain the processes of their generation. Each and optimization of the pseudo-norm 𝑙0 can be replaced by
GAE pulse is an additive mixture of elementary geoacoustic optimization with respect to a norm different from 𝑙0 (1), while
impulses. A key feature of geophysical impulse signals is their retaining the advantages and quality of the initial approximation.
variability and complexity of the morphological structure.
This paper is devoted to the approach to modeling impulse
signals based on sparse approximation. This approach consists
of a compact representation of the signal as a sum of elements
The research was supported by RSF, project No.18-11-00087
235
S − D → min, MODEL OF THE GEOACOUSTIC EMISSION SIGNAL
()
2
The real signal of geoacoustic emission can be represented,
 0
 K, firstly, as the sum of two components: the actual geoacoustic
impulse and the noise component. The presence of noise is due
to the features of registration and the complexity of the
here K is limitation on 𝑙0 ,  2 is Euclidian norm. environment in which the observation takes place. In turn, the
geoacoustic pulse is a collection of elementary pulses.
The authors proposed a modification of the matching pursuit Thus, the model of a geoacoustic emission signal is a linear
algorithm called adaptive matching pursuit (AMP) [11]–[13]. combination of functions (2):
This algorithm allows expanding the original dictionary due to
 N1 N1 + N 2
  b j g j (t , p) + RN
the refinement procedure. A block diagram of the refinement
 x (t ) = a g (t , p ) +
procedure is shown in Fig.1. i i
 i =1 j = N1 +1
 (2)
 RN (t ) 2 → min
 N1 + N 2 → min

where gi(t, p) are the atoms approximating the impulse, gj(t, p)
are the atoms approximating the parasitic component of the
impulse. The value of N1 characterizes the complexity of the
impulse structure, and the value of N2 is the number of atoms
which represents the noise of the impulse.
Below in Fig.2 examples of visualization based on the
Wigner-Wille transform of a sparse representation of real
geoacoustic emission signals are shown [14].
Figure 1. The refinement procedure of AMP (Fig. from [11])
The step λ and the number of refinement iterations are

selected depending on the required accuracy and refinement
rate. The refined atom is added to the source dictionary,
expanding it.
The advantages of a sparse approximation in comparison
with classical approaches are follows. First, since the number of
nonzero approximation elements is small, the main advantage of
this method is the reduction in the dimension of the feature space
of the signal. This fact becomes a particularly obvious advantage
when comparing a sparse representation, for example, with a Figure 2. Example of signal representation, a) visualisation in the time
Short Fourier Transform or with a Continuous Wavelet domain, b) visualisation in the time-frequency domain
Transform, because these transforms are extremely redundant.
Redundancy of representation is an important limitation when EXPERIMENTS
applying methods of classification and clustering. The use of
physically substantiated dictionaries of functions opens the way The set of atomic parameters of each individual impulse
to the interpretation of the signal analysis results at a characterizes its internal structure. During the experiment, 2000
qualitatively new level, for example, for the task of identifying characteristic GAE impulses (length of 8 msec, filling frequency
signal sources and their characteristics. of 10-15 kHz) were analyzed. Further, statistical time-frequency
characteristics were considered–the number of atoms providing
a given accuracy, the distribution of distances between atoms,
and the distribution of atom frequencies (Fig.3-6).
236
The number of atoms determines the complexity of the
impulse. It is seen that on average a single geoacoustic impulse
is described by 4 atoms with an error is less than 5%.
Figure 5. Distribution of a source size l (meters) calculated by J. Brune

formula
Finally, the distribution of distances between atoms within

the framework of a single pulse is shown. This characteristic
allows us to estimate the possible aggregate size of the region
generated impulse.
Figure 3. Distribution of number of atoms in signal representations
Frequency filling of atoms is one of the most important

characteristics associated with the characteristics of shear
deformations that gave rise to elementary geoacoustic impulses
[15]. The Fig.4 shows the frequency distribution. It is seen that
the distribution is multimode, and this fact allows us to assume
a mixture of distributions, and as a result, the multiscale of the
generated displacement sources. An estimation of the source
sizes according to J. Brun’s formula showed that they are from
0.05 to 0.15 m which is confirmed in [16], [17] (Fig. 5).
Figure 6. Distribution of the relation an distance between atoms to the length
of an impulse (in persent) Δ
DISSCUSSION
In summary, note that a sparse approximation is one of the
most promising methods for estimating impulse signals. The
resulting models are physically interpretable. The results shown
for real geoacoustic signals are consistent with the theoretical
assumptions and actual results of observations.
An important development of the proposed approaches is
their integration into software systems for analyzing large data
based on machine learning methods. Primarily this is due to the
large volume of recorded experimental materials and the need to
solve the problems of clustering and classifying both individual
Figure 4. Distribution of atoms frequency (kHz) pulses and their sequences. The proposed approaches can be
used in numerous applications.
REFERENCES
[1] Alina A. Kim, A. B. Tristanov, Y. V. Marapulets, and O. O.
Lukovenkova, Methods of registration and frequency-time analysis of
geoacoustic emission signals. Vladivostok: Dalnauka, 2017 (in russian).
[2] Y. V. Marapulets, A. B. Tristanov, O. O. Lukovenkova, and A. A.
Afanaseva, “The sparse approximation with combined dictionary of the
acoustic signals,” in 2014 International Conference on Computer
Technologies in Physical and Engineering Applications, ICCTPEA 2014
- Proceedings, 2014, pp. 102–103.
[3] V. N. Temlyakov, “Nonlinear methods of approximation,” Found.
Comput. Math., vol. 3, pp. 33–107, 2003.
237
[4] J. a. Tropp, A. C. Gilbert, and M. J. Strauss, “Algorithms for simultaneous [12] A. A. Kim, O. O. Lukovenkova, Y. V. Marapulets, and A. B. Tristanov,
sparse approximation. Part I: Greedy pursuit,” Signal Processing, vol. 86, “Parallel computations for real-Time implementation of adaptive sparse
no. 3, pp. 572–588, Mar. 2006. approximation methods,” in Proceedings of 2017 20th IEEE International
[5] E. Schmidt, “Zur Theorie der linearen und nichtlinearen Conference on Soft Computing and Measurements, SCM 2017, 2017.
Integralgleichungen. III. Teil,” Mathematische Annalen, vol. 65. pp. 370– [13] A. A. Kim, O. O. Lukovenkova, Y. V. Marapulets, and A. B. Tristanov,
399, 1908. “Modeling of signals of pulse origin on the basis of sparse approximation
[6] S. Mallat and Z. Zhang, “Matching Pursuit with time-frequency scheme,” in Proceedings of International Conference on Soft Computing
dictionaries,” Aug. 1993. and Measurements, SCM 2015, 2015, pp. 240–242.
[7] B. L. Sturm, B. Lee, and T. Sturm, “Sparse Approximation and Atomic [14] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency
Decomposition: Considering Atom Interactions in Evaluating and dictionaries,” IEEE Trans. Signal Process., vol. 41, pp. 3397–3415, 1993.
Building Signal Representations,” D., UNIVERSITY OF CALIFORNIA, [15] O. O. Lukovenkova, Y. V. Marapulets, and A. B. Tristanov, “Estimate of
SANTA BARBARA, 2009. displacement-type sources scales of geoacoustic emission according to
[8] R. Gribonval, “Harmonic decomposition of audio signals with matching sparse representation of signal,” in Proceedings of the 19th International
pursuit,” Signal Process. IEEE Trans., vol. 51, no. 1, pp. 101–111, 2003. Conference on Soft Computing and Measurements, SCM 2016, 2016.
[9] S. S. Chen, D. L. Donoho, and M. a. Saunders, “Atomic Decomposition [16] Y. V. Marapulets, O. P. Rulenko, M. A. Mishchenko, and B. M. Shevtsov,
by Basis Pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, p. 33, 1998. “Relationship of high-frequency geoacoustic emission and electric field
in the atmosphere in seismotectonic process,” Dokl. Earth Sci., vol. 431,
[10] M. Elad, Sparse and Redundant Representations. New York, NY: no. 1, pp. 361–364, Mar. 2010.
Springer New York, 2010.
[17] B. M. Shevtsov, Y. V. Marapulets, and A. O. Shcherbina, “Directionality
[11] A. Kim, O. Lukovenkova, Y. Marapulets, and A. Tristanov, “Parallel of surface high-frequency geoacoustic emission during deformational
adaptive sparse approximation methods for analysis of geoacoustic disturbances,” Dokl. Earth Sci., vol. 430, no. 1, pp. 67–70, Jan. 2010.
pulses,” E3S Web Conf., vol. 20, p. 02003, Oct. 2017.
238
SIGNaL PROCESSING
SPa 2018
Detecting the Number of Speakers in Speech

Mixtures by Human and Machine
Tomasz Maka, Miroslaw Lazoryszczak
Faculty of Computer Science and Information Technology

West Pomeranian University of Technology, Szczecin
Zolnierska 52, 71-210, Szczecin, Poland
{tmaka, mlazoryszczak}@wi.zut.edu.pl
Abstract—The problem of sound sources estimation and its crucial for time delay estimation from all microphones. The
properties in acoustic scene plays important role in many voice- author shows that vocal tract excitation remain unchanged
based interaction systems. The interference between sources in two recorded signals. The important requirement of such
can deteriorate system performance meaningfully. The paper
presents a comparison results of objective and subjective methods scheme is that the distances from each speaker to each
applied to the process of identification the number of speakers in microphone should be different. Similar approach is presented
speech mixtures. The audio data set used for computational and in [2] where a pair of microphones is used at acquisition step.
subjective tests consists of a number of utterances spoken by from Then the cross-correlation with Bessel coefficients was applied
two up to seven simultaneous speakers. In order to determine to multi-speaker signals.
the number of speakers, two approaches are applied to speech
mixtures: first uses spectrogram factorization with NMF (non- In [3], the authors present several techniques for deter-
negative matrix factorization) algorithm, the other is based on the mining speaker count: a model-based approaches are used to
perceptual evaluation by the group of listeners. Both techniques represent properties of signals with different number of speak-
are compared in terms of classification accuracy. ers and signal-based are used to spectral peaks detection and
similarity measure between peaks and clustering the similarity
I. INTRODUCTION
matrix. In some works the number of speakers is determining
Modern human-machine interfaces use multimodal commu- by using microphone array like in [4] while in others like
nication channels. One of them becoming more and more [5] is the result of the intermediate stage in the process of
common is based on audio signal analysis. There is a sig- speakers tracking process. Among single channel methods
nificant difference between ideal and real environment. Such several approaches also exist. In [6] the author observes the
environmental conditions play the most important role in sev- presence of modulation pattern in each speech utterance which
eral practical implementations. Especially voice-based systems should contain a peak around selected frequency. According to
are vulnerable to numerous distortion and are sensitive to the author the peak decreases while number of simultaneous
many factors. Although it is desirable to process a signal with speakers increases. In turn, the solution presented in [7]
satisfactory parameters, this situation is often uncommon. The utilizes smartphone audio interface for signal acquisition. The
issue of simultaneous number of speakers estimation can be analysis exploits pitch estimation process to detect speech
an important component of the real-world speech recognition signal. In parallel MFCC features are computed and used with
systems. Knowing the number of speakers talking in parallel selected distance metric to differentiate of the speakers. The
several tasks can be performed to extract or enhance one gender of speakers is indirectly utilized using pitch detection
of them. Such subsystem for automatic determination of the in case of similar MFCC features that helps distinguish the
number of speakers can be a part of human–machine voice total number of speakers.
interaction systems to resolve whether the real input signal In our previous work [8], the technique based on spectral
needs additional processing before referring to the next stage peaks tracking and attributes calculated from peaks histogram
of analysis. Another use can be a multimedia indexing-retrieval was proposed. The method uses spectral peaks estimated by
systems, where the number of speakers is simply one of the linear prediction-based spectral envelope of the source signal.
attributes or can be an important parameter of the input at the Selected features have been computed from the histogram at
classification stage. different frequency bands. Finally, the statistical properties
There are many approaches to solve the problem of simulta- of the resultant features have been used to discover the
neous speakers estimation. Some of them cope with the main relationship with the number of speakers, although they are
task using multi-channel audio acquisition. Multi-channel pro- not clear and unambiguous in every case. In this study we
cessing provides more information about number of speakers. have compared the accuracy of the estimation the number of
Such approaches are more common but require additional speakers in simultaneously spoken sentences by a group of
components for data acquisition and computing resources. For listeners and using an approach based on non-negative matrix
example, in [1] the spatial separation of microphones was factorization of the spectrogram.
239
TABLE I
N UMBER OF SPEAKERS DETECTION ACCURACY FOR INDIVIDUAL LISTENERS
Number Listeners
of speakers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 92 100 83 92 58 100 100 92 100 92 92 25 83 100 67
2,3 63 96 54 88 58 83 88 79 75 71 71 33 88 79 58
2,3,4 (S1 ) 53 75 39 67 50 69 64 75 58 58 53 33 69 61 47
2,3,4,5 (S2 ) 44 60 29 50 42 54 52 63 44 48 42 33 58 46 40
2,3,4,5,6 (S3 ) 37 48 23 40 37 43 42 53 35 40 33 27 48 37 32
2,3,4,5,6,7 (S4 ) 34 45 22 38 34 41 39 50 33 38 31 25 45 34 30
TABLE II
II. HUMAN ASSESSMENT C ONFUSION MATRIX FOR THE BEST CASE LISTENER
Let’s assume Mn denotes the sound mixture consisting n
simultaneously spoken utterances. We have created four sets M2 M3 M4 M5 M6
M2 100 0 0 0 0
S1 , S2 , S3 and S4 in our classification chain. Each set contains M3 16.7 75 8.3 0 0
mixtures defined as Sk = {M2 , . . . , Mk+3 } with k+2 classes, M4 25 58.3 16.7 0 0
where k = 1, 2, 3, 4. In this way, consecutive classes M M5 0 41.7 41.7 16.7 0
M6 0 41.7 41.7 16.7 0
include mixtures from utterances spoken by 2, 3, 4, 5, 6 and
7 speakers.
TABLE III
C ONFUSION MATRIX FOR THE WORST CASE LISTENER
A. Database
The dedicated audio database for human assessment as well M2 M3 M4 M5 M6
for machine learning was designed. Details of source compo- M2 83.3 16.7 0 0 0
nents of the database have been presented in [8], however the M3 75 25 0 0 0
M4 16.7 75 8.3 0 0
mixtures of the utterances performed by individual speakers M5 33.3 33.3 33.3 0 0
have been prepared for evaluation purposes of this paper. M6 25 50 25 0 0
The source audio set consists of twelve different voices in-
cluding six male and six female speakers. The sentences were
recorded originally in Polish language at 16 bit resolution and
The evaluation was performed by fifteen listeners of var-
44100 Hz sampling rate under the same acoustical conditions.
ious age and listening experience. Two participants were
Each recording was individually normalized before the mixing
professionals who deal with sound engineering. Although
stage. We decided to limit the duration time of each mixtures to
the selected research [9] based on recommendation regarding
7 seconds for machine learning and human recognition phases.
subjective assessment of sound quality [10] propose minimum
Such length of excerpts is adequate to achieve both goals. At
twenty listeners, the task of determination number of speakers
the end, all examples have been resampled to 22050 Hz.
differ slightly from a simple speech quality assessment.
Each mixture has been built by random selection of a male
or female voice first. The beginning of each 7 seconds excerpt C. Results
within selected recording was also randomly determined. The Overall results of the human assessment for all 64 sound
whole dataset consists of three groups, which are determined excerpts are presented as box plot diagrams in Fig. 1 (first
by speakers’ gender: male, female and both where each half of all evaluated examples) and Fig. 2 (second half of
group includes four fragments. For mixed gender groups, the all evaluated examples). On the horizontal axis, individual
speaker’s gender was selected alternately to preserve balance audio mixtures grouped by number of simultaneous speakers
between number of female and male voices. The general have been placed, while vertical axis represents the determined
quantitative structure of the audio dataset for human evaluation number of speakers by all participants of evaluation. The
purpose is presented in Table I. obtained results show that listeners tend to indicate lower
than the actual number of speakers, especially for examples
B. Evaluation containing more than four speakers. The case of six speakers
The audio dataset for human evaluation purposes consists was indicated in only nine cases, while the total number of
of three group of total 64 examples. All mixtures sorted responses in this group was equal to 192. It is also interesting
in random order were combined into one file. Individual that only in one case the number of speakers was recognized
excerpts in the resulting file were separated by the lector’s as being equal to 7 which was a wrong assignment. It can
announcements containing the number of the next example. be assumed that wrong answers may have been caused by
In such form, they were presented to listeners for evaluation. the lack of concentration or inattention of listeners. This
The total duration time of the final test file is 10 minutes and is particularly indicated by incorrect answers for samples
13 seconds long. containing two speakers. Examples of the effectiveness of the
240
7
6
Number of speakers
31
11
21
12
15
17
13
18
19
20
22
23
25
26
27
28
29
30
32
2
10
3
14
16
24
Mixture number
Fig. 1. Human evaluation results for the first half of 64 mixtures.
6
Number of speakers
2
41
51
61
33
35
36
38
39
42
43
44
45
46
47
48
49
52
53
54
55
56
59
60
62
34
37
40
50
57
58
63
64
Mixture number
Fig. 2. Human evaluation results for the second half of 64 mixtures.
simultaneous number of speakers recognizing by the listeners matrices W ∈ R+ and H ∈ R+ such that their product is
for the best and worst case were collected in Table II and close to matrix S:
Table III respectively. It should be mentioned that the case
shown in Table III represents the lowest efficiency in the class S ≈ Wm×r · Hr×n , (2)
of examples comparable with machine classification, but not
the worst case at all. where r denotes the rank of the approximation.
Non-negative matrix factorization is an optimization prob-
III. MACHINE CLASSIFICATION
lem defined as follows:
A. Parametrization stage
In the feature extraction stage, each mixture signal m(t) min D(S|WH) subject to W ≥ 0, H ≥ 0, (3)
W,H
was converted into magnitude spectrogram S using short-time
Fourier transform: where D is the Kullback-Leibler divergence function [11]:
2
Sm×n = |STFT {m(t)}| . (1) X Sij
D(S|WH) = Sij · log − Sij + (WH)ij . (4)
Because the spectrogram can be treated as non-negative ma- i,j
(WH)ij
trix (S ∈ R+ ), we have performed its decomposition into
frequency patterns and activation coefficients using the NMF To optimize the divergence function D, a multiplicative update
algorithm [11]. The factorization task is to find non-negative rules are applied until convergence. Estimation algorithm of
241
TABLE IV
C ONFUSION MATRIX FOR HUMAN EVALUATION
M2 M3 M4 M5 M6 M7
M2 100.0 100.0 100.0 100.0 0 0 0 0 0 0 0 0 - 0 0 0 - - 0 0 - - - 0
M3 16.7 16.7 16.7 16.7 75.0 75.0 75.0 75.0 8.3 8.3 8.3 8.3 - 0 0 0 - - 0 0 - - - 0
M4 25.0 25.0 25.0 25.0 58.3 58.3 58.3 58.3 16.7 16.7 16.7 16.7 - 0 0 0 - - 0 0 - - - 0
M5 - 0 0 0 - 41.7 41.7 41.7 - 41.7 41.7 41.7 - 16.7 16.7 16.7 - - 0 0 - - - 0
M6 - - 0 0 - - 41.7 41.7 - - 41.7 41.7 - - 16.7 16.7 - - 0 0 - - - 0
M7 - - - 0 - - - 0 - - - 50.0 - - - 50.0 - - - 0 - - - 0
S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4
the matrices W and H can be summarized in the follow- B. Experimental setup

ing form: Acoustic data exploited in the machine classification uses
 the same audio recordings as in previous evaluation. The whole
 S

 · HT set contained 96 samples was split in the ratio 70/30 to the train

 WH
 W = W ◦ 1 · HT
 and test subsets. Having obtained calculated feature vectors for
, (5) each sound mixtures, we have employed a set of 15 popular

 S

 W ·T classifiers [13] to perform the classification.


 H=H◦ WH In the experiments, we have used configurations created on
WT · 1 the basis of the following parameters:
- NMF rank (r) = 2, 3, 4, 5, 6, 7, 8, 16, 32, 50,
where: 1 is matrix of ones and ◦ is the symbol of element-wise
- 43 configurations of 15 various classifiers,
product.
- maximum/average pooling (Φ) of activation matrix H,
The frequency components in the spectrogram are strictly - row pooling size = 2, 4, 8, 16 (z).
dependent on the genre of the speaker and its anatomical
For every combination, the classifier with the best classifica-
structure of the speech apparatus. Therefore, in order to
tion accuracy was selected, giving a set of 320 discriminative
determine the number of speakers we have decided to pay
configurations.
attention to the variability of activation coefficients. As can
be seen in Fig. 4, these coefficients can provide discriminative C. Results
properties. To visualise the H matrix, a standard grey palette
The highest classification accuracy was obtained for support
was used, where the darker colours denote higher activation
vector machine (SVM) classifier (and its nu-SVC and C-
values. The feature extraction scheme is depicted in Fig. 3. To
SVM variants) with sequential minimal optimisation [14] in
the training phase. The best results (from 56.25% to 83.3%)
are depicted in Table V with configurations for all sets. The
rank of NMF decomposition was equal to 5 for the highest
scores with average pooling (z = 8). The data mapping in the
SVM classifier is performed by using kernels [15]. According
to the Table V the results are dependent on the type of
kernel. Therefore, we have determined the configuration of the
feature extraction phase and the properties of the SVM kernel
to maximize the accuracy of all classes. The final structure
of the machine classification is described by the following
parameters:
- NMF rank r = 5,
- average pooling with z = 8,
Fig. 3. Feature extraction scheme based on non-negative spectrogram - C-SVM classifier with 4th order polynomial kernel.
factorization. Results obtained in this case for classes S1 − S4 are presented
in confusion matrix shown in Table VI.
obtain more compact representation of the activation matrix
we have performed a pooling operation on the rows of H. IV. DISCUSSION
The pooling operation [12] returns a value Φ of selected There are several facts from the subjective and objective
rectangular neighborhood and can provide small invariability test results presented in this paper. First of all, despite ini-
in the data. In our approach, each row of matrix H was tial assumptions, the comparison of both domains – human
divided into z parts as shown in Fig. 3. After executing the assessment and machine classification, turned out to ambigu-
average/max pooling operation, matrix is unfolded into the ous. It does not exclude the possibility of formulating some
final representation as a feature vector. comparative conclusions. The essence of objective methods
242
Fig. 4. Example of activation coefficients (H) obtained for NMF decomposition (r = 5) using the mixtures of 2, 3, 4, 5, 6 and 7 simultaneous mixed-gender
speakers.
TABLE V
T HE BEST CLASSIFICATION ACCURACY FOR SEVERAL CONFIGURATIONS
NMF Pooling Feature

Set Classifier / Kernel Accuracy
order type vector size
S1 5 average 40 (z=8) nu-SVC / sigmoid 83.3 %
S2 5 average 40 (z=8) C-SVM / polynomial 66.7 %
S3 5 average 40 (z=8) nu-SVC / sigmoid 60 %
S4 5 maximum 20 (z=4) C-SVM / Pearson universal 56.25 %
TABLE VI
C ONFUSION MATRIX FOR ALL SETS
M2 M3 M4 M5 M6 M7
M2 66.7 66.7 66.7 66.7 33.3 16.7 33.3 33.3 0 16.7 0 0 - 0 0 0 - - 0 0 - - - 0
M3 0 0 0 0 83.3 66.7 66.7 66.7 16.7 16.7 16.7 16.7 - 16.7 16.7 16.7 - - 0 0 - - - 0
M4 0 0 0 0 33.3 66.7 16.7 16.7 66.7 33.3 33.3 33.3 - 50 33.3 33.3 - - 16.7 16.7 - - - 0
M5 - 0 0 0 - 0 0 0 - 0 0 0 - 100 33.3 33.3 - - 66.7 66.7 - - - 0
M6 - - 0 0 - - 16.7 16.7 - - 16.7 16.7 - - 0 0 - - 66.7 66.7 - - - 0
M7 - - - 0 - - - 0 - - - 0 - - - 0 - - - 33.3 - - - 0
S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4
0 10 20 30 40 50 60 70 80 consequence of fact that the number of examples is not exactly

the same for subjective and objective tests. Next issue is that
S1
only some cases of human assessment were selected for direct
S2 comparison. This is due to the fact, that listeners were not lim-
ited in terms of the maximum number of speakers. Therefore,
S3
they often pointed out the greater number of speakers than
S4
human the actual one within a given class. Nevertheless one of the
machine
main goals was to find an objective method for the number of
0 10 20 30 40 50 60 70 80 simultaneous speakers estimation.
Accuracy [%]
Fig. 5. Classification accuracy for C-SVM classifier with 4-order polynomial

A comparison of human assessment and machine classifica-
kernel using NMF rank r = 5 and average pooling (z = 8) versus human tion for best obtained cases are depicted in Fig. 5. It is shown
classification accuracy. that results of automatic estimation procedure are clearly better
than human evaluation. Although, there is no competition
between human and machine methods, there are opportunities
lies in multiple passes of one or several classifiers with for both approaches improvement. This includes an idea of
modified parameters, while human evaluation consists in a extending the human evaluation process. The solution may
single listening session for each utterance. This is also a consist in placing multiple repetitions of the same utterances
243
in the final evaluation of audio file and evaluate the same [4] E. Zwyssig, S. Renals, and M. Lincoln, “Determining the number
audio file several times by the same listener. In both machine of speakers in a meeting using microphone array features,” in IEEE
International Conference on Acoustics, Speech and Signal Processing
and human approaches, the gender impact on the recognition (ICASSP), Kyoto, 2012, pp. 4765–4768.
results was not examined. However, it does not guarantee that [5] A. Masnadi-Shirazi and B. Rao, “Cartesian tracking of unknown time-
the accuracy of recognition will change significantly. varying number of speakers using distributed microphone pairs,” in
Proceedings of the 21st European Signal Processing Conference (EU-
V. CONCLUSIONS SIPCO), Marrakech, September 2013, pp. 1–5.
[6] T. Arai, “Estimating number of speakers by the modulation character-
We have presented subjective and objective schemes for istics of speech,” in Proc. of the IEEE International Conference on
estimating the number of speakers in mixtures containing from Acoustics, Speech, and Signal Processing (ICASSP ’03), vol. 2, 2003,
two to seven simultaneous spoken sentences. The subjective pp. 197–200.
test included a group of listeners with various auditory expe- [7] C. Xu, S. Li, G. Liu, Y. Zhang, E. Miluzzo, Y.-F. Chen, J. Li, and
B. Firner, “Crowd++: Unsupervised speaker count with smartphones,”
rience, whereas the objective tests were performed using an in ACM UbiComp, 2013.
approach based on activation matrix obtained in the process [8] T. Maka and M. Lazoryszczak, “Influence of simultaneous spoken sen-
of the spectrogram factorization. Within the scope of the tences on the properties of spectral peaks,” in International Conference
possible comparison, better results were obtained with the on Signal Processing: Algorithms, Architectures, Arrangements, and
Applications SPA’15, Poznan, Poland, Sep 2015, pp. 87–92.
automatic method. Such effects encourage both further works [9] Loizou P.C., Speech Quality Assessment. In: Lin W., Tao D., Kacprzyk
on improving the automatic methods of speaker recognition J., Li Z., Izquierdo E., Wang H. (eds) Multimedia Analysis, Processing
as well as searching for connections between the mechanism and Communications. Studies in Computational Intelligence, vol 346.
Springer, Berlin, Heidelberg, 2011.
of human hearing and systems based on objective solutions.
[10] International Telecommunication Union - Radiocommunication Sector,
R EFERENCES Recommendation BS. 562-3, Subjective assessment of sound quality
(1990)
[1] R. Swamy, K. Murty, and B. Yegnanarayana, “Determining number [11] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-
of speakers from multispeaker speech signals using excitation source negative matrix factorization,” Nature, vol. 401, pp. 788–791, October
information,” IEEE Signal Processing Letters, vol. 14, no. 7, pp. 481– 1999.
484, 2007.
[12] Y.-L. Boureau, N. L. Roux, F. Bach, J. Ponce, and Y. LeCun, “Ask
[2] P. A. Kumar, L. Balakrishna, C. Prakash, and S. Gangashetty, “Bessel
the locals: multi-way local pooling for image recognition,” in IEEE
features for estimating number of speakers from multispeaker speech
International Conference on Computer Vision – ICCV’2011. IEEE, 2011,
signals,” in Proc. of the 18th International Conference on Systems,
pp. 2651–2658.
Signals and Image Processing (IWSSIP), Sarajevo, 2011, pp. 1–4.
[14] J. C. Platt, “Sequential minimal optimization: A fast algorithm for
[3] U. Rafi and R. Bardeli, “Harmonic cues for number of simultaneous
training support vector machines,” in Advances in Kernel Methods:
speakers estimation,” in Audio Engineering Society Conference: 53rd
Support Vector Learning, B. Schlkopf, C. J. C. Burges, and A. J. Smola,
International Conference: Semantic Audio, Jan 2014.
Eds. The MIT Press, 1998, pp. 185–208.
[13] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning – Data Mining, Inference, and Prediction, 2nd ed. Springer- [15] C. M. Bishop, Pattern Recognition and Machine Learning, 1st ed.
Verlag New York, 2009. Springer, 2006.
244
SIGNaL PROCESSING
SPa 2018
Marking the Allophones Boundaries Based on the

DTW Algorithm
Janusz Rafałko
Warsaw University of Technology
1)
Faculty of Mathematics and Information Science

Warsaw, Poland
2)
Gdańsk University of Technology
Gdańsk, Poland
janusz.rafalko@pg.edu.pl
Abstract—The paper presents an approach to marking the to speech synthesis are described in [7], [12], [13]. In the
boundaries of allophones in the speech signal based on the [11] study are shown the basic approach of concatenative TTS
Dynamic Time Warping (DTW) algorithm. Setting and marking (Text-to-Speech) system based on allophones in the context
of allophones boundaries in continuous speech is a difficult issue
due to the mutual influence of adjacent phonemes on each of multilingual synthesis. It is in the case of concatenation
other. It is this neighbourhood on the one hand that creates synthesis the correct delimitation of acoustic units in a con-
variants of phonemes that is allophones, and on the other hand tinuous speech signal is important. It is associated with the
it affects that the border between allophones is in some cases creation of bases of acoustic units used to synthesize speech.
very difficult to determine. Nowadays, this task is carried out Another area of application for the appropriate marking of
manually in cooperation with specialists in the field of phonetics.
The presented approach allows to build a system that is able to the allophone boundaries is the field of foreign language
automate this process. The aim of the work currently carried out learning. Using this to marking allophones in English has a
by the author is a method that facilitates the training material high development potential, because nowadays according to
processing for the needs of the development of multimodal speech David Cristal, author of e.g. “The Cambridge Encyclopedia
recognition systems. For this purpose, the difficult problem of of the English Language” and the book "English as a Global
marking boundaries of allophones is solved in this report based
on the Polish dictionary in the context of the creation of allophone Language" [5], non-native English speakers are three times
bases for speech synthesis. This is done in this way due to the as many as native speakers. Algorithms related to language
simplified possibility of organizing critical listening and subjective processing, thus the delimitation of allophones can be very
evaluation of received allophones by a large group of Polish native helpful here in learning the correct pronunciation, e.g. in e-
speakers (73 people). Strengthening the method will allow it to be learning.
used for the extraction of allophones for the needs of developed
system of automatic transcription of English speech and for its The approach to marking the limits of allophones presented
notation according to the IPA standard. The analysed continuous in this article was used to create allophone bases for concate-
speech is combined in the DTW algorithm with a synthesized native speech synthesis based on allophones. The main goal
speech signal. The comparison of both signals is performed not in speech synthesis is creation a speech which is as similar
in the time domain as in the classical DTW, but in the frequency as possible to a voice of living man. It forces the task of
domain. This allows for a statement that the phonetic content
of both signals is compared. The paper describes the process of keeping the personal characteristics of voice, accent properties,
marking the boundaries of allophones for the Polish language, phonetic articulation, and prosodic properties. A method which
however after appropriate modifications, this approach can be allows to reproduce the human personal speech characteristics
used to determine the allophones boundaries in other languages, is a concatenative method. It uses small and natural acoustic
especially for English. units, such as allophones from which the speech is synthesized.
Keywords—allophoneme analysis, speech processing, Dynamic
The individual features of human voice are included in natural
Time Warping acoustic units – allophones. In order to synthesize the voice
of particular man, there must be allophones database created.
Nowadays databases like this are created manually or half-
I. I NTRODUCTION
automatically, which is very time consuming and takes at
The problem of marking the borders of allophones in a least several months of work. This requires recording the
continuous speech signal, but also other acoustic units, such appropriate voice sample and then processing it. Then, from
as syllables, is a very important issue in speech technology. It such a natural speech signal, you should cut the appropriate
is related to such issues as, for example, speech recognition acoustic units out and save them to the base. All these stages
and transcription, speech synthesis, or learning a foreign require specialist knowledge. Automatic methods of creating
language. Speech recognition and speech transcription are such bases are e.g. presented in [1], [10]. The approach
described, for example, in [4], [8], [9]. Various approaches presented in this article is a new approach that will allow to
245
automatic marking of allophones in continuous speech, which
is the basis for the automatic creation of an allophone base.
This will allow to easily create the base of voice of any person,
living or not, if we have a sufficiently large sample of such a
voice.
II. AUTOMATIC ALLOPHONES DATABASES CREATION
TECHNOLOGY
The general scheme of automatic allophones databases
creation technology is shown on the Fig. 1.
Figure 2. Synthesized marked word "standardowych"
III. AUTOMATIC SEGMENTATION OF ALLOPHONES

A. Marking the allophones boundaries
The automatic segmentation of speech signal is based on the
DTW algorithm, however in contrast to the classical DTW, it is
not based on the signal in the time domain but in the frequency
domain. In the algorithm, we compose two signals, natural
speech signal and synthesized speech signal, in which we
know the boundaries of allophones. Reference (synthesized)
and natural speech signal is divided into frames, which may
overlap and the Fourier transform (FFT) is being calculated in
each frame. The first step in the DTW method is calculating
Figure 1. The scheme of allophone databases creation
the local distances matrix. This matrix are calculated from the
spectral features vectors in every frame:
It can be divided into two stages. The first stage of this
technology include preparation of text and acoustic corpuses K
X
and manual segmentation of speech signal in order to receive c(n, m) = kS(n), E(m)k = |S(s, k) − E(m, k)| (1)
the reference base. The second stage is automatic marking and k=1
cutting the allophones from the natural speech signal. This where:
stage includes three steps: S(n) – synthesized signal spectrum in the “n” frame
E(m) – natural signal spectrum in the “m” frame
1) Synthesis of the speech of the text corpus
K – the length of spectral feature vector
2) Automatic marking and segmentation of natural signal
3) Creating the final allophones base
The synthesized signal frame is correlated with natural
The first step is synthesis of the same signal as the natural signal frame and on the base of signal spectrum in the frame,
one. It does not have to be a complete synthesis with intona- the distance between vectors is calculated. The next step is
tion. It is enough to concatenate allophones together. In this calculation of global distances matrix and alignment path. It
way we will get a synthesized signal in which we know the is shown on Fig. 3.
boundaries of allophones. Fig. 2. presents the synthesized, ref- This figure shows the speech signals, which uses frames
erenced Polish word "standardowych" (Eng. standard) together with the length of 256 samples, the Hamming window and 50
with the allophones boundaries marked. % overlapping. Used metrics is a Manhattan metrics. Signals
The next step is the automate the process of marking and are presented in the form of spectrograms. At the bottom, hori-
cutting the acoustic units out. In this system, the automatic zontally there is the spectrogram of natural speech signal of the
segmentation of speech signal is based on the Dynamic Time Polish word “gałazka”
˛ (Eng. twig), pronunciation: gawONska
Warping (DTW) algorithm based on dynamic programming (alphabet IPA – International Phonetic Alphabet). On the left
[2], [3], [6]. After marking the boundaries in the natural speech side, vertically there is spectrogram of the synthesized the
signal, cutting the units is already a simple process. In this way same word. The global distance matrix is graphically presented
we get a set of all allophones. The last step is the creation of in the centre of picture. The blue colour means the small
the final allophone base based on this set. distances values between signal frames, and the red one means
246
Figure 5. Allophone "g" cut manually and automatically
Figure 3. Matrix of global distances
a long distance, that is signals with different frequencies.

We can also see the natural speech signal allophones borders
determined on the basis of alignment path and boundaries of Figure 6. Allophone "a" cut manually and automatically
allophones from the synthesized signal. This is the essence of
units marking and cutting automation. We do not compare the
natural signal to the reference one, but using the alignment necessary to analyze received units, and marked those pho-
path we assigned the allophones boundaries. After marking netic units, in which the acceptable errors during reading or
the boundaries in the natural speech signal, cutting the units automatic segmentation are too big. The reason why these
is already a simple process. In this way we get a set of all errors arise, are the phenomena of reduction and simplification
the allophones which are located in the acoustic corpus. An of phonemes in natural speech leading to almost complete
example of marked boundaries in a natural word is shown disappearance of phonemes, with the result that the phonetic
in Fig. 4. The upper spectrogram represents the reference content of the synthesized speech does not coincide with
synthesized word, the bottom - natural word with the marked natural speech, e.g. jabłko → jabwkO (in formal language),
allophones borders. japkO (in colloquial speech) – Eng. apple. Second reason is
the inaccurate setting of the markers in natural signal in the
process of automatic boundaries marking. The operation of
finding wrong marked allophones boundaries is based on com-
parison of reference acoustic characteristics from reference
base and those of natural speech segments obtained from the
segmentation process. When the differences between them are
higher than the threshold, it means, that such segment will not
be able to provide the minimum level of necessary quality of
synthesized speech and should be rejected. This operation is
performed by testing the time parameters, i.e. the duration of
the allophone, and parameters of acoustically - phonetical, i.e.
the cost of matching in segmentation algorithm.
Figure 4. Specified borders of allophones in the synthesized signal (upper),
and marked borders in natural speech signal (lower) The duration of the test units Tt obtained in the segmenta-
tion process is compared with the duration of reference units
The following Fig. 5. and Fig. 6. show the time diagrams Tr used in the synthesis module according to the formula (2).
of the first two allophones cut from the word "gałazka"
˛ out
|Tt − Tr |
in an automatic way. As can be seen the boundaries of these δt = >α (2)
allophones are set in the correct way. The upper graphs in the Tr
figures show the reference units, while the lower ones show If the relative error of the units duration is greater than the
the units cut automatically. The time scale is preserved. threshold α, the element is rejected. During experiments for
different collections it has been determined, that if the error
B. Selection of the allophones exceeds 80 % the allophone is not suitable for final base. There
In concatenative synthesis for which the prepared bases are about 9% of the total number in sample set of this kind
are designated, we need only one piece of each allophone of allophones.
in the base. In order to create an acoustic database it is The second parameter in this operation is the cost of
247
Table I
matching both units, referenced and tested, - in this case, T HE RESULTS OF THE SUBJECTIVE TESTS
the cost of matching in DTW algorithm. Because the local
distance defining the similarity of these units in the frequency
domain, this cost can be a measure of the phonetic accuracy of
tested allophone unit. This cost is the sum of local distances
within the alignment path determined for those units. This
cost, similarly as a unit duration, may differ for particular
units, so we had to develop a relative factor. However, the
cost of matching the reference unit is equal zero. That is why,
the average value of matching cost for all instances of a given
entity was taken into account. Formula (3) shows this ratio:
|CP − CP ave |
δc = >β (3)
CP ave
where:
CP – matching cost of allophone unit
CP ave – average matching cost of all instances of the unit
In this case, it was experimentally determined for varied

sets that an error greater than 50 % disqualifies such cut
unit from the use in output base. This is about 3.5 % of
the total number of units. As a result of this operations,
all pieces for which δt > α or δc > β are excluded from MOS (Mean Opinion Score) and DMOS (Degradation Mean
using in allophonic database. As the experiments shown, Opinion Score) parameters assume numerical values from 1 to
the best results of operation marking allophones with wrong 5, whereas logatomic sharpness ratio is given in percentages.
boundaries assigned are obtained when α = 0.8 and β = 0.5. In the ACR method the absolute quality of presented voice
The presented results show that wrong marking boundaries in samples is determined, without the use of a reference signal.
the speech signal are about 12.5 %. Then the MOS parameter is calculated – the averaged listeners
opinion, which characterizes the sound quality. The partici-
IV. E VALUATION OF RECEIVED DATABASES pants listen to the recorded speech, and then they evaluate it in
In the case of synthesized speech, the most important eval- the scale from 1 to 5. In this case, two scales recommended by
uation include articulation, transparency and effortlessness, the ITU were used: Listening quality scale and auditory effort
are subjective ones. In the case of this study, allophonic scale. The evaluation of automatically obtained databases
databases are created, and they have been used to synthesize were performed in comparison with the referenced databases
speech therefore, the studies relay on the comparison of the obtained manually. This means that, the listeners, who evaluate
speech obtained from the automatic databases derived from the speaker synthesized voice, evaluate the samples created on
the presented system and the referenced databases obtained the basis of the automatic database, as well as the reference
in manual way. The most commonly used subjective speech database samples. The order of the samples was random.
testing methods are: As the table shows, in any case, the assessment was a
little below 4, what means, that the synthesized speech quality
• ACR (Absolute Category Rating) – the method of abso-
was good and allowed to understand the speech without any
lute speech quality evaluation
difficulties with a light intensity of attention. The standard
• DCR (Degradation Category Rating) – the method de-
deviation calculated from the sample keeps the level a little
scribing the level of speech quality degradation
above 0.5 what means, that there were not huge differences
• The logatome articulation research method
between listeners. The reference databases looked better in
The ACR and DCR are described and recommended by this evaluation, what should be understandable. The “Janusz”
the International Telecommunication Unit (ITU) to assess the database was evaluated a little lower than, the “Speaker”
quality of speech signal transmission in the analog and digital database. Important is, that the “Janusz” database was created
telecommunications channels and speech encoding systems on the basis of another speaker voice, and yet it practically
[14]. The logatome method is described in the Polish Standard does not differ from the reference base.
PN-90/T-05100 [15]. The presented results were performed DCR method is used for examining so-called speech degra-
on two databases obtained in automatic way. The database dation degree. The measurement is based on the comparison
of professional radio speaker – "Speaker", and the voice of of the natural signal with the examining one, determining
this study author – "Janusz". The study involved 73 people, its degradation in the five-point scale from “the difference
mostly students. The results are summarized in Table I. is imperceptible” to “the difference is audible and clearly
The table presents all calculated evaluation parameters. The perceptible”. On the basis of obtained results, the DMOS
248
coefficient was determined. In the case of this coefficient, [6] J. Deller, J. Proakis, J. Hansen, Discrete-Time Processing of Speech
we can see that, the “Janusz” database, both standard and Signals, Prentice Hall, 1987
[7] T. Dutoit, An Introduction to text-to-speech synthesis, Kluwer
automatic, got higher score. Academic Publishers 1997
The logatome articulation method determines the percent [8] W. Mattheyses, V. Verhelst, Audiovisual speech synthesis: An overview
of logatomes correctly received by the listeners in relation of the stateof-the-art, Speech Communication, 66: 182–217, 2015
[9] D. O’Shaughnessy, Automatic speech recognition: History, methods and
to the total number of logatomes presented. The logatome challenges, Pattern Recognition, 41, 10: 2965–2979, 2008
recognition is the result of hearing all of its component [10] K. Szklanny, D. Oliver, Creation and analysis of a Polish speech
phonemes but not association with the known word. During database for use in unit selection speech synthesis, Genova, LREC
Conference, 2006
analyzing, we can see that, the logatome sharpness is better [11] E. Szpilewski, B. Piórkowska, J. Rafałko, B. Lobanov, V. Kiselov,
for both voices in automatic databases. In case of the speaker L, Tsirulnik, Polish TTS in Multi-Voice Slavonic Languages Speech
voice, the average difference is approx. 5 %, and in case of Synthesis System, SPECOM’2004 Proceedings, 9th International Con-
ference Speech and Computer, pp. 565–570, Saint-Petersburg, Russia
“Janusz” databases this difference is about 15 %. In any case 2004
and any type of logatomes, their recognition is better when [12] P. Taylor, Text-to-Speech Synthesis, Cambridge University Press 2009
they are created using automatic databases. [13] J. van Santen, R. Sproat, J. Olive, J. Hirschberg, Progress in speech
synthesis, Springer Verlag, New York 1997
V. C ONCLUSION [14] ITU-T Recommendation P.800, Method for subjective determination of
transmission quality, 1996
In this work, a new approach to the automatic marking [15] PN-90/T-05100, Analogowe łańcuchy telefoniczne. Wymagania i metody
pomiaru wyrazistości logatomowej, Warszawa 1993
of allophone boundaries in a continuous speech signal has
been described. The performed studies showed, that the quality
of allophonic databases created automatically are comparable
with the quality of those created manually. This means that
they can be used in the concatenative speech synthesizers. The
presented algorithms have been tested for the Polish language,
however, after appropriate modifications, they can be adapted
to other languages, in particular English. Also the application
of automatic delimitation of allophone boundaries is not
limited to the creation of allophone bases for concatenative
speech synthesis presented in the article. The applications
can be much more e.g. In algorithms supporting e-learning
language teaching or in speech recognition algorithms.
The difficult problem of marking boundaries of allophones
is solved in this report based on the Polish dictionary due to
the simplified possibility of organizing critical listening and
subjective evaluation of received allophones by a large group
of Polish native speakers (73 people). Actually the aim of
the work currently carried out by the author is a method that
facilitates the training material processing for the needs of
the development of multimodal speech recognition systems.
Strengthening the method will allow it to be used for the
extraction of allophones for the needs of developed system of
automatic transcription of English speech and for its notation
according to the IPA standard.
ACKNOWLEDGMENT
Research sponsored by the Polish National Science Centre,
Dec. No. 2015/17/B/ST6/01874.
R EFERENCES
[1] G. Almpanidis, C. Kotropoulos, Automatic Phonemic Segmentation
Using The Bayesian Information Criterion With Generalised Gamma
Priors, Proceedings of EUSIPCO 2007
[2] R. Bellman, S. Drefus, Applied dynamic programming, Prinston
University Press, 1971
[3] R. Bellman, Dynamic Programming, Dover Publications, 2003
[4] L. Besacier, E. Barnard, A. Karpov, T. Schultz, Automatic speech
recognition for under-resourced languages: A survey, Speech Com-
munication, 56: 85–100, 2014
[5] D. Crystal, English as a Global Language, Cambridge University
Press, 2 edition, 2003
249
SIGNaL PROCESSING
SPa 2018
Infrared thermal camera-based system for tram

drivers warning about hazardous situations
Adam Konieczka, Ewelina Michałowicz, Karol Piniarski
Poznan University of Technology, Institute of Automation and Robotics,
Division of Signal Processing and Electronic Systems, Poznan, Poland
adam.konieczka@put.poznan.pl, ewelina.michalowicz@student.put.poznan.pl, karol.piniarski@put.poznan.pl
Abstract — In this paper, we propose a new thermal camera- is also more difficult to interpret for a user, due to the low
based system for tram drivers. It aims to increase the safety of number of details.
tram traffic at night. The proposed solution uses a standard While traffic at night is significantly smaller, more than a
vision camera and a thermal camera. Firstly, it processes the half of pedestrian deaths take place at this time (51%) [8]. The
achieved images in order to detect the tram tracks. Secondly, it
main source of this is poor visibility.
detects people or obstacles on tracks and generates warnings for
the driver. This solution has been tested in static condition using Nowadays, automotive companies offer many solutions that
a standard-gauge tram. The achieved results prove that this increase the safety of the night traffic. One of them is night
prototype system can effectively warn of danger situations vision systems to assists driver at roads and protects against
especially in dark places. accidents with pedestrians [3]. The first system called
NightVision was offered by BMW company in their 7-th car
Keywords - track extraction, obstacle recognition, tram vehicle, series. The system has the ability to detect pedestrians and
safety driving support, thermovision large animals in the distance of 100 m. One another system is
Night View Assist Plus offered by Mercedes-Benz. In its
I. INTRODUCTION newest version thermal camera is supported by the near-
In recent years, many works have been carried out infrared camera. The evolution of the night vision systems in
regarding the detection of threats that occur in built-up areas the automotive market is presented in [3, 7].
[1, 2]. Most systems use cameras that work in the field of Many researchers present their approaches for pedestrian
visible light, but systems using thermal imaging cameras are detection in thermal spectrum images. In general, they based
also becoming more common. on shape and temperature contrast information. In [9] a
The thermovision due to the technological progress and thermal stereo vision system for pedestrian detection is
falling costs of devices has been used in some new areas like proposed. It is based on three different underlying approaches:
medicine, monitoring of conditions of buildings or machines, warm area detection, edge-based detection, and v-disparity
and intelligent transportation systems. The most common computation. A decision process is performed using head
fields of usage of the thermovision are night-time imaging, morphological and thermal characteristics. In [10] a pedestrian
e.g. automotive night vision systems [3], long-distance detection method based on monocular thermal camera is
detection of living beings e.g. surveillance and bordering [4], presented. The adaptive local dual threshold segmentation
CCTV monitoring of vulnerable areas like airports, military algorithm for candidate regions is used together with
zones, and nuclear power plants [5]. histograms of oriented gradients and modified support vector
Thermal cameras have several important advantages over machine classifier. In [11] an approach for pedestrian
normal vision: they see at night and fog, while the normal detection and tracking based on shape features is proposed.
visibility range drastically decreases. This is due to the fact Contrast temperature between the background and target
that thermal cameras capture infrared radiation (heat) with pedestrians is used for detection of pedestrians and transition
wavelengths in the range of 3–30 μm emitted by objects [6]. In of the score between the adjacent frame for tracking.
practice, living beings with temperature other than the There are often dangerous situations which involve trams
surroundings become more distinctive. and pedestrians. It can be noted that the term “tram” is called
Thermal cameras have the large range of detection. It can be in Europe and also known as “streetcar” or “trolley-car” in
up to 300 m [7] (the professional detectors, e.g. in a border or other countries. In [12] a real-time tram track detection
military monitoring reaches 20 km detection distance, but are method is proposed. It uses a standard vision long-distance
expensive and require a special cooling). Moreover, they camera mounted on a tram. Experiments have shown that this
cannot be dazzled by a street light or car headlight. system is effective in relatively good lighting conditions and it
From the other hand, the thermal cameras are still more can be used during obstacle recognition. Standard vision
expensive than standard vision (normal) cameras and have cameras are also used in the obstacle recognition systems
lower resolution than normal cameras. The thermal spectrum described in [13, 14]. The first one can detect static and
moving obstacles, and warn the driver. The detected objects
250
are classified into four categories, “Man”, “Bicycle”, “Car”, laden tram in normal conditions is significantly longer than
“The other”. Unfortunately, the experiments were performed indicated in Table 1.
in good light condition or in a laboratory. The second system On many sections of tracks, the speed of trams can be as
also supports tram drivers using front view images of trams. high as 70 km/h (for example on the Poznań Fast Tram (PST)
This test method detects obstacles on or around tracks from – this route is unlit at night) [20].
the quiescent images from standard vision camera. Moreover, Tests carried out by communication companies (popular
the detection results depend on the surrounding lighting 105N (805N) type trams manufactured before 2000 were
environment, and the experiments under poor lighting used) also show the length of the stopping distance at higher
conditions were not conducted. speeds. Table 2 presents the actual results obtained for
The tram lines are located not only in the highly urbanized unladen trams on straight and dry tracks. These results taking
areas. Tram transport is available also in suburban areas, into account the tram driver's reaction time and the braking
which are not always fenced and illuminated at night. system response time.
Pedestrians sometimes cross the tracks in dark places.
However, to the best of our knowledge, there were no TABLE II. THE ACTUAL STOPPING DISTANCE OF TRAMS AT DIFFERENT
SPEEDS
propositions of solutions that use a standard vision camera and
a thermal camera for tram drivers warning at night. It can be Measured stopping distance [m]
noted, that the manufacturer of thermal imaging cameras Speed [km/h] emergency normal
braking braking
suggests the use of these cameras for people on metro, tram or
[19] 13.2 -
railway tracks detection [15]. 20
Therefore, in this paper, we propose the night vision system [21] 14.5 19.0
that can detect some possibly dangerous situations and alert 30 [21] 21.0 31.0
the tram driver. [19] 41.9 -
40
This paper is organized as follow. After the introduction in [21] 39.0 43.0
this section, the problem of early detection of threats is 50 [19] 62.9/80.0* -
considered in Section 2. In Section 3 the architecture of our * with full load
system is presented. Experimental results are described in
Section 4. The last section contains final remarks. Regulation [17] requires that the low beams of trams
illuminate the track at a distance of 40 m with good air
II. PROBLEM DEFINITION
transparency. The high beams should illuminate the track at a
The full weight of the modern 32-meter tram can reach up distance of 100 m in the same conditions. However, the
to approximately 59 tons [16, 17]. Friction between metal possibility of the use of the high beams is limited due to the
wheels (such trams can have only 12 wheels) and the rail is possibility of blinding other drivers [22]. Therefore, under
much smaller than in the case of cars. For this reason, the tram poor lighting conditions, even emergency stopping of the tram
stopping distance is much longer than for cars. In addition, the can be problematic.
stopping distance is extended on wet tracks, on tracks covered As has already been mentioned it happens that dark-dressed
with leaves, or on sloping tracks. The maximum permissible people pass through tracks in prohibited and unlit places.
stopping distance of an unloaded tram from a speed of 30 Sometimes they even lying on the tracks (e.g. after alcohol
km/h on a straight, horizontal and dry track must not be drinking). Wild animals such as roe-deer and wild boars also
greater than shown in Table 1. enter the tracks. Tram drivers often encounter such situations
while driving trams.
TABLE I. THE MAXIMUM PERMISSIBLE STOPPING DISTANCE OF AN
UNLOADED TRAM FROM A SPEED OF 30 KM/H [17]
It is worth mentioning that tram's passengers are not
fastened with seat belts and often travel in a standing position.
Maximum stopping
distance [m]
Therefore, in case of sudden braking, they can fall over and
Year of the tram production suffer injuries.
emergency normal
braking braking Our proposed experimental system can alert the tram
before 31.12.1963 17.3 43.4 drivers about hazards on tracks at large distances. We assume
01.01.1964–01.01.2000 17.3 31.5 that the thermal image for the tram driver should not be
02.01.2000–01.01.2002 13.3 28.9 displayed in order not to distract him. Therefore, the achieved
02.01.2002–01.01.2005 12.4 26.7 image must be processed. Information about the detected
after 01.01.2005 11.5 24.8
threat should be in the form of an acoustic or visual signal
(e.g. blinking control lamp on the control panel of the tram).
It is worth noting that, for example, among the normally used trams of the Public Transport Corporation
MPK Poznań 4 trams were produced before 1963, 271 from 1964 to 2000, and 151 after 2000 [18].
III. SYSTEM ARCHITECTURE
It should be taken into account that the tram driver's The proposed system uses 2 cameras: thermal camera and a
reaction time and the braking system response time are normal camera. Both cameras record the same scene before
approx. 1 second [19]. The typical stopping distance of the the tram. The achieved frames are converted to the resolution
of 320×240 pixels for the further processing. Image processing
251
(in the test version of our system) is carried out using the range of possible values of pixels in frames from 0 to 255). The
Matlab programming environment. The processing steps are influence of the gamma correction step on track detection
schematically presented in Fig. 1. effectiveness is presented in Table 3 (in next section).
Figure 2. Diagram of the track detection algorithm
Then the Prewitt operator (with 3×3 pixel kernel) is used

for the vertical edges detection. Such achieved image is
binarized with the threshold equal to 15. Small groups of pixels
(less than 10) are removed using a morphological opening (Fig.
3b).
The Hough transform is used for the track detection. It
detects straight lines (or almost straight lines – for curve tracks)
inclined at angles between -40° and 10° degrees with respect to
the vertical direction. From the detected lines are selected lines
that crossing the bottom edge of the image between 130th and
260th pixel (counting from the left side). As the rails the two
inner lines are selected. The detected rails are marked (in Fig.
3c in green).
Finally, the ROI is selected. It is limited by the bottom edge
of the image and by the horizontal line at the determined height
Figure 1. General scheme of the threat detection algorithm
of the image (where the tracks become invisible). The side
Firstly, the track before the tram is detected in order to edges of the ROI include two times wider area than the rails.
select the area (ROI – region of interest), where the threats will As a result, the ROI includes the area which, in fact, the tram
be detected. For this purpose frames from the normal camera takes (typically the width of trams does not exceed 2.4 m, and
are used, because the track is poorly visible in the thermal the track gauge is 1.435 m). The achieved ROI presents Fig.
frames. If this frame is too dark to the ROI detection, ROI is set 3d.
as a default area.
Secondly, threats are detected using thermal camera and
normal camera frames. If some threats are detected in the ROI,
the alert for the tram driver is generated.
We assume that the proposed system should be activated
above a certain speed of the tram. Therefore, false alarms will
not be generated when the tram is at the tram stop or before the a) b)
intersection with the road.
A. Track detection
The detailed diagram of the track detection algorithm is
shown in Fig. 2. At first, a frame from a normal camera is
converted to the grayscale (Fig. 3a). Then, optionally gamma c) d)
correction may be applied with experimentally set factor equal
Figure 3. Steps of track detection with gamma equal to 0.8: a) original
to 0.4). This step is needed when the mean brightness level of grayscale image, b) image after small objects removal step, c) track detection
the right bottom side of the frame is in the range 25–72 (the step, d) track area (ROI) selection step
252
When the ROI is not detected, the ROI area includes the
area of a straight track is assumed, because this system is standard camera
especially needed in places, where trams can go very fast.
B. Detection of potential threat thermal camera

The threat detection algorithm uses normal frames (Fig. 4a)
and thermal frames (Fig. 4c) simultaneously. Both frames are
thresholded using extended-maxima transform [23] in order to
better extraction of interesting objects, like obstacles or tracks.
In the next step, small objects are removed as in track detection
algorithm (Fig. 4b and 4d, respectively).
Figure 5. Cameras mounted at the front of the tram
The influence of gamma correction on the effectiveness of

a) b) track detection was examined in the recordings achieved from
the normal camera. The experiment was carried out on 199
frames (pixels value range 0–255) achieved at a low level of
lighting (the mean brightness level of 146 frames was under
50). The obtained results are presented in Table 3.
TABLE III. EFFECTIVENESS OF TRACK DETECTION AT DIFFERENT LEVEL

OF GAMMA AND BRIGHTNESS
c) d)
Mean Effectiveness of track detection [%]
brightness level Without gam-
of frames gamma=0.4 gamma=0.5 gamma=0.8
ma correction
<25 0 0 0 0
25–50 9.23 9.23 4.62 3.08
50–72 26.92 26.92 26.92 26.92
e) f) >72 25.93 25.93 22.22 29.63
Figure 4. Detection of potential threat (a lying man on the track): a) normal
image, b) binary normal image, c) thermal image, d) binary thermal image, The low lighting level negatively affected on the all frames
e) logical product of images b and d, f) areas of the detected threat quality, because they were very noisy. Therefore, the visibility
of the rails was poor.
Then the logical function is applied to the achieved images. Table 3 shows that the gamma correction should be
The areas that occur in the thermal image and do not occur in performed with a factor of about 0.4–0.5 for frames with a low
the normal image are recognized as the threats (Fig. 4e). When level of brightness (less than 72) and the gamma correction
the threat area occurs in the ROI (Fig. 4f), the alarm for the should not be performed for brighter frames. The low
tram driver is generated. The alarm will not be generated if the effectiveness of tracks detection is caused probably by the high
same object occurs on both images simultaneously. In such noise level in the image. The above results also indicate the
situation it is assumed that this object is also good visible to the need of use a higher quality camera.
driver. Table 4 presents the effectiveness of dark-dressed
IV. EXPERIMENTAL RESULTS pedestrians detection at different distances from the tram. This
experiment was carried out on the unlit part of the PST route,
In order to achieve the thermal recordings under real while the tram was stationary (due to safety reasons).
conditions, the historic tram type N from MPK Poznań was Experiments of human detection from each of the tested
used. In this tram cameras could be easily and temporarily distances were conducted 10 times. In five cases, the examined
installed (Fig. 5). person was inside the ROI, and in the other five cases, the
For tests the normal camera Sony HDR-XR200VE and the person was outside the ROI. The small number of replicates of
thermal camera Flir Ex4 (with resolution 80×60 pixels) [24] the experiment resulted from the short time available for the
were used. Recordings were achieved on the route over 30 km experiment. It was not possible to delay the remaining trams.
at night. The results presented in Table 4 prove that from the
distance of even 45 m people detection was not difficult. The
pedestrians were not visible for the normal camera from this
distance. The tram driver and the other participants of this
experiment also did not see the pedestrians from the distance of
253
more than 35 m. For the distance of 53 m, 40% of incorrect vehicle detection,” in Signal Process.: Algorithms, Architectures,
detection of the pedestrian was achieved. This was probably Arrangements, and Appl. (SPA), Poznan, 2014, pp. 100‒103.
due to the low resolution of the thermal camera. For [3] Y. Luo, J. Remillard, D. Hoetzer, “Pedestrian Detection in Near-Infrared
Night Vision System,” in IEEE Intelligent Vehicles Symp., 2010, pp. 51–
comparison, fig. 6 shows the thermal images achieved from the 58.
distance of 38 and 53 m. [4] A. Leykin and R. Hammoud, “Robust Multi-Pedestrian Tracking in
Thermal-Visible Surveillance Videos,” in Conf. on Computer Vision and
TABLE IV. EFFECTIVENESS OF PEDESTRIAN DETECTION Pattern Recognition Workshop (CVPRW'06), 2006, pp. 136–136.
Number of [5] D. Pahud and M. Hubbuch, “Measured Thermal Performances of the
Distance Energy Pile System of the Dock Midfield at Zürich Airport,” in Proc.
true true false false
[m] European Geothermal Congress, Unterhaching, 2007, pp. 1–7.
positive negative positive negative
detection detection detection detection [6] F. Jahard, D.A. Fish, A.A. Rio, C.P. Thompson, “Far/Near Infrared
23 5 5 0 0 Adapted Pyramid-Based Fusion for Automotive Night Vision,” in Proc.
on Int. Conf. on Image Process. and its Appl., vol. 8, 1997, pp. 886–890.
30 5 5 0 0
[7] K. Piniarski, P. Pawłowski and A. Dąbrowski, “Pedestrian detection by
38 5 4 1 0 video processing in automotive night vision system,” in Signal Process.:
45 4 5 0 1 Algorithms, Architectures, Arrangements, and Appl. (SPA), Poznan,
2014, pp. 104–109.
53 2 4 1 3 [8] J.F. Pace, J. Sanmartín, P. Thomas, A. Kirk, et al., “Traffic Safety Basic
Facts 2012 – Pedestrians,” Deliverable D3.9 Assembly of Basic Fact
Sheets and Annu. Statistical Rep., DaCoTA EU Road Safety Project,
2012, pp. 1–12. Avaliable: http://www.dacota-project.eu/BFS%202011.
pedestrian on the track html.
[9] M. Bertozzi, A. Broggi, A. Lasagni and M. Rose, “Infrared stereo
vision-based pedestrian detection,” in Proc. IEEE Intelligent Vehicles
Symp., 2005, pp. 24–29.
[10] Q. Liu, J. Zhuang and S. Kong, “Detection of Pedestrians at Night Time
Using Learning-based Method and Head Validation,” in IEEE Int.
a) b) Imaging Syst. and Techn. (IST) Conf., 2012, pp. 398–402.
Figure 6. Thermal images achieved during the experiment from the distance [11] D.E. Kim and D.S. Kwon, “Pedestrian detection and tracking in thermal
of: a) 38 m, b) 53 m images using shape features,” in 12th Int. Conf. on Ubiquitous Robots
and Ambient Intelligence (URAI), Goyang, 2015, pp. 22–25.
[12] C. Wu, Y. Wang, C. Yan, “Robust Tramway Detection in Challenging
V. CONCLUDING REMARKS Urban Rail Transit Scenes,” in CCF Chinese Conference on Computer
Vision 2017: Computer Vision, part I, CCIS 771, pp. 242–257.
In this paper, the new thermal camera-based system for
tram drivers is proposed. It aims to increase the safety of tram [13] H. Miyayama, T. Ohya, T. Katori, T. Izumi, “Obstacle recognition from
forward view images from trams,” WIT Transactions on State of the Art
traffic at night. The achieved results prove that even the use of in Science and Engineering, vol. 45, pp. 137–147, WIT Press, 2010.
a relatively cheap low-resolution thermal camera and a [14] T. Katori, T. Izumi, “Obstacle detection on a track from a quiescent
standard camera can significantly improve the visibility of image of a tram’s front view,” WIT Transactions on The Built
certain hazards, such as people on the tracks. The use of the Environment, vol 127, pp. 135–146, WIT Press, 2012.
proposed vision system can reduce the number of sudden [15] “Safety and efficiency in public transportation. Avoid accidents &
braking of trams and positively affect the safety of travel. infrastructure damage” Avaliable: http://www.flir.com.au/traffic/display/
The authors plan further work on this system. They intend ?id=67430.
to repeat experiments using better quality cameras. The authors [16] Solaris Bus & Coach S.A., “Kierunek Tramino. Katalog produktowy
2017/2018,” 09.2017. Avaliable: https://www.solarisbus.com/public/
would also like to conduct attempts to detect pedestrians assets/content/pojazdy/katalog/Tramino_PL.pdf.
crossing tracks while driving a tram in real situations.
[17] “Rozporządzenie Ministra Infrastruktury z dnia 2 marca 2011 r. w
In order to optimize and accelerate the algorithm execution, sprawie warunków technicznych tramwajów i trolejbusów oraz zakresu
it can be transferred to another programming environment, for ich niezbędnego wyposażenia,” Dziennik Ustaw Rzeczypospolitej
example, to the Python language. Polskiej, no. 65, item 344, 2011.
[18] “Informacja MPK Poznań Sp. z o.o. o stanie taboru komunikacyjnego na
ACKNOWLEDGMENT dzień 1. stycznia 2018 r.,” internal document.
We would like to thank the Public Transport Corporation [19] MPK Poznań, “Uwaga! Czołg na szynach,” Avaliable: http://www.mpk.
poznan.pl/bezpieczenstwo/uzytkownicy-drog/190-torowisko-rozejrzyj-
MPK Poznań Sp. z o.o. for providing the tram to the sie/551-uwaga-czlog-na-szynach.
experiment. [20] MPK Poznań, “Schemat sieci tramwajowej. Ograniczenia prędkości od
20 stycznia 2018 r.,” internal document of MPK Poznań Sp. z o.o.
REFERENCES [21] K. Wach, “Jak hamuje tramwaj,” InfoTram, 14.10.2011, Avaliable:
http://infotram.pl/jak-hamuje-tramwaj-_more_54019.html.
[1] J. Balcerek, A. Chmielewska, A. Dąbrowski, D. Jackowski, A.
Konieczka, T. Marciniak, P. Pawłowski, “Recognition of threats in [22] “Ustawa z dnia 20 czerwca 1997 r. – Prawo o ruchu drogowym,”
urban areas by means of the analysis of video sequences,” in Proc. of Dziennik Ustaw Rzeczypospolitej Polskiej, no. 98, item 602, 1997.
Multimedia Commun., Services and Security (MCSS), Kraków, 2010, pp. [23] P. Soille, “Morphological Image Analysis: Principles and Applications,”
41–48. Springer-Verlag, 1999, pp. 170–171.
[2] J. Balcerek, A. Konieczka, T. Marciniak, A. Dąbrowski, K. Maćkowiak, [24] “FLIR Ex4 Infrared Thermal Camera.” Avaliable: https://www.
K. Piniarski, “Video processing approach for supporting pedestrians in flircameras.co.uk/flir-ex4-infrared-thermal-camera.html.
254
SIGNaL PROCESSING
SPa 2018
Feature matching and ArUco markers application

in mobile eye tracking studies
Adam Bykowski Szymon Kupiński
Poznan Supercomputing and Networking Center, Poznan Supercomputing and Networking Center,
Poznan, Poland Poznan, Poland
Abstract—This paper presents eye tracking glasses data anal-

ysis automation techniques, utilizing image processing. Two
separate techniques will be described. One method is used to
automate mapping of point-of-regard to a static reference image
using a feature matching algorithm AKAZE. The second method
utilizes ArUco markers for mapping of point-of-regard to a
screencast from a mobile device. The described methods are used
to aggregate experiment statistical data for future analysis and
presentation in forms like heatmaps or gaze plots. Algorithms
are implemented in Python 3.6 and OpenCV library.
I. I NTRODUCTION Fig. 1. Tobii Pro Glasses 2 eye-tracking glasses [7].
Eye tracking allows one to determine the point of eye
focus, on the basis of observation of the eye image and eye
movements. In many areas, such as psychology, marketing statistical data or visualization like, for example, a heat map.
and design, it is recognized that the eye movement and gaze However, performing this task is time-consuming and tedious,
point helps tracking observer attention [1]. Until recently, and its quality is dependent on the researcher’s tidiness.
experiments in the field of eye-tracking enabled only the use No free and/or open-source solutions capable of automatic
of two-dimensional stimuli presented in a specially prepared gaze mapping has been found to date. However, there are
laboratory on a computer screen with a remote eye tracker. commercial solutions such as Tobii Pro Lab [3] or iMotions
Over the past few years, costs have decreased and the quality Eye Tracking Software [4], which can perform semantic gaze
of head-mounted eye tracking systems has improved. An easy- mapping on the set of static reference images. To the best of
to-use eye tracking glasses appeared (SMI, Tobii, PupilLabs, our knowledge, there is no commercial solution capable of
iMotions), allowing the user to perform eye tracking studies automatic mapping to video screencast.
away from the computer with almost natural conditions for The goal of this work is the implementation of an automatic
the test subjects [2]. semantic gaze mapping algorithm onto a static reference
Eye tracking glasses are built in a glasses frame. They are image or video screencast, using feature matching techniques
equipped with illuminating infrared light-emitting diodes (IR and fiducial markers. Two different solutions are presented.
LED) and cameras recording the user’s eyes. Besides, a high- The first solution utilizes the AKAZE [5] feature matching
resolution camera dedicated for scene recording is placed at algorithm; the second one uses ArUco [6] markers. The main
the front of the glasses. effect of the mapping process is a list of coordinates expressed
In the eye-tracking studies utilizing mobile eye tracking in a reference image (reference video) coordinate system
glasses, participants are able to look around and move their with corresponding timestamp data, saved as a CSV (comma-
heads freely, at their own pace, when looking at a stimulus. separated values) file.
Therefore, in order to process specific types of experiment Algorithms were implemented in Python 3.6.2. They utilize
analysis, a certain mapping procedure becomes necessary. Typ- external libraries such as OpenCV (version 3.3.1+contrib),
ically, a procedure called semantic gaze mapping is performed. numpy (1.13.3), matplotlib (2.1.0). AKAZE feature matching
Semantic gaze mapping consists of mapping coordinates of algorithm and ArUco markers function implementations are
events (for example, individual samples or visual fixations) included in the OpenCV.
registered by eye tracking glasses onto the common static
reference image. So far, this routine has often been performed II. M APPING USING FEATURE MATCHING
manually for events of every participant in the experiment, The presented solution automates semantic gaze mapping
using dedicated software, for example, SMI BeGaze. When onto a static reference image by utilizing a feature matching
software returns all detected events, the researcher has to click algorithm AKAZE [5]. Feature matching can be described as
the location of every single detected event in the reference im- establishing a set of correspondences across many views of a
age. The procedure is necessary in order to produce aggregated particular scene.
255
In addition, high-resolution panoramic pictures of the land-
scape were taken and they are hereinafter referred to as
”reference images”.
B. Methods
Fig. 2. Reference image 1.

A chosen static reference image (on which coordinates will
be mapped) is loaded. The AKAZE object responsible for
feature detection is created using the cv2.AKAZE create()
function. This function takes as an argument a threshold value
which adjusts the number of points that will be detected. Then
the algorithm searches for keypoints on the image and assigns
descriptors to them using the detectAndCompute() function
from the AKAZE class which returns detected keypoints and
their descriptors. In the experiments we set the threshold value
Fig. 3. Reference image 2.
to 0.0004, which in consequence gave us about 2500 and 3700
keypoints for two panoramic images (with resolutions 3177 ×
720 and 2304 × 599 px respectively).
The image features may be specific, ”interesting” structures
Then the following steps are performed for each individual
in the image, such as points, edges or objects. In our case,
frame from the eye tracking glasses video. At first, the current
the image feature consists of a keypoint and a descriptor. The
frame is loaded and the algorithm searches for keypoints in
keypoint (interest point, local image feature, salient feature)
the image and assigns descriptors to them, just like for the
is a highly distinctive location in the image, with an ability
reference image.
to preserve its characteristic properties after alteration by
Then detected features on the reference image and current
geometric transformations like rotations, translations, scaling,
video frame are compared and matched. This aims to localize
etc. The descriptor is, depending on the used algorithm, a
the same image details on both images. In order to do this,
vector of numerical or binary values, describing keypoint
a brute-force matcher object is created with the function
properties.
cv2.BFMatcher(cv2.NORM HAMMING). A descriptor dis-
A. Input data tance measurement is done using the Hamming distance since
AKAZE descriptors are binary. Then, for each descriptor from
Recordings were collected with Tobii Pro Glasses 2 eye one set two best matches are searched with the matcher object
tracking glasses. Collected data consisted of two one-minute method knnMatch() with parameter k=2. Two best matches
videos acquired from two participants. The recordings were are needed to compute the ratio criterion described in [8]. The
made outdoors, in two separate locations with a view of an ratio criterion assumes matching as correct if the quotient of
upland landscape. In the videos, participants are told to freely best match similarity to second-best match similarity is below
look around at the landscape in front of them (angle about 180 the certain threshold. Best effects occur when this value is set
degrees). The video from the front camera has the resolution about 0.7-0.8 [8]. This criterion prevents matching descriptors
of 1920 × 1080 pixels and the frame rate equals 30 frames when there is no obvious, best match. If the criterion is
per second. Gaze data was registered with the frequency of 50 satisfied, two features are considered a valid match.
samples per second. Having matched pairs of features, the homography esti-
mation is performed. The goal of this stage is to find the
homography matrix (with the form presented in equation
1), which projects keypoints from the frame image to the
reference image with minimal reprojection error. The process
also takes into account the possible existence of outliers (incor-
rectly matched pairs). The mentioned process of estimating the
homography may utilize methods such as RANSAC [9] or the
Least Median of Squares (LMedS) [10] to estimate the best
homography. When using RANSAC you have to determine
the threshold value (the maximum allowed reprojection error
to treat a point pair as an inlier), while using LMedS does not
require any parameters. In the process, the RANSAC threshold
value was set to 20 pixels.
 0     
Fig. 4. A sample view from the eye tracking glasses front camera; images
x x h11 h12 h13 x
from this camera are compared with the reference image (in this particular y 0  = H y  = h21 h22 h23  y  (1)
case with the reference image 1). 1 1 h31 h32 h33 1
256
Fig. 5. Histogram of distance errors for the first video mapping. Fig. 6. Histogram of distance errors for the second video mapping.
TABLE I
A LGORITHM ’ S MAPPING ACCURACY, COMPARED TO MAPPING BY T OBII
The obtained homography matrix is then used to compute P RO L AB
the coordinates of gaze point in the reference image coordinate
system. Video Reference image size [pixels] Distance error [pixels]
mean median st. dev.
The process is repeated for all frames from the eye tracking 1 3177 × 720 30.27 17.20 39.09
glasses video. Finally, all mapped coordinates are exported to 2 2304 × 599 28.83 10.63 44.04
a specified CSV (comma-separated values) file and can be
analyzed by another software.
D. Discussion
C. Test results Mean and median errors obtained from tests are relatively
The implemented algorithm mapping accuracy was tested small, compared to the resolutions of reference images (about
with the commercial software. As the ground truth, the results 1% of image width), so mapped data preserves meaningful
from the automatic mapping done in Tobii Pro Lab software information in a general sense and can be used as a reliable
were accepted. source for further analysis. However, there are some samples
In the used eye-tracking glasses recordings 95% and 96% mapped with greater error values; these singular samples may
of samples were considered valid data. As stated in Tobii Pro cause some fixations and saccades detection to be incorrect
Lab User Manual [11], ”It is very unusual to get 100%, as (because the distance between consecutive samples is not
some samples will always be missing due to the participant preserved).
blinking. Blinking usually causes around 5-10% data loss It should be mentioned that the inerrancy of Tobii’s software
during a recording”, so the used videos may be considered mapping result as ground truth cannot be fully assured. In
a valid testing material. order to eliminate this issue, a comparison with another
Tobii’s software was able to process 77.4% and 82.6% software and/or manual mapping may be conducted to provide
frames from the two videos; AKAZE was able to process more credible evaluation of the algorithm’s performance.
all frames (including frames, processed by Tobii). For each In general, many factors influence the overall quality of
frame, where AKAZE’s and Tobii’s mapped coordinates were the mapping procedure, such as quality (resolution) of the
available, the distance between them was computed using reference image, its ”detail richness”, helpful for feature
equation 2. Distances in mapped coordinates were compared. distinctiveness, similar light conditions and perspective, and
obviously the correctness of raw data acquired by the eye
p
d = (xa − xt )2 + (ya − yt )2 (2) tracker. Considering these factors, it is not that simple to
estimate the overall algorithm’s efficiency; probably a broader
where xa , ya and xt , yt are coordinates mapped by AKAZE range of tests (including different conditions) should be con-
and Tobii’s software respectively. ducted for a complex analysis of the algorithm’s performance.
Figures 5 and 6 show histograms of distance errors for AKAZE was chosen for its ready to use OpenCV implemen-
first and second tested video accordingly. The horizontal axis tation, satisfying performance and no licensing fee required,
shows the differences in mapping in pixels, while the vertical but alternatives such as SIFT [8], SURF [12], BRIEF [13],
axis shows the frequency of particular distance. Mean, median ORB [14], BRISK [15], KAZE [16] and others may also be
and standard deviation information are shown in table I. considered in practical application, depending on the user’s
257
requirements (such as speed, accuracy or licensing fees).
Assuming that head movement is continuous and relatively
slow it may also be taken into consideration to involve infor-
mation about homography in previous frames for computing
homography for the current frame. However, this approach
limits the possibility of full parallelization of the algorithm.
III. M APPING USING A RU CO MARKERS

A. Introduction
Eye tracking experiments can make use of mobile devices,
for example, smartphones or tablets. Location of device screen
in the camera video is not known a priori and changes in
time (when the device is held by the user). The issue is
to successfully locate the device’s screen borders in order
to express the point-of-regard location in the device’s screen Fig. 8. Left image: a view from the eye tracking glasses front camera,
coordinate system. a sample ArUco marker is fixed above the smartphone; rectangle present
algorithm’s estimation of screen position and circle indicates gaze point. Right
There are some differences as opposed to mapping on the image: a synchronized frame from the recorded screencast.
static reference image. Depending on the application, using
the mentioned method would require changing the reference
image (in this case current device’s screen) with every action
from the user or preparing a static reference image of the
whole web page (which may be impossible for dynamically
changing content). Furthermore, often the quality of a video
from the front eye-tracking glasses camera is not sufficient
to recognize the content of the screen (especially, when the
device’s screen brightness is high relative to the environment
- then it appears as a white rectangle).
At present different approaches are known. [17] presented
utilization of some type of markers placed on the screen, where
the device is in a fixed position. [18] introduced the use of
adaptive color detection techniques in order to track a colorful
smartphone case.
Fig. 9. The position of the coordinate system origin, relative to the device’s
screen and the ArUco marker.
Fig. 7. Sample ArUco marker. C. Methods

The experiment requires the presence of the printed marker
The presented solution utilizes ArUco [6] markers for the near the device, possibly in a plane with the screen. For
device’s screen location. According to documentation ”ArUco smartphone screen tracking we decided to use a printed marker
is an OpenSource library for camera pose estimation using fixed above the screen.
squared markers”. Correct placement of the marker has a huge impact on
correct coordinates mapping in the next steps. In our tests,
B. Input data we tried to use a marker with a size equal to the screen width
The input data consists of eye-tracking glasses in order to simplify calibration (horizontal displacement equals
recordings, eye tracking sample data and mobile device almost zero).
screencasts. Also physical ArUco markers were used. The choice of the marker’s size and number is crucial. A
Layouts of markers were generated with the function larger marker provides a smaller position estimation error,
cv2.aruco.getPredefinedDictionary(cv2.aruco.DICT 4X4 50). whereas a greater number of markers improves reliability
They were converted to images with the function of the system (if certain markers cannot be detected due to
cv2.aruco.drawMarker(). occlusion) and allow using average position estimation.
258
Importantly, the current implementation of detection of
ArUco markers by OpenCV detects markers only when they
are completely visible. It is suggested to use smaller markers
than one large if you are not sure if the marker will be
completely visible through the whole recording.
As an initialization, it is needed to express the marker’s
corners location in pixels, where the origin is, for example, in
the screen top-left corner.
The eye tracking glasses video and mobile device screencast
are synchronized manually and processed to have an equal
frame rate.
The following steps are done for each pair of frames. The
algorithm utilizes the cv2.aruco.detectMarkers() function from
the OpenCV library in order to search for markers in the
eye tracking glasses video frame and returns a vector of
detected marker corners, a vector of markers’ identifiers, and a
vector of marker candidates (detected squares without a valid
Fig. 10. Histogram of error distances for the three screen corners.
code). Combining the location of corners in the current frame
and marker calibration information it is possible to compute
homography (as for mapping with AKAZE) between corners.
greater the average error is (the mapping error is proportional
Computed homography allows for localizing screen borders as
to the distance to the marker). This fact also applies to the
well as the mapping gaze point.
mapping of the gaze point.
D. Test results The key disadvantage of the current implementation is the
In order to estimate the quality of mapping with ArUco fact that the marker has to be fully present in the video to
markers, the following test was performed. A user wearing be detected. In this type of scenario it is possible to hold the
Tobii Pro Glasses 2 eye-tracking glasses was asked to find smartphone in a way that the marker may be partially visible,
some information using a web browser on his/her smartphone. which makes automatic marker detection impossible. Using
The device has an ArUco marker attached to the top. The many smaller markers may be considered as a solution, so
marker’s location relative to the screen was measured and that it is harder to occlude all markers at once.
used as the algorithm’s calibration parameter. The eye tracking An important factor is the correct placement of the marker.
glasses video was recorded for a minute and then analyzed. It should be placed in a plane with the device screen. Also,
60 frames from the video (one frame every second) were the accurate measurement of the relative distance between the
taken. For each frame, three screen corners (top-left, top- marker and the screen is necessary to properly estimate the
right and bottom-left corner) were manually pointed and their screen.
coordinates were saved. Three corners were used instead Another issue, especially important in eye tracking studies,
of four, because during natural smartphone usage often the is that placing a relatively large marker next to the device may
thumb occluded the fourth corner, so it was not possible be distracting for the users. In consequence, marker presence
to accurately point its location. On these frames automatic may have an influence on using the device in a natural way
ArUco markers detection was also performed. Detected marker and the returned gaze position data may be biased.
corners were combined with calibration data to estimate a Summing up the above mentioned disadvantages, for exper-
homography matrix which allowed computing locations of the iments involving smartphones, in some cases it may be vital
device’s screen corners. Manually and automatically extracted to use mapping with a color case (as described in [18] instead
points’ locations were compared to each other and the distance of a fiducial marker, as this approach does not bother with
between them was computed for each pair. calibration issues and appears to look more natural.
The results are shown in Fig. 10. As it can be seen on
IV. C ONCLUSION
the histogram, top-left and top-right corners were mapped
with similar accuracy, with less than 15 pixels of error. The Mobile eye trackers allow for quite a natural interaction with
detection of the bottom-left corner has significantly worse the environment; however, it comes together with the chal-
accuracy. lenge of collecting gaze data in reference to the participant’s
perspective instead of a common reference. The presented
E. Discussion work shows methods for overcoming this problem and allows
The top corners were mapped with a significantly lower aggregation of gaze data without restricting head movements
error than the bottom one. The main reason for this observation of the participants. Implemented algorithms are capable of
is that the bottom-left corner is the furthest mapped corner working with any eye tracking glasses which provide front
from the marker; the further the point is from the marker, the camera video and corresponding gaze coordinate samples.
259
Remapped coordinates can be utilized to create visualizations, [17] I. Giannopoulos, P. Kiefer, and M. Raubal, “Geogazemarks: Providing
which could not be made without the semantic gaze mapping gaze history for the orientation on small display maps,” in Proceedings
of the 14th ACM International Conference on Multimodal Interaction,
stage, such as heat map or provide statistics for the predefined ser. ICMI ’12. New York, NY, USA: ACM, 2012, pp. 165–172.
AOI (area of interest) on the reference image. [Online]. Available: http://doi.acm.org/10.1145/2388676.2388711
The presented solutions, as well as indicated disadvantages [18] L. Paletta, H. Neuschmied, M. Schwarz, G. Lodron, M. Pszeida,
S. Ladstätter, and P. Luley, “Smartphone eye tracking toolbox: Accurate
and propositions of improvements, may serve as a support for gaze recovery on mobile displays,” in Proceedings of the Symposium
further research in the mobile eye tracking field. on Eye Tracking Research and Applications, ser. ETRA ’14. New
York, NY, USA: ACM, 2014, pp. 367–68. [Online]. Available:
http://doi.acm.org/10.1145/2578153.2628813
V. ACKNOWLEDGMENTS
The authors extend their gratitude to Ilona Potocka and
Mateusz Rogowski from the Faculty of Geographical and Ge-
ological Sciences of the Adam Mickiewicz University for joint
eye-tracking studies and their assistance in data collection.
R EFERENCES
[1] A. T. Duchowski, Eye tracking methodology - theory and
practice, 2nd Edition. Springer, 2007. [Online]. Available:
https://doi.org/10.1007/978-1-84628-609-4
[2] K. M. Evans, R. Jacobs, J. A. Tarduno, and J. Pelz, “Collecting and
analyzing eye-tracking data in outdoor environments,” vol. 5, 01 2012.
[3] “Tobii Pro Lab software.” [Online]. Available:
https://www.tobiipro.com/product-listing/tobii-pro-lab/
[4] “iMotions Eye Tracking Software.” [Online]. Available:
https://imotions.com/eye-tracking/
[5] P. F. Alcantarilla and T. Solutions, “Fast explicit diffusion for accelerated
features in nonlinear scale spaces,” IEEE Trans. Patt. Anal. Mach. Intell,
vol. 34, no. 7, pp. 1281–1298, 2011.
[6] S. Garrido-Jurado, R. Muoz-Salinas, F. Madrid-Cuevas, and
M. Marn-Jimnez, “Automatic generation and detection of highly
reliable fiducial markers under occlusion,” Pattern Recognition,
vol. 47, no. 6, pp. 2280 – 2292, 2014. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0031320314000235
[7] “Tobii Pro Glasses 2.” [Online]. Available: https://www.tobiipro.com
[8] D. G. Lowe, “Distinctive image features from scale-
invariant keypoints,” International Journal of Computer Vision,
vol. 60, no. 2, pp. 91–110, Nov 2004. [Online]. Available:
https://doi.org/10.1023/B:VISI.0000029664.99615.94
[9] M. A. Fischler and R. C. Bolles, “Random sample consensus:
A paradigm for model fitting with applications to image analysis
and automated cartography,” in Readings in Computer Vision,
M. A. Fischler and O. Firschein, Eds. San Francisco (CA):
Morgan Kaufmann, 1987, pp. 726 – 740. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/B9780080515816500702
[10] D. Simpson, “Introduction to rousseeuw (1984) least median of squares
regression,” in Breakthroughs in Statistics. Springer, 1997, pp. 433–461.
[11] “Tobii Pro Lab User Manual v.1.86 - en-US.” [Online]. Avail-
able: https://www.tobiipro.com/siteassets/tobii-pro/user-manuals/Tobii-
Pro-Lab-User-Manual/?v=1.86
[12] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-up
robust features (surf),” Computer Vision and Image Understanding,
vol. 110, no. 3, pp. 346 – 359, 2008, similarity Matching
in Computer Vision and Multimedia. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1077314207001555
[13] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust
independent elementary features,” in Computer Vision – ECCV 2010,
K. Daniilidis, P. Maragos, and N. Paragios, Eds. Berlin, Heidelberg:
Springer Berlin Heidelberg, 2010, pp. 778–792.
[14] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient
alternative to sift or surf,” in 2011 International Conference on Computer
Vision, Nov 2011, pp. 2564–2571.
[15] S. Leutenegger, M. Chli, and R. Y. Siegwart, “Brisk: Binary robust
invariant scalable keypoints,” in 2011 International Conference on
Computer Vision, Nov 2011, pp. 2548–2555.
[16] P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “Kaze features,” in
Computer Vision – ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Perona,
Y. Sato, and C. Schmid, Eds. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2012, pp. 214–227.
260
SIGNaL PROCESSING
SPa 2018
Adaptive methods of time-dependent crowd density

distribution visualization
Marianna Parzych, Tomasz Marciniak and Adam Dabrowski
˛
Institute of Automation and Robotics
Division of Signal Processing and Electronic Systems
Poznań University of Technology, Poland
Email: {marianna.parzych, tomasz.marciniak, adam.dabrowski}@put.poznan.pl
Abstract—The paper presents an analysis of visualization II. R ELATED W ORK

methods of crowd density visualization. Generated density maps
take into account changes in time. Three methods have been This section presents the solutions for crowd density estima-
implemented and tested. The first one uses motion detection tion. In terms of the methods used, the solutions can be divided
based on the background subtraction. The second one is based into five categories: pixel-based approaches, texture analysis,
on BLOBs (binary large objects) analysis. The third method uses
interest points ie. points on the image that can be used by the appearance-based approaches, points-of-interest grouping, and
object track the movement. The tests were performed using the 3D reconstruction.
PETS2009 video sequence database. The obtained maps were • Pixel-based approaches:
evaluated and the time consumptions were estimated.
Pixel based methods consist of very local features, such as
the analysis of individual pixels obtained by background
I. I NTRODUCTION subtraction or edge detection. This approach has been
used in [3], [4], and [5] where a manually established
Determination of the size and density of the crowd in public linear relationship between the number of pedestrian-
spaces becomes increasingly important in the video monitoring classified pixels and the number of people was proposed.
systems. It can be used as a part of larger surveillance systems: The authors of [6], [7] has used edge detection for
information about the density of the crowd may assist in the crowd density estimation. Because the features are very
detection, tracking and analysis of the movement of persons. low-level, this class allows for less accurate results, so,
Density estimation is one of the basic problems of crowd it focuses on crowd density or people number estima-
analysis, while the density of crowd is an important feature of tion rather than on counting individuals. In contrast to
the studied scene. detectors or trackers, pixel analysis allows for privacy
Conventional solutions for the crowd density determination preserving and maintaining the total anonymity of the
require the use of manual analysis - physical people counting people in the studied scene.
by monitoring operators. Such methods are cumbersome and • Texture analysis:
expensive, thus impossible to operate in a large number of Crowd density estimation can also be made by image
rooms over along time. Manual counting of people on the basis processing based on texture analysis techniques. Despite
of crowded images requires a lot of concentration and is very the fact that this class uses the features of a higher level
tedious. During the research, it was proven that after analysing compared to approaches based on pixels, it is mainly
tens of images, the level of focus and effectiveness of manual used to estimate the number of people on the scene rather
counting drastically drops. While assessing changes in crowd than to identify individuals. The solutions using texture
density in just one-day-long period may require hundreds or analysis are based on the assumption that images of
thousands of frames to be analysed. low density crowds tend to present coarse texture, while
An interesting and attractive option is the use of the CCTV images of very dense crowds tend to present fine textures.
(closed circuit television) system, since most commercial Different texture analysis features has been used in such
spaces are equipped with a video monitoring. Manual observa- systems: grey-level dependence matrix (GLDM) [8], [9],
tion can be supported by advanced systems for automatic anal- the Minkowski dimension (MD) and the translation in-
ysis of video sequences. Automatic crowd density estimation is variant orthonormal Chebysheb moments (TIOCM) [10],
an issue studied by research teams who have presented various local binary pattern (LBP) [11].
approaches to the topic [1], [2]. However, despite the fact • Appearance-based approaches:
that automated solutions are being developed, they estimate Models that rely on appearance-level analysis try to
only the instantaneous density values. The literature review identify individuals in a scene, based on the way a person
has shown no work analysing changes in crowd density over appears on a video footage. Some frameworks aim to
time, which is particularly interesting because such solutions detect the whole silhouette. However, there are occlusions
could be the most useful and have implementation potential. in a crowded scene and the most occlusion-resistant part
261
of the human body is the head or the upper-body (head
and shoulder). The most popular descriptors for head or
upper-body detection are Haar-like features ([12], [13],
[14]) and histograms of oriented gradients (HoG) ([15],
[16], [17]). Appearance-based models tend to produce
more accurate results than pixel-level or texture analysis,
but identifying individuals is reliable mostly in lower-
density crowds.
• Points-of-interest grouping:
Solutions based on the detection and grouping of so-
called points-of-interest, in contrast to the appearance-
based ones, work well in dense crowds because it is easier
Fig. 1: Density map showing the most popular shop areas
to provide resistance to occlusion. The number of de-
tected point and the information of their motion are used
to estimate the number of people in the crowd. Different
teams of researchers have recognized the different types
of points as best describing the human silhouette: Harris-
corners [18], speeded up robust features (SURF) [19],
scale-invariant robust transform (SIFT) [20], and features
from accelerated segment test (FAST) [21].
• 3D reconstruction:
Solutions based on 3D reconstruction [22], [23] and
depth analysis [24] are not very popular. During the
review, only a few examples were found. This is probably
Fig. 2: The general processing scheme
due to the fact that the existing CCTV systems are not
usually equipped with devices that allow to obtain depth
information. For this reason, the possibility of using such
solutions in practice is limited. why many different color scales are used, adapting them to
Most of the solutions estimate only the density of the a given task. The most common is the scale in which cool
whole crowd present in the studied scene. Implementations colors (blue, green) correspond to small numbers, and warm
that examine also the distribution of people in the scene (i.e. colors (red, yellow) are assigned to the largest values.
the studied area) are considerably rare ([9], [17], [21]). In this Density maps are used in many different fields. Developers
case, it is not only necessary to estimate the density of the create maps of websites depicting the areas most frequently
crowd, but also to determine how it varies in different areas clicked or viewed the longest by users. Marketing also reaches
of the studied scene. for this solution in order to present the results of research, for
Interestingly, the literature review has shown no work example, the areas of advertising on which users focused their
analysing changes in crowd density over time. All discussed eyes the most.
solutions estimate only the instantaneous density values. Vision systems use density maps to visualize store areas
with greater crowd density (Fig. 1 [26]) and visualization of
III. G ENERAL A PPROACH car or human traffic.
Our method for estimation of crowd density is based on
an alternative approach [27]. The lack of solutions enabling B. Proposed Solution
the observation of changes in the density of the crowd and its
distribution over time was noted. Therefore, we decided to use It was decided to visualize changes in crowd density using
information about location and motion of people to generate the density maps. In order to ensure the privacy of people in
maps that shows the crowd density per time unit, averaging the studied area, it was decided to use simple features based
the instantaneous density in the individual frames. Developed on motion analysis in the recording to generate density maps.
maps were to enable observation of crowd distribution in Features are accumulated in the accumulation matrix with the
the tested room. In this case, an additional issue is the selected τ period. The accumulation is made with an adaptive
development of effective means of visualization, which allows resolution depending on the size and type of features and it
intuitive interpretation of results. is variably scaled to improve the dynamic range of the result
[28]. Based on the obtained accumulation matrices, the set of
A. Density Maps for Crowd Analysis density maps showing the changes in crowd distribution and
Density map is a graphical representation of data, where density are generated. The processing steps are described in
each color corresponds to the value assigned to it. The use of detail in the following section. The general processing scheme
colors facilitates the interpretation of numerical data, which is is shown in Fig. 2.
262
IV. D ENSITY D ISTRIBUTION A NALYSIS based on motion detection, but by a value d equal to the
A. Density Map Generation based on Motion Detection number of BLOB perimeter pixels. The accumulation matrix
update resolution is BLOB width by BLOB height, .
The first of the proposed solutions for the visualization of
The density map is then generated from the obtained ac-
crowd density is based on motion detection using background
cumulation image by the application of the color map. The
subtraction. The chosen background estimation method is
processing steps of density map generation based on BLOB
based on Gaussian mixture model (GMM) with shadow de-
detection is shown in Algorithm 2.
tection [29]. Morphological operation is used to remove noise
from received foreground masks. The next step is to create
Algorithm 2 Density map generation - BLOB detection
an accumulation matrix by sequential addition of foreground
masks. while current frame index i < τ do
The accumulation image obtained in this way is difficult load the current input frame
to interpret. Depending on the number of people and the the perform motion detection
amount of motion in the recording, as well as the τ value perform BLOB detection
determining the length of the time interval for which the map for each BLOB detected: do
is generated, the values in the accumulation matrix may not be update the accumulation matrix:
sufficiently differentiated. In this case, it was decided to use d ← BLOB perimeter pixels number
contrast enhancement. Since not all values of the accumulation update resolution (rows, cols) ← BLOB width, BLOB
image are equally significant for the interpretation of the height
results, linear contrast enhancement is not optimal. Is required for each cell (row, column) in accumulation matrix
to apply and adjust γ parameter, which describes the shape M(rows, cols) do
of curve representing relationship of values before and after if BLOB b(row, column) 6= 0 then
the enhancement. A color map is applied to the prepared M(row, column) ← M(row, column) + d
image of accumulation in order to obtain an intuitive map end if
of crowd density and its distribution in the examined space. end for
The processing steps are presented in Algorithm 1. end for
end while
Algorithm 1 Density map generation - motion detection generate density map:
scale 16 bit accumulation matrix to 8 bit image
while current frame index i < τ do
apply the color map
load the current input frame
perform motion detection
update the accumulation matrix:
for each cell (row, column) in accumulation matrix M C. Density Map Generation with the use of Interest Points
do The methods described above are based on the analysis
if foreground mask f (row, column) 6= 0 then of the whole image and background mask. A third method
M(row, column) ← M(row, column) + 1 of density map generation was developed, where the whole
end if frame is not analyzed, only the characteristic points detected
end for on it, called interest points. Interest points are points on
end while the image that characterize the object and can be used to
generate density map: track the movement of this object. There are many methods
scale 16 bit accumulation matrix to 8 bit image to detect good features (points) to track, in this work it
perform contrast adjustment with selected γ parameter was decided to use the corners detected by the Shi-Tomasi
apply the color map method [30]. Depending on the recording parameters, points
detection has to be updated after a chosen frame number.
The motion of the points is then analysed with the use of
B. Density Map Generation based on BLOB Detection optical flow algorithm based on well-known Lukas-Kanade
Another solution for time-dependent crowd-density visual- implementation [31]. The coordinates of moving points are
ization is based on BLOBs (binary large objects) analysis. stored and accumulation matrix is updated for every point.
Motion detection from previous section is extended with Adaptive resolution of accumulation is adjust based on the
BLOB detection. Then the accumulation matrix id updated motion values. The accumulation matrix update resolution is
with variable resolution, selected adaptively based on the r by r, and r is described√by formula:
characteristics of a given BLOB. An intuitive assumption was r= u2 + v 2 /2
made that the larger the area, the more people the BLOB where u is the value of point motion along the x axis, and v
contains. This means that it increases the crowd density more is the value of point motion along the y axis. It was assumed
than a small area BLOB. The accumulation matrix is therefore that the smaller the value of traffic, the more time a person
increased not by a fixed value, as was the case for the method spent in a given place. This means that it had a large impact
263
on the time-dependent crowd density, so the corresponding area - the location of the crowd does not change during the
cells of the accumulation matrix are increased by the value recording, so data are not suitable for testing of visualizations
of d inversely proportional to the amount of movement of the showing changes in the crowd distribution in the studied area.
points: Recordings from "‘People Tracking"’ set present groups of
d = 100 · 1r people of various sizes, moving freely within the whole frame,
The obtained accumulation image does not allow for they are better suited as a test material for developed solution.
visually-satisfying map, therefore additional operations are Density maps obtained with the use of three proposed methods
used to enhance it. First the matrix is multiplied by scalar are show in Fig. 3.
value k. Next it is dilated with ellipse structural element of It can be seen that although the used recordings present
size proportional to the frame size. Then the simple low-pass groups of people with different densities (see I), the range
filter with 2 × 2px mask is used to blur the image. Opening of map colors is always the same. This is due to the fact
operation is used to remove noise and artifacts. Last step is that the maps represent areas with the smallest and largest
application of the color map. The processing steps of crowd relative crowd density. Methods are used for visualization, not
density map generation are presented in Algorithm 3. to accurately determine the number of people on the recording
or to classify crowd density.
Algorithm 3 Density map generation - interest points
VI. T IME PARAMETERS OF PROPOSED METHODS
while current frame index i < τ do
load the current input frame Tests were carried out to check the time parameters of the
if i mod β then developed algorithms. Density maps based on recordings from
perform Shi-Tomasi corner detection the PETS database were generated. The recording resolution
end if was modified to check its effect on the processing speed. The
perform points motion analysis with optical flow (u, v) original resolution of recordings is 768×576 pixels, tests were
for each point detected: do carried out on recordings scaled by a factors: 0.3, 0.6, 1, 1.3.
update The results are presented in the Table II.
√the accumulation
matrix:
The time needed for the generated maps based on the
r ← u2 + v 2 /2
d ← 100 · 1r accumulation matrix is in each case a few seconds. It can also
update resolution (rows, cols) ← point center coordi- be noticed that there is no dependence between the resolution
nates ±r of images and the time needed to process them. Although this
for each cell (row, column) in accumulation matrix seems to be big, it should be remembered that generating maps
M(rows, cols) do is done not after every frame of the recording, but once every
M(row, column) ← M(row, column) + d τ frames. During the tests, the τ parameter was equal to the
end for number of frames in the recording, so maps was generated
end for only once for each input video. For this reason, limiting the
end while time needed to generate maps is not a critical issue.
generate density map: The average processing time of a single frame (features
scale 16 bit accumulation matrix to 8 bit image detection and accumulation matrix update) is correlated with
multiply obtained image k times images resolution. For each method with increasing resolution,
perform dilation operation the time needed for analysis increases. It is easy to speed
apply blur up the operation of algorithms by reducing the recording
perform opening operation to remove noise resolution, which does not adversely affect the results of their
apply the color map operation (see Fig. 4.
Another way to limit processing time is to manipulate
β parameter for the interest-points method. This parameter
determines how often points should be updated. In the case
V. R ESULTS
of PETS database, it was β = 10, because of the distance
The developed methods were tested using the PETS2009 and angle of the camera, people often came in and out of
[32] database. Examples of maps were generated on the the frame. However, for recordings with a different view, the
basis of 3 recordings with different parameters. The recording parameter can take on much larger values. Figure 5 shows the
parameters are shown in the Table I. effect of parameter β on the average processing time for one
The selected data belongs to the set "‘People Tracking"’. It frame (Recording 1 with original 768 × 576 px).
was decided to use it instead of a dedicated set "‘Person Count
and Density Estimation"’ because it better shows the effects of VII. C ONCLUSION
visualization. Recordings from the "‘Person Count and Density Depending on the method used, the result images obtained,
Estimation"’ set show clustered groups of people moving i.e. density maps, have different characters. In the case of the
along fixed paths. Most of the frames from the recordings method based on the interest points, the maps present the most
present a large group of people occupying the whole road details. Nevertheless, the two other methods are also attractive,
264
TABLE I: Descriptions of test recordings from PETS2009
Recording Frames number People Density
Set Time View Description
number (\tau) number level
Single people moving in different directions
People
1 12:34 794 001 3-9 low on roads and grass. They pass, meet and
Tracking
separate, stop and move.
Single people first movinf, than standing on
People
2 14:41 239 001 3 - 41 high road and grass. Large group of people moving
Tracking
on the road in one direction.
Single people moving in different directions
People
3 14:55 435 001 10 - 35 high on roads and grass. Groups moving on the road,
Tracking
first in one, than in another direction.
(a) Recording 1, motion detection based map (b) Recording 1, BLOB detection based map (c) Recording 1, interest points based map
(d) Recording 2, motion detection based map (e) Recording 2, BLOB detection based map (f) Recording 2, interest points based map
(g) Recording 3, motion detection based map (h) Recording 3, BLOB detection based map (i) Recording 3, interest points based map
Fig. 3: Sample visualization results
as they allow quick density assessment. The processing times R EFERENCES

depend on the image resolution and are similar for large
resolutions. The analyzes performed showed that the resolution [1] B. Zhan, D. N. Monekosso, P. Remagnino, S. A. Velastin and Li-Qun
Xu, Crowd analysis: a survey, Maschine Vision and Applications, vol.
can be changed without deterioration of the resulting image 19, iss. 5-6, pp. 345-357, 2008.
quality. [2] J. Silveira Jr, S. R. Musse and C. R. Jung, Crowd Analysis Using Com-
puter Vision Techniques: A survey, IEEE Signal Processing Magazine,
vol. 27, iss. 5, pp. 66-77, 2010.
ACKNOWLEDGMENT [3] A. Davies, J. Yin and S. Velastin, Crowd monitoring using image
processing, IEEE Electronic Community Engeeniring Journal, vol. 7, no.
This work was supported by the DS/2018 funds. 1, pp. 37-47, 1995.
265
TABLE II: Time parameters of proposed methods
Average time of
Average processing time
density map generation [ms]
of one frame [ms]
(once every τ frames)
Resolution Motion BLOB Motion BLOB
Interest-points Interest-points
of input [px] detection detection detection detection
230 × 172 23 23 11 5145 3515 2844
460 × 345 84 84 57 5224 1277 1620
768 × 576 222 215 225 9700 4487 3829
998 × 748 388 389 410 4391 1302 1429
(a) Recording 1, motion detection based map (b) Recording 1, BLOB detection based map (c) Recording 1, interest points based map
Fig. 4: Visualization results for Recording 1 with 0.33 of original resolution (230 × 172 px)
Fig. 5: Diagram showing the relation between the processing time and β parameter
[4] J. Yin, S. Velastin and A. Davies, Image processing techniques for crowd [11] H. Fradi, Xuran Zhao and J. L. Dugelay, Crowd density analysis using
density estimation using a reference image, Proceedings of 2nd Asia- subspace learning on local binary pattern, International Conference on
Pacific Conference on Computer Vision, pp.122-129, 1995. Multimedia and Expo Workshops (ICMEW), pp. 1-6, 2013.
[5] R. Ma, L. Li, W. Huang and Q. Tian, On pixel counts based crowd [12] I. Ali and M. Dailey, Multiple Human Tracking in High-Density Crowds,
density estimation for visual surveillance, Conference on Cybernetics and Advanced Concepts for Intelligent Vision Systems: 11th International
Intelligent Systems, pp. 170-173, 2003. Conference, pp. 540-549, 2009.
[6] P. Karpagavalli and A. Ramprasad, Estimating the Density of the People [13] R. Xu, Y. Guan and Y. Huang, Multiple Human Detection and Tracking
and counting the number of People in a Crowd Environment for Human Based on Head Detection for Real-time Video Surveillance, Multimedia
Safety, International Conference on Communication and Signal Process- Tools Appl., vol. 74, no. 3, pp. 729-742, 2015.
ing, pp. 663-667, 2013. [14] M. Jones and J. Snow, Pedestrian detection using boosted features over
[7] C. Regazzoni, A. Tesei and V. Murino, A real-time vision system for crowd many frames, International Conference on Pattern Recognition, pp. 1-4,
monitoring, International Conference on Industrial Electronics, Control, 2008.
and Instrumentation, vol. 3, pp. 1860-1864, 1993. [15] N. Dalal and B. Trigs, Histogram of oriented gradients for human
[8] A. Marana, S. Velastin,L. da Costa and R. Lotufo, Estimation of crowd detection, IEEE Computer Society Conference on Computer Vision and
density using image processing, IEE Colloquium on Image Processing Pattern Recognition, vol. 1, pp. 886-893, 2005.
for Security Applications (Digest No: 1997/074), pp. 11/1-11/8, 1997. [16] M. Patzold, R. Evangelio and T. Sikora, Counting people in crowded
[9] X. Wu, G. Liang, K. Lee and Y. Xu, Crowd Density Estimation Using environments by fusion of shape and motion information, International
Texture Analysis and Learning, International Conference on Robotics and Conference on Advanced Video and Signal Based Surveillance, pp. 157-
Biomimetics, pp. 214-219, 2006. 164, 2010.
[10] H. Rahmalan, M. Nixon and J. Carter, On Crowd Density Estimation [17] M. Rodriguez, I. Leptan, J. Sivic and J. Y. Audibert, Density-aware
for Surveillance, IET Conference on Crime and Security, pp. 540-545, person detection and tracking in crowds, International Conference on
2006. Computer Vision, pp. 2423-2430, 2011.
266
[18] A. Albiol, M. Silla, A. Albiol and J. Mossi, Video analysis using
corner motion statistics, IEEE Internetional Workshop on Performance
Evaluation of Tracking and Surveillance, pp. 31-38, 2009.
[19] S. Riachi, W. Karam and H. Greige, An improved real-time method
for counting people in crowded scenes based on a statistical approach,
International Conference on Informatics in Control, Automation and
Robotics (ICINCO), pp. 203-212, 2014.
[20] H. Fradi and J. Dugelay, People counting system in crowded scenes
based on feature regresion, 20th European Signal Processing Conference
(EUSIPCO 2012), pp. 136-140, 2002.
[21] H. Fradi and J.-L. Dugelay, Crowd Density Map Estimation Based on
Feature Tracks, IEEE 15th International Workshop on Multimedia Signal
Processing (MMSP), pp. 40-45, 2013.
[22] P. Gardziński, K. Kowalak, Ł. Kamiński and S. Maćkowiak, Crowd Den-
sity Estimation Based on Voxel Model in Multi-View Surveillance Systems,
International Conference on Systems, Signals and Image Processing, pp.
216-219, 2015.
[23] C. Stahlschmidt, A. Gavriilidis and A. Kummert, Density Measurements
from a Top-View Position using a Time-of-Flight Camera, Proceedings of
the 8th International Workshop on Multidimensional Systems, pp. 1-6,
2013.
[24] G. Xiong, X. Wu, J. Cheng, Y. Chen, Y. Ou and Y. Liu, Crowd
density estimation based on image potential energy model, International
Conference on Robotics and Biomimetics, pp. 538-543, 2011.
[25] Heat maps for eyetracking [online]
http://symetria.pl/blog/artykuly/mapy-cieplneheat-maps-w-badaniu-
eye-tracking-przydatne-wyniki-czy-zawsze/ access on: 16.05.2018.
[26] M. Parzych, A. Chmielewska, T. Marciniak, A. Dabrowski,
˛ A. Chros-
towska, M. Klincewicz, Automatic people density maps generation with
use of movement detection analysis, International Conference on Human
System Interaction, pp. 26-31, 2013.
[27] M. Parzych, A. Chmielewska, T. Marciniak, A. Dabrowski,˛ New ap-
proach to traffic density estimation based on indoor and outdoor scenes
from CCTV, Foundations of Computing and Decision Sciences, vol. 40,
no.2: 119-132, 2015.
[28] U. Meyer-Gawron, Generation of density maps for moving objects in
video recordings (Master thesis), 2017.
[29] P. KadewTraKuPong and R. Bowden, An improved adaptive background
mixture model for real-time tracking with shadow detection, Proc. 2nd
European Workshop on Advanced Video-Based Surveillance Systems, pp.
1-5, 2001.
[30] J. Shi and C. Tomasi, Good Features to Track, Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 593-600,
1994.
[31] B. Lucas, T. Kanade, An Iterative Image Registration Technique with an
Application to Stereo Vision, Proc. of 7th International Joint Conference
on Artificial Intelligence (IJCAI), pp. 674-679, 1981.
[32] PETS2009 Benchmark Data, [online]
http://www.cvg.reading.ac.uk/PETS2009/a.html
267
SIGNaL PROCESSING
SPa 2018
Automatic recognition of image details

using stereovision and 2D algorithms
Julian Balcerek, Mateusz Łuczak, Paweł Pawłowski, Adam Dąbrowski

Division of Signal Processing and Electronic Systems, Institute of Automation and Robotics,
Poznan University of Technology, Poland
{julian.balcerek, pawel.pawlowski, adam.dabrowski}@put.poznan.pl
Abstract— This paper presents a utilization of the stereovision in Additionally, placing cameras behind the windscreen protects
the automotive applications and video monitoring. Two video them from damage and dirt.
channels of stereovision system offer a new quality of recognition
of image details. Even with popular object recognition algorithms Since 2013 the EyeSight technology has been available in
the stereovision significantly improves the analysis plausibility. cars from Fuji Heavy Industries, Ltd. (less formerly known as
Preliminary experiments with the specially prepared 3D video Subaru Corporation). This system uses the stereo camera with
test database show improvement of the recognition effectiveness
by 10-30% in comparison to the single video channel. the distance between lenses equal to 350 mm. It can detect and
distinguish vehicles, motorcycles, bicycles, pedestrians, and
Keywords- stereovision; automotive, video monitoring; image lane markers. It is specially tuned for collision avoidance, pre-
processing; detail recognition; advanced driver assistance systems; collision steering assistance and adaptive cruise control. The
object detection. driver is assisted with audible warnings, indicator lights, and
finally by braking and steering the wheel control. The system
I. INTRODUCTION offers many scenarios of reaction, e.g. if the object of interest
A stereovision is typically understood as the acquisition and in front of the car is detected by the anticipatory system, the
processing of images from two cameras located on the same vehicle is slowing down or bringing to a stop. If it is detected
horizontal axis. It has been used in many areas for over 100 that the driver has failed to take evasive action, the brakes are
years. As opposed to the monovision, two image channels applied automatically. If insufficient braking input of a driver
provide additional information which introduces a three- is detected, the braking force is automatically increased. If the
dimensional (3D) effect. The stereovision is typically used for vehicle is not moving while the vehicle in front has taken off
estimation of distance from the camera to objects in the scene. (e.g. on green light) the warning sound and flashing light is
To perform it, the corresponding points in both stereo images produced. It is also detected if the vehicle unintendedly sways
must be found. The distance between these points gives in the lane while the speed is 60 km/h or more, as well as if the
information about the depth (typically presented as the so- vehicle begins to stray into the next lane with high speed.
called depth map). Presented functions are activated when the speed difference
between the car and a leading car or an object is less than
Beside the 3D effect and the distance estimation, the
50 km/h (35 km/h for a pedestrian). However, some
stereovision can also be utilized to help recognize chosen
unfavorable conditions, like bad weather, may disturb the
image details. The detection algorithms use the disparity
correct behavior of the system [2].
information contained in the depth map or both the depth map
and one or two original images. In this work we consider the A similar system, called the stereo multi-purpose camera
use of the stereovision in two significant areas of modern (SMPC) has been mounted since 2013 in Mercedes-Benz
technology: the automotive industry and video monitoring. (division of the company Daimler AG). Two cameras with the
distance between lenses equal to 220 mm provide a 3D view of
The stereovision in the automotive industry is typically
the area up to 50 meters in front of the vehicle [3]. Introduced
used in the so-called advanced driver assistance system
video processing algorithms can detect and carry out spatial
(ADAS). ADAS is equipped with the human-machine
classification of vehicles that are driving ahead, oncoming or
interfaces which help drivers to improve traffic safety and to
crossing, as well as of pedestrians and traffic signs. The
get better convenience of traveling. Up to now, several car
stereovision is also used in various assistance systems: Brake
manufacturers have introduced solutions on the market, based
Assist system (BAS Plus) with Cross-Traffic Assist, in
on two cameras. ADAS supported by a stereo camera are
advanced cruise control system (Distronic Plus) with Steering
supposed to be a part of autonomous cars [1]. Stereo cameras
Assist, and in active body control system (Magic Body
are typically mounted behind the front windscreen at the level
Control) utilizing the road profile measured by a road-sensing
of the driver's eyes (e.g. in both sides of the rear-view mirror).
system (Road Surface Scan). BAS Plus with Cross-Traffic
268
Assist are based on the amalgamated data from two channels of channels provide much more information, primarily related to
the stereo camera and the radar sensor system. They can help to the volume of the object, than images made in the monoscopic
avoid not only the rear-end collisions with vehicles directly in techniques [15].
front, but also imminent crashes on cross traffic on the
An example of more specific application in video
crossroads. If a hazardous situation is detected, the visual and
monitoring is the attempt to gait recognition by silhouette
acoustic warnings are activated. In case of too tentatively
extraction of a moving person with the use of stereo camera
pressing the brake pedal by the driver, the brake fluid pressure
and the calculated depth map [16]. The 3D vector reflects
is automatically boosted for effective emergency braking. The
changes in the moving silhouette in time [17]. Lines of the
Cross-Traffic Assist is active at speeds up to 72 km/h. In the
same intensity allow to create a 3D body model [18]. It should
Distronic Plus with Steering Assist the basic radar-based
be emphasized, that to obtain quality results, the algorithms
function to keep the vehicle at the desired distance from
require a good match between both stereo images [19].
another vehicle in front is enhanced by the addition of a
Steering Assist, which helps the driver to stay centered in the The paper is organized as follows. After an introduction in
lane by generating the appropriate steering torque when this section, the authors’ stereovision approach to the 2D
travelling on a straight road or in gentle bends. Lane markings recognition tasks is presented in Section II. Fundamentals and
and vehicles driving ahead are recognized using stereo camera, lacks of stereovision, important to further analyses are
and this information is relayed to the electric steering discussed in Section III. Selected video processing algorithms
assistance system. Lane is keeping up to 200 km/h even under are presented in Section IV. Section V describes the prepared
adverse weather conditions. While driving at slow speeds, e.g. database of stereo images. In Section VI results of experiments
in congested traffic, the vehicle ahead can be used by the on recognition of image details are presented. The last Section
Steering Assist as the means of orientation, even when there VII contains final remarks.
are no clear lane markings visible [4]. The Magic Body Control
with Road Surface Scan is used to scan the road surface using II. STEREOVISION APPROACH TO 2D RECOGNITION TASKS
the stereo camera. Additionally, for better convenience of It is important to notice some limitations of stereovision.
traveling, the shock damping at each wheel may be adjusted to When the view has many plans, i.e. many objects occur in
absorb imperfections of the road detected by the 3D vision [5]. various distances to the camera, the view in one image may be
In 2016 Suzuki Motor Corporation introduced to the market invisible or partially obscured and therefore it is impossible to
the Dual Camera Brake Support (DBCS). In this system the calculate full depth map using dependences between two views
or correctly recognize all objects using typical image
distance between camera lenses is 160 mm. Based on the
processing algorithms. In further considerations, we assume
calculated distance between objects and the car, their shapes
just such hard cases. Moreover, in these cases, simple 2D to 3D
and sizes, the system detects and distinguishes vehicles,
conversion methods dedicated for observation and monitoring
pedestrians, and lanes on the road [6].
systems cannot be directly adapted. They fill the information
Also in 2016, Daihatsu Motor Co., Ltd. introduced the gaps in less important areas such as background and peripheral
stereo camera driving assistance system called Smart Assist III. objects and does not operate on important objects of interest
The distance between right and left lenses of the stereo camera [20–22]. On the other hand, it we cannot reject two-channel
is much smaller and equals 80 mm. Thus, the size of the stereo information, at least during the preprocessing, because we can
camera is reduced and it may be used for so-called kei-cars ipso facto reject significant information. Moreover, in the
(mini cars). This also proves that even relatively small lens literature there are much more algorithms for processing of the
spacing, close to the spacing of human eyes (65 mm), can monoscopic than the stereoscopic images.
support ADAS. The Smart Assist III system enables to avoid
Taking into account all above facts, we propose an
vehicle-to-vehicle and vehicle-to-pedestrian collisions. If the
application of selected typical monoscopic algorithms on
relative speed to the pedestrian is in the range of 4–30 km/h,
stereovision images for selected cases from urban environment.
the collision avoidance function is performed. When the
relative speed is too high to avoid the collision (i.e. 30– Such approach is somehow between plain 2D image processing
50 km/h), the system reduces accident damages [7]. and stereovision. The authors’ approach, called as “double 2D
recognition” is graphically shown in Fig. 1 in relation to other,
Usage of stereo cameras in automotive applications is a standard approaches. In fact, the system can evaluate the
significant field of scientific research [8–13]. Unfortunately, quality of acquired images and chose the working mode. In
the detailed technical descriptions of commercial systems are cases with occluded objects it may work in the double 2D
unavailable with a single exception of Daimler AG [5]. mode, otherwise in plain 3D recognition mode. Choosing of
one of the modes of processing may use the region-based
In the second area of interest, i.e. video monitoring,
multiple description coding scheme (RB-MDC) for the
application of stereovision is still being developed to a lesser
multiview video and the depth. In this algorithm a virtual right
extent. Stereovision in the surveillance systems allows to
view is synthesized from the left view and the depth. Based on
estimate the 3D-coordinates of the objects of interest. The
the virtual right view, the original right view is classified into
positioning can be performed using fixed reference points with
one of three classes: a dissoccluded (the pixels in the right view
known 3D-coordinates [14]. Sequences recorded in two video
that cannot be rendered from the left view), an illumination-
269
affected (the regions in the right view that can be rendered where the distance Z is calculated using the difference
from the left view but only with low quality) or into remaining between the views which is equal to xl − xr.
regions (can be rendered from the left view with a sufficient
For the depth map preparation using the known metric
level of quality) [23]. However, the solution of the switching
problem is not in the scope of this paper. values of distances Z from the camera lens to the 3D scene
objects, the following expression is used
 Z  Z max 
v  round  2 N  1  , v  0, 2 N  1
 Z min  Z max  (3)
where Zmax is a distance from the cameras to the point of the

farthest background plan and Zmin is a distance from the
cameras to the point of the nearest foreground. The resulting v
is the pixel depth value of the point of the interest in the depth
map [25, 26].
Figure 1. Approaches to detail recognition based on monoscopic and

stereoscopic cameras
It should be also noted that the presented idea with double

2D recognition comes from the analysis of two images by the
human. The analogy is that the man sees and analyzes all the
details from two eye views, even those visible from the
perspective of only one individual eye.
III. FUNDAMENTALS AND LACKS OF STEREOVISION
Typical algorithms of the stereovision image processing
are based on the depth maps. The depth is calculated based on
differences between the images presenting two views: left and
right view. In the beginning, the images must be matched to
find the corresponding points. Then, the differences in the a)
horizontal coordinate of the same point of the object on the
scene between the left and right views are taken into account.
Graphical interpretation of the distance calculation using
differences in coordinates for a given point on the object is
shown in Fig. 2a. Point cl is a viewpoint (center) of the left
camera and point cr is a viewpoint (center) of the right camera,
while tx is a distance between the cameras. Point of scene
object p, in the distance Z from the cameras plane, is projected
into image plane of two cameras with the focal length f at
pixel with horizontal coordinate shift xl for the left image and
xr for the right image [22, 24].
From the affine transformation, the following formula is
derived [22]:
f
xl  x r  t x b)
Z (1) Figure 2. Parallel configuration of cameras used for the depth map
generation with: (a) unrestricted two-camera view of point of scene,
The expression (1) can be transformed to the form (b) – the view of point of scene restricted by foreground object
for one camera, based on illustrations from [22, 24]
xl  xr
Z Finally, using formula (2), the formula (3) can be rewritten
tx f (2) to the form:
270
 2N  1  xl  xr  range. Parameters for detection were chosen experimentally
v  round    Z max   , v  0, 2 N  1 Detailed parameters of the algorithm were chosen separately
 Z min  Z max  tx f  (4) for the red and green light.
Second, tested algorithm for character recognition from the
Finally, the depth map is a set of all pixel depth values v. road information boards belongs to the group of the optical
Typically, further processing, e.g. the recognition of image character recognition (OCR) algorithms [27]. First, the
details, is based on the depth map. candidate text regions are detected using maximally stable
extreme regions (MSER) method. Then, non-text regions are
However, there are cases when, despite the use of removed using the geometric properties and a stroke width
stereovision, we do not have full information about the object variation. At the next stage the text regions are merged for the
of interest from both video channels. Such situation occurs final detection. After localizing the text regions, the text is
when an additional foreground object obscures the view of the recognized using OCR algorithm [27].
point of the scene object, as it is illustrated with a gray region
in Fig. 2b. Thus, the value of xl cannot be precisely In case of video monitoring, the authors tested the license
determined. Using information from the second video channel, plate recognition and the people faces detection.
it is only known that The license plate recognition uses the same algorithm as
xl  xl  d1, xl  d 2 (5) described above, but with differences in tuning parameters.
During merging text regions the expansion amount of bounding
Therefore it is not possible to precisely determine the distance boxes for license plates has smaller value than for the
Z to the point of the scene information boards. OCR treats the text in the image as a single
block of the text instead of automatic determination of layout
xl  d1  xr xl  d 2  xr and reading order of text blocks. Moreover, the language to
Z , recognize was not specified as Polish (like in the recognition of
tx f tx f the road information boards).
(6)
As a result the depth map would be inaccurate. People face detection is based on the Viola-Jones object
detection algorithm [28]. Detection is performed using sliding
Otherwise, we may assume that xl = xr and substitute this window technique, where the size of the window varies to
to the expression (2). It would mean that Z=0. This would also detect objects at different scales. A cascade classifier is used to
lead to significant errors in the depth map. decide whether the window contains the object of interest. As
In real cases, the obscuring object may appear or in the the result, bounding box values are obtained.
right or left video channel. Therefore, two-channel
V. STEREOSCOPIC DATABASE
information cannot be rejected. Finally, thanks to the
stereovision, the field of view is wider and an important In order to conduct the experiments, an appropriate
information about the point of scene object is still preserved at database of test images was prepared. A stereoscopic digital
least in one channel. In consequence, popular 2D algorithms camera FujiFilm Real 3D W3 with distance between the lenses
(those which do not require the depth map and operate on one equal to 75mm was used to image acquisition. This distance is
image) may be used. close to one of the newest systems, commercially used in light
vehicles, described in the introduction.
IV. TESTED ALGORITHMS
For the study of the proposed “double 2D” method, several The images were taken on the campus of the Poznan
2D image processing recognition algorithms have been University of Technology, Poland and in the vicinity. The
selected and tested in automotive and video monitoring database consists of 74 stereovision test images. It includes 23
applications. In case of automotive, the authors chose cases for tests of the street light recognition, 15 for tests of
recognition of traffic lights and text information from the road OCR on the road information boards, and 18 for the license
boards, in case of video monitoring the license plates plate recognition, and 18 for the face human detection.
recognition and the human face detection were considered.
The test images with traffic lights were taken from the
First tested algorithm for recognition of traffic lights interior of a vehicle. The camera was mounted behind the
operates in the following manner. First, the lower part of the windshield of the vehicle. The recordings were made in heavy
image area, where there are no traffic lights, is cut. Then, the traffic. In most cases the other vehicle that was close to
stereopair (left and right) image is binarized using threshold vehicle, obscured the view on traffic lights. Thus, in particular
values for red, green and blue color component. At the next views (left or right) traffic lights were fully or partially
stage morphological closing on the binary image is performed. obscured. If the vehicle is in queue to the intersection or stuck
It uses a disk-shaped structuring element in order to preserve in a traffic jam, it is hard or even impossible to change the
the circular nature of the traffic light object. Finally, circles position of the car for a better view. Thus, the information
corresponding to traffic lights are detected using circular could not be obtained from subsequent frames. Only the
Hough transform. These circles have radii in the specified
271
second channel of the stereo image allows to access additional TABLE I. RESULTS OF RECOGNITION FOR STREET LIGHTS (1–2), OCR
ON ROAD INFORMATION BOARDS (3–4), LICENSE PLATES (5), AND HUMAN
information (cf. visibility of the red light in Fig. 3a). FACE DETECTION (6)
The test images with road boards were prepared in two Images
ways. First group was taken from the vehicle interior. Second Parameters Left Right Better Worse
group was taken from the pedestrian perspective. In all cases view view view (left view (left
only only or right) of right)
one of the views information boards were partially obscured 1) Efficiency of
(cf. Fig. 3b). recognition for all street 33.3 % 37.0 % 50.0 % 20.4 %
lights
2) Recognition
efficiency of obscured 20.8 % 29.2 % 50.0 % 0.0 %
traffic lights
3) Letter recognition
efficiency for road 53.6 % 34.1 % 59.5 % 28.3 %
information boards
a) 4) Words recognition
efficiency for road 31.7 % 21.8 % 41.6 % 11.9 %
information boards
5) Recognition
efficiency of characters 43.7 % 47.6 % 57.9 % 33.3 %
for license plate
6) Efficiency of human
b) 38.9 % 94.4 % 100 % 33.3 %
detection
In Table I the recognition efficiency for the left, right image

views only and for views with better and worse view from the
stereo pair are presented. Better view means that the efficiency
of the recognition is better for this view (e.g. left) than the
c) other (e.g. right).
Relatively low efficiency of the street lights recognition
arises from a distance to the lights and their obscuration. In
addition to the obscured lights, there were other, uncovered
lights. The results include two cases, where all the lights were
covered (Tab. I, rows 1 and 2). The detection efficiency for all
d)
lights using stereoscopic images was greater by at least 13%
Figure 3. Illustrative stereo images (left and right) from prepared database and lights obscured by at least 21% in comparison to the fixed
for (a) street light recognition, (b) character recognition from road information views. This means the better efficiency of stereo recognition
boards, (c) license plate recognition, and (d) human detection based o face
and better efficiency of single view (left or right).
The test images with license plates were taken from the In case of OCR in the road information boards, thanks to
monitoring system camera in the parking lot and near the entry stereovision, the effectiveness is at least 6% higher for letters
barriers. In one view, the plate was partially obscured. For and 10% for words. For the recognition of characters from the
example, in Fig. 3c we can see a situation when the second car license plates, the effectiveness is at least 10% higher. Human
is behind the first one and partially covers the plate. Such a face detection is by at least 5% better in the stereovision
acquisition.
situation could happen when the driver of the second vehicle
maintains a small distance to prevent reading the plate from All these results show the predominance of stereovision
the first car. over individual views of the image. One could look for better
parameters and algorithms of recognition, but this was the
The test images for human face detection were also taken goal of the experiments. It should be noted, that differences in
by the monitoring system camera. We included cases where recognition may come not only from the obscuration of a
the person's face was partially invisible in one view. For single channel but also from the camera limits, e.g. blinding of
example, a person was leaning out to avoid being seen by the the particular lens, and other, very often automatically
camera (cf. Fig. 3d). adjusted settings, like e.g. focus.
VI. EXPERIMENTAL RESULTS VII. CONCLUDING REMARKS

The experiments were carried out on the prepared database Double 2D recognition, which was proposed in this paper,
of stereovision images using the selected algorithms. The uses the stereo pair of images and it is dedicated to cases,
algorithms which operate on 2D images were tested separately where some objects obscure the region of interest. Because, in
on the left and right view of the stereopair. The aim of this part such cases, the typical stereovision algorithms may not produce
of experiments was to compare the recognition efficiency on correct depth map, the monoscopic 2D algorithms were used to
the left and right view only. The results are presented in Tab. I. detect image details. The double-lens acquisition of the video is
272
important because in case of two channels, even if the depth [13] A. Fregin, J. Müller, K. Dietmayer, “Three ways of using stereo vision
for traffic light recognition,” IEEE Intelligent Vehicles Symposium, Los
maps could not be correctly generated, they bring more Angeles, CA, USA, 11–14 June 2017.
information than one channel. It was proven that the usage of [14] D. Stepanov, I. Tishchenko, “The concept of video surveillance system
two channels for recognition tasks increases the effectiveness based on the principles of stereo vision,” 18th Conference of Open
of the details recognition in the automotive applications and Innovations Association and Seminar on Information Security and
video monitoring. Thus, the authors extended the possibilities Protection of Information Technology (FRUCT-ISPIT), St. Petersburg,
Russia, 18–22 April 2016.
of applying the monoscopic algorithms and proposed
[15] S. Sivapalan, D. Chen, S. Denman, S. Sridharan, C. Fookes, “Gait
additional areas of using stereovision cameras. energy volumes and frontal gait recognition using depth images,”
International Joint Conference on Biometrics (IJCB), pp. 1–6, 2011.
The authors plan to further develop the prepared database
[16] D. Cetnarowicz, J. Balcerek, A. Konieczka, “Classification of gait
and conduct experiments with other scenarios and algorithms. energy image (in Polish),” XIV Krajowa Konferencja Elektroniki, pp.
597−602, Darłowo, Poland, 08−12 June 2015.
REFERENCES
[17] H. Liu, Y. Cao, Z. Wang, “A novel algorithm of gait recognition,”
[1] E. Johannesson, “Integrated Stereovision for an Autonomous Ground International Conference on Wireless Communications & Signal
Vehicle,” SURF Report, Lund Institute of Technology, 2005. Processing (WCSP), pp. 1–5, 2009.
[2] “Subaru EyeSight,” official web pages of Subaru Corporation, [18] A.N.M. Gomatain, S. Sasi, “Gait recognition based on isoluminance line
https://www.subaru.com/engineering/eyesight.html and and 3D template matching,” Proc. of 2005 International Conference on
http://www.subaru.com.mt/tec_eyesight.html, access: 25.05.2018. Intelligent Sensing and Information Processing, pp. 156–160, 2005.
[3] “Mercedes-Benz Intelligent Drive: The intelligent car,” official web [19] H. Liu, Y. Cao, Z. Wang, “Automatic gait recognition from a distance,”
page of Daimler AG, http://media.daimler.com/marsMediaSite Chinese Control and Decision Conf. (CCDC), pp. 2777–2782, 2010.
/en/instance/ko/Mercedes-Benz-Intelligent-Drive-The-intelligent- [20] J. Balcerek, A. Konieczka, A. Dąbrowski, T. Marciniak, “Binary depth
car.xhtml?oid=9904196, 22.10. 2013, access: 25.05.2018. map generation and color component hole filling for 3D effects in
[4] J. Clark, “Top 20 Mercedes-Benz assistance programs,” an online monitoring systems,” Proc. of Signal Processing − Algorithms,
magazine eMercedesBenz, http://www.emercedesbenz.com/autos Architectures, Arrangements and Applications SPA’2011, IEEE Int.
/mercedes-benz/s-class/top-20-mercedes-benz-assistance-programs Conf., Poznań, Poland, pp. 138–143, 29–30 September 2011.
/attachment/mercedes-benz-technical-terms-12c1235_24/, 19.11.2012, [21] A. Dąbrowski, J. Balcerek, A. Konieczka, “An Approach to Adjustment
access: 25.05.2018. and Reduction of the Number of Controlling Parameters for Simple 2D
[5] U. Franke, S. Gehrig, “How Cars Learned to See,” Daimler AG to 3D Image Conversion Schemes,” Proc. of New Trends in Audio and
Research & Development, Böblingen, Photogrammetric Week, Video/Signal Processing − Algorithms, Architectures, Arrangements and
D. Fritsch (Ed.) Wichmann/VDE Verlag, Belin & Offenbach, pp. 3–10, Applications NTAV/SPA’2012, IEEE International Conference, Łódź,
2013. Poland, pp. 227–231, 27–29 September 2012.
[6] Official web page of Suzuki Motor Corporation, https://www.suzuki.nl [22] J. Balcerek, A. Dąbrowski, A. Konieczka, “Stereovision option for
/over-suzuki/techniek/dual-camera-brake-support/, access: 26.05.2018. monitoring systems - A method based on perception control of depth,”
[7] “Daihatsu pioneers new stereo camera ADAS system,” Secured by Proc. of Signal Processing − Algorithms, Architectures, Arrangements
Design (SBD) Automotive information service “Safe Car News,” and Applications SPA’2013, IEEE International Conference, Poznań,
Poland, pp. 226–230, 26–28 September 2013.
12.12.2016, http://safecarnews.com/daihatsu-pioneers-new-stereo-
camera-adas-system/, access: 25.05.2018. [23] C. Lin, Y. Zhao, J. Xiao, T. Tillo, “Region-Based Multiple Description
Coding for Multiview Video Plus Depth Video,” IEEE Trans.
[8] N. Ventroux, R. Schmit, F. Pasquet, P.-E. Viel and S. Guyetant,
Multimedia, vol. 20, no. 5, pp. 1209-1223, May 2018.
“Stereovision-based 3D obstacle detection for automotive safety driving
assistance,” Proceedings of the 12th International IEEE Conference on [24] Z. Liang, and W.J. Tam, “Stereoscopic Image Generation Based on
Intelligent Transportation Systems, pp. 394−399, St. Louis, MO, USA, Depth Images for 3D TV,” IEEE Transactions on Broadcasting, vol. 51,
October 3–7, 2009. issue 2, pp. 191–199, 2005.
[9] M.A. Garcia-Garrido, et. al., “Complete Vision-Based Traffic Sign [25] J. Balcerek, A. Dąbrowski, A. Konieczka, “Simple efficient techniques
Recognition Supported by an I2V Communication System,” Sensors, for creating effective 3D impressions from 2D original images,” Proc. of
pp.1148–1169, 2012. New Trends in Audio and Video/Signal Processing − Algorithms,
[10] S. Cavegn, S. Nebiker, “Automated 3D road sign mapping with Architectures, Arrangements and Applications NTAV/SPA’2008, IEEE
stereovision based mobile mapping exploiting disparity information International Conference, Poznań, Poland, pp. 219–224, 25–27
September 2008.
from dense stereo matching,” International Archives of the
Photogrammetry, Remote Sensing and Spatial Information Sciences, [26] C. Fehn, “Depth-image-based rendering (DIBR), compression, and
Volume XXXIX-B4, 2012, XXII ISPRS Congress, Melbourne, transmission for a new approach on 3D-TV,” Proc. of SPIE 5291,
Australia, 25 August–01 September 2012. Stereoscopic Displays and Virtual Reality Systems XI, San Jose, CA,
USA, pp. 93–104, 18 January 2004.
[11] S. Cafiso, A. Di Graziano, G. Pappalardo, “In-vehicle stereo vision
system for identification oftraffic conflicts between bus and pedestrian,” [27] MathWorks, “Automatically Detect and Recognize Text in Natural
Journal of Traffic and Transportation Engineering (english edition), no. Images,“ https://www.mathworks.com/help/vision/examples
4(1), pp. 3−13, 2017. /automatically-detect-and-recognize-text-in-natural-images.html, access:
09.06.2018.
[12] Y. Huang, “Obstacle Detection in Urban Traffic Using Stereovision,”
Proceedings of the 8th International IEEE Conference on Intelligent [28] MathWorks, “vision.CascadeObjectDetector System object,”
Transportation Systems, Vienna, Austria, pp. 633–638, 13–16 https://www.mathworks.com/help/vision/ref/vision.cascade
September 2005. objectdetector-system-object.html, access: 09.06.2018.
Paper prepared with the DS 2018 project financial means.
273
SIGNaL PROCESSING
SPa 2018
Evaluation postural stability using complex-valued

data Fourier analysis of the follow-up posturographic
trajectories
Zenon Kidoń, Jerzy Fiołka
Akademicka 16, 44-100 Gliwice, Poland
e-mail: zenon.kidon@polsl.pl
Abstract—Posturography is a non-invasive diagnostic method During a dynamic posturography examination, a patient is

that is used to assess an individual’s ability to maintain balance. affected by external stimuli that disturb their postural
The main types of posturography are static, dynamic and follow- equilibrium mechanically [2] [9]. Specially-designed stands
up posturography. During a follow-up posturography with servo motors that enable the movements or rotation of the
examination, a patient stands on a force platform that is force platform are used in this type of diagnostics. Because this
equipped with a monitor. The individual is supposed to balance method is rather expensive, it is not very common.
the body in such a way that the Center of Pressure (CoP) point
(presented on a computer screen) coincides with the circle A follow-up posturography with visual feedback combines
trajectory of a moving visual stimulus. The response to such the advantages of both the static and dynamic approaches.
stimulation provides valuable information about the ability of an
individual to maintain their balance. The paper proposes a During the follow-up posturography examination, a patient
method, which is based on a Fourier analysis of complex-valued stands on a static force platform that is equipped with a monitor
data, that is used to parameterise the posturographic trajectory (Fig. 1). The individual is supposed to balance their body in
that is obtained during a follow-up examination. The usefulness such a way that the Center of Pressure (CoP) point (presented
of the presented postural stability measures was confirmed by on the computer screen) coincides with the circle trajectory of
tests that were performed in a group of 30 individuals. the moving visual stimulus [10]. The response to such
stimulation provides valuable information about the dynamic
Keywords-postural stability, follow-up posturography, harmonic aspect of the individual’s ability to maintain their balance. A
analysis, signal processing. follow-up posturography can also be used to assess body
posture symmetry, which is very important, e.g. during a
I. INTRODUCTION patient’s rehabilitation after a total hip arthroplasty [11] [12].
Posturography is a non-invasive diagnostic method that is
used to assess an individual’s ability to maintain their balance
[1] [2]. Postural stabilisation is a dynamic process that involves
multisensory inputs and is the result of the continuous
incorporation of the motor commands that are formulated by
the central nervous system. The ability to stand erect is
predominantly regulated by visual, proprioceptive and
vestibular afferent information [3].
An individual’s ability to maintain their balance in the
standing position comprises a number of micromoves that are
performed in different directions. In static posturography, these
micromoves (sway) are recorded using a posturograph to
provide the trajectory of the projection of the body’s center of
gravity on the surface of the force platform [4]. A static
posturography examination typically lasts from 30 to 120 Figure 1. A follow-up posturography stand [10].
seconds [5]. The relationship between balance disorders and
the properties of posturographic trajectories has been When a patient sways their body while trying to follow the
investigated by many authors [6] [7]. The parameters of circular clockwise (or counter-clockwise) motion of the visual
trajectory such as the sway path, sway area, mean sway have stimulus as closely as possible their body’s CoP point precedes
commonly been used for this purpose [8]. the stimulus point from time to time. As a result, the individual
moves their body back slightly to correct this situation. This
behaviour produces a follow-up trajectory that has loops in
274
which the CoP point moves in the opposite direction, after Without losing generality, we can assume that the
which the stimulus point does (Fig. 2). It can be assumed that fundamental period Tp=1. Thus, z(1) = z(0) and z(t’) for t’ ∈ [-
the larger the loops are, the lower postural stability is, similar ∞,∞] is defined and is periodic with a period unity. Hence, z(t)
to the case with static posturography trajectories. can conveniently be represented as a Fourier series [13].
The function (1) can be expanded in a Fourier series
+∞ +∞
z (t ) = ∑ ck exp(i 2πkt / T p ) = ∑ ck exp(i 2πkt ) (3)
k = −∞ k = −∞
where ck are the Fourier coefficients given by
1
ck = 1
Tp ∫ z (t ) exp(−i 2πkt / T p )dt = ∫ z (t ) exp(−i 2πkt )dt (4)
0
In a case of complex signals, the conjugate symmetry

ck=c*-k does not hold. Thus, the magnitude of the Fourier
coefficients is not even the functions (the angle of the Fourier
coefficients are also not odd functions).
To gain a better understanding of the relationship between
the expansion coefficients and the geometry of the
Figure 2. Example of the CoP trajectory and the stimulus trajectory (dashed
line) that are registered during a follow-up posturography examination.
stabiographic trajectory, let’s consider (3) and (4) [14] [15].
Assuming in (4) that k=0, it is clear that c0 is simply the
The paper proposes a method, which is based on a Fourier average of the points. Thus, it represents the position of the
analysis of complex-valued data, that can be used to curve centroid
parametrise the posturographic trajectory that is obtained
during a follow-up posturography examination. Moreover, the 1
application of the complex plane method enables the
c0 = ∫ z (t )dt (5)
reconstruction of the trajectory from the expansion coefficients
0
with an assumed level of specificity. This property is
particularly important in the context of a visual analysis of
medical records, because it simplifies the evaluation of the As a result, equation (3) can be written as
shapes that are registered.
+∞
II. POSTUROGRAFIC TRAJECTORY DECOMPOSITION z (t ) = c0 + ∑ z k (t ) (6)
k =1
The posturographic signal can be represented in the
complex plane by the complex function of a real variable
The contribution made to the Fourier series by the k-th
harmonic component is given by
z (t ) = x(t ) + i ⋅ y (t ) (1)
z k (t ) = c− k exp(−i 2πkt ) + ck exp(i 2πkt ) (7)
where x(t) and y(t) are the sway of the COP point in the medial-
lateral and anterior-posterior directions, respectively.
Generally, for a given value of k, (7) represents an ellipse in
Generally, although the start and end point of the follow-up the complex plane. When ck or c-k is equal to zero, we get a
trajectory do not have exactly the same coordinates, they are circle. In the case |ck| = |c-k| , the ellipse collapses into a straight
situated close to one another. Taking into account that the line segment.
posturographic trajectories are irregular, the trajectory curve
can be closed without considerable change in their overall The parameters of the ellipses, i.e. the length of the semi-
shape. Under this assumption, the posturographic trajectory minor and semi-major axes depend on the values of the
comprises a closed curve. Thus, for all t expansion coefficients, ck and c-k. This can be explained by the
following reasoning. Using the basic relations
z (t + T p ) = z (t ) (2) 2
z k (t ) = z (t ) z (t ) (8)
275
cos ϕ = 0.5(e − jϕ + e iϕ ) (9)
the square of the modulus can be written as
z k (t ) = c− k + 2 c− k ck cos(4πkt + ϕ k − ϕ − k ) + ck
2 2 2
(10)
where ϕk = arg(ck).
The maximum values of (10), which is equal to
2
z k (t ) = c − k
2
+ 2 c −k c k + c k
2
[
= c− k + ck ]2 (11)
is attained for
1  ϕ−k − ϕk 
t = tM =   (12)
2πk  2 
Figure 3. An ellipse on the complex plane.
Therefore, the length of the semi-major axis can be derived as
The last issue is the direction in which the individual
ellipses are traced out. The coordinates of a given point are a
zk (t M ) = c− k + ck (13) function of parameter t (in the case of the posturographic
trajectory this is the time). When the value of the parameter t
By finding the minimum of (10), we get the length of minor- increases and |c-k|>|ck|, the ellipse is traced out clockwise. If
axis. Hence, for |c-k|>|ck| |ck|>|c-k|, the direction is the opposite (anticlockwise).
Moreover, the number of circulations of the ellipse depends on
the value of k. Assuming that t varies from 0 to 1, the ellipse
zk (t M ) = c− k − ck (14) for k=1 (first harmonic) is traced out once and the ellipse for zk
is traced out k times.
or
A. Discrete-time implementation
In practice, the posturographic trajectory is not a
z k (t M ) = c k − c − k (15) continuous function of time, but rather a sequence of points
that are obtained by the uniform sampling of the
if |ck|>|c-k|. posturographic signals. By applying the trapezium rule to the
integral (4), the expansion coefficients can be derived as
In addition to the derived length, the angle between the
semi-major axes and the real axis is also an important
parameter. Substituting (12) for (7) yields N −1
c k = (1 / N ) ∑ z m exp(−i 2πkm / N ) (18)
m =0
ϕ k + ϕ −k
[
z k (t M ) = c−k + ck exp(i ] 2
) (16)
where N is the number of points and zm is a complex number
that represents the m-th point of posturographic trajectory.
The argument of the complex number that is given by (16) is In the posturographic examinations that were performed,
the inclination α: the signal was registered in the time interval of 60 s with a
sampling rate of 50 samples per second. Thus, the number N in
(18) is equal to 1500.
ϕ k + ϕ −k
arg{z k (t M )} = α = (17)
2 B. Filtering of the posturographic trajectory
An ellipse on the complex plane with the parameters indicated Using the presented analysis allows the posturographic
is shown in Fig. 3. trajectory at a certain level of roughness to be represented by a
specified number of summed harmonics H. Fig. 4 shows the
results of the plot of the partial sums (6) of the first ten terms
(H=10) and the first fifty terms (H=50). Because the sizes of
the individual harmonics decreases rapidly, the posturographic
276
trajectory can be approximated using a partial sum with only
several terms. As a result, this method can simplify the analysis
of the medical records of patients by filtering or displaying the
trajectories with an assumed level of detail. As a consequence,
a comparison of the main geometry properties of the
trajectories (e.g. size, shape) is more effective.
III. ED PARAMETER
Many different parameterisation methods that are based on
the displacement of the COP point have been proposed in the
literature [16] [17]. The parameters which are commonly used
in practice describe the overall “size” of the trajectory. The
most popular “classical” parameters are the length of trajectory,
the area under the unrolled trajectory and the average CoP
point deviation from the center of the trajectory [8]. Another
possibility is to use the Fourier transform of the AP and/or ML
component to quantify an individual’s postural stability [7].
Moreover, in some experimental studies, the stabiographic
signal is analysed using a fractal point of view [7].
In the paper, a new ED (Energy Distribution) parameter is
introduced to quantify an individual’s postural stability. This
parameter, in contrast to the classical approach, describes a
dynamic aspect of body sway. The value of ED is calculated
based on a follow-up posturographic trajectory that is
registered for a circular stimuli that moves clockwise at a
constant angular velocity [10]. The proposed ED parameter is
defined as
∑c k
(19)
ED = k'
∑c k
k < H ,k ≠0
where ck are the expansion coefficients given by (18), H is the

number of harmonics and k’ is the summation index that is
restricted to those values of k for which
ck > c−k (20)
Equation (20) implies that the ellipse is traced out

anticlockwise. Thus, the ED parameter is the ratio of the Figure 4. The filtered trajectory for H=10 (a) and H=50 (b).
energy that is contained in the ellipses traced out opposite to
the stimuli to the total energy of the trajectory. As a result, the • Length of the Trajectory – LT (mm):
ED parameter quantifies the postural stability by measuring the
ability of the individual to preserve the correct direction of N
movement that is invoked by the movement of the stimuli. LT = ∑ l (i ) (21)
i=2
IV. EXPERIMENTS
In order to verify the applicability of the proposed
parameterisation method, the static and follow-up
posturography examination results for 30 individuals were
l (i ) = [x(i) − x(i − 1)]2 + [ y (i) − y (i − 1)]2 (22)
compared. Because postural stability is very important for
elderly people, the individuals were 37-76 years old (19 where: x(i), y(i) - coordinates of CoP; l(i) - distance
women and 11 men). The static examination lasted 30 seconds between two consecutive points of posturographic
and the following trajectory parameters were taken into trajectory, N - number of points comprising the trajectory.
account [12]:
277
• Area under the unrolled Trajectory – AT (mm2):
N
AT = ∑ p(i ) (23)
i=2
p(i ) = ob(i ) ⋅ [ob(i ) − r (i − 1)]⋅ [ob(i ) − r (i )]⋅ [ob(i ) − l (i )] (24)
l (i ) + r (i ) + r (i − 1)
ob(i ) = (25)
2
where: r(i) – distance between CoP location (x(i),y(i)) at a

given moment in time (i) and the center of trajectory
(x0,y0).
• Average CoP Deviation from the center of the
Trajectory (mm) – DT (mm):
∑ r (i)
DT = i =1 (26)
N
The follow-up posturography examination lasted 60

seconds and the trajectory was registered for one cycle of the
moving stimulus at a constant circular speed in the clockwise
direction. Then, the proposed ED parameter was calculated for
the data that was collected.
STATISTICA software was used for the statistical analysis.
In order to evaluate the normality of the distribution, the Figure 5. A scatter plot of the data of the ED and DT parameters with the
Shapiro-Wilk test was performed. The Pearson correlation regression line for all of the patients (a) and after eliminating one outlier point
coefficients were computed for the pairs of parameters in order (b).
to evaluate the relationships between the parameters being
analysed. The statistical significance level was established at p The values show a very strong mutual correlation between
< 0.05. the classical static measures (from 0.6669 for the DT and LT,
up to 0.8782 for the AT and LT). This proves that these
Fig. 5 presents a scatter plot of the ED and DT parameters parameters quantify the same aspect of human postural
with the regression line for all of the patients (a) and after stability, which is directly related to the “size” of a static
eliminating one outlier point (b). The significant deviation of trajectory.
the point that was eliminated from the others was probably due
to a temporal disturbance of the individual being examined. The values indicate that the ED parameter is correlated with
Eliminating the outlier point resulted in the change in the a high (DT) or moderate (AT and LT) degree. This confirms
correlation between the ED and DT parameters from 0.564 to that the ED parameter provides valuable diagnostic
0.763. The mutual correlation of the proposed, as well as the information, which is currently not taken into account during a
other two above-mentioned parameters, are 0.533 (ED and AT) static examination, and that by evaluating those aspects of
and 0.4559 (ED and LT) – Tab. 1. postural stability, an individual can receive the appropriate
treatment.
TABLE 1. MUTUAL CORRELATIONS BETWEEN THE INDIVIDUAL PARAMETERS
V. CONCLUSIONS
Pearson’s correlation coefficienta
The application of follow-up posturography in clinical
Parameter DT LT AT
researches is very promising because it provides additional
ED 0.7633 0.4559 0.533 unique information about the human balance control system.
DT 0.6669 0.8259 By describing the dynamic aspects of body sway, we are able
to quantify the postural responses to visual stimuli that is
LT 0.8782 generated by a computer and displayed on a monitor. The use
a. All values are statistically significant for p < 0.05. of the proposed method in the clinical context would enable
278
meaningful information to be extracted from posturographic [5] K. Le Clair, C. Riah "Postural stability measures: what to measure and
data and to present the results in the form of numerical values how long," Clinical Biomechanics, vol. 11 no. 3, pp. 176-178, 1996.
that are calculated according to the presented formulae. [6] M. Salavati, M. R. Hadian, M. Mazaheri, "Test–retest reliability of
center of pressure measures of postural stability during quiet standing in
The results that were obtained prove the applicability of the a group with musculoskeletal disorders consisting of low back pain,
presented approach in evaluating human postural stability. This anterior cruciate ligament injury and functional ankle instability”, Gait
and Posture, vol. 29, pp. 460-464, 2009.
is due to the fact that the proposed method quantifies certain
[7] L. Baratto, P. G. Morasso, C. Re, G. Spada, "A new look at
aspects of postural balance control, which are not accessible posturographic analysis in the clinical context: sway-density vs. other
during a static examination. parameterization techniques," Motor Control, vol. 6, no. 3, pp. 246-270,
2002.
Moreover, reconstructing the trajectory at some level of
[8] L. Rocchi, L. Chiari, A. Cappello, "Feature selection of stabilometric
specificity simplifies the evaluation of its shape, which can be parameters based on principal component analysis," Medical &
useful in preliminary investigations. Biological Engineering & Computing, vol. 42, 2004, pp. 71-79.
The aim of the further studies will be to evaluate the [9] B. R. Bloem, J. E. Visser, J. H. J. Allum, "Handbook of Clinical
Neurophysiology," Elsevier, 2003, pp. 295-336.
applicability of the presented methods on a statistically
[10] Z. Kidon, J. Fiolka, "Harmonic analysis of a complex-valued
significant group of patients. Moreover, special attention will stabilographic signal," Przeglad Elektrotechniczny, vol. 93, 2017, pp.
be paid to using the ED parameter in diagnosing patients that 185-188.
suffer from Parkinson’s disease. [11] K. Pethe-Kania, J. A. Opara, D. Kania, Z. Kidoń, T. Lukaszewicz, "The
follow-up posturography in rehabilitation after total hip arthroplasty",
Acta of Bioengineering and Biomechanics, vol. 19, 2017, pp. 97-104.
ACKNOWLEDGMENT
[12] T. Lukaszewicz, D. Kania, Z. Kidon, K. Pethe-Kania, "Posturographic
This work was supported by the Ministry of Science and methods for body posture symmetry assessment," Bulletin of the Polish
Higher Education funding for statutory activities. Academy of Sciences – Technical Sciences, vol. 63, 2015, pp. 907-917.
[13] P. Kumar, E. Foufoula-Georgiou, “Fourier domain shape analysis
methods: A brief review and an illustrative application to rainfall area
REFERENCES evolution”, Water Resources Research, vol. 26, 1990, pp. 2219-2227.
[1] D. A. Winter, "Human balance and posture control during standing and [14] J. McGarva, G. Mullineux, "Harmonic representation of closed curves,"
walking," Gait & Posture, vol. 3, pp. 193-241, 1995. Appl. Math. Modelling, vol. 17, 1993, pp. 213-218.
[2] H. Chaudhry, B. Bukiet, Z. Ji, T. Findley, "Measurement of balance in [15] Ch. Dalitz, Ch. Brandt, S. Goebbels, D. Kolanus, "Fourier descriptors
computer posturography. Comparison of methods - A brief review," for broken shapes," EURASIP Journal on Advances in Signal
Journal of Bodywork & Movement Therapies, vol. 15, no. 4, pp. 82–91, Processing, vol. 2013, pp. 1–11.
2011. [16] J. Fiolka, Z. Kidon, "Method for stabilogram characterization using
[3] R. J. Peterka, "Sensorimotor integration in human postural control," J. angular-segment function," Bulletin of the Polish Academy of Sciences -
Neurophysiol, vol. 88, 2002, pp. 1097-1118. Technical Sciences, vol. 61, 2013, pp. 391-397.
[4] M. Duarte, S. M. S. F. Freitas, "Revision of posturography based on [17] T. Lukaszewicz, Z. Kidon, D. Kania, K. Pethe-Kania, "Postural
force plate for balance evaluation,” Rev Bras Fisioter, vol. 14, no. 3, pp. symmetry evaluation using wavelet correlation coefficients calculated
183–92, 2010. for the follow-up posturographic trajectories," Elektronika ir
Elektrotechnika, vol. 22, 2016, pp. 84-88.
279
SIGNaL PROCESSING
SPa 2018
Automated determination of arterial input function in

DCE-MR images of the kidney
Artur Klepaczko1, Martyna Muszelska1 Eli Eikefjord2, Jarle Rørvik3, Arvid Lundervold4
Lodz University of Technology 2
Haukeland University Hospital, Department of Radiology
Institute of Electronics 3
University of Bergen, Department of Clinical Medicine,
Lodz, Poland 4
University of Bergen, Department of Biomedicine,
aklepaczko@p.lodz.pl Bergen, Norway
arvid.lundervold@biomed.uib.no
Abstract—This paper concerns the problem of estimating early detection of the disease allows prevention of its
renal perfusion based on the Dynamic Contrast Enhanced development to the end stage [1].
MRI. Quantification of perfusion parameters is possible by
Dynamic contrast-enhanced magnetic resonance imaging
the means of pharmacokinetic modeling. Several
mathematical formulations of PK models have been (DCE-MRI) is one of the available techniques for assessment
of kidney functioning as it allows to estimate its perfusion.
proposed. In any case, it is important to determine the so-
The procedure requires administration of contrast agent (CA)
called arterial input function, i.e. the time-course of the
contrast agent bolus in a main feeding artery. In case of the intravenously in a form of a bolus, whose temporal
concentration through the arterial system and capillary bed is
kidney it is the descending aorta. Usually, determination of
AIF is performed manually. We propose the automatic measured. Tissue perfusion as such is a source of viability in
preserving cell nutrition, humoral communication, and the
procedure to determine AIF, thus reducing the involvement
of a human observer in the image processing pipeline. Our elimination of waste products during lifetime. Thus, the local
blood flow and perfusion play a fundamental role in the
proposed method uses a combination of image processing and
machine learning algorithms firstly to identify all voxels function of kidneys, especially those affected by the above-
listed diseases.
potentially belonging to the descending aorta and secondly to
select those voxels which are free from the inflow artifact. The The standard method used in estimation of renal perfusion
tests of our method performed for 10 DCE-MRI datasets based on the DCE-MR images involves pharmacokinetic (PK)
show its effectiveness in terms of the resulting perfusion modeling of the tracer agent transfer from the arterial
parameters measurements. compartment into the extravascular extracellular space. There
Keywords—perfusion-weighted imaging; arterial input
have been numerous PK models proposed in the literature [2],
function; pharmacokinetic modeling which vary in 1) the number of considered compartments, 2)
the level of incorporating compartment-specific impulse
I. INTRODUCTION residue functions, 3) having or not having the capability to
reflect the tracer-agent leakage from capillaries in cancerous
Monitoring the state of kidneys functioning becomes an
cells, and 4) in a way a given model accounts for the delay and
increasingly important task of the modern medical diagnostics.
broadening of peak region in the arterial input function.
The growing rate of patients with various renal diseases in the
developed countries is – besides other factors – linked to the In any case, the crucial step in fitting a PK model to DCE-
aging society phenomenon. Failure in renal operation may MR data is determination of the arterial input function (AIF),
impair physiological homeostasis, leading to improper i.e. the time-course of the gadolinium-based tracer agent
management of electrolytes, acid-base balance perturbation, or concentration in the feeding artery which supplies an organ of
deregulation of arterial blood pressure. Kidneys are interest with blood. It is recognized that a standard
responsible for blood filtration, removal of water-soluble methodology of AIF estimation is required, so that perfusion
waste products of metabolism and surplus glucose and other measurements are repeatable and comparable. Currently such
organic substances. The kidney diseases include, among standards do not exist, although some recommendations have
others, acute kidney injury (AKI), chronic kidney disease been formulated [3]. Manual delineation of the regions of
(CKD) and various cancers (e.g. renal cell carcinoma). The interest, where the AIF is expected to manifest, turns out
treatment depends on the pathological condition and may significantly observer-dependent [4]. It is recommended to
require life-long dialysis, kidney removal or transplantation. In position the AIF region at a location distal to blood inlet, so
any case, exact knowledge of the kidney performance is that the observed intensity change depends exclusively on CA
needed for the proper diagnosis, treatment and follow-up concentration and is not affected by the inflow artifact. Also,
prognosis. Especially in case of CKD, caused by e.g. the size of the regions of interest placed over a feeding artery
longstanding hypertension or diabetes mellitus, it is important to a large extent affects the calculated GFR values [5].
to continuously and precisely monitor renal function, since Eventually, as already noted, more appropriate measurements
are obtained if instead of a single AIF for the whole organ
This paper was supported by the Polish National Science Centre grant no.
UMO-2014/15/B/ST7/05227.
280
under study, a series of input functions are determined in Gadolinium-based CA (0.025 mmol/kg of GdDOTA) was
proximal locations to voxels representing a perfused tissue. injected intravenously at the flow rate=3 mL/s.
The issue of automated positioning of regions allowing for
Five seconds after injection, the patients were instructed to
AIF estimation is addressed e.g. in [6], where the problem is
approached using independent component analysis. The other hold their breath for about 26 seconds to enable motion-free
image acquisition during the first pass perfusion. Afterwards,
solutions include blind source separation and deconvolution
using singular value decomposition [7, 8]. the breath-holds periods of 15 seconds were interleaved by 26-
second free-breathing time slots, during which the motion
However, the above-mentioned attempts do not provide a artifact could manifest more strongly in the acquired data.
consistent solution which would offer observer-independent, Therefore, the collected images, before any processing, were
automated determination of the AIF region while submitted to image registration procedure. This step was
simultaneously minimizing the inflow artifact. This paper accomplished using b-spline deformable registration algorithm
addresses this problem and proposes a method which relies on implemented in the 3D Slicer’s Plastimatch module [11]. The
a combination of image processing and machine learning algorithm fits a so-called moving volume to a fixed one using a
techniques. The algorithm first performs vessel enhancement sequence of stages which vary in the scale of image primitives
and segmentation to identify all voxels belonging to the considered in the matching transformation. We used 4 scales
abdominal aorta in a given DCE-MR image data set. Secondly, parameterized by the image subsampling rate and the grid size
the selected aorta voxels are partitioned in the unsupervised (cf. Table 1). For every subject, we identified one time frame
way. The latter operation leads to automatic determination of as a fixed volume and all the remaining 73 frames were
an AIF region, whose intensity predominantly depends upon matched to that reference. The fixed volume was chosen as the
tracer kinetics behavior. The effectiveness of our proposed one corresponding to the kidney cortex perfusion phase, i.e. the
algorithm is presented in relation to the data set composed of time when the contrast between the cortex and medulla is
10 real DCE-MRI examinations performed for 5 healthy maximized.
volunteers. Although the method is independent from the
assumed PK model, we verify its performance using the two- TABLE I. PARAMETER SETTING OF THE B-SPLINE DEFORMABLE
REGISTRATION ALGORITHM
compartment filtration Tofts’ model [9].
Image subsampling Maximum
II. MATERIALS AND METHODS Stage Grid size [mm]
rate [voxel] iterations
A. DCE-MRI examinations 0 (5,5,3) –
1 (4,4,2) 100
The experiments reported in this study were conducted on 50
2 (2,2,1) 50
a data set of 10 DCE-MRI examinations collected for 5
3 (2,2,1) 25
healthy volunteers [10]. All volunteers have signed written
informed consent to take part in the study in agreement with
the local board for bioethical affairs. Every patient was In addition to image acquisition, each subject underwent a
examined twice, seven days apart. Hence, the whole data set blood test to enable measurement of glomerular filtration rate
was composed of two parts denoted as Volume 1 and (GFR) – a standard clinical examination routinely performed
Volume 2. The examinations were performed on a 32-channel in kidneys diagnostics. The measurement was accomplished
1.5 T scanner (Siemens Magnetom Avanto) using 6-channel by injecting 5 mL of iohexol (300 mg I/mL; Omnipaque 300,
body matrix and table-mounted 6-channel receiver coils. The GE Healthcare) followed by venous blood sample obtained
3D FLASH spoiled gradient recalled acquisition sequence was after 4 hours.
configured with the following parameters: TE=0.8 ms, B. Segmentation of the DCE-MR images
TR=2.36 ms, FA=20°, parallel imaging factor=3, in-plane
resolution=2.2 x 2.2 mm2, slice thickness=3 mm, acquisition The first step of the algorithm aims at delineation of the
matrix=192x192, number of slices=30. The dynamic sequence abdominal aorta region. We accomplish this task through
consisted of 74 volumes acquired at 2.3 seconds intervals. The vessel segmentation in one time frame of the DCE-MR image
Input frame Low-pass filtering Vessel enhancement Flood fill

Fig. 1. Subsequent steps of delineation of the abdominal aorta in a DCE-MRI sequence.
281
series. To select the appropriate time frame we plot temporal In response to the above described challenges, we propose
course of the mean image intensity in its bottom-central to automatically select the AIF ROI in two phases. First, the
region, where the bifurcation of the aorta is expected to be image voxel time-courses segmented previously as aorta
located. The maximum of this time-course corresponds to locations are partitioned by the k-means clustering algorithm.
frame, when the contrast agent fills the whole aorta lumen so At this stage we arbitrarily assume that there may be k=5
that it can be clearly distinguished at its entire length. various groups of tracer agent concentration time curves. In
our experiments this number appeared as a reasonable trade-
Next, the image is filtered with the low-pass Gaussian off between the need to finely discriminate different sections
kernel (𝜎=3) and submitted to the vessel enhancement routine. of the abdominal aorta on one side, and the demand that each
In this step we computed a multi-scale vesselness function in cluster represents a significant number of image voxels on the
the form proposed by Sato [12] using 5 scales of radius other side. Secondly, we analyze the time-courses of each
standard deviation in range from 0.5 to 2.5 mm. As a result, cluster center. The ultimately chosen AIF region corresponds
the filtered image contains only the elongated bright to the cluster whose mean signal time-course averaged over
structures. However, apart from the aorta, other tissues and the first few time frames is the lowest. The reasoning which
smaller vessels may be still visible. In order to remove them, underlies such a solution follows the observation that initially,
the flood fill operation is executed with the lower threshold set under absence of the contrast agent, high signal intensity from
to an empirically found value equal to 12.5% of the maximum the aorta may result only from the blood inflow effect. The
image intensity. The upper bound is set to maximum, as we low signal corresponds to the bottom-most aorta region, where
found the aorta to be the brightest structure visible in time- the moving blood spins are fully saturated and produce weak
frames selected for segmentation. Similarly, the seed point of signal.
the flood fill operation was determined as the brightest point D. Algorithm validation using the Tofts’ model
in the central part of the image along the coronal direction.
The described strategy is visualized in Fig. 1 as a sequence of We evaluate our proposed algorithm by comparing the
example images – results of the algorithm steps. image-derived GFR estimates with the gold standard values,
i.e. iohexol-based measurements. For that purpose we have
C. AIF region identification implemented in a computer program the extended 2-
Theoretically, the best AIF location would be situated as compartment filtration model described in [9]. This
close as possible to the perfused organ. In case of kidneys, pharmacokinetic model will be summarized now.
there are renal arteries, which may play the role of the blood Firstly, the signal intensity (SI) time curves must be
suppliers. However, in the DCE-MR images, only short inlet converted into concentration curves. The conversion formula
sections of the renal arteries are visible. The image resolution is derived from the signal equation in spoiled gradient echo
is too coarse relative to the diameters of renal vessels in the sequence:
distal sections (< 2 mm). In consequence, the corresponding
!! !"# ! !!! !!"/!!
voxels convey little and potentially biased (because of the 𝑆𝐼 = (1)
!!!"# !! !!"/!!
noise and the partial volume effect) information about contrast
agent concentration time-course. Therefore in practice, the where M0 is the equilibrium magnetization, 𝛼 is the flip angle
AIF is usually determined based on the signal measured in the (FA), TR is the repetition time, T1 is the longitudinal relaxation
main abdominal aorta. time constant. TR and FA are MRI acquisition sequence
parameters selected by the operator. T1 is a tissue-
Unfortunately, even then positioning of the AIF region is characteristic constant, significantly shortened by the contrast
not straightforward. Due to the axial orientation of the agent. Specifically, the contrast agent alters the tissue property
acquisition matrix and the high blood flow velocity, the upper called relaxation rate. The latter is related to T1 simply as:
parts of aorta are strongly affected by the inflow artifact. Thus,
the signal in such locations is enhanced not only by the CA 𝑅! = 1/𝑇! . (2)
bolus, but also by the flux of fresh, non-saturated Hydrogen Moreover, since the concentration of the contrast agent
protons flowing rapidly deep into the imaging slab in direction bolus is time-dependent, we have
perpendicular to the slice plane. As a result, measurement of
true contrast agent concentration in these voxels is 𝑅! 𝑡 = 𝑅!" + 𝑟! 𝐶 𝑡 , (3)
significantly hampered and its estimates mislead the where R10 is the longitudinal relaxation rate of the tissue
subsequent computation of perfusion parameters. It is before bolus arrival, C(t) is the time-dependent concentration
common that the inflow artifact is present even on the level of of the contrast agent in a given tissue, and r1 is the T1-
the kidneys. Therefore, the AIF ROI is routinely placed below relaxivity of that space. Hence, we can rewrite Eq. (1) as
the renal arteries inlets, e.g. close to bifurcation of the aorta.
On the other hand, the further the AIF region is located, the !! !"# ! !!! !!!" !"
𝑆! = (4)
less precise perfusion measurements are. This problem is !!!"# !! !!!" !"
linked to the temporal delay between the CA bolus passage for the pre-bolus phase, or as
through the aorta and the kidney perfusion phase. Thus, it is
!! !"# ! !!! !!! ! !"
desirable to include in the AIF ROI image voxels lying no 𝑆 𝑡 = (5)
!!!"# !! !!! ! !"
further than it is truly required to avoid the inflow effect
problem.
282
in the perfusion and uptake phases. S(t) and S0 can be obtain the concentration in blood plasma Cpart (t) it has to be
determined from image as a mean signal of arterial input or a divided by the hematocrit-dependent ratio as
kidney region of interest. S0 is the time-averaged signal
!!!"#
intensity taken over the period from the beginning of the 𝐶!!"# = . (11)
!!!"# !"#$%
examination until the bolus arrival time (BAT). The rate of
both signal expressions leads to M0–free formula where Hctlarge is the hematocrit value in large blood vessels.
!! !!! !!!" !" !!!"# !! !!! (!)!"
For any linear and stationary tissue with initial condition
= . (6) C(0)=0, the CA concentration in the intravascular (IV) space
!(!) !!!"# !! !!!" !" !!! !!! (!)!"
of that tissue is related to the arterial input function by a
In this study we assume that T10blood = 1.4 s and T10kidney = 1.2 s. convolution with a vascular impulse response function
The Eq. (6) resolves to (VIRF). In other words, the peak concentration of CA in blood
! !!! ! !
plasma is delayed and broadened in the tissue and these two
𝑅! 𝑡 = − ln (7) effects are modeled by a VIRF. Hence, the intravascular
!" !!! ! ! !"# !
plasma concentration in the kidney is given by
where
! !"#
!(!) 𝐶!!"# = 𝐶!!"# ⨂𝑔 𝑡 = 𝐶
! !
𝑡 − 𝜏 𝑔 𝜏 𝑑𝜏 (12)
𝑎 𝑡 = (8)
!!
with
and !
!
𝑔 𝑡 𝑑𝑡 = 1 (13)
!!! !!!" !"
𝑏= (9) The g(t) may have in general different forms. In our study we
!!!"# !! !!!" !"
Finally, from Eq. 3 we have the desired formula for signal to used the delayed exponential function:
concentration transform 0 𝑡<∆
!!∆
!
!! ! !! 𝑔 𝑡 = ! ! (14)
!" 𝑒 !!
𝑡≥∆
𝐶 𝑡 = . (10) !!
!!
The Eq. (10) applied to arterial signal gives the CA where Tg denotes the exponential decay time constant,
concentration (AIF) in blood, denoted as Cbart (t). In order to whereas ∆ is the delay before the CA appears in the capillary
(a) (b)
(c) (d)
Fig. 2. Example segmentation results (a) and their corresponding intensity time-curves (b). Manual AIF ROI (red solid line)
annotated over automatically found ROI (c), where the inflow artifact (white arrow) is absent (d).
283
bed. The contrast agent transfers from the IV space into the regions are located in the neighborhood of the aorta
extravascular extracellular space (EV) with the Ktrans constant. bifurcation. As shown in the bottom of Fig. 2, the inflow
Therefore, the CA concentration in the kidney tissue can be artifacts in these locations are absent (in contrast to the inlet
modeled as sections marked with white arrow). The solid red border lines
! !"# indicate manually placed AIF ROIs.
𝐶! 𝑡 = 𝐾 !"#$% 𝐶
! !
𝜏 ⨂𝑔 𝜏 𝑑𝜏 + 𝑣! 𝐶!!"# 𝑡 . (15)
TABLE II. ESTIMATED VALUES OF GFR FOR THE SUBJECTS INCLUDED IN
In order to fit the above model to the image data, there are four 2
THE STUDY [ML/MIN/1.73 MM ]
free parameters which have to be adjusted: Ktrans, Tg, ∆, and vp –
Iohexol Automatically Manually
plasma volume fraction. The image-derived GFR value is a Subject Study
GFR determined AIF determined AIF
product of Ktrans and volume of the kidney region of interest. 1 94 103
1 107
III. RESULTS 2 117 66
1 94 54
A. Experimental setting 2 98
2 63 45
The AIF regions were automatically determined for each 3
1
90
52 73
of the 10 DCE image series using the above described method. 2 116 115
The only operation which was performed manually was 1 85 78
4 93
positioning the bounding boxes at the bottom-central and 2 105 104
central parts of the image, where we expected to locate the 1 45 45
aorta bifurcation and a seed point for the flood-fill procedure 5 94
2 81 108
(see Sect. II.2). The boxes were determined only once and
remained unchanged for all 10 examinations.
In parallel to automatic aorta segmentation we performed
its manual delineation in every data set. The manually
delineated regions were used to evaluate quality of automatic
segmentation results. In essence, we computed the Dice
coefficient between manual and automatic segmentations. This
evaluation is necessary to assess reliability of the further
performed pharmacokinetic modeling of renal perfusion.
Additionally, we also manually put ROIs over the feasible AIF
locations. Consequently, we calculated the GFRs using two
alternative approaches to determination of arterial input
function: the proposed observer-independent method and
conventional, based on subjective observer decision.
The above-described Tofts model was fitted to the DCE
data in the uptake phase. That is, when the contrast agent is
filtered both in the cortex and medulla. Accordingly, the
analysis was narrowed to the appropriate regions of the
kidneys, annotated manually.
Eventually, following the recommendations expressed in
other studies devoted to PK modeling of renal perfusion [9,
13], the time-courses of image intensities (both in aorta and in
perfused organs) were interpolated and oversampled to
achieve finer temporal resolution. Thus, the time interval of
2.3 sec in the original time-series is reduced to 0.4 sec in the
oversampled domain. The interpolation was accomplished
using cubic b-splines method.
B. AIF region determination
Figure 2 presents results of aorta segmentation in 1
example dataset along with the annotated (color-coded)
clusters extracted by the k-means algorithm. The relatively
high Dice coefficients (mean = 0.88, minimum = 0.76,
maximum = 0.92) confirm credibility of the performed
automatic segmentations in terms of their good
correspondence with manual delineations of aorta. In addition,
the time-courses of the individual clusters are shown. The AIF
regions are chosen based on the analysis of the mean clusters Fig. 3. Fitted model curves to the image data for subject 1 (study 2) in left
intensities up to the bolus arrival time. In all cases, these (top) and right (bottom) kidney.
284
C. Perfusion parameters estimation aorta bifurcation or candidate seed voxels for the flood-fill
operation to be located. All other actions are fully automatic,
Table 2 collects the GFR values calculated for all 5 thus giving a chance for enhanced objectivism of the analysis.
patients along with the iohexol-based measurements. The
image-derived GFR ratios were estimated as the product of the REFERENCES
Ktrans constant found during the curve fitting procedure in the
[1] S. Li, F.G. Zöllner, A.D. Merrem, Y. Peng, J. Rørvik, A. Lundervold,
uptake phase and the volume of the corresponding region of L.R. Schad, “Wavelet-based segmentation of renal compartments in
interest. These products were determined separately for the DCE-MRI of human kidney: Initial results in patients and healthy
left and right kidneys and aggregated. The comparison of the volunteers.” Computerized Medical Imaging and Graphics, 36:108–118,
obtained ratios reveals that the fitted curves (see Fig. 3) of the 2012.
tracer agent kinetics represents filtration characteristics of the [2] R. Bammer (Ed.), MR and CT Perfusion and Pharmacokinetic Imaging.
examined kidneys correctly. The observed discrepancy Clinical Applications and Theory, Wolters Kluwer, Philadelphia, 2016.
between image- and blood-based GFR estimates is statistically [3] M.O. Leach, et al., “Imaging vascular function for early stage clinical
trials using dynamic contrast-enhanced magnetic resonance imaging”.
significant under the assumption of 95% confidence level (p- European Radiology, 22:1451–1464, 2012.
value = 0.19), however it does not deviate more from the gold [4] U.I. Attengerger, et al. “MR-based semi-automated quantification of
standard measurement than if instead of the automatically renal functional parameters with a two-compartment model—An
determined AIF, manual ROIs were used. As shown in the interobserver analysis”. European Journal of Radiology, 65:59–65,
Bland-Altman plots (Fig. 4), the mean differences along all 10 2008.
studies, as well as the corresponding 95% agreement intervals [5] M. Cutajar, et al. “The importance of AIF ROI selection in DCE-MRI
are pretty close for both approaches. renography: Reproducibility and variability of renal perfusion and
filtration”. European Journal of Radiology, 74:e154-e160, 2010.
IV. DISCUSSION [6] F. Calamante, et al. “Defining a local arterial input function for
perfusion MRI using independent component analysis”. Magnetic
The calculated GFR values indicate that the proposed Resonance in Medicine, 52(4):789-797, 2004.
method of automatic determination of the AIF region exhibits [7] M.R. Smith, et al., “Removing the effect of SVD algorithmic artifacts
small but significantly higher reliability than the observer- present in quantitative MR perfusion studies”. Magnetic Resonance in
Medicine, 51(3):631-634, 2004.
dependent strategy. The automated approach produces GFR
estimates which are biased from the ground truth values by [8] S.L. Keeling, et al., “Deconvolution for DCE-MRI using an exponential
approximation basis”. Medical Image Analysis, 13(1):80-90, 2009.
11.2 ml/min/1.73 mm2 on average (𝜎=22.3), whereas the [9] P.S. Tofts, et al., “Precise measurement of renal filtration and vascular
results obtained with the manually annotated AIF regions give parameters using a two-compartment model for dynamic contrast-
mean difference equal to 17.3 ml/min/1.73 mm2 (𝜎=27.0). enhanced MRI of the kidney gives realistic normal values”. European
Radiology, 22:1320-1330, 2012.
The larger scattering of the manually derived GFRs leads to
[10] E. Eikefjord et al. “Use of 3D DCE-MRI for the Estimation of Renal
also to broader intervals of agreement. Interestingly, the Perfusion and Glomerular Filtration Rate: An Intrasubject Comparison
automatic procedure tends to overestimate the filtration of FLASH and KWIC With a Comprehensive Framework for
characteristics in higher range of true GFR rates (> 90 Evaluation”. American Journal of Radiology, 204:W273–W281, 2015.
ml/min/1.73 mm2). On the other hand, the manual selection of [11] A. Fedorov, R. Beichel, J. Kalpathy-Cramer, J. Finet, J-C. Fillion-Robin,
AIF regions more frequently underestimates the renal S. Pujol, C. Bauer, D. Jennings, F.M. Fennessy, M. Sonka, J. Buatti.,
S.R. Aylward, J.V. Miller, S. Pieper, R. Kikinis, “3D Slicer as an Image
performance in the lower range of GFR values. This
Computing Platform for the Quantitative Imaging Network.” Magnetic
observation should be further confirmed in the experiments Resonance Imaging, 30(9):1323–41, 2012.
with higher number of patients. [12] Y. Sato, et al. „3D multi-scale line filler for segmentation and
visualization of curvilinear structures in medical images”. In J. Troccaz,
Overall assessment of the obtained results support the E. Grimson, R. Mösges (eds.), Proc. CVRMed-MRCAS’97, LNCS, pp.
proposed concept of AIF region identification. It remarkably 213–222, 1997.
reduces the need for an observer participating in image post- [13] S.P. Sourbron, et al., “MRI-Measurement of Perfusion and Glomerular
processing. The only involvement that is required is reduced to Filtration inthe Human Kidney With a Separable Compartment Model”.
roughly delineation of subvolumes, where one could expect the Invastigative Radiology, 43(1):40–48, 2008.
Fig. 4. Comparison of methods for GFR quantification. The Bland-Altman plots illustrate agreement between iohexol-based measurements and image derived
GFR ratios using either automatically or manually determined AIF region.
285
SIGNaL PROCESSING
SPa 2018
An artificial neural network for GFR estimation in the

DCE-MRI studies of the kidneys
Michał Strzelecki1, Artur Klepaczko1, Martyna Muszelska1 Eli Eikefjord2, Jarle Rørvik3, Arvid Lundervold4
1
Lodz University of Technology 2
Haukeland University Hospital, Department of Radiology
Institute of Electronics 3
University of Bergen, Department of Clinical Medicine,
Lodz, Poland 4
University of Bergen, Department of Biomedicine,
aklepaczko@p.lodz.pl Bergen, Norway
Abstract—The dynamic contrast-enhanced magnetic sample acquisition after 4 hours after injection of the tracer [1].
resonance imaging is a diagnostic method directed at Other techniques are based on ad-hoc formulae, which
estimation of renal performance. Analysis of the image approximate the estimated GFR (eGFR) based on several
intensity time-courses in the renal cortex and parenchyma factors, including plasma creatinine, albumin concentration and
enables quantification of the kidney filtration correction weights for age, race, and sex [2].
characteristics. A standard approach used for that purpose
Dynamic contrast-enhanced magnetic resonance imaging
involves fitting a pharmacokinetic model to image data and
optimizing a set of model parameters. It is essentially a (DCE-MRI) is yet another method for GFR assessment. The
gadolinium-based contrast agent (CA) bolus is intravenously
multi-objective and non-linear optimization problem.
Standard methods applied in such scenarios include non- administered to a patient’s blood system. Then, a dynamic
sequence of images is acquired in a range of time frames
linear least-squares (NLS) algorithms, such as Levenberg-
Marquardt or Trust Region Reflective methods. The major covering bolus passage through the kidney. The MR signal
intensity in the region penetrated by the contrast agent varies
disadvantage of these classical approaches is the
requirement for determining the starting point of the in time, as the paramagnetic tracer alters relaxation properties
of the tissue. Note, that the measured signal is a non-linear
optimization, whose final result is a local minimum of the
objective function. On the contrary, artificial neural function of the tracer concentration, and this non-linearity
disallows direct conversion of image intensity to CA
networks (ANN) are trained based on a large range of
parameter combinations, potentially covering whole concentration. Special considerations must be taken into
account before such a conversion can be applied, including
solution space. Thus, they appear particularly useful in
fitting complex, non-linear, multi-parametric relationships measurement or otherwise presumption of the pre-bolus
T1 relaxation rate of the kidney. Eventually, the tissue CA
to the observed noisy data and offer greater ability to detect
all possible interactions between predictor variables without concentration time-courses must be fit to a specific
pharmacokinetic model. It is usually parameterized by several
the need for explicit statistical formulation. In this paper we
compare the ANN and NLS approaches in application to factors, such as blood flow rate, vessel permeability, blood
and tissue volume fractions, and the contrast agent transfer
measuring perfusion based on DCE-MR images. The
experiments performed on a dataset containing 10 dynamic constant Ktrans, that characterizes blood transport from the
intravascular to extravascular compartment. Through
image series collected for 5 healthy volunteers proved
superior performance of the neural networks over classical multiplication of Ktrans and kidneys volume one obtains the
estimate of GFR ratio. As it can be seen, image-based
methods in terms of quantifying true perfusion parameters,
robustness to noise and varying imaging conditions. derivation of GFR depends on multiple observed and
calculated variables, and involves several steps, each
Keywords—dynamic contrast-enhanced MRI; artificial neural introducing a bias from the measured quantity. On the other
networks; pharmacokinetic modeling; parameter estimation hand, however, this approach is advantageous as it enables to
estimate GFR separately for the left and right kidney,
I. INTRODUCTION simultaneously offering limited invasiveness.
Medical diagnosis of kidneys state routinely involves Parameters of a pharmacokinetic model are normally
estimation of the glomerular filtration rate (GFR), which determined in a non-linear least-squares (NLS) curve-fitting
quantifies efficiency of renal tissue perfusion. By definition, procedure. Two popular choices include the Levenberg-
GFR is the ratio of blood flow rate between the glomerular Marquardt method for unbounded scenario and the Trust
capillary bed and the so-called Bowman's capsule. However, Region Reflective technique when optimizing with
these flow rates cannot be measured directly. Thus, in a constraints. Both methods are iterative in nature and strongly
standard clinical procedure, GFR value is determined by depend on initial guess of parameter values, often resulting in
measuring the level of creatinine in blood, and repeating the local minima of the objective function. Moreover, the
examination after a 24-hour period. Nowadays, a more frequent computations must be repeated for every patient. Thus, the
approach utilizes iohexol clearance test, which requires blood
UMO-2014/15/B/ST7/05227.
286
model can hardly exploit general biophysical characteristics of
tissue perfusion. In light of these observations, artificial neural
networks (ANNs) emerge as an attractive alternative as they
demonstrate particular usefulness in fitting complex, non-
linear, multi-parametric relationships to the observed noisy
data. ANNs are trained on:
a. a wide range of response variables, i.e. feasible
parameter values, potentially covering whole solution
space, and
b. their corresponding CA concentration time-courses
calculated using mathematical equations defined by
the pharmacokinetic model.
Consequently, an a priori knowledge about the underlying
mechanism of signal generation is incorporated by the network
[3]. It thus ensures the trade-off between ability to determinate
patient-specific parameter estimates and model generalization,
making it robust to noise, outliers and other measurement
errors.
The remainder of the paper is organized as follows. In the
Materials and Methods section we provide an overall
description of the DCE-MRI technique and the details of the
dataset we worked on. Afterwards, we present our proposed
architecture of the artificial neural network constructed to
determine the parameters of a pharmacokinetic model used in
our study. The Results section consists of a comparison
between the GFR ratios calculated using the Trust Region
Reflective algorithm and the designed neural network-based Fig. 1. Conceptual design of MLP for estimation of unknown kidney model
solution. In the final section, we discuss the obtained results. parameters 𝜽 based on observation vector samples 𝐶!" (𝜽) as a response to
! represents estimated model parameters.
stimulus 𝐶!!"# . 𝜽
II. MATERIALS AND METHODS
A. Dynamic Contrast-Enhanced MR Imaging ! !"#
𝐶! 𝑡 = 𝐾 !"#$% 𝐶
! !
𝜏 ⨂𝑔 𝜏 𝑑𝜏 + 𝑣! 𝐶!!"# 𝑡 , (1)
DCE-MRI is actually a protocol for acquiring a series of
T1-weighted volumetric images [4]. The measured signal is with
enhanced as the Gadolinium-based contrast agent bolus alters 𝐶!!"# = 𝐶!!"# ⨂𝑔 𝑡 , (2)
the longitudinal relaxivity of the tissue of interest. The image art
where ⊗ is the convolution operator, Cp denotes the arterial
acquisition phase must be followed by a post-processing stage
input function, Ktrans is the volume transfer constant which
in order to obtain diagnostically relevant information. There
directly leads to GFR estimation, vp is the plasma volume
are two subcategories of approaches that accomplish this goal, fraction, g(t) denotes the so-called vascular impulse response
classified as semiquantitative and quantitative [5]. The former function (VIRF), which models the rate and delay of CA
characterize perfusion based on shape attributes of the signal transportation from a supplying artery to the tissue. We
intensity time curves. The derived metrics include, e.g., assumed the shape of VIRF to be defined by the delayed
absolute and relative signal enhancement, area under curve, or exponential function:
late to early contrast enhancement ratio. However, these
parameters are hard to interpret in terms of the underlying 0 𝑡<∆
!!∆
biological processes. On the other hand, quantitative 𝑔 𝑡 = ! ! (3)
𝑒 !!
𝑡≥∆
approaches apply a pharmacokinetic model to encapsulate the !!
relationship between two contrast agent concentration time with

courses – in the tissue and in the supplying artery. The latter is !
therefore referred to as arterial input function (AIF). A given !
𝑔 𝑡 𝑑𝑡 = 1. (4)
PK model postulates specific dependence of the CA Hence, Tg (exponential decay time constant) and ∆ (delay) are
concentration in the tissue from the AIF and this link is found two additional model parameters. The detailed formulation of
by adjusting the set of model parameters to fit the postulated the model as well as the recipe for T1-weighted signal-to-
curve to the observed image data. Most importantly, the concentration conversion can be found in [6]. Note however,
adjusted parameters have clear physiological interpretation that the first term in Eq. (1) describes the blood transfer from
and can thus serve for diagnostic purposes. the intravascular to the extravascular extracellular space,
In this study, the two-compartment filtration model whereas the second term describes the concentration in the
proposed by Tofts [6] is used. According to this model, the intravascular compartment.
CA concentration in the kidney is determined as
287
B. Subjects Apart from DCE-based measurements, GFR values were
estimated using two standard clinical procedures. Firstly, the
The experiments reported in this study were conducted on iohexol clearance test was conducted by injecting 300 mg
a dataset of 10 DCE-MRI examinations collected for 5 healthy I/mL (Omnipaque 300, GE Healthcare). It was followed by
volunteers [7]. All volunteers have signed written informed obtaining a venous blood sample after 4 hours. Secondly, the
consent to take part in the study in agreement with the local eGFR was calculated based on serum creatinine level and
board for bioethical affairs. Every patient was examined twice, using the Modification of Diet in Renal Disease equation [2].
seven days apart. Hence, the whole data set was composed of
two parts denoted as Study 1 and Study 2. The examinations C. Artificial Neural Network design
were performed on a 32-channel 1.5 T scanner (Siemens The concept of neural network-based PK model
Magnetom Avanto) using 6-channel body matrix and table- parameters identification is shown in Fig. 1. In our
mounted 6-channel receiver coils. The 3D FLASH spoiled implementation, both input and hidden layers were fully
gradient recalled acquisition sequence was configured with the connected. For clarity, only part od the neurons connections
following parameters: TE=0.8 ms, TR=2.36 ms, FA=20°, are shown in Fig. 1. The observation vector 𝐶! 𝑡 in the tissue
parallel imaging factor=3, in-plane resolution=2.2 x 2.2 mm2, of interest forms the input to a feed-forward multi-layer
slice thickness=3 mm, acquisition matrix=192x192, number of perceptron (MLP). A MLP produces estimates 𝜽 = ℎ(𝑾, 𝐶! )
slices=30. The dynamic sequence consisted of 74 volumetric of the unknown parameters at its output, where W is a network
frames acquired at 2.3 seconds intervals. The contrast agent weight vector. Thus, the MLP approximates the mapping
(0.025 mmol/kg of GdDOTA) was injected intravenously at
𝜽 = ℋ(𝐶! ) from the observation space to the parameter space.
the flow rate=3 mL/s.
It is assumed that this mapping exists and is unique. The MLP
Five seconds after injection, the patients were instructed to learns the mapping in the training phase performed under the
hold their breath for 26 seconds to enable motion-free image control of the error back-propagation algorithm. The aim of
acquisition during the first pass perfusion. Afterwards, the this process is to minimize the mean-square approximation
breath-holds periods of 15 seconds were interleaved by 26- error over a set of 𝐾 examples 𝜽! ∈ Θ, 𝑘 = 1, … , 𝐾 by
second free-breathing slots, during which the motion artifact adjusting the weights W.
could manifest more strongly in the acquired data. Hence, the
Hence, as shown in Fig. 2, there are two modes of
images were registered in the time domain, that is all volumes
operation of the network. In the training mode, the switch is
in the series were transformed to maximize their matching
set in the lower position to connect the model output to the
with one reference frame set apart. In each data set the
input of the MLP. Parameter vectors are generated such that
reference volume was selected individually within the early
𝜽 ∈ Θ. For each given vector 𝜽, model response 𝐶! (𝜽) is:
renal perfusion phase (at approx. 10 sec after bolus arrival),
1. calculated (based on model equations),
where the contrast between renal cortex and medulla allows
2. probed in the temporal domain (forming a vector
their clear discrimination. For the registration procedure we
employed a b-spline deformable algorithm, as implemented in 𝐶! 𝜽 = 𝐶!! 𝜽 , 𝐶!! 𝜽 , … 𝐶!" 𝜽 where N –
the Plastimatch utility of the 3D Slicer software. number of time frames),
3. finally, applied to the input of the MLP.
Fig. 2. Principle of kidney parameter estimation by means of a MLP
288
STIMULUS OBSERVATIONS
Fig. 3. Kidney model stimulated by the AIF
In the next phase network weights W are fixed and the TABLE I. PARAMETER SETTING OF THE B-SPLINE DEFORMABLE
REGISTRATION ALGORITHM
switch in Fig. 2 is set up in the upper position. It corresponds
to the recall mode of the MLP operation, leading to the Parameter Min Max Mean SD
Random
arrangement shown in Fig. 1. Thus, actually measured CA distribution
concentration, 𝐶!! 𝜽 forms the input to the neural network in Ktrans [min-1] 0 – 0.25 0.1 Normal
this mode. The designed network produces parameter vp [mm3] 0.2 0.8 – – Uniform
estimates at its output in a relatively short time. This time is
∆ [s] 1 3.5 – – Uniform
equal to the time-delay related to signal propagation from the
ANN input to the output and can be made very short by MRT [s] – – 5.5 0.7 Normal
employing a hardware-implemented ANN. Thus, high speed is Tg [s] 0.02 – – – Normal
the advantageous feature of the technique. Contrary to NLS
parameter estimation, no iterative calculations are performed Having sampled model parameters, the corresponding
in the recall mode. observation time-courses can be generated. This step is
accomplished by stimulating the model with an arterial input
The crucial step in ANN design is to collect a
function Cpart(t). Hence, in order to obtain a model response,
representative set of training examples, which should follow
we need a set of possible AIFs to choose from. For the need of
the assumed pharmacokinetic model and optimally cover
this study, we prepared five AIF time courses based on
entire range of possible parameter values. The amount of
available DCE examinations from Study 1. The procedure
collected clinical data is too small to satisfy these conditions.
involved delineation of AIF regions inside the abdominal aorta
Hence, the problem can be best solved through numerical
of a given subject. As recommended elsewhere [8] we placed
simulation. The idea of training set preparation is illustrated in
these regions close to the aorta bifurcation into iliac arteries to
Fig. 3, where it is explicitly assumed that CA concentration,
minimize the inflow artifact. Next, mean image intensities
i.e. the model response, uniquely depends on given model
within the AIF regions were converted into concentration time
parameters 𝜽 = [𝐾 !"#$% , 𝑣! , 𝑇! , ∆] and arterial input function.
courses. Eventually, for each pair of parameters we randomly
The training observations should result from random picked one AIF time-course and calculated model response
sampling of the 4-dimensional space Θ. We assumed that Ktrans according to Eq. (1). This step completes the process of the
and vp were drawn from normal and uniform distributions training set construction.
respectively, with either mean and standard deviation or range
adjusted to physiologically reasonable values found in III. RESULTS
literature (see Table 1). We repeated this sampling 100 times A. Evaluation of the simulated dataset
for each of the two distributions and formed all possible pairs
of theses parameters. Then, for each pair we picked one ∆ The designed neural network was implemented in Python
value from a uniform distribution. Eventually, the decay using the Keras library [9]. Specifically, we used the so-called
constant Tg of the VIRF function was determined indirectly, Sequential API to construct a 3-layer network with 34 and 17
since the feasible range of its values were not available during units in hidden layers, both with rectified linear activation
our study. However, it is related to the delay variable through functions. The output layer consisted of four neurons with
the following equation linear activations, each responsible for approximation of one
PK model parameter. All layers were configured as Dense,
MRT = ∆ + Tg, (5) that is each neuron in a given layer was connected with all
where MRT denotes mean residence time – a parameter neurons in a previous layer. The chosen optimization method
describing duration of tracer presence in tissue supplying was the stochastic gradient descent and the loss metric was the
arteries. Possible values of MRT are known and reported in mean squared error between the predicted and target model
Table 1. Thus, we uniformly sampled the MRT range and responses. The simulated concentration time courses were split
calculated Tg accordingly. into training and test data sets, the latter containing 30% of
simulated tracer time courses. Moreover, 30% of the training
vectors were used during optimization as a validation set.
289

Fig. 4. Regression results for randomly selected 200 test samples.
The training procedure was launched for 200 epochs. This IV. DISCUSSION
number – adjusted empirically – occurred sufficient to achieve
high prediction accuracy on both validation and test sets. First, it must be noted that the network achieved high
Increasing the number of epochs to 1000 resulted in network accuracy in approximating true model parameters. The
overfitting and poorer performance on the test was observed. obtained R2 values calculated on the test dataset were close to
The scatter plots shown in Fig. 4 present 200 randomly unity for each of the four estimated variables. Most
selected pairs of network-predicted and actual values for each importantly, this result was achieved for a series of various
of the four parameters. Apparently, the network predictions lie arterial input functions determined for different subjects. It
close to reference levels which is quantitatively confirmed by shows that the network succeeded in learning a general pattern
the relatively high R2 scores (as annotated in figure legends). governing relationship between PK model responses and its
parameters.
B. GFR estimation results
TABLE II. ESTIMATED VALUES OF GFR FOR THE SUBJECTS INCLUDED IN
In the recall phase we fed the network with image-derived 2
THE STUDY [ML/MIN/1.73 MM ]
concentration time courses. In this experiment, we were primarily
Iohexol MDRD ANN NLS
focused on Ktrans estimation, since this parameter directly leads to Subject Study
GFR GFR estimation estimation
GFR calculation. Thus, for the need of GFR estimation, in every 1 94 125 103
data set we first manually delineated kidney regions of interest 1 107
2 117 108 66
embracing renal cortex and medulla. It enabled determination of
1 94 120 54
kidneys volumes and mean intensity time courses required as 2 98
input to the signal-to-concentration conversion procedure. Finally, 2 63 120 45
the Ktrans estimates found for a given subject, as reproduced by the 3
1
90
52 68 73
network, were multiplied by the corresponding left and right 2 116 81 115
kidney volumes and aggregated to obtain GFR ratios. The results 1 85 93 78
4 93
are collected in Table 2. Comparison of two methods for image- 2 105 94 104
based GFR estimation (NLS curve fitting and ANN regression) 1 45 85 45
against reference blood test-based approaches is quantitatively 5 94
2 81 91 108
performed using Bland-Altman plots shown in Fig. 6.
290
Fig. 6. Comparison of image-based GFR estimations (upper row) using Non-linear Least Squares (NLS) curve fitting and the designed Artificial Neural Network
against clinical standards (bottom row) – Iohexol clearance test and Modification of Diet in Renal Disease equation.
Secondly, the GFR estimates produced by the neural REFERENCES
network appear more reliable than values determined using the [1] P. Delanaye, “Iohexol plasma clearance for measuring of glomerular
Trust Region Reflective algorithm for curve fitting. The mean filtration rate in clinical practice and research: a review Part 1: How to
difference between the ANN-based GFR estimates and the measure glomerular filtration rate with iohexol?”Clinical Kidney
measurements using iohexol-clearance test equates to -2.1 Journal, 9(5):682–699, 2016.
ml/min/1.73 mm2 (SD = 13.8) and it amounts to 10.5 [2] J.R. Zabell, G. Larson, J. Koffel, D. Li, J.K. Anderson, C.J. Weight,
“Use of Modification of Diet in Renal Disease for estimating glomerular
ml/min/1.73 mm2 (SD= 17.2) if compared to MDRD method.
filtration rate in the urological literature.” Journal of Endourology,
The respective means obtained for the curve-fitting procedure 30(8):930–933, 2016.
are equal to 17.3 (standard deviation = 27.0) and 29.9 (standard [3] A. Materka, M. Strzelecki, “Parametric testing of mixed-signal circuits
deviation = 29.6) ml/min/1.73 mm2. Higher values of standard by ANN processing of transient responses.” Journal of Electronic
deviation in the two latter cases lead also to broader intervals of Testing: Theory and Applications, 9:187–202, 1996.
agreement. Better results for the ANN approach were obtained [4] R. Bammer (Ed.), MR and CT Perfusion and Pharmacokinetic Imaging.
even though the training process included AIF data from the Clinical Applications and Theory, Wolters Kluwer, Philadelphia, 2016.
Study 1 only. However, GFR values were equally well [5] A. Jackson, K.-L. Li, X. Zu, “Semi-quantitative parameter analysis of
approximated for the Study 2, which additionally proves that DCE-MRI revisited: Monte Carlos simulation, clinical comparisons, and
clinical validation of measurement errors in patients with type 2
the network gained the generalization capability. Note, that neurofibromatosis.” PLoS ONE, 9(3):e90300, 2014.
measurements estimated using MDRD equation are also [6] P.S. Tofts, et al., “Precise measurement of renal filtration and vascular
deviated from the iohexol clearance test. Thus, the obtained parameters using a two-compartment model for dynamic contrast-
results must be also assessed in light of uncertainty between the enhanced MRI of the kidney gives realistic normal values”. European
reference values themselves. Radiology, 22:1320-1330, 2012.
[7] E. Eikefjord et al. “Use of 3D DCE-MRI for the Estimation of Renal
In conclusion, model parameter estimation can be Perfusion and Glomerular Filtration Rate: An Intrasubject Comparison
efficiently realized using multi-layer feed-forward neural of FLASH and KWIC With a Comprehensive Framework for
network. This approach can outperform strategies based on Evaluation”. American Journal of Radiology, 204:W273–W281, 2015.
non-linear least squares optimization. In particular, the [8] S.P. Sourbron, H.J. Michaely, M.F. Reiser, S.O. Schoenberg, “MRI-
designed network achieved reasonable accuracy in estimating measurement of perfusion and glomerular filtration rate in the human
kidney with a separable compartment model.” Investigative Radiology,
parameters of the two-compartment filtration model of renal 43(1):40–48, 2008.
perfusion. These observations require further validation using
[9] F. Chollet. Keras repository available at
more subjects and other pharmacokinetic models. https://github.com/fchollet/keras. Accessed May 2018.
291
SIGNaL PROCESSING
SPa 2018
Numerical simulation of the b-SSFP sequence in

MR perfusion-weighted imaging of the kidney
Artur Klepaczko1, Piotr Skulimowski1, Ludomir Stefańczyk Eli Eikefjord2, Jarle Rørvik3,
Michał Strzelecki1 Medical University of Lodz Arvid Lundervold4
1
Lodz University of Technology Department of Diagnostic Imaging 2
Haukeland University Hospital,
Institute of Electronics Lodz, Poland Department of Radiology
Lodz, Poland 3
ludomir.stefanczyk@umed.lodz.pl University of Bergen, Department of
aklepaczko@p.lodz.pl Clinical Medicine,
4
University of Bergen, Department of
Biomedicine, Bergen, Norway
Abstract—Magnetic resonance (MR) simulation is one of the performance is an indicator of physiological homeostasis,
possible approaches to test and develop new imaging protocols. It acid-base balance and arterial blood pressure..
can assist in fast, on-demand verification of various hypotheses The major advantage over gold standard approaches
concerning the impact of different physical and/or technical offered by DCE-MRI is its ability to capture renal
factors on image appearance. In this paper, we perform
numerical simulation of dynamic contrast-enhanced MR
performance separately for the left and right kidney along with
imaging. In particular, we present the implementation of the so- giving the insight into their anatomy. The latter feature enables
called balanced steady state free precession sequence and show volume and size measurements as well as detection of any
its application in the synthesis of DCE-MR images mimicking abnormal structures such as tumors or cysts. However, current
perfusion-weighted examinations of the kidney. To this end, we methods routinely employed in clinics for measuring kidney
designed a simplified digital phantom of renal parenchyma perfusion characteristics involve creatinine or iohexol
comprising of kidney cortex and medulla. The phantom was clearance tests to determine the glomerular filtration rate
constructed based on manual segmentation of a real high- (GFR). Alternatively, GFR can be approximated using various
resolution CT image of the abdomen. The contrast agent kinetics ad-hoc equations that combine creatinine level, age, gender
was incorporated into the model by assigning time-varying T1
relaxation time to the kidney tissue segments. The relevant T1 and race (e.g. modification of diet in renal disease, MDRD)
time courses were determined based on analysis of real DCE-MR [2]. The reason for image-based diagnostic being a supporting
studies. Eventually, the practical aspects of the designed rather than a reference tool is the difficulty in absolute
simulator are illustrated in an example application, where quantification of perfusion indicators.
selected image-derived perfusion characteristics are referred to The general approach to calculate GFR based on a DCE
physiological parameters of the kidney. examination consists in fitting a pharmacokinetic model to the
observed image intensity time curves [1, 3]. The fitting
Keywords—dynamic contrast-enhanced MRI; MRI simulation; procedure determines a set of model parameters, such as
pharmacokinetic modeling; semi-quantitative parameter estimation blood volume fraction vb or filtration rate Ktrans, which
characterizes blood transfer from the intravascular to the
I. INTRODUCTION
extravascular extracellular space. Multiplying Ktrans by the
Dynamic contrast-enhanced (DCE) magnetic resonance kidney parenchyma volume directly leads to GFR estimation.
imaging (MRI) is a diagnostic method for estimating tissue Absolute quantification of these parameters require a priori
perfusion – the process of delivering blood to a capillary bed knowledge of the exact values of blood and tissue longitudinal
of a given body organ. Essentially, DCE-MRI produces a relaxation constants. Although methods for estimation subject-
series of gradient echo images acquired after administration of specific T1 and T2 rates exist, they are heavily time-consuming
the Gadolinium-based contrast agent (CA) [1]. As the contrast and involve multiple scanning of a patient using large ranges
agent bolus passes through the circulatory system, the of repetition times. Therefore, in practice T1 and – when
relaxation rates of penetrated tissues alters and thus the necessary – T2 values of blood and tissues in question are
generated MR signal varies with time. Through analysis of often assumed equal to some typical values published in the
image intensity time-courses one may derive quantitative literature.
perfusion estimates which reflect a tissue health state. This Alternative approach, which avoids any assumptions
paper focuses on perfusion of the kidney – an organ which is concerning relaxation times attempt to characterize perfusion
particularly crucial due to its multilateral role in the body indirectly by deriving metrics of the signal intensity curves.
functioning. Kidneys are responsible for blood filtration, The evaluated descriptors embrace e.g. the maximum
removal of water-soluble waste products of metabolism, enhancement ratio, time-to-peak enhancement, area under
surplus glucose and other organic substances. Renal curve [4]. Although these metrics can be applied in a
differentiation study (such as case vs. control, a longitudinal
UMO-2014/15/B/ST7/05227.
292

Fig. 1. CT data set used in the study for segmentation of kidney arteries – axial, sagittal and coronal slices (bottom row) and volume reconstruction (top panel).
Renal artery stenosis visible in the left kidney (blue arrow). Image visualization performed using Slicer 3D software.
trial), they are hardly interpretable in terms of a specific approval from the local ethics committee (approval no.
patient renal performance. Note also, that the relationship RNN/132/17/KE). The in-plane resolution of the image was
between CA concentration in tissue and the measured MR equal to 0.703 × 0.703 mm2 with the matrix size – 512 × 512
signal is in general nonlinear. pixels. The number of axial slices was equal to 571 and the
The assumed workflow decomposes into the following series spacing between slices, as well as slice thickness was set to
of steps: 1) Construction of a digital phantom of the kidney 0.625 mm. In case of the left kidney, the obstruction of the
represented by a mesh of tissue-mimicking particles; 2) renal artery resulted in remarkable shrinkage of the organ
Determination of temporal variability of T1 and T2 relaxation parenchyma. Hence, for the purpose of subsequent MRI
times for the two kidney functional parts – cortex and medulla –
simulations, the phantom constructed for the right kidney was
based on a presumed pharmacokinetic model and GFR value; 3)
used. However, in this section we present the segmentation
Simulation of the DCE-MRI series using the balanced steady-
state free precession (b-SSFP) acquisition sequence; 4) results obtained for both kidneys.
Evaluating relationship between the assumed GFR and the The segmentation was accomplished by the help of the
calculated image-based semi-quantitative perfusion metrics. In 3D Slicer software [5]. First, the image was thresholded, so
the following, we describe this workflow in detail. that only voxels whose intensity fell into the range 900–1400
HU remained. This operation removed almost all bone
II. MATERIALS AND METHODS structures and undesired soft tissues. Next, 50 seed points
A. Digital phantom of the kidney were placed over each kidney region and distributed through
various cross-sections. Delineation of kidneys parenchyma
The digital phantom used in this study was constructed was performed using the Slicer’s Robust Statistics Segmenter
based on kidney segmentation in a real high-resolution module, which implements a variant of the active contour
computed tomography (CT) image. The contrast-enhanced method. In order to achieve segmentation of the entire
angiogram of the abdomen was acquired for one male patient parenchyma, we adjusted the intensity homogeneity to 0.8,
with diagnosed renal artery stenosis in the left kidney (cf. Fig. boundary smoothness to 0.6 and approximate volume to 200
1). The data set was anonymized and processed after ml, leaving the rest of the parameters with their default values.
collecting a written informed consent from the patient and
293
filling allows determination of particles spacing, individually
for each spatial dimension. We assumed isotropic particles
distribution and adjusted the spacing such that every particle
occupied 1 mm3 of the tissue. This adjustment ensures an
acceptable trade-off between the need to keep the
computational complexity low and the demand for
maintaining magnetic spin variability in a target image voxel.
B. Modeling of the tracer kinetics

For the need of MR simulation, every tissue component must
be assigned its magnetic properties, i.e. proton density, T1 and
T2 relaxation times. As stated above, when the gadolinium
based contrast agent mixes with blood or enters the tissue, it
alters the surrounding medium characteristics. Specifically, the
(a)
longitudinal relaxation rate changes according to the equation:
𝑅! 𝑡 = 𝑅!" + 𝑟! 𝐶 𝑡 , (1)
where R10 is the longitudinal relaxation rate of the tissue
before CA bolus arrival, C(t) is the time-dependent
concentration of the contrast agent in a given tissue, and r1 is
the T1-relaxivity of the agent. Note, that the same relation links
pre- and post-contrast transverse relaxation rates. The
relaxation rates are inverse proportional to relaxation times,
hence:
𝑇! 𝑡 = 1/𝑅! 𝑡 , (2)
(b)
where i = 1, 2 to denote either longitudinal or transverse
relaxation processes. Knowledge of time-varying relaxation
times is essential for simulating dynamic MR signal. Since the
relaxivities r1 and r2, as well as typical T10 values in blood and
in a normal kidney are known [6-8] (see Table 1), solving (2)
requires determination of C(t) both in the cortex and medulla
components.
In a simulation study, CA concentration time course could
be prescribed any physiologically feasible trace. Therefore, in
this paper, C(t) is calculated by assuming that the CA passage
is governed by the classic 2-compartment filtration model of
renal perfusion [3]. This pharmacokinetic model, can be
expressed as
(c)
! !"#
Fig 2. CT image segmentation results: a) renal parenchyma, b) renal cortex, c) 𝐶! 𝑡 = 𝐾 !"#$% 𝐶
! !
𝜏 ⨂𝑔 𝜏 𝑑𝜏 + 𝑣! 𝐶!!"# 𝑡 , (3)
visualization of all renal components. art
where Cp denotes the arterial input function (AIF), i.e. CA
After manual correction of remnant segmentation islands we concentration in the artery supplying the organ with blood,
obtained kidney regions as visualized in Fig. 2–a. In the next Cpkid is the CA concentration in the intravascular space of the
step, the kidney pyramids (medulla) and pelvis were identified kidney tissue:
semi-automatically using the level-tracing tool from the Editor ! !"#
𝐶!!"! = 𝐶!!"# ⨂𝑔 𝑡 = 𝐶
! !
𝑡 − 𝜏 𝑔 𝜏 𝑑𝜏. (4)
module (Fig. 2–b). By subtracting pelvis and medulla regions
from the previously found parenchymal segment, we found The vascular impulse response function (VIRF), denoted as
voxels belonging to the cortex (Fig. 2–c). Next, segments g(t), controls the delay of CA bolus delivery from the
defined on the original image raster grid, were transformed supplying artery to the tissue, as well as the rate of its
into their corresponding surface models. For that purpose we dispersion within the tissue. Mathematically, VIRF can be
used the ray-casting algorithm available in the Volume formulated using the delayed exponential function:
rendering module and generated sets of triangular faces
0 𝑡<∆
forming polyhedron objects, ready to save in the STL file !!∆
format. The STL objects were imported into the MATLAB 𝑔 𝑡 = ! ! . (5)
𝑒 !!
𝑡≥∆
!!
environment and the encoded polyhedrons were filled with the
tissue-modeling particles. The custom script responsible for
294
Hence, the model is fully defined by 4 parameters: Ktrans – The quotient on the left side of Eq. (7) can be calculated based
blood transfer or filtration rate, vp – plasma volume fraction, on observed mean image intensities in the region of interest
Tg – exponential decay time constant, and ∆ is the delay before placed over the abdominal aorta in a respective patient
the CA appears in the capillary bed. In the performed examination. Then, Eq. (7) can be transformed to:
simulations, Ktrans was varied with respect to kidney phantom ! !!! ! !
volume to obtain different GFR values. The other three 𝑅! 𝑡 = − ln (8)
!" !!! ! ! !"# !
parameters were adjusted to some fixed values, typical for a
healthy kidney. where
!(!)
TABLE I. PARAMETER SETTING IN THE EXPERIMENTS 𝑎 𝑡 = , (9)
!!
Value and
Parameter
Study 1 Study 2 Study 3 Study 4 Study 5 !!! !!!" !"
𝑏= . (10)
Ktrans cortex !!!"# !! !!!" !"
0.19 0.20 0.22 0.25 0.28
[min-1] In order to evaluate the expression for b in (9), the pre-bolus
Ktrans medulla longitudinal relaxation rate must be somehow assumed. The
0.19 0.19 0.22 0.22 0.25
[min-1] literature values of T10 for arterial blood in a 1.5 T field
vp 0.2 strength fall into the range of 1429 to 1531 msec [7]. From Eq.
Tg [sec] 1 (1) we obtain the desired formula for the arterial input
function:
∆ cortex [sec] 3.4
!
∆ medulla [sec] 5.4 !! ! !!
𝐶!!"# 𝑡 = !"
. (11)
r1,GdDOTA !!
3.6
[sec-1 mM-1] Finally, C(t), T1(t) and T2(t) can be calculated using Eqs. (1)–
r2,GdDOTA (3). Table 1 summarizes the assumed model parameters for the
4.3
[sec-1 mM-1] cortex and medulla components along with all relevant
T10 blood [sec] 1.429 simulation constants. Note that the VIRF delay constant for
T10 cortex [sec] 1.442 medulla is larger than for the cortex to reflect the natural
succession of CA bolus passage, which is first perfused in the
T10 medulla [sec] 1.261
glomeruli in the cortex, and only then traverses into the
In order to complete derivation of C(t) one must determine peritubular capillaries located in the medulla.
the AIF time course. We accomplished this task by measuring C. Simulation of the b-SSFP sequence
signal intensity in the abdominal aorta in a set of real DCE-
In the designed DCE simulator every tissue particle is
MRI examinations acquired for 10 healthy volunteers [9]. All assigned a magnetization vector, whose temporal evolution is
subjects have signed written informed consent to take part in controlled by the Bloch equation [10]. Specifically, our system
the study in agreement with the local board for bioethical uses analytical solution the Bloch equation in the form:
affairs. The examinations were performed on a 32-channel 1.5
!" !
T scanner (Siemens Magnetom Avanto) using 6-channel body 𝑀! 𝑡 + 𝛿𝑡 = 𝑅𝑜𝑡! 𝜃! 𝑅𝑜𝑡! 𝜃! 𝑅relax 𝑀! 𝑡 + 𝑅relax 𝑀! , (12)
matrix and table-mounted 6-channel receiver coils. The series where M denotes the magnetization vector of a particle p, δt is
of 74 volumes were acquired at 2.3 seconds intervals with the the simulation time step, 𝑅𝑜𝑡! is the rotation operator, which
following parameters: TE = 0.8 ms, TR = 2.36 ms, FA = 20°, evolves 𝑀! around the z-axis due to the phase encoding
parallel imaging factor = 3, in-plane resolution = 2.2 × 2.2
gradient ( 𝜃! ) and field inhomogeneity effect ( 𝜃! ). The
mm2, slice thickness = 3 mm, acquisition matrix = 192 × 192,
equilibrium magnetization 𝑀! scales the signal magnitude
number of slices = 30. The Gadolinium-based CA (0.025
with respect to several factors, including Planck and
mmol/kg of GdDOTA) was injected intravenously at the flow Boltzmann constants, gyromagnetic ratio, as well as tissue
rate = 3 mL/s. The dataset was acquired using the spoiled temperature and proton density. With other factors being
gradient recalled echo sequence. In such a sequence, the signal equal, 𝑀! is determined exclusively in relation to proton
equation is defined as density.
!! !"# ! !!! !!"/!!
𝑆𝐼 = (5) The transverse and longitudinal relaxation effects are
!!!"# !! !!"/!! !" !
controlled by matrices 𝑅relax and 𝑅relax :
where M0 is the equilibrium magnetization, 𝛼 is the flip angle !"
!
(FA) and TR is the repetition time. By taking (5) for the pre- 𝑒 !! 0 0
bolus phase and calculating the ratio with a post-contrast time- !"
𝑅relax =
!"
!! , (13a)
0 𝑒 ! 0
varying enhanced signal S(t), the unknown magnetization term !"
!
cancels out: 0 0 𝑒 !!
!! !!! !!!" !" !!!"# !! !!! (!)!"

= . (7)
!(!) !!!"# !! !!!" !" !!! !!! (!)!"
295
0 where 𝑛! denotes the total number of particles in a given
!
𝑅relax = 0 (13b) tissue component. At the end of each repetition cycle one line
!"
! of k-space is filled in. Finally, the image is reconstructed by
1−𝑒 !!
FFT transform of the measured k-space.
Equation (12) covers the evolution of the magnetization The organization of the acquisition events starting from an
vectors during free precession and under field gradients RF pulse until signal acquisition defines the type of a
switched on. In the excitation phase, the solution to the Bloch sequence. In case of steady-state free precession (SSFP)
equation reduces to: method, the gradient area on every direction (that is slice
selection, phase-encoding and frequency-encoding) must
𝑀! 𝑡 + 𝛿𝑡 = 𝑅RF 𝛼eff 𝑀! 𝑡 , (14)
remain constant between TR cycles. Moreover, if the gradient
which is valid under the assumption that the excitation pulse is area is kept zero, than the resultant signal is the coherent sum
much shorter than relaxation times. 𝑅RF rotates 𝑀! towards of the free induction decay and stimulated echo, and the
transverse plane by an effective angle 𝛼eff , which encapsulates sequence is referred to as balanced steady-state free precession
the off-resonance effects alternating the presumed flip (b-SSFP). Thorough explanation of this technique can be
angle [11]. found in [10]. Here, it must be underlined, that the main
The imaging protocol consists of subsequent stages, advantage of the b-SSFP is its ability to maintain relatively
starting from the application of the RF excitation pulse, and large magnetization under very short repetition times. On the
then continuing through the phase and frequency encoding, other hand, the resulting signal is dependent both on the T1 and
and readout gradients, separated by free precession phases. T2 tissue properties and thus, b-SSFP images are said to have
The measurement sequence is controlled by the echo and T1/T2 contrast.
repetition times 𝑇! and 𝑇! , and the flip angle 𝜃. During the III. RESULTS
signal acquisition phase, the resultant transverse magnetization
of all particles is sampled and aggregated in the so-called k- In the experimental part, we performed 5 DCE imaging
space: simulations for various settings of Ktrans parameter, as listed
!! !! out in Table 1. All acquisitions were accomplished under the
𝑠 𝑡 = !!! 𝑀! (𝑡)𝑥 +𝑗 !!! 𝑀! (𝑡)𝑦 , (15) same MRI protocol: TR = 4.3 msec, TE = 2.3 msec, FA = 25°,
acquisition matrix = 64 × 64, in-plane resolution = 2 × 2 mm2,
slice thickness = 3 mm, number of slices = 36.
Figure 3–a presents an example of an image-derived
arterial input function and the calculated CA concentrations
for a given setting of Ktrans parameters in the kidney cortex and
medulla. Their corresponding time courses of the longitudinal
relaxation times are visualized in Fig. 3–b, whereas examples
of simulated images are shown in Fig. 4. For each simulated
series, we calculated the semi-quantitative properties of the
mean signal intensity time-courses in the respective regions of
interest (medulla or cortex). The collected metrics included
maximum enhancement ratio (MER), expressed as the
percentage signal increase relative to the pre-contrast baseline,
time-to-peak (TTP) and area under curve (AUC), restricted to
the interval between bolus arrival (approx. 10-15 sec. after
injection) and CA flush-out time (usually 80–90 sec. of the
examination). Interpretation of these quantities is illustrated in
Fig. 5 for an example signal time course obtained for the
(a) cortex component. The measurement results are presented in
Table 2, where the Ktrans parameters were transformed to the
diagnostically comprehensible GFR values.
(b)
Fig 3. AIF and CA concentration time courses in the kidney for Fig 4. Examples of the simulated DCE-MRI images (frame 14). Sagittal,
experimental Study 1 (a). Longitudinal relaxation times shown in (b). coronal, and axial cross-sections of the digital kidney phantom.
296
TABLE II. CALCULATED SEMI-QUANTITATIVE METRICS REFERENCES
Study GFR [ml/min] MER TTP AUC
[1] R. Bammer (Ed.), MR and CT Perfusion and Pharmacokinetic Imaging.
no. Cortex Medulla [%] [sec] [a.u.] Clinical Applications and Theory, Wolters Kluwer, Philadelphia, 2016.
1 36.6 13.3 21.1 168 [2] J.R. Zabell, G. Larson, J. Koffel, D. Li, J.K. Anderson, C.J. Weight,
“Use of Modification of Diet in Renal Disease for estimating glomerular
2 38.5 13.3 21.2 187 filtration rate in the urological literature.” Journal of Endourology,
3 42.3 15.4 21.4 3 207 30(8):930–933, 2016.
[3] P.S. Tofts, et al., “Precise measurement of renal filtration and vascular
4 48.1 15.4 21.8 245 parameters using a two-compartment model for dynamic contrast-
5 53.9 17.4 22.3 284 enhanced MRI of the kidney gives realistic normal values”. European
Radiology, 22:1320-1330, 2012.
[4] A. Jackson, K.-L. Li, X. Zu, “Semi-quantitative parameter analysis of
DCE-MRI revisited: Monte Carlos simulation, clinical comparisons, and
IV. DISCUSSION clinical validation of measurement errors in patients with type 2
The obtained results show that there exist a direct neurofibromatosis.” PLoS ONE, 9(3):e90300, 2014.
correspondence between semi-quantitative parameters of the [5] A. Fedorov, R. Beichel, J. Kalpathy-Cramer, J. Finet, J-C. Fillion-Robin,
S. Pujol, C. Bauer, D. Jennings, F.M. Fennessy, M. Sonka, J. Buatti.,
perfusion-weighted image intensity time-courses and the S.R. Aylward, J.V. Miller, S. Pieper, R. Kikinis, “3D Slicer as an Image
underlying physiological tissue properties. The observed Computing Platform for the Quantitative Imaging Network.” Magnetic
relationship is presumably stronger than in real-life conditions Resonance Imaging, 30(9):1323–41, 2012.
due to a series of simplifications introduced at various layers of [6] M. Rohrer, H. Bauer, J. Mintorowvitch, M. Requardt, H.-J. Weinmann,
the model, such as physiological behavior of the tissue or MR “Comparison of Magnetic Properties of MRI Contrast Media Solutions
imaging sequence. However, still the proposed simulation at Different Magnetic Field Strengths,” Investigative Radiology,
40(11):715–724, 2005.
framework has the potential of revealing important
dependencies between CA time-curves characteristics and true [7] X. Zhang, E.T. Petersen, E. Ghariq, J.B. de Vis, A.G. Webb, W.M.
Teeuwisse, J. Hendrikse, M.J.P. van Osch, “In Vivo Blood T1
tracer kinetics. As for example, in our experiments, TTP Measurements at 1.5 T, 3 T, and 7 T,” Magnetic Resonance in Medicine,
parameter occurred completely independent from GFR setting, 70:1082–1086, 2013.
while AUC exhibited the highest sensitivity in this respect. [8] Y. Huang, E.A. Sadowski, N.S. Artz, S. Seo, A. Djamali, T.M. Grist,
Naturally, more simulations should be performed, and a larger S.B. Fain, “Measurement and Comparison of T1 Relaxation Times in
variability of PK model parameters taken into account, in order Native and Transplanted Kidney Cortex and Medulla,” Journal of
to provide reliable verification of any hypotheses concerning Magnetic Resonance Imaging, 33(5):1241–1247, 2011.
the feasibility of using semi-quantitative approach in [9] E. Eikefjord et al. “Use of 3D DCE-MRI for the Estimation of Renal
Perfusion and Glomerular Filtration Rate: An Intrasubject Comparison
diagnostics of kidney state. of FLASH and KWIC With a Comprehensive Framework for
Evaluation”. American Journal of Radiology, 204:W273–W281, 2015.
[10] M.A. Bernstein, K.F. King, X.J. Zhou, Handbook of MRI pulse
sequences, Elsevier Academic Press, 2004.
[11] Z. Cao, S. Oh, C. T. Sica, J. M. McGarrity, T. Horan, W. Luo, and C. M.
Collins, “Bloch-Based MRI System Simulator Considering Realistic
Electromagnetic Fields for Calculation of Signal, Noise, and Specific
Absorption Rate,” Magnetic Resonance in Medicine, 73:237–247, 2014.
Fig. 5. Mean signal intensity time course for the cortex component obtained
for the simulated DCE-MRI image (study 1).
297
SIGNaL PROCESSING
SPa 2018
Automatic 3D segmentation of MRI data for detection

of head and neck cancerous lymph nodes
Baixiang Zhao, John Soraghan, Gaetano Derek Grose Trushali Doshi
Di-caterina, and Lykourgos Petropoulakis Beatson West of Scotland Cancer Centre Micrima Limited
Centre for Signal and Image processing, Glasgow,United Kingdom Bristol, UK
University of Strathclyde, Derek.Grose@ggc.scot.nhs.uk trushalid@gmail.com
Glasgow,United Kingdom
baixiang.zhao@strath.ac.uk,
j.soraghan@strath.ac.uk,
gaetano.di-caterina@strath.ac.uk
l.petropoulakis@strath.ac.uk
Abstract— A novel algorithm for automatic 3D segmentation [3], graph cut techniques [4], and deformable models [5, 6].
of magnetic resonance imaging (MRI) data for detection of head These previous works mostly segment lymph nodes from CT
and neck cancerous lymph nodes (LN)) is presented in this paper. images in 2D, and some approaches rely on manual
The proposed algorithm pre-processes the MRI data slices to segmentation. On the contrary, the novel work in this paper
enhance quality and reduce artefacts. A modified Fuzzy c-mean addresses the automatic delineation of cancerous lymph nodes
process is performed through all slices, followed by a probability from a series of MRI slices, using a 3D level set method (LSM).
map which refines the clustering results, to detect the
approximate position of cancerous lymph nodes. Fourier
interpolation is applied to create an isotropic 3D MRI volume. A
new 3D level set method segments the tumour from the
interpolated MRI volume. The proposed algorithm is tested on
synthetic and real MRI data. The results show that the novel
cancerous lymph nodes 3D volume extraction algorithm has over
0.9 Dice similarity score on synthetic data and 0.7 on real MRI
data. The F-measure is 0.92 on synthetic data and 0.75 on real
data.
Lymph node
Keywords—MRI data, Head and neck cancer, Modified fuzzy c- (a) (b)
mean, probability map, 3D level set method (LSM)
Fig. 1. (a) A T1-weighted Gadolinium-enhanced head and neck MR image
example with cancerous lymph nodes; (b) A image serires from one MRI
I. INTRODUCTION dataset.
Radiotherapy, along with surgery, provides the main option
for curative cancer treatment. The delineation of cancerous This paper presents a new fully automatic algorithm for 3D
lymph nodes can help determine the clinical target volume segmentation of cancerous lymph nodes volume from T1-
(CTV) [1]. CTV is the exact tumour volume with margin for weighted Gadolinium-enhanced MRI data. The challenges of
subclinical microscopic spread and affected lymph node. this work include segmenting tumour regions with fuzzy
Definition of this region is fundamental for accurate and boundaries, non-uniform intensities, and avoiding adjacent
effective radiation treatment planning. Development of anatomical structures. It is essential to determine the position of
automated delineation methods can help reduce inter and intra lymph nodes, and the intensity range of the cancerous lymph
variabilities of manual tumour delineation, providing objective node area is very important. This novel algorithm is validated
and reliable assistance to clinical oncologists to reduce work on both synthetic and real MRI data from the Beatson West of
load and improve radiation treatment [2]. Scotland Cancer Centre, in Glasgow.
Fig. 1(a) shows a T1 weighted gadolinium-enhanced head The remainder of the paper is organised as follows. Section
and neck MR image with cancerous lymph nodes. It is known 2 describes the new automatic cancerous lymph nodes 3D
that cancerous lymph nodes have fuzzy boundaries and they are segmentation algorithm. Section 3 demonstrates the
not significantly distinct from neighbour tissues. Furthermore experimental results on both synthetic and read MRI datasets.
as seen in Fig. 1(b) artefacts of MRI data, such as uneven The last section summaries the paper.
illumination, are present. All these make automatic
segmentation of cancerous areas a very challenging task.
A variety of algorithms have been proposed for cancerous
lymph nodes segmentation, such as training based approaches
*Research supported by Beatson Cancer Charity.
298
II. AUTOMATIC HEAD AND NECK ABNORMAL LYMPH NODES
Throat detection Modified FCM Probability map
3D SEGMENTATION
The proposed head and neck cancerous lymph nodes Fig. 3. Workflow for lymph nodes detection.
segmentation is shown in Fig. 2. The algorithm contains two
main parts: a) Image pre-processing, b) Cancerous lymph nodes Cancerous lymph nodes detection is performed slice by
3D segmentation. slice. For each MRI slice, the throat is detected by two fuzzy
rules [2]. Then a modified fuzzy c-mean (MFCM) algorithm
[12] utilises intensity and spatial information of pixels to
MRI data
organise them into five clusters. Clustering into five categories
is based on the assumption that a pre-processed head and neck
MRI slice consists of four main tissue types (fatty tissue, cancer
Image pre-processing tissue, normal tissue, and normal muscle tissues) and
• Image enhancement background. Based on prior biomedical knowledge from
• Bias field correction clinicians, head and neck cancerous regions (tumour and
• Intensity standardisation
• Fourier interpolation cancerous LN) are normally located around the throat region,
and they usually have first or second brightest intensity among
all tissues.
Cancerous lymph nodes 3D

segmentation
• Lymph nodes detection
• 3D LSM evolution
• Post-processing
Fig. 2. Flowchart of automatic cancerous lymph nodes segmentation.
(a) (b)
A. Image pre-processing
In this work, multiple pre-processing techniques are applied Fig. 4. (a) A pre-processed MRI slice with detected throat region (red is); (b)
Modified fuzzy c-mean, left top is original image, other five are five clusters,
to the MRI data for artefacts removing and image enhancing.
from left to right first row, then left to right second row, the intensity of each
Background noise is minimised using morphological opening cluster is brighter.
and a majority operation [7], which can remove small noisy
regions while preserving edges in the image. The images are Fig. 4(a) illustrates the typical throat detection result. Based
enhanced using a background brightness preserving contrast on the throat location and intensities of pixels, the MFCM
enhancement technique [8]. There are two types of intensity groups pixels into five clusters. The cancerous lymph nodes are
variations in MRI data. The first type is intensity in first or second brightest clusters (bottom right and bottom
inhomogeneity (IIH) (also named as Bias field) [9] on a single middle shown in Fig. 4(b)). Regions of interest are taken from
slice. In this paper, IIH is estimated and corrected based on these clusters and combined with edge information of the
techniques described in [10]. The second type of intensity original image to separate large regions. Morphological
variation is between slices. This occurs when a certain opening is used to further separate regions, and small regions
measured intensity cannot be associated with a specific tissue are removed.
class [9] on all slices. In this work the effect of intensity
variation is reduced using an intensity standardisation between
slices[7]. The MRI data used in this work has anisotropic
voxels, while 3D LSM only works well on isotropic voxels.
Original voxels are converted to isotopic voxels through
Fourier interpolation, which was introduced in [11]. The
volume for LSM segmentation is reconstructed in 3D using
both real and interpolated slices.
B. Cancerous lymph nodes detection (a) (b) Lymph node
Results of the level set evolution rely on the initialisation

and target intensity range setting. The initialisation provides a Fig. 5. (a) A pre-processed MRI slice;(b) Clustering result after removing
start point (initial seed) for the level set method, and sets the small noisy regions.
size of the initial seed. The target intensity range is a rough
estimate of the range of pixel intensities of the cancerous lymph The resulting regions are shown in Fig. 5(b), where the
node region. The detection of lymph nodes helps determine lymph node is identifiable.
start point of level set evolution, and the intensity range of To roughly detect the position of the cancerous lymph node,
cancerous lymph nodes area. The steps for lymph nodes a probability map, W, of lymph nodes location is proposed to
detection are shown in Fig. 3.
299
identify a region from Fig. 5(b). This region has the highest
probability of overlapping with lymph node.
W  Wlocation  Wsize  Weccentricity 
Eq. 1 shows the components of the probability map W.

There are three parts on the right hand side of (1). The location
weight, Wlocation, is the weight factor derived from the location
of regions compared to throat. The second component is Wsize,
i.e. the ‘size weight’. This factor is set for de-weighting large
regions. If a region is not in an expected lymph node position
but has a large area, this region is likely to be a false positive in
the lymph node detection scheme. Thus, Wsize is used, based on (a) (b)
the idea that the location of region centre should be more
Fig. 6. Probability map and demonstration of thresholds. Red lines with texts
emphasised during the detection. This means that the location show position of a-c, and blue lines with texts show position of g-h. Red
of a region’s centre is far more important than other points regions in (a) and (b) show positions of throat region. (a)(b) only display one
inside this region. The last component of the probability map is side of a-h on single slice, a-h are symmetry on the other side of same slice.
Weccentricity. This is derived from the region’s eccentricity,
according to the prior knowledge that cancerous lymph nodes The value of Wlocation is set according to the prior knowledge
which are mostly circular shapes. In (1), α, β and θ are that the head and neck lymph nodes are located at two sides of
parameters controlling the contribution of the three parts, and the throat region but not closely adjacent to the throat.
are so that α + β + θ = 1.
Assuming that a region has n points then the size weight of
The construction of Wlocation may be written as: each region may be calculated as follows:
Wlocation  wc wr 
n
1 (3)
W size
i 1 ( x i  x c)  ( y  y )  eps
i c
 0 | Cn  y | a
 |C  ya| where xi and yi are the coordinates of ith point; xc and yc are
 n
a | Cn  y | b the coordinates of the region centre; eps is a small positive
 | a b|
wc   number (e.g. machine epsilon) to ensure non division by zero.
1  | Cn  y  b | b | Cn  y | c
 |c b| In this case, if a region has a large size, but its centre is actually
 located at an undesired area, this region will not be picked by
 0 else
(2) the algorithm. Thus false positives will be less likely generated
 R x
 1 m n Rn  x  g in the detection section.
g


wr  (1  m) 
Rn  x  g
g  Rn  x  h
Eccentricity is calculated as follows:
 hg
 C (4)


0 else
W eccentricity

A
where Rn and Cn are row and column coordinates,.wc is the where C is the distance from the region centre to the focus
weight derived from column position, and wr is derived from along the major axis of ellipse, and A is the length of major axis.
row position. Parameters x and y are the centroid coordinates of When eccentricity is close to 0, the region is likely to be a circle;
the throat region. Parameters a-c, g and h are adaptive when it is close to 1, the region is likely to be elongated.
thresholds which are determined based on whether a pixel is in From the probability map W for the processed MRI slice Fig.
a close-to-throat region or away-from-throat region as 5(a) shown in Fig. 7(a), it can be seen that two sides of throat
illustrated in Fig. 6(a) and (b). The parameter m is a control (but not adjacent to throat) have the highest probability. After
coefficient which ranges from 0 to 1. It is adaptively set based overlapping the probability map with regions in Fig. 5(b), the
on the ratio between throat width and image width. region which has the highest probability to be lymph node is
The intersection symbol at the top of (2) denotes the fuzzy determined. This is displayed in Fig. 7(b) with a red mark.
AND operator. The values of a-h are adaptive to the throat In each slice, one region is detected to be most likely
region’s size, and they are automatically set based on throat overlapped with a lymph node area. Fig. 9(a) shows the
region’s width and height. For example, if on an MRI slice the detection results of all slice in a 3D view. If a slice has no
throat height is less than 0.1 of image height, a will be set as 0.8 cancerous lymph nodes, a false positive region may be detected,
of throat height. Fig. 6 shows probability maps of two MRI such as the left bottom blob shown in Fig. 9(a). Thus, a further
slices. It also displays positions of a-h in these two maps. In (2), processing step (Fig. 8) is added to remove false positives, and
a-c are the horizontal coordinates of red lines a-c in Fig. 6; g-h also determine 3D location of cancerous lymph nodes.
in (2) are the vertical coordinates of the blue lines g-h in Fig. 6.
Pixels located between line a and b or line b and c will have
different location weights.
300
where  is the average curvature of evolving curve; T is


mean intensity of detected lymph nodes region; ε is standard
deviation of all pixels inside the detected lymph nodes region,
and α is a weighting factor 0 < λ<1. The right hand side of (5)
comprises an external force Fext, which drives the curve to the
boundary and an internal force Fint, that keeps segmentation
result smooth. This level set function is based on intensities of
pixels, and curvature of evolving curve.
(a) (b)
Fig. 7. (a) Probability map of Fig. 5(a); (b) Regions and detected rough
lymph nodes position marked with red point.
Fig. 8 shows the section of the proposed algorithm which

removes the false positives from the lymph node rough
detection process, and thus determines the 3D position of
lymph nodes. Firstly, the detected regions are sorted based on
their radii. Second, the region with largest radius R and centre
(X, Y) is taken; other regions whose centres is inside (X+ Rcosθ, (a) (b)
Y+ Rsinθ) are assigned to the same group. Then the process is
repeated on those regions which have not been grouped, until Fig. 9. (a) Regions detected by probability map; (b) Initial seed for 3D LSM.
all regions are grouped. The search process is performed slice
by slice. Then after this process, all regions are clustered into Some visual results from the 3D LSM segmentation are
groups. The group which contains the majority of regions is shown in Fig. 10. Fig. 10(a-c) shows the evolution process and
kept, and regions in other groups are discarded; thus regions (d) is the 3D LSM result. It can be seen from Fig. 10(c) that
overlapped with cancerous lymph node are all detected. The there are unsmooth surfaces. These are due to weak boundaries
lymph node centre can be calculated by taking the average of of lymph nodes region, and some adjacent tissues that have
the regions’ coordinates. similar intensities. Thus, post-processing is applied. The post-
processing uses 3D morphological operations to remove the
unsmooth parts. The 3D morphological operation applied
Ungrouped regions comprises 3D erosion first to remove noise and separate 3D
objects; then the largest connected component is picked among
Find largest region with radius Ri 3D objects; finally 3D dilation is applied to compensate the
and centre (Xi, Yi) volume loss in erosion. The final segmentation result is shown
in Fig. 10(d).
Other regions whose centres inside (Xi+
Ricosθ, Yi+ Risinθ) are put into group i
All regions are grouped
No Yes
End
(a) (b) (c) (d)
Fig. 8. Work flow for grouping detected regions based on their horizontal
location. Fig. 10. (a) shows the LSM evolution result after 200 iterations, (b) shows the
result after 300 iterations, and (c) is the final LSM result (after 500 iterations).
(d) shows 3D segmentaion after post-processing.
C. Lymph nodes segmentaion with 3D LSM
The initial seed for the 3D level set function is set at the
centre of the detected lymph node. The horizontal size of the III. EXPERIMENTAL RESULTS
initial seed is 0.05 of the interpolated image stack’s horizontal The new algorithm was implemented in Matlab, running on
size. The height of seed is one third of rough lymph node height. a PC with 16G RAM, 3.2GHz Intel(R) Core(TM) i7-8700 CPU.
An example of initial seed is shown in Fig. 9(b). Experiments were conducted on both synthetic and real MRI
datasets.
In this work, the speed function F used for 3D level set
evolution is [13]:
A. Experiments on synthetic data

F   (  I ( x, y, z )  T )  (1   ). A synthetic MRI dataset was generated using a modified
 Shepp and Logan head phantom function from [14]. In this
(5)
work, the generated synthetic data has 8 slices, and three
 Fext  Fint configurable parameters to add artefacts, including contrast
301
reduction, bias field, and noise. In our experiment, the added Fig. 13 to Fig. 15 shows examples of extracted 3D
artefacts include Rician noise [14] with standard deviation of 10, cancerous lymph nodes, and also extracted 2D contours with
bias field (IIH) generated by cos function [14], and decreased annotated consensus outline. It can be seen in Fig. 13-15 that
LN-to-background contrast with ratio at 0.3. the cancerous lymph nodes in Dataset1-3 are well extracted. Fig.
13(a)(b)(c) show that even when lymph node boundaries are
Results on synthetic dataset are measured by Dice similarity very similar to adjacent tissues, the algorithm can still track the
coefficient (DSC). The DSC is to measure similarity between lymph node and segment the cancerous region. Furthermore,
Fig. 13-15 shows that the proposed algorithm provides similar
2  ( A  B) (6)
DSC ( A, B )  segmentation results as compared to the consensus outline.
A B
two samples A and B; it can be calculated as given [15].

From Fig. 11, experiments using synthetic data shows that
the proposed algorithm has tolerance to artefacts, and the
automatic segmentation is highly overlapped with the ground
truth.
(a) (b) (c) (d)
Fig. 13. Segmented cancerous lymph nodes from Dataset1. Red contours are
automatic segmenation, and yellow contours are gold standards (consensu
manual outline) (a) 2D contours of top slice (b) 2D contours of the middle
slice, (c) 2D contours of the bottom slice (d) 3D visualisation of lymph node.
(a) (b) (c) (d)
Fig. 11. Simulation on sythetic datasets with Red contours automatic

segmentation, and Yellow contours ground truth (a) result from bottom slice (b)
result from middle slice (c) result fromtop slice. (d) automatic extracted 3D
LN.
Fig. 12 shows that the DSCs between automatic

segmentation (A in (6)) and ground truth (B in (6)) are around (a) (b) (c) (d)
90%, and the mean DSC is 91%. The F-measure of automatic
segmentation result is 0.92. This synthetic dataset has 8 slices, Fig. 14. Segmented cancerous lymph nodes from Dataset 2. Red contours are
and the overall processing time was 152 seconds. manual outline) (a) 2D contours of top slice (b) 2D contours of the middle
slice (c) 2D contours of the bottom slice (d) 3D visualisation of lymph node.
(a) (b) (c) (d)
Fig. 15. Segmented cancerous lymph nodes from Dataset 3. Red contours are
Fig. 12. Dice similarity score for each slice of a sythetic MRI dataset. The x- manual outline) (a) 2D contours of top slice (b) 2D contours of the middle
axis shows slice number, and y-axis shows DSC score. slice, (c) 2D contours of the bottom slice. (d) 3D visualisation of lymph node.
The proposed algorithm |was quantitatively assessed using

B. Experiments on real data the Dice similarity coefficient (DSC) and F-measure. DSC
The proposed algrotihms was tested on 5 real datasets (each measures the overlapping rate between two areas. Fig. 16
one has on average 10 slices) from Beatson West of Scotland shows the DSC of between automatic segmentation and
Cancer Centre, in Glasgow. Results in this section demonstrate consensus outlines..
some extracted 3D cancerous lymph nodes, and their contours As the boxplot in Fig. 16 illustrates, the medians of DSCs
on 2D slices. Also, the automatic segmentations are compared on 5 datasets are all above 60%. From left to right in Fig. 16,
with a consensus tumour outline. The consensus tumour the first dataset has highest DSC 90%, dataset 3 and 5 have
outlines on 2D axial slices were formed by clinicians from DSCs around 80%, dataset 2 has DSC around 70%, and dataset
Beatson West of Scotland Cancer Centre. 5 has lowest DSC 60%. The mean DSC through five datasets is
302
70%. The average false negative rate is 0.0025, and the average ACKNOWLEDGMENT
false positive rate is 0.2023. The authors would like to acknowledge grant from Beatson
As bar graph shown in Fig. 17 illustrates, three datasets Cancer Charity for their financial support with this study.
have F-measure score around 0.8 and higher, dataset 2 has F-
measure score about 0.7, and dataset 5 has lowest F-measure REFERENCES
score of 0.54. The mean F-measure score through 5 datasets is [1] N. G. Burnet, S. J. Thomas, K. E. Burton, and S. J. Jefferies, "Defining the
0.75. Each dataset has 8-10 MRI slices; the average processing tumour and target volumes for radiotherapy," Cancer Imaging, vol.
time of proposed algorithm on each dataset is 250 seconds, and 4, pp. 153-161, 2004.
the processing time on each slice is 30 seconds. [2] T. Doshi, J. Soraghan, L. Petropoulakis, G. Di Caterina, D. Grose, K.
MacKenzie, and C. Wilson, "Automatic pharynx and larynx cancer
segmentation framework (PLCSF) on contrast enhanced MR
images," Biomedical Signal Processing and Control, vol. 33, pp.
178-188, 2017/03/01/ 2017.
[3] J. Feulner, S. K. Zhou, M. Huber, J. Hornegger, D. Comaniciu, and A.
Cavallaro, "Lymph node detection in 3-D chest CT using a spatial
prior probability," in 2010 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 2010, pp. 2926-2932.
[4] Y. Wang and R. Beichel, "Graph-Based Segmentation of Lymph Nodes in
CT Data," in Advances in Visual Computing, Berlin, Heidelberg,
2010, pp. 312-321.
[5] J. Yan, T.-g. Zhuang, B. Zhao, and L. H. Schwartz, "Lymph node
segmentation from CT images using fast marching method,"
Computerized Medical Imaging and Graphics, vol. 28, pp. 33-38,
2004/01/01/ 2004.
Fig. 16. Dice similarity coefficient on 5 head and neck MRI datasets. [6] J.-Y. Zhou, W. Fang, K.-L. Chan, V. F. H. Chong, and J. B. K. Khoo,
"Extraction of Metastatic Lymph Nodes from MR Images Using
Two Deformable Model-based Approaches," Journal of Digital
Imaging, vol. 20, pp. 336-346, 2007.
[7] R. C. Gonzalez, "Digital image processing," ed: Prentice hall, 2016.
[8] T. Tan, K. Sim, and C. P. Tso, "Image enhancement using background
brightness preserving histogram equalisation," Electronics letters,
vol. 48, pp. 155-157, 2012.
[9] F. Jager and J. Hornegger, "Nonrigid Registration of Joint Histograms for
Intensity Standardization in Magnetic Resonance Imaging," IEEE
Transactions on Medical Imaging, vol. 28, pp. 137-150, 2009.
[10] O. Salvado, C. Hillenbrand, Z. Shaoxiang, and D. L. Wilson, "Method to
correct intensity inhomogeneity in MR images for atherosclerosis
characterization," IEEE Transactions on Medical Imaging, vol. 25,
pp. 539-552, 2006.
[11] S. Campbell, T. Doshi, J. Soraghan, L. Petropoulakis, G. D. Caterina, D.
Fig. 17. F-mearsure on 5 head and neck MRI datasets. Grose, and K. MacKenzie, "3-dimensional throat region
segmentation from MRI data based on fourier interpolation and 3-
dimensional level set methods," in 2015 37th Annual International
IV. CONCLUSION Conference of the IEEE Engineering in Medicine and Biology
This paper presented a new algorithm for automatic Society (EMBC), 2015, pp. 2419-2422.
[12] T. Doshi, J. Soraghan, D. Grose, K. MacKenzie, and L. Petropoulakis,
detection, 3D segmentation, and visualisation of cancerous "Modified fuzzy c-means clustering for automatic tongue base
lymph nodes from T1-weighted Gadolinium-enhanced head and tumour extraction from MRI data," in 2014 22nd European Signal
neck MR images. The proposed method was demonstrated to Processing Conference (EUSIPCO), 2014, pp. 2460-2464.
work well on both synthetic and real MRI datasets. The results [13] A. E. Lefohn, J. M. Kniss, C. D. Hansen, and R. T. Whitaker, "A streaming
on synthetic data show that this method is tolerant to artefacts. narrow-band algorithm: interactive computation and visualization of
level sets," IEEE Transactions on Visualization and Computer
The results on real data show that this algorithm can segment Graphics, vol. 10, pp. 422-433, 2004.
most tumour regions and can produce smooth surfaces. [14] R. V. D. Walle, H. H. Barrett, K. J. Myers, M. I. Aitbach, B. Desplanques,
A. F. Gmitro, J. Cornelis, and I. Lemahieu, "Reconstruction of MR
In the future, this method will be tested on more MRI images from data acquired on a general nonregular grid by
datasets. Also a quantitative study will be taken on more pseudoinverse calculation," IEEE Transactions on Medical Imaging,
annotated data. The modification on de-noising methods, and vol. 19, pp. 1160-1167, 2000.
the involvement of shape constrained 3D level set function are [15] T. Sørensen, "{A method of establishing groups of equal amplitude in plant
also objectives for future study. sociology based on similarity of species and its application to
analyses of the vegetation on Danish commons}," Biol. Skr., vol. 5,
pp. 1-34, 1948.
303
SIGNaL PROCESSING
SPa 2018
Noise Cancellation Method for Speech Signal by

Using an Extension Type UKF
Hisako Orimoto , Akira ikuta
Prefectural University of Hiroshima
Hiroshima, Japan
Email: orimoto@pu-hiroshima.ac.jp
Abstract—Numerous noise suppression methods for speech systems. However, the restriction of Gaussian noise is required
signals have been developed up to now. In this paper, a new for the application to actual sound environment.
method to suppress noise in speech signals is proposed by In this paper, a new noise suppression algorithm for the
use of an extension type Unscented Kalman filter (UKF). A
method considering non-Gaussian noise is proposed theoretically actual speech signals contaminated by noise is theoretically
by introducing an expansion expression of Bayes ’ theorem proposed in the extension type UKF considering non-Gaussian
and considering nonlinear correlation information between the noise. More specifically, first, nonlinear functions among the
speech signal and the observation data. Specifically, by selecting time series of the speech signal and noisy observation are
appropriately the sample points and the weight coefficients, an adopted to express the system and observation characteristics.
estimation algorithm of the speech signal for nonliner system
is derived on the basis of conditional probability distribution. Next, the sample points and the weighting coefficients satis-
Moreover, expansion coefficients in the estimation algorithm fying the certain requirement are introduced. The estimation
are realized by considering the higher order correlation in- algorithm for the speech signal considering non-Gaussian
formation. Improvement for the precise estimation is expected noise and correlation information of linear and nonlinear is
by considering non-Gaussian property. The effectiveness of the derived by use of an expansion expression of Bayes’ theorem.
proposed method is confirmed by applying it to speech signals
contaminated by noises. Furthermore, by applying the proposed algorithm to speech
signals with several kinds of noises, its effectiveness is exper-
I. I NTRODUCTION imentally confirmed.
Speech recognition systems have been applied in various II. T HEORY

fields due to recent development of digital signal process- A. Modeling of sound environment system
ing technique. For example, these systems are applied to The proposed method considers that system noise wk and
inspection and maintenance operations in industrial factories observation noise vk are non-Gaussian. In complex sound
and to recording and reporting routines at construction sites, environment, system noise wk independent of speech signal
etc. where hand-writing is difficult. For speech recognition in xk , observation yk , and observation noise vk at a discrete
such actual circumstances, some countermeasure methods for time k are considered. When the observation equation includes
surrounding noises are indispensable. unknown parameters ak and bk , these parameters have to be
In such a noise suppression task for speech signal, many estimated with the specific signal xk simultaneously.
algorithms applying Kalman filter have been proposed up
to now [1],[2]. However, this filter assumes a linear model xk+1 = fk (xk ) + wk (1)
as system and observation equations under the condition of yk = hk (xk , ak ) + bk vk (2)
white and Gaussian noise. The actual noises show complex
fluctuation forms with non-Gaussian and non-white properties. B. Extension type UKF by considering non-Gaussian noise
On the other hand, the extended Kalman filter (EKF) [3] The UKF assumes Gaussian noises wk and vk . However, the
can be applied to nonlinear systems. However, a linearized background noise in the actual sound environment shows arbi-
approximation model for the nonlinear systems is necessary. trary fluctuation of non-Gaussian type. The proposed method
∗(i) ∗(i) ∗(i)
Therefore, there are many restrictions to apply it to actual considers non-Gaussian noise. The σ points xk , ak , bk
sound environment systems. From the above viewpoint, in (i)
and vk are obtained as sample points. Superscript (i) shows
our previously reported study, a noise suppression algorithm ith σ point. The σ points have to be chosen so as to obtain
for the actual speech signals, without requirement of the the same mean and variance as original variables.
assumption of Gaussian white noise has been proposed [4].
∗(0)
Since the calculation of the expansion coefficients in the xk = x∗k ,
√
previous algorithm was very complicated, a simplified method ∗(1)
xk = x∗k + (1 + λ)Γxk ,
is required. The UKF [5] applying Unscented Transformation ∗(2)
√
(UT) to mean, variance and covariance of input/output has xk = x∗k − (1 + λ)Γxk ,
∗(0)
been proposed. This filter need not linearize the nonlinear ak = a∗k ,
304
∗(1)
√
ak = a∗k +(1 + λ)Γak , is adopted :
∗(2) ∗
√
ak = ak − (1 + λ)Γak ,
P0 (xk | Yk−1 ) = N (xk ; x∗k , Γxk ), (8)
∗(0)
bk = b∗k , P0 (ak | Yk−1 ) = N (ak ; a∗k , Γak ), (9)
∗(1)
√
bk = b∗k +(1 + λ)Γbk , P0 (bk | Yk−1 ) = N (bk ; b∗k , Γbk ), (10)
∗(2) ∗
√
bk = bk − (1 + λ)Γbk , P0 (yk | Yk−1 ) = N (yk ; yk∗ , Ωk ), (11)
∗(0)
vk = ⟨vk ⟩ ,
∗(1)
√ ⟨ ⟩
vk = ⟨vk ⟩ + (1 + λ)Γvk , yk∗ = ⟨yk | Yk−1 ⟩ , Ωk= (yk − yk∗ )2 | Yk−1 .
∗(2)
√
vk = ⟨vk ⟩ − (1 + λ)Γvk ,
λ 1 with
W (0) = , W (1) = ,
1+λ 2(1 + λ) (x − µ)2
1 −
1 2
N (x; µ, σ ) = √ e 2σ 2 ,
W (2) = , (3) 2πσ 2
2(1 + λ)
⟨ ⟩ (12)
x∗k = ⟨xk | Yk−1 ⟩ , Γxk = (xk − x∗k )2 | Yk−1 ,
⟨ ⟩
a∗k = ⟨ak | Yk−1 ⟩ , Γak = (ak − a∗k )2 | Yk−1 , The orthonormal polynomials [6] with four weighting prob-
⟨ ⟩
b∗k = ⟨bk | Yk−1 ⟩ , Γbk = (bk − b∗k )2 | Yk−1 . ability distributions in (8)-(11) are expressed as Hermite
polynomial.
where λ is regulation parameter. Here,weighing coefficients ( )
W (i) have to satisfy the normalization condition, (1) 1 xk − x∗k
φl (xk ) = √ Hl √ , (13)
2
∑ l! Γxk
( )
W (i) = 1. (4) 1 ak − a∗k
(2)
i=0 φm (ak ) = √ Hm √ , (14)
m! Γak
∗(i) ∗(i) ∗(i) ( )
Next, the σ points xk , ak and bk are substituted into 1 bk − b∗k
(3)
(2): φn (bk ) = √ Hn √ , (15)
n! Γbk
∗(i) ∗(i) ∗(i) ∗(i) (i) ( )
yk = hk (xk , ak ) + bk vk . ( i = 0, 1, 2) (5) 1 yk − yk∗
φ(4)
s (yk ) = √ Hs √ . (16)
The conditional joint probability distribution of the specific s! Ωk
signal xk and the unknown parameter ak and bk is expressed
by using expansion expression of Bayes’ theorem[6], as fol- From (6), the estimates of xk , ak and bk can be expressed as
lows: follows:
∞
1 ∑
∑
P (xk , ak , bk , yk | Yk−1 )
P (xk , ak , bk | Yk ) = Al00s c1l φ(4)
s (yk )
P (yk | Yk−1 )
l=0 s=0
= P0 (xk | Yk−1 )P0 (ak | Yk−1 )P0 (bk | Yk−1 ) x̂k =< xk | Yk >= ∞ , (17)
∑
∞ ∑∞ ∑∑ ∞
∑ (1) A000s φ(4)
s (yk )
Almns φl (xk )φ(2) (3) (4)
m (ak )φn (bk )φs (yk ) s=0
l=0m=0n=0 s=0 ∞
2 ∑
∞
∑
∑ Al00s c2l φ(4)
s (yk )
/ A000s φ(4)
s (yk ), l=0 s=0
s=0 Pxk =< (xk − x̂k )2 | Yk >= ∞ , (18)
∑
Yk = {y1 , y2 , . . . , yk }, (6) A000s φ(4)
s (yk )
s=0
with ∞
1 ∑
∑
⟨ ⟩ A0m0s d1m φ(4)
(1) s (yk )
Almns = φl (xk )φ(2) (3) (4)
m (ak )φn (bk )φs (yk ) | Yk−1 . (7) m=0 s=0
âk =< ak | Yk >= ∞ , (19)
∑
(1) (2) (3)
The above three functions φl (xk ), φm (ak ),and φn (bk ) A000s φ(4)
s (yk )
(4)
φs (yk ) are orthonormal polynomials of degrees l, m, n and s=0
∞
2 ∑
∑
s with weighting functions P0 (xk | Yk−1 ), P0 (ak | Yk−1 ),
P0 (bk | Yk−1 ) and P0 (yk | Yk−1 ), which can be choose as A0m0s d2m φ(4)
s (yk )
m=0 s=0
the probability functions describing the dominant part of the Pak =< (ak − âk )2 | Yk >= ∞ , (20)
∑
actual fluctuation or as the well-known standard probability A000s φ(4)
s (yk )
distributions. As the weighting functions, Gaussian distribution s=0
305
∑ ∞
1 ∑
1
2n
∑ { }
∗(i)
A00ns e1n φ(4)
s (yk ) ≃ W (i) (xk − x∗k )2 −Γxk
n=0 s=0
2Γxk Ωk i=0
b̂k =< bk | Yk >= ∞ , (21) { }
∑ ∗(i)
(yk −yk∗ )2 −Ωk . (24)
A000s φ(4)
s (yk )
s=0
∞
2 ∑
Furthermore, expansion coefficients A0101 , A0102 , A0201 ,
∑
A00ns e2n φ(4) A0202 , A0011 , A0012 , A0021 and A0022 are calculated through
s (yk )
n=0 s=0
the same manners:
Pbk =< (bk − b̂k )2 | Yk >= ∞ . (22)
∑ ∑2
A000s φ(4) 1 ∗(i) ∗(i)
s (yk ) A0101 = √ √ W (i) (ak − a∗k )(yk − yk∗ ),
s=0 Γak Ωk i=0
Where, c1l , c2l , d1m ,d2m ,e1n and e2n are coefficients satisfy- 1 ∑ 2 { }
∗(i) ∗(i)
ing the following relationships: A0201 = √ √ W (i) (ak −a∗k )2 −Γak (yk −yk∗ ),
2Γak Ωk i=0
1
∑ 1 ∑ 2 { }
∗(i) ∗(i)
xk =
(1)
c1l φl (xk ), A0102 = √ √ W (i) (ak − a∗k ) (yk − yk∗ )2 − Ωk ,
2 Γak Ωk i=0
l=0
2
∑ 1 ∑2 { }{ }
∗(i) ∗(i)
(xk − x̂k )2 =
(1)
c2l φl (xk ), A0202 = W (i) (ak −a∗k )2 −Γak (yk −yk∗ )2 − Ωk ,
2Γak Ωk i=0
l=0
1 ∑2
∑ 1 ∗(i) ∗(i)
ak = d1m φ(2)
m (ak ),
A0011 = √ √ W (i) (bk − b∗k )(yk − yk∗ ),
Γbk Ωk i=0
m=0
2
∑ 1 ∑ 2 { }
∗(i) ∗(i)
(ak − âk )2 = d2m φ(2)
m (ak ),
A0021 = √ √ W (i) (bk −b∗k )2 −Γbk (yk −yk∗ ),
2Γbk Ωk i=0
m=0
1
∑ 1 ∑ 2 { }
∗(i) ∗(i)
bk = e1m φ(3)
n (bk ),
A0012 = √ √ W (i) (bk − b∗k ) (yk − yk∗ )2 − Ωk ,
2 Γbk Ωk i=0
n=0
1 ∑ (i) { ∗(i) ∗ 2 }{ }
2 2
∑ ∗(i)
(bk − b̂k )2 = e2n φ(3)
n (bk ). (23) A0022 = W (bk −bk ) −Γbk (yk −yk∗ )2 − Ωk .
2Γbk Ωk i=0
n=0
(25)
Each expansion coefficient Almns defined by (7) is obtained
specifically. The approximation of expansion coefficient is cal- New σ points are obtained by using the estimates, as
∗(i) ∗(i)
culated by substituting σ points xk and yk in conditional follows,
expectation of xk and yk . (0)
x̂k = x̂k ,
A0001 = A0002 = A1000 = A2000 = 0, (1)
√
x̂k = x̂k + (1 + λ)Pxk ,
1 √
A1001 = √ √ ⟨(xk − x∗k )(yk − yk∗ )|Yk−1 ⟩ (2)
x̂k = x̂k − (1 + λ)Pxk ,
Γxk Ωk
λ 1
1 ∑2n W (0) = , W (1) = ,
≃√ √
∗(i) ∗(i)
W (i) (xk − x∗k )(yk − yk∗ ), 1+λ 2(1 + λ)
Γxk Ωk i=0 1
⟨{ } ⟩ W (2) = . (26)
1 2 2(1 + λ)
A2001 = √ √ (xk − x∗k ) − Γxk (yk −yk∗ )|Yk−1
2Γxk Ωk The σ points in (26) are transformed by using nonlinear system
1 ∑2n { } model in (1) , as follows:
∗(i) ∗(i)
≃√ √ W (i) (xk − x∗k )2 −Γxk (yk − yk∗ ),
2Γxk Ωk i=0 ∗(i)
xk+1
(i)
= fk (x̂k ) + ⟨wk ⟩ . (27)
1 ⟨ { } ⟩
2
A1002 = √ √ (xk − x∗k ) (yk − yk∗ ) − Ωk |Yk−1 Finally, the predictions are expressed by adopting the σ points
2 Γxk Ω k
2n { } in (27):
1 ∑ ∗(i) ∗(i)
≃√ √ W (i) (xk − x∗k ) (yk − yk∗ )2 − Ωk , 2
∑
2 Γxk Ωk i=0 x∗k+1 =
∗(i)
W (i) xk+1 , (28)
1 ⟨{ }
2 i=0
A2002 = (xk −x∗k ) −Γxk
2Γxk Ωk ∑2
⟨ ⟩
{ } ⟩ Γxk+1 =
∗(i)
W (i) (xk+1 − x∗k+1 )2+ (wk − ⟨wk ⟩)2 .(29)
2
(yk − yk∗ ) − Ωk | Yk−1
i=0
306
Since the parameters ak and bk are constants, time transition
model : ak+1 = ak and bk+1 = bk are introduced for
the recursive estimation. By using these relationships, the
predictions are given as follows,
a∗k+1 = âk , Γak+1 = Pak , (30)
b∗k+1 = b̂k , Γbk+1 = Pbk . (31)
The state estimation algorithm with expansion coefficient
Almns reflecting linear and nonlinear correlation information
and statistics of non-Gaussian noise is completed.
III. E XPERIMENT
Fig. 1. Original female speech signal.
In order to confirm the effectiveness of the proposed adap-
tive noise suppression algorithm, it is applied to speech signals.
More specifically, we estimated the speech signal based on the
observation contaminated with additive noises. The following
two kinds of noises were adopted: (a) white noise, (b) colored
noise generated by the AR(1) model: vk+1 = −0.5vk + ek ,
where is ek a white noise.
The observation equation model with a background noise
of unknown statistics can be expressed as follows:
yk = ak xk + bk vk . (32)
Using system equation : xk+1 = F xk +Gwk , x∗k+1 and Γxk+1
can be expressed as follows:
Fig. 2. Observed female speech signal contaminated by a white noise with
x∗k+1 = F x̂k + G ⟨wk ⟩ , (33) amplitude of 2 times.
⟨ ⟩
Γxk+1 = F 2 Pxk + G2 (wk − ⟨wk ⟩)2 , (34)
where F and G are calculated from time correlation informa-
tion of xk and xk+1 :
√
⟨xk+1 xk ⟩
F = 2 , G = (1 − F 2 ) ⟨x2k ⟩. (35)
⟨xk ⟩
The data were created by mixing noise with speech signals
on a computer. Figure 1 shows the original female speech
signal. Figure 2 shows the speech signal contaminated by
noise with amplitude of 2 times larger than original signal.
The estimated speech signal by use of the proposed method is
shown in Figure 3. The estimated result by the EKF is shown
in Figure 4. Furthermore, the same experiment was done by
Fig. 3. Estimated female speech signal by use of proposed method
using male speech signal. Figure 5 shows the original male
speech signal. The results by applying the proposed method
and the EKF to the observed male speech signal (shown in
IV. C ONCLUSION
Figure 6) contaminated by a colored noise with the amplitude
of 2 times larger than the signal are shown in Figure 7 and 8. In this paper, a new adaptive noise suppression algorithm for
In addition, the both methods were applied to the male speech speech signal has been proposed, which is applicable to actual
signal contaminated by a white noise with amplitude of 5 times environment with non-Gaussian and non-white noises. The
(show in Figure 9). The estimation results are shown in Figure proposed method considered σ point of not only xk but also
10 and 11. unknown parameter ak , bk and noise vk . More over, this study
Furthermore, the estimation RMS(root mean square) has proposed a method including the higher order correlation
error and a performance evaluate index defined by information between xk and yk . Our algorithm has been
10log10 (Σx2k /Σ(xk − x̂k )2 )[dB] are shown in Table I, II, realized by introducing a conditional probability expression
III and IV. From this table, the improved effectiveness of as the system model and utilizing the Bayes’ theorem as
the proposed adaptive algorithm may be clearly noticed in the fundamental principle of estimation. Application of our
comparison with the EKF. algorithm has been made to real speech signals contaminated
307
Fig. 4. Estimated female speech signal by use of EKF Fig. 7. Estimated male speech signal by use of proposed method based on
the Fig.6
Fig. 5. Original male speech signal. Fig. 8. Estimated female speech signal by use of EKF based on the Fig.6
Fig. 6. Observed male speech signal contaminated by a white noise with Fig. 9. Observed male speech signal contaminated by a white noise with
amplitude of 2 times. amplitude of 5 times.
R EFERENCES
by noises. As a result, it has been revealed by experiments [1] M. Gabrea, E. Grivel, and M. Najim, A single microphone Kalman filter-
that better estimation results may be obtained by our algorithm based noise canceller, IEEE Signal Process. Lett., vol.6, no.3, 55-57,
as compared with the EKF. The proposed approach is quite 1999.
[2] N. Tanabe, T. Furukawa, and S. Tsuji, Robust noise suppression algo-
different from those traditional standard techniques. However, rithm with the Kalman filter theory for white and colored disturbance,
we are still in an early stage of development, and a number IEICE Trans. Fundamentals, vol.E91-A, no.3, 818-829, 2008.
of practical problems are yet to be investigated in the future. [3] H. J. Kushner, Approximations to optimal nonlinear filter, IEEE Trans.
on Automatic Control, vol.12, no.5, 546-556, 1967.
These include: (i) application to a diverse range of speech [4] A.Ikuta and H. Orimoto, Adaptive Noise Suppression Algorithm for
signals in actual noise environment, (ii) extension to cases with speech signal based on stochastic system theory, IEICE Trans. Funda-
multi-noise sources, and (iii) finding an optimal number of ex- mentals of Electronics, Communications and Computer Sciences, E94-
A(8), 1618-1627, 2011.
pansion terms for the expansion-based probability expressions [5] S. Julier and J. Uhlmann, Unscented filtering and nonlinear estimation,
adopted. Proceedings of IEEE, vol.92 no.3, 401-421, 2004.
308
TABLE II
C OMPARISON OF ESTIMATION ERROR BETWEEN EXTENSION TYPE UKF
AND EKF BASED ON THE FEMALE SPEECH SIGNAL CONTAMINATED BY
PINK NOISE ( D B)
RMS Error
SN Ratio Proposed Method EKF
1/1 0.0130 0.01702
1/2 0.0169 0.03271
1/3 0.0235 0.03246
1/4 0.0272 0.03249
1/5 0.0274 0.03249
1/10 0.0309 0.03249
Performance Evaluation Index
Fig. 10. Estimated male speech signal by use of proposed method based on 1/1 7.9764 5.616
the Fig.9 1/2 4.3926 -0.05917
1/3 2.7951 -0.00004190
1/4 1.5367 -0.00003369
1/5 1.4773 -0.00003125
1/10 0.0449 0.00003842
TABLE III
AND EKF BASED ON THE MALE SPEECH SIGNAL CONTAMINATED BY
WHITE NOISE ( D B)
RMS Error
1/1 0.0129 0.06084
1/2 0.0311 0.06097
1/3 0.0322 0.06105
Fig. 11. Estimated female speech signal by use of EKF based on the Fig.9 1/4 0.0376 0.06098
1/5 0.0406 0.06100
TABLE I 1/10 0.0505 0.06105
C OMPARISON OF ESTIMATION ERROR BETWEEN EXTENSION TYPE UKF Performance Evaluation Index
AND EKF BASED ON THE FEMALE SPEECH SIGNAL CONTAMINATED BY SN Ratio Proposed Method EKF
WHITE NOISE ( D B) 1/1 13.8456 0.02787
RMS Error 1/2 6.177 0.01029
SN Ratio Proposed Method EKF 1/3 5.88 -0.002299
1/1 0.0120 0.02666 1/4 4.5325 0.008438
1/2 0.0148 0.03199 1/5 3.8666 0.005441
1/3 0.0195 0.03221 1/10 1.9664 -0.002300
1/4 0.0213 0.03226
1/5 0.0239 0.03277
1/10 0.0289 0.03242
Performance Evaluation Index
SN Ratio Proposed Method EKF TABLE IV
1/1 8.6265 1.7170
AND EKF BASED ON THE MALE SPEECH SIGNAL CONTAMINATED BY PINK
1/2 6.8470 0.1335
NOISE ( D B)
1/3 4.4249 0.07401
1/4 3.6543 0.06154 RMS Error
1/5 2.6530 -0.07569 SN Ratio Proposed Method EKF
1/10 1.0952 0.01754 1/1 0.0231 0.05338
1/2 0.024 0.06083
1/3 0.0305 0.06105
1/4 0.0361 0.06104
1/5 0.0404 0.06104
[6] M. Ohta and H. Yamada, New Methodological Trials of Dynamical State
Estimation for the Noise and Vibration Environmental System, Acustica, 1/10 0.051 0.06104
vol.55, no.4, 199-212, 1984. Performance Evaluation Index
1/1 8.7592 1.165
1/2 8.4074 0.02899
1/3 6.3519 -0.001615
1/4 4.8775 -0.00003818
1/5 3.9057 -0.00003191
1/10 1.8808 -0.00001910
309
SIGNaL PROCESSING
SPa 2018
Speech Intelligibility in the presence of X4

Unmanned Aerial Vehicle
Marzena Mięsikowska
Faculty of Mechatronics and Machine Engineering
Kielce University of Technology
Kielce, Poland
marzena@tu.kielce.pl
Abstract—The main purpose of this work was to obtain presence of noise when a speech is corrupted by the
background sound levels and speech intelligibility as well as to background noise not present during training. Searching for
evaluate classification of speech commands in the presence of an coefficients and classifiers that increase the recognition
unmanned aerial vehicle (UAV) equipped with four rotating accuracy in noisy conditions is very important and still
propellers. Speech intelligibility was assessed using speech considered [5, 6, 7, 8]. ASR systems work best in noise-free
interference level (SIL) parameter according to ISO 9921. The conditions. In the presence of any ambient noise, the accuracy
UAV background sound levels were recorded in laboratory of ASR systems can be significantly reduced. Many solutions
conditions using Norsonic140 sound analyzer in the absence of have been proposed to improve recognition accuracy in noisy
the UAV and in the presence of the UAV. The classification of
environments. The first approach is focused on
speech commands /left, right, up, down, forward, backward,
start, stop/ recorded with Olympus LS-11 was evaluated in
parameterization methods that are resistant to noise or
laboratory condition based on Mel-frequency cepstral coefficients minimize the effect of noise. Other approaches are based on the
and discriminant function analysis. The UAV was hovering at adoption of clean models to the noise recognition environment
1.5m during recordings. The A-weighted sound level obtained in to contaminate the models or transform a noisy speech into a
the presence of the UAV was 70.5 dB(A). Speech intelligibility clean speech—the noise is removed or reduced from the
rating was poor in the presence of the UAV. Discriminant representation of the speech or implementation of audio-visual
analysis based on Mel-frequency cepstral coefficients showed speech recognition methods based on lip detection.
very successful classification of speech commands equal to 100%.
Evaluation of environmental conditions using acoustic
Evaluated speech intelligibility did not exclude verbal
communication with the UAV. The successful classification of
methods allows a preliminary assessment of speech
speech commands in the presence of the UAV can enable the intelligibility and the ability to conduct verbal communication
control of the UAV using voice commands and general under given conditions. Such information as what is the
communication with the UAV using speech. strength of environmental signals and how it affects verbal
communication could be necessary to activate the ASR system.
Keywords-Unmanned Aerial Vehicle; speech commands; sound If the speech intelligibility is insufficient in given ambient
levels; speech intelligibility; discriminant analysis conditions, then a solution could be adopted to enable
communication with the ASR system. Speech intelligibility can
I. INTRODUCTION be assessed using International Standards ISO 9921 that
specifies the requirements for the performance of speech
Automatic speech recognition (ASR) systems allow communication for verbal alert and danger signals, information
humans to communicate with devices/computers by using messages, and speech communication in general [9]. Speech
voice commands. ASR systems can be applied in intelligent interference level (SIL) is one of the parameters that offer a
houses, the cabin of vehicles, and speech therapy [1]. By voice method to predict and assess speech intelligibility in cases of
commands, one can control computers, multimedia players, direct communication.
and mobile phones. Application of the ASR system on an
unmanned aerial vehicle (UAV) can be very attractive [2]. This study aimed to obtain background sound levels,
Such option could enable a pilot to control UAV by voice speech intelligibility and to evaluate classification of speech
commands [3, 4]. This solution adds mobility to the pilot. It commands in the presence of the UAV in laboratory
can be called free hands remote control option that allows the conditions. The research was aimed at determining the
pilot to perform certain tasks with UAV by using voice possibility of verbal communication in the presence of UAV
commands recognized by the ASR system. However, the and assessing the initial classification of voice commands.
speech signal is disturbed in this case by the noise of the UAV.
A critical feature of ASR systems is the accuracy of
recognition. The recognition accuracy can degrade in the
310
II. METHODS
A. UAV used in the experiment C. Speech intelligibility

The UAV used in this experiment is presented in Figure 1. Speech intelligibility was evaluated on the basis of speech
interference level (SIL) parameter according to standard ISO
9921 [9]. The SIL parameter is calculated according to (1):
 SIL = LS,A,L – LSIL  
where: LSIL—an arithmetic mean of the sound-pressure levels

in four bands with central frequencies 500Hz, 1 kHz, 2 kHz,
and 4 kHz; LS,A,L—the level of the speech signal determined by
the vocal effort of the speaker. The distance is L=2m. The SIL
is expressed in dB. If the SIL value exceeds 21 dB, the speech
intelligibility is excellent. If the SIL value ranges between 15
and 21, the speech intelligibility is good. If the SIL value
ranges between 10 and 15, the speech intelligibility is fair. If
Figure 1. The UAV used in the experiment the SIL value ranges between 3 and 10, the speech
intelligibility is poor. If the SIL value does not exceed 3 dB,
The UAV is DJI Mavic Pro. The UAV can move at a the speech intelligibility is bad.
maximum speed of 65 km/h (40 mph). The maximum ceiling The SIL offers a simple method to predict or assess speech
achieved by the drone is 5000m. Maximum flight time is 27 intelligibility in cases of direct communication in a noisy
minutes. During the experiment the drone was hovering at the environment. It takes into account a simple average of the noise
height of 1.5m in the laboratory. spectrum, the vocal effort of the speaker and the distance
between the speaker and the listener.
B. Background sound levels
The background sound levels were measured using D. Speech commands recordings
Norsonic 140 sound analyzer in laboratory conditions in the The following speech commands were recorded in
absence of the UAV and in the presence of the UAV hovering laboratory conditions using Olympus LS-11 digital recorder:
at 1.5m in a distance of 2m from the recording equipment, as /forward, backward, up, down, left, right, start, stop/. Each
presented in Figure 2. command was expressed three times by the woman-speaker
aged 38. The commands were spoken from a distance of 0.3m
from the microphone in the presence of the UAV. The
microphone was at a distance of 2m from the UAV. The UAV
was hovering at 1.5m while speaking commands as in Figure 2.
E. Mel-frequency cepstral coefficients (MFCC)

The twelve Mel-frequency cepstral coefficients were
extracted from speech recordings. The reason for using MFCC
was their usage in speech recognition and as a very effective
classification method in this experiment using discriminant
function analysis when analyzing speech signal recorded in the
presence of the UAV.
F. Discriminant function analysis

Discriminant analysis was used to investigate the
significant differences between speech commands using 12
MFCC. The speech commands were taken as grouping
variables and the MFCC were taken as independent variables.
Discriminant analysis included the discrimination stage and the
classification stage. Discriminant analysis was performed using
STATISTICA Software [10]. In the discrimination stage, the
maximum number of discriminant functions computed was
equal to the number of groups minus one. Canonical analysis
was performed which determined the successive functions and
canonical roots. The standardized coefficients were obtained in
Figure 2. Measurement in the laboratory each discriminant function. The larger the standardized
311
coefficients, the greater the contribution of the variable to the appeared at 10 Hz, 16Hz, 100Hz, 200Hz, 400Hz, 1kHz,
discrimination between groups. Chi-square tests with 1.6kHz, 4 kHz, and 6.3kHz.
successive roots removed were performed. The coefficient of
canonical correlation (canonical-R) is the measure of the B. Speech intelligibility
association between the i-canonical discriminant function and Speech Intelligibility based on SIL in laboratory conditions
the group. The canonical-R ranges between 0 (no association) in the presence of the UAV and in the absence of the UAV is
and 1 (very high association). Wilks’-Lambda statistic is used presented in Table 1.
to determine the statistical significance of discrimination and
ranges between 0 (excellent discrimination) and 1 (no
discrimination). TABLE I. THE BACKGROUND LEVELS IN DB(A), LSIL, SIL, AND SPEECH
INTELLIGIBILITY IN THE PRESENCE OF THE UAV HOVERING AT 1.5, AND IN THE
The classification stage proceeded after determining ABSENCE OF THE UAV
variables that discriminate speech groups. Due to eight speech LAeq LSIL SIL Intelligibility
groups, eight classification functions were created according [dB(A)] dB dB rating
to (2): Laboratory
28,3 12,63 35,36 Excellent

 Ki(h)=ci0+wi1mfcc1+wi2mfcc2+…+wi12mfcc12  
UAV in Laboratory
where: the h – speech commands as a considered group 70,5 58,95 7,03 Poor
/backward, down, forward, left, right, start, stop, up/; the
subscript i denotes the respective group; ci0 is a constant for the
i’th group; wij is the weight for the j’th variable in the Speech intelligibility was excellent in the absence of the
computation of the classification score for the i’th group; and UAV. Speech intelligibility was poor in the presence of the
mfccj is the observed mel-cepstral value for the respective case. UAV.
The classification functions can be used to determine to which
group each case most likely belongs. The case is classified as C. Time-frequency analysis of speech recordings influenced
belonging to the group for which it has the highest by the UAV noise
classification score. With such functions, the case is classified In Figure 4, the time-frequency analysis of the /forward/
with the group for which Ki(h) assumes the highest value. The speech command spoken in laboratory conditions in the
classification matrix shows the number of cases that were absence of the UAV and in the presence of the UAV is
correctly classified and those that were misclassified. presented.
III. RESULTS
In this experiment, the following results were obtained:
A. UAV background sound levels

In Figure 3, the background sound levels recorded with
NOR140 sound analyzer in laboratory conditions in the
absence of the UAV (blue line) and in the presence of the UAV
(red line) are presented.
a)
Figure 3. The background levels in a laboratory room in the absence of the

UAV (blue line) and in the presence of the UAV (red line)
b)
The A-weighted sound level was 70.5 dB(A) in the Figure 4. The time-frequency analysis of the /forward/ speech command: a)
presence of the UAV and was 28.3 dB(A) in the absence of the in the absence of the UAV, b) in the presence of the UAV hovering at 1.5m in
UAV in the laboratory. The characteristic peaks of the UAV laboratory conditions
312
TABLE II. CHI-SQUARE TESTS WITH SUCCESSIVE ROOTS REMOVED
Comparing both figures, except of frequency bands that Chi-square with successive roots removed
Roots
represent speech command, frequency bands in Figure 4b that removed Canonical Wilks’-
Chi-Square p-value
R Lambda
might represent the UAV can also be observed.
0 0,977 0,0000 144,77 0,0000
D. Discriminant function analysis of speech recordings 1 0,962 0,0003 104,76 0,0017
influcened by the UAV noise
2 0,952 0,0043 70,93 0,0274
Discriminant function analysis was performed based on 12
MFCC as independent variables and speech commands as 3 0,916 0,0453 40,24 0,2882
grouping variable. The analysis showed significant main 4 0,768 0,2801 16,54 0,8675
effects used in the model (Wilks' Lambda: 0,0000146 approx.
F (84,38) = 2.373046 p < 0.0018). Seven discriminant 5 0,420 0,6828 4,96 0,9864
functions (Root1, Root2, Root3, Root4, Root5, Root6, and 6 0,414 0,8288 2,44 0,8750
Root7) were created. Chi-square tests with successive roots
removed and performed in the canonical stage are presented in
Table 2.
After performing canonical stage and deriving discriminant
According to Table 2, chi-square tests with successive roots
functions with 12 MFCC features that discriminate mostly
removed showed the significance of all created discriminant
between groups, the classification stage proceeded. The
functions used in the model (R=0.976, Wilks’-
coefficients of the classification functions obtained for groups
Lambda=0.000015, p<0.000043). The removal of the first
are presented in Table 3.
discriminant function showed the high canonical value R
between groups and discriminant functions (R=0.962). The
removal of the second, third, till seventh discriminant functions
also showed the high canonical value R.
TABLE III. THE COEFFICIENTS OF CLASSIFICATION FUNCTIONS
ci K(backward) K(down) K(forward) K(left) K(right) K(start) K(stop) K(up)

wi1 - 50,28 -31,38 -103,52 -81,96 -25,63 -49,29 -114,17 -22,74
wi2 -898,82 -892,42 -831,91 -832,68 -874,20 -850,39 -836,51 -862,49
wi3 3282,10 3139,49 3330,68 3283,14 3107,74 3114,25 3318,72 3058,11
wi4 -3060,29 -2963,03 -3124,39 -3009,91 -2971,47 -2972,72 -3123,85 -2913,48
wi5 866,01 896,98 855,99 981,82 931,33 975,05 906,27 898,03
wi6 -716,33 -754,44 -742,47 -751,61 -778,64 -798,38 -780,08 -764,65
wi7 -857,03 -700,27 -1132,27 -1116,30 -768,36 -750,36 -863,18 -818,06
wi8 -429,81 -464,12 -347,53 -279,94 -464,33 -430,56 -448,34 -426,98
wi9 -991,68 -935,53 -791,66 -987,58 -882,12 -910,04 -862,60 -801,63
wi10 1014,54 1050,14 888,98 1096,50 1003,50 1180,07 1307,87 779,03
wi11 2245,99 2025,78 2261,41 2119,32 2021,85 1818,36 1817,61 2156,13
wi12 -3217,06 -2963,04 -3331,12 -3133,77 -2966,85 -2802,02 -2938,82 -3043,67
ci0 -8345,23 -7859,11 -8456,19 -8090,76 -7837,81 -7754,81 -8366,67 -7611,10
Results of classification using classification functions K(h) According to Table 4, the classification stage was
for altitude groups are presented in Table 4. successful (100%) for each group. Discriminant analysis
showed significant differences between different speech
The value three (3) in Table 4 means that, for three commands influenced by the UAV noise. There were no
considered records, three were correctly classified with the significant differences between the same commands influenced
considered group using the respective classification function by the UAV noise.
K(h). The value zero (0) means that no record was classified to
belong to the considered group using the function K(h).
313
TABLE IV. THE CLASSIFICATION MATRIX
Group Percent K(backward) K(down) K(forward) K(left) K(right) K(start) K(stop) K(up)
backward 100% 3 0 0 0 0 0 0 0
down 100% 0 3 0 0 0 0 0 0
forward 100% 0 0 3 0 0 0 0 0
left 100% 0 0 0 3 0 0 0 0
right 100% 0 0 0 0 3 0 0 0
start 100% 0 0 0 0 0 3 0 0
stop 100% 0 0 0 0 0 0 3 0
up 100% 0 0 0 0 0 0 0 3
Total 100% 3 3 3 3 3 3 3 3
IV. DISCUSSION
The A-weighted sound level of the UAV hovering at 1.5m
A. UAV background sound levels in laboratory conditions was 70.5 dB(A). The characteristic
peaks of the UAV appeared at 10 Hz, 16Hz, 100Hz, 200Hz,
The A-weighted sound levels were 70.5 dB(A) in the
400Hz, 1kHz, 1.6kHz, 4 kHz, and 6.3kHz. Speech
presence of the UAV, and 28.3 dB(A) in the absence of the
intelligibility was excellent in the absence of the UAV and
UAV. The characteristic peaks of the UAV appeared at 10 Hz,
poor in the presence of the UAV in laboratory conditions.
16Hz, 100Hz, 200Hz, 400Hz, 1kHz, 1.6kHz, 4 kHz, and
Time-frequency analysis showed characteristic bands for the
6.3kHz.
speech and the characteristic band for the UAV. Discriminant
analysis showed significant differences between speech
B. Speech intelligibility command groups. The classification was very successful at
Speech intelligibility was excellent in the absence of the 100% of accuracy.
UAV and poor in the presence of the UAV in laboratory
conditions. The presence of the UAV does not exclude the REFERENCES
possibility of verbal communication with the UAV.
[1] M. Mięsikowska, E. de Ruiter, “Automatic recognition of voice
C. Time-frequency analysis of speech recordings influenced commands in a car cabin,” Measurement, Automation, Monitoring, Vol.
by the UAV noise 60, No. 8, pp.652-654, 2014.
Time-frequency analysis showed frequency band [2] M. Draper, G. Calhoun, H. Ruff, D. Williamson, T. Barry, "Manual
characteristic for the speech commands as well as frequency versus speech input for unmanned aerial vehicle control station
operations," Proceedings of the Human Factors and Ergonomics Society
bands that can be characteristic for the UAV. It means that it Annual Meeting, vol. 47. no. 1, pp.109-113, Sage CA: Los Angeles, CA:
can be possible to extract speech recordings from the UAV SAGE Publications, 2003.
noise. [3] S. Bold, B. Sosorbaram, B-E. Batsukh, S. Ro Lee, “Autonomous Vision
Based Facial and voice Recognition on the Unmanned Aerial Vehicle,”
D. Discriminant function analysis of speech recordings International Journal on Recent and Innovation Trends in Computing
and Communication, Vol.4, Issue 2, pp.243-249, February 2016.
influenced by the UAV noise
[4] SS. Anand, R. Mathiyazaghan, “Design and Fabrication of Voice
Discriminant analysis based on 12 MFCC showed Controlled Unmanned Aerial Vehicle,” Journal of Aeronautics &
significant differences between speech command groups. The Aerospace Engineering, Vol.5, Issue 2, April 2016. DOI: 10.4172/2168-
classification was very accurate—100% of classification which 9792.100016
can confirm the possibility of verbal communication with the [5] A. M. Prodeus, "Performance measures of noise reduction algorithms in
UAV. All commands have been correctly classified and voice control channels of UAVs," 2015 IEEE International Conference
Actual Problems of Unmanned Aerial Vehicles Developments
assigned to the appropriate speech group. Discriminant analysis (APUAVD), Kiev, 2015, pp. 189-192.
and MFCC perfectly coped with the correct classification of [6] A. Revathi and C. Jeyalakshmi, "Robust speech recognition in noisy
voice commands. environment using perceptual features and adaptive filters," 2017 2nd
International Conference on Communication and Electronics Systems
(ICCES), Coimbatore, 2017, pp. 692-696.
V. CONCLUSIONS
[7] S. Ondas, J. Juhar, M. Pleva, A. Cizmar, and R. Holcer, “Service Robot
This study aimed to obtain background sound levels and SCORPIO with Robust Speech Interface,” International Journal of
speech intelligibility as well as to evaluate classification of Advanced Robotic Systems, Vol. 10, Issue: 1, pp.1-11, 2013.
speech commands in the presence of the UAV in laboratory [8] Z. l. Zha et al., "Robust speech recognition combining cepstral and
conditions. The research was aimed at determining the articulatory features," 2017 3rd IEEE International Conference on
Computer and Communications (ICCC), Chengdu, 2017, pp. 1401-1405.
possibility of verbal communication in the presence of UAV
[9] ISO/IEC 9921 – Assessment of Speech Intelligibility
and assessing the initial classification of voice commands.
[10] Discriminant function analysis – STATSoft Electronic Documentation:
http://www.statsoft.com/textbook/discriminant-function-analysis
314
SIGNaL PROCESSING
SPa 2018
Low band continuous speech system for voice

pathologies identification
Hugo Cordeiro Carlos Meneses

Dept. of Electronic, Telecommunications and Computers Dept. of Electronic, Telecommunications and Computers,
ISEL-IPL ISEL-IPL
Lisbon, Portugal Lisbon, Portugal
hcordeiro@deetc.isel.ipl.pt cmeneses@deetc.isel.ipl.pt
Abstract — This paper describes the impact of the signal For pathological voice identification, spectral parameters
bandwidth reduction in the identification of voice pathologies. such as energy spectrum [6], mel-frequency cepstral
The implemented systems evaluate the identification of 3 classes coefficients (MFCC) [7]–[9] and formant analysis [10], [11]
divided by healthy subjects, subjects diagnosed with achieve an accuracy rate over 90%. One of the advantages of
physiological larynx pathologies and subjects diagnosed with
neuromuscular larynx pathologies. Continuous speech signals are
this approach is that pitch estimation, a difficult task when
down-sampled to 4 kHz and the extracted spectral parameters dealing with voice disorders [8], is not required. Typically, the
are applied to a GMM classifier. No significant change in sample used for analysis is the sustained vowel /a/, since it is
accuracy occurs, being possible to conclude that the low produced with the vocal tract completely open and is
frequencies contain sufficient information to allow the correlated with the electroglottograph [12]. In [10] formants
classification of pathologies. A second objective is to test the are used as parameters for detecting voice pathologies. The
effects of suppressing the voice activity detection and the authors concluded that patients change the vocal tract in
increasing the analysis window length. In both cases the accuracy different ways to compensate for the vocal fold handicap.
increases. In conclusion, a pathological voice identification For pathological voice identification the use of continuous
system based on signals sampled at 4 kHz, without voice activity
detection and with an analysis window length of 40 ms is
speech shows to outperform the use of the more traditional
proposed, getting 81.8% accuracy. The proposed system has also vowel /a/ [13]–[15]. In these studies, it is verified that the
the advantage of reduces the storage memory and the processing spectral richness present in the continuous speech contributes
time. to a better detection of the pathologies. However, studies with
continuous speech has been limited by the lack of databases,
Keywords: Voice pathologies identification, Low band speech being the MEEI database [16] one of the few that contains
analysis, Spectral parameters, Voice activity detection. continuous speech files that are long enough to produce some
reliable research.
I. INTRODUCTION Independently of the type of parameters or phonetic
Pathological voice identification is the process of content, the sampling frequency is normally above 20 kHz,
distinguish between subjects with and without voice leading to signal bandwidths above 10 kHz. However, studies
pathologies. Voice pathologies identification is the process of [11], [17] in the context of selecting spectral parameters for
distinguish subjects by the type of pathology, which can pathological voice identification with the vowel /a/ show that
include pathological voice identification. The use of speech in it is possible to detect pathological voices with information
these tasks, as a non-invasive, easy and quick method, is contained below 2 kHz. The objective of this work is to extend
useful in screening situations or as a complementary method former work in voice pathologies identification with
for the diagnosis of voice pathologies. continuous speech sampled at 25 kHz [14], to test if the use of
Speech analysis for identifying pathological voices lower band speech signals sampled at only 4 kHz is also
typically has the acoustic or spectral approach. Acoustic enough and do not reduce the final accuracy. A second
analysis methods provide perturbation measures such as jitter objective is to test the effects of suppressing the voice activity
and shimmer [1], [2]. However, in [3], it is suggested that detection (VAD) and increasing the length of the analysis
perturbation analysis may not be applicable to aperiodic window. For unhealthy voices the VAD, developed for
signals. In recent years, different techniques based on healthy voices, removes a considerable portion of signal from
non-linear dynamic models have been applied to explore the unhealthy subjects, that can contain valuable information. The
dynamic behaviour of biomedical signals, including voice increase of the analysis window length increases the frequency
signals [4], [5]. Research shows that acoustic measures, such resolution although loses resolution in time.
as jitter, shimmer and harmonic-to-noise ratio (HNR), are The remainder of this article is organised as followed:
associated with deviation from normal voice quality. section II presents the related work about spectral analysis and
voice pathologies identification. Section III describes the
315
objectives and experimental setup. Section IV presents the distinguish between healthy subjects and those suffering from
results and discussion. Conclusions are presented in section V. paralysis or nodules/edemas.
II. RELATED WORK
A. Spectral Analysis
The spectral envelope peaks with small bandwidth have
formants information. In [11], to verify if there are a change in
the formants for unhealthy voices, spectral envelope was
estimated by spectral analysis with a 30th order linear
prediction filter.
For a sustained /a/ produced by healthy subjects, the
harmonics with the highest energy are those corresponding to
the first formant (F1), corresponding to the first peak of the
spectral envelope (P1), as can be seen in Fig. 1a). The
frequency of the first formant, for the vowel /a/, is typically
above 550 Hz. A first peak in the spectral envelope may also
occur to model the first harmonics but with energy lower than
that of the first formant, and with a bandwidth that tends to
increase. Due to the vibration of the vocal folds, modelled by
two poles at the origin, a spectral tilt occurs that diminishes
the energy of the high frequencies. Being a voiced phoneme,
the signal is not noisy and exhibits a harmonic structure.
However, for the corpora extracted from the database of the
Bioengineering Group of the Engineering School of São Paulo
University (DBSP) first used in [5], for all the 31 unhealthy
subjects the first two harmonics have a higher energy
comparing to the higher-order harmonics, creating a first peak
(P1) with an energy higher than the first formant (F1), as can
be seen in Fig. 1b). These two harmonics, whose amplitudes
are designated in the literature by H1 and H2, are object of
several studies [18], [19]. These studies reveal that the
difference between these two amplitudes is related with
breathy voices, since the second harmonic has less energy than
the first harmonic, which is not the case in healthy voices.
Recall that breathy voices are present in many of the
pathologies. For pathologic voice identification task, with only
this parameter, an accuracy of 100% in the DBSP database
and 74% in the MEEI database is achieved [11], [17].
With the evolution of the disease, vocal folds tend to
increase aperiodicity and the signal becomes noisier,
especially at higher frequencies. The spectral tilt decreases and
the energy in the high frequencies increases, which impairs the
modelling of the first two harmonics, as show in Fig. 1c).
Combining the first peak with aperiodicity and noise
parameters [17], the accuracy increases to 94.2%.
This analysis reveals that there is significant information at
low frequencies that allows discrimination of pathological
voices and healthy voices.
B. Voice pathologies identification using Continuous Speech
In [13], a continuous speech signal was introduced into the
identification of laryngeal pathologies and compared with the a) Healthy subject, P1=F1>550 Hz, Low noise, High spectral tilt.
traditional sustained vowel /a/ at 25 kHz sampling frequency. b) Unhealthy subjects, P1<550 Hz, Low noise, High spectral tilt.
A three-class classification system was implemented to c) Unhealthy subject, P1=F1>550 Hz, Hight noise, Low spectral tilt.
Figure 1 – Spectrum and Spectral envelope.
316
Two classifiers were implemented: one based on support considerable portion of signal from unhealthy subjects, which
vector machine (SVM) and other based on Gaussian mixture can contain valuable information. The increase of the analysis
model (GMM), both using MFFC parameters. The parameters window length increases the frequency resolution although
were extracted from the speech signal with 20 ms window loses time resolution.
length and 10 ms overlap, after a pre-processing the signal The same three-class classification system used in [13] is
with a VAD to remove silences. For continuous speech, the considered: physiological larynx pathologies (PLP), with 59
GMM system achieved 74% accuracy rate while the SVM samples (vocal fold nodules and edemas), neuromuscular
system obtained 72% accuracy rate. For the sustained vowel larynx pathologies (NLP), with 59 samples (unilateral vocal
/a/, the accuracy rates achieved by the GMM and SVM fold paralysis), and healthy voices, with 36 samples. This is a
systems were 66% and 69% respectively. Continuous speech MEEI database subset that was chosen to maximize the number
produces better results for the two systems with the GMM of unhealthy subjects. In total there are 118 unhealthy subjects
classifier, which outperformed the SVM classifier. divided equally by the two classes. Unilateral vocal fold
In addition to the traditional MFCC parameter, new paralysis is the most common larynx pathology in the MEEI
formant parameters were also evaluated for the classifiers database. Edemas is the second most common pathology and
described in [14]. The first of these parameters is Line can be a preliminary disease for nodules [15], which is why
Spectral Frequencies (LSF), which contains direct information they were merged into one class. Three subjects are diagnosed
about the formant frequencies and bandwidths and represents as having these two pathologies simultaneously.
the vocal tract. The other parameters are Mel Line Spectral The same GMM system presented in [14] is used as
Frequencies (MLSF) [20], an LSF based parameter that baseline system. The input signals correspond to continuous
contains perceptual information, with a mel-filter bank applied speech sampled at 25 kHz. Recall that the GMM classifier
to the spectrum. The MFCC, MLSF and LSF parameters were achieved better results than the SVM classifier [14] and
tested for several classifiers using the sustained vowel /a/ and continuous speech has better results that the sustained vowel
continuous speech to identify healthy subjects and those with /a/. The window length was set to 20 ms with 10 ms overlap. A
unilateral vocal fold paralysis or nodules/edemas. The best VAD [21] were used to remove silences. The extracted
accuracy rate, 77.9%, was obtained by a GMM classifier using parameters correspond to MFCC, MLSF or LSF. Different
the MLSF parameter extracted from continuous speech. LSF, parameter orders were tested and the one that obtain the best
without perceptual information, achieved 76% accuracy rate. accuracy is chosen. For the MFCC parameter, delta-MFCC
To improve the results of the single system, a hierarchical was added to the main parameter. For the MLSF and LSF
classification and system combination were implemented [15]. parameters, DLSF (difference between LSF coefficients in the
This approach demonstrates a significant benefit in the same window) and DMLSF was added to the main parameter.
classification of larynx pathologies, with hierarchical Energy and delta-energy was also combined with the main
classification and system combination achieving 84.4% parameters. All the data was normalized with zero mean and
accuracy rate, which represents an improvement of 9% unit variance. The GMM classifier with 32 mixtures is used.
compared with the best single system presented in [14]. For The GMM classifiers were trained with 75% of the data and
pathological voice identification, the accuracy is 98.7% and tested with the remaining 25%. The k-fold cross-validation
the accuracy of identification between the two pathologic method [22] was used, with k equal to 4. Subjects in the train
classes is 81.3%. set are not repeated in the test set. In total 4 systems were
created to rotate the train and test set and evaluate all the data
III. OBJECTIVES AND EXPERIMENTAL SETUP set. Each test set of the healthy class contains the data of 9
From the spectral analysis of the previous section with the subjects. For each unhealthy class test, three sets contain the
sustained vowel /a/, it has been proven that pathological voice data of 15 subjects and one contains the data of 14 subjects.
identification is feasible using parameters present at very low The classiﬁcation is frame based, and each subject is classiﬁed
frequencies. Also, from the previous section, voice pathologies into the class to which most of the frames were assigned.
identification is feasible with parameters extracted from The different systems were evaluated comparing the overall
continuous speech sampled at 25 kHz, that is 12.5 kHz accuracy (ACC) (1), the class sensitivity or True Positive Rate
bandwidth. (TPR) (2) and the class precision or Positive Predictive Rate
The main purpose of this work is to merge this two research (PPV) (3). These measures are estimated using the True
paths and verify if the information present in the low Positive (TP), False Positive (FP), False Negative (FN) and
frequencies is also valid for voice pathologies identification. True Negative (TN) approach. As the systems have three
The reduction of the bandwidth has the additional advantage to classes, when a class is considered Positive the other two
reduce the processing time, since there are less samples to classes are considered Negative.
process for the same analysis window length. ACC = ( TP + TN ) / ( TP + TN + FP + FN ) (1)
A second objective is to test the effect of removing the TPR = TP / ( TP + FN ) (2)
VAD and increase the analysis window length. For unhealthy
PPV = TP / ( TP + FP ) (3)
voices the VAD, developed for healthy voices, removes a
317
To verify if the band reduction has an impact on the frequency from 25 kHz to 4 kHz also has the advantage that the
accuracy, the signals was low pass filtered at 2 kHz and signals need 6.25 times less storage memory and have 6.25 less
down-sampled to 4 kHz. The VAD was also removed to access samples to be processed, decreasing the algorithm processing
its effect. In terms of processing time, if on the one hand time.
processing time is reduced by removing the VAD, on the other
The impact of VAD removal was tested and the results are
hand there are also more windows to process. To verify the
presented in Tables III and IV, for 25 kHz and 4 kHz sampling
impact of the analysis window length, the speech signal was
frequency respectively. It is verified improvements in all the
extracted in 40 ms window length with 20 ms overlap. Also
systems, with exception for the LSF parameter for 25 kHz
decreases time resolution to the limits in which quasi-stationary
sampling frequency, where the accuracy remains the same. For
can be considered in continuous speech signals.
25 kHz sampling frequency there is an increase of the TPR for
IV. RESULTS AND DISCUSSION healthy voices and neuromuscular larynx pathologies in all the
parameters.
The results of the baseline system are the same as the
reported at [14] and are presented at Table I, for the three Table III – System results with 25 kHz sampling frequency without VAD.
parameters in study (MFCC, MLSF, LSF). The parameter
order, presented in the last row, is the one that obtain the best MFCC MLSF LSF
accuracy. It is noted that the TPR for healthy subjects is very Healthy TPR [%] 97.2 100 100
high, mainly for MLSF and LSF parameters (97.2%). Healthy PPV [%] 92.1 90 94.7
PLP TPR [%] 74.6 79.7 71.2
Table I – System results with 25 kHz sampling frequency. PLP PPV [%] 68.8 75.8 67.7
NLP TPR [%] 64.4 69.5 64.4
MFCC MLSF LSF
NLP PPV [%] 74.5 78.8 70.9
Healthy TPR [%] 91.6 97.2 97.2
ACC [%] 76.0 80.5 76.0
Healthy PPV [%] 89.1 85.4 92.1
Parameter order 12 12 12
PLP TPR [%] 76.2 78.0 76.3
PLP PPV [%] 66.2 73.0 69.2 Table IV – System results with 4 kHz sampling frequency without VAD.
NLP TPR [%] 61.0 66.1 62.7
MFCC MLSF LSF
NLP PPV [%] 73.4 78.0 72.5
Healthy TPR [%] 100 97.2 97.2
ACC [%] 74.0 77.9 76.0
Healthy PPV [%] 78.3 85.4 87.5
PLP TPR [%] 74.6 78.0 81.4
Table II – System results with 4 kHz sampling frequency. PLP PPV [%] 73.3 78.0 76.2
NLP TPR [%] 66.1 69.5 67.8
MFCC MLSF LSF
NLP PPV [%] 81.3 75.9 78.4
Healthy TPR [%] 97.2 97.2 97.2
ACC [%] 77.3 79.2 79.9
Healthy PPV [%] 83.3 81.4 87.5
PLP TPR [%] 74.6 71.2 78.0
PLP PPV [%] 67.7 77.1 69.7
Table V – System results with 4 kHz sampling frequency without VAD
NLP TPR [%] 59.3 71.2 61.0 and 40/20 ms window.
NLP PPV [%] 74.5 76.6 75.0 MFCC MLSF LSF
ACC [%] 74 77.3 76.0 Healthy TPR [%] 97.2 97.2 97.2
Parameter order 10 6 12 Healthy PPV [%] 83.3 85.4 85.4
PLP TPR [%] 79.7 81.4 78.0
PLP PPV [%] 73.4 81.4 75.4
The main objective is to verify if the bandwidth reduction
NLP TPR [%] 67.8 72.9 69.5
has an impact in the accuracy. The signals are down-sampled
NLP PPV [%] 83.3 79.6 78.8
to 4 kHz. Different parameter orders are tested and only the
ACC [%] 79.2 81.8 79.2
best results are presented at Table II. The impact is null in the
accuracy rate for the MFCC and LSF parameters. The biggest
improvement is 5.6% in the TPR of healthy voices for the
MFCC parameter. With the MLSF parameter there is a small To assess the impact of increasing the analysis window
accuracy decrease of 0.6%. In this case, there is an increase of length, the system was tested without VAD and 4 kHz
5.1% in the TPR for neuromuscular larynx pathologies at the sampling frequency. The analysis window has a length of 40
expense of a decrease of 6.8% for the physiological larynx ms with 20 ms overlap. Results are presented in Tables V. The
pathologies. This demonstrates, as expected, that the relevant accuracy increases for the perceptual parameters MFCC in
information to the identification of voice pathologies is in the 1.9% and MLSF in 2.6%. Also, all the pathologies increase the
very lower frequencies. Although not the main objective of this TPR. This can be explained by the improved mel-filter
work, it should be noted that decreasing the sampling performance due to the increase in frequency resolution from
318
50 Hz to 25 Hz. For the LSF parameter, without perceptual V. CONCLUSION
information, the accuracy decreases by 0.7%. The TPR This paper presented a low band speech analysis system for
increases for the neuromuscular larynx pathologies for all the voice pathologies identification with continuous speech
parameters. signals. The accuracy is the same as the obtained with signals
To be noted also that, in the tested situation, increasing the with the most common 25 kHz sampling frequency, from what
size of the analysis window to double decreases the number of it was concluded that the essential information for this task is
windows to half, decreasing the processing time of the below 2 kHz and the signals can be sampled at 4 kHz.
algorithm. A second objective was to test the effect of supressing the
Fig. 2 presents a graphic comparation between all the VAD and the increase of the analysis window length. The
VAD removes a considerable portion of signal from unhealthy
systems. The best accuracy value for all the tests presented
subjects, that can contain valuable information. The increase of
corresponds to 81,8%, obtained with the MLSF parameter, the analysis window length increases the frequency resolution,
without VAD and sampled at 4 kHz with a 40/20 ms window. which influences the mel-filter in perceptual filtering.
Results show an improvement in accuracy in both cases,
due mainly to the improvement in the recognition of subjects
with pathologies. The best accuracy is 81.8%, achieved with
the MLSF parameter, signals sampled at 4 kHz, without VAD
and analysis window of 40 ms with 20 ms overlap.
The reduction of the bandwidth has the additional
advantage of reducing the storage memory and the processing
time, since there are less samples to be processed for the same
analysis window length. The increase of the analysis window
also brings a decrease in the processing time of the algorithm.
References
[1] S. Iwata, “Periodicities of pitch perturbations in normal and
pathological larynges,” Laryngoscope, vol. 82, pp. 87–96, 1972.
[2] J. P. Teixeira and A. Gonçalves, “Algorithm for Jitter and Shimmer
Measurement in Pathologic Voices,” Procedia Comput. Sci., vol.
Figure 2 - Systems Comparison. 100, no. March, pp. 271–279, 2016.
[3] I. Titze, Workshop on Acoustic Voice Analysis: Summary Statement.
National Center for Voice and Speech, 1995.
Table VI – Comparison of performance among classes with pathologies.
[4] J. J. Jiang, Y. Zhang, J. MacCallum, A. Sprecher, and L. Zhou,
“Objective acoustic analysis of pathological voices from patients
MLSF GMM MLSF GMM with vocal nodules and polyps,” Folia Phoniatr. Logop., vol. 61, no.
fs = 25 kHz fs = 4 kHz 6, pp. 342–349, 2009.
System
With VAD Without VAD
Combination [5] P. R. Scalassara, M. E. Dajer, C. D. Maciel, R. C. Guido, and J. C.
20/10 ms 40/20 ms
window window Pereira, “Relative entropy measures applied to healthy and
79.7 pathological voice characterization,” Appl. Math. Comput., vol. 207,
PLP TPR [%] 83.1 83.1
no. 1, pp. 95–108, 2009.
PLP PPV [%] 71.2 79.0 80.3
[6] K. Shama, A. Krishna, and N. U. Cholayya, “Study of harmonics-
NLP TPR [%] 67.8 78.0 79.7 to-noise ratio and critical-band energy spectrum of speech as
NLP PPV [%] 77.0 82.1 82.4 acoustic indicators of laryngeal and voice pathology,” EURASIP J.
ACC [%] 73.4 80.5 81.3 Adv. Signal Process., vol. 2007, 2007.
[7] R. Fraile, N. Sáenz-Lechón, J. I. Godino-Llorente, V. Osma-Ruiz,
Analysing only the results between the two classes of and C. Fredouille, “Automatic detection of laryngeal pathologies in
records of sustained vowels by means of mel-frequency cepstral
pathologies, suppressing the healthy subjects class, it is verified coefficient parameters and differentiation of patients by sex,” Folia
that the results obtained are slightly below from those obtained Phoniatr. Logop., vol. 61, no. 3, pp. 146–152, 2009.
by the more complex combination of systems presented in [15], [8] J. D. Arias-Londoño, J. I. Godino-Llorente, M. Markaki, and Y.
with a difference of only 0.8% in the identification between the Stylianou, “On combining information from modulation spectra and
two pathological classes. This difference is due to a slight mel-frequency cepstral coefficients for automatic detection of
pathological voices.,” Logoped. Phoniatr. Vocol., vol. 36, no. 2, pp.
decrease in the accuracy rate of subjects with neuromuscular 60–69, 2011.
pathologies, as presented in Table VI. These results show a [9] V. Majidnezhad, “A novel hybrid of genetic algorithm and ANN for
significant increase in the accuracy compared to the baseline developing a high efficient method for vocal fold pathology
system, with emphasis for the neuromuscular pathologies diagnosis,” EURASIP J. Audio, Speech, Music Process., vol. 2015,
no. 1, p. 3, 2015.
where the rate of correctness increased by 10.2% in absolute
values. [10] J. W. Lee, H. G. Kang, J. Y. Choi, and Y. I. Son, “An investigation
of vocal tract characteristics for acoustic discrimination of
pathological voices,” Biomed Res. Int., vol. 2013, 2013.
[11] H. T. Cordeiro, J. M. Fonseca, and C. M. Ribeiro, “LPC Spectrum
319
First Peak Analysis for Voice Pathology Detection,” Procedia [17] H. Cordeiro, J. Fonseca, and C. Meneses, “Spectral envelope and
Technol., vol. 9, pp. 1104–1111, 2013. periodic component in classification trees for pathological voice
[12] M. N. Vieira, F. R. McInnes, and M. a Jack, “On the influence of diagnostic,” Conf. Proc. ... Annu. Int. Conf. IEEE Eng. Med. Biol.
laryngeal pathologies on acoustic and electroglottographic jitter Soc. IEEE Eng. Med. Biol. Soc. Annu. Conf., vol. 2014, 2014.
measures.,” J. Acoust. Soc. Am., vol. 111, no. 2, pp. 1045–1055, [18] R. Wayland and A. Jongman, “Acoustic correlates of breathy and
2002. clear vowels: The case of Khmer,” J. Phon., vol. 31, no. 2, pp. 181–
[13] H. Cordeiro, C. Meneses, and J. Fonseca, Continuous speech 201, 2003.
classification systems for voice pathologies identification, vol. 450. [19] B. R. Gerratt, J. Kreiman, and M. Garellek, “Comparing Measures
2015. of Voice Quality From Sustained Phonation and Continuous
Speech,” Am. J. Speech-Language Pathol., vol. 59(5), no. October,
[14] H. Cordeiro, J. Fonseca, I. Guimarães, and C. Meneses, “Voice
Pathologies Identification Speech: signals , features and classifiers pp. 994–1001, 2016.
evaluation,” 19th Conf. SPA 2015 Signal Process. Algorithms, [20] H. Cordeiro and C. M. Ribeiro, “Speaker characterization with
Archit. Arrange. Appl., 2015. MLSFs,” in IEEE Odyssey 2006: Workshop on Speaker and
[15] H. Cordeiro, J. Fonseca, I. Guimarães, and C. Meneses, Language Recognition, 2006.
“Hierarchical Classification and System Combination for [21] L. F. Lamel, L. R. Rabiner, A. E. Rosenberg, and J. G. Wilpon, “An
Automatically Identifying Physiological and Neuromuscular Improved Endpoint Detector for Isolated,” IEEE Trans. Acoust.,
Laryngeal Pathologies,” J. Voice, vol. 31, no. 3, p. 384.e9-384.e14, vol. ASSP-29, no. 4, pp. 777–785, 1981.
2017. [22] R. Kohavi, “A Study of Cross-Validation and Bootstrap for
[16] Massachusetts Eye and Ear Infirmary Voice disorders database, Accuracy Estimation and Model Selection,” in International Joint
(Version 1.03 cd-rom). Kay Elemetrics Corp., Lincoln Park, NJ, Conference on Artificial Intelligence, 1995, vol. 14, no. 12, pp.
1994. 1137–1143.
320
SIGNaL PROCESSING
SPa 2018
Features extraction for the automatic detection of

ALS disease from acoustic speech signals
Maxim Vashkevich, Elias Azarov and Alexander Petrovsky Yuliya Rushkevich
Department of Computer Engineering Republican Research and Clinical Center of
Belarusian State University of Neurology and Neurosurgery
Informatics and Radioelectronics Minsk, Belarus
6, P.Brovky str., 220013, Minsk, Belarus Email: rushkevich@tut.by
Email: {vashkevich, azarov palex}@bsuir.by
Abstract—The paper presents a features for detection of tests different time, frequency or time-frequency features are
pathological changes in acoustic speech signal for the diagnosis used. They are extracted form the speech signal using either
of the bulbar form of Amyotrophic Lateral Sclerosis (ALS). We linear or non-linear processing techniques. Features based on
collected records of the running speech test from 48 people, 26
with ALS. The proposed features are based on joint analysis of linear processing include F0 (the fundamental frequency of
different vowels. Harmonic structure of the vowels are also taken vocal oscillation), absolute sound pressure level, harmonics-to-
into consideration. We also presenting the rationale of vowels noise ratio (HNR) [8], jitter (the degree of variation of F0 from
selection for calculation of the proposed features. Applying this cycle to cycle), shimmer (the degree of variation in speech
features to classification task using linear discriminant analysis amplitude from cycle to cycle) [7], Mel-frequency cepstral
(LDA) lead to overall correct classification performance of 88.0%.
Index Terms—speech analysis, formants, ALS. coefficients (MFCC) [9]. The main drawback of the mentioned
features is that many of them does not specifically designed for
detection of voice disorders and therefore their performance is
I. I NTRODUCTION
limited. Also they can be well applied to sustained phonation
Perceptible changes in speech are inherent to many neu- test but not to running speech test.
rological diseases. Bulbar motor changes (i.e. difficulty with More recently, several new measurement methods have been
speech or swallowing) are the first symptoms in approximately proposed to assess dysphonic symptoms in speech [6]. Those
30% of persons with amyotrophic lateral sclerosis (ALS) [1]. methods are based on nonlinear time series analysis. The most
In most cases detection of speech motor involvement in ALS popular among them detrended fluctuation analysis (DFA) and
currently are based on subjective assessment of clinicians’ recurrence period density entropy (RPDE) [7]. The drawback
auditory perceptions. However auditory-perceptual judgment of the nonlinear measurement methods is their much higher
as a tool for classifying speech disorders are susceptible to sensitivity to noise and other environmental factors.
a variety of sources of error and bias [2]. Some symptoms As a rule all extracted feature vectors with corresponding la-
of speech motor changes in ALS cannot be easily detected bels are used to obtain classifier based on supervised learning.
without instrumentation [3] especially at the beginning of the Linear discriminant analysis (LDA) along with support vector
disease. In turn, late detection of voice pathology can lead machine (SVM) are the most frequently used classification
to late detection of ALS. Advanced assessment strategies of tools in tasks of neurological diseases diagnosis [6, 9, 10].
speech motor changes are needed for early disease detection The aim of this work is to present new features that based on
and optimizing the efficacy of therapeutic ALS drug trials [4]. linear processing techniques and designed specially for detect-
One of the problem of ALS detection is there is no standard- ing dysphonic/dysathic speech pathology of patients with ALS.
ized speech diagnostic procedure. For detecting neurological The proposed features are robust to uncontrolled variation in
diseases many vocal test have been proposed. Some of them acoustic environment, have clear rationale and possibly can
include sustained phonations [5, 6], where the patient is be used in telemedicine systems. Using the proposed features
instructed to produce a single vowel and hold the pitch of classifier for the diagnosis of ALS based on LDA have been
it as constant as possible, for as long as possible. In running obtained.
speech tests patients are instructed to speak a standard sentence
that is constructed to contain a representative sample of II. F OUNDATION OF VOWELS SELECTION
linguistic units [7]. Another approach is to use rapid repetitions The ALS results in changes of the stimulation of the
of syllables, which is referred to as a diadochokinetic task muscles in general and the muscles of the tongue in particular.
(DDK). During this speaking test patients are asked to produce The position of the tongue’s surface is manipulated by large
the maximum number of syllable (e.g., “tah” or “pah”) as and powerful muscles in its root, which move it within
rapidly and accurately as possible in a single breath [1]. the mouth [11]. From the speech production point of view
Currently for detecting neurological diseases using vocal important horizontal position of the tongue surface (front ⇔
321
back) and vertical position (high ⇔ low). Fig. 1 shows a shapes of these envelopes. A typical example of envelopes
schematic characterization of some Russian vowels in terms with a high degree of similarity is given in Fig. 3.
of relative tongue positions.
high
/i/ /ʉ/
front back
/ɵ/
/e/ /æ/
low
Fig. 1: Relative tongue positions of Russian vowels (in Inter-
national Phonetic Alphabet (IPA) representation) [11] Fig. 3: Similarity of the envelope sounds /æ/ and /i/ (for patient
with ALS)
For detecting symptoms of bulbar form of ALS from the
acoustic speech signal, it is advisable to select the vowels /æ/ To quantify the differences between the envelopes of vowels
and /i/, since for their pronouncing requires a considerable /æ/ and /i/ it is suggested to use the l1 -norm distance measure
activity of tongue muscles.
In our experiments we use running speech as more realistic P
X
test of impairment in actual everyday life. Records with d1 (Ei , Ea ) = |Ei (k) − Ea (k)|, (1)
counting form 1 to 10 (in Russian) were used as a materials for k=1
experiments. For the analysis we have selected close in time where Ei (k) is envelope of the vowel /i/, Ea (k) – envelope of
fragments of speech signal containing the vowels /æ/ and /i/, the vowel /æ/, P – the number of points in the Bark frequency
(as a rule, sounds were selected from the words “odin”, “dvæ”, domain in which envelope is defined.
“tri”). An example of the formant structure of vowels /æ/ and
/i/, produced by a healthy person is shown in Fig. 2. B. Mutual location of the formant frequencies
Fi(1)
As mentioned in section II formants of vowels /æ/ and /i/
Fi(2)
Fa(1) Fa(2) have fixed order in normal case. However, in patients with ALS
50
\aa\
mutual location of the formant frequencies can be violated.
40 \iy\ Fig. 4 shows an example of the envelopes of vowels /æ/ and
Amplitude, dB
30
/i/, pronounced by a patient with ALS (voice disorder was
20
perceivable).
10
0 Fi(1)
Fa(1) Fa(2)
-10 Fi(2)
-20 50
2 4 6 8 10 12 14 16 18
Frequency, Bark 40 \aa\
\iy\
Amplitude, dB
Fig. 2: Formant structure of the vowels /æ/ and /i/ (healthy 30
person) 20
10
Visual analysis of envelopes in Fig. 2 shows that formants 0
significantly spaced in frequency domain and are arranged in -10
the following order: Fi (1) < Fa (1) < Fa (2) < Fi (2). As -20
2 4 6 8 10 12 14 16 18
a rule pathological changes in speech are perceived aurally Frequency, Bark
therefore it is meaningful to use psychoacoustically motivated
Bark scale for improving correlation between perceptual and Fig. 4: Abnormal mutual location of the formant frequencies
acoustic data [2]. As it will be shown in the following in patient with ALS
convergence of the formants can mean existence of patholog-
ical abnormalities. In this regard, Bark scale allows to unify In the case when the normal order is not violated, there is
distance between firsts and seconds formants of the vowels /æ/ often a significant convergence of the formant frequencies as
and /i/. shown in Fig. 5.
To quantify the degree of violation of the mutual formant
III. F EATURES FOR AUTOMATIC DETECTION
structure of vowels /æ/ and /i/ feature f mterr (Fi , Fa ) is
A. Distance between envelopes proposed (see eq. (2)). This expression returns value in the
Joint analysis of envelopes of vowels /æ/ and /i/ of persons range [0, 2]. The f mterr is equal to 2 when the normal mutual
with ALS have revealed an increased similarity between the formant structure is violated (i.e. either Fi (1) > Fa (1), or
322


 2 if Fi (1) > Fa (1) or Fa (2) > Fi (2)



 Fa (1)−Fi (1)
− Fi (2)−F a (2)
2 − 2 2 , if Fa (1) − Fi (1) < 2 and Fi (2) − Fa (2) < 2
Fa (1)−Fi (1)
f mterr (Fi , Fa ) = 1− 2 , if Fa (1) − Fi (1) < 2 (2)




 1 − Fi (2)−F2
a (2)
, if Fi (2) − Fa (2) < 2

 0, otherwise
0.95 Bark 20
\aa\ 10
30 12.1 dB
\iy\
Amplitude, dB
0
Amplitude, dB
20
-10
10
-20
0
-30
-10
-40
0 100 200 300 400 500 600 700 800
2 4 6 8 10 12 14 16 18
Frequency, Bark Frequency, Hz
Fig. 5: Convergence of formant frequencies of the vowels /æ/ Fig. 7: First three harmonics of vowel /æ/ (ALS patient).
and /i/ (patient with ALS) Difference between amplitudes of 2nd and 3rd harmonics is 12
dB
Fa (2) > Fi (2)). For normal mutual formant location f mterr x[n] Envelope E(k)
returns 0. It has been noticed that the distance between the LP analysis
Envelope
transformation to
calculation
formants of vowels /æ/ and /i/ for healthy person is more than bark scale
2 Bark, therefore the degree of convergence of the formants

is estimated by the function f mterr (Fi , Fa ) in cases when F(1)
Formant
distance between the first formants and/or between the second estimation F(2)
formants of vowels /æ/ and /i/ are less than 2 Barks.
C. Difference in the amplitudes of the harmonics. Harmonic
Amplitude harmdiff
difference
Analysis of harmonic structure of vowel /æ/ in persons with analysis
calculation
ALS have revealed that disphonic disorders have affect on fist
three harmonic components. Fig. 6 and Fig. 7 shows some Fig. 8: Scheme of speech signal analysis
representative examples.
20
IV. C LASSIFICATION
10 A. Scheme for features extraction
Amplitude, dB
31.4 dB
0 General scheme for features extraction for automatic de-
-10
tection of bulbar ALS is given in in Fig. 8. For the anal-
ysis, segments of speech signal with a duration of 150-
-20
200 msec containing vowels /æ/ and /i/ were selected. LP-
-30 analysis is done using traditional algorithms, while harmonic
-40
analysis is performed using technique described in [12].
0 100 200 300 400 500 600 700 800 For each pair of vowels from the dataset features (1), (2)
Frequency, Hz
and (3) are extracted and concatenated into vector x =
Fig. 6: First three harmonics of vowel /æ/ (ALS patient). [d1 (Ei , Ea ) fmt err (Fi , Fa ) harm diff (A1 , A2 , A3 )]T .
Difference between amplitudes of 1st and 3rd harmonic is 31
dB B. Linear discriminant analysis
In order to discriminating between the two classes of
To quantify the degree of deviation in amplitude structure normal and pathological cases, linear discriminant analysis
of harmonics of vowels /æ/ the following measure is proposed (LDA) with Fisher criterion was used [13]. The idea of linear
discriminant analysis (LDA) lies in the search for such a
harm diff (A1 , A2 , A3 ) = max(A1 , A2 ) − A3 , (3)
hyperplane w in the feature space, so that the projection of
where Ai – amplitude of i-th harmonic in dB. all training vectors onto it minimizes the within-class variation
323
and maximizes the between-class variation:
wSB wT
w = arg max , (4)
w wS wT
W
where SB – between class scatter matrix and SW – within
class scatter matrix. In turn these matrices are calculated as
follows
SB = (µ1 − µ2 )(µ1 − µ2 )T , (5)
2 X
X
SW = (x − µj )(x − µj )T (6)
j=1 x
where µ1 – mean value of feature vector for healthy people Fig. 9: Probability density of distance between envelopes
and µ2 – mean value of feature vector for people with ALS. d1 (Ei , Ea )
The solution of (4) can be found via the generalized eigenvalue
problem
SB w = λSW w, (7)
where the maximum eigenvalue λ and its associated eigen-
vector gives the quantity of interest and the projection basis.
More detailed description of LDA is given in [13].
V. E XPERIMENTAL RESULTS
A. Data collection
To validate proposed new features, real world clinical sam-
ples were used. Speech recording of Russian speaking patients
with ALS was carried out in Republican Research and Clinical
Center of Neurology and Neurosurgery (Minsk, Belarus). A
total of 48 speakers were recorder, with 22 healthy speakers Fig. 10: Probability density for fmt err (Ei , Ea )
(15 males, 7 females) and 26 speakers (14 males, 12 females)
having been diagnosed with ALS. The average age in the
healthy group was 36.3 years (SD 9.5, Min 22, Max 81) C. Classification results
and the average age in the ALS group was 56.5 years (SD Using collected base of 106 train samples LDA was per-
10.5, Min 36, Max 82). The samples recorded at 44.1 kHz formed based on the following steps:
using smartphone with a standard headset and stored as 16 bit − Between class scatter matrix SB calculation using (5);
uncompressed PCM files. − Within class scatter matrix SB calculation using (6);
For classification purpose 106 pairs of vowels /æ/ and /i/ − Solving the eq. (7) by calculating matrix S−1 W SB and
were manually pre-segmented prior to feature extraction (61 performing its eigenvalue decomposition. To maximize
– healthy, 45 – pathology). Fisher’s criterion (4) projection hyperplane w is deter-
B. Statistical analysis of features mined by eigenvector associated with maximum eigen-
value λ.
In order to gain a preliminary understanding of the statis-
tical properties of the features we compute their distributions Classification is performed using following equation
estimated using Gaussian kernel density. p = sign(wT x − b) (8)
Figure 9 shows the density function for distance between
envelopes d1 (). This feature shows a considerable distinction where b is boundary, if p = −1 then vector x classified as
between healthy controls and people with ALS. healthy, if p = 1 then vector x classified as pathology.
The probability density of fmt err feature is shown in In Fig. 13 kernel density function for projection on hy-
Fig. 10. This results show that there are violations of mutual perplane of all train vectors is shown. Overall classification
location of the formant frequencies for some samples from accuracy is equal to 88.0%, true positive 90.5% and true
healthy control group. However, this could appear due to negative 84.6%.
inaccuracy of algorithm of formant frequencies detection.
The feature harm diff is also shows a good separation VI. C ONCLUSION
between healthy and pathological groups (Fig.11). The paper presents a several new features that can be
For comparison reason we have computed a density function calculated form running speech test for ALS diagnosis. New
for widely used HNR feature (Fig. 12). Although HNR is features are based on 1) analysis of envelopes of vowels /æ/
quite effective for sustained phonation test [7] it is not so and /i/ and 2) analysis of mutual formant structure of vowels
good when used for analysis of short vowels (<200 ms). /æ/ and /i/. The vowels /æ/ and /i/ were selected as the most
324
ACKNOWLEDGMENT
This work was supported by the Belarusian Fundamental
Research Fund (F17U003).
R EFERENCES
[1] T. Spangler, N. V. Vinodchandran, A. Samal, and J. R.
Green, “Fractal features for automatic detection of
dysarthria,” in 2017 IEEE EMBS International Confer-
ence on Biomedical Health Informatics (BHI), Feb 2017,
pp. 437–440.
[2] R. D. Kent, “Hearing and believingsome limits to the
Fig. 11: Probability density for difference between harmonic auditory-perceptual assessment of speech and voice dis-
amplitudes harm diff orders,” American Journal of Speech-Language Pathol-
ogy, vol. 5, no. 3, pp. 7–23, 1996.
[3] Y. Yunusova, J. S. Rosenthal, J. R. Green, S. Shellikeri,
P. Rong, J. Wang, and L. Zinman, “Detection of bulbar
ALS using a comprehensive speech assessment battery,”
in Models and analysis of vocal emissions for biomedical
applications: 8th international workshop, Dec 2013, pp.
217–220.
[4] J. R. Green, Y. Yunusova, M. S. Kuruvilla, J. Wang,
G. L. Pattee, L. Synhorst, L. Zinman, and J. D. Berry,
“Bulbar and speech motor assessment in als: Challenges
and future directions,” Amyotrophic Lateral Sclerosis and
Frontotemporal Degeneration, vol. 14, no. 7-8, pp. 494–
500, 2013.
[5] R. J. Baken and R. F. Orlikoff, Clinical Measurement of
Fig. 12: Probability density for HNR computed for vowel /æ/ Speech and Voice, 2nd ed. San Diego: Singular Thomson
taken from running speech test Learning, 2000.
[6] A. Tsanas, M. A. Little, P. E. McSharry, J. Spielman, and
L. O. Ramig, “Novel speech signal processing algorithms
for high-accuracy classification of Parkinson’s disease,”
IEEE Transactions on Biomedical Engineering, vol. 59,
no. 5, pp. 1264–1271, May 2012.
[7] M. A. Little, P. E. McSharry, E. J. Hunter, J. Spielman,
and L. O. Ramig, “Suitability of dysphonia measure-
ments for telemonitoring of parkinsons disease,” IEEE
Transactions on Biomedical Engineering, vol. 56, no. 4,
pp. 1015–1022, April 2009.
[8] P. Boersma, “Accurate short-term analysis of the fun-
damental frequency and the harmonics-to-noise ratio of
a sampled sound,” in Institute of Phonetic Sciences,
Fig. 13: Probability density function for projection on hyper- University of Amsterdam, Proceedings 17, 1993, pp. 97–
plane w of all train vectors 110.
[9] A. Benba, A. Jilbab, and A. Hammouch, “Discriminat-
ing between patients with parkinson’s and neurological
diseases using cepstral analysis,” IEEE Transactions on
suitable because their pronouncing requires a considerable Neural Systems and Rehabilitation Engineering, vol. 24,
work of tongue muscle (the bulbar symptoms of ALS include no. 10, pp. 1100–1108, Oct 2016.
tongue atrophy). Another one feature is based on analysis of [10] M. Little, P. McSharry, I. Moroz, and S. Roberts, “Non-
harmonic structure of vowel /æ/, the statistical analysis have linear, biophysically-informed speech pathology detec-
shown that for pathological cases difference between first two tion,” in 2006 IEEE International Conference on Acous-
and third amplitudes of harmonics lager then in healthy control tics Speech and Signal Processing Proceedings, vol. 2,
group. Further work is necessary to improve classification May 2006, pp. II–II.
result. Usage of presented feature with LDA-based classier [11] X. Huang, A. Acero, and G.-W. Hon, Spoken language
allows to achieve overall classification accuracy of 88%. processing: a guide to theory, algorithm, and system
325
development. Upper Saddle River, New Jersey, USA:
Prentice Hall PTR, 2001.
[12] A. Petrovsky and E. Azarov, “Instantaneous harmonic
analysis: Techniques and applications to speech sig-
nal processing,” in Speech and Computer, A. Ronzhin,
R. Potapova, and V. Delic, Eds. Springer International
Publishing, 2014, pp. 24–33.
[13] T. Hastie, R. Tibshirani, and J. Friedman, The Elements
of Statistical Learning, ser. Springer Series in Statistics.
New York, NY, USA: Springer New York Inc., 2001.
326
SIGNaL PROCESSING
SPa 2018
DAB+ Coverage Analysis: a New Look at Network

Planning using GIS Tools
Marek Kulawiak, Przemysław Falkowski-Gilski, Marcin Kulawiak

Gdansk, Poland
{marek.kulawiak, przemyslaw.falkowski, marcin.kulawiak}@eti.pg.edu.pl
Abstract—For many years, the matter of designing a transmitter Three years later, DTM, vegetation and building location data
network, optimized for best signal coverage, has been a subject of have been processed in GIS to identify hotspots for cellular
intense research. In the last decade, numerous researchers and radiation [6]. More recently, GIS software and Non-Linear
institutions used GIS and spatial analysis tools for network Programming (NLP) have been used to devise a set of criteria
planning, especially transmitter location. Currently, many for placement of cellular network base stations [7], as well as
existing systems operate in a strictly two-dimensional manner, for monitoring a mobile network operator’s coverage area [8].
not taking into account the three-dimensional nature of the
analyzed landscape. Moreover, systems that utilize a digital While the number of attempts at GIS-assisted radio
terrain model in their analyzes also resort to a basic viewshed network modeling and analysis has been large and varied,
model. Due to recent adoption of mobile LiDAR scanners, the majority of existing systems operate in a strictly
the number of three-dimensional terrain information, including two-dimensional manner, without taking into account the
location and height of natural and man-made objects, is growing three-dimensional nature of the signal’s interaction with terrain
rapidly. In this paper, we propose a novel web-based GIS tool for topology. Even those systems that do use a DTM tool in their
three-dimensional simulation and analysis of radio signal analyzes usually perform simple viewshed processing and
coverage, dedicated to DAB+ terrestrial broadcasting combine the results with a two-dimensional path loss model.
network planning.
In the light of the recent adoption of mobile LiDAR
Keywords-DAB+, digital audio broadcasting, GIS, modeling, scanners, the number of three-dimensional terrain information,
signal coverage, network planning including the location and height of buildings and forests,
has been rapidly growing [9]. In this context, the paper
I. INTRODUCTION proposes a novel web-based GIS tool for three-dimensional
simulation and analysis of radio signal coverage dedicated to
The subject of designing a transmitter network for best investigation of DAB+ (Digital Audio Broadcasting plus)
signal coverage is a complex one. In order to fulfill all transmitter networks. Additional information considering
necessary requirements, the planner needs to consider several possibilities and limitations related with the DAB+
issues, including geographical location of individual broadcasting system may be found in [10].
transmitters, their orientation and signal strength. The topology
of the analyzed area is also a key factor, and becomes even
more important with increasing population density in some II. NETWORK PLANNING
regions. All of the aforementioned elements share a A predefined location is considered to be covered by a
geographical context, and as such are best investigated using a signal if there is at least one facility sited within a preset
Geographic Information System (GIS). threshold distance. If more than one facility satisfies this
criterion, it is assumed that one of these facilities, most often
Over the last two decades, GIS has been an important aid in
the closest one, will serve the consumer. The remaining ones
planning and analysis of radio network coverage in many will have no relation to the user demand. However, there are
places around the world [1]. In the early years of the new cases in which this multiple coverage has either pros or cons,
century, ArcGIS Digital Terrain Model (DTM) viewshed depending on the network deployment scenario.
analysis has been used to identify potential clients for
broadband WiMAX Internet providers in isolated and remote Terrestrial broadcasting systems, such as DAB+, as well as
regions of Canada [2]. Four years later, several radio other digital broadcasting standards, provide examples of
propagation models, including Okumura-Hata and Free-Space situations in which overlapping coverage occurs. For instance
Path Loss (FSPL), have been implemented in GRASS GIS when multiple signals originate at several transmitters and
for simulation and analysis of terrestrial network radio reach the same receiver. This situation may have either a
coverage [3]. In the same year, a similar tool has been positive or negative effect on the reception quality at the user
presented for WiMAX networks [4]. In the following year, side. It can be the cause of degradation in service availability,
GIS-based radio coverage mapping has been used to analyze reliability, and of course user experience. Additional
and identify problematic cells in a transmitter network [5]. information considering subjective and objective quality
327
assessment in DAB+ broadcasting system, including both mechanism is necessary in both single and multiple frequency
Quality of Service (QoS) and Quality of Experience (QoE) networks.
aspects, may be found in [11][12].
Moreover, whenever multiple transmitters are required, B. DAB+ Coverage Analysis
either because the region is large or there are gaps to be filled, In the DAB+ broadcast network planning process, technical
some overlap between coverage areas of neighboring parameters of the transmitters are defined in order to cover a
transmitters is unavoidable. Otherwise, a large area is left predefined percentage of territory or population with the signal.
unserved and becomes a whitespace. In case of digital The most important technical parameters are:
terrestrial broadcasting systems, multiple transmitter networks
 locations of the transmitters,
can be designed to operate in either Multiple-Frequency
Network (MFN) or Single-Frequency Network (SFN) mode  effective radiated power,
[13]. Currently, there exist commercial broadcasting coverage
planning tools, e.g. CHIRplus from LStelcom [14] and  antenna radiation pattern,
FRANSY from IRT, Munich [15].  operating frequency of the transmitter.
A. MFN and SFN Network Mode It should also be noted that bandwidth is the most precious
resource. Frequency resources are limited, based on national
In the MFN mode, transmitters that broadcast the same
and international agreements, and should be allocated,
information or program (in case of DAB+ this refers to a single
distributed and managed in an appropriate way. The matter of
multiplex) use different frequencies. On the other hand, in SFN
resource allocation in the DAB+ broadcasting system is
the whole region or country is covered by transmitters using a
discussed in [18].
single frequency, achieving a more efficient use of the
spectrum and bandwidth resources. However, in single- The frequency allocation table of a particular broadcast
frequency networks, a user device will receive more than one service is divided into radio channels with a specific
signal over the same frequency. Unless all received signals bandwidth, depending on the type of service. In Poland,
have identical information and perfect time synchronization, similar as in other European countries, the terrestrial DAB+
these signals will either add-up or interfere with each other. standard utilizes frequency resources in band III, that is
From the receiver side, the strongest signal will be considered between 174-240 MHz. When a specific channel is assigned to
as the primary signal. This means that this transmitter will be the transmitter, it should take into account the level of
the main signal source, while the remaining ones will either interference from neighboring transmitters. During the
contribute to a better reception or interfere with the main planning process, two major requirements should be taken into
transmitter. account:
This is the case for all technologies that utilize Coded  maximum possible coverage, especially in the
Orthogonal Frequency Division Multiplexing (COFDM), i.e. SFN scenario,
terrestrial Digital Video Broadcasting (DVB) and DAB/DAB+,
the leading broadcasting systems in Europe [16]. COFDM was  reduction of interference outside the serving area,
designed with multiple signals in mind, and if two signals especially in the MFN scenario.
carrying the same information are received within a preset time The first condition is simply defined as the minimal
delay of each other, known as the guard interval, they add-up in electrical field strength necessary for proper quality, related
a useful manner. However, if the delay between the two signals with signal decoding by the user equipment. This minimal
exceeds the guard interval, they interfere with each other. The usage field is dependable on signal-to-noise power ratio C/N
guard interval can be adjusted for a single frequency network. (Carrier-to-Noise) on the input of the receiver. It depends on
The larger it is, the more robust the system is to multiple signal modulation and the transmitting frequency.
signals. However, the amount of data that can be distributed
becomes smaller. Naturally, smaller amount of data means The second condition defines the minimal signal-to-
lower capacity, which leads to lesser number of services interference ratio C/I (Carrier-to-Interference), necessary to
available in a single multiplex. In consequence, the number of receive the signal and perform proper decoding. This minimal
high-quality audio services will also be lower. Additional C/I ratio is called protection ratio. If the interference level is
information on transmission quality measurements in DAB+ above signal level on the receiver input, proper reception will
may be found in [17]. not be possible [19].
Finally, the delay between two received signals depends
entirely on the relative positions of the two transmitters with III. MATERIALS & METHODS
respect to the user device. This factor can be somewhat The development of a novel tool for signal coverage
controlled by choosing appropriate locations for the simulation and analysis requires several components, including
broadcasting transmitter. It must be mentioned that echoes of an appropriate model, software environment and methodology.
the different signals, e.g. scattered on hills and buildings, This section briefly describes those key elements of the
have also a similar effect. However, the level of echoes is developed system.
usually far lower than the level of direct signals originating at
other transmitters. Furthermore, an echo-controlling
328
A. GIS Tools In the SFN scenario, when a consumer device receives a
Geographic Information Systems are used for creation, signal from more than one transmitter, the strongest transmitter
integration, editing, analysis and visualization of spatial data in is considered as the main signal source. In this case, the second
the form of thematic maps. Spatial analyzes are usually and following transmitters can either contribute to a good
performed using Desktop GIS, and results thereof are reception or act as a source of interference. This of course
disseminated using Internet GIS, also known as Web-GIS [20]. depends on the technology and location of a particular
Over the years, there have been several attempts at providing transmitter. In this case, facilities should be located in such a
desktop-level analytic functionality in Web-GIS [21][22]. way that coverage overlapping is avoided to prevent
However, three-dimensional spatial analysis has so far not been interference, or that overlapping signals are combined
achieved in Web-GIS without the need for platform-specific constructively for enhanced coverage.
browser plugins.
C. Methodology
As far as browser-independent implementation of
three-dimensional Web-GIS is concerned, the current Traditional signal coverage mapping methods involve
state-of-the-art in the field is represented by the Cesium two-dimensional path loss modeling combined with a viewshed
javascript library. Cesium provides four-dimensional (3D space analysis based on a three-dimensional DTM. As a result,
and time) visualization of geographic data from many sources, the produced maps are 2D by nature. In contrast, the proposed
including Web Map Service (WMS), Tile Map Service (TMS), solution involves creating a fully 3D mapping and analysis
OpenStreetMap, Bing Maps and others. Three-dimensional workflow. In order to provide 3D mapping and analysis of
data may be loaded along with textures and animations from radio signal coverage, the Okumura-Hata radio wave
COLLADA and glTF formats, as well as using the open 3D propagation model [24] has been implemented in a 3D
Tiles protocol. For every object, the library provides methods Web-based GIS system. The system uses the Cesium library to
to change its transparency, brightness, contrast, color and provide a fully 3D mapping and analysis environment.
saturation. Standard functionality also encompasses drawing of A comparison of the standard approach to signal coverage
the atmosphere, celestial objects such as the sun, moon, or mapping (A), with the proposed methodology (B), is presented
stars, water, as well as dynamic lighting and shading. in Fig. 1.
Navigation is possible using a keyboard, mouse and touch
input. Due to the fact that Cesium uses the open WebGL
standard of web browser 3D graphics acceleration, it can
operate within any compatible web browser. In consequence,
a web application built with Cesium can be accessed and
utilized on any desktop-class computer or mobile device.
All of the above makes Cesium a powerful tool for
constructing 3D GIS applications. However, the most
important feature of the library is that it is a completely
Free and Open Source Software (FOSS), which makes it easy
to adapt for many different purposes [23]. As such, it has been
selected as the foundation of the presented system.
B. DAB+ Serving Area

Figure 1. Comparison of the standard signal coverage mapping methodology
The DAB+ broadcasting system has four transmission (A) and the proposed approach (B).
modes. Different variants have an impact on the network
planning process. Regardless of the mode, coverage can be As shown in Fig. 1, although both signal coverage mapping
estimated for different reception conditions, namely: methods use similar data and models, the proposed approach
 fixed reception, maintains a fully three-dimensional pipeline. As a result,
signal modeling can be performed in three dimensions,
 portable outdoor reception, allowing for a more precise analysis of coverage in the context
of terrain topology. For instance, this allows to identify local
 portable indoor reception, obstacles that can negatively affect the signal, causing
 mobile reception. scattering, fading, multipath propagation, etc.
When planning a broadcast network, two factors have to be In order to provide a real-world implementation scenario,
taken into account: the first is covering the territory and the a particular geographic environment was required. For this
other is the number of inhabitants which have to be covered purpose, the area of the Tricity, a metropolitan complex located
with certain signal level. In this scenario, we focus on network in Northern Poland, has been selected. The dedicated operating
coverage of the Tricity metropolitan area (Pomeranian DAB+ transmitter is located in Gdańsk Chwaszczyno.
Voievodship) located in Northern Poland. The serving area can be considered as a mixture of urban and
suburban environments. The parameters of the terrestrial
broadcasting network are shown in Table 1.
329
TABLE I. DAB+ GDAŃSK CHWASZCZYNO TRANSMITTER PARAMETERS
Frequency Power
MUX Channel Polarization]
[MHz] [kW]
DAB-GDA 5B 176.640 10 H
According to the Polish Central Statistics Office GUS [25],

the Tricity region has approximately 800.000 inhabitants. It can
be estimated that the serving region of the DAB+ transmitter
located in Chwaszczyno delivers the digital radio signal to over
1 Million inhabitants. This clearly indicates that the topic of
network and coverage planning is a matter of great importance,
taking into consideration the number of users and related
revenue. The theoretical coverage of the serving transmitter
tower, according to the Okumura-Hata model, is shown
in Fig. 2. Figure 4. Modeled signal quality zones: excellent (marked in green),
good (marked in yellow) and fair (marked in red)
The three-dimensional nature of the presented solution

enables a more detailed analysis of the impact of terrain
topology on signal reception. For instance, utilizing 3D radio
signal modeling allows to identify local white spaces or regions
with Non-Line-of-Sight (NLOS) conditions, where the digital
Figure 2. Analyzed area in Tricity, Northern Poland. signal may be blocked by both natural and man-made objects,
such as hills and buildings.
It can be assumed that the estimated coverage area
represents the majority of the Pomeranian Voievodeship.
IV. RESULTS & DISCUSSION

The proposed system has been set to analyze radio signal
coverage of the DAB+ terrestrial network in the Tricity region.
Fig. 3 presents the 3D representation of the theoretical range of
the transmitter tower in Chwaszczyno.
Figure 5. Area of potential signal loss (marked in magenta) created by

occlusion of transmitter tower (shown on the horizon) by hills
in the City of Gdynia, Leszczynki district
Fig. 5 presents results of 3D signal occlusion modeling in

the City of Gdynia, Leszczynki district. Occlusion of
transmitter tower (shown on the horizon) by hills creates an
area of potential signal loss (marked in magenta).
V. CONCLUSIONS
This paper presents a novel method for 3D modeling and
mapping of radio signal coverage using GIS tools.
Figure 3. 3D representation of the theoretical range The proposed solution, designed to analyze possible DAB+
of the transmitter tower in Chwaszczyno transmitter configurations, has been evaluated in the area of
Tricity in Northern Poland. The presented results have shown
The implementation of the signal attenuation model in the that a fully three-dimensional simulation method enables a
3D Web-GIS system enables straightforward visualization and more thorough investigation of effects like the structural
analysis of potential signal quality issues. Fig. 4 shows occlusion of municipal features, where signal obstruction may
the areas of the Pomeranian region where signal reception is be analyzed individually for particular objects. Moreover,
estimated to be: excellent (marked in green), good (marked in the presented system also enables to investigate issues such as
yellow) and fair (marked in red). the impact of terrain topology on signal reception. The latter is
330
particularly important, as it could not be directly investigated in Proc. SIBCON Conference, pp. 1–8, 2016.
using traditional methods. DOI: 10.1109/SIBCON.2016.7491691
[9] M. Kulawiak and Z. Lubniewski, “Processing of LiDAR and Multibeam
Proper modeling of potential signal coverage issues is an Sonar Point Cloud Data for 3D Surface and Object Shape
important subject for broadcasters, operators, as well as Reconstruction,” in Proc. BGC Congress, pp. 187–190, 2016.
network providers, who want to keep the total costs of DOI: 10.1109/BGC.Geomatics.2016.41
deploying a broadcast network as low as possible. From a [10] P. Gilski and J. Stefański, “Can the Digital Surpass the Analog: DAB+
Possibilities, Limitations and User Expectations,” Intl. J. Elec. Tele., vol.
theoretical perspective, the network QoS and user QoE 62, no. 4, pp. 353–361, 2016. DOI: 10.1515/eletel-2016-0049
will depend on the number and nature of the received signals, [11] P. Gilski, “DAB vs DAB+ Radio Broadcasting: a Subjective
e.g. number of neighboring transmitters, relative delays, error Comparative Study,” Arch. Acoust., vol. 42, no. 4, pp. 157–165, 2017.
rate, etc. Geographical factors, such as terrain topology and DOI: 10.1515/aoa-2017-0074
population density, will also have a great impact on the signal [12] K. Ulovec and M. Smutny, “Perceived Audio Quality Analysis in Digital
quality. Network coverage estimations can certainly improve Audio Broadcasting Plus System Based on PEAQ,” Radioengineering,
the planning and maintenance phase of any private or public vol. 27, no. 1, pp. 342–352, 2018.
broadcaster. Further studies may include different serving areas DOI: 10.13164/re.2018.0342
or reception conditions, particularly indoor and outdoor, [13] V. Marianov and H. A. Eiselt, “Transmitter location for maximum
coverage and constructive–destructive interference management,”
as well as mobile scenarios. As the demand for more reliable Comput. Oper. Res., vol. 39, pp. 1441–1449, 2012.
services will continue to grow, there will be more demand DOI: 10.1016/j.cor.2011.08.015
for computer-aided GIS simulation tools for network [14] CHIRplus, Available at: https://www.lstelcom.com/en/solutions-in/radio-
coverage analysis. communications/telecom/ [Accessed: 28.05.2018]
[15] FRANSY, Available at: https://www.irt.de/produkte-und-
As shown in presented results, the integration of 3D radio services/produkte/fransy/ [Accessed: 28.05.2018]
signal modeling, with a dedicated 3D GIS system, constitutes a
[16] European Broadcast Union, Available at: https://www.ebu.ch/home
novel tool for enhanced mapping and analysis of signal [Accessed: 28.05.2018]
coverage. While the system has been tested on DAB+ [17] P. Gilski and J. Stefański, “Transmission Quality Measurements in
transmitters, it could be potentially adapted for investigation DAB+ Broadcast System,” Metrol. Meas. Syst., vol. 24, no. 4,
of other types of terrestrial broadcasting networks, including pp. 675–683, 2017. DOI: 10.1515/mms-2017-0050
radio and television. [18] P. Gilski, “Adaptive Multiplex Resource Allocation Method for DAB+
Broadcast System,” in Proc. 21st SPA Conference, pp. 337–342, 2017.
DOI: 10.23919/SPA.2017.8166889
REFERENCES
[19] M. Gosta, D. Vlahović, and B. Zovko-Cihlar, “Interference conditions
[1] C. Fry, “GIS in Telecommunications”, in Geographical Information on Croatian coast in DVB-T planning,” in Proc. 15th IWSSIP
Systems, pp. 819–826, 1999. Conference, 2008. DOI: 10.1109/IWSSIP.2008.4604405
[2] M. Sawada, D. Cossette, B. Wellar, and T. Kurt, “Analysis of the [20] P. Longley, Geographic information systems and science.
urban/rural broadband divide in Canada: Using GIS in planning John Wiley & Sons, 2005.
terrestrial wireless deployment,” Government Information Quarterly, [21] M. Kulawiak, Z. Lubniewski, K. Bikonis, and A. Stepnowski,
vol. 23, no. 3-4, pp. 454-479, 2006. DOI: 10.1016/j.giq.2006.08.003 “Geographical information system for analysis of critical infrastructures
[3] A. Hrovat, I. Ozimek, A. Vilhar, T. Celcer, I. Saje, and T. Javornik, and their hazards due to terrorism, man-originated catastrophes and
“Radio coverage calculations of terrestrial wireless networks using an natural disasters for the city of Gdansk,” in Information Fusion and
open-source GRASS system,” WSEAS Transactions on Geographic Information Systems, Springer, Berlin, Heidelberg,
Communications, vol. 9, no. 10, pp. 646–657, 2010. pp. 251–262, 2009. DOI: 10.1007/978-3-642-00304-2_17
[4] Y. Ahmed, W. Mughni, and P. Akhtar, “Development of a GIS Tool for [22] A. Dawidowicz and M. Kulawiak, “The potential of Web-GIS and
WiMax Network Planning,” in Proc. ICIET Conference, pp. 1–5, 2010. geovisual analytics in the context of marine cadastre,” Surv. Rev.,
DOI: 10.1109/ICIET.2010.5625735 pp. 1–12, 2017. DOI: 10.1080/00396265.2017.1328331
[5] L. Lan, X. Gou, Y. Xie, and M. Wu, “Intelligent GSM Cell Coverage [23] M. Kulawiak and M. Kulawiak, “Application of Web-GIS for
Analysis System Based on GIS,” Journal of Computers, vol. 6, no. 5, dissemination and 3D visualization of large-volume LIDAR data,”
pp. 897–904, 2011. DOI: 10.4304/jcp.6.5.897-904 Lecture Notes in Geoinformation and Cartography,
[6] M. Sangeetha, B. M. Purushothaman, and S. Suresh Babu, “Estimating The Rise of Big Spatial Data, Springer, pp. 1–12, 2016.
cell phone signal intensity and identifying Radiation Hotspot Area for DOI: 10.1007/978-3-319-45123-7_1
Tirunel Veli Taluk using RS and GIS,” Intl. J. Res. Eng. Technol., [24] ITU Recommendation P.1546, Method for point-to-area predictions for
vol. 3, no. 2, pp. 412–418, 2014. DOI: 10.1.1.676.275 terrestrial services in the frequency range 30 MHz to
[7] S. A. Karulkar and J. Y. Oh, “Optimal Placement of Base Station for 3 000 MHz, Geneva, Switzerland, 2013.
Cellular Network Expansion,” Issues in Information Systems, vol. 17, [25] GUS Statistics Poland, Available at:
no. 2, 2016. http://demografia.stat.gov.pl/bazademografia/Tables.aspx
[8] A. A. Sorokin and A. A. Gorunov, “Multifunction measuring system for [Accessed: 28.05.2018]
monitoring of coverage area of mobile network operator,”
331
SIGNaL PROCESSING
SPa 2018
Perfect Low Power Narrowband Transmitters for

Dense Wireless Sensor Networks
Anatoliy Platonov Ievgen Zaitsev
Warsaw University of Technology Warsaw University of Technology
Warsaw, Poland Warsaw, Poland
e-mail: plat@ise.pw.edu.pl e-mail: ievgen.zaitsev@gmail.com
Abstract—This paper presents the backgrounds of approach  Limited energy resources (batteries or not too efficient
to optimization and design of software defined adaptive feedback renewable sources) of EN create severe limitations on the
communication systems (AFCS) for applications at the physical power consumption of EN and their transmitters.
(PHY) layer of wireless sensor networks. A particular feature of
AFCS is they transmit the signals from digital or analog sensors  Apart from reliable delivering signals, EN transmitters of
to the base stations (BS) using pulse-amplitude (PAM) modula- dense WSN should most efficiently utilize the bandwidth
tors adaptively adjusted by the controls formed in BS, no coding. of their channels to decrease inter-channel interference.
Absence of coders permits to derive optimal transmission-recep-
tion algorithms determining the way of optimal AFCS design. Thus, the most appropriate EN transmitters for WSN ap-
Adaptive properties of the systems permit to transmit data to BS plications are to be low power narrowband devices reliably
perfectly, i.e. with energy and spectral efficiencies attaining transmitting signals and capable to work during months or
Shannon’s limits. The not used before adequate measures of years depending on the network destination. The great rate of
AFCS performance are discussed and used for investigation of transmission is not necessary.
designed prototype of optimal AFCS functioning. Optimal AFCS
may become a perspective class of high efficient narrowband low These requirements are contradictory, and existing com-
energy communication channels for the wireless sensor networks. munication systems (CS) cannot fully address the problem. It
looks some paradoxically, but the reason is coding of signals,
Index Terms—adaptive transmission; feedback, optimization, currently basic and commonly used principle of signals trans-
performance, limit energy-spectral efficiency, capacity.
mission. As it was shown in [8]-[12], coding does not allow
formulation of full mathematical models describing transfor-
I. INTRODUCTION mations of input signals from the input to output of CS, and
In the last years, there grows a number of publications in- counting the influence of transmitter - receiver parameters and
dicating very little ability to improve the performance of phys- channel noises on the quality of transmission. This blocks
ical (PHY) layer of wireless networks other than to increase possibility to develop systematic approach to analytical opti-
the output power of the transmitters [1]-[2]. The Shannon’s mization of CS with coding, what does not permit to define the
limit is being approached within 1 dB and further improve- requirements to EN transmitters and BS ensuring the limit
ment critically complicates the transmitting units of end nodes performance of data transmission, and to determine this limit.
[2]. The need for extreme energy efficiency, as well as for new On the other hand, in the years 1955-1970s coding was not
wireless networking paradigms is noted in [3],[4]. In `[5], considered as a single principle of the signals transmission. A
authors directly ask: whether research at the PHY layer of great number of excellent works on analytical optimization of
networks is still relevant? The severity of the problem has CS with feedback channels (FCS) has been published ([13]-
additionally increased with the beginning of works on devel- [18] and others). Common result of the research was confirma-
opment of heterogeneous 5G networks [6], in particular, of tion that optimal FCSs may transmit signals without coding
dense wireless sensor networks (WSN). “perfectly”, i.e. with negligibly small errors, and bit rate of
In the paper, we consider possibilities to improve the per- transmission is equal to the channel capacity. Later research
formance of PHY layer of WSN using new principle of the also showed that “perfect” FCS transmit signals with maximal
signals transmission and optimal transmission – reception energy [J/bit] and spectral [bit/s/Hz] efficiency. Moreover, EN
algorithms developed in works [7]-[8]. The proposed principle transmitters of FCS have substantially simpler construction
of transmission takes into account the following particularities than transmitters of CS due to the absence of coders replaced
of (non-mash topology) WSN: by the pulse-amplitude (PAM) modulator adjusted over feed-
back channel. The results of FCS optimization (transmission-
 The end-nodes (EN) of networks communicates with the reception algorithm) also determined the way of the perfect
base stations (BS) over forward and feedback channels. FCS design. The task seemed to be solved but none result of
 Every EN transmitter deliver to BS small amount of in- the researches is implemented until now. The main reason was
formation that does not require great rate of transmission. (and remains) commonly used linear model of the forward
(EN) transmitter.
332
Every transmitter has limited output range that, in absence t
of additional constraints, inevitably caused its saturation in the xt x ek 
yk xˆn
S&H Σ M1 Ch1 DM1 DSP
first or next cycles of signal transmission which breaks a
transmission. Linear models of the forward transmitters omit xˆk* xˆk
xˆk 1
this fact. The second reason was scenario-dependent perfor- R2 Ch2 T2 z-1
mance of FCS: system: optimal in the given scenario works Adaptive modulator
not optimally in another. t
Forward EN transmitter Base station (BS)
Research [7]-[12] allowed to remove these problems, and
Fig. 1. Block diagram of AFCS
to develop analytical approach to FCS optimization permitting
to derive applicable in practice transmission-reception algo- yk
rithms. A scalar version of the algorithm was used for design- A0
ing the prototype of optimal adaptive FCS (AFCS) [19],[20].
Experimental study of the prototype confirmed its functioning -1/Mk
as the perfect CS. The paper presents a discussion of the prin- 0 1/Mk ek
ciples of perfect AFCS functioning, designs and testing from
-A0
the point of view of information theory, as well of possibilities
of their implementation as the software defined CS. Fig. 2. Static transition function of the forward transmitter (PAM modulator).
II. PRINCIPLE OF TRANSMISSION AND CHARACTERISTICS OF

AFCS  M e przy |ek |  1 / M k
yk  A0  k k (1)
According to the note of C. Shannon [21], adequate per-  1 przy |ek |>1 / M k
formance criterion for AFCS transmitting signals from analog where A0 is the amplitude of the carrier (see also Fig. 2). The
sources is the mean distance between the original x and re- received demodulated signal is described by the relationship:
covered xˆ signals that is the mean square error (MSE) Pk 
E[( x  xˆk )2 ] of transmission. General block diagram of AFCS A
yk 
yk   k , (2)
is presented in Fig. 1. It is assumed the signals formed by the A0
sensors are normally distributed narrowband processes or where A  A0 γ / r is the amplitude of demodulated signal, γ
random values with known mean value x0 and variance σ 02 ; describes the path losses, r is the distance between FT and
and forward and feedback channel noises  k , k are additive BS. Signals yk are routed to the DSP unit which computes the
white Gaussian noises (AWGN) next estimate xˆk according to the equation (see e.g. [7],[8]):
A. General principle AFCS transmission xˆk  xˆk 1  Lk yk , ( xˆ0  x0 ) (3)
The input signals xt are sampled in the sample and hold where Lk is free parameter determining the rate of algorithm
unit (S&H) of the EN transmitter (See Fig.1) and each sample (3) convergence, and xˆk  xˆ( y1k ) where y1k  ( y1 ,..., yk ) is the
x ( m ) , m  1, 2... , is transmitted in the same way independently sequence of signals delivered to BS. .
from previous ones. In this case, analysis of AFCS work can
be reduced to transmission of a single sample that permits to New estimate xˆk is stored in the memory of DSP unit and
omit index m in notations x ( m ) . Each sample is held at the transmitted over feedback channel to the EN transmitter. The
first input of subtracting unit Σ during the time permitting to received signal xˆk* 1  xˆk  vk 1 is routed to the subtractor Σ ,
transmit it in n cycles. Duration of cycles Δt0 depends on the which forms new signal ek 1  x - xˆk  ν k 1 . Simultaneously,
distance between EN and BS, and on the rate of their signal DSP unit sets the parameters M k and Lk of the adaptive
processing units, as well determines minimal bandwidth modulator and estimating algorithm (3) to the values M k 1
2F0  1 / Δt0 of the channels. In each cycle k  1,..., n , sub- and Lk 1 , and next cycle of transmission begins. After n
tractor Σ forms a difference signal ek  x - xˆk* 1 , where cycles, final estimate xˆn is routed to the addressee, and AFCS
xˆk*  xˆk 1  vk describes estimate xˆk 1 of the sample computed begins transmission of the next sample.
in previous cycle by digital signal processing unit (DSP) of B. Optimal transmission-reception algorithm
BS, which stores and sends it to the EN over the feedback
Models (1)-(3) of the EN transmitter and digital part of BS
channel T2-Ch2-R2. Variable ν k describes errors in the re-
receiver allow formulation of full mathematical model of the
ceived estimate xˆk 1 caused by feedback AWGN ηk , and their
transmission process from the beginning to the end, as well as
variance σ ν2 is known and satisfies the inequality σ ν2  σ 02 .
of MSE of estimates Pk directly dependent on the parameters
Thus, ek are the distorted by feedback noise ν k errors x - xˆk 1
influencing the quality of transmission. The latter one permits
of estimates of the sample after k cycles of transmission.
to find optimal gains M k , Lk which minimize, for each
Values ek are routed to the input of adjusted PAM modu- k  1,..., n , values of MSE Pk , as well as to find these values.
lator M1 whose modulation depth is set, for each k, to corre- Furthermore, this information is sufficient for designing of
sponding value M k . Omitting high frequency components, optimal AFCS which transmit signals to BS with maximal and
signals emitted by EN transmitter can be written in the form: attaining Shannon’s limit energy and spectral efficiencies. In
333
the formulated conditions, solution of this task (optimal The next widely used measure of the system performance
transmission - reception algorithm) is as follows ([7]-[11]). is the rate of transmission measured in bit/s. Unlike CS with
Parameters M k of EN transmitter (1) should be set, for coding, AFCS allow accurate evaluation of the bit rate provid-
each k  1,.., n , to the maximal permissible values: ed by overall system, which may differ from the bit rate of
transmission over its internal physical channel [7]-[11]. This
1 1 becomes possible due to a possibility to compute the prior
M kmax  , M 1max  . (4)
α  P

2 min
k 1
α 0 H ( X ) and posterior H ( X | Xˆn ) entropies of input signals, and
where parameter α (saturation factor) determines the permis- mean amount of information I ( X , Xˆn )  H ( X )  H ( X | Xˆn ) .
Taking into account that duration of the samples transmission
sible probability μ of the EN transmitter saturation connected
is Tn  nΔt0  n / 2 F0 , one may determine the mean bit rate
with α by the equation [7]:
2
of transmission provided by AFCS considered as a generalized
x α
1  1 μ communication channel:
 (α )  
2π 0
e 2 dx 
2
. (5)
I ( X , Xˆn ) σ02 F0 σ2
Parameters Lk of digital receiver of BS (algorithm (3) RnAFCS   Fn log2  log2 0 [bit/s]. (10)
Tn Pn n Pn
should be set, for each k  1,.., n , to the values:
In turn, iterative AFCS transmission permits to introduce
AM k Pkmin 1  Pkmin  more informative than (10) and useful for the researchers
opt
L  2 1
 1  min
1
. (6)
k
   A2 M k2 (2  Pkmin
-1 ) AM opt
k  Pk  current bit rate determined by the relationship:
Values of parameters (4),(6) directly depend on the limit I ( X , Xˆk )  I ( X , Xˆk -1 )Pk 1
ΔRkAFCS  [bit/s] , (11)
 F0 log 2
values of MSE determined by the recursive Riccati-type equa- Δt 0 Pk
tion (indices “max” and “opt” in M k , Lk are further omitted): which describes increment of information in sequentially com-
puted estimates xˆk , ( k  1,..., n ) of the transmitted samples.
( ξ2  A2 Mk22 )Pkmin
1 2 1   v2  min
Pkmin   (1  Q ) 1  Q2
P ,
min  k 1 Formulas (10), (11) define the not less useful characteristics
 ξ  A Mk (  Pk 1 )
2 2 2 2 min
 (  Pk 1 ) 
2
of AFCS – the mean and current spectral efficiencies of

( P0  σ ) . 2
0 (7) transmission (in [bit/s/Hz]):
2
Variable Q in (7) describes the signal-to-noise ratio
RnAFCS 1 σ2 ΔRkAFCS P
(SNR) at the forward channel output:  log2 0 ;  log 2 k 1 . (12)
2 2
F0 n Pn F0 Pk
W sign  A   A0   1
Q2     SNRout
DM 1
The next measure important for the research and applica-
2   , (8)
2

     r N F
  0 tions is energy efficiency of transmission defined as the ratio
of the “energy per bit” Enbit AFCS (energy lost for transmission
where N / 2 is the double side spectral power density of
of one bit) to the spectral power density N of AWGN k :
AWGN  k in the forward channel.
It is worth noting that appearance of the parameter α in re- En Wnsign
bit AFCS
nWn
sign
nQn
2
   . (13)
lationships (4),(6)-(8) removes the main cause of the faults Nξ N Rn Nξ F0 I ( X , Xˆn ) σ2
log 2 0
with application of previous research. Setting α to the value Pn
satisfying formula (5) fixes a probability Prksat (mean frequen-
One should notice that the presented measures (except of
cy) of possible saturations of EN transmitter at the given small
(11)) can be also used for evaluation of the performance both
level [22], e.g. μ  104  107 :
1/ M k
AFCS and arbitrary CS: it is sufficient to transmit testing
sequence of samples ( x (1) ,..., x ( M ) ) , to store estimates xk( m )
Prk  Pr( M k | ek | 1 y1 )  1 
sat k 1
 p(ek | y1k 1 ) dx   , (9)
and to compute corresponding values of MSE using formula:
1/ M k
M
what practically eliminates appearance of saturations. Another 1
problem - dependence of the gains M k , Lk and MSE Pk on
Pˆ
k 
M
[ x
m 1
( m)
 xˆk( m ) ]2 ; (k  1,..., n) . (14)
the conditions of AFCS application is discussed in Sect, IV.
For non-iterative CS with coding n  1 , and substitution of
C. Main measures of AFCS performance Pˆ
1 instead of Pn into formulas (10),(12),(13) will give the
The results of works [7]-[11] showed that MSE of trans- values of information characteristics of the CS with coding.
mission Pk (accuracy of the signal recovering) is the basic
measure of AFCS performance, and determines information III. DIFFERENCES AND CLOSENESS OF LIMIT PERFORMANCE
characteristics of the systems. Another, not used until now CHARACTERISTICS OF CS AND AFCS
measure of AFCS performance is the saturation factor α (or
probability of saturation μ ) determining the percent of com- A. Differences in definitions of AFCS and CS capacities
pletely distorted samples. Parameter α is set by the designers Capacity of CS is determined as the maximal amount of in-
according to requirements to the system, is constant and does formation delivered to addressee per second or per transmitted
not depend on the conditions of AFCS application. signal which can be discrete or continuous, i.e. is maximal
334
available bit rate of transmission over the system. In the clas- transmitters permits to consider them as the ideal, noiseless
sic approach, CS are considered as linear physical channel channels ( σ ν2  0 ). In this case, MSE (7) gives the following
transmitting sequences of discrete signals referring to the code relationship ([7]-[11], see also [13]-[16]):
sequences formed in the source – channel coders. Only basic
Pnmin  σ 02 (1  Q 2 )n . (17)
characteristics of the system are considered - power W sign of
the received Gaussian signal, double side channel bandwidth Substitution of (17) into (16) determines the capacity of
2F0 and spectral power density N / 2 of the channel AWGN AFCS, constant independently from the number of cycles:
 k . Then, capacity of the channel (i.e. of CS) is defined as the
bit rate maximal on the set of prior distributions of the input
CnAFCS  F0 log2 (1  Q 2 ) . (18)
signals (Shannon’s formula [21]): Identity of formulas (15) and (18) proves the equivalence
 
I (X, Xˆ )  W sign of externally different expressions for the channel capacity.
C CS  max  F0 log2 1   This also confirms a possibility to use formulas (10),(14) for
p( x)  F0 N  
Δt0  (15) evaluation and comparison of the bit rate of different CS with
 F0 log2 (1  Q ).
2 coding. Expressions (12),(13) for the limit energy and spectral
efficiencies of optimal AFCS with ideal feedback also take the
The task of researchers is to find isomorphic mapping of form of canonic relationships of information theory:
the origin set of discrete input signals into the set of code
words which can be transmitted over given channel with min- C nAFCS
imal errors (bit error rate – BER) and maximal rate as close as  log2 (1  Q 2 ) [bit/s/Hz] ; (19)
F0
possible to the capacity (15). Optimization modulators and
 C 
AFCS
physical receivers is carried out almost independently and is Enbit AFCS Q2 F0

 2 F0  1  .
goal is additional improvement of the codes transmission.  = AFCS (20)
Nξ log 2 (1  Q ) C
2
 
 
In AFCS theory, this task is reversed. Namely, it is given One should stress that for AFCS with noisy feedback for-
definite class of signal sources (described by prior distribu- mulas (17)-(20) are valid only at the initial interval of trans-
tion). The task is to determine the method for implementing mission k  1,.., n* , where Pnmin  σ ν2 and n* is the solution
the modulator and receiver of AFCS enabling transmission of of equation Pnmin  σ ν2 , ([22],[7]). Longer transmission aggra-
input signals with minimal errors without coding. Works [7]- vates the capacity and power-spectral efficiency of the system.
[11] and [13]-[18] show that optimal utilization of feedback
channel permits to develop corresponding modulator and re- C. Measurement of MSE and information characteristics
ceiver, i.e. AFCS transmitting signals in real time with mini- Principle of MSE measurement was briefly presented at
mal MSE Pnmin without coding. Therefore, substitution of Pnmin the end of Sect. II, wider discussion can be found, e.g. in [7],
into (10) gives the expression determining maximal bit rate at [9]. Therefore, below we give only minimal information nec-
the AFCS output that is capacity of the system: essary for understanding the results of experiments with the
02 F0 02 prototype of AFCS discussed in the next Section.
CnAFCS  max RnAFCS  Fn log2  log2 . (16) According to (10)-(12), MSE Pk determines the bit rate
n n
M1 ;L1 Pnmin n Pnmin and spectral efficiency of AFCS, and to evaluate these values,
One should note that transmission of input signals with a it is to measure values Pˆ k using relationship (14) and substi-
bit rate greater than (16) is impossible. tute the result into corresponding formulas. For computations,
it is more convenient to use MSE expressed in decibels:
Remark . Formula (16) is well known in rate-distortion theory
(e.g.[14],[17],[18]), and Pnmin determines minimal mean dis-  Pˆ 
MSEn [dB]  10log10  n2  . (21)
tortion (“rate distortion function”) of the received signals, σ 
 0
which can be achieved using long codes approaching the rate
of transmission to the channel capacity. As it was noticed So, substitution of (21) into (10) gives the relationship:
above, coding does not allow formulation of the model of RnAFCS 3.32 2 0.332
transmission process counting all basic factors disturbing  log10 0   MSEn [dB] . (22)
F0 n ˆ
Pn n
transmission of signals, what substantially weakens construc-
tiveness of rate-distortion theory in application to CS with For the optimal AFCS with ideal feedback, limit energy ef-
coding. In turn, AFCS theory permits to construct such models ficiency EnbitAFCS / N ξ can be evaluated substituting the meas-
and to determine the lower boundary of the signals distortion, ured value RnAFCS / F0  CnAFCS / F0 into the right side of (20).
as well the way of interaction and parameters of modulator If AFCS works non-optimally than (see (13), “energy per bit”
and receiver permitting the system to attain this boundary. Enbit AFCS and energy efficiency Enbit AFCS / N depend not only on
This enables systematic design of perfect ASPS most answer- R AFCS / F0 but also on the SNR Q2 .
ing the requirements to communication channels of WSN.
D. Maintaining the optimal mode of transmission
B. Limit characteristics of optimal AFCS with ideal feedback Both SNR Q 2 , MSE Pk , and parameters M k , Lk of op-
Application of high quality feedback channels, e.g. em- timal AFCS depend on the scenario of applications (distance
ploying efficient error correcting codes and powerful feedback between EN and BS, environment, level of noise, etc. For this
335
+5V +5V -30…+20 dB -50…+0 dBm
x ek Mkek RF Amplifier Bandpass
with filter and RF power
∑
xˆk 1 controlled antenna meter
gain amplifier
f0
+5V
Multimodal Frequency
DAC carrier signal
meter
ADC
DСP
◘ generator
Code Mk
Code f0
ˆk 1
Code x 868 MHz, FSK 868 MHz, FSK
Code
Feedback Feedback Micro- xˆn
UART
channel Channel
Microcontroller STM32 Receiver Transmitter controller
R2 T2 STM32
Forward EN Transmitter Base Station (BS)
Fig. 3. Extended block-diagram connection of the main units of EN

transmitter and base station of the prototype of ASPS.
reason, system optimal in the given conditions will transmit Fig. 4. PCB module of EN transmitter integrated with sensor; there are high-
signals in another scenario not optimally, with greater MSE lighted: 1 - inclinometer, 2 - digitally controlled potentiometers, 3 - RF ampli-
fier and modulator, 4 – microcontroller, 5 - feedback receiver R2.
and lower bit rate and efficiency. This problem also appears in
evaluation of performance of the CS with coding.
random samples) repeated with 1 minute time interval. The
Additional research showed that AFCSs allow the elabora- aim of experiments was a study of functioning of the prototype
tion of algorithms adaptively adjusting parameters M k , Lk , ( with the 10 dBm power EN transmitter and ceramic helical
k  1,..., n ) to the values maintaining optimal regime of trans- mini-antennas. There were measured changes of the values of
mission independently from the scenario of AFCS application. MSE k [dB] in sequential cycles of the samples transmission,
These algorithms utilize a possibility to measure MSE Pk and and energy- spectral-efficiency of the prototype.
its reaction on deviations of the parameters M k , Lk from
The results of these (as well of other) experiments showed
optimal values (4),(6). Auto-adjusting of the parameters is
that, in first 3-6 cycles, values MSE k [dB] decrease linearly
provided by permanent exchange of information between the
that is, in the linear scale, MSE Pn decreases exponentially.
EN and BS during transmission of testing sequence of signals
As it was shown above, this is possible only if AFCS transmit
stored in the memory of EN microcontroller and DSP unit of
signals perfectly. The latter one proves that in the initial cycles
BS. The accuracy and energy-spectral efficiency of the adjust-
prototype transmits signals as the perfect system. After 3 cy-
ed system can be greater or lesser depending on the scenario
cles, MSE k [dB] of the signals transmitted on the distances in
of application but attains the limit or close to limit values. This
30 and 50 meters attains the values of – 40 dB and – 30 dB
effect is unrealizable for CS with coding - codes have no pa-
order, respectively. In these cycles, spectral efficiency of pro-
rameters permitting to adjust them to the characteristics of
totype attains the limit value C AFCS / F0 and can be evaluated
surrounding, and only partial solutions are possible (e.g.
using formula (22), and energy efficiency - using formula (20)
switching the codes or the power of EN transmitters depend-
The measured values of efficiencies are given in Tab. 1. The
ing of the distance between the EN and BS).
results of calculations are illustrated in In Fig. 6, on “energy-
spectral efficiency” plane widely used in CS researches.
IV. EXPERIMENTAL VERIFICATION OF RESULTS
The prototype of optimal AFCS was designed as the com- TABLE 1. MEASURED CHARACTERISTICS OF PROTOTYPE
munication channel between the wireless sensor (inclinometer) Distance (m) 30 50
and BS. The structure of prototype and PCB module of the EN MSEn[dB]/n – 14 – 11
transmitters are shown in Figs. 3, 4, respectively. The start
AFCS
point for the design was optimal transmission-reception algo- C / F0 [bit/s/Hz] 4.68 3.62
rithm (1)-(8). Forward EN transmitter (left side of Fig. 3) was E bit AFCS
/ Nξ 5.26 3.12
n
realized using serial digitally controlled RF amplifier preceded bit AFCS
by the adaptive PAM modulator. Regulation of modulation E n / N [dB] 7.2 4.96
depth was provided by EN microcontroller and digitally con-
The plots also show that longer transmission of the sam-
trolled potentiometer DCP using signals formed in BS and
ples decrease of MSE k [dB] but since the values of –40 dB
delivered to EN over feedback. The feedback channel was
order there appear fluctuations and declination from initially
realized using serial digital receiver and transmitter (power 27
optimal regime of transmission what requires additional study.
dBm) ensuring practically ideal delivering signals to the EN.
Figs. 5a-b. present the results of two series of experiments IV. DISCUSSION OF RESULTS AND CONCLUSIONS
carried out inside the faculty building at the distances 30 and
50 meters between EN and BS (in straight line). Each series The results of work confirm a correctness of theoretical ba-
included 5 experiments (transmission of the sequences of sis used for design of optimal (perfect) AFCS transmitting
signals without coding with the accuracy and energy- spectral
336
16
8 R>C
30m
Rb/F 0 , [bit/s /Hz ]

4
50m
2
Shannon’s boundary
1
0.5
R<C
Equation Section (Next) 0.25

0 -6 0 6 12 18 24 30
Ebit / Nξ [dB]
Fig. 5. Changes of MSEk [dB] as a function of the number of cycles in the experiments with proto- Fig. 6. Energy-spectral efficiency of transmission
type of AFCS repeated with 1 minute time interval: a) – transmission at the distance 30 meters, b) provided by AFCS prototype at the distances 30
transmission at the 50 meters (power of EN transmitter 10 mW, ceramic antennas). and 50 meters on the energy-spectral plane.
efficiency attaining the Shannon’s limits. Comparison of dis- [4] G.Y. Li, C. Xiong, C.-Y. Yang, S.Q. Zhang, Y. Chen, S.-G. Xu, “Ener-
gy-efficient wireless communications: tutorial, survey, and open is-
cussed in Sect. II, III measures of AFCS and DCS perfor-
sues,” IEEE Wireless Comm. Mag., vol. 18, no. 6, 2011, pp. 28-35.
mance with the measures currently used in CS theory showed [5] M. Dohler, R.W. Heath, A. Lozano, C.B. Papadias, R.A. Valecuela, “Is
their full coincidence in the cycles where feedback errors do the PHY layer dead?”, IEEE Comm. Magazine, vol. 49, no. 4, 2011, pp.
not influence transmission of signals. Moreover, MSE – based 159-166.
[6] A. Gatherer, “The death of 5G?”, Ed. in Chief of IEEE ComSoc Tech-
measures can be used for adequate evaluation and comparison nology News, thematic series of IEEEXplore e-papers, 2015-2016.
of the performance of arbitrary CS transmitting analog signals. [7] A. Platonov, “Optimization of adaptive communication systems with
In turn, results of Sect. III expand the field of constructive feedback channels,” Proc. of IEEE Wireless Comm. and Networking
applications of the rate-distortion theory. Additional research Conf. WCNC, Budapest, 2009, pp. 93-96.
[8] A. Platonov, “Capacity and power-bandwidth efficiency of wireless
showed that optimal AFCS can be also used for top-efficient
adaptive feedback communication systems,” IEEE Comm. Letters, vol.
transmission of short code blocks. 16, no. 5, 2012, pp. 573-576.
[9] A. Platonov, Ie. Zaitsev, “New approach to improvement and meas-
Optimal use of the feedback and software imparts AFCS a urement of the performance of PHY layer links of WSN”, IEEE Trans.
number of qualitatively new properties and functionalities on Instrum. and Measurement, vol. 63 , no. 11, 2014, pp. 2539 – 2547.
unachievable for CS with a single forward channel. Designing [10] A. Platonov, “Information theory: two theories in one”. Proc. SPIE, vol.
of the prototype also showed a possibility to realize AFCS as 8903, 2013, pp.8903G-1–16.
[11] A. Platonov, “Analog feedback communication systems: continuation
the software defined radio (SDR) system with substantially of researches”, Przegląd Telekom.), no. 8-9, 2013, pp.1164–1169.
extended functionalities comparing to known SDR transceiv- [12] A. Platonov,” Concurrent software–hardware optimisation of adaptive
ers. Namely, optimal AFCS: estimating, identifying and filtering systems,” Kybernetes: Int Journ, of
Systems & Cybernetics,(Emerald), vol. 37, no 5, 2008, pp. 590-607.
 Transmit the signals with theoretically achievable accura- [13] T. J. Goblick, “Theoretical limitations on the transmission of data from
cy, optimally utilize their energy and spectral resources, as analog sources,” IEEE Trans. on Inf. Theory, vol. 11, no.4, 1965, pp.
well allow regulation of the accuracy of transmission and 558-567.
[14] T. Kailath, “An application of Shannon’s rate-distortion theory to analog
energy consumption of EN by the controls formed in BS communication over feedback channels,” Proc. IEEE, vol. 55, no. 6,
and delivered to EN transmitter over feedback channel. 1967, pp. 1102-1103.
 May adaptively adjust their parameters to the changes of [15] J.P.M. Schalkwijk, L.I. Bluestein, “Transmission of analog wave-forms
through channels with feedback,” IEEE Trans. on Inf. Theory, vol. 13,
AFCS location and characteristics of environment (includ- no. 4, 1967, pp. 617-619.
ing non-stationary noises) in the way maintaining the most [16] J.K. Omura, “Optimal transmission of analog data for channels with
efficient mode of transmission. feedback,” IEEE Trans. on Inf. Theory, vol. 14, no. 1, 1968, pp. 38-43.
 Are narrowband systems: the most of widely used sensors [17] R.J. Gallager, Information Theory and Reliable Communication, J.
Wiley, New York, 1968.
register processes whose baseband [0, F ] does not exceed [18] S. Butman, “Rate distortion over band-limited feedback channels,”
several kHz, and sufficient for the data transmission band- IEEE Trans. on Inf. Theory, vol. 16, no. 1, 1971, pp. 110-112.
width of the forward channel is 10-100 kHz, ( F0  nF ). [19] A. Platonov, Ie. Zaitsev, B. Jeleński, L. Opalski, J. Piekarski, H.
Chaciński, “Prototype of analog feedback communication system: first
These and other advantages of optimal AFCS permit to results of experimental study,” IEEE Conf. on Comm. and Networking
consider them as a potentially perspective class of the chan- (BlackSeaCom), Varna, Bulgaria, June 2016, pp. 1-4, IEEE Xplore.
nels for PHY layer of WSN, as well for other applications. [20] A. Platonov, Ie. Zaitsev, L. Opalski, “Theoretical basis, principles of
design and experimental study of the prototype of "perfect" AFCS
transmitting signals without coding,” Proc. of SPIE, vol. 104451, 2017,
Equation Section (Next)REFERENCES pp. 104451F -1-14.
[1] N. Hunn, Essentials of Short-Range Wireless. Cambridge, UK, 2010.` [21] C. E Shannon, “ Mathematical theory of communication,” The Bell
[2] Costello, D. J., Jr., Forney, G. D., “Channel coding: the road to channel Systems Techn. Journ, vol. 27, 1948, pp. 379–423, 623–656.
capacity,” Survey, Proc. of IEEE, vol. 95, no. 6, 2007, pp. 1150-1178. [22] A. Platonov, “Optimal identification of regression-type processes under
[3] A. Goldsmith, “The road ahead for wireless technology: dreams and adaptively controlled observation,” IEEE Trans. on Sign. Proc., vol.
challenges,” ISWCS-2016, Sept. 2016, Poznan, Poland. 42, no. 9, Sept. 1994, pp. 2280-2291.
337
SIGNaL PROCESSING
SPa 2018
Probabilistic reasoning for indoor positioning

with sequences of WiFi fingerprints
Jan Wietrzykowski
Institute of Control, Robotics and Information Engineering
Email: jan.wietrzykowski@put.poznan.pl
Abstract—The paper tackles the problem of indoor personal information, it is possible to determine the current location.
positioning using sensors available in modern mobile devices, The most successful methods base on WiFi fingerprinting
such as smartphones or tablets. Alike many of the state-of-the- because direct modeling of the signal distributions is often
art approaches, the proposed method utilizes WiFi fingerprints
to find the user’s position in a predefined map of WiFi signals. intractable for indoor environments. Buildings have internal
However, we improve the approach to WiFi-based positioning structures, such as walls and ceilings, that reflect signals
by considering probabilistic dependencies between the neigh- and, moreover, an arrangement of many objects is subject
boring fingerprints in a sequence of consecutive WiFi scans. to frequent changes. It complicates the models that could be
The algorithm uses linear-chain Conditional Random Fields to used and causes the computations to outdate quickly [2]. In
infer the most probable sequence of user’s positions, which
makes it possible to find a consistent trajectory. Due to the fingerprinting method, on the other hand, the designer doesn’t
use of probabilistic reasoning in a wider spatial context the even have to know a number of access points present in the
algorithm considers a number of possible positions, and resolves environment, neither positions of them.
ambiguities stemming from noisy WiFi measurements. We tested An example of WiFi fingerprinting approach is WKNN [3]
the approach using data collected in one of the buildings of (weighted k-nearest neighbor) that selects k most similar scans
Poznan University of Technology with a regular smartphone.
from the database of map scans and computes the position as
I. I NTRODUCTION a weighted average of their locations. Despite being simple,
the approach performs well and has only a few parameters to
Mobile devices, such as smartphones or tablets, became tune. The downside is that it requires a substantial number of
ubiquitous and are assisting us in everyday activities. They database scans to work correctly. Although, there is a limit of
are small, affordable and equipped with various sensors that precision that can be achieved by using only a single scan.
enable to perceive the environment and monitor user’s activity. As mentioned earlier, WiFi signal distribution is affected by
One of the most promising new applications is personal indoor a number of factors that alter it, which makes it impossible
navigation that guides a user to a selected target location. to have a precise model or database of precise measurements.
Although the problem is generally solved for outdoor environ- Additionally, low precision often causes a “jumping” effect
ments, thanks to global navigation satellite systems (GNSS), if the localization outcome is viewed in real time by the
it remains an open issue for indoor areas. user, because the transitions between consecutive poses are
In indoor areas the GNSS signals are unavailable, there- not restrained. Despite low precision, it is often sufficient for a
fore a demand for other sources of positioning information user to know approximate position to reach a specified target.
emerges. Among many approaches, two main groups are easily This shifts the problem we want to solve from continuous
distinguishable: those requiring additional, dedicated infras- localization with a centimeter-level accuracy, as in vision-
tructure and those relying on existing devices. An example of based Simultaneous Localization and Mapping (SLAM), to
the methods with additional infrastructure is localization using a discrete task of determining whereabouts.
RFID/BLE (radio-frequency identification/Bluetooth Low En- To improve the accuracy and robustness of the localization
ergy) beacons [1]. The beacons are placed in the environment system, it is necessary to include more information. For-
and act as reference points for the device that is localiz- tunately, in practical applications, a problem of positioning
ing itself. Unfortunately, such infrastructure requires constant user without any knowledge of the previous measurements
maintenance, which generates additional costs. Therefore, we is rather uncommon. Various filtering techniques are being
focus on the latter group of methods, that rely on devices applied to track the position of an agent using a sequence
already present in the environment and internal sensors of a of measurements. However, solutions based on Extended
smartphone or tablet. Kalman Filtering [4] are not well-suited for imprecise WiFi
One of the most useful sources of information about user’s measurements due to the linearization errors, while Particle
position in an indoor environment is the WiFi adapter. The Filtering [5] requires a large number of particles and therefore
adapter scans the frequencies dedicated to WiFi communi- is computationally expensive, which makes it unsuitable for
cation and returns the list of available access points along such systems as smartphones. There were also attempts to
with signal strengths for each access point. By analyzing this adapt recent SLAM techniques, such as factor graph opti-
338
mization in continuous space for fusing data from WiFi scans As an inference with continuous random variables imposes
and Pedestrian Dead Reckoning [6] or even observations of many practical problems, we divide the space of the possible
QR codes [7]. However, highly uncertain WiFi measurements user positions into a rectangular grid to facilitate a discrete
are often not properly handled by gradient-based optimization representation of these positions. This enabled us to use a well-
resulting in a need for relatively dense maps of WiFi scans, established inference mechanism of Belief Propagation [11]
that in turn are laborious in preparation. Although the map with messages represented as probability tables. By using
generation procedure can be automatized, this requires another Belief Propagation with linear-chain models, we are able
localization system, such as visual SLAM, to be integrated to obtain exact results, that is, the computed assignment
with the WiFi surveying procedure [8]. If only a sparse map of corresponds to a global maximum of joint probability for
the WiFi scans is available, a more suitable approach is to use the given model. It is a major advantage over the continuous
the idea of Markov Models with Viterbi Tracking [9], which models used in most SLAM systems, where the optimization
originates from robot localization algorithms in their early is performed using gradient-descent methods, that are prone
development [10]. Yet, a proper model and its parametrization to get stuck in local minima. Moreover, using exponential
are of pivotal importance, which makes an automatic param- family for probability distribution representation, simplifies the
eter learning procedure recommended. model, while preserving the ability to represent an arbitrary
Therefore, we propose a novel method for determining user distribution if the functions used to express factors represent
position using a sequence of WiFi scans by performing a sufficient statistics [12].
probabilistic inference within linear-chain Conditional Ran-
dom Fields. The main advantages of our method (Sec. II) A. Single scan matching
include: To produce an a priori distribution of the user location,
• By employing a sequence of scans we exploit the con- based on a single WiFi scan, it is necessary to find locations
tinuous nature of user position to find the most probable that could have similar readouts. The scan S is a list of access
trajectory. points in a range of the WiFi adapter, each characterized by
• Using a discretized space of possible locations enables us its BSSID (unique MAC address), SSID (network name) and
to apply Belief Propagation inference mechanism, which received signal strength RSS (expressed in dBm). As modeling
is not susceptible to getting stuck in local minima, thus the WiFi signals propagation in an indoor environment is
finding the globally most probable assignment. complex and inaccurate due to signal reflections and constant
• Using Conditional Random Fields with probability distri- changes in the environment [13], we resign from fitting the
bution in a form of exponential family limits the number current scan to a model of signal distribution. Instead, we
of parameters to tune by enabling automatic estimation compare the scan with a database of map scans. This database
of crucial parameters. is a previously prepared set of scans, where each scan has an
• We utilize the knowledge of the building structure by associated location where it was taken at. We find the most
restraining positions to be within an area marked as similar scans in the database and, basing on their locations
allowed. and similarity of scans, define possible locations of the current
scan.
We evaluate the solution using real-scenario data collected One of the most important issues in this method is a
in Mechatronics Centre (MC) building of Poznan University measure of similarity between two scans. We use two criteria
of Technology (Sec. III) by comparing it to state-of-the- to characterize the similarity: common network ratio and an
art WKNN method. The experiments provide us insights average L2 distance of signal strengths. The common access
regarding the applied probabilistic model, which we conclude points ratio c(Si , Sj ) of a scan Si with a scan Sj is a ratio
in Sec. IV. of the number of access points with the same BSSID in
both scans to the number of all access points in the scan
II. P ROBABILISTIC REASONING
Si . To consider two scans as similar, both ratios c(Si , Sj )
Our approach performs probabilistic inference to find the and c(Sj , Si ) have to exceed 0.6. When this criterion is met,
most likely assignment of positions to a sequence of consecu- we compute distance dL2 (Si , Sj ) between signal strengths of
tive WiFi scans. Each WiFi scan is matched against a database access points with the same BSSID:
of map scans, which produces an a priori distribution of v
u
position probability for this scan. Those a priori distributions u1 X X
dL2 (Si , Sj ) = t 1bi =bj (si − sj )2 , (1)
are combined within the probabilistic model with constraints n
(si ,bi )∈Si (sj ,bj )∈Sj
stemming from measured user displacements between scan
moments. The constraints try to maintain distances between where (si , bi ) is a signal strength and BSSID for scan Si ,
consecutive positions as close as possible to the distances 1bi =bj is an indicator function equal to 1 when the condition
computed using a stepometer. By using a sequence of scans, bi = bj is met and 0 otherwise, and n is a number of common
we widen the context that is taken into consideration during access points.
the process of defining the user position. A schematic overview The a priori distribution map is finally computed using a
of the method is depicted in Fig. 1. Gaussian kernel placed in a location computed by the WKNN
339
Fig. 1. Schematic overview of the proposed method, a factor graph of the proposed model is composed of random variables (white circles) representing
positions of WiFi scans and two types of factors (black squares): unary that that drive positions to agree with the a priori, WiFi-based probability distribution
maps and pairwise that drive positions to agree with stepometer measurements
algorithm. The kernel is weighted by the average error ei value. Thus, the equation for the joint probability has the
(computed using dL2 distance) of the k most similar map following form:
scans. This way, we obtain higher probability in a vicinity  
X X 
of similar scans and completely ignore map scans that do 1
p(x) = exp θn f (xi ) + θe g(xi , xj ) ,
not have enough common networks. The equation for the Z  
Si ∈T (Si ,Sj )∈E
distribution is as follows: (3)

−ei −|xi − xw | f (xi ) = log(p̃(xi |Si )), (4)
p̃(xi |Si ) = exp exp , (2)
σe2 σw2
−(|xi − xj | − dij )2
g(xi , xj ) = , (5)
where σe2is a parameter controlling the influence of average σd2
error, and σw2
is a parameter controlling the influence of the
where T is a set of trajectory scans, E is a set of edges
position distance. The likelihood table is then computed by
connecting consecutive scans, xi is a position of scan Si
approximating distribution inside a grid cell by a constant
and dij is a value of distance computed by the stepometer.
value equal to the value of the distribution in the center of
To improve the localization precision, we force positions to
the cell.
belong to areas that are marked as allowed, i.e. areas that the
B. Probabilistic model and inference user is able to reach and are in the scope of the localization
To describe the problem in terms of probability, we con- system (Fig. 2). We further discard positions that have very
struct a probabilistic graphical model (depicted in central part low probability according to the a priori map, because it is
of Fig. 1), where the probability is expressed as a product highly unlikely that the true position is among them. This
of factors. Each WiFi scan from the investigated trajectory effectively limits the number of possible values of random
has a random variable associated with it, which describes the variables, which is also beneficial in terms of computing time.
position where the scan took place. Therefore, the model has a To find the most probable configuration of random variables,
form of a linear chain, where each random variable is related and therefore locations for each scan, regarding the model
by a factor to the previous one and the next one. The factors given by (3), it is necessary to perform an inference. The
are divided into two types: unary factors and pairwise factors. inference process computes the Maximum A Posteriori (MAP)
Unary factors exploit the a priori knowledge and the higher the assignment of the random variables:
probability for the position, the higher their value, therefore x∗ = arg max p(x), (6)
driving positions to agree with the probability distribution x
maps. In turn, the pairwise factors assure that the displacement and we chose Belief Propagation algorithm for this task, which
between consecutive locations is consistent with a stepometer is based on passing messages between factors and random
measurement. The smaller the difference, the higher the factor variables. Fortunately, for models without graph loops, this
340
based localization systems, because the typical errors of such
systems are considerably larger.
The parameters for the model were obtained using the
procedure described in Sec. II-B. Please note that, although
there are several parameters (σe , σw , and σd ) in the considered
problem, only the ratio of σe to σw is significant. The
reason for that is the form of functions f and g, which are
linearly dependent on values of those parameters. Therefore,
Fig. 2. Positions during the inference are restricted to the allowed area the expression θn f (xi ) is expanded as:
(marked green)
θn θn
θn f (xi ) = 2
(−ei ) + 2 (−|xi − xw |), (8)
σe σw
algorithm is exact and requires only 2N message passes, and the expression θe g(xi , xj ) as:
where N is a number of connections in a graph.
One of the pivotal matters during the inference is the values θe
θe g(xi , xj ) = 2 −(|xi − xj | − dij )2 . (9)
of parameters θ. Those parameters control the influence of σd
different types of factors and therefore influence the inference Thus, the estimation of θn and θe make the values of σθn2 ,
results. In the proposed method, we estimate the values of θe
e
, and σθn2 constant, evantually leaving the ratio of σe to
the parameters using Maximum Likelihood Estimation (MLE) σd2 w
procedure. Given a set of labeled examples, we compute σd as the only parameter to set by hand. We use 3 separate
the values that maximize the likelihood. And the likelihood, parameters due to numerical stability, which is better when
assuming that the examples are statistically independent and values of functions are moderate.
identically distributed, is a product of joint probabilities for We compared our solution against the WKNN, which is
each example: a golden standard among WiFi-fingerprinting methods. We
Y show two localization cases, the online case, where the current
L(θ : D) = p(D|θ) = p(xm |θ), (7) position is calculated by inferring the positions for the last l
m∈D scans, and the offline case, where the positions of all scans in
the trajectory are inferred at once. The online case is especially
where D is a set of labeled examples, and xm is a vector
suitable for navigating user in a real-time, while the offline case
with values for each random variable in example m. Having
is useful when gathering crowd-sourced maps.
all examples fully labeled, the problem of maximizing (7)
The results for both cases are gathered in Tab. I and visual-
is concave [11], which facilitates the use of gradient-ascent
ized in Fig. 3 and Fig. 4. We have chosen the value of l = 10
methods. In our work, we harnessed L-BFGS algorithm, as it
as scans further away than 10 are rarely affected by each other
is proven that it performs well for this class of problems [11]
during inference and the inference time of approximately 190
and we used a logarithm of (7) as it can be decomposed easily
ms on Core i5-3230M is within the acceptable range for real-
for gradient computation.
time localization. The grid size for all experiments was 1 m
and k = 6 for WKNN. The charts in Fig. 5 and Fig. 6 present
III. E XPERIMENTS AND RESULTS
a Cumulative Distribution Function (CDF) of error. The results
To evaluate the proposed solution, we have conducted indicate that the proposed approach improves the localization
experiments in real-life scenarios using data from a tablet held precision, compared to state-of-the-art WKNN method and is
by a user during traversing a test trajectory. We acquired a especially beneficial in the offline case.
map of scans in the Mechatronics Center building of Poznan
University of Technology, and then six different trajectories TABLE I
R ESULTS OF POSITIONING FOR 3 TEST TRAJECTORIES
within this building. The trajectories were split into training
part, that was used during parameter estimation, and testing trajectory average error [m]
part, that was used to evaluate the performance of the solution. online, l = 10 offline
To automate the data gathering procedure, we used a dedicated our WKNN our WKNN
application that enables to record WiFi scans, along with traj1 2.16 2.84 1.65 3.19
positions which they were taken at and accelerometer readouts. traj2 3.84 3.86 2.24 3.81
The positions are determined by analyzing timestamps of traj3 3.20 3.49 1.99 3.45
scans with timestamps of passing the vertices of the planned
trajectory. The user has to draw the trajectory composed of
straight segments in advance and then tap the screen each time IV. C ONCLUSION
he reaches the end of a segment. By assuming the velocity We have proposed, implemented and evaluated a novel
across the segment is constant, we are able to compute the method for solving the indoor localization problem using WiFi
positions of scans. Although it is only an approximate position, fingerprinting. The method models the problem as linear-
it is sufficiently accurate for the purpose of evaluating WiFi- chain Conditional Random Field and performs probabilistic
341
Fig. 3. A visualization of positioning results for traj3 for online version of the proposed method (blue trajectory) with l = 10, compared to WKNN method
(magenta trajectory) and ground truth trajectory (green). Red dots represent map database scans
Fig. 4. A visualization of positioning results for traj3 for offline version of the proposed method (blue trajectory), compared to WKNN method (magenta
trajectory) and ground truth trajectory (green). Red dots represent map database scans
Fig. 5. A Cumulative Distribution Function (CDF) of error for all trajectories Fig. 6. A Cumulative Distribution Function (CDF) of error for all trajectories
for online version of the proposed method with l = 10, compared to WKNN for offline version of the proposed method, compared to WKNN method
method
Our solution improves the localization results, compared

inference to assign a position to each WiFi scan. Due to the to WKNN method. The improvement is clearly visible in
discretization of the user’s position space, it is possible to the offline case, where the positions for the whole trajectory
use efficient Belief Propagation inference algorithm, which are inferred at once. As we suppose, the main reason for
produces exact and globally optimal results. We exploit the such results is the way that displacement constraints work.
continuous nature of user position to restrain distances be- The constraints, imposed by pairwise factors, are effective
tween consecutive scans using stepometer predictions. when the inference procedure chooses from the map a distant
342
scan located along the direction of the actual user movement. [4] B. Dong and T. Burgess, “Adaptive Kalman filter for indoor navigation,”
In this case, the pairwise factors maintain the user position in 2016 International Conference on Indoor Positioning and Indoor
Navigation (IPIN), Oct 2016, pp. 1–4.
close to the actual one by limiting the possible displacement [5] S. Knauth and A. Koukofikis, “Smartphone positioning in large envi-
resulting from the similarity of distant WiFi scans. But when ronments by sensor data fusion, particle filter and FCWC,” in 2016
the matching to a priori map encourages the inference pro- International Conference on Indoor Positioning and Indoor Navigation
(IPIN), Oct 2016, pp. 1–5.
cedure to choose a scan located along the trajectory in the [6] M. Nowicki and P. Skrzypczyński, “Indoor navigation with a smartphone
direction opposite to the actual motion direction, the constraint fusing inertial and WiFi data via factor graph optimization,” in Mobile
is barely active because the displacement is consistent with Computing, Applications, and Services, S. Sigg, P. Nurmi, and F. Salim,
Eds. Cham: Springer International Publishing, 2015, pp. 280–298.
the measured one by doing back and forth movements. The [7] M. Nowicki, M. Rostkowska, and P. Skrzypczyski, “Indoor navigation
effects of the displacement constraints can be compared to a using QR codes and WiFi signals with an implementation on mobile
mechanical analogy of constraints imposed by an elastic string, platform,” in 2016 Signal Processing: Algorithms, Architectures, Ar-
rangements, and Applications (SPA), Sept 2016, pp. 156–161.
that is effective when stretching (pulling), but does nothing [8] P. B. amd J. Bartoszek, “On automatic metric radio map generation
when squeezing (pushing). for the purpose of WiFi navigation,” Journal of Automation, Mobile
The above-mentioned behavior is the main interest of our Robotics & Intelligent Systems, vol. 11, no. 3, pp. 62–73, 2017.
[9] D. Han, S. hoon Jung, and S. Lee, “A sensor fusion method for Wi-
future research. We plan to replace displacement constraints Fi-based indoor positioning,” ICT Express, vol. 2, no. 2, pp. 71 – 74,
with constraints stemming from full Pedestrian Dead Reckon- 2016.
ing, that involve also orientation estimates. It should enable [10] D. Fox, W. Burgard, and S. Thrun, “Markov localization for mobile
robots in dynamic environments,” J. Artif. Int. Res., vol. 11, no. 1, pp.
us to make better predictions of the user movement. 391–427, Jul. 1999.
[11] C. Sutton and A. McCallum, “An introduction to conditional random
R EFERENCES fields,” Foundations and Trends in Machine Learning, vol. 4, no. 4, pp.
[1] P. Martin, B.-J. Ho, N. Grupen, S. Muñoz, and M. Srivastava, “An 267–373, 2011.
iBeacon primer for indoor localization: Demo abstract,” in 1st ACM [12] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential
Conference on Embedded Systems for Energy-Efficient Buildings, ser. families, and variational inference,” Foundations and Trends in Machine
BuildSys ’14. ACM, 2014, pp. 190–191. Learning, vol. 1, no. 1–2, pp. 1–305, 2008.
[2] S. Xia, Y. Liu, G. Yuan, M. Zhu, and Z. Wang, “Indoor fingerprint po- [13] E. S. Lohan, K. Koski, J. Talvitie, and L. Ukkonen, “WLAN and RFID
sitioning based on Wi-Fi: an overview,” ISPRS Int. J. Geo-Information, Propagation Channels for Hybrid Indoor Positioning,” in International
vol. 6, p. 135, 2017. Conference on Localization and GNSS 2014 (ICL-GNSS 2014), Helsinki,
[3] Z. Guowei, X. Zhan, and L. Dan, “Research and improvement on indoor 2014, pp. 1–6.
localization based on RSSI fingerprint database and K-nearest neighbor
points,” in 2013 International Conference on Communications, Circuits
and Systems (ICCCAS), vol. 2, Nov 2013, pp. 68–71.
343
SIGNaL PROCESSING
SPa 2018
Selection and tests of lossless and lossy video codecs

for advanced driver-assistance systems
Paweł Pawłowski, Karol Piniarski, Adam Dąbrowski
Department of Computing, Division of Signal Processing and Electronic Systems
Poznań, Poland
{pawel.pawlowski; karol.piniarski; adam.dabrowski}@put.poznan.pl
Abstract— In this paper tests of lossy and lossless video codecs in which automatic analysis of compressed images is
selected for application in the advanced driver-assistance system performed. The authors of paper [3] present a method of
(ADAS) are presented. increasing the degree of compression while maintaining a
constant indicator of automatic detection of objects. Work [4]
Keywords-component: advanced driver-assistance systems,
automotive, lossless and losy video compression, codec
concerns the impact of the video compression on the
effectiveness of object detection. An overview of special
I. INTRODUCTION coders is presented, which are used to compress images on
Advanced driver-assistance systems (ADAS) become more which an automatic detection of objects is performed.
and more important in current automotive industry. On one Promising results are presented in [5]. It discusses the impact
hand they help the drivers in many cases like e.g. assistance of compression on the effectiveness of object detection. The
during parking, avoiding lane departure when the driver influence of the compression level on changing the precision
become tired, reverse driving, road navigation and many of detection is presented according to the JPEG standard. The
others. On the other hand, from the perspective of the detection use artificial neural networks and SVM (support
designer, they produce a lot of data, especially coming from vector machine) classifier. The authors noticed that precision
visual sensors like cameras. Processing, transmitting and decreased relatively slowly, despite the ever-increasing
logging these huge amounts of data is a real problem. compression ratio. This may mean that lossy compression, but
Therefore, a new area of application for video codecs (short only with intra-prediction, may be less detrimental to
for coders/decoders) appears. automatic detectors than inter-prediction. In the paper [6], the
In the targeted use of video codecs, i.e. to support active authors examine the effect of compression on automatic face
safety technologies that will be used in ADAS and in recognition using Haar descriptors. Experiments show that in
autonomous driving systems, the dominant (if not the only) the context of face detection, JPEG-XR and JPEG2000 codecs
image analysis technique will be automatic processing. have an advantage over typical JPEG encoder, especially
The most commonly used video codecs are lossy codecs. when images are compressed at high compression rates. It also
This is due to the requirements for high compression rates, turn out that a small initial JPEG compression does not affect
especially in a network transmission or a television. the performance of the algorithms.
Previously developed lossy codecs, intended primarily for the Although the ADAS application designer, as a user of video
recipient in the form of human beings, are optimized to codec, usually has no influence on the development of the
achieve the highest but subjectively assessed quality. codec, how the codec processes the image and how it is
Although the metrics for objective image quality assessment optimized, he or she can choose the best codec for the given
(e.g. PSNR – peak-signal-to-noise ratio or SSIM – structural application. Additionally, by setting the parameters which
similarity image metric [1]) are known, this does not change control the codec, it is possible to significantly affect its
the fact that the loss of codecs is very often understood as behavior and therefore it may ultimately improve the quality
"perceived loss". The "perceived loss" assumes that the codec of the final result.
is subjective lossless when no changes in the image are In this paper we test a bunch of lossless and lossy video
noticed by a human. Unfortunately, there are many cases when codecs on videos that are typical in ADAS applications. By
even huge data loss is not perceived by the human’s visual the tests we try to show the method of a selection of the best
system. This is especially noticeable in the details, quickly codec.
changing scenes and moving objects at high speed. In fact, in II. METHODS OF COMPRESSION OF STILL IMAGES AND
ADAS applications, these errors, although invisible to VIDEO STREAMS
humans, can be critical.
In the literature, this problem is slowly perceived. The Compression of images and video streams strongly uses
authors of the work [2] present the video compression features of images like similarity of neighboring points in
standards H.264, H.265 and show an overview of applications images and the similarity of neighboring images in video
sequences on one hand and the perceptual limitations of the
344
human vision system on the other hand. For example, a person precise calculation of the so-called motion vectors with an
can see a lot of detail in static images, when the observation accuracy of up to 1/8 pixel, contextual compression, the use of
time is relatively long, but loses the ability to see details of advanced indirect transformations.
fast-moving objects. In addition, the human has much less We should notice, the lossless codecs can not delete
precision in spatial color discrimination than the luminance information at all, so the use of the DCT transformation is not
discrimination. These phenomena are used by image coders advisable. Instead, variable-length coding techniques of
and video coders. Therefore, they use much different prediction error words are used. These include: entropy coding
compression algorithms than e.g. popular file compressors, (e.g. Huffman coder), arithmetic coders, string coding (RLE,
like zip, rar, 7z. run-length encoders, VLC – variable-length coders). Some of
The basic mechanism of the compression is a prediction. By them use quite large context models [8]. In lossless image
storing the reference point value, the value of subsequent codecs, only intra-image prediction is made, while in video
points can be predicted based on previous points. Analyzing codecs, inter-image prediction can also be used. In some
the image from top to bottom, from left to right, the prediction advanced solutions, several prediction variants are calculated
schemes must only take into account already known points. and the best one is chosen. Unfortunately, this means an
The prediction is made by both the encoder and the decoder, additional necessity to inform the receiver about the selection
and only the reference point and prediction errors are of the predictor, and therefore the need to send another control
transmitted between the encoder and the decoder. data. In fact, lossless codecs, unlike lossy ones, remain not so
The image format is also very important, especially in color complicated, and therefore much faster. They are typically
images. Typically, the coder transform strongly correlated used during the so-called video stream “ripping encoding”,
RGB (red, green, blue) components into other spaces, e.g. where no loss of data is allowed.
YUV (Y – the luma component, the brightness, U, V – the
chrominance, color components). It leads to the compression III. CONTROL OF VIDEO CODER
of less correlated data, obtaining higher quality and higher The user of video codec, by setting the parameters which
compression rates. control the coder, is able to significantly affect the behavior of
In lossless coders, to reduce the computational complexity, the coder and therefore may fit it to the application needs.
very simple predictors are used, which utilize similarity of Moreover by proper settings it is possible to ultimately
neighboring points. For example, the predicted point can be improve the quality of the compressed video.
computed as a simple copy of neighboring point, or a mean or One of the most important settings of video coder is a
a median of neighboring points. In addition to intra prediction parameter which determine the size of the output stream [9,
(i.e. inside the still image or one video frame), inter-prediction 10].
(prediction of points based on previous and / or subsequent Setting the size of the stream plays an important role in the
picture frames) is used. The more information the predictor video coding, although there is no a normative tool for any
will take into account, the more accurately it predicts the next video encoding standard [9]. In the transmission of a video
point (block), so the prediction errors to be encoded are less. signal (e.g., on television), a stream control must ensure that
However, the complication of the predictor, is at the expense the encoded bit-stream can be successfully transmitted and
of the computational complexity of both the encoder and the make full use of the limited bandwidth of the channel.
decoder. We introduced, that the lossy compression is based mainly
In the lossy coders, the most common method is to on the selected quantization of the signal to reduce the amount
calculate the DCT transform (discrete cosine transform is used of data without a significant diminish of the perceived quality.
because of its symmetry) for the so-called macroblock. Finally, adjusting the output stream is obtained by adjusting
Typically, the macroblock is a block of 16x16 pixels, the quantizer parameter. The less accurately we will represent
nowadays, the image resolutions are much higher than a the data or reduce their amount, the smaller the output stream
decade before, the macroblocks are growing – the latest will be.
codecs use blocks that reach the size of 64x64 pixels [7]. The Due to the variability of video content, and in consequence,
calculation of the DCT transform allows to sort data according the data rate, video encoders can be classified into two
to the importance (from the so-called constant component, categories: constant bit rate (CBR) and variable bit rate (VBR)
which presents the average brightness or color of the block to encoders.
data containing the smallest details (higher spatial frequency). In the case of CBR applications, designers focus mainly on
After DCT transform calculation, a quantization of transform how to improve the accuracy of matching between the
coefficients is performed, which eliminates "less important" assumed bit rate and the actual, instantaneous bit rate, while
data. The greater the degree of compression, the more data maintaining the restrictions on the size of the delay and the
should be deleted. buffer used.
In the latest generations of lossy codecs, there are many In VBR applications, we rather avoid the fluctuation of the
other techniques to improve the degree of compression and picture quality, even in variable content of natural scenes,
quality, which significantly increases the codec's complexity while the variability of the output stream is allowed. To
and compression time. New solutions include, for example, minimize the spread of the stream size, data buffering is used.
dividing large blocks into smaller ones with irregular shapes, In this technique the FIFO (first-in first-out) buffer is used,
345
when at the output of the buffer the stream size variation is players. The widely implemented H.264 codec gave relatively
lower than at the input of the buffer. In the extreme case, the good performance but a much higher coding cost [13].
VBR stream can be converted to the CBR form, but this Unfortunately, the newest x265 and VP9 codecs are not
requires the use of a large buffer. A large buffer is not only a included in the comparison.
hardware cost, but also a long delay between the input and A very interesting, often updated benchmark, which
output signal, which in real-time systems is a big presents the latest lossless image and video coders, is [14].
disadvantage, and sometimes is unacceptable [9]. Among the video codecs, the best results are achieved by the
The size of the video stream is often adjusted in an adaptive x264, x265, VP9 and FFV1 codecs, which is the fastest one,
way, i.e. by matching it to the channel capacity between the despite using only one core of processor. From the
sender and the recipient or to the computing efficiency, screen comparisons it can be concluded that codecs designed for
resolution, or other requirements of the receiver. lossy compression have much worse performance in lossless
In fact, in the considered ADAS applications, the problem mode [14].
of the channel between the sender and the recipient is not We also found the test of lossless image (intra-frame)
critical, while the problem of image quality control is very encoders like WepP, JPEG-LS, FLIF, BGP [14, 15]. With 8-
important. The analyzes presented below focus on the impact bit RGB images they reaches compression ratios between 2.3
of stream size settings and its control on the image quality, and 3.0. This results show that a single image coding (without
with particular emphasis on the "rate control" control for the the use of inter-prediction) results in a lower than with inter-
x264 (Advanced Video Coding, AVC class), x265 and VP9 prediction compression rate even in lossless mode. In case of
(High Efficiency Video Coding, HEVC class) codecs based on comparison of lossy image and video coders the difference is
the paper [11]. even higher. Therefore, in this paper we omit the analysis of
lossy still-image encoders.
IV. EXTERNAL TESTS OF VIDEO CODECS Another extensive comparison of lossless or near-lossless
A. Lossless codecs codecs can also be found in [16] and in [17] (for medical
According to very best knowledge of the authors, there is signals).
not much scientific work involved in testing the performance
B. Lossy codecs
of lossless codecs. More comparative works are presented on
In case of lossy video codes, the most up-to-date and
the Internet in form the of so-called benchmarks.
extensive reports with comparisons of video codecs are
In 2007, the MSU Video group presented a comparison of
prepared and published by the Graphics & Media Lab Video
the performance of over a dozen lossless codecs [12]. The
Group [18]. The report [19] by this group compared the
following codecs were tested: Alpary, ArithTuv, AVIzlib,
coding quality of newly created HEVCs (high efficiency video
CamStudio GZIP, CorePNG, FastCodec, FFV1, Huffyuv,
coder), like AV1, uAVS2, Kingsoft HEVC, nj265, VP9, and
Lagarithm, LOCO, LZO, MSU Lab, PICvideo, Snow, x264,
x265 with coders of previous standards, like nj264, x264 using
YULS. The purpose of the study was to determine which
objective evaluation methods and high-definition video
codec achieves the highest compression ratio. The research
sequences. The authors of report [19] strongly state that the
was carried out on 9 various, standard video sequences. The
AV1 codec offers the best coding quality in all test sequences.
tests were performed for setting the maximum compression
The second place is occupied by the VP9 encoder and the
ratio (Max) or maximum coding speed (Fast). The highest
KingSoft HEVC encoder (depending on the test sequences). In
compression ratio offer encoders YULS and FFV1 (about 3.3).
case of set x265 to high quality (2- and 3-pass presets in the
The YULS is very slow (it reaches about 1 4CIF_fps), while
Placebo program), the situation slightly changes: the x265 is
the FFV1 reaches about 6 4CIF_fps (1 4CIF_fps means that
slightly better than the average of VP9, x265 with a 2-pass
the encoder compress one 4CIF resolution frame, i.e. 704480 Placebo setting almost the same quality as VP9 (in some
pixels, in one second). The fastest encoders are Lagarith and sequences outperforms VP9, in some VP9 is better than the 2-
HuffYUV (about 2x faster than FFV1), but they have worse pass Placebo x265).
compression ratio (about 2.7 and 2.1 respectively). In general Analyzing the coding rate, it turns out that the AVS2
lossy encoder, i.e. x264 was also tested, but set in a lossless encoder offers a higher coding rate compared to other
mode. Although it is very complicated, it reaches compression encoders. The AV1 encoder has extremely low performance -
ratio equal to 2.75 with 5 4CIF_fps. It shows the need for a it is 2500-3000 times slower than its competitors. The x265
different approach to lossy and lossless compression (a Placebo presets (2 and 3 passes) offers 10-15 times less speed
universal solution does not achieve as good results as than the competitors. In the "ripping encoding" setting, which
dedicated). provides very high quality of compressed video the speed of
Another list of lossless codecs for video sequences was compression of many tested codecs is very low (below 1fps).
prepared in [13] in 2016. The tested codecs are: Alparysoft, An interesting observation is that codecs producing a larger
FastCodec, Huffyuv, H.264, H.265, Lagarith, MSU, MLC, output stream do operate significantly slower. The decrease in
Toponoky, and UtVideo. In the results, the MLC encoder the speed of operation (in fps), with the increase of the output
surpasses other codecs in terms of both compression (3.73 for stream is relatively small.
24-bit RGB) and speed, but, as the authors note, the video Unfortunately, in real-time applications (like ADAS), there
stream produced by MLC is not supported by most video is no time and possibilities for multi (2 or 3) passes of the
346
encoder. In fact, the trend is noticeable: the popular 2-pass C. Experiments on lossless codecs
mode has now been extended to 3-pass mode).
During experiments, lossless codecs were tested using the
Analyzing the quality results in [19], three codecs: AV1,
PC1 test platform. Three sequences from the TME database
x265, VP9 achieve the best, very similar image quality with a
were selected for testing: tme08, tme11 and tme32. The
slight advantage of AV1. However, the VP9 codec is the most
measurements were made using VirtualDub software (all
balanced with respect to quality for speed. With the same
codecs except VP9, FFV1 and x265, which were tested using
quality, the older H.264 codec have a significantly lower
ffmpeg software). All codecs were set to maximize the
compression rate.
compression ratio.
It is also worth noting that many modern image sensors
Figure 1 presents the compression ratio and Fig. 2 depicts
offers more than 8 bit per pixel component, e.g. 12- or 14-bit
the input (uncompressed) video throughput [MB/s] of the
but all of the found tests were carried out on 8-bit data. In fact,
codec [21]. In Figs. 1 and 2, the following codename
some codecs support 12 or even 16-bit representation (e.g.
abbreviations have been used: Alp (Alparysoft), FastC (Fast
FFV1), but in the literature there are no analyses regarding
Codec), huff (HuffYUV), lahg (Lagarith), mlc (MLC), top
image compression in this type of image representation.
(Toponoky), utv (Ut Video), ffv1 (FFV1), ffv1_ffmpeg (FFV1
V. AUTHORS’ TESTS OF VIDEO CODECS interfaced by ffmpeg software) x264 (implementation of the
H.264 standard), x265_ffmpeg (implementation of the H.265
A. Dataset of video sequences
standard, interfaced by ffmpeg software, lossless LL_VS
For codecs testing, the TME (Toyota Motor Europe) options), VP9_ffmpeg (VP9, interfaced by ffmpeg software,
Motorway data set was used (with the consent of its authors) lossless LLMT_RT options), msu (MSU). The AV1 codec
[20]. TME Motorway's recording database consists of 28 (currently very inefficient implementation in case of
sequences with a total length of about 27 minutes (30000 compression speed) and Daala and Thor (both at the
frames) with vehicle descriptions (ground truth). Annotations development stage) have not been tested.
were generated semi-automatically using laser scanner data. In Fig. 1 shows that 7 out of 12 tested codecs achieve an
this work, no annotation data was used. The database contains average compression rate higher than 3, while 4 of them, i.e.
sequences from journeys made on motorways in Italy. MLC, FFV1, X264, VP9 exceed the value of compression
Recordings include variable motion situations, different ratio equal to 3.3. For individual sequences (in most cases the
number of lanes, intersections and different lighting sequence tme32 was best compressed), additionally Lagarith
conditions. The recording database is characterized by the and x265 codecs achieved compression ratio higher than 3.3.
following details: image acquisition: stereovision (camera left Taking into account the speed of compression (c.f. Fig. 2),
and right), 20fps, resolution 1024768 (color components codecs Fast Codec, HuffYUV, Ut Video and Lagarith get the
without Bayer filter interpolation, losslessly stored as highest bit rates. It is worth notice that these results are
monochrome images), FOV 32° (horizontal), additional consistent with reports presented in [21]. Lagarith and Ut
information on color coding from the Bayer filter. Video codecs are fastest, reaching 169.2 MB/s and
Three sequences from the TME database were selected for 167.2 MB/s of input (uncompressed) video stream throughput,
testing compression techniques: tme08, tme11, tme32. The respectively. This result was certainly influenced by the
shortest sequence lasted about 60 seconds, the longest 80 limitations of the read and write speed of HDD interfaced with
seconds [20]. SATA II.
Also, the use of a CPU by particular codecs was also
B. Test platforms examined. It was noted that only x264, x265 and VP9 were
Tests of lossless codecs were performed on the PC with the able to use the CPU 100%. For other codecs, which use
following configuration (PC1): Dell Precision T1500, Intel multithreading, i.e. Lagarith, huffYUV, MLC, Ut Video, the
Core i7-870 2.93GHz processor (computing index equal to CPU usage did not reach 100%, changing from 33% to 65%.
5362 according cpubenchmark.net), 64-bit Windows OS, This means that no CPU was a limitation of the speed of
16GB RAM, GPU Nvidia Quadro FX 580, SATA II HDD operation, but the disk read and write operations.
(measured read and write operations ca. 100MB/s). Considering the performance of the codec, hence both the
Tests of lossy codecs were performed on the PC with compression ratio and the compression speed (input video
slightly better configuration (PC2): Dell Optiplex, Intel Core throughput), there is no unambiguous winner. Lagarith and
i7-3770 3.4GHz processor (computing index equal to 9302 Ut Video codecs are the fastest, but they do not offer the
according cpubenchmark.net), 64-bit Windows OS, 8GB highest compression rates. Taking into account the codecs that
RAM, GPU Nvidia GeForce GT 640, SATA III HDD offer average compression ratios more than 3, the FFV1 codec
(measured read and write operations ca. 200MB/s).. (the version interfaced with the ffmpeg software) is the fastest
Due to the large size of the test sequences, the codec.
uncompressed data and data after compression were
respectively read and written to the disk.
347
codecs offered this parameter setting, it could not be assumed
Compression ratio
4 that they achieved the same image quality with the same QP
3.5 settings. No work has been found in the literature, which
3 would suggest what value of QP for individual codecs should
2.5 be set to achieve comparable compressed image quality. The
2 parameter QP can be set for x264 and x265 in the range from
1.5 0 to 61, while for VP9 from 0 to 63. We found in the study
1 [19] that for the same compression rates the newer codecs
0.5 (x265, VP9, HEVC class) offer much higher perceived and
0 objective (according to SSIM [1]) image quality compared to
x264 (AVC class).
VP9_ffmpeg
x265_ffmpeg
ffv1_ffmpeg
x264
huff
top
lagh
FastC
ffv1
msu
mlc
utv
Each of the codecs has a lot of detailed settings. It is almost
impossible to examine all of their combinations, but you can
certainly find key ones. For example, for the VP9 codec,
setting the -row-mt 1 parameter allowed for the acceleration of
Figure 1. Compression ratio of lossless codecs compression from 10% to 50% depending on the remaining
codec settings. As a result, the average CPU utilization was at
80-90%. For the x264 and x265 codecs, the CPU utilization
Input throughput [MB/s] was practically 100% all the time. The x264, x265 codecs
were tested for two different performance settings, i.e. "preset
100
medium" and "ultrafast", while the VP9 codec was tested for
three different performance settings "-deadline good-spread 2"
10 "-deadline realtime -speed 5" and for maximum speed "-
deadline realtime -speed 8". For all of these settings, the
1 quality should be the same, but the compression rate changes
depending on the relationship: faster processing means larger
file means lower compression ratio.
0.1
One of the frequently used presets is the so-called
VP9_ffmpeg
x265_ffmpeg
huff
ffv1_ffmpeg
x264
top
lagh
FastC
ffv1
msu
mlc
utv
"placebo". This setting should give the smallest picture

distortion. However, as shown in the study [22], the "placebo"
setting, compared to the "veryslow" setting, improves image
quality by only about 1%, but at the expense of a significant
increase in compression time. Next preset, namely "veryslow"
Figure 2. Input (uncompressed) video throughput [MB/s] of lossless codecs gives a quality about 3% higher than the "slower", the
"slower" by about 5% higher than "slow", the "slow" by about
D. Experiments on lossy codecs 5-10% higher than the "medium" setting. Finally, the
The tests of lossy codecs was carried out using the library "veryslow" setting was used for the x264, x265 codecs. An
(console application) ffmpeg 4.0 and the PC2 test platform. extensive guide on the x264 and x265 codec settings can be
Using the literature results presented in Sect. IV, three codecs found in [11].
were selected and tested in detail: x264, x265 and VP9. They For the VP9 codec, similarly, the almost-best setting was
are the most promising in terms of performance and chosen, i.e. "-deadline good" because, as the tests show, the
compression, the most advanced, and at the same time having "best" setting definitely slows down the coding process. In
fast and stable implementations. Please note that the newest paper [23], the authors state that the "best" setting gives a
AV1 codec has not been tested due to its very inefficient (trial) marginal improving the quality over the "-deadline good-
implementation. In one of the reports, the AV1 codec is in its speed 0" setting.
very early implementation about 2500-3000 times slower than The compression ratios of tested lossy codecs for three
the competitive HEVC codecs [19]. various settings of quantization parameter are presented in
The main goal of the study was to obtain information about Fig. 3, while input (uncompressed) video throughput [MB/s]
the performance of individual codecs (the compression ratio in Fig. 4. The general relationship is that the higher the QP is
and speed of operation) without a detailed examination of the the higher are the compression ratio and the processing speed.
quality of the compressed image. Codecs VP9, x264, and x265 for single sequences, depending
For testing, just like for lossless codecs, three video on the settings for the tested QP parameter values (5; 10, 20)
sequences from the TME database were selected: tme08, reach compression ratios from 5.7 to 91.3. The highest
tme11, tme32. The tested codecs were set to maintain a compression rate of the codecs was obtained for the QP = 20
constant quality, i.e. constant quantization parameter (CQP) parameter setting, i.e. for the lowest image quality. Changes in
rate control. The QP was changed and the tests were carried the settings of individual codecs can lead to more than 10-fold
out for QP = 5, 10 and 20. Unfortunately, although all three acceleration of compression with only about 10% decrease in
348
the compression rate. Therefore, it is necessary to precisely REFERENCES
select the control parameters. The x264 codec is comparable [1] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, Image Quality
in compression speed to the VP9 codec, but with similar Assessment: From Error Visibility to Structural Similarity, IEEE
degrees of compression, it achieves much lower image quality Transactions on Image Processing, Vol. 13, No. 4, 2004.
(the problem was discussed above). The VP9 codec is much [2] R. V. Babu, M. Tom, and P. Wadekar. A survey on compressed domain
video analysis techniques, Multimedia Tools Appl. 75 2, pp. 1043-1078,
more efficient than the x265 codec and therefore seems to be 2016.
the best for the ADAS applications. [3] Choi, Hyomin, Ivan V. Bajic, High efficiency compression for object
detection, CoRR abs/1710.11151, 2017.
Compression ratio QP=5 [4] L. Kong, R. Dai, Temporal-Fluctuation-Reduced Video Encoding for
70 Object Detection in Wireless Surveillance Systems, 2016 IEEE Int.
QP=10 Symposium on Multimedia (ISM), San Jose, CA, pp. 126-132, 2016.
60
50 QP=20 [5] A. Marsetic1, Z. Kokalj1, K. Ostir, The effect of lossy image
compression on object based image classification - WorldView-2 case
40 study, ISPRS - International Archives of the Photogrammetry, Remote
30 Sensing and Spatial Information Sciences. XXXVIII-4/W19. pp. 187-
20 192, 2011.
10 [6] P. Elmer, A. Lupp, S. Sprenger, R. Thaler, A. Uhl, Exploring
Compression Impact on Face Detection Using Haar-like Features,
0 Scandinavian Conference on Image Analysis, pp. 53-64, 2015.
x264 ultrafast
x265 ultrafast
VP9 realtime
VP9 realtime
x264 medium
x265 medium
VP9 good speed2
[7] VP9, WebM Project, https://www.webmproject.org/vp9/, access

speed5
speed8 11.05.2018.
[8] Codec FFV1, https://www.ffmpeg.org/~michael/ffv1.html, access
11.05.2018.
[9] Z. Chen, K. N. Ngan, Recent advances in rate control for video coding,
Signal Processing: Image Communication, Volume 22, Issue 1, pp. 19-
38, 2007.
Figure 3. Compression ratio of lossy codecs for various settings of
[10] Z. Wu, S. Xie, K. Zhang, R. Wu, Rate Control in Video Coding, in
quantization parameter Recent Advances on Video Coding, Edited by Javier Del Ser Lorente,
www.intechopen.com, pp. 80-116, 2011.
Input throughput [MB/s] QP=5 [11] W. Robitza, Understanding Rate Control Modes (x264, x265, vpx), Mar
1, 2017, http://slhck.info/video/2017/03/01/rate-control.html, access
160 QP=10
11.05.2018.
140
120 QP=20 [12] D. Vatolin, I. Seleznev, M. Smirnov, Lossless Video Codecs
100 Comparison ‘2007, MSU Graphics & Media Lab (Video Group)
80 http://www.compression.ru/video/codec_comparison/lossless_codecs_2
60 007_en.html, access 11.05.2018.
40 [13] D. Li, V. Bui, L. C. Chang, Performance Comparison of State-of-Art
20 Lossless Video Compression Methods, 2016 International Conference
0 on Computational Science and Computational Intelligence (CSCI), Las
Vegas, NV, pp. 386-391, 2016.
x264 ultrafast
x265 ultrafast
x264 medium
x265 medium
VP9 realtime
VP9 realtime
VP9 good speed2
speed5
speed8
[14] Squeezechart, the world’s largest lossless data compression benchmark,

http://www.squeezechart.com, access 11.05.2018.
[15] Lossless Photo Compression Benchmark,
http://imagecompression.info/gralic/LPCB.html, access 11.05.2018.
[16] Comparison of video codecs, Wikipedia, https://en.wikipedia.org/wiki/
Comparison_of_video_codecs, access 11.05.2018.
Figure 4. Input (uncompressed) video throughput [MB/s] of lossy codecs for [17] V. Bui, L. C. Chang, D. Li, L. Y. Hsu, M. Y. Chen, Comparison of
various settings of quantization parameter Lossless Video and Image Compression Codecs for Medical Computed
Tomography Datasets, 2016 IEEE International Conference on Big
VI. CONCLUSIONS Data, pp. 3960-3962, 2016.
[18] CS MSU Graphics & Media Lab Video Group Home Page,
We prepared a review and presented selection and tests of http://www.compression.ru, access 11.05.2018.
lossless and lossy video codecs for ADAS applications. [19] HEVC Video Codecs Comparison 2017, Part 5: High-quality encoders
In the case of lossless codecs, the required parameters are: (inc. VP9, AV1), http://www.compression.ru/video/codec_comparison
the compression rate and the encoding speed. As a result of /hevc_2017/, access 11.05.2018.
analyzes and tests carried out, the best balanced solution is the [20] TME Motorway dataset, http://cmp.felk.cvut.cz/data/motorway, access
FFV1 codec. 11.05.2018.
In the case of lossy codecs intended for ADAS, the quality [21] T. Daede, A. Norkin, I. Brailovskiy, Video Codec Testing and Quality
Measurement, https://tools.ietf.org/id/draft-ietf-netvc-testing-06.html,
of the compressed video stream must also be taken into October 30, 2017, access 11.05.2018.
account. The highest image quality is offered by the latest [22] H.264 Video Encoding Guide, https://trac.ffmpeg.org/wiki/Encode/
HEVC codecs. Although, in general, HEVC codecs are more H.264, access 11.05.2018.
complex than AVC and therefore compress more slowly, the [23] VP9 Bitrate Modes in Detail, Google Inc. Aug. 2017, https://developers.
VP9 codec belonging to the HEVC group offers encoding google.com/media/vp9/bitrate-modes/, access 11.05.2018.
speed close to the encoder x264 (the AVC group). Preparation of this paper was supported with the means of DS 2018 project.
349
SIGNaL PROCESSING
SPa 2018
Audio processing with using Python language

science libraries
Tatsiana Viarbitskaya Andrzej Dobrucki
Department of Acoustic and Multimedia Department of Acoustic and Multimedia
Wroclaw University of Science and Technology Wroclaw University of Science and Technology
Wroclaw, Poland Wroclaw, Poland
tatsiana.viarbitskaya@pwr.edu.pl andrzej.dobrucki@pwr.edu.pl
Abstract— The topic of the article is recognition of instruments  it includes several science-oriented libraries which
and playing techniques of music for detection and correction of are strong developing,
errors in a given music sample. It shows how to achieve
characteristics of recorded sound and also how to compare  it supports memory management,
amplitudes and frequencies of the same music piece, but played
by different persons and also with using various instruments. For  it supports multithreading operations,
this aim the signal processing algorithms are used, which are  it has a large and comprehensive standard library.
available in standard Python libraries such as "numpy" or
"scipy". The key idea of the processing is detection of errors, but Python has also disadvantages, which are touched audio
save playing technique and individual style of the player. signal area:
Keywords-component; signal processing, sound recognizing,  there are no fixed width integer types like in C++
signal analysis, data recognition, wavelet transform, transform (start C++11), but using Numpy it is possible to
parameters, violin playing techniques. provide them in code,
 mutable default arguments,
I. INTRODUCTION
 not-mutable to define build-it data types and
At the beginning a few words of history: Python was risen structures,
in December 2008 with an initial server version Python 3.0.
After two years there was decided to complete standalone  default numeral casting types,
program and in July 2010 Python ver.2.0 was appeared.
 iusage global interpreted lock mechanism (GLM).
Two versions of the same languages were developing at the
same time and python.org decide to stop maintenance the old At the paper is shows how to use Python science libraries at
Python version 2.7 in 2020 and focus for evaluating and audio processing procedures, programs and also to define its
developing the newest branch for this scripting language. possibilities. [1]
At the time the last Python release is June 15th with the II. PYTHON ECOSYSTEM
release 3.7.
Python ecosystem contains such areas as:
Through the time, the more and more applications,
calculations, signal proceed and even big data analysis are  Core: which is included interpreter, programming
written in this use-friendly scripting language.
languages and standard library
Python has a lot of advantages, among them:
 it is an open-source software  IDEs: Pyzo, PyCharm, Spider, Jupiter
 it is easy adaptability for each operating system:  Libraries: Numpy, Scipy, Matplotlib, etc.
Cpython (C oriented language), Jython (Java
oriented language), IronPython (C# oriented Which are compiled separated with each other, only the
realization), etc interface is available.
 it is intuitive and user-friendly language,
 it supports multiple programming paradigms such
as object-oriented, imperative, functional and
procedural,
350
optimization the source code and run-time results for the
computation in the program.[5]
Ecosystem is present below:
C. Matplotlib
This Python module is used for visualization the results for
scipy and numpy. Experienced users of signal processing
could see similarity with MATLAB graphic Tool. [6]
IV. PROGRAM REALIZATION
The advantage of application Python language in audio
processing problems realization is highlighted, when the task
is to compare two signals by amplitude and frequency side.
The algorithm for the program is next:
1. Calculation of the Fourier transform with all given

arguments,
2. Calculation of the module from which corresponding
to amplitude of the signal,
3. Calculation of the signal phase from each array of
Fourier transform,
4. Calculation of the list of frequencies, which are
present in the signal,
5. Calculation of the modules’ difference
6. Calculation of the Fast Fourier Transform for
different signals with the same phase,
Fig. 1. Visualization of Python ecosystem idea[2]
7. Calculation of the Inverse Fourier Transform,
Such organization is perfect for data processing, where there 8. Return the difference values corresponding with
can import whole library, or just a module when it is needed. difference in frequencies and amplitudes between the
From the program realization point of view – each library signals.
should be installed from the python environment using for
example python package index: The program realization is given below:
python -m pip install --user numpy scipy matpl def get_difference_in_frequency(**args):

# calculation fft of given arguments
otlib ipython jupyter pandas sympy nose fft_args = numpy.fft.fft(**args)
# calculation the module from which corresponding to amplitude

Fig. 2. Example of libraries installation in Python
of the signal
Such organization makes available to provide any changes in fft_args_abs = numpy.absolute(fft_args)
the science library module without any investigation in the
core, make files and interfaces for the mail core component. # calculation the signal phase from fft
[1, 3] fft_args_angle = numpy.angle(fft_args)
III. LIBRARIES – SHORT SUMMARY # calculation the list of frequencies, which are present in the signal
freqs = numpy.fft.fftfreq(len(samples_args), sample_time)
All Python libraries are free and open-source Python libraries
used for scientific and technical computing. The short # calculation the modules' difference
introduction for using each of them is given below: fft_abs_diff = []
for i in range(0, min(len((fft_args)):
A. Scipy fft_abs_diff.append(fft_args_abs[i][i] - fft_args_abs[i][0])
Python’s open source libraries with possibilities for
optimization, linear algebra, integration, interpolation, special fft_diff = [mag * numpy.exp(ph * 1j) for (mag, ph) in
functions, FFT, signal and image processing, ODE solvers and zip(fft_abs_diff, fft_args_angle)]
other computation task used in science and engineering.[4]
# calculation inverse fft
B. Numpy samples_diff = numpy.fft.ifft(fft_diff)
# change the values' representation - from complex to integer
Python’s special module for matrix computations. At the
return [int(numpy.real(samples_diff[i])) for i in range(0,
beginning the was part of Scipy, but from version Python 2.5 len(samples_diff))]
and whole Python 3.x it was divided to the single module for
351
And in the further part of program the function should be called with given below:
the proper arguments:
def bytes_to_samples(bytes, sample_width):
samples_diff = get_difference_in_frequency(samples_1, samples_2) # sample_width = 1: unsigned char (B),
# sample_width = 2: unsigned short (H),
It was shown, that the program realization contains only 10 # sample_width = 4: unsigned int (I)
lines.
It is also possible to hear the difference by exporting the # '<' --> Endianness of the used architecture
results to .wav format file using conversion from integers to integer_codes = { 1: "<B", 2: "<H", 4: "<I", 3: "<I" }
bytes’ representations.
samples = []
Programmer should pay attention during program realization for i in range(0, len(bytes), sample_width):
for conversion into integers to bytes’ representation in Python sample = bytes[i:i + sample_width]
language, which causes structure file package depending on if sample_width == 3:
sample width in the audio/video files. sample = sample + b'0'
unpacked_int =
The Python documentary describes only the standard ones:[7] struct.unpack(integer_codes[sample_width],sample)[0]
samples.append(unpacked_int)
TABLE I. DATA TYPE CASTING IN PYTHON return samples
As is shown higher – one of the advantage of using Python
language for audio signal processing is possibilities to
Standard implement own function in the existing libraries, and the
Format C Type Python type
size disadvantage is need to know the signal’s data.
x pad byte no value V. ACHIEVED RESULTS

For the analyzing the work of the program, there was chosen
violin piece of music with changed amplitude in three places
integer,integer, and false frequency in the middle of playing and add the other
unsigned char,
B,b,c 1 frequency at the end.
signed char, char
bytes of length 1
Achieved results for amplitude differences:
unsigned short ,
H, h integer 2
short
int, unsigned int,

I, i,l,L long, integer 4
unsigned long
long long,
q,Q integer 8
unsigned long long
Fig. 3. Time dependence of amplitude difference between two signals
e (7) float 2
f float float 4
d double float 8
For non-standard sample width like ‘3’, there need to right Fig. 4. Time dependence of amplitude difference between two signals
own implementation in source code or in the Python numpy after filtration the lower frequencies
module.
The example of realization for sample width equals to three is The sign of calculation shows how amplitude has been
352
changed – louder of quieter. There no necessary to put
program listing at the paper.
Achieved results for frequencies:
Fig. 5. Spectrum of frequencies difference between two signals
Fig. 7. Elapsed time comparison program realization in the different

programming languages: Python, C++ and MATLAB
The table, which is used:
TABLE II. TIME FOR PROGRAM OF DETECT LOCAL ERRORS (AMPLITUDE

Fig. 6. Spectrum of frequencies difference between two signals after filtration
AND FREQUENCY) IN RECORDING SOUND REALIZATION BY DIFFERENT
PROGRAMMING LANGUAGES: PYTHON, MATLAB, C++
And as for amplitude, at the program is given an information
Sample
about added, extracted and difference frequencies between Python [s] C++[s]
number Matlab [s]
two analyzed signals. And there is no necessary to show the 1
listing of the program at the paper. 7.414574521 11.175571 6.81355
2
VI. ELAPSED TIME BY DIFFERENT LANGUAGE REALIZATION 8.199937738 9.905346 7.95859
3
The other important factor in the programming and signal 7.971058602 10.765689 7.84302
processing realization is elapsed time of program. As there 4
8.095232282 11.066577 7.841715
was written before elapsed time for python realization is worse
than realization the equivalent in C++, but no data or very raw 5
8.22793889 9.853975 7.714038
data is given in literature about comparison with MATLAB.
6
Please pay attention, that the elapsed time also depend on 8.337927106 9.81997 7.645216
hardware realization of the computing machine and with 7
8.17966754 10.731375 7.253004
variety computers the results could be different, but the trend
of the languages is the same. [8] 8
8.133809619 10.861479 7.232316
For the calculation there was used by computer: 9
 64 architecture 8.046256933 11.148632 7.865466
 8 GB of RAM memory 10
8.170780637 10.713528 7.980349
 Intel CORE i7 processor (6700HQ)
The table, which is compare different program realization,
using by default libraries in such languages as Python, C++ The simpler programs – find the proper Tuning for the
and Matlab present at the picture below: instruments, the time trend looks the same:
353
VII. CONCLUSION
The audio processing is growing rapidly as all multimedia
trends. There is necessary to rise open-source libraries for
scientist and engineering computations. The best option at the
moment seems to use Python language with science open
source modules such as numpy and scipy, which are could be
customization by user as were shown at the paper.
Also the graphical Python module matplotlib with the support
2D and 3D visualization is good replace for commercial
products.
The programs written in Python are intuitive, user-friendly
and simple in realization.
The elapsed time of programs realization is closer to high –
level programming languages, as the examples was chosen
C++.
Among all these advantages there are still some non-solved
until now problems, which could be problematic to use
Python:
- Strongly depend on hardware realization
- In comparison with MATLAB there is no Simulink
equivalent tool
- The work under pure Python’s libraries for big data
computations is still ongoing.
Fig. 8. Elapsed time comparison simple program realization in the different VIII. REFERENCIES
programming languages: Python, C++ and MATLAB
[1] Available: https://www.python.org/
The table with the values is given below: [2] PYZOorganization.Available:
http://www.pyzo.org/python_vs_matlab.html
TABLE III. TIME FOR PROGRAM OF TUNNING DETECTION IN RECORDING [3] W. J. Chun, Core Python Applications Programming, 2012.
SOUND REALIZATION BY DIFFERENT PROGRAMMING LANGUAGES: PYTHON,
MATLAB, C++ [4] E. Bressert, SciPy and NumPy, 2012.
[5] T. E. Oliphant, Guide to NumPy: Market-Determined, Temporary,
Sample
Python [s] C++[s] Distribution-Restriction (MDTDR), 2006.
number Matlab [s]
0.659046 0.723243 0.636577 [6] S. Vaingast, "Beginning Python Visualization: Crafting Visual
1
Transformation Scripts," Springer, p. 384, 2009.
0.658998 0.753899 0.625782
2 [7] (2018). 7. Binary Data Services¶. Available:
0.658998 0.758025 0.624338 https://docs.python.org/3/library/binary.html
3
[8] S. G. Andreas C.Mueller, Introduction to Machine Learning with
0.662764 0.695496 0.651307
4 Python: A Guide for Data Scientists.: O'Reilly, 2016.
0.640861 0.683522 0.625929
5
0.709416 0.820221 0.670634
6
0.664872 0.706117 0.650437
7
0.674194 0.745767 0.649144
8
0.670955 0.694655 0.662659
9
0.633083 0.647797 0.627932
10
At the both examples – the longest elapsed time was used by

MATLAB, the shortest – by C++. The elapsed time
realizations based on Python is closer to C++ than MATLAB
results.
354
SIGNaL PROCESSING
SPa 2018
Active elimination of tonal components

in acoustic signals
Michał Łuczyński Stefan Brachmański Andrzej Dobrucki

Chair of Acoustics and Multimedia Chair of Acoustics and Multimedia Chair of Acoustics and Multimedia
Wroclaw University of Science Wroclaw University of Science Wroclaw University of Science
and Technology and Technology and Technology
Wrocław, Poland Wrocław, Poland Wrocław, Poland
michal.luczynski@pwr.edu.pl stefan.brachmanski@pwr.edu.pl andrzej.dobrucki@pwr.edu.pl
Abstract— In this paper algorithms for eliminating tonal reduction using synthesis is the necessity of getting to know
components from acoustic signals are presented. The tonal the primary signal earlier.
components are periodic signals with randomly modulated The authors of this work are considering the possibility of
amplitude, phase and frequency. Active elimination of a single creating an algorithm for active elimination of tonal
tonal component bases on adding a synthesized component with
opposite polarity. To do this effectively, it is necessary to meet the
components in previously unknown signals. In order to
criteria for adjusting the values of the parameters for both achieve it, algorithm that allows the detection of tonal
components. Otherwise, the level reduction may be insufficient or components in the signal is being analyzed. This algorithm
even the level of the component may increase. It is important to also allows determination of their parameters (amplitude,
define acceptable error values for which are met the reduction phase, frequency), synthesis of canceling signals and adding
criteria. The problem defined in this way allows to indicate the them to the primary signal. Because of the algorithm
limitations of the method and to create a mathematical model (eliminating tonal components), the level of the resultant
allowing to predict effectiveness for any case in the scope of the signal (output signal) is being reduced. This approach to active
analyzed events. As part of the experiment, signals are analyzed noise reduction is correct because of the large group of signals
with one or more tonal components. The obtained results of the
applied algorithm are promising and show the desirability of
that can be described as tonal signals for which an algorithm
continuing the work. In special conditions, the developed could be used.
algorithm allows for reduction of tonal components to values B. Tonal components
below the acoustic background.
Among the signals that can be classified as tonal ones [9,10]
Keywords – acoustic signal procesing, active tone cancelation, are, for example, some noises (transformers, compressors,
tonal signal, tonal components saws, motors noise), acoustic feedback, mains hum, speech
signal, sound of music instruments. An example of the tonal
I. INTRODUCTION noise spectrum is shown in Figure 1.
A. Active noise canceling

The idea of an active noise reduction has been known for
many years [1] and there are many different algorithms for
implementing it [2,3]. The basic idea relies on adding a signal
with opposite polarity to the signal. The result of such action
is wave interference resulting in reducing the level of the
resultant signal. Initially, popular solutions are based on
feedback loops. Currently, the highly developed solution is
based on feedforward loops and adaptive filters. In both cases,
broadband processing is being assumed [4,5]. These tests
present solutions with high noise reduction efficiency for
narrowband components (tonal components), e.g. for
transformer noise [6]. There are published papers indicating
the possibility of using sound synthesis to generate a canceling
signal as well. The presented tests are carried out for selected
groups of signals such as transformers noise [7] or the signal
in ambulance [8]. The indicated limitation of active noise Figure 1 Tonal signal with two detected tonal components
355
Signals of a tonal character are signals containing tonal III. DESCRIPTION OF THE EXPERIMENT
components. The tonal components are narrowband (or
periodic) components with randomly modulated amplitude, A. Tested signals
phase and frequency. To detect tonal components in a signal, The signal used for the experiment can be divided into two
the frequency should be detected first, followed by the groups: synthesized (digitally generated) and real (recording
amplitude and phase. One of the ways to determine the of the fan sound in the ventilation duct). Among the
frequency is narrowband FFT analysis [11]. In the case of synthesized signals there are pure tones, harmonics and
narrowband FFT analysis, accurately determining the nonharmonic multitones [12] and noise tones (narrowband
frequency is associated with increased length of the FFT block noise). All samples are no less than 10 seconds duration.
(increase the length of the time window). However, when the
frequency varies, additional errors in frequency detection may 1) Pure tones: 11 Hz, 22 Hz, 101 Hz, 1 kHz, 10 kHz
occur. Despite various algorithms increasing accuracy of tone The frequency of 22 Hz was chosen because it is the lowest
detection, there are also other problems - mainly in the frequency of the FFT for sampling frequency of 44.1 kHz and
detection of the low frequencies. 2500 number of samples (conditions of the experiment).
Because of this, there may be a problem with frequency
detection. As with the 11 Hz frequency, which is an octave
II. DESCRIPTION OF THE ALGORITHM
frequency lower than the limit frequency. 101 Hz was chosen
The used algorithm is based on the active noise reduction because it is not a divisor of sampling frequency. 10 kHz is
algorithms using tonal synthesis. The algorithms consist of the about ¼ of sampling frequency, so the shape of the waveform
following stages: differs from the sinusoidal shape.
• Recording of input signal which consist a tonal component
or tonal components. Signal is divided into frames with 2,500 2) Multitones with harmonics frequencies
samples.
a) 22 Hz, 44 Hz, 66 Hz
• Transform using a FFT transformation
• Basis on the FFT a frequency, amplitude and phase of tonal The FFT (22 Hz) limit frequency and two harmonics. During
component is detected the analysis of this multitone, there may be problems with
• Synthesis a tonal components with phase shifted of 180 frequency detection due to the presence of the 22 Hz
degree in comparison with detected signal component.
• Add sound synthesized to the input signal b) 100 Hz, 200 Hz, 300 Hz, 400 Hz, 500 Hz, 600 Hz,
• As results of those elimination tonal components of a input 700 Hz
signal are eliminated (completely or partially) Harmonic frequencies, which often occur in real signals. The
difference between frequencies is enough, so there should not
Upon detection, each detected tone is processed separately be problems with the detection.
(parallel processing). Then, after synthesizing the tones, they
3) Multitones with nonharmonics frequencies
are being add to the primary (input) signal. The block diagram
of the algorithm for a single frame is shown in Figure 2. a) 99 Hz, 100 Hz, 101 Hz
Three frequencies very close to each other. Probably there
will be a problem with frequency detection. The difference
between frequencies is small enough to treat it as one
frequency with resultant phase. Possible effect is lower level
reduction.
b) 990 Hz, 1000 Hz, 1010 Hz

Three frequencies very close to each other. There will
probably be a problem with frequency detection. Differences
between frequencies is large enough, so the reduction of the
tested algorithm may fail.
c) 151 Hz, 251 Hz, 353 Hz, 457 Hz, 557 Hz, 653 Hz,
751 Hz
Subsequent non-harmonic frequencies. Probably no problems
Figure 2 Block diagram of the algorithm with the detection. Can be compared to the multitone with
harmonic frequencies.
All component frequencies of the multitones have the same

amplitude and the initial phase of 0 degrees. Amplitudes have
356
been selected so the sum of the components in any case does 3) Multitones with nonharmonic frequencies
not exceed level of 0 dBFS.
a) 99 Hz, 100 Hz, 101 Hz
4) Noise tones Detection of 3 frequency components was not possible due to
White noise with a notch filter applied [13] (a signal that too small differences in the frequency of each component. In
sounds like a pure tone that is not a sinusoidal signal). The the case of the detection of one frequency, whose phase is the
research was carried out for subsequent values of the filter's resultant of the phases of all three components, the elimination
quality factor (Q), so that the subsequent signals would more partially succeeded. The results are presented in Figure 4.
and more resemble the tone of a simple tone while being A case of periodic frequency detection error is observable due
narrow-band noise. to large differences in the phase of individual components.
a) Center frequency 1 kHz, quality factor Q=10 b) 990 Hz, 1000 Hz, 1010 Hz
b) Center frequency 1 kHz, quality factor Q=50 In this case, the differences between frequencies were too
large to be treated as one tone (as in case 2a) and too small to
c) Center frequency 1 kHz, quality factor Q=100 be able to detect 3 separate frequencies. The results are shown
d) Center frequency 1 kHz, quality factor Q=200 in Figure 5.
e) Center frequency 1 kHz, quality factor Q=500 c) 151 Hz, 251 Hz, 353 Hz, 457 Hz, 557 Hz, 653 Hz,
751 Hz
5) Real signal – low-frequency tonal component of the The detection and elimination went well, as in the case of 1b.
recorded fan, frequency about 48 Hz
4) Noise tones
The noise of the fan was recorded using the Superlux ECM-
999 measurement microphone [14]. The microphone was a) Center frequency 1 kHz, quality factor Q=10
placed on the wall of the ventilation duct. The position of the The level of the output signal increased by 0.8 dB. The
microphone has been adjusted perceptually. This means that bandwidth was too large to treat the signal as a tone.
the position in which the sound was recorded will most likely b) Center frequency 1 kHz, quality factor Q=50
resemble real sound. The measurements were carried out
As in 3a, the signal was not narrow enough. In this case, the
basing on norm concerning the measurement of noise of fans
level of the output signal was 2.1 dB higher than the input
in ventilation ducts [15].
signal.
c) Center frequency 1 kHz, quality factor Q=100
IV. RESULTS In this case, the level of the output signal decreased by 2.9 dB.
d) Center frequency 1 kHz, quality factor Q=200
1) Pure tones
The level of the output signal decreased by 5.2 dB.
The effect of eliminating tonal components at 11 Hz and 22 e) Center frequency 1 kHz, quality factor Q=500
Hz was not obtained due to problems with tone detection. For The level of the output signal decreased by 12.1 dB. The
101 Hz, 1 kHz and 10 kHz frequencies, complete elimination results are shown in Figure 6.
of components has been performed.
5) Real signal – low-frequency tonal component of the
2) Multitones with harmonic frequencies recorded fan, frequency about 48 Hz
a) 22 Hz, 44 Hz, 66 Hz The frequency of the tonal component varied over time, but
Due to the 22 Hz component in the spectrum of the output the elimination of the component was successful. The level of
signal, the detection of tonal components was burdened with a the analyzed component decreased to the background noise
large error. The elimination of the components was also level (total elimination condition). The results are shown in
unsuccessful. Figure 7.
b) - 100 Hz, 200 Hz, 300 Hz, 400 Hz, 500 Hz, 600 Hz,
700 Hz All figures show 5 graphs: A – average spectrum of
The tonal components have been eliminated. In the output input signal (FFT size 65536), B – average spectrum of output
signal spectrum, artifacts related to window mismatches are signal (FFT size 65536), C – an example part of input signal
being observed. The results are shown in Figure 3. (time domain), D – an example part of output signal, E –
frequency detected by algorithm in following frames.
357
Figure 3 Signal components: 100,200,300,400,500,600,700 Hz; Number of detected tones:7 tones
Figure 4 Signal components: 99, 100, 101 Hz; Number of detected tones: 1 tone
358
Figure 5 Signal components: 990, 1000, 1010 Hz; Number of detected tones: 1 tone
Figure 6 Signal components: noise tone, center frequency 1000Hz, quality factor Q=500 -26,1 dB -38,2 dB
359
Figure 7 Noise of fan, eliminated frequency: about 48 Hz.
[5] S. Cecchi, A.Terenzi, P. Peretti, F. Bettarelli, Real Time Implementation

V. CONCLUSIONS of an Active Noise Control for Snoring Reduction, 144th Concention
Audio Engineering Society, May 23-26, Milan, Italy
As part of the experiment, operation of the tone component [6] P. Górski, W. M. Zawieska System aktywnej redukcji hałasu o
elimination algorithm for different acoustic signals was przebiegu okresowym Bezpieczeństwo pracy 9/2002
verified. For some signals, the tone components were fully [7] X. Qiua, X. Lia, Y. Aib, C. H.Hansen A waveform synthesis algorithm
eliminated, and for some eliminations were unsatisfied. The for active control of transformer noise: implementation Applied
main reason of elimination failures were the problems with the Acoustics 63 (2002) 467–479
detection of tonal components frequencies. Thanks to the [8] L. Morzynski, P. Górski, Sygnalizator ostrzegawczy w pojazdach
uprzywilejowanych zintegrowany z systemem aktywnej redukcji hałasu,
conducted experiment, data were collected which allows Bezpieczeństwo pracy 7-8/2008
creation of general guidelines and limitations to the presented [9] J. Wiciak Badania i ocena uciążliwości hałasu niskoczęstotliwościowego
method of active noise reduction. w czterokondygnacyjnym budynku mieszkalnym, Diagnostyka 36/2005,
As part of further research, the influence of the time window ISSN: 1641-6414
length (frame length) on elimination efficiency will be [10] Z Engel, G. Makarewicz, L. Morzyński, W. M. Zawieska Metody
Aktywne Redukcji Hałasu, chapter 8.9 (p. 113) Centralny Instytut
investigated. Different shapes of the time window will also be Ochrony Pracy, Warszawa 2001
checked. [11] J. O. Smith III, Spectral Audio Signal Processing, 2011, ISBN 978-0-
9745607-3-1
REFERENCES
[12] Using Multitones in Audio Test, Audio Precision,
[1] Patent US 2043416 A, Paul Lueg - Process of silencing sound https://www.ap.com/technical-library/using-multitones-in-audio-test/,
oscillations Created on 2008-07-01
[2] 2] Sen M. Kuo, Dennis R. Morgan Active Noise Control: A Tutorial [13] M. Mojiri, M. Karimi-Ghartemani, A. Bakshai, Time-Domain Signal
Review, Proceedings of the IEEE, Volume: 87, Issue: 6, Jun 1999, pp Analysis Using Adaptive Notch Filter, IEEE Transactions on Signal
943-973 Processing, Vol. 55
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=763310 [14] Superlux EXM-999 A Professional Precision Instrument Microphone,
(19.07.2018) datasheet
[3] M. Pawełczyk Application-Oriented Design of Active Noise Control [15] IS0 7235:1991, Acoustics - Measurement procedures for ducted
Systems, Academic Publishing House EXIT, Warsaw 2013 silencers - Insertion loss, flow noise and total pressure loss.
[4] M. C. J. Trinder, P. A. Nelson Active Noise Control In Finite Length
Ducts, Journal of Sound and Vibration, (1983), 89, 1, pp 95-10,
http://resource.isvr.soton.ac.uk/staff/pubs/PubPDFs/Pub3304.pdf
(19.07.2018)
360
SIGNaL PROCESSING
SPa 2018
Analysis of application popssibilities of Grey

System Theory to detection of acoustic feedback
1
Sabiniok, Maciej
2
Brachmański, Stefan
1,2
Wroclaw University of Science and Technology, Faculty of Electronics, Chair of Acoustic and Multimedia
Wroclaw, Poland
1
maciej.sabiniok@aes.pwr.edu.pl
2
stefan.brachmanski@pwr.edu.pl
Abstract - Notch-filter-based howling suppression is one of the reactive and howling can be perceived before it is detected. The
most popular gain reduction method of dealing with acoustic goal of this paper is to analyze the possible usage of the first
feedback problem. The main goal of this paper is to analyze the order grey prediction model GM(1,1) in the detection stage.
possibilities of using the grey prediction model GM(1,1) in order Grey prediction model is the part of the Grey System Theory,
to accelerate the feedback detection process of the algorithm. which is dealing with uncertain systems. It is suitable for
Computer based comparative simulations of the algorithm forecasting data characterized by exponential growth.
containing the prediction model in the detection stage and without
it were performed. Simulations were performed for different Authors compared the detection performance, in terms of
prediction order, number of predicted samples and analysis detection time of the algorithm containing prediction model in
window length. The comparison and evaluation were carried out combination with existing detection criteria to the performance
for different source signals. Music, speech and noise signals were of the algorithm based only on the selected criteria. To compare
used. the detection process of both algorithms the computer-based
Keywords-component; acoustic feedback suppresion; feedback simulations were performed. Fig. 1 shows schematically the
detection; grey prediction; grey model simulated system, being single channel electroacoustic system
consisting of one microphone and one loudspeaker. The source
I. INTRODUCTION signal p1(t) is captured by the microphone producing signal m(t).
Acoustic coupling is present in most of the electroacoustic Microphone signal is then amplified by electroacoustic forward
systems designed to amplify and reproduce captured path G and after being converted by the loudspeaker and
microphone signal. Signal received by the microphone is modified by the acoustic feedback path F, signal p2(t) is added
amplified and, after being played back by the loudspeaker it can to the p2(t). Based on simulation results the mean detection
be fed back, creating close loop system. Effects of acoustic times for different prediction parameters and source signals were
coupling cause the significant deterioration of sound quality and calculated.
affect the maximum stable gain (MSG) of the system. Audible
artefacts in the form of ringing and howling effects can occur g(t) l(t)
even if the system gain remains below MSG.
Various methods of dealing with acoustic feedback [1] [2]
[6] [7] [12] problem can be distinguished depending on how they
G F
work [1] [2] [17]. Notch-filter-based howling suppression p2(t)
(NHS) is one of the most commonly used acoustic feedback m(t)
control method. It belongs to the family of gain reduction
methods and it works by inserting to the signal path the narrow p1(t)
band notch filter around the frequency, at which howling is Figure 1. Simulated electroacoustic signal path.
detected. The state-of-the-art methods are using the two-stage
algorithm which consist of detection algorithm analyzing the II. HOWLING DETECTION
microphone signal and bank of the notch filters controlled by the
set of parameters obtained in the detection stage. Detection stage Howling detection in general takes place in frequency
is the crucial part of all NHS methods and ideally should provide domain. Microphone signal is analyzed using DFT in frames of
fast and reliable detection of howling component in the length M. Typically the frame overlap of M – P samples where
microphone signal. In real world, usage of combination of P is the frame hop size. P size is chosen with the compromise of
spectral and temporal features to determine occurrence of computational complexity and time between feedback detection
howling component in the microphone signal requires time. Due and inserting the notch filter in electroacoustic forward path. In
to detection time the nature of most of the NHS methods is all cases the frame overlap is 50% (P=M/2), as a good
361
balance [17]. Each frame is defined as the vector m(t) of M • IPMP – interframe peak magnitude persistence
samples at the time t (1). On samples vector the DFT (2) is determines in how many past frames the given
performed to obtain microphone signal spectrum estimation . candidate howling frequency remains in the set of
howling candidates
𝒎(𝒕) = [𝑦(𝑡 + 𝑃 − 𝑀) … 𝑦(𝑡 + 𝑃 − 1)]𝑇 (1)
𝑄𝑀−1
∑𝑗=0 [𝜔𝑖 ∈𝐷𝜔 (𝑡−𝑗𝑃)]
𝐼𝑃𝑀𝑃(𝜔𝑖 , 𝑡) = (4)
𝑀(𝜔𝑘 , 𝑡) = ∑𝑀−1
𝑛=0 𝑤(𝑡𝑛 )𝑚(𝑡𝑛 )𝑒
−𝑗 𝜔𝑘 𝑡𝑛
(2) 𝑄𝑀
Effectively the algorithm bases on the microphones signal PHPR features based on the assumption of the sinusoidal
spectrogram. The example of the microphone signal disclosing howling components in the signal does not have the harmonics
instability is shown on the Fig 2. content as speech or music. Feedback is detected if PHPR≥TPHPR
for all m∈M. IPMP feature assumes the howling component last
longer in time than tonal components of source signal. The
howling can be detected if IPMP≥TIPMP.
As shown in [16] there is no benefits from usage of 0.5th and
1.5th subharmonics used in [17]. The authors used the
m∈{2,3,4} in determination of the PHPR feature. The
TPHPR=30dB and TIPMP=3/5 were chosen as in [17]. The
feedback is detected if both are fulfilled for given threshold.
When the instability occurs the howling components frequencies
Figure 2. Example of the microphone signal spectrum disclosing
instability
are in the howling candidates set through the substantial number
of analyzed frames. The example of the PHPR values for
Based on estimated short-time microphone signal spectrum howling candidates presented on Fig. 3 and allowing to
usually the peak picking algorithm identifies the predetermined determine the howling components are given in the Tab. 1.
number of spectral peaks which are candidates howling Frequencies in the table are sorted in order of magnitude.
components. Fig. 3 shows the short-time spectrum of the signal Frequencies highlighted in green exceed the TPHPR. It can be seen
presented on Fig. 2 at the time equal to five seconds. For that more than one frequency in this particular time frame fulfil
readability the spectrum is shown in the range from 300 Hz to the PHPR detection criteria. This example shows the sense of
5 kHz. Red circles on the graph represent the spectral peaks testing more than one feature in parallel.
determined by pick picking algorithm. Table 1. Example values of PHPR feature for howling candidates shown on
Fig. 3
f [Hz] 517 409 1217 818 2024 3628 4037 4436 2821 4834
2 42,4 37,8 21,5 19,8 44,0 49,6 56,2 59,5 32,6 37,8
m 3 67,4 10,5 9,7 22,9 33,7 40,7 48,7 46,5 21,8 31,9
4 65,9 14,7 25,8 24,4 8,4 28,6 38,4 36,0 10,4 29,8
Figure 3. Short-time spectrum of the signal presented on Fig. 2. Red circles Fig. 4 shows the block diagram of the described detection
corresponds to spectral peaks determined by the peak picking algorithm process. Microphone signal is represented by m(t), it is divided
into overlapping frames m(t) which is windowed and transform
To distinguish the howling components from tonal components into frequency domain using DFT. Peak picking algorithm
of source signal, the number of spectral and temporal features determine the candidate howling component collected in set
have been proposed [9]. To determine the presence of the Dω(t), than based on this set and the current and past calculated
instability, one or more, both spectral and temporal feature can short-time frequency representation of the microphone signal
be used simultaneously. Authors used the combination of spectral and temporal features are calculated.
spectral and temporal features. Peak-to harmonic power ratio
(PHPR) (3) and interframe peak magnitude persistence (IPMP) Window function
M(ω,t)
(4) evaluated by [15] [16] were used. The PHPR and IPMP m(t) Signal framing m(t) + Peak picking
FFT
feature can be determined as follows:
• PHPR – determines the ratio between the power of
the spectral components being howling candidate Spectral and
Howling
and its mth harmonics and subharmonics power. DH(t)
Detection
temporal feature Dω(t)
calculation
|𝑀(𝜔 ,𝑡)|2 Figure 4. Howling detection algorithm block diagram
𝑃𝐻𝑃𝑅(𝜔𝑖 , 𝑡, 𝑚)[𝑑𝐵] = 10 log10 |𝑀(𝑚𝜔𝑖 2 (3)
𝑖 ,𝑡)|
362
III. GREY SYSTEM THEORY The exponential growth in an amplitude is characteristic for
The Grey System Theory was formulated in 1982 by Deng’s howling frequency [13] in instable system. The use of the
Julong in System & Control Letters with his paper “The control frequency analysis allows to avoid the problem of the negative
problem of grey system [control problem of grey. Grey System sample without introduction of a fixed components to the signal,
Theory. Besides of the probability, fuzzy systems and rough sets which decreases the available dynamic range. Additionally, the
the theory are the methodologies for dealing with the systems, frequency analysis allows tracking and even predicting both
which have uncertain or incomplete information[11]. As the spectral features of the signal, and temporal behavior of each
systems with incomplete or uncertain information about their frequency bins individually. These elements allow to presume
behavior or parameters are very common in real world, the Grey the grey prediction as it is sufficient for exponentially growing
System Theory became interdisciplinary [3] [4] [8]. Recently the data series, can accelerate the acoustic feedback detection
first attempts to apply the theory in acoustics had appeared [10]. process in the algorithm which use the existing spectral and
In case of sound reinforcement systems which require feedback temporal feature of the signal defined earlier (3)(4).
control, there is no information about incoming signal, and IV. NEW ALGORITHM
acoustic feedback can vary in time.
Proposed algorithm is based on the detection explained in
One of the main elements of the Grey System Theory section II. First the algorithm analyzes the given spectral and
framework is gray prediction. Opposed to the stochastic temporal features (3) (4) as before. Second, if algorithm does not
methods, grey prediction allows determination of the system detect howling component, it reaches forward. For selected
behavior based on small, at least four, number of samples. The howling candidate the magnitudes of spectral bins in K future
simplest grey prediction model called GM(1,1) [5] is described frames are predicted using GM(1,1) prediction model from L-1
by the first order, single variable differential equation (5). As the past samples and current sample of spectral bins magnitude. L is
solution of the first order differential equation in exponential the number of input samples in prediction algorithm for each
function this implies its usefulness for data characterized by frequency bin corresponding to candidate howling component.
exponential growth [14]. The same detection criteria as for unmodified algorithm are then
evaluated in order to detect howling components in the
(0) (1)
𝑥𝑘 + 𝑎𝑧𝑘 = 𝑏, 𝑘 = 1,2, … , 𝑛, … (5) microphone signal. Algorithm block diagram is shown on Fig. 5.
(0)
xk are the k-th, nonnegative elements of series of original data Window function
Signal M(ω,t) Dω(t)
X(0). From the original data the generation series X(1) is obtained m(t)
framing
m(t) + Peak picking
FFT
by accumulation operation specified as follows:
(1) (0)
𝑥𝑘 = ∑𝑘𝑖=0 𝑥𝑖 (6) DH(t)
YES Howling
DH(t)
Howling
Spectral and
temporal feature
detected? Detection
calculation
were xk(1) are the k-th elements of X(1) series. zk(1) are calculated NO
based on generation series X(1) as follows:

START PREDICTION Frequency
domain
(1) (1) (1) prediction
𝑧𝑘 = 0,5𝑥𝑘 + 0,5𝑥𝑘−1 (7)
Figure 5. Clock diagram of proposed modified howling detection
algorithm consisting grey prediction model.
Assuming nonnegative value off X(0) series elements and
defining Y and B as
V. SIMULATION
(0) (1)
𝑥2 −𝑧2 1 In simulation the electroacoustic system from fig. 1 was
(0) (1) simulated. The source signals were scaled to rms value of sound
𝑌 = 𝑥3 𝐵 = −𝑧3 1 (8) pressure corresponding to 60 dB SPL. In the next step this
⋮ ⋮ ⋮
(0) (1) signals were processed by the microphone sensitivity of 1.85
[𝑥𝑛 ] [−𝑧𝑛 1] mV/Pa which is the sensitivity of Shure SM58 microphone.
Further the signal was amplified in electroacoustic forward path
The coefficient of the equation can be calculated using the Least
G, and after being transformed to the units of sound pressure by
Square method:
loudspeaker efficiency equal to the 95 dB SPL(1W/1m), its
[𝑎 𝑏]𝑇 = [𝐵𝑇 𝐵]−1 𝐵𝑇 𝑌 convolution with room impulse response p2(t) was added back
(9)
to the microphone input p1(t). The room impulse response
The solution of equation is given as originally used in [16] was chosen. Additionally the microphone
signal m(t) were sent to detection algorithm. The electroacoustic
(1) (1) 𝑏 𝑏
𝑥̂𝑘+1 = [𝑥1 − ]𝑒 −𝑎𝑘 + , 𝑘 = 1,2, , … , 𝑛 (10) forward path gain was controlled by the detection stage which
𝑎 𝑎
(1)
gives the information about detected feedback. Initially the gain
𝑥̂𝑘+1allows to predict the future values of original series X(0) as was set slightly below the MSG. The gain gradually increased in
follows: time of 50 ms until it reached the value exceeding MSG and then
remains constant. This situation leads to instability of the system
(0) (1) (1)
𝑥̂𝑘+1 = 𝑥̂𝑘+1 − 𝑥̂𝑘 (11) and feedback slowly build up in time as the gain is only fraction
of decibels above MSG. When the algorithm detected feedback
the gradually decreasing gain over 50 ms were performed to the
363
value of 5 decibels below the MSG to get the system out of VI. SIMULATION RESULTS
instability. After stabilizing the system, whole procedure is Following figures show the simulations result of mean
being repeated from the beginning. This arrangement allowed to howling detection time of tested algorithm. Fig. 7, 8, 9 present
execute the substantial number of detection realization within the results obtained for the signal. Fig. 10, 11, 12 show the
long test signal without inserting filters into the signal path results for noise signal and the Fig. 13, 14, 15 contained the
which changed the frequency response of the electroacoustic simulation result for the speech signal. The value referred to the
forward path. Gradually increasing and decreasing the gain “ref” on the graph corresponds to the values obtained in
allow to avoid the artefacts caused by the quick changes in signal simulation for unmodified algorithm. which is the reference for
amplitude. Fig. 6 presents the simulation arrangement. the rest results.
g(t) l(t) Music signal

0,7
Detection time [s]

0,6
G F 0,5
0,4
0,3
0,2
Howling 0,1
detection p2(t) 0
K 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
L ref 4 5 6 7
m(t) p1(t) M 2048
Figure 7. Simulation results for music sample and analysis
Figure 6. Simulation arrangement in evaluation of howling window length M=1024.
detection algorithms
The work focuses on the detection time of the algorithm. In Music signal
each realization the time counting starts when the 0,7
Detection time [s] 0,6
electroacoustic forward path gain exceeds the MSG. From that 0,5
time system is unstable and howling starts building up in time. 0,4
At the end of test signal the algorithms return the mean detection 0,3
0,2
time and the number of detection. The rapidity of amplitude 0,1
rising of howling component strongly depends on forward gain 0
K 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
above the MSG, therefore, the absolute detection time is not
possible to define. However, because of repeatable conditions L ref 4 5 6 7
the relative performance in terms of detection time of the M 2048
algorithm with different parameter are possible to compare. Figure 8. Simulation results for music sample and analysis
window length M=2048.
The simulations were performed for both algorithms. The
simulations with algorithm without grey prediction model in
detection stage were taken as the reference. To obtain the Music signal
reasonable number of realization of detection process for all test 0,7
Detection time [s]
0,6
cases, the test signals with the length of one minute have been 0,5
used. As source signal, music, speech and noise like signal were 0,4
0,3
used. For the music, fragment of Partita No. 2 in D minor 0,2
(allemande) for solo violin by J. S. Bach were used as in [16]. 0,1
For speech and noise like signals the mix of ITU-T Rec. P.501 0
K 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
(01/2012) test signals were used. As the signal from ITU-T is
delivered in sampling frequency of 32 kHz the room impulse L ref 4 5 6 7
response was resampled to be consistent with test signal. M 2048

Figure 9. Simulation results for music sample and analysis
Different window length, prediction order L and number of window length M=4096.
predicted sample K were used. Prediction order refers to the
number of last values of short-time signal spectrum used in
prediction for each frequency bin. Simulation with window
lengths of M=4096, M=2048, M=1024 were performed for each
source signal and both algorithms. For the prediction order in
modified algorithm for every window length the order of L=4,
L=5, L=6, L=7 were used in simulation. Number of predicted
sample K=1, K=2, K=3, K=4 were simulated for every window
length and prediction order combination.
364
Noise signal Speech signal
3 0,3
Detection time [s]
Detection time [s]

2,5 0,25
2 0,2
1,5 0,15
1 0,1
0,5 0,05
0 0
K 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 K 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
L ref 4 5 6 7 L ref 4 5 6 7
M 1024 M 2048
Figure 10. Simulation results for noise sample and analysis Figure 14. Simulation results for speech sample and
window length M=1024. analysis window length M=2048.
Noise signal Speech signal

2 0,3
Detection time [s]
Detection time [s]

0,25
1,5
0,2
1 0,15
0,1
0,5
0,05
0 0
K 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 K 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
L ref 4 5 6 7 L ref 4 5 6 7
M 2048 M 4096
Figure 11. Simulations results for noise sample and analysis Figure 15. Simulation results for speech and analysis
window length M=2048. window length M=4096.
As can be seen from obtained results it is possible in all

Noise signal
cases, to find the configuration of proposed algorithm for which
1,8
1,6 the mean time of detection of acoustic feedback is lower than
Detection time [s]
1,4
1,2 reference value of unmodified algorithms. The largest influence
1 on detection time was noted for music signal. For the best results
0,8
0,6 the mean detection time is half of the reference value. This
0,4
0,2 situation can be seen on Fig. 5 and Fig. 6. What also can be noted
0
K 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
are decreased performance for higher prediction order in most
cases.
L ref 4 5 6 7
M 4096 In general the lower prediction order of four to five samples
Figure 12. Simulation results for noise sample and analysis of short-time spectrum gives better results in prediction. For
window length M=4096. higher order too much averaging can take place and cause the
decrease in performance.
Speech signal
VII. CONCLUSION AND FUTURE WORK
0,4
0,35
Detection time [s]
0,3
The detection time should be as low as possible in order to
0,25 minimize the audible artefacts caused by the instability of the
0,2 system. In this paper the first attempt at dealing with the
0,15
0,1 feedback problem with help of Grey System Theory was
0,05
0 proposed with positive results. Proposed algorithm allows to
K 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 acceleration of the howling detection process in combination
L ref 4 5 6 7 with existing discrimination features.
M 1024 The reliability of the algorithm was not taken into account in
Figure 13. Simulation results for speech sample and analysis evaluation. It should be done to confirm usefulness of the
window length M=1024. algorithm in feedback detection. The chosen features used in
detection algorithm can also be suboptimal. It should be
considered in further work related to the use of Grey System
Theory and grey prediction model in Howling detection.
365
VIII. REFERENCES [9] J. Lee, S. H. Choi, "Low complexity howling detection based on statistical
analysis of temporal spectra", International Journal of Multimedia and
Ubiquitous Engineering Vol.8, No.5 (2013), pp.83-92.
[1] A, Favrot, C. Faller, " Adapative equalizer for Acoustic Feedback [10] M. Lewandowski, G. Makarewicz, "Application of prediction algorithm
Control", J. Audio Eng. Soc., Vol. 61, No. 12, 2013 December, pp. 1015- to eliminate delays in audio Signal processing systems with automatic
1021 parameters adjustment", XVII International.
[2] A. F. Rocha, A J.S. Ferreira, "An accurate Method of Detection and [11] M. Lu, " Grey system: theory, methods, applications and challenges",
cancelltion of multiple acoustic feedbacks", AES 118th Convention, Leverhulme Trust Workshop on Grey Systems and Applications 22-24
Barcelona, Spain, 2005 May 28–31. September 2015, Bucharest, Romania.
[3] Cz. Cempel, "Grey systems theory - new methodology of analysis and [12] M.R. Shroeder, "Improvement of acoustic-feedback stability by
evaluation of complex systems", Zeszyty Naukowe Politechniki frequency shifting", The Journal of the Acoustical Society of America 36,
Poznańskiej, 63, 2014. 1718 (1964).
[4] D. Julong, "Introduction to grey system theory", The Journal of Grey [13] N. Osmanovic, V. E. Clarke, E. Velandia, "An in flight low latency
System 1 (1989) pp. 1-24. acoustic feedback cancellation algorithm", AES 123rd Convention, New
[5] E. Kayacan, B. Ulutus, O. Kaynak, "Grey system theory-based models in York, NY, USA, 2007 October 5–8.
time series prediction", Expert Systems with Applications an International [14] Q. Li, Y. Lin, "Review paper: A briefing to grey systems theory", Journal
Journal 37 (2010), pp. 1784–1789. of Systems Science and Information Apr., 2014, Vol. 2, No. 2, pp. 178-
[6] G. Rombouts, T. Watershoot, K. Struyve, M. Moonen, "Acoustic 192
feedback canellation for long acoustic path using nonstationary source [15] S. Braun, "Evaluation of various algorithms to detect acoustic feedback",
model" , Transactions on Signal Processing, vol. 54, no.9, Sep. 2006, pp. Toningenieur Projekt, Graz, February 19, 2012.
3426–3434.
[16] T. Watershoot, M. Moonen, "Comparative evaluation of "howling
[7] G. Rombouts, T. Watershoot, M. Moonen, " Robust and efficient detection criteria in notch-filter-based howling suppression, J. Audio Eng.
implementation of the PEM–AFROW algorithm for Acoustic Feedback Soc., Vol. 58, No. 11, 2010 November pp. 923-940
Cancellation", J. Audio Eng. Soc., Vol. 55, No. 11, 2007
[17] T. Watershoot, M. Moonen, "Fifty years of acoustic feedback control:
[8] J. Gu, N. Vivhare, B. Ayyub, M. Pecht, "Application of grey prediction state of the art. And future challenges", Proc. IEEE, vol. 99, no. 2, Feb.
model for failure prognostics of electronics", International Journal of 2011, pp. 288–327.
Performability Engineering, Vol. 6, No. 5, September 2010, pp. 435-442.
366
SIGNaL PROCESSING
SPa 2018
Low F
Frequeency Louds
L speakeer Meaasurem
ments Usingg
Ann Aneechoic Acou
ustic Chamb
C ber
Krzysztof SO
OZAŃSKI Anna SO
OZAŃSKA
Instiitute of Electriical Engineeriing University of Cambridgee
University oof Zielona Górra Zielona Gó óra, Poland Cambridgee, Great Britain
n
K
K.Sozanski@ieee.uz.zgora.pl
Abstract — This paper discusses thee problem off low gatin w additionaal near-field measurements
ng technique with m s
frequency m measurements of loudspeak kers. This ca n be for low
l frequency
y (performed in an ordinarry room or ann
achieved by performing outdoor
o measuurements (in oopen- acouustic chamberr), or in an annechoic acoustic chamber..
air), by using a near-field d technique orr by using a large Blocck diagram ofo a typical procedure fo or measuringg
anechoic acouustic chamber.. The frequenccy characteristtics of
an exemplaryy loudspeaker area obtained and
a compared u using
loud
dspeaker freq quency charaacteristics is depicted inn
these three mmethods. Thiss article also describes aneechoic Fig. 2.
acoustic cham
mber built in the
t Science and Technology Park A(ω)
Log
L Chirp Impulse
of the University of Zielonaa Gora. or MLS Responsee
Time
FFT Φ(ω)
Gating
Signal Calculation
on
Keywords - anechoic acoustic chamber, acooustic
measurementts, loudspeakerr, loudspeakerr measurementts. Fiig. 2. Simplified block
b diagram off the procedure fo
or loudspeaker
measuremments.
I. INTTRODUCTION
The measurem ments in this ppaper are con nducted on a
The proceess of convertting an electriical signal intto an exemmplary samp ple two-way small loud dspeaker boxx
acoustic wavve is very commplex and hencce very difficuult to
designed and buiilt in our labboratory in thhe Institute off
describe maathematically or to simullate. Furtherm more,
Elecctrical Engineering (shownn in Fig. 3). Itt is a 9.5-litree
loudspeakerss are characteerized by a great variabilityy and
their operatiion depends on the size and shape off the ventted box, with the tweeter DT100
D andd
enclosure aand the crosssover param meters. Thereefore, midrrange/woofer CSC145G, both from Peerless.P Thee
during the design proceess it is not possible to rely threee different measurements
m techniques are
a applied too
exclusively on calculaations and simulations, and this loudspeaker system.
s
verification oof the resultss with measurrements is neeeded
[2-9]. The bllock diagram of a loudspeaker measureement
system is deppicted in Fig. 1.
Loud
dspeaker
Und
der Test
Micropphone
Measurement Power
System Amplifier
1m
Personal Computter
Fig. 1. Bloock diagram of looudspeaker measu

urements system..
Typically, good measuring equ uipment is very

expensive, w
with the measuurement equippment that is most
convenient tto use manuffactured by companies
c succh as
Audio Preccision, Bruel& &Kjaer etc. However, iit is
possible to uuse more afffordable soluttions, such ass the
Clio system ffrom Audiom
matica.
During m measurements it is necessaary to ensuree low
noise level and to eliminnate the influ
uence of refleected
waves from tthe environmeent (such as walls,
w celling, floor
etc.). It is nnecessary to conduct all measuremennts in
Fig. 3. Tested louddspeaker box.
similar condditions, such as
a in a free field.
f This caan be
accomplishedd by measurring in the open-air,
o by time
367
Acoustic measurements have been obtained using the For typical measuring conditions, where l1 = 1 m,
computer controlled system Clio from Audiomatica [1]. ν = 343 m/s (for temperature 20°C). The lowest measured
Thanks to its advanced signal processing methods, using frequency can be calculated by the formula
Clio it is possible to measure frequency characteristics of
acoustic systems without an anechoic chamber. v 343 , (3)
f min  
II. OPEN-AIR LOUDSPEAKER MEASUREMENTS l2
2 h  0.25  1
2
2 h2  1
 l1
Open air measurements are very similar to free-field 4
measurements. The open-air measurement view is shown
in Fig. 4. The loudspeaker system is placed on a special 300
stand at height h from the ground. In order to avoid
influence of acoustic noise, the measurements should be 250
performed in a quiet environment. Another disruptive
factor is the wind. 200
150
100
50
0
0 2 4 6 8 10
h [m]
Fig. 6. The relationship between the lowest measured frequency fmin and
the distance to the ground h.
The relationship between the distance to the ground h

Fig. 4. Schematic diagram of the system for open-air measurements.
and the lowest measured frequency fmin is depicted in
Fig. 6.
The measurements are made using LogChrip signal
and time gating for canceling reflected acoustic waves; Impulse response of the tested loudspeaker system for
only the direct response from loudspeaker is taken into open-air measurements is shown in Fig. 7. The
account. It is worth noting that the reflected energy from loudspeaker impulse response can be divided into three
the ground arrives at the test microphone later than direct regions: delay region, meter-on region and reflection
wave. The ground reflection geometry is shown in Fig. 5. region. In the analysis only the meter-on region is used
and remaining regions are filled with zeros. In this case
for h = 2.88 m, Δt = 14.1 ms and fmin = 72 Hz. The
measurements were taken at the institute's backyard. The
frequency response determined on the basis of this
impulse response is shown in Fig. 8. As can be seen, in
order to obtain a full frequency response for low
frequencies, the height of the position of the tested
loudspeaker system should be increased. However, the
height of the position is often limited due to technical
reasons. It should also be noted that open-air
measurements are dependent on weather and external
conditions and cannot always be carried out satisfactorily.
Fig. 5. The ground reflection geometry.
Using the time gating technique it possible to cancel

the influence of the reflected energy. The time gating can Meter on
Normalized acoustic presure p/p 0
be calculated by formula
l12
2 h2   l1
t  4 , (1) Floor Bounce
v
where: l1 – direct wave distance, h – distance to the
ground, ν – speed of sound.
The lowest measured frequency is equal to
1 . (2)
f min
t Fig. 7. The impulse response of the tested loudspeaker system.
368
Magnitude [dBFS]
Fig. 10. Loudspeak

ker measurementss in the acoustic chamber
c in the
Fig. 8. The frequency responnse of the tested loudspeaker systeem Institute of Electricaal Engineering.
measurred in open-air.
This evaluationn method requuires great caare in order too
III. THE ACOUSTIC
C CHAMBER LOUDSPAEAKER
O R obtaain reliable reesults. Such loudspeaker measurementt
MEASUR
REMENTS USIN
NG NEAR-FIEL
LD TECHNIQUE
E probblems are desccribed fairly w
well by D'Appolito [4, 5].
Near-fieldd technique can be used in any rroom. Inn the case of the tested loudspeaker system, twoo
However, it is best to usee a room with h noise insulaation. addiitional measurements werre made, onee close to thee
Here, the aauthor uses an a acoustic chamber
c withh an speaaker's membraane and the othther close to th
he bass-reflexx
acoustic insuulation. The walls
w and ceiling of the cham
amber holee. The resultss of such meeasurements are a shown inn
are covered w with a 14 cm layer
l of minerral wool, as shhown Fig. 11, blue - far-field, redd – membran ne near-field,,
in Fig. 9. Thhe floor is alsoo covered by acoustic isolaation. ow – bass-refllex near-field..
yello
Figure 10 shhows the cham mber during a measurementt of a
loudspeaker system. This T acoustic chamber has
dimensions 3x3x2.5m. The T chamber is located inn the
Institute of Electrical Enngineering at the Universitty of
Zielona Goraa. The author was also invo olved in the deesign
Magnitude [dBFS]
and construcction of the chhamber and it is now undeer his

supervision.
4
40 mm 100 mm
m
Mineral wool board Mineral wool boaard
Boxer Rockton 100 mm m
The wall
Plaste
erboard
25 mm
Fig. 11. The frequeency response off the tested loudsp peaker system
meeasured in the aco
oustic chamber: bblue - far-field, reed – membrane
Fig. 9. Acooustic isolation of the wall of the acoustic
a chamberr. near-field, yellow – basss-reflex near-field.
It should be no
oted that such measurementts are difficultt
A chambeer of this kindd only suppreesses medium m and for more
m complexx loudspeakerr systems succh as acousticc
high frequenncy acoustic waves.
w On the other hand,, low dipooles etc.
frequency aacoustic wavves are weeakly suppreessed.
Reflected ennergy from nearby
n walls, floor and ceeiling III. THE MEASSUREMENTS IN
N ANECHOIC ACOUSTIC
arrives at thee test microphhone later thaan direct wavve. In CHAMBBER
this chamberr it is possiblee to achieve Δt
Δ of no more than Cu
urrently, the author has ann access to a much betterr
2.9 ms, so tthe lowest measured
m freq
quency is arround acouustic chamberr, in which w wave reflectionns are greatlyy
350 Hz. dammpened down. It is an anecchoic acousticc chamber inn
The soluttion for measuring loweer frequenciees is the Electroacousstics Laborato tory in the Science andd
measurementt in the near-field. The near--field Techhnology Park k of the Univversity of Zieelona Gora inn
measurementt method forr low frequen ncies is usedd and Nowwy Kisielin. The
T chamber ddesign is show wn in Fig. 12..
then measureements for neear-field and far-field
f (1 m)) can The acoustic cham mber is made oof reinforced concrete; it iss
be combinned to giive the entire e frequuency a cuube with dim mensions of 110x10x10 m. It is isolatedd
characteristiccs [6]. from
m the surround dings and possitioned on a vibration-freee
mat that is insulatting it from thhe foundation.
369
Reinforced
d concrete
walll, 400 mm
Anec
choic chamber : 10x10x10m
Room with measuurement equipme
ents
Extern
nal wall
Doors wi
with acoustic isolation
Metal grid
Foundations
Vibratiion and noise isola

ation Foundations
F
Fig. 12. Simpliified view of the chamber

c design.
Miicrophone
po
ositioning
arm
Door with
wedges
Turntable
Fig. 13. The view off the acoustic aneechoic chamber with
w acoustically transparent floor-ggrid.
All walls, celling and flloor of the chaamber are covvered conssists of a turnttable, a microophone positio
oning arm andd
with wedges absorbing acoustic energy. The height oof the a coomputer controlled unit based on modules m from
m
wedge is 1.22 m. The weddges are madee of mineral w wool. Natiional Instrum ments. Addittionally the chamber iss
The chamberr has an almosst acoustically y transparent ffloor- equiipped with the followinng measurem ment systems::
grid. The tennsioned cable floor is used to permit wal alking APxx525 Audio Analyzer froom Audio Prrecision [11],,
above the weedges. The viiew of the chaamber is show wn in soun
nd and vibraation measuree meters SVA AN 979 andd
Fig. 13. The chamber is equipped d with comp mputer SVAAN 958A form m Svantek [100] and other eq quipment.
controlled mmicrophone poositioning sysstem. The syystem
370
The chamber design is based on Beranek’s wedge Additionally, an anechoic acoustic chamber allows one
structure [3]. A very important design factor is the wedge to use signals, which cannot be freely used when
height, as it is directly correlated to the wedge’s lower reflections occur. Figure 16 shows frequency
cutoff frequency. The minimal value damping frequency characteristics of the investigated loudspeaker in response
of an anechoic acoustic chamber is to a multitone sinusoidal signal.
 ,
f min  (4)
4h
where: h – height of wedge, ν – speed of sound.
Magnitude [dBFS]
In this particular case, the chamber minimum
frequency is equal to 71 Hz.
The frequency characteristic of the tested loudspeaker
system obtained using LogChirp signal is shown in
Fig. 14. As it was shown in the graph, during one
measurement we managed to get the entire frequency
response of the tested system. Such measurements are
more accurate in comparison to near-field measurements.
This is particularly important for more complex
loudspeaker systems, for example acoustic dipole,
systems, etc. Fig. 16. The frequency response of tested loudspeaker system measured
in acoustic anechoic chamber using multitone signal.
An unique feature of the anechoic acoustic chamber is

the possibility of performing measurements in response to
low frequency sinusoidal signals. For example, this
Magnitude [dBFS]
allows one to measure non-linear distortions introduced

by amplifier-loudspeaker system for different output
powers. For example, Fig. 17 depicts a frequency
response of the investigated loudspeaker system and
audio power amplifier to a single sinusoidal signal of
60 Hz.
100
X: 60.06
80 Y: 87.9
Fig. 14. The frequency response of the tested loudspeaker system
measured in the acoustic anechoic chamber using LogChirp signal.
60
An anechoic acoustic chamber also allows one to

40
perform measurements by the classical method using
sinusoidal signals. Figure 15 shows frequency
20
characteristics obtained by such a method. The tested
loudspeaker was stimulated using successive sinusoidal
0
signals in a range of 20...20,000 Hz. Obtained frequency
characteristics are shown in Fig. 14. 10
2
10
3
10
4
Frequency [Hz]
100
Fig. 17. The frequency response of tested loudspeaker system measured
90 in acoustic anechoic chamber using 60 Hz single sinusoidal signal.
80 The author was also involved in the design and

construction of the chamber, and he is currently the head
70
of Electroacoustics Laboratory. Thanks to the chamber
and the measurement instruments installed in it, it is
possible to perform very precise acoustic measurements,
60
especially in the low frequency range.
50
10
2
10
3
10
4 V. CONCLUSION
Frequency [Hz] In the author’s opinion, in spite of the significant
developments in the field of digital acoustic
Fig. 15. The frequency response of the tested loudspeaker system measurements, acoustic anechoic chambers continue to
measured in the acoustic anechoic chamber using sinusoidal signals.
be a really important research tool. Advanced digital
measurements can be used for measurements under
371
normal circumstances; however it is critical to be able to [5] J. A. D’Appolito, Measuring Loudspeaker Low-Frequency
verify such results in an acoustic anechoic chamber. Response, available at http://www.audiomatica.com, 2018.
Therefore, contrary to many opinions, the author believes [6] D. B. Keele, Low-Frequency Loudspeaker Assessment by Near-
that acoustic anechoic chambers do not belong to the past Field Sound Pressure Measurement, J. Audio Engineering
and deserve a rightful place as valuable research tools. Society, Vol. 22, April 1974, pp. 154-162.
[7] R. H. Small, Simplified Loudspeaker Measurements at low
Frequencies, J. Audio Engineering Society, Vol. 20, Jan/Feb
REFERENCES 1972, pp. 28-33.
[1] Audiomatica, CLIO Software Release 10 User’s Manual, [8] K. Sozanski, Digital Signal Processing in Power Electronics
available at http://www.audiomatica.com, 2018. Control Circuits, second edition, Springer-Verlag, London, 2017.
[2] J. E. Benson, Theory and Design of Loudspeaker Enclosures, [9] Floyd E. Toole, , Audio-Science in the Service of Art , available
Proc. IREE, Australia, vol. 30, September, 1969, pp. 261. at http://www.harman.com/about_harman/technology_leadership.aspx,
[3] L. Beranek, H. Sleeper, The Design and Construction of Anechoic 2018.
Sound Chambers, The Journal of the Acoustical Society of [10] Svantek, available at https://www.svantek.com/lang-en/products/,
America, 1946. 2018.
[4] J. A. D’Appolito, Testing Loudspeakers, Audio Amateur Press, [11] Audio Precision, available at https://www.ap.com/analyzers-
1998. accessories/apx52x/, 2018.
372
Index of Authors
Azarov Elias ............................... 321 Kim A.A. .................................... 235
Bajcsy Peter ............................... 146 Klepaczko Artur ......................... 174, 280, 286, 292
Balcerek Julian ........................... 268 Kło ow ki Piotr .......................... 223
Bamberg Fabian ......................... 37 Kociń ki Marek .......................... 168, 174, 180
Baspinar Ulvi ............................. 27 Kociolek Marcin ......................... 146, 152
Bednarek Jakub .......................... 163 Konieczka Adam ........................ 250
Bednarek Michał ........................ 93, 158, 163 Ko tek Bożena ........................... 104, 213
Beisland Christian ...................... 168, 174 Koszewski Damian ..................... 213
B cek e e khan .............. 50 Kotus J. ....................................... 64
Borowicz Adam ......................... 114 Krupa Krzysztof ......................... 59
Brachmań ki Stefan ................... 355, 361 Kryjak Tomasz ........................... 140
Brady Mary ................................ 146 Kubus Lukasz ............................. 191
Bykowski Adam ......................... 255 Kulawiak Marcin ........................ 22, 327
Canayaz Emre ............................ 50 Kupiń ki Sz mon ....................... 255
Cordeiro Hugo ........................... 315 Lazoryszczak Miroslaw .............. 239
Coskun Adem ............................. 130 Leniowska Lucyna ..................... 186
Cygert S. .................................... 98 Libal Urszula .............................. 229
Cz żew ki Andrzej .................... 98, 208 Lo negård Are ............................ 168, 174
Dąbrow ki Adam ............... 15, 134, 261, 268, 344 Łuczak Mateu z ......................... 268
Di-caterina Gaetano ................... 298 Łucz ń ki Michał ....................... 355
Długo z Rafał ............................. 134 Lukovenkova O.O. ..................... 235
Dobrucki Andrzej ....................... 350, 355 Lundervold Arvid .............. 168, 174, 280, 286, 292
Doshi Trushali ............................ 298 Magiera Wladyslaw .................... 229
Drgas Szymon ............................ 82 Maka Tomasz ............................. 239
Eikefjord Eli ............................ 280, 286, 292 Marapulets Yu.V. ....................... 235
Eminaga Yaprak ......................... 130 Marciniak Tomasz ...................... 261
Falkowski- i ki Przem ław ... 217, 327 Marciniuk Karolina .................... 208
Fallah Faezeh ............................. 37 Matei Basarab ............................. 108
Fiołka Jerz ................................ 31, 274 Matei Radu ................................. 126
ołębiew ka Joanna ................... 104 Materka Andrzej ......................... 168, 174, 180
Grochowina Marcin ................... 59, 186 Matłacz Marcin .......................... 88
Grose Derek ............................... 298 Mazaheri Samaneh ..................... 43
Grzywalski Tomasz .................... 82 Meneses Carlos .......................... 315
Halvorsen Ole J. ......................... 168, 174 Michałowicz Ewe ina ................. 250
Ikuta Akira ................................. 197, 304 Mię ikow ka Marzena ................ 310
Janik Małgorzata Aneta ............. 70 Mokraoui Anissa ........................ 108
Janik Paweł ................................ 70 Moosavi-Tayebi Rohollah .......... 43
Janus Piotr .................................. 140 Muszelska Martyna .................... 280, 286
Jurek Jakub ................................. 168, 174 Noorbakhsh-Devlagh Hamid ...... 43
Kale Izzet ................................... 130 Nourmohammadi-Khiarak Jalil .. 43
Kidoń Zenon .............................. 274 Orimoto Hisako .......................... 197, 304
373
Ozen Sertan ................................ 27 Soraghan John ............................ 298
Parfieniuk Marek ........................ 76 Sozań ka Anna ........................... 367
Park Sang Yoon ......................... 76 Sozań ki Krz ztof ..................... 367
Parzych Marianna ...................... 261 Stefańcz k Ludomir ................... 292
Pawlikowski Adam .................... 134 Strumiłło Paweł .......................... 14
Pawłow ki Paweł ....................... 134, 268, 344 Strzelecki Michal ....................... 152, 286, 292
Peleszy Mateusz ......................... 174 Szczodrak Maciej ....................... 208
Petropoulakis Lykourgos ........... 298 Szwoch Grzegorz ....................... 16
Petrovsky Alexander .................. 120, 321 Szymajda Szymon ...................... 152
Petrovsky Nick A. ...................... 120 Thai Ba chien ............................. 108
Piaskowski Karol ....................... 163 Tristanov A.B. ............................ 235
Pie ka Michał ............................. 70 Ulacha Grzegorz ......................... 203
Piltz Julia .................................... 104 Vashkevich Maxim .................... 321
Piniarski Karol ........................... 250, 344 Viarbitskaya Tatsiana ................. 350
Piotr Skulimowski ....................... 292 Vierhaus Heinrich Theodor ........ 13
Platonov Anatoliy ...................... 332 Vinhais Carlos ............................ 180
Poczeta Katarzyna ...................... 191 Walas Krzysztof ......................... 93, 158
Poterala Magdalena .................... 191 Walter Sven S. ............................ 37
Rafałko Janusz ........................... 245 Wernik Cezary ........................... 203
Rei æter Lar ........................... 168, 174 Wietrzykowski Jan ..................... 338
Romeny Bart M. ter Haar ........... 12 Wojciechowski Andrzej ............. 174
Rørvik Jar e ..................... 168, 174, 280, 286, 292 Wróbe Z gmunt ........................ 70
Rushkevich Yuliya ..................... 321 Yang Bin .................................... 37
Rybenkov Eugene V. ................. 120 Yastrebov Alexander .................. 191
Sabiniok Maciej ......................... 361 Zaitsev Ievgen ............................ 332
Sarwas Grzegorz ........................ 55, 88 Zaporowski Szymon ................... 104
Sayin Fatih Serdar ...................... 27 Zhao Baixiang ............................ 298
Skoneczn Sławomir .................. 55
374

SPA2018proceedings PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

SPA2018proceedings PDF

Caricato da

Copyright:

Formati disponibili

THE INSTITUTE OF ELECTRICAL

AND ELECTRONICS ENGINEERS INC.

CIRCUITS AND SYSTEMS CHAPTER

POZNAN UNIVERSITY OF TECHNOLOGY

CIRCUITS AND SYSTEMS CHAPTER

POZNAN UNIVERSITY OF TECHNOLOGY

Index of Authors ....................................................................................................................................................... 373

Poznan, September 2018

Professor Adam Dąbrowski

Vision for Vision – Deep Learning in Retinal

Bart M. ter Haar Romeny

Migrating Electronic Systems from Fault Tolerant

Heinrich Theodor Vierhaus

Electronic Systems and Interfaces Aiding the

Contemporary technologies and techniques for

Suppression of distortions in signals received from

Figure 3. Histogram of phase differences in a noise recorded from the dual-

The weighting function is calculated as follows. Signals

∆ϕ ( f ) = arg (X q ( f )) − arg ( X i ( f )) (3)

The function ∆ϕ(f) is normalized to [-180, 180) degrees range

This function has values in range from 0 to 1.

where α is the learning parameter, X(f) is the signal frame

Figure 7. Spectrogram of a fragment of Recording 1, after noise suppression

Figure 9. Noise profiles computed from Recording 1, using various methods

Figure 8. Spectrogram of a fragment of Recording 1, after noise suppression

Fig. 9 shows noise profiles computed from the same

Programmatic Simulation of Laser Scanning Products

Keywords-Laser scanning; Simulation; LiDAR; 3D; Point II. RELATED WORK

While the number of attempts at simulating the process of 8 Model Key-points

TABLE I. CLASS ATTRIBUTES OF POINTS DESCRIBED BY THE LAS

As a result of the aforementioned simulations using the

The presented software has been validated with the use of

Figure 4. The results of performing surface scanning simulations on the

The obtained point clouds were then combined into two

Hand Gesture Recognition by Using sEMG Signals

Fig. 3. Selected method types for EMG feature extraction

The extracted feature samples from subject 1 for one channel

TABLE II. FEATURE SAMPLES FROM SUBJECT I

Fig. 1. Myo Armband position on the forearm

Time dependent approaches are more useful for real-time

C. ANN Training and Test

M1 4.8 0.4 0 0.8 0 M4 1.3 1.3 0.4 20.7 0

M3 0 0 5 0.2 0 M5 0.7 0.6 0.2 0.3 24.7

are between 89.6% and 99.2%. The average of multi user

interactive computer based rehabilitation but the study shows

Preliminary investigation of the in-cylinder pressure

∆ d Moreover, we assume that 0 < Ωi (n) < π. To illustrate

AMPLITUDE ENVELOPE [kPa]

Hierarchical Feature-learning Graph-based

D. Training of the Random Forest Classifier (l) (l) (l)

(l) (l) (l)

The prior-based Random Walker algorithm involved a spatial, Nc

weights of Gp were derived from some prior probabilities (l) (l)

HCRF using this function. However, we adapted it to our

Object Detection utilizing Modified Auto Encoder

ZCA whiting x4 x4'

Fig 1: Low-level schema of the proposed algorithm

as human eye perceives images, most adjacent "pixels" in the eye

Where data samples are represented by the index i . Square

Where: C. Feature Classification

feature map Apply Max-Pooling

feature map Apply Max-Pooling

Fig 3. Structure of a CNN with Max-Pooling in the proposed method