Sei sulla pagina 1di 10

INDIAN INSTITUTE OF TECHNOLOGY

ROORKEE
DEPARTMENT OF ELECTRICAL ENGINEERING

Project No - 8

Urban Sound Classification using Neural Networks


EEN-300 | BTech. Project| December 26, 2018
Name of the students:
S.no Name Enrollment Contact no. Email-id
no.

1 Andiboyina 14115016 9557658383 mouryanagesh96@gmail.com


Mourya
Chakradhar
Nagesh

2 Gugulothu Nikhil 14115043 7060467034 naiknikhil12355@gmail.com

3 Manoj Kumar 14115071 9610266037 manojiit37@gmail.com


Meena

Project Supervisor’s details:


S.no Name Designation Email-id

1 Jeevanand Assistant Professor, EED, jeevafee@iitr.ac.in


Seshadrinath IITR.
Abstract:
We all get exposed to different kinds of sounds in our daily life. In this project
we will learn different kinds of neural network architectures by using these
neural networks to classify urban sounds.

Objectives:
 Build neural networks which can classify different kinds of urban
sounds.
 Compare the performance of each neural network for the classification.

Introduction
Unlike speech and audio signals, urban sounds are usually unstructured sounds.
They include real life noises generated by human activities ranging from
transportation to leisure activities. Automatic urban sound classification could
identify the noise source, benefitting the urban livability, such as noise control,
audio surveillance, soundscape assessment and acoustic environment planning.

Historically, comparing to many research works on automatic speech


recognition or music information retrieval, the studies towards urban sound
event analysis are comparatively limited. To identify a dominant sound event
from its chaotic mixtures, most past studies focus on finding the corresponding
efficient sound feature that is representative of the target sound class, even if the
sound is partially masked by other sound sources. In this trend, early works on
urban sound classification rely on manually engineered features that have been
widely used in audio/speech classification, such as Mel-Frequency Cepstral
Coefficients (MFCCs), MPEG-7 audio low-level features, or other customized
temporal and spectral acoustic features. Combined with some general classifier
(like SVM, k-NN,GMM, random forest,etc.), these features are then classified
into their corresponding sound classes. Recently, it has been frequently reported
that features automatically learned by a neural network are more powerful than
hand-crafted features on various classification problems. In particular, in the
field of image classification, the supervised or unsupervised feature learning by
a convolutional neural network (ConvNet), has been actively employed and
obtained unprecedented classification accuracy.

PAGE 1
Dataset:
We need a labelled dataset that we can feed into machine learning algorithm.
Fortunately, some researchers published urban sound dataset. It contains 8,732
labelled sound clips (4 seconds each) from ten classes: air conditioner, car horn,
children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and
street music. The dataset by default is divided into 10-folds. In this dataset, the
sound files are in .wav format but if you have files in another format such as .mp3,
then it’s good to convert them into .wav format. It’s because .mp3 is lossy music
compression technique.

Data Handling:

As with all unstructured data formats, audio data has a couple of preprocessing
steps which have to be followed before it is presented for analysis. The first step
is to actually load the data into a machine understandable format. For this, we
simply take values after every specific time steps. For example; in a 2 second
audio file, we extract values at half a second. This is called sampling of audio
data, and the rate at which it is sampled is called the sampling rate. Another
way of representing audio data is by converting it into a different domain of data
representation, namely the frequency domain. When we sample an audio data,
we require much more data points to represent the whole data and also, the
sampling rate should be as high as possible. On the other hand, if we represent

PAGE 2
audio data in frequency domain, much less computational space is required.

Feature Extraction:
To extract the useful features from sound data, we will use Librosa library. It
provides several methods to extract different features from the sound clips with
default sampling rate to be 22050Hz. We are going to use below mentioned
methods to extract various features:

 melspectrogram: Compute a Mel-scaled power spectrogram


 mfcc: Mel-frequency cepstral coefficients
 chorma-stft: Compute a chromagram from a waveform or power
spectrogram
 spectral_contrast: Compute spectral contrast
 tonnetz: Computes the tonal centroid features (tonnetz)

PAGE 3
Using librosa package for feature extraction and matplotlib package for plotting
we can differentiate different kinds of sounds using these images below.

Classification using Multilayer Perceptron Model:

After extracting the feature data from the sounds using librosa library we will
label the input data using “One Hot Encoder” .Once it is done will split the data
into train and test sets randomly with less than 70%. We then trained the data
with the neural network with random weights and consisting of 280 hidden units
in the first layer which uses the activation function of “tanh” and output of this

PAGE 4
hidden layer is then feedforwarded to 2nd hidden layer which consists of 300
hidden units which uses the activation function called “sigmoid function” and the
output layer uses the softmax function. We then used the gradient descent
optimizer with learning rate of 0.01 to minimize cost function (target*log
(output)). Here is the plot of cost vs iterations to measure the accuracy.

Classification using Convolutional Neural Network:


Convolutional Neural Networks (CNN) are very effective for image classification
since it retains most of features while decreasing the computation required. In
case of images matrices of RGB are fed into different channels of neural network.
But in sound data, different coefficients of MFCC are fed into different channels
of CNN. These coefficients are reshaped into 128 x 128 matrix and are fed as input
to convolutional layers of 24 different channels.

PAGE 5
Architecture of Convolutional Neural Networks

PAGE 6
Future Work and Goals:
Once we are able to create a NN we will try to classify using different kinds of
neural networks and compare their performance. Further we may try to create a
new neural network architecture by combining MLP and CNN architectures and
analyze the performance of that network.

Conclusion:
We successfully classified the data using Multilayer Perceptron Model and
Convolutional Neural Network. Due to huge amount of data used to train CNN,
we could not achieve the same with MLP due to performance issues

PAGE 7
References:
CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION by
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren
Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan
Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson Google, Inc., New York,
NY, and Mountain View, CA, USA

https://www.kaggle.com/pavansanagapati/urban-sound-classification

https://towardsdatascience.com/urban-sound-classification-part-1-
99137c6335f9

PAGE 8

Potrebbero piacerti anche