Sei sulla pagina 1di 6

Voice Activation Using Speaker Recognition for Controlling Humanoid Robot

1 st Dyah Ayu Anggreini Tuasikal School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia anggreiniayu@students.itb.ac.id

2 nd Hanif Fakhrurroja School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia

hani002@lipi.go.id

3 rd Carmadi Machbub School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia carmadi@lskk.ee.itb.ac.id

AbstractVoice activation and speaker recognition are needed in many applications today. Speaker recognition is the process of automatically recognizing who speaks based on the voice signal. The introduction of these speakers is generally required on systems that use security and privacy. One example of this paper application is for activation and security in controlling humanoid robots. Voice recording process using Kinect 2.0. The first step in the speech recognition process is feature extraction. In this paper use Mel Frequency Cepstrum Coefficient (MFCC) on characteristic extraction process and Dynamic Time Warping (DTW) used as feature matching technique. The test was performed by 5 different speakers, with 2 types of words ("aktifkan" means activate and "hello slim"), and test with different recording distance (0.5m, 2m, 4m). Robot activation using two different types of words has an average accuracy of 91.5%. At the next difficulty level for testing the recording distance accuracy decreased from 97.5% to 85% to 65%.

Keywords— Speaker Recognition, Dynamic Time Warping (DTW), Mel Frequency Cepstrum Coefficient (MFCC), Kinect 2.0, Humanoid Robot, Bioloid GP.

I.

INTRODUCTION

Speaker Recognition is the process of recognizing automatically who speaks based on the individual information contained in the voice signal. With Speaker Recognition allows speaker sound to verify their identity and access control[1]. Speaker recognition are two important functions, namely identification and verification. Speaker identification is the process of getting the speaker's identity by comparing the speaker sound features with all the features of each speaker in database. While speaker verification is the process of accepting or rejecting an identity where the speaker's identity has been previously known based on data that has been entered in database[2]. Two main modules in speaker recognition are feature extraction and feature matching. The first step is process is feature extraction using Mel Frequency Cepstrum Coefficient (MFCC). On the feature matching process to measure the similarity between two time series that may have variations in time and speed are used the most popular method of dynamic time warping (DTW)[3]. Research speaker recognition has been done with several methods one of which is used in feature extraction. The most

commonly used methods for voice extraction are Mel- frequency cepstral coefficients (MFCC), Linear Prediction Cepstrum Coefficient (LPCC), Modified Mel- frequency Cepstral Coefficients (MMFCC), Bark Frequency Cepstrum Coefficients (BFCC), and Revised Perceptual Linear Prediction (RPLP ) in a study conducted by [4] from the comparison of the five methods the MFCC method achieved 99.87% accuracy for speaker recognition. Then some research results using the Dynamic Time Warping (DTW) method resulted in verification of the speakers by combining MFCC and DTW working well on text dependent for speaker verification purposes [5], using both algorithms MFCC and DTW can improve voice recognition performance [6], and the study using MFCC and DTW achieved a 92% success rate for the 0.25 threshold [7]. This paper will prove that two methods of MFCC and DTW which have been widely used by previous researchers on speaker verification can work well. Speaker recognition is implemented on voice activation of Biolod GP Robot to receive voice commands using Dynamic Time Warping (DTW) Method for speaker recognition process, and voice feature extraction using MFCC method. The recording process using Kinect 2.0 that can produce the captured audio has noise resistance[8]. We will test the accuracy of the system by using two different types of pronunciation words and the difference in recording distance between the sensor and the speaker.

II. THEORY

A. Feature Extraction – Mel Frequency Cepstrum

Coefficient (MFCC) MFCC is a popular feature extraction technique used for voice signals. The main purpose of MFCC is to imitating the perception of human hearing that can not receive frequencies above 1 Khz. MFCC is based on the variation of human ear’s critical bandwidth with frequency. MFCC has two types of filter which are spaced linearly at low frequency below 1000 Hz and logarithmic spacing above 1000Hz[9]. The block diagram in Figure 1 below summarizes relevant processes associated with MFCC.

diagram in Figure 1 below summarizes relevant processes associated with MFCC. Fig. 1. Block Diagram of

Fig. 1.

Block Diagram of MFCC Process

1

1) Pre-emphasize Filtering These filters maintain high frequencies on a spectrum, which are generally eliminated during sound production processes. The purpose of pre - emphasis filtering is reduce noise ratio of the signal, thus improving signal quality and balancing the spectrum of voiced sound.

Y []n =−−X []n

aX [n

1]

(1)

Where Y[n] is signal pre-emphasize result, X [n] is signal

is konstanta 0.9

before pre-emphasize process, and

konstanta 0.9 ≤ ≤ before pre-emphasize process, and 1.0[5]. 2) Frame Blocking The frame blocking function

1.0[5].

2)

Frame Blocking

The frame blocking function is to divide the signal into multiple frames. Sound signals must be processed by short segments (short frame) because voice signals continue to change due to the articulation shift of the vocal cords. For signal processing, the commonly used frame length is between 10-30 ms[10]. The windowing process parameters relate to width of the window, the distance between windows, and the shape of the window which will result in frame size (M) and frame shift (N). The frame blocking process is illustrated in figure 2.

N N N N M M M M Fig. 2. Frame Blocking Process 3) Windowing
N
N
N
N
M
M
M
M
Fig. 2.
Frame Blocking Process
3)
Windowing

The next process is windowing process, the purpose of this process is to reduce the non-continuous signal due to frame blocking process at the beginning and end of each frame.

where is the

Window is defined as

number of samples in each frame. The windowing process can

be calculated with.

in each frame. The windowing process can be calculated with. Y = XW × n nn
in each frame. The windowing process can be calculated with. Y = XW × n nn

Y = XW×

n

nn

(2)

process can be calculated with. Y = XW × n nn (2) Where is signal result
process can be calculated with. Y = XW × n nn (2) Where is signal result

Where is signal result of windowing n-sample, is n- sample values, and is window value. Type of window used is hamming window[6]. The hamming window equation is

used is hamming window[6]. The hamming window equation is W n = 0,54 − 0, 46

W

n

=

0,54

0, 46 cos

  

2

π n

  

M 1

(3)

Where

  2 π n    M − 1 (3) Where value is 0,1,

value is 0,1, … ,

   M − 1 (3) Where value is 0,1, … , and is frame

and

  M − 1 (3) Where value is 0,1, … , and is frame length.

is frame length.

M − 1 (3) Where value is 0,1, … , and is frame length. Fig. 3.

Fig. 3. Hamming Window

2

4) Fast Fourier Transform (FFT) FFT serves to convert the sample sound signal (frame N) from the time domain to the frequency domain. The signal in the frame is periodic when the FFT is used on the frame. The fast algorithm for implementing DFT is FFT[11]. The FFT equation is

X

n

=

N 1

k = 0

Xe

k

2

π

jkn N

/

,

n

=−

,

0,1,2,

N

1

Where

sample.

0 Xe k − 2 π jkn N / , n =− , 0,1,2, N 1

is aperiodic row with n-value and

, 0,1,2, N 1 Where sample. is aperiodic row with n-value and (4) is number of

(4)

is number of

5) Mel Filterbank The human ear is not sensitive to all band frequencies because of the distinctive human ear shape. The human ear becomes less sensitive approximately at a greater frequency than 1000Hz. Therefore, it is used to overcome this. The Mel Filterbank graph is shown in figure 4. The mel filterbank equation is

F Mel

(

)

=

2595log10   1

+

Where,

is F Mel ( ) = 2595log10   1  + Where, is frequency in

is frequency in Hz.

f

700

(5)

Mel - Filter Bank

2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2000
4000
6000
8000
10000
12000

Frekuensi (Hz)

Fig 4. Mel Filterbank

6) Discrete Cosine Transform (DCT) The final step of the MFCC feature extraction process is DCT. Discrete cosine transform process obtained the desired feature vector. In this step takes only the cosine part from the complex exponential using the fourier transforms imposed for the discrete signal function.

()

Fk

=

N

f

1

r = 0

f

()

n

.cos

 

π rk 2 N

π rk

2

N

 

(6)

where F (k ) is the discrete cosine signal function and f(n) the

discrete signal function. DCT results are only real without imaginary parts that can simplify the calculation. with the DCT process, the value of magnitude is the result of the magnitude of the DCT itself, and regardless of the phase[6].

B. Dynamic Time Warping (DTW)

Dynamic Time Warping utilizes a dynamic-programming technique that is quite popular in speech signal processing technology. This method is used to calculate the distance between two time series data. The basic principle is to provide a range of 'steps' in space (time frames in the sample, time frames in the template) and used to match paths that can show similarities between straight time frames. It can be used to determine the similarity between two time series as well as to find the corresponding area between two time series. The constraint usually found in speaker recognition is the recording process that has a difference in duration, even if the word or phrase is stipulated the same, therefore this method is needed to overcome it[12]. The advantage of this method is to calculate the distance of two vectors of different lengths[3]. It can determine how well the similarity between the template and the sample sound is determined by the total similiarity cost (the result of the pattern matching of two voice). Total ‘similarity cost’ obtained with this algorithm is an indication of how well these samples and templates have in common, which will then be selected best-matching templates. The DTW distance between two vectors is calculated from the optimal bending path of the two vectors. Illustration of matching with DTW method is shown in the figure 5[13].

of matching with DTW method is shown in the figure 5[13]. Fig. 5. Illustration Matching Two

Fig. 5. Illustration Matching Two Time Series DTW Method

The technique used in this DTW is to use dynamic programming method. Distance DTW can be calculated using the following equation. If there are two sets of data Q and C, with each length m and n with

Q = qq

1

,

2

C = cc

1

,

2

,

,

,

,

q

m

c

n

(7)

(8)

To obtain the similarity of both data using DTW method,

m x n matrix is formed distance value

with matrix element (i, j) in the

dq

(,

c

ij

)

=

(

q c

i

j

)

2

(9)

Next determine the warping path is the path that has the lowest cost. The criteria for determining warping paths are as follows [6]. 1) Boundary Condition

3

In order for the processed data to start from the beginning to the end, then the warping path is formed from the starting point and end point of the data set. 2) Monotonic Condition In order to maintain a sequence of time series, the process is based on time or condition, to avoid looping. 3) Continuity Condition In order for the processed data does not jump to the distant data. After getting the warping path, DTW matrix is made by calculating the accumulated distance with the following equation.

( ,

Di j

)

=+

( ,

di j

)

min

Di

Di

Di

(

(

( ,

j

1,

1,

j

)

1)

j

1)

(10)

III. SYSTEM DESIGN AND IMPLEMENTATION

The speaker recognition is used to enable GP bioloid robot in order to receive voice commands at the next stage in real- time. The speaker recognition system configuration is shown in figure 6.

recognition system configuration is shown in figure 6. Fig. 6. Speaker Recognition System configuration This

Fig. 6. Speaker Recognition System configuration

This Bioloid GP robot will be active only with the owner's voice. In the early stages a voice recording process conducted by five different speakers consisted of 3 women and 2 men to test the accuracy of the system. The voice data to be processed comes from the 5 speaker's recorded audio. Bioloid GP Robot Owner and 4 other speakers record voice exercise data 20 times, and sequentially test with one word "aktifkan" means activate, with two words "hello slim", and record sound at a distance of 0.5m, 2m, and 4m. The recording process uses a Kinect 2.0 4-mic array with 24 bit analog to digital converter (ADC), using 48 KHz sampling frequency, processed through Visual Studio 2017, with a duration of 2 seconds for every word spoken. Then the sound is extracted using MFCC.

At the MFCC stage, the recorded sound is formed up to 2048 samples per frame. The processed sound is first converted into the frequency domain using FFT before passing through the filter stage. At this stage, the amount of warping is 20 pieces of filterbank, so for the next stage of formation cepstrum can be changed back to time domain by DCT. Then the feature extraction results are used for the matching process on the DTW method. The DTW method is performed according to the flow chart of figure 7.

method is performed according to the flow chart of figure 7. Fig. 7. Flow Chart Metode

Fig. 7. Flow Chart Metode Dynamic Time Warping

A. Serial Communication Design

Serial communication in this research is used to connect microcontroller with other devices using embedded system. Pin serial port on the microcontroller ie RxD and TxD. RxD function is to receive data from computer or other equipment, while TxD function is to send data to computer or other equipment. Communication between PC and arduino uses full duplex serial communication with 2 data lines, 1 sends (pinTX) and 1 receive (pinRX) however, dynamixel motors only require 1 data path to communicate. To connect arduino mega with dynamixel motor required another interface that is IC 74LS241N as serial data multiplexer where one communication line can be used for communication more than one dynamixel.

B. Design of Bioloid GP Robot Movement

Design of Bioloid GP Robot movement with 18 DOF using Robo Plus application. Robo Plus application is available several features to make the desired movement. The response to be implemented on the Bioloid GP robot is shown in Table I.

TABLE I.

LIST OF IMPLEMENTED COMMANDS

Respond

Robot Movement

Verified

Robot stands up and raises his hand to the right and left

Speaker

Not Verified

Robot stands and doesn’t move

Speaker

Robot movement design can be seen from simulated movements that have been programmed in the application Robo Plus. The simulation results for each speaker recognition response to be used are shown in figure 8.

recognition response to be used are shown in figure 8. Fig. 8. Robot Position for Speaker

Fig. 8. Robot Position for Speaker Recognition Response

After designing robot movement with simulation, then implemented directly to the robot using the servo angle positioning position of Robo Plus which is then implanted in Arduino Mega. Bioloid GP robot consists of motor dynamixel AX-12A and AX-18A which then driven using Arduino Mega.

IV. EXPERIMENTAL RESULTS

This experiment is done by searching for voice characteristics of a word spoken using MFCC feature extraction. After getting feature features that will be compared, then processed by DTW to verify the speaker. The speaker matching process is performed by storing the extraction of a single speaker feature as a reference for comparison with the reference speaker and the other four speakers. The speaker's introduction with DTW results in a similiarity cost used to distinguish one speaker's voice from the other four speakers. The system will verify the speaker if the comparable sound is the same as the sound previously stored, otherwise if the sound is different from the comparison then the system will refuse or not verified. Testing process of speaker recognition implementation on Bioloid GP robot is shown in figure 9. Then Bioloid GP Robot will give response according to table III.2. Here are the results of the robot movement implementation shown in Figure 10.

4

Fig. 9. Testing process of speaker recognition implementation on Bioloid GP robot Fig. 10. Implementation

Fig. 9. Testing process of speaker recognition implementation on Bioloid GP robot

of speaker recognition implementation on Bioloid GP robot Fig. 10. Implementation of robot motion response during

Fig. 10. Implementation of robot motion response during speaker verification

Speaker's voice from the five speakers has different age, gender, and speech accent to pronunciate the words shown in Figure 11 and Figure 12.

to pro nunciate the words shown in Figure 11 and Figure 12. Fig. 11. Speech Signal
to pro nunciate the words shown in Figure 11 and Figure 12. Fig. 11. Speech Signal
to pro nunciate the words shown in Figure 11 and Figure 12. Fig. 11. Speech Signal
to pro nunciate the words shown in Figure 11 and Figure 12. Fig. 11. Speech Signal

Fig. 11. Speech Signal from “Aktifkan” Pronunciation

5

Fig. 11. Speech Signal from “Aktifkan” Pronunciation 5 Fig. 12. Speech Signal from “Hello Slim” Pronunciation

Fig. 12. Speech Signal from “Hello Slim” Pronunciation

Speaker recognition test was conducted by 5 speakers, each speaker performing 40 tests of pronunciation for one word "aktifkan" and "hello Slim" to test the accuracy of the system with total testing as much as 200 times. Experimental results tested on the five speakers are shown in figure 13.

results tested on the five speakers are shown in figure 13. Fig. 13. Level of Accuracy

Fig. 13. Level of Accuracy from Single Word and Two Words Data Testing

As we can see, speaker recognition can be used to control robots through voice activation. The robot can be active if it is ordered by speaker 1 and inactive if it is ordered by the other four speakers according to what we want. From the graph in figure 13 the word "aktifkan" is more recognizable than the word "hello Slim". Accuracy of 97.5% for speaker 1, 95% accuracy for speaker 2, 100% accuracy for speaker 3, 80% accuracy for speaker 4, and 85% for speaker 5, with an average accuracy of 91.5%. According to experimental results the average accuracy rate for single word is 93% and 90% for two words. For test results by varying the distance between Kinect 2.0 and speaker is shown in figure 12. In this test speaker 1 speaks 20 times for 2 words of pronunciation on all three recording distances. With a total test of 120 times.

Fig. 14. Level of Accuracy from Recording Distance Data Testing As we can see at

Fig. 14. Level of Accuracy from Recording Distance Data Testing

As we can see at figure 14, the recording distance parameter affects the speaker's introductory experimental results. For 0.5 meters recording distance between Kinect and speakers get accuracy of 97.5%, accuracy decreased to 85% at a distance of 2 meters, and decreased to 65% at a distance of 4m. With an average accuracy for testing the recording distance at the introduction of the speaker is 82.5%. The accuracy of speaker recognition will decrease as the recording distance increases. This is caused by increasing the distance affecting the magnitude of the resulting amplitude which causes the feature extraction process to be inaccurate, which can interfere with the process of verifying the speaker on the DTW method.

V. CONCLUSION

Voice activation studies using speaker recognition to control the Bioloid GP Robot by MFCC and DTW methods can be implemented well in humanoid robots. The test was performed by 5 different speakers, with 2 types of words ("aktifkan" and "hello slim"), and test with different recording distance (0.5m, 2m, 4m). Robot activation using two different types of words has an average accuracy of 91.5%. At the next difficulty level for testing the recording distance accuracy decreased from 97.5% to 85% to 65% due to increased spacing between sensors and speakers that could affect the size of the amplitude generated. The value of the MFCC parameter used affects the success rate when matching by DTW. Experimental results show that speaker recognition to control the Bioloid GP Robot can be solved by DTW. The number of words spoken and the recording distance affect the accuracy of the recognition.

6

ACKNOWLEDGMENT

This work was supported by Program of Post Graduate Team Research 2018 from The Ministry of Research, Technology and Higher Education, Republic of Indonesia.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

REFERENCES

A. R. G, “REAL TIME SPEAKER RECOGNITION USING MFCC AND VQ National Institute of Technology,” National Institute of Technology, Rourkela, 2008. M. Limkar, “Speaker Recognition using VQ and DTW,” Int. Conf. Adv. Commun. Comput. Technol., pp. 18–20, 2012. D. Vashisht, S. Sharma, and L. Dogra, “DESIGN OF MFCC AND DTW FOR ROBUST SPEAKER RECOGNITION,” Int. J. Electr. Electron. Eng., vol. 2, no. 3, pp. 12–17, 2015. M. G. Sumithra and a. K. Devika, “A study on feature extraction techniques for text independent speaker identification,” 2012 Int. Conf. Comput. Commun. Informatics, pp. 1–5, 2012. K. B. Joshi and V. V Patil, “Text-dependent Speaker Recognition and Verification using Mel Frequency Cepstral Coefficient and Dynamic Time Warping,” Int. J. Electron. Commun. Technol., vol. 7109, pp. 150–154, 2015. L. Muda, M. Begam, and I. Elamvazuthi, “Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques,” J. Comput., no. March, 2010. S. Verma, T. Gulati, and R. Lamba, “Recognizing Voice for Numerics Using Mfcc and Dtw,” Int. J. Appl. or Innov. Eng. Manag., vol. 2, no. 5, pp. 127–130, 2013. M. H. Tambunan, Martin, H. Fakhruroja, and C. Machbub, “Indonesian Speech Recognition Grammar Using Kinect 2.0 for

Controlling Humanoid Robot,” Int. Conf. Signals Syst., no. 978, pp. 59–63, 2018. A. Bala, “Voice command recognition system based voice command recognition,” Int. J. Eng. Sci. Technol., no. December,

2010.

R. Hasan, M. Jamil, G. Rabbani, and S. Rahman, “Speaker Identification Using Mel Frequency Cepstral Coefficients,” 3rd Int.

Conf. Electr. Comput. Eng. ICECE 2004, no. December, pp. 28–30,

2004.

D. Handaya, H. Fakhruroja, E. M. I. Hidayat, and C. Machbub, “Comparison of Indonesian speaker recognition using Vector Quantization and Hidden Markov Model for unclear pronunciation problem,” in 2016 6th International Conference on System Engineering and Technology (ICSET), 2016, pp. 39–45. B. Priya and S. Kaur, “Comparative Study of Male and Female Voices Using Mfcc and Dtw Algorithm in,” Int. J. Adv. Res. Electron. Commun. Eng., vol. 3, no. 8, pp. 2–5, 2014. A. Mueen and E. Keogh, “Extracting Optimal Performance from Dynamic Time Warping,” Int. Conf. Knowl. Discov. Data Min., pp. 2129–2130, 2016.