Sei sulla pagina 1di 6

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

118

Singer and Instrument Recognition from video song using Supporting Vector Machine
B.Srinivas, D.Sudheer Babu, K.Santosh Jhansi
Abstract Singer and Instrument identification from videosong (SIIV) is an interesting research topic in the fast growing music era and it has wide range of applications in the field of Human Computer Interaction (HCI).The objective of automatic singer and instrument recognition is to extract, characterize and recognize the information about singer and instrumental identity. Initially the song is extracted from the video then the singer features such as, Mel Frequency cepstrum coefficients (MFCC) and Mel Energy Spectrum Dynamic Coefficients (MEDC) are extracted from the audio file. The Support Vector Machine (SVM) is used as classifier to classify the verities of instruments used in music such as Harmonium, Sitar, Sarod, Tanpura, Santoor, Veena, Tabla etc,.The SVM is used to train and classify the system with singer voice and instruments. It gives 98.75% classification accuracy for Gender independent case 96.73% for male and 100% for female singers. It also gives 95.46% accuracy for instrument classification. Index Terms Feature Extraction, Feature Selection, SVM, Singer Classification, instrumental classification.

1. INTRODUCTION
occupies the pharmacy of psychological and various other forms of treatment, the choice of the musical instrument and the respective vocals being used vary depending on the application purpose making it very crucial to identify the effect of instruments and vocals. This kind of a requirement further instigates the necessity of separation and identification of voice and instruments. In places where ancient history still lives like China and Japan it has been estimated that every drop of heritage that is implemented, one of the famous being Kung Fu also has certain musical backgrounds to relax and re-venture the mind and body, these instruments used are nowadays widely in use and in research as they are a promising potential to many applications. Hence considering that music is a part of historic, past, present and promising future, it is always a treat to contribute technical assistance to make it more experimental and reachable to needs and desires of changing times. This aspect makes it challenging as well as competitive and any contribution with a tint of innovation goes a long way in a vast field like music. This paper is an attempt to show a new sight and door to enhance the use of music. The objective of automatic singer and instrument recognition is to extract, characterize and recognize theinformation about singerand instrumental identity.

usic has become a part and parcel of every days profile of life. Everybody wishes to and connect themselves to music voluntarily or involuntarily for various reasons like maintaining composure and relaxation through the mundane lifestyle, fun, passion, travel-bud, stress buster etcThe advent in technology has opened up the possibility of automating the selection of ones choice and judgment based on singer and musical instruments used which can act as a huge favor in terms of analysis of tone, rhythm, beat and synchronization purposes. The singer contains numerous discriminative features which can be used as measurement to identify them while the feature selection and extraction help in recognizing the instruments thereby making music lovable and understandable even to a stranger or listener of music world and hence improving the scope of music, listeners, careers, technique and passion. The flexibility of knowing the instruments used and the respective singer, gives a justified credit not only to the singer but also the instruments used thereby portraying its beauty and preservation of its culture. The history of music is without doubt the face of rhythms which tune a persons mind from beats to peace hence making it the choice of every man. It has been scientifically proved that music and voice or a combination of both have an intricate effect on the mind and henceforth, the body due to which selected musical instruments and vocal songs are used for many medical, psychological ,nervous and religious treatments. If keenly observed one can notice that as much as music

2. LITERATURE SURVEY
Singer and instrument recognition systems have been one of the most researched areas due to their promising potential in various applications, if achieved. In singer and instrument recognition system, a number of instruments are limited while there are millions of singers. On the other hand, the frequency variation of human voice range is very limited when compared to instruments. None the less, a single voice can produce a much greater variety of outcomes than a single instrument. Certain techniques

B.Srinivas is with Department of Computer Science and Engineering, MVGR College of Engineering, vizianagaram, Andhra pradesh. D.Sudheer Babu is with Department of Computer Science and Engineering, MVGR College of Engineering, vizianagaram, Andhra pradesh. K.Santosh Jhansi is with Department of Computer Science and Engineering, MVGR College of Engineering, vizianagaram, Andhra pradesh.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

119

describe methods of identifying singers voice from the monophonic music including sounds of various musical instruments based on auditory features using SRC to classify the song MFCC MEDC GTCC for feature extraction and GMM modeling the singers voice but this long procedure is merely to improve the music retrieval system with no much potential to handle variations in music [1, 4]. Singer identification techniques were facing improvement by using cepstrum transformation techniques but that involves reducing the interference of background music limiting it to singers voice alone [11, 13]. Speaking in context of achieving the respective purpose of the system using SVM architecture under supervised machine learning due to its very good classifying abilities, the objective of SVM learning is finding a classification unction that approximates the unknown input on the strict basis of the set of measures among which is an input pattern, and the respective corresponding target. In order to allow nonlinear classification functions, the training points are mapped from the input space to a feature space , through a nonlinear mapping .A simple linear hyper-plane is used for separating the points in the feature space. Through certain efficient mathematical properties of the respective nonlinear mappings, SVMs avoid explicitly working in the feature space, thereby retaining the advantages of the linear approach despite the existence of nonlinear separating functions. In case of instances of not linearly separable vectors, the algorithm maps them into a high dimensional space which is called feature space where they are almost linearly separable. This is typically achieved through the use of a kernel function that computes the dot product of the images of two examples in the feature space. The success and preference of SVM is due to the existence of good theoretical results guaranteeing that the hypothesis obtained from training data minimizes a bound on the error associated with the respective future test data. The resulting classifier usually assigns class labels to the testing instances where the values of the predictor features are known, but the value of the class label is unknown. There are several applications for Machine Learning, the most significant of which is the quality of analyses or a trial of establishing relationships between multiple features soothing the process of finding solutions to certain problems. Machine learning can often be successfully applied to such classification problems improving the efficiency of systems and the designs of machines. Every instance in any dataset used by machine learning algorithms is represented using certain set of features which may differ in their characteristics varying from being continuous to categorical or binary. If instances are given with known labels and testing is confined to the level of trained set, such a learning measure is called supervised learning. The above features give a promising scope to the use of SVM in singer and voice identification systems. Different dimensions of music retrievals have been explored among which raga identification of carnatic music was achieved where the input polyphonic music signal is analyzed and made to pass through a signal separation algorithm to separate the instrument and the vocal signal. After extracting the vocal signal, and segmen-

tation using a certain segmentation algorithm followed by determining the singer to know the fundamental frequency of the singer. The frequency components of the signal are then determined which are mapped into the swara sequence and thereby determining the Raga of the particular song but is limited to only carnatic. As progress was being made, a technique was developed to reduce the computational load time in feature extraction, Zernike moments which act as shape descriptors of speech signal are extracted and used as features of speech signal. To extract Zernike moments, one dimensional audio signal is converted into two dimensional image file. Then various feature selection and ranking algorithms like t-Test, Chi Square etcwhose performance is evaluated using accuracy of the classifier using SVM algorithm are used to select important feature of speech signal to achieve accuracy but it takes a step back in combined classification of performance. This was followed by trying a clustering technique for better classification purpose by clustering songs in a music collection which are of the same singer, or of singers with similar voices, a number of models are trained for different types of singers voices in the database [4, 6]. Then, songs in the music collection are matched with these models and assigned to corresponding clusters. A new cluster and its model are built when a song does not match well with any of existing models. To retrieve songs that are similar to a query song in terms of singers voice, a model is first built for the singers voice in the query song. Next, testing samples are extracted from songs in a database, and matched with the voice model. The songs may then be ranked according to the likelihood values, and those songs which sit top at the ranking list are considered to be most similar to the query song in singers voice. Vibrato is a kind of tremulous effect imparted to vocal or instrumental tone for added expressiveness carrying only a subtle variation in pitch. It corresponds to a periodic fluctuation of the fundamental frequency. Every singer would develop a vibrato function to personalize his singing style. While exploring the acoustic features that reflect vibrato information in order to identify singers of popular music, an attempt was made through starting with an enhanced vocal detection method that allows selecting vocal segments with high confidence. From the selected vocal segments, the cepstral coefficients that reflect the vibrato characteristics are computed and the coefficients are derived using bandpass filters, such as parabolic and cascaded band-pass filters after which they are spread according to the octave frequency scale. The intent and strategy of the respective classifier formulation was to make use of the high level musical knowledge of song structure in singer modeling. Singer identification is validated on a database containing a set of popular songs from commercially available CD recordings from certain set of singers [14]. This approach is expected to have achieved a good accuracy rate and an average error rate of only 16.2% in segment level identification. Another approach in the attempt to focus on extracting melodies from music like extracting vocal melodies from polyphonic audio in particular was achieved by short-term processing where a tumbrel distance between

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

120

each pitch contour and the space of human voice is meas-

ured so that any vocal pitch contour can be isolated and saved in its uniqueness. Computation of the timbral distance is based on an acoustic phonetic parametrization of human voiced sound. In cases where long term processing organizes short-term procedures, it is done in such a manner wherein the relatively reliable melody segments are determined first. While this experiment was tested on vocal excerpts from the ADC 2004 dataset, this system is depicted to have achieved an overall transcription accuracy of 77%.Considering that above attempts are a drop in the ocean of attempts to explore the possibilities of various identification and classification techniques to support the dynamic applications ranging from day to day, one can observe that a high road had been faced and searched in attempts made to achieve automatic musical instrument recognition systems. Different techniques were used with a promising scope, achieving good results. Most systems have worked on independent pieces of isolated notes mostly borrowed from a single common source with the restriction of the notes being confined to a very small pitch range. Polyphonic recognition has also been the point of research with certain thorough attempts, although the number of instruments involved is very limited. The studies using isolated tones and monophonic phrases are the most relevant in our scope. The paper describes a method of identifying singers voice from the monophonic music including sounds of various musical instruments based on auditory features thereby achieving speech and musical instruments information individually and efficiently.

framing, windowing, FFT, melfilter bank etcwhich together perform the processing of feature extraction. Feature extraction is one of the primary help in terms of classification and identification. Input video Song Separation of Audio Audio and Video Input Feature Extraction

Create/Update References
Fig.1. Block Diagram of training phase

Features

The processing part, which converts the sampled audio signals are converted into a set of feature vectors, which characterize the properties of the respective speaker by the frequency of pitch or speech. The process of audio and video separation is mandatorily performed in training as well as testing phase and hence can be termed Signal processing or Front-end processing. A reduction of feature data by modeling the distributions of the feature vectors is achieved which are finally updated in the training database. The same process is simultaneously carried out in terms of capturing the musical instruments involved in the song by breaking or dividing the continuous song into respective parts based on their frequency or other factors. TestVideo Song
Separation of audio and Video Input Audio Feature Extraction Features

3. PROPOSED SYSTEM
Speaking of vocal part of the system, the structure of the vocal tract implicitly varies for every person and hence the voice information available in the speech signal can be used to identify the speaker categorizing voice as a biometric identity of the speaker. One of the major advantages of using voice for identification is for authentication purpose. This respective system involves two phases: training and testing. Training is the process of loading, familiarizing and hence filling the system with the voice characteristics of the speakers who are a part of the respective cluster involved in the task. Testing is the actual phase where an unidentified input is given to a system for its respective identification or recognition where the quality and application of the respective view or concept is tested.. The proposed system singer and instrumental identification from video song (SIIV) proceed in two phases that are training and testing phases. In training phase, Singer and instrument identification system (SIIV) takes video as an input which is mixture of video, vocal and music supplement. In Training phase our system is illustrated in Figure 1.1 where the respective system is trained with the voice of the singer and instruments. Firstly, the audio and video are separated .Once the separation is successful, audio is considered and the required respective features are extracted from the audio using MFFC that is capable of speaker or in this respective case, the singer identification. MFCC has various activities like pre processing,

Similarity Assessment Features Existing Reference

Score

Decision Logic

Decision

Fig.2. Block Diagram of testing phase

Fig 2 shows testing phase of the system. While the testing phase system verifies singer voice and instrumental match. In the process of training, testing phases the input video song separated into three components, video, vocal, instrumental segmentation. The combination of voice and instruments that were observed and trained to the system in the training phase are now put into the test of application whose primary requirement is to distinguish between the voice and instruments with the capability of being validated with the respective trained voices and instruments for clearance. The vocal, instrumental (non-vocal) are segregated from the song to locate and discard the voice as well as instruments (non-vocal)

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

121

based on the regions. The first step in testing phase is similar to that of the training phase where Signal processing is performed which is followed by providing the extracted audio features and detected instrumentals as input to the following step, Similarity assessment is performed after the respective feature extraction. The similarity assessment is performed taking the references of voices and instruments from the training phase. In the case of a match found the respective match is scored into the system and further sent for processing using the concept of decision logic. This concept leads to a concluded decision as to which are the respective processed voice and instrument used among the trained set as this uses SVM architecture with a supervised learning where the output is a member of the trained input. Once the decision is made, the piece of segregated instruments and voice is clear. Hence the efficient combination of training and testing phase alone can give an effective output since the similarity assessment in the testing phase depends on the compatibility of the trained input and the feature extraction along with the feature labeling following the feature extraction. In terms of an effective result to be achieved, the system is primarily trained with certain set of necessary instrumentals, vocal and their combination justifying the supervised learning mechanism of the SVM technique after which the video input is given to the trained system which separates the audio and video yet leaving another important distinction measure which is the extraction of vocal and instrumental from the respective audio which is achieved in complete terms using the feature extraction followed by the achievement of the vocal and instrumental separation .Then the extracted features are respectively labeled based on various relative and accurate factors varying from pitch to many other biometric and instrumental facets. This process now continuous by applying the SVM technique under the supervised learning where the classification of the respective vocals and instrumentals is performed under the reference of the trained set and the segregated audio and video are achieved from the similarity assessment and decision logic thereby giving the respective output of the separated vocals and instruments. However, it must be kept in mind that the separation of audio and video and the feature extraction is carried out by both training and testing phases due to which the compatibility factor is of high importance for the systems success .

C = melcepst(s,fs,e0dD)% including log energy 0th cepstral coef, delta and delta-delta coefficients. Inputs: S: Singing Signal FS: sample rate in HZ(default 11025) NC: Number of cepstral coefficients excluding 0th coefficient N: length of the frame (default power of 2<30 ms) P: no of filters in filterbank (default floor(3*log(fs))) inc frame increment (default = 0) Fh: high end of highest filter as a fraction of fs (default = 0.5) W: any sensible combination of the following: R: rectangular window in time domain N: Hamming window in time domain M: Hamming window in the domain (default) T: Triangular shaped filters in mel domain (default) n: Hamming shaped filters in mel domain p: filters act in the power domain m: Hamming shaped filter in mel domain a: filters act in the absolute magnitude domain (default) 0: Include 0th order cepstral coefficient e: Include log energy d: Include delta coefficient (dc/dt) D: Include delta-delta coefficient (d^2c/dt^2) z: highest and lower filters taper down to zero (default) y: lowest filter remains at down to 0 frequency and The highest filter remains at 1 up to nyquist frequency. If ty or ny is specified, the total power in the fft is preserved. Outputs: mel cepstrum out put: One frame pre row. Log energy, if requested, is the element of each row followed by the delta and then the delta-delta coefficients. The 3 modules in this paper 1. 2. 3. Extraction of features from different audio songs using MFCC. Classification of the music and song using SVM classifier. Determining the singer and instruments.

3.1 Feature Extraction

In singer and instrumental identification system (SIIS) features are extracted using Mel-frequency cepstral coefficients (MFCCs). Figure 1.3 illustrates overall system. It is believed that rhythm is the beat and length dissimilar of continuousof sounds in a song. Singing features are the primarysign of singer. In this study, the five groups of short term features that were extracted relate to fundamental frequency (F0), energy, the first four formant frequency (F1 to F4), two Mel Frequency Cepstrum Coefficients(MFCC1, MFCC2). MELCEPST calculate the Mel Cepstrum of signal C = (S,FS,W,NC,P,N,INC,FL,FH) simple use: C = melcepst(s,fs)% calculates melcepstrum with 12 coefs, 256 sample frames.

It is believed that prosodic (relating to the rhythmic aspect of language) features are primary indicator of singer states that fundamental frequency, energy and formant frequencies with amplitude are potentially effective parameters to distinguish certain emotional states. In this study, the five groups of short term features that were extracted related to fundamental frequency (F0) ,energy, the first four fundamental frequencies (F1 to F4), two Mel Frequency Cepstrum Coefficients ( MFCC1 , MFCC2). For wave file extraction, it is important to determine the amplitude and pitch, since the wave may be scale dependent between [-1 to +1] sub-band was from the song. In this paper five instrument were used and after the song is observed from different songs of the both genders they are been categorized with number (0 to 4) .

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

122

4. RESULTS
In our proposed system, various samples have been collected such as Harmonium, Santoor, Veena, Sitar, Tabla, and Tanpura. First we converted the audio samples to wave file format. Extracted features from the collected samples and trained our system thoroughly. Fig. 3. shows the collected samples amplitude. We collected different video songs and extracted audio from it with length of 25 to 35 seconds. Fig 4 shows the test sample amplitude for the song.

TABLE 1.INSTRUMENTAL IDENTIFICATION WITH ACCURACY Instrument Name Harmonium Santoor Sarasvati Veena Sitar Tabla Tanpura Accuracy 98.54 97.56 97.49 99.55 98.12 98.34 C 8 8 8 8 8 8 g 0.5013 0.7124 0.4195 0.8731 0.7483 0.7546

We assigned classes to each instrument during the training phase and tested our system with song. Fig 5 shows instrument identified based on the classes.

Harmonium

Santoor

Fig 5. Instrument detection based on the Match

Saraswati Veena

Sitar

A typical support vector machine only archives multi case classification [17, 18]. The training set contains multi set class the testing set matches with it. Some times it is possible to detect multiple classes. Table 2. shows singer identification from the test data with good accuracy. TABLE 2.SINGER IDENTIFICATION WITH ACCURACY Singer # 1 2 3 4 5 6 Accuracy 99.05 98.02 98.35 98.34 97.45 98.32 C 32768 32768 32768 32768 32768 32768 Gamma 2 8 2 8 8 8

Tabla

Tanpura

Fig. 3. Instruments and their respective amplitudes.

Fig 4. Test Song sample Amplitude We have done large scale experiments on both cases such as instrument and singer level. Table.1. shows instrument identification when we tested the system with the collected samples. It is shows significantly instrument detection with good accuracy.

In a song it is possible that there can be more than one singer Fig 6. shows singer identification from audio which is extracted from video song then assigned classes to it, we trained our system and tested with the singers samples, identified singers where they are in interleaved passion.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

123

Fig 6. singer identification based on the Match

5. CONCLUSION
Singer and instrument identification system is applied on various video song first we separated audio from video song and extracted features using MFCC and MEDC. Trained our system with classes finally tested with samples and observed that results from svm. We have got 99.05 and 99.55% respectively instrument as well singer identification.

[17] G. Guo and S. Z. Li, Content-based audio classification and retrieval by support vector machines, IEEE Trans. Neural Networks, vol. 14, no. 1, pp. 209215, Jan. 2003. [18] P. Ding, Z. Chen, Y. Liu, and B. Xu, Asymmetrical support vector machines and applications in speech processing, in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., vol. 1, May 2002, pp. I-73I-76.

References

B.Srinivas received M.Tech in computer Science and engineering in 2008 from Acharya Nagarjuna University. He has two and half years of industry and four years of teching experience. He is currently employed as an assistant professor in CSE department, MVGR College of Engineering. D.Sudheer babu. received B.Tech in (computer science and Engineering). K.Santosh Jhansi received B.Tech in Computer Science & Engineering. Received M.Tech in Software Engineering. She is currently working as an Assistant Professor in MVGR College of Engineering. she got 4 years of teaching experience.

[1] W. H. Tsai and H. M. Wang, Automatic singer recognition of popular music recordings via estimation and modeling of solo vocal signals, IEEE Trans. Speech Audio Process., vol. 14, no. 1, pp. 333341, Jan. 2006. [2] A. L. Berenzweig, D. P. W. Ellis, and S. Lawrence, Using voice segment to improve artist classification of music, in Proc. AES 22nd Int. Conf. Virtual, Synthetic Entertainment Audio, 2002. [3] G. J. Brown and D. L. Wang, Separation of speech by computational auditory scene analysis, in Speech Enhancement, J. Benesty, S. Makino, and J. Chen, Eds. New York: Springer, 2005, pp. 371402. [4] W. Chou and L. Gu, Robust singing detection in speech/music discriminator design, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2001, pp. 865868. [5] P. Divenyi, Ed., Speech Separation by Humans and Machines. Norwell, MA: Kluwer Academic, 2005. [6] M. Goto, A real-time music-scene-description system: PredominantF0 estimation for detecting melody and bass lines in real-world audio signals, Speech Commun., vol. 43, no. 4, pp. 311329, 2004. [7] D. Li, I. K. Sethi, N. Dimitrova, and T. McGee, Classification of general audio data for content-based retrieval, Pattern Recognition Lett., vol. 22, pp. 533544, 2002. [8] D. L. Wang and G. J. Brown, Separation of speech from interfering sounds based on oscillatory correlation, IEEE Trans. Neural Netw., vol. 10, no. 3, pp. 684697, May 1999. [9] B. Whitman, G. Flake, and S. Lawrence, Artist detection in music with minnowmatch, in Proc. IEEE Workshop on Neural Networks for Signal Processing, 2001. [10] A. Berenzweig, D. P.W. Ellis, and S. Lawrence, Using voice segments to improve artist classification of music, in Proc. Int. Conf. Virtual, Synthetic, and Entertainment Audio, 2002. [11] T. Zhang, Automatic singer identification, in Proc. IEEE Int. Conf. Multimedia Expo, 2003. [12] W. H. Tsai, D. Rodgers, and H. M. Wang, Blind clustering of popular music recordings based on singer voice characteristics, Computer Music Journal, vol. 28, no. 3, pp. 68-78, 2004. [13] N. C. Maddage, C. Xu, and Y. Wang, Singer identification based on vocal and instrumental models, in Proc. Int. Conf. Pattern Recognition, 2004. [14] J. Shen, B. Cui, J. Shepherd, K. L. Tan, Towards efficient automated singer identification in large music databases, in Proc. Int. ACM SIGIR Conf. Research and Development on Information Retrieval, 2006. [15] W. H. Tsai, S. J. Liao, and C. Lai, Automatic Identification of Simultaneous Singers in Duet Recordings, in Proc. Int. Conf. Music Information Retrieval, 2008. [16] Y. Stylianou, O. Cappe, and E. Moulines, Continuous Probabilistic Transform for Voice Conversion, IEEE Trans. Speech Audio Process., vol. 6, no. 2, pp. 131-142, 1998.

Potrebbero piacerti anche