Nilesh M. Patil
Research Scholar, Pacific Academy of Higher Education and Research, Udaipur;
Assistant Professor, Fr. CRCE, Mumbai, India
Email: [email protected] Milind U. NemadeProfessor, Electronics Engineering, K J Somaiya Institute of Engineering and Information Technology, Sion, Mumbai, India.

Email: [email protected]
The volume of audio data is increasing tremendously daily on public networks like Internet. This increases the difficulty in accessing those audio data. Hence, there is a need of efficient indexing and annotation mechanisms. The problems like non-stationarities and discontinuities present in the audio signal rises the difficulty in segmentation and classification of audio signals. The other challenging task is the extracting and selecting the optimal features in audio signal. The application areas of audio classification and retrieval system includes speaker recognition, gender classification, music genre classification, environment sound classification, etc. This paper proposes a machine learning approach based on neural network which performs audio pre-processing, segmentation, feature extraction, classification and retrieval of audio signal from the dataset. We found that FPNN classifier gives better accuracy, F1-score and Kappa coefficient values compared to SVM, k-NN and PNN classifiers.

Keywords- SVM, k-NN, PNN, FPNN, Sensitivity, Specificity, Accuracy, Recall, Precision, Jaccard Coefficient, Dice Coefficient, Kappa Coefficient
INTRODUCTIONIn this paper, an approach for audio pre-processing using averaging filter and segmentation, feature extraction, neural network-based classification and retrieval of audio signal is proposed. In audio signal processing applications, segmentation and classification play a vital role. Audio segmentation 1 is used in various applications like audio archive management, surveillance, medical applications, entertainment industry, etc. and is an essential pre-processing step. However, the traditional segmentation techniques like decoder-based segmentation approach which only place the boundaries at the silence locations and metric-based approach which empirically sets threshold value are easy quite simple. Hence, there is a need to perform segmentation in an efficient manner. The feature extraction 2 process gives meaningful information about the audio signal and help in classifying those audio signals. Feature extraction can be done in time domain, frequency domain and coefficient domain. But with the increase in the number of audio signals in the dataset, the computational complexity of traditional feature extraction approaches also increases. To index the audio signals in the dataset, audio classification is the popular approach used. The traditional audio classification techniques consider only the single similarity measure without considering the perceptual similarity of audio signals.

In our proposed work, we have utilized averaging filter to filter the audio signals and reduce the Gaussian noise. Pitch extraction method is used for segmentation of filtered audio signal. In this approach, we have computed short time zero crossing rate (ZCR), short time energy (STE), spectral flux, spectral skewness. Segmentation process divides the audio signal into voiced and unvoiced segments. In time domain we have extracted features like Root Mean Square (RMS), ZCR, and silence ratio. In frequency domain, the features like bandwidth, spectrogram, spectral centroid and pitch are extracted. In coefficient domain we have considered mel based frequency cepstral coefficients (MFCC) and linear predictive coding (LPC). Multi-label and multi-level classification method is used to classify the audio signal as music signal, speech signal or the environment sound. In the proposed system, Fuzzy Probabilistic Neural Network (FPNN) provides better classification accuracy as compared to Support Vector Machine (SVM), k-Nearest Neighbour (k-NN) and Probabilistic Neural Network (PNN).

Muthumari Arumugam, Mala Kaliappan 3 used PNN classifier to classify the audio signals from GTZAN dataset and achieved an accuracy of 96.2%. Trisiladevi C. Nagavi, Anusha S.B. and Poornima S.P. 4 achieved accuracy of 85% in time duration of 2-3 minutes using Sort-Merge technique for classification based on acoustic features of the audio signal. R. Christopher Praveen Kumar, S. Suguna, J. Becky Alfreda 5 highlighted weighted MFCC (WMFCC) on GTZAN dataset and obtained precision value of 96.40% and better recall value. Muhammad M. Al-Maathidi, Francis F. 6 achieved an accuracy of 78.1% using MFCC and supervised neural network to classify audio signal into music, speech and other noise. Srinivasa Murthy Y. and Koolagudi, S.G. 7 used MFCC and artificial neural network (ANN) to classify Telugu audio clips into vocal with accuracy of 85.89% and non-vocal with accuracy of 88.52%. Xueyuan Zhang, Zhousheng Su, et al 8 applied spectral decomposition technique on Digital Juice Sound FX Library and BBC Sound Effects Library to achieve an accuracy of 84.1%. Haque M.A. and Kim J.M. 9 performed classification using correlation intensive c-means (CIFCM) algorithm and SVM classifier on compressed audio signals achieving an accuracy of 89.53%. Geiger J.T., Schuller B. and Rigoll G. 10 performed classification on park, restaurant and tube station audio signals using SVM classifier and achieved an accuracy of 73%. Dhanalakshmi P., Palanivel S. and Ramalingam V. 11 used auto-associative neural network model (AANN) and Gaussian Mixture Model (GMM) for classification of environment sounds, machine noise, etc. to achieve an accuracy of 93.1% for AANN and 92.9% for GMM. Matthew Riley, Eric Heinen, Joydeep Ghosh 12 performed vector quantization on dataset of 4000 songs and achieved an accuracy of 90%. Dong-Chul Park 13 had used Centroid Neural Network (CNN) for classification of 2663 audio signals achieving an accuracy 75.62%. Saadia Zahid, Fawad Hussain 14 applied bagged SVM and ANN on GTZAN dataset to obtain 98% classification accuracy. Poonam Mahana and Gurbhej Singh 15 performed classification of audio signals using k-NN and SVM with accuracy of 74.6% and 90% respectively. G. Tzanetakis, P. Cook 16 classified music signals into 10 genres using GMM and k-NN with 61% accuracy. Riccardo Miotto, Gert Lanckriet 17 used Dirichlet Mixture Model (DMM) and SVM to classify audio signals from CAL500 dataset and achieved precision of 0.475, recall 0.235 and f-score 0.285. Mohammad A. Haque and Jong-Myon Kim 18 applied fuzzy c-means algorithm on GTZAN dataset for classifying music signals with an accuracy of 84.17%. Shweta Vijay Dharbade, P.S. Deshpande 19 used local discriminant bases to classify audio signals as artificial sounds, natural sounds, instrumental music and speech with an accuracy of 95%, 93%, 97% and 95% respectively. Babu Kaji Baniya, Deepak Ghimire, et al 20 used timbral texture, rhythmic content features, MFCC, and extreme learning machine (ELM) on GTZAN dataset and achieved accuracy of 85.15%. Kesavan Namboothiri and Anju L. 21 applied dynamic time warping (DTW) method with SVM classifier to classify audio signals from MARSYAS web to obtain an accuracy of 96.2%. Feng Rong 22 used SVM with Gaussian Kernel and obtained a classification accuracy of 87.6% for general sounds and 86.3% for audio scenes. Gursimran Kaur and Neha Mohan 23 used MFCC feature vectors to classify music signal using SVM and back-propagation neural network (BPNN) classifiers achieving an accuracy of 83% and 93% respectively. Toni Hirvonen 24 used MFCC features with zero-phase component analysis to classify speech and music signals obtaining an accuracy of 95%. Malay Singh, et al 25 applied fuzzy logic and knowledge-base filtering on audio signals from CMUSphinx4 library with classification accuracy of 70%.

In this section, we explain the proposed system in detail. The block diagram of the proposed system is shown in the Figure 1 below. For creating a dataset of input audio signals, we have taken into consideration GTZAN dataset from Marsyasweb 26 which consists of 1000 music signals with 10 different genres and 64 speech signals. Each audio track in GTZAN dataset is 16-bit, 30 seconds long and 22050Hz Mono file in .au format. For environment sound classification we have taken 200 audio clips classified in 10 different classes in .wav format each of 5 seconds long from ESC-50 dataset 27.

Figure SEQ Figure * ARABIC 1 Block Diagram of Proposed System
Moving average filter is used for filtering the audio signal. It is a simple low pass finite impulse response (FIR) filter which smooths an array of sampled data/signal. It takes M samples of input at a time and take the average of those M-samples and produces a single output point 28. In our proposed system we have taken 16 samples at a time for averaging in the sliding window. Average filter normalizes the amplitude of the audio signal and reduces the gaussian noise present. Figure 2 and figure 3 below shows the plot of audio input signal from the dataset and signal after filtering.

Figure SEQ Figure * ARABIC 2 Input Audio Signal

Figure SEQ Figure * ARABIC 3 Filtered Audio Signal
The filtered signal is given to the segmentation process. The segmentation process divides the filtered audio signal into homogenous voiced and unvoiced segments. For segmentation, we have used pitch extraction method wherein we had used ZCR and STE as time domain features and spectral flux and spectral skewness as frequency domain features. The zero-crossing count indicates the frequency at which the energy is concentrated in the signal spectrum. The excitation of vocal tract by the periodic flow of air at the glottis generates voiced speech and usually shows a low zero-crossing count, whereas the unvoiced speech is produced by the constriction of the vocal tract narrow enough to cause turbulent airflow which results in noise and shows high zero-crossing count. Energy of a speech is another parameter for classifying the voiced/unvoiced parts. The voiced part of the speech has high energy because of its periodicity and the unvoiced part of speech has low energy.
Zero Crossing Rate
Zero-crossing rate is a measure of number of times in a given time interval/frame that the amplitude of the audio signals passes through a value of zero 29. Figure 4 below shows the zero crossings in the filtered audio signal. A reasonable generalization is that if the zero-crossing rate is high, the speech signal is unvoiced, while if the zero-crossing rate is low, the speech signal is voiced. A zero-crossing rate is calculated as follow:
sgnym=1, if y(m)?0-1, if ym<0(2)
wn=12N, if 0?n?N-10, otherwise(3)
and y(m) is the time domain signal for frame m.

Short Time Energy
The amplitude of the audio signal varies with respect to time. Generally, the amplitude of unvoiced segments is lower than that of voiced segments 29. The energy of the signal shows representation that reflects these amplitude variations. Short time energy can be defined as:
The choice of the window determines the nature of the short-time energy representation. In our proposed system, we used Rectangular window. Figure 5 below shows the short-time energy in the filtered audio signal.

Figure SEQ Figure * ARABIC 4 Zero Crossing in Filtered Signal Figure 5 Short-time Energy of Filtered Signal
Spectral Flux
The average variation value of spectrum between two adjacent frames in a given clip is called spectral flux (SF). The spectrum flux of speech signal is higher compared to music signal due to the composition of alternating voiced sounds and unvoiced sounds in the syllable rate 30. Spectrum flux is a good feature to discriminate among speech, environment sound and music. The spectrum flux of environmental sounds is among the highest, and changes more dramatically than those of speech and music. The spectral flux is calculated as follows:
where A(n,k) is the discrete Fourier transform of the nth frame of input signal.

x(m) is the original audio data, w(m) is the window function, L is the window length, K is the order of Discrete Fourier Transform (DFT), and N is the total number of frames. Figure 6 below shows the plot of spectral flux of the pre-processed audio signal.

Figure 6 Spectral Flux
Spectral Skewness
Skewness is based on both pitch and energy. Skewness is the measure of an amount of high frequencies present in a signal on average. High frequencies are obtained based on the pitch of the tone. Skewness indicates more energy on the higher and lower parts of the spectrum 31.
Figure 7 below shows the voiced segment of the filtered audio input signal.

Figure 7 Voiced Segment of Filtered Audio signal
The next step is the feature extraction. The features are extracted in three different domains namely, time domain, frequency domain and coefficient domain. In time domain we have estimated root mean square (RMS), zero crossing rate, and pitch saliency-ratio using autocorrelation.
The RMS represents the square root of the average power of the audio signal for a given period of time. It is calculated as follows:
RMSj=1Nm=1Nxj2(m) (7)
where xj(m) for (m = 1,2,···,N) is jth frame of windowed audio signal of length N. N is the number of samples in each frame.

Pitch silence ratio
It is the ratio of silent frames (determined by a preset threshold) and the entire frames. Here, a frame is said to be silent if the frame RMS is less than 10% of mean of RMS of the ?les. It is calculated as follows:
SR=Number of silence framesTotal number of frames(8)
In frequency domain we have estimated features like bandwidth, spectrogram, frequency centroid, spectral centroid and pitch.
It represents the frequency up to which the signal holds information. The bandwidth of a signal is often equal to, but never higher than, half the sampling rate. Figure 8 below shows the bandwidth plot of the pre-processed audio signal. It is calculated using the formula:
Bj=0?0(?-?c)Xj(?)2d?0?0Xj(?)2d? (9)

Figure 8 Bandwidth of the pre-processed Signal
Spectrogram splits the signal into overlapping segments, windows each segment with the hamming window and forms the output with their zero-padded, N points discrete Fourier transforms. It is a three-dimensional representation. X-axis represents the timing information, Y-axis shows the frequency components present in the speech signal and the darkness indicates the energy present in speech signal at that frequency. The dark bands in the spectrogram represents the resonances of a vocal tract system for the given sound unit. These resonances are also called as formant frequencies which represents the high energy portions in the frequency spectrum of a speech signal. The shape of the dark bands indicates, how the vocal tract shape changes from one sound unit to the other. Figure 9 below shows the spectrogram of the pre-processed audio signal.

Figure 9 Spectrogram of audio signal
Frequency Centroid
The balancing point of the frequency spectrum is represented by frequency centroid. It deals with the brightness in the signal and is calculated using the formula:
Spectral Centroid
Centroid deals with the sound sharpness i.e. high frequency components of the spectrum. The spectral centroid is calculated using the formula:
where fk is the frequency at bin k.

Pitch refers to the fundamental period of a human speech waveform. We compute the pitch by ?nding the time lag with the largest autocorrelation energy.

Salience of Pitch
The ratio of first pitch (peak) value and the zero-lag value of the autocorrelation is called as pitch saliency ratio. It is defined by the function ?j(P)?(0) .

?jP=m=-??xjmxjm-P, ?0=m=-??x2(m)2(12)
In coefficient domain, we have estimated MFCC (mel-frequency cepstral coe?cients) and LPC (linear prediction coe?cients) coe?cients.

MFCCs are the coefficients obtained in the mel frequency cepstrum where the frequency bands are equally spaced on the mel scale. These are computed from the FFT power coe?cients. We adopt the ?rst 12 orders of coe?cients, out of which first three are used in building fuzzy inference system (FIS) for classification. Figure 10 shows the plot of MFCC coefficients before liftering. The relation between the frequency and the mel scale is expressed as follows:

Figure 10 MFCC Coefficients before Liftering Figure 11 LPC Coefficients of Voiced Segment
The LPC coe?cients are a short-time measure of the speech signal. They describe the signal as the output of an all-pole ?lter. The plot of LPC coefficients of voiced segment is shown in figure 11 above.
The figure 12 below summarizes the values of the features extracted for the classification purpose.

Figure 12 Features Extracted
The next step in the proposed system is classification process. In this process, we have classified the audio signals into three main classes viz music, speech and environment sound. Further, the music signals are classified into 10 genres like Blues, Classical, Country, Disco, HipHop, Jazz, Metal, Pop, Reggae and Rock. Speech signals are further categorized into the gender male or female and also get the age of the speaker 32. Environment sounds are also categorized into 10 classes as Chirping Birds, Church Bell Ring, Clapping, Crying Babies, Door-wood-knock, Frog, Glass Breaking, Keyboard Typing, Train and Water Drops. For classification, we have designed FIS taking into consideration features like mean of spectrogram, mean and standard deviation of pitch saliency ratio, mean of pitch, mean of zero crossing, mean of 1st MFCC coefficients, mean of 3rd MFCC coefficients for fuzzy logic rules. We have considered four different classifiers for classification of audio signals. They are SVM, k-NN, PNN and FPNN.

SVM is a supervised machine learning algorithm that classifies data by finding the linear decision boundary (hyperplane) separating all the datapoints of one class from those of the other class as shown in figure 13 below. Classification consists of two steps: training and testing. In the training phase, SVM receives some extracted speech features as input. In the testing phase, SVM constructs a hyper plane which can be used for classification. We have adopted linear kernel function of SVM in classification. In the figure 13 below, 0 indicates true positive and 1 indicates true negative.

Figure 13 SVM Hyperplane Figure 14 k-NN Classification
k-NN categorizes objects based on the classes of their nearest neighbours in the dataset. We have used Euclidean distance to find the nearest neighbour. We have taken value of k as 3. Figure 14 above shows the k-NN classification.
PNN is feed-forward neural network. It is widely used in classification and pattern recognition problems. A PNN is an implementation of a statistical algorithm called kernel discriminant analysis in which the operations are organized into a multi-layered feedforward network with four layers: Input layer, Pattern layer, Summation layer and Output layer as shown in figure 15 below. Figure 16 shows how PNN divides the input space into three classes.

Figure 1 SEQ Figure * ARABIC 5 Architecture of PNN Classifier Figure 1 SEQ Figure * ARABIC 6 PNN Classification into 3 Classes
In the PNN algorithm, a Parzen window and a non-parametric function is used to approximate PDF of each. Then, using PDF of each class, we estimate the class probability of a new input data and Bayes’ rule is then employed to allocate the class with highest posterior probability to new input data. By this method, the probability of mis-classification is minimized. In FPNN, the network is a combination of probabilistic and generalized regression neural networks, and is capable of data classification on the basis of fuzzy decision on the membership of a particular observation to a certain class. Fuzzy Probabilistic Neural Network (FPNN) as shown in figure 16 is a four layered structure consisting of Input layer, Prototype/Pattern layer, Summation layer and Output layer. The input layer receives an n-dimensional vector x(k) for classification. It does not perform any computation and simply distribute the input to the neurons in next layer. The first hidden layer called prototype layer, contains neurons identical to the number of the training samples with the gbell activation functions and their synaptic weights in input-to-prototype connections are determined by the components of the training patterns. The second hidden layer called summation layer, consists of m+1 elementary summing node with each of first m nodes representing an individual class. All these m nodes receive outputs of the prototype layer.

Figure 16 Architecture of FPNN
The output of the classification process is shown in figure 17 below. The input signal (numbered 60 in dataset in .au format) is classified as Music Signal and the genre of that music signal is classified as Blues. Similarly, the input signal can be classified into Speech Signal giving its gender and age as shown in figure 18. In case of speech signal we have considered from dataset as input signal which is classified as Female Voice with age between 10-25.

Figure 1 SEQ Figure * ARABIC 7 Classification GUI for Music Signal

Figure 1 SEQ Figure * ARABIC 8 Classification GUI for Speech Signal
In the similar manner, we classified 1140.wav signal as Environment Sound Signal which is classified as Clapping as shown in the figure 19 below.

Figure 1 SEQ Figure * ARABIC 9 Classification GUI for Environment Sound
The next step is the retrieval process. Here, we have calculated the Euclidean distance among the feature vectors and compared the input audio signal with the audio files in the dataset. This step is basically to perform optimization to get the 10 best audio files that closely match the feature vectors of the input audio file and also to get the input audio file in those retrieved result. We had taken as the input audio file in this paper and in figure 17 above, lower part denotes the retrieved result which contains the number of input audio file.

This section illustrates the performance evaluation and comparative analysis of different classifiers. Figure 20 below shows the confusion matrix of different classifiers. A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. The performance of the approach is evaluated using metrics such as Precision, Recall, Sensitivity, Specificity, Jaccard Coefficient, Dice Coefficient, Kappa Coefficient and Accuracy.

Precision is defined as the ratio of the number of correct results to the number of predicted results.

where true positive (TP) is the number of audio signals correctly classified as music, speech or environment sound and false positive (FP) is the number of audio signals incorrectly classified, for example, music signal classified as speech signal.
Recall id defined as the ratio of number of correct results to the number of returned results.

Recall=TP(TP+FN) (15)
where false negative (FN) is the number of music signals incorrectly classified as non-musical signal.

Sensitivity is defined as the true positive recognition rate.

Sensitivity %=TPTP+FN×100 (16)
Specificity is defined as the true negative recognition rate.

Specificity %=TNTN+FP×100(17)
where true negative (TN) is the number of audio signals incorrectly classified as music, speech or environment sound.

Jaccard Coefficient
The Jaccard similarity index (sometimes called the Jaccard similarity coefficient) compares members for two sets to see which members are shared and which are distinct. It’s a measure of similarity for the two sets of data, with a range from 0% to 100%.

Dice Coefficient
The harmonic mean of precision and recall is called the dice coefficient. It is also called F1-Score. Higher value of F1-score indicates improved efficiency in segmentation and classification.

Dice Coefficient F1-Score= 2×Recall×PrecisionRecall+Precision(19)
Kappa Coefficient
The Kappa statistic is a metric that is based on the difference between the observed accuracy and the Expected accuracy that the samples randomly would be expected to
Reveal 33. It is calculated as,
where, Oa is the Observed accuracy and Ea is the Expected accuracy. An interpretation for the statistic is given as
<0: Less than chance agreement
0.01 – 0.20: Slight agreement
0.21 – 0.40: Fair agreement
0.41 – 0.60: Moderate agreement
0.61 – 0.80: Substantial agreement
0.81 – 0.99: Almost perfect agreement
The assessments are to be adapted according to the context. The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves. In addition, it considers random chance, which generally means it is less misleading than simply using accuracy as a metric.

Accuracy is defined as the ratio of number of correctly classified results to the number of the classified results. The performance of the classifier is determined based on the number of samples that are correctly and incorrectly predicted by the classifier.

Accuracy%=Number of correctly classified resultsTotal number of classified results×100(21)

Figure SEQ Figure * ARABIC 10 Confusion Matrix of Classifiers
Figure 21below shows the performance metric values obtained after executing the system for input audio signal

Figure SEQ Figure * ARABIC 21 Performance Metrics of Classifiers
In this paper, we had presented an approach for classification and retrieval of pre-processed audio signals. We used classifiers like SVM, k-NN, PNN and FPNN for classification of audio signals and found that FPNN achieves higher accuracy of 93.6759%. We had used GTZAN dataset for music and speech signals and ESC- 50 dataset for environment sounds. The music signals are classified into 10 genres like Blues, Classical, Country, Disco, HipHop, Jazz, Metal, Pop, Reggae and Rock. For speech signals, we identified the speaker gender and their age. Environment sounds are also classified into 10 labels as Chirping Birds, Church Bell Ring, Clapping, Crying Babies, Door-wood-knock, Frog, Glass Breaking, Keyboard Typing, Train and Water Drops. The F1-score and Kappa coefficient values are also high for FPNN classifier.

Castán, D., Tavarez, D., et al., “Albayzín-2014 evaluation: Audio segmentation and classification in broadcast news domains”, Springer EURASIP Journal on Audio, Speech, and Music Processing, Vol. 33, December 2015, pp. 1-15.
Ludeña-Choez, J. and Gallardo-Antolín, A., “Feature Extraction Based on the High-Pass Filtering of Audio Signals for Acoustic Event Classification”, Elsevier Journal of Computer Speech & Language, Vol. 30, Issue 1, March 2015, pp. 32-42.

Muthumari, A. and Mala, K., “An Efficient Approach for Segmentation, Feature Extraction and Classification of Audio Signals”, Journal of Circuits and Systems, Vol. 7, April 2016, pp. 255-279.

Trisiladevi C. Nagavi, Anusha S.B., Monisha P., Poornima S.P., “Content Based Audio Retrieval with MFCC Feature Extraction, Clustering and Sort-Merge Techniques”, Proceedings of IEEE 4th International Conference on Computing, Communications and Networking Technologies, July 2013, pp. 1-6.
R. Christopher Praveen Kumar, S. Suguna, J. Becky Elfreda, “Audio Retrieval based on Cepstral Feature”, International Journal of Computer Applications, ISSN: 0975-8887, Vol. 107, No. 17, December 2014, pp.28-33.

Muhammad M. Al-Maathidi, Francis F. Li, “NNET Based Audio Content Classification and Indexing System”, International Journal of Digital Information and Wireless Communications (IJDIWC), ISSN: 2225-658X, Vol. 2, Issue 4, 2012, pp. 335-347.

Srinivasa Murthy Y. and Koolagudi, S.G., “Classification of Vocal and Non-Vocal Regions from Audio Songs Using Spectral Features and Pitch Variations”, Proceedings of IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE), Halifax, May 2015, pp. 1271-1276.
Zhang X., Su Z., Lin P., He Q. and Yang J., “An Audio Feature Extraction Scheme Based on Spectral Decomposition”, Proceedings of IEEE International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, July 2014, pp. 730-733.

Haque M.A. and Kim J.M., “An Enhanced Fuzzy C-Means Algorithm for Audio Segmentation and Classification”, Springer International Journal of Multimedia Tools and Applications, Vol. 63, Issue 2, March 2013, pp.485-500.
Geiger J.T., Schuller B. and Rigoll G., “Large-Scale Audio Feature Extraction and SVM for Acoustic Scene Classification”, Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, October 2013, pp. 1- 4.
Dhanalakshmi P., Palanivel S. and Ramalingam V., “Classification of Audio Signals Using AANN and GMM”, Applied Soft Computing Elsevier Journal, Vol. 11, Issue 1, January 2011, pp. 716-723.

Matthew Riley, Eric Heinen and Joydeep Ghosh, “A Text Retrieval Approach to Content-Based Audio Retrieval”, Proceedings of ISMIR 9th International Conference on Music Information Retrieval, September 2008, pp. 295-300.

Dong-Chul Park, “Content-based retrieval of audio data using a Centroid Neural Network”, Proceedings of IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), South Korea, December 2010, pp. 394 – 398.

Saadia Zahid, Fawad Hussain, Muhammad Rashid, Muhammad Haroon Yousaf, and Hafiz Adnan Habib, “Optimized Audio Classification and Segmentation Algorithm by Using Ensemble Methods”, Hindawi Publishing Corporation’s Journal on Mathematical Problems in Engineering, Vol. 2015, Article ID 209814, April 2015, pp. 1 – 11.

Poonam Mahana, Gurbhej Singh, “Comparative Analysis of Machine Learning Algorithms for Audio Signals Classification”, International Journal of Computer Science and Network Security (IJCSNS), Vol. 15, Issue 6, June 2015, pp. 49 – 55.

G. Tzanetakis, P. cook, “Musical genre classification of audio signals”, IEEE Transactions on Speech and Audio Processing, Vol. 10, Issue 5, July 2002.

Riccardo Miotto, Gert Lanckriet, “A Generative Context Model for Semantic Music Annotation and Retrieval”, IEEE Transactions on Audio, Speech, and Language Processing, Volume: 20, Issue: 4, May 2012, pp. 1096-1108.

Mohammad A. Haque, Jong-Myon Kim, “An analysis of content-based classification of audio signals using a fuzzy c-means algorithm”, Springer Journal of Multimedia Tools and Applications, Volume 63, Issue 1, March 2013, pp 77–92.

Shweta Vijay Dhabarde, P.S.Deshpande, “Feature Extraction and Classification of Audio Signal Using Local Discriminant Bases”, International Journal of Industrial Electronics and Electrical Engineering, ISSN: 2347-6982 Volume-3, Issue-5, May-2015, pp. 51-54.

Babu Kaji Baniya, Deepak Ghimire, Joonwhoan Lee, “Automatic Music Genre Classification Using Timbral Texture and Rhythmic Content Features”, ICACT Transactions on Advanced Communications Technology (TACT) Vol. 3, Issue 3, May 2014, pp. 434-443.

Kesavan Namboothiri T. and Anju L., “Efficient Audio Retrieval Using SVM and DTW Techniques”, International Journal of Emerging Technology in Computer Science ; Electronics (IJETCSE), Vol. 23, June 2016, Issue 2.

Feng Rong, “Audio classification method based on machine learning”, IEEE Proceedings of International Conference on Intelligent Transportation, Big Data ; Smart City, 2016, pp. 81-84.

Gursimran Kour and Neha Mehan, “Music Genre Classification using MFCC, SVM and BPNN”, International Journal of Computer Applications, Vol. 112, No. 6, February 2015.

Toni Hirvonen, “Speech/Music Classification of Short Audio Segments”, IEEE Proceedings of International Symposium on Multimedia, 2014, pp. 135-138.

Malay Singh, Uma Shanker Tiwary, Tanveer J. Siddiqui, “A Speech retrieval System based on Fuzzy logic and Knowledge-base filtering”, IEEE Proceedings of International Conference on Multimedia, Signal Processing and Communication Technologies (IMPACT), November 2013, pp. 46-50.

GTZAN Dataset: Dataset:
Steven W. Smith, The Scientist and Engineer’s Guide to Digital Signal Processing, pp. 277-284.

Sunitha R, “Separation of Unvoiced and Voiced Speech using Zero Crossing Rate and Short Time Energy”, International Journal of Advanced Computing and Electronics Technology (IJACET), Vol. 4, No.1, ISSN: 2394-3416, 2017, pp. 6-9.

R. Thiruvengatanadhan, P. Dhanalakshmi, P. Suresh Kumar, “Speech/Music Classification using SVM”, International Journal of Computer Applications Vol. 65, No.6, ISSN: 0975 – 8887, March 2013, pp. 36-41.

S. Radha Krishna and R. Rajeswara Rao, “SVM based Emotion Recognition using Spectral Features and PCA”, International Journal of Pure and Applied Mathematics, Vol. 114, No. 9, ISSN: 1314-3395, 2017 pp. 227-235.