Results and Discussions We recorded Marathi database of numerals zero to nine. In this we have intended to implement a password system with numerals and many other such applications in everyday life. The 20 samples for each word were recorded from different people and these samples were then normalized by dividing their maximum values. Then they were decomposed using Dynamic Time Warping. Out of 20 samples recorded, 16 samples are used to train the DTW and the unused 4 samples are used for test purpose. In this project, speech recognition software had been developed using MFCC & DTW algorithms.
The reference file was created for different-different pre recorded speech signals. When the microphone input signal was applied, its MFCC coefficients were compared to the pre-recorded speech’s MFCC coefficients using DTW algorithm. The Output scores of DTW calculate the nearest sound of the recorded speech signals.
End of the software output was displayed on MATLAB output screen. Software would display correct numeral if applied microphone signal would be compared with pre-recorded ; online signals. The Results of some of the extracted features of recorded database of numerals zero to nine in Marathi are shown in the figures below. Fig. 6.1 Mel Frequency Cepstrum Coefficients Fig. 6.
2 Mel Frequency Cepstrum of Shunya. Coefficients of pach. Graphical User Interface (GUI) of the systemWe have created the GUI of the system for the recognition of the numerals.
The DTW 0-9 Digit Recognizer has the various command buttons like record, open, play, recognize etc. It shows the opened wave file. In this project, we have designed a DTW digit recognizer, in which the command button open reads the pre-recorded numerals and the command button record the online numeral spoken by the speaker.
We can play the pre-recorded ; online numeral spoken by the speaker, and then we can recognize the numeral using the DTW for feature matching. It matches the template by taking into account the minimum warping distance between the various numerals. The Template with closest match defined in manner chosen as recognized numeral ; it is displayed on GUI display. Fig. 6.
3 GUI of DTW Digit Recognizer. Fig. 6.4 GUI of opened wave file. Fig.
6.5 GUI of pattern matching of shunya. Fig.
6.6 GUI of recognized numeral shunya. Fig. 6.
7 GUI of pattern matching of ek. Fig. 6.
8 GUI of recognized numeral ek. Fig. 6.9 GUI of pattern matching of saha. Fig.
6.10 GUI of recognized numeral saha. Fig. 6.11 GUI of pattern matching of nau. Fig. 6.
12 GUI of recognized numeral nau. 6.2 Testing and Results6.
2.1 Testing with pre-recorded samplesOut of the 20 samples recorded for each word, 16 were used for training purpose. We tested our program’s accuracy with these 4 unused samples. A total of 20 samples were tested (4 samples each for the 5 words) and the program yielded the right result for all 20 samples. Thus, we obtained 100% accuracy with pre- recorded samples.6.2.
2 Real-time testing For real-time testing, we took a sample using microphone and directly executed the program using this sample. A total of 30 samples were tested, out of which 24 samples gave the right result. This gives an accuracy of about 80% with real-time samples.
6.2.3 ResultsCase 1: Speaker independent (20 templates per digit 10 male, 10 female)The above implemented work is tested for 100 samples of each word spoken by 50 different speakers with 2 samples of each digit per head. The testing work leads to the results given in Table 6.1. Table 6.1 Accuracy of the Speaker Independent Test Results.DIGIT 0 1 2 3 4 5 6 7 8 9% ACCURACY 87 88 82 78 79 84 85 81 78 87Case 2: Speaker Dependent (one template per digit).
The above implemented work is tested for 10 samples of each word spoken by single speaker. The results are given in Table 6.2. Table 6.2 Accuracy of the Speaker Dependent Test Results.DIGIT 0 1 2 3 4 5 6 7 8 9% ACCURACY 90 91 84 90 87 88 92 84 86 92It is observed that the accuracy of the pre-recorded samples is more than that of the real-time testing samples. We have also observed that the accuracy of the speaker dependent samples is more than that of the speaker independent samples. Table 6.
3 Confusion Matrix of the MFCC & DTW Recognition.ekdon teen char pachsahasat aathnaushunyaAvg. %ek1 1 1 4 1 1 1 1 1 0 80don 2 2 2 2 3 2 2 2 2 2 90teen 3 3 3 3 9 3 3 2 2 2 80char 4 4 5 4 4 4 4 6 4 4 80pach5 5 5 5 5 5 5 5 5 3 90Saha6 6 6 6 1 6 6 4 6 6 80Sat 7 7 8 7 7 7 7 7 7 7 90Aath2 8 8 8 8 7 8 8 8 8 80nau9 9 4 9 9 5 9 9 9 9 80shunya0 0 0 0 5 0 0 2 0 0 80Table 6.4 Confusion Matrix of the MFCC & HMM Recognition.ekdon teen char pachsahasat aathnaushunyaAvg. %ek1 1 1 1 1 1 1 3 1 1 90don 2 2 2 2 2 2 2 2 2 5 90teen 3 3 3 3 3 3 1 3 3 3 90char 4 3 4 4 4 4 4 4 8 4 80pach5 5 5 5 5 5 5 5 5 5 100Saha6 6 6 8 6 6 6 6 6 6 90Sat 7 7 7 7 7 7 7 5 7 7 90Aath8 8 8 8 8 8 8 8 5 8 90nau9 9 9 9 9 7 9 9 9 9 90shunya0 0 0 7 0 0 0 5 0 0 80Table 6.5 Comparison Digit Recognition Accuracy Test Results.
Numeral DTW Accuracy HMM Accuracyek80 90don 90 90teen 80 90char 80 80pach90 100Saha80 90Sat 90 90Aath80 90nau80 90shunya80 80Average % 83% 89%Experimentally, it is observed that recognition accuracy is better for HMM compared with DTW, but the training procedure in DTW is very simple and fast, as compared with the HMM.Fig. 6.13 Recognition accuracy of the DTW & HMM.The time required for recognition of numerals using HMM is more as compared to DTW, as it has to go through the many states, iteratations& many more mathematical modeling, so DTW is preferred for the real-time applications as compared with the HMM.Conclusions and Future Scopes 7.1 Conclusions Though the advances accomplished throughout the last decades, automatic speech recognition (ASR) is still a challenging and difficult task 1.
The non-parametric method for modeling the human auditory perception system, Mel Frequency Cepstral Coefficients (MFCCs) isused as extraction techniques. The nonlinear sequence alignment known as Dynamic Time Warping (DTW) has been used as features matching techniques. The nonlinear sequence alignment known as Dynamic Time Warping (DTW) has been used as features matching techniques. Since it’s obvious that the voice signal tends to have different temporal rate, the alignment is important to produce the better performance. This paper proposed that higher recognition rates can be achieved using MFCC features with DTW which is useful for different time varying numeral speech utterances. MFCC analysis provides better recognition rate than LPC as it operates on a logarithmic scale which resembles human auditory system whereas LPC has uniform resolution over the frequency plane.
This is followed by pattern recognition. Since the voice signal tends to have different temporal rate, DTW is one of the methods that provide non-linear alignment between two voice signals.Another method called HMM that statistically models the words is also presented. Experimentally it is observed that recognition accuracy is better for HMM compared with DTW, but the training procedure in DTW is very simple and fast, as compared with the HMM.The time required for recognition of numerals using HMM is more as compared to DTW, as it has to go through the many states, iteratations; many more mathematical modeling, so DTW is preferred for the real-time applications as compared with the HMM .DTW is a cost minimization matching technique, in which a test signal is stretched or compressed according to a reference template.The accuracy of the pre-recorded samples is more than that of the real-time testing samples.
We have also observed that the accuracy of the speaker dependent samples is more than that of the speaker independent samples. 7.2 Future ScopesOne of the key areas where future work can be concentrated is the large vocabulary generation ; to improve robustness of speech recognition performance 2.
Another key area of research is focused on an opportunity rather than a problem. This research attempts to take advantage of the fact that in many applications there is a large quantity of speech data available, up to millions of hours. It is too expensive to have humans transcribe such large quantities of speech, so the research focus is on developing new methods ofmachine learning that can effectively utilize large quantities of unlabeled data. The better understanding of human capabilities and to use this understanding to improve machine recognition performance.The future work could be towards Online Speech Summarization. The majority of speech summarization research has focused on extracting the most informative dialogue acts from recorded, archived data.The future work could be towards minimizing the time required for recognition of numerals using HMM.