Classification of Javanese Script Hanacara Voice Using Mel Frequency Cepstral Coefficient MFCC and Selection of Dominant Weight Features

— This study investigates the sound of Hanacaraka in Javanese to select the best frame feature in checking the reading sound. Selection of the right frame feature is needed in speech recognition because certain frames have accuracy at their dominant weight, so it is necessary to match frames with the best accuracy. Common and widely used feature extraction models include the Mel Frequency Cepstral Coefficient (MFCC). The MFCC method has an accuracy of 50% to 60%. This research uses MFCC and the selection of Dominant Weight features for the Javanese language script sound Hanacaraka which produces a frame and cepstral coefficient as feature extraction. The use of the cepstral coefficient ranges from 0 to 23 or as many as 24 cepstral coefficients. In comparison, the captured frame consists of 0 to 10 frames or consists of eleven frames. A sound sampling of 300 recorded voice sampling was tested on 300 voice recordings of both male and female voice recordings. The frequency used is 44,100 kHz 16-bit stereo. The accuracy results show that the MFCC method with the ninth frame selection has a higher accuracy rate of 86% than other frames.


INTRODUCTION
Speech recognition in languages has been widely used. Several models have been developed with a language approach in grammar or other methods with feature extraction and speech recognition system matching. Voice recognition through vocabulary, both linking and stopping the reading, has flexible rules. This rule becomes flexible because it depends on the length of breath and the rhythm of the reader's reading. The rules listed for shaping this language model are becoming more numerous and very complex. Common and widely used feature extraction models include the Mel Frequency Cepstral Coefficient (MFCC). Research on the sound of the Javanese language Hanacaraka features the right features to choose the best frame feature in checking the sound of the Javanese language script reading. Selection of the right frame feature is needed in speech recognition because certain frames have accuracy at their dominant weight, so it is necessary to match frames with the best accuracy. Before the feature selection stage is carried out, the first feature extraction is carried out with MFCC. The general and widely used feature extraction method uses the Mel Frequency Cepstral Coefficient (MFCC).

A. Feature Extraction Method
The voice recognition method that uses feature extraction is significant because the feature extraction results significantly affect the match results and pattern recognition checks. Research that uses feature extraction methods includes Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Code (LPC) [1]. Both methods have weaknesses and advantages in feature extraction that produces features. method has an accuracy of 50% to 60%. In comparison, the Linear Predicted Code (LPC) is only 45% to 50%, so the non-linear MFCC method has higher accuracy than the linear approach method with the LPC method.
MFCC has weaknesses, including low frequency, environmental noise, sensitivity, almost similar sound patterns, and classification [2]. Meanwhile, MFCC has advantages, including capturing voice characteristics that are important in recognition, capturing critical information in voice, producing minimal data without losing information, and replicating the sound of human hearing [3]. In addition, feature extraction using MFCC is widely used for speech recognition because it is more precise in various conditions [4]. The feature extraction method using Linear Predictive Code (LPC) has weaknesses, including noise, changing speech frequency, and classification [5]. This method has the advantage of autocorrelation [1], [6].
The research of sound feature extraction using both MFCC and LPC has the same weaknesses, including noise, almost similar speech frequencies, frequently changing frequencies, and classification. The weakness of these two methods was also revealed by [1] that feature extraction using MFCC and LPC is not suitable for recognizing huge numbers of sounds, so classification is needed.
Based on the weaknesses and strengths of the two methods, both feature extraction using MFCC and or LPC, the researchers prefer feature extraction using MFCC because the level of accuracy is better than LPC [1][7] [8]. MFCC feature extraction was between 58-75% [7]. In addition, the LPC method, research by [9] is more suitable for linear computations, whereas the human voice is essentially non-linear.
Another study related to language regarding hijaiyah letter recognition by Bethaningtyas [13] used MFCC by comparing 3, 6, 9, and 12 channels of the training data model and the deviation values. Another study related to hijaiyah letters by Heriyanto [14] used the method of average energy and waved deviation as a comparison. Meanwhile, other research related to the hijaiyah letter phoneme by Subali et al., [15] using the LPC and DTW methods resulted in the formant frequency. Speakers in pronunciation and DTW have the advantage of autocorrelation.
Another MFCC research by modifying was carried out by [16] in the windowing section. Other research also by modifying MFCC resulted in acoustic signal analysis with the stages of pre-emphasis, frame blocking, hamming windowing, Fast Fourier Transform, Mel Filterbank, Discrete Cosine Transform (DCT), Delta energy, and delta spectrum [17].

B. Matching Speech Recognition
Matching research on speech recognition using different methods produces different outputs, including through neural networks [18], Hidden Markov Model (HMM) [4], speech recognition with Dynamic Time Wrapping (DTW) [ 20].
Speech recognition research using the feature extraction method MFCC has been widely carried out in all fields, including applied language. Research on speech recognition in the field of Arabic by [4] states that the extraction of Mel Frequency Cepstral Coefficients (MFCC) in the form of a feature to get the conformity value of Indonesian speakers to native speakers is classified using matching with the Hidden Markov Model (HMM).
MFCC is applied in other language fields, including Indonesian, by identifying speech signals into vocabulary, resulting in Phoneme and Syllable Models and segmentation [10]. Similar research was conducted by [11] using the Mel Frequency Cepstral Coefficient (MFCC) and Hidden Markov Model (HMM), which can recognize phoneme segmentation in Indonesian. Similar research on phonemes was also carried out by Cahyarini [12], who identified speech pauses between phonemes.
Voice recognition using the DTW method was carried out to calculate the distance between two-time series data [20]. This method has the advantage of calculating the distance between two data vectors of different lengths or knowing the value of the smallest matching distance between the voices of novice speakers and expert speakers [9].
DTW is an algorithm as a non-linear sequence alignment used to measure the similarity of a pattern in a time-variable and more realistic data series area for matching. DTW has a weakness in terms of accuracy, namely with wildly varied results [21], and still equals the accuracy level of HMM [4]. Meanwhile, the use of the HMM method, based on [11], has a weakness in terms of less resistance or robustness.
Another speech recognition method using Neural Network (NN) has advantages in learning systems, knowledge acquisition, classification, and generalization patterns [18]. According to [22], NN has a weakness in the training process, which requires a long time with a large amount of data. The same statement by [7] to identify the number one to nine utterances has a problem when the training process with massive data requires a very long processing time.

A. Mel Frequency Cepstral Coefficients (MFCC)
The research method is divided into two major parts, MFCC feature extraction and Normalization of Dominant Weight. An explanation of the steps in each section can be seen in Fig.1.  The MFCC method was first introduced by Davis and Mermelstein around 1980. MFCC is a method that is quite good in speech recognition in the field of speech recognition [23]. MFCC is the feature extraction that is most widely used in speaker recognition and speech recognition.
MFCC is a feature extraction that produces features or features that differentiate one another in the cepstral coefficient parameter [1]. Feature extraction of the Mel Frequency Cepstral Coefficient (MFCC) converts sound waves into several parameters, such as the cepstral coefficient representing audio files [4]. In addition, MFCC produces vector features that convert voice signals into several vectors for speech feature recognition [20].

B. Pre-emphasis
According to Tokunbo [24], pre-emphasis is an early-stage process and a very simple way to do it. The signal often experiences noise or noise interference to improve the Signal to Noise Ratio (SNR). Preemphasis has the aim that the high-frequency part still has good signal quality and is still in the realm of time [25]. Pre-emphasis according to [26] with α values between 0 to 1 or between 0.9 ≤ ≤ 1.0 using (1) In this case, the ( ) symbol is the signal of the pre-emphasis result. In contrast, ( ) is the signal symbol before pre-emphasis, the n symbol is the serial number of the signal, α is the pre-emphasis filter constant between 0.9-1.0, and s are signals. Taking the nth signal on pre-emphasis is carried out along the reading of one syllable with a time of one to two seconds.

C. Frame Blocking
The frame blocking process is blocked in a frame with N samples and shifted by M samples so that = 2 with < . Figure 1 shows an illustration of frame blocking [1]. The width of the frames is denoted by N, while the width of the shift for each frame is as M. The overlap width is calculated by the − difference.
The average shooting time is between 20-40 milliseconds [4]. Frames are taken as long as possible to get a good frequency resolution, while the shortest possible time is intended to get the best time domain. Calculation of the number of frame blocking using equation (2) In this case, symbol ( ) is the result of frame blocking, symbol n is 0,1, . . . − 1. The symbol N represents the number of samples, M is the frame length, is 0,1, … − 1. The symbol represents all signals, and is the result of pre-emphasis.  Figure 2 shows is the first frame of the sound signal in the formula symbolized by then + = .

D. Windowing
Windowing aims to reduce discontinuation effects at the edges of the frame generated by the frame blocking process. Windowing used is the Rectangular Window, Hamming Window, and Hanning Window [4]. The researcher uses Hanning windowing of the three windowing functions because it is smoother than the others [27]. Representation of the windowing function using (3) In this case, ( ) is the window function using hanning, where n is 0,1, . . . , − 1, is the length of the frame.

E. Fast Fourier Transform (FFT)
Fast Fourier Transform is developing a fast algorithm to implement Discrete Fourier Transform (DFT), which converts digital signals in the time domain to the frequency domain [1] algorithm developed by Cooley and Turkey.
DFT computation time is too long and inefficient then FFT can perform efficiency calculations. Proakis and Manolakis [26] stated that the FFT method is efficient in calculating DFT. Discrete Fourier Transform (DFT) using (5). In this case, symbol d [k] results from DFT calculation, symbol ( ) resulting from windowing. The N symbol is a natural number, N is the number of samples to be processed ( ). The k symbol is a discrete frequency variable with a value ( = / 2, ). Fast Fourier Transform aims to decompose the signal into a sinusoid signal in actual units and imaginary units. Fast Fourier Transform using equation (6)   In this case, the function T (m) is the result of the math Fast Fourier Transform calculation, the symbol ( ) is the result of the nth windowing calculation. The n symbol is the signal serial number. The m symbol is the index of the frequency (1,2, … ).  Figure 3 shows the results of the FFT processing. FFT has a frequency domain and generates a spectrum.

F. Mel Frequency Wrapping (MFW)
Mel Frequency Wrapping (MFW) is a filter in the form of a filter bank to determine the energy size of a particular frequency band in sound signals [19] [20] MFW, according to Laha [28], converts the frequency into Mel Frequency Wrapping (MFW).
Filterbank has a frequency response via a triangular-shaped path whose distance and constant frequency intervals determine size. The output process obtained from the filter is known as the mel spectrum using (7).
In this case, the symbol Y [i] results from the calculation of MFW it-i, where G is the total magnitude spectrum (GN). Then symbol T [j] is the result of FFT, Hi [j] is the filterbank coefficient at frequency j (1 ≤ i ≤ E), and E is the number of channels in filterbank. The approach used is in the form of MFW using (8).
( ) = 2595 10 (1 + 700 ). In this case, MFW uses a frequency with the MFW scale, f as the frequency. MFW produces a mel spectrum. MFW frequency scale is a linear frequency scale at frequencies below 1,000 Hz and is a logarithmic scale at frequencies above 1,000 Hz [20].

G. Discrete Cosine Transform (DCT)
DCT, according to Smith [29], is a relative of the Fourier transform that decomposes the signal to the cosine wave. DCT has been widely used in sound and image processing, for example, JPEG or BMP files. The concept of DCT is similar to the inverse Fourier transform, and DCT is close to the Principal Component Analysis (PCA) method, which is a classical static that is widely used in data analysis and compression.
DCT can be assumed to replace the inverse Fourier transform in the MFCC feature extraction process [20]. Discrete Cosine Transforms (DCT) is a member of the sinusoidal unit transformation class [30]. DCT aims to produce spectrum mel to improve the quality of recognition. DCT uses (9) In this case, is the coefficient, where [ ] is the output of the filter bank process on the index, is the number of coefficients, and is the expected number of coefficients. The DCT process produces spectrum mel.
In this case, the symbol ( ) is the window function of the cepstral features, is the cepstral coefficients, the symbol is the index of the cepstral coefficients. The cepstral liftering processing results in the form of frames and cepstral coefficients then processed to feature selection. The feature selection is described in Chapter V of the feature selection model.

I. Selection of Dominant Weight Normalization Feature
The MFCC feature extraction result is the frame and cepstral coefficient, which significantly influences speech matching and recognition. The selection of features is carried out using the dominant weight normalization model. The model has six stages, namely determining the threshold, making the range, filtering, eliminating the duplication of weights, normalizing the weight and dominant weight resulting in a table of features [31]. Normalization of dominant weights and conformity testing. Figure 4 shows the feature selection carried out starting with MFCC feature extraction. The results of the feature extraction are then matched by testing the dominant weight normalization algorithm.

III. RESULTS
The test results on selecting the right features were carried out on the number of cepstral coefficients and number of frames. In contrast, the Characteristic Extraction Results using MFCC produced frame and cepstral coefficients with eleven framed and 24 cepstral coefficients. It can be seen in Fig.5 and Fig.6.    Figure 6 shows the results of the MFCC greeting "hanacaraka" consisting of eleven frames and twentyfour cepstral coefficients. The proposed feature selection has six stages, namely determining the same threshold, creating the same range, filtering, eliminating duplication of weights, normalizing the weight and dominant weight [31] [32]. All these steps are applied in the dominant of weight normalization feature selection algorithm to produce a feature table.  Table 1 shows the results of feature extraction using the MFCC saying "ha, na, ca, ra, ka. The result of feature extraction is a frame consisting of eleven frames and a twenty-four cepstral coefficient. The next process of the frame and cepstral coefficient results is carried out by selecting features by selecting frame features with feature selection algorithms using Dominant Weight Normalization. Figure 6 shows the dominant weight normalization algorithm, which starts from the first and second steps of taking speech sounds; the third step is carrying out the threshold process. Then, in the fourth step and so on, coverage, filtering, eliminating duplication of weights, normalization of dominant weights and weights are carried out.

B.
Data Collection Sound sampling was carried out as many as 300 recorded voices and tested on 200 recorded voices. The cepstral coefficient, ranging from 0 to 23, is 24 cepstral coefficient and 0 to 10 frames. The frequency used is 44,100 kHz 16bit stereo by recording male voices and female voices.  Table 2 and Table 3 sampling of votes for each voice of speech reading 37 samplings, both male and female. The total number of votes was 296 votes. The existing feature table threshold is checked with similarity range and filtering, sequential calculations, and Uniformity of Pattern Conformity calculations using equations as in the algorithm. Speech checking is performed to select the correct reference and select the right features [31] [32]. Figure 7 shows the algorithm for checking the sound suitability of Javanese script speech by taking the feature results from MFCC feature extraction in the form of frame parameters and cepstral coefficient. Algorithm for conformity checking through range checking, filtering each frame, then sequential multiplication, and the final calculation of Pattern Uniformity Conformity. The results of the frame and cepstral coefficient can be seen in Table 4.
Based on Table 4, selecting the best frame features on the 9th frame with an average value of 85% better than other frames.