Convolutional Neural Networks Based on Raspberry Pi for a Prototype of Vocal Cord Abnormalities Identification

This study aims to make a device prototype for identifying vocal cord abnormalities based on Raspberry Pi. This prototype could classify the abnormalities into seven classes, i.e ., cysts, granulomas, nodules, normal, papilloma, paralysis, and no vocal cords. The applied method to classify is a deep learning algorithm, mainly using Convolutional Neural Network (CNN). In building the CNN model, we used a statistical method to form a model training scenario, also modified the AlexNet architectu re model by optimizing the parameters. The optimized parameters in the test scenario obtained 95.35% accuracy. The CNN model implemented on the Raspberry Pi, and the test results obtained 79.75% accuracy.


INTRODUCTION
High and low tones are needed in communication and singing. These tones depend on the tension of the vocal cord [1]. The vocal cord is the narrowest breathing apparatus [2]. When the area is abnormal, it will show some symptoms also some gripes. Thus, health workers will diagnose these symptoms and gripes. Diagnosis is one of the mattering steps during the examination of the disease. Diagnosis of errors can cause mishandling that increases the chance of death [3]. In the vocal cord cases, the examination process is carried out using a laryngoscopy or a stroboscopy device or both. Stroboscopy is an examination of the condition of vocal cords, such as their anatomy, function, and biomechanism [4]. The observer will check and view the vocal cords using a camera which passed through the nose into the throat and diagnosed the result. However, each observer may give different opinions during the examination process based on their capacities and experiences. Through digital image processing, it can help the observer in determining the conditions as well as abnormalities in the examined vocal cord. Digital images are formed by a collection of dots called pixels (pixel or p icture element) [5]. A digital image is described by a [m, n] matrix in 2D [6]. In a previous study, Bima created a system to help observers detect vocal cord abnormalities using the Moore Neighbor Tracing image processing method [7]. The results obtained an accuracy rate of 85.83% from the 120 tested data. There are only four classes on the tested vocal cord abnormalities, i.e., paralysis, papilloma, granuloma, and nodules/cyst. However, there are still some deficiencies in the system, such as requires a lot of user assistance to obtain the diagnostic results, e.g., rotation, fitting, and multiple grayscale settings. Another study created a similar system as well as using multiple image processing methods [8]- [10]. Both studies could classify the vocal cord conditions and the abnormalities into sixth classes, i.e., normal, nodules, cyst, granulomas, papilloma, and paralysis. They applied the Chan-Vese algorithm to automatically obtain the glottis segmentation area so that the system could run without further init ialization. However, all three studies could be performed only if the Personal Computer (PC) has installed MATLAB. Also, it could not perform in real time.  [12]. The best C NNs model and architecture that they had performed are ResNet and Inception. However, it could only detect the location of the vocal cord and trachea ring but not to classifying abnormalities from the voca l cord.
Based on the background, this study aims to create a device prototype for identifying vocal cord abnormalities based on Raspberry Pi. Raspberry Pi is chosen as its portability that can easily be moved. The devised system using deep learning with the convolutional neural network (CNN) method and could identify in real-time. CNN is usually used in image recognition and pattern detection [13]. CNN can learn how to extract image features once and how to classify them [14]. With this method, parameter settings are no longer performed to obtain predictive results so that it will be more comfortable in terms of software use. This prototype is expected to help doctors diagnose vocal cord conditions, especially vocal cords abnormalities and medical technology development in the future.

A. Vocal Cord Abnormalities
Vocal cord abnormalities such as cysts, granulomas, nodules, papilloma, and paralysis are structural disorders by some larynx lesions. The observed characteristic while examining the patient is the physical form of the vocal cords in the image. Fig.1 shows the image of vocal cord abnormalities as well as the normal one [7].

B. Prototype Design
There are three main components in the devised prototype system, i.e., camera, Raspberry Pi, and display monitor. Camera's primary function as an image acquisition device. Raspberry Pi then processes the image from the camera and performs image processing. The c lassification result is then displayed through the display monitor. The prototype design is shown in Fig.2.

C. Data Preparation and Pre-processing
We obtained image data from previous work [7] with 120 images in total. Firstly, the image is preprocessed to get an equal data standard. The preprocessing step is cropping, gray scaling, and resizin g to 64 64 pixels. For training a CNN model, it requires loads of data. Data augmentation was then performed to increase the owned data [15]. Some of the techniques chosen to perform are width shift, height shift, zoom range, horizontal flip, and rotation range. These techniques can manipulate data without missin g any critical information. Total 7100 data that we obtained from the augmentation data process. Validation is a crucial step when forming a model so that the model can generalize new data. One of the validation techniques is ho ld out. Hold out d istribute the data into three parts, i.e., trainin g dataset, validation dataset, and test dataset. Training dataset and validation dataset used during the training process while test dataset used to test the obtained model from the training process. Training dataset, validation dataset, and test dataset are distributed to 80:10:10 data allocation.

D. CNN Architecture
The a rchitecture design model in CNN affects the model performances. Two models are chosen to be trained with the dataset as its history on ImageNet Large Scale Visual Recognit ion Challenge (ILSVRC). The First model is LeNet-5. LeNet-5 was found by Yann Lecun. This architecture was chosen to be an option in the network architecture. LeNet5 is shown in Fig.3(a). The second model is AlexNet as the winner of ILSVRC 2012, with an acquired error of 16.4% from 1000 images classification [16]. It has eight layers. Because the depth of AlexNet is too severe for running in real-time videos in Raspberry Pi, we modified the architecture with 1/16 reduction for each depth convolution. We changed the input size in the grayscale dimension (shown in Fig.3(b)). This modified AlexNet model later called our proposed model or modified AlexNet.

E. Training Model
Training parameters also affect the model performances. Several parameters have a significant effect on the model, i.e., input dimensions, epoch, batch size, iterations, dropout, and learning rate. Some parameter observation is needed to get the optimum parameters in the trainin g process and the CNN architecture. Input dimensions, pooling layers type, convolution kernel size, learning rate, dropout, and epoch are the parameters that variated due to the observation in this study. The training model is performed by initiating the value of each parameter, then variate it gradually, and the best value of each parameter will be chosen.

F. Test Scenarios
After gettin g the optimum model from the training model, the model is tested using the test da taset. In the test process, the prototype had already integrated. A tablet screen is prepared in addition to showing the test dataset. The test process starts by showing the test dataset on the tablet screen. The camera then took a screenshot of the vocal cord images on the tablet screen. Raspberry Pi received the input images then processed it usin g the proposed CNN model, also give a prediction or classification result and shown by the display monitor. Fig.4 shows the flowchart of the CNN model test scena rio.

A. CNN Architecture Evaluation
Our proposed model/modified AlexNet and LeNet-5 are trained with initiation parameters to get the best architecture model. The in itiating parameters are epoch, dropout, learning rate, convolution kernel size, and max pooling type. Both performances are evaluated with their accuracy over epoch value and loss over epoch value. Fig.5 shows the result of the training architecture model, the modified Alex Net, and LeNet-5. Fig.5(a) shows that the loss/error value and validation loss value of both modified AlexNet and LeNet-5 are not significant. It means that both models do not overfit. LeNet-5 has a shorter training time. However, modified AlexNet has a lower loss value than LeNet-5. Also, Fig.5(b) shows that the accuracy of modified AlexNet is higher than LeNet-5.  Table 1. Thus, it can be said that the modified AlexNet has better performances than LeNet-5. The modified Alex Net could explore more complex/detail features (low-level features) as it has more in-depth architecture than LeNet-5. So that modified AlexNet is chosen as the applied CNN architecture in this study.

B. Results of Training Model
Training is carried out using a laptop with data acquisition derived from previous work [7], and its distribution has been explained in Subsection II.C. Statistical methods were applied to do tuning parameters for getting the optimum parameters. This section provides the results of observation training model.

a) Input Dimensions
Input dimensions could a ffect model accuracy because it determines the amount input of information. If there is only a small amount of information, it may lose some vital information. However, too much information could make the computation higher and severe to run in Raspberry Pi.
We variated the input dimensions to 28 28, 32 32, and 64 64 pixels. The result shows in Table 2 and shown that model performances using 64 64 pixels input dimensions are the best cause it has the highest validation accuracy value and the lowest loss a nd validation loss value. The larger size would take more time to compute because it needs more layers to explore the image. However, the smaller size input image may lose some information and get a lower accuracy. b) Pooling Layer Poolin g layer is a reduction data dimension process. It would reduce the sensitivity model in noise and variations. There are two methods to do this process, usin g an average or maximum value from kernel window. We found that in this case study average method provides the best result. However, the average method provides the same result in other study cases. Furthermore, the details shown in Table 3. c) Convolution Kernel Size Convolution kernel size affects the number of learning parameters. Commonly, the applied kernel size is 3 3, 5 5, and 7 7. The performance result shown in Table 4, we found that 5 5 and 7 7 kernel size has a similar validation accuracy, which higher than 3 3. Also, the smallest kernel size has the highest loss. It because smaller filters may collect more information and able to dist inguish features at low-level yet require more in-depth architecture. While larger kernel size has a spacious area of observation and hard to differentiate detailed characteristics, thus we choose 5 5 as the convolution kernel size. Learning rate adjusts how responsive the updated weight. A higher learning rate may reach the convergence point in a shorter time. Nevertheless, if the value is too big, then the weight alteration over error value will become too responsive and do not reach the convergence point. Otherwise, a lower learning rate may longer reach the convergence point and has a higher probability of reaching the convergence point.
We variated the learning rate to 0.01, 0.001, and 0.0001 and found that the stable learning rate value was 0.001 (shown in Table 5). Learning rate value at 0.01 failed to reach the convergence point as it has too high value, wh ile learning rate 0.0001 needs more epoch to reach the convergence point. So that we choose 0.001 as the learning rate value. e) Dropout As regulate technique, dropout will deactivate some neurons, thus decreasing the overfitting trained model. In the init iation parameters, we used 0.03 as a dropout value. Based on the varying value, we found tha t the highest accuracy performed by the training model without dropout (shown in Ta ble 6). However, we chose 0.03 as dropout value cause its differences between loss and validation loss is smaller than without dropout.  Table  7, the highest valida tion accuracy is reached when the epoch value is 50. The lower value of epoch will be causing the weight not optimal, yet the model may not classify correctly, wh ile an excessive epoch will merely be causing weights to memorize the training data. Thus, they may not recognize the characteristic of test dataset correctly. After tuning the parameters, we got the optimum parameters: input dimensions 64 64 pixels, usin g average pooling la yer, convolution kernel size 5 5, learning rate 0.001, not using any dropout value, and epoch 50. The result of training model performance before and after optimizing the parameters shown in Table 8. The optimized parameters model obtained 95.35% and have better performances than the model before optimizing the parameters.

C. Results of CNN Model Test Scenario in Raspberry Pi
After getting the training model and parameters optimized, the test is performed using a prototype and deployed the training model to Raspberry Pi. The confusion matrix of the tested model in Raspberry Pi shown in Ta ble 9. It shows that nodules are hard to predict, yet granuloma is easier to predict correctly using the CNN model. Based on the table, from the 711 data, the CNN model could predict 567 data correctly. The accuracy count as follows, Thus, the accuracy is 79.75%. The test scenario was performed by the built prototype as shown in Fig.6. Also, the examples of classify results shown in Fig.7

IV. DISCUSSION
The causes of the test accuracy only get 79.75% is the built prototype acquire data from the tablet screen. In that process, much noise arises from the environment, such as the brightness of the light.
Specifically, the CNN model performance for each class is shown in Table 3. A 79.75% accuracy value means that almost all data is predicted correctly. Furthermore, in predicting each class, CNN model performance can be reviewed from recall and precision value.
In the table, recall value is h igher than precision at paralysis, cyst, and none classes. For example, in none class, the model could classify all none datasets correctly, even though there are other datasets classified as none. The illustration can be seen in Fig.8(a). Otherwise, precision value is higher from recall at normal and papilloma classes. For example, in normal classes, the model could cla ssify the normal dataset only some parts of the dataset. The illustration can be seen in Fig.8(b).
When the recall and precision values are equally high, it shows that the classification result is great as in granuloma classes, whereas nodules' recall a nd precision values are low, which shows that the modified and studied CNN model not su itable to classify nodules.

V. CONCLUSION
This study has developed a CNN model that is applied to the prototype using Raspberry Pi. Of the two training models considered, LeNet -5 and modified AlexNet, modified AlexNet was chosen as the training model because of its value of loss and validation loss is sma ll without showing overfitting. We also optimized the trainin g parameters of the modified AlexNet model. The performance of the training model with optimized parameters got 95.35% accuracy while the C NN model performance in Raspberry Pi got 79.75%. The accuracy value is smaller than the performance of training because of environmental noises when acquiring images on the prototype.

ACKNOWLEDGMENT
Thank to Dustira Hospital Cimahi which kindly lent the data for this study.