Accuracy Analysis of K-Nearest Neighbor and Naïve Bayes Algorithm in the Diagnosis of Breast Cancer

— In the medical field, there are many disease sufferers' records, including data on breast cancer. An extraction process to fine information in previously unknown data is known as data mining. Data mining uses pattern recognition techniques such as statistics and mathematics to find patterns from old data or cases. One of the prominent roles of data mining is classification. In the classification dataset, there is one objective attribute, which is also called the label attribute. This attribute will be searched from new data based on other attributes in the past. The number of attributes can affect the performance of an algorithm. If the classification process is inaccurate, the researcher needs to double-check each previous stage to look for errors. The best algorithm for one data type is not necessarily suitable for another data type. For this reason, the K-Nearest Neighbor and Naïve Bayes algorithms will be used as a solution to this problem. The research method used was to prepare data from the breast cancer dataset, conduct training and testing upon the data, then perform a comparative analysis. The research target is to produce the best algorithm in classifying breast cancer so that patients with existing parameters can be predicted which ones are malignant and benign breast cancer. This pattern can be used as a diagnostic measure to be detected earlier and is expected to reduce the mortality rate from breast cancer. By making comparisons, this method produces 95.79% for K-Nearest Neighbor and 93.39% for Naïve Bayes.


INTRODUCTION
Classification is widely used to determine decisions according to new knowledge gained from processing past data using algorithms. There is one objective attribute in the classification dataset, or it can be called the label attribute. This attribute will be searched from new data based on other attributes in the past. The number of attributes can affect the performance of an algorithm. If the classification process is inaccurate, the researcher needs to doublecheck each previous stage to look for errors. Data types significantly affect the performance and accuracy of an algorithm. The best algorithm for one data type is not necessarily suitable for another data type. In general, detection of the level of malignancy of breast cancer is, by the way, called prognosis. The prognosis is the medical team's "best guess" in determining whether a patient is cured of breast cancer or not. Apart from prognosis, another way is bioinformatics using data mining techniques because it has been shown to detect breast cancer's malignancy level [1]. As information technology advances, especially in artificial intelligence, machine learning techniques are being introduced to improve automatic detection capabilities. With this system's help, the possibility of misdiagnosis made by medical professionals can be avoided, and medical data can be checked in a short time and more detail [2].
Several data mining methods that are widely used for classification include the K-Nearest Neighbor and Naïve Bayes algorithms. K-Nearest Neighbor method classifies objects based on learning data close to the object according to the number of closest neighbors or the value of 'k.' Meanwhile, the Naïve Bayes method performs a classification based on probability and the Bayesian theorem with the assumption that each variable X is independent. At present, K-NN and Naïve Bayes have been widely used in problems faced Copyright [3]. In modeling areas such as detection of tumor types using Naïve Bayes [4], classification of kidney stones using K-NN [5], prediction of heart disease using Naïve Bayes [6], classification of Naïve Bayes for predicting colon cancer [7], sentiment analysis in Twitter uses Naïve Bayes [8], detection of abnormal behavior using Naïve Bayes [9] and An Improved KNN Text Classification Algorithm Based on K-Medoids and Rough Set [10]. This study aims to classify breast cancer so that patients with existing parameters can be predicted which ones are malignant and benign breast cancer. This pattern can be used as a diagnostic measure to be detected earlier and is expected to reduce the mortality rate from breast cancer.

A. System Description
The system is designed to be able to classify cancer data using the K-NN and Naïve Bayes algorithm. The disease is then divided into two classes, namely benign and malignant. The process applied to the system is divided into three stages, including pre-processing, design classifier, and postprocessing. For more details, the system flow can be seen in Fig.1. The pre-processing stage is a stage that starts with the data collection process. The data collected were then grouped based on the influence on each class. After that, the data is normalized, where the data is entered into the appropriate class. Afterward, the data was distributed to two groups which are 80% for training and 20% for testing.
After the pre-processing stage is complete, the data is then entered into each classifier as knowledge. The classifier then learns from the data that has been entered and evaluated. If any of the specified attributes have not been trained, the system training process will be repeated with a different structure and function.
The third stage is the post-processing stage, where the classification results are displayed in a form that is easier to understand. The system will display whether the cancer is benign or malignant [11].

B. K-Nearest Neighbor Classifier
Classification using the K-NN algorithm is a classification method that uses learning data closest to the object.

a) Data Normalization
Data normalization is carried out so that there is a balance of data on each attribute used. Z-score normalization is a normalization method based on the mean (average value) and standard deviation. This method is fruitful if the actual minimum and maximum values of the data are unknown [12]. Z-Score is a measure of the deviation of data from its average value measured in standard deviation units. If the value is above the average, then the Zscore will be positive. While if the value is below the average, the Z-score will be negative. This Z-Score is also called the Standard Value. The benefit of standardizing the raw score values or observed values from the normal distribution into this Z Score is to allow us to calculate the probability of the score occurring in the normal distribution and also to allow us to compare two scores coming from different populations.
It should be noted that this Z score will only be helpful or meaningful if it is calculated for observations in the form of a normal distribution. Standard Normal Distribution is a normal distribution in the form of an average value of zero (0), and the standard deviation is one (1 find the Z Score or Standard Value, we need to know the mean and standard deviation of a population. The formula for calculating the Z Score is to subtract the observed value (raw score) from the population mean and then divide it by the standard deviation. The following is the formula for calculating the Z Score: Where: : z-score (standard value) : observed value (raw score) : mean : standard deviation b) Calculation of Proximity Test Data and Training Data Calculation of the distance between the new data and the training data 1 using the Euclidean distance. Where: (x, y): the distance between x and data y : the value of the attribute from the test data (x), where k= 1, 2, …, n : the value of the attribute from the training data (y), where k = 1, 2, …, n After the distance or dissimilarity (d) is calculated, then it is converted into similarity (s) with an interval between 0 to 1 (s [0,1]).
Cross-validation is a simple form of statistical technique. Fold amount the standard for predicting error rates from data is to use 10-fold crossvalidation [13]. Cross-validation is used in order to find the best parameters of one model [14]. This is done by testing the amount of error in the testing data. In cross-validation, data is divided into k samples of the same size. The k subset of data used will be used k-1 sample training data and one remaining sample for testing data. In crossvalidation, data is divided into k samples of the same size. The k subset of data used will be used k-1 sample as training data and one remaining sample for testing data. This is often called k-fold validation.

d) Confusion Matrix
Confusion Matrix is a table to evaluate the performance of the identification model. Confusion Matrix shows the result of identifying the amount of correct prediction data and incorrect predictive data compared to the facts produced. Table 1 shows the Confusion Matrix [15]. a: many data predicted by the system with the correct results is indicated healthy, the doctor states indicated healthily.
b: many data predicted by the system with wrong results is indicated by malaria, the doctor state indicated healthily.
c: many data predicted by the system with true results is indicated wrong; the doctor stated malaria.
d: many data predicted by the system with the correct results is indicated malaria; the doctor stated indicated malaria.
There are several terms based on Table 1.
-True Positive (TP) is positive data correctly indicated on the model. Calculation TP values can be calculated using (4).
-False Positive (FP) is positive data incorrectly indicated on the model. Calculation FP values can be calculated using (5).
-True Negative (TN) is negative data that is correctly indicated in the model. Calculation TN value can be calculated using (6).
Measurement of accuracy is a step to prove the level of performance of an algorithm dataset used. In this research, a confusion matrix is used as a performance measurement tool classification algorithm. A confusion matrix is a calculation that compares datasets with the results of the classification, following the actual data with the total amount of data. This matrix's final result is the level of accuracy with units of a percent (%). This level of accuracy will be used later the researchers' reference to the classification algorithm's performance. Confusion matrix contains information comparison of classification labels with actual labels. From Table 1, the level of accuracy can be calculated from an algorithm model using (7)  C. Naïve Bayes Classifier Classification using the Naïve Bayes Algorithm is a classification method based on the Bayes Theorem assuming each other's parameter independence. Bayes' theorem provides a way to calculate the probability of a parameter's value using the value of another parameter.

Calculation of Probability and Classifier of Test Data,
-After the data is divided into training data and test data, the standard deviation and mean will be calculated for each target parameter class (Diagnosis) for each attribute.
-After the standard deviation and mean per each target parameter class (Diagnosis) per each attribute, will be used for the classification for test data 1.
-Naïve Bayes classification calculates the probability of the diagnosis parameter value based on other parameters' value. Calculation of probability using the Gaussian Naïve Bayes formula: -After each attribute's probability is calculated, it will be multiplied into the diagnosis value's probability.

D. Testing Design
In this study, the test carried out tests the system's classification accuracy using the K-NN algorithm and the Naïve Bayes algorithm. Accuracy is measured using the k-fold cross-validation method. The results of measuring the accuracy of the two algorithms are then analyzed for comparison. In addition to the accuracy results, comparisons are also made by comparing the length of time it takes for each algorithm to perform the classification process against the test data that has been prepared. The test scheme is shown in Fig.2.

A. Cancer Data Compilation Process
The data used as training data and test data are data about breast cancer. There are 455 training data consisting of 284 data on benign cancer cases, 171 data on cases of malignant cancer.

B. Process of Arranging Attributes
The attributes used to classify are radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimensions based on the data obtained. The data stored for each attribute is the measurement average (mean), standard error of measurement (se), and the minimum value (worst).

C. K-NN Classifier Process
The K-NN classification process is done by comparing the similarities between test data with training data owned by the system. If the similarity of the case value in the training data with the test data is greater, then it will be collected as a solution. Data collected as a set of solutions is as much as the value of k, so the case with k similarity value as much as k will be used as the solution set. The class diagnosis with the most frequency will be taken and displayed as a system solution [11]. Examples of cases in training data are shown in Table 2. Users classify breast cancer data with data entered into the system in test data shown in Table 3. The process of classifying test data in Table 3 using the training data in Table 2 is divided into several steps, namely calculating proximity, sorting the highest proximity, and determining the solution as a result of classification.

a) Proximity Calculation Process
Calculation of the distance between the new data and the training data 1 using the Euclidean distance. After calculating the closeness between the new data and the training data 1 is done, the similarity or closeness results are saved for later comparison with the results of the closeness between the new data and other training data. Calculation of the closeness between new data and other training data is done in the same way as calculating the closeness between new data and training data 1.

b) The Highest Similarity Sorting
After the closeness between the new data and all the training data has been carried out, the next step is to sort the training data based on the new data's closest proximity. The highest proximity value sorts the calculation of the new data's closeness with the training data shown in Table 4. Based on the data from the calculation of the proximity in Table 4, the closest neighbor is taken as much as k, namely k = 4. So that the closest neighbors to be used for the next stage are Training Data  Training Data 5. K of the closest neighbor data is shown in Table 5.

D. Naïve Bayes Classifier Process
The Naïve Bayes classification process is carried out by calculating the highest probability using a formula based on the Bayes Theorem. Because the available cancer data is continuous, the formula used to calculate the probability is the Gaussian Naïve Bayes Formula.

a) Data Processing
The sample data used as training data and Naïve Bayes classification test data are shown in Tables 6 and 7. The data will be randomly divided into 80% training data and 20% test data.

b) Probability Calculation and Test Data Classification
After the data is divided into training data and test data, the standard deviation and mean will be calculated for each attribute's target parameter class (diagnosis). The standard deviation and mean values are shown in Table 8 and Table 9.

c) Classification Process
After the standard deviation and mean per each target parameter class (Diagnosis) per each attribute, it will be used to classify the test data 1. Naïve Bayes classification calculates the probability of the Diagnosis parameter value based on the value of other parameters. The calculation of probability uses the Gaussian Naïve Bayes Formula. The test carried out on the system is the k-fold cross-validation test with k = 10, with data that has been previously randomized with details of 280 benign cancer cases and 170 malignant cancer cases. Then the randomized data is divided into 10 folds, with each fold containing 45 data. The division of data into folds is shown in Table 11. The k-fold cross-validation test was carried out on the K-NN classifier by dividing the data in table 10. The test results using k = 10 for k-fold crossvalidation are shown in Table 12.

f) Testing Naïve Bayes Classifier
The data used for testing the Naïve Bayes classifier uses k-fold cross-validation with data shown in table 11. The test is carried out by entering the test data one by one into the system and then recording the classification results and the system's running time to perform the classification. The test results are shown in Table 13.

IV. DISCUSSION
Based on the calculation of accuracy using k-fold cross-validation for k = 10, the K-Nearest Neighbor algorithm's average accuracy is 95.79%. The average accuracy for k-10 is better than k = 7 with an accuracy of 95.64% and k = 5 with an accuracy of 95.38% with a confusion matrix as in Table 14 below. Based on the confusion matrix in table 14, it can be seen that the system can correctly classify 279 types of benign cancer and 159 malignant cancers. In addition, the true condition value can be seen in Table 15, where the True Positive value (TP) is the correct classification value for each class, False Positive (FP) is the classification value where the actual data is from another class but is classified into class A, for example, class B data on the original data but the classification results give M. True Negative (TN) is the classification result value of another class from the original data of another class. False Negative (FN) is the value of the classification results from the original data for class A. However, the classification results give results that are not class A. For example, the original data for class M but the classification results state class B, of each class, can be seen in Table 15. Based on the confusion matrix in Table 16, it can be seen that the system can correctly classify 276 for each type of cancer for benign, and for malignant cancer as many as 152. In addition, the value of the true condition shows the TP, TN, FP values, and FN of each class can be seen in Table 17. The following is a comparison of the K-Nearest Neighbor classifier results with Naïve Bayes, shown in Table 18.

V. CONCLUSION
The test results of the system using the K-NN classification method were able to classify 438 data correctly. In comparison, Naïve Bayes correctly classified 428 data using the k-fold cross-validation test with k = 10. This shows that the K-NN classification method has better accuracy than the Naïve Bayes classification method on the data used. The K-NN method gets higher accuracy because the Naïve Bayes algorithm is a parametric algorithm that assumes that each attribute of the data is independent, which is a scarce property in the real world. The average length of time required for the K-NN method to perform classification is much slower because the K-NN algorithm will calculate each training data's distance with test data. In contrast, the Naïve Bayes algorithm only needs to calculate the standard deviation and mean once for all test data.