Classification Based on Configuration Objects by Using Procrustes Analysis

— Classification is one of the data mining topics that will predict an object to go into a certain group. The prediction process can be performed by using similarity measures, classification trees, or regression. On the other hand, Procrustes refers to a technique of matching two configurations that have been implemented for outlier detection. Based on the result, Procrustes has a potential to tackle the misclassification problem when the outliers are assumed as the misclassified object. Therefore, the Procrustes classification algorithm (PrCA) and Procrustes nearest neighbor classification algorithm (PNNCA) were proposed in this paper. The results of those algorithms had been compared to the classical classification algorithms, namely k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), AdaBoost (AB), Random Forest (RF), Logistic Regression (LR), and Ridge Regression (RR). The data used were iris, cancer, liver, seeds, and wine dataset. The minimum and maximum accuracy values obtained by the PrCA algorithm were 0.610 and 0.925, while the PNNCA were 0.610 and 0.963. PrCA was generally better than k-NN, SVM, and AB. Meanwhile, PNNCA was generally better than k-NN, SVM, AB, and RF. Based on the results, PrCA and PNNCA certainly deserve to be proposed as a new approach in the classification process.


INTRODUCTION
Classification is one of the data mining topics quite popular [1]. Classification methods had been implemented in several fields. For example, in the food and agriculture fields, classification was implemented to evaluate food quality [2] and predict soil fertility [3] [4]. The method also could be implemented for diagnosing disease [5] [6]. for forecasting, it is able to predict weather [7] and the failure of electrical devices [8]. And also, classification had been implemented to assess the performance of employees [9] and students [10].
The principal task of the classification method is the prediction of an object into a certain available group. The prediction process can be performed by using similarity measures, classification trees, or regression. The similarity measures are carried out by k-Nearest Neighbor (k-NN) [11] and Support Vector Machine (SVM) [12]. The classification trees are carried out by AdaBoost (AB) [13] and Random Forest (RF) [14]. Meanwhile, regression approaches are carried out by Logistic Regression (LR) [15] and Ridge Regression (RR) [16].
On the other hand, the Procrustes analysis refers to a technique of matching two configurations. Procrustes analysis formulated computation of the least-squares problem of Y configuration into X configuration by using an orthogonal matrix Q [17]. The first formula of Procrustes was the ordinary Procrustes analysis (OPA). Then, Procrustes had been developed into the Full Procrustes Mean (FPM) [18] and the Goodness-of-fit of Procrustes (GoFP) [19]. Procrustes has recently been implemented in several researchers, namely to determine variables selection [20], measure the quality of biplot analysis [21] [22], measure the quality of imputation data [23] [24], detect outliers [25], and solve shape clustering problem [26]. Based on the result, Procrustes has a potential to tackle the misclassification problems when the outliers are assumed as the misclassified objects. Therefore, this paper intends to carry out the classification process by using Procrustes. It will become a new strategy where configurations can be utilized as the basis for the classification process. In this paper, there are two Procrustes algorithms proposed in this paper, Procrustes classification (PrCA) and Procrustes nearest neighbor classification algorithms (PNNCA). The concept of k-NN classification is involved in the PNNC. The classification results from the algorithms proposed are compared with the classical classification methods, namely k-NN, SVM, AB, RF, and RR. The data used in this paper are iris, cancer, liver, seeds, and wine dataset.
The difference between this paper and the others is Procrustes' involvement in the classification process at the dataset. It is also hoped that the involvement can contribute to meaningful knowledge, especially in the classification process. This paper is arranged as follows. Section 2 describes a brief history of Procrustes analysis. Furthermore, Section 3 describes the research method used in this paper. Section 4 describes results and discussion. The conclusion is in the last section.

II. A BRIEF HISTORY OF PROCRUSTES ANALYSIS
In ancient Greek, Procrustes' name referred to a bandit who tortured his guests to make a perfect fit with his bed by stretching their limbs or cutting them off. In mathematics, Procrustes referred to a technique of matching two configurations and producing a match measure. Those configurations are matrices of the same size. Suppose is n-by-p matrix configuration and is m-by-q matrix configuration. If = and < then needs to be optimally matched to by adding m-byk matrix where = − . Similarly, if < and = then must be added l-by-p matrix where = − [22]. To measure the difference between and , Procrustes utilize the sum of the squared distances , given by Equation 1.
Geometrically, Procrustes works to minimize ( , ) by using series of Euclidean similarity transformations, namely translation, rotation, and dilation. Optimal translation in Procrustes is = − , where is the number of rows and is used to denote n-by-1 vector having each component equal to 1. Optimal rotation is derived by Ten Berge [27] using the complete form of singular value decomposition (CFSVD) of ′ , i.e, ′ = ′, where = diag( ) is a real diagonal matrix, and are orthogonal matrices. By using CFSVD, we get solution = ′, giving the optimal rotation matrix. Optimal dilation is given by scalar = trace( ′ ) trace ( ′ ) . By using the optimal transformation described above, the ordinary Procrustes analysis (OPA) is given by Equation 2.
trace( ′ ) [18]. The full Procrustes mean (FPM) is a technique for getting the mean of configuration matrices of similar shapes [26]. FPM does not give a measure of the match, so this algorithm abandons in here. Optimal transformation ordering is given by Bakhtiar and Siswadi in the order of translation-rotation-dilation (TRD) as stated in the following theorem [17].

Theorem 1.
Given two matrices and in n-by-p, the Procrustes between and after the optimal translation-rotation-dilation (TRD) ordering is given by Equation 3.
Procrustes measure which is given by ( , ) or TRD ( , ) does not comply with the symmetrical property where ( , ) ≠ ( , ). For the problem, Bakhtiar and Siswadi in [19] has embedded the symmetrical property by adding other transformation namely a normalization as stated in the following theorem Theorem 2. Given two matrices and in n-by-p, the Procrustes between and after the optimal translation-normalization-rotation-dilation ordering comply with the symmetrical property given by Equation 4.
where and are rank and singular value of ̅ ′ ̅ or ̅ ′ ̅ with ̅ and ̅ are matrices after normalization process by using formula ̅ = ‖ ‖ and ̅ = ‖ ‖ .
Proof. The complete proof is shown by [19].

A. Procrustes Classification Algorithm
The basic idea of the Procrustes classification algorithm (PrCA) is a change in the group's configuration because of the testing data entry. If the testing data is entered into a particular available group, it will change its configuration. If the change is the largest, it can be assumed that the testing data is misclassified in the group. The visualization of this concept is given in Fig.2.
From Fig.2, we are intuitively convinced that the configurations of and are different. Dissimilarity measures of those configurations can be obtained using the GoFP. If GoFP is close to 1, then the difference between and is tiny. It means that the entry of testing data does not change the initial configuration significantly, so it can be assumed that the testing data is part of the group. Conversely, if GoFP is close to 0, then the difference between and is huge. It means that the entry of testing data changes the initial configuration significantly, so it can be assumed that the testing data is not part of the group. The next problem arises when the GoFP will be calculated. Suppose that is n-by-p matrix, so is certainly (n+1)-by-p matrix. There is a difference in the size of and . As a result, GoFP can not calculate. To solve the problem, we have to add one object in where it does not change the configuration of extremely. One of the solutions is to select a particular object from . In this paper, it will be selected from the prototype of . Based on that experience, the Procrustes classification algorithm is proposed with the following steps.

Suppose that
n-by-p is a matrix of the ith group ( = 1,2, … , ), and is testing data.

B. Procrustes Nearest Neighbor Classification Algorithm
The basic idea of the Procrustes nearest neighbor classification algorithm (PNNCA) is that add with object from closest to the testing data. To be optimally matched to , we have to add with its prototype on PrCA. Now, we try to add with another object from . K-NN algorithm works by seeing the nearest neighbor. Each of the objects in all groups is a neighbor of testing data. Based on all neighbors, testing data will be classified into the group which contains its nearest neighbor. Based on this fact, we know that each object in a certain group is a neighbor of testing data. One of them has the smallest distance with testing data, suppose where ∈ . If is added then we have added so it has the same size with using the concept of k-NN.

C. Research Flow
The research flow used in this paper consists of four main steps: preprocessing of data, classification process, computation of accuracy of the classification results, and comparison of the classification results. Data used are iris, cancer, liver, seeds, and wine datasets obtained from the UCI website. In preprocessing of data, data will be standardized by using zscore because there are features that have different units. In the classification process, classification algorithms used are PrCA, PNNCA, k-NN, SVM, AB, RF, LR, and RR. The testing data are obtained by using k-fold crossvalidation ( = 10). K-fold cross-validation will divide data into ten parts randomly [28]. Each part will be testing data consecutively, and the remaining part will be training data. The accuracy results from an algorithm are obtained from the average accuracy of all testing data by using (5).

Accuracy =
A number of true classification A number of processed data , The classification results of each algorithm are compared to get the best result. In simple terms, the research flow is shown in Fig.3.

A. The data used
The data used in this research are secondary data obtained from the website UCI, namely iris, cancer, liver, seeds, and wine. All of these data are quantitative data. The description of these data used is shown in   Table 2 shows the maximum and minimum variance ( 2 ) of dataset. From the table, we know that cancer, liver, seeds, and wine dataset have different 2 min and 2 max significantly. It shows that those datasets have features with different units. These differences will certainly affect the classification results, where features with a large variance will be more influential. To overcome it, those data are standardized first by using z-score . Whereas iris dataset have not any different 2 min and 2 max significantly. It shows that the dataset has features with the same units, so it does not need to be standardized.

B. The Classification Results
The getting process of the accuracy results is obtained from each algorithm iteration 100 times in each dataset. It is done to see the convergence of the accuracy results by using (6).
where is accuracy result in nth iteration, and is the convergence of the accuracy [29]. Computation in this paper uses Matlab.  Figure 4 shows the accuracy convergence from PrCA, PNNCA, k-NN, SVM, AB, RR, RF, and LR algorithms in the iris dataset. From the chart, we know that each algorithm's accuracy results are satisfactory because the minimal value of the accuracy is above 0.800. The average accuracy of each algorithm precisely shows in Table 3. In Table 3, we know that the best algorithms are PNNCA and all algorithms' accuracy results are quite similar.  Figure 5 shows the chart of the convergence of the accuracy results of each algorithm in the cancer dataset. We know intuitively that the AB algorithm result is below 0.500. Its result is certainly not satisfactory because the true classification is less than the wrong classification. If we see the other algorithms, it knows that all algorithms, except AB, have good enough because of the dominant true classification results. The average accuracy of each algorithm precisely shows in Table 4.  Table 4 shows exactly the accuracy values of each algorithm. From the table, we know that the result of PrCA and PNNCA is quite similar; the difference is only 0.0102. we also know that PrCA and PNNCA results are good enough because the true classification is more than the false classification. However, those algorithms are not the best algorithm in the cancer dataset. Figure 6 shows the chart of the convergence of the accuracy from all algorithms used in the liver dataset. We find again that there is one algorithm whose accuracy results are below 0.500. The algorithm is SVM. However, other algorithms are good enough because of the dominant true classification results. To know the results of each algorithm exactly needs to see Table 5.   Table 5 shows the average value of the accuracy of each algorithm in the liver dataset with 100 times repetition. From the table, we know that PrCA and PNNCA results are better than k-NN and SVM. It is clearly that PNNCA is better than PrCA in the liver dataset. Although PrCA and PNNCA results are not the best, their results are good enough because the dominant classification is true. The convergence of the accuracy results in the seeds dataset shown in Fig.7 shows that all algorithms, except Adaboost (AB), are satisfactory because their accuracy is above 0.800. In comparison, the AB results are only good enough because its accuracy is slightly above 0.550. The precise results of each algorithm can be known in Table 6. From Table 6, we know that the PrCA and PNNCA results are satisfactory because the accuracy value are above 0.900. Moreover, we also know that PNNCA is the best algorithm, while PrCA is the second-best algorithm in the seeds dataset. Figure 8 shows intuitively that the SVM results are about 0.500, and the k-NN results are about 0.600 in the wine dataset. While other algorithms have accuracy results is about or above 0.800. To know exactly needs to see Table 7.
From Table 7, we know the average accuracy of each algorithm used. The SVM algorithm result is 0.5004, so the result is good enough. If we focus on PrCA PNNCA, we will see that their results are satisfactory in the wine dataset because those are above 0.800, although one of them is not the best algorithm. We also know that the PNNCA is better than PrCA.

V. DISCUSSION
The results that have been shown previously provide some necessary information about the classification algorithm comparisons. First, PrCA is better than k-NN in the liver, seeds, and wine dataset, while PNNCA is better than k-NN in the iris, liver, seeds, and wine dataset. Second, PrCA is better than SVM in cancer, liver, seeds, and wine dataset, while PNNCA is better than SVM in all datasets used. Third, PrCA is better than AB in the cancer, seeds, and wine dataset, while PNNCA is better than AB in iris, cancer, seeds, and wine. Fourth, PrCA and PNNCA are better than RR in the same dataset, namely iris and seeds. Fifth, PrCA is better than RF in the iris and seeds dataset, while PNNCA is better than RF in iris, liver, and seeds.
At last, PrCA is just better than LR in the seeds dataset, while PNNCA is better than LR in iris and seeds. The facts show that PrCA has good results predominantly compared to k-NN, SVM, and AB. While PNNCA has good results predominantly compared to k-NN, SVM, AB, and RF. We also know that selected object to add for optimally matching to give impact in the classification results, where PNNCA is better than PrCA in this case. In general, the results from PNNCA and PrCA show that the involvement of all objects in the group in the classification process based on similarity measure is more advantageous than only several objects from the group.

VI. CONCLUSION
This paper discussed the classification algorithms proposed by using Procrustes, namely PrCA and PNNCA. The results conclude that the results of PrCA are quite similar to PNNCA, but PNNCA is better than PrCA in all datasets used. PrCA has outperformed three of the six comparing algorithms, which are k-NN, SVM, and AB. Meanwhile, PNNCA has outperformed four of them that are k-NN, SVM, AB, and RF. Based on the results, PrCA and PNNCA certainly deserve to be proposed as a new approach in the classification process.