Indonesian news classification application with named entity recognition approach

— Nowadays, many netizens search for news via search engines with countless amounts of information, so it is increasingly difficult to determine when the number of news articles that appear changes very quickly and dynamically. Thus, it is necessary to process the extraction of news information to display the core information of the news. Problems arise, especially in Indonesian, which has a structure of various noun phrase entities with shallow parsing or grammatical induction. Yet, it continues to confront the feature representation difficulty when analyzing from a formal lexical grammar, interpreting semantically, and extracting information. Named entity recognition (NER) has the opportunity to overcome this because it can extract news entities in depth, starting from proper nouns in text documents containing information search, machine translation, answering questions, and automatic summarization. This study offered the NER approach for constructing an application of classification Indonesian news material. The NER approach is distinct from others, which may be rule-based, dictionary-matching-based, machine learning-based, or a combination of them. Moreover, NER may be extended to recognize distinct entities based on their requirements. This study uses Design-Based Research whose process includes (1) pre-implementation, (2) design, (3) implementation and revision, and finally, (4) reflection and evaluation. This application was developed on the platform Python, streamlet, BeautifulSoup, news, and spacy library. The results of application accuracy testing have an F1-Score value of 89.69 % for all entities consisting


I. INTRODUCTION
Digital news portals are internet users' most important news sources [1]. This is indicated by the interest in reading news on traditional web media or commercial portals, which is higher than new news media such as the Internet BBS [2]. Now, people prefer to search for news through search engines where the results can display a combination of several news portals according to the news keywords [3]. However, searching through search engines that can produce countless combinations of information sometimes takes much time to find the expected information [4]. Moreover, relevant news information will be even more challenging to determine when changes in the number of news articles that appear are swift and dynamic [5]. Due to the frequent millisecond updates to the storage, real-time data is enormous [6].
An information extraction process is needed on the news so that it can display the core information of the news quickly. The problem is that the structure of Indonesian has various noun phrase entities with shallow parsing or grammatical induction [7], so it takes action to classify text documents into correct classes based on techniques in machine learning [8]. To solve this issue, use named entity recognition (NER) to extract the core text ideas in ontology [9]. NER can classify proper nouns in text documents widely used in information retrieval, machine translation, answering questions, and automatic summarization [10]. The NER work process utilizes bags of words (BoW) that handle locations, people, organizations, or institutions [11]. In addition, NER's performance can extract each news entity in Jurnal Infotel, Vol. 15, No. 2, May 2023 https://doi.org/10.20895/infotel.v15i2.909 Indonesian news classification application with named entity recognition approach depth to support the ability to detect fake news [12].
This study proposed the NER method for developing an application for classifying Indonesian news information. NER is one method found in information extraction to detect and classify certain entities in a text [13]. NER extracts a person's name, organization, and location from a document's text. However, NER may be expanded to identify different things based on the needs. Research on NER has been carried out in various languages other than English and Indonesian, such as Chinese [14], [15], Arabic [16], Indian [17], Turkish [18], and Khmer [19]. Previous research also mentions using NER as a text analysis resource in Uyghurs [20]. Many aspects of Natural Language Processing necessitate the use of NER. Babych and Hartley [21] stated that NER could improve Machine Translation performance. Various research topics in Natural Language Processing, such as Quote attribution in Elson and McKeown [22], implement NER to detect the quote's author. Another function, NER, is used for the reference approach to detect text document entities [23].

II. RESEARCH METHOD
Text Classification is a classification method that involves assigning input from a text document to a set of pre-defined classes using a machine learning algorithm [24]. Automatic text classification utilizing machine learning classifiers such as naive Bayes, support vector machines, neural networks, and decision trees has improved performance in recent years [25]. However, it still faces the feature representation challenge in analyzing from a lexical a formal grammar, interpreting semantically, and extracting information [26]. The NER technique differs from the others, which may be rule-based, dictionary-matching-based, machine learning-based, or a mix [26]. Moreover, NER may be expanded to identify different entities based on their needs [9].
The application of the NER algorithm for the classification of Indonesian language news is carried out using the design-based research (DBR) approach. The DBR approach is iterative in designing, implementing, evaluating, and improving problem-specific interventions by considering limited resources and technology [27]. Overall the stages of the DBR approach include (1) pre-implementation, (2) design, (3) implementation and revision, and finally, (4) reflection and evaluation [28]. The complete stages of the research will be shown in Fig. 1.
Finding news data includes looking for Indonesian news themes in the Google News library and scraping news articles. A website is used for packaging the application interface display design. Python and Streamlit are the platforms used for this experimentation. After that, put the model's accuracy, recall, and F1-Score to the test.

III. RESULT
This section discusses identify news topics, scraping news content, classification using NER, and evaluation of custom NER model.

A. Identify News Topics
The Google News library, or news library in Python, is used while looking for news subjects. The features in this collection enable you to identify the most popular news headlines within a specific time frame and in a specified language [29]. In this research, only news in the Indonesian language is covered. The findings included a wide range of subjects, including world, nation, technology, entertainment, sports, science, and health shown in Fig. 2.

B. Scraping News Content
After choosing a news topic, the top 3 stories currently trending in Indonesia will be shown along with their respective news titles and URLs (shown in Fig. 3). Additionally, web scraping is used to get news data content, as in Fig. 4. Data collection from potentially limited access platforms will be simpler with web scraping [30]. Among these is BeautifulSoup, a Python package created by Leonard Richardson and many other programmers that enables the extraction of structured data from web pages by parsing XML and HTML [31].

C. Classification using NER
The named entity recognition (NER) technique is used to classify news articles. NER is a natural language processing activity that extracts specified entity words or phrases from unstructured text data and categorizes them (entity type, time type, and number type) [32]. This task typically targets well-known things such as People, Organizations, Dates, etc. [33]. This study employs a unique NER model created with the Spacy framework version 3. Spacy allows it to operate on words and subwords rather than tokens. The research entities include place, figure, day, date, and organization, as in

D. Evaluation of Custom NER model
The NER model was tested using the SpaCy library, and its performance was evaluated by calculating precision, recall, and F1-Score. Table 1 is the assessment metrics findings for all entities.

IV. DISCUSSION
This study's findings are likely to serve as the foundation for subsequent research on the usage of NER in Indonesian. The F1-score results obtained an accuracy of 86.96 %, tending to be better than previous studies by AI-Ash et al. [34] at 76 % and Wintaka et al. [35] at 84.11 %. This is based on the fact that Indonesian has an extensive vocabulary, and the number of word categorization class entities will be significant. As a natural language processing package, Spacy enables the creation of custom NER models with entities that meet certain specifications. Of course, more datasets and word-labeling annotations are required for custom NER model development.

V. CONCLUSION
This Indonesian news classification application was developed using the Python language, streamlet, Beau-   tifulSoup, news, and Spacy library. Streamlit is used to support making GUI applications in web form. BeautifulSoup is a library for scraping trending news content data through the news library. The classification model was custom developed using the Spacy v3 library consisting of place, figure, day, date, and organization entities. The evaluation value of the model has an F1-Score of 86.96 % for all entities. For the model accuracy value to be better, further research can add feature selection to the data preprocessing process.

ACKNOWLEDGMENT
This research was sponsored through the internal research program of Universitas Duta Bangsa Surakarta in 2022.