Performance comparison of cache replacement algorithms on various internet trafﬁc

— Internet users tend to skip and look for alternative websites if they have slow response times. For cloud network managers, implementing a caching strategy on the edge network can help lighten the workload of databases and application servers. The caching strategy is carried out by storing frequently accessed data objects in cache memory. Through this strategy, the speed of access to the same data becomes faster. Cache replacement is the main mechanism of the caching strategy. There are seven cache replacement algorithms with good performance that can be used, namely LRU, LFU, LFUDA, GDS, GDSF, SIZE, and FIFO. The algorithm is developed uniquely according to the internet trafﬁc patterns encountered. Therefore, a particular cache replacement algorithm cannot be said to be superior to other algorithms. This paper presents a performance comparison simulation of the seven cache replacement algorithms on various internet trafﬁc extracted from the public IRcache dataset. The results of this study indicate that the Hit Ratio (HR) performance is strongly inﬂuenced by cache size, cacheable and unique requests. The smaller the unique request that occurs, the greater the HR performance obtained. The LRU algorithm shows a very good HR performance to perform cache replacement work under normal internet conditions. However, when the access impulse phenomenon occurs, the GDSF algorithm is superior in obtaining HRs with limited cache memory capacity. The simulation results show that GDSF reaches a 50.75% HR while LRU is only 49.17% when access anomalies occur.


I. INTRODUCTION
The economic growth of a nation is caused by the rapid development of the internet, which makes it easier to explore information and adds to the insights of economic actors [1], [2]. Internet technology also causes shifts in individual lifestyles that can improve the population's welfare and strengthen the institutionalization of village communities [3], [4]. The number of internet and digital technology users has increased since the COVID-19 pandemic [5], [6], especially elearning and e-banking users [7], [8]. In addition, data transactions in the internet world are getting bigger as 5G technology, edge computing, and the Internet of Things (IoT) evolve [9], [10].
Internet users tend to go through and look for alternative websites if they have a slow response time [11], [12]. Response time is the time it takes to measure the speed at which the browser fully loads a web page since it was first clicked [13], [14]. Response time is a problem that must be considered by internet-based application developers and cloud network infrastructure managers as it can improve the experience and comfort of their users [15], [16].
For internet-based application developers, one solution to overcome the problem of response time is to scale out applications using cache memory or In-memory Database (IMDB) [17], [18]. For cloud network managers, implementing caching strategies on the edge network can help ease the workload of Jurnal Infotel Vol. 15  databases and application servers [19], [20]. Caching strategies can also save bandwidth and reduce network latency [21], [22]. The caching strategy is carried out by storing data objects accessed often at specific locations [23], [24]. Through this strategy, the same speed of data access becomes faster because the data has been stored before [25]. The right caching strategy can reduce the exact data requests to the origin server to speed up the response time on the client side.
Caching strategy is a relatively broad field of research because this strategy is widely used in several areas such as cloud networks [19], [26], browser applications [27], [28], operating systems [29], embedded systems [30], mobile network [31], [32], and telecommunications [20]. The primary mechanism of caching strategy is cache replacement, changing the content of cached data in cache memory because of its limited capacity [33], [34]. Researchers have developed several cache replacement algorithms, such as Least Recently Used (LRU), Least Frequently Used (LFU), Least Recently Used Dynamic Aging (LFUDA), Greedy-Dual Size (GDS), Greedy-Dual Size Frequency (GDSF), SIZE, and First-in-First-out (FIFO). The seven algorithms are generally used in the benchmarking process in caching strategy research. Practically some of these algorithms have also been adopted in caching systems contained in proxy servers such as Squid and Varnish.
The selection of the cache replacement algorithm must be carried out comprehensively so that it does not reduce the application's performance or cause traffic congestion [35]. The researchers uniquely developed the cache replacement mechanism following the pattern of internet traffic [36]. Therefore, specific cache replacement algorithms cannot be considered superior to other cache replacement algorithms [37], [38].
This paper aims to compare the performance of the seven caching algorithms on different internet access patterns. The internet traffic pattern tends to be random, following Zipf's law [39], [40]. However, in certain conditions, impulse access can occur like a viral phenomenon that occurs quickly and then gradually returns to normal [41]. Therefore, this paper can be an essential reference for researchers caching strategies to develop new cache replacement algorithms based on the internet access patterns they face.
For a better understanding, the rest of this paper is organized as follows. In section II, we provide the research method, followed by the result in section III. Section IV presents the discussion of the result. Finally, we provide the conclusion in section V.

II. RESEARCH METHOD
This section discusses research scenarios, cache replacement algorithms, IRcache dataset, and Hit Ratio (HR).

A. Research Scenarios
The contribution of this paper is to present the simulation results of comparing the HR performance of seven cache replacement algorithms on various internet traffic using the IRcache dataset consisting of four sub-datasets. Each sub-datasets consists of 666 records. Various types of internet traffic conditions are known from statistics on total requests, cacheable bytes, cacheable requests, and unique requests, as presented in Table 1.
Seven algorithms were run to simulate the cache replacement mechanism in four IRcache datasets. The performance evaluation of the algorithm is calculated using the HR by calculating the percentage of request data served successfully by the cache memory compared to the total request data. The work of the seven cache replacement algorithms, the IRcache dataset, and the HR are described in more detail in section II-B, section II-C, and section II-D, respectively.

B. Cache Replacement Algorithms
LFU and LRU are the most famous cache replacement algorithms with good HR performance. If there is a new cached data storage request in the cache memory, the LFU algorithm will first delete the cached data with a minor access count. At the same time, the LRU algorithm will delete the most recently used cached data. LRU algorithm performance is strongly influenced by access recency, while LFU is influenced by access count or the number of accesses to cached data. Arlitt et al. [42] then developed LFU by adding a wear factor called Dynamic Aging. Eq. (1) is used to calculate the aging factor (L) represented in the keyvalue variable K (g) . A variable F (g) is an access count value of cached data that must be deleted to be replaced with new cached data entering the cache memory.
The SIZE algorithm will erase the cached data with the largest cache size, while FIFO will erase the cached Hierarchy data The hierarchy of data request. 10.
User ID The user ID.
data in the order in which it first entered the cache memory.
LFU, LRU, SIZE, and FIFO algorithms do not consider network costs or other costs that affect caching decisions. Therefore, Cao et al. [43] propose a new key-value calculation by including the cost-aware variable. Eq. (2) is used to calculate K (g) the value of the GDS algorithm.
Parameter L shows the running queue, its value starts from 0 and is updated with the min K value of the lowest value in priority queue. Parameter S (g) is the size of document i. Parameter C (g) is the cost associated with bringing document i into the cache.
In its development, the access count variable is necessary so that caching decisions can be taken more comprehensively by considering the cost awareness and popularity based on the value of the access count. Cherkasova et al. [44], therefore, proposed improving the performance of the GDS by proposing the GDSF. Eq. (3) calculates the key value of the GDSF.
Parameters L, C (g) and S (g) are the same as those parameters in the GDS algorithm. Parameter F (g) indicates the number of documents (g)

C. IRcache Dataset
This paper used the IRcache dataset to synthesize various internet traffic. The IRcache dataset was first released in 1999 by Alex Rousskov [45], who was later managed by the National Laboratory of Applied Network Research (NLANR) in the United States. The IRcache dataset is an accurate world proxy drawn from proxy servers spread across five the United States cities, namely Urbana-Champaign (UC), Boulder (BO2), Silicon Valley (SV), San Diego (SD), and New York (NY).
The IRcache dataset is perfect for simulating the cache replacement mechanism in the data caching process in cache memory. The IRcache dataset also represents the characteristics of internet traffic in actual conditions. In the last five years, this IRcache dataset is still used by Ibrahim et al. [46] in cache replacement research and Li et al. [47] in content caching optimization research. Table 2 presents the properties contained in the IRcache dataset.

D. Hit Ratio
HR is the percentage of data service requests successfully served by cache memory (r i ) compared to the total data requests it receives (N ). One cache hit indicates that one data request was successfully served by cache memory. If this condition occurs, then r i would be worth 1. The opposite of cache hit is cache miss, a condition if a request for data cannot be served by cache memory, so the requested data will be forwarded to the origin server. When the requested data has been obtained from the origin server, this data will be immediately stored in the cache memory. The value of the hit cache in that condition remains one, but the total number of data requests (N ) is already two. Therefore, the resulting HR value of 1 2 ×100% = 50% in this condition. Eq. (4) is used to calculate the HR value.
III. RESULT Table 3 shows the average HR performance of seven cache replacement algorithms run on four IRcache sub-datasets. Based on Table 3, the most significant HR performance measurement is achieved in the NY dataset. In this dataset, the minimum HR performance is 48.41%, obtained by the FIFO algorithm, while the maximum HR performance is 50.75%, obtained by the GDSF algorithm. The SV dataset became the smallest in terms of HR performance. In this dataset, the maximum HR performance is 7.43%, obtained by the GDSF algorithm, while the minimum HR performance is 6.40%, obtained by the SIZE algorithm. The best HR values in Table 3 are marked in bold print, while the worst performances are marked in italic print.
The HR performance of the GDSF algorithm was superior in two datasets, namely SV and NY, while the LRU algorithm was superior in the BO2 and UC  Table 2 and Table 3 show the relationship. The smaller the cacheable and unique request that occurs, the greater the resulting HR performance. It is shown based on the performance HR in the NY dataset, which is superior to the other three datasets.    In addition, the correlation between Table 2 and Table 3 also shows that the greater the unique request that occurs, the smaller the resulting HR performance. The more significant number of unique requests causes the lesser probability of saving cached data in the cache memory. Fig. 1 to Fig. 4 shows the relationship between cache size and the resulting HR performance. The performance of the HR is directly proportional to the cache size. The replacement cache mechanism is more common in cache size with a small size. It causes not many cacheable requests to be stored, so the smaller the chance of a cache hit occurs. A rare hit cache causes the hit performance ratio to be low. Fig. 1 to Fig. 4, the GDSF algorithm achieves the highest HR in the NY dataset of 50.75%, while the FIFO algorithm HR value is the lowest at 47.90%. In the UC dataset, the LRU algorithm is the best with a HR of 32.60%, while LFU is the worst with a HR of 23.78%. In the SV dataset, the highest HR was achieved by the GDSF algorithm of 7.43%, while the lowest value was obtained by the LFU of 5.57%. Finally, the simulation results on the BO2 dataset show that the LRU algorithm is the best with a 22.21% HR, while the SIZE algorithm is the worst with a 17.42% HR.

Based on
The SIZE algorithm gives higher priority to small size cached data to occupy cache memory. It was intended that more cached data can be stored. However, the simulation results show that this strategy is not  Fig. 1 to Fig. 4 also shows that the NY dataset is unfavorable for access recency-based cache replacement algorithms such as LRU. The LRU algorithm can excel over the HR performance in three datasets but not in the NY dataset. Based on Fig. 1 to Fig. 4, key-value and access frequency-based algorithms such as GDSF, LFU, and LFUDA obtained superior HR performance than the other four algorithms. Caching decisions based on the calculation of access count data on datasets with low unique request statistics will produce more cache hits because each cached data has the same caching probability. This fact is significantly inversely proportional to the SV dataset with the unique requests. Each cached data has a small caching probability in cache memory with minimal size because its position in the cache memory may be quickly replaced with other cached data.
Based on the HR performance in Fig. 1 to Fig. 4, the access pattern that occurs in the NY dataset is an anomaly. This small number of cacheable and unique requests illustrates that there has been impulse access at a particular time that only accesses a small portion of cached data. It is the leading cause of cacheable and unique request values in the NY dataset to be the smallest. If further attention is paid, this phenomenon of impulse access only occurs briefly. This phenomenon can be avoided by providing a cache memory of a larger size. Fig. 1 to Fig. 4 shows that the 150 (KB) cache size configuration can produce the same enormous HR performance (59.91%) in the seven existing cache replacement algorithms. This condition has never been found in the other three datasets, namely Urbana-Champaign (UC), Boulder (BO2), Silicon Valley (SV), San Diego (SD).

IV. DISCUSSION
Proxy servers generally utilize cache replacement algorithms to organize popular web-object storage in cache memory so that subsequent web-object requests can be served more quickly. In typical internet traffic conditions and during impulse access, key-value-based and cost-aware algorithms such as GDSF can maintain good HR performance, as shown in the simulation of NY and SV datasets. However, the LRU algorithm can be relied on in regular internet traffic, such as the HR performance results in the BO2 and UC datasets. Some researchers develop GDSF algorithms into WGDSF [48] or WSCRP [34] with quite convincing HR performance results. LRU algorithm has also been developed using metaheuristic optimization methods [49] or machine learning [50]. Both show better HR performance compared to the standard LRU version.
In addition to being used as a proxy server, the cache replacement algorithm discussed in this paper can also be adapted to perform caching on the server database. Complex query requests with joins to multiple tables can be stored in the memory buffer to speed up the same query request later. Caching the Cachematic framework [51] in caching the query result by building an abstract syntax tree to identify "where" and "join" clauses on each target query submitted by the user. The cache replacement mechanism can also be specifically utilized directly on in-memory databases, such as Redis [52] or Memcached [53], to store smallsized data objects that users often access.

V. CONCLUSION
The HR performance is strongly influenced by cache size, cacheable and unique request. The smaller the unique request, the greater the performance HR obtained. LRU algorithm shows excellent HR performance for cache replacement work in normal internet conditions. However, when there is an impulse access phenomenon, the GDSF algorithm is superior in acquiring a HR on limited cache memory capacity. The GDSF algorithm is more flexible to handle internet traffic under normal conditions and when access anomalies occur.
However, the LRU algorithm can be relied on in regular internet traffic, such as the HR performance results in the BO2 and UC datasets. The SIZE algorithm gives higher priority to small size cached data to occupy cache memory. It was intended that more cached data can be stored. However, the simulation results show that this strategy is not effective in increasing the HR performance. GDSF, LFU, and LFUDA obtained superior HR performance than the other four algorithms on dataset with high the unique requests.