Sparse data applications have irregular access patterns that stymie modern memory architectures. Although hyper-sparse workloads have received considerable attention in the past, moderately-sparse workloads prevalent in machine learning applications, graph processing and HPC have not. Where the former can bypass the cache hierarchy, the latter fit in the cache. This article makes the observation that intelligent, near-processor cache management can improve bandwidth utilization for data-irregular accesses, thereby accelerating moderately-sparse workloads. We propose SortCache, a processor-centric approach to accelerating sparse workloads by introducing accelerators that leverage the on-chip cache subsystem, with minimal programmer intervention.
With the rapid development of quantitative trading business in the field of investment, quantitative trading platform is becoming an important tool for numerous investing users to participate in quantitative trading. In using the platform, return time of backtesting historical data is a key factor that influences user experience. In the aspect of optimising data access time, cache management is a critical link. Research work on cache management has achieved many referential results. However, quantitative trading platform has its special demands. (1) Data access of users has overlapping characteristics for time-series data. (2) This platform uses a wide variety of caching devices with heterogeneous performance. To address the above problems, a cache management approach adapting quantitative trading platform is proposed. It not only merges the overlapping data in the cache to save space but also places data into multi-level caching devices driven by user experience. Our extensive experiments demonstrate that the proposed approach could improve user experience up to >50% compared with the benchmark algorithms.
The Information Centric Networking ICN architectures are proposed to overcome the problems of the actual internet architecture. One of the main straight points of the ICN architectures is the in-network caching. The ICN performance is influenced by efficiency of the adopted caching strategy which manages the contents in the network and decides where caching them. However, the major issue which faces the caching strategies in the ICN architectures is the strategic election of the cache routers to store the data through its delivery path. This will reduce congestion, optimize the distance between the consumers and the required data furthermore improve latency and alleviate the viral load on the servers. In this paper, we propose a new efficient caching strategy for the Named Data Networking architecture NDN named NECS which is the most promising architecture between all the ICN architectures. The proposed strategy reduces the traffic redundancy, eliminates the useless replication of contents and improves the replay time for users due to the strategic position of cache routers. Besides, we evaluate the performance of this proposed strategy and we compare it with three other NDN caching strategies, using the simulator network environment NdnSIM. On the basis of the simulations carried out, we obtained interesting and convincing results.
GPGPUs has gradually become a mainstream acceleration component in high-performance computing. The long latency of memory operations is the bottleneck of GPU performance. In the GPU, multiple threads are divided into one warp for scheduling and execution. The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall. We propose Locality-Based Cache Management (LCM), combined with the Locality-Based Warp Scheduling (LWS), to reduce cache contention and improve GPU performance. Each load instruction can be divided into three types according to locality: only used once as streaming data locality, accessed multiple times in the same warp as intra-warp locality, and accessed in different warps as inter-warp data locality. According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make full use of the inter-warp locality, and combine with the LWS to alleviate cache contention. LCM and LWS can effectively improve cache performance, thereby improving overall GPU performance. Through experimental evaluation, our LCM and LWS can obtain an average performance improvement of 26% over baseline GPU.