Muhammad Hassan: Enhancing Processor Performance: Approaches for Memory Characterization, Efficient Dynamic Instruction Prefetching, and Optimized Instruction Caching
- Date: 31 May 2024, 13:00
- Location: 80127, Ångströmslaboratoriet, Lägerhyddsvägen 1, Uppsala
- Type: Thesis defence
- Thesis author: Muhammad Hassan
- External reviewer: Boris Grot
- Supervisor: David Black-Schaffer
- Research subject: Computer Science
- DiVA
Abstract
Low latency access to both data and instructions is paramount for processor performance. However, memory speed has been trailing behind the processor speed and is now a dominant bottleneck in execution. While both data and instruction misses cause performance losses, data misses can be overlapped with other useful work, but instruction misses stall the front-end of the processor leading to greater performance loss than data misses.
Memory access characterization is important for designing memory hierarchies. While many works have characterised SPEC benchmark's memory behaviour, the results have been either tied to a specific micro-architecture or ignored the time-based behaviour of the benchmarks. In this thesis, we remove a majority of the micro-architectural features to characterize the intrinsic memory behaviour of the SPEC benchmarks and use this to understand how the workloads behave with various cache sizes and prefetching. In order to simplify the analysis of complex time-based results, we propose the use of MPKI Bins which divide the execution into distinct MPKI ranges. Using MPKI bins, we demonstrate that short memory-bound phases cause a significant percentage of the overall cache misses.
For instructions, the growing instruction footprints of server workloads are causing significant performance losses due to front-end stalls that cannot be overlapped or hidden by out-of-order execution. The second part of this thesis develops a technique to enable dedicated instruction prefetchers without the area cost of separate metadata storage structures. We propose to re-purpose the branch target buffer (BTB) to store prefetcher metadata based on the insight that benchmarks that require a dedicated instruction prefetcher can tolerate increased BTB misses. Going further, we propose L2 instruction bypassing based on the insight that decreased L2 data misses deliver more benefit then the slight instruction latency reduction of having instructions in the L2. We show that L2 instruction bypass delivers more performance than a dedicated instruction prefetcher and instruction focused replacement policies.