Our research has been driven by two trends: (1) the increasing use of instruction-level parallelism (ILP) in recent processors, and (2) the increasing shift in high performance computing from scientific and engineering applications to commercial database and media processing applications. The following describes our work in four related areas:
instruction-level parallelism in shared-memory multiprocessors,
exploiting instruction-level parallelism for memory system performance,
architectures for emerging applications in databases and media processing, and
fast and accurate performance evaluation of shared-memory multiprocessors.
Instruction-Level Parallelism in Shared-Memory Multiprocessors
Much of the recent increase in the performance of processors has come from instruction-level parallelism or ILP, which exploits parallelism within the sequential instruction stream of a processor. Current processors exploit ILP through complex techniques such as multiple issue, out-of-order issue, non-blocking loads, and speculation. Most previous evaluation studies of shared-memory systems, however, have assumed a very simple model of the processor. Our work has been the first to explore how to exploit ILP features in shared-memory systems, beginning with an extensive analysis of how ILP affects current shared-memory performance. For a range of applications on current and near-future systems, our key results are:
ILP features are effective in improving CPU performance but are not as effective in improving memory system performance. The key reason is that the applications do not generally expose multiple read misses within the space of a hardware instruction window. The processor, therefore, cannot overlap multiple read misses to hide their latencies behind each other; read miss latencies are too long to be hidden behind other computation [HPCA'97, IEEE TOC'99].
Traditional memory system optimizations such as software prefetching and relaxed memory consistency models are insufficient to hide memory latency in shared-memory systems with ILP processors [ISCA'97, IEEE TOC'99].
ILP features can be used to significantly narrow the hardware performance gap between sequential consistency and relaxed consistency models; however, with current optimizations, a significant gap remains for some applications due to write latency [ASPLOS'96, SPAA'97, Proc. of the IEEE'99].
Exploiting Instruction-Level Parallelism for Memory System Performance
Using the above results, we have developed new techniques to exploit ILP to improve memory system performance in both uniprocessors and multiprocessors.
The above results motivate an optimization that moves multiple read misses closer to each other so they appear within the space of a hardware instruction window and can be overlapped with each other [HPCA'97, IEEE TOC'99]. We have developed a compiler algorithm to support this clustering of read misses without conflicting with previous locality optimizations (which often result in moving read misses apart) [JILP'00 (shorter version in MICRO'99)]. This optimization gives significant performance benefits on both uniprocessors and multiprocessors, as seen from simulation and on a real machine.
We have shown that the above read miss clustering technique combines well with the commonly used technique of software prefetching for tolerating memory latency [PACT'01]. Although on the surface the two techniques appear to target similar latencies, we show that the combination works better than either technique alone, with each technique helping to overcome the limitations of the other. Again, our results are for uniprocessors and shared-memory multiprocessors, using simulation and on a real system.
We have developed a new technique that uses speculative retirement to further narrow the hardware performance gap between sequential consistency and relaxed memory consistency models [SPAA'97].
We have also developed a memory-side prefetching technique to hide latency
incurred by inherently serial accesses to linked data structures (LDS) [TR'01]. A
programmable prefetch engine sits close to memory and traverses LDS independently from the processor. The prefetch engine can run ahead of the
processor because of its low latency, high bandwidth path to memory. This allows the prefetch engine to initiate data transfers earlier than the
processor and pipeline multiple such transfers over the network. We evaluate
this technique on a system with processor-in-memory (PIM) chips. We find this
technique provides significant performance benefits, but a combination of both
memory-side and processor-side prefetching performs best.
Architectures for Emerging Applications in Databases and Media Processing
Until recently, high performance computing was primarily driven by scientific and engineering applications. In the future, however, database and media processing applications will be among the largest consumers of computing cycles. In collaboration with researchers from Compaq Western Research Laboratory, we have performed the first comprehensive simulation study of on-line transaction processing (OLTP) and decision support system (DSS) applications on shared-memory systems with state-of-the-art processor designs [ASPLOS'98]. Our most significant results were that the OLTP workload is dominated by instruction cache and migratory data misses; however, its performance could be significantly improved by using simple stream buffers to address instruction cache misses, and a combination of prefetching and producer-initiated communication [HPCA'97] to address migratory data misses.
We also started work on general-purpose architectures for media processing applications, in collaboration with Dr. Jouppi of Compaq Western Research Laboratory. Our first paper in this area characterizes how several media processing applications use various features of general-purpose processors, including media instruction set extensions and software prefetching [ISCA'99]. Our second paper in this area proposes reconfigurable caches to allow the cache SRAM arrays to be dynamically divided into partitions that can be used for different processor activities [ISCA'00]. These studies have lead to a larger project in the area of architectures for multimedia and communications applications (click here for more information).
Fast and Accurate Performance Evaluation of Shared-Memory Multiprocessors
Sound performance evaluation methodology is essential for credible computer architecture research. We have shown that commonly used shared-memory simulators based on simple approximations to model ILP processors give unacceptably large and application-dependent errors (over 100% error in some cases) [HPCA'97]. For higher accuracy, we have developed a detailed simulator, RSIM, which has been licensed by over 600 users worldwide [TCCA newsletter'97].
The detail in RSIM, however, makes it ten times slower than previous simpler simulators, resulting in a significant tradeoff between accuracy and performance. The previous simpler simulators are therefore still being widely used for their higher speed. To improve this speed vs. accuracy tradeoff, we have developed a new simulation technique that is almost as accurate as RSIM and only an average of 2.7 times slower than previous commonly used fast simulators [HPCA'99]. These results force a reconsideration of the simulation methodology for shared-memory multiprocessors.
We have also collaborated with researchers from the University of Wisconsin on an analytic model for ILP shared-memory multiprocessors that is highly accurate for many cases and can be solved in a few seconds [ISCA'98]. To address the problem of estimating parameter values for this model, we designed a very fast simulator called FastILP that provides the key ILP-related parameters (but not execution time) at a speed almost two orders of magnitude faster than RSIM.
We are currently investigating additional techniques to further improve the speed of both shared-memory and uniprocessor simulation with ILP processors.
Miscellaneous
We have also worked on producer-initiated communication (remote writes) in combination with consumer-initiated communication (prefetching) [HPCA'97] and, in collaboration with the Rice Treadmarks group, performed a comparison of lazy release consistency and entry consistency software DSM architectures [HPCA'96].