RSIM Project Overview

Our research has been driven by two trends: (1) the increasing use of instruction-level parallelism (ILP) in recent processors, and (2) the increasing shift in high performance computing from scientific and engineering applications to commercial database and media processing applications. The following describes our work in four related areas:

Instruction-Level Parallelism in Shared-Memory Multiprocessors

Much of the recent increase in the performance of processors has come from instruction-level parallelism or ILP, which exploits parallelism within the sequential instruction stream of a processor. Current processors exploit ILP through complex techniques such as multiple issue, out-of-order issue, non-blocking loads, and speculation. Most previous evaluation studies of shared-memory systems, however, have assumed a very simple model of the processor. Our work has been the first to explore how to exploit ILP features in shared-memory systems, beginning with an extensive analysis of how ILP affects current shared-memory performance. For a range of applications on current and near-future systems, our key results are:

Exploiting Instruction-Level Parallelism for Memory System Performance

Using the above results, we have developed new techniques to exploit ILP to improve memory system performance in both uniprocessors and multiprocessors.

Architectures for Emerging Applications in Databases and Media Processing

Until recently, high performance computing was primarily driven by scientific and engineering applications. In the future, however, database and media processing applications will be among the largest consumers of computing cycles. In collaboration with researchers from Compaq Western Research Laboratory, we have performed the first comprehensive simulation study of on-line transaction processing (OLTP) and decision support system (DSS) applications on shared-memory systems with state-of-the-art processor designs [ASPLOS'98]. Our most significant results were that the OLTP workload is dominated by instruction cache and migratory data misses; however, its performance could be significantly improved by using simple stream buffers to address instruction cache misses, and a combination of prefetching and producer-initiated communication [HPCA'97] to address migratory data misses.

We also started work on general-purpose architectures for media processing applications, in collaboration with Dr. Jouppi of Compaq Western Research Laboratory. Our first paper in this area characterizes how several media processing applications use various features of general-purpose processors, including media instruction set extensions and software prefetching [ISCA'99]. Our second paper in this area proposes reconfigurable caches to allow the cache SRAM arrays to be dynamically divided into partitions that can be used for different processor activities [ISCA'00]. These studies have lead to a larger project in the area of architectures for multimedia and communications applications (click here for more information).

Fast and Accurate Performance Evaluation of Shared-Memory Multiprocessors

Sound performance evaluation methodology is essential for credible computer architecture research. We have shown that commonly used shared-memory simulators based on simple approximations to model ILP processors give unacceptably large and application-dependent errors (over 100% error in some cases) [HPCA'97]. For higher accuracy, we have developed a detailed simulator, RSIM, which has been licensed by over 600 users worldwide [TCCA newsletter'97].

The detail in RSIM, however, makes it ten times slower than previous simpler simulators, resulting in a significant tradeoff between accuracy and performance. The previous simpler simulators are therefore still being widely used for their higher speed. To improve this speed vs. accuracy tradeoff, we have developed a new simulation technique that is almost as accurate as RSIM and only an average of 2.7 times slower than previous commonly used fast simulators [HPCA'99]. These results force a reconsideration of the simulation methodology for shared-memory multiprocessors.

We have also collaborated with researchers from the University of Wisconsin on an analytic model for ILP shared-memory multiprocessors that is highly accurate for many cases and can be solved in a few seconds [ISCA'98]. To address the problem of estimating parameter values for this model, we designed a very fast simulator called FastILP that provides the key ILP-related parameters (but not execution time) at a speed almost two orders of magnitude faster than RSIM.

We are currently investigating additional techniques to further improve the speed of both shared-memory and uniprocessor simulation with ILP processors.

Miscellaneous

We have also worked on producer-initiated communication (remote writes) in combination with consumer-initiated communication (prefetching) [HPCA'97] and, in collaboration with the Rice Treadmarks group, performed a comparison of lazy release consistency and entry consistency software DSM architectures [HPCA'96].


RSIM Publications
Back to RSIM home page