Scalable Specialization

In nearly all compute domains, architectures are increasingly relying on parallelism and hardware specialization to exceed the limits of single core performance. GPUs, FPGAs, and other specialized accelerators are being incorporated into systems ranging from mobile devices to supercomputers to data centers, offering new challenges and opportunities to both hardware and software designers.

One of the biggest challenges in this area is efficient data movement. Growth in compute throughput has far exceeded that of memory throughput in specialized devices, while Amdahl's Law has motivated a push to deliver efficiency gains for workloads with irregular sharing and fine-grain synchronization. Conventional heterogeneous programming paradigms which offload coarse-grain tasks to accelerators and require communication through memory (or worse, explicit off-chip data transfer) cannot meet these new requirements.

This project focuses on developing architectures and programming interfaces which can flexibly, efficiently, and simply accommodate these rapidly evolving memory demands. Our cross-layer research includes innovations in coherence protocols and consistency models for heterogeneous systems, novel application-customized coherent storage structures for specialized devices, hardware and software scheduling strategies for heterogeneous workloads, innovations to enable coherent data movement and automatic generation of efficient compute designs in configurable hardware (e.g., FPGAs), a novel hardware-software interface that redefines the notion of an ISA for heterogeneous computing, and innovations in applications to make them amenable to future heterogeneous systems.

Scalable Heterogeneous Compute Architecture

Recent results

Our recent work has questioned conventional wisdom in industry on coherence protocols and consistency models for heterogeneous systems. Industry efforts to improve efficiency for emerging heterogeneous workloads (especially for GPUs) have tended to add complexity to the programming model through the use of non-coherent scratchpads, scoped synchronization, and relaxed atomics. Each of these methods trades efficiency for reduced programmability and limited usability.

Building on our past work on the DeNovo project and the data-race-free memory model, we have shown that it is possible to achieve high GPU cache efficiency without burdening the programmer with added complexity. The coherent stash architecture offers the efficiency benefits of a directly addressed scratchpad while also offering the programmability and global addressability of a coherent cache (ISCA 2015). We have proposed and evaluated the use of the DeNovo coherence protocol for GPU caches (MICRO 2015), and found that high cache reuse is possible in the presence of frequent synchronization without relying on scoped synchronization. Finally, we have defined DRF-Relaxed (ISCA 2017), which formalizes safe use cases of atomic relaxation and offers simple SC-centric semantics for programs which use them. Our work has been recognized as honorable mentions for IEEE Micro Top Picks, the Kuck PhD thesis award, keynote and other invited talks, and has led to multiple industry collaborations.