Heterogeneous Computing: A Unified HW-SW Approach  
University of Illinois at Urbana-Champaign

Device specialization is a natural path to power efficiency. Already, there has been a significant amount of work on compute specialization; future systems will have some collection of heterogeneous compute elements, a trend which has already begun in modern embedded systems. However, there are many inefficiencies in current approaches which prevent significant performance and energy savings from being attained. We believe that a unified, hardware-software co-design approach is needed to remove these inefficiencies. In this project, we specifically focus on two inefficiencies:

1. Transferring data efficiently through the memory system.
2. Enabling programmers to use diverse heterogeneous hardware without losing portability, and enabling more effective software-hardware interfaces.

Memory system :

The memory system that connects these diverse compute elements has not received as much attention as the compute units. In current embedded systems, each compute element has its own memory resources. The memory resources are only loosely integrated with each other, which results in unnecessary data copying and movement. In turn, this leads to energy wastage and prevents fine-grained tasks from being off-loaded to specialized units that may be able to perform a task more efficiently. Furthermore, as technology continues to scale downwards, memory accesses are expected to become the dominant consumer of energy. Thus, finding efficient methods to transfer data between compute elements is essential.

We believe that a major source of energy wastage in modern memory systems stems from the largely software-oblivious design of such systems. By utilizing information from the software, we can more efficiently transfer data throughout the system. This insight derives from our ongoing and prior work in the DeNovo project.

Interface mechanism - a typed virtual instruction set:

Programming applications for such hardware that use diverse combinations of computing elements is extremely challenging. These challenges arise from three root causes: (1) diverse parallelism models; (2) diverse memory architectures; and (3) diverse hardware instruction set semantics. To make use of the full range of available hardware to maximize performance and energy efficiency, the programming environment needs to provide common abstractions for all the available hardware compute units in heterogeneous systems. Not only are these abstractions required at the level of source-code, but also at object-code level to make the object-code portable across the same and different manufacturer's devices, thus allowing the application vendor to be able to ship a single software version across a broad range of devices.

We believe that these issues are best addressed using a language-neutral, virtual instruction set layer that abstracts away most of the low-level details of hardware, an approach we call Virtual Instruction Set Computing or VISC. Our system organization is shown in Figure 1. The key point is that the only software components that can "see" the hardware details are the translators (i.e., compiler back ends), system-level and application-level schedulers, a minimal set of other low-level OS components and some device drivers. The rest of the software stack, including source-level language implementations, application libraries, and middleware, lives above the virtual ISA and is portable across different heterogeneous system configurations. Unlike previous VISC systems, our virtual instruction set design abstracts away and unifies the diverse forms of parallelism in hardware (using a combination of only two models of parallelism). It also provides abstractions for memory and communication, allowing back-end translators to generate code for efficient data movement across compute units. These abstractions enable programmers to write efficient software applications that are portable across a diverse range of hardware configurations. Moreover, we are exploiting the flexible translator-hardware communication in VISC systems to enable the novel memory system designs described above.

VISC Picture

Figure 1: System Organization

2012 Qualcomm Innovation Fellowship