Overview
The SWAT (SoftWare
Anomaly Treatment) project at Illinois has developed a novel and
extremely low-cost design paradigm for affordable hardware reliability
across all segments of the computing market.
Devices in shipped
chips are increasingly expected to fail for many reasons, threatening
future improvements in functionality and performance of computing
systems. At the same time,
computers are becoming pervasive and society increasingly depends on
their reliable operation. Previous solutions for hardware resiliency
required significant redundancy and are too expensive to be widely
deployed. In contrast, SWAT enables reliable operation at near-zero
cost, making reliability affordable and pervasive.

Instead of heavyweight
always-on redundant computing, SWAT employs extremely low cost monitors
that look for anomalous software behavior as symptoms of hardware
faults. In the relatively rare event of such symptom detection, SWAT
employs a sophisticated operation to rescue and recover the system from
the impact of the fault. Since this rescue happens relatively rarely, it
can be done largely in software at very low cost. This is analogous to
the deployment of the Special Weapons And Tactics (S.W.A.T.) team that
remains latent in the common case, and is called for only in high-risk
situations. A SWAT enabled system is thus equipped with very simple,
low-cost symptom detectors, a specialized diagnosis procedure that can
identify the source of the problem, and a recovery mechanism that can
seamlessly recover the system to a fault-free state.
The project has
attracted significant interest from industry. Last year, the
Semiconductor Research Corporation (SRC), a consortium of semiconductor
companies, provided funding for prototyping SWAT as a step towards
validation and potential transfer of the technology to industry.
Additionally, a SWAT student will be spending several months at Intel to
demonstrate SWAT for systems and fault models relevant to Intel. In
recognition of the potential of the work, another student received a
fellowship from Intel and a scholarship from IBM. We were also able to
win a Computing Innovations fellowship grant through which a female
postdoctoral student has now joined the SWAT team.
In summary, SWAT is
addressing a critical challenge, its novel solution strategy is
attracting much industry attention, and it has the potential to be a
game-changing innovation for not only the microprocessor industry but
also for the computing industry at large.
SWAT Framework Components SWAT has following framework components:
1. Detection: SWAT detects hardware faults by monitoring software
misbehavior.
2. Recovery: SWAT relies on a checkpointing mechanism to recover the
state of the system. On a detection the system is rolled back to a
prestine state.
3. Diagnosis: After detecting a fault a diagnosis
procedure is invoked that replays the fault activating trace
repeatedly on the faulty hardware to identify the source of the
fault.
4. Repair: On diagnosing the source of the fault,
appropriate repair action is performed depending on the availability
of the redundant hardware components.
All these components are controled by the flexible firmware layer.
|