Sign In

Communications of the ACM

ACM TechNews

Supercomputers Face Growing Resilience Problems

View as: Print Mobile App Share:
IBM Blue Gene/L supercomputer

Credit: CNET

North Carolina State University (NCSU) researchers have developed RedMPI, software that runs in conjunction with the Message Passing Interface (MPI), a library for splitting applications across multiple servers so the different parts of the program can be executed in parallel. The researchers say RedMPI could be a solution to the growing vulnerability of high-performance computing systems.

As supercomputers grow more powerful, they also grow more vulnerable to failure due to the increased amount of built-in componentry. NCSU's David Fiala says the problem will only get worse as the industry moves toward exascale systems. He says that to account for the additional hardware required for exascale computing, system reliability will need to be improved by 100 times in order to keep the same mean time between failures provided by today's supercomputers.

RedMPI addresses the problem of silent data corruption by simultaneously running multiple copies of a program and then comparing the answers. RedMPI intercepts and copies every MPI message that an application sends, and distributes copies of the message to the clone of the program. If different clones calculate different answers, the numbers can be recalculated on the fly.

From IDG News Service
View Full Article


Abstracts Copyright © 2012 Information Inc., Bethesda, Maryland, USA


No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account