Design features, failure modes, and forensic analysis in the GFDL workflow management system

Time 07/14/14 03:45PM-04:15PM

Room GC402

At NOAA/GFDL, various applications and libraries are utilized in conjunction with the FMS Runtime Environment (FRE) to
provide a seamless end-to-end workflow management system encompassing multiple computing sites and a centralized
analysis system. The workflow involves many discrete steps, including configuration and running of climate models on
supercomputers, data transfers to analysis sites, and subsequent postprocessing and visualization. Given such a complex
runtime environment, fault-resilience, reliability and robustness are key design criteria. Understanding failure modes
and designing automatic recovery systems requires individuals who are proficient in all aspects of the components of the
workflow, including the specifics of the various hardware infrastructures. Further complicating any such forensic
analysis is the requirement within the climate community for absolute reproducibility at any stage of the workflow
during a simulation.

This talk will describe the FRE system and present the scales of computational and data volumes it is built to address.
We will discuss the most common failure modes and design features intended to build resilience against them, presenting
the various checks currently needed to guarantee data integrity as well as details of past forensic analyses that have
been undertaken will be presented. Additionally, the discussion will present future scientific plans and how these
plans will alter the data integrity discussion.


Speaker Rusty Benson NOAA (National Oceanic & Atmospheric Administration, Washington, D.C.)

