Trua: Efficient Task Replication for Flexible User-defined Availability in Scientific Grids

Zhe Zhang, Brian Bockelman, Derek Weitzel, Xinkai Zhang, Hamid Vakilzadian, David Swanson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

Failure is inevitable in scientific computing. As scientific applications and facilities increase their scales over the last decades, finding the root cause of a failure can be very complex or at times nearly impossible. Different scientific computing customers have varying availability demands as well as a diverse willingness to pay for availability. In contrast to existing solutions that try to provide higher and higher availability in scientific grids, we propose a model called Task Replication for Userdefined Availability (Trua). Trua provides flexible, user-defined, availability in scientific grids, allowing customers to express their desire for availability to computational providers. Trua differs from existing task replication approaches in two folds. First, it relies on the historic failure information collected from the virtual layer of the scientific grids. The reliability model for the failures can be represented with a bimodal Johnson distribution which is different from any existing distributions. Second, it adopts an anomaly detector to filter out anomalous failures; it additionally adopts novel selection algorithms to mitigate the effects of temporary and spatial correlations of the failures without knowing the root cause of the failures. We apply the Trua on real-world traces collected from the Open Science Grid (OSG). Our results show that the Trua can successfully meet user-defined availability demands.

Original languageEnglish (US)
Title of host publicationProceedings - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
EditorsLaurent Lefevre, Carlos A. Varela, George Pallis, Adel N. Toosi, Omer Rana, Rajkumar Buyya
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages360-369
Number of pages10
ISBN (Electronic)9781728160955
DOIs
StatePublished - May 2020
Event20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020 - Melbourne, Australia
Duration: May 11 2020May 14 2020

Publication series

NameProceedings - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020

Conference

Conference20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
Country/TerritoryAustralia
CityMelbourne
Period5/11/205/14/20

Keywords

  • Task replication
  • anomaly
  • availability
  • bathtub curve
  • failure
  • lifetime distribution
  • scientific grids
  • system utilization
  • valley

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Trua: Efficient Task Replication for Flexible User-defined Availability in Scientific Grids'. Together they form a unique fingerprint.

Cite this