Pixida: Optimizing Data Parallel Jobs in Wide-area Data Analytics

Konstantinos Kloudas, Margarida Mamede, Nuno Preguiça, Rodrigo Rodrigues

Research output: Contribution to journalArticle

45 Citations (Scopus)

Abstract

In the era of global-scale services, big data analytical queries are often required to process datasets that span multiple data centers (DCs). In this setting, cross-DC bandwidth is often the scarcest, most volatile, and/or most expensive resource. However, current widely deployed big data analytics frameworks make no attempt to minimize the traffic traversing these links
In this paper, we present PIXIDA, a scheduler that aims to minimize data movement across resource constrained links To achieve this, we introduce a new abstraction called SILO, which is key to modeling PIXIDA'S scheduling goals as a graph partitioning problem. Furthermore, we show that existing graph partitioning problem formulations do not map to how big data jobs work, causing their solutions to miss opportunities for avoiding data movement. To address this, we formulate a new graph partitioning problem and propose a novel algorithm to solve it. We integrated PIXIDA in Spark and our experiments show that, when compared to existing schedulers, PIXIDA achieves a significant traffic reduction of up to similar to 9 x on the aforementioned links.
Original languageEnglish
Pages (from-to)72-83
Number of pages12
JournalProceedings Of The Vldb Endowment
Volume9
Issue number2
DOIs
Publication statusPublished - Oct 2015

Keywords

  • Algorithms
  • Graph theory
  • Scheduling

Fingerprint Dive into the research topics of 'Pixida: Optimizing Data Parallel Jobs in Wide-area Data Analytics'. Together they form a unique fingerprint.

  • Cite this