Data Analytics in the Cloud with Flexible MapReduce Workflows

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Citations (Scopus)

Abstract

Data analytic applications are characterized by largedata sets that are subject to a series of processing phases. Someof these phases are executed sequentially but others can beexecuted concurrently or in parallel on clusters, grids or clouds.The MapReduce programming model has been applied to processlarge data sets in cluster and cloud environments. For developingan application using MapReduce there is a need toinstall/configure/access specific frameworks such as ApacheHadoop or Elastic MapReduce in Amazon Cloud. It would bedesirable to provide more flexibility in adjusting suchconfigurations according to the application characteristics.Furthermore the composition of the multiple phases of a dataanalytic application requires the specification of all the phasesand their orchestration. The original MapReduce model andenvironment lacks flexible support for such configuration andcomposition. Recognizing that scientific workflows have beensuccessfully applied to modeling complex applications, this paperdescribes our experiments on implementing MapReduce as subworkflowsin the AWARD framework (Autonomic WorkflowActivities Reconfigurable and Dynamic). A text mining dataanalytic application is modeled as a complex workflow withmultiple phases, where individual workflow nodes supportMapReduce computations. As in typical MapReduceenvironments, the end user only needs to define the applicationalgorithms for input data processing and for the map and reducefunctions. In the paper we present experimental results whenusing the AWARD framework to execute MapReduce workflowsdeployed over multiple Amazon EC2 (Elastic Compute Cloud)instances.
Original languageUnknown
Title of host publication2012 4th IEEE International Conference on Cloud Computing Technology and Science
Pages1-8
Publication statusPublished - 1 Jan 2012
Event2012 4th IEEE International Conference on Cloud Computing Technology and Science -
Duration: 1 Jan 2012 → …

Conference

Conference2012 4th IEEE International Conference on Cloud Computing Technology and Science
Period1/01/12 → …

Cite this