Runtime Modifications of Spark Data Processing Pipelines

Lazovik, E.; Medema, M.; Albers, T.; Langius, E.A.F.; Lazovik, A.

Runtime Modifications of Spark Data Processing Pipelines

conference paper

2017

Lazovik, E.

Medema, M.

Albers, T.

Langius, E.A.F.

Lazovik, A.

Distributed data processing systems are the standard means for large-scale data analysis in the Big Data field. These systems are based on processing pipelines where the processing is done via a composition of multiple elements or steps. In current distributed data processing systems, the code and parameters that create the pipeline are set at design time, before the application starts processing any data. Any changes that have to be applied to the pipeline after it has been started, require the entire pipeline to be restarted. When a system needs to be operational 24/7 or has to respond in a timely fashion, restarting and having downtime is not acceptable. In this case, computing should be performed autonomously by the processing system that continuously takes the changes from the environment, and adjusts its processing steps, parameters, etc. on-the-fly. In this paper, we try to solve this problem by allowing changes to be made to a processing pipeline without restarting. We focus on two aspects of the problem: switching to another data source that is used as input, and changing the functional code and variables within the elements of a pipeline. Our system is built on top of Apache Spark, a framework widely used for distributed data processing.

Topics

Apache spark Autonomic processing Big data applications Distributed data processing Heterogeneous data sources On-the-fly updates Runtime pipeline adaptation Data handling Pipeline codes Heterogeneous data sources Runtimes Pipeline processing systems

TNO Identifier

782452

Repository link

https://resolver.tno.nl/uuid:a35d75d0-5adf-4afd-974b-205575f6deb1

ISBN

9781538619391

Article nr.

8064052

Source title

4th IEEE International Conference on Cloud and Autonomic Computing, ICCAC 2017, 18-22 September 2017, Tucson, AZ, USA

Collation

12 p.

Pages

34-45

Files

To receive the publication files, please send an e-mail request to TNO Repository.

Runtime Modifications of Spark Data Processing Pipelines

Make TNO yours!