Runtime Modifications of Spark Data Processing Pipelines

conference paper
Distributed data processing systems are the standard means for large-scale data analysis in the Big Data field. These systems are based on processing pipelines where the processing is done via a composition of multiple elements or steps. In current distributed data processing systems, the code and parameters that create the pipeline are set at design time, before the application starts processing any data. Any changes that have to be applied to the pipeline after it has been started, require the entire pipeline to be restarted. When a system needs to be operational 24/7 or has to respond in a timely fashion, restarting and having downtime is not acceptable. In this case, computing should be performed autonomously by the processing system that continuously takes the changes from the environment, and adjusts its processing steps, parameters, etc. on-the-fly. In this paper, we try to solve this problem by allowing changes to be made to a processing pipeline without restarting. We focus on two aspects of the problem: switching to another data source that is used as input, and changing the functional code and variables within the elements of a pipeline. Our system is built on top of Apache Spark, a framework widely used for distributed data processing.
TNO Identifier
782452
ISBN
9781538619391
Article nr.
8064052
Source title
4th IEEE International Conference on Cloud and Autonomic Computing, ICCAC 2017, 18-22 September 2017, Tucson, AZ, USA
Collation
12 p.
Pages
34-45
Files
To receive the publication files, please send an e-mail request to TNO Repository.