Please use this identifier to cite or link to this item: http://dspace.lib.uom.gr/handle/2159/25038
Author: Τανταλάκη, Νικολέτα
Title: Parallel and distributed processing of big data streams and scheduling algorithms
Alternative Titles: Παράλληλη και κατανεμημένη επεξεργασία ροών Μεγάλων Δεδομένων και αλγόριθμοι χρονοδρομολόγησης
Date Issued: 2021
Department: Πανεπιστήμιο Μακεδονίας. Τμήμα Εφαρμοσμένης Πληροφορικής (ΕΠ)
Supervisor: Σουραβλάς, Σταύρος
Abstract: Nowadays, we are witnessing the development of the so-called Internet of Things (IoT), where devices collect data and exploit interconnectivity to transmit it for processing in the cloud. Worldwide streams are expanding continuously, resulting in an accelerating need to efficiently and timely handle these large amounts of data that arrive continuously. Cloud computing technology with superior computational power and high reliability rises as a promising solution for the challenges posed by data stream processing. In-memory computing is used to meet performance related requirements like latency and throughput that are extremely important in Data Stream Processing (DSP) applications. Several different technologies have emerged specifically to address the challenges of processing high-volume, real-time data, exploiting on-the-fly computations. Distributed Stream Processing Systems (DSPSs) assign applications' processing tasks to the available resources and route streaming data between them. Efficient scheduling of processing tasks can reduce application latencies and eliminate network congestions. However, the available in-built scheduling techniques of DSPSs are far from optimal. In this thesis, we need to solve the task scheduling problem which focuses on which tasks to be allocated on which resources, and controls the order of job execution. An overview of the available DSPSs is presented and a classification of the existing scheduling policies is provided. In this way, useful information about the matters to consider when designing an effective scheduling policy is revealed. Then, a general formulation of the task scheduling problem is presented and a matrix-based, linear scheme is provided. Differently from existing research efforts, that rarely consider memory utilization in their analysis, the derived scheme is performed in a memory-efficient and well-balanced manner. It takes advantage of pipelines to efficiently handle applications, where there is need for heavy communication (all-to-all) between tasks, assigned to pairs of components. The scheme proposed in this thesis is static. However, when it comes to streams of data, the input load usually fluctuates drastically over time. Dynamic schemes use run-time adaptations and task re-scheduling to handle possible changes in the cluster but this usually results in significant downtime and performance degradation. Rather than re-configuring online the tasks' allocation, the proposed scheme handles queue waiting times efficiently and tries to maintain a stable and robust configuration by balancing load between the cluster's nodes. Of course, an adaptive version this approach would increase its performance, so this extension is left for future work. For concreteness, this approach is illustrated based on Apache Storm semantics. The performance evaluation depicts the importance of constraining the required buffer space and achieving load balance to improve the system's performance and overcome the challenges of running DSP applications. The proposed scheme was compared to two state-of-the-art strategies; the default Storm scheduler and R-Storm. It was found to outperform both the other strategies in terms of throughput, achieving an average of 25%-45% improvement under various scenarios, mainly as a result of reduced buffering (≈45% less memory). At the end of this work, the contribution of real-time data processing in an application field is presented. The field of agriculture has to face difficult challenges due to numerous technological transformations used for increasing productivity and products quality. In precision agriculture, a key component is the use of IoT and various items like sensors, control systems, robotics and autonomous vehicles that produce high velocity data streams. Current advances in IoT and cloud computing have led to the development of new applications that have great potential in precision agriculture. However, several challenges arise as open research fields and future directions are revealed.
Keywords: Parallel processing
Distributed processing
Big data
Scheduling
Cloud computing
Information: Η βιβλιοθήκη διαθέτει αντίτυπο της διατριβής σε έντυπη μορφή.
Περιλαμβάνει βιβλιογραφικές αναφορές (σ. 93-109)
Διατριβή (Διδακτορική--Πανεπιστήμιο Μακεδονίας, Θεσσαλονίκη, 2021.
003/2021
Rights: Αναφορά Δημιουργού - Μη Εμπορική Χρήση - Παρόμοια Διανομή 4.0 Διεθνές
Appears in Collections:Τμήμα Εφαρμοσμένης Πληροφορικής (Δ)

Files in This Item:
File Description SizeFormat 
TantalakiNiκoletaPhD2021.pdf5.8 MBAdobe PDFView/Open


This item is licensed under a Creative Commons License Creative Commons