Please use this identifier to cite or link to this item: http://dspace.lib.uom.gr/handle/2159/26449
Author: Τσόγκας, Βασίλειος
Title: Parallelizing entity resolution methods for big data
Alternative Titles: Παραλληλοποίηση μεθόδων διευθέτησης οντοτήτων για μεγάλα δεδομένα
Date Issued: 2022
Department: Πρόγραμμα Μεταπτυχιακών Σπουδών Ειδίκευσης στην Εφαρμοσμένη Πληροφορική
Supervisor: Κολωνιάρη, Γεωργία
Abstract: Entity Resolution (ER) is the process of locating records which represent the same real-world entity, within a single dataset or across different datasets. ER exists for several years now and has been evolving constantly, since it has to keep up the pace with the developments in technology, as well as in the field of data management. All these years, various techniques have been used for the implementation of the ER process, like blocking, filtering, and matching, in order to improve its performance and effectiveness. However, ER faces new challenges in the age of big data analytics we live in, since traditional methods of handling data have not proved very efficient. Hence, ER in turn must evolve further, so as to adapt to the modern world of Big Data analytics. In this work we study the ER process, how it is divided in stages and present popular methods used in each stage. We focus on Blocking techniques and specifically on Improved Suffix Array Blocking with Bloom Filters. After implementing this method serially, we study how to apply parallelization, using Apache Spark. We conduct comparative experiments between the serial and parallel execution, present the results and examine the significant improvement in efficiency, when the process is executed in parallel. Our conclusions indicate that ER methods, if applied in a distributed manner, are capable of handling Big Data.
Keywords: Entity Resolution
Big Data
Blocking
Filtering
Inverted Index
Suffix Array Blocking
Bloom Filter
Parallel execution
Apache Spark
Scala
Java
Information: Διπλωματική εργασία--Πανεπιστήμιο Μακεδονίας, Θεσσαλονίκη, 2022.
Appears in Collections:Π.Μ.Σ. στην Εφαρμοσμένη Πληροφορική (M)

Files in This Item:
File Description SizeFormat 
TsogkasVasileiosMsc2022.pdf1.21 MBAdobe PDFView/Open


Items in Psepheda are protected by copyright, with all rights reserved, unless otherwise indicated.