Please use this identifier to cite or link to this item: http://dspace.lib.uom.gr/handle/2159/26449
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorΚολωνιάρη, Γεωργίαel
dc.contributor.authorΤσόγκας, Βασίλειοςel
dc.date.accessioned2022-02-28T07:38:39Z-
dc.date.available2022-02-28T07:38:39Z-
dc.date.issued2022el
dc.identifier.urihttp://dspace.lib.uom.gr/handle/2159/26449-
dc.descriptionΔιπλωματική εργασία--Πανεπιστήμιο Μακεδονίας, Θεσσαλονίκη, 2022.el
dc.description.abstractEntity Resolution (ER) is the process of locating records which represent the same real-world entity, within a single dataset or across different datasets. ER exists for several years now and has been evolving constantly, since it has to keep up the pace with the developments in technology, as well as in the field of data management. All these years, various techniques have been used for the implementation of the ER process, like blocking, filtering, and matching, in order to improve its performance and effectiveness. However, ER faces new challenges in the age of big data analytics we live in, since traditional methods of handling data have not proved very efficient. Hence, ER in turn must evolve further, so as to adapt to the modern world of Big Data analytics. In this work we study the ER process, how it is divided in stages and present popular methods used in each stage. We focus on Blocking techniques and specifically on Improved Suffix Array Blocking with Bloom Filters. After implementing this method serially, we study how to apply parallelization, using Apache Spark. We conduct comparative experiments between the serial and parallel execution, present the results and examine the significant improvement in efficiency, when the process is executed in parallel. Our conclusions indicate that ER methods, if applied in a distributed manner, are capable of handling Big Data.en
dc.format.extent52el
dc.language.isoenen
dc.publisherΠανεπιστήμιο Μακεδονίαςel
dc.subjectEntity Resolutionen
dc.subjectBig Dataen
dc.subjectBlockingen
dc.subjectFilteringen
dc.subjectInverted Indexen
dc.subjectSuffix Array Blockingen
dc.subjectBloom Filteren
dc.subjectParallel executionen
dc.subjectApache Sparken
dc.subjectScalaen
dc.subjectJavaen
dc.titleParallelizing entity resolution methods for big dataen
dc.title.alternativeΠαραλληλοποίηση μεθόδων διευθέτησης οντοτήτων για μεγάλα δεδομέναel
dc.typeElectronic Thesis or Dissertationen
dc.typeTexten
dc.contributor.departmentΠρόγραμμα Μεταπτυχιακών Σπουδών Ειδίκευσης στην Εφαρμοσμένη Πληροφορικήel
Appears in Collections:Π.Μ.Σ. στην Εφαρμοσμένη Πληροφορική (M)

Files in This Item:
File Description SizeFormat 
TsogkasVasileiosMsc2022.pdf1.21 MBAdobe PDFView/Open


Items in Psepheda are protected by copyright, with all rights reserved, unless otherwise indicated.