Please use this identifier to cite or link to this item:
http://dspace.lib.uom.gr/handle/2159/27904
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Ευαγγελίδης, Γεώργιος | el |
dc.contributor.author | Παπανικολάου, Μερκούριος | el |
dc.date.accessioned | 2022-11-22T11:25:30Z | - |
dc.date.available | 2022-11-22T11:25:30Z | - |
dc.date.issued | 2022 | el |
dc.identifier.uri | http://dspace.lib.uom.gr/handle/2159/27904 | - |
dc.description | Διπλωματική εργασία--Πανεπιστήμιο Μακεδονίας, Θεσσαλονίκη, 2022. | el |
dc.description.abstract | K – Nearest Neighbor (k-NN) classifier is one of them most widely used supervised learning algorithms. Its popularity is mainly due to its simplicity, effectiveness, ease of implementation and ability to add new data in the training set at any time. However, one of the major drawbacks of this algorithm is the fact that its performance mainly depends on the parameter value k, i.e. the number of nearest neighbors that the algorithm examines in order to classify the unlabeled instance. In most cases, it is a fixed value, independent of data distribution. The most frequently used technique for the “best” k determination is the cross validation as there is no general rule for choosing the k value due to its dependency on the training dataset. A large k value results in a noise tolerant classifier, as its search area is large. On the other hand, a small k value results in a noise sensitive classifier, as the search area is limited. So, selecting a fixed k value throughout the dataset does not take into account its special features, like data distribution, class separation, imbalanced classes, sparse and dense neighborhoods and noisy subspaces. In recent years, a lot of research have been conducted in order to tackle the above- mentioned disadvantage. The research has led to various approaches of k-NN classifier, which mainly combine it with various other techniques for k value determination. In the present study, a thorough literature review is conducted in order to summarize all the achievements made to date in this field. This procedure led to a pool of twenty-eight (28) publications, covering a time period from 1986 till 2020 (with median value 2009). These studies are presented in this work, describing the techniques used for dynamic k determination. For each study, several indicators are recorded, namely the technique used for k selection, the level of k selection, the number of datasets used for experiments, whether statistical tests were conducted or not, the total number of citations each research has received as well as the average citations per year. Apart from the above, a new alternative version of k-NN algorithm is proposed. The proposed algorithm is an extension of a previous work, found in the literature. The approach is a k-free k-NN variation, in the sense that the user does not select the parameter, but it is selected dynamically, depending on the area where each unlabeled data point lies. The algorithm falls into the group of the studies that exploit prototype and clustering techniques in order to represent the initial dataset. Through a recursive process, homogenous clusters are created, each of which are represented by a representative. Moreover, a new term is introduced, namely the Sphere of Influnce (SoI), an index which indicates the size of each created cluster. This index, in combination with the indicator depth (d), provides useful information about the subspace that each representative lies. Finally, heuristics are proposed in order to exploit the information provided from SoI and d and convert it in a k value, unique for every unlabeled instance. Extensive experiments were conducted on thirty (30) datasets for all proposed heuristics. Some of these datasets contained artificial noise in order to test the proposed algorithm in real life situations. The experiments showed a very competitive performance – in terms of accuracy – of the proposed algorithm in comparison with some commonly used k-NN variations. Moreover, Wilcoxon statistical test was used to find statistically significant differences. | en |
dc.format.extent | 53 | el |
dc.language.iso | en | en |
dc.publisher | Πανεπιστήμιο Μακεδονίας | el |
dc.rights | Αναφορά Δημιουργού-Μη Εμπορική Χρήση 4.0 Διεθνές | el |
dc.rights.uri | http://creativecommons.org/licenses/by-nc/4.0/ | en |
dc.subject | Supervised machine learning | en |
dc.subject | Classification | en |
dc.subject | k-nearest neighbor (k- NN) | en |
dc.subject | Dynamic k value determination | en |
dc.subject | Heuristics | en |
dc.title | Dynamic / Adaptive determination of parameter k IN k-NN classifier | en |
dc.type | Electronic Thesis or Dissertation | en |
dc.type | Text | en |
dc.contributor.department | Πρόγραμμα Μεταπτυχιακών Σπουδών στην Τεχνητή Νοημοσύνη και Αναλυτική Δεδομένων | el |
Appears in Collections: | ΠΜΣ στην Τεχνητή Νοημοσύνη και Αναλυτική Δεδομένων (Μ) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
PapanikolaouMerkouriosMSc2022.pdf | 772.64 kB | Adobe PDF | View/Open |
This item is licensed under a Creative Commons License