DEVELOPMENT OF ADVANCED DATA SAMPLING SCHEMES TO ALLEVIATE CLASS IMBALANCE PROBLEM IN DATA MINING CLASSIFICATION ALGORITHMS

FOLORUNSO, SAKINAT OLUWABUKONLA

Please use this identifier to cite or link to this item: http://ir.library.ui.edu.ng/handle/123456789/4228

Full metadata record

DC Field	Value	Language
dc.contributor.author	FOLORUNSO, SAKINAT OLUWABUKONLA	-
dc.date.accessioned	2019-02-07T13:32:55Z	-
dc.date.available	2019-02-07T13:32:55Z	-
dc.date.issued	2015-09	-
dc.identifier.uri	http://ir.library.ui.edu.ng/handle/123456789/4228	-
dc.description	A Thesis in the Department of Computer Science, Submitted to the Faculty of Science, In partial fulfilment of the requirements for the degree of DOCTOR OF PHILOSOPHY of the UNIVERSITY OF IBADAN	en_US
dc.description.abstract	Classification is the process of finding a set of models that distinguish data classes to predict unknown class label in data mining. The class imbalance problem occurs when standard classifiers are majority-biased while the minority class is ignored. Existing classifiers tend to maximise overall prediction accuracy and minimise error at the expense of the minority class. However, research had shown that misclassification cost of the minority class is higher and should not be ignored since it is the class of interest. This work was therefore designed to develop advanced data sampling schemes that improve the classification performance of imbalance datasets with the view of increasing the recall of the minority class. Synthetic Minority Oversampling Technique (SMOTE) was extended to SMOTE+300% and combined with existing under-sampling schemes: Random Under-Sampling (RUS), Neighbourhood Cleaning Rule (NCL), Wilson’s Edited Nearest Neighbour (ENN) and Condense Nearest Neighbour (CNN). Five advanced data sampling scheme algorithms: SMOTE300ENN, SMOTE300RUS, SMOTE300NCL, SMOTENCL and SMOTERUS were coded using JAVA and implemented in WEKA, a data mining tool as an Application Programming Interface. The existing and developed schemes were applied to 886 Diabetes Mellitus (DM), 1,163 Senior Secondary School Certificate Result (SSSCR) and 786 Contraceptive Methods (CM) datasets. The datasets were collected in Ilesha and Ibadan, Nigeria. Their performances were determined with different classification algorithms using Receiver Operating Characteristics (ROC), recall of the minority class and performance gain metrics. Friedman’s Test at p = 0.05 was used to analyse these schemes against the classification algorithms. The ROC metric revealed that the mean rank values for DM, SSSCR and CM datasets treated with the advanced schemes ranged from 6.9-13.8, 3.8-12.8 and 6.6-13.5, respectively when compared with the existing schemes which ranged from 3.4-7.8, 2.6-12.6 and 2.8-7.9, respectively. These results signifies improved classification performance. The Recall metric analysis for the DM, SSSCR and CM datasets in the advanced schemes ranged from 9.4-13.0, 6.3-14.0 and 7.3-13.6, respectively when compared with the existing schemes 2.0-7.5, 2.5-8.9 and 2.1-7.4, respectively. These results show increased detection of the minority class. Performance gains by the advanced UNIVERSITY OF IBADAN LIBRARY vii schemes over the original dataset (DM, SSCE and CM) were: SMOTE300ENN (27.1%), SMOTE300RUS (11.6%), SMOTE300NCL (15.5%), SMOTENCL (8.3%) and SMOTERUS (7.3%). Significant difference was observed amongst all the schemes. The higher the mean rank value and performance gain, the better the scheme. The SMOTE300ENN scheme gave the highest ROC and recall values in the three datasets which were 13.8, 12.8, 12.3 and 13.0, 14.0, 13.6, respectively. The developed Synthetic Minority Oversampling Technique 300 Wilson’s Edited Nearest Neighbour scheme significantly improved classification performance and increased the recall of the minority class over the existing schemes using the same dataset. It is therefore recommended for classification of imbalanced datasets. Keywords: Imbalanced dataset, Receiver operating characteristics, Data reduction techniques, Data reduction techniques Word count: 445	en_US
dc.language.iso	en	en_US
dc.subject	Imbalanced dataset	en_US
dc.subject	Receiver operating characteristics	en_US
dc.subject	Data reduction techniques	en_US
dc.subject	Data reduction techniques	en_US
dc.title	DEVELOPMENT OF ADVANCED DATA SAMPLING SCHEMES TO ALLEVIATE CLASS IMBALANCE PROBLEM IN DATA MINING CLASSIFICATION ALGORITHMS	en_US
dc.type	Thesis	en_US
Appears in Collections:	Scholarly works

Files in This Item:

File	Description	Size	Format
ui_thesis_DEVELOPMENT OF ADVANCED DATA SAMPLING SCHEMES TO ALLEVIATE CLASS IMBALANCE PROBLEM IN DATA MINING CLASSIFICATION ALGORITHMS.pdf		4.75 MB	Adobe PDF	View/Open

Show simple item record

UISpace