Data Classification Using Various Learning Algorithms
Chapter One
Aim andย Objectives
The aim of this research is to investigate the extent to which dimensionality reduction techniques preserve classification.
The objectives of the research are as follows:
- Implementation of fifteen dimensionality reduction techniques and using these techniques to reduce the weather and student datasets, as well as the ionosphere dataset obtained from the UCI machine learning repository[23].
- Implementation of the perceptron classification algorithm and using it to classify the data points of a two-class dataset. It shall also be applied to the datasets reduced from this two-class dataset using the techniques above, and comparisons will be made to determine the extent to which the reduction methods preserve the classification of the original dataset using the
- Implementation of the k-Nearest Neighbors classification algorithm and comparing the performance of the dimensionality reduction techniques on preserving the classification of a dataset by the k-nearest neighbors and perceptron classification
- Using confusion matrices to show the extent to which each dimensionality reduction method preserves classification of the original datasets and make comparisons with each
CHAPTER TWO
ย LITERATURE REVIEW
Thisย chapterย givesย aย reviewย ofย literatureย relatedย toย dimensionalityย reduction,ย machineย learning,ย and the application of dimensionality reduction in the machine learning domain with a bias towards the perceptron and K-Nearest Neighbors learningย algorithms.
Dimensionalityย Reduction
Dimensionalityย reductionย isย definedย asย theย mappingย ofย highย dimensionalย dataย toย aย lowย dimensional data,ย suchย thatย theย result obtainedย byย analyzingย theย reducedย datasetย isย aย goodย approximationย ofย the result obtained by analyzing the original high dimensional data [24]. Due to the challenges faced in the analysis of the available large pool of data, there has been prevalence in emphasizing the robustness and importance of dimensionality reduction inย literature.
The importance of dimensionality reduction is stressed in [25], where the authors proposed four novel dimensionality reduction techniques; New Top-Down, New Bottom-Up, Variance โ New Top-Down Hybrid, and Variance โ New Bottom-Up Hybrid approaches and used them alongside other existing techniques in reducing images. The authors observed that most of the approaches belonging to the first category (each attribute in the reduced dataset is a linear combination of the attributesย ofย theย originalย dataset)ย areย inefficientย inย imageย preservation,ย whileย techniquesย belonging toย theย secondย categoryย (theย setย ofย attributesย inย theย reducedย datasetย isย aย subsetย ofย theย setย ofย attributes inย theย originalย dataset)ย areย reasonablyย efficientย inย imageย preservation.ย Afterย theseย observations,ย the authors proceeded by applying several queries on the reduced image to discover certain features ofย theย originalย image.ย Theyย observedย thatย theย featuresย ofย theย reducedย imageย correspondsย accurately to the attributes of the originalย image.
The authors of [1] identified some schemes used in reducing the number of features in high dimensional datasets to improve machine learning algorithms. The authors explained the concept ofย criticalย dimensions,ย whichย isย theย minimumย numberย ofย featuresย requiredย forย predictionย (withย high accuracy)ย inย classificationย algorithms.ย Theyย presentedย fourย dimensionalityย reductionย schemesย with their advantages and disadvantages. This provides researchers with the necessaryย information and direction when choosing a reduction scheme considering the datasetย type.
Inย aย recentย studyย [5],ย theย authorsย proposedย aย newย approachย to reducingย theย dimensionalityย ofย data. In this approach, for a reduction to m attributes from the original n attributes, m attributes are randomlyย selectedย fromย thoseย nย attributesย toย form theย reducedย dataset.ย Thisย approachย provedย toย be slightly better than some of the most popular dimensionality reduction techniques like Random Projection and Principal Component Analysis in the K-means Clustering preservation of the original datasets. Thus dimensionality reduction is a very fertile area of research and a strongย tool for high-dimensional dataย preprocessing.
The authors in [26] proposed a biologically inspired (from the behavior of ants) dimensionality reduction method called Ant Colony Optimization-Selection algorithm. The authors used five microarray datasets which have a very high dimensionality to prove that the proposed algorithm selects more important genes from the high dimensional data based on some parameters with an excellent classification accuracy.
ย Machineย Learning
Tom Mitchell in his classic book on machine learning says โThe field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experienceโ [9]. To put it simply, machine learning is a scientific field in which computer systems can automatically and intelligently learn their computation and improve on it through experience and can be classified into supervised and unsupervised learning algorithms. Machine learning techniques have been and are still used in solving a lot of complex real world problems. Thereย areย aย lotย ofย classificationย techniques.ย sinceย noย classifierย isย consideredย strictlyย betterย thanย the others [27], the perceptron and K-Nearest Neighbor classifiers are chosen for the purpose of this research.
Theย perceptronย isย anย Artificialย Neuralย Networkย thatย mimicsย orย triesย toย simulateย theย activitiesย ofย the brain with regards to information processing [14]. To do this, we take a weighted sum of inputs and if the sum is greater than some threshold value, it sends an output of one otherwise it sends zero (or -1). The perceptron is made up of a summation processor which takes the dot product of theย inputsย andย theย weightsย andย thenย anย activationย functionย alsoย knownย asย threshold,ย whichย usesย one step function to determine the output of the perceptronย [16].
A great work in the domain of machine learning can be seen in [28], where the authors proposed a single layer Artificial Neural Network for the purpose of classifying cancer patients using Chebyshev, Trigonometric and Legendre expansion techniques. The authors concluded that Discreteย Cosineย Transformย featureย reductionย basedย Neuralย Networkย classifiersย performย betterย than the other classifiers used for the purpose of theย research.
Another classification algorithm is the K-Nearest Neighbor algorithm. In this algorithm, the principle is that, the label of any given instance is predicted based on the label most common to its k nearest neighbors [15]. In a study [29], the authors proposed a new method for attribute selection, which selects attributes that complement each other and then tested it on a real dataset. The authors used two classes from the dataset and found that the proposed technique selects subsets of attributes that yields a classification accuracy which is higher than the accuracy obtained by using the entire set of attributes or even the subset of attributes identified by CART. The classification technique used for the purpose of the investigation is the K-Nearest Neighbor classification algorithm, and a confusion matrix is used to measure the overall accuracy of the classifier.
CHAPTER THREE
METHODOLOGY
This chapter gives a detailed description of all the algorithms used in achieving the aim and objectives of this thesis. MATLAB programming is used for the analysis and implementations in this thesis.
Dimensionality Reductionย Techniques
This section gives a description of the dimensionality reduction techniques implemented in this thesis.
ย The New Randomย Approach
Thisย isย aย techniqueย suggestedย byย [5].ย Withย thisย technique,ย toย reduceย aย data setย Dย ofย dimensionality d to one of dimensionality k, a set Skย is formed consisting of k numbers selected at random from the set S shown in equationย (3.1).
Sย =ย {xย ฯตย Nย |ย 1ย ยฃย xย ยฃย d} (3.1)
Then, our reduced set, DR, is shown in equation (3.2).
DRย =ย D(:,ย Sk) (3.2)
That is, DRย is a data set having the same number of rows as D, and if Aiย is the ithย attribute of DR, then Aiย will be the jthย attribute of D if j is the ithย element of Sk.
Modified New Randomย Approach
This technique is a modification of the new random approach proposed by [36]. To reduce a data setย Dย ofย dimensionalityย pย toย oneย ofย dimensionalityย k,ย with this improved approach, a more efficient methodย isย utilizedย inย generatingย theย randomย numbers,ย i.e.ย theย resultsย areย lessย random.ย Theย algorithm is given in Algorithmย 3.1.
CHAPTER FOUR
RESULTS AND DISCUSSION
The Perceptronย Preservation
In this section, the results of the comparison amongst dimensionality reduction techniques (explained in chapter three of this thesis) for the perceptron classification preservation of the weather, student and ionosphere datasets are presented. To achieve these results, the training set in each of the three datasets for both the original and reduced datasets is used in training the perceptron, the weight vector obtained from the training phase is then used in classifying the test sets of both the original and reduced datasets. The obtained results are shown in tables 4.1, 4.2 and 4.3.
CHAPTER FIVE
ย SUMMARY, CONCLUSION AND RECOMMENDATION
ย ย Summary
Inย thisย thesis,ย weย startedย byย pointingย outย theย challengesย facedย inย theย extractionย ofย usefulย information from available large pool of data which increases at an alarming rate. Dimensionality reduction wasย introducedย asย aย methodย thatย providesย aย compactย representationย ofย anย originalย high-dimensional data, thus making it a very powerful tool and also an invaluable preprocessing step in facilitating the application of many machine learningย algorithms.
After that, a review was done on literature related to the subject of this thesis. As pointed out, in the review of related work, dimensionality reduction has been applied to several domains, including machine learning. The methodology used in achieving the objectives of this research was then explained in detail. This includes detailed explanation of the methods involved; fifteen dimensionality reduction techniques, two classification algorithms (the perceptron and K-Nearest Neighbors)ย andย theย confusionย matrix.ย Theย resultsย ofย theย achievedย objectives,ย whichย wereย presented in the fourth chapter, revealed the extent to which dimensionality reduction techniques preserve the perceptron and K-Nearest Neighborย classification.
Next, the confusion matrix was used to show the extent to which these fifteen dimensionality reduction techniques โ compared against each other – preserve the perceptron and k-nearest neighbor classification.
Conclusion
The aim of this thesis as stated in chapter 1 is to investigate the extent to which dimensionality reduction techniques preserve classification. This investigation revealed that the dimensionality
reduction techniques implemented in this thesis seem to perform much better at preserving K- Nearestย Neighborย classificationย thanย theyย doย atย preservingย theย classificationย ofย theย originalย datasets usingย theย perceptron.ย Inย general,ย theย dimensionalityย reductionย techniquesย proveย toย beย veryย efficient in preserving the classification of both the lazy and eager learners used for thisย investigation.
Recommendation
It would be interesting and worth investigating the classification preservation of dimensionality reduction methods on more sophisticated classifiers like the support vector machine and decision trees.
REFERENCES
- Sharma and K. Saroha, โStudy of dimension reduction methodologies in data mining,โ in International Conference on Computing, Communication and Automation, 2015, pp. 133โ137.
- K. Fodor, โA survey of dimension reduction techniques,โ Center for Applied Scienti๏ฌc Computing, Lawrence Livermore National Laboratory, no. 1, pp. 1โ18,2002.
- Achlioptas, โDatabase-friendly random projections: Johnson-Lindenstrauss with binary coins,โ J. Comput. Syst. Sci., vol. 66, no. 4, pp. 671โ687,2003.
- S. Nsang, I. Diaz, and A. Ralescu, โEnsemble Clustering based on Heterogeneous Dimensionality Reduction Methods and Context-dependent Similarity Measures,โ Int.J. Adv. Sci. Technol., vol. 64, pp. 101โ118,ย 2014.
- S. Nsang, A. Maikori, F. Oguntoyinbo and H. Yusuf, โA New Random Approach To Dimensionality Reduction,โ in Intโl Conf. on Advances in Big Data Analytics | ABDAโ15|, 2014, vol. 60, no. 6, pp.ย 2114โ2142.
- H. Deshmukh, T. Ghorpade, and P. Padiya, โImproving classification using preprocessing and machine learning algorithms on NSL-KDD dataset,โ in Proceedings – 2015 International Conference on Communication, Information and Computing Technology, ICCICT 2015,2015.
- Kalamaras, โA novel approach for multimodal graph dimensionality reduction,โ Imperial college London,2015.
- Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas, and I. Chouvarda, โMachine Learning and Data Mining Methods in Diabetes Research,โ Comput. Struct. Biotechnol. J., vol. 15, pp. 104โ116,2017.