Data Classification Using Various Learning Algorithms

Data Classification Using Various Learning Algorithms

Chapter One

Aim and Objectives

The aim of this research is to investigate the extent to which dimensionality reduction techniques preserve classification.

The objectives of the research are as follows:

Implementation of fifteen dimensionality reduction techniques and using these techniques to reduce the weather and student datasets, as well as the ionosphere dataset obtained from the UCI machine learning repository[23].
Implementation of the perceptron classification algorithm and using it to classify the data points of a two-class dataset. It shall also be applied to the datasets reduced from this two-class dataset using the techniques above, and comparisons will be made to determine the extent to which the reduction methods preserve the classification of the original dataset using the
Implementation of the k-Nearest Neighbors classification algorithm and comparing the performance of the dimensionality reduction techniques on preserving the classification of a dataset by the k-nearest neighbors and perceptron classification
Using confusion matrices to show the extent to which each dimensionality reduction method preserves classification of the original datasets and make comparisons with each

CHAPTER TWO

LITERATURE REVIEW

This chapter gives a review of literature related to dimensionality reduction, machine learning, and the application of dimensionality reduction in the machine learning domain with a bias towards the perceptron and K-Nearest Neighbors learning algorithms.

Dimensionality Reduction

Dimensionality reduction is defined as the mapping of high dimensional data to a low dimensional data, such that the result obtained by analyzing the reduced dataset is a good approximation of the result obtained by analyzing the original high dimensional data [24]. Due to the challenges faced in the analysis of the available large pool of data, there has been prevalence in emphasizing the robustness and importance of dimensionality reduction in literature.

The importance of dimensionality reduction is stressed in [25], where the authors proposed four novel dimensionality reduction techniques; New Top-Down, New Bottom-Up, Variance – New Top-Down Hybrid, and Variance – New Bottom-Up Hybrid approaches and used them alongside other existing techniques in reducing images. The authors observed that most of the approaches belonging to the first category (each attribute in the reduced dataset is a linear combination of the attributes of the original dataset) are inefficient in image preservation, while techniques belonging to the second category (the set of attributes in the reduced dataset is a subset of the set of attributes in the original dataset) are reasonably efficient in image preservation. After these observations, the authors proceeded by applying several queries on the reduced image to discover certain features of the original image. They observed that the features of the reduced image corresponds accurately to the attributes of the original image.

The authors of [1] identified some schemes used in reducing the number of features in high dimensional datasets to improve machine learning algorithms. The authors explained the concept of critical dimensions, which is the minimum number of features required for prediction (with high accuracy) in classification algorithms. They presented four dimensionality reduction schemes with their advantages and disadvantages. This provides researchers with the necessary information and direction when choosing a reduction scheme considering the dataset type.

In a recent study [5], the authors proposed a new approach to reducing the dimensionality of data. In this approach, for a reduction to m attributes from the original n attributes, m attributes are randomly selected from those n attributes to form the reduced dataset. This approach proved to be slightly better than some of the most popular dimensionality reduction techniques like Random Projection and Principal Component Analysis in the K-means Clustering preservation of the original datasets. Thus dimensionality reduction is a very fertile area of research and a strong tool for high-dimensional data preprocessing.

The authors in [26] proposed a biologically inspired (from the behavior of ants) dimensionality reduction method called Ant Colony Optimization-Selection algorithm. The authors used five microarray datasets which have a very high dimensionality to prove that the proposed algorithm selects more important genes from the high dimensional data based on some parameters with an excellent classification accuracy.

Machine Learning

Tom Mitchell in his classic book on machine learning says “The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience” [9]. To put it simply, machine learning is a scientific field in which computer systems can automatically and intelligently learn their computation and improve on it through experience and can be classified into supervised and unsupervised learning algorithms. Machine learning techniques have been and are still used in solving a lot of complex real world problems. There are a lot of classification techniques. since no classifier is considered strictly better than the others [27], the perceptron and K-Nearest Neighbor classifiers are chosen for the purpose of this research.

The perceptron is an Artificial Neural Network that mimics or tries to simulate the activities of the brain with regards to information processing [14]. To do this, we take a weighted sum of inputs and if the sum is greater than some threshold value, it sends an output of one otherwise it sends zero (or -1). The perceptron is made up of a summation processor which takes the dot product of the inputs and the weights and then an activation function also known as threshold, which uses one step function to determine the output of the perceptron [16].

A great work in the domain of machine learning can be seen in [28], where the authors proposed a single layer Artificial Neural Network for the purpose of classifying cancer patients using Chebyshev, Trigonometric and Legendre expansion techniques. The authors concluded that Discrete Cosine Transform feature reduction based Neural Network classifiers perform better than the other classifiers used for the purpose of the research.

Another classification algorithm is the K-Nearest Neighbor algorithm. In this algorithm, the principle is that, the label of any given instance is predicted based on the label most common to its k nearest neighbors [15]. In a study [29], the authors proposed a new method for attribute selection, which selects attributes that complement each other and then tested it on a real dataset. The authors used two classes from the dataset and found that the proposed technique selects subsets of attributes that yields a classification accuracy which is higher than the accuracy obtained by using the entire set of attributes or even the subset of attributes identified by CART. The classification technique used for the purpose of the investigation is the K-Nearest Neighbor classification algorithm, and a confusion matrix is used to measure the overall accuracy of the classifier.

CHAPTER THREE

METHODOLOGY

This chapter gives a detailed description of all the algorithms used in achieving the aim and objectives of this thesis. MATLAB programming is used for the analysis and implementations in this thesis.

Dimensionality Reduction Techniques

This section gives a description of the dimensionality reduction techniques implemented in this thesis.

The New Random Approach

This is a technique suggested by [5]. With this technique, to reduce a data set D of dimensionality d to one of dimensionality k, a set S_k is formed consisting of k numbers selected at random from the set S shown in equation (3.1).

S = {x ϵ N | 1 £ x £ d} (3.1)

Then, our reduced set, D_R, is shown in equation (3.2).

D_R = D(:, S_k) (3.2)

That is, D_R is a data set having the same number of rows as D, and if A_i is the i^th attribute of D_R, then A_i will be the j^th attribute of D if j is the i^th element of S_k.

Modified New Random Approach

This technique is a modification of the new random approach proposed by [36]. To reduce a data set D of dimensionality p to one of dimensionality k, with this improved approach, a more efficient method is utilized in generating the random numbers, i.e. the results are less random. The algorithm is given in Algorithm 3.1.

CHAPTER FOUR

RESULTS AND DISCUSSION

The Perceptron Preservation

In this section, the results of the comparison amongst dimensionality reduction techniques (explained in chapter three of this thesis) for the perceptron classification preservation of the weather, student and ionosphere datasets are presented. To achieve these results, the training set in each of the three datasets for both the original and reduced datasets is used in training the perceptron, the weight vector obtained from the training phase is then used in classifying the test sets of both the original and reduced datasets. The obtained results are shown in tables 4.1, 4.2 and 4.3.

CHAPTER FIVE

SUMMARY, CONCLUSION AND RECOMMENDATION

Summary

In this thesis, we started by pointing out the challenges faced in the extraction of useful information from available large pool of data which increases at an alarming rate. Dimensionality reduction was introduced as a method that provides a compact representation of an original high-dimensional data, thus making it a very powerful tool and also an invaluable preprocessing step in facilitating the application of many machine learning algorithms.

After that, a review was done on literature related to the subject of this thesis. As pointed out, in the review of related work, dimensionality reduction has been applied to several domains, including machine learning. The methodology used in achieving the objectives of this research was then explained in detail. This includes detailed explanation of the methods involved; fifteen dimensionality reduction techniques, two classification algorithms (the perceptron and K-Nearest Neighbors) and the confusion matrix. The results of the achieved objectives, which were presented in the fourth chapter, revealed the extent to which dimensionality reduction techniques preserve the perceptron and K-Nearest Neighbor classification.

Next, the confusion matrix was used to show the extent to which these fifteen dimensionality reduction techniques – compared against each other – preserve the perceptron and k-nearest neighbor classification.

Conclusion

The aim of this thesis as stated in chapter 1 is to investigate the extent to which dimensionality reduction techniques preserve classification. This investigation revealed that the dimensionality

reduction techniques implemented in this thesis seem to perform much better at preserving K- Nearest Neighbor classification than they do at preserving the classification of the original datasets using the perceptron. In general, the dimensionality reduction techniques prove to be very efficient in preserving the classification of both the lazy and eager learners used for this investigation.

Recommendation

It would be interesting and worth investigating the classification preservation of dimensionality reduction methods on more sophisticated classifiers like the support vector machine and decision trees.

REFERENCES

Sharma and K. Saroha, “Study of dimension reduction methodologies in data mining,” in International Conference on Computing, Communication and Automation, 2015, pp. 133–137.
K. Fodor, “A survey of dimension reduction techniques,” Center for Applied Scientiﬁc Computing, Lawrence Livermore National Laboratory, no. 1, pp. 1–18,2002.
Achlioptas, “Database-friendly random projections: Johnson-Lindenstrauss with binary coins,” J. Comput. Syst. Sci., vol. 66, no. 4, pp. 671–687,2003.
S. Nsang, I. Diaz, and A. Ralescu, “Ensemble Clustering based on Heterogeneous Dimensionality Reduction Methods and Context-dependent Similarity Measures,” Int.J. Adv. Sci. Technol., vol. 64, pp. 101–118, 2014.
S. Nsang, A. Maikori, F. Oguntoyinbo and H. Yusuf, “A New Random Approach To Dimensionality Reduction,” in Int’l Conf. on Advances in Big Data Analytics | ABDA’15|, 2014, vol. 60, no. 6, pp. 2114–2142.
H. Deshmukh, T. Ghorpade, and P. Padiya, “Improving classification using preprocessing and machine learning algorithms on NSL-KDD dataset,” in Proceedings – 2015 International Conference on Communication, Information and Computing Technology, ICCICT 2015,2015.
Kalamaras, “A novel approach for multimodal graph dimensionality reduction,” Imperial college London,2015.
Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas, and I. Chouvarda, “Machine Learning and Data Mining Methods in Diabetes Research,” Comput. Struct. Biotechnol. J., vol. 15, pp. 104–116,2017.

Other Topics