Physics Project Topics

Categorization of Data Using Hierarchical Clustering

Categorization of Data Using Hierarchical Clustering

Categorization of Data Using Hierarchical Clustering

CHAPTER ONE

OVERVIEW OF THE STUDY

Given a data set containing n points in high dimensional space, it is often helpful if it can be projected onto a lower dimensional space without suffering great distortion. This process is called dimensionality reduction. Essentially, dimensionality reduction reduces the number of variables to be considered in a way that the relevant data is retained while reducing the amount of the data.

Dimensionality reduction helps to reduce the runtime of algorithms whose runtime depends on the dimensions of the working space. It also broadens the scope for the choice of method for data processing. It provides complexity control which avoids overfitting of the training data.

Dimensionality can be applied in several domains which include text data, image data, nearest neighbor search and in the domain of clustering and classification. Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning. Classification, on the other hand, is a method of supervised learning. The task of the supervised learner is to predict the value of the function for any valid input after having seen a number of training examples (i.e. pair of input and target output). As mentioned above, this project focuses on the categorization of data using hierarchical clustering.

CHAPTER TWO

HIERARCHICAL CLUSTERING

Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.

With hierarchical clustering, the clusters are arranged in a hierarchy. The termination condition could be chosen as, for example, a desired number of clusters or a minimum distance between clusters.

In hierarchical clustering we begin with n clusters consisting of one point each.

Then the nearest neighbor clusters are combined until a termination criterion has been reached.

 

CHAPTER THREE

DIMENSIONALITY REDUCTION

TECHNIQUES

RANDOM PROJECTIONS (RP)

A dataset of M-dimensions is reduced to an L-dimensional dataset where (L << M) using a random M x L matrix R. The reduced dataset is given by:

CHAPTER FOR

IMPLEMENTATION

We use the student data set to implement the six dimensionality reduction algorithms as discussed earlier. The hierarchical clustering algorithm is implemented on the data set and set aside. In this research, we only group the data into 20 clusters. The data set is reduced to 10 and 12 columns using each of the dimensionality reduction techniques mentioned earlier. The hierarchical clustering algorithm is then run on the reduced data. The reduced clustered data and the original clustered data are compared using rand index. Each of the dimensionality reduction techniques will be compared by runtime, inter point distance preservation and variance preservation.

CHAPTER FIVE

CONCLUSION

In conclusion, dimensionality reduction is a good way to reduce data so that they can be hierarchically clustered faster. Its saves time and memory. Hierarchical clustering  is useful in several aspects of our daily lives (e.g. school results and medical records) and dimensionality reduction can improve the speed at which its implemented.

As seen on the table above, all reduction techniques fully preserve the hierarchical clustering of the data set. The fastest reduction technique is Random Projections and it also preserves much of the inter-point distance and variance. Principle component Analysis is the fast and it also preserves inter-point distances and variance the highest.

REFERENCES 

  1. Achlioptas. Database-friendly random projections. In Proc. ACM Symp. on the Principles of Database Systems, pages 274–281,2001.
  2. Augustine Nsang “Novel Approaches to Dimensionality Reduction and Applications”
WeCreativez WhatsApp Support
Our customer support team is here to answer your questions. Ask us anything!