Abstract:

As a multivariate data mining technique “cluster analysis” able to identify as a key analytical tool in diverse arena. This paper intends to provide a profound collection of information for researchers and practitioners in simple manner understanding of cluster analysis through the theoretical background. Text books on cluster analysis and research articles employees cluster analysis as main structural analysis were reviewed as the methodology. Under the revealed information researchers need to play a greater attention to guidelines concerning the use and reporting of cluster analysis as the practitioners.

Key words: Cluster analysis, distance, multivariate, data mining, algorithms

1. INTRODUCTION

Cluster analysis is one kind of an exploratory analysis technique that can use to identify the structures with in data. A collection of data objects can be define as a cluster (Everitt et al, 2011). Usually with similar to one another object within the same group and the dissimilar to the objects in other group can be clearly identify as a cluster (Leonard & Peter, 2009). Thus finding similarities between data according to the specific characteristics in the data set and grouping similar data into clusters can be define as cluster analysis (Charles, 2004). Segmentation analysis or Taxonomy analysis are two different terminologies use to delineate the cluster analysis (Manly, 2005). That statistical method, tries to identify homogeneous clusters of cases if the grouping is not defined. As a multivariate data mining technique “cluster analysis” does not make any difference between dependent and independent variables that minimize complex multivariate data into smaller meaningful subsets. More specifically the technique comprises with a number of different types of algorithms. The methods usually that can used for clustering objects of similar kind into separate classifications (Kaufman at al, 1990). The main goal of the cluster analysis technique is to organize observed data into a meaningful structure.

As a tool of data mining cluster analysis is used to discover the concealed structures or relationships within data (Michaud, 1997). And also to acquire a quick outline of data though the effective visualization as in cluster analysis can help review the data quality. Therefore cluster analysis represents a remarkable analytical tool for different subject areas. Mainly in social sciences, health science (Cattle, 1943) finance and marketing, engineering and planning sectors, insurance field and etc. Hereafter, the applied cluster analysis is evolving in field of research significantly. According to that, the main purpose of the study is to deliberate the theoretical background and provide a sound understand of cluster analysis for researchers.

2. BACKGROUND OF CLUSTER ANALYSIS

Cluster analysis as a multivariate statistical procedure is generally applied for data mining through applying in different fields of pattern recognition, machine learning, image analysis, bio formats, information retrieval, marketing and business analysis, social behavioral analysis and data comparisons (Charles, 2004). Cluster analysis has origins in anthropology dating back more than eighty years through the work of Driver and Krober in 1932 and introduced to psychology by Zubin in 1938 (Blashfield, 1976).

3. TYPES OF CLUSTERING

To minimize and maximize the intra-cluster distance is the main objective of clustering (Michaud, 1997). According to that one of a significant perception in cluster analysis is the types of clusters.

If the population clusters are sufficiently and well disjointed practically any clustering method can accomplish well. Hence a cluster is set of points such that any point in cluster is related to every point in the cluster than to any point not in the cluster.

Center based cluster type is also commonly used in the field of biology (Blashfield & Aldenderfer, 1978). The center of the cluster is often a centroid. In a Custer the average of all the points is the most repetitive point that can practice to compare the other types of clusters. These are very efficient for clustering large data base.

Density based clustering is another type of clusters. A cluster is a dense region of points, which is separated by low-density regions from other regions of high density (Gan et al, 2007). This method is used when the outliers are present and the clusters are irregular at any kind of form.

Contiguous cluster is a set of points such that a point in a cluster is nearer to one or more other points which in the cluster than to any point not in the cluster (Kaufman at al, 1990). An additional conceptual clusters compute entitled as shared property. These type of clusters are accomplished of generating hierarchical category structure.

Not only that but also the clusters defined by an objective function also can identify as an outstanding type. A cluster that can minimize or maximize an objective function find in this situation (Anderberg, 1973). Calculate all probable methods of dividing the points into clusters. When estimate the goodness of each potential set of clusters can be obtain by using the given objective function (Parsons, 2004).

If the researcher could be able to use a good clustering method it will procedure high quality clusters. The quality of clustering results depends on both the similarity measure used by the method and its implementation. And also measured by its capability to determine the hidden patterns inside the clusters.

4. DISTANCE BETWEEN CLUSTERS

Distance between the clusters termed as the selections are between groups linkage in the average distance of all data points within the clusters (Michaud, 1997). The distance between clusters can be identify through the distance metric properties. A distance metric defined as a function of d that takes as arguments two points x and y in an n-dimensional space R^n and should follow the properties such as symmetry, positivity and tringle inequality (Kaufman at al, 1990). The property reflects the fact that the distance between two points should be measured along the shortest route. The distance between x and y can computed in following ways.

4.1 EUCLIDEAN DISTANCE

The distance takes into account both direction and the magnitude of the vectors. The Euclidean distance between two n-dimensional vectors x=(x_1,x_2,….x_n) and y=(y_1,y_2,….y_n) is

d_(E (x,y))=?(?(x_1-y_1)?^2+?(x_2-y_2)?^2+?+?(x_n-y_n)?^2 )

d_(E (x,y))=?(?_(i=i)^n??(x_i-y_i)?^2 )

The squared Euclidean distance between two n-dimensional vectors (Charles, 2004).

d_(E^2 (x,y))=?(x_1-y_1)?^2+?(x_2-y_2)?^2+?+?(x_n-y_n)?^2

d_(E^2 (x,y))=?_(i=i)^n??(x_i-y_i)?^2

Euclidean distance is evident towards to provide more weights to the outliers due to the lack of the square root (Kaufman at al, 1990). The standardized Euclidean distance between two n-dimensional vectors.

d_(SE (x,y))=?(1/?s_1?^2 ?(x_1-y_1)?^2+?+?1/?s_n?^2 (x_n-y_n)?^2 )

d_(SE(x,y))=?(?_(i=i)^n?? 1/?s_i?^2 (x_i-y_i)?^2 )

Uses the notion of weighting each dimension by a quantity inversely proportional to the amount of variability along the dimension.

4.2 MAHHATTAN DISTANCE

Mahhattan distance represents the space measure along directions that are parallel to x and y axis.

d_(M (x,y))=|x_1-y_1 |+|x_2-y_2 |+?+|x_n-y_n |

d_(M (x,y))=?_(i=i)^n?|x_i-y_i |

Where |x_i-y_i | represent the absolute value of the difference betweenx_i and y_i.

4.3 CHEBYCHEV DISTANCE

Chebychev distance basically picks the largest difference between any two corresponding coordinates. The main objective of this method is to replicate is there any immense difference between two n-dimensional vectors.

d_(max (x,y))=?max?_i |x_i-y_i |

The dimension is very sensitive to outlying measurements and robust of small amount of noise (Binder, 1978).

4.4 COSINE SIMILARITY

Cosine similarity takes into the interpretation with only the angle and discards the magnitude. The distance between two n-dimensional vectors is,

d_(? (x,y))=cos(?)=(x.y)/(?x?.?y? )

Where ?x?=?(?_(i=1)^n?x_i^2 )

4.5 CORRELATION DISTANCE

The Pearson correlation distance computes the distance of each point from the linear regression line and will take the values between 0 and 2 (Blashfield & Aldenderfer, 1978). The distance between two n-dimensional vectors is,

d_(R (x,y))=1-?_xy

Where ?_xy is the Pearson correlation coefficient of the vectors x and y.

4.6 MANHATTAN DISTANCE

Manhattan distance reduce the distance of classical Euclidean distance. The distance between n-dimensional vectors,

d_(M1 (x,y))=?((x_1-y_1 )^T s^(-1) (x-y))

Where “s” is any n×m positive definite matrix (x_1-y_1 )^T is the transformation of

(x-y). Usually this “s” matrix is the covariance matrix of the dataset. If the space wrapping matrix 5 is taken to be the identity matrix (Blashfield & Aldenderfer, 1978).

4.7 MINKOWSKI DISTANCE

Minkowski distance is the method of the generalization of Euclidean and Manhattan distance. The distance between n-dimensional vectors,

d_(M.k (x,y))={|x_1-y_1 |^m+?+|x_n-y_n |^m }^(1/m)

d_(M.k (x,y))={?_(i=1)^n?|x_i-y_i |^m }^(1/m)

Where x^(1?m)=?(m&x) note that m=1 the distance reduces to Manhattan distance. Simple sum of the absolute differences. For m=2 the Minkowski distance use to reduces to Euclidean distance (Binder, 1978).

The different types of variables mixed together could be able to identify as the practical scenario. For that purpose any of the distance that identify in above can be reformed by applying a weighting scheme which reflects the variance. Generally it is necessary to standardize and normalize objects in order to compute (Anderberg, 1973). The distance matric has highly depended by the clustering. Through the changes of the distance matric may affect the number and membership of the clusters as well as the relationship between them theatrically.

5. REQUIREMENTS OF CLUSTERING IN DATA MINING

Interval – scaled, binary, nominal, ordinal and ratio or mixed type variables can be classified as the types of variables of cluster analysis. But mainly which cluster analysis is to be done selected by keeping the past research in mind, theoretical approach, hypothesis being tested and the judgement of the researcher.

Under the requirements of clustering in data mining at first should focus on the scalability. Many clustering algorithms work well on small data sets containing fewer than several number of data objects. Therefor the need highly scalable clustering algorithms to deal with large database (Blashfield & Aldenderfer, 1978).

Ability to deal with different types of attributes is another major requirement. Algorithms should be capable to be applied on any kind of data such as interval based data, categorical and binary.

Discovery of clusters with arbitrary shape is additional requirement. The clustering algorithm should be capable to detecting clusters of arbitrary shape.

High dimensionality is one of a key requirement. The clustering algorithm should not only be able to adjust with low-dimensional data but also with the high-dimensional space.

When concern the requirement of ability to deal with noisy data that explains through the data sensitivity (Charles, 2004). Database contain the noisy, missing or erroneous data. Some algorithms are sensitive to those kind of data and may lead to poor quality clusters.

The clustering results should be interpretable, comparable and usable. And also many algorithms have need of uses to provide base knowledge in the form of input parameters.

Not only that major requirements but also should focus the ability to handle dynamic data , insensitive to order of input records and combination of user specified constraints as requirements of clustering in data mining.

6. CLUSTER APPROACHES

There are different types of cluster approaches that can use to analysis in different types of fields. Partitioning algorithms is very popular as a cluster approach (Parsons et al, 2004).

To create a hierarchy decomposition of the set of data using some criterion is under the hierarchy algorithms. Based on connectivity and density function the density-based define as a one of a major approach. Based on a multiple-level structure define in grid-based approach. The model-based approach is a common model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other (Blashfield & Aldenderfer, 1978). Frequent-pattern based approach is based on the analysis of frequent patterns. As a final point the user constraint based approach is use to cluster by consider user specified or application.

7. CONCLUSIONS

The clustering is identify as is grouping similar data in similar groups. Greater than the similarity within the group and greater difference between the groups more distinct the clustering. For taxonomy description while data interpretation by hypothesis generation and finally identification of the relationship identify as the core objectives of the cluster analysis. That designed by after formulating a problem and then have to select the proper distance measure to find the appropriate clustering procedure. Subsequently deciding the number of clusters should interpret the profile to assess the validity of clustering.