Microarray is a research tool to know how an organism adapts to the environmental changes, interacts with other organisms and for knowing the finer details of interaction between the various cellular components.

The gene expression values and ratios are subjected to analysis using statistical tests or clustering methods. Statistical methods are either inferential or descriptive based on the nature of the microarray data.

Cluster analysis is now widely used to analyze the gene expression matrices. The principal goal of clustering is to group objects which are similar in nature. The process of clustering is easy, but defining the similarity is the toughest task. Usually the measure of similarity is defined as a distance metric. Clustering techniques depend on the distance metrics used. The technique had its origin in the construction of phylogenetic trees.

Clustering can be done to either individual gene with different expression values or to a group of similar experimental data.

Purpose of clustering

Clustering techniques are applied to microarray data to

• Identify groups of genes which are co-regulated.
• Identify temporal or spatial gene expression patterns in the given expression matrices.
• Arrange a set of genes in a linear order to infer significant information.
• Detect experimental artifacts, errors and incomplete hybridizations
• As a quality check for the grouping methods adopted for data analysis.
• Identify new classes of biological samples (e.g. finding tumor subtypes)

Basis of Cluster Analysis

Clustering is generally done on the basis of two aspects of the data points
1. Distance metrics as a quantification approach to group objects based on similarity or dissimilarity of data.
2. Cluster algorithms are statistical programs used to cluster objects with similar properties. The within cluster distance between the data points should be smaller and between cluster distance should be larger.

Distance metrics

A distance metric is expressed as a vector in the format (x, y) which has definite values in an n dimensional space. These distance metrics should satisfy three conditions.

1. The distance should be symmetric between the points x and y.
In essence, d(x, y) should be equal to d(y, x).
2. The distance between x and y should be equal to or greater than zero. It should be positive real numbers.
d (x, y) ≥ 0.
3. x and y should be placed at an ideal distance from each other that the sum of distances from any of these points to an arbitrary point z should be more than or equal to the value.
d (x,y) ≤ d (x,z) + d (y,z)

There are three distance measure widely used in microarray data clustering. Euclidean distance, Manhattan distance and Correlation distance.

The distance measure to be used depends on the application of data and the analysis of data. To be more precise, it depends on the similarities that you would like to identify.

Correlation distance is a measure of the general trends and relative differences between the data points. Euclidean and Manhattan distances measure absolute differences between the points. Of them, Manhattan distance is more robust against outliers.

The expression values can be standardized by subtracting the mean expression values from data and dividing by standard deviation. Both Euclidean and correlation distances gives equivalent values after standardization.

Inter cluster distances

Simple linkage method measures the distance between the closest neighbors between the two clusters. Distance from each vector of one cluster to the data points of the next cluster is calculated and the minimum of these is taken.

Complete linkage analysis uses the reverse method. It takes the distance between the farthest neighbors of two clusters and takes the maximum value for linkage.

Centroid linkage method calculates the distance between the two clusters as the square of Euclidean distance between the means (centroids) of the clusters. This method is more robust to outliers.

Average linkage method measure the average distance between the members of the clusters. For this the distance between each member of the cluster to every member of the other cluster is calculated and the average value is taken.

The choice of the linkage method affects the clustering efficiency and complexity. Single and complete linkages require fewer calculations. However the accuracy is very low with single linkage method. Compared to the above, centroid and average linkage methods produce more accurate clusters but require sophisticated and complex computations. The average linkage method and complete linkage methods are more widely used in microarray data analysis.

About Author / Additional Info: