Biotech Articles
Publish Your Research Online
Get Recognition - International Audience

Request for an Author Account   |   Login   |   Submit Article

Microarray Data: To Cluster Or Not? | Purpose of Clustering

BY: Sandhya Anand | Category: Bioinformatics | Submitted: 2011-01-28 10:16:21
       No Photo
Article Summary: "Clustering is the most popular method in gene expression analysis methods. This is a kind of descriptive statistics in which data is explored to define the similarity (dissimilarity) between the objects..."

Share with Facebook Share with Linkedin Share with Twitter Share with Pinterest Email this article

Microarray is a research tool to know how an organism adapts to the environmental changes, interacts with other organisms and for knowing the finer details of interaction between the various cellular components.

The gene expression values and ratios are subjected to analysis using statistical tests or clustering methods. Statistical methods are either inferential or descriptive based on the nature of the microarray data.

Cluster analysis is now widely used to analyze the gene expression matrices. The principal goal of clustering is to group objects which are similar in nature. The process of clustering is easy, but defining the similarity is the toughest task. Usually the measure of similarity is defined as a distance metric. Clustering techniques depend on the distance metrics used. The technique had its origin in the construction of phylogenetic trees.

Clustering can be done to either individual gene with different expression values or to a group of similar experimental data.

Purpose of clustering

Clustering techniques are applied to microarray data to

• Identify groups of genes which are co-regulated.
• Identify temporal or spatial gene expression patterns in the given expression matrices.
• Arrange a set of genes in a linear order to infer significant information.
• Detect experimental artifacts, errors and incomplete hybridizations
• As a quality check for the grouping methods adopted for data analysis.
• Identify new classes of biological samples (e.g. finding tumor subtypes)

Basis of Cluster Analysis

Clustering is generally done on the basis of two aspects of the data points
1. Distance metrics as a quantification approach to group objects based on similarity or dissimilarity of data.
2. Cluster algorithms are statistical programs used to cluster objects with similar properties. The within cluster distance between the data points should be smaller and between cluster distance should be larger.

Distance metrics

A distance metric is expressed as a vector in the format (x, y) which has definite values in an n dimensional space. These distance metrics should satisfy three conditions.

1. The distance should be symmetric between the points x and y.
In essence, d(x, y) should be equal to d(y, x).
2. The distance between x and y should be equal to or greater than zero. It should be positive real numbers.
d (x, y) ≥ 0.
3. x and y should be placed at an ideal distance from each other that the sum of distances from any of these points to an arbitrary point z should be more than or equal to the value.
d (x,y) ≤ d (x,z) + d (y,z)

There are three distance measure widely used in microarray data clustering. Euclidean distance, Manhattan distance and Correlation distance.

The distance measure to be used depends on the application of data and the analysis of data. To be more precise, it depends on the similarities that you would like to identify.

Correlation distance is a measure of the general trends and relative differences between the data points. Euclidean and Manhattan distances measure absolute differences between the points. Of them, Manhattan distance is more robust against outliers.

The expression values can be standardized by subtracting the mean expression values from data and dividing by standard deviation. Both Euclidean and correlation distances gives equivalent values after standardization.

Inter cluster distances

Simple linkage method measures the distance between the closest neighbors between the two clusters. Distance from each vector of one cluster to the data points of the next cluster is calculated and the minimum of these is taken.

Complete linkage analysis uses the reverse method. It takes the distance between the farthest neighbors of two clusters and takes the maximum value for linkage.

Centroid linkage method calculates the distance between the two clusters as the square of Euclidean distance between the means (centroids) of the clusters. This method is more robust to outliers.

Average linkage method measure the average distance between the members of the clusters. For this the distance between each member of the cluster to every member of the other cluster is calculated and the average value is taken.

The choice of the linkage method affects the clustering efficiency and complexity. Single and complete linkages require fewer calculations. However the accuracy is very low with single linkage method. Compared to the above, centroid and average linkage methods produce more accurate clusters but require sophisticated and complex computations. The average linkage method and complete linkage methods are more widely used in microarray data analysis.

About Author / Additional Info:

Search this site & forums
Share this article with friends:

Share with Facebook Share with Linkedin Share with Twitter Share with Pinterest Email this article

More Social Bookmarks (Digg etc..)

Comments on this article: (0 comments so far)

Comment By Comment

Leave a Comment   |   Article Views: 4246

Additional Articles:

•   Basic Biotechnology Concepts and Applications

•   Medicinal Uses of Cassia Auriculata

•   A New Approach To Combat Leprosy Protein Serine Hydroxymethyltransferase

•   Proliferating Demand For Probiotics

Latest Articles in "Bioinformatics" category:
•   Career as Bioinformatician and Biostatistician

•   Expander: A Tool of Bioinformatics

•   Role of Bioinformatics in Drug Discovery

•   Importance and Applications of Bioinformatics in Molecular Medicine

•   Bioinformaticist vs. Bioinformatician - Definition, Differences and Career Outlook

•   Bioinformatics Application in Nanotechnology

•   How Bioinformatics Handles the Biological Data?

•   Application of Bioinformatics in Medicine

•   Prenatal Diagnosis via Bioinformatics Skills

•   Applications of Bioinformatics in Agriculture

•   Next Generation Sequencing Technologies: 454 Pyrosequencing

•   GenScan: Bioinformatics Software For Structure Prediction and Analysis of Gene

•   Pairwise Sequence Alignment For Sequence Similarity

•   Applications of Bioinformatics in Biotechnology

•   Introduction to Bioinformatics: Role of Mathematics and Technology

•   Why and How of Normalization in Microarray Data Analysis

•   Steps in Microarray Data Analysis - Part I

•   Steps in Microarray Data Analysis - Part II

•   Bilirubin Metabolism And its Role in Neonatal Jaundice

Important Disclaimer: All articles on this website are for general information only and is not a professional or experts advice. We do not own any responsibility for correctness or authenticity of the information presented in this article, or any loss or injury resulting from it. We do not endorse these articles, we are neither affiliated with the authors of these articles nor responsible for their content. Please see our disclaimer section for complete terms.
Page copy protected against web site content infringement by Copyscape
Copyright © 2010 - Do not copy articles from this website.

Agriculture Bioinformatics Applications Biotech Products Biotech Research
Biology Careers College/Edu DNA Environmental Biotech
Genetics Healthcare Industry News Issues Nanotechnology
Others Stem Cells Press Release Toxicology  

  |   Disclaimer/Privacy/TOS   |   Submission Guidelines   |   Contact Us