Publish Your Research Online
Get Recognition - International Audience
Request for an Author Account | Login | Submit Article
|HOME||FAQ||TOP AUTHORS||FORUMS||PUBLISH ARTICLE|
Microarray Data: To Cluster Or Not? | Purpose of ClusteringBY: Sandhya Anand | Category: Bioinformatics | Submitted: 2011-01-28 10:16:21
Article Summary: "Clustering is the most popular method in gene expression analysis methods. This is a kind of descriptive statistics in which data is explored to define the similarity (dissimilarity) between the objects..."
Microarray is a research tool to know how an organism adapts to the environmental changes, interacts with other organisms and for knowing the finer details of interaction between the various cellular components.
The gene expression values and ratios are subjected to analysis using statistical tests or clustering methods. Statistical methods are either inferential or descriptive based on the nature of the microarray data.
Cluster analysis is now widely used to analyze the gene expression matrices. The principal goal of clustering is to group objects which are similar in nature. The process of clustering is easy, but defining the similarity is the toughest task. Usually the measure of similarity is defined as a distance metric. Clustering techniques depend on the distance metrics used. The technique had its origin in the construction of phylogenetic trees.
Clustering can be done to either individual gene with different expression values or to a group of similar experimental data.
Purpose of clustering
Clustering techniques are applied to microarray data to
• Identify groups of genes which are co-regulated.
• Identify temporal or spatial gene expression patterns in the given expression matrices.
• Arrange a set of genes in a linear order to infer significant information.
• Detect experimental artifacts, errors and incomplete hybridizations
• As a quality check for the grouping methods adopted for data analysis.
• Identify new classes of biological samples (e.g. finding tumor subtypes)
Basis of Cluster Analysis
Clustering is generally done on the basis of two aspects of the data points
1. Distance metrics as a quantification approach to group objects based on similarity or dissimilarity of data.
2. Cluster algorithms are statistical programs used to cluster objects with similar properties. The within cluster distance between the data points should be smaller and between cluster distance should be larger.
A distance metric is expressed as a vector in the format (x, y) which has definite values in an n dimensional space. These distance metrics should satisfy three conditions.
1. The distance should be symmetric between the points x and y.
In essence, d(x, y) should be equal to d(y, x).
2. The distance between x and y should be equal to or greater than zero. It should be positive real numbers.
d (x, y) ≥ 0.
3. x and y should be placed at an ideal distance from each other that the sum of distances from any of these points to an arbitrary point z should be more than or equal to the value.
d (x,y) ≤ d (x,z) + d (y,z)
There are three distance measure widely used in microarray data clustering. Euclidean distance, Manhattan distance and Correlation distance.
The distance measure to be used depends on the application of data and the analysis of data. To be more precise, it depends on the similarities that you would like to identify.
Correlation distance is a measure of the general trends and relative differences between the data points. Euclidean and Manhattan distances measure absolute differences between the points. Of them, Manhattan distance is more robust against outliers.
The expression values can be standardized by subtracting the mean expression values from data and dividing by standard deviation. Both Euclidean and correlation distances gives equivalent values after standardization.
Inter cluster distances
Simple linkage method measures the distance between the closest neighbors between the two clusters. Distance from each vector of one cluster to the data points of the next cluster is calculated and the minimum of these is taken.
Complete linkage analysis uses the reverse method. It takes the distance between the farthest neighbors of two clusters and takes the maximum value for linkage.
Centroid linkage method calculates the distance between the two clusters as the square of Euclidean distance between the means (centroids) of the clusters. This method is more robust to outliers.
Average linkage method measure the average distance between the members of the clusters. For this the distance between each member of the cluster to every member of the other cluster is calculated and the average value is taken.
The choice of the linkage method affects the clustering efficiency and complexity. Single and complete linkages require fewer calculations. However the accuracy is very low with single linkage method. Compared to the above, centroid and average linkage methods produce more accurate clusters but require sophisticated and complex computations. The average linkage method and complete linkage methods are more widely used in microarray data analysis.
About Author / Additional Info:
Comments on this article: (0 comments so far)
• Biogas Formation and an Urge For Its Intensive Utilization
• Biohazards of Genetic Engineering
• Cloning in the Dairy Industry for Improving Dairy Production
• Antibiotics Vs Probiotics
Latest Articles in "Bioinformatics" category:
• Career as Bioinformatician and Biostatistician
• Expander: A Tool of Bioinformatics
• Role of Bioinformatics in Drug Discovery
• Importance and Applications of Bioinformatics in Molecular Medicine
• Bioinformaticist vs. Bioinformatician - Definition, Differences and Career Outlook
• Bioinformatics Application in Nanotechnology
• How Bioinformatics Handles the Biological Data?
• Application of Bioinformatics in Medicine
• Prenatal Diagnosis via Bioinformatics Skills
• Applications of Bioinformatics in Agriculture
• Next Generation Sequencing Technologies: 454 Pyrosequencing
• GenScan: Bioinformatics Software For Structure Prediction and Analysis of Gene
• Pairwise Sequence Alignment For Sequence Similarity
• Applications of Bioinformatics in Biotechnology
• Introduction to Bioinformatics: Role of Mathematics and Technology
• Why and How of Normalization in Microarray Data Analysis
• Steps in Microarray Data Analysis - Part I
• Steps in Microarray Data Analysis - Part II
• Bilirubin Metabolism And its Role in Neonatal Jaundice
Important Disclaimer: All articles on this website are for general information only and is not a professional or experts advice. We do not own any responsibility for correctness or authenticity of the information presented in this article, or any loss or injury resulting from it. We do not endorse these articles, we are neither affiliated with the authors of these articles nor responsible for their content. Please see our disclaimer section for complete terms.
Copyright © 2010 biotecharticles.com - Do not copy articles from this website.
ARTICLE CATEGORIES :
| Disclaimer/Privacy/TOS | Submission Guidelines | Contact Us