Why and How of Normalization in Microarray Data Analysis

Article Summary:

Normalization is the process used in microarray data analysis to correct the measurement errors and bias introduced in acquisition of data. The data from microarray experiments are log normalized format since the log forms of data always follow a Gaussian distribution. The process therefore contributes more towards error correction...

The term normalization has been linked to microarray data as the first step in data analysis. This should not be confused with the normalization in statistical procedures in which the purpose is to make the data distribution to a normal or Gaussian distribution. The microarray data conforms to the pattern of Gaussian distribution and there is no need of data modification with such an aim.

Normalization of microarray data is aimed to correct for the systematic measurement errors and bias in the observed data. The errors and bias may be introduced by several factors such as difference in probe labeling, concentration of target DNA/ RNA sequence, efficiency of hybridization, instrumental noise due to scanners or printers etc. The process allows data to be compared across a common reference.

The actual measurement of the microarrays from the scanner consists of true measurement as well as the error component. The error in turn has bias and variance.

The variance is often distributed in a uniform pattern across the data. Examples include systematic errors due to defects in instrumentation and biological variation like difference in tissue samples or strains of mice. Bias is the tendency of the experimental system to err. For example the effects of binding vary for Cy5 and Cy3 dyes. The effects of variance can be nullified through experimental replicates both technical and biological. The corrections can be introduced by appropriate statistical tests also. The effect of bias can be nullified by the process of normalization.

Normalization is usually applied to the differential expression of dyes. The log ratio is represented as M = log2(R/G) = log2 R- log2 G.
The log intensity at each spot is calculated as
A = 1/2 log2 (RG) = Â½ (log2(R) + log2(G))

The process of normalization can be classified into linear and non linear normalization. Linear normalization is applied to selected genes or global ones. The process is quite suitable for consistent data quality metrics. The database construction with this approach is simple. However the error factor is assumed to be uniform across all genes and so does the mRNA content in the cells selected for comparison.

Non linear normalization is highly precise for data at extreme values, but requires a gene set for reference. The problem with this approach is that it provides too many false positives and increases the power even with a statistically poor data. The purpose of both methods is to bring each image in the microarray data to same average brightness using simple statistics or more complex ones.
The simplest procedure of normalization employs division of the expression ratios by mean value. This can consider a single micro array slide as a whole or subdivide them into sectors to create sub matrices. In this method, low intense spots will show much higher variability than the brighter ones. The variability includes a greater detail of background variations and machine variability. When the absolute amount of RNA available is less, the change is greater. Dividing by the mean value will suppress the data in the average value from analysis.

For the simplest model with an experimental replicate of two slides, the M values are the average of the two slides. In case of a dye swap experiment, the M values of one of the slides has to be multiplied by -1 and the average is taken to get the fitted M value on which normalization is to be performed.

The linear models can perform normalization within slide, between paired slides and between two slides in which same type of hybridization is employed. It can include all the genes, house keeping genes alone or control genes.

Setting the M Vs A values to median is another type of normalization which assumes the changes to be symmetric over all the genes in a slide. Locally weighted regression methods are much sophisticated methods of normalization. There are two types- print tip type and global loess.
In print tip group normalization the M values are normalized by subtraction from corresponding average of the print tip group. The local regressions are linear and selected within a span of 0.4.
The Global Loess normalization does not count the differences in subarray and hence useful when the spatial variations in the M-A plots are negligible. The process can also be done by weighing for spot quality and scale normalization. Further complicated normalization procedures like median absolute deviation are employed in special cases. The choice of normalization method depends on the purpose of research and quality of data.

About Author / Additional Info:

Publish Your Research Online

Why and How of Normalization in Microarray Data Analysis

Article Summary: