The microarray data have been extensively used in genomic studies. However the data is extensive making it difficult to have a common reference design and analysis unlike the biological experimental data.

The steps used in microarray data analysis are summarized below.

1. Generation of data

The microarray data is an array of expression values derived from the hybridization of cDNA probes with the target. The matrix gives the values of hybridization as a measure of intensity of emitted fluorescence of the Cy5 and Cy3 dyes. The intensity of light emitted is affected by the overall intensity of light used in scanning, the dye effect and the back ground emission in addition to the intensity of hybridization.

Scanning of the hybridized arrays reveals the expression values. In most cases, the scanners are inbuilt with such calculations and provide the normalized log ratios along with the unprocessed data.

Scanning involves three steps:
1. Gridding- It involves separation of microarray spots by using image coordinates for the spots.
2. Segmentation is the process of separating the foreground and background pixels in a microarray spot.
3. Intensity extraction is done by calculation of average foreground and background intensities. This is done for the individual spots of the array.
After gridding, the microarray spot is marked as a circle. The target median is the median value of all the pixels in the circle. A square is marked around the circle. The pixels outside the circle but inside the square box are taken for calculation of a median value. This gives the background median value. Area is defined as the number of pixels inside the circle which are above the background pixels outside the circle but inside the square.

From these values, the integrated intensity is calculated as
Integrated intensity = (Target median - Background median) * Area

Intensity of Cy3 and Cy5 are calculated in this way and the log2 ratio is taken for further analysis.

2. Data preprocessing

The microarray data follows a Gaussian or normal distribution in its logarithmic format. Hence the preprocessing of microarray data is essentially aimed to eliminate the experimental bias and errors. The MA plot is used to analyze the data and to decide on the process of normalization.

Scaling is also employed in some cases in which the data is scaled up. The process is suitable for low expression values. The major goals of preprocessing are
1. To filter out changes in gene expression due to biologically relevant variations from the total variations.
2. Remove effects of technical artifacts due to DNA/oligonucleotide deposits on the slide.
3. Remove error due to equipments such as scanners used in quantifying expression.
4. Variations in quality of RNA due to extraction procedures.
5. Variations in washing of the slides after hybridization.
6. Errors in reading of signal due to errors in calibration, and the type of scanner used in measuring.
The preprocessing steps used in microarray data analysis assume that the genes change in an experiment.

3. Design of the experiment

Ideally this should be done before the collection of data. However, due to the increasing availability of free databases of microarray data, acquisition of data has become easier and analysis procedures usually start from these databases.
The experimental can be in the simple control vs. experiment design or more complicated designs.

a. Between subject designs:
These designs have two groups control and experiment groups with 'n' number of subjects in each group and are analyzed via simple statistical tests like the t-test.

b. within subject design:
This design takes into consideration of the intrasubject variations within the groups. For example design which takes into account the patient data before and after drug intake.

c. Factorial designs:
This is similar to between and /or within subject design, but considers the factors such as age, gender or any qualitative factors which can be used to group the data in the control and experiment data.
The within subject designs employ the data from subjects in different groups categorized based on the above mentioned factors along with another continuous factor, time.
Mixed factorial designs use a mix of the above combinations with a choice of factors under consideration.

d. Reference designs:
These designs require a reference sample. The experiment groups can be one or many and the comparison can be either one way or two ways.

e. Balanced designs/ designs without reference:
Here there is no specific reference and the comparison is between the groups. The number of relationships to be tested is dependent on the number of groups and usually more complicated. Such designs are useful where the genes are expressing at lower levels or having fewer variations and hence a meaningful comparison with a reference is not possible.

But the choice of the statistical test to establish the significance value is largely based on the nature of data and the experimental design. The factors deciding the choice and the statistics used in microarray data analysis are detailed in the next section.

About Author / Additional Info:
Part 2: