Inferential statistics

Inferential statistics

Inferential statistics use parametric or nonparametric statistics. It provides statistical significance to the discovered gene expression values. This type of analysis is used for research designs involving hypothesis formulation and is suited for finding up regulated genes. It is usually guided by the hypothesis.

• The expression ratios of control and experiment data are calculated and normalized.

• Scaling is applied for data having low expression values.

• Set the p value. By setting the p value to 0.05 you expect only 500 out of the 10000 genes under analysis to show significant variation due to random chance.

• The goal is to establish the differential variation of the targeted gene in control vs. experiment data.

• The null hypothesis assumes no difference and alternate hypothesis assumes difference across the control and experimental data. If the p value exceeds the significance level 0.05 then the null hypothesis is accepted.

• The test statistic is applied to the data to find the p value based on which the null hypothesis is either accepted or rejected.

• The simplest of the inferential statistics is the t-test which finds the P value using the mean expression values and standard deviations.

• A factor which accounts for the noise is also incorporated into the test for better precision.

**Types of inferential statistical tests**

Inferential tests are either parametric or non parametric based on the nature of the data. Parametric statistical tests are employed for data which follow the normal distribution. Non parametric tests are employed if the data is not sure of following the Gaussian distribution. Practically speaking this differentiation has little to do with the nature and is heavily dependent on the choice of the investigator since all statistical tests in general assume the normality of data when extended to infinite population size.

The following rules are general for the choice of the statistical test

1. To compare one group to a reference value -t test for parametric data and Wilcoxon test for non parametric data.

2. Comparison of two paired groups - Paired t test and Wilcoxon test

3. Comparison of two unpaired groups - Unpaired t test and Mann Whitney test

4. To compare three or more unrelated groups - One way ANOVA and Kruskal Wallis test.

5. To compare three or more related groups - Repeated ANOVA and Friedman test.

**Descriptive Statistical test**

This is an explorative mechanism in which data is compared using the correlation coefficient and visualized to find the extent of similarity. The similarity between the genes is expressed as distance metrics. This can be either

• Euclidean distance

• Person correlation coefficient or

• Manhattan distance.

The choice of the distance to be measured depends on the area of application and the type of similarity between the genes you would like to find.

The data is expressed as matrix points in a graph and the absolute difference between the vectors form the basis of Euclidean and Manhattan distance. Manhattan distance is more robust to the presence of outliers. Standardization can be applied to all the three distance measures. After the standardization, Euclidean and correlation distances are approximately equal.

Alternatively, statistical algorithms can be used to find the similarity and to group similar objects or data. Descriptive statistical tests are either supervised or unsupervised methods of clustering. Unsupervised methods include Hierarchical clustering, K Means clustering, and Self Organizing Maps.

• Hierarchical clustering is used to link genes based on similarity and builds a tree to find the target gene just like pedigree charts.

• Self Organizing Maps (SOM) use random statistical relations and measure of correlation to split the genes to sub groups based on similarity. This is an example of machine learning in which the statistical programs are employed to cluster similar genes.

• Principle Component Analysis is another method in which every gene is considered as a dimension (vector). The aim of the test is to find a single dimension that best represents the variations in the data.

Supervised methods include use of linear discriminants, decision trees and Support Vector Machines to classify similar genes into groups for analysis.

All these methods are based on the homogeneity and separation principles. Those which are more similar are grouped together and which are dissimilar are clustered separately. Although these descriptive tests rule out the requirement for a common reference, the validity is not as significant as the statistical tests. The choice of the method however depends on the purpose and nature of data.

**About Author / Additional Info:**Part 1: http://www.biotecharticles.com/Bioinformatics-Article/Steps-in-Microarray-Data-Analysis-Part-I-558.html