Gene and genome duplication events have been known to be the driving forces in evolution along with further accelerated events such as genetic drift, mutations etc. However scientists encounter with a bottleneck in deciphering their importance due to their apparent non correlation between the genetic diversity and genome duplication events.

There are several approaches to study the pattern of these duplication events. The most promising approach is however through the bioinformatics tools.

The process of detecting gene duplication involves finding orthologs and paralogs, construction of gene trees and species trees. For genome duplication, evidences need to be analyzed within the species as well as across different species. Detecting accelerated divergence and measuring positive selection helps to identify duplicate gene evolution.

Determining Orthologs and paralogs

Orthologous genes occur by speciation and have the same function. Paralogs on the other hand, arise by duplication. Usually these genes acquire new functions in the event of gene duplication.

Comparative genomic approaches depend on identification of orthology based on conserved sequences. Evolutionary genomics makes use of complete mapping of the whole genome and comparison of sequences including those which underwent duplication.

Methods of determining orthologs

1. Pair wise sequence comparison

This method identifies the regions based on sequence similarity. The similarity may be indicative of a functional, structural or evolutionary relationship between the sequences. As indicated by name, they are used to compare two sequences from different species. The process can be used to identify orthologs. Multiple sequence alignment is used to infer evolutionary relationship among three or more sequences. MSA is useful in finding out homologous sequences. Bi directional BLAST and FASTA takes heuristic approach whereas Needleman-Wunsch (global) and Smith-Waterman (local) algorithms allow dynamic programming. The methods focus on one-to one comparison and identification of orthologs. However, BLAST is ineffective to detect homology between distantly related species.

2. Hit Clustering methods

Hit clustering methods detect clusters of similar sequences either from pair wise hits or from cluster graphs. Methods combining the algorithms such as Recursive and the Markov Cluster (MCL) algorithms have been developed for better detection. Such clustering is not ideal for analysis of large data from the same genome since the clusters lack functional coherence.

3. Synteny
Synteny refers to the presence of two or more genes on the same chromosome in a species. Determining Orthologous sequences using synteny is based on conserved sequences. Conserved synteny sequences share at least one ortholog among the different species. However, the number of orthologs thus found does not provide accurate data on genomic distances since the total number of chromosomes vary among the different species.


4. Phylogenetic approaches

Phylogenetic approaches make use of construction of phylogenetic trees based on sequence similarity. The tree can also be constructed based on distance matrix between sequences, restriction data and allele data. Orthologous sequences are clustered nearer to each other. The approach is best suited for specific families. A genome wide analysis can give rise to inaccurate results due to convergent evolution.


Problems in determination of orthologs

However, there are constraints for conducting genome wide studies to detect orthology.
a. The number of genes to be analyzed is huge which includes many paralogous gene families which has subfunctionalilzed before species divergence. This makes the correlation almost non linear.
b. Gene duplication events do not occur at steady rates. Evidences for abundant duplication and even loss of existing genome are seen in evolutionary history. Similarly sometimes protein families get expanded and genes acquire new functions or get inactivated.
c. False positives can arise due to the presence of matching sequences in unrelated proteins which is not necessarily due to common ancestry.
d. Noise in genomic data is unavoidable in most cases.
e. The rates of mutation can vary between related/similar genes and species. This can result in unpredictable gene divergences.
f. Presence of pseudogenes and incorrect/ incomplete gene models also pose serious problems in analysis.

Limitations
The methods in analyzing orthologs pose serious challenges as the data becomes huge especially genome wide analyses. Combination of two or more approaches has been promising. Determination of gene orthology is necessary prerequisite for data mining and anlysis of genomic data. There are several tools like PSI-BLAST, COG, INPARANOID etc which makes use of pair wise sequence comparison. Newer approaches such as SynPhyl integrate the syntenic and phylogenetic methods and show a promising future for analysis at a genome wide scale.

About Author / Additional Info: