Analysing sequence data
The primary data of sequencing projects are DNA sequences. These become only really valuable through their annotation. Several layers of analysis with bioinformatics tools are necessary to arrive from a raw DNA sequence at an annotated protein sequences:
- establish the correct order of sequence contigs to obtain one continuous sequence;
- find the tranlation and transcription initiation sites, find promoter sites, define open reading frames (ORF);
- find splice sites, introns, exons;
- translate the DNA sequence into a protein sequence, searching all six frames;
- compare the DNA sequence to known protein sequences in order to verify exons etc with homologuous sequences.
Some completely automated annotation systems have been developed (e.g., GENEQUIZ), which use a multitude of different programs and methods.
The protein sequences are further analysed to predict function. The function can often be inferred if a sequence of a homologous protein with known function can be found. Homology searches are the predominant bioinformatics application, and very efficient search methods have been developed . The often difficult distinction between orthologous sequences and paralogous sequences facilitates the functional annotation in the comparison of whole genomes. Several methods detect glycolysation, myristylation and other sites, and the prediction of signal peptides in the amino acid sequence give valuable information about the subcellular location of a protein.
The ultimate goal of sequence annotation is to arrive at a complete functional description of all genes of an organism. However, function is an ill-defined concept. Thus, the simplified idea of "one gene - one protein - one structure - one function" cannot take into account proteins that have multiple functions depending on context (e.g., subcellar location and the presence of cofactors). Well-known cases of "moonlighting" proteins are lens crystalline and phosphoglucose isomerase. Currently, work on ontologies is under way to explicitly define a vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.
Families of similar sequences contain information on sequence evolution in the form of specific conservation patters at all sequence positions. Multiple sequence alignments are useful for
- building sequence profiles or Hidden Markov Models to perform more sensitive homology searches. A sequence profile contains information about the variability of every sequence position. improving structure prediction methods (secondary structure prediction). Sequence profile searches have become readily available through the introduction of PsiBLAST ;
- studying evolutionary aspects, by the construction of phylogenetic trees from the pairwise differences between sequences: for example, the classification with 70S, 30S RNAs established the separate kingdom of archeae;
- determining active site residues, and residues specifc for subfamilies;
- predicting protein-protein interactions;
- analysing single nucleotide polymorphisms to hunt for genetic sources of deseases.
Many complete genomes of microorganisms and a few of eukaryotes are available . By analysis of entire genome sequences a wealth of additional information can be obtained. The complete genomic sequence contains not only all protein sequences but also sequences regulating gene expression. A comparison of the genomes of genetically close organisms reveals genes responsible for specific properties of the organisms (e.g., infectivity). Protein interactions can be predicted from conservation of gene order or operon organisation in different genomes. Also the detection of gene fusion and gene fission (i.e, one protein is split into two in another genome) events helps to deduce protein interactions.