Analysing sequence data
The primary data of sequencing projects are DNA sequences. These become only really valuable through their annotation. Several layers of analysis with bioinformatics tools are necessary to arrive from a raw DNA sequence at an annotated protein sequences:
- establish the correct order of sequence contigs to obtain one continuous sequence;
- find the tranlation and transcription initiation sites, find promoter sites, define open reading frames (ORF);
- find splice sites, introns, exons;
- translate the DNA sequence into a protein sequence, searching all six frames;
- compare the DNA sequence to known protein sequences in order to verify exons etc with homologuous sequences.
The protein sequences are further analysed to predict function. The function can often be inferred if a sequence of a homologous protein with known function can be found. Homology searches are the predominant bioinformatics application, and very efficient search methods have been developed [3]. The often difficult distinction between orthologous sequences and paralogous sequences facilitates the functional annotation in the comparison of whole genomes. Several methods detect glycolysation, myristylation and other sites, and the prediction of signal peptides in the amino acid sequence give valuable information about the subcellular location of a protein.
Families of similar sequences contain information on sequence evolution in the form of specific conservation patters at all sequence positions. Multiple sequence alignments are useful for
- building sequence profiles or Hidden Markov Models to perform more sensitive homology searches. A sequence profile contains information about the variability of every sequence position. improving structure prediction methods (secondary structure prediction). Sequence profile searches have become readily available through the introduction of PsiBLAST [3];
- studying evolutionary aspects, by the construction of phylogenetic trees from the pairwise differences between sequences: for example, the classification with 70S, 30S RNAs established the separate kingdom of archeae;
- determining active site residues, and residues specifc for subfamilies;
- predicting protein-protein interactions;
- analysing single nucleotide polymorphisms to hunt for genetic sources of deseases.