Terms in Bioinformatics
Two proteins with related folds but unrelated sequences are called analogous. During evolution, analogous proteins independently developed the same fold.
In the biosciences, a databank (or data bank) is a structured set of raw data, most notably DNA sequences from sequencing projects (e.g. the EMBL and GenBank databases).
A database (or data base) is a collection of data that is organized so that its contents can easily be accessed, managed, and modified by a computer. The most prevalent type of database is the relational database which organizes the data in tables; multiple relations can be mathematically defined between the rows and columns of each table to yield the desired information. An object-oriented database stores data in the form of objects which are organized in hierachical classes that may inherit properties from classes higher in the tree structure.
In the biosciences, a database is a curated repository of raw data containing annotations, further analysis, links to other databases. Examples of databases are the SWISSPROT database for annotated protein sequences or the FlyBase database of genetic and molecular data for Drosophila melanogaster.
In general, dynamic programming is an algorithmic scheme for solving discrete optimization problems that have overlapping subproblems. In a dynamic programming algorithm, the definition of the function that is optimized is extended as the computation proceeds. The solution is constructed by progressing from simpler to more complex cases, thereby solving each subproblem before it is needed by any other subproblem.
In particular, the algorithm for finding optimal alignments is an example of dynamic programming.
In molecular dynamics and molecular mechanics calculations, the intra- and intermolecular interactions of a molecule are calculated from a simplified empirical parametrization called a force field. These include atom masses, charges, dihedral angles, improper angles, van-der-Waals and electrostatic interactions, etc.
The genome is the gene complement of an organism. A genome sequence comprises the information of the entire genetic material of an organism.
Genomics, Functional Genomics, Structural Genomics
The goal of Genomics is to determine the complete DNA sequence for all the genetic material contained in an organism's complete genome.
Functional genomics (sometimes refered to as functional proteomics) aims at determining the function of the proteome (the protein complement encoded by an organism's entire genome). It expands the scope of biological investigation from studying single genes or proteins to studying all genes or proteins at once in a systematic fashion, using large-scale experimental methodologies combined with statistical analysis of the results.
Structural Genomics is the systematic effort to gain a complete structural description of a defined set of molecules, ultimately for an organism's entire proteome. Structural genomics projects apply X-ray crystallography and NMR spectroscopy in a high-throughput manner.
Hidden Markov Model
A Hidden Markov Model (HMM) is a general probabilistic model for sequences of symbols. In a Markov chain, the probability of each symbol depends only on the preceeding one. Hidden Markov models are widely used in bioinformatics, most notably to replace sequence profile in the calculation of sequence alignments.
Two proteins with related folds and related sequences are called homologous. Commonly, homologous proteins are further divided into orthologous and paralogous proteins. While orthologous proteins evolved from a common ancestral gene, paralogous proteins were created by gene duplication.
A neural network is a computer algorithm to solve non-linear optimisation problems. The algorithm was derived in analogy to the way the densely interconnected, parallel structure of the brain processes information.
The word ontology has a long history in philosophy, in which it refers to the study of being as such. In information science, an ontology is an explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships among them.
Open Reading Frame (ORF)
An opening frame contains a series of codons (base triplets) coding for amino acids without any termination codons. There are six potential reading frames of an unidentified sequence.
Protein Folding Problem
Proteins fold on a time scale from µs to s. Starting from a random coil conformation, proteins can find their stable fold quickly although the number of possible conformations is astronomically high. The Protein Folding Problem is to predict the folding and the final structure of a protein solely from its sequence.
The Protein Structure Prediction Problem refers to the combinatorial problem to calculate the three-dimensional structure of a protein from its sequence alone. It is one of the biggest challenges in structural bioinformatics.
The Proteome is the protein complement expressed by a genome. While the genome is static, the proteome continually changes in response to external and internal events.
Proteomics aims at quantifying the expression levels of the complete protein complement (the proteome) in a cell at any given time. While proteomics research was initially focussed on two-dimensional gel electrophoresis for protein separation and identification, proteomics now refers to any procedure that characterizes the function of large sets of proteins. It is thus often used as a synonym for functional genomics.
A contig consists of a set of gel readings from a sequencing project that are related to one another by overlap of their sequences. The gel readings of a contig can be combined to form a contiguous consensus sequence whose length is called the length of the contig.
A sequence profile represents certain features in a set of aligned sequences. In particular, it gives position-dependent weights for all 20 amino acids and as for insertion and deletion events at any sequence position.
Single Nucleotide Polymorphism
Single Nucleotide Polymorphisms (SNPs) are single base pair positions in genomic DNA at which normal individuals in a given population show different sequence alternatives (alleles) with the least frequent allele having an abundance of 1% or greater. SNPs occur once every 100 to 300 bases and are hence the most common genetic variations.
Threading techniques try to match a target sequence on a library of known three-dimensional structures by "threading" the target sequence over the known coordinates. In this manner, threading tries to predict the three-dimensional structure starting from a given protein sequence. It is sometimes successful when comparisons based on sequences or sequence profiles alone fail due to a too low sequence similarity.
The Turing machine is one of the key abstractions used in modern computability theory. It is a mathematical model of a device that changes its internal state and reads from, writes on, and moves a potentially infinite tape, all in accordance with its present state. The model of the Turing machine played an important role in the conception of the modern digital computer.