Organizing biological knowledge in databases
Biological raw data are stored in public databanks (such as Genbank or EMBL for primary DNA sequences). The data can be submitted and accessed via the world wide web. Protein sequence databanks like trEMBL provide the most likely translation of all coding sequences in the EMBL databank. Sequence data are prominent, but also other data are stored, e. g. yeast two-hybrid screens, expression arrays, systematic gene-knock-out experiments, and metabolic pathways.
The stored data need to be accessed in a meaningful way, and often contents of several databanks or databases have to be accessed simultaneously and correlated with each other. Special languages have been developed to facilitate this task (such as the Sequence Retrieval System (SRS) and the Entrez system). An unsolved problem is the optimal design of inter-operating database systems. Databases provide additional functionality such as access to sequence homology searches and links to other databases and analysis results. For example, SWISSPROT  contains verified protein sequences and more annotations describing the function of a protein. Protein 3D structures are stored in specific databases (for example, the Protein Data Bank , now primarily curated and developed by the Research Collaboratory for Structural Bioinformatics). Organism specific databases have been developed (such as ACEDB, the A C. Elegans DataBase for the C. elegans genome, FLYBASE for D. melanogaster etc). A major problem are errors in databanks and databases (mostly errors in annotation), in particular since errors propagate easily through links.
Also databases of scientific literature (such as PUBMED, MEDLINE) provide additional functionality, e.g. they can search for similar articles based on word-usage analysis. Text recognition systems are being developed that extract automatically knowledge about protein function from the abstracts of scientific articles, notably on protein-protein interactions.