********************************************************************************
UniProt Release 2.0 Notes
05-July-2004
********************************************************************************
CONTENTS
1. Introduction
2. Database description
3. Current release contents
4. Description of changes made to UniProt since release 1.0
5. Forthcoming changes
6. How to link to UniProt
7. Feedback
8. Acknowledgments
9. Terms of use
1. INTRODUCTION
The UniProt consortium--European Bioinformatics Institute (EBI), Swiss
Institute of Bioinformatics (SIB) and Protein Information Resource
(PIR)--is pleased to announce UniProt Release 2.0 (05-July-2004).
UniProt is updated bi-weekly, and can be accessed online for searches or
download at http://www.uniprot.org.
2. DATABASE DESCRIPTION
UniProt is a centralized resource for protein sequences and functional
information. UniProt was created by joining together the information
from Swiss-Prot, TrEMBL and PIR. UniProt is comprised of three
components, each optimized for different uses:
a. The UniProt Knowledgebase (UniProt) is the central access point for
extensive curated protein information, including function,
classification, and cross-references. The UniProt Knowledgebase contains
two major elements: a section containing manually-annotated records,
based on information from the literature and curator-evaluated
computational analysis (referred to as UniProt/Swiss-Prot); and a
section containing computationally-analyzed records awaiting manual
annotation (referred to as UniProt/TrEMBL). PIR-PSD entries not found in
Swiss-Prot or TrEMBL were incorporated into the UniProt Knowledgebase,
and bi-directional cross-references between these and Swiss-Prot or
TrEMBL records were created to allow easy tracking. By design, the
Knowledgebase is non-redundant, with the goal of representing all known
information regarding a particular protein. The UniProt Knowledgebase
aims to describe in a single record all the protein products derived
from a certain gene (or genes if the translation from different genes in
a genome leads to indistinguishable proteins) from a certain species.
The UniProt Knowledgebase represents a carefully selected subset of the
sequences found in UniParc (see below). The UniProt Knowledgebase
provides extensive cross-references to external data collections, such
as the corresponding nucleotide entries in DDBJ/EMBL/GenBank, 2D-PAGE
data, protein structure databases, protein domain and family
characterization databases, post-translational modification databases,
species-specific data collections, and disease databases. As a result of
this extensive cross-referencing, the Knowledgebase serves as a de facto
hub for biomolecular information about any given protein. Each entry in
the Swiss-Prot section of the UniProt Knowledgebase is thoroughly
analyzed and annotated. Literature-based curation is used to extract
experimental data, which is then added to the entry. Supplementing the
experimental information is manually-confirmed results from various
sequence analysis programs. The annotation includes a description of the
properties of the protein, such as its function, any known
post-translational modifications, domains, catalytic or other sites,
secondary and quaternary structure, similarities to other proteins,
diseases caused by mutations in the protein, pathways in which the
protein is involved, sequence conflicts, and variants. Detailed
information is available in the UniProt Knowledgebase user manual
(http://us.expasy.org/sprot/userman.html), and in the UniProt/Swiss-Prot
release notes (http://expasy.org/sprot/relnotes/) and the UniProt/TrEMBL
release notes (ftp://ftp.ebi.ac.uk/pub/databases/trembl/relnotes.txt).
b. The UniProt Non-redundant Reference databases (UniRef) combines
closely related sequences into a single record to accelerate sequence
searches. While merging in the Knowledgebase is restricted to
curator-assisted inclusion of reliable and stable sequence data for a
single species, UniRef100 merges sequences automatically across
different species and includes all UniProt Knowledgebase records. It
also includes those UniParc records that represent sequences deemed
over-represented and excluded from the Knowledgebase such as
DDBJ/EMBL/GenBank WGS (Whole Genome Shotgun) coding sequence
translations, Ensembl protein translations from various organisms and
IPI data. The production of UniRef100 begins with the clustering of all
records by sequence identity. Identical sequences and sub-fragments are
presented as a single UniRef100 entry, containing the accession numbers
of all merged entries, the protein sequence, and links to the
corresponding Knowledgebase and archival records. UniRef90 and UniRef50
are built from UniRef100 and are intended to provide non-redundant
sequence collections for the scientific community to use in performing
faster homology searches. All records having >90% or >50% identity are
merged together into a single UniRef90 or UniRef50 entry, respectively.
c. UniProt Archive (UniParc) is a comprehensive repository of all
publicly available protein sequences, consisting only of unique
identifiers and sequence. While most protein sequence data is derived
from the translation of DDBJ/EMBL/GenBank nucleotide sequences, a large
amount of primary protein sequence data resulting from the direct
sequencing of proteins is submitted directly to other sources, including
Swiss-Prot, TrEMBL, and PIR-PSD; in addition, a large number of protein
sequences are found in patent applications, as well as in entries from
the Protein Data Bank (PDB). Given the wide variety of primary sources
and variation in the degree and quality of annotation, UniParc was
created; it is designed to capture all available protein sequence data
from sources such as the DDBJ/EMBL/GenBank, UniProt/Swiss-Prot,
UniProt/TrEMBL, PIR-PSD, Ensembl, International Protein Index (IPI),
PDB, RefSeq, FlyBase, WormBase, H-Inv, TROME, European Patent Office,
United States Patent and Trademark Office and Japan Patent Office. This
combination of sources makes UniParc the most comprehensive, publicly
accessible, non-redundant protein sequence database available. UniParc
represents each protein sequence once and only once, assigning it a
unique UniParc identifier. UniParc cross-references the accession
numbers of the source databases, providing sequence versions that are
incremented in the usual fashion. Status flags are used to indicate the
status of the entry in the original source database, with "active"
indicating that the entry is still present in the source database and
"obsolete" indicating that the entry no longer exists in the source
database. UniParcs intended use is to track the current status and
history of all proteins. Sequence similarity search is the most reliable
method for such retrieval. UniParc records carry no annotation, but this
information can be found in the UniProt Knowledgebase.
Additional information about UniProt databases can be obtained from
http://www.uniprot.org/database/DBDescription.shtml
3. CURRENT RELEASE CONTENTS
-------------------------------------------------------------
UniProt Release 2.0
-------------------------------------------------------------
Database -- Entries
-------------------------------------------------------------
UniProt -- 1,487,788 (UniProt/Swiss-Prot 44.0: 153,871; UniProt/TrEMBL
27.0: 1,333,917)
UniRef 100 -- 1,306,318
UniRef 90 -- 816,857
UniRef 50 -- 465,394
UniParc -- 3,863,370
-------------------------------------------------------------
4. DESCRIPTION OF CHANGES MADE TO UNIPROT SINCE RELEASE 1.0
UniProt Knowledgebase - Please read the UniProt/Swiss-Prot and
UniProt/TrEMBL release notes (http://expasy.org/sprot/relnotes/ and
ftp://ftp.ebi.ac.uk/pub/databases/trembl/relnotes.txt) and the recent
changes webpage (http://expasy.org/sprot/relnotes/sp_news.html).
UniRef - The current UniRef100 database combines identical sequences and
sub-fragments from any organism into a single UniRef entry. Prior to
Release 1.8, these sequences were combined only if they were derived
from the same species. The new DTD is available at
ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/.
5. FORTHCOMING CHANGES
You can read about forthcoming changes at
http://expasy.org/sprot/relnotes/sp_soon.html.
6. HOW TO LINK TO UNIPROT ENTRIES
A detailed description of how to link to UniProt entries can be found at
http://www.uniprot.org/support/docs/linkUniProt.html
7. FEEDBACK
We are constantly trying to improve our database in terms of accuracy
and representation and hence we consider your feedback
(http://www.uniprot.org/support/feedback.shtml) extremely valuable.
Please contact us if you have any questions
(http://www.uniprot.org/support/helpdesk.shtml) or comments
(http://www.uniprot.org/support/feedback.shtml). You can also subscribe
(http://www.uniprot.org/support/alerts.shtml) to e-mail alerts for the
latest information on UniProt databases.
8. ACKNOWLEDGMENTS
UniProt is supported mainly by the National Institutes of Health (NIH)
grant U01 HG02712. Minor support for the EBIs involvement in UniProt
comes from the two European Union contracts BioBabel (QLRT-2000-00981)
and TEMBLOR (QLRI-2001-00015) and from the NIH grant R01 HGO2273.
Swiss-Prot activities at the SIB are supported by the Swiss Federal
Government through the Federal Office of Education and Science. PIR
activities are also supported by the National Science Foundation (NSF)
grants DBI-0138188 and ITR-0205470.
9. TERMS OF USE
UniProt is available for both commercial and non-commercial use. Please
see http://www.uniprot.org/terms.shtml for details.