The new release of GenBank, 74.0, is now available through
anonymous FTP to: 'ncbi.nlm.nih.gov' in the directory
'ncbi-genbank'. There have been changes in the organization
of the taxonomic divisions into which the entries are divided:
removal of the 'nlm' and 'organelle' divisions and the creation
of an 'est' division. See the README file below for details.
Dennis Benson
NCBI
***********************************************************************
NCBI-GenBank Flat File Release 74.0
Release Date: December 15, 1992
Close-of-Data: November 17, 1992
The files in this directory comprise NCBI-GenBank Release 74.0, in the
GenBank flatfile format. These files are also available on the
"NCBI-GenBank: Flat File Format" CD-ROM.
At the NLM, sequence records are created by specialized indexers in the
Division of Library Operations. Over 325,000 articles per year from 3400
journals are scanned for sequence data. They are supplemented by journals
in plant and veterinary sciences through a collaboration with the National
Agricultural Library. These records join the direct submission data
stream from LANL and submissions from the European Molecular Biology
Laboratory (EMBL) Data library and the DNA Database of Japan (DDBJ)
for incorporation within a relational database, NCBI-GenBank.
The data in the LANL direct-submission stream were first parsed into the
Abstract Syntax Notation 1 (ASN.1) format, then, along with all the other
records in the relational database, converted to the GenBank flat file format.
The ASN.1 form of the data is incorporated into the NCBI "Entrez: Sequences"
CD-ROM, and is available, as is the flat file data, by anonymous FTP to:
'ncbi.nlm.nih.gov'.
For additional information see the file 'gbrel.txt' in this directory.
======================================================================
Important Changes in Release 74.0
As announced in the Release Notes of NCBI-GenBank 73.1, the entries
that had appeared in the 'gbnlm.seq' data file have been moved to their
appropriate taxonomic division, and a new data file called 'gbest.seq'
has been created for Expressed Sequence Tag (EST) sequences.
In addition, the entries in the Organelle division now appear in their
appropriate taxonomic division; the data file 'gborg.seq' will no longer
appear in releases of NCBI-GenBank.
======================================================================
LOCUS name duplicates
Because NCBI handles discrete update streams from LANL, DDBJ, and
EMBL, it is difficult to guarantee the uniqueness of LOCUS names
among all sequence entries. With release 74.0, we were unable to
resolve 5 duplicate names before close-of-data:
Division PRI: HUMDNAJ: X62421 and D13388
HUMWT1: X51630 and D13624
Division ROD: MUSLAMC: X14170 and D13181
Division VRT: QULTROPOM4: X54379 and X54280
QULTROPOM5: X54479 and X54281
We have decided to allow the LOCUS name redundancy rather than filter
out the sequences involved.
======================================================================
Genpept
Due to requests for the protein translations appearing in the
GenBank flat file feature table, we are making the sequences available
in FASTA format. The file 'genpept.fasta.Z' contains a definition line
with locus and descriptive information followed by the protein sequence.
These are protein sequences which appeared in entries from Release 74.0.
CAUTION: The format of the definition line is expected to change within
the next few months, so please be aware of the risk of writing software
which parses the current format. Changes will include the addition of
accession numbers to the definition line.
===============================================================================
ncbi-genbank/daily
This subdirectory contains the Cumulative Update (CU) for all new or updated
entries since close-of-data for Release 74.0.
The flatfile CU is generated nightly by:
a) Collecting direct submissions from LANL; creating a non-redundant
flatfile from the submissions; parsing the result into ASN.1; and then
regenerating a new flatfile from the ASN.1 .
b) Outputting, in ASN.1 format, all entries in the NCBI Backbone database
that have been added or updated since November 17, 1992, and then
converting the ASN.1 to flatfile format.
c) Combining the results of (a) and (b) into a single file called
gbcu.flat.Z .
File gpcu.fasta.Z contains the protein translations appearing in
gbcu.flat.Z, in FASTA format (see the description about genpept.fasta.Z,
above).
CAUTION: The format of the definition line is expected to change within
the next few months, so please be aware of the risk of writing software
which parses the current format. Changes will include the addition of
accession numbers to the definition line.
NOTE: During the first day or two following the date on which a new
release is posted in the ncbi-genbank directory, the CU in this directory
could be empty. As soon as an update occurs in a) the NCBI Backbone database
or b) the LANL direct-submission stream, gbcu.flat.Z will re-appear,
containing all new or updated entries since close of data for the new
release.
===========================================================================
ncbi-genbank/daily-nc
This subdirectory contains individual files for each day's new or updated
entries since close-of-data for Release 74.0.
File names for these Non-Cumulative Updates (NCU) are of the form
ncMMDD.flat.Z, where MMDD represents Month-Day.
A flatfile NCU is generated by:
a) Parsing the flatfile version of a single day's direct submissions
to LANL into ASN.1, and then regenerating a flatfile from the ASN.1 .
b) Outputting, in ASN.1 format, all entries in the NCBI Backbone database
that have been added or updated on that same day, and then converting
the ASN.1 to flatfile format.
c) Combining the results of (a) and (b) into a single flatfile.
A file called "Last.File" in this directory contains the name of the
most recently generated flatfile NCU.
Entries undergoing successive updates on different days will be present in
more than one NCU file. However, a single NCU will not contain multiple
versions of an entry updated more than once on a single day.
===========================================================================
If you have any further questions, please contact:
National Center for Biotechnology Information
National Library of Medicine, 38A, 8N805
8600 Rockville Pike
Bethesda, MD 20894
USA
Voice: (301) 496-2475
Fax: (301) 480-9241
The electronic mail address is: info at ncbi.nlm.nih.gov