Greetings GenBank Users,
The release notes for October's GenBank 198.0 included an announcement
of a new accession format for CON-division WGS scaffold records. It
is included below for your reference.
This new accession format (in which the accessions for WGS scaffolds
are very similar to the accessions of the WGS contigs from which they
are constructed) will initially be used for WGS projects that :
a) Have a very large number of contigs (typically, greater than 1 million)
b) Have a correspondingly large number of scaffolds
c) Are completely unannotated, at both the contig and scaffold level.
And the first WGS project which has these properties is: ALWZ02 .
The contigs for ALWZ02 have been available at the NCBI FTP since
mid-June 2013, in the genbank/wgs directory. There are 43 pairs of
GenBank flatfile and nucleotide FASTA files. For example:
-rw-r--r-- 1 gbupdate gbproces 147855124 Jun 27 16:24 wgs.ALWZ.1.fsa_nt.gz
-rw-r--r-- 1 gbupdate gbproces 229952261 Jun 27 16:29 wgs.ALWZ.1.gbff.gz
....
-rw-r--r-- 1 gbupdate gbproces 121080718 Jun 27 16:28 wgs.ALWZ.43.fsa_nt.gz
-rw-r--r-- 1 gbupdate gbproces 172868580 Jun 27 16:32 wgs.ALWZ.43.gbff.gz
And now, as of December 4 2013, there is a similar set of files for
the ALWZ02 scaffolds:
-rw-r--r-- 1 gbupdate gbproces 147881072 Dec 4 10:45 wgs.ALWZ.scflds.1.fsa_nt.gz
-rw-r--r-- 1 gbupdate gbproces 28173600 Dec 4 10:52 wgs.ALWZ.scflds.1.gbff.gz
....
-rw-r--r-- 1 gbupdate gbproces 117159578 Dec 4 10:51 wgs.ALWZ.scflds.48.fsa_nt.gz
-rw-r--r-- 1 gbupdate gbproces 1202661 Dec 4 10:53 wgs.ALWZ.scflds.48.gbff.gz
There are cognate sets of files for the ASN.1 version of the ALWZ WGS
project, in the ncbi-asn1/wgs directory of the NCBI FTP site.
Here's an excerpt of the flatfile for the first ALWZ scaffold, which
illustrates the new accession number format:
LOCUS ALWZ02S0000001 701 bp DNA linear CON 14-JUN-2013
DEFINITION Picea glauca scaffold316, whole genome shotgun sequence.
ACCESSION ALWZ02S0000001 ALWZ000000000
VERSION ALWZ02S0000001.1
DBLINK BioProject: PRJNA83435
KEYWORDS WGS.
SOURCE Picea glauca (white spruce)
So, for WGS projects which meet criteria (a) through (c) above, the
comprehensive WGS FTP areas will now contain data for both contigs
*and* scaffolds. And the scaffold records are making use of the new
accession format.
NOTE: Assembly-Version 03 of the ALWZ WGS project is being processed
now, so all of the ALWZ files at the NCBI FTP site are likely to be
updated within the next few weeks.
Regards,
Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS
=================================================================
1.4 Upcoming Changes
1.4.1 New accession format for CON-division WGS scaffold records
WGS scaffolds that are constructed from WGS contigs currently
make use of a '2+6' accession number format, with two leading
alphabetic characters followed by six digits. Here is an example
of a WGS-master record that references two different ranges of
scaffold accession numbers:
http://www.ncbi.nlm.nih.gov/nuccore/AABR00000000
LOCUS AABR06000000 112651 rc DNA linear ROD 16-MAR-2012
DEFINITION Rattus norvegicus strain BN/SsNHsdMCW, whole genome shotgun
sequencing project.
ACCESSION AABR00000000
VERSION AABR00000000.6 GI:380236478
DBLINK BioProject: PRJNA10629
KEYWORDS WGS.
SOURCE Rattus norvegicus (Norway rat)
ORGANISM Rattus norvegicus
....
....
WGS AABR06000001-AABR06112651
WGS_SCAFLD CM000072-CM000092
WGS_SCAFLD JH612139-JH620698
//
Many WGS projects have a large number of chromosome-specific scaffolds
(such as the JH accession range), and a much smaller number of scaffolds
that represent the entirety of the chromosomes (such as the the CM
accession range). Because of the former, we are consuming '2+6' prefixes,
like JH, at an unsustainable rate.
So we plan to introduce a new accession format for WGS scaffolds which
mirrors the format of the underlying WGS contigs:
4 letter WGS project code
2 digit assembly-version number
"S" (for 'scaffold')
Six or seven digits
So in the above example, the set of 'JH' scaffolds could make use of
accession numbers such as AABR06S000001 and AABR06S112651 :
WGS AABR06000001-AABR06112651
WGS_SCAFLD CM000072-CM000092
WGS_SCAFLD AABR06S000001-AABR06S112651
We do not currently plan to replace existing '2+6' accessions with
the new '4+2+S+6/7' accessions. However, as of the December 2013
GenBank release, the new format will begin to appear for newly-processed
WGS sequencing projects.