Accession numbers

José R. Valverde txomsy at cnb.uam.es
Thu Jan 15 12:37:11 EST 1998

Is it just my sillyness or do I seem to have a real problem with accession

    I've detected this problem specially with TREMBL updates: when I build
the indexes, _ALL_ the sequences fail to have an accession number although
they really do have one.

    Trying various possibilities I found the following:
    	- presence of the underscore in the AN prevents it from being accepted
	  and precludes generation of an usable database. .HEADER, .SEQ and 
	  .REF files are generated, but other files are 0-length. The database
	  is practically useless.
	- removal of the underscore leaves many of the sequences without an
	  accession number (10946 out of 46574), but allows generation of the
	  databases. Most of the seqs won't be fetch-ables by AN.
	- removal of the "-" and everything after up to the ";" makes GCG
	  report no AN for only 8 sequences, but still most of them are
	  unavailable by AN: now there are many sequences with duplicate
	  AN and of these only the last one can be got.

    Experimenting with it, it looks like GCG refuses to accept the underscore
and at the same time, refuses any AN with more than 8 characters or an "-". 
In case of duplicates it only seems to give the latest sequence.

    This has been using "embltogcg -rel=4.1 -year=1998 -month=1 -day=14 \
-protein -dir=. -ignorenames tr_upd.dat" to build the databases on brand new
GCG v9.1. Other databases also gave messages of "seq xxx has not an AN" too
for some odd sequences, so I guess there are more inaccesible sequences

    I had a quick look to the specs of EMBL, SWISSPROT and TREMBL and couldn't 
find any reference to a length limit or a restricted char composition of the

    So, the question is: am I doing something wrong? How is other people
solving the problem?


Jose R. Valverde

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net