Is it just my sillyness or do I seem to have a real problem with accession
numbers?
I've detected this problem specially with TREMBL updates: when I build
the indexes, _ALL_ the sequences fail to have an accession number although
they really do have one.
Trying various possibilities I found the following:
- presence of the underscore in the AN prevents it from being accepted
and precludes generation of an usable database. .HEADER, .SEQ and
.REF files are generated, but other files are 0-length. The database
is practically useless.
- removal of the underscore leaves many of the sequences without an
accession number (10946 out of 46574), but allows generation of the
databases. Most of the seqs won't be fetch-ables by AN.
- removal of the "-" and everything after up to the ";" makes GCG
report no AN for only 8 sequences, but still most of them are
unavailable by AN: now there are many sequences with duplicate
AN and of these only the last one can be got.
Experimenting with it, it looks like GCG refuses to accept the underscore
and at the same time, refuses any AN with more than 8 characters or an "-".
In case of duplicates it only seems to give the latest sequence.
This has been using "embltogcg -rel=4.1 -year=1998 -month=1 -day=14 \
-protein -dir=. -ignorenames tr_upd.dat" to build the databases on brand new
GCG v9.1. Other databases also gave messages of "seq xxx has not an AN" too
for some odd sequences, so I guess there are more inaccesible sequences
around.
I had a quick look to the specs of EMBL, SWISSPROT and TREMBL and couldn't
find any reference to a length limit or a restricted char composition of the
AN.
So, the question is: am I doing something wrong? How is other people
solving the problem?
jr
--
Jose R. Valverde
EMBnet/CNB