Has anybody successfully used dbifasta to index the GCG supplied genpept
database
directly?
The first problem was that the header lines looked like:
>BAA35036 GB:AB001396 E2 region [Hepatitis C virus] (ver 1)
and dbifasta listed no matching format. But it turned out there was a
gcgaccid
in the dbifasta code, it just wasn't in the acd file. So I added it to
the acd file, and
it showed up in the menu, but wouldn't run (it would just start and stop
with no warning
or error messages.)
Having faced this sort of header problem about a million times before I
used
fastamungheader (
ftp://saf.bio.caltech.edu/pub/software/molbio/fastamungheader.c )
to rewrite genpept.seq into a supported header format with all the lines
having exactly
the same length:
>GB:AB001396 BAA35036 E2 region [Hepatitis C virus] (ver 1).
and used the gcgidacc switch. That ran to completion, generating along
the way
around a zillion lines like:
This is a warning: Duplicate ID skipped: Z99759
Then I put back the original genpept.seq file - not knowing what the
GCG software
might or might not be expecting in the FASTA header.
After that sequences could be retrieved with:
# seqret -sequence genpept:BAA35036 -filter
>BAA35036 GB:AB001396 E2 region [Hepatitis C virus] (ver 1)
RTNVMGGAAAITTRGFVSLFTLINSQR
but not
# seqret -sequence genpept:AB001396
Reads and writes (returns) sequences
^C (after giving up waiting for a prompt to reappear, versus)
# seqret -sequence genpept:wombat
Reads and writes (returns) sequences
An error has been found: EMBLCD Entry failed
An error has been found: Database 'genpept' : access method 'emblcd'
failed
An error has been found: option -sequence: Unable to read sequence
'genpept:wombat'
There is a serious problem: seqret terminated: Bad value for option
and no prompt
That is, specify nonsense and it blows up instantly, which is fine.
Specify a Genbank ID number
and it goes bonkers. I can just picture somebody using w2h and
specifying a
genpept:ID combo and locking the server until the cows come home. To
prevent that I'm
deleting the indices for now.
The /usr/local/share/EMBOSS/emboss.default file has this for genpept:
DB genpept [
method: gcg
format: fasta
dir: $emboss_db_dir/gcggenpept
file: *.seq
# optional parameters
type: P
release: 122.0
indexdir: $emboss_index_dir/gcggenpept
]
What are you folks doing with genpept? Right now it looks like the only
safe thing to do is
to index on the one field and ignore the other.
Thanks,
David Mathog
mathog at caltech.edu