Hi,
Given a (long) list of swissprot accession numbers I want to
script-wise retrieve
a. all their sequences in fasta format.
b. all their matching DNA coding sequences in fasta format.
I know how to perform a.
(at http://www.expasy.ch/sprot/sprot-retrieve-list.html
or through the swissprot flat file )
I also see that each _individual_ swissprot protein entry is linked to
EMBL / GenBank / DDBJ for what they refer to as its "NOT_ANNOTATED_CDS".
(the first field under "Cross-references")
But how to automate this DNA CDS retrieval?
Two ways i can imagine (but dont know how to perform neither):
a. find a swissprot flat file containing both side by side.
b. 1. transform swissprot accession into genbank accession
(eg P02906 -> X13380).
2. retrieve a list of "NOT_ANNOTATED_CDS" by their genbank accession
numbers.
While b. seems more realistic:
- I dont know how to perform step 1.
for a large group of swissprot accession numbers.
(the swissprot flatfile does not seem to contain the genbank cross ref)
- The best i could find given a list of genbank accesion numbers is how
to retrieve <= 50 ( http://www.ebi.ac.uk/cgi-bin/emblfetch )
but my lists are much longer...
(and im not even sure that web page returns only the "NOT_ANNOTATED_CDS")
Help?
Thanks,
-Gill