Computer readable FASTA output

Ewan Birney birney at molbiol.ox.ac.uk
Fri Oct 20 04:02:54 EST 1995

>>>LCBO prolactin precursor - bovine
>; n1: 229
>; initn:  442
>; init1:  314
>; opt: 501
>; z-score: 600.7
>; expect: 1.5e-27
>; smith-waterman: 501
>; ident: 0.365 
>; overlap: 222
>; start_seq1: 1
>; stop_seq1: 224
>; start_seq2: 1
>; stop_seq2: 229
>>musplf ..
>>LCBO ..
>>>LCPG prolactin precursor - pig                     (229 aa)

This looks great - I agree with Keith's point of adding in // a deliminters
to each entry

What are the .. points at the end of the sequence? Is this for better
GCG parsing (???) or indicative of more text for each entry?

Can also suggest that it is
and	start_hit

which makes the seq1 and seq2 mean more to people reading it.

Naturally - this suggests making a "standardised" format for database searches
which would be

; tag: item
; tag: item
Query_sequence alignment
Hit_sequence alignment

It would be trivial for me to get SearchWise to chuck this sort of thing out
(except.... what do you do with TFASTA/Protein query vs DNA sequence hits?)

Can I suggest one thing in the parsing: that items are either
one word or " " deliminted for strings. Do you want to build in line-overrun
systems (some sort of backslash?) 

And should we have an agreed set of tags (eg start_query, start_hit)

This is a good suggestion though


birney at molbiol.ox.ac.uk

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net