>>>LCBO prolactin precursor - bovine
>; n1: 229
>; initn: 442
>; init1: 314
>; opt: 501
>; z-score: 600.7
>; expect: 1.5e-27
>; smith-waterman: 501
>; ident: 0.365
>; overlap: 222
>; start_seq1: 1
>; stop_seq1: 224
>; start_seq2: 1
>; stop_seq2: 229
>>musplf ..
> MLPSLIQPCSWILLLLLVNSSLLWKNVASFPMCAMRNGRCFMSFEDTFE
>LAGSLSHNISIEVSELFTEFEKHYSNVSGLRDKSPMRCNTSFLPTPENKE
>QARLTHYSALLKSGAMILDAWESPLDDLVSELSTIKNVPDIIISKATDIK
>KKINAVRNGVNALMSTMLQNGDEEKKNPAWF....LQSDNEDARIHSLYG
>MISCLDNDFKKVDIYLNVLKCYMLKIDNC
>>LCBO ..
>MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFD
>RAVMVSHYIHDLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKE
>QAQQTHHEVLMSLILGLLRSWNDPLYHLVTEVRGMKGAPDAILSRAIEIE
>EENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDEDARYSAFYN
>LLHCLRRDSSKIDTYLKLLNCRIIYNNNC
>>>LCPG prolactin precursor - pig (229 aa)
This looks great - I agree with Keith's point of adding in // a deliminters
to each entry
What are the .. points at the end of the sequence? Is this for better
GCG parsing (???) or indicative of more text for each entry?
Can also suggest that it is
start_query:
stop_query:
and start_hit
stop_hit
which makes the seq1 and seq2 mean more to people reading it.
Naturally - this suggests making a "standardised" format for database searches
which would be
>>Hit_name
; tag: item
; tag: item
;
>query_name
Query_sequence alignment
>hit_name
Hit_sequence alignment
//
It would be trivial for me to get SearchWise to chuck this sort of thing out
(except.... what do you do with TFASTA/Protein query vs DNA sequence hits?)
Can I suggest one thing in the parsing: that items are either
one word or " " deliminted for strings. Do you want to build in line-overrun
systems (some sort of backslash?)
And should we have an agreed set of tags (eg start_query, start_hit)
This is a good suggestion though
ewan
birney at molbiol.ox.ac.uk