proy at rsvs.ulaval.ca (Paul Roy) writes:
> Dear GCGers:
> I recently downloaded the FASTA format flat file TB.dbs from the
> Sanger center database. In running FROMFASTA I noticed that there were
> several occurrences of the same name given to two or more adjacent
> sequences of different lengths, presumably non-assembled sequences from
> the same cosmid. In Unix systems this results in all but the last
> sequence being lost. Does anyone have a work-around for this other than
> hand editing the titles in the 4 meg flat file?
The TB.dbs file is not very GCG-friendly (or should that be the other
way around?)
Finished cosmids just have the cosmid name:
>MTCY164
The unfinished cosmids in the file have a consecutive series of sequences:
>Cosmid=cY432; Contig ID=01311; Length=8195; Status=Unfinished
>Cosmid=cY432; Contig ID=00864; Length=29989; Status=Unfinished
... and so on.
The easiest way is to adjust the unfinished names with a Perl script
so they end with "_1", "_2" and so on.
% sangerfix.pl TB.dbs TB.fix
% cat sangerfix.pl
#!/usr/local/bin/perl
$cosmid="xxxx";
$i=0;
while (<>) {
if (/^>Cosmid=(\w+);/) {
if ($cosmid ne $1) {$cosmid = $1;$i=0}
++$i;
print ">MT$cosmid","_$i\n";
}
else {print}
}
--
----------------------------------------------------------------------
Peter Rice | Informatics Division, The Sanger Centre,
E-mail: pmr at sanger.ac.uk | Wellcome Trust Genome Campus,
Tel: (44) 1223 494967 | Hinxton, Cambridge, CB10 1SA, England
Fax: (44) 1223 494919 | URL: http://www.sanger.ac.uk/Users/pmr/