M. tuberculosis genome database

Peter Rice pmr at sanger.ac.uk
Thu May 22 10:08:14 EST 1997

proy at rsvs.ulaval.ca (Paul Roy) writes:
> Dear GCGers:
>      I recently downloaded the FASTA format flat file  TB.dbs  from the
> Sanger center database.  In running FROMFASTA I noticed that there were
> several occurrences of the same name given to two or more adjacent
> sequences of different lengths, presumably non-assembled sequences from
> the same cosmid.  In Unix systems this results in all but the last
> sequence being lost.  Does anyone have a work-around for this other than
> hand editing the titles in the 4 meg flat file?

The TB.dbs file is not very GCG-friendly (or should that be the other
way around?)

Finished cosmids just have the cosmid name:


The unfinished cosmids in the file have a consecutive series of sequences:

>Cosmid=cY432; Contig ID=01311; Length=8195; Status=Unfinished

>Cosmid=cY432; Contig ID=00864; Length=29989; Status=Unfinished

... and so on.

The easiest way is to adjust the unfinished names with a Perl script
so they end with "_1", "_2" and so on.

% sangerfix.pl TB.dbs TB.fix

% cat sangerfix.pl


while (<>) {
    if (/^>Cosmid=(\w+);/) {
        if ($cosmid ne $1) {$cosmid = $1;$i=0}
        print ">MT$cosmid","_$i\n";
    else {print}

Peter Rice                | Informatics Division, The Sanger Centre,
E-mail: pmr at sanger.ac.uk  | Wellcome Trust Genome Campus,
Tel: (44) 1223 494967     | Hinxton, Cambridge, CB10 1SA, England
Fax: (44) 1223 494919     | URL: http://www.sanger.ac.uk/Users/pmr/

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net