IUBio

M. tuberculosis genome database

Peter Rice pmr at sanger.ac.uk
Thu May 22 10:08:14 EST 1997


proy at rsvs.ulaval.ca (Paul Roy) writes:
> Dear GCGers:
>      I recently downloaded the FASTA format flat file  TB.dbs  from the
> Sanger center database.  In running FROMFASTA I noticed that there were
> several occurrences of the same name given to two or more adjacent
> sequences of different lengths, presumably non-assembled sequences from
> the same cosmid.  In Unix systems this results in all but the last
> sequence being lost.  Does anyone have a work-around for this other than
> hand editing the titles in the 4 meg flat file?

The TB.dbs file is not very GCG-friendly (or should that be the other
way around?)

Finished cosmids just have the cosmid name:

>MTCY164

The unfinished cosmids in the file have a consecutive series of sequences:

>Cosmid=cY432; Contig ID=01311; Length=8195; Status=Unfinished

>Cosmid=cY432; Contig ID=00864; Length=29989; Status=Unfinished

... and so on.

The easiest way is to adjust the unfinished names with a Perl script
so they end with "_1", "_2" and so on.

% sangerfix.pl TB.dbs TB.fix

% cat sangerfix.pl
#!/usr/local/bin/perl

$cosmid="xxxx";
$i=0;

while (<>) {
    if (/^>Cosmid=(\w+);/) {
        if ($cosmid ne $1) {$cosmid = $1;$i=0}
        ++$i;
        print ">MT$cosmid","_$i\n";
    }
    else {print}
}


-- 
----------------------------------------------------------------------
Peter Rice                | Informatics Division, The Sanger Centre,
E-mail: pmr at sanger.ac.uk  | Wellcome Trust Genome Campus,
Tel: (44) 1223 494967     | Hinxton, Cambridge, CB10 1SA, England
Fax: (44) 1223 494919     | URL: http://www.sanger.ac.uk/Users/pmr/



More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net