IUBio

fasta file writing

Reinhard Doelz doelz at comp.bioz.unibas.ch
Thu Aug 17 14:07:41 EST 1995


Peter Rice (pmr at sanger.ac.uk) wrote:

: On a related note, I would like to be able to put some extra text on the
: ">" line of FASTA format output - with the aim of using SRS to generate a
: database subset for blast indexing and searching. Then (for example) the
: accession number and definition could appear in the blast search results.

The accession number is a bit more effort than the description itself, 
which is what we've done earlier by modifying the corresponding routines.

You might use SeqWriteFasta in seq.c as a template and call it MyWriteFasta.

INT4 MyWriteFasta (SEQo *seq, char *nam, INT4 (*print)(char*,...))
...

  print (">%s %60.60s\n", nam == NULL ? seq->name : nam, seq->cmnt);
                                                         ^^^^^^^^^^ 
...                                                see seq.h - this is the seq.
                                                   description.

The only problem is that seq.h declares cmnt as a pointer rather an array, 
which is ok in most places but nicer if it is treated like 'nam'. 
In particular,as it treats the cmt as string in the SeqWrite Routine but 
is not populatedi because this is commented out.
I have made it read SEQxXCMNT characters which is OK in all examples I tested. 
We use home-produced GCG format which does not have the LEN: field. The 
stuff we use is rather 'NBRF' format which is constant for the past years. 
Unfortunately, Thure tests for LEN: as he assumes that GCG format is produced 
by GCG tools, and this made our names disappear entirely in the SeqWriteFasta
routine. I changed that, and the cmnt change from pointer to array  
made it necessary to change seq.h , too: 

40,41c40
< /*  char  *cmnt;  */   /* sequence description */
<   char  cmnt[SEQxXCMNT];     /* sequence description */
---
>   char  *cmnt;     /* sequence description */

And, as SeqWrite is static as internal function you must change 
this and declare the prototype non-static in both seq.h and seq.c 
(this is not in the diffs). 
 
in seq.c I had: 

 bioa.embnet.unibas.ch > diff seq.c seq.c.orig
156c156
<     strcpy(tmp->cmnt, "");
---
>     tmp->cmnt = NULL;
191,192c191,192
<     /* if (tmp->cmnt)
<       free(tmp->cmnt); */
---
>     if (tmp->cmnt)
>       free(tmp->cmnt);
911c911,912
<   strncpy ((*seq)->cmnt, file->ln,SEQxXCMNT);
---
>   strcpy ((*seq)->cmnt, file->ln);
>
1327,1330c1328
<   } else { /* RD */
<      sscanf (file->ln, "%s", (*seq)->name);
<    }
<
---
>   }
1340c1338
<   strncpy ((*seq)->cmnt, file->ln,SEQxXCMNT); /* was commented out, copied in again,
 RD */
---
>      /*  strcpy (seq->cmnt, file->ln); */
 bioa.embnet.unibas.ch >



To make it work you could use the following as example. 
I use the code as listed below, and have made SeqWrite a non- static function 
by adding it to seq.h and commenting 'static' in seq.c properly. 
This example writes all sequences of swissprot with a particular keyword
in the desired fasta format.


 bioz.embnet.unibas.ch >  ./test | more
166 entries written to set "q"
>ACTB_DICDI  CALCIUM-REGULATED ACTIN BUNDLING PROTEIN.
MAETKVAPNLTGIEQTKAGQSFTEKLSAEAMEFFCNVAKLPFSQQAVHFL
NAYWAEVSKEAEFIYSVGWETIKYADMHCKGIQLVFKYDEGNDLDFDIAL
YFYEQLCKFCEDPKNKNYATTYPISQPQMLTALKRKQELREKVDVNFDGR
VSFLEYLLYQYKDFANPADFCTRSMNHDEHPEIKKARLALEEVNKRIRAY
EEEKARLTEESKIPGVKGLGATNMLAQIDSGPLKEQLNFALISAEAAVRT
ASKKYGGAAYSGGAGDAGAGSSAGAIWWMNRDLEEKKKRYGPQKK
>ARP_EUGGR   CALCIUM-BINDING ACIDIC-REPEAT PROTEIN PRECURSOR (ARP).
MSHLWCWLFLVLCLACLVLSIEAKDSDGDGLLDVDEINVYFTDPYNADSD
QDGLTDGLEVNRHQTHPQDKDTDDDSIGDGVEVNNLGTNPKDPDSDDDGL
TDGAEVNLYRTDPLDADSTTTGCPMGGGAEVRHRPQNGDTDDDGLTDGAE
VNVHRTNPQDGDSDDDGLSDGAEVNTYHSNPKDGDSDDDGVSDGAEVNPK
LKDSDGDGLTDEEEIKLYRTDPFCADSDFDGLLDGEEVKVHKTNPLDGDS
DDDGLGDGAEVTHFNTNPLDADSDNDGLDDGEEINVHGTDPEDPDSDNDG
LNDGDEVNVYNTDPEEDDSDEDGVCDGAEVNVHHTNPKDEDSDNDGIPDG
AEINTHKTDPNDEDSDDDGIADGAEVTLTDSDGDGLPDEDEVALYNTNPA
NADSDYDGLTDGAEVKRYQSNPLDKDTDDDGLGDGVEVTVGTDPHDATVT
TTGSRTAVEINVHGSDPNDEDTDDDGLTDGAEVNLHRTDPEDADTDDDGL
TDGAEVNTYR
... 


etc. The source is: 


#include <stdio.h>
#include "srs.h"


INT4 MyWriteFasta (SEQo *seq, char *nam, INT4 (*print)(char*,...))
{
  INT4 lncnt = 0;

  if (!print)
    print = (INT4 (*)(char*,...)) printf;

  if (!seq) return 0;
  print (">%s %60.60s\n", nam == NULL ? seq->name : nam, seq->cmnt);
  lncnt = SeqWrite (seq->seq, seq->len, 50, 50, 0, print); 
  return lncnt+1; 
       
}


int main ()
 {
   SETo     *set;
   IDoENTRY id;
   ENTRYo   *entry;
   int      n, entryN;
   SEQo      *seq=NULL;
   
   SrsEnv ();
   LibOpen ("srswin");
   
   ParDefStr ("fieldList", "ID");
   ParDefStr ("fieldList", "Definition");
   
   if (QryDo ("[swissprot-definition:calcium*]","Q")) {
     set = SetGet ("Q"); 
     entryN = SetSize ("Q");
     
     for (n=1;  n <= entryN; n++) {
       SetGetID (set, n, &id); 
       entry = EntryOpen (&id);
       SlbGetSequence (entry, &seq);
       MyWriteFasta (seq, NULL, (int (*)(char*,...)) printf); 
       SeqClr (&seq);
       
       EntryClose (&entry);
     }
   }
 }
     

The makefile I used was 
 bioz.embnet.unibas.ch > cat Makefile

CC = cc  -g
EXE = /biox6/srs/SRS/srs4_06/bin/osf
SRC = /biox6/srs/SRS/srs4_06/src
INCLUDE = -I$(SRC) -I$(SRSEXE)
LIBS =
LOADFLAGS = $(LIBS) $(SRSEXE)/libsrs.a

default: test2

test: test.c
        $(CC) test.c $(INCLUDE) $(LOADFLAGS) -o test




Maybe this helps.
Regards
Reinhard Doelz
BioComputing Basel


 
-- 
 R.Doelz         Klingelbergstr.70| Tel. x41 61 267 2247  Fax x41 61 267 2078|
 Biocomputing        CH 4056 Basel| electronic Mail    doelz at ubaclu.unibas.ch|
 Biozentrum der Universitaet Basel|-------------- Switzerland ---------------|
<a href=http://beta.embnet.unibas.ch/>EMBnet Switzerland:info at ch.embnet.org</a> 




More information about the Bio-srs mailing list

Send comments to us at biosci-help [At] net.bio.net