Reformatting EMBL sequences for GCG

Charles Bailey bailey at hmivax.humgen.upenn.edu
Tue Jul 13 16:45:00 EST 1993

In article <930713.143326.6518 at medinfo.rochester.edu>, charles at MEDINFO.ROCHESTER.EDU ("Charles A. Alexander") writes:
> Hi GCGers
> Does anybody know how to reformat EMBL sequences for GCG?  These sequences were 
> retrieved from the NCBI site.  I used Fromembl but I get the following error:

EMBL sequences returned by the NCBI mail server are not in EMBL format.  I'm
not sure what to call the header format, but I suspect that it's a result of
reading an ASN.1 record to generate the reply document.

> *** ERROR in CopyToSQLine
>      File ends before the sequence is found! ***
> Is it an error on my part?  Or are the files formatted differently (sans //)
> at the NCBI?  GCG expects to read // at the end.

No error of yours.  Actually, the problem is not a missing // - FromEMBL will
just read to the end of the file and quit happily - but the missing 'SQ' string
on the last line of the header, which causes CopyToSQLine to keep reading
through the entire file without finding what it thinks is the end of the

> Any clues?

The following patch to FromEMBL will allow it to handle sequences in the format
returned by NCBI.  Note that the header of the resulting GCG sequence file will
still be in the format used in the NCBI document, so if you have any tools
which read the header to obtain information about the sequence (e.g. parse
feature table entries), they will not work on the converted sequence.

*** fromembl.for
--- fromembl_new.for
*** 180,186
  	    Call StrTruncate(OutLine)
  	    Call StrAerate(OutLine)
  	    Call WriteString(OutFile, OutLine)
! 	    If ( StrFind('SQ',Token).eq.1 ) then
  	      Call WriteString(OutFile, ' ')
  	    End If
--- 180,187 -----
  	    Call StrTruncate(OutLine)
  	    Call StrAerate(OutLine)
  	    Call WriteString(OutFile, OutLine)
! 	    If ( ( StrFind('SQ',Token).eq.1 ) .or. 
!      &       ( StrFind('SEQUENCE',Token).eq.1 ) ) then
  	      Call WriteString(OutFile, ' ')
  	    End If

Note: For those unfamiliar with the patch format, it is a context diff suitable
for input into the Wall patch(1) tool.  This utility is found on many Unix
system, and is available for VMS as well.  If you don't have Wall patch on your
system, you can either make the changes to GenSource: FromEMBL.For by hand
(just change to line preceded by ! in the first block to the lines preceded by
! in the second block), or you can get the C source code for the Wall patch
tool (with some extensions) by anonymous ftp here, as file
Anon_Root:[Util.VMS.SysMaint]Wall_Patch_Diff.Zip.  This is a ZIP archive; UnZIP
for VMS is available in Anon_Root:[Util.VMS.Archive]UnZip.Exe.

Further instructions on incorporating changes into GCG source code can be found
in the System Support Manual.

I hope this helps.  Ifanyone has any problems with this patch, please don't
hesitate to drop me a line.  Good luck.

                    Charles Bailey

!             Dept. of Genetics / Howard Hughes Medical Institute
! University of Pennsylvania School of Medicine  Rm. 430 Clinical Research Bldg.
!     422 Curie Blvd.  Philadelphia, PA 19104 USA      Tel. (215) 898-1699
!          Internet: bailey at genetics.upenn.edu  (IN

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net