importing raw sequence files

Reinhard Doelz doelz at comp.bioz.unibas.ch
Thu Feb 8 06:47:02 EST 1996

Chris Barry (chbarry at mackiller.llnl.gov) wrote:
: Could someone tell me how to import a raw sequence file on GCG? Thanks

Depending, what 'raw' format you will have. 

Initialize GCG as usual and type the command 'genmanual exchange'. 

The following is an excerpt from the BioCompanion Chapter 5, page 38 or 
similar, depending on your version. This text assumes that you are on 
the UNIX system but certainly a replacement of $ with % will help (the 
BioCompanion is also available as VMS versioni, site-taylored, or in 
electronic form). 

Maybe this helps. 


Import of Sequences to the GCG Package

To use sequence data on the computer, you need to know what a sequence format 
is. After you have transferred a sequence file to your computer, you may need 
to reformat the sequence to work with a given sequence analysis package. This 
section explains most of the solutions using the GCG package. 

Sequence Formats

Briefly, a sequence format is a convention which defines what part of a data 
file is interpreted as sequence and what part as additional data. Depending on 
the software package used for sequence analysis, some of these additional data
are of importance for processing. E.g., the GCG sequence format defines the 
type of the sequence data (protein or DNA). Other elements set the date, or log
a line containing the length of the file. Therefore, a given sequence format is
difficult to maintain in a normal text editor, and, usually, computer programs
dedicated to sequence editing will deal with the details. 

Plain Text Sequence Format 

The plain text sequence format is typically generated by word processors 
(saved as text file with line breaks) or by electronic sources such as mail 
messages. A plain text format contains only sequence data and, therefore, may 
need editing to strip all additional data. 

Sequence Formats Ready to Use with Sequence Analysis Packages 

Sequence formats ready to use with sequence analysis packages are either 
generated within a sequence analysis package, e.g., 

     GCG Sequence Analysis Package: 
               Comments and sequence are separated by a single line
               ending with two periods (..). 

     Sequence Analysis Packages from IntelliGenetics, Inc.: IG suite, 
               PCGene, etc.
               Characteristic: Comment lines start with a semicolon (;). 

     Other Sequence Analysis Packages: These can be either commercial 
              (e.g., DNA*) or from the public domain (e.g., PEARSON format, 
              also called FASTA format). Ask your local program consultant 
              if you work with these packages and need to convert one format i
              into another (see below). 

or come from the original databases. This can be either from a local 
installation, or by network retrieval tools, such as electronic mail or 
World-Wide Web . Examples: 

     EMBL or SWISSPROT sequence database: entries start with two i
     characters on each line. 

ID  (entry code)  
... (other fields) ...  
SQ  (then the sequence)   

     GENBANK sequence database: entries start with words in a tabulated text. 

LOCUS     (entry code)  
..........(other fields) ...  
ORIGIN ...(then the sequence)   

     PIR International sequence database: entries start with > character 

>P1; (entry code)  
... (one line of text) ...   
(sequence, finished by a *)  
(eventually, more text)   

Reformatting Sequences 

Refer to the section "Transfer of Data" for details on how to copy data from 
and to other computers. 

Reformatting from other Packages 

The program readseq (PD; Author:Gilbert) is very useful to interconvert all 
kinds of sequence formats. Alternatively, try one of the programs of the GCG 
package. To get information about GCG's reformatting programs, use 

% genmanual sequence_exchange 

The following selection of programs should cover most of your needs. 

NOTE: When reformatting a sequence, the sequence name of the original 
sequence is adopted. The original file name is replaced by the name of the 
corresponding sequence in the originating database; e.g., if you have used 
the file name 'test.seq' in an export from electronic mail , WWW , ENTREZ , 
or similar, and the entry obtained from EMBL is M12345, the reformatting 
will result in a file called 'm12345.embl' and not retain the file name used 

from GENBANK (NCBI)              

% fromgenbank 

from EMBL (EBI)               

% fromembl 

from the IG suite package   

% fromig 

from programs of PIR (e.g., ATLAS)  

% frompir 

from ASCII files (e.g., electronic mail, or STADEN package)  

% fromstaden 

if errors occur (because lines are too long), use first 

% chopup 

Reformatting from Established GCG Sequences. 

The program 'reformat' allows you to format from and to various GCG-type 
of formats and also helps if sequences are corrupted (checksum changed). 
To get information on this program, use 

% genhelp reformat 


% reformat -check 

(The sequence of exercise 1 must be treated this way). 

Reformatting from "Unknowns" 

A plain text file (only sequence data) is a good place to start. Use your 
text editor to create such a file. To convert the file to the GCG sequence 
format, put two periods (..) at the beginning of the text. Then, use 

% reformat 

to obtain the final GCG-type format. 


Reinhard Doelz, Basel, Switzerland

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net