Chris Barry (chbarry at mackiller.llnl.gov) wrote:
: Could someone tell me how to import a raw sequence file on GCG? Thanks
Depending, what 'raw' format you will have.
Initialize GCG as usual and type the command 'genmanual exchange'.
The following is an excerpt from the BioCompanion Chapter 5, page 38 or
similar, depending on your version. This text assumes that you are on
the UNIX system but certainly a replacement of $ with % will help (the
BioCompanion is also available as VMS versioni, site-taylored, or in
electronic form).
Maybe this helps.
Regards
Reinhard
Import of Sequences to the GCG Package
=======================================
To use sequence data on the computer, you need to know what a sequence format
is. After you have transferred a sequence file to your computer, you may need
to reformat the sequence to work with a given sequence analysis package. This
section explains most of the solutions using the GCG package.
Sequence Formats
Briefly, a sequence format is a convention which defines what part of a data
file is interpreted as sequence and what part as additional data. Depending on
the software package used for sequence analysis, some of these additional data
are of importance for processing. E.g., the GCG sequence format defines the
type of the sequence data (protein or DNA). Other elements set the date, or log
a line containing the length of the file. Therefore, a given sequence format is
difficult to maintain in a normal text editor, and, usually, computer programs
dedicated to sequence editing will deal with the details.
Plain Text Sequence Format
The plain text sequence format is typically generated by word processors
(saved as text file with line breaks) or by electronic sources such as mail
messages. A plain text format contains only sequence data and, therefore, may
need editing to strip all additional data.
Sequence Formats Ready to Use with Sequence Analysis Packages
Sequence formats ready to use with sequence analysis packages are either
generated within a sequence analysis package, e.g.,
GCG Sequence Analysis Package:
Comments and sequence are separated by a single line
ending with two periods (..).
Sequence Analysis Packages from IntelliGenetics, Inc.: IG suite,
PCGene, etc.
Characteristic: Comment lines start with a semicolon (;).
Other Sequence Analysis Packages: These can be either commercial
(e.g., DNA*) or from the public domain (e.g., PEARSON format,
also called FASTA format). Ask your local program consultant
if you work with these packages and need to convert one format i
into another (see below).
or come from the original databases. This can be either from a local
installation, or by network retrieval tools, such as electronic mail or
World-Wide Web . Examples:
EMBL or SWISSPROT sequence database: entries start with two i
characters on each line.
ID (entry code)
... (other fields) ...
SQ (then the sequence)
//
GENBANK sequence database: entries start with words in a tabulated text.
LOCUS (entry code)
..........(other fields) ...
ORIGIN ...(then the sequence)
//
PIR International sequence database: entries start with > character
>P1; (entry code)
... (one line of text) ...
(sequence, finished by a *)
(eventually, more text)
Reformatting Sequences
Refer to the section "Transfer of Data" for details on how to copy data from
and to other computers.
Reformatting from other Packages
The program readseq (PD; Author:Gilbert) is very useful to interconvert all
kinds of sequence formats. Alternatively, try one of the programs of the GCG
package. To get information about GCG's reformatting programs, use
% genmanual sequence_exchange
The following selection of programs should cover most of your needs.
NOTE: When reformatting a sequence, the sequence name of the original
sequence is adopted. The original file name is replaced by the name of the
corresponding sequence in the originating database; e.g., if you have used
the file name 'test.seq' in an export from electronic mail , WWW , ENTREZ ,
or similar, and the entry obtained from EMBL is M12345, the reformatting
will result in a file called 'm12345.embl' and not retain the file name used
before.
from GENBANK (NCBI)
% fromgenbank
from EMBL (EBI)
% fromembl
from the IG suite package
% fromig
from programs of PIR (e.g., ATLAS)
% frompir
from ASCII files (e.g., electronic mail, or STADEN package)
% fromstaden
if errors occur (because lines are too long), use first
% chopup
Reformatting from Established GCG Sequences.
The program 'reformat' allows you to format from and to various GCG-type
of formats and also helps if sequences are corrupted (checksum changed).
To get information on this program, use
% genhelp reformat
or
% reformat -check
(The sequence of exercise 1 must be treated this way).
Reformatting from "Unknowns"
A plain text file (only sequence data) is a good place to start. Use your
text editor to create such a file. To convert the file to the GCG sequence
format, put two periods (..) at the beginning of the text. Then, use
% reformat
to obtain the final GCG-type format.
===============
--
Reinhard Doelz, Basel, Switzerland