IUBio

Randomised sequences

Brian Fristensky frist at cc.umanitoba.ca
Thu Feb 27 05:29:36 EST 2003




Hans Stenvien wrote:
> I have data sets consisting of several thousand unique coding sequences. For
> each sequence, I would like to generate a number of randomised sequences
> with the same length and nucleotide composition. Does anyone know of
> appropriate online service/software allowing me to input all of my sequences
> from one file? It would be great if one file is generated containing all the
> random sequences for all original sequences. I have so far only found online
> services/softwares enabling me to shuffle nucleotide order in one sequence
> at a time (i.e. manual input of each sequence). I am afraid I am not able to
> do any programming myself.
> 
> Any help is appreciated.
> 
> Best regards,
> Hans
> 
> 

The shuffle program from XYLEM will do that. The manual page is appended
below. XYLEM can be downloaded from:

http://home.cc.umanitoba.ca/%7Epsgendb/XYLEM.html

Binaries are available for Solaris and Linux, and source
code should compile readily on other platforms, since
the code is pretty simple.

-- 
======================================
Brian Fristensky (ON SABBATICAL til July 1, 2003)
Department of Plant Science
University of Manitoba
Winnipeg, MB R3T 2N2  CANADA
frist at cc.umanitoba.ca
Sabbatical phone: 204-474-6724
Voicemail:        204-474-6085
Home phone:       204-261-3960
FAX:              204-474-7528
http://home.cc.umanitoba.ca/~frist/
===========================================================

The most unforgiveable sin of all is being right too soon.

===========================================================

      shuffle.doc                                           update 3 Feb 94

      SYNOPSIS
            shuffle -sn [-wn -on]

      DESCRIPTION
           Shuffles sequences locally. See Lipman DJ, Wilbur WJ, Smith TF
           and Waterman MS (1984) On the statistical significance of nucleic
           acid similarities. Nucl. Acids Res. 12:215-226. 

           -sn    n is a random integer between 0 and 32767. This number
                  must be provided for each run.

           -wn    n is an integer, indicating the width of the window for
                  random localization. If w exceeds the length of a 
sequence,
                  or is negative, the entire sequence is scrambled as a 
single
                  window. This is also the case if w is not specified. 


           -on    n is an integer, indicating the number of nucleotides
                  overlap between adjacent windows. It should never exceed
                  the window size.  o defaults to 0 if not specified.

           If w and o are specified, overlapping windows of w nucleotides
           are shuffled, thus preserving the local characteristic base
           composition. Windows overlap by o nucleotides. 

           If w and o are not specified, each sequence is shuffled 
globally,
           thus preserving the overall base composition, but not the local
           variations in comp.

           Any number of sequences may be processed from a single input
           file.  In Pearson-format files, each new sequence begins with a
           '>' comment line, indicating the name and a short description of
           the sequence.

           No distinction is made between protein or nucleic acid sequences.
           That is, shuffle will read any of the following characters as
           sequence:

           T,U,C,A,G,N,R,Y,M,W,S,K,D,H,V,B,L,Z,F,P,E,I,Q,X,*,-

           where '*' is the result of translating a stop codon, and '-'
           is a gap generated during sequence alignment. Lowercase is
           also accepted.

      EXAMPLE
           A sample output file is shown below. Note that the first two
           lines of output are comment lines, listing the version of the
           program and the parameters used in the run.

           >SHUFFLE                   VERSION 11/ 8/93
           >RANDOM SEED:     9873          WINDOW:   12 
OVERLAP:   3
           >BAZFAZ - Borborigmus azerbi F-actin-zeta gene
           ctgagtagctagtcctaaatagttagtccatagtactagtacgggtcgtt
           cacccttgggcagtg.....(etc.)

      AUTHOR
        Dr. Brian Fristensky
        Dept. of Plant Science
        University of Manitoba
        Winnipeg, MB  Canada  R3T 2N2
        Phone: 204-474-6085
        FAX: 204-474-7528
        frist at cc.umanitoba.ca

      REFERENCE
        Fristensky, B. (1993) Feature expressions: creating and manipulating
        sequence datasets. Nucleic Acids Research 21:5997-6003.





More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net