Making a "true" EST consensus -- 2

Dr. Rob Miller rmiller at house.med.und.ac.za
Tue Dec 24 07:11:45 EST 1996

Much thanks to those of you who have suggested we try the current
databases of EST consensuses -- we're trying to improve on them !

We use UniGene and BodyMap to benchmark our clusters.  Our effort is to
create a database of EST consensus sequences that can be searched by
sequence rather than starting with tissue or clone information, but we'd
very much prefer that someone be able to find the correct EST
consensuses when they search with the complete sequence (and eventually
get a nice alignment with a big gap between the 5' and 3' fragments in
the database).  On the other hand, we need to be able to find the right
3' region for a hit on an associated 5' consensus, but the database
submission format doesn't appear to handle specific linkages for
multiple alignments together with non-specific linkages for
clone-related fragments.  We believe the best approach will be to
submit `artificially linked' 3' and 5' consensus sequences where
appropriate, but we are concerned about what the best format of the
linker region should be with respect to the variety of alignment
software/algorithms out there.

Still interested in any hints on this (or preferences from those of
you whose software may have to deal with it in coming years ! :-)

                        Merriest of Christmases to you all,


rmiller at house.med.und.ac.za


sorry, the info at sanbi addr below won't work over christmas, happy
thoughts to you if you can direct a copy of your reply to
rmiller at house.med.und.ac.za as I don't always have good experiences
with international newservers :-)

Dr. Rob Miller wrote:

 Hi there,

  Got some nucleotide sequence alignment/search/database questions for
you :

  How do we link 3' EST to 5' EST fragments from the same clone in order
 to make the linked consensus useful for subsequent searching, alignment
 and/or translation?

 We're developing a set of EST consensus sequences to submit to a
public database, and naturally we'd like these to be of the greatest
utility possible.  We are thinking about the most useful format for the

  What is the  best way to link data for ESTs which come from the same
 clone -- a way that will preferably result in gaps inserted in the
linker region when someone comes along and searches the database with
the sequence of the full clone ?

 Specifically, we'll be creating artificial consensus sequences from
two EST consensuses, e.g. a 5' EST AAAAAAAAAAAAAA and a 3' EST

 So our questions are:

   * What are the ramifications of

      using NNN's (unassigned) :


      or using ----'s (gap) :

            AAAAAAAAAAAAA-----------------ZZZZZZZZZZZZZZZZ  ???

      between the two sequences ?

    * how many characters would be ideal ?

    * what else could be used ?

  We invite any helpful comments, and feel free to e-mail a copy of
 your reply to

to make certain we see it.

to make certain we see it.

                                 thanks in advance,


Robert T. Miller, Ph.D.                         
rmiller at house.med.und.ac.za

Manager - Durban Satellite - South African National Bioinformatics

Faculty of Medicine / Dept of Virology / University of Natal 
Private Bag 7 / Congella 4013 / Durban / South Africa 
phone +27 (031) 3603743                     fax +27 (031) 3603744 or

