Software to read whole genomes?

Sean Eddy eddy at wol.wustl.edu
Wed Oct 1 11:09:01 EST 1997

In article <60re0s$227 at gap.cco.caltech.edu> mathog at seqaxp.bio.caltech.edu writes:
  >The EST database is not consistent with respect to clone orientation.
  >To illustrate this point, I picked a single clone at random from near the
  >beginning of GB_EST1, with reasonable confidence that it would not conform
  >to the format you describe above...

Since your example wasn't a WashU EST, I can't help you there.  I did
say that I was describing the format of the WashU Genbank files. I'm
not arguing that the public EST data isn't flawed; only that at WashU,
they already try to provide the info you need, in as parsable form as
possible. Since there are automated scripts generating the Genbank
files for any given WashU EST project, you're pretty much guaranteed a
high degree of consistency in the format of these usually "free text"

  >Going back to the example that you cite, which admittedly does contain the
  >direction information, the .r1/.s1 notation is nice, but it is in an
  >unparseable format coded into the definition field.  By unparseable, I mean
  >just that a generalized program that reads Genbank data fields will not be
  >able to trivially determine forward/reverse, since this information is
  >contained in a nonstandard format *for the database as a whole* within
  >another field. 

Somewhat agreed. Talk to NCBI, I guess. LaDeana does the best she can,
given the restrictions of Genbank format. There aren't fields for this
stuff. I don't see why it's not possible to parse it, though; I sure
do. If it's a WashU EST, look at the first word in the DEFINITION
line, and look for .r1 for a 3' read and .s1 for a 5' read. A quick
Perl script. You can't do all the ESTs that way, only WashU ones, but
WashU ESTs are the bulk of the public data.

  > [stuff about reversed clone orientations]

I think you might be confused about the difference between a read
direction and a clone direction, or else you're wanting data that we
simply don't have. I don't think we can provide adequately reliable
clone orientation data. A few libraries aren't even directionally
cloned or dT-primed. LaDeana tells you in the file which end was read
(.r1 vs. .s1), tells you if it was a directionally cloned library,
clips the vector, determines a high-quality stop from PHRED's base
calls, and tells you if her BLAST QC scripts detected an anomalous
similarity on the wrong strand, probably but not necessarily
indicating that the clone orientation is reversed. What can she do for
reversed clones that lack a telltale BLAST hit? Conversely, what about
mRNA transcripts that overlap a gene on the other strand, and thus
give you a BLAST hit on the "wrong" strand?  She can't be sure if a 3'
read really corresponds to a 3' end of an mRNA, so she doesn't
annotate clone orientation, and I agree with her.

  >The take home lesson is that not all EST entries contain direction
  >information, or contain that information in the same format, and even if
  >they have that information, the orientation may be questionable, and there
  >is no indication of the reliability of the information presented.  None of 
  >this matters much if you are working with 10-20 ESTs by hand, but if you 
  >are trying to process thousands of them, well, have fun.

Agreed. Constructive suggestions are welcome on how to improve the
"product" coming out of the WashU EST group; I'll pass them along to
the folks here. General comments about the EST data format in Genbank
and dbEST as a whole should probably be directed to NCBI instead.

It's important to remember what ESTs are, though -- high throughput
raw data generated from biologically crude sources. Artifacts from a
variety of sources are inherent in the process. This isn't genome
sequencing, where you get a consistent assembly and a high-quality
consensus. We find ESTs just as difficult to analyze. The community
sees the same information we have.


- Sean Eddy, Ph.D. 
- Dept. of Genetics, Washington University School of Medicine
- 660 S. Euclid Box 8232, St. Louis MO 63110, USA 
- mailto://eddy@genetics.wustl.edu http://genome.wustl.edu/eddy

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net