In article <u9lo0fyb7a.fsf at wol.wustl.edu>, eddy at wol.wustl.edu (Sean Eddy) writes:
> > - the orientation of the read is annotated redundantly in parsable form
> > in the DEFINITION field:
> > zv37h04.s1 Soares ovary tumor NbHOT Homo sapiens clone 755863 3'
> > i.e.:
> > <clone plate location>.[s,r]1 <library> <clone ID> [5,3]'
> > where an s1 is a 5' read; r1 is a 3' read.
>>oops. The last line is correct; but the previous line is
>wrong/misleading. The .r1 or .s1 indicates the direction of the read.
The EST database is not consistent with respect to clone orientation.
To illustrate this point, I picked a single clone at random from near the
beginning of GB_EST1, with reasonable confidence that it would not conform
to the format you describe above, and fair confidence that it would not
contain orientation information at all. (This based on prior personal
experience.) Sure enough, the randomly selected entry U21463 is:
LOCUS HSU21463 623 bp mRNA EST 31-MAR-1995
DEFINITION Human partial cDNA sequence with CCA repeat region, T3 end of clone
JL74.
ACCESSION U21463
NID g732488
KEYWORDS EST.
SOURCE human.
ORGANISM Homo sapiens
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 623)
AUTHORS Longshore,J.W.
TITLE Isolation and Characterization of Human Brain Genes with CCA
trinucleotide repeats
JOURNAL Am. J. Hum. Genet. 55, A264 (1994)
REFERENCE 2 (bases 1 to 623)
AUTHORS Han,J., Hsu,C., Zhu,Z., Longshore,J.W. and Finley,W.H.
TITLE Over-representation of the disease associated (CAG) and (CGG)
repeats in the human genome
JOURNAL Nucleic Acids Res. 22 (9), 1735-1740 (1994)
MEDLINE 94261446
REFERENCE 3 (bases 1 to 623)
AUTHORS Longshore,J.W.
TITLE Direct Submission
JOURNAL Submitted (16-FEB-1995) John W. Longshore, Laboratory of
Medical
Genetics, University of Alabama, 1720 7th Ave. S., Sparks 442,
Birmingham, AL 35294-0017, USA
FEATURES Location/Qualifiers
source 1. .623
/organism="Homo sapiens"
/note="T3 end of clone"
/clone_lib="Stratagene catalog #936205"
/clone="JL74"
/tissue_type="hippocamapus"
/sex="female"
/dev_stage="2 year old"
Here deducing the orientation will require a trip to the library - "T3 end
of clone" -> which end that is depends on vector -> vector not named
except via "Stratagene catalog #936205"
Going back to the example that you cite, which admittedly does contain the
direction information, the .r1/.s1 notation is nice, but it is in an
unparseable format coded into the definition field. By unparseable, I mean
just that a generalized program that reads Genbank data fields will not be
able to trivially determine forward/reverse, since this information is
contained in a nonstandard format *for the database as a whole* within
another field.
In any case, the biologically relevant direction information is found on
the mRNA line, which says (as it should):
mRNA complement(<1..>413)
It is good that you picked this example to make your point, because here we
have a clone that is pretty clearly inserted in the reverse direction, as
judged by the direction of similarity found to another sequence.
Specifically, the homologies found are (GCG BESTFIT)
.s1 (len 413) CAMK (len 1793)
413->4 <==> 1385 -> 1793 Percent Similarity: 96.822
.r1 (len 617)
391->617 <==> 1346 -> 1572 Percent Similarity: 98.238
This is why the mRNA line is reversed from the .s1/.r1 lines. Can you name
the piece of software that could have read this entry, noted the s1/r1 and
mRNA reversal, and accounted for it? I doubt such a beast exists, since
the explanation is buried in yet another unparseable field, COMMENT, as:
Possible reversed clone: similarity on wrong strand
The take home lesson is that not all EST entries contain direction
information, or contain that information in the same format, and even if
they have that information, the orientation may be questionable, and there
is no indication of the reliability of the information presented. None of
this matters much if you are working with 10-20 ESTs by hand, but if you
are trying to process thousands of them, well, have fun.
Regards,
David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech