Joakim Westberg wrote:
>> Hi!
>> I am looking for a standard procedure for calculating the readlength of sequencing traces. Phred20 seems to be one way of evaluating the reads. Does anyone know how to calculate the readlength of a sequencing trace at Phred20? What does Phred20 exactly mean?
the phred quality scores suggest a probabality of the base being in
error, each number is a factor of 10; ie a phred Q=20 means that there
is a 1/100 chance of the base being wrong. a base with phred=40 means
that there is 1/10,000 chance that the base is wrong.
>One explanation I have got is that all bases are counted that have a quality value of 20 or more with the Phred program. That results in a much shorter readlength than manualy evaluation of the readlength.
This is true. We determine the number of bases with a Phred q>20 with a
program called qrep, that you run after you have run phred. It gives
analysis of all the gels that have been phreded in a project. I use it
to monitor the QC daily (do a find on all the files that have a creation
date less than one day old, phred them, and then run qrep...all as a
cronjob). I believe that qrep is available from Brent Ewing for
academic institutions.
As far as read length goes, the number of alignable bases is always much
larger than the q>20 value (that q>20 is not continuous bases anyway).
After you run phrap (the alignment program) the phrap.out file gives
average confirmed and trimmed values that have been used to make contigs
that are almost twice as long as the q>20 values (by the time the
project is finished). (eg avg q>20 is 500, confirmed trimmed read
lenghth is 900). There are other ways of determining trimmed read
lengths that are more or less stringent (eg trim back from both ends
until a region where there are <2 n's within 50 bases (same sequences
mentioned above would probably have a legnth trimmed in this way of 650
to 750 bases).
The take home point is that there are a number of ways to measure the
trimmed read length. As long as you are internally consistant it should
not matter which method you use.
EXCEPT when it comes to determining the cost of sequencing. Then, there
needs to be some standard that all labs use so costs can be compared and
processes corrected if they result in too much spending. It used to be
that we could determine costs simply by taking the amount of money put
into the project (including overhead with depreciation on instruments
over 5 years) and dividing that by the number of bases submitted to
Genbank in one year. That gives the cost per finished base. Now that
fseekers at net.bio.netthe emphasis has moved away from finishing to
generating 5X redundant
sequence, there is a lack of a good standard for determining costs.
There was a meeting at the NIH about this last year and from my
recollection the only standard that had a consensus was reporting the
costs/phred20 base produced.
sorry about venting, but I hope that helps.
srlasky
--
Stephen R. Lasky, Ph.D. #
University of Washington #
Department of Molecular Biotechnology #
srlasky at u.washington.edu #
#########################################