IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Brugia malayi EST cluster database

Mark Blaxter mark.blaxter at ed.ac.uk
Wed Aug 22 04:49:25 EST 2001

The Brugia malayi clustered EST dataset on the web

The Filarial Genome Project, sponsored by the World Health 
Organisation/TDR, the UK Medical Research Council, New England 
Biolabs, The Edna McConnell Clark Foundation and the Wellcome Trust, 
has been sequencuing expressed sequence tags (ESTs) from Brugia 
malayi for several years. The 23,000 ESTs have been the subject of 
ongoing analyses, and we are pleased to announce the availability of 
the latest version of our clustered database.

The Brugia malayi EST dataset has been clustered into groups of 
sequences which are thought to encode for one gene. This clustering 
process has been carried out here in Edinburgh using in-house 
software based on the BLAST algorithm. Each cluster is identified by 
a unique ID starting with 'BMC' (for Brugia malayi cluster) followed 
by 5 digits. The ID numbers are consistent with previous publications 
on the B.malayi dataset and will remain intact following any future 
rebuilds. The sequences from each cluster are then used to build a 
consensus sequence which is used for further analysis. The results of 
this clustering analysis along with some basic annotation is now 
available on NEMBASE 
(http://nema.cap.ed.ac.uk/nematodeESTs/nembase.html) a nematode 
specific database resource available for searching via the world wide 
web. There is also a Brugia specific cluster search web page which 
you may find easier to use at : 
At present there are four ways in which the database can be searched :

		A) By Accession Number or Cluster ID
		B) By simple keyword searching of blast output
		C) By sequence similarity
		D) By stage expression

A) By cluster ID or Accession number of a constituent EST sequence. 
On the search page there is a small text box in which you can enter 
the ID of the cluster you are interested in (if already known) or the 
accession number of an EST. Enter the appropriate ID and click on the 
'go' button. You will be taken straight to a page detailing the 
relevant cluster. [Please note that, at present, the ribosomal 
RNA-derived ESTs are NOT included, and thus if you enter a ribosomal 
RNA EST accession number, no answer will be returned. This omission 
will be rectified in the near future.]

B) By BLAST annotation. After the consensus sequences were created 
for each cluster, they were used to perform three separate blast 
searches [blastn against the non-redundant DNA database, blastx 
against the non-redundant protein database and blastn against the EST 
database (dbEST)]. Results from these blasts are stored in NEMBASE 
and may be searched by the use of simple keywords.
Simply enter the word you are interested in into the box marked 
'annotation text' and click on the submit button. After a few moments 
you will be directed to a page listing the clusters whose blast 
results match you search keyword (e.g. If you entered 'globin', you 
will be given a list of all the clusters in which the word 'globin' 
appeared in the blast output). This list may be ordered either by 
their relative abundance (number of sequences in the cluster) or by 
their relative blast probability (e) value. In addition you may 
specify a minimum blast probability to ensure that the blast hit is 
'real'. The list shows the cluster ID, the number of sequences within 
the cluster and the three top blast hits against each of the 
databases. By clicking on the cluster ID you will be taken to the 
page detailing that cluster.

C) By sequence similarity. If you are interested in finding the 
clusters which most closely match a sequence you are interested in, 
you may use the local BLAST facility. Simply cut and paste your 
sequence into the large box of the search by sequence similarity 
section, select any appropriate options and click on the submit 
button. After a few moments you will see a page detailing the BLAST 
output. The graphic at the top indicates the relative position and 
score of each 'hit' against your sequence. Clicking on the cluster ID 
in this graphic will take you to the  alignment of that cluster 
against your input sequence. Clicking on the cluster ID by the 
alignment view will take you to the page detailing that cluster.

D) By stage expression profile. You may be interested in clusters 
containing sequences which are expressed only at particular stages or 
at particular levels of abundance. This search mode is accessed via a 
separate page. Click the link from the Brugia page and you will see a 
form which enables you to enter a profile into the boxes and retreive 
a list of clusters which satisfy that profile. Valid arguments are 
numbers to indicate that exactly that number of sequences is found in 
the cluster, or you can use the '>' and '<' symbols to specify a 
minimum or maximum number of sequences that have to be present 
respectively. For example if you wanted clusters which were 
relatively highly expressed in microfilaria but were not found to be 
expressed in adults, you might enter '>5' in the MF box and '0' in 
the 'Total Adults' box. Pressing submit would then retreive all 
clusters which contained more than 5 sequences from MF libraries but 
no sequences from any adult library. At this step you have two 
choices for output

1: Normal list of clusters (as above)
2: A graphic (PhyloView) showing the realtive phylogenetic 
distribution of blast similarity matches of the clusters with three 

To use the PhyloView option, simply select the appropriate button and 
choose the three datasets from the lists before clicking the submit 
button. After a few moments you will see a graphic appear with a list 
of clusters beneath it. The graphic is an interactive Java 
application which allows you to zoom in and around the triangle 
representing phylogenetic phase space. Within the triangle are 
coloured squares. Each represents a unique cluster. The relative 
position of the square to the three vertices represents the realtive 
phylogenetic distribution of blast similarity matches of that cluster 
to the three organisms chosen on the previous page. The colour of the 
square indicates the highest Blast score obtained against the three 
datasets. Clicking on each square reveals its cluster number. By 
holding down the <ctrl> key while clicking on a square will launch a 
new web window detailing that cluster. [Please note that the full 
functionality of the PhyloView is only available in either Netscape 
4.7 and above or Internet Explorer 5 and above]

What the Cluster View Shows :
The detailed view of each cluster contains the following information :
A brief summary of the cluster indicating its index number, the 
number and types of sequences belonging to the cluster, the number of 
contigs predicted for the cluster by the assembly program (different 
contigs represent either alternative splices or different alleles) 
and the libraries represented by the cluster. For sequence types :
Blue = EST, Red = cDNA, Magenta = genomic DNA, Green = GSS
To the right of the summary table is the precomputed blast 
information available on the cluster. By moving the mouse over the 
appropriate button, different hits will be displayed within the text 
window. Clicking on the buttons launches a new window showing the 
BLAST output. It should be noted that similarity scores > e^-99 are 
scored as 0.
The text "no significant hits " indicates that no hits with a 
similarity score of < e^-5 were obtained, while "no hits" indicates 
that no blast hits were found for this cluster.
Below the header information is data pertaining to each contig - the 
contig number, length of the sequence, number of ESTs which make up 
this particular contig and a list of these ESTs (coloured by type - 
see above). Click on the EST name to retreive the GenBank entry.
Under the contig information is a simple graphic indicating the 
position of the sequences relative to the contig. The sequences are 
coloured according to quality/alignment information. Gold indicates 
sequence of high quality; purple indicates lower quality sequence 
used in creating the consensus sequence. GenBank entries can again be 
retreived by clicking on each sequence within the graphic. Finally, 
below the graphic the cluster consensus sequence is given. The BLAST 
button below this takes you to our in-house BLAST server 
automatically pasting in the consensus sequence.

Analyses using this database should reference Parkinson, J., C. 
Whitton, D. Guiliano, J. Daub and M.L. Blaxter. 2001. 200,000 
nematode ESTs on the net. Trends in Parasitology. 17: 394-396.

If you have any problems or questions do not hesitate to contact John 
Parkinson at john.parkinson at ed.ac.uk.

Mark Blaxter, David Guiliano and John Parkinson
Edinburgh 21/Aug/2001

Dr. Mark Blaxter   email  Mark.Blaxter at ed.ac.uk
Reader in Nematode Genetics
Institute of Cell, Animal and Population Biology
Ashworth laboratories, Room 311
King's Buildings, University of Edinburgh,
West Mains Road, EDINBURGH  EH9 3JT, UK
phone: (+44) 131 650 6760  **NEW** Fax :...650 7489
see   http://www.nematodes.org

          ~ may all beings be happy ~
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://iubio.bio.indiana.edu/bionet/mm/parasite/attachments/20010822/65a420bb/attachment.html

More information about the Parasite mailing list

Send comments to us at biosci-help [At] net.bio.net