> I am looking for an automated (or even semi-autimated)
> method for generating a multiple sequence alignment of something
> like 6,000 sequences, all of them > 80% identical to one another.
> I wish to align the envelope gene (or portions there-of)
> which have been sequenced from the Human Immunodeficiency Virus
> type 1 or types 1 and 2. A BLAST search against the nr dataset
> provided by NCBI reveals that there are several thousand HIV
> env sequences in the database today.
>> If I cannot find a tool already suitable for this, I'd
> like advice on building a program (perhaps using ASN.1 code from
> the NCBI Software Developers Toolkit) that will build a massive
> multiple sequence alignment, given a query sequence (I plan to
> use a "consensus sequence" from an alignment of 50 HIV env genes
> from diverse subtype) and the GenBank/EMBL database.
> My first thought is to use a tool such as FASTA to
> obtain information about each sequence from GenBank (Is
> it highly similar to HIV env? If so, what region of it
> aligns with what region of my query) and then use that information
> as a starting point for the multiple sequence alignment.
>> Any thoughts or help will be greatly appreciated.
>
The best method I would suggest would be to use a HMM method. You could
build an HMM using a representitive subset and then use the HMM
to build the larger alginment. The two main HMM packages are
HMMer - http://genome.wustl.edu/eddy/hmm.html
and
SAM http://www.cse.ucsc.edu/research/compbio/sam.html
Give the high level of similarity there should be no problem using
these tools.
ewan