Hi Austin,
I just have a plain list of 200 proteins, without data from the
experiment. I need to cluster the proteins by their inherent
characteristics (function, ancestry). I used the protein database on
the NCBI website to get the sequences. Now, I want to take all these
200 sequences and get some measure of how similar each is to each
other. I figure this would require some specific software that would
allow me to enter all the proteins and see how they're related. I found
ProtoNet, but it seems you can only enter one protein and explore its
specific cluster. Are there any other tools for this I might not be
aware of?
I'm sorry to keep asking you questions like this -- just referring me
to a website that explains this would be greatly appreciated.
Thank you,
Rex
Austin P. So (Hae Jin) wrote:
> Rex Eastbourne wrote:
> > Thanks again for replying. The k-means algorithm should be a snap. But
> > how do I convert the proteins, which are in the format
> > "UPSP_SLDJK_HUMAN_P12182" to vectors that can be handled by the
> > mathematical algorithm (i.e. what is the "distance" between two
> > proteins)? Is there already a program that does this? (I understand
> > there's something on the NCBI's website.)
>> So, if I understand the format of the data:
>> 1. "UPSP_SLDJK_HUMAN_P12182" is just a name...say it is a row id.
> 2. with that name (i.e. in each row), you will have a series of data
> points, each data point corresponding the amount of protein found in
> patient X (technically you don't have to know if they have the disease
> or not).
> 3. each column (i.e. patient data) will therefore be a
> (multidimensional) data vector, with each protein being an "axis".
>> patient1 patient2 patient3 patient4
> protein1 1 50 49 3
> protein2 2 35 30 1
> protein3 30 20 20 31
>> In this way you can apply (hierarchical) k-means clustering on the
> column "vectors".
>> Note that you may not get anything either since ultimately your analysis
> is only as good as your data...
>> Austin