Rex Eastbourne wrote:
> Thanks again for replying. The k-means algorithm should be a snap. But
> how do I convert the proteins, which are in the format
> "UPSP_SLDJK_HUMAN_P12182" to vectors that can be handled by the
> mathematical algorithm (i.e. what is the "distance" between two
> proteins)? Is there already a program that does this? (I understand
> there's something on the NCBI's website.)
So, if I understand the format of the data:
1. "UPSP_SLDJK_HUMAN_P12182" is just a name...say it is a row id.
2. with that name (i.e. in each row), you will have a series of data
points, each data point corresponding the amount of protein found in
patient X (technically you don't have to know if they have the disease
or not).
3. each column (i.e. patient data) will therefore be a
(multidimensional) data vector, with each protein being an "axis".
patient1 patient2 patient3 patient4
protein1 1 50 49 3
protein2 2 35 30 1
protein3 30 20 20 31
In this way you can apply (hierarchical) k-means clustering on the
column "vectors".
Note that you may not get anything either since ultimately your analysis
is only as good as your data...
Austin