I am new to data mining so please excuse my ignorance. Lets assume
- i have created a cluser model
- identified 3 clusters ( a, b, c)
- each record consists of 15 columns
- collecting new records( 15 variables) real time
what i would like to do is plot these new records programmatically as i collect them realtime. I assume this new record will belong to one of these three clusters. I believe we can find the cluster this new record belongs to by ' SELECT Cluster()....' and distance from the center of the cluster by ClusterDistance(). To plot this on a 2-dimentional space i need (x, y).
ClusterDistance() could be Y but what will be X.
thanks.
Cluster() will return cluster that is most likely contain an input case (in your case, the new case). You can also use ClusterProbability() to get the probability that a case belongs to a particular cluster. This basically serves as the (reversed) cluster distance you are talking about; and it works with more general data (including both numeric and discrete data). Moreover, you can use PredictHistogrom(…) to return a histogram of the likelihood of the input case existing in each of the model’s clusters. You can also use CaseLikelihood(…) to return a measure from 0 to 1 that indicates how likely an input case is to exist considering the model learned by the algorithm.
For your reference, we have a live sample The Art of Clustering demonstrating how to use all these features to render 2D data points according to clustering results. I hope this sample will be of help to your project.
Good luck,
|||Your x and y are whatever you choose them to be. For example, the way our cluster diagram works is to plot cluster locations on a 2d plane by arbitrarily laying them out and using a "point-charge" approach to move the clusters around until they converge (or we get tired....). If you were to use such a method for identifying cluster "locations" in 2d space, you could then use ClusterDistance() (which is 1-ClusterProbability) for each case vs. each cluster to approximate where the case would land in the 2d space. You could then color the case by the most likely cluster, and you would have a diagram that looked similar to Yimin's Art of Clustering example, but with arbitrarily dimensioned cluster models.|||Thanks Wu. I have one more question.
lets assume the new record( all floats columns ) belongs to cluster A but is significantly away from the center of the cluster.
there can be one or more than one columns that caused this record to be far away from the center of the cluster.
Is there any way we can find out the most significant columns in this record that caused it .
thanks
|||This is exactly how we implemented the outlier detection in the data mining addins for Excel. Code that shows how to do this is at http://www.sqlserverdatamining.com/DMCommunity/LiveSamples/46.aspx