Hi,
I couldn't use the 2 Dimensional Cluster Test - it freezed Excel on 2
computers
But I analized the link to article on clustering and it confirmed what I
had thought. E.g. if I had a cluster with 10 cases - NODE_SUPPORT
would be 10. If only 5 of them contained an element, then the SUPPORT in
the NODE_DISTRIBUTION would be 5. But whatever the NODE_SUPPORT is, it
cannot be lower than the SUPPORT in the NODE_DISTRIBUTION - am I correct?
>> The ClusterProbability results are clear in k-means if my query covers
>> existing cases. It has difficulties in cases built as extensions of
>> historical ones - it is specific for this algorithm, isnt' it?
In my data cases look like:
P1, P2 for ID=1,2,3,12
P1 for ID=4,5,12
P2, P3 for ID=6,7
P3 for ID=8,9,10
In this set I don't have cases like P1,P2,P3 - this is what I called an
"extension" - of cases P1,P2 or P2,P3.
According to the query on .cases, the data were groupped into 3 clusters:
Cluster1 - cases containing only P1 (ID=4,5,12)
Cluster2 - cases containing only P2+P3, P3 (ID=6,7,8,9,10)
Cluster3 - cases containing only P1+P2(ID=1,2,3,12)
And from my point of view a matrix of cluster profiles should look like
(I hope it is visible):
| Cluster1 | Cluster2 | Cluster3
-------------------------------------------------------------------------------
P1 | 100%, support=3 | 0%, support=0 |100%, support=4
-------------------------------------------------------------------------------
P2 | 0%, support=0 | 40%, support=2 |100%, support=4
-------------------------------------------------------------------------------
P3 | 0%, support=0 | 100%, support=5 |0%, support=0
And from Microsoft Cluster Viewer:
| Cluster1 | Cluster2 | Cluster3
-------------------------------------------------------------------------------
P1 | 100%, support=6 | 0%, support=0 |100%, support=1
-------------------------------------------------------------------------------
P2 | 50%, support=3 | 40%, support=2 |100%, support=1
-------------------------------------------------------------------------------
P3 | 0%, support=0 | 100%, support=5 |0%, support=0
I can't understand the 50% for P2 in Cluster1 and the differences in
support.
I also compared prediction results for EM and k-means MMs for all
combinations of P1, P2 and P3. I queried Cluster() and
ClusterProbability().
For example: for P1 ClusterProbability() assigned 50% for Cluster1 and
Cluster3 and Cluster() returned Cluster1. For P1+P3 ClusterProbability()
assigned 33% for each Cluster and Cluster() returned Cluster1. In my
opinion in such cases it returns the cluster with lower ID.
I enclose the data I used for training my MM:
Create table:
USE [Mgr]
CREATE TABLE [dbo].[Test1](
[ID] [nchar](10),
[Product] [nchar](10),
CONSTRAINT [PK_Test1] PRIMARY KEY CLUSTERED
(
[ID] ASC,
[Product] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY] ) ON [PRIMARY]
Insert:
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('1', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('1', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('2', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('2', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('3', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('3', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('4', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('5', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('6', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('6', 'P3');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('7', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('7', 'P3');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('8', 'P3');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('9', 'P3');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('10', 'P3');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('11', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('11', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('12', 'P1');
I hope you can explain the differences now.
Or do you suggest to move the whole thread to sqlserverdatamining?
Ver.
>> Stay informed about: Clustering wrong results in SQL 2005