Welcome to dbFreaks.com!
FAQFAQ    SearchSearch      ProfileProfile    Private MessagesPrivate Messages   Log inLog in

Clustering wrong results in SQL 2005

 
   Database Help (Home) -> Datamining RSS
Next:  model from an OLAP cube, too many cases evaluated..  
Author Message
Vereenka

External


Since: Feb 05, 2008
Posts: 3



(Msg. 1) Posted: Tue Feb 05, 2008 12:23 pm
Post subject: Clustering wrong results in SQL 2005
Archived from groups: microsoft>public>sqlserver>datamining (more info?)

Hi,

I'm testing clustering model with simple data and I found weird wrong
characteritics.
The mining model:

CREATE MINING MODEL [Test1 MM] (

[ID] TEXT KEY,

[Test1] TABLE PREDICT

(

[Product] TEXT KEY

)

) Using Microsoft_Clustering


Sample data are organized in a table with 2 columns: ID and Product and
both
columns are PK.
ID=1, 2, 3, 11 are related with products P1 and P2, ID=4, 5 and 12 are
related only with product P1, ID=6,7 are related with products P2 and P3
and
ID=8, 9 and 10 are related with product P3.

After training, I got 3 clusters and browsed them with DMX query:
SELECT FLATTENED * FROM [Test1 MM].CASES
WHERE IsInNode('001') (also 002, 003)

According to the results, cluster 1 contains cases for ID=4, 5 and 12 -
containing product P1, cluster 2 - cases for ID=6, 7, 8, 9, 10 and cluster
3 - cases for ID=1, 2, 3 and 11.
So it seems that cluster 1 contains only P1, cluster 3 - P1 and P2 50%:50%
and cluster 2 - P2 and P3 (more)

It seemed OK with manual calculations so I switched to Mining Model Viewer
and checked out Cluster Profiles tab. And according to the tab, cluster 1
contains not only P1, but also P2. Statistics for cluster 2 and 3 are
correct, but what about the cluster 1? Why the MM Viewer shows product
P2 in
the cluster, which is not returned in the query above?

Could you please help?

Thanks a lot!

Ver

 >> Stay informed about: Clustering wrong results in SQL 2005 
Back to top
Login to vote
Dejan Sarka

External


Since: Mar 18, 2004
Posts: 103



(Msg. 2) Posted: Mon Feb 11, 2008 8:02 am
Post subject: Re: Clustering wrong results in SQL 2005 [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Hi!

Could you check whether you get the same result with hard clustering, i.e.
with K-means algorithm? The default Expectation-Maximization algorithm, also
known as soft clustering, actually assigns each case to each cluster with
some probability, while K-means assigns each case to one cluster with
probability 1. Maybe you can check probabilities for each cluster with teh
ClusterProbability DMX function?

--
Dejan Sarka
http://blogs.solidq.com/EN/dsarka/default.aspx

"Vereenka" wrote in message

> Hi,
>
> I'm testing clustering model with simple data and I found weird wrong
> characteritics.
> The mining model:
>
> CREATE MINING MODEL [Test1 MM] (
>
> [ID] TEXT KEY,
>
> [Test1] TABLE PREDICT
>
> (
>
> [Product] TEXT KEY
>
> )
>
> ) Using Microsoft_Clustering
>
>
> Sample data are organized in a table with 2 columns: ID and Product and
> both
> columns are PK.
> ID=1, 2, 3, 11 are related with products P1 and P2, ID=4, 5 and 12 are
> related only with product P1, ID=6,7 are related with products P2 and P3
> and
> ID=8, 9 and 10 are related with product P3.
>
> After training, I got 3 clusters and browsed them with DMX query:
> SELECT FLATTENED * FROM [Test1 MM].CASES
> WHERE IsInNode('001') (also 002, 003)
>
> According to the results, cluster 1 contains cases for ID=4, 5 and 12 -
> containing product P1, cluster 2 - cases for ID=6, 7, 8, 9, 10 and cluster
> 3 - cases for ID=1, 2, 3 and 11.
> So it seems that cluster 1 contains only P1, cluster 3 - P1 and P2 50%:50%
> and cluster 2 - P2 and P3 (more)
>
> It seemed OK with manual calculations so I switched to Mining Model Viewer
> and checked out Cluster Profiles tab. And according to the tab, cluster 1
> contains not only P1, but also P2. Statistics for cluster 2 and 3 are
> correct, but what about the cluster 1? Why the MM Viewer shows product P2
> in
> the cluster, which is not returned in the query above?
>
> Could you please help?
>
> Thanks a lot!
>
> Ver

 >> Stay informed about: Clustering wrong results in SQL 2005 
Back to top
Login to vote
Vereenka

External


Since: Feb 05, 2008
Posts: 3



(Msg. 3) Posted: Wed Feb 13, 2008 12:45 am
Post subject: Re: Clustering wrong results in SQL 2005 [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Dejan Sarka pisze:
> Hi!
>
> Could you check whether you get the same result with hard clustering,
> i.e. with K-means algorithm? The default Expectation-Maximization
> algorithm, also known as soft clustering, actually assigns each case to
> each cluster with some probability, while K-means assigns each case to
> one cluster with probability 1. Maybe you can check probabilities for
> each cluster with teh ClusterProbability DMX function?
>

Hi,

I've built 3 additional models to cover all clustering methods.
The results are the same for each pair: EM algorithms and k-means
algorithms.

In each model I still get the 50% probability of P2 in cluster 1 - in
NODE_DISTRIBUTION, also for k-means. The other Ps are assigned to 1
cluster with probability 1, just as you write Wink

Maybe the SUPPORT in NODE_DISTRIBUTION could be helpful - I noticed it
was higher than in NODE_SUPPORT but I don't know how it was calculated.
The support for the whole node is equal to number of cases assigned to
the cluster but what about the SUPPORT in NODE_DISTRIBUTION? E.g. I
received SUPPORT=6 for P1=Existing while the number of cases in this
cluster was only 3. Also support for P3=Missing was 6 while the total
number of cases containing P3 was only 5.

The ClusterProbability results are clear in k-means if my query covers
existing cases. It has difficulties in cases built as extensions of
historical ones - it is specific for this algorithm, isnt' it?
In case of EM, if ClusterProbability is equal for clusters, Cluster()
returns the one with lower node id (number in node name).

What would you suggest? Clustering function is promising but I can't
load real data until I understand "the P2" effect Wink

Ver.
 >> Stay informed about: Clustering wrong results in SQL 2005 
Back to top
Login to vote
Dejan Sarka

External


Since: Mar 18, 2004
Posts: 103



(Msg. 4) Posted: Wed Feb 13, 2008 8:15 am
Post subject: Re: Clustering wrong results in SQL 2005 [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

> In each model I still get the 50% probability of P2 in cluster 1 - in
> NODE_DISTRIBUTION, also for k-means. The other Ps are assigned to 1
> cluster with probability 1, just as you write Wink
>
> Maybe the SUPPORT in NODE_DISTRIBUTION could be helpful - I noticed it was
> higher than in NODE_SUPPORT but I don't know how it was calculated. The
> support for the whole node is equal to number of cases assigned to the
> cluster but what about the SUPPORT in NODE_DISTRIBUTION? E.g. I received
> SUPPORT=6 for P1=Existing while the number of cases in this cluster was
> only 3. Also support for P3=Missing was 6 while the total number of cases
> containing P3 was only 5.

The NODE_SUPPORT contains the hard support for the current cluster, i.e. the
number of cases assigned to a cluster. The SUPPORT in the NODE_DISTRIBUTION
is the support for specific state / value in a cluster; you should read it
together with the PROBABILITY column in NODE_DISTRIBUTION. You can find
more about content of clustering models at
http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=102&Id=46.

> The ClusterProbability results are clear in k-means if my query covers
> existing cases. It has difficulties in cases built as extensions of
> historical ones - it is specific for this algorithm, isnt' it?

I do not get exactly what you mean here. If you mean with "cases built as
extensions of historical ones" on possibility to incrementally train a
model, then it is not a problem of Clustering only - in SQL Server 2005, you
can only reprocess the complete model, there is no incremental processing
for any algorithm.

> In case of EM, if ClusterProbability is equal for clusters, Cluster()
> returns the one with lower node id (number in node name).
>
> What would you suggest? Clustering function is promising but I can't load
> real data until I understand "the P2" effect Wink

I would suggest you to download the "2 Dimensional Cluster Test.xls" Excel
file from
http://www.sqlserverdatamining.com/ssdm/Home/Downloads/tabid/60/Default.aspx.
Note you will have to register to the site first. This site is maintained by
SQL Server Data Mining team. This Excel file helps you playing with
different parameters of the Clustering algorithm and observe the effects
graphically. BTW, you can also post a question on that site, once you are
registered, and these guys really know how the algorithms work, as they
programmed them:-)
In addition, could you post source table DDL and INSERTS for our data? Then
I could play with it as well, any maybe together we could get the
comprehension?

--
Dejan Sarka
http://blogs.solidq.com/EN/dsarka/default.aspx
 >> Stay informed about: Clustering wrong results in SQL 2005 
Back to top
Login to vote
Vereenka

External


Since: Feb 05, 2008
Posts: 3



(Msg. 5) Posted: Fri Feb 15, 2008 8:07 pm
Post subject: Re: Clustering wrong results in SQL 2005 [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Hi,

I couldn't use the 2 Dimensional Cluster Test - it freezed Excel on 2
computers Wink
But I analized the link to article on clustering and it confirmed what I
had thought. E.g. if I had a cluster with 10 cases - NODE_SUPPORT
would be 10. If only 5 of them contained an element, then the SUPPORT in
the NODE_DISTRIBUTION would be 5. But whatever the NODE_SUPPORT is, it
cannot be lower than the SUPPORT in the NODE_DISTRIBUTION - am I correct?

>> The ClusterProbability results are clear in k-means if my query covers
>> existing cases. It has difficulties in cases built as extensions of
>> historical ones - it is specific for this algorithm, isnt' it?

In my data cases look like:
P1, P2 for ID=1,2,3,12
P1 for ID=4,5,12
P2, P3 for ID=6,7
P3 for ID=8,9,10
In this set I don't have cases like P1,P2,P3 - this is what I called an
"extension" - of cases P1,P2 or P2,P3.

According to the query on .cases, the data were groupped into 3 clusters:
Cluster1 - cases containing only P1 (ID=4,5,12)
Cluster2 - cases containing only P2+P3, P3 (ID=6,7,8,9,10)
Cluster3 - cases containing only P1+P2(ID=1,2,3,12)

And from my point of view a matrix of cluster profiles should look like
(I hope it is visible):

| Cluster1 | Cluster2 | Cluster3
-------------------------------------------------------------------------------
P1 | 100%, support=3 | 0%, support=0 |100%, support=4
-------------------------------------------------------------------------------
P2 | 0%, support=0 | 40%, support=2 |100%, support=4
-------------------------------------------------------------------------------
P3 | 0%, support=0 | 100%, support=5 |0%, support=0

And from Microsoft Cluster Viewer:

| Cluster1 | Cluster2 | Cluster3
-------------------------------------------------------------------------------
P1 | 100%, support=6 | 0%, support=0 |100%, support=1
-------------------------------------------------------------------------------
P2 | 50%, support=3 | 40%, support=2 |100%, support=1
-------------------------------------------------------------------------------
P3 | 0%, support=0 | 100%, support=5 |0%, support=0

I can't understand the 50% for P2 in Cluster1 and the differences in
support.

I also compared prediction results for EM and k-means MMs for all
combinations of P1, P2 and P3. I queried Cluster() and
ClusterProbability().
For example: for P1 ClusterProbability() assigned 50% for Cluster1 and
Cluster3 and Cluster() returned Cluster1. For P1+P3 ClusterProbability()
assigned 33% for each Cluster and Cluster() returned Cluster1. In my
opinion in such cases it returns the cluster with lower ID.

I enclose the data I used for training my MM:
Create table:
USE [Mgr]
CREATE TABLE [dbo].[Test1](
[ID] [nchar](10),
[Product] [nchar](10),
CONSTRAINT [PK_Test1] PRIMARY KEY CLUSTERED
(
[ID] ASC,
[Product] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY] ) ON [PRIMARY]

Insert:
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('1', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('1', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('2', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('2', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('3', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('3', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('4', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('5', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('6', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('6', 'P3');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('7', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('7', 'P3');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('8', 'P3');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('9', 'P3');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('10', 'P3');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('11', 'P1');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('11', 'P2');
INSERT INTO [Mgr].[dbo].[Test1] VALUES ('12', 'P1');

I hope you can explain the differences now.
Or do you suggest to move the whole thread to sqlserverdatamining?

Ver.
 >> Stay informed about: Clustering wrong results in SQL 2005 
Back to top
Login to vote
Display posts from previous:   
Related Topics:
clustering - does anybody know how the clustering heuristics to best determine the number of clusters works? how good is it? can i change it? thanks Gil http://msdn2.microsoft.com/en-us/library/ms174879.aspx CLUSTER_COUNT: Specifies the approximate number of..

How to do updates upon the query results - Hello, I've run a complex SELECT statement combined of the several tables, which returns me: MemberID MemberName CurrentWeightAvailableOn010108 NeedsToHaveWeightOnDate Modifier(NeedsToHaveWeight-CurrentWeightAvailable) 1 John Doe ...

Performance issues SQL Server 2005 Data Mining - Hello I am trying to run a clustering model (building mode) against a 15.000.000 records table consisting in 45 fields (categorical and numeric). I have to kill the process after 5 hours because never ends. I have no problem with other..

Collation Problem any help - Hello all : My problem is like the following I have an online databse with a latin collation, this database is using ntext and nvarchar values that was with our previous admin ok the problem is in the data is stored should use another collation (the...

Reprocessing a mining model - I'm currently developing a user registration prototype. The aim is to classify certain types of users. However, after the registration I query a decision tree model in Analysis Services with DMX and I want to send the user data into the warehouse to....
   Database Help (Home) -> Datamining All times are: Pacific Time (US & Canada)
Page 1 of 1

 
You can post new topics in this forum
You can reply to topics in this forum
You can edit your posts in this forum
You can delete your posts in this forum
You can vote in polls in this forum



[ Contact us | Terms of Service/Privacy Policy ]