SonS/MDSonS Toolbox
Correspondence address: Jose María Martínez
Electronic Engineering Dpt. University of Valencia
Universities Ave, s/n
E46100 Burjassot (Valencia) SPAIN
email: jose.maria.martinez (at) uv.es

SonS/MDSonS Toolbox: Introduction

This Matlab toolbox consist of two methods, Sectors on Sectors (SonS) and Multidimensional Sectors on Sectors (MDSonS). They are novel visualization techniques for improving the interpretation of several data mining algorithms. These methods are applied in this case for visualizing the results of hierarchical clustering, which make possible to extract all the existing relationships among centroids attributes at any hierarchy level. However, it is also possible to use them in other data mining algorithms: a) Growing Hierarchical Self-Organizing Maps (GHSOM), which make possible to visualize, simultaneously, the data information at each hierarchy level compactly; b) classification trees, in which the SonS is used for representing the input data information for each class presented in each terminal node. Download the SonS and MDSonS toolbox in this link.

Sectors on Sectors (SonS)

Sectors on Sectors (SonS) is a visualization method that extracts visual information of data groups by representing the number of instances in each group, the value of the centroids of these groups of data and the existing relationships among the several groups and variables. This method is based on the well-known pie chart visualization. Each cluster is represented by a slice of a circle (pie sectors). The arc length of each pie sector is proportional to the number of patterns included in each cluster. By means of new divisions in each pie sector and a color bar with the same number of labels as attributes, the existing relationships among centroids’ attributes of the different clusters can be inferred. Figure 1 represents the three steps followed to create the SonS visualization method; which are stated as follows:

  1. Division of one circle on several sectors depending on the number of clusters: First of all the circle is divided into several pie segments or sectors corresponding to each cluster. The arc length of each sector is proportional to the number of patterns included in each cluster. The number of patterns belonging to each cluster is shown within parentheses. In this way, the significance of each cluster is easily recognizable (Figure 1, left). 
  2. Division of the pie sectors depending on the number and the value of attributes: After the first step, each sector is divided into as many subsectors as variables presented in the problem. The inner part corresponds to the first variable, and going outwards, the next variables are appearing. Each one of these parts vary its radius. This radius corresponds to the relative value of each variable, with respect to the sum of all of them (each variable is scaled between [0, 1] before carrying out the clustering in order to avoid a biased model. Moreover, the scaling makes that the relevance of each variable [represented by the size of the radius] is independent of its range, that is, the relevance is not higher even though the variable has a higher range. The use of scaled variables, guarantees that the radius is the relevance of the variable within the cluster). That is, let X be a centroid corresponding to one cluster, so that, X = {x1, x2, . . . , xNThen, the radius of each subsector (correspondingto each centroid attribute) is calculated as follows: In this way the bigger the radius corresponding to each variable, the higher the weight of the variable and therefore, the more relevant the feature. This is a good method to identify the relevance of each variable within each cluster in a straightforward way (Figure 1, middle).
  3. Color coding for identifying the real value of features: Attached to the graph, there is a color bar with the same number of labels as variables (each label for each variable). The mean value of the variables of each class (normally, the centroid) is codified by means of colors (this is automatically extensible to other measures such as the median, which is a much more adequate prototype measure in presence of outliers, for instance. Therefore, SonS is not restricted to the use of a particular prototype measure). The value of the color for the first feature (inner subsector), is given by the first column label, the second feature by the second column label and so on. In this way, it is possible to know the exact value of each variable for each cluster centroid (Figure 1, right). 

Multidimensional Sectors on Sectors (MDSonS)

This method is an improvement of the SonS visualization technique, called Multidimensional Sectors on Sectors (MDSonS). The visualization method is different from that proposed before due to the need of accommodating the information provided by Multidimensional Scaling (MDS) to the new visualization. In MDSonS each cluster is represented by a circle. The area of each circle is proportional to the number of patterns included in each cluster and the distance among circles is proportional to the distance among clusters. By means of each slice of a circle (pie sectors) and a color bar with the same number of labels as attributes, the existing relationships among centroids’ attributes at any hierarchy level can be extracted. Once decided the structure of the clustering (number of clusters in each hierarchical level), the visualization graph is produced in three steps for each hierarchy level (see Fig. 2). These three steps should be taken starting from the first hierarchy level, and are described, as follows:

  1. Representation of the different clusters and their size: First of all, as many circles as clusters are drawn. The area of each circle is proportional to the number of patterns included in each cluster and the distance among circles is proportional to the distance among clusters’ centroids. The distances among centroids are computed by MDS. MDS produces a representation of the similarity (or dissimilarity) between pairs of objects in a multidimensional space as distances between points of a low-dimensional space. The number of patterns belonging to each cluster is shown within parentheses. In this way it is easily recognizable the significance of each cluster and the distance among them (Figure 2, top left). 
  2. Division of the circles depending on the number and the value of attributes: Once the data is divided into clusters and after knowing the size of each one of them, the value of the attributes (or features) for each cluster centroid is analyzed. For this task, each circle corresponding with each cluster, is divided into several sectors, which correspond to each variable. The first variable is the one that starts with a vertical line in the top middle of the circle, and the rest of the variables is appearing sequentially counter clockwise. The arc length of each sector corresponds to the relative value of each variable, respect to the sum of all of them (as in SonS, each variable is scaled between [0, 1] before carrying out the clustering in order to avoid a biased model). In this way, the bigger the arc length of a given variable, the more relevant the variable. With this method it can be identified the relevance of each variable, within each cluster or within all of them in a straightforward way. (Figure 2, top right).
  3. Color coding for identifying the real value of features: Attached to the graph, there is a color bar with the same number of labels as variables (each label for each variable). First column label gives the value of the first feature, second column label gives the second feature value and so on. In this way it is possible to know the exact value of each variable for each cluster centroid (Figure 2, bottom). 


The description for the first hierarchy level can be extended to the rest of levels. For instance, in the data set analyzed in Fig. 12, for the second hierarchy level it can be observed that from each circle (corresponding to each cluster) in level 1, a new graph with new values of cluster centroids emerges.  

The main advantage of the proposed visualization technique is that it is possible to observe relationships among different variables in the same cluster and relationships among the same variables in different clusters, in the different levels of the hierarchy; but specially, what is remarkable in comparison to SonS is the representation of the distances among clusters centroids.

Example of Sectors on Sectors (SonS) and Multidimensional Sectors on Sectors (MDSonS) visualization methods

Contents

Itroduction

In order to see how the functions corresponding with SonS and MDSonS methods work, firstly we are going to carry out a hierarchical clustering. Its worth mentioning that these two methods can be used to visualize both hierarchical clustering and classical clustering. This example consists of 3 Matlab files:

Clustering program (clustering.m)

This program makes a hierarchical clustering of a synthetic data set which consists of 3 clouds of points (first hierarchy) that can be divided into 9 (three new for each one of the former 3). The results of the clustering are saved in order to draw the results with the SonS method (see example.m)

VISUALIZING THE DATA SET:

clear all
close all
clc

Here we load the synthetic data set created two show this example.

load datos

As follows we visualize the data in 3D. This example is presented in order understand better the SonS and MDSonS methods. We can check the data in 3D and compare it with the representations provided by these methods, but it should be bear in mind that these methods are focused on visualizing high-dimensional data sets.

%----------------------------------
%        3D DATA VISUALIZATION
%----------------------------------
plot3(datos(:,1),datos(:,2),datos(:,3),'.')
grid
title('Data representation in 3D')


xlabel('X')
ylabel('Y')
zlabel('Z')

Here we normalize the data. You will need the som toolbox (http://www.cis.hut.fi/somtoolbox/) for using the next functions. If not, you can comment this lines and normalize in other way).

%----------------------------------
%          NORMALIZATION
%----------------------------------

sData = som_data_struct(datos);
sData = som_normalize(sData,'range');
datos_norm=sData.data;

FIRST HIERARCHY:

As follows we carry out the clustering for the first hierarchical level. Firstly, we build the dendrogram. In the dendrogram, we can identify 3 main clusters corresponding with de 3 main clouds of points.

%-------------------------------------
%          DENDROGRAM
%-------------------------------------

Y=pdist(datos_norm,'euclidean');
Z=linkage(Y,'average');

figure
[H, T] = dendrogram(Z,'colorthreshold','default');
ylabel('Distance')
set(H,'LineWidth',2)
title('Dendrogram')

Now we carry out the clustering algorithm:

%-------------------------------------
%              CLUSTERING
%-------------------------------------

N1=3;%number of clusters in level1
components=3;%number of variables
centroids1=zeros(N1,components);

labels1 = cluster(Z,'MaxClust',N1);% we carry out the clustering

%We calculate the centroids of each cluster.
for n=1:N1
    indices=find(labels1==n);
    centroids1_norm(n,:)=sum(datos_norm(indices,:))/length(indices);
end

Now we denormalize the computed centroids. As previously mentioned, You will need the som toolbox (http://www.cis.hut.fi/somtoolbox/), but you can also denormalize in other way.

%----------------------------------
%           DENORMALIZATION
%----------------------------------
%you will need the som toolbox (if not, you can comment this lines and denormalize in other way)
datos = som_denormalize(sData);
datos=datos.data;
centroids1=som_denormalize(centroids1_norm,sData.comp_norm);

As follows, we represent the clustering results of the first hierarchical level. We represent in different colors the different clusters detected by the algorithm and with red dots the centroid of each cluster.

%-------------------------------------
%        CLUSTERING VISUALIZATION
%-------------------------------------
figure
indices1a=find(labels1==1);
plot3(datos(indices1a,1),datos(indices1a,2),datos(indices1a,3),'b.')
hold on

indices1b=find(labels1==2);
plot3(datos(indices1b,1),datos(indices1b,2),datos(indices1b,3),'k.')
grid

indices1c=find(labels1==3);
plot3(datos(indices1c,1),datos(indices1c,2),datos(indices1c,3),'g.')

plot3(centroids1(:,1),centroids1(:,2),centroids1(:,3),'.r','MarkerSize',15)
hold off
legend('Cluster 1','Cluster 2','Cluster 3','Centroids')
title('Clustering solution. 1st Hierarchy')

SECOND HIERARCHY:

As follows we carry out the clustering for the second hierarchical level. As in the previous case,firstly, we build the dendrogram. In the dendrogram, we can identify that each former cluster is now divided into 3 new clusters, obtaining 9 clusters.

%-------------------------------------
%          DENDROGRAM
%-------------------------------------

figure
[H, T] = dendrogram(Z,'colorthreshold',0.2);
ylabel('Distance')
set(H,'LineWidth',2)
title('Dendrogram')

Now we carry out the clustering algorithm:

%-------------------------------------
%              CLUSTERING
%-------------------------------------


N2=9;%number of clusters in level 2
components=3;
centroids2=zeros(N1,components);

labels2 = cluster(Z,'MaxClust',N2);

for n=1:N2
    indices=find(labels2==n);
    centroids2_norm(n,:)=sum(datos_norm(indices,:))/length(indices);
end

Now we denormalize the computed centroids in the same way as before.

%----------------------------------
%           DENORMALIZATION
%----------------------------------
%you will need the som toolbox (if not, you can comment this lines and denormalize in other way)
datos = som_denormalize(sData);
datos=datos.data;
centroids2=som_denormalize(centroids2_norm,sData.comp_norm);

Finally, we represent the clustering results of the second hierarchical level. We represent in different colors the different clusters detected by the algorithm.

%-------------------------------------
%        CLUSTERING VISUALIZATION
%-------------------------------------

figure
indices1a=find(labels2==1);
plot3(datos(indices1a,1),datos(indices1a,2),datos(indices1a,3),'b.')
hold on

indices1b=find(labels2==2);
plot3(datos(indices1b,1),datos(indices1b,2),datos(indices1b,3),'g.')
grid

indices1c=find(labels2==3);
plot3(datos(indices1c,1),datos(indices1c,2),datos(indices1c,3),'r.')

indices1d=find(labels2==4);
plot3(datos(indices1d,1),datos(indices1d,2),datos(indices1d,3),'c.')
grid

indices1e=find(labels2==5);
plot3(datos(indices1e,1),datos(indices1e,2),datos(indices1e,3),'m.')

indices1f=find(labels2==6);
plot3(datos(indices1f,1),datos(indices1f,2),datos(indices1f,3),'y.')

indices1g=find(labels2==7);
plot3(datos(indices1g,1),datos(indices1g,2),datos(indices1g,3),'k.')
grid

indices1h=find(labels2==8);
plot3(datos(indices1h,1),datos(indices1h,2),datos(indices1h,3),'Color',[.6 0.5 0.5],'Marker','.','LineStyle','none')

indices1i=find(labels2==9);
plot3(datos(indices1i,1),datos(indices1i,2),datos(indices1i,3),'Color',[.6 0 0],'Marker','.','LineStyle','none')


plot3(centroids2(:,1),centroids2(:,2),centroids2(:,3),'.r','MarkerSize',15)
hold off

legend('Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5','Cluster 6','Cluster 7','Cluster 8','Cluster 9')
title('Clustering solution. 2nd Hierarchy')

Here, we make a renaming of the variables for better understanding.

%-------------------------------------
%           RENAMING
%-------------------------------------

centroids_1st_hierarchy=centroids1;
centroids_2nd_hierarchy=centroids2;

labels_1st_hierarchy=labels1;
labels_2nd_hierarchy=labels2;

After running the "clustering.m" file, the data needed for build the graphs are saved (clustering centroids and labels of the first and second level of hierarchy)

%-------------------------------------
%       SAVING CENTROIDS AND LABELS
%-------------------------------------

save labels labels_1st_hierarchy labels_2nd_hierarchy
save centroids centroids_1st_hierarchy centroids_2nd_hierarchy
save sData sData

Visualization of SonS and MDSonS (example.m)

As mentioned in the introduction, this program calls the sons.m and mdsons.m functions for visualizing the results of the clustering. These functions represent only one hierarchical level, so in example.m there are different calls of these functions in order to represent the different hierarchies.

Here we load the data needed

load centroids
load labels
load sData


num_clusters=3;

FIRST HIERARCHY:

In this part of the program, we calculate the number of patterns in each cluster.The variable "patterns_in_clusters" is a vector (each column corresponds to each cluster) that contains the number of patterns in each one of the clusters.

%------------------------------------------------------------------
% CALCULATION OF NUMBER OF PATERNS IN EACH CLUSTER OF 1ST HIERARCHY
%------------------------------------------------------------------

 for n=1:num_clusters
        aux=find(labels_1st_hierarchy==n);
        patterns_in_clusters_1st_hierarchy(1,n)=length(aux);
 end

Now, the SonS and MDSonS function are called in order to represent the graphs. The SonS and MDSonS function need two variables:

Also it can be included a cell array where each cell corresponds to the label of each cluster.

Here we represent the SonS method:

%----------------------------------------------------
%   SonS function : Here we build the graph
%----------------------------------------------------
figure
sons([patterns_in_clusters_1st_hierarchy],centroids_1st_hierarchy,{'C1','C2','C3'});
title({'SonS Method: 1st HIERARCHY';'';''})
set(gcf,'Position',[0 0 500 300])%size of the matlab window

Here we represent the MSonS method:

%----------------------------------------------------
%   MDSonS function : Here we build the graph
%----------------------------------------------------
figure
mdsons([patterns_in_clusters_1st_hierarchy],centroids_1st_hierarchy,{'C1','C2','C3'});
title({'MDSonS Method: 1st HIERARCHY';'';''})
set(gcf,'Position',[0 0 700 500])%size of the matlab window

)

SECOND HIERARCHY:

Now we are going to represent the second hierarchy. For that, from the previous sectors or clusters (C1, C2 and C3) we have new clusters (three new clusters for each one of the fomers).

This code refers to the first graph (corresponding with C1 in 1st hierarchy) in the second hierarchy.

In this part of the program, we calculate the number of patterns in each new cluster (from the former C1) in the second hierarchy. The variable "patterns_in_clusters" is a vector (each column corresponds to each cluster) that contains the number of patterns in each one of the clusters.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%    1ST GRAPH IN SECOND HIERARCHY
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

num_clusters=3;
cont=0;
for n=[5 4 8]% In this graph are included the 4th, 5th and 8th clusters of
             % the second hiearchy (the 3 clusters consisted of Cluster 1 in
             % 1st hierachy)

        indices=find(labels_2nd_hierarchy==n);
        cont=cont+1;
        patterns_in_clusters(1,cont)=length(indices);

end

Now, the SonS and MDSonS function are called in order to represent the corresponding graph.

%------------------------------------------------------------
% SonS function : Here we build the 1st graph of 2nd hiearchy
%------------------------------------------------------------
figure
h1=gcf; % returns the handle of the current figure.
set(h1,'Position',[0 0 1050 500])%size of the matlab window
subplot(131)
sons(patterns_in_clusters,centroids_2nd_hierarchy([5 4 8],:),{'C1.1','C1.2','C1.3'});
title ({'SonS Method: 2nd HIERARCHY';'';'C1';''},'FontWeight','bold')


%------------------------------------------------------------
% MDSonS function : Here we build the 1st graph of 2nd hiearchy
%------------------------------------------------------------
figure
h2=gcf; % returns the handle of the current figure.
set(h2,'Position',[0 0 1350 800])%size of the matlab window
subplot(131)
mdsons(patterns_in_clusters,centroids_2nd_hierarchy([5 4 8],:),{'C1.1','C1.2','C1.3'});
title ({'MDSonS Method: 2nd HIERARCHY';'';'C1';''},'FontWeight','bold')

In this part of the program, we calculate the number of patterns in each new cluster (from the former C2) in the second hierarchy.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%    2ND GRAPH IN SECOND HIERARCHY
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

num_clusters=3;
cont=0;
for n=[6 9 7] % In this graph are included the 6th, 9th and 7th clusters of
              % the second hiearchy (the 3 clusters consited of Cluster 2 in
              % 1st hierachy)
        indices=find(labels_2nd_hierarchy==n);
        cont=cont+1;
        patterns_in_clusters(1,cont)=length(indices);
        %The variable "patterns_in_clusters" is a vector (each column corresponds
        %to each cluster) that contains the number of patterns in each one of the clusters.
end

Now, the SonS and MDSons functions are called in order to represent the second graph (corresponding with C2 in 1st hierarchy) in the second hierarchy.

%------------------------------------------------------------
% SonS function : Here we build the 2nd graph of 2nd hiearchy
%------------------------------------------------------------
figure(h1) %makes h1 the current figure
subplot(132)
sons(patterns_in_clusters,centroids_2nd_hierarchy([6 9 7],:),{'C2.1','C2.2','C2.3'});
title ({'SonS Method: 2nd HIERARCHY';'';'C2';''},'FontWeight','bold')


%------------------------------------------------------------
% MDSonS function : Here we build the 2nd graph of 2nd hiearchy
%------------------------------------------------------------
figure(h2) %makes h2 the current figure
subplot(132)
mdsons(patterns_in_clusters,centroids_2nd_hierarchy([6 9 7],:),{'C2.1','C2.2','C2.3'});
title ({'MDSonS Method: 2nd HIERARCHY';'';'C2';''},'FontWeight','bold')

As in previous cases, here we calculate the number of patterns in each new cluster (from the former C3) in the second hierarchy.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%    3RD GRAPH IN SECOND HIERARCHY
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

num_clusters=3;
cont=0;
for n=[1 2 3] % In this graph are included the 1st, 2nd and 3rd clusters of
              % the second hiearchy (the 3 clusters consited of Cluster 3 in
              % 1st hierachy)
        indices=find(labels_2nd_hierarchy==n);
        cont=cont+1;
        patterns_in_clusters(1,cont)=length(indices);
        %The variable "patterns_in_clusters" is a vector (each column corresponds
        %to each cluster) that contains the number of patterns in each cluster.
end

Finally, the SonS and MDSonS functions are called in order to represent the third graph (corresponding with C3 in 1st hierarchy) in the second hierarchy.

%------------------------------------------------------------
% SonS function : Here we build the 3rd graph of 2nd hiearchy
%------------------------------------------------------------
figure(h1) %makes h1 the current figure
subplot(133)
sons(patterns_in_clusters,centroids_2nd_hierarchy([1 2 3],:),{'C3.1','C3.2','C3.3'});
title ({'C3';''},'FontWeight','bold')

%------------------------------------------------------------
% MDSonS function : Here we build the 3rd graph of 2nd hiearchy
%------------------------------------------------------------
figure(h2) %makes h2 the current figure
subplot(133)
mdsons(patterns_in_clusters,centroids_2nd_hierarchy([1 2 3],:),{'C3.1','C3.2','C3.3'});
title ({'C3';''},'FontWeight','bold')