Correspondence address: Jose María Martínez

Electronic Engineering Dpt. University of Valencia

Universities Ave, s/n

E46100 Burjassot (Valencia) SPAIN

email: jose.maria.martinez (at) uv.es

**SonS/MDSonS Toolbox: Introduction**

This Matlab toolbox consist of two methods, Sectors on Sectors (SonS)
and Multidimensional Sectors on Sectors (MDSonS). They are novel
visualization techniques for improving the interpretation of several data mining algorithms. These methods are applied in this case for
visualizing the results of hierarchical clustering, which make possible
to extract all the existing relationships among centroids attributes at
any hierarchy level. However, it is also possible to use them in other
data mining algorithms: a) Growing Hierarchical Self-Organizing Maps
(GHSOM), which make possible to visualize, simultaneously, the data
information at each hierarchy level compactly; b) classification trees,
in which the SonS is used for representing the input data information for each class presented in each terminal node. **Download the SonS and MDSonS toolbox in this link**.

**Sectors on Sectors (SonS)**

Sectors
on Sectors (SonS) is a visualization method that extracts visual
information of data groups by representing the number of instances in
each group, the value of the centroids of these groups of data and the
existing relationships among the several groups and variables. This
method is based on the well-known pie chart visualization. Each cluster
is represented by a slice of a circle (pie sectors). The arc length of
each pie sector is proportional to the number of patterns included in
each cluster. By means of new divisions in each pie sector and a color bar with
the same number of labels as attributes, the existing relationships
among centroids’ attributes of the different clusters can be
inferred. Figure 1 represents the three steps followed to create the
SonS visualization method; which are stated as follows:

**Division of one circle on several sectors****depending on the number of clusters:**First of all the circle is divided into several pie segments or sectors corresponding to each cluster. The arc length of each sector is proportional to the number of patterns included in each cluster. The number of patterns belonging to each cluster is shown within parentheses. In this way, the significance of each cluster is easily recognizable (Figure 1, left).**Division of the pie sectors depending on the number and the value of attributes:**After the first step, each sector is divided into as many subsectors as variables presented in the problem. The inner part corresponds to the first variable, and going outwards, the next variables are appearing. Each one of these parts vary its radius. This radius corresponds to the relative value of each variable, with respect to the sum of all of them (each variable is scaled between [0, 1] before carrying out the clustering in order to avoid a biased model. Moreover, the scaling makes that the relevance of each variable [represented by the size of the radius] is independent of its range, that is, the relevance is not higher even though the variable has a higher range. The use of scaled variables, guarantees that the radius is the relevance of the variable within the cluster). That is, let X be a centroid corresponding to one cluster, so that, X = {x_{1}, x_{2}, . . . , x_{N}} Then, the radius of each subsector (correspondingto each centroid attribute) is calculated as follows: In this way the bigger the radius corresponding to each variable, the higher the weight of the variable and therefore, the more relevant the feature. This is a good method to identify the relevance of each variable within each cluster in a straightforward way (Figure 1, middle).**Color coding for identifying the real****value of features:**Attached to the graph, there is a color bar with the same number of labels as variables (each label for each variable). The mean value of the variables of each class (normally, the centroid) is codified by means of colors (this is automatically extensible to other measures such as the median, which is a much more adequate prototype measure in presence of outliers, for instance. Therefore, SonS is not restricted to the use of a particular prototype measure). The value of the color for the first feature (inner subsector), is given by the first column label, the second feature by the second column label and so on. In this way, it is possible to know the exact value of each variable for each cluster centroid (Figure 1, right).

**Multidimensional Sectors on Sectors (MDSonS)**

This
method is an improvement of the SonS visualization technique, called
Multidimensional Sectors on Sectors (MDSonS). The visualization method
is different from that proposed before due to the need of accommodating
the information provided by Multidimensional Scaling (MDS) to the new
visualization. In MDSonS each cluster is represented by a circle. The
area of each circle is proportional to the number of patterns included
in each cluster and the distance among circles is proportional to the
distance among clusters. By means of each slice of a circle (pie
sectors) and a color bar with the same number of labels as
attributes, the existing relationships among centroids’
attributes at any hierarchy level can be extracted. Once decided the
structure of the clustering (number of clusters in each hierarchical
level), the visualization graph is produced in three steps for
each hierarchy level (see Fig. 2). These three steps should be taken
starting from the first hierarchy level, and are described, as follows:

**Representation of the different clusters****and their size:**First of all, as many circles as clusters are drawn. The area of each circle is proportional to the number of patterns included in each cluster and the distance among circles is proportional to the distance among clusters’ centroids. The distances among centroids are computed by MDS. MDS produces a representation of the similarity (or dissimilarity) between pairs of objects in a multidimensional space as distances between points of a low-dimensional space. The number of patterns belonging to each cluster is shown within parentheses. In this way it is easily recognizable the significance of each cluster and the distance among them (Figure 2, top left).**Division of the circles depending on the****number and the value of attributes:**Once the data is divided into clusters and after knowing the size of each one of them, the value of the attributes (or features) for each cluster centroid is analyzed. For this task, each circle corresponding with each cluster, is divided into several sectors, which correspond to each variable. The first variable is the one that starts with a vertical line in the top middle of the circle, and the rest of the variables is appearing sequentially counter clockwise. The arc length of each sector corresponds to the relative value of each variable, respect to the sum of all of them (as in SonS, each variable is scaled between [0, 1] before carrying out the clustering in order to avoid a biased model). In this way, the bigger the arc length of a given variable, the more relevant the variable. With this method it can be identified the relevance of each variable, within each cluster or within all of them in a straightforward way. (Figure 2, top right).**Color coding for identifying the real****value of features:**Attached to the graph, there is a color bar with the same number of labels as variables (each label for each variable). First column label gives the value of the first feature, second column label gives the second feature value and so on. In this way it is possible to know the exact value of each variable for each cluster centroid (Figure 2, bottom).

The description for the first hierarchy level can be extended to the rest of levels. For instance, in the data set analyzed in Fig. 12, for the second hierarchy level it can be observed that from each circle (corresponding to each cluster) in level 1, a new graph with new values of cluster centroids emerges.

The
main advantage of the proposed visualization technique is that it is
possible to observe relationships among different variables in the same
cluster and relationships among the same variables in different
clusters, in the different levels of the hierarchy; but specially, what
is remarkable in comparison to SonS is the representation of the
distances among clusters centroids.

# Example of Sectors on Sectors (SonS) and Multidimensional Sectors on Sectors (MDSonS) visualization methods

## Contents

## Itroduction

In order to see how the functions corresponding with SonS and MDSonS methods work, firstly we are going to carry out a hierarchical
clustering. Its worth mentioning that these two methods can be used to visualize both hierarchical clustering and classical
clustering. This example consists of 3 `Matlab` files:

**Clustering**: The program carries out a hierarchical clustering. It saves the centroids and the labels of the data of the different hierarchies (the labels indicate the belonging to each cluster).**Example**: This program calls the`sons.m`and`mdsons.m`functions for visualizing the results of the clustering. These functions represent only one hierarchical level, so in`example.m`there are different calls of these functions in order to represent the different hierarchies.**SonS**: The`sons.m`function carries out the visualization of the SonS method.**MDSonS**: The`mdsons.m`function carries out the visualization of the MDSonS method.

## Clustering program (`clustering.m`)

This program makes a hierarchical clustering of a synthetic data set which consists of 3 clouds of points (first hierarchy)
that can be divided into 9 (three new for each one of the former 3). The results of the clustering are saved in order to draw
the results with the SonS method (see `example.m`)

**VISUALIZING THE DATA SET**:

clear all close all clc

Here we load the synthetic data set created two show this example.

```
load datos
```

As follows we visualize the data in 3D. This example is presented in order understand better the SonS and MDSonS methods. We can check the data in 3D and compare it with the representations provided by these methods, but it should be bear in mind that these methods are focused on visualizing high-dimensional data sets.

%---------------------------------- % 3D DATA VISUALIZATION %---------------------------------- plot3(datos(:,1),datos(:,2),datos(:,3),'.') grid title('Data representation in 3D') xlabel('X') ylabel('Y') zlabel('Z')

Here we normalize the data. You will need the som toolbox (http://www.cis.hut.fi/somtoolbox/) for using the next functions. If not, you can comment this lines and normalize in other way).

%---------------------------------- % NORMALIZATION %---------------------------------- sData = som_data_struct(datos); sData = som_normalize(sData,'range'); datos_norm=sData.data;

**FIRST HIERARCHY**:

As follows we carry out the clustering for the first hierarchical level. Firstly, we build the dendrogram. In the dendrogram, we can identify 3 main clusters corresponding with de 3 main clouds of points.

%------------------------------------- % DENDROGRAM %------------------------------------- Y=pdist(datos_norm,'euclidean'); Z=linkage(Y,'average'); figure [H, T] = dendrogram(Z,'colorthreshold','default'); ylabel('Distance') set(H,'LineWidth',2) title('Dendrogram')

Now we carry out the clustering algorithm:

%------------------------------------- % CLUSTERING %------------------------------------- N1=3;%number of clusters in level1 components=3;%number of variables centroids1=zeros(N1,components); labels1 = cluster(Z,'MaxClust',N1);% we carry out the clustering %We calculate the centroids of each cluster. for n=1:N1 indices=find(labels1==n); centroids1_norm(n,:)=sum(datos_norm(indices,:))/length(indices); end

Now we denormalize the computed centroids. As previously mentioned, You will need the som toolbox (http://www.cis.hut.fi/somtoolbox/), but you can also denormalize in other way.

%---------------------------------- % DENORMALIZATION %---------------------------------- %you will need the som toolbox (if not, you can comment this lines and denormalize in other way) datos = som_denormalize(sData); datos=datos.data; centroids1=som_denormalize(centroids1_norm,sData.comp_norm);

As follows, we represent the clustering results of the first hierarchical level. We represent in different colors the different clusters detected by the algorithm and with red dots the centroid of each cluster.

%------------------------------------- % CLUSTERING VISUALIZATION %------------------------------------- figure indices1a=find(labels1==1); plot3(datos(indices1a,1),datos(indices1a,2),datos(indices1a,3),'b.') hold on indices1b=find(labels1==2); plot3(datos(indices1b,1),datos(indices1b,2),datos(indices1b,3),'k.') grid indices1c=find(labels1==3); plot3(datos(indices1c,1),datos(indices1c,2),datos(indices1c,3),'g.') plot3(centroids1(:,1),centroids1(:,2),centroids1(:,3),'.r','MarkerSize',15) hold off legend('Cluster 1','Cluster 2','Cluster 3','Centroids') title('Clustering solution. 1st Hierarchy')

**SECOND HIERARCHY**:

As follows we carry out the clustering for the second hierarchical level. As in the previous case,firstly, we build the dendrogram. In the dendrogram, we can identify that each former cluster is now divided into 3 new clusters, obtaining 9 clusters.

%------------------------------------- % DENDROGRAM %------------------------------------- figure [H, T] = dendrogram(Z,'colorthreshold',0.2); ylabel('Distance') set(H,'LineWidth',2) title('Dendrogram')

Now we carry out the clustering algorithm:

%------------------------------------- % CLUSTERING %------------------------------------- N2=9;%number of clusters in level 2 components=3; centroids2=zeros(N1,components); labels2 = cluster(Z,'MaxClust',N2); for n=1:N2 indices=find(labels2==n); centroids2_norm(n,:)=sum(datos_norm(indices,:))/length(indices); end

Now we denormalize the computed centroids in the same way as before.

%---------------------------------- % DENORMALIZATION %---------------------------------- %you will need the som toolbox (if not, you can comment this lines and denormalize in other way) datos = som_denormalize(sData); datos=datos.data; centroids2=som_denormalize(centroids2_norm,sData.comp_norm);

Finally, we represent the clustering results of the second hierarchical level. We represent in different colors the different clusters detected by the algorithm.

%------------------------------------- % CLUSTERING VISUALIZATION %------------------------------------- figure indices1a=find(labels2==1); plot3(datos(indices1a,1),datos(indices1a,2),datos(indices1a,3),'b.') hold on indices1b=find(labels2==2); plot3(datos(indices1b,1),datos(indices1b,2),datos(indices1b,3),'g.') grid indices1c=find(labels2==3); plot3(datos(indices1c,1),datos(indices1c,2),datos(indices1c,3),'r.') indices1d=find(labels2==4); plot3(datos(indices1d,1),datos(indices1d,2),datos(indices1d,3),'c.') grid indices1e=find(labels2==5); plot3(datos(indices1e,1),datos(indices1e,2),datos(indices1e,3),'m.') indices1f=find(labels2==6); plot3(datos(indices1f,1),datos(indices1f,2),datos(indices1f,3),'y.') indices1g=find(labels2==7); plot3(datos(indices1g,1),datos(indices1g,2),datos(indices1g,3),'k.') grid indices1h=find(labels2==8); plot3(datos(indices1h,1),datos(indices1h,2),datos(indices1h,3),'Color',[.6 0.5 0.5],'Marker','.','LineStyle','none') indices1i=find(labels2==9); plot3(datos(indices1i,1),datos(indices1i,2),datos(indices1i,3),'Color',[.6 0 0],'Marker','.','LineStyle','none') plot3(centroids2(:,1),centroids2(:,2),centroids2(:,3),'.r','MarkerSize',15) hold off legend('Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5','Cluster 6','Cluster 7','Cluster 8','Cluster 9') title('Clustering solution. 2nd Hierarchy')

Here, we make a renaming of the variables for better understanding.

%------------------------------------- % RENAMING %------------------------------------- centroids_1st_hierarchy=centroids1; centroids_2nd_hierarchy=centroids2; labels_1st_hierarchy=labels1; labels_2nd_hierarchy=labels2;

After running the "clustering.m" file, the data needed for build the graphs are saved (clustering centroids and labels of the first and second level of hierarchy)

%------------------------------------- % SAVING CENTROIDS AND LABELS %------------------------------------- save labels labels_1st_hierarchy labels_2nd_hierarchy save centroids centroids_1st_hierarchy centroids_2nd_hierarchy save sData sData

## Visualization of SonS and MDSonS (`example.m`)

As mentioned in the introduction, this program calls the `sons.m` and `mdsons.m` functions for visualizing the results of the clustering. These functions represent only one hierarchical level, so in `example.m` there are different calls of these functions in order to represent the different hierarchies.

Here we load the data needed

load centroids load labels load sData num_clusters=3;

**FIRST HIERARCHY**:

In this part of the program, we calculate the number of patterns in each cluster.The variable "patterns_in_clusters" is a vector (each column corresponds to each cluster) that contains the number of patterns in each one of the clusters.

%------------------------------------------------------------------ % CALCULATION OF NUMBER OF PATERNS IN EACH CLUSTER OF 1ST HIERARCHY %------------------------------------------------------------------ for n=1:num_clusters aux=find(labels_1st_hierarchy==n); patterns_in_clusters_1st_hierarchy(1,n)=length(aux); end

Now, the SonS and MDSonS function are called in order to represent the graphs. The SonS and MDSonS function need two variables:

**A vector**(with length equal to the number of clusters ()), containing the number of patterns included in each cluster (the first column corresponds to the first cluster and so on).**A matrix**of , where the rows correspond to each cluster centroid () of dimension.

Also it can be included a cell array where each cell corresponds to the label of each cluster.

Here we represent the SonS method:

%---------------------------------------------------- % SonS function : Here we build the graph %---------------------------------------------------- figure sons([patterns_in_clusters_1st_hierarchy],centroids_1st_hierarchy,{'C1','C2','C3'}); title({'SonS Method: 1st HIERARCHY';'';''}) set(gcf,'Position',[0 0 500 300])%size of the matlab window

Here we represent the MSonS method:

%---------------------------------------------------- % MDSonS function : Here we build the graph %---------------------------------------------------- figure mdsons([patterns_in_clusters_1st_hierarchy],centroids_1st_hierarchy,{'C1','C2','C3'}); title({'MDSonS Method: 1st HIERARCHY';'';''}) set(gcf,'Position',[0 0 700 500])%size of the matlab window)

**SECOND HIERARCHY**:

Now we are going to represent the second hierarchy. For that, from the previous sectors or clusters (C1, C2 and C3) we have new clusters (three new clusters for each one of the fomers).

This code refers to the first graph (corresponding with C1 in 1st hierarchy) in the second hierarchy.

In this part of the program, we calculate the number of patterns in each new cluster (from the former C1) in the second hierarchy. The variable "patterns_in_clusters" is a vector (each column corresponds to each cluster) that contains the number of patterns in each one of the clusters.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % 1ST GRAPH IN SECOND HIERARCHY %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% num_clusters=3; cont=0; for n=[5 4 8]% In this graph are included the 4th, 5th and 8th clusters of % the second hiearchy (the 3 clusters consisted of Cluster 1 in % 1st hierachy) indices=find(labels_2nd_hierarchy==n); cont=cont+1; patterns_in_clusters(1,cont)=length(indices); end

Now, the SonS and MDSonS function are called in order to represent the corresponding graph.

%------------------------------------------------------------ % SonS function : Here we build the 1st graph of 2nd hiearchy %------------------------------------------------------------ figure h1=gcf; % returns the handle of the current figure. set(h1,'Position',[0 0 1050 500])%size of the matlab window subplot(131) sons(patterns_in_clusters,centroids_2nd_hierarchy([5 4 8],:),{'C1.1','C1.2','C1.3'}); title ({'SonS Method: 2nd HIERARCHY';'';'C1';''},'FontWeight','bold') %------------------------------------------------------------ % MDSonS function : Here we build the 1st graph of 2nd hiearchy %------------------------------------------------------------ figure h2=gcf; % returns the handle of the current figure. set(h2,'Position',[0 0 1350 800])%size of the matlab window subplot(131) mdsons(patterns_in_clusters,centroids_2nd_hierarchy([5 4 8],:),{'C1.1','C1.2','C1.3'}); title ({'MDSonS Method: 2nd HIERARCHY';'';'C1';''},'FontWeight','bold')

In this part of the program, we calculate the number of patterns in each new cluster (from the former C2) in the second hierarchy.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % 2ND GRAPH IN SECOND HIERARCHY %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% num_clusters=3; cont=0; for n=[6 9 7] % In this graph are included the 6th, 9th and 7th clusters of % the second hiearchy (the 3 clusters consited of Cluster 2 in % 1st hierachy) indices=find(labels_2nd_hierarchy==n); cont=cont+1; patterns_in_clusters(1,cont)=length(indices); %The variable "patterns_in_clusters" is a vector (each column corresponds %to each cluster) that contains the number of patterns in each one of the clusters. end

Now, the SonS and MDSons functions are called in order to represent the second graph (corresponding with C2 in 1st hierarchy) in the second hierarchy.

%------------------------------------------------------------ % SonS function : Here we build the 2nd graph of 2nd hiearchy %------------------------------------------------------------ figure(h1) %makes h1 the current figure subplot(132) sons(patterns_in_clusters,centroids_2nd_hierarchy([6 9 7],:),{'C2.1','C2.2','C2.3'}); title ({'SonS Method: 2nd HIERARCHY';'';'C2';''},'FontWeight','bold') %------------------------------------------------------------ % MDSonS function : Here we build the 2nd graph of 2nd hiearchy %------------------------------------------------------------ figure(h2) %makes h2 the current figure subplot(132) mdsons(patterns_in_clusters,centroids_2nd_hierarchy([6 9 7],:),{'C2.1','C2.2','C2.3'}); title ({'MDSonS Method: 2nd HIERARCHY';'';'C2';''},'FontWeight','bold')

As in previous cases, here we calculate the number of patterns in each new cluster (from the former C3) in the second hierarchy.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % 3RD GRAPH IN SECOND HIERARCHY %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% num_clusters=3; cont=0; for n=[1 2 3] % In this graph are included the 1st, 2nd and 3rd clusters of % the second hiearchy (the 3 clusters consited of Cluster 3 in % 1st hierachy) indices=find(labels_2nd_hierarchy==n); cont=cont+1; patterns_in_clusters(1,cont)=length(indices); %The variable "patterns_in_clusters" is a vector (each column corresponds %to each cluster) that contains the number of patterns in each cluster. end

Finally, the SonS and MDSonS functions are called in order to represent the third graph (corresponding with C3 in 1st hierarchy) in the second hierarchy.

%------------------------------------------------------------ % SonS function : Here we build the 3rd graph of 2nd hiearchy %------------------------------------------------------------ figure(h1) %makes h1 the current figure subplot(133) sons(patterns_in_clusters,centroids_2nd_hierarchy([1 2 3],:),{'C3.1','C3.2','C3.3'}); title ({'C3';''},'FontWeight','bold') %------------------------------------------------------------ % MDSonS function : Here we build the 3rd graph of 2nd hiearchy %------------------------------------------------------------ figure(h2) %makes h2 the current figure subplot(133) mdsons(patterns_in_clusters,centroids_2nd_hierarchy([1 2 3],:),{'C3.1','C3.2','C3.3'}); title ({'C3';''},'FontWeight','bold')