1 Introduction

DBS is a flexible and robust clustering framework that consists of three independent modules [Thrun/Ultsch, 2020]. The first module is the parameter-free projection method Pswarm, which exploits the concepts of self-organization and emergence, game theory, swarm intelligence and symmetry considerations [Thrun/Ultsch, 2020]. The second module is a parameter-free high-dimensional data visualization technique, which generates projected points on a topographic map with hypsometric colors, called the generalized U-matrix. The third module is a clustering method with no sensitive parameters. The clustering can be verified by the visualization and vice versa. The term DBS refers to the method as a whole [Thrun/Ultsch, 2020]. For further details, see [Thrun/Ultsch, 2020].

2 First Example: Automatic approach

Here an example is presented using the automatic approach without any user interaction with shiny. If you want to verify your clustering result externally, you can use Heatmap or SilhouettePlot of the R package DataVisualizations on CRAN.

2.1 First Module: Projection of high-dimensional Data

First generate a two-dimensional projection, the [1:n,1:n] distance matrix of n cases has to be defined by the user.

library(DatabionicSwarm)
library(GeneralizedUmatrix)
data('Hepta')
InputDistances = as.matrix(dist(Hepta$Data))
projection = Pswarm(InputDistances)

2.2 Second Module: Generalized Umatrix

Here the Generalized Umatrix is calculated using a simplified emergent self-organizing map algorithm [Thrun/Ultsch, 2020b]. The output is a list (genUmatrixList) of several elements. Then, the visualization of Generalized Umatrix is can be shown by a 3D landscape called topographic map with hypsometric tints using the output of this list named genUmatrixList.

Hypsometric tints are surface colors that represent ranges of elevation. For the 3D landscape the contour lines are combined with a specific color scale. The color scale is chosen to display various valleys, ridges, and basins: blue colors indicate small distances (sea level), green and brown colors indicate middle distances (low hills), and shades of white colors indicate vast distances (high mountains covered with snow and ice).

Seven valleys are shown resulting in seven main clusters. The resulting visualization will be toroidal meaning that the left borders cyclically connects to the right border (and bottom to top). It means there are no “real” borders in this visualizations. Instead, the visualization is “continuous”. This can be visualized using the ‘Tiled=TRUE’ option of ‘plotTopographicMap’.

Note, that the ‘NoLevels’ option is only set to load this vignette faster and should normally not be set manually. It describes the number contour lines placed relative to the hypsometric tints. All visualizations here are small and a low dpi is set in knitr in order to load the vignette faster.

genUmatrixList = GeneratePswarmVisualization(
  Data = Hepta$Data, 
  projection$ProjectedPoints, 
  projection$LC,
  Parallel=FALSE)#CRAN guidelines do not allow =TRUE in vignette

GeneralizedUmatrix::plotTopographicMap(
  genUmatrixList$Umatrix, 
  genUmatrixList$Bestmatches, 
  NoLevels = 10)

2.3 Third Module: Automatic Clustering

The number of clusters can be derived from dendrogram (PlotIt=TRUE) or the visualization. Therefore we choose the seven valleys as the number of clusters. The function DBSclustering has one parameter to be set. Typically, the default setting “StructureType = TRUE” works fine. However, for density-based structures sometimes StructureType = FALSE of the function ‘DBSclustering’ yields better results. Please verify with the visualization or the Dendrogram. For the Dendrogram choose PlotIt=TRUE in the function ‘DBSclustering’. In the case of “BestmatchingUnits”, the parameter “LC” defines the size of the grid with Lines and columns where the position (0,0) lies in the left upper corner. In the case of “ProjectedPoints”, the point (0,0) lies in the left bottom corner. The transformation is normally done automatically. However, sometimes the user wishes to skip the visualization and use projected points directly. Then “LC” can be changed accordingly to LC[c(2,1)]. Seldom, there could be a rounding error leading to an error catch. In such a case try LC+1.

Cls = DBSclustering(k = 7,
                    DataOrDistance = Hepta$Data,
                    BestMatches = genUmatrixList$Bestmatches,
                    LC = genUmatrixList$LC,
                    PlotIt = FALSE)
GeneralizedUmatrix::plotTopographicMap(genUmatrixList$Umatrix,
                                       genUmatrixList$Bestmatches,
                                       Cls,
                                       NoLevels = 10)

#Second Example: Interactive Approach In this example, we show how to improve an automatic clustering accordingly to the topographic map using a shiny interface. The examples are not run in Rmarkdown, because CRAN wants to check the results regularly and this example is above their time limit.

2.4 First Module: Projection of High-dimensional Data

First, we generate a 2d projection with instant visualization of annealing steps (PlotIt=TRUE). This shows the non-linear process of concentrating on global structures first and later on local structures. Such an approach enables to entangle non-linear high-dimensional structures. If the user does not define the DistanceMatrix, it is automatically set to Euclidean because the data itself can be the input for ‘Pswarm’.

library(DatabionicSwarm)
data('Lsun3D')
projection = Pswarm(Lsun3D$Data,
                    Cls = Lsun3D$Cls,
                    PlotIt = T,
                    Silent = T)

2.5 Second Module: Generalized Umatrix

If Non-Euclidean Distances are used, Please Use SammonsMapping from the ProjectionBasedClustering package with the correct OutputDimension to generate a new data matrix from the distances (see SheppardDiagram of DataVisualization Package or KruskalStress). Here the Generalized Umatrix is calculated using a simplified emergent self-organizing map algorithm. Then the topographic map is visualized based on the information of the Generalized Umatrix.

library(DatabionicSwarm)
library(GeneralizedUmatrix)
genUmatrixList = GeneratePswarmVisualization(
  Data = Lsun3D$Data,
  projection$ProjectedPoints,
  projection$LC)

GeneralizedUmatrix::plotTopographicMap(
  genUmatrixList$Umatrix, 
  genUmatrixList$Bestmatches)

2.6 Third Module: Interactive Clustering

The number of clusters can be derived from dendrogram (PlotIt=TRUE) or the visualization. In this example, outliers should be marked manually in the visualization after the process of automatic clustering. Therefore we choose the three central valleys as the number of clusters. Often, it helps to generate first the shape of an island out of the continuous topographic map because then you already have the most prominent mountains marked as the borders of the visualizations. Then you can improve the clustering by redefining valleys interactively or marking outliers lying in vulcanos. It is strongly suggested to verify such a clustering externally, e.g. Heatmap or some unsupervised index.

library(DatabionicSwarm)
library(GeneralizedUmatrix)
Cls = DBSclustering(
  k = 3,
  Lsun3D$Data,
  genUmatrixList$Bestmatches,
  genUmatrixList$LC,
  PlotIt = FALSE)

GeneralizedUmatrix::plotTopographicMap(
  genUmatrixList$Umatrix, 
  genUmatrixList$Bestmatches,
  Cls)

2.7 Generating the Shape of an Island out of the Topograpahic Map

To generate the 3D landscape in the shape of an island from the toroidal topographic map visualization you may cut your island interactively around high mountain ranges.

library(DatabionicSwarm)
library(ProjectionBasedClustering)
library(GeneralizedUmatrix)

Imx = ProjectionBasedClustering::interactiveGeneralizedUmatrixIsland(
  genUmatrixList$Umatrix, 
  genUmatrixList$Bestmatches, 
  Cls)

GeneralizedUmatrix::plotTopographicMap(
  genUmatrixList$Umatrix,
  genUmatrixList$Bestmatches,
  Cls = Cls,
  Imx = Imx)

2.8 Manually Improving the Clustering Using the Topograpahic Map

In this example, the four outliers can be marked manually with mouse clicks using the shiny interface.

library(ProjectionBasedClustering)
Cls2 = ProjectionBasedClustering::interactiveClustering(
  genUmatrixList$Umatrix,
  genUmatrixList$Bestmatches,
  Cls)

3 Quality Measures of Projection and Clustering

3.1 Gabriel Classification Error

Using insights of graph theory, Gabriel classification error calculates for each projected point the direct neighborhood based on the Gabriel graph [Thrun et al.,2023] . Every direct Connection is weighted with the high-dimensional distance between the two corresponding data points and sorted per neighborhood by these weights. In the next step all sorted projected points points of the direct neighborhood of each projected points acquire new weights according to the harmonic function. Then, the prior classification is used to check which points do not belong to these direct neighborhoods of projected points. The weights of these points are summed up. A lower value indicates a better two-dimensional projection of the high-dimensional Input space. A higher value indicates a worse two-dimensional projection of the high-dimensional Input space.

library(DatabionicSwarm)
data('Lsun3D')
InputDistances = as.matrix(dist(Lsun3D$Data))
projection = Pswarm(InputDistances)

if(requireNamespace("DRquality"))  
  DRquality::GabrielClassificationError(
    Lsun3D$Data,
    projection$ProjectedPoints,
    LC=projection$LC,#optional for toroidal projections
    Lsun3D$Cls,PlotIt=TRUE)

You can also compare various projections method to a common baseline together:

if(requireNamespace("DRquality"))
  GCE_pswarm = DRquality::GabrielClassificationError(
    Lsun3D$Data,
    projection$ProjectedPoints,
    Lsun3D$Cls)$GCE

if(requireNamespace("ProjectionBasedClustering")  
  baselineproj = ProjectionBasedClustering::NeRV(
    Lsun3D$Data)
  
if(requireNamespace("DRquality"))
  GCE_nerv = DRquality::GabrielClassificationError(
    Lsun3D$Data,
    baselineproj,
    Lsun3D$Cls)$GCE
  
RelativeDifference(GCE_pswarm,GCE_nerv)

This has the advantage of an clear range of \([-2,2]\). Further Details can be read in the conference presentation attached to [Thrun/Ultsch,2018] on ResearchGate.

3.2 Clustering Accuracy

The accuracy is defined as follows: \(Accuracy=\frac{No. of true positives}{No.of cases}\) The number of true positives is the number of labeled data points for which the label defined by a prior classification is identical to the label defined after the clustering process. The best of all permutation of labels of the clustering algorithm regarding the accuracy is chosen, because the labels are arbitrarily defined by any algorithm. See details in conference presentation attached to [Thrun et al.,2018] on ResearchGate or in [Thrun et al.,2023].

if(requireNamespace("FCPS"))
    FCPS::ClusterAccuracy(Lsun3D$Cls,Cls)

#References [Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Swarm Intelligence for Self-Organized Clustering, Journal of Artificial Intelligence, Vol. in press, pp. doi 10.1016/j.artint.2020.103237, 2020.

[Thrun/Ultsch, 2020b] Thrun, M. C., & Ultsch, A.: Uncovering High-Dimensional Structures of Projections from Dimensionality Reduction Methods, MethodsX, Vol. 7, pp. 101093, DOI 10.1016/j.mex.2020.101093, 2020.

[Thrun, 2018] Thrun, M. C.: Projection Based Clustering through Self-Organization and Swarm Intelligence, doctoral dissertation 2017, Springer, Heidelberg, ISBN: 978-3-658-20539-3, https://doi.org/10.1007/978-3-658-20540-9, 2018.

[Thrun/Ultsch,2018] Thrun, M. C., & Ultsch, A. : Investigating Quality measurements of projections for the Evaluation of Distance and Density-based Structures of High-Dimensional Data, Proc. European Conference on Data Analysis (ECDA), Paderborn, Germany. 2018a.

[Thrun et al.,2018] Thrun, M. C., Pape, F., & Ultsch, A. : Benchmarking Cluster Analysis Methods using PDE-Optimized Violin Plots, Proc. European Conference on Data Analysis (ECDA), Paderborn, Germany, 2018.

[Thrun et al.,2023] Thrun, M. C., Märte, J., & Stier, Q.: Analyzing Quality Measurements for Dimensionality Reduction, Machine Learning and Knowledge Extraction (MAKE), Vol. 5(3), pp. 1076-1118, DOI: 10.3390/make5030056, 2023.

mirror server hosted at Truenetwork, Russian Federation.