Workflow

  1. Loading necessary libraries and data
  • Loading libraries
  • Print citations
  • Loading necessary data
  1. Investigation of two-step clustering
  • Cluster tendency (columns)
    • Calculating Hopkins index
    • Hierarchical clustering for indicators
    • Apply multidimensional scaling to indicators
  • Cluster tendency (rows)
    • Calculating Hopkins index
    • Hierarchical clustering for indicators
    • Apply multidimensional scaling to indicators
  1. Model reduction
  • Principal component analysis
    • Determination of PC R (resources)
    • Determination of PC E (environment)
    • Determination of PC C (connectivity)
    • Determination of PC O (output)
    • Correlation test for principal components
  • Factor analysis
  1. Data reduction
  • Organize the factors into 3 clusters
  • Organize the factors into 4 clusters
  1. Determining partial rankings
  • Correlation between rankings for 3 clusters
  • Correlation between rankings for 4 clusters
  1. Summary and conclusion

1. Loading necessary libraries and data

Citations

## Wright K (2021). _corrgram: Plot a Correlogram_. R package version
## 1.14, <URL: https://CRAN.R-project.org/package=corrgram>.
## Wei T, Simko V (2021). _R package "corrplot": Visualization of a
## Correlation Matrix_. (Version 0.88), <URL:
## https://github.com/taiyun/corrplot>.
## Revelle W (2021). _psych: Procedures for Psychological, Psychometric,
## and Personality Research_.  Northwestern University, Evanston,
## Illinois. R package version 2.1.3, <URL:
## https://CRAN.R-project.org/package=psych>.
## Kassambara A, Mundt F (2020). _factoextra: Extract and Visualize the
## Results of Multivariate Data Analyses_. R package version 1.0.7, <URL:
## https://CRAN.R-project.org/package=factoextra>.
## Venables WN, Ripley BD (2002). _Modern Applied Statistics with S_,
## Fourth edition. Springer, New York. ISBN 0-387-95457-0, <URL:
## https://www.stats.ox.ac.uk/pub/MASS4/>.
## Ferraro M, Giordani P, Serafini A (2019). "fclust: An R Package for
## Fuzzy Clustering." _The R Journal_, *11*. <URL:
## https://journal.r-project.org/archive/2019/RJ-2019-017/RJ-2019-017.pdf>.
## Wright K, YiLan L, RuTong Z (2021). _clustertend: Check the Clustering
## Tendency_. R package version 1.5, <URL:
## https://CRAN.R-project.org/package=clustertend>.
## Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2021). _cluster:
## Cluster Analysis Basics and Extensions_. R package version 2.1.2 - For
## new features, see the 'Changelog' file (in the package source), <URL:
## https://CRAN.R-project.org/package=cluster>.
## Hennig C (2020). _fpc: Flexible Procedures for Clustering_. R package
## version 2.2-9, <URL: https://CRAN.R-project.org/package=fpc>.
## Scrucca L, Fop M, Murphy TB, Raftery AE (2016). "mclust 5: clustering,
## classification and density estimation using Gaussian finite mixture
## models." _The R Journal_, *8*(1), 289-317. <URL:
## https://doi.org/10.32614/RJ-2016-021>.
## Rosseel Y (2012). "lavaan: An R Package for Structural Equation
## Modeling." _Journal of Statistical Software_, *48*(2), 1-36. <URL:
## https://www.jstatsoft.org/v48/i02/>.
## Epskamp S (2019). _semPlot: Path Diagrams and Visual Analysis of
## Various SEM Packages' Output_. R package version 1.1.2, <URL:
## https://CRAN.R-project.org/package=semPlot>.
## Harrell Jr FE, Dupont wcfC, others. m (2021). _Hmisc: Harrell
## Miscellaneous_. R package version 4.5-0, <URL:
## https://CRAN.R-project.org/package=Hmisc>.
## Hahsler M, Buchta C, Hornik K (2020). _seriation: Infrastructure for
## Ordering Objects Using Seriation_. R package version 1.2-9, <URL:
## https://CRAN.R-project.org/package=seriation>.
## 
## Hahsler M, Hornik K, Buchta C (2008). "Getting things in order: An
## introduction to the R package seriation." _Journal of Statistical
## Software_, *25*(3), 1-34. ISSN 1548-7660, doi: 10.18637/jss.v025.i03
## (URL: https://doi.org/10.18637/jss.v025.i03), <URL:
## https://www.jstatsoft.org/v25/i03/>.
## Wickham H (2007). "Reshaping Data with the reshape Package." _Journal
## of Statistical Software_, *21*(12), 1-20. <URL:
## http://www.jstatsoft.org/v21/i12/>.
## Wickham H, François R, Henry L, Müller K (2021). _dplyr: A Grammar of
## Data Manipulation_. R package version 1.0.6, <URL:
## https://CRAN.R-project.org/package=dplyr>.
## Xie Y (2021). _knitr: A General-Purpose Package for Dynamic Report
## Generation in R_. R package version 1.33, <URL:
## https://yihui.org/knitr/>.
## 
## Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition.
## Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <URL:
## https://yihui.org/knitr/>.
## 
## Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in
## R." In Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible
## Computational Research_. Chapman and Hall/CRC. ISBN 978-1466561595,
## <URL: http://www.crcpress.com/product/isbn/9781466561595>.
## Becker OScbRA, Minka ARWRvbRBEbTP, Deckmyn. A (2018). _maps: Draw
## Geographical Maps_. R package version 3.3.0, <URL:
## https://CRAN.R-project.org/package=maps>.
## Becker OScbRA, Brownrigg. ARWRvbR (2018). _mapdata: Extra Map
## Databases_. R package version 2.3.0, <URL:
## https://CRAN.R-project.org/package=mapdata>.
## Bivand R, Lewin-Koh N (2021). _maptools: Tools for Handling Spatial
## Objects_. R package version 1.1-1, <URL:
## https://CRAN.R-project.org/package=maptools>.
## Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, Allen J,
## McPherson J, Dipert A, Borges B (2021). _shiny: Web Application
## Framework for R_. R package version 1.6.0, <URL:
## https://CRAN.R-project.org/package=shiny>.
## Cheng J, Sievert C, Chang W, Xie Y, Allen J (2021). _htmltools: Tools
## for HTML_. R package version 0.5.1.1, <URL:
## https://CRAN.R-project.org/package=htmltools>.
## Murdoch D, Adler D (2021). _rgl: 3D Visualization Using OpenGL_. R
## package version 0.106.8, <URL: https://CRAN.R-project.org/package=rgl>.
## Murdoch D (2016). _rglwidget: 'rgl' in 'htmlwidgets' Framework_. R
## package version 0.2.1, <URL:
## https://CRAN.R-project.org/package=rglwidget>.
## Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
## Springer-Verlag New York. ISBN 978-3-319-24277-4, <URL:
## https://ggplot2.tidyverse.org>.
## Slowikowski K (2021). _ggrepel: Automatically Position Non-Overlapping
## Text Labels with 'ggplot2'_. R package version 0.9.1, <URL:
## https://CRAN.R-project.org/package=ggrepel>.
## de Vries A, Ripley BD (2020). _ggdendro: Create Dendrograms and Tree
## Diagrams Using 'ggplot2'_. R package version 0.1.22, <URL:
## https://CRAN.R-project.org/package=ggdendro>.
## Cheng J, Galili T (2018). _d3heatmap: Interactive Heat Maps Using
## 'htmlwidgets' and 'D3.js'_. R package version 0.6.1.2, <URL:
## https://CRAN.R-project.org/package=d3heatmap>.
## Csardi G, Nepusz T (2006). "The igraph software package for complex
## network research." _InterJournal_, *Complex Systems*, 1695. <URL:
## https://igraph.org>.
## Xiao N (2018). _ggsci: Scientific Journal and Sci-Fi Themed Color
## Palettes for 'ggplot2'_. R package version 2.9, <URL:
## https://CRAN.R-project.org/package=ggsci>.
## Sievert C (2020). _Interactive Web-Based Data Visualization with R,
## plotly, and shiny_. Chapman and Hall/CRC. ISBN 9781138331457, <URL:
## https://plotly-r.com>.

2. Two-step clustering

As a first step, we examine whether the data are suitable for clustering. Clustering should be checked for both variables and countries.

Cluster tendency of indicators

Calculating Hopkins indicator

The Hopkins index (H) helps to control clustering. If the Hopkins index (H) is greater than 0.5, it is not worth clustering.

When clustering the indicators, we choose Spearman correlation as a similarity indicator and 1-Spearman correlation as a distance indicator, because the rank correlation also shows nonlinear relationships.

## [1] "H:=  0.311994877382919"

Since the Hopkins index is less than 0.5, the indicators can be clustered.

The next step is to plot cluster tendency in order to determine the preliminary number of clusters.

**Figure 1. ** Cluster tendency of indicators

Figure 1. Cluster tendency of indicators

The Figure 1 shows, that 2-4 clusters (which are the groups of indicators here) can be specified, but this should be confirmed by further studies.

Hierarchical clustering for indicators

The next step is to use hierarchical clustering for the indicators. The result of the clustering is shown in Figure 2.

**Figure 2.** Hierarchical clustering for the indicators

Figure 2. Hierarchical clustering for the indicators

Hierarchical clustering also makes 2-4 clusters (in this case a group of indicators) probable. Displaying the distance matrix provides additional help in determining the number of clusters. Figure 3 shows the distance matrix for the indicators, where the blue cells represent values close to -1 and the dark red cells represent values close to 1.

**Figure 3** Distance matrix for the indicators

Figure 3 Distance matrix for the indicators

Van Mechelen et al. (1993) seggested the following criteria for clustering:

  1. Within-cluster dissimilarities should be small.

  2. Between-cluster dissimilarities should be large.

  3. Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or, if appropriate, by linear, time series or spatial process models.

  4. Members of a cluster should be well represented by its centroid.

  5. The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”).

  6. Clusters should be stable.

  7. Clusters should correspond to connected areas in data space with high density.

  8. The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear).

  9. It should be possible to characterize the clusters using a small number of variables.

  10. Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering.

  11. Features should be approximately independent within clusters.

  12. All clusters should have roughly the same size.

  13. The number of clusters should be low.

Cluster statistics are calculated for 2-4 clusters.

## $n
## [1] 24
## 
## $cluster.number
## [1] 2
## 
## $cluster.size
## [1] 21  3
## 
## $min.cluster.size
## [1] 3
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.2498799 0.7726897
## 
## $average.distance
## [1] 0.5610211 0.6223547
## 
## $median.distance
## [1] 0.5672502 0.6203722
## 
## $separation
## [1] 0.68497 0.68497
## 
## $average.toother
## [1] 1.015888 1.015888
## 
## $separation.matrix
##         [,1]    [,2]
## [1,] 0.00000 0.68497
## [2,] 0.68497 0.00000
## 
## $ave.between.matrix
##          [,1]     [,2]
## [1,] 0.000000 1.015888
## [2,] 1.015888 0.000000
## 
## $average.between
## [1] 1.015888
## 
## $average.within
## [1] 0.5686878
## 
## $n.between
## [1] 63
## 
## $n.within
## [1] 213
## 
## $max.diameter
## [1] 1.24988
## 
## $min.separation
## [1] 0.68497
## 
## $within.cluster.ss
## [1] 4.111321
## 
## $clus.avg.silwidths
##         1         2 
## 0.4384823 0.3867175 
## 
## $avg.silwidth
## [1] 0.4320117
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.6564944
## 
## $dunn
## [1] 0.5480287
## 
## $dunn2
## [1] 1.63233
## 
## $entropy
## [1] 0.3767702
## 
## $wb.ratio
## [1] 0.5597936
## 
## $ch
## [1] 10.44002
## 
## $cwidegap
## [1] 0.5204666 0.6203722
## 
## $widestgap
## [1] 0.6203722
## 
## $sindex
## [1] 0.68497
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL
## $n
## [1] 24
## 
## $cluster.number
## [1] 3
## 
## $cluster.size
## [1] 15  3  6
## 
## $min.cluster.size
## [1] 3
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.0665562 0.7726897 0.9575082
## 
## $average.distance
## [1] 0.4433945 0.6223547 0.7098425
## 
## $median.distance
## [1] 0.4366579 0.6203722 0.6949694
## 
## $separation
## [1] 0.2946191 0.6849700 0.2946191
## 
## $average.toother
## [1] 0.7970161 1.0158882 0.7187454
## 
## $separation.matrix
##           [,1]      [,2]      [,3]
## [1,] 0.0000000 0.7845616 0.2946191
## [2,] 0.7845616 0.0000000 0.6849700
## [3,] 0.2946191 0.6849700 0.0000000
## 
## $ave.between.matrix
##           [,1]      [,2]      [,3]
## [1,] 0.0000000 1.0441516 0.6734484
## [2,] 1.0441516 0.0000000 0.9452299
## [3,] 0.6734484 0.9452299 0.0000000
## 
## $average.between
## [1] 0.8144531
## 
## $average.within
## [1] 0.5323766
## 
## $n.between
## [1] 153
## 
## $n.within
## [1] 123
## 
## $max.diameter
## [1] 1.066556
## 
## $min.separation
## [1] 0.2946191
## 
## $within.cluster.ss
## [1] 3.425579
## 
## $clus.avg.silwidths
##           1           2           3 
##  0.34829660  0.33917439 -0.05553832 
## 
## $avg.silwidth
## [1] 0.2461976
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5722906
## 
## $dunn
## [1] 0.276234
## 
## $dunn2
## [1] 0.9487293
## 
## $entropy
## [1] 0.9002561
## 
## $wb.ratio
## [1] 0.6536614
## 
## $ch
## [1] 8.082114
## 
## $cwidegap
## [1] 0.4325824 0.6203722 0.5920308
## 
## $widestgap
## [1] 0.6203722
## 
## $sindex
## [1] 0.2946191
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL
## $n
## [1] 24
## 
## $cluster.number
## [1] 4
## 
## $cluster.size
## [1]  5 10  3  6
## 
## $min.cluster.size
## [1] 3
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.0665562 0.4296074 0.7726897 0.9575082
## 
## $average.distance
## [1] 0.6794132 0.2371534 0.6223547 0.7098425
## 
## $median.distance
## [1] 0.7161004 0.2366260 0.6203722 0.6949694
## 
## $separation
## [1] 0.2746400 0.2746400 0.6849700 0.2946191
## 
## $average.toother
## [1] 0.7358357 0.6848112 1.0158882 0.7187454
## 
## $separation.matrix
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.0000000 0.2746400 0.7845616 0.5761566
## [2,] 0.2746400 0.0000000 0.8014765 0.2946191
## [3,] 0.7845616 0.8014765 0.0000000 0.6849700
## [4,] 0.5761566 0.2946191 0.6849700 0.0000000
## 
## $ave.between.matrix
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.0000000 0.5818078 1.0129007 0.8540163
## [2,] 0.5818078 0.0000000 1.0597770 0.5831645
## [3,] 1.0129007 1.0597770 0.0000000 0.9452299
## [4,] 0.8540163 0.5831645 0.9452299 0.0000000
## 
## $average.between
## [1] 0.7571513
## 
## $average.within
## [1] 0.4956133
## 
## $n.between
## [1] 203
## 
## $n.within
## [1] 73
## 
## $max.diameter
## [1] 1.066556
## 
## $min.separation
## [1] 0.27464
## 
## $within.cluster.ss
## [1] 3.01833
## 
## $clus.avg.silwidths
##          1          2          3          4 
## -0.1273645  0.5697175  0.3391744 -0.1851225 
## 
## $avg.silwidth
## [1] 0.2069642
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5264653
## 
## $dunn
## [1] 0.2575017
## 
## $dunn2
## [1] 0.8196294
## 
## $entropy
## [1] 1.298077
## 
## $wb.ratio
## [1] 0.6545763
## 
## $ch
## [1] 6.723373
## 
## $cwidegap
## [1] 0.6839779 0.2432582 0.6203722 0.5920308
## 
## $widestgap
## [1] 0.6839779
## 
## $sindex
## [1] 0.27464
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Unfortunately, based on the above results, it is not possible to decide how many (2, 3 or 4) clusters can be formed on the indicators. So the interpretability of the clusters will decide this problem.

Multidimensional scaling for indicators

To examine the interpretability of the clusters, the next step is to use multidimensional scaling (MDS) for the indicators. First, we form two clusters, the results of which are shown in Figure 4.

## initial  value 0.234620 
## iter   5 value 0.102313
## iter  10 value 0.079063
## iter  15 value 0.058780
## iter  20 value 0.035771
## iter  25 value 0.022966
## iter  30 value 0.013233
## final  value 0.009420 
## converged
**Figure 4.** Multidimensional scaling for indicators considering 2 clusters

Figure 4. Multidimensional scaling for indicators considering 2 clusters

As shown in Figure 4, the two clusters are difficult to interpret. The first cluster included 19 indicators, while the second included only 3 indicators.

Figure 5 shows the result of multidimensional scaling in the case of specifying 3 clusters.

## initial  value 0.234620 
## iter   5 value 0.102313
## iter  10 value 0.079063
## iter  15 value 0.058780
## iter  20 value 0.035771
## iter  25 value 0.022966
## iter  30 value 0.013233
## final  value 0.009420 
## converged
**Figure 5.** Multidimensional scaling for indicators considering 3 clusters

Figure 5. Multidimensional scaling for indicators considering 3 clusters

Based on Figure 5, the 3 clusters (i.e. 3 groups of indicators) are also difficult to interpret. There were 15 indicators in the first cluster, 3 in the second cluster and 6 in the third cluster. However, the indicators measuring the relationships (C1, C2, C3 and C4) have already been clustered.

In the next step, we will look at how to group the indicators if we create 4 clusters. The result is shown in Figure 6.

## initial  value 0.234620 
## iter   5 value 0.102313
## iter  10 value 0.079063
## iter  15 value 0.058780
## iter  20 value 0.035771
## iter  25 value 0.022966
## iter  30 value 0.013233
## final  value 0.009420 
## converged
**Figure 6.** Multidimensional scaling for indicators considering 3 clusters

Figure 6. Multidimensional scaling for indicators considering 3 clusters

If four clusters are specified, 5 indicators will be added to the first cluster, 10 to the second, 3 to the third, and 6 to the fourth. Purely only the variable group of the relationships containing the variables C1-C4 appears in the fourth cluster, in the case of the other clusters the composition of the indicators is very mixed. Four clusters (groups of variables) are also difficult to interpret, so it is necessary to reduce the number of variables, which can be done, for example, by principal component analysis.

Cluster tendency of countries

Calculating Hopkins indicator

The Hopkins index (H) helps to control clustering. If the Hopkins index (H) is greater than 0.5, it is not worth clustering. In the case of clustering rows (countries), distance is the Euclidean distance.

## [1] "H:= 0.220381143328058"

Since the Hopkins index is less than 0.5, countries (i.e., rows) can be clustered.

The next step is to plot the clustering in order to determine the preliminary number of clusters.

**Figure 7.** cluster tendency of countries

Figure 7. cluster tendency of countries

Based on Figure 7, 3-4 clusters (ie a group of countries) are likely, but this needs to be confirmed by further studies.

Hierarchical clustering of countries

The next step is to apply hierarchical clustering for countries. The result of the clustering is shown in Figure 8.

**8. ábra** A hierarchikus klaszterezés eredménye az országok esetében

8. ábra A hierarchikus klaszterezés eredménye az országok esetében

Hierarchical clustering makes the existence of 3-4 clusters (here a group of countries) probable.

Displaying the distance matrix provides additional help in determining the number of clusters. Figure 9 shows the distance matrix for countries, with blue cells representing values close to -1 and dark red cells representing values close to 1.

**9. ábra** Távolságmátrix az országok esetében

9. ábra Távolságmátrix az országok esetében

Based on the distance matrix shown in Figure 9, 3-4 clusters are probable. In the next step, we compare the cases where we create 3 or 4 clusters for the countries.

## $n
## [1] 50
## 
## $cluster.number
## [1] 3
## 
## $cluster.size
## [1] 21 15 14
## 
## $min.cluster.size
## [1] 14
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 172.7304 187.5483 131.6234
## 
## $average.distance
## [1]  96.74970 114.40191  94.56826
## 
## $median.distance
## [1]  92.82482 111.94521  98.24285
## 
## $separation
## [1] 57.08949 64.51798 57.08949
## 
## $average.toother
## [1] 165.3137 174.3910 131.3474
## 
## $separation.matrix
##           [,1]      [,2]     [,3]
## [1,]   0.00000 136.35107 57.08949
## [2,] 136.35107   0.00000 64.51798
## [3,]  57.08949  64.51798  0.00000
## 
## $ave.between.matrix
##          [,1]     [,2]     [,3]
## [1,]   0.0000 200.0512 128.0951
## [2,] 200.0512   0.0000 135.9007
## [3,] 128.0951 135.9007   0.0000
## 
## $average.between
## [1] 157.7719
## 
## $average.within
## [1] 101.4346
## 
## $n.between
## [1] 819
## 
## $n.within
## [1] 406
## 
## $max.diameter
## [1] 187.5483
## 
## $min.separation
## [1] 57.08949
## 
## $within.cluster.ss
## [1] 258714.8
## 
## $clus.avg.silwidths
##         1         2         3 
## 0.2440727 0.1473629 0.1832766 
## 
## $avg.silwidth
## [1] 0.1980369
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.566333
## 
## $dunn
## [1] 0.3043989
## 
## $dunn2
## [1] 1.119693
## 
## $entropy
## [1] 1.081972
## 
## $wb.ratio
## [1] 0.6429189
## 
## $ch
## [1] 24.42036
## 
## $cwidegap
## [1]  98.27018 125.24363 102.04286
## 
## $widestgap
## [1] 125.2436
## 
## $sindex
## [1] 61.91874
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL
## $n
## [1] 50
## 
## $cluster.number
## [1] 4
## 
## $cluster.size
## [1] 21 11 14  4
## 
## $min.cluster.size
## [1] 4
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 172.7304 148.9883 131.6234 185.2957
## 
## $average.distance
## [1]  96.74970 106.13002  94.56826 128.13876
## 
## $median.distance
## [1]  92.82482 108.07670  98.24285 135.48427
## 
## $separation
## [1] 57.08949 60.81710 57.08949 60.81710
## 
## $average.toother
## [1] 165.3137 158.4569 131.3474 186.9005
## 
## $separation.matrix
##           [,1]      [,2]      [,3]     [,4]
## [1,]   0.00000 136.35107  57.08949 201.0139
## [2,] 136.35107   0.00000  64.51798  60.8171
## [3,]  57.08949  64.51798   0.00000 102.1068
## [4,] 201.01385  60.81710 102.10676   0.0000
## 
## $ave.between.matrix
##          [,1]     [,2]     [,3]     [,4]
## [1,]   0.0000 187.6245 128.0951 234.2247
## [2,] 187.6245   0.0000 124.8736 122.8686
## [3,] 128.0951 124.8736   0.0000 166.2251
## [4,] 234.2247 122.8686 166.2251   0.0000
## 
## $average.between
## [1] 155.9924
## 
## $average.within
## [1] 100.7137
## 
## $n.between
## [1] 863
## 
## $n.within
## [1] 362
## 
## $max.diameter
## [1] 185.2957
## 
## $min.separation
## [1] 57.08949
## 
## $within.cluster.ss
## [1] 247876.6
## 
## $clus.avg.silwidths
##           1           2           3           4 
##  0.24407268  0.07194809  0.15158446 -0.03399656 
## 
## $avg.silwidth
## [1] 0.158063
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5576146
## 
## $dunn
## [1] 0.3080993
## 
## $dunn2
## [1] 0.9588711
## 
## $entropy
## [1] 1.255947
## 
## $wb.ratio
## [1] 0.6456321
## 
## $ch
## [1] 17.30098
## 
## $cwidegap
## [1]  98.27018 106.99219 102.04286 174.91549
## 
## $widestgap
## [1] 174.9155
## 
## $sindex
## [1] 60.06623
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Nevertheless, based on the results, it is not possible to decide how many (3 or 4) clusters can be specified for the countries. Then the number of clusters should be decided by their interpretability.

Multidimensional scalig for countries

To examine the interpretability of clusters, the next step is to use multidimensional scaling for countries. First, we specify three clusters (i.e. three groups of countries), the results of which are shown in Figure 10.

## initial  value 0.728168 
## final  value 0.727938 
## converged
**Figure 10.** Multidimensional scaling for countries considering 3 clusters

Figure 10. Multidimensional scaling for countries considering 3 clusters

Figure 10 shows that the first cluster included 21 countries, the second 15 countries and 14 countries.

In the next step, we will look at how to specify 4 clusters. The result is shown in Figure 11.

## initial  value 0.728168 
## final  value 0.727938 
## converged
**Figure 11.** Multidimensional scaling for countries considering 4 clusters

Figure 11. Multidimensional scaling for countries considering 4 clusters

In the case of specifying 4 clusters, 21 countries were included in the first cluster, 11 in the second, 14 in the third, and 4 in the fourth. Compared to the previous case (when we specified three clusters), the countries in the first and in the third clusters did not change. The fourth cluster included the countries (Denmark, Sweden, Switzerland, United States) that belonged to the second cluster in the previous case. The principles formulated by Van Mechelen (1993) are mostly fulfilled by 4 clusters.

The results of cluster tendency: It is clear that clustering can be made by both variables and countries (see Hopkins index). However, the resulting clusters are quite difficult to interpret. The groups of variables contain mixed indicators, so model reduction is needed before grouping countries.

#3. Model reduction

Principal Component Analysis

The first step is to determine the number of principal components / factors. The Figure 12 shows the scree plot, which can be used to specify the number of latent variables (i.e. principal components).

**Figure 12 scree plot** Könyökdiagram

Figure 12 scree plot Könyökdiagram

Based on Figure 12, maximum 5 principal component (PC), while maximum 3 factor should be kept.

U21 contains 4 main groups of indicators: resources (R1 - R5), environment (E1 - E4), connectivity (C1 - C6) and outputs (O1 - O9). In the case of principal component analysis, it is worth starting from these predefined groups of indicators.

R (resources) PC

First, we define the principal component R, which contains the indicators R1-R5, that describe the resources of each country.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Rs))
## Overall MSA =  0.65
## MSA for each item = 
##   R1   R2   R3   R4   R5 
## 0.75 0.54 0.66 0.68 0.60
## 
## Factor analysis with Call: principal(r = Rs, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 5  and the objective function was  1.28 
## The number of observations was  50  with Chi Square =  58.63  with prob <  2.3e-11 
## 
## The root mean square of the residuals (RMSA) is  0.15
##        R1        R2        R3        R4        R5 
## 0.4804492 0.3481106 0.7782275 0.8074245 0.8522280
## [1] "Variance rate= 0.653287948596533"

Since the KMO value is greater than 0.5, principal component / factor analysis can be performed. All communality is greater than 0.25, which means that the original variables fit the principal component correctly. Furthermore, the variance ratio is greater than 0.5, which means that at least 50% of the information remains even after model reduction. Based on these, the R principal component created above is appropriate and can be used in further analyzes.

E (environment) PC

Next, we specify the principal component E, which contains the indicators E1-E4, that describe the environmental indicators of each country.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Es))
## Overall MSA =  0.55
## MSA for each item = 
##   E1   E2   E3   E4 
## 0.61 0.56 0.53 0.54
## 
## Factor analysis with Call: principal(r = Es, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 2  and the objective function was  0.28 
## The number of observations was  50  with Chi Square =  13.06  with prob <  0.0015 
## 
## The root mean square of the residuals (RMSA) is  0.21
##        E1        E2        E3        E4 
## 0.3235923 0.1889422 0.5982086 0.6393162
## [1] "Variance rate= 0.43751482380739"

ased on the above results, it can be seen that the communality of E2 is less than 0.25 (0.189), so E2 does not fit the PC properly. However, if we exclude E2 in the next step, the communality value of E1 will be decreased less to 0.25. Most indicators can be saved, if E3 will be excluded, instead of E2. Thus, in the next step, we omit E3.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Es))
## Overall MSA =  0.55
## MSA for each item = 
##   E1   E2   E4 
## 0.53 0.54 0.61
## 
## Factor analysis with Call: principal(r = Es, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 0  and the objective function was  0.22 
## The number of observations was  50  with Chi Square =  10.42  with prob <  NA 
## 
## The root mean square of the residuals (RMSA) is  0.25
##        E1        E2        E4 
## 0.6167214 0.5175482 0.2970137
## [1] "Variance rate= 0.477094428557052"

After excluding E3, the communality value is greater than 0.25 for all variables. The KMO value is greater than 0.5. However, the value of the variance ratio is less than 0.5 (0.477), which means that at least 50% of the information is lost after the model reduction.

C (connenctions) PC

Next, we specify the principal component C, which contains the indicators C1-C6, that describe the connectivity and communications indicators of each country.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Cs))
## Overall MSA =  0.67
## MSA for each item = 
##   C1   C2   C3   C4   C5   C6 
## 0.57 0.66 0.37 0.70 0.82 0.78
## 
## Factor analysis with Call: principal(r = Cs, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 9  and the objective function was  0.53 
## The number of observations was  50  with Chi Square =  24.07  with prob <  0.0042 
## 
## The root mean square of the residuals (RMSA) is  0.14
##        C1        C2        C3        C4        C5        C6 
## 0.4456108 0.5144444 0.1290487 0.6247154 0.6170022 0.5594279
## [1] "Variance rate= 0.481708220917479"

The communality value of C3 is lower than 0.25, therefore, this variable should be excluded in the next step.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Cs))
## Overall MSA =  0.79
## MSA for each item = 
##   C1   C2   C4   C5   C6 
## 0.78 0.79 0.80 0.79 0.80
## 
## Factor analysis with Call: principal(r = Cs, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 5  and the objective function was  0.23 
## The number of observations was  50  with Chi Square =  10.55  with prob <  0.061 
## 
## The root mean square of the residuals (RMSA) is  0.13
##        C1        C2        C4        C5        C6 
## 0.5120676 0.5011992 0.5971696 0.6207606 0.5728886
## [1] "Variance rate= 0.560817123312952"

Both the KMO and the variance rate is greater than 0.5.

O (outputs) PC

At the last step, we specify the principal component O, which contains the indicators O1-O9, that describe the research output indicators of each country.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA =  0.73
## MSA for each item = 
##   O1   O2   O3   O4   O5   O6   O7   O8   O9 
## 0.40 0.73 0.77 0.79 0.63 0.83 0.74 0.85 0.64
## 
## Factor analysis with Call: principal(r = Os, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 27  and the objective function was  3.28 
## The number of observations was  50  with Chi Square =  145.84  with prob <  2.9e-18 
## 
## The root mean square of the residuals (RMSA) is  0.14
##           O1           O2           O3           O4           O5           O6 
## 0.1066000549 0.8603383885 0.7882600079 0.7698125877 0.6266131079 0.3970218624 
##           O7           O8           O9 
## 0.5495702936 0.7077575306 0.0005444639
## [1] "Variance rate= 0.560817123312952"

Based on the above results, for both O1 and O9, the communality is lower than 0.25, however, the variance ratio is adequate (0.56). There are two options: use one principal component or two factors.

OPT1 - Find one principal component

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA =  0.79
## MSA for each item = 
##   O2   O3   O4   O5   O6   O7   O8 
## 0.74 0.79 0.81 0.75 0.82 0.78 0.85
## 
## Factor analysis with Call: principal(r = Os, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 14  and the objective function was  1.62 
## The number of observations was  50  with Chi Square =  73.02  with prob <  5.4e-10 
## 
## The root mean square of the residuals (RMSA) is  0.1
##        O2        O3        O4        O5        O6        O7        O8 
## 0.8860009 0.7906704 0.7907994 0.5663786 0.4074405 0.5578915 0.7229660
## [1] "Variance rate= 0.560817123312952"

The value of KMO is greater than 0.5, for all indicators the communality is greater than 0.25, the variance rate is 0.56, i.e. at least 50% of the original information remains even after model reduction. However, two important indicators (O1 and O9) were dropped from the model. The O1 indicator is the number of articles published by higher education institutions, while the O9 is the proportion of unemployed graduates. These indicators are also worthwhile and would need to be preserved after model reduction.

OPT2 - Find two (or more) factors

A második lehetőség, hogy több mutatót őrizzünk meg, 2 faktor használata. 2 faktorra vonatkozó eredmények lentebb olvashatók.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA =  0.73
## MSA for each item = 
##   O1   O2   O3   O4   O5   O6   O7   O8   O9 
## 0.40 0.73 0.77 0.79 0.63 0.83 0.74 0.85 0.64
## 
## Factor analysis with Call: principal(r = Os, nfactors = 2)
## 
## Test of the hypothesis that 2 factors are sufficient.
## The degrees of freedom for the model is 19  and the objective function was  1.87 
## The number of observations was  50  with Chi Square =  81.98  with prob <  8.4e-10 
## 
## The root mean square of the residuals (RMSA) is  0.11
##        O1        O2        O3        O4        O5        O6        O7        O8 
## 0.8438830 0.9223750 0.7955037 0.8210897 0.8665949 0.4008533 0.5506741 0.7255365 
##        O9 
## 0.3083455
## 
## Loadings:
##    RC1    RC2   
## O1  0.138  0.908
## O2  0.959       
## O3  0.886  0.104
## O4  0.905       
## O5  0.670  0.646
## O6  0.629       
## O7  0.732  0.124
## O8  0.850       
## O9         0.547
## 
##                  RC1   RC2
## SS loadings    4.656 1.579
## Proportion Var 0.517 0.175
## Cumulative Var 0.517 0.693

In the case of two (principal) factors, akk communality values are greater than 0.25; however, the correlation between factor 1 and O5, and the factor 2 and O5 is very similar (|cor(RC1,O5)|<2|cor(RC2,O5)| or |cor(RC1,O5)|<|cor(RC2,O5)|+0.25). Therefore O5 cannot be matched to any factor. In this case we say that O5 is a common indicator, which has to be excluded from the model.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA =  0.78
## MSA for each item = 
##   O1   O2   O3   O4   O6   O7   O8   O9 
## 0.39 0.75 0.83 0.78 0.79 0.80 0.84 0.55
## 
## Factor analysis with Call: principal(r = Os, nfactors = 2)
## 
## Test of the hypothesis that 2 factors are sufficient.
## The degrees of freedom for the model is 13  and the objective function was  1.21 
## The number of observations was  50  with Chi Square =  53.62  with prob <  7.1e-07 
## 
## The root mean square of the residuals (RMSA) is  0.1
##        O1        O2        O3        O4        O6        O7        O8        O9 
## 0.3429981 0.9340280 0.7946173 0.8450090 0.5310990 0.5954249 0.7515102 0.7198578
## 
## Loadings:
##    RC1    RC2   
## O1  0.131  0.571
## O2  0.966       
## O3  0.891       
## O4  0.911 -0.123
## O6  0.615  0.390
## O7  0.716  0.288
## O8  0.848  0.180
## O9 -0.111  0.841
## 
##                  RC1   RC2
## SS loadings    4.195 1.319
## Proportion Var 0.524 0.165
## Cumulative Var 0.524 0.689

Both the KMO (0.78) and the variance rate (0.689) is greater than 0.5. All communality values are greater than 0.25. And there is no common indicator.

The first factor contains the folowing indicators:

  • \(O_2\): Total number of articles that are produced by higher education institutions per capita
  • \(O_3\): Average impact of articles, as measured by citations in 2014 of articles that were published in previous years using the Karolinska Institute normalized impact factor
  • \(O_4\): Depth of world-class universities in a country. This is calculated as an average of the institutions’ score of a country that is listed in the top 500 of the Shanghai ranking, divided by the country’s population
  • \(O_6\): Enrollment in tertiary education as a % of the eligible population, which is defined as the 5-year age group after secondary education
  • \(O_7\): % of the population aged 25-64 with a tertiary qualification
  • \(O_8\): Number of researchers (full-time equivalent) in the nation per population

The second factor contains the following indicators:

  • \(O_1\): Total number of journal articles that are produced by higher education institutions
  • \(O_9\): Unemployment rates among tertiary-educated aged 25-64 years compared with unemployment rates for those with only upper-secondary or post-secondary non-tertiary education

Thus, compared to the specification of the principal component \(O\), both the \(O_1\) and \(O_9\) indicators in the second factor were saved.

Since we specified principal components all for \(R\), \(C\), \(E\), \(O\) indicators, the components can even be correlated. Therefore, in the next step, the correlation between them have to be checked.

Correlation analysis for PCs

Figure 13 shows the results of the correlation analysis. The Figure 13 shows that there are strong correlations among principal components \(R\), \(O\), and \(C\).

##           R         E         C         O
## R 1.0000000 0.1659944 0.7354010 0.8207528
## E 0.1659944 1.0000000 0.2930795 0.2203930
## C 0.7354010 0.2930795 1.0000000 0.8482095
## O 0.8207528 0.2203930 0.8482095 1.0000000
**Figure 13.** Correlation between principal components

Figure 13. Correlation between principal components

Correlation analysis is also performed among the PCs \(R\), \(E\), and \(C\) and two specified factors, such as \(O_{fact_{1}}\), and \(O_{fact_{2}}\). The result of the correlation analysis is shown in Figure 14.

##                  R         E          C      O_fact1       O_fact2
## R        1.0000000 0.1659944  0.7354010 8.350790e-01 -5.855570e-02
## E        0.1659944 1.0000000  0.2930795 2.054947e-01  1.568645e-01
## C        0.7354010 0.2930795  1.0000000 8.631607e-01 -1.107934e-01
## O_fact1  0.8350790 0.2054947  0.8631607 1.000000e+00  1.713747e-16
## O_fact2 -0.0585557 0.1568645 -0.1107934 1.713747e-16  1.000000e+00
**Figure 14.** Correlations between PCs and factors

Figure 14. Correlations between PCs and factors

Based on Figure 14 latent variables R, O_fact1, and C are strongly correlated each other. Correlation analysis specifies three latent variable groups: (1) \(R\), \(C\), \(O_{fact_{1}}\), (2) \(O_{fact_{2}}\), (3) \(E\).

Factor analysis

After principal component analysis, factor analysis (FA) is performed on the indicators. As a first step, we examine the applicability of factor analysis, which is aided by scree-plots and calculating the KMO value.

**Figure 15.** scree plot

Figure 15. scree plot

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(U21_filtered))
## Overall MSA =  0.73
## MSA for each item = 
##   R1   R2   R3   R4   R5   E1   E2   E3   E4   C1   C2   C3   C4   C5   C6   O1 
## 0.70 0.51 0.82 0.75 0.79 0.56 0.35 0.69 0.82 0.76 0.63 0.37 0.83 0.83 0.76 0.36 
##   O2   O3   O4   O5   O6   O7   O8   O9 
## 0.80 0.80 0.79 0.75 0.73 0.66 0.74 0.30

Figure 15 suggests us to specify 3 independent groups of variables (see the same results in Figure 12.)

## 
## Factor analysis with Call: principal(r = U21s, nfactors = 3, rotate = "varimax")
## 
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 207  and the objective function was  11.54 
## The number of observations was  50  with Chi Square =  440.6  with prob <  5.8e-19 
## 
## The root mean square of the residuals (RMSA) is  0.09
##        R1        R2        R3        R4        R5        E1        E2        E3 
## 0.5654944 0.5743389 0.8261170 0.7944043 0.8819861 0.3833381 0.1206436 0.5076963 
##        E4        C1        C2        C3        C4        C5        C6        O1 
## 0.5389281 0.3994398 0.7234222 0.2496209 0.5761145 0.6911836 0.6506483 0.5851133 
##        O2        O3        O4        O5        O6        O7        O8        O9 
## 0.9083897 0.8677446 0.8564676 0.7466957 0.3476795 0.5736386 0.7121123 0.3973431
## 
## Loadings:
##    RC1    RC2    RC3   
## R1  0.397 -0.639       
## R2  0.439 -0.239 -0.570
## R3  0.906              
## R4  0.849 -0.265       
## R5  0.914 -0.211       
## E1  0.158  0.191  0.567
## E2         0.147  0.302
## E3  0.337  0.593  0.204
## E4  0.516  0.512  0.106
## C1  0.576         0.255
## C2  0.541 -0.159  0.637
## C3  0.154 -0.111  0.462
## C4  0.669  0.311  0.178
## C5  0.815 -0.153       
## C6  0.787  0.163       
## O1  0.227  0.495 -0.537
## O2  0.942         0.138
## O3  0.920         0.140
## O4  0.908         0.175
## O5  0.731  0.305 -0.345
## O6  0.526  0.208 -0.165
## O7  0.644  0.101 -0.386
## O8  0.836              
## O9         0.626       
## 
##                   RC1   RC2   RC3
## SS loadings    10.059 2.231 2.188
## Proportion Var  0.419 0.093 0.091
## Cumulative Var  0.419 0.512 0.603
**Figure 16.** Biplot of 3 factors

Figure 16. Biplot of 3 factors

Figure 16 shows the properties of the three factors. As can be seen from the above results, there are also low communality (e.g. \(C_3\)) and common indicators (e.g. \(R_2\)). In order to get a comprehensive picture of the relative position of the factors and the indicators belonging to each factor, it is expedient to plot the initial state of the factors (\(RC_1\), \(RC_2\), \(RC_3\)) in 3D. This 3D figure is shown in Figure 17. By rotating the figure, you can see how each indicator is grouped, to which factors it belongs. Consistent with the above results, it can be seen here that the first factor includes, for example, \(R_3\), \(R_4\) and \(R_5\).

## null 
##    1

Figure 17. Factors and indicators in 3D

Excluding common and low communality value indicators the result is shown in the Figure 18..

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(U21s))
## Overall MSA =  0.81
## MSA for each item = 
##   R3   R4   R5   E1   E2   C1   C5   C6   O1   O4   O8   O9 
## 0.79 0.83 0.80 0.40 0.47 0.93 0.85 0.83 0.41 0.85 0.87 0.49
## 
## Factor analysis with Call: principal(r = U21s, nfactors = 3, rotate = "varimax")
## 
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 33  and the objective function was  2.21 
## The number of observations was  50  with Chi Square =  93.17  with prob <  1.2e-07 
## 
## The root mean square of the residuals (RMSA) is  0.08
##        R3        R4        R5        E1        E2        C1        C5        C6 
## 0.8262154 0.8290226 0.9132030 0.6894734 0.5995707 0.4921809 0.6855794 0.6979789 
##        O1        O4        O8        O9 
## 0.6054390 0.8427117 0.7250600 0.6423689
## 
## Loadings:
##    RC1    RC2    RC3   
## R3  0.888         0.190
## R4  0.905              
## R5  0.954              
## E1  0.134  0.815       
## E2 -0.126  0.764       
## C1  0.616  0.202 -0.269
## C5  0.824              
## C6  0.796 -0.161  0.197
## O1  0.148 -0.114  0.755
## O4  0.906  0.136       
## O8  0.829         0.169
## O9 -0.108  0.296  0.737
## 
##                  RC1   RC2   RC3
## SS loadings    5.787 1.450 1.312
## Proportion Var 0.482 0.121 0.109
## Cumulative Var 0.482 0.603 0.712
**Figure 18.** shows the final factor structure

Figure 18. shows the final factor structure

Figure 19 shows the final factor structure in 3D.

Figure 19 final factor structure in 3D

The 3D figure above illustrates the three distinct groups of indicators.

Figure 20 shows the structure equation model, which graphically shows which indicators belong to which factor. In the figure, the thickness and strength of the arrows illustrate the extent to which each indicator contributes to a given factor.

The list of indicators, which belong to the factor 1:

  • \(R_3\): Annual exp. per student (full-time equivalent) by tertiary education institutions in USD, PPP
  • \(R_4\): Exp. in tertiary education institutions for R&D as a % of GDP
  • \(R_5\): Exp. in tertiary education institutions for R&D per head of population at USD, PPP
  • \(C_1\): % of international students in tertiary education
  • \(C_5\): Responses to question “Knowledge transfer is highly developed between companies and universities”, which was asked of business executives in the annual survey by IMD World Development Centre, Switzerland
  • \(C_6\): % of university research publications that are co-authored with industry researchers
  • \(O_4\): Depth of world-class universities in a country. This is calculated as an average of the institutions’ score of a country that is listed in the top 500 of the Shanghai ranking, divided by the country’s population
  • \(O_8\): Number of researchers (full-time equivalent) in the nation per population

The list of indicators, which belong to the factor 2:

  • \(E_1\): % of female students in tertiary education
  • \(E_2\): % of female academic staff in tertiary institutions

The list of indicators, which belong to the factor 3:

  • \(O_1\): Total number of journal articles that are produced by higher education institutions
  • \(O_9\): Unemployment rates among tertiary-educated aged 25-64 years compared with unemployment rates for those with only upper-secondary or post-secondary non-tertiary education

The first factor (\(RC_1\)) contains both expenditure-type, relationship-type and output-type indicators. The second factor (\(RC_2\)) shows the role of women in higher education, and the third factor (\(RC_3\)) contains output indicators.

**Figure 20.** The final structure ecuation model

Figure 20. The final structure ecuation model

The summary of the evaluatiion of dimension reduciton: If the goal is to specify independent groups of variables, then 3 groups of variables can be defined with both principal component and factor analysis. Of these, two “pure” groups of variables contain only 2-2 indicators, while the third mixed group forms a group of R, C, O indicators. A serious problem is that only 12 of the 24 indicators are preserved.

4. Data reduction

Specifiying 3 clusters on 3 factors

Based on the preliminary clustering study, 3-4 clusters were probable for countries (rows). Thus, as a first step, we specify three clusters.

## $n
## [1] 50
## 
## $cluster.number
## [1] 3
## 
## $cluster.size
## [1] 18 20 12
## 
## $min.cluster.size
## [1] 12
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 2.180009 5.704993 4.494681
## 
## $average.distance
## [1] 1.146773 1.610924 1.901035
## 
## $median.distance
## [1] 1.108597 1.356117 1.479310
## 
## $separation
## [1] 0.7804456 0.7390499 0.7390499
## 
## $average.toother
## [1] 2.479894 2.470991 2.636661
## 
## $separation.matrix
##           [,1]      [,2]      [,3]
## [1,] 0.0000000 0.7804456 0.8288624
## [2,] 0.7804456 0.0000000 0.7390499
## [3,] 0.8288624 0.7390499 0.0000000
## 
## $ave.between.matrix
##          [,1]     [,2]     [,3]
## [1,] 0.000000 2.373189 2.657735
## [2,] 2.373189 0.000000 2.617694
## [3,] 2.657735 2.617694 0.000000
## 
## $average.between
## [1] 2.520423
## 
## $average.within
## [1] 1.513456
## 
## $n.between
## [1] 816
## 
## $n.within
## [1] 409
## 
## $max.diameter
## [1] 5.704993
## 
## $min.separation
## [1] 0.7390499
## 
## $within.cluster.ss
## [1] 77.04899
## 
## $clus.avg.silwidths
##         1         2         3 
## 0.4922942 0.3117005 0.2117167 
## 
## $avg.silwidth
## [1] 0.3527181
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.4333483
## 
## $dunn
## [1] 0.1295444
## 
## $dunn2
## [1] 1.248367
## 
## $entropy
## [1] 1.076819
## 
## $wb.ratio
## [1] 0.6004771
## 
## $ch
## [1] 21.33511
## 
## $cwidegap
## [1] 0.9296514 2.7278667 2.2474481
## 
## $widestgap
## [1] 2.727867
## 
## $sindex
## [1] 0.764972
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

**Figure 21*) shows the results of clustering based on 3 factors

18 countries belong to the first,20 countries belong to the second, and 12 countries belong to the third cluster.

Figure 21. the results of three clusters

Figure 22. shows which countries belong to which clusters. These countries are separated align with the three factors.

**Figure 22.** The specified three clusters on the map

Figure 22. The specified three clusters on the map

Specifiying 4 clusters on 3 factors

Next step is to specify four clusters on 3 factors.

## $n
## [1] 50
## 
## $cluster.number
## [1] 4
## 
## $cluster.size
## [1] 22  4 17  7
## 
## $min.cluster.size
## [1] 4
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 2.581561 3.503032 2.180009 3.714822
## 
## $average.distance
## [1] 1.228931 2.257800 1.134808 1.939406
## 
## $median.distance
## [1] 1.237384 2.276079 1.096520 1.818473
## 
## $separation
## [1] 0.3953279 0.6144058 0.3953279 0.7669507
## 
## $average.toother
## [1] 2.389488 3.094787 2.484270 3.014312
## 
## $separation.matrix
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.0000000 0.6144058 0.3953279 0.7669507
## [2,] 0.6144058 0.0000000 1.1363751 2.1025880
## [3,] 0.3953279 1.1363751 0.0000000 1.3234252
## [4,] 0.7669507 2.1025880 1.3234252 0.0000000
## 
## $ave.between.matrix
##          [,1]     [,2]     [,3]     [,4]
## [1,] 0.000000 2.807635 2.142717 2.749849
## [2,] 2.807635 0.000000 3.167920 3.819656
## [3,] 2.142717 3.167920 0.000000 3.167064
## [4,] 2.749849 3.819656 3.167064 0.000000
## 
## $average.between
## [1] 2.612725
## 
## $average.within
## [1] 1.378705
## 
## $n.between
## [1] 831
## 
## $n.within
## [1] 394
## 
## $max.diameter
## [1] 3.714822
## 
## $min.separation
## [1] 0.3953279
## 
## $within.cluster.ss
## [1] 53.58231
## 
## $clus.avg.silwidths
##         1         2         3         4 
## 0.3877404 0.1005759 0.4489660 0.2472150 
## 
## $avg.silwidth
## [1] 0.3659104
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5644199
## 
## $dunn
## [1] 0.1064191
## 
## $dunn2
## [1] 0.9490284
## 
## $entropy
## [1] 1.205341
## 
## $wb.ratio
## [1] 0.5276886
## 
## $ch
## [1] 26.73279
## 
## $cwidegap
## [1] 0.8288624 2.7278667 0.9296514 2.2474481
## 
## $widestgap
## [1] 2.727867
## 
## $sindex
## [1] 0.5028333
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Figure 23 shows, that the specified four clusters are well separated. There are 22 countries in the first cluster, 4 in the second, 17 in the third and 7 in the fourth.

Figure 23. specified four clusters on three factors

Figure 24 shows the map of clustered countries.

**Figure 24.** The specified four clusters on the map

Figure 24. The specified four clusters on the map

5. Specifying partial rankings within specified clusters.

When specifying the partial rankings (within clusters of countries), we should either use the calculated factor weights or the original weights used by U21.

Firt, we calculate the original weights used by U21.

In the next step, the partial rankings are calculated, which (see Table 1). In the last two columns of the Table 1, variables km3 and km4 indicate cluster memberships for clusters 3 and 4, respectively. U21_overall_Rank contains the rankings of the original U21 ranking, FA_Rank the ranking based on factor scores, and U21_Rank_selected is the rankings that only contain indicators, which corresponds to the factors.

Table 1 Partial rankings
ID U21_overall_Rank FA_weights FA_Rank Score_selected U21_Rank_selected km3 km4
Argentina 1 41 3.716323e-02 39 53.20702 16 2 1
Australia 2 9 1.226081e-01 20 46.85638 23 1 3
Austria 3 12 1.205554e-01 22 77.76420 6 1 3
Belgium 4 13 2.847575e-01 6 41.03496 27 1 3
Brazil 5 38 1.626992e-01 14 41.07081 26 2 1
Bulgaria 6 40 5.989465e-02 30 22.48808 47 2 1
Canada 7 4 2.566109e-01 7 77.78374 5 1 3
Chile 8 33 1.120821e-02 43 20.57509 48 3 1
China 9 34 6.702021e-02 28 39.96442 29 2 2
Croatia 10 44 6.752780e-02 27 26.79678 42 2 1
Czech Republic 11 26 1.535241e-01 15 61.93087 9 2 1
Denmark 12 3 1.276323e-01 17 56.33062 12 1 3
Finland 13 5 4.592789e-01 2 95.87306 2 1 3
France 14 18 2.193017e-01 9 40.09535 28 1 3
Germany 15 14 4.278289e-01 3 73.57370 7 2 2
Greece 16 32 3.890460e-02 38 28.52945 41 3 1
Hong Kong 17 15 1.276280e-01 18 50.23970 21 1 3
Hungary 18 29 1.664197e-01 12 31.49285 38 2 2
India 19 50 1.537641e-03 47 37.16362 31 3 4
Indonesia 20 48 1.284827e-03 48 17.93802 50 3 4
Iran 21 49 4.994809e-03 45 35.34617 33 3 4
Ireland 22 17 2.919132e-01 5 41.66170 25 1 3
Israel 23 19 1.652540e-01 13 54.70782 13 1 3
Italy 24 27 4.908183e-02 34 32.20203 36 3 1
Japan 25 20 1.944635e-04 49 61.59416 10 3 4
Korea, Rep. (South) 26 21 1.681845e-05 50 31.96663 37 3 4
Malaysia 27 28 5.352892e-02 31 50.55125 20 2 1
Mexico 28 46 9.292853e-03 44 20.50180 49 3 1
Netherlands 29 7 1.669516e-01 11 68.99285 8 1 3
New Zealand 30 16 7.233421e-02 26 39.72090 30 1 3
Norway 31 11 2.270396e-01 8 78.31016 4 1 3
Poland 32 31 8.456104e-02 24 25.39432 43 2 1
Portugal 33 24 1.182719e-01 23 52.54867 18 1 1
Romania 34 39 4.490066e-02 35 24.88180 44 2 1
Russia 35 36 7.997149e-02 25 44.89472 24 2 1
Saudi Arabia 36 30 2.028512e-02 41 32.57981 35 3 1
Serbia 37 34 5.163543e-02 33 53.96639 14 2 1
Singapore 38 10 5.246536e-02 32 48.01649 22 1 3
Slovakia 39 37 1.219388e-01 21 50.87108 19 2 1
Slovenia 40 25 1.228653e-01 19 30.92769 39 2 1
South Africa 41 45 4.384428e-02 36 29.91428 40 2 1
Spain 42 23 1.420620e-01 16 32.97279 34 2 1
Sweden 43 2 2.060980e-01 10 100.00000 1 1 3
Switzerland 44 6 6.600255e-02 29 52.70262 17 1 3
Taiwan 45 22 3.518628e-02 40 58.63325 11 3 4
Thailand 46 42 4.127924e-02 37 24.36351 45 2 1
Turkey 47 47 3.550801e-03 46 35.56118 32 3 4
Ukraine 48 43 1.884380e-02 42 23.44289 46 2 1
United Kingdom 49 8 3.468806e-01 4 80.62078 3 1 3
United States 50 1 4.693511e-01 1 53.27870 15 2 2

In the next step, the rank correlations between the rankings are calculated.

Rank correlations, in the case of 3 clusters

Figures 25-28 show rank correlations between rankings. Figure 25 shows rank correlations for all countries, while Figures 26-28 show rank correlations separately for the clusters 1-3.

If we examine all countries (see Figure 25), there is a close positive relationship between the rankings of the original U21 ranking (U21_overall_Rank) and the ranking we calculated, which contains only the indicators corresponding to each factor (U21_Rank_selected). If we examine only the rankings of the countries belonging to clusters 1 and 2, a similar result is obtained (see Figures 26-27).

**Figure 25.** Partial rankings fo all countries

Figure 25. Partial rankings fo all countries

**Figure 26.** Partial correlactions between calculated and the original rankings on cluster 1

Figure 26. Partial correlactions between calculated and the original rankings on cluster 1

**Figure 27.** Partial correlactions between calculated and the original rankings on cluster 2

Figure 27. Partial correlactions between calculated and the original rankings on cluster 2

**Figure 28.** Partial correlactions between calculated and the original rankings on cluster 3

Figure 28. Partial correlactions between calculated and the original rankings on cluster 3

Rank correlations, in the case of 4 clusters

Rank correlations are also calculated when four clusters are created. For 1, 2, and 3 clusters, there are moderately strong relationship between the original U21 rankings and calculated rankings.

**Figure 29.** Partial correlactions between calculated and the original rankings on cluster 1

Figure 29. Partial correlactions between calculated and the original rankings on cluster 1

**Figure 30.** Partial correlactions between calculated and the original rankings on cluster 2

Figure 30. Partial correlactions between calculated and the original rankings on cluster 2

**Figure 31.** Partial correlactions between calculated and the original rankings on cluster 3

Figure 31. Partial correlactions between calculated and the original rankings on cluster 3

**Figure 32.** Partial correlactions between calculated and the original rankings on cluster 4

Figure 32. Partial correlactions between calculated and the original rankings on cluster 4

6. Summary and conclusions

The results of the two-step clustering are summarized below.

Figure 33 shows the heat map of the original U21 data. The coloring of the cells depends on the values for each indicator. The darker red cells indicate smaller values, and the darker blue cells indicate higher values.

Figure 33. Heat map of U21 indicators grouped by the set of tree clusters

Figure 34 shows the countries belonging to the three clusters, as well as the indicators that belong to the three factors. Indicators that do not belong to any factor are displayed with a dark red cell. It can be seen that half (12) of the indicators were dropped during the analysis.

Figure 34. selected countries and indicators

Figure 35. Heat map of U21 indicators grouped by the set of four clusters

Figure 35 shows the heat map of the original U21 data. The coloring of the cells depends on the values for each indicator. The darker red cells indicate smaller values, and the darker blue cells indicate higher values.

Figure 36. shows the four clusters specified and the indicators that belong to the three factors.

Figure 36. selected countries and indicators

Summary: Applying two-step clustering is questionable, first for the low sample. The method itself does not allow clusters to overlap, however, it may perform well in some factors in one country and poorly in others. As a result of clustering, we also obtained very small clusters, which are difficult to analyze and interpret.

The additional files attached to the study (including this html) are free to use for scientific purposes. Please cite the original paper

Further references

Van Mechelen, I., J. Hampton, R. S. Michalski, and P. Theuns (1993). Categories and Concepts - Theoretical Views and Inductive Data Analysis. Academic Press, London.