Workflow

Loading necessary libraries and data

Loading libraries
Print citations
Loading necessary data

Investigation of two-step clustering

Cluster tendency (columns)
- Calculating Hopkins index
- Hierarchical clustering for indicators
- Apply multidimensional scaling to indicators
Cluster tendency (rows)
- Calculating Hopkins index
- Hierarchical clustering for indicators
- Apply multidimensional scaling to indicators

Model reduction

Principal component analysis
- Determination of PC R (resources)
- Determination of PC E (environment)
- Determination of PC C (connectivity)
- Determination of PC O (output)
- Correlation test for principal components
Factor analysis

Data reduction

Organize the factors into 3 clusters
Organize the factors into 4 clusters

Determining partial rankings

Correlation between rankings for 3 clusters
Correlation between rankings for 4 clusters

Summary and conclusion

1. Loading necessary libraries and data

Loading libraries

# Correlation analysis
library(corrgram) # Correlogram
library(corrplot) # Correlation plots
source("http://www.sthda.com/upload/rquery_cormat.r") # Correlation matrix representation

# Dimension reduction methods
library(psych) # Factor Analysis, Principal Component Analysis
library(factoextra) # Distance metrics
library(MASS) # Multi-dimensional Scaling

# Cluster analysis
library(fclust) # Cluster analysis
library(clustertend) # Hopkins coefficients
library(cluster) # Cluster plots
library(fpc) # Cluster statistics
library(mclust) # Multi-step clustering

# Complex models
library(lavaan) # CFA, SEM models
library(semPlot) # Plot models

# Further statistical libraries
library(Hmisc) # Misceleaus statistical functions
library(seriation) # Seriation

# Data manipulation
library(reshape2) # Data transformation
library("dplyr", character.only = TRUE) # Data manipulation
library(knitr) # Formatted tables

# Maps
library(maps) # Mapping fuctions
library(mapdata) # Mapping data
library(maptools) # Mapping tools

# Interactive 3D figures

library(shiny) # Interactive figures 
library(htmltools) # HTML widgets
library(rgl) #3D shaping
library(rglwidget) #3D shaping
library(ggplot2) # Complex figures
library(ggrepel) # Complex figures
library(ggdendro) # Dendogram
library(d3heatmap) # Heat map
library(igraph) # Graph representation
library(ggsci) # Simpson palette
library(plotly) # Plotly - interactive plot

Citations

# Correlation analysis
base::print(citation("corrgram"),style="text") # Correlogram

## Wright K (2021). _corrgram: Plot a Correlogram_. R package version
## 1.14, <URL: https://CRAN.R-project.org/package=corrgram>.

base::print(citation("corrplot"),style="text") # Correlation plots

## Wei T, Simko V (2021). _R package "corrplot": Visualization of a
## Correlation Matrix_. (Version 0.88), <URL:
## https://github.com/taiyun/corrplot>.

# Dimension reduction methods
base::print(citation("psych"),style="text") # Factor Analysis, Principal Component Analysis

## Revelle W (2021). _psych: Procedures for Psychological, Psychometric,
## and Personality Research_.  Northwestern University, Evanston,
## Illinois. R package version 2.1.3, <URL:
## https://CRAN.R-project.org/package=psych>.

base::print(citation("factoextra"),style="text") # Distance metrics

## Kassambara A, Mundt F (2020). _factoextra: Extract and Visualize the
## Results of Multivariate Data Analyses_. R package version 1.0.7, <URL:
## https://CRAN.R-project.org/package=factoextra>.

base::print(citation("MASS"),style="text") # Multi-dimensional Scaling

## Venables WN, Ripley BD (2002). _Modern Applied Statistics with S_,
## Fourth edition. Springer, New York. ISBN 0-387-95457-0, <URL:
## https://www.stats.ox.ac.uk/pub/MASS4/>.

# Cluster analysis
base::print(citation("fclust"),style="text") # Cluster analysis

## Ferraro M, Giordani P, Serafini A (2019). "fclust: An R Package for
## Fuzzy Clustering." _The R Journal_, *11*. <URL:
## https://journal.r-project.org/archive/2019/RJ-2019-017/RJ-2019-017.pdf>.

base::print(citation("clustertend"),style="text") # Hopkins coefficients

## Wright K, YiLan L, RuTong Z (2021). _clustertend: Check the Clustering
## Tendency_. R package version 1.5, <URL:
## https://CRAN.R-project.org/package=clustertend>.

base::print(citation("cluster"),style="text") # Cluster plots

## Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2021). _cluster:
## Cluster Analysis Basics and Extensions_. R package version 2.1.2 - For
## new features, see the 'Changelog' file (in the package source), <URL:
## https://CRAN.R-project.org/package=cluster>.

base::print(citation("fpc"),style="text") # Cluster statistics

## Hennig C (2020). _fpc: Flexible Procedures for Clustering_. R package
## version 2.2-9, <URL: https://CRAN.R-project.org/package=fpc>.

base::print(citation("mclust"),style="text") # Multi-step clustering

## Scrucca L, Fop M, Murphy TB, Raftery AE (2016). "mclust 5: clustering,
## classification and density estimation using Gaussian finite mixture
## models." _The R Journal_, *8*(1), 289-317. <URL:
## https://doi.org/10.32614/RJ-2016-021>.

# Complex models
base::print(citation("lavaan"),style="text") # CFA, SEM models

## Rosseel Y (2012). "lavaan: An R Package for Structural Equation
## Modeling." _Journal of Statistical Software_, *48*(2), 1-36. <URL:
## https://www.jstatsoft.org/v48/i02/>.

base::print(citation("semPlot"),style="text") # Plot models

## Epskamp S (2019). _semPlot: Path Diagrams and Visual Analysis of
## Various SEM Packages' Output_. R package version 1.1.2, <URL:
## https://CRAN.R-project.org/package=semPlot>.

# Further statistical libraries
base::print(citation("Hmisc"),style="text") # Misceleaus statistical functions

## Harrell Jr FE, Dupont wcfC, others. m (2021). _Hmisc: Harrell
## Miscellaneous_. R package version 4.5-0, <URL:
## https://CRAN.R-project.org/package=Hmisc>.

base::print(citation("seriation"),style="text") # Seriation

## Hahsler M, Buchta C, Hornik K (2020). _seriation: Infrastructure for
## Ordering Objects Using Seriation_. R package version 1.2-9, <URL:
## https://CRAN.R-project.org/package=seriation>.
## 
## Hahsler M, Hornik K, Buchta C (2008). "Getting things in order: An
## introduction to the R package seriation." _Journal of Statistical
## Software_, *25*(3), 1-34. ISSN 1548-7660, doi: 10.18637/jss.v025.i03
## (URL: https://doi.org/10.18637/jss.v025.i03), <URL:
## https://www.jstatsoft.org/v25/i03/>.

# Data manipulation
base::print(citation("reshape2"),style="text") # Data transformation

## Wickham H (2007). "Reshaping Data with the reshape Package." _Journal
## of Statistical Software_, *21*(12), 1-20. <URL:
## http://www.jstatsoft.org/v21/i12/>.

base::print(citation("dplyr"),style="text") # Data manipulation

## Wickham H, François R, Henry L, Müller K (2021). _dplyr: A Grammar of
## Data Manipulation_. R package version 1.0.6, <URL:
## https://CRAN.R-project.org/package=dplyr>.

base::print(citation("knitr"),style="text") # Formatted tables

## Xie Y (2021). _knitr: A General-Purpose Package for Dynamic Report
## Generation in R_. R package version 1.33, <URL:
## https://yihui.org/knitr/>.
## 
## Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition.
## Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <URL:
## https://yihui.org/knitr/>.
## 
## Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in
## R." In Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible
## Computational Research_. Chapman and Hall/CRC. ISBN 978-1466561595,
## <URL: http://www.crcpress.com/product/isbn/9781466561595>.

# Maps
base::print(citation("maps"),style="text") # Mapping fuctions

## Becker OScbRA, Minka ARWRvbRBEbTP, Deckmyn. A (2018). _maps: Draw
## Geographical Maps_. R package version 3.3.0, <URL:
## https://CRAN.R-project.org/package=maps>.

base::print(citation("mapdata"),style="text") # Mapping data

## Becker OScbRA, Brownrigg. ARWRvbR (2018). _mapdata: Extra Map
## Databases_. R package version 2.3.0, <URL:
## https://CRAN.R-project.org/package=mapdata>.

base::print(citation("maptools"),style="text") # Mapping tools

## Bivand R, Lewin-Koh N (2021). _maptools: Tools for Handling Spatial
## Objects_. R package version 1.1-1, <URL:
## https://CRAN.R-project.org/package=maptools>.

# Interactive 3D figures

base::print(citation("shiny"),style="text") # Interactive figures

## Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, Allen J,
## McPherson J, Dipert A, Borges B (2021). _shiny: Web Application
## Framework for R_. R package version 1.6.0, <URL:
## https://CRAN.R-project.org/package=shiny>.

base::print(citation("htmltools"),style="text") # HTML widgets

## Cheng J, Sievert C, Chang W, Xie Y, Allen J (2021). _htmltools: Tools
## for HTML_. R package version 0.5.1.1, <URL:
## https://CRAN.R-project.org/package=htmltools>.

base::print(citation("rgl"),style="text") #3D shaping

## Murdoch D, Adler D (2021). _rgl: 3D Visualization Using OpenGL_. R
## package version 0.106.8, <URL: https://CRAN.R-project.org/package=rgl>.

base::print(citation("rglwidget"),style="text") #3D shaping

## Murdoch D (2016). _rglwidget: 'rgl' in 'htmlwidgets' Framework_. R
## package version 0.2.1, <URL:
## https://CRAN.R-project.org/package=rglwidget>.

base::print(citation("ggplot2"),style="text") # Complex figures

## Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
## Springer-Verlag New York. ISBN 978-3-319-24277-4, <URL:
## https://ggplot2.tidyverse.org>.

base::print(citation("ggrepel"),style="text") # Complex figures

## Slowikowski K (2021). _ggrepel: Automatically Position Non-Overlapping
## Text Labels with 'ggplot2'_. R package version 0.9.1, <URL:
## https://CRAN.R-project.org/package=ggrepel>.

base::print(citation("ggdendro"),style="text") # Dendogram

## de Vries A, Ripley BD (2020). _ggdendro: Create Dendrograms and Tree
## Diagrams Using 'ggplot2'_. R package version 0.1.22, <URL:
## https://CRAN.R-project.org/package=ggdendro>.

base::print(citation("d3heatmap"),style="text") # Heat map

## Cheng J, Galili T (2018). _d3heatmap: Interactive Heat Maps Using
## 'htmlwidgets' and 'D3.js'_. R package version 0.6.1.2, <URL:
## https://CRAN.R-project.org/package=d3heatmap>.

base::print(citation("igraph"),style="text") # Graph representation

## Csardi G, Nepusz T (2006). "The igraph software package for complex
## network research." _InterJournal_, *Complex Systems*, 1695. <URL:
## https://igraph.org>.

base::print(citation("ggsci"),style="text") # Simpson palette

## Xiao N (2018). _ggsci: Scientific Journal and Sci-Fi Themed Color
## Palettes for 'ggplot2'_. R package version 2.9, <URL:
## https://CRAN.R-project.org/package=ggsci>.

base::print(citation("plotly"),style="text") # Plotly - interactive plot

## Sievert C (2020). _Interactive Web-Based Data Visualization with R,
## plotly, and shiny_. Chapman and Hall/CRC. ISBN 9781138331457, <URL:
## https://plotly-r.com>.

Loading necessary data

The original U21 data is used.

load("U21_2014.RData")
rownames(U21_rank) <- rownames(U21_filtered)
orig_mtx <- as.matrix(U21_filtered)
#U21_filtered: 50 x 24 dataframe, amely az eredeti 2014-es U21-es adatokat tartalmazza
#U21_rank: 50 x 1 mátrix (vektor) U21 rangsor
#orig_mtx: 50 x 24 mátrix, eredeti U21 indikátorok az 50 országra

2. Two-step clustering

As a first step, we examine whether the data are suitable for clustering. Clustering should be checked for both variables and countries.

Cluster tendency of indicators

Calculating Hopkins indicator

The Hopkins index (H) helps to control clustering. If the Hopkins index (H) is greater than 0.5, it is not worth clustering.

When clustering the indicators, we choose Spearman correlation as a similarity indicator and 1-Spearman correlation as a distance indicator, because the rank correlation also shows nonlinear relationships.

vars <- U21_filtered[,c("R1","R2","R3","R4","R5","E1","E2","E3","E4","C1","C2","C3","C4","C5","C6","O1","O2","O3","O4","O5","O6","O7","O8","O9")]
rownames(vars)<-rownames(U21_filtered)
countries<-t(vars)
colnames(countries) <- rownames(U21_filtered)

 dist.m<-1-cor(vars,method="spearman")
paste("H:= ",hopkins(dist.m, n=nrow(dist.m)-1, byrow = F, header = F))

## [1] "H:=  0.311994877382919"

Since the Hopkins index is less than 0.5, the indicators can be clustered.

The next step is to plot cluster tendency in order to determine the preliminary number of clusters.

# Determine cluster tendency
clustend <- get_clust_tendency(scale(dist.m), nrow(dist.m)-1)
# Plot cluster tendency
clustend$plot + 
  scale_fill_gradient(low = "yellow", high = "blue")

Figure 1. Cluster tendency of indicators

The Figure 1 shows, that 2-4 clusters (which are the groups of indicators here) can be specified, but this should be confirmed by further studies.

Hierarchical clustering for indicators

The next step is to use hierarchical clustering for the indicators. The result of the clustering is shown in Figure 2.

hc <- hclust(dist(dist.m), method = "complete")
hcdata <- dendro_data(hc, type="rectangle")
o.cols <- hcdata$labels$label
ggplot(hcdata$segments) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
  geom_text(data = hcdata$labels, aes(x, y, label = label), 
            hjust = 1.3, vjust=0.5, size = 4) +
  labs(x="", y="", title="Hierarchical clustering for the indicators") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_blank(),
        axis.text.y = element_blank()) +
  coord_flip()

Figure 2. Hierarchical clustering for the indicators

Hierarchical clustering also makes 2-4 clusters (in this case a group of indicators) probable. Displaying the distance matrix provides additional help in determining the number of clusters. Figure 3 shows the distance matrix for the indicators, where the blue cells represent values close to -1 and the dark red cells represent values close to 1.

cormat <- rquery.cormat(dist.m, graphType="heatmap")

Figure 3 Distance matrix for the indicators

Van Mechelen et al. (1993) seggested the following criteria for clustering:

Within-cluster dissimilarities should be small.
Between-cluster dissimilarities should be large.
Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or, if appropriate, by linear, time series or spatial process models.
Members of a cluster should be well represented by its centroid.
The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”).
Clusters should be stable.
Clusters should correspond to connected areas in data space with high density.
The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear).
It should be possible to characterize the clusters using a small number of variables.
Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering.
Features should be approximately independent within clusters.
All clusters should have roughly the same size.
The number of clusters should be low.

Cluster statistics are calculated for 2-4 clusters.

cluster.stats(dist.m,cutree(hc, k=2))

## $n
## [1] 24
## 
## $cluster.number
## [1] 2
## 
## $cluster.size
## [1] 21  3
## 
## $min.cluster.size
## [1] 3
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.2498799 0.7726897
## 
## $average.distance
## [1] 0.5610211 0.6223547
## 
## $median.distance
## [1] 0.5672502 0.6203722
## 
## $separation
## [1] 0.68497 0.68497
## 
## $average.toother
## [1] 1.015888 1.015888
## 
## $separation.matrix
##         [,1]    [,2]
## [1,] 0.00000 0.68497
## [2,] 0.68497 0.00000
## 
## $ave.between.matrix
##          [,1]     [,2]
## [1,] 0.000000 1.015888
## [2,] 1.015888 0.000000
## 
## $average.between
## [1] 1.015888
## 
## $average.within
## [1] 0.5686878
## 
## $n.between
## [1] 63
## 
## $n.within
## [1] 213
## 
## $max.diameter
## [1] 1.24988
## 
## $min.separation
## [1] 0.68497
## 
## $within.cluster.ss
## [1] 4.111321
## 
## $clus.avg.silwidths
##         1         2 
## 0.4384823 0.3867175 
## 
## $avg.silwidth
## [1] 0.4320117
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.6564944
## 
## $dunn
## [1] 0.5480287
## 
## $dunn2
## [1] 1.63233
## 
## $entropy
## [1] 0.3767702
## 
## $wb.ratio
## [1] 0.5597936
## 
## $ch
## [1] 10.44002
## 
## $cwidegap
## [1] 0.5204666 0.6203722
## 
## $widestgap
## [1] 0.6203722
## 
## $sindex
## [1] 0.68497
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

cluster.stats(dist.m,cutree(hc, k=3))

## $n
## [1] 24
## 
## $cluster.number
## [1] 3
## 
## $cluster.size
## [1] 15  3  6
## 
## $min.cluster.size
## [1] 3
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.0665562 0.7726897 0.9575082
## 
## $average.distance
## [1] 0.4433945 0.6223547 0.7098425
## 
## $median.distance
## [1] 0.4366579 0.6203722 0.6949694
## 
## $separation
## [1] 0.2946191 0.6849700 0.2946191
## 
## $average.toother
## [1] 0.7970161 1.0158882 0.7187454
## 
## $separation.matrix
##           [,1]      [,2]      [,3]
## [1,] 0.0000000 0.7845616 0.2946191
## [2,] 0.7845616 0.0000000 0.6849700
## [3,] 0.2946191 0.6849700 0.0000000
## 
## $ave.between.matrix
##           [,1]      [,2]      [,3]
## [1,] 0.0000000 1.0441516 0.6734484
## [2,] 1.0441516 0.0000000 0.9452299
## [3,] 0.6734484 0.9452299 0.0000000
## 
## $average.between
## [1] 0.8144531
## 
## $average.within
## [1] 0.5323766
## 
## $n.between
## [1] 153
## 
## $n.within
## [1] 123
## 
## $max.diameter
## [1] 1.066556
## 
## $min.separation
## [1] 0.2946191
## 
## $within.cluster.ss
## [1] 3.425579
## 
## $clus.avg.silwidths
##           1           2           3 
##  0.34829660  0.33917439 -0.05553832 
## 
## $avg.silwidth
## [1] 0.2461976
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5722906
## 
## $dunn
## [1] 0.276234
## 
## $dunn2
## [1] 0.9487293
## 
## $entropy
## [1] 0.9002561
## 
## $wb.ratio
## [1] 0.6536614
## 
## $ch
## [1] 8.082114
## 
## $cwidegap
## [1] 0.4325824 0.6203722 0.5920308
## 
## $widestgap
## [1] 0.6203722
## 
## $sindex
## [1] 0.2946191
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

cluster.stats(dist.m,cutree(hc, k=4))

## $n
## [1] 24
## 
## $cluster.number
## [1] 4
## 
## $cluster.size
## [1]  5 10  3  6
## 
## $min.cluster.size
## [1] 3
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.0665562 0.4296074 0.7726897 0.9575082
## 
## $average.distance
## [1] 0.6794132 0.2371534 0.6223547 0.7098425
## 
## $median.distance
## [1] 0.7161004 0.2366260 0.6203722 0.6949694
## 
## $separation
## [1] 0.2746400 0.2746400 0.6849700 0.2946191
## 
## $average.toother
## [1] 0.7358357 0.6848112 1.0158882 0.7187454
## 
## $separation.matrix
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.0000000 0.2746400 0.7845616 0.5761566
## [2,] 0.2746400 0.0000000 0.8014765 0.2946191
## [3,] 0.7845616 0.8014765 0.0000000 0.6849700
## [4,] 0.5761566 0.2946191 0.6849700 0.0000000
## 
## $ave.between.matrix
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.0000000 0.5818078 1.0129007 0.8540163
## [2,] 0.5818078 0.0000000 1.0597770 0.5831645
## [3,] 1.0129007 1.0597770 0.0000000 0.9452299
## [4,] 0.8540163 0.5831645 0.9452299 0.0000000
## 
## $average.between
## [1] 0.7571513
## 
## $average.within
## [1] 0.4956133
## 
## $n.between
## [1] 203
## 
## $n.within
## [1] 73
## 
## $max.diameter
## [1] 1.066556
## 
## $min.separation
## [1] 0.27464
## 
## $within.cluster.ss
## [1] 3.01833
## 
## $clus.avg.silwidths
##          1          2          3          4 
## -0.1273645  0.5697175  0.3391744 -0.1851225 
## 
## $avg.silwidth
## [1] 0.2069642
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5264653
## 
## $dunn
## [1] 0.2575017
## 
## $dunn2
## [1] 0.8196294
## 
## $entropy
## [1] 1.298077
## 
## $wb.ratio
## [1] 0.6545763
## 
## $ch
## [1] 6.723373
## 
## $cwidegap
## [1] 0.6839779 0.2432582 0.6203722 0.5920308
## 
## $widestgap
## [1] 0.6839779
## 
## $sindex
## [1] 0.27464
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Unfortunately, based on the above results, it is not possible to decide how many (2, 3 or 4) clusters can be formed on the indicators. So the interpretability of the clusters will decide this problem.

Multidimensional scaling for indicators

To examine the interpretability of the clusters, the next step is to use multidimensional scaling (MDS) for the indicators. First, we form two clusters, the results of which are shown in Figure 4.

mds.isomds <- isoMDS(dist(dist.m), tol=1e-3, k=10, maxit = 200)

## initial  value 0.234620 
## iter   5 value 0.102313
## iter  10 value 0.079063
## iter  15 value 0.058780
## iter  20 value 0.035771
## iter  25 value 0.022966
## iter  30 value 0.013233
## final  value 0.009420 
## converged

plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cutree(hc, k=2)
ggplot(plotdata, aes(V1, V2, label=names)) + 
  geom_point(aes(colour=factor(cut)), size=2.3) +
  geom_text_repel(aes(colour=factor(cut)), size=4) + 
  scale_colour_discrete(name = "Clusters") +
  labs(x="", y="", title="Multidimensional scaling for indicators considering 2 clusters") + theme_bw()

Figure 4. Multidimensional scaling for indicators considering 2 clusters

As shown in Figure 4, the two clusters are difficult to interpret. The first cluster included 19 indicators, while the second included only 3 indicators.

Figure 5 shows the result of multidimensional scaling in the case of specifying 3 clusters.

mds.isomds <- isoMDS(dist(dist.m), tol=1e-3, k=10, maxit = 200)

## initial  value 0.234620 
## iter   5 value 0.102313
## iter  10 value 0.079063
## iter  15 value 0.058780
## iter  20 value 0.035771
## iter  25 value 0.022966
## iter  30 value 0.013233
## final  value 0.009420 
## converged

plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cutree(hc, k=3)
ggplot(plotdata, aes(V1, V2, label=names)) + 
  geom_point(aes(colour=factor(cut)), size=2.3) +
  geom_text_repel(aes(colour=factor(cut)), size=4) + 
  scale_colour_discrete(name = "Clusters") +
  labs(x="", y="", title="Multidimensional scaling for indicators considering 3 clusters") + theme_bw()

Figure 5. Multidimensional scaling for indicators considering 3 clusters

Based on Figure 5, the 3 clusters (i.e. 3 groups of indicators) are also difficult to interpret. There were 15 indicators in the first cluster, 3 in the second cluster and 6 in the third cluster. However, the indicators measuring the relationships (C1, C2, C3 and C4) have already been clustered.

In the next step, we will look at how to group the indicators if we create 4 clusters. The result is shown in Figure 6.

mds.isomds <- isoMDS(dist(dist.m), tol=1e-3, k=10, maxit = 200)

## initial  value 0.234620 
## iter   5 value 0.102313
## iter  10 value 0.079063
## iter  15 value 0.058780
## iter  20 value 0.035771
## iter  25 value 0.022966
## iter  30 value 0.013233
## final  value 0.009420 
## converged

plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cutree(hc, k=4)
ggplot(plotdata, aes(V1, V2, label=names)) + 
  geom_point(aes(colour=factor(cut)), size=2.3) +
  geom_text_repel(aes(colour=factor(cut)), size=4) + 
  scale_colour_discrete(name = "Clusters") +
  labs(x="", y="", title="Multidimensional scaling for indicators considering 4 clusters") + theme_bw()

Figure 6. Multidimensional scaling for indicators considering 3 clusters

If four clusters are specified, 5 indicators will be added to the first cluster, 10 to the second, 3 to the third, and 6 to the fourth. Purely only the variable group of the relationships containing the variables C1-C4 appears in the fourth cluster, in the case of the other clusters the composition of the indicators is very mixed. Four clusters (groups of variables) are also difficult to interpret, so it is necessary to reduce the number of variables, which can be done, for example, by principal component analysis.

Cluster tendency of countries

Calculating Hopkins indicator

The Hopkins index (H) helps to control clustering. If the Hopkins index (H) is greater than 0.5, it is not worth clustering. In the case of clustering rows (countries), distance is the Euclidean distance.

vars <- U21_filtered[,c("R1","R2","R3","R4","R5","E1","E2","E3","E4","C1","C2","C3","C4","C5","C6","O1","O2","O3","O4","O5","O6","O7","O8","O9")]
rownames(vars)<-rownames(U21_filtered)
countries<-t(vars)
colnames(countries) <- rownames(U21_filtered)
dist.c<-as.matrix(dist(vars))
paste("H:=",hopkins(dist.c, n=nrow(dist.c)-1, byrow = F, header = F))

## [1] "H:= 0.220381143328058"

Since the Hopkins index is less than 0.5, countries (i.e., rows) can be clustered.

The next step is to plot the clustering in order to determine the preliminary number of clusters.

# Cluster tendency
clustend <- get_clust_tendency(scale(dist.c), nrow(dist.c)-1)
# Plot the cluster tendency
clustend$plot + 
  scale_fill_gradient(low = "yellow", high = "blue")

Figure 7. cluster tendency of countries

Based on Figure 7, 3-4 clusters (ie a group of countries) are likely, but this needs to be confirmed by further studies.

Hierarchical clustering of countries

The next step is to apply hierarchical clustering for countries. The result of the clustering is shown in Figure 8.

hc <- hclust(dist(dist.c), method = "complete")
hcdata <- dendro_data(hc, type="rectangle")
o.rows <- hcdata$labels$label

ggplot(hcdata$segments) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
  geom_text(data = hcdata$labels, aes(x, y, label = label), 
            hjust = 1, vjust=0.7, size = 3) +
  labs(x="", y="", title="Hierarchical clustering for countries") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_blank(),
        axis.text.y = element_blank()) +
  coord_flip()

8. ábra A hierarchikus klaszterezés eredménye az országok esetében

Hierarchical clustering makes the existence of 3-4 clusters (here a group of countries) probable.

Displaying the distance matrix provides additional help in determining the number of clusters. Figure 9 shows the distance matrix for countries, with blue cells representing values close to -1 and dark red cells representing values close to 1.

cormat <- rquery.cormat(dist.c, graphType="heatmap")

9. ábra Távolságmátrix az országok esetében

Based on the distance matrix shown in Figure 9, 3-4 clusters are probable. In the next step, we compare the cases where we create 3 or 4 clusters for the countries.

cluster.stats(dist.c,cutree(hc, k=3))

## $n
## [1] 50
## 
## $cluster.number
## [1] 3
## 
## $cluster.size
## [1] 21 15 14
## 
## $min.cluster.size
## [1] 14
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 172.7304 187.5483 131.6234
## 
## $average.distance
## [1]  96.74970 114.40191  94.56826
## 
## $median.distance
## [1]  92.82482 111.94521  98.24285
## 
## $separation
## [1] 57.08949 64.51798 57.08949
## 
## $average.toother
## [1] 165.3137 174.3910 131.3474
## 
## $separation.matrix
##           [,1]      [,2]     [,3]
## [1,]   0.00000 136.35107 57.08949
## [2,] 136.35107   0.00000 64.51798
## [3,]  57.08949  64.51798  0.00000
## 
## $ave.between.matrix
##          [,1]     [,2]     [,3]
## [1,]   0.0000 200.0512 128.0951
## [2,] 200.0512   0.0000 135.9007
## [3,] 128.0951 135.9007   0.0000
## 
## $average.between
## [1] 157.7719
## 
## $average.within
## [1] 101.4346
## 
## $n.between
## [1] 819
## 
## $n.within
## [1] 406
## 
## $max.diameter
## [1] 187.5483
## 
## $min.separation
## [1] 57.08949
## 
## $within.cluster.ss
## [1] 258714.8
## 
## $clus.avg.silwidths
##         1         2         3 
## 0.2440727 0.1473629 0.1832766 
## 
## $avg.silwidth
## [1] 0.1980369
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.566333
## 
## $dunn
## [1] 0.3043989
## 
## $dunn2
## [1] 1.119693
## 
## $entropy
## [1] 1.081972
## 
## $wb.ratio
## [1] 0.6429189
## 
## $ch
## [1] 24.42036
## 
## $cwidegap
## [1]  98.27018 125.24363 102.04286
## 
## $widestgap
## [1] 125.2436
## 
## $sindex
## [1] 61.91874
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

cluster.stats(dist.c,cutree(hc, k=4))

## $n
## [1] 50
## 
## $cluster.number
## [1] 4
## 
## $cluster.size
## [1] 21 11 14  4
## 
## $min.cluster.size
## [1] 4
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 172.7304 148.9883 131.6234 185.2957
## 
## $average.distance
## [1]  96.74970 106.13002  94.56826 128.13876
## 
## $median.distance
## [1]  92.82482 108.07670  98.24285 135.48427
## 
## $separation
## [1] 57.08949 60.81710 57.08949 60.81710
## 
## $average.toother
## [1] 165.3137 158.4569 131.3474 186.9005
## 
## $separation.matrix
##           [,1]      [,2]      [,3]     [,4]
## [1,]   0.00000 136.35107  57.08949 201.0139
## [2,] 136.35107   0.00000  64.51798  60.8171
## [3,]  57.08949  64.51798   0.00000 102.1068
## [4,] 201.01385  60.81710 102.10676   0.0000
## 
## $ave.between.matrix
##          [,1]     [,2]     [,3]     [,4]
## [1,]   0.0000 187.6245 128.0951 234.2247
## [2,] 187.6245   0.0000 124.8736 122.8686
## [3,] 128.0951 124.8736   0.0000 166.2251
## [4,] 234.2247 122.8686 166.2251   0.0000
## 
## $average.between
## [1] 155.9924
## 
## $average.within
## [1] 100.7137
## 
## $n.between
## [1] 863
## 
## $n.within
## [1] 362
## 
## $max.diameter
## [1] 185.2957
## 
## $min.separation
## [1] 57.08949
## 
## $within.cluster.ss
## [1] 247876.6
## 
## $clus.avg.silwidths
##           1           2           3           4 
##  0.24407268  0.07194809  0.15158446 -0.03399656 
## 
## $avg.silwidth
## [1] 0.158063
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5576146
## 
## $dunn
## [1] 0.3080993
## 
## $dunn2
## [1] 0.9588711
## 
## $entropy
## [1] 1.255947
## 
## $wb.ratio
## [1] 0.6456321
## 
## $ch
## [1] 17.30098
## 
## $cwidegap
## [1]  98.27018 106.99219 102.04286 174.91549
## 
## $widestgap
## [1] 174.9155
## 
## $sindex
## [1] 60.06623
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Nevertheless, based on the results, it is not possible to decide how many (3 or 4) clusters can be specified for the countries. Then the number of clusters should be decided by their interpretability.

Multidimensional scalig for countries

To examine the interpretability of clusters, the next step is to use multidimensional scaling for countries. First, we specify three clusters (i.e. three groups of countries), the results of which are shown in Figure 10.

mds.isomds <- isoMDS(dist(dist.c), tol=1e-3, k=10, maxit = 200)

## initial  value 0.728168 
## final  value 0.727938 
## converged

plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cutree(hc, k=3)

ggplot(plotdata, aes(V1, V2, label=names)) + 
  geom_point(aes(colour=factor(cut)), size=2.3) +
  geom_text_repel(aes(colour=factor(cut)), size=4) + 
  scale_colour_discrete(name = "Clusters") +
  labs(x="", y="", title="Multidimensional scaling for countries considering 3 clusters") + theme_bw()

Figure 10. Multidimensional scaling for countries considering 3 clusters

Figure 10 shows that the first cluster included 21 countries, the second 15 countries and 14 countries.

In the next step, we will look at how to specify 4 clusters. The result is shown in Figure 11.

mds.isomds <- isoMDS(dist(dist.c), tol=1e-3, k=10, maxit = 200)

## initial  value 0.728168 
## final  value 0.727938 
## converged

plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cutree(hc, k=4)

ggplot(plotdata, aes(V1, V2, label=names)) + 
  geom_point(aes(colour=factor(cut)), size=2.3) +
  geom_text_repel(aes(colour=factor(cut)), size=4) + 
  scale_colour_discrete(name = "Clusters") +
  labs(x="", y="", title="Multidimensional scaling for countries considering 4 clusters") + theme_bw()

Figure 11. Multidimensional scaling for countries considering 4 clusters

In the case of specifying 4 clusters, 21 countries were included in the first cluster, 11 in the second, 14 in the third, and 4 in the fourth. Compared to the previous case (when we specified three clusters), the countries in the first and in the third clusters did not change. The fourth cluster included the countries (Denmark, Sweden, Switzerland, United States) that belonged to the second cluster in the previous case. The principles formulated by Van Mechelen (1993) are mostly fulfilled by 4 clusters.

The results of cluster tendency: It is clear that clustering can be made by both variables and countries (see Hopkins index). However, the resulting clusters are quite difficult to interpret. The groups of variables contain mixed indicators, so model reduction is needed before grouping countries.

#3. Model reduction

Principal Component Analysis

The first step is to determine the number of principal components / factors. The Figure 12 shows the scree plot, which can be used to specify the number of latent variables (i.e. principal components).

scree(U21_filtered)

Figure 12 scree plot Könyökdiagram

Based on Figure 12, maximum 5 principal component (PC), while maximum 3 factor should be kept.

U21 contains 4 main groups of indicators: resources (R1 - R5), environment (E1 - E4), connectivity (C1 - C6) and outputs (O1 - O9). In the case of principal component analysis, it is worth starting from these predefined groups of indicators.

R (resources) PC

First, we define the principal component R, which contains the indicators R1-R5, that describe the resources of each country.

Rs<-U21_filtered[,c("R1","R2","R3","R4","R5")]
KMO(cor(Rs))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Rs))
## Overall MSA =  0.65
## MSA for each item = 
##   R1   R2   R3   R4   R5 
## 0.75 0.54 0.66 0.68 0.60

fitR <- principal(Rs,nfactors=1)
summary(fitR) # print variance accounted for

## 
## Factor analysis with Call: principal(r = Rs, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 5  and the objective function was  1.28 
## The number of observations was  50  with Chi Square =  58.63  with prob <  2.3e-11 
## 
## The root mean square of the residuals (RMSA) is  0.15

fitR$communality

##        R1        R2        R3        R4        R5 
## 0.4804492 0.3481106 0.7782275 0.8074245 0.8522280

pcaR <- prcomp(Rs, scale = TRUE)
eigs <- pcaR$sdev^2
varR <- eigs[1] / sum(eigs)
paste("Variance rate=",varR)

## [1] "Variance rate= 0.653287948596533"

Since the KMO value is greater than 0.5, principal component / factor analysis can be performed. All communality is greater than 0.25, which means that the original variables fit the principal component correctly. Furthermore, the variance ratio is greater than 0.5, which means that at least 50% of the information remains even after model reduction. Based on these, the R principal component created above is appropriate and can be used in further analyzes.

E (environment) PC

Next, we specify the principal component E, which contains the indicators E1-E4, that describe the environmental indicators of each country.

Es<-U21_filtered[,c("E1","E2","E3","E4")]
KMO(cor(Es))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Es))
## Overall MSA =  0.55
## MSA for each item = 
##   E1   E2   E3   E4 
## 0.61 0.56 0.53 0.54

fitE <- principal(Es,nfactors=1)
summary(fitE) # print variance accounted for

## 
## Factor analysis with Call: principal(r = Es, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 2  and the objective function was  0.28 
## The number of observations was  50  with Chi Square =  13.06  with prob <  0.0015 
## 
## The root mean square of the residuals (RMSA) is  0.21

fitE$communality

##        E1        E2        E3        E4 
## 0.3235923 0.1889422 0.5982086 0.6393162

pcaE <- prcomp(Es, scale = TRUE)
eigs <- pcaE$sdev^2
varE <- eigs[1] / sum(eigs)
paste("Variance rate=",varE)

## [1] "Variance rate= 0.43751482380739"

ased on the above results, it can be seen that the communality of E2 is less than 0.25 (0.189), so E2 does not fit the PC properly. However, if we exclude E2 in the next step, the communality value of E1 will be decreased less to 0.25. Most indicators can be saved, if E3 will be excluded, instead of E2. Thus, in the next step, we omit E3.

Es<-U21_filtered[,c("E1","E2","E4")]
KMO(cor(Es))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Es))
## Overall MSA =  0.55
## MSA for each item = 
##   E1   E2   E4 
## 0.53 0.54 0.61

fitE <- principal(Es,nfactors=1)
summary(fitE)

## 
## Factor analysis with Call: principal(r = Es, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 0  and the objective function was  0.22 
## The number of observations was  50  with Chi Square =  10.42  with prob <  NA 
## 
## The root mean square of the residuals (RMSA) is  0.25

fitE$communality

##        E1        E2        E4 
## 0.6167214 0.5175482 0.2970137

pcaE <- prcomp(Es, scale = TRUE)
eigs <- pcaE$sdev^2
varE <- eigs[1] / sum(eigs)
paste("Variance rate=",varE)

## [1] "Variance rate= 0.477094428557052"

After excluding E3, the communality value is greater than 0.25 for all variables. The KMO value is greater than 0.5. However, the value of the variance ratio is less than 0.5 (0.477), which means that at least 50% of the information is lost after the model reduction.

C (connenctions) PC

Next, we specify the principal component C, which contains the indicators C1-C6, that describe the connectivity and communications indicators of each country.

Cs<-U21_filtered[,c("C1","C2","C3","C4","C5","C6")]
KMO(cor(Cs))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Cs))
## Overall MSA =  0.67
## MSA for each item = 
##   C1   C2   C3   C4   C5   C6 
## 0.57 0.66 0.37 0.70 0.82 0.78

fitC <- principal(Cs,nfactors=1)
summary(fitC)

## 
## Factor analysis with Call: principal(r = Cs, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 9  and the objective function was  0.53 
## The number of observations was  50  with Chi Square =  24.07  with prob <  0.0042 
## 
## The root mean square of the residuals (RMSA) is  0.14

fitC$communality

##        C1        C2        C3        C4        C5        C6 
## 0.4456108 0.5144444 0.1290487 0.6247154 0.6170022 0.5594279

pcaC <- prcomp(Cs, scale = TRUE)
eigs <- pcaC$sdev^2
varC <- eigs[1] / sum(eigs)
paste("Variance rate=",varC)

## [1] "Variance rate= 0.481708220917479"

The communality value of C3 is lower than 0.25, therefore, this variable should be excluded in the next step.

Cs<-U21_filtered[,c("C1","C2","C4","C5","C6")]
KMO(cor(Cs))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Cs))
## Overall MSA =  0.79
## MSA for each item = 
##   C1   C2   C4   C5   C6 
## 0.78 0.79 0.80 0.79 0.80

fitC <- principal(Cs,nfactors=1)
summary(fitC)

## 
## Factor analysis with Call: principal(r = Cs, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 5  and the objective function was  0.23 
## The number of observations was  50  with Chi Square =  10.55  with prob <  0.061 
## 
## The root mean square of the residuals (RMSA) is  0.13

fitC$communality

##        C1        C2        C4        C5        C6 
## 0.5120676 0.5011992 0.5971696 0.6207606 0.5728886

pcaC <- prcomp(Cs, scale = TRUE)
eigs <- pcaC$sdev^2
varC <- eigs[1] / sum(eigs)
paste("Variance rate=",varC)

## [1] "Variance rate= 0.560817123312952"

Both the KMO and the variance rate is greater than 0.5.

O (outputs) PC

At the last step, we specify the principal component O, which contains the indicators O1-O9, that describe the research output indicators of each country.

Os<-U21_filtered[,c("O1","O2","O3","O4","O5","O6","O7","O8","O9")]
KMO(cor(Os))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA =  0.73
## MSA for each item = 
##   O1   O2   O3   O4   O5   O6   O7   O8   O9 
## 0.40 0.73 0.77 0.79 0.63 0.83 0.74 0.85 0.64

fitO <- principal(Os,nfactors=1)
summary(fitO)

## 
## Factor analysis with Call: principal(r = Os, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 27  and the objective function was  3.28 
## The number of observations was  50  with Chi Square =  145.84  with prob <  2.9e-18 
## 
## The root mean square of the residuals (RMSA) is  0.14

fitO$communality

##           O1           O2           O3           O4           O5           O6 
## 0.1066000549 0.8603383885 0.7882600079 0.7698125877 0.6266131079 0.3970218624 
##           O7           O8           O9 
## 0.5495702936 0.7077575306 0.0005444639

pcaO <- prcomp(Os, scale = TRUE)
eigs <- pcaC$sdev^2
varO <- eigs[1] / sum(eigs)
paste("Variance rate=",varO)

## [1] "Variance rate= 0.560817123312952"

Based on the above results, for both O1 and O9, the communality is lower than 0.25, however, the variance ratio is adequate (0.56). There are two options: use one principal component or two factors.

OPT1 - Find one principal component

Os<-U21_filtered[,c("O2","O3","O4","O5","O6","O7","O8")]
KMO(cor(Os))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA =  0.79
## MSA for each item = 
##   O2   O3   O4   O5   O6   O7   O8 
## 0.74 0.79 0.81 0.75 0.82 0.78 0.85

fitO <- principal(Os,nfactors=1)
summary(fitO)

## 
## Factor analysis with Call: principal(r = Os, nfactors = 1)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 14  and the objective function was  1.62 
## The number of observations was  50  with Chi Square =  73.02  with prob <  5.4e-10 
## 
## The root mean square of the residuals (RMSA) is  0.1

fitO$communality

##        O2        O3        O4        O5        O6        O7        O8 
## 0.8860009 0.7906704 0.7907994 0.5663786 0.4074405 0.5578915 0.7229660

pcaO <- prcomp(Os, scale = TRUE)
eigs <- pcaC$sdev^2
varO <- eigs[1] / sum(eigs)
paste("Variance rate=",varO)

## [1] "Variance rate= 0.560817123312952"

The value of KMO is greater than 0.5, for all indicators the communality is greater than 0.25, the variance rate is 0.56, i.e. at least 50% of the original information remains even after model reduction. However, two important indicators (O1 and O9) were dropped from the model. The O1 indicator is the number of articles published by higher education institutions, while the O9 is the proportion of unemployed graduates. These indicators are also worthwhile and would need to be preserved after model reduction.

OPT2 - Find two (or more) factors

A második lehetőség, hogy több mutatót őrizzünk meg, 2 faktor használata. 2 faktorra vonatkozó eredmények lentebb olvashatók.

Os<-U21_filtered[,c("O1","O2","O3","O4","O5","O6","O7","O8","O9")]
KMO(cor(Os))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA =  0.73
## MSA for each item = 
##   O1   O2   O3   O4   O5   O6   O7   O8   O9 
## 0.40 0.73 0.77 0.79 0.63 0.83 0.74 0.85 0.64

fitO2 <- principal(Os,nfactors=2)
summary(fitO2)

## 
## Factor analysis with Call: principal(r = Os, nfactors = 2)
## 
## Test of the hypothesis that 2 factors are sufficient.
## The degrees of freedom for the model is 19  and the objective function was  1.87 
## The number of observations was  50  with Chi Square =  81.98  with prob <  8.4e-10 
## 
## The root mean square of the residuals (RMSA) is  0.11

fitO2$communality

##        O1        O2        O3        O4        O5        O6        O7        O8 
## 0.8438830 0.9223750 0.7955037 0.8210897 0.8665949 0.4008533 0.5506741 0.7255365 
##        O9 
## 0.3083455

fitO2$loadings

## 
## Loadings:
##    RC1    RC2   
## O1  0.138  0.908
## O2  0.959       
## O3  0.886  0.104
## O4  0.905       
## O5  0.670  0.646
## O6  0.629       
## O7  0.732  0.124
## O8  0.850       
## O9         0.547
## 
##                  RC1   RC2
## SS loadings    4.656 1.579
## Proportion Var 0.517 0.175
## Cumulative Var 0.517 0.693

In the case of two (principal) factors, akk communality values are greater than 0.25; however, the correlation between factor 1 and O5, and the factor 2 and O5 is very similar (|cor(RC1,O5)|<2|cor(RC2,O5)| or |cor(RC1,O5)|<|cor(RC2,O5)|+0.25). Therefore O5 cannot be matched to any factor. In this case we say that O5 is a common indicator, which has to be excluded from the model.

Os<-U21_filtered[,c("O1","O2","O3","O4","O6","O7","O8","O9")]
KMO(cor(Os))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA =  0.78
## MSA for each item = 
##   O1   O2   O3   O4   O6   O7   O8   O9 
## 0.39 0.75 0.83 0.78 0.79 0.80 0.84 0.55

fitO2 <- principal(Os,nfactors=2)
summary(fitO2)

## 
## Factor analysis with Call: principal(r = Os, nfactors = 2)
## 
## Test of the hypothesis that 2 factors are sufficient.
## The degrees of freedom for the model is 13  and the objective function was  1.21 
## The number of observations was  50  with Chi Square =  53.62  with prob <  7.1e-07 
## 
## The root mean square of the residuals (RMSA) is  0.1

fitO2$communality

##        O1        O2        O3        O4        O6        O7        O8        O9 
## 0.3429981 0.9340280 0.7946173 0.8450090 0.5310990 0.5954249 0.7515102 0.7198578

fitO2$loadings

## 
## Loadings:
##    RC1    RC2   
## O1  0.131  0.571
## O2  0.966       
## O3  0.891       
## O4  0.911 -0.123
## O6  0.615  0.390
## O7  0.716  0.288
## O8  0.848  0.180
## O9 -0.111  0.841
## 
##                  RC1   RC2
## SS loadings    4.195 1.319
## Proportion Var 0.524 0.165
## Cumulative Var 0.524 0.689

Both the KMO (0.78) and the variance rate (0.689) is greater than 0.5. All communality values are greater than 0.25. And there is no common indicator.

The first factor contains the folowing indicators:

\(O_2\): Total number of articles that are produced by higher education institutions per capita
\(O_3\): Average impact of articles, as measured by citations in 2014 of articles that were published in previous years using the Karolinska Institute normalized impact factor
\(O_4\): Depth of world-class universities in a country. This is calculated as an average of the institutions’ score of a country that is listed in the top 500 of the Shanghai ranking, divided by the country’s population
\(O_6\): Enrollment in tertiary education as a % of the eligible population, which is defined as the 5-year age group after secondary education
\(O_7\): % of the population aged 25-64 with a tertiary qualification
\(O_8\): Number of researchers (full-time equivalent) in the nation per population

The second factor contains the following indicators:

\(O_1\): Total number of journal articles that are produced by higher education institutions
\(O_9\): Unemployment rates among tertiary-educated aged 25-64 years compared with unemployment rates for those with only upper-secondary or post-secondary non-tertiary education

Thus, compared to the specification of the principal component \(O\), both the \(O_1\) and \(O_9\) indicators in the second factor were saved.

Since we specified principal components all for \(R\), \(C\), \(E\), \(O\) indicators, the components can even be correlated. Therefore, in the next step, the correlation between them have to be checked.

Correlation analysis for PCs

Figure 13 shows the results of the correlation analysis. The Figure 13 shows that there are strong correlations among principal components \(R\), \(O\), and \(C\).

factors<-cbind(fitR$scores,fitE$scores,fitC$scores,fitO$scores)
colnames(factors) <- c("R","E","C","O")
cor(factors)

##           R         E         C         O
## R 1.0000000 0.1659944 0.7354010 0.8207528
## E 0.1659944 1.0000000 0.2930795 0.2203930
## C 0.7354010 0.2930795 1.0000000 0.8482095
## O 0.8207528 0.2203930 0.8482095 1.0000000

corrgram(factors,order = TRUE,main="Correlation between principal components",lower.panel=panel.conf)

Figure 13. Correlation between principal components

Correlation analysis is also performed among the PCs \(R\), \(E\), and \(C\) and two specified factors, such as \(O_{fact_{1}}\), and \(O_{fact_{2}}\). The result of the correlation analysis is shown in Figure 14.

factors<-cbind(fitR$scores,fitE$scores,fitC$scores,fitO2$scores)
colnames(factors) <- c("R","E","C","O_fact1","O_fact2")
cor(factors)

##                  R         E          C      O_fact1       O_fact2
## R        1.0000000 0.1659944  0.7354010 8.350790e-01 -5.855570e-02
## E        0.1659944 1.0000000  0.2930795 2.054947e-01  1.568645e-01
## C        0.7354010 0.2930795  1.0000000 8.631607e-01 -1.107934e-01
## O_fact1  0.8350790 0.2054947  0.8631607 1.000000e+00  1.713747e-16
## O_fact2 -0.0585557 0.1568645 -0.1107934 1.713747e-16  1.000000e+00

corrgram(factors,order = TRUE,main="Correlations between PCs and factors",lower.panel=panel.conf)

Figure 14. Correlations between PCs and factors

Based on Figure 14 latent variables R, O_fact1, and C are strongly correlated each other. Correlation analysis specifies three latent variable groups: (1) \(R\), \(C\), \(O_{fact_{1}}\), (2) \(O_{fact_{2}}\), (3) \(E\).

Factor analysis

After principal component analysis, factor analysis (FA) is performed on the indicators. As a first step, we examine the applicability of factor analysis, which is aided by scree-plots and calculating the KMO value.

scree(U21_filtered)

Figure 15. scree plot

KMO(cor(U21_filtered))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(U21_filtered))
## Overall MSA =  0.73
## MSA for each item = 
##   R1   R2   R3   R4   R5   E1   E2   E3   E4   C1   C2   C3   C4   C5   C6   O1 
## 0.70 0.51 0.82 0.75 0.79 0.56 0.35 0.69 0.82 0.76 0.63 0.37 0.83 0.83 0.76 0.36 
##   O2   O3   O4   O5   O6   O7   O8   O9 
## 0.80 0.80 0.79 0.75 0.73 0.66 0.74 0.30

Figure 15 suggests us to specify 3 independent groups of variables (see the same results in Figure 12.)

U21s<-U21_filtered[,c("R1","R2","R3","R4","R5","E1","E2", "E3", "E4","C1","C2","C3","C4","C5","C6","O1","O2","O3","O4","O5","O6","O7","O8","O9")]
fit <- principal(U21s,nfactors=3,rotate = "varimax")
summary(fit) # print variance accounted for

## 
## Factor analysis with Call: principal(r = U21s, nfactors = 3, rotate = "varimax")
## 
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 207  and the objective function was  11.54 
## The number of observations was  50  with Chi Square =  440.6  with prob <  5.8e-19 
## 
## The root mean square of the residuals (RMSA) is  0.09

fit$communality

##        R1        R2        R3        R4        R5        E1        E2        E3 
## 0.5654944 0.5743389 0.8261170 0.7944043 0.8819861 0.3833381 0.1206436 0.5076963 
##        E4        C1        C2        C3        C4        C5        C6        O1 
## 0.5389281 0.3994398 0.7234222 0.2496209 0.5761145 0.6911836 0.6506483 0.5851133 
##        O2        O3        O4        O5        O6        O7        O8        O9 
## 0.9083897 0.8677446 0.8564676 0.7466957 0.3476795 0.5736386 0.7121123 0.3973431

fit$loadings

## 
## Loadings:
##    RC1    RC2    RC3   
## R1  0.397 -0.639       
## R2  0.439 -0.239 -0.570
## R3  0.906              
## R4  0.849 -0.265       
## R5  0.914 -0.211       
## E1  0.158  0.191  0.567
## E2         0.147  0.302
## E3  0.337  0.593  0.204
## E4  0.516  0.512  0.106
## C1  0.576         0.255
## C2  0.541 -0.159  0.637
## C3  0.154 -0.111  0.462
## C4  0.669  0.311  0.178
## C5  0.815 -0.153       
## C6  0.787  0.163       
## O1  0.227  0.495 -0.537
## O2  0.942         0.138
## O3  0.920         0.140
## O4  0.908         0.175
## O5  0.731  0.305 -0.345
## O6  0.526  0.208 -0.165
## O7  0.644  0.101 -0.386
## O8  0.836              
## O9         0.626       
## 
##                   RC1   RC2   RC3
## SS loadings    10.059 2.231 2.188
## Proportion Var  0.419 0.093 0.091
## Cumulative Var  0.419 0.512 0.603

biplot(fit)

Figure 16. Biplot of 3 factors

Figure 16 shows the properties of the three factors. As can be seen from the above results, there are also low communality (e.g. \(C_3\)) and common indicators (e.g. \(R_2\)). In order to get a comprehensive picture of the relative position of the factors and the indicators belonging to each factor, it is expedient to plot the initial state of the factors (\(RC_1\), \(RC_2\), \(RC_3\)) in 3D. This 3D figure is shown in Figure 17. By rotating the figure, you can see how each indicator is grouped, to which factors it belongs. Consistent with the above results, it can be seen here that the first factor includes, for example, \(R_3\), \(R_4\) and \(R_5\).

save <- options(rgl.useNULL = TRUE)
open3d()

## null 
##    1

options(save)
plot3d(fit$loadings)
text3d(fit$loadings,texts = as.vector(rownames(fit$loadings)))
coords <- NULL
for (i in 1:nrow(fit$loadings)) {
  coords <- rbind(coords, rbind(c(0,0,0),fit$loadings[i,]))
}
lines3d(coords, col="red", lwd=4)
rglwidget()

Figure 17. Factors and indicators in 3D

Excluding common and low communality value indicators the result is shown in the Figure 18..

columns<-c("R3","R4","R5","E1","E2","C1","C5","C6","O1","O4","O8","O9")
U21s<-U21_filtered[,columns]
KMO(cor(U21s))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(U21s))
## Overall MSA =  0.81
## MSA for each item = 
##   R3   R4   R5   E1   E2   C1   C5   C6   O1   O4   O8   O9 
## 0.79 0.83 0.80 0.40 0.47 0.93 0.85 0.83 0.41 0.85 0.87 0.49

fit <- principal(U21s,nfactors=3,rotate = "varimax")
summary(fit)

## 
## Factor analysis with Call: principal(r = U21s, nfactors = 3, rotate = "varimax")
## 
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 33  and the objective function was  2.21 
## The number of observations was  50  with Chi Square =  93.17  with prob <  1.2e-07 
## 
## The root mean square of the residuals (RMSA) is  0.08

fit$communality

##        R3        R4        R5        E1        E2        C1        C5        C6 
## 0.8262154 0.8290226 0.9132030 0.6894734 0.5995707 0.4921809 0.6855794 0.6979789 
##        O1        O4        O8        O9 
## 0.6054390 0.8427117 0.7250600 0.6423689

fit$loadings

## 
## Loadings:
##    RC1    RC2    RC3   
## R3  0.888         0.190
## R4  0.905              
## R5  0.954              
## E1  0.134  0.815       
## E2 -0.126  0.764       
## C1  0.616  0.202 -0.269
## C5  0.824              
## C6  0.796 -0.161  0.197
## O1  0.148 -0.114  0.755
## O4  0.906  0.136       
## O8  0.829         0.169
## O9 -0.108  0.296  0.737
## 
##                  RC1   RC2   RC3
## SS loadings    5.787 1.450 1.312
## Proportion Var 0.482 0.121 0.109
## Cumulative Var 0.482 0.603 0.712

biplot(fit)

Figure 18. shows the final factor structure

Figure 19 shows the final factor structure in 3D.

plot3d(fit$loadings)
text3d(fit$loadings,texts = as.vector(rownames(fit$loadings)))
coords <- NULL
for (i in 1:nrow(fit$loadings)) {
  coords <- rbind(coords, rbind(c(0,0,0),fit$loadings[i,]))
}
lines3d(coords, col="red", lwd=4)
rglwidget()

Figure 19 final factor structure in 3D

The 3D figure above illustrates the three distinct groups of indicators.

Figure 20 shows the structure equation model, which graphically shows which indicators belong to which factor. In the figure, the thickness and strength of the arrows illustrate the extent to which each indicator contributes to a given factor.

The list of indicators, which belong to the factor 1:

\(R_3\): Annual exp. per student (full-time equivalent) by tertiary education institutions in USD, PPP
\(R_4\): Exp. in tertiary education institutions for R&D as a % of GDP
\(R_5\): Exp. in tertiary education institutions for R&D per head of population at USD, PPP
\(C_1\): % of international students in tertiary education
\(C_5\): Responses to question “Knowledge transfer is highly developed between companies and universities”, which was asked of business executives in the annual survey by IMD World Development Centre, Switzerland
\(C_6\): % of university research publications that are co-authored with industry researchers
\(O_4\): Depth of world-class universities in a country. This is calculated as an average of the institutions’ score of a country that is listed in the top 500 of the Shanghai ranking, divided by the country’s population
\(O_8\): Number of researchers (full-time equivalent) in the nation per population

The list of indicators, which belong to the factor 2:

\(E_1\): % of female students in tertiary education
\(E_2\): % of female academic staff in tertiary institutions

The list of indicators, which belong to the factor 3:

\(O_1\): Total number of journal articles that are produced by higher education institutions
\(O_9\): Unemployment rates among tertiary-educated aged 25-64 years compared with unemployment rates for those with only upper-secondary or post-secondary non-tertiary education

The first factor (\(RC_1\)) contains both expenditure-type, relationship-type and output-type indicators. The second factor (\(RC_2\)) shows the role of women in higher education, and the third factor (\(RC_3\)) contains output indicators.

model='RC1=~R3+R4+R5+C1+C5+C6+O4+O8
       RC2=~E1+E2
       RC3=~O1+O9
       RC1~~0*RC2 + 0*RC3
       RC2~~0*RC3'
fitsem <- cfa(model,U21_filtered,std.lv = TRUE)
semPaths(fitsem,residuals=FALSE,"std",edge.color = "black",color="yellow")

Figure 20. The final structure ecuation model

The summary of the evaluatiion of dimension reduciton: If the goal is to specify independent groups of variables, then 3 groups of variables can be defined with both principal component and factor analysis. Of these, two “pure” groups of variables contain only 2-2 indicators, while the third mixed group forms a group of R, C, O indicators. A serious problem is that only 12 of the 24 indicators are preserved.

4. Data reduction

Specifiying 3 clusters on 3 factors

Based on the preliminary clustering study, 3-4 clusters were probable for countries (rows). Thus, as a first step, we specify three clusters.

set.seed(2)
km3 <- kmeans(fit$scores,3)$cluster
cluster.stats(dist(fit$scores),km3)

## $n
## [1] 50
## 
## $cluster.number
## [1] 3
## 
## $cluster.size
## [1] 18 20 12
## 
## $min.cluster.size
## [1] 12
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 2.180009 5.704993 4.494681
## 
## $average.distance
## [1] 1.146773 1.610924 1.901035
## 
## $median.distance
## [1] 1.108597 1.356117 1.479310
## 
## $separation
## [1] 0.7804456 0.7390499 0.7390499
## 
## $average.toother
## [1] 2.479894 2.470991 2.636661
## 
## $separation.matrix
##           [,1]      [,2]      [,3]
## [1,] 0.0000000 0.7804456 0.8288624
## [2,] 0.7804456 0.0000000 0.7390499
## [3,] 0.8288624 0.7390499 0.0000000
## 
## $ave.between.matrix
##          [,1]     [,2]     [,3]
## [1,] 0.000000 2.373189 2.657735
## [2,] 2.373189 0.000000 2.617694
## [3,] 2.657735 2.617694 0.000000
## 
## $average.between
## [1] 2.520423
## 
## $average.within
## [1] 1.513456
## 
## $n.between
## [1] 816
## 
## $n.within
## [1] 409
## 
## $max.diameter
## [1] 5.704993
## 
## $min.separation
## [1] 0.7390499
## 
## $within.cluster.ss
## [1] 77.04899
## 
## $clus.avg.silwidths
##         1         2         3 
## 0.4922942 0.3117005 0.2117167 
## 
## $avg.silwidth
## [1] 0.3527181
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.4333483
## 
## $dunn
## [1] 0.1295444
## 
## $dunn2
## [1] 1.248367
## 
## $entropy
## [1] 1.076819
## 
## $wb.ratio
## [1] 0.6004771
## 
## $ch
## [1] 21.33511
## 
## $cwidegap
## [1] 0.9296514 2.7278667 2.2474481
## 
## $widestgap
## [1] 2.727867
## 
## $sindex
## [1] 0.764972
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

**Figure 21*) shows the results of clustering based on 3 factors

18 countries belong to the first,20 countries belong to the second, and 12 countries belong to the third cluster.

  plot3d(fit$loadings)
text3d(fit$loadings,texts = rownames(fit$loadings))
coords <- NULL
for (i in 1:nrow(fit$loadings)) {
  coords <- rbind(coords, rbind(c(0,0,0),fit$loadings[i,]))
}
lines3d(coords, col="red", lwd=4)

text3d(fit$scores,texts = rownames(U21_filtered),col=km3)
rglwidget()

Figure 21. the results of three clusters

Figure 22. shows which countries belong to which clusters. These countries are separated align with the three factors.

data(wrld_simpl)
countries_km1 = wrld_simpl@data$NAME %in% c("Australia","Austria","Belgium","Canada","Denmark","Finland","France","Hon Kong","Ireland","Israel","Netherlands","New Zealand","Norway","Portugal","Singapore","Sweden","Switzerland","United Kingdom")
countries_km2 = wrld_simpl@data$NAME %in% c("Argentina","Brazil","Bulgaria","China", "Croatia", "Czech Republic", "Germany", "Hungary","Malaysia","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","South Africa","Spain","Thailand","Ukraine","United States")
countries_km3 = wrld_simpl@data$NAME %in% c("Chile", "Greece", "India", "Indonesia", "Iran", "Italy", "Japan", "Mexico", "Saudi Arabia", "Taiwan", "Turkey", "Korea, Republic of")
countries_km2 = countries_km2 *2
countries_km3 = countries_km3 *3
countries_map = countries_km1+countries_km2+countries_km3+1
plot(wrld_simpl, col = c(gray(.80), "red", "blue", "green")[countries_map])

Figure 22. The specified three clusters on the map

Specifiying 4 clusters on 3 factors

Next step is to specify four clusters on 3 factors.

set.seed(2)
km4 <- kmeans(fit$scores,4)$cluster
cluster.stats(dist(fit$scores),km4)

## $n
## [1] 50
## 
## $cluster.number
## [1] 4
## 
## $cluster.size
## [1] 22  4 17  7
## 
## $min.cluster.size
## [1] 4
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 2.581561 3.503032 2.180009 3.714822
## 
## $average.distance
## [1] 1.228931 2.257800 1.134808 1.939406
## 
## $median.distance
## [1] 1.237384 2.276079 1.096520 1.818473
## 
## $separation
## [1] 0.3953279 0.6144058 0.3953279 0.7669507
## 
## $average.toother
## [1] 2.389488 3.094787 2.484270 3.014312
## 
## $separation.matrix
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.0000000 0.6144058 0.3953279 0.7669507
## [2,] 0.6144058 0.0000000 1.1363751 2.1025880
## [3,] 0.3953279 1.1363751 0.0000000 1.3234252
## [4,] 0.7669507 2.1025880 1.3234252 0.0000000
## 
## $ave.between.matrix
##          [,1]     [,2]     [,3]     [,4]
## [1,] 0.000000 2.807635 2.142717 2.749849
## [2,] 2.807635 0.000000 3.167920 3.819656
## [3,] 2.142717 3.167920 0.000000 3.167064
## [4,] 2.749849 3.819656 3.167064 0.000000
## 
## $average.between
## [1] 2.612725
## 
## $average.within
## [1] 1.378705
## 
## $n.between
## [1] 831
## 
## $n.within
## [1] 394
## 
## $max.diameter
## [1] 3.714822
## 
## $min.separation
## [1] 0.3953279
## 
## $within.cluster.ss
## [1] 53.58231
## 
## $clus.avg.silwidths
##         1         2         3         4 
## 0.3877404 0.1005759 0.4489660 0.2472150 
## 
## $avg.silwidth
## [1] 0.3659104
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5644199
## 
## $dunn
## [1] 0.1064191
## 
## $dunn2
## [1] 0.9490284
## 
## $entropy
## [1] 1.205341
## 
## $wb.ratio
## [1] 0.5276886
## 
## $ch
## [1] 26.73279
## 
## $cwidegap
## [1] 0.8288624 2.7278667 0.9296514 2.2474481
## 
## $widestgap
## [1] 2.727867
## 
## $sindex
## [1] 0.5028333
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Figure 23 shows, that the specified four clusters are well separated. There are 22 countries in the first cluster, 4 in the second, 17 in the third and 7 in the fourth.

  plot3d(fit$loadings)
text3d(fit$loadings,texts = rownames(fit$loadings))
coords <- NULL
for (i in 1:nrow(fit$loadings)) {
  coords <- rbind(coords, rbind(c(0,0,0),fit$loadings[i,]))
}
lines3d(coords, col="red", lwd=4)

text3d(fit$scores,texts = rownames(U21_filtered),col=km4)
rglwidget()

Figure 23. specified four clusters on three factors

Figure 24 shows the map of clustered countries.

library(maptools)
data(wrld_simpl)
countries_km1 = wrld_simpl@data$NAME %in% c("Argentina","Brazil","Bulgaria","Chile","Croatia","Czech Republic","Greece", "Italy","Malaysia","Mexico","Poland","Portugal","Romania","Russia","Saudi Arabia","Serbia","Slovakia","Slovenia","South Africa","Spain","Thailand","Ukraine")
countries_km2 = wrld_simpl@data$NAME %in% c("China","Germany","Hungary","United States")
countries_km3 = wrld_simpl@data$NAME %in% c("Australia","Austria","Belgium","Canada","Denmark","Finland","France","Hong Kong", "Ireland","Israel","Netherlands","New Zealand","Norway","Singapore","Sweden","Switzerland","United Kingdom")
countries_km4 = wrld_simpl@data$NAME %in% c("India","Indonesia","Iran","Japan","Taiwan","Turkey","Korea, Republic of")
countries_km2 = countries_km2 *2
countries_km3 = countries_km3 *3
countries_km4 = countries_km4 *4
countries_map = countries_km1+countries_km2+countries_km3+countries_km4+1
plot(wrld_simpl, col = c(gray(.80), "red", "blue", "green", "orange")[countries_map])

Figure 24. The specified four clusters on the map

5. Specifying partial rankings within specified clusters.

When specifying the partial rankings (within clusters of countries), we should either use the calculated factor weights or the original weights used by U21.

Firt, we calculate the original weights used by U21.

weights=matrix(0,1,24)
colnames(weights)<-c("R1","R2","R3","R4","R5","E1","E2","E3","E4","C1","C2","C3","C4","C5","C6","O1","O2","O3","O4","O5","O6","O7","O8","O9")
weights[1,] <- c(0.05,0.05,0.05,0.025,0.025,0.02,0.02,0.02,0.14,0.04,0.04,0.02,0.02,0.04,0.04,0.13,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03)
rownames(weights)<-c("weights")

In the next step, the partial rankings are calculated, which (see Table 1). In the last two columns of the Table 1, variables km3 and km4 indicate cluster memberships for clusters 3 and 4, respectively. U21_overall_Rank contains the rankings of the original U21 ranking, FA_Rank the ranking based on factor scores, and U21_Rank_selected is the rankings that only contain indicators, which corresponds to the factors.

base<-as.data.frame(1:nrow(U21_rank))
colnames(base)<-paste("ID")
#base$countries<-rownames(U21_rank)
base$U21_overall_Rank<-U21_rank
base$FA_weights <- as.matrix(pnorm(fit$scores[,1])*pnorm(fit$scores[,2])*pnorm(fit$scores[,3]))
base$FA_Rank<-rank(-base$FA_weights)
base$Score_selected <- rowSums(U21_filtered[,columns]*weights[,columns])*100/max(rowSums(U21_filtered[,columns]*weights[,columns]))
base$U21_Rank_selected<-rank(-base$Score_selected)
base$km3 <-km3
base$km4 <-km4
rownames(base)<-rownames(U21_rank)
kable(base, caption="**Table 1** Partial rankings")

**Table 1** Partial rankings
	ID	U21_overall_Rank	FA_weights	FA_Rank	Score_selected	U21_Rank_selected	km3	km4
Argentina	1	41	3.716323e-02	39	53.20702	16	2	1
Australia	2	9	1.226081e-01	20	46.85638	23	1	3
Austria	3	12	1.205554e-01	22	77.76420	6	1	3
Belgium	4	13	2.847575e-01	6	41.03496	27	1	3
Brazil	5	38	1.626992e-01	14	41.07081	26	2	1
Bulgaria	6	40	5.989465e-02	30	22.48808	47	2	1
Canada	7	4	2.566109e-01	7	77.78374	5	1	3
Chile	8	33	1.120821e-02	43	20.57509	48	3	1
China	9	34	6.702021e-02	28	39.96442	29	2	2
Croatia	10	44	6.752780e-02	27	26.79678	42	2	1
Czech Republic	11	26	1.535241e-01	15	61.93087	9	2	1
Denmark	12	3	1.276323e-01	17	56.33062	12	1	3
Finland	13	5	4.592789e-01	2	95.87306	2	1	3
France	14	18	2.193017e-01	9	40.09535	28	1	3
Germany	15	14	4.278289e-01	3	73.57370	7	2	2
Greece	16	32	3.890460e-02	38	28.52945	41	3	1
Hong Kong	17	15	1.276280e-01	18	50.23970	21	1	3
Hungary	18	29	1.664197e-01	12	31.49285	38	2	2
India	19	50	1.537641e-03	47	37.16362	31	3	4
Indonesia	20	48	1.284827e-03	48	17.93802	50	3	4
Iran	21	49	4.994809e-03	45	35.34617	33	3	4
Ireland	22	17	2.919132e-01	5	41.66170	25	1	3
Israel	23	19	1.652540e-01	13	54.70782	13	1	3
Italy	24	27	4.908183e-02	34	32.20203	36	3	1
Japan	25	20	1.944635e-04	49	61.59416	10	3	4
Korea, Rep. (South)	26	21	1.681845e-05	50	31.96663	37	3	4
Malaysia	27	28	5.352892e-02	31	50.55125	20	2	1
Mexico	28	46	9.292853e-03	44	20.50180	49	3	1
Netherlands	29	7	1.669516e-01	11	68.99285	8	1	3
New Zealand	30	16	7.233421e-02	26	39.72090	30	1	3
Norway	31	11	2.270396e-01	8	78.31016	4	1	3
Poland	32	31	8.456104e-02	24	25.39432	43	2	1
Portugal	33	24	1.182719e-01	23	52.54867	18	1	1
Romania	34	39	4.490066e-02	35	24.88180	44	2	1
Russia	35	36	7.997149e-02	25	44.89472	24	2	1
Saudi Arabia	36	30	2.028512e-02	41	32.57981	35	3	1
Serbia	37	34	5.163543e-02	33	53.96639	14	2	1
Singapore	38	10	5.246536e-02	32	48.01649	22	1	3
Slovakia	39	37	1.219388e-01	21	50.87108	19	2	1
Slovenia	40	25	1.228653e-01	19	30.92769	39	2	1
South Africa	41	45	4.384428e-02	36	29.91428	40	2	1
Spain	42	23	1.420620e-01	16	32.97279	34	2	1
Sweden	43	2	2.060980e-01	10	100.00000	1	1	3
Switzerland	44	6	6.600255e-02	29	52.70262	17	1	3
Taiwan	45	22	3.518628e-02	40	58.63325	11	3	4
Thailand	46	42	4.127924e-02	37	24.36351	45	2	1
Turkey	47	47	3.550801e-03	46	35.56118	32	3	4
Ukraine	48	43	1.884380e-02	42	23.44289	46	2	1
United Kingdom	49	8	3.468806e-01	4	80.62078	3	1	3
United States	50	1	4.693511e-01	1	53.27870	15	2	2

In the next step, the rank correlations between the rankings are calculated.

Rank correlations, in the case of 3 clusters

Figures 25-28 show rank correlations between rankings. Figure 25 shows rank correlations for all countries, while Figures 26-28 show rank correlations separately for the clusters 1-3.

If we examine all countries (see Figure 25), there is a close positive relationship between the rankings of the original U21 ranking (U21_overall_Rank) and the ranking we calculated, which contains only the indicators corresponding to each factor (U21_Rank_selected). If we examine only the rankings of the countries belonging to clusters 1 and 2, a similar result is obtained (see Figures 26-27).

par(mfrow=c(2,2))
RANKS<-c("U21_overall_Rank","FA_Rank","U21_Rank_selected")
corrgram(base[,RANKS],cor.method = "spearman",order = TRUE,main="For all countries",lower.panel=panel.cor)

Figure 25. Partial rankings fo all countries

corrgram(base[km3==1,RANKS],cor.method = "spearman",order = TRUE,main=paste("Partial rank correlations of",sum(km3==1)," countries from cluster 1"),lower.panel=panel.cor)

Figure 26. Partial correlactions between calculated and the original rankings on cluster 1

corrgram(base[km3==2,RANKS],cor.method = "spearman",order = TRUE,main=paste("Partial rank correlations of",sum(km3==2)," countries from cluster 2"),lower.panel=panel.cor)

Figure 27. Partial correlactions between calculated and the original rankings on cluster 2

corrgram(base[km3==3,RANKS],cor.method = "spearman",order = TRUE,main=paste("Partial rank correlations of",sum(km3==3)," countries from cluster 3"),lower.panel=panel.cor)

Figure 28. Partial correlactions between calculated and the original rankings on cluster 3

Rank correlations, in the case of 4 clusters

Rank correlations are also calculated when four clusters are created. For 1, 2, and 3 clusters, there are moderately strong relationship between the original U21 rankings and calculated rankings.

par(mfrow=c(2,2))
RANKS<-c("U21_overall_Rank","FA_Rank","U21_Rank_selected")
corrgram(base[km4==1,RANKS],cor.method = "spearman",order = TRUE,main=paste("Partial rank correlations of",sum(km4==1)," countries from cluster 1"),lower.panel=panel.cor)

Figure 29. Partial correlactions between calculated and the original rankings on cluster 1

corrgram(base[km4==2,RANKS],cor.method = "spearman",order = TRUE,main=paste("Partial rank correlations of",sum(km4==2)," countries from cluster 2"),lower.panel=panel.cor)

Figure 30. Partial correlactions between calculated and the original rankings on cluster 2

corrgram(base[km4==3,RANKS],cor.method = "spearman",order = TRUE,main=paste("Partial rank correlations of",sum(km4==3)," countries from cluster 3"),lower.panel=panel.cor)

Figure 31. Partial correlactions between calculated and the original rankings on cluster 3

corrgram(base[km4==4,RANKS],cor.method = "spearman",order = TRUE,main=paste("Partial rank correlations of",sum(km4==4)," countries from cluster 4"),lower.panel=panel.cor)

Figure 32. Partial correlactions between calculated and the original rankings on cluster 4

6. Summary and conclusions

The results of the two-step clustering are summarized below.

MTX3<-as.data.frame(matrix(0,nrow(U21_filtered),ncol(U21_filtered)))
MTX4<-as.data.frame(matrix(0,nrow(U21_filtered),ncol(U21_filtered)))
rownames(MTX3)<-rownames(U21_filtered)
colnames(MTX3)<-colnames(U21_filtered)
rownames(MTX4)<-rownames(U21_filtered)
colnames(MTX4)<-colnames(U21_filtered)
MTX3[,columns]<-km3
MTX4[,columns]<-km4
cols<-c('0'="#FFFFFF",'1'="#99FF66",'2'="#66FF33",'3'="#33CC00",'4'="#009900")
SER3<-c(seriate(dist(as.matrix(MTX3))),seriate(dist(t(as.matrix(MTX3)))))
SER4<-c(seriate(dist(as.matrix(MTX4))),seriate(dist(t(as.matrix(MTX4)))))
ORDERED3<-MTX3[get_order(SER3,dim=1),get_order(SER3,dim=2)]
ORDORIG3<-U21_filtered[get_order(SER3,dim=1),get_order(SER3,dim=2)]
ORDERED4<-MTX4[get_order(SER4,dim=1),get_order(SER4,dim=2)]
ORDORIG4<-U21_filtered[get_order(SER4,dim=1),get_order(SER4,dim=2)]
rownames(ORDERED3)<-rownames(MTX3[get_order(SER3,dim=1),get_order(SER3,dim=2)])
rownames(ORDERED4)<-rownames(MTX4[get_order(SER4,dim=1),get_order(SER4,dim=2)])
rownames(ORDORIG3)<-rownames(MTX3[get_order(SER3,dim=1),get_order(SER3,dim=2)])
rownames(ORDORIG4)<-rownames(MTX4[get_order(SER4,dim=1),get_order(SER4,dim=2)])
par(mfrow=c(2,2))

Figure 33 shows the heat map of the original U21 data. The coloring of the cells depends on the values for each indicator. The darker red cells indicate smaller values, and the darker blue cells indicate higher values.

p <- plot_ly(x=colnames(ORDORIG3), y=rownames(ORDORIG3), z = as.matrix(ORDORIG3), type = "heatmap",colorscale="RdBu",reversescale=TRUE)
p

Figure 33. Heat map of U21 indicators grouped by the set of tree clusters

Figure 34 shows the countries belonging to the three clusters, as well as the indicators that belong to the three factors. Indicators that do not belong to any factor are displayed with a dark red cell. It can be seen that half (12) of the indicators were dropped during the analysis.

p <- plot_ly(x=colnames(ORDERED3), y=rownames(ORDERED3), z = as.matrix(ORDERED3), type = "heatmap",colorscale="RdBu",reversescale=TRUE)
p

Figure 34. selected countries and indicators

p <- plot_ly(x=colnames(ORDORIG4), y=rownames(ORDORIG4), z = as.matrix(ORDORIG4), type = "heatmap",colorscale="RdBu",reversescale=TRUE)
p

Figure 35. Heat map of U21 indicators grouped by the set of four clusters

Figure 35 shows the heat map of the original U21 data. The coloring of the cells depends on the values for each indicator. The darker red cells indicate smaller values, and the darker blue cells indicate higher values.

Figure 36. shows the four clusters specified and the indicators that belong to the three factors.

p <- plot_ly(x=colnames(ORDERED4), y=rownames(ORDERED4), z = as.matrix(ORDERED4), type = "heatmap",colorscale="RdBu",reversescale=TRUE)
p

Figure 36. selected countries and indicators

Summary: Applying two-step clustering is questionable, first for the low sample. The method itself does not allow clusters to overlap, however, it may perform well in some factors in one country and poorly in others. As a result of clustering, we also obtained very small clusters, which are difficult to analyze and interpret.

The additional files attached to the study (including this html) are free to use for scientific purposes. Please cite the original paper

Original paper

Kosztyán, Z. T., Banász, Z., Csányi, V. V., & Telcs, A. (2019). Rankings or leagues or rankings on leagues?-Ranking in fair reference groups. Tertiary Education and Management, 25(4), 289-310.

Further references

Van Mechelen, I., J. Hampton, R. S. Michalski, and P. Theuns (1993). Categories and Concepts - Theoretical Views and Inductive Data Analysis. Academic Press, London.

Partial ranking with two-step clustering

1. Loading necessary libraries and data

Loading libraries

Citations

Loading necessary data

2. Two-step clustering

Cluster tendency of indicators

Calculating Hopkins indicator

Hierarchical clustering for indicators

Multidimensional scaling for indicators

Cluster tendency of countries

Calculating Hopkins indicator

Hierarchical clustering of countries

Multidimensional scalig for countries

Principal Component Analysis

R (resources) PC

E (environment) PC

C (connenctions) PC

O (outputs) PC

OPT1 - Find one principal component

OPT2 - Find two (or more) factors

Correlation analysis for PCs

Factor analysis

4. Data reduction

Specifiying 3 clusters on 3 factors

Specifiying 4 clusters on 3 factors

5. Specifying partial rankings within specified clusters.

Rank correlations, in the case of 3 clusters

Rank correlations, in the case of 4 clusters

6. Summary and conclusions

Original paper

Further references