Workflow
# Correlation analysis
library(corrgram) # Correlogram
library(corrplot) # Correlation plots
source("http://www.sthda.com/upload/rquery_cormat.r") # Correlation matrix representation
# Dimension reduction methods
library(psych) # Factor Analysis, Principal Component Analysis
library(factoextra) # Distance metrics
library(MASS) # Multi-dimensional Scaling
# Cluster analysis
library(fclust) # Cluster analysis
library(clustertend) # Hopkins coefficients
library(cluster) # Cluster plots
library(fpc) # Cluster statistics
library(mclust) # Multi-step clustering
# Complex models
library(lavaan) # CFA, SEM models
library(semPlot) # Plot models
# Further statistical libraries
library(Hmisc) # Misceleaus statistical functions
library(seriation) # Seriation
# Data manipulation
library(reshape2) # Data transformation
library("dplyr", character.only = TRUE) # Data manipulation
library(knitr) # Formatted tables
# Maps
library(maps) # Mapping fuctions
library(mapdata) # Mapping data
library(maptools) # Mapping tools
# Interactive 3D figures
library(shiny) # Interactive figures
library(htmltools) # HTML widgets
library(rgl) #3D shaping
library(rglwidget) #3D shaping
library(ggplot2) # Complex figures
library(ggrepel) # Complex figures
library(ggdendro) # Dendogram
library(d3heatmap) # Heat map
library(igraph) # Graph representation
library(ggsci) # Simpson palette
library(plotly) # Plotly - interactive plot
## Wright K (2021). _corrgram: Plot a Correlogram_. R package version
## 1.14, <URL: https://CRAN.R-project.org/package=corrgram>.
## Wei T, Simko V (2021). _R package "corrplot": Visualization of a
## Correlation Matrix_. (Version 0.88), <URL:
## https://github.com/taiyun/corrplot>.
# Dimension reduction methods
base::print(citation("psych"),style="text") # Factor Analysis, Principal Component Analysis
## Revelle W (2021). _psych: Procedures for Psychological, Psychometric,
## and Personality Research_. Northwestern University, Evanston,
## Illinois. R package version 2.1.3, <URL:
## https://CRAN.R-project.org/package=psych>.
## Kassambara A, Mundt F (2020). _factoextra: Extract and Visualize the
## Results of Multivariate Data Analyses_. R package version 1.0.7, <URL:
## https://CRAN.R-project.org/package=factoextra>.
## Venables WN, Ripley BD (2002). _Modern Applied Statistics with S_,
## Fourth edition. Springer, New York. ISBN 0-387-95457-0, <URL:
## https://www.stats.ox.ac.uk/pub/MASS4/>.
## Ferraro M, Giordani P, Serafini A (2019). "fclust: An R Package for
## Fuzzy Clustering." _The R Journal_, *11*. <URL:
## https://journal.r-project.org/archive/2019/RJ-2019-017/RJ-2019-017.pdf>.
## Wright K, YiLan L, RuTong Z (2021). _clustertend: Check the Clustering
## Tendency_. R package version 1.5, <URL:
## https://CRAN.R-project.org/package=clustertend>.
## Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2021). _cluster:
## Cluster Analysis Basics and Extensions_. R package version 2.1.2 - For
## new features, see the 'Changelog' file (in the package source), <URL:
## https://CRAN.R-project.org/package=cluster>.
## Hennig C (2020). _fpc: Flexible Procedures for Clustering_. R package
## version 2.2-9, <URL: https://CRAN.R-project.org/package=fpc>.
## Scrucca L, Fop M, Murphy TB, Raftery AE (2016). "mclust 5: clustering,
## classification and density estimation using Gaussian finite mixture
## models." _The R Journal_, *8*(1), 289-317. <URL:
## https://doi.org/10.32614/RJ-2016-021>.
## Rosseel Y (2012). "lavaan: An R Package for Structural Equation
## Modeling." _Journal of Statistical Software_, *48*(2), 1-36. <URL:
## https://www.jstatsoft.org/v48/i02/>.
## Epskamp S (2019). _semPlot: Path Diagrams and Visual Analysis of
## Various SEM Packages' Output_. R package version 1.1.2, <URL:
## https://CRAN.R-project.org/package=semPlot>.
# Further statistical libraries
base::print(citation("Hmisc"),style="text") # Misceleaus statistical functions
## Harrell Jr FE, Dupont wcfC, others. m (2021). _Hmisc: Harrell
## Miscellaneous_. R package version 4.5-0, <URL:
## https://CRAN.R-project.org/package=Hmisc>.
## Hahsler M, Buchta C, Hornik K (2020). _seriation: Infrastructure for
## Ordering Objects Using Seriation_. R package version 1.2-9, <URL:
## https://CRAN.R-project.org/package=seriation>.
##
## Hahsler M, Hornik K, Buchta C (2008). "Getting things in order: An
## introduction to the R package seriation." _Journal of Statistical
## Software_, *25*(3), 1-34. ISSN 1548-7660, doi: 10.18637/jss.v025.i03
## (URL: https://doi.org/10.18637/jss.v025.i03), <URL:
## https://www.jstatsoft.org/v25/i03/>.
## Wickham H (2007). "Reshaping Data with the reshape Package." _Journal
## of Statistical Software_, *21*(12), 1-20. <URL:
## http://www.jstatsoft.org/v21/i12/>.
## Wickham H, François R, Henry L, Müller K (2021). _dplyr: A Grammar of
## Data Manipulation_. R package version 1.0.6, <URL:
## https://CRAN.R-project.org/package=dplyr>.
## Xie Y (2021). _knitr: A General-Purpose Package for Dynamic Report
## Generation in R_. R package version 1.33, <URL:
## https://yihui.org/knitr/>.
##
## Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition.
## Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <URL:
## https://yihui.org/knitr/>.
##
## Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in
## R." In Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible
## Computational Research_. Chapman and Hall/CRC. ISBN 978-1466561595,
## <URL: http://www.crcpress.com/product/isbn/9781466561595>.
## Becker OScbRA, Minka ARWRvbRBEbTP, Deckmyn. A (2018). _maps: Draw
## Geographical Maps_. R package version 3.3.0, <URL:
## https://CRAN.R-project.org/package=maps>.
## Becker OScbRA, Brownrigg. ARWRvbR (2018). _mapdata: Extra Map
## Databases_. R package version 2.3.0, <URL:
## https://CRAN.R-project.org/package=mapdata>.
## Bivand R, Lewin-Koh N (2021). _maptools: Tools for Handling Spatial
## Objects_. R package version 1.1-1, <URL:
## https://CRAN.R-project.org/package=maptools>.
## Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, Allen J,
## McPherson J, Dipert A, Borges B (2021). _shiny: Web Application
## Framework for R_. R package version 1.6.0, <URL:
## https://CRAN.R-project.org/package=shiny>.
## Cheng J, Sievert C, Chang W, Xie Y, Allen J (2021). _htmltools: Tools
## for HTML_. R package version 0.5.1.1, <URL:
## https://CRAN.R-project.org/package=htmltools>.
## Murdoch D, Adler D (2021). _rgl: 3D Visualization Using OpenGL_. R
## package version 0.106.8, <URL: https://CRAN.R-project.org/package=rgl>.
## Murdoch D (2016). _rglwidget: 'rgl' in 'htmlwidgets' Framework_. R
## package version 0.2.1, <URL:
## https://CRAN.R-project.org/package=rglwidget>.
## Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
## Springer-Verlag New York. ISBN 978-3-319-24277-4, <URL:
## https://ggplot2.tidyverse.org>.
## Slowikowski K (2021). _ggrepel: Automatically Position Non-Overlapping
## Text Labels with 'ggplot2'_. R package version 0.9.1, <URL:
## https://CRAN.R-project.org/package=ggrepel>.
## de Vries A, Ripley BD (2020). _ggdendro: Create Dendrograms and Tree
## Diagrams Using 'ggplot2'_. R package version 0.1.22, <URL:
## https://CRAN.R-project.org/package=ggdendro>.
## Cheng J, Galili T (2018). _d3heatmap: Interactive Heat Maps Using
## 'htmlwidgets' and 'D3.js'_. R package version 0.6.1.2, <URL:
## https://CRAN.R-project.org/package=d3heatmap>.
## Csardi G, Nepusz T (2006). "The igraph software package for complex
## network research." _InterJournal_, *Complex Systems*, 1695. <URL:
## https://igraph.org>.
## Xiao N (2018). _ggsci: Scientific Journal and Sci-Fi Themed Color
## Palettes for 'ggplot2'_. R package version 2.9, <URL:
## https://CRAN.R-project.org/package=ggsci>.
## Sievert C (2020). _Interactive Web-Based Data Visualization with R,
## plotly, and shiny_. Chapman and Hall/CRC. ISBN 9781138331457, <URL:
## https://plotly-r.com>.
The original U21 data is used.
load("U21_2014.RData")
rownames(U21_rank) <- rownames(U21_filtered)
orig_mtx <- as.matrix(U21_filtered)
#U21_filtered: 50 x 24 dataframe, amely az eredeti 2014-es U21-es adatokat tartalmazza
#U21_rank: 50 x 1 mátrix (vektor) U21 rangsor
#orig_mtx: 50 x 24 mátrix, eredeti U21 indikátorok az 50 országra
As a first step, we examine whether the data are suitable for clustering. Clustering should be checked for both variables and countries.
The Hopkins index (H) helps to control clustering. If the Hopkins index (H) is greater than 0.5, it is not worth clustering.
When clustering the indicators, we choose Spearman correlation as a similarity indicator and 1-Spearman correlation as a distance indicator, because the rank correlation also shows nonlinear relationships.
vars <- U21_filtered[,c("R1","R2","R3","R4","R5","E1","E2","E3","E4","C1","C2","C3","C4","C5","C6","O1","O2","O3","O4","O5","O6","O7","O8","O9")]
rownames(vars)<-rownames(U21_filtered)
countries<-t(vars)
colnames(countries) <- rownames(U21_filtered)
dist.m<-1-cor(vars,method="spearman")
paste("H:= ",hopkins(dist.m, n=nrow(dist.m)-1, byrow = F, header = F))
## [1] "H:= 0.311994877382919"
Since the Hopkins index is less than 0.5, the indicators can be clustered.
The next step is to plot cluster tendency in order to determine the preliminary number of clusters.
# Determine cluster tendency
clustend <- get_clust_tendency(scale(dist.m), nrow(dist.m)-1)
# Plot cluster tendency
clustend$plot +
scale_fill_gradient(low = "yellow", high = "blue")
The Figure 1 shows, that 2-4 clusters (which are the groups of indicators here) can be specified, but this should be confirmed by further studies.
The next step is to use hierarchical clustering for the indicators. The result of the clustering is shown in Figure 2.
hc <- hclust(dist(dist.m), method = "complete")
hcdata <- dendro_data(hc, type="rectangle")
o.cols <- hcdata$labels$label
ggplot(hcdata$segments) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
geom_text(data = hcdata$labels, aes(x, y, label = label),
hjust = 1.3, vjust=0.5, size = 4) +
labs(x="", y="", title="Hierarchical clustering for the indicators") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank(),
axis.text.y = element_blank()) +
coord_flip()
Hierarchical clustering also makes 2-4 clusters (in this case a group of indicators) probable. Displaying the distance matrix provides additional help in determining the number of clusters. Figure 3 shows the distance matrix for the indicators, where the blue cells represent values close to -1 and the dark red cells represent values close to 1.
Van Mechelen et al. (1993) seggested the following criteria for clustering:
Within-cluster dissimilarities should be small.
Between-cluster dissimilarities should be large.
Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or, if appropriate, by linear, time series or spatial process models.
Members of a cluster should be well represented by its centroid.
The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”).
Clusters should be stable.
Clusters should correspond to connected areas in data space with high density.
The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear).
It should be possible to characterize the clusters using a small number of variables.
Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering.
Features should be approximately independent within clusters.
All clusters should have roughly the same size.
The number of clusters should be low.
Cluster statistics are calculated for 2-4 clusters.
## $n
## [1] 24
##
## $cluster.number
## [1] 2
##
## $cluster.size
## [1] 21 3
##
## $min.cluster.size
## [1] 3
##
## $noisen
## [1] 0
##
## $diameter
## [1] 1.2498799 0.7726897
##
## $average.distance
## [1] 0.5610211 0.6223547
##
## $median.distance
## [1] 0.5672502 0.6203722
##
## $separation
## [1] 0.68497 0.68497
##
## $average.toother
## [1] 1.015888 1.015888
##
## $separation.matrix
## [,1] [,2]
## [1,] 0.00000 0.68497
## [2,] 0.68497 0.00000
##
## $ave.between.matrix
## [,1] [,2]
## [1,] 0.000000 1.015888
## [2,] 1.015888 0.000000
##
## $average.between
## [1] 1.015888
##
## $average.within
## [1] 0.5686878
##
## $n.between
## [1] 63
##
## $n.within
## [1] 213
##
## $max.diameter
## [1] 1.24988
##
## $min.separation
## [1] 0.68497
##
## $within.cluster.ss
## [1] 4.111321
##
## $clus.avg.silwidths
## 1 2
## 0.4384823 0.3867175
##
## $avg.silwidth
## [1] 0.4320117
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.6564944
##
## $dunn
## [1] 0.5480287
##
## $dunn2
## [1] 1.63233
##
## $entropy
## [1] 0.3767702
##
## $wb.ratio
## [1] 0.5597936
##
## $ch
## [1] 10.44002
##
## $cwidegap
## [1] 0.5204666 0.6203722
##
## $widestgap
## [1] 0.6203722
##
## $sindex
## [1] 0.68497
##
## $corrected.rand
## NULL
##
## $vi
## NULL
## $n
## [1] 24
##
## $cluster.number
## [1] 3
##
## $cluster.size
## [1] 15 3 6
##
## $min.cluster.size
## [1] 3
##
## $noisen
## [1] 0
##
## $diameter
## [1] 1.0665562 0.7726897 0.9575082
##
## $average.distance
## [1] 0.4433945 0.6223547 0.7098425
##
## $median.distance
## [1] 0.4366579 0.6203722 0.6949694
##
## $separation
## [1] 0.2946191 0.6849700 0.2946191
##
## $average.toother
## [1] 0.7970161 1.0158882 0.7187454
##
## $separation.matrix
## [,1] [,2] [,3]
## [1,] 0.0000000 0.7845616 0.2946191
## [2,] 0.7845616 0.0000000 0.6849700
## [3,] 0.2946191 0.6849700 0.0000000
##
## $ave.between.matrix
## [,1] [,2] [,3]
## [1,] 0.0000000 1.0441516 0.6734484
## [2,] 1.0441516 0.0000000 0.9452299
## [3,] 0.6734484 0.9452299 0.0000000
##
## $average.between
## [1] 0.8144531
##
## $average.within
## [1] 0.5323766
##
## $n.between
## [1] 153
##
## $n.within
## [1] 123
##
## $max.diameter
## [1] 1.066556
##
## $min.separation
## [1] 0.2946191
##
## $within.cluster.ss
## [1] 3.425579
##
## $clus.avg.silwidths
## 1 2 3
## 0.34829660 0.33917439 -0.05553832
##
## $avg.silwidth
## [1] 0.2461976
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.5722906
##
## $dunn
## [1] 0.276234
##
## $dunn2
## [1] 0.9487293
##
## $entropy
## [1] 0.9002561
##
## $wb.ratio
## [1] 0.6536614
##
## $ch
## [1] 8.082114
##
## $cwidegap
## [1] 0.4325824 0.6203722 0.5920308
##
## $widestgap
## [1] 0.6203722
##
## $sindex
## [1] 0.2946191
##
## $corrected.rand
## NULL
##
## $vi
## NULL
## $n
## [1] 24
##
## $cluster.number
## [1] 4
##
## $cluster.size
## [1] 5 10 3 6
##
## $min.cluster.size
## [1] 3
##
## $noisen
## [1] 0
##
## $diameter
## [1] 1.0665562 0.4296074 0.7726897 0.9575082
##
## $average.distance
## [1] 0.6794132 0.2371534 0.6223547 0.7098425
##
## $median.distance
## [1] 0.7161004 0.2366260 0.6203722 0.6949694
##
## $separation
## [1] 0.2746400 0.2746400 0.6849700 0.2946191
##
## $average.toother
## [1] 0.7358357 0.6848112 1.0158882 0.7187454
##
## $separation.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.0000000 0.2746400 0.7845616 0.5761566
## [2,] 0.2746400 0.0000000 0.8014765 0.2946191
## [3,] 0.7845616 0.8014765 0.0000000 0.6849700
## [4,] 0.5761566 0.2946191 0.6849700 0.0000000
##
## $ave.between.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.0000000 0.5818078 1.0129007 0.8540163
## [2,] 0.5818078 0.0000000 1.0597770 0.5831645
## [3,] 1.0129007 1.0597770 0.0000000 0.9452299
## [4,] 0.8540163 0.5831645 0.9452299 0.0000000
##
## $average.between
## [1] 0.7571513
##
## $average.within
## [1] 0.4956133
##
## $n.between
## [1] 203
##
## $n.within
## [1] 73
##
## $max.diameter
## [1] 1.066556
##
## $min.separation
## [1] 0.27464
##
## $within.cluster.ss
## [1] 3.01833
##
## $clus.avg.silwidths
## 1 2 3 4
## -0.1273645 0.5697175 0.3391744 -0.1851225
##
## $avg.silwidth
## [1] 0.2069642
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.5264653
##
## $dunn
## [1] 0.2575017
##
## $dunn2
## [1] 0.8196294
##
## $entropy
## [1] 1.298077
##
## $wb.ratio
## [1] 0.6545763
##
## $ch
## [1] 6.723373
##
## $cwidegap
## [1] 0.6839779 0.2432582 0.6203722 0.5920308
##
## $widestgap
## [1] 0.6839779
##
## $sindex
## [1] 0.27464
##
## $corrected.rand
## NULL
##
## $vi
## NULL
Unfortunately, based on the above results, it is not possible to decide how many (2, 3 or 4) clusters can be formed on the indicators. So the interpretability of the clusters will decide this problem.
To examine the interpretability of the clusters, the next step is to use multidimensional scaling (MDS) for the indicators. First, we form two clusters, the results of which are shown in Figure 4.
## initial value 0.234620
## iter 5 value 0.102313
## iter 10 value 0.079063
## iter 15 value 0.058780
## iter 20 value 0.035771
## iter 25 value 0.022966
## iter 30 value 0.013233
## final value 0.009420
## converged
plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cutree(hc, k=2)
ggplot(plotdata, aes(V1, V2, label=names)) +
geom_point(aes(colour=factor(cut)), size=2.3) +
geom_text_repel(aes(colour=factor(cut)), size=4) +
scale_colour_discrete(name = "Clusters") +
labs(x="", y="", title="Multidimensional scaling for indicators considering 2 clusters") + theme_bw()
As shown in Figure 4, the two clusters are difficult to interpret. The first cluster included 19 indicators, while the second included only 3 indicators.
Figure 5 shows the result of multidimensional scaling in the case of specifying 3 clusters.
## initial value 0.234620
## iter 5 value 0.102313
## iter 10 value 0.079063
## iter 15 value 0.058780
## iter 20 value 0.035771
## iter 25 value 0.022966
## iter 30 value 0.013233
## final value 0.009420
## converged
plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cutree(hc, k=3)
ggplot(plotdata, aes(V1, V2, label=names)) +
geom_point(aes(colour=factor(cut)), size=2.3) +
geom_text_repel(aes(colour=factor(cut)), size=4) +
scale_colour_discrete(name = "Clusters") +
labs(x="", y="", title="Multidimensional scaling for indicators considering 3 clusters") + theme_bw()
Based on Figure 5, the 3 clusters (i.e. 3 groups of indicators) are also difficult to interpret. There were 15 indicators in the first cluster, 3 in the second cluster and 6 in the third cluster. However, the indicators measuring the relationships (C1, C2, C3 and C4) have already been clustered.
In the next step, we will look at how to group the indicators if we create 4 clusters. The result is shown in Figure 6.
## initial value 0.234620
## iter 5 value 0.102313
## iter 10 value 0.079063
## iter 15 value 0.058780
## iter 20 value 0.035771
## iter 25 value 0.022966
## iter 30 value 0.013233
## final value 0.009420
## converged
plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cutree(hc, k=4)
ggplot(plotdata, aes(V1, V2, label=names)) +
geom_point(aes(colour=factor(cut)), size=2.3) +
geom_text_repel(aes(colour=factor(cut)), size=4) +
scale_colour_discrete(name = "Clusters") +
labs(x="", y="", title="Multidimensional scaling for indicators considering 4 clusters") + theme_bw()
If four clusters are specified, 5 indicators will be added to the first cluster, 10 to the second, 3 to the third, and 6 to the fourth. Purely only the variable group of the relationships containing the variables C1-C4 appears in the fourth cluster, in the case of the other clusters the composition of the indicators is very mixed. Four clusters (groups of variables) are also difficult to interpret, so it is necessary to reduce the number of variables, which can be done, for example, by principal component analysis.
The Hopkins index (H) helps to control clustering. If the Hopkins index (H) is greater than 0.5, it is not worth clustering. In the case of clustering rows (countries), distance is the Euclidean distance.
vars <- U21_filtered[,c("R1","R2","R3","R4","R5","E1","E2","E3","E4","C1","C2","C3","C4","C5","C6","O1","O2","O3","O4","O5","O6","O7","O8","O9")]
rownames(vars)<-rownames(U21_filtered)
countries<-t(vars)
colnames(countries) <- rownames(U21_filtered)
dist.c<-as.matrix(dist(vars))
paste("H:=",hopkins(dist.c, n=nrow(dist.c)-1, byrow = F, header = F))
## [1] "H:= 0.220381143328058"
Since the Hopkins index is less than 0.5, countries (i.e., rows) can be clustered.
The next step is to plot the clustering in order to determine the preliminary number of clusters.
# Cluster tendency
clustend <- get_clust_tendency(scale(dist.c), nrow(dist.c)-1)
# Plot the cluster tendency
clustend$plot +
scale_fill_gradient(low = "yellow", high = "blue")
Based on Figure 7, 3-4 clusters (ie a group of countries) are likely, but this needs to be confirmed by further studies.
The next step is to apply hierarchical clustering for countries. The result of the clustering is shown in Figure 8.
hc <- hclust(dist(dist.c), method = "complete")
hcdata <- dendro_data(hc, type="rectangle")
o.rows <- hcdata$labels$label
ggplot(hcdata$segments) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
geom_text(data = hcdata$labels, aes(x, y, label = label),
hjust = 1, vjust=0.7, size = 3) +
labs(x="", y="", title="Hierarchical clustering for countries") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank(),
axis.text.y = element_blank()) +
coord_flip()
Hierarchical clustering makes the existence of 3-4 clusters (here a group of countries) probable.
Displaying the distance matrix provides additional help in determining the number of clusters. Figure 9 shows the distance matrix for countries, with blue cells representing values close to -1 and dark red cells representing values close to 1.
Based on the distance matrix shown in Figure 9, 3-4 clusters are probable. In the next step, we compare the cases where we create 3 or 4 clusters for the countries.
## $n
## [1] 50
##
## $cluster.number
## [1] 3
##
## $cluster.size
## [1] 21 15 14
##
## $min.cluster.size
## [1] 14
##
## $noisen
## [1] 0
##
## $diameter
## [1] 172.7304 187.5483 131.6234
##
## $average.distance
## [1] 96.74970 114.40191 94.56826
##
## $median.distance
## [1] 92.82482 111.94521 98.24285
##
## $separation
## [1] 57.08949 64.51798 57.08949
##
## $average.toother
## [1] 165.3137 174.3910 131.3474
##
## $separation.matrix
## [,1] [,2] [,3]
## [1,] 0.00000 136.35107 57.08949
## [2,] 136.35107 0.00000 64.51798
## [3,] 57.08949 64.51798 0.00000
##
## $ave.between.matrix
## [,1] [,2] [,3]
## [1,] 0.0000 200.0512 128.0951
## [2,] 200.0512 0.0000 135.9007
## [3,] 128.0951 135.9007 0.0000
##
## $average.between
## [1] 157.7719
##
## $average.within
## [1] 101.4346
##
## $n.between
## [1] 819
##
## $n.within
## [1] 406
##
## $max.diameter
## [1] 187.5483
##
## $min.separation
## [1] 57.08949
##
## $within.cluster.ss
## [1] 258714.8
##
## $clus.avg.silwidths
## 1 2 3
## 0.2440727 0.1473629 0.1832766
##
## $avg.silwidth
## [1] 0.1980369
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.566333
##
## $dunn
## [1] 0.3043989
##
## $dunn2
## [1] 1.119693
##
## $entropy
## [1] 1.081972
##
## $wb.ratio
## [1] 0.6429189
##
## $ch
## [1] 24.42036
##
## $cwidegap
## [1] 98.27018 125.24363 102.04286
##
## $widestgap
## [1] 125.2436
##
## $sindex
## [1] 61.91874
##
## $corrected.rand
## NULL
##
## $vi
## NULL
## $n
## [1] 50
##
## $cluster.number
## [1] 4
##
## $cluster.size
## [1] 21 11 14 4
##
## $min.cluster.size
## [1] 4
##
## $noisen
## [1] 0
##
## $diameter
## [1] 172.7304 148.9883 131.6234 185.2957
##
## $average.distance
## [1] 96.74970 106.13002 94.56826 128.13876
##
## $median.distance
## [1] 92.82482 108.07670 98.24285 135.48427
##
## $separation
## [1] 57.08949 60.81710 57.08949 60.81710
##
## $average.toother
## [1] 165.3137 158.4569 131.3474 186.9005
##
## $separation.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.00000 136.35107 57.08949 201.0139
## [2,] 136.35107 0.00000 64.51798 60.8171
## [3,] 57.08949 64.51798 0.00000 102.1068
## [4,] 201.01385 60.81710 102.10676 0.0000
##
## $ave.between.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.0000 187.6245 128.0951 234.2247
## [2,] 187.6245 0.0000 124.8736 122.8686
## [3,] 128.0951 124.8736 0.0000 166.2251
## [4,] 234.2247 122.8686 166.2251 0.0000
##
## $average.between
## [1] 155.9924
##
## $average.within
## [1] 100.7137
##
## $n.between
## [1] 863
##
## $n.within
## [1] 362
##
## $max.diameter
## [1] 185.2957
##
## $min.separation
## [1] 57.08949
##
## $within.cluster.ss
## [1] 247876.6
##
## $clus.avg.silwidths
## 1 2 3 4
## 0.24407268 0.07194809 0.15158446 -0.03399656
##
## $avg.silwidth
## [1] 0.158063
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.5576146
##
## $dunn
## [1] 0.3080993
##
## $dunn2
## [1] 0.9588711
##
## $entropy
## [1] 1.255947
##
## $wb.ratio
## [1] 0.6456321
##
## $ch
## [1] 17.30098
##
## $cwidegap
## [1] 98.27018 106.99219 102.04286 174.91549
##
## $widestgap
## [1] 174.9155
##
## $sindex
## [1] 60.06623
##
## $corrected.rand
## NULL
##
## $vi
## NULL
Nevertheless, based on the results, it is not possible to decide how many (3 or 4) clusters can be specified for the countries. Then the number of clusters should be decided by their interpretability.
To examine the interpretability of clusters, the next step is to use multidimensional scaling for countries. First, we specify three clusters (i.e. three groups of countries), the results of which are shown in Figure 10.
## initial value 0.728168
## final value 0.727938
## converged
plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cutree(hc, k=3)
ggplot(plotdata, aes(V1, V2, label=names)) +
geom_point(aes(colour=factor(cut)), size=2.3) +
geom_text_repel(aes(colour=factor(cut)), size=4) +
scale_colour_discrete(name = "Clusters") +
labs(x="", y="", title="Multidimensional scaling for countries considering 3 clusters") + theme_bw()
Figure 10 shows that the first cluster included 21 countries, the second 15 countries and 14 countries.
In the next step, we will look at how to specify 4 clusters. The result is shown in Figure 11.
## initial value 0.728168
## final value 0.727938
## converged
plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cutree(hc, k=4)
ggplot(plotdata, aes(V1, V2, label=names)) +
geom_point(aes(colour=factor(cut)), size=2.3) +
geom_text_repel(aes(colour=factor(cut)), size=4) +
scale_colour_discrete(name = "Clusters") +
labs(x="", y="", title="Multidimensional scaling for countries considering 4 clusters") + theme_bw()
In the case of specifying 4 clusters, 21 countries were included in the first cluster, 11 in the second, 14 in the third, and 4 in the fourth. Compared to the previous case (when we specified three clusters), the countries in the first and in the third clusters did not change. The fourth cluster included the countries (Denmark, Sweden, Switzerland, United States) that belonged to the second cluster in the previous case. The principles formulated by Van Mechelen (1993) are mostly fulfilled by 4 clusters.
The results of cluster tendency: It is clear that clustering can be made by both variables and countries (see Hopkins index). However, the resulting clusters are quite difficult to interpret. The groups of variables contain mixed indicators, so model reduction is needed before grouping countries.
#3. Model reduction
The first step is to determine the number of principal components / factors. The Figure 12 shows the scree plot, which can be used to specify the number of latent variables (i.e. principal components).
Based on Figure 12, maximum 5 principal component (PC), while maximum 3 factor should be kept.
U21 contains 4 main groups of indicators: resources (R1 - R5), environment (E1 - E4), connectivity (C1 - C6) and outputs (O1 - O9). In the case of principal component analysis, it is worth starting from these predefined groups of indicators.
First, we define the principal component R, which contains the indicators R1-R5, that describe the resources of each country.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Rs))
## Overall MSA = 0.65
## MSA for each item =
## R1 R2 R3 R4 R5
## 0.75 0.54 0.66 0.68 0.60
##
## Factor analysis with Call: principal(r = Rs, nfactors = 1)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 5 and the objective function was 1.28
## The number of observations was 50 with Chi Square = 58.63 with prob < 2.3e-11
##
## The root mean square of the residuals (RMSA) is 0.15
## R1 R2 R3 R4 R5
## 0.4804492 0.3481106 0.7782275 0.8074245 0.8522280
pcaR <- prcomp(Rs, scale = TRUE)
eigs <- pcaR$sdev^2
varR <- eigs[1] / sum(eigs)
paste("Variance rate=",varR)
## [1] "Variance rate= 0.653287948596533"
Since the KMO value is greater than 0.5, principal component / factor analysis can be performed. All communality is greater than 0.25, which means that the original variables fit the principal component correctly. Furthermore, the variance ratio is greater than 0.5, which means that at least 50% of the information remains even after model reduction. Based on these, the R principal component created above is appropriate and can be used in further analyzes.
Next, we specify the principal component E, which contains the indicators E1-E4, that describe the environmental indicators of each country.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Es))
## Overall MSA = 0.55
## MSA for each item =
## E1 E2 E3 E4
## 0.61 0.56 0.53 0.54
##
## Factor analysis with Call: principal(r = Es, nfactors = 1)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 2 and the objective function was 0.28
## The number of observations was 50 with Chi Square = 13.06 with prob < 0.0015
##
## The root mean square of the residuals (RMSA) is 0.21
## E1 E2 E3 E4
## 0.3235923 0.1889422 0.5982086 0.6393162
pcaE <- prcomp(Es, scale = TRUE)
eigs <- pcaE$sdev^2
varE <- eigs[1] / sum(eigs)
paste("Variance rate=",varE)
## [1] "Variance rate= 0.43751482380739"
ased on the above results, it can be seen that the communality of E2 is less than 0.25 (0.189), so E2 does not fit the PC properly. However, if we exclude E2 in the next step, the communality value of E1 will be decreased less to 0.25. Most indicators can be saved, if E3 will be excluded, instead of E2. Thus, in the next step, we omit E3.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Es))
## Overall MSA = 0.55
## MSA for each item =
## E1 E2 E4
## 0.53 0.54 0.61
##
## Factor analysis with Call: principal(r = Es, nfactors = 1)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 0 and the objective function was 0.22
## The number of observations was 50 with Chi Square = 10.42 with prob < NA
##
## The root mean square of the residuals (RMSA) is 0.25
## E1 E2 E4
## 0.6167214 0.5175482 0.2970137
pcaE <- prcomp(Es, scale = TRUE)
eigs <- pcaE$sdev^2
varE <- eigs[1] / sum(eigs)
paste("Variance rate=",varE)
## [1] "Variance rate= 0.477094428557052"
After excluding E3, the communality value is greater than 0.25 for all variables. The KMO value is greater than 0.5. However, the value of the variance ratio is less than 0.5 (0.477), which means that at least 50% of the information is lost after the model reduction.
Next, we specify the principal component C, which contains the indicators C1-C6, that describe the connectivity and communications indicators of each country.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Cs))
## Overall MSA = 0.67
## MSA for each item =
## C1 C2 C3 C4 C5 C6
## 0.57 0.66 0.37 0.70 0.82 0.78
##
## Factor analysis with Call: principal(r = Cs, nfactors = 1)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 9 and the objective function was 0.53
## The number of observations was 50 with Chi Square = 24.07 with prob < 0.0042
##
## The root mean square of the residuals (RMSA) is 0.14
## C1 C2 C3 C4 C5 C6
## 0.4456108 0.5144444 0.1290487 0.6247154 0.6170022 0.5594279
pcaC <- prcomp(Cs, scale = TRUE)
eigs <- pcaC$sdev^2
varC <- eigs[1] / sum(eigs)
paste("Variance rate=",varC)
## [1] "Variance rate= 0.481708220917479"
The communality value of C3 is lower than 0.25, therefore, this variable should be excluded in the next step.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Cs))
## Overall MSA = 0.79
## MSA for each item =
## C1 C2 C4 C5 C6
## 0.78 0.79 0.80 0.79 0.80
##
## Factor analysis with Call: principal(r = Cs, nfactors = 1)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 5 and the objective function was 0.23
## The number of observations was 50 with Chi Square = 10.55 with prob < 0.061
##
## The root mean square of the residuals (RMSA) is 0.13
## C1 C2 C4 C5 C6
## 0.5120676 0.5011992 0.5971696 0.6207606 0.5728886
pcaC <- prcomp(Cs, scale = TRUE)
eigs <- pcaC$sdev^2
varC <- eigs[1] / sum(eigs)
paste("Variance rate=",varC)
## [1] "Variance rate= 0.560817123312952"
Both the KMO and the variance rate is greater than 0.5.
At the last step, we specify the principal component O, which contains the indicators O1-O9, that describe the research output indicators of each country.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA = 0.73
## MSA for each item =
## O1 O2 O3 O4 O5 O6 O7 O8 O9
## 0.40 0.73 0.77 0.79 0.63 0.83 0.74 0.85 0.64
##
## Factor analysis with Call: principal(r = Os, nfactors = 1)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 27 and the objective function was 3.28
## The number of observations was 50 with Chi Square = 145.84 with prob < 2.9e-18
##
## The root mean square of the residuals (RMSA) is 0.14
## O1 O2 O3 O4 O5 O6
## 0.1066000549 0.8603383885 0.7882600079 0.7698125877 0.6266131079 0.3970218624
## O7 O8 O9
## 0.5495702936 0.7077575306 0.0005444639
pcaO <- prcomp(Os, scale = TRUE)
eigs <- pcaC$sdev^2
varO <- eigs[1] / sum(eigs)
paste("Variance rate=",varO)
## [1] "Variance rate= 0.560817123312952"
Based on the above results, for both O1 and O9, the communality is lower than 0.25, however, the variance ratio is adequate (0.56). There are two options: use one principal component or two factors.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA = 0.79
## MSA for each item =
## O2 O3 O4 O5 O6 O7 O8
## 0.74 0.79 0.81 0.75 0.82 0.78 0.85
##
## Factor analysis with Call: principal(r = Os, nfactors = 1)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 14 and the objective function was 1.62
## The number of observations was 50 with Chi Square = 73.02 with prob < 5.4e-10
##
## The root mean square of the residuals (RMSA) is 0.1
## O2 O3 O4 O5 O6 O7 O8
## 0.8860009 0.7906704 0.7907994 0.5663786 0.4074405 0.5578915 0.7229660
pcaO <- prcomp(Os, scale = TRUE)
eigs <- pcaC$sdev^2
varO <- eigs[1] / sum(eigs)
paste("Variance rate=",varO)
## [1] "Variance rate= 0.560817123312952"
The value of KMO is greater than 0.5, for all indicators the communality is greater than 0.25, the variance rate is 0.56, i.e. at least 50% of the original information remains even after model reduction. However, two important indicators (O1 and O9) were dropped from the model. The O1 indicator is the number of articles published by higher education institutions, while the O9 is the proportion of unemployed graduates. These indicators are also worthwhile and would need to be preserved after model reduction.
A második lehetőség, hogy több mutatót őrizzünk meg, 2 faktor használata. 2 faktorra vonatkozó eredmények lentebb olvashatók.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA = 0.73
## MSA for each item =
## O1 O2 O3 O4 O5 O6 O7 O8 O9
## 0.40 0.73 0.77 0.79 0.63 0.83 0.74 0.85 0.64
##
## Factor analysis with Call: principal(r = Os, nfactors = 2)
##
## Test of the hypothesis that 2 factors are sufficient.
## The degrees of freedom for the model is 19 and the objective function was 1.87
## The number of observations was 50 with Chi Square = 81.98 with prob < 8.4e-10
##
## The root mean square of the residuals (RMSA) is 0.11
## O1 O2 O3 O4 O5 O6 O7 O8
## 0.8438830 0.9223750 0.7955037 0.8210897 0.8665949 0.4008533 0.5506741 0.7255365
## O9
## 0.3083455
##
## Loadings:
## RC1 RC2
## O1 0.138 0.908
## O2 0.959
## O3 0.886 0.104
## O4 0.905
## O5 0.670 0.646
## O6 0.629
## O7 0.732 0.124
## O8 0.850
## O9 0.547
##
## RC1 RC2
## SS loadings 4.656 1.579
## Proportion Var 0.517 0.175
## Cumulative Var 0.517 0.693
In the case of two (principal) factors, akk communality values are greater than 0.25; however, the correlation between factor 1 and O5, and the factor 2 and O5 is very similar (|cor(RC1,O5)|<2|cor(RC2,O5)| or |cor(RC1,O5)|<|cor(RC2,O5)|+0.25). Therefore O5 cannot be matched to any factor. In this case we say that O5 is a common indicator, which has to be excluded from the model.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(Os))
## Overall MSA = 0.78
## MSA for each item =
## O1 O2 O3 O4 O6 O7 O8 O9
## 0.39 0.75 0.83 0.78 0.79 0.80 0.84 0.55
##
## Factor analysis with Call: principal(r = Os, nfactors = 2)
##
## Test of the hypothesis that 2 factors are sufficient.
## The degrees of freedom for the model is 13 and the objective function was 1.21
## The number of observations was 50 with Chi Square = 53.62 with prob < 7.1e-07
##
## The root mean square of the residuals (RMSA) is 0.1
## O1 O2 O3 O4 O6 O7 O8 O9
## 0.3429981 0.9340280 0.7946173 0.8450090 0.5310990 0.5954249 0.7515102 0.7198578
##
## Loadings:
## RC1 RC2
## O1 0.131 0.571
## O2 0.966
## O3 0.891
## O4 0.911 -0.123
## O6 0.615 0.390
## O7 0.716 0.288
## O8 0.848 0.180
## O9 -0.111 0.841
##
## RC1 RC2
## SS loadings 4.195 1.319
## Proportion Var 0.524 0.165
## Cumulative Var 0.524 0.689
Both the KMO (0.78) and the variance rate (0.689) is greater than 0.5. All communality values are greater than 0.25. And there is no common indicator.
The first factor contains the folowing indicators:
The second factor contains the following indicators:
Thus, compared to the specification of the principal component \(O\), both the \(O_1\) and \(O_9\) indicators in the second factor were saved.
Since we specified principal components all for \(R\), \(C\), \(E\), \(O\) indicators, the components can even be correlated. Therefore, in the next step, the correlation between them have to be checked.
Figure 13 shows the results of the correlation analysis. The Figure 13 shows that there are strong correlations among principal components \(R\), \(O\), and \(C\).
factors<-cbind(fitR$scores,fitE$scores,fitC$scores,fitO$scores)
colnames(factors) <- c("R","E","C","O")
cor(factors)
## R E C O
## R 1.0000000 0.1659944 0.7354010 0.8207528
## E 0.1659944 1.0000000 0.2930795 0.2203930
## C 0.7354010 0.2930795 1.0000000 0.8482095
## O 0.8207528 0.2203930 0.8482095 1.0000000
Correlation analysis is also performed among the PCs \(R\), \(E\), and \(C\) and two specified factors, such as \(O_{fact_{1}}\), and \(O_{fact_{2}}\). The result of the correlation analysis is shown in Figure 14.
factors<-cbind(fitR$scores,fitE$scores,fitC$scores,fitO2$scores)
colnames(factors) <- c("R","E","C","O_fact1","O_fact2")
cor(factors)
## R E C O_fact1 O_fact2
## R 1.0000000 0.1659944 0.7354010 8.350790e-01 -5.855570e-02
## E 0.1659944 1.0000000 0.2930795 2.054947e-01 1.568645e-01
## C 0.7354010 0.2930795 1.0000000 8.631607e-01 -1.107934e-01
## O_fact1 0.8350790 0.2054947 0.8631607 1.000000e+00 1.713747e-16
## O_fact2 -0.0585557 0.1568645 -0.1107934 1.713747e-16 1.000000e+00
Based on Figure 14 latent variables R, O_fact1, and C are strongly correlated each other. Correlation analysis specifies three latent variable groups: (1) \(R\), \(C\), \(O_{fact_{1}}\), (2) \(O_{fact_{2}}\), (3) \(E\).
After principal component analysis, factor analysis (FA) is performed on the indicators. As a first step, we examine the applicability of factor analysis, which is aided by scree-plots and calculating the KMO value.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(U21_filtered))
## Overall MSA = 0.73
## MSA for each item =
## R1 R2 R3 R4 R5 E1 E2 E3 E4 C1 C2 C3 C4 C5 C6 O1
## 0.70 0.51 0.82 0.75 0.79 0.56 0.35 0.69 0.82 0.76 0.63 0.37 0.83 0.83 0.76 0.36
## O2 O3 O4 O5 O6 O7 O8 O9
## 0.80 0.80 0.79 0.75 0.73 0.66 0.74 0.30
Figure 15 suggests us to specify 3 independent groups of variables (see the same results in Figure 12.)
U21s<-U21_filtered[,c("R1","R2","R3","R4","R5","E1","E2", "E3", "E4","C1","C2","C3","C4","C5","C6","O1","O2","O3","O4","O5","O6","O7","O8","O9")]
fit <- principal(U21s,nfactors=3,rotate = "varimax")
summary(fit) # print variance accounted for
##
## Factor analysis with Call: principal(r = U21s, nfactors = 3, rotate = "varimax")
##
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 207 and the objective function was 11.54
## The number of observations was 50 with Chi Square = 440.6 with prob < 5.8e-19
##
## The root mean square of the residuals (RMSA) is 0.09
## R1 R2 R3 R4 R5 E1 E2 E3
## 0.5654944 0.5743389 0.8261170 0.7944043 0.8819861 0.3833381 0.1206436 0.5076963
## E4 C1 C2 C3 C4 C5 C6 O1
## 0.5389281 0.3994398 0.7234222 0.2496209 0.5761145 0.6911836 0.6506483 0.5851133
## O2 O3 O4 O5 O6 O7 O8 O9
## 0.9083897 0.8677446 0.8564676 0.7466957 0.3476795 0.5736386 0.7121123 0.3973431
##
## Loadings:
## RC1 RC2 RC3
## R1 0.397 -0.639
## R2 0.439 -0.239 -0.570
## R3 0.906
## R4 0.849 -0.265
## R5 0.914 -0.211
## E1 0.158 0.191 0.567
## E2 0.147 0.302
## E3 0.337 0.593 0.204
## E4 0.516 0.512 0.106
## C1 0.576 0.255
## C2 0.541 -0.159 0.637
## C3 0.154 -0.111 0.462
## C4 0.669 0.311 0.178
## C5 0.815 -0.153
## C6 0.787 0.163
## O1 0.227 0.495 -0.537
## O2 0.942 0.138
## O3 0.920 0.140
## O4 0.908 0.175
## O5 0.731 0.305 -0.345
## O6 0.526 0.208 -0.165
## O7 0.644 0.101 -0.386
## O8 0.836
## O9 0.626
##
## RC1 RC2 RC3
## SS loadings 10.059 2.231 2.188
## Proportion Var 0.419 0.093 0.091
## Cumulative Var 0.419 0.512 0.603
Figure 16 shows the properties of the three factors. As can be seen from the above results, there are also low communality (e.g. \(C_3\)) and common indicators (e.g. \(R_2\)). In order to get a comprehensive picture of the relative position of the factors and the indicators belonging to each factor, it is expedient to plot the initial state of the factors (\(RC_1\), \(RC_2\), \(RC_3\)) in 3D. This 3D figure is shown in Figure 17. By rotating the figure, you can see how each indicator is grouped, to which factors it belongs. Consistent with the above results, it can be seen here that the first factor includes, for example, \(R_3\), \(R_4\) and \(R_5\).
## null
## 1
options(save)
plot3d(fit$loadings)
text3d(fit$loadings,texts = as.vector(rownames(fit$loadings)))
coords <- NULL
for (i in 1:nrow(fit$loadings)) {
coords <- rbind(coords, rbind(c(0,0,0),fit$loadings[i,]))
}
lines3d(coords, col="red", lwd=4)
rglwidget()
Excluding common and low communality value indicators the result is shown in the Figure 18..
columns<-c("R3","R4","R5","E1","E2","C1","C5","C6","O1","O4","O8","O9")
U21s<-U21_filtered[,columns]
KMO(cor(U21s))
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(U21s))
## Overall MSA = 0.81
## MSA for each item =
## R3 R4 R5 E1 E2 C1 C5 C6 O1 O4 O8 O9
## 0.79 0.83 0.80 0.40 0.47 0.93 0.85 0.83 0.41 0.85 0.87 0.49
##
## Factor analysis with Call: principal(r = U21s, nfactors = 3, rotate = "varimax")
##
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 33 and the objective function was 2.21
## The number of observations was 50 with Chi Square = 93.17 with prob < 1.2e-07
##
## The root mean square of the residuals (RMSA) is 0.08
## R3 R4 R5 E1 E2 C1 C5 C6
## 0.8262154 0.8290226 0.9132030 0.6894734 0.5995707 0.4921809 0.6855794 0.6979789
## O1 O4 O8 O9
## 0.6054390 0.8427117 0.7250600 0.6423689
##
## Loadings:
## RC1 RC2 RC3
## R3 0.888 0.190
## R4 0.905
## R5 0.954
## E1 0.134 0.815
## E2 -0.126 0.764
## C1 0.616 0.202 -0.269
## C5 0.824
## C6 0.796 -0.161 0.197
## O1 0.148 -0.114 0.755
## O4 0.906 0.136
## O8 0.829 0.169
## O9 -0.108 0.296 0.737
##
## RC1 RC2 RC3
## SS loadings 5.787 1.450 1.312
## Proportion Var 0.482 0.121 0.109
## Cumulative Var 0.482 0.603 0.712
Figure 19 shows the final factor structure in 3D.
plot3d(fit$loadings)
text3d(fit$loadings,texts = as.vector(rownames(fit$loadings)))
coords <- NULL
for (i in 1:nrow(fit$loadings)) {
coords <- rbind(coords, rbind(c(0,0,0),fit$loadings[i,]))
}
lines3d(coords, col="red", lwd=4)
rglwidget()
The 3D figure above illustrates the three distinct groups of indicators.
Figure 20 shows the structure equation model, which graphically shows which indicators belong to which factor. In the figure, the thickness and strength of the arrows illustrate the extent to which each indicator contributes to a given factor.
The list of indicators, which belong to the factor 1:
The list of indicators, which belong to the factor 2:
The list of indicators, which belong to the factor 3:
The first factor (\(RC_1\)) contains both expenditure-type, relationship-type and output-type indicators. The second factor (\(RC_2\)) shows the role of women in higher education, and the third factor (\(RC_3\)) contains output indicators.
model='RC1=~R3+R4+R5+C1+C5+C6+O4+O8
RC2=~E1+E2
RC3=~O1+O9
RC1~~0*RC2 + 0*RC3
RC2~~0*RC3'
fitsem <- cfa(model,U21_filtered,std.lv = TRUE)
semPaths(fitsem,residuals=FALSE,"std",edge.color = "black",color="yellow")
The summary of the evaluatiion of dimension reduciton: If the goal is to specify independent groups of variables, then 3 groups of variables can be defined with both principal component and factor analysis. Of these, two “pure” groups of variables contain only 2-2 indicators, while the third mixed group forms a group of R, C, O indicators. A serious problem is that only 12 of the 24 indicators are preserved.
Based on the preliminary clustering study, 3-4 clusters were probable for countries (rows). Thus, as a first step, we specify three clusters.
## $n
## [1] 50
##
## $cluster.number
## [1] 3
##
## $cluster.size
## [1] 18 20 12
##
## $min.cluster.size
## [1] 12
##
## $noisen
## [1] 0
##
## $diameter
## [1] 2.180009 5.704993 4.494681
##
## $average.distance
## [1] 1.146773 1.610924 1.901035
##
## $median.distance
## [1] 1.108597 1.356117 1.479310
##
## $separation
## [1] 0.7804456 0.7390499 0.7390499
##
## $average.toother
## [1] 2.479894 2.470991 2.636661
##
## $separation.matrix
## [,1] [,2] [,3]
## [1,] 0.0000000 0.7804456 0.8288624
## [2,] 0.7804456 0.0000000 0.7390499
## [3,] 0.8288624 0.7390499 0.0000000
##
## $ave.between.matrix
## [,1] [,2] [,3]
## [1,] 0.000000 2.373189 2.657735
## [2,] 2.373189 0.000000 2.617694
## [3,] 2.657735 2.617694 0.000000
##
## $average.between
## [1] 2.520423
##
## $average.within
## [1] 1.513456
##
## $n.between
## [1] 816
##
## $n.within
## [1] 409
##
## $max.diameter
## [1] 5.704993
##
## $min.separation
## [1] 0.7390499
##
## $within.cluster.ss
## [1] 77.04899
##
## $clus.avg.silwidths
## 1 2 3
## 0.4922942 0.3117005 0.2117167
##
## $avg.silwidth
## [1] 0.3527181
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.4333483
##
## $dunn
## [1] 0.1295444
##
## $dunn2
## [1] 1.248367
##
## $entropy
## [1] 1.076819
##
## $wb.ratio
## [1] 0.6004771
##
## $ch
## [1] 21.33511
##
## $cwidegap
## [1] 0.9296514 2.7278667 2.2474481
##
## $widestgap
## [1] 2.727867
##
## $sindex
## [1] 0.764972
##
## $corrected.rand
## NULL
##
## $vi
## NULL
**Figure 21*) shows the results of clustering based on 3 factors
18 countries belong to the first,20 countries belong to the second, and 12 countries belong to the third cluster.
plot3d(fit$loadings)
text3d(fit$loadings,texts = rownames(fit$loadings))
coords <- NULL
for (i in 1:nrow(fit$loadings)) {
coords <- rbind(coords, rbind(c(0,0,0),fit$loadings[i,]))
}
lines3d(coords, col="red", lwd=4)
text3d(fit$scores,texts = rownames(U21_filtered),col=km3)
rglwidget()
Figure 22. shows which countries belong to which clusters. These countries are separated align with the three factors.
data(wrld_simpl)
countries_km1 = wrld_simpl@data$NAME %in% c("Australia","Austria","Belgium","Canada","Denmark","Finland","France","Hon Kong","Ireland","Israel","Netherlands","New Zealand","Norway","Portugal","Singapore","Sweden","Switzerland","United Kingdom")
countries_km2 = wrld_simpl@data$NAME %in% c("Argentina","Brazil","Bulgaria","China", "Croatia", "Czech Republic", "Germany", "Hungary","Malaysia","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","South Africa","Spain","Thailand","Ukraine","United States")
countries_km3 = wrld_simpl@data$NAME %in% c("Chile", "Greece", "India", "Indonesia", "Iran", "Italy", "Japan", "Mexico", "Saudi Arabia", "Taiwan", "Turkey", "Korea, Republic of")
countries_km2 = countries_km2 *2
countries_km3 = countries_km3 *3
countries_map = countries_km1+countries_km2+countries_km3+1
plot(wrld_simpl, col = c(gray(.80), "red", "blue", "green")[countries_map])
Next step is to specify four clusters on 3 factors.
## $n
## [1] 50
##
## $cluster.number
## [1] 4
##
## $cluster.size
## [1] 22 4 17 7
##
## $min.cluster.size
## [1] 4
##
## $noisen
## [1] 0
##
## $diameter
## [1] 2.581561 3.503032 2.180009 3.714822
##
## $average.distance
## [1] 1.228931 2.257800 1.134808 1.939406
##
## $median.distance
## [1] 1.237384 2.276079 1.096520 1.818473
##
## $separation
## [1] 0.3953279 0.6144058 0.3953279 0.7669507
##
## $average.toother
## [1] 2.389488 3.094787 2.484270 3.014312
##
## $separation.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.0000000 0.6144058 0.3953279 0.7669507
## [2,] 0.6144058 0.0000000 1.1363751 2.1025880
## [3,] 0.3953279 1.1363751 0.0000000 1.3234252
## [4,] 0.7669507 2.1025880 1.3234252 0.0000000
##
## $ave.between.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.000000 2.807635 2.142717 2.749849
## [2,] 2.807635 0.000000 3.167920 3.819656
## [3,] 2.142717 3.167920 0.000000 3.167064
## [4,] 2.749849 3.819656 3.167064 0.000000
##
## $average.between
## [1] 2.612725
##
## $average.within
## [1] 1.378705
##
## $n.between
## [1] 831
##
## $n.within
## [1] 394
##
## $max.diameter
## [1] 3.714822
##
## $min.separation
## [1] 0.3953279
##
## $within.cluster.ss
## [1] 53.58231
##
## $clus.avg.silwidths
## 1 2 3 4
## 0.3877404 0.1005759 0.4489660 0.2472150
##
## $avg.silwidth
## [1] 0.3659104
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.5644199
##
## $dunn
## [1] 0.1064191
##
## $dunn2
## [1] 0.9490284
##
## $entropy
## [1] 1.205341
##
## $wb.ratio
## [1] 0.5276886
##
## $ch
## [1] 26.73279
##
## $cwidegap
## [1] 0.8288624 2.7278667 0.9296514 2.2474481
##
## $widestgap
## [1] 2.727867
##
## $sindex
## [1] 0.5028333
##
## $corrected.rand
## NULL
##
## $vi
## NULL
Figure 23 shows, that the specified four clusters are well separated. There are 22 countries in the first cluster, 4 in the second, 17 in the third and 7 in the fourth.
plot3d(fit$loadings)
text3d(fit$loadings,texts = rownames(fit$loadings))
coords <- NULL
for (i in 1:nrow(fit$loadings)) {
coords <- rbind(coords, rbind(c(0,0,0),fit$loadings[i,]))
}
lines3d(coords, col="red", lwd=4)
text3d(fit$scores,texts = rownames(U21_filtered),col=km4)
rglwidget()
Figure 24 shows the map of clustered countries.
library(maptools)
data(wrld_simpl)
countries_km1 = wrld_simpl@data$NAME %in% c("Argentina","Brazil","Bulgaria","Chile","Croatia","Czech Republic","Greece", "Italy","Malaysia","Mexico","Poland","Portugal","Romania","Russia","Saudi Arabia","Serbia","Slovakia","Slovenia","South Africa","Spain","Thailand","Ukraine")
countries_km2 = wrld_simpl@data$NAME %in% c("China","Germany","Hungary","United States")
countries_km3 = wrld_simpl@data$NAME %in% c("Australia","Austria","Belgium","Canada","Denmark","Finland","France","Hong Kong", "Ireland","Israel","Netherlands","New Zealand","Norway","Singapore","Sweden","Switzerland","United Kingdom")
countries_km4 = wrld_simpl@data$NAME %in% c("India","Indonesia","Iran","Japan","Taiwan","Turkey","Korea, Republic of")
countries_km2 = countries_km2 *2
countries_km3 = countries_km3 *3
countries_km4 = countries_km4 *4
countries_map = countries_km1+countries_km2+countries_km3+countries_km4+1
plot(wrld_simpl, col = c(gray(.80), "red", "blue", "green", "orange")[countries_map])
When specifying the partial rankings (within clusters of countries), we should either use the calculated factor weights or the original weights used by U21.
Firt, we calculate the original weights used by U21.
weights=matrix(0,1,24)
colnames(weights)<-c("R1","R2","R3","R4","R5","E1","E2","E3","E4","C1","C2","C3","C4","C5","C6","O1","O2","O3","O4","O5","O6","O7","O8","O9")
weights[1,] <- c(0.05,0.05,0.05,0.025,0.025,0.02,0.02,0.02,0.14,0.04,0.04,0.02,0.02,0.04,0.04,0.13,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03)
rownames(weights)<-c("weights")
In the next step, the partial rankings are calculated, which (see Table 1). In the last two columns of the Table 1, variables km3 and km4 indicate cluster memberships for clusters 3 and 4, respectively. U21_overall_Rank contains the rankings of the original U21 ranking, FA_Rank the ranking based on factor scores, and U21_Rank_selected is the rankings that only contain indicators, which corresponds to the factors.
base<-as.data.frame(1:nrow(U21_rank))
colnames(base)<-paste("ID")
#base$countries<-rownames(U21_rank)
base$U21_overall_Rank<-U21_rank
base$FA_weights <- as.matrix(pnorm(fit$scores[,1])*pnorm(fit$scores[,2])*pnorm(fit$scores[,3]))
base$FA_Rank<-rank(-base$FA_weights)
base$Score_selected <- rowSums(U21_filtered[,columns]*weights[,columns])*100/max(rowSums(U21_filtered[,columns]*weights[,columns]))
base$U21_Rank_selected<-rank(-base$Score_selected)
base$km3 <-km3
base$km4 <-km4
rownames(base)<-rownames(U21_rank)
kable(base, caption="**Table 1** Partial rankings")
ID | U21_overall_Rank | FA_weights | FA_Rank | Score_selected | U21_Rank_selected | km3 | km4 | |
---|---|---|---|---|---|---|---|---|
Argentina | 1 | 41 | 3.716323e-02 | 39 | 53.20702 | 16 | 2 | 1 |
Australia | 2 | 9 | 1.226081e-01 | 20 | 46.85638 | 23 | 1 | 3 |
Austria | 3 | 12 | 1.205554e-01 | 22 | 77.76420 | 6 | 1 | 3 |
Belgium | 4 | 13 | 2.847575e-01 | 6 | 41.03496 | 27 | 1 | 3 |
Brazil | 5 | 38 | 1.626992e-01 | 14 | 41.07081 | 26 | 2 | 1 |
Bulgaria | 6 | 40 | 5.989465e-02 | 30 | 22.48808 | 47 | 2 | 1 |
Canada | 7 | 4 | 2.566109e-01 | 7 | 77.78374 | 5 | 1 | 3 |
Chile | 8 | 33 | 1.120821e-02 | 43 | 20.57509 | 48 | 3 | 1 |
China | 9 | 34 | 6.702021e-02 | 28 | 39.96442 | 29 | 2 | 2 |
Croatia | 10 | 44 | 6.752780e-02 | 27 | 26.79678 | 42 | 2 | 1 |
Czech Republic | 11 | 26 | 1.535241e-01 | 15 | 61.93087 | 9 | 2 | 1 |
Denmark | 12 | 3 | 1.276323e-01 | 17 | 56.33062 | 12 | 1 | 3 |
Finland | 13 | 5 | 4.592789e-01 | 2 | 95.87306 | 2 | 1 | 3 |
France | 14 | 18 | 2.193017e-01 | 9 | 40.09535 | 28 | 1 | 3 |
Germany | 15 | 14 | 4.278289e-01 | 3 | 73.57370 | 7 | 2 | 2 |
Greece | 16 | 32 | 3.890460e-02 | 38 | 28.52945 | 41 | 3 | 1 |
Hong Kong | 17 | 15 | 1.276280e-01 | 18 | 50.23970 | 21 | 1 | 3 |
Hungary | 18 | 29 | 1.664197e-01 | 12 | 31.49285 | 38 | 2 | 2 |
India | 19 | 50 | 1.537641e-03 | 47 | 37.16362 | 31 | 3 | 4 |
Indonesia | 20 | 48 | 1.284827e-03 | 48 | 17.93802 | 50 | 3 | 4 |
Iran | 21 | 49 | 4.994809e-03 | 45 | 35.34617 | 33 | 3 | 4 |
Ireland | 22 | 17 | 2.919132e-01 | 5 | 41.66170 | 25 | 1 | 3 |
Israel | 23 | 19 | 1.652540e-01 | 13 | 54.70782 | 13 | 1 | 3 |
Italy | 24 | 27 | 4.908183e-02 | 34 | 32.20203 | 36 | 3 | 1 |
Japan | 25 | 20 | 1.944635e-04 | 49 | 61.59416 | 10 | 3 | 4 |
Korea, Rep. (South) | 26 | 21 | 1.681845e-05 | 50 | 31.96663 | 37 | 3 | 4 |
Malaysia | 27 | 28 | 5.352892e-02 | 31 | 50.55125 | 20 | 2 | 1 |
Mexico | 28 | 46 | 9.292853e-03 | 44 | 20.50180 | 49 | 3 | 1 |
Netherlands | 29 | 7 | 1.669516e-01 | 11 | 68.99285 | 8 | 1 | 3 |
New Zealand | 30 | 16 | 7.233421e-02 | 26 | 39.72090 | 30 | 1 | 3 |
Norway | 31 | 11 | 2.270396e-01 | 8 | 78.31016 | 4 | 1 | 3 |
Poland | 32 | 31 | 8.456104e-02 | 24 | 25.39432 | 43 | 2 | 1 |
Portugal | 33 | 24 | 1.182719e-01 | 23 | 52.54867 | 18 | 1 | 1 |
Romania | 34 | 39 | 4.490066e-02 | 35 | 24.88180 | 44 | 2 | 1 |
Russia | 35 | 36 | 7.997149e-02 | 25 | 44.89472 | 24 | 2 | 1 |
Saudi Arabia | 36 | 30 | 2.028512e-02 | 41 | 32.57981 | 35 | 3 | 1 |
Serbia | 37 | 34 | 5.163543e-02 | 33 | 53.96639 | 14 | 2 | 1 |
Singapore | 38 | 10 | 5.246536e-02 | 32 | 48.01649 | 22 | 1 | 3 |
Slovakia | 39 | 37 | 1.219388e-01 | 21 | 50.87108 | 19 | 2 | 1 |
Slovenia | 40 | 25 | 1.228653e-01 | 19 | 30.92769 | 39 | 2 | 1 |
South Africa | 41 | 45 | 4.384428e-02 | 36 | 29.91428 | 40 | 2 | 1 |
Spain | 42 | 23 | 1.420620e-01 | 16 | 32.97279 | 34 | 2 | 1 |
Sweden | 43 | 2 | 2.060980e-01 | 10 | 100.00000 | 1 | 1 | 3 |
Switzerland | 44 | 6 | 6.600255e-02 | 29 | 52.70262 | 17 | 1 | 3 |
Taiwan | 45 | 22 | 3.518628e-02 | 40 | 58.63325 | 11 | 3 | 4 |
Thailand | 46 | 42 | 4.127924e-02 | 37 | 24.36351 | 45 | 2 | 1 |
Turkey | 47 | 47 | 3.550801e-03 | 46 | 35.56118 | 32 | 3 | 4 |
Ukraine | 48 | 43 | 1.884380e-02 | 42 | 23.44289 | 46 | 2 | 1 |
United Kingdom | 49 | 8 | 3.468806e-01 | 4 | 80.62078 | 3 | 1 | 3 |
United States | 50 | 1 | 4.693511e-01 | 1 | 53.27870 | 15 | 2 | 2 |
In the next step, the rank correlations between the rankings are calculated.
Figures 25-28 show rank correlations between rankings. Figure 25 shows rank correlations for all countries, while Figures 26-28 show rank correlations separately for the clusters 1-3.
If we examine all countries (see Figure 25), there is a close positive relationship between the rankings of the original U21 ranking (U21_overall_Rank) and the ranking we calculated, which contains only the indicators corresponding to each factor (U21_Rank_selected). If we examine only the rankings of the countries belonging to clusters 1 and 2, a similar result is obtained (see Figures 26-27).
par(mfrow=c(2,2))
RANKS<-c("U21_overall_Rank","FA_Rank","U21_Rank_selected")
corrgram(base[,RANKS],cor.method = "spearman",order = TRUE,main="For all countries",lower.panel=panel.cor)
Rank correlations are also calculated when four clusters are created. For 1, 2, and 3 clusters, there are moderately strong relationship between the original U21 rankings and calculated rankings.
par(mfrow=c(2,2))
RANKS<-c("U21_overall_Rank","FA_Rank","U21_Rank_selected")
corrgram(base[km4==1,RANKS],cor.method = "spearman",order = TRUE,main=paste("Partial rank correlations of",sum(km4==1)," countries from cluster 1"),lower.panel=panel.cor)
The results of the two-step clustering are summarized below.
MTX3<-as.data.frame(matrix(0,nrow(U21_filtered),ncol(U21_filtered)))
MTX4<-as.data.frame(matrix(0,nrow(U21_filtered),ncol(U21_filtered)))
rownames(MTX3)<-rownames(U21_filtered)
colnames(MTX3)<-colnames(U21_filtered)
rownames(MTX4)<-rownames(U21_filtered)
colnames(MTX4)<-colnames(U21_filtered)
MTX3[,columns]<-km3
MTX4[,columns]<-km4
cols<-c('0'="#FFFFFF",'1'="#99FF66",'2'="#66FF33",'3'="#33CC00",'4'="#009900")
SER3<-c(seriate(dist(as.matrix(MTX3))),seriate(dist(t(as.matrix(MTX3)))))
SER4<-c(seriate(dist(as.matrix(MTX4))),seriate(dist(t(as.matrix(MTX4)))))
ORDERED3<-MTX3[get_order(SER3,dim=1),get_order(SER3,dim=2)]
ORDORIG3<-U21_filtered[get_order(SER3,dim=1),get_order(SER3,dim=2)]
ORDERED4<-MTX4[get_order(SER4,dim=1),get_order(SER4,dim=2)]
ORDORIG4<-U21_filtered[get_order(SER4,dim=1),get_order(SER4,dim=2)]
rownames(ORDERED3)<-rownames(MTX3[get_order(SER3,dim=1),get_order(SER3,dim=2)])
rownames(ORDERED4)<-rownames(MTX4[get_order(SER4,dim=1),get_order(SER4,dim=2)])
rownames(ORDORIG3)<-rownames(MTX3[get_order(SER3,dim=1),get_order(SER3,dim=2)])
rownames(ORDORIG4)<-rownames(MTX4[get_order(SER4,dim=1),get_order(SER4,dim=2)])
par(mfrow=c(2,2))
Figure 33 shows the heat map of the original U21 data. The coloring of the cells depends on the values for each indicator. The darker red cells indicate smaller values, and the darker blue cells indicate higher values.
p <- plot_ly(x=colnames(ORDORIG3), y=rownames(ORDORIG3), z = as.matrix(ORDORIG3), type = "heatmap",colorscale="RdBu",reversescale=TRUE)
p
Figure 34 shows the countries belonging to the three clusters, as well as the indicators that belong to the three factors. Indicators that do not belong to any factor are displayed with a dark red cell. It can be seen that half (12) of the indicators were dropped during the analysis.
p <- plot_ly(x=colnames(ORDERED3), y=rownames(ORDERED3), z = as.matrix(ORDERED3), type = "heatmap",colorscale="RdBu",reversescale=TRUE)
p
p <- plot_ly(x=colnames(ORDORIG4), y=rownames(ORDORIG4), z = as.matrix(ORDORIG4), type = "heatmap",colorscale="RdBu",reversescale=TRUE)
p
Figure 35 shows the heat map of the original U21 data. The coloring of the cells depends on the values for each indicator. The darker red cells indicate smaller values, and the darker blue cells indicate higher values.
Figure 36. shows the four clusters specified and the indicators that belong to the three factors.
p <- plot_ly(x=colnames(ORDERED4), y=rownames(ORDERED4), z = as.matrix(ORDERED4), type = "heatmap",colorscale="RdBu",reversescale=TRUE)
p
Summary: Applying two-step clustering is questionable, first for the low sample. The method itself does not allow clusters to overlap, however, it may perform well in some factors in one country and poorly in others. As a result of clustering, we also obtained very small clusters, which are difficult to analyze and interpret.
The additional files attached to the study (including this html) are free to use for scientific purposes. Please cite the original paper
Van Mechelen, I., J. Hampton, R. S. Michalski, and P. Theuns (1993). Categories and Concepts - Theoretical Views and Inductive Data Analysis. Academic Press, London.