Feedback should be send to goran.milovanovic_ext@wikimedia.de. These notebook(s) refer to Phab T285458: Generate inputs for 1st sensemaking session about ORES quality score distributions across the Wikidata classes.


0. Setup

NOTE. The usage of mean quality or mean ORES quality in this report is the following one:

  • mean ORES quality is always computed per Wikidata class,
  • by taking that the ORES scores map from A, B, C, E, and E, onto
  • 5, 4, 3, 2, and 1, respectively.

Example. If class has 5 A items, 3 B items, 0 C items, 2 D items, and 10 E items, its mean quality score is

5 * 5 + 3 * 4 + 0 * 3 + 2 * 2 + 10 * 1 = 51

divided by the number of items in that class, which happens to be

5 + 3 + 0 + 2 + 10 = 20.

In this example, the mean quality score for the given Wikidata class would be: 2.55.

### --- Setup
library(mclust)
library(data.table)
library(tidyverse)
library(snowfall)
library(ggrepel)
source('WD_ORES_ClassClustering_Functions.R')
### --- dirTree
dataDir <- paste0(getwd(), "/_data/")
reportingDir <- paste0(getwd(), "/_reporting/")

1. The Fundamental Dataset

## --- Load
# - load
rawData <- data.table::fread(paste0(dataDir, "classQualityContingency.csv"), 
                             header = T)
rownames(rawData) <- rawData$V1
rawData$V1 <- NULL
# - check if there are empty classes
w <- which(rowSums(rawData) == 0)
if (length(w) > 0) {
  print(length(w))
}

The fundamental dataset in this analysis comprises 472034 Wikidata classes and the counts of ORES A, B, C, D, and E scored items in each of them. Here are the first rows from the fundamental dataset, each row representing a Wikidata class:

head(rawData)

2. EDA: The distribution of ORES scores across Wikidata classes

2.1 The distribution of Wikidata class size

The first problem that we arrive at is the distribution of the class size in Wikidata: the number of items present in each class. Let’s take a look at the distribution of class size in Wikidata:

### --- data
dataSet <- rawData
dataSet$numItems <- rowSums(dataSet)
dataSet <- dataSet %>% 
  dplyr::arrange(desc(numItems))
summary(dataSet$numItems)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       1        1        2      197        4 37215412 

From the value of the median, 2, we can see that 50% of Wikidata classes have two or less than two items only. The 3rd quartile (75% of data) is found at 4, which means than only 25% of classes have more than four items. The maximal number of items is found at 37,215,412, and as we shall see that is the Wikidata class of scholarly article (Q13442814).

It is questionable, thus, whether it makes sense to assess the distribution of ORES quality scores across all Wikidata classes at once. Let’s take a closer look at the top 100 most populated Wikidata classes:

dataSet$class <- rownames(dataSet)
labels <- wmdedata_api_fetch_labels(items = dataSet$class[1:100], 
                                    language = "en",
                                    fallback = T)
top100WDClasses <- dataSet[1:100, c('class', 'numItems')]
top100WDClasses <- dplyr::left_join(top100WDClasses, 
                                    labels,
                                    by = c("class" = "item"))
top100WDClasses

We already know that Astronomical Object (Q6999 + its subclasses) and Scholarly Article (Q13442814 + its subclasses) account for a huge amount of Wikidata items. Let’s single out those two classes and take a look at the distribution of ORES scores inside them.

2.2 ORES quality in Astronomical Object (Q6999 + its subclasses)

Here’s the dataset for Astronomical Object (Q6999 + its subclasses) only:

 astronomicalObjects = c(
    'Q523', 'Q318', 'Q1931185', 'Q1457376', 'Q2247863', 'Q3863', 'Q83373',
    'Q2154519', 'Q726242', 'Q1153690', 'Q204107', 'Q71963409', 'Q67206691',
    'Q1151284', 'Q67206701', 'Q66619666', 'Q72802727', 'Q2168098', 'Q6243',
    'Q72802508', 'Q11282', 'Q72803170', 'Q1332364', 'Q72802977', 'Q6999',
    'Q1491746', 'Q272447', 'Q497654', 'Q204194', 'Q130019', 'Q744691',
    'Q71798532', 'Q46587', 'Q11276', 'Q71965429', 'Q5871', 'Q72803622',
    'Q72803426', 'Q3937', 'Q72803708', 'Q168845', 'Q24452', 'Q67201574',
    'Q2557101', 'Q691269', 'Q13632', 'Q10451997', 'Q28738741', 'Q22247', 'Q6999'
  )
astronomySet <- dataSet %>% 
  dplyr::filter(class %in% astronomicalObjects)
head(astronomySet)

It actually comprises 11.87% of Wikidata. Let’s take a look at the distribution of the mean per class ORES score:

astronomySet$meanQuality <- apply(astronomySet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
ggplot(data = astronomySet, 
       aes(x = meanQuality)) + 
  geom_density(alpha = .25, fill = "darkred", color = "darkred", size = .25) + 
  ggtitle("Astronomical Object (Q6999 + its subclasses)") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  xlim(0, 5) +
  theme(plot.title = element_text(size = 12, hjust = .5))

summary(astronomySet$meanQuality)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.093   2.495   2.504   2.839   3.213 

2.4 ORES quality in Scholarly Article (Q13442814 + its subclasses)

Here’s the dataset for Scholarly Article (Q13442814 + its subclasses) only:

scientificPapers = c('Q7318358', 'Q2782326', 'Q18918145',
                     'Q1504425', 'Q7316896', 'Q92998777',
                     'Q10885494', 'Q15706459', 'Q58901470',
                     'Q59458414', 'Q56478376', 'Q12183006',
                     'Q82969330', 'Q58900805', 'Q13442814')
articleSet <- dataSet %>% 
  dplyr::filter(class %in% scientificPapers)
head(articleSet)

It actually comprises 42.49% of Wikidata. Let’s take a look at the distribution of the mean per class ORES score:

articleSet$meanQuality <- apply(articleSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
ggplot(data = articleSet, 
       aes(x = meanQuality)) + 
  geom_density(alpha = .25, fill = "darkblue", color = "darkblue", size = .25) + 
  ggtitle("Scholarly Article (Q13442814 + its subclasses)") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  xlim(0, 5) +
  theme(plot.title = element_text(size = 12, hjust = .5))

summary(articleSet$meanQuality)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.292   2.819   2.726   3.188   3.500 

2.5 ORES quality in Human (Q5)

Here’s the dataset for Human (Q5) only:

human = 'Q5'
humanSet <- dataSet %>% 
  dplyr::filter(class %in% human)
humanSet$meanQuality <- apply(humanSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
head(humanSet)

It actually comprises 7.51% of Wikidata.

2.6 The distribution of ORES scores in the remaining Wikidata classes

The two large Wikidata classes - Astronomical Object (Q6999) and Scholarly Article (Q13442814), and their subclasses as well - will be removed from all following analyses.

Let’s take a look at the distribution of ORES scores in the remaining Wikidata classes:

wikidataSet <- dataSet %>% 
  dplyr::filter(!(class %in% astronomicalObjects | class %in% scientificPapers))
head(wikidataSet)
wikidataSet$meanQuality <- apply(wikidataSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
ggplot(data = wikidataSet, 
       aes(x = meanQuality)) + 
  geom_density(alpha = .25, fill = "darkorange", color = "darkorange", size = .25) + 
  ggtitle("Wikidata - (Astronomical Object + Scholarly Article)") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  xlim(0, 5) +
  theme(plot.title = element_text(size = 12, hjust = .5))

summary(wikidataSet$meanQuality)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   2.000   2.251   2.955   5.000 

Interestingly, the distribution of the mean ORES score in the remained of Wikidata is clearly bimodal, with a mean at 2.2508731. Let’s find out what are the largest Wikidata classes found in the following ranges of mean ORES scores:

  • 1.5 - 2.5
  • 2.5 - 3.5

First the lower score classes, with a range of mean quality of 1.5 - 2.5:

lowerScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 1.5 & meanQuality < 2.5) %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
higherScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 2.5 & meanQuality < 3.5)  %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
lowerScoreClasses <- wmdedata_api_fetch_labels(items = lowerScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
lowerScoreClasses <- dplyr::left_join(lowerScoreClasses,
                                      dplyr::select(lowerScoreSet, class, numItems, meanQuality),
                                      by = c("item" = "class"))
lowerScoreClasses

Now for the higher score classes, with a range of mean quality of 2.5 - 3.5:

higherScoreClasses <- wmdedata_api_fetch_labels(items = higherScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
higherScoreClasses <- dplyr::left_join(higherScoreClasses,
                                       dplyr::select(higherScoreSet, class, numItems, meanQuality),
                                       by = c("item" = "class"))
higherScoreClasses

Neither the lower now the higher score classes isolated in the analysis above do not seem prima facie semantically coherent.

2.7 The distribution of ORES scores in the remaining Wikidata classes: >= 1000 items

Let’s take a look at the distribution of ORES scores in the remaining Wikidata classes but taking into our consideration only classes with 1,000 or more items:

wikidataSet <- dataSet %>% 
  dplyr::filter(!(class %in% astronomicalObjects | class %in% scientificPapers)) %>% 
  dplyr::filter(numItems >= 1000)
head(wikidataSet)
wikidataSet$meanQuality <- apply(wikidataSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
ggplot(data = wikidataSet, 
       aes(x = meanQuality)) + 
  geom_density(alpha = .25, fill = "darkorange", color = "darkorange", size = .25) + 
  ggtitle("Wikidata - (Astronomical Object + Scholarly Article)\nOnly classes w. >= 1000 items") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  xlim(0, 5) +
  theme(plot.title = element_text(size = 12, hjust = .5))

summary(wikidataSet$meanQuality)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   2.038   2.209   2.448   3.999 

Interestingly, the distribution of the mean ORES score in the remained of Wikidata is clearly bimodal, with a mean at 2.2508731. Let’s find out what are the largest Wikidata classes found in the following ranges of mean ORES scores:

  • 1.5 - 2.5
  • 2.5 - 3.5

First the lower score classes, with a range of mean quality of 1.5 - 2.5:

lowerScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 1.5 & meanQuality < 2.5) %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
higherScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 2.5 & meanQuality < 3.5)  %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
lowerScoreClasses <- wmdedata_api_fetch_labels(items = lowerScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
lowerScoreClasses <- dplyr::left_join(lowerScoreClasses,
                                      dplyr::select(lowerScoreSet, class, numItems, meanQuality),
                                      by = c("item" = "class"))
lowerScoreClasses

Now for the higher score classes, with a range of mean quality of 2.5 - 3.5:

higherScoreClasses <- wmdedata_api_fetch_labels(items = higherScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
higherScoreClasses <- dplyr::left_join(higherScoreClasses,
                                       dplyr::select(higherScoreSet, class, numItems, meanQuality),
                                       by = c("item" = "class"))
higherScoreClasses

2.8 The distribution of ORES scores in the remaining Wikidata classes: < 1000 items

Let’s take a look at the distribution of ORES scores in the remaining Wikidata classes but taking into our consideration only classes with less than 1,000 items:

wikidataSet <- dataSet %>% 
  dplyr::filter(!(class %in% astronomicalObjects | class %in% scientificPapers)) %>% 
  dplyr::filter(numItems < 1000)
head(wikidataSet)
wikidataSet$meanQuality <- apply(wikidataSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
ggplot(data = wikidataSet, 
       aes(x = meanQuality)) + 
  geom_density(alpha = .25, fill = "darkorange", color = "darkorange", size = .25) + 
  ggtitle("Wikidata - (Astronomical Object + Scholarly Article)\nOnly classes w. >= 1000 items") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  xlim(0, 5) +
  theme(plot.title = element_text(size = 12, hjust = .5))

summary(wikidataSet$meanQuality)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   2.000   2.251   2.957   5.000 

Interestingly, the distribution of the mean ORES score in the remained of Wikidata is clearly bimodal, with a mean at 2.2508731. Let’s find out what are the largest Wikidata classes found in the following ranges of mean ORES scores:

  • 1.5 - 2.5
  • 2.5 - 3.5

First the lower score classes, with a range of mean quality of 1.5 - 2.5:

lowerScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 1.5 & meanQuality < 2.5) %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
higherScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 2.5 & meanQuality < 3.5)  %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
lowerScoreClasses <- wmdedata_api_fetch_labels(items = lowerScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
lowerScoreClasses <- dplyr::left_join(lowerScoreClasses,
                                      dplyr::select(lowerScoreSet, class, numItems, meanQuality),
                                      by = c("item" = "class"))
lowerScoreClasses

Now for the higher score classes, with a range of mean quality of 2.5 - 3.5:

higherScoreClasses <- wmdedata_api_fetch_labels(items = higherScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
higherScoreClasses <- dplyr::left_join(higherScoreClasses,
                                       dplyr::select(higherScoreSet, class, numItems, meanQuality),
                                       by = c("item" = "class"))
higherScoreClasses

3. Clustering the ORES scores across the Wikidata classes

We will make an attempt at a better understanding of the distribution of ORES scores content-wide Wikidata by clustering the dataset following the removal of Astronomical Object (Q6999) and Scholarly Article (Q13442814) (and their subclasses as well).

clustSet <- wikidataSet %>% 
  dplyr::select(A, B, C, D, E)
### --- Cluster with {mclust}
snowfall::sfInit(parallel = T, cpus = 23)
R Version:  R version 4.0.3 (2020-10-10) 
snowfall::sfExport('clustSet')
snowfall::sfLibrary(mclust)
Library mclust loaded.
t1 <- Sys.time()
models <- snowfall::sfClusterApplyLB(1:30, function(x) {
  mclust.options(hcUse = "STD")
  return(
    Mclust(clustSet, 
           G = 2:30, 
           initialization = list())
  )
})
t2 <- Sys.time()
print(
  paste0("mclust took: ", 
         as.character(difftime(t2, t1, units = "mins"))
  )
)
[1] "mclust took: 24.2635710159938"
snowfall::sfStop()

### --- Post-hoc model selection
maxBIC <- sapply(models, function(x) {
  max(apply(x$BIC, 2, max, na.rm = T))
})
no non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Inf
selectedModel <- which.max(maxBIC)
wd_classes_optimal <- models[[selectedModel]]
rm(models)

### --- Inspect Model
legend_args <- list(x = "bottomright", 
                    ncol = 5)
plot(wd_classes_optimal, 
     what = 'BIC', 
     legendArgs = legend_args)

We have actually run many cluster models and picked up the most optimal that we have found:

wd_classes_optimal$modelName
[1] "VII"
wd_classes_optimal$G
[1] 9
wd_classes_optimal$n
[1] 471970
wd_classes_optimal$d
[1] 5

The most optimal clustering solution that we were able to produce encompasses 12 clusters.

3.1 Clusters: number of Wikidata classes per cluster, mean ORES quality per Cluster, and the average number of items in a class per cluster

### --- Join clusters to data points (WD classes)
cluster_assign <- as.data.frame(wd_classes_optimal$classification)
colnames(cluster_assign) <- "cluster"
cluster_assign$class <- rownames(cluster_assign)
clustSet$class <- rownames(clustSet)
clustSet <- dplyr::left_join(clustSet,
                             cluster_assign,
                             by = "class")
rm(cluster_assign)

### --- Inspect clusters
clustSet$classSize <- rowSums(clustSet[, c('A', 'B', 'C', 'D', 'E')])
clustSet$meanQuality <- apply(clustSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})


# - ORES cluster quality profiles
classProfiles <- clustSet %>% 
  dplyr::select(cluster, meanQuality, classSize, A, B, C, D, E) %>% 
  dplyr::group_by(cluster) %>% 
  dplyr::summarise(meanClusterQuality = mean(meanQuality),
                   meanClusterClassSize = mean(classSize), 
                   medianClusterClassSize = median(classSize), 
                   numClasses = n(),
                   A = sum(A), 
                   B = sum(B), 
                   C = sum(C), 
                   D = sum(D), 
                   E = sum(E)) %>% 
  arrange(desc(meanClusterClassSize))
classProfiles$totalItems <- classProfiles$A + 
  classProfiles$B + 
  classProfiles$C + 
  classProfiles$D + 
  classProfiles$E

# - Analytics
# - Plot A. Mean per Cluster Class Size vs Mean per Cluster Class Quality
ggplot(data = classProfiles, 
       aes(x = meanClusterClassSize, 
           y = meanClusterQuality, 
           size = numClasses, 
           label = cluster)) + 
  geom_point(color = "red", fill = "tomato") +
  geom_text_repel(size = 4) +
  ggtitle("Mean per Cluster WD Class Size vs Mean per Cluster Class Quality") + 
  scale_x_continuous(labels = scales::comma) + 
  scale_y_continuous(labels = scales::comma) +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(size = 10, hjust = .5))

In the chart above each bubble represents a cluster of Wikidata classes. The horizontel axis, labeled meanClusterClassSize, represents the mean size of a Wikidata class in some particular cluster: the size of the class is the number of items in it, and as each cluster encompasses a number of classes we compute the mean size of a class in a cluster. The vertical axis represents the mean ORES quality: it is the average of all Wikidata classes’ mean ORES scores, for all classes found in a particular cluster. The size of the bubble represents the number of classes found in each cluster. There is one cluster encompassing a relatively low number of Wikidata classes, labeled as Cluster 6, on the right of the plot, but that very cluster also encompasses very large Wikidata classes (and that is why it is positioned on the far right of the horizontal axis). All other clusters encompass Wikidata classes with a relatively low number of items in them (and that is why they are position on the far left of the horizontal axis). In the lower left corner of the chart we find Cluster 2, a large bubble - which means that it encompasses a large number of classes - with a relatively small average number of items per class, and relatively low mean ORES scores per class.

3.2 Clusters: the distribution of ORES scores per cluster

# - Plot B. Average number of A, B, C, D, and E ranked items per class, per Cluster
plotBFrame <- classProfiles %>% 
  dplyr::select(cluster, A, B, C, D, E) %>% 
  pivot_longer(-cluster, 
               names_to = "ORES Rank", 
               values_to = "Num Items per ORES Rank"
  ) 
plotBFrame$cluster <- factor(plotBFrame$cluster)
ggplot(data = plotBFrame, 
       aes(x = `ORES Rank`, 
           y = `Num Items per ORES Rank`,
           group = cluster,
           color = cluster,
           fill = cluster)
) + 
  geom_point() + geom_path() + 
  facet_wrap(~cluster, scales = "free") + 
  scale_y_continuous(labels = scales::comma) +
  ylab("Mean(Num.items)\n
       per ORES Rank per WD Class") +
  ggtitle("Average number of A, B, C, D, and E ranked items per WD Class, per Cluster") +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(legend.position = "none") + 
  theme(strip.background = element_blank()) + 
  theme(plot.title = element_text(size = 10, hjust = .5)) 

The horizontal axis in all charts represents the ORES score, while the vertical axis represents the mean number of items per class in a cluster that belong to a particular ORES score category (A, B, C, D, or E); each chart represents a cluster.

If we superimpose the quality score profiles:

# - superimposed: 
ggplot(data = plotBFrame, 
       aes(x = `ORES Rank`, 
           y = log(`Num Items per ORES Rank`),
           group = cluster,
           color = cluster,
           fill = cluster)
) + 
  geom_point() + geom_path() + 
  scale_y_continuous(labels = scales::comma) +
  ylab("Mean(Num.items)\n
       per ORES Rank per WD Class") +
  ggtitle("Average number of A, B, C, D, and E ranked items per WD Class, per Cluster") +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(legend.position = "top") + 
  theme(plot.title = element_text(size = 12))

3.3 Clusters: Describing their content

We will query the Wikidata API to obtain the English labels for top 20 most populated Wikidata classes in each cluster.

### --- Describe clusters
describeClusters <- lapply(sort(unique(clustSet$cluster)), function(x) {
  d <- clustSet[clustSet$cluster == x, ]
  d <- d %>% 
    dplyr::arrange(desc(classSize))
  return(d$class[1:20])
})
# - call wmdedata_api_fetch_labels()
describeClusters <- lapply(describeClusters, 
                           wmdedata_api_fetch_labels, 
                           language = "en", 
                           fallback = T)

However, just from looking at Wikidata class labels, it does not seem possible to provide a coherent interpretation of the clusters. It is interesting, however, to take a look at the composition of Cluster 6:

describeClusters[[6]]

This cluster (6) encompasses large Wikidata classes. It is quite possible that some of them could be excluded from further analyses (e.g. Wikimedia category, Wikimedia template, Wikimedia disambiguation page, etc). Also, some of them might need a separate treatment similar to Astronomical Objects and Scholarly Articles (e.g. Human, taxon, gene, protein, etc).

It is possible that a more comprehensive description of the clustering solution could be obtained from re-labeling the present Wikidata classes by their super-classes in the Wikidata hierarchy via P31/P279, and by looking at the distribution of the super-classes across the clusters. However, it would take a lot of computational resources to be able to describe the solution in that way.

4. Classes of highest and lowest mean ORES quality

Of course, we will take a look only at the Wikidata classes with above average number of items.

dataSet <- rawData
dataSet$classSize <- rowSums(dataSet)
dataSet$meanQuality <- apply(dataSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
dataSet$Class <- rownames(dataSet)
dataSet <- dataSet %>% 
  dplyr::filter(classSize > mean(dataSet$classSize))
dataSet <- dataSet %>% 
  dplyr::arrange(desc(meanQuality))
topQualitySet <- head(dataSet, 100)
topQualitySetLabels <- wmdedata_api_fetch_labels(topQualitySet$Class, 
                                                 language = "en", 
                                                 fallback = T)
topQualitySetLabels <- dplyr::left_join(topQualitySetLabels,
                                        dplyr::select(topQualitySet, Class, classSize, meanQuality),
                                        by = c("item" = "Class"))

lowQualitySet <- tail(dataSet, 100)
lowQualitySetLabels <- wmdedata_api_fetch_labels(lowQualitySet$Class, 
                                                 language = "en", 
                                                 fallback = T)
lowQualitySetLabels <- dplyr::left_join(lowQualitySetLabels,
                                        dplyr::select(lowQualitySet, Class, classSize, meanQuality),
                                        by = c("item" = "Class"))

The 100 top quality Wikidata classes are:

topQualitySetLabels

The lower 100 quality Wikidata classes are:

lowQualitySetLabels

We can see that many Wikimedia* classes are found among the classes of lowest mean ORES scores in the dataset.


License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.


---
title: The Distribution of ORES Quality Scores across Wikidata Classes
author:
- name: Goran S. Milovanović
  affiliation: Wikimedia Deutschland, Data Scientist
date: "`r format(Sys.time(), '%d %B %Y')`"
abstract: 
output:
  html_notebook:
    code_folding: hide
    theme: spacelab
    toc: yes
    toc_float: yes
    toc_depth: 5
  html_document:
    toc: yes
    toc_depth: 5
---

![](img/Wikidata-logo-en.png) 
![](img/Wikimedia_Deutschland_Logo_small.png)

***
**Feedback** should be send to `goran.milovanovic_ext@wikimedia.de`. 
These notebook(s) refer to [Phab T285458: Generate inputs for 1st sensemaking session about ORES quality score distributions across the Wikidata classes](https://phabricator.wikimedia.org/T285458).

***

### 0. Setup

**NOTE.** The usage of **mean quality** or **mean ORES quality** in this report is the following one:

- mean ORES quality is always computed per Wikidata class,
- by taking that the ORES scores map from A, B, C, E, and E, onto
- 5, 4, 3, 2, and 1, respectively.

*Example.* If class has 5 A items, 3 B items, 0 C items, 2 D items, and 10 E items, its mean quality score is

> 5 * 5 + 3 * 4 + 0 * 3 + 2 * 2 + 10 * 1 = 51

divided by the number of items in that class, which happens to be

> 5 + 3 + 0 + 2 + 10 = 20.

In this example, the mean quality score for the given Wikidata class would be: `r 51/20`.


```{r echo = T, message = F}
### --- Setup
library(mclust)
library(data.table)
library(tidyverse)
library(snowfall)
library(ggrepel)
source('WD_ORES_ClassClustering_Functions.R')
### --- dirTree
dataDir <- paste0(getwd(), "/_data/")
reportingDir <- paste0(getwd(), "/_reporting/")
```

### 1. The Fundamental Dataset

```{r echo = T, message = F}
## --- Load
# - load
rawData <- data.table::fread(paste0(dataDir, "classQualityContingency.csv"), 
                             header = T)
rownames(rawData) <- rawData$V1
rawData$V1 <- NULL
# - check if there are empty classes
w <- which(rowSums(rawData) == 0)
if (length(w) > 0) {
  print(length(w))
}
```

The fundamental dataset in this analysis comprises `r dim(rawData)[1]` Wikidata classes and the counts of ORES A, B, C, D, and E scored items in each of them. Here are the first rows from the fundamental dataset, each row representing a Wikidata class:

```{r echo = T, message = F}
head(rawData)
```

### 2. EDA: The distribution of ORES scores across Wikidata classes

#### 2.1 The distribution of Wikidata class size

The first problem that we arrive at is the distribution of the class size in Wikidata: the number of items present in each class. Let's take a look at the distribution of class size in Wikidata:

```{r echo = T, message = F}
### --- data
dataSet <- rawData
dataSet$numItems <- rowSums(dataSet)
dataSet <- dataSet %>% 
  dplyr::arrange(desc(numItems))
summary(dataSet$numItems)
```
From the value of the median, 2, we can see that 50% of Wikidata classes have two or less than two items only. The 3rd quartile (75% of data) is found at 4, which means than only 25% of classes have more than four items. The maximal number of items is found at `37,215,412`, and as we shall see that is the Wikidata class of `scholarly article (Q13442814)`.

It is questionable, thus, whether it makes sense to assess the distribution of ORES quality scores across all Wikidata classes at once. Let's take a closer look at the top 100 most populated Wikidata classes:

```{r echo = T, message = F}
dataSet$class <- rownames(dataSet)
labels <- wmdedata_api_fetch_labels(items = dataSet$class[1:100], 
                                    language = "en",
                                    fallback = T)
top100WDClasses <- dataSet[1:100, c('class', 'numItems')]
top100WDClasses <- dplyr::left_join(top100WDClasses, 
                                    labels,
                                    by = c("class" = "item"))
top100WDClasses
```

We already know that **Astronomical Object** (`Q6999` + its subclasses) and **Scholarly Article** (`Q13442814` + its subclasses) account for a huge amount of Wikidata items. Let's single out those two classes and take a look at the distribution of ORES scores inside them.

#### 2.2 ORES quality in Astronomical Object (`Q6999` + its subclasses)

Here's the dataset for Astronomical Object (`Q6999` + its subclasses) only:

```{r echo = T, message = F}
 astronomicalObjects = c(
    'Q523', 'Q318', 'Q1931185', 'Q1457376', 'Q2247863', 'Q3863', 'Q83373',
    'Q2154519', 'Q726242', 'Q1153690', 'Q204107', 'Q71963409', 'Q67206691',
    'Q1151284', 'Q67206701', 'Q66619666', 'Q72802727', 'Q2168098', 'Q6243',
    'Q72802508', 'Q11282', 'Q72803170', 'Q1332364', 'Q72802977', 'Q6999',
    'Q1491746', 'Q272447', 'Q497654', 'Q204194', 'Q130019', 'Q744691',
    'Q71798532', 'Q46587', 'Q11276', 'Q71965429', 'Q5871', 'Q72803622',
    'Q72803426', 'Q3937', 'Q72803708', 'Q168845', 'Q24452', 'Q67201574',
    'Q2557101', 'Q691269', 'Q13632', 'Q10451997', 'Q28738741', 'Q22247', 'Q6999'
  )
astronomySet <- dataSet %>% 
  dplyr::filter(class %in% astronomicalObjects)
head(astronomySet)
```

It actually comprises `r round(sum(astronomySet$numItems)/sum(dataSet$numItems)*100, 2)`% of Wikidata. Let's take a look at the distribution of the mean per class ORES score:

```{r echo = T, message = F}
astronomySet$meanQuality <- apply(astronomySet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
ggplot(data = astronomySet, 
       aes(x = meanQuality)) + 
  geom_density(alpha = .25, fill = "darkred", color = "darkred", size = .25) + 
  ggtitle("Astronomical Object (Q6999 + its subclasses)") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  xlim(0, 5) +
  theme(plot.title = element_text(size = 12, hjust = .5))
```
```{r echo = T, message = F}
summary(astronomySet$meanQuality)
```
#### 2.4 ORES quality in Scholarly Article (`Q13442814` + its subclasses)

Here's the dataset for Scholarly Article (`Q13442814` + its subclasses) only:

```{r echo = T, message = F}
scientificPapers = c('Q7318358', 'Q2782326', 'Q18918145',
                     'Q1504425', 'Q7316896', 'Q92998777',
                     'Q10885494', 'Q15706459', 'Q58901470',
                     'Q59458414', 'Q56478376', 'Q12183006',
                     'Q82969330', 'Q58900805', 'Q13442814')
articleSet <- dataSet %>% 
  dplyr::filter(class %in% scientificPapers)
head(articleSet)
```

It actually comprises `r round(sum(articleSet$numItems)/sum(dataSet$numItems)*100, 2)`% of Wikidata. Let's take a look at the distribution of the mean per class ORES score:

```{r echo = T, message = F}
articleSet$meanQuality <- apply(articleSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
ggplot(data = articleSet, 
       aes(x = meanQuality)) + 
  geom_density(alpha = .25, fill = "darkblue", color = "darkblue", size = .25) + 
  ggtitle("Scholarly Article (Q13442814 + its subclasses)") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  xlim(0, 5) +
  theme(plot.title = element_text(size = 12, hjust = .5))
```

```{r echo = T, message = F}
summary(articleSet$meanQuality)
```

#### 2.5 ORES quality in Human (`Q5`)

Here's the dataset for Human (`Q5`) only:

```{r echo = T, message = F}
human = 'Q5'
humanSet <- dataSet %>% 
  dplyr::filter(class %in% human)
humanSet$meanQuality <- apply(humanSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
head(humanSet)
```

It actually comprises `r round(sum(humanSet$numItems)/sum(dataSet$numItems)*100, 2)`% of Wikidata.

#### 2.6 The distribution of ORES scores in the remaining Wikidata classes

The two large Wikidata classes - **Astronomical Object** (`Q6999`) and **Scholarly Article** (`Q13442814`), and their subclasses as well - will be removed from all following analyses.

Let's take a look at the distribution of ORES scores in the remaining Wikidata classes:

```{r echo = T, message = F}
wikidataSet <- dataSet %>% 
  dplyr::filter(!(class %in% astronomicalObjects | class %in% scientificPapers))
head(wikidataSet)
```

```{r echo = T, message = F}
wikidataSet$meanQuality <- apply(wikidataSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
ggplot(data = wikidataSet, 
       aes(x = meanQuality)) + 
  geom_density(alpha = .25, fill = "darkorange", color = "darkorange", size = .25) + 
  ggtitle("Wikidata - (Astronomical Object + Scholarly Article)") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  xlim(0, 5) +
  theme(plot.title = element_text(size = 12, hjust = .5))
```
```{r echo = T, message = F}
summary(wikidataSet$meanQuality)
```
Interestingly, the distribution of the mean ORES score in the remained of Wikidata is clearly *bimodal*, with a mean at `r mean(wikidataSet$meanQuality)`. Let's find out what are *the largest Wikidata classes* found in the following ranges of mean ORES scores: 

- 1.5 - 2.5
- 2.5 - 3.5

First the lower score classes, with a range of mean quality of 1.5 - 2.5:

```{r echo = T, message = F}
lowerScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 1.5 & meanQuality < 2.5) %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
higherScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 2.5 & meanQuality < 3.5)  %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
lowerScoreClasses <- wmdedata_api_fetch_labels(items = lowerScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
lowerScoreClasses <- dplyr::left_join(lowerScoreClasses,
                                      dplyr::select(lowerScoreSet, class, numItems, meanQuality),
                                      by = c("item" = "class"))
lowerScoreClasses
```

Now for the higher score classes, with a range of mean quality of 2.5 - 3.5:

```{r echo = T, message = F}
higherScoreClasses <- wmdedata_api_fetch_labels(items = higherScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
higherScoreClasses <- dplyr::left_join(higherScoreClasses,
                                       dplyr::select(higherScoreSet, class, numItems, meanQuality),
                                       by = c("item" = "class"))
higherScoreClasses
```

Neither the lower now the higher score classes isolated in the analysis above do not seem *prima facie* semantically coherent.


#### 2.7 The distribution of ORES scores in the remaining Wikidata classes: >= 1000 items

Let's take a look at the distribution of ORES scores in the remaining Wikidata classes **but taking into our consideration only classes with 1,000 or more items**:

```{r echo = T, message = F, warning = F}
wikidataSet <- dataSet %>% 
  dplyr::filter(!(class %in% astronomicalObjects | class %in% scientificPapers)) %>% 
  dplyr::filter(numItems >= 1000)
head(wikidataSet)
```

```{r echo = T, message = F, , warning = F}
wikidataSet$meanQuality <- apply(wikidataSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
ggplot(data = wikidataSet, 
       aes(x = meanQuality)) + 
  geom_density(alpha = .25, fill = "darkorange", color = "darkorange", size = .25) + 
  ggtitle("Wikidata - (Astronomical Object + Scholarly Article)\nOnly classes w. >= 1000 items") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  xlim(0, 5) +
  theme(plot.title = element_text(size = 12, hjust = .5))
```
```{r echo = T, message = F}
summary(wikidataSet$meanQuality)
```
Interestingly, the distribution of the mean ORES score in the remained of Wikidata is clearly *bimodal*, with a mean at `r mean(wikidataSet$meanQuality)`. Let's find out what are *the largest Wikidata classes* found in the following ranges of mean ORES scores: 

- 1.5 - 2.5
- 2.5 - 3.5

First the lower score classes, with a range of mean quality of 1.5 - 2.5:

```{r echo = T, message = F}
lowerScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 1.5 & meanQuality < 2.5) %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
higherScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 2.5 & meanQuality < 3.5)  %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
lowerScoreClasses <- wmdedata_api_fetch_labels(items = lowerScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
lowerScoreClasses <- dplyr::left_join(lowerScoreClasses,
                                      dplyr::select(lowerScoreSet, class, numItems, meanQuality),
                                      by = c("item" = "class"))
lowerScoreClasses
```

Now for the higher score classes, with a range of mean quality of 2.5 - 3.5:

```{r echo = T, message = F}
higherScoreClasses <- wmdedata_api_fetch_labels(items = higherScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
higherScoreClasses <- dplyr::left_join(higherScoreClasses,
                                       dplyr::select(higherScoreSet, class, numItems, meanQuality),
                                       by = c("item" = "class"))
higherScoreClasses
```

#### 2.8 The distribution of ORES scores in the remaining Wikidata classes: < 1000 items

Let's take a look at the distribution of ORES scores in the remaining Wikidata classes **but taking into our consideration only classes with less than 1,000 items**:

```{r echo = T, message = F}
wikidataSet <- dataSet %>% 
  dplyr::filter(!(class %in% astronomicalObjects | class %in% scientificPapers)) %>% 
  dplyr::filter(numItems < 1000)
head(wikidataSet)
```

```{r echo = T, message = F}
wikidataSet$meanQuality <- apply(wikidataSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
ggplot(data = wikidataSet, 
       aes(x = meanQuality)) + 
  geom_density(alpha = .25, fill = "darkorange", color = "darkorange", size = .25) + 
  ggtitle("Wikidata - (Astronomical Object + Scholarly Article)\nOnly classes w. >= 1000 items") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  xlim(0, 5) +
  theme(plot.title = element_text(size = 12, hjust = .5))
```
```{r echo = T, message = F}
summary(wikidataSet$meanQuality)
```
Interestingly, the distribution of the mean ORES score in the remained of Wikidata is clearly *bimodal*, with a mean at `r mean(wikidataSet$meanQuality)`. Let's find out what are *the largest Wikidata classes* found in the following ranges of mean ORES scores: 

- 1.5 - 2.5
- 2.5 - 3.5

First the lower score classes, with a range of mean quality of 1.5 - 2.5:

```{r echo = T, message = F}
lowerScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 1.5 & meanQuality < 2.5) %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
higherScoreSet <- wikidataSet %>% 
  dplyr::filter(meanQuality >= 2.5 & meanQuality < 3.5)  %>% 
  arrange(desc(numItems)) %>% 
  top_n(100)
lowerScoreClasses <- wmdedata_api_fetch_labels(items = lowerScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
lowerScoreClasses <- dplyr::left_join(lowerScoreClasses,
                                      dplyr::select(lowerScoreSet, class, numItems, meanQuality),
                                      by = c("item" = "class"))
lowerScoreClasses
```

Now for the higher score classes, with a range of mean quality of 2.5 - 3.5:

```{r echo = T, message = F}
higherScoreClasses <- wmdedata_api_fetch_labels(items = higherScoreSet$class, 
                                               language = "en", 
                                               fallback = T)
higherScoreClasses <- dplyr::left_join(higherScoreClasses,
                                       dplyr::select(higherScoreSet, class, numItems, meanQuality),
                                       by = c("item" = "class"))
higherScoreClasses
```


### 3. Clustering the ORES scores across the Wikidata classes

We will make an attempt at a better understanding of the distribution of ORES scores content-wide Wikidata by clustering the dataset following the removal of **Astronomical Object** (`Q6999`) and **Scholarly Article** (`Q13442814`) (and their subclasses as well).

```{r echo = T, message = F}
clustSet <- wikidataSet %>% 
  dplyr::select(A, B, C, D, E)
```


```{r echo = T, message = F}
### --- Cluster with {mclust}
snowfall::sfInit(parallel = T, cpus = 23)
snowfall::sfExport('clustSet')
snowfall::sfLibrary(mclust)
t1 <- Sys.time()
models <- snowfall::sfClusterApplyLB(1:30, function(x) {
  mclust.options(hcUse = "STD")
  return(
    Mclust(clustSet, 
           G = 2:30, 
           initialization = list())
  )
})
t2 <- Sys.time()
print(
  paste0("mclust took: ", 
         as.character(difftime(t2, t1, units = "mins"))
  )
)
snowfall::sfStop()

### --- Post-hoc model selection
maxBIC <- sapply(models, function(x) {
  max(apply(x$BIC, 2, max, na.rm = T))
})
selectedModel <- which.max(maxBIC)
wd_classes_optimal <- models[[selectedModel]]
rm(models)

### --- Inspect Model
legend_args <- list(x = "bottomright", 
                    ncol = 5)
plot(wd_classes_optimal, 
     what = 'BIC', 
     legendArgs = legend_args)
```

We have actually run many cluster models and picked up the most optimal that we have found:

```{r echo = T, message = F}
wd_classes_optimal$modelName
wd_classes_optimal$G
wd_classes_optimal$n
wd_classes_optimal$d
```
The most optimal clustering solution that we were able to produce encompasses 12 clusters.

#### 3.1 Clusters: number of Wikidata classes per cluster, mean ORES quality per Cluster, and the average number of items in a class per cluster

```{r echo = T, message = F}
### --- Join clusters to data points (WD classes)
cluster_assign <- as.data.frame(wd_classes_optimal$classification)
colnames(cluster_assign) <- "cluster"
cluster_assign$class <- rownames(cluster_assign)
clustSet$class <- rownames(clustSet)
clustSet <- dplyr::left_join(clustSet,
                             cluster_assign,
                             by = "class")
rm(cluster_assign)

### --- Inspect clusters
clustSet$classSize <- rowSums(clustSet[, c('A', 'B', 'C', 'D', 'E')])
clustSet$meanQuality <- apply(clustSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})


# - ORES cluster quality profiles
classProfiles <- clustSet %>% 
  dplyr::select(cluster, meanQuality, classSize, A, B, C, D, E) %>% 
  dplyr::group_by(cluster) %>% 
  dplyr::summarise(meanClusterQuality = mean(meanQuality),
                   meanClusterClassSize = mean(classSize), 
                   medianClusterClassSize = median(classSize), 
                   numClasses = n(),
                   A = sum(A), 
                   B = sum(B), 
                   C = sum(C), 
                   D = sum(D), 
                   E = sum(E)) %>% 
  arrange(desc(meanClusterClassSize))
classProfiles$totalItems <- classProfiles$A + 
  classProfiles$B + 
  classProfiles$C + 
  classProfiles$D + 
  classProfiles$E

# - Analytics
# - Plot A. Mean per Cluster Class Size vs Mean per Cluster Class Quality
ggplot(data = classProfiles, 
       aes(x = meanClusterClassSize, 
           y = meanClusterQuality, 
           size = numClasses, 
           label = cluster)) + 
  geom_point(color = "red", fill = "tomato") +
  geom_text_repel(size = 4) +
  ggtitle("Mean per Cluster WD Class Size vs Mean per Cluster Class Quality") + 
  scale_x_continuous(labels = scales::comma) + 
  scale_y_continuous(labels = scales::comma) +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(size = 10, hjust = .5))
```
In the chart above each bubble represents a cluster of Wikidata classes. The horizontel axis, labeled `meanClusterClassSize`, represents the mean size of a Wikidata class in some particular cluster: the size of the class is the number of items in it, and as each cluster encompasses a number of classes we compute the mean size of a class in a cluster. The vertical axis represents the mean ORES quality: it is the average of all Wikidata classes' mean ORES scores, for all classes found in a particular cluster. The size of the bubble represents the number of classes found in each cluster. There is one cluster encompassing a relatively low number of Wikidata classes, labeled as Cluster 6, on the right of the plot, but that very cluster also encompasses very large Wikidata classes (and that is why it is positioned on the far right of the horizontal axis). All other clusters encompass Wikidata classes with a relatively low number of items in them (and that is why they are position on the far left of the horizontal axis). In the lower left corner of the chart we find Cluster 2, a large bubble - which means that it encompasses a large number of classes - with a relatively small average number of items per class, and relatively low mean ORES scores per class.

#### 3.2 Clusters: the distribution of ORES scores per cluster

```{r echo = T, message = F}
# - Plot B. Average number of A, B, C, D, and E ranked items per class, per Cluster
plotBFrame <- classProfiles %>% 
  dplyr::select(cluster, A, B, C, D, E) %>% 
  pivot_longer(-cluster, 
               names_to = "ORES Rank", 
               values_to = "Num Items per ORES Rank"
  ) 
plotBFrame$cluster <- factor(plotBFrame$cluster)
ggplot(data = plotBFrame, 
       aes(x = `ORES Rank`, 
           y = `Num Items per ORES Rank`,
           group = cluster,
           color = cluster,
           fill = cluster)
) + 
  geom_point() + geom_path() + 
  facet_wrap(~cluster, scales = "free") + 
  scale_y_continuous(labels = scales::comma) +
  ylab("Mean(Num.items)\n
       per ORES Rank per WD Class") +
  ggtitle("Average number of A, B, C, D, and E ranked items per WD Class, per Cluster") +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(legend.position = "none") + 
  theme(strip.background = element_blank()) + 
  theme(plot.title = element_text(size = 10, hjust = .5)) 
```
The horizontal axis in all charts represents the ORES score, while the vertical axis represents the mean number of items per class in a cluster that belong to a particular ORES score category (A, B, C, D, or E); each chart represents a cluster.

If we superimpose the quality score profiles: 

```{r echo = T, message = F}
# - superimposed: 
ggplot(data = plotBFrame, 
       aes(x = `ORES Rank`, 
           y = log(`Num Items per ORES Rank`),
           group = cluster,
           color = cluster,
           fill = cluster)
) + 
  geom_point() + geom_path() + 
  scale_y_continuous(labels = scales::comma) +
  ylab("Mean(Num.items)\n
       per ORES Rank per WD Class") +
  ggtitle("Average number of A, B, C, D, and E ranked items per WD Class, per Cluster") +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(legend.position = "top") + 
  theme(plot.title = element_text(size = 12))
```
#### 3.3 Clusters: Describing their content

We will query the Wikidata API to obtain the English labels for top 20 most populated Wikidata classes in each cluster.

```{r echo = T, message = F}
### --- Describe clusters
describeClusters <- lapply(sort(unique(clustSet$cluster)), function(x) {
  d <- clustSet[clustSet$cluster == x, ]
  d <- d %>% 
    dplyr::arrange(desc(classSize))
  return(d$class[1:20])
})
# - call wmdedata_api_fetch_labels()
describeClusters <- lapply(describeClusters, 
                           wmdedata_api_fetch_labels, 
                           language = "en", 
                           fallback = T)
```

However, just from looking at Wikidata class labels, it does not seem possible to provide a coherent interpretation of the clusters.
It is interesting, however, to take a look at the composition of Cluster 6:

```{r echo = T, message = F}
describeClusters[[6]]
```

This cluster (6) encompasses large Wikidata classes. It is quite possible that some of them could be excluded from further analyses (e.g. Wikimedia category, Wikimedia template, Wikimedia disambiguation page, etc). Also, some of them might need a separate treatment similar to Astronomical Objects and Scholarly Articles (e.g. Human, taxon, gene, protein, etc).

It is possible that a more comprehensive description of the clustering solution could be obtained from re-labeling the present Wikidata classes by their super-classes in the Wikidata hierarchy via `P31/P279`, and by looking at the distribution of the super-classes across the clusters. However, it would take a lot of computational resources to be able to describe the solution in that way.

### 4. Classes of highest and lowest mean ORES quality

Of course, we will take a look only at the Wikidata classes with above average number of items.

```{r echo = T, message = F}
dataSet <- rawData
dataSet$classSize <- rowSums(dataSet)
dataSet$meanQuality <- apply(dataSet[, c('A', 'B', 'C', 'D', 'E')], 1, function(x) {
  return(sum((x * 5:1))/sum(x))
})
dataSet$Class <- rownames(dataSet)
dataSet <- dataSet %>% 
  dplyr::filter(classSize > mean(dataSet$classSize))
dataSet <- dataSet %>% 
  dplyr::arrange(desc(meanQuality))
```

```{r echo = T, message = F}
topQualitySet <- head(dataSet, 100)
topQualitySetLabels <- wmdedata_api_fetch_labels(topQualitySet$Class, 
                                                 language = "en", 
                                                 fallback = T)
topQualitySetLabels <- dplyr::left_join(topQualitySetLabels,
                                        dplyr::select(topQualitySet, Class, classSize, meanQuality),
                                        by = c("item" = "Class"))

lowQualitySet <- tail(dataSet, 100)
lowQualitySetLabels <- wmdedata_api_fetch_labels(lowQualitySet$Class, 
                                                 language = "en", 
                                                 fallback = T)
lowQualitySetLabels <- dplyr::left_join(lowQualitySetLabels,
                                        dplyr::select(lowQualitySet, Class, classSize, meanQuality),
                                        by = c("item" = "Class"))
```

The 100 top quality Wikidata classes are:

```{r echo = T, message = F}
topQualitySetLabels
```

The lower 100 quality Wikidata classes are:

```{r echo = T, message = F}
lowQualitySetLabels
```

We can see that many `Wikimedia*` classes are found among the classes of lowest mean ORES scores in the dataset.

***
License: [GPLv3](http://www.gnu.org/licenses/gpl-3.0.txt)
This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this Notebook. If not, see <http://www.gnu.org/licenses/>.

***


