Report timestamp: 2019-09-22 03:15:07

Contact: goran.milovanovic_ext at wikimedia.de


Introduction

The Wikidata Quality Report presents an assessment of the quality of Wikidata items based on the Objective Revision Evaluation Services - ORES machine learning system predictions.

The Grading Scheme for Wikidata items used in this report encompasses five categories (A, B, C, D, and E) of decreasing quality.

This Report uses the Wikidata Concepts Monitor - WDCM re-use statistics data in combination with ORES prediction scores to provide a more comprehensive picture of data quality in Wikidata. The WDCM system tracks the Wikidata re-use across the Wikimedia projects and assings each item a unique re-use statistic. The WDCM re-use statistic is defined as the number of mentions of an item across the Wikimedia projects: multiple uses of the same item, in any of the several usage aspects (see: wbc_entity_usage), on the same page, and in the same project are counted as one item mention. For example, the usage statistic for Q42 is thus the sum of the count of all pages across all Wikimedia projects that make at least one use of it - and irrespective of the usage aspect.

All data sets upon which this Report is based are publicly available from the following URL:

https://analytics.wikimedia.org/datasets/wmde-analytics-engineering/Wikidata/WD_DataQuality/

library(tidyverse)
library(ggrepel)

Overview

stats <- read.csv('https://analytics.wikimedia.org/datasets/wmde-analytics-engineering/Wikidata/WD_DataQuality/dataQuality_Stats.csv', 
                  header = T, 
                  check.names = F, 
                  stringsAsFactors = F)

This Report finds a total number of 63084587 items currently in Wikidata, of which 58558373 (92.83%) have received a quality assessment from the ORES system. Out of the total number of 63084587 items, WDCM finds that 28658550 (45.43%) are re-used across the Wikimedia projects. The latest ORES prediction run upon which this Report is based was on 2019-09-16 13:17:00, while the latest update of the WDCM re-use statistics happend on 2019-09-21 21:32:20 UTC. The report is based on the 2019-08 snapshot of the wmf.mediawiki_history table in the WMD Data Lake.

Quality distribution for all items

Looking at all of the 58558373 Wikidata items that have received a quality score from ORES, we find the following:

dataA <- read.csv('https://analytics.wikimedia.org/datasets/wmde-analytics-engineering/Wikidata/WD_DataQuality/dataQuality_oresScoresDistribution.csv',
                 header = T,
                 check.names = F,
                 stringsAsFactors = F,
                 row.names = 1)
colnames(dataA) <- c('Quality', 'Number of items')
dataA$Percent <- round(dataA$`Number of items`/sum(dataA$`Number of items`)*100, 2)
DT::datatable(dataA, 
              options = list(
                width = '100%',
                columnDefs = list(list(className = 'dt-center', targets = "_all"))
              ),
              rownames = FALSE
    )

Quality distribution for the top 10,000 re-used items

Looking only at the top 10,000 most re-used items across the Wikimedia projects according to the WDCM, we find the following distribution of quality:

dataB <- read.csv('https://analytics.wikimedia.org/datasets/wmde-analytics-engineering/Wikidata/WD_DataQuality/dataQuality_oresScoresDistribution_10000.csv',
                 header = T,
                 check.names = F,
                 stringsAsFactors = F,
                 row.names = 1)
colnames(dataB) <- c('Quality', 'Number of items')
dataB$Percent <- round(dataB$`Number of items`/sum(dataB$`Number of items`)*100, 2)
DT::datatable(dataB, 
              options = list(
                width = '100%',
                columnDefs = list(list(className = 'dt-center', targets = "_all"))
              ),
              rownames = FALSE
    )

Quality distribution: top 10K most used items vs all items

If we would compare the quality distribution of all items to the top 10,000 re-used items across the Wikimedia projects, this is the picture that would emerge:

qualFrame <- data.frame(
  Quality = c("A", "B", "C", "D", "E"),
  All = dataA$Percent,
  Top10K = dataB$Percent, 
  stringsAsFactors = F)
qualFrame <- gather(qualFrame, 
                    key = 'Items',
                    value = 'Percent', 
                    2:3)
ggplot(qualFrame, 
       aes(x = Quality, 
           y = Percent, 
           group = Items, 
           color = Items, 
           fill = Items, 
           label = paste0(Percent, " %"))) + 
  geom_line() + 
  geom_point(size = 2.5) +
  geom_point(size = 2, color = "white") + 
  geom_text_repel(show.legend = FALSE) + 
  scale_colour_manual(values= c("darkblue", "darkred")) +
  theme_bw() +
  theme(panel.background = element_rect(color = "white", fill = "white")) +
      theme(panel.border = element_blank()) +
      theme(panel.grid = element_blank()) + 
      theme(legend.text = element_text(size = 14)) +
      theme(legend.title = element_text(size = 15))


Item quality and re-use

Item re-use w. outliers

In the following chart we present the WDCM re-use statistics (vertical axis, logarithmic scale) for all Wikidata items respective of their predicted quality score (A, B, C, D, or E). The horizontal lines in the boxplots represent the median values of the re-use statistics, while their lower and upper limits represent the 1st (.25) and the 3rd (.75) quartile, respectively. The free floating points above (and sometimes bellow) the boxes are outliers: they represent the Wikidata items used suspiciously more (or less) than other items in the same quality class. Once again: the outliers in this boxplot are detected for each quality class (A, B, C, D, or E) separately.

Item re-use w/o outliers

Now we present the same data except for the outliers that have been removed. We can see that item quality is correlated with item re-use: the lower the item quality, the less the item seems to be re-used across the Wikimedia projects. The vertical lines above and bellow the boxes extend to Q3+1.5*IQR and Q1-1.5*IQR respectively, where IQR is the Interquartile range; the outliers, if any remained here, would be shown bellow and above these limits.

Diversity of item re-use vs item quality

The following chart provides an in-depth insight into the relationship between item quality (as predicted by ORES) and item re-use across the Wikimedia projects (as assessed by WDCM). Each bubble represents (potentially) many Wikidata items. Let’s focus on a single quality class, A for example. Any A class item could be re-used 1, 2, 3, .., n, etc. times across the projects. Each bubble in this chart represents all Wikidata items in the respective quality class that share the same value of the re-use statistic (y-axis, log scale). The size of the bubble corresponds to the number of items that it represents (i.e. the number of items that share the same value of the re-use statistic). From the chart we can observe the following: the lower the item quality class (A > B > C > D > E), the lesser the number of unique re-use statistic values that the items from the respective class take.

The diversity of the unique values of the re-use statistic is much higher in the A quality class, for example, than in the D or E classes (where the re-use statistic takes only three unique values). This might be a consequence of (a) more human or human|machine engagement in the re-use of the top quality A items across the projects and (b) more pure machine engagement in the re-use of the D and E lower quality items.


Critical Wikidata items

We provide a list of all Wikidata items scored as imperfect (i.e. found in the B, C, D, or E quality class) that are also found as outliers in terms of their re-use (i.e. being used suspiciously more than all other Wikidata items across the Wikimedia projects). The list encompasses the top 1,000 most re-used items of imperfect quality from each quality class that are recognized as outliers in terms of their re-use. The items from the D and E classes in this list are probably the most critical Wikidata items and need to be improved immediately.

critical <- read.csv('https://analytics.wikimedia.org/datasets/wmde-analytics-engineering/Wikidata/WD_DataQuality/positiveOutliers.csv',
                 header = T,
                 check.names = F,
                 stringsAsFactors = F,
                 row.names = 1)
colnames(critical) <- c('Item', 'Quality', 'Re-use')
critical$url <- paste0('https://www.wikidata.org/wiki/', critical$Item)
critical$Item <- paste0('<a href="', critical$url, '" target="_blank">', critical$Item, "</a>")
critical$url <- NULL
critical <- critical %>% 
  dplyr::arrange(desc(Quality), desc(`Re-use`))
DT::datatable(critical, 
              options = list(
                pageLength = 100, 
                width = '100%',
                columnDefs = list(list(className = 'dt-center', targets = "_all"))
              ),
              escape = FALSE,
              rownames = FALSE
    )

We wish to thank all the member of the ORES team for their help in the production of this report.


