Report timestamp: 2020-04-28 00:08:04


Reference Phabricator Ticket: T248308


Notebook: Optimizing WDQS Response Time From SPARQL Queries Analysis

Feedback should be send to goran.milovanovic_ext@wikimedia.de.

Summary

WMDE together with php.cc is trying to find a way to optimize the Wikidata Query Service. This Report provides (a) and Exploratory Data Analysis (EDA) approach to WDQS response times (Part A) and (2) describes a Machine Learning (ML) approach (Part B) to analyze the SPARQL queries that were received by the endpoint in the first three weeks of April 2020, with the goal of developing a model able to predict whether a query would be critical for WDQS/Blazegraph or not. The motivation for the ML approach to the problem is the following one: by singling out the most important features that are present in the SPARQL queries under analysis we hope to be able to understand better what types of queries typically cause problems.

This, Part A, of the Report, provides the Exploratory Data Analytics(EDA) on the WDQS query response times. We analyze the query response times across a range of variables obtained directly from the event.wdqs_external_sparql_query schema in the WMF Data Lake.

In the following sections we describe (1) the characteristics of the sample of SPARQL queries used in this study, (2) the overview of the number of queries run per (a) day of week, (b) hour of day, (c) WMF Datacenter/Host, (d) HTTP method of request, (e) server HTTP response code, and (f) the desired output format, (3) the mean and median WDQS query processing times across the mentioned (a) - (f) variables, and (4) the distributions of WDQS processing times across WMF Datacenter/Hosts and output format.

Summary

  • The eqiad data center is receiving tons of queries in comparison to codfw.
  • The XML output format seems to take much more to process in comparison to JSON and text/plain (except for we really have only few observations of text/plain in the sample).
  • The distributions of the WDQS processing time across the crucial variables (WMF Datacenter/Host, Output format) are highly skewed towards short processing times - so we really need to focus on the outliers seriously.

0. Sample of SPARQL queries observed at the WDQS SPARQL endpoint

The SPARQL queries that hit the endpoint are registered in the event.wdqs_external_sparql_query table in the WMF Data Lake (our Hadoop Big Data storage). A bucketing approach to sample data from this Hive table was taken, sampling approximately 1% of queries that were observed in the period between 2020/04/01 and 2020/04/21. The sample is approximately uniformly distributed across days and hours of the mentioned period of time. The sample size is slightly higher than 1M SPARQL queries - including duplicates.

The following fields were derived from the event.wdqs_external_sparql_query table for each collected SPARQL query:

  • ds_sparql - the SPARQL query
  • ds_dt - the query timestamp, later parsed into year, month, day, hour, and minute
  • ds_status_code - the server http status response for this query
  • ds_query_time - how many seconds did it took for the WDQS to process the query
  • ds_datacenter_host - a combination of the (a) WMF data center and (b) a back-end host in the respective data center
  • ds_format - content type (e.g. JSON, text/tsv, xml, etc.)
  • ds_method - HTTP method (e.g. GET, POST)
  • ds_n_conc_queries - for each SPARQL query, the number of SPARQL whose processing started in (1) the same minute, on (2) the same data center, and on (3) the same back-end host.

The following R script was used to obtain a sample of SPARQL queries from event.wdqs_external_sparql_query.

#!/usr/bin/env Rscript

### ---------------------------------------------------------------------------
### --- WDQS_Event_S1_ETL.R v 0.0.1
### --- Script: WDQS_Event_S1_ETL.R, v. Beta 0.1
### --- Description: WDQS_Event_S1_ETL.R collects a sample of 
### --- SPARQL queries from event.wdqs_external_sparql_query in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: stat1005
### --- Date: 2020/04/22
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software: 
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed 
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- Section 1. Collect raw data set event.wdqs_external_sparql_query
### ---------------------------------------------------------------------------

### --- setup
library(data.table)
library(tidyverse)

### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = F)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
  paste(fPath[1:length(fPath) - 1], collapse = "/"),
  "/",
  sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_event/_data/')
analyticsDir <- paste0(fPath, '_event/_analytics/')

### --- grab a sample of SPARQL queries event.wdqs_external_sparql_query
### --- w. HiveQL from Beeline
# - query filename
queryFile <- "wdqs_CollectData.hql"
# - Kerberos Init
kerberosPrefix <- 'sudo -u analytics-privatedata kerberos-run-command analytics-privatedata '
# - dateRange
startDate <- '2020/04/01'
endDate <- '2020/04/21'
dateRange <- seq(as.Date(startDate), as.Date(endDate), by = "day")
# - query over dateRange
for (i in 1:length(dateRange)) {
  print(paste0("Started: ", dateRange[i]))
  # - year, month, date
  year = substr(dateRange[i], 1, 4)
  month = substr(dateRange[i], 6, 7)
  if (substr(month, 1, 1) == '0') {month = substr(month, 2, 2)}
  day = substr(dateRange[i], 9, 10)
  if (substr(day, 1, 1) == '0') {day = substr(day, 2, 2)}
  # - compose query
  hiveQLquery <- paste0("SELECT meta.dt as dt, 
                          regexp_replace(
                            regexp_replace(
                              regexp_replace(query, '\\n|\\r\\n|\\r|\\n\\r', ' __NEWLINE__ '),
                            '\\t', ' '),
                          '\\000', ' ') as sparql, 
                          format, meta.domain as domain, 
                          http.request_headers['X-BIGDATA-MAX-QUERY-MILLIS'] as max_query_millis, http.method as method, 
                          http.status_code as status_code, 
                          backend_host, datacenter, query_time 
                          FROM event.wdqs_external_sparql_query TABLESAMPLE (BUCKET 1 OUT OF 100 ON rand()) w
                          WHERE isnotnull(http.status_code) 
                          AND (meta.domain = 'query.wikidata.org') 
                          AND year=", year,
                        " AND month=", month,
                        " AND day=", day, ';')
  filename <- "wdqs_data_sample_TEST_"
  filename <- paste0(filename, dateRange[i], ".tsv")
  # - write hql
  write(hiveQLquery, paste0(fPath, queryFile))
  # - to Report
  print("Fetching sparql_testData from wmf.webrequest now.")
  # - Run query
  query <- system(command = paste(kerberosPrefix, 
                                  '/usr/local/bin/beeline --incremental=true --silent -f "',
                                  paste0(fPath, queryFile),
                                  '" > ', dataDir, filename,
                                  sep = ""),
                  wait = TRUE)
  print(paste0("Completed: ", dateRange[i]))
  print("--------------------------------------------")
}
# - to Report
print("DONE w. ETL from Hadoop: wmf.webrequest.")

1. Exploratory Data Analytics

All of the following analytics were derived from the datasets produced by the following R script.

#!/usr/bin/env Rscript

### ---------------------------------------------------------------------------
### --- WDQS_Event_S5_ReportingDatasets.R, v. 0.0.1
### --- Script: WDQS_Event_S5_ReportingDatasets.R, v. 0.0.1
### --- Description: WDQS_Event_S5_ReportingDatasets.R is used
### --- only to produce the reporting datasets for the 
### --- Wikidata SPARQL Endpoint Analytics Project.
### --- All data are obtained from the event.wdqs_external_sparql_query 
### --- table in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: stat1005
### --- Date: 2020/04/27
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software: 
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed 
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------

### --- setup
library(data.table)
library(stringr)
library(dplyr)
library(tidyr)
library(httr)
library(pROC)
library(xgboost)
library(caret)
library(fastDummies)

### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
  paste(fPath[1:length(fPath) - 1], collapse = "/"),
  "/",
  sep = "")

fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
# - NOTE: Hetzner local dir
#  - fPath <- '/home/goran/WMDE/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_event/_data/')
analyticsDir <- paste0(fPath, '_event/_analytics/')
reportingDir <- paste0(fPath, '_event/_reporting/')

### ---------------------------------------------------------------------------
### --- Section 1. Produce reporting dataset
### ---------------------------------------------------------------------------

### --- feature frequencies: vocabulary
filename <- 'queries_vocabulary.csv'
vocabulary <- fread(paste0(analyticsDir, filename))
vocabulary$V1 <- NULL

### --- dataset
filename <- 'wdqs_data_sample_2020-04-22.csv'
dataSet <- fread(paste0(analyticsDir, filename))
dataSet$V1 <- NULL
# - derive: n_conc_queries per datacenter x back_endhost
dataSet$datacenter_host <- paste(dataSet$datacenter,
                                 dataSet$backend_host, 
                                 sep = "_")
dataSet$datacenter <- NULL
dataSet$backend_host <- NULL
# - the number of queries started in the same minute
# - as the target query
# - per datacenter x back_endhost
dataSet$datacenter_host_ymdhm <- paste(dataSet$datacenter_host,
                                       dataSet$ymdhm, 
                                       sep = "_")
dataSet$ymdhm <- NULL
n_conc_queries <- dataSet %>% 
  dplyr::select(datacenter_host_ymdhm) %>% 
  dplyr::group_by(datacenter_host_ymdhm) %>% 
  dplyr::summarise(n_conc_queries = n())
dataSet <- dataSet %>% 
  dplyr::left_join(n_conc_queries, by = "datacenter_host_ymdhm")
dataSet$datacenter_host_ymdhm <- NULL
rm(n_conc_queries)
# - derive: mild and extreme outliers on query_time
# - query_time >= q3 + 3*iq 
# - q3 = 75 percentile, q1 = 25 percentile
q3 <- quantile(dataSet$query_time, .75)
q1 <- quantile(dataSet$query_time, .25)
# - iq = q3 - q1; Interquartile range
iq <- unname(q3 - q1)
upper_outer_fence <- q3 + 3*iq
upper_inner_fence <- q3 + 1.5*iq
dataSet$query_time_extreme <- 
  ifelse(dataSet$query_time >= upper_outer_fence, 1, 0)
dataSet$query_time_mild <- 
  ifelse(dataSet$query_time >= upper_inner_fence, 1, 0)
# - derive multiclass DV: 0 - mild, mild - extreme, extreme query_time
dataSet$query_time_class <- 0
dataSet$query_time_class[dataSet$query_time_extreme == 0 & dataSet$query_time_mild == 1] <- 1
dataSet$query_time_class[dataSet$query_time_extreme == 1] <- 2
# - derive: below and above median query_time
dataSet$query_time_median <- ifelse(dataSet$query_time >= median(dataSet$query_time), 1, 0)

### --- vocabulary: top_f most frequently used features
top_f <- 300
ob_features <- c('nchar', '__vars__', '__vars_usage__', '__literal__', 
                 '__vars_label__', '__vars_label_usage__', '__iri_ref__')
vocabulary <- vocabulary$feature[1:top_f]
vocabulary <- unique(c(vocabulary, ob_features))

### --- join vocabulary to dataSet
queries_processed <- fread(paste0(analyticsDir, 'queries_processed.csv'))
queries_processed$V1 <- NULL
dim(queries_processed)
queries_processed <- queries_processed %>% 
  dplyr::filter(feature %in% vocabulary)
rm(vocabulary)
dim(queries_processed)
featureFrame <- data.table::dcast(as.data.table(queries_processed),
                                  uniqueSparqlId ~ feature,
                                  value.var = "freq")
rm(queries_processed); gc()
# - mark features in dataSet: ds
colnames(dataSet) <- paste0("ds_", colnames(dataSet))
# - join:
dataSet <- dataSet %>% 
  dplyr::left_join(featureFrame, 
                   by = c('ds_uniqueSparqlId' = 'uniqueSparqlId'))
rm(featureFrame); gc()
colnames(dataSet)

# - store reporting data.frame: reportingFrame.csv
filename <- paste0('reportingFrame', '_', top_f, '.csv')
write.csv(dataSet,
          paste0(reportingDir, filename))

# - clear
rm(list = setdiff(ls(), c('reportingDir', 'analyticsDir', 'dataDir', 'fPath')))
gc()

### ---------------------------------------------------------------------------
### --- Section 2. Exploratory Data Analysis
### ---------------------------------------------------------------------------

### --- load reporting data.frame
top_f <- 300
filename <- paste0('reportingFrame', '_', top_f, '.csv')
reportFrame <- fread(paste0(reportingDir, filename))
reportFrame$V1 <- NULL

### --- reportFrame$recode ds_day to day of week
sampleDays <- ifelse(nchar(reportFrame$ds_day) == 1, 
                     paste0('0', reportFrame$ds_day),
                     reportFrame$ds_day)
reportFrame$ds_day <- paste0("2020-04-", sampleDays)
reportFrame$ds_day <- weekdays(as.Date(reportFrame$ds_day))
rm(sampleDays)

### --- Sample characteristics
# - external (non-SPARQL parsing based) fatures in the dataset
ds_features <- 
  colnames(reportFrame)[which(grepl("^ds_", colnames(reportFrame)))]

# - days
sampleDays <- as.data.frame(table(reportFrame$ds_day))
colnames(sampleDays) <- c('Day', 'Num.Queries')
sampleDays$Percent <- round(
  sampleDays$Num.Queries/sum(sampleDays$Num.Queries)*100, 2)
sampleDays <-dplyr::arrange(sampleDays, desc(Percent))
write.csv(sampleDays, paste0(reportingDir, 'sampleChar_Days.csv'))
# - hours
sampleHours <- as.data.frame(table(reportFrame$ds_hour))
colnames(sampleHours) <- c('Hour', 'Num.Queries')
sampleHours$Percent <- round(
  sampleHours$Num.Queries/sum(sampleHours$Num.Queries)*100, 2)
sampleHours <-dplyr::arrange(sampleHours, desc(Percent))
write.csv(sampleHours, paste0(reportingDir, 'sampleChar_Hours.csv'))
# - HTTP method
sampleMethod <- as.data.frame(table(reportFrame$ds_method))
colnames(sampleMethod) <- c('HTTP.method', 'Num.Queries')
sampleMethod$Percent <- round(
  sampleMethod$Num.Queries/sum(sampleMethod$Num.Queries)*100, 2)
sampleMethod <-dplyr::arrange(sampleMethod, desc(Percent))
write.csv(sampleMethod, paste0(reportingDir, 'sampleChar_HTTPmethod.csv'))
# - datacenter_host
sample_datacenter_host <- as.data.frame(table(reportFrame$ds_datacenter_host))
colnames(sample_datacenter_host) <- c('Datacenter.Host', 'Num.Queries')
sample_datacenter_host$Percent <- round(
  sample_datacenter_host$Num.Queries/sum(sample_datacenter_host$Num.Queries)*100, 2)
sample_datacenter_host <-dplyr::arrange(sample_datacenter_host, desc(Percent))
write.csv(sample_datacenter_host, paste0(reportingDir, 'sampleChar_DatacenterHost.csv'))
# - Status Code
sample_StatusCode <- as.data.frame(table(reportFrame$ds_status_code))
colnames(sample_StatusCode) <- c('Status', 'Num.Queries')
sample_StatusCode$Percent <- round(
  sample_StatusCode$Num.Queries/sum(sample_StatusCode$Num.Queries)*100, 2)
sample_StatusCode <-dplyr::arrange(sample_StatusCode, desc(Percent))
write.csv(sample_StatusCode, paste0(reportingDir, 'sampleChar_StatusCode.csv'))
# - Format
sample_Format <- as.data.frame(table(reportFrame$ds_format))
colnames(sample_Format) <- c('Format', 'Num.Queries')
sample_Format$Percent <- round(
  sample_Format$Num.Queries/sum(sample_Format$Num.Queries)*100, 2)
sample_Format <-dplyr::arrange(sample_Format, desc(Percent))
write.csv(sample_Format, paste0(reportingDir, 'sample_Format.csv'))
# - Distribution of the number of queries 
# - concurrently started in the same minute
reportFrame$ds_day_hour_minute <- paste(reportFrame$ds_day,
                                        reportFrame$ds_hour,
                                        reportFrame$ds_minute,
                                        sep = "-")
sample_datacenter_host_qrqueries <- reportFrame %>% 
  dplyr::select(ds_datacenter_host, ds_day_hour_minute) %>% 
  dplyr::group_by(ds_datacenter_host, ds_day_hour_minute) %>% 
  dplyr::summarise(query_count = n()) %>%
  dplyr::select(ds_datacenter_host, query_count) %>% 
  dplyr::group_by(ds_datacenter_host) %>% 
  dplyr::summarise(mean_conc_queries = mean(query_count), 
                   median_conc_queries = median(query_count), 
                   stdev_conc_queries = sd(query_count)) %>%
  dplyr::arrange(desc(mean_conc_queries))
write.csv(sample_datacenter_host_qrqueries, 
          paste0(reportingDir, 'sample_datacenter_host_qrqueries.csv'))
reportFrame$ds_day_hour_minute <- NULL

### --- Produce repeated queries dataset
repeatedQueries <- reportFrame %>% 
  dplyr::select(ds_uniqueSparqlId, 
                ds_sparql, 
                ds_query_time)
colnames(repeatedQueries)
repeatedQueries <- repeatedQueries %>% 
  dplyr::group_by(ds_uniqueSparqlId, ds_sparql) %>% 
  dplyr::summarise(num = n(),
                   mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time),
                   stdev_query_time = sd(ds_query_time))
repeatedQueries$Percent <- round(
  repeatedQueries$num/sum(repeatedQueries$num)*100, 6)
repeatedQueries <- dplyr::arrange(repeatedQueries, desc(num))
write.csv(repeatedQueries, 
          paste0(reportingDir, 'repeatedQueries.csv'))
rm(repeatedQueries); gc()

### --- Cross-Tabulation: ds_* features x query_time
# - ds_format
crossTabs_format <- reportFrame %>% 
  dplyr::select(ds_format, ds_query_time) %>% 
  dplyr::group_by(ds_format) %>% 
  dplyr::summarise(mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time), 
                   stdev_query_time = sd(ds_query_time)) %>% 
  dplyr::arrange(desc(mean_query_time))
write.csv(crossTabs_format, 
          paste0(reportingDir, 'crossTabs_format.csv'))
# - ds_day
crossTabs_day <- reportFrame %>% 
  dplyr::select(ds_day, ds_query_time) %>% 
  dplyr::group_by(ds_day) %>% 
  dplyr::summarise(mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time), 
                   stdev_query_time = sd(ds_query_time)) %>% 
  dplyr::arrange(desc(mean_query_time))
write.csv(crossTabs_day, 
          paste0(reportingDir, 'crossTabs_day.csv'))
# - ds_method
crossTabs_method <- reportFrame %>% 
  dplyr::select(ds_method, ds_query_time) %>% 
  dplyr::group_by(ds_method) %>% 
  dplyr::summarise(mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time), 
                   stdev_query_time = sd(ds_query_time)) %>% 
  dplyr::arrange(desc(mean_query_time))
write.csv(crossTabs_method, 
          paste0(reportingDir, 'crossTabs_method.csv'))
# - ds_datacenter_host
crossTabs_datacenter_host <- reportFrame %>% 
  dplyr::select(ds_datacenter_host, ds_query_time) %>% 
  dplyr::group_by(ds_datacenter_host) %>% 
  dplyr::summarise(mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time), 
                   stdev_query_time = sd(ds_query_time)) %>% 
  dplyr::arrange(desc(mean_query_time))
write.csv(crossTabs_datacenter_host, 
          paste0(reportingDir, 'crossTabs_datacenter_host.csv'))

# - ds_status_code
crossTabs_status_code <- reportFrame %>% 
  dplyr::select(ds_status_code, ds_query_time) %>% 
  dplyr::group_by(ds_status_code) %>% 
  dplyr::summarise(mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time), 
                   stdev_query_time = sd(ds_query_time)) %>% 
  dplyr::arrange(desc(mean_query_time))
write.csv(crossTabs_status_code, 
          paste0(reportingDir, 'crossTabs_status_code.csv'))

1.1 Sample Description

library(ggplot2)
library(ggrepel)
library(scales)

First we provide descriptive statistics for the most important variables in the data set.

1.1.1 The distribution of queries per day of week

pFrame <- read.csv('sampleChar_Days.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Day <- factor(pFrame$Day,
                     levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
ggplot(data = pFrame, 
       aes(x = Day, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) + 
  geom_bar(width = .15, color = 'darkred', fill = 'red', stat = 'identity') + 
  geom_text_repel(size = 3) + 
  theme_bw()

1.1.2 The distribution of queries per hour of day

pFrame <- read.csv('sampleChar_Hours.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Hour <- factor(pFrame$Hour,
                     levels = as.character(0:23))
ggplot(data = pFrame, 
       aes(x = Hour, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) + 
  geom_bar(width = .15, color = 'darkred', fill = 'red', stat = 'identity') + 
  geom_text_repel(size = 3) + 
  theme_bw()

1.1.3A The distribution of queries per WMF Datacenter/Host

pFrame <- read.csv('sampleChar_DatacenterHost.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Datacenter.Host <- factor(pFrame$Datacenter.Host,
                     levels = unique(pFrame$Datacenter.Host))
ggplot(data = pFrame, 
       aes(x = Datacenter.Host, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) + 
  geom_bar(width = .15, color = 'darkgreen', fill = 'green', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))

1.1.3B The mean and median number of queries concurrently started in the same minute, per WMF Datacenter/Host

pFrame <- read.csv('sample_datacenter_host_qrqueries.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Datacenter.Host <- factor(pFrame$ds_datacenter_host,
                     levels = unique(pFrame$ds_datacenter_host))
ggplot(data = pFrame, 
       aes(x = ds_datacenter_host, 
           y = mean_conc_queries, 
           label = paste0("Med: ", median_conc_queries))) + 
  geom_bar(width = .15, color = 'darkgreen', fill = 'green', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('WMF Datacenter/Host') + ylab('Mean Response Time (secs)') +
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))

1.1.4 The distribution of queries per HTTP method of request

pFrame <- read.csv('sampleChar_HTTPmethod.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$HTTP.method <- factor(pFrame$HTTP.method,
                     levels = unique(pFrame$HTTP.method))
ggplot(data = pFrame, 
       aes(x = HTTP.method, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) + 
  geom_bar(width = .15, color = 'darkblue', fill = 'blue', stat = 'identity') + 
  geom_text_repel(size = 3) + 
  theme_bw()

1.1.5 The distribution of queries per server response code

pFrame <- read.csv('sampleChar_StatusCode.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Status<- as.character(pFrame$Status)
ggplot(data = pFrame, 
       aes(x = Status, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) +
  scale_y_continuous(labels = comma) + 
  geom_bar(width = .15, color = 'darkorange', fill = 'orange', stat = 'identity') + 
  geom_text_repel(size = 3) + 
  theme_bw()

1.1.6 The distribution of queries per output format

pFrame <- read.csv('sample_Format.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Format<- factor(pFrame$Format, 
                       levels = unique(pFrame$Format))
ggplot(data = pFrame, 
       aes(x = Format, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) + 
  geom_bar(width = .15, color = 'darkviolet', fill = 'violet', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  theme_bw()

1.2 WDQS processing times

1.2.1 Day of week vs. WDQS response time

pFrame <- read.csv('crossTabs_day.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_day <- factor(pFrame$ds_day,
                        levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
ggplot(data = pFrame, 
       aes(x = ds_day, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkred', fill = 'red', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('Day') + ylab('Mean Query Time (secs)') +
  theme_bw()

1.2.2 Hour of day vs. WDQS response time

pFrame <- read.csv('crossTabs_hour.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_hour <- factor(pFrame$ds_hour,
                     levels = as.character(0:23))
ggplot(data = pFrame, 
       aes(x = ds_hour, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkred', fill = 'red', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('Hour') + ylab('Mean Query Time (secs)') +
  theme_bw()

1.2.3 WMF Datacenter/Host vs. WDQS response time

pFrame <- read.csv('crossTabs_datacenter_host.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_datacenter_host <- factor(pFrame$ds_datacenter_host,
                                 levels = unique(pFrame$ds_datacenter_host))
ggplot(data = pFrame, 
       aes(x = ds_datacenter_host, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkgreen', fill = 'green', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('WMF/Datacenter/Host') + ylab('Mean Query Time (secs)') +
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))

1.2.4 HTTP method of request vs. WDQS response time

pFrame <- read.csv('crossTabs_method.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_method <- factor(pFrame$ds_method,
                           levels = unique(pFrame$ds_method))
ggplot(data = pFrame, 
       aes(x = ds_method, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkblue', fill = 'blue', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('HTTP method of request') + ylab('Mean Query Time (secs)') +
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))

1.2.5 Server response code vs. WDQS response time

pFrame <- read.csv('crossTabs_status_code.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_status_code <- factor(pFrame$ds_status_code,
                                levels = unique(pFrame$ds_status_code))
ggplot(data = pFrame, 
       aes(x = ds_status_code, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkorange', fill = 'orange', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('Status Code') + ylab('Mean Query Time (secs)') +
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))

1.2.6 Output format vs. WDQS response time

pFrame <- read.csv('crossTabs_format.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_format <- factor(pFrame$ds_format,
                           levels = unique(pFrame$ds_format))
ggplot(data = pFrame, 
       aes(x = ds_format, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkviolet', fill = 'violet', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('Status Code') + ylab('Mean Query Time (secs)') +
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))

1.3 The distribution of WDQS processing times

1.3.1 The distribution of WDQS processing times across WMF Datacenter/Hosts

NOTE. Please mind that we use a logarithmic scale for the WDQS query processing times. The distributions are highly skewed on their natural scales.

pFrame <- read.csv('qtDistribution_DatacenterHost.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
ggplot(data = pFrame, 
       aes(x = log(ds_query_time))) +
  geom_density(size = .25, color = "darkgreen", fill="green", alpha = .2) +
  facet_wrap(~ds_datacenter_host) +
  xlab('log(WDQS processing time)') + ylab('Density') + 
  ylim(c(0, 1)) + 
  theme_bw()+ 
  theme(axis.text.x =  element_text(angle = 90))

1.3.2 The distribution of WDQS processing times across output formats

NOTE. The text/plain output format has two few observations.

pFrame <- read.csv('qtDistribution_Format.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
ggplot(data = pFrame, 
       aes(x = log(ds_query_time))) +
  geom_density(size = .25, color = "darkviolet", fill="violet", alpha = .2) +
  facet_wrap(~ds_format) +
  xlab('log(WDQS processing time)') + ylab('Density') + 
  ylim(c(0, 1)) + 
  theme_bw()+ 
  theme(axis.text.x =  element_text(angle = 90))


Goran S. Milovanović

Wikimedia Deutschland, Data Scientist DataKolektiv, Owner

contact:

---
title: WDQS Endpoint Analytics - Optimizing Server Response Time - Part A (EDA)
author:
- name: Goran S. Milovanović
  affiliation: Wikimedia Deutschland, Data Scientist, DataKolektiv, Owner
output:
  html_notebook:
    code_folding: hide
    theme: spacelab
    toc: yes
    toc_float: yes
    toc_depth: 5
  html_document:
    toc: yes
    toc_depth: 5
---
***

**Report timestamp:** `r {Sys.time()}`

***

**Reference Phabricator Ticket:** [T248308](https://phabricator.wikimedia.org/T248308)

***
### Notebook: Optimizing WDQS Response Time From SPARQL Queries Analysis
**Feedback** should be send to `goran.milovanovic_ext@wikimedia.de`. 

## **Summary** 

[WMDE](https://www.wikimedia.de/) together with [php.cc](https://thephp.cc/welcome) is trying to find a way to optimize the [Wikidata Query Service](https://www.mediawiki.org/wiki/Wikidata_Query_Service). This Report provides (a) and Exploratory Data Analysis (EDA) approach to WDQS response times (Part A) and (2) describes a Machine Learning (ML) approach (Part B) to analyze the SPARQL queries that were received by the endpoint in the first three weeks of April 2020, with the goal of developing a model able to predict whether a query would be **critical** for WDQS/Blazegraph or not. The motivation for the ML approach to the problem is the following one: by singling out the most important features that are present in the SPARQL queries under analysis we hope to be able to understand better what types of queries typically cause problems.

**This, Part A, of the Report, provides the Exploratory Data Analytics(EDA) on the WDQS query response times. We analyze the query response times across a range of variables obtained directly from the [`event.wdqs_external_sparql_query`](https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/event-schemas/+/master/jsonschema/sparql/query/1.0.0.yaml) schema in the [WMF Data Lake](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake).**

In the following sections we describe **(1)** the characteristics of the sample of SPARQL queries used in this study, **(2)** the overview of the number of queries run per (a) day of week, (b) hour of day, (c) WMF Datacenter/Host, (d) HTTP method of request, (e) server HTTP response code, and (f) the desired output format, **(3)** the mean and median WDQS query processing times across the mentioned (a) - (f) variables, and **(4)** the distributions of WDQS processing times across WMF Datacenter/Hosts and output format.

**Summary** 

* The `eqiad` data center is receiving tons of queries in comparison to `codfw`.
* The `XML` output format seems to take much more to process in comparison to `JSON` and `text/plain` (except for we really have only few observations of `text/plain` in the sample).
* The distributions of the WDQS processing time across the crucial variables (WMF Datacenter/Host, Output format) are highly skewed towards short processing times - so we really need to focus on the outliers seriously.


***

## 0. Sample of SPARQL queries observed at the WDQS SPARQL endpoint

The SPARQL queries that hit the endpoint are registered in the `event.wdqs_external_sparql_query` table in the WMF Data Lake (our Hadoop Big Data storage). A [bucketing approach](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling) to sample data from this Hive table was taken, sampling approximately `1%` of queries that were observed in the period between `2020/04/01` and `2020/04/21`. The sample is approximately uniformly distributed across days and hours of the mentioned period of time. The sample size is slightly higher than `1M` SPARQL queries - including duplicates.

The following fields were derived from the `event.wdqs_external_sparql_query` table for each collected SPARQL query:

- `ds_sparql` - the SPARQL query
- `ds_dt` - the query timestamp, later parsed into `year`, `month`, `day`, `hour`, and `minute`
- `ds_status_code` - the server http status response for this query
- `ds_query_time` - how many seconds did it took for the WDQS to process the query
- `ds_datacenter_host` - a combination of the (a) WMF data center and (b) a back-end host in the respective data center
- `ds_format` - content type (e.g. JSON, text/tsv, xml, etc.)
- `ds_method` - HTTP method (e.g. GET, POST)
- `ds_n_conc_queries` - for each SPARQL query, the number of SPARQL whose processing started in (1) the same minute, on (2) the same data center, and on (3) the same back-end host.

The following R script was used to obtain a sample of SPARQL queries from `event.wdqs_external_sparql_query`.

```{r echo = T, eval = F, message = F}
#!/usr/bin/env Rscript

### ---------------------------------------------------------------------------
### --- WDQS_Event_S1_ETL.R v 0.0.1
### --- Script: WDQS_Event_S1_ETL.R, v. Beta 0.1
### --- Description: WDQS_Event_S1_ETL.R collects a sample of 
### --- SPARQL queries from event.wdqs_external_sparql_query in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: stat1005
### --- Date: 2020/04/22
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software: 
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed 
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- Section 1. Collect raw data set event.wdqs_external_sparql_query
### ---------------------------------------------------------------------------

### --- setup
library(data.table)
library(tidyverse)

### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = F)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
  paste(fPath[1:length(fPath) - 1], collapse = "/"),
  "/",
  sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_event/_data/')
analyticsDir <- paste0(fPath, '_event/_analytics/')

### --- grab a sample of SPARQL queries event.wdqs_external_sparql_query
### --- w. HiveQL from Beeline
# - query filename
queryFile <- "wdqs_CollectData.hql"
# - Kerberos Init
kerberosPrefix <- 'sudo -u analytics-privatedata kerberos-run-command analytics-privatedata '
# - dateRange
startDate <- '2020/04/01'
endDate <- '2020/04/21'
dateRange <- seq(as.Date(startDate), as.Date(endDate), by = "day")
# - query over dateRange
for (i in 1:length(dateRange)) {
  print(paste0("Started: ", dateRange[i]))
  # - year, month, date
  year = substr(dateRange[i], 1, 4)
  month = substr(dateRange[i], 6, 7)
  if (substr(month, 1, 1) == '0') {month = substr(month, 2, 2)}
  day = substr(dateRange[i], 9, 10)
  if (substr(day, 1, 1) == '0') {day = substr(day, 2, 2)}
  # - compose query
  hiveQLquery <- paste0("SELECT meta.dt as dt, 
                          regexp_replace(
                            regexp_replace(
                              regexp_replace(query, '\\n|\\r\\n|\\r|\\n\\r', ' __NEWLINE__ '),
                            '\\t', ' '),
                          '\\000', ' ') as sparql, 
                          format, meta.domain as domain, 
                          http.request_headers['X-BIGDATA-MAX-QUERY-MILLIS'] as max_query_millis, http.method as method, 
                          http.status_code as status_code, 
                          backend_host, datacenter, query_time 
                          FROM event.wdqs_external_sparql_query TABLESAMPLE (BUCKET 1 OUT OF 100 ON rand()) w
                          WHERE isnotnull(http.status_code) 
                          AND (meta.domain = 'query.wikidata.org') 
                          AND year=", year,
                        " AND month=", month,
                        " AND day=", day, ';')
  filename <- "wdqs_data_sample_TEST_"
  filename <- paste0(filename, dateRange[i], ".tsv")
  # - write hql
  write(hiveQLquery, paste0(fPath, queryFile))
  # - to Report
  print("Fetching sparql_testData from wmf.webrequest now.")
  # - Run query
  query <- system(command = paste(kerberosPrefix, 
                                  '/usr/local/bin/beeline --incremental=true --silent -f "',
                                  paste0(fPath, queryFile),
                                  '" > ', dataDir, filename,
                                  sep = ""),
                  wait = TRUE)
  print(paste0("Completed: ", dateRange[i]))
  print("--------------------------------------------")
}
# - to Report
print("DONE w. ETL from Hadoop: wmf.webrequest.")
```


## 1. Exploratory Data Analytics 

All of the following analytics were derived from the datasets produced by the following R script.

```{r echo = T, eval = F, message = F}

#!/usr/bin/env Rscript

### ---------------------------------------------------------------------------
### --- WDQS_Event_S5_ReportingDatasets.R, v. 0.0.1
### --- Script: WDQS_Event_S5_ReportingDatasets.R, v. 0.0.1
### --- Description: WDQS_Event_S5_ReportingDatasets.R is used
### --- only to produce the reporting datasets for the 
### --- Wikidata SPARQL Endpoint Analytics Project.
### --- All data are obtained from the event.wdqs_external_sparql_query 
### --- table in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: stat1005
### --- Date: 2020/04/27
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software: 
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed 
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------

### --- setup
library(data.table)
library(stringr)
library(dplyr)
library(tidyr)
library(httr)
library(pROC)
library(xgboost)
library(caret)
library(fastDummies)

### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
  paste(fPath[1:length(fPath) - 1], collapse = "/"),
  "/",
  sep = "")

fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
# - NOTE: Hetzner local dir
#  - fPath <- '/home/goran/WMDE/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_event/_data/')
analyticsDir <- paste0(fPath, '_event/_analytics/')
reportingDir <- paste0(fPath, '_event/_reporting/')

### ---------------------------------------------------------------------------
### --- Section 1. Produce reporting dataset
### ---------------------------------------------------------------------------

### --- feature frequencies: vocabulary
filename <- 'queries_vocabulary.csv'
vocabulary <- fread(paste0(analyticsDir, filename))
vocabulary$V1 <- NULL

### --- dataset
filename <- 'wdqs_data_sample_2020-04-22.csv'
dataSet <- fread(paste0(analyticsDir, filename))
dataSet$V1 <- NULL
# - derive: n_conc_queries per datacenter x back_endhost
dataSet$datacenter_host <- paste(dataSet$datacenter,
                                 dataSet$backend_host, 
                                 sep = "_")
dataSet$datacenter <- NULL
dataSet$backend_host <- NULL
# - the number of queries started in the same minute
# - as the target query
# - per datacenter x back_endhost
dataSet$datacenter_host_ymdhm <- paste(dataSet$datacenter_host,
                                       dataSet$ymdhm, 
                                       sep = "_")
dataSet$ymdhm <- NULL
n_conc_queries <- dataSet %>% 
  dplyr::select(datacenter_host_ymdhm) %>% 
  dplyr::group_by(datacenter_host_ymdhm) %>% 
  dplyr::summarise(n_conc_queries = n())
dataSet <- dataSet %>% 
  dplyr::left_join(n_conc_queries, by = "datacenter_host_ymdhm")
dataSet$datacenter_host_ymdhm <- NULL
rm(n_conc_queries)
# - derive: mild and extreme outliers on query_time
# - query_time >= q3 + 3*iq 
# - q3 = 75 percentile, q1 = 25 percentile
q3 <- quantile(dataSet$query_time, .75)
q1 <- quantile(dataSet$query_time, .25)
# - iq = q3 - q1; Interquartile range
iq <- unname(q3 - q1)
upper_outer_fence <- q3 + 3*iq
upper_inner_fence <- q3 + 1.5*iq
dataSet$query_time_extreme <- 
  ifelse(dataSet$query_time >= upper_outer_fence, 1, 0)
dataSet$query_time_mild <- 
  ifelse(dataSet$query_time >= upper_inner_fence, 1, 0)
# - derive multiclass DV: 0 - mild, mild - extreme, extreme query_time
dataSet$query_time_class <- 0
dataSet$query_time_class[dataSet$query_time_extreme == 0 & dataSet$query_time_mild == 1] <- 1
dataSet$query_time_class[dataSet$query_time_extreme == 1] <- 2
# - derive: below and above median query_time
dataSet$query_time_median <- ifelse(dataSet$query_time >= median(dataSet$query_time), 1, 0)

### --- vocabulary: top_f most frequently used features
top_f <- 300
ob_features <- c('nchar', '__vars__', '__vars_usage__', '__literal__', 
                 '__vars_label__', '__vars_label_usage__', '__iri_ref__')
vocabulary <- vocabulary$feature[1:top_f]
vocabulary <- unique(c(vocabulary, ob_features))

### --- join vocabulary to dataSet
queries_processed <- fread(paste0(analyticsDir, 'queries_processed.csv'))
queries_processed$V1 <- NULL
dim(queries_processed)
queries_processed <- queries_processed %>% 
  dplyr::filter(feature %in% vocabulary)
rm(vocabulary)
dim(queries_processed)
featureFrame <- data.table::dcast(as.data.table(queries_processed),
                                  uniqueSparqlId ~ feature,
                                  value.var = "freq")
rm(queries_processed); gc()
# - mark features in dataSet: ds
colnames(dataSet) <- paste0("ds_", colnames(dataSet))
# - join:
dataSet <- dataSet %>% 
  dplyr::left_join(featureFrame, 
                   by = c('ds_uniqueSparqlId' = 'uniqueSparqlId'))
rm(featureFrame); gc()
colnames(dataSet)

# - store reporting data.frame: reportingFrame.csv
filename <- paste0('reportingFrame', '_', top_f, '.csv')
write.csv(dataSet,
          paste0(reportingDir, filename))

# - clear
rm(list = setdiff(ls(), c('reportingDir', 'analyticsDir', 'dataDir', 'fPath')))
gc()

### ---------------------------------------------------------------------------
### --- Section 2. Exploratory Data Analysis
### ---------------------------------------------------------------------------

### --- load reporting data.frame
top_f <- 300
filename <- paste0('reportingFrame', '_', top_f, '.csv')
reportFrame <- fread(paste0(reportingDir, filename))
reportFrame$V1 <- NULL

### --- reportFrame$recode ds_day to day of week
sampleDays <- ifelse(nchar(reportFrame$ds_day) == 1, 
                     paste0('0', reportFrame$ds_day),
                     reportFrame$ds_day)
reportFrame$ds_day <- paste0("2020-04-", sampleDays)
reportFrame$ds_day <- weekdays(as.Date(reportFrame$ds_day))
rm(sampleDays)

### --- Sample characteristics
# - external (non-SPARQL parsing based) fatures in the dataset
ds_features <- 
  colnames(reportFrame)[which(grepl("^ds_", colnames(reportFrame)))]

# - days
sampleDays <- as.data.frame(table(reportFrame$ds_day))
colnames(sampleDays) <- c('Day', 'Num.Queries')
sampleDays$Percent <- round(
  sampleDays$Num.Queries/sum(sampleDays$Num.Queries)*100, 2)
sampleDays <-dplyr::arrange(sampleDays, desc(Percent))
write.csv(sampleDays, paste0(reportingDir, 'sampleChar_Days.csv'))
# - hours
sampleHours <- as.data.frame(table(reportFrame$ds_hour))
colnames(sampleHours) <- c('Hour', 'Num.Queries')
sampleHours$Percent <- round(
  sampleHours$Num.Queries/sum(sampleHours$Num.Queries)*100, 2)
sampleHours <-dplyr::arrange(sampleHours, desc(Percent))
write.csv(sampleHours, paste0(reportingDir, 'sampleChar_Hours.csv'))
# - HTTP method
sampleMethod <- as.data.frame(table(reportFrame$ds_method))
colnames(sampleMethod) <- c('HTTP.method', 'Num.Queries')
sampleMethod$Percent <- round(
  sampleMethod$Num.Queries/sum(sampleMethod$Num.Queries)*100, 2)
sampleMethod <-dplyr::arrange(sampleMethod, desc(Percent))
write.csv(sampleMethod, paste0(reportingDir, 'sampleChar_HTTPmethod.csv'))
# - datacenter_host
sample_datacenter_host <- as.data.frame(table(reportFrame$ds_datacenter_host))
colnames(sample_datacenter_host) <- c('Datacenter.Host', 'Num.Queries')
sample_datacenter_host$Percent <- round(
  sample_datacenter_host$Num.Queries/sum(sample_datacenter_host$Num.Queries)*100, 2)
sample_datacenter_host <-dplyr::arrange(sample_datacenter_host, desc(Percent))
write.csv(sample_datacenter_host, paste0(reportingDir, 'sampleChar_DatacenterHost.csv'))
# - Status Code
sample_StatusCode <- as.data.frame(table(reportFrame$ds_status_code))
colnames(sample_StatusCode) <- c('Status', 'Num.Queries')
sample_StatusCode$Percent <- round(
  sample_StatusCode$Num.Queries/sum(sample_StatusCode$Num.Queries)*100, 2)
sample_StatusCode <-dplyr::arrange(sample_StatusCode, desc(Percent))
write.csv(sample_StatusCode, paste0(reportingDir, 'sampleChar_StatusCode.csv'))
# - Format
sample_Format <- as.data.frame(table(reportFrame$ds_format))
colnames(sample_Format) <- c('Format', 'Num.Queries')
sample_Format$Percent <- round(
  sample_Format$Num.Queries/sum(sample_Format$Num.Queries)*100, 2)
sample_Format <-dplyr::arrange(sample_Format, desc(Percent))
write.csv(sample_Format, paste0(reportingDir, 'sample_Format.csv'))
# - Distribution of the number of queries 
# - concurrently started in the same minute
reportFrame$ds_day_hour_minute <- paste(reportFrame$ds_day,
                                        reportFrame$ds_hour,
                                        reportFrame$ds_minute,
                                        sep = "-")
sample_datacenter_host_qrqueries <- reportFrame %>% 
  dplyr::select(ds_datacenter_host, ds_day_hour_minute) %>% 
  dplyr::group_by(ds_datacenter_host, ds_day_hour_minute) %>% 
  dplyr::summarise(query_count = n()) %>%
  dplyr::select(ds_datacenter_host, query_count) %>% 
  dplyr::group_by(ds_datacenter_host) %>% 
  dplyr::summarise(mean_conc_queries = mean(query_count), 
                   median_conc_queries = median(query_count), 
                   stdev_conc_queries = sd(query_count)) %>%
  dplyr::arrange(desc(mean_conc_queries))
write.csv(sample_datacenter_host_qrqueries, 
          paste0(reportingDir, 'sample_datacenter_host_qrqueries.csv'))
reportFrame$ds_day_hour_minute <- NULL

### --- Produce repeated queries dataset
repeatedQueries <- reportFrame %>% 
  dplyr::select(ds_uniqueSparqlId, 
                ds_sparql, 
                ds_query_time)
colnames(repeatedQueries)
repeatedQueries <- repeatedQueries %>% 
  dplyr::group_by(ds_uniqueSparqlId, ds_sparql) %>% 
  dplyr::summarise(num = n(),
                   mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time),
                   stdev_query_time = sd(ds_query_time))
repeatedQueries$Percent <- round(
  repeatedQueries$num/sum(repeatedQueries$num)*100, 6)
repeatedQueries <- dplyr::arrange(repeatedQueries, desc(num))
write.csv(repeatedQueries, 
          paste0(reportingDir, 'repeatedQueries.csv'))
rm(repeatedQueries); gc()

### --- Cross-Tabulation: ds_* features x query_time
# - ds_format
crossTabs_format <- reportFrame %>% 
  dplyr::select(ds_format, ds_query_time) %>% 
  dplyr::group_by(ds_format) %>% 
  dplyr::summarise(mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time), 
                   stdev_query_time = sd(ds_query_time)) %>% 
  dplyr::arrange(desc(mean_query_time))
write.csv(crossTabs_format, 
          paste0(reportingDir, 'crossTabs_format.csv'))
# - ds_day
crossTabs_day <- reportFrame %>% 
  dplyr::select(ds_day, ds_query_time) %>% 
  dplyr::group_by(ds_day) %>% 
  dplyr::summarise(mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time), 
                   stdev_query_time = sd(ds_query_time)) %>% 
  dplyr::arrange(desc(mean_query_time))
write.csv(crossTabs_day, 
          paste0(reportingDir, 'crossTabs_day.csv'))
# - ds_method
crossTabs_method <- reportFrame %>% 
  dplyr::select(ds_method, ds_query_time) %>% 
  dplyr::group_by(ds_method) %>% 
  dplyr::summarise(mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time), 
                   stdev_query_time = sd(ds_query_time)) %>% 
  dplyr::arrange(desc(mean_query_time))
write.csv(crossTabs_method, 
          paste0(reportingDir, 'crossTabs_method.csv'))
# - ds_datacenter_host
crossTabs_datacenter_host <- reportFrame %>% 
  dplyr::select(ds_datacenter_host, ds_query_time) %>% 
  dplyr::group_by(ds_datacenter_host) %>% 
  dplyr::summarise(mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time), 
                   stdev_query_time = sd(ds_query_time)) %>% 
  dplyr::arrange(desc(mean_query_time))
write.csv(crossTabs_datacenter_host, 
          paste0(reportingDir, 'crossTabs_datacenter_host.csv'))

# - ds_status_code
crossTabs_status_code <- reportFrame %>% 
  dplyr::select(ds_status_code, ds_query_time) %>% 
  dplyr::group_by(ds_status_code) %>% 
  dplyr::summarise(mean_query_time = mean(ds_query_time), 
                   median_query_time = median(ds_query_time), 
                   stdev_query_time = sd(ds_query_time)) %>% 
  dplyr::arrange(desc(mean_query_time))
write.csv(crossTabs_status_code, 
          paste0(reportingDir, 'crossTabs_status_code.csv'))
```

### 1.1 Sample Description

```{r echo = T, eval = T, message = F}
library(ggplot2)
library(ggrepel)
library(scales)
```

First we provide descriptive statistics for the most important variables in the data set.

#### 1.1.1 The distribution of queries per day of week

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('sampleChar_Days.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Day <- factor(pFrame$Day,
                     levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
ggplot(data = pFrame, 
       aes(x = Day, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) + 
  geom_bar(width = .15, color = 'darkred', fill = 'red', stat = 'identity') + 
  geom_text_repel(size = 3) + 
  theme_bw()
```

#### 1.1.2 The distribution of queries per hour of day

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('sampleChar_Hours.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Hour <- factor(pFrame$Hour,
                     levels = as.character(0:23))
ggplot(data = pFrame, 
       aes(x = Hour, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) + 
  geom_bar(width = .15, color = 'darkred', fill = 'red', stat = 'identity') + 
  geom_text_repel(size = 3) + 
  theme_bw()
```

#### 1.1.3A The distribution of queries per WMF Datacenter/Host

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('sampleChar_DatacenterHost.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Datacenter.Host <- factor(pFrame$Datacenter.Host,
                     levels = unique(pFrame$Datacenter.Host))
ggplot(data = pFrame, 
       aes(x = Datacenter.Host, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) + 
  geom_bar(width = .15, color = 'darkgreen', fill = 'green', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))
```

#### 1.1.3B The mean and median number of queries concurrently started in the same minute, per WMF Datacenter/Host

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('sample_datacenter_host_qrqueries.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Datacenter.Host <- factor(pFrame$ds_datacenter_host,
                     levels = unique(pFrame$ds_datacenter_host))
ggplot(data = pFrame, 
       aes(x = ds_datacenter_host, 
           y = mean_conc_queries, 
           label = paste0("Med: ", median_conc_queries))) + 
  geom_bar(width = .15, color = 'darkgreen', fill = 'green', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('WMF Datacenter/Host') + ylab('Mean Response Time (secs)') +
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))
```

#### 1.1.4 The distribution of queries per HTTP method of request

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('sampleChar_HTTPmethod.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$HTTP.method <- factor(pFrame$HTTP.method,
                     levels = unique(pFrame$HTTP.method))
ggplot(data = pFrame, 
       aes(x = HTTP.method, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) + 
  geom_bar(width = .15, color = 'darkblue', fill = 'blue', stat = 'identity') + 
  geom_text_repel(size = 3) + 
  theme_bw()
```

#### 1.1.5 The distribution of queries per server response code

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('sampleChar_StatusCode.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Status<- as.character(pFrame$Status)
ggplot(data = pFrame, 
       aes(x = Status, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) +
  scale_y_continuous(labels = comma) + 
  geom_bar(width = .15, color = 'darkorange', fill = 'orange', stat = 'identity') + 
  geom_text_repel(size = 3) + 
  theme_bw()
```

#### 1.1.6 The distribution of queries per output format

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('sample_Format.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$Format<- factor(pFrame$Format, 
                       levels = unique(pFrame$Format))
ggplot(data = pFrame, 
       aes(x = Format, 
           y = Num.Queries, 
           label = paste0(Percent, "%"))) + 
  geom_bar(width = .15, color = 'darkviolet', fill = 'violet', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  theme_bw()
```

### 1.2 WDQS processing times

#### 1.2.1 Day of week vs. WDQS response time

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('crossTabs_day.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_day <- factor(pFrame$ds_day,
                        levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
ggplot(data = pFrame, 
       aes(x = ds_day, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkred', fill = 'red', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('Day') + ylab('Mean Query Time (secs)') +
  theme_bw()
```

#### 1.2.2 Hour of day vs. WDQS response time

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('crossTabs_hour.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_hour <- factor(pFrame$ds_hour,
                     levels = as.character(0:23))
ggplot(data = pFrame, 
       aes(x = ds_hour, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkred', fill = 'red', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('Hour') + ylab('Mean Query Time (secs)') +
  theme_bw()
```

#### 1.2.3 WMF Datacenter/Host vs. WDQS response time

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('crossTabs_datacenter_host.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_datacenter_host <- factor(pFrame$ds_datacenter_host,
                                 levels = unique(pFrame$ds_datacenter_host))
ggplot(data = pFrame, 
       aes(x = ds_datacenter_host, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkgreen', fill = 'green', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('WMF/Datacenter/Host') + ylab('Mean Query Time (secs)') +
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))
```

#### 1.2.4 HTTP method of request vs. WDQS response time

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('crossTabs_method.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_method <- factor(pFrame$ds_method,
                           levels = unique(pFrame$ds_method))
ggplot(data = pFrame, 
       aes(x = ds_method, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkblue', fill = 'blue', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('HTTP method of request') + ylab('Mean Query Time (secs)') +
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))
```

#### 1.2.5 Server response code vs. WDQS response time

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('crossTabs_status_code.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_status_code <- factor(pFrame$ds_status_code,
                                levels = unique(pFrame$ds_status_code))
ggplot(data = pFrame, 
       aes(x = ds_status_code, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkorange', fill = 'orange', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('Status Code') + ylab('Mean Query Time (secs)') +
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))
```

#### 1.2.6 Output format vs. WDQS response time

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('crossTabs_format.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
pFrame$ds_format <- factor(pFrame$ds_format,
                           levels = unique(pFrame$ds_format))
ggplot(data = pFrame, 
       aes(x = ds_format, 
           y = mean_query_time, 
           label = paste0("Med: ", median_query_time))) + 
  geom_bar(width = .15, color = 'darkviolet', fill = 'violet', stat = 'identity') + 
  scale_y_continuous(labels = comma) + 
  geom_text_repel(size = 3) + 
  xlab('Status Code') + ylab('Mean Query Time (secs)') +
  theme_bw() + 
  theme(axis.text.x =  element_text(angle = 90))
```

### 1.3 The distribution of WDQS processing times

#### 1.3.1 The distribution of WDQS processing times across WMF Datacenter/Hosts

**NOTE.** Please mind that we use a logarithmic scale for the WDQS query processing times. The distributions are *highly skewed* on their natural scales.

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('qtDistribution_DatacenterHost.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
ggplot(data = pFrame, 
       aes(x = log(ds_query_time))) +
  geom_density(size = .25, color = "darkgreen", fill="green", alpha = .2) +
  facet_wrap(~ds_datacenter_host) +
  xlab('log(WDQS processing time)') + ylab('Density') + 
  ylim(c(0, 1)) + 
  theme_bw()+ 
  theme(axis.text.x =  element_text(angle = 90))
```

#### 1.3.2 The distribution of WDQS processing times across output formats

**NOTE.** The `text/plain` output format has two few observations.

```{r echo = T, eval = T, message = F}
pFrame <- read.csv('qtDistribution_Format.csv', 
                   header = T, 
                   row.names = 1, 
                   check.names = F,
                   stringsAsFactors = F)
ggplot(data = pFrame, 
       aes(x = log(ds_query_time))) +
  geom_density(size = .25, color = "darkviolet", fill="violet", alpha = .2) +
  facet_wrap(~ds_format) +
  xlab('log(WDQS processing time)') + ylab('Density') + 
  ylim(c(0, 1)) + 
  theme_bw()+ 
  theme(axis.text.x =  element_text(angle = 90))
```

***
Goran S. Milovanović

Wikimedia Deutschland, Data Scientist
DataKolektiv, Owner

2020.

contact: goran.milovanovic_ext@wikimedia.de

![](_img/DK_Logo_100.png)
![](_img/Wikidata-logo-en.png)
![](_img/Wikimedia_Deutschland_Logo_small.png)





