Report timestamp: 2020-04-15 09:55:56


Reference Phabricator Ticket: T248308


Notebook: Optimizing WDQS Response Time From SPARQL Queries Analysis

Feedback should be send to goran.milovanovic_ext@wikimedia.de.

Summary

WMDE together with php.cc is trying to find a way to optimize the Wikidata Query Service. This Report describes a Machine Learning (ML) approach to analyze the SPARQL queries that were received by the endpoint in March 2020, with the goal of developing a model able to predict whether a query would be critical for WDQS/Blazegraph or not. By singling out the most important features that are present in the SPARQL queries under analysis we hope to be able to understand better what types of queries typically cause problems. Beyond that, the development of a predictive model for SPARQL query server response time in itself lays out the foundations for the development of an optimization engine for WDQS, a system that would be able to quickly scan a received query and decide whether the WDQS handling of any given query needs to be adjusted in order to avoid critically high server response times.

In the following sections we describe (1) the sampling approach taken to collect a relevant set of SPARQL queries observed at the (WDQS) endpoint in three weeks of March 2020, (2) the feature engineering approach taken to analyze the composition of queries and extract meaningful “atomic elements” of SPARQL that will be used in an attempt to predict the WDQS response time for a query, (3) the definition of a criterion for selecting a critical query for WDQS (i.e. the one that makes too much use of the service’s resources), (4) the ML (optimization) approach, relying on XGBoost, taken to build a predictive model for WDQS server response times starting from the extracted SPARQL features, and (5) the lessons learned - specifically on what constraints will be met in the case of an eventual development of a WDQS optimization engine in the future - as well as suggestions on what further steps need to be made in order to reach an acceptable level of understanding of the problem and design a prototype of a future solution.


0. Sample of SPARQL queries observed at the WDQS SPARQL endpoint

The SPARQL queries that hit the endpoint are registered in the wmf.webrequest table in the WMF Data Lake (our Hadoop Big Data storage). A bucketing approach to sample data from this Hive table was taken, sampling approximately 1% of queries that were observed in the period between 2020/03/01 and 2020/03/20. The sample is approximately uniformly distributed across days and hours of the mentioned period of time. The sample size is slightly higher than 1M SPARQL queries - including duplicates.

The following fields were collected from the wmf.webrequest table for each collected SPARQL query (see: wmf.webrequest documenation for details):

  • uri_query - the URL encoded SPARQL query w. WDQS query parameters
  • dt - the query timestamp, later parsed into year, month, day, hour, and minute
  • http_status - the server http status response for this query
  • cache_status - the cache status for this query (see: https://wikitech.wikimedia.org/wiki/Caching_overview#Headers)
  • time_firstbyte - how many seconds elapsed before the first byte of the server response was sent out to the client
  • response_size - response size
  • agent_type - spider or human
  • content_type - content type (e.g. JSON, text/tsv, xml, etc.)
  • access_method - desktop, mobile, or app.

NOTE. As noted in the Summary, our attempt will be to analyze the composition of the SPARQL queries and try to predict a binary variable critical response time/non-critical response time which we plan to derive from time_firstbyte (taken to be an approximation to the WDQS response time). It is immediately obvious that only some fields from the wmf.webrequest table are candidates for a predictive model of the described kind: uri_query contains the query and we will use it to extract the query features, which can be done before WDQS query processing, dt is the timestamp which can be known before the query is processed, however, response_size is certainly not available before query processing and cannot be used in any imaginable WDQS optimization system. So, the following fields are useful for exploratory data analysis only: http_status, cache_status, response_size, agent_type, content_type (note: we can learn about this variable from the SPARQL query parameters present in the uri_query field), access_method.

#!/usr/bin/env Rscript

### ---------------------------------------------------------------------------
### --- WD_WDQS_Analytics_CollectData.R, v. Beta 0.1
### --- Script: WD_WDQS_Analytics_CollectData.R, v. Beta 0.1
### --- Description: WD_WDQS_Analytics_CollectData.R collects a sample of 
### --- SPARQL queries from wmf.webrequest in the WMF Data Lake,
### --- uri_path = "/sparql" OR uri_path= "/bigdata/namespace/wdq/sparql"
### --- NOTE. "/bigdata/namespace/wdq/sparql" redirects to "/sparql" 
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: /home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics
### --- on stat1005
### --- Date: 2020/03/20
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software: 
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed 
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- Section 1. Collect raw data set from wmf.webrequest
### ---------------------------------------------------------------------------

### --- setup
library(data.table)
library(tidyverse)

### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
  paste(fPath[1:length(fPath) - 1], collapse = "/"),
  "/",
  sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_data/')
analyticsDir <- paste0(fPath, '_analytics/')
imgDir <- paste0(fPath, '_img/')

### --- grab a sample of SPARQL queries from wmf.webrequest
### --- w. HiveQL from Beeline
# - query filename
queryFile <- "wdqs_CollectData.hql"
# - Kerberos Init
kerberosPrefix <- 'sudo -u analytics-privatedata kerberos-run-command analytics-privatedata '
# - dateRange
endDate <- '2020/03/20'
startDate <- '2020/03/01'
dateRange <- seq(as.Date(startDate), as.Date(endDate), by = "day")
# - query over dateRange
for (i in 1:length(dateRange)) {
  print(paste0("Started: ", dateRange[i]))
  # - year, month, date
  year = substr(dateRange[i], 1, 4)
  month = substr(dateRange[i], 6, 7)
  if (substr(month, 1, 1) == '0') {month = substr(month, 2, 2)}
  day = substr(dateRange[i], 9, 10)
  if (substr(day, 1, 1) == '0') {day = substr(day, 2, 2)}
  # - compose query
  hiveQLquery <- paste0('SELECT dt, uri_query, http_status, cache_status, time_firstbyte, response_size, agent_type, content_type, access_method
                          FROM wmf.webrequest TABLESAMPLE (BUCKET 1 OUT OF 100 ON rand()) w
                          WHERE (uri_path = "/sparql" OR uri_path= "/bigdata/namespace/wdq/sparql")
                          AND webrequest_source = "text"
                          AND year=', year,
                        ' AND month=', month,
                        ' AND day=', day, ';')
  
  filename <- "wdqs_data_sample_TEST_"
  filename <- paste0(filename, dateRange[i], ".tsv")
  # - write hql
  write(hiveQLquery, paste0(fPath, queryFile))
  # - to Report
  print("Fetching sparql_testData from wmf.webrequest now.")
  # - Run query
  query <- system(command = paste(kerberosPrefix, 
                                  '/usr/local/bin/beeline --incremental=true --silent -f "',
                                  paste0(fPath, queryFile),
                                  '" > ', dataDir, filename,
                                  sep = ""),
                  wait = TRUE)
  print(paste0("Completed: ", dateRange[i]))
  print("--------------------------------------------")
}
# - to Report
print("DONE w. ETL from Hadoop: wmf.webrequest.")

1. Data Cleaning and Feature Engineering

1.1 Data Cleaning

The first step following the sample of SPARQL queries is obtained from wmf.webrequest is to check the data for consistency and urltools::url_decode() decode the queries (R {urltools} package was used). The following chunk of code also parses the dt (timestamp) field and extracts year, month, day, hour, and minute when the query was run. Also, and additional variable is produced from a simple concatenation of year, month, day, hour, and minute: ymdmh. This variable will be subsequently used to calculate the number of queries that were run in a particular minute observed in the sample. This is a potential “environmental” variable (i.e. taken from the WDQS/Blazegraph enviornment) in predictive modeling of server response time. It can characterize the query processing context by estimating how many queries were run against Blazegraph concurrently in the same minute when a particular query under of interest was run. There are possible refinments of this variable that were not explored in this pilot study. Any future WDQS optimization system relying on this variable would have to rely on Blazegraph broadcasting this variable to the predictive modeling API.

NOTE. In this step we eliminate all queries that have resulted in a http_status of 4**: client side errors. Obviously, incorrectly stated SPARQL queries are of no use to us in any attempt to optimize the WDQS processing time. So if a query fails an initial syntax analysis on the endpoint it does not enter our analysis.

#!/usr/bin/env Rscript

### ---------------------------------------------------------------------------
### --- WD_WDQS_Analytics_Clean_Inspect_Data_EDA.R, v. Beta 0.1
### --- Script: WD_WDQS_Analytics_FeatureEngineering.R, v. Beta 0.1
### --- Description: WD_WDQS_Analytics_FeatureEngineering.R performs 
### --- feature engineering of the SPARQL queries data 
### --- collected w. WD_WDQS_Analytics_CollectData.R from the 
### --- wmf.webrequest table in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: /home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics
### --- on stat1005
### --- Date: 2020/03/20
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software: 
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed 
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------

### --- setup
library(data.table)
library(tidyverse)
library(urltools)

### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
  paste(fPath[1:length(fPath) - 1], collapse = "/"),
  "/",
  sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'

dataDir <- paste0(fPath, '_data/')
analyticsDir <- paste0(fPath, '_analytics/')
imgDir <- paste0(fPath, '_img/')

### ---------------------------------------------------------------------------
### --- Section 1. Load, clean, and decode the SPARQL queries data set
### --- NOTE. Mind the private data (!)
### ---------------------------------------------------------------------------

### --- Sys.setlocale()
Sys.setlocale('LC_ALL','C')

### --- load data
lF <- list.files(dataDir)
lF <- lF[grepl("^wdqs_data_sample", lF)]
lF <- lF[!grepl("TEST", lF)]
dataSet <- lapply(paste0(dataDir, lF), fread)
dataSet <- rbindlist(dataSet)
dim(dataSet)

### --- clean data: http_status
table(dataSet$http_status)
# - inspect http_status x time_firstbyte
dataSet %>% 
  dplyr::select(http_status, time_firstbyte) %>% 
  dplyr::group_by(http_status) %>% 
  dplyr::summarise(mean = mean(time_firstbyte, na.rm = T), 
                   median = median(time_firstbyte, na.rm = T), 
                   min = min(time_firstbyte, na.rm = T), 
                   max = max(time_firstbyte, na.rm = T))
# - NOTE. http_status == "-" has min == 0, max == 0, mean == 0, median == 0
# - Remove queries that failed for client errors: http_status == 4**
w <- which(grepl("^4", dataSet$http_status))
if (length(w) > 0) {
  dataSet <- dataSet[-w, ]
}

### --- clean data: cache_status
table(dataSet$cache_status)
# - inspect cache_status x time_firstbyte
dataSet %>% 
  dplyr::select(cache_status, time_firstbyte) %>% 
  dplyr::group_by(cache_status) %>% 
  dplyr::summarise(mean = mean(time_firstbyte, na.rm = T), 
                   median = median(time_firstbyte, na.rm = T), 
                   min = min(time_firstbyte, na.rm = T), 
                   max = max(time_firstbyte, na.rm = T))
# - NOTE. cache_status == "-" has min == 0, max == 0, mean == 0, median == 0

### --- clean data: agent_type
table(dataSet$agent_type)
# - inspect agent_type x time_firstbyte
dataSet %>% 
  dplyr::select(agent_type, time_firstbyte) %>% 
  dplyr::group_by(agent_type) %>% 
  dplyr::summarise(mean = mean(time_firstbyte, na.rm = T), 
                   median = median(time_firstbyte, na.rm = T), 
                   min = min(time_firstbyte, na.rm = T), 
                   max = max(time_firstbyte, na.rm = T))

### --- clean data: content_type
# - recode content_type
dataSet$content_type <- sapply(dataSet$content_type, function(x) {
  d <- gsub("charset=", "", x)
  d <- gsub("tab-separated-values", "tsv", d)
  return(d)
})
table(dataSet$content_type)
# - inspect content_type x time_firstbyte
dataSet %>% 
  dplyr::select(content_type, time_firstbyte) %>% 
  dplyr::group_by(content_type) %>% 
  dplyr::summarise(mean = mean(time_firstbyte, na.rm = T), 
                   median = median(time_firstbyte, na.rm = T), 
                   min = min(time_firstbyte, na.rm = T), 
                   max = max(time_firstbyte, na.rm = T))

### --- clean data: access_method
table(dataSet$access_method)
# - NOTE. Almost always dataSet$access_method == "desktop"

### --- validate dt (timestamp)
w <- which(dataSet$dt == "-")
length(w)
w <- which(!(grepl("\\d\\d\\d\\d-\\d\\d-\\d\\dT\\d\\d:\\d\\d:\\d\\dZ", 
                 dataSet$dt)))
length(w)
### --- extract year, month, day, hour, and minute from dt (timestamp)
dataSet$year <- substr(dataSet$dt, 1, 4)
dataSet$month <- substr(dataSet$dt, 6, 7)
dataSet$day <- substr(dataSet$dt, 9, 10)
dataSet$hour <- substr(dataSet$dt, 12, 13)
dataSet$minute <- substr(dataSet$dt, 15, 16)
dataSet$ymdhm <- paste(dataSet$year, 
                       dataSet$month,
                       dataSet$day,
                       dataSet$hour,
                       dataSet$minute,
                       sep = "-")
dim(dataSet)

### --- clean data: uri_query 
### --- prepare SPARQL queries for parsing
# - decode and clean uri_query
dataSet$uri_query <- urltools::url_decode(dataSet$uri_query)
# - check: dataSet$uri_query == ""
w <- which(dataSet$uri_query == "")
if (length(w) > 0) {
  dataSet <- dataSet[-w, ]
}
dim(dataSet)
# - check: is.na(dataSet$uri_query)
w <- which(is.na(dataSet$uri_query))
if (length(w) > 0) {
  dataSet <- dataSet[-w, ]
}
dim(dataSet)
w <- which(grepl("^\\s+$", dataSet$uri_query))
if (length(w) > 0) {
  dataSet <- dataSet[-w, ]
}
dim(dataSet)

### --- add ID, uniqueQueryID
dataSet$id <- 1:dim(dataSet)[1]
uniqueQueries <- unique(dataSet$uri_query)
uniqueQueries <- data.frame(uri_query = uniqueQueries, 
                            uniqueQueryId = 1:length(uniqueQueries), 
                            stringsAsFactors = F)
dataSet <- dataSet %>% 
  dplyr::left_join(uniqueQueries, 
                   by = "uri_query")
rm(uniqueQueries)

### --- rearrange, store clean data set
colnames(dataSet)
dataSet <- dataSet %>% 
  dplyr::select('id', 'uniqueQueryId', 'dt', 'year', 'month', 'day', 'hour', 'minute',
                       'uri_query', 'http_status', 'cache_status', 'time_firstbyte', 
                       'response_size', 'agent_type', 'content_type', 'access_method')
filename <- paste0(analyticsDir, 
                   'wdqs_data_sample_', 
                   substr(as.character(Sys.time()), 1, 10), 
                   '.csv')
write.csv(dataSet, filename)

1.2 Feaure Engineering

The SPARQL queries in our sample, present in the uri_query field retrieved from the wmf.webrequest table, need to be parsed in order to extract a set of meaningful features to use in predictive modeling of server response time. As in the case of Natural Language Processing, we need to differentiate between three possible levels of analysis here:

  • syntax: the composition of SPARQL keywords, operators, variables, and functions result in a specific description of the data that need to be obtained, and their analysis on behalf of Blazegraph determines the query execution plan (via Abstract Syntax Tree, AST, derived from query analysis); determines query validity and query execution plan;
  • semantics: the exact mapping of query syntax to particular variables that need to be instantiated in order to obtain the desired result from WDQS; theoretically, it determines truth relations; in practice, we can say that it precisely specifies the expected query result;
  • pragmatics: given the SPARQL language conventions are satisfied, any query migth instantiate various variable names, use (more or less efficiently) various syntactic constructions to obtain the same effect, include code comments, etc.: the way users use the SPARQL query language to communicate their goals to the WDQS.

Our approach to feature engineering is to parse each SPARQL query with a set of regular expressions in order to (a) single out the “atomic elements” of the SPARQL language (keywords, functions, IRI mentions, variables, literals, operators, Wikibase parameters, etc.), (b) compute a set of abstract variables that can be used to describe the query, e.g.: the number of different variables instantiated (i.e. how many unique variables there are in a query), the total number of variables mentioned (including repeated uses of the same variable), the number of IRI references in a query, how many variables also request labels, and similar.

With this appproach, which is essentialy the same as what is typically done in Natural Language Processing, query syntax and semantics are in effect fused. A question could be posed: why keep specific variable names, or comments, when it is clear that the later are not even processed while variable names are conventional and WDQS users are free to use them as long as they satisfy the rules of the SPARQL syntax? These aspects of query analysis are important from the viewpoint of query pragmatics and carry useful information indeed. Of course the variable names are conventional, but imagine a community of librarians using WDQS who learn SPARQL from each other and thus frequently use variable names like ?author and/or ?authorLabel? If such “typical” variable names are frequently used in similar queries, they can act as informative features in the prediction of query response time indeed. Comments: many Wikidata bots might use a “signature” comment (and indeed some do) of some form, for example. If such bots run their queries against WDQS frequently, the comments in their code might also act as predictive feaures for the query perfomance.

NOTE. The current regex analysis of SPARQL queries is approximate and, most of time, correct in a sense of being able to extract intuitively meaningful atoms of SPARQL. It is approximate in a sense that the regular expressions use are derived by the author of this report by directly analyzing the typical SPARQL queries with only limited attention payed to the formal definition of SPARQL (e.g. its Backus–Naur forms, or the formal regex definitions of its various syntactic constituents). The feature engineering procedure used in our current analyses can be improved, but from our manual query analysis inspection it can be improved only slightly.

Beyond extracting SPARQL “atoms”, the following features are derived as frequencies (counts) for each query:

  • nchar - the length of the query in characters
  • __vars__ - how many variables are used?
  • __vars_usage__ - how many variable mentions there are in total?
  • __vars_label__ - how many variables request labels also?
  • __vars_label_usage__ - how many times are variable labels mentioned in a query?
  • __paths__ - how many property paths are present in the query?
  • __IRI_REF__ - how many IRI references are used in the query?
  • __LITERAL__ - how many SPARQL literals are used in the query?

So the final description of any SPARQL query, following these feature engineering steps, is a vector of counts for each extracted feature, including: (a) tokens (each IRI mentioned, for example), as well as (b) types (how many __IRI_REFs__ there are in a query in total).

NOTE. Approximating syntax. Obviously, this feature engineering approach does not analyze the structural relations between the language constituents: its syntax. However, some of the extracted features clearly “correlate” with the query syntactic complexity. For example, the count of { tokens, or the count of SELECT keywords, used in a query, clearly speaks about the query depth (i.e. how many sub-queries do we have, if any). So the syntactic information is present in our feature engineering more or less up to a degree dictated by how much useful information is carried by a mention of a particular SPARQL keyword, or function, or operator, etc; the number of variables used, the query “depth”, and similar. More useful syntactic information can be obtained from Blazegraph by running a query in the EXPLAIN mode (similar to what one would in SQL), see: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Explain_Query. However, even if featurus obtained from the query execution plan in this way might prove to be useful, it is unclear at this point whether the SPARQL query execution plan can be obtained from Blazegraph without actually running the query.

#!/usr/bin/env Rscript

### ---------------------------------------------------------------------------
### --- WD_WDQS_Analytics_FeatureEngineering.R, v. Beta 0.1
### --- Script: WD_WDQS_Analytics_FeatureEngineering.R, v. Beta 0.1
### --- Description: WD_WDQS_Analytics_FeatureEngineering.R performs 
### --- feature engineering of the SPARQL queries data 
### --- collected w. WD_WDQS_Analytics_CollectData.R from the 
### --- wmf.webrequest table in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: 
### --- /home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics
### --- on stat1005
### --- Date: 2020/03/20
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software: 
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed 
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------

### --- setup
library(data.table)
library(stringr)
library(dplyr)
library(tidyr)
library(snowfall)

### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
  paste(fPath[1:length(fPath) - 1], collapse = "/"),
  "/",
  sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_data/')
analyticsDir <- paste0(fPath, '_analytics/')
imgDir <- paste0(fPath, '_img/')

### ---------------------------------------------------------------------------
### --- Section 1. Load and prepare SPARQL queries
### ---------------------------------------------------------------------------
# - load queries
filename <- 'wdqs_data_sample_TEST_2020-04-12.csv'
dataSet <- fread(paste0(analyticsDir, filename))
dataSet$V1 <- NULL
# - keep only unique queries
w <- which(!duplicated(dataSet$uniqueQueryId))
if (length(w) > 0) {
  dataSet <- dataSet[w]
  }
# - prepare queries for parsing
queries <- dataSet$uri_query
uniqueQueryId <- dataSet$uniqueQueryId
rm(dataSet); gc()

# - parse_SPARQL()
parse_SPARQL <- function(tq) {
  # - get nchar()
  tq_nchar <- nchar(tq)
  # - split by "\\?query=|\\?query ="
  tq <- gsub("query=|query =|\\?query=|\\?query =", "query=__QUERY_STARTS___", tq)
  tq <- unlist(strsplit(tq, split = "query=|query =|\\?query=|\\?query ="))
  w_empty <- which(nchar(tq) == 0)
  if (length(w_empty) > 0) {tq <- tq[-w_empty]}
  # - recognize format, output, results parameters
  wPars <- which(grepl("format=|format =|output=|output =|results=|results =", tq))
  # - store format, output, results parameters in a separate
  # - vector: pars; remove from tq if any parameters are present
  if (length(wPars) > 0) {
    pars <- tq[wPars]
    pars <- gsub("\\?|&", "", pars)
    tq <- tq[-wPars]
  } else {
    pars <- NULL
  }
  tq <- gsub("__QUERY_STARTS___", "", tq, fixed = T)
  # - remove "?" from the beginning of SPARQL if present
  tq <- gsub("^\\?", "", tq)
  # - split by "\n"
  tq <- unlist(lapply(tq, function(x) {
    strsplit(x, split = "\\n")
  }))
  # - remove "\t" if any
  tq <- gsub("\\t+", "", tq)
  # - trim trailing space at the beginning and end, if any
  tq <- str_trim(tq, side = "both")
  # - collect IRI references
  tq_iri <- unlist(str_extract_all(tq, '<([[:alnum:]]|[[:punct:]])+>'))
  # - match IRI references
  tq <- gsub('<([[:alnum:]]|[[:punct:]])+>', " __IRI_REF__ ", tq)
  # - preserve lines that encompasse only comments
  wCom <- which(grepl("^#", tq))
  if (length(wCom) > 0) {
    comment_lines <- tq[wCom]
    tq <- tq[-wCom]
  } else {
    comment_lines <- NULL
  }
  # - space to the left and right of 
  # - brackets, operators, and punctuation
  tq <- gsub("\\{", " \\{ ", tq)
  tq <- gsub("\\}", " \\} ", tq)
  tq <- gsub("\\)", " \\) ", tq)
  tq <- gsub("\\(", " \\( ", tq)
  tq <- gsub("\\]", " \\] ", tq)
  tq <- gsub("\\[", " \\[ ", tq)
  tq <- gsub("([^<]|[^>]|[^!])=", " = ", tq)
  tq <- gsub("!=", " != ", tq)
  tq <- gsub(">=", " >= ", tq)
  tq <- gsub("<=", " <= ", tq)
  tq <- gsub(">", " > ", tq)
  tq <- gsub("<", " < ", tq)
  tq <- gsub("<", " < ", tq)
  tq <- gsub("\\|\\|", " \\|\\| ", tq)
  tq <- gsub("&&", " && ", tq)
  tq <- gsub(",", " , ", tq)
  tq <- gsub("\\.", " \\. ", tq)
  tq <- gsub(";", " ; ", tq)
  tq <- gsub("\\s+", " ", tq)
  # - match literals
  tq <- gsub('(\'|")+([^\'"])+(\'|")+', " __LITERAL__ ", tq)
  tq <- gsub('\\b[[:digit:]]+\\b', " __LITERAL__ ", tq)
  # - "word tokenization": split by "\s"
  tq <- unlist(lapply(tq, function(x) {
    strsplit(x, split = "\\s")
  }))
  # - remove: '"', ".", ","
  tq <- gsub('("|\\.|,)+', '', tq)
  # - remove: "'"
  tq <- gsub("'+", '', tq)
  # - count unique variables
  num_vars <- length(unique(tq[grepl('^\\?|^\\$', tq)]))
  # - count variable use
  vars_usage <- length(tq[grepl('^\\?|^\\$', tq)])
  # - count unique label variables
  num_vars_label <- length(unique(tq[grepl('[^:]label$', tq, ignore.case = T)]))
  # - count label variable use
  label_vars_usage <- length(tq[grepl('[^:]label$', tq, ignore.case = T)])
  # - add variables use and unique variable count to tq
  tq <- c(tq, 
          rep('__vars__', num_vars), 
          rep('__vars_usage__', vars_usage), 
          rep('__vars_label__', num_vars_label),
          rep('__vars_label_usage__', label_vars_usage))
  # - tolower()
  tq <- tolower(tq)
  # - remove tq elements that begin with: "?", "$", ")", ";", "}"
  wClean <- which(grepl('^\\)|^;|^\\}|^\\]', tq))
  if (length(wClean > 0)) {
    tq <-  tq[-wClean]
    }
  # - clean from: "]", "}", ";", ")" if present anywhere
  tq <- gsub("\\]|\\}|\\)|;", "", tq)
  # - count how many "/" path operators were used in a query
  num_paths <- length(tq[grepl('([[:alnum:]]|:)+/([[:alnum:]]|:)+', tq)])
  tq <- c(tq, 
          rep('__paths__', num_paths))
  # - remove empty elements from tq if any:
  wEmpty <- which(grepl("^\\s+$|^$", tq))
  if (length(wEmpty) > 0) {tq <- tq[-wEmpty]}
  # - enter comments
  if (length(comment_lines) > 0) {tq <- c(tq, comment_lines)}
  # - enter pars
  if (!is.null(pars)) {tq <- c(tq, pars)}
  # - enter tq_iri
  if (length(tq_iri)>0) {tq <- c(tq, tq_iri)}
  # - tq feature frequency as data.frame
  tq <- as.data.frame(table(tq))
  # - enter nchar(tq) as feature:
  tq <- rbind(tq, 
              data.frame(tq = 'nchar',
                         Freq = tq_nchar))
  # - ouput: tq
  return(tq)
}

# - Parse SPARQL queries
sfInit(parallel = T, cpus = 36)
t1 <- Sys.time()
sfExport("queries")
sfExport("parse_SPARQL")
sfLibrary(stringr)
# - process
queries_processed <- sfClusterApplyLB(queries, parse_SPARQL)
# - stop cluster
sfStop()
print(paste0("Parsing took: ", 
             round(difftime(Sys.time(), t1, units = "min"), 2)))

# - collect results
names(queries_processed) <- uniqueQueryId
queries_processed <- rbindlist(queries_processed, 
                               idcol = T, 
                               fill = T, 
                               use.names = T)
colnames(queries_processed) <- 
  c('uniqueQueryId', 'feature', 'freq')
write.csv(queries_processed, paste0(analyticsDir, "queries_processed_TEST.csv"))

# - produce vocabulary: feature frequency
vocabulary <- queries_processed %>%
  dplyr::select(feature, freq) %>% 
  dplyr::group_by(feature) %>% 
  dplyr::summarise(freq = sum(freq)) %>% 
  dplyr::arrange(desc(freq))
write.csv(vocabulary, paste0(analyticsDir, "SPARQLvocabulary_TEST.csv"))
head(data.frame(vocabulary), 400)

# - in how many queries is a particular feature used?
head(queries_processed)
query_features <- queries_processed %>% 
  dplyr::select(feature) %>% 
  dplyr::group_by(feature) %>% 
  dplyr::summarise(num_queries = n()) %>% 
  dplyr::arrange(desc(num_queries))
query_features$percent_queries <- 
  query_features$num_queries/length(unique(queries_processed$uniqueQueryId))
write.csv(query_features, paste0(analyticsDir, 'feature_query_percent_TEST.csv'))

Again, The goal is to derive a description of SPARQL queries via feature vectors. For example:

iri_ref:11, literal:1, vars:5, vars_label:1, vars_label_usage:1, vars_usage:20, ?item:15, ?type:1, ?typelabel:1, ?url:2, ?wikipedia:1, {:14, http://standartconta.com.tr/:1, http://standartconta.com.tr:1, http://www.standartconta.com.tr/:2, http://www.standartconta.com.tr:1, http://www.www.standartconta.com.tr/:1, https://standartconta.com.tr/:1, https://standartconta.com.tr:1, https://www.standartconta.com.tr/:1, https://www.standartconta.com.tr:1, https://www.www.standartconta.com.tr/:1, bd:serviceparam:1, distinct:1, optional:1, schema:about:1, select:1, service:1, union:10, wdt:p31:1, wdt:p856:12, where:1, wikibase:label:1, wikibase:language:1, nchar:903

In this example of a query feature vector we see that the query had used 11 IRI references (__iri_ref__:11), one literal (__literal__:1), five variables (__vars__:5) that were mentioned 20 times altogether (__vars_usage__:20), used wikibase:label once (wikibase:label:1), the UNION keyword 10 times (union:10), that the length of the query in characters is nchar:903, etc.

2. A criterion for selecting a critical query for WDQS

Now when each SPARQL query in our sample is described by a feature vector suitable for predictive modeling, we need to select a criterion variable: what do we want to predict? We want to predict the WDQS response time, obviously - the information carried in the time_firstbyte field of the wmf.webrequest table (i.e.how many seconds elapsed before the first byte of the server response was sent out to the client). However, since we already now that the time to process a SPARQL query depends not only upon the features of the query itself, but also upon the WDQS/Blazegraph configuration that presents a black box for us, we might conclude that predicting the exact server response time might present a too ambitious goal. Indeed, our initial attempts to predict the server response time for a query shows that approximately a correlation of R = .72 (on the test data set, approx. half of 1M queries in the sample) between the predictive model based on our features and response time can be achieved - a result not that sound.

What if we can train a model to predict a derived, binary variable that describes each query as critical (i.e. resulting in an excessively long query processing time) or non-critical (i.e. resulting in a “typical” query processing time)? If we pose the problem in this way, than it is easy to derive such a binary criterion from the observed time_firstbyte value for each query:

  • take the extreme outliers on time_firstbyte to be all observations above the variable’s upper outer fence: Q3 + 3*IQR, where Q3 is the .75 percentile of time_firstbyte, while IQR stands for Inter-Quartile Range and presents the difference between Q3 (the .75 percentile of the data) and Q1 (the .25 percentile of the data).

This definition is widely used in Exploratory Data Analysis. We can also select a less extreme criterion by considering anything that is a mild outlier on time_firstbyte as critical:

  • take the mild outliers on time_firstbyte to be all observations above the variable’s upper inner fence: Q3 + 1.5*IQR, where Q3 is the .75 percentile of time_firstbyte, while IQR stands for Inter-Quartile Range and presents the difference between Q3 (the .75 percentile of the data) and Q1 (the .25 percentile of the data).

While we have used both criteria in our attempts at predictive modeling, we will mainly focus on the prediction of extreme outliers on time_firstbyte in what follows.

3. Machine Learning w. XGBoost: can we predict extreme outliers on WDQS response time from SPARQL query features?

Thus far,

  • all SPARQL queries are represented as feature vectors, and
  • a criterion variable has been defined (extreme outliers on WDQS server response times).

Our goal is to select a predictive model that will use the feauture vectors to predict whether the respective queries’ response times are critical (extreme) or not. At the same time, the model will be performing feature selection and informing us about the value of each feature in the prediction. This will allows us to select a small set of SPARQL feauters that can be used to characterize the queries as resulting in typical vs. extreme WDQS response times. And such small feature vocabulary is exactly what we are looking for in order to be able to develop a WDQS optimization engine in the future: a system that would fast can any given SPARQL query and decide wheter it needs to be handled as a special, potentially difficult case to process - or should the processing proceed as usual.

We have used XGBoost via the R {xgboost} package to train many predictive models that attempt to categorize SPARQL queries as critical or not in accordance with the above stated criterion. XGBoost - short for Extreme Gradient Boosting - is a ML method to estimate an ensemble of mathematical models - decision trees in our case - that jointly work to predict the response variable. Its description is really way out of scope of this Report. This technique is as known to be notoriously good in extracting almost every and the last piece of useful information from the training data (feauture vectors describing the SPARQL queries in this case). The XGBoost models involve a large number of parameters that must be fine-tuned in order for it to derive a plausible predictive model for the task at hand. We have experimented with various parameter settings and data sets encompassing a variable number of extracted SPARQL features to reach the following conclusions:

  • the number of features used to build a model does not significantly change the accuracy and precision (both to be explained soon) with which the model can tell the critical SPARQL queries from typical ones; this is most probably entirely due to the fact that XGBoost performs feature selection along the way of its training; this is also very important because it means that we can use only a small fraction of features to train a predictive model - and the model with less features will run faster than the one encompassing a large number of feature;
  • mostly, the performance of the predictive model is not seriously affected by choosing various combinations of parameter settings; we have performed many cross-validations, always additional splitting our 1M sample of SPARQL queries into train and test data (approx. 50% of the sample each), in order to reach this conclusion;
  • the model performance is good but far from perfect, however
  • there are possible improvements that always involve a compromise in respect to how can be we design a future WDQS Optimization Engine.

In order to explain the typical modelling result with XGBoost, we need to introduce some terminology first.

Let’s assume each of the SPARQL queries is coded as 1 if the respective WDQS response time for that query was critically high or 0 if it had fallen in a range of typical processing times. For a sample of SPARQL queries than we have two binary vectors over {0, 1} describing: (1) the observed status of the query (again: 1 - critical, 0 - typical), and (2) the predicted status of the query: what the trained predictive model “thinks” the query would be by looking at the features that describe it.

For example:

sampleFrame <- data.frame(query = paste0("Query_", 1:10),
                          observed = round(runif(10, 0, 1)),
                          predicted = round(runif(10, 0, 1)))
sampleFrame

For each of the 10 queries in the table we know if they really where critical (1) or not (0) and we see that from the observed column, and if the model predicts they would be critical (1) or not (0) from their features: the predicted column. For example, in the first row we have a query that was critical indeed and correctly predicted as critical: that is something we call a Hit (or a True Positive) in ML. In the second row we observe a False Alarm (or a False Positive): the query was not really critical (it has 0 on observed) but the model has incorrectly categorized it as if it would be (1 on predicted). In the eigth row we find a situation known as a Correct Rejection: the query was not critical (0 on observed) and the model figured out it wouldn’t be critical indeed by looking into its features (0 on predicted). Finally, the fourth important situation can be found in the sixth row, where the query was critical (1 on observed) but not recognized as such by the model (0 on predicted), a case known as Miss in the ML terminology. More on this terminology: https://en.wikipedia.org/wiki/Confusion_matrix.

In the table above, we can compute the model accuracy by looking at how many times do the observe and predicted column coincide:

Table 1. Confusion matrix example.

acc <- sum(sampleFrame$observed == sampleFrame$predicted)/dim(sampleFrame)[1]
print(paste0("Model accuracy = ", round(acc, 2), "."))
[1] "Model accuracy = 0.6."

When we look at the Hit rate we see that:

hits <- sum(sampleFrame$observed == 1 & sampleFrame$predicted == 1)/sum(sampleFrame$observed == 1)
print(paste0("Hit rate = ", round(hits, 2), "."))
[1] "Hit rate = 0.43."

Finally, the False alarm rate of the model is:

FAs <- sum(sampleFrame$observed == 0 & sampleFrame$predicted == 1)/sum(sampleFrame$observed == 1)
print(paste0("FA rate = ", round(FAs, 2), "."))
[1] "FA rate = 0."

So, the model would recognize a critical query correctly 71% of time, while 30% of time it would incorrectly predict a non-critical query to be critical. Imagine if we use this model to chose when to start and when not to start some optimization procedure in WDQS - be it as simple as diverging different queries (critical vs. non-critical) to separate processing resources. From the viewpoint of any system optimization, the decision is needed: can we live with these two values, or not? The False Alarm suggests that 30% of time the model would suggest the optimization routine to start in a situation when it should not start and thus spare valuable system resources. Again, 100% - 71% (Hit rate) = 29% of time the model would miss to trigger the optimization routine for a critical query that could engage too many resources.

A good predictive model is the one which attains (a) a high Hit rate (i.e. observed critical and predicted critical), and (b) a low False Alarm rate (i.e. observed non-critical but predicted critical). In other words, we do not want our model to miss critical queries and thus not trigger some optimization procedure to help WDQS better use its resources, but we also do not want our model to trigger any optimization procedures on a query that would not result in a critically high usage of resources.

Our current results:

  • the results vary with the number of features used and specific values of model parameters, but they do not vary significantly:
  • typical model accuracy is around 95%;
  • typical Hit rate is around 48%, and
  • typical False alarm rate is slightly above 2%.

Now we can see how misleading is model accuracy alone.

Tweaks and Compromises

  • Without going into detail about it, since this is a binary classification task, there is a legitimate way to tweak a model and attain a higher hit rate (by manipulating the so called decision boundary), but at the expense of increasing the False Alarm rate at the same time. For example, with the current results, we can achieve a Hit rate as high as 84% but at the expense of increasing the False Alarm rate to almost 40% at the same time.

How do we compromise about this prediction model? First to remind: it is a binary classification model. The model does not directly output its predictions as classes: 1 - critical, 0 - typical, as in the example in Table 1. Rather, the model learns about the queries from their feature vectors and predicts a probability that the query will critical (1). Now, in order to categorize the query as critical or not, we need to set a treshold value (i.e. a decision boundary), .5 for example, and prononuce the model prediction to be 1 each time a predicted probability for a query is above the treshold value. A value of .5 is an intuitive choice for binary classification, however, we might pick another value as well. It is by manipulating the treshold value how we enter a trade-off between True (Hit) and False Positive (False Alarm) rates. In the following chart the treshold value varies smoothly between 0 and 1, the True Positive rate is found on the vertical, and the False Positive rate on the horizontal axis:

Chart 1.

library(ggplot2)
library(ggrepel)
locDir <- '/home/goransm/Work/___DataKolektiv/Projects/WikimediaDEU/_WMDE_Projects/Wikidata/WD_Misc/WD_SPARQL_Endpoint_Analytics/_analytics/'
rocModel <- read.csv(paste0(locDir, 'rocModel.csv'),
                     header = T, 
                     row.names = 1,
                     stringsAsFactors = F)
rocModel$diff <- rocModel$hit - rocModel$FA
wMaxDiff <- which.max(rocModel$diff)
rocModel[wMaxDiff, ]
rocModel$label <- ''
rocModel$label[wMaxDiff] <- paste0('Hit: ', round(rocModel$hit[wMaxDiff], 2), 
                                   "; FA: ", round(rocModel$FA[wMaxDiff], 2))
ggplot(data = rocModel,
       aes(x = FA, y = hit, label = label)) + 
  geom_point(size = .75) + geom_path(size = .25, color = "blue") + 
  ggrepel::geom_text_repel() + 
  xlab('False Positives (False Alarm Rate)') + ylab('True Positives (Hit Rate)') +
  theme_bw()

We can see that the there is a treshold value where we observe a 67% True Positive Rate and a 12% False Positive Rate: the difference between the two is highest when treshold = .2.

In effect this means:

  • imagine, again, that we develop a system to optimize the WDQS by somehow splitting its resources into those that process queries with typical expected processing times from those with critical expected processing times;
  • our current modelling efforts let us know that we can increase the number of True Positives to two-thirds at the expense of observing 12% of typical queries as if they were critical;
  • that means the system would correctly run on critical processing resourcess 2/3 of the time, while it would erroneously run on critical processing resources 12% of time (i.e. 12% of queries should be processed with the typical processing resources but they would incorrectly trigger the critical processing resources);
  • or, to rephrase: the model would think a critical query is non-critical one-third of a time, and decide to pass such a query to a normal, typical WDQS processing, while at the same time it would recognize a non-critical query as non-critical 88% of time and than it would not trigger any optimization action.

The question is: can we afford such an optimizer for WDQS? What is the trade-off that we can accept to live with?

4. Directions for future research + considerations of possible problems in WDQS optimization

Directions for further research (time estimate: until the end of the next week)

  • The sampling approach appears to be robust, but we need to replicate all feature engineering and ML procedures across at least several more samples;
  • The feature engineering procedures are approximate: they could and should be improved;
  • The ML results were discussed and it is uncertain if further significant improvements are possible, but at least the following should be tested:
    • working with corrections for imbalanced designs: we observe a very low number of critical queries compared to a huge number of typical queries in our sample; some corrections to control for this imbalance are already applied, but there are other approaches that might be helpful to potentially train better classifiers;
    • not all XGBoost parameters that might help train a better classifier were entered into our cross-validation procedures; some of them should be given a change;
    • maybe a different model altogether (e.g. Convolutional Deep Learning) should be tested.

Final considerations

  • If we are about to develop some sort of a WDQS Optimization Engine based on this and/or similar models,
  • we need to consider how fast can these class of models run in production, and
  • if they can grab “environmental variables” from the WDQS/Blazegraph: for example, one of the features they use is the total number of queries running against the WDQS in the same minute in which the query under analysis was run - and this cannot be known from SPARQL query analysis alone.
  • Deployment to production: we would probably need to switch from {xgboost} to its implementation in {h2o} where we can deploy models as Java MOJO or POJO objects, or build a REST API for a model predictive run; all these steps imply that some time would be lost in query processing before the optimizer would be able to deliver a prediction, so deploying a model with a minimum runtime latency in a predictive run would be essential.

Goran S. Milovanović

Wikimedia Deutschland, Data Scientist DataKolektiv, Owner

contact: goran.milovanovic_ext@wikimedia.de

---
title: WDQS Endpoint Analytics - Optimizing Server Response Time
author:
- name: Goran S. Milovanović
  affiliation: Wikimedia Deutschland, Data Scientist, DataKolektiv, Owner
output:
  html_notebook:
    code_folding: hide
    theme: spacelab
    toc: yes
    toc_float: yes
    toc_depth: 5
  html_document:
    toc: yes
    toc_depth: 5
---
***

**Report timestamp:** `r {Sys.time()}`

***

**Reference Phabricator Ticket:** [T248308](https://phabricator.wikimedia.org/T248308)

***
### Notebook: Optimizing WDQS Response Time From SPARQL Queries Analysis
**Feedback** should be send to `goran.milovanovic_ext@wikimedia.de`. 

## **Summary** 

[WMDE](https://www.wikimedia.de/) together with [php.cc](https://thephp.cc/welcome) is trying to find a way to optimize the [Wikidata Query Service](https://www.mediawiki.org/wiki/Wikidata_Query_Service). This Report describes a Machine Learning (ML) approach to analyze the SPARQL queries that were received by the endpoint in March 2020, with the goal of developing a model able to predict whether a query would be **critical** for WDQS/Blazegraph or not. By singling out the most important features that are present in the SPARQL queries under analysis we hope to be able to understand better what types of queries typically cause problems. Beyond that, the development of a predictive model for SPARQL query server response time in itself lays out the foundations for the development of an optimization engine for WDQS, a system that would be able to quickly scan a received query and decide whether the WDQS handling of any given query needs to be adjusted in order to avoid critically high server response times. 

In the following sections we describe **(1)** the *sampling approach* taken to collect a relevant set of SPARQL queries observed at the (WDQS) endpoint in three weeks of March 2020, **(2)** the *feature engineering* approach taken to analyze the composition of queries and extract meaningful "atomic elements" of SPARQL that will be used in an attempt to predict the WDQS response time for a query, **(3)** the definition of a criterion for selecting a **critical** query for WDQS (i.e. the one that makes too much use of the service's resources), **(4)** the ML (optimization) approach, relying on [XGBoost](https://xgboost.readthedocs.io/en/latest/), taken to build a predictive model for WDQS server response times starting from the extracted SPARQL features, and (5) the *lessons learned* - specifically on what *constraints* will be met in the case of an eventual development of a WDQS optimization engine in the future - as well as *suggestions* on what further steps need to be made in order to reach an acceptable level of understanding of the problem and design a prototype of a future solution.

***

## 0. Sample of SPARQL queries observed at the WDQS SPARQL endpoint

The SPARQL queries that hit the endpoint are registered in the [wmf.webrequest](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest) table in the WMF Data Lake (our Hadoop Big Data storage). A [bucketing approach](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling) to sample data from this Hive table was taken, sampling approximately `1%` of queries that were observed in the period between `2020/03/01` and `2020/03/20`. The sample is approximately uniformly distributed across days and hours of the mentioned period of time. The sample size is slightly higher than `1M` SPARQL queries - including duplicates.

The following fields were collected from the `wmf.webrequest` table for each collected SPARQL query (see: [wmf.webrequest documenation](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest) for details):

- `uri_query` - the URL encoded SPARQL query w. WDQS query parameters
- `dt` - the query timestamp, later parsed into `year`, `month`, `day`, `hour`, and `minute`
- `http_status` - the server http status response for this query
- `cache_status` - the cache status for this query (see: https://wikitech.wikimedia.org/wiki/Caching_overview#Headers)
- `time_firstbyte` - how many seconds elapsed before the first byte of the server response was sent out to the client
- `response_size` - response size
- `agent_type` - spider or human
- `content_type` - content type (e.g. JSON, text/tsv, xml, etc.)
- `access_method` - desktop, mobile, or app.

**NOTE.** As noted in the **Summary**, our attempt will be to analyze the composition of the SPARQL queries and try to predict a binary variable *critical response time*/*non-critical response time* which we plan to derive from `time_firstbyte` (taken to be an approximation to the WDQS response time). It is immediately obvious that only some fields from the `wmf.webrequest` table are candidates for a predictive model of the described kind: `uri_query` contains the query and we will use it to extract the query features, which can be done before WDQS query processing, `dt` is the timestamp which can be known before the query is processed, however, `response_size` is certainly not available before query processing and cannot be used in any imaginable WDQS optimization system. So, the following fields are useful for exploratory data analysis only: `http_status`, `cache_status`, `response_size`, `agent_type`, `content_type` (note: we can learn about this variable from the SPARQL query parameters present in the `uri_query` field), `access_method`.


```{r echo = T, eval = F, message = F}
#!/usr/bin/env Rscript

### ---------------------------------------------------------------------------
### --- WD_WDQS_Analytics_CollectData.R, v. Beta 0.1
### --- Script: WD_WDQS_Analytics_CollectData.R, v. Beta 0.1
### --- Description: WD_WDQS_Analytics_CollectData.R collects a sample of 
### --- SPARQL queries from wmf.webrequest in the WMF Data Lake,
### --- uri_path = "/sparql" OR uri_path= "/bigdata/namespace/wdq/sparql"
### --- NOTE. "/bigdata/namespace/wdq/sparql" redirects to "/sparql" 
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: /home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics
### --- on stat1005
### --- Date: 2020/03/20
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software: 
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed 
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- Section 1. Collect raw data set from wmf.webrequest
### ---------------------------------------------------------------------------

### --- setup
library(data.table)
library(tidyverse)

### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
  paste(fPath[1:length(fPath) - 1], collapse = "/"),
  "/",
  sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_data/')
analyticsDir <- paste0(fPath, '_analytics/')
imgDir <- paste0(fPath, '_img/')

### --- grab a sample of SPARQL queries from wmf.webrequest
### --- w. HiveQL from Beeline
# - query filename
queryFile <- "wdqs_CollectData.hql"
# - Kerberos Init
kerberosPrefix <- 'sudo -u analytics-privatedata kerberos-run-command analytics-privatedata '
# - dateRange
endDate <- '2020/03/20'
startDate <- '2020/03/01'
dateRange <- seq(as.Date(startDate), as.Date(endDate), by = "day")
# - query over dateRange
for (i in 1:length(dateRange)) {
  print(paste0("Started: ", dateRange[i]))
  # - year, month, date
  year = substr(dateRange[i], 1, 4)
  month = substr(dateRange[i], 6, 7)
  if (substr(month, 1, 1) == '0') {month = substr(month, 2, 2)}
  day = substr(dateRange[i], 9, 10)
  if (substr(day, 1, 1) == '0') {day = substr(day, 2, 2)}
  # - compose query
  hiveQLquery <- paste0('SELECT dt, uri_query, http_status, cache_status, time_firstbyte, response_size, agent_type, content_type, access_method
                          FROM wmf.webrequest TABLESAMPLE (BUCKET 1 OUT OF 100 ON rand()) w
                          WHERE (uri_path = "/sparql" OR uri_path= "/bigdata/namespace/wdq/sparql")
                          AND webrequest_source = "text"
                          AND year=', year,
                        ' AND month=', month,
                        ' AND day=', day, ';')
  
  filename <- "wdqs_data_sample_TEST_"
  filename <- paste0(filename, dateRange[i], ".tsv")
  # - write hql
  write(hiveQLquery, paste0(fPath, queryFile))
  # - to Report
  print("Fetching sparql_testData from wmf.webrequest now.")
  # - Run query
  query <- system(command = paste(kerberosPrefix, 
                                  '/usr/local/bin/beeline --incremental=true --silent -f "',
                                  paste0(fPath, queryFile),
                                  '" > ', dataDir, filename,
                                  sep = ""),
                  wait = TRUE)
  print(paste0("Completed: ", dateRange[i]))
  print("--------------------------------------------")
}
# - to Report
print("DONE w. ETL from Hadoop: wmf.webrequest.")
```


## 1. Data Cleaning and Feature Engineering 

### 1.1 Data Cleaning

The first step following the sample of SPARQL queries is obtained from `wmf.webrequest` is to check the data for consistency and `urltools::url_decode()` decode the queries (R {urltools} package was used). The following chunk of code also parses the `dt` (timestamp) field and extracts `year`, `month`, `day`, `hour`, and `minute` when the query was run. Also, and additional variable is produced from a simple concatenation of `year`, `month`, `day`, `hour`, and `minute`: `ymdmh`. This variable will be subsequently used to calculate the number of queries that were run in a particular minute observed in the sample. This is a potential "environmental" variable (i.e. taken from the WDQS/Blazegraph enviornment) in predictive modeling of server response time. It can characterize the query processing context by estimating how many queries were run against Blazegraph concurrently in the same minute when a particular query under of interest was run. There are possible refinments of this variable that were not explored in this pilot study. Any future WDQS optimization system relying on this variable would have to rely on Blazegraph broadcasting this variable to the predictive modeling API.

**NOTE.** In this step we eliminate all queries that have resulted in a `http_status` of `4**`: client side errors. Obviously, incorrectly stated SPARQL queries are of no use to us in any attempt to optimize the WDQS processing time. So if a query fails an initial syntax analysis on the endpoint it does not enter our analysis.

```{r echo = T, eval = F, message = F}

#!/usr/bin/env Rscript

### ---------------------------------------------------------------------------
### --- WD_WDQS_Analytics_Clean_Inspect_Data_EDA.R, v. Beta 0.1
### --- Script: WD_WDQS_Analytics_FeatureEngineering.R, v. Beta 0.1
### --- Description: WD_WDQS_Analytics_FeatureEngineering.R performs 
### --- feature engineering of the SPARQL queries data 
### --- collected w. WD_WDQS_Analytics_CollectData.R from the 
### --- wmf.webrequest table in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: /home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics
### --- on stat1005
### --- Date: 2020/03/20
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software: 
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed 
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------

### --- setup
library(data.table)
library(tidyverse)
library(urltools)

### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
  paste(fPath[1:length(fPath) - 1], collapse = "/"),
  "/",
  sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'

dataDir <- paste0(fPath, '_data/')
analyticsDir <- paste0(fPath, '_analytics/')
imgDir <- paste0(fPath, '_img/')

### ---------------------------------------------------------------------------
### --- Section 1. Load, clean, and decode the SPARQL queries data set
### --- NOTE. Mind the private data (!)
### ---------------------------------------------------------------------------

### --- Sys.setlocale()
Sys.setlocale('LC_ALL','C')

### --- load data
lF <- list.files(dataDir)
lF <- lF[grepl("^wdqs_data_sample", lF)]
lF <- lF[!grepl("TEST", lF)]
dataSet <- lapply(paste0(dataDir, lF), fread)
dataSet <- rbindlist(dataSet)
dim(dataSet)

### --- clean data: http_status
table(dataSet$http_status)
# - inspect http_status x time_firstbyte
dataSet %>% 
  dplyr::select(http_status, time_firstbyte) %>% 
  dplyr::group_by(http_status) %>% 
  dplyr::summarise(mean = mean(time_firstbyte, na.rm = T), 
                   median = median(time_firstbyte, na.rm = T), 
                   min = min(time_firstbyte, na.rm = T), 
                   max = max(time_firstbyte, na.rm = T))
# - NOTE. http_status == "-" has min == 0, max == 0, mean == 0, median == 0
# - Remove queries that failed for client errors: http_status == 4**
w <- which(grepl("^4", dataSet$http_status))
if (length(w) > 0) {
  dataSet <- dataSet[-w, ]
}

### --- clean data: cache_status
table(dataSet$cache_status)
# - inspect cache_status x time_firstbyte
dataSet %>% 
  dplyr::select(cache_status, time_firstbyte) %>% 
  dplyr::group_by(cache_status) %>% 
  dplyr::summarise(mean = mean(time_firstbyte, na.rm = T), 
                   median = median(time_firstbyte, na.rm = T), 
                   min = min(time_firstbyte, na.rm = T), 
                   max = max(time_firstbyte, na.rm = T))
# - NOTE. cache_status == "-" has min == 0, max == 0, mean == 0, median == 0

### --- clean data: agent_type
table(dataSet$agent_type)
# - inspect agent_type x time_firstbyte
dataSet %>% 
  dplyr::select(agent_type, time_firstbyte) %>% 
  dplyr::group_by(agent_type) %>% 
  dplyr::summarise(mean = mean(time_firstbyte, na.rm = T), 
                   median = median(time_firstbyte, na.rm = T), 
                   min = min(time_firstbyte, na.rm = T), 
                   max = max(time_firstbyte, na.rm = T))

### --- clean data: content_type
# - recode content_type
dataSet$content_type <- sapply(dataSet$content_type, function(x) {
  d <- gsub("charset=", "", x)
  d <- gsub("tab-separated-values", "tsv", d)
  return(d)
})
table(dataSet$content_type)
# - inspect content_type x time_firstbyte
dataSet %>% 
  dplyr::select(content_type, time_firstbyte) %>% 
  dplyr::group_by(content_type) %>% 
  dplyr::summarise(mean = mean(time_firstbyte, na.rm = T), 
                   median = median(time_firstbyte, na.rm = T), 
                   min = min(time_firstbyte, na.rm = T), 
                   max = max(time_firstbyte, na.rm = T))

### --- clean data: access_method
table(dataSet$access_method)
# - NOTE. Almost always dataSet$access_method == "desktop"

### --- validate dt (timestamp)
w <- which(dataSet$dt == "-")
length(w)
w <- which(!(grepl("\\d\\d\\d\\d-\\d\\d-\\d\\dT\\d\\d:\\d\\d:\\d\\dZ", 
                 dataSet$dt)))
length(w)
### --- extract year, month, day, hour, and minute from dt (timestamp)
dataSet$year <- substr(dataSet$dt, 1, 4)
dataSet$month <- substr(dataSet$dt, 6, 7)
dataSet$day <- substr(dataSet$dt, 9, 10)
dataSet$hour <- substr(dataSet$dt, 12, 13)
dataSet$minute <- substr(dataSet$dt, 15, 16)
dataSet$ymdhm <- paste(dataSet$year, 
                       dataSet$month,
                       dataSet$day,
                       dataSet$hour,
                       dataSet$minute,
                       sep = "-")
dim(dataSet)

### --- clean data: uri_query 
### --- prepare SPARQL queries for parsing
# - decode and clean uri_query
dataSet$uri_query <- urltools::url_decode(dataSet$uri_query)
# - check: dataSet$uri_query == ""
w <- which(dataSet$uri_query == "")
if (length(w) > 0) {
  dataSet <- dataSet[-w, ]
}
dim(dataSet)
# - check: is.na(dataSet$uri_query)
w <- which(is.na(dataSet$uri_query))
if (length(w) > 0) {
  dataSet <- dataSet[-w, ]
}
dim(dataSet)
w <- which(grepl("^\\s+$", dataSet$uri_query))
if (length(w) > 0) {
  dataSet <- dataSet[-w, ]
}
dim(dataSet)

### --- add ID, uniqueQueryID
dataSet$id <- 1:dim(dataSet)[1]
uniqueQueries <- unique(dataSet$uri_query)
uniqueQueries <- data.frame(uri_query = uniqueQueries, 
                            uniqueQueryId = 1:length(uniqueQueries), 
                            stringsAsFactors = F)
dataSet <- dataSet %>% 
  dplyr::left_join(uniqueQueries, 
                   by = "uri_query")
rm(uniqueQueries)

### --- rearrange, store clean data set
colnames(dataSet)
dataSet <- dataSet %>% 
  dplyr::select('id', 'uniqueQueryId', 'dt', 'year', 'month', 'day', 'hour', 'minute',
                       'uri_query', 'http_status', 'cache_status', 'time_firstbyte', 
                       'response_size', 'agent_type', 'content_type', 'access_method')
filename <- paste0(analyticsDir, 
                   'wdqs_data_sample_', 
                   substr(as.character(Sys.time()), 1, 10), 
                   '.csv')
write.csv(dataSet, filename)
```

### 1.2 Feaure Engineering

The SPARQL queries in our sample, present in the `uri_query` field retrieved from the `wmf.webrequest` table, need to be parsed in order to extract a set of meaningful features to use in predictive modeling of server response time. As in the case of Natural Language Processing, we need to differentiate between three possible levels of analysis here:

- **syntax**: the composition of SPARQL keywords, operators, variables, and functions result in a specific description of the data that need to be obtained, and their analysis on behalf of Blazegraph determines the query execution plan (via Abstract Syntax Tree, AST, derived from query analysis); determines query validity and query execution plan;
- **semantics**: the exact mapping of query syntax to particular variables that need to be instantiated in order to obtain the desired result from WDQS; theoretically, it determines truth relations; in practice, we can say that it precisely specifies the expected query result; 
- **pragmatics**: given the SPARQL language conventions are satisfied, any query migth instantiate various variable names, use (more or less efficiently) various syntactic constructions to obtain the same effect, include code comments, etc.: the way users *use* the SPARQL query language to communicate their goals to the WDQS.

Our approach to feature engineering is to parse each SPARQL query with a set of regular expressions in order to (a) single out the "atomic elements" of the SPARQL language (keywords, functions, IRI mentions, variables, literals, operators, Wikibase parameters, etc.), (b) compute a set of abstract variables that can be used to describe the query, e.g.: the number of different variables instantiated (i.e. how many unique variables there are in a query), the total number of variables mentioned (including repeated uses of the same variable), the number of IRI references in a query, how many variables also request labels, and similar.

With this appproach, which is essentialy the same as what is typically done in Natural Language Processing, query syntax and semantics are in effect fused. A question could be posed: why keep specific variable names, or comments, when it is clear that the later are not even processed while variable names are conventional and WDQS users are free to use them as long as they satisfy the rules of the SPARQL syntax? These aspects of query analysis are important from the viewpoint of query *pragmatics* and carry useful information indeed. Of course the variable names are conventional, but imagine a community of librarians using WDQS who learn SPARQL from each other and thus frequently use variable names like `?author` and/or `?authorLabel`? If such "typical" variable names are frequently used in similar queries, they can act as informative features in the prediction of query response time indeed. Comments: many Wikidata bots might use a "signature" comment (and indeed some do) of some form, for example. If such bots run their queries against WDQS frequently, the comments in their code might also act as predictive feaures for the query perfomance. 

**NOTE.** The **current regex analysis of SPARQL queries is approximate** and, most of time, correct in a sense of being able to extract intuitively meaningful atoms of SPARQL. It is approximate in a sense that the regular expressions use are derived by the author of this report by directly analyzing the typical SPARQL queries with only limited attention payed to the formal definition of SPARQL (e.g. its Backus–Naur forms, or the formal regex definitions of its various syntactic constituents). The feature engineering procedure used in our current analyses **can be improved**, but from our manual query analysis inspection it can be improved only slightly. 

Beyond extracting SPARQL "atoms", the following features are derived as frequencies (counts) for each query:

- `nchar` - the length of the query in characters
- `__vars__` - how many variables are used?
- `__vars_usage__` - how many variable mentions there are in total? 
- `__vars_label__` - how many variables request labels also?
- `__vars_label_usage__` - how many times are variable labels mentioned in a query?
- `__paths__` - how many property paths are present in the query?
- `__IRI_REF__` - how many IRI references are used in the query?
- `__LITERAL__` - how many SPARQL literals are used in the query?

So the final description of any SPARQL query, following these feature engineering steps, is a vector of counts for each extracted feature, including: (a) tokens (each IRI mentioned, for example), as well as (b) types (how many `__IRI_REFs__` there are in a query in total).

**NOTE.** Approximating syntax. Obviously, this feature engineering approach does not analyze the structural relations between the language constituents: its syntax. However, some of the extracted features clearly "correlate" with the query syntactic complexity. For example, the count of `{` tokens, or the count of `SELECT` keywords, used in a query, clearly speaks about the query `depth` (i.e. how many sub-queries do we have, if any). So the syntactic information is present in our feature engineering more or less up to a degree dictated by how much useful information is carried by a mention of a particular SPARQL keyword, or function, or operator, etc; the number of variables used, the query "depth", and similar. More useful syntactic information can be obtained from Blazegraph by running a query in the EXPLAIN mode (similar to what one would in SQL), see: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Explain_Query. However, even if featurus obtained from the query execution plan in this way might prove to be useful, it is unclear at this point whether the SPARQL query execution plan can be obtained from Blazegraph *without* actually running the query.


```{r echo = T, eval = F, message = F}

#!/usr/bin/env Rscript

### ---------------------------------------------------------------------------
### --- WD_WDQS_Analytics_FeatureEngineering.R, v. Beta 0.1
### --- Script: WD_WDQS_Analytics_FeatureEngineering.R, v. Beta 0.1
### --- Description: WD_WDQS_Analytics_FeatureEngineering.R performs 
### --- feature engineering of the SPARQL queries data 
### --- collected w. WD_WDQS_Analytics_CollectData.R from the 
### --- wmf.webrequest table in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: 
### --- /home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics
### --- on stat1005
### --- Date: 2020/03/20
### ---------------------------------------------------------------------------

### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software: 
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed 
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------

### --- setup
library(data.table)
library(stringr)
library(dplyr)
library(tidyr)
library(snowfall)

### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
  paste(fPath[1:length(fPath) - 1], collapse = "/"),
  "/",
  sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_data/')
analyticsDir <- paste0(fPath, '_analytics/')
imgDir <- paste0(fPath, '_img/')

### ---------------------------------------------------------------------------
### --- Section 1. Load and prepare SPARQL queries
### ---------------------------------------------------------------------------
# - load queries
filename <- 'wdqs_data_sample_TEST_2020-04-12.csv'
dataSet <- fread(paste0(analyticsDir, filename))
dataSet$V1 <- NULL
# - keep only unique queries
w <- which(!duplicated(dataSet$uniqueQueryId))
if (length(w) > 0) {
  dataSet <- dataSet[w]
  }
# - prepare queries for parsing
queries <- dataSet$uri_query
uniqueQueryId <- dataSet$uniqueQueryId
rm(dataSet); gc()

# - parse_SPARQL()
parse_SPARQL <- function(tq) {
  # - get nchar()
  tq_nchar <- nchar(tq)
  # - split by "\\?query=|\\?query ="
  tq <- gsub("query=|query =|\\?query=|\\?query =", "query=__QUERY_STARTS___", tq)
  tq <- unlist(strsplit(tq, split = "query=|query =|\\?query=|\\?query ="))
  w_empty <- which(nchar(tq) == 0)
  if (length(w_empty) > 0) {tq <- tq[-w_empty]}
  # - recognize format, output, results parameters
  wPars <- which(grepl("format=|format =|output=|output =|results=|results =", tq))
  # - store format, output, results parameters in a separate
  # - vector: pars; remove from tq if any parameters are present
  if (length(wPars) > 0) {
    pars <- tq[wPars]
    pars <- gsub("\\?|&", "", pars)
    tq <- tq[-wPars]
  } else {
    pars <- NULL
  }
  tq <- gsub("__QUERY_STARTS___", "", tq, fixed = T)
  # - remove "?" from the beginning of SPARQL if present
  tq <- gsub("^\\?", "", tq)
  # - split by "\n"
  tq <- unlist(lapply(tq, function(x) {
    strsplit(x, split = "\\n")
  }))
  # - remove "\t" if any
  tq <- gsub("\\t+", "", tq)
  # - trim trailing space at the beginning and end, if any
  tq <- str_trim(tq, side = "both")
  # - collect IRI references
  tq_iri <- unlist(str_extract_all(tq, '<([[:alnum:]]|[[:punct:]])+>'))
  # - match IRI references
  tq <- gsub('<([[:alnum:]]|[[:punct:]])+>', " __IRI_REF__ ", tq)
  # - preserve lines that encompasse only comments
  wCom <- which(grepl("^#", tq))
  if (length(wCom) > 0) {
    comment_lines <- tq[wCom]
    tq <- tq[-wCom]
  } else {
    comment_lines <- NULL
  }
  # - space to the left and right of 
  # - brackets, operators, and punctuation
  tq <- gsub("\\{", " \\{ ", tq)
  tq <- gsub("\\}", " \\} ", tq)
  tq <- gsub("\\)", " \\) ", tq)
  tq <- gsub("\\(", " \\( ", tq)
  tq <- gsub("\\]", " \\] ", tq)
  tq <- gsub("\\[", " \\[ ", tq)
  tq <- gsub("([^<]|[^>]|[^!])=", " = ", tq)
  tq <- gsub("!=", " != ", tq)
  tq <- gsub(">=", " >= ", tq)
  tq <- gsub("<=", " <= ", tq)
  tq <- gsub(">", " > ", tq)
  tq <- gsub("<", " < ", tq)
  tq <- gsub("<", " < ", tq)
  tq <- gsub("\\|\\|", " \\|\\| ", tq)
  tq <- gsub("&&", " && ", tq)
  tq <- gsub(",", " , ", tq)
  tq <- gsub("\\.", " \\. ", tq)
  tq <- gsub(";", " ; ", tq)
  tq <- gsub("\\s+", " ", tq)
  # - match literals
  tq <- gsub('(\'|")+([^\'"])+(\'|")+', " __LITERAL__ ", tq)
  tq <- gsub('\\b[[:digit:]]+\\b', " __LITERAL__ ", tq)
  # - "word tokenization": split by "\s"
  tq <- unlist(lapply(tq, function(x) {
    strsplit(x, split = "\\s")
  }))
  # - remove: '"', ".", ","
  tq <- gsub('("|\\.|,)+', '', tq)
  # - remove: "'"
  tq <- gsub("'+", '', tq)
  # - count unique variables
  num_vars <- length(unique(tq[grepl('^\\?|^\\$', tq)]))
  # - count variable use
  vars_usage <- length(tq[grepl('^\\?|^\\$', tq)])
  # - count unique label variables
  num_vars_label <- length(unique(tq[grepl('[^:]label$', tq, ignore.case = T)]))
  # - count label variable use
  label_vars_usage <- length(tq[grepl('[^:]label$', tq, ignore.case = T)])
  # - add variables use and unique variable count to tq
  tq <- c(tq, 
          rep('__vars__', num_vars), 
          rep('__vars_usage__', vars_usage), 
          rep('__vars_label__', num_vars_label),
          rep('__vars_label_usage__', label_vars_usage))
  # - tolower()
  tq <- tolower(tq)
  # - remove tq elements that begin with: "?", "$", ")", ";", "}"
  wClean <- which(grepl('^\\)|^;|^\\}|^\\]', tq))
  if (length(wClean > 0)) {
    tq <-  tq[-wClean]
    }
  # - clean from: "]", "}", ";", ")" if present anywhere
  tq <- gsub("\\]|\\}|\\)|;", "", tq)
  # - count how many "/" path operators were used in a query
  num_paths <- length(tq[grepl('([[:alnum:]]|:)+/([[:alnum:]]|:)+', tq)])
  tq <- c(tq, 
          rep('__paths__', num_paths))
  # - remove empty elements from tq if any:
  wEmpty <- which(grepl("^\\s+$|^$", tq))
  if (length(wEmpty) > 0) {tq <- tq[-wEmpty]}
  # - enter comments
  if (length(comment_lines) > 0) {tq <- c(tq, comment_lines)}
  # - enter pars
  if (!is.null(pars)) {tq <- c(tq, pars)}
  # - enter tq_iri
  if (length(tq_iri)>0) {tq <- c(tq, tq_iri)}
  # - tq feature frequency as data.frame
  tq <- as.data.frame(table(tq))
  # - enter nchar(tq) as feature:
  tq <- rbind(tq, 
              data.frame(tq = 'nchar',
                         Freq = tq_nchar))
  # - ouput: tq
  return(tq)
}

# - Parse SPARQL queries
sfInit(parallel = T, cpus = 36)
t1 <- Sys.time()
sfExport("queries")
sfExport("parse_SPARQL")
sfLibrary(stringr)
# - process
queries_processed <- sfClusterApplyLB(queries, parse_SPARQL)
# - stop cluster
sfStop()
print(paste0("Parsing took: ", 
             round(difftime(Sys.time(), t1, units = "min"), 2)))

# - collect results
names(queries_processed) <- uniqueQueryId
queries_processed <- rbindlist(queries_processed, 
                               idcol = T, 
                               fill = T, 
                               use.names = T)
colnames(queries_processed) <- 
  c('uniqueQueryId', 'feature', 'freq')
write.csv(queries_processed, paste0(analyticsDir, "queries_processed_TEST.csv"))

# - produce vocabulary: feature frequency
vocabulary <- queries_processed %>%
  dplyr::select(feature, freq) %>% 
  dplyr::group_by(feature) %>% 
  dplyr::summarise(freq = sum(freq)) %>% 
  dplyr::arrange(desc(freq))
write.csv(vocabulary, paste0(analyticsDir, "SPARQLvocabulary_TEST.csv"))
head(data.frame(vocabulary), 400)

# - in how many queries is a particular feature used?
head(queries_processed)
query_features <- queries_processed %>% 
  dplyr::select(feature) %>% 
  dplyr::group_by(feature) %>% 
  dplyr::summarise(num_queries = n()) %>% 
  dplyr::arrange(desc(num_queries))
query_features$percent_queries <- 
  query_features$num_queries/length(unique(queries_processed$uniqueQueryId))
write.csv(query_features, paste0(analyticsDir, 'feature_query_percent_TEST.csv'))
```

Again, The goal is to derive a description of SPARQL queries via feature vectors. For example:

> __iri_ref__:11, __literal__:1, __vars__:5, __vars_label__:1, __vars_label_usage__:1, __vars_usage__:20, ?item:15, ?type:1, ?typelabel:1, ?url:2, ?wikipedia:1, {:14, <http://standartconta.com.tr/>:1, <http://standartconta.com.tr>:1, <http://www.standartconta.com.tr/>:2, <http://www.standartconta.com.tr>:1, <http://www.www.standartconta.com.tr/>:1, <https://standartconta.com.tr/>:1, <https://standartconta.com.tr>:1, <https://www.standartconta.com.tr/>:1, <https://www.standartconta.com.tr>:1, <https://www.www.standartconta.com.tr/>:1, bd:serviceparam:1, distinct:1, optional:1, schema:about:1, select:1, service:1, union:10, wdt:p31:1, wdt:p856:12, where:1, wikibase:label:1, wikibase:language:1, nchar:903

In this example of a query feature vector we see that the query had used 11 IRI references (`__iri_ref__:11`), one literal (`__literal__:1`), five variables (`__vars__:5`) that were mentioned 20 times altogether (`__vars_usage__:20`), used `wikibase:label` once (`wikibase:label:1`), the `UNION` keyword 10 times (`union:10`), that the length of the query in characters is `nchar:903`, etc.

## 2. A criterion for selecting a critical query for WDQS

Now when each SPARQL query in our sample is described by a feature vector suitable for predictive modeling, we need to select a criterion variable: what do we want to predict? We want to predict the WDQS response time, obviously - the information carried in the `time_firstbyte` field of the `wmf.webrequest` table (i.e.how many seconds elapsed before the first byte of the server response was sent out to the client). However, since we already now that the time to process a SPARQL query depends not only upon the features of the query itself, but also upon the WDQS/Blazegraph configuration that presents a *black box* for us, we might conclude that predicting the exact server response time might present a too ambitious goal. Indeed, our initial attempts to predict the server response time for a query shows that approximately a correlation of R = .72 (on the test data set, approx. half of `1M` queries in the sample) between the predictive model based on our features and response time can be achieved - a result not that sound. 

What if we can train a model to predict a derived, binary variable that describes each query as *critical* (i.e. resulting in an excessively long query processing time) or *non-critical* (i.e. resulting in a "typical" query processing time)? If we pose the problem in this way, than it is easy to derive such a binary criterion from the observed `time_firstbyte` value for each query:

- take the **extreme outliers** on `time_firstbyte` to be all observations above the variable's *upper outer fence*: Q3 + 3*IQR, where Q3 is the `.75` percentile of `time_firstbyte`, while IQR stands for Inter-Quartile Range and presents the difference between Q3 (the `.75` percentile of the data) and Q1 (the `.25` percentile of the data).

This definition is widely used in Exploratory Data Analysis. We can also select a less extreme criterion by considering anything that is a mild outlier on `time_firstbyte` as critical:

- take the **mild outliers** on `time_firstbyte` to be all observations above the variable's *upper inner fence*: Q3 + 1.5*IQR, where Q3 is the `.75` percentile of `time_firstbyte`, while IQR stands for Inter-Quartile Range and presents the difference between Q3 (the `.75` percentile of the data) and Q1 (the `.25` percentile of the data).

While we have used both criteria in our attempts at predictive modeling, we will mainly focus on the prediction of **extreme** outliers on `time_firstbyte`  in what follows.

## 3. Machine Learning w. XGBoost: can we predict extreme outliers on WDQS response time from SPARQL query features?

Thus far,

- all SPARQL queries are represented as feature vectors, and
- a criterion variable has been defined (extreme outliers on WDQS server response times).

Our goal is to select a predictive model that will use the feauture vectors to predict whether the respective queries' response times are critical (extreme) or not. At the same time, the model will be performing feature selection and informing us about the value of each feature in the prediction. This will allows us to select a small set of SPARQL feauters that can be used to characterize the queries as resulting in typical vs. extreme WDQS response times. And such small feature vocabulary is exactly what we are looking for in order to be able to develop a WDQS optimization engine in the future: a system that would fast can any given SPARQL query and decide wheter it needs to be handled as a special, potentially difficult case to process - or should the processing proceed as usual.

We have used [XGBoost](https://xgboost.readthedocs.io/en/latest/index.html) via the R [{xgboost}](https://cran.r-project.org/web/packages/xgboost/index.html) package to train many predictive models that attempt to categorize SPARQL queries as critical or not in accordance with the above stated criterion. XGBoost - short for [Extreme Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) - is a ML method to estimate an ensemble of mathematical models - decision trees in our case - that jointly work to predict the response variable. Its description is really way out of scope of this Report. This technique is as known to be notoriously good in extracting almost every and the last piece of useful information from the training data (feauture vectors describing the SPARQL queries in this case). The XGBoost models involve a [large number of parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) that must be fine-tuned in order for it to derive a plausible predictive model for the task at hand. We have experimented with various parameter settings and data sets encompassing a variable number of extracted SPARQL features to reach the following conclusions:

- the number of features used to build a model does not significantly change the *accuracy* and *precision* (both to be explained soon) with which the model can tell the critical SPARQL queries from typical ones; this is most probably entirely due to the fact that XGBoost performs feature selection along the way of its training; this is also *very important* because it means that we can use only a small fraction of features to train a predictive model - and the model with less features will run faster than the one encompassing a large number of feature; 
- mostly, the performance of the predictive model is not seriously affected by choosing various combinations of parameter settings; we have performed many cross-validations, always additional splitting our `1M` sample of SPARQL queries into train and test data (approx. 50% of the sample each), in order to reach this conclusion;
- the model performance is good but far from perfect, however
- there are possible improvements that always involve a compromise in respect to how can be we design a future WDQS Optimization Engine.

```{r echo = T, eval = F, message = F}

```


In order to explain the typical modelling result with XGBoost, we need to introduce some terminology first.

Let's assume each of the SPARQL queries is coded as `1` if the respective WDQS response time for that query was critically high or `0` if it had fallen in a range of typical processing times. For a sample of SPARQL queries than we have two binary vectors over {0, 1} describing: (1) the observed status of the query (again: 1 - critical, 0 - typical), and (2) the predicted status of the query: what the trained predictive model "thinks" the query would be by looking at the features that describe it.

For example:

```{r echo = T}
sampleFrame <- data.frame(query = paste0("Query_", 1:10),
                          observed = round(runif(10, 0, 1)),
                          predicted = round(runif(10, 0, 1)))
sampleFrame
```

For each of the 10 queries in the table we know if they really where critical (1) or not (0) and we see that from the `observed` column, and if the model predicts they would be critical (1) or not (0) from their features: the `predicted` column. For example, in the first row we have a query that was critical indeed and correctly predicted as critical: that is something we call a Hit (or a True Positive) in ML. In the second row we observe a False Alarm (or a False Positive): the query was not really critical (it has 0 on `observed`) but the model has incorrectly categorized it as if it would be (1 on `predicted`). In the eigth row we find a situation known as a Correct Rejection: the query was not critical (0 on `observed`) and the model figured out it wouldn't be critical indeed by looking into its features (0 on `predicted`). Finally, the fourth important situation can be found in the sixth row, where the query was critical (1 on `observed`) but not recognized as such by the model (0 on `predicted`), a case known as Miss in the ML terminology. More on this terminology: https://en.wikipedia.org/wiki/Confusion_matrix.

In the table above, we can compute the model accuracy by looking at how many times do the `observe` and `predicted` column coincide:

**Table 1.** Confusion matrix example.

```{r echo = T}
acc <- sum(sampleFrame$observed == sampleFrame$predicted)/dim(sampleFrame)[1]
print(paste0("Model accuracy = ", round(acc, 2), "."))
```

When we look at the Hit rate we see that:

```{r echo = T}
hits <- sum(sampleFrame$observed == 1 & sampleFrame$predicted == 1)/sum(sampleFrame$observed == 1)
print(paste0("Hit rate = ", round(hits, 2), "."))
```

Finally, the False alarm rate of the model is:

```{r echo = T}
FAs <- sum(sampleFrame$observed == 0 & sampleFrame$predicted == 1)/sum(sampleFrame$observed == 1)
print(paste0("FA rate = ", round(FAs, 2), "."))
```

So, the model would recognize a critical query correctly 71% of time, while 30% of time it would incorrectly predict a non-critical query to be critical. Imagine if we use this model to chose when to start and when not to start some optimization procedure in WDQS - be it as simple as diverging different queries (critical vs. non-critical) to separate processing resources. From the viewpoint of any system optimization, the decision is needed: can we live with these two values, or not? The False Alarm suggests that 30% of time the model would suggest the optimization routine to start in a situation when it should not start and thus spare valuable system resources. Again, 100% - 71% (Hit rate) = 29% of time the model would miss to trigger the optimization routine for a critical query that could engage too many resources.

A good predictive model is the one which attains (a) a high Hit rate (i.e. observed critical and predicted critical), and (b) a low False Alarm rate (i.e. observed non-critical but predicted critical). In other words, we do not want our model to miss critical queries and thus not trigger some optimization procedure to help WDQS better use its resources, but we also do not want our model to trigger any optimization procedures on a query that would not result in a critically high usage of resources.

**Our current results**:

- the results vary with the number of features used and specific values of model parameters, but they do not vary significantly:
- typical model accuracy is around 95%;
- typical Hit rate is around 48%, and
- typical False alarm rate is slightly above 2%.

Now we can see how misleading is model accuracy alone.

**Tweaks and Compromises**

- Without going into detail about it, since this is a binary classification task, there is a legitimate way to tweak a model and attain a higher hit rate (by manipulating the so called decision boundary), but at the expense of increasing the False Alarm rate at the same time. For example, with the current results, we can achieve a Hit rate as high as 84% but at the expense of increasing the False Alarm rate to almost 40% at the same time.

How do we compromise about this prediction model? First to remind: it is a binary classification model. The model does not directly output its predictions as classes: `1` - critical, `0` - typical, as in the example in Table 1. Rather, the model learns about the queries from their feature vectors and predicts a **probability** that the query will critical (`1`). Now, in order to categorize the query as critical or not, we need to set a treshold value (i.e. a decision boundary), `.5` for example, and prononuce the model prediction to be `1` each time a predicted probability for a query is above the treshold value. A value of `.5` is an intuitive choice for binary classification, however, we might pick another value as well. It is by manipulating the treshold value how we enter a trade-off between True (Hit) and False Positive (False Alarm) rates. In the following chart the treshold value varies smoothly between 0 and 1, the True Positive rate is found on the vertical, and the False Positive rate on the horizontal axis:

**Chart 1.**

```{r echo}
library(ggplot2)
library(ggrepel)
locDir <- '/home/goransm/Work/___DataKolektiv/Projects/WikimediaDEU/_WMDE_Projects/Wikidata/WD_Misc/WD_SPARQL_Endpoint_Analytics/_analytics/'
rocModel <- read.csv(paste0(locDir, 'rocModel.csv'),
                     header = T, 
                     row.names = 1,
                     stringsAsFactors = F)
rocModel$diff <- rocModel$hit - rocModel$FA
wMaxDiff <- which.max(rocModel$diff)
rocModel[wMaxDiff, ]
rocModel$label <- ''
rocModel$label[wMaxDiff] <- paste0('Hit: ', round(rocModel$hit[wMaxDiff], 2), 
                                   "; FA: ", round(rocModel$FA[wMaxDiff], 2))
ggplot(data = rocModel,
       aes(x = FA, y = hit, label = label)) + 
  geom_point(size = .75) + geom_path(size = .25, color = "blue") + 
  ggrepel::geom_text_repel() + 
  xlab('False Positives (False Alarm Rate)') + ylab('True Positives (Hit Rate)') +
  theme_bw()
```

We can see that the there is a treshold value where we observe a 67% True Positive Rate and a 12% False Positive Rate: the difference between the two is highest when `treshold = .2`.

In effect this means:

- imagine, again, that we develop a system to optimize the WDQS by somehow splitting its resources into those that process queries with typical expected processing times from those with critical expected processing times;
- our current modelling efforts let us know that we can increase the number of True Positives to two-thirds at the expense of observing 12% of typical queries as if they were critical;
- that means the system would correctly run on critical processing resourcess 2/3 of the time, while it would erroneously run on critical processing resources 12% of time (i.e. 12% of queries should be processed with the typical processing resources but they would incorrectly trigger the critical processing resources);
- or, to rephrase: the model would think a critical query is non-critical one-third of a time, and decide to pass such a query to a normal, typical WDQS processing, while at the same time it would recognize a non-critical query as non-critical 88% of time and than it would not trigger any optimization action. 

The question is: can we afford such an optimizer for WDQS? What is the trade-off that we can accept to live with?

## 4. Directions for future research + considerations of possible problems in WDQS optimization

**Directions for further research (time estimate: until the end of the next week)**

- The **sampling approach** appears to be robust, but we need to replicate all feature engineering and ML procedures across at least several more samples;
- The **feature engineering** procedures are approximate: they could and should be improved;
- The **ML** results were discussed and it is uncertain if further significant improvements are possible, but at least the following should be tested:
    - working with corrections for imbalanced designs: we observe a very low number of critical queries compared to a huge number of typical queries in our sample; some corrections to control for this imbalance are already applied, but there are other approaches that might be helpful to potentially train better classifiers;
    - not all XGBoost parameters that might help train a better classifier were entered into our cross-validation procedures; some of them should be given a change;
    - maybe a different model altogether (e.g. Convolutional Deep Learning) should be tested.

**Final considerations**

- If we are about to develop some sort of a WDQS Optimization Engine based on this and/or similar models,
- we need to consider how fast can these class of models run in production, and
- if they can grab "environmental variables" from the WDQS/Blazegraph: for example, **one of the features they use is the total number of queries running against the WDQS in the same minute in which the query under analysis was run** - and this cannot be known from SPARQL query analysis alone.
- Deployment to production: we would probably need to switch from {xgboost} to its implementation in {h2o} where we can deploy models as Java MOJO or POJO objects, or build a REST API for a model predictive run; all these steps imply that some time would be lost in query processing before the optimizer would be able to deliver a prediction, so deploying a model with a minimum runtime latency in a predictive run would be essential.

***
Goran S. Milovanović

Wikimedia Deutschland, Data Scientist
DataKolektiv, Owner

2020.

contact: goran.milovanovic_ext@wikimedia.de

![](_img/DK_Logo_100.png)
![](_img/Wikidata-logo-en.png)
![](_img/Wikimedia_Deutschland_Logo_small.png)





