Report timestamp: 2020-04-29 13:03:08
Reference Phabricator Ticket: T248308
Feedback should be send to goran.milovanovic_ext@wikimedia.de
.
WMDE together with php.cc is trying to find a way to optimize the Wikidata Query Service. This Report provides (a) and Exploratory Data Analysis (EDA) approach to WDQS response times (Part A) and (2) describes a Machine Learning (ML) approach (Part B) to analyze the SPARQL queries that were received by the endpoint in the first three weeks of April 2020, with the goal of developing a model able to predict whether a query would be critical for WDQS/Blazegraph or not. By singling out the most important features that are present in the SPARQL queries under analysis we hope to be able to understand better what types of queries typically cause problems.
In the following sections we describe (1) the sampling approach taken to collect a relevant set of SPARQL queries observed at the (WDQS) endpoint in three weeks of March 2020, (2) the feature engineering approach taken to analyze the composition of queries and extract meaningful “atomic elements” of SPARQL that will be used in an attempt to predict the WDQS response time for a query, (3) the definition of a criterion for selecting a critical query for WDQS (i.e. the one that makes too much use of the service’s resources), and (4) the ML (optimization) approach, relying on XGBoost, taken to build a predictive model for WDQS server response times starting from the extracted SPARQL feature.
event.wdqs_external_sparql_query
table (lives in the WMF Data Lake, Hadoop) we find the query_time
variable representing the WDQS processing time in seconds for each query observed at the endpoint;1
- queries that took excessively long processing times, defined as mild outliers: query_time >= Q3+1.5×IQR
, where Q3
represents the 3rd quartile (75%
) of the WDQS response time distribution while IQR
represents the inter-quartile range, IQR = Q3-Q1
; 0
- queries that took “typical” processing times - everything with the processing time below what we have just defined as mild outliers.The SPARQL queries that hit the endpoint are registered in the event.wdqs_external_sparql_query
table in the WMF Data Lake (our Hadoop Big Data storage). A bucketing approach to sample data from this Hive table was taken, sampling approximately 1%
of queries that were observed in the period between 2020/04/01
and 2020/04/21
. The sample is approximately uniformly distributed across days and hours of the mentioned period of time. The sample size is slightly higher than 1M
SPARQL queries - including duplicates.
The following fields were derived from the event.wdqs_external_sparql_query
table for each collected SPARQL query:
ds_sparql
- the SPARQL queryds_dt
- the query timestamp, later parsed into year
, month
, day
, hour
, and minute
ds_status_code
- the server http status response for this queryds_query_time
- how many seconds did it took for the WDQS to process the queryds_datacenter_host
- a combination of the (a) WMF data center and (b) a back-end host in the respective data centerds_format
- content type (e.g. JSON, text/tsv, xml, etc.)ds_method
- HTTP method (e.g. GET, POST)ds_n_conc_queries
- for each SPARQL query, the number of SPARQL whose processing started in (1) the same minute, on (2) the same data center, and on (3) the same back-end host.NOTE. It is immediately obvious that not all fields from the event.wdqs_external_sparql_query
table are candidates for a predictive model of the described kind: ds_status_code
, the server HTTP response code, is not know before the query is actually processed. We keep this variable in the dataset having in mind that it might serve as a potential criterion in similar prediction tasks someday.
The following R script was used to obtain a sample of SPARQL queries from event.wdqs_external_sparql_query
.
#!/usr/bin/env Rscript
### ---------------------------------------------------------------------------
### --- WDQS_Event_S1_ETL.R v 0.0.1
### --- Script: WDQS_Event_S1_ETL.R, v. Beta 0.1
### --- Description: WDQS_Event_S1_ETL.R collects a sample of
### --- SPARQL queries from event.wdqs_external_sparql_query in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: stat1005
### --- Date: 2020/04/22
### ---------------------------------------------------------------------------
### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software:
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------
### ---------------------------------------------------------------------------
### --- Section 1. Collect raw data set event.wdqs_external_sparql_query
### ---------------------------------------------------------------------------
### --- setup
library(data.table)
library(tidyverse)
### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = F)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
paste(fPath[1:length(fPath) - 1], collapse = "/"),
"/",
sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_event/_data/')
analyticsDir <- paste0(fPath, '_event/_analytics/')
reportingDir <- paste0(fPath, '_event/_reporting/')
### --- grab a sample of SPARQL queries event.wdqs_external_sparql_query
### --- w. HiveQL from Beeline
# - query filename
queryFile <- "wdqs_CollectData.hql"
# - Kerberos Init
kerberosPrefix <- 'sudo -u analytics-privatedata kerberos-run-command analytics-privatedata '
# - dateRange
startDate <- '2020/04/01'
endDate <- '2020/04/21'
dateRange <- seq(as.Date(startDate), as.Date(endDate), by = "day")
# - query over dateRange
for (i in 1:length(dateRange)) {
print(paste0("Started: ", dateRange[i]))
# - year, month, date
year = substr(dateRange[i], 1, 4)
month = substr(dateRange[i], 6, 7)
if (substr(month, 1, 1) == '0') {month = substr(month, 2, 2)}
day = substr(dateRange[i], 9, 10)
if (substr(day, 1, 1) == '0') {day = substr(day, 2, 2)}
# - compose query
hiveQLquery <- paste0("SELECT meta.dt as dt,
regexp_replace(
regexp_replace(
regexp_replace(query, '\\n|\\r\\n|\\r|\\n\\r', ' __NEWLINE__ '),
'\\t', ' '),
'\\000', ' ') as sparql,
format, meta.domain as domain,
http.request_headers['X-BIGDATA-MAX-QUERY-MILLIS'] as max_query_millis, http.method as method,
http.status_code as status_code,
backend_host, datacenter, query_time
FROM event.wdqs_external_sparql_query TABLESAMPLE (BUCKET 1 OUT OF 100 ON rand()) w
WHERE isnotnull(http.status_code)
AND (meta.domain = 'query.wikidata.org')
AND year=", year,
" AND month=", month,
" AND day=", day, ';')
filename <- "wdqs_data_sample_TEST_"
filename <- paste0(filename, dateRange[i], ".tsv")
# - write hql
write(hiveQLquery, paste0(fPath, queryFile))
# - to Report
print("Fetching sparql_testData from wmf.webrequest now.")
# - Run query
query <- system(command = paste(kerberosPrefix,
'/usr/local/bin/beeline --incremental=true --silent -f "',
paste0(fPath, queryFile),
'" > ', dataDir, filename,
sep = ""),
wait = TRUE)
print(paste0("Completed: ", dateRange[i]))
print("--------------------------------------------")
}
# - to Report
print("DONE w. ETL from Hadoop: wmf.webrequest.")
The following R script checks for data consistency The following chunk of code also parses the dt
(timestamp) field and extracts year
, month
, day
, hour
, and minute
when the query was run. Also, and additional variable is produced from a simple concatenation of year
, month
, day
, hour
, and minute
: ymdmh
. This variable will be subsequently used to calculate the number of queries that were run in a particular minute observed in the sample, per WMF data center (eqiad
or codefw
) and back-end host in the data center. This is a potential “environmental” variable (i.e. taken from the WDQS/Blazegraph enviornment) in predictive modeling of server response time. It can characterize the query processing context by estimating how many queries were run against Blazegraph concurrently in the same minute, on the same server, when a particular query under of interest was run. Any future WDQS optimization system relying on this variable would have to rely on Blazegraph somehow broadcasting this variable to the predictive modeling API.
NOTE. In this step we also eliminate all queries that have resulted in a http_status
of 4**
: client side errors. Obviously, incorrectly stated SPARQL queries are of no use to us in any attempt to optimize the WDQS processing time. So if a query fails an initial syntax analysis on the endpoint it does not enter our analysis.
#!/usr/bin/env Rscript
### ---------------------------------------------------------------------------
### --- WDQS_Event_S2_CleanUp_Inspect.R, v. 0.0.1
### --- Script: WDQS_Event_S2_CleanUp_Inspect.R, v. 0.0.1
### --- Description: WDQS_Event_S2_CleanUp_Inspect.R performs
### --- feature engineering of the SPARQL queries data
### --- collected w. WDQS_Event_S1_ETL.R from the
### --- event.wdqs_external_sparql_query table in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: stat1005
### --- Date: 2020/03/22
### ---------------------------------------------------------------------------
### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software:
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------
### --- setup
library(data.table)
library(tidyverse)
library(urltools)
### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
paste(fPath[1:length(fPath) - 1], collapse = "/"),
"/",
sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_event/_data/')
analyticsDir <- paste0(fPath, '_event/_analytics/')
reportingDir <- paste0(fPath, '_event/_reporting/')
### ---------------------------------------------------------------------------
### --- Section 1. Load, clean, and decode the SPARQL queries data set
### ---------------------------------------------------------------------------
### --- Sys.setlocale()
Sys.setlocale('LC_ALL','C')
### --- load data
lF <- list.files(dataDir)
lF <- lF[grepl("^wdqs_data_sample", lF)]
dataSet <- lapply(paste0(dataDir, lF), fread, header = T, quote="")
dataSet <- rbindlist(dataSet)
dim(dataSet)
### --- inspect
colnames(dataSet)
str(dataSet)
### --- clean data: sparql
dataSet$sparql <- gsub("__NEWLINE__", "\\\\n", dataSet$sparql)
### --- clean data: max_query_millis
sumNA <- sum(is.na(dataSet$max_query_millis))
sumNA
table(dataSet$max_query_millis)
# - all dataSet$max_query_millis values are: 60000; dropping
dataSet$max_query_millis <- NULL
### --- clean data: status_code
sumNA <- sum(is.na(dataSet$status_code))
sumNA
table(dataSet$status_code)
# - remove status_code = 4*
w400 <- which(grepl('^4', dataSet$status_code))
length(w400)
dim(dataSet)
if (length(w400) > 0) {
dataSet <- dataSet[-w400, ]
}
dim(dataSet)
### --- clean data: format
sumNA <- sum(is.na(dataSet$format))
sumNA
table(dataSet$format)
# - fix: lower case and upper case
dataSet$format <- tolower(dataSet$format)
table(dataSet$format)
### --- clean data: domain
sumNA <- sum(is.na(dataSet$domain))
sumNA
table(dataSet$domain)
# - all dataSet$domain values are: query.wikidata.org; dropping
dataSet$domain <- NULL
### --- clean data: method
sumNA <- sum(is.na(dataSet$method))
sumNA
table(dataSet$method)
### --- clean data: backend_host
sumNA <- sum(is.na(dataSet$backend_host))
sumNA
table(dataSet$backend_host)
### --- clean data: datacenter
sumNA <- sum(is.na(dataSet$datacenter))
sumNA
table(dataSet$datacenter)
### --- clean data: query_time
sumNA <- sum(is.na(dataSet$query_time))
sumNA
summary(dataSet$query_time)
### --- validate dt (timestamp)
sumNA <- sum(is.na(dataSet$dt))
sumNA
w <- which(dataSet$dt == "-")
length(w)
w <- which(!(grepl("\\d\\d\\d\\d-\\d\\d-\\d\\dT\\d\\d:\\d\\d:\\d\\d\\.\\d+Z",
dataSet$dt)))
length(w)
# - nothing crucial:
dataSet$dt[w]
### --- extract year, month, day, hour, and minute from dt (timestamp)
dataSet$day <- substr(dataSet$dt, 9, 10)
dataSet$hour <- substr(dataSet$dt, 12, 13)
dataSet$minute <- substr(dataSet$dt, 15, 16)
dataSet$ymdhm <- paste(dataSet$day,
dataSet$hour,
dataSet$minute,
sep = "-")
### --- remove dataSet$dt
dataSet$dt <- NULL
### --- inspect: format and query_time
dataSet %>%
dplyr::select(format, query_time) %>%
dplyr::group_by(format) %>%
dplyr::summarise(mean = mean(query_time, na.rm = T),
median = median(query_time, na.rm = T),
min = min(query_time, na.rm = T),
max = max(query_time, na.rm = T),
n = n()) %>%
dplyr::arrange(desc(median))
### --- inspect: method and query_time
dataSet %>%
dplyr::select(method, query_time) %>%
dplyr::group_by(method) %>%
dplyr::summarise(mean = mean(query_time, na.rm = T),
median = median(query_time, na.rm = T),
min = min(query_time, na.rm = T),
max = max(query_time, na.rm = T),
n = n()) %>%
dplyr::arrange(desc(median))
### --- inspect: status_code and query_time
dataSet %>%
dplyr::select(status_code, query_time) %>%
dplyr::group_by(status_code) %>%
dplyr::summarise(mean = mean(query_time, na.rm = T),
median = median(query_time, na.rm = T),
min = min(query_time, na.rm = T),
max = max(query_time, na.rm = T),
n = n()) %>%
dplyr::arrange(desc(median))
### --- inspect: backend_host and query_time
dataSet %>%
dplyr::select(backend_host, query_time) %>%
dplyr::group_by(backend_host) %>%
dplyr::summarise(mean = mean(query_time, na.rm = T),
median = median(query_time, na.rm = T),
min = min(query_time, na.rm = T),
max = max(query_time, na.rm = T),
n = n()) %>%
dplyr::arrange(desc(median))
### --- inspect: datacenter and query_time
dataSet %>%
dplyr::select(datacenter, query_time) %>%
dplyr::group_by(datacenter) %>%
dplyr::summarise(mean = mean(query_time, na.rm = T),
median = median(query_time, na.rm = T),
min = min(query_time, na.rm = T),
max = max(query_time, na.rm = T),
n = n()) %>%
dplyr::arrange(desc(median))
### --- inspect: day and query_time
dataSet %>%
dplyr::select(day, query_time) %>%
dplyr::group_by(day) %>%
dplyr::summarise(mean = mean(query_time, na.rm = T),
median = median(query_time, na.rm = T),
min = min(query_time, na.rm = T),
max = max(query_time, na.rm = T),
n = n()) %>%
dplyr::arrange(desc(median))
### --- inspect: hour and query_time
dataSet %>%
dplyr::select(hour, query_time) %>%
dplyr::group_by(hour) %>%
dplyr::summarise(mean = mean(query_time, na.rm = T),
median = median(query_time, na.rm = T),
min = min(query_time, na.rm = T),
max = max(query_time, na.rm = T),
n = n()) %>%
dplyr::arrange(desc(median))
### --- add ID, uniqueQueryID
dataSet$id <- 1:dim(dataSet)[1]
uniqueQueries <- unique(dataSet$sparql)
uniqueQueries <- data.frame(sparql = uniqueQueries,
uniqueSparqlId = 1:length(uniqueQueries),
stringsAsFactors = F)
dim(uniqueQueries)
dataSet <- dataSet %>%
dplyr::left_join(uniqueQueries,
by = "sparql")
rm(uniqueQueries)
### --- rearrange, store clean data set
colnames(dataSet)
dataSet <- dataSet %>%
dplyr::select('id', 'uniqueSparqlId', 'day', 'hour', 'minute', 'ymdhm',
'sparql', 'format', 'method', 'status_code',
'backend_host', 'datacenter', 'query_time')
# - final check:
w <- which(complete.cases(dataSet))
length(w)
dim(dataSet)
### --- store to analyticsDir
filename <- paste0(analyticsDir,
'wdqs_data_sample_',
substr(as.character(Sys.time()), 1, 10),
'.csv')
write.csv(dataSet, filename)
The SPARQL queries in our sample, present in the sparql
field of the dataset retrieved from the event.wdqs_external_sparql_query
table, need to be parsed in order to extract a set of meaningful features to use in predictive modeling of server response time. As in the case of Natural Language Processing, we need to differentiate between three possible levels of analysis here:
Our approach to feature engineering is to parse each SPARQL query with a set of regular expressions in order to (a) single out the “atomic elements” of the SPARQL language (keywords, functions, IRI mentions, variables, literals, operators, Wikibase parameters, etc.), (b) compute a set of abstract variables that can be used to describe the query, e.g.: the number of different variables instantiated (i.e. how many unique variables there are in a query), the total number of variables mentioned (including repeated uses of the same variable), the number of IRI references in a query, how many variables also request labels, and similar.
With this appproach, which is essentialy the same as what is typically done in Natural Language Processing, query syntax and semantics are in effect fused. A question could be posed: why keep specific variable names, or comments, when it is clear that the later are not even processed while variable names are conventional and WDQS users are free to use them as long as they satisfy the rules of the SPARQL syntax? These aspects of query analysis are important from the viewpoint of query pragmatics and carry useful information indeed. Of course the variable names are conventional, but imagine a community of librarians using WDQS who learn SPARQL from each other and thus frequently use variable names like ?author
and/or ?authorLabel
? If such “typical” variable names are frequently used in similar queries, they can act as informative features in the prediction of query response time indeed. Comments: many WDQS bots and/or human users might use a “signature” comment (and indeed some do) of some form, for example. If such bots run their queries against WDQS frequently, the comments in their code might also act as predictive feaures for the query perfomance.
NOTE. The current regex analysis of SPARQL queries is approximate and, most of time, correct in a sense of being able to extract intuitively meaningful atoms of SPARQL. It is approximate in a sense that the regular expressions use are derived by the author of this report by directly analyzing the typical SPARQL queries with only limited attention payed to the formal definition of SPARQL (e.g. its Backus–Naur forms, or the formal regex definitions of its various syntactic constituents). The feature engineering procedure used in our current analyses can certainly receive further improvment.
Beyond extracting SPARQL “atoms”, the following features are derived as frequencies (counts) for each query:
nchar
- the length of the query in characters__vars__
- how many variables are used?__vars_usage__
- how many variable mentions there are in total?__vars_label__
- how many variables request labels also?__vars_label_usage__
- how many times are variable labels mentioned in a query?__IRI_REF__
- how many IRI references are used in the query?__LITERAL__
- how many SPARQL literals are used in the query?So the final description of any SPARQL query, following these feature engineering steps, is a vector of counts for each extracted feature, including: (a) tokens (e.g. each IRI or SPARQL literal mentioned, for example), as well as (b) types (e.g. how many __IRI_REFs__
there are in a query in total).
NOTE 1. Approximating syntax. Obviously, this feature engineering approach does not analyze the structural relations between the language constituents: its syntax. However, some of the extracted features clearly “correlate” with the query syntactic complexity. For example, the count of {
tokens, or the count of SELECT
keywords, used in a query, clearly speaks about the query depth
(i.e. how many sub-queries do we have, if any). So the syntactic information is present in our feature engineering more or less up to a degree dictated by how much useful information is carried by a mention of a particular SPARQL keyword, or function, or operator, etc; the number of variables used, the query “depth”, and similar. More useful syntactic information can be obtained from Blazegraph by running a query in the EXPLAIN mode (similar to what one would in SQL), see: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Explain_Query. However, even if featurus obtained from the query execution plan in this way might prove to be useful, it is unclear at this point whether the SPARQL query execution plan can be obtained from Blazegraph without actually running the query.
NOTE 2. The following script performs feature engineering followint the above outlined principles. It also computes the n_conc_queries
variable: the number of queries whose processing started concurrently in the same minute, same WMF data center, on the same back-end host. The queries_vocabulary.csv
file contains a list of all extracted features sorted in a descending order of their usage frequency across all SPARQL queries in the sample. The queries_coverage.csv
dataset, on the other hand, lists features alongside the number of queries in the sample that make use of them, and also expresses feature coverage in percents. Additional operations are performed on column (feature) names in order to comfort them to XGBoost standards. The top_f
variable controls how many of the most frequently used features will be selected for the data set that is passed on to the ML phase with XGBoost.
#!/usr/bin/env Rscript
### ---------------------------------------------------------------------------
### --- WD_WDQS_Analytics_FeatureEngineering.R, v. Beta 0.1
### --- Script: WD_WDQS_Analytics_FeatureEngineering.R, v. Beta 0.1
### --- Description: WD_WDQS_Analytics_FeatureEngineering.R performs
### --- feature engineering of the SPARQL queries data
### --- collected w. WD_WDQS_Analytics_CollectData.R from the
### --- wmf.webrequest table in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM:
### --- /home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics
### --- on stat1005
### --- Date: 2020/03/20
### ---------------------------------------------------
#!/usr/bin/env Rscript
### ---------------------------------------------------------------------------
### --- WDQS_Event_S3_FeatureEngineering.R, v. 0.0.1
### --- Script: WDQS_Event_S3_FeatureEngineering.R, v. Beta 0.1
### --- Description: WDQS_Event_S3_FeatureEngineering.R performs
### --- feature engineering of the SPARQL queries data
### --- collected w. WDQS_Event_S1_ETL.R from the
### --- event.wdqs_external_sparql_query table in the WMF Data Lake.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: on stat1005
### --- Date: 2020/04/22
### ---------------------------------------------------------------------------
### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software:
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------
### --- setup
library(data.table)
library(stringr)
library(dplyr)
library(tidyr)
library(snowfall)
### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
paste(fPath[1:length(fPath) - 1], collapse = "/"),
"/",
sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_event/_data/')
analyticsDir <- paste0(fPath, '_event/_analytics/')
reportingDir <- paste0(fPath, '_event/_reporting/')
### ---------------------------------------------------------------------------
### --- Section 1. Parse SPARQL queries
### ---------------------------------------------------------------------------
### --- load queries
filename <- 'wdqs_data_sample_2020-04-22.csv'
dataSet <- fread(paste0(analyticsDir, filename))
dataSet$V1 <- NULL
# - keep only unique queries
w <- which(!duplicated(dataSet$uniqueSparqlId))
if (length(w) > 0) {
dataSet <- dataSet[w]
}
dim(dataSet)
### --- prepare queries for parsing
queries <- dataSet$sparql
uniqueQueryId <- dataSet$uniqueSparqlId
rm(dataSet); gc()
### --- function: parse_SPARQL()
parse_SPARQL <- function(tq) {
# - get nchar()
tq_nchar <- nchar(tq)
# - split by "\n"
tq <- unlist(lapply(tq, function(x) {
strsplit(x, split = "\\\\n")
}))
# - trim trailing space at the beginning and end, if any
tq <- str_trim(tq, side = "both")
# - remove empty lines if any
wEmpty <- which(tq == "")
if (length(wEmpty) > 0) {
tq <- tq[-wEmpty]
}
# - collect IRI references
tq_iri <- unlist(str_extract_all(tq, '<([[:alnum:]]|[[:punct:]])+>'))
# - match IRI references
tq <- gsub('<([[:alnum:]]|[[:punct:]])+>', " __IRI_REF__ ", tq)
# - process comments
tq_comments <- unlist(str_extract_all(tq, "#.+$"))
tq <- gsub("#.+$", "", tq)
wEmpty <- which(tq == "")
if (length(wEmpty) > 0) {
tq <- tq[-wEmpty]
}
# - space to the left and right of
# - brackets, operators, and punctuation
tq <- gsub("\\{", " \\{ ", tq)
tq <- gsub("\\}", " \\} ", tq)
tq <- gsub("\\)", " \\) ", tq)
tq <- gsub("\\(", " \\( ", tq)
tq <- gsub("\\]", " \\] ", tq)
tq <- gsub("\\[", " \\[ ", tq)
tq <- gsub("([^<]|[^>]|[^!])=", " = ", tq)
tq <- gsub("!=", " != ", tq)
tq <- gsub(">=", " >= ", tq)
tq <- gsub("<=", " <= ", tq)
tq <- gsub("!([^=])", " ! ", tq)
tq <- gsub(">([^=])", " > ", tq)
tq <- gsub("<([^=])", " < ", tq)
tq <- gsub("\\^", " \\^ ", tq)
tq <- gsub("\\+", " \\+ ", tq)
tq <- gsub("-", " - ", tq)
tq <- gsub("\\*", " \\* ", tq)
tq <- gsub("/", " / ", tq)
tq <- gsub("\\|\\|", " \\|\\| ", tq)
tq <- gsub("([^\\|])\\|([^\\|])", " \\| ", tq)
tq <- gsub("&&", " && ", tq)
tq <- gsub(",", " , ", tq)
tq <- gsub("\\.", " \\. ", tq)
tq <- gsub(";", " ; ", tq)
tq <- gsub("\\s+", " ", tq)
# - match literals
tq <- gsub('(\'|")+([^\'"])+(\'|")+', " __LITERAL__ ", tq)
tq <- gsub('\\b[[:digit:]]+\\b', " __LITERAL__ ", tq)
# - "word tokenization": split by "\s"
tq <- unlist(lapply(tq, function(x) {
strsplit(x, split = "\\s")
}))
# - remove: '"', ".", ","
tq <- gsub('("|\\.|,)+', '', tq)
# - remove: "'"
tq <- gsub("'+", '', tq)
# - count unique variables
num_vars <- length(unique(tq[grepl('^\\?|^\\$', tq)]))
# - count variable use
vars_usage <- length(tq[grepl('^\\?|^\\$', tq)])
# - count unique label variables
num_vars_label <- length(unique(tq[grepl('[^:]label$', tq, ignore.case = T)]))
# - count label variable use
label_vars_usage <- length(tq[grepl('[^:]label$', tq, ignore.case = T)])
# - add variables use and unique variable count to tq
tq <- c(tq,
rep('__vars__', num_vars),
rep('__vars_usage__', vars_usage),
rep('__vars_label__', num_vars_label),
rep('__vars_label_usage__', label_vars_usage))
# - tolower()
tq <- tolower(tq)
# - clean from: "]", "}", ";", ")" if present anywhere
tq <- gsub("\\]|\\}|\\)|;", "", tq)
# - remove empty elements from tq if any:
wEmpty <- which(grepl("^\\s+$|^$", tq))
if (length(wEmpty) > 0) {tq <- tq[-wEmpty]}
# - enter comments
if (length(tq_comments) > 0) {
tq <- c(tq, tq_comments)
}
# - enter tq_iri
if (length(tq_iri) > 0) {
tq <- c(tq, tq_iri)
}
# - tq feature frequency as data.frame
tq <- as.data.frame(table(tq))
# - enter nchar(tq) as feature:
tq <- rbind(tq,
data.frame(tq = 'nchar',
Freq = tq_nchar))
# - ouput: tq
return(tq)
}
### --- Parse SPARQL queries in parallel
sfInit(parallel = T, cpus = 36)
t1 <- Sys.time()
sfExport("queries")
sfExport("parse_SPARQL")
sfLibrary(stringr)
# - process
queries_processed <- sfClusterApplyLB(queries, parse_SPARQL)
# - stop cluster
sfStop()
print(paste0("Parsing took: ",
round(difftime(Sys.time(), t1, units = "min"), 2)))
### --- collect results
names(queries_processed) <- uniqueQueryId
queries_processed <- rbindlist(queries_processed,
idcol = T,
fill = T,
use.names = T)
colnames(queries_processed) <-
c('uniqueSparqlId', 'feature', 'freq')
write.csv(queries_processed, paste0(analyticsDir, "queries_processed.csv"))
### --- produce vocabulary: feature frequency
vocabulary <- queries_processed %>%
dplyr::select(feature, freq) %>%
dplyr::group_by(feature) %>%
dplyr::summarise(freq = sum(freq)) %>%
dplyr::arrange(desc(freq))
write.csv(vocabulary, paste0(analyticsDir, "queries_vocabulary.csv"))
dim(vocabulary)
head(data.frame(vocabulary), 400)
### --- in how many queries is a particular feature used?
query_features <- queries_processed %>%
dplyr::select(feature) %>%
dplyr::group_by(feature) %>%
dplyr::summarise(num_queries = n()) %>%
dplyr::arrange(desc(num_queries))
query_features$percent_queries <-
query_features$num_queries/length(unique(queries_processed$uniqueSparqlId))
write.csv(query_features, paste0(analyticsDir, "queries_coverage.csv"))
head(data.frame(query_features), 400)
# - clear
rm(list = setdiff(ls(), c('analyticsDir', 'dataDir', 'fPath')))
gc()
### ---------------------------------------------------------------------------
### --- Section 2. Prepare Model Datasets
### ---------------------------------------------------------------------------
### --- feature frequencies: vocabulary
filename <- 'queries_vocabulary.csv'
vocabulary <- fread(paste0(analyticsDir, filename))
vocabulary$V1 <- NULL
### --- dataset
filename <- 'wdqs_data_sample_2020-04-22.csv'
dataSet <- fread(paste0(analyticsDir, filename))
dataSet$V1 <- NULL
# - remove features that will not be used in the ML phase
dataSet$sparql <- NULL
dataSet$minute <- NULL
# - derive: n_conc_queries per datacenter x back_endhost
dataSet$datacenter_host <- paste(dataSet$datacenter,
dataSet$backend_host,
sep = "_")
dataSet$datacenter <- NULL
dataSet$backend_host <- NULL
# - the number of queries started in the same minute
# - as the target query
# - per datacenter x back_endhost
dataSet$datacenter_host_ymdhm <- paste(dataSet$datacenter_host,
dataSet$ymdhm,
sep = "_")
dataSet$ymdhm <- NULL
n_conc_queries <- dataSet %>%
dplyr::select(datacenter_host_ymdhm) %>%
dplyr::group_by(datacenter_host_ymdhm) %>%
dplyr::summarise(n_conc_queries = n())
dataSet <- dataSet %>%
dplyr::left_join(n_conc_queries, by = "datacenter_host_ymdhm")
dataSet$datacenter_host_ymdhm <- NULL
rm(n_conc_queries)
# - derive: mild and extreme outliers on query_time
# - query_time >= q3 + 3*iq
# - q3 = 75 percentile, q1 = 25 percentile
q3 <- quantile(dataSet$query_time, .75)
q1 <- quantile(dataSet$query_time, .25)
# - iq = q3 - q1; Interquartile range
iq <- unname(q3 - q1)
upper_outer_fence <- q3 + 3*iq
upper_inner_fence <- q3 + 1.5*iq
dataSet$query_time_extreme <-
ifelse(dataSet$query_time >= upper_outer_fence, 1, 0)
dataSet$query_time_mild <-
ifelse(dataSet$query_time >= upper_inner_fence, 1, 0)
# - derive multiclass DV: 0 - mild, mild - extreme, extreme query_time
dataSet$query_time_class <- 0
dataSet$query_time_class[dataSet$query_time_extreme == 0 & dataSet$query_time_mild == 1] <- 1
dataSet$query_time_class[dataSet$query_time_extreme == 1] <- 2
# - derive: below and above median query_time
dataSet$query_time_median <- ifelse(dataSet$query_time >= median(dataSet$query_time), 1, 0)
### --- vocabulary: top_f most frequently used features
top_f <- 300
ob_features <- c('nchar', '__vars__', '__vars_usage__', '__literal__',
'__vars_label__', '__vars_label_usage__', '__iri_ref__')
vocabulary <- vocabulary$feature[1:top_f]
vocabulary <- unique(c(vocabulary, ob_features))
### --- join vocabulary to dataSet
queries_processed <- fread(paste0(analyticsDir, 'queries_processed.csv'))
queries_processed$V1 <- NULL
dim(queries_processed)
queries_processed <- queries_processed %>%
dplyr::filter(feature %in% vocabulary)
rm(vocabulary)
dim(queries_processed)
featureFrame <- data.table::dcast(as.data.table(queries_processed),
uniqueSparqlId ~ feature,
value.var = "freq")
rm(queries_processed); gc()
# - mark features in dataSet: ds
colnames(dataSet) <- paste0("ds_", colnames(dataSet))
# - join:
dataSet <- dataSet %>%
dplyr::left_join(featureFrame,
by = c('ds_uniqueSparqlId' = 'uniqueSparqlId'))
rm(featureFrame); gc()
colnames(dataSet)
### --- comfort column names to {xgboost}
modelFrameColumns <- colnames(dataSet)
# - define replacements
mchar_regex <- c('\\.', '\\\\', '\\|', '\\(', '\\)',
'\\[', '\\]', '\\{', '\\}', '\\^',
'\\$', '\\*', '\\+', '\\?')
mchar_regex_replace <- c('_dot_', '_backslash_', '_vbar_', '_left_bracket_', '_right_bracket_',
'_left_sq_bracket_', '_right_sq_bracket_', '_left_curly_', '_right_curly_', '_caret_',
'_dollar_', '_asterisk_', '_plus_', '_question_mark_')
modelFrameColumns <- unname(sapply(modelFrameColumns, function(x) {
for (i in 1:length(mchar_regex)) {
if (grepl(mchar_regex[i], x)) {
x <- gsub(mchar_regex[i], mchar_regex_replace[i], x)
}
}
return(x)
}))
sparql_ops <- c('#', '@', ':', ';', '/', '>', '<', '=', '<=', '>=')
sparql_ops_replace <- c('_hash_', '_at_', '_colon_', '_semi_colon_', '_slash_', '_greater_than_', '_less_than_',
'_equals_', '_less_than_or_equal_', '_greater_than_or_equal_')
modelFrameColumns <- unname(sapply(modelFrameColumns, function(x) {
for (i in 1:length(sparql_ops)) {
if (grepl(sparql_ops[i], x)) {
x <- gsub(sparql_ops[i], sparql_ops_replace[i], x)
}
}
return(x)
}))
modelFrameColumns <- paste0("f_", modelFrameColumns)
# - produce and store modelFrameFeatureCode
modelFrameFeatureCode <- data.frame(feature = colnames(dataSet),
xgboost_feature = modelFrameColumns,
stringsAsFactors = F)
filename <- paste0('modelFrameFeatureCode', '_', top_f, '.csv')
write.csv(modelFrameFeatureCode,
paste0(analyticsDir, filename))
# - store dataSet w. {xgboost} column names
colnames(dataSet) <- modelFrameFeatureCode$xgboost_feature
filename <- paste0('modelFrame', '_', top_f, '.csv')
write.csv(dataSet,
paste0(analyticsDir, filename))
Again, The goal is to derive a description of SPARQL queries via feature vectors. For example:
iri_ref:11, literal:1, vars:5, vars_label:1, vars_label_usage:1, vars_usage:20, ?item:15, ?type:1, ?typelabel:1, ?url:2, ?wikipedia:1, {:14, http://standartconta.com.tr/:1, http://standartconta.com.tr:1, http://www.standartconta.com.tr/:2, http://www.standartconta.com.tr:1, http://www.www.standartconta.com.tr/:1, https://standartconta.com.tr/:1, https://standartconta.com.tr:1, https://www.standartconta.com.tr/:1, https://www.standartconta.com.tr:1, https://www.www.standartconta.com.tr/:1, bd:serviceparam:1, distinct:1, optional:1, schema:about:1, select:1, service:1, union:10, wdt:p31:1, wdt:p856:12, where:1, wikibase:label:1, wikibase:language:1, nchar:903
In this example of a query feature vector we see that the query had used 11 IRI references (__iri_ref__:11
), one literal (__literal__:1
), five variables (__vars__:5
) that were mentioned 20 times altogether (__vars_usage__:20
), used wikibase:label
once (wikibase:label:1
), the UNION
keyword 10 times (union:10
), that the length of the query in characters is nchar:903
, etc.
Thus far,
Our goal is to select a predictive model that will use the feauture vectors to predict whether the respective queries’ response times are critical (i.e. are they outliers on query_time
) or not. At the same time, the model will be performing feature selection (i.e. XGBoost will not necessarily use all of the provided features to build a sequence of decision trees for binary classification) and informing us about the value of each feature in the prediction. This will allows us to select a small set of SPARQL feauters that can be used to characterize the queries as resulting in typical vs. extreme WDQS response times. And such small feature vocabulary is exactly what we are looking for in order to be able to develop a WDQS optimization engine in the future: a system that would fast can any given SPARQL query and decide wheter it needs to be handled as a special, potentially difficult case to process - or should the processing proceed as usual.
We have used XGBoost via the R {xgboost} package to train many predictive models that attempt to categorize SPARQL queries as critical or not in accordance with the above stated criterion. XGBoost - short for Extreme Gradient Boosting - is a ML method to estimate an ensemble of mathematical models - decision trees in our case - that jointly work to predict the response variable. Its description is really way out of scope of this Report. This technique is as known to be notoriously good in extracting almost every and the last piece of useful information from the training data (feauture vectors describing the SPARQL queries in this case). The XGBoost models involve a large number of parameters that must be fine-tuned in order for it to derive a plausible predictive model for the task at hand. We have experimented with various parameter settings and data sets encompassing a variable number of extracted SPARQL features to reach the following conclusions:
1M
sample of SPARQL queries into train and test data (approx. 50% of the sample each), in order to reach this conclusion;#!/usr/bin/env Rscript
### ---------------------------------------------------------------------------
### --- WDQS_Event_S4A_XGBoostOptimization_Downsampling.R, v. 0.0.1
### --- Script: WDQS_Event_S4A_XGBoostOptimization_Downsampling.R, v. 0.0.1
### --- Description: WDQS_Event_S4A_XGBoostOptimization_Downsampling.R performs
### --- XGBoost optimization of SPARQL queries WDQS response times
### --- collected w. WDQS_Event_S1_ETL.R from the
### --- event.wdqs_external_sparql_query table in the WMF Data Lake.
### --- NOTE. Approach to Class Imbalances: downsampling.
### --- Author: Goran S. Milovanovic, Data Scientist, WMDE
### --- Developed under the contract between Goran Milovanovic PR Data Kolektiv
### --- and WMDE.
### --- Contact: goran.milovanovic_ext@wikimedia.de
### ---------------------------------------------------------------------------
### --- RUN FROM: stat1005
### --- Date: 2020/04/27
### ---------------------------------------------------------------------------
### ---------------------------------------------------------------------------
### --- LICENSE:
### ---------------------------------------------------------------------------
### --- GPL v2
### --- This file is part of the Wikidata SPARQL Endpoint Analytics Project
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is free software:
### --- you can redistribute it and/or modify
### --- it under the terms of the GNU General Public License as published by
### --- the Free Software Foundation, either version 2 of the License, or
### --- (at your option) any later version.
### ---
### --- Wikidata SPARQL Endpoint Analytics Project is distributed
### --- in the hope that it will be useful,
### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
### --- GNU General Public License for more details.
### ---
### --- You should have received a copy of the GNU General Public License
### --- along with Wikidata SPARQL Endpoint Analytics Project.
### --- If not, see <http://www.gnu.org/licenses/>.
### ---------------------------------------------------------------------------
### --- setup
library(data.table)
library(stringr)
library(dplyr)
library(tidyr)
library(httr)
library(pROC)
library(xgboost)
library(caret)
library(fastDummies)
### --- directories
### --- Read WDCM paramereters
# - fPath: where the scripts is run from?
fPath <- as.character(commandArgs(trailingOnly = FALSE)[4])
fPath <- gsub("--file=", "", fPath, fixed = T)
fPath <- unlist(strsplit(fPath, split = "/", fixed = T))
fPath <- paste(
paste(fPath[1:length(fPath) - 1], collapse = "/"),
"/",
sep = "")
fPath <- '/home/goransm/Analytics/Wikidata/WD_misc/WD_SPARQL_Endpoint_Analytics/'
# - NOTE: Hetzner local dir
# - fPath <- '/home/goran/WMDE/WD_SPARQL_Endpoint_Analytics/'
dataDir <- paste0(fPath, '_event/_data/')
analyticsDir <- paste0(fPath, '_event/_analytics/')
reportingDir <- paste0(fPath, '_event/_reporting/')
### ---------------------------------------------------------------------------
### --- Section 1.1 Train selected model
### ---------------------------------------------------------------------------
### --- grab dataset
filename <- 'modelFrame_300.csv'
### --- load model dataset
modelFrame <- fread(paste0(analyticsDir, filename))
modelFrame$V1 <- NULL
print(paste0("The dimesion is: ", paste(dim(modelFrame), collapse = ", ")))
### --- recode f_ds_day to day of week
sampleDays <- ifelse(nchar(modelFrame$f_ds_day) == 1,
paste0('0', modelFrame$f_ds_day),
modelFrame$f_ds_day)
modelFrame$f_ds_day <- paste0("2020-04-", sampleDays)
modelFrame$f_ds_day <- weekdays(as.Date(modelFrame$f_ds_day))
### --- process model dataset
# - split SPARQL features
ds_features <- colnames(modelFrame)[which(grepl("^f_ds", colnames(modelFrame)))]
modelFrame_ds <- modelFrame[, ..ds_features]
modelFrame_f <- modelFrame %>%
dplyr::select(-ds_features)
rm(modelFrame); gc()
# - transform NA to zero in modelFrame_f
# - (for all non ds_ variables:)
modelFrame_f[is.na(modelFrame_f)] <- 0
# - pick dimension of SPARQL features only:
top_f <- dim(modelFrame_f)[2]
# - recompose modelFrame
modelFrame <- cbind(modelFrame_ds, modelFrame_f)
rm(modelFrame_ds); rm(modelFrame_f); gc()
### --- Prima Facie Correlation between:
# - modelFrame$f_ds_n_conc_queries
# - modelFrame$f_ds_query_time
corrFrame <- modelFrame %>%
dplyr::select(f_ds_datacenter_host,
f_ds_n_conc_queries,
f_ds_query_time)
corrFrame <- corrFrame %>%
dplyr::group_by(f_ds_datacenter_host) %>%
dplyr::summarise(med_n_conc_queries = median(f_ds_n_conc_queries),
med_query_time = median(f_ds_query_time))
cor.test(corrFrame$med_n_conc_queries,
corrFrame$med_query_time,
method = "pearson")
cor.test(corrFrame$med_n_conc_queries,
corrFrame$med_query_time,
method = "spearman")
### --- Select only ML features
# - to split train and test: f_ds_ttSplit
modelFrame$f_ds_ttSplit <- round(runif(dim(modelFrame)[1], 0, 1))
# - remove IDs and keep track of f_ds_ttSplit
modelFrame_IDs <- modelFrame %>%
dplyr::select(f_ds_id,
f_ds_uniqueSparqlId,
f_ds_ttSplit)
filename <- paste0('modelFrame_IDs_', top_f, ".csv")
write.csv(modelFrame_IDs,
paste0(analyticsDir, filename))
rm(modelFrame_IDs); gc()
modelFrame <- modelFrame %>%
dplyr::select(-f_ds_id,
-f_ds_uniqueSparqlId)
### --- Select criterion
### --- Drop all other potential criteria
### --- Potential criteria are:
# - f_ds_query_time_median
# - f_ds_query_time
# - f_ds_query_time_mild
# - f_ds_query_time_extreme
# - f_ds_query_time_class
# - f_ds_status_code
### --- criterion: f_ds_query_time_mild
table(modelFrame$f_ds_query_time_mild)
modelFrame <- modelFrame %>%
dplyr::select(-f_ds_query_time,
-f_ds_query_time_median,
-f_ds_query_time_extreme,
-f_ds_query_time_class,
-f_ds_status_code)
### --- prepare XGBoost structures
# - modelFrame as designMatrix
modelFrame$f_ds_hour <- as.character(modelFrame$f_ds_hour)
modelFrame <- fastDummies::dummy_cols(modelFrame)
# - remove dummified:
modelFrame$f_ds_hour <- NULL
modelFrame$f_ds_day <- NULL
modelFrame$f_ds_format <- NULL
modelFrame$f_ds_method <- NULL
modelFrame$f_ds_datacenter_host <- NULL
# - split: train vs test
trainFrame <- modelFrame[modelFrame$f_ds_ttSplit == 0, ]
testFrame <- modelFrame[modelFrame$f_ds_ttSplit == 1, ]
rm(modelFrame); gc()
### --- Downsampling
table(trainFrame$f_ds_query_time_mild)
positiveSampleSize = sum(trainFrame$f_ds_query_time_mild == 1)
negativeSample <- trainFrame[trainFrame$f_ds_query_time_mild == 0]
dim(negativeSample)
positiveSample <- trainFrame[trainFrame$f_ds_query_time_mild == 1]
dim(positiveSample)
rm(trainFrame)
negativeSampleChoice <- sample(1:dim(negativeSample)[1],
size = positiveSampleSize,
replace = FALSE,
prob = NULL)
negativeSample <- negativeSample[negativeSampleChoice, ]
dim(negativeSample)
trainFrame <- rbind(negativeSample,
positiveSample)
rm(positiveSample); rm(negativeSample); gc()
# - design matrices
response_time_train <- trainFrame$f_ds_query_time_mild
trainFrame$f_ds_query_time_mild <- NULL
trainFrame$f_ds_ttSplit <- NULL
trainFrame <- as.matrix(trainFrame)
response_time_test <- testFrame$f_ds_query_time_mild
testFrame$f_ds_query_time_mild <- NULL
testFrame$f_ds_ttSplit <- NULL
testFrame <- as.matrix(testFrame)
# - to {xgboost} DMatrix
trainFrame <- xgb.DMatrix(trainFrame, label = response_time_train)
print(paste0("Dimesion of trainFrame: ",
paste(dim(trainFrame), collapse = ", ", sep = "")))
print("testFrame as xgb.DMatrix now.")
testFrame <- xgb.DMatrix(testFrame, label = response_time_test)
print(paste0("Dimesion of testFrame: ",
paste(dim(testFrame), collapse = ", ", sep = "")))
# - store testFrame to save RAM on stat100*
xgb.DMatrix.save(testFrame, paste0(analyticsDir, "temp_testFrame.dmatrix"))
rm(testFrame); gc()
### --- train XGBoost on train model dataset
print("Preparation complete. Model w. XGBoost now.")
print("-----------------------------------------------------------------------")
print("Set CV tree parameters:")
### --- grab model params
eta <- .5
gamma <- 10
max_depth <- 6
subsample <- .5
colsample_bytree <- .5
nthread <- 36
nrounds <- 30000
t1 <- Sys.time()
### --- FIRE!
res_boost <- xgb.train(
data = trainFrame,
watchlist = list(validation1 = trainFrame),
params = list(booster = "gbtree",
nthread = nthread,
eta = eta,
gamma = gamma,
max_depth = max_depth,
subsample = subsample,
colsample_bytree = colsample_bytree,
objective = "binary:logistic"),
nrounds = nrounds,
verbose = 1,
print_every_n = 100,
eval_metric = "logloss",
early_stopping_rounds = 1000,
maximize = NULL,
save_period = NULL,
save_name = NULL,
xgb_model = NULL
)
print(paste0("Training ends: ", Sys.time()))
training_time <- difftime(Sys.time(), t1, units = "mins")
print(paste0("XGBoost CV took: ", training_time))
# - grab feature importance
importanceFrame <- xgb.importance(feature_names = colnames(trainFrame),
model = res_boost)
# - remove trainFrame
rm(trainFrame); gc()
# - predictions on the testFrame:
testFrame <- xgb.DMatrix(paste0(analyticsDir, "temp_testFrame.dmatrix"))
training_prediction <- predict(res_boost, testFrame,
ntreelimit = res_boost$best_ntreelimit)
# - remove testFrame
rm(testFrame);gc()
# - ROC Analysis
d_t <- seq(.01, .99, .01)
rocFrame <- data.frame(threshold = numeric(length(d_t)),
acc = numeric(length(d_t)),
tp_rate = numeric(length(d_t)),
fn_rate = numeric(length(d_t)),
fp_rate = numeric(length(d_t)),
tn_rate = numeric(length(d_t)))
for (i in 1:length(d_t)) {
dec_threshold <- d_t[i]
acc <- mean(as.numeric(training_prediction > dec_threshold) == response_time_test)
hitRate <- sum((as.numeric(training_prediction > dec_threshold) == 1) & (response_time_test == 1))
hitRate <- hitRate/sum(response_time_test == 1)
missRate <- sum((as.numeric(training_prediction > dec_threshold) == 0) & (response_time_test == 1))
missRate <- missRate/sum(response_time_test == 1)
falseAlarmRate <- sum((as.numeric(training_prediction > dec_threshold) == 1) & (response_time_test == 0))
falseAlarmRate <- falseAlarmRate/sum(response_time_test == 0)
correctRejectionRate <- sum((as.numeric(training_prediction > dec_threshold) == 0) & (response_time_test == 0))
correctRejectionRate <- correctRejectionRate/sum(response_time_test == 0)
rocFrame[i, ] <- c(dec_threshold,
round(acc, 3),
round(hitRate, 3),
round(missRate, 3),
round(falseAlarmRate, 3),
round(correctRejectionRate, 3)
)
}
rocFrame$diff_hit_fa <- rocFrame$tp_rate - rocFrame$fp_rate
wmax_diff <- which.max(rocFrame$diff_hit_fa)
rocFrame[wmax_diff, ]
write.csv(rocFrame,
paste0(analyticsDir, "ROC_model_qtMild_eta001_", top_f, ".csv"))
# - feature importance
modelFrameCodeFile <- list.files(analyticsDir)
modelFrameCodeFile <- modelFrameCodeFile[grepl("modelFrameFeatureCode", modelFrameCodeFile)]
modelFrameCodeFile <- modelFrameCodeFile[grepl(top_f, modelFrameCodeFile)]
modelFrameCode <- read.csv(paste0(analyticsDir, modelFrameCodeFile),
check.names = F,
header = T,
row.names = 1,
stringsAsFactors = F)
importanceFrame <- importanceFrame %>%
dplyr::left_join(modelFrameCode,
by = c('Feature' = 'xgboost_feature'))
rm(modelFrameCode)
importanceFrame$feature[which(is.na(importanceFrame$feature))] <-
importanceFrame$Feature[which(is.na(importanceFrame$feature))]
importanceFrame <- importanceFrame[, c(5, 1, 2, 3, 4)]
colnames(importanceFrame)[2] <- 'xgboost_feature'
importanceFrame <- dplyr::arrange(importanceFrame,
desc(Gain))
write.csv(importanceFrame,
paste0(analyticsDir, "importance_qtMild_eta001_", top_f, ".csv"))
In order to explain the typical modelling result with XGBoost, we need to introduce some terminology first.
Let’s assume each of the SPARQL queries is coded as 1
if the respective WDQS response time for that query was critically high or 0
if it had fallen in a range of typical processing times. For a sample of SPARQL queries than we have two binary vectors over {0, 1} describing: (1) the observed status of the query (again: 1 - critical, 0 - typical), and (2) the predicted status of the query: what the trained predictive model “thinks” the query would be by looking at the features that describe it.
For example:
sampleFrame <- data.frame(query = paste0("Query_", 1:10),
observed = round(runif(10, 0, 1)),
predicted = round(runif(10, 0, 1)))
sampleFrame
For each of the 10 queries in the table we know if they really where critical (1) or not (0) and we see that from the observed
column, and if the model predicts they would be critical (1) or not (0) from their features: the predicted
column. For example, in the first row we have a query that was critical indeed and correctly predicted as critical: that is something we call a Hit (or a True Positive) in ML. In the second row we observe a False Alarm (or a False Positive): the query was not really critical (it has 0 on observed
) but the model has incorrectly categorized it as if it would be (1 on predicted
). In the eigth row we find a situation known as a Correct Rejection: the query was not critical (0 on observed
) and the model figured out it wouldn’t be critical indeed by looking into its features (0 on predicted
). Finally, the fourth important situation can be found in the sixth row, where the query was critical (1 on observed
) but not recognized as such by the model (0 on predicted
), a case known as Miss in the ML terminology. More on this terminology: https://en.wikipedia.org/wiki/Confusion_matrix.
In the table above, we can compute the model accuracy by looking at how many times do the observe
and predicted
column coincide:
Table 1. Confusion matrix example.
acc <- sum(sampleFrame$observed == sampleFrame$predicted)/dim(sampleFrame)[1]
print(paste0("Model accuracy = ", round(acc, 2), "."))
[1] "Model accuracy = 0.2."
When we look at the Hit rate we see that:
hits <- sum(sampleFrame$observed == 1 & sampleFrame$predicted == 1)/sum(sampleFrame$observed == 1)
print(paste0("Hit rate = ", round(hits, 2), "."))
[1] "Hit rate = 0.33."
Finally, the False alarm rate of the model is:
FAs <- sum(sampleFrame$observed == 0 & sampleFrame$predicted == 1)/sum(sampleFrame$observed == 1)
print(paste0("FA rate = ", round(FAs, 2), "."))
[1] "FA rate = 0.67."
So, the model would recognize a critical query correctly 71% of time, while 30% of time it would incorrectly predict a non-critical query to be critical. Imagine if we use this model to chose when to start and when not to start some optimization procedure in WDQS - be it as simple as diverging different queries (critical vs. non-critical) to separate processing resources. From the viewpoint of any system optimization, the decision is needed: can we live with these two values, or not? The False Alarm suggests that 30% of time the model would suggest the optimization routine to start in a situation when it should not start and thus spare valuable system resources. Again, 100% - 71% (Hit rate) = 29% of time the model would miss to trigger the optimization routine for a critical query that could engage too many resources.
A good predictive model is the one which attains (a) a high Hit rate (i.e. observed critical and predicted critical), and (b) a low False Alarm rate (i.e. observed non-critical but predicted critical). In other words, we do not want our model to miss critical queries and thus not trigger some optimization procedure to help WDQS better use its resources, but we also do not want our model to trigger any optimization procedures on a query that would not result in a critically high usage of resources.
Our current results:
Now we can see how misleading is model accuracy alone. In effect, these results imply the following:
eqiad
and codefw
, of which the later seems to be barely used to support the WDQS; we can think about the problem as if we were posing a question if a query is too heavy to be processed in the overloaded eqiad
and thus should be sent to codefw
, or not);The XGBoost model has used 312 features to model query_time
. NOTE. Some of the initial features, like day of week
, actually become several features in the process of dummy coding (e.g. for day of week
it would be: Monday
, Tuesday
, Wednesday
, etc, each being a binary 0/1
feature, e.g. 1
for Monday
if it was Monday and 0
otherwise).
The following table presents all the features used to predict if the query will reach a critical or typical processing time. The table is sorted by Gain* which is the measure of feature's importance in the categorization of queries. Remember that
ds_n_conc_queries` is the number of queries concurrently started in the same minute, in the same datacenter and on the same back-end host.
Feature Importance Table
library(ggplot2)
library(tidyverse)
library(DT)
featureFrame <- read.csv('importance_qtMild_eta001_300.csv',
header = T,
check.names = F,
row.names = 1,
stringsAsFactors = F)
featureFrame <- featureFrame %>%
dplyr::select(feature, Gain)
featureFrame$Gain <- round(featureFrame$Gain, 5)
DT::datatable(featureFrame,
options = list(
paging = T,
pageLength = 50
))
Just from a glimpse at the feature importance table we can see that it is important:
f_ds_day_*
, representing day or week, and f_ds_hour_*
, representing hour of day, are selected as important; also, the above mentioned ds_n_conc_queries
which is a proxy measure of a datacenter/host load in a given minute)nchar
(the length of the query in characters) and __vars_usage__
(the total number of tokens representing variables found in the SPARQL code);f_ds_datacenter_host_eqiad_wdqs1005
, f_ds_datacenter_host_eqiad_wdqs1006
, f_ds_datacenter_host_eqiad_wdqs1004
, f_ds_datacenter_host_eqiad_wdqs1007, f_ds_datacenter_host_codfw_wdqs2002
, etc;__vars_label_usage__
- a count of tokens representing labels for a particular variable, e.g. ?authorLabel
and similar;?imdbid
feature (except for if WDQS users are pulling our leg by using a variable named ?imdbid
while actually retrieving data on cats - the risks of analyzing the pragmatics of a language);{
feature, how many open curly braces do we find in a query (and knowing that each must be matched by the respective }
symbol we have eliminated the later early in the feature engineering phase);Following the model and variable selection, we have estimated a Binomial Logistic Regression model with the selected features and on the same dataset. We have used the Ridge regression approach to regularization. The obtained coefficients are linear:
The following charts show the linear coefficients of the 20 features with most positive and most negative coefficients:
pFrame <- read.csv('Regression_Coeffs_300.csv',
header = T,
check.names = F,
row.names = 1,
stringsAsFactors = F)
# - log-odds to odds
pFrame$Weight <- exp(pFrame$Weight)
pFrame_top <- head(pFrame, 30)
pFrame_bottom <- tail(pFrame, 30)
pFrame <- rbind(pFrame_top, pFrame_bottom)
pFrame$pos_neg <- c(rep('positive', 30), rep('negative', 30))
pFrame$xgboost_feature <- NULL
pFrame$Feature <- factor(pFrame$Feature,
levels = pFrame$Feature[order(pFrame$Weight)])
ggplot(data = pFrame,
aes(x = Feature,
y = Weight)) +
geom_line(group = 1) +
geom_point(size = 2.5) + geom_point(size = 2, color = "white") +
facet_wrap(~pos_neg, scales='free_x') +
theme_bw() +
theme(axis.text.x = element_text(size = 9, angle = 90, hjust=0.95, vjust=0.2))
The following table presents all model features alongside their respective linear coefficients
pFrame$pos_neg <- NULL
pFrame$Weight <- round(pFrame$Weight, 2)
DT::datatable(pFrame,
options = list(
paging = T,
pageLength = 50
))
Goran S. Milovanović
Wikimedia Deutschland, Data Scientist DataKolektiv, Owner
contact: goran.milovanovic_ext@wikimedia.de