Verify edits_hourly data content page edit counts

https://phabricator.wikimedia.org/T228721

The purpose of this task is to QC the edits_hourly data in Turnilo and Superset and confirm the content page edit counts are correct by comparing mediawiki_history to mariadb replicas.

Documentation and analaysis of previous issues leading to missing content page info: https://github.com/wikimedia-research/2019-02-active-editors-discrepancy/blob/master/analysis.ipynb and task https://phabricator.wikimedia.org/T221338

In [47]:
# https://stackoverflow.com/a/35018739/1091835
library(IRdisplay)

display_html(
'<script>  
code_show=true; 
function code_toggle() {
  if (code_show){
    $(\'div.input\').hide();
  } else {
    $(\'div.input\').show();
  }
  code_show = !code_show
}  
$( document ).ready(code_toggle);
</script>
  <form action="javascript:code_toggle()">
    <input type="submit" value="Click here to toggle on/off the raw code.">
 </form>'
)
In [23]:
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(magrittr); library(zeallot); library(glue); library(tidyverse); library(lubridate)
    library(scales)
})

# date range - limiting to 2019
start_date <- "20190101"
start_date_human <- format(as.Date(start_date, '%Y%m%d'), '%d %B %Y')
end_date <- "20190630"
end_date_human <- format(as.Date(end_date, '%Y%m%d'), '%d %B %Y')

# Pass a sample of top sized ranked wiki and ones dentified as having a high volume of editors 
# Also some wikis identified as having missing page info in https://phabricator.wikimedia.org/T221338]

target_wikis <- c(
   "jawiki" = "Japanese", "itwiki" = "Italian", 
    "fawiki" = "Persian", "arwiki" = "Arabic",  "idwiki" = "Indonesian", "thwiki" = "Thai", "enwiki" = "English", 
    "dewiki" = "German", "frwiki" = "French",  "ruwiki" = "Russian", 
    "zhwiki" = "Chinese", "ptwiki" = "Portuguese", "nlwiki" = "Nederlandstalige"
)

Official edit counts based on MariaDB Replicas

In [117]:
## Find content space edits on MariaDB replicas. 
## Refer to https://meta.wikimedia.org/wiki/Research:Editor_month_dataset 

query <- "SELECT
 database() as wiki,
  date,
  ifnull(sum(content_edits), 0) as content_edits
FROM 
(
  SELECT 
    DATE(LEFT(rev_timestamp, 8)) as date, 
    sum(page_namespace = 0 or cn.namespace is not null) as content_edits
FROM revision
left join page on rev_page = page_id
left join staging.content_namespaces cn on database() = wiki and page_namespace = namespace
WHERE rev_timestamp >= '{start_date}' AND
rev_timestamp < '{end_date}'     

union all

SELECT
  DATE(LEFT(ar_timestamp, 8)) as date,
sum(ar_namespace = 0 or cn.namespace is not null) as content_edits
FROM archive
left join staging.content_namespaces cn on database() = wiki and ar_namespace = namespace
WHERE ar_timestamp >= '{start_date}' AND
ar_timestamp < '{end_date}' 
) revs

group by date;"
In [29]:
content_edits_replicas <- map_df(
        set_names(names(target_wikis), names(target_wikis)),
        ~ shhh(wmf::mysql_read(glue(query), .x)),
        .id = "wiki"
    ) 

 save(content_edits_replicas, file="Data/content_edits_replicas.RData")
In [42]:
load("Data/content_edits_replicas.RData")
content_edits_replicas$date <- as.Date(content_edits_replicas$date, format = "%Y-%m-%d")
In [43]:
content_edits_replicas_monthly <- content_edits_replicas %>%
 mutate(date = floor_date(date, "month")) %>%
  group_by(date) %>%
  summarise(content_edits = sum(content_edits)) %>%
  mutate(type = 'MariaDB')
In [44]:
head(content_edits_replicas_monthly)
A tibble: 6 × 3
datecontent_editstype
<date><dbl><chr>
2019-01-017310977MariaDB
2019-02-016431345MariaDB
2019-03-017136986MariaDB
2019-04-017402043MariaDB
2019-05-017714056MariaDB
2019-06-017025704MariaDB

I verified the monthly content numbers from the replicas with the data available in Turnilo and Superset. Now to check against the mediawiki_history on hive.

Content edit counts based on mediawiki_history dataset in the Data Lake

In [16]:
## Select main page namespace edits from mediawiki_history using the page_namespace_is_content_historical

# In terminal
# spark2R --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G
# use https://github.com/wikimedia-research/Editing-movement-metrics/blob/master/queries/update_editor_month.sql as a reference

query <- "SELECT date_format(event_timestamp, 'yyyy-MM-dd') as date,
wiki_db as wiki,
coalesce(sum(cast(page_namespace_is_content_historical as int)), 0
    ) as content_edits
FROM wmf.mediawiki_history
WHERE
event_entity = 'revision' AND
event_type = 'create' and
event_timestamp IS NOT NULL and
wiki_db in ('jawiki', 'itwiki', 'fawiki', 'arwiki' , 'idwiki' , 'thwiki' , 'enwiki', 'dewiki' ,
    'frwiki' , 'ruwiki' , 'zhwiki' , 'ptwiki' , 'nlwiki' ) and
event_timestamp >= '2019-01-01' and
event_timestamp < '2019-06-30' and
snapshot = '2019-07'
GROUP BY wiki_db, date_format(event_timestamp, 'yyyy-MM-dd')"


results <- collect(sql(query))
save(results, file="R/Data/content_edits_hive.RData")
Error in UseMethod("collect"): no applicable method for 'collect' applied to an object of class "c('sql', 'character')"
Traceback:

1. collect(sql(query))
In [45]:
load("Data/content_edits_hive.RData")
content_edits_datalake$date <- as.Date(content_edits_datalake$date, format = "%Y-%m-%d")

Confirmed that the hive numbers match with the edit counts available in Turnilo and Superset.

In [53]:
content_edits_datalake_monthly <- content_edits_datalake %>%
 mutate(date = floor_date(date, "month")) %>% 
  group_by(date) %>%
  summarise(content_edits = sum(content_edits)) %>%
  mutate(type = 'data_lake')
In [54]:
head(content_edits_datalake)
A data.frame: 6 × 3
datewikicontent_edits
<date><chr><dbl>
2019-04-17fawiki 3668
2019-02-23ruwiki 12575
2019-03-02enwiki103464
2019-06-19ruwiki 14792
2019-06-11idwiki 8097
2019-01-06dewiki 23555

Verified hive content edits with the edits_hourly data available in Superset and Turnilo. Now to double check any discrepancies.

In [55]:
content_edits_combined_monthly <- rbind(content_edits_datalake_monthly, 
                                        content_edits_replicas_monthly)
In [56]:
content_edits_combined_monthly
A tibble: 12 × 3
datecontent_editstype
<date><dbl><chr>
2019-01-017243513data_lake
2019-02-016369705data_lake
2019-03-017067604data_lake
2019-04-017339635data_lake
2019-05-017647072data_lake
2019-06-016980287data_lake
2019-01-017310977MariaDB
2019-02-016431345MariaDB
2019-03-017136986MariaDB
2019-04-017402043MariaDB
2019-05-017714056MariaDB
2019-06-017025704MariaDB
In [59]:
p <- ggplot(content_edits_combined_monthly, (aes(x=date, y= content_edits, color = type)))+
geom_line() +
scale_y_continuous("total content edits per day", labels = polloi::compress) +
    scale_x_date("Date", labels = date_format("%b %Y"), date_breaks = "1 month") +
    labs(title = "Monthly Content Edits: Replicas vs Data Lake") +
    ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
    theme(axis.text.x=element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5),
        panel.grid = element_line("gray70"))

p
In [60]:
content_edits_diff_monthly <- content_edits_combined_monthly %>%
    spread(type, content_edits) %>%
 mutate(diff = data_lake - MariaDB, 
        diff_percent = (data_lake - MariaDB)/MariaDB *100)
    

head(content_edits_diff_monthly)
A tibble: 6 × 5
datedata_lakeMariaDBdiffdiff_percent
<date><dbl><dbl><dbl><dbl>
2019-01-0172435137310977-67464-0.9227768
2019-02-0163697056431345-61640-0.9584309
2019-03-0170676047136986-69382-0.9721471
2019-04-0173396357402043-62408-0.8431186
2019-05-0176470727714056-66984-0.8683370
2019-06-0169802877025704-45417-0.6464406
In [100]:
# Plot the difference beween Hive and MariaDB edits
p <- ggplot(content_edits_diff_monthly, (aes(x=date, y= abs(diff))))+
geom_col(fill = 'dark blue') +
scale_y_continuous("Diff in edit count per month", labels = polloi::compress) +
    scale_x_date("Date", labels = date_format("%b %Y"), date_breaks = "1 month") +
    labs(title = "Difference between total monthly Content Edits \n from DataLake and MariaDB") +
    ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
    theme(axis.text.x=element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5),
        panel.grid = element_line("gray70"))

p

There is a slight difference (under 0.1%) between the DataLake and MariaDB content namespace edit counts using the two query above, which is likely due to difference in the two datasets.

Note: MariaDB query shows the numbers of edits to the main namespace. This is not exactly the same thing as edits to content namespaces, but it's a pretty close approximation. See https://meta.wikimedia.org/wiki/Research:Content_namespace

Verify that mediawiki content issues with missing namespace info is corrected in recent snapshot

In [36]:
query <- 
"SELECT
  revision_is_deleted_by_page_deletion as rev_deleted,
  page_is_deleted as page_deleted,
  page_title IS NULL as null_title,
  page_namespace IS NULL as null_namespace,
  page_namespace_is_content IS NULL as null_is_content_namespace,
  page_title_historical IS NULL as null_historical_title,
  page_namespace_historical IS NULL as null_historical_namespace,
  page_namespace_is_content_historical IS NULL as null_historical_is_content_namespace,
  COUNT(1) as c
FROM wmf.mediawiki_history
WHERE snapshot = '2019-06'
  AND event_entity = 'revision'
GROUP BY
  revision_is_deleted_by_page_deletion,
  page_is_deleted,
  page_title IS NULL,
  page_namespace IS NULL,
  page_namespace_is_content IS NULL,
  page_title_historical IS NULL,
  page_namespace_historical IS NULL,
  page_namespace_is_content_historical IS NULL
ORDER BY
  rev_deleted,
  page_deleted,
  null_title,
  null_namespace,
  null_is_content_namespace,
  null_historical_title,
  null_historical_namespace,
  null_historical_is_content_namespace
LIMIT 1000"
In [37]:
namespace_check <- wmf::query_hive(query)
In [38]:
namespace_check
A data.frame: 7 × 9
rev_deletedpage_deletednull_titlenull_namespacenull_is_content_namespacenull_historical_titlenull_historical_namespacenull_historical_is_content_namespacec
<chr><chr><chr><chr><chr><chr><chr><chr><dbl>
falseNULL true true true true true true 1458440
falsefalsefalsefalsefalsefalsefalsefalse4055914781
falsetrue falsefalsefalsefalsefalsefalse 98726
true NULL true true true falsefalsefalse 48671097
true NULL true true true falsefalsetrue 414357
true falsefalsefalsefalsefalsefalsefalse 203823
true true falsefalsefalsefalsefalsefalse 158672610

Results of the above query look as expected. About 95% of all revisions that are not deleted do not belong to a deleted page and have all their page_namespace info set correctly.

Nlwiki previously had 5160672 deleted pages with null set for page_namespace_historical, page_namespace_is_content and page_namespace_is_content_historical. I did a spot check on wmf.mediawikihistory to confirm this is no longer true.

In [71]:
nlwiki_query <- "SELECT COUNT(*)
FROM wmf.mediawiki_history
WHERE 
wiki_db = 'nlwiki' 
AND page_is_deleted = TRUE
AND page_namespace_historical = NULL
AND page_namespace_is_content = NULL
AND page_namespace_is_content_historical = NULL"
In [72]:
nlwiki_page_check <- wmf::query_hive(nlwiki_query)

Confirmed that there are 0 pages in nlwiki where this condition is met using the current snapshot of mediawiki_history.

Check Active Editor Counts

Redo check of active counts using Neil's method documented here

In [80]:
# User previsous "official" active editor count data pulled using MariaDB
metrics_url <- "https://raw.githubusercontent.com/wikimedia-research/Editing-movement-metrics/75b3251727f8c766e4872f775f57a09632df6500/metrics/metrics.tsv"

official_ae_raw <- read.csv(url(metrics_url),sep="\t")

tail(official_ae_raw)
A data.frame: 6 × 23
monthactive_editorsdata_editsexisting_active_editorsglobal_south_active_editorsglobal_south_editsglobal_south_nonbot_editsmobile.heavy_wiki_active_editorsmobile.heavy_wiki_editsmobile.heavy_wiki_new_editor_retentionnet_new_Wikidata_entitiesnet_new_Wikipedia_articlesnet_new_content_pagesnew_active_editorsnew_editor_retentionnonbot_nondata_nonupload_editsrevert_ratesecond_month_active_editorstotal_editsuploads
<fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
2132018-08-01786811825756560371 NA NA NA374112108200.04700795 5202352215791818411140180.05151272138886620.06979796343340969220799857
2142018-09-0182084178101165926520109207032820703263863 8465030.05068545 3772221856441635067182840.05681354134392830.07611506367138588449835420
2152018-10-0182567230900686081519891197977119797693825 9577940.0504849811478041910872101982166900.06641143133821930.07354250425143471807588857
2162018-11-018258423908469613302000820151772015175363911507930.0433289012721491923402268787156660.06874217135635610.07146537473544349706592679
2172018-12-018107521426310611612035920675772067577404611253810.04704514 9475711862332128146147530.07207374134904200.06528740429741966740551296
2182019-01-0186776232234446516121367243494124349414228 9882330.04335825 9519762003661938938171080.05423185149170460.06034288365246294961587164
In [81]:
official_ae_raw$month <- as.Date(official_ae_raw$month, format = "%Y-%m-%d")

official_ae <- official_ae_raw %>%
select('month', 'active_editors')%>%
filter(month != '2005-06-01') %>%  #Maria DB appears to have a duplicate value for this date
mutate(type = "MariaDb")
In [ ]:
#Create intermediate hive table to calculate results

hive.run(["""
CREATE TABLE IF NOT EXISTS mneisler.editor_month_new (
  `wiki` STRING,
  `month` TIMESTAMP, -- Hive 1.1 does not support the DATE type
  `local_user_id` BIGINT,
  `user_name` STRING,
  `edits` BIGINT,
  `content_edits` BIGINT,
  `bot_flag` BOOLEAN,
  `user_registration` TIMESTAMP
) 
STORED AS PARQUET
""", """
insert into mneisler.editor_month_new
select
    wiki_db as wiki,
    trunc(event_timestamp, "MONTH") as month,
    event_user_id as local_user_id,
    max(event_user_text) as user_name, -- Some rows incorrectly have a null `event_user_text`
    count(*) as edits,
    coalesce(
        sum(cast(page_namespace_is_content_historical as int)),
        0
    ) as content_edits,
    (
        max(array_contains(event_user_groups, "bot")) or 
        max(array_contains(event_user_groups_historical, "bot"))
    ) as bot,
    min(event_user_creation_timestamp) as user_registration
from wmf.mediawiki_history
where
    event_timestamp < "2019-02-01" and
    event_entity = "revision" and
    event_type = "create" and
    snapshot = "2019-06"
group by
    trunc(event_timestamp, "MONTH"),
    wiki_db,
    event_user_id
"""])
In [74]:
#Look at active editor numbers via hive.
query <- "
select
    month,
    count(*) as active_editors
from (
    select
        cast(month as date) as month,
        user_name,
        sum(content_edits) as content_edits,
        max(bot_flag) as bot_flag
    from mneisler.editor_month_new
    where 
        local_user_id != 0
    group by month, user_name
) global_edits
where
    content_edits >= 5 and
    not bot_flag and 
    user_name not regexp 'bot\\b'
group by month
"
In [75]:
new_ae_raw = wmf::query_hive(query)
In [76]:
new_ae_raw$month <- as.Date(new_ae_raw$month, format = "%Y-%m-%d")
In [78]:
new_ae <- new_ae_raw %>%
filter(month != '2005-06-01') %>%  # filtered out duplicate month that was in MariaDB so I can do direct comparison
mutate(type = "data_lake")
In [82]:
deviation <- rbind(new_ae, official_ae)
In [96]:
comparison_plot <- rbind(new_ae, official_ae) %>%
ggplot(aes(x=month, y= active_editors, color = type)) +
geom_line() +
scale_y_continuous("monthly active editors", labels = polloi::compress) +
    scale_x_date("Date", labels = date_format("%b %Y"), date_breaks = "12 months") +
    labs(title = "Monthly active editors from Data Lake and MariaDb") +
ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
    theme(axis.text.x=element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5))

comparison_plot

It looks like is minimal difference between active editor date from the official and new ae metrics. This metric appears to be fixed. Let's confirm the difference.

In [95]:
deviation_plot <- rbind(new_ae, official_ae) %>%
spread(type, active_editors) %>%
mutate(diff = data_lake-MariaDb) %>%
ggplot(aes(x=month, y= diff)) +
geom_line()  +
scale_y_continuous("Diff in monthly active edtiors", labels = polloi::compress) +
    scale_x_date("Date", labels = date_format("%b %Y"), date_breaks = "12 months") +
    labs(title = "Difference between total monthly active edits \n from DataLake and MariaDB") +
    ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
    theme(axis.text.x=element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5))

deviation_plot

The difference between the two is under 1%. The highest monthly difference between the two is 750 active editors per month which is a signficiantly improved from the previous difference of 10,000 active editors between the two dataset.

In [92]:
head(official_ae)
A data.frame: 6 × 3
monthactive_editorstype
<date><dbl><chr>
2001-01-01 4MariaDb
2001-02-01 8MariaDb
2001-03-0121MariaDb
2001-04-0118MariaDb
2001-05-0120MariaDb
2001-06-0119MariaDb
In [93]:
head(new_ae)
A data.frame: 6 × 3
monthactive_editorstype
<date><int><chr>
2001-01-01 9data_lake
2001-02-0117data_lake
2001-03-0131data_lake
2001-04-0124data_lake
2001-05-0129data_lake
2001-06-0132data_lake