Feedback should be send to goran.milovanovic_ext@wikimedia.de.

Reference Document: New editors - WMDE deep dive - analytics questions

Reference Phabricator Ticket: https://phabricator.wikimedia.org/T256433

Background/ Reason why

The first campaign which contained tracking was conducted in 2017. Since then several campaigns with different content and user journeys have been realized. For strategic decisions on future activities to gain new editors we want to comprehensively analyze past activities and their impact. Besides qualitative results we need to consider quantitative results. As campaign reports usually cover a certain time during and after the campaign but no long term effects. This should be done in this comprehensive analysis.

Timing Briefing: 9th July 2020 Delivery of report: week 31

General requirements

  • The report should be delivered in tables and charts in html as in previous reports

  • The report might be publicly available in the future. For this delivery deadline this is no requirement.

  • Communication will happen in phabricator

  • Based on this report we will have more questions which should be addressed in a phase 2 in August.

  • The time span for this report is January 2017 until June 2020

0. Data Acquisiton

NOTE: the Data Acquisition code chunk is not fully reproducible from this Report. The data are collected by running the PySpark script CampaignsReview2020_S01_ETL.py on the stat1005.eqiad.wmnet server, collecting the data either as .tsv files or storing large datasets directly to HDFS in the WMF Data Lake. Immediately following this step the R script CampaignsReview2020_S01_ETL.R is run to clean the data, produce aggregated datasets, prepare the data for visualization, and compute all requested statistics.

0.1 ETL: dewiki user registrations and revisions

The CampaignsReview2020_S01_ETL.py script: collect data on user registrations and revisions from the wmf.mediawiki_history table:

# - Setup
import pyspark
from pyspark.sql import SparkSession, DataFrame, Window
from pyspark.sql.functions import rank, col, explode, regexp_extract, array_contains, when, sum, count, expr
import re
import csv
import pandas as pd

### --- dir structure and params
mwwikiSnapshot = "2020-06"

# - Spark Session
sc = SparkSession\
    .builder\
    .appName("WD Human Edits per Class")\
    .enableHiveSupport()\
    .getOrCreate()
# - SQL context
sqlContext = pyspark.SQLContext(sc)

# - Process wmf.mediawiki_history: all users ever registered with dewiki
dewiki_regusers = sqlContext.sql("""SELECT user_id, user_registration_timestamp 
                                    FROM wmf.mediawiki_history 
                                    WHERE event_entity = 'user' 
                                        AND event_type = 'create' 
                                        AND user_is_anonymous = false 
                                        AND user_is_created_by_self = true 
                                        AND user_id IS NOT NULL  
                                        AND user_registration_timestamp IS NOT NULL 
                                        AND NOT ARRAY_CONTAINS(user_is_bot_by, 'name') 
                                        AND NOT ARRAY_CONTAINS(user_is_bot_by, 'group') 
                                        AND NOT ARRAY_CONTAINS(user_is_bot_by_historical, 'name') 
                                        AND NOT ARRAY_CONTAINS(user_is_bot_by_historical, 'group') 
                                        AND wiki_db = 'dewiki' 
                                        AND snapshot='""" + mwwikiSnapshot + """'""" + 
                                 """ORDER BY user_id""")

dewiki_regusers.toPandas().to_csv("/home/goransm/Analytics/NewEditors/CampaignsReview2020/_data/dewiki_regusers.csv", 
                                  header=True, 
                                  index=False)
                                  
# - Process wmf.mediawiki_history: all revisions ever made on dewiki
dewiki_revisions = sqlContext.sql("""SELECT event_user_id, 
                                           event_user_registration_timestamp, 
                                           event_timestamp 
                                    FROM wmf.mediawiki_history 
                                    WHERE event_entity = 'revision' 
                                        AND event_type = 'create' 
                                        AND event_user_is_anonymous = false 
                                        AND event_user_is_created_by_self = true 
                                        AND event_user_id IS NOT NULL 
                                        AND event_user_registration_timestamp IS NOT NULL 
                                        AND NOT ARRAY_CONTAINS(event_user_is_bot_by, 'name') 
                                        AND NOT ARRAY_CONTAINS(event_user_is_bot_by, 'group') 
                                        AND NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, 'name') 
                                        AND NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, 'group') 
                                        AND page_namespace_is_content = 0 
                                        AND page_namespace_is_content_historical = 0 
                                        AND wiki_db = 'dewiki' 
                                        AND snapshot='""" + mwwikiSnapshot + """'""" + 
                                 """ORDER BY event_user_id, event_timestamp""")

dewiki_revisions.repartition(10).write.format('csv').save('dewiki_revisions')

0.2 Definitions: user registration and user revision

From the CampaignsReview2020_S01_ETL.py script we can see exactly what definitions of user registration and user revision are used in this report:

  • User registration: we exclude anonymous users (user_is_anonymous = false), users who where not self-created (user_is_created_by_self = true), users who have no values found in the user_id or the registration timestamp field (user_id IS NOT NULL AND user_registration_timestamp IS NOT NULL), and any users who are currently classified as bots or used to be classified as bots in the past.

  • User revision: we focus on revisions made on content pages only (page_namespace_is_content = true AND page_namespace_is_content_historical = true) holding on to the same set of constraints for users who have cause a particular revision as we did in the definition of user registration.

0.3 Data pre-processing: dewiki user registrations and revisions

The CampaignsReview2020_S01_ETL.R script: clean the data, produce aggregated datasets, prepare the data for visualization, and compute all requested statistics:

### --- 2020/07/14
### --- Phab: https://phabricator.wikimedia.org/T256433
### --- WMDE Banner Campaigns Comprehensive Report 2017-2020

t1 <- Sys.time()

library(tidyverse)
library(data.table)

fPath <- '/home/goransm/Analytics/NewEditors/CampaignsReview2020/'
dataDir <- paste0(fPath, "_data/")
analyticsDir <- paste0(fPath, "_analytics/")
hdfsPath <- 'hdfs:///user/goransm/dewiki_revisions'

### ---------------------------------------------------------------------
### --- Section 0. Campaigns Dataset
### ---------------------------------------------------------------------

### --- load Campaign Registered Users dataset
campaignIDs <- read.csv(paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs.csv"), 
                        header = T,
                        check.names = F,
                        stringsAsFactors = F)

### ---------------------------------------------------------------------
### --- Section 1. Datasets
### ---------------------------------------------------------------------

### --- Compose final revision dataset from hdfs: dewiki_revisions
# - copy splits from hdfs to local dataDir
system(paste0('sudo -u analytics-privatedata kerberos-run-command analytics-privatedata hdfs dfs -ls ', 
              hdfsPath, ' > ', 
              dataDir, 'files.txt'), 
       wait = T)
files <- read.table(paste0(dataDir, 'files.txt'), skip = 1)
files <- as.character(files$V8)[2:length(as.character(files$V8))]
file.remove(paste0(dataDir, 'files.txt'))
for (i in 1:length(files)) {
  system(paste0('sudo -u analytics-privatedata kerberos-run-command analytics-privatedata hdfs dfs -text ', 
                files[i], ' > ',  
                paste0(dataDir, "dewiki_revisions", i, ".csv")), wait = T)
}
# - read splits: dewiki_revisions
# - load
lF <- list.files(dataDir)
lF <- lF[grepl("dewiki_revisions", lF)]
dewiki_revisions <- lapply(lF, function(x) {fread(paste0(dataDir, x), header = F)})
# - collect
dewiki_revisions <- rbindlist(dewiki_revisions)
# - schema
colnames(dewiki_revisions) <- c('user_id', 'reg_time', 'rev_time')

### --- Load registration data: dewiki_regusers
dewiki_regusers <- fread(paste0(dataDir, "dewiki_regusers.csv"), header = T)

### ---------------------------------------------------------------------
### --- Section 2. Statistics and Analytical Datasets
### ---------------------------------------------------------------------

### -----------------------------------------------------
### --- Statistics on user registrations since the beginning of time
### -----------------------------------------------------

# - remove campaign registered users from dewiki_regusers
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_regusers <- dewiki_regusers[!(dewiki_regusers$user_id %in% campaign_regIDS), ]

# - stats
stats <- list()
stats$total_registered_users <- dim(dewiki_regusers)[1]
wEdited <- which(dewiki_regusers$user_id %in% dewiki_revisions$user_id)
stats$total_users_who_edited <- length(wEdited)

# - distibution of account age
dewiki_regusers$reg_time <- as.Date(dewiki_regusers$user_registration_timestamp)
dewiki_regusers$account_age_weeks <- as.numeric(
  difftime(Sys.time(),
           dewiki_regusers$reg_time,
           units = "weeks")
)
dewiki_regusers$account_age_years = dewiki_regusers$account_age_weeks/52.1429

# - stats
stats$min_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[1]
  )
stats$max_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[6]
)
stats$mean_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[4]
)
stats$median_account_age_weeks <- median(dewiki_regusers$account_age_weeks)
stats$min_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[1]
)
stats$max_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[6]
)
stats$mean_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[4]
)
stats$median_account_age_years <- median(dewiki_regusers$account_age_years)
saveRDS(stats, 
        paste0(analyticsDir, "dewiki_stats.Rds"))

# - store data file
saveRDS(dewiki_regusers, 
        paste0(analyticsDir, "dewiki_regusers.Rds"))

### -----------------------------------------------------
### --- Statistics on user registrations since 2017
### -----------------------------------------------------

dewiki_regusers_2017 <- dewiki_regusers[dewiki_regusers$user_registration_timestamp >= 2017, ]

# - stats
stats_2017 <- list()
stats_2017$total_registered_users <- dim(dewiki_regusers_2017)[1]
wEdited <- which(dewiki_regusers_2017$user_id %in% dewiki_revisions$user_id)
stats_2017$total_users_who_edited <- length(wEdited)
stats_2017$min_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[1]
)
stats_2017$max_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers_2017$account_age_weeks))[6]
)
stats_2017$mean_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers_2017$account_age_weeks))[4]
)
stats_2017$median_account_age_weeks <- median(dewiki_regusers_2017$account_age_weeks)
stats_2017$min_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers_2017$account_age_years))[1]
)
stats_2017$max_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers_2017$account_age_years))[6]
)
stats_2017$mean_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers_2017$account_age_years))[4]
)
stats_2017$median_account_age_years <- median(dewiki_regusers_2017$account_age_years)
saveRDS(stats_2017, 
        paste0(analyticsDir, "dewiki_stats_2017.Rds"))

# - store data file
saveRDS(dewiki_regusers_2017, 
        paste0(analyticsDir, "dewiki_regusers_2017.Rds"))

# - clean up
rm(dewiki_regusers); rm(dewiki_regusers_2017); gc()

### -----------------------------------------------------
### --- Statistics on revisions since the beginning of time
### -----------------------------------------------------

# - remove campaign registered users from dewiki_revisions
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_revisions <- dewiki_revisions[!(dewiki_revisions$user_id %in% campaign_regIDS), ]
# - for non-registering campaigns: keep only user revisions before the campaign onset
# - there is currently one non-registering campaign present in campaignIDs:
non_registering_campaignIDs <- campaignIDs %>% 
  filter(registered == 0)
non_registering_campaigns <- unique(non_registering_campaignIDs$campaign)
non_registering_campaigns
# - "occasional_editors2020"
# - the campaign onset for "occasional_editors2020" is:
# - 2020/05/14
wRemoveRevisions <- which(
  (dewiki_revisions$user_id %in% non_registering_campaignIDs$user_id) & 
    (dewiki_revisions$rev_time >= "2020-05-14")
  )
dewiki_revisions <- dewiki_revisions[-wRemoveRevisions, ]

# - statistics
stats_revisions <- list()
stats_revisions$total_revisions <- dim(dewiki_revisions)[1]

# - distribution of number of revisions per user
rev_dist <- table(dewiki_revisions$user_id)
rev_dist <- as.data.frame(rev_dist)
colnames(rev_dist) <- c('user_id', 'revisions')
rev_dist <- arrange(rev_dist, desc(revisions))
rev_dist <- table(rev_dist$revisions)
rev_dist <- as.data.frame(rev_dist)
colnames(rev_dist) <- c('revisions', 'users')
rev_dist <- arrange(rev_dist, desc(users))
saveRDS(rev_dist, 
        paste0(analyticsDir, "rev_dist.Rds"))
rm(rev_dist); gc()

# - edit classes
editClasses <- dewiki_revisions %>% 
  select(user_id) %>% 
  group_by(user_id) %>% 
  summarise(revisions = n())
editBoundaries <- list(
  c(0, 1), 
  c(2, 5),
  c(6, 9),
  c(10, 49)
)
editClasses$editClass <- sapply(editClasses$revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
editClasses$editClass[editClasses$editClass == "0 - 1"] <- "1"
editClasses <- arrange(editClasses, desc(revisions))
editClasses_dist <- table(editClasses$editClass)
editClasses_dist <- as.data.frame(editClasses_dist)
colnames(editClasses_dist) <- c("Edit Class", "Users")
editClasses_dist$`Edit Class` <- factor(editClasses_dist$`Edit Class`, 
                                        levels = c('1', 
                                                   '2 - 5', 
                                                   '6 - 9', 
                                                   '10 - 49', 
                                                   '> 50'))
editClasses_dist <- arrange(editClasses_dist, `Edit Class`)
editClasses_dist$`% Users` <- editClasses_dist$Users/sum(editClasses_dist$Users)*100
saveRDS(editClasses_dist, 
        paste0(analyticsDir, "editClasses_dist.Rds"))

# - cummulative edits in dewiki_revisions
setkey(dewiki_revisions, user_id, rev_time)
dewiki_revisions <- dewiki_revisions[order(user_id, rev_time)]
dewiki_revisions[, cum_revisions := seq_len(.N), by = user_id]
dewiki_revisions[, cum_revisions := rowid(user_id)]
dewiki_revisions$reg_time <- as.Date(dewiki_revisions$reg_time)
dewiki_revisions$rev_time <- as.Date(dewiki_revisions$rev_time)
dewiki_revisions$account_age_rev_time_weeks <- difftime(dewiki_revisions$rev_time,
                                                        dewiki_revisions$reg_time,
                                                        units = "weeks")
dewiki_revisions$account_age_rev_time_years <- 
  dewiki_revisions$account_age_rev_time_weeks/52.1429
dewiki_revisions$account_age_rev_time_weeks <- 
  as.numeric(dewiki_revisions$account_age_rev_time_weeks)
dewiki_revisions$account_age_rev_time_years <- 
  as.numeric(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$editClass <- sapply(dewiki_revisions$cum_revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
dewiki_revisions <- dewiki_revisions[order(rev_time)]

### ___ NOTE ___
# - There are 29148 observations where reg_time > rev_time:
sum(dewiki_revisions$account_age_rev_time_weeks < 0)
### ___ ACTION ___
# - remove from dewiki_revisions:
w <- which(dewiki_revisions$account_age_rev_time_weeks < 0)
dewiki_revisions <- dewiki_revisions[-w, ]

# - intoduce account age in weeks and years classes
dewiki_revisions$rev_time_ym <- substr(dewiki_revisions$rev_time, 1, 7)
dewiki_revisions$account_age_rev_time_years_class <- 
  round(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$account_age_rev_time_years_class <- 
  paste0(dewiki_revisions$account_age_rev_time_years_class, 
         " - ", 
         dewiki_revisions$account_age_rev_time_years_class + 1)

# - save the elaborated version of dewiki_revisions
saveRDS(dewiki_revisions, 
        paste0(analyticsDir, "dewiki_revisions_elaborated.Rds"))

# - produce dewiki_revisions_overview for visualization
dewiki_revisions_overview <- dewiki_revisions %>% 
  select(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  group_by(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  summarise(n_users = n())
saveRDS(dewiki_revisions_overview, 
        paste0(analyticsDir, "dewiki_revisions_overview.Rds"))

# - users active (at least one edit) after
# - two weeks, one month, six months, and one year
two_weeks = 2
one_month = 4.34524
six_months = 26.0715
one_year = 52.1429
active_users <- list()
active_users$two_weeks <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > two_weeks]
      )
    )
active_users$one_month <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_month]
    )
  )
active_users$six_months <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > six_months]
    )
  )
active_users$one_year <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_year]
    )
  )
active_users$two_weeks_p_total_registered_users <- 
  active_users$two_weeks/stats$total_registered_users
active_users$one_month_p_total_registered_users <- 
  active_users$one_month/stats$total_registered_users
active_users$six_months_p_total_registered_users <- 
  active_users$six_months/stats$total_registered_users
active_users$one_year_p_total_registered_users <- 
  active_users$one_year/stats$total_registered_users
active_users$two_weeks_p_total_users_who_edited <- 
  active_users$two_weeks/stats$total_users_who_edited
active_users$one_month_p_total_users_who_edited <- 
  active_users$one_month/stats$total_users_who_edited
active_users$six_months_total_users_who_edited <- 
  active_users$six_months/stats$total_users_who_edited
active_users$one_year_p_total_users_who_edited <- 
  active_users$one_year/stats$total_users_who_edited
saveRDS(active_users, 
        paste0(analyticsDir, "active_users.Rds"))

### -----------------------------------------------------
### --- Statistics on revisions since 2017
### -----------------------------------------------------

rm(dewiki_revisions); gc()
# - read splits: dewiki_revisions
# - load
lF <- list.files(dataDir)
lF <- lF[grepl("dewiki_revisions", lF)]
dewiki_revisions <- lapply(lF, function(x) {fread(paste0(dataDir, x), header = F)})
# - collect
dewiki_revisions <- rbindlist(dewiki_revisions)
# - schema
colnames(dewiki_revisions) <- c('user_id', 'reg_time', 'rev_time')

# - filter for >= 2017 on user registration
dewiki_revisions <- filter(dewiki_revisions, reg_time >= 2017)
dewiki_revisions <- as.data.table(dewiki_revisions)

# - remove campaign registered users from dewiki_revisions
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_revisions <- dewiki_revisions[!(dewiki_revisions$user_id %in% campaign_regIDS), ]
# - for non-registering campaigns: keep only user revisions before the campaign onset
# - there is currently one non-registering campaign present in campaignIDs:
non_registering_campaignIDs <- campaignIDs %>% 
  filter(registered == 0)
non_registering_campaigns <- unique(non_registering_campaignIDs$campaign)
non_registering_campaigns
# - "occasional_editors2020"
# - the campaign onset for "occasional_editors2020" is:
# - 2020/05/14
wRemoveRevisions <- which(
  (dewiki_revisions$user_id %in% non_registering_campaignIDs$user_id) & 
    (dewiki_revisions$rev_time >= "2020-05-14")
)
dewiki_revisions <- dewiki_revisions[-wRemoveRevisions, ]

# - statistics
stats_revisions_2017 <- list()
stats_revisions_2017$total_revisions <- dim(dewiki_revisions)[1]

# - distribution of number of revisions per user
rev_dist_2017 <- table(dewiki_revisions$user_id)
rev_dist_2017 <- as.data.frame(rev_dist_2017)
colnames(rev_dist_2017) <- c('user_id', 'revisions')
rev_dist_2017 <- arrange(rev_dist_2017, desc(revisions))
rev_dist_2017 <- table(rev_dist_2017$revisions)
rev_dist_2017 <- as.data.frame(rev_dist_2017)
colnames(rev_dist_2017) <- c('revisions', 'users')
rev_dist_2017 <- arrange(rev_dist_2017, desc(users))
saveRDS(rev_dist_2017, 
        paste0(analyticsDir, "rev_dist_2017.Rds"))
rm(rev_dist_2017); gc()

# - edit classes
editClasses_2017 <- dewiki_revisions %>% 
  select(user_id) %>% 
  group_by(user_id) %>% 
  summarise(revisions = n())
editBoundaries <- list(
  c(0, 1), 
  c(2, 5),
  c(6, 9),
  c(10, 49)
)
editClasses_2017$editClass <- sapply(editClasses_2017$revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
editClasses_2017$editClass[editClasses_2017$editClass == "0 - 1"] <- "1"
editClasses_2017 <- arrange(editClasses_2017, desc(revisions))
editClasses_dist_2017 <- table(editClasses_2017$editClass)
editClasses_dist_2017 <- as.data.frame(editClasses_dist_2017)
colnames(editClasses_dist_2017) <- c("Edit Class", "Users")
editClasses_dist_2017$`Edit Class` <- factor(editClasses_dist_2017$`Edit Class`, 
                                        levels = c('1', 
                                                   '2 - 5', 
                                                   '6 - 9', 
                                                   '10 - 49', 
                                                   '> 50'))
editClasses_dist_2017 <- arrange(editClasses_dist_2017, `Edit Class`)
editClasses_dist_2017$`% Users` <- editClasses_dist_2017$Users/sum(editClasses_dist_2017$Users)*100
saveRDS(editClasses_dist_2017, 
        paste0(analyticsDir, "editClasses_dist_2017.Rds"))

# - cummulative edits in dewiki_revisions
setkey(dewiki_revisions, user_id, rev_time)
dewiki_revisions <- dewiki_revisions[order(user_id, rev_time)]
dewiki_revisions[, cum_revisions := seq_len(.N), by = user_id]
dewiki_revisions[, cum_revisions := rowid(user_id)]
dewiki_revisions$reg_time <- as.Date(dewiki_revisions$reg_time)
dewiki_revisions$rev_time <- as.Date(dewiki_revisions$rev_time)
dewiki_revisions$account_age_rev_time_weeks <- difftime(dewiki_revisions$rev_time,
                                                        dewiki_revisions$reg_time,
                                                        units = "weeks")
dewiki_revisions$account_age_rev_time_years <- 
  dewiki_revisions$account_age_rev_time_weeks/52.1429
dewiki_revisions$account_age_rev_time_weeks <- 
  as.numeric(dewiki_revisions$account_age_rev_time_weeks)
dewiki_revisions$account_age_rev_time_years <- 
  as.numeric(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$editClass <- sapply(dewiki_revisions$cum_revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
dewiki_revisions <- dewiki_revisions[order(rev_time)]

# - intoduce account age in weeks and years classes
dewiki_revisions$rev_time_ym <- substr(dewiki_revisions$rev_time, 1, 7)
dewiki_revisions$account_age_rev_time_years_class <- 
  round(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$account_age_rev_time_years_class <- 
  paste0(dewiki_revisions$account_age_rev_time_years_class, 
         " - ", 
         dewiki_revisions$account_age_rev_time_years_class + 1)

# - save the elaborated version of dewiki_revisions_2017 
saveRDS(dewiki_revisions, 
        paste0(analyticsDir, "dewiki_revisions_2017_elaborated.Rds"))

# - produce dewiki_revisions_overview_2017 for visualization
dewiki_revisions_overview_2017 <- dewiki_revisions %>% 
  select(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  group_by(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  summarise(n_users = n())
saveRDS(dewiki_revisions_overview_2017, 
        paste0(analyticsDir, "dewiki_revisions_overview_2017.Rds"))

# - users active (at least one edit) after
# - two weeks, one month, six months, and one year
two_weeks = 2
one_month = 4.34524
six_months = 26.0715
one_year = 52.1429
active_users_2017 <- list()
active_users_2017$two_weeks <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > two_weeks]
    )
  )
active_users_2017$one_month <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_month]
    )
  )
active_users_2017$six_months <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > six_months]
    )
  )
active_users_2017$one_year <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_year]
    )
  )
active_users_2017$two_weeks_p_total_registered_users <- 
  active_users_2017$two_weeks/stats_2017$total_registered_users
active_users_2017$one_month_p_total_registered_users <- 
  active_users_2017$one_month/stats_2017$total_registered_users
active_users_2017$six_months_p_total_registered_users <- 
  active_users_2017$six_months/stats_2017$total_registered_users
active_users_2017$one_year_p_total_registered_users <- 
  active_users_2017$one_year/stats_2017$total_registered_users
active_users_2017$two_weeks_p_total_users_who_edited <- 
  active_users_2017$two_weeks/stats_2017$total_users_who_edited
active_users_2017$one_month_p_total_users_who_edited <- 
  active_users_2017$one_month/stats_2017$total_users_who_edited
active_users_2017$six_months_total_users_who_edited <- 
  active_users_2017$six_months/stats_2017$total_users_who_edited
active_users_2017$one_year_p_total_users_who_edited <- 
  active_users_2017$one_year/stats_2017$total_users_who_edited
saveRDS(active_users_2017, 
        paste0(analyticsDir, "active_users_2017.Rds"))

### -----------------------------------------------------
### --- Statistics on user registrations for 
### --- campaign registered users
### -----------------------------------------------------

### --- load Campaign Registered Users dataset
campaignIDs <- read.csv(paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs.csv"), 
                        header = T,
                        check.names = F,
                        stringsAsFactors = F)

# - read splits: dewiki_revisions
# - load
lF <- list.files(dataDir)
lF <- lF[grepl("dewiki_revisions", lF)]
dewiki_revisions <- lapply(lF, function(x) {fread(paste0(dataDir, x), header = F)})
# - collect
dewiki_revisions <- rbindlist(dewiki_revisions)
# - schema
colnames(dewiki_revisions) <- c('user_id', 'reg_time', 'rev_time')

### --- Load registration data: dewiki_regusers
dewiki_regusers <- fread(paste0(dataDir, "dewiki_regusers.csv"), header = T)

# - which campaign registered users cannot be found in dewiki_regusers
wNotFound <- which(!(campaignIDs$user_id[campaignIDs$registered == 1] %in% dewiki_regusers$user_id))
campaignIDs$not_found_in_dewiki_regusers <- 0
campaignIDs$not_found_in_dewiki_regusers[wNotFound] <- 1
# - store elaborated campaignIDs
write.csv(campaignIDs, 
          paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs_Elaborated.csv"))

# - keep only campaign registered users from dewiki_regusers
dim(dewiki_regusers)
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_regusers <- dewiki_regusers[(dewiki_regusers$user_id %in% campaign_regIDS), ]
dim(dewiki_regusers)

# - stats
stats_campaigns <- list()
stats_campaigns$total_registered_users <- dim(dewiki_regusers)[1]
wEdited <- which(dewiki_regusers$user_id %in% dewiki_revisions$user_id)
stats_campaigns$total_users_who_edited <- length(wEdited)

# - distibution of account age
dewiki_regusers$reg_time <- as.Date(dewiki_regusers$user_registration_timestamp)
dewiki_regusers$account_age_weeks <- as.numeric(
  difftime(Sys.time(),
           dewiki_regusers$reg_time,
           units = "weeks")
)
dewiki_regusers$account_age_years = dewiki_regusers$account_age_weeks/52.1429

# - stats
stats_campaigns$min_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[1]
)
stats_campaigns$max_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[6]
)
stats_campaigns$mean_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[4]
)
stats_campaigns$median_account_age_weeks <- median(dewiki_regusers$account_age_weeks)
stats_campaigns$min_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[1]
)
stats_campaigns$max_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[6]
)
stats_campaigns$mean_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[4]
)
stats_campaigns$median_account_age_years <- median(dewiki_regusers$account_age_years)
saveRDS(stats_campaigns, 
        paste0(analyticsDir, "dewiki_stats_campaigns.Rds"))

# - store data file
saveRDS(dewiki_regusers, 
        paste0(analyticsDir, "dewiki_regusers_campaigns.Rds"))

### -----------------------------------------------------
### --- Statistics on revisions for campaign registered users
### -----------------------------------------------------

# - filter for campaign registered users on user registration
dim(dewiki_revisions)
dewiki_revisions <- filter(dewiki_revisions, user_id %in% campaign_regIDS)
dewiki_revisions <- as.data.table(dewiki_revisions)
dim(dewiki_revisions)

# - for non-registering campaigns: keep only user revisions following the campaign onset
# - there is currently one non-registering campaign present in campaignIDs:
non_registering_campaignIDs <- campaignIDs %>% 
  filter(registered == 0)
non_registering_campaigns <- unique(non_registering_campaignIDs$campaign)
non_registering_campaigns
# - "occasional_editors2020"
# - the campaign onset for "occasional_editors2020" is:
# - 2020/05/14
wRemoveRevisions <- which(
  (dewiki_revisions$user_id %in% non_registering_campaignIDs$user_id) & 
    (dewiki_revisions$rev_time < "2020-05-14")
)
dewiki_revisions <- dewiki_revisions[-wRemoveRevisions, ]

# - statistics
stats_revisions_campaigns <- list()
stats_revisions_campaigns$total_revisions <- dim(dewiki_revisions)[1]

# - distribution of number of revisions per user
rev_dist_campaigns <- table(dewiki_revisions$user_id)
rev_dist_campaigns <- as.data.frame(rev_dist_campaigns)
colnames(rev_dist_campaigns) <- c('user_id', 'revisions')
rev_dist_campaigns <- arrange(rev_dist_campaigns, desc(revisions))
rev_dist_campaigns <- table(rev_dist_campaigns$revisions)
rev_dist_campaigns <- as.data.frame(rev_dist_campaigns)
colnames(rev_dist_campaigns) <- c('revisions', 'users')
rev_dist_campaigns <- arrange(rev_dist_campaigns, desc(users))
saveRDS(rev_dist_campaigns, 
        paste0(analyticsDir, "rev_dist_campaigns.Rds"))
rm(rev_dist_campaigns); gc()

# - edit classes
editClasses_campaigns <- dewiki_revisions %>% 
  select(user_id) %>% 
  group_by(user_id) %>% 
  summarise(revisions = n())
editBoundaries <- list(
  c(0, 1), 
  c(2, 5),
  c(6, 9),
  c(10, 49)
)
editClasses_campaigns$editClass <- sapply(editClasses_campaigns$revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
editClasses_campaigns$editClass[editClasses_campaigns$editClass == "0 - 1"] <- "1"
editClasses_campaigns <- arrange(editClasses_campaigns, desc(revisions))
editClasses_dist_campaigns <- table(editClasses_campaigns$editClass)
editClasses_dist_campaigns <- as.data.frame(editClasses_dist_campaigns)
colnames(editClasses_dist_campaigns) <- c("Edit Class", "Users")
editClasses_dist_campaigns$`Edit Class` <- factor(editClasses_dist_campaigns$`Edit Class`, 
                                             levels = c('1', 
                                                        '2 - 5', 
                                                        '6 - 9', 
                                                        '10 - 49', 
                                                        '> 50'))
editClasses_dist_campaigns <- arrange(editClasses_dist_campaigns, `Edit Class`)
editClasses_dist_campaigns$`% Users` <- editClasses_dist_campaigns$Users/sum(editClasses_dist_campaigns$Users)*100
saveRDS(editClasses_dist_campaigns, 
        paste0(analyticsDir, "editClasses_dist_campaigns.Rds"))

# - cummulative edits in dewiki_revisions
setkey(dewiki_revisions, user_id, rev_time)
dewiki_revisions <- dewiki_revisions[order(user_id, rev_time)]
dewiki_revisions[, cum_revisions := seq_len(.N), by = user_id]
dewiki_revisions[, cum_revisions := rowid(user_id)]
dewiki_revisions$reg_time <- as.Date(dewiki_revisions$reg_time)
dewiki_revisions$rev_time <- as.Date(dewiki_revisions$rev_time)
dewiki_revisions$account_age_rev_time_weeks <- difftime(dewiki_revisions$rev_time,
                                                        dewiki_revisions$reg_time,
                                                        units = "weeks")
dewiki_revisions$account_age_rev_time_years <- 
  dewiki_revisions$account_age_rev_time_weeks/52.1429
dewiki_revisions$account_age_rev_time_weeks <- 
  as.numeric(dewiki_revisions$account_age_rev_time_weeks)
dewiki_revisions$account_age_rev_time_years <- 
  as.numeric(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$editClass <- sapply(dewiki_revisions$cum_revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
dewiki_revisions <- dewiki_revisions[order(rev_time)]

# - intoduce account age in weeks and years classes
dewiki_revisions$rev_time_ym <- substr(dewiki_revisions$rev_time, 1, 7)
dewiki_revisions$account_age_rev_time_years_class <- 
  round(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$account_age_rev_time_years_class <- 
  paste0(dewiki_revisions$account_age_rev_time_years_class, 
         " - ", 
         dewiki_revisions$account_age_rev_time_years_class + 1)

# - save the elaborated version of dewiki_revisions_campaigns
saveRDS(dewiki_revisions, 
        paste0(analyticsDir, "dewiki_revisions_campaigns_elaborated.Rds"))


# - produce dewiki_revisions_overview_campaigns for visualization
dewiki_revisions_overview_campaigns <- dewiki_revisions %>% 
  select(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  group_by(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  summarise(n_users = n())
saveRDS(dewiki_revisions_overview_campaigns, 
        paste0(analyticsDir, "dewiki_revisions_overview_campaigns.Rds"))

# - users active (at least one edit) after
# - two weeks, one month, six months, and one year
two_weeks = 2
one_month = 4.34524
six_months = 26.0715
one_year = 52.1429
active_users_campaigns <- list()
active_users_campaigns$two_weeks <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > two_weeks]
    )
  )
active_users_campaigns$one_month <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_month]
    )
  )
active_users_campaigns$six_months <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > six_months]
    )
  )
active_users_campaigns$one_year <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_year]
    )
  )
active_users_campaigns$two_weeks_p_total_registered_users <- 
  active_users_campaigns$two_weeks/stats_campaigns$total_registered_users

active_users_campaigns$one_month_p_total_registered_users <- 
  active_users_campaigns$one_month/stats_campaigns$total_registered_users

active_users_campaigns$six_months_p_total_registered_users <- 
  active_users_campaigns$six_months/stats_campaigns$total_registered_users

active_users_campaigns$one_year_p_total_registered_users <- 
  active_users_campaigns$one_year/stats_campaigns$total_registered_users

active_users_campaigns$two_weeks_p_total_users_who_edited <- 
  active_users_campaigns$two_weeks/stats_campaigns$total_users_who_edited

active_users_campaigns$one_month_p_total_users_who_edited <- 
  active_users_campaigns$one_month/stats_campaigns$total_users_who_edited

active_users_campaigns$six_months_total_users_who_edited <- 
  active_users_campaigns$six_months/stats_campaigns$total_users_who_edited

active_users_campaigns$one_year_p_total_users_who_edited <- 
  active_users_campaigns$one_year/stats_campaigns$total_users_who_edited

saveRDS(active_users_campaigns, 
        paste0(analyticsDir, "active_users_campaigns.Rds"))

### --- Final Reporting
paste0("Processing took: ", difftime(Sys.time(), t1, units = "mins"), " minutes.")
rm(list = ls()); gc()

1 Data Analysis

All statistics and visualizations reported in the following sections refer to the questions formulated in the reference Phabricator ticket.

Note. All revisions made by campaign registered users were removed from the revision datasets used in the sections addressing the organic registrations and revisions since 2017. For the non-registering WMDE Banner Campaigns (e.g. WMDE Occasional Editors 2020 campaign, which addressed the already existing, registered users only), we remove all the edits made on their behalf following their exposure to the respective campaign. Likewise, in the campaigns datasets, we remove all their edits made before their exposure to the respective campaign.

1.1 Organic Registrations and Revisions since 2017

1.1.1 Organic Registrations since 2017

Q. What is the age of the German Wikipedia Community in terms of account age?

stats_2017 <- readRDS(paste0(analyticsDir, 'dewiki_stats_2017.Rds'))

The following statistics all refer to dewiki user registrations since 2017:

  • The total number of registered users is 386471.
  • The total number of registered users who have ever edited is 145690.
  • That means that 37.7% of registered users ever edited dewiki.
  • Statistics on account age in weeks: the minimum account age is 2.39, the maximum account age is 184.8, the mean account age is 92.97, and the median account ages is 91.53;
  • while in years, that would be: the minimum account age is 0.05, the maximum account age is 3.54, the mean account age is 1.78, and the median account ages is 1.76.

1.1.2 User Revisions since 2017

Q. How many of them edited (since registration until 30th June 2020):

  • 1 edit

  • 2 to 5 edits

  • 5 to 9 edits

  • 10 to 49 edits

  • 50 or more edits

editClasses_dist_2017 <- readRDS(paste0(analyticsDir, 'editClasses_dist_2017.Rds'))
editClasses_dist_2017$`% Users` <- round(editClasses_dist_2017$`% Users`, 2)
datatable(editClasses_dist_2017)

Q. Retention rate: How many newly registered users are active after (active = at least 1 edit)

  • 2 weeks after registration
  • 1 month after registration
  • 6 months after registration
  • 12 months after registration

How high is the retention rate of these active users compared to the number of registrations?

active_users_2017 <- readRDS(paste0(analyticsDir, 'active_users_2017.Rds'))
active_users_2017 <- data.frame(
  `Retention Class` = c('2 weeks', '1 month', '6 months', '1 year'),
  Users = as.numeric(active_users_2017[1:4]),
  `As % of registered users` = round(as.numeric(active_users_2017[5:8]), 2),
  `As % of users who ever edited` = round(as.numeric(active_users_2017[9:12]), 2),
  stringsAsFactors = F, 
  check.names = F)
datatable(active_users_2017)

Q. Edit Classes (facets) x Account Age Classes (group, step: one year) x Time (horizontal) → do we observe always one and the same group of active editors, or do the newcomers join in to stay active editors? - start: 2017.

Note. In the following chart, the Account Age variable refers to the user account age at the moment when a respective revision was made by that user. Tabs refer to different edit classes.

dewiki_revisions_overview_2017 <- readRDS(paste0(analyticsDir, 'dewiki_revisions_overview_2017.Rds'))
dewiki_revisions_overview_2017 <- ungroup(dewiki_revisions_overview_2017)
colnames(dewiki_revisions_overview_2017) <- c('Revision Year-Month', 'Edit Class', 'Account Age', 'Users')
dewiki_revisions_overview_2017 <- filter(dewiki_revisions_overview_2017, 
                                         !(`Revision Year-Month` == "2020-07"))
dewiki_revisions_overview_2017$`Edit Class`[dewiki_revisions_overview_2017$`Edit Class` == "0 - 1"] <- "1"
dewiki_revisions_overview_2017$`Edit Class` <- factor(dewiki_revisions_overview_2017$`Edit Class`,
                                                      levels = c('1',
                                                                 '2 - 5',
                                                                 '6 - 9',
                                                                 '10 - 49',
                                                                 '> 50'))
ggplot(dewiki_revisions_overview_2017, 
       aes(x = `Revision Year-Month`, 
           y = Users,
           group = `Account Age`, 
           color = `Account Age`)) + 
  geom_line() + geom_point(size = 1) +
  scale_color_manual(values=c("#308FF3", "#70BA0A", "#FABC0A", "#ED2809")) +
  facet_wrap(~`Edit Class`, nrow = 5, ncol = 1, scales="free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 0.95, vjust = 0.2))

1.2 Campaign Registrations and Revisions since 2017

Note. Some users found in the WMDE campaign user registration datasets could not be matched with the user IDs in the wmf.mediawiki_history table. The following table presents an overview of how many campaign registered users are missing in the wmf.mediawiki_history (see: Wikitech wmf.mediaWiki_history documentation). All WMDE campaign registered user IDs were checked for uniqueness and confirmed to be unique. The total number of WMDE campaign registered users that were not matched to the user ID fields (event_user_id for revisions, and user_id for registrations) in the wmf.mediawiki_history table is 49.

campaignIDs <- read.csv(paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs_Elaborated.csv"), 
                        header = T, 
                        check.names = F,
                        row.names = 1,
                        stringsAsFactors = F)
not_found <- as.data.frame(
  table(campaignIDs$campaign[campaignIDs$not_found_in_dewiki_regusers == 1])
)
colnames(not_found) <- c('Campaign Code', 'Num.Users')
datatable(not_found)

Given that there are 4163 WMDE campaign registered users in total, that means what we will not be able to analyze 1.18% of them. To keep the analysis consistent with the numbers previously reported on user registrations and revisions since 2017, all user registration data are derived from the campaign registered users that were matched with the user IDs in the wmf.mediawiki_history table.

1.2.1 Campaign Registrations since 2017

Q. What is the age of the German Wikipedia Community in terms of account age for campaign registered users?

dewiki_stats_campaigns <- readRDS(paste0(analyticsDir, 'dewiki_stats_campaigns.Rds'))

The following statistics all refer to dewiki campaign user registrations since 2017:

  • The total number of registered users is 4114.
  • The total number of registered users who have ever edited is 1123.
  • That means that 27.3% of registered users ever edited dewiki.
  • Statistics on account age in weeks: the minimum account age is 2.68, the maximum account age is 170.4, the mean account age is 117.4, and the median account ages is 114.39;
  • while in years, that would be: the minimum account age is 0.05, the maximum account age is 3.27, the mean account age is 2.25, and the median account ages is 2.19.

1.2.2 Campaign Revisions since 2017

Q. How many of campaign registered users edited (since registration until 30th June 2020):

  • 1 edit

  • 2 to 5 edits

  • 5 to 9 edits

  • 10 to 49 edits

  • 50 or more edits

editClasses_dist_campaigns <- readRDS(paste0(analyticsDir, 'editClasses_dist_campaigns.Rds'))
editClasses_dist_campaigns$`% Users` <- round(editClasses_dist_campaigns$`% Users`, 2)
datatable(editClasses_dist_campaigns)

Q. Retention rate: How many campaign registered users are active after (active = at least 1 edit)

  • 2 weeks after registration
  • 1 month after registration
  • 6 months after registration
  • 12 months after registration

How high is the retention rate of these active users compared to the number of registrations?

active_users_campaigns <- readRDS(paste0(analyticsDir, 'active_users_campaigns.Rds'))
active_users_campaigns <- data.frame(
  `Retention Class` = c('2 weeks', '1 month', '6 months', '1 year'),
  Users = as.numeric(active_users_campaigns[1:4]),
  `As % of registered users` = round(as.numeric(active_users_campaigns[5:8]), 2),
  `As % of users who ever edited` = round(as.numeric(active_users_campaigns[9:12]), 2),
  stringsAsFactors = F, 
  check.names = F)
datatable(active_users_campaigns)

Q. Edit Classes (facets) x Account Age Classes (group, step: one year) x Time (horizontal) → do we observe always one and the same group of active editors, or do the newcomers join in to stay active editors? - start: 2017 (for campaign registered users):

Note. In the following chart, the Account Age variable refers to the user account age at the moment when a respective revision was made by that user. Tabs refer to different edit classes.

dewiki_revisions_overview_campaigns <- readRDS(paste0(analyticsDir, 'dewiki_revisions_overview_campaigns.Rds'))
dewiki_revisions_overview_campaigns <- ungroup(dewiki_revisions_overview_campaigns)
colnames(dewiki_revisions_overview_campaigns) <- c('Revision Year-Month', 'Edit Class', 'Account Age', 'Users')
dewiki_revisions_overview_campaigns <- filter(dewiki_revisions_overview_campaigns, 
                                         !(`Revision Year-Month` == "2020-07"))
dewiki_revisions_overview_campaigns$`Edit Class`[dewiki_revisions_overview_campaigns$`Edit Class` == "0 - 1"] <- "1"
dewiki_revisions_overview_campaigns$`Edit Class` <- factor(dewiki_revisions_overview_campaigns$`Edit Class`,
                                                           levels = c('1',
                                                                      '2 - 5',
                                                                      '6 - 9',
                                                                      '10 - 49',
                                                                      '> 50'))
ggplot(dewiki_revisions_overview_campaigns, 
       aes(x = `Revision Year-Month`, 
           y = Users,
           group = `Account Age`, 
           color = `Account Age`)) + 
  geom_line() + geom_point(size = 1) + 
  scale_color_manual(values=c("#308FF3", "#70BA0A", "#FABC0A", "#ED2809")) + 
  facet_wrap(~`Edit Class`, nrow = 5, ncol = 1, scales="free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 0.95, vjust = 0.2))

1.3 Campaign Registrations and Revisions: Overview

1.3.1 Campaign User Registrations and Revisions

I have just one major remark on the report: I also need registrations and revisions per campaign to be able to compare the campaigns. Are you still on it or did you miss it out? (the edit class and account age comparison is not necessary for the campaign split).

Reference: Phab: T256433#6328618 Note. The recent editors column reports on number of campaign registered users who did at least one edit as of 30th June 2020, reference: Phab: T256433#6385973

campaignRegs <- read.csv(paste0(analyticsDir, "campaignRegistrationsSummary.csv"), 
                         header = T,
                         check.names = F,
                         row.names = 1,
                         stringsAsFactors = F)
campaignRevs <- read.csv(paste0(analyticsDir, "campaignRevisionsSummary.csv"), 
                         header = T,
                         check.names = F,
                         row.names = 1,
                         stringsAsFactors = F)
campaignsOverview <- left_join(campaignRevs, 
                               campaignRegs, 
                               by = "campaign")
campaignsOverview$rev_per_reg <- round(campaignsOverview$revisions/campaignsOverview$registered, 2)
recentEditors <- read.csv(paste0(analyticsDir, "recentCampaignEditors.csv"), 
                         header = T,
                         check.names = F,
                         row.names = 1,
                         stringsAsFactors = F)
colnames(recentEditors)[2] <- "recent editors"
campaignsOverview <- left_join(campaignsOverview, 
                               recentEditors, 
                               by = "campaign")
datatable(campaignsOverview)

1.3.2 Campaign User Retention

I wondered if it were much additional work to compute not only # of revisions and registrations (in section 1.3) , but also retention/ retention rates (like you did in section 1.2.2) per campaign? I guess that’s what @Verena was initially asking for.

Reference: Phab: T256433#6337701

active_users_per_campaign <- readRDS(paste0(analyticsDir, "active_users_per_campaign.Rds"))
active_users_per_campaign <- active_users_per_campaign[, c('campaign',
                                                           'two_weeks',
                                                           'one_month',
                                                           'six_months',
                                                           'one_year')]
colnames(active_users_per_campaign) <- c('campaign',
                                         '2 weeks',
                                         '1 month',
                                         '6 months',
                                         '1 year')
datatable(active_users_per_campaign)

1.4 Training Modules in 2018 Campaigns: Active Users

Of the people who started training modules (onboarding content used in 2018 in thank you, spring and summer campaign), how is the rate of still active users?

active_users_campaigns_training_2018 <- 
  readRDS(paste0(analyticsDir, "active_users_campaigns_training.Rds"))
active_users_campaigns_training_2018 <- data.frame(
  `Retention Class` = c('2 weeks', '1 month', '6 months', '1 year'),
  Users = as.numeric(active_users_campaigns_training_2018[1:4]),
  stringsAsFactors = F, 
  check.names = F)
datatable(active_users_campaigns_training_2018)

2 Additional Requests

2.1 Organic Growth/Age of Community

Request. “Organic Growth/ Age of Community: For the years 2001 to 2019 we need for every year the average age of all accounts who did at least one edit in this year. age = number of years since registration ( I am aware that for a few accounts the registration date can’t be retrieved from the database. Because this should be a relatively small number of accounts we can neglect that here.)” reference Phab ticket

dewiki_revisions_elaborated <- readRDS("~/WMDE/NewEditors/CampaignsReview2020/_analytics/dewiki_revisions_elaborated.Rds")
dewiki_revisions_elaborated <- dplyr::select(dewiki_revisions_elaborated, 
                                             reg_time, 
                                             rev_time)
dewiki_revisions_elaborated$rev_year <- substr(dewiki_revisions_elaborated$rev_time, 1, 4)
dewiki_revisions_elaborated <- dplyr::filter(dewiki_revisions_elaborated, 
                                             rev_year != "2020")
dewiki_revisions_elaborated$account_age <- difftime(dewiki_revisions_elaborated$rev_time, 
                                                    dewiki_revisions_elaborated$reg_time, 
                                                    units = "weeks")
one_year = 52.1429
dewiki_revisions_elaborated$account_age <- dewiki_revisions_elaborated$account_age/one_year
organicGrowth <- dewiki_revisions_elaborated %>% 
  dplyr::select(rev_year, account_age) %>% 
  dplyr::group_by(rev_year) %>% 
  summarise(avg_account_age = mean(account_age))
datatable(organicGrowth)

---
title: '2020 WMDE New Editors Campaigns Review'
author: "Goran S. Milovanovic, Data Scientist, WMDE"
date: "July 17, 2020"
output:
  html_notebook:
    code_folding: hide
    theme: simplex
    toc: yes
    toc_float: yes
    toc_depth: 5
  html_document:
    toc: yes
    toc_depth: 5
---

**Feedback** should be send to `goran.milovanovic_ext@wikimedia.de`. 

**Reference Document:** [New editors - WMDE deep dive - analytics questions](https://docs.google.com/document/d/1tOCpmn4KgJLDPR1Uaq3kBdOiL_E677COynhhbpr6MZk/edit?ts=5f05e1cc)

**Reference Phabricator Ticket:** [https://phabricator.wikimedia.org/T256433](https://phabricator.wikimedia.org/T256433)

**Background/ Reason why**

The first campaign which contained tracking was conducted in 2017. Since then several campaigns with different content and user journeys have been realized. For strategic decisions on future activities to gain new editors we want to comprehensively analyze past activities and their impact. Besides qualitative results we need to consider quantitative results. As campaign reports usually cover a certain time during and after the campaign but no long term effects. This should be done in this comprehensive analysis.

**Timing**
Briefing: 9th July 2020
Delivery of report: week 31

**General requirements**

- The report should be delivered in tables and charts in html as in previous reports

- The report might be publicly available in the future. For this delivery deadline this is no requirement.

- Communication will happen in [phabricator](https://phabricator.wikimedia.org/T256433)

- Based on this report we will have more questions which should be addressed in a phase 2 in August.

- The time span for this report is January 2017 until June 2020

```{r, echo = F, warning = 'hide', message = F, results = 'hide'}
# !diagnostics off
### --- Setup
library(kableExtra)
library(rmarkdown)
library(knitr)
library(tidyverse)
library(data.table)
library(reshape2)
library(DT)
library(ggrepel)
library(scales)
library(RColorBrewer)

### --- directory tree
dataDir <- paste0(getwd(), "/", "_data/")
analyticsDir <- paste0(getwd(), "/", "_analytics/")
```

## 0. Data Acquisiton

**NOTE:** the Data Acquisition code chunk is not fully reproducible from this Report. The data are collected by running the PySpark script `CampaignsReview2020_S01_ETL.py` on the stat1005.eqiad.wmnet server, collecting the data either as `.tsv` files or storing large datasets directly to HDFS in the WMF Data Lake. Immediately following this step the R script `CampaignsReview2020_S01_ETL.R` is run to clean the data, produce aggregated datasets, prepare the data for visualization, and compute all requested statistics.

### 0.1 ETL: `dewiki` user registrations and revisions

The `CampaignsReview2020_S01_ETL.py` script: collect data on user registrations and revisions from the `wmf.mediawiki_history` table:

```{r, echo = T, eval = F}
# - Setup
import pyspark
from pyspark.sql import SparkSession, DataFrame, Window
from pyspark.sql.functions import rank, col, explode, regexp_extract, array_contains, when, sum, count, expr
import re
import csv
import pandas as pd

### --- dir structure and params
mwwikiSnapshot = "2020-06"

# - Spark Session
sc = SparkSession\
    .builder\
    .appName("WD Human Edits per Class")\
    .enableHiveSupport()\
    .getOrCreate()
# - SQL context
sqlContext = pyspark.SQLContext(sc)

# - Process wmf.mediawiki_history: all users ever registered with dewiki
dewiki_regusers = sqlContext.sql("""SELECT user_id, user_registration_timestamp 
                                    FROM wmf.mediawiki_history 
                                    WHERE event_entity = 'user' 
                                        AND event_type = 'create' 
                                        AND user_is_anonymous = false 
                                        AND user_is_created_by_self = true 
                                        AND user_id IS NOT NULL  
                                        AND user_registration_timestamp IS NOT NULL 
                                        AND NOT ARRAY_CONTAINS(user_is_bot_by, 'name') 
                                        AND NOT ARRAY_CONTAINS(user_is_bot_by, 'group') 
                                        AND NOT ARRAY_CONTAINS(user_is_bot_by_historical, 'name') 
                                        AND NOT ARRAY_CONTAINS(user_is_bot_by_historical, 'group') 
                                        AND wiki_db = 'dewiki' 
                                        AND snapshot='""" + mwwikiSnapshot + """'""" + 
                                 """ORDER BY user_id""")

dewiki_regusers.toPandas().to_csv("/home/goransm/Analytics/NewEditors/CampaignsReview2020/_data/dewiki_regusers.csv", 
                                  header=True, 
                                  index=False)
                                  
# - Process wmf.mediawiki_history: all revisions ever made on dewiki
dewiki_revisions = sqlContext.sql("""SELECT event_user_id, 
                                           event_user_registration_timestamp, 
                                           event_timestamp 
                                    FROM wmf.mediawiki_history 
                                    WHERE event_entity = 'revision' 
                                        AND event_type = 'create' 
                                        AND event_user_is_anonymous = false 
                                        AND event_user_is_created_by_self = true 
                                        AND event_user_id IS NOT NULL 
                                        AND event_user_registration_timestamp IS NOT NULL 
                                        AND NOT ARRAY_CONTAINS(event_user_is_bot_by, 'name') 
                                        AND NOT ARRAY_CONTAINS(event_user_is_bot_by, 'group') 
                                        AND NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, 'name') 
                                        AND NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, 'group') 
                                        AND page_namespace_is_content = 0 
                                        AND page_namespace_is_content_historical = 0 
                                        AND wiki_db = 'dewiki' 
                                        AND snapshot='""" + mwwikiSnapshot + """'""" + 
                                 """ORDER BY event_user_id, event_timestamp""")

dewiki_revisions.repartition(10).write.format('csv').save('dewiki_revisions')
```

### 0.2 Definitions: user registration and user revision

From the `CampaignsReview2020_S01_ETL.py` script we can see exactly what definitions of `user registration` and `user revision` are used in this report:

- **User registration:** we **exclude** _anonymous users_ (`user_is_anonymous = false`), users who _where not self-created_ (`user_is_created_by_self = true`), 
users _who have no values found in the `user_id` or the registration timestamp field_ (`user_id IS NOT NULL AND user_registration_timestamp IS NOT NULL`), and any users _who are currently classified as bots or used to be classified as bots in the past_.  

- **User revision:** we **focus on** revisions made on _content pages_ only (`page_namespace_is_content = true AND page_namespace_is_content_historical = true`) holding on to the same set of constraints for users who have cause a particular revision as we did in the definition of user registration.


### 0.3 Data pre-processing: `dewiki` user registrations and revisions

The `CampaignsReview2020_S01_ETL.R` script: clean the data, produce aggregated datasets, prepare the data for visualization, and compute all requested statistics:

```{r, echo = T, eval = F}

### --- 2020/07/14
### --- Phab: https://phabricator.wikimedia.org/T256433
### --- WMDE Banner Campaigns Comprehensive Report 2017-2020

t1 <- Sys.time()

library(tidyverse)
library(data.table)

fPath <- '/home/goransm/Analytics/NewEditors/CampaignsReview2020/'
dataDir <- paste0(fPath, "_data/")
analyticsDir <- paste0(fPath, "_analytics/")
hdfsPath <- 'hdfs:///user/goransm/dewiki_revisions'

### ---------------------------------------------------------------------
### --- Section 0. Campaigns Dataset
### ---------------------------------------------------------------------

### --- load Campaign Registered Users dataset
campaignIDs <- read.csv(paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs.csv"), 
                        header = T,
                        check.names = F,
                        stringsAsFactors = F)

### ---------------------------------------------------------------------
### --- Section 1. Datasets
### ---------------------------------------------------------------------

### --- Compose final revision dataset from hdfs: dewiki_revisions
# - copy splits from hdfs to local dataDir
system(paste0('sudo -u analytics-privatedata kerberos-run-command analytics-privatedata hdfs dfs -ls ', 
              hdfsPath, ' > ', 
              dataDir, 'files.txt'), 
       wait = T)
files <- read.table(paste0(dataDir, 'files.txt'), skip = 1)
files <- as.character(files$V8)[2:length(as.character(files$V8))]
file.remove(paste0(dataDir, 'files.txt'))
for (i in 1:length(files)) {
  system(paste0('sudo -u analytics-privatedata kerberos-run-command analytics-privatedata hdfs dfs -text ', 
                files[i], ' > ',  
                paste0(dataDir, "dewiki_revisions", i, ".csv")), wait = T)
}
# - read splits: dewiki_revisions
# - load
lF <- list.files(dataDir)
lF <- lF[grepl("dewiki_revisions", lF)]
dewiki_revisions <- lapply(lF, function(x) {fread(paste0(dataDir, x), header = F)})
# - collect
dewiki_revisions <- rbindlist(dewiki_revisions)
# - schema
colnames(dewiki_revisions) <- c('user_id', 'reg_time', 'rev_time')

### --- Load registration data: dewiki_regusers
dewiki_regusers <- fread(paste0(dataDir, "dewiki_regusers.csv"), header = T)

### ---------------------------------------------------------------------
### --- Section 2. Statistics and Analytical Datasets
### ---------------------------------------------------------------------

### -----------------------------------------------------
### --- Statistics on user registrations since the beginning of time
### -----------------------------------------------------

# - remove campaign registered users from dewiki_regusers
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_regusers <- dewiki_regusers[!(dewiki_regusers$user_id %in% campaign_regIDS), ]

# - stats
stats <- list()
stats$total_registered_users <- dim(dewiki_regusers)[1]
wEdited <- which(dewiki_regusers$user_id %in% dewiki_revisions$user_id)
stats$total_users_who_edited <- length(wEdited)

# - distibution of account age
dewiki_regusers$reg_time <- as.Date(dewiki_regusers$user_registration_timestamp)
dewiki_regusers$account_age_weeks <- as.numeric(
  difftime(Sys.time(),
           dewiki_regusers$reg_time,
           units = "weeks")
)
dewiki_regusers$account_age_years = dewiki_regusers$account_age_weeks/52.1429

# - stats
stats$min_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[1]
  )
stats$max_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[6]
)
stats$mean_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[4]
)
stats$median_account_age_weeks <- median(dewiki_regusers$account_age_weeks)
stats$min_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[1]
)
stats$max_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[6]
)
stats$mean_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[4]
)
stats$median_account_age_years <- median(dewiki_regusers$account_age_years)
saveRDS(stats, 
        paste0(analyticsDir, "dewiki_stats.Rds"))

# - store data file
saveRDS(dewiki_regusers, 
        paste0(analyticsDir, "dewiki_regusers.Rds"))

### -----------------------------------------------------
### --- Statistics on user registrations since 2017
### -----------------------------------------------------

dewiki_regusers_2017 <- dewiki_regusers[dewiki_regusers$user_registration_timestamp >= 2017, ]

# - stats
stats_2017 <- list()
stats_2017$total_registered_users <- dim(dewiki_regusers_2017)[1]
wEdited <- which(dewiki_regusers_2017$user_id %in% dewiki_revisions$user_id)
stats_2017$total_users_who_edited <- length(wEdited)
stats_2017$min_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[1]
)
stats_2017$max_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers_2017$account_age_weeks))[6]
)
stats_2017$mean_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers_2017$account_age_weeks))[4]
)
stats_2017$median_account_age_weeks <- median(dewiki_regusers_2017$account_age_weeks)
stats_2017$min_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers_2017$account_age_years))[1]
)
stats_2017$max_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers_2017$account_age_years))[6]
)
stats_2017$mean_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers_2017$account_age_years))[4]
)
stats_2017$median_account_age_years <- median(dewiki_regusers_2017$account_age_years)
saveRDS(stats_2017, 
        paste0(analyticsDir, "dewiki_stats_2017.Rds"))

# - store data file
saveRDS(dewiki_regusers_2017, 
        paste0(analyticsDir, "dewiki_regusers_2017.Rds"))

# - clean up
rm(dewiki_regusers); rm(dewiki_regusers_2017); gc()

### -----------------------------------------------------
### --- Statistics on revisions since the beginning of time
### -----------------------------------------------------

# - remove campaign registered users from dewiki_revisions
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_revisions <- dewiki_revisions[!(dewiki_revisions$user_id %in% campaign_regIDS), ]
# - for non-registering campaigns: keep only user revisions before the campaign onset
# - there is currently one non-registering campaign present in campaignIDs:
non_registering_campaignIDs <- campaignIDs %>% 
  filter(registered == 0)
non_registering_campaigns <- unique(non_registering_campaignIDs$campaign)
non_registering_campaigns
# - "occasional_editors2020"
# - the campaign onset for "occasional_editors2020" is:
# - 2020/05/14
wRemoveRevisions <- which(
  (dewiki_revisions$user_id %in% non_registering_campaignIDs$user_id) & 
    (dewiki_revisions$rev_time >= "2020-05-14")
  )
dewiki_revisions <- dewiki_revisions[-wRemoveRevisions, ]

# - statistics
stats_revisions <- list()
stats_revisions$total_revisions <- dim(dewiki_revisions)[1]

# - distribution of number of revisions per user
rev_dist <- table(dewiki_revisions$user_id)
rev_dist <- as.data.frame(rev_dist)
colnames(rev_dist) <- c('user_id', 'revisions')
rev_dist <- arrange(rev_dist, desc(revisions))
rev_dist <- table(rev_dist$revisions)
rev_dist <- as.data.frame(rev_dist)
colnames(rev_dist) <- c('revisions', 'users')
rev_dist <- arrange(rev_dist, desc(users))
saveRDS(rev_dist, 
        paste0(analyticsDir, "rev_dist.Rds"))
rm(rev_dist); gc()

# - edit classes
editClasses <- dewiki_revisions %>% 
  select(user_id) %>% 
  group_by(user_id) %>% 
  summarise(revisions = n())
editBoundaries <- list(
  c(0, 1), 
  c(2, 5),
  c(6, 9),
  c(10, 49)
)
editClasses$editClass <- sapply(editClasses$revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
editClasses$editClass[editClasses$editClass == "0 - 1"] <- "1"
editClasses <- arrange(editClasses, desc(revisions))
editClasses_dist <- table(editClasses$editClass)
editClasses_dist <- as.data.frame(editClasses_dist)
colnames(editClasses_dist) <- c("Edit Class", "Users")
editClasses_dist$`Edit Class` <- factor(editClasses_dist$`Edit Class`, 
                                        levels = c('1', 
                                                   '2 - 5', 
                                                   '6 - 9', 
                                                   '10 - 49', 
                                                   '> 50'))
editClasses_dist <- arrange(editClasses_dist, `Edit Class`)
editClasses_dist$`% Users` <- editClasses_dist$Users/sum(editClasses_dist$Users)*100
saveRDS(editClasses_dist, 
        paste0(analyticsDir, "editClasses_dist.Rds"))

# - cummulative edits in dewiki_revisions
setkey(dewiki_revisions, user_id, rev_time)
dewiki_revisions <- dewiki_revisions[order(user_id, rev_time)]
dewiki_revisions[, cum_revisions := seq_len(.N), by = user_id]
dewiki_revisions[, cum_revisions := rowid(user_id)]
dewiki_revisions$reg_time <- as.Date(dewiki_revisions$reg_time)
dewiki_revisions$rev_time <- as.Date(dewiki_revisions$rev_time)
dewiki_revisions$account_age_rev_time_weeks <- difftime(dewiki_revisions$rev_time,
                                                        dewiki_revisions$reg_time,
                                                        units = "weeks")
dewiki_revisions$account_age_rev_time_years <- 
  dewiki_revisions$account_age_rev_time_weeks/52.1429
dewiki_revisions$account_age_rev_time_weeks <- 
  as.numeric(dewiki_revisions$account_age_rev_time_weeks)
dewiki_revisions$account_age_rev_time_years <- 
  as.numeric(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$editClass <- sapply(dewiki_revisions$cum_revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
dewiki_revisions <- dewiki_revisions[order(rev_time)]

### ___ NOTE ___
# - There are 29148 observations where reg_time > rev_time:
sum(dewiki_revisions$account_age_rev_time_weeks < 0)
### ___ ACTION ___
# - remove from dewiki_revisions:
w <- which(dewiki_revisions$account_age_rev_time_weeks < 0)
dewiki_revisions <- dewiki_revisions[-w, ]

# - intoduce account age in weeks and years classes
dewiki_revisions$rev_time_ym <- substr(dewiki_revisions$rev_time, 1, 7)
dewiki_revisions$account_age_rev_time_years_class <- 
  round(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$account_age_rev_time_years_class <- 
  paste0(dewiki_revisions$account_age_rev_time_years_class, 
         " - ", 
         dewiki_revisions$account_age_rev_time_years_class + 1)

# - save the elaborated version of dewiki_revisions
saveRDS(dewiki_revisions, 
        paste0(analyticsDir, "dewiki_revisions_elaborated.Rds"))

# - produce dewiki_revisions_overview for visualization
dewiki_revisions_overview <- dewiki_revisions %>% 
  select(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  group_by(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  summarise(n_users = n())
saveRDS(dewiki_revisions_overview, 
        paste0(analyticsDir, "dewiki_revisions_overview.Rds"))

# - users active (at least one edit) after
# - two weeks, one month, six months, and one year
two_weeks = 2
one_month = 4.34524
six_months = 26.0715
one_year = 52.1429
active_users <- list()
active_users$two_weeks <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > two_weeks]
      )
    )
active_users$one_month <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_month]
    )
  )
active_users$six_months <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > six_months]
    )
  )
active_users$one_year <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_year]
    )
  )
active_users$two_weeks_p_total_registered_users <- 
  active_users$two_weeks/stats$total_registered_users
active_users$one_month_p_total_registered_users <- 
  active_users$one_month/stats$total_registered_users
active_users$six_months_p_total_registered_users <- 
  active_users$six_months/stats$total_registered_users
active_users$one_year_p_total_registered_users <- 
  active_users$one_year/stats$total_registered_users
active_users$two_weeks_p_total_users_who_edited <- 
  active_users$two_weeks/stats$total_users_who_edited
active_users$one_month_p_total_users_who_edited <- 
  active_users$one_month/stats$total_users_who_edited
active_users$six_months_total_users_who_edited <- 
  active_users$six_months/stats$total_users_who_edited
active_users$one_year_p_total_users_who_edited <- 
  active_users$one_year/stats$total_users_who_edited
saveRDS(active_users, 
        paste0(analyticsDir, "active_users.Rds"))

### -----------------------------------------------------
### --- Statistics on revisions since 2017
### -----------------------------------------------------

rm(dewiki_revisions); gc()
# - read splits: dewiki_revisions
# - load
lF <- list.files(dataDir)
lF <- lF[grepl("dewiki_revisions", lF)]
dewiki_revisions <- lapply(lF, function(x) {fread(paste0(dataDir, x), header = F)})
# - collect
dewiki_revisions <- rbindlist(dewiki_revisions)
# - schema
colnames(dewiki_revisions) <- c('user_id', 'reg_time', 'rev_time')

# - filter for >= 2017 on user registration
dewiki_revisions <- filter(dewiki_revisions, reg_time >= 2017)
dewiki_revisions <- as.data.table(dewiki_revisions)

# - remove campaign registered users from dewiki_revisions
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_revisions <- dewiki_revisions[!(dewiki_revisions$user_id %in% campaign_regIDS), ]
# - for non-registering campaigns: keep only user revisions before the campaign onset
# - there is currently one non-registering campaign present in campaignIDs:
non_registering_campaignIDs <- campaignIDs %>% 
  filter(registered == 0)
non_registering_campaigns <- unique(non_registering_campaignIDs$campaign)
non_registering_campaigns
# - "occasional_editors2020"
# - the campaign onset for "occasional_editors2020" is:
# - 2020/05/14
wRemoveRevisions <- which(
  (dewiki_revisions$user_id %in% non_registering_campaignIDs$user_id) & 
    (dewiki_revisions$rev_time >= "2020-05-14")
)
dewiki_revisions <- dewiki_revisions[-wRemoveRevisions, ]

# - statistics
stats_revisions_2017 <- list()
stats_revisions_2017$total_revisions <- dim(dewiki_revisions)[1]

# - distribution of number of revisions per user
rev_dist_2017 <- table(dewiki_revisions$user_id)
rev_dist_2017 <- as.data.frame(rev_dist_2017)
colnames(rev_dist_2017) <- c('user_id', 'revisions')
rev_dist_2017 <- arrange(rev_dist_2017, desc(revisions))
rev_dist_2017 <- table(rev_dist_2017$revisions)
rev_dist_2017 <- as.data.frame(rev_dist_2017)
colnames(rev_dist_2017) <- c('revisions', 'users')
rev_dist_2017 <- arrange(rev_dist_2017, desc(users))
saveRDS(rev_dist_2017, 
        paste0(analyticsDir, "rev_dist_2017.Rds"))
rm(rev_dist_2017); gc()

# - edit classes
editClasses_2017 <- dewiki_revisions %>% 
  select(user_id) %>% 
  group_by(user_id) %>% 
  summarise(revisions = n())
editBoundaries <- list(
  c(0, 1), 
  c(2, 5),
  c(6, 9),
  c(10, 49)
)
editClasses_2017$editClass <- sapply(editClasses_2017$revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
editClasses_2017$editClass[editClasses_2017$editClass == "0 - 1"] <- "1"
editClasses_2017 <- arrange(editClasses_2017, desc(revisions))
editClasses_dist_2017 <- table(editClasses_2017$editClass)
editClasses_dist_2017 <- as.data.frame(editClasses_dist_2017)
colnames(editClasses_dist_2017) <- c("Edit Class", "Users")
editClasses_dist_2017$`Edit Class` <- factor(editClasses_dist_2017$`Edit Class`, 
                                        levels = c('1', 
                                                   '2 - 5', 
                                                   '6 - 9', 
                                                   '10 - 49', 
                                                   '> 50'))
editClasses_dist_2017 <- arrange(editClasses_dist_2017, `Edit Class`)
editClasses_dist_2017$`% Users` <- editClasses_dist_2017$Users/sum(editClasses_dist_2017$Users)*100
saveRDS(editClasses_dist_2017, 
        paste0(analyticsDir, "editClasses_dist_2017.Rds"))

# - cummulative edits in dewiki_revisions
setkey(dewiki_revisions, user_id, rev_time)
dewiki_revisions <- dewiki_revisions[order(user_id, rev_time)]
dewiki_revisions[, cum_revisions := seq_len(.N), by = user_id]
dewiki_revisions[, cum_revisions := rowid(user_id)]
dewiki_revisions$reg_time <- as.Date(dewiki_revisions$reg_time)
dewiki_revisions$rev_time <- as.Date(dewiki_revisions$rev_time)
dewiki_revisions$account_age_rev_time_weeks <- difftime(dewiki_revisions$rev_time,
                                                        dewiki_revisions$reg_time,
                                                        units = "weeks")
dewiki_revisions$account_age_rev_time_years <- 
  dewiki_revisions$account_age_rev_time_weeks/52.1429
dewiki_revisions$account_age_rev_time_weeks <- 
  as.numeric(dewiki_revisions$account_age_rev_time_weeks)
dewiki_revisions$account_age_rev_time_years <- 
  as.numeric(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$editClass <- sapply(dewiki_revisions$cum_revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
dewiki_revisions <- dewiki_revisions[order(rev_time)]

# - intoduce account age in weeks and years classes
dewiki_revisions$rev_time_ym <- substr(dewiki_revisions$rev_time, 1, 7)
dewiki_revisions$account_age_rev_time_years_class <- 
  round(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$account_age_rev_time_years_class <- 
  paste0(dewiki_revisions$account_age_rev_time_years_class, 
         " - ", 
         dewiki_revisions$account_age_rev_time_years_class + 1)

# - save the elaborated version of dewiki_revisions_2017 
saveRDS(dewiki_revisions, 
        paste0(analyticsDir, "dewiki_revisions_2017_elaborated.Rds"))

# - produce dewiki_revisions_overview_2017 for visualization
dewiki_revisions_overview_2017 <- dewiki_revisions %>% 
  select(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  group_by(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  summarise(n_users = n())
saveRDS(dewiki_revisions_overview_2017, 
        paste0(analyticsDir, "dewiki_revisions_overview_2017.Rds"))

# - users active (at least one edit) after
# - two weeks, one month, six months, and one year
two_weeks = 2
one_month = 4.34524
six_months = 26.0715
one_year = 52.1429
active_users_2017 <- list()
active_users_2017$two_weeks <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > two_weeks]
    )
  )
active_users_2017$one_month <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_month]
    )
  )
active_users_2017$six_months <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > six_months]
    )
  )
active_users_2017$one_year <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_year]
    )
  )
active_users_2017$two_weeks_p_total_registered_users <- 
  active_users_2017$two_weeks/stats_2017$total_registered_users
active_users_2017$one_month_p_total_registered_users <- 
  active_users_2017$one_month/stats_2017$total_registered_users
active_users_2017$six_months_p_total_registered_users <- 
  active_users_2017$six_months/stats_2017$total_registered_users
active_users_2017$one_year_p_total_registered_users <- 
  active_users_2017$one_year/stats_2017$total_registered_users
active_users_2017$two_weeks_p_total_users_who_edited <- 
  active_users_2017$two_weeks/stats_2017$total_users_who_edited
active_users_2017$one_month_p_total_users_who_edited <- 
  active_users_2017$one_month/stats_2017$total_users_who_edited
active_users_2017$six_months_total_users_who_edited <- 
  active_users_2017$six_months/stats_2017$total_users_who_edited
active_users_2017$one_year_p_total_users_who_edited <- 
  active_users_2017$one_year/stats_2017$total_users_who_edited
saveRDS(active_users_2017, 
        paste0(analyticsDir, "active_users_2017.Rds"))

### -----------------------------------------------------
### --- Statistics on user registrations for 
### --- campaign registered users
### -----------------------------------------------------

### --- load Campaign Registered Users dataset
campaignIDs <- read.csv(paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs.csv"), 
                        header = T,
                        check.names = F,
                        stringsAsFactors = F)

# - read splits: dewiki_revisions
# - load
lF <- list.files(dataDir)
lF <- lF[grepl("dewiki_revisions", lF)]
dewiki_revisions <- lapply(lF, function(x) {fread(paste0(dataDir, x), header = F)})
# - collect
dewiki_revisions <- rbindlist(dewiki_revisions)
# - schema
colnames(dewiki_revisions) <- c('user_id', 'reg_time', 'rev_time')

### --- Load registration data: dewiki_regusers
dewiki_regusers <- fread(paste0(dataDir, "dewiki_regusers.csv"), header = T)

# - which campaign registered users cannot be found in dewiki_regusers
wNotFound <- which(!(campaignIDs$user_id[campaignIDs$registered == 1] %in% dewiki_regusers$user_id))
campaignIDs$not_found_in_dewiki_regusers <- 0
campaignIDs$not_found_in_dewiki_regusers[wNotFound] <- 1
# - store elaborated campaignIDs
write.csv(campaignIDs, 
          paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs_Elaborated.csv"))

# - keep only campaign registered users from dewiki_regusers
dim(dewiki_regusers)
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_regusers <- dewiki_regusers[(dewiki_regusers$user_id %in% campaign_regIDS), ]
dim(dewiki_regusers)

# - stats
stats_campaigns <- list()
stats_campaigns$total_registered_users <- dim(dewiki_regusers)[1]
wEdited <- which(dewiki_regusers$user_id %in% dewiki_revisions$user_id)
stats_campaigns$total_users_who_edited <- length(wEdited)

# - distibution of account age
dewiki_regusers$reg_time <- as.Date(dewiki_regusers$user_registration_timestamp)
dewiki_regusers$account_age_weeks <- as.numeric(
  difftime(Sys.time(),
           dewiki_regusers$reg_time,
           units = "weeks")
)
dewiki_regusers$account_age_years = dewiki_regusers$account_age_weeks/52.1429

# - stats
stats_campaigns$min_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[1]
)
stats_campaigns$max_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[6]
)
stats_campaigns$mean_account_age_weeks <- unname(
  summary(as.numeric(dewiki_regusers$account_age_weeks))[4]
)
stats_campaigns$median_account_age_weeks <- median(dewiki_regusers$account_age_weeks)
stats_campaigns$min_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[1]
)
stats_campaigns$max_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[6]
)
stats_campaigns$mean_account_age_years <- unname(
  summary(as.numeric(dewiki_regusers$account_age_years))[4]
)
stats_campaigns$median_account_age_years <- median(dewiki_regusers$account_age_years)
saveRDS(stats_campaigns, 
        paste0(analyticsDir, "dewiki_stats_campaigns.Rds"))

# - store data file
saveRDS(dewiki_regusers, 
        paste0(analyticsDir, "dewiki_regusers_campaigns.Rds"))

### -----------------------------------------------------
### --- Statistics on revisions for campaign registered users
### -----------------------------------------------------

# - filter for campaign registered users on user registration
dim(dewiki_revisions)
dewiki_revisions <- filter(dewiki_revisions, user_id %in% campaign_regIDS)
dewiki_revisions <- as.data.table(dewiki_revisions)
dim(dewiki_revisions)

# - for non-registering campaigns: keep only user revisions following the campaign onset
# - there is currently one non-registering campaign present in campaignIDs:
non_registering_campaignIDs <- campaignIDs %>% 
  filter(registered == 0)
non_registering_campaigns <- unique(non_registering_campaignIDs$campaign)
non_registering_campaigns
# - "occasional_editors2020"
# - the campaign onset for "occasional_editors2020" is:
# - 2020/05/14
wRemoveRevisions <- which(
  (dewiki_revisions$user_id %in% non_registering_campaignIDs$user_id) & 
    (dewiki_revisions$rev_time < "2020-05-14")
)
dewiki_revisions <- dewiki_revisions[-wRemoveRevisions, ]

# - statistics
stats_revisions_campaigns <- list()
stats_revisions_campaigns$total_revisions <- dim(dewiki_revisions)[1]

# - distribution of number of revisions per user
rev_dist_campaigns <- table(dewiki_revisions$user_id)
rev_dist_campaigns <- as.data.frame(rev_dist_campaigns)
colnames(rev_dist_campaigns) <- c('user_id', 'revisions')
rev_dist_campaigns <- arrange(rev_dist_campaigns, desc(revisions))
rev_dist_campaigns <- table(rev_dist_campaigns$revisions)
rev_dist_campaigns <- as.data.frame(rev_dist_campaigns)
colnames(rev_dist_campaigns) <- c('revisions', 'users')
rev_dist_campaigns <- arrange(rev_dist_campaigns, desc(users))
saveRDS(rev_dist_campaigns, 
        paste0(analyticsDir, "rev_dist_campaigns.Rds"))
rm(rev_dist_campaigns); gc()

# - edit classes
editClasses_campaigns <- dewiki_revisions %>% 
  select(user_id) %>% 
  group_by(user_id) %>% 
  summarise(revisions = n())
editBoundaries <- list(
  c(0, 1), 
  c(2, 5),
  c(6, 9),
  c(10, 49)
)
editClasses_campaigns$editClass <- sapply(editClasses_campaigns$revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
editClasses_campaigns$editClass[editClasses_campaigns$editClass == "0 - 1"] <- "1"
editClasses_campaigns <- arrange(editClasses_campaigns, desc(revisions))
editClasses_dist_campaigns <- table(editClasses_campaigns$editClass)
editClasses_dist_campaigns <- as.data.frame(editClasses_dist_campaigns)
colnames(editClasses_dist_campaigns) <- c("Edit Class", "Users")
editClasses_dist_campaigns$`Edit Class` <- factor(editClasses_dist_campaigns$`Edit Class`, 
                                             levels = c('1', 
                                                        '2 - 5', 
                                                        '6 - 9', 
                                                        '10 - 49', 
                                                        '> 50'))
editClasses_dist_campaigns <- arrange(editClasses_dist_campaigns, `Edit Class`)
editClasses_dist_campaigns$`% Users` <- editClasses_dist_campaigns$Users/sum(editClasses_dist_campaigns$Users)*100
saveRDS(editClasses_dist_campaigns, 
        paste0(analyticsDir, "editClasses_dist_campaigns.Rds"))

# - cummulative edits in dewiki_revisions
setkey(dewiki_revisions, user_id, rev_time)
dewiki_revisions <- dewiki_revisions[order(user_id, rev_time)]
dewiki_revisions[, cum_revisions := seq_len(.N), by = user_id]
dewiki_revisions[, cum_revisions := rowid(user_id)]
dewiki_revisions$reg_time <- as.Date(dewiki_revisions$reg_time)
dewiki_revisions$rev_time <- as.Date(dewiki_revisions$rev_time)
dewiki_revisions$account_age_rev_time_weeks <- difftime(dewiki_revisions$rev_time,
                                                        dewiki_revisions$reg_time,
                                                        units = "weeks")
dewiki_revisions$account_age_rev_time_years <- 
  dewiki_revisions$account_age_rev_time_weeks/52.1429
dewiki_revisions$account_age_rev_time_weeks <- 
  as.numeric(dewiki_revisions$account_age_rev_time_weeks)
dewiki_revisions$account_age_rev_time_years <- 
  as.numeric(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$editClass <- sapply(dewiki_revisions$cum_revisions, function(x) {
  wEC <- sapply(editBoundaries, function(y) {
    x >= y[1] & x <= y[2]
  })
  if (sum(wEC) == 0) {
    return("> 50")
  } else {
    return(paste0(editBoundaries[[which(wEC)]][1],
                  " - ",
                  editBoundaries[[which(wEC)]][2]
    )
    )
  }
})
dewiki_revisions <- dewiki_revisions[order(rev_time)]

# - intoduce account age in weeks and years classes
dewiki_revisions$rev_time_ym <- substr(dewiki_revisions$rev_time, 1, 7)
dewiki_revisions$account_age_rev_time_years_class <- 
  round(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$account_age_rev_time_years_class <- 
  paste0(dewiki_revisions$account_age_rev_time_years_class, 
         " - ", 
         dewiki_revisions$account_age_rev_time_years_class + 1)

# - save the elaborated version of dewiki_revisions_campaigns
saveRDS(dewiki_revisions, 
        paste0(analyticsDir, "dewiki_revisions_campaigns_elaborated.Rds"))


# - produce dewiki_revisions_overview_campaigns for visualization
dewiki_revisions_overview_campaigns <- dewiki_revisions %>% 
  select(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  group_by(rev_time_ym, editClass, account_age_rev_time_years_class) %>% 
  summarise(n_users = n())
saveRDS(dewiki_revisions_overview_campaigns, 
        paste0(analyticsDir, "dewiki_revisions_overview_campaigns.Rds"))

# - users active (at least one edit) after
# - two weeks, one month, six months, and one year
two_weeks = 2
one_month = 4.34524
six_months = 26.0715
one_year = 52.1429
active_users_campaigns <- list()
active_users_campaigns$two_weeks <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > two_weeks]
    )
  )
active_users_campaigns$one_month <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_month]
    )
  )
active_users_campaigns$six_months <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > six_months]
    )
  )
active_users_campaigns$one_year <- 
  length(
    unique(
      dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_year]
    )
  )
active_users_campaigns$two_weeks_p_total_registered_users <- 
  active_users_campaigns$two_weeks/stats_campaigns$total_registered_users

active_users_campaigns$one_month_p_total_registered_users <- 
  active_users_campaigns$one_month/stats_campaigns$total_registered_users

active_users_campaigns$six_months_p_total_registered_users <- 
  active_users_campaigns$six_months/stats_campaigns$total_registered_users

active_users_campaigns$one_year_p_total_registered_users <- 
  active_users_campaigns$one_year/stats_campaigns$total_registered_users

active_users_campaigns$two_weeks_p_total_users_who_edited <- 
  active_users_campaigns$two_weeks/stats_campaigns$total_users_who_edited

active_users_campaigns$one_month_p_total_users_who_edited <- 
  active_users_campaigns$one_month/stats_campaigns$total_users_who_edited

active_users_campaigns$six_months_total_users_who_edited <- 
  active_users_campaigns$six_months/stats_campaigns$total_users_who_edited

active_users_campaigns$one_year_p_total_users_who_edited <- 
  active_users_campaigns$one_year/stats_campaigns$total_users_who_edited

saveRDS(active_users_campaigns, 
        paste0(analyticsDir, "active_users_campaigns.Rds"))

### --- Final Reporting
paste0("Processing took: ", difftime(Sys.time(), t1, units = "mins"), " minutes.")
rm(list = ls()); gc()
```


## 1 Data Analysis

All statistics and visualizations reported in the following sections refer to the questions formulated in the [reference Phabricator ticket](https://phabricator.wikimedia.org/T256433). 

**Note.** All revisions made by campaign registered users were removed from the revision datasets used in the sections addressing the organic registrations and revisions since 2017. For the non-registering WMDE Banner Campaigns (e.g. WMDE Occasional Editors 2020 campaign, which addressed the already existing, registered users only), we remove all the edits made on their behalf _following_ their exposure to the respective campaign. Likewise, in the campaigns datasets, we remove all their edits made _before_ their exposure to the respective campaign.

### 1.1 Organic Registrations and Revisions since 2017

#### 1.1.1 Organic Registrations since 2017

**Q.** What is the age of the German Wikipedia Community in terms of account age?

```{r echo = T, eval = T, warning = 'hide', message = FALSE, fig.width = 10, fig.height = 10}
stats_2017 <- readRDS(paste0(analyticsDir, 'dewiki_stats_2017.Rds'))
```

The following statistics all refer to `dewiki` user registrations since 2017:

- The total number of registered users is **`r stats_2017$total_registered_users`**.
- The total number of registered users who have ever edited is **`r stats_2017$total_users_who_edited`**.
- That means that **`r round(stats_2017$total_users_who_edited/stats_2017$total_registered_users*100, 2)`%** of registered users ever edited `dewiki`.
- Statistics on account age _in weeks_: the minimum account age is **`r round(stats_2017$min_account_age_weeks, 2)`**, the maximum account age is **`r round(stats_2017$max_account_age_weeks, 2)`**, the mean account age is **`r round(stats_2017$mean_account_age_weeks, 2)`**, and the median account ages is **`r round(stats_2017$median_account_age_weeks, 2)`**;
- while _in years_, that would be: the minimum account age is **`r round(stats_2017$min_account_age_years, 2)`**, the maximum account age is **`r round(stats_2017$max_account_age_years, 2)`**, the mean account age is **`r round(stats_2017$mean_account_age_years, 2)`**, and the median account ages is **`r round(stats_2017$median_account_age_years, 2)`**.

#### 1.1.2 User Revisions since 2017

**Q.** How many of them edited (since registration until 30th June 2020):

- 1 edit

- 2 to 5 edits

- 5 to 9 edits

- 10 to 49 edits

- 50 or more edits

```{r echo = T, eval = T, warning = 'hide', message = FALSE, fig.width = 10, fig.height = 10}
editClasses_dist_2017 <- readRDS(paste0(analyticsDir, 'editClasses_dist_2017.Rds'))
editClasses_dist_2017$`% Users` <- round(editClasses_dist_2017$`% Users`, 2)
datatable(editClasses_dist_2017)
```

**Q.** Retention rate: How many newly registered users are active after (active = at least 1 edit)

- 2 weeks after registration
- 1 month after registration
- 6 months after registration
- 12 months after registration

How high is the retention rate of these active users compared to the number of registrations?

```{r echo = T, eval = T, warning = 'hide', message = FALSE, fig.width = 10, fig.height = 10}
active_users_2017 <- readRDS(paste0(analyticsDir, 'active_users_2017.Rds'))
active_users_2017 <- data.frame(
  `Retention Class` = c('2 weeks', '1 month', '6 months', '1 year'),
  Users = as.numeric(active_users_2017[1:4]),
  `As % of registered users` = round(as.numeric(active_users_2017[5:8]), 2),
  `As % of users who ever edited` = round(as.numeric(active_users_2017[9:12]), 2),
  stringsAsFactors = F, 
  check.names = F)
datatable(active_users_2017)
```

**Q.** Edit Classes (facets) x Account Age Classes (group, step: one year) x Time (horizontal) → do we observe always one and the same group of active editors, or do the newcomers join in to stay active editors? - start: 2017.

**Note.** In the following chart, the `Account Age` variable refers to the user account age at the moment when a respective revision was made by that user. Tabs refer to different edit classes.

```{r echo = T, eval = T, warning = 'hide', message = FALSE, fig.width = 10, fig.height = 10}
dewiki_revisions_overview_2017 <- readRDS(paste0(analyticsDir, 'dewiki_revisions_overview_2017.Rds'))
dewiki_revisions_overview_2017 <- ungroup(dewiki_revisions_overview_2017)
colnames(dewiki_revisions_overview_2017) <- c('Revision Year-Month', 'Edit Class', 'Account Age', 'Users')
dewiki_revisions_overview_2017 <- filter(dewiki_revisions_overview_2017, 
                                         !(`Revision Year-Month` == "2020-07"))
dewiki_revisions_overview_2017$`Edit Class`[dewiki_revisions_overview_2017$`Edit Class` == "0 - 1"] <- "1"
dewiki_revisions_overview_2017$`Edit Class` <- factor(dewiki_revisions_overview_2017$`Edit Class`,
                                                      levels = c('1',
                                                                 '2 - 5',
                                                                 '6 - 9',
                                                                 '10 - 49',
                                                                 '> 50'))
ggplot(dewiki_revisions_overview_2017, 
       aes(x = `Revision Year-Month`, 
           y = Users,
           group = `Account Age`, 
           color = `Account Age`)) + 
  geom_line() + geom_point(size = 1) +
  scale_color_manual(values=c("#308FF3", "#70BA0A", "#FABC0A", "#ED2809")) +
  facet_wrap(~`Edit Class`, nrow = 5, ncol = 1, scales="free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 0.95, vjust = 0.2))
```

### 1.2  Campaign Registrations and Revisions since 2017

**Note.** Some users found in the WMDE campaign user registration datasets could not be matched with the user IDs in the `wmf.mediawiki_history` table. The following table presents an overview of how many campaign registered users are missing in the `wmf.mediawiki_history` (see: [Wikitech wmf.mediaWiki_history documentation](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history)). All WMDE campaign registered user IDs were checked for uniqueness and confirmed to be unique. The total number of WMDE campaign registered users that were not matched to the user ID fields (`event_user_id` for revisions, and `user_id` for registrations) in the `wmf.mediawiki_history` table is **49**.

```{r echo = T, eval = T, warning = 'hide', message = FALSE, fig.width = 10, fig.height = 10}
campaignIDs <- read.csv(paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs_Elaborated.csv"), 
                        header = T, 
                        check.names = F,
                        row.names = 1,
                        stringsAsFactors = F)
not_found <- as.data.frame(
  table(campaignIDs$campaign[campaignIDs$not_found_in_dewiki_regusers == 1])
)
colnames(not_found) <- c('Campaign Code', 'Num.Users')
datatable(not_found)
```

Given that there are **`r length((campaignIDs$user_id[campaignIDs$registered==1]))`** WMDE campaign registered users in total, that means what we will not be able to analyze **`r round(49/length(campaignIDs$user_id[campaignIDs$registered==1])*100, 2)`%** of them. To keep the analysis consistent with the numbers previously reported on user registrations and revisions since 2017, all user registration data are derived from the campaign registered users that were matched with the user IDs in the `wmf.mediawiki_history` table.

#### 1.2.1 Campaign Registrations since 2017

**Q.** What is the age of the German Wikipedia Community in terms of account age **for campaign registered users**?

```{r echo = T, eval = T, warning = 'hide', message = FALSE, fig.width = 10, fig.height = 10}
dewiki_stats_campaigns <- readRDS(paste0(analyticsDir, 'dewiki_stats_campaigns.Rds'))
```

The following statistics all refer to `dewiki` **campaign** user registrations since 2017:

- The total number of registered users is **`r dewiki_stats_campaigns$total_registered_users`**.
- The total number of registered users who have ever edited is **`r dewiki_stats_campaigns$total_users_who_edited`**.
- That means that **`r round(dewiki_stats_campaigns$total_users_who_edited/dewiki_stats_campaigns$total_registered_users*100, 2)`%** of registered users ever edited `dewiki`.
- Statistics on account age _in weeks_: the minimum account age is **`r round(dewiki_stats_campaigns$min_account_age_weeks, 2)`**, the maximum account age is **`r round(dewiki_stats_campaigns$max_account_age_weeks, 2)`**, the mean account age is **`r round(dewiki_stats_campaigns$mean_account_age_weeks, 2)`**, and the median account ages is **`r round(dewiki_stats_campaigns$median_account_age_weeks, 2)`**;
- while _in years_, that would be: the minimum account age is **`r round(dewiki_stats_campaigns$min_account_age_years, 2)`**, the maximum account age is **`r round(dewiki_stats_campaigns$max_account_age_years, 2)`**, the mean account age is **`r round(dewiki_stats_campaigns$mean_account_age_years, 2)`**, and the median account ages is **`r round(dewiki_stats_campaigns$median_account_age_years, 2)`**.

#### 1.2.2 Campaign Revisions since 2017

**Q.** How many of **campaign registered users** edited (since registration until 30th June 2020):

- 1 edit

- 2 to 5 edits

- 5 to 9 edits

- 10 to 49 edits

- 50 or more edits

```{r echo = T, eval = T, warning = 'hide', message = FALSE, fig.width = 10, fig.height = 10}
editClasses_dist_campaigns <- readRDS(paste0(analyticsDir, 'editClasses_dist_campaigns.Rds'))
editClasses_dist_campaigns$`% Users` <- round(editClasses_dist_campaigns$`% Users`, 2)
datatable(editClasses_dist_campaigns)
```

**Q.** Retention rate: How many **campaign registered** users are active after (active = at least 1 edit)

- 2 weeks after registration
- 1 month after registration
- 6 months after registration
- 12 months after registration

How high is the retention rate of these active users compared to the number of registrations?

```{r echo = T, eval = T, warning = 'hide', message = FALSE, fig.width = 10, fig.height = 10}
active_users_campaigns <- readRDS(paste0(analyticsDir, 'active_users_campaigns.Rds'))
active_users_campaigns <- data.frame(
  `Retention Class` = c('2 weeks', '1 month', '6 months', '1 year'),
  Users = as.numeric(active_users_campaigns[1:4]),
  `As % of registered users` = round(as.numeric(active_users_campaigns[5:8]), 2),
  `As % of users who ever edited` = round(as.numeric(active_users_campaigns[9:12]), 2),
  stringsAsFactors = F, 
  check.names = F)
datatable(active_users_campaigns)
```

**Q.** Edit Classes (facets) x Account Age Classes (group, step: one year) x Time (horizontal) → do we observe always one and the same group of active editors, or do the newcomers join in to stay active editors? - start: 2017 (**for campaign registered users**):

**Note.** In the following chart, the `Account Age` variable refers to the user account age at the moment when a respective revision was made by that user. Tabs refer to different edit classes.

```{r echo = T, eval = T, warning = 'hide', message = FALSE, fig.width = 10, fig.height = 10}
dewiki_revisions_overview_campaigns <- readRDS(paste0(analyticsDir, 'dewiki_revisions_overview_campaigns.Rds'))
dewiki_revisions_overview_campaigns <- ungroup(dewiki_revisions_overview_campaigns)
colnames(dewiki_revisions_overview_campaigns) <- c('Revision Year-Month', 'Edit Class', 'Account Age', 'Users')
dewiki_revisions_overview_campaigns <- filter(dewiki_revisions_overview_campaigns, 
                                         !(`Revision Year-Month` == "2020-07"))
dewiki_revisions_overview_campaigns$`Edit Class`[dewiki_revisions_overview_campaigns$`Edit Class` == "0 - 1"] <- "1"
dewiki_revisions_overview_campaigns$`Edit Class` <- factor(dewiki_revisions_overview_campaigns$`Edit Class`,
                                                           levels = c('1',
                                                                      '2 - 5',
                                                                      '6 - 9',
                                                                      '10 - 49',
                                                                      '> 50'))
ggplot(dewiki_revisions_overview_campaigns, 
       aes(x = `Revision Year-Month`, 
           y = Users,
           group = `Account Age`, 
           color = `Account Age`)) + 
  geom_line() + geom_point(size = 1) + 
  scale_color_manual(values=c("#308FF3", "#70BA0A", "#FABC0A", "#ED2809")) + 
  facet_wrap(~`Edit Class`, nrow = 5, ncol = 1, scales="free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 0.95, vjust = 0.2))
```

### 1.3 Campaign Registrations and Revisions: Overview

#### 1.3.1 Campaign User Registrations and Revisions

> I have just one major remark on the report: I also need registrations and revisions per campaign to be able to compare the campaigns. Are you still on it or did you miss it out? (the edit class and account age comparison is not necessary for the campaign split).

Reference: [Phab: T256433#6328618](https://phabricator.wikimedia.org/T256433#6328618)
**Note.** The `recent editors` column reports on number of campaign registered users who did at least one edit as of 30th June 2020, reference: [Phab: T256433#6385973](https://phabricator.wikimedia.org/T256433#6385973)

```{r echo = T, eval = T, warning = 'hide', message = FALSE}
campaignRegs <- read.csv(paste0(analyticsDir, "campaignRegistrationsSummary.csv"), 
                         header = T,
                         check.names = F,
                         row.names = 1,
                         stringsAsFactors = F)
campaignRevs <- read.csv(paste0(analyticsDir, "campaignRevisionsSummary.csv"), 
                         header = T,
                         check.names = F,
                         row.names = 1,
                         stringsAsFactors = F)
campaignsOverview <- left_join(campaignRevs, 
                               campaignRegs, 
                               by = "campaign")
campaignsOverview$rev_per_reg <- round(campaignsOverview$revisions/campaignsOverview$registered, 2)
recentEditors <- read.csv(paste0(analyticsDir, "recentCampaignEditors.csv"), 
                         header = T,
                         check.names = F,
                         row.names = 1,
                         stringsAsFactors = F)
colnames(recentEditors)[2] <- "recent editors"
campaignsOverview <- left_join(campaignsOverview, 
                               recentEditors, 
                               by = "campaign")
datatable(campaignsOverview)
```

#### 1.3.2 Campaign User Retention

> I wondered if it were much additional work to compute not only # of revisions and registrations (in section 1.3) , but also retention/ retention rates (like you did in section 1.2.2) per campaign? I guess that's what @Verena was initially asking for.

Reference: [Phab: T256433#6337701](https://phabricator.wikimedia.org/T256433#6337701)

```{r echo = T, eval = T, warning = 'hide', message = FALSE}
active_users_per_campaign <- readRDS(paste0(analyticsDir, "active_users_per_campaign.Rds"))
active_users_per_campaign <- active_users_per_campaign[, c('campaign',
                                                           'two_weeks',
                                                           'one_month',
                                                           'six_months',
                                                           'one_year')]
colnames(active_users_per_campaign) <- c('campaign',
                                         '2 weeks',
                                         '1 month',
                                         '6 months',
                                         '1 year')
datatable(active_users_per_campaign)
```


### 1.4 Training Modules in 2018 Campaigns: Active Users

> Of the people who started training modules (onboarding content used in 2018 in thank you, spring and summer campaign), how is the rate of still active users?

```{r echo = T, eval = T, warning = 'hide', message = FALSE}
active_users_campaigns_training_2018 <- 
  readRDS(paste0(analyticsDir, "active_users_campaigns_training.Rds"))
active_users_campaigns_training_2018 <- data.frame(
  `Retention Class` = c('2 weeks', '1 month', '6 months', '1 year'),
  Users = as.numeric(active_users_campaigns_training_2018[1:4]),
  stringsAsFactors = F, 
  check.names = F)
datatable(active_users_campaigns_training_2018)
```

## 2 Additional Requests

### 2.1 Organic Growth/Age of Community

**Request**. "Organic Growth/ Age of Community: For the years 2001 to 2019 we need for every year the average age of all accounts who did at least one edit in this year. age = number of years since registration ( I am aware that for a few accounts the registration date can't be retrieved from the database. Because this should be a relatively small number of accounts we can neglect that here.)" [reference Phab ticket](https://phabricator.wikimedia.org/T256433#6385973)

```{r echo = T, eval = T, warning = 'hide', message = FALSE}
dewiki_revisions_elaborated <- readRDS("~/WMDE/NewEditors/CampaignsReview2020/_analytics/dewiki_revisions_elaborated.Rds")
dewiki_revisions_elaborated <- dplyr::select(dewiki_revisions_elaborated, 
                                             reg_time, 
                                             rev_time)
dewiki_revisions_elaborated$rev_year <- substr(dewiki_revisions_elaborated$rev_time, 1, 4)
dewiki_revisions_elaborated <- dplyr::filter(dewiki_revisions_elaborated, 
                                             rev_year != "2020")
dewiki_revisions_elaborated$account_age <- difftime(dewiki_revisions_elaborated$rev_time, 
                                                    dewiki_revisions_elaborated$reg_time, 
                                                    units = "weeks")
one_year = 52.1429
dewiki_revisions_elaborated$account_age <- dewiki_revisions_elaborated$account_age/one_year
organicGrowth <- dewiki_revisions_elaborated %>% 
  dplyr::select(rev_year, account_age) %>% 
  dplyr::group_by(rev_year) %>% 
  summarise(avg_account_age = mean(account_age))
datatable(organicGrowth)
```

```{r echo = T, eval = T, warning = 'hide', message = FALSE, fig.width = 10, fig.height = 10}
organicGrowth$rev_year <- as.numeric(organicGrowth$rev_year)
organicGrowth$avg_account_age <- as.numeric(organicGrowth$avg_account_age)
ggplot(organicGrowth, 
       aes(x = rev_year, 
           y = avg_account_age, 
           label = round(avg_account_age, 2))) + 
  geom_path(size = .25) + 
  geom_point(size = 1.5) + 
  geom_point(size = 1, color = "white") + 
  scale_x_continuous(
    breaks = organicGrowth$rev_year, 
    labels = as.character(organicGrowth$rev_year)) +
  geom_text_repel() + 
  xlab('Year') + ylab('Average Account Age (Yrs)') +
  ggtitle('Average Acount Age in Years: Dewiki Revisions until 2019.') + 
  theme(axis.text.x = element_text(angle = 90, hjust = 0.95, vjust = 0.2))
```









