Feedback should be send to goran.milovanovic_ext@wikimedia.de
.
Reference Document: New editors - WMDE deep dive - analytics questions
Reference Phabricator Ticket: https://phabricator.wikimedia.org/T256433
Background/ Reason why
The first campaign which contained tracking was conducted in 2017. Since then several campaigns with different content and user journeys have been realized. For strategic decisions on future activities to gain new editors we want to comprehensively analyze past activities and their impact. Besides qualitative results we need to consider quantitative results. As campaign reports usually cover a certain time during and after the campaign but no long term effects. This should be done in this comprehensive analysis.
Timing Briefing: 9th July 2020 Delivery of report: week 31
General requirements
The report should be delivered in tables and charts in html as in previous reports
The report might be publicly available in the future. For this delivery deadline this is no requirement.
Communication will happen in phabricator
Based on this report we will have more questions which should be addressed in a phase 2 in August.
The time span for this report is January 2017 until June 2020
NOTE: the Data Acquisition code chunk is not fully reproducible from this Report. The data are collected by running the PySpark script CampaignsReview2020_S01_ETL.py
on the stat1005.eqiad.wmnet server, collecting the data either as .tsv
files or storing large datasets directly to HDFS in the WMF Data Lake. Immediately following this step the R script CampaignsReview2020_S01_ETL.R
is run to clean the data, produce aggregated datasets, prepare the data for visualization, and compute all requested statistics.
dewiki
user registrations and revisionsThe CampaignsReview2020_S01_ETL.py
script: collect data on user registrations and revisions from the wmf.mediawiki_history
table:
# - Setup
import pyspark
from pyspark.sql import SparkSession, DataFrame, Window
from pyspark.sql.functions import rank, col, explode, regexp_extract, array_contains, when, sum, count, expr
import re
import csv
import pandas as pd
### --- dir structure and params
mwwikiSnapshot = "2020-06"
# - Spark Session
sc = SparkSession\
.builder\
.appName("WD Human Edits per Class")\
.enableHiveSupport()\
.getOrCreate()
# - SQL context
sqlContext = pyspark.SQLContext(sc)
# - Process wmf.mediawiki_history: all users ever registered with dewiki
dewiki_regusers = sqlContext.sql("""SELECT user_id, user_registration_timestamp
FROM wmf.mediawiki_history
WHERE event_entity = 'user'
AND event_type = 'create'
AND user_is_anonymous = false
AND user_is_created_by_self = true
AND user_id IS NOT NULL
AND user_registration_timestamp IS NOT NULL
AND NOT ARRAY_CONTAINS(user_is_bot_by, 'name')
AND NOT ARRAY_CONTAINS(user_is_bot_by, 'group')
AND NOT ARRAY_CONTAINS(user_is_bot_by_historical, 'name')
AND NOT ARRAY_CONTAINS(user_is_bot_by_historical, 'group')
AND wiki_db = 'dewiki'
AND snapshot='""" + mwwikiSnapshot + """'""" +
"""ORDER BY user_id""")
dewiki_regusers.toPandas().to_csv("/home/goransm/Analytics/NewEditors/CampaignsReview2020/_data/dewiki_regusers.csv",
header=True,
index=False)
# - Process wmf.mediawiki_history: all revisions ever made on dewiki
dewiki_revisions = sqlContext.sql("""SELECT event_user_id,
event_user_registration_timestamp,
event_timestamp
FROM wmf.mediawiki_history
WHERE event_entity = 'revision'
AND event_type = 'create'
AND event_user_is_anonymous = false
AND event_user_is_created_by_self = true
AND event_user_id IS NOT NULL
AND event_user_registration_timestamp IS NOT NULL
AND NOT ARRAY_CONTAINS(event_user_is_bot_by, 'name')
AND NOT ARRAY_CONTAINS(event_user_is_bot_by, 'group')
AND NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, 'name')
AND NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, 'group')
AND page_namespace_is_content = 0
AND page_namespace_is_content_historical = 0
AND wiki_db = 'dewiki'
AND snapshot='""" + mwwikiSnapshot + """'""" +
"""ORDER BY event_user_id, event_timestamp""")
dewiki_revisions.repartition(10).write.format('csv').save('dewiki_revisions')
From the CampaignsReview2020_S01_ETL.py
script we can see exactly what definitions of user registration
and user revision
are used in this report:
User registration: we exclude anonymous users (user_is_anonymous = false
), users who where not self-created (user_is_created_by_self = true
), users who have no values found in the user_id
or the registration timestamp field (user_id IS NOT NULL AND user_registration_timestamp IS NOT NULL
), and any users who are currently classified as bots or used to be classified as bots in the past.
User revision: we focus on revisions made on content pages only (page_namespace_is_content = true AND page_namespace_is_content_historical = true
) holding on to the same set of constraints for users who have cause a particular revision as we did in the definition of user registration.
dewiki
user registrations and revisionsThe CampaignsReview2020_S01_ETL.R
script: clean the data, produce aggregated datasets, prepare the data for visualization, and compute all requested statistics:
### --- 2020/07/14
### --- Phab: https://phabricator.wikimedia.org/T256433
### --- WMDE Banner Campaigns Comprehensive Report 2017-2020
t1 <- Sys.time()
library(tidyverse)
library(data.table)
fPath <- '/home/goransm/Analytics/NewEditors/CampaignsReview2020/'
dataDir <- paste0(fPath, "_data/")
analyticsDir <- paste0(fPath, "_analytics/")
hdfsPath <- 'hdfs:///user/goransm/dewiki_revisions'
### ---------------------------------------------------------------------
### --- Section 0. Campaigns Dataset
### ---------------------------------------------------------------------
### --- load Campaign Registered Users dataset
campaignIDs <- read.csv(paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs.csv"),
header = T,
check.names = F,
stringsAsFactors = F)
### ---------------------------------------------------------------------
### --- Section 1. Datasets
### ---------------------------------------------------------------------
### --- Compose final revision dataset from hdfs: dewiki_revisions
# - copy splits from hdfs to local dataDir
system(paste0('sudo -u analytics-privatedata kerberos-run-command analytics-privatedata hdfs dfs -ls ',
hdfsPath, ' > ',
dataDir, 'files.txt'),
wait = T)
files <- read.table(paste0(dataDir, 'files.txt'), skip = 1)
files <- as.character(files$V8)[2:length(as.character(files$V8))]
file.remove(paste0(dataDir, 'files.txt'))
for (i in 1:length(files)) {
system(paste0('sudo -u analytics-privatedata kerberos-run-command analytics-privatedata hdfs dfs -text ',
files[i], ' > ',
paste0(dataDir, "dewiki_revisions", i, ".csv")), wait = T)
}
# - read splits: dewiki_revisions
# - load
lF <- list.files(dataDir)
lF <- lF[grepl("dewiki_revisions", lF)]
dewiki_revisions <- lapply(lF, function(x) {fread(paste0(dataDir, x), header = F)})
# - collect
dewiki_revisions <- rbindlist(dewiki_revisions)
# - schema
colnames(dewiki_revisions) <- c('user_id', 'reg_time', 'rev_time')
### --- Load registration data: dewiki_regusers
dewiki_regusers <- fread(paste0(dataDir, "dewiki_regusers.csv"), header = T)
### ---------------------------------------------------------------------
### --- Section 2. Statistics and Analytical Datasets
### ---------------------------------------------------------------------
### -----------------------------------------------------
### --- Statistics on user registrations since the beginning of time
### -----------------------------------------------------
# - remove campaign registered users from dewiki_regusers
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_regusers <- dewiki_regusers[!(dewiki_regusers$user_id %in% campaign_regIDS), ]
# - stats
stats <- list()
stats$total_registered_users <- dim(dewiki_regusers)[1]
wEdited <- which(dewiki_regusers$user_id %in% dewiki_revisions$user_id)
stats$total_users_who_edited <- length(wEdited)
# - distibution of account age
dewiki_regusers$reg_time <- as.Date(dewiki_regusers$user_registration_timestamp)
dewiki_regusers$account_age_weeks <- as.numeric(
difftime(Sys.time(),
dewiki_regusers$reg_time,
units = "weeks")
)
dewiki_regusers$account_age_years = dewiki_regusers$account_age_weeks/52.1429
# - stats
stats$min_account_age_weeks <- unname(
summary(as.numeric(dewiki_regusers$account_age_weeks))[1]
)
stats$max_account_age_weeks <- unname(
summary(as.numeric(dewiki_regusers$account_age_weeks))[6]
)
stats$mean_account_age_weeks <- unname(
summary(as.numeric(dewiki_regusers$account_age_weeks))[4]
)
stats$median_account_age_weeks <- median(dewiki_regusers$account_age_weeks)
stats$min_account_age_years <- unname(
summary(as.numeric(dewiki_regusers$account_age_years))[1]
)
stats$max_account_age_years <- unname(
summary(as.numeric(dewiki_regusers$account_age_years))[6]
)
stats$mean_account_age_years <- unname(
summary(as.numeric(dewiki_regusers$account_age_years))[4]
)
stats$median_account_age_years <- median(dewiki_regusers$account_age_years)
saveRDS(stats,
paste0(analyticsDir, "dewiki_stats.Rds"))
# - store data file
saveRDS(dewiki_regusers,
paste0(analyticsDir, "dewiki_regusers.Rds"))
### -----------------------------------------------------
### --- Statistics on user registrations since 2017
### -----------------------------------------------------
dewiki_regusers_2017 <- dewiki_regusers[dewiki_regusers$user_registration_timestamp >= 2017, ]
# - stats
stats_2017 <- list()
stats_2017$total_registered_users <- dim(dewiki_regusers_2017)[1]
wEdited <- which(dewiki_regusers_2017$user_id %in% dewiki_revisions$user_id)
stats_2017$total_users_who_edited <- length(wEdited)
stats_2017$min_account_age_weeks <- unname(
summary(as.numeric(dewiki_regusers$account_age_weeks))[1]
)
stats_2017$max_account_age_weeks <- unname(
summary(as.numeric(dewiki_regusers_2017$account_age_weeks))[6]
)
stats_2017$mean_account_age_weeks <- unname(
summary(as.numeric(dewiki_regusers_2017$account_age_weeks))[4]
)
stats_2017$median_account_age_weeks <- median(dewiki_regusers_2017$account_age_weeks)
stats_2017$min_account_age_years <- unname(
summary(as.numeric(dewiki_regusers_2017$account_age_years))[1]
)
stats_2017$max_account_age_years <- unname(
summary(as.numeric(dewiki_regusers_2017$account_age_years))[6]
)
stats_2017$mean_account_age_years <- unname(
summary(as.numeric(dewiki_regusers_2017$account_age_years))[4]
)
stats_2017$median_account_age_years <- median(dewiki_regusers_2017$account_age_years)
saveRDS(stats_2017,
paste0(analyticsDir, "dewiki_stats_2017.Rds"))
# - store data file
saveRDS(dewiki_regusers_2017,
paste0(analyticsDir, "dewiki_regusers_2017.Rds"))
# - clean up
rm(dewiki_regusers); rm(dewiki_regusers_2017); gc()
### -----------------------------------------------------
### --- Statistics on revisions since the beginning of time
### -----------------------------------------------------
# - remove campaign registered users from dewiki_revisions
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_revisions <- dewiki_revisions[!(dewiki_revisions$user_id %in% campaign_regIDS), ]
# - for non-registering campaigns: keep only user revisions before the campaign onset
# - there is currently one non-registering campaign present in campaignIDs:
non_registering_campaignIDs <- campaignIDs %>%
filter(registered == 0)
non_registering_campaigns <- unique(non_registering_campaignIDs$campaign)
non_registering_campaigns
# - "occasional_editors2020"
# - the campaign onset for "occasional_editors2020" is:
# - 2020/05/14
wRemoveRevisions <- which(
(dewiki_revisions$user_id %in% non_registering_campaignIDs$user_id) &
(dewiki_revisions$rev_time >= "2020-05-14")
)
dewiki_revisions <- dewiki_revisions[-wRemoveRevisions, ]
# - statistics
stats_revisions <- list()
stats_revisions$total_revisions <- dim(dewiki_revisions)[1]
# - distribution of number of revisions per user
rev_dist <- table(dewiki_revisions$user_id)
rev_dist <- as.data.frame(rev_dist)
colnames(rev_dist) <- c('user_id', 'revisions')
rev_dist <- arrange(rev_dist, desc(revisions))
rev_dist <- table(rev_dist$revisions)
rev_dist <- as.data.frame(rev_dist)
colnames(rev_dist) <- c('revisions', 'users')
rev_dist <- arrange(rev_dist, desc(users))
saveRDS(rev_dist,
paste0(analyticsDir, "rev_dist.Rds"))
rm(rev_dist); gc()
# - edit classes
editClasses <- dewiki_revisions %>%
select(user_id) %>%
group_by(user_id) %>%
summarise(revisions = n())
editBoundaries <- list(
c(0, 1),
c(2, 5),
c(6, 9),
c(10, 49)
)
editClasses$editClass <- sapply(editClasses$revisions, function(x) {
wEC <- sapply(editBoundaries, function(y) {
x >= y[1] & x <= y[2]
})
if (sum(wEC) == 0) {
return("> 50")
} else {
return(paste0(editBoundaries[[which(wEC)]][1],
" - ",
editBoundaries[[which(wEC)]][2]
)
)
}
})
editClasses$editClass[editClasses$editClass == "0 - 1"] <- "1"
editClasses <- arrange(editClasses, desc(revisions))
editClasses_dist <- table(editClasses$editClass)
editClasses_dist <- as.data.frame(editClasses_dist)
colnames(editClasses_dist) <- c("Edit Class", "Users")
editClasses_dist$`Edit Class` <- factor(editClasses_dist$`Edit Class`,
levels = c('1',
'2 - 5',
'6 - 9',
'10 - 49',
'> 50'))
editClasses_dist <- arrange(editClasses_dist, `Edit Class`)
editClasses_dist$`% Users` <- editClasses_dist$Users/sum(editClasses_dist$Users)*100
saveRDS(editClasses_dist,
paste0(analyticsDir, "editClasses_dist.Rds"))
# - cummulative edits in dewiki_revisions
setkey(dewiki_revisions, user_id, rev_time)
dewiki_revisions <- dewiki_revisions[order(user_id, rev_time)]
dewiki_revisions[, cum_revisions := seq_len(.N), by = user_id]
dewiki_revisions[, cum_revisions := rowid(user_id)]
dewiki_revisions$reg_time <- as.Date(dewiki_revisions$reg_time)
dewiki_revisions$rev_time <- as.Date(dewiki_revisions$rev_time)
dewiki_revisions$account_age_rev_time_weeks <- difftime(dewiki_revisions$rev_time,
dewiki_revisions$reg_time,
units = "weeks")
dewiki_revisions$account_age_rev_time_years <-
dewiki_revisions$account_age_rev_time_weeks/52.1429
dewiki_revisions$account_age_rev_time_weeks <-
as.numeric(dewiki_revisions$account_age_rev_time_weeks)
dewiki_revisions$account_age_rev_time_years <-
as.numeric(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$editClass <- sapply(dewiki_revisions$cum_revisions, function(x) {
wEC <- sapply(editBoundaries, function(y) {
x >= y[1] & x <= y[2]
})
if (sum(wEC) == 0) {
return("> 50")
} else {
return(paste0(editBoundaries[[which(wEC)]][1],
" - ",
editBoundaries[[which(wEC)]][2]
)
)
}
})
dewiki_revisions <- dewiki_revisions[order(rev_time)]
### ___ NOTE ___
# - There are 29148 observations where reg_time > rev_time:
sum(dewiki_revisions$account_age_rev_time_weeks < 0)
### ___ ACTION ___
# - remove from dewiki_revisions:
w <- which(dewiki_revisions$account_age_rev_time_weeks < 0)
dewiki_revisions <- dewiki_revisions[-w, ]
# - intoduce account age in weeks and years classes
dewiki_revisions$rev_time_ym <- substr(dewiki_revisions$rev_time, 1, 7)
dewiki_revisions$account_age_rev_time_years_class <-
round(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$account_age_rev_time_years_class <-
paste0(dewiki_revisions$account_age_rev_time_years_class,
" - ",
dewiki_revisions$account_age_rev_time_years_class + 1)
# - save the elaborated version of dewiki_revisions
saveRDS(dewiki_revisions,
paste0(analyticsDir, "dewiki_revisions_elaborated.Rds"))
# - produce dewiki_revisions_overview for visualization
dewiki_revisions_overview <- dewiki_revisions %>%
select(rev_time_ym, editClass, account_age_rev_time_years_class) %>%
group_by(rev_time_ym, editClass, account_age_rev_time_years_class) %>%
summarise(n_users = n())
saveRDS(dewiki_revisions_overview,
paste0(analyticsDir, "dewiki_revisions_overview.Rds"))
# - users active (at least one edit) after
# - two weeks, one month, six months, and one year
two_weeks = 2
one_month = 4.34524
six_months = 26.0715
one_year = 52.1429
active_users <- list()
active_users$two_weeks <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > two_weeks]
)
)
active_users$one_month <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_month]
)
)
active_users$six_months <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > six_months]
)
)
active_users$one_year <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_year]
)
)
active_users$two_weeks_p_total_registered_users <-
active_users$two_weeks/stats$total_registered_users
active_users$one_month_p_total_registered_users <-
active_users$one_month/stats$total_registered_users
active_users$six_months_p_total_registered_users <-
active_users$six_months/stats$total_registered_users
active_users$one_year_p_total_registered_users <-
active_users$one_year/stats$total_registered_users
active_users$two_weeks_p_total_users_who_edited <-
active_users$two_weeks/stats$total_users_who_edited
active_users$one_month_p_total_users_who_edited <-
active_users$one_month/stats$total_users_who_edited
active_users$six_months_total_users_who_edited <-
active_users$six_months/stats$total_users_who_edited
active_users$one_year_p_total_users_who_edited <-
active_users$one_year/stats$total_users_who_edited
saveRDS(active_users,
paste0(analyticsDir, "active_users.Rds"))
### -----------------------------------------------------
### --- Statistics on revisions since 2017
### -----------------------------------------------------
rm(dewiki_revisions); gc()
# - read splits: dewiki_revisions
# - load
lF <- list.files(dataDir)
lF <- lF[grepl("dewiki_revisions", lF)]
dewiki_revisions <- lapply(lF, function(x) {fread(paste0(dataDir, x), header = F)})
# - collect
dewiki_revisions <- rbindlist(dewiki_revisions)
# - schema
colnames(dewiki_revisions) <- c('user_id', 'reg_time', 'rev_time')
# - filter for >= 2017 on user registration
dewiki_revisions <- filter(dewiki_revisions, reg_time >= 2017)
dewiki_revisions <- as.data.table(dewiki_revisions)
# - remove campaign registered users from dewiki_revisions
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_revisions <- dewiki_revisions[!(dewiki_revisions$user_id %in% campaign_regIDS), ]
# - for non-registering campaigns: keep only user revisions before the campaign onset
# - there is currently one non-registering campaign present in campaignIDs:
non_registering_campaignIDs <- campaignIDs %>%
filter(registered == 0)
non_registering_campaigns <- unique(non_registering_campaignIDs$campaign)
non_registering_campaigns
# - "occasional_editors2020"
# - the campaign onset for "occasional_editors2020" is:
# - 2020/05/14
wRemoveRevisions <- which(
(dewiki_revisions$user_id %in% non_registering_campaignIDs$user_id) &
(dewiki_revisions$rev_time >= "2020-05-14")
)
dewiki_revisions <- dewiki_revisions[-wRemoveRevisions, ]
# - statistics
stats_revisions_2017 <- list()
stats_revisions_2017$total_revisions <- dim(dewiki_revisions)[1]
# - distribution of number of revisions per user
rev_dist_2017 <- table(dewiki_revisions$user_id)
rev_dist_2017 <- as.data.frame(rev_dist_2017)
colnames(rev_dist_2017) <- c('user_id', 'revisions')
rev_dist_2017 <- arrange(rev_dist_2017, desc(revisions))
rev_dist_2017 <- table(rev_dist_2017$revisions)
rev_dist_2017 <- as.data.frame(rev_dist_2017)
colnames(rev_dist_2017) <- c('revisions', 'users')
rev_dist_2017 <- arrange(rev_dist_2017, desc(users))
saveRDS(rev_dist_2017,
paste0(analyticsDir, "rev_dist_2017.Rds"))
rm(rev_dist_2017); gc()
# - edit classes
editClasses_2017 <- dewiki_revisions %>%
select(user_id) %>%
group_by(user_id) %>%
summarise(revisions = n())
editBoundaries <- list(
c(0, 1),
c(2, 5),
c(6, 9),
c(10, 49)
)
editClasses_2017$editClass <- sapply(editClasses_2017$revisions, function(x) {
wEC <- sapply(editBoundaries, function(y) {
x >= y[1] & x <= y[2]
})
if (sum(wEC) == 0) {
return("> 50")
} else {
return(paste0(editBoundaries[[which(wEC)]][1],
" - ",
editBoundaries[[which(wEC)]][2]
)
)
}
})
editClasses_2017$editClass[editClasses_2017$editClass == "0 - 1"] <- "1"
editClasses_2017 <- arrange(editClasses_2017, desc(revisions))
editClasses_dist_2017 <- table(editClasses_2017$editClass)
editClasses_dist_2017 <- as.data.frame(editClasses_dist_2017)
colnames(editClasses_dist_2017) <- c("Edit Class", "Users")
editClasses_dist_2017$`Edit Class` <- factor(editClasses_dist_2017$`Edit Class`,
levels = c('1',
'2 - 5',
'6 - 9',
'10 - 49',
'> 50'))
editClasses_dist_2017 <- arrange(editClasses_dist_2017, `Edit Class`)
editClasses_dist_2017$`% Users` <- editClasses_dist_2017$Users/sum(editClasses_dist_2017$Users)*100
saveRDS(editClasses_dist_2017,
paste0(analyticsDir, "editClasses_dist_2017.Rds"))
# - cummulative edits in dewiki_revisions
setkey(dewiki_revisions, user_id, rev_time)
dewiki_revisions <- dewiki_revisions[order(user_id, rev_time)]
dewiki_revisions[, cum_revisions := seq_len(.N), by = user_id]
dewiki_revisions[, cum_revisions := rowid(user_id)]
dewiki_revisions$reg_time <- as.Date(dewiki_revisions$reg_time)
dewiki_revisions$rev_time <- as.Date(dewiki_revisions$rev_time)
dewiki_revisions$account_age_rev_time_weeks <- difftime(dewiki_revisions$rev_time,
dewiki_revisions$reg_time,
units = "weeks")
dewiki_revisions$account_age_rev_time_years <-
dewiki_revisions$account_age_rev_time_weeks/52.1429
dewiki_revisions$account_age_rev_time_weeks <-
as.numeric(dewiki_revisions$account_age_rev_time_weeks)
dewiki_revisions$account_age_rev_time_years <-
as.numeric(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$editClass <- sapply(dewiki_revisions$cum_revisions, function(x) {
wEC <- sapply(editBoundaries, function(y) {
x >= y[1] & x <= y[2]
})
if (sum(wEC) == 0) {
return("> 50")
} else {
return(paste0(editBoundaries[[which(wEC)]][1],
" - ",
editBoundaries[[which(wEC)]][2]
)
)
}
})
dewiki_revisions <- dewiki_revisions[order(rev_time)]
# - intoduce account age in weeks and years classes
dewiki_revisions$rev_time_ym <- substr(dewiki_revisions$rev_time, 1, 7)
dewiki_revisions$account_age_rev_time_years_class <-
round(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$account_age_rev_time_years_class <-
paste0(dewiki_revisions$account_age_rev_time_years_class,
" - ",
dewiki_revisions$account_age_rev_time_years_class + 1)
# - save the elaborated version of dewiki_revisions_2017
saveRDS(dewiki_revisions,
paste0(analyticsDir, "dewiki_revisions_2017_elaborated.Rds"))
# - produce dewiki_revisions_overview_2017 for visualization
dewiki_revisions_overview_2017 <- dewiki_revisions %>%
select(rev_time_ym, editClass, account_age_rev_time_years_class) %>%
group_by(rev_time_ym, editClass, account_age_rev_time_years_class) %>%
summarise(n_users = n())
saveRDS(dewiki_revisions_overview_2017,
paste0(analyticsDir, "dewiki_revisions_overview_2017.Rds"))
# - users active (at least one edit) after
# - two weeks, one month, six months, and one year
two_weeks = 2
one_month = 4.34524
six_months = 26.0715
one_year = 52.1429
active_users_2017 <- list()
active_users_2017$two_weeks <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > two_weeks]
)
)
active_users_2017$one_month <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_month]
)
)
active_users_2017$six_months <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > six_months]
)
)
active_users_2017$one_year <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_year]
)
)
active_users_2017$two_weeks_p_total_registered_users <-
active_users_2017$two_weeks/stats_2017$total_registered_users
active_users_2017$one_month_p_total_registered_users <-
active_users_2017$one_month/stats_2017$total_registered_users
active_users_2017$six_months_p_total_registered_users <-
active_users_2017$six_months/stats_2017$total_registered_users
active_users_2017$one_year_p_total_registered_users <-
active_users_2017$one_year/stats_2017$total_registered_users
active_users_2017$two_weeks_p_total_users_who_edited <-
active_users_2017$two_weeks/stats_2017$total_users_who_edited
active_users_2017$one_month_p_total_users_who_edited <-
active_users_2017$one_month/stats_2017$total_users_who_edited
active_users_2017$six_months_total_users_who_edited <-
active_users_2017$six_months/stats_2017$total_users_who_edited
active_users_2017$one_year_p_total_users_who_edited <-
active_users_2017$one_year/stats_2017$total_users_who_edited
saveRDS(active_users_2017,
paste0(analyticsDir, "active_users_2017.Rds"))
### -----------------------------------------------------
### --- Statistics on user registrations for
### --- campaign registered users
### -----------------------------------------------------
### --- load Campaign Registered Users dataset
campaignIDs <- read.csv(paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs.csv"),
header = T,
check.names = F,
stringsAsFactors = F)
# - read splits: dewiki_revisions
# - load
lF <- list.files(dataDir)
lF <- lF[grepl("dewiki_revisions", lF)]
dewiki_revisions <- lapply(lF, function(x) {fread(paste0(dataDir, x), header = F)})
# - collect
dewiki_revisions <- rbindlist(dewiki_revisions)
# - schema
colnames(dewiki_revisions) <- c('user_id', 'reg_time', 'rev_time')
### --- Load registration data: dewiki_regusers
dewiki_regusers <- fread(paste0(dataDir, "dewiki_regusers.csv"), header = T)
# - which campaign registered users cannot be found in dewiki_regusers
wNotFound <- which(!(campaignIDs$user_id[campaignIDs$registered == 1] %in% dewiki_regusers$user_id))
campaignIDs$not_found_in_dewiki_regusers <- 0
campaignIDs$not_found_in_dewiki_regusers[wNotFound] <- 1
# - store elaborated campaignIDs
write.csv(campaignIDs,
paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs_Elaborated.csv"))
# - keep only campaign registered users from dewiki_regusers
dim(dewiki_regusers)
campaign_regIDS <- unique(campaignIDs$user_id[campaignIDs$registered == 1])
dewiki_regusers <- dewiki_regusers[(dewiki_regusers$user_id %in% campaign_regIDS), ]
dim(dewiki_regusers)
# - stats
stats_campaigns <- list()
stats_campaigns$total_registered_users <- dim(dewiki_regusers)[1]
wEdited <- which(dewiki_regusers$user_id %in% dewiki_revisions$user_id)
stats_campaigns$total_users_who_edited <- length(wEdited)
# - distibution of account age
dewiki_regusers$reg_time <- as.Date(dewiki_regusers$user_registration_timestamp)
dewiki_regusers$account_age_weeks <- as.numeric(
difftime(Sys.time(),
dewiki_regusers$reg_time,
units = "weeks")
)
dewiki_regusers$account_age_years = dewiki_regusers$account_age_weeks/52.1429
# - stats
stats_campaigns$min_account_age_weeks <- unname(
summary(as.numeric(dewiki_regusers$account_age_weeks))[1]
)
stats_campaigns$max_account_age_weeks <- unname(
summary(as.numeric(dewiki_regusers$account_age_weeks))[6]
)
stats_campaigns$mean_account_age_weeks <- unname(
summary(as.numeric(dewiki_regusers$account_age_weeks))[4]
)
stats_campaigns$median_account_age_weeks <- median(dewiki_regusers$account_age_weeks)
stats_campaigns$min_account_age_years <- unname(
summary(as.numeric(dewiki_regusers$account_age_years))[1]
)
stats_campaigns$max_account_age_years <- unname(
summary(as.numeric(dewiki_regusers$account_age_years))[6]
)
stats_campaigns$mean_account_age_years <- unname(
summary(as.numeric(dewiki_regusers$account_age_years))[4]
)
stats_campaigns$median_account_age_years <- median(dewiki_regusers$account_age_years)
saveRDS(stats_campaigns,
paste0(analyticsDir, "dewiki_stats_campaigns.Rds"))
# - store data file
saveRDS(dewiki_regusers,
paste0(analyticsDir, "dewiki_regusers_campaigns.Rds"))
### -----------------------------------------------------
### --- Statistics on revisions for campaign registered users
### -----------------------------------------------------
# - filter for campaign registered users on user registration
dim(dewiki_revisions)
dewiki_revisions <- filter(dewiki_revisions, user_id %in% campaign_regIDS)
dewiki_revisions <- as.data.table(dewiki_revisions)
dim(dewiki_revisions)
# - for non-registering campaigns: keep only user revisions following the campaign onset
# - there is currently one non-registering campaign present in campaignIDs:
non_registering_campaignIDs <- campaignIDs %>%
filter(registered == 0)
non_registering_campaigns <- unique(non_registering_campaignIDs$campaign)
non_registering_campaigns
# - "occasional_editors2020"
# - the campaign onset for "occasional_editors2020" is:
# - 2020/05/14
wRemoveRevisions <- which(
(dewiki_revisions$user_id %in% non_registering_campaignIDs$user_id) &
(dewiki_revisions$rev_time < "2020-05-14")
)
dewiki_revisions <- dewiki_revisions[-wRemoveRevisions, ]
# - statistics
stats_revisions_campaigns <- list()
stats_revisions_campaigns$total_revisions <- dim(dewiki_revisions)[1]
# - distribution of number of revisions per user
rev_dist_campaigns <- table(dewiki_revisions$user_id)
rev_dist_campaigns <- as.data.frame(rev_dist_campaigns)
colnames(rev_dist_campaigns) <- c('user_id', 'revisions')
rev_dist_campaigns <- arrange(rev_dist_campaigns, desc(revisions))
rev_dist_campaigns <- table(rev_dist_campaigns$revisions)
rev_dist_campaigns <- as.data.frame(rev_dist_campaigns)
colnames(rev_dist_campaigns) <- c('revisions', 'users')
rev_dist_campaigns <- arrange(rev_dist_campaigns, desc(users))
saveRDS(rev_dist_campaigns,
paste0(analyticsDir, "rev_dist_campaigns.Rds"))
rm(rev_dist_campaigns); gc()
# - edit classes
editClasses_campaigns <- dewiki_revisions %>%
select(user_id) %>%
group_by(user_id) %>%
summarise(revisions = n())
editBoundaries <- list(
c(0, 1),
c(2, 5),
c(6, 9),
c(10, 49)
)
editClasses_campaigns$editClass <- sapply(editClasses_campaigns$revisions, function(x) {
wEC <- sapply(editBoundaries, function(y) {
x >= y[1] & x <= y[2]
})
if (sum(wEC) == 0) {
return("> 50")
} else {
return(paste0(editBoundaries[[which(wEC)]][1],
" - ",
editBoundaries[[which(wEC)]][2]
)
)
}
})
editClasses_campaigns$editClass[editClasses_campaigns$editClass == "0 - 1"] <- "1"
editClasses_campaigns <- arrange(editClasses_campaigns, desc(revisions))
editClasses_dist_campaigns <- table(editClasses_campaigns$editClass)
editClasses_dist_campaigns <- as.data.frame(editClasses_dist_campaigns)
colnames(editClasses_dist_campaigns) <- c("Edit Class", "Users")
editClasses_dist_campaigns$`Edit Class` <- factor(editClasses_dist_campaigns$`Edit Class`,
levels = c('1',
'2 - 5',
'6 - 9',
'10 - 49',
'> 50'))
editClasses_dist_campaigns <- arrange(editClasses_dist_campaigns, `Edit Class`)
editClasses_dist_campaigns$`% Users` <- editClasses_dist_campaigns$Users/sum(editClasses_dist_campaigns$Users)*100
saveRDS(editClasses_dist_campaigns,
paste0(analyticsDir, "editClasses_dist_campaigns.Rds"))
# - cummulative edits in dewiki_revisions
setkey(dewiki_revisions, user_id, rev_time)
dewiki_revisions <- dewiki_revisions[order(user_id, rev_time)]
dewiki_revisions[, cum_revisions := seq_len(.N), by = user_id]
dewiki_revisions[, cum_revisions := rowid(user_id)]
dewiki_revisions$reg_time <- as.Date(dewiki_revisions$reg_time)
dewiki_revisions$rev_time <- as.Date(dewiki_revisions$rev_time)
dewiki_revisions$account_age_rev_time_weeks <- difftime(dewiki_revisions$rev_time,
dewiki_revisions$reg_time,
units = "weeks")
dewiki_revisions$account_age_rev_time_years <-
dewiki_revisions$account_age_rev_time_weeks/52.1429
dewiki_revisions$account_age_rev_time_weeks <-
as.numeric(dewiki_revisions$account_age_rev_time_weeks)
dewiki_revisions$account_age_rev_time_years <-
as.numeric(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$editClass <- sapply(dewiki_revisions$cum_revisions, function(x) {
wEC <- sapply(editBoundaries, function(y) {
x >= y[1] & x <= y[2]
})
if (sum(wEC) == 0) {
return("> 50")
} else {
return(paste0(editBoundaries[[which(wEC)]][1],
" - ",
editBoundaries[[which(wEC)]][2]
)
)
}
})
dewiki_revisions <- dewiki_revisions[order(rev_time)]
# - intoduce account age in weeks and years classes
dewiki_revisions$rev_time_ym <- substr(dewiki_revisions$rev_time, 1, 7)
dewiki_revisions$account_age_rev_time_years_class <-
round(dewiki_revisions$account_age_rev_time_years)
dewiki_revisions$account_age_rev_time_years_class <-
paste0(dewiki_revisions$account_age_rev_time_years_class,
" - ",
dewiki_revisions$account_age_rev_time_years_class + 1)
# - save the elaborated version of dewiki_revisions_campaigns
saveRDS(dewiki_revisions,
paste0(analyticsDir, "dewiki_revisions_campaigns_elaborated.Rds"))
# - produce dewiki_revisions_overview_campaigns for visualization
dewiki_revisions_overview_campaigns <- dewiki_revisions %>%
select(rev_time_ym, editClass, account_age_rev_time_years_class) %>%
group_by(rev_time_ym, editClass, account_age_rev_time_years_class) %>%
summarise(n_users = n())
saveRDS(dewiki_revisions_overview_campaigns,
paste0(analyticsDir, "dewiki_revisions_overview_campaigns.Rds"))
# - users active (at least one edit) after
# - two weeks, one month, six months, and one year
two_weeks = 2
one_month = 4.34524
six_months = 26.0715
one_year = 52.1429
active_users_campaigns <- list()
active_users_campaigns$two_weeks <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > two_weeks]
)
)
active_users_campaigns$one_month <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_month]
)
)
active_users_campaigns$six_months <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > six_months]
)
)
active_users_campaigns$one_year <-
length(
unique(
dewiki_revisions$user_id[dewiki_revisions$account_age_rev_time_weeks > one_year]
)
)
active_users_campaigns$two_weeks_p_total_registered_users <-
active_users_campaigns$two_weeks/stats_campaigns$total_registered_users
active_users_campaigns$one_month_p_total_registered_users <-
active_users_campaigns$one_month/stats_campaigns$total_registered_users
active_users_campaigns$six_months_p_total_registered_users <-
active_users_campaigns$six_months/stats_campaigns$total_registered_users
active_users_campaigns$one_year_p_total_registered_users <-
active_users_campaigns$one_year/stats_campaigns$total_registered_users
active_users_campaigns$two_weeks_p_total_users_who_edited <-
active_users_campaigns$two_weeks/stats_campaigns$total_users_who_edited
active_users_campaigns$one_month_p_total_users_who_edited <-
active_users_campaigns$one_month/stats_campaigns$total_users_who_edited
active_users_campaigns$six_months_total_users_who_edited <-
active_users_campaigns$six_months/stats_campaigns$total_users_who_edited
active_users_campaigns$one_year_p_total_users_who_edited <-
active_users_campaigns$one_year/stats_campaigns$total_users_who_edited
saveRDS(active_users_campaigns,
paste0(analyticsDir, "active_users_campaigns.Rds"))
### --- Final Reporting
paste0("Processing took: ", difftime(Sys.time(), t1, units = "mins"), " minutes.")
rm(list = ls()); gc()
All statistics and visualizations reported in the following sections refer to the questions formulated in the reference Phabricator ticket.
Note. All revisions made by campaign registered users were removed from the revision datasets used in the sections addressing the organic registrations and revisions since 2017. For the non-registering WMDE Banner Campaigns (e.g. WMDE Occasional Editors 2020 campaign, which addressed the already existing, registered users only), we remove all the edits made on their behalf following their exposure to the respective campaign. Likewise, in the campaigns datasets, we remove all their edits made before their exposure to the respective campaign.
Q. What is the age of the German Wikipedia Community in terms of account age?
stats_2017 <- readRDS(paste0(analyticsDir, 'dewiki_stats_2017.Rds'))
The following statistics all refer to dewiki
user registrations since 2017:
dewiki
.Q. How many of them edited (since registration until 30th June 2020):
1 edit
2 to 5 edits
5 to 9 edits
10 to 49 edits
50 or more edits
editClasses_dist_2017 <- readRDS(paste0(analyticsDir, 'editClasses_dist_2017.Rds'))
editClasses_dist_2017$`% Users` <- round(editClasses_dist_2017$`% Users`, 2)
datatable(editClasses_dist_2017)
Q. Retention rate: How many newly registered users are active after (active = at least 1 edit)
How high is the retention rate of these active users compared to the number of registrations?
active_users_2017 <- readRDS(paste0(analyticsDir, 'active_users_2017.Rds'))
active_users_2017 <- data.frame(
`Retention Class` = c('2 weeks', '1 month', '6 months', '1 year'),
Users = as.numeric(active_users_2017[1:4]),
`As % of registered users` = round(as.numeric(active_users_2017[5:8]), 2),
`As % of users who ever edited` = round(as.numeric(active_users_2017[9:12]), 2),
stringsAsFactors = F,
check.names = F)
datatable(active_users_2017)
Q. Edit Classes (facets) x Account Age Classes (group, step: one year) x Time (horizontal) → do we observe always one and the same group of active editors, or do the newcomers join in to stay active editors? - start: 2017.
Note. In the following chart, the Account Age
variable refers to the user account age at the moment when a respective revision was made by that user. Tabs refer to different edit classes.
dewiki_revisions_overview_2017 <- readRDS(paste0(analyticsDir, 'dewiki_revisions_overview_2017.Rds'))
dewiki_revisions_overview_2017 <- ungroup(dewiki_revisions_overview_2017)
colnames(dewiki_revisions_overview_2017) <- c('Revision Year-Month', 'Edit Class', 'Account Age', 'Users')
dewiki_revisions_overview_2017 <- filter(dewiki_revisions_overview_2017,
!(`Revision Year-Month` == "2020-07"))
dewiki_revisions_overview_2017$`Edit Class`[dewiki_revisions_overview_2017$`Edit Class` == "0 - 1"] <- "1"
dewiki_revisions_overview_2017$`Edit Class` <- factor(dewiki_revisions_overview_2017$`Edit Class`,
levels = c('1',
'2 - 5',
'6 - 9',
'10 - 49',
'> 50'))
ggplot(dewiki_revisions_overview_2017,
aes(x = `Revision Year-Month`,
y = Users,
group = `Account Age`,
color = `Account Age`)) +
geom_line() + geom_point(size = 1) +
scale_color_manual(values=c("#308FF3", "#70BA0A", "#FABC0A", "#ED2809")) +
facet_wrap(~`Edit Class`, nrow = 5, ncol = 1, scales="free_y") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 0.95, vjust = 0.2))
Note. Some users found in the WMDE campaign user registration datasets could not be matched with the user IDs in the wmf.mediawiki_history
table. The following table presents an overview of how many campaign registered users are missing in the wmf.mediawiki_history
(see: Wikitech wmf.mediaWiki_history documentation). All WMDE campaign registered user IDs were checked for uniqueness and confirmed to be unique. The total number of WMDE campaign registered users that were not matched to the user ID fields (event_user_id
for revisions, and user_id
for registrations) in the wmf.mediawiki_history
table is 49.
campaignIDs <- read.csv(paste0(dataDir, "_campaignIDs/WMDE_Campaign_Registered_Users_IDs_Elaborated.csv"),
header = T,
check.names = F,
row.names = 1,
stringsAsFactors = F)
not_found <- as.data.frame(
table(campaignIDs$campaign[campaignIDs$not_found_in_dewiki_regusers == 1])
)
colnames(not_found) <- c('Campaign Code', 'Num.Users')
datatable(not_found)
Given that there are 4163 WMDE campaign registered users in total, that means what we will not be able to analyze 1.18% of them. To keep the analysis consistent with the numbers previously reported on user registrations and revisions since 2017, all user registration data are derived from the campaign registered users that were matched with the user IDs in the wmf.mediawiki_history
table.
Q. What is the age of the German Wikipedia Community in terms of account age for campaign registered users?
dewiki_stats_campaigns <- readRDS(paste0(analyticsDir, 'dewiki_stats_campaigns.Rds'))
The following statistics all refer to dewiki
campaign user registrations since 2017:
dewiki
.Q. How many of campaign registered users edited (since registration until 30th June 2020):
1 edit
2 to 5 edits
5 to 9 edits
10 to 49 edits
50 or more edits
editClasses_dist_campaigns <- readRDS(paste0(analyticsDir, 'editClasses_dist_campaigns.Rds'))
editClasses_dist_campaigns$`% Users` <- round(editClasses_dist_campaigns$`% Users`, 2)
datatable(editClasses_dist_campaigns)
Q. Retention rate: How many campaign registered users are active after (active = at least 1 edit)
How high is the retention rate of these active users compared to the number of registrations?
active_users_campaigns <- readRDS(paste0(analyticsDir, 'active_users_campaigns.Rds'))
active_users_campaigns <- data.frame(
`Retention Class` = c('2 weeks', '1 month', '6 months', '1 year'),
Users = as.numeric(active_users_campaigns[1:4]),
`As % of registered users` = round(as.numeric(active_users_campaigns[5:8]), 2),
`As % of users who ever edited` = round(as.numeric(active_users_campaigns[9:12]), 2),
stringsAsFactors = F,
check.names = F)
datatable(active_users_campaigns)
Q. Edit Classes (facets) x Account Age Classes (group, step: one year) x Time (horizontal) → do we observe always one and the same group of active editors, or do the newcomers join in to stay active editors? - start: 2017 (for campaign registered users):
Note. In the following chart, the Account Age
variable refers to the user account age at the moment when a respective revision was made by that user. Tabs refer to different edit classes.
dewiki_revisions_overview_campaigns <- readRDS(paste0(analyticsDir, 'dewiki_revisions_overview_campaigns.Rds'))
dewiki_revisions_overview_campaigns <- ungroup(dewiki_revisions_overview_campaigns)
colnames(dewiki_revisions_overview_campaigns) <- c('Revision Year-Month', 'Edit Class', 'Account Age', 'Users')
dewiki_revisions_overview_campaigns <- filter(dewiki_revisions_overview_campaigns,
!(`Revision Year-Month` == "2020-07"))
dewiki_revisions_overview_campaigns$`Edit Class`[dewiki_revisions_overview_campaigns$`Edit Class` == "0 - 1"] <- "1"
dewiki_revisions_overview_campaigns$`Edit Class` <- factor(dewiki_revisions_overview_campaigns$`Edit Class`,
levels = c('1',
'2 - 5',
'6 - 9',
'10 - 49',
'> 50'))
ggplot(dewiki_revisions_overview_campaigns,
aes(x = `Revision Year-Month`,
y = Users,
group = `Account Age`,
color = `Account Age`)) +
geom_line() + geom_point(size = 1) +
scale_color_manual(values=c("#308FF3", "#70BA0A", "#FABC0A", "#ED2809")) +
facet_wrap(~`Edit Class`, nrow = 5, ncol = 1, scales="free_y") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 0.95, vjust = 0.2))
I have just one major remark on the report: I also need registrations and revisions per campaign to be able to compare the campaigns. Are you still on it or did you miss it out? (the edit class and account age comparison is not necessary for the campaign split).
Reference: Phab: T256433#6328618 Note. The recent editors
column reports on number of campaign registered users who did at least one edit as of 30th June 2020, reference: Phab: T256433#6385973
campaignRegs <- read.csv(paste0(analyticsDir, "campaignRegistrationsSummary.csv"),
header = T,
check.names = F,
row.names = 1,
stringsAsFactors = F)
campaignRevs <- read.csv(paste0(analyticsDir, "campaignRevisionsSummary.csv"),
header = T,
check.names = F,
row.names = 1,
stringsAsFactors = F)
campaignsOverview <- left_join(campaignRevs,
campaignRegs,
by = "campaign")
campaignsOverview$rev_per_reg <- round(campaignsOverview$revisions/campaignsOverview$registered, 2)
recentEditors <- read.csv(paste0(analyticsDir, "recentCampaignEditors.csv"),
header = T,
check.names = F,
row.names = 1,
stringsAsFactors = F)
colnames(recentEditors)[2] <- "recent editors"
campaignsOverview <- left_join(campaignsOverview,
recentEditors,
by = "campaign")
datatable(campaignsOverview)
I wondered if it were much additional work to compute not only # of revisions and registrations (in section 1.3) , but also retention/ retention rates (like you did in section 1.2.2) per campaign? I guess that’s what @Verena was initially asking for.
Reference: Phab: T256433#6337701
active_users_per_campaign <- readRDS(paste0(analyticsDir, "active_users_per_campaign.Rds"))
active_users_per_campaign <- active_users_per_campaign[, c('campaign',
'two_weeks',
'one_month',
'six_months',
'one_year')]
colnames(active_users_per_campaign) <- c('campaign',
'2 weeks',
'1 month',
'6 months',
'1 year')
datatable(active_users_per_campaign)
Of the people who started training modules (onboarding content used in 2018 in thank you, spring and summer campaign), how is the rate of still active users?
active_users_campaigns_training_2018 <-
readRDS(paste0(analyticsDir, "active_users_campaigns_training.Rds"))
active_users_campaigns_training_2018 <- data.frame(
`Retention Class` = c('2 weeks', '1 month', '6 months', '1 year'),
Users = as.numeric(active_users_campaigns_training_2018[1:4]),
stringsAsFactors = F,
check.names = F)
datatable(active_users_campaigns_training_2018)
Request. “Organic Growth/ Age of Community: For the years 2001 to 2019 we need for every year the average age of all accounts who did at least one edit in this year. age = number of years since registration ( I am aware that for a few accounts the registration date can’t be retrieved from the database. Because this should be a relatively small number of accounts we can neglect that here.)” reference Phab ticket
dewiki_revisions_elaborated <- readRDS("~/WMDE/NewEditors/CampaignsReview2020/_analytics/dewiki_revisions_elaborated.Rds")
dewiki_revisions_elaborated <- dplyr::select(dewiki_revisions_elaborated,
reg_time,
rev_time)
dewiki_revisions_elaborated$rev_year <- substr(dewiki_revisions_elaborated$rev_time, 1, 4)
dewiki_revisions_elaborated <- dplyr::filter(dewiki_revisions_elaborated,
rev_year != "2020")
dewiki_revisions_elaborated$account_age <- difftime(dewiki_revisions_elaborated$rev_time,
dewiki_revisions_elaborated$reg_time,
units = "weeks")
one_year = 52.1429
dewiki_revisions_elaborated$account_age <- dewiki_revisions_elaborated$account_age/one_year
organicGrowth <- dewiki_revisions_elaborated %>%
dplyr::select(rev_year, account_age) %>%
dplyr::group_by(rev_year) %>%
summarise(avg_account_age = mean(account_age))
datatable(organicGrowth)