Feedback should be send to goran.milovanovic_ext@wikimedia.de.

knitr::opts_chunk$set(fig.width = 14, fig.height = 10) 

0. Task Description

The task is described in the respective Phabricator ticket by Jan Dittrich:

User Story :: As a user researcher, I want to know how our user base is composed so I can evaluate if survey and interview data has a bias compared to it and to see basic patterns that I might want to further explore.

Information needs:

I imagine the following data to be useful (all can be aggregates): I imagine this should not be a graphana board but a report (ideally: RMarkdown or Jupyter Notebook)

  • How is our user base composed in terms of edit count? (e.g. shown by an histogram, y= count, x=bins of edit count ranges)
  • How were our user base composed in terms of recent edits? How many account did, in the last month (or so) edited n times (e.g. shown by an histogram, y= count, x=bins of edit count ranges for a month)
  • How many of those users do participate in discussions? (mosaic plot, y=bins of edit count, x= bins of discussion participation)
  • How many new users do we have each month? – And of them, how many do edit how much? – And of them, how many do participate in discussions?

Or in general, how is the relation between:

  • Edit count and: Discussion participation (possibly split by item and property discussions vs. other project discussions), adding values, correcting values, adding references, adding qualifiers

1. Data Acquisition

The relevant data were obtained from the wmf.mediawiki_history table, which stores the denormalized edit history of WMF’s wikis. The following HiveQL queries were run within the R environment from the stat1005 statbox server on October 15, 2018:

1.1 Total revisions: users vs. namespaces

The following HiveQL query, run by a system() call from within R, collects the counts of all event_user_id (user IDs) and page_namespace (page namespace of the page where a particular revision took place) combinations across all users and revisions on wikidatawiki. Bots and anonymous users are filtered out, as well as page redirects and revisions that were subsequently deleted. The resulting table presents the number of revisions made by any user on any page, grouped by page namespaces.

filename <- "WD_revisions.tsv"
connect <- '/usr/local/bin/beeline --silent --incremental -e'
args <- '"USE wmf; 
          SELECT COUNT(*), event_user_id, page_namespace FROM mediawiki_history 
            WHERE (
              event_entity = \'revision\' AND 
              event_type = \'create\' AND 
              wiki_db = \'wikidatawiki\' AND 
              event_user_is_bot_by_name = FALSE AND 
              event_user_is_anonymous = FALSE AND 
              NOT ARRAY_CONTAINS(event_user_groups, \'bot\') AND 
              NOT ARRAY_CONTAINS(event_user_groups_historical, \'bot\') AND 
              event_user_id != 0 AND 
              page_is_redirect = FALSE AND  
              revision_is_deleted = FALSE AND 
              snapshot = \'2018-09\'
            ) 
            GROUP BY event_user_id, page_namespace;"'
out <- paste0("> /home/goransm/RScripts/Wikidata/", filename)
qCommand <- paste(connect, args, out)
system(qCommand, wait = T)

1.2 Total revisions: users vs. namespaces in September 2018.

The following HiveQL query is exactly the same as the previous one, except for the fact that it filters out all revision older then Septmeber 2018. The nature of the wmf.mediawiki_history table is such that a current table snapshot takes some time to compute, so that the latest table snapshot that was available is 2018-09. In other words, the October 2018 dataset was not yet ready when we’ve run these analyses.

filename <- "WD_revisions_September.tsv"
connect <- '/usr/local/bin/beeline --silent --incremental -e'
args <- '"USE wmf; 
          SELECT COUNT(*), event_user_id, page_namespace FROM mediawiki_history 
            WHERE (
              event_entity = \'revision\' AND 
              event_type = \'create\' AND 
              wiki_db = \'wikidatawiki\' AND 
              event_user_is_bot_by_name = FALSE AND 
              event_user_is_anonymous = FALSE AND 
              NOT ARRAY_CONTAINS(event_user_groups, \'bot\') AND 
              NOT ARRAY_CONTAINS(event_user_groups_historical, \'bot\') AND 
              event_user_id != 0 AND 
              page_is_redirect = FALSE AND  
              revision_is_deleted = FALSE AND 
              snapshot = \'2018-09\' AND 
              event_timestamp > \'2018-09-01 00:00:00.0\'
            ) 
            GROUP BY event_user_id, page_namespace;"'
out <- paste0("> /home/goransm/RScripts/Wikidata/WD_edits/", filename)
qCommand <- paste(connect, args, out)
system(qCommand, wait = T)

1.3 User registrations

The following HiveQL query collects all user registrations on wikidatawiki. The documentation of the wmf.mediawiki_history table is not clear on the following point: what is the exact meaning of the event_user_ type of fields in the context where event_entity = user? In other words, it is uncertain whether filtering out bots and anonymous users should be done by refering to event_user_is_ type of fields or to user_is_ type of fileds. In order to prevent any inconsistencies in the data, both types of fields where used to filter out bots and anonymoys users. In spite of this precaution, the later results will demonstrate that it is highly likely that at least some bots have “survived” these filtering procedures. They were filtered out my statistical methods, as it will be demonstrated. The resulting table contains the user IDs and the timestamps of their registration with wikidatawiki. Note. In spite of the fact that the Wikidata project started on October 30, 2012, many registration timestamps are as old as 2004. The only hypothesis in respect to this fact is that the older registration timestamps exist as a consequence of unifyed logins of those users who were already registered at other wikies. As we will see these user accounts were filtered out from some of the analyses.

filename <- "WD_users.tsv"
connect <- '/usr/local/bin/beeline --silent --incremental -e'
args <- '"USE wmf; 
          SELECT user_id, user_creation_timestamp FROM mediawiki_history 
            WHERE (
              event_entity = \'user\' AND 
              event_type = \'create\' AND 
              wiki_db = \'wikidatawiki\' AND 
              event_user_is_bot_by_name = FALSE AND 
              event_user_is_anonymous = FALSE AND 
              user_is_bot_by_name = FALSE AND 
              user_is_anonymous = FALSE AND 
              NOT ARRAY_CONTAINS(event_user_groups, \'bot\') AND 
              NOT ARRAY_CONTAINS(event_user_groups_historical, \'bot\') AND 
              NOT ARRAY_CONTAINS(user_groups, \'bot\') AND 
              NOT ARRAY_CONTAINS(user_groups_historical, \'bot\') AND 
              snapshot = \'2018-09\'
            );"'
out <- paste0("> /home/goransm/RScripts/Wikidata/WD_edits/", filename)
qCommand <- paste(connect, args, out)
system(qCommand, wait = T)

2. Data Wrangling

The following procedures load the datasets locally, impose consistent column names to obtain the desired matches, introduces the namespace names (e.g. “Main (Items)”, “Main (Items talk)”, and similar) beyond namespace codes, and creates Posix timestamps for user registrations. Finally, the revision data are filtered out additionaly to keep only the user accounts that are found in the user registration data.

# - locally: revisions vs namespaces
library(data.table)
data.table 1.10.4
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
setwd('/home/goransm/Work/___DataKolektiv/Projects/WikimediaDEU/_WMDE_Projects/Wikidata/WD_edits/')
wdRevisions <- fread('WD_revisions.tsv', sep = "\t")
colnames(wdRevisions) <- c('revisions', 'user', 'wd_namespace')
# - locally: revisions vs namespaces, September only
wdRevisionsSep <- fread('WD_revisions_September.tsv', sep = "\t")
colnames(wdRevisionsSep) <- c('revisions', 'user', 'wd_namespace')
# - locally: user registrations
wdUsers <- fread('WD_users.tsv', sep = "\t")

Read 43.4% of 2998702 rows
Read 49.7% of 2998702 rows
Read 65.4% of 2998702 rows
Read 83.0% of 2998702 rows
Read 2998702 rows and 2 (of 2) columns from 0.083 GB file in 00:00:06
colnames(wdUsers) <- c('user', 'creation_timestamp')
### --- Wrangling
library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:data.table’:

    between, first, last

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
# - WD namespace coding table; 
# - add namespace name to wdRevisions, wdRevisionsSep 
# - https://www.wikidata.org/wiki/Help:Namespaces
namespaces <- read.csv('namespace_coding_scheme.csv', 
                       header = T, 
                       check.names = F, 
                       stringsAsFactors = F)
wdRevisions <- left_join(wdRevisions, namespaces,
                         by = c("wd_namespace" = "namespaceCode"))
wdRevisionsSep <- left_join(wdRevisionsSep, namespaces,
                            by = c("wd_namespace" = "namespaceCode"))
# - creation_timestamp in wdUsers to Posix
library(lubridate)

Attaching package: ‘lubridate’

The following objects are masked from ‘package:data.table’:

    hour, isoweek, mday, minute, month, quarter, second, wday, week, yday, year

The following object is masked from ‘package:base’:

    date
wdUsers$creation_timestamp <- sapply(wdUsers$creation_timestamp, 
                                     function(x) {
                                       strsplit(
                                         strsplit(x, split = ".", fixed = T)[[1]][1], 
                                         split = " ", fixed = T)[[1]][1]
                                     })
wdUsers$creation_timestamp <- ymd(wdUsers$creation_timestamp)
wdUsers$YM <- sapply(as.character(wdUsers$creation_timestamp), 
                     function(x) {
                       paste(strsplit(x, split = "-", fixed = T)[[1]][1:2], collapse = "-")
                     })

3. Data Analyses

### --- Analytics
library(ggplot2)
library(ggrepel)
library(stringr)

Q1. How is our user base composed in terms of edit count? (e.g. shown by an histogram, y= count, x=bins of edit count ranges)

First we group all user revisions and user IDs irrespective of the page namespace and sum up all revisions per user:

plotFrame <- wdRevisions %>% 
  select(user, revisions) %>% 
  group_by(user) %>% 
  summarise(revisions = sum(revisions)) %>% 
  arrange(desc(revisions))

Exploratory analysis has indicated that outlier removal is necessary in this case. Namely, account with millions of edits were discovered, indicating (as already observed) the possibility that some bots were not sucessfully removed while filtering out by respective database flags. It is also possible that we are observing the outcomes of automated imports to Wikidata that were not flagged as bots here.

# - APPROACH: For a given continuous variable, outliers are those observations 
# - that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ 
# - is the difference between 75th and 25th quartiles.
outs <- which(
  plotFrame$revisions %in% boxplot.stats(plotFrame$revisions)$out)
plotFrame <- plotFrame[-outs, ]

Q1.1 Checking for power-law behavior: log(Number of users) vs. log (Number of revisions) plot

We are doing this not to check for power-laws really and for some essential reasons, but rather because the skewness of the distribution prevents us from visualizing these data in an informative way.

# - Power-Law
userRevisions <- as.data.frame(table(plotFrame$revisions), 
                               stringsAsFactors = F)
colnames(userRevisions) <- c('Revisions', 'Num.Users')
userRevisions$Revisions <- as.numeric(userRevisions$Revisions)
userRevisions <- arrange(userRevisions, desc(Revisions))
# - log-log
ggplot(data = userRevisions, 
       aes(x = log10(Revisions), 
           y = log10(`Num.Users`),
           label = Revisions)) + 
  geom_point(size = .1) + geom_smooth(method = lm, size = .2) + 
  ggtitle("Revisions per user") + 
  geom_text_repel(size = 2.5) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11)) + 
  ylab("log(Num.Users)")

Q1.2 Checking for power-law behavior: Number of users vs. Number of revisions plot

In absolute values:

# - absolute values
ggplot(data = userRevisions, 
       aes(x = Revisions, 
           y = `Num.Users`,
           label = Revisions)) + 
  geom_point(size = .5) + geom_path(size = .2) + 
  ggtitle("Revisions per user") + 
  geom_text_repel(size = 2.5) +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11)) + 
  ylab("Num.Users")

Q1.3 Distribution of revisions per user: a Histogram

# - log histogram
ggplot(data = plotFrame, 
       aes(revisions)) + 
  geom_histogram(bins = 50, fill = "deepskyblue", color = "white") + 
  ggtitle("Distribution of revisions per user") + 
  ylab("Num.Users") + xlab("Revisions") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10, hjust = 1)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11))

Q2. How were our user base composed in terms of recent edits? How many account did, in the last month (or so) edited n times (e.g. shown by an histogram, y= count, x=bins of edit count ranges for a month)?

The procedures are the same as in Q2, except for now we focus on September 2018 data only.

plotFrame <- wdRevisionsSep %>% 
  select(user, revisions) %>% 
  group_by(user) %>% 
  summarise(revisions = sum(revisions)) %>% 
  arrange(desc(revisions))
# - Outlier detection and removal
# - APPROACH: For a given continuous variable, outliers are those observations 
# - that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ 
# - is the difference between 75th and 25th quartiles.
outs <- which(
  plotFrame$revisions %in% boxplot.stats(plotFrame$revisions)$out)
plotFrame <- plotFrame[-outs, ]

Q2.1 Checking for power-law behavior: log(Number of users) vs. log (Number of revisions) plot

userRevisions <- as.data.frame(table(plotFrame$revisions), 
                               stringsAsFactors = F)
colnames(userRevisions) <- c('Revisions', 'Num.Users')
userRevisions$Revisions <- as.numeric(userRevisions$Revisions)
userRevisions <- arrange(userRevisions, desc(Revisions))
ggplot(data = userRevisions, 
       aes(x = log10(Revisions), 
           y = log10(`Num.Users`),
           label = Revisions)) + 
  geom_point(size = .1) + geom_smooth(method = lm, size = .2) + 
  ggtitle("Revisions per user: September 2018") + 
  geom_text_repel(size = 2.5) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11)) + 
  ylab("log(Num.Users)")

Q2.2 Checking for power-law behavior: Number of users vs. Number of revisions plot

ggplot(data = userRevisions, 
       aes(x = Revisions, 
           y = `Num.Users`,
           label = Revisions)) + 
  geom_point(size = .5) + geom_path(size = .2) + 
  ggtitle("Revisions per user: September 2018") + 
  geom_text_repel(size = 2.5) +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11)) + 
  ylab("Num.Users")

Q2.3 Distribution of revisions per user in September 2018: a Histogram

ggplot(data = plotFrame, 
       aes(revisions)) + 
  geom_histogram(bins = 50, fill = "deepskyblue", color = "white") + 
  ggtitle("Distribution of revisions per user: September 2018") + 
  ylab("Num.Users") + xlab("Revisions") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10, hjust = 1)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11))

Q3. How many of those users do participate in discussions? (mosaic plot, y = bins of edit count, x = bins of discussion participation)

I was not entirely sure if I have understood correctly the intention with the mosaic plot here when I’ve started working on this analysis. However, the following approach should be illustrative in respect to Q3.

In the first step, we separate all talk (i.e. discussion) namespaces from the namespaces where “main” revisions take place.

wdRevisionsSep$talk <- ifelse(grepl("talk", wdRevisionsSep$namespace, fixed = T), "discussion", "edit")
plotFrame <- wdRevisionsSep %>% 
  select(user, revisions, talk) %>% 
  group_by(user, talk) %>% 
  summarise(revisions = sum(revisions))

Removing outliers, as in previous analyses:

# - APPROACH: For a given continuous variable, outliers are those observations 
# - that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ 
# - is the difference between 75th and 25th quartiles.
outs <- which(
  plotFrame$revisions %in% boxplot.stats(plotFrame$revisions)$out)
plotFrame <- plotFrame[-outs, ]
plotFrame <- arrange(plotFrame, user)

Now we cross-tabulate the number of revisions made on discussion and main namespaces for all users, and visualize the R table object to obtain a mosaic plot:

# - number of users that have both edited and discussed
numEditTalkUsers <- sum(duplicated(plotFrame$user))
wEditTalkUsers <- which(duplicated(plotFrame$user))
wEditTalkUsers <- plotFrame$user[wEditTalkUsers]
plotFrame <- plotFrame[plotFrame$user %in% wEditTalkUsers, ]
pFrame <- data.frame(edits = plotFrame$revisions[plotFrame$talk == "edit"],
                     discussions = plotFrame$revisions[plotFrame$talk == "discussion"], 
                     stringsAsFactors = F)
pFrame <- arrange(pFrame, edits)
# A "mosaic" plot
plot(table(pFrame$edits, pFrame$discussions),
     xlab = "Edits", ylab = "Discussions", 
     main = "Edits vs. Discussions, Sep 2018", 
     color = "darkorange")

Another, more conservative and potentialy useful approach to visualize the same cross-tabulated data:

ggplot(data = pFrame, 
       aes(x = edits, y = discussions, label = discussions)) +
  geom_point(size = 1.5, color = "blue") +
  geom_point(size = 1, color = "white") +
  ggtitle("Distribution of revisions per user: September 2018") +
  ylab("Discussions") + xlab("Edits") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10, hjust = 1)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11))

Q3.1 Alternative approach

This is the number of users following the binning of the distributions of the number of edits (i.e. revisions made in any of the main namespaces) and the number of discussions (i.e. revisions made on talk pages):

pFrame$editIntervals <- cut(pFrame$edits, 
                            breaks = 4)
pFrame$editIntervals <- sapply(pFrame$editIntervals, function(x) {
  edges <- strsplit(as.character(x), split = ",")[[1]]
  edges[1] <- floor(as.numeric(str_extract(edges[1], "([[:digit:]]|\\.)+"))) + 1
  edges[2] <- floor(as.numeric(str_extract(edges[2], "([[:digit:]]|\\.)+")))
  return(paste0("(", edges[1]," - ", edges[2], ")"))
})
pFrame$discussionIntervals <- cut(pFrame$discussions,
                                  breaks = 4)
pFrame$discussionIntervals <- sapply(pFrame$discussionIntervals, function(x) {
  edges <- strsplit(as.character(x), split = ",")[[1]]
  edges[1] <- floor(as.numeric(str_extract(edges[1], "([[:digit:]]|\\.)+"))) + 1
  edges[2] <- floor(as.numeric(str_extract(edges[2], "([[:digit:]]|\\.)+")))
  return(paste0("(", edges[1]," - ", edges[2], ")"))
})
editIntervalsLevels <- sapply(unique(pFrame$editIntervals), function(x) {
  lower <- str_extract(x, '[[:digit:]]+')
})
editIntervalsLevels <- names(editIntervalsLevels)[order(as.numeric(editIntervalsLevels))]
discussionIntervalsLevels <- sapply(unique(pFrame$discussionIntervals), function(x) {
  lower <- str_extract(x, '[[:digit:]]+')
})
discussionIntervalsLevels <- names(discussionIntervalsLevels)[order(as.numeric(discussionIntervalsLevels))]
pFrame$editIntervals <- factor(pFrame$editIntervals, 
                               levels = editIntervalsLevels)
pFrame$discussionIntervals <- factor(pFrame$discussionIntervals, 
                               levels = discussionIntervalsLevels)
pFrame %>% 
  dplyr::select(editIntervals, discussionIntervals) %>% 
  dplyr::group_by(editIntervals, discussionIntervals) %>% 
  summarise(Count = n()) %>% 
  ggplot(aes(x = editIntervals,
             y = discussionIntervals,
             label = Count)) +
  geom_point(aes(size = Count), color = "cadetblue4", shape = 19) +
  geom_text(size = 3, nudge_x = .15) + 
  xlab("Edits") + ylab("Discussions") + 
  theme_bw() + 
  theme(panel.background = element_rect(fill = "lightblue"))

NOTE. This is (as requested on Phab) only for recent revisions (September 2018). We can have this produced from all available revisions of course.

Q4. How many new users do we have each month?

Note. My decision here was to encompass only those users who have registered later than or in November 2012 (since Wikidata started in October 2018).

Q4.1 User registrations per month

library(tidyr)
usersFrame <- wdUsers %>% 
  filter(YM >= "2012-11") %>% 
  arrange(YM)
usersRegistered <- usersFrame %>% 
  group_by(YM) %>% 
  summarise(registrations = n())
ggplot(data = usersRegistered, 
       aes(x = YM, y = registrations, label = registrations)) + 
  geom_path(size = .25, group = 1, color = "blue") + 
  geom_point(size = 1.5, color = "blue") + 
  geom_point(size = 1, color = "white") + 
  geom_text_repel(size = 2.5) +
  ggtitle("Registrations on Wikidata since: November 2012") +
  ylab("Year-Month") + xlab("Registrations") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 6, hjust = 1, angle = 90)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11))

Q4.2 User registrations and revisions vs. participation in discussions

Note. Logarithmic scaling was used; edit stands for revisions on the “main” namespaces (note: not only Items, but all namespaces that are not “talk” namespaces), discussion for revisions on “talk” namespaces. Each point represents a count in the particular month since November 2012 in which the users whose data are taken into consideration have registered.

wdRevisions$talk <- ifelse(grepl("talk", wdRevisions$namespace, fixed = T), "discussion", "edit")
userRevs <- wdRevisions %>% 
  select(user, talk, revisions) %>% 
  group_by(user, talk) %>% 
  summarise(revisions = sum(revisions)) %>% 
  spread(key = talk, 
         value = revisions, 
         fill = 0)
usersFrame <- left_join(usersFrame, userRevs, 
                        by = 'user')
usersFrame <- usersFrame[complete.cases(usersFrame), ]
uFrame <- usersFrame %>% 
  select(YM, discussion, edit) %>% 
  group_by(YM) %>% 
  summarise(edit = sum(edit), discussion = sum(discussion))
uFrame <- uFrame %>% 
  gather(key = "Activity",
         value = "Revisions",
         c('edit', 'discussion'))
ggplot(data = uFrame, 
       aes(x = YM, y = log10(Revisions), 
           color = Activity, 
           group = Activity, 
           label = Revisions)) + 
  geom_text_repel(size = 2, color = "black", segment.size   = .2) +
  geom_point(size = 1.5) + geom_path(size = .25) + 
  geom_point(size = 1, color = "white") + 
  ggtitle("log(Revisions) per Activity: since November 2012") +
  xlab("Year-Month") + ylab("log10(Revisions)") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 6, hjust = 1, angle = 90)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11)) + 
  theme(legend.position = "top")

Q4.3 The relationship between revisions and participations in discussions

By eye-balling only: there seem to be an approximate and weak linear relationship between the number of revisions made on the “main” and “talk” namespaces (once again: by “main” namespaces we mean not only Items, but all namespaces that are not “talk” namespaces). Again, each point represents a particular month since November 2012 in which the users whose data are taken into consideration have registered, except for the data are now represented on a scattergram.

pFrame <- uFrame %>% 
  spread(key = Activity, 
         value = Revisions)
ggplot(data = pFrame, 
       aes(x = edit, y = discussion)) + 
  geom_point(size = 1.5) +  
  geom_point(size = 1, color = "white") +
  geom_smooth(method = lm, size = .25) + 
  ggtitle("Edits vs. Discussions: since November 2012") +
  ylab("Discussions") + xlab("Edits") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 8, hjust = 1, angle = 90)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11))

model <- lm(pFrame$discussion ~ pFrame$edit)
summary(model)

Call:
lm(formula = pFrame$discussion ~ pFrame$edit)

Residuals:
    Min      1Q  Median      3Q     Max 
-8395.3  -381.8    50.5   200.9 11101.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.594e+01  3.081e+02  -0.117    0.907    
pFrame$edit  1.003e-03  6.718e-05  14.935   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2300 on 69 degrees of freedom
Multiple R-squared:  0.7637,    Adjusted R-squared:  0.7603 
F-statistic: 223.1 on 1 and 69 DF,  p-value: < 2.2e-16

We can explain some 76% of variance in revisions made on “talk” namespaces from revisions made on the “main” namespaces. However, the assumptions of the linear model were not tested (not to mention that we are dealing with count data here, so Poisson would be more appropriate), neither any outliers (and there are some obvious outliers in this analysis, c.f. the scattergram) were removed, so this result should be taken with a grain of salt (at least).

---
title: 'Basic Data On Wikidata Use Report'
author: "Goran S. Milovanovic, Data Scientist, WMDE"
date: "October 15, 2018"
output:
  html_notebook:
    code_folding: show
    theme: simplex
    toc: yes
    toc_float: yes
    toc_depth: 3
  html_document:
    toc: yes
    toc_depth: 3
---

**Feedback** should be send to `goran.milovanovic_ext@wikimedia.de`. 

```{r}
knitr::opts_chunk$set(fig.width = 14, fig.height = 10) 
```

# 0. Task Description

The task is described in the respective [Phabricator ticket](https://phabricator.wikimedia.org/T206214) by Jan Dittrich:

**User Story** :: As a user researcher, I want to know how our user base is composed so I can evaluate if survey and interview data has a bias compared to it and to see basic patterns that I might want to further explore.

Information needs:

I imagine the following data to be useful (all can be aggregates): 
I imagine this should not be a graphana board but a report (ideally: RMarkdown or Jupyter Notebook)

- How is our user base composed in terms of edit count? (e.g. shown by an histogram, y= count, x=bins of edit count ranges)
- How were our user base composed in terms of recent edits? How many account did, in the last month (or so) edited n times (e.g. shown by an histogram, y= count, x=bins of edit count ranges for a month)
- How many of those users do participate in discussions? (mosaic plot, y=bins of edit count, x= bins of discussion participation)
- How many new users do we have each month?
-- And of them, how many do edit how much?
-- And of them, how many do participate in discussions?

Or in general, how is the relation between:

- Edit count and: Discussion participation (possibly split by item and property discussions vs. other project discussions), adding values, correcting values, adding references, adding qualifiers

# 1. Data Acquisition

The relevant data were obtained from the [wmf.mediawiki_history](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history) table, which stores the denormalized edit history of WMF's wikis. The following HiveQL queries were run within the R environment from the stat1005 statbox server on October 15, 2018:

## 1.1 Total revisions: users vs. namespaces

The following HiveQL query, run by a `system()` call from within R, collects the counts of all `event_user_id` (user IDs) and `page_namespace` (page namespace of the page where a particular revision took place) combinations across all users and revisions on `wikidatawiki`. Bots and anonymous users are filtered out, as well as page redirects and revisions that were subsequently deleted. The resulting table presents the number of revisions made by any user on any page, grouped by page namespaces.

```{r, echo = T, eval = F}
filename <- "WD_revisions.tsv"
connect <- '/usr/local/bin/beeline --silent --incremental -e'
args <- '"USE wmf; 
          SELECT COUNT(*), event_user_id, page_namespace FROM mediawiki_history 
            WHERE (
              event_entity = \'revision\' AND 
              event_type = \'create\' AND 
              wiki_db = \'wikidatawiki\' AND 
              event_user_is_bot_by_name = FALSE AND 
              event_user_is_anonymous = FALSE AND 
              NOT ARRAY_CONTAINS(event_user_groups, \'bot\') AND 
              NOT ARRAY_CONTAINS(event_user_groups_historical, \'bot\') AND 
              event_user_id != 0 AND 
              page_is_redirect = FALSE AND  
              revision_is_deleted = FALSE AND 
              snapshot = \'2018-09\'
            ) 
            GROUP BY event_user_id, page_namespace;"'
out <- paste0("> /home/goransm/RScripts/Wikidata/", filename)
qCommand <- paste(connect, args, out)
system(qCommand, wait = T)
```

## 1.2 Total revisions: users vs. namespaces in September 2018.

The following HiveQL query is exactly the same as the previous one, except for the fact that it filters out all revision older then Septmeber 2018. The nature of the `wmf.mediawiki_history` table is such that a current table snapshot takes some time to compute, so that the latest table snapshot that was available is `2018-09`. In other words, the October 2018 dataset was not yet ready when we've run these analyses.  

```{r, echo = T, eval = F}
filename <- "WD_revisions_September.tsv"
connect <- '/usr/local/bin/beeline --silent --incremental -e'
args <- '"USE wmf; 
          SELECT COUNT(*), event_user_id, page_namespace FROM mediawiki_history 
            WHERE (
              event_entity = \'revision\' AND 
              event_type = \'create\' AND 
              wiki_db = \'wikidatawiki\' AND 
              event_user_is_bot_by_name = FALSE AND 
              event_user_is_anonymous = FALSE AND 
              NOT ARRAY_CONTAINS(event_user_groups, \'bot\') AND 
              NOT ARRAY_CONTAINS(event_user_groups_historical, \'bot\') AND 
              event_user_id != 0 AND 
              page_is_redirect = FALSE AND  
              revision_is_deleted = FALSE AND 
              snapshot = \'2018-09\' AND 
              event_timestamp > \'2018-09-01 00:00:00.0\'
            ) 
            GROUP BY event_user_id, page_namespace;"'
out <- paste0("> /home/goransm/RScripts/Wikidata/WD_edits/", filename)
qCommand <- paste(connect, args, out)
system(qCommand, wait = T)
```

## 1.3 User registrations

The following HiveQL query collects all user registrations on `wikidatawiki`. The documentation of the `wmf.mediawiki_history` table is not clear on the following point: what is the exact meaning of the `event_user_` type of fields in the context where `event_entity = user`? In other words, it is uncertain whether filtering out bots and anonymous users should be done by refering to `event_user_is_` type of fields or to `user_is_` type of fileds. In order to prevent any inconsistencies in the data, both types of fields where used to filter out bots and anonymoys users. In spite of this precaution, the later results will demonstrate that it is highly likely that at least some bots have "survived" these filtering procedures. They were filtered out my statistical methods, as it will be demonstrated. The resulting table contains the user IDs and the timestamps of their registration with `wikidatawiki`.
**Note.** In spite of the fact that the Wikidata project started on October 30, 2012, many registration timestamps are as old as 2004. The only hypothesis in respect to this fact is that the older registration timestamps exist as a consequence of unifyed logins of those users who were already registered at other wikies. As we will see these user accounts were filtered out from some of the analyses.

```{r, echo = T, eval = F}
filename <- "WD_users.tsv"
connect <- '/usr/local/bin/beeline --silent --incremental -e'
args <- '"USE wmf; 
          SELECT user_id, user_creation_timestamp FROM mediawiki_history 
            WHERE (
              event_entity = \'user\' AND 
              event_type = \'create\' AND 
              wiki_db = \'wikidatawiki\' AND 
              event_user_is_bot_by_name = FALSE AND 
              event_user_is_anonymous = FALSE AND 
              user_is_bot_by_name = FALSE AND 
              user_is_anonymous = FALSE AND 
              NOT ARRAY_CONTAINS(event_user_groups, \'bot\') AND 
              NOT ARRAY_CONTAINS(event_user_groups_historical, \'bot\') AND 
              NOT ARRAY_CONTAINS(user_groups, \'bot\') AND 
              NOT ARRAY_CONTAINS(user_groups_historical, \'bot\') AND 
              snapshot = \'2018-09\'
            );"'
out <- paste0("> /home/goransm/RScripts/Wikidata/WD_edits/", filename)
qCommand <- paste(connect, args, out)
system(qCommand, wait = T)
```


# 2. Data Wrangling

The following procedures load the datasets locally, impose consistent column names to obtain the desired matches, introduces the namespace names (e.g. "Main (Items)", "Main (Items talk)", and similar) beyond namespace codes, and creates `Posix` timestamps for user registrations. Finally, the revision data are filtered out additionaly to keep only the user accounts that are found in the user registration data.

```{r echo = T}
# - locally: revisions vs namespaces
library(data.table)
setwd('/home/goransm/Work/___DataKolektiv/Projects/WikimediaDEU/_WMDE_Projects/Wikidata/WD_edits/')
wdRevisions <- fread('WD_revisions.tsv', sep = "\t")
colnames(wdRevisions) <- c('revisions', 'user', 'wd_namespace')
# - locally: revisions vs namespaces, September only
wdRevisionsSep <- fread('WD_revisions_September.tsv', sep = "\t")
colnames(wdRevisionsSep) <- c('revisions', 'user', 'wd_namespace')
# - locally: user registrations
wdUsers <- fread('WD_users.tsv', sep = "\t")
colnames(wdUsers) <- c('user', 'creation_timestamp')

### --- Wrangling
library(dplyr)
# - WD namespace coding table; 
# - add namespace name to wdRevisions, wdRevisionsSep 
# - https://www.wikidata.org/wiki/Help:Namespaces
namespaces <- read.csv('namespace_coding_scheme.csv', 
                       header = T, 
                       check.names = F, 
                       stringsAsFactors = F)
wdRevisions <- left_join(wdRevisions, namespaces,
                         by = c("wd_namespace" = "namespaceCode"))
wdRevisionsSep <- left_join(wdRevisionsSep, namespaces,
                            by = c("wd_namespace" = "namespaceCode"))
# - creation_timestamp in wdUsers to Posix
library(lubridate)
wdUsers$creation_timestamp <- sapply(wdUsers$creation_timestamp, 
                                     function(x) {
                                       strsplit(
                                         strsplit(x, split = ".", fixed = T)[[1]][1], 
                                         split = " ", fixed = T)[[1]][1]
                                     })
wdUsers$creation_timestamp <- ymd(wdUsers$creation_timestamp)
wdUsers$YM <- sapply(as.character(wdUsers$creation_timestamp), 
                     function(x) {
                       paste(strsplit(x, split = "-", fixed = T)[[1]][1:2], collapse = "-")
                     })
wdUsers$creation_timestamp <- NULL
# - keep only revisions in wdRevisions, wdRevisionsSep
# - made on behalf of users in wdUsers
w1 <- which(wdRevisions$user %in% wdUsers$user)
wdRevisions <- wdRevisions[w1, ]
w2 <- which(wdRevisionsSep$user %in% wdUsers$user)
wdRevisionsSep <- wdRevisionsSep[w2, ]
```


# 3. Data Analyses


```{r echo = T}
### --- Analytics
library(ggplot2)
library(ggrepel)
library(stringr)
```

## Q1. How is our user base composed in terms of edit count? (e.g. shown by an histogram, y= count, x=bins of edit count ranges)

First we group all user revisions and user IDs irrespective of the page namespace and sum up all revisions per user:

```{r echo = T}
plotFrame <- wdRevisions %>% 
  select(user, revisions) %>% 
  group_by(user) %>% 
  summarise(revisions = sum(revisions)) %>% 
  arrange(desc(revisions))
```

Exploratory analysis has indicated that outlier removal is necessary in this case. Namely, account with millions of edits were discovered, indicating (as already observed) the possibility that some bots were not sucessfully removed while filtering out by respective database flags. It is also possible that we are observing the outcomes of automated imports to Wikidata that were not flagged as bots here.

```{r echo = T}
# - APPROACH: For a given continuous variable, outliers are those observations 
# - that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ 
# - is the difference between 75th and 25th quartiles.
outs <- which(
  plotFrame$revisions %in% boxplot.stats(plotFrame$revisions)$out)
plotFrame <- plotFrame[-outs, ]
```

## Q1.1 Checking for power-law behavior: log(Number of users) vs. log (Number of revisions) plot

We are doing this not to check for power-laws really and for some essential reasons, but rather because the skewness of the distribution prevents us from visualizing these data in an informative way. 

```{r echo = T}
# - Power-Law
userRevisions <- as.data.frame(table(plotFrame$revisions), 
                               stringsAsFactors = F)
colnames(userRevisions) <- c('Revisions', 'Num.Users')
userRevisions$Revisions <- as.numeric(userRevisions$Revisions)
userRevisions <- arrange(userRevisions, desc(Revisions))
# - log-log
ggplot(data = userRevisions, 
       aes(x = log10(Revisions), 
           y = log10(`Num.Users`),
           label = Revisions)) + 
  geom_point(size = .1) + geom_smooth(method = lm, size = .2) + 
  ggtitle("Revisions per user") + 
  geom_text_repel(size = 2.5) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11)) + 
  ylab("log(Num.Users)")
```

## Q1.2 Checking for power-law behavior: Number of users vs. Number of revisions plot

In absolute values:

```{r echo = T}
# - absolute values
ggplot(data = userRevisions, 
       aes(x = Revisions, 
           y = `Num.Users`,
           label = Revisions)) + 
  geom_point(size = .5) + geom_path(size = .2) + 
  ggtitle("Revisions per user") + 
  geom_text_repel(size = 2.5) +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11)) + 
  ylab("Num.Users")
```

## Q1.3 Distribution of revisions per user: a Histogram

```{r echo = T}
# - log histogram
ggplot(data = plotFrame, 
       aes(revisions)) + 
  geom_histogram(bins = 50, fill = "deepskyblue", color = "white") + 
  ggtitle("Distribution of revisions per user") + 
  ylab("Num.Users") + xlab("Revisions") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10, hjust = 1)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11))
```

## Q2. How were our user base composed in terms of recent edits? How many account did, in the last month (or so) edited n times (e.g. shown by an histogram, y= count, x=bins of edit count ranges for a month)?

The procedures are the same as in `Q2`, except for now we focus on September 2018 data only.

```{r echo = T}
plotFrame <- wdRevisionsSep %>% 
  select(user, revisions) %>% 
  group_by(user) %>% 
  summarise(revisions = sum(revisions)) %>% 
  arrange(desc(revisions))
# - Outlier detection and removal
# - APPROACH: For a given continuous variable, outliers are those observations 
# - that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ 
# - is the difference between 75th and 25th quartiles.
outs <- which(
  plotFrame$revisions %in% boxplot.stats(plotFrame$revisions)$out)
plotFrame <- plotFrame[-outs, ]
```

## Q2.1 Checking for power-law behavior: log(Number of users) vs. log (Number of revisions) plot

```{r echo = T}
userRevisions <- as.data.frame(table(plotFrame$revisions), 
                               stringsAsFactors = F)
colnames(userRevisions) <- c('Revisions', 'Num.Users')
userRevisions$Revisions <- as.numeric(userRevisions$Revisions)
userRevisions <- arrange(userRevisions, desc(Revisions))
ggplot(data = userRevisions, 
       aes(x = log10(Revisions), 
           y = log10(`Num.Users`),
           label = Revisions)) + 
  geom_point(size = .1) + geom_smooth(method = lm, size = .2) + 
  ggtitle("Revisions per user: September 2018") + 
  geom_text_repel(size = 2.5) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11)) + 
  ylab("log(Num.Users)")
```

## Q2.2 Checking for power-law behavior: Number of users vs. Number of revisions plot

```{r echo = T}
ggplot(data = userRevisions, 
       aes(x = Revisions, 
           y = `Num.Users`,
           label = Revisions)) + 
  geom_point(size = .5) + geom_path(size = .2) + 
  ggtitle("Revisions per user: September 2018") + 
  geom_text_repel(size = 2.5) +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11)) + 
  ylab("Num.Users")
```

## Q2.3 Distribution of revisions per user in September 2018: a Histogram

```{r echo = T}
ggplot(data = plotFrame, 
       aes(revisions)) + 
  geom_histogram(bins = 50, fill = "deepskyblue", color = "white") + 
  ggtitle("Distribution of revisions per user: September 2018") + 
  ylab("Num.Users") + xlab("Revisions") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10, hjust = 1)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11))
```

## Q3. How many of those users do participate in discussions? (mosaic plot, y = bins of edit count, x = bins of discussion participation)

I was not entirely sure if I have understood correctly the intention with the mosaic plot here when I've started working on this analysis. However, the following approach should be illustrative in respect to `Q3`.

In the first step, we separate all talk (i.e. discussion) namespaces from the namespaces where "main" revisions take place.

```{r echo = T}
wdRevisionsSep$talk <- ifelse(grepl("talk", wdRevisionsSep$namespace, fixed = T), "discussion", "edit")
plotFrame <- wdRevisionsSep %>% 
  select(user, revisions, talk) %>% 
  group_by(user, talk) %>% 
  summarise(revisions = sum(revisions))
```

Removing outliers, as in previous analyses:

```{r echo = T}
# - APPROACH: For a given continuous variable, outliers are those observations 
# - that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ 
# - is the difference between 75th and 25th quartiles.
outs <- which(
  plotFrame$revisions %in% boxplot.stats(plotFrame$revisions)$out)
plotFrame <- plotFrame[-outs, ]
plotFrame <- arrange(plotFrame, user)
```

Now we cross-tabulate the number of revisions made on discussion and main namespaces for all users, and visualize the R `table` object to obtain a mosaic plot: 

```{r echo = T}
# - number of users that have both edited and discussed
numEditTalkUsers <- sum(duplicated(plotFrame$user))
wEditTalkUsers <- which(duplicated(plotFrame$user))
wEditTalkUsers <- plotFrame$user[wEditTalkUsers]
plotFrame <- plotFrame[plotFrame$user %in% wEditTalkUsers, ]
pFrame <- data.frame(edits = plotFrame$revisions[plotFrame$talk == "edit"],
                     discussions = plotFrame$revisions[plotFrame$talk == "discussion"], 
                     stringsAsFactors = F)
pFrame <- arrange(pFrame, edits)
# A "mosaic" plot
plot(table(pFrame$edits, pFrame$discussions),
     xlab = "Edits", ylab = "Discussions", 
     main = "Edits vs. Discussions, Sep 2018", 
     color = "darkorange")
```

Another, more conservative and potentialy useful approach to visualize the same cross-tabulated data:

```{r echo = T}
ggplot(data = pFrame, 
       aes(x = edits, y = discussions, label = discussions)) +
  geom_point(size = 1.5, color = "blue") +
  geom_point(size = 1, color = "white") +
  ggtitle("Distribution of revisions per user: September 2018") +
  ylab("Discussions") + xlab("Edits") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10, hjust = 1)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11))
```


### Q3.1 Alternative approach

This is the number of users following the binning of the distributions of the number of edits (i.e. revisions made in any of the main namespaces) and the number of discussions (i.e. revisions made on talk pages):

```{r echo = T}
pFrame$editIntervals <- cut(pFrame$edits, 
                            breaks = 4)
pFrame$editIntervals <- sapply(pFrame$editIntervals, function(x) {
  edges <- strsplit(as.character(x), split = ",")[[1]]
  edges[1] <- floor(as.numeric(str_extract(edges[1], "([[:digit:]]|\\.)+"))) + 1
  edges[2] <- floor(as.numeric(str_extract(edges[2], "([[:digit:]]|\\.)+")))
  return(paste0("(", edges[1]," - ", edges[2], ")"))
})
pFrame$discussionIntervals <- cut(pFrame$discussions,
                                  breaks = 4)
pFrame$discussionIntervals <- sapply(pFrame$discussionIntervals, function(x) {
  edges <- strsplit(as.character(x), split = ",")[[1]]
  edges[1] <- floor(as.numeric(str_extract(edges[1], "([[:digit:]]|\\.)+"))) + 1
  edges[2] <- floor(as.numeric(str_extract(edges[2], "([[:digit:]]|\\.)+")))
  return(paste0("(", edges[1]," - ", edges[2], ")"))
})
editIntervalsLevels <- sapply(unique(pFrame$editIntervals), function(x) {
  lower <- str_extract(x, '[[:digit:]]+')
})
editIntervalsLevels <- names(editIntervalsLevels)[order(as.numeric(editIntervalsLevels))]
discussionIntervalsLevels <- sapply(unique(pFrame$discussionIntervals), function(x) {
  lower <- str_extract(x, '[[:digit:]]+')
})
discussionIntervalsLevels <- names(discussionIntervalsLevels)[order(as.numeric(discussionIntervalsLevels))]

pFrame$editIntervals <- factor(pFrame$editIntervals, 
                               levels = editIntervalsLevels)
pFrame$discussionIntervals <- factor(pFrame$discussionIntervals, 
                               levels = discussionIntervalsLevels)
pFrame %>% 
  dplyr::select(editIntervals, discussionIntervals) %>% 
  dplyr::group_by(editIntervals, discussionIntervals) %>% 
  summarise(Count = n()) %>% 
  ggplot(aes(x = editIntervals,
             y = discussionIntervals,
             label = Count)) +
  geom_point(aes(size = Count), color = "cadetblue4", shape = 19) +
  geom_text(size = 3, nudge_x = .15) + 
  xlab("Edits") + ylab("Discussions") + 
  theme_bw() + 
  theme(panel.background = element_rect(fill = "lightblue"))
```

**NOTE.** This is (as requested on Phab) only for recent revisions (September 2018). We can have this produced from all available revisions of course.

## Q4. How many new users do we have each month?

**Note.** My decision here was to encompass only those users who have registered later than or in November 2012 (since Wikidata started in October 2018).

### Q4.1 User registrations per month

```{r echo = T}
library(tidyr)
usersFrame <- wdUsers %>% 
  filter(YM >= "2012-11") %>% 
  arrange(YM)
usersRegistered <- usersFrame %>% 
  group_by(YM) %>% 
  summarise(registrations = n())
ggplot(data = usersRegistered, 
       aes(x = YM, y = registrations, label = registrations)) + 
  geom_path(size = .25, group = 1, color = "blue") + 
  geom_point(size = 1.5, color = "blue") + 
  geom_point(size = 1, color = "white") + 
  geom_text_repel(size = 2.5) +
  ggtitle("Registrations on Wikidata since: November 2012") +
  ylab("Year-Month") + xlab("Registrations") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 6, hjust = 1, angle = 90)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11))
```

### Q4.2 User registrations and revisions vs. participation in discussions

**Note.** Logarithmic scaling was used; `edit` stands for revisions on the "main" namespaces (note: not only Items, but all namespaces that are not "talk" namespaces), `discussion` for revisions on "talk" namespaces. Each point represents a count in the particular month since November 2012 in which the users whose data are taken into consideration have registered.

```{r echo = T}
wdRevisions$talk <- ifelse(grepl("talk", wdRevisions$namespace, fixed = T), "discussion", "edit")
userRevs <- wdRevisions %>% 
  select(user, talk, revisions) %>% 
  group_by(user, talk) %>% 
  summarise(revisions = sum(revisions)) %>% 
  spread(key = talk, 
         value = revisions, 
         fill = 0)
usersFrame <- left_join(usersFrame, userRevs, 
                        by = 'user')
usersFrame <- usersFrame[complete.cases(usersFrame), ]
uFrame <- usersFrame %>% 
  select(YM, discussion, edit) %>% 
  group_by(YM) %>% 
  summarise(edit = sum(edit), discussion = sum(discussion))
uFrame <- uFrame %>% 
  gather(key = "Activity",
         value = "Revisions",
         c('edit', 'discussion'))
ggplot(data = uFrame, 
       aes(x = YM, y = log10(Revisions), 
           color = Activity, 
           group = Activity, 
           label = Revisions)) + 
  geom_text_repel(size = 2, color = "black", segment.size	= .2) +
  geom_point(size = 1.5) + geom_path(size = .25) + 
  geom_point(size = 1, color = "white") + 
  ggtitle("log(Revisions) per Activity: since November 2012") +
  xlab("Year-Month") + ylab("log10(Revisions)") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 6, hjust = 1, angle = 90)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11)) + 
  theme(legend.position = "top")
```

### Q4.3 The relationship between revisions and participations in discussions

By eye-balling only: there seem to be an approximate and weak linear relationship between the number of revisions made on the "main" and "talk" namespaces (once again: by "main" namespaces we mean not only Items, but all namespaces that are not "talk" namespaces). Again, each point represents a particular month since November 2012 in which the users whose data are taken into consideration have registered, except for the data are now represented on a scattergram.

```{r echo = T}
pFrame <- uFrame %>% 
  spread(key = Activity, 
         value = Revisions)
ggplot(data = pFrame, 
       aes(x = edit, y = discussion)) + 
  geom_point(size = 1.5) +  
  geom_point(size = 1, color = "white") +
  geom_smooth(method = lm, size = .25) + 
  ggtitle("Edits vs. Discussions: since November 2012") +
  ylab("Discussions") + xlab("Edits") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 8, hjust = 1, angle = 90)) + 
  theme(axis.text.y = element_text(size = 10, hjust = 1)) +
  theme(axis.title.x = element_text(size = 10)) +
  theme(axis.title.y = element_text(size = 10)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 11))
```

```{r echo = T}
model <- lm(pFrame$discussion ~ pFrame$edit)
summary(model)
```

We can explain some `76%` of variance in revisions made on "talk" namespaces from revisions made on the "main" namespaces. However, the assumptions of the linear model were not tested (not to mention that we are dealing with count data here, so Poisson would be more appropriate), neither any outliers (and there are some obvious outliers in this analysis, c.f. the scattergram) were removed, so this result should be taken with a grain of salt (at least).







