Investigate pageviews decline in June 2019

Task

Summary

In June 2019, there was a slight decline in year-over-year pageviews (-0.9%). This was the first YOY decline in total pageviews since May/June 2018. We'd like to look into possible reasons for this yoy decline this month and flattening the past quarter.

Approach: Review pageviews in broken down by the following:

  • platform differences (desktop vs mobile web)
  • location (country and continent)
  • project
  • referrers
  • os/browser
  • weekend vs weekday changes

Data Notes:

  • Data source wmf.pageviews_hourly via Hive and also Turnilo and Superset dashboards. June 2019 data.
  • I applied the IE corrections (T176023) to all queries in this notebook since they were also applied in the board metrics where we intially identified the decline. However, the difference appears minima when comparing the numbers to turnilo/superset,
  • Pageviews filtered to view only desktop views (where YOY decline occured) and to filter out bots.
In [300]:
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(magrittr); library(zeallot); library(glue); library(tidyverse); library(glue); library(lubridate)
    library(scales)
})

Pageviews by Platform

In [14]:
#pageviews by platform since May 2019.
query <- "
SELECT CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
access_method, SUM(view_Count)as views
FROM wmf.pageview_hourly
WHERE year = 2019 and month >= 5
AND agent_type='user'
AND NOT (country_code IN ('PK', 'IR', 'AF') -- https://phabricator.wikimedia.org/T157404#3194046
AND user_agent_map['browser_family'] = 'IE') -- https://phabricator.wikimedia.org/T193578#4300284
GROUP BY year, month, day, access_method;"
In [15]:
pageviews_byplatform <- wmf::query_hive(query)
In [21]:
pageviews_byplatform$date <- as.Date(pageviews_byplatform$date, format = "%Y-%m-%d")
In [26]:
p <- pageviews_byplatform %>%
filter(date < '2019-07-01') %>%
       ggplot(aes(x=date, y =views, color = access_method)) +  
  geom_line() +
  scale_y_continuous("daily pageviews", labels = polloi::compress) +
  scale_x_date("Date", labels = date_format("%Y-%m-%d"), date_breaks = "1 week")+
  labs(title = "Total daily Wikimedia pageviews by access method") +
  ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1),
        panel.grid = element_line("gray70"))
p

There was a yoy decline in desktop (-11.9%) while mobile web pageviews increased yoy by +7.8%. We typically report a yoy decline in desktop each month but this one was higher compared to previous months (-8.3% in May and -9.4% in April).

I will investigate the decline in desktop by breaking down by various factors.

Desktop Pageviews Change By Country

In [28]:
#Find desktop pageviews by country to calculate yoy and mom changes on desktop

query <- "
SELECT country, 
SUM(IF((month =5 and year = 2019), view_count, null)) AS may19_views,
SUM(IF((month =6 and year = 2019), view_count, null)) AS june19_views,
SUM(IF((month =6 and year = 2018), view_count, null)) AS june18_views
FROM wmf.pageview_hourly
WHERE
((year = 2018 and month =6) OR (year=2019 and (month = 5 OR month =6))) 
AND NOT (country_code IN ('PK', 'IR', 'AF') -- https://phabricator.wikimedia.org/T157404#3194046
AND user_agent_map['browser_family'] = 'IE') -- https://phabricator.wikimedia.org/T193578#4300284
AND agent_type = 'user'
AND access_method = 'desktop'
GROUP BY country;"
In [29]:
pageviews_bycountry <- wmf::query_hive(query)
In [31]:
pageviews_bycountry_clean <- pageviews_bycountry

pageviews_bycountry_clean$june18_views <- as.numeric(pageviews_bycountry_clean$june18_views)
pageviews_bycountry_clean$june19_views <- as.numeric(pageviews_bycountry_clean$june19_views)
pageviews_bycountry_clean$may19_views <- as.numeric(pageviews_bycountry_clean$may19_views)
Warning message in eval(expr, envir, enclos):
“NAs introduced by coercion”Warning message in eval(expr, envir, enclos):
“NAs introduced by coercion”Warning message in eval(expr, envir, enclos):
“NAs introduced by coercion”
In [294]:
top_declines_bycountry <- pageviews_bycountry_clean %>%
filter(june19_views > 100000) %>% ## filter out countries with extremely low traffic
group_by(country) %>%
mutate(yoy_percent = (june19_views/june18_views-1)*100,
      mom_percent = (june19_views/may19_views-1)* 100) %>%
arrange(yoy_percent)

head(top_declines_bycountry, 10)
A grouped_df: 10 × 6
countrymay19_viewsjune19_viewsjune18_viewsyoy_percentmom_percent
<chr><dbl><dbl><dbl><dbl><dbl>
China 220390612075955894758876-78.09223 -5.805615
Angola 1299766 1370404 4243697-67.70731 5.434671
Sudan 356867 153037 432793-64.63968-57.116517
Seychelles 147283 123991 228457-45.72677-15.814452
Saint Lucia 128526 112224 201168-44.21379-12.683815
Guinea 152504 100088 178443-43.91038-34.370246
Malawi 196063 149743 266474-43.80577-23.625059
Iceland 2486622 2375930 4062511-41.51573 -4.451501
Cyprus 2176770 1838748 3017296-39.05974-15.528604
Dominican Republic 7489960 4854531 7508222-35.34380-35.186156

There was a significant yoy decline (> 50%) in China, Angola and Sudan. China has been blocked on all Wikipedias since April 2019 and has had a significant decrease in overall pageviews since that date. http://bit.ly/2OFQq7V I'll look further into China YOY changes to see the rates of YoY decrease.

In [237]:
#Daily china pageviews since 2018
#Did not apply IE correction since bug was not relevant for views from China
query <- "
SELECT CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
SUM(view_Count)as views
FROM wmf.pageview_hourly
WHERE year >= 2018 
AND agent_type='user'
AND access_method = 'desktop'
AND country = 'China'
GROUP BY year, month, day;"
In [238]:
pageviews_china <- wmf::query_hive(query)
In [239]:
pageviews_china$date <- as.Date(pageviews_china$date, format = "%Y-%m-%d")
In [241]:
#Various adjustments to allow for yoy comparison.
pageviews_china_monthly <- pageviews_china %>%
  filter(date < '2019-07-01') %>% 
    mutate(date = floor_date(date, "month"))  %>% 
    group_by(date) %>% 
    summarise(views = sum(as.numeric(views)))  %>%   
    mutate(year = ifelse(date >= '2018-01-01' & date < '2019-01-01', '2018', '2019'),
      MonthN =as.factor(format(as.Date(date),"%m")),
         Month = months(as.Date(date), abbreviate=TRUE)) 

pageviews_china_monthly
A tibble: 18 × 5
dateviewsyearMonthNMonth
<date><dbl><chr><fct><chr>
2018-01-01 94312710201801Jan
2018-02-01 83652834201802Feb
2018-03-01 98097047201803Mar
2018-04-01 88804295201804Apr
2018-05-01103179527201805May
2018-06-01 94758876201806Jun
2018-07-01 98069866201807Jul
2018-08-01 91183329201808Aug
2018-09-01145545026201809Sep
2018-10-01139642061201810Oct
2018-11-01119397625201811Nov
2018-12-01125441745201812Dec
2019-01-01135626917201901Jan
2019-02-01107050240201902Feb
2019-03-01105678078201903Mar
2019-04-01 55467027201904Apr
2019-05-01 22039061201905May
2019-06-01 20759558201906Jun
In [250]:
pageviews_china_yoy <- pageviews_china_monthly %>%
  ggplot(aes(x=MonthN, y = views, group = year, color = year)) +  
  geom_line(size = 1) +
  scale_y_continuous("pageviews per month", labels = polloi::compress) +
  scale_x_discrete(breaks = pageviews_china_monthly$MonthN, labels = pageviews_china_monthly$Month)+
  labs(title = "Monthly wikimedia pageviews from China \n year over year comparison") +
  ggthemes::theme_tufte(base_size = 14, base_family = "Gill Sans") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.title = element_blank(),
        legend.position = "bottom",
        panel.grid = element_line("gray70"),
        legend.key.width=unit(1.5,"cm")) +
  xlab("Month") +
  scale_linetype_manual(breaks=c('2018','2019'), values=c(2,1))

pageviews_china_yoy
In [301]:
# Daily Wikimedia pageviews on desktop in China for 2019
 
pageviews_china_weekly <- pageviews_china %>%
filter(date < '2019-07-01') %>%
 mutate(date = floor_date(date, "week"))  %>%
group_by(date) %>%
summarise(views = sum(views)) %>%
  ggplot(aes(x=date, y =views)) +  
  geom_line(color= "blue") +
geom_vline(xintercept = as.numeric(as.Date("2019-04-23")),
             linetype = "dashed", color = "black") +
  geom_text(aes(x=as.Date('2019-04-23'), y=30E6, label="All Wikipedias blocked in China"), size=3.5, vjust = -1.2, angle = 90, color = "black") +
  scale_y_continuous("Weekly pageviews", labels = polloi::compress) +
  scale_x_date("Date", labels = date_format("%Y-%m-%d"), date_breaks = "1 month")+
  labs(title = "Weekly wikimedia pageviews from China") +
  ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1),
        panel.grid = element_line("gray70"))

ggsave("Figures/pageviews_china_weekly.png", pageviews_china_weekly, width = 18, height = 9, units = "in", dpi = 150)
pageviews_china_weekly
In [206]:
#Find estimated percentage decline in overall pageviews due to China drop in pageviews. 
china_drop <- top_declines_bycountry %>%
filter(country == 'China') %>%
summarise(num_decrease = june18_views - june19_views)

#Total Wikimedia June 2018 Pageviews across all countries is 6570949411

china_drop_percent <- china_drop$num_decrease/ 6570949411 *100
china_drop_percent
1.12615869292986

Following the China block at the end of April 2019, pageviews have declined with the high YoY declines . In June 2019, there was a 74M year over year pageview drop in China across all Wikimedia projects on desktop. This represents about a little over 1% year over year decrease in all desktop Wikimedia traffic.

Pageviews by Continent

In [112]:
#Find pageviews by continent to calculate yoy and mom changes on desktop

query <- "
SELECT continent, 
SUM(IF((month =5 and year = 2019), view_count, null)) AS may19_views,
SUM(IF((month =6 and year = 2019), view_count, null)) AS june19_views,
SUM(IF((month =6 and year = 2018), view_count, null)) AS june18_views
FROM wmf.pageview_hourly
WHERE
((year = 2018 and month =6) OR (year=2019 and (month = 5 OR month =6))) 
AND NOT (country_code IN ('PK', 'IR', 'AF') -- https://phabricator.wikimedia.org/T157404#3194046
AND user_agent_map['browser_family'] = 'IE') -- https://phabricator.wikimedia.org/T193578#4300284
AND agent_type = 'user'
AND access_method = 'desktop'
GROUP BY continent;"
In [113]:
pageviews_bycontinent <- wmf::query_hive(query)
Error in nrow(data): object 'results' not found
Traceback:

1. wmf::query_hive(query)
2. stop_on_empty(results)
3. nrow(data)
In [35]:
pageviews_bycontinent_clean <- pageviews_bycontinent

pageviews_bycontinent_clean$june18_views <- as.numeric(pageviews_bycontinent_clean$june18_views)
pageviews_bycontinent_clean$june19_views <- as.numeric(pageviews_bycontinent_clean$june19_views)
pageviews_bycontinent_clean$may19_views <- as.numeric(pageviews_bycontinent_clean$may19_views)
In [295]:
top_declines_bycontinent <- pageviews_bycontinent_clean %>%
group_by(continent) %>%
mutate(yoy_percent = (june19_views/june18_views-1)*100,
      mom_percent = (june19_views/may19_views-1)* 100) %>%
arrange(yoy_percent)

head(top_declines_bycontinent, 10)
A grouped_df: 8 × 6
continentmay19_viewsjune19_viewsjune18_viewsyoy_percentmom_percent
<chr><dbl><dbl><dbl><dbl><dbl>
Africa 86063700 72500679 95054438-23.727203-15.759282
South America 345128819 305522000 369310810-17.272392-11.475952
Asia 122870153211297121121307746112-13.613805 -8.056425
Europe 305082255426290045562954416896-11.014435-13.826369
North America176971804715165760541700284728-10.804583-14.304086
Oceania 132374985 120991398 132036397 -8.365117 -8.599500
Unknown 17933093 16190848 12099495 33.814246 -9.715251
Antarctica 3943 3514 535556.822430-10.880041
In [37]:
#Daily pageviews by continent
query <- "
SELECT CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
continent, SUM(view_Count)as views
FROM wmf.pageview_hourly
WHERE year = 2019 and month >= 5
AND agent_type='user'
AND access_method = 'desktop'
AND NOT (country_code IN ('PK', 'IR', 'AF') -- https://phabricator.wikimedia.org/T157404#3194046
AND user_agent_map['browser_family'] = 'IE') -- https://phabricator.wikimedia.org/T193578#4300284
GROUP BY year, month, day, continent;"
In [38]:
pageviews_bycontinent_daily <- wmf::query_hive(query)
In [39]:
pageviews_bycontinent_daily$date <- as.Date(pageviews_bycontinent_daily$date, format = "%Y-%m-%d")
In [40]:
p <- pageviews_bycontinent_daily %>%
filter(date < '2019-07-01') %>%
       ggplot(aes(x=date, y =views, color = continent)) +  
  geom_line() +
  scale_y_continuous("daily pageviews", labels = polloi::compress) +
  scale_x_date("Date", labels = date_format("%Y-%m-%d"), date_breaks = "1 week")+
  labs(title = "Total daily Wikimedia pageviews by Continent") +
  ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1),
        panel.grid = element_line("gray70"))
p

All continents saw YoY declines in desktop pageviews ranging from -8% (Oceania) to -24% (Africa). YoY declines were not strongly isolated to one individual continent.

Pageviews By Project

In [47]:
#Pageviews by project, YoY and MoM changes
query <- "
SELECT project, 
SUM(IF((month =5 and year = 2019), view_count, null)) AS may19_views,
SUM(IF((month =6 and year = 2019), view_count, null)) AS june19_views,
SUM(IF((month =6 and year = 2018), view_count, null)) AS june18_views
FROM wmf.pageview_hourly
WHERE
((year = 2018 and month =6) OR (year=2019 and (month = 5 OR month =6))) 
AND NOT (country_code IN ('PK', 'IR', 'AF') -- https://phabricator.wikimedia.org/T157404#3194046
AND user_agent_map['browser_family'] = 'IE') -- https://phabricator.wikimedia.org/T193578#4300284
AND agent_type = 'user'
AND access_method = 'desktop'
GROUP BY project;"
In [48]:
pageviews_byproject <- wmf::query_hive(query)
In [49]:
pageviews_byproject_clean <- pageviews_byproject

pageviews_byproject_clean$june18_views <- as.numeric(pageviews_byproject_clean$june18_views)
pageviews_byproject_clean$june19_views <- as.numeric(pageviews_byproject_clean$june19_views)
pageviews_byproject_clean$may19_views <- as.numeric(pageviews_byproject_clean$may19_views)
Warning message in eval(expr, envir, enclos):
“NAs introduced by coercion”Warning message in eval(expr, envir, enclos):
“NAs introduced by coercion”Warning message in eval(expr, envir, enclos):
“NAs introduced by coercion”
In [296]:
top_declines_byproject <- pageviews_byproject_clean %>%
filter(project %in% c("en.wikipedia", 'es.wikipedia', 'de.wikipedia', 'ja.wikipedia', 
      'fr.wikipedia', 'ru.wikipedia', 'commons.wikimedia', 'it.wikipedia', 'zh.wikipedia', 'pt.wikipedia')) %>% 
## Isolate to projects that account for largets portion of June traffic
group_by(project) %>%
mutate(yoy_percent = (june19_views/june18_views-1)*100,
      mom_percent = (june19_views/may19_views-1)* 100) %>%
arrange(yoy_percent)


top_declines_byproject
A grouped_df: 10 × 6
projectmay19_viewsjune19_viewsjune18_viewsyoy_percentmom_percent
<chr><dbl><dbl><dbl><dbl><dbl>
ru.wikipedia 411354317 355753794 453964977-21.6340881-13.516455
pt.wikipedia 114951406 104268113 132051914-21.0400593 -9.293747
it.wikipedia 183944351 168652384 207553817-18.7428174 -8.313366
es.wikipedia 402611831 330642555 390153358-15.2531823-17.875599
fr.wikipedia 304478595 250102684 290544145-13.9192139-17.858697
de.wikipedia 460107993 393327434 445920582-11.7942858-14.514105
en.wikipedia 304833921627024069493028839644-10.7774836-11.348221
zh.wikipedia 209023131 186467182 206515020 -9.7076900-10.791126
ja.wikipedia 374520717 367571906 398345144 -7.7252700 -1.855388
commons.wikimedia 113460459 107567146 107167805 0.3726315 -5.194156

Commons Wikimedia saw the only YoY increase out of the top 10 visited wikimedia projects. All the other projects had yoy decreases ranging from -21% to -8%. Note that Wikimedia Commons has not been reported as blocked by China, which may be an indication that the block is attributing to a portion in yoy decline on the other language Wikipedias.

In [302]:
#Comparison of june 2018 and June 2019 Pageviews by Projects
pageviews_byproject_yoy <- top_declines_byproject %>%
gather(year, views, june19_views:june18_views) %>%
ggplot(aes(x=year, y= views, fill = year)) +
geom_col() +
scale_y_continuous("total pageviews", labels = polloi::compress) +
facet_wrap(~project) +
labs(title = "Comparison of June 2018 and June 2019 Pageviews by Projects") +
  ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
  theme(axis.text.x=element_blank(),
        panel.grid = element_line("gray70"),
       axis.title.x=element_blank())

ggsave("Figures/pageviews_byproject_yoy .png", pageviews_byproject_yoy , width = 18, height = 9, units = "in", dpi = 150)

pageviews_byproject_yoy
In [114]:
#Daily pageviews by project
query <- "
SELECT CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
project, SUM(view_Count)as views
FROM wmf.pageview_hourly
WHERE year = 2019 and month >= 5
AND agent_type='user'
AND access_method = 'desktop'
AND NOT (country_code IN ('PK', 'IR', 'AF') -- https://phabricator.wikimedia.org/T157404#3194046
AND user_agent_map['browser_family'] = 'IE') -- https://phabricator.wikimedia.org/T193578#4300284
GROUP BY year, month, day, project;"
In [115]:
pageviews_byproject_daily <- wmf::query_hive(query)
In [116]:
pageviews_byproject_daily$date <- as.Date(pageviews_byproject_daily$date, format = "%Y-%m-%d")
In [117]:
p <- pageviews_byproject_daily %>%
filter(date < '2019-07-01',
    project %in% c('es.wikipedia', 'de.wikipedia', 'ja.wikipedia', 
      'fr.wikipedia', 'ru.wikipedia', 'commons.wikimedia', 'it.wikipedia', 'zh.wikipedia', 'pt.wikipedia')) %>%
  ggplot(aes(x=date, y =views, color = project)) +  
  geom_line() +
  scale_y_continuous("daily pageviews", labels = polloi::compress) +
  scale_x_date("Date", labels = date_format("%Y-%m-%d"), date_breaks = "1 week")+
  labs(title = "Total daily Wikimedia pageviews by Project") +
  ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1),
        panel.grid = element_line("gray70"))
p

Pageviews by Referer

In [123]:
#Pageviews by referer project, YoY and MoM changes
query <- "
SELECT referer_class, 
SUM(IF((month =5 and year = 2019), view_count, null)) AS may19_views,
SUM(IF((month =6 and year = 2019), view_count, null)) AS june19_views,
SUM(IF((month =6 and year = 2018), view_count, null)) AS june18_views
FROM wmf.pageview_hourly
WHERE
((year = 2018 and month =6) OR (year=2019 and (month = 5 OR month =6))) 
AND NOT (country_code IN ('PK', 'IR', 'AF') -- https://phabricator.wikimedia.org/T157404#3194046
AND user_agent_map['browser_family'] = 'IE') -- https://phabricator.wikimedia.org/T193578#4300284
AND agent_type = 'user'
AND access_method = 'desktop'
GROUP BY referer_class;"
In [124]:
pageviews_byreferer <- wmf::query_hive(query)
In [297]:
top_declines_byreferer <- pageviews_byreferer %>%
group_by(referer_class) %>%
mutate(yoy_percent = (june19_views/june18_views-1)*100,
      mom_percent = (june19_views/may19_views-1)* 100) %>%
arrange(yoy_percent)


top_declines_byreferer
A grouped_df: 5 × 6
referer_classmay19_viewsjune19_viewsjune18_viewsyoy_percentmom_percent
<chr><dbl><dbl><dbl><dbl><dbl>
unknown 3717178 2677223 3551112-24.608883-27.977003
external 153716112 136872428 163903775-16.492205-10.957657
external (search engine)267589519422648968592640852562-14.236149-15.359284
none 138148159511972358141346170662-11.063593-13.336825
internal 241593659421888188372416471300 -9.420864 -9.400816
In [ ]:
There was a yoy decline across all referer_class types.
In [270]:
#Comparison of june 2018 and June 2019 Pageviews by Referer Class
pageviews_byreferer_yoy <- top_declines_byreferer %>%
filter(referer_class != "unknown") %>% #remove unknown referers
gather(year, views, june19_views:june18_views) %>%
ggplot(aes(x=year, y= views, fill = year)) +
geom_col() +
scale_y_continuous("total pageviews", labels = polloi::compress) +
facet_wrap(~referer_class) +
labs(title = "Comparison of June 2018 and June 2019 \n pageviews by referer class") +
  ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
  theme(axis.text.x=element_blank(),
        panel.grid = element_line("gray70"),
       axis.title.x=element_blank())

pageviews_byreferer_yoy

I also reviewed a breakdown of pageviews by major search engine using the discovery wmflabs dashboard

There are no signficant changes across June seen for any of the major external serarch engines.

Pageviews by Weekday and Weekend

Historically, there are fewer views on desktop on the weekends compared to weekdays. To investigate if the higher number of weekend days in June led to a decrease I'll look at the total views on weekends and weekydays in June 2019 and compare to June 2018 last year.

I first calculated calulcated the average weekday and weekend pageviews in both June 2018 and June 2019.

In [129]:
## Collect daily pageviews on desktop in June 2018 and June 2019.

query <- "
SELECT CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
SUM(view_count)as views
FROM wmf.pageview_hourly
WHERE ((year = 2019 and month = 6) OR (year = 2018 and month = 6))
AND agent_type='user'
AND access_method = 'desktop'
AND NOT (country_code IN ('PK', 'IR', 'AF') -- https://phabricator.wikimedia.org/T157404#3194046
AND user_agent_map['browser_family'] = 'IE') -- https://phabricator.wikimedia.org/T193578#4300284
GROUP BY year, month, day;"
In [130]:
pageviews_daily <- wmf::query_hive(query)
In [131]:
pageviews_daily$date <- as.Date(pageviews_daily$date, format = "%Y-%m-%d")
In [139]:
#Insert day of the week
pageviews_daily_wday <- pageviews_daily
pageviews_daily_wday$day <- weekdays(as.Date(pageviews_daily_wday$date))
In [188]:
# Find average weekend and weekday views on desktop

wday_avg <- pageviews_daily_wday %>%
mutate(date = floor_date(date, "month"),
       day_type = ifelse(day %in% c('Saturday', 'Sunday'), 'weekend', 'weekday'),
      year = ifelse(date =='2019-06-01', 'June 2019', 'June 2018')) %>%
group_by(year, day_type)  %>%
summarise(avg_daily_views = mean(as.numeric(views)),
         total_views = sum(as.numeric(views)))
                         
wday_avg
A grouped_df: 4 × 4
yearday_typeavg_daily_viewstotal_views
<chr><chr><dbl><dbl>
June 2018weekday2335248754904022383
June 2018weekend1852141141666927028
June 2019weekday2086575644173151288
June 2019weekend1617349871617349873

There was a YoY decrease in both weekday and weekend pageviews in June consistent with an overall decrease in traffic.

There was one additional weekend day in June 2019 compared to June 2018. Since there are fewer views on desktop on the weekends compared to weekdays, this might have also attributed to a small portion of the overall YoY decline on desktop.

To calculate, I multiplied the difference in number of weekend days for each June (1) by the average 2019 difference in weekend and weekday daily pageviews to estimate the drop in pageviews from the additional weekend day.

In [158]:
#Count number of day types

wday_counts <- pageviews_daily_wday  %>%
mutate(date = floor_date(date, "month"),
       day_type = ifelse(day %in% c('Saturday', 'Sunday'), 'weekend', 'weekday'),
      year = ifelse(date =='2019-06-01', 'June 2019', 'June 2018')) %>%
group_by(year)  %>%
summarise(weekday_num = sum(day_type == 'weekday'),
         weekend_num = sum(day_type == 'weekend'))


wday_counts
A tibble: 2 × 3
yearweekday_numweekend_num
<chr><int><int>
June 201821 9
June 20192010
In [290]:
# Calculate pageview drop in June attibuted to number of weekends. 
# Proposed equation: Multiply the difference in weekend numbers by avg 2019 difference in weekend and weekday previews.

drop_num <- (wday_counts[2, 3] - wday_counts[1, 3]) * (wday_avg[3, 3] - wday_avg[4, 3])

#Find estimated percentage drop in June 2019 traffic due to additional weekend day 
#Note: 6570949411 is the total num of June 2018 pageviews)

drop_percent <- (drop_num/ 6570949411) *100
drop_percent
A data.frame: 1 × 1
weekend_num
<dbl>
0.7140913

The additional weekend day in June 2019 led to an estimated 46.9M pageview drop across all Wikimedia projects on desktop or about 0.7% decrease in overall desktop pageviews in June 2019.

Pageviews by OS Browser Family

In [279]:
#Find desktop pageviews by browser family to calculate yoy and mom changes 

query <- "
SELECT user_agent_map['browser_family'] as browser, 
SUM(IF((month =5 and year = 2019), view_count, null)) AS may19_views,
SUM(IF((month =6 and year = 2019), view_count, null)) AS june19_views,
SUM(IF((month =6 and year = 2018), view_count, null)) AS june18_views
FROM wmf.pageview_hourly
WHERE
((year = 2018 and month =6) OR (year=2019 and (month = 5 OR month =6))) 
AND NOT (country_code IN ('PK', 'IR', 'AF') -- https://phabricator.wikimedia.org/T157404#3194046
AND user_agent_map['browser_family'] = 'IE') -- https://phabricator.wikimedia.org/T193578#4300284
AND agent_type = 'user'
AND access_method = 'desktop'
AND user_agent_map['browser_family'] IN ('IE', 'Chrome', 'Firefox', 'Safari', 'Edge', 'Other', 'Opera')
GROUP BY user_agent_map['browser_family'];"
In [280]:
pageviews_bybrowser <- wmf::query_hive(query)
In [281]:
head(pageviews_bybrowser)
A data.frame: 6 × 4
browsermay19_viewsjune19_viewsjune18_views
<chr><dbl><dbl><dbl>
Opera 160340508 155410004 135562031
Chrome 356625770430410012653353866096
IE 628724346 582400797 807325877
Firefox1150456571 9939843871135771806
Other 232294161 215284369 262134242
Safari 351045917 306683116 370938256
In [298]:
top_declines_bybrowser <- pageviews_bybrowser %>%
filter(june19_views > 1) %>%
group_by(browser) %>%
mutate(yoy_percent = (june19_views/june18_views-1)*100,
      mom_percent = (june19_views/may19_views-1)* 100) %>%
arrange(yoy_percent)


top_declines_bybrowser
A grouped_df: 7 × 6
browsermay19_viewsjune19_viewsjune18_viewsyoy_percentmom_percent
<chr><dbl><dbl><dbl><dbl><dbl>
IE 628724346 582400797 807325877-27.860507 -7.367863
Other 232294161 215284369 262134242-17.872474 -7.322522
Safari 351045917 306683116 370938256-17.322328-12.637321
Firefox1150456571 9939843871135771806-12.483795-13.600877
Chrome 356625770430410012653353866096 -9.328483-14.728505
Edge 317927056 282759705 275885045 2.491857-11.061453
Opera 160340508 155410004 135562031 14.641248 -3.075021
In [299]:
#Daily pageviews by browser
query <- "
SELECT CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
user_agent_map['browser_family'] as browser, SUM(view_Count)as views
FROM wmf.pageview_hourly
WHERE year = 2019 and month >= 5
AND agent_type='user'
AND access_method = 'desktop'
AND NOT (country_code IN ('PK', 'IR', 'AF') -- https://phabricator.wikimedia.org/T157404#3194046
AND user_agent_map['browser_family'] = 'IE') -- https://phabricator.wikimedia.org/T193578#4300284
AND user_agent_map['browser_family'] IN ('IE', 'Chrome', 'Firefox', 'Safari', 'Edge', 'Other', 'Opera')
GROUP BY year, month, day, user_agent_map['browser_family'];"
In [286]:
pageviews_bybrowser_daily <- wmf::query_hive(query)
In [287]:
pageviews_bybrowser_daily$date <- as.Date(pageviews_bybrowser_daily$date, format = "%Y-%m-%d")
In [288]:
p <- pageviews_bybrowser_daily %>%
filter(date < '2019-07-01') %>%
  ggplot(aes(x=date, y =views, color = browser)) +  
  geom_line() +
  scale_y_continuous("daily pageviews", labels = polloi::compress) +
  scale_x_date("Date", labels = date_format("%Y-%m-%d"), date_breaks = "1 week")+
  labs(title = "Total daily Wikimedia pageviews by Browser") +
  ggthemes::theme_tufte(base_size = 12, base_family = "Gill Sans") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1),
        panel.grid = element_line("gray70"))
p

There were YoY declines across all browser groups except Edge and Opera but no significants drops associated with any one browser family.