21:03:59 #startmeeting ArchCom RFC meeting: T145472: Survey Cookies/Local Storage usage on Wikimedia sites 21:03:59 Meeting started Wed Oct 19 21:03:59 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:03:59 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:03:59 The meeting name has been set to 'archcom_rfc_meeting__t145472__survey_cookies_local_storage_usage_on_wikimedia_sites' 21:03:59 T145472: Survey Cookies/Local Storage usage on Wikimedia sites - https://phabricator.wikimedia.org/T145472 21:04:07 o/ 21:04:37 robla: I guess the bot being blocked from topic changes in this channel saves a step :/ 21:05:54 hi zzhou_ ! 21:05:58 hi everyone 21:06:43 gwicke, Krinkle and I just briefly discussed this in the ArchCom Planning meeting last hour 21:07:31 ...and I've been talking to bawolff and dapatrick about this for a little while 21:07:43 * bawolff waves 21:08:05 * dapatrick waves also 21:08:50 does anyone have any questions about this RFC before I prompt zzhou_ to ask questions that he has? 21:10:19 I love the idea of automating the audit process. 21:10:34 But I wonder who is going to watch the logs? 21:10:37 (Hi All) 21:10:56 and if we are going to spam flood ourselves each time a new cookie is introduced 21:11:31 so I am not sure who will watch the logs right now - probably someone on the legal team? 21:11:37 ?? 21:11:40 and ideally over time people will be more carefula bout introducing these so 21:11:49 that there will be less alerts over time 21:12:00 as we better communicate to them about this issue 21:12:14 of course the first step is to understand the issue (hence this RFC) and figure out the scale of it 21:12:43 if there really that many cookies that we need to track down, we might have to think of more scalable ways of managing this for now 21:14:05 bawolff: can you describe how you used mwgrep to help out zzhou_ ? 21:14:42 I'm not so much concerned about distinct cookies. It's more about the sheer number of requests that are likely to come between the cookie being introduced and a change to the extension to make it as expected. 21:15:05 I think that it's necessary to have some process/application which consumes the logged cookie information, stores unique cookies, associated wiki name, and number of times observed. But I'm getting a little ahead of where we are in the meeting. 21:15:18 we may have to sample the logs 21:15:25 We tried to find instances of setting cookies in js pages on wiki 21:15:50 mwgrep returned like 1500 results 21:15:56 #idea test this out on meta or mediawiki before completely rolling the change out 21:16:00 o_O 21:16:20 so that sort of simple static analysis was kind of unfeasible 21:16:25 bawolff: mostly gadgets storing state? 21:16:43 Seemed like it 21:17:16 often the same gadget or similar gadgets copied across multiple wikis 21:18:37 dapatrick: talk to tgr and get sentry setup in prod :) 21:18:39 #info We tried to find instances of setting cookies in js pages on wiki; mwgrep returned like 1500 results 21:18:58 bd808 Will do. 21:19:31 #info 14:18:36 dapatrick: talk to tgr and get sentry setup in prod :) 21:19:32 One concern i have is it seems kind of like we are approaching this backwards - we want to know when personal information is stored so we are looking at cookies 21:19:49 but.. really cookies are just a means 21:20:18 and its no different if personal info is stored some other way 21:20:36 So it feels like we are looking at a symptom 21:20:43 well, in my mind cookies == correlation == possible tracking 21:20:50 a gadget could store personal info in an api preference? 21:20:59 but i dont have any better suggestions to address this issue 21:21:15 krenair: yeah. Or in a public wiki page 21:21:23 database and EL schema audits? 21:21:35 or probably other ways i havent thought of 21:22:47 so one reason on for tracking down cookies/local storage is that we have a table currently that lists all the cookies/local storage we use so ideally that information will be up-to-date: https://wikimediafoundation.org/wiki/Cookie_statement#3._What_types_of_cookies_does_Wikimedia_use.3F 21:24:39 if we found out a gadget is storing cookies on, say, the Javanese Wikisource, what would we do about it? 21:25:22 presumably the same thing we'd do if we found one on the English Wikisource 21:25:27 seems to me like we are planning to collect loads of non-actionable data 21:25:54 zzhou_, for discussions' sake, could we include a statement on that page that says something to the effect of 'WMF uses these cookies, but there may be others created by Gadgets, extensions, etc. deployed by administrators of individual wikis/projects'? 21:25:59 If this is being used for say all users to Javanese Wikisource automatially, without their consent, ideally, we would list that in the Cookies table 21:26:27 dapatrick: that’s also a possibility 21:26:43 how do you check whether it's set for all users and without consent? 21:27:04 but I think perhaps after we figure out the scope of the issue 21:27:08 tgr, I think the answer there is source code and setting analysis. 21:27:13 I dont think there is any instance of any cookie anywhere that asks for conscent 21:27:39 there’s could be implied consent when you use Cookies statements like this 21:27:47 or at least warning to the user 21:27:59 extensions aren't deployed by administrators of individual wikis/projects 21:28:29 a script that just loads automatically when a page visits a Wiki page without the end user knowing about it would be more problematic 21:28:36 Thanks for that clarification, Krenair. This is not the final wording of such an addition. 21:29:47 who is going to be doing all this log checking and source code analysis? 21:30:41 IMO 1) looking through gadget code on thousands of wikis (possibly written in the local language, possibly broken for ages and/or no one still active knowing what it does) is not realistic 21:31:23 2) even if you want to do that, logging cookies does not seem very helpful data for that kind of review 21:32:00 I guess one could do horrible hacks with replacing document.cookie and then logging stack traces 21:33:05 If you assume cookie names are relatively unique having the cookie name is a good start to finding the relavent code 21:33:23 but indeed, not an easy task 21:34:16 I'm sitting in the same room with zzhou_ now, and I'm going to try to restate the point he's trying to make 21:34:26 the JS code might be loaded from another wiki or an external domain 21:34:38 Zzhou: I noticed you in addition to having a law degree from Columbia that you "spent a semester studying Chinese law at Peking University in Beijing" Will the cookie questions you're asking have differential potential effects for Wikipedia /Wikimedia working in China do you think? Are these considerations to plan for in any way - both legally and in various languages in China too? 21:34:44 (well hopefully not external external but tool labs) 21:35:03 it might come from a browser plugin etc 21:35:24 bawolff gave zzhou_ the output of mwgrep, which is basically just a list of cookie setting calls in the MediaWiki: namespace on all of our wikis (I think) 21:35:27 CSP will hopefully solve the external problem one day :p 21:35:48 bawolff's run gave back 1500 results 21:36:10 zzhou_ is basically saying "1500 is a lot, but *that's* manageable, right?" 21:36:47 Tgr: thatd have to be a pretty broken browser plugin but presumably thatd only be a small number of users so in the long tail 21:36:47 yea I am saying even if we end with a list of 1500 unique cookies, we can take time to go through them 21:37:06 I am not sure we will have this table anymore at that time since it is too large 21:37:08 zzhou_: I guess that depends on who has to do that audit and what it keeps them from doing otherwise 21:37:41 where "a list of 1500 unique cookies" is "1500 places in the MediaWiki: namespace Javascript that seem to be setting cookies" 21:38:07 how much time you'd estimate for dealing with one cookie? 21:38:22 Based on a naive regex that probably missed a lot 21:38:24 Would anyone be viewing the info received for the cookies ?? 21:38:39 yea, I wasn’t proposing someone go over 1500 cookies necessarily - I think potentially past a certain large number, we will just rethink our strategy of listing all the cookies 21:39:25 the pint isn't just to list them though is it? its to audit why they exist 21:39:50 probably most of the cookies are opt-in (even if the user is not specifically told they are opting into a cookie, but they would have to enable a gadget or something) 21:39:52 and likely to stop using them if there isn't a very good reason? 21:40:00 not necessarily, since we don’t even know the scale of the issue yet 21:40:27 and to clarify by *1500 unqiue cookies I meant 1500 unqiue cookie names 21:41:07 If there will be people whom arent employees viewing the info received i would suggest having some sort of confidentially document (not a lawyer/legal team member but thats just my 2 cents) 21:41:33 Zppix|mobile, zzhou_ is on the legal team. 21:41:37 so the logging is just going to end up with a set of N strings. Then someone will need to pour through source code on-wiki and on the server side to see if they can find those same strings 21:42:04 Then they will need to determine who "owns" the code that sets the cookie 21:42:08 Zppix|mobile: Also, that's what the generic NDA's cover anyway 21:42:09 So what is the ultimate goal we have here? 21:42:18 and then contact those persons to find out why they are doing so 21:42:25 bd808 Right, then determine from source code, documentation, or conversation with the project owner the reason for the existence of the cookie. 21:42:43 bd808, sorry, what you said when you finished your thought. :) 21:42:45 bd808: correct, but potentially, a lot of the scripts are just copies one of another and they are really using the same cookie names so maybe we don’t have as many other cookies as the mgrep suggests 21:42:57 Do we basically want to explain ourselves to our users(?) 21:43:02 did you not run it through uniq? 21:43:17 even if someone gets really really good at that process thats going to take an hour a cookie 21:43:40 Nearly a year full time work 21:43:57 Could a bot handle the tideous source editing or no? 21:44:28 Reedy: at the time the output didnt seem amenable to processing like that 21:44:34 at least not easily 21:44:41 Zppix: no 21:45:15 where's the list? 21:45:20 ^ 21:45:42 Currently only on a private email thread 21:45:58 * robla would love to make mwgrep public, and short of that, make it so that we run mwgrep scans and publish the static logs 21:46:00 i can pastebin it once i find it again 21:46:10 there's a ticket for that robla 21:46:37 Also needs my no private patch merging ;) 21:46:38 Krenair: I heard about that from Krinkle ....please do tell! 21:46:56 so I guess working from the mwgrep list is not realistic, that leaves logging what cookies are set, doing some sort of honeypot approach, working with the community and leaving it to them to identify cookies, or just ignoring the issue 21:47:00 Is the main time consumer translation for the cookie? I cant think of any other reason 21:47:22 robla, https://phabricator.wikimedia.org/T71489 21:48:18 So umm. What about if we just put the cookie table on meta, and tell people to add items when they introduce new cookies 21:48:32 !bug 1 | bawolff 21:48:32 bawolff: https://bugzilla.wikimedia.org/show_bug?id=1 21:48:39 and then use cookie logging to guage how complete the table is 21:50:01 * robla is sad that the bug 1 link above doesn't go to https://phabricator.wikimedia.org/T2001 21:50:23 bawolff: you mean a separate table to help us to chase down the cookies (not the cookies table for the end user we have right now)? 21:50:29 !botbrain 21:51:17 !bug del 21:51:18 Sorry, you are not authorized to perform this 21:51:29 !bug del 21:51:29 Sorry, you are not authorized to perform this 21:51:30 just beeds !bug is https://bugzilla.wikimedia.org/$1 21:51:32 Lol 21:52:00 say we get a table with 100 cookies and we log 1000 unique cookie names (let's optimistically assume there are no dynamically named cookies) 21:52:01 Zzhou_: a crowd sourced table 21:52:09 Maybe !bug should change from bugzilla.wikimedia to phabricator.wikimedia 21:52:10 again, what would we do with the data? 21:52:25 Zppix|mobile: bugzilla had bugs, phab has tasks 21:52:26 would someone have to go through the 900 missing names and check? 21:52:32 if the url is right, it will redirect correctly 21:52:41 tgr, Yes. 21:52:47 Zppix: but then we cant make snide comments about bug #1 21:52:50 * robla has a meeting to go to in 7 minutes, so will end this abruptly 21:52:55 :p 21:53:13 also, we can keep the conversation generally going on Phab and on #wikimedia-tech 21:53:29 what are the chances of ending up with an amount of data that does not take man-months to sort through? 21:53:51 tgr: if we have that many cookies, we might need some sort of disclaimer like Dapatrick suggested earlier as it would not be feasible to go over all that many and furthermore, it is not disclosure to the user if we just present them with a list of 1000 cookies 21:54:21 zzhou_: so can we just start with that disclaimer and skip the intermediate steps? :) 21:55:07 that’s an option - it def. less ideal than having a cookies table that’s up to date 21:55:27 (assuming the size of the table is still limited) 21:56:52 180 seconds until abrupt end of meeting.... 21:57:01 does everyone think it is likely we have many hundreds to thousands of unqiue cookie (names) lying around? 21:57:13 Perfectly timed with my battery dying ;) 21:57:27 the list sorted and uniq'd will remove dupes 21:58:16 #info next week's tentative topic: T138783 SVG stuff 21:58:16 T138783: SVG Upload should (optionally) allow the xhtml namespace - https://phabricator.wikimedia.org/T138783 21:58:18 I think we need the logging to find out honestly. Probably not too hard to add into the wikimedia messages extension or something similar 21:58:19 Reedy: yea, perhaps that’s the first step 21:58:19 I think the distribution will have a long tail 21:58:21 zzhou_ I believe there may be a possibly untenable number. I do not be believe it will be many hundres of thousands. 21:58:36 sorry I meant hundred to thousands ;) 21:58:58 Ah. I also read you wrong. 21:59:07 across all projects and languages? I wouldn't doubt high hundreds 21:59:10 45 seconds to end of meeting 21:59:18 Hundreds to thousands is about what I expect. 21:59:19 yeah, the long tail will be long 21:59:27 ok 21:59:34 thanks everyone! those that want to keep talking can use #wikimedia-tech 22:00:04 alright I will pop to #wikimedia-tech in case people have time, I want to follow-up a little 22:00:07 #endmeeting