Mikhail Popov
Discovery / Analysis
Wikimedia Foundation
The Wikidata Query Service (WDQS) is designed to let users run queries on the data contained in Wikidata. The service uses SPARQL Protocol and RDF Query Language (SPARQL) as the query language.
The number of users of the service has fallen since the announcement, vacillating at around 100 users per day in the recent weeks.
The lower and upper bounds represent the first and third quartiles (25% and 75%). Here we can see that the number of queries per user has stabilized a lot after the announcement, mostly because prior to the announcement the queries came from few bots testing the service.
WDQS users are a very geographically diverse bunch! In fact, 73 different countries1 were represented between August 23rd and October 4th.
U.S., U.K., Germany, and France are the top-represented countries, with U.S. leading the pack.
Chrome and Firefox are, unsurprisingly, WDQS users’ preferred browsers.
Windows 7 and Mac OS X users are by far the most popular operating systems among WDQS users.
Perhaps the biggest challenge of working with this dataset was the fact that a lot of the queries that our users ran were just sample queries we provide on query.wikidata.org or examples found on MediaWiki/Wikitech. Therefore, we put together a procedure for detecting whether a query is an example or not.
We were able to find a few queries and manually mark them as examples. Other queries that were perfect matches to these manually verified queries were marked as “definitely an example”.
First, we compiled a list of example queries, stripped out extraneous spaces (“condensed”), and cropped them using the maximum length of condensed user-submitted queries.
The procedure can be described as follows. For each user-submitted query \(Q\):
That is, if a query is 4 or more times more likely to be an example than not, it makes sense to say it’s probably an example.
Total queries (black dashes) over time and how many were the sample queries we provided for demonstration.
Is the query an example? | Total queries | % of total |
---|---|---|
definitely no | 21487 | 56.48 |
definitely yes | 651 | 1.71 |
maybe | 7955 | 20.91 |
probably yes | 7949 | 20.90 |
For many of the statistical break downs in this report, we will restrict ourselves to queries that are definitely not examples we provided.
Varying patterns of WDQS usage by country (top 5 countries, over time). Purple dashes mark the public announcement.
Varying patterns of WDQS unique users by country (top 5 countries, over time). Purple dashes mark the public announcement. What is very interesting is that South Korea is a top 5 country in usage but with barely any users.
total queries | os | browser_major | country |
---|---|---|---|
2126 | Other | Other Other | Germany |
1340 | Other | Other Other | Unknown |
1339 | Other | Other Other | Unknown |
1281 | Other | Other Other | Unknown |
1222 | Other | Other Other | Unknown |
1163 | Other | Other Other | Unknown |
972 | Windows 7 | Chrome 45 | United Kingdom |
867 | Other | Other Other | Unknown |
794 | Other | Other Other | Unknown |
782 | Mac OS X | Safari 7 | United States |
702 | Other | Other Other | Unknown |
676 | Other | Other Other | Unknown |
627 | Other | Other Other | Unknown |
555 | Other | Other Other | Unknown |
473 | Other | Other Other | Unknown |
469 | Windows 7 | Firefox 40 | United Kingdom |
468 | Windows 7 | Chrome 45 | Republic of Korea |
465 | Other | Other Other | Unknown |
433 | Windows 8.1 | Firefox 41 | France |
422 | Windows XP | Firefox 40 | Switzerland |
median queries per day | os | browser_major | country |
---|---|---|---|
172.0 | Windows 7 | Chrome 45 | Republic of Korea |
164.0 | Windows 8.1 | Chrome 45 | Republic of Korea |
126.0 | Windows 7 | Chrome 45 | United Kingdom |
110.0 | Windows 7 | Chrome 46 | United States |
103.0 | Other | Other Other | United States |
101.0 | Ubuntu | Chromium 44 | France |
99.0 | Linux | Firefox 42 | France |
97.0 | Windows 7 | Chrome 45 | Republic of Korea |
92.0 | Windows 7 | Chrome 45 | Republic of Korea |
90.0 | Windows 8.1 | Chrome 45 | Spain |
89.0 | Linux | Firefox 43 | United States |
85.5 | Windows 7 | Firefox 40 | United Kingdom |
82.0 | Windows 7 | Firefox 40 | Belgium |
81.0 | Windows 7 | Chrome 45 | Germany |
68.0 | Windows 10 | Chrome 45 | Republic of Korea |
67.0 | Other | Other Other | Unknown |
66.0 | Other | Other Other | Unknown |
62.0 | Mac OS X | Safari 7 | United States |
60.0 | Other | Other Other | Unknown |
60.0 | Windows 7 | Firefox 40 | Austria |
total queries | median queries per day | os | browser_major | country |
---|---|---|---|---|
1340 | 60.0 | Other | Other Other | Unknown |
1281 | 67.0 | Other | Other Other | Unknown |
782 | 62.0 | Mac OS X | Safari 7 | United States |
469 | 85.5 | Windows 7 | Firefox 40 | United Kingdom |
468 | 172.0 | Windows 7 | Chrome 45 | Republic of Korea |
url | total | % of total |
---|---|---|
– | 13269 | 88.13 |
https://query.wikidata.org/ | 661 | 4.39 |
https://query.wikidata.org/#…QUERY… | 355 | 2.36 |
https://www.wikidata.org/wiki/Wikidata:Data_access | 172 | 1.14 |
https://www.facebook.com/gertruda | 70 | 0.46 |
https://query.wikidata.org | 46 | 0.31 |
https://tools.wmflabs.org/ppp-sparql/ | 37 | 0.25 |
http://www.traackr.com/ | 29 | 0.19 |
http://www.google.com/ | 28 | 0.19 |
http://longurl.org | 26 | 0.17 |
https://m.facebook.com/?_rdr | 26 | 0.17 |
https://www.facebook.com/ | 24 | 0.16 |
http://jamiembrown.com/ | 21 | 0.14 |
http://tools.wmflabs.org/wdq2sparql/w2s.php | 20 | 0.13 |
https://accounts.google.com/ServiceLogin?service=m… | 15 | 0.10 |
https://www.rebelmouse.com/ | 15 | 0.10 |
https://en.planet.wikimedia.org/ | 14 | 0.09 |
https://www.google.com/ | 13 | 0.09 |
https://www.google.com/?gws_rd=ssl#q=news | 10 | 0.07 |
https://www.wikidata.org/wiki/Wikidata:Bistro | 10 | 0.07 |
domain | total | % of total |
---|---|---|
– | 13269 | 88.13 |
query.wikidata.org | 1063 | 7.06 |
www.wikidata.org | 192 | 1.28 |
www.facebook.com | 103 | 0.68 |
tools.wmflabs.org | 65 | 0.43 |
www.google.com | 57 | 0.38 |
www.traackr.com | 29 | 0.19 |
m.facebook.com | 28 | 0.19 |
longurl.org | 26 | 0.17 |
jamiembrown.com | 21 | 0.14 |
accounts.google.com | 18 | 0.12 |
www.rebelmouse.com | 15 | 0.10 |
en.planet.wikimedia.org | 14 | 0.09 |
www.google.de | 13 | 0.09 |
magnusmanske.de | 9 | 0.06 |
www.google.it | 9 | 0.06 |
it.wikisource.org | 8 | 0.05 |
www.google.co.uk | 7 | 0.05 |
www.google.nl | 7 | 0.05 |
www.google.fr | 6 | 0.04 |
…from ourselves:
…from others:
Note that this is an interactive graph like the ones we use in Discovery Dashboards. Mouse-over to see the values of the time series in the legend. You can also zoom in on a particular range. (Zoom out by double-clicking.)
Legend:
The referers were shortened for privacy and space reasons as they contained queries.
We can see multiple modes in the distribution of query lengths, which suggests that the distribution is a mixture of several distributions. The next step is to use a clustering algorithm to separate the distributions out into distinct groups. For this task, we chose a model-based clustering algorithm.
We performed model-based clustering on the log10-transformed character counts of condensed queries that were “definitely not” sample queries we provided. (Model-based clustering relies on Gaussian mixture models, so the log10 transformation was employed to correct for the right-skewness and make the data Normal.)
The most optimal model was a 2-component univariate mixture with unequal variances. The centers for the 2 clusters (on the raw scale) are: 97 and 576 characters.
a | _O_O | cpi | Django | Poland |
{ | isis | house | PREFIX | ASK {} |
l | Paris | turku | Fruits | iraq |
GIS | Putin | Berlin | Hitman | {SPAQL |
Q975 | qqq | Darwin | SELECT | wallera |
http://tago01.inttech.flab.fujitsu.co.jp/lod4all/api/search.cgi?appID=xawsaykmcb&type=sparql&format=json&query=select+*+where++%7B%0A++++++++++++%7B%3Chttp%3A%2F%2Fja.dbpedia.org%2Fresource%2F%E6%9D%B1%E4%BA%AC%3E+%3Fp+%3Fo+.+%7D%0AUNION+%7B+%3Chttp%3A%2F%2Fja.dbpedia.org%2Fresource%2FTokyo%3E+%3Fp+%3Fo+.+%7D%0A+++++++++++++++++++++++++++%7D
======================
DESCRIBE <https://be-tarask.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D1%8D%D0%B3%D0%BE%D1%80%D1%8B%D1%8F%3A%D0%9A%D0%BE%D1%81%D0%BC%D0%B0%D1%81>
======================
DESCRIBE <https://mk.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D1%98%D0%B0%3A%D0%92%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%B0>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
prefix wd: <http://www.wikidata.org/entity/>
SELECT DISTINCT ?s WHERE {
?s p:P551 ?st .
?st q:P580 ?t .
OPTIONAL { ?st q:P582 ?t2 }
?st wikibase:rank wikibase:NormalRank.
FILTER(!bound(?t2))
?s p:P551 ?st2 .
FILTER(?st2 != ?st)
OPTIONAL { ?s wdt:P570 ?d }
FILTER(!bound(?d))
?st2 wikibase:rank wikibase:NormalRank.
FILTER NOT EXISTS { VALUES ?s { wd:Q49601 wd:Q2832892 wd:Q5249867 wd:Q6174253 wd:Q11994910 wd:Q18588524 wd:Q18582082 wd:Q18643319 wd:Q17100851 wd:Q20820956 wd:Q360 wd:Q6294 wd:Q19799971 wd:Q37876 wd:Q2835169 wd:Q16247359 wd:Q18643303 wd:Q15407925 wd:Q6783051 wd:Q18676928 wd:Q11982912 wd:Q18587441 wd:Q18643324 wd:Q17114155 wd:Q6880994 wd:Q18642095 wd:Q18588176 wd:Q18643322 wd:Q18638333 wd:Q170581 wd:Q18325835 wd:Q18585872 wd:Q873 wd:Q18610003 wd:Q18663763 wd:Q6767478 wd:Q7035242 wd:Q19848 wd:Q2643843 wd:Q18587597 wd:Q13424289 wd:Q16727834 wd:Q18643302 wd:Q19562542 wd:Q18643276 wd:Q1183503 wd:Q11971872 wd:Q20310566 wd:Q19562540 wd:Q18581089 wd:Q19666600 wd:Q6111740 wd:Q19562683 wd:Q20754694 wd:Q454970 wd:Q19562535 wd:Q17114066 wd:Q18643281 wd:Q37079 wd:Q16172281 wd:Q16165619 wd:Q5543184 wd:Q122229 wd:Q18643273 wd:Q55800 wd:Q18643272 wd:Q15430322 } }
} LIMIT 10
======================
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
prefix wd: <http://www.wikidata.org/entity/>
SELECT DISTINCT ?s WHERE {
?s p:P551 ?st .
?st q:P580 ?t .
OPTIONAL { ?st q:P582 ?t2 }
?st wikibase:rank wikibase:NormalRank.
FILTER(!bound(?t2))
?s p:P551 ?st2 .
FILTER(?st2 != ?st)
OPTIONAL { ?s wdt:P570 ?d }
FILTER(!bound(?d))
?st2 wikibase:rank wikibase:NormalRank.
FILTER NOT EXISTS { VALUES ?s { wd:Q49601 wd:Q2832892 wd:Q5249867 wd:Q6174253 wd:Q11994910 wd:Q18588524 wd:Q18582082 wd:Q18643319 wd:Q17100851 wd:Q20820956 wd:Q360 wd:Q6294 wd:Q19799971 wd:Q37876 wd:Q2835169 wd:Q16247359 wd:Q18643303 wd:Q15407925 wd:Q6783051 wd:Q18676928 wd:Q11982912 wd:Q18587441 wd:Q18643324 wd:Q17114155 wd:Q6880994 wd:Q18642095 wd:Q18588176 wd:Q18643322 wd:Q18638333 wd:Q170581 wd:Q18325835 wd:Q18585872 wd:Q873 wd:Q18610003 wd:Q18663763 wd:Q6767478 wd:Q7035242 wd:Q19848 wd:Q2643843 wd:Q18587597 wd:Q13424289 wd:Q16727834 wd:Q18643302 wd:Q19562542 wd:Q18643276 wd:Q1183503 wd:Q11971872 wd:Q20310566 wd:Q19562540 wd:Q18581089 wd:Q19666600 wd:Q6111740 wd:Q19562683 wd:Q20754694 wd:Q454970 wd:Q19562535 wd:Q17114066 wd:Q18643281 wd:Q37079 wd:Q16172281 wd:Q16165619 wd:Q5543184 wd:Q122229 wd:Q18643273 wd:Q55800 wd:Q18643272 wd:Q15430322 wd:Q1780654 } }
} LIMIT 10
======================
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?x ?y ?z WHERE {
w
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?s ?sl ?o WHERE {
?s wdt:P2005 ?o .
?s rdfs:label ?sl .
FILTER (lang(?sl)='ko')
}limit 1000
======================
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?p ?o
WHERE {<http://www.wikidata.org/prop/direct/P354> ?p ?o}
======================
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?s ?sl ?o WHERE {
?s wdt:P351 ?o .
?s rdfs:label ?sl .
FILTER (lang(?sl)='en')
}limit 1000
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
prefix wd: <http://www.wikidata.org/entity/>
SELECT DISTINCT ?s WHERE {
?s p:P551 ?st .
?st q:P580 ?t .
OPTIONAL { ?st q:P582 ?t2 }
?st wikibase:rank wikibase:NormalRank.
FILTER(!bound(?t2))
?s p:P551 ?st2 .
FILTER(?st2 != ?st)
OPTIONAL { ?s wdt:P570 ?d }
FILTER(!bound(?d))
?st2 wikibase:rank wikibase:NormalRank.
FILTER NOT EXISTS { VALUES ?s { wd:Q49601 wd:Q2832892 wd:Q5249867 wd:Q6174253 wd:Q11994910 wd:Q18588524 wd:Q18582082 wd:Q18643319 wd:Q17100851 wd:Q20820956 wd:Q360 wd:Q6294 wd:Q19799971 wd:Q37876 wd:Q2835169 wd:Q16247359 wd:Q18643303 wd:Q15407925 wd:Q6783051 wd:Q18676928 wd:Q11982912 wd:Q18587441 wd:Q18643324 wd:Q17114155 wd:Q6880994 wd:Q18642095 wd:Q18588176 wd:Q18643322 wd:Q18638333 wd:Q170581 wd:Q18325835 wd:Q18585872 wd:Q873 wd:Q18610003 wd:Q18663763 wd:Q6767478 wd:Q7035242 wd:Q19848 wd:Q2643843 wd:Q18587597 wd:Q13424289 wd:Q16727834 wd:Q18643302 wd:Q19562542 wd:Q18643276 wd:Q1183503 wd:Q11971872 wd:Q20310566 wd:Q19562540 wd:Q18581089 wd:Q19666600 wd:Q6111740 wd:Q19562683 wd:Q20754694 wd:Q454970 wd:Q19562535 wd:Q17114066 wd:Q18643281 wd:Q37079 wd:Q16172281 wd:Q16165619 wd:Q5543184 wd:Q122229 wd:Q18643273 wd:Q55800 wd:Q18643272 wd:Q15430322 } }
} LIMIT 10
======================
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?tl ?coord ?b ?lat ?long WHERE {
?t wdt:P31 wd:Q484170 .
?t wdt:P625 ?coord .
?t p:P625 ?a .
?a v:P625 ?c .
?c wikibase:geoLatitude ?lat .
?c wikibase:geoLongitude ?long .
OPTIONAL {
?t rdfs:label ?tl filter (lang(?tl) = "fr") .
}
}
ORDER BY ASC(?lat)
LIMIT 100
One of the bigger challenges encountered in this analysis was the fact that queries were cropped. When the user executes a query, their query is passed via GET, and is saved in Varnish as an encoded uri_path
. Varnish, however, has a character limit, so the encoded queries get cropped. Therefore, when we decode the queries, the end result is also cropped. So a lot of the queries in this dataset were partial queries.
Another issue (and this may be actually be the biggest issue) is that many of the queries are sample queries found on various WDQS-related MediaWiki/Wikitech articles. We (read: I) had to compile together as many of the example queries as we could and then perform approximate string matching to separate the queries that are user-written from the ones that are probably examples.
The process we employed was ad-hoc, not very robust, and highly dubious, but not entirely unreasonable. We recommend collaborating with our language expert (read: Trey) to develop a more robust methodology for detecting when the query submitted matches an example query we have on file.
Furthermore, for the sake of time, we did not include example queries from other languages in our initial compilation of examples. Some of the queries that were deemed “definitely not an example” actually WERE most definitely examples written in French.
We would like to thank Trey Jones for his advice in dealing with approximate string matching, and Oliver Keyes for his review of this report and helpful feedback.
The countries are: Algeria, Angola, Argentina, Armenia, Australia, Austria, Azerbaijan, Belarus, Belgium, Brazil, Bulgaria, Cambodia, Canada, Chile, China, Colombia, Croatia, Czech Republic, Denmark, Ecuador, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, Guadeloupe, Hungary, India, Indonesia, Iran, Ireland, Israel, Italy, Japan, Latvia, Luxembourg, Malaysia, Mali, Malta, Martinique, Mexico, Montenegro, Nepal, Netherlands, New Zealand, Norway, Poland, Portugal, Qatar, Republic of Korea, Romania, Russia, Saudi Arabia, Serbia, Singapore, Slovak Republic, Slovenia, South Africa, Spain, Sri Lanka, Sweden, Switzerland, Taiwan, Thailand, Turkey, Ukraine, United Kingdom, United States, Uruguay, Venezuela, and Vietnam.↩