Web UI for cirrus debug/devel features: - Settings dump - Mappings dump - Copy version of settings+mappings suitable to create index with curl - cirrusDumpQuery - cirrusDumpResult - cirrusExplain - cirrusUserTesting Top level idea is to make it easy to access all of these things. Could be a userscript run on-page in the wiki. Could be an SPA run from tool labs (or even people.wikimedia.org). ============ docker setup to initialize elasticsearch, import latest cirrus dump, and attach a kibana instance for UI. Probably with a modified mapping more amicable to kibana inspection. ============ Some script to manage elasticsearch allocation manually via api? Pointless, but perhaps fun. =========== phabricator formatted export for jupyter - problem: images? -- Seems would need to upload separately and then reference them in final output -- There is an api for this, but then we can't just emit something to paste into a field the whole export needs to happen over api then. - better, but worse: data-uri's would be great. But i dunno if phab is built for megabyte sized posts. They also don't support data-uri's. Browsers also hate when you copy/paste excessive amounts of data. ========== Custom implementation to find similar images in commons: - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.5151&rep=rep1&type=pdf - http://www.deepideas.net/building-content-based-search-engine-quantifying-similarity/ - Convert image into a feature vector - Use clustering to generate an image signature - Find k-nearest-neighbors via Earth Mover Distance (EMD), can utilize pyemd library. - It's very not-obvious how the signature + weight gets plugged into pyemd - EMD is expensive, no clue how this would scale to millions of images - This would probably perform poorly, more interesting to get to understand some of the history of similar image retrieval ========= https://github.com/beniz/deepdetect.git ? - Use pre-trained ML to detect objects in images and then label those objects. - Can compare similarity of objects detected for similar images. Can probably extend with color information - Do we actually have a use case for images similar to other images? Perhaps on upload? ========== Elasticsearch cluster balance simulator - Allow to Simulate valuate how the cluster balancing performs under various simulated conditions - no way this could be done in a weekend hackathon. It would probably be completely wrong as well and simulate some idealized cluster that doesn't act like ours. ========== Prototype Lire plugin for elasticsearch - Lire = Lucene Image REtrieval - I know nothing about it, other than it exists - Plugin already exists plugging it into solr, so how hard could it be? - Maybe try it out standalone with some small test set to see what it does ========== Potential daemon for serving up similarity, bring your own image vector. - https://github.com/facebookresearch/faiss/wiki - Could use vectors from above "custom implementation", but probably not too fancy - Could use opencv: https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_surf_intro/py_surf_intro.html ========= Updatable doc values in elasticsearch! * Guaranteed to suck! * Needs to query doc value on update to put into new document (in source?) * Should there be some stupid hack that makes requesting field from source return the doc value? * Otherwise, different results from different places. Fun! ========== Segment-level sidecar data? ========== Scroll to section / Scroll to snippet from search results * Javascript string search? * Elasticsearch highlighter to be aware of heading positions in text? * lookup ========== Index pHash for images into elasticsearch - Do some crazy expensive query to measure hamming distance ========= extract transliteration from mediawiki into a composer library build a small server over the library as a transliteration service ========= Make progress on extension.json for CirrusSearch ======== Expose cirrussearch sort orders to api/ui ======== Convert cindy's actual runner into a docker container