## Intro Efficient Wikipedia Information Retrieval raster@hluw.de Bitreichcon 1 Chemnitz, 11 March 2017 ## Synopsis * 5GB of text files, i.e. Wikipedia articles * served via http by megabytes of PHP, Javascript, SQL (sloccount) * alternatively by gopher, wikifs (P9) and plenty of other wiki programs ## dmenu: efficiently selecting articles searching or retrieving dmenu ## Tools * grep * i.e. the ed-command g/regular-epression/p * fgrep, egrep, agrep ## Information Retrieval * worse than full-text search * when data size is too large tricks and crooks come into play * Precision/Recall “F-measure” * percentage of sloc that deal with IR ## full-text corpus * some measurements with 5 GB (TODO) different patterns | ms * data size / intersection ## Bibliography Beesley and Karttunen 2003: Finite State Morphology, Stanford Kernighan and Plauger 1981: Software Tools in Pascal, Reading Kernighan and Pike 1984: The UNIX Programming Environment, Englewood Cliffs Lesk 1997: Practical Digital Libraries, San Francisco Lesk 2005: Understanding Digital Libraries, San Francisco Witten, Moffat, Bell 1999: Managing Gigabytes, San Francisco