[HN Gopher] Text Classification by Data Compression ___________________________________________________________________ Text Classification by Data Compression Author : Lemaxoxo Score : 32 points Date : 2021-06-08 20:08 UTC (2 hours ago) (HTM) web link (maxhalford.github.io) (TXT) w3m dump (maxhalford.github.io) | sean_pedersen wrote: | Cool idea! Shouldn't this work also by concatenating the single | document (you want to classify) with the compressed version of | the conc. class corpus (saving compute time)? | Lemaxoxo wrote: | I think that is what is being suggested in the other comment. | One would have to try! My instincts tell me the results would | not be identical. | lovasoa wrote: | You don't have to recompress the whole corpus to add a single | document to it. All the compression algorithms mentioned here | work in a streaming fashion. You could "just" save the internal | state of the algorithm after compressing the training data, and | then reuse that state for each classification task. | Lemaxoxo wrote: | I suspected this. However, I wasn't able to grok the | documentation well enough but I didn't able to find a | convincing example. It seems to me that these Python | compressors get "frozen" and can't be used to compress further | data. | spullara wrote: | Was going to come here to say that. Played around with this a | bit for compressing small fields using a learned dictionary: | | https://github.com/spullara/corpuscompression | thomasluce wrote: | I worked for an internet scraping/statistics gathering company | some years ago, and we used this approach alongside a few others | to find mailing addresses embedded in websites. Basically use | LZW-type compression with entropy information only trained on | known addresses, and then compress a document, looking for the | section of the document with the highest compression ratio. | | It worked decently well, and surprisingly better than a lot of | other, more standard approaches just because of the wild non- | uniformity of human-generated content on the web. ___________________________________________________________________ (page generated 2021-06-08 23:00 UTC)