[HN Gopher] Text Classification by Data Compression
       ___________________________________________________________________
        
       Text Classification by Data Compression
        
       Author : Lemaxoxo
       Score  : 32 points
       Date   : 2021-06-08 20:08 UTC (2 hours ago)
        
 (HTM) web link (maxhalford.github.io)
 (TXT) w3m dump (maxhalford.github.io)
        
       | sean_pedersen wrote:
       | Cool idea! Shouldn't this work also by concatenating the single
       | document (you want to classify) with the compressed version of
       | the conc. class corpus (saving compute time)?
        
         | Lemaxoxo wrote:
         | I think that is what is being suggested in the other comment.
         | One would have to try! My instincts tell me the results would
         | not be identical.
        
       | lovasoa wrote:
       | You don't have to recompress the whole corpus to add a single
       | document to it. All the compression algorithms mentioned here
       | work in a streaming fashion. You could "just" save the internal
       | state of the algorithm after compressing the training data, and
       | then reuse that state for each classification task.
        
         | Lemaxoxo wrote:
         | I suspected this. However, I wasn't able to grok the
         | documentation well enough but I didn't able to find a
         | convincing example. It seems to me that these Python
         | compressors get "frozen" and can't be used to compress further
         | data.
        
         | spullara wrote:
         | Was going to come here to say that. Played around with this a
         | bit for compressing small fields using a learned dictionary:
         | 
         | https://github.com/spullara/corpuscompression
        
       | thomasluce wrote:
       | I worked for an internet scraping/statistics gathering company
       | some years ago, and we used this approach alongside a few others
       | to find mailing addresses embedded in websites. Basically use
       | LZW-type compression with entropy information only trained on
       | known addresses, and then compress a document, looking for the
       | section of the document with the highest compression ratio.
       | 
       | It worked decently well, and surprisingly better than a lot of
       | other, more standard approaches just because of the wild non-
       | uniformity of human-generated content on the web.
        
       ___________________________________________________________________
       (page generated 2021-06-08 23:00 UTC)