[HN Gopher] Gzip beats BERT? Part 2: dataset issues, improved sp...
       ___________________________________________________________________
        
       Gzip beats BERT? Part 2: dataset issues, improved speed, and
       results
        
       Author : JoeyBananas
       Score  : 170 points
       Date   : 2023-07-29 16:04 UTC (6 hours ago)
        
 (HTM) web link (kenschutte.com)
 (TXT) w3m dump (kenschutte.com)
        
       | beefman wrote:
       | Part 1 discussed here:
       | https://news.ycombinator.com/item?id=36758433
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _Bad numbers in the "gzip beats BERT" paper?_ -
         | https://news.ycombinator.com/item?id=36758433 - July 2023 (128
         | comments)
        
       | __vec__ wrote:
       | Anyone recreate all of the results with un"contaminated
       | datasets"?
        
       | ks2048 wrote:
       | This is my blog post, if anyone has any questions.
       | 
       | I'll add that since I wrote these two blog posts, other people
       | have sent me their other interesting work:
       | 
       | (1) I link to this at the end of the post (using zstd
       | dictionaries): https://github.com/cyrilou242/ftcc
       | 
       | (2) today someone sent me this (bag-of-words better than gzip):
       | https://arxiv.org/abs/2307.15002v1
        
         | p1esk wrote:
         | Your conclusion: "using ideas from text compression for text
         | classification tasks is an interesting idea and may lead to
         | other interesting research."
         | 
         | Would you say this idea is interesting enough for you
         | personally to research it further?
        
           | ks2048 wrote:
           | For me, no. Mainly because "text classification" is a pretty
           | limited application and one I don't plan to spend much time
           | on. For NLP tasks that require a deeper "understanding", I
           | don't see how compression algorithms can help much (at least
           | _directly_ ).
        
             | nico wrote:
             | Just conceptually, compression is an analog of
             | understanding
             | 
             | To be able to compress something, you need to understand it
             | first
             | 
             | We use this everyday, we compress things by naming them
             | 
             | Once we name something, we don't need to explain or
             | describe, we can just use the name instead
             | 
             | That allows us to compress our communications and it
             | directly affects the parties understanding of the
             | information
             | 
             | That's just conceptually. At a math/algorithm level I don't
             | really know the specifics of your research or the paper in
             | question
        
               | ChainOfFools wrote:
               | It may sound strange out of context, but the most
               | memorable quote I've encountered in any book or any piece
               | of writing anywhere, at least in terms of informing my
               | own understanding of language and the construction of
               | meaning through communication, came in a book on screen
               | writing by William Goldman. The guy who wrote The
               | Princess Bride, of all things.
               | 
               | The sentence was simply, (and in capitals in the
               | original), "POETRY IS COMPRESSION."
        
               | quickthrower2 wrote:
               | Would make a good haiku line 2
        
               | ks2048 wrote:
               | Yes, I agree. That's why I said _directly_ (with regards
               | to compression algorithms used for understanding).
               | _Indirectly_ , yes, compression and
               | intelligence/understanding are closely related.
        
               | mannykannot wrote:
               | One could say that you need to understand _something_
               | about the artifact you are compressing, but, to be clear,
               | you can compress text without understanding anything
               | about its semantic content, and this is what gzip does.
               | The only understanding needed for that level of
               | compression is that the thing to be compressed is a
               | string in a binary alphabet.
        
               | joshuamorton wrote:
               | Of course, which is why gzip is a good baseline for
               | "better" compressors that do have semantic understanding.
               | 
               | The whole idea of an autoencoder _is_ conceptual
               | compression. You take a concept (say: human faces) and
               | create a compressor that is so overfit to that concept
               | that when given complete goobldygook (random seed data)
               | it decompresses that to something with semantic meaning!
        
           | refulgentis wrote:
           | Text-similarity embeddings aren't very interesting and will
           | correlate with gzip, especially when the test is text
           | similarity, especially when they're distinct vocabularies
           | being tested.
           | 
           | The really useful ones are based on SBERT, and measure the
           | likelihood that the answer is contained in the text that was
           | embedded.
           | 
           | ex. from my unit tests: "what is my safe passcode?" has a
           | strong match with "my lockbox pin is 1234", but a very weak
           | match to 'my jewelry is stored safely in the safe'
           | 
           | I learned this from
           | https://news.ycombinator.com/item?id=35377935: thank you to
           | whoever posted this, blew my mind and gave me a powerful
           | differentiator
        
         | cs702 wrote:
         | No questions from me. Just want to say: Thank you for doing all
         | this work!
        
         | phyzome wrote:
         | I'm idly curious how much of a speedup you achieved.
        
           | ks2048 wrote:
           | I don't have complete numbers on this (I think it depends a
           | lot on the size of training set), but for one dataset,
           | normalized time for a batch:                   original    :
           | 1.000         precomputed : 0.644 (first improvement)
           | gziplength  : 0.428 (+ 2nd improvement)
        
       | godelski wrote:
       | I really think that the numbers were inflated because the
       | prolific benchmarkism that goes on in ML. Basically if you don't
       | beat SOTA, you don't get published. Usually you need SOTA on
       | MULTIPLE datasets. Which is is problematic, because plenty of non
       | SOTA methods are useful (forget novel). Given the results
       | Ken/ks2048 calculated, I am pretty confident that the work
       | wouldn't have made it in. BUT I think the results given the other
       | features does make the work quite useful! I agree Ken, that it
       | unfairly boosts their work, but I understand why they're bending
       | over backwards to defend it. I wish people would just admit
       | mistakes but that risks (probably not) losing a paper. This is
       | probably the same reason they didn't think to double check the
       | suspicious results like the Filipino dataset too (btw, not
       | uncommon for datasets to be spoiled people. Always be
       | suspicious!).
       | 
       | I'm not trying to give them a pass, but we do need to discuss the
       | perverse incentives we've set up that make these kinds of things
       | prolific. The work should be good on its own, but good doesn't
       | mean it'll get published in a journal. And frankly, it doesn't
       | matter how many citations your arxiv paper has, people will still
       | say "it isn't peer reviewed" and it won't help you get a job,
       | graduate, or advance in academia. Which I think we should all
       | agree is idiotic, since citations are indicating peer review too.
        
         | lalaland1125 wrote:
         | I don't blame them for failing to double check their results.
         | 
         | I blame them for giving obviously incorrect excuses on GitHub
         | when such an obvious mistake is pointed out.
         | 
         | There is no way they could be at the stage they claim to be in
         | their program (having just defended their thesis) and think the
         | excuses they gave on GitHub are reasonable.
        
           | godelski wrote:
           | Yeah, I fully agree. They should just admit the mistake
           | rather than try to justify it. I was just trying to explain
           | the incentive structure around them that encourages this
           | behavior. Unfortunately no one gives you points for admitting
           | your mistakes (in fact, you risk losing points) and you are
           | unlikely to lose points for doubling down on an error.
           | 
           | > There is no way they could be at the stage they claim to be
           | in their program (having just defended their thesis) and
           | think the excuses they gave on GitHub are reasonable.
           | 
           | Unfortunately it is a very noisy process. I know people from
           | top 3 universities that have good publication records and
           | don't know probabilities from likelihoods. I know students
           | and professors at these universities that think autocasting
           | your model to fp16 reduces your memory by half (from fp32)
           | and are confused when you explain that that's a theoretical
           | (and not practical) lower bound. Just the other day I had
           | someone open an issue on my github (who has a PhD from one of
           | these universities and is currently a professor!) who was
           | expecting me to teach them how to load a pretrained model.
           | This is not uncommon.
           | 
           | Goodhart's Law is a bitch.
        
       | recov wrote:
       | Related, sentdex did a video and implementation on it as well:
       | https://www.youtube.com/watch?v=jkdWzvMOPuo
        
       | dekhn wrote:
       | This is a masterwork of analysis and improvement of a method.
        
       | [deleted]
        
       | the_man_of_sex wrote:
       | [flagged]
        
       | birdyrooster wrote:
       | There is this sense of deflation of effort in tech right now
       | (always?) where, if you can just wait a moment longer to start
       | coding, you can just adopt something else and save yourself from
       | the rat race.
        
       | luc4sdreyer wrote:
       | > Scientific research typically has been founded on high ethical
       | standards established by researchers in academia and health care
       | research institutions. Scientific fraud, an act of deception or
       | misrepresentation of one's own work, violates these ethical
       | standards.
       | 
       | And according to Ken Schutte:
       | 
       | > this method uses the test label as part of its decision process
       | which is not the standard classification setting and can't be
       | fairly compared to others that don't.
       | 
       | Can anyone make the case that these two descriptions don't
       | overlap? Personally I can't see how the original author can be so
       | blase about this.
       | 
       | [1] https://pubmed.ncbi.nlm.nih.gov/2061524/
        
         | godelski wrote:
         | I try to explain in this comment[0]. I agree that this is
         | unethical behavior, but we need to also be aware of what
         | pressures are encouraging this behavior. I also think Ken is
         | romanticizing the standards of science a bit here. This would
         | be great, but it is not what happens in practice.
         | Unfortunately. Mostly unintentionally, but there is intentional
         | ones too.
         | 
         | [0] https://news.ycombinator.com/item?id=36922708
        
       | codeflo wrote:
       | The article has a link[1] to a discussion between the blog author
       | and the paper author that I find revealing.
       | 
       | Perhaps as a reminder, the issue is that the paper's
       | implementation of their 2-nearest neighbor secretly uses an
       | oracle to break ties, which obviously inflates the accuracy
       | compared to a real-world kNN classifier that has to choose
       | heuristically. To be fair, this could be a weird implementation
       | accident and not malice. But I think it does invalidate the
       | results.
       | 
       | But rather than admit error, the author _defends_ this choice,
       | and does so using (in my opinion) dubious statistical arguments.
       | Which leads me to believe that -- at least at this point -- they
       | know they made a mistake and just won't admit it.
       | 
       | They claim that instead of a real-world accuracy, they wanted to
       | find the "max" accuracy that their classifier was statistically
       | capable of. That is, the accuracy you get if the stars happen to
       | align and you get the luckiest possible result. Well, not only is
       | this creative new metric not described in the paper, it's also
       | not applied to the other algorithms. For example, I think a
       | neural network is capable of achieving a "max" accuracy of 100%,
       | if all the initial weights happen to perfectly encode both the
       | training and test sets. But of course they just use standard
       | training to give the numbers for those algorithms.
       | 
       | [1] https://github.com/bazingagin/npc_gzip/issues/3
        
         | ks2048 wrote:
         | Well put. Yes, I mention a similar case towards the end of that
         | exchange: Consider a random-guess classifier. That has a _max
         | accuracy_ of 100%. Clearly, not a useful measure on its own.
        
         | pedrosorio wrote:
         | > They claim that instead of a real-world accuracy, they wanted
         | to find the "max" accuracy that their classifier was
         | statistically capable of
         | 
         | Yeah, I read this on the GitHub issue a week ago and couldn't
         | believe it. Ideally, their profile(1) should allow them to
         | quickly admit they were wrong on such a simple issue. Pursuit
         | of truth and knowledge, etc.
         | 
         | (1) a young PhD from a prestigious university
         | 
         | > For example, I think a neural network is capable of achieving
         | a "max" accuracy of 100%
         | 
         | Why reach for such powerful tools? f(x) = random(num_classes),
         | achieves 100% "upper bound" accuracy.
        
         | lalaland1125 wrote:
         | In academia, it's better to cling to obviously false
         | justifications to dismiss criticism and keep a paper accepted
         | than to admit fault and potentially be forced to retract.
         | 
         | Publish or perish
        
           | bonzini wrote:
           | Retracting is extremely rare in computer science, which is
           | why instead many conferences have started "stamping" papers
           | that have artifacts which provide reproducible results.
        
         | hinkley wrote:
         | A couple of AI hype cycles ago, everyone was abuzz about
         | genetic algorithms. I recall a cautionary tale that was related
         | about someone using FPGAs to do genetic algorithms.
         | 
         | After a while they noticed several disturbing things. One, that
         | the winners had fewer gates than theory thought was necessary
         | to solve the problem. Two, some days the winners didn't work,
         | and three, sometimes the winners didn't work on a different
         | FPGA.
         | 
         | After much study the answer was that the winning candidates
         | were treating the gate logic as analog. Manufacturing flaws or
         | PSU fluctuations would result in the analog aspects behaving
         | differenty.
         | 
         | To fix this, they split the fitness test in two passes. All
         | implementations that actually worked got re-run in an emulator,
         | which of course treats the behavior as purely digital. Only if
         | they worked with both did they avoid being culled.
        
         | [deleted]
        
       | Twirrim wrote:
       | > The paper's repo does minimal processing on the datasets. It
       | turns out that these problems exist in the source Huggingface
       | datasets. The two worst ones can be checked quickly using only
       | Huggingface's datasets.load_dataset:
       | 
       | I'm really surprised HuggingFace isn't doing filtering/evaluation
       | of the datasets they're presenting. This ought to be a simple
       | check for them.
        
         | lalaland1125 wrote:
         | It's not the job of HuggingFace to certify datasets. It's
         | simply outside the scope of their work.
        
         | godelski wrote:
         | That's a tall order. While the cases here are simple and more
         | obvious, they don't scale well. It can also be problematic if
         | an official dataset has the error, as now they've created a
         | different one. They have 48,627 datasets. Their goal is not to
         | validate datasets (which is far more difficult than checking
         | for dupes (not easy btw)), but to be like github so that others
         | (like Ken) can review the work of his peers and check for
         | mistakes. Due to this, HF has to allow for uploading of
         | arbitrary datasets, because they cannot be an arbitrator of
         | what is good or bad, since that depends on what's being solved.
         | They could probably set a flag for datasets (and maybe even
         | some statistics!) that are under a few gigs in size, but they
         | cannot and should not filter them.
        
           | Twirrim wrote:
           | I appreciate there is nuance, and some checks would be
           | computationally expensive, but something like training data
           | and evaluation data being literally identical seems like it
           | would be pretty straightforward to check for and a very
           | simple quick rejection.
        
         | _delirium wrote:
         | I think of HuggingFace as essentially a GitHub for ML stuff.
         | They just provide infrastructure that anyone can upload to.
        
         | pizza wrote:
         | Is there feature for hf's datasets platform that makes
         | load_dataset throw an exception if you try to load a known-
         | dubious dataset unless you explicitly provide a kwarg like
         | 'allow_dubious=True'? If not, that might be a boon for the
         | whole field.. might nip the propagation of false results at the
         | outset
        
       | itvision wrote:
       | <offtopic probably>Haven't read the article but nowadays there's
       | no reason to use either GZIP, or bzip2 when ZSTD is available.
       | It's just so much better than both, I've no idea why people
       | haven't replaced everything with ZSTD, except for XZ/7-Zip which
       | can provide much high compression ratios at the cost of very slow
       | compression and insane RAM requirements (3840MB dictionary with
       | at least two threads).</offtopic probably>
        
       ___________________________________________________________________
       (page generated 2023-07-29 23:00 UTC)