[HN Gopher] Gzip beats BERT? Part 2: dataset issues, improved sp... ___________________________________________________________________ Gzip beats BERT? Part 2: dataset issues, improved speed, and results Author : JoeyBananas Score : 170 points Date : 2023-07-29 16:04 UTC (6 hours ago) (HTM) web link (kenschutte.com) (TXT) w3m dump (kenschutte.com) | beefman wrote: | Part 1 discussed here: | https://news.ycombinator.com/item?id=36758433 | dang wrote: | Thanks! Macroexpanded: | | _Bad numbers in the "gzip beats BERT" paper?_ - | https://news.ycombinator.com/item?id=36758433 - July 2023 (128 | comments) | __vec__ wrote: | Anyone recreate all of the results with un"contaminated | datasets"? | ks2048 wrote: | This is my blog post, if anyone has any questions. | | I'll add that since I wrote these two blog posts, other people | have sent me their other interesting work: | | (1) I link to this at the end of the post (using zstd | dictionaries): https://github.com/cyrilou242/ftcc | | (2) today someone sent me this (bag-of-words better than gzip): | https://arxiv.org/abs/2307.15002v1 | p1esk wrote: | Your conclusion: "using ideas from text compression for text | classification tasks is an interesting idea and may lead to | other interesting research." | | Would you say this idea is interesting enough for you | personally to research it further? | ks2048 wrote: | For me, no. Mainly because "text classification" is a pretty | limited application and one I don't plan to spend much time | on. For NLP tasks that require a deeper "understanding", I | don't see how compression algorithms can help much (at least | _directly_ ). | nico wrote: | Just conceptually, compression is an analog of | understanding | | To be able to compress something, you need to understand it | first | | We use this everyday, we compress things by naming them | | Once we name something, we don't need to explain or | describe, we can just use the name instead | | That allows us to compress our communications and it | directly affects the parties understanding of the | information | | That's just conceptually. At a math/algorithm level I don't | really know the specifics of your research or the paper in | question | ChainOfFools wrote: | It may sound strange out of context, but the most | memorable quote I've encountered in any book or any piece | of writing anywhere, at least in terms of informing my | own understanding of language and the construction of | meaning through communication, came in a book on screen | writing by William Goldman. The guy who wrote The | Princess Bride, of all things. | | The sentence was simply, (and in capitals in the | original), "POETRY IS COMPRESSION." | quickthrower2 wrote: | Would make a good haiku line 2 | ks2048 wrote: | Yes, I agree. That's why I said _directly_ (with regards | to compression algorithms used for understanding). | _Indirectly_ , yes, compression and | intelligence/understanding are closely related. | mannykannot wrote: | One could say that you need to understand _something_ | about the artifact you are compressing, but, to be clear, | you can compress text without understanding anything | about its semantic content, and this is what gzip does. | The only understanding needed for that level of | compression is that the thing to be compressed is a | string in a binary alphabet. | joshuamorton wrote: | Of course, which is why gzip is a good baseline for | "better" compressors that do have semantic understanding. | | The whole idea of an autoencoder _is_ conceptual | compression. You take a concept (say: human faces) and | create a compressor that is so overfit to that concept | that when given complete goobldygook (random seed data) | it decompresses that to something with semantic meaning! | refulgentis wrote: | Text-similarity embeddings aren't very interesting and will | correlate with gzip, especially when the test is text | similarity, especially when they're distinct vocabularies | being tested. | | The really useful ones are based on SBERT, and measure the | likelihood that the answer is contained in the text that was | embedded. | | ex. from my unit tests: "what is my safe passcode?" has a | strong match with "my lockbox pin is 1234", but a very weak | match to 'my jewelry is stored safely in the safe' | | I learned this from | https://news.ycombinator.com/item?id=35377935: thank you to | whoever posted this, blew my mind and gave me a powerful | differentiator | cs702 wrote: | No questions from me. Just want to say: Thank you for doing all | this work! | phyzome wrote: | I'm idly curious how much of a speedup you achieved. | ks2048 wrote: | I don't have complete numbers on this (I think it depends a | lot on the size of training set), but for one dataset, | normalized time for a batch: original : | 1.000 precomputed : 0.644 (first improvement) | gziplength : 0.428 (+ 2nd improvement) | godelski wrote: | I really think that the numbers were inflated because the | prolific benchmarkism that goes on in ML. Basically if you don't | beat SOTA, you don't get published. Usually you need SOTA on | MULTIPLE datasets. Which is is problematic, because plenty of non | SOTA methods are useful (forget novel). Given the results | Ken/ks2048 calculated, I am pretty confident that the work | wouldn't have made it in. BUT I think the results given the other | features does make the work quite useful! I agree Ken, that it | unfairly boosts their work, but I understand why they're bending | over backwards to defend it. I wish people would just admit | mistakes but that risks (probably not) losing a paper. This is | probably the same reason they didn't think to double check the | suspicious results like the Filipino dataset too (btw, not | uncommon for datasets to be spoiled people. Always be | suspicious!). | | I'm not trying to give them a pass, but we do need to discuss the | perverse incentives we've set up that make these kinds of things | prolific. The work should be good on its own, but good doesn't | mean it'll get published in a journal. And frankly, it doesn't | matter how many citations your arxiv paper has, people will still | say "it isn't peer reviewed" and it won't help you get a job, | graduate, or advance in academia. Which I think we should all | agree is idiotic, since citations are indicating peer review too. | lalaland1125 wrote: | I don't blame them for failing to double check their results. | | I blame them for giving obviously incorrect excuses on GitHub | when such an obvious mistake is pointed out. | | There is no way they could be at the stage they claim to be in | their program (having just defended their thesis) and think the | excuses they gave on GitHub are reasonable. | godelski wrote: | Yeah, I fully agree. They should just admit the mistake | rather than try to justify it. I was just trying to explain | the incentive structure around them that encourages this | behavior. Unfortunately no one gives you points for admitting | your mistakes (in fact, you risk losing points) and you are | unlikely to lose points for doubling down on an error. | | > There is no way they could be at the stage they claim to be | in their program (having just defended their thesis) and | think the excuses they gave on GitHub are reasonable. | | Unfortunately it is a very noisy process. I know people from | top 3 universities that have good publication records and | don't know probabilities from likelihoods. I know students | and professors at these universities that think autocasting | your model to fp16 reduces your memory by half (from fp32) | and are confused when you explain that that's a theoretical | (and not practical) lower bound. Just the other day I had | someone open an issue on my github (who has a PhD from one of | these universities and is currently a professor!) who was | expecting me to teach them how to load a pretrained model. | This is not uncommon. | | Goodhart's Law is a bitch. | recov wrote: | Related, sentdex did a video and implementation on it as well: | https://www.youtube.com/watch?v=jkdWzvMOPuo | dekhn wrote: | This is a masterwork of analysis and improvement of a method. | [deleted] | the_man_of_sex wrote: | [flagged] | birdyrooster wrote: | There is this sense of deflation of effort in tech right now | (always?) where, if you can just wait a moment longer to start | coding, you can just adopt something else and save yourself from | the rat race. | luc4sdreyer wrote: | > Scientific research typically has been founded on high ethical | standards established by researchers in academia and health care | research institutions. Scientific fraud, an act of deception or | misrepresentation of one's own work, violates these ethical | standards. | | And according to Ken Schutte: | | > this method uses the test label as part of its decision process | which is not the standard classification setting and can't be | fairly compared to others that don't. | | Can anyone make the case that these two descriptions don't | overlap? Personally I can't see how the original author can be so | blase about this. | | [1] https://pubmed.ncbi.nlm.nih.gov/2061524/ | godelski wrote: | I try to explain in this comment[0]. I agree that this is | unethical behavior, but we need to also be aware of what | pressures are encouraging this behavior. I also think Ken is | romanticizing the standards of science a bit here. This would | be great, but it is not what happens in practice. | Unfortunately. Mostly unintentionally, but there is intentional | ones too. | | [0] https://news.ycombinator.com/item?id=36922708 | codeflo wrote: | The article has a link[1] to a discussion between the blog author | and the paper author that I find revealing. | | Perhaps as a reminder, the issue is that the paper's | implementation of their 2-nearest neighbor secretly uses an | oracle to break ties, which obviously inflates the accuracy | compared to a real-world kNN classifier that has to choose | heuristically. To be fair, this could be a weird implementation | accident and not malice. But I think it does invalidate the | results. | | But rather than admit error, the author _defends_ this choice, | and does so using (in my opinion) dubious statistical arguments. | Which leads me to believe that -- at least at this point -- they | know they made a mistake and just won't admit it. | | They claim that instead of a real-world accuracy, they wanted to | find the "max" accuracy that their classifier was statistically | capable of. That is, the accuracy you get if the stars happen to | align and you get the luckiest possible result. Well, not only is | this creative new metric not described in the paper, it's also | not applied to the other algorithms. For example, I think a | neural network is capable of achieving a "max" accuracy of 100%, | if all the initial weights happen to perfectly encode both the | training and test sets. But of course they just use standard | training to give the numbers for those algorithms. | | [1] https://github.com/bazingagin/npc_gzip/issues/3 | ks2048 wrote: | Well put. Yes, I mention a similar case towards the end of that | exchange: Consider a random-guess classifier. That has a _max | accuracy_ of 100%. Clearly, not a useful measure on its own. | pedrosorio wrote: | > They claim that instead of a real-world accuracy, they wanted | to find the "max" accuracy that their classifier was | statistically capable of | | Yeah, I read this on the GitHub issue a week ago and couldn't | believe it. Ideally, their profile(1) should allow them to | quickly admit they were wrong on such a simple issue. Pursuit | of truth and knowledge, etc. | | (1) a young PhD from a prestigious university | | > For example, I think a neural network is capable of achieving | a "max" accuracy of 100% | | Why reach for such powerful tools? f(x) = random(num_classes), | achieves 100% "upper bound" accuracy. | lalaland1125 wrote: | In academia, it's better to cling to obviously false | justifications to dismiss criticism and keep a paper accepted | than to admit fault and potentially be forced to retract. | | Publish or perish | bonzini wrote: | Retracting is extremely rare in computer science, which is | why instead many conferences have started "stamping" papers | that have artifacts which provide reproducible results. | hinkley wrote: | A couple of AI hype cycles ago, everyone was abuzz about | genetic algorithms. I recall a cautionary tale that was related | about someone using FPGAs to do genetic algorithms. | | After a while they noticed several disturbing things. One, that | the winners had fewer gates than theory thought was necessary | to solve the problem. Two, some days the winners didn't work, | and three, sometimes the winners didn't work on a different | FPGA. | | After much study the answer was that the winning candidates | were treating the gate logic as analog. Manufacturing flaws or | PSU fluctuations would result in the analog aspects behaving | differenty. | | To fix this, they split the fitness test in two passes. All | implementations that actually worked got re-run in an emulator, | which of course treats the behavior as purely digital. Only if | they worked with both did they avoid being culled. | [deleted] | Twirrim wrote: | > The paper's repo does minimal processing on the datasets. It | turns out that these problems exist in the source Huggingface | datasets. The two worst ones can be checked quickly using only | Huggingface's datasets.load_dataset: | | I'm really surprised HuggingFace isn't doing filtering/evaluation | of the datasets they're presenting. This ought to be a simple | check for them. | lalaland1125 wrote: | It's not the job of HuggingFace to certify datasets. It's | simply outside the scope of their work. | godelski wrote: | That's a tall order. While the cases here are simple and more | obvious, they don't scale well. It can also be problematic if | an official dataset has the error, as now they've created a | different one. They have 48,627 datasets. Their goal is not to | validate datasets (which is far more difficult than checking | for dupes (not easy btw)), but to be like github so that others | (like Ken) can review the work of his peers and check for | mistakes. Due to this, HF has to allow for uploading of | arbitrary datasets, because they cannot be an arbitrator of | what is good or bad, since that depends on what's being solved. | They could probably set a flag for datasets (and maybe even | some statistics!) that are under a few gigs in size, but they | cannot and should not filter them. | Twirrim wrote: | I appreciate there is nuance, and some checks would be | computationally expensive, but something like training data | and evaluation data being literally identical seems like it | would be pretty straightforward to check for and a very | simple quick rejection. | _delirium wrote: | I think of HuggingFace as essentially a GitHub for ML stuff. | They just provide infrastructure that anyone can upload to. | pizza wrote: | Is there feature for hf's datasets platform that makes | load_dataset throw an exception if you try to load a known- | dubious dataset unless you explicitly provide a kwarg like | 'allow_dubious=True'? If not, that might be a boon for the | whole field.. might nip the propagation of false results at the | outset | itvision wrote: | <offtopic probably>Haven't read the article but nowadays there's | no reason to use either GZIP, or bzip2 when ZSTD is available. | It's just so much better than both, I've no idea why people | haven't replaced everything with ZSTD, except for XZ/7-Zip which | can provide much high compression ratios at the cost of very slow | compression and insane RAM requirements (3840MB dictionary with | at least two threads).</offtopic probably> ___________________________________________________________________ (page generated 2023-07-29 23:00 UTC)