[HN Gopher] Abusing linear regression to make a point ___________________________________________________________________ Abusing linear regression to make a point Author : furcyd Score : 41 points Date : 2020-07-06 20:47 UTC (2 hours ago) (HTM) web link (www.goodmath.org) (TXT) w3m dump (www.goodmath.org) | graycat wrote: | The article has: | | > For trace failure, the probability of failure is linear in the | size of the radiation dose that the chip is exposed to. | | No it's not. Impossible. Wrong. | | Does anyone not see why it's wrong? | olliej wrote: | I saw that chart on Twitter and thought it was a joke :-/ | | You don't need to know anything about maths to see that that is | farcical. | ooobit2 wrote: | "Hold my America," said the beer. | User23 wrote: | > I said that if you had reason to believe in a linear | relationship, then you could try to find it. That's the huge | catch to linear regression: no matter what data you put in, | you'll always get a "best match" line out. | | This challenge generalizes to all model fitting. Incorrectly | assuming a distribution is Gaussian is a big one. | xapata wrote: | The variable of interest may not have a Gaussian distribution, | but its expected value and variance generally are. Sure, | there's some pathological cases, but the Cauchy distribution | doesn't show up that often. | solveit wrote: | Does the Cauchy distribution ever actually show up? | ooobit2 wrote: | Less often than meetings start and end on time. | fractionalhare wrote: | Another one is handwaving that a distribution is normal when _n | > 30_ because "central limit theorem!" | | Amateur statistics is full of magical numbers and thresholds | where everything "just works" :) | ooobit2 wrote: | When I applied to my MSDA program, I had to interview with | the lady who would become my first year mentor. "Now, do it | over again... on paper." That's what she'd say to anyone too | confident in their outputs. She has a reputation for weeding | out people who can pass the entrance testing and | qualifications, but can't adapt. And what we do requires | thick skin. You're wrong until you just _happen_ to be right. | spekcular wrote: | This article does not capture what is actually wrong with the | regression. | | First, it's not necessarily wrong to fit a linear regression to | data that might not be from a linear model, or that you know to | be nonlinear. The data could be linear enough in the region of | interest for the line to nonetheless be useful, for example. | Sure, you need an underlying linear process if you want certain | theorems and guarantees to apply. But with any data set, linear | or not, regression still gives the best linear approximation to | the conditional expectation function. | | Second, the following paragraph seems to imply that small | correlations are the same as no correlation, and the reason the | regression is problematic is that the correlation is small: | | > How does that fit look to you? I don't have access to the | original dataset, so I can't check it, but I'm guessing that the | correlation there is somewhere around 0.1 or 0.2 - also known as | "no correlation". | | But small correlations, if they actually exist, can sometimes be | of great practical relevance. So that's not it either. | | The actual problem is that the correlation isn't statistically | significant - there isn't enough evidence to conclude that the | observed (small) correlation actually exists, as opposed to being | the result of random noise in the data. And indeed, as some other | comments here point out, you can get similar graphs by fitting | lines to randomly simulated fake data. | | (If you prefer a Bayesian gloss: the data isn't informative | enough to move you off any reasonable prior with most of its mass | around zero. Same principle.) | fractionalhare wrote: | _> First, it 's not necessarily wrong to fit a linear | regression to data that might not be from a linear model, or | that you know to be nonlinear. The data could be linear enough | in the region of interest for the line to nonetheless be | useful, for example._ | | Yes, very good point! Likewise regression is not robust to log | transformations, but we still use log transformations | (depending on the data) because we may be able to tolerate some | loss of information. If the nonlinear correlation is bounded | below some tolerable amount it's okay to use the linear | relationship on its own. | khr wrote: | Agreed. I was curious enough to run the model myself so I used | a tool to extract the data. The slope estimate (b=17.24) is not | significantly different from zero, p=.437. | | The data are here: https://pastebin.com/HhWTKZRb | SubiculumCode wrote: | The problem is that the author is essentially claiming that | running the regression for data not passing his eyeball test | is, in itself, a misuse of regression...which is nonsense. | gweinberg wrote: | The way I'd say it is that the uncertainty of the value is | large compared to the value itself. Just thinking about the | causal mechanisms, drinking (binge or not) with other people | could put you at risk if any of the other drunks are infected. | But drinking alone might help protect you from covid, since it | is a way to avoid other people. | asdf_snar wrote: | This person has no idea what they're writing about. | | Edited because my post was flagged (I'm not sure why). The | definition of correlation coefficient is incorrect, which could | have been attributed to a typo, except the author goes on to say | "The bottom is, essentially, just stripping the signs away.", | suggesting the square root of a sum of squared differences would | be the same as the sum of differences, were it not for the signs. | That's not how norms work. | | The whole paragraph on interpreting a correlation coefficient is | particularly painful to read: "... if the correlation is perfect | - that is, if the dependent variable increases linearly with the | independent, then the correlation will be 1. If the dependency | variable decreases linearly in opposition to the dependent, then | the correlation will be -1. If there's no relationship, then the | correlation will be 0." | | For all its good intentions, I feel like this post hurts more | than it helps. | scottlocklin wrote: | You can't mention spurious linear regression without predicting | the S&P500 with Leinweiber's price of butter in Bangladesh | indicator. | | https://nerdsonwallstreet.typepad.com/my_weblog/files/datami... | klenwell wrote: | In a similar vein: | | https://www.tylervigen.com/spurious-correlations | | I'm not convinced all of these are spurious. | fractionalhare wrote: | Overfitting is insidious. These graphics would convince a lot | of people who'd otherwise catch the error in the article | without needing it to be explained to them. At that point it's | not enough to eyeball the line of best fit, you have to sanity | check the explanatory power and do cross validation... | leto_ii wrote: | Taleb was recently steaming on Twitter about a similar thing done | to supposedly show a correlation between physician salary and | covid mortality: | https://twitter.com/nntaleb/status/1279954325087891464 | | He follows it with a few examples of spurious regressions from | random data: | https://twitter.com/nntaleb/status/1280090844113100801 | nvader wrote: | The chart referenced in this article was by the same author. | | https://twitter.com/AmihaiGlazer/status/1277769775855235072/... | | https://twitter.com/AmihaiGlazer/status/1279210404602712064/... | | My favourite part is the discussion about what a vertical line | of regression means. | | https://twitter.com/AmihaiGlazer/status/1279905458812149760 | | Discovering a vertical regression line sounds like a beautiful | prompt for a hard sci-fi short story. | Tainnor wrote: | This physically hurts. | SubiculumCode wrote: | Ultimately, the author rejected the use of regression by using an | eyeball test. | | Eyeball tests are not rigorous, and can be misleading. Further, | the purpose of regression is not just to obtain the slope via | least squares in the case of obvious relationships, but to | provide a test of the null hypothesis (slope = 0) of weaker, but | theoretically interesting relationships. | | This type of amateur (and wrong) statistics article shouldn't be | making it to the top of HN. | webel0 wrote: | Could someone link to original article? Didn't see in post. | Notice that they don't cite what the authors' computed R^2 was | but conjectured it was low (and I agree that it is likely low). | Thus, doesn't appear to be a case of blind p-hacking off the bat. | | Could just be really bad. However could be that: | | - The conclusion of the paper was that no relationship exists. | | - Later specifications include covariates. For example, including | travel flows here could help to disentangle cultural mores | regarding drinking and probability that ANY virus was transmitted | to place. | | - some sort of weighting was done. Although in that case I would | expect to see a steeper slope to account for New York. Usual | practice here would be to display the univariate relationship | with circles that are sized to match weights. | | - Graphs like this can play tricks on your eyes. There might be a | lot of dots clustered along the fit line that are overlapping | etc. | huac wrote: | the beauty of this post is that it is truly evergreen: there | is, and always will be, bad statistics to grimace at | MrL567 wrote: | Telling every time someone posts a bad regression they never post | the R^2. | surething123 wrote: | You might find this article insightful as to the utility of | R-squared - "Is R-squared Useless?" | | https://data.library.virginia.edu/is-r-squared-useless/ | sideshowb wrote: | If only. Sometimes they smooth or bin the data points and | _then_ post r2! ___________________________________________________________________ (page generated 2020-07-06 23:00 UTC)