[HN Gopher] Abusing linear regression to make a point
       ___________________________________________________________________
        
       Abusing linear regression to make a point
        
       Author : furcyd
       Score  : 41 points
       Date   : 2020-07-06 20:47 UTC (2 hours ago)
        
 (HTM) web link (www.goodmath.org)
 (TXT) w3m dump (www.goodmath.org)
        
       | graycat wrote:
       | The article has:
       | 
       | > For trace failure, the probability of failure is linear in the
       | size of the radiation dose that the chip is exposed to.
       | 
       | No it's not. Impossible. Wrong.
       | 
       | Does anyone not see why it's wrong?
        
       | olliej wrote:
       | I saw that chart on Twitter and thought it was a joke :-/
       | 
       | You don't need to know anything about maths to see that that is
       | farcical.
        
         | ooobit2 wrote:
         | "Hold my America," said the beer.
        
       | User23 wrote:
       | > I said that if you had reason to believe in a linear
       | relationship, then you could try to find it. That's the huge
       | catch to linear regression: no matter what data you put in,
       | you'll always get a "best match" line out.
       | 
       | This challenge generalizes to all model fitting. Incorrectly
       | assuming a distribution is Gaussian is a big one.
        
         | xapata wrote:
         | The variable of interest may not have a Gaussian distribution,
         | but its expected value and variance generally are. Sure,
         | there's some pathological cases, but the Cauchy distribution
         | doesn't show up that often.
        
           | solveit wrote:
           | Does the Cauchy distribution ever actually show up?
        
             | ooobit2 wrote:
             | Less often than meetings start and end on time.
        
         | fractionalhare wrote:
         | Another one is handwaving that a distribution is normal when _n
         | > 30_ because "central limit theorem!"
         | 
         | Amateur statistics is full of magical numbers and thresholds
         | where everything "just works" :)
        
           | ooobit2 wrote:
           | When I applied to my MSDA program, I had to interview with
           | the lady who would become my first year mentor. "Now, do it
           | over again... on paper." That's what she'd say to anyone too
           | confident in their outputs. She has a reputation for weeding
           | out people who can pass the entrance testing and
           | qualifications, but can't adapt. And what we do requires
           | thick skin. You're wrong until you just _happen_ to be right.
        
       | spekcular wrote:
       | This article does not capture what is actually wrong with the
       | regression.
       | 
       | First, it's not necessarily wrong to fit a linear regression to
       | data that might not be from a linear model, or that you know to
       | be nonlinear. The data could be linear enough in the region of
       | interest for the line to nonetheless be useful, for example.
       | Sure, you need an underlying linear process if you want certain
       | theorems and guarantees to apply. But with any data set, linear
       | or not, regression still gives the best linear approximation to
       | the conditional expectation function.
       | 
       | Second, the following paragraph seems to imply that small
       | correlations are the same as no correlation, and the reason the
       | regression is problematic is that the correlation is small:
       | 
       | > How does that fit look to you? I don't have access to the
       | original dataset, so I can't check it, but I'm guessing that the
       | correlation there is somewhere around 0.1 or 0.2 - also known as
       | "no correlation".
       | 
       | But small correlations, if they actually exist, can sometimes be
       | of great practical relevance. So that's not it either.
       | 
       | The actual problem is that the correlation isn't statistically
       | significant - there isn't enough evidence to conclude that the
       | observed (small) correlation actually exists, as opposed to being
       | the result of random noise in the data. And indeed, as some other
       | comments here point out, you can get similar graphs by fitting
       | lines to randomly simulated fake data.
       | 
       | (If you prefer a Bayesian gloss: the data isn't informative
       | enough to move you off any reasonable prior with most of its mass
       | around zero. Same principle.)
        
         | fractionalhare wrote:
         | _> First, it 's not necessarily wrong to fit a linear
         | regression to data that might not be from a linear model, or
         | that you know to be nonlinear. The data could be linear enough
         | in the region of interest for the line to nonetheless be
         | useful, for example._
         | 
         | Yes, very good point! Likewise regression is not robust to log
         | transformations, but we still use log transformations
         | (depending on the data) because we may be able to tolerate some
         | loss of information. If the nonlinear correlation is bounded
         | below some tolerable amount it's okay to use the linear
         | relationship on its own.
        
         | khr wrote:
         | Agreed. I was curious enough to run the model myself so I used
         | a tool to extract the data. The slope estimate (b=17.24) is not
         | significantly different from zero, p=.437.
         | 
         | The data are here: https://pastebin.com/HhWTKZRb
        
           | SubiculumCode wrote:
           | The problem is that the author is essentially claiming that
           | running the regression for data not passing his eyeball test
           | is, in itself, a misuse of regression...which is nonsense.
        
         | gweinberg wrote:
         | The way I'd say it is that the uncertainty of the value is
         | large compared to the value itself. Just thinking about the
         | causal mechanisms, drinking (binge or not) with other people
         | could put you at risk if any of the other drunks are infected.
         | But drinking alone might help protect you from covid, since it
         | is a way to avoid other people.
        
       | asdf_snar wrote:
       | This person has no idea what they're writing about.
       | 
       | Edited because my post was flagged (I'm not sure why). The
       | definition of correlation coefficient is incorrect, which could
       | have been attributed to a typo, except the author goes on to say
       | "The bottom is, essentially, just stripping the signs away.",
       | suggesting the square root of a sum of squared differences would
       | be the same as the sum of differences, were it not for the signs.
       | That's not how norms work.
       | 
       | The whole paragraph on interpreting a correlation coefficient is
       | particularly painful to read: "... if the correlation is perfect
       | - that is, if the dependent variable increases linearly with the
       | independent, then the correlation will be 1. If the dependency
       | variable decreases linearly in opposition to the dependent, then
       | the correlation will be -1. If there's no relationship, then the
       | correlation will be 0."
       | 
       | For all its good intentions, I feel like this post hurts more
       | than it helps.
        
       | scottlocklin wrote:
       | You can't mention spurious linear regression without predicting
       | the S&P500 with Leinweiber's price of butter in Bangladesh
       | indicator.
       | 
       | https://nerdsonwallstreet.typepad.com/my_weblog/files/datami...
        
         | klenwell wrote:
         | In a similar vein:
         | 
         | https://www.tylervigen.com/spurious-correlations
         | 
         | I'm not convinced all of these are spurious.
        
         | fractionalhare wrote:
         | Overfitting is insidious. These graphics would convince a lot
         | of people who'd otherwise catch the error in the article
         | without needing it to be explained to them. At that point it's
         | not enough to eyeball the line of best fit, you have to sanity
         | check the explanatory power and do cross validation...
        
       | leto_ii wrote:
       | Taleb was recently steaming on Twitter about a similar thing done
       | to supposedly show a correlation between physician salary and
       | covid mortality:
       | https://twitter.com/nntaleb/status/1279954325087891464
       | 
       | He follows it with a few examples of spurious regressions from
       | random data:
       | https://twitter.com/nntaleb/status/1280090844113100801
        
         | nvader wrote:
         | The chart referenced in this article was by the same author.
         | 
         | https://twitter.com/AmihaiGlazer/status/1277769775855235072/...
         | 
         | https://twitter.com/AmihaiGlazer/status/1279210404602712064/...
         | 
         | My favourite part is the discussion about what a vertical line
         | of regression means.
         | 
         | https://twitter.com/AmihaiGlazer/status/1279905458812149760
         | 
         | Discovering a vertical regression line sounds like a beautiful
         | prompt for a hard sci-fi short story.
        
       | Tainnor wrote:
       | This physically hurts.
        
       | SubiculumCode wrote:
       | Ultimately, the author rejected the use of regression by using an
       | eyeball test.
       | 
       | Eyeball tests are not rigorous, and can be misleading. Further,
       | the purpose of regression is not just to obtain the slope via
       | least squares in the case of obvious relationships, but to
       | provide a test of the null hypothesis (slope = 0) of weaker, but
       | theoretically interesting relationships.
       | 
       | This type of amateur (and wrong) statistics article shouldn't be
       | making it to the top of HN.
        
       | webel0 wrote:
       | Could someone link to original article? Didn't see in post.
       | Notice that they don't cite what the authors' computed R^2 was
       | but conjectured it was low (and I agree that it is likely low).
       | Thus, doesn't appear to be a case of blind p-hacking off the bat.
       | 
       | Could just be really bad. However could be that:
       | 
       | - The conclusion of the paper was that no relationship exists.
       | 
       | - Later specifications include covariates. For example, including
       | travel flows here could help to disentangle cultural mores
       | regarding drinking and probability that ANY virus was transmitted
       | to place.
       | 
       | - some sort of weighting was done. Although in that case I would
       | expect to see a steeper slope to account for New York. Usual
       | practice here would be to display the univariate relationship
       | with circles that are sized to match weights.
       | 
       | - Graphs like this can play tricks on your eyes. There might be a
       | lot of dots clustered along the fit line that are overlapping
       | etc.
        
         | huac wrote:
         | the beauty of this post is that it is truly evergreen: there
         | is, and always will be, bad statistics to grimace at
        
       | MrL567 wrote:
       | Telling every time someone posts a bad regression they never post
       | the R^2.
        
         | surething123 wrote:
         | You might find this article insightful as to the utility of
         | R-squared - "Is R-squared Useless?"
         | 
         | https://data.library.virginia.edu/is-r-squared-useless/
        
         | sideshowb wrote:
         | If only. Sometimes they smooth or bin the data points and
         | _then_ post r2!
        
       ___________________________________________________________________
       (page generated 2020-07-06 23:00 UTC)