hngopher.com

       [HN Gopher] Introduction to Modern Statistics
       ___________________________________________________________________
        
       Introduction to Modern Statistics
        
       Author : noelwelsh
       Score  : 526 points
       Date   : 2023-10-12 08:45 UTC (12 hours ago)
        
 (HTM) web link (openintro-ims2.netlify.app)
 (TXT) w3m dump (openintro-ims2.netlify.app)
        
       | noelwelsh wrote:
       | Statistics education is undergoing a bit of a revolution, driven
       | by the accessibility of computers. For example, hypothesis
       | testing is introduced by randomization[1], using a randomized
       | permutation test[2]. I find this really easy to understand,
       | compared to how I learned statistics using a more traditional
       | approach. The traditional approach taught be a cookbook of
       | hypothesis tests to use: use the t-test in this situation, use
       | the chi-squared in this situation, and so on. I never gained any
       | understanding of why I should use these different tests, or where
       | they came from, from the cookbook approach.
       | 
       | For the same approach in a slightly different context see [3].
       | 
       | [1]: https://openintro-ims2.netlify.app/11-foundations-
       | randomizat...
       | 
       | [2]: https://en.wikipedia.org/wiki/Permutation_test
       | 
       | [3]:
       | https://inferentialthinking.com/chapters/11/1/Assessing_a_Mo...
        
         | iTokio wrote:
         | There is also Brillant that has a very polished interactive
         | course:
         | 
         | https://brilliant.org/courses/statistics/
        
           | usgroup wrote:
           | These things are great if they add value for you, but I would
           | be very skeptical of any non-mathematical approach to
           | statistics. I think statistics is only made clear by
           | mathematics, much the same as Physics. And one cannot grasp
           | statistics without being able to understand the maths.
           | 
           | I think that still the best way to understand statistics is
           | to start with the mathematical theory and to grind 1000+
           | textbook problems.
        
             | mkl wrote:
             | > I think that still the best way to understand statistics
             | is to start with the mathematical theory and to grind 1000+
             | textbook problems.
             | 
             | Are there any books you'd recommend for this approach?
        
               | usgroup wrote:
               | My grind was "Mathematical Statistics with Applications"
               | by Wackerly et al. There are PDF versions if you Google
               | for it. I can't say it was quick, easy or intuitive; but
               | it works.
               | 
               | I also liked "In all Likelihood" by Pawitan for a
               | "likelihoodist" foundational approach.
        
         | usgroup wrote:
         | I've had similar thoughts, but I think its more to do with what
         | is in your head at the time you hear about it. I found
         | permutation tests satisfying to learn about because they
         | somehow helped consolidate what I knew from distribution
         | theory. If I didn't know any distribution theory prior, I'm not
         | sure they could have that effect.
         | 
         | If you study mathematical statistics, it is not taught as a
         | cookbook. At the elementary level you learn probability theory
         | and distribution theory, all the different distributions,
         | hypothesis tests, regression, ANOVA and so on proceed from
         | there. Meanwhile, I think research scientists are often taught
         | statistics as a set of recipes because its usually a short
         | course for a specific discipline. E.g. Statistics for
         | biologists.
        
           | ImaCake wrote:
           | I think those short courses would be more effective if they
           | didn't bother with ANOVA and instead taught intro probability
           | and distributions and then jumped straight to regression.
           | ANOVA is just a really specific way of doing a regression.
           | 
           | In R, and python::statsmodels you get the answer to
           | (essentially) an ANOVA any time you run an LM or GLM; its the
           | Z-statistic for your whole model.
           | 
           | I know there is more nuance to this, but teaching students
           | that they can use regression for most of the problems they
           | would have used seemingly arcane tests for is going to be
           | much more useful for the students.
           | 
           | Here is a lovely page demonstrating how to do this in R:
           | https://lindeloev.github.io/tests-as-linear/
        
             | usgroup wrote:
             | I agree with the sentiment although I'm not sure there is
             | the time for all of it. At least when I took them,
             | probability theory and distribution theory were separate
             | semester long courses, and the former was a prerequisite
             | for the latter.
        
             | gpderetta wrote:
             | Stastsmodels and that github page are the only reason I
             | have some understanding of statistical tests.
        
           | bunderbunder wrote:
           | _Principles of Statistics_ by M.G. Bulmer is a nice
           | introduction to the mathematical side of things. It 's part
           | of Dover's classic textbook series, so it's inexpensive
           | compared to newer textbooks, and also concise and well-
           | written.
           | 
           | It does assume you already have a solid understanding of
           | calculus and combinatorics, though. Which I think is fair.
           | Discrete statistics is arguably just applied combinatorics,
           | and continuous statistics applied calculus, so if you have a
           | strong foundation in those two subjects then you're already
           | 90% of the way there. (And, if you don't, stop the cart and
           | let the horse catch up.)
        
         | dr_dshiv wrote:
         | Do you know of any validation studies with Advanced Data
         | Analysis (formerly code interpreter) in chatGPT? I think it can
         | be excellent as a teaching tool.
        
         | wespiser_2018 wrote:
         | The difficulty of teaching statistics is that the maths you
         | need to prove things are right and gain an intuitive
         | understanding of the methods are far more advanced than what is
         | presented in a basic stats course. Gosset came up with the
         | t-test and proved to the world it made sense, yet we teach
         | students to apply it in a black box way without a fundamental
         | understanding of why it's right. That's not great pedagogy.
         | 
         | IMO, this is where Bayesian Statistics is far superior. There's
         | a Curry-Howard isomorphism to logic which runs extremely deep,
         | and it's possible to introduce using conjugate distributions
         | with nice closed form analytical solutions. Anything more
         | complex, well, that's what computers are for, and there are
         | great ways (STAN) to run complex distributions that are far
         | more intricate than frequentist methods.
        
           | zozbot234 wrote:
           | Maximum likelihood (which underpins many frequentist methods)
           | basically amounts to Bayesian statistics with a uniform prior
           | on your parameters. And the "shape" of your prior actually
           | depends on the chosen parametrization, so in principle you
           | can account for non-flat priors as well.
        
             | nextos wrote:
             | IMHO, the discussion should not be so much whether to teach
             | Bayesian or maximum likelihood. But instead, whether to
             | teach generative models or to keep going with hypothesis
             | tests, which are generally presented to students as a bag
             | of tricks.
             | 
             | Generative models, (implemented in e.g. Stan, PyMC, Pyro,
             | Turing, etc.) split models from inference. So one can
             | switch from maximum likelihood to variational inference or
             | MCMC quite easily.
             | 
             | Generative models, beginning from regression, make a lot
             | more sense to students and yield much more robust
             | inference. Most people I know who publish research articles
             | on a frequent basis do not know p-values are not a measure
             | of effect sizes. This demonstrates current education has
             | failed.
        
             | eutectic wrote:
             | Maximum Likelihood corresponds to Bayesian statistics with
             | MAP estimation, which is not the typical way to use the
             | posterior.
        
           | thefringthing wrote:
           | > There's a Curry-Howard isomorphism [between] logic [and
           | Bayesian statistical inference].
           | 
           | This is an odd way of putting it. I think it's better to say
           | that, given some mostly uncontroversial assumptions, if one
           | is willing to assign real number degrees of belief to
           | uncertain claims, then Bayesian statistical inference is the
           | only way of reasoning about those claims that's compatible
           | with classical propositional logic.
        
       | jna_sh wrote:
       | Very excited to see Mine Cetinkaya-Rundel is an author here! Many
       | might be familiar with "R for Data Science"
       | (https://r4ds.had.co.nz/), to which she is a contributor, but
       | she's also published a lot of great papers around teaching data
       | science.
        
         | ayhanfuat wrote:
         | She also has some online courses on Coursera
         | (https://www.coursera.org/instructor/minecetinkayarundel).
         | Hands down one of the best instructors I have seen.
        
       | zvmaz wrote:
       | What is a good book on statistics that one can use for self-
       | learning?
        
         | noelwelsh wrote:
         | Depends where you are starting from and what you want to learn.
         | The linked book is a first year introduction, and does a good
         | job of that. If you want to go further there are many other
         | options:
         | 
         | * Statistical Inference by Casella and Berger. This book has a
         | very good reputation for building statistics from first
         | principles. I won't link to them, but you can find full PDF
         | scans online with a simple search. Amazon reviews:
         | https://www.amazon.com/Statistical-Inference-Roger-Berger/dp...
         | 
         | * Statistics by Freedman, Pisani, and Purves has similarly very
         | good reviews and can be easily found online. Amazon reviews:
         | https://www.amazon.com/Statistics-Fourth-David-Freedman-eboo...
         | 
         | * The majority of the Berkeley data science core curriculum
         | books are online. This is not purely statistics but 1) is
         | taught in a modern style that makes use of computation and
         | randomization and 2) uses tools that may be useful to learn
         | about.
         | 
         | 1. https://inferentialthinking.com/chapters/intro.html (Data 8)
         | 
         | 2. https://learningds.org/intro.html (Data 100)
         | 
         | 3. http://prob140.org/textbook/content/README.html (Data 140)
         | 
         | 4. https://data102.org/fa23/resources/#textbooks-from-
         | previous-... (Data 102; this gets into machine learning and
         | pure statistics)
         | 
         | The Berkeley curriculum is not the only one; there are tens,
         | possibly hundreds, of online courses. The Berkeley curriculum
         | is just 1) quite extensive and 2) the one I happened to read
         | the most about when I was recently researching how data science
         | is currently taught.
        
         | sudoankit wrote:
         | I particularly like Statistical Inference by George Casella and
         | Roger Lee Berger.
         | 
         | You could also look at Introduction to Probability by Joseph K.
         | Blitzstein and Jessica Hwang (available for free here:
         | http://probabilitybook.net (redirects to drive)).
        
           | laichzeit0 wrote:
           | Should be noted that Casella's book is... well... really
           | great if you thought Spivak's calculus and Rudin's analysis
           | to be fun books, especially the exercises.
           | 
           | Casella's exercises are absolutely brutal.
        
         | dan-robertson wrote:
         | I like _statistical rethinking_. It's targeted at science phd
         | students so the focus is "how can you use statistics for
         | testing your scientific hypotheses and trying to tease out
         | causation". It doesn't go deep into the mathematics of things
         | (though expects readers to be decently numerate and comfortable
         | analysing data without statistics). It only really talks about
         | Bayesian models and how to fit them by computer, so won't cover
         | much of the frequenting side of things at all.
        
         | verbify wrote:
         | ISLR/ISLP is free, was used in my masters and is excellent (and
         | has an accompanying video series)
         | 
         | https://www.statlearning.com/
        
         | dtjohnnyb wrote:
         | A couple of more introductory books that come at it from the
         | point of view of "someone who can code" are: -
         | https://greenteapress.com/wp/think-stats-2e/ (and the similar
         | Think Bayes if you enjoy this one) -
         | https://nostarch.com/learnbayes
         | 
         | Can second Statistical Rethinking though if you have the basics
         | of stats and want to learn it again from a very different, more
         | causal/bayesian point of view.
        
         | begemotz wrote:
         | What is your background and what field will you be applying
         | your knowledge to?
         | 
         | There can be a rather wide gap between a theoretical approach
         | that you might encounter as taught by a statistician and an
         | applied approach you might encounter in a business statistics
         | or social science statistics course.
         | 
         | Depending on your math background and the area of intended
         | application, in my opinion, it would sway recommendations for a
         | first 'book' on statistics for self-learning.
        
         | photochemsyn wrote:
         | Good video lecture series:
         | 
         | https://www.thegreatcourses.com/courses/learning-statistics-...
         | 
         | Might be available for free via your local library, too.
        
       | ricksunny wrote:
       | I'm looking for help with distilling 'truth' from folk belief
       | systems by formalizng them under a Bayesian network framework, in
       | case anyone is looking for a project through which to sharpen
       | their statistical saw.
        
       | d00mer wrote:
       | They should remove "modern" from the title, because who the hell
       | uses the "R programming language" these days anymore?
        
         | Onawa wrote:
         | Everyone in my branch of Toxicology? Tons of people in
         | biological sciences. Just because you have bias against the
         | tool and don't run in the same circles doesn't mean that R
         | isn't used and love by a subset of devs.
        
         | noelwelsh wrote:
         | Statisticians do. The Berkeley curriculum, which I've linked to
         | in another comment, uses Python.
        
         | adr1an wrote:
         | Everyone but you. Check any statistics journal. Only a few
         | people developing methods switched to Python or Julia.
        
         | i_love_limes wrote:
         | A lot of people... in fact a huge portion of statisticians,
         | epidemiologists, econometrics, use it as their primary
         | language.
         | 
         | I do genetic epidemiology (which is considerably more compute
         | intensive than regular epidemiology), and R is still the most
         | common language, with the most libraries and packages being
         | used for it, compared to python for example.
         | 
         | I think maybe you should consider being less forthcoming with
         | your opinions on topics which you are not well informed on.
        
           | wespiser_2018 wrote:
           | I worked in data science for a few start ups, and even though
           | I know Python (it's my LeetCode language of choice), R just
           | dominates when it comes to accessing academic methods and
           | computational analysis. If you are going to push the
           | boundaries of what you can and can't analysis for statistical
           | effects and leverage academic learnings, it's R.
        
         | dereify wrote:
         | fyi many state-of-the-art statistical libraries exist (or are
         | properly maintained) in R only
        
           | ImaCake wrote:
           | I find it depends on what you want. There is no canonical GAM
           | (gen. addative model) library in python but there are a few
           | options - which are not easy to use. The statsmodels GAM
           | implementation appears to be broken. R, of course, has a
           | stupid easy to use GAM library that is pretty fast.
           | 
           | On the other hand, R has _too many_ obscure options for what
           | I can find in scipy or sklearn. So I find it easier to just
           | jump into sklearn, use the very nice unified interface
           | "pipelines" to churn through a whole bunch of different
           | estimators without having to do any munging on my data.
           | 
           | So I think it just depends on your field. But R seems to
           | stick more with academia.
        
         | nomilk wrote:
         | Before I knew command line, I tried to install python and spent
         | the next 3 days resolving an installation issue with 'wheel'.
         | 
         | By contrast, from first downloading R to running my first R
         | script took about 1 hour (the most difficult part was opening
         | the 'script' pane in RStudio IDE, which doesn't open by default
         | on new installations, for some reason).
         | 
         | There's huge demand out there for statistical software that's
         | accessible to people whose primary pursuit is not
         | programming/cs, but genetics, bioinformatics, economics,
         | ecology and other disciplines that necessitate tooling much
         | more powerful than excel, but with barriers to entry not much
         | greater than excel. R is a fairly amazing fit for those folks.
        
           | perrygeo wrote:
           | R and CRAN really get package management right. Even as a
           | very infrequent R user, there are no surprises, it "just
           | works". Compare that to my daily Python usage where I am
           | continually flummoxed by dependency issues.
        
             | _Wintermute wrote:
             | Strong disagree, there's a reason RStudio/Posit are
             | spending so much time trying to develop 3rd party
             | alternatives to install.packages() and CRAN.
             | 
             | Try installing an older version of a package without it
             | pulling in the most recent incompatible dependencies, it's
             | a whole adventure.
        
         | MilStdJunkie wrote:
         | Respectfully, I'm going to ask, "what what?". I can't swing a
         | cat without hitting dplyr. It's probably industry dependent
         | though - I could see a dataset that's 99% text having
         | absolutely no reason to even look at R at all.
        
         | f6v wrote:
         | Most people in bioinformatics.
        
         | epgui wrote:
         | Probably most people who do statistics.
         | 
         | R sucks as a language but it excels at that specific
         | application, just because of its tremendous ecosystem (putting
         | even python to shame in some niche areas).
        
           | wespiser_2018 wrote:
           | R is fine, it's no more absurd than other non-typed languages
           | like javascript. Most languages are very good at one or two
           | things, then not so good or appropriate for other tasks. For
           | R, that's statistics, modeling, and exploratory analysis,
           | which it absolutely crushes at due to ecosystem effects.
        
       | dleeftink wrote:
       | Anyone looking to apply and compare frequentist and bayesian
       | methods within a unified GUI (which is essentially an elegant
       | wrapper to R and selected/custom statistical packages), should
       | check out _JASP_ developed by the University of Amsterdam [0]. It
       | 's free to use, and the graphs + captions generated during each
       | step are publication quality right out of the box.
       | 
       | Using it truly feels like a 'fresh way' to do statistics. Its
       | main website provides ample use cases, guides and tutorials, and
       | I often return to the blog for the well documented deepdives into
       | how traditional frequentist methods and their bayesian
       | counterparts compare (the animated explainers are especially
       | helpful, and I appreciate the devs reflecting on each release and
       | future directions).
       | 
       | [0]: https://jasp-stats.org/
        
         | NeutralForest wrote:
         | there was an interview of one of the JASP (creator or
         | maintainer, can't remember) in the "Learn Bayesian Stats"
         | podcast; it was very interesting.
        
           | rdhyee wrote:
           | I think the referenced episode is
           | https://learnbayesstats.com/episode/61-why-we-still-use-
           | non-... Thanks for pointing it out!
        
           | dleeftink wrote:
           | To me, it's academic software _done right_ , both in terms of
           | accessibility and maintenance. I'd love to hear more about
           | their governance and funding structure and how this might be
           | applied elsewhere, and learn about academic software of
           | similar grade and utility.
        
         | mindcrime wrote:
         | Even better than just being "free to use" it's F/OSS (under the
         | AGPL):
         | 
         | https://github.com/jasp-stats/jasp-desktop
        
         | 3abiton wrote:
         | How does this compare to other stat libraries?
        
       | begemotz wrote:
       | I like the inclusion of randomization and bootstrapping. It's
       | unfortunate that the hypothesis framework is still NHST -- I
       | wouldn't consider that 'modern' by any means.
        
         | noelwelsh wrote:
         | I don't see widespread agreement in the statistics community as
         | to what should replace NHST. If you go Bayesian you need to
         | completely rewrite the course. I've seen confidence intervals
         | suggested as an alternative, but there are arguments against.
         | I've also seen arguments that hypothesis tests shouldn't be
         | used at all. Given that NHST is still widely used and there
         | isn't a clear alternative I think it's a disservice to students
         | to not introduce them.
        
           | begemotz wrote:
           | I probably should have been more clear. I didn't say
           | hypothesis testing, I said NHST (the binary null/alt
           | hypothesis approach) - which is an approach to hypothesis
           | testing particularly prevelant in certain disciplines such as
           | Psychology.
           | 
           | And in that context, there is a lot of agreement that this
           | approach is fundamentally flawed and outdated. if you are
           | interested, I can provide references when I get to the
           | office. But off the top of my head consider Gigerenzer and
           | Cummings.
        
             | noelwelsh wrote:
             | For those following along at home Gigerenzer is, I think,
             | "Mindless Statistics"[1] and Cummings is "The New
             | Statistics"[2].
             | 
             | [1]: https://pure.mpg.de/rest/items/item_2101336/component/
             | file_2... [2]: Sample at
             | https://tandfbis.s3.amazonaws.com/rt-
             | media/pp/common/sample-...
        
               | begemotz wrote:
               | Yes, those are appropriate (although Gigerenzer and
               | Cummings both have other relevant publications on the
               | topic).
               | 
               | As for a undergraduate text that 'teaches the
               | difference', you can look at 'An Introduction to
               | Statistics' by Carlson & Winquist.
        
       | RedShift1 wrote:
       | Can I download this as a PDF? I'd like to read it offline.
        
         | noelwelsh wrote:
         | Here: https://www.openintro.org/book/ims/
        
           | RedShift1 wrote:
           | This is the first version, not the 2nd?
        
             | noelwelsh wrote:
             | Hmmm ... must be because the 2nd edition is still in
             | progress. Best option might be to follow the immortal words
             | of Obiwan Kenobi and "use the source":
             | https://github.com/OpenIntroStat/ims
             | 
             | Otherwise you can try building a PDF from the very similar
             | Data 8 book[1] using [2]
             | 
             | [1]: https://github.com/data-8/textbook
             | 
             | [2]: https://jupyterbook.org/en/stable/advanced/pdf.html
        
       | usgroup wrote:
       | I think Ronald Fisher may not have used bootstrap to calculate
       | confidence intervals; but it looks to me like he invented most of
       | the rest of the syllabus .. in the early 1900s :-)
        
       | mjburgess wrote:
       | What's often missing from these introductions is when statistics
       | will not work; and what it even means when it "works". The amount
       | of data needed to tell between two normal is about 30 data points
       | -- between two power-law distributions, >trillion. (And this
       | basically scuppers the central limit theorem, on which a lot of
       | cargo-cult stats is justified).
       | 
       | Stats, imv, should be taught simulation-first: code up your
       | hypotheses and see if they're even testable. Many many projects
       | would immediately fail at the research stage.
       | 
       | Next, know that predictions are almost never a good goal. Almost
       | everything is practically unpredictable -- with a near infinite
       | number of relevant causes, uncontrollable.
       | 
       | At best, in ideal cases, you can use stats to model a
       | distribution of predictions _and then_ determine a risk /value
       | across that range. Ie., the goal isnt to predict anything but to
       | prescribe some action (or inference) according to a risk
       | tolerance (risk of error, or financial risk, etc.).
       | 
       | It seems a generation of people have half-learned bits of stats,
       | glued them together, and created widespread 'statistical cargo-
       | cultism'.
       | 
       | The lesson of stats isnt hypothesis testing, but how almost no
       | hypotheses are testable -- _and then_ what do you do
        
         | Ensorceled wrote:
         | It's ironic that this ... rant? ... is basically unreadable
         | without knowledge of basic statistical methods.
         | 
         | How do you teach any of this to someone who hasn't already
         | taken introductory statistics? How do you learn anything if you
         | first have to learn the myriad ways something you don't even
         | have a basic working knowledge of can fail before you learn it?
        
           | mjburgess wrote:
           | The comment is addressed to the informed reader who is the
           | only one with a hope of being persuaded on this point.
           | 
           | To teach this, from scratch, I think is fairly easy -- but
           | there's few with any incentive to do it. Many in academia
           | wouldnt know how, and if they did, would discover that much
           | of their research can be shown _a priori_ to not be
           | worthwhile (rather than after a decade of  'debate').
           | 
           | All you really need is to start with establishing an
           | intuitive understanding of randomness, how apparently highly
           | patterned it is, and so on. Then ask: how easy is it to
           | reproduce an observed pattern with (simulated) randomness?
           | 
           | That question alone, properly supported via basic programming
           | simulations, will take you extremely far. Indeed, the answer
           | to it is often obvious -- a trivial program.
           | 
           | That few ever write such programs shows how the whole edifice
           | of stats education is geared towards confirmation bias.
           | 
           | Before computers, stats was either an extremely mathematical
           | disipline seeking (empirically useless) formula for toy
           | models; or using heuristic empirical formula that rarely
           | applied.
           | 
           | Computers basically obviate all of that. Stats is mostly
           | about counting things and making comparisons -- perfect tasks
           | for machines. with only a few high-school mathematical
           | formula most could derive most useful statistical techniques
           | as simple computer programs.
        
             | noelwelsh wrote:
             | The modern approach, of which this textbook is an example,
             | does start with simulation. In fact there is very little
             | classical statistics (distributions, analytic tests) in the
             | book. The Berkeley Data 8 book, which I link to in another
             | comment, takes the same approach. I imagine there is still
             | too much classical material for your tastes, but there is
             | definitely change happening.
        
             | 2devnull wrote:
             | " that much of their research can be shown a priori to not
             | be worthwhile"
             | 
             | Bingo. Cargo cult stats all the way down. It's not just
             | personal interest, it's the entire field, it's their
             | colleagues, mentors, and students. Good luck getting
             | somebody to see the light when not just their own income
             | depends on not seeing it, their whole world depends on the
             | "stat recipes" handed down from granny.
        
               | brutusborn wrote:
               | I think the egotistical aspect is the most powerful: many
               | researchers have built an identity based on the fact that
               | they "know" something, so to propose better alternatives
               | to their pet theories is tantamount to proposing their
               | life is a lie. To change their mind they need to admit
               | they didn't "know".
               | 
               | The better the alternatives, the more fierce the passion
               | with which they will be rejected by the mainstream.
        
               | 2devnull wrote:
               | I now think it's best explained by simple economics.
               | Academia and academics are the product of economic forces
               | by and large. It's not quirky personalities or uniquely
               | talented minds that make up academia today. It's droves
               | of conscientious (big five sense) conformists, with
               | either high iq or mere socio-economic privilege, who have
               | been trained by our society to feel that financial
               | security means college, and even more financial security
               | means even more college. Credentials are like alpha .05,
               | they solve a scale problem in a way that alters the
               | quality/quantity ratio. If you want more
               | researchers/research/science output, credentials and
               | alpha .05 cargo cult stats are your levers to get more
               | quantity at lower quality.
        
           | Retric wrote:
           | It seems like a reasonable critique. The suggestion is to
           | include such ideas as people are taking introductory
           | statistics which isn't inappropriate. I wouldn't suggest
           | forcing students to code up their own simulations from
           | scratch, but creating a framework where students can plug in
           | various formula for each population, attach a statistical
           | test, and then run various simulations could do quite a bit.
           | However, what kinds of formula students are told to plug in
           | are important.
           | 
           | If every formula is producing bell curves then that's a
           | failure to educate people. 50d6 vs 50d6 + 1 is easy enough
           | you can include 1d2 * 50 + 50d6 for a 2 tailed distribution,
           | but also significantly different distributions which then
           | fail various tests etc.
           | 
           | I've seen people correctly remember the formula for
           | statistical tests from memory and then wildly misapply them.
           | That seems like focusing on the wrong things in an age when
           | such information is at everyone's fingertips, but
           | understanding of what that information means isn't.
        
         | taeric wrote:
         | Model building, at large, is the thing I regret being bad at.
         | Model your problem and then throw inputs at it and see what you
         | can see.
         | 
         | Sucks, as we seem to have taught everyone that statistical
         | models are somehow unique models that can only be made to get a
         | prediction. To the point that we seem to have hard delineations
         | between "predictive" models and other "models.".
         | 
         | I suspect there are some decent ontologies there. But, at
         | large, I regret that so many won't try to build a model.
        
         | srean wrote:
         | I work in applied ML and stats. Whenever a client gets pushy
         | about getting a prediction and would not care about quantifying
         | the uncertainty around it, I take it as a signal to disengage
         | and look for better pastures. It is really not worth the time,
         | more so if you value integrity.
         | 
         | Competent stakeholders and decision makers use the uncertainty
         | around predictions, the chances of an outcome that is different
         | from the point-predicted outcome, to come to a decision and the
         | plan includes what the course of action should be should the
         | outcome differ from the prediction.
        
         | 0xDEAFBEAD wrote:
         | >The amount of data needed to tell between two normal is about
         | 30 data points
         | 
         | What are you trying to say here? If there are two normal
         | distributions, both with variance one, one having mean 0 and
         | the other having mean 100, and I get a single sample from one
         | of the distributions, I can guess which distribution it came
         | from with very high confidence. Where did the number 30 come
         | from?
        
           | sndean wrote:
           | > Where did the number 30 come from?
           | 
           | Yeah, I've also heard 30 for normal distributions over and
           | over in ~7 stats courses that I've taken.
           | 
           | This SE stats answer sounds reasonable enough:
           | https://stats.stackexchange.com/a/2542
        
         | juunpp wrote:
         | I am a noob and I've always got stuck on comparing two
         | independent means. Assumption: normality. Yeah, data is never
         | normal in my bakery.
        
         | haberman wrote:
         | This really resonates with me. I've attempted self-study about
         | statistics many times, each time wanting to understand the
         | fundamental assumptions that underlie popular statistical
         | methods. When I read the result of a poll or a a scientific
         | study, how rigorous are the claimed results, and what false
         | assumptions could undermine them?
         | 
         | I want to build intuitions for how these statistical methods
         | even work, at a high level, before getting drowned in math
         | about all the details. And like you say, I want to understand
         | the boundaries: "when statistics will not work; and what it
         | even means when it "works".
         | 
         | I imagine that different methodologies exist on a spectrum,
         | where some give more reliable results, and others are more
         | likely to be noise. I want to understand how to roughly tell
         | the good from the bad, and how to spot common problems.
        
         | wespiser_2018 wrote:
         | "Simulation first" is how I did things when I worked in data
         | science and bioinformatics. Define the simulation that
         | represents "random", then see how far off the actual data is
         | using either information theory or just a visual examination of
         | the data and summary statistic checks. That's a fast and easy
         | way to gut check any observation to see if there is an
         | underlying effect, which you can then "prove" using a more
         | sophisticated analysis.
         | 
         | Just raw hypothesis is just too easy to juke by overwhelming it
         | with trials. Lots of research papers have "statistically
         | significant" results, but give no mention of how many
         | experiments it took to get them, or any indiciation of negative
         | results. Eventually, there will always be the analysis where
         | you incorrectly reject the null hypothsis given enough effort.
        
         | RSMDZ wrote:
         | >> between two power-law distributions, >trillion
         | 
         | Do you have anywhere I can read more about this? I would have
         | assumed that a trillion data points would be sufficient to
         | compare any two real-world distributions
        
         | bigbillheck wrote:
         | > The amount of data needed to tell between ... two power-law
         | distributions, >trillion.
         | 
         | I don't agree with this as a statement of fact (except in the
         | obvious case of two power-law distributions with extremely
         | close parameters). Supposing it was true, that would mean that
         | you would almost never have to actually worry about the
         | parameter, because unless your dataset is that large one power
         | law is about as good as any other for describing your data.
        
       | elashri wrote:
       | Thanks to the author for the book and making it open access. I
       | always admire these efforts.
        
       | growingkittens wrote:
       | Is there a "pre-statistics" book that teaches the thinking skills
       | and concepts needed to understand statistics?
        
         | ndr wrote:
         | This book seems to start where you need it to start.
         | 
         | You don't need much beyond basic calculus. Most suffer from
         | some mental block they got installed at a young age akin those
         | that say "I'm bad at math" because their teacher sucked. Dive
         | in and you won't regret it.
        
           | obscurette wrote:
           | I have been a math teacher and although I can't guarantee
           | that I didn't suck, I can say that most of kids don't develop
           | this attitude because of teachers, but because of their
           | parents. "My mum says that she sucked at math/music/whatever
           | as well, so do I!" is far too common. As a teacher I just
           | didn't have resources to influence this attitude either.
        
             | ndr wrote:
             | Yes, parents can be horrible too. Unfortunately it's
             | somehow socially acceptable and even worthy of pride in
             | some circles, to be "bad at math". It's seems very rare for
             | someone to openly say "I'm bad at [my native language]" or
             | "writing".
             | 
             | I feel stats is has a somewhat similar effect even among
             | those with math education. Several friends who have a
             | degree in math recoil at the first mention of stats
             | concepts.
        
               | obscurette wrote:
               | > It's seems very rare for someone to openly say "I'm bad
               | at [my native language]" or "writing".
               | 
               | It is actually even fashionable in non-english countries.
               | Declaring "I'm bad at [my native language], I only use
               | english anyway" makes you a better person somehow. And
               | it's not rare in other areas either - in post-truth world
               | it's trendy not to know things.
        
               | Novosell wrote:
               | In non-english countries? All of them? Source? I, as a
               | person from one of said non-english countries, disagree.
        
           | growingkittens wrote:
           | My mental block is a brain injury that went undiagnosed until
           | I was 30. I can't really hold more than two numbers in my
           | head at a time. I struggled through math in school because it
           | was lecture based, and the books were written to accompany a
           | lecture.
           | 
           | I can learn math fairly well if I have the right written
           | material and the right direction. However, I do not retain
           | math skills: without active practice, I revert back to "how
           | do fractions work?"
           | 
           | For example, I did extremely well in a college algebra course
           | that was partially online (combined with Khan Academy to
           | catch up). I could do my tests perfectly in pen, much to the
           | amusement of the assistants. I could make connections and see
           | the implications and applications of the math. Roughly three
           | to six months later, I was back to forgetting fractions.
           | 
           | I can't learn these things over time, but I can learn them
           | all at once. I'm collecting resources for my next math
           | adventure.
        
       | armcat wrote:
       | One of my favourite books on statistics and probability is
       | "Regression and Other Stories", by Andrew Gelman, Jennifer Hill
       | and Aki Vehtari. You can access the book for free here:
       | https://users.aalto.fi/~ave/ROS.pdf
        
         | epgui wrote:
         | +1, this is a great textbook, and not just for social sciences
         | as the second header would suggest.
        
       | epgui wrote:
       | As much as I appreciate and love all pedagogical endeavours in
       | the field, especially in the form of open texts, I really,
       | really, really dislike this overall approach to teaching
       | introductory statistics.
       | 
       | I'm hoping to see, over time, a shift away from ad-hoc null
       | hypothesis testing in favour of linear models (yes, in
       | introductory courses, from the start-- see link below) and
       | Bayesian-by-default approaches.
       | 
       | https://lindeloev.github.io/tests-as-linear/#:~:text=Most%20....
        
         | bschne wrote:
         | I am partway through McElreath's "Statistical Rethinking" and I
         | fully agree with this.
        
           | epgui wrote:
           | That's a great textbook!
        
             | TheAlchemist wrote:
             | It's been recommended on this topic several times, so I'm
             | looking at it. Quite expensive ! I see there is a series of
             | lectures, which seems identical to the book. Is it the same
             | ? Or still worth buying the book ?
        
               | noelwelsh wrote:
               | The lectures are good, and I've been told the book can be
               | found online by the intrepid. I guess that Anna's Archive
               | or Library Genesis has it.
        
               | TheAlchemist wrote:
               | I've found the book indeed - although it seems to be the
               | first edition.
               | 
               | It's here:
               | https://civil.colorado.edu/~balajir/CVEN6833/bayes-
               | resources...
        
         | begemotz wrote:
         | I agree about teaching from a unified GLM basis. The 'bayesian-
         | by-default' approach seems to going out on a more tenuous limb,
         | imo.
        
           | JHonaker wrote:
           | It's only appears tenuous because the subjective choices you
           | have to make when using frequentist methods are made for you
           | by the developer of the method.
           | 
           | It's less comfortable to use Bayesian methods because you
           | have to be explicit about your assumptions _as the user_ ,
           | which opens your assumptions up for easier inspection.
           | There's also way less specific information implied by priors
           | than most people think. Informative priors should try to make
           | distinctions between something that's reasonable-ish and
           | something that's essentially infinity (take pharmacokinetics
           | for example, the diffusion velocity of a molecule in your
           | blood stream shouldn't have a velocity near the speed of
           | light in a vacuum should it?). They should not be forcing
           | your model to achieve a particular result. Luckily, because
           | of the need to explicitly state them in a Bayesian analysis,
           | it's much easier to determine if they were properly set.
           | 
           | Prior specification is essentially problem domain-informed
           | regularization where you can actually hope to understand if
           | the hyperparameter is going to work or not.
        
         | fallat wrote:
         | > I'm hoping to see, over time, a shift away from ad-hoc null
         | hypothesis testing in favour of linear models (yes, in
         | introductory courses, from the start-- see link below) and
         | Bayesian-by-default approaches.
         | 
         | Is there anything where I can start today, as a guinea pig? My
         | statistics education is basically zero.
        
           | bschne wrote:
           | See my sibling comment, can recommend this:
           | https://xcelab.net/rm/statistical-rethinking/
        
           | noelwelsh wrote:
           | There are other comments here that suggests a number of books
           | at varying levels. "Introduction to Modern Statistics" is
           | very approachable in its presentation.
        
       | willsmith72 wrote:
       | The epub is apparently too big to send to a kindle, but I can't
       | see the option to download it, only the pdf. Any ideas?
        
       | tea-coffee wrote:
       | This looks to be the 2nd edition. Can anyone comment on how the
       | 1st edition was?
        
       | mavam wrote:
       | For studying statistics, I put together a comprehensive cheat
       | sheet: https://github.com/mavam/stat-cookbook
        
       ___________________________________________________________________
       (page generated 2023-10-12 21:00 UTC)