[HN Gopher] Logistic regression from scratch
       ___________________________________________________________________
        
       Logistic regression from scratch
        
       Author : pmuens
       Score  : 107 points
       Date   : 2020-06-25 13:54 UTC (9 hours ago)
        
 (HTM) web link (philippmuens.com)
 (TXT) w3m dump (philippmuens.com)
        
       | curiousgal wrote:
       | Very interesting seeing people in the comments debate what is a
       | very basic thing taught in any stats/econometrics class.
       | 
       | The idea behind binary regression ( Y is 0 or 1) is that you use
       | a latent variable Y* = beta X + epsilon.
       | 
       | X is the matrix of indenpent variables, beta is the vector of
       | coefficients and epsilon is an error term that sums the rest of
       | what X can't explain.
       | 
       | Y thus becomes 1 if Y* >0 and 0 otherwise.
       | 
       | Seeing how Y is binary, we can model it using a Bernoulli
       | distribution with a success probability P(Y=1) = P(Y* >0) = 1 -
       | P(Y* <=0) = 1 - P(epsilon <= -beta X) = 1 - CDF_epsilon(-beta X)
       | 
       | Technically you can use any function that maps R to [0, 1] as a
       | CDF. If its density is symmetrical then you can directly write
       | the above probability as CDF(beta X). The two usual choices are
       | either the normal CDF which gives the Probit model or the
       | Logistic function (Sigmoid) which gives the Logit model. With the
       | CDF known you can calculate the likelihood and use it to estimate
       | the coefficients.
       | 
       | People prefer the Logit model because the coefficients of the
       | model are interpretable in terms of log-odds and all and the
       | fucntion has some nice numerical properties.
       | 
       | That's all there is to it really.
        
         | reactspa wrote:
         | I've always found it a little simplistic that the default cut-
         | off, in most statistical software, for whether something should
         | be 0 or 1, is 0.5.
         | 
         | (i.e. > 0.5 equals 1, and < 0.5 equals 0).
         | 
         | This seems to be a "rarely-questioned assumption".
         | 
         | Is there a reason why this is considered reasonable? And is
         | there a name for the cut-off (i.e., if I were to want to change
         | the cut-off, what keyword should I search for inside the
         | software's manual?)?
        
           | curiousgal wrote:
           | From a stats perspective the cutoff is included in the
           | coefficients. If you use a design matrix (add a column of 1s
           | to your variables) you get in a non matrix notation (beta_0
           | _1 + beta_1_ X_1 +...) So the threshold can be considered
           | beta_0.
           | 
           | In the software, you can get classification models to output
           | class probabilities instead of class labels. You can then use
           | whatever threshold you like for to transform those
           | probabilities to labels.
           | 
           | You may see it refered to as "discrimination threshold".
           | Varying that threshold is how ROC curves are constructed.
        
           | [deleted]
        
         | platz wrote:
         | fyi, generally when explaining things to non-practitioners,
         | it's a detractor to add qualifiers like
         | 
         | - this is a very basic thing
         | 
         | - that's all there is to it
         | 
         | because although it is a basic thing to you, it's not a basic
         | thing to someone who hasn't spent the same time studying all
         | the concepts beforehand.
         | 
         | This is generally why you have to take a class to grok stats
         | rather than just read some reference material and definitions.
         | 
         | It's similar to programming environments when the senior dev
         | says some thing that requires a lot of context is just really
         | simple.
         | 
         | Monads are Simply Just monoids in the category of endofunctors,
         | after all, they are really Simple, that's all there is to it!
         | (what the subject hears: This is so simple to me, why aren't
         | you smart enough to get this simple concept? You should know
         | this already--I shouldn't even have to make this comment! Why
         | haven't you been properly educated, like I have? )
        
           | curiousgal wrote:
           | I totally agree but I was explaining that to practionners, or
           | at least people who consider themselves data scientists. I
           | agree that it is by no means a beginner friendly explanation.
        
       | IdiocyInAction wrote:
       | My understanding of Logistic Regression is that it's linear
       | regression on the log-odds, which are then converted to
       | probabilities with the sigmoid/softmax function. This formulation
       | allows one to do direct linear regression on the probabilities,
       | without the unpleasant side effects of just using a linear model
       | as-is. A mathematical justification for doing this is given by
       | the generalized linear model formulation.
        
         | obastani wrote:
         | This is not quite correct. The log probabilities are
         | 
         | log p(y=1 | x; beta) = beta * x - log Z(x; beta)
         | 
         | where
         | 
         | Z(x) = p(y=0 | x; beta) + p(y=1 | x; beta)
         | 
         | Thus, you can think of it as linear regression, but with an
         | additional term log Z(x; beta) in the log likelihood.
        
         | j7ake wrote:
         | Putting logistic and linear regression into the generalised
         | linear model framework is the right way to think of it and
         | compare them.
         | 
         | From this point of view linear regression would be using GLM
         | with identity link function, logistic regression uses the logit
         | function as the link function.
        
         | tomrod wrote:
         | From what I recall this is a bit off -- not a bad mental model
         | but the math plays out different.
         | 
         | Linear regression has a closed form solution of X projected
         | onto Y: \hat{\beta} = (X'X)^{-1} X' Y
         | 
         | It is equivalent to the Maximum Likelihood Estimator (MLE) for
         | linear regression. However, for logistic regression, MLE would
         | estimate different from MLE for the log odds output.
         | 
         | Linear regression on {class_inclusion} = XB gives the linear
         | probability model, which has limited utility. The required
         | transform is covered by another commenter.
        
           | IdiocyInAction wrote:
           | You're right, my model was a bit off. Thanks for pointing
           | that out, I forgot about the fact.
        
         | delib wrote:
         | > it's linear regression on the log-odds
         | 
         | Almost - logistic regression assumes that the function is
         | _linear in the log odds_ , i.e. log(p/(1-p)) = Xb + e. The
         | problem is that you can't compute the log-odds, because you
         | don't know p.
        
         | random314 wrote:
         | Linear regression uses MSE loss. Logistic regression uses log-
         | loss. Both loss functions behave differently.
         | 
         | Its not just the underlying model, but the loss function is
         | also different.
        
         | crdrost wrote:
         | There is a different mathematical justification as well in
         | terms of Bayesian reasoning.
         | 
         | The claim is that "evidence" in Bayesian reasoning naturally
         | acts upon the log-odds, mapping prior log-odds to posterior
         | log-odds additively. To see this, calculate from the
         | definition,                   Odds(X | E) := Pr(X | E) / Pr(!X
         | | E)                      = Pr(X [?] E) / Pr(!X [?] E)
         | = (Pr(E | X) x Pr(X)) / (Pr(E | !X) x Pr(!X))
         | = LR(E, X) x Odds(X),
         | 
         | Where LR is the usual likelihood ratio.
         | 
         | So when we take the logarithm of both sides, we find that new
         | evidence adds some quantity--the log of the likelihood ratio of
         | the evidence--to our log of prior probability, in this phrasing
         | of Bayes' theorem.
         | 
         | I sometimes tell people this in a slightly strange language, I
         | say that if we ran into aliens we might find out that they
         | don't believe the things are absolutely true or false, but
         | instead measure their truth or falsity in decibels.
         | 
         | So another perspective on what logistic regression is trying to
         | do, is that it is trying to assume linear log-likelihood-ratio
         | dependence based on the strength of some independent pieces of
         | evidence. You can weakly justify this in all cases, using
         | calculus and assuming everything has a small impact. You can
         | further justify it strongly for any signal where twice as large
         | of a measured regression variable ultimately implies twice as
         | many independent events at a much lower level happening and
         | independently providing their evidences for the regression
         | outcome. So like, I come from a physics background, I am
         | thinking in this case of photon counts in a photomultiplier
         | tube or so: I know that at a lower level, each photon is
         | contributing equally some small little bit of evidence for
         | something, so when I count all the time up together, this is
         | the appropriate framework to use.
        
         | [deleted]
        
         | nerdponx wrote:
         | It's better to think of linear regression and logistic
         | regression as special cases of the Generalized Linear Model
         | (GLM).
         | 
         | In that framework, they are literally the same model with
         | different "settings" - Gaussian vs Bernoulli distribution.
        
           | markkvdb wrote:
           | I have to disagree with you. While assuming Gaussian
           | disturbance terms results in a linear regression, the linear
           | regression framework is more general. It makes no assumptions
           | about the distribution of the disturbance terms. Instead, it
           | merely restricts the variance to be constant over all values
           | of the response variable.
        
             | nerdponx wrote:
             | Both things can be true.
             | 
             | Linear regression is extra-special because it's a special
             | case of several different frameworks and model classes.
             | 
             | I should have written that it's better (in my opinion) to
             | think of logistic regression in the context of GLMs, at
             | least while you're learning.
             | 
             | Edit: yes logistic regression is a special case of
             | regression with a different loss function. But it's not
             | nearly "as special" as linear regression.
        
       | colinmhayes wrote:
       | How do we differentiate between econometrics and machine
       | learning? Logistic regression seems like it fits into
       | econometrics better than machine learning to me. There's no
       | regularization. I guess there's gradient descent which can be
       | seen as more machine learning. In the end it's semantics of
       | course, still an interesting distinction.
        
         | blackbear_ wrote:
         | The correct bucket for logistic regression should be
         | "statistics", under "generalized linear models".
        
         | TrackerFF wrote:
         | Well, what do you define as machine learning?
         | 
         | Logistic regression is clearly a classifier. And you need data
         | to train it. So it's a supervised learning algorithm.
        
           | colinmhayes wrote:
           | I'm trying to have a conversation so I can figure it out.
           | Pretty confident that being a classifier does not make it
           | machine learning, econometrics has classifiers too.
           | Econometric models also need data to train them, so I'm not
           | sure your second point is helpful either. Unless you're
           | claiming the difference is nothing but whether the model is
           | used by an economist.
        
             | heavenlyblue wrote:
             | What is machine learning then?
        
         | alexilliamson wrote:
         | From my experience, econometricians and ML practitioners mostly
         | pretend like the other group doesn't exist.
        
         | eVoLInTHRo wrote:
         | Econometrics is the application of statistical techniques on
         | economics-related problems, typically to understand
         | relationships between economic phenomena (e.g. income) and
         | things that might be associated with it (e.g. education).
         | 
         | Machine learning is typically defined as a way to enable
         | computers to learn from data to accomplish tasks, without
         | explicitly telling them how.
         | 
         | Both fields can use logistic regression, regularization, and
         | gradient descent to accomplish their goals, so in that sense
         | there's no distinction.
         | 
         | But IMO there is a difference in their primary intention:
         | econometrics typically focuses on inference about
         | relationships, machine learning typically focuses on predictive
         | accuracy. That's not to say that econometrics doesn't consider
         | predictive accuracy, or that machine learning doesn't consider
         | inference, but it's usually not their primary concern.
        
           | colinmhayes wrote:
           | So you're going with the only difference being who's building
           | the model. Interesting take, can't say I disagree much.
           | Although I would say that regularization in econometric
           | models is a bit rare because it distorts the coefficients
           | which as you pointed out is the primary goal of econometrics.
        
         | random314 wrote:
         | Logistic regression can use both L1 and L2 regularization
        
           | colinmhayes wrote:
           | And ols can too. That doesn't make it machine learning. This
           | implementation doesn't involve any regularization.
        
       | oli5679 wrote:
       | Weight of evidence binning can be helpful feature engineering
       | strategy for logistic regression.
       | 
       | Often this is a good 'first cut' model for a binary classifier on
       | tabular data. If feature interactions don't have a major impact
       | on your target then this can actually be a tough benchmark to
       | beat.
       | 
       | https://github.com/oli5679/WeightOfEvidenceDemo
       | 
       | https://www.listendata.com/2015/03/weight-of-evidence-woe-an...
        
       | thomasahle wrote:
       | Logistic regression can learn some quite amazing things. I
       | trained a linear function to play chess:
       | https://github.com/thomasahle/fastchess and it manages to predict
       | the next moves of top engine games with 27% accuracy.
       | 
       | A benefit of logistic regression is that the resulting model
       | really fast. Furthermore, it's linear, so you can do incremental
       | updates to your prediction. If you have `n` classes and `b` input
       | features change, you can recompute in `bn` time, rather than
       | doing a full matrix multiplication, which can be a huge time
       | saver.
        
         | mhh__ wrote:
         | Isn't 27% worse than flipping a coin?
        
           | thomasahle wrote:
           | A typical chess position has 20-40 legal moves. The complete
           | space of moves for the model to predict from has about 1800
           | moves.
           | 
           | For comparison Leela Zero gets around 60% accuracy on
           | predicting its own next move.
           | 
           | With this sort of accuracy you can reduce the search part of
           | the algorithm to an effective branching factor of 2-4 rather
           | than 40, nearly for free, which is a pretty big win.
        
             | nuclearnice1 wrote:
             | I don't understand the comment about Leela. Why isn't own
             | move prediction deterministic?
        
               | thomasahle wrote:
               | Because Leela (like fastchess mentioned above) has two
               | parts: A neural network predicting good moves, and a tree
               | search exploring the moves suggested and evaluating the
               | resulting positions (with a second net).
               | 
               | If the prediction (policy) net had a 100% accuracy, you
               | wouldn't need the tree search part at all.
        
               | heavenlyblue wrote:
               | You haven't mentioned any nondeterministic behaviour,
               | therefore Leela is supposed to predict it's own moves
               | with a 100% accuracy.
        
               | tel wrote:
               | It's not non-determinism, it's partial information. The
               | NN part guesses the best move that will be found by
               | search X% of the time. If you just ditched the search
               | part, Leela would be faster and lose out on (1-X)% of the
               | better moves.
        
               | nuclearnice1 wrote:
               | Got it. Thanks for clarifying. Let me restate.
               | 
               | Part one of Leela ranks several chess moves. Part two
               | picks among those.
               | 
               | 60% of the time part 2 chooses the #1 ranked move.
        
           | gpderetta wrote:
           | Only if you have only two legal moves.
        
           | enchiridion wrote:
           | No, uniform random would be bounded by 1/16. However you
           | cannot move ever piece in every configuration, so it's
           | greater than that. Actually would be an interesting problem
           | for figure out...
        
             | adiM wrote:
             | There are only 16 pieces, but in most board positions, many
             | pieces can make more than one legal move.
        
       | clircle wrote:
       | Ah, the old "regression from scratch" post that is mandatory for
       | all blogs
        
         | melling wrote:
         | It looks like he's working through a lot of algorithms:
         | 
         | https://github.com/pmuens/lab
        
       ___________________________________________________________________
       (page generated 2020-06-25 23:00 UTC)