[HN Gopher] Logistic regression from scratch ___________________________________________________________________ Logistic regression from scratch Author : pmuens Score : 107 points Date : 2020-06-25 13:54 UTC (9 hours ago) (HTM) web link (philippmuens.com) (TXT) w3m dump (philippmuens.com) | curiousgal wrote: | Very interesting seeing people in the comments debate what is a | very basic thing taught in any stats/econometrics class. | | The idea behind binary regression ( Y is 0 or 1) is that you use | a latent variable Y* = beta X + epsilon. | | X is the matrix of indenpent variables, beta is the vector of | coefficients and epsilon is an error term that sums the rest of | what X can't explain. | | Y thus becomes 1 if Y* >0 and 0 otherwise. | | Seeing how Y is binary, we can model it using a Bernoulli | distribution with a success probability P(Y=1) = P(Y* >0) = 1 - | P(Y* <=0) = 1 - P(epsilon <= -beta X) = 1 - CDF_epsilon(-beta X) | | Technically you can use any function that maps R to [0, 1] as a | CDF. If its density is symmetrical then you can directly write | the above probability as CDF(beta X). The two usual choices are | either the normal CDF which gives the Probit model or the | Logistic function (Sigmoid) which gives the Logit model. With the | CDF known you can calculate the likelihood and use it to estimate | the coefficients. | | People prefer the Logit model because the coefficients of the | model are interpretable in terms of log-odds and all and the | fucntion has some nice numerical properties. | | That's all there is to it really. | reactspa wrote: | I've always found it a little simplistic that the default cut- | off, in most statistical software, for whether something should | be 0 or 1, is 0.5. | | (i.e. > 0.5 equals 1, and < 0.5 equals 0). | | This seems to be a "rarely-questioned assumption". | | Is there a reason why this is considered reasonable? And is | there a name for the cut-off (i.e., if I were to want to change | the cut-off, what keyword should I search for inside the | software's manual?)? | curiousgal wrote: | From a stats perspective the cutoff is included in the | coefficients. If you use a design matrix (add a column of 1s | to your variables) you get in a non matrix notation (beta_0 | _1 + beta_1_ X_1 +...) So the threshold can be considered | beta_0. | | In the software, you can get classification models to output | class probabilities instead of class labels. You can then use | whatever threshold you like for to transform those | probabilities to labels. | | You may see it refered to as "discrimination threshold". | Varying that threshold is how ROC curves are constructed. | [deleted] | platz wrote: | fyi, generally when explaining things to non-practitioners, | it's a detractor to add qualifiers like | | - this is a very basic thing | | - that's all there is to it | | because although it is a basic thing to you, it's not a basic | thing to someone who hasn't spent the same time studying all | the concepts beforehand. | | This is generally why you have to take a class to grok stats | rather than just read some reference material and definitions. | | It's similar to programming environments when the senior dev | says some thing that requires a lot of context is just really | simple. | | Monads are Simply Just monoids in the category of endofunctors, | after all, they are really Simple, that's all there is to it! | (what the subject hears: This is so simple to me, why aren't | you smart enough to get this simple concept? You should know | this already--I shouldn't even have to make this comment! Why | haven't you been properly educated, like I have? ) | curiousgal wrote: | I totally agree but I was explaining that to practionners, or | at least people who consider themselves data scientists. I | agree that it is by no means a beginner friendly explanation. | IdiocyInAction wrote: | My understanding of Logistic Regression is that it's linear | regression on the log-odds, which are then converted to | probabilities with the sigmoid/softmax function. This formulation | allows one to do direct linear regression on the probabilities, | without the unpleasant side effects of just using a linear model | as-is. A mathematical justification for doing this is given by | the generalized linear model formulation. | obastani wrote: | This is not quite correct. The log probabilities are | | log p(y=1 | x; beta) = beta * x - log Z(x; beta) | | where | | Z(x) = p(y=0 | x; beta) + p(y=1 | x; beta) | | Thus, you can think of it as linear regression, but with an | additional term log Z(x; beta) in the log likelihood. | j7ake wrote: | Putting logistic and linear regression into the generalised | linear model framework is the right way to think of it and | compare them. | | From this point of view linear regression would be using GLM | with identity link function, logistic regression uses the logit | function as the link function. | tomrod wrote: | From what I recall this is a bit off -- not a bad mental model | but the math plays out different. | | Linear regression has a closed form solution of X projected | onto Y: \hat{\beta} = (X'X)^{-1} X' Y | | It is equivalent to the Maximum Likelihood Estimator (MLE) for | linear regression. However, for logistic regression, MLE would | estimate different from MLE for the log odds output. | | Linear regression on {class_inclusion} = XB gives the linear | probability model, which has limited utility. The required | transform is covered by another commenter. | IdiocyInAction wrote: | You're right, my model was a bit off. Thanks for pointing | that out, I forgot about the fact. | delib wrote: | > it's linear regression on the log-odds | | Almost - logistic regression assumes that the function is | _linear in the log odds_ , i.e. log(p/(1-p)) = Xb + e. The | problem is that you can't compute the log-odds, because you | don't know p. | random314 wrote: | Linear regression uses MSE loss. Logistic regression uses log- | loss. Both loss functions behave differently. | | Its not just the underlying model, but the loss function is | also different. | crdrost wrote: | There is a different mathematical justification as well in | terms of Bayesian reasoning. | | The claim is that "evidence" in Bayesian reasoning naturally | acts upon the log-odds, mapping prior log-odds to posterior | log-odds additively. To see this, calculate from the | definition, Odds(X | E) := Pr(X | E) / Pr(!X | | E) = Pr(X [?] E) / Pr(!X [?] E) | = (Pr(E | X) x Pr(X)) / (Pr(E | !X) x Pr(!X)) | = LR(E, X) x Odds(X), | | Where LR is the usual likelihood ratio. | | So when we take the logarithm of both sides, we find that new | evidence adds some quantity--the log of the likelihood ratio of | the evidence--to our log of prior probability, in this phrasing | of Bayes' theorem. | | I sometimes tell people this in a slightly strange language, I | say that if we ran into aliens we might find out that they | don't believe the things are absolutely true or false, but | instead measure their truth or falsity in decibels. | | So another perspective on what logistic regression is trying to | do, is that it is trying to assume linear log-likelihood-ratio | dependence based on the strength of some independent pieces of | evidence. You can weakly justify this in all cases, using | calculus and assuming everything has a small impact. You can | further justify it strongly for any signal where twice as large | of a measured regression variable ultimately implies twice as | many independent events at a much lower level happening and | independently providing their evidences for the regression | outcome. So like, I come from a physics background, I am | thinking in this case of photon counts in a photomultiplier | tube or so: I know that at a lower level, each photon is | contributing equally some small little bit of evidence for | something, so when I count all the time up together, this is | the appropriate framework to use. | [deleted] | nerdponx wrote: | It's better to think of linear regression and logistic | regression as special cases of the Generalized Linear Model | (GLM). | | In that framework, they are literally the same model with | different "settings" - Gaussian vs Bernoulli distribution. | markkvdb wrote: | I have to disagree with you. While assuming Gaussian | disturbance terms results in a linear regression, the linear | regression framework is more general. It makes no assumptions | about the distribution of the disturbance terms. Instead, it | merely restricts the variance to be constant over all values | of the response variable. | nerdponx wrote: | Both things can be true. | | Linear regression is extra-special because it's a special | case of several different frameworks and model classes. | | I should have written that it's better (in my opinion) to | think of logistic regression in the context of GLMs, at | least while you're learning. | | Edit: yes logistic regression is a special case of | regression with a different loss function. But it's not | nearly "as special" as linear regression. | colinmhayes wrote: | How do we differentiate between econometrics and machine | learning? Logistic regression seems like it fits into | econometrics better than machine learning to me. There's no | regularization. I guess there's gradient descent which can be | seen as more machine learning. In the end it's semantics of | course, still an interesting distinction. | blackbear_ wrote: | The correct bucket for logistic regression should be | "statistics", under "generalized linear models". | TrackerFF wrote: | Well, what do you define as machine learning? | | Logistic regression is clearly a classifier. And you need data | to train it. So it's a supervised learning algorithm. | colinmhayes wrote: | I'm trying to have a conversation so I can figure it out. | Pretty confident that being a classifier does not make it | machine learning, econometrics has classifiers too. | Econometric models also need data to train them, so I'm not | sure your second point is helpful either. Unless you're | claiming the difference is nothing but whether the model is | used by an economist. | heavenlyblue wrote: | What is machine learning then? | alexilliamson wrote: | From my experience, econometricians and ML practitioners mostly | pretend like the other group doesn't exist. | eVoLInTHRo wrote: | Econometrics is the application of statistical techniques on | economics-related problems, typically to understand | relationships between economic phenomena (e.g. income) and | things that might be associated with it (e.g. education). | | Machine learning is typically defined as a way to enable | computers to learn from data to accomplish tasks, without | explicitly telling them how. | | Both fields can use logistic regression, regularization, and | gradient descent to accomplish their goals, so in that sense | there's no distinction. | | But IMO there is a difference in their primary intention: | econometrics typically focuses on inference about | relationships, machine learning typically focuses on predictive | accuracy. That's not to say that econometrics doesn't consider | predictive accuracy, or that machine learning doesn't consider | inference, but it's usually not their primary concern. | colinmhayes wrote: | So you're going with the only difference being who's building | the model. Interesting take, can't say I disagree much. | Although I would say that regularization in econometric | models is a bit rare because it distorts the coefficients | which as you pointed out is the primary goal of econometrics. | random314 wrote: | Logistic regression can use both L1 and L2 regularization | colinmhayes wrote: | And ols can too. That doesn't make it machine learning. This | implementation doesn't involve any regularization. | oli5679 wrote: | Weight of evidence binning can be helpful feature engineering | strategy for logistic regression. | | Often this is a good 'first cut' model for a binary classifier on | tabular data. If feature interactions don't have a major impact | on your target then this can actually be a tough benchmark to | beat. | | https://github.com/oli5679/WeightOfEvidenceDemo | | https://www.listendata.com/2015/03/weight-of-evidence-woe-an... | thomasahle wrote: | Logistic regression can learn some quite amazing things. I | trained a linear function to play chess: | https://github.com/thomasahle/fastchess and it manages to predict | the next moves of top engine games with 27% accuracy. | | A benefit of logistic regression is that the resulting model | really fast. Furthermore, it's linear, so you can do incremental | updates to your prediction. If you have `n` classes and `b` input | features change, you can recompute in `bn` time, rather than | doing a full matrix multiplication, which can be a huge time | saver. | mhh__ wrote: | Isn't 27% worse than flipping a coin? | thomasahle wrote: | A typical chess position has 20-40 legal moves. The complete | space of moves for the model to predict from has about 1800 | moves. | | For comparison Leela Zero gets around 60% accuracy on | predicting its own next move. | | With this sort of accuracy you can reduce the search part of | the algorithm to an effective branching factor of 2-4 rather | than 40, nearly for free, which is a pretty big win. | nuclearnice1 wrote: | I don't understand the comment about Leela. Why isn't own | move prediction deterministic? | thomasahle wrote: | Because Leela (like fastchess mentioned above) has two | parts: A neural network predicting good moves, and a tree | search exploring the moves suggested and evaluating the | resulting positions (with a second net). | | If the prediction (policy) net had a 100% accuracy, you | wouldn't need the tree search part at all. | heavenlyblue wrote: | You haven't mentioned any nondeterministic behaviour, | therefore Leela is supposed to predict it's own moves | with a 100% accuracy. | tel wrote: | It's not non-determinism, it's partial information. The | NN part guesses the best move that will be found by | search X% of the time. If you just ditched the search | part, Leela would be faster and lose out on (1-X)% of the | better moves. | nuclearnice1 wrote: | Got it. Thanks for clarifying. Let me restate. | | Part one of Leela ranks several chess moves. Part two | picks among those. | | 60% of the time part 2 chooses the #1 ranked move. | gpderetta wrote: | Only if you have only two legal moves. | enchiridion wrote: | No, uniform random would be bounded by 1/16. However you | cannot move ever piece in every configuration, so it's | greater than that. Actually would be an interesting problem | for figure out... | adiM wrote: | There are only 16 pieces, but in most board positions, many | pieces can make more than one legal move. | clircle wrote: | Ah, the old "regression from scratch" post that is mandatory for | all blogs | melling wrote: | It looks like he's working through a lot of algorithms: | | https://github.com/pmuens/lab ___________________________________________________________________ (page generated 2020-06-25 23:00 UTC)