[HN Gopher] Codeball - AI-powered code review
       ___________________________________________________________________
        
       Codeball - AI-powered code review
        
       Author : lladnar
       Score  : 56 points
       Date   : 2022-05-27 18:38 UTC (4 hours ago)
        
 (HTM) web link (codeball.ai)
 (TXT) w3m dump (codeball.ai)
        
       | sanity31415 wrote:
       | "Download free money" sounds like a scam.
        
         | videlov wrote:
         | It's just poking fun at other websites being too serious.
        
       | videlov wrote:
       | Creator of Codeball here, somebody beat us to sharing it :).
       | 
       | Codeball is a result of a hack-week at Sturdy - we were thinking
       | about ways to reduce the waiting-for-code-review and were curious
       | exactly how predictable the entire process is. It turned out very
       | predictable!
       | 
       | Happy to answer any questions.
        
         | dchichkov wrote:
         | Hi. I've tried creating the same service, about 5 years back ;)
         | articoder.com ;) Was digging at it for a few months. But
         | natural language processing of the time was not up to the
         | task...
         | 
         | Good to know that now it is doable in a week, with such good
         | precision! Or do you have humans in the backend ;) ?
         | 
         | How do you compare yourself to PullRequest (they've been
         | digging at it for 5 years as well) and recently folded? [funny
         | fact, we've been interviewed in the same YC batch, which always
         | makes me wonder, if YC liked the idea enough to have it
         | implemented by another team ;) ]
        
           | videlov wrote:
           | It's really cool to hear that others have thought about this
           | too!
           | 
           | >How do you compare yourself to PullRequest
           | 
           | So it turns out that the most of code contributions nowadays
           | get merged without fixes or feedback during the review (about
           | 2/3). I think this is because the increased focus on
           | continuous delivery and shipping small & often. Codeball's
           | purpose is to identify and approve those 'easy' PRs and
           | humans get to deal with the trickier ones. The cool part
           | about it is being less blocked.
        
         | emeraldd wrote:
         | Is your model trained per language?
         | 
         | Without something that semantically understands the code under
         | review ( which all but requires general AI or at the least a
         | strong static analyslzer) doing anything more than adding noise
         | to the process or worse leading to certain groups of developers
         | effectively being given a free pass.
        
         | sidlls wrote:
         | I know for a fact I would not want to automate many of the
         | predictable aspects of code reviews at any job I've ever had.
         | This is because many of the predictable aspects of code review
         | are due to poor review practices. Things like rubber-stamping
         | review requests with a certain change size (e.g. lines of code,
         | number of files), surface-level reviews (e.g. "yep this doesn't
         | violate our style guidelines that the linter can't catch"), and
         | similar items.
         | 
         | A proper code review isn't simply catching API or style errors
         | --it seeks to understand how the change affects the
         | architecture and structure of the existing code. I'm sure AI
         | can help with that, and for a broad class of changes it's
         | likely somewhat to very predictable--but I'm skeptical that it
         | is predictable for enough use cases to make it worth spending
         | money on (for now), say.
         | 
         | Put another way: "approves code reviews a human would have
         | approved" isn't exactly the standard I'd want automated reviews
         | to aspire to. Human approval, in my experience, is mostly not
         | good quality reviews.
        
           | visarga wrote:
           | Maybe the AI approach is still useful. I am thinking
           | analysing the AST to measure the impact of a code change, or
           | the complexity of the various components of the project. Some
           | kind of graph analysis to measure complexity and
           | maintainability on a project level.
        
           | videlov wrote:
           | My thesis is a tool like Codeball would reduce the amount of
           | rubber-stamping. Thing is, many devs aim to ship code "small
           | and often" which inevitably leads to fast pattern matching
           | style reviews. If software can reliably deal with those,
           | humans can focus their energy on the tricky ones. Kind of
           | like how using a code formatter eliminates this type of
           | discussions and lets people focus on semantics.
        
         | cobbal wrote:
         | How much will you be charging for the adversarial network to
         | allow someone to get any PR approved? ;)
        
       | gombosg wrote:
       | I'm a bit skeptical here. We should ask the question: why are we
       | reviewing code in the first place? This has some hot debates on
       | HN every now and then, and it's because reviews are not just
       | automated checks but part of the engineering culture, which is a
       | defining part of any company or eng department.
       | 
       | PR reviews are a way of learning from each other, keeping up with
       | how the codebase evolves, sharing progress and ideas, giving
       | feedback and asking questions. For example at $job we 90% approve
       | PRs with various levels of pleas, suggestions, nitpicks and
       | questions. We approve because of trust (each PR contains a demo
       | video of a working feature or fix) and not to block each other,
       | but there might be important feedback or suggestions given among
       | the comments. A "rubber stamp bot" would be hard to train in such
       | a review system and simply misses the point of what reviews are
       | about.
       | 
       | What happens if there is a mistake (hidden y2k bomb, deployment
       | issue, incident, regression, security bug, bad database
       | migration, wrong config) in a PR that passes a human review? At a
       | toxic company you get finger pointing, but with a healthy team,
       | people can learn a lot when something bad passes a review. But
       | you can't discuss anything with an indeterministic review bot.
       | There's no responsibility there.
       | 
       | Another question is the review culture. If this app is trained on
       | some repo (whether PRs were approved or not), past reviews
       | reflect the review culture of the company. What happens when a
       | blackbox AI takes that over? Is it going to train itself on its
       | own reviews? People and review culture can be changed, but a
       | black box AI is hard to change in a predictable way.
       | 
       | I'd rather set up code conventions, automated linters (i.e.
       | deterministic checks) etc. than have a review bot allow code into
       | production. Or just let go of PR reviews altogether, there were
       | some articles shared on HN about that recently. :)
        
       | apugoneappu wrote:
       | Explanation of results for non-ML folks (results on the default
       | supabase repo shown on the homepage):
       | 
       | Codeball's precision is 0.99. It simply means that 99% PRs that
       | were predicted approvable by Codeball were actually approved. In
       | layman, if Codeball says that a PR is approvable, you can be 99%
       | sure that it is.
       | 
       | But recall is 48%, meaning that only 48% of actually approved PRs
       | were predicted to be approvable. So Codeball incorrectly flagged
       | 52% of the approvable PRs to be un-approvable, just to be safe.
       | 
       | So Codeball is like a strict bartender who only serves you when
       | they are absolutely sure you're old enough. You may still be
       | overage but Codeball's not serving you.
        
         | Der_Einzige wrote:
         | A LOT of ML applications should be exactly like this.
         | 
         | I want systems with low recall that "flag" things but ultra
         | ultra high precision. Many times, we get exactly the opposite -
         | which is far worse!
        
         | l33t2328 wrote:
         | That's still super useful.
         | 
         | I'm assuming most PR's are approvable. If that's the case then
         | this should cut down on time spent doing reviews by a lot.
        
       | sabujp wrote:
       | is this just (not) approving or is it actually providing
       | automated feedback for what needs to be fixed and suggestions?
        
         | videlov wrote:
         | It is like a first-line reviewer. It approves contributions
         | that it is really confident are good and leaves the rest to
         | humans. So basically it saves time and content switching for
         | developers.
        
           | mikeryan wrote:
           | Is there no marker that can be provided to indicate why it
           | failed or even a line number?
           | 
           | Can't tell if it's something like formatting and code style
           | or "bad code" or what. Even as a first line reviewer I can't
           | tell if this is valuable or not without any details on why it
           | would approve something.
           | 
           | The PR's it would Approve here were all super minor. Could
           | probably get similar number of these approved just by doing a
           | Lines of Code changed + "Has it been linted"
           | 
           | It's really hard to tell if this is valuable or not yet.
        
             | videlov wrote:
             | You are making a very good point. Right now it can't give
             | such indication because it is a black-box model. There are
             | hundreds of inputs that go in (eg. characteristics of the
             | code, how much the author has worked with this code in the
             | past, how frequently this part of the code changes) and the
             | output is how confident the model is that the contribution
             | is safe to merge.
             | 
             | With that said, there are ways of exposing more details to
             | developers. For example, scoring is done per-file, and
             | Codeball can tell you which files it was not confident in.
        
       | apugoneappu wrote:
       | Probably I'm misled but how is it a code review without looking
       | at the actual code? (not listed as an input feature on the 'how'
       | page)
        
         | videlov wrote:
         | It does look at the code at a meta level, in particular if the
         | kind of change in the PR has previously been objected to or
         | corrected afterwards. It creates perceptual hashes out of the
         | code changes which are used as categorical variables that go in
         | the neural net.
         | 
         | Deriving features about the code contributions is probably the
         | most challenging aspect of the project so far.
        
       | anonred wrote:
       | So I dry ran it against a tiny open source repo I maintain and it
       | worked on exactly 0 of the last 50 PRs. For example, it didn't
       | auto-approve a PR that was just deleting stale documentation
       | files... The idea sounds nice, but the execution is a bit lacking
       | right now.
        
         | moffkalast wrote:
         | I don't really get the point of it either, since it just
         | approves PRs. I know when my PR is mergeable, you don't have to
         | tell me that. What I need is some feedback since that's what
         | code review is for.
         | 
         | Any linter is more useful than this.
        
       | mdaniel wrote:
       | I can't tell if this is a joke or not
        
         | videlov wrote:
         | It is definitely not a joke. This started off as scratching our
         | own itch in answering 'how predictable are programmers?' but it
         | turned out to be really useful, so we made a site.
        
           | mdaniel wrote:
           | Good to know; is there an example (other than its own GH
           | Action repo) to see what it has approved?
           | 
           | Given that it's a model, is there a feedback mechanism
           | through which one could advise it (or you) of false
           | positives?
           | 
           | I would be thrilled to see what it would have said about:
           | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/76318
           | (q.v. https://news.ycombinator.com/item?id=30872415)
        
             | videlov wrote:
             | It would have said nothing. The model's idea is to identify
             | the bulk of easy / safe contributions that get approved
             | without objection and let the humans focus on the tricky
             | ones (like the example above).
             | 
             | On the site you can give it GitHub repos where it will test
             | the last 50 PRs and show you what it would have done (false
             | negatives and false positives included). You can also give
             | it a link to an individual PR as well, but GitLab is not
             | yet supported.
        
             | zegl wrote:
             | I've tried to reproduce #76318 as best as I could (using a
             | fork of the CE version of GitLab).
             | https://github.com/zegl/gitlabhq-cve-test/pull/1
             | 
             | Codeball did not approve the PR! https://codeball.ai/predic
             | tion/8cc54ce2-9f50-4e5c-9a16-3bc48...
        
         | jldugger wrote:
         | if len(diff) > 500 lines:             return "Looks good to me"
         | time.sleep(86400)         return "+1"
        
       | [deleted]
        
       | danielmarkbruce wrote:
       | Looks awesome.
       | 
       | Tone down the marketing page :) This page makes it sound like a
       | non-serious person built the tool.
       | 
       | How about: "Codeball approves Pull Requests that a human would
       | approve. Reduce waiting for reviews, save time and money."
       | 
       | And make the download button: "Download"
        
       | donkarma wrote:
       | I would never use something like this. Seems to me that it's just
       | a heuristic based on the char diff count. I made a simple repo
       | that has a shell script that does rm -rf
       | /usr/old_files_and_stuff, added a space next to the first slash
       | and it was approved, which is dangerous. If I need to manually
       | verify it anyways for stuff like this, why would I use it?
        
         | iamnafets wrote:
         | I generally feel the same way, but just to steel man the
         | argument: would your manual code review process have caught
         | this issue?
         | 
         | Sometimes we compare new things against their hypothetical
         | ideal rather than the status quo. The latter is significantly
         | more tractable.
        
       | tehsauce wrote:
       | I would be a bit concerned about adversarial attacks with this.
       | I'm sure someone will be able to come up with an innocent looking
       | PR that the system will always approve, but actually is
       | malicious. Then any repo which auto-approves PRs with this could
       | be vulnerable.
        
         | videlov wrote:
         | There are 3 categories of predictors that the model takes into
         | account, here are some examples: (1) The code complexity, and
         | its perceptual hash. (2) The author and their track record in
         | the repository. (3) The author's past involvement in the
         | specific files being modified.
         | 
         | With that said, an adversarial from somebody within the
         | team/organisation would be very difficult to detect.
        
       | codeenlightener wrote:
       | This is an exciting direction for AI code tools! I'm curious to
       | see code review tools that give feedback on non-approved code to
       | developers, which I think is the an important purpose of code
       | review, to build a shared understanding of technical standards.
       | 
       | On a related note, I'm working on https://denigma.app, which is
       | an AI that tries to explain code, giving a second opinion on what
       | it looks like it does. One company said they found it useful for
       | code review. Maybe just seeing how clear an AI explanation is is
       | a decent metric of code quality.
        
       | mchusma wrote:
       | I think I like this better expressed as a linter than a code
       | reviewer. Maybe it doesn't sell as well. But giving this to devs
       | to help them make better PRs and have more confidence in
       | approval? Good. Skipping code review? Bad.
       | 
       | In my experience, most "issues" in code review are not technical
       | errors, they are business logic errors, which there is most of
       | the time not even enough context in the code to know what the
       | right answer is. It is in a PM or Sales Person's head.
        
       | Imnimo wrote:
       | >Codeball uses a Multi-layer Perceptron classifier neural network
       | as prediction model. The model takes hundreds of inputs in it's
       | input layer, has two hidden layers and a single output scoring
       | the likelihood a Pull Request would be approved.
       | 
       | Really bringing out the big guns here!
        
       | eyelidlessness wrote:
       | This is a neat idea but gives me pause. Thinking about how it
       | would work in projects I maintain, it would either:
       | 
       | - be over-confident, providing negative value because the
       | proportion of PRs which "LGTM" is extraordinarily low, and my
       | increasingly deep familiarity with the code and areas of risk
       | makes me even more suspicious when something looks that safe
       | 
       | - never gain confidence in any PR, providing no value
       | 
       | I can't think of a scenario where I'd use this for these
       | projects. But I can certainly imagine it in the abstract, under
       | circumstances where baseline safety of changes is much higher.
        
       ___________________________________________________________________
       (page generated 2022-05-27 23:00 UTC)