[HN Gopher] Codeball - AI-powered code review ___________________________________________________________________ Codeball - AI-powered code review Author : lladnar Score : 56 points Date : 2022-05-27 18:38 UTC (4 hours ago) (HTM) web link (codeball.ai) (TXT) w3m dump (codeball.ai) | sanity31415 wrote: | "Download free money" sounds like a scam. | videlov wrote: | It's just poking fun at other websites being too serious. | videlov wrote: | Creator of Codeball here, somebody beat us to sharing it :). | | Codeball is a result of a hack-week at Sturdy - we were thinking | about ways to reduce the waiting-for-code-review and were curious | exactly how predictable the entire process is. It turned out very | predictable! | | Happy to answer any questions. | dchichkov wrote: | Hi. I've tried creating the same service, about 5 years back ;) | articoder.com ;) Was digging at it for a few months. But | natural language processing of the time was not up to the | task... | | Good to know that now it is doable in a week, with such good | precision! Or do you have humans in the backend ;) ? | | How do you compare yourself to PullRequest (they've been | digging at it for 5 years as well) and recently folded? [funny | fact, we've been interviewed in the same YC batch, which always | makes me wonder, if YC liked the idea enough to have it | implemented by another team ;) ] | videlov wrote: | It's really cool to hear that others have thought about this | too! | | >How do you compare yourself to PullRequest | | So it turns out that the most of code contributions nowadays | get merged without fixes or feedback during the review (about | 2/3). I think this is because the increased focus on | continuous delivery and shipping small & often. Codeball's | purpose is to identify and approve those 'easy' PRs and | humans get to deal with the trickier ones. The cool part | about it is being less blocked. | emeraldd wrote: | Is your model trained per language? | | Without something that semantically understands the code under | review ( which all but requires general AI or at the least a | strong static analyslzer) doing anything more than adding noise | to the process or worse leading to certain groups of developers | effectively being given a free pass. | sidlls wrote: | I know for a fact I would not want to automate many of the | predictable aspects of code reviews at any job I've ever had. | This is because many of the predictable aspects of code review | are due to poor review practices. Things like rubber-stamping | review requests with a certain change size (e.g. lines of code, | number of files), surface-level reviews (e.g. "yep this doesn't | violate our style guidelines that the linter can't catch"), and | similar items. | | A proper code review isn't simply catching API or style errors | --it seeks to understand how the change affects the | architecture and structure of the existing code. I'm sure AI | can help with that, and for a broad class of changes it's | likely somewhat to very predictable--but I'm skeptical that it | is predictable for enough use cases to make it worth spending | money on (for now), say. | | Put another way: "approves code reviews a human would have | approved" isn't exactly the standard I'd want automated reviews | to aspire to. Human approval, in my experience, is mostly not | good quality reviews. | visarga wrote: | Maybe the AI approach is still useful. I am thinking | analysing the AST to measure the impact of a code change, or | the complexity of the various components of the project. Some | kind of graph analysis to measure complexity and | maintainability on a project level. | videlov wrote: | My thesis is a tool like Codeball would reduce the amount of | rubber-stamping. Thing is, many devs aim to ship code "small | and often" which inevitably leads to fast pattern matching | style reviews. If software can reliably deal with those, | humans can focus their energy on the tricky ones. Kind of | like how using a code formatter eliminates this type of | discussions and lets people focus on semantics. | cobbal wrote: | How much will you be charging for the adversarial network to | allow someone to get any PR approved? ;) | gombosg wrote: | I'm a bit skeptical here. We should ask the question: why are we | reviewing code in the first place? This has some hot debates on | HN every now and then, and it's because reviews are not just | automated checks but part of the engineering culture, which is a | defining part of any company or eng department. | | PR reviews are a way of learning from each other, keeping up with | how the codebase evolves, sharing progress and ideas, giving | feedback and asking questions. For example at $job we 90% approve | PRs with various levels of pleas, suggestions, nitpicks and | questions. We approve because of trust (each PR contains a demo | video of a working feature or fix) and not to block each other, | but there might be important feedback or suggestions given among | the comments. A "rubber stamp bot" would be hard to train in such | a review system and simply misses the point of what reviews are | about. | | What happens if there is a mistake (hidden y2k bomb, deployment | issue, incident, regression, security bug, bad database | migration, wrong config) in a PR that passes a human review? At a | toxic company you get finger pointing, but with a healthy team, | people can learn a lot when something bad passes a review. But | you can't discuss anything with an indeterministic review bot. | There's no responsibility there. | | Another question is the review culture. If this app is trained on | some repo (whether PRs were approved or not), past reviews | reflect the review culture of the company. What happens when a | blackbox AI takes that over? Is it going to train itself on its | own reviews? People and review culture can be changed, but a | black box AI is hard to change in a predictable way. | | I'd rather set up code conventions, automated linters (i.e. | deterministic checks) etc. than have a review bot allow code into | production. Or just let go of PR reviews altogether, there were | some articles shared on HN about that recently. :) | apugoneappu wrote: | Explanation of results for non-ML folks (results on the default | supabase repo shown on the homepage): | | Codeball's precision is 0.99. It simply means that 99% PRs that | were predicted approvable by Codeball were actually approved. In | layman, if Codeball says that a PR is approvable, you can be 99% | sure that it is. | | But recall is 48%, meaning that only 48% of actually approved PRs | were predicted to be approvable. So Codeball incorrectly flagged | 52% of the approvable PRs to be un-approvable, just to be safe. | | So Codeball is like a strict bartender who only serves you when | they are absolutely sure you're old enough. You may still be | overage but Codeball's not serving you. | Der_Einzige wrote: | A LOT of ML applications should be exactly like this. | | I want systems with low recall that "flag" things but ultra | ultra high precision. Many times, we get exactly the opposite - | which is far worse! | l33t2328 wrote: | That's still super useful. | | I'm assuming most PR's are approvable. If that's the case then | this should cut down on time spent doing reviews by a lot. | sabujp wrote: | is this just (not) approving or is it actually providing | automated feedback for what needs to be fixed and suggestions? | videlov wrote: | It is like a first-line reviewer. It approves contributions | that it is really confident are good and leaves the rest to | humans. So basically it saves time and content switching for | developers. | mikeryan wrote: | Is there no marker that can be provided to indicate why it | failed or even a line number? | | Can't tell if it's something like formatting and code style | or "bad code" or what. Even as a first line reviewer I can't | tell if this is valuable or not without any details on why it | would approve something. | | The PR's it would Approve here were all super minor. Could | probably get similar number of these approved just by doing a | Lines of Code changed + "Has it been linted" | | It's really hard to tell if this is valuable or not yet. | videlov wrote: | You are making a very good point. Right now it can't give | such indication because it is a black-box model. There are | hundreds of inputs that go in (eg. characteristics of the | code, how much the author has worked with this code in the | past, how frequently this part of the code changes) and the | output is how confident the model is that the contribution | is safe to merge. | | With that said, there are ways of exposing more details to | developers. For example, scoring is done per-file, and | Codeball can tell you which files it was not confident in. | apugoneappu wrote: | Probably I'm misled but how is it a code review without looking | at the actual code? (not listed as an input feature on the 'how' | page) | videlov wrote: | It does look at the code at a meta level, in particular if the | kind of change in the PR has previously been objected to or | corrected afterwards. It creates perceptual hashes out of the | code changes which are used as categorical variables that go in | the neural net. | | Deriving features about the code contributions is probably the | most challenging aspect of the project so far. | anonred wrote: | So I dry ran it against a tiny open source repo I maintain and it | worked on exactly 0 of the last 50 PRs. For example, it didn't | auto-approve a PR that was just deleting stale documentation | files... The idea sounds nice, but the execution is a bit lacking | right now. | moffkalast wrote: | I don't really get the point of it either, since it just | approves PRs. I know when my PR is mergeable, you don't have to | tell me that. What I need is some feedback since that's what | code review is for. | | Any linter is more useful than this. | mdaniel wrote: | I can't tell if this is a joke or not | videlov wrote: | It is definitely not a joke. This started off as scratching our | own itch in answering 'how predictable are programmers?' but it | turned out to be really useful, so we made a site. | mdaniel wrote: | Good to know; is there an example (other than its own GH | Action repo) to see what it has approved? | | Given that it's a model, is there a feedback mechanism | through which one could advise it (or you) of false | positives? | | I would be thrilled to see what it would have said about: | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/76318 | (q.v. https://news.ycombinator.com/item?id=30872415) | videlov wrote: | It would have said nothing. The model's idea is to identify | the bulk of easy / safe contributions that get approved | without objection and let the humans focus on the tricky | ones (like the example above). | | On the site you can give it GitHub repos where it will test | the last 50 PRs and show you what it would have done (false | negatives and false positives included). You can also give | it a link to an individual PR as well, but GitLab is not | yet supported. | zegl wrote: | I've tried to reproduce #76318 as best as I could (using a | fork of the CE version of GitLab). | https://github.com/zegl/gitlabhq-cve-test/pull/1 | | Codeball did not approve the PR! https://codeball.ai/predic | tion/8cc54ce2-9f50-4e5c-9a16-3bc48... | jldugger wrote: | if len(diff) > 500 lines: return "Looks good to me" | time.sleep(86400) return "+1" | [deleted] | danielmarkbruce wrote: | Looks awesome. | | Tone down the marketing page :) This page makes it sound like a | non-serious person built the tool. | | How about: "Codeball approves Pull Requests that a human would | approve. Reduce waiting for reviews, save time and money." | | And make the download button: "Download" | donkarma wrote: | I would never use something like this. Seems to me that it's just | a heuristic based on the char diff count. I made a simple repo | that has a shell script that does rm -rf | /usr/old_files_and_stuff, added a space next to the first slash | and it was approved, which is dangerous. If I need to manually | verify it anyways for stuff like this, why would I use it? | iamnafets wrote: | I generally feel the same way, but just to steel man the | argument: would your manual code review process have caught | this issue? | | Sometimes we compare new things against their hypothetical | ideal rather than the status quo. The latter is significantly | more tractable. | tehsauce wrote: | I would be a bit concerned about adversarial attacks with this. | I'm sure someone will be able to come up with an innocent looking | PR that the system will always approve, but actually is | malicious. Then any repo which auto-approves PRs with this could | be vulnerable. | videlov wrote: | There are 3 categories of predictors that the model takes into | account, here are some examples: (1) The code complexity, and | its perceptual hash. (2) The author and their track record in | the repository. (3) The author's past involvement in the | specific files being modified. | | With that said, an adversarial from somebody within the | team/organisation would be very difficult to detect. | codeenlightener wrote: | This is an exciting direction for AI code tools! I'm curious to | see code review tools that give feedback on non-approved code to | developers, which I think is the an important purpose of code | review, to build a shared understanding of technical standards. | | On a related note, I'm working on https://denigma.app, which is | an AI that tries to explain code, giving a second opinion on what | it looks like it does. One company said they found it useful for | code review. Maybe just seeing how clear an AI explanation is is | a decent metric of code quality. | mchusma wrote: | I think I like this better expressed as a linter than a code | reviewer. Maybe it doesn't sell as well. But giving this to devs | to help them make better PRs and have more confidence in | approval? Good. Skipping code review? Bad. | | In my experience, most "issues" in code review are not technical | errors, they are business logic errors, which there is most of | the time not even enough context in the code to know what the | right answer is. It is in a PM or Sales Person's head. | Imnimo wrote: | >Codeball uses a Multi-layer Perceptron classifier neural network | as prediction model. The model takes hundreds of inputs in it's | input layer, has two hidden layers and a single output scoring | the likelihood a Pull Request would be approved. | | Really bringing out the big guns here! | eyelidlessness wrote: | This is a neat idea but gives me pause. Thinking about how it | would work in projects I maintain, it would either: | | - be over-confident, providing negative value because the | proportion of PRs which "LGTM" is extraordinarily low, and my | increasingly deep familiarity with the code and areas of risk | makes me even more suspicious when something looks that safe | | - never gain confidence in any PR, providing no value | | I can't think of a scenario where I'd use this for these | projects. But I can certainly imagine it in the abstract, under | circumstances where baseline safety of changes is much higher. ___________________________________________________________________ (page generated 2022-05-27 23:00 UTC)