[HN Gopher] Ignore 98% of dependency alerts: introducing Semgrep...
       ___________________________________________________________________
        
       Ignore 98% of dependency alerts: introducing Semgrep Supply Chain
        
       Author : ievans
       Score  : 113 points
       Date   : 2022-10-04 15:45 UTC (7 hours ago)
        
 (HTM) web link (r2c.dev)
 (TXT) w3m dump (r2c.dev)
        
       | snowstormsun wrote:
       | Really nice idea to only show warnings if they are relevant. It's
       | indeed annoying if you need to upgrade lodash only to make your
       | audit tool not show critical warnings because of some function
       | that is not used at all.
       | 
       | This is not open source, though? It does make a big difference
       | for some whether you're able to run the check offline or you're
       | forced to upload your code to some service.
       | 
       | One feature I'd love in such tool would be to be able to get the
       | relevant parts of the changelog of the package that needs to be
       | upgraded. It's not responsible to just run the upgrade command
       | without checking the changelog for breaking or relevant changes.
       | That's exactly why upgrades tend to be done very late, because
       | there is a real risk of breaking something even if it's just a
       | minor version.
        
         | mattkopecki wrote:
         | There are definitely other approaches that don't require code
         | to be uploaded anywhere. For example, we (https://rezilion.com)
         | work with your package managers to understand what dependencies
         | your program has, and then analyze that metadata on the back
         | end. Net result is still to be able to see what vulnerabilities
         | are truly exploitable and which are not.
        
         | ievans wrote:
         | All the engine functionality is FOSS
         | https://semgrep.dev/docs/experiments/r2c-internal-project-de...
         | (code at https://github.com/returntocorp/semgrep); but the
         | rules are currently private (may change in the future).
         | 
         | As with all other Semgrep scanning, the analysis is done
         | locally and offline -- which is a major contrast to most other
         | vendors. See #12 on our development philosophy for more
         | details: https://semgrep.dev/docs/contributing/semgrep-
         | philosophy/
         | 
         | Relevant part of the changelog is a good idea--others have also
         | come out with statistical approaches based on upgrades others
         | made (eg dependabot has a compatibility score which is based on
         | "when we made PRs for this on other repos, what % of the time
         | did tests pass vs fail")
        
         | freeqaz wrote:
         | Here is some code on GitHub that does call site checking using
         | SemGrep: https://github.com/lunasec-
         | io/lunasec/blob/master/lunatrace/...
         | 
         | (Note: I helped write that. We're building a similar service to
         | the r2c one.)
         | 
         | You're right that patching is hard because of opaque package
         | diffs. I've seen some tools coming out like Socket.dev which
         | show a diff between versions.
         | https://socket.dev/npm/package/react/versions
         | 
         | But, that said, this is still a hard problem to solve and it's
         | happened before that malware[0][1] has been silently shipped
         | because of how opaque packages are.
         | 
         | 0:
         | https://web.archive.org/web/20201221173112/https://github.co...
         | 
         | 1: https://www.coindesk.com/markets/2018/11/27/fake-
         | developer-s...
        
           | feross wrote:
           | Thanks for mentioning Socket.dev :)
           | 
           | Looking at package diffs is super important because of the
           | rise of "protestware". For example, a maintainer of the
           | event-source-polyfill package recently added code which
           | redirects website visitors located in Eastern European
           | timezones to a change.org petition page. This means that real
           | users are being navigated to this random URL in production.
           | 
           | See the attack code here:
           | https://socket.dev/npm/package/event-source-
           | polyfill/diff/1....
           | 
           | It's very unlikely that users of event-source-polyfill are
           | aware that this hidden behavior has been added to the
           | package. And yet, the package remains available on npm many
           | months after it was initially published. We think that supply
           | chain security tools like Socket have an important role to
           | play in warning npm users when unwanted 'gray area' code is
           | added to packages they use.
        
       | stevebmark wrote:
       | I've always thought that dependabot was busy-work, a waste of
       | time. This article makes a good point that drives it home:
       | Alarams that aren't real make all alarms useless. Dependabot is
       | especially painful in non-typed languages (Python, Ruby, and
       | especially Javascript) where "upgrading" a library can break
       | things that there's no way to know until production.
       | 
       | Maybe the constant work, extra build time (and cash for all
       | that), and risk of breaking production, is worth it for the 0.01%
       | of the time there's a real vulnerability? It seems like a high
       | price to pay though. When there are major software
       | vulnerabilities (like log4j), the whole industry usually swarms
       | around it, and the alarm has high value.
       | 
       | I just realized how much CircleCI probably loves Dependabot. I
       | wonder what hit % their margins would take if we moved off it
       | collectively as an industry.
        
         | bawolff wrote:
         | I kind of feel like dependabot alerts should be treated like a
         | coding convention error - that extra whitespace isnt actually
         | causing a problem but we fix it right away.
         | 
         | Otherwise you have to start analyzing the alerts, and good luck
         | with that. The low severity ones are marked critical and the
         | scary ones are marked low. Suddenly you have 200 unfixed alerts
         | and its impossible to know if somewhere in that haystack is an
         | important one.
        
         | mfer wrote:
         | > When there are major software vulnerabilities (like log4j),
         | the whole industry usually swarms around it, and the alarm has
         | high value.
         | 
         | You're leaving me with the impression that you think we should
         | only patch major software vulnerabilities. This I would
         | disagree with. Minor vulnerabilities can be used, especially in
         | groups, to do things we don't anticipate. It's not just about a
         | single vulnerability but about how an attacker can leverage
         | multiple different vulnerabilities together.
        
         | danenania wrote:
         | If you use vendoring, it's also worth considering that there's
         | always some inherent security risk in upgrading dependencies.
         | If an attacker takes control of a package somewhere in your
         | dependency tree, you don't get compromised until you actually
         | install a new version of that package. This risk can often
         | outweigh the risk of very minor/dev-facing CVEs.
        
           | feross wrote:
           | Shameless plug: This is what I'm building Socket.dev to
           | solve.
           | 
           | Socket watches for changes to "package manifest" files such
           | as package.json, package-lock.json, and yarn.lock. Whenever a
           | new dependency is added in a pull request, Socket analyzes
           | the package's behavior and leaves a comment if it is a
           | security risk.
           | 
           | You can see some real-world examples here:
           | https://socket.dev/blog/socket-for-github-1.0
        
             | e1g wrote:
             | We use Socket and my favorite feature is when you highlight
             | new dependencies with a post-install hook. It's not always
             | a problem, but almost always a smell.
             | 
             | One feature request: please allow me to "suppress" warnings
             | for a specific package+version combo. This is useful for
             | activist libs that take a political stance - I know it
             | happens, but often cannot remove them, and don't want to
             | continue flagging the same problem at every sec review.
        
         | smcleod wrote:
         | IMO Dependabot is really dreadful at its job. Try Renovate -
         | it's really brilliant, fast, flexible, supports properly
         | binding PRs/MRs.
        
       | scinerio wrote:
       | Will this ever be integrated with Gitlab Ultimate?
        
         | mattkopecki wrote:
         | Gitlab Ultimate uses Rezilion to accomplish a similar aim.
         | Rather than using the principle of "reachability", Rezilion
         | analyzes at runtime what functions and classes are loaded to
         | memory. Much more deterministic and less of a guess about what
         | code will be called.
         | 
         | https://about.gitlab.com/blog/2022/03/23/gitlab-rezilion-int...
        
           | masklinn wrote:
           | How does it do that in the face of lazy loading, or for
           | languages in which "what functions and classes are loaded in
           | to memory" is not really a thing (e.g. C)?
        
             | tsimionescu wrote:
             | Shouldn't this be very easy in C? With static linking,
             | you're vulnerable if you're linking the package. With
             | dynamic linking, you're vulnerable if you're importing the
             | specific functions. Otherwise, you're not vulnerable -
             | there's no other legal way to call a function in C.
             | 
             | Now, if you're memory mapping some file and jumping into it
             | to call that function, good luck. You're already well into
             | undefined behavior territory.
             | 
             | Now, for lazy loading, I'm assuming the answer is the same
             | as any other runtime path analysis tool: it's up to you to
             | make sure all relevant code paths are actually running
             | during the analysis. Presumably your tests should be
             | written in such a way as to trigger the loading of all
             | dependencies.
             | 
             | I think there's really no other reasonable way to handle
             | this, though I can't say I've worked with either GutHub
             | Ultimate or Rezilion, so maybe I'm missing something.
        
               | underyx wrote:
               | Hey, I work on OP's product, and just wanted to mention
               | that reachability is not always about a function being
               | called. Sometimes insecure behavior is triggered by
               | setting options to a certain value[0]. Other times it's
               | feasible to mark usages of an insecure function as safe
               | when we know that the passed argument comes from a
               | trusted source[1]. The Semgrep rules we write understand
               | these nuances instead of just flagging function calls.
               | 
               | [0]: e.g. https://nvd.nist.gov/vuln/detail/CVE-2021-28957
               | 
               | [1]: e.g. https://nvd.nist.gov/vuln/detail/CVE-2014-0081
        
             | mattkopecki wrote:
             | Rezilion works at runtime when the Gitlab runner spins up a
             | container for testing the app. Rezilion observes the
             | contents of memory and can reverse-engineer back to the
             | filesystem to see where everything was loaded from.
             | 
             | In the CI pipeline this depends on your tests exercising
             | the app, but when you deploy Rezilion into a longer-lived
             | environment like Stage or Prod then you may get some new
             | code pathways that are used, although most find that the
             | results aren't surprisingly different between all of the
             | environments.
        
           | scinerio wrote:
           | Ah, thank you. It's not entirely clear whether this is
           | something baked into Gitlab Ultimates SAST CI/CD
           | feature/template, or if it's a third party that I would have
           | to license first. Do you happen to know?
        
       | jollyllama wrote:
       | Sounds nice. I've never worked with a tool like this that doesn't
       | turn up a ridiculous number of false positives.
        
       | henvic wrote:
       | How the hell do you end up with 1644 vulnerable packages anyways?
       | 
       | * rhetorical question, JS...
       | 
       | It was actually one of the main drivers for me to start using Go
       | instead of JavaScript for server-side applications and CLIs about
       | 8 years ago.
        
         | nightpool wrote:
         | Roughly: NPM, Github, and others funded open bug bounties for
         | all popular NPM packages. These bug bounties led to a rash of
         | security "vulnerabilities" being reported against open source
         | project, to satisfy the terms of the bounty conditions. Public
         | bug bounty "intermediary" companies are a major culprit here--
         | they have an incentive to push maintainers to accept even
         | trivial "vulnerabilities", since their success is tied to
         | "number of vulnerabilities reported" and "amount of bounties
         | paid out". This leads to classes of vulnerabilities like reDOS
         | or prototype pollution that would never have been noticed or
         | worth any money otherwise.
        
       | thenerdhead wrote:
       | The problem really comes down to data quality in disclosing
       | vulnerabilities.
       | 
       | With higher quality data, better CVSS scores can be calculated.
       | With higher quality data, affected code paths can be better
       | disclosed. With higher quality data, unknown vulnerabilities may
       | be found in parallel to the known ones.
       | 
       | I don't think any tool or automation can solve the problem of
       | high quality data. Humans have to discern to provide it. No
       | amount of code analysis can solve that. But it sure can help.
        
         | light24bulbs wrote:
         | You're right. Nobody bothers to make scanners because there's
         | no data, and nobody has come up with a good format to convey
         | the data between producers (like NVD) and consumers (like
         | dependabot).
         | 
         | I wrote a blog post talking about some of this stuff:
         | https://www.lunasec.io/docs/blog/the-issue-with-vuln-scanner...
         | 
         | It truly is a chicken and egg problem. There are next to no
         | automated scanners that make use of data like that, semgrep is
         | the furthest along and my company is close behind them at
         | taking a stab at it as far as I can tell. Heck there are hardly
         | any that do anything with the existing "Environmental" part of
         | the CVSS, and that has been pretty well populated by NVD, I
         | believe.
         | 
         | The existing interchange formats for vulnerability data, such
         | as OSV, are underdesigned to the point that it feels like
         | GitHub CoPilot designed them. It's real work to even get to the
         | point that you can consume them, given all the weird choices in
         | there. Sorry if I'm salty.
         | 
         | There is an attempt to create a standard for situational
         | vulnerability exposure called "VEX" or Vulnerability Exchange
         | Format, but it's almost entirely focused on conveying
         | information about what vulnerabilities have been manually
         | eliminated, so that software "vendors" can satisfy their
         | customers, especially in government contracts. It's not
         | modeling the full picture of what can happen in a dependency
         | tree and all the useful false-positive information in there.
        
           | thenerdhead wrote:
           | Yeah agreed. When I see these problem statements, I see us
           | addressing problems that are by-products of vulnerability
           | fatigue.
           | 
           | I.e "be lazy and ignore those vulnerabilities by using our
           | tools!"
           | 
           | It hardly solves the true issue of an industry wide challenge
           | of lack of useful information or even transparency of said
           | information from responsible parties. I believe this laziness
           | is what got us here in the first place.
        
       | CSDude wrote:
       | Jokes on you I already ignore %100 of them /s
       | 
       | I like the promise however how can I trust it completely that the
       | ignored part is not actually reachable? All the languages (except
       | a few) do some magic that might not be detected? At previous
       | work, we were bombarded with dependency upgrades, I can still
       | feel the pain in my bones.
        
       | thefrozenone wrote:
       | How does this tool go from a vuln. in a library to -> a set of
       | affected functions/control paths? My understanding was that the
       | CVE format is unustructed which makes an analysis like this
       | difficult
        
         | theptip wrote:
         | My question too. All I see is this citation:
         | 
         | > [1] We'll be sharing more details about this work later in
         | October. Stay tuned!
        
         | ievans wrote:
         | We added support to the Semgrep engine for combining package
         | metadata restrictions (from the CVE format) with code search
         | patterns that indicate you're using the vulnerable library
         | (we're writing those mostly manually, but Semgrep makes it
         | pretty easy):                   - id: vulnerable-awscli-
         | apr-2017           pattern-either:           - pattern:
         | boto3.resource('s3', ...)           - pattern:
         | boto3.client('s3', ...)           r2c-internal-project-depends-
         | on:             namespace: pypi             package: awscli
         | version: "<= 1.11.82"           message: this version of awscli
         | is subject to a directory traversal vulnerability in the s3
         | module
         | 
         | This is still experimental and internal
         | (https://semgrep.dev/docs/experiments/r2c-internal-project-
         | de...) but eventually we'd like to promote it and also maybe
         | open up our CVE rules more as well!
        
           | mattkopecki wrote:
           | Here is a good writeup of some of the pros and cons of using
           | a "reachability" approach.
           | 
           | https://blog.sonatype.com/prioritizing-open-source-
           | vulnerabi...
           | 
           | >Unfortunately, no technology currently exists that can tell
           | you whether a method is definitively not called, and even if
           | it is not called currently, it's just one code change away
           | from being called. This means that reachability should never
           | be used as an excuse to completely ignore a vulnerability,
           | but rather reachability of a vulnerability should be just one
           | component of a more holistic approach to assessing risk that
           | also takes into account the application context and severity
           | of the vulnerability.
        
             | DannyBee wrote:
             | Err, "no technology currently exists" is wrong, "no
             | technology can possibly exist" to say whether something if
             | definitively called.
             | 
             | It's an undecidable problem in any of the top programming
             | languages, and some of the sub problems (like aliasing)
             | themselves are similarly statically undecidable in any
             | meaningful programming language.
             | 
             | You can choose between over-approximation or under-
             | approximation.
        
             | sverhagen wrote:
             | I saw that Java support was still in beta. But it makes me
             | wonder if it's going to come with a "don't use reflection"
             | disclaimer, then...?
        
       | jrockway wrote:
       | This is a similar mechanism as govulncheck
       | (https://pkg.go.dev/golang.org/x/vuln/cmd/govulncheck), which has
       | been quite nice to use in practice. Because it only cares about
       | vulnerable code that is actually possible to call, it's quiet
       | enough to use as a presubmit check without annoying people. Nice
       | to see this for other languages.
        
         | Hooray_Darakian wrote:
         | How does it deal with vulnerability alerts which don't say
         | anything about what code is affected?
        
           | jrockway wrote:
           | From https://go.dev/security/vuln/: "A vulnerability database
           | is populated with reports using information from the data
           | pipeline. All reports in the database are reviewed and
           | curated by the Go Security team."
           | 
           | I would imagine that's what Semgrep is doing as well. You're
           | paying for the analysis; the code is the easy part.
        
           | ievans wrote:
           | Both Semgrep Supply Chain and govulncheck (AFAIK) are doing
           | this work manually, for now. It would indeed be nice if the
           | vulnerability reporting process had a way to provide
           | metadata, but there's no real consensus on what format that
           | data would take. We take advantage of the fact that Semgrep
           | makes it much easier than other commercial tools (or even
           | most linters) to write a rule quickly.
           | 
           | The good news is there's a natural statistical power
           | distribution: most alerts come from few vulnerabilities in
           | the most popular (and often large) libraries, so you get
           | significant lift just by writing rules starting with
           | libraries.
        
             | Hooray_Darakian wrote:
             | > Both Semgrep Supply Chain and govulncheck (AFAIK) are
             | doing this work manually, for now.
             | 
             | Ya I get that, but surely you don't have 100% coverage.
             | What does your code do for the advisories which you don't
             | have coverage for? Alert? Ignore?
        
               | nightpool wrote:
               | Since security vulnerability alerts are already created
               | and processed manually (e.g., every Dependabot alert is
               | triggered by some Github employee who imported the right
               | data into their system and clicked "send" on it), adding
               | an extra step to create the right rules doesn't seem
               | impossibly resource intensive. Certainly much more time
               | is spent "manually" processing even easier-to-automate
               | things in other parts of the economy, like payments
               | reconciliation (https://keshikomisimulator.com/)
        
       ___________________________________________________________________
       (page generated 2022-10-04 23:00 UTC)