[HN Gopher] The Python Package Index is now a GitHub secret scan...
       ___________________________________________________________________
        
       The Python Package Index is now a GitHub secret scanning integrator
        
       Author : rbanffy
       Score  : 327 points
       Date   : 2021-03-24 11:52 UTC (11 hours ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | soheil wrote:
       | This makes me wonder if Github should do basic code sanity checks
       | on every repo. Things like checking for division by zero,
       | infinite-loops, etc. They'd have to be very conservative checks
       | as to not trigger false positives. But if there is benefit in
       | secret scanning for all public repos there must be benefit in
       | detecting other types of programmer mistakes.
        
         | leblancfg wrote:
         | They acquired LGTM (https://github.com/marketplace/lgtm) not
         | too long ago, so expect this to happen.
        
       | [deleted]
        
       | RocketSyntax wrote:
       | would love to see tighter integration with some GitHub Secret/
       | Action publishing
        
         | di wrote:
         | Not sure if this is what you're asking for, but the PyPA does
         | maintain a GitHub Action for publishing to PyPI as well:
         | https://github.com/pypa/gh-action-pypi-publish
        
       | sneak wrote:
       | This is some epic-level brand building in action. Pretty soon,
       | people just entering our industry will mistakenly believe that
       | GitHub's ownership (Microsoft) wants open source to exist and
       | thrive.
        
       | molticrystal wrote:
       | They got a decent list of partnered companies which you can find
       | over here:
       | 
       | https://docs.github.com/en/code-security/secret-security/abo...
       | 
       | Glad they got our back.
        
       | linkdd wrote:
       | Great news!
       | 
       | IMHO, Github should make it mandatory for integrated services to
       | provide this feature.
        
       | loloquwowndueo wrote:
       | Wow today I learned this acronym. PyPI -> python package index,
       | after using python for over a decade. Thanks!
        
         | kspacewalk2 wrote:
         | Pronounced pie-pee-aye, and not pee-pee, pie-pee or any of the
         | other ways I heard it pronounced at work :)
        
           | verall wrote:
           | Based on my workplace, I'm pretty sure it's "pee-pee". Just
           | like 'Qt' is "cue-tee".
           | 
           | There's no winning these battles..
        
             | porker wrote:
             | > Just like 'Qt' is "cue-tee".
             | 
             | How else would you want to pronounce it?
        
               | conradludgate wrote:
               | According to them, its just "cute"
        
           | loloquwowndueo wrote:
           | Right, I used to pronounce it as pie-pie. Might continue to
           | do so but at least I know what it stands for :D
        
             | danudey wrote:
             | I call it pie-pie because that makes the most sense and
             | sounds the least weird.
        
         | mschulkind wrote:
         | Just don't confuse it with PyPy, which is entirely different...
        
           | fredley wrote:
           | That's why you pronounce it "Cheese Shop"
        
           | daviddavis wrote:
           | And don't pronounce PyPI as pie-pie. It's pie-P-I.
        
             | dec0dedab0de wrote:
             | it was much easier when it was just called the cheese shop
        
             | lostcolony wrote:
             | Ah. The fat detective.
        
       | cpcallen wrote:
       | The headline sounds insidious (How dare PyPI and GitHub secretly
       | scan me! I'm glad someone has revealed this dastardly collusion!)
       | but it turns out they're actually doing something great.
        
         | zitterbewegung wrote:
         | Naming things is the hardest thing to do in computer science.
        
           | brian_herman wrote:
           | Yes brother I agree!
        
           | doubleunplussed wrote:
           | I thought it was the second hardest. At least that's what I
           | remember, since I last checked.
        
           | cbm-vic-20 wrote:
           | That, and cache invalidation.
        
             | teraku wrote:
             | That, and off-by-one errors
        
               | airstrike wrote:
               | There are actually only two hard problems in computer
               | science:
               | 
               | 0) Cache invalidation
               | 
               | 1) Naming things
               | 
               | 5) Asynchronous callbacks
               | 
               | 2) Off-by-one errors
               | 
               | 3) Scope creep
               | 
               | 6) Bounds checking
        
               | jsheard wrote:
               | 4294967295) Integer underflows
        
               | macksd wrote:
               | 7) Project estimation
        
               | DonHopkins wrote:
               | -1) Keeping secrets
        
               | weeboid wrote:
               | Luckily, building better garbage collectors is easy: ref
               | pointers to each cons
        
               | wizzwizz4 wrote:
               | Naming things is the hardest thing to do in computer
               | science.
        
               | mbreese wrote:
               | 7) February 29th.
        
               | moviuro wrote:
               | 7) Timezones
               | 
               | FTFY
        
               | Sebb767 wrote:
               | 7.0000001) leap seconds
        
               | _joel wrote:
               | NaN) Javascript
        
               | gogopuppygogo wrote:
               | 9000) communicating
        
         | eganist wrote:
         | @dang, in re: this comment, any hopes of editing the title to
         | say "secret-scanning" with a hyphen? Might add some clarity.
        
       | melson wrote:
       | good one
        
       | z77dj3kl wrote:
       | Is there some best practice on creating a format for secret keys?
       | If I create an API with secret keys, should I make them something
       | like z77dj3kl-secret-pk-[secret-stuff]?
       | 
       | Is there an argument (security by obscurity?) that that makes it
       | easier to spot it and abuse it?
       | 
       | Or would it be better to encode it in the secret bits somehow,
       | add 16 control bits that have known values?
        
         | theoretick wrote:
         | FWIW There's a new RFC for specifying a URI scheme:
         | https://tools.ietf.org/html/rfc8959
        
       | einpoklum wrote:
       | As a non-Python person:
       | 
       | Is it an easy mistake to make, for someone to inadvertently
       | commit and push a "secret PyPI token"?
        
         | progval wrote:
         | I think not. The standard tools read the token from ~/.pypirc
         | (or the console if absent). Inadvertent commits of the token
         | probably only happens if you have a custom script with a
         | hardcoded token.
        
         | macintux wrote:
         | Secrets in general leak into source code all the time, nothing
         | specific about PyPI.
        
         | klyrs wrote:
         | I can certainly imagine putting a token into a deploy script in
         | the same directory as a python package's repo. From there, it's
         | a typo away from getting added and committed to the repo. So,
         | it's better to keep those tokens elsewhere.
        
           | einpoklum wrote:
           | Isn't it totally verboten to put secret tokens / passwords
           | into scripts? Regardless of language?
           | 
           | When I write, say, bash scripts which do work using ssh, I
           | don't specify a password: The user running the script will
           | provide their own manually, or use ssh-copy-id, or edit the
           | authorized_keys file on the target machine if they want to
           | save themselves some typing. That is - authentication is
           | decoupled from my script's actual work. Why is that not how
           | things work with PyPI?
        
             | progval wrote:
             | It is. But even if it is strongly discouraged, some people
             | will commit it anyway. Look at any beginner's repository,
             | there is a high chance it contains files compiled from the
             | source of the repo (executable, .pyc, ...), the developer's
             | IDE config (.vscode, ...), __MACOSX, ...
        
             | klyrs wrote:
             | > Isn't it totally verboten to put secret tokens /
             | passwords into scripts?
             | 
             | It's only a rule because people have made the mistake
             | enough to learn the lesson...
        
         | hannasanarion wrote:
         | If you are trying to publish your package for other people to
         | download through the `pip` package manager, then yeah.
         | 
         | Most python devs will probably never publish to PyPi, but this
         | can save some headaches for those who do, especially for the
         | first time.
        
       | seanwilson wrote:
       | Do any APIs standardise on a simple secret key pattern that can
       | be easily identified as a secret? For example, all secrets have a
       | "secret-" prefix? Or is this idea unworkable?
       | 
       | I usually try and prefix e.g. fields in config files with
       | "secret" to make it obvious they shouldn't be committed.
        
         | csnover wrote:
         | There was a discussion a while ago about IETF RFC 8959 which
         | proposes a secret-token URI that might be of interest:
         | https://news.ycombinator.com/item?id=25978185
        
       | amichal wrote:
       | These secret scanning integrations have been very helpful. We had
       | a client ask to take a project open source recently that had
       | started a few years ago as closed source. We of course checked
       | over the current version of the code and have had linters in
       | place to look for secrets for a while but not in the very early
       | days of the project. In that one codebase we had:
       | 
       | - AWS IAM token for S3 upload access to a throwaway dev bucket.
       | The bucket had already been deleted but still... Got an email
       | about it informing me the IAM token had been revoked by AWS
       | within 5 minutes
       | 
       | - A Slack webhook notification URL/secret. Committed as a example
       | on a working branch and then git rm'ed but still active. Got an
       | email about it and token revoked by Slack automatically within 5
       | minutes.
       | 
       | - A Mapbox API token. This one was funny. The token was indeed in
       | there and functional but was in the docs/sample code for a
       | dependency. Still, we got an email within the hour about it and
       | were able to investigate.
       | 
       | Edit: In this case we intentionally kept the commit history. A
       | safer alternative (and one we normally practice) is to start a
       | fresh repo for the open source variant.
        
         | ed25519FUUU wrote:
         | An overlooked vector is old commits. It's often times better to
         | squash all commits before taking a project open source, which
         | is a real shame for obvious reasons.
         | 
         | Commit histories can spill a lot of secrets that are easy to
         | overlook.
        
           | psanford wrote:
           | There are tools available to help look for this sort of thing
           | (for both you and any potential attackers). TruffleHog[1] is
           | the first one that comes to mind for me.
           | 
           | I also like shhgit[2] for looking for secrets in
           | repositories. (I don't think shhgit will look back in the git
           | history for you though).
           | 
           | [1]: https://github.com/dxa4481/truffleHog
           | 
           | [2]: https://github.com/eth0izzle/shhgit
        
             | amichal wrote:
             | Thanks! I knew they existed but hadn't investigated for one
             | that would look over past history. Will try out truffleHog.
        
             | lstamour wrote:
             | Another idea is to use a git commit hook, such as
             | https://github.com/cloud-gov/caulking
        
           | _the_inflator wrote:
           | Absolutely this!
           | 
           | Same problem here with inner source, that goes open source.
           | 
           | I feel sorry for all our internal committers, however I know
           | of "secrets", that went into the commit history. We are still
           | considering our option, but tend to opt for deleting our
           | commit history entirely and build a wall of fame for the
           | former committers.
        
           | jgalt212 wrote:
           | My current fear is versioning back up systems. KeePass files
           | may now have secure master keys, but maybe the version saved
           | 18 mos ago did not.
           | 
           | 1. Get an old copy 2. run dictionary attack 3. prosper
        
         | danudey wrote:
         | > A safer alternative (and one we normally practice) is to
         | start a fresh repo for the open source variant.
         | 
         | Note that it's also possible to go back and rewrite history
         | (e.g. if you know what the tokens are and where/when they were
         | committed), to preserve Git history while cleaning out tokens.
         | It can be mildly slow or complicated, but there are tools to
         | automate it, such as BFG Repo Cleaner[0] which is relatively
         | easy to use (once you learn it).
         | 
         | There are other awesome rewriting tools, like git filter-
         | repo[1], but that operates solely on the structure of the
         | repository (i.e. it can manipulate basically anything _except_
         | file contents). Great for removing unwanted files or
         | directories extremely fast, but not good for removing tokens
         | (unless you want to remove the entire file the token was in).
         | [0] https://rtyley.github.io/bfg-repo-cleaner/         [1]
         | https://github.com/newren/git-filter-repo
        
           | [deleted]
        
           | amichal wrote:
           | Learning so many options from this thread. I've used these
           | tools when I knew what to look for but thats been the tricky
           | bit.
           | 
           | psanford also mentioned truffleHog and others, lstamour
           | mentioned https://github.com/cloud-gov/caulking which is
           | built on gitleaks which looks good. caulking's customized
           | list of patterns for gitleaks is here
           | https://github.com/cloud-gov/caulking/blob/master/local.toml
           | Looks like it would have found the keys in my example case no
           | problem.
        
         | anderskaseorg wrote:
         | When I helped to take Zulip open-source in 2015, I wrote a
         | simple script that scrubbed secrets from the commit history
         | using git fast-export and git fast-import. We replaced all our
         | secrets with xxxxxxx placeholders, replaced internal customer
         | references with dummy names, deleted and renamed certain files,
         | and even did some code replacements that caused certain commit
         | diffs to become empty so those commits could be removed from
         | the history.
         | 
         | https://github.com/zulip/zulip/blob/3.3/tools/zanitizer
         | 
         | https://github.com/zulip/zulip/blob/3.3/tools/zanitizer_conf...
         | 
         | The script was really fast (all ~10000 commits in a few
         | minutes), which allowed us to iterate quickly on its
         | configuration as we audited using gitk and other tools for
         | remaining items to scrub.
         | 
         | Doing this work allowed us to release with an essentially
         | complete history going back to the first commit in 2012, which
         | has been a really valuable resource for understanding why
         | various Zulip subsystems were written the way they were.
         | 
         | Nowadays there are other tools for scrubbing history that might
         | be more polished, like BFG: https://rtyley.github.io/bfg-repo-
         | cleaner/
        
           | amichal wrote:
           | Nice tooling. I've used bfg when we knew what patterns to
           | look for. This project didn't generally access private data,
           | had a reasonably well behaved team for most of its life (the
           | pre-linter & code-review commits were my own damn fault).
           | Since it was low risk, I just did a few manual `git log -S
           | ...` and moved on. I was still very happy to have github
           | catch my throwaway credentials and remind me in the most
           | obvious way that these things go in `ENV` and not IN code
           | even in examples!
        
       | dthul wrote:
       | I was seriously impressed when a few days ago I accidentally
       | pushed my secret Discord bot token to Github and literally one
       | second later I received a Discord message and an email letting me
       | know that I leaked my token and that they deactivated it.
        
       | Kaimunchi wrote:
       | Look into this software for device management sclera VDMS -
       | https://youtu.be/0_7V3lECy_s
        
       | akhilpotla wrote:
       | It would be nice instead if the git command prevented you from
       | committing a file with a token in it.
        
       | simonw wrote:
       | In case anyone is interested, it looks like this is the
       | implementation on the PyPI side:
       | https://github.com/pypa/warehouse/pull/8563
        
         | danudey wrote:
         | > Fixes #6051         > See #7124 reverted in #8555 due to
         | #8554 which is addressed in #8562 (pfew...)         > Should
         | not be merged before #8562: EDIT:          >          > Re-
         | revert of the code. The bug that caused revert was splitted
         | into #8562
         | 
         | Software development in a nutshell, everyone.
        
       | remram wrote:
       | FYI pypi tokens look like
       | pypi-9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9
       | 
       | The integration means that GitHub knows to recognize this format,
       | and calls some API of pypi.org when it finds one so PyPI can
       | revoke it.
       | 
       | As always, please allow me to lament that we don't have a
       | standard for this, such as secret-
       | token:pypi.org/9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9,
       | which would let any system know that this string is a secret and
       | that pypi.org should be notified (for example via POST
       | pypi.org/.well-know/compromised-secret). See also
       | https://news.ycombinator.com/item?id=25978185
        
         | l0b0 wrote:
         | One cool data format standard I only recently learned about is
         | multihash[1] - a self-describing hash format: the first byte
         | represents the hashing algorithm, the second byte represents
         | the length of the hash, and the subsequent [length] bytes is
         | the actual hash.
         | 
         | Something similar for tokens would be really useful.
         | 
         | [1] https://multiformats.io/multihash/
        
         | nindalf wrote:
         | According to the documentation
         | (https://docs.github.com/en/developers/overview/secret-
         | scanni...), secret issuers specify a regex that can detect
         | secrets they've issued. "Be as precise as possible, because
         | this will reduce the number of false positives" - that's the
         | guideline from GitHub. Github runs the regex on every commit
         | that is uploaded and informs the secret provider when a match
         | occurs.
        
           | kevincox wrote:
           | I wonder if false-positives often result in GitHub sending
           | secrets to the wrong service.
        
             | danudey wrote:
             | I wonder if any of those services have a combination of bad
             | regexes and bad validation and could be SQL injected by
             | committing a malicious faux-token to GitHub.
        
         | woodruffw wrote:
         | Hey there! I designed and implemented PyPI's tokens (although
         | not the secret scanning integration).
         | 
         | They're actually just macaroons[1] internally, which means that
         | they could easily be upgraded at some point to include a
         | reporting URL like you mention.
         | 
         | Just as a tidbit: they were originally prefixed with "pypi:"
         | rather than "pypi-", but that colon caused problems for a few
         | packaging utilities. Any sort of in-band signaling like that is
         | unlikely to gain widespread adoption for exactly that reason
         | :-)
         | 
         | [1]: https://en.wikipedia.org/wiki/Macaroons_(computer_science)
        
       | leot wrote:
       | > to help keep their customers safe
       | 
       | The elimination of a distinction between "safety" and "security"
       | is unhealthy imo, as it leads to a failure to distinguish between
       | unintentional harm caused by nature, and intentional harm caused
       | by other people.
       | 
       | E.g. "safety first" is only intelligible if it doesn't also
       | prevent you from trusting anyone (which is what would be implied
       | by "security first" as a general priority).
        
         | hannasanarion wrote:
         | Do you lock your doors?
        
           | leot wrote:
           | Sometimes. But I can't say that I have a "security first"
           | mindset, which seems analogous to "trust no one".
        
       | brian_herman wrote:
       | This is great hopefully we will get GitHub packages support for
       | python soon. https://github.com/features/packages
        
         | luhn wrote:
         | It's on their public roadmap:
         | https://github.com/github/roadmap/issues/94
         | 
         | Unfortunately it's marked as "Future," so it's still a ways
         | out.
        
       | natemcintosh wrote:
       | Can someone explain what exactly this means?
        
         | stevekemp wrote:
         | If you commit your AWS secrets/tokens, or similar, inside a
         | python script it will now be discovered by github
         | automatically.
         | 
         | They have integrations with a bunch of services to recognize
         | the tokens, and disable them. This means malicious users can't
         | copy/paste them, spin up servers and leave you with a big bill.
         | (Ideally, of course it could still happen, but the aim is to
         | prevent that kind of thing.)
        
           | JosephRedfern wrote:
           | Though this has been true for a while, it's not what this
           | announcement is about. This is specifically announcing
           | automated scanning and reporting of PyPI keys, which if
           | exposed, could allow a bad actor to distribute compromised
           | Python packages via PyPi (e.g. pip)
        
             | russfink wrote:
             | And this is a potentially huge security issue. Think about
             | all the systems software that relies on Python packages.
        
         | geofft wrote:
         | If you accidentally commit your PyPI private token to git and
         | push it to GitHub, PyPI will detect this and disable the token
         | within seconds (because there are absolutely bots who will try
         | to find it and abuse it).
        
         | eecc wrote:
         | > From today, GitHub will scan every commit to a public
         | repository for exposed PyPI API tokens. We will forward any
         | tokens we find to PyPI, who will automatically disable them and
         | notify their owners.
        
         | [deleted]
        
         | prepend wrote:
         | It should reduce the possibility of pypi packages being taken
         | over as the result of its owner being careless with theirs pypi
         | credentials.
         | 
         | I think it's good because the risk of a package being taken
         | over is low, but very damaging if it occurs in a widely used
         | package.
        
         | nautilus12 wrote:
         | I presume it means that if someone accidentally pushes up a
         | token to a public github repo then it can't be used to hijack
         | all the PyPi packages corresponding to that token to become
         | malicious
        
       | bombcar wrote:
       | The API keys I've used (admittedly not many) all seem to be long
       | random text strings - how does GitHub detect them? By then being
       | used (ie in api code) or do they actually have a known format?
        
         | di wrote:
         | PyPI API keys have a known format, they start with "pypi-".
        
         | Deathmax wrote:
         | GitHub documents the process over at
         | https://docs.github.com/en/developers/overview/secret-
         | scanni.... You specify a regex, and you check if the secret is
         | valid on your end.
        
           | monkeybutton wrote:
           | There must be an astounding number of false positives for
           | common patterns like N-length string of base64 chars. Could
           | someone upload a malicious file with millions of matching
           | strings and watch Github DDoS a company's verification
           | endpoint?
        
             | neurostimulant wrote:
             | I imagine the scanning would be rate-limited on per-repo
             | basis.
        
               | lostcolony wrote:
               | Probably also a max false positive rate; this isn't a
               | guarantee, just a service, so if it detects X false
               | positives it could just exclude the repo entirely as
               | problematic.
        
               | monkeybutton wrote:
               | Yeah, that would be reasonable.
        
           | michaelcampbell wrote:
           | "Now you have 2 problems."
        
         | MattConfluence wrote:
         | This is a difficult problem indeed, but thankfully it is just
         | as difficult for the malicious actors as it is for the "good
         | guys". Since various bad guys have presumably been scanning
         | public repos for years already, Github and PyPa adding this
         | feature is leveling the playing field, even if it is not a 100%
         | accurate search algorithm.
        
         | boarnoah wrote:
         | Not sure how these particular scanners do it, but during
         | security assessments you sometimes use tools that will find all
         | strings in an application package with high entropy.
         | 
         | Usually its junk, but occasionally you do get lucky and find
         | tokens.
        
       ___________________________________________________________________
       (page generated 2021-03-24 23:00 UTC)