[HN Gopher] The Python Package Index is now a GitHub secret scan... ___________________________________________________________________ The Python Package Index is now a GitHub secret scanning integrator Author : rbanffy Score : 327 points Date : 2021-03-24 11:52 UTC (11 hours ago) (HTM) web link (github.blog) (TXT) w3m dump (github.blog) | soheil wrote: | This makes me wonder if Github should do basic code sanity checks | on every repo. Things like checking for division by zero, | infinite-loops, etc. They'd have to be very conservative checks | as to not trigger false positives. But if there is benefit in | secret scanning for all public repos there must be benefit in | detecting other types of programmer mistakes. | leblancfg wrote: | They acquired LGTM (https://github.com/marketplace/lgtm) not | too long ago, so expect this to happen. | [deleted] | RocketSyntax wrote: | would love to see tighter integration with some GitHub Secret/ | Action publishing | di wrote: | Not sure if this is what you're asking for, but the PyPA does | maintain a GitHub Action for publishing to PyPI as well: | https://github.com/pypa/gh-action-pypi-publish | sneak wrote: | This is some epic-level brand building in action. Pretty soon, | people just entering our industry will mistakenly believe that | GitHub's ownership (Microsoft) wants open source to exist and | thrive. | molticrystal wrote: | They got a decent list of partnered companies which you can find | over here: | | https://docs.github.com/en/code-security/secret-security/abo... | | Glad they got our back. | linkdd wrote: | Great news! | | IMHO, Github should make it mandatory for integrated services to | provide this feature. | loloquwowndueo wrote: | Wow today I learned this acronym. PyPI -> python package index, | after using python for over a decade. Thanks! | kspacewalk2 wrote: | Pronounced pie-pee-aye, and not pee-pee, pie-pee or any of the | other ways I heard it pronounced at work :) | verall wrote: | Based on my workplace, I'm pretty sure it's "pee-pee". Just | like 'Qt' is "cue-tee". | | There's no winning these battles.. | porker wrote: | > Just like 'Qt' is "cue-tee". | | How else would you want to pronounce it? | conradludgate wrote: | According to them, its just "cute" | loloquwowndueo wrote: | Right, I used to pronounce it as pie-pie. Might continue to | do so but at least I know what it stands for :D | danudey wrote: | I call it pie-pie because that makes the most sense and | sounds the least weird. | mschulkind wrote: | Just don't confuse it with PyPy, which is entirely different... | fredley wrote: | That's why you pronounce it "Cheese Shop" | daviddavis wrote: | And don't pronounce PyPI as pie-pie. It's pie-P-I. | dec0dedab0de wrote: | it was much easier when it was just called the cheese shop | lostcolony wrote: | Ah. The fat detective. | cpcallen wrote: | The headline sounds insidious (How dare PyPI and GitHub secretly | scan me! I'm glad someone has revealed this dastardly collusion!) | but it turns out they're actually doing something great. | zitterbewegung wrote: | Naming things is the hardest thing to do in computer science. | brian_herman wrote: | Yes brother I agree! | doubleunplussed wrote: | I thought it was the second hardest. At least that's what I | remember, since I last checked. | cbm-vic-20 wrote: | That, and cache invalidation. | teraku wrote: | That, and off-by-one errors | airstrike wrote: | There are actually only two hard problems in computer | science: | | 0) Cache invalidation | | 1) Naming things | | 5) Asynchronous callbacks | | 2) Off-by-one errors | | 3) Scope creep | | 6) Bounds checking | jsheard wrote: | 4294967295) Integer underflows | macksd wrote: | 7) Project estimation | DonHopkins wrote: | -1) Keeping secrets | weeboid wrote: | Luckily, building better garbage collectors is easy: ref | pointers to each cons | wizzwizz4 wrote: | Naming things is the hardest thing to do in computer | science. | mbreese wrote: | 7) February 29th. | moviuro wrote: | 7) Timezones | | FTFY | Sebb767 wrote: | 7.0000001) leap seconds | _joel wrote: | NaN) Javascript | gogopuppygogo wrote: | 9000) communicating | eganist wrote: | @dang, in re: this comment, any hopes of editing the title to | say "secret-scanning" with a hyphen? Might add some clarity. | melson wrote: | good one | z77dj3kl wrote: | Is there some best practice on creating a format for secret keys? | If I create an API with secret keys, should I make them something | like z77dj3kl-secret-pk-[secret-stuff]? | | Is there an argument (security by obscurity?) that that makes it | easier to spot it and abuse it? | | Or would it be better to encode it in the secret bits somehow, | add 16 control bits that have known values? | theoretick wrote: | FWIW There's a new RFC for specifying a URI scheme: | https://tools.ietf.org/html/rfc8959 | einpoklum wrote: | As a non-Python person: | | Is it an easy mistake to make, for someone to inadvertently | commit and push a "secret PyPI token"? | progval wrote: | I think not. The standard tools read the token from ~/.pypirc | (or the console if absent). Inadvertent commits of the token | probably only happens if you have a custom script with a | hardcoded token. | macintux wrote: | Secrets in general leak into source code all the time, nothing | specific about PyPI. | klyrs wrote: | I can certainly imagine putting a token into a deploy script in | the same directory as a python package's repo. From there, it's | a typo away from getting added and committed to the repo. So, | it's better to keep those tokens elsewhere. | einpoklum wrote: | Isn't it totally verboten to put secret tokens / passwords | into scripts? Regardless of language? | | When I write, say, bash scripts which do work using ssh, I | don't specify a password: The user running the script will | provide their own manually, or use ssh-copy-id, or edit the | authorized_keys file on the target machine if they want to | save themselves some typing. That is - authentication is | decoupled from my script's actual work. Why is that not how | things work with PyPI? | progval wrote: | It is. But even if it is strongly discouraged, some people | will commit it anyway. Look at any beginner's repository, | there is a high chance it contains files compiled from the | source of the repo (executable, .pyc, ...), the developer's | IDE config (.vscode, ...), __MACOSX, ... | klyrs wrote: | > Isn't it totally verboten to put secret tokens / | passwords into scripts? | | It's only a rule because people have made the mistake | enough to learn the lesson... | hannasanarion wrote: | If you are trying to publish your package for other people to | download through the `pip` package manager, then yeah. | | Most python devs will probably never publish to PyPi, but this | can save some headaches for those who do, especially for the | first time. | seanwilson wrote: | Do any APIs standardise on a simple secret key pattern that can | be easily identified as a secret? For example, all secrets have a | "secret-" prefix? Or is this idea unworkable? | | I usually try and prefix e.g. fields in config files with | "secret" to make it obvious they shouldn't be committed. | csnover wrote: | There was a discussion a while ago about IETF RFC 8959 which | proposes a secret-token URI that might be of interest: | https://news.ycombinator.com/item?id=25978185 | amichal wrote: | These secret scanning integrations have been very helpful. We had | a client ask to take a project open source recently that had | started a few years ago as closed source. We of course checked | over the current version of the code and have had linters in | place to look for secrets for a while but not in the very early | days of the project. In that one codebase we had: | | - AWS IAM token for S3 upload access to a throwaway dev bucket. | The bucket had already been deleted but still... Got an email | about it informing me the IAM token had been revoked by AWS | within 5 minutes | | - A Slack webhook notification URL/secret. Committed as a example | on a working branch and then git rm'ed but still active. Got an | email about it and token revoked by Slack automatically within 5 | minutes. | | - A Mapbox API token. This one was funny. The token was indeed in | there and functional but was in the docs/sample code for a | dependency. Still, we got an email within the hour about it and | were able to investigate. | | Edit: In this case we intentionally kept the commit history. A | safer alternative (and one we normally practice) is to start a | fresh repo for the open source variant. | ed25519FUUU wrote: | An overlooked vector is old commits. It's often times better to | squash all commits before taking a project open source, which | is a real shame for obvious reasons. | | Commit histories can spill a lot of secrets that are easy to | overlook. | psanford wrote: | There are tools available to help look for this sort of thing | (for both you and any potential attackers). TruffleHog[1] is | the first one that comes to mind for me. | | I also like shhgit[2] for looking for secrets in | repositories. (I don't think shhgit will look back in the git | history for you though). | | [1]: https://github.com/dxa4481/truffleHog | | [2]: https://github.com/eth0izzle/shhgit | amichal wrote: | Thanks! I knew they existed but hadn't investigated for one | that would look over past history. Will try out truffleHog. | lstamour wrote: | Another idea is to use a git commit hook, such as | https://github.com/cloud-gov/caulking | _the_inflator wrote: | Absolutely this! | | Same problem here with inner source, that goes open source. | | I feel sorry for all our internal committers, however I know | of "secrets", that went into the commit history. We are still | considering our option, but tend to opt for deleting our | commit history entirely and build a wall of fame for the | former committers. | jgalt212 wrote: | My current fear is versioning back up systems. KeePass files | may now have secure master keys, but maybe the version saved | 18 mos ago did not. | | 1. Get an old copy 2. run dictionary attack 3. prosper | danudey wrote: | > A safer alternative (and one we normally practice) is to | start a fresh repo for the open source variant. | | Note that it's also possible to go back and rewrite history | (e.g. if you know what the tokens are and where/when they were | committed), to preserve Git history while cleaning out tokens. | It can be mildly slow or complicated, but there are tools to | automate it, such as BFG Repo Cleaner[0] which is relatively | easy to use (once you learn it). | | There are other awesome rewriting tools, like git filter- | repo[1], but that operates solely on the structure of the | repository (i.e. it can manipulate basically anything _except_ | file contents). Great for removing unwanted files or | directories extremely fast, but not good for removing tokens | (unless you want to remove the entire file the token was in). | [0] https://rtyley.github.io/bfg-repo-cleaner/ [1] | https://github.com/newren/git-filter-repo | [deleted] | amichal wrote: | Learning so many options from this thread. I've used these | tools when I knew what to look for but thats been the tricky | bit. | | psanford also mentioned truffleHog and others, lstamour | mentioned https://github.com/cloud-gov/caulking which is | built on gitleaks which looks good. caulking's customized | list of patterns for gitleaks is here | https://github.com/cloud-gov/caulking/blob/master/local.toml | Looks like it would have found the keys in my example case no | problem. | anderskaseorg wrote: | When I helped to take Zulip open-source in 2015, I wrote a | simple script that scrubbed secrets from the commit history | using git fast-export and git fast-import. We replaced all our | secrets with xxxxxxx placeholders, replaced internal customer | references with dummy names, deleted and renamed certain files, | and even did some code replacements that caused certain commit | diffs to become empty so those commits could be removed from | the history. | | https://github.com/zulip/zulip/blob/3.3/tools/zanitizer | | https://github.com/zulip/zulip/blob/3.3/tools/zanitizer_conf... | | The script was really fast (all ~10000 commits in a few | minutes), which allowed us to iterate quickly on its | configuration as we audited using gitk and other tools for | remaining items to scrub. | | Doing this work allowed us to release with an essentially | complete history going back to the first commit in 2012, which | has been a really valuable resource for understanding why | various Zulip subsystems were written the way they were. | | Nowadays there are other tools for scrubbing history that might | be more polished, like BFG: https://rtyley.github.io/bfg-repo- | cleaner/ | amichal wrote: | Nice tooling. I've used bfg when we knew what patterns to | look for. This project didn't generally access private data, | had a reasonably well behaved team for most of its life (the | pre-linter & code-review commits were my own damn fault). | Since it was low risk, I just did a few manual `git log -S | ...` and moved on. I was still very happy to have github | catch my throwaway credentials and remind me in the most | obvious way that these things go in `ENV` and not IN code | even in examples! | dthul wrote: | I was seriously impressed when a few days ago I accidentally | pushed my secret Discord bot token to Github and literally one | second later I received a Discord message and an email letting me | know that I leaked my token and that they deactivated it. | Kaimunchi wrote: | Look into this software for device management sclera VDMS - | https://youtu.be/0_7V3lECy_s | akhilpotla wrote: | It would be nice instead if the git command prevented you from | committing a file with a token in it. | simonw wrote: | In case anyone is interested, it looks like this is the | implementation on the PyPI side: | https://github.com/pypa/warehouse/pull/8563 | danudey wrote: | > Fixes #6051 > See #7124 reverted in #8555 due to | #8554 which is addressed in #8562 (pfew...) > Should | not be merged before #8562: EDIT: > > Re- | revert of the code. The bug that caused revert was splitted | into #8562 | | Software development in a nutshell, everyone. | remram wrote: | FYI pypi tokens look like | pypi-9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9 | | The integration means that GitHub knows to recognize this format, | and calls some API of pypi.org when it finds one so PyPI can | revoke it. | | As always, please allow me to lament that we don't have a | standard for this, such as secret- | token:pypi.org/9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9, | which would let any system know that this string is a secret and | that pypi.org should be notified (for example via POST | pypi.org/.well-know/compromised-secret). See also | https://news.ycombinator.com/item?id=25978185 | l0b0 wrote: | One cool data format standard I only recently learned about is | multihash[1] - a self-describing hash format: the first byte | represents the hashing algorithm, the second byte represents | the length of the hash, and the subsequent [length] bytes is | the actual hash. | | Something similar for tokens would be really useful. | | [1] https://multiformats.io/multihash/ | nindalf wrote: | According to the documentation | (https://docs.github.com/en/developers/overview/secret- | scanni...), secret issuers specify a regex that can detect | secrets they've issued. "Be as precise as possible, because | this will reduce the number of false positives" - that's the | guideline from GitHub. Github runs the regex on every commit | that is uploaded and informs the secret provider when a match | occurs. | kevincox wrote: | I wonder if false-positives often result in GitHub sending | secrets to the wrong service. | danudey wrote: | I wonder if any of those services have a combination of bad | regexes and bad validation and could be SQL injected by | committing a malicious faux-token to GitHub. | woodruffw wrote: | Hey there! I designed and implemented PyPI's tokens (although | not the secret scanning integration). | | They're actually just macaroons[1] internally, which means that | they could easily be upgraded at some point to include a | reporting URL like you mention. | | Just as a tidbit: they were originally prefixed with "pypi:" | rather than "pypi-", but that colon caused problems for a few | packaging utilities. Any sort of in-band signaling like that is | unlikely to gain widespread adoption for exactly that reason | :-) | | [1]: https://en.wikipedia.org/wiki/Macaroons_(computer_science) | leot wrote: | > to help keep their customers safe | | The elimination of a distinction between "safety" and "security" | is unhealthy imo, as it leads to a failure to distinguish between | unintentional harm caused by nature, and intentional harm caused | by other people. | | E.g. "safety first" is only intelligible if it doesn't also | prevent you from trusting anyone (which is what would be implied | by "security first" as a general priority). | hannasanarion wrote: | Do you lock your doors? | leot wrote: | Sometimes. But I can't say that I have a "security first" | mindset, which seems analogous to "trust no one". | brian_herman wrote: | This is great hopefully we will get GitHub packages support for | python soon. https://github.com/features/packages | luhn wrote: | It's on their public roadmap: | https://github.com/github/roadmap/issues/94 | | Unfortunately it's marked as "Future," so it's still a ways | out. | natemcintosh wrote: | Can someone explain what exactly this means? | stevekemp wrote: | If you commit your AWS secrets/tokens, or similar, inside a | python script it will now be discovered by github | automatically. | | They have integrations with a bunch of services to recognize | the tokens, and disable them. This means malicious users can't | copy/paste them, spin up servers and leave you with a big bill. | (Ideally, of course it could still happen, but the aim is to | prevent that kind of thing.) | JosephRedfern wrote: | Though this has been true for a while, it's not what this | announcement is about. This is specifically announcing | automated scanning and reporting of PyPI keys, which if | exposed, could allow a bad actor to distribute compromised | Python packages via PyPi (e.g. pip) | russfink wrote: | And this is a potentially huge security issue. Think about | all the systems software that relies on Python packages. | geofft wrote: | If you accidentally commit your PyPI private token to git and | push it to GitHub, PyPI will detect this and disable the token | within seconds (because there are absolutely bots who will try | to find it and abuse it). | eecc wrote: | > From today, GitHub will scan every commit to a public | repository for exposed PyPI API tokens. We will forward any | tokens we find to PyPI, who will automatically disable them and | notify their owners. | [deleted] | prepend wrote: | It should reduce the possibility of pypi packages being taken | over as the result of its owner being careless with theirs pypi | credentials. | | I think it's good because the risk of a package being taken | over is low, but very damaging if it occurs in a widely used | package. | nautilus12 wrote: | I presume it means that if someone accidentally pushes up a | token to a public github repo then it can't be used to hijack | all the PyPi packages corresponding to that token to become | malicious | bombcar wrote: | The API keys I've used (admittedly not many) all seem to be long | random text strings - how does GitHub detect them? By then being | used (ie in api code) or do they actually have a known format? | di wrote: | PyPI API keys have a known format, they start with "pypi-". | Deathmax wrote: | GitHub documents the process over at | https://docs.github.com/en/developers/overview/secret- | scanni.... You specify a regex, and you check if the secret is | valid on your end. | monkeybutton wrote: | There must be an astounding number of false positives for | common patterns like N-length string of base64 chars. Could | someone upload a malicious file with millions of matching | strings and watch Github DDoS a company's verification | endpoint? | neurostimulant wrote: | I imagine the scanning would be rate-limited on per-repo | basis. | lostcolony wrote: | Probably also a max false positive rate; this isn't a | guarantee, just a service, so if it detects X false | positives it could just exclude the repo entirely as | problematic. | monkeybutton wrote: | Yeah, that would be reasonable. | michaelcampbell wrote: | "Now you have 2 problems." | MattConfluence wrote: | This is a difficult problem indeed, but thankfully it is just | as difficult for the malicious actors as it is for the "good | guys". Since various bad guys have presumably been scanning | public repos for years already, Github and PyPa adding this | feature is leveling the playing field, even if it is not a 100% | accurate search algorithm. | boarnoah wrote: | Not sure how these particular scanners do it, but during | security assessments you sometimes use tools that will find all | strings in an application package with high entropy. | | Usually its junk, but occasionally you do get lucky and find | tokens. ___________________________________________________________________ (page generated 2021-03-24 23:00 UTC)