[HN Gopher] Show HN: I mirrored all the code from PyPI to GitHub...
       ___________________________________________________________________
        
       Show HN: I mirrored all the code from PyPI to GitHub and analysed
       it
        
       This is a side project I've been working on for the last few
       months. I built an automated system to continuously mirror all the
       code on PyPI to a series of Github repositories. Mirroring PyPI
       code to Github enables:  1. Scanning of all new Python packages for
       accidentally published credentials  2. A browsable/searchable index
       of published code with a nice UI  3. Large-scale analysis of _all_
       published code to see how the language is evolving  Using this
       project anyone is able to download the contents of PyPI to their
       personal machine and analyse every piece of code ever published in
       a matter of hours.  I hope it enables people to do things with the
       worlds largest and oldest corpus of Python code that wasn't
       possible before, and while this is likely totally useless to most
       people I think that is kind of cool and unique.
        
       Author : orf
       Score  : 77 points
       Date   : 2023-09-02 18:45 UTC (4 hours ago)
        
 (HTM) web link (py-code.org)
 (TXT) w3m dump (py-code.org)
        
       | tentacleuno wrote:
       | I'm all for archivism, but wouldn't this get taken down by
       | GitHub? If what your website says is true, you're storing 300GB+
       | of code on GitHub. I've heard stories from people who've also
       | tested the limits, and they've had emails from GitHub asking them
       | to cease activity.
        
         | orf wrote:
         | I was in contact with them, and they are apparently OK with
         | having the repositories be split up. There are over 230 of them
         | each under 1.3gb in total size.
         | 
         | I'm working on distributing this data without GitHub - git
         | packfiles are a fantastic way of compressing this data, and you
         | can serve those easily enough from a bucket.
        
         | [deleted]
        
       | tlocke wrote:
       | Interesting. In the language feature list, what is 'try star' ?
        
         | orf wrote:
         | It's exception groups[1], with the syntax `except*`.
         | 
         | 1. https://peps.python.org/pep-0654/
        
       | swyx wrote:
       | highly recommend clicking the "witness the inevitable future"
       | button
       | 
       | as a python oldie coming back into python, i've been surprised by
       | dataclasses. are they basically "backwards compatible better
       | classes"? any strong blogposts or readings that people can share
       | about better typed python/OSS module publishing practices?
        
         | [deleted]
        
         | fbdab103 wrote:
         | You reminded me of this article[0] where the author asks why
         | not dataclasses by default? I am inclined to agree, dataclasses
         | feel Pythonic in that they remove boilerplate with reasonable
         | defaults (sorting, hashing, etc).
         | 
         | [0] https://blog.glyph.im/2023/02/data-classification.html
        
         | dylanjcastillo wrote:
         | It's under Growth > Files for those struggling to find the
         | button
        
         | kzrdude wrote:
         | Dataclasses are orthogonal to typing (IMO), they just use types
         | for their evocative syntax for fields.
         | 
         | Dataclasses are nice - they are a pared down version of the
         | attrs library, so a simple way to create data-only or mainly-
         | data records through classes. They are not intended to replace
         | all classes.
        
       | teddyh wrote:
       | Some people _don't_ put their code on GitHub, since they object
       | to GitHub's ToS, especially those pertaining to analysis and use
       | by Copilot and various other uses which Microsoft may see fit to
       | put the code.
       | 
       | Will Microsoft see this as a free license to use all of PyPI?
        
         | [deleted]
        
         | chatmasta wrote:
         | The CoPilot team could pull the code from PyPi and use it to
         | train their models, regardless of whether it's on GitHub. If
         | you don't want AI trained on your code, then either don't
         | publish it, or publish it somewhere that forbids (and
         | preferably enforces) AI companies to index or train on it. But
         | good luck with that... it's public code. Don't publish it if
         | you don't want humans or machines to read it.
        
           | labster wrote:
           | The GPL should really be updated to say that any code
           | produced from machine learning on GPL code is also GPL
           | licensed.
        
             | chatmasta wrote:
             | Yes, or there should be a ROBOTS.TXT file that describes
             | how the code in a directory may be indexed or used by
             | machines (e.g. malware scanning okay, no LLM training,
             | etc.) But you're probably correct that such rights should
             | just be covered by the license itself.
             | 
             | Your reciprocity suggestion could also work, since it would
             | mean any LLM trained on even a single file of GPL code
             | would be "poisoned" in the sense that any code it produced
             | would also be GPL. This would make people wary of using the
             | LLM for writing code, and thus would make the creators of
             | it wary of training it on the code.
        
             | jbaber wrote:
             | This is an interesting idea.
        
           | teddyh wrote:
           | If Microsoft could legally do that, then why is that clause
           | present in the GitHub ToS?
        
             | schemescape wrote:
             | Out of my own curiosity, which clause in the GitHub TOS are
             | you referring to?
        
             | chatmasta wrote:
             | Because you're pushing the code to GitHub, so they need to
             | enumerate their rights in terms of what they can do with it
             | once you push it there. But if you publish your code to
             | PyPi, the relevant ToS is the PyPi ToS, which has no such
             | clause forbidding either PyPi or others from using the code
             | how they'd like to (and as mentioned by other comments in
             | this thread, the ToS actually explicitly grants others the
             | right to republish the code).
        
         | orf wrote:
         | They are within their rights to make that choice, but when you
         | publish a package to PyPI you agree to their terms which gives
         | anyone the right to mirror, distribute and otherwise use the
         | code you've published.
        
           | kzrdude wrote:
           | The rights you find in the PyPI terms, do they provide
           | everything you need to comply with the Github terms?
           | Ultimately it's tricky to understand what Github really means
           | with their terms (they say User-generated contents a lot.)
        
       | DrNosferatu wrote:
       | What fraction of the total is TensorFlow related code?
        
         | orf wrote:
         | There is a section on the page that tells you (near the
         | bottom): 16% of all uncompressed data in PyPI is from the
         | Tensorflow project.
        
           | geysersam wrote:
           | What the? 16%? Of the Python ecosystem. Why is it so big?
        
             | orf wrote:
             | You can go check out the page linked in the post to see?
             | That's kind of why it exists!
             | 
             | Spoiler: the answer is nightly builds, huge binaries and a
             | lot of wheels per release.
        
               | [deleted]
        
       | codetrotter wrote:
       | > https://py-code.org/download
       | 
       | This is a perfect job for task-spooler! :D
       | 
       | To mirror your pypi data, I sshed into my server and did this:
       | mkdir -p ~/src/github.com/pypi-data         cd !$         wget
       | https://raw.githubusercontent.com/pypi-
       | data/data/main/links/repositories.txt         xargs -L1 tsp git
       | clone --bare < repositories.txt
       | 
       | And then I closed the ssh connection to my server, knowing that
       | my server will proceed to mirror all of those repositories of
       | yours one by one :D
        
         | [deleted]
        
       | gustavus wrote:
       | Did you respect the licence and wishes of everyone who published
       | their code to PyPi and don't put it in GitHub?
       | 
       | Is it possible that by publishing the code on GitHub you are
       | going to cause other avenues the authors have for distributing
       | their information will be down ranked in search results because
       | of the prominence of GitHub?
       | 
       | What it really sounds like is you didn't think it was easy enough
       | to use PyPi to do analytics and analysis of the code and figured
       | you'd just export all of the data from one service to another
       | service with a UI you liked better without permission, and mind
       | you without really doing anything more than copying code from one
       | page to another, which is fairly trivial.
       | 
       | I don't like this, it's this kind of stunt that makes me
       | reluctant to publish my code in general.
        
         | hluska wrote:
         | If you don't like it, don't publish your code to PyPI. It's as
         | simple as that. Their terms of use allow behaviour like this.
         | 
         | In the future, rather than shit on someone's project, read.
        
         | galenmarchetti wrote:
         | https://pypi.org/policy/terms-of-use/
         | 
         | Seems like PyPi explicitly allows this behavior:
         | 
         | "If I upload Content covered by a royalty-free license included
         | with such Content, giving the PSF the right to copy and
         | redistribute such Content unmodified on PyPI as I have uploaded
         | it, with no further action required by the PSF (an "Included
         | License"), I represent and warrant that the uploaded Content
         | meets all the requirements necessary for free redistribution by
         | the PSF and any mirroring facility, public or private, under
         | the Included License.
         | 
         | If I upload Content other than under an Included License, then
         | I grant the PSF and all other users of the web site an
         | irrevocable, worldwide, royalty-free, nonexclusive license to
         | reproduce, distribute, transmit, display, perform, and publish
         | the Content, including in digital form."
        
         | [deleted]
        
         | orf wrote:
         | It's not that I didn't think "it was easy enough to use PyPI do
         | analytics and analysis": it is near _impossible_ for the layman
         | to use PyPI to do analytics and analysis in it 's current form.
         | The volume of data is very unweildly, there are numerous quirks
         | with extracting packages and no tooling exists to help you in
         | any way.
         | 
         | This means no analysis has been done on the contents of PyPI.
         | In turn this means malicious packages are harder to detect (and
         | for sure still present somewhere in there), it means people
         | publish an absolutely _crazy_ number of credentials to PyPI on
         | a daily basis without ever knowing (+ no simple way to find
         | concrete ways to improve this) and it means there is a lack of
         | exploration on the impacts of language features /changes on the
         | ecosystem.
         | 
         | To me the GitHub aspect isn't important or interesting. Would
         | it make any difference if it was distributed from a series of
         | git repositories hosted on S3? It's the git apsect that is
         | interesting, because it lowers the barrier for anyone to access
         | the corpus of already public, already mirrored and already
         | automatically-scanned-by-bad-actors code that is on PyPI.
         | 
         | While this project is more "a number of things glued-together"
         | than "a groundbreaking invention", I have to disagree with the
         | triviality aspect. Most problems we deal with can be reduced to
         | 'copying X from one place to another' (sorting?), and the devil
         | is always in the details.
         | 
         | > I don't like this, it's this kind of stunt that makes me
         | reluctant to publish my code in general.
         | 
         | Isn't this quite circular? People using code you publish
         | publicly makes you reluctant to publish code publicly?
        
           | lizard wrote:
           | > Isn't this quite circular? People using code you publish
           | publicly makes you reluctant to publish code publicly?
           | 
           | Not that I disagree with this project, but just to maybe help
           | see it from a little different perspective...
           | 
           | When people publish their code, I think they typically expect
           | it's going to be used like                   import
           | my_package         do_something_cool()
           | 
           | So it is a little weird when things like this come along and
           | change that expectation.
           | 
           | It's kind of like, "I scanned millions of Facebook photos for
           | soda cans to see if people prefer Pepsi or Coke!" People
           | didn't post those photos be be part of a project, they just
           | wanted to share some pictures with their friends.
        
             | cmcaleer wrote:
             | It's not unusual to want to change certain behaviours of a
             | project, e.g. by subclassing something within it. It's also
             | worth at least having some idea of the code you're running
             | before you run it, particularly if you don't know the
             | developer, for many reasons but for e.g. [0].
             | 
             | I'm not really sold on the perspective that if you're a
             | sophisticated enough developer to know+upload+publish on
             | pypi that you wouldn't expect someone to read your code. In
             | many ways that's kind of the point. Not to say such people
             | don't exist, but they're probably a small minority.
             | 
             | [0]: https://cyble.com/blog/over-45-thousand-users-fell-
             | victim-to...
        
             | orf wrote:
             | Thanks for that, I can definitely appreciate this
             | perspective. I'd say it's more akin to uploading photos to
             | a shared public host like imgur rather than Facebook, but
             | regardless I can see how someone's expectations of who/what
             | would use it might be different than mine.
        
       | ehPReth wrote:
       | Very interesting! Just a heads up that your charts on dark mode
       | under Growth have yellow text on a white background which is
       | near-impossible to read for me on hover
        
         | orf wrote:
         | Thanks, fixed!
        
       ___________________________________________________________________
       (page generated 2023-09-02 23:00 UTC)