Recently I was talking to an academic friend about avoiding LLM scraping, and they submitted my question to Google's chatbot GPT2 derivative. There were several interesting points in the response, the gist of which is that it was an arms race. Google suggested "honeypotting" with garbage content. The most useful kind of garbage content I can think of is using the dark net- dark net refers to socks prox-ying your traffic which makes it cryptographically homogeneous and ambiguous. The dark net mostly refers to the US Navy proxy, tor onion routing now a charity, or the smaller university-born i2p. For example, when I ssh to a tilde, I route through tor. Both the tilde and myself know who each other are. It's not useful for obfuscating simple crime in general, which is the widely promoted myth. Since whatever tilde has a clearnet address, one outproxy (gateway to clearnet) knows one tor user has sshed to the tilde. My ISP can only harvest that their user is connected to onion routing, and nothing else. This way, lots of information a person is leaking about their internet usage becomes unavailable to data merchants. I guess this is only meta-data for constructing a large language model though. Aside: In general one would never need or want to touch clear net, whose connections are scraped. It's just that it's hard for people to find out how not to use it, because capitalists don't profit from people being safe from them. A eusocial advantage of the dark net is that its emphasis on safety makes self-hosting more safe and easy, since all connections are started by you connecting outwards to participate. My experience was that i2pd, the C++ i2p implementation was confusing and unreliable, but it seems to be working well as I just installed and configured it according to openbsd's pkg_add i2pd followed by following /usr/local/share/doc/pkg-readmes/i2pd so we can replace browsing tracking with opaque garbage as a first step in fighting capitalist scraping, and we can self host to avoiding leaking our own, and our visitors data to proprietary hosting.