[HN Gopher] Almost all searches on my independent search engine ...
       ___________________________________________________________________
        
       Almost all searches on my independent search engine are now from
       SEO spam bots
        
       Author : m-i-l
       Score  : 604 points
       Date   : 2022-05-16 10:08 UTC (12 hours ago)
        
 (HTM) web link (blog.searchmysite.net)
 (TXT) w3m dump (blog.searchmysite.net)
        
       | mywaifuismeta wrote:
       | That's really interesting... and sad. For what it's worth, I've
       | noticed comment bots dramatically increase over the last year
       | too. They have always been there, but looking at Reddit, YouTube,
       | etc, now there seem to be 10x more than there were a few years
       | earlier. Even on HN it has gotten worse.
        
         | cbozeman wrote:
         | Is there a browser plug-in or some other piece of software that
         | can filter, or highlight, which posts / comments are likely
         | made by bots?
        
         | BuyMyBitcoins wrote:
         | On two occasions I've read one of my comments here on HN copied
         | and posted on Reddit. The user profiles that copied my comments
         | in _seemed_ like they were run by a real person but the rest of
         | their posts might have all been scraped as well.
         | 
         | I only found out because I just so happened to be looking at
         | the comments on a related news story and quickly realized the
         | post sounded strangely familiar. I'm sure most of us here have
         | had our comments copied without our knowledge.
        
       | chairmanwow1 wrote:
       | I created a temporary email service that was being used by about
       | 10k users / week. Then several weeks ago, the number of users
       | started growing like crazy up to about 60k users a day. Then we
       | checked the recent email activity and 60k / 65k emails were from
       | a social networking site.
       | 
       | Seems our service was being used to create fake bot accounts. The
       | newly created accounts were obvious fakes. Rather than deal with
       | the issue, we just shut the service off.
        
       | wibyweb wrote:
       | In late April up to now, Wiby (a small mostly unheard of search
       | engine) began having the exact same issue. Tens of thousands of
       | the exact same type of "powered by..." requests coming from
       | thousands of IPs. They are using a tool called QHub.
        
         | m-i-l wrote:
         | Thanks for wiby.me. I have seen QHub coming up in the scraping
         | footprints, but my assumption has been that the footprint query
         | is looking for Question and Answers sites powered by QHub
         | containing their targeted terms, e.g. because there's a known
         | vulnerability with QHub that their scripts can exploit to auto-
         | post backlinks or whatever it is they do. There are lots of
         | other hosting tools, other than QHub, that come up in the
         | footprints as well. I found some lists of footprints by doing
         | an internet search for one of them: "Designed by Mitre Design
         | and SWOOP".
        
           | wibyweb wrote:
           | Interesting, thanks for that extra info.
        
       | mfrye0 wrote:
       | I run a data aggregation company that has a fairly advanced
       | scraping infrastructure for collecting data across the web.
       | Having built the scraping side, I'm pretty familiar with most of
       | the strategies for avoiding bot detection.
       | 
       | Coming from that perspective, detecting and stopping at least the
       | majority of bots out there is fairly doable, and I put together a
       | rudimentary thing for a side project.
       | 
       | The core of it uses an IP API for looking up the requesting IP to
       | identify the country and if it's coming from a data center, VPN,
       | Tor, etc. If it passes that, I trigger Google Captcha to show up.
       | Lastly, I track IPs that make it through and have some basic
       | rules in place to try to detect patterns and block offenders that
       | way.
       | 
       | There's a bunch more stuff you can check for, but the core of it
       | is basically filtering out data center traffic to minimize the
       | requests going to Google Captcha.
        
       | buzzwords wrote:
       | I have had very interesting conversations with people who are
       | "casual" users of internet. They are still finding the results of
       | the likes of Google, bing and duckduckgo perfectly suitable.
       | Maybe it's most of us here who have different needs to what's
       | available.
        
         | bachmeier wrote:
         | I suppose it depends what they're looking for. If you're a
         | homeowner looking for a service of some kind...good luck. There
         | are domains that aren't too bad, like programming, but you
         | should go into a search with low expectations. Anyone that
         | remembers the early days of Google will find today's search
         | engines to be useless in comparison.
        
       | not2b wrote:
       | The conclusion isn't that there's nobody out there, but that the
       | billion-odd people who use search engines every day have no idea
       | what searchmysite.net is. They use Google, often without even
       | knowing it because they just type some words into their browser
       | and take what they get.
        
       | Auguste wrote:
       | I'm disappointed that Search My Site isn't seeing many legitimate
       | viewers.
       | 
       | Just wanted you to know that I'm a fan. I love reading peoples
       | personal websites, and Search My Site has been great for
       | discoverability. I visit the Newest Pages and Browse Sites pages
       | once or twice a week to check out the new sites being indexed.
       | 
       | I don't know what the answer is to the spam bots, but you do have
       | some real visitors out there. :)
        
       | closedloop129 wrote:
       | >I noted that there had been multiple weeks where not one single
       | real person had visited a single blog entry for the whole week
       | 
       | The site is not on https://searchengine.party/ nor on
       | seirdy.one's overview. Apart from the blog, how could users find
       | that engine?
       | 
       | Is there some place where new search engines are announced and
       | where new search engines band together to make themselves heard?
        
         | m-i-l wrote:
         | Actually seirdy.one added searchmysite.net to his excellent
         | list[0] way back in March 2021[1].
         | 
         | [0] https://seirdy.one/2021/03/10/search-engines-with-own-
         | indexe...
         | 
         | [1]
         | https://git.sr.ht/~seirdy/seirdy.one/commit/ab92d8ded69fd869...
        
       | arunsivadasan wrote:
       | Thank you building something like this!
        
       | xwdv wrote:
       | Will we ever see the return of hand curated directories of
       | websites like the old days, categorized by topics and approved by
       | human review?
        
         | closedloop129 wrote:
         | Coincidentally, such a site was submitted yesterday:
         | https://news.ycombinator.com/item?id=31387592
        
         | saalweachter wrote:
         | Wikipedia, maybe?
         | 
         | The greater problem of curation is that it doesn't scale, and
         | you need immense human effort to survey and curate both the
         | breadth of questions -- what's a good table saw? what aspects
         | of Egyptian culture were exported back to Greece? is HDPE
         | plastic safe? give me some punk music. -- and also the breadth
         | of answers, both every website and every type of table saw.
         | 
         | The lesser is that you cannot curate without introducing a
         | _voice_ , a set of preferences that may not be universal.
         | Tastes are not universal, you can't recommend the same band for
         | everyone. Resources are not universal, regardless of whether
         | the $10000 table saw is more than 100x better than the $100
         | table saw, it's just out of reach of most people. And needs
         | aren't universal -- a professional cabinet maker and a DIYer
         | making a chicken coop don't need the same saw.
         | 
         | There's a set of priors behind every query, and you either need
         | to get users to frame their queries in a way that captures all
         | of the relevant priors, or you need to create a variety of
         | voices that capture different sets of priors and curate answers
         | appropriate to that voice. Are you asking Norm Abrams, Monica
         | Mangin, or Shane Wighton for a recommendation on a table saw?
        
           | xwdv wrote:
           | Perhaps there can be a difference between search engines for
           | answering specific questions, and directories where one may
           | browse a broad range of topics without any goal in
           | particular.
        
       | westcort wrote:
       | My key takeaways:
       | 
       | 1. Almost all searches on my independent search engine are now
       | from SEO spam bots
       | 
       | 2. In summary, if they break through the current reverse proxy
       | level protection, options include an invisible ReCAPTCHA (but
       | given I've sometimes 160,000 requests a day I'd be well over the
       | 1,000,000 a month free tier limit), requiring JavaScript as per
       | the web analytics or some Cross Site Request Forgery style
       | protection (but those would place much more load on the servers),
       | or CloudFlare (but the searchmysite.net spider is still currently
       | blocked by CloudFlare as per Some of the challenges of building
       | an internet search)
       | 
       | 3. If you were into conspiracy theories you could claim that the
       | major search engines were trying to stifle the competition, but a
       | more realistic explanation is simply that searchmysite.net is
       | being drowned out by SEO spam
       | 
       | 4. If I'd had a decent amount of real users visiting and never
       | returning I could reasonably conclude that updating the blog
       | wasn't the most productive use of my time and effort, but without
       | any real users in the first place it is hard to gauge whether
       | people like it or not
       | 
       | My own independent search engine, https://www.locserendipity.com,
       | is seeing similar trends.
        
       | superasn wrote:
       | To me it looks like some popular spamming software (like
       | thebestspinner, etc) just integrated you and now everyone who is
       | the software is now hitting your site.
       | 
       | The good news in this case is that's it'll be easy to spot the
       | pattern and block it, the bad news is you're entering a never-
       | ending cat and mouse game.
        
       | larsrc wrote:
       | Google puts a _lot_ of effort into avoiding SEO spam, but it's a
       | red queen problem.
        
       | melenaboija wrote:
       | I am a total ignorant about search engines and I have a question
       | after seeing all types of comments and projects popping up lately
       | and criticizing Google results which is if it is realistic to
       | think that something similar to Google could exist.
       | 
       | It seems to me that there are all sort of tools out there to do
       | so such as all the public NLP implementations, vector search
       | engines, ... and I wonder if it is that not everything that is
       | needed is truly available, it is a matter of the needed resources
       | to have something working or is just a matter of the products
       | already existing and not getting traction (and I am not talking
       | about the other big search engines).
        
       | phkahler wrote:
       | >> This time I'm really not sure what the solution is.
       | 
       | As with everything internet, the solution is to have solid,
       | verifiable user identification. I realize the downside is that
       | sites would love to have all your activity logged under a
       | verifiable identity, so the other problem is we need to ban
       | collection of such personal data.
        
       | jrochkind1 wrote:
       | I'm not sure I understand the theory of what motivates the
       | automated "powered by" searches; can anyone explain it (or an
       | alternate theory) further?
        
       | Teandw wrote:
       | This guy throws multiple reasons/conspiracies out there on why
       | the website is really struggling to gain literally any sort of
       | traction. Web is all bots, search engines not promoting
       | competitors and being drowned out by SEO spam, yet he's failing
       | to see the most obvious reason... the reason nearly all websites
       | don't gain traction...
       | 
       | Because it's a bad website. It provides no value to the user. I
       | put in a few search terms and had no relevant search results
       | back. What use is a search engine that can't find what I'm
       | searching for?
       | 
       | Maybe if that was improved he may see traction.
        
         | CWuestefeld wrote:
         | Whether or not his site is meeting his goals is his business.
         | 
         | I find this a really interesting post, because I'm also dealing
         | with excessive bot traffic (it's generally about half of my
         | overall), and specifically how to salvage analytics data when
         | there's so much noise. Seeing what other people are doing to
         | combat it helps me, regardless of whether you might think of
         | them as successful or not.
        
         | lukev wrote:
         | I second this. Don't get me wrong, I applaud the concept and
         | the effort, but this implementation isn't quite there.
         | 
         | I searched for "document management system comparison" since I
         | am currently in the process of selecting one for our legal team
         | at work. Some on-the-ground reports from real users would be
         | hugely valuable. But this is the classic example of where
         | Google utterly fails; document management is a 100 billion
         | industry and there are absolutely no search results which are
         | not SEO, marketing copy, or astroturfed listicles with nearly
         | zero value.
         | 
         | Unfortunately, this website returned even less relevant
         | results. Not a single result pertained to document management
         | at all; instead it returned random matches on words like
         | "system" and "management."
         | 
         | Whoever solves this problem could definitely unseat Google as
         | the go-to search engine for most people. So it's a big prize.
         | But it's also a super hard socio-technical problem, requiring
         | incredibly sophisticated and powerful tech in a highly
         | adversarial environment. However, regrettably, it looks like
         | this attempt hasn't even got the basic search tech down.
        
           | marginalia_nu wrote:
           | Is a comparison of document management systems something you
           | expect actually find, as something written by humans? I
           | wouldn't write such an article, I don't know who would.
           | 
           | The only people who seem to be writing these types of
           | comparison articles are spammers.
           | 
           | I typed this reply without checking, but I checked now, and
           | yeah -- if you google "document management system
           | comparison", you get ads for document management systems, and
           | search engine spam. That's hardly helpful.
        
             | oneeyedpigeon wrote:
             | 2nd result I got from that exact search is an article from
             | techradar:
             | 
             | https://www.techradar.com/uk/best/best-document-
             | management-s...
             | 
             | Do you consider that search engine spam?
        
               | marginalia_nu wrote:
               | Yeah, that's affiliate marketing dressed up as a review.
               | They're getting a kickback for several of the links in
               | the review.
               | 
               | The deal on DocuWare is perhaps the most obvious, but the
               | Abbyy-link also run through an affiliate marketing
               | redirect service.
        
           | freediver wrote:
           | Typed this search into Kagi and got:
           | 
           | - This results from an old site https://www.scanstore.com/Sca
           | nning_Software/Document_Managem... not sure if still relevant
           | 
           | - A bunch of discussions from reddit and other forums
           | (probably best lead)
           | 
           | - One research paper https://arxiv.org/pdf/1403.3131.pdf
           | 
           | - Listicles grouped togeter so you can skip them
           | 
           | - The noncommercial filter gave a few more good results, but
           | it seems like there is not much 'good' content written on
           | this topic
           | 
           | I would definetely not call all Kagi results fantastic, but
           | it does seem to be better than Google. We are trying hard to
           | solve the problem of the nonsense on the web (Kagi founder
           | here).
        
             | alx__ wrote:
             | Thanks for building Kagi! Have been enjoying the experience
             | of it this past month
        
             | kldx wrote:
             | Got any beta slots to share?
        
         | status200 wrote:
         | I searched "best dress shoes reddit" as a test, and just got a
         | random list of websites that had the word "shoes" on the page
         | somewhere, including a Dinosaur Comic from 2008.
         | 
         | So... yeah. Won't exactly be my first choice of search engine
         | in the future.
        
           | matt_heimer wrote:
           | Looking at the blog
           | (https://blog.searchmysite.net/posts/milestone-1000th-site-
           | in...) I think very little of the internet is in this search
           | engine.
           | 
           | Its difficult to gauge the quality of the engine itself at
           | this point with so little content in it.
           | 
           | What I can say is that even remotely presenting the system as
           | a general purpose internet search engine like the UI from
           | https://searchmysite.net/ does is going to give people the
           | wrong idea and make them think the system is bad. To start
           | with I'd suggest adding the number of sites indexed to the
           | main search page.
           | 
           | I also think that the https://searchmysite.net/ portal will
           | likely never be a destination. I'd suggest trying to promote
           | it differently, offer a service service for OG internet
           | sites, they opt-in to the service because they want a search
           | widget they can embed on their site that has filter to search
           | just that site or all OG sites. Having website categories
           | would also help so people could search across tech blogs, or
           | aquarium, or bowling sites, etc. Basically the old web ring
           | idea but powered by search instead of just browsing a list.
           | 
           | Since there is a chicken and egg scenario - What you really
           | need are people that think Google sucks that are invested in
           | a niche and want to build a search ring out. The "only sites
           | submitted by verified site owners" restriction needs to go,
           | you want good curation but this is just too restrictive. I
           | also think "downranks results containing adverts" is too
           | restrictive, switch that to "downranks results containing
           | excessive adverts and SEO spam".
        
           | _tom_ wrote:
           | It doesn't index sites like Reddit, so, not too surprising
           | Reddit wasn't in the result.
        
         | honkdaddy wrote:
         | Searching for Astral Codex Ten, a popular, well-written, non-
         | spammy blog which I would expect is indexed...
         | 
         | Returns only results in which _other_ bloggers are referencing
         | ACX. Consider me as one of the datapoints that arrived from HN
         | and likely won't be back, I'm afraid.
        
           | m-i-l wrote:
           | Thanks for your feedback. The idea was for people to submit
           | sites they like, and search sites other people have liked.
           | I've submitted Astral Codex Ten, and that site is now indexed
           | for the benefit of others.
        
           | wccrawford wrote:
           | I just search Kagi, Google, and DDG for "Astral Codex Ten"
           | and it was the first result on each.
        
             | weird-eye-issue wrote:
             | Ironically the Kagi search engine is not in the first few
             | results in Google when you search Kagi (at least in
             | Thailand)
             | 
             | And when I did make it to the site, it looks like I have to
             | sign up to use it? I'm not sure putting a locked gate in
             | front of a search engine in 2022 makes sense but okay
        
               | norman784 wrote:
               | The whole concept of kagi is to be a paid service (is
               | still in beta and for now it's free AFAIK), so you pay
               | money instead of having ads or the search engine selling
               | your data, use the service that suits best to your
               | purposes and philosophy.
        
               | ipaddr wrote:
               | The concept in 2022 sounds doomed to fail on many fronts.
               | A service that claims to offer privacy but requires
               | identifying payment information. A required email signup
               | so followup sales emails can happen when the service is
               | ready.
               | 
               | Ddg was popular on here until they censored certain
               | websites. Does this search service censor?
               | 
               | Sounds like they are trying to tackle privacy but in
               | reality users of this service will have less privacy.
        
         | m-i-l wrote:
         | Hi, "this guy" here:-) If people come to a site but don't come
         | back then it is reasonable to conclude that "it's a bad
         | website", but as the blog entry put it "without any real users
         | in the first place it is hard to gauge whether people like it
         | or not".
         | 
         | Note also that it isn't intended to be a general purpose search
         | engine, but a niche search engine to try and find some of the
         | fun and interesting content, e.g. relating to hobbies and
         | interests, which used to be at the core of the web but which
         | can be difficult to find anywhere nowadays.
        
           | soheil wrote:
           | How exactly is a "general purpose search engine" different
           | than a "search engine to try and find some of the fun and
           | interesting content"?
        
             | m-i-l wrote:
             | The general purpose search engines search the whole
             | internet, and as a result claim that you can search for
             | anything on the whole internet, even going beyond that to
             | answer questions which aren't on the internet as such, e.g.
             | "What is my IP?" and "What time is it?". However, niche
             | search engines only search specific parts of the internet,
             | and only claim to be able to deliver results relating to
             | their specific topic, e.g. you wouldn't ask the search on a
             | car forum what the weather is today.
        
               | soheil wrote:
               | Ok, but answering questions like "what time is it?"
               | doesn't subtract from the usefulness of a search engine.
               | Seems like you're saying it makes your search engine
               | better somehow because it can't do the above.
        
               | dumbfounder wrote:
               | I am a search guy and I would like you to succeed. But I
               | don't get it. The name of the site is bland and makes me
               | think you are a white label search service for websites.
               | On the homepage it says "Open source search engine and
               | search as a service for personal and independent
               | websites." but it offers me to reason about why I (or
               | anyone) would want to use it. The content it actually
               | searches is random and of no real particular value as far
               | as I can tell. Also, you are trying to avoid spam sites,
               | but once you reach a certain size that's all you would
               | see is people submitting spam sites. If you blocked
               | people from submitting you would never get all the
               | diamonds in the rough you are trying to expose.
               | 
               | You need to find an actual niche that solves a real
               | problem people have and can understand and orient
               | everything you do to tackling that. Then expand from
               | there.
        
               | haswell wrote:
               | > _general purpose search engines search the whole
               | internet, and as a result claim that you can search for
               | anything on the whole internet, even going beyond that to
               | answer questions which aren 't on the internet as such,
               | e.g. "What is my IP?"_
               | 
               | I think there are two distinct things here:
               | 
               | 1) Searching the whole internet
               | 
               | 2) Returning results that aren't necessarily from the
               | Internet, but instead are convenience features of the
               | engine
               | 
               | I understand that you're not trying to replicate things
               | like "What's the weather today", but when I want results
               | about <very specific classic car X>, how can you return
               | meaningful results without searching the whole Internet?
               | 
               | Put another way, if you don't search the whole Internet,
               | the results are going to be limited to only the curated
               | list of sources you do search. This can be useful in its
               | own way - i.e. if you are positioning this as "search
               | this list of curated sources", but also means the site
               | will only be as useful as the curation you provide.
               | 
               | For example, I dabble with Software Defined Radio. If I
               | search your site for "rtlsdr", a very popular package, I
               | get three results. Those results are somewhat
               | interesting, but I know there's a whole world of content
               | out there related to rtlsdr that I'm not seeing here.
               | 
               | So adding a bit to what the parent commenter was saying -
               | if I'm using your site to look for my particular niche,
               | and I only see three results when I know there are many
               | more, I'm not likely to continue using your site to
               | search for rtlsdr.
               | 
               | It then leads me to wonder what I _can_ search for, or if
               | there 's much utility to searching at all.
               | 
               | Please take these comments in the spirit they are
               | intended - I think a search engine that helps find things
               | on the "old" web, or just helps me cut through all of the
               | SEO optimized crap is a great idea. It's something I want
               | to use. But I can also understand why someone might try a
               | search and move on.
               | 
               | Just an idea, but maybe providing a way for independent
               | creators to submit their site for indexing (or for an
               | interested user like me to submit a site) would help
               | increase your reach.
        
             | _tom_ wrote:
             | Google is demonstrating this nicely now. It's become almost
             | useless, replacing the query I actually typed with
             | something more popular. And when that doesn't happen, the
             | results are likely seo'd junk. (The latter is not purely
             | googles fault, it's just that smaller search engines aren't
             | targeted as much).
             | 
             | Try looking up a phone number (by number) in google for a
             | great example of nothing but spam results.
        
         | native_samples wrote:
         | Well, it's worse than that. The whole schtick is that it's only
         | pure, real content by folksy people like us. The top reason to
         | use it on the about page is:
         | 
         |  _Indexes only user-submitted sites with a moderation layer on
         | top, for a community-based approach to content curation, rather
         | than indexing the entire internet with all of its spam, "search
         | engine optimisation" and "click-bait" content._
         | 
         | So I tried searching [kotlin] and got 123 results ...
         | 
         | https://searchmysite.net/search/?q=kotlin
         | 
         | ... of which the 9th result is SEO spam! It reads:
         | 
         | PersonalSit.es | Yes we got hot and fresh sites
         | https://personalsit.es/ ...
         | Shandilyahttps://msfjarvis.devTagsandroid, kotlin, rust Go to
         | feed Go to siteradoslawkoziel.plradoslawkoziel.pl ...
         | 
         | That looks like junk to me. How is that possible if what the
         | developer says is true, that it's all verified and pre-
         | moderated?
        
           | m-i-l wrote:
           | Thanks for your feedback. It is just the home page which is
           | moderated before indexing (and reviewed annually). When
           | https://personalsit.es/ was listed it looked legitimate, but
           | agreed the results for that site look infected with spam now.
           | I've found at least one other site today where the home page
           | and blog look genuinely legitimate, but which has a complete
           | spam subdomain, quite possibly the victim of a subdomain
           | takeover attack by spammers. I've delisted both.
           | Unfortunately it isn't an easy task trying to defeat a vast
           | army of well funded spammers in your spare time!
        
             | stevenicr wrote:
             | As someone that has a few sites that can get user generated
             | content - I must say that it saddens me that spam stuffing
             | would get the main domain and site delisted - and likely
             | never re-listed.
             | 
             | A couple times a year I get hit with a bunch of spam blogs
             | / user profiles and when I discover and clean them up, I
             | assume that at least google/bing see that the spam-to-real
             | ratio has been fixed and rank it higher again.. but I'm not
             | sure really, especially since google took keywords out of
             | click traffic.
             | 
             | What would be nice is something like the 'site has been
             | hacked page' that I've unfortunately seen a few times for
             | sites - that lets you clean it up and submit a re-check
             | it's clean now button thing.
             | 
             | I've also suggested that google make it so you have to
             | vouch for links which would expose people using the spam
             | stuffing techniques.. kind of the opposite of the disavow
             | tool - but they never read any of my disavow submissions.
             | 
             | Sucks to get spammed, fight spam, and then be penalized for
             | it more ways than one.
             | 
             | One of my older buddypress/wpmu sites I recently turned off
             | blog creation for users because it's just so tiring
             | fighting the spammers - which are only doing what they do
             | because google - meh.
        
             | salawat wrote:
             | Your problem is that SEO are under no obligation to be
             | truthful with you, and will likely pull bait and switches
             | as far as making accounts if it ever seems like your site
             | will catch on.
             | 
             | Note, I nearly spit my food the first time I was at lunch
             | and someone was talking about SEO a few tables away...oh a
             | decade or so ago now. It's sad it's gotten this bad.
        
         | pwiercinski wrote:
         | I guess the use-case just isn't that popular. It's a good
         | website if you want to learn what some devs are up to, but
         | barely anyone cares about that. Most people use search engines
         | to find answers to their questions and Search My Site just
         | doesn't work like that.
        
         | fortran77 wrote:
         | I found a few pro-terrorism sites here. I don't think it's the
         | OPs purpose, but he's being duped by the few users that do look
         | for sites like this where they can add a "curated link" to
         | their ISIS or Hezbollah or Hamas site with a slick facade.
        
           | m-i-l wrote:
           | Thanks for your feedback. If you can drop me a note I'll
           | remove those sites - it is against the Terms of Use at
           | https://searchmysite.net/pages/terms/ (not that spammers,
           | terrorists, etc. care about complying with a Terms of Use). I
           | think legitimate looking home pages as a front to other non-
           | legitimate content is a genuine problem this model doesn't
           | solve (also noting that some of those home pages may even be
           | genuinely legitimate but have been hacked e.g. via a
           | subdomain takeover).
        
         | [deleted]
        
         | Jleagle wrote:
         | I'm getting lots of `No results found for query = xxx.`
        
           | rightbyte wrote:
           | That sounds like a feature actually, being honest about no
           | hits.
        
       | XCSme wrote:
       | If the internet is dead, is there anything left that's "alive"?
       | The mobile app stores are also filled with crap[0] and it seems
       | that the ratio of spam content vs real content is getting close
       | to infinity.
       | 
       | [0]: https://youtu.be/E8Lhqri8tZk - 1,500 Slot Machines Walk into
       | a Bar: Adventures in Quantity Over Quality
        
       | john-radio wrote:
       | Since everyone in this thread wants to jump down OP's throat
       | about the quality of his web site, another interesting search
       | engine is millionshort.com, which allows you to filter out the
       | top N web sites from the results of your search. It's a great
       | tool for looking past sites with good SEO; all you have to do is
       | fiddle with the value of N.
       | 
       | For example, searching for "electronic music box" as /u/ajnin
       | suggested, with the top 100K web sites removed from the results,
       | filters out the following:
       | 
       | > These 23 sites were removed from your results:
       | 
       | > alibaba.com (1 result removed)
       | 
       | > aliexpress.com (1 result removed)
       | 
       | > allaboutcircuits.com (1 result removed)
       | 
       | > amazon.com (2 result removed)
       | 
       | > apple.com (1 result removed)
       | 
       | > bestreviews.com (1 result removed)
       | 
       | > ebay.com (1 result removed)
       | 
       | > etsy.com (2 result removed)
       | 
       | > facebook.com (1 result removed)
       | 
       | > instructables.com (2 result removed)
       | 
       | > lightinthebox.com (2 result removed)
       | 
       | > lumberjocks.com (1 result removed)
       | 
       | > mapquest.com (1 result removed)
       | 
       | > reverb.com (1 result removed)
       | 
       | > twitter.com (1 result removed)
       | 
       | > wikipedia.org (1 result removed)
       | 
       | > yelp.com (1 result removed)
       | 
       | > youtube.com (2 result removed)
       | 
       | And the top result ends up being https://midiguy.com/.
        
         | mdoms wrote:
         | Million Short also has an option to remove only e-commerce
         | results which is invaluable if you still want results from
         | sites like Twitter, Wikipedia and YouTube but don't want online
         | shopping spam.
        
           | consp wrote:
           | Would this also work for the fake-sites-stealing-text-to-
           | look-legit sites since they quickly end up in the top
           | results?
        
         | blisterpeanuts wrote:
         | That's an outstanding concept. One problem though: wouldn't it
         | also filter out high quality curated results?
        
       | trinovantes wrote:
       | If this was the spam for a search engine (almost) nobody uses, it
       | makes you wonder how much abuse the major search engines face
        
         | Nextgrid wrote:
         | My understanding is that this wasn't about gaming this
         | particular search engine itself, and more about the spammers
         | using the search engine for its intended purpose of finding
         | spam-free content so they can then use this content as copy for
         | their spam posts.
        
         | sonicggg wrote:
         | I'd assume they have more control though. I noticed whenever I
         | use Google after connecting to NordVPN, it requires a captcha
         | the first time.
        
         | mensetmanusman wrote:
         | They face a lot. I always browse with incognito on safari, and
         | I quite often have to do captchas on google and bing etc. to
         | prove I'm not a computer...
         | 
         | If there is money involved and value in being able to trick
         | search engines, I'm not surprised it's a thriving business of
         | grift.
        
           | hihihihi1234 wrote:
           | Why do you use Bing?
        
             | the_third_wave wrote:
             | Why don't you like diversity, in this case diversity of
             | search engines? Bing may have its problems but so does
             | Google, the way to handle this is to either use many
             | different engines or to use a meta-search engine like
             | Searx. The latter is far easier so it is what I do. Just
             | relying on a single source makes you an easy target for
             | those who control that source.
        
             | maven29 wrote:
             | You should try Bing again. Bing doesn't mess with your
             | query terms as much as google does. If you aren't a zoomer
             | typing out whole sentences into the search bar, the fact
             | that Bing doesn't substitute your jargon for more general
             | terms will help with spending less time in the search
             | results.
             | 
             | I just got tired of iterative refining not working as it
             | used to in the past. I once got results for databases when
             | searching for decibels (despite spelling it out in full),
             | so it isn't just a matter of semantically related terms.
             | 
             | The rewriting is just braindead and the ranking algorithm
             | falls for generated content way too easily. Google
             | shouldn't be trying to teach me DHCP when I am clearly
             | trying to recall a config item, but then it gets worse when
             | you read the infobox and realize that it's written at a
             | toddler level of comprehension.
             | 
             | This is with the caveat that all search engines rely on
             | some level of personalization, so you might be able to get
             | good results on google if they deem you worthy.
        
         | ricardo81 wrote:
         | Indeed. There are various SEO "rank tracking" services that
         | scrape millions of SERPs a month.
        
       | thelittleone wrote:
       | Complete SEO noob here. Can someone help explain what these bots
       | are trying to achieve? There is mention in the blog that they're
       | trying to uncover ad free content.
        
       | DethNinja wrote:
       | Only solution is a webring based federated search engine.
       | 
       | 1. You just put /webring.txt to your website. It shows links to
       | other websites with a hard limit of 100 websites.
       | 
       | 2. To combat spam and bots, search engine does accept blocklist
       | as an input. So other people can curate the content.
       | 
       | 3. People can personally rank the websites they like, so webring
       | of the said website gets ranked higher for that specific user.
       | This can be a community effort too.
       | 
       | 4. Search engine itself should be under a commercial license so
       | that other people can keep building it and add ads if they want
       | to commercialise it.
       | 
       | I'm too busy to spend time with this but perhaps one day I can
       | start coding it.
       | 
       | I'm convinced that search engine model of early internet is just
       | dead, webrings are the way forward.
        
         | mcv wrote:
         | Nice idea, but of course if it gets even slightly popular,
         | every SEO content farm will immediately generate 10,000 sites
         | that all list each other in their webring.txt.
        
       | TheRealDunkirk wrote:
       | If there's a game to play, people will write software to play it
       | for their profit.
       | 
       | I guess it's back to web rings.
        
       | robmay wrote:
       | While I'm generally a blockchain skeptic, this is actually a good
       | use for a blockchain - to "register" bots so they have an id, and
       | an owner, and you can measure their behavior. There are going to
       | be more bots interacting with more sites, so, this could work.
        
       | PaulHoule wrote:
       | Spammers badly need spam-free content so they can mix some
       | legitimate links with the junk they spew.
       | 
       | One great Black Hat SEO trick is to find where your competitors
       | are getting clean links and insert your own links there so they
       | do your spamming for you.
        
         | closedloop129 wrote:
         | Why does the mixing work? Shouldn't Google and Bing know what
         | the original content is and automatically identify the sites
         | that are copies?
        
           | PaulHoule wrote:
           | Here's an example.
           | 
           | If I have (say) 15 affiliate marketing sites, I might make a
           | link aggregator site that looks a bit like Hacker News.
           | Except I won't make just one, I might make 30 of them.
           | 
           | These might subscribe to a bunch of RSS feeds and randomly
           | select articles, maybe 10% of the links on those sites go to
           | my affiliate sites.
           | 
           | If you can inject spam into those RSS feeds that system I
           | describe would amplify it and this could have effects ranging
           | from: you are using my marketing machine to promote your
           | content to my sites getting really obnoxious and getting
           | blocked.
           | 
           | ----
           | 
           | "Duplicate Detection" is a necessary technology for web
           | search because sheepeople copy themselves and other people
           | without bound. It cuts both ways because Google and Bing have
           | no sure way to know which one is the copy and which is the
           | original. So (1) they aren't completely efficient at removing
           | duplicates and (2) duplicate detection can be turned into a
           | weapon against you, just like that link aggregator.
        
       | randomstring wrote:
       | Search traffic has always been mostly automated spam bots.
       | 
       | Even back in the Open Directory Days when we powered part of
       | search.netscape.com I estimated 80+% of all search traffic was
       | automated. At least most of it self-identified with the same Java
       | useragent.
       | 
       | Later when working Topix, despite being a news search engine,
       | most traffic was bot traffic. Most included the word "mortgage"
       | in the query. Topix specialized in localized content, and that
       | was very popular for SEO scrapers.
       | 
       | Lastly at Blekko, I estimate 90+% of traffic was automated. By
       | then maybe half or more learned to change the user agent. Most
       | used HTTP/1.0, a dead giveaway as no browser still uses 1.0. This
       | was a major aspect in Blekko's load shedding strategy. If the
       | servers started to get overloaded, we'd start bouncing suspected
       | bot traffic to a redirect that would show in the logs. If there
       | was a human with a modern browser running javascript on the other
       | end, would get redirect to a link that wouldn't get bounced. I
       | would check the logs weekly to see if any humans got caught. None
       | ever did. This was a huge monetary savings, you only need 1/10th
       | the servers if you can safely ignore the bots.
       | 
       | Often it's endless repetition of the same keywords in a random
       | order with a place name appended, or prepended, or inserted. over
       | and over. Often variations on known monetizatable SEO keywords.
       | However, much of it doesn't make any sense.
       | 
       | I don't have any insight into Google's numbers but I would
       | conservatively estimate 95% or more of all their queries are
       | automated bots and not humans. And the level of spy-vs-spy going
       | on for Google CPU resources vs SEO bots is probably pretty
       | evolved by now. I stopped tracking many years ago when Google
       | switched to densely packed obfuscated javascript for page
       | renders. Maybe this is part of why automated queries are so high
       | across the web, maybe google is too hard to crack for most.
        
         | superjan wrote:
         | Almost sounds like it is justified to add a javascript crypto
         | miner to your pages to make the bots pay for the use of your
         | service.
        
           | randomstring wrote:
           | The point is that the vast majority of scrapers do not bother
           | to run javascript.
        
         | [deleted]
        
         | stevenicr wrote:
         | appreciate the sharing of info here.
         | 
         | I have recently been discovering and combating some similar,
         | albeit much smaller issues.
         | 
         | I've been finding that a bunch of my recent 'resource sucks'
         | have been constant spidering from petal-bot, semrush bot,
         | alibiba-bot and a few others.
         | 
         | Using the wordpress plugin stop-bad-bots and it's logs has been
         | eye-opening for me recently.
         | 
         | I understand many of these are not directly dark-seo related,
         | but their aggressive nature is hurting the cpu and memory
         | limits of some of my servers and sites so it's a big issue
         | regardless of the intents behind them.
         | 
         | (kind of) glad someone else has dealt with these issues, and
         | glad to see some of the 'how' for handling, identifying, and
         | some actual real numbers for the impacts, as I've been guessing
         | some of these things in my small projects, indeed it's a real
         | thing. As well as a practical issue to pay attention to and
         | work on.
        
       | munk-a wrote:
       | Could you possibly use your robots.txt to redirect them all to
       | ad-laiden pages to try and subsidize your legitimate users?
        
       | buro9 wrote:
       | This is for comment spam.
       | 
       | It's trying to find a long tail of popular but not top listed
       | blogs for the purpose of posting comments with the much desired
       | links to the SEO target.
        
         | Veen wrote:
         | Does that work any more? I thought everyone put nofollow
         | attributes on comment links.
        
           | 0des wrote:
           | If it didnt work, would you still see it?
        
             | Veen wrote:
             | Yes, because to sell it you need someone to believe it
             | works. That's independent of whether it actually works
             | (although this does answer my initial question).
        
               | [deleted]
        
       | hinkley wrote:
       | I am slowly convincing my coworkers that deploying the exact same
       | binary as two different 'services' is a significant tool to have
       | in your toolbox. Some disaster recovery work we're doing is
       | making it a much easier sell.
       | 
       | I'm really just combining two very old tricks here. Traffic
       | shaping based on class of service for two different requests, and
       | for two different classes of users.
       | 
       | Segregating bot traffic improves consumer experience. Segregating
       | admin traffic from both allows you to set an upper and lower
       | bound on availability.
        
       | FargaColora wrote:
       | You mention the "Dead Internet Theory" (not heard that phrase
       | before!).
       | 
       | I agree: the WWW Internet is dead, that is your problem. No-one
       | visits websites anymore, everyone has moved to the 10 biggest
       | websites and all data is now siloed there.
       | 
       | If I want to search for something topical and relevant, I go to
       | Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps,
       | Discord etc.
       | 
       | The general Internet is dead: it's just legacy content and spam.
       | 
       | If you think it's bad for you, imagine what it is like for Google
       | Search! Their entire business is indexing a medium which no
       | longer has any relevancy. People complain that Google no longer
       | delivers good results. But what can Google do? The "good content"
       | is no longer available for them to index.
       | 
       | Want to become rich? Make a search engine which indexes the fresh
       | relevant data from the big siloed websites, and ignores the
       | general dead Internet.
        
         | marginalia_nu wrote:
         | I built my search engine in part to explore whether this was
         | actually true, and I don't think it actually is.
         | 
         | There's still a lot of organic human-made content still out
         | there, possibly more than ever, it's just not able to compete
         | with the SEO industry that completely displaces it from Google
         | and social media.
        
           | kodah wrote:
           | Agreed, the general internet is not dead, but the majority of
           | internet users are on Facebook, Twitter, Reddit, HackerNews,
           | Instagram, Google Maps, Discord etc.
           | 
           | From my perspective, we onboarded a lot (if not most) people
           | to the internet after 2007 (the explosion of social media).
           | People sticking to big sites really speaks to an inability to
           | explore the larger internet and a lack of knowing _why_ you
           | would even want to.
        
           | alxlaz wrote:
           | This matches my findings 100%. The WWW is active and
           | bubbling, but virtually all the cool websites I've found in
           | the last 10 years or so came through friends, small IRC
           | channels, or more recently through marginalia.nu :-). Google
           | and friends are facilitators for the SEO and tracking
           | industries, so of course they have zero interest to
           | prioritize these things over content spam -- their whole
           | business runs on content spam. But the WWW is as alive as it
           | gets.
        
           | dylan604 wrote:
           | And who uses your search? I had never heard of "you" until
           | just now. And there is the problem with "new" search engines.
           | Unless you can come up with what would have to be one of the
           | greatest ad campaigns the world has ever seen, no significant
           | number of users will know you exist. Where does the money to
           | pay for that ad campaign come from? How will a search engine
           | generate money to stay relevant? Once people see you becoming
           | relevant, they will figure out how to game your system. It's
           | just the nature of the beast. I don't think I'm being overly
           | cynical about this either.
        
             | marginalia_nu wrote:
             | Why would I need to generate money to stay relevant?
        
               | dylan604 wrote:
               | <edit>The first </edit>relevant was the wrong word.
               | sustainable would be more appropriate. on the assumption
               | that hosting the search engine isn't free, and unless it
               | is supported by a generous benefactor it will need to
               | have a way of generating money to keep the servers
               | running.
        
               | marginalia_nu wrote:
               | I'm self hosting so my operational cost is like $50/mo.
        
             | throwaway14356 wrote:
             | then he must be relevant
        
           | fifticon wrote:
           | I second that independent sites exist - I maintain my own
           | website on a personally run server. There are dozens of us!
           | to quote a quaint phrase.
        
           | api wrote:
           | All open systems are destroyed by spam once they become
           | popular enough to be profitable targets. This will eventually
           | happen to the Fediverse too. If there is money to be made
           | pissing all over the commons, the commons will be pissed all
           | over.
           | 
           | It even happens to proprietary silos if they are too open.
           | Look at how many bots and spammers infest social media.
           | Propaganda and disinformation can also be considered a form
           | of spam.
           | 
           | I realize this sounds cynical but don't shoot the messenger.
           | It's just something I've learned watching the Internet evolve
           | since the middle 1990s. Spam eats everything it can.
           | 
           | IMHO the future is enclaves and invite only communities. The
           | Internet is a dark forest.
        
             | marginalia_nu wrote:
             | As old open systems are destroyed, new ones are created to
             | replace them. The Internet exists in a constant state of
             | rebirth and transformation. You really can't step into the
             | same river twice.
        
               | nonrandomstring wrote:
               | > You really can't step into the same river twice.
               | 
               | I love the maxim and philosophy of eternal refreshment.
               | 
               | Seems like the problem is more akin to having nuclear
               | waste dumped into our rivers though.
        
             | pixl97 wrote:
             | It's not cynical, is how every system in nature works.
             | Everything alive must develop an immune system or it is
             | attacked and eaten.
        
             | NoGravitas wrote:
             | You are probably right about the future; not necessarily
             | because of spam, though that's a part of it, but just
             | because of the toxicity of global, open to the world,
             | mostly public social media. The Fediverse has mostly
             | coasted by so far on obscurity, but it's not great, and
             | it's bound to get worse. All of my online socializing these
             | days is either through short-lived pseuds on topic-oriented
             | fora, or invite-only Matrix rooms.
        
             | pwdisswordfish9 wrote:
             | > This will eventually happen to the Fediverse too.
             | 
             | Oh, don't worry, the Fediverse will never catch on.
        
               | ffhhj wrote:
               | Why? Serious question.
        
           | indigochill wrote:
           | How do you surface organic human content? I happen to linger
           | around the fediverse/tildeverse sphere where I see organic
           | content from people I personally have a direct (digital)
           | connection to (and I started self-hosting my music after Epic
           | bought Bandcamp), but I'm not clear on how I'd go about
           | digging that kind of stuff up in the more general case.
        
             | marginalia_nu wrote:
             | I do a traditional web crawl and exclude anything that
             | looks too much like it wants a high google ranking. Nothing
             | to it.
        
               | ratww wrote:
               | This might be controversial, but I wish Google would
               | exclude those websites too.
               | 
               | Google started punishing keyword spam, then it started
               | punishing black-hat comment spam. Even Youtube
               | backtracked on the "videos have to be 10 minutes to
               | rank".
               | 
               | I wish they would do the same for carefully manicured SEO
               | content farms too, as those sites are causing a harm
               | worse than keyword-spammer sites did.
        
               | marginalia_nu wrote:
               | They're probably doing all they can. The problem is their
               | dominance, both means they have effectively an entire
               | industry looking for loopholes in everything they do, as
               | well as legal considerations (arbitrarily punishing
               | individual smaller actors might skirt on the territory of
               | anti-competitive behavior)
        
               | ajmurmann wrote:
               | I love your search engine. Should I stop recommending it
               | to friends to keep it safe?
               | 
               | I jest a little bit, but your comment genuinely makes me
               | wonder if Marginalia++ is search results - Google -
               | Marginalia
        
               | sdoering wrote:
               | I fear that Google also has a conflict of interest here.
               | A lot of these non optimized sites are not interested in
               | making money via ads. So Google wouldn't profit
               | additionally from leading people there.
               | 
               | And a lot of people (myself often times included) are
               | looking for a quick answer. A good enough answer. So good
               | enough, SEO optimized is being surfaced. The result of an
               | optimization war on both sides combined with the
               | inevitable monetary interests.
               | 
               | I don't habe a solution. Sadly.
        
               | galangalalgol wrote:
               | Does anyone have an ad free search engine? You'd start
               | with blacklists from ublock origin, pi-hole, and similar,
               | don't bother even crawling those, then have easy
               | reporting for new or self hosted ads. Not much money in
               | it if any, but it would be refreshing. Might even have a
               | mode to nix anything with a payment method on the site,
               | or that links to a site with a payment method.
        
               | ajmurmann wrote:
               | > Does anyone have an ad free search engine
               | 
               | kagi.com search.marginalia.nu
        
               | EVa5I7bHFq9mnYK wrote:
               | Maybe back to Yahoo model of the 90s? Manually created
               | collection of curated links?
        
               | datavirtue wrote:
               | Yes. We have enough users now.
        
               | ratww wrote:
               | I think there's two kinds of SEO spam going on.
               | 
               | The black-hat kind is definitely made to extract money
               | from ads. But those are easy to avoid for web veterans
               | IMO. And I also feel that Google is doing its part, even
               | though it's costing them money from those sweet ads!
               | 
               | But the white-hat kind, also known as content marketing,
               | is made to let legit companies _save_ money. Instead of
               | paying for Google Advertisement, they get traffic by
               | means of organic content. Think  "Michelin Guide" or "Red
               | Bull". Which is a jolly fine idea and responsible for a
               | lot of good stuff, but the problem is that this has been
               | taken to extremes, and now the web is littered with low-
               | effort content made by freelancer writers getting
               | peanuts.
               | 
               | I would personally prefer if those freelancer writers
               | were doing 10 interesting Red Bull articles per month
               | rather than 500 rehashes of contents from other websites.
               | But who am I to judge.
               | 
               | In the news industry things are also very similar.
        
               | Nextgrid wrote:
               | The "white-hat kind" can trivially be filtered out (or
               | deterred) by downranking any of the crap these marketers
               | use to measure their conversion rate - analytics, etc.
        
               | ratww wrote:
               | I love this idea. Would be nice to see it in a search
               | engine, or at least a browser extension showing how much
               | analytics junk a site has before you click it.
        
               | Nextgrid wrote:
               | Kagi has a non-commercial filter that I suspect uses the
               | presence of ads/analytics as a signal.
        
             | ysavir wrote:
             | It's not about surfacing organic human content, it's about
             | only indexing organic human content. The problem is
             | automated indexing. So long as indexing works according to
             | defined rules, the advantage will be to those able to shape
             | their content to those rules, and the spammers and scammers
             | will win.
             | 
             | An idea I've had for a few years is making a social-network
             | based index engine. The only pages that get indexed are
             | pages that users themselves mark as worth indexing, and the
             | only pages returned in your results are pages that were
             | marked for indexing by people you added to your circles, or
             | the people in their circles, or the people in _those_
             | circles, etc (probably up to 5 or 6 degrees of separation).
        
               | nyokodo wrote:
               | > up to 5 or 6 degrees of separation
               | 
               | So basically everyone on earth?
        
               | ysavir wrote:
               | Alright, 2 or 3!
        
               | kmeisthax wrote:
               | ...so, blogrolls?
        
               | ysavir wrote:
               | Not familiar with blogrolls, but not quite. The idea is
               | more to have standard search engine user experience, but
               | with the requirement that each result is vetted by
               | someone the user trusts, or trusts by proxy.
        
             | pixl97 wrote:
             | Welcome to the billion dollar question. Any place that is
             | authentic will face the zombie horde attempting to fake
             | authenticity in order to capture attention.
        
               | tomxor wrote:
               | I think your _almost_ right, but it 's not necessarily
               | authenticity... I think it's just money.
               | 
               | Large "authentic" search engines can exist to serve the
               | rest of the web, those personal blogs and other small
               | communities. Those sites have a natural tendency to not
               | be trying to turn everything into a revenue stream, so if
               | that was the prerequisite for an engine, it would be a
               | perfect match and naturally dissuade marketing types.
        
               | pixl97 wrote:
               | Authenticity is worth money.
               | 
               | When you have a 'real' community you're talking about
               | real people with real salaries and desires, add in that
               | you tend to develop a real trust between members. Think
               | of this as fertilized soil. You can grow crops in it, but
               | weed seeds will eventually land and try to take over it.
               | 
               | HackerNews is a good example of this, it takes a healthy
               | amount of moderation to keep things on topic where things
               | like politics get peared pretty ruthlessly. If for a
               | minute Dang gave in found ways to additionally monetize
               | the forums, something that would be profitable for a
               | while at least, things would start down a bad path.
        
           | sdoering wrote:
           | I can only agree with my sister comment. I find this
           | industrialized web more and more shallow and taxing to use.
           | 
           | While professionally I need to help (smaller, local) clients
           | to reach their audiences I become more and more weary.
           | 
           | It is like walking through a supermarket with industrialized
           | fast convenience food shouting in bright colors and
           | advertising while ultimately not nourishing me like slow,
           | real food could.
           | 
           | I am still looking for this digital slow food movement.
        
             | nonrandomstring wrote:
             | > I am still looking for this digital slow food movement.
             | 
             | https://digitalvegan.net
             | 
             | Please read it, and if you enjoy it please suggest it to
             | friends.
        
           | Vladimof wrote:
           | I added it to my list of search engines on Firefox... your
           | favicon is really small, that's on purpose?
        
           | ColinHayhurst wrote:
           | Agreed.
           | 
           | > If I want to search for something topical and relevant, I
           | go to Facebook, Twitter, Reddit, HackerNews, Instagram,
           | Google Maps, Discord etc. The general Internet is dead: it's
           | just legacy content and spam.
           | 
           | The "general" Internet is not dead. Though if you just want
           | to participate in just Facebook, Twitter, Reddit, HackerNews,
           | Instagram, Google Maps, Discord you might well think that.
           | 
           | Users of marginalia (author above), Mojeek (disclosure: CEO)
           | and others [0] are well aware that there are riches of
           | organic human-made content; from years back and new. Yes, a
           | lot of noise too, which Google has a bigger (SEO) struggle to
           | compete against. But still there is good and different
           | content available.
           | 
           | To find good content, using search, you need to use "search"
           | engines which enable discovery, as Google used to do so. I
           | stress the "search" as the emphasis of Google, Bing and thus
           | their syndicates is increasingly on being "answer" engines.
           | 
           | [0] https://seirdy.one/2021/03/10/search-engines-with-own-
           | indexe...
        
             | mc32 wrote:
             | Sounds like we're back to AskJeeves and a number of failed
             | answer engines from a couple of decades ago!
        
               | ColinHayhurst wrote:
               | AskBERT but now MUM knows best.
        
             | tmaly wrote:
             | Everyone is trying to game the Google algorithm. The net
             | result is all this long form content and cooking recipes
             | that are 10 pages long.
             | 
             | There seems to be a big disconnect with a typical users
             | attention span and the length of a post.
        
               | ajmurmann wrote:
               | I thought the recipe thing was to be able to copyright
               | them
        
             | Domenic_S wrote:
             | > _The "general" Internet is not dead._
             | 
             | For some things it is. Good luck getting a non-
             | sponsored/SEO-gamed review of a kitchen appliance or
             | particular vacation mode such as a cruise. It's
             | flabbergasting.
             | 
             | Most times I just stick "inurl:reddit.com" in my search and
             | _try_ to get discussion threads about the thing I 'm
             | researching, but even that's getting filled up with shills.
        
               | ColinHayhurst wrote:
               | Result #1 & #2 for kitchen appliance review (your
               | personalised/local results might vary):
               | 
               | Google:
               | 
               | https://www.expertreviews.co.uk/home-garden/home-
               | appliances
               | 
               | https://www.goodhousekeeping.com/appliances/
               | 
               | Bing:
               | 
               | https://www.which.co.uk/reviews/fitted-
               | kitchens/article/plan...
               | 
               | https://www.goodhousekeeping.com/appliances/
               | 
               | DDG:
               | 
               | https://www.goodhousekeeping.com/appliances/
               | 
               | https://www.which.co.uk/reviews/fitted-
               | kitchens/article/plan...
               | 
               | Marginalia:
               | 
               | https://www.infiniteeureka.com/shop-markdowns-on-small-
               | kitch...
               | 
               | http://www.fullyramblomatic.com/essays/sarah.htm
               | 
               | Mojeek:
               | 
               | https://www.appliancesreviewed.net/
               | 
               | https://busybakers.co.uk/category/kitchen-appliance-
               | reviews/
        
               | [deleted]
        
               | FargaColora wrote:
               | Most of these are spam. They contain affiliate links to
               | Amazon to buy the product which is being reviewed,
               | therefore the the review cannot be trusted.
               | 
               | "Which" looks to be the exception, but that is a paid-for
               | service.
               | 
               | It's a sad state of affairs.
        
               | kelnage wrote:
               | I understand your opinion about affiliate links - but I
               | use several review websites that use such links for all
               | products they review, and have both positive and negative
               | reviews for products. So I wouldn't say it necessarily
               | follows that affiliate links = biased reviews.
        
               | throwaway894345 wrote:
               | I think search engines are broken, but the Internet
               | itself is probably not "dead". It's just our
               | accessibility to that information. That's not super
               | helpful until we have better search engines (which steer
               | us away from this SEO stuff), but the good news is that
               | building a better search engine is easier than
               | resurrecting the Internet. In particular, there's a good
               | chance that a niche, naive search engine might be able to
               | significantly improve accessibility (e.g., high rankings
               | for pages that answer user queries in the fewest bytes).
        
               | marginalia_nu wrote:
               | -\\_(tsu)_/-
               | 
               | http://www.jitterbuzz.com/indmix.html
               | 
               | http://www.alaska.net/~akpassag/
        
               | FargaColora wrote:
               | These websites seem to be last updated decades ago, which
               | is prehistoric to most casual browsers. There's no doubt
               | there is great content on the general internet, but these
               | examples I would classify as "legacy".
        
               | marginalia_nu wrote:
               | I can see why the website owners would be interested in
               | getting traffic to recent websites, but why would you be
               | interested in recently updated websites?
        
           | pmontra wrote:
           | I take myself as an example.
           | 
           | People that know me and don't meet me regularly might know
           | the URL of my web site and might care to look at it once per
           | year and check if there is something new. Usually pictures
           | and tales from holidays. Covid made those holidays less
           | memorable so I didn't make any update since fall 2019. People
           | that meet me regularly don't need that website, I'm telling
           | them the tales first hand and showing them the pictures
           | without being obnoxious. I guess that this website is a
           | target for your search engine except it's not in English and
           | your search engine seems to want English search phrases.
           | 
           | I don't have anything of value to share on a public chat like
           | Twitter and I don't have an ego to pretend I do. I also don't
           | use Facebook anymore. I go there once per year to like the
           | messages that wish me happy birthday. I think it's polite to
           | do so. All my media production is on WhatsApp or Telegram in
           | group chats with people I know in real life.
           | 
           | If I really cared about producing content for the world I'd
           | probably be using Twitter, Medium or the fad of the year and
           | they'd take care of my SEO (do they?) or I'd be trying to
           | score points on StackOverflow.
           | 
           | To recap: I never intended to compete on SEO. I'm really OK
           | that my website is only for friends and spreads by word of
           | mouth. It probably never did, I bet it's been on a flatline
           | since I created it 20+ years ago.
        
         | captainmuon wrote:
         | But Twitter, Reddit, HN, and most other such places are just
         | websites and can be indexed fine. Same with Wikipedia, which is
         | very much a silo (they don't have regular links in text in the
         | hypertext spirit, but only footnotes).
         | 
         | Facebook and Instagram are more of a walled garden, like Quora,
         | but there is a lot of junk there anyway.
         | 
         | It's sad for the WWW, but I don't really think it is a
         | fundamental problem for search engines. In fact Twitter for
         | example gives a direct pipe to Google. If you tweet something,
         | it is immediately findable. Similar for StackExchange, but
         | there I think the site is so "small" that Google can afford to
         | just continuously index it.
        
           | ratww wrote:
           | Twitter and Reddit still can be indexed, but they've also
           | become increasingly hard to use without an account. Reddit
           | doesn't let you fully expand threads when you're unlogged.
           | Twitter limits the amount of things you can read and shows a
           | modal. Both of them heavily limit usage on mobile devices
           | without installing an app.
           | 
           | Sure, an account is free but might require giving information
           | you don't want to give. Twitter asks me for a phone number a
           | few minutes after creating an account, even if I don't post
           | anything). Reddit at least lets you skip giving an email.
           | 
           | Sure, there are workarounds such as using lite versions (old
           | Reddit, mobile Twitter), but that's not known to all people
           | coming from a search engine.
           | 
           | It feels as if HN are the only one that's not a partially
           | walled garden yet (and Wikipedia of course).
        
             | airstrike wrote:
             | > Reddit doesn't let you fully expand threads when you're
             | unlogged.
             | 
             | that's what old.reddit.com is for!
        
               | FargaColora wrote:
               | old.reddit will be gone soon, it is inevitable.
               | Especially once they go public.
        
               | ntauthority wrote:
               | Isn't it a bit ironic that a site - or its operator -
               | 'going public' means all the content on said site
               | actually 'goes private'?
        
               | aceazzameen wrote:
               | Yup. It's bound to happen. And when it does, Reddit will
               | no longer exist in my eyes.
        
               | azemetre wrote:
               | Agreed. IDK how I feel about Reddit. I've been on it
               | since 2010 when Fark lost its spark. I remember some
               | great times but a lot of it was "junk" content that in
               | the end was very meaningless. I wish I could say I used
               | it to develop my career in tech but that isn't true
               | either; I use specific blogs, books, and tutorial sites
               | to learn instead.
               | 
               | I suppose I mostly view it as a continuous party, yeah
               | it's fun if you attend but after a few hours I wish I was
               | doing something more productive.
        
               | ratww wrote:
               | Exactly, I mentioned it. But not only it's bound to go
               | away sometime, it's also not trivial to find to anyone
               | who's not an expert Reddit user, unfortunately.
        
           | TheRealDunkirk wrote:
           | And isn't great to get a link to Reddit or Twitter, and you
           | click the link, and try to navigate to the comments for
           | context or the answer, and you go to click the link to expand
           | it, and then you get a demand to log in and install their
           | app? Don't talk about walled gardens and not include Reddit
           | or Twitter just because they let you look at one brick before
           | demanding their tax.
        
             | [deleted]
        
         | hn_throwaway_99 wrote:
         | Doesn't _this_ site, and all of the content it links to, pretty
         | much disprove your theory?
         | 
         | Yes, sure, I often do go to the "top sites" when searching for
         | content, but I still usually start at Google. And, despite all
         | the SEO spam, Google still does a fairly decent of landing me
         | on, for example, the appropriate Wikipedia page, Stackoverflow
         | post, travel site, etc.
        
         | mrtksn wrote:
         | It has been dead for a while now and the whole society feels it
         | globally. Things were getting so good then things become
         | horrible and whoever cracks the path to the goods stuff again
         | will find great riches at the end of the path.
        
         | dageshi wrote:
         | I agree with you to an extent. The web is less useful than it
         | used to be. BUT I would say a lot of that usefulness has
         | diverted into youtube. There are people who would previously
         | have made sites who are making youtube videos instead which of
         | course is owned by google.
        
         | Jenk wrote:
         | > If I want to search for something topical and relevant, I go
         | to Facebook, Twitter, Reddit, HackerNews, Instagram, Google
         | Maps, Discord etc.
         | 
         | High chances you will find a link to an external site over
         | content actually on those big named sites though, right? That
         | tells us the organic web isn't dead, it's just hard to
         | discover/navigate - because of SEO wars, most probably... The
         | problem isn't the lack of content, it's the number of shitty
         | spammy sites standing in your way of the sites you actually
         | want to see. Like a sleazy salesman trying to direct you to the
         | crap laden three wheeled rust bucket when you were heading
         | toward the family sedans.
        
         | altairprime wrote:
         | If you want to be rich, solve search without full-text indexing
         | of sites. Pagerank only ever worked because of human curation
         | of webrings. Full-text search made is easier to find content,
         | and opened the door for spammers. The only viable route forward
         | for search will be to replace full-text indexing with human
         | curation, somehow. Solve how to scale that up instead, so that
         | when everyone else realizes we need it for the health of the
         | Web, you're ready.
        
         | [deleted]
        
         | shortformblog wrote:
         | I think this is a tad reductive, but I will say that we sure
         | let a lot of big companies convince a huge portion of the
         | population to create all of their content on platforms that
         | they have no real control over.
         | 
         | The problem is, many of them didn't realize this was a problem
         | until recently.
         | 
         | That said, plenty of exciting stuff is happening outside of the
         | walled garden, as long as you know how to find it.
        
           | Gravityloss wrote:
           | And not only did this happen already over a decade ago, a lot
           | of the current internet users have never known anything else.
           | 
           | We had a discussion with coworkers and somebody mentioned
           | irc. Explaining to younger colleagues what it was and that it
           | was not a product of a company, but operators had servers
           | that formed a network, and it was more like infrastructure.
           | Felt weird.
        
             | Elvie wrote:
             | isn't Discord a bit like IRC used to be?
        
               | ori_b wrote:
               | How do I connect to a self hosted discord, and then
               | connect it to my friends self hosted one?
               | 
               | And where do I get the RFC for the protocol so that I can
               | write my own compatible implementation?
               | 
               | IRC isn't a product. It's a standardized protocol
               | sufficiently simple to implement in a day or two.
        
             | kasey_junk wrote:
             | Most of the kids in my 3rd graders peer group understand
             | federated infrastructures quite well because of Minecraft.
             | 
             | Perhaps it wasn't the federated nature of irc that was
             | surprising but the fact that it was irc?
        
               | mst wrote:
               | Isn't minecraft more decentralised than federated?
               | 
               | IRC networks usually have multiple servers connected
               | together (historically, often run by a bunch of different
               | people) and I didn't think people self-hosting minecraft
               | servers usually did that?
        
             | shortformblog wrote:
             | I think honestly it highlights the power of marketing as
             | much as anything else. In some ways, building an open
             | network is always going to put you at a disadvantage to a
             | company that can throw money at user acquisition and PR
             | teams. That federated networks like Mastodon have seen
             | growth reflects the fact that word of mouth still means
             | something in 2022.
        
         | NicoJuicy wrote:
         | The big siloed websites are just indexes of fresh content
         | though.
         | 
         | With a generic way to place comments on it.
        
         | psyc wrote:
         | Based on my observations over the past year, I'm certain that
         | Google and Bing choose not to show us most of the web anymore.
         | 
         | I usually find what I'm looking for. It just takes literally
         | three orders of magnitude longer than it used to for the same
         | kind of stuff. I used to use Google a lot to jog my memory
         | about various things I vaguely remembered. Type a few
         | associative words and snippets, press Enter, done. Google's
         | useless for that now.
         | 
         | If you're looking for hot pop shit in trendy publications,
         | things to buy, commercial services to subscribe to - G has you
         | covered. That's what they do now.
        
         | ouid wrote:
         | Google is still pretty good at searching reddit. Maybe reddit
         | can acquire them.
        
           | big_blind wrote:
           | site:reddit just is the best search engine at this point. I
           | still don't like Google though.
        
         | dotnet00 wrote:
         | I agree that this seems way too reductive. I was recently
         | reflecting on this and noticed that I constantly run across new
         | blogs and sites whenever trying to learn something. I just
         | don't usually pay much attention to the site name in the way
         | that I remember HN, Reddit, Twitter etc.
         | 
         | So, while I would agree that some aspects of the old internet
         | are dead (like 'small' ~1000 user forums focused on specific
         | topics having largely been replaced by generally inferior
         | subreddits and discord servers), I think it hasn't gotten as
         | bad as you're making it out to be.
        
         | baxtr wrote:
         | I am not so sure...
         | 
         | I think what happened is this: the WWW was everything back in
         | the days. But in the "old days," only 10% of all people were
         | online, the web elite. Then, AOL came, and the rest came online
         | slowly but surely. The so-called "mainstream" people were no
         | geeks, and these people were "just" ordinary people. Almost all
         | were captured by what you call "big websites".
         | 
         | Now, we see the 100% being dominated by the 90%. That's why
         | "Google results are bad". Bad for us! Not maybe (most probably)
         | not for them.
        
           | nl wrote:
           | Eternal September was Sep 1993. AOL hit the internet in March
           | 1994.
           | 
           | Netscape didn't launch until December 1994 (and the WWW was
           | nothing before that. I subscribed to a mailing list with new
           | sites that were released and I'd visit most new websites on
           | the internet on most days with the Cello browser in my uni
           | labs most days).
           | 
           | AOL users have been there since the beginning of the WWW.
           | 
           | https://en.m.wikipedia.org/wiki/Eternal_September
        
             | CWuestefeld wrote:
             | My recollection is that the AOL event you reference was
             | only making usenet accessible - a point that makes good
             | sense in the context of the eternal September.
             | 
             | But when talking about the WWW, that's a very different
             | story. I think that AOL didn't incorporate a web browser
             | until quite some time after that.
        
         | mywaifuismeta wrote:
         | I no longer see Google as a neutral "search engine" the way it
         | used to be. Now it's just another company that owns and
         | promotes certain types of content, no different from reddit.
         | For some things Google has the best content, for some things
         | Twitter or Reddit have the best content.
        
           | dixego wrote:
           | Google is an advertising company. It has been for a good
           | while.
        
             | big_blind wrote:
             | Yeah I use you.com and kagi.com. No advertising on either.
             | Less SEO spam too it seems.
        
           | [deleted]
        
           | photochemsyn wrote:
           | I find one of the best ways to find interesting content on
           | specific subjects using Google is now to start blocking all
           | their top returns (a lot of SEO spam). This is somewhat
           | tedious (lots of -site:seospam.com) and Google doesn't like
           | automated queries. However, a few rounds of this often turns
           | up interesting content down low in the search results. Just
           | don't take what's on offer on page one of search results,
           | basically.
           | 
           | Where it's gotten really bad is on news searches as Google
           | either now has some kind of shitlist of independent news
           | sites that it won't allow to show op on, for example,
           | site:youtube.com searches - or, it's filtered through a guest
           | list. It's hard to tell which strategy they're using, but
           | news is definitely being heavily filtered based on very
           | dubious propaganda-smelling agendas.
        
             | xvello wrote:
             | You might be interested in using uBlockOrigin and
             | https://letsblock.it/filters/search-results to easily block
             | these domains. In addition to your own domain list, you can
             | use the community-maintained SO / github / npm copycat
             | lists.
        
           | maxwelldone wrote:
           | Back in 2000s Google used to be the place for any type of
           | search (IIRC).
           | 
           | Now, I've been conditioned to use it only for specific use
           | cases, mostly for convenience. Some examples include:
           | 
           | 1. Anything programming related (searching for man pages,
           | error codes etc) is straightforward. (I do have some UBO
           | filters to exclude SO copycats)
           | 
           | 2. Utility stuff like currency conversion, finding time in
           | another city, weather etc.
           | 
           | Where Google has really fallen behind is in multimedia
           | search. Not sure if it's due to copyright issues or not but
           | Bing and Yandex provide way better service in this regard.
           | 
           | Not to mentions the "reddit" suffix I need to add to any
           | search that even remotely calls for public opinion. In many
           | cases, Google is just a shortcut to take me to the relevant
           | subreddit.
        
             | ufmace wrote:
             | Programming-related stuff seems to have gotten a lot worse
             | in the last couple of years. Now most terms, at least for
             | common things, return a ton of blogspam, when the official
             | docs or SO are usually the best source.
        
             | LegitShady wrote:
             | another thing seems to be prioritizing current news over
             | past news which makes searching for old.articles youve read
             | quite difficult.
        
         | samstave wrote:
         | This MUST be the reason that they threw their purchase of
         | Postini in the garbage and my GMAIL INBOX is filled with spam,
         | and my "social" and "promotions" tabs dont filter....
         | 
         | GMAIL is garbage now, I literally use it as my spam email any
         | more. Which sucks because I have had it for a _really_ long
         | time.
         | 
         | Annecdote on Yahoo! Mail ; years ago I wrote to yahoo support
         | asking when I created my Yahoo Mail account (i'd had it from
         | the 90s when it was very early available...)
         | 
         | And support told me that they couldnt tell me when my account
         | was created as that was *proprietary company information*
         | 
         | So I deleted my Yahoo account. Im about to DL all my gmail and
         | do the same.
        
         | throw10920 wrote:
         | > I agree: the WWW Internet is dead
         | 
         | I've heard this claim a lot, with 0 supporting evidence. Do you
         | have any?
         | 
         | My own experience is that there are _thousands_ of content-
         | rich, high-quality blogs still being written by real humans,
         | because I regularly find and bookmark new ones weekly, without
         | even looking for them, so: please provide evidence for this
         | claim that runs counter to my lived experience.
        
         | PragmaticPulp wrote:
         | > If I want to search for something topical and relevant, I go
         | to Facebook, Twitter, Reddit, HackerNews, Instagram, Google
         | Maps, Discord etc.
         | 
         | Maybe we're searching for different content, but I disagree.
         | While Google results are not without noise, I think it's a huge
         | exaggeration to suggest it's useless. I still regularly find
         | quality results from a quick skim of the first or second page
         | of Google results.
         | 
         | Meanwhile places like Reddit, Twitter, and Hacker News are full
         | of very strong opinions that _feel_ truthy, but are mostly
         | noise. Unless you go in with enough baseline knowledge to
         | filter out 9 /10 underinformed comments to dig out the 10% who
         | actually have direct knowledge of the subject and aren't just
         | parroting some version of something they read from other
         | comments, skipping straight to social sites becomes a source of
         | misinformation.
        
         | derefr wrote:
         | > Make a search engine which indexes the fresh relevant data
         | from the big siloed websites, and ignores the general dead
         | Internet
         | 
         | I don't understand why Google themselves don't do this.
         | LinkedIn v. hiQ demonstrated that they won't get in trouble for
         | scraping users' subjective views of data within these silos and
         | then stitching them together to form a cohesive whole. So
         | where's the effort to do so? It seems like the obvious step.
        
         | Gigachad wrote:
         | Interesting thought. I just went though my browser history and
         | realised that almost every time I use google search, I already
         | know what website I want, I just don't know the exact
         | link/page. I'll use google because the search on stack overflow
         | or reddit sucks but I know I'm looking for a page on one
         | particular site.
        
           | Pelam wrote:
           | I realized this too. I disabled search from address bar and
           | started bookmarking everything even remotely sane I see. I
           | often add a few personal keywords to the bookmark bar.
           | 
           | It is starting to pay dividends. Instead of weird stuff
           | thrown up by google when I type in something, I get the "oh
           | yeah, that was the page" from a short list of bookmarks shown
           | to match the words.
        
           | npilk wrote:
           | I had the same realization and ended up setting up a simple
           | Cloudflare script to automatically do an "I'm Feeling Lucky"
           | style search to return the first result:
           | https://notes.npilk.com/custom-search
        
         | lysecret wrote:
         | I think this is a very "consumer focused" take. Yes. A lot of
         | interesting people data is now "locked" behind these
         | aggregators and platforms (and also hard to handle because of
         | GDPR). But most interesting company data is still out there.
        
         | matheusmoreira wrote:
         | The internet itself is probably gonna die soon anyway. Every
         | country wants to impose its own laws on it. I think it'll
         | eventually fragment into multiple segregated continental
         | networks, if not national ones, all with heavy filtering at the
         | borders.
         | 
         | I'm happy to have experienced the free internet. Truly a jewel
         | of humanity.
        
           | cesarb wrote:
           | > I think it'll eventually fragment into multiple segregated
           | continental networks, if not national ones
           | 
           | That's exactly the world in which the Internet grew. There
           | were multiple segregated national and sub-national networks,
           | and the Internet was built as a means to interconnect them.
           | After some time, the Internet protocols ended up being used
           | even within these networks, but that was not originally the
           | case. And even today, there are still things like the AS
           | (Autonomous System) concept which permeates the core of the
           | top-level Internet routing protocols, which still reflect the
           | Internet being a "network of networks" instead of a single
           | unified network.
           | 
           | That's why I'm not too worried about the Internet
           | fragmenting; we've seen this before. What happens next is
           | gateways between the networks, and there are already shades
           | of these in the VPN providers which allow one to connect as
           | if one were located in a different network, often from a
           | different country.
        
           | kmlx wrote:
           | > I think it'll eventually fragment into multiple segregated
           | continental networks
           | 
           | i think it already has.
           | 
           | the Great Firewall of China is the classic example, but I
           | think the trend started in the west with the Right to be
           | forgotten/right to erasure in Europe, and subsequent HTTP
           | Status 451 Unavailable For Legal Reasons. GDPR just further
           | cemented the split between Europe and the rest, and the new
           | DMA & DSA regulation in the European Union finally makes it
           | clear. The writing is of course on the wall, so countries
           | like India or Australia aren't too far behind. Places like
           | California also have their own "right to be forgotten", and
           | I'm sure the US will not be left behind for too long before
           | we see regulation further splitting their internet from the
           | RoW. And I don't think the RoW will hold off much longer till
           | it also splits into multiple big blocks. It's the start of
           | the new "nationalist" internet, and I'm sure we'll all be
           | poorer because of it.
        
             | matheusmoreira wrote:
             | Exactly what I mean. There is no way to have an
             | international network with national borders.
             | Telecommunications providers have always been centralized
             | and have always been in bed with the government. Only way
             | we'll ever be free is if someone invents some kind of
             | decentralized long range wireless mesh network.
        
               | politician wrote:
               | Like Starlink?
        
               | ricardobeat wrote:
               | Starlink connects to standard internet gateways on the
               | ground. It cannot function without the 'regular
               | internet', unless a replacement appears.
        
               | dotnet00 wrote:
               | IIRC there was mention of it providing some p2p network
               | style communication capabilities for Ukraine's military,
               | and one of the reasons it's appealing to the US's
               | military is the ability to route communications entirely
               | within the network (well, with the gen 2 satellites which
               | have laser interconnects).
               | 
               | So it can (at least eventually) function without 'regular
               | internet', although I would still be hesitant to call it
               | a viable infrastructure choice if the goal is to get
               | around government control, simply from how much SpaceX
               | have to appease the government to do anything space
               | related.
        
               | matheusmoreira wrote:
               | Starlink is maintained by a company, it's an internet
               | service provider. One visit from the police and they'll
               | censor anything.
               | 
               | The mesh network should be made out of common hardware in
               | order to be viable. I'd suggest phones but those devices
               | are owned before they've even left the factory.
        
               | Nextgrid wrote:
               | One visit from the _US_ police. US-unfriendly countries
               | have no leverage over it, and similarly, the US has no
               | leverage over satellite ISPs based in countries they aren
               | 't on good terms with.
        
               | jrockway wrote:
               | > US-unfriendly countries have no leverage over it
               | 
               | "Star Wars Episode 10: The one that's not fiction."
        
               | Nextgrid wrote:
               | Internet censorship isn't worth going to war over and
               | disclosing secret anti-satellite weapons that are better
               | saved for a rainy day.
        
               | jrockway wrote:
               | It's probably easier to just cut off outgoing payments to
               | Starlink anyway. They're not a charity, so if they don't
               | get paid, they probably don't want to provide service
               | just to send a message to some random government.
               | 
               | On the other hand, if you want to demonstrate that you
               | have anti-satellite capability it's probably a better
               | idea to shoot down a corporate satellite than a military
               | one. The Soviet Union shot down Korean Air Lines Flight
               | 007 and it didn't start a war, after all.
        
               | eloisius wrote:
               | Good luck, spectrum is highly regulated in every country
               | I can think of. If national governments don't want you
               | networking across borders, you're definitely not going to
               | be broadcasting long range radio transmissions that way.
               | In fact, it's currently illegal to transmit encrypted
               | data or to relay packets via ham radio in the US.
        
               | matheusmoreira wrote:
               | Who knows? The whole point of decentralization is for
               | there to be so many nodes in the network they can't
               | possibly take them all down so that it's pointless to
               | even try. What if all smartphones formed a mesh network?
               | There aren't enough prisons in my country for all those
               | criminals.
        
               | eloisius wrote:
               | I agree with your ethos, but I don't share your optimism.
               | If the state wants to enforce networking firewalls along
               | national boundaries, no technological solution will save
               | us in general. As a resourceful techie with the right
               | know-how you may be able to sneak your packets through,
               | just like people in Cuba receive a literal packet of data
               | via sneakernet, but if the state doesn't want widespread
               | meshnets circumventing their firewall, they will imprison
               | you for emitting pirate radio signals, they will penalize
               | any electronics manufacturer that makes non-compliant
               | hardware, and rest assured that companies will go right
               | along. Liberty requires more than technical solutions.
               | 
               | I'm saying this as someone who once wrote a decentralized
               | P2P mesh for instant messaging[1]. I was inspired by the
               | HK protests going on ~2014 after hearing that they were
               | using Bluetooth chat apps. Luckily Matrix, Telegram,
               | Signal, etc. mostly solved the problem. Still, I don't
               | think any amount of mesh networking would turn back the
               | tide of Hong Kong now.
               | 
               | [1]: https://github.com/zacstewart/comm/
        
               | groby_b wrote:
               | >What if all smartphones formed a mesh network? There
               | aren't enough prisons in my country for all those
               | criminals.
               | 
               | There don't need to be. You publicly gruesomely execute
               | the first 100 or so you catch, and the practice of
               | running a mesh node on your cell phone will fall so far
               | out of fashion that the network breaks.
               | 
               | Societal shortcomings cannot be fixed via tech alone. If
               | you can't build a society resilient to authoritarianism
               | in the first place, tech will not help you. It can be
               | used to _increase_ resilience, but that 's far from
               | fixing the problem by itself.
        
           | 7sidedmarble wrote:
           | The networking may have been open like that, but I'm not sure
           | the content ever was. It seems to me like a lot of internet
           | users consume mainly the content of sites from their country.
           | Kind of hard to blame them when that content is probably
           | going to download fastest. But the language barrier has also
           | kept the internet from becoming truly global.
        
           | dreen wrote:
           | I think this was inevitable all along, something similar
           | happened to radio if I'm not mistaken.
           | 
           | However, the good news is that we will never stop reinventing
           | everything. The real value of the old internet was showing us
           | what is possible.
        
             | nonrandomstring wrote:
             | > The real value of the old internet was showing us what is
             | possible.
             | 
             | Of equal value is that it showed us what not to do.
             | 
             | We have 30 years of documentation for research on exactly
             | what a successful intra-planetary network needs to be
             | immune to. A successful future network must build-in
             | resistance all forms of human pyschopathology from the
             | ground up.
        
               | pde3 wrote:
               | This is a nice fantasy, but it's a fantasy. The tech
               | stack and network we have is too dense a forest to be
               | replaced by clean slate designs. But maybe some of the
               | problems could be improved with some new platforms and
               | APIs. Mind you, ML is making so much progress so quickly
               | that what happened over the last thirty years is at best
               | a partial model of the problem we have to solve now, and
               | the tools we have to do it with...
        
               | nonrandomstring wrote:
               | > ML is making so much progress so quickly that what
               | happened over the last thirty years is at best a partial
               | model of the problem we have to solve now, and the tools
               | we have to do it with...
               | 
               | Sorry I don't see how ML can help here. It seems like
               | another thing to pin hopes of repairing an already too
               | broken system on.
               | 
               | "We cannot solve our problems with the same thinking we
               | used when we created them." -- Albert Einstein
               | 
               | "A new scientific truth does not triumph by convincing
               | its opponents and making them see the light, but rather
               | because its opponents eventually die, and a new
               | generation grows up that is familiar with it." -- Max
               | Planck
               | 
               | We are the dying generation my friend. We built it. They
               | came. It didn't work. Surely if ML can do anything it's
               | telling us that we need to tear down the old system
               | completely and start again, don't you think? Adding
               | sticking tape won't help.
               | 
               | edit: turning a grunt into an honest question
        
           | Whiteshadow12 wrote:
           | This made me sad, the optimist in me believes that some
           | alternative will be built, that could take us back to those
           | days. Honestly I do feel for most of my life I experienced an
           | American Internet mostly (From South Africa), as long as one
           | can still hop from one internet to another, in as simple a
           | manner as possible it might not as bad as it could be.
        
             | matheusmoreira wrote:
             | I'm sad as well. To me it feels like we're already living
             | in a cyberpunk nightmare, things just keep getting worse
             | and there's nothing anyone can do to stop it.
        
           | [deleted]
        
         | lkxijlewlf wrote:
         | > If I want to search for something topical and relevant, I go
         | to Facebook, Twitter, Reddit, HackerNews, Instagram, Google
         | Maps, Discord etc.
         | 
         | Interesting. When I search for something topical I search those
         | sites using Google because al(most) (I don't use some like FB
         | and insta) all those sites have really shitty search.
        
         | jerf wrote:
         | "I agree: the WWW Internet is dead, that is your problem. No-
         | one visits websites anymore, everyone has moved to the 10
         | biggest websites and all data is now siloed there."
         | 
         | That is not the Dead Internet Theory. That's just something
         | anyone can see by looking at the world.
         | 
         | The Dead Internet Theory is that the Internet is _already_ an
         | echo chamber custom fed to you by a collection of bots and
         | other such things, and that a lot of the  "people" you think
         | you're interacting with are already, today, faked. You're
         | basically in a constructed echo chamber designed only with the
         | interests of the creators of that chamber in mind, using the
         | powerful social cues of _homo sapiens_ effectively against you.
         | 
         | In particular, those silos aren't where people are
         | communicating. Those silos are where you _think_ you 're
         | communicating.
         | 
         | It is obviously not entirely true. When we physically meet
         | friends, sometimes topics wander to "Did you see what I posted
         | on Facebook?" So far, we've not caught Facebook actively
         | forging posts from our real-life friends that we physically
         | know. (Though we _have_ caught them failing to disseminate
         | posts in what seems to be a distinctly slanted manner.)
         | 
         | I am also not terribly convinced that the bots have mastered
         | long-form content like you see on HN. I think we've had some
         | try, and while they can sort of pass, they seem to expend so
         | much effort on merely "passing" that they don't have much left
         | over to actually drive the conversation. HN probably still
         | requires real humans to manipulate things.
         | 
         | Where I do seriously wonder about this theory is Twitter. AI
         | _has_ progressed to the point that short-form content like that
         | can be effectively generated and driven in a certain direction.
         | There 's been some chatter on the far-out rumor mills about
         | just how bot-infested Twitter may be, how many people think
         | they have thousands of followers, even having interacted with
         | some of them as "people", and in fact may only have dozens of
         | flesh-and-blood humans following them, if that. Stay tuned,
         | this one is developing.
         | 
         | (Note that while this could be "a big plan", it is also a
         | possible outcome of many groups independently coming to the
         | conclusion that a Twitter bot horde could be useful. A few
         | hundred from X trying to nudge you one way, a few hundred from
         | Y trying to nudge you another, another few thousand from Z
         | trying to nudge you yet another, before you know it, the vast
         | vast majority of everyone's "followers" is bots bots bots, and
         | there was no grand plan to produce that result. It just so
         | happens that Twitter's ancient decision to be dedicated to
         | short-form content, with no particular real-world connection to
         | the conversation participants, where everyone is isolated on
         | their own feed (even if that is shared in some ways) made it
         | the first place where this could happen. Things with real-world
         | connections, things where everyone is in the same "area" like
         | an HN conversation, and long-form content will all be three
         | things that will be harder for AIs to manipulate. Twitter is
         | like the agar dish for this sort of thing, by its structure.)
        
           | thesuitonym wrote:
           | > (Though we have caught them failing to disseminate posts in
           | what seems to be a distinctly slanted manner.)
           | 
           | I haven't seen this, but I'd be interested in reading about
           | it, if you have a link!
        
           | ftkftk wrote:
           | I agree - I don't believe that there is a grand master plan
           | of a conspiratorial or other nature. I think it is simply, as
           | you stated, a co-evolution of independent actors.
        
         | rchaud wrote:
         | > Want to become rich? Make a search engine which indexes the
         | fresh relevant data from the big siloed websites, and ignores
         | the general dead Internet.
         | 
         | That would be a great service, but it certainly wouldn't make
         | you rich. Where's the money going to come from? Google got rich
         | because they acquired an ads platform (DoubleClick) and an
         | analytics platform (Urchin) and started monetizing the vast
         | amounts of data they had. That was years after Google had
         | established goodwill as the best search engine.
        
           | big_blind wrote:
           | I use beta search engines. On kagi.com and you.com you can
           | preference and filter top sites. There's also no advertising
           | on either. I've just stopped using Google altogether and its
           | improved search so much.
        
         | simion314 wrote:
         | This is not true, maybe for a subset of Internet users.
         | 
         | For example you have Wikis and forums. Wikis are good for
         | communities that are passionate about a topic and they
         | collaborate on buidling content for their passion. Reddit is a
         | valid alternative to forums but if the community s older and
         | has members that are technical competent then they usually have
         | the forum customized for their purpose and the forum will
         | continue to exist , especially if you want to avoid some third
         | party censorship.
         | 
         | I never ever search for something and found answers on
         | Facebook, sometimes very rare I find something that points to
         | Instagram blogs/posts but never Facebook.
         | 
         | Probably depends on your location and what you search for, so
         | it might be possible that 99% of your Internet consumption is
         | satisfied by 5-10 websites.
        
         | Hnrobert42 wrote:
         | As you describe this, it makes me think about how populations
         | tend to migrate to cities and away from rural areas. There's
         | even a parallel to white flight in the emerging popularity of
         | the chan/gab fora.
        
         | hombre_fatal wrote:
         | I don't get how TFA shows evidence of the Dead Internet Theory
         | just because their site manages to attract ~zero users.
         | 
         | Just host a <form><textarea><button></form> at an IP address
         | and notice it's just spambots submitting it with backlinks, not
         | actual users. Doesn't mean the internet is dead nor that the
         | indieweb is dead.
         | 
         | It doesn't really show anything other than the only people able
         | to extract value from your creation are the spammers.
        
         | jspaetzel wrote:
         | This is so incredibly false, I've been working on a project for
         | the last six months and MoM I've seen steady increase in usage.
         | Tbh much much higher usage then I expected. Most users find my
         | site via Google or Facebook however they are looking for
         | content that's not in those silos and have no problems leaving
         | them.
         | 
         | If you have high quality content and you get it indexed
         | properly by Google, users will come.
         | 
         | There are reasons users are not using your website.
         | 
         | 1. It's not solving a problem people have.
         | 
         | 2. Users can't find it.
         | 
         | Who, in their right mind searches for search engines? Nobody I
         | know.
         | 
         | If you want users you have to go out and get them (literally
         | pound the pavement and talk to people) or create a LOT more
         | content ironically, so they can find your site on the search
         | engines they are using today.
        
         | black_puppydog wrote:
         | These discussions always make me recal Jacob Applebaum. Think
         | of him what you want, but this statement of his really stuck
         | with me at the time. Paraphrasing:
         | 
         | The real dark-net is facebook. Everything that goes in there
         | never comes out again and is basically invisible to the world,
         | except if you join facebook yourself.
         | 
         | My own prime example of that used to be pinterest: it seems to
         | be a 100% sink in the directed graph of internet links. But
         | since Applebaum stated this, instagram (also facebook of
         | course) is trying hard to push pinterest off that particular
         | throne.
        
           | LegitShady wrote:
           | to me this is also discord - which seems to have become the
           | chose alternative tk online forums for many communities and
           | basically hides what used to be the public face of those
           | communities.
        
           | samatman wrote:
        
         | boplicity wrote:
         | > No-one visits websites anymore, everyone has moved to the 10
         | biggest websites and all data is now siloed there.
         | 
         | Really? We make our living running a small web based
         | publication; around 40k readers a month. I know of many other
         | sites like this. Google, and other search engines, depends on
         | niche websites to provide quality search results. Without sites
         | like ours, the internet would truly be dead, and search would
         | be mostly useless. Our "traffic sources" come from a mix of
         | Facebook, Search, Reddit, etc, in addition to our many loyal
         | readers.
         | 
         | Others in our niche are producing blog spam, which looks nearly
         | identical to people who aren't experts in the field, but we
         | have real experts, fact checkers, etc, as part of our
         | production process. This is a big problem: These low quality
         | websites get similar rankings to our own, which does make it
         | much harder for people to get quality information via search.
         | (Hence the general shift towards trusting social
         | recommendations, such as from Reddit.)
         | 
         | In short, the WWW is alive and well, it's just buried under a
         | bunch of #$#$%.
        
           | rchaud wrote:
           | > Our "traffic sources" come from a mix of Facebook, Search,
           | Reddit, etc, in addition to our many loyal readers.
           | 
           | 40k/mo is a pretty good number for an independent website. As
           | a word of warning though, relying on social media reach is a
           | dangerous game, as there is anecdotal evidence that tweets
           | with outbound links don't get as many impressions as those
           | that link to in-site content, like another Twitter post.
           | 
           | As for Facebook, well, there's a good comic from The Oatmeal
           | (enormously popular on FB back in 2010) that talks about what
           | happened in the long run:
           | 
           | https://twitter.com/Oatmeal/status/923250055540219904
        
         | Cthulhu_ wrote:
         | I don't believe the WWW internet is dead; there's still
         | millions of webpages being made and published every day.
         | However, the traffic numbers are skewed in favor of the big
         | socials and aggregators; I wouldn't be surprised if the 80/20
         | rule applies there.
        
           | pnutjam wrote:
           | There seems to be a tendancy towards video that undercuts the
           | "old internet". I prefer instructions in a text or list
           | format, but that's almost impossible to find for things like,
           | changing the headlight bulb on my traverse.
           | 
           | 1. turn the wheel so it is pointed hard in the direction of
           | the bulb you are changing.
           | 
           | 2. remove the hex screws from the shroud in the wheel well
           | 
           | 3. pull the shroud down, it's pretty flexible plastic.
           | 
           | 4. reach up and change the bulb. The wires are a bit short so
           | you might need to get both hands in there. I have big hands
           | and I'm able to do it.
           | 
           | ---- There are innumerable videos explaining this process,
           | but very few text directions.
        
             | ElevenLathe wrote:
             | I think this is actually because real, fluent literacy is
             | still rare even in highly developed places. It may be
             | easier for a very literate someone to dash off those
             | instructions but most people are 1000x more comfortable
             | making a little video. Same goes for reading vs watching
             | the video.
             | 
             | This is my same theory about meetings being universally
             | preferred to asynchronous email, even when literally all
             | the questions someone asks at a meeting have already been
             | answered in my long form email.
             | 
             | Most people, even if they can read, are not really
             | comfortable with it. Doubly so for writing. There used to
             | be no choice to function in society, but increasingly we
             | can use technology to substitute for reading and writing
             | effectively, so people do.
        
               | pnutjam wrote:
               | You're probably right, it's just so frustrating.
               | 
               | I think I'm going to start compiling stuff like this in
               | my git repo.
        
             | Jiro wrote:
             | Even something like that flounders on the question "these
             | instructions say to pull down the shroud, what is a
             | shroud?" or "I can't find those hex screws, where are they
             | located?" Repairs are inherently visual, although text with
             | illustrations might work.
        
         | soheil wrote:
         | To a fish the world is made of water and there can't possibly
         | be anything else worthwhile. This is more indicative of how you
         | spend your time online vs reality.
        
         | heavyset_go wrote:
         | I was once on this bandwagon, but I think it was just
         | confirmation bias reflecting the way _I_ used the internet at
         | the time. The non-siloed internet is bigger than the pre-siloed
         | internet ever was.
        
         | omoikane wrote:
         | I think the Dead Internet Theory bit is just a bait to get more
         | comments. It's a bit of a stretch to conclude that the internet
         | is mostly robots just because one website sees mostly robots.
         | This extrapolation would be convincing if that one website is a
         | high ranking website that sees a lot of traffic, but
         | searchmysite.net does not appear to be one of the top websites.
        
         | DebtDeflation wrote:
         | Unfortunately, correct. The average Internet user accesses it
         | via a phone, not a desktop, laptop, or even tablet these days.
         | Most of that access is through apps, not a browser. To the
         | extent that a user is looking for a factoid answer and does a
         | search, a Google Knowledge Graph result with a Wikipedia link
         | is probably enough in most cases. If they want a technical
         | question answered, Stack Exchange; a product review, Reddit;
         | nearby restaurants with reviews, Google Maps; etc.
        
         | stackbutterflow wrote:
         | I think you're generalizing your own behavior. I regularly use
         | google to search for topics that cross my mind and I end up on
         | many websites that are not one the giants in your list. It's a
         | fun activity. If people stick to the same 10 websites that's on
         | them. Nothing prevents you from exploring the web.
        
           | MockObject wrote:
           | > Nothing prevents you from exploring the web.
           | 
           | What prevents you from exploring the web is you can't find
           | but the same 10 sites through search engines.
        
         | jrussbowman wrote:
         | "Want to become rich? Make a search engine which indexes the
         | fresh relevant data from the big siloed websites, and ignores
         | the general dead Internet."
         | 
         | Did that to some degree. Unscatter.com pulls from reddit and
         | twitter to source links.
         | 
         | I found reddit only created an echo chamber bubble of obvious
         | bias and twitter only diluted it a little.
        
         | CTDOCodebases wrote:
         | People are doing this already. You just have to include the
         | site name in the search on google e.g reddit. Search on these
         | platforms is often broken.
        
       | freeone3000 wrote:
       | Well, the first two links loaded for a search for "magic the
       | gathering" are 404s. The "Random" link at the bottom 403s. The
       | search engine feels broken.
        
       | assemblylang wrote:
       | There are still ways to prod out good content from the SEO spam
       | on search engines. I wrote a google search front end that does
       | this [0], using search operators to remove some common SEO spam.
       | 
       | [0] https://sayno2seo.com
        
       | hammock wrote:
       | Makes me wonder whether Google tolerates bots on its search
       | engine, to boost its ads revenue.
       | 
       | See also Twitter's extraordinary claim that 5% or less of its
       | users are bots (or a claim from Twitter's detractors that up to
       | 90% of its DAU are bots)
        
         | exyi wrote:
         | I don't think it does, I get a ~~middle finger~~ recaptcha
         | every time I try google something
        
       | iamjbn wrote:
       | Adding to the list I have been building for very long --
       | "Becoming irrelevant, Google Search" -- here:
       | https://docs.google.com/document/d/1cSMY5wXSKhJdMxeJEvTUJ21e...
        
       | iamwil wrote:
       | To the OP of the article, this is great. I had just never known
       | about it, to use it for searching.
       | 
       | Usually quality blog posts on specific technical topics are just
       | things I run across through HN, lobsters, or twitter. Now it's
       | one more channel to look for things that I'm specifically
       | researching, like CRDTs. Kudos!
        
       | ColinHayhurst wrote:
       | Mojeek member here. We have always had a high level of spam bots;
       | as any search engine/service will have. It's a constant battle to
       | fend off new bots; folks can always use try out our API rather
       | than freeloading, and some do. Many obviously do not. We are
       | taking a look at whether things have also changed for us since
       | mid-April 2022.
        
         | ColinHayhurst wrote:
         | Some evidence of an uptick here too. Historically it has been
         | ~80%. 6 days ago we had to block 92%. Yesterday we blocked
         | around three times that number of bot searches.
         | 
         | edit: the three times spike yesterday was one particular new
         | attacker; general recent rise holds.
        
       | alphabet9000 wrote:
       | i recently built a habitat for spam bots, they eventually found
       | it and now post peacefully
       | 
       | https://upstairs.treehouse.telnet.asia/pharm/cylohexapine
        
         | TremendousJudge wrote:
         | It's beautiful to see nature healing
        
           | tbm57 wrote:
           | Maybe someone should start an internet rewilding project
        
         | getcrunk wrote:
         | this is the best thing I have ever seen. Its art, engineering,
         | biology and sociology. Do you write blog posts about it?
        
       | mcv wrote:
       | If you're trying to boost your user numbers, I'm in. Results on
       | topics I search for are very sparse, but it's all content I
       | hadn't seen before, which is great.
       | 
       | Sounds like your search engine is not suitable as a replacement
       | for more traditional search engines, but it might complement them
       | very well. I'll give it a try.
       | 
       | As for the SEO bots: can't you simply block those?
        
       | egberts1 wrote:
       | Error codes. Open source that reports in mysterious error codes.
       | 
       | Used to be able to Google for those; now, not so much.
        
       | 0xbadcafebee wrote:
       | People will only use your product if they know about it and
       | perceive value in it. How do people know about it, and why would
       | they want to use it?
       | 
       | On _" Most of the tiny number of real users have come from links
       | posted to places like Hacker News, and there is almost no organic
       | traffic from other search engines"_ - Organic traffic comes from
       | word of mouth. Are people talking about your site? If they're
       | not, you're not gonna see organic traffic. You could do what
       | others do and pay some influencers to advertise your site, but
       | that's expensive and not as scalable as "real" buzz. Is your
       | product exciting or controversial? If not, why would people talk
       | about it?
       | 
       | Your homepage's tag line is _" Open source search engine and
       | search as a service for personal and independent websites."_ A
       | regular person's eyes would glaze as they try to figure out what
       | this means. Given some time they might put together the words
       | "search engine" and "personal" and "websites" and figure this is
       | a blog search engine. So just say that.
       | 
       | The "Newest Pages" section is a fun novelty, but after a few
       | minutes the novelty wears off.
       | 
       | The "Browse Sites" section is _almost_ useful. Next to the list
       | of sites I see some tags. Why isn 't a heatmap of the tags the
       | first thing I see? That would be way more useful than a paginated
       | list of random sites.
       | 
       | Your "About" page lists _" community-based approach to content
       | curation"_. This is the most exciting aspect of the whole
       | endeavor, so add that to your front page blurb ("Community search
       | engine"). You would probably do well to build a real community
       | around it, for example with a forum or chat system (GitHub
       | Discussions does not count). A SubReddit would be an easy way to
       | bootstrap this and later move it to your own hosted forum.
       | 
       | You'll probably need a very complicated moderation system if this
       | thing takes off.
        
       | unnouinceput wrote:
       | Plot twist: His website/search engine/blog is written by a bot
       | and not a real person behind.
        
       | albatrosstrophy wrote:
       | Ona tangential note, I remember a time when Google had the option
       | to search only for 'discussions'. The results were amazing and
       | accurate as it scoured online forums. Almost all issue I had (was
       | following the rooting scene closely back then) were quickly
       | resolved. Then suddenly it got removed for reasons unknown to me.
       | Anyone knows if it's replicatable today?
        
         | sodality2 wrote:
         | Brave Search does have a discussion search section.
        
         | blackhaz wrote:
         | Sometimes adding "reddit" to a search query produces fantastic
         | results.
        
           | jrussbowman wrote:
           | I do this all the time
        
           | tunap wrote:
           | I have had some success adding "forum", when looking for
           | trade discussions; eg: controls & automotive. With all the
           | walled silos on the net, this is much less useful with every
           | passing day. On the bright side, I don't have to use -twitter
           | & -facebook, so there's that.
        
           | throwaway27727 wrote:
           | This is great but it seems reddit has done something to mess
           | with their date reporting. When looking for recent posts, I
           | might see a result on Google that says it was posted in the
           | last few days, but on clicking the result will actually be
           | from years ago.
        
             | asddubs wrote:
             | might also be google. I've noticed inaccurate dates that
             | don't appear anywhere for some of my pages. my only theory
             | as to why these were displayed is that google interpreted a
             | (server side) randomly generated number in an inline script
             | as a timestamp (but i can't know for sure that's what
             | happened)
        
             | oefrha wrote:
             | Messed up dates, plus irrelevant topics showing up because
             | there are matched snippets in "more posts from...".
        
           | SirAiedail wrote:
           | I use "site:reddit.com" to fully restrict to that. You can
           | even filter by subreddit that way.
           | 
           | Works well with HN and other sites, too.
        
           | matheusmoreira wrote:
           | Not sure for how much longer this is going to work. Plenty of
           | marketers make fake posts there in grassroots campaigns.
           | Reddit itself is an advertising company.
           | 
           | God I hope they never find out about this site.
        
         | f0xJtpvHYTVQ88B wrote:
         | Brave Search recently implemented "discussions". From what I've
         | seen it is mostly Reddit results but StackExchange also can
         | appear there.
         | 
         | https://searchengineland.com/brave-search-discussions-383706
        
         | Cthulhu_ wrote:
         | I have a suspicion they removed it because of the amount of
         | spam on those forums. There's tons of abandoned forums that are
         | only occupied by spambots.
         | 
         | There's even pretty convincing looking accounts and messages
         | that turn out to be spam in the end, once they start trying to
         | post links.
         | 
         | I have Akismet on the comment section of the Wordpress front-
         | end of the site I run, it basically said something like 99.99%
         | of attempted comments were spam. I'm sure the same applies to
         | e-mail and the like.
        
           | matsemann wrote:
           | Reminds of those "fake forums" I sometimes see when
           | exhausting google's results. Found a screenshot of the
           | concept here: https://www.reddit.com/r/Scams/comments/jxtr1k/
           | but_it_requir...
        
           | 6510 wrote:
           | Everyone is a spammer according to Akismet. I wouldn't be
           | surprised if 99% of that 99.9999% is false positives.
           | 
           | You could start a website for people you don't like, flag all
           | the comments as spam and they wont be allowed to post
           | anything elsewhere - forever!
        
             | efreak wrote:
             | That percentage sounds about right to me. I've seen
             | comments on blogs from ~10-15 years ago, that continue to
             | have spam posted to them. The first 2-3 comments will be
             | relevant, but comments 50-100 may have a single relevant
             | comment along them, with a total of anywhere from 300-3000
             | comments. Older comments link mainly to blogs
             | (*.WordPress.com) and such, while newer comments link to
             | Facebook and Instagram.
        
       | arbuge wrote:
       | It is my experience that SEO bots are increasingly ignoring
       | robots.txt entries disallowing them from crawling our sites. Last
       | week we noticed several doing this. I don't mind naming names -
       | semrush, something called grapeshot crawler, something else
       | called blex bot, and moz dotbot. Anyone else having the same
       | experience?
        
       | edenfed wrote:
       | I'm currently building a search engine made specifically for
       | developers. We are searching directly in
       | GitHub/StackOverfow/Reddit so SEO is not a problem. You are
       | welcome to try it at https://keyval.dev
        
       | mcovalt wrote:
       | I noticed this on https://hndex.org. So many searches for hair
       | loss products. Like thousands... daily.
        
       | ajnin wrote:
       | This made me curious to try that search engine so I typed
       | "electronic music box" (first thing that came to mind). As far as
       | I can tell none or the 10+ pages of results include all those 3
       | words. I mean, you might not have any relevant sites in your
       | database (likely if there are only 1000 sites or so as another of
       | your blog posts imply), and I understand you want to show _some_
       | result to the user, but if I want irrelevant links I might as
       | well go to google.com...
        
         | thehodge wrote:
         | Yeah same, I searched for Leeds grand theatre and the top
         | result is something titled "June 2012 - Sam's Blog' which just
         | mentions the word grand.
        
         | lubesGordi wrote:
         | What the heck is an 'electronic music box'? I personally
         | wouldn't expect those three words to show up on any sites
         | served by a small search engine.
        
       | nspattak wrote:
       | This is an awsome website that I was not aware of!
        
         | mlatu wrote:
         | and there you have it: nobody uses it because nobody knows of
         | it.
         | 
         | of course for a bot it is easy to remember your site, its just
         | another url in a long list of others... but what does a human
         | do? they go to their fav search site, be it duck duck go,
         | google or bing... perhaps even yahoo.
         | 
         | i remember when google just started out, back then you would
         | have used askjeeves, altavista or yahoo... google was really
         | good compared to those... and the name was new, kinda
         | orthogonal to existing search engines (except yahoo perhaps)
         | and perhaps the most important bit: the site was "clean" except
         | for the searchbar, there was nothing distracting there. you
         | opened it and knew it is for looking up stuff
         | 
         | now, to join in, this late in the game? difficult. difficult.
         | 
         | maybe it would be easier if it specialized for some niche? idk.
         | 
         | dear OP: i'll try to remember your searchengine, but i cant
         | promise to become a regular
        
       | jacquesm wrote:
       | One day we'll have an internet for humans exclusively. On another
       | note, with 160K requests / day from bots you could of course
       | simply block the bots structurally assuming they are nice enough
       | to identify themselves. Block all of AWS and Google, Russia,
       | China, NK and a couple of other bot hot spots and the service may
       | well become more successful for regular users because they get
       | faster results. Bots can afford to wait, humans are often
       | impatient. And with 2 hits / second by bots that may well become
       | a factor.
        
         | netsharc wrote:
         | I wonder how that could be accomplished. Maybe they'll build a
         | brain interface to replace the "I'm not a robot" captchas/add a
         | TPM chip to the brain.
         | 
         | And then the spammers will start selling tools to fake the
         | responses. Or pay Filipinos a few cents a month to have the
         | chip implanted to their brains...
        
           | jacquesm wrote:
           | Well, we can do it with the roads, I'm pretty sure if the
           | incentives are right we can come up with a way to do it
           | online. As long as we have not passed the Turing test ;)
           | 
           | The current web seems to favor machines talking to machines
           | and that is definitely not how it was intended.
        
           | Nextgrid wrote:
           | > Or pay Filipinos a few cents a month to have the chip
           | implanted to their brains...
           | 
           | That's the problem with blocking _bots_ as opposed to
           | malicious behavior. Bot blocking is actually trivial and very
           | cheap to bypass as long as you can buy slave-like labor for
           | peanuts.
           | 
           | Ideally you'd want to block malicious _behavior_ (when it
           | comes to SEO spam, downrank anything for-profit such as ads,
           | analytics, affiliate links, etc) instead to remove the
           | incentives for spamming, regardless of whether it 's a bot or
           | human.
           | 
           | In this case the only problem is that this search engine
           | gives away resources (search queries) for free and then
           | complains that people (in this case spammers) are taking it.
           | It's not really a _spam_ problem - they 'd complain equally
           | well if they had some _legitimate_ user that happened to need
           | tons of search queries to achieve their task.
           | 
           | The only solution here is to start charging for stuff that
           | costs money, and then it doesn't really matter who is on the
           | other side, as long as they pay the bill.
        
             | samatman wrote:
             | It's a principal-agent problem. Websites want to be paid
             | for their content, rent ad space, advertisers want users to
             | see ads, users want to find content.
             | 
             | The agent in the middle fucking over all three principals
             | is hmm. Metaalphabetic, let's say.
        
         | m-i-l wrote:
         | This isn't indexing by search engine spiders, which are usually
         | fairly benign and easy to identify with user agent etc. This is
         | searches for "scraping footprints" executed en mass by "SEO
         | proxy farms", which are designed to be very difficult to detect
         | (e.g. originating from globally distributed residential IPs,
         | quite possibly ordinary home user's machines which have been
         | compromised). The main giveaway that something is a "scraping
         | footprint" is the long search query which includes text that
         | would appear on a template, e.g. ""This website is proudly
         | using the open source classifieds software OSClass" rega
         | turntables", for someone looking for OSClass-powered pages they
         | could "search engine optimise" for the query "rega turntables".
        
           | thesuitonym wrote:
           | That's funny to me, because if I'm searching for something
           | that would have been around between 2004-2012, I'll often
           | append "Powered by phpBB" (or other software) to find posts
           | about it on forums.
        
         | pjmlp wrote:
         | And the cycle will reboot itself again.
         | 
         | The silos we have nowadays were there before the Internet took
         | off, on BBS, Compuserve, Geocities, ....
         | 
         | Apparently the majority of regular humans likes to have
         | centralized providers they can reach out to, instead of the
         | freadom of decentralized content.
        
           | jacquesm wrote:
           | Yes, that's true. Bots tend to follow the money.
        
         | xmodem wrote:
         | > Block all of AWS and Google,
         | 
         | Google for "residential proxy". This is already a huge
         | industry, and it's difficult not to see how we haven't lost
         | this war a long time ago.
        
         | kmeisthax wrote:
         | ...so you're going to write your own HTTP requests? Encrypt
         | your traffic and validate certificates by hand? Toggle in each
         | TCP header from a memory debugger?
         | 
         | Most of the Internet is bots because humans don't actually
         | generate HTTP traffic - they fire up a bot called a "browser"
         | to do it for them. The challenge for anti-spam is to
         | distinguish which bots are currently being directly controlled
         | by humans and which ones are not-so-directly controlled by
         | such. This isn't even a hard line; I've frequently hit Hacker
         | News' bot detection just by upvoting a comment and then
         | clicking reply too quickly.
        
           | jacquesm wrote:
           | I really don't understand your comment.
           | 
           | Just so we don't have to argue about what constitutes a bot
           | and what does not I propose we use this definition:
           | 
           | https://en.wikipedia.org/wiki/Internet_bot
        
       | calltrak wrote:
        
       | [deleted]
        
       | oefrha wrote:
       | > I didn't notice at first because the web analytics only shows
       | real users, and the unusual activity could only be seen by
       | looking at the server logs.
       | 
       | Sounds like everyone blocking analytics (Plausible in this case),
       | e.g. myself just now, is lumped in with spam bots.
       | 
       | Of course, analytics blocking can't meaningfully swing the
       | ~99.99% statistic.
        
         | rhn_mk1 wrote:
         | I would argue that yes, it can. If the only people who are
         | interested in using the website are those who block analytics -
         | and, given the demographic of a niche search engine, it doesn't
         | sound entirely implausible - then there's no telling how the
         | 99.99% splits into bots and nerds.
        
           | oefrha wrote:
           | Not every "nerd" use a blocker. I know many who don't. Some
           | want to support the sites they visit; some want to see the
           | web as it is for most people; some say their mental filters
           | are so well developed that ads don't bother them; etc.
        
           | Xylakant wrote:
           | You could guesstimate by checking the IP address - blocks
           | assigned to residential users are likely humans, blocks
           | assigned to cloud providers etc. likely bots.
        
             | gnabgib wrote:
             | This is far from true. Either via trojans, botnets, "crowd
             | sourced vpns", or of course tor relays, residential IPs are
             | a source of many bots. The overwhelming majority of spam
             | sources (after you block a few data centers in NL).
        
           | asddubs wrote:
           | even if there's 99 people blocking analytics for every person
           | who doesn't, the figure is still 99%
        
         | scambier wrote:
         | If you self-host Plausible, it's also possible to bundle the
         | analytics package with the website, so that there's isn't an
         | "ad-blockable" lone request for the .js file.
         | 
         | https://github.com/plausible/plausible-tracker
        
           | pluc wrote:
           | Yeah there is. I surf with JS off because of people like you.
        
             | varun_ch wrote:
             | Most of the data you can collect with Plausible could just
             | be collected server side instead, it's nothing like Google
             | Analytics.
        
               | netr0ute wrote:
               | > Most of the data you can collect with Plausible could
               | just be collected server side instead
               | 
               | Then why not just use that instead?
        
               | tylergetsay wrote:
               | SPAs & marketing teams are used to snippets
        
             | scambier wrote:
             | Also notice how I said "analytics package" and not
             | "tracking" in my comment, because there is no tracking. I
             | mean, unless you're the only visitor from a specific
             | country, there is literally 0 identifying data in
             | Plausible.
        
               | netr0ute wrote:
               | Analytics is still unnecessary JS and a bandwidth hog, so
               | it has to go.
        
             | folkrav wrote:
             | https://plausible.io/privacy-focused-web-analytics
             | 
             | You surf with JS off because of sites abusing their users'
             | data. This is not it.
        
               | [deleted]
        
               | 34679 wrote:
               | Collecting data that a user doesn't want collected is
               | abuse. It doesn't matter what you do with it.
        
               | folkrav wrote:
               | Oof. Hard disagree on that one, way too black & white of
               | a position for me in the face of such a broad concept as
               | "data".
        
               | inetknght wrote:
               | > _You surf with JS off because of sites abusing their
               | users ' data. This is not it._
               | 
               | Wrong. I surf with JS off because of sites that use JS to
               | collect information about me.
               | 
               | If it's available on the server, then sure that might be
               | considered fair game. But using javascript (or any other
               | client-side tool) to do what you _should_ instead do
               | server-side _is_ abusing users (or their data).
               | 
               | Putting analytics inline so it's "not ad-blocked by a url
               | request" is absolutely disrespecting users and a perfect
               | reason to turn off javascript.
        
               | folkrav wrote:
               | > Wrong. I surf with JS off because of sites that use JS
               | to collect information about me.
               | 
               | Plausible doesn't collect information about you, but the
               | site's usage. Do you also object to physical stores
               | putting up cameras?
               | 
               | Here's their own instance, open to public.
               | 
               | https://plausible.io/plausible.io
               | 
               | > If it's available on the server, then sure that might
               | be considered fair game. But using javascript (or any
               | other client-side tool) to do what you should instead do
               | server-side is abusing users (or their data).
               | 
               | That's quite the affirmation. Is this fact or opinion?
        
               | inetknght wrote:
               | > _Plausible doesn 't collect information about you, but
               | the site's usage. Do you also object to physical stores
               | putting up cameras?_
               | 
               | The difference is that the cameras don't get attached to
               | my physical body, doesn't have any ability to monitor my
               | actions after I have left the presence of the physical
               | store, and can't force me to take any physical item or
               | action.
               | 
               | Javascript, on the other hand, has the capability to
               | become persistent, can monitor my computer's activity
               | outside of your website, and can leave a lot (!) of
               | additional data on my computer without my permission.
        
       | MicahKV wrote:
       | So spammers have latched onto your search engine because they are
       | getting useful results. They are able to systematically discover
       | websites built on certain platforms that allow users to post
       | content containing links, which they can target for link spam. It
       | is very difficult to fight this on a technical level because
       | there is an entire industry built around blackhat SEO, with all
       | kinds of softwares and services dedicated to thwarting your
       | defensive efforts. Even Google struggles to keep up with this.
       | 
       | However, they are also systematically feeding you their footprint
       | lists. I imagine you could put together a footprint blacklist
       | pretty quickly, and just stop returning results for any obvious
       | spam queries like those containing "powered by wordpress".
       | 
       | It's not a very elegant solution I'll admit. It won't stop the
       | bots from trying, and you may have to circle back periodically to
       | add new footprints as they surface. But it's a potentially quick
       | and easy way to stop rewarding their efforts, and the blackhat
       | world is pretty used to burning out their resources so hopefully
       | they will figure out it's a dead end and move on.
        
         | wolpoli wrote:
         | Considering that as of Mar 12, this search engine only has 1001
         | sites indexed, I am not sure how useful this site is for
         | getting SEO backlinks. Speaking of which, are backlinks still a
         | thing these days?
        
         | pascalxus wrote:
         | just to throw out ideas: What if he decided to charge for each
         | search?, say 1 cent or so. Users could purchase them in bulk,
         | say 100 searches for a 1$.
         | 
         | The world is getting more and more desperate for a better
         | search engine. the day may come, when people are willing to pay
         | for better results.
        
         | marginalia_nu wrote:
         | > So spammers have latched onto your search engine because they
         | are getting useful results.
         | 
         | I'm not sure about this. At least with my search engine, it
         | doesn't really seem to matter what response they get, I don't
         | even think they look at the responses. They keep hammering away
         | with tens of thousands of queries per day with the requests
         | even though they've seen nothing but HTTP Status 403 since last
         | October or so.
         | 
         | My best guess is they're going after search engines in general
         | in case they forward queries to google, in order to manipulate
         | their typeahead suggestions.
        
           | miohtama wrote:
           | Put a CloudFlare web application firewall at the front of the
           | site and then use its rate limited / CAPTCHA features to
           | throttle traffic. It is the easiest way to get rid of
           | parasitic scraping and API abuse. Cost is $0.
        
           | MicahKV wrote:
           | Huh, well I guess there goes my theory about the incentive.
           | What a bummer. I would have thought that at least with search
           | engine scraping, they would stop expending the effort once
           | the results dried up.
        
         | z3t4 wrote:
         | Or put those query results behind an anti-bot/"capcha" test.
        
           | Ikatza wrote:
           | How about serving bots with one link per page, and taking a
           | minute to serve each page? Would this impact their
           | efficiency?
        
           | tofuahdude wrote:
           | Captcha breaking is SO easy these days; even the modern
           | captchas are easy to defeat.
        
           | MicahKV wrote:
           | That would probably help, but it's also a continuation of the
           | cat and mouse game. There are plenty of captcha breaking
           | services out there, it only cost about $1 to programmatically
           | solve 1000 captchas.
        
             | sylware wrote:
             | ... and there are the "click farms" with human beings.
        
               | z3t4 wrote:
               | If someone pay people to collect data you could outright
               | sell the data to them.
        
             | anselmschueler wrote:
             | As I understand it, the main point of CAPTCHAs isn't to
             | keep out bots completely, but to give enough friction to
             | make automated attacks or uses infeasible, while keeping
             | the friction low enough that normal users can still use it
             | normally.
        
             | noAnswer wrote:
             | > There are plenty of captcha breaking services out there
             | 
             | Give it a try and see what happens.
             | 
             | People said greylisting against email spam wouldn't work,
             | since spammers would just resend. It works since 20 years.
             | To get your IP off the DNSBL NiX Spam you just have to
             | follow a link. People said spammers would automate that
             | process. Never happened in 19 years. Sometimes spammers are
             | just lazy.
        
             | minsc_and_boo wrote:
             | Sure, but it increases friction that forces a re-eval of
             | cost/benefit of the bot(s).
             | 
             | Newest captcha services are a prediction score, not even a
             | verification screen, and you can feed polluting data to
             | bots you are certain to exist.
        
               | Calavar wrote:
               | Agreed. I suspect that this is an arbitrage game on the
               | part of the SEO spammers. Each search is cheaper for them
               | than it is for a competitor who's using a major search
               | engine with more extensive anti-spammer protections, and
               | that difference equals $$$. A captcha doesn't have to be
               | an unbeatable solution. It just has to provide enough of
               | a barrier to equalize the cost.
        
               | MicahKV wrote:
               | I'm not so sure about this. The spammers goal is to build
               | up as big a list of link spam targets as possible. If one
               | spammer chooses to only scrape minor engines and another
               | only major engines, the one scraping the major engines
               | will probably come out on top despite the higher cost.
               | Whoever is abusing OP's search engine is likely doing it
               | to supplement the data they are already scraping from the
               | major engines.
               | 
               | For OP, I think simply not returning results at all is a
               | more practical measure because it removes the reward
               | completely. Captchas and bot detection keep the reward in
               | play, while taking away the results entirely makes the
               | entire pursuit futile.
        
               | go_prodev wrote:
               | Deliberately feeding the spam bots into an endless loop
               | of captchas might slowly drain their accounts if they are
               | paying 3rd party captcha farms.
        
               | jfim wrote:
               | It might be a better idea to return low quality results
               | than nothing at all. The idea is that it's pretty obvious
               | when the bot is banned when it receives no results at
               | all. Having to look at the results manually to determine
               | whether one is banned is a much more time consuming
               | endeavor.
        
               | MicahKV wrote:
               | Well what I'm suggesting isn't about blocking the bots,
               | it's about removing the incentive. So in this case, I
               | think the more obvious it is the better. I would want
               | them to realize as soon as possible that they are 100%
               | wasting their time.
               | 
               | If anything, it might be best to return a page that
               | explicitly states "Sorry, this search engine no longer
               | supports SEO footprint search queries."
               | 
               | *edit for typo & wording
        
               | bornfreddy wrote:
               | On the other hand, making content difficult to parse is
               | easy to do and a very strong weapon. Make them waste dev
               | time... It is much easier to make variants of HTML than
               | it is to parse it. You can even automate it to some
               | degree.
        
         | gopher_space wrote:
         | > It is very difficult to fight this on a technical level
         | 
         | It is when your base assumption is that you won't hire outside
         | of engineering. There are more bored teenagers with phones than
         | people creating quality content, so I'm not sure why you
         | wouldn't just brute force checks against bad actors.
        
         | pstuart wrote:
         | If the confidence was high enough, perhaps return garbage data?
        
       | _tom_ wrote:
       | I think many people in the comments here, and most users, are
       | missing that you index a SMALL subset of the web. This leads to
       | people running a default test search, finding no results, and
       | concluding your search engine is bad, and leaving.
       | 
       | While you imply that in the search page, obviously it's not clear
       | enough.
       | 
       | Maybe add "this search engine only searches a small set of user
       | submitted sites. Click <here> for the list. Or <here> to add your
       | site."
        
       | AdamN wrote:
       | IMHO what you should try is excluding all sites with excessive
       | third-party cookies, sluggish performance, and too many ads. That
       | will slice the index down by 80% probably but it would be a
       | really nice thing to see. It might push out low quality SEO
       | results for a couple of years.
        
         | guerrilla wrote:
         | This is the solution. Google and DuckDuckGo should be doing
         | this too (and make exceptions if they need to so that they
         | don't collapse). We have to incentivize the good behavior and
         | create an environment where people actually compete on the
         | properties we want and not horseshit.
        
       | stuff4ben wrote:
       | Just posting here that I'm real and I'm glad I found
       | searchmysite! After HN, Verge, Ars, Gizmodo, and some car forum,
       | I struggle to find content I want to read. Hopefully this will
       | allow me to continue to find something I can read as I work on
       | solving problems at work. I find distractions help me to refocus
       | in an odd way.
        
       | marginalia_nu wrote:
       | I had to put my search engine behind Cloudflare to deal with
       | this. Like the volume grew to about 10x the traffic I saw sitting
       | at the front page of Hacker News for a full week.
        
         | marginalia_nu wrote:
         | This is the rate of rejected HTTP requests I'm seeing at this
         | point: https://www.marginalia.nu/junk/spam.png
         | 
         | Real search traffic is about half that.
        
           | m-i-l wrote:
           | Thanks V. I'm seeing a similar number of problem search
           | requests (although nowhere near as many real search
           | requests:-), so it is probably the same "SEO practitioners"
           | running the same "scraping footprints" against different
           | search engines around the same time.
           | 
           | I was kind-of hoping that somewhere in this discussion there
           | would be an "And the answer to your problem is...", but I
           | suppose it is a very specific problem which only a search
           | engine would encounter. I think the Cloudflare solution you
           | have is probably the best to block the requests as early as
           | possible. The reverse proxy config[0] I've got seems to be
           | mostly holding out for now though.
           | 
           | [0]
           | https://github.com/searchmysite/searchmysite.net/issues/55
        
             | marginalia_nu wrote:
             | If they're from the same outfit I've had problems with I
             | really am at a loss as to what, other than Cloudflare, is a
             | good solution. I got like 4-5 requests per second at worst.
             | Seems to be a botnet, I entered a few of the source IPs
             | into my browser and got like login screens to enterprise
             | routers and so on.
        
       | searchableguy wrote:
       | Not surprised. I see many startups with Head of SEO (Search
       | engine optimization) with huge salaries now a days.
        
       | evanmoran wrote:
       | Has anyone seen this bot growth with online newsletter signups?
       | I've noticed a steady increase in signups but without any
       | equivalent marking or product buzz that might account for it
        
       | jrussbowman wrote:
       | It's been the same for unscatter.com for years but I've always
       | attributed to that to me not having a real marketing strategy or
       | even sticking with the ones I've tried to start.
        
       ___________________________________________________________________
       (page generated 2022-05-16 23:00 UTC)