[HN Gopher] Almost all searches on my independent search engine ... ___________________________________________________________________ Almost all searches on my independent search engine are now from SEO spam bots Author : m-i-l Score : 604 points Date : 2022-05-16 10:08 UTC (12 hours ago) (HTM) web link (blog.searchmysite.net) (TXT) w3m dump (blog.searchmysite.net) | mywaifuismeta wrote: | That's really interesting... and sad. For what it's worth, I've | noticed comment bots dramatically increase over the last year | too. They have always been there, but looking at Reddit, YouTube, | etc, now there seem to be 10x more than there were a few years | earlier. Even on HN it has gotten worse. | cbozeman wrote: | Is there a browser plug-in or some other piece of software that | can filter, or highlight, which posts / comments are likely | made by bots? | BuyMyBitcoins wrote: | On two occasions I've read one of my comments here on HN copied | and posted on Reddit. The user profiles that copied my comments | in _seemed_ like they were run by a real person but the rest of | their posts might have all been scraped as well. | | I only found out because I just so happened to be looking at | the comments on a related news story and quickly realized the | post sounded strangely familiar. I'm sure most of us here have | had our comments copied without our knowledge. | chairmanwow1 wrote: | I created a temporary email service that was being used by about | 10k users / week. Then several weeks ago, the number of users | started growing like crazy up to about 60k users a day. Then we | checked the recent email activity and 60k / 65k emails were from | a social networking site. | | Seems our service was being used to create fake bot accounts. The | newly created accounts were obvious fakes. Rather than deal with | the issue, we just shut the service off. | wibyweb wrote: | In late April up to now, Wiby (a small mostly unheard of search | engine) began having the exact same issue. Tens of thousands of | the exact same type of "powered by..." requests coming from | thousands of IPs. They are using a tool called QHub. | m-i-l wrote: | Thanks for wiby.me. I have seen QHub coming up in the scraping | footprints, but my assumption has been that the footprint query | is looking for Question and Answers sites powered by QHub | containing their targeted terms, e.g. because there's a known | vulnerability with QHub that their scripts can exploit to auto- | post backlinks or whatever it is they do. There are lots of | other hosting tools, other than QHub, that come up in the | footprints as well. I found some lists of footprints by doing | an internet search for one of them: "Designed by Mitre Design | and SWOOP". | wibyweb wrote: | Interesting, thanks for that extra info. | mfrye0 wrote: | I run a data aggregation company that has a fairly advanced | scraping infrastructure for collecting data across the web. | Having built the scraping side, I'm pretty familiar with most of | the strategies for avoiding bot detection. | | Coming from that perspective, detecting and stopping at least the | majority of bots out there is fairly doable, and I put together a | rudimentary thing for a side project. | | The core of it uses an IP API for looking up the requesting IP to | identify the country and if it's coming from a data center, VPN, | Tor, etc. If it passes that, I trigger Google Captcha to show up. | Lastly, I track IPs that make it through and have some basic | rules in place to try to detect patterns and block offenders that | way. | | There's a bunch more stuff you can check for, but the core of it | is basically filtering out data center traffic to minimize the | requests going to Google Captcha. | buzzwords wrote: | I have had very interesting conversations with people who are | "casual" users of internet. They are still finding the results of | the likes of Google, bing and duckduckgo perfectly suitable. | Maybe it's most of us here who have different needs to what's | available. | bachmeier wrote: | I suppose it depends what they're looking for. If you're a | homeowner looking for a service of some kind...good luck. There | are domains that aren't too bad, like programming, but you | should go into a search with low expectations. Anyone that | remembers the early days of Google will find today's search | engines to be useless in comparison. | not2b wrote: | The conclusion isn't that there's nobody out there, but that the | billion-odd people who use search engines every day have no idea | what searchmysite.net is. They use Google, often without even | knowing it because they just type some words into their browser | and take what they get. | Auguste wrote: | I'm disappointed that Search My Site isn't seeing many legitimate | viewers. | | Just wanted you to know that I'm a fan. I love reading peoples | personal websites, and Search My Site has been great for | discoverability. I visit the Newest Pages and Browse Sites pages | once or twice a week to check out the new sites being indexed. | | I don't know what the answer is to the spam bots, but you do have | some real visitors out there. :) | closedloop129 wrote: | >I noted that there had been multiple weeks where not one single | real person had visited a single blog entry for the whole week | | The site is not on https://searchengine.party/ nor on | seirdy.one's overview. Apart from the blog, how could users find | that engine? | | Is there some place where new search engines are announced and | where new search engines band together to make themselves heard? | m-i-l wrote: | Actually seirdy.one added searchmysite.net to his excellent | list[0] way back in March 2021[1]. | | [0] https://seirdy.one/2021/03/10/search-engines-with-own- | indexe... | | [1] | https://git.sr.ht/~seirdy/seirdy.one/commit/ab92d8ded69fd869... | arunsivadasan wrote: | Thank you building something like this! | xwdv wrote: | Will we ever see the return of hand curated directories of | websites like the old days, categorized by topics and approved by | human review? | closedloop129 wrote: | Coincidentally, such a site was submitted yesterday: | https://news.ycombinator.com/item?id=31387592 | saalweachter wrote: | Wikipedia, maybe? | | The greater problem of curation is that it doesn't scale, and | you need immense human effort to survey and curate both the | breadth of questions -- what's a good table saw? what aspects | of Egyptian culture were exported back to Greece? is HDPE | plastic safe? give me some punk music. -- and also the breadth | of answers, both every website and every type of table saw. | | The lesser is that you cannot curate without introducing a | _voice_ , a set of preferences that may not be universal. | Tastes are not universal, you can't recommend the same band for | everyone. Resources are not universal, regardless of whether | the $10000 table saw is more than 100x better than the $100 | table saw, it's just out of reach of most people. And needs | aren't universal -- a professional cabinet maker and a DIYer | making a chicken coop don't need the same saw. | | There's a set of priors behind every query, and you either need | to get users to frame their queries in a way that captures all | of the relevant priors, or you need to create a variety of | voices that capture different sets of priors and curate answers | appropriate to that voice. Are you asking Norm Abrams, Monica | Mangin, or Shane Wighton for a recommendation on a table saw? | xwdv wrote: | Perhaps there can be a difference between search engines for | answering specific questions, and directories where one may | browse a broad range of topics without any goal in | particular. | westcort wrote: | My key takeaways: | | 1. Almost all searches on my independent search engine are now | from SEO spam bots | | 2. In summary, if they break through the current reverse proxy | level protection, options include an invisible ReCAPTCHA (but | given I've sometimes 160,000 requests a day I'd be well over the | 1,000,000 a month free tier limit), requiring JavaScript as per | the web analytics or some Cross Site Request Forgery style | protection (but those would place much more load on the servers), | or CloudFlare (but the searchmysite.net spider is still currently | blocked by CloudFlare as per Some of the challenges of building | an internet search) | | 3. If you were into conspiracy theories you could claim that the | major search engines were trying to stifle the competition, but a | more realistic explanation is simply that searchmysite.net is | being drowned out by SEO spam | | 4. If I'd had a decent amount of real users visiting and never | returning I could reasonably conclude that updating the blog | wasn't the most productive use of my time and effort, but without | any real users in the first place it is hard to gauge whether | people like it or not | | My own independent search engine, https://www.locserendipity.com, | is seeing similar trends. | superasn wrote: | To me it looks like some popular spamming software (like | thebestspinner, etc) just integrated you and now everyone who is | the software is now hitting your site. | | The good news in this case is that's it'll be easy to spot the | pattern and block it, the bad news is you're entering a never- | ending cat and mouse game. | larsrc wrote: | Google puts a _lot_ of effort into avoiding SEO spam, but it's a | red queen problem. | melenaboija wrote: | I am a total ignorant about search engines and I have a question | after seeing all types of comments and projects popping up lately | and criticizing Google results which is if it is realistic to | think that something similar to Google could exist. | | It seems to me that there are all sort of tools out there to do | so such as all the public NLP implementations, vector search | engines, ... and I wonder if it is that not everything that is | needed is truly available, it is a matter of the needed resources | to have something working or is just a matter of the products | already existing and not getting traction (and I am not talking | about the other big search engines). | phkahler wrote: | >> This time I'm really not sure what the solution is. | | As with everything internet, the solution is to have solid, | verifiable user identification. I realize the downside is that | sites would love to have all your activity logged under a | verifiable identity, so the other problem is we need to ban | collection of such personal data. | jrochkind1 wrote: | I'm not sure I understand the theory of what motivates the | automated "powered by" searches; can anyone explain it (or an | alternate theory) further? | Teandw wrote: | This guy throws multiple reasons/conspiracies out there on why | the website is really struggling to gain literally any sort of | traction. Web is all bots, search engines not promoting | competitors and being drowned out by SEO spam, yet he's failing | to see the most obvious reason... the reason nearly all websites | don't gain traction... | | Because it's a bad website. It provides no value to the user. I | put in a few search terms and had no relevant search results | back. What use is a search engine that can't find what I'm | searching for? | | Maybe if that was improved he may see traction. | CWuestefeld wrote: | Whether or not his site is meeting his goals is his business. | | I find this a really interesting post, because I'm also dealing | with excessive bot traffic (it's generally about half of my | overall), and specifically how to salvage analytics data when | there's so much noise. Seeing what other people are doing to | combat it helps me, regardless of whether you might think of | them as successful or not. | lukev wrote: | I second this. Don't get me wrong, I applaud the concept and | the effort, but this implementation isn't quite there. | | I searched for "document management system comparison" since I | am currently in the process of selecting one for our legal team | at work. Some on-the-ground reports from real users would be | hugely valuable. But this is the classic example of where | Google utterly fails; document management is a 100 billion | industry and there are absolutely no search results which are | not SEO, marketing copy, or astroturfed listicles with nearly | zero value. | | Unfortunately, this website returned even less relevant | results. Not a single result pertained to document management | at all; instead it returned random matches on words like | "system" and "management." | | Whoever solves this problem could definitely unseat Google as | the go-to search engine for most people. So it's a big prize. | But it's also a super hard socio-technical problem, requiring | incredibly sophisticated and powerful tech in a highly | adversarial environment. However, regrettably, it looks like | this attempt hasn't even got the basic search tech down. | marginalia_nu wrote: | Is a comparison of document management systems something you | expect actually find, as something written by humans? I | wouldn't write such an article, I don't know who would. | | The only people who seem to be writing these types of | comparison articles are spammers. | | I typed this reply without checking, but I checked now, and | yeah -- if you google "document management system | comparison", you get ads for document management systems, and | search engine spam. That's hardly helpful. | oneeyedpigeon wrote: | 2nd result I got from that exact search is an article from | techradar: | | https://www.techradar.com/uk/best/best-document- | management-s... | | Do you consider that search engine spam? | marginalia_nu wrote: | Yeah, that's affiliate marketing dressed up as a review. | They're getting a kickback for several of the links in | the review. | | The deal on DocuWare is perhaps the most obvious, but the | Abbyy-link also run through an affiliate marketing | redirect service. | freediver wrote: | Typed this search into Kagi and got: | | - This results from an old site https://www.scanstore.com/Sca | nning_Software/Document_Managem... not sure if still relevant | | - A bunch of discussions from reddit and other forums | (probably best lead) | | - One research paper https://arxiv.org/pdf/1403.3131.pdf | | - Listicles grouped togeter so you can skip them | | - The noncommercial filter gave a few more good results, but | it seems like there is not much 'good' content written on | this topic | | I would definetely not call all Kagi results fantastic, but | it does seem to be better than Google. We are trying hard to | solve the problem of the nonsense on the web (Kagi founder | here). | alx__ wrote: | Thanks for building Kagi! Have been enjoying the experience | of it this past month | kldx wrote: | Got any beta slots to share? | status200 wrote: | I searched "best dress shoes reddit" as a test, and just got a | random list of websites that had the word "shoes" on the page | somewhere, including a Dinosaur Comic from 2008. | | So... yeah. Won't exactly be my first choice of search engine | in the future. | matt_heimer wrote: | Looking at the blog | (https://blog.searchmysite.net/posts/milestone-1000th-site- | in...) I think very little of the internet is in this search | engine. | | Its difficult to gauge the quality of the engine itself at | this point with so little content in it. | | What I can say is that even remotely presenting the system as | a general purpose internet search engine like the UI from | https://searchmysite.net/ does is going to give people the | wrong idea and make them think the system is bad. To start | with I'd suggest adding the number of sites indexed to the | main search page. | | I also think that the https://searchmysite.net/ portal will | likely never be a destination. I'd suggest trying to promote | it differently, offer a service service for OG internet | sites, they opt-in to the service because they want a search | widget they can embed on their site that has filter to search | just that site or all OG sites. Having website categories | would also help so people could search across tech blogs, or | aquarium, or bowling sites, etc. Basically the old web ring | idea but powered by search instead of just browsing a list. | | Since there is a chicken and egg scenario - What you really | need are people that think Google sucks that are invested in | a niche and want to build a search ring out. The "only sites | submitted by verified site owners" restriction needs to go, | you want good curation but this is just too restrictive. I | also think "downranks results containing adverts" is too | restrictive, switch that to "downranks results containing | excessive adverts and SEO spam". | _tom_ wrote: | It doesn't index sites like Reddit, so, not too surprising | Reddit wasn't in the result. | honkdaddy wrote: | Searching for Astral Codex Ten, a popular, well-written, non- | spammy blog which I would expect is indexed... | | Returns only results in which _other_ bloggers are referencing | ACX. Consider me as one of the datapoints that arrived from HN | and likely won't be back, I'm afraid. | m-i-l wrote: | Thanks for your feedback. The idea was for people to submit | sites they like, and search sites other people have liked. | I've submitted Astral Codex Ten, and that site is now indexed | for the benefit of others. | wccrawford wrote: | I just search Kagi, Google, and DDG for "Astral Codex Ten" | and it was the first result on each. | weird-eye-issue wrote: | Ironically the Kagi search engine is not in the first few | results in Google when you search Kagi (at least in | Thailand) | | And when I did make it to the site, it looks like I have to | sign up to use it? I'm not sure putting a locked gate in | front of a search engine in 2022 makes sense but okay | norman784 wrote: | The whole concept of kagi is to be a paid service (is | still in beta and for now it's free AFAIK), so you pay | money instead of having ads or the search engine selling | your data, use the service that suits best to your | purposes and philosophy. | ipaddr wrote: | The concept in 2022 sounds doomed to fail on many fronts. | A service that claims to offer privacy but requires | identifying payment information. A required email signup | so followup sales emails can happen when the service is | ready. | | Ddg was popular on here until they censored certain | websites. Does this search service censor? | | Sounds like they are trying to tackle privacy but in | reality users of this service will have less privacy. | m-i-l wrote: | Hi, "this guy" here:-) If people come to a site but don't come | back then it is reasonable to conclude that "it's a bad | website", but as the blog entry put it "without any real users | in the first place it is hard to gauge whether people like it | or not". | | Note also that it isn't intended to be a general purpose search | engine, but a niche search engine to try and find some of the | fun and interesting content, e.g. relating to hobbies and | interests, which used to be at the core of the web but which | can be difficult to find anywhere nowadays. | soheil wrote: | How exactly is a "general purpose search engine" different | than a "search engine to try and find some of the fun and | interesting content"? | m-i-l wrote: | The general purpose search engines search the whole | internet, and as a result claim that you can search for | anything on the whole internet, even going beyond that to | answer questions which aren't on the internet as such, e.g. | "What is my IP?" and "What time is it?". However, niche | search engines only search specific parts of the internet, | and only claim to be able to deliver results relating to | their specific topic, e.g. you wouldn't ask the search on a | car forum what the weather is today. | soheil wrote: | Ok, but answering questions like "what time is it?" | doesn't subtract from the usefulness of a search engine. | Seems like you're saying it makes your search engine | better somehow because it can't do the above. | dumbfounder wrote: | I am a search guy and I would like you to succeed. But I | don't get it. The name of the site is bland and makes me | think you are a white label search service for websites. | On the homepage it says "Open source search engine and | search as a service for personal and independent | websites." but it offers me to reason about why I (or | anyone) would want to use it. The content it actually | searches is random and of no real particular value as far | as I can tell. Also, you are trying to avoid spam sites, | but once you reach a certain size that's all you would | see is people submitting spam sites. If you blocked | people from submitting you would never get all the | diamonds in the rough you are trying to expose. | | You need to find an actual niche that solves a real | problem people have and can understand and orient | everything you do to tackling that. Then expand from | there. | haswell wrote: | > _general purpose search engines search the whole | internet, and as a result claim that you can search for | anything on the whole internet, even going beyond that to | answer questions which aren 't on the internet as such, | e.g. "What is my IP?"_ | | I think there are two distinct things here: | | 1) Searching the whole internet | | 2) Returning results that aren't necessarily from the | Internet, but instead are convenience features of the | engine | | I understand that you're not trying to replicate things | like "What's the weather today", but when I want results | about <very specific classic car X>, how can you return | meaningful results without searching the whole Internet? | | Put another way, if you don't search the whole Internet, | the results are going to be limited to only the curated | list of sources you do search. This can be useful in its | own way - i.e. if you are positioning this as "search | this list of curated sources", but also means the site | will only be as useful as the curation you provide. | | For example, I dabble with Software Defined Radio. If I | search your site for "rtlsdr", a very popular package, I | get three results. Those results are somewhat | interesting, but I know there's a whole world of content | out there related to rtlsdr that I'm not seeing here. | | So adding a bit to what the parent commenter was saying - | if I'm using your site to look for my particular niche, | and I only see three results when I know there are many | more, I'm not likely to continue using your site to | search for rtlsdr. | | It then leads me to wonder what I _can_ search for, or if | there 's much utility to searching at all. | | Please take these comments in the spirit they are | intended - I think a search engine that helps find things | on the "old" web, or just helps me cut through all of the | SEO optimized crap is a great idea. It's something I want | to use. But I can also understand why someone might try a | search and move on. | | Just an idea, but maybe providing a way for independent | creators to submit their site for indexing (or for an | interested user like me to submit a site) would help | increase your reach. | _tom_ wrote: | Google is demonstrating this nicely now. It's become almost | useless, replacing the query I actually typed with | something more popular. And when that doesn't happen, the | results are likely seo'd junk. (The latter is not purely | googles fault, it's just that smaller search engines aren't | targeted as much). | | Try looking up a phone number (by number) in google for a | great example of nothing but spam results. | native_samples wrote: | Well, it's worse than that. The whole schtick is that it's only | pure, real content by folksy people like us. The top reason to | use it on the about page is: | | _Indexes only user-submitted sites with a moderation layer on | top, for a community-based approach to content curation, rather | than indexing the entire internet with all of its spam, "search | engine optimisation" and "click-bait" content._ | | So I tried searching [kotlin] and got 123 results ... | | https://searchmysite.net/search/?q=kotlin | | ... of which the 9th result is SEO spam! It reads: | | PersonalSit.es | Yes we got hot and fresh sites | https://personalsit.es/ ... | Shandilyahttps://msfjarvis.devTagsandroid, kotlin, rust Go to | feed Go to siteradoslawkoziel.plradoslawkoziel.pl ... | | That looks like junk to me. How is that possible if what the | developer says is true, that it's all verified and pre- | moderated? | m-i-l wrote: | Thanks for your feedback. It is just the home page which is | moderated before indexing (and reviewed annually). When | https://personalsit.es/ was listed it looked legitimate, but | agreed the results for that site look infected with spam now. | I've found at least one other site today where the home page | and blog look genuinely legitimate, but which has a complete | spam subdomain, quite possibly the victim of a subdomain | takeover attack by spammers. I've delisted both. | Unfortunately it isn't an easy task trying to defeat a vast | army of well funded spammers in your spare time! | stevenicr wrote: | As someone that has a few sites that can get user generated | content - I must say that it saddens me that spam stuffing | would get the main domain and site delisted - and likely | never re-listed. | | A couple times a year I get hit with a bunch of spam blogs | / user profiles and when I discover and clean them up, I | assume that at least google/bing see that the spam-to-real | ratio has been fixed and rank it higher again.. but I'm not | sure really, especially since google took keywords out of | click traffic. | | What would be nice is something like the 'site has been | hacked page' that I've unfortunately seen a few times for | sites - that lets you clean it up and submit a re-check | it's clean now button thing. | | I've also suggested that google make it so you have to | vouch for links which would expose people using the spam | stuffing techniques.. kind of the opposite of the disavow | tool - but they never read any of my disavow submissions. | | Sucks to get spammed, fight spam, and then be penalized for | it more ways than one. | | One of my older buddypress/wpmu sites I recently turned off | blog creation for users because it's just so tiring | fighting the spammers - which are only doing what they do | because google - meh. | salawat wrote: | Your problem is that SEO are under no obligation to be | truthful with you, and will likely pull bait and switches | as far as making accounts if it ever seems like your site | will catch on. | | Note, I nearly spit my food the first time I was at lunch | and someone was talking about SEO a few tables away...oh a | decade or so ago now. It's sad it's gotten this bad. | pwiercinski wrote: | I guess the use-case just isn't that popular. It's a good | website if you want to learn what some devs are up to, but | barely anyone cares about that. Most people use search engines | to find answers to their questions and Search My Site just | doesn't work like that. | fortran77 wrote: | I found a few pro-terrorism sites here. I don't think it's the | OPs purpose, but he's being duped by the few users that do look | for sites like this where they can add a "curated link" to | their ISIS or Hezbollah or Hamas site with a slick facade. | m-i-l wrote: | Thanks for your feedback. If you can drop me a note I'll | remove those sites - it is against the Terms of Use at | https://searchmysite.net/pages/terms/ (not that spammers, | terrorists, etc. care about complying with a Terms of Use). I | think legitimate looking home pages as a front to other non- | legitimate content is a genuine problem this model doesn't | solve (also noting that some of those home pages may even be | genuinely legitimate but have been hacked e.g. via a | subdomain takeover). | [deleted] | Jleagle wrote: | I'm getting lots of `No results found for query = xxx.` | rightbyte wrote: | That sounds like a feature actually, being honest about no | hits. | XCSme wrote: | If the internet is dead, is there anything left that's "alive"? | The mobile app stores are also filled with crap[0] and it seems | that the ratio of spam content vs real content is getting close | to infinity. | | [0]: https://youtu.be/E8Lhqri8tZk - 1,500 Slot Machines Walk into | a Bar: Adventures in Quantity Over Quality | john-radio wrote: | Since everyone in this thread wants to jump down OP's throat | about the quality of his web site, another interesting search | engine is millionshort.com, which allows you to filter out the | top N web sites from the results of your search. It's a great | tool for looking past sites with good SEO; all you have to do is | fiddle with the value of N. | | For example, searching for "electronic music box" as /u/ajnin | suggested, with the top 100K web sites removed from the results, | filters out the following: | | > These 23 sites were removed from your results: | | > alibaba.com (1 result removed) | | > aliexpress.com (1 result removed) | | > allaboutcircuits.com (1 result removed) | | > amazon.com (2 result removed) | | > apple.com (1 result removed) | | > bestreviews.com (1 result removed) | | > ebay.com (1 result removed) | | > etsy.com (2 result removed) | | > facebook.com (1 result removed) | | > instructables.com (2 result removed) | | > lightinthebox.com (2 result removed) | | > lumberjocks.com (1 result removed) | | > mapquest.com (1 result removed) | | > reverb.com (1 result removed) | | > twitter.com (1 result removed) | | > wikipedia.org (1 result removed) | | > yelp.com (1 result removed) | | > youtube.com (2 result removed) | | And the top result ends up being https://midiguy.com/. | mdoms wrote: | Million Short also has an option to remove only e-commerce | results which is invaluable if you still want results from | sites like Twitter, Wikipedia and YouTube but don't want online | shopping spam. | consp wrote: | Would this also work for the fake-sites-stealing-text-to- | look-legit sites since they quickly end up in the top | results? | blisterpeanuts wrote: | That's an outstanding concept. One problem though: wouldn't it | also filter out high quality curated results? | trinovantes wrote: | If this was the spam for a search engine (almost) nobody uses, it | makes you wonder how much abuse the major search engines face | Nextgrid wrote: | My understanding is that this wasn't about gaming this | particular search engine itself, and more about the spammers | using the search engine for its intended purpose of finding | spam-free content so they can then use this content as copy for | their spam posts. | sonicggg wrote: | I'd assume they have more control though. I noticed whenever I | use Google after connecting to NordVPN, it requires a captcha | the first time. | mensetmanusman wrote: | They face a lot. I always browse with incognito on safari, and | I quite often have to do captchas on google and bing etc. to | prove I'm not a computer... | | If there is money involved and value in being able to trick | search engines, I'm not surprised it's a thriving business of | grift. | hihihihi1234 wrote: | Why do you use Bing? | the_third_wave wrote: | Why don't you like diversity, in this case diversity of | search engines? Bing may have its problems but so does | Google, the way to handle this is to either use many | different engines or to use a meta-search engine like | Searx. The latter is far easier so it is what I do. Just | relying on a single source makes you an easy target for | those who control that source. | maven29 wrote: | You should try Bing again. Bing doesn't mess with your | query terms as much as google does. If you aren't a zoomer | typing out whole sentences into the search bar, the fact | that Bing doesn't substitute your jargon for more general | terms will help with spending less time in the search | results. | | I just got tired of iterative refining not working as it | used to in the past. I once got results for databases when | searching for decibels (despite spelling it out in full), | so it isn't just a matter of semantically related terms. | | The rewriting is just braindead and the ranking algorithm | falls for generated content way too easily. Google | shouldn't be trying to teach me DHCP when I am clearly | trying to recall a config item, but then it gets worse when | you read the infobox and realize that it's written at a | toddler level of comprehension. | | This is with the caveat that all search engines rely on | some level of personalization, so you might be able to get | good results on google if they deem you worthy. | ricardo81 wrote: | Indeed. There are various SEO "rank tracking" services that | scrape millions of SERPs a month. | thelittleone wrote: | Complete SEO noob here. Can someone help explain what these bots | are trying to achieve? There is mention in the blog that they're | trying to uncover ad free content. | DethNinja wrote: | Only solution is a webring based federated search engine. | | 1. You just put /webring.txt to your website. It shows links to | other websites with a hard limit of 100 websites. | | 2. To combat spam and bots, search engine does accept blocklist | as an input. So other people can curate the content. | | 3. People can personally rank the websites they like, so webring | of the said website gets ranked higher for that specific user. | This can be a community effort too. | | 4. Search engine itself should be under a commercial license so | that other people can keep building it and add ads if they want | to commercialise it. | | I'm too busy to spend time with this but perhaps one day I can | start coding it. | | I'm convinced that search engine model of early internet is just | dead, webrings are the way forward. | mcv wrote: | Nice idea, but of course if it gets even slightly popular, | every SEO content farm will immediately generate 10,000 sites | that all list each other in their webring.txt. | TheRealDunkirk wrote: | If there's a game to play, people will write software to play it | for their profit. | | I guess it's back to web rings. | robmay wrote: | While I'm generally a blockchain skeptic, this is actually a good | use for a blockchain - to "register" bots so they have an id, and | an owner, and you can measure their behavior. There are going to | be more bots interacting with more sites, so, this could work. | PaulHoule wrote: | Spammers badly need spam-free content so they can mix some | legitimate links with the junk they spew. | | One great Black Hat SEO trick is to find where your competitors | are getting clean links and insert your own links there so they | do your spamming for you. | closedloop129 wrote: | Why does the mixing work? Shouldn't Google and Bing know what | the original content is and automatically identify the sites | that are copies? | PaulHoule wrote: | Here's an example. | | If I have (say) 15 affiliate marketing sites, I might make a | link aggregator site that looks a bit like Hacker News. | Except I won't make just one, I might make 30 of them. | | These might subscribe to a bunch of RSS feeds and randomly | select articles, maybe 10% of the links on those sites go to | my affiliate sites. | | If you can inject spam into those RSS feeds that system I | describe would amplify it and this could have effects ranging | from: you are using my marketing machine to promote your | content to my sites getting really obnoxious and getting | blocked. | | ---- | | "Duplicate Detection" is a necessary technology for web | search because sheepeople copy themselves and other people | without bound. It cuts both ways because Google and Bing have | no sure way to know which one is the copy and which is the | original. So (1) they aren't completely efficient at removing | duplicates and (2) duplicate detection can be turned into a | weapon against you, just like that link aggregator. | randomstring wrote: | Search traffic has always been mostly automated spam bots. | | Even back in the Open Directory Days when we powered part of | search.netscape.com I estimated 80+% of all search traffic was | automated. At least most of it self-identified with the same Java | useragent. | | Later when working Topix, despite being a news search engine, | most traffic was bot traffic. Most included the word "mortgage" | in the query. Topix specialized in localized content, and that | was very popular for SEO scrapers. | | Lastly at Blekko, I estimate 90+% of traffic was automated. By | then maybe half or more learned to change the user agent. Most | used HTTP/1.0, a dead giveaway as no browser still uses 1.0. This | was a major aspect in Blekko's load shedding strategy. If the | servers started to get overloaded, we'd start bouncing suspected | bot traffic to a redirect that would show in the logs. If there | was a human with a modern browser running javascript on the other | end, would get redirect to a link that wouldn't get bounced. I | would check the logs weekly to see if any humans got caught. None | ever did. This was a huge monetary savings, you only need 1/10th | the servers if you can safely ignore the bots. | | Often it's endless repetition of the same keywords in a random | order with a place name appended, or prepended, or inserted. over | and over. Often variations on known monetizatable SEO keywords. | However, much of it doesn't make any sense. | | I don't have any insight into Google's numbers but I would | conservatively estimate 95% or more of all their queries are | automated bots and not humans. And the level of spy-vs-spy going | on for Google CPU resources vs SEO bots is probably pretty | evolved by now. I stopped tracking many years ago when Google | switched to densely packed obfuscated javascript for page | renders. Maybe this is part of why automated queries are so high | across the web, maybe google is too hard to crack for most. | superjan wrote: | Almost sounds like it is justified to add a javascript crypto | miner to your pages to make the bots pay for the use of your | service. | randomstring wrote: | The point is that the vast majority of scrapers do not bother | to run javascript. | [deleted] | stevenicr wrote: | appreciate the sharing of info here. | | I have recently been discovering and combating some similar, | albeit much smaller issues. | | I've been finding that a bunch of my recent 'resource sucks' | have been constant spidering from petal-bot, semrush bot, | alibiba-bot and a few others. | | Using the wordpress plugin stop-bad-bots and it's logs has been | eye-opening for me recently. | | I understand many of these are not directly dark-seo related, | but their aggressive nature is hurting the cpu and memory | limits of some of my servers and sites so it's a big issue | regardless of the intents behind them. | | (kind of) glad someone else has dealt with these issues, and | glad to see some of the 'how' for handling, identifying, and | some actual real numbers for the impacts, as I've been guessing | some of these things in my small projects, indeed it's a real | thing. As well as a practical issue to pay attention to and | work on. | munk-a wrote: | Could you possibly use your robots.txt to redirect them all to | ad-laiden pages to try and subsidize your legitimate users? | buro9 wrote: | This is for comment spam. | | It's trying to find a long tail of popular but not top listed | blogs for the purpose of posting comments with the much desired | links to the SEO target. | Veen wrote: | Does that work any more? I thought everyone put nofollow | attributes on comment links. | 0des wrote: | If it didnt work, would you still see it? | Veen wrote: | Yes, because to sell it you need someone to believe it | works. That's independent of whether it actually works | (although this does answer my initial question). | [deleted] | hinkley wrote: | I am slowly convincing my coworkers that deploying the exact same | binary as two different 'services' is a significant tool to have | in your toolbox. Some disaster recovery work we're doing is | making it a much easier sell. | | I'm really just combining two very old tricks here. Traffic | shaping based on class of service for two different requests, and | for two different classes of users. | | Segregating bot traffic improves consumer experience. Segregating | admin traffic from both allows you to set an upper and lower | bound on availability. | FargaColora wrote: | You mention the "Dead Internet Theory" (not heard that phrase | before!). | | I agree: the WWW Internet is dead, that is your problem. No-one | visits websites anymore, everyone has moved to the 10 biggest | websites and all data is now siloed there. | | If I want to search for something topical and relevant, I go to | Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, | Discord etc. | | The general Internet is dead: it's just legacy content and spam. | | If you think it's bad for you, imagine what it is like for Google | Search! Their entire business is indexing a medium which no | longer has any relevancy. People complain that Google no longer | delivers good results. But what can Google do? The "good content" | is no longer available for them to index. | | Want to become rich? Make a search engine which indexes the fresh | relevant data from the big siloed websites, and ignores the | general dead Internet. | marginalia_nu wrote: | I built my search engine in part to explore whether this was | actually true, and I don't think it actually is. | | There's still a lot of organic human-made content still out | there, possibly more than ever, it's just not able to compete | with the SEO industry that completely displaces it from Google | and social media. | kodah wrote: | Agreed, the general internet is not dead, but the majority of | internet users are on Facebook, Twitter, Reddit, HackerNews, | Instagram, Google Maps, Discord etc. | | From my perspective, we onboarded a lot (if not most) people | to the internet after 2007 (the explosion of social media). | People sticking to big sites really speaks to an inability to | explore the larger internet and a lack of knowing _why_ you | would even want to. | alxlaz wrote: | This matches my findings 100%. The WWW is active and | bubbling, but virtually all the cool websites I've found in | the last 10 years or so came through friends, small IRC | channels, or more recently through marginalia.nu :-). Google | and friends are facilitators for the SEO and tracking | industries, so of course they have zero interest to | prioritize these things over content spam -- their whole | business runs on content spam. But the WWW is as alive as it | gets. | dylan604 wrote: | And who uses your search? I had never heard of "you" until | just now. And there is the problem with "new" search engines. | Unless you can come up with what would have to be one of the | greatest ad campaigns the world has ever seen, no significant | number of users will know you exist. Where does the money to | pay for that ad campaign come from? How will a search engine | generate money to stay relevant? Once people see you becoming | relevant, they will figure out how to game your system. It's | just the nature of the beast. I don't think I'm being overly | cynical about this either. | marginalia_nu wrote: | Why would I need to generate money to stay relevant? | dylan604 wrote: | <edit>The first </edit>relevant was the wrong word. | sustainable would be more appropriate. on the assumption | that hosting the search engine isn't free, and unless it | is supported by a generous benefactor it will need to | have a way of generating money to keep the servers | running. | marginalia_nu wrote: | I'm self hosting so my operational cost is like $50/mo. | throwaway14356 wrote: | then he must be relevant | fifticon wrote: | I second that independent sites exist - I maintain my own | website on a personally run server. There are dozens of us! | to quote a quaint phrase. | api wrote: | All open systems are destroyed by spam once they become | popular enough to be profitable targets. This will eventually | happen to the Fediverse too. If there is money to be made | pissing all over the commons, the commons will be pissed all | over. | | It even happens to proprietary silos if they are too open. | Look at how many bots and spammers infest social media. | Propaganda and disinformation can also be considered a form | of spam. | | I realize this sounds cynical but don't shoot the messenger. | It's just something I've learned watching the Internet evolve | since the middle 1990s. Spam eats everything it can. | | IMHO the future is enclaves and invite only communities. The | Internet is a dark forest. | marginalia_nu wrote: | As old open systems are destroyed, new ones are created to | replace them. The Internet exists in a constant state of | rebirth and transformation. You really can't step into the | same river twice. | nonrandomstring wrote: | > You really can't step into the same river twice. | | I love the maxim and philosophy of eternal refreshment. | | Seems like the problem is more akin to having nuclear | waste dumped into our rivers though. | pixl97 wrote: | It's not cynical, is how every system in nature works. | Everything alive must develop an immune system or it is | attacked and eaten. | NoGravitas wrote: | You are probably right about the future; not necessarily | because of spam, though that's a part of it, but just | because of the toxicity of global, open to the world, | mostly public social media. The Fediverse has mostly | coasted by so far on obscurity, but it's not great, and | it's bound to get worse. All of my online socializing these | days is either through short-lived pseuds on topic-oriented | fora, or invite-only Matrix rooms. | pwdisswordfish9 wrote: | > This will eventually happen to the Fediverse too. | | Oh, don't worry, the Fediverse will never catch on. | ffhhj wrote: | Why? Serious question. | indigochill wrote: | How do you surface organic human content? I happen to linger | around the fediverse/tildeverse sphere where I see organic | content from people I personally have a direct (digital) | connection to (and I started self-hosting my music after Epic | bought Bandcamp), but I'm not clear on how I'd go about | digging that kind of stuff up in the more general case. | marginalia_nu wrote: | I do a traditional web crawl and exclude anything that | looks too much like it wants a high google ranking. Nothing | to it. | ratww wrote: | This might be controversial, but I wish Google would | exclude those websites too. | | Google started punishing keyword spam, then it started | punishing black-hat comment spam. Even Youtube | backtracked on the "videos have to be 10 minutes to | rank". | | I wish they would do the same for carefully manicured SEO | content farms too, as those sites are causing a harm | worse than keyword-spammer sites did. | marginalia_nu wrote: | They're probably doing all they can. The problem is their | dominance, both means they have effectively an entire | industry looking for loopholes in everything they do, as | well as legal considerations (arbitrarily punishing | individual smaller actors might skirt on the territory of | anti-competitive behavior) | ajmurmann wrote: | I love your search engine. Should I stop recommending it | to friends to keep it safe? | | I jest a little bit, but your comment genuinely makes me | wonder if Marginalia++ is search results - Google - | Marginalia | sdoering wrote: | I fear that Google also has a conflict of interest here. | A lot of these non optimized sites are not interested in | making money via ads. So Google wouldn't profit | additionally from leading people there. | | And a lot of people (myself often times included) are | looking for a quick answer. A good enough answer. So good | enough, SEO optimized is being surfaced. The result of an | optimization war on both sides combined with the | inevitable monetary interests. | | I don't habe a solution. Sadly. | galangalalgol wrote: | Does anyone have an ad free search engine? You'd start | with blacklists from ublock origin, pi-hole, and similar, | don't bother even crawling those, then have easy | reporting for new or self hosted ads. Not much money in | it if any, but it would be refreshing. Might even have a | mode to nix anything with a payment method on the site, | or that links to a site with a payment method. | ajmurmann wrote: | > Does anyone have an ad free search engine | | kagi.com search.marginalia.nu | EVa5I7bHFq9mnYK wrote: | Maybe back to Yahoo model of the 90s? Manually created | collection of curated links? | datavirtue wrote: | Yes. We have enough users now. | ratww wrote: | I think there's two kinds of SEO spam going on. | | The black-hat kind is definitely made to extract money | from ads. But those are easy to avoid for web veterans | IMO. And I also feel that Google is doing its part, even | though it's costing them money from those sweet ads! | | But the white-hat kind, also known as content marketing, | is made to let legit companies _save_ money. Instead of | paying for Google Advertisement, they get traffic by | means of organic content. Think "Michelin Guide" or "Red | Bull". Which is a jolly fine idea and responsible for a | lot of good stuff, but the problem is that this has been | taken to extremes, and now the web is littered with low- | effort content made by freelancer writers getting | peanuts. | | I would personally prefer if those freelancer writers | were doing 10 interesting Red Bull articles per month | rather than 500 rehashes of contents from other websites. | But who am I to judge. | | In the news industry things are also very similar. | Nextgrid wrote: | The "white-hat kind" can trivially be filtered out (or | deterred) by downranking any of the crap these marketers | use to measure their conversion rate - analytics, etc. | ratww wrote: | I love this idea. Would be nice to see it in a search | engine, or at least a browser extension showing how much | analytics junk a site has before you click it. | Nextgrid wrote: | Kagi has a non-commercial filter that I suspect uses the | presence of ads/analytics as a signal. | ysavir wrote: | It's not about surfacing organic human content, it's about | only indexing organic human content. The problem is | automated indexing. So long as indexing works according to | defined rules, the advantage will be to those able to shape | their content to those rules, and the spammers and scammers | will win. | | An idea I've had for a few years is making a social-network | based index engine. The only pages that get indexed are | pages that users themselves mark as worth indexing, and the | only pages returned in your results are pages that were | marked for indexing by people you added to your circles, or | the people in their circles, or the people in _those_ | circles, etc (probably up to 5 or 6 degrees of separation). | nyokodo wrote: | > up to 5 or 6 degrees of separation | | So basically everyone on earth? | ysavir wrote: | Alright, 2 or 3! | kmeisthax wrote: | ...so, blogrolls? | ysavir wrote: | Not familiar with blogrolls, but not quite. The idea is | more to have standard search engine user experience, but | with the requirement that each result is vetted by | someone the user trusts, or trusts by proxy. | pixl97 wrote: | Welcome to the billion dollar question. Any place that is | authentic will face the zombie horde attempting to fake | authenticity in order to capture attention. | tomxor wrote: | I think your _almost_ right, but it 's not necessarily | authenticity... I think it's just money. | | Large "authentic" search engines can exist to serve the | rest of the web, those personal blogs and other small | communities. Those sites have a natural tendency to not | be trying to turn everything into a revenue stream, so if | that was the prerequisite for an engine, it would be a | perfect match and naturally dissuade marketing types. | pixl97 wrote: | Authenticity is worth money. | | When you have a 'real' community you're talking about | real people with real salaries and desires, add in that | you tend to develop a real trust between members. Think | of this as fertilized soil. You can grow crops in it, but | weed seeds will eventually land and try to take over it. | | HackerNews is a good example of this, it takes a healthy | amount of moderation to keep things on topic where things | like politics get peared pretty ruthlessly. If for a | minute Dang gave in found ways to additionally monetize | the forums, something that would be profitable for a | while at least, things would start down a bad path. | sdoering wrote: | I can only agree with my sister comment. I find this | industrialized web more and more shallow and taxing to use. | | While professionally I need to help (smaller, local) clients | to reach their audiences I become more and more weary. | | It is like walking through a supermarket with industrialized | fast convenience food shouting in bright colors and | advertising while ultimately not nourishing me like slow, | real food could. | | I am still looking for this digital slow food movement. | nonrandomstring wrote: | > I am still looking for this digital slow food movement. | | https://digitalvegan.net | | Please read it, and if you enjoy it please suggest it to | friends. | Vladimof wrote: | I added it to my list of search engines on Firefox... your | favicon is really small, that's on purpose? | ColinHayhurst wrote: | Agreed. | | > If I want to search for something topical and relevant, I | go to Facebook, Twitter, Reddit, HackerNews, Instagram, | Google Maps, Discord etc. The general Internet is dead: it's | just legacy content and spam. | | The "general" Internet is not dead. Though if you just want | to participate in just Facebook, Twitter, Reddit, HackerNews, | Instagram, Google Maps, Discord you might well think that. | | Users of marginalia (author above), Mojeek (disclosure: CEO) | and others [0] are well aware that there are riches of | organic human-made content; from years back and new. Yes, a | lot of noise too, which Google has a bigger (SEO) struggle to | compete against. But still there is good and different | content available. | | To find good content, using search, you need to use "search" | engines which enable discovery, as Google used to do so. I | stress the "search" as the emphasis of Google, Bing and thus | their syndicates is increasingly on being "answer" engines. | | [0] https://seirdy.one/2021/03/10/search-engines-with-own- | indexe... | mc32 wrote: | Sounds like we're back to AskJeeves and a number of failed | answer engines from a couple of decades ago! | ColinHayhurst wrote: | AskBERT but now MUM knows best. | tmaly wrote: | Everyone is trying to game the Google algorithm. The net | result is all this long form content and cooking recipes | that are 10 pages long. | | There seems to be a big disconnect with a typical users | attention span and the length of a post. | ajmurmann wrote: | I thought the recipe thing was to be able to copyright | them | Domenic_S wrote: | > _The "general" Internet is not dead._ | | For some things it is. Good luck getting a non- | sponsored/SEO-gamed review of a kitchen appliance or | particular vacation mode such as a cruise. It's | flabbergasting. | | Most times I just stick "inurl:reddit.com" in my search and | _try_ to get discussion threads about the thing I 'm | researching, but even that's getting filled up with shills. | ColinHayhurst wrote: | Result #1 & #2 for kitchen appliance review (your | personalised/local results might vary): | | Google: | | https://www.expertreviews.co.uk/home-garden/home- | appliances | | https://www.goodhousekeeping.com/appliances/ | | Bing: | | https://www.which.co.uk/reviews/fitted- | kitchens/article/plan... | | https://www.goodhousekeeping.com/appliances/ | | DDG: | | https://www.goodhousekeeping.com/appliances/ | | https://www.which.co.uk/reviews/fitted- | kitchens/article/plan... | | Marginalia: | | https://www.infiniteeureka.com/shop-markdowns-on-small- | kitch... | | http://www.fullyramblomatic.com/essays/sarah.htm | | Mojeek: | | https://www.appliancesreviewed.net/ | | https://busybakers.co.uk/category/kitchen-appliance- | reviews/ | [deleted] | FargaColora wrote: | Most of these are spam. They contain affiliate links to | Amazon to buy the product which is being reviewed, | therefore the the review cannot be trusted. | | "Which" looks to be the exception, but that is a paid-for | service. | | It's a sad state of affairs. | kelnage wrote: | I understand your opinion about affiliate links - but I | use several review websites that use such links for all | products they review, and have both positive and negative | reviews for products. So I wouldn't say it necessarily | follows that affiliate links = biased reviews. | throwaway894345 wrote: | I think search engines are broken, but the Internet | itself is probably not "dead". It's just our | accessibility to that information. That's not super | helpful until we have better search engines (which steer | us away from this SEO stuff), but the good news is that | building a better search engine is easier than | resurrecting the Internet. In particular, there's a good | chance that a niche, naive search engine might be able to | significantly improve accessibility (e.g., high rankings | for pages that answer user queries in the fewest bytes). | marginalia_nu wrote: | -\\_(tsu)_/- | | http://www.jitterbuzz.com/indmix.html | | http://www.alaska.net/~akpassag/ | FargaColora wrote: | These websites seem to be last updated decades ago, which | is prehistoric to most casual browsers. There's no doubt | there is great content on the general internet, but these | examples I would classify as "legacy". | marginalia_nu wrote: | I can see why the website owners would be interested in | getting traffic to recent websites, but why would you be | interested in recently updated websites? | pmontra wrote: | I take myself as an example. | | People that know me and don't meet me regularly might know | the URL of my web site and might care to look at it once per | year and check if there is something new. Usually pictures | and tales from holidays. Covid made those holidays less | memorable so I didn't make any update since fall 2019. People | that meet me regularly don't need that website, I'm telling | them the tales first hand and showing them the pictures | without being obnoxious. I guess that this website is a | target for your search engine except it's not in English and | your search engine seems to want English search phrases. | | I don't have anything of value to share on a public chat like | Twitter and I don't have an ego to pretend I do. I also don't | use Facebook anymore. I go there once per year to like the | messages that wish me happy birthday. I think it's polite to | do so. All my media production is on WhatsApp or Telegram in | group chats with people I know in real life. | | If I really cared about producing content for the world I'd | probably be using Twitter, Medium or the fad of the year and | they'd take care of my SEO (do they?) or I'd be trying to | score points on StackOverflow. | | To recap: I never intended to compete on SEO. I'm really OK | that my website is only for friends and spreads by word of | mouth. It probably never did, I bet it's been on a flatline | since I created it 20+ years ago. | captainmuon wrote: | But Twitter, Reddit, HN, and most other such places are just | websites and can be indexed fine. Same with Wikipedia, which is | very much a silo (they don't have regular links in text in the | hypertext spirit, but only footnotes). | | Facebook and Instagram are more of a walled garden, like Quora, | but there is a lot of junk there anyway. | | It's sad for the WWW, but I don't really think it is a | fundamental problem for search engines. In fact Twitter for | example gives a direct pipe to Google. If you tweet something, | it is immediately findable. Similar for StackExchange, but | there I think the site is so "small" that Google can afford to | just continuously index it. | ratww wrote: | Twitter and Reddit still can be indexed, but they've also | become increasingly hard to use without an account. Reddit | doesn't let you fully expand threads when you're unlogged. | Twitter limits the amount of things you can read and shows a | modal. Both of them heavily limit usage on mobile devices | without installing an app. | | Sure, an account is free but might require giving information | you don't want to give. Twitter asks me for a phone number a | few minutes after creating an account, even if I don't post | anything). Reddit at least lets you skip giving an email. | | Sure, there are workarounds such as using lite versions (old | Reddit, mobile Twitter), but that's not known to all people | coming from a search engine. | | It feels as if HN are the only one that's not a partially | walled garden yet (and Wikipedia of course). | airstrike wrote: | > Reddit doesn't let you fully expand threads when you're | unlogged. | | that's what old.reddit.com is for! | FargaColora wrote: | old.reddit will be gone soon, it is inevitable. | Especially once they go public. | ntauthority wrote: | Isn't it a bit ironic that a site - or its operator - | 'going public' means all the content on said site | actually 'goes private'? | aceazzameen wrote: | Yup. It's bound to happen. And when it does, Reddit will | no longer exist in my eyes. | azemetre wrote: | Agreed. IDK how I feel about Reddit. I've been on it | since 2010 when Fark lost its spark. I remember some | great times but a lot of it was "junk" content that in | the end was very meaningless. I wish I could say I used | it to develop my career in tech but that isn't true | either; I use specific blogs, books, and tutorial sites | to learn instead. | | I suppose I mostly view it as a continuous party, yeah | it's fun if you attend but after a few hours I wish I was | doing something more productive. | ratww wrote: | Exactly, I mentioned it. But not only it's bound to go | away sometime, it's also not trivial to find to anyone | who's not an expert Reddit user, unfortunately. | TheRealDunkirk wrote: | And isn't great to get a link to Reddit or Twitter, and you | click the link, and try to navigate to the comments for | context or the answer, and you go to click the link to expand | it, and then you get a demand to log in and install their | app? Don't talk about walled gardens and not include Reddit | or Twitter just because they let you look at one brick before | demanding their tax. | [deleted] | hn_throwaway_99 wrote: | Doesn't _this_ site, and all of the content it links to, pretty | much disprove your theory? | | Yes, sure, I often do go to the "top sites" when searching for | content, but I still usually start at Google. And, despite all | the SEO spam, Google still does a fairly decent of landing me | on, for example, the appropriate Wikipedia page, Stackoverflow | post, travel site, etc. | mrtksn wrote: | It has been dead for a while now and the whole society feels it | globally. Things were getting so good then things become | horrible and whoever cracks the path to the goods stuff again | will find great riches at the end of the path. | dageshi wrote: | I agree with you to an extent. The web is less useful than it | used to be. BUT I would say a lot of that usefulness has | diverted into youtube. There are people who would previously | have made sites who are making youtube videos instead which of | course is owned by google. | Jenk wrote: | > If I want to search for something topical and relevant, I go | to Facebook, Twitter, Reddit, HackerNews, Instagram, Google | Maps, Discord etc. | | High chances you will find a link to an external site over | content actually on those big named sites though, right? That | tells us the organic web isn't dead, it's just hard to | discover/navigate - because of SEO wars, most probably... The | problem isn't the lack of content, it's the number of shitty | spammy sites standing in your way of the sites you actually | want to see. Like a sleazy salesman trying to direct you to the | crap laden three wheeled rust bucket when you were heading | toward the family sedans. | altairprime wrote: | If you want to be rich, solve search without full-text indexing | of sites. Pagerank only ever worked because of human curation | of webrings. Full-text search made is easier to find content, | and opened the door for spammers. The only viable route forward | for search will be to replace full-text indexing with human | curation, somehow. Solve how to scale that up instead, so that | when everyone else realizes we need it for the health of the | Web, you're ready. | [deleted] | shortformblog wrote: | I think this is a tad reductive, but I will say that we sure | let a lot of big companies convince a huge portion of the | population to create all of their content on platforms that | they have no real control over. | | The problem is, many of them didn't realize this was a problem | until recently. | | That said, plenty of exciting stuff is happening outside of the | walled garden, as long as you know how to find it. | Gravityloss wrote: | And not only did this happen already over a decade ago, a lot | of the current internet users have never known anything else. | | We had a discussion with coworkers and somebody mentioned | irc. Explaining to younger colleagues what it was and that it | was not a product of a company, but operators had servers | that formed a network, and it was more like infrastructure. | Felt weird. | Elvie wrote: | isn't Discord a bit like IRC used to be? | ori_b wrote: | How do I connect to a self hosted discord, and then | connect it to my friends self hosted one? | | And where do I get the RFC for the protocol so that I can | write my own compatible implementation? | | IRC isn't a product. It's a standardized protocol | sufficiently simple to implement in a day or two. | kasey_junk wrote: | Most of the kids in my 3rd graders peer group understand | federated infrastructures quite well because of Minecraft. | | Perhaps it wasn't the federated nature of irc that was | surprising but the fact that it was irc? | mst wrote: | Isn't minecraft more decentralised than federated? | | IRC networks usually have multiple servers connected | together (historically, often run by a bunch of different | people) and I didn't think people self-hosting minecraft | servers usually did that? | shortformblog wrote: | I think honestly it highlights the power of marketing as | much as anything else. In some ways, building an open | network is always going to put you at a disadvantage to a | company that can throw money at user acquisition and PR | teams. That federated networks like Mastodon have seen | growth reflects the fact that word of mouth still means | something in 2022. | NicoJuicy wrote: | The big siloed websites are just indexes of fresh content | though. | | With a generic way to place comments on it. | psyc wrote: | Based on my observations over the past year, I'm certain that | Google and Bing choose not to show us most of the web anymore. | | I usually find what I'm looking for. It just takes literally | three orders of magnitude longer than it used to for the same | kind of stuff. I used to use Google a lot to jog my memory | about various things I vaguely remembered. Type a few | associative words and snippets, press Enter, done. Google's | useless for that now. | | If you're looking for hot pop shit in trendy publications, | things to buy, commercial services to subscribe to - G has you | covered. That's what they do now. | ouid wrote: | Google is still pretty good at searching reddit. Maybe reddit | can acquire them. | big_blind wrote: | site:reddit just is the best search engine at this point. I | still don't like Google though. | dotnet00 wrote: | I agree that this seems way too reductive. I was recently | reflecting on this and noticed that I constantly run across new | blogs and sites whenever trying to learn something. I just | don't usually pay much attention to the site name in the way | that I remember HN, Reddit, Twitter etc. | | So, while I would agree that some aspects of the old internet | are dead (like 'small' ~1000 user forums focused on specific | topics having largely been replaced by generally inferior | subreddits and discord servers), I think it hasn't gotten as | bad as you're making it out to be. | baxtr wrote: | I am not so sure... | | I think what happened is this: the WWW was everything back in | the days. But in the "old days," only 10% of all people were | online, the web elite. Then, AOL came, and the rest came online | slowly but surely. The so-called "mainstream" people were no | geeks, and these people were "just" ordinary people. Almost all | were captured by what you call "big websites". | | Now, we see the 100% being dominated by the 90%. That's why | "Google results are bad". Bad for us! Not maybe (most probably) | not for them. | nl wrote: | Eternal September was Sep 1993. AOL hit the internet in March | 1994. | | Netscape didn't launch until December 1994 (and the WWW was | nothing before that. I subscribed to a mailing list with new | sites that were released and I'd visit most new websites on | the internet on most days with the Cello browser in my uni | labs most days). | | AOL users have been there since the beginning of the WWW. | | https://en.m.wikipedia.org/wiki/Eternal_September | CWuestefeld wrote: | My recollection is that the AOL event you reference was | only making usenet accessible - a point that makes good | sense in the context of the eternal September. | | But when talking about the WWW, that's a very different | story. I think that AOL didn't incorporate a web browser | until quite some time after that. | mywaifuismeta wrote: | I no longer see Google as a neutral "search engine" the way it | used to be. Now it's just another company that owns and | promotes certain types of content, no different from reddit. | For some things Google has the best content, for some things | Twitter or Reddit have the best content. | dixego wrote: | Google is an advertising company. It has been for a good | while. | big_blind wrote: | Yeah I use you.com and kagi.com. No advertising on either. | Less SEO spam too it seems. | [deleted] | photochemsyn wrote: | I find one of the best ways to find interesting content on | specific subjects using Google is now to start blocking all | their top returns (a lot of SEO spam). This is somewhat | tedious (lots of -site:seospam.com) and Google doesn't like | automated queries. However, a few rounds of this often turns | up interesting content down low in the search results. Just | don't take what's on offer on page one of search results, | basically. | | Where it's gotten really bad is on news searches as Google | either now has some kind of shitlist of independent news | sites that it won't allow to show op on, for example, | site:youtube.com searches - or, it's filtered through a guest | list. It's hard to tell which strategy they're using, but | news is definitely being heavily filtered based on very | dubious propaganda-smelling agendas. | xvello wrote: | You might be interested in using uBlockOrigin and | https://letsblock.it/filters/search-results to easily block | these domains. In addition to your own domain list, you can | use the community-maintained SO / github / npm copycat | lists. | maxwelldone wrote: | Back in 2000s Google used to be the place for any type of | search (IIRC). | | Now, I've been conditioned to use it only for specific use | cases, mostly for convenience. Some examples include: | | 1. Anything programming related (searching for man pages, | error codes etc) is straightforward. (I do have some UBO | filters to exclude SO copycats) | | 2. Utility stuff like currency conversion, finding time in | another city, weather etc. | | Where Google has really fallen behind is in multimedia | search. Not sure if it's due to copyright issues or not but | Bing and Yandex provide way better service in this regard. | | Not to mentions the "reddit" suffix I need to add to any | search that even remotely calls for public opinion. In many | cases, Google is just a shortcut to take me to the relevant | subreddit. | ufmace wrote: | Programming-related stuff seems to have gotten a lot worse | in the last couple of years. Now most terms, at least for | common things, return a ton of blogspam, when the official | docs or SO are usually the best source. | LegitShady wrote: | another thing seems to be prioritizing current news over | past news which makes searching for old.articles youve read | quite difficult. | samstave wrote: | This MUST be the reason that they threw their purchase of | Postini in the garbage and my GMAIL INBOX is filled with spam, | and my "social" and "promotions" tabs dont filter.... | | GMAIL is garbage now, I literally use it as my spam email any | more. Which sucks because I have had it for a _really_ long | time. | | Annecdote on Yahoo! Mail ; years ago I wrote to yahoo support | asking when I created my Yahoo Mail account (i'd had it from | the 90s when it was very early available...) | | And support told me that they couldnt tell me when my account | was created as that was *proprietary company information* | | So I deleted my Yahoo account. Im about to DL all my gmail and | do the same. | throw10920 wrote: | > I agree: the WWW Internet is dead | | I've heard this claim a lot, with 0 supporting evidence. Do you | have any? | | My own experience is that there are _thousands_ of content- | rich, high-quality blogs still being written by real humans, | because I regularly find and bookmark new ones weekly, without | even looking for them, so: please provide evidence for this | claim that runs counter to my lived experience. | PragmaticPulp wrote: | > If I want to search for something topical and relevant, I go | to Facebook, Twitter, Reddit, HackerNews, Instagram, Google | Maps, Discord etc. | | Maybe we're searching for different content, but I disagree. | While Google results are not without noise, I think it's a huge | exaggeration to suggest it's useless. I still regularly find | quality results from a quick skim of the first or second page | of Google results. | | Meanwhile places like Reddit, Twitter, and Hacker News are full | of very strong opinions that _feel_ truthy, but are mostly | noise. Unless you go in with enough baseline knowledge to | filter out 9 /10 underinformed comments to dig out the 10% who | actually have direct knowledge of the subject and aren't just | parroting some version of something they read from other | comments, skipping straight to social sites becomes a source of | misinformation. | derefr wrote: | > Make a search engine which indexes the fresh relevant data | from the big siloed websites, and ignores the general dead | Internet | | I don't understand why Google themselves don't do this. | LinkedIn v. hiQ demonstrated that they won't get in trouble for | scraping users' subjective views of data within these silos and | then stitching them together to form a cohesive whole. So | where's the effort to do so? It seems like the obvious step. | Gigachad wrote: | Interesting thought. I just went though my browser history and | realised that almost every time I use google search, I already | know what website I want, I just don't know the exact | link/page. I'll use google because the search on stack overflow | or reddit sucks but I know I'm looking for a page on one | particular site. | Pelam wrote: | I realized this too. I disabled search from address bar and | started bookmarking everything even remotely sane I see. I | often add a few personal keywords to the bookmark bar. | | It is starting to pay dividends. Instead of weird stuff | thrown up by google when I type in something, I get the "oh | yeah, that was the page" from a short list of bookmarks shown | to match the words. | npilk wrote: | I had the same realization and ended up setting up a simple | Cloudflare script to automatically do an "I'm Feeling Lucky" | style search to return the first result: | https://notes.npilk.com/custom-search | lysecret wrote: | I think this is a very "consumer focused" take. Yes. A lot of | interesting people data is now "locked" behind these | aggregators and platforms (and also hard to handle because of | GDPR). But most interesting company data is still out there. | matheusmoreira wrote: | The internet itself is probably gonna die soon anyway. Every | country wants to impose its own laws on it. I think it'll | eventually fragment into multiple segregated continental | networks, if not national ones, all with heavy filtering at the | borders. | | I'm happy to have experienced the free internet. Truly a jewel | of humanity. | cesarb wrote: | > I think it'll eventually fragment into multiple segregated | continental networks, if not national ones | | That's exactly the world in which the Internet grew. There | were multiple segregated national and sub-national networks, | and the Internet was built as a means to interconnect them. | After some time, the Internet protocols ended up being used | even within these networks, but that was not originally the | case. And even today, there are still things like the AS | (Autonomous System) concept which permeates the core of the | top-level Internet routing protocols, which still reflect the | Internet being a "network of networks" instead of a single | unified network. | | That's why I'm not too worried about the Internet | fragmenting; we've seen this before. What happens next is | gateways between the networks, and there are already shades | of these in the VPN providers which allow one to connect as | if one were located in a different network, often from a | different country. | kmlx wrote: | > I think it'll eventually fragment into multiple segregated | continental networks | | i think it already has. | | the Great Firewall of China is the classic example, but I | think the trend started in the west with the Right to be | forgotten/right to erasure in Europe, and subsequent HTTP | Status 451 Unavailable For Legal Reasons. GDPR just further | cemented the split between Europe and the rest, and the new | DMA & DSA regulation in the European Union finally makes it | clear. The writing is of course on the wall, so countries | like India or Australia aren't too far behind. Places like | California also have their own "right to be forgotten", and | I'm sure the US will not be left behind for too long before | we see regulation further splitting their internet from the | RoW. And I don't think the RoW will hold off much longer till | it also splits into multiple big blocks. It's the start of | the new "nationalist" internet, and I'm sure we'll all be | poorer because of it. | matheusmoreira wrote: | Exactly what I mean. There is no way to have an | international network with national borders. | Telecommunications providers have always been centralized | and have always been in bed with the government. Only way | we'll ever be free is if someone invents some kind of | decentralized long range wireless mesh network. | politician wrote: | Like Starlink? | ricardobeat wrote: | Starlink connects to standard internet gateways on the | ground. It cannot function without the 'regular | internet', unless a replacement appears. | dotnet00 wrote: | IIRC there was mention of it providing some p2p network | style communication capabilities for Ukraine's military, | and one of the reasons it's appealing to the US's | military is the ability to route communications entirely | within the network (well, with the gen 2 satellites which | have laser interconnects). | | So it can (at least eventually) function without 'regular | internet', although I would still be hesitant to call it | a viable infrastructure choice if the goal is to get | around government control, simply from how much SpaceX | have to appease the government to do anything space | related. | matheusmoreira wrote: | Starlink is maintained by a company, it's an internet | service provider. One visit from the police and they'll | censor anything. | | The mesh network should be made out of common hardware in | order to be viable. I'd suggest phones but those devices | are owned before they've even left the factory. | Nextgrid wrote: | One visit from the _US_ police. US-unfriendly countries | have no leverage over it, and similarly, the US has no | leverage over satellite ISPs based in countries they aren | 't on good terms with. | jrockway wrote: | > US-unfriendly countries have no leverage over it | | "Star Wars Episode 10: The one that's not fiction." | Nextgrid wrote: | Internet censorship isn't worth going to war over and | disclosing secret anti-satellite weapons that are better | saved for a rainy day. | jrockway wrote: | It's probably easier to just cut off outgoing payments to | Starlink anyway. They're not a charity, so if they don't | get paid, they probably don't want to provide service | just to send a message to some random government. | | On the other hand, if you want to demonstrate that you | have anti-satellite capability it's probably a better | idea to shoot down a corporate satellite than a military | one. The Soviet Union shot down Korean Air Lines Flight | 007 and it didn't start a war, after all. | eloisius wrote: | Good luck, spectrum is highly regulated in every country | I can think of. If national governments don't want you | networking across borders, you're definitely not going to | be broadcasting long range radio transmissions that way. | In fact, it's currently illegal to transmit encrypted | data or to relay packets via ham radio in the US. | matheusmoreira wrote: | Who knows? The whole point of decentralization is for | there to be so many nodes in the network they can't | possibly take them all down so that it's pointless to | even try. What if all smartphones formed a mesh network? | There aren't enough prisons in my country for all those | criminals. | eloisius wrote: | I agree with your ethos, but I don't share your optimism. | If the state wants to enforce networking firewalls along | national boundaries, no technological solution will save | us in general. As a resourceful techie with the right | know-how you may be able to sneak your packets through, | just like people in Cuba receive a literal packet of data | via sneakernet, but if the state doesn't want widespread | meshnets circumventing their firewall, they will imprison | you for emitting pirate radio signals, they will penalize | any electronics manufacturer that makes non-compliant | hardware, and rest assured that companies will go right | along. Liberty requires more than technical solutions. | | I'm saying this as someone who once wrote a decentralized | P2P mesh for instant messaging[1]. I was inspired by the | HK protests going on ~2014 after hearing that they were | using Bluetooth chat apps. Luckily Matrix, Telegram, | Signal, etc. mostly solved the problem. Still, I don't | think any amount of mesh networking would turn back the | tide of Hong Kong now. | | [1]: https://github.com/zacstewart/comm/ | groby_b wrote: | >What if all smartphones formed a mesh network? There | aren't enough prisons in my country for all those | criminals. | | There don't need to be. You publicly gruesomely execute | the first 100 or so you catch, and the practice of | running a mesh node on your cell phone will fall so far | out of fashion that the network breaks. | | Societal shortcomings cannot be fixed via tech alone. If | you can't build a society resilient to authoritarianism | in the first place, tech will not help you. It can be | used to _increase_ resilience, but that 's far from | fixing the problem by itself. | 7sidedmarble wrote: | The networking may have been open like that, but I'm not sure | the content ever was. It seems to me like a lot of internet | users consume mainly the content of sites from their country. | Kind of hard to blame them when that content is probably | going to download fastest. But the language barrier has also | kept the internet from becoming truly global. | dreen wrote: | I think this was inevitable all along, something similar | happened to radio if I'm not mistaken. | | However, the good news is that we will never stop reinventing | everything. The real value of the old internet was showing us | what is possible. | nonrandomstring wrote: | > The real value of the old internet was showing us what is | possible. | | Of equal value is that it showed us what not to do. | | We have 30 years of documentation for research on exactly | what a successful intra-planetary network needs to be | immune to. A successful future network must build-in | resistance all forms of human pyschopathology from the | ground up. | pde3 wrote: | This is a nice fantasy, but it's a fantasy. The tech | stack and network we have is too dense a forest to be | replaced by clean slate designs. But maybe some of the | problems could be improved with some new platforms and | APIs. Mind you, ML is making so much progress so quickly | that what happened over the last thirty years is at best | a partial model of the problem we have to solve now, and | the tools we have to do it with... | nonrandomstring wrote: | > ML is making so much progress so quickly that what | happened over the last thirty years is at best a partial | model of the problem we have to solve now, and the tools | we have to do it with... | | Sorry I don't see how ML can help here. It seems like | another thing to pin hopes of repairing an already too | broken system on. | | "We cannot solve our problems with the same thinking we | used when we created them." -- Albert Einstein | | "A new scientific truth does not triumph by convincing | its opponents and making them see the light, but rather | because its opponents eventually die, and a new | generation grows up that is familiar with it." -- Max | Planck | | We are the dying generation my friend. We built it. They | came. It didn't work. Surely if ML can do anything it's | telling us that we need to tear down the old system | completely and start again, don't you think? Adding | sticking tape won't help. | | edit: turning a grunt into an honest question | Whiteshadow12 wrote: | This made me sad, the optimist in me believes that some | alternative will be built, that could take us back to those | days. Honestly I do feel for most of my life I experienced an | American Internet mostly (From South Africa), as long as one | can still hop from one internet to another, in as simple a | manner as possible it might not as bad as it could be. | matheusmoreira wrote: | I'm sad as well. To me it feels like we're already living | in a cyberpunk nightmare, things just keep getting worse | and there's nothing anyone can do to stop it. | [deleted] | lkxijlewlf wrote: | > If I want to search for something topical and relevant, I go | to Facebook, Twitter, Reddit, HackerNews, Instagram, Google | Maps, Discord etc. | | Interesting. When I search for something topical I search those | sites using Google because al(most) (I don't use some like FB | and insta) all those sites have really shitty search. | jerf wrote: | "I agree: the WWW Internet is dead, that is your problem. No- | one visits websites anymore, everyone has moved to the 10 | biggest websites and all data is now siloed there." | | That is not the Dead Internet Theory. That's just something | anyone can see by looking at the world. | | The Dead Internet Theory is that the Internet is _already_ an | echo chamber custom fed to you by a collection of bots and | other such things, and that a lot of the "people" you think | you're interacting with are already, today, faked. You're | basically in a constructed echo chamber designed only with the | interests of the creators of that chamber in mind, using the | powerful social cues of _homo sapiens_ effectively against you. | | In particular, those silos aren't where people are | communicating. Those silos are where you _think_ you 're | communicating. | | It is obviously not entirely true. When we physically meet | friends, sometimes topics wander to "Did you see what I posted | on Facebook?" So far, we've not caught Facebook actively | forging posts from our real-life friends that we physically | know. (Though we _have_ caught them failing to disseminate | posts in what seems to be a distinctly slanted manner.) | | I am also not terribly convinced that the bots have mastered | long-form content like you see on HN. I think we've had some | try, and while they can sort of pass, they seem to expend so | much effort on merely "passing" that they don't have much left | over to actually drive the conversation. HN probably still | requires real humans to manipulate things. | | Where I do seriously wonder about this theory is Twitter. AI | _has_ progressed to the point that short-form content like that | can be effectively generated and driven in a certain direction. | There 's been some chatter on the far-out rumor mills about | just how bot-infested Twitter may be, how many people think | they have thousands of followers, even having interacted with | some of them as "people", and in fact may only have dozens of | flesh-and-blood humans following them, if that. Stay tuned, | this one is developing. | | (Note that while this could be "a big plan", it is also a | possible outcome of many groups independently coming to the | conclusion that a Twitter bot horde could be useful. A few | hundred from X trying to nudge you one way, a few hundred from | Y trying to nudge you another, another few thousand from Z | trying to nudge you yet another, before you know it, the vast | vast majority of everyone's "followers" is bots bots bots, and | there was no grand plan to produce that result. It just so | happens that Twitter's ancient decision to be dedicated to | short-form content, with no particular real-world connection to | the conversation participants, where everyone is isolated on | their own feed (even if that is shared in some ways) made it | the first place where this could happen. Things with real-world | connections, things where everyone is in the same "area" like | an HN conversation, and long-form content will all be three | things that will be harder for AIs to manipulate. Twitter is | like the agar dish for this sort of thing, by its structure.) | thesuitonym wrote: | > (Though we have caught them failing to disseminate posts in | what seems to be a distinctly slanted manner.) | | I haven't seen this, but I'd be interested in reading about | it, if you have a link! | ftkftk wrote: | I agree - I don't believe that there is a grand master plan | of a conspiratorial or other nature. I think it is simply, as | you stated, a co-evolution of independent actors. | rchaud wrote: | > Want to become rich? Make a search engine which indexes the | fresh relevant data from the big siloed websites, and ignores | the general dead Internet. | | That would be a great service, but it certainly wouldn't make | you rich. Where's the money going to come from? Google got rich | because they acquired an ads platform (DoubleClick) and an | analytics platform (Urchin) and started monetizing the vast | amounts of data they had. That was years after Google had | established goodwill as the best search engine. | big_blind wrote: | I use beta search engines. On kagi.com and you.com you can | preference and filter top sites. There's also no advertising | on either. I've just stopped using Google altogether and its | improved search so much. | simion314 wrote: | This is not true, maybe for a subset of Internet users. | | For example you have Wikis and forums. Wikis are good for | communities that are passionate about a topic and they | collaborate on buidling content for their passion. Reddit is a | valid alternative to forums but if the community s older and | has members that are technical competent then they usually have | the forum customized for their purpose and the forum will | continue to exist , especially if you want to avoid some third | party censorship. | | I never ever search for something and found answers on | Facebook, sometimes very rare I find something that points to | Instagram blogs/posts but never Facebook. | | Probably depends on your location and what you search for, so | it might be possible that 99% of your Internet consumption is | satisfied by 5-10 websites. | Hnrobert42 wrote: | As you describe this, it makes me think about how populations | tend to migrate to cities and away from rural areas. There's | even a parallel to white flight in the emerging popularity of | the chan/gab fora. | hombre_fatal wrote: | I don't get how TFA shows evidence of the Dead Internet Theory | just because their site manages to attract ~zero users. | | Just host a <form><textarea><button></form> at an IP address | and notice it's just spambots submitting it with backlinks, not | actual users. Doesn't mean the internet is dead nor that the | indieweb is dead. | | It doesn't really show anything other than the only people able | to extract value from your creation are the spammers. | jspaetzel wrote: | This is so incredibly false, I've been working on a project for | the last six months and MoM I've seen steady increase in usage. | Tbh much much higher usage then I expected. Most users find my | site via Google or Facebook however they are looking for | content that's not in those silos and have no problems leaving | them. | | If you have high quality content and you get it indexed | properly by Google, users will come. | | There are reasons users are not using your website. | | 1. It's not solving a problem people have. | | 2. Users can't find it. | | Who, in their right mind searches for search engines? Nobody I | know. | | If you want users you have to go out and get them (literally | pound the pavement and talk to people) or create a LOT more | content ironically, so they can find your site on the search | engines they are using today. | black_puppydog wrote: | These discussions always make me recal Jacob Applebaum. Think | of him what you want, but this statement of his really stuck | with me at the time. Paraphrasing: | | The real dark-net is facebook. Everything that goes in there | never comes out again and is basically invisible to the world, | except if you join facebook yourself. | | My own prime example of that used to be pinterest: it seems to | be a 100% sink in the directed graph of internet links. But | since Applebaum stated this, instagram (also facebook of | course) is trying hard to push pinterest off that particular | throne. | LegitShady wrote: | to me this is also discord - which seems to have become the | chose alternative tk online forums for many communities and | basically hides what used to be the public face of those | communities. | samatman wrote: | boplicity wrote: | > No-one visits websites anymore, everyone has moved to the 10 | biggest websites and all data is now siloed there. | | Really? We make our living running a small web based | publication; around 40k readers a month. I know of many other | sites like this. Google, and other search engines, depends on | niche websites to provide quality search results. Without sites | like ours, the internet would truly be dead, and search would | be mostly useless. Our "traffic sources" come from a mix of | Facebook, Search, Reddit, etc, in addition to our many loyal | readers. | | Others in our niche are producing blog spam, which looks nearly | identical to people who aren't experts in the field, but we | have real experts, fact checkers, etc, as part of our | production process. This is a big problem: These low quality | websites get similar rankings to our own, which does make it | much harder for people to get quality information via search. | (Hence the general shift towards trusting social | recommendations, such as from Reddit.) | | In short, the WWW is alive and well, it's just buried under a | bunch of #$#$%. | rchaud wrote: | > Our "traffic sources" come from a mix of Facebook, Search, | Reddit, etc, in addition to our many loyal readers. | | 40k/mo is a pretty good number for an independent website. As | a word of warning though, relying on social media reach is a | dangerous game, as there is anecdotal evidence that tweets | with outbound links don't get as many impressions as those | that link to in-site content, like another Twitter post. | | As for Facebook, well, there's a good comic from The Oatmeal | (enormously popular on FB back in 2010) that talks about what | happened in the long run: | | https://twitter.com/Oatmeal/status/923250055540219904 | Cthulhu_ wrote: | I don't believe the WWW internet is dead; there's still | millions of webpages being made and published every day. | However, the traffic numbers are skewed in favor of the big | socials and aggregators; I wouldn't be surprised if the 80/20 | rule applies there. | pnutjam wrote: | There seems to be a tendancy towards video that undercuts the | "old internet". I prefer instructions in a text or list | format, but that's almost impossible to find for things like, | changing the headlight bulb on my traverse. | | 1. turn the wheel so it is pointed hard in the direction of | the bulb you are changing. | | 2. remove the hex screws from the shroud in the wheel well | | 3. pull the shroud down, it's pretty flexible plastic. | | 4. reach up and change the bulb. The wires are a bit short so | you might need to get both hands in there. I have big hands | and I'm able to do it. | | ---- There are innumerable videos explaining this process, | but very few text directions. | ElevenLathe wrote: | I think this is actually because real, fluent literacy is | still rare even in highly developed places. It may be | easier for a very literate someone to dash off those | instructions but most people are 1000x more comfortable | making a little video. Same goes for reading vs watching | the video. | | This is my same theory about meetings being universally | preferred to asynchronous email, even when literally all | the questions someone asks at a meeting have already been | answered in my long form email. | | Most people, even if they can read, are not really | comfortable with it. Doubly so for writing. There used to | be no choice to function in society, but increasingly we | can use technology to substitute for reading and writing | effectively, so people do. | pnutjam wrote: | You're probably right, it's just so frustrating. | | I think I'm going to start compiling stuff like this in | my git repo. | Jiro wrote: | Even something like that flounders on the question "these | instructions say to pull down the shroud, what is a | shroud?" or "I can't find those hex screws, where are they | located?" Repairs are inherently visual, although text with | illustrations might work. | soheil wrote: | To a fish the world is made of water and there can't possibly | be anything else worthwhile. This is more indicative of how you | spend your time online vs reality. | heavyset_go wrote: | I was once on this bandwagon, but I think it was just | confirmation bias reflecting the way _I_ used the internet at | the time. The non-siloed internet is bigger than the pre-siloed | internet ever was. | omoikane wrote: | I think the Dead Internet Theory bit is just a bait to get more | comments. It's a bit of a stretch to conclude that the internet | is mostly robots just because one website sees mostly robots. | This extrapolation would be convincing if that one website is a | high ranking website that sees a lot of traffic, but | searchmysite.net does not appear to be one of the top websites. | DebtDeflation wrote: | Unfortunately, correct. The average Internet user accesses it | via a phone, not a desktop, laptop, or even tablet these days. | Most of that access is through apps, not a browser. To the | extent that a user is looking for a factoid answer and does a | search, a Google Knowledge Graph result with a Wikipedia link | is probably enough in most cases. If they want a technical | question answered, Stack Exchange; a product review, Reddit; | nearby restaurants with reviews, Google Maps; etc. | stackbutterflow wrote: | I think you're generalizing your own behavior. I regularly use | google to search for topics that cross my mind and I end up on | many websites that are not one the giants in your list. It's a | fun activity. If people stick to the same 10 websites that's on | them. Nothing prevents you from exploring the web. | MockObject wrote: | > Nothing prevents you from exploring the web. | | What prevents you from exploring the web is you can't find | but the same 10 sites through search engines. | jrussbowman wrote: | "Want to become rich? Make a search engine which indexes the | fresh relevant data from the big siloed websites, and ignores | the general dead Internet." | | Did that to some degree. Unscatter.com pulls from reddit and | twitter to source links. | | I found reddit only created an echo chamber bubble of obvious | bias and twitter only diluted it a little. | CTDOCodebases wrote: | People are doing this already. You just have to include the | site name in the search on google e.g reddit. Search on these | platforms is often broken. | freeone3000 wrote: | Well, the first two links loaded for a search for "magic the | gathering" are 404s. The "Random" link at the bottom 403s. The | search engine feels broken. | assemblylang wrote: | There are still ways to prod out good content from the SEO spam | on search engines. I wrote a google search front end that does | this [0], using search operators to remove some common SEO spam. | | [0] https://sayno2seo.com | hammock wrote: | Makes me wonder whether Google tolerates bots on its search | engine, to boost its ads revenue. | | See also Twitter's extraordinary claim that 5% or less of its | users are bots (or a claim from Twitter's detractors that up to | 90% of its DAU are bots) | exyi wrote: | I don't think it does, I get a ~~middle finger~~ recaptcha | every time I try google something | iamjbn wrote: | Adding to the list I have been building for very long -- | "Becoming irrelevant, Google Search" -- here: | https://docs.google.com/document/d/1cSMY5wXSKhJdMxeJEvTUJ21e... | iamwil wrote: | To the OP of the article, this is great. I had just never known | about it, to use it for searching. | | Usually quality blog posts on specific technical topics are just | things I run across through HN, lobsters, or twitter. Now it's | one more channel to look for things that I'm specifically | researching, like CRDTs. Kudos! | ColinHayhurst wrote: | Mojeek member here. We have always had a high level of spam bots; | as any search engine/service will have. It's a constant battle to | fend off new bots; folks can always use try out our API rather | than freeloading, and some do. Many obviously do not. We are | taking a look at whether things have also changed for us since | mid-April 2022. | ColinHayhurst wrote: | Some evidence of an uptick here too. Historically it has been | ~80%. 6 days ago we had to block 92%. Yesterday we blocked | around three times that number of bot searches. | | edit: the three times spike yesterday was one particular new | attacker; general recent rise holds. | alphabet9000 wrote: | i recently built a habitat for spam bots, they eventually found | it and now post peacefully | | https://upstairs.treehouse.telnet.asia/pharm/cylohexapine | TremendousJudge wrote: | It's beautiful to see nature healing | tbm57 wrote: | Maybe someone should start an internet rewilding project | getcrunk wrote: | this is the best thing I have ever seen. Its art, engineering, | biology and sociology. Do you write blog posts about it? | mcv wrote: | If you're trying to boost your user numbers, I'm in. Results on | topics I search for are very sparse, but it's all content I | hadn't seen before, which is great. | | Sounds like your search engine is not suitable as a replacement | for more traditional search engines, but it might complement them | very well. I'll give it a try. | | As for the SEO bots: can't you simply block those? | egberts1 wrote: | Error codes. Open source that reports in mysterious error codes. | | Used to be able to Google for those; now, not so much. | 0xbadcafebee wrote: | People will only use your product if they know about it and | perceive value in it. How do people know about it, and why would | they want to use it? | | On _" Most of the tiny number of real users have come from links | posted to places like Hacker News, and there is almost no organic | traffic from other search engines"_ - Organic traffic comes from | word of mouth. Are people talking about your site? If they're | not, you're not gonna see organic traffic. You could do what | others do and pay some influencers to advertise your site, but | that's expensive and not as scalable as "real" buzz. Is your | product exciting or controversial? If not, why would people talk | about it? | | Your homepage's tag line is _" Open source search engine and | search as a service for personal and independent websites."_ A | regular person's eyes would glaze as they try to figure out what | this means. Given some time they might put together the words | "search engine" and "personal" and "websites" and figure this is | a blog search engine. So just say that. | | The "Newest Pages" section is a fun novelty, but after a few | minutes the novelty wears off. | | The "Browse Sites" section is _almost_ useful. Next to the list | of sites I see some tags. Why isn 't a heatmap of the tags the | first thing I see? That would be way more useful than a paginated | list of random sites. | | Your "About" page lists _" community-based approach to content | curation"_. This is the most exciting aspect of the whole | endeavor, so add that to your front page blurb ("Community search | engine"). You would probably do well to build a real community | around it, for example with a forum or chat system (GitHub | Discussions does not count). A SubReddit would be an easy way to | bootstrap this and later move it to your own hosted forum. | | You'll probably need a very complicated moderation system if this | thing takes off. | unnouinceput wrote: | Plot twist: His website/search engine/blog is written by a bot | and not a real person behind. | albatrosstrophy wrote: | Ona tangential note, I remember a time when Google had the option | to search only for 'discussions'. The results were amazing and | accurate as it scoured online forums. Almost all issue I had (was | following the rooting scene closely back then) were quickly | resolved. Then suddenly it got removed for reasons unknown to me. | Anyone knows if it's replicatable today? | sodality2 wrote: | Brave Search does have a discussion search section. | blackhaz wrote: | Sometimes adding "reddit" to a search query produces fantastic | results. | jrussbowman wrote: | I do this all the time | tunap wrote: | I have had some success adding "forum", when looking for | trade discussions; eg: controls & automotive. With all the | walled silos on the net, this is much less useful with every | passing day. On the bright side, I don't have to use -twitter | & -facebook, so there's that. | throwaway27727 wrote: | This is great but it seems reddit has done something to mess | with their date reporting. When looking for recent posts, I | might see a result on Google that says it was posted in the | last few days, but on clicking the result will actually be | from years ago. | asddubs wrote: | might also be google. I've noticed inaccurate dates that | don't appear anywhere for some of my pages. my only theory | as to why these were displayed is that google interpreted a | (server side) randomly generated number in an inline script | as a timestamp (but i can't know for sure that's what | happened) | oefrha wrote: | Messed up dates, plus irrelevant topics showing up because | there are matched snippets in "more posts from...". | SirAiedail wrote: | I use "site:reddit.com" to fully restrict to that. You can | even filter by subreddit that way. | | Works well with HN and other sites, too. | matheusmoreira wrote: | Not sure for how much longer this is going to work. Plenty of | marketers make fake posts there in grassroots campaigns. | Reddit itself is an advertising company. | | God I hope they never find out about this site. | f0xJtpvHYTVQ88B wrote: | Brave Search recently implemented "discussions". From what I've | seen it is mostly Reddit results but StackExchange also can | appear there. | | https://searchengineland.com/brave-search-discussions-383706 | Cthulhu_ wrote: | I have a suspicion they removed it because of the amount of | spam on those forums. There's tons of abandoned forums that are | only occupied by spambots. | | There's even pretty convincing looking accounts and messages | that turn out to be spam in the end, once they start trying to | post links. | | I have Akismet on the comment section of the Wordpress front- | end of the site I run, it basically said something like 99.99% | of attempted comments were spam. I'm sure the same applies to | e-mail and the like. | matsemann wrote: | Reminds of those "fake forums" I sometimes see when | exhausting google's results. Found a screenshot of the | concept here: https://www.reddit.com/r/Scams/comments/jxtr1k/ | but_it_requir... | 6510 wrote: | Everyone is a spammer according to Akismet. I wouldn't be | surprised if 99% of that 99.9999% is false positives. | | You could start a website for people you don't like, flag all | the comments as spam and they wont be allowed to post | anything elsewhere - forever! | efreak wrote: | That percentage sounds about right to me. I've seen | comments on blogs from ~10-15 years ago, that continue to | have spam posted to them. The first 2-3 comments will be | relevant, but comments 50-100 may have a single relevant | comment along them, with a total of anywhere from 300-3000 | comments. Older comments link mainly to blogs | (*.WordPress.com) and such, while newer comments link to | Facebook and Instagram. | arbuge wrote: | It is my experience that SEO bots are increasingly ignoring | robots.txt entries disallowing them from crawling our sites. Last | week we noticed several doing this. I don't mind naming names - | semrush, something called grapeshot crawler, something else | called blex bot, and moz dotbot. Anyone else having the same | experience? | edenfed wrote: | I'm currently building a search engine made specifically for | developers. We are searching directly in | GitHub/StackOverfow/Reddit so SEO is not a problem. You are | welcome to try it at https://keyval.dev | mcovalt wrote: | I noticed this on https://hndex.org. So many searches for hair | loss products. Like thousands... daily. | ajnin wrote: | This made me curious to try that search engine so I typed | "electronic music box" (first thing that came to mind). As far as | I can tell none or the 10+ pages of results include all those 3 | words. I mean, you might not have any relevant sites in your | database (likely if there are only 1000 sites or so as another of | your blog posts imply), and I understand you want to show _some_ | result to the user, but if I want irrelevant links I might as | well go to google.com... | thehodge wrote: | Yeah same, I searched for Leeds grand theatre and the top | result is something titled "June 2012 - Sam's Blog' which just | mentions the word grand. | lubesGordi wrote: | What the heck is an 'electronic music box'? I personally | wouldn't expect those three words to show up on any sites | served by a small search engine. | nspattak wrote: | This is an awsome website that I was not aware of! | mlatu wrote: | and there you have it: nobody uses it because nobody knows of | it. | | of course for a bot it is easy to remember your site, its just | another url in a long list of others... but what does a human | do? they go to their fav search site, be it duck duck go, | google or bing... perhaps even yahoo. | | i remember when google just started out, back then you would | have used askjeeves, altavista or yahoo... google was really | good compared to those... and the name was new, kinda | orthogonal to existing search engines (except yahoo perhaps) | and perhaps the most important bit: the site was "clean" except | for the searchbar, there was nothing distracting there. you | opened it and knew it is for looking up stuff | | now, to join in, this late in the game? difficult. difficult. | | maybe it would be easier if it specialized for some niche? idk. | | dear OP: i'll try to remember your searchengine, but i cant | promise to become a regular | jacquesm wrote: | One day we'll have an internet for humans exclusively. On another | note, with 160K requests / day from bots you could of course | simply block the bots structurally assuming they are nice enough | to identify themselves. Block all of AWS and Google, Russia, | China, NK and a couple of other bot hot spots and the service may | well become more successful for regular users because they get | faster results. Bots can afford to wait, humans are often | impatient. And with 2 hits / second by bots that may well become | a factor. | netsharc wrote: | I wonder how that could be accomplished. Maybe they'll build a | brain interface to replace the "I'm not a robot" captchas/add a | TPM chip to the brain. | | And then the spammers will start selling tools to fake the | responses. Or pay Filipinos a few cents a month to have the | chip implanted to their brains... | jacquesm wrote: | Well, we can do it with the roads, I'm pretty sure if the | incentives are right we can come up with a way to do it | online. As long as we have not passed the Turing test ;) | | The current web seems to favor machines talking to machines | and that is definitely not how it was intended. | Nextgrid wrote: | > Or pay Filipinos a few cents a month to have the chip | implanted to their brains... | | That's the problem with blocking _bots_ as opposed to | malicious behavior. Bot blocking is actually trivial and very | cheap to bypass as long as you can buy slave-like labor for | peanuts. | | Ideally you'd want to block malicious _behavior_ (when it | comes to SEO spam, downrank anything for-profit such as ads, | analytics, affiliate links, etc) instead to remove the | incentives for spamming, regardless of whether it 's a bot or | human. | | In this case the only problem is that this search engine | gives away resources (search queries) for free and then | complains that people (in this case spammers) are taking it. | It's not really a _spam_ problem - they 'd complain equally | well if they had some _legitimate_ user that happened to need | tons of search queries to achieve their task. | | The only solution here is to start charging for stuff that | costs money, and then it doesn't really matter who is on the | other side, as long as they pay the bill. | samatman wrote: | It's a principal-agent problem. Websites want to be paid | for their content, rent ad space, advertisers want users to | see ads, users want to find content. | | The agent in the middle fucking over all three principals | is hmm. Metaalphabetic, let's say. | m-i-l wrote: | This isn't indexing by search engine spiders, which are usually | fairly benign and easy to identify with user agent etc. This is | searches for "scraping footprints" executed en mass by "SEO | proxy farms", which are designed to be very difficult to detect | (e.g. originating from globally distributed residential IPs, | quite possibly ordinary home user's machines which have been | compromised). The main giveaway that something is a "scraping | footprint" is the long search query which includes text that | would appear on a template, e.g. ""This website is proudly | using the open source classifieds software OSClass" rega | turntables", for someone looking for OSClass-powered pages they | could "search engine optimise" for the query "rega turntables". | thesuitonym wrote: | That's funny to me, because if I'm searching for something | that would have been around between 2004-2012, I'll often | append "Powered by phpBB" (or other software) to find posts | about it on forums. | pjmlp wrote: | And the cycle will reboot itself again. | | The silos we have nowadays were there before the Internet took | off, on BBS, Compuserve, Geocities, .... | | Apparently the majority of regular humans likes to have | centralized providers they can reach out to, instead of the | freadom of decentralized content. | jacquesm wrote: | Yes, that's true. Bots tend to follow the money. | xmodem wrote: | > Block all of AWS and Google, | | Google for "residential proxy". This is already a huge | industry, and it's difficult not to see how we haven't lost | this war a long time ago. | kmeisthax wrote: | ...so you're going to write your own HTTP requests? Encrypt | your traffic and validate certificates by hand? Toggle in each | TCP header from a memory debugger? | | Most of the Internet is bots because humans don't actually | generate HTTP traffic - they fire up a bot called a "browser" | to do it for them. The challenge for anti-spam is to | distinguish which bots are currently being directly controlled | by humans and which ones are not-so-directly controlled by | such. This isn't even a hard line; I've frequently hit Hacker | News' bot detection just by upvoting a comment and then | clicking reply too quickly. | jacquesm wrote: | I really don't understand your comment. | | Just so we don't have to argue about what constitutes a bot | and what does not I propose we use this definition: | | https://en.wikipedia.org/wiki/Internet_bot | calltrak wrote: | [deleted] | oefrha wrote: | > I didn't notice at first because the web analytics only shows | real users, and the unusual activity could only be seen by | looking at the server logs. | | Sounds like everyone blocking analytics (Plausible in this case), | e.g. myself just now, is lumped in with spam bots. | | Of course, analytics blocking can't meaningfully swing the | ~99.99% statistic. | rhn_mk1 wrote: | I would argue that yes, it can. If the only people who are | interested in using the website are those who block analytics - | and, given the demographic of a niche search engine, it doesn't | sound entirely implausible - then there's no telling how the | 99.99% splits into bots and nerds. | oefrha wrote: | Not every "nerd" use a blocker. I know many who don't. Some | want to support the sites they visit; some want to see the | web as it is for most people; some say their mental filters | are so well developed that ads don't bother them; etc. | Xylakant wrote: | You could guesstimate by checking the IP address - blocks | assigned to residential users are likely humans, blocks | assigned to cloud providers etc. likely bots. | gnabgib wrote: | This is far from true. Either via trojans, botnets, "crowd | sourced vpns", or of course tor relays, residential IPs are | a source of many bots. The overwhelming majority of spam | sources (after you block a few data centers in NL). | asddubs wrote: | even if there's 99 people blocking analytics for every person | who doesn't, the figure is still 99% | scambier wrote: | If you self-host Plausible, it's also possible to bundle the | analytics package with the website, so that there's isn't an | "ad-blockable" lone request for the .js file. | | https://github.com/plausible/plausible-tracker | pluc wrote: | Yeah there is. I surf with JS off because of people like you. | varun_ch wrote: | Most of the data you can collect with Plausible could just | be collected server side instead, it's nothing like Google | Analytics. | netr0ute wrote: | > Most of the data you can collect with Plausible could | just be collected server side instead | | Then why not just use that instead? | tylergetsay wrote: | SPAs & marketing teams are used to snippets | scambier wrote: | Also notice how I said "analytics package" and not | "tracking" in my comment, because there is no tracking. I | mean, unless you're the only visitor from a specific | country, there is literally 0 identifying data in | Plausible. | netr0ute wrote: | Analytics is still unnecessary JS and a bandwidth hog, so | it has to go. | folkrav wrote: | https://plausible.io/privacy-focused-web-analytics | | You surf with JS off because of sites abusing their users' | data. This is not it. | [deleted] | 34679 wrote: | Collecting data that a user doesn't want collected is | abuse. It doesn't matter what you do with it. | folkrav wrote: | Oof. Hard disagree on that one, way too black & white of | a position for me in the face of such a broad concept as | "data". | inetknght wrote: | > _You surf with JS off because of sites abusing their | users ' data. This is not it._ | | Wrong. I surf with JS off because of sites that use JS to | collect information about me. | | If it's available on the server, then sure that might be | considered fair game. But using javascript (or any other | client-side tool) to do what you _should_ instead do | server-side _is_ abusing users (or their data). | | Putting analytics inline so it's "not ad-blocked by a url | request" is absolutely disrespecting users and a perfect | reason to turn off javascript. | folkrav wrote: | > Wrong. I surf with JS off because of sites that use JS | to collect information about me. | | Plausible doesn't collect information about you, but the | site's usage. Do you also object to physical stores | putting up cameras? | | Here's their own instance, open to public. | | https://plausible.io/plausible.io | | > If it's available on the server, then sure that might | be considered fair game. But using javascript (or any | other client-side tool) to do what you should instead do | server-side is abusing users (or their data). | | That's quite the affirmation. Is this fact or opinion? | inetknght wrote: | > _Plausible doesn 't collect information about you, but | the site's usage. Do you also object to physical stores | putting up cameras?_ | | The difference is that the cameras don't get attached to | my physical body, doesn't have any ability to monitor my | actions after I have left the presence of the physical | store, and can't force me to take any physical item or | action. | | Javascript, on the other hand, has the capability to | become persistent, can monitor my computer's activity | outside of your website, and can leave a lot (!) of | additional data on my computer without my permission. | MicahKV wrote: | So spammers have latched onto your search engine because they are | getting useful results. They are able to systematically discover | websites built on certain platforms that allow users to post | content containing links, which they can target for link spam. It | is very difficult to fight this on a technical level because | there is an entire industry built around blackhat SEO, with all | kinds of softwares and services dedicated to thwarting your | defensive efforts. Even Google struggles to keep up with this. | | However, they are also systematically feeding you their footprint | lists. I imagine you could put together a footprint blacklist | pretty quickly, and just stop returning results for any obvious | spam queries like those containing "powered by wordpress". | | It's not a very elegant solution I'll admit. It won't stop the | bots from trying, and you may have to circle back periodically to | add new footprints as they surface. But it's a potentially quick | and easy way to stop rewarding their efforts, and the blackhat | world is pretty used to burning out their resources so hopefully | they will figure out it's a dead end and move on. | wolpoli wrote: | Considering that as of Mar 12, this search engine only has 1001 | sites indexed, I am not sure how useful this site is for | getting SEO backlinks. Speaking of which, are backlinks still a | thing these days? | pascalxus wrote: | just to throw out ideas: What if he decided to charge for each | search?, say 1 cent or so. Users could purchase them in bulk, | say 100 searches for a 1$. | | The world is getting more and more desperate for a better | search engine. the day may come, when people are willing to pay | for better results. | marginalia_nu wrote: | > So spammers have latched onto your search engine because they | are getting useful results. | | I'm not sure about this. At least with my search engine, it | doesn't really seem to matter what response they get, I don't | even think they look at the responses. They keep hammering away | with tens of thousands of queries per day with the requests | even though they've seen nothing but HTTP Status 403 since last | October or so. | | My best guess is they're going after search engines in general | in case they forward queries to google, in order to manipulate | their typeahead suggestions. | miohtama wrote: | Put a CloudFlare web application firewall at the front of the | site and then use its rate limited / CAPTCHA features to | throttle traffic. It is the easiest way to get rid of | parasitic scraping and API abuse. Cost is $0. | MicahKV wrote: | Huh, well I guess there goes my theory about the incentive. | What a bummer. I would have thought that at least with search | engine scraping, they would stop expending the effort once | the results dried up. | z3t4 wrote: | Or put those query results behind an anti-bot/"capcha" test. | Ikatza wrote: | How about serving bots with one link per page, and taking a | minute to serve each page? Would this impact their | efficiency? | tofuahdude wrote: | Captcha breaking is SO easy these days; even the modern | captchas are easy to defeat. | MicahKV wrote: | That would probably help, but it's also a continuation of the | cat and mouse game. There are plenty of captcha breaking | services out there, it only cost about $1 to programmatically | solve 1000 captchas. | sylware wrote: | ... and there are the "click farms" with human beings. | z3t4 wrote: | If someone pay people to collect data you could outright | sell the data to them. | anselmschueler wrote: | As I understand it, the main point of CAPTCHAs isn't to | keep out bots completely, but to give enough friction to | make automated attacks or uses infeasible, while keeping | the friction low enough that normal users can still use it | normally. | noAnswer wrote: | > There are plenty of captcha breaking services out there | | Give it a try and see what happens. | | People said greylisting against email spam wouldn't work, | since spammers would just resend. It works since 20 years. | To get your IP off the DNSBL NiX Spam you just have to | follow a link. People said spammers would automate that | process. Never happened in 19 years. Sometimes spammers are | just lazy. | minsc_and_boo wrote: | Sure, but it increases friction that forces a re-eval of | cost/benefit of the bot(s). | | Newest captcha services are a prediction score, not even a | verification screen, and you can feed polluting data to | bots you are certain to exist. | Calavar wrote: | Agreed. I suspect that this is an arbitrage game on the | part of the SEO spammers. Each search is cheaper for them | than it is for a competitor who's using a major search | engine with more extensive anti-spammer protections, and | that difference equals $$$. A captcha doesn't have to be | an unbeatable solution. It just has to provide enough of | a barrier to equalize the cost. | MicahKV wrote: | I'm not so sure about this. The spammers goal is to build | up as big a list of link spam targets as possible. If one | spammer chooses to only scrape minor engines and another | only major engines, the one scraping the major engines | will probably come out on top despite the higher cost. | Whoever is abusing OP's search engine is likely doing it | to supplement the data they are already scraping from the | major engines. | | For OP, I think simply not returning results at all is a | more practical measure because it removes the reward | completely. Captchas and bot detection keep the reward in | play, while taking away the results entirely makes the | entire pursuit futile. | go_prodev wrote: | Deliberately feeding the spam bots into an endless loop | of captchas might slowly drain their accounts if they are | paying 3rd party captcha farms. | jfim wrote: | It might be a better idea to return low quality results | than nothing at all. The idea is that it's pretty obvious | when the bot is banned when it receives no results at | all. Having to look at the results manually to determine | whether one is banned is a much more time consuming | endeavor. | MicahKV wrote: | Well what I'm suggesting isn't about blocking the bots, | it's about removing the incentive. So in this case, I | think the more obvious it is the better. I would want | them to realize as soon as possible that they are 100% | wasting their time. | | If anything, it might be best to return a page that | explicitly states "Sorry, this search engine no longer | supports SEO footprint search queries." | | *edit for typo & wording | bornfreddy wrote: | On the other hand, making content difficult to parse is | easy to do and a very strong weapon. Make them waste dev | time... It is much easier to make variants of HTML than | it is to parse it. You can even automate it to some | degree. | gopher_space wrote: | > It is very difficult to fight this on a technical level | | It is when your base assumption is that you won't hire outside | of engineering. There are more bored teenagers with phones than | people creating quality content, so I'm not sure why you | wouldn't just brute force checks against bad actors. | pstuart wrote: | If the confidence was high enough, perhaps return garbage data? | _tom_ wrote: | I think many people in the comments here, and most users, are | missing that you index a SMALL subset of the web. This leads to | people running a default test search, finding no results, and | concluding your search engine is bad, and leaving. | | While you imply that in the search page, obviously it's not clear | enough. | | Maybe add "this search engine only searches a small set of user | submitted sites. Click <here> for the list. Or <here> to add your | site." | AdamN wrote: | IMHO what you should try is excluding all sites with excessive | third-party cookies, sluggish performance, and too many ads. That | will slice the index down by 80% probably but it would be a | really nice thing to see. It might push out low quality SEO | results for a couple of years. | guerrilla wrote: | This is the solution. Google and DuckDuckGo should be doing | this too (and make exceptions if they need to so that they | don't collapse). We have to incentivize the good behavior and | create an environment where people actually compete on the | properties we want and not horseshit. | stuff4ben wrote: | Just posting here that I'm real and I'm glad I found | searchmysite! After HN, Verge, Ars, Gizmodo, and some car forum, | I struggle to find content I want to read. Hopefully this will | allow me to continue to find something I can read as I work on | solving problems at work. I find distractions help me to refocus | in an odd way. | marginalia_nu wrote: | I had to put my search engine behind Cloudflare to deal with | this. Like the volume grew to about 10x the traffic I saw sitting | at the front page of Hacker News for a full week. | marginalia_nu wrote: | This is the rate of rejected HTTP requests I'm seeing at this | point: https://www.marginalia.nu/junk/spam.png | | Real search traffic is about half that. | m-i-l wrote: | Thanks V. I'm seeing a similar number of problem search | requests (although nowhere near as many real search | requests:-), so it is probably the same "SEO practitioners" | running the same "scraping footprints" against different | search engines around the same time. | | I was kind-of hoping that somewhere in this discussion there | would be an "And the answer to your problem is...", but I | suppose it is a very specific problem which only a search | engine would encounter. I think the Cloudflare solution you | have is probably the best to block the requests as early as | possible. The reverse proxy config[0] I've got seems to be | mostly holding out for now though. | | [0] | https://github.com/searchmysite/searchmysite.net/issues/55 | marginalia_nu wrote: | If they're from the same outfit I've had problems with I | really am at a loss as to what, other than Cloudflare, is a | good solution. I got like 4-5 requests per second at worst. | Seems to be a botnet, I entered a few of the source IPs | into my browser and got like login screens to enterprise | routers and so on. | searchableguy wrote: | Not surprised. I see many startups with Head of SEO (Search | engine optimization) with huge salaries now a days. | evanmoran wrote: | Has anyone seen this bot growth with online newsletter signups? | I've noticed a steady increase in signups but without any | equivalent marking or product buzz that might account for it | jrussbowman wrote: | It's been the same for unscatter.com for years but I've always | attributed to that to me not having a real marketing strategy or | even sticking with the ones I've tried to start. ___________________________________________________________________ (page generated 2022-05-16 23:00 UTC)