[HN Gopher] Google Search Results Plagued with spam ".it" domains
       ___________________________________________________________________
        
       Google Search Results Plagued with spam ".it" domains
        
       Author : pyinstallwoes
       Score  : 210 points
       Date   : 2022-07-23 08:19 UTC (14 hours ago)
        
 (HTM) web link (community.cloudflare.com)
 (TXT) w3m dump (community.cloudflare.com)
        
       | seydor wrote:
       | It must be the mafia. No other explanation
        
         | tremon wrote:
         | Yeah, but which one? The Camorra, Ndrangheta, Stidda, RIAA or
         | FAANG?
        
       | mkl95 wrote:
       | .it may be the .tk of the 2020s
        
         | qalmakka wrote:
         | Just, no. Plenty of legitimate websites under .it, basically
         | every single Italian company plus all localized versions of
         | international websites (apple.it, google.it, ...)
        
           | aaaaaaaaaaab wrote:
           | >basically every single Italian company
           | 
           | Sounds about as trustworthy to me as a .tk domain.
        
         | toastal wrote:
         | Shame too. I was hosting my personal site on .tk when I was
         | broke out of college, but often a link too it was automatically
         | flagged as spam.
        
         | marginalia_nu wrote:
         | Whatever domain name is cheap is going to be plagued by spam.
         | 
         | I think in recent time, .icu and .xyz have been the most
         | problematic, to the point where you to this day probably don't
         | want to host a mail server on those domains.
         | 
         | The same with cloud providers. A fairly significant amount of
         | sketchy websites seem to be hosted on cheap cloud providers
         | with weak rules enforcement. I've taken to blocking all of
         | Alibaba's IP ranges from my search engine crawler, the signal
         | to noise from those sites were so bad it just wasn't worth
         | looking for legit content.
        
         | password1 wrote:
         | Why? .tk was popular because it was free, so it was really
         | useful for teens and young adults in an era when you still had
         | to host things somewhere if you wanted them online. On the
         | other hand .it is the tld of Italy and used legit by all
         | businesses of EU's third largest economy.
        
           | ricardobeat wrote:
           | .it is currently being offered for free / EUR1
        
             | peoplefromibiza wrote:
             | .com are being offered free or EUR1 as well, but you don't
             | need to be an EU member with a valid EU ID to register a
             | .com
             | 
             | https://www.register.it/domains/?lang=en
             | 
             | https://imgur.com/W0XkZIj
             | 
             | https://imgur.com/a/p9sFsKj
        
               | [deleted]
        
       | worldofmatthew wrote:
       | Google appears to stopped caring after Matt Cutts left......
        
         | jrockway wrote:
         | The Internet is quite a bit bigger than it was in 2016:
         | https://www.internetworldstats.com/emarketing.htm
         | 
         | (Have no idea how reputable that data is, but it seems about
         | right to me. In 2016 there were 3.6 billion Internet users. Now
         | there are 5.3 billion.)
        
           | ehnto wrote:
           | Also imagine how many bots there are, and how fast they can
           | generate content now.
        
           | Proven wrote:
        
       | sub7 wrote:
       | This headline would be accurate without the ".it" domains part
        
       | randomperson_24 wrote:
       | I think Google can probably not fix it. Users will have to be
       | manually reporting as spam. These websites on seeing traffic from
       | Google's crawler bots show a perfectly legit and highly SEO
       | optimized website, but for anything else show other spam. If
       | Google starts indexing from random IP ranges, most websites would
       | probably block indexing from "unofficial IPs" or some companies
       | (esp in EU) would file some lawsuit against Google. The reason
       | being that some pay-walled news article websites won't be indexed
       | properly, as the "unofficial IP-ed" Googlebot will not get the
       | paywalled content.
       | 
       | If a website lies to Google itself, I believe the only way to
       | solve it is by reporting the search result as spam or Google
       | contracts people to somehow visit all billions of web pages
       | (again the same problem - from different IP ranges) to verify it
       | as a legit page.
       | 
       | I would like to know how Google currently handles it and probably
       | how it could be improved
        
         | panarky wrote:
         | Google has all the text they've scraped.
         | 
         | They can see all the domains that have served a given snippet.
         | 
         | They also have history to identify where each snippet was first
         | seen.
         | 
         | If SO has a lot of traffic and a good reputation, and if the
         | same snippet is found first at SO and then later at bunch of
         | newly created, low volume, low reputation domains, then show
         | the SO result and not the others.
        
         | Nextgrid wrote:
         | The practice of "cloaking" has been around for ages and I'm
         | sure Google has (or at least had) solutions against it.
         | 
         | I'm not sure on what grounds could someone sue for crawling
         | from random, unaffiliated addresses as long as the crawling
         | isn't causing a denial of service (they can always check
         | robots.txt using the main IP then use that to throttle crawling
         | from random IPs as to remain compliant).
         | 
         | > The reason being that some pay-walled news article websites
         | won't be indexed properly, as the "unofficial IP-ed" Googlebot
         | will not get the paywalled content.
         | 
         | Good riddance? That would be a welcome change.
        
       | mrkramer wrote:
       | This is 2nd and 3rd search page results spam, I get spam phishing
       | websites on 1st page of search results when I search for certain
       | ecommerce websites. Google is done.
        
       | jwally wrote:
       | Anecdotally gmail has been doing a miserable job filtering spam
       | for the last 5 months or so. For me it used to be pretty
       | bulletproof - one of its best features.
       | 
       | Now I get something from McAffee Pratners(sic) every other day
       | warning my computer is about to expire. Back in May I kept
       | winning things from Home Depot and Lowes; and gmail would
       | categorize it as "forums".
       | 
       | No idea if its related, just odd.
        
         | hn_throwaway_99 wrote:
         | Agreed, same experience, all with the same format, primarily
         | from some sort of outlook.com domain.
        
           | hyperdimension wrote:
           | Ironic, since Microsoft is the worst (in my experience) at
           | being a giant black hole to emails sent from an otherwise
           | well-configured (SPF, DKIM, DMARC, non-SBL-listed IP, &c) but
           | not major SMTP host.
        
         | sofixa wrote:
         | I occasionally have spam that slips through Gmail's filters,
         | but when I explicitly mark it as spam it disappears nothing of
         | the same type reappears again.
        
         | silversmith wrote:
         | Add another anecdote to the anecdata pile. Past three months,
         | McAffee and north american shopping chain spam is breaking
         | through Gmail filters. And reporting as spam does not help. I
         | assume they've been somehow building Google reputation for the
         | spam accounts.
        
         | thelollies wrote:
         | Similarly I've been getting smashed with gmail spam, google
         | calendar spam and google drive spam for around 6 months now
         | after never previously getting any and despite reporting most
         | of it.
        
           | tempest_ wrote:
           | Yeah the drive spam started for me last year or the year
           | before. It took a break but in the last couple months has
           | returned with a vengeance.
           | 
           | I had the Slack google drive integration and I needed to mute
           | it because it was couple doc invites every few hours.
        
         | edvards wrote:
         | Been having the same issue on my old email. a LOT of spam going
         | past the filter.
        
         | pwiecz wrote:
         | On the other hand over last few months majority of post from
         | sma few of private Google Groups, I'm member of, keep getting
         | wrongly clsssified as spam.
         | 
         | I don't know if it's related either.
        
         | pverghese wrote:
         | I disagree. Prior to Gmail i used to get thousands of spam
         | email everyday Now everything is filtered. Barely get any.
         | 
         | The added benefit is I don't get any tech calls for help from
         | my parents who also don't end up clicking random spam and
         | wondering why bad things are happening
        
         | echen wrote:
         | I've been experiencing the same. Here are a bunch of egregious
         | spam mistakes we collected from different people to illustrate
         | the problem: https://www.surgehq.ai/blog/are-the-spammers-
         | winning-failure...
        
       | krono wrote:
       | Every single one of these results carries the `html` filetype as
       | part of their URL is my experience. This is likely a consequence
       | of the useragent-based switcheroo technique they use to fool
       | Google.
       | 
       | Just blanket block the lot with the following uBlock Origin
       | filter:
       | google.*##.g:has(a[href*=".it"][href$=".html"])
       | 
       | Google ain't going to fix itself ;)
        
         | [deleted]
        
         | LinuxBender wrote:
         | In addition to this if one runs unbound as their DNS on their
         | home router and they block DoH then one could add
         | local-zone: "it" always_nxdomain
         | 
         | to NXDOMAIN all requests for the .it TLD and protect non
         | browser devices. I use this method to stay off sanctioned
         | country TLD's and to remove the cheap/free spammy domains and
         | TLD's that often contain more malware than anything useful.
        
         | jacooper wrote:
         | Or use Brave search, which honestly from my experience is much
         | better.
        
         | peoplefromibiza wrote:
         | cool!
         | 
         | now s/\\.it/every TLD/ and you solved domain spam forever.
         | 
         | /s
         | 
         | You might not know that 99.99% of .it domains with urls ending
         | up in .html are completely legit, including some official
         | government one.
        
           | [deleted]
        
           | nkrisc wrote:
           | Since uBlock is run on the client, unless you're Italian or
           | interested in Italian sites it doesn't really seem like much
           | of an issue.
           | 
           | I could block all .it sites on my network and I'd likely
           | never even notice.
        
             | peoplefromibiza wrote:
             | yeah, right, unless you're american, why should you care
             | about .com domains?                 -\_(tsu)_/-
             | 
             | the problem is not .it domains, it's clearly stated in the
             | linked post
             | 
             |  _A large number of spam pages are indexed when searching
             | by our product name. It's very similar to Japanese Keyword
             | hack, but the difference is that our site is not hacked_
             | 
             | so it's definitely an indexing issue, those .it domains are
             | being indexed for the Japanese word hack for some reason,
             | it's not that .it domains are particularly spammy per se.
             | 
             | Your "solution" would filter the vast minority of the
             | abusers at the cost of banning an entire TLD, not much
             | different than turning off the internet connection
             | entirely.
             | 
             | Most of the spam on the internet comes from .com domains
             | though, even more so because registering a .com domain is
             | much easier than getting an .it
             | 
             | Are you willing to ban .com too?
        
               | kzrdude wrote:
               | .com implies spam - it's commercial, so let's go ahead.
               | If it's not .org I'm not playing. /s
        
               | plank wrote:
               | And yet, here you are, and not on ycombinator.org? ;-)
        
               | nkrisc wrote:
               | > Your "solution" would filter the vast minority of the
               | abusers at the cost of banning an entire TLD, not much
               | different than turning off the internet connection
               | entirely.
               | 
               | Again, we're talking about client-side filtering. The
               | original comment about blocking .it domains was talking
               | about a uBlock Origin rule. No one's talking about
               | blocking .it domains from the web.
               | 
               | Yes, as an American, I could block all .it domains on my
               | end and my web experience likely wouldn't change at all.
               | I rarely, if ever, need to visit .it domains. So maybe I
               | will.
        
               | krono wrote:
               | This visually hides the HTML elements on Google Search
               | and for me only. There is no networking involved and so
               | Italian TLDs are still reachable.
               | 
               | This is a personal solution to an extremely disruptive
               | and long standing problem, and only affects those who
               | choose to employ it. It's not hurting anyone.
        
             | tbran wrote:
             | Nah. I've been reading the docs on Spatialite (the spatial
             | extension for SQLite) at http://www.gaia-gis.it/ the last
             | couple days. It has both a "spam" TLD and a design from
             | 1998.
        
           | fijiaarone wrote:
           | But not many of the official government ones.
        
             | peoplefromibiza wrote:
             | official government in Italy also means cities, towns,
             | hospitals, universities, public schools etc
             | 
             | There are 8 thousands towns in Italy, each with their own
             | .it website.
        
         | qalmakka wrote:
         | Blanket banning a whole TLD is stupid. One thing is blocking
         | some obscure stuff like ".su", but .it? It's just too big, and
         | arguably unwise if you are in Europe where having to connect to
         | Italian websites or services isn't a remote possibility.
        
           | krono wrote:
           | This merely hides Google search results in my browser.
           | 
           | No network connections are blocked...
        
         | permo-w wrote:
         | I'm sure there are plenty of non-spam html pages based in Italy
         | too
        
           | krono wrote:
           | Considering the crowd that trade-off that seemed too obvious
           | to mention.
        
         | seumars wrote:
         | What's this useragent switcheroo?
        
           | krono wrote:
           | Browsers and other programs can use the User-Agent[1] header
           | to send along a bit of information about themselves with each
           | request.
           | 
           | This and other information is then used to filter out various
           | types of visitor.
           | 
           | In this case, requests claiming to be a Google Search crawler
           | will receive a boring page with lots of text that it can
           | index and use as search results.
           | 
           | Most browsers' devtools let you change your user-agent
           | string, and a listing of the ones used by Google crawlers is
           | publicly available. Not saying that you should, but you could
           | check this out for yourself... entirely at your own risk of
           | course :)
           | 
           | https://en.wikipedia.org/wiki/User_agent
           | 
           | https://developers.google.com/search/docs/advanced/crawling/.
           | ..
        
       | politelemon wrote:
       | What does Cloudflare normally do with spam sites, is it a hands
       | off approach or do they do some policing?
        
         | yellow_lead wrote:
         | Hands off till it gets upvoted on HN
        
         | charcircuit wrote:
         | They do the bare minimum. You can report sites for abuse and
         | they will take them down. It doesn't appear like they do
         | anything to proactively stop similar sites so the person can
         | just make a new account and domain and be back in business.
        
           | miyuru wrote:
           | cloudflare abuse department is really lacking.
           | 
           | Their abuse form is getting abused too. It sends an email to
           | site operator and the server hosting company in single submit
           | so its getting abused. It not even have a captcha.
           | 
           | https://abuse.cloudflare.com/
        
         | [deleted]
        
         | breakingcups wrote:
         | They will do nothing, and it is a feature.
         | 
         | The only thing they'll do is forward the complaint to the user.
         | Leaving you with no recourse other than to take legal action
         | before Cloudflare will lift a finger.
         | 
         | Unless there's CSAM, of course.
        
           | sammy2244 wrote:
           | You need to remember that Cloudflare isnt a host
        
             | r1ch wrote:
             | That isn't always the case these days, web apps can run on
             | Cloudflare without an origin.
        
           | bad416f1f5a2 wrote:
           | This... seems about right?
           | 
           | A trademark dispute is a civil issue between two parties. We
           | have legal systems to solve these. Cloudflare should ensure
           | that their customers get timely notification of complaints,
           | and that's pretty much it.
        
       | jl6 wrote:
       | I recently cursed google search results when trying to research
       | an actor's birth date. There were two dates given on Wikipedia
       | and I wanted to see which one (if either) was correct. Google
       | returned the actor's IMDB page (which listed a third date, and no
       | source), and then pages upon pages of what appeared to be auto-
       | generated sites that clearly scraped from Wikipedia, repeating
       | one or the other of the Wikipedia dates.
       | 
       | This is not helping to organize the world's knowledge.
        
         | hericium wrote:
         | > This is not helping to organize the world's knowledge.
         | 
         | And Google is not about organizing world's knowledge but
         | creeping on people for YoY financial results.
        
           | richbell wrote:
           | They're quoting Google's own mission statement; though, you
           | are correct.
           | 
           | > Google's mission is to organize the world's information and
           | make it universally accessible and useful.
        
         | znpy wrote:
         | > This is not helping to organize the world's knowledge.
         | 
         | Oh they stopped doing that long long ago...
        
         | dazc wrote:
         | A lot of actors lie about their age so I wouldn't hold out too
         | much hope on getting an accurate result on that one.
         | 
         | I get your point though about the multiple results for
         | something where there clearly is no authoritative answer.
        
       | jeroenhd wrote:
       | I don't think this problem should be solved by Cloudflare. Cheap
       | domains will always exist and they shouldn't be a problem. The
       | problem lies with Google and its failure to detect these spam
       | sites.
       | 
       | Surely Google can spare an engineer or two to do a deep dive into
       | the way any one of these spam sites manages to get itself to the
       | first page of Google, work out their scheme, and fix the
       | algorithm? This problem isn't exactly hard to reproduce!
        
         | johnklos wrote:
         | I really don't get why so many people are willing to give
         | Cloudflare a free pass on stuff like this. Why is it OK for a
         | company to facilitate and host (1) thousands of scam domains,
         | making reporting arduous and ineffective?
         | 
         | Anyone trying to infect others with Trojans and viruses just
         | need to check user agents or use dynamic redirect URLs, and
         | suddenly this clearly illegal activity becomes black magic that
         | is way beyond the comprehension of the folks at Cloudflare.
         | 
         | Cloudflare is basically making the shittiest parts of the
         | Internet safe for scammers and spammers, and this is just one
         | example.
         | 
         | If that's not bad enough, they're trying like crazy to become a
         | monopoly. If this what they do now, imagine how bad it'll be
         | when they control even more and feel even more immune to making
         | money from scammers.
         | 
         | (1) Hosting is providing services on the Internet without which
         | a site would not function. Providing DNS is hosting. Providing
         | proxy is hosting. Providing email is hosting. Don't fall for
         | Cloudflare's "we don't host" bullshit.
        
           | Nextgrid wrote:
           | I disagree.
           | 
           | For example, SO copycats are legitimate in that they respect
           | the license and otherwise just serve the content to whoever
           | sends them an HTTP request. As far as I know they don't spam
           | links to their domain anywhere. They are low-quality and of
           | dubious utility for sure, but I'd rather not make the
           | Internet a place where you need to prove quality & utility to
           | someone to be able to host an HTTP server.
           | 
           | The real problem is that a dumbass like Google comes along,
           | sees this and decides that it should rank _higher_ than the
           | source content.
        
             | jfengel wrote:
             | If they're not spamming their links, how do they get such
             | high search engine optimization?
             | 
             | Somebody upthread suggested it was just the use of Google
             | ads, which I suppose is possible, but somehow it seems
             | unlikely. Google sure does love money but they also need to
             | be considered a good search engine, and I'd expect them to
             | be at least a little wary about things like that.
             | 
             | Is there something else I'm missing?
        
               | Nextgrid wrote:
               | It's my understanding that link spamming has become
               | counter-productive since a Google algorithm update almost
               | a decade ago? I'm not sure what they're doing but I don't
               | think it's link spam, because of that and also because
               | I've never seen their spam anywhere (if they're using
               | link spam they must do so on sources that have good
               | "authority" for programming-related topics and thus one
               | of us would've likely seen it).
        
               | techdragon wrote:
               | It's often cheap blog spam "original content" and
               | matching cheap social media spam to increase how
               | legitimate the blog looks ... which is cheaper than ever
               | now thanks to advances in machine learning models like
               | GPT-3 and other current generation models. The pipeline
               | is take a random sample of pages in the domain, take the
               | target page -> summarise -> generate some blog spam of
               | varying length and level of human input -> if desired
               | based on social media analytics then generate some
               | automated social posts about the blog article that was
               | just added since it's widely done by real humans with
               | their real blogs it all looks legit.
               | 
               | This is how it gets done and Google used to be brutal
               | about crushing it, somewhere along the way they seem to
               | have given up on being so brutal.
        
             | forgotpwd16 wrote:
             | >SO copycats are legitimate in that they respect the
             | license
             | 
             | Do they? SO contributions are under CC BY-SA. Haven't seen
             | copycats providing attribution let alone specifying that
             | the content is under the same license.
        
               | Nextgrid wrote:
               | I'm not sure, but the business model of them is ad
               | revenue - they get paid as soon as the page loads. Adding
               | the require attribution & license disclosure wouldn't
               | hurt them at all, so I'm assuming they're either already
               | doing it or will start doing it if asked.
        
             | mod wrote:
             | Exactly. Similarly, I think any dumbass should be able to
             | fix cars in his own yard, including for a fee and calling
             | himself a business.
             | 
             | But Google maps better not drive me to someone's yard when
             | I ask to navigate to a nearby mechanic.
             | 
             | If it did, it would be hard to blame anyone but Google.
        
             | abbe98 wrote:
             | Is it common that SO and Wikipedia copycats respects open
             | licenses? Most times I run into them they do not.
             | 
             | It's really tricky to enforce open licenses on this scale
             | as it's each contributor that licenses their content rather
             | than the platform host.
        
               | Nextgrid wrote:
               | See my response here:
               | https://news.ycombinator.com/item?id=32204913
        
           | 2OEH8eoCRo0 wrote:
           | Cloudflare should worry about what sites Google is choosing
           | to index and show?
           | 
           | This is clearly Google's issue.
        
             | tremon wrote:
             | No, Cloudflare should worry about what sites they host and
             | enable. How Google ranks the sites that Cloudflare hosts is
             | a secondary issue, and is outside of Cloudflare's control.
        
           | jeroenhd wrote:
           | Cloudflare should take action on reported domains and their
           | owners, especially if those domains are malicious.
           | 
           | However, I don't want Cloudflare to preventatively police
           | what is and isn't a bad website. When these scam sites go
           | live, they can quite easily contain real content (say, a
           | blog, with articles written by AI good enough not to be
           | immediately obvious) and then change into malware on a
           | schedule.
           | 
           | Cloudflare can't see what code customers run on the backend
           | and that's probably a good thing. They're already holding too
           | much power over the internet and requiring the backend to be
           | transparent would only make them more in control of the web.
           | 
           | Any registrar hosts thousands if not millions of spam sites
           | because every single one of the billion registrars have DNS
           | set up in some way.
           | 
           | Despite being almost exclusively used for spam and amateur
           | projects, the .TK TLD barely shows up in Google. Spam sites
           | are a symptom of other services linking to them and making
           | them worth the investment. If Google, Bing, Qwant and Yandex
           | weren't falling for the SEO scams these scammers use, we
           | wouldn't have this problem.
           | 
           | Hosters have some immunity by design, and that's very much a
           | good thing. They have to respond to abuse complaints, but
           | they're not responsible for filtering out all of their
           | customers. Requiring them to do so is exactly what the EU is
           | trying to force upon the internet, which is terrible for
           | online freedom.
        
             | RockRobotRock wrote:
             | You make some good arguments, but if a site is caught in
             | the act of hosting obvious malware, Cloudflare should make
             | a reasonable effort to suspend their activity.
        
         | nousermane wrote:
         | "Cheap domains" is not a thing. $25/year domain for a personal
         | website is kinda pricey. But scam/spam operator can pay that
         | and more pretty darn easy.
        
           | [deleted]
        
           | schroeding wrote:
           | There is at least one registrar that gives away .it domains
           | (and apparently .eu domains? WTF?) for free for one year[1],
           | with no major strings attached (as long as you cancel after
           | the first year) as far as I read, correct me if I'm wrong.
           | 
           | Why they decided to ".xyz the TLD", I don't know.
           | -\\_(tsu)_/-
           | 
           | [1] https://www.register.it/?lang=en
        
             | fortran77 wrote:
             | I'm not sure about today, but about 12 years ago, I was
             | able to get 1000 .info domains for about $200. (We were
             | doing some machine-generated splog creation to see if we
             | could game Google search results. We could.)
        
           | [deleted]
        
           | worldofmatthew wrote:
           | Who is charging $25/year for domains?
        
             | dark-star wrote:
             | $25/year for an .it domain is pretty cheap actually,
             | usually they sell for more like 40EUR per year
        
               | thiht wrote:
               | .it are around 10 bucks a year. No idea where you would
               | find them at 25 or 40 a year.
        
             | smcl wrote:
             | Depends on the TLD, I searched gandi with a very random set
             | of characters the keyboard (to ensure I could probably get
             | many results) and here's a selection of country-level ones
             | which are above $25/yr:
             | 
             | - abcedasdfff.io = EUR59.29/year
             | 
             | - abcedasdfff.tw = EUR25.20/year
             | 
             | - abcedasdfff.nz = EUR25.40/year
             | 
             | - abcedasdfff.mx = EUR48.28/year
             | 
             | Most of them appear to be EUR10-20/yr, but it's certainly
             | not uncommon to see them go for EUR25 or higher. Note: EUR
             | and USD are roughly at parity so I don't think it's really
             | necessary to do a conversion.
        
               | worldofmatthew wrote:
               | Most common TLDs are not in that price range.
        
         | Hrundi wrote:
         | The fact that this has been going on for several years makes me
         | believe Google either doesn't care or the problem is
         | particularly hard to fix (less believable)
        
           | worldofmatthew wrote:
           | I noticed issues since Matt Cutts left. No one care anymore.
           | There are AI generated website that have been running for
           | years, ranking highly in Google.
        
           | jaimex2 wrote:
           | It's incredibly easy to fix. They don't care as they have a
           | monopoly.
        
             | worldofmatthew wrote:
             | Allow people to report AI generated website to a human at
             | Google.
        
             | robocat wrote:
             | Another comment links to a blacklist, which works.
             | 
             | If it can be effectively blacklisted, then Google is
             | dropping the ball. This isn't difficult algorithm foo
             | failure.
             | 
             | I don't agree with your sentences, but I do agree with your
             | point.
        
           | ehnto wrote:
           | It has been an ongoing battle for 10-15 years at this point.
           | Search engines are constantly battling people trying to game
           | their systems. I have to wonder if Google hasn't lost the
           | thread a bit, inside their surely quite complex algorithm
           | black boxes.
           | 
           | For a while now Google has suggested that the best way to
           | rank well is to have human readable content and focus on user
           | experience. At the same time, natural language generation has
           | come leaps and bounds, to the point where sometimes even I, a
           | human, can't tell if an article has been spun by a bot or
           | not.
           | 
           | So if Google starts ranking human readable content, and
           | robots can now produce human readable content, what is the
           | next ranking signal they can use to differentiate spam from
           | humans? Are we going to end up with "Verified Websites" ala
           | verified Twitter handles?
           | 
           | A huge portion of the web at this point is just bots
           | communicating with eachother, and legitimate business systems
           | having to process bots participation on the internet. I
           | imagine the portion of the web that Google crawls that is
           | legitimate versus that which is bot generated would surely be
           | majority bots, just because of how fast they can generate
           | content. One thing they can't do as easily though is register
           | domains, so it may be one of the better points of defense.
        
             | pixl97 wrote:
             | The dead internet theory.
        
           | frankfrankfrank wrote:
           | Google has at the very least neglected its search for many
           | years now and recently has also actively made it worse
           | through all the censorship and thought control stuff. I find
           | it rather surprising because essentially all of google's
           | success is lynchpinned by search. All it would take is for a
           | narrative to dominate that the best results can be found
           | elsewhere, which does not seem particularly remote,
           | considering how much damage google has done to its search.
        
         | hericium wrote:
         | > I don't think this problem should be solved by Cloudflare.
         | Cheap domains will always exist and they shouldn't be a
         | problem. The problem lies with Google and its failure to detect
         | these spam sites.
         | 
         | The problem exists outside (Google-controlled) web: with (not
         | fully Google-controlled) email, too.
         | 
         | Around 2020 I did a per-tld checks on wanted/unwanted messages
         | (ham and spam). With thousands of messages sent from .xyz
         | domains (envelope sender host or PTR record of sending host; I
         | ignored the From header) there wasn't a single legit message.
         | 100% SPAM.
        
           | axsharma wrote:
           | The irony, Google/Alphabet uses ABC.xyz.
        
       | oefrha wrote:
       | What can search engines do about user-agent based content
       | differentiation? Say my robots.txt allows Googlebot and nothing
       | else. If Google attempts to double-check with a covert user
       | agent, robots.txt is violated. Assign humans to review reported
       | pages? It's pretty easy to swamp a manual system like that. Just
       | forget about robots.txt?
        
         | foobarbecue wrote:
         | I remember reading on HN years ago that Google bots have never
         | honored robots.txt, but I don't actually know
        
         | hombre_fatal wrote:
         | robots.txt is just a guideline between well-meaning actors for
         | the majority of their traffic, like helping a bot not waste its
         | time nor your bandwidth by crawling dynamically-generated,
         | endless-scrolling /calendar.php pages. Google does use it to
         | that extent.
         | 
         | It's not a firewall.
         | 
         | Seems like you're describing cloaking (https://developers.googl
         | e.com/search/docs/advanced/guideline...), one of the oldest SEO
         | tricks, and you can imagine that search engines started
         | defeating it on Day 2 of crawling the web.
        
       | Thorentis wrote:
       | The obsession with "machine learning" is actually making systems
       | dumber. Google Search and Gmail spam filters are getting worse
       | with each passing week, and I am almost certain the increasing
       | reliance on ML is to blame.
        
         | patentatt wrote:
         | I chalk it up to a cost benefit calculation. Google clearly
         | isn't trying to eliminate all spam in search. It's not their
         | goal. They are not trying to optimize for the user experience.
         | They're trying to optimize revenue.
        
           | mod wrote:
           | They're trying to keep spam out of my inbox, and the spam
           | rate has been increasing for me (and other HN commenters who
           | frequently talk about it)
        
         | tremon wrote:
         | The competing explanation is that "machine learning" is
         | actually making spam generator systems smarter, so spam gets
         | harder to detect.
        
       | beardyw wrote:
       | Presumably these folk take advantage of cheap/free domain offers
       | wherever in the world they are.
        
         | 01acheru wrote:
         | I think you are right, https://www.register.it/ is offering
         | free domains for 1 year since some time.
        
           | [deleted]
        
           | peoplefromibiza wrote:
           | it's not that straightforward though.
           | 
           | to register an .it you must prove you are a person or a
           | business working or residing in one of the EU member states
           | and need to provide the ID of a person who's gonna be listed
           | as admin-c of the domain.
        
             | gsich wrote:
             | No you don't. I had a .it domain too, yes there is a field
             | in regstritation where you should enter a "identity card
             | id", but I didn't have one so I entered something random.
             | Worked of course.
        
               | peoplefromibiza wrote:
               | > No you don't.
               | 
               | Yes, you do!
               | 
               | of course it worked.
               | 
               | you just committed a crime.
               | 
               | you can fake your id everywhere in the World, it is a
               | crime everywhere in the world and if something happens
               | doesn't mean you won't get caught.
               | 
               | you can drive a stolen car, it will work.
               | 
               | > yes there is a field in regstritation where you should
               | enter a "identity card id"
               | 
               | so it is required! you simply ignored it, lied and broke
               | the law.
               | 
               | your criminal behaviour doesn't imply laws do not exist.
               | 
               | if you tried to buy an insurance policy with that fake
               | ID, you would be in troubles now.
        
               | [deleted]
        
               | [deleted]
        
               | pixl97 wrote:
               | Right, and I'm sure that government across the ocean will
               | get right on prosecuting that violation...
        
               | schroeding wrote:
               | You can do that, but you always run the risk of someone
               | snitching to nic.it, in which case you would lose the
               | domain. :/
        
               | 01acheru wrote:
               | I don't think this is an issue if you're a spammer. Those
               | domains are probably short lived anyway.
        
         | pyinstallwoes wrote:
         | I've actually experienced this and it is not related at all to
         | the device. It was related to the signed in google account
         | across networks and devices.
        
       | Ueland wrote:
       | Note that the discussion is a year old. Around one year ago I
       | wrote more about this "phenonomen" here:
       | https://news.ycombinator.com/item?id=27993123
       | 
       | tl;dr: I managed to find the servers behind it, most likely
       | anybody who are still affected can do the same thing I did pretty
       | easily. We also followed the money, which is a tad more work.
        
       | thejosh wrote:
       | There has been a huge influx this year with the amount of sites
       | that simply scrape SO and then have the exact content on their
       | site. It's a pain, and there is no official way to remove them.
       | 
       | I thought that this was a massive nono from Googles side, has
       | something changed?
        
         | xbar wrote:
         | This reminds me of this one site that simply scraped all the
         | open source code it could see and then produced AI-generated
         | copies.
        
         | phreack wrote:
         | It took me a while for no good reason but I finally got an
         | unofficial extension to add a "block" button to search results.
         | It immediately improved my experience, I can't recommend it
         | enough. No more Pinterest, SO clones, useless Quora spam, with
         | very little work. I can't believe I didn't do it sooner.
        
         | ofou wrote:
         | just switch to you.com
        
         | anonred wrote:
         | Other search engines allow you to block domains from showing up
         | in the results. I've switched to Kagi out of frustration and
         | honestly it's as good or better than Google just because of
         | that one feature.
        
         | endofreach wrote:
         | https://news.ycombinator.com/item?id=29403947
        
         | atwood22 wrote:
         | I have no evidence of this, but the ad load on the returned
         | results has gotten way higher. In theory, ranking sites that
         | display Google ads higher would be a very easy knob for Google
         | to turn to increase profit. The SO scrapers probably have
         | Google ads on them, making them more profitable for Google.
        
           | Nextgrid wrote:
           | Turning the knob one way explicitly might raise some anti-
           | trust concerns, however the same motivation can be used to
           | _avoid_ turning the knob the other way and this can be done
           | much more sneakily without leaving clear evidence - simply
           | don 't allocate budget/etc to projects that would turn the
           | knob the other way and you're done.
        
           | hombre_fatal wrote:
           | I ran into so many Stack Overflow "mirrors" yesterday like
           | this: https://www.anycodings.com/1questions/400836/swiftui-
           | update-...
           | 
           | 10 years I gave up on a large project where I rehosted and
           | organized dead Usenet forum content because Google's dupe-
           | penalty detector was too good and too aggressive for content
           | that you could barely find beyond a six-year-old cache hit
           | where the origin website was long gone.
           | 
           | Meanwhile these Stack Overflow scrapers are just
           | `<html>{copy-and-paste}</html>` and the same domains are
           | still alive despite years of cloning.
           | 
           | Looks like it's time to boot my project back up.
        
             | atwood22 wrote:
             | It's clearly not a copy and paste. I just visited that link
             | on my phone and got blocked from viewing because I'm using
             | an ad blocker.
        
             | avipars wrote:
             | Also lots of github scrapers
        
             | skilled wrote:
             | "All Rights Reserved."
        
           | panarky wrote:
           | This is a very old conspiracy theory that's been repeatedly
           | debunked.
           | 
           | https://www.searchenginejournal.com/ranking-
           | factors/google-a...
        
             | chakkepolja wrote:
             | That link is about AdWords spend by the site in question,
             | and not about displaying AdSense ads on the site. Totally
             | unrelated.
        
           | 1597 wrote:
           | I've noticed this with youtube. Even though I'm on desktop
           | with an adblocker they repeatedly autoplay the same video
           | with a creator embedded crypto promotion at the beginning
           | (especially when it would be plausible to infer I'm asleep
           | from user interaction and clock/watch time). Must be getting
           | a cut (plus scamming the ad buyer).
        
         | Surfactant7 wrote:
         | There's a simple way around that. Nothing to install. Nothing
         | to update.
         | 
         | Just go to SO and use its search bar. It's actually quite good.
         | 
         | I mean, you know that's where you'll want to find the answer
         | anyway - not some random corporate webpage or ad-infested
         | splog. Why not cut out the middle man?
         | 
         | Only if that fails do I bother with Google.
        
           | watchdogtimer wrote:
           | Or, if DuckDuckGo is your default search engine, you can
           | append ' !so' to your search term.
        
           | burnished wrote:
           | Huh, you know, you're right. I recently did that and it was
           | fine.
           | 
           | I think a lot of others formed their opinion (myself muchly
           | included) about this from sites where the search bar was a
           | joke played on people.
           | 
           | Edit: let me upgrade that 'fine' to 'great', now that I think
           | about it it was actually better than a google search which
           | was not my previous experience.
        
           | avereveard wrote:
           | Google index used to be fairly more competent at finding
           | relevant issues for a query, especially if some words were
           | synonyms of what found in the snippets at even loosely
           | related
        
         | Quenhus wrote:
         | Here is my uBlock filter with hundreds of GitHub/StackOverflow
         | copycats: https://github.com/quenhus/uBlock-Origin-dev-filter
         | 
         | It blocks copycats and hide them from multiple search engines.
         | You may also use the list with uBlacklist.
        
           | thejosh wrote:
           | This is fantastic! This is exactly what I needed, thanks!
        
           | SmellTheGlove wrote:
           | You rock. Thank you.
        
           | Phlogi wrote:
           | This even works on Firefox Nightly on Android. Thanks a lot!
        
           | colordrops wrote:
           | With these two pieces of data:
           | 
           | * the identical text copied from stack overflow should be
           | easily identifiable
           | 
           | * volunteers put together a list of these sites themselves
           | 
           | it should be obvious to Google apoligists that Google is
           | either negligent or intentionally allowing these sites in
           | their search. I'm sick of hearing about how "the world is
           | different" and it's an "arms race" between spam sites and
           | google. Bullshit.
        
             | IfOnlyYouKnew wrote:
             | The problem with these theories is that they lack any
             | sensible explanation of motive. Google intentionally
             | degrading its search results because they "earn more if the
             | user has to search again and again" just doesn't feel
             | right: even if it were true in some short-term experiment,
             | it would compromise the way people at Google think of
             | themselves and their work to a degree that would be
             | devastating to the company. There is no way they would
             | throw away that sort of value without being under intense
             | pressure, which they definitely are not.
        
               | colordrops wrote:
               | These large tech companies have a long and varied history
               | of stupid short-term decision making for profit and bad
               | products due to local individual failures. Until there is
               | a clear and detailed explanation of how the spam sites
               | are avoiding google's wrath, the explanation of stupidity
               | or short-term thinking on Google's part seems just as
               | plausible.
        
               | lamontcg wrote:
               | Well come up with an explanation of how these entirely
               | mechanically generated SO clone sites, with no
               | obfuscation, are allowed to exist by Google, when
               | identifying them and removing them should be fairly
               | trivial?
               | 
               | At the very least they're being deliberately neglectful
               | because they don't feel the bad experience harms their
               | revenue because there's no other substantial competitor
               | so they can abuse their monopoly status.
               | 
               | I guess they may just not care enough about software
               | developers and figure we're mostly using ad blockers so
               | its wasted effort and we'll develop blocklists ourselves.
               | With no monetary value that they can assign to the ill
               | will that it engenders they figure it must not matter so
               | they don't bother. Pissing off a large chunk of the
               | entire IT community via obvious neglect seems like a poor
               | move to me, but then I've never felt that I'm cut out for
               | management.
        
               | burnished wrote:
               | Maybe the problem is just genuinely hard and beyond their
               | capabilities.
        
               | colordrops wrote:
               | Detecting identical snippits of text is beyond virtually
               | no one's abilities.
        
               | Beldin wrote:
               | Another comment stated that SO uses ads from someone else
               | than Google, while the copy-paste sites use Google for
               | ads. If true, that is clear monetary incentive to not go
               | after this too hard.
        
             | rightbyte wrote:
             | SO seem to have Yahoo ads, so I guess it is a no brainer
             | for Google to rank sites they profit from over the content
             | the lusers want.
        
               | jiggawatts wrote:
               | This is the real answer.
        
             | remus wrote:
             | > the identical text copied from stack overflow should be
             | easily identifiable
             | 
             | Google starts matching content from SO => Spammers start
             | tweaking the text slightly => google implements some
             | expensive similarity score to down rank copy cat sites =>
             | spammers use more complex scrambling=> ...
             | 
             | > volunteers put together a list of these sites themselves
             | 
             | These lists only work because they're used by a tiny
             | minority of people. If Google were to do this the spammers
             | would start switching domains more quickly (or find some
             | other workaround).
             | 
             | I'm no Google apologist but I think you're underestimating
             | how hard search ranking is when spammers are actively
             | trying to game the system.
        
               | colordrops wrote:
               | > tweaking the text slightly
               | 
               | That's what ML is perfect at detecting, which is Google's
               | forte.
               | 
               | Some of these sites have been returned as top results for
               | a while, so are you suggesting that Google just gave up
               | because spammers would be able to evade them with an
               | update?
        
         | maxwelldone wrote:
         | I've been using this uBO filter since someone recommended on a
         | different thread and it's been great at removing those annoying
         | sites from search results: https://github.com/quenhus/uBlock-
         | Origin-dev-filter
        
           | burtekd wrote:
           | The author acutally posted above your comment ;)
        
         | minutillo wrote:
         | My theory is that one of the inputs to Google's ranking
         | algorithm is now "how much money would we make from this
         | click?" A click to SO has a small number of ads which are
         | obviously ads and easily ignored. A click to the average
         | scrape-jacked SO page has dozens of ads using every dark
         | pattern in the book to generate accidental clicks.
        
           | fluidcruft wrote:
           | One of the other commenters above made the claim that SO runs
           | yahoo ads. If that's true then from a Google perspective, the
           | click has either zero or negative money-making value.
           | 
           | Maybe that means we should be searching in yahoo rather than
           | google.
        
         | disruptiveink wrote:
         | I'd like to get actual confirmation of this, but my vague
         | feeling is that, once upon a time, Google Search would get
         | "updates", as in, actually deployed code that would change the
         | rule of the game and most of the previous dirty tricks would
         | become unusable, leading to people to go out and find out new
         | ones.
         | 
         | This changed with the Google "machine learning" days, where you
         | no longer have humans at the helm laying down explicit rules,
         | so no more "change the world" updates, you can only slightly
         | nudge the parameters towards what you want, meaning the same
         | old tricks keep being effective for far too long.
        
           | nerdawson wrote:
           | The "May Core Update" which recently rolled out impacted
           | every site.
           | 
           | A lot of updates are targeted at specific problems such as
           | low quality product reviews but there are still broader
           | updates taking place.
        
           | dewey wrote:
           | > Google Search would get "updates", as in, actually deployed
           | code that would change the rule of the game
           | 
           | That's just what the scheduled "core update" days are now: ht
           | tps://developers.google.com/search/blog/2022/05/may-2022-c...
        
         | skilled wrote:
         | I had this topic brought back to my mind yesterday as I was
         | doing some research using the Ahrefs keyword tool. I do believe
         | it would be possible to create a very large dataset of these
         | copycat sites (using Ahrefs) to be used as a blacklist in
         | various filters/extensions.
         | 
         | But the crazy part is that, for example - Ahrefs says that
         | StackOverflow has "Organic traffic" in the range of 22 million
         | per month. A lot of these copycat sites, at least the ones I
         | saw - have a traffic range anywhere from 10k to 500k per month.
         | 
         | I mean, it's pretty insane just how well such sites can rank in
         | Google, and you bet those copycats are making absolute bank
         | from ads even if the majority of developers immediately close
         | the site.
         | 
         | There's a lot going on with Google Search these days, a lot of
         | people are complaining that sites that scrape content can
         | easily rank really well for long-tail keywords. One case in
         | particular, a site will scrape Google to collect "featured
         | snippets" and "people also ask" - then combined anywhere from
         | 20 to 40 of these answers and publish them as a blog post.
         | 
         | None of the words are changed, all questions/answers worded
         | exactly the same. And Google puts these sites on page 1.
         | 
         | What a joke.
        
           | helsinkiandrew wrote:
           | > I do believe it would be possible to create a very large
           | dataset of these copycat sites
           | 
           | Would they just move to creating and using new domains with
           | the same content as soon as traffic to the old becomes drops?
           | (What looks like the spammers in the original post are doing)
           | 
           | But something does need to be done to these sites.
        
             | fragmede wrote:
             | Fingerprint the site's content so the new domain name isn't
             | able to SEO a good score.
        
           | SheetPost wrote:
           | > bet those copycats are making absolute bank from ads even
           | if the majority of developers immediately close the site
           | 
           | I bet the majority of developers block ads
        
             | skilled wrote:
             | "developers"
        
           | chewz wrote:
           | > What a joke.
           | 
           | It is simple. Google is making more money from copycat sites
           | then from original content...
        
             | dan-robertson wrote:
             | I think this just isn't how Google work. I would expect to
             | see a lot more spam if Google were happy to collect money
             | from advertising on spam sites.
        
             | unglaublich wrote:
             | ... on the short term.
        
               | pixl97 wrote:
               | Does the market care about anything else?
        
               | [deleted]
        
           | ben_jones wrote:
           | For all we know the sites have better internationalizations
           | and cater to audiences invisible from a US-based perspective.
        
             | bequanna wrote:
             | These sites are just scraping SO and dumping the text from
             | the question+answers in a blog-style format.
             | 
             | I don't think this is a cultural issue, I fail to see how
             | this can be considered value add by anyone.
        
               | adhesive_wombat wrote:
               | I found that Google will even rank a quote from an issue
               | tracker on one of those "clones with advert/malwar
               | overlays" higher than the original.
        
           | lamontcg wrote:
           | > then combined anywhere from 20 to 40 of these answers and
           | publish them as a blog post.
           | 
           | yeah i've been hitting a ton of those lately.
        
         | aaaaaaaaaaab wrote:
         | Some ex-Googlers say that someone ran an AB-test, and it turned
         | out that per-search revenue was decreasing when these sites
         | were blocked.
        
         | MaxDPS wrote:
         | I've been using uBlacklist and it works really well. It even
         | lets me highlight specific websites so I have a better chance
         | of seeing them if they are further down the list.
         | https://iorate.github.io/ublacklist/docs
        
         | pyinstallwoes wrote:
         | Similar to YouTube search results. Lots of spam videos. No way
         | to block a creator. Totally ruins it.
        
       | aaron695 wrote:
        
       | kmfrk wrote:
       | Remember the good old days of talking about a "semantic web"? Now
       | we just get one Google results page of SEO'd garbage with no way
       | to process them.
       | 
       | I can't help plug kagi.com, which has the amazing feature of
       | grouping SEO'd stuff like recommendation lists together, so a
       | thing that's contextually useful is still available but without
       | polluting the other contexts.
        
       | midislack wrote:
       | Google hasn't given a shit about search since at least a decade
       | ago. It's all about data collection via Android and Chrome OS,
       | and gmail and docs. They don't need search to collect your data
       | any more. Don't people actually know this? LOL
        
         | DharmaPolice wrote:
         | All that data they've collected is only useful if they can sell
         | something (i.e. ads) based on the data. AFAIK the majority of
         | their income they get for ads is from search based ads.
        
       | pyinstallwoes wrote:
       | Also noticed here:
       | https://support.google.com/websearch/thread/118733416/lot-of...
       | 
       | Locked?
        
         | rwmj wrote:
         | I've been gettng these spam .it domains for years and years,
         | this is nothing at all new.
        
       | meerita wrote:
       | Dealwith.it
        
       | fredgrott wrote:
       | there is also the spam of name.ru.com domains as well
       | 
       | Warning, do not click on those links as you will get your PC
       | infected.
        
         | sammy2244 wrote:
         | Wow from clicking on a link on a modern browser? New 0-day?
        
       | nottorp wrote:
       | No, google search is plagued with spam from any domain. And even
       | the non spam results are useless.
        
       | H8crilA wrote:
       | It may be that deep learning is now increasingly used to generate
       | the spam. It either is or will be used for spam generation A LOT.
       | Frankly it seems to be the most promising commercial use-case for
       | the large language models.
        
       ___________________________________________________________________
       (page generated 2022-07-23 23:00 UTC)