[HN Gopher] Google Search Results Plagued with spam ".it" domains ___________________________________________________________________ Google Search Results Plagued with spam ".it" domains Author : pyinstallwoes Score : 210 points Date : 2022-07-23 08:19 UTC (14 hours ago) (HTM) web link (community.cloudflare.com) (TXT) w3m dump (community.cloudflare.com) | seydor wrote: | It must be the mafia. No other explanation | tremon wrote: | Yeah, but which one? The Camorra, Ndrangheta, Stidda, RIAA or | FAANG? | mkl95 wrote: | .it may be the .tk of the 2020s | qalmakka wrote: | Just, no. Plenty of legitimate websites under .it, basically | every single Italian company plus all localized versions of | international websites (apple.it, google.it, ...) | aaaaaaaaaaab wrote: | >basically every single Italian company | | Sounds about as trustworthy to me as a .tk domain. | toastal wrote: | Shame too. I was hosting my personal site on .tk when I was | broke out of college, but often a link too it was automatically | flagged as spam. | marginalia_nu wrote: | Whatever domain name is cheap is going to be plagued by spam. | | I think in recent time, .icu and .xyz have been the most | problematic, to the point where you to this day probably don't | want to host a mail server on those domains. | | The same with cloud providers. A fairly significant amount of | sketchy websites seem to be hosted on cheap cloud providers | with weak rules enforcement. I've taken to blocking all of | Alibaba's IP ranges from my search engine crawler, the signal | to noise from those sites were so bad it just wasn't worth | looking for legit content. | password1 wrote: | Why? .tk was popular because it was free, so it was really | useful for teens and young adults in an era when you still had | to host things somewhere if you wanted them online. On the | other hand .it is the tld of Italy and used legit by all | businesses of EU's third largest economy. | ricardobeat wrote: | .it is currently being offered for free / EUR1 | peoplefromibiza wrote: | .com are being offered free or EUR1 as well, but you don't | need to be an EU member with a valid EU ID to register a | .com | | https://www.register.it/domains/?lang=en | | https://imgur.com/W0XkZIj | | https://imgur.com/a/p9sFsKj | [deleted] | worldofmatthew wrote: | Google appears to stopped caring after Matt Cutts left...... | jrockway wrote: | The Internet is quite a bit bigger than it was in 2016: | https://www.internetworldstats.com/emarketing.htm | | (Have no idea how reputable that data is, but it seems about | right to me. In 2016 there were 3.6 billion Internet users. Now | there are 5.3 billion.) | ehnto wrote: | Also imagine how many bots there are, and how fast they can | generate content now. | Proven wrote: | sub7 wrote: | This headline would be accurate without the ".it" domains part | randomperson_24 wrote: | I think Google can probably not fix it. Users will have to be | manually reporting as spam. These websites on seeing traffic from | Google's crawler bots show a perfectly legit and highly SEO | optimized website, but for anything else show other spam. If | Google starts indexing from random IP ranges, most websites would | probably block indexing from "unofficial IPs" or some companies | (esp in EU) would file some lawsuit against Google. The reason | being that some pay-walled news article websites won't be indexed | properly, as the "unofficial IP-ed" Googlebot will not get the | paywalled content. | | If a website lies to Google itself, I believe the only way to | solve it is by reporting the search result as spam or Google | contracts people to somehow visit all billions of web pages | (again the same problem - from different IP ranges) to verify it | as a legit page. | | I would like to know how Google currently handles it and probably | how it could be improved | panarky wrote: | Google has all the text they've scraped. | | They can see all the domains that have served a given snippet. | | They also have history to identify where each snippet was first | seen. | | If SO has a lot of traffic and a good reputation, and if the | same snippet is found first at SO and then later at bunch of | newly created, low volume, low reputation domains, then show | the SO result and not the others. | Nextgrid wrote: | The practice of "cloaking" has been around for ages and I'm | sure Google has (or at least had) solutions against it. | | I'm not sure on what grounds could someone sue for crawling | from random, unaffiliated addresses as long as the crawling | isn't causing a denial of service (they can always check | robots.txt using the main IP then use that to throttle crawling | from random IPs as to remain compliant). | | > The reason being that some pay-walled news article websites | won't be indexed properly, as the "unofficial IP-ed" Googlebot | will not get the paywalled content. | | Good riddance? That would be a welcome change. | mrkramer wrote: | This is 2nd and 3rd search page results spam, I get spam phishing | websites on 1st page of search results when I search for certain | ecommerce websites. Google is done. | jwally wrote: | Anecdotally gmail has been doing a miserable job filtering spam | for the last 5 months or so. For me it used to be pretty | bulletproof - one of its best features. | | Now I get something from McAffee Pratners(sic) every other day | warning my computer is about to expire. Back in May I kept | winning things from Home Depot and Lowes; and gmail would | categorize it as "forums". | | No idea if its related, just odd. | hn_throwaway_99 wrote: | Agreed, same experience, all with the same format, primarily | from some sort of outlook.com domain. | hyperdimension wrote: | Ironic, since Microsoft is the worst (in my experience) at | being a giant black hole to emails sent from an otherwise | well-configured (SPF, DKIM, DMARC, non-SBL-listed IP, &c) but | not major SMTP host. | sofixa wrote: | I occasionally have spam that slips through Gmail's filters, | but when I explicitly mark it as spam it disappears nothing of | the same type reappears again. | silversmith wrote: | Add another anecdote to the anecdata pile. Past three months, | McAffee and north american shopping chain spam is breaking | through Gmail filters. And reporting as spam does not help. I | assume they've been somehow building Google reputation for the | spam accounts. | thelollies wrote: | Similarly I've been getting smashed with gmail spam, google | calendar spam and google drive spam for around 6 months now | after never previously getting any and despite reporting most | of it. | tempest_ wrote: | Yeah the drive spam started for me last year or the year | before. It took a break but in the last couple months has | returned with a vengeance. | | I had the Slack google drive integration and I needed to mute | it because it was couple doc invites every few hours. | edvards wrote: | Been having the same issue on my old email. a LOT of spam going | past the filter. | pwiecz wrote: | On the other hand over last few months majority of post from | sma few of private Google Groups, I'm member of, keep getting | wrongly clsssified as spam. | | I don't know if it's related either. | pverghese wrote: | I disagree. Prior to Gmail i used to get thousands of spam | email everyday Now everything is filtered. Barely get any. | | The added benefit is I don't get any tech calls for help from | my parents who also don't end up clicking random spam and | wondering why bad things are happening | echen wrote: | I've been experiencing the same. Here are a bunch of egregious | spam mistakes we collected from different people to illustrate | the problem: https://www.surgehq.ai/blog/are-the-spammers- | winning-failure... | krono wrote: | Every single one of these results carries the `html` filetype as | part of their URL is my experience. This is likely a consequence | of the useragent-based switcheroo technique they use to fool | Google. | | Just blanket block the lot with the following uBlock Origin | filter: | google.*##.g:has(a[href*=".it"][href$=".html"]) | | Google ain't going to fix itself ;) | [deleted] | LinuxBender wrote: | In addition to this if one runs unbound as their DNS on their | home router and they block DoH then one could add | local-zone: "it" always_nxdomain | | to NXDOMAIN all requests for the .it TLD and protect non | browser devices. I use this method to stay off sanctioned | country TLD's and to remove the cheap/free spammy domains and | TLD's that often contain more malware than anything useful. | jacooper wrote: | Or use Brave search, which honestly from my experience is much | better. | peoplefromibiza wrote: | cool! | | now s/\\.it/every TLD/ and you solved domain spam forever. | | /s | | You might not know that 99.99% of .it domains with urls ending | up in .html are completely legit, including some official | government one. | [deleted] | nkrisc wrote: | Since uBlock is run on the client, unless you're Italian or | interested in Italian sites it doesn't really seem like much | of an issue. | | I could block all .it sites on my network and I'd likely | never even notice. | peoplefromibiza wrote: | yeah, right, unless you're american, why should you care | about .com domains? -\_(tsu)_/- | | the problem is not .it domains, it's clearly stated in the | linked post | | _A large number of spam pages are indexed when searching | by our product name. It's very similar to Japanese Keyword | hack, but the difference is that our site is not hacked_ | | so it's definitely an indexing issue, those .it domains are | being indexed for the Japanese word hack for some reason, | it's not that .it domains are particularly spammy per se. | | Your "solution" would filter the vast minority of the | abusers at the cost of banning an entire TLD, not much | different than turning off the internet connection | entirely. | | Most of the spam on the internet comes from .com domains | though, even more so because registering a .com domain is | much easier than getting an .it | | Are you willing to ban .com too? | kzrdude wrote: | .com implies spam - it's commercial, so let's go ahead. | If it's not .org I'm not playing. /s | plank wrote: | And yet, here you are, and not on ycombinator.org? ;-) | nkrisc wrote: | > Your "solution" would filter the vast minority of the | abusers at the cost of banning an entire TLD, not much | different than turning off the internet connection | entirely. | | Again, we're talking about client-side filtering. The | original comment about blocking .it domains was talking | about a uBlock Origin rule. No one's talking about | blocking .it domains from the web. | | Yes, as an American, I could block all .it domains on my | end and my web experience likely wouldn't change at all. | I rarely, if ever, need to visit .it domains. So maybe I | will. | krono wrote: | This visually hides the HTML elements on Google Search | and for me only. There is no networking involved and so | Italian TLDs are still reachable. | | This is a personal solution to an extremely disruptive | and long standing problem, and only affects those who | choose to employ it. It's not hurting anyone. | tbran wrote: | Nah. I've been reading the docs on Spatialite (the spatial | extension for SQLite) at http://www.gaia-gis.it/ the last | couple days. It has both a "spam" TLD and a design from | 1998. | fijiaarone wrote: | But not many of the official government ones. | peoplefromibiza wrote: | official government in Italy also means cities, towns, | hospitals, universities, public schools etc | | There are 8 thousands towns in Italy, each with their own | .it website. | qalmakka wrote: | Blanket banning a whole TLD is stupid. One thing is blocking | some obscure stuff like ".su", but .it? It's just too big, and | arguably unwise if you are in Europe where having to connect to | Italian websites or services isn't a remote possibility. | krono wrote: | This merely hides Google search results in my browser. | | No network connections are blocked... | permo-w wrote: | I'm sure there are plenty of non-spam html pages based in Italy | too | krono wrote: | Considering the crowd that trade-off that seemed too obvious | to mention. | seumars wrote: | What's this useragent switcheroo? | krono wrote: | Browsers and other programs can use the User-Agent[1] header | to send along a bit of information about themselves with each | request. | | This and other information is then used to filter out various | types of visitor. | | In this case, requests claiming to be a Google Search crawler | will receive a boring page with lots of text that it can | index and use as search results. | | Most browsers' devtools let you change your user-agent | string, and a listing of the ones used by Google crawlers is | publicly available. Not saying that you should, but you could | check this out for yourself... entirely at your own risk of | course :) | | https://en.wikipedia.org/wiki/User_agent | | https://developers.google.com/search/docs/advanced/crawling/. | .. | politelemon wrote: | What does Cloudflare normally do with spam sites, is it a hands | off approach or do they do some policing? | yellow_lead wrote: | Hands off till it gets upvoted on HN | charcircuit wrote: | They do the bare minimum. You can report sites for abuse and | they will take them down. It doesn't appear like they do | anything to proactively stop similar sites so the person can | just make a new account and domain and be back in business. | miyuru wrote: | cloudflare abuse department is really lacking. | | Their abuse form is getting abused too. It sends an email to | site operator and the server hosting company in single submit | so its getting abused. It not even have a captcha. | | https://abuse.cloudflare.com/ | [deleted] | breakingcups wrote: | They will do nothing, and it is a feature. | | The only thing they'll do is forward the complaint to the user. | Leaving you with no recourse other than to take legal action | before Cloudflare will lift a finger. | | Unless there's CSAM, of course. | sammy2244 wrote: | You need to remember that Cloudflare isnt a host | r1ch wrote: | That isn't always the case these days, web apps can run on | Cloudflare without an origin. | bad416f1f5a2 wrote: | This... seems about right? | | A trademark dispute is a civil issue between two parties. We | have legal systems to solve these. Cloudflare should ensure | that their customers get timely notification of complaints, | and that's pretty much it. | jl6 wrote: | I recently cursed google search results when trying to research | an actor's birth date. There were two dates given on Wikipedia | and I wanted to see which one (if either) was correct. Google | returned the actor's IMDB page (which listed a third date, and no | source), and then pages upon pages of what appeared to be auto- | generated sites that clearly scraped from Wikipedia, repeating | one or the other of the Wikipedia dates. | | This is not helping to organize the world's knowledge. | hericium wrote: | > This is not helping to organize the world's knowledge. | | And Google is not about organizing world's knowledge but | creeping on people for YoY financial results. | richbell wrote: | They're quoting Google's own mission statement; though, you | are correct. | | > Google's mission is to organize the world's information and | make it universally accessible and useful. | znpy wrote: | > This is not helping to organize the world's knowledge. | | Oh they stopped doing that long long ago... | dazc wrote: | A lot of actors lie about their age so I wouldn't hold out too | much hope on getting an accurate result on that one. | | I get your point though about the multiple results for | something where there clearly is no authoritative answer. | jeroenhd wrote: | I don't think this problem should be solved by Cloudflare. Cheap | domains will always exist and they shouldn't be a problem. The | problem lies with Google and its failure to detect these spam | sites. | | Surely Google can spare an engineer or two to do a deep dive into | the way any one of these spam sites manages to get itself to the | first page of Google, work out their scheme, and fix the | algorithm? This problem isn't exactly hard to reproduce! | johnklos wrote: | I really don't get why so many people are willing to give | Cloudflare a free pass on stuff like this. Why is it OK for a | company to facilitate and host (1) thousands of scam domains, | making reporting arduous and ineffective? | | Anyone trying to infect others with Trojans and viruses just | need to check user agents or use dynamic redirect URLs, and | suddenly this clearly illegal activity becomes black magic that | is way beyond the comprehension of the folks at Cloudflare. | | Cloudflare is basically making the shittiest parts of the | Internet safe for scammers and spammers, and this is just one | example. | | If that's not bad enough, they're trying like crazy to become a | monopoly. If this what they do now, imagine how bad it'll be | when they control even more and feel even more immune to making | money from scammers. | | (1) Hosting is providing services on the Internet without which | a site would not function. Providing DNS is hosting. Providing | proxy is hosting. Providing email is hosting. Don't fall for | Cloudflare's "we don't host" bullshit. | Nextgrid wrote: | I disagree. | | For example, SO copycats are legitimate in that they respect | the license and otherwise just serve the content to whoever | sends them an HTTP request. As far as I know they don't spam | links to their domain anywhere. They are low-quality and of | dubious utility for sure, but I'd rather not make the | Internet a place where you need to prove quality & utility to | someone to be able to host an HTTP server. | | The real problem is that a dumbass like Google comes along, | sees this and decides that it should rank _higher_ than the | source content. | jfengel wrote: | If they're not spamming their links, how do they get such | high search engine optimization? | | Somebody upthread suggested it was just the use of Google | ads, which I suppose is possible, but somehow it seems | unlikely. Google sure does love money but they also need to | be considered a good search engine, and I'd expect them to | be at least a little wary about things like that. | | Is there something else I'm missing? | Nextgrid wrote: | It's my understanding that link spamming has become | counter-productive since a Google algorithm update almost | a decade ago? I'm not sure what they're doing but I don't | think it's link spam, because of that and also because | I've never seen their spam anywhere (if they're using | link spam they must do so on sources that have good | "authority" for programming-related topics and thus one | of us would've likely seen it). | techdragon wrote: | It's often cheap blog spam "original content" and | matching cheap social media spam to increase how | legitimate the blog looks ... which is cheaper than ever | now thanks to advances in machine learning models like | GPT-3 and other current generation models. The pipeline | is take a random sample of pages in the domain, take the | target page -> summarise -> generate some blog spam of | varying length and level of human input -> if desired | based on social media analytics then generate some | automated social posts about the blog article that was | just added since it's widely done by real humans with | their real blogs it all looks legit. | | This is how it gets done and Google used to be brutal | about crushing it, somewhere along the way they seem to | have given up on being so brutal. | forgotpwd16 wrote: | >SO copycats are legitimate in that they respect the | license | | Do they? SO contributions are under CC BY-SA. Haven't seen | copycats providing attribution let alone specifying that | the content is under the same license. | Nextgrid wrote: | I'm not sure, but the business model of them is ad | revenue - they get paid as soon as the page loads. Adding | the require attribution & license disclosure wouldn't | hurt them at all, so I'm assuming they're either already | doing it or will start doing it if asked. | mod wrote: | Exactly. Similarly, I think any dumbass should be able to | fix cars in his own yard, including for a fee and calling | himself a business. | | But Google maps better not drive me to someone's yard when | I ask to navigate to a nearby mechanic. | | If it did, it would be hard to blame anyone but Google. | abbe98 wrote: | Is it common that SO and Wikipedia copycats respects open | licenses? Most times I run into them they do not. | | It's really tricky to enforce open licenses on this scale | as it's each contributor that licenses their content rather | than the platform host. | Nextgrid wrote: | See my response here: | https://news.ycombinator.com/item?id=32204913 | 2OEH8eoCRo0 wrote: | Cloudflare should worry about what sites Google is choosing | to index and show? | | This is clearly Google's issue. | tremon wrote: | No, Cloudflare should worry about what sites they host and | enable. How Google ranks the sites that Cloudflare hosts is | a secondary issue, and is outside of Cloudflare's control. | jeroenhd wrote: | Cloudflare should take action on reported domains and their | owners, especially if those domains are malicious. | | However, I don't want Cloudflare to preventatively police | what is and isn't a bad website. When these scam sites go | live, they can quite easily contain real content (say, a | blog, with articles written by AI good enough not to be | immediately obvious) and then change into malware on a | schedule. | | Cloudflare can't see what code customers run on the backend | and that's probably a good thing. They're already holding too | much power over the internet and requiring the backend to be | transparent would only make them more in control of the web. | | Any registrar hosts thousands if not millions of spam sites | because every single one of the billion registrars have DNS | set up in some way. | | Despite being almost exclusively used for spam and amateur | projects, the .TK TLD barely shows up in Google. Spam sites | are a symptom of other services linking to them and making | them worth the investment. If Google, Bing, Qwant and Yandex | weren't falling for the SEO scams these scammers use, we | wouldn't have this problem. | | Hosters have some immunity by design, and that's very much a | good thing. They have to respond to abuse complaints, but | they're not responsible for filtering out all of their | customers. Requiring them to do so is exactly what the EU is | trying to force upon the internet, which is terrible for | online freedom. | RockRobotRock wrote: | You make some good arguments, but if a site is caught in | the act of hosting obvious malware, Cloudflare should make | a reasonable effort to suspend their activity. | nousermane wrote: | "Cheap domains" is not a thing. $25/year domain for a personal | website is kinda pricey. But scam/spam operator can pay that | and more pretty darn easy. | [deleted] | schroeding wrote: | There is at least one registrar that gives away .it domains | (and apparently .eu domains? WTF?) for free for one year[1], | with no major strings attached (as long as you cancel after | the first year) as far as I read, correct me if I'm wrong. | | Why they decided to ".xyz the TLD", I don't know. | -\\_(tsu)_/- | | [1] https://www.register.it/?lang=en | fortran77 wrote: | I'm not sure about today, but about 12 years ago, I was | able to get 1000 .info domains for about $200. (We were | doing some machine-generated splog creation to see if we | could game Google search results. We could.) | [deleted] | worldofmatthew wrote: | Who is charging $25/year for domains? | dark-star wrote: | $25/year for an .it domain is pretty cheap actually, | usually they sell for more like 40EUR per year | thiht wrote: | .it are around 10 bucks a year. No idea where you would | find them at 25 or 40 a year. | smcl wrote: | Depends on the TLD, I searched gandi with a very random set | of characters the keyboard (to ensure I could probably get | many results) and here's a selection of country-level ones | which are above $25/yr: | | - abcedasdfff.io = EUR59.29/year | | - abcedasdfff.tw = EUR25.20/year | | - abcedasdfff.nz = EUR25.40/year | | - abcedasdfff.mx = EUR48.28/year | | Most of them appear to be EUR10-20/yr, but it's certainly | not uncommon to see them go for EUR25 or higher. Note: EUR | and USD are roughly at parity so I don't think it's really | necessary to do a conversion. | worldofmatthew wrote: | Most common TLDs are not in that price range. | Hrundi wrote: | The fact that this has been going on for several years makes me | believe Google either doesn't care or the problem is | particularly hard to fix (less believable) | worldofmatthew wrote: | I noticed issues since Matt Cutts left. No one care anymore. | There are AI generated website that have been running for | years, ranking highly in Google. | jaimex2 wrote: | It's incredibly easy to fix. They don't care as they have a | monopoly. | worldofmatthew wrote: | Allow people to report AI generated website to a human at | Google. | robocat wrote: | Another comment links to a blacklist, which works. | | If it can be effectively blacklisted, then Google is | dropping the ball. This isn't difficult algorithm foo | failure. | | I don't agree with your sentences, but I do agree with your | point. | ehnto wrote: | It has been an ongoing battle for 10-15 years at this point. | Search engines are constantly battling people trying to game | their systems. I have to wonder if Google hasn't lost the | thread a bit, inside their surely quite complex algorithm | black boxes. | | For a while now Google has suggested that the best way to | rank well is to have human readable content and focus on user | experience. At the same time, natural language generation has | come leaps and bounds, to the point where sometimes even I, a | human, can't tell if an article has been spun by a bot or | not. | | So if Google starts ranking human readable content, and | robots can now produce human readable content, what is the | next ranking signal they can use to differentiate spam from | humans? Are we going to end up with "Verified Websites" ala | verified Twitter handles? | | A huge portion of the web at this point is just bots | communicating with eachother, and legitimate business systems | having to process bots participation on the internet. I | imagine the portion of the web that Google crawls that is | legitimate versus that which is bot generated would surely be | majority bots, just because of how fast they can generate | content. One thing they can't do as easily though is register | domains, so it may be one of the better points of defense. | pixl97 wrote: | The dead internet theory. | frankfrankfrank wrote: | Google has at the very least neglected its search for many | years now and recently has also actively made it worse | through all the censorship and thought control stuff. I find | it rather surprising because essentially all of google's | success is lynchpinned by search. All it would take is for a | narrative to dominate that the best results can be found | elsewhere, which does not seem particularly remote, | considering how much damage google has done to its search. | hericium wrote: | > I don't think this problem should be solved by Cloudflare. | Cheap domains will always exist and they shouldn't be a | problem. The problem lies with Google and its failure to detect | these spam sites. | | The problem exists outside (Google-controlled) web: with (not | fully Google-controlled) email, too. | | Around 2020 I did a per-tld checks on wanted/unwanted messages | (ham and spam). With thousands of messages sent from .xyz | domains (envelope sender host or PTR record of sending host; I | ignored the From header) there wasn't a single legit message. | 100% SPAM. | axsharma wrote: | The irony, Google/Alphabet uses ABC.xyz. | oefrha wrote: | What can search engines do about user-agent based content | differentiation? Say my robots.txt allows Googlebot and nothing | else. If Google attempts to double-check with a covert user | agent, robots.txt is violated. Assign humans to review reported | pages? It's pretty easy to swamp a manual system like that. Just | forget about robots.txt? | foobarbecue wrote: | I remember reading on HN years ago that Google bots have never | honored robots.txt, but I don't actually know | hombre_fatal wrote: | robots.txt is just a guideline between well-meaning actors for | the majority of their traffic, like helping a bot not waste its | time nor your bandwidth by crawling dynamically-generated, | endless-scrolling /calendar.php pages. Google does use it to | that extent. | | It's not a firewall. | | Seems like you're describing cloaking (https://developers.googl | e.com/search/docs/advanced/guideline...), one of the oldest SEO | tricks, and you can imagine that search engines started | defeating it on Day 2 of crawling the web. | Thorentis wrote: | The obsession with "machine learning" is actually making systems | dumber. Google Search and Gmail spam filters are getting worse | with each passing week, and I am almost certain the increasing | reliance on ML is to blame. | patentatt wrote: | I chalk it up to a cost benefit calculation. Google clearly | isn't trying to eliminate all spam in search. It's not their | goal. They are not trying to optimize for the user experience. | They're trying to optimize revenue. | mod wrote: | They're trying to keep spam out of my inbox, and the spam | rate has been increasing for me (and other HN commenters who | frequently talk about it) | tremon wrote: | The competing explanation is that "machine learning" is | actually making spam generator systems smarter, so spam gets | harder to detect. | beardyw wrote: | Presumably these folk take advantage of cheap/free domain offers | wherever in the world they are. | 01acheru wrote: | I think you are right, https://www.register.it/ is offering | free domains for 1 year since some time. | [deleted] | peoplefromibiza wrote: | it's not that straightforward though. | | to register an .it you must prove you are a person or a | business working or residing in one of the EU member states | and need to provide the ID of a person who's gonna be listed | as admin-c of the domain. | gsich wrote: | No you don't. I had a .it domain too, yes there is a field | in regstritation where you should enter a "identity card | id", but I didn't have one so I entered something random. | Worked of course. | peoplefromibiza wrote: | > No you don't. | | Yes, you do! | | of course it worked. | | you just committed a crime. | | you can fake your id everywhere in the World, it is a | crime everywhere in the world and if something happens | doesn't mean you won't get caught. | | you can drive a stolen car, it will work. | | > yes there is a field in regstritation where you should | enter a "identity card id" | | so it is required! you simply ignored it, lied and broke | the law. | | your criminal behaviour doesn't imply laws do not exist. | | if you tried to buy an insurance policy with that fake | ID, you would be in troubles now. | [deleted] | [deleted] | pixl97 wrote: | Right, and I'm sure that government across the ocean will | get right on prosecuting that violation... | schroeding wrote: | You can do that, but you always run the risk of someone | snitching to nic.it, in which case you would lose the | domain. :/ | 01acheru wrote: | I don't think this is an issue if you're a spammer. Those | domains are probably short lived anyway. | pyinstallwoes wrote: | I've actually experienced this and it is not related at all to | the device. It was related to the signed in google account | across networks and devices. | Ueland wrote: | Note that the discussion is a year old. Around one year ago I | wrote more about this "phenonomen" here: | https://news.ycombinator.com/item?id=27993123 | | tl;dr: I managed to find the servers behind it, most likely | anybody who are still affected can do the same thing I did pretty | easily. We also followed the money, which is a tad more work. | thejosh wrote: | There has been a huge influx this year with the amount of sites | that simply scrape SO and then have the exact content on their | site. It's a pain, and there is no official way to remove them. | | I thought that this was a massive nono from Googles side, has | something changed? | xbar wrote: | This reminds me of this one site that simply scraped all the | open source code it could see and then produced AI-generated | copies. | phreack wrote: | It took me a while for no good reason but I finally got an | unofficial extension to add a "block" button to search results. | It immediately improved my experience, I can't recommend it | enough. No more Pinterest, SO clones, useless Quora spam, with | very little work. I can't believe I didn't do it sooner. | ofou wrote: | just switch to you.com | anonred wrote: | Other search engines allow you to block domains from showing up | in the results. I've switched to Kagi out of frustration and | honestly it's as good or better than Google just because of | that one feature. | endofreach wrote: | https://news.ycombinator.com/item?id=29403947 | atwood22 wrote: | I have no evidence of this, but the ad load on the returned | results has gotten way higher. In theory, ranking sites that | display Google ads higher would be a very easy knob for Google | to turn to increase profit. The SO scrapers probably have | Google ads on them, making them more profitable for Google. | Nextgrid wrote: | Turning the knob one way explicitly might raise some anti- | trust concerns, however the same motivation can be used to | _avoid_ turning the knob the other way and this can be done | much more sneakily without leaving clear evidence - simply | don 't allocate budget/etc to projects that would turn the | knob the other way and you're done. | hombre_fatal wrote: | I ran into so many Stack Overflow "mirrors" yesterday like | this: https://www.anycodings.com/1questions/400836/swiftui- | update-... | | 10 years I gave up on a large project where I rehosted and | organized dead Usenet forum content because Google's dupe- | penalty detector was too good and too aggressive for content | that you could barely find beyond a six-year-old cache hit | where the origin website was long gone. | | Meanwhile these Stack Overflow scrapers are just | `<html>{copy-and-paste}</html>` and the same domains are | still alive despite years of cloning. | | Looks like it's time to boot my project back up. | atwood22 wrote: | It's clearly not a copy and paste. I just visited that link | on my phone and got blocked from viewing because I'm using | an ad blocker. | avipars wrote: | Also lots of github scrapers | skilled wrote: | "All Rights Reserved." | panarky wrote: | This is a very old conspiracy theory that's been repeatedly | debunked. | | https://www.searchenginejournal.com/ranking- | factors/google-a... | chakkepolja wrote: | That link is about AdWords spend by the site in question, | and not about displaying AdSense ads on the site. Totally | unrelated. | 1597 wrote: | I've noticed this with youtube. Even though I'm on desktop | with an adblocker they repeatedly autoplay the same video | with a creator embedded crypto promotion at the beginning | (especially when it would be plausible to infer I'm asleep | from user interaction and clock/watch time). Must be getting | a cut (plus scamming the ad buyer). | Surfactant7 wrote: | There's a simple way around that. Nothing to install. Nothing | to update. | | Just go to SO and use its search bar. It's actually quite good. | | I mean, you know that's where you'll want to find the answer | anyway - not some random corporate webpage or ad-infested | splog. Why not cut out the middle man? | | Only if that fails do I bother with Google. | watchdogtimer wrote: | Or, if DuckDuckGo is your default search engine, you can | append ' !so' to your search term. | burnished wrote: | Huh, you know, you're right. I recently did that and it was | fine. | | I think a lot of others formed their opinion (myself muchly | included) about this from sites where the search bar was a | joke played on people. | | Edit: let me upgrade that 'fine' to 'great', now that I think | about it it was actually better than a google search which | was not my previous experience. | avereveard wrote: | Google index used to be fairly more competent at finding | relevant issues for a query, especially if some words were | synonyms of what found in the snippets at even loosely | related | Quenhus wrote: | Here is my uBlock filter with hundreds of GitHub/StackOverflow | copycats: https://github.com/quenhus/uBlock-Origin-dev-filter | | It blocks copycats and hide them from multiple search engines. | You may also use the list with uBlacklist. | thejosh wrote: | This is fantastic! This is exactly what I needed, thanks! | SmellTheGlove wrote: | You rock. Thank you. | Phlogi wrote: | This even works on Firefox Nightly on Android. Thanks a lot! | colordrops wrote: | With these two pieces of data: | | * the identical text copied from stack overflow should be | easily identifiable | | * volunteers put together a list of these sites themselves | | it should be obvious to Google apoligists that Google is | either negligent or intentionally allowing these sites in | their search. I'm sick of hearing about how "the world is | different" and it's an "arms race" between spam sites and | google. Bullshit. | IfOnlyYouKnew wrote: | The problem with these theories is that they lack any | sensible explanation of motive. Google intentionally | degrading its search results because they "earn more if the | user has to search again and again" just doesn't feel | right: even if it were true in some short-term experiment, | it would compromise the way people at Google think of | themselves and their work to a degree that would be | devastating to the company. There is no way they would | throw away that sort of value without being under intense | pressure, which they definitely are not. | colordrops wrote: | These large tech companies have a long and varied history | of stupid short-term decision making for profit and bad | products due to local individual failures. Until there is | a clear and detailed explanation of how the spam sites | are avoiding google's wrath, the explanation of stupidity | or short-term thinking on Google's part seems just as | plausible. | lamontcg wrote: | Well come up with an explanation of how these entirely | mechanically generated SO clone sites, with no | obfuscation, are allowed to exist by Google, when | identifying them and removing them should be fairly | trivial? | | At the very least they're being deliberately neglectful | because they don't feel the bad experience harms their | revenue because there's no other substantial competitor | so they can abuse their monopoly status. | | I guess they may just not care enough about software | developers and figure we're mostly using ad blockers so | its wasted effort and we'll develop blocklists ourselves. | With no monetary value that they can assign to the ill | will that it engenders they figure it must not matter so | they don't bother. Pissing off a large chunk of the | entire IT community via obvious neglect seems like a poor | move to me, but then I've never felt that I'm cut out for | management. | burnished wrote: | Maybe the problem is just genuinely hard and beyond their | capabilities. | colordrops wrote: | Detecting identical snippits of text is beyond virtually | no one's abilities. | Beldin wrote: | Another comment stated that SO uses ads from someone else | than Google, while the copy-paste sites use Google for | ads. If true, that is clear monetary incentive to not go | after this too hard. | rightbyte wrote: | SO seem to have Yahoo ads, so I guess it is a no brainer | for Google to rank sites they profit from over the content | the lusers want. | jiggawatts wrote: | This is the real answer. | remus wrote: | > the identical text copied from stack overflow should be | easily identifiable | | Google starts matching content from SO => Spammers start | tweaking the text slightly => google implements some | expensive similarity score to down rank copy cat sites => | spammers use more complex scrambling=> ... | | > volunteers put together a list of these sites themselves | | These lists only work because they're used by a tiny | minority of people. If Google were to do this the spammers | would start switching domains more quickly (or find some | other workaround). | | I'm no Google apologist but I think you're underestimating | how hard search ranking is when spammers are actively | trying to game the system. | colordrops wrote: | > tweaking the text slightly | | That's what ML is perfect at detecting, which is Google's | forte. | | Some of these sites have been returned as top results for | a while, so are you suggesting that Google just gave up | because spammers would be able to evade them with an | update? | maxwelldone wrote: | I've been using this uBO filter since someone recommended on a | different thread and it's been great at removing those annoying | sites from search results: https://github.com/quenhus/uBlock- | Origin-dev-filter | burtekd wrote: | The author acutally posted above your comment ;) | minutillo wrote: | My theory is that one of the inputs to Google's ranking | algorithm is now "how much money would we make from this | click?" A click to SO has a small number of ads which are | obviously ads and easily ignored. A click to the average | scrape-jacked SO page has dozens of ads using every dark | pattern in the book to generate accidental clicks. | fluidcruft wrote: | One of the other commenters above made the claim that SO runs | yahoo ads. If that's true then from a Google perspective, the | click has either zero or negative money-making value. | | Maybe that means we should be searching in yahoo rather than | google. | disruptiveink wrote: | I'd like to get actual confirmation of this, but my vague | feeling is that, once upon a time, Google Search would get | "updates", as in, actually deployed code that would change the | rule of the game and most of the previous dirty tricks would | become unusable, leading to people to go out and find out new | ones. | | This changed with the Google "machine learning" days, where you | no longer have humans at the helm laying down explicit rules, | so no more "change the world" updates, you can only slightly | nudge the parameters towards what you want, meaning the same | old tricks keep being effective for far too long. | nerdawson wrote: | The "May Core Update" which recently rolled out impacted | every site. | | A lot of updates are targeted at specific problems such as | low quality product reviews but there are still broader | updates taking place. | dewey wrote: | > Google Search would get "updates", as in, actually deployed | code that would change the rule of the game | | That's just what the scheduled "core update" days are now: ht | tps://developers.google.com/search/blog/2022/05/may-2022-c... | skilled wrote: | I had this topic brought back to my mind yesterday as I was | doing some research using the Ahrefs keyword tool. I do believe | it would be possible to create a very large dataset of these | copycat sites (using Ahrefs) to be used as a blacklist in | various filters/extensions. | | But the crazy part is that, for example - Ahrefs says that | StackOverflow has "Organic traffic" in the range of 22 million | per month. A lot of these copycat sites, at least the ones I | saw - have a traffic range anywhere from 10k to 500k per month. | | I mean, it's pretty insane just how well such sites can rank in | Google, and you bet those copycats are making absolute bank | from ads even if the majority of developers immediately close | the site. | | There's a lot going on with Google Search these days, a lot of | people are complaining that sites that scrape content can | easily rank really well for long-tail keywords. One case in | particular, a site will scrape Google to collect "featured | snippets" and "people also ask" - then combined anywhere from | 20 to 40 of these answers and publish them as a blog post. | | None of the words are changed, all questions/answers worded | exactly the same. And Google puts these sites on page 1. | | What a joke. | helsinkiandrew wrote: | > I do believe it would be possible to create a very large | dataset of these copycat sites | | Would they just move to creating and using new domains with | the same content as soon as traffic to the old becomes drops? | (What looks like the spammers in the original post are doing) | | But something does need to be done to these sites. | fragmede wrote: | Fingerprint the site's content so the new domain name isn't | able to SEO a good score. | SheetPost wrote: | > bet those copycats are making absolute bank from ads even | if the majority of developers immediately close the site | | I bet the majority of developers block ads | skilled wrote: | "developers" | chewz wrote: | > What a joke. | | It is simple. Google is making more money from copycat sites | then from original content... | dan-robertson wrote: | I think this just isn't how Google work. I would expect to | see a lot more spam if Google were happy to collect money | from advertising on spam sites. | unglaublich wrote: | ... on the short term. | pixl97 wrote: | Does the market care about anything else? | [deleted] | ben_jones wrote: | For all we know the sites have better internationalizations | and cater to audiences invisible from a US-based perspective. | bequanna wrote: | These sites are just scraping SO and dumping the text from | the question+answers in a blog-style format. | | I don't think this is a cultural issue, I fail to see how | this can be considered value add by anyone. | adhesive_wombat wrote: | I found that Google will even rank a quote from an issue | tracker on one of those "clones with advert/malwar | overlays" higher than the original. | lamontcg wrote: | > then combined anywhere from 20 to 40 of these answers and | publish them as a blog post. | | yeah i've been hitting a ton of those lately. | aaaaaaaaaaab wrote: | Some ex-Googlers say that someone ran an AB-test, and it turned | out that per-search revenue was decreasing when these sites | were blocked. | MaxDPS wrote: | I've been using uBlacklist and it works really well. It even | lets me highlight specific websites so I have a better chance | of seeing them if they are further down the list. | https://iorate.github.io/ublacklist/docs | pyinstallwoes wrote: | Similar to YouTube search results. Lots of spam videos. No way | to block a creator. Totally ruins it. | aaron695 wrote: | kmfrk wrote: | Remember the good old days of talking about a "semantic web"? Now | we just get one Google results page of SEO'd garbage with no way | to process them. | | I can't help plug kagi.com, which has the amazing feature of | grouping SEO'd stuff like recommendation lists together, so a | thing that's contextually useful is still available but without | polluting the other contexts. | midislack wrote: | Google hasn't given a shit about search since at least a decade | ago. It's all about data collection via Android and Chrome OS, | and gmail and docs. They don't need search to collect your data | any more. Don't people actually know this? LOL | DharmaPolice wrote: | All that data they've collected is only useful if they can sell | something (i.e. ads) based on the data. AFAIK the majority of | their income they get for ads is from search based ads. | pyinstallwoes wrote: | Also noticed here: | https://support.google.com/websearch/thread/118733416/lot-of... | | Locked? | rwmj wrote: | I've been gettng these spam .it domains for years and years, | this is nothing at all new. | meerita wrote: | Dealwith.it | fredgrott wrote: | there is also the spam of name.ru.com domains as well | | Warning, do not click on those links as you will get your PC | infected. | sammy2244 wrote: | Wow from clicking on a link on a modern browser? New 0-day? | nottorp wrote: | No, google search is plagued with spam from any domain. And even | the non spam results are useless. | H8crilA wrote: | It may be that deep learning is now increasingly used to generate | the spam. It either is or will be used for spam generation A LOT. | Frankly it seems to be the most promising commercial use-case for | the large language models. ___________________________________________________________________ (page generated 2022-07-23 23:00 UTC)