[HN Gopher] Filters to block and remove copycat-websites from Du... ___________________________________________________________________ Filters to block and remove copycat-websites from DuckDuckGo, Google and other Author : gleb_the_human Score : 145 points Date : 2022-02-17 16:27 UTC (6 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | sebazzz wrote: | I use Kagi as a search engine and can just block the site from | the search results. | wanderingmind wrote: | Kagi filters are great for programming, but still evolving for | others. I still see a lot of pinterest results. You can block | domains by adding them to Kagi blocklist through Settings -> | Personalized Results -> Blocked Domains. | lolinder wrote: | I started using Kagi recently, and so far haven't had to block | a single site. Their filters are great! | xarope wrote: | I've been recently searching for some very specific keyword | stuff, and bumping into a lot of sites which seemed like just | reformatted copy-n-paste of stackoverflow and various mailing | lists, adding zero value and clogging up the top 100 search | results. | | Now that I see the HUGE number of copycat sites in the | stackoverflow_copycats.txt file, I am beginning to understand | what's going on. | | Thanks! | Kovah wrote: | Does anyone has an idea how to make this work in Brave without | uBlock? I added the block list to custom filters | (brave://adblock/) but results for those spam sites are still | shown in Duckduck. | aunty_helen wrote: | I've been using uBlacklist which adds a little block this site | button to the google search results. Handy. | | It is a browser extension and I haven't looked too deeply into it | so if that's important to you perhaps have a browse over their | repo etc before installing. | OGWhales wrote: | I use that too, great for blocking pinterest from flooding all | of your google image search results. | reillyse wrote: | I really don't like these websites and I smash that back button | as soon as I realize I've landed on one. | | That said I'm amazed they are still showing up at the top of | google search. My understanding was that that kind of behavior | (which I think at least some other people do too) combined with | the fact that they are just copying another much higher page | ranked website would mean that they are highly unlikely to rank | above the relevant stack overflow article that they are duping. | So what is happening here? | giancarlostoro wrote: | Reminds me of going to Google Images, and getting sent to | Pinterest... which is not where the image is sourced out of. | hughrr wrote: | Pinterest is one of those sites that really makes me want to | strangle someone. It's just an abhorrent walled garden of | other people's property. | giancarlostoro wrote: | If I were on Pinterest looking up things, fine. But I'm on | Google, not trying to find a mirror of what I want, I want | what I want. | | Edit: I have a friend who works there, but not as an | engineer haha I'm pretty sure I've told him my woes with | pinterest. My wife loves pinterest though. It allows her to | come up with amazing design ideas and art ideas. | blacksmith_tb wrote: | I am not sure how GOOG weighs what happens after you click on a | result, it would be clever of them to notice how quickly you | click on another result for the same search and slightly | downgrade the first link (though, what happens if you open the | first three links into new tabs before you actually visit them, | say). My assumption was that they just counted clicks as an | upvote, so if these scammers can make it into the first page of | results, they will tend to stay. | ChefboyOG wrote: | For a long time now, Google has weighted behavioral signals | similar to what you describe. "Bounce Rate" is the percentage | of users who quickly leave your site after clicking. "Dwell | Time" is the amount of time a user spends on a page. | | There's even a cottage industry around gaming these signals. | See SerpClix and the like. | ajsnigrutin wrote: | So how come pinterest is still on top with many searches? | jonas21 wrote: | Why not? Pinterest is a popular site, with lots of | content all linked to each other. Many people probably | spend a long time there after clicking a result. | dylan604 wrote: | The inordinate amount of time to click away all of the | dark ui login screens just to see the content before | making the decision its not what you wanted already | increased the dwell time to longer than other sites. | xenadu02 wrote: | This is not even the first or second time Google has rolled | out changes that allowed SEO spam sites copying Stackoverflow | or Wikipedia to rank higher than the original. | | They did fix this at one point in time by figuring out which | site posted the content first and penalizing the copycats, | but it appears the fix is once again broken. | reillyse wrote: | I figure they must just be monitoring the original content | and republishing it before it's indexed by google. The | searches are so specific and niche that generally ranking | isn't hard it's beating the og that's hard. | | I just don't know how they are managing to get indexed | before the big name established sites. Perhaps they are | succeeding on some small percentage and that is what we are | seeing? | | Perhaps they have an additional trick to make it look like | they posted the content first, perhaps internal links or | something. | dtech wrote: | I find that hard to believe, the SO questions are often | years old, the GH ones months. | [deleted] | [deleted] | dawnerd wrote: | FYI whoever made this, you can create clickable links to import | filters. For example: | https://subscribe.adblockplus.org/?location=https://raw.gith... | | Quick edit: I know the domain is ABP but ublock origin picks it | up. | poulpy123 wrote: | Great, I will try that soon. These websites are infuriating | pajko wrote: | Another useful extension like this is | https://iorate.github.io/ublacklist/ | Melatonic wrote: | Anybody know of a way I could bulk import these into NextDNS? | cmroanirgo wrote: | Missing from the title is: | | > _Specific to dev websites like StackOverflow or GitHub._ | | Before I noticed that, I had searched for pinterest and found | nothing. Even marking the HN title with "dev" would be good. | | If this were my list I'd add w3schools because to me, it's low | quality, especially compared to mozilla. | hlbjhblbljib wrote: | > I had searched for pinterest and found nothing | | So it's working as intended and blocking low effort spam sites | oxguy3 wrote: | Cool idea! I was surprised that Wikipedia mirrors aren't | included, as I encounter them constantly and they drive me | bonkers. I opened an issue: https://github.com/quenhus/uBlock- | Origin-dev-filter/issues/2... | ummonk wrote: | Yeah it's really frustrating when I read a poorly sourced | Wikipedia article and I'm trying to search for other sources on | the claims in the article but all I get is clones of the | Wikipedia article. | nhoughto wrote: | Making it this obvious how to set this up made me finally do it, | no more junk results (well less junk..). Thanks! | willis936 wrote: | Bless you. This took 30 seconds to put on my phone and laptop and | has already improved my results so much. | mcfedr wrote: | I never understood why Google isn't blocking these crap results, | it's really making my experience of search really bad for a light | of my searches | brimble wrote: | Do they do a good job at getting clickthroughs on Google ads on | their site? :-/ | | Does the rate of ad-clicking on the results page increase if | most of the "natural" results are crap? :-( | lumost wrote: | I've noticed a recent trend where the copy cat/adware sites | are "up-ranked" relative to original content. This would be | the expected behavior of a search engine optimizing for | clicks and revenue. | slig wrote: | They used to be superb on detecting duplicated content. They | also were extremely good at detecting spam/ham. Nowadays it | feels like they don't even care anymore and whatever filters | they have are either broken or untrained. | ahelwer wrote: | Looks great and much better than my piecemeal efforts, although I | recommend linking to a specific commit of all.txt so you aren't | opening up your browser's ublock origin filter list to arbitrary | remote control. Like: | | https://raw.githubusercontent.com/quenhus/uBlock-Origin-dev-... | Quenhus wrote: | As the author of the filter, I strongly agree with you. | However, I believe it would be too tedious for most people to | update the filter "by hand". I think I'm going to add this | important security information in the README. | btdmaster wrote: | To be fair, they are quite nice when the official website is down | or blocked... | kipchak wrote: | Google's cached version of pages can be another useful option, | if you click on the ellipses to the right of a search result's | address and then "cached" in the bottom right hand corner of | the "About this result" box. | userbinator wrote: | Came here to post a similar sentiment. I've rescued very useful | content from mirroring sites that was gone from the original. | You can filter them out if you want, but don't forget you're | doing that or you may not find what you're after. | Jerry2 wrote: | ... or DMCA'd. | | Over the past year, I've noticed that quite a few repos that I | used to track have disappeared. I keep a local bookmarks list | now because if a "starred" project is removed or DMCA'd, Github | does not tell you about it and they remove any mention of the | repo from the "starred" list. ___________________________________________________________________ (page generated 2022-02-17 23:00 UTC)