[HN Gopher] Filters to block and remove copycat-websites from Du...
       ___________________________________________________________________
        
       Filters to block and remove copycat-websites from DuckDuckGo,
       Google and other
        
       Author : gleb_the_human
       Score  : 145 points
       Date   : 2022-02-17 16:27 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | sebazzz wrote:
       | I use Kagi as a search engine and can just block the site from
       | the search results.
        
         | wanderingmind wrote:
         | Kagi filters are great for programming, but still evolving for
         | others. I still see a lot of pinterest results. You can block
         | domains by adding them to Kagi blocklist through Settings ->
         | Personalized Results -> Blocked Domains.
        
         | lolinder wrote:
         | I started using Kagi recently, and so far haven't had to block
         | a single site. Their filters are great!
        
       | xarope wrote:
       | I've been recently searching for some very specific keyword
       | stuff, and bumping into a lot of sites which seemed like just
       | reformatted copy-n-paste of stackoverflow and various mailing
       | lists, adding zero value and clogging up the top 100 search
       | results.
       | 
       | Now that I see the HUGE number of copycat sites in the
       | stackoverflow_copycats.txt file, I am beginning to understand
       | what's going on.
       | 
       | Thanks!
        
       | Kovah wrote:
       | Does anyone has an idea how to make this work in Brave without
       | uBlock? I added the block list to custom filters
       | (brave://adblock/) but results for those spam sites are still
       | shown in Duckduck.
        
       | aunty_helen wrote:
       | I've been using uBlacklist which adds a little block this site
       | button to the google search results. Handy.
       | 
       | It is a browser extension and I haven't looked too deeply into it
       | so if that's important to you perhaps have a browse over their
       | repo etc before installing.
        
         | OGWhales wrote:
         | I use that too, great for blocking pinterest from flooding all
         | of your google image search results.
        
       | reillyse wrote:
       | I really don't like these websites and I smash that back button
       | as soon as I realize I've landed on one.
       | 
       | That said I'm amazed they are still showing up at the top of
       | google search. My understanding was that that kind of behavior
       | (which I think at least some other people do too) combined with
       | the fact that they are just copying another much higher page
       | ranked website would mean that they are highly unlikely to rank
       | above the relevant stack overflow article that they are duping.
       | So what is happening here?
        
         | giancarlostoro wrote:
         | Reminds me of going to Google Images, and getting sent to
         | Pinterest... which is not where the image is sourced out of.
        
           | hughrr wrote:
           | Pinterest is one of those sites that really makes me want to
           | strangle someone. It's just an abhorrent walled garden of
           | other people's property.
        
             | giancarlostoro wrote:
             | If I were on Pinterest looking up things, fine. But I'm on
             | Google, not trying to find a mirror of what I want, I want
             | what I want.
             | 
             | Edit: I have a friend who works there, but not as an
             | engineer haha I'm pretty sure I've told him my woes with
             | pinterest. My wife loves pinterest though. It allows her to
             | come up with amazing design ideas and art ideas.
        
         | blacksmith_tb wrote:
         | I am not sure how GOOG weighs what happens after you click on a
         | result, it would be clever of them to notice how quickly you
         | click on another result for the same search and slightly
         | downgrade the first link (though, what happens if you open the
         | first three links into new tabs before you actually visit them,
         | say). My assumption was that they just counted clicks as an
         | upvote, so if these scammers can make it into the first page of
         | results, they will tend to stay.
        
           | ChefboyOG wrote:
           | For a long time now, Google has weighted behavioral signals
           | similar to what you describe. "Bounce Rate" is the percentage
           | of users who quickly leave your site after clicking. "Dwell
           | Time" is the amount of time a user spends on a page.
           | 
           | There's even a cottage industry around gaming these signals.
           | See SerpClix and the like.
        
             | ajsnigrutin wrote:
             | So how come pinterest is still on top with many searches?
        
               | jonas21 wrote:
               | Why not? Pinterest is a popular site, with lots of
               | content all linked to each other. Many people probably
               | spend a long time there after clicking a result.
        
               | dylan604 wrote:
               | The inordinate amount of time to click away all of the
               | dark ui login screens just to see the content before
               | making the decision its not what you wanted already
               | increased the dwell time to longer than other sites.
        
           | xenadu02 wrote:
           | This is not even the first or second time Google has rolled
           | out changes that allowed SEO spam sites copying Stackoverflow
           | or Wikipedia to rank higher than the original.
           | 
           | They did fix this at one point in time by figuring out which
           | site posted the content first and penalizing the copycats,
           | but it appears the fix is once again broken.
        
             | reillyse wrote:
             | I figure they must just be monitoring the original content
             | and republishing it before it's indexed by google. The
             | searches are so specific and niche that generally ranking
             | isn't hard it's beating the og that's hard.
             | 
             | I just don't know how they are managing to get indexed
             | before the big name established sites. Perhaps they are
             | succeeding on some small percentage and that is what we are
             | seeing?
             | 
             | Perhaps they have an additional trick to make it look like
             | they posted the content first, perhaps internal links or
             | something.
        
               | dtech wrote:
               | I find that hard to believe, the SO questions are often
               | years old, the GH ones months.
        
           | [deleted]
        
         | [deleted]
        
       | dawnerd wrote:
       | FYI whoever made this, you can create clickable links to import
       | filters. For example:
       | https://subscribe.adblockplus.org/?location=https://raw.gith...
       | 
       | Quick edit: I know the domain is ABP but ublock origin picks it
       | up.
        
       | poulpy123 wrote:
       | Great, I will try that soon. These websites are infuriating
        
       | pajko wrote:
       | Another useful extension like this is
       | https://iorate.github.io/ublacklist/
        
       | Melatonic wrote:
       | Anybody know of a way I could bulk import these into NextDNS?
        
       | cmroanirgo wrote:
       | Missing from the title is:
       | 
       | > _Specific to dev websites like StackOverflow or GitHub._
       | 
       | Before I noticed that, I had searched for pinterest and found
       | nothing. Even marking the HN title with "dev" would be good.
       | 
       | If this were my list I'd add w3schools because to me, it's low
       | quality, especially compared to mozilla.
        
         | hlbjhblbljib wrote:
         | > I had searched for pinterest and found nothing
         | 
         | So it's working as intended and blocking low effort spam sites
        
       | oxguy3 wrote:
       | Cool idea! I was surprised that Wikipedia mirrors aren't
       | included, as I encounter them constantly and they drive me
       | bonkers. I opened an issue: https://github.com/quenhus/uBlock-
       | Origin-dev-filter/issues/2...
        
         | ummonk wrote:
         | Yeah it's really frustrating when I read a poorly sourced
         | Wikipedia article and I'm trying to search for other sources on
         | the claims in the article but all I get is clones of the
         | Wikipedia article.
        
       | nhoughto wrote:
       | Making it this obvious how to set this up made me finally do it,
       | no more junk results (well less junk..). Thanks!
        
       | willis936 wrote:
       | Bless you. This took 30 seconds to put on my phone and laptop and
       | has already improved my results so much.
        
       | mcfedr wrote:
       | I never understood why Google isn't blocking these crap results,
       | it's really making my experience of search really bad for a light
       | of my searches
        
         | brimble wrote:
         | Do they do a good job at getting clickthroughs on Google ads on
         | their site? :-/
         | 
         | Does the rate of ad-clicking on the results page increase if
         | most of the "natural" results are crap? :-(
        
           | lumost wrote:
           | I've noticed a recent trend where the copy cat/adware sites
           | are "up-ranked" relative to original content. This would be
           | the expected behavior of a search engine optimizing for
           | clicks and revenue.
        
         | slig wrote:
         | They used to be superb on detecting duplicated content. They
         | also were extremely good at detecting spam/ham. Nowadays it
         | feels like they don't even care anymore and whatever filters
         | they have are either broken or untrained.
        
       | ahelwer wrote:
       | Looks great and much better than my piecemeal efforts, although I
       | recommend linking to a specific commit of all.txt so you aren't
       | opening up your browser's ublock origin filter list to arbitrary
       | remote control. Like:
       | 
       | https://raw.githubusercontent.com/quenhus/uBlock-Origin-dev-...
        
         | Quenhus wrote:
         | As the author of the filter, I strongly agree with you.
         | However, I believe it would be too tedious for most people to
         | update the filter "by hand". I think I'm going to add this
         | important security information in the README.
        
       | btdmaster wrote:
       | To be fair, they are quite nice when the official website is down
       | or blocked...
        
         | kipchak wrote:
         | Google's cached version of pages can be another useful option,
         | if you click on the ellipses to the right of a search result's
         | address and then "cached" in the bottom right hand corner of
         | the "About this result" box.
        
         | userbinator wrote:
         | Came here to post a similar sentiment. I've rescued very useful
         | content from mirroring sites that was gone from the original.
         | You can filter them out if you want, but don't forget you're
         | doing that or you may not find what you're after.
        
         | Jerry2 wrote:
         | ... or DMCA'd.
         | 
         | Over the past year, I've noticed that quite a few repos that I
         | used to track have disappeared. I keep a local bookmarks list
         | now because if a "starred" project is removed or DMCA'd, Github
         | does not tell you about it and they remove any mention of the
         | repo from the "starred" list.
        
       ___________________________________________________________________
       (page generated 2022-02-17 23:00 UTC)