[HN Gopher] We can do better than DuckDuckGo
       ___________________________________________________________________
        
       We can do better than DuckDuckGo
        
       Author : als0
       Score  : 292 points
       Date   : 2020-11-17 18:16 UTC (4 hours ago)
        
 (HTM) web link (drewdevault.com)
 (TXT) w3m dump (drewdevault.com)
        
       | x87678r wrote:
       | I dont want privacy, I want competition. Which is why I use Bing
       | and honestly it works virtually all the time.
       | 
       | Maybe there could be some pure anonymous ad-free search engine
       | but it more realistic to have alternative commercial one. I
       | really dont care that people are looking at my searches for how
       | to resize an array or cheap hotels in Florida.
        
         | ablanco wrote:
         | This "nothing to hide" argument has been pretty analized, and
         | in my opinion, it's really dangerous.
         | https://spreadprivacy.com/three-reasons-why-the-nothing-to-h...
        
       | staunch wrote:
       | We have to make sure to include highly relevant advertisements in
       | the search results, at least 50% of the results should be ads. So
       | there needs to be a marketplace for buying/selling ads.
       | 
       | We can't have a search engine that is only useful for finding the
       | most relevant web pages for a given query. People love highly
       | relevant advertisement in their search results.
        
       | TheGrassyKnoll wrote:
       | I actually heard a DDG ad on the radio in the Los Angeles area
       | (KNX 1070 24 hour news). Still love the bangs.
        
       | Siira wrote:
       | Searx is a partially viable FOSS meta-search engine.
        
       | todd3834 wrote:
       | Aren't the secrets of the algorithm what prevent people from
       | gaming the results? While I love the idea of search becoming
       | fully open source I'm skeptical it could be done. I hope I'm
       | wrong and I'd love to dedicate time to an open source project
       | with this goal if anyone presents a convincing plan.
        
       | moonchild wrote:
       | Findx[1] was an attempt to make an opensource search engine.
       | Today it's just another bing wrapper; but their code[2] is still
       | available, waiting to be used as a starting point for another
       | project.
       | 
       | 1. https://www.findx.com/
       | 
       | 2. https://github.com/privacore/open-source-search-engine
        
       | abalaji wrote:
       | This is why search is hard: 15% of Google searches are new each
       | day. [1] And, with over 1.7+ billion web pages, [2] it would take
       | a gargantuan open source effort to put something together like
       | this.
       | 
       | Not to mention the cost, not sure something like this could be
       | sustained with a Wikipedia-esque "please donate $15" fundraising
       | model.
       | 
       | [1] https://searchengineland.com/google-reaffirms-15-searches-
       | ne...
       | 
       | [2] https://www.weforum.org/agenda/2019/09/chart-of-the-day-
       | how-...
        
       | ecommerceguy wrote:
       | I'm surprised noone has rented Ahrefs database, whipped up an
       | algorithm and called it a search engine. Besides google and
       | microsoft, who has a bigger snapshot of the entire web (NSA not
       | included)? Majestic maybe?
        
       | claytoneast wrote:
       | I wonder if you could start small on something like this. Build a
       | proof of concept, a search engine for programmers that indexes
       | only programming sites/material. See if you can technically do
       | it, & if you can figure out governance mechanisms for the
       | project. Sort of like Amazon starting with just selling books.
        
       | mcqueenjordan wrote:
       | I've recently been /tinkering/ with exactly such an idea! In my
       | case, it's even more specific and scoped: A search engine with
       | only allow-listed domains about software engineering/tech/product
       | blogs that I trust.
       | 
       | https://github.com/jmqd/folklore.dev
       | 
       | It's not even really at the POC stage yet, but I hope to host it
       | with a simple web frontend sometime soon. Primarily, this is just
       | for myself... I just want a good way to search the sources that I
       | myself trust.
        
       | easymovet wrote:
       | Crawlers is top down approach, a distributed list that people pay
       | digital money for listing will both incentivize nodes to be
       | online and transforms sybil attacks into paid advertising.
        
       | rjurney wrote:
       | Check out the serious difficulties the Common Crawl had with
       | crawling 1% of the public internet on donated money and then get
       | back to me with a plan. This is really, really hard to do for
       | free. Maybe talk to Gates :)
        
       | Analemma_ wrote:
       | I spent seven years working at Bing, and I can tell you that this
       | guy is massively, hugely underestimating the difficulty of this
       | problem. His repeated "it's easy! You just have to..."
       | suggestions are absurd. This is typical HN content where someone
       | with no domain expertise swaggers in and assumes everyone in the
       | space must be idiots, and that only he can save the day.
       | 
       | Trust me, there is _not_ a ton of potential  "just sitting on the
       | floor" in web search.
        
         | ablanco wrote:
         | Given your experience, What's your opinion about ddg result
         | quality?
        
       | 6510 wrote:
       | While I agree with its lack of organization I don't think YaCy
       | being untolerably slow is necessarily an argument. If you are
       | looking for a complete set of pages on a specific topic time is
       | sort-of irrelevant. Google for example has alerts for new
       | results. That these pages are not available sooner (before
       | publication) is not intolerable. You can also throw hardware at
       | YaCy and adjust the settings which improves it a lot. The
       | challenge with a distributed approach is sorting the results.
       | Other crawlers have the same problem but in a distributed system
       | it is even harder.
       | 
       | Running an instance for websites related to your occupation or
       | hobby YaCy is quite wonderful. You don't want google removing a
       | bunch of pages that might cover exactly the sub-topic you are
       | looking for. Of course the smaller the number of pages in your
       | niche the better it works.
        
       | neurobashing wrote:
       | Am I the only person who just doesn't have problems with DDG
       | search results?
       | 
       | What am I doing wrong (or right), here? I put a thing in and find
       | it. I just don't use Google any more.
       | 
       | Genuinely curious why it's working for me and such garbage for
       | everyone else.
        
         | djsumdog wrote:
         | I say about 50% I'm good with DDG. About 1/3 of the time I add
         | !g, usually for weird error messages and tech stuff.
         | 
         | Honestly we shouldn't be using Google for everything. Why not
         | just search StackExchange or Github issues directly for known
         | bug problems? If you need a movie, !imdb or !rt forward you to
         | exactly where you want to really search on.
         | 
         | If DDG or Google also included independent small blogs for
         | movie results, I could see the value in that. I'd prefer
         | someone's review on their own site or video channel, but it
         | doesn't. We've kinda lost that part of the Internet.
        
         | pizza234 wrote:
         | I've tried DDG for a while, around a couple of years ago, and I
         | had lower-quality results particularly for technical subjects
         | (which are the vast majority of my searches). I will give DDG
         | another shot, though.
        
         | keithnz wrote:
         | for generic stuff DDG is mostly ok. But for local results, even
         | though it has a switch for local results, it REALLY REALLY
         | REALLY sucks bad and often doesn't get any of the expected
         | places anywhere in the first few pages for New Zealand which
         | makes it somewhat useless
        
         | dybber wrote:
         | I'm mostly getting Norwegian results, when searching for Danish
         | subjects from a Danish IP address. It also seems it just hasn't
         | indexed as many websites as Google.
        
         | proactivesvcs wrote:
         | I sometimes come across inappropriate results - for example I
         | search for a hex error code and the results are for other
         | numbers - and sometimes the adverts are misleading, but neither
         | are so prevalent enough that it harms the experience in
         | general.
         | 
         | I always send feedback when I come across incorrect results and
         | also try to when I get a really easy find.
         | 
         | I have not had to resort to any other search engine for at
         | least five years.
        
         | Moru wrote:
         | I'm also using DDG exclusively since many years. I find what I
         | need usually as the first couple of results or in the box on
         | the right, that usually goes directly to the authorative source
         | anyway.
        
         | jlarocco wrote:
         | Yeah, I'm with you.
         | 
         | I can think of some improvements (better forum/mailing list
         | coverage), but it's generally fine for almost everything.
         | Lately if I don't find it on DDG I probably won't have much
         | better luck anywhere else, either.
        
         | Dahoon wrote:
         | Does Google search results work for you? If yes, then I'd say
         | the reason is you don't see or agree with how bad results are
         | today (as others have posted extensively about). I for one find
         | DDG as the search engine that returns the worst results. Qwant
         | is a better Bing-using engine IMO but it is still bad.
        
       | jesuscyborg wrote:
       | The way I'd code a better search engine is I'd design an ML model
       | that's trained to recognize handwritten HTML like this, and only
       | add those to the index. It'd be cheap to crawl probably only
       | needing a single computer to run the whole search engine. It'd
       | resurrect The Old Web, that still exists, but just got buried
       | beneath the spammy SEO optimized grifter web over the years as
       | normies flooded the scene.
        
         | buzzerbetrayed wrote:
         | I hope to never use your search engine. I love hand written
         | HTML as much as the next guy, but search engine's are made to
         | find things. And useful information exists on web sites that
         | use generated and/or minified HTML.
        
       | mixologic wrote:
       | How would anybody ever know what the server is running and/or
       | doing with the data you send it, regardless of if it is running
       | open or closed source code?
       | 
       | A service, running on somebody else's machine, is essntially
       | closed.
       | 
       | I think the only way to have an 'open' service is to have it
       | managed like a co-op, where the users all have access to
       | deployment logs or other such transparency.
       | 
       | Even then, it requires implicit trust in whomever has the
       | authorization to access the servers.
        
         | joosters wrote:
         | In _theory_ , this is the kind of thing that the GPL v3 was
         | trying to address: roughly speaking, if you host & run a
         | service that is derived from GPL-v3'd software, you are obliged
         | to publish your modifications.
         | 
         | But, I agree with you - and I don't think the author had really
         | thought through what they were demanding, they made no mention
         | of licensing other than singing happy praises of FOSS as if
         | that would magically mean you could trust what a search engine
         | was doing.
        
           | lixtra wrote:
           | > In theory, this is the kind of thing that the GPL v3 was
           | trying to address: roughly speaking, if you host & run a
           | service that is derived from GPL-v3'd software, you are
           | obliged to publish your modifications.
           | 
           | You mean AGPL https://en.m.wikipedia.org/wiki/Affero_General_
           | Public_Licens...
        
             | joosters wrote:
             | You're right... I'm misremembering the GPL, wikipedia says
             | that it was only 'Early drafts of GPLv3 also let licensors
             | add an Affero-like requirement that would have plugged the
             | ASP loophole in the GPL' - I hadn't realised it never made
             | it into the final version.
        
           | jedimastert wrote:
           | > In theory, this is the kind of thing that the GPL v3 was
           | trying to address: roughly speaking, if you host & run a
           | service that is derived from GPL-v3'd software, you are
           | obliged to publish your modifications.
           | 
           | Why would I trust someone to do that, though?
        
         | joshuaissac wrote:
         | That sounds a bit like YaCy.[1] It is a program that apparently
         | lets you host a search engine on your own machine, or have it
         | run as a P2P node.
         | 
         | I think the next step forward should be to have indices that
         | can be shared/sold for use with local mode. So you might buy
         | specialised indices for particular fields, or general ones like
         | what Google has. The size of Google's index is measured in
         | petabytes, so a normal person would still not have the
         | capability to run something like that locally.
         | 
         | 1. https://yacy.net/
        
         | Jyaif wrote:
         | > How would anybody ever know what the server is running and/or
         | doing with the data you send it, regardless of if it is running
         | open or closed source code?
         | 
         | https://en.wikipedia.org/wiki/Homomorphic_encryption
        
         | [deleted]
        
       | a3camero wrote:
       | I took a stab at making a simple search engine and wrote up some
       | of the lessons I learned from doing this as a hobby project
       | during Coronavirus: https://www.cameronhuff.com/blog/making-a-
       | search-engine-gori.... Here are some of the lessons I learned
       | that might help anyone else out there considering trying their
       | hand at this (which is a great educational project!):
       | 
       | 1. Use private networking for traffic between components.
       | 
       | 2. Compress screenshots carefully. Screenshots are a major part
       | of the disk requirement.
       | 
       | 3. Use "block storage" (network-attached flash memory storage) to
       | store indices instead of RAM.
       | 
       | 4. Carefully distinguish between URLs that are perceived (i.e.
       | displayed on a site) vs. actual URL that results from following
       | the link.
       | 
       | 5. When dequeueing URLs, be careful to dequeue the one with the
       | lowest depth and lowest number of attempts. 6. Store pages using
       | delta compression.
       | 
       | 7. Don't store something if it's already stored, by addressing
       | content using hashes.
       | 
       | 8. Sequeuntial-ish integer data can often be stored using offsets
       | instead of the actual number to achieve significant file size
       | savings.
       | 
       | 9. Hash collisions are far more common than you'd expect due to
       | the "birthday problem".
       | 
       | 10. Always use an object for storing a URL because raw URLs that
       | are read from webpages often have issues.
       | 
       | 11. Use redis to cache data. A cache is essential and MySQL (or
       | another database, I started with mongodb) isn't meant for that.
       | But also use redis for basic queues, for which a system like this
       | needs several to achieve good throughput.
       | 
       | 12. Use APIs to connect components across system boundaries
       | rather than using file access or database access directly.
       | 
       | 13. Think carefully about where data is stored. Some data needs
       | to be in RAM, some can be stored temporarily on disk on a VM or
       | in a database, some needs to be on block storage, etc.
       | 
       | 14. Bandwidth charges make even cheap servicse like B2
       | (S3-compatible object storage) expensive.
       | 
       | 15. Cheap VMs are important.
       | 
       | 16. sha1 hashes can be done using webcrypto API but MD5 can't.
       | 
       | 17. Redis BITFIELD command can be used to store information in
       | bitmaps that can be very efficient memory-wise.
       | 
       | 18. Using block storage to store indices is cheap but limits the
       | throughput of the system.
       | 
       | 19. Storing data in tables where the table name is a part of the
       | index (such as docs1, docs2, etc.) can make a lot of sense and be
       | much faster than a large table with an index for the field.
       | 
       | 20. Websites are not designed to be crawled. Proper crawling,
       | that is respectful (i.e. not too many pages loaded per hour),
       | thorough without being overly thorough, and adjusted according to
       | how frequently site content appears to change, is harder than it
       | appears.
       | 
       | 21. Study academic articles and search engine company
       | presentations, even old ones (pre-2010) to understand how to
       | design a search engine.
       | 
       | 22. The distribution of words in a document is just as important
       | as the word count, and maybe more so.
       | 
       | 23. Search terms can be locally hashes on the user's computer and
       | sent to the server to see if there are pages that have that term,
       | without exposing the term itself to the search engine.
       | 
       | 24. Downloading and indexing a website with hundreds of thousands
       | of pages takes a long time if you are want to crawl a site
       | politely (i.e. one page per minute).
        
       | ufo wrote:
       | One thing I always wished for is if there were a way to use
       | duckduckgo bang searches in my browser without sending them
       | through DDG. But apparently it's harder to implement than it
       | sounds.
        
         | takeda wrote:
         | You absolutely can, at least in Firefox you can right click on
         | search field, select "Add a Keyword for this search..." Then
         | save it as bookmark and enter the keyword (you don't have to
         | use !, but it is an option if you chose so).
         | 
         | You can also create such bookmark manually and use %s in the
         | url as a placeholder where search query should be placed.
         | 
         | The manual configuration can be useful when there's no direct
         | search field. For example freshports.org allows querying
         | freebsd.org. I can add a bookmark with search keyword "fp" to
         | point to https://freshports.org/%S
         | 
         | After that I can type in address bar: fp lang/python39 to land
         | on https://freshports.org/lang/python39 (the capital %S doesn't
         | escape special characters like /)
        
         | iuqiddis wrote:
         | In Firefox you can right click on a search field and add a
         | keyword bookmark. Once saved, you can type 'kw search query',
         | where kw is your defined key word, in the address bar to
         | directly search the relevant site
        
           | ufo wrote:
           | I'm aware of that. The problem is that you have to manually
           | add all the keywords yourself. AFAIK, there isn't an easy way
           | to import a large list of curated keywords like the DDG bang
           | list.
        
             | detaro wrote:
             | They are bookmarks you can export/import from Firefox, so
             | someone could easily make a Firefox bookmark file for a
             | large set of them.
        
           | wldcordeiro wrote:
           | Love this feature. I've got basically all the bang keywords
           | but instead of say `!g query` it just becomes `g <query>`.
        
       | wolco2 wrote:
       | Privacy or not I'm starting to find things on ddg that google has
       | been filtering.
       | 
       | I found out through comments on hn that 8chan was backup under a
       | new name: 8kun
       | 
       | Typing it into google I get articles about it but no link in the
       | results.
       | 
       | In duckduckgo first link.
       | 
       | Made me think what else am I missing?
        
         | the_only_law wrote:
         | Googles filtering is so weird. I have a bad habit of buying old
         | hardware without checking if the documentation had been made
         | available on the web.
         | 
         | Recently I found myself desperate for any information on a
         | price of hardware i had gotten. I was swapping out all sorts of
         | queries woth different keywords hoping to find a manual. I was
         | able to find some marketing material which was helpful, albeit
         | barely. Eventually I had exhausted the search results for most
         | pf my queries, gave up and assumed that it was simply lost to
         | time and I was out of luck.
         | 
         | Eventually I went back to the sales paper I found. Going to the
         | site it was hosted on, a Lithuanian reseller. I translated the
         | page, eventually finding a direct link to a user manual on the
         | exact same page as the sales paper I had found. The document
         | was in English, contained important words from my queries (such
         | as the product name, company, "user manual" etc. The document
         | was at the same path as the sales paper too. I hace no idea why
         | Google found the sales paper but not the manual.
         | 
         | Unfortunately the manual still wasn't what I was looking for
         | _exactly_ but it was a hell of a lot better than what I could
         | get from Google 's results.
        
       | jron wrote:
       | SEO is crushing the utility of Google. It is pretty telling when
       | you need to add things like site:reddit.com to get anything of
       | value. Harnessing real user experiences (blogs, etc) is the key
       | to a better search engine. This model unfortunately crumbles
       | under walled gardens which is increasingly the preferred location
       | of user activity.
        
         | RileyJames wrote:
         | That's where blogs were at, but now a massive portion of them
         | are content farms / splogs.
         | 
         | You're right that the walled gardens have hurt this. So often I
         | search something specific, or a topic, and find very little.
         | But I know there are communities on Facebook for this, I know
         | there would be peoples posts out there on Instagram which 100%
         | answer my question. But they may as well not exist. Unless I
         | was "following" then when it was said, and mentally indexed it,
         | these things are mostly unfindable, and that's if I even have
         | an account for said service (which I don't for Facebook)
         | 
         | It's sad, more people than ever using the internet, more
         | content & knowledge being created than ever before, yet it's no
         | longer possible to find the great answers.
        
         | djsumdog wrote:
         | > Harnessing real user experiences (blogs, etc)
         | 
         | This is what we need more than anything. More independent
         | blogs. The ability to search events now, or 10 years ago, mass
         | indexing of RSS feeds, etc.
         | 
         | A general search engine is kinda way out of the ballpark for
         | now. But you could specialize for long form blogs, from all
         | sides, hard-left, hard-right, women in tech, white
         | supremacists, all the extremes and moderates.
         | 
         | I've love to have an interface to search a topic and see what
         | all kinds of people have posted long form, without commentary
         | or Twitter/Facebook bullshit "Fact checking" notices. I what to
         | see what real writers are seeing across the spectrum on a given
         | topic for the week or month.
        
           | ant6n wrote:
           | Its hard to get readership writing blogs these days. Thats
           | pretty demotivating.
        
             | CameronNemo wrote:
             | Also difficult to distinguish a blog from a content farm if
             | you are just crawling the web. Any content pattern you
             | select for would likely be quickly adopted by SEOs.
        
           | grey_earthling wrote:
           | > This is what we need more than anything. More independent
           | blogs. The ability to search events now, or 10 years ago,
           | mass indexing of RSS feeds, etc.
           | 
           | Thought experiment: what would a search engine look like if
           | it _only_ indexed RSS and Atom feeds?
        
       | snowwrestler wrote:
       | Why would advertisements not fly? Search intent is like the
       | canonical example of an ad targeting signal that does not need
       | personal data to succeed. If someone is searching for laptops,
       | you can show them laptop ads.
       | 
       | I think most of Google's personal data efforts actually support
       | a) better organic search results (does this person mean Apple the
       | company or apple the fruit), and b) all the ads that are served
       | off their SERPs, where there is no signal of intent to read (I.e.
       | their display network). Again, you don't need personal data to
       | serve ads based on search terms.
        
       | merlinscholz wrote:
       | I miss Cliqz. It was a new search engine, with its own crawler,
       | almost completely from scratch. It even had a dev blog where they
       | wrote articles on how to build your own search engine:
       | https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...
        
       | wcerfgba wrote:
       | I wonder if instead of another search engine we would benefit
       | from a directory, like DMOZ, or perhaps something tag based or
       | non-hierarchical. Sometimes I find better results by first
       | finding a good website in space of my query, and then searching
       | within that site, as opposed to applying a specific query over
       | all websites. Once example would be recipes: if you search for
       | "bean burger recipe" you will get lots of results across many
       | website, but some may not be very good, whereas if you already
       | know of recipe websites that you consider high-quality or match
       | your preferences, then you'll find the best (subjectively) recipe
       | by visiting that site and then searching for bean burgers.
        
         | SNosTrAnDbLe wrote:
         | Yeah. Exactly my thoughts. I really liked the concept of
         | del.icio.us. where humans could bookmark and tag web sites.
         | 
         | dmoz looks pretty great but the categories look limiting.
        
       | nikivi wrote:
       | I'd love a truly open source world class search engine. Curious
       | how both the crawler and the search index / search is done by the
       | likes of Google/Bing/DDG. Eventually someone will make an oss
       | version of it that can compete.
       | 
       | The beauty of such oss solution maybe the custom heuristics that
       | can be created based off the crawled data.
        
         | anotherdirtbag wrote:
         | There's no need to compete. People who want things like this
         | just do it themselves. Checkout YaCy
         | https://github.com/yacy/yacy_search_server
        
         | Mediterraneo10 wrote:
         | The challenges to OSS developers are numerous. First of all,
         | many popular sites on the internet block crawlers other than
         | Google and Bing, because only those ones seem to matter to
         | their business, and any small upstart would be assumed to be a
         | dodgy bot. Secondly, Google amasses the database it has only
         | with vast data centers, incredible amounts of bandwidth, and
         | power requirements unavailable to a startup.
        
           | creese wrote:
           | How would anyone block a crawler? A crawler is just a
           | headless browser.
        
             | tleb_ wrote:
             | robots.txt
             | 
             | https://www.robotstxt.org/
             | 
             | https://en.wikipedia.org/wiki/Robots_exclusion_standard
        
               | Xylakant wrote:
               | Note that robots.txt is a hint to well-behaved crawlers,
               | not blocking them in any regard.
               | 
               | You can block crawlers if you can identify them, but
               | reliably identifying them is hard.
        
         | ddorian43 wrote:
         | Good luck with that mate. Check out https://commoncrawl.org/
        
       | katsura wrote:
       | My biggest pet peeve with DDG at the moment is that whenever I
       | search for something on my phone the first two results are ads,
       | and those two results actually take up my whole screen. I mean
       | sure, those are probably not privacy invading, but I literally
       | don't care as I wasn't looking for them.
        
       | metroholografix wrote:
       | DuckDuckGo is a mirage and should not be used by privacy-
       | conscious folks. Take a look at its terms of service, information
       | collected section:
       | 
       | "We also save searches, but again, not in a personally
       | identifiable way, as we do not store IP addresses or unique User
       | agent strings. We use aggregate, non-personal search data to
       | improve things like misspellings."
       | 
       | So they save your web searches and claim that they do so in an
       | non-personally identifiable way. The privacy problems with this
       | claim are many, even if one accepts it at face value (good luck
       | verifying that this is the case).
        
         | robertlagrant wrote:
         | I don't see why you'd both nitpick their terms of service, and
         | then also claim that it's a pack of lies and can't be trusted.
         | Why do the former and then the latter? If your complaint is
         | just "I can't verify anything about their privacy" then that
         | would've made sense.
        
         | fbelzile wrote:
         | > DuckDuckGo is a mirage ... The privacy problems with this
         | claim are many ... good luck verifying ...
         | 
         | Okay, can you list just a few?
         | 
         | If you're going to make counter-claims like this, you're going
         | to have to provide evidence.
         | 
         | Statements like these are not conducive in gaining popular
         | support for increased privacy.
        
           | metroholografix wrote:
           | How do you save a search in a non-personally identifiable
           | way? Do you have a human verify the data belonging to each
           | and every search ? Not saving IPs and/or browser data doesn't
           | solve the problem since the search terms themselves can be
           | personally identifiable.
           | 
           | How do you verify that DuckDuckGo does -the minimal and
           | ineffective- things they claim to do? They offer no proof.
           | 
           | How do you verify that DuckDuckGo does not secretly cooperate
           | with more powerful coercive actors?
           | 
           | How do you verify that DuckDuckGo, offering a single point of
           | compromise, has not been thoroughly compromised by more
           | powerful actors?
        
             | bscphil wrote:
             | > How do you save a search in a non-personally identifiable
             | way?
             | 
             | Save a sha256 hash of every search for 24 hours. If you see
             | the same hash from >10 distinct IP addresses in a 24 hour
             | period, save the search terms.
             | 
             | That's just off the top of my head, I have no reason to
             | think they're doing it exactly like that. The point is that
             | you're claiming that we shouldn't trust DuckDuckGo because
             | you can't think of a way that they could securely and
             | privately do what they do -- but that's just your
             | intuitions, for whatever they may be worth.
             | 
             | I also don't really buy the worries you have with the last
             | two questions, e.g.:
             | 
             | > How do you verify that DuckDuckGo does not secretly
             | cooperate with more powerful coercive actors?
             | 
             | How would you verify that for _any_ centralized service,
             | open source or not? I think your security concerns go a bit
             | beyond what most people interested in critiquing  /
             | improving DDG can reasonably expect to achieve.
        
               | pfarrell wrote:
               | > How would you verify that for any centralized service,
               | open source or not?
               | 
               | I think, technically, some sort of honeypot verification
               | could prove a compromise (i.e. if information that has
               | very little chance of existing naturally in two systems,
               | say a string a guids).
               | 
               | But... I agree with your point. I don't think this is
               | actually feasible or realistic, just technically
               | possible.
        
               | pb7 wrote:
               | >How would you verify that for any centralized service,
               | open source or not?
               | 
               | Other centralized (search) services don't have their
               | entire existence depending on this one factor. What is
               | DDG if not alleged privacy? Just use Bing directly.
        
               | bscphil wrote:
               | I don't understand that argument at all. What's the
               | threat model?
               | 
               | I think it's entirely reasonable to be in the following
               | posture: I want as much privacy for my web searches as I
               | can reasonably achieve without having to run a search
               | engine myself. I'm willing to trust that search providers
               | are not saving personally identifiable information or
               | passively turning over search data to law enforcement if
               | they claim that they are not in their terms of service.
               | 
               | That's pretty much the use case for DDG. With Bing you
               | _know_ they are violating your privacy. With DDG you have
               | a promise _in writing_ that they are not. It 's hard to
               | see how that's not strictly better than what you get from
               | Bing if privacy is among your core desiderata.
        
               | pb7 wrote:
               | I think we're on the same page. I was saying that if it
               | were to be discovered that DDG lacks privacy then there
               | would be no reason to use it over Bing since that is its
               | raison d'etre.
               | 
               | >I'm willing to trust that search providers are not
               | saving personally identifiable information or passively
               | turning over search data to law enforcement if they claim
               | that they are not in their terms of service.
               | 
               | Do other search companies disclose that they share data
               | with the FBI, NSA, etc in their ToS? Genuinely don't
               | know.
        
             | jerf wrote:
             | "How do you save a search in a non-personally identifiable
             | way?"
             | 
             | To a first approximation, you just... do it.
             | 
             | Granted, if you search "{jerf's realname here}
             | {embarrassing disease} cure" or something, in the
             | pathological case, you could at least guess that maybe it
             | was me, though even then my real name is far from unique,
             | and nothing stops anyone else from running such a search.
             | 
             | But otherwise, if all you have is a pile of a few billion
             | searches, you don't have any information about any of the
             | specific searchers. Even if you search for your own
             | specific address, you don't really get anything out of it;
             | there's no guarantee it was you, or a friend of yours, or
             | an automated address scraper. There isn't much you can get
             | out of a search string without more information connected
             | to it.
             | 
             | The rest of your criticisms are too powerful for the topic
             | at hand; they don't prove we shouldn't use DDG, they prove
             | we shouldn't use the internet at all.
        
               | Dahoon wrote:
               | At the very least your example is PII which you cannot
               | save and also claim to be Private.
        
               | [deleted]
        
             | [deleted]
        
         | WA9ACE wrote:
         | Do you have a search engine that you prefer to use that claims
         | not to store said information that I might try?
        
           | h2onock wrote:
           | I can hand on heart tell you that Mojeek doesn't and never
           | has. I know this because I work for Mojeek.
        
             | Pick-A-Hill2019 wrote:
             | Hi. I took a look at Mojeek (first time I've heard about
             | it) and since you mentioned the site and you work there -
             | 
             | In your Privacy page (Data Usage Section) there is a
             | mention of stored "Browser Data" & " These logs contain the
             | time of visit, page requested, possibly referral data, and
             | located in a separate log browser information."
             | 
             | This is an honest question - How is that not exactly what
             | the Parent stated was the issue?                   So they
             | save your web searches and claim that they do so in an non-
             | personally identifiable way.
        
               | ricardo81 wrote:
               | The referred to issue with DDG is that its favicon
               | service was informing DDG of sites _you visit_ , rather
               | than searches you make.
               | 
               | But agreed that all search engines have to be trusted on
               | their word about anonymising data and not retaining PII.
        
           | metroholografix wrote:
           | The only solution I see is fully distributed/decentralized
           | search. Run your own crawler or be part of a network that
           | distributes this out to each participating node.
           | 
           | Every centralized search engine has immensely hard-to-resist
           | and powerful incentives to play "The Eye of Sauron" with your
           | data. Additionally, they offer single points of compromise to
           | other, far more powerful actors. Whatever guarantees
           | DuckDuckGo gives you -and right now they don't give any-
           | don't mean much, if they've been thoroughly (willingly or
           | unwillingly) compromised.
           | 
           | Which doesn't mean one should always steer well clear just
           | that one should at least be aware of the tradeoffs one makes
           | when using a centralized search engine. And with DuckDuckGo's
           | misleading marketing, I feel that this point is lost on
           | significant chunks of its userbase.
        
             | ravenstine wrote:
             | Such search engines have been around for many years, and
             | they suck donkey balls. Pardon my French. Install YaCy and
             | tell me how you like it.
             | 
             | It wouldn't matter anyway, because decentralization doesn't
             | really solve privacy any better than centralized search,
             | besides the fact that it could theoretically provide more
             | choices.
             | 
             | No matter what you use, privacy ultimately depends on
             | trust. The reason that I have more trust for DDG than I do
             | Google is, unlike Google, its primary audience is privacy-
             | minded folks. If it came out that DDG was tracking users
             | and selling that data, DDG would be immediately done as a
             | brand. They at least have some incentive to do what they
             | say. Decentralization provides no such benefit because a
             | search "node" is unlikely to have any sort of meaningful
             | brand to keep up.
             | 
             | > And with DuckDuckGo's misleading marketing, I feel that
             | this point is lost on significant chunks of its userbase.
             | 
             | How is it misleading? My understanding from their marketing
             | is that they don't create profiles of their users based on
             | searches. Until we have evidence to the contrary, it's not
             | outrageous to assume they are being truthful.
        
             | burnthrow wrote:
             | "Run your own crawler" is not a solution.
             | 
             | Cool my comments are immediately downvoted like that
             | Italian guy's.
        
             | unethical_ban wrote:
             | Yeah, now you're just saying "Nothing centralized can ever
             | be trusted". So just say that rather than nitpicking their
             | ToS. You weren't going to care what they said anyway.
        
       | keyle wrote:
       | I agree DDG isn't perfect or great but it's _good_ 80% of the
       | time.
       | 
       | I always start with DDG and revert back to Google if it doesn't
       | help, or I feel "there's got to be a better way".
       | 
       | That said, talk is cheap, show us your engine.
        
       | Guest19023892 wrote:
       | I wonder if someone could setup a curated search engine. However,
       | allow anyone to curate the results and define a custom list of
       | allowed URLs. Then, others can use that list.
       | 
       | For example, I decide Google is terrible when I'm searching for
       | product reviews, and all I get are results to Amazon referral
       | websites and spam blogs that never owned the products to begin
       | with. So, I find 200 sites or forums that actually have quality
       | reviews and I create a whitelist of those URLs, and I name it
       | "John Doe's Product Reviews List".
       | 
       | Other people visit the search engine and they can see my list,
       | rate it, favorite it, and apply it to their results.
       | 
       | So, the idea is you visit the search engine, type your query,
       | then select from a drop down one of your favorite curated lists
       | to apply. Maybe you like to use "Mike's favorite free stock photo
       | websites" when searching for free photos for your projects. Maybe
       | you like to apply "Jane's vegan friendly results" when searching
       | recipes or face creams. Maybe you want to buy local, so you use
       | the "Handmade in X" list when searching for your next belt. Maybe
       | you use another list that only shows results from forums. Or
       | another for tracking/ad free websites.
       | 
       | Keep track of list changes. So, if someone gets paid off to allow
       | certain sites on their popular list, others can easily fork a
       | past version of the list.
        
       | benmller313 wrote:
       | I think this person actually means "We can imagine doing better
       | than DuckDuckGo".
        
         | 6510 wrote:
         | The right question is: How to do search using open source
         | tools?
         | 
         | If your goal is "to make something better than the Duck" and
         | you succeed, the Duck dies... what is your goal now?
        
         | timClicks wrote:
         | Well, ideas are much easier than implementations.
        
           | ikiris wrote:
           | It's kind of amazing how many people think an idea is the
           | biggest part of a viable product.
        
             | 6510 wrote:
             | So you want to build a team and organize finances first?
             | That doesn't seem like a bad idea... wait...
        
           | TedDoesntTalk wrote:
           | Cliqz in Germany was one such implementation, funded in part
           | by Mozilla but completely independent.
           | 
           | They wrote their own search engine.
           | 
           | They closed shop earlier this year.
        
       | corytheboyd wrote:
       | You need money and dedicated resources to run and manage the
       | service, which at some point is just going to require trust.
       | Trusting nobody is smart, but expecting a service to compete and
       | win the long game without trusting it is pointless.
        
       | blibble wrote:
       | as a recent ddg convert, I've noticed little difference from
       | google
       | 
       | (might be because google's results these days are so bad
       | though... can't really tell)
        
       | beefield wrote:
       | How about a crowdsourced search engine like wikipedia or
       | stackoverflow? Like:
       | 
       | When you search for "kittens" you get the links that are most
       | upvoted by the community.
       | 
       | If nobody has ever submitted links for search term "kittens" ,
       | you get a link to selected generic search engines. And "kittens"
       | end up into a list of words someone has searched but nobody has
       | yet added a good result link for.
        
         | Moru wrote:
         | I hate to be so negative but that's just another sort of SEO
         | problem. Someone will pay a large group of people to sit and
         | click upvotes for their clients nonstop.
        
           | beefield wrote:
           | Of course there are going to be some highly debated search
           | terms. But I think that applieas also to weikipedia and they
           | have managed to pull it off so that it works reasonably well.
           | 
           | I mean, you could always put a big red badge on top of the
           | results that says something in the line of "this search term
           | seems to be troublesome. You may want to check qwant/ddg or
           | maybe even google."
        
       | josefresco wrote:
       | Just do it.
        
         | MaxBarraclough wrote:
         | I suspect Drew has his hands full with the SourceHut project.
        
           | mekster wrote:
           | Perhaps it was better for him to say, "There's a better way
           | to do it than DDG" than "We can do better than DDG" as if
           | he's about to do it when in fact he's waiting for his revenue
           | to go up.
        
         | eznzt wrote:
         | The last thing we need is a search engine with pictures of
         | anime girls.
        
       | vladmk wrote:
       | This post looks horrible on the bottom left for mobile fyi
        
       | dgudkov wrote:
       | >We need a real, working FOSS search engine, complete with its
       | own crawler.
       | 
       | How would an open-source search engine stand against abusive SEO-
       | optimization? If anyone can understand how the ranking algorithm
       | then anyone can game it.
        
         | AsyncAwait wrote:
         | Not as much if you have user curated tier 1 sites. If these
         | start to become spammy, they get removed.
        
       | dumbfounder wrote:
       | Yes, we can do better than DDG. But if you are expecting to fund
       | a real search engine with a few hundred thousand dollars you are
       | insane. It will take a ton of development and a ton of hardware
       | to create an index that isn't a pile of garbage. This isn't 2000
       | anymore. You need to index >100 billion pages and you need it
       | updated and you need great crawling and parsing and you need
       | great algorithms and probably an entirely proprietary engine and
       | you need to CONSTANTLY refine all the above until it isn't
       | garbage. Maybe you could muster something passable for $1B over 5
       | years with a strong core team that attracts great talent. If
       | Apple actually does this, as they are rumored to, I bet they dump
       | $10b into it just for the initial version.
        
         | Aeolun wrote:
         | If you want a _good_ engine there is no need to index 100B
         | pages, since 99% of the pages are blogspam.
        
           | bmurphy1976 wrote:
           | How are you going to identify what's blogspam and what's
           | legitimate without indexing it all in the first place?
        
         | ricardo81 wrote:
         | Agreed, it's going to require significant investment in
         | hardware and software.
         | 
         | The recent UK Competitions and Market Authority report
         | evaluating Google and the UK search market came to the
         | conclusion a new entrant would require about 18 Billion GBP in
         | capital to become a credible alternative search engine, in
         | terms of size, quality, hardware, man hours making it.
         | 
         | Remember Cuil? Had the size, the fanfare but unfortunately not
         | the quality.
        
         | AsyncAwait wrote:
         | I think that is why the idea would be to have the tier 1 sites
         | so you don't have to index as much.
        
           | dumbfounder wrote:
           | I said ISN'T a pile of garbage :)
        
         | nostromo wrote:
         | > If Apple actually does this, as they are rumored to, I bet
         | they dump $10b into it just for the initial version.
         | 
         | Google pays Apple more than that every year just to set Google
         | as the default search engine on iPhones.
         | 
         | In a way, Google is funding its future competitor.
        
         | lemax wrote:
         | Open Street Map is a nice analogy for what could work. Aside
         | from the open source maintenance of the map, there's also tons
         | of corporate help in the background. Companies that are
         | delivering OSM as a service or relying on it for their own
         | services have an interest in making it better. MapBox, for
         | example, apparently pays tons of people a salary who are
         | contributing upstream to Open Street Map. If we can get an
         | Apple/Microsoft/Other players collab maybe a venerable
         | alternative can actually be built.
        
         | kilroy123 wrote:
         | I agree and I have been hoping Apple builds a serious
         | competitor. I welcome any competition at this point. Let's be
         | real, not many people are using bing. People _would_ actually
         | use apple search.
        
           | _underfl0w_ wrote:
           | > People would actually use apple search.
           | 
           | Of course they would - it would be set as the default search
           | on their iPhones with no clear-cut way to change it. You
           | know, "security". The users don't know what's best for them,
           | etc. as Apple seems to think.
        
           | ur-whale wrote:
           | >I welcome any competition at this point.
           | 
           | Microsoft tried and failed to build a competitor and it's not
           | like they have shallow pockets.
           | 
           | They grossly underestimated a number of aspects:
           | - The huge number of man-years invested in Google's search
           | quality stack hand-tuning and what it would take to replicate
           | it.             - The infrastructure required to build a
           | crawler / indexer stack as good as Google's
           | 
           | I think in 2020, the second problem is within reach of many
           | companies technically. It's mostly a matter of throwing
           | enough money at optimized infrastructure.
           | 
           | However, replicating the search quality stack is going to be
           | very hard, unless someone makes a huge breakthrough in
           | machine learning / language modeling / language understanding
           | at a thousandth of the cost it currently takes to run
           | something like GPT-3.
           | 
           | The most likely candidate to execute properly on that last
           | bit is - unfortunately - Google.
        
             | Krasnol wrote:
             | Sure but Microsoft has tried and failed to build a
             | competitor to google not to DDG.
             | 
             | I don't see it that hopeless. I feel it kinda is like
             | starting Open Street Maps. It won't be perfect for a long
             | time but there will be people who'd prefer it and help out.
        
       | nynx wrote:
       | I'd love to participate in a project like this. Does anyone know
       | how to contact drew devault?
        
         | enriquto wrote:
         | He's very responsive in the mailing lists of all his projects.
         | Just don't ask him a silly question, like blockchain support,
         | or shit like that.
        
       | flas9sd wrote:
       | is somebody aware of a project where the end-user Browser acts as
       | a Crawler? it already spent the energy to render the content.
       | Readability.js extracts page section, does some processing for
       | keywords, hashes anchor links, signs it and sends it off. Cache-
       | Control response headers indicate if the page is public or
       | private. Of course, where it is sending to will have an
       | electricity bill to pay to index the submissions.
        
         | 1MachineElf wrote:
         | The idealist in me fantasizes this is possible with a browser-
         | based P2P zettelkasten.
        
         | jedimastert wrote:
         | That's an interesting point...I wouldn't trust the `Cache-
         | Control`, unfortunately, but a distributed indexing model might
         | be interesting...
         | 
         | I know there have been talks of set-ups that essentially take a
         | web archive of your entire history to search back through...
        
       | znpy wrote:
       | the tiering system is dumb, really really dumb.
       | 
       | it would basically make already famous domains shine and dump
       | lesser known domains into 20th page oblivion.
       | 
       | it saddens me because google search results actually helped you
       | discover new sites and new people, but it's been years since that
       | has changed.
        
       | cbsks wrote:
       | > Crucially, I would not have it crawling the entire web from the
       | outset. Instead, it should crawl a whitelist of domains, or "tier
       | 1" domains. These would be the limited mainly to authoritative or
       | high-quality sources for their respective specializations, and
       | would be weighed upwards in search results. Pages that these
       | sites link to would be crawled as well, and given tier 2 status,
       | recursively up to an arbitrary N tiers.
       | 
       | I like this idea. It would be interesting to see the domains of
       | every search query that I have clicked on and see what the
       | distributions is like. I suspect there would be a long tail but I
       | wonder how many domains actually need to be indexed for 99% of my
       | personal search needs. Does anyone have data like this?
        
       | RileyJames wrote:
       | I'm pro privacy, but I dont have a problem with AdWords, outside
       | of googles implementation.
       | 
       | If AdWords targeting was purely based on the search term, I don't
       | mind.
       | 
       | The search engine has to generate revenue somehow, and the
       | revenue generated on "Saas crm" with a single click is likely to
       | be larger than any users annual subscription. (10 - 100+ per
       | click)
       | 
       | I'm unclear on the ethical / privacy concerns of "AdWords" style
       | advertising.
        
         | ricardo81 wrote:
         | FWIW it's no longer called Adwords, just Google Ads
         | 
         | Agree, keyword (and location, a lot of searches are for _X near
         | me_ ) for the most part offers a way of delivering relevant
         | ads.
         | 
         | Google are able to generate more income per search because of
         | their critical mass of searches and advertisers, as well as
         | having more data on searchers based on search history to
         | maximise that revenue per search.
        
         | blendergeek wrote:
         | Here is the problem (and it isn't privacy):
         | 
         | A search engine's job is to present you with the best possible
         | results for any given query.
         | 
         | A ad is either A) the best possible result or B) not the best
         | possible result. If the ad is the best possible result, then
         | the search engine must display it anyway in order to fulfill
         | its mission. If it is not the best possible result, the search
         | engine must violate its mission in order to display it. To put
         | it bluntly, advertising is paying to decrease the quality of
         | search results.
        
       | andreareina wrote:
       | DDG does operate their own crawler[1], though they also do still
       | rely on third parties[2].
       | 
       | [1] https://help.duckduckgo.com/duckduckgo-help-
       | pages/results/du...
       | 
       | [2] https://help.duckduckgo.com/duckduckgo-help-
       | pages/results/so...
        
         | Kiro wrote:
         | Their own crawler is only used to fetch things for the widgets,
         | not the search index.
        
         | mekster wrote:
         | Author didn't even DDG to find this out?
        
           | Dahoon wrote:
           | Clearly neither did you as he is correct. DDGs crawler is not
           | a crawler like googlebot.
        
           | eqv wrote:
           | Drew has a longstanding history of ill-informed rants ([1]
           | [2]) about technology. He's also quite willing to lie about
           | the facts[3].
           | 
           | [1] https://news.ycombinator.com/item?id=24121609
           | 
           | [2] https://news.ycombinator.com/item?id=23966778
           | 
           | [3] https://news.ycombinator.com/item?id=24023998
        
             | dang wrote:
             | No personal attacks on HN, please.
             | 
             | Digging up past internet history as ammunition in an
             | argument isn't cool either.
        
             | Eeems wrote:
             | Just don't give him IRC ops and then get into a private
             | argument with him. https://www.omnimaga.org/news/omnomirc-
             | moved-to-new-server/m...
        
               | djsumdog wrote:
               | Wow. I'm honestly not surprised. That's ... that's pretty
               | shitty.
        
               | Eeems wrote:
               | Knowing a bit of his personal history I can kind of
               | understand why he acts the way he does, and has the
               | opinions he does. Doesn't excuse some of it, but at least
               | I kinda get why.
               | 
               | I just wish his name would stop coming up for me tied to
               | opinion pieces like this. I'd rather just see things
               | about how some project he's working on is doing great and
               | being widely adopted.
        
             | djsumdog wrote:
             | I don't like to criticize the author. We all have good
             | takes and bad takes and really for a single post, you
             | should address the argument. Digging up the past is part of
             | what's making the world worse.
             | 
             | That being said, I do see a valid reason for bringing up
             | his history of bad takes. I use to respect Devault. He
             | banned me on the Fediverse because he disagreed with me
             | being against defunding the police and against critical
             | race theory.
             | 
             | I find some of stuff interesting, and I agree with more
             | AGPL and more real open source development. I'd even say
             | I'm jealous that he can actually fund himself off of his
             | FOSS projects and do what he loves.
             | 
             | But I do agree, he does have a lot of questionable takes.
             | He seems to love Go and hate Rust, hate threads for some
             | reason, and has a lot of RMS style takes. Not all of them
             | are bad, and hardcore people can help you think.
             | 
             | As far as this post goes, I do think search is pretty
             | broken. I think a better solution is more specialized
             | search. Have a web tool just for tech searching that does
             | StackExchange sites, github, blogs, forums, bug trackers
             | and other things specialized to development.
             | 
             | Another idea would be an index that just did blogs, do you
             | can look up any topic and see what people are writing about
             | long form for the current month. Add features to easily see
             | what people were saying 5 or 10 years ago too. There is a
             | ton of specialized work there, in filtering blog spam,
             | making sure you get topics from all sides (including
             | "banned" blogs), etc.
             | 
             | You use to have to go to Lycos, Yahoo, Hotbot, Excite and
             | you'd get different results and find lots of different
             | helpful things. We need that back. It will take some good,
             | specialized tools, to break people from Google search.
        
       | baggachipz wrote:
       | > The main problem is: who's going to pay for it? Advertisements
       | or paid results are not going to fly -- conflict of interest.
       | Private, paid access to search APIs or index internals is one
       | opportunity, but it's kind of shit and I think that preferring
       | open data access and open APIs would be exceptionally valuable
       | for the community.
       | 
       | There's no reason you couldn't allow the first _N_ number of api
       | hits to be free, then charge for higher tiers of access.
        
       | wenbin wrote:
       | It's almost impossible to build a decent web search engine from
       | scratch today (i.e., build your own index, fight SEO spams, tweak
       | search result relevance...). The web is already so big and so
       | complex. Otherwise Google won't need to hire so many people to
       | work on search alone.
       | 
       | If you didn't start at the very early stage of tiny web (e.g.,
       | Google in 1996 as a research project) and grew with the web over
       | the past 20+ years, or you don't have super deep pocket (e.g.,
       | Microsoft Bing in mid 2000s), then it's almost impossible to
       | build a decent web search engine within a few years.
       | 
       | It's possible to build vertical search engines on far smaller
       | scale, far less complex, far less lucrative things that
       | Google/Microsoft has little interest today (e.g., recipes [2],
       | podcasts [3], gifs [4]...)
       | 
       | It's also possible to come up with a different discovery
       | mechanism for web (or a small portion of web), other than a
       | traditional complete web search engine. Essentially you don't
       | cross moat to attack a huge castle (e.g., Google). Instead, you
       | bypass the castle [1].
       | 
       | [1]
       | https://twitter.com/benedictevans/status/1038538688232226817...
       | 
       | [2] https://www.yummly.com/
       | 
       | [3] https://www.listennotes.com/
       | 
       | [4] https://giphy.com/
        
         | mongol wrote:
         | You are probably right. But still... the approach suggested
         | makes kind of sense. A curated list of trusted sites as kind of
         | seed. Not the entire web. This can be as small or as large as
         | can be useful. It does not need to be about the entire web. How
         | big is the "useful" blogosphere, for example? Cannot an
         | opensource project that gathers momentum somehow create a
         | curated list of let's say 10 000 trusted blogs and index those?
         | Index all mailing lists that can be found, index all of reddit,
         | index Hacker News, index Wikipedia, the 100 most well regarded
         | news sites in each country, etc. Would not such an index be a
         | good start and better than Google in many cases?
        
       | jedimastert wrote:
       | > Instead, it should crawl a whitelist of domains, or "tier 1"
       | domains. These would be the limited mainly to authoritative or
       | high-quality sources for their respective specializations, and
       | would be weighed upwards in search results.
       | 
       | Not a big fan of this conclusion. Who chooses the white list, and
       | why should I trust them? Is it democratically chosen? Just
       | because a site is popular very clear does not mean it's
       | trustworthy. Does it get vetted? by whom? Also, who's definition
       | of trustworthy are we trusting?
       | 
       | If I want my blog to show up on your search engine, do I have to
       | get it linked by one of those sites, or can I register with you?
       | Will I be tier 1, or
        
         | ecommerceguy wrote:
         | So basically lock out any new site, regardless of content.
         | Great idea /s
        
         | mekster wrote:
         | But the email is already like this. It's the inbox providers
         | who choose what domain is legit and new domains start from
         | negative rating. Treating the web the same way doesn't sound
         | too unnatural.
         | 
         | It would be bad if those in the positions profit by
         | "authorizing" who is good though.
        
           | buzzerbetrayed wrote:
           | I'm not sure why email should be an example of the correct
           | way to do it. And with email I can check my spam folder and
           | see exactly what has been rejected. So unless the search
           | engine has a list of sites that aren't deemed worthy included
           | with every search (which probably wouldn't happen), I think
           | this solution has some pretty big flaws. It should be noted
           | that the current system also has these flaws, as Google and
           | DDG can show you whatever they want base don whatever
           | criteria they see fit.
        
             | 6510 wrote:
             | I like this idea! Have the usual official results... then
             | have an option to go to level 2, level 3, level 4 etc (lvl
             | 1 is not included in lvl 2)
             | 
             | You can have really biased technically terrible filters
             | that for example put a site on level 4 because it is to
             | new, to small and any number of other dumb SEO nonsense
             | arguments. (The topic was not in the url! There was poor
             | choice of text color!)
             | 
             | I think wikipedia has a lot of research to offer on what to
             | do but also what not to do. Try getting to tier 2 edits on
             | a popular article? It would take days to sort out the edits
             | and construct by hand a tier 2 article.
        
         | Shared404 wrote:
         | > Who chooses the white list, and why should I trust them? Is
         | it democratically chosen?
         | 
         | You could have user compiled lists of sites to show in search
         | results.
         | 
         | Let the users pick the lists they want to see, and communities
         | can create and distribute lists within themselves.
        
           | Jtsummers wrote:
           | That's what directory sites offered once upon a time. It was
           | a pretty good way to discover new content back then. I spent
           | a lot of time on dmoz when I wanted to find information about
           | various topics.
        
           | RileyJames wrote:
           | Great idea, but why build a search engine at all in this
           | case? You can use DDG + your filter and see only the results
           | from your whitelist.
           | 
           | Could easily be implemented for any current search engine.
           | 
           | To a large extent, this is what you already do when you view
           | a page of search results. Filter them based on your
           | understand of what sites / results hold value.
        
             | Arnavion wrote:
             | >Great idea, but why build a search engine at all in this
             | case? You can use DDG + your filter and see only the
             | results from your whitelist.
             | 
             | If I want to search for "X" within N sites, where N = 20,
             | how do I make a DDG filter for that?
        
         | vorpalhex wrote:
         | I wonder if the correct answer is a blacklist for known spammy
         | sites and the ability to turn the list off.
         | 
         | If I never see a pinterest link, or one of those sites that
         | just republishes stackoverflow answers unedited, I'd be fine
         | with it.
         | 
         | Of course, these systems always get abused and some political
         | or news site will end up on it.
        
         | bscphil wrote:
         | > If I want my blog to show up on your search engine, do I have
         | to get it linked by one of those sites, or can I register with
         | you? Will I be tier 1, or
         | 
         | I think what I'd say in defense is that we've misunderstood
         | what search engines are useful for. They're really bad at
         | helping us discover new things. Your blog might be awesome, but
         | it's not going to be easy for a search engine to tell that it's
         | awesome. It's going to have to compete with other blogs that
         | also want views, some of whom are going to be better than yours
         | at SEO, and so on.
         | 
         | What a search engine _might_ be able to tell is that it 's
         | _useful_. Because what search engines are at least potentially
         | good at is answering questions. You do that by having a list of
         | known good sites to answer specific types of questions, and
         | looking at the sites they link to. It 's when you try to do
         | both (index everything on the web and provide accurate answers
         | to specific questions) that you end up failing to do either.
         | For example this is the #2 result for "python f strings" on
         | DDG[1]. It's total garbage, and, quoting the blog, "we can do
         | better". (This result is also on page 1 for the same query on
         | Google.)
         | 
         | What I believe ddevault is suggesting is that we make a search
         | engine that does the only thing search engines are really good
         | at, answering questions. You throw away the idea of indexing
         | everything on the web, and therefore the possibility of
         | "discovery". What that means is that in 2020 you need some
         | other mechanism for discovering new sites, bloggers, and so on.
         | Fortunately we do have some alternatives in that space.
         | 
         | To be clear, I don't know if I 100% buy this argument, but I
         | think it's the general idea behind what's being suggested in
         | this blog post.
         | 
         | [1] https://careerkarma.com/blog/python-f-string/
        
       | suff wrote:
       | No you can't.
        
       | bscphil wrote:
       | > they've demonstrated gross incompetence in privacy
       | 
       | Not sure I buy the example that is given here.
       | 
       | 1. It's an issue in their browser app, not their search service.
       | 
       | 2. It's not completely indefensible: it allows fetching favicons
       | (potentially) much faster, since they're cached, and they promise
       | that the favicon service is 100% anonymous anyway.
       | 
       | 3. They responded to user feedback and switched to fetching
       | favicons locally, so this is no longer an issue.
       | https://github.com/duckduckgo/Android/issues/527#issuecommen...
       | 
       | > The search results suck! The authoritative sources for anything
       | I want to find are almost always buried beneath 2-5 results from
       | content scrapers and blogspam. This is also true of other search
       | engines like Google.
       | 
       | This part is kinda funny because "DuckDuckGo sucks, it's just as
       | bad as Google" is ... not the sort of complaint you normally hear
       | about an alternative search engine, nor does it really connect
       | with any of the normal reasons people consider alternative search
       | engines.
       | 
       | That said, I _agree_ with this point. Both DDG and Google seem to
       | be losing the spam war, from what I can tell. And the diagnosis
       | is a good one too: the problem with modern search engines is that
       | they 're not opinionated / biased _enough_!
       | 
       | > Crucially, I would not have it crawling the entire web from the
       | outset. Instead, it should crawl a whitelist of domains, or "tier
       | 1" domains. These would be the limited mainly to authoritative or
       | high-quality sources for their respective specializations, and
       | would be weighed upwards in search results. Pages that these
       | sites link to would be crawled as well, and given tier 2 status,
       | recursively up to an arbitrary N tiers.
       | 
       | This is, obviously, very different from the modern search engine
       | paradigm where domains are treated neutrally at the outset, and
       | then they "learn" weights from how often they get linked and so
       | on. (I'm not sure whether it's possible to make these opinionated
       | decisions in an open source way, but it seems like obviously the
       | right way to go for higher quality results.) Some kind of logic
       | like "For Python programming queries, docs.python.org and then
       | StackExchange are the tier 1 sources" seems to be the kind of
       | hard-coded information that would vastly improve my experience
       | trying to look things up on DuckDuckGo.
        
         | dwd wrote:
         | I thought DDG already crawled their own curated list of sites?
         | 
         | There is a DuckDuckGoBot and I think it was an interview or
         | podcast Gabriel did a while back that he mentioned they use it
         | for filling out gaps in the Bing API data to provide the
         | instant answers, favicons. Their preference for the instant
         | answers were authoritative references such as docs.python.org.
         | This would have been a while back though.
        
           | bscphil wrote:
           | If memory serves, those crawls are _only_ used for Instant
           | Answers. My interpretation of the blog post is that it would
           | be nice to have a search engine that 's sort of a hybrid
           | approach based on Instant Answers for the _whole_ web.
        
         | jedberg wrote:
         | I think Google sort of takes into account "votes", in that they
         | look at the last thing you clicked on from that search, and
         | consider that the "right answer", which they then feed back
         | into their results.
         | 
         | As such, they effectively have a list of "tier 1" domains.
        
           | gregmac wrote:
           | I kind of hope they don't, or there is more to it than just
           | that -- for example, a user coming back and clicking on
           | something else counts as a downvote for the first item.
           | 
           | Any system that ranks things purely based on votes or view
           | counts can have a feedback loop that can amplify "bad"
           | results that happen to get near the top for whatever reason.
           | For web search, this would encourage results that _look_
           | right from the results page, even if they 're not actually a
           | good result of what the user is looking for.
           | 
           | An example of this would be when you're trying to find an
           | answer to a specific question like "How do I do X when Y?".
           | The best result I'd hope for is a page that answers the
           | question (or a close enough question to be applicable), while
           | the promising-looking-but-actually-bad result is a page where
           | someone asks the exact same question but there are no
           | answers.
        
             | eyelidlessness wrote:
             | > Any system that ranks things purely based on votes or
             | view counts can have a feedback loop that can amplify "bad"
             | results that happen to get near the top for whatever
             | reason.
             | 
             | I think this is a place where Google has pretty obvious
             | algorithm problems. For example, I'm building a personal
             | website for the first time in many years, and obviously
             | that means I'm doing a fair bit of looking up new or
             | forgotten webdev stuffs. It's widely known that W3Schools
             | is low quality/high clickbait/has a long history of gaming
             | the SEO system. They've been penalized by Google's
             | algorithm rule changes but continue to get the top result
             | (or even the top 3-5 results!), _even with Google having a
             | profile of my browsing habits, and knowing that I
             | intentionally spend longer on these searches to pick a
             | result from MDN or whatever_. It seems pretty likely that
             | W3Schools is just riding click rate to stay at the top. And
             | it's pathological.
        
               | beckingz wrote:
               | Is w3schools that bad?
               | 
               | for some languages, W3schools is as good a reference or
               | better than the official documentation.
               | 
               | And they're definitely better than most seospam.
        
               | eyelidlessness wrote:
               | W3Schools is _awful_. The official documentation is hard
               | to navigate, but W3Schools is notorious for misleading
               | and poor quality examples and advice. MDN, caniuse, CSS
               | Tricks and such are much better resources.
        
           | bscphil wrote:
           | I don't know if DDG does that exactly, but their help page
           | does say this:
           | 
           | > Second, we measure engagement of specific events on the
           | page (e.g. when a misspelling message is displayed, and when
           | it is clicked). This allows us to run experiments where we
           | can test different misspelling messages and use CTR (click
           | through rate) to determine the message's efficacy. If you are
           | looking at network requests, these are the ones going to the
           | one-pixel image at improving.duckduckgo.com. These requests
           | are anonymous and the information is used only by us to
           | improve our products.
           | 
           | The Firefox network logger does show requests to this domain
           | when I click on a link in the search results, before the page
           | navigates away. This suggests to me they might by logging
           | this information. _To be clear_ , this is speculation on my
           | part, because I haven't examined the URL parameters in
           | detail.
           | 
           | In any case, I'm not sure how much this manages to improve
           | the results, since usually I _can_ get help with my Python
           | query (for example) using whatever crappy blog post is first
           | in the results, but results from the official docs or
           | StackExchange are still probably better and should be
           | prioritized.
        
         | Silhouette wrote:
         | _Some kind of logic like "For Python programming queries,
         | docs.python.org and then StackExchange are the tier 1 sources"
         | seems to be the kind of hard-coded information that would
         | vastly improve my experience trying to look things up on
         | DuckDuckGo._
         | 
         | The problem with this strategy is always going to be that
         | different users will regard different sources as most
         | desirable.
         | 
         | For example, it's enormously frustrating that searching for
         | almost anything Python-related on DDG seems to return lots of
         | random blog posts but hardly ever shows the official Python
         | docs near the top. I don't personally think the official Python
         | docs are ideally presented, but they're almost certainly more
         | useful to me at that time than some random blog that happens to
         | mention an API call I'm looking up.
         | 
         | On the other hand, I would gladly have an option in a search
         | engine to hide the entire Stack Exchange network by default.
         | The signal/noise ratio has been so bad for a long time that I
         | would prefer to remove them from my search experience entirely
         | rather than prioritise them. YMMV, of course. (Which is my
         | point.)
        
         | judge2020 wrote:
         | > and they promise that the favicon service is 100% anonymous
         | anyway.
         | 
         | With that logic, Apple's OCSP server is also 100% anonymous
         | (which I legitimately can believe it is).
        
         | brundolf wrote:
         | Agreed. I think the key point here is that the web is a
         | radically different place than it was in 1998 (when Google
         | launched and established the paradigm as we know it). Back then
         | the quality-to-spam ratio was probably much higher, there were
         | many more self-hosted sources rather than platforms (for better
         | or worse). The naive scraping approach was both more crucial
         | and more effective. And in the decades since, it's been a
         | constant war of attrition to make that model keep working under
         | more and more adversarial circumstances.
         | 
         | So I think that stepping back and re-thinking what a search
         | engine fundamentally is, is a great starting point for
         | disruption.
         | 
         | Additionally, something the OP didn't mention is that ML
         | technologies have progressed dramatically since 1998, and that
         | much of that progress has been done in the open. I can't
         | imagine that not being a force-multiplier for any upstart in
         | this domain.
        
         | jbay808 wrote:
         | Maybe instead of hard-coding these preferences in the search
         | engine, or having it try to guess for you based on your search
         | history, you can opt-in to download and apply such lists of
         | ranking modifiers to your user profile. Those lists would be
         | maintained by 3rd parties and users, just like eg. adblock
         | blacklists and whitelists. For example, Python devs might
         | maintain a list of search terms and associated urls that get
         | boosted, including stack exchange and their own docs. "Learn
         | python" tutorials would recommend you set up your search
         | preferences for efficient python work, just like they recommend
         | you set up the rest of your workflow. Japanese python devs
         | might have their own list that boosts the official python docs
         | and also whatever the popular local equivalent of stackexchange
         | is in Japan, which gets recommended by the Japanese tutorials.
         | People really into 3D printing can compile their own list for
         | 3D printing hobbyists. You can apply and remove any number of
         | these to your profile at a time.
        
           | visarga wrote:
           | I have had a similar idea, what you're proposing is
           | essentially a ranking/filtering customisation. The internet
           | is a big scene, and on this scene we have companies and their
           | products, political parties, ad agencies and regular users.
           | Everyone is fighting for attention, clicks. Google has
           | control over a ranking and filtering system that covers most
           | searches on the internet. FB and Twitter hold another
           | ranking/filtering sweet spot for social networks.
           | 
           | The problem is that we have no say in ranking and filtering.
           | I think it should be customisable both on a personal and
           | community level. We need a way to filter out the crap and
           | surface the good parts on all these sites.
        
           | hobs wrote:
           | Back in the day you'd have webrings - groups of sites that
           | linked each other in clear association.
        
           | mech422 wrote:
           | This would be awesome! I'm so tired of google ignoring what I
           | tell it, and trying to 'guess' what I want.
           | 
           | I'd also love to be able to specify I want results from the
           | last year without having to set it everytime.
        
           | wstrange wrote:
           | This doesn't really seem immune from spam.
           | 
           | I got signed up for goodreads (book review site), and I get
           | tons of spam.
           | 
           | This is a hard problem..
        
             | vorpalhex wrote:
             | Like any other list, it depends on who maintains it. You
             | basically want to find the correct BDFL to maintain a list,
             | much like many awesome-* repositories operate.
        
           | AsyncAwait wrote:
           | This is actually a great idea and something I can see working
           | rather well.
        
           | nolanhergert89 wrote:
           | As a hack until then, I've found Google's Custom Search
           | Engine feature to work well enough for my use cases.
           | https://programmablesearchengine.google.com/cse/all
        
           | bscphil wrote:
           | I like this idea! I think the biggest difficulty with it -
           | which is also probably _the_ most important reason that
           | engines like Google and DDG are currently struggling to
           | return good results - is that the search space is just so
           | enormously large now. The advantage of the suggestion in the
           | blog post is that you trim down the possible results to a
           | handful of  "known good" sources.
           | 
           | As I understand it, you'd want to continue to search the
           | whole "unbiased" web, then apply different filters / weights
           | on every search. I really do like the idea, but I imagine
           | we'd be talking about an increase in compute requirements of
           | several orders of magnitude for each search as a result.
           | 
           | Maybe something like this could be made a paid feature, with
           | a certain set of reasonable filters / weights made the
           | default.
        
             | retsibsi wrote:
             | This may be a very dumb question, but could the filtering
             | be done client-side? As in, DDG's servers do their thing as
             | normal and return the results, then code is executed on
             | your machine to weight/prune the results according to your
             | preferences.
             | 
             | Maybe this would require too much data to be sent to the
             | client, compared to the usual case where they only need a
             | page of results at a time. If so, would a compromise be
             | viable, whereby the client receives the top X results and
             | filters those?
        
             | Spooky23 wrote:
             | I disagree; the search space is shrinking as more and more
             | stuff moves to walled gardens like Facebook and Twitter.
        
           | 867-5309 wrote:
           | > to guess for you based on your search history, you can opt-
           | in to download and apply such lists of ranking modifiers to
           | your user profile
           | 
           | pro-privacy does not sit well with terms such as search
           | history and user profile
        
             | jbay808 wrote:
             | You might have misread. My suggestion is as an
             | _alternative_ to history based ranking.
        
           | brundolf wrote:
           | This is a great idea. It's like a modern reboot of the old
           | concept of curated "link lists", maintained by everyone from
           | bloggers to Yahoo. Doing it at a meta level for search-engine
           | domains is a really cool thought.
        
       | meerita wrote:
       | I'm an old-time user of DDG. I agree with the 3 points. I feel
       | like I'm using a shitty car in a world where everyone runs a
       | ferrari.
       | 
       | The one that strikes me most is the results. I feel like DDG
       | doesn't search the entire internet, like there's zillions of
       | pages there waiting to be indexed, even old websites, but the
       | results i get are so poor.
       | 
       | Even with this handicap, I still use over INSERT YOUR AD HERE
       | Google Search.
        
       | pmoriarty wrote:
       | My main problem with DDG is that there's no way to be sure they
       | actually respect their users' privacy as they claim to.
       | 
       | Ideally, services like theirs would be continuously audited by
       | respectable, trusted organizations like the EFF.. multiple such
       | organizations even.
       | 
       | Then I'd have at least some reason to believe their claims of not
       | collecting data about me.
       | 
       | As it stands, I only have their word for it.. which in this day
       | and age is pretty worthless.
       | 
       | That said, I'd still _much_ rather use DDG, who at least pay lip
       | service to privacy, than sites like Google or Facebook, who are
       | openly contemptuous of it.
       | 
       | At the very least it sends a message to these organizations that
       | privacy is still valued, and they'd lose out by not trying to
       | accommodate the privacy needs of their users to some extent.
        
         | spinach wrote:
         | Facebook and Google are huge, global companies where their main
         | product is free, and yet they aren't a charity. The only way to
         | be mega-rich and offer something free is to be shady and
         | manipulative with user's data. Exploiting privacy is their
         | business model. They aren't gonna respect it.
         | 
         | Being super financially successful off free products and
         | services is not a recipe for an honest, citizen respecting
         | company.
        
           | Dahoon wrote:
           | DDG search costs the same as google search.
        
         | dangus wrote:
         | I don't even care about the privacy. (Well, I do, but in this
         | context I have no reasonable way to ensure it)
         | 
         | What I do care about is trust-building and monopolistic
         | practices.
         | 
         | That, to me, is a great reason to use DDG instead of Google or
         | even Bing.
        
           | cpeterso wrote:
           | I also prefer DDG's user interface over Google's. And DDG's
           | !bang search shortcuts.
           | 
           | DDG has been my default search engine for years and its
           | results are good enough for me 95% of the time. I only need
           | to use Google as a fallback when searching for niche
           | technical information or "needles in haystacks".
        
             | jschwartzi wrote:
             | Even then, the Google results are usually terrible. I
             | haven't used Google as a fallback in about a year because
             | every time I tried it they couldn't find what I was looking
             | for either. Or they did something atrocious like changing
             | my search terms for me.
        
       | lambda_obrien wrote:
       | Why couldn't several coordinating specialized search engines
       | share their data via something like "charge the downloader" S3
       | buckets? Then you get an org like StackExchange who could provide
       | indexed data from their site and the algorithms to search the
       | data the most efficiently, GitHub can do the same for their
       | specific zone of speciality, Amazon, etc.
       | 
       | Then anyone who wants to use the data can either copy it to their
       | own S3 buckets to pay just once, or can use it with some sort of
       | pay-as-you-go method. Anyone who runs a search engine can use the
       | algorithms as a guide for the specific searches they are
       | interested in for their site, or can just make their own.
       | 
       | You could trust the other indexers not to give you bad data,
       | because you'd have some sort of legal agreement and technical
       | standards that would ensure that they couldn't/wouldn't "poison
       | the well" somehow with the data they provide. Further, if a bad
       | actor was providing faulty data, the other actors would notice
       | and kick them out of the group or just stop using their data.
       | 
       | It would have to be fully open source, I agree with the other
       | parts of Drew's essay here, but I think we _could_ share the
       | index /data somehow if we got together and tried to think about
       | it. We just need a standard for how we share the data.
        
         | cptskippy wrote:
         | So you're proposing Snowflake for search?
        
         | ricardo81 wrote:
         | There's Common Crawl for the crawling aspect, about 3.2 billion
         | pages last time I looked. One of the issues with that kind of
         | detachment of jobs is crawl data freshness.
        
       | moocowtruck wrote:
       | let me guess, and better is drewdrewdevault
        
       | api wrote:
       | A major challenge with search in 2020 is that it's adversarial.
       | Any open source search engine that gets popular is going to be
       | analyzed by black hat SEO people and explicitly targeted by spam
       | networks. Competently indexing and searching content is really
       | only a small part of the problem now, with the adversarial "red
       | queen's race" against black hat SEO and spam being the more
       | significant issue.
        
       | suff wrote:
       | I take that back. Aside from loads of money and boundless dev
       | time, you've got it all figured out :-)
        
       | messo wrote:
       | > If SourceHut eventually grows in revenue -- at least 5-10x its
       | present revenue -- I intend to sponsor this as a public benefit
       | project, with no plans for generating revenue.
       | 
       | I like this attitude. Makes me happy to be a paying member of
       | SourceHut.
        
       ___________________________________________________________________
       (page generated 2020-11-17 23:01 UTC)