[HN Gopher] We can do better than DuckDuckGo ___________________________________________________________________ We can do better than DuckDuckGo Author : als0 Score : 292 points Date : 2020-11-17 18:16 UTC (4 hours ago) (HTM) web link (drewdevault.com) (TXT) w3m dump (drewdevault.com) | x87678r wrote: | I dont want privacy, I want competition. Which is why I use Bing | and honestly it works virtually all the time. | | Maybe there could be some pure anonymous ad-free search engine | but it more realistic to have alternative commercial one. I | really dont care that people are looking at my searches for how | to resize an array or cheap hotels in Florida. | ablanco wrote: | This "nothing to hide" argument has been pretty analized, and | in my opinion, it's really dangerous. | https://spreadprivacy.com/three-reasons-why-the-nothing-to-h... | staunch wrote: | We have to make sure to include highly relevant advertisements in | the search results, at least 50% of the results should be ads. So | there needs to be a marketplace for buying/selling ads. | | We can't have a search engine that is only useful for finding the | most relevant web pages for a given query. People love highly | relevant advertisement in their search results. | TheGrassyKnoll wrote: | I actually heard a DDG ad on the radio in the Los Angeles area | (KNX 1070 24 hour news). Still love the bangs. | Siira wrote: | Searx is a partially viable FOSS meta-search engine. | todd3834 wrote: | Aren't the secrets of the algorithm what prevent people from | gaming the results? While I love the idea of search becoming | fully open source I'm skeptical it could be done. I hope I'm | wrong and I'd love to dedicate time to an open source project | with this goal if anyone presents a convincing plan. | moonchild wrote: | Findx[1] was an attempt to make an opensource search engine. | Today it's just another bing wrapper; but their code[2] is still | available, waiting to be used as a starting point for another | project. | | 1. https://www.findx.com/ | | 2. https://github.com/privacore/open-source-search-engine | abalaji wrote: | This is why search is hard: 15% of Google searches are new each | day. [1] And, with over 1.7+ billion web pages, [2] it would take | a gargantuan open source effort to put something together like | this. | | Not to mention the cost, not sure something like this could be | sustained with a Wikipedia-esque "please donate $15" fundraising | model. | | [1] https://searchengineland.com/google-reaffirms-15-searches- | ne... | | [2] https://www.weforum.org/agenda/2019/09/chart-of-the-day- | how-... | ecommerceguy wrote: | I'm surprised noone has rented Ahrefs database, whipped up an | algorithm and called it a search engine. Besides google and | microsoft, who has a bigger snapshot of the entire web (NSA not | included)? Majestic maybe? | claytoneast wrote: | I wonder if you could start small on something like this. Build a | proof of concept, a search engine for programmers that indexes | only programming sites/material. See if you can technically do | it, & if you can figure out governance mechanisms for the | project. Sort of like Amazon starting with just selling books. | mcqueenjordan wrote: | I've recently been /tinkering/ with exactly such an idea! In my | case, it's even more specific and scoped: A search engine with | only allow-listed domains about software engineering/tech/product | blogs that I trust. | | https://github.com/jmqd/folklore.dev | | It's not even really at the POC stage yet, but I hope to host it | with a simple web frontend sometime soon. Primarily, this is just | for myself... I just want a good way to search the sources that I | myself trust. | easymovet wrote: | Crawlers is top down approach, a distributed list that people pay | digital money for listing will both incentivize nodes to be | online and transforms sybil attacks into paid advertising. | rjurney wrote: | Check out the serious difficulties the Common Crawl had with | crawling 1% of the public internet on donated money and then get | back to me with a plan. This is really, really hard to do for | free. Maybe talk to Gates :) | Analemma_ wrote: | I spent seven years working at Bing, and I can tell you that this | guy is massively, hugely underestimating the difficulty of this | problem. His repeated "it's easy! You just have to..." | suggestions are absurd. This is typical HN content where someone | with no domain expertise swaggers in and assumes everyone in the | space must be idiots, and that only he can save the day. | | Trust me, there is _not_ a ton of potential "just sitting on the | floor" in web search. | ablanco wrote: | Given your experience, What's your opinion about ddg result | quality? | 6510 wrote: | While I agree with its lack of organization I don't think YaCy | being untolerably slow is necessarily an argument. If you are | looking for a complete set of pages on a specific topic time is | sort-of irrelevant. Google for example has alerts for new | results. That these pages are not available sooner (before | publication) is not intolerable. You can also throw hardware at | YaCy and adjust the settings which improves it a lot. The | challenge with a distributed approach is sorting the results. | Other crawlers have the same problem but in a distributed system | it is even harder. | | Running an instance for websites related to your occupation or | hobby YaCy is quite wonderful. You don't want google removing a | bunch of pages that might cover exactly the sub-topic you are | looking for. Of course the smaller the number of pages in your | niche the better it works. | neurobashing wrote: | Am I the only person who just doesn't have problems with DDG | search results? | | What am I doing wrong (or right), here? I put a thing in and find | it. I just don't use Google any more. | | Genuinely curious why it's working for me and such garbage for | everyone else. | djsumdog wrote: | I say about 50% I'm good with DDG. About 1/3 of the time I add | !g, usually for weird error messages and tech stuff. | | Honestly we shouldn't be using Google for everything. Why not | just search StackExchange or Github issues directly for known | bug problems? If you need a movie, !imdb or !rt forward you to | exactly where you want to really search on. | | If DDG or Google also included independent small blogs for | movie results, I could see the value in that. I'd prefer | someone's review on their own site or video channel, but it | doesn't. We've kinda lost that part of the Internet. | pizza234 wrote: | I've tried DDG for a while, around a couple of years ago, and I | had lower-quality results particularly for technical subjects | (which are the vast majority of my searches). I will give DDG | another shot, though. | keithnz wrote: | for generic stuff DDG is mostly ok. But for local results, even | though it has a switch for local results, it REALLY REALLY | REALLY sucks bad and often doesn't get any of the expected | places anywhere in the first few pages for New Zealand which | makes it somewhat useless | dybber wrote: | I'm mostly getting Norwegian results, when searching for Danish | subjects from a Danish IP address. It also seems it just hasn't | indexed as many websites as Google. | proactivesvcs wrote: | I sometimes come across inappropriate results - for example I | search for a hex error code and the results are for other | numbers - and sometimes the adverts are misleading, but neither | are so prevalent enough that it harms the experience in | general. | | I always send feedback when I come across incorrect results and | also try to when I get a really easy find. | | I have not had to resort to any other search engine for at | least five years. | Moru wrote: | I'm also using DDG exclusively since many years. I find what I | need usually as the first couple of results or in the box on | the right, that usually goes directly to the authorative source | anyway. | jlarocco wrote: | Yeah, I'm with you. | | I can think of some improvements (better forum/mailing list | coverage), but it's generally fine for almost everything. | Lately if I don't find it on DDG I probably won't have much | better luck anywhere else, either. | Dahoon wrote: | Does Google search results work for you? If yes, then I'd say | the reason is you don't see or agree with how bad results are | today (as others have posted extensively about). I for one find | DDG as the search engine that returns the worst results. Qwant | is a better Bing-using engine IMO but it is still bad. | jesuscyborg wrote: | The way I'd code a better search engine is I'd design an ML model | that's trained to recognize handwritten HTML like this, and only | add those to the index. It'd be cheap to crawl probably only | needing a single computer to run the whole search engine. It'd | resurrect The Old Web, that still exists, but just got buried | beneath the spammy SEO optimized grifter web over the years as | normies flooded the scene. | buzzerbetrayed wrote: | I hope to never use your search engine. I love hand written | HTML as much as the next guy, but search engine's are made to | find things. And useful information exists on web sites that | use generated and/or minified HTML. | mixologic wrote: | How would anybody ever know what the server is running and/or | doing with the data you send it, regardless of if it is running | open or closed source code? | | A service, running on somebody else's machine, is essntially | closed. | | I think the only way to have an 'open' service is to have it | managed like a co-op, where the users all have access to | deployment logs or other such transparency. | | Even then, it requires implicit trust in whomever has the | authorization to access the servers. | joosters wrote: | In _theory_ , this is the kind of thing that the GPL v3 was | trying to address: roughly speaking, if you host & run a | service that is derived from GPL-v3'd software, you are obliged | to publish your modifications. | | But, I agree with you - and I don't think the author had really | thought through what they were demanding, they made no mention | of licensing other than singing happy praises of FOSS as if | that would magically mean you could trust what a search engine | was doing. | lixtra wrote: | > In theory, this is the kind of thing that the GPL v3 was | trying to address: roughly speaking, if you host & run a | service that is derived from GPL-v3'd software, you are | obliged to publish your modifications. | | You mean AGPL https://en.m.wikipedia.org/wiki/Affero_General_ | Public_Licens... | joosters wrote: | You're right... I'm misremembering the GPL, wikipedia says | that it was only 'Early drafts of GPLv3 also let licensors | add an Affero-like requirement that would have plugged the | ASP loophole in the GPL' - I hadn't realised it never made | it into the final version. | jedimastert wrote: | > In theory, this is the kind of thing that the GPL v3 was | trying to address: roughly speaking, if you host & run a | service that is derived from GPL-v3'd software, you are | obliged to publish your modifications. | | Why would I trust someone to do that, though? | joshuaissac wrote: | That sounds a bit like YaCy.[1] It is a program that apparently | lets you host a search engine on your own machine, or have it | run as a P2P node. | | I think the next step forward should be to have indices that | can be shared/sold for use with local mode. So you might buy | specialised indices for particular fields, or general ones like | what Google has. The size of Google's index is measured in | petabytes, so a normal person would still not have the | capability to run something like that locally. | | 1. https://yacy.net/ | Jyaif wrote: | > How would anybody ever know what the server is running and/or | doing with the data you send it, regardless of if it is running | open or closed source code? | | https://en.wikipedia.org/wiki/Homomorphic_encryption | [deleted] | a3camero wrote: | I took a stab at making a simple search engine and wrote up some | of the lessons I learned from doing this as a hobby project | during Coronavirus: https://www.cameronhuff.com/blog/making-a- | search-engine-gori.... Here are some of the lessons I learned | that might help anyone else out there considering trying their | hand at this (which is a great educational project!): | | 1. Use private networking for traffic between components. | | 2. Compress screenshots carefully. Screenshots are a major part | of the disk requirement. | | 3. Use "block storage" (network-attached flash memory storage) to | store indices instead of RAM. | | 4. Carefully distinguish between URLs that are perceived (i.e. | displayed on a site) vs. actual URL that results from following | the link. | | 5. When dequeueing URLs, be careful to dequeue the one with the | lowest depth and lowest number of attempts. 6. Store pages using | delta compression. | | 7. Don't store something if it's already stored, by addressing | content using hashes. | | 8. Sequeuntial-ish integer data can often be stored using offsets | instead of the actual number to achieve significant file size | savings. | | 9. Hash collisions are far more common than you'd expect due to | the "birthday problem". | | 10. Always use an object for storing a URL because raw URLs that | are read from webpages often have issues. | | 11. Use redis to cache data. A cache is essential and MySQL (or | another database, I started with mongodb) isn't meant for that. | But also use redis for basic queues, for which a system like this | needs several to achieve good throughput. | | 12. Use APIs to connect components across system boundaries | rather than using file access or database access directly. | | 13. Think carefully about where data is stored. Some data needs | to be in RAM, some can be stored temporarily on disk on a VM or | in a database, some needs to be on block storage, etc. | | 14. Bandwidth charges make even cheap servicse like B2 | (S3-compatible object storage) expensive. | | 15. Cheap VMs are important. | | 16. sha1 hashes can be done using webcrypto API but MD5 can't. | | 17. Redis BITFIELD command can be used to store information in | bitmaps that can be very efficient memory-wise. | | 18. Using block storage to store indices is cheap but limits the | throughput of the system. | | 19. Storing data in tables where the table name is a part of the | index (such as docs1, docs2, etc.) can make a lot of sense and be | much faster than a large table with an index for the field. | | 20. Websites are not designed to be crawled. Proper crawling, | that is respectful (i.e. not too many pages loaded per hour), | thorough without being overly thorough, and adjusted according to | how frequently site content appears to change, is harder than it | appears. | | 21. Study academic articles and search engine company | presentations, even old ones (pre-2010) to understand how to | design a search engine. | | 22. The distribution of words in a document is just as important | as the word count, and maybe more so. | | 23. Search terms can be locally hashes on the user's computer and | sent to the server to see if there are pages that have that term, | without exposing the term itself to the search engine. | | 24. Downloading and indexing a website with hundreds of thousands | of pages takes a long time if you are want to crawl a site | politely (i.e. one page per minute). | ufo wrote: | One thing I always wished for is if there were a way to use | duckduckgo bang searches in my browser without sending them | through DDG. But apparently it's harder to implement than it | sounds. | takeda wrote: | You absolutely can, at least in Firefox you can right click on | search field, select "Add a Keyword for this search..." Then | save it as bookmark and enter the keyword (you don't have to | use !, but it is an option if you chose so). | | You can also create such bookmark manually and use %s in the | url as a placeholder where search query should be placed. | | The manual configuration can be useful when there's no direct | search field. For example freshports.org allows querying | freebsd.org. I can add a bookmark with search keyword "fp" to | point to https://freshports.org/%S | | After that I can type in address bar: fp lang/python39 to land | on https://freshports.org/lang/python39 (the capital %S doesn't | escape special characters like /) | iuqiddis wrote: | In Firefox you can right click on a search field and add a | keyword bookmark. Once saved, you can type 'kw search query', | where kw is your defined key word, in the address bar to | directly search the relevant site | ufo wrote: | I'm aware of that. The problem is that you have to manually | add all the keywords yourself. AFAIK, there isn't an easy way | to import a large list of curated keywords like the DDG bang | list. | detaro wrote: | They are bookmarks you can export/import from Firefox, so | someone could easily make a Firefox bookmark file for a | large set of them. | wldcordeiro wrote: | Love this feature. I've got basically all the bang keywords | but instead of say `!g query` it just becomes `g <query>`. | wolco2 wrote: | Privacy or not I'm starting to find things on ddg that google has | been filtering. | | I found out through comments on hn that 8chan was backup under a | new name: 8kun | | Typing it into google I get articles about it but no link in the | results. | | In duckduckgo first link. | | Made me think what else am I missing? | the_only_law wrote: | Googles filtering is so weird. I have a bad habit of buying old | hardware without checking if the documentation had been made | available on the web. | | Recently I found myself desperate for any information on a | price of hardware i had gotten. I was swapping out all sorts of | queries woth different keywords hoping to find a manual. I was | able to find some marketing material which was helpful, albeit | barely. Eventually I had exhausted the search results for most | pf my queries, gave up and assumed that it was simply lost to | time and I was out of luck. | | Eventually I went back to the sales paper I found. Going to the | site it was hosted on, a Lithuanian reseller. I translated the | page, eventually finding a direct link to a user manual on the | exact same page as the sales paper I had found. The document | was in English, contained important words from my queries (such | as the product name, company, "user manual" etc. The document | was at the same path as the sales paper too. I hace no idea why | Google found the sales paper but not the manual. | | Unfortunately the manual still wasn't what I was looking for | _exactly_ but it was a hell of a lot better than what I could | get from Google 's results. | jron wrote: | SEO is crushing the utility of Google. It is pretty telling when | you need to add things like site:reddit.com to get anything of | value. Harnessing real user experiences (blogs, etc) is the key | to a better search engine. This model unfortunately crumbles | under walled gardens which is increasingly the preferred location | of user activity. | RileyJames wrote: | That's where blogs were at, but now a massive portion of them | are content farms / splogs. | | You're right that the walled gardens have hurt this. So often I | search something specific, or a topic, and find very little. | But I know there are communities on Facebook for this, I know | there would be peoples posts out there on Instagram which 100% | answer my question. But they may as well not exist. Unless I | was "following" then when it was said, and mentally indexed it, | these things are mostly unfindable, and that's if I even have | an account for said service (which I don't for Facebook) | | It's sad, more people than ever using the internet, more | content & knowledge being created than ever before, yet it's no | longer possible to find the great answers. | djsumdog wrote: | > Harnessing real user experiences (blogs, etc) | | This is what we need more than anything. More independent | blogs. The ability to search events now, or 10 years ago, mass | indexing of RSS feeds, etc. | | A general search engine is kinda way out of the ballpark for | now. But you could specialize for long form blogs, from all | sides, hard-left, hard-right, women in tech, white | supremacists, all the extremes and moderates. | | I've love to have an interface to search a topic and see what | all kinds of people have posted long form, without commentary | or Twitter/Facebook bullshit "Fact checking" notices. I what to | see what real writers are seeing across the spectrum on a given | topic for the week or month. | ant6n wrote: | Its hard to get readership writing blogs these days. Thats | pretty demotivating. | CameronNemo wrote: | Also difficult to distinguish a blog from a content farm if | you are just crawling the web. Any content pattern you | select for would likely be quickly adopted by SEOs. | grey_earthling wrote: | > This is what we need more than anything. More independent | blogs. The ability to search events now, or 10 years ago, | mass indexing of RSS feeds, etc. | | Thought experiment: what would a search engine look like if | it _only_ indexed RSS and Atom feeds? | snowwrestler wrote: | Why would advertisements not fly? Search intent is like the | canonical example of an ad targeting signal that does not need | personal data to succeed. If someone is searching for laptops, | you can show them laptop ads. | | I think most of Google's personal data efforts actually support | a) better organic search results (does this person mean Apple the | company or apple the fruit), and b) all the ads that are served | off their SERPs, where there is no signal of intent to read (I.e. | their display network). Again, you don't need personal data to | serve ads based on search terms. | merlinscholz wrote: | I miss Cliqz. It was a new search engine, with its own crawler, | almost completely from scratch. It even had a dev blog where they | wrote articles on how to build your own search engine: | https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr... | wcerfgba wrote: | I wonder if instead of another search engine we would benefit | from a directory, like DMOZ, or perhaps something tag based or | non-hierarchical. Sometimes I find better results by first | finding a good website in space of my query, and then searching | within that site, as opposed to applying a specific query over | all websites. Once example would be recipes: if you search for | "bean burger recipe" you will get lots of results across many | website, but some may not be very good, whereas if you already | know of recipe websites that you consider high-quality or match | your preferences, then you'll find the best (subjectively) recipe | by visiting that site and then searching for bean burgers. | SNosTrAnDbLe wrote: | Yeah. Exactly my thoughts. I really liked the concept of | del.icio.us. where humans could bookmark and tag web sites. | | dmoz looks pretty great but the categories look limiting. | nikivi wrote: | I'd love a truly open source world class search engine. Curious | how both the crawler and the search index / search is done by the | likes of Google/Bing/DDG. Eventually someone will make an oss | version of it that can compete. | | The beauty of such oss solution maybe the custom heuristics that | can be created based off the crawled data. | anotherdirtbag wrote: | There's no need to compete. People who want things like this | just do it themselves. Checkout YaCy | https://github.com/yacy/yacy_search_server | Mediterraneo10 wrote: | The challenges to OSS developers are numerous. First of all, | many popular sites on the internet block crawlers other than | Google and Bing, because only those ones seem to matter to | their business, and any small upstart would be assumed to be a | dodgy bot. Secondly, Google amasses the database it has only | with vast data centers, incredible amounts of bandwidth, and | power requirements unavailable to a startup. | creese wrote: | How would anyone block a crawler? A crawler is just a | headless browser. | tleb_ wrote: | robots.txt | | https://www.robotstxt.org/ | | https://en.wikipedia.org/wiki/Robots_exclusion_standard | Xylakant wrote: | Note that robots.txt is a hint to well-behaved crawlers, | not blocking them in any regard. | | You can block crawlers if you can identify them, but | reliably identifying them is hard. | ddorian43 wrote: | Good luck with that mate. Check out https://commoncrawl.org/ | katsura wrote: | My biggest pet peeve with DDG at the moment is that whenever I | search for something on my phone the first two results are ads, | and those two results actually take up my whole screen. I mean | sure, those are probably not privacy invading, but I literally | don't care as I wasn't looking for them. | metroholografix wrote: | DuckDuckGo is a mirage and should not be used by privacy- | conscious folks. Take a look at its terms of service, information | collected section: | | "We also save searches, but again, not in a personally | identifiable way, as we do not store IP addresses or unique User | agent strings. We use aggregate, non-personal search data to | improve things like misspellings." | | So they save your web searches and claim that they do so in an | non-personally identifiable way. The privacy problems with this | claim are many, even if one accepts it at face value (good luck | verifying that this is the case). | robertlagrant wrote: | I don't see why you'd both nitpick their terms of service, and | then also claim that it's a pack of lies and can't be trusted. | Why do the former and then the latter? If your complaint is | just "I can't verify anything about their privacy" then that | would've made sense. | fbelzile wrote: | > DuckDuckGo is a mirage ... The privacy problems with this | claim are many ... good luck verifying ... | | Okay, can you list just a few? | | If you're going to make counter-claims like this, you're going | to have to provide evidence. | | Statements like these are not conducive in gaining popular | support for increased privacy. | metroholografix wrote: | How do you save a search in a non-personally identifiable | way? Do you have a human verify the data belonging to each | and every search ? Not saving IPs and/or browser data doesn't | solve the problem since the search terms themselves can be | personally identifiable. | | How do you verify that DuckDuckGo does -the minimal and | ineffective- things they claim to do? They offer no proof. | | How do you verify that DuckDuckGo does not secretly cooperate | with more powerful coercive actors? | | How do you verify that DuckDuckGo, offering a single point of | compromise, has not been thoroughly compromised by more | powerful actors? | bscphil wrote: | > How do you save a search in a non-personally identifiable | way? | | Save a sha256 hash of every search for 24 hours. If you see | the same hash from >10 distinct IP addresses in a 24 hour | period, save the search terms. | | That's just off the top of my head, I have no reason to | think they're doing it exactly like that. The point is that | you're claiming that we shouldn't trust DuckDuckGo because | you can't think of a way that they could securely and | privately do what they do -- but that's just your | intuitions, for whatever they may be worth. | | I also don't really buy the worries you have with the last | two questions, e.g.: | | > How do you verify that DuckDuckGo does not secretly | cooperate with more powerful coercive actors? | | How would you verify that for _any_ centralized service, | open source or not? I think your security concerns go a bit | beyond what most people interested in critiquing / | improving DDG can reasonably expect to achieve. | pfarrell wrote: | > How would you verify that for any centralized service, | open source or not? | | I think, technically, some sort of honeypot verification | could prove a compromise (i.e. if information that has | very little chance of existing naturally in two systems, | say a string a guids). | | But... I agree with your point. I don't think this is | actually feasible or realistic, just technically | possible. | pb7 wrote: | >How would you verify that for any centralized service, | open source or not? | | Other centralized (search) services don't have their | entire existence depending on this one factor. What is | DDG if not alleged privacy? Just use Bing directly. | bscphil wrote: | I don't understand that argument at all. What's the | threat model? | | I think it's entirely reasonable to be in the following | posture: I want as much privacy for my web searches as I | can reasonably achieve without having to run a search | engine myself. I'm willing to trust that search providers | are not saving personally identifiable information or | passively turning over search data to law enforcement if | they claim that they are not in their terms of service. | | That's pretty much the use case for DDG. With Bing you | _know_ they are violating your privacy. With DDG you have | a promise _in writing_ that they are not. It 's hard to | see how that's not strictly better than what you get from | Bing if privacy is among your core desiderata. | pb7 wrote: | I think we're on the same page. I was saying that if it | were to be discovered that DDG lacks privacy then there | would be no reason to use it over Bing since that is its | raison d'etre. | | >I'm willing to trust that search providers are not | saving personally identifiable information or passively | turning over search data to law enforcement if they claim | that they are not in their terms of service. | | Do other search companies disclose that they share data | with the FBI, NSA, etc in their ToS? Genuinely don't | know. | jerf wrote: | "How do you save a search in a non-personally identifiable | way?" | | To a first approximation, you just... do it. | | Granted, if you search "{jerf's realname here} | {embarrassing disease} cure" or something, in the | pathological case, you could at least guess that maybe it | was me, though even then my real name is far from unique, | and nothing stops anyone else from running such a search. | | But otherwise, if all you have is a pile of a few billion | searches, you don't have any information about any of the | specific searchers. Even if you search for your own | specific address, you don't really get anything out of it; | there's no guarantee it was you, or a friend of yours, or | an automated address scraper. There isn't much you can get | out of a search string without more information connected | to it. | | The rest of your criticisms are too powerful for the topic | at hand; they don't prove we shouldn't use DDG, they prove | we shouldn't use the internet at all. | Dahoon wrote: | At the very least your example is PII which you cannot | save and also claim to be Private. | [deleted] | [deleted] | WA9ACE wrote: | Do you have a search engine that you prefer to use that claims | not to store said information that I might try? | h2onock wrote: | I can hand on heart tell you that Mojeek doesn't and never | has. I know this because I work for Mojeek. | Pick-A-Hill2019 wrote: | Hi. I took a look at Mojeek (first time I've heard about | it) and since you mentioned the site and you work there - | | In your Privacy page (Data Usage Section) there is a | mention of stored "Browser Data" & " These logs contain the | time of visit, page requested, possibly referral data, and | located in a separate log browser information." | | This is an honest question - How is that not exactly what | the Parent stated was the issue? So they | save your web searches and claim that they do so in an non- | personally identifiable way. | ricardo81 wrote: | The referred to issue with DDG is that its favicon | service was informing DDG of sites _you visit_ , rather | than searches you make. | | But agreed that all search engines have to be trusted on | their word about anonymising data and not retaining PII. | metroholografix wrote: | The only solution I see is fully distributed/decentralized | search. Run your own crawler or be part of a network that | distributes this out to each participating node. | | Every centralized search engine has immensely hard-to-resist | and powerful incentives to play "The Eye of Sauron" with your | data. Additionally, they offer single points of compromise to | other, far more powerful actors. Whatever guarantees | DuckDuckGo gives you -and right now they don't give any- | don't mean much, if they've been thoroughly (willingly or | unwillingly) compromised. | | Which doesn't mean one should always steer well clear just | that one should at least be aware of the tradeoffs one makes | when using a centralized search engine. And with DuckDuckGo's | misleading marketing, I feel that this point is lost on | significant chunks of its userbase. | ravenstine wrote: | Such search engines have been around for many years, and | they suck donkey balls. Pardon my French. Install YaCy and | tell me how you like it. | | It wouldn't matter anyway, because decentralization doesn't | really solve privacy any better than centralized search, | besides the fact that it could theoretically provide more | choices. | | No matter what you use, privacy ultimately depends on | trust. The reason that I have more trust for DDG than I do | Google is, unlike Google, its primary audience is privacy- | minded folks. If it came out that DDG was tracking users | and selling that data, DDG would be immediately done as a | brand. They at least have some incentive to do what they | say. Decentralization provides no such benefit because a | search "node" is unlikely to have any sort of meaningful | brand to keep up. | | > And with DuckDuckGo's misleading marketing, I feel that | this point is lost on significant chunks of its userbase. | | How is it misleading? My understanding from their marketing | is that they don't create profiles of their users based on | searches. Until we have evidence to the contrary, it's not | outrageous to assume they are being truthful. | burnthrow wrote: | "Run your own crawler" is not a solution. | | Cool my comments are immediately downvoted like that | Italian guy's. | unethical_ban wrote: | Yeah, now you're just saying "Nothing centralized can ever | be trusted". So just say that rather than nitpicking their | ToS. You weren't going to care what they said anyway. | keyle wrote: | I agree DDG isn't perfect or great but it's _good_ 80% of the | time. | | I always start with DDG and revert back to Google if it doesn't | help, or I feel "there's got to be a better way". | | That said, talk is cheap, show us your engine. | Guest19023892 wrote: | I wonder if someone could setup a curated search engine. However, | allow anyone to curate the results and define a custom list of | allowed URLs. Then, others can use that list. | | For example, I decide Google is terrible when I'm searching for | product reviews, and all I get are results to Amazon referral | websites and spam blogs that never owned the products to begin | with. So, I find 200 sites or forums that actually have quality | reviews and I create a whitelist of those URLs, and I name it | "John Doe's Product Reviews List". | | Other people visit the search engine and they can see my list, | rate it, favorite it, and apply it to their results. | | So, the idea is you visit the search engine, type your query, | then select from a drop down one of your favorite curated lists | to apply. Maybe you like to use "Mike's favorite free stock photo | websites" when searching for free photos for your projects. Maybe | you like to apply "Jane's vegan friendly results" when searching | recipes or face creams. Maybe you want to buy local, so you use | the "Handmade in X" list when searching for your next belt. Maybe | you use another list that only shows results from forums. Or | another for tracking/ad free websites. | | Keep track of list changes. So, if someone gets paid off to allow | certain sites on their popular list, others can easily fork a | past version of the list. | benmller313 wrote: | I think this person actually means "We can imagine doing better | than DuckDuckGo". | 6510 wrote: | The right question is: How to do search using open source | tools? | | If your goal is "to make something better than the Duck" and | you succeed, the Duck dies... what is your goal now? | timClicks wrote: | Well, ideas are much easier than implementations. | ikiris wrote: | It's kind of amazing how many people think an idea is the | biggest part of a viable product. | 6510 wrote: | So you want to build a team and organize finances first? | That doesn't seem like a bad idea... wait... | TedDoesntTalk wrote: | Cliqz in Germany was one such implementation, funded in part | by Mozilla but completely independent. | | They wrote their own search engine. | | They closed shop earlier this year. | corytheboyd wrote: | You need money and dedicated resources to run and manage the | service, which at some point is just going to require trust. | Trusting nobody is smart, but expecting a service to compete and | win the long game without trusting it is pointless. | blibble wrote: | as a recent ddg convert, I've noticed little difference from | google | | (might be because google's results these days are so bad | though... can't really tell) | beefield wrote: | How about a crowdsourced search engine like wikipedia or | stackoverflow? Like: | | When you search for "kittens" you get the links that are most | upvoted by the community. | | If nobody has ever submitted links for search term "kittens" , | you get a link to selected generic search engines. And "kittens" | end up into a list of words someone has searched but nobody has | yet added a good result link for. | Moru wrote: | I hate to be so negative but that's just another sort of SEO | problem. Someone will pay a large group of people to sit and | click upvotes for their clients nonstop. | beefield wrote: | Of course there are going to be some highly debated search | terms. But I think that applieas also to weikipedia and they | have managed to pull it off so that it works reasonably well. | | I mean, you could always put a big red badge on top of the | results that says something in the line of "this search term | seems to be troublesome. You may want to check qwant/ddg or | maybe even google." | josefresco wrote: | Just do it. | MaxBarraclough wrote: | I suspect Drew has his hands full with the SourceHut project. | mekster wrote: | Perhaps it was better for him to say, "There's a better way | to do it than DDG" than "We can do better than DDG" as if | he's about to do it when in fact he's waiting for his revenue | to go up. | eznzt wrote: | The last thing we need is a search engine with pictures of | anime girls. | vladmk wrote: | This post looks horrible on the bottom left for mobile fyi | dgudkov wrote: | >We need a real, working FOSS search engine, complete with its | own crawler. | | How would an open-source search engine stand against abusive SEO- | optimization? If anyone can understand how the ranking algorithm | then anyone can game it. | AsyncAwait wrote: | Not as much if you have user curated tier 1 sites. If these | start to become spammy, they get removed. | dumbfounder wrote: | Yes, we can do better than DDG. But if you are expecting to fund | a real search engine with a few hundred thousand dollars you are | insane. It will take a ton of development and a ton of hardware | to create an index that isn't a pile of garbage. This isn't 2000 | anymore. You need to index >100 billion pages and you need it | updated and you need great crawling and parsing and you need | great algorithms and probably an entirely proprietary engine and | you need to CONSTANTLY refine all the above until it isn't | garbage. Maybe you could muster something passable for $1B over 5 | years with a strong core team that attracts great talent. If | Apple actually does this, as they are rumored to, I bet they dump | $10b into it just for the initial version. | Aeolun wrote: | If you want a _good_ engine there is no need to index 100B | pages, since 99% of the pages are blogspam. | bmurphy1976 wrote: | How are you going to identify what's blogspam and what's | legitimate without indexing it all in the first place? | ricardo81 wrote: | Agreed, it's going to require significant investment in | hardware and software. | | The recent UK Competitions and Market Authority report | evaluating Google and the UK search market came to the | conclusion a new entrant would require about 18 Billion GBP in | capital to become a credible alternative search engine, in | terms of size, quality, hardware, man hours making it. | | Remember Cuil? Had the size, the fanfare but unfortunately not | the quality. | AsyncAwait wrote: | I think that is why the idea would be to have the tier 1 sites | so you don't have to index as much. | dumbfounder wrote: | I said ISN'T a pile of garbage :) | nostromo wrote: | > If Apple actually does this, as they are rumored to, I bet | they dump $10b into it just for the initial version. | | Google pays Apple more than that every year just to set Google | as the default search engine on iPhones. | | In a way, Google is funding its future competitor. | lemax wrote: | Open Street Map is a nice analogy for what could work. Aside | from the open source maintenance of the map, there's also tons | of corporate help in the background. Companies that are | delivering OSM as a service or relying on it for their own | services have an interest in making it better. MapBox, for | example, apparently pays tons of people a salary who are | contributing upstream to Open Street Map. If we can get an | Apple/Microsoft/Other players collab maybe a venerable | alternative can actually be built. | kilroy123 wrote: | I agree and I have been hoping Apple builds a serious | competitor. I welcome any competition at this point. Let's be | real, not many people are using bing. People _would_ actually | use apple search. | _underfl0w_ wrote: | > People would actually use apple search. | | Of course they would - it would be set as the default search | on their iPhones with no clear-cut way to change it. You | know, "security". The users don't know what's best for them, | etc. as Apple seems to think. | ur-whale wrote: | >I welcome any competition at this point. | | Microsoft tried and failed to build a competitor and it's not | like they have shallow pockets. | | They grossly underestimated a number of aspects: | - The huge number of man-years invested in Google's search | quality stack hand-tuning and what it would take to replicate | it. - The infrastructure required to build a | crawler / indexer stack as good as Google's | | I think in 2020, the second problem is within reach of many | companies technically. It's mostly a matter of throwing | enough money at optimized infrastructure. | | However, replicating the search quality stack is going to be | very hard, unless someone makes a huge breakthrough in | machine learning / language modeling / language understanding | at a thousandth of the cost it currently takes to run | something like GPT-3. | | The most likely candidate to execute properly on that last | bit is - unfortunately - Google. | Krasnol wrote: | Sure but Microsoft has tried and failed to build a | competitor to google not to DDG. | | I don't see it that hopeless. I feel it kinda is like | starting Open Street Maps. It won't be perfect for a long | time but there will be people who'd prefer it and help out. | nynx wrote: | I'd love to participate in a project like this. Does anyone know | how to contact drew devault? | enriquto wrote: | He's very responsive in the mailing lists of all his projects. | Just don't ask him a silly question, like blockchain support, | or shit like that. | flas9sd wrote: | is somebody aware of a project where the end-user Browser acts as | a Crawler? it already spent the energy to render the content. | Readability.js extracts page section, does some processing for | keywords, hashes anchor links, signs it and sends it off. Cache- | Control response headers indicate if the page is public or | private. Of course, where it is sending to will have an | electricity bill to pay to index the submissions. | 1MachineElf wrote: | The idealist in me fantasizes this is possible with a browser- | based P2P zettelkasten. | jedimastert wrote: | That's an interesting point...I wouldn't trust the `Cache- | Control`, unfortunately, but a distributed indexing model might | be interesting... | | I know there have been talks of set-ups that essentially take a | web archive of your entire history to search back through... | znpy wrote: | the tiering system is dumb, really really dumb. | | it would basically make already famous domains shine and dump | lesser known domains into 20th page oblivion. | | it saddens me because google search results actually helped you | discover new sites and new people, but it's been years since that | has changed. | cbsks wrote: | > Crucially, I would not have it crawling the entire web from the | outset. Instead, it should crawl a whitelist of domains, or "tier | 1" domains. These would be the limited mainly to authoritative or | high-quality sources for their respective specializations, and | would be weighed upwards in search results. Pages that these | sites link to would be crawled as well, and given tier 2 status, | recursively up to an arbitrary N tiers. | | I like this idea. It would be interesting to see the domains of | every search query that I have clicked on and see what the | distributions is like. I suspect there would be a long tail but I | wonder how many domains actually need to be indexed for 99% of my | personal search needs. Does anyone have data like this? | RileyJames wrote: | I'm pro privacy, but I dont have a problem with AdWords, outside | of googles implementation. | | If AdWords targeting was purely based on the search term, I don't | mind. | | The search engine has to generate revenue somehow, and the | revenue generated on "Saas crm" with a single click is likely to | be larger than any users annual subscription. (10 - 100+ per | click) | | I'm unclear on the ethical / privacy concerns of "AdWords" style | advertising. | ricardo81 wrote: | FWIW it's no longer called Adwords, just Google Ads | | Agree, keyword (and location, a lot of searches are for _X near | me_ ) for the most part offers a way of delivering relevant | ads. | | Google are able to generate more income per search because of | their critical mass of searches and advertisers, as well as | having more data on searchers based on search history to | maximise that revenue per search. | blendergeek wrote: | Here is the problem (and it isn't privacy): | | A search engine's job is to present you with the best possible | results for any given query. | | A ad is either A) the best possible result or B) not the best | possible result. If the ad is the best possible result, then | the search engine must display it anyway in order to fulfill | its mission. If it is not the best possible result, the search | engine must violate its mission in order to display it. To put | it bluntly, advertising is paying to decrease the quality of | search results. | andreareina wrote: | DDG does operate their own crawler[1], though they also do still | rely on third parties[2]. | | [1] https://help.duckduckgo.com/duckduckgo-help- | pages/results/du... | | [2] https://help.duckduckgo.com/duckduckgo-help- | pages/results/so... | Kiro wrote: | Their own crawler is only used to fetch things for the widgets, | not the search index. | mekster wrote: | Author didn't even DDG to find this out? | Dahoon wrote: | Clearly neither did you as he is correct. DDGs crawler is not | a crawler like googlebot. | eqv wrote: | Drew has a longstanding history of ill-informed rants ([1] | [2]) about technology. He's also quite willing to lie about | the facts[3]. | | [1] https://news.ycombinator.com/item?id=24121609 | | [2] https://news.ycombinator.com/item?id=23966778 | | [3] https://news.ycombinator.com/item?id=24023998 | dang wrote: | No personal attacks on HN, please. | | Digging up past internet history as ammunition in an | argument isn't cool either. | Eeems wrote: | Just don't give him IRC ops and then get into a private | argument with him. https://www.omnimaga.org/news/omnomirc- | moved-to-new-server/m... | djsumdog wrote: | Wow. I'm honestly not surprised. That's ... that's pretty | shitty. | Eeems wrote: | Knowing a bit of his personal history I can kind of | understand why he acts the way he does, and has the | opinions he does. Doesn't excuse some of it, but at least | I kinda get why. | | I just wish his name would stop coming up for me tied to | opinion pieces like this. I'd rather just see things | about how some project he's working on is doing great and | being widely adopted. | djsumdog wrote: | I don't like to criticize the author. We all have good | takes and bad takes and really for a single post, you | should address the argument. Digging up the past is part of | what's making the world worse. | | That being said, I do see a valid reason for bringing up | his history of bad takes. I use to respect Devault. He | banned me on the Fediverse because he disagreed with me | being against defunding the police and against critical | race theory. | | I find some of stuff interesting, and I agree with more | AGPL and more real open source development. I'd even say | I'm jealous that he can actually fund himself off of his | FOSS projects and do what he loves. | | But I do agree, he does have a lot of questionable takes. | He seems to love Go and hate Rust, hate threads for some | reason, and has a lot of RMS style takes. Not all of them | are bad, and hardcore people can help you think. | | As far as this post goes, I do think search is pretty | broken. I think a better solution is more specialized | search. Have a web tool just for tech searching that does | StackExchange sites, github, blogs, forums, bug trackers | and other things specialized to development. | | Another idea would be an index that just did blogs, do you | can look up any topic and see what people are writing about | long form for the current month. Add features to easily see | what people were saying 5 or 10 years ago too. There is a | ton of specialized work there, in filtering blog spam, | making sure you get topics from all sides (including | "banned" blogs), etc. | | You use to have to go to Lycos, Yahoo, Hotbot, Excite and | you'd get different results and find lots of different | helpful things. We need that back. It will take some good, | specialized tools, to break people from Google search. | baggachipz wrote: | > The main problem is: who's going to pay for it? Advertisements | or paid results are not going to fly -- conflict of interest. | Private, paid access to search APIs or index internals is one | opportunity, but it's kind of shit and I think that preferring | open data access and open APIs would be exceptionally valuable | for the community. | | There's no reason you couldn't allow the first _N_ number of api | hits to be free, then charge for higher tiers of access. | wenbin wrote: | It's almost impossible to build a decent web search engine from | scratch today (i.e., build your own index, fight SEO spams, tweak | search result relevance...). The web is already so big and so | complex. Otherwise Google won't need to hire so many people to | work on search alone. | | If you didn't start at the very early stage of tiny web (e.g., | Google in 1996 as a research project) and grew with the web over | the past 20+ years, or you don't have super deep pocket (e.g., | Microsoft Bing in mid 2000s), then it's almost impossible to | build a decent web search engine within a few years. | | It's possible to build vertical search engines on far smaller | scale, far less complex, far less lucrative things that | Google/Microsoft has little interest today (e.g., recipes [2], | podcasts [3], gifs [4]...) | | It's also possible to come up with a different discovery | mechanism for web (or a small portion of web), other than a | traditional complete web search engine. Essentially you don't | cross moat to attack a huge castle (e.g., Google). Instead, you | bypass the castle [1]. | | [1] | https://twitter.com/benedictevans/status/1038538688232226817... | | [2] https://www.yummly.com/ | | [3] https://www.listennotes.com/ | | [4] https://giphy.com/ | mongol wrote: | You are probably right. But still... the approach suggested | makes kind of sense. A curated list of trusted sites as kind of | seed. Not the entire web. This can be as small or as large as | can be useful. It does not need to be about the entire web. How | big is the "useful" blogosphere, for example? Cannot an | opensource project that gathers momentum somehow create a | curated list of let's say 10 000 trusted blogs and index those? | Index all mailing lists that can be found, index all of reddit, | index Hacker News, index Wikipedia, the 100 most well regarded | news sites in each country, etc. Would not such an index be a | good start and better than Google in many cases? | jedimastert wrote: | > Instead, it should crawl a whitelist of domains, or "tier 1" | domains. These would be the limited mainly to authoritative or | high-quality sources for their respective specializations, and | would be weighed upwards in search results. | | Not a big fan of this conclusion. Who chooses the white list, and | why should I trust them? Is it democratically chosen? Just | because a site is popular very clear does not mean it's | trustworthy. Does it get vetted? by whom? Also, who's definition | of trustworthy are we trusting? | | If I want my blog to show up on your search engine, do I have to | get it linked by one of those sites, or can I register with you? | Will I be tier 1, or | ecommerceguy wrote: | So basically lock out any new site, regardless of content. | Great idea /s | mekster wrote: | But the email is already like this. It's the inbox providers | who choose what domain is legit and new domains start from | negative rating. Treating the web the same way doesn't sound | too unnatural. | | It would be bad if those in the positions profit by | "authorizing" who is good though. | buzzerbetrayed wrote: | I'm not sure why email should be an example of the correct | way to do it. And with email I can check my spam folder and | see exactly what has been rejected. So unless the search | engine has a list of sites that aren't deemed worthy included | with every search (which probably wouldn't happen), I think | this solution has some pretty big flaws. It should be noted | that the current system also has these flaws, as Google and | DDG can show you whatever they want base don whatever | criteria they see fit. | 6510 wrote: | I like this idea! Have the usual official results... then | have an option to go to level 2, level 3, level 4 etc (lvl | 1 is not included in lvl 2) | | You can have really biased technically terrible filters | that for example put a site on level 4 because it is to | new, to small and any number of other dumb SEO nonsense | arguments. (The topic was not in the url! There was poor | choice of text color!) | | I think wikipedia has a lot of research to offer on what to | do but also what not to do. Try getting to tier 2 edits on | a popular article? It would take days to sort out the edits | and construct by hand a tier 2 article. | Shared404 wrote: | > Who chooses the white list, and why should I trust them? Is | it democratically chosen? | | You could have user compiled lists of sites to show in search | results. | | Let the users pick the lists they want to see, and communities | can create and distribute lists within themselves. | Jtsummers wrote: | That's what directory sites offered once upon a time. It was | a pretty good way to discover new content back then. I spent | a lot of time on dmoz when I wanted to find information about | various topics. | RileyJames wrote: | Great idea, but why build a search engine at all in this | case? You can use DDG + your filter and see only the results | from your whitelist. | | Could easily be implemented for any current search engine. | | To a large extent, this is what you already do when you view | a page of search results. Filter them based on your | understand of what sites / results hold value. | Arnavion wrote: | >Great idea, but why build a search engine at all in this | case? You can use DDG + your filter and see only the | results from your whitelist. | | If I want to search for "X" within N sites, where N = 20, | how do I make a DDG filter for that? | vorpalhex wrote: | I wonder if the correct answer is a blacklist for known spammy | sites and the ability to turn the list off. | | If I never see a pinterest link, or one of those sites that | just republishes stackoverflow answers unedited, I'd be fine | with it. | | Of course, these systems always get abused and some political | or news site will end up on it. | bscphil wrote: | > If I want my blog to show up on your search engine, do I have | to get it linked by one of those sites, or can I register with | you? Will I be tier 1, or | | I think what I'd say in defense is that we've misunderstood | what search engines are useful for. They're really bad at | helping us discover new things. Your blog might be awesome, but | it's not going to be easy for a search engine to tell that it's | awesome. It's going to have to compete with other blogs that | also want views, some of whom are going to be better than yours | at SEO, and so on. | | What a search engine _might_ be able to tell is that it 's | _useful_. Because what search engines are at least potentially | good at is answering questions. You do that by having a list of | known good sites to answer specific types of questions, and | looking at the sites they link to. It 's when you try to do | both (index everything on the web and provide accurate answers | to specific questions) that you end up failing to do either. | For example this is the #2 result for "python f strings" on | DDG[1]. It's total garbage, and, quoting the blog, "we can do | better". (This result is also on page 1 for the same query on | Google.) | | What I believe ddevault is suggesting is that we make a search | engine that does the only thing search engines are really good | at, answering questions. You throw away the idea of indexing | everything on the web, and therefore the possibility of | "discovery". What that means is that in 2020 you need some | other mechanism for discovering new sites, bloggers, and so on. | Fortunately we do have some alternatives in that space. | | To be clear, I don't know if I 100% buy this argument, but I | think it's the general idea behind what's being suggested in | this blog post. | | [1] https://careerkarma.com/blog/python-f-string/ | suff wrote: | No you can't. | bscphil wrote: | > they've demonstrated gross incompetence in privacy | | Not sure I buy the example that is given here. | | 1. It's an issue in their browser app, not their search service. | | 2. It's not completely indefensible: it allows fetching favicons | (potentially) much faster, since they're cached, and they promise | that the favicon service is 100% anonymous anyway. | | 3. They responded to user feedback and switched to fetching | favicons locally, so this is no longer an issue. | https://github.com/duckduckgo/Android/issues/527#issuecommen... | | > The search results suck! The authoritative sources for anything | I want to find are almost always buried beneath 2-5 results from | content scrapers and blogspam. This is also true of other search | engines like Google. | | This part is kinda funny because "DuckDuckGo sucks, it's just as | bad as Google" is ... not the sort of complaint you normally hear | about an alternative search engine, nor does it really connect | with any of the normal reasons people consider alternative search | engines. | | That said, I _agree_ with this point. Both DDG and Google seem to | be losing the spam war, from what I can tell. And the diagnosis | is a good one too: the problem with modern search engines is that | they 're not opinionated / biased _enough_! | | > Crucially, I would not have it crawling the entire web from the | outset. Instead, it should crawl a whitelist of domains, or "tier | 1" domains. These would be the limited mainly to authoritative or | high-quality sources for their respective specializations, and | would be weighed upwards in search results. Pages that these | sites link to would be crawled as well, and given tier 2 status, | recursively up to an arbitrary N tiers. | | This is, obviously, very different from the modern search engine | paradigm where domains are treated neutrally at the outset, and | then they "learn" weights from how often they get linked and so | on. (I'm not sure whether it's possible to make these opinionated | decisions in an open source way, but it seems like obviously the | right way to go for higher quality results.) Some kind of logic | like "For Python programming queries, docs.python.org and then | StackExchange are the tier 1 sources" seems to be the kind of | hard-coded information that would vastly improve my experience | trying to look things up on DuckDuckGo. | dwd wrote: | I thought DDG already crawled their own curated list of sites? | | There is a DuckDuckGoBot and I think it was an interview or | podcast Gabriel did a while back that he mentioned they use it | for filling out gaps in the Bing API data to provide the | instant answers, favicons. Their preference for the instant | answers were authoritative references such as docs.python.org. | This would have been a while back though. | bscphil wrote: | If memory serves, those crawls are _only_ used for Instant | Answers. My interpretation of the blog post is that it would | be nice to have a search engine that 's sort of a hybrid | approach based on Instant Answers for the _whole_ web. | jedberg wrote: | I think Google sort of takes into account "votes", in that they | look at the last thing you clicked on from that search, and | consider that the "right answer", which they then feed back | into their results. | | As such, they effectively have a list of "tier 1" domains. | gregmac wrote: | I kind of hope they don't, or there is more to it than just | that -- for example, a user coming back and clicking on | something else counts as a downvote for the first item. | | Any system that ranks things purely based on votes or view | counts can have a feedback loop that can amplify "bad" | results that happen to get near the top for whatever reason. | For web search, this would encourage results that _look_ | right from the results page, even if they 're not actually a | good result of what the user is looking for. | | An example of this would be when you're trying to find an | answer to a specific question like "How do I do X when Y?". | The best result I'd hope for is a page that answers the | question (or a close enough question to be applicable), while | the promising-looking-but-actually-bad result is a page where | someone asks the exact same question but there are no | answers. | eyelidlessness wrote: | > Any system that ranks things purely based on votes or | view counts can have a feedback loop that can amplify "bad" | results that happen to get near the top for whatever | reason. | | I think this is a place where Google has pretty obvious | algorithm problems. For example, I'm building a personal | website for the first time in many years, and obviously | that means I'm doing a fair bit of looking up new or | forgotten webdev stuffs. It's widely known that W3Schools | is low quality/high clickbait/has a long history of gaming | the SEO system. They've been penalized by Google's | algorithm rule changes but continue to get the top result | (or even the top 3-5 results!), _even with Google having a | profile of my browsing habits, and knowing that I | intentionally spend longer on these searches to pick a | result from MDN or whatever_. It seems pretty likely that | W3Schools is just riding click rate to stay at the top. And | it's pathological. | beckingz wrote: | Is w3schools that bad? | | for some languages, W3schools is as good a reference or | better than the official documentation. | | And they're definitely better than most seospam. | eyelidlessness wrote: | W3Schools is _awful_. The official documentation is hard | to navigate, but W3Schools is notorious for misleading | and poor quality examples and advice. MDN, caniuse, CSS | Tricks and such are much better resources. | bscphil wrote: | I don't know if DDG does that exactly, but their help page | does say this: | | > Second, we measure engagement of specific events on the | page (e.g. when a misspelling message is displayed, and when | it is clicked). This allows us to run experiments where we | can test different misspelling messages and use CTR (click | through rate) to determine the message's efficacy. If you are | looking at network requests, these are the ones going to the | one-pixel image at improving.duckduckgo.com. These requests | are anonymous and the information is used only by us to | improve our products. | | The Firefox network logger does show requests to this domain | when I click on a link in the search results, before the page | navigates away. This suggests to me they might by logging | this information. _To be clear_ , this is speculation on my | part, because I haven't examined the URL parameters in | detail. | | In any case, I'm not sure how much this manages to improve | the results, since usually I _can_ get help with my Python | query (for example) using whatever crappy blog post is first | in the results, but results from the official docs or | StackExchange are still probably better and should be | prioritized. | Silhouette wrote: | _Some kind of logic like "For Python programming queries, | docs.python.org and then StackExchange are the tier 1 sources" | seems to be the kind of hard-coded information that would | vastly improve my experience trying to look things up on | DuckDuckGo._ | | The problem with this strategy is always going to be that | different users will regard different sources as most | desirable. | | For example, it's enormously frustrating that searching for | almost anything Python-related on DDG seems to return lots of | random blog posts but hardly ever shows the official Python | docs near the top. I don't personally think the official Python | docs are ideally presented, but they're almost certainly more | useful to me at that time than some random blog that happens to | mention an API call I'm looking up. | | On the other hand, I would gladly have an option in a search | engine to hide the entire Stack Exchange network by default. | The signal/noise ratio has been so bad for a long time that I | would prefer to remove them from my search experience entirely | rather than prioritise them. YMMV, of course. (Which is my | point.) | judge2020 wrote: | > and they promise that the favicon service is 100% anonymous | anyway. | | With that logic, Apple's OCSP server is also 100% anonymous | (which I legitimately can believe it is). | brundolf wrote: | Agreed. I think the key point here is that the web is a | radically different place than it was in 1998 (when Google | launched and established the paradigm as we know it). Back then | the quality-to-spam ratio was probably much higher, there were | many more self-hosted sources rather than platforms (for better | or worse). The naive scraping approach was both more crucial | and more effective. And in the decades since, it's been a | constant war of attrition to make that model keep working under | more and more adversarial circumstances. | | So I think that stepping back and re-thinking what a search | engine fundamentally is, is a great starting point for | disruption. | | Additionally, something the OP didn't mention is that ML | technologies have progressed dramatically since 1998, and that | much of that progress has been done in the open. I can't | imagine that not being a force-multiplier for any upstart in | this domain. | jbay808 wrote: | Maybe instead of hard-coding these preferences in the search | engine, or having it try to guess for you based on your search | history, you can opt-in to download and apply such lists of | ranking modifiers to your user profile. Those lists would be | maintained by 3rd parties and users, just like eg. adblock | blacklists and whitelists. For example, Python devs might | maintain a list of search terms and associated urls that get | boosted, including stack exchange and their own docs. "Learn | python" tutorials would recommend you set up your search | preferences for efficient python work, just like they recommend | you set up the rest of your workflow. Japanese python devs | might have their own list that boosts the official python docs | and also whatever the popular local equivalent of stackexchange | is in Japan, which gets recommended by the Japanese tutorials. | People really into 3D printing can compile their own list for | 3D printing hobbyists. You can apply and remove any number of | these to your profile at a time. | visarga wrote: | I have had a similar idea, what you're proposing is | essentially a ranking/filtering customisation. The internet | is a big scene, and on this scene we have companies and their | products, political parties, ad agencies and regular users. | Everyone is fighting for attention, clicks. Google has | control over a ranking and filtering system that covers most | searches on the internet. FB and Twitter hold another | ranking/filtering sweet spot for social networks. | | The problem is that we have no say in ranking and filtering. | I think it should be customisable both on a personal and | community level. We need a way to filter out the crap and | surface the good parts on all these sites. | hobs wrote: | Back in the day you'd have webrings - groups of sites that | linked each other in clear association. | mech422 wrote: | This would be awesome! I'm so tired of google ignoring what I | tell it, and trying to 'guess' what I want. | | I'd also love to be able to specify I want results from the | last year without having to set it everytime. | wstrange wrote: | This doesn't really seem immune from spam. | | I got signed up for goodreads (book review site), and I get | tons of spam. | | This is a hard problem.. | vorpalhex wrote: | Like any other list, it depends on who maintains it. You | basically want to find the correct BDFL to maintain a list, | much like many awesome-* repositories operate. | AsyncAwait wrote: | This is actually a great idea and something I can see working | rather well. | nolanhergert89 wrote: | As a hack until then, I've found Google's Custom Search | Engine feature to work well enough for my use cases. | https://programmablesearchengine.google.com/cse/all | bscphil wrote: | I like this idea! I think the biggest difficulty with it - | which is also probably _the_ most important reason that | engines like Google and DDG are currently struggling to | return good results - is that the search space is just so | enormously large now. The advantage of the suggestion in the | blog post is that you trim down the possible results to a | handful of "known good" sources. | | As I understand it, you'd want to continue to search the | whole "unbiased" web, then apply different filters / weights | on every search. I really do like the idea, but I imagine | we'd be talking about an increase in compute requirements of | several orders of magnitude for each search as a result. | | Maybe something like this could be made a paid feature, with | a certain set of reasonable filters / weights made the | default. | retsibsi wrote: | This may be a very dumb question, but could the filtering | be done client-side? As in, DDG's servers do their thing as | normal and return the results, then code is executed on | your machine to weight/prune the results according to your | preferences. | | Maybe this would require too much data to be sent to the | client, compared to the usual case where they only need a | page of results at a time. If so, would a compromise be | viable, whereby the client receives the top X results and | filters those? | Spooky23 wrote: | I disagree; the search space is shrinking as more and more | stuff moves to walled gardens like Facebook and Twitter. | 867-5309 wrote: | > to guess for you based on your search history, you can opt- | in to download and apply such lists of ranking modifiers to | your user profile | | pro-privacy does not sit well with terms such as search | history and user profile | jbay808 wrote: | You might have misread. My suggestion is as an | _alternative_ to history based ranking. | brundolf wrote: | This is a great idea. It's like a modern reboot of the old | concept of curated "link lists", maintained by everyone from | bloggers to Yahoo. Doing it at a meta level for search-engine | domains is a really cool thought. | meerita wrote: | I'm an old-time user of DDG. I agree with the 3 points. I feel | like I'm using a shitty car in a world where everyone runs a | ferrari. | | The one that strikes me most is the results. I feel like DDG | doesn't search the entire internet, like there's zillions of | pages there waiting to be indexed, even old websites, but the | results i get are so poor. | | Even with this handicap, I still use over INSERT YOUR AD HERE | Google Search. | pmoriarty wrote: | My main problem with DDG is that there's no way to be sure they | actually respect their users' privacy as they claim to. | | Ideally, services like theirs would be continuously audited by | respectable, trusted organizations like the EFF.. multiple such | organizations even. | | Then I'd have at least some reason to believe their claims of not | collecting data about me. | | As it stands, I only have their word for it.. which in this day | and age is pretty worthless. | | That said, I'd still _much_ rather use DDG, who at least pay lip | service to privacy, than sites like Google or Facebook, who are | openly contemptuous of it. | | At the very least it sends a message to these organizations that | privacy is still valued, and they'd lose out by not trying to | accommodate the privacy needs of their users to some extent. | spinach wrote: | Facebook and Google are huge, global companies where their main | product is free, and yet they aren't a charity. The only way to | be mega-rich and offer something free is to be shady and | manipulative with user's data. Exploiting privacy is their | business model. They aren't gonna respect it. | | Being super financially successful off free products and | services is not a recipe for an honest, citizen respecting | company. | Dahoon wrote: | DDG search costs the same as google search. | dangus wrote: | I don't even care about the privacy. (Well, I do, but in this | context I have no reasonable way to ensure it) | | What I do care about is trust-building and monopolistic | practices. | | That, to me, is a great reason to use DDG instead of Google or | even Bing. | cpeterso wrote: | I also prefer DDG's user interface over Google's. And DDG's | !bang search shortcuts. | | DDG has been my default search engine for years and its | results are good enough for me 95% of the time. I only need | to use Google as a fallback when searching for niche | technical information or "needles in haystacks". | jschwartzi wrote: | Even then, the Google results are usually terrible. I | haven't used Google as a fallback in about a year because | every time I tried it they couldn't find what I was looking | for either. Or they did something atrocious like changing | my search terms for me. | lambda_obrien wrote: | Why couldn't several coordinating specialized search engines | share their data via something like "charge the downloader" S3 | buckets? Then you get an org like StackExchange who could provide | indexed data from their site and the algorithms to search the | data the most efficiently, GitHub can do the same for their | specific zone of speciality, Amazon, etc. | | Then anyone who wants to use the data can either copy it to their | own S3 buckets to pay just once, or can use it with some sort of | pay-as-you-go method. Anyone who runs a search engine can use the | algorithms as a guide for the specific searches they are | interested in for their site, or can just make their own. | | You could trust the other indexers not to give you bad data, | because you'd have some sort of legal agreement and technical | standards that would ensure that they couldn't/wouldn't "poison | the well" somehow with the data they provide. Further, if a bad | actor was providing faulty data, the other actors would notice | and kick them out of the group or just stop using their data. | | It would have to be fully open source, I agree with the other | parts of Drew's essay here, but I think we _could_ share the | index /data somehow if we got together and tried to think about | it. We just need a standard for how we share the data. | cptskippy wrote: | So you're proposing Snowflake for search? | ricardo81 wrote: | There's Common Crawl for the crawling aspect, about 3.2 billion | pages last time I looked. One of the issues with that kind of | detachment of jobs is crawl data freshness. | moocowtruck wrote: | let me guess, and better is drewdrewdevault | api wrote: | A major challenge with search in 2020 is that it's adversarial. | Any open source search engine that gets popular is going to be | analyzed by black hat SEO people and explicitly targeted by spam | networks. Competently indexing and searching content is really | only a small part of the problem now, with the adversarial "red | queen's race" against black hat SEO and spam being the more | significant issue. | suff wrote: | I take that back. Aside from loads of money and boundless dev | time, you've got it all figured out :-) | messo wrote: | > If SourceHut eventually grows in revenue -- at least 5-10x its | present revenue -- I intend to sponsor this as a public benefit | project, with no plans for generating revenue. | | I like this attitude. Makes me happy to be a paying member of | SourceHut. ___________________________________________________________________ (page generated 2020-11-17 23:01 UTC)