[HN Gopher] EU Open Web Search project kicked off ___________________________________________________________________ EU Open Web Search project kicked off Author : ZacnyLos Score : 126 points Date : 2022-09-20 17:50 UTC (5 hours ago) (HTM) web link (openwebsearch.eu) (TXT) w3m dump (openwebsearch.eu) | Animats wrote: | Early version: https://www.chatnoir.eu/ | andrewmcwatters wrote: | I suspect search engines are an outdated concept for at least the | largest of sites, who will generally, but not always, have better | ways to directly search their own content. | | The remainder of the search problem seems to just be collecting | relevant trafficked sites for listing in results. Today Google et | al seem to be doing this BY HAND. And it's not even obfuscated. | | Recently, for the first time in my life, the wizard behind the | curtain seems to have been exposed. I feel strongly that one | could probably start a small index that catered to a fairly large | audience. | | And honestly, for other queries, just tell the user to search | that site directly. I think you could even market it to users as | not a technical limitation, but behavior that should be | considered fuddy-duddy. | | Like, really, you're going to search me? You know they have their | own search right? | | Even Yellow Pages faded into obscurity eventually. | ur-whale wrote: | > the largest of sites, who will generally, but not always, | have better ways to directly search their own content. | | I have the exact opposite experience. | | To wit: searching HN via the algolia link at the bottom is way | worse than searching on Google with a site:ycombinator.com | restrict. | | Same thing for YouTube, where the search engine is tuned for | maximizing watch time and strictly not to return what you're | looking for. | notright wrote: | the EU loves taxing productive companies and wasting said money | in stillborn projects that nevertheless promise a kind of bright | socialist federalist Europe in their bureaucratic minds | Comevius wrote: | At least we have some of the most livable countries on Earth to | show for it. I take taxes over any trickle-down economics, and | don't let me stop you looking up the definition of socialist, | because you are using it wrong. | | Besides it's a 8.5 million EUR project, it's literally nothing, | it's payroll for a few people. The money is being invested into | people who then spend most of it, so it's a triple investment. | arjenpdevries wrote: | Isn't it lovely?! | notright wrote: | I am fine as long as they pay for these self-centered utopias | with their own money | hrbf wrote: | I've already caught their crawler ignoring robots.txt directives | on one of my sites, aggressively indexing explicitly excluded | information. | arjenpdevries wrote: | That cannot be true, as the project has yet to start. But | anyone can start a crawler, so you may have encountered other | people's software. We wouldn't be so unknowledgeable to ignore | robots.txt ;-) | lizardactivist wrote: | Out of curiosity, what's the url for your website, and from | what IP or host do their crawlers connect? | logicalmonster wrote: | What does "based on European values and jurisdiction" refer to? | I'd love to be pleasantly surprise, but this sounds like it's | ripe for centralized censorship. | notright wrote: | InTheArena wrote: | Given the history of the 20th century, this kind of comment | promoting European values and jurisdiction seems..... dicey. | Companies ethical records, as shitty as they are have nothing | on the mass destruction, genocide and stupidity of governments. | ur-whale wrote: | Looks like it's Northern EU only. | | No research institutes from {France, Italy, Spain, Greece, | Portugal, etc ...} involved. | arjenpdevries wrote: | Slovenia, Czech Republic. But yes, I think there was a | competing proposal from Italy/Spain. Not enough budget for two | projects in this area, unfortunately, as they were good too. | marginalia_nu wrote: | I'm a bit skeptical EU-funding a bunch of professors is the way a | search engine will be built. | | The primary goal for academics is to publish new findings, while | what you need to build a search engine is rock solid CS and | information retrieval basics. Academically, it's not very | exciting. Most of it was hashed out in the 1980s or earlier. | hkt wrote: | ..correct me if I'm wrong, but Google was started by a couple | of postdoctoral researchers, no? | DannyBee wrote: | Who deliberately did not stay in academia to do it. More to | the point, a successful team building a product like a search | engine requires roles that academia doesn't really have. | | Who is doing product management? | | Who is doing product marketing? | | etc | | This is all applied engineering at this point, not R&D. How | does it at all fit into academia's strong suit? | mkl95 wrote: | > 14 European research and computing centers | | > 7 countries. | | > 25+ people. | | There are literally dozens of them! | | https://openwebsearch.eu/partners/ | marginalia_nu wrote: | I don't think the number of people or even the size of the | budget is wrong. A small team can be incredibly powerful and | productive if you have the right people. In fact, I think far | more often search engines fail from trying to start too big | than too small. | | The problem is that you need people who actually know how to | architect complex software systems much more than you need | revolutionary new algorithms. For that, professors are the | wrong people. A professor on the team, sure, that might be | helpful. Not half a Manhattan project's worth. | mkl95 wrote: | It happens all the time in Europe. Collaboration between | public and private companies is pretty much a pipe dream in | the EU. Some company that actually works on building search | technology would achieve way more than a bunch of | professors. | | I disagree on the budget though. It is basically pocket | change. | marginalia_nu wrote: | Arguably the biggest most unsolved problem in search is | how to make a profit (or even break even). This can be | approached in two ways: You can either try to find some | way of making search more profitable, or you can find a | way to make search cheaper. I think the latter is a lot | more plausible than the former. | | A shoestring budget keeps the costs down by design and by | necessity. A large budget virtually ensures the search | engine becomes so expensive to operate it will never | break even. | [deleted] | jjulius wrote: | >I'm a bit skeptical EU-funding a bunch of professors is the | way a search engine will be built. | | Heh, so, funny story... | | >A second grant--the DARPA-NSF grant most closely associated | with Google's origin--was part of a coordinated effort to build | a massive digital library using the internet as its backbone. | Both grants funded research by two graduate students who were | making rapid advances in web-page ranking, as well as tracking | (and making sense of) user queries: future Google cofounders | Sergey Brin and Larry Page. | | >The research by Brin and Page under these grants became the | heart of Google: people using search functions to find | precisely what they wanted inside a very large data set. | | https://qz.com/1145669/googles-true-origin-partly-lies-in-ci... | imhoguy wrote: | "unbiased...based on European values" - will it fly? | topspin wrote: | European values are inherently unbiased. What's the problem? | o.O | tricky777 wrote: | seems like a very interesting idea. So many times I wanted some | kind of advanced gogle-query-language. (i know about allinurl and | such, but thats not enough. google is tuned for average user, | which is good for google, but not for any non average query) | dataking wrote: | I don't see any mention of Quaero, the EU search engine that was | supposed to compete with Google [0, 1]. How is this time | different? | | [0] https://en.wikipedia.org/wiki/Quaero | | [1] https://www.dw.com/en/germany-pulls-away-from-quaero- | search-... | arjenpdevries wrote: | For starters: the objective is to create the index not the | engine, that's quite a different ambition. | | We are very aware of the Quaero/Theseus history :-) | marginalia_nu wrote: | What is the difference? | freediver wrote: | Supposedely the project is about just building the | platform/infrastructure (which is what the index is) upon | which search engines can be built. | | These search engines will then have the freedom to define | their own search product experience, business model, even | ranking of results. | jonas21 wrote: | So something even more vaguely defined and detached from | real use cases than last time? Great. | freediver wrote: | The above actually defines the scope very well. There is | lot more to be built upon it, but it is not what the | project is trying to solve. | notright wrote: | notright wrote: | This was the past legislature project. The new legislature | brings CHANGE. They are not the same.. | thepangolino wrote: | dang wrote: | Url changed from https://www.zylstra.org/blog/2022/09/eu-open- | web-search-proj..., which points to | https://djoerdhiemstra.com/2022/open-web-search-project-kick..., | which points to this. | lucideer wrote: | Which now shows: | | > _Resource Limit Is Reached_ | | > _The website is temporarily unable to service your request as | it exceeded resource limit. Please try again later._ | | Original URL might be more resilient... | dang wrote: | Hmm. I can access the page without that message. In any case | the Internet Archive seems to have it: | | https://web.archive.org/web/20220920183027/https://openwebse. | .. | Proven wrote: | rrwo wrote: | It will be interesting to see what the index contains, and how it | is structured. | | What made Google such a game changer was that they based their | index not just on the contents, but on how pages linked to each | other. | arjenpdevries wrote: | That's the marketing story. I think it's because they didn't | clutter their homepage like AltaVista did. | boyter wrote: | I have written this before but I'll put it here again. What I | would like to see is a federated search engine. Based on | activitypub that works like mastodon. Don't like the results from | one source? Just remove them from your sources, or lower their | ranking. Similar to yacy but you can work with the protocol to | connect or build whatever type of index you want using whatever | technology you like, and communicate over an existing standard. | Want to build the worlds best index of Pokemon sites, then go do | it. Want to build a search engine using idris or ats? Sure! I did | note the professors are on mastodon so perhaps this may actually | happen. | | One of these days I'll actually implement the above assuming | nobody else does. I figured if I can at least get the basics done | and a reference implementation that's easy to run it could prove | the concept. If anyone is interested in this do email my in my | bio. | | What I worry about for this project is that it becomes another | island which prohibits remixing of results like google and bing, | and its own index and ranking algorithms become gamed. | | I wish the creators best of luck though. I am also hoping for | some more blogs and papers about the internals of he engine. So | little information is published in the space that anything is | welcome, especially if it's deeply technical. | fabrice_d wrote: | At least one of the partners | (https://openwebsearch.eu/partners/radboud-university/) does | research on "federated search systems", so there's hope! | asim wrote: | One of the things I wonder here is if it would be easier to | just start by crawling known RSS feeds and then exposing a JSON | API for the data and making the whole thing open source. Then | keeping a public list of indexes and who crawls what. | Eventually moving into crawling other sources but first | primarily addressing the majority of useful content that's | easily parseable. | TacticalCoder wrote: | > Don't like the results from one source? Just remove them from | your sources, or lower their ranking. | | That's basically Usenet killfiles and, yes, I think they're | totally due for a comeback in one form or another. Usenet may | have had its issues towards the end (although it still exists), | but killfiles weren't one of its problems. The simplest one you | could just discard sources you didn't want to read anymore but | the more advanced you could assign weight/rankings based on | various factors (keywords / usernames / if you did participate | or not in a discussion / etc.). | arjenpdevries wrote: | We like Federated search, we like decentralized search, and | even P2P search; we are trying to find a good mix, and decided | to get started rather than wait! Exciting times. | marginalia_nu wrote: | What are the benefits from this? | | I'm not trying to be dismissive, it's just my feeling from | working on search.marginalia.nu is that nearly every aspect | of search benefits from locality, not only is the full crawl- | set instrumental in determining both domain rankings and | relevance signals on a term-level such as anchor tag | keywords; but the way an inverted index is typically set up | is extremely disk cache friendly where the access pattern for | checking the first document warms up the cache for the other | queries, but that discount obviously only exists when it's | the same cache. | hkt wrote: | I would _love_ to be able to run a node that mirrors part or | all of an index like this, and to let people query it - a bit | like https://torrents-csv.ml/#/ | | Good luck! I'll be watching your progress and cheering you | all on! | cookiengineer wrote: | Isn't searx what you're describing? I was running an instance | for a while, and it's basically a meta search engine that has | support for all kinds of providers. | | There are also some web extensions available so that you can | fill it with more data. | | [1] https://searx.github.io/searx/ | boyter wrote: | Searx is half of it where it calls out to other searches but | does not provide its own index as far as I can see. It also | does not remix the results. | vindarel wrote: | I'd say it rather looks like Seeks, unfortunately defunkt: | https://en.wikipedia.org/wiki/Seeks | | > a decentralized p2p websearch and collaborative tool. | | > It relies on a distributed collaborative filter[6] to let | users personalize and share their preferred results on a | search. | googlryas wrote: | What benefit does federation bring here? Unless it is very | simple to set up, most communities are non-technical and | probably won't be able to set up their own crawler. I would | think just a search engine that lets you customize the ranking | algorithm, and maybe hook into whatever ontology they've | developed and ranking it accordingly would be sufficient. | melony wrote: | What's the point of a federated search engine? At the end of | the day most nodes will end up implementing the same | regulations/censorship with development driven primarily by a | few. It's like ethereum vs ethereum classic all over again. If | the EU or the developers' respective governments demand a | censorship or forgetting feature to be implemented, it's not | like the federated nature would matter. An open source search | index is useful, a search engine that can be easily self hosted | is also useful. But building a search engine as a federated | system is a gimmick with no significant value. | | Do you see any major Mastodon nodes interfacing with Truth | Social or Gab? I certainly don't. If federation barely works | for a social media app, I fail to see how it would even matter | for a search engine. | ur-whale wrote: | Search is way more than just indexing. | | I'd really like to see them match the 20+ years of search quality | fine-tuning that Google built into their search engine. | | Not that Google is as good as it used to, but still, catching up | with them is way more complicated than just building a big crawl | + index piece of infrastructure. | | And all of that on a government-funded shoestring budget. | | Mmmh. | | Good luck to them, but I'm not holding my breath. | bslqn wrote: | merb wrote: | so it began, that sern starts to gather market share. | | -- | | I doubt this will take off. I mean they investend more in funding | and marketing instead of starting to built something. they | should've started with code (agpl3 of course) and invited more | and more people. at the moment this is more buzzword bingo | bullshit than anything else. it's basically always the same | problem, instead of focusing on the product, they fous more on | the message. | s-xyz wrote: | Correct me if I am wrong, but so the purpose is to create an | index database, upon which custom search engines can be attached | upon? Ie, the EU will crawl all pages on the web? | murphyslab wrote: | The index is just the first step according to news articles: | | > Once the index has been created, the next step is to develop | search applications. | | > The team at TU Graz will be particularly active here in the | CoDiS Lab and will work on the conception and user-centric | aspects of the search applications. This includes, for example, | research into new search paradigms that enable searchers to | have a say in how the search takes place. The idea is that | there are different search algorithms or that you can influence | the behavior of the search algorithms. For example, you could | search specifically for scientific documents or for documents | with arguments, include search terms that have already been | used, or include documents from the intranet in the search. | | https://www.krone.at/2791083 | rgrieselhuber wrote: | The real game-changer in search would be if companies would agree | to publish indexes of their own sites in an open standard to a | place that everyone could access. This would undercut the | monopoly power that large search engines have and allow everyone | to focus on innovating the best way to search that content vs. | having to spend so much time and money to crawl and index it. | _Algernon_ wrote: | People would abuse that for SEO purposes within seconds. | rgrieselhuber wrote: | The market need would then be shifted to the best search | interfaces instead of who has the most money to build the | biggest index. A much better focus, IMO. | TheFerridge wrote: | I believe that is precisely what the project is aiming to do, | and to turn it into a public resource. | arjenpdevries wrote: | We will explore that idea in the project, I also think it may | help (but vulnerable for Web index spam by adversary parties). | rgrieselhuber wrote: | That is indeed the biggest problem but maybe something that | can be more effectively dealt with downstream by the content | rankers and potentially even the user base / custom search | algorithm builders. Brave's Goggles project is a good early | prototype of this concept. | freediver wrote: | Standard for this already exists [1] but it does not solve the | problems of | | 1. Implementation (sites do not need to have a sitemap; or | those that have it, may not have an accurate one) | | 2. Discoverability (finding sites in the first place, you'll | need a centralised directory of all sites; or resort back to | crawling in which case sitemaps are not needed) | | 3. Ranking (biggest problem in creating a search engine) | | [1] https://www.sitemaps.org/protocol.html | rgrieselhuber wrote: | The sitemaps standard (if this is the basis) would need to be | expanded to support additional metadata / structured data to | support this idea. | | 1. This would be up to sites, to your point, major question | would be best way to create incentives. | | 2. This is solvable via a number of approaches, but the | search engines themselves would be mostly responsible for | finding the right approach for their business. I know how I | would do it. | | 3. Indeed, which would be the main point of this | decentralization, to let search engines focus on their | hardest problem. | | Edit: would Kagi not benefit from having to worry about | crawling / indexing sites? | [deleted] | freediver wrote: | > would Kagi not benefit from having to worry about | crawling / indexing sites? | | It would, but sitemaps do not provide that function as we | discussed above. However if EU Open Web Search succeeded, | that is something we could probably use to some extent. | wizofaus wrote: | I suspect you underestimate how much of the power of search | engines is being able to interpret search queries and figure | out what a user is really looking for. Even if there were a | public, standardised up-to-date high performance full-text | index of the entire web freely available I'm willing to bet | Google search would be a useful value-add in its ability to | answer natural language queries. | rgrieselhuber wrote: | I run an SEO platform SaaS, so I'm familiar. :) | spookthesunset wrote: | I'm pretty sure we tried that way back in the day with <meta | name="keywords" content="spam spam spam spam">. People would | stuff that with every word in the english language. Older | search engines that used those keywords returned some pretty | awful results. You simply can't trust sites, who have a strong | incentive to get to the top of SEO rankings, to not lie. In | fact, given at least one of your competitors will stuff their | keywords to get to the top you'll have to do it too. It would | become an arms race for who can stuff the most garbage into | their indexes to "win". It just doesn't work. | | All search engines that attempt to be useful will have to | filter out the junk. You just have to trust that the search | engine you are using isn't withholding results from you that it | considers "bad" (eg: "misinformation" (i.e. stuff somebody | disagrees with)). | | And to me, that is the crux of the debate really. Nobody wants | spam for search results--everybody agrees with that and there | is no real debate about filtering that crap out. The argument | really is should a very large company that has a huge market | share get to decide what constitutes "fact" and what is | "misinformation". Based on 2.5 years of experience so far, what | was once deemed "misinformation" has a sneaky way of becoming | "factual information". Labeling and hiding "misinformation" | because it goes against some narrative pushed by incredibly | powerful entities is very scary and there was a hell of a lot | of exactly that going on during this covid crap. | | I used to fall on the side of "private companies can do | whatever they want" but now I'm not so sure. Companies like FB, | Twitter or Google play a huge role in shaping politics and | society. I'm no longer convinced it is okay to let them play | the role of "fact checker" or anything like that. Filtering | spam is one thing, but hiding "misinformation" is entirely | different. | rgrieselhuber wrote: | Your last point is also the one (aside from the economics) I | am the most interested in. | | I think we live in a world now where we are so used to a few | tech giants mediating everything for us that we can't even | imagine other solutions to this problem, but it's also how we | got to this point in the first place. | closedloop129 wrote: | >You simply can't trust sites, who have a strong incentive to | get to the top of SEO rankings | | Why is it not enough to punish sites that abuse the keywords? | spookthesunset wrote: | Who is the one who punishes the abusers? How can you scale | the solution to deal with billions of pages? | bobajeff wrote: | One problem with that is now you have to trust the websites to | give an accurate index of their content. | jeffbee wrote: | Anyone who thinks this will work has never tried to index a | site. A huge amount of effort is spent trying to figure out | if the site is serving different content to users vs | crawlers, or if the site is coded to appear visually | different to humans vs machines. If you ask sites to index | themselves you will get lies only. | rgrieselhuber wrote: | I index sites all the time and I think it could work. There | will be other problems, of course, but we already are | partly there with XML sitemaps. Relying on the large search | engines to enforce "honesty" from websites puts them into a | mediator role that has a number of negative effects both | for search in general and, increasingly, society at large. | kittiepryde wrote: | Relying on sites to be honest about themselves, is even | less likely. There are monetary incentives for many of | them not to do that. Many sites host dishonest and | clickbait content with extreme levels of SEO already. The | cost of dishonesty decreases if you can directly modify | the index. | rgrieselhuber wrote: | I think that is primarily a symptom of the fact that we | have a bottleneck on search interface providers. If it | were easier / cheaper for new search engines / rankers to | exist in the market, they could fairly easily filter out | unscrupulous domains. | wumpus wrote: | I've run a web-scale search engine and I don't think it | will work. | rgrieselhuber wrote: | Indeed | boyter wrote: | I'd rather see them publish a federated search of their own | content. | rgrieselhuber wrote: | Your comment prompted me to check out Searchcode, looks very | interesting. How would the federated search model work in | this example? Instead of you having to index the various code | repositories, they would index themselves and make their | search of those indexes available via a federated API? | rrwo wrote: | There are already sitemaps, and pages used structured data like | HTML5/ARIA roles, RDF or JSON+LD to provide some semantic | annotations. | | I'd rather that web robots use this information to build useful | indexes than to have to worry about generating yet another feed | in the hopes that it helps people find my content in a search | engine. | | Besides, a web robot can determine how much other sites link to | my content and help determine its overall ranking in results. | Adding another type of index file to my site will do nothing to | determine how it relates to other sites. | rgrieselhuber wrote: | The structured data on sites, unfortunately, still requires a | crawler to index that content, which serves as a barrier for | search engine startups. At a minimum, adding some metadata | content to XML sitemaps would go a long way to solving some | of this problem (title, meta description, content summary, | even structured data to the sitemaps). | Eduard wrote: | What's the problem of using any of the many free webcrawler | (libraries) available to crawl a website (even if solely | based on the pages advertised by sitemap.xml / robots.txt- | announced sitemaps), then extract structured data from | these pages? | | I don't see this as a barrier unique to startups. | rgrieselhuber wrote: | It's easy to do for small sets of sites, but try doing | this at web-scale and you quickly run into a large | financial barrier. It's not about technical feasibility | as much as it is cost. | DOsinga wrote: | With a budget of 8.5M Eur/Usd. Alphabet spends 200B per year. If | 40% of that is spend on search, their budget is 10 thousand times | larger. | lucideer wrote: | It's definitely a comparative underdog regardless, but if you | think Alphabet spends anywhere near 40% on search you're out of | your mind. I'd be shocked if their spend is double-digits. I'd | be unsurprised if it's <1%. | o_m wrote: | I doubt 40% is spent on search. Seeing how bad Google has | gotten, it seems more likely there is just a skeleton crew | keeping the lights on | mkl95 wrote: | I would be shocked if Alphabet spent >5% on search. But even 1% | would dwarf this project. | antics9 wrote: | We need to develop a social aspect to search where results are | also moderated and curated by humans in some kind of way. | topspin wrote: | And when that curation produces results you find abhorrent? | What then? Because I guarantee it would; a metaphysical | certitude. | Extropy_ wrote: | On first glance, I see the word "unbiased" immediately followed | by "based on European values". Now, I'm no expert, but to me, | that seems pretty biased. | radiojasper wrote: | biased on European values | nathan_phoenix wrote: | This is just a short reply to a blog which mentions that the | project started... | | The actual website of the project (with some concrete info) can | be found here: https://openwebsearch.eu/ | dang wrote: | Changed now. Thanks! | [deleted] | jacooper wrote: | > A new EU project OpenWebSearch.eu ... [in which] ... the key | idea is to separate index construction from the search engines | themselves, where the most expensive step to create index shards | can be carried out on large clusters while the search engine | itself can be operated locally. ...[including] an Open-Web-Search | Engine Hub, [where anyone can] share their specifications of | search engines and pre-computed, regularly updated search | indices. ... that would enable a new future of human-centric | search without privacy concerns. | | So.. Who's going to create the index? Indexing the web is | expensive, and its offset by the ads the indexer runs on their | search website, such as Google, bing, brave and others. | amelius wrote: | I wonder how privacy will be ensured when your query hits the | map-reduce infrastructure running on these clusters. | | Regarding privacy the bar is significantly higher than what | Google has to deal with. This will come at some cost in quality | and/or speed. | caust1c wrote: | Every individual website has an incentive to create indices of | their own content, and hosting providers could provide it as a | service. Not hard to envision. Search Engines could download | these indices periodically to build the meta-search. | wizofaus wrote: | Also not hard to envision websites being incentivised to lie | in their indexes. | moffkalast wrote: | Someone who's snagging an EU grant, that's who. | ur-whale wrote: | > Someone who's snagging an EU grant, that's who. | | Bullseye. | beardedman wrote: | Oh cool, but do you mean the "EU Open Web Search Data Collection | Program"? ___________________________________________________________________ (page generated 2022-09-20 23:00 UTC)