[HN Gopher] Marginalia: DIY search engine that focuses on non-co...
       ___________________________________________________________________
        
       Marginalia: DIY search engine that focuses on non-commercial
       content
        
       Author : thunderbong
       Score  : 460 points
       Date   : 2023-04-18 09:53 UTC (13 hours ago)
        
 (HTM) web link (search.marginalia.nu)
 (TXT) w3m dump (search.marginalia.nu)
        
       | gerdesj wrote:
       | What a cracking resource! When you need to get away from the
       | beige web, a few clicks on Random Mode is all you need.
        
       | throwaway280382 wrote:
       | @marginalia_nu, Few months ago, you said you would consider open
       | sourcing this search engine. Are there any tasks that us github
       | warriors can help with?
        
       | PaulHoule wrote:
       | Wow! Relevance looks good for queries i tried and i like the
       | square interface too.
        
       | barbs wrote:
       | How does it compare to https://wiby.me/ ?
        
         | marginalia_nu wrote:
         | Wiby is manually curated. Means their result align much better
         | with the operator's vision. Marginalia has an orders of
         | magnitude bigger index, but not all of it is as consistently
         | good as Wiby.
        
       | marginalia_nu wrote:
       | Watch my computer struggle here: https://www.marginalia.nu/stats/
        
         | antman wrote:
         | What is the stack? Can it scale up?
        
           | marginalia_nu wrote:
           | Custom index software built from scratch in Java. MariaDB
           | link database. The entire search engine runs on a PC in my
           | living room.
           | 
           | You could pretty trivially shard the index by `hash(domain) %
           | numShards`. There's no support for this because I literally
           | only have this single server, but it wouldn't be much work.
        
             | [deleted]
        
           | toyg wrote:
           | https://github.com/MarginaliaSearch/MarginaliaSearch
        
       | dylan808hewitt wrote:
       | [dead]
        
       | qwertox wrote:
       | What really drives me crazy with Google is that they think that
       | it is ok to not label potentially paywalled articles as ads.
       | 
       | I get tricked so often into clicking a news snippet offered by
       | Google only to then land on a site which not only presents me a
       | paywall, but also does want me to accept their cookie policy
       | _before_ they present me the paywall.
       | 
       | It makes me angry every time anew.
        
       | HarHarVeryFunny wrote:
       | I'm not sure how effective it is at present - just gave it a very
       | quick test on searching for info on roman coins - but the concept
       | is great. This is something that I've often wished existed.
       | 
       | If I'm searching for roman coins I certainly don't want to find
       | commercial sites selling them (I know what those are), or even
       | the well-known online national collections or auction archives...
       | I'd like to be able to find the specialist sites built by
       | collectors (and maybe academics) that are non-commercial and way
       | more interesting.
       | 
       | In the early days of the internet some specialist content/pages
       | were organized into "web rings" each linking to each other, but
       | nowadays we're mostly relying on search to discover new pages,
       | and it seems a lot of the hobbyist content is way harder to find,
       | assuming it's even out there.
        
         | marginalia_nu wrote:
         | What did you search? Try just 'roman coins'
         | 
         | #1: http://www.romancoins.info/Content.html
         | 
         | #2-4: were not very good
         | 
         | #5: https://www.forumancientcoins.com/dougsmith/voc1.html
         | 
         | #6: https://www.cngcoins.com/Greek+and+Roman+Coins.aspx
         | 
         | #7: https://www.crystalinks.com/romecoins.html
         | 
         | If you search for specifically the 'as' it may be eaten as a
         | stop word :-/
        
           | HarHarVeryFunny wrote:
           | I was searching "imp constantinvs" which is part of the
           | legend on many coins of constantine the great. Would expect
           | to see these details listed on any hobbyist sites, as well as
           | the commercial ones I'm not interested it.
           | 
           | BTW #1, 5, 6 are all good sites, but those are very
           | mainstream - those will be top links in Google as well. #6 is
           | purely commercial - an auction house. #5 is a coin dealer's
           | commercial site, but has good collector resources (discussion
           | board, Wiki, collectors galleries) as well.
        
             | marginalia_nu wrote:
             | Do you know of any hobbyist sites within this space? I want
             | to check a thing, could be this corner of the internet
             | isn't well indexed. I should be able to tell with explore2.
        
               | HarHarVeryFunny wrote:
               | constantinethegreatcoins.com is one - the owner is also a
               | dealer, but this is his private hobbyist site.
               | 
               | Some examples of other non-commercial roman coin hobbyist
               | sites (that will also rank fairly highly with Google)
               | are:
               | 
               | augustuscoins.com wildwinds.com beastcoins.com
               | www.notinric.lechstepniewski.info https://www.nummus-
               | bibleii.com/
               | 
               | I'm at work right now, so these are just some examples
               | off the top of my head. I can give more examples later if
               | it's useful. Some of these site will include links to
               | other collector/hobbyist sites.
        
               | marginalia_nu wrote:
               | Hmm, several of those weren't indexed, I added them to
               | the crawl queue. Seems like the numismatics corner of the
               | web isn't well indexed by marginalia.
               | 
               | constantinethegreatcoins does show up for 'imp
               | constantinvs' though.
        
       | CalRobert wrote:
       | Marginalia comes up on HN every so often, and I always look at
       | it, think "oh that's neat!", maybe add it to my bookmarks
       | toolbar, and then forget about it. Are there a lot of people who
       | find themselves using it daily?
        
         | frogulis wrote:
         | I use it from time to time when I want to read something
         | interesting. It can be a great source of articles that feel HN-
         | worthy, if that makes sense.
        
         | yuhong wrote:
         | [dead]
        
         | Firmwarrior wrote:
         | I use it as my daily time-waster, it has a lot more interesting
         | stuff than you'll find on sites like Reddit or Twitter
         | 
         | https://search.marginalia.nu/explore/random
        
           | newqer wrote:
           | What have you done? I tried to cut back on Reddit, but this
           | seems like I could go done a rabbit hole for a few hours per
           | day.
        
         | marginalia_nu wrote:
         | I don't even use it daily. It's not a Google replacement, and
         | it's not trying to be either. It's more of an on-ramp for the
         | obscure web at this point.
         | 
         | That said, it's gotten way better at finding stuff with the
         | last few releases.
        
           | GravitasFailure wrote:
           | It's interesting, and I really appreciate that you aren't
           | trying to out-Google Google. This seems like a useful tool in
           | its own right in addition to what else exists.
        
             | marginalia_nu wrote:
             | I don't think that would make sense. If Google is
             | struggling with search, a one man Google clone isn't going
             | to do it better.
             | 
             | I also think that having "a google", one central search
             | engine, is inherently a bad thing for the health of the
             | Internet. It drives a lot of this search engine spam
             | epidemic we're seeing.
             | 
             | A broader and (IMO) more interesting problem is Internet
             | discovery.
        
               | throwaway14356 wrote:
               | without/before commerce one would link to similar
               | websites as much as possible. Now those are called
               | competitors.
               | 
               | I bet one could make a facinating ranking algo that
               | groups sites by subject then sort them by nr of links to
               | others in that group.
               | 
               | So the perfect SEO would be to have a blogroll at the top
               | of the left menu with every related website in it.
               | 
               | i.e. 3 stores sell the same item. Nr 1 is the one linking
               | to the other 2. Extra points for linking to that specific
               | product page.
        
               | hedora wrote:
               | Google used to be a ~one man search engine clone (and it
               | was definitely better a few years after that than it is
               | today).
        
               | marginalia_nu wrote:
               | A lot of Google's initial quality was due to the fact
               | that the content it indexed was much higher quality.
               | 
               | Even besides the point that the websites they indexed
               | were a lot less adversarial, they put a lot of emphasis
               | on indexing academia, and were outspoken against what
               | came to be their present mixed motives[1].
               | 
               | [1] http://infolab.stanford.edu/~backrub/google.html#a
        
               | robin_reala wrote:
               | Well, two-man.
        
               | marginalia_nu wrote:
               | I think it was actually three-man with Scott Hassan.
        
         | pxc wrote:
         | I don't use it daily, but I have reached for it multiple times
         | in the last few weeks. I like it for finding blog posts,
         | tutorials, comparisons, and hobby projects without getting
         | caught up in fake articles like SEO-heavy wikis of copy-pasted
         | content.
        
         | cpach wrote:
         | Use it maybe once every week or something like that. I like it.
         | 
         | I use it mostly for tech/programming/FOSS stuff. Especially for
         | programming topics it can be good for filtering out all the
         | 'w3schools' type of blog spam that just floods Google's
         | results.
        
       | themodelplumber wrote:
       | > The Random Mode has been overhauled, and is quite entertaining.
       | I encourage you to give it a spin.
       | 
       | Yep, this is a good example of warping the Feeling Lucky pattern
       | into a really neat little discovery tool.
       | 
       | IMO it would even be cool if the site was this part first, oh and
       | hey it's also a search engine.
       | 
       | (While I'm random-ing: The Arch Wiki is in there? Seriously? Just
       | for that, I propose that it either be skinned to max vaporwave,
       | or host a webcam pointed at a Manjaro machine, or both...I'll be
       | waiting over here, downloading 4.1 GB of marginalia for my AUR
       | build of PCManFM)
        
         | selfhoster11 wrote:
         | Reminds me of StumbleUpon. I still miss that.
        
           | arcanemachiner wrote:
           | That period gave rise to the most educational web surfing
           | I've ever experienced.
        
         | BiteCode_dev wrote:
         | Stumble upon used to do this. It was really cool.
        
           | justusthane wrote:
           | StumbleUpon was amazing. I feel like that was really peak
           | internet, at least for me. I found so many weird, awesome
           | things.
        
             | marginalia_nu wrote:
             | It really could only work around the time it existed I
             | feel. The internet _was_ a lot weirder back then.
             | 
             | One big difference then from now is that you basically need
             | a PhD in the Canvas API (or WebGL or whatever) to
             | accomplish something a 5 year old could do in Flash. Web
             | design was a lot more accessible. You didn't have to worry
             | about responsive designs and fluid layouts. You could just
             | position:absolute everything and that was kinda fine.
        
               | giantrobot wrote:
               | I think you might have some nostalgia goggles on at the
               | moment. There's nothing holding people back from making
               | "weird" web pages today, they can even make them nice and
               | responsive. One of the better concepts around HTML and
               | CSS was separation of data and layout.
               | 
               | It's trivial to have a "weird" position:absolute design
               | with a break for mobile that switches to a more fluid
               | layout. Desktop users can have their "weird" layout but I
               | can still read the page on mobile and you can readily
               | crawl and index it.
               | 
               | People moved away from design tools like DreamWeaver that
               | helped make "weird" stuff and instead installed WordPress
               | or some CSS/JavaScript framework that just bakes in all
               | the "boring" fluid layouts.
               | 
               | You're not necessarily wrong about Flash in terms of
               | design or creation but your search engine wouldn't be
               | terribly practical if everyone was still using Flash for
               | everything. Flash allowed content packed inside SWFs but
               | also allowed fetching external resources. You wouldn't be
               | able to index any of that unless your crawler executed
               | the Flash and/or inspected all the URL references for
               | external resources.
               | 
               | Flash created an inaccessible deep web just like today's
               | JavaScript website-is-an-application "sites".
               | 
               | Don't get me wrong, I love the old web with quirky table-
               | based layouts, "unofficial" fansites, and personal
               | homepages hosted on forgotten university servers in a
               | closet. There was a vibrancy that's largely missing from
               | today's web.
               | 
               | I think a big change has been tools have become more
               | geared for boring than the creative and people treat
               | content on the web as a side hustle. Google et all
               | haven't helped by favoring recency over other relevance
               | factors.
        
               | counttheforks wrote:
               | Is it trivial enough that a 5 year old could do in a
               | point and click editor?
        
               | giantrobot wrote:
               | It could be. But modern tools don't bother. Then again,
               | Flash's usability by a 5 year old is being a bit oversold
               | here.
        
             | themodelplumber wrote:
             | There are quite a few interesting alternatives these days.
             | Bored Button is one. There are some that are even more like
             | it but I'm away from grep and my notes at the moment.
             | 
             | Reddit even has some kinda-similar subs.
        
         | marginalia_nu wrote:
         | I haven't got the time to curate this stuff. There's like
         | 10,000 domains in the list. It's some one off SQL script I
         | think that generated the sample based on parameters lost to
         | time.
        
           | muyuu wrote:
           | are domains whitelisted?
        
             | marginalia_nu wrote:
             | The domains you get from browse:random is from a small
             | selection yeah. But if you start traversing with "similar"
             | there is no such limitation, only limit is that they must
             | have a screenshot.
             | 
             | (There's also explore2.marginalia.nu which is not even
             | limited to websites with a screenshot)
        
       | dang wrote:
       | Related. Others?
       | 
       |  _Marginalia Search has received an NLNet grant_ -
       | https://news.ycombinator.com/item?id=34945541 - Feb 2023 (17
       | comments)
       | 
       |  _A Theoretical Justification (2021)_ -
       | https://news.ycombinator.com/item?id=32586273 - Aug 2022 (22
       | comments)
       | 
       |  _The Evolution of Marginalia 's Crawling_ -
       | https://news.ycombinator.com/item?id=32565052 - Aug 2022 (22
       | comments)
       | 
       |  _Botspam apocalypse_ -
       | https://news.ycombinator.com/item?id=32339314 - Aug 2022 (346
       | comments)
       | 
       |  _Marginalia Goes Open Source_ -
       | https://news.ycombinator.com/item?id=31536626 - May 2022 (72
       | comments)
       | 
       |  _Uncertain Future for Marginalia Search_ -
       | https://news.ycombinator.com/item?id=31200319 - April 2022 (37
       | comments)
       | 
       |  _Marginalia Search: 1 Year_ -
       | https://news.ycombinator.com/item?id=30823481 - March 2022 (29
       | comments)
       | 
       |  _Show HN: Marginalia - Exploration Mode_ -
       | https://news.ycombinator.com/item?id=30047455 - Jan 2022 (53
       | comments)
       | 
       |  _A search engine that favors text-heavy sites and punishes
       | modern web design_ -
       | https://news.ycombinator.com/item?id=28550764 - Sept 2021 (717
       | comments)
       | 
       | (just as a reminder, these lists are only to satiate curious
       | readers - there's no reproach for reposting! Reposts are fine on
       | HN after a year or so: https://news.ycombinator.com/newsfaq.html)
        
       | nullandvoid wrote:
       | I get no results at all for this seemingly simple
       | https://search.marginalia.nu/search?query=how+to+draw+a+3d+b...,
       | presume it's been HN hugged to death?
        
         | shp0ngle wrote:
         | nah it just have a tiny index.
         | 
         | search for just "3d box" or something like that.
        
         | melx wrote:
         | No, it's really means "404 nothing found" for your search
         | query. I searched for my company name and got nothing as well.
         | A bit surprising since it says "search the Internet" ;-)
        
         | marginalia_nu wrote:
         | It doesn't do semantic search or synonyms. Think keywords, not
         | questions.
         | 
         | Search for "draw 3d box" or "draw a cube" and it starts giving
         | results.
        
       | overthemoon wrote:
       | The "random" button sent me on an hour long rabbit hole and I
       | learned about (among other things) the gopher protocol. A+, would
       | lose that time again.
        
         | gerdesj wrote:
         | My first experience of the internet was telnet from Win 3 box
         | to a X.25 PAD and then telnet to something JANET (UK) then
         | something US based (NSF I think) and fire up Gopher or WAIS.
         | 
         | Later my boss asked me to look at this web thing that he had
         | heard about. I fired up telnet and eventually found an on ramp
         | to CERN. To me it looked rather like everything else but I'm
         | not exactly a rocket scientist!
         | 
         | https://www.w3.org/History/1992/WWW/FAQ/WAISandGopher.html
        
       | qwertox wrote:
       | Hmm, "getting started with react" yields the following as the
       | first match:
       | 
       | "https://frontendmasters.com/courses/complete-react-v5/gettin...
       | Getting Started with Pure React - Complete Intro to React, v5 |
       | Frontend Masters The "Getting Started with Pure React" Lesson is
       | part of the full, Complete Intro to React, v5 course featured in
       | this preview video."
        
         | marginalia_nu wrote:
         | Thanks for pointing it out. I blacklisted the domain. I don't
         | mind commercial content, but if they're using SEO like that
         | they're being a nuisance.
        
       | [deleted]
        
       | melx wrote:
       | All seem nice until I get to see the search results.. which I
       | cannot "read". It's very difficult to read an output that goes 5x
       | boxes horizontally, and each such line goes then vertically on
       | "forever". It's like yellow pages book from the 1990s.
        
         | marginalia_nu wrote:
         | Yeah, the "magic: the gathering" layout some limitations. I
         | want something that makes good use of a large screen though.
         | 
         | I've got some ideas in the pipe, but haven't had the time to
         | give them enough polish that I'm happy with them.
         | 
         | This is an early draft: https://imgur.com/a/vMVO7CK
        
           | melx wrote:
           | Oh I see! It displays good on mobile devices.
           | 
           | The draft looks nice. The text colour is a bit hard to
           | distinguish from the surrounding background, and I don't have
           | any eye conditions.
        
             | marginalia_nu wrote:
             | Yeah, the contrast is one of the things I'm not entirely
             | happy with. The positioning is also a bit off, especially
             | if you resize the window a bit. As stated, needs polish.
             | But I really like the idea of the search engine being a bit
             | more transparent with how it works.
        
       | gavinhoward wrote:
       | Yay! Marginalia considers my site important and good enough to be
       | indexed!
       | 
       | Honestly, this makes me really happy. I would prefer that my
       | traffic be driven by curated search engines, even if I get less
       | traffic.
       | 
       | Also, I use this. I think it's great.
        
       | hk__2 wrote:
       | Is it limited to English? I made a few queries in French and
       | Italian and got either no results either a couple of irrelevant
       | ones.
        
         | marginalia_nu wrote:
         | Yes.
         | 
         | It's in part a measure to limit the scope of the project (the
         | entire thing runs off a single PC), but it's also hard to build
         | a good language model for a language you don't speak, and I
         | only speak English and Swedish. But if the project grows, gets
         | more hardware, and contributors that speak other languages,
         | then maybe this will change in the future.
        
       | repeekad wrote:
       | I remember being so excited about the search engine Neeva because
       | they seemed to be building a full fledge independent web index
       | with top notch talent, I was really bought into the idea of a
       | premium new search experience that I could pay for (no ads) and
       | revenue share some kinds of content. But years later they focused
       | on crypto and AI instead, and I always find myself just googling
       | it because I have ad block and the results are more relevant,
       | sigh
        
         | whiplash451 wrote:
         | Per their website, they still seem to focus on ad-less search.
         | What am I missing?
        
           | [deleted]
        
         | ajmurmann wrote:
         | There also is kagi.com which should fit all the requirements
         | you described.
        
           | phendrenad2 wrote:
           | Such a bad name though. Try telling hour friends about kagi.
           | They'll type "kaggy". Maybe it'll be popular in Japan?
        
             | kevincox wrote:
             | FWIW Googling "kaggy search" has "Did you mean: kagi
             | search" at the top and kagi.com as the second hit.
        
           | slekker wrote:
           | Their new pricing is quite unaffordable now :(
        
             | ajmurmann wrote:
             | It's really bad. For years I've been saying that I want
             | someone to make a search engine where I am the customer and
             | that I am happy to pay. Now kagi is here and I am to cheap
             | to pay for it. I feel called out.
        
       | cloudyporpoise wrote:
       | I've always been curious about how search engines seed their
       | scanning and index programs. Like how do you know what domains,
       | ips, etc.. to start scanning and where is the origin?
        
         | gertgoeman wrote:
         | I remember reading somewhere that Google used dmoz
         | (https://en.wikipedia.org/wiki/DMOZ) as seed page for their
         | crawler. Not sure if it's true though...
        
         | ddorian43 wrote:
         | Start with Common Crawl and go from there.
        
         | djoldman wrote:
         | That may be a much easier question to answer than discovery.
         | 
         | How do you discover relevant new domains?
        
           | marginalia_nu wrote:
           | I've actually sort of solved this recently. Marginalia's
           | ranking algorithm is a modified PageRank that instead of
           | links uses website adjacencies[1].
           | 
           | It can rank websites even if they aren't indexed, based on
           | who is linking to them.
           | 
           | Vanilla PageRank can't do this very well. Domains that aren't
           | indexed don't have (known) outgoing links, in the periphery
           | of the rank. There's a some tricks to get these to not mess
           | up the algorithm completely, but they basically all rank
           | poorly. That's even without considering all the well known
           | tricks for manipulating vanilla pagerank. The modified
           | version seems very robust with regards to both problems.
           | 
           | [1] https://memex.marginalia.nu/log/73-new-approach-to-
           | ranking.g...
        
         | marginalia_nu wrote:
         | It's basically seeded with my personal bookmark list. Like a
         | few dozen links.
         | 
         | Not exactly this, but close enough:
         | https://memex.marginalia.nu/links/bookmarks.gmi
         | 
         | I've changed the crawler design a couple of times, but the
         | principle for growing the set of sites to be crawled is to look
         | for sites that are (in some sense) adjacent to domains that
         | were found to be good.
        
           | cloudyporpoise wrote:
           | So if there was a new domain, unlinked by anything - this
           | wouldn't find it?
        
             | marginalia_nu wrote:
             | It wouldn't. But such islands are typically not very
             | interesting either. The context of who links to a domain is
             | very important for a search engine for many tasks, not just
             | discovery.
        
               | cloudyporpoise wrote:
               | Very cool. Reason I ask is at first glance the header
               | "Search the Internet" to me, implies you are searching
               | the entire internet. It sounds like a more appropriate
               | header would be "Search the obsecure Internet"
        
               | marginalia_nu wrote:
               | To be fair, no search engine lets you search the entire
               | Internet, not even Google does this.
               | 
               | Internet arguably doesn't even have a size. You can
               | construct a website that's like n.example.com/m which
               | links to '(n+1).example.com/m' and 'n.example.com/(m+1)',
               | for each m and n between 0 and 1e308.
        
               | Lex-2008 wrote:
               | I did it! For every two numbers, calc.shpakovsky.ru has a
               | static(-looking) webpage showing their sum (or
               | difference, etc). Together with links to several other
               | pages. The only limitation I know of is 4k URL length.
               | Interestingly enough, major search engines are rather
               | smart about it and cooled down their indexing efforts
               | after some time. Guess, I'm not the first one to make
               | such a website.
        
               | marginalia_nu wrote:
               | Haha, nice! Crawler traps are a quite old phenomenon.
               | Been around since before Google.
               | 
               | Dunno about the others, but my crawler has a set depth it
               | will crawl. It'll BFS for like 1000-10000 documents
               | depending on some factors.
        
           | HeckFeck wrote:
           | May I submit my sites to your index? I think they'd be a good
           | fit for the index.
           | 
           | https://www.thran.uk and https://wmw.thran.uk
        
             | marginalia_nu wrote:
             | You can add them yourself :-)
             | 
             | https://search.marginalia.nu/site/www.thran.uk
             | 
             | https://search.marginalia.nu/site/wmw.thran.uk
             | 
             | Only this is possible as long as the index knows about the
             | domain. Yours are, but if not, anyone can shoot me an email
             | or something and I can poke them into the database.
             | 
             | The limitation for known domains is in place to avoid
             | abuse.
        
               | HeckFeck wrote:
               | Thanks!
        
           | gregw134 wrote:
           | 1) How many pages are in your index 2) How do you do indexing
           | and retrieval? Do you build a word index by document and find
           | documents that match all words in the query?
        
             | marginalia_nu wrote:
             | 1) At this moment about 70 million documents. I've had it
             | at about 110 million, dunno what the actual limit is.
             | 
             | 2) Yes. Everything is in-house.
             | 
             | Do you build a word index by document and find documents
             | that match all words in the query?)
             | 
             | Yeah. It's actually got three indices;
             | 
             | * One is a forward index with `document id -> document
             | metadata`
             | 
             | * One is a priority term index with `term -> document id`.
             | 
             | * One is a full index with `term -> (document, term
             | metadata)`
             | 
             | They're all based on static b-trees.
        
               | abracadaniel wrote:
               | Is there a domain list if I wanted to crawl the hosts
               | myself? I see you have the raw crawl data, which is
               | appreciated, but a raw domain list would be cool.
        
               | marginalia_nu wrote:
               | I guess technically that could be arranged. Although I
               | don't want everyone to run their own crawler. It would
               | annoy a lot of webmasters and end up with even more
               | hurdles to be able to run a crawler. Better to share the
               | data if possible.
        
       | cs702 wrote:
       | Like HN, Marginalia is a fresh of breath air in comparison to
       | today's SEO-optimized, monolith-dominated web.
       | 
       | Is there a way to donate money?
        
         | postdb wrote:
         | it is on the front page of their page:
         | https://memex.marginalia.nu/projects/edge/supporting.gmi
        
           | cs702 wrote:
           | Silly me, I assumed the "Support" button at the top of the
           | page was there for users who need... support.
           | 
           | I kept looking for a "Donate" button :-P
           | 
           | Thank you!
        
             | andruby wrote:
             | @marginalia_nu this is probably actionable feedback for you
        
       | dizhn wrote:
       | For all the talk of needing all the cloud infra to run even a
       | simple website, Marginalia hits the frontpage of HN and we can't
       | even bring a single PC sitting in some guy's living room to its
       | knees.
        
         | marginalia_nu wrote:
         | https://www.marginalia.nu/junk/just-a-fleshwound.webp
         | 
         | If anything it's running faster now. All you've done is warm up
         | the caches and given the JVM a chance to optimize the hottest
         | code.
         | 
         | (real talk the SSDs are running pretty near 100% utilization
         | though)
        
           | yonrg wrote:
           | 239 days up. That's brave too ;)
        
             | marginalia_nu wrote:
             | Rebooting is like a hour of downtime :-/
             | 
             | FWIW I'm going commando with no ECC ram too.
        
             | nine_k wrote:
             | I remember there is ksplice or something like that to
             | upgrade even the kernel without a complete downtime.
             | Everything else can be upgraded piecemeal, provided that
             | worker processes can be restarted without downtime.
        
           | winrid wrote:
           | If the SSDs were really maxed out you'd see high CPU
           | usage/load as the CPU as blocked by IOWAIT.
        
             | marginalia_nu wrote:
             | Not in this case, it's all mmap.
        
               | winrid wrote:
               | Really? Even with mmapp'ed memory won't the CPU still
               | register user code waiting on reading pages from disk as
               | iowait? I'm so surprised by that that if it doesn't it
               | sounds like a bug.
        
               | marginalia_nu wrote:
               | Yeah it's at least what I've been seeing. Although it
               | could alternatively be that a lot of the I/O activity is
               | predictive reads, and the threads don't actually stall on
               | page faults all I/O that often.
        
           | mvdwoord wrote:
           | Very well done! On mobile now but will check out the site
           | once home.
           | 
           | Do you happen to have a writeup somewhere of your tech stack?
        
             | btown wrote:
             | Not OP but https://github.com/MarginaliaSearch/MarginaliaSe
             | arch/blob/ma... has a diagram and component overview!
        
               | marginalia_nu wrote:
               | I'm working on the documentation. It's getting there, but
               | it's still kinda thin in many places.
        
               | btown wrote:
               | IMO it's actually incredibly well-documented and
               | thoughtfully organized for a one-person project! You
               | should be proud of what you've put together here!
        
               | marginalia_nu wrote:
               | Yeah I did a huge refactoring effort very recently. I put
               | a lot of effort in making the code easy to poke around in
               | and I feel that works very well.
               | 
               | But besides that, there's still a lot left to be desired
               | when it comes to how it actually works. Not everything is
               | easy to glean from the code alond.
        
             | marginalia_nu wrote:
             | It's a Debian server running nginx into a bunch of custom
             | java services that use the spark microframework[1]. I use a
             | MariaDB server for link data, and I've built a bespoke
             | index in Java.
             | 
             | [1] https://sparkjava.com/ I don't use springboot or
             | anything like that, besides Spark I'm not using frameworks.
        
           | rglover wrote:
           | Semi-related: do you just run a static IP out of your house?
        
             | marginalia_nu wrote:
             | Yup.
        
           | dizhn wrote:
           | I love this!! :) How much does 24 real cores + 126GB ram cost
           | on the cloud? A million dollars?
        
             | ftth_finland wrote:
             | About 150EUR per month as a dedicated server from Hetzner.
        
               | marginalia_nu wrote:
               | Something like EUR200/mo if you factor in the need for
               | disk space as well. This is also Hetzner we're talking
               | about. They're sort of infamous for horror stories of
               | arbitrarily removed servers and having shitty support.
               | They're the cheapest for a reason.
               | 
               | But with dedicated servers, are we really talking cloud?
        
               | ftth_finland wrote:
               | A dedicated server is obviously not "cloud", but that
               | wasn't really the point I was making.
               | 
               | My point was rather underlining the absurdity of using
               | cloud for everything.
               | 
               | Herzner is just an example of a dedicated server
               | provider. There are others, some in the same price range,
               | others a bit more.
               | 
               | As an aside to my point, it is often cheaper and more
               | flexible to use dedicated servers than you buy and
               | collocate your own hardware.
        
               | belter wrote:
               | So the Cloud is the cheaper option, if you factor in
               | energy cost and hardware depreciation.
        
               | ftth_finland wrote:
               | The monthly cost of a dedicated server includes
               | everything: bandwidth, power and hardware.
               | 
               | How do you figure a $5k annual cloud spend is cheaper
               | than ~150EUR per month?
        
               | belter wrote:
               | I do not understand your comment.
        
               | dizhn wrote:
               | I think parent is talking about colo.
        
               | hiddencost wrote:
               | I think you're confused.
               | 
               | ~150EUR in cloud costs is cheaper than $5k cost to buy
               | the hardware the guy has in his living room.
        
             | DHolzer wrote:
             | in Azure that's roughly 5k per year if you pay for the
             | whole year upfront. I have the pleasure of playing with
             | 64cores, 256gb RAM and 2xV100 for data science projects
             | every now and then. That turns out to be roughly 32k per
             | year.
        
             | marginalia_nu wrote:
             | Yeah I wouldn't try to run this in the cloud. Would be
             | broke as a joke in a week.
             | 
             | This is $5000 worth of consumer hardware, give or take.
        
               | TremendousJudge wrote:
               | nice
        
             | alex_sf wrote:
             | More than the cores and RAM, you have bigger issues with
             | I/O (both throughput and latency) to disk and the network
             | from cloud providers. Physical hardware, even when
             | comparing cores/RAM 1:1, is outrageously faster than cloud
             | services.
        
               | nine_k wrote:
               | Don't use EBS, use the local SSD which are an option for
               | most cloud VMs.
        
           | [deleted]
        
           | adrian_mrd wrote:
           | Kudos. [0] is the best URL I have come across in the past few
           | years
           | 
           | [0] https://www.marginalia.nu/junk/just-a-fleshwound.webp
        
             | phendrenad2 wrote:
             | "/junk/just-a-fleshwound.webp" for those on mobile
        
           | gary_0 wrote:
           | I'm curious about your network bandwidth/load. You only serve
           | text, right? [Edit: No, I see thumbnail images too!] Is the
           | box in a datacenter? If not, what kind of Internet connection
           | does it have?
        
             | marginalia_nu wrote:
             | Average load today has at worst been about 300 Kb/s TX, 200
             | Kb/s RX. I've got a 1000/100 mbit/s down/up connection.
             | Seems to be holding without much trouble.
             | 
             | Most pages with images do lazy loading so I'm not hit with
             | 30 images all at once. They're also webp and cached via
             | cloudflare, softens the blow quite a lot.
        
         | [deleted]
        
         | [deleted]
        
         | alex_sf wrote:
         | StackOverflow still just runs on a pair of (beefy) SQL servers.
         | Modern web engineering is a joke.
        
           | dizhn wrote:
           | I belive hackernews itself is a couple of servers too.
        
             | winrid wrote:
             | One big server with a flat text file DB on NVME drives
             | AFAIK.
        
               | amiga-workbench wrote:
               | I believe the application code is single threaded due to
               | its interpreter too.
        
               | TremendousJudge wrote:
               | it does run pretty slow under load though, and they have
               | acknowledged it's due to this
        
               | marginalia_nu wrote:
               | HN has mutable data though. That's a much harder problem
               | than indexing a large amount of static data like a search
               | engine.
        
               | winrid wrote:
               | Not only that, but I don't think the custom datastore
               | handles concurrent writes.
               | 
               | One thread FTW. :)
        
             | bookofjoe wrote:
             | dang? Bueller? Anyone?
        
             | hedora wrote:
             | Wikipedia too.
        
               | mrweasel wrote:
               | That doesn't sound right:
               | https://wikitech.wikimedia.org/wiki/MediaWiki_at_WMF
        
           | nemo44x wrote:
           | A lot of modern web engineering is built for problems most
           | people won't have.
        
             | nine_k wrote:
             | The people _strive_ to have these problems! Hockey stick
             | growth, servers melting under signup requests, payment
             | systems struggling under the stream of subscription
             | payments! Scale up, up, up! And for _that_ you might want
             | to run your setup under k8s since day one, just in case,
             | even though a single inexpensive server would run the whole
             | thing with a 5x capacity reserve. But that would feel like
             | a side project, not a startup!
        
         | HarHarVeryFunny wrote:
         | That is pretty impressive - not only not on its knees, but very
         | responsive atm.
        
         | MichaelZuo wrote:
         | I doubt anyone would be foolish enough to claim that the site
         | NEEDS 'cloud infra' to run.
        
           | nine_k wrote:
           | It depends. Both https://google.com and, say,
           | https://www.medusa.priv.at/ are technically web sites, but
           | the complexity of the tech that makes them work is pretty
           | different.
        
           | marginalia_nu wrote:
           | I think it sort of depends on what you want.
           | 
           | Every time I deploy a service it goes down for anything
           | between 30 seconds and 5 minutes. When I switch indices, the
           | entire search engine is down for a day or more. Since the
           | entire project is essentially non-commercial, I think this is
           | fine. I don't need five nines.
           | 
           | If reliability was extremely important, scales would tilt
           | differently, maybe cloud would be a good option. A lot of it
           | is for CYA's sake as well. If I mess up with my server,
           | that's both my problem and my responsibility. If a cloud
           | provider messes up, then that's a SLA violation and maybe
           | damages are due.
        
         | nine_k wrote:
         | 24 cores, 128 GB RAM. One could run 10-20 EC2 instances to
         | utilize a box like this, and produce an impression of sprawling
         | backend infrastructure.
        
         | [deleted]
        
       | MichaelZuo wrote:
       | It's quite hard to search for lists of any kind. For example:
       | 
       | list of Italian generals
       | 
       | list of CPU architectures
       | 
       | list of positive rights
       | 
       | etc...
       | 
       | Return no relevant results. Perhaps it's not giving enough weight
       | to the 'list' aspect?
        
         | marginalia_nu wrote:
         | I think there's two reasons for this.
         | 
         | The query processing is fairly crude. For better or worse, it
         | doesn't do much special processing. Which means you basically
         | need a website that repeatedly says "list of CPU architectures"
         | to rank well.
         | 
         | Most of the pages that contain such a title are also actual
         | lists. The index de-prioritizes documents that are mostly lists
         | or tabular data, as they're rarely very often false positives
         | as they often contain repeated words.
        
           | MichaelZuo wrote:
           | It doesn't seem like an issue of ranking since there are no
           | results with lists of any sort, even at the bottom.
           | 
           | Even if the word 'list of ...' only appears once, it wouldn't
           | be filtered out, right?
           | 
           | Out of 100 million pages, it seems like there could easily be
           | a few hundred thousand with lists.
        
       | SinePost wrote:
       | What advantages does this offer over something like typing
       | "$QUERY -site:*.com" into a mainstream search engine? I think
       | webmasters in general do a pretty good job at self-segregating
       | their sites into commercial and non-commercial entities through
       | the use of different top-level domains.
        
         | 1123581321 wrote:
         | The advantage is a better set of sites since there are a ton of
         | interesting little .com sites out there, and lots of unwanted
         | sites on .org and country code domains. He also blocks some URL
         | patterns that appear on spammy domains regardless of TLD. Try a
         | few searches or the random results page to see the difference.
         | It's fun browsing.
        
       | djoldman wrote:
       | From my goto search term, "chickens":
       | 
       | https://theoutline.com/post/5608/bury-me-in-chicken-diapers
        
         | marginalia_nu wrote:
         | I'll counter with the top result for "cats":
         | http://diabellalovescats.com/catland.htm
        
         | Aeolun wrote:
         | Thank you. I needed this in my life.
        
         | pxc wrote:
         | Anecdotally, all the people I know who have
         | recreational/pet/indoor chickens are lovely human beings, so
         | I'm wholly in favor of this absurd industry and its success.
        
           | marginalia_nu wrote:
           | "Recreational Chicken" is a two word poem if I ever saw one.
        
           | bookofjoe wrote:
           | https://www.amazon.com/Under-Henfluence-Inside-Backyard-
           | Chic...
        
       | Jemm wrote:
       | First search for a commodity item returned pages full of
       | conspiracy theories and then drifted in to anti-vax territory.
        
         | marginalia_nu wrote:
         | Interesting, what did you search for?
        
         | HeckFeck wrote:
         | The unfiltered Internet in all its glory!
        
         | shanebellone wrote:
         | I yearn for an exclusionary Internet. Most voices don't matter.
        
           | manuelmoreale wrote:
           | Would you mind explain 1) why you want that and 2) who
           | decides which voices do matter?
        
             | shanebellone wrote:
             | "1) why you want that"
             | 
             | Universal broadcast does not work (beneficially for
             | society) in an industry built to monetize reach.
             | 
             | Everyone is entitled to their opinions, but voices are not
             | equal in utility or worth.
             | 
             | "2) who decides which voices do matter?"
             | 
             | This is always the problem, isn't it? I don't have an
             | answer for you.
        
               | manuelmoreale wrote:
               | Isn't 1 just the result of 2? It's because we don't have
               | an answer and because there probably isn't an answer to
               | the second question that we need universal broadcast as
               | you called it.
               | 
               | We could get away from it only if we figure out an answer
               | to the second question but I suspect we'll never get to
               | an answer.
        
               | shanebellone wrote:
               | "Isn't 1 just the result of 2?"
               | 
               | Only in that reach will continue to be monetized
               | regardless of its impact on society.
        
           | pulpfictional wrote:
           | Isn't it already?
        
         | SinePost wrote:
         | That doesn't actually sound very different from my experience
         | with major search engines beyond the first page. I've taken it
         | as a bit of an law that the Internet outside of large
         | centralized and/or moderated sites gets very fringe very
         | quickly. Since the whole point of the search engine is to
         | display noncommercial sites, users will inevitably face
         | thousands of self-published blogs of varying beliefs, quality,
         | and truthiness. As these fringe sites take up more domain names
         | by total volume than mainstream platforms (there is only one
         | Twitter, Facebook, et al.), I am not surprised at all that they
         | seem to be even more voluminous here than on commercial search
         | engines.
        
       | yonrg wrote:
       | Thanks for this! It absolutely touched me. www in a way it was
       | back in the 90 :) those memories came back when getting results
       | to neocities.
        
       ___________________________________________________________________
       (page generated 2023-04-18 23:00 UTC)