[HN Gopher] Marginalia: DIY search engine that focuses on non-co... ___________________________________________________________________ Marginalia: DIY search engine that focuses on non-commercial content Author : thunderbong Score : 460 points Date : 2023-04-18 09:53 UTC (13 hours ago) (HTM) web link (search.marginalia.nu) (TXT) w3m dump (search.marginalia.nu) | gerdesj wrote: | What a cracking resource! When you need to get away from the | beige web, a few clicks on Random Mode is all you need. | throwaway280382 wrote: | @marginalia_nu, Few months ago, you said you would consider open | sourcing this search engine. Are there any tasks that us github | warriors can help with? | PaulHoule wrote: | Wow! Relevance looks good for queries i tried and i like the | square interface too. | barbs wrote: | How does it compare to https://wiby.me/ ? | marginalia_nu wrote: | Wiby is manually curated. Means their result align much better | with the operator's vision. Marginalia has an orders of | magnitude bigger index, but not all of it is as consistently | good as Wiby. | marginalia_nu wrote: | Watch my computer struggle here: https://www.marginalia.nu/stats/ | antman wrote: | What is the stack? Can it scale up? | marginalia_nu wrote: | Custom index software built from scratch in Java. MariaDB | link database. The entire search engine runs on a PC in my | living room. | | You could pretty trivially shard the index by `hash(domain) % | numShards`. There's no support for this because I literally | only have this single server, but it wouldn't be much work. | [deleted] | toyg wrote: | https://github.com/MarginaliaSearch/MarginaliaSearch | dylan808hewitt wrote: | [dead] | qwertox wrote: | What really drives me crazy with Google is that they think that | it is ok to not label potentially paywalled articles as ads. | | I get tricked so often into clicking a news snippet offered by | Google only to then land on a site which not only presents me a | paywall, but also does want me to accept their cookie policy | _before_ they present me the paywall. | | It makes me angry every time anew. | HarHarVeryFunny wrote: | I'm not sure how effective it is at present - just gave it a very | quick test on searching for info on roman coins - but the concept | is great. This is something that I've often wished existed. | | If I'm searching for roman coins I certainly don't want to find | commercial sites selling them (I know what those are), or even | the well-known online national collections or auction archives... | I'd like to be able to find the specialist sites built by | collectors (and maybe academics) that are non-commercial and way | more interesting. | | In the early days of the internet some specialist content/pages | were organized into "web rings" each linking to each other, but | nowadays we're mostly relying on search to discover new pages, | and it seems a lot of the hobbyist content is way harder to find, | assuming it's even out there. | marginalia_nu wrote: | What did you search? Try just 'roman coins' | | #1: http://www.romancoins.info/Content.html | | #2-4: were not very good | | #5: https://www.forumancientcoins.com/dougsmith/voc1.html | | #6: https://www.cngcoins.com/Greek+and+Roman+Coins.aspx | | #7: https://www.crystalinks.com/romecoins.html | | If you search for specifically the 'as' it may be eaten as a | stop word :-/ | HarHarVeryFunny wrote: | I was searching "imp constantinvs" which is part of the | legend on many coins of constantine the great. Would expect | to see these details listed on any hobbyist sites, as well as | the commercial ones I'm not interested it. | | BTW #1, 5, 6 are all good sites, but those are very | mainstream - those will be top links in Google as well. #6 is | purely commercial - an auction house. #5 is a coin dealer's | commercial site, but has good collector resources (discussion | board, Wiki, collectors galleries) as well. | marginalia_nu wrote: | Do you know of any hobbyist sites within this space? I want | to check a thing, could be this corner of the internet | isn't well indexed. I should be able to tell with explore2. | HarHarVeryFunny wrote: | constantinethegreatcoins.com is one - the owner is also a | dealer, but this is his private hobbyist site. | | Some examples of other non-commercial roman coin hobbyist | sites (that will also rank fairly highly with Google) | are: | | augustuscoins.com wildwinds.com beastcoins.com | www.notinric.lechstepniewski.info https://www.nummus- | bibleii.com/ | | I'm at work right now, so these are just some examples | off the top of my head. I can give more examples later if | it's useful. Some of these site will include links to | other collector/hobbyist sites. | marginalia_nu wrote: | Hmm, several of those weren't indexed, I added them to | the crawl queue. Seems like the numismatics corner of the | web isn't well indexed by marginalia. | | constantinethegreatcoins does show up for 'imp | constantinvs' though. | CalRobert wrote: | Marginalia comes up on HN every so often, and I always look at | it, think "oh that's neat!", maybe add it to my bookmarks | toolbar, and then forget about it. Are there a lot of people who | find themselves using it daily? | frogulis wrote: | I use it from time to time when I want to read something | interesting. It can be a great source of articles that feel HN- | worthy, if that makes sense. | yuhong wrote: | [dead] | Firmwarrior wrote: | I use it as my daily time-waster, it has a lot more interesting | stuff than you'll find on sites like Reddit or Twitter | | https://search.marginalia.nu/explore/random | newqer wrote: | What have you done? I tried to cut back on Reddit, but this | seems like I could go done a rabbit hole for a few hours per | day. | marginalia_nu wrote: | I don't even use it daily. It's not a Google replacement, and | it's not trying to be either. It's more of an on-ramp for the | obscure web at this point. | | That said, it's gotten way better at finding stuff with the | last few releases. | GravitasFailure wrote: | It's interesting, and I really appreciate that you aren't | trying to out-Google Google. This seems like a useful tool in | its own right in addition to what else exists. | marginalia_nu wrote: | I don't think that would make sense. If Google is | struggling with search, a one man Google clone isn't going | to do it better. | | I also think that having "a google", one central search | engine, is inherently a bad thing for the health of the | Internet. It drives a lot of this search engine spam | epidemic we're seeing. | | A broader and (IMO) more interesting problem is Internet | discovery. | throwaway14356 wrote: | without/before commerce one would link to similar | websites as much as possible. Now those are called | competitors. | | I bet one could make a facinating ranking algo that | groups sites by subject then sort them by nr of links to | others in that group. | | So the perfect SEO would be to have a blogroll at the top | of the left menu with every related website in it. | | i.e. 3 stores sell the same item. Nr 1 is the one linking | to the other 2. Extra points for linking to that specific | product page. | hedora wrote: | Google used to be a ~one man search engine clone (and it | was definitely better a few years after that than it is | today). | marginalia_nu wrote: | A lot of Google's initial quality was due to the fact | that the content it indexed was much higher quality. | | Even besides the point that the websites they indexed | were a lot less adversarial, they put a lot of emphasis | on indexing academia, and were outspoken against what | came to be their present mixed motives[1]. | | [1] http://infolab.stanford.edu/~backrub/google.html#a | robin_reala wrote: | Well, two-man. | marginalia_nu wrote: | I think it was actually three-man with Scott Hassan. | pxc wrote: | I don't use it daily, but I have reached for it multiple times | in the last few weeks. I like it for finding blog posts, | tutorials, comparisons, and hobby projects without getting | caught up in fake articles like SEO-heavy wikis of copy-pasted | content. | cpach wrote: | Use it maybe once every week or something like that. I like it. | | I use it mostly for tech/programming/FOSS stuff. Especially for | programming topics it can be good for filtering out all the | 'w3schools' type of blog spam that just floods Google's | results. | themodelplumber wrote: | > The Random Mode has been overhauled, and is quite entertaining. | I encourage you to give it a spin. | | Yep, this is a good example of warping the Feeling Lucky pattern | into a really neat little discovery tool. | | IMO it would even be cool if the site was this part first, oh and | hey it's also a search engine. | | (While I'm random-ing: The Arch Wiki is in there? Seriously? Just | for that, I propose that it either be skinned to max vaporwave, | or host a webcam pointed at a Manjaro machine, or both...I'll be | waiting over here, downloading 4.1 GB of marginalia for my AUR | build of PCManFM) | selfhoster11 wrote: | Reminds me of StumbleUpon. I still miss that. | arcanemachiner wrote: | That period gave rise to the most educational web surfing | I've ever experienced. | BiteCode_dev wrote: | Stumble upon used to do this. It was really cool. | justusthane wrote: | StumbleUpon was amazing. I feel like that was really peak | internet, at least for me. I found so many weird, awesome | things. | marginalia_nu wrote: | It really could only work around the time it existed I | feel. The internet _was_ a lot weirder back then. | | One big difference then from now is that you basically need | a PhD in the Canvas API (or WebGL or whatever) to | accomplish something a 5 year old could do in Flash. Web | design was a lot more accessible. You didn't have to worry | about responsive designs and fluid layouts. You could just | position:absolute everything and that was kinda fine. | giantrobot wrote: | I think you might have some nostalgia goggles on at the | moment. There's nothing holding people back from making | "weird" web pages today, they can even make them nice and | responsive. One of the better concepts around HTML and | CSS was separation of data and layout. | | It's trivial to have a "weird" position:absolute design | with a break for mobile that switches to a more fluid | layout. Desktop users can have their "weird" layout but I | can still read the page on mobile and you can readily | crawl and index it. | | People moved away from design tools like DreamWeaver that | helped make "weird" stuff and instead installed WordPress | or some CSS/JavaScript framework that just bakes in all | the "boring" fluid layouts. | | You're not necessarily wrong about Flash in terms of | design or creation but your search engine wouldn't be | terribly practical if everyone was still using Flash for | everything. Flash allowed content packed inside SWFs but | also allowed fetching external resources. You wouldn't be | able to index any of that unless your crawler executed | the Flash and/or inspected all the URL references for | external resources. | | Flash created an inaccessible deep web just like today's | JavaScript website-is-an-application "sites". | | Don't get me wrong, I love the old web with quirky table- | based layouts, "unofficial" fansites, and personal | homepages hosted on forgotten university servers in a | closet. There was a vibrancy that's largely missing from | today's web. | | I think a big change has been tools have become more | geared for boring than the creative and people treat | content on the web as a side hustle. Google et all | haven't helped by favoring recency over other relevance | factors. | counttheforks wrote: | Is it trivial enough that a 5 year old could do in a | point and click editor? | giantrobot wrote: | It could be. But modern tools don't bother. Then again, | Flash's usability by a 5 year old is being a bit oversold | here. | themodelplumber wrote: | There are quite a few interesting alternatives these days. | Bored Button is one. There are some that are even more like | it but I'm away from grep and my notes at the moment. | | Reddit even has some kinda-similar subs. | marginalia_nu wrote: | I haven't got the time to curate this stuff. There's like | 10,000 domains in the list. It's some one off SQL script I | think that generated the sample based on parameters lost to | time. | muyuu wrote: | are domains whitelisted? | marginalia_nu wrote: | The domains you get from browse:random is from a small | selection yeah. But if you start traversing with "similar" | there is no such limitation, only limit is that they must | have a screenshot. | | (There's also explore2.marginalia.nu which is not even | limited to websites with a screenshot) | dang wrote: | Related. Others? | | _Marginalia Search has received an NLNet grant_ - | https://news.ycombinator.com/item?id=34945541 - Feb 2023 (17 | comments) | | _A Theoretical Justification (2021)_ - | https://news.ycombinator.com/item?id=32586273 - Aug 2022 (22 | comments) | | _The Evolution of Marginalia 's Crawling_ - | https://news.ycombinator.com/item?id=32565052 - Aug 2022 (22 | comments) | | _Botspam apocalypse_ - | https://news.ycombinator.com/item?id=32339314 - Aug 2022 (346 | comments) | | _Marginalia Goes Open Source_ - | https://news.ycombinator.com/item?id=31536626 - May 2022 (72 | comments) | | _Uncertain Future for Marginalia Search_ - | https://news.ycombinator.com/item?id=31200319 - April 2022 (37 | comments) | | _Marginalia Search: 1 Year_ - | https://news.ycombinator.com/item?id=30823481 - March 2022 (29 | comments) | | _Show HN: Marginalia - Exploration Mode_ - | https://news.ycombinator.com/item?id=30047455 - Jan 2022 (53 | comments) | | _A search engine that favors text-heavy sites and punishes | modern web design_ - | https://news.ycombinator.com/item?id=28550764 - Sept 2021 (717 | comments) | | (just as a reminder, these lists are only to satiate curious | readers - there's no reproach for reposting! Reposts are fine on | HN after a year or so: https://news.ycombinator.com/newsfaq.html) | nullandvoid wrote: | I get no results at all for this seemingly simple | https://search.marginalia.nu/search?query=how+to+draw+a+3d+b..., | presume it's been HN hugged to death? | shp0ngle wrote: | nah it just have a tiny index. | | search for just "3d box" or something like that. | melx wrote: | No, it's really means "404 nothing found" for your search | query. I searched for my company name and got nothing as well. | A bit surprising since it says "search the Internet" ;-) | marginalia_nu wrote: | It doesn't do semantic search or synonyms. Think keywords, not | questions. | | Search for "draw 3d box" or "draw a cube" and it starts giving | results. | overthemoon wrote: | The "random" button sent me on an hour long rabbit hole and I | learned about (among other things) the gopher protocol. A+, would | lose that time again. | gerdesj wrote: | My first experience of the internet was telnet from Win 3 box | to a X.25 PAD and then telnet to something JANET (UK) then | something US based (NSF I think) and fire up Gopher or WAIS. | | Later my boss asked me to look at this web thing that he had | heard about. I fired up telnet and eventually found an on ramp | to CERN. To me it looked rather like everything else but I'm | not exactly a rocket scientist! | | https://www.w3.org/History/1992/WWW/FAQ/WAISandGopher.html | qwertox wrote: | Hmm, "getting started with react" yields the following as the | first match: | | "https://frontendmasters.com/courses/complete-react-v5/gettin... | Getting Started with Pure React - Complete Intro to React, v5 | | Frontend Masters The "Getting Started with Pure React" Lesson is | part of the full, Complete Intro to React, v5 course featured in | this preview video." | marginalia_nu wrote: | Thanks for pointing it out. I blacklisted the domain. I don't | mind commercial content, but if they're using SEO like that | they're being a nuisance. | [deleted] | melx wrote: | All seem nice until I get to see the search results.. which I | cannot "read". It's very difficult to read an output that goes 5x | boxes horizontally, and each such line goes then vertically on | "forever". It's like yellow pages book from the 1990s. | marginalia_nu wrote: | Yeah, the "magic: the gathering" layout some limitations. I | want something that makes good use of a large screen though. | | I've got some ideas in the pipe, but haven't had the time to | give them enough polish that I'm happy with them. | | This is an early draft: https://imgur.com/a/vMVO7CK | melx wrote: | Oh I see! It displays good on mobile devices. | | The draft looks nice. The text colour is a bit hard to | distinguish from the surrounding background, and I don't have | any eye conditions. | marginalia_nu wrote: | Yeah, the contrast is one of the things I'm not entirely | happy with. The positioning is also a bit off, especially | if you resize the window a bit. As stated, needs polish. | But I really like the idea of the search engine being a bit | more transparent with how it works. | gavinhoward wrote: | Yay! Marginalia considers my site important and good enough to be | indexed! | | Honestly, this makes me really happy. I would prefer that my | traffic be driven by curated search engines, even if I get less | traffic. | | Also, I use this. I think it's great. | hk__2 wrote: | Is it limited to English? I made a few queries in French and | Italian and got either no results either a couple of irrelevant | ones. | marginalia_nu wrote: | Yes. | | It's in part a measure to limit the scope of the project (the | entire thing runs off a single PC), but it's also hard to build | a good language model for a language you don't speak, and I | only speak English and Swedish. But if the project grows, gets | more hardware, and contributors that speak other languages, | then maybe this will change in the future. | repeekad wrote: | I remember being so excited about the search engine Neeva because | they seemed to be building a full fledge independent web index | with top notch talent, I was really bought into the idea of a | premium new search experience that I could pay for (no ads) and | revenue share some kinds of content. But years later they focused | on crypto and AI instead, and I always find myself just googling | it because I have ad block and the results are more relevant, | sigh | whiplash451 wrote: | Per their website, they still seem to focus on ad-less search. | What am I missing? | [deleted] | ajmurmann wrote: | There also is kagi.com which should fit all the requirements | you described. | phendrenad2 wrote: | Such a bad name though. Try telling hour friends about kagi. | They'll type "kaggy". Maybe it'll be popular in Japan? | kevincox wrote: | FWIW Googling "kaggy search" has "Did you mean: kagi | search" at the top and kagi.com as the second hit. | slekker wrote: | Their new pricing is quite unaffordable now :( | ajmurmann wrote: | It's really bad. For years I've been saying that I want | someone to make a search engine where I am the customer and | that I am happy to pay. Now kagi is here and I am to cheap | to pay for it. I feel called out. | cloudyporpoise wrote: | I've always been curious about how search engines seed their | scanning and index programs. Like how do you know what domains, | ips, etc.. to start scanning and where is the origin? | gertgoeman wrote: | I remember reading somewhere that Google used dmoz | (https://en.wikipedia.org/wiki/DMOZ) as seed page for their | crawler. Not sure if it's true though... | ddorian43 wrote: | Start with Common Crawl and go from there. | djoldman wrote: | That may be a much easier question to answer than discovery. | | How do you discover relevant new domains? | marginalia_nu wrote: | I've actually sort of solved this recently. Marginalia's | ranking algorithm is a modified PageRank that instead of | links uses website adjacencies[1]. | | It can rank websites even if they aren't indexed, based on | who is linking to them. | | Vanilla PageRank can't do this very well. Domains that aren't | indexed don't have (known) outgoing links, in the periphery | of the rank. There's a some tricks to get these to not mess | up the algorithm completely, but they basically all rank | poorly. That's even without considering all the well known | tricks for manipulating vanilla pagerank. The modified | version seems very robust with regards to both problems. | | [1] https://memex.marginalia.nu/log/73-new-approach-to- | ranking.g... | marginalia_nu wrote: | It's basically seeded with my personal bookmark list. Like a | few dozen links. | | Not exactly this, but close enough: | https://memex.marginalia.nu/links/bookmarks.gmi | | I've changed the crawler design a couple of times, but the | principle for growing the set of sites to be crawled is to look | for sites that are (in some sense) adjacent to domains that | were found to be good. | cloudyporpoise wrote: | So if there was a new domain, unlinked by anything - this | wouldn't find it? | marginalia_nu wrote: | It wouldn't. But such islands are typically not very | interesting either. The context of who links to a domain is | very important for a search engine for many tasks, not just | discovery. | cloudyporpoise wrote: | Very cool. Reason I ask is at first glance the header | "Search the Internet" to me, implies you are searching | the entire internet. It sounds like a more appropriate | header would be "Search the obsecure Internet" | marginalia_nu wrote: | To be fair, no search engine lets you search the entire | Internet, not even Google does this. | | Internet arguably doesn't even have a size. You can | construct a website that's like n.example.com/m which | links to '(n+1).example.com/m' and 'n.example.com/(m+1)', | for each m and n between 0 and 1e308. | Lex-2008 wrote: | I did it! For every two numbers, calc.shpakovsky.ru has a | static(-looking) webpage showing their sum (or | difference, etc). Together with links to several other | pages. The only limitation I know of is 4k URL length. | Interestingly enough, major search engines are rather | smart about it and cooled down their indexing efforts | after some time. Guess, I'm not the first one to make | such a website. | marginalia_nu wrote: | Haha, nice! Crawler traps are a quite old phenomenon. | Been around since before Google. | | Dunno about the others, but my crawler has a set depth it | will crawl. It'll BFS for like 1000-10000 documents | depending on some factors. | HeckFeck wrote: | May I submit my sites to your index? I think they'd be a good | fit for the index. | | https://www.thran.uk and https://wmw.thran.uk | marginalia_nu wrote: | You can add them yourself :-) | | https://search.marginalia.nu/site/www.thran.uk | | https://search.marginalia.nu/site/wmw.thran.uk | | Only this is possible as long as the index knows about the | domain. Yours are, but if not, anyone can shoot me an email | or something and I can poke them into the database. | | The limitation for known domains is in place to avoid | abuse. | HeckFeck wrote: | Thanks! | gregw134 wrote: | 1) How many pages are in your index 2) How do you do indexing | and retrieval? Do you build a word index by document and find | documents that match all words in the query? | marginalia_nu wrote: | 1) At this moment about 70 million documents. I've had it | at about 110 million, dunno what the actual limit is. | | 2) Yes. Everything is in-house. | | Do you build a word index by document and find documents | that match all words in the query?) | | Yeah. It's actually got three indices; | | * One is a forward index with `document id -> document | metadata` | | * One is a priority term index with `term -> document id`. | | * One is a full index with `term -> (document, term | metadata)` | | They're all based on static b-trees. | abracadaniel wrote: | Is there a domain list if I wanted to crawl the hosts | myself? I see you have the raw crawl data, which is | appreciated, but a raw domain list would be cool. | marginalia_nu wrote: | I guess technically that could be arranged. Although I | don't want everyone to run their own crawler. It would | annoy a lot of webmasters and end up with even more | hurdles to be able to run a crawler. Better to share the | data if possible. | cs702 wrote: | Like HN, Marginalia is a fresh of breath air in comparison to | today's SEO-optimized, monolith-dominated web. | | Is there a way to donate money? | postdb wrote: | it is on the front page of their page: | https://memex.marginalia.nu/projects/edge/supporting.gmi | cs702 wrote: | Silly me, I assumed the "Support" button at the top of the | page was there for users who need... support. | | I kept looking for a "Donate" button :-P | | Thank you! | andruby wrote: | @marginalia_nu this is probably actionable feedback for you | dizhn wrote: | For all the talk of needing all the cloud infra to run even a | simple website, Marginalia hits the frontpage of HN and we can't | even bring a single PC sitting in some guy's living room to its | knees. | marginalia_nu wrote: | https://www.marginalia.nu/junk/just-a-fleshwound.webp | | If anything it's running faster now. All you've done is warm up | the caches and given the JVM a chance to optimize the hottest | code. | | (real talk the SSDs are running pretty near 100% utilization | though) | yonrg wrote: | 239 days up. That's brave too ;) | marginalia_nu wrote: | Rebooting is like a hour of downtime :-/ | | FWIW I'm going commando with no ECC ram too. | nine_k wrote: | I remember there is ksplice or something like that to | upgrade even the kernel without a complete downtime. | Everything else can be upgraded piecemeal, provided that | worker processes can be restarted without downtime. | winrid wrote: | If the SSDs were really maxed out you'd see high CPU | usage/load as the CPU as blocked by IOWAIT. | marginalia_nu wrote: | Not in this case, it's all mmap. | winrid wrote: | Really? Even with mmapp'ed memory won't the CPU still | register user code waiting on reading pages from disk as | iowait? I'm so surprised by that that if it doesn't it | sounds like a bug. | marginalia_nu wrote: | Yeah it's at least what I've been seeing. Although it | could alternatively be that a lot of the I/O activity is | predictive reads, and the threads don't actually stall on | page faults all I/O that often. | mvdwoord wrote: | Very well done! On mobile now but will check out the site | once home. | | Do you happen to have a writeup somewhere of your tech stack? | btown wrote: | Not OP but https://github.com/MarginaliaSearch/MarginaliaSe | arch/blob/ma... has a diagram and component overview! | marginalia_nu wrote: | I'm working on the documentation. It's getting there, but | it's still kinda thin in many places. | btown wrote: | IMO it's actually incredibly well-documented and | thoughtfully organized for a one-person project! You | should be proud of what you've put together here! | marginalia_nu wrote: | Yeah I did a huge refactoring effort very recently. I put | a lot of effort in making the code easy to poke around in | and I feel that works very well. | | But besides that, there's still a lot left to be desired | when it comes to how it actually works. Not everything is | easy to glean from the code alond. | marginalia_nu wrote: | It's a Debian server running nginx into a bunch of custom | java services that use the spark microframework[1]. I use a | MariaDB server for link data, and I've built a bespoke | index in Java. | | [1] https://sparkjava.com/ I don't use springboot or | anything like that, besides Spark I'm not using frameworks. | rglover wrote: | Semi-related: do you just run a static IP out of your house? | marginalia_nu wrote: | Yup. | dizhn wrote: | I love this!! :) How much does 24 real cores + 126GB ram cost | on the cloud? A million dollars? | ftth_finland wrote: | About 150EUR per month as a dedicated server from Hetzner. | marginalia_nu wrote: | Something like EUR200/mo if you factor in the need for | disk space as well. This is also Hetzner we're talking | about. They're sort of infamous for horror stories of | arbitrarily removed servers and having shitty support. | They're the cheapest for a reason. | | But with dedicated servers, are we really talking cloud? | ftth_finland wrote: | A dedicated server is obviously not "cloud", but that | wasn't really the point I was making. | | My point was rather underlining the absurdity of using | cloud for everything. | | Herzner is just an example of a dedicated server | provider. There are others, some in the same price range, | others a bit more. | | As an aside to my point, it is often cheaper and more | flexible to use dedicated servers than you buy and | collocate your own hardware. | belter wrote: | So the Cloud is the cheaper option, if you factor in | energy cost and hardware depreciation. | ftth_finland wrote: | The monthly cost of a dedicated server includes | everything: bandwidth, power and hardware. | | How do you figure a $5k annual cloud spend is cheaper | than ~150EUR per month? | belter wrote: | I do not understand your comment. | dizhn wrote: | I think parent is talking about colo. | hiddencost wrote: | I think you're confused. | | ~150EUR in cloud costs is cheaper than $5k cost to buy | the hardware the guy has in his living room. | DHolzer wrote: | in Azure that's roughly 5k per year if you pay for the | whole year upfront. I have the pleasure of playing with | 64cores, 256gb RAM and 2xV100 for data science projects | every now and then. That turns out to be roughly 32k per | year. | marginalia_nu wrote: | Yeah I wouldn't try to run this in the cloud. Would be | broke as a joke in a week. | | This is $5000 worth of consumer hardware, give or take. | TremendousJudge wrote: | nice | alex_sf wrote: | More than the cores and RAM, you have bigger issues with | I/O (both throughput and latency) to disk and the network | from cloud providers. Physical hardware, even when | comparing cores/RAM 1:1, is outrageously faster than cloud | services. | nine_k wrote: | Don't use EBS, use the local SSD which are an option for | most cloud VMs. | [deleted] | adrian_mrd wrote: | Kudos. [0] is the best URL I have come across in the past few | years | | [0] https://www.marginalia.nu/junk/just-a-fleshwound.webp | phendrenad2 wrote: | "/junk/just-a-fleshwound.webp" for those on mobile | gary_0 wrote: | I'm curious about your network bandwidth/load. You only serve | text, right? [Edit: No, I see thumbnail images too!] Is the | box in a datacenter? If not, what kind of Internet connection | does it have? | marginalia_nu wrote: | Average load today has at worst been about 300 Kb/s TX, 200 | Kb/s RX. I've got a 1000/100 mbit/s down/up connection. | Seems to be holding without much trouble. | | Most pages with images do lazy loading so I'm not hit with | 30 images all at once. They're also webp and cached via | cloudflare, softens the blow quite a lot. | [deleted] | [deleted] | alex_sf wrote: | StackOverflow still just runs on a pair of (beefy) SQL servers. | Modern web engineering is a joke. | dizhn wrote: | I belive hackernews itself is a couple of servers too. | winrid wrote: | One big server with a flat text file DB on NVME drives | AFAIK. | amiga-workbench wrote: | I believe the application code is single threaded due to | its interpreter too. | TremendousJudge wrote: | it does run pretty slow under load though, and they have | acknowledged it's due to this | marginalia_nu wrote: | HN has mutable data though. That's a much harder problem | than indexing a large amount of static data like a search | engine. | winrid wrote: | Not only that, but I don't think the custom datastore | handles concurrent writes. | | One thread FTW. :) | bookofjoe wrote: | dang? Bueller? Anyone? | hedora wrote: | Wikipedia too. | mrweasel wrote: | That doesn't sound right: | https://wikitech.wikimedia.org/wiki/MediaWiki_at_WMF | nemo44x wrote: | A lot of modern web engineering is built for problems most | people won't have. | nine_k wrote: | The people _strive_ to have these problems! Hockey stick | growth, servers melting under signup requests, payment | systems struggling under the stream of subscription | payments! Scale up, up, up! And for _that_ you might want | to run your setup under k8s since day one, just in case, | even though a single inexpensive server would run the whole | thing with a 5x capacity reserve. But that would feel like | a side project, not a startup! | HarHarVeryFunny wrote: | That is pretty impressive - not only not on its knees, but very | responsive atm. | MichaelZuo wrote: | I doubt anyone would be foolish enough to claim that the site | NEEDS 'cloud infra' to run. | nine_k wrote: | It depends. Both https://google.com and, say, | https://www.medusa.priv.at/ are technically web sites, but | the complexity of the tech that makes them work is pretty | different. | marginalia_nu wrote: | I think it sort of depends on what you want. | | Every time I deploy a service it goes down for anything | between 30 seconds and 5 minutes. When I switch indices, the | entire search engine is down for a day or more. Since the | entire project is essentially non-commercial, I think this is | fine. I don't need five nines. | | If reliability was extremely important, scales would tilt | differently, maybe cloud would be a good option. A lot of it | is for CYA's sake as well. If I mess up with my server, | that's both my problem and my responsibility. If a cloud | provider messes up, then that's a SLA violation and maybe | damages are due. | nine_k wrote: | 24 cores, 128 GB RAM. One could run 10-20 EC2 instances to | utilize a box like this, and produce an impression of sprawling | backend infrastructure. | [deleted] | MichaelZuo wrote: | It's quite hard to search for lists of any kind. For example: | | list of Italian generals | | list of CPU architectures | | list of positive rights | | etc... | | Return no relevant results. Perhaps it's not giving enough weight | to the 'list' aspect? | marginalia_nu wrote: | I think there's two reasons for this. | | The query processing is fairly crude. For better or worse, it | doesn't do much special processing. Which means you basically | need a website that repeatedly says "list of CPU architectures" | to rank well. | | Most of the pages that contain such a title are also actual | lists. The index de-prioritizes documents that are mostly lists | or tabular data, as they're rarely very often false positives | as they often contain repeated words. | MichaelZuo wrote: | It doesn't seem like an issue of ranking since there are no | results with lists of any sort, even at the bottom. | | Even if the word 'list of ...' only appears once, it wouldn't | be filtered out, right? | | Out of 100 million pages, it seems like there could easily be | a few hundred thousand with lists. | SinePost wrote: | What advantages does this offer over something like typing | "$QUERY -site:*.com" into a mainstream search engine? I think | webmasters in general do a pretty good job at self-segregating | their sites into commercial and non-commercial entities through | the use of different top-level domains. | 1123581321 wrote: | The advantage is a better set of sites since there are a ton of | interesting little .com sites out there, and lots of unwanted | sites on .org and country code domains. He also blocks some URL | patterns that appear on spammy domains regardless of TLD. Try a | few searches or the random results page to see the difference. | It's fun browsing. | djoldman wrote: | From my goto search term, "chickens": | | https://theoutline.com/post/5608/bury-me-in-chicken-diapers | marginalia_nu wrote: | I'll counter with the top result for "cats": | http://diabellalovescats.com/catland.htm | Aeolun wrote: | Thank you. I needed this in my life. | pxc wrote: | Anecdotally, all the people I know who have | recreational/pet/indoor chickens are lovely human beings, so | I'm wholly in favor of this absurd industry and its success. | marginalia_nu wrote: | "Recreational Chicken" is a two word poem if I ever saw one. | bookofjoe wrote: | https://www.amazon.com/Under-Henfluence-Inside-Backyard- | Chic... | Jemm wrote: | First search for a commodity item returned pages full of | conspiracy theories and then drifted in to anti-vax territory. | marginalia_nu wrote: | Interesting, what did you search for? | HeckFeck wrote: | The unfiltered Internet in all its glory! | shanebellone wrote: | I yearn for an exclusionary Internet. Most voices don't matter. | manuelmoreale wrote: | Would you mind explain 1) why you want that and 2) who | decides which voices do matter? | shanebellone wrote: | "1) why you want that" | | Universal broadcast does not work (beneficially for | society) in an industry built to monetize reach. | | Everyone is entitled to their opinions, but voices are not | equal in utility or worth. | | "2) who decides which voices do matter?" | | This is always the problem, isn't it? I don't have an | answer for you. | manuelmoreale wrote: | Isn't 1 just the result of 2? It's because we don't have | an answer and because there probably isn't an answer to | the second question that we need universal broadcast as | you called it. | | We could get away from it only if we figure out an answer | to the second question but I suspect we'll never get to | an answer. | shanebellone wrote: | "Isn't 1 just the result of 2?" | | Only in that reach will continue to be monetized | regardless of its impact on society. | pulpfictional wrote: | Isn't it already? | SinePost wrote: | That doesn't actually sound very different from my experience | with major search engines beyond the first page. I've taken it | as a bit of an law that the Internet outside of large | centralized and/or moderated sites gets very fringe very | quickly. Since the whole point of the search engine is to | display noncommercial sites, users will inevitably face | thousands of self-published blogs of varying beliefs, quality, | and truthiness. As these fringe sites take up more domain names | by total volume than mainstream platforms (there is only one | Twitter, Facebook, et al.), I am not surprised at all that they | seem to be even more voluminous here than on commercial search | engines. | yonrg wrote: | Thanks for this! It absolutely touched me. www in a way it was | back in the 90 :) those memories came back when getting results | to neocities. ___________________________________________________________________ (page generated 2023-04-18 23:00 UTC)