[HN Gopher] A Unix-style personal search engine and web crawler ... ___________________________________________________________________ A Unix-style personal search engine and web crawler for your digital footprint Author : amirGi Score : 257 points Date : 2021-07-26 16:09 UTC (6 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | ctocoder wrote: | wrote something along the same ilk but got distracted | https://github.com/dathan/go-find-hexagonal | yunruse wrote: | I love this idea, but the name "digital footprint" sort of | implies it's what effect you've had on the Internet for helping | keep your online persona under control: your tweets, comments, | emails, et cetera. | | But this is a great idea! Having a search engine for vaguely | _anything_ you touch very much does look like it'd increase the | signal:noise ratio. It'd be interesting to be able to add whole | sites (using, say, DuckDuckGo as an external crawler) to be able | to fetch general ideas, such as, say, "Stack Exchange posts | marked with these tags". | flanbiscuit wrote: | > but the name "digital footprint" sort of implies it's what | effect you've had on the Internet for helping keep your online | persona under control: your tweets, comments, emails, et | cetera. | | I had the exact same thought when I saw that in the title. That | would also be a cool idea to be able to search within your own | online accounts. | | So this is what the project's description of what "digital | footprint" means: | | > Apollo is a search engine and web crawler to digest your | digital footprint. What this means is that you choose what to | put in it. When you come across something that looks | interesting, be it an article, blog post, website, whatever, | you manually add it (with built in systems to make doing so | easy). If you always want to pull in data from a certain data | source, like your notes or something else, you can do that too. | This tackles one of the biggest problems of recall in search | engines returning a lot of irrelevant information because with | Apollo, the signal to noise ratio is very high. You've chosen | exactly what to put in it. | | If I'm interpreting this correctly, this seems like an | alternative way of bookmarking with advanced searching because | it scrapes the data from the source. Cool idea, means I have to | worry less about organizing my bookmarks. | zerop wrote: | How's it different from instapaper like services. There is also | open source alternative of instapaper called wallabag. | fidesomnes wrote: | Adding support for transcribed voice notes like from Otter would | be nice. | dpcx wrote: | Similar also to Promnesia | (https://github.com/karlicoss/promnesia), which includes a | browser extension to search the records. | dandanua wrote: | A similar tool - https://github.com/go-shiori/shiori | encryptluks2 wrote: | Shiori is "okay" but is not actively being maintained at all. | The original author abandoned it and the new maintainer | apparently never planned on supporting it. | toomanyducks wrote: | If nothing else, that README is fantastic! | soheil wrote: | Has the author tried pressing CMD+Y to view and search browser | history? | ThinkBeat wrote: | I use Evernote for this. | | You can set it ot save a link, a screenshot, or content of the | page. You can add tags if you want, and it is also easy to | annotate it so you can remember the context better. You can also | add links to other post inside Evernote. | | Pocket is also a great tool I used for many years. Quite similar | and different. | | Both have browser extensions, so it is easy to clip. | | With Evernote I even have shortcuts defined so I dont have to | click for the webpage to be clipped. | Minor49er wrote: | This looks really cool. It's beyond the scope of this project, | but I think that having something like this as a browser | extension would make it easier to use: instead of manually | copying and scraping links, it could index and save pages that | you've been on, placing much more significance on anything that | you've bookmarked. Granted, this is just an immediate thought. | I'm going to give this a proper try once I have some more spare | time. | ya1sec wrote: | Great thought. I've adopted a similar workflow using the | https://www.are.na/ chrome extension to save links to channels. | Might be a nice touch to feed channels into the engine using | their API | Minor49er wrote: | This looks like a fun way to explore topics. I just signed up | pantulis wrote: | Reminds me a lot of DEVONthink for Mac | MisterTea wrote: | > I've wasted many an hour combing through Google and my search | history to look up a good article, blog post, or just something | I've seen before. | | This is the fault of web browser vendors who have yet to give a | damn about book marks. | | > Apollo is a search engine and web crawler to digest your | digital footprint. What this means is that you choose what to put | in it. When you come across something that looks interesting, be | it an article, blog post, website, whatever, you manually add it | (with built in systems to make doing so easy). | | So it's a searchable database for bookmarks then. | | > The first thing you might notice is that the design is | reminiscent of the old digital computer age, back in the Unix | days. This is intentional for many reasons. In addition to paying | homage to the greats of the past, this design makes me feel like | I'm searching through something that is authentically my own. | When I search for stuff, I genuinely feel like I'm travelling | through the past. | | This does not make any sense. It's Unix-like because it feels | old? It seems like the author thoroughly misses the point of unix | philosophy. | chris_st wrote: | > _So it 's a searchable database for bookmarks then._ | | It appears to be that, but it appears also to pull out the | _content_ of the web page and index that too, so you can | (presumably) find stuff that isn 't in the "pure" bookmark, | which I think of as a link with maybe a title. | nextaccountic wrote: | I think browsers should download a full copy of each bookmark | (so you can still see it when they are taken down) and make | it fully searchable. | | Actually, I've been trying to find Firefox extensions that | give a better interface to bookmarks and there doesn't seem | to be one. It's like, people don't use bookmarks anymore and | accept that it might as well not exist, and use something | else. | | It's telling that Firefox has two bookmark systems built-in | (pocket and regular bookmarks) and they aren't integrated | with each other; I suppose that people that use pocket never | think about regular bookmarks. | | edit: but my pet peeve is that it isn't easy to search | history for something I saw 10 days ago but I don't remember | the exact keywords to search. | forgotpwd16 wrote: | >I think browsers should download a full copy of each | bookmark [...] and make it fully searchable. | | This, outside a browser, could be implemented as a | server/client self-hosted solution with a back-end taking | care of downloading/searching and an extension acting as | client. Maybe it could even be made entirely as extension? | berkes wrote: | That would miss all the personalized content, all the | content behind authorization and so on. | | At the very least, it would need to be able to get the | content pushed to it by the client, the way the client | has it at moment of bookmarking, making the | download/scraping kindof superflous. | | Indexing and doing search, however, is hard, but solved. | Hard in the sense that it is not something a firefox | addon could do very well. I presume a (self)hosted | meilisearch would suffice, though. | huanwin wrote: | You and GP might find ArchiveBox to have overlap with | what you're describing? | https://github.com/ArchiveBox/ArchiveBox | | Edit: here's the description from their repo | | "ArchiveBox is a powerful, self-hosted internet archiving | solution to collect, save, and view sites you want to | preserve offline. | | You can set it up as a command-line tool, web app, and | desktop app (alpha), on Linux, macOS, and Windows. | | You can feed it URLs one at a time, or schedule regular | imports from browser bookmarks or history, feeds like | RSS, bookmark services like Pocket/Pinboard, and more. | See input formats for a full list. | | It saves snapshots of the URLs you feed it in several | formats: HTML, PDF, PNG screenshots, WARC, and more out- | of-the-box, with a wide variety of content extracted and | preserved automatically (article text, audio/video, git | repos, etc.). See output formats for a full list." | phildenhoff wrote: | The difference, to me, about Pocket is that I use it | specifically as a to-read list. My list is just "sites I | want to visit/read/watch later", whereas bookmarks are more | of "I want to go here regularly". Also, all the bookmark | systems I've ever used treat links as files that can only | be in one folder, whereas Pocket at least has tags so links | can associate with multiple topics. | [deleted] | forgotpwd16 wrote: | >at least has tags so links can associate with multiple | topics | | This has applied since ever to regular bookmarks as well. | Basically you can just throw everything in unsorted and | use tags only. | cassepipe wrote: | Firefox has bookmark tags | joshuaissac wrote: | Older versions IE used to have something like this. | "Favourites" had a "Make available offline" box that could | be ticked to keep an offline copy of the page. But they | were not searchable. | cratermoon wrote: | > I think browsers should download a full copy of each | bookmark | | Have you tried Zotero? | totetsu wrote: | Zotero is great for this. set up a webdav docker | container and you can sync it easily too | nojito wrote: | Safari reader list does this and it's awesome. | throwawayboise wrote: | In Firefox, _File_ - > _Save Page As..._ lets me do this. | Local search tools should be able to index such archives | (if they can index Word documents, they should be able to | index HTML). Seems a fairly solved problem if it 's | something you need? | asdff wrote: | Pocket isn't for bookmarks. It's a reading list. Safari and | Chrome have this feature too. | medstrom wrote: | If you don't categorize bookmarks anyway, Pocket and | equivalent might be all-around better than bookmarks. | chillpenguin wrote: | Agree with the unix bit. I was expecting something "unix | philosophy" but it turns out they just meant it looks retro. | jll29 wrote: | "looks (intentionally) retro" like Serenity OS. | 1vuio0pswjnm7 wrote: | "It seems like the author thoroughly misses the point of the | unix philosophy." | | It's like a re-interpretation of history where AT&T still | controls UNIX. (What do people think of AT&T these days.) | | "The first thing you might notice ..." | | First thing I notice is this project is 100% tied to Google, | what with Chrome and Go (even for SNOBOL pattern matching, | sheesh). | | "... this design makes me feel like I'm searching through | something that is authentically my own." | | Except it isn't. It shuns the use of freely available, open- | source UNIX-like projects in favor of software belonging to a | company that Hoovers up personal data and sells online ad | services. Enjoy the illusion. :) | | Life can be very comfortable inside the gilded cage.1 The | Talosians will take good care of you.2 | | 1. https://en.wikipedia.org/wiki/Gilded_cage | | 2. https://en.wikipedia.org/wiki/Talosians | stevekemp wrote: | I've been thinking recently it might be interesting/useful to | write a simple SOCKS proxy which could be used by my browser. | | The SOCKS proxy would not just fetch the content of the page(s) | requested, but would also dump them to | ~/Archive/$year/$month/$day/$domain/$id.html. | | Of course I'd only want to archive text/plain and text/html, | but it seems like it should be a simple thing to write and | might be useful. Searching would be a simple matter of grep.. | habibur wrote: | Did that. But then you will find your disk quickly getting | filled up with GBs of cached contents that you rarely search | within. | | Rather when you need that same content, you will find | yourself going to google, searching that and the page is | instantly there unless removed. | | There's a reason why bookmarks aren't as popular as it had | been. People now use google + keywords instead of bookmarks. | berkes wrote: | It would also miss all the pages that are built from ajax- | requests on the client side. Which, nowadays, is a large | amount. The client is the one assembling all the content | into the thing you read and so it is the most likely | candidate to offer the copy that you want indexed. | kbenson wrote: | Maybe archive.org should run a subscription service where | for a few bucks a month, you can request your page visits | be archived (in a timely manner and with some level of | assurance) and leverage their system for tracking content | over time. That, in conjunction with something like Google, | might actually give fairly good assurance that what you're | searching for actually exists in a state like you saw it, | while also leveraging that 30 people accessing this blog | today that use the service don't use significantly more | resources to store the data, and also helps archive.org | fulfill its mission. | ryandrake wrote: | > This does not make any sense. It's Unix-like because it feels | old? It seems like the author thoroughly misses the point of | unix philosophy. | | Yea, I couldn't figure out what makes it Unix-like, either. I | mean, which UNIX in particular? Solaris? AIX? HP-UX? Do you use | UNIX commands to navigate it? Is there a shell or something? | Kind of odd way to describe it. | chillpenguin wrote: | Usually when someone says something is unix-like, they mean | it "embraces unix philosophy", which usually means something | like it operates on stdin/stdout so it can be composed in a | pipeline on the shell. | | Which is why I was mislead in this case :) | jll29 wrote: | Microsoft Research's Dr. Susan Dumais is the expert on this kind | of personal information management. | | Her landmark system (and associated seminal SIGIR'03 paper) | "Stuff I've Seen" tackled re-finding material: | http://susandumais.com/UMAP2009-DumaisKeynote_Share.pdf | simonw wrote: | My version of this is https://dogsheep.github.io/ - the idea is | to pull your digital footprint from various different sources | (Twitter, Foursquare, GitHub etc) into SQLite database files, | then run Datasette on top to explore them. | | On top of that I built a search engine called Dogsheep Beta which | builds a full-text search index across all of the different | sources and lets you search in one place: | https://github.com/dogsheep/dogsheep-beta | | You can see a live demonstration of that search engine on the | Datasette website: https://datasette.io/-/beta?q=dogsheep | | The key difference I see with Apollo is that Dogsheep separates | fetching of data from search and indexing, and uses SQLite as the | storage format. I'm using a YAML configuration to define how the | search index should work: | https://github.com/simonw/datasette.io/blob/main/templates/d... - | it defines SQL queries that can be used to build the index from | other tables, plus HTML fragments for how those results should be | displayed. | gizdan wrote: | Wow! That's super cool. I will have to check this out at some | point. Am I correct in understanding that the pocket tool | actually imports the URLs contents? If not, how hard would it | be to include the actual content of URLs? Specifically, I'll | probably end up using something else (for me NextCloud | bookmarks). | simonw wrote: | Sadly not - I'd love it to do that, but the Pocket API | doesn't make that available. | | I've been contemplating building an add-on for Dogsheep that | can do this for any given URL (from Pocket or other sources) | by shelling out to an archive script such as | https://github.com/postlight/mercury-parser - I collected | some suggestions for libraries to use here: | https://twitter.com/simonw/status/1401656327869394945 | | That way you could save a URL using Pocket or browser | bookmarks or Pinboard or anything else that I can extract | saved URLs from an a separate script could then archive the | full contents for you. | neolog wrote: | SingleFile and SingleFileZ are chrome extensions that | export full web pages pretty effectively. | | https://chrome.google.com/webstore/detail/singlefile/mpiodi | j... | | https://chrome.google.com/webstore/detail/singlefilez/offkd | f... | tomcam wrote: | Holy crap you should submit as a Show HN | mosselman wrote: | Simon is not an unknown on HN. | simonw wrote: | It's failed to make the homepage a few times in the past: | https://hn.algolia.com/?q=dogsheep - the one time it did make | it was this one about Dogsheep Photos: | https://news.ycombinator.com/item?id=23271053 | ryanfox wrote: | I run a similar project: https://apse.io | | It runs locally on your laptop/desktop, so you don't need a | server to host anything. | | Also, it can index _everything_ you do, not just web content. | | It works really well for me! | totetsu wrote: | there used to be an actity timeline journal program i ran on | ubuntu that let me see which days i accessed which files. It was | very useful as a sudent. | cratermoon wrote: | Interesting project but some of what the author writes just | sounds flat-out weird. "The first thing you might notice is that | the design is reminiscent of the old digital computer age, back | in the Unix days." | | "Apollo's client side is written in Poseidon." | | I had to look that up: Poseidon is not a language, it's just a | javascript framework for event-driven dom updates. | wydfre wrote: | It seems pretty cool - but I think falcon[0] is more practical. | You can install it from the chrome extension store[1], if you are | too lazy to get it running yourself. | | [0]: https://github.com/lengstrom/falcon | | [1]: | https://chrome.google.com/webstore/detail/falcon/mmifbbohghe... | grae_QED wrote: | Are there any Firefox equivalents to Falcon? I'm very | interested in something like this. | news_to_me wrote: | If it's a WebExtension, it's usually not too hard to port to | Firefox (https://developer.mozilla.org/en- | US/docs/Mozilla/Add-ons/Web...) | nathan_phoenix wrote: | In the issues someone says that it works even in FF. You just | need to change the extension of the file. Tho I didn't try it | yet. | | https://github.com/lengstrom/falcon/issues/73#issuecomment-6. | .. | soheil wrote: | There is something really strange about a lot of recent Go | projects including this one. I can't put my finger on, but the | combination of the author and the type of problem they choose to | tackle oftentimes seems baffling to me. Most projects seem to be | solving a problem that is often misidentified or otherwise badly | solved, but somehow the focus ends up being on the code | architecture or the UI design. It's like they're trying to solve | a problem just for the sake of writing some code and the correct | way to use Go idiomatically or something and don't really care | about the problem or how well the solution actually works. | asdff wrote: | I think projects like this are just resume builders. Everyone | says "show a project on github," well here is one of these | projects. The dev is probably hoping this helps land them a job | offer. Its fine if the project is ultimately "lame" in some | way, since its not the job description of a developer to make a | cool unique app, but to follow orders from the project manager | and write code, which is what this project shows this dev can | do. | jrm4 wrote: | Yeah, as a bit of an old-timer, I'm trying to learn to stop | worrying and learn to love watching everybody reinvent wheels? | For me it's "why are you people doing that in Javascript?" that | continually comes up in my own head, but I suppose I should try | to be patient and see if anything comes of it. | rhn_mk1 wrote: | This seems similar to recoll augmented with recoll-we. | | https://addons.mozilla.org/en-US/firefox/addon/recoll-we/ | SahAssar wrote: | Looks very much like one of the ideas I've been thinking of | building! The way I planned to do it was to use a similar | approach to rga for files ( https://github.com/phiresky/ripgrep- | all ) and having a webextension to pull all webpages I vist | (filtered via something like | https://github.com/mozilla/readability ), dump that into either | sqlite with FTS5 or postgres with FTS for search. | | A good search engine for "my stuff" and "stuff I've seen before" | is not available for most people in my experience. Pinboard and | similar sites fill some of that role, but only for things that | you bookmark (and I'm not sure they do full-text search of the | documents). | | --- | | Two things I'd mention are: | | 1. Digital footprint usually means your info on other sites, not | just things I've accessed. If I read a blog that is not part of | my footprint, but if I leave a comment on that blog that comment | is part of it. The term is also mostly used in a tracking and | negative context (although there are exceptions), so you might | want to change that: | https://en.wikipedia.org/wiki/Digital_footprint | | 2. I don't really get what makes it UNIX-style (or what exactly | you mean by that? There seems to be many definitions), and the | readme does not seem to clarify much besides expecting me to | notice it by myself. | eddieh wrote: | I've been toying with an idea like this too. I set my browser | to never delete history items years ago, so I have a huge | amount of daily web use that needs to be indexed. The browser's | built in history search has saved me a few times, but it is so | primitive it hurts. | grae_QED wrote: | >I don't really get what makes it UNIX-style | | I think what they meant was that it's an entirely text based | program. Perhaps they are conflating UNIX with CLI. | alanh wrote: | code comment in the readme describes the Record as constituting | an 'interverted index'. typo for inverted? although it is not | obvious to me what would make this an inverted index instead of a | normal index | [deleted] | etherio wrote: | This is cool! Similar to one of the goals I'm trying to | accomplish with Archivy (https://archivy.github.io) with the | broader goal of not just storing your digital presence but also | acting as a personal knowledge base. | kordlessagain wrote: | Cool! It's great to see others thinking about this. I've been | working on https://mitta.us for a while now and it uses solr, a | headless brrowser and google vision to snapshot and index full | text. The UI is a bit odd but you can just append mitta.us/ to | any URL to save it. | encryptluks2 wrote: | Why do all these bookmark projects: | | 1. Rely on JavaScript for the interface. Being built in Go, why | not just paginate the results and utilize Bleve or Xapian for | search? | | 2. Store data in a format that is not easily readable by itself. | The only exception to this is nb. | | 3. Suck at CLI tools. I'm looking to rclone, Hugo, kubectl, etc | for the right way to build a CLI. ___________________________________________________________________ (page generated 2021-07-26 23:00 UTC)