[HN Gopher] A Unix-style personal search engine and web crawler ...
       ___________________________________________________________________
        
       A Unix-style personal search engine and web crawler for your
       digital footprint
        
       Author : amirGi
       Score  : 257 points
       Date   : 2021-07-26 16:09 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ctocoder wrote:
       | wrote something along the same ilk but got distracted
       | https://github.com/dathan/go-find-hexagonal
        
       | yunruse wrote:
       | I love this idea, but the name "digital footprint" sort of
       | implies it's what effect you've had on the Internet for helping
       | keep your online persona under control: your tweets, comments,
       | emails, et cetera.
       | 
       | But this is a great idea! Having a search engine for vaguely
       | _anything_ you touch very much does look like it'd increase the
       | signal:noise ratio. It'd be interesting to be able to add whole
       | sites (using, say, DuckDuckGo as an external crawler) to be able
       | to fetch general ideas, such as, say, "Stack Exchange posts
       | marked with these tags".
        
         | flanbiscuit wrote:
         | > but the name "digital footprint" sort of implies it's what
         | effect you've had on the Internet for helping keep your online
         | persona under control: your tweets, comments, emails, et
         | cetera.
         | 
         | I had the exact same thought when I saw that in the title. That
         | would also be a cool idea to be able to search within your own
         | online accounts.
         | 
         | So this is what the project's description of what "digital
         | footprint" means:
         | 
         | > Apollo is a search engine and web crawler to digest your
         | digital footprint. What this means is that you choose what to
         | put in it. When you come across something that looks
         | interesting, be it an article, blog post, website, whatever,
         | you manually add it (with built in systems to make doing so
         | easy). If you always want to pull in data from a certain data
         | source, like your notes or something else, you can do that too.
         | This tackles one of the biggest problems of recall in search
         | engines returning a lot of irrelevant information because with
         | Apollo, the signal to noise ratio is very high. You've chosen
         | exactly what to put in it.
         | 
         | If I'm interpreting this correctly, this seems like an
         | alternative way of bookmarking with advanced searching because
         | it scrapes the data from the source. Cool idea, means I have to
         | worry less about organizing my bookmarks.
        
       | zerop wrote:
       | How's it different from instapaper like services. There is also
       | open source alternative of instapaper called wallabag.
        
       | fidesomnes wrote:
       | Adding support for transcribed voice notes like from Otter would
       | be nice.
        
       | dpcx wrote:
       | Similar also to Promnesia
       | (https://github.com/karlicoss/promnesia), which includes a
       | browser extension to search the records.
        
       | dandanua wrote:
       | A similar tool - https://github.com/go-shiori/shiori
        
         | encryptluks2 wrote:
         | Shiori is "okay" but is not actively being maintained at all.
         | The original author abandoned it and the new maintainer
         | apparently never planned on supporting it.
        
       | toomanyducks wrote:
       | If nothing else, that README is fantastic!
        
       | soheil wrote:
       | Has the author tried pressing CMD+Y to view and search browser
       | history?
        
       | ThinkBeat wrote:
       | I use Evernote for this.
       | 
       | You can set it ot save a link, a screenshot, or content of the
       | page. You can add tags if you want, and it is also easy to
       | annotate it so you can remember the context better. You can also
       | add links to other post inside Evernote.
       | 
       | Pocket is also a great tool I used for many years. Quite similar
       | and different.
       | 
       | Both have browser extensions, so it is easy to clip.
       | 
       | With Evernote I even have shortcuts defined so I dont have to
       | click for the webpage to be clipped.
        
       | Minor49er wrote:
       | This looks really cool. It's beyond the scope of this project,
       | but I think that having something like this as a browser
       | extension would make it easier to use: instead of manually
       | copying and scraping links, it could index and save pages that
       | you've been on, placing much more significance on anything that
       | you've bookmarked. Granted, this is just an immediate thought.
       | I'm going to give this a proper try once I have some more spare
       | time.
        
         | ya1sec wrote:
         | Great thought. I've adopted a similar workflow using the
         | https://www.are.na/ chrome extension to save links to channels.
         | Might be a nice touch to feed channels into the engine using
         | their API
        
           | Minor49er wrote:
           | This looks like a fun way to explore topics. I just signed up
        
       | pantulis wrote:
       | Reminds me a lot of DEVONthink for Mac
        
       | MisterTea wrote:
       | > I've wasted many an hour combing through Google and my search
       | history to look up a good article, blog post, or just something
       | I've seen before.
       | 
       | This is the fault of web browser vendors who have yet to give a
       | damn about book marks.
       | 
       | > Apollo is a search engine and web crawler to digest your
       | digital footprint. What this means is that you choose what to put
       | in it. When you come across something that looks interesting, be
       | it an article, blog post, website, whatever, you manually add it
       | (with built in systems to make doing so easy).
       | 
       | So it's a searchable database for bookmarks then.
       | 
       | > The first thing you might notice is that the design is
       | reminiscent of the old digital computer age, back in the Unix
       | days. This is intentional for many reasons. In addition to paying
       | homage to the greats of the past, this design makes me feel like
       | I'm searching through something that is authentically my own.
       | When I search for stuff, I genuinely feel like I'm travelling
       | through the past.
       | 
       | This does not make any sense. It's Unix-like because it feels
       | old? It seems like the author thoroughly misses the point of unix
       | philosophy.
        
         | chris_st wrote:
         | > _So it 's a searchable database for bookmarks then._
         | 
         | It appears to be that, but it appears also to pull out the
         | _content_ of the web page and index that too, so you can
         | (presumably) find stuff that isn 't in the "pure" bookmark,
         | which I think of as a link with maybe a title.
        
           | nextaccountic wrote:
           | I think browsers should download a full copy of each bookmark
           | (so you can still see it when they are taken down) and make
           | it fully searchable.
           | 
           | Actually, I've been trying to find Firefox extensions that
           | give a better interface to bookmarks and there doesn't seem
           | to be one. It's like, people don't use bookmarks anymore and
           | accept that it might as well not exist, and use something
           | else.
           | 
           | It's telling that Firefox has two bookmark systems built-in
           | (pocket and regular bookmarks) and they aren't integrated
           | with each other; I suppose that people that use pocket never
           | think about regular bookmarks.
           | 
           | edit: but my pet peeve is that it isn't easy to search
           | history for something I saw 10 days ago but I don't remember
           | the exact keywords to search.
        
             | forgotpwd16 wrote:
             | >I think browsers should download a full copy of each
             | bookmark [...] and make it fully searchable.
             | 
             | This, outside a browser, could be implemented as a
             | server/client self-hosted solution with a back-end taking
             | care of downloading/searching and an extension acting as
             | client. Maybe it could even be made entirely as extension?
        
               | berkes wrote:
               | That would miss all the personalized content, all the
               | content behind authorization and so on.
               | 
               | At the very least, it would need to be able to get the
               | content pushed to it by the client, the way the client
               | has it at moment of bookmarking, making the
               | download/scraping kindof superflous.
               | 
               | Indexing and doing search, however, is hard, but solved.
               | Hard in the sense that it is not something a firefox
               | addon could do very well. I presume a (self)hosted
               | meilisearch would suffice, though.
        
               | huanwin wrote:
               | You and GP might find ArchiveBox to have overlap with
               | what you're describing?
               | https://github.com/ArchiveBox/ArchiveBox
               | 
               | Edit: here's the description from their repo
               | 
               | "ArchiveBox is a powerful, self-hosted internet archiving
               | solution to collect, save, and view sites you want to
               | preserve offline.
               | 
               | You can set it up as a command-line tool, web app, and
               | desktop app (alpha), on Linux, macOS, and Windows.
               | 
               | You can feed it URLs one at a time, or schedule regular
               | imports from browser bookmarks or history, feeds like
               | RSS, bookmark services like Pocket/Pinboard, and more.
               | See input formats for a full list.
               | 
               | It saves snapshots of the URLs you feed it in several
               | formats: HTML, PDF, PNG screenshots, WARC, and more out-
               | of-the-box, with a wide variety of content extracted and
               | preserved automatically (article text, audio/video, git
               | repos, etc.). See output formats for a full list."
        
             | phildenhoff wrote:
             | The difference, to me, about Pocket is that I use it
             | specifically as a to-read list. My list is just "sites I
             | want to visit/read/watch later", whereas bookmarks are more
             | of "I want to go here regularly". Also, all the bookmark
             | systems I've ever used treat links as files that can only
             | be in one folder, whereas Pocket at least has tags so links
             | can associate with multiple topics.
        
               | [deleted]
        
               | forgotpwd16 wrote:
               | >at least has tags so links can associate with multiple
               | topics
               | 
               | This has applied since ever to regular bookmarks as well.
               | Basically you can just throw everything in unsorted and
               | use tags only.
        
               | cassepipe wrote:
               | Firefox has bookmark tags
        
             | joshuaissac wrote:
             | Older versions IE used to have something like this.
             | "Favourites" had a "Make available offline" box that could
             | be ticked to keep an offline copy of the page. But they
             | were not searchable.
        
             | cratermoon wrote:
             | > I think browsers should download a full copy of each
             | bookmark
             | 
             | Have you tried Zotero?
        
               | totetsu wrote:
               | Zotero is great for this. set up a webdav docker
               | container and you can sync it easily too
        
             | nojito wrote:
             | Safari reader list does this and it's awesome.
        
             | throwawayboise wrote:
             | In Firefox, _File_ - > _Save Page As..._ lets me do this.
             | Local search tools should be able to index such archives
             | (if they can index Word documents, they should be able to
             | index HTML). Seems a fairly solved problem if it 's
             | something you need?
        
             | asdff wrote:
             | Pocket isn't for bookmarks. It's a reading list. Safari and
             | Chrome have this feature too.
        
               | medstrom wrote:
               | If you don't categorize bookmarks anyway, Pocket and
               | equivalent might be all-around better than bookmarks.
        
         | chillpenguin wrote:
         | Agree with the unix bit. I was expecting something "unix
         | philosophy" but it turns out they just meant it looks retro.
        
           | jll29 wrote:
           | "looks (intentionally) retro" like Serenity OS.
        
         | 1vuio0pswjnm7 wrote:
         | "It seems like the author thoroughly misses the point of the
         | unix philosophy."
         | 
         | It's like a re-interpretation of history where AT&T still
         | controls UNIX. (What do people think of AT&T these days.)
         | 
         | "The first thing you might notice ..."
         | 
         | First thing I notice is this project is 100% tied to Google,
         | what with Chrome and Go (even for SNOBOL pattern matching,
         | sheesh).
         | 
         | "... this design makes me feel like I'm searching through
         | something that is authentically my own."
         | 
         | Except it isn't. It shuns the use of freely available, open-
         | source UNIX-like projects in favor of software belonging to a
         | company that Hoovers up personal data and sells online ad
         | services. Enjoy the illusion. :)
         | 
         | Life can be very comfortable inside the gilded cage.1 The
         | Talosians will take good care of you.2
         | 
         | 1. https://en.wikipedia.org/wiki/Gilded_cage
         | 
         | 2. https://en.wikipedia.org/wiki/Talosians
        
         | stevekemp wrote:
         | I've been thinking recently it might be interesting/useful to
         | write a simple SOCKS proxy which could be used by my browser.
         | 
         | The SOCKS proxy would not just fetch the content of the page(s)
         | requested, but would also dump them to
         | ~/Archive/$year/$month/$day/$domain/$id.html.
         | 
         | Of course I'd only want to archive text/plain and text/html,
         | but it seems like it should be a simple thing to write and
         | might be useful. Searching would be a simple matter of grep..
        
           | habibur wrote:
           | Did that. But then you will find your disk quickly getting
           | filled up with GBs of cached contents that you rarely search
           | within.
           | 
           | Rather when you need that same content, you will find
           | yourself going to google, searching that and the page is
           | instantly there unless removed.
           | 
           | There's a reason why bookmarks aren't as popular as it had
           | been. People now use google + keywords instead of bookmarks.
        
             | berkes wrote:
             | It would also miss all the pages that are built from ajax-
             | requests on the client side. Which, nowadays, is a large
             | amount. The client is the one assembling all the content
             | into the thing you read and so it is the most likely
             | candidate to offer the copy that you want indexed.
        
             | kbenson wrote:
             | Maybe archive.org should run a subscription service where
             | for a few bucks a month, you can request your page visits
             | be archived (in a timely manner and with some level of
             | assurance) and leverage their system for tracking content
             | over time. That, in conjunction with something like Google,
             | might actually give fairly good assurance that what you're
             | searching for actually exists in a state like you saw it,
             | while also leveraging that 30 people accessing this blog
             | today that use the service don't use significantly more
             | resources to store the data, and also helps archive.org
             | fulfill its mission.
        
         | ryandrake wrote:
         | > This does not make any sense. It's Unix-like because it feels
         | old? It seems like the author thoroughly misses the point of
         | unix philosophy.
         | 
         | Yea, I couldn't figure out what makes it Unix-like, either. I
         | mean, which UNIX in particular? Solaris? AIX? HP-UX? Do you use
         | UNIX commands to navigate it? Is there a shell or something?
         | Kind of odd way to describe it.
        
           | chillpenguin wrote:
           | Usually when someone says something is unix-like, they mean
           | it "embraces unix philosophy", which usually means something
           | like it operates on stdin/stdout so it can be composed in a
           | pipeline on the shell.
           | 
           | Which is why I was mislead in this case :)
        
       | jll29 wrote:
       | Microsoft Research's Dr. Susan Dumais is the expert on this kind
       | of personal information management.
       | 
       | Her landmark system (and associated seminal SIGIR'03 paper)
       | "Stuff I've Seen" tackled re-finding material:
       | http://susandumais.com/UMAP2009-DumaisKeynote_Share.pdf
        
       | simonw wrote:
       | My version of this is https://dogsheep.github.io/ - the idea is
       | to pull your digital footprint from various different sources
       | (Twitter, Foursquare, GitHub etc) into SQLite database files,
       | then run Datasette on top to explore them.
       | 
       | On top of that I built a search engine called Dogsheep Beta which
       | builds a full-text search index across all of the different
       | sources and lets you search in one place:
       | https://github.com/dogsheep/dogsheep-beta
       | 
       | You can see a live demonstration of that search engine on the
       | Datasette website: https://datasette.io/-/beta?q=dogsheep
       | 
       | The key difference I see with Apollo is that Dogsheep separates
       | fetching of data from search and indexing, and uses SQLite as the
       | storage format. I'm using a YAML configuration to define how the
       | search index should work:
       | https://github.com/simonw/datasette.io/blob/main/templates/d... -
       | it defines SQL queries that can be used to build the index from
       | other tables, plus HTML fragments for how those results should be
       | displayed.
        
         | gizdan wrote:
         | Wow! That's super cool. I will have to check this out at some
         | point. Am I correct in understanding that the pocket tool
         | actually imports the URLs contents? If not, how hard would it
         | be to include the actual content of URLs? Specifically, I'll
         | probably end up using something else (for me NextCloud
         | bookmarks).
        
           | simonw wrote:
           | Sadly not - I'd love it to do that, but the Pocket API
           | doesn't make that available.
           | 
           | I've been contemplating building an add-on for Dogsheep that
           | can do this for any given URL (from Pocket or other sources)
           | by shelling out to an archive script such as
           | https://github.com/postlight/mercury-parser - I collected
           | some suggestions for libraries to use here:
           | https://twitter.com/simonw/status/1401656327869394945
           | 
           | That way you could save a URL using Pocket or browser
           | bookmarks or Pinboard or anything else that I can extract
           | saved URLs from an a separate script could then archive the
           | full contents for you.
        
             | neolog wrote:
             | SingleFile and SingleFileZ are chrome extensions that
             | export full web pages pretty effectively.
             | 
             | https://chrome.google.com/webstore/detail/singlefile/mpiodi
             | j...
             | 
             | https://chrome.google.com/webstore/detail/singlefilez/offkd
             | f...
        
         | tomcam wrote:
         | Holy crap you should submit as a Show HN
        
           | mosselman wrote:
           | Simon is not an unknown on HN.
        
           | simonw wrote:
           | It's failed to make the homepage a few times in the past:
           | https://hn.algolia.com/?q=dogsheep - the one time it did make
           | it was this one about Dogsheep Photos:
           | https://news.ycombinator.com/item?id=23271053
        
       | ryanfox wrote:
       | I run a similar project: https://apse.io
       | 
       | It runs locally on your laptop/desktop, so you don't need a
       | server to host anything.
       | 
       | Also, it can index _everything_ you do, not just web content.
       | 
       | It works really well for me!
        
       | totetsu wrote:
       | there used to be an actity timeline journal program i ran on
       | ubuntu that let me see which days i accessed which files. It was
       | very useful as a sudent.
        
       | cratermoon wrote:
       | Interesting project but some of what the author writes just
       | sounds flat-out weird. "The first thing you might notice is that
       | the design is reminiscent of the old digital computer age, back
       | in the Unix days."
       | 
       | "Apollo's client side is written in Poseidon."
       | 
       | I had to look that up: Poseidon is not a language, it's just a
       | javascript framework for event-driven dom updates.
        
       | wydfre wrote:
       | It seems pretty cool - but I think falcon[0] is more practical.
       | You can install it from the chrome extension store[1], if you are
       | too lazy to get it running yourself.
       | 
       | [0]: https://github.com/lengstrom/falcon
       | 
       | [1]:
       | https://chrome.google.com/webstore/detail/falcon/mmifbbohghe...
        
         | grae_QED wrote:
         | Are there any Firefox equivalents to Falcon? I'm very
         | interested in something like this.
        
           | news_to_me wrote:
           | If it's a WebExtension, it's usually not too hard to port to
           | Firefox (https://developer.mozilla.org/en-
           | US/docs/Mozilla/Add-ons/Web...)
        
           | nathan_phoenix wrote:
           | In the issues someone says that it works even in FF. You just
           | need to change the extension of the file. Tho I didn't try it
           | yet.
           | 
           | https://github.com/lengstrom/falcon/issues/73#issuecomment-6.
           | ..
        
       | soheil wrote:
       | There is something really strange about a lot of recent Go
       | projects including this one. I can't put my finger on, but the
       | combination of the author and the type of problem they choose to
       | tackle oftentimes seems baffling to me. Most projects seem to be
       | solving a problem that is often misidentified or otherwise badly
       | solved, but somehow the focus ends up being on the code
       | architecture or the UI design. It's like they're trying to solve
       | a problem just for the sake of writing some code and the correct
       | way to use Go idiomatically or something and don't really care
       | about the problem or how well the solution actually works.
        
         | asdff wrote:
         | I think projects like this are just resume builders. Everyone
         | says "show a project on github," well here is one of these
         | projects. The dev is probably hoping this helps land them a job
         | offer. Its fine if the project is ultimately "lame" in some
         | way, since its not the job description of a developer to make a
         | cool unique app, but to follow orders from the project manager
         | and write code, which is what this project shows this dev can
         | do.
        
         | jrm4 wrote:
         | Yeah, as a bit of an old-timer, I'm trying to learn to stop
         | worrying and learn to love watching everybody reinvent wheels?
         | For me it's "why are you people doing that in Javascript?" that
         | continually comes up in my own head, but I suppose I should try
         | to be patient and see if anything comes of it.
        
       | rhn_mk1 wrote:
       | This seems similar to recoll augmented with recoll-we.
       | 
       | https://addons.mozilla.org/en-US/firefox/addon/recoll-we/
        
       | SahAssar wrote:
       | Looks very much like one of the ideas I've been thinking of
       | building! The way I planned to do it was to use a similar
       | approach to rga for files ( https://github.com/phiresky/ripgrep-
       | all ) and having a webextension to pull all webpages I vist
       | (filtered via something like
       | https://github.com/mozilla/readability ), dump that into either
       | sqlite with FTS5 or postgres with FTS for search.
       | 
       | A good search engine for "my stuff" and "stuff I've seen before"
       | is not available for most people in my experience. Pinboard and
       | similar sites fill some of that role, but only for things that
       | you bookmark (and I'm not sure they do full-text search of the
       | documents).
       | 
       | ---
       | 
       | Two things I'd mention are:
       | 
       | 1. Digital footprint usually means your info on other sites, not
       | just things I've accessed. If I read a blog that is not part of
       | my footprint, but if I leave a comment on that blog that comment
       | is part of it. The term is also mostly used in a tracking and
       | negative context (although there are exceptions), so you might
       | want to change that:
       | https://en.wikipedia.org/wiki/Digital_footprint
       | 
       | 2. I don't really get what makes it UNIX-style (or what exactly
       | you mean by that? There seems to be many definitions), and the
       | readme does not seem to clarify much besides expecting me to
       | notice it by myself.
        
         | eddieh wrote:
         | I've been toying with an idea like this too. I set my browser
         | to never delete history items years ago, so I have a huge
         | amount of daily web use that needs to be indexed. The browser's
         | built in history search has saved me a few times, but it is so
         | primitive it hurts.
        
         | grae_QED wrote:
         | >I don't really get what makes it UNIX-style
         | 
         | I think what they meant was that it's an entirely text based
         | program. Perhaps they are conflating UNIX with CLI.
        
       | alanh wrote:
       | code comment in the readme describes the Record as constituting
       | an 'interverted index'. typo for inverted? although it is not
       | obvious to me what would make this an inverted index instead of a
       | normal index
        
       | [deleted]
        
       | etherio wrote:
       | This is cool! Similar to one of the goals I'm trying to
       | accomplish with Archivy (https://archivy.github.io) with the
       | broader goal of not just storing your digital presence but also
       | acting as a personal knowledge base.
        
       | kordlessagain wrote:
       | Cool! It's great to see others thinking about this. I've been
       | working on https://mitta.us for a while now and it uses solr, a
       | headless brrowser and google vision to snapshot and index full
       | text. The UI is a bit odd but you can just append mitta.us/ to
       | any URL to save it.
        
       | encryptluks2 wrote:
       | Why do all these bookmark projects:
       | 
       | 1. Rely on JavaScript for the interface. Being built in Go, why
       | not just paginate the results and utilize Bleve or Xapian for
       | search?
       | 
       | 2. Store data in a format that is not easily readable by itself.
       | The only exception to this is nb.
       | 
       | 3. Suck at CLI tools. I'm looking to rclone, Hugo, kubectl, etc
       | for the right way to build a CLI.
        
       ___________________________________________________________________
       (page generated 2021-07-26 23:00 UTC)