[HN Gopher] Show HN: DontBeEvil.rip: Search, for developers (API...
       Show HN: DontBeEvil.rip: Search, for developers (API, expressions,
       I'd like to invite everyone to try out DontBeEvil.rip, an
       experimental search engine for developers.  tl;dr  $ alias
       rip="curl -G -H 'Accept: text/plain' --url
       https://dontbeevil.rip/search --data-urlencode "  $ rip
       'q=Heartbleed bug'  DontBeEvil.rip is a year long experiment to see
       if a small team can build a developer-focused search engine that is
       self-sustaining on $10 monthly subscriptions.  It works by only
       indexing high-quality resources that are relevant to developers.
       You won't get useless listicles because we'll never crawl them.
       Relevant urls are harvested from HN, StackOverflow, programmer
       Reddit, and a few others. Page content comes mostly from the Common
       Crawl project.  The limited, but awesome, features in this first
       release are:  - Expressions! Experience the power of
       Elasticsearch's Simple Query Strings.  - REST API. Just change
       'text/plain' to `application/json` in the above alias.  - CLI. Just
       use curl in the terminal. Simple as.  HackerNews, StackOverflow,
       Arxiv abstracts, 2M Github repos, and programmer Reddit (up to
       2020) are being indexed right now. There's much more to come in the
       next few months.  I'd love to hear your questions, comments and
       suggestions in the comments below.
       Author : alangibson
       Score  : 218 points
       Date   : 2022-03-03 11:25 UTC (11 hours ago)
       | waynecochran wrote:
       | Are you using something similar to the original pagerank
       | algorithm that uses eigen-analysis of the link graph?
         | alangibson wrote:
         | It's not even that sophisticated yet. I'm ranking urls based on
         | their normalized score on the various community sites I find
         | them on. My next TODO is to roll up those ranks to get a rank
         | for the site, then index the whole site.
         | I will also be using the PageRank calculated by Common Crawl as
         | soon as they release the next data set.
       | bryanrasmussen wrote:
       | how's ranking done, I searched for xslt and I saw a lot of HN
       | results in the first part, seemed weird that HN would rank highly
       | for that.
         | alangibson wrote:
         | Results are heavily weighted to HN and Stackoverflow right now
         | because they are the easiest resources to access and rank.
         | Since posts have a score on both platforms, it's easy to pull
         | out some 'authority' signal.
         | There's many more web pages coming. They are much more
         | difficult to get ahold of and rank though because I need to run
         | my own crawler to fill in what Common Crawl doesn't have and
         | then calculate my own site authority rankings.
       | sjbrown wrote:
       | Now do recipes
         | alangibson wrote:
         | I want to. So much.
         | all2 wrote:
         | StackOverflow but for food. RecipeOverflow. StackFood.
       | husainfazel wrote:
       | Going to "https://dontbeevil.rip/" results in a JSON error in the
       | browser:
       | {"message":"Missing Authentication Token"}
         | kiettv wrote:
         | You need to access the sub path /search?q= :)
         | przemub wrote:
         | Apparently developers need no homepages, just APIs :)
           | alangibson wrote:
           | Jup. No time for fancy things like HTML yet :)
         | alangibson wrote:
         | There's nothing there yet. I'm 100% focused on building the
         | index and tuning the master search query. There is however a
         | blog post that goes into more detail on how to do things like
         | pagination soon.
         | tl;dr: https://dontbeevil.rip/search?q=monads&from=10
       | rdiddly wrote:
       | Love the URL most of all but will be trying this out!
       | dotancohen wrote:
       | This is great, even if I'm not getting as good results as from
       | google for now.
       | Can you expose the filtering features of ES? I'd love to query
       | e.g. "+python lists" and get results only related to Python, no
       | e.g. Lisp results. For Stack Overflow you could use the
       | question's tags as filter keys, and for other sites you'd add
       | them manually (so e.g. the PHP docs get the PHP key).
       | If you're thinking of monetizing this, I'll tell you what I tell
       | all the small, useful services that I'd like to pay for. There
       | are too many small, useful services that I'd like to pay for.
       | I'll gladly pay $1 for such a service, but you'll have a hard
       | time convincing me to pay more.
       | gargarplex wrote:
       | I guess blogs that are linked-to in non-killed HN comments should
       | probably be crawled a bit. Have you considered using social user
       | karma (this could be a 1-10 score uniquely calculated for users
       | of each of HN, Twitter, Reddit as long as it's built in a modular
       | way) as a weight in a PageRank style schema?
       | Here's how I am going to evaluate your search engine. Yesterday I
       | searched Google for "get dynamodb table row count" and found this
       | URL, https://bobbyhadz.com/blog/aws-dynamodb-count-items, which
       | provides a terrible recommendation involving a full table scan.
       | With DontBeEvil, I didn't find the correct answer, to use the
       | describe-table API.
       | If you really plan to dedicate a year to this, I would strongly
       | encourage you to re-post again as soon as you have a strong
       | update. Right now this has potential to provide value but really
       | does not. So update us when you have confidence that you might be
       | providing value! But we think you're on to a great opportunity.
         | alangibson wrote:
         | > I guess blogs that are linked-to in non-killed HN comments
         | should probably be crawled a bit
         | They are, but there are relatively a few of them because my
         | only page content source is the Common Crawl. The hit rate vs
         | the total urls I'm interested in is not great. I expect to fix
         | this soon.
         | I'm also not indexing entire sites, only specific upvoted urls.
         | This will change as well.
         | > Have you considered using social user karma (this could be a
         | 1-10 score uniquely calculated for users of each of HN,
         | Twitter, Reddit as long as it's built in a modular way) as a
         | weight in a PageRank style schema?
         | Definitely. I've already started in on calculating a rank
         | coefficient for submitters, but it's not completely clear now
         | to best use it yet.
         | > Here's how I am going to evaluate your search engine
         | Feel free to dump more of these. Some solid test cases would be
         | very helpful.
       | Traster wrote:
       | This is an interesting approach to the general problem. The
       | general problem being that whatever data source you use is
       | inevitably going to be polluted by players who wish to be top of
       | the rankings in a search engine if your engine is used. Maybe
       | this solution - serving a very small niche will work, but I'd be
       | really interested to know if you guys have spent any time trying
       | to SEO your own search engine? Hire an intern whose sole task is
       | to get a page to the top of a fairly common search query like
       | replacing some common python package with your own one?
       | kiettv wrote:
       | It's powered by ElasticSearch, is it? So I can use all of its
       | query parameters?
         | alangibson wrote:
         | Indeed it is. You can you use simple query strings for the q
         | parameter. See
         | https://www.elastic.co/guide/en/elasticsearch/reference/curr...
         | I'm considering opening up full ES query support for paying
         | customers, but it's too dangerous to expose it to the Internet
         | unrestricted.
           | rmbyrro wrote:
           | I think it's dangerous to expect a malicious actor would not
           | pay $10 to screw your service.
             | cj wrote:
             | "Risk management" is often not the same as "risk
             | elimination"
             | alangibson wrote:
             | Indeed it is. Presumably I would have had time to build up
             | some safeguards and run beefier servers by then though.
               | marginalia_nu wrote:
               | As a word of warning, when HN discovered my search
               | engine, I was hit hard by a botnet within a few days. Saw
               | about 30-40k queries/hour from some 10k IP addresses. I'm
               | self hosted so the worst that happened is my search
               | engine was a bit slow, but if I was cloud hosted I'd have
               | a _very_ sizable bill to pay.
               | If you do not already have a global rate limit, implement
               | one ASAP. Better to have one and not need it, than to
               | need it and not have it.
               | alangibson wrote:
               | I can't wait for the bots to show up. Setting a rate
               | limit was one of the first things I did :)
       | AnyTimeTraveler wrote:
       | I just tried a few queries related to rust and it's library
       | rocket. I got only useless results on the first page and didn't
       | check further.
       | I'm guessing that's because it doesn't index docs.rs and the rust
       | forum. Both incredibly important for Rust development.
       | So as long as this engine doesn't also index most programming
       | related forums, I won't be able to use it effectively, even
       | though I really would like to.
       | The concept of limiting the scope to just a few websites sounds
       | really interesting, though. I think I will take this idea and
       | build a little thing on top of google to implement that site
       | filtering on my queries.
       | ykevinator2 wrote:
       | Can we drop the q= and the quotes from the shell cmd somehow,
       | that would make it so much nicer, and rip is a great command
       | line.
         | alangibson wrote:
         | There's a real command line coming. If you're on a Debian Linux
         | and feel like testing it out, just do
         | apt install curl jq
         | pip3 install jtbl
         | curl -O
         | https://raw.githubusercontent.com/alangibson/dontbeevil.rip/...
         | chmod u+x rip
         | ./rip 'what is a monad'
           | l0b0 wrote:
           | With long options, JSON output, and no extra Python
           | dependencies:                 rip() {           curl \
           | --data-urlencode "q=${1}" \               --get \
           | --header 'Content-Type: application/json' --header 'Accept:
           | application/json' \               --silent \
           | 'https://dontbeevil.rip/search' \               | jq '[
           | .hits.hits[] | { title: .fields.title[0], url:
           | .fields.url[0], highlight: .highlight.text[0] } ]'       }
         | alexrsagen wrote:
         | This function should work for you:
         | $ rip() { curl -G -H "Accept: text/plain" --url
         | https://dontbeevil.rip/search --data-urlencode "q=$*"; }
         | $ rip Heartbleed bug
         | edit: alangibson's solution in this thread is better :)
       | cbreynoldson wrote:
       | did this idea spark from PG's old talk on new ideas
       | https://youtu.be/R9ITLdmfdLI? One of them is literally "search
       | engine for developers/hackers"
       | Natfan wrote:
       | Here's a simple PowerShell wrapper for your lovely tool:
       | function rip {         param (           [Parameter(Mandatory,
       | ValueFromRemainingArguments)]           [String]           $Query
       | )         $RequestParameters = @{           URI =
       | "https://dontbeevil.rip/search?q=$Query"           Headers = @{
       | Accept = "text/plain" }         }         $Request = Invoke-
       | RestMethod @RequestParameters         return $Request       }
       | Usage:                 > rip heartbleed bug            Heartbleed
       | Bug       <https://heartbleed.com/>           Heartbleed Bug The
       | Heartbleed Bug The Heartbleed Bug is a        serious
       | vulnerability in the popular OpenSSL cryptographic
       | software library....
       | przemub wrote:
       | I really like it and it already gives some useful results. A rise
       | of curated search engines as yours would be lovely.
       | It would be nice if the main page linked to your blog or anything
       | really, because I would like to know where can I follow this
       | project!
         | alangibson wrote:
         | Thanks!
         | I'm giving this project a year to build up momentum. If it
         | looks promising, I plan on having other STEM verticals. Maybe
         | even fix recipe searches one day :)
         | A real homepage is coming. Feel free to subscribe to my blogs
         | RSS feed for now: https://landshark.io/feed.xml
       | amar-laksh wrote:
       | I saw your blog post a couple of days ago, This looks really
       | promising!
         | alangibson wrote:
         | Thanks! I'll be updating that post today. I changed quite a few
         | things getting ready for this Show HN, so it's now out of date.
       | arminiusreturns wrote:
       | Best of luck to you, I think there is a targettable niche that
       | could utilize this.
       | Having thought quite a bit about the search space, I think a
       | whitelist approach is going to be the next big search thing,
       | because advertising and bs sites have corrupted SEO too far.
       | I'm reminded of the site indexer websites in the early days of
       | the internet. Curation if done properly (based on quality of
       | content and not certain other factors that currently play too
       | heavy a role in the seo algo black boxes) seems to be how we
       | adapt to the current information tsunami we are all dealing with.
       | I think a long time ago I decided I would even pay for such a
       | service, just like I am willing to pay for a good new source (FT
       | for me, not cheap, but worth it). Im not positive the 10$ mark is
       | low enough but I hope for the general landscape it is.
       | Just dont forget to keep dontbeevil more than a catchphrase. In
       | particlar, please be transparent with what user data you collect
       | and how you use it.
         | alangibson wrote:
         | > I think a whitelist approach is going to be the next big
         | search thing
         | It almost as to be. Spammers, growth hackers, et al. are just
         | too numerous and too good.
         | > Im not positive the 10$ mark is low enough but I hope for the
         | general landscape it is.
         | I saw enough people mention $10 that I decided to go for it. To
         | be honest, $10 is already probably too low to be sustainable
         | because of the huge amount of resources search consumes and the
         | high cost of development.
         | My gut feeling is that it's economically impossible to build a
         | good search engine that isn't loaded with ads and spyware. But
         | I spent so long complaining about G that I decided to prove to
         | myself one way or another.
           | wolfgang42 wrote:
           | _> because of the huge amount of resources search consumes_
           | I've been intermittently working on much the same idea as the
           | OP, and I suspect this is actually a lot less of a problem
           | than it seems, since they're focusing on a niche. Indexing
           | _everything_ the way Google does requires a lot of resources,
           | but indexing the majority of useful material in a specific
           | domain takes a lot less. (My ElasticSearch index for the
           | entirety of StackOverflow is a mere 40 GB, for example.)
           | By far the more expensive part is likely to be paying market
           | rates for a developer (you need a decent number of users
           | paying $10/mo to hit a mid-market salary), but in theory this
           | scales relatively independently of userbase.
           |  _Edit:_ I've just noticed I'm replying to the OP, who's
           | mentioned downthread that they're using BigQuery and spending
           | $200 /week. I've gone the marginalia.nu route and run
           | everything on a computer in my living room, which changes the
           | calculus somewhat--it's a lot cheaper, but probably involves
           | more development time.
           | For me it's mainly about the learning experience but I'd be
           | interested to hear your thoughts on the tradeoff.
       | martinald wrote:
       | Why not make a website for this? Why just limit it to the
       | terminal (hard to use on mobile for example)?
       | Edit: obviously you can query it from a browser, but it would
       | take like a couple hours to have a view that parsed the json and
       | put it in a google-style layout with a search bar.
         | alangibson wrote:
         | > Why not make a website for this?
         | It's coming. I decided to get it out there as soon as I had an
         | index that could theoretically be useful. The feedback I'm
         | getting will drive the next chunk of work that gets done. For
         | instance, I'll probably bring in language docs next as a lot of
         | people have asked for them already.
       | leke wrote:
       | Nice. Some very different results returned.
       | dmix wrote:
       | If you search for "Reddit" the first result is "Google Search is
       | Dying" on Hacker News.
       | https://dontbeevil.rip/search?q=reddit
         | alangibson wrote:
         | The reason is because there was a long discussion about Reddit
         | as a search engine in that thread. reddit.com will likely never
         | be indexed. Many of the subreddits already are, but I haven't
         | exposed the ability to do something like Google's
         | `site:reddit.com/r/*` yet. That's coming though.
       | [deleted]
       | niek_pas wrote:
       | This looks really cool! It would be neat to have a proper CLI
       | with a more fully-flushed out UI, with things like shortcuts to
       | quickly open links. Is there any way I can be kept up to date
       | with the state of this project?
       | Also, am I correct in assuming it's not open source?
         | kordlessagain wrote:
         | Like an ultra powerful goosh.org UI, with AI command synthesis,
         | image uploads, crawling, search, opening pages, etc.
           | alangibson wrote:
           | CLI in the browser. I love it.
         | alangibson wrote:
         | The repo is over here:
         | https://github.com/alangibson/dontbeevil.rip
         | You'll be disappointed though as most of the important stuff
         | only lives as BigQuery queries. I will be updating it in the
         | near future though.
       | 0x20cowboy wrote:
       | Maybe use Gopher? Lynx supports it and there are a few other
       | newish clients out there.
       | l0b0 wrote:
       | Nice! Do any more advanced query strings work right now? Like
       | looking for recent pages or only searching titles?
       | operator-name wrote:
       | I've tried it out. It's quite obvious the limited number of
       | crawled sites when searching for anything obscure or one step
       | outside of programming.
       | Even 'javascript reverse string', which I expected some docs or
       | stack overflow pages seems to give me a HN thread, someone's
       | github repo and a not very related SO thread.
       | Is MDN, MDSN, more dev docs documentation on the roadmap?
       | It's definitely an interesting technique. Do you have anything in
       | place to detect garbage, substenceless articles like which has
       | started popping up on Google?
       | I've seen the occasional one using github repositories or pages.
       | Looking at the current list you're broadly reliant on moderators
       | and communities, and as the search engine you moderate which
       | sites are indexed.
         | alangibson wrote:
         | > one step outside of programming.
         | Indeed, it's explicitly for programming only (for now).
         | > Even 'javascript reverse string', which I expected some docs
         | or stack overflow pages seems to give me a HN thread,
         | Next up is indexing language documentation. At this point I'm
         | relying heavily on Q&A and community sites since they have
         | their own built in quality rankings.
         | > Is MDN, MDSN, more dev docs documentation on the roadmap?
         | Most definitely. Feel free to dump a list of urls of your
         | favorite doc sites. I'm building a whitelist now.
         | > Do you have anything in place to detect garbage,
         | substenceless articles like which has started popping up on
         | Google?
         | My strategy is to not index spam in the first place. That's why
         | I started by extracting links from with community sites that
         | have their own moderation in place. The next step is to
         | whitelist high quality sites. That is potentially a huge list
         | to maintain, which is why I'm am narrowly focused on software
         | development.
         | Everything old is new again...
           | thewebcount wrote:
           | If you haven't already, please crawl
           | <https://developer.apple.com> and <https://swift.org>.
           | pbhjpbhj wrote:
           | >At this point I'm relying heavily on Q&A and community sites
           | since they have their own built in quality rankings. //
           | How are you using the "built in quality rankings", could you
           | give some examples?
           | On Reddit, say, except in a few groups like AskHistorians you
           | can still get very high ranking for a meme post and very low
           | ranking for a list that has high informational value.
           | StackOverflow is extraordinarily prone to killing off
           | reasonably good contributions and giving very high ranking to
           | out of date answers (the latter is the biggest problem with
           | SO sites at present IMO).
           | ejp wrote:
           | I use this documentation aggregator/search in the browser to
           | access most language docs. It might serve as a whitelist
           | starting point! https://devdocs.io/
             | alangibson wrote:
             | Nice. Thanks!
           | vulcan01 wrote:
           | Ok, site request (aside from MDN): pkg.go.dev
           | Many of these are linked to GitHub/GitLab repos, so I'm not
           | sure how you'll deduplicate that.
       | mynameismon wrote:
       | > StackOverflow Does this include the entire StackExchange
       | network, or only StackOverflow? Because some SE sites (in
       | particular, UnixSE and ServerFault) also produce highly relevant
       | results.
         | alangibson wrote:
         | > Does this include the entire StackExchange network
         | Not yet. I'm focused on explicitly developer-oriented resources
         | right now. Those you mentioned are on the TODO list though.
       | eatbitseveryday wrote:
       | I would recommend adding technical blogs. Not by hand, but if you
       | can automate identifying some. Many are small but have good
       | content.
       | Edit: also some corporate technical documentation like Mozilla,
       | Microsoft, IBM, etc have many such developer pages.
         | alangibson wrote:
         | I automate it by pulling urls out of HN, programmer Reddit,
         | etc. Right now my only source of page content is the Common
         | Crawl, which is why there are relatively few web pages indexed.
         | That will change.
         | A next step is to index entire sites, not just individual
         | pages, based on the positive votes their links get.
       | sanmartin65 wrote:
       | hansott wrote:
       | Couple of thoughts:
       | Make it available through a web page instead of a raw search
       | dump?
       | Hide the internals of your search engine? In case you want to
       | switch to meilisearch, algolia, ... (for cost reasons)
       | Preferably use your own search DSL to avoid users to learn about
       | Elasticsearch queries (goes hand in hand with hiding internals of
       | search engine)
       | Good luck! :)
         | alangibson wrote:
         | That'll all happen (though ES simple search expressions are
         | quite OK). The reason it is the way it is today so to enable me
         | to get it out into the world as fast as possible. It puts the M
         | in MVP.
         | randomsilence wrote:
         | Why reinvent the wheel? It's a selling point that Elasticsearch
         | queries can be used.
         | If he changes the engine he might as well implement the
         | Elasticsearch language then for the new engine.
       | baggachipz wrote:
       | 1. Stand up ElasticSearch instance       2. Have it index SO and
       | HN       3. Charge $10 per month       4. Profit!!
         | alangibson wrote:
         | Maybe you should read the other posts about future plans and
         | how this is extremely alpha. Or maybe go anywhere else and do
         | literally anything else but be obnoxious in this thread.
         | I'm spending over $200 per week just to stand up the service as
         | it is. $10 for a full functional search engine will likely not
         | be even close to PROFIT!!!!
           | sdesol wrote:
           | > I'm spending over $200 per week just to stand up the
           | service as it is
           | What are using if you don't mind me asking? Not trying to
           | criticize or anything. I have a Heztner box that gives me 1TB
           | SSD in RAID 0 mode and 64 GB of RAM for about 80 CAD a month.
             | alangibson wrote:
             | I have 3 sizeable EC2 instances running an ElasticSearch
             | cluster, plus a beefy box for data preprocessing and
             | crawling.
             | A big chunk actually goes to BigQuery. There are publicly
             | available datasets for HN, Stackoverflow and a few others
             | there. I've also loaded up the Common Crawl index. The
             | query and storage fees really add up.
             | I'm hopefully done with huge BigQuery queries, so that $200
             | will probably drop for a while.
               | rmbyrro wrote:
               | I'm probably wrong about my assumptions, but presume you
               | are open to any kind of constructive feedback, so here it
               | goes...
               | Maybe you're overkilling with the infra stack.
               | I would simplify until having a mature product,
               | especially if I'm bootstrapping, which I think is your
               | case.
               | Right now, you're still a bit far from MVP, from my point
               | of view. Those $200 can probably be reduced by 50%-75% if
               | you compromise on stuff only important to non-alpha
               | services (i.e. 99.99% availability). A single EC2 box
               | should be enough. Maybe look into Postgres or another
               | FOSS instead of BigQuery.
               | These $100-$150 savings per week can go into promoting
               | your service, getting as much attention as possible to
               | maximize feedback.
               | Good luck!
         | harryvederci wrote:
         | > Please don't post shallow dismissals, especially of other
         | people's work. A good critical comment teaches us something.
         | Source: https://news.ycombinator.com/newsguidelines.html
           | baggachipz wrote:
           | You're right, that was dickish. I could have asked what's
           | different without the snark.
             | alangibson wrote:
             | The purpose of this project is to see if it's possible to
             | build a highly targeted, privacy respecting, search engine
             | that people will pay for. I've given myself a year to build
             | the index and tune for relevance. If at the end of that
             | year it's not a path to sustainability, I'll shut it down
             | secure in the knowledge that, despite what they say, people
             | really won't pay for search. If it is, then I'll start
             | scaling into other STEM subjects.
             | So the difference is, it has the things folks on HN say
             | they want: - search expressions - REST api - no tracking -
             | users are buyers, not products
               | all2 wrote:
               | Would you consider allowing users to host instances/nodes
               | of the engine in return for free or reduced monthly
               | rates? I wouldn't mind making that kind of trade.
               | rglullis wrote:
               | How would one go about ensuring these nodes are not
               | malicious?
       | ghawk1ns wrote:
       | You can try with a function to simplify the cli:
       | $ rip() { curl -G -H 'Accept: text/plain' --url
       | https://dontbeevil.rip/search --data-urlencode 'q=$@'}
       | $ rip heartbleed bug
         | mananaysiempre wrote:
         | The single quotes probably need to be double ones in the last
         | argument to permit parameter expansion, and the $@ (separately
         | quote every argument if quoted) probably wants to be a $*
         | (quote the entire space-separated argument array if quoted)?
         | There's also the grammar quirk where the last command inside
         | braces (but not parens) needs a semicolon or newline to
         | separate it from the brace itself. Thus:                 rip()
         | { curl -G -H 'Accept: text/plain' --url
         | https://dontbeevil.rip/search --data-urlencode "q=$*"; }
         | (tested).
         | I still support the point that there is no reason for this to
         | be a (grammar-defying) alias rather than a (tame) shell
         | function or even a separate script.
       | wodenokoto wrote:
       | My shell-fu isn't the greatest. I thought I could be clever and
       | do                   >alias rip="curl -G -H 'Accept: text/plain'
       | \         --url https://dontbeevil.rip/search --data-urlencode
       | q="              >rip hello         {"message": "Missing required
       | request parameters: [q]"}         curl: (6) Could not resolve
       | host: hello
         | cudder wrote:
         | Turn it into a function:                   rip() {
         | curl -G -H 'Accept: text/plain' \             --url
         | https://dontbeevil.rip/search --data-urlencode "q=$@"         }
         | Now `rip hello` works.
           | harryvederci wrote:
           | This only works with 1-word arguments. You can change $@ to
           | $* to fix that.
           | (I'm acting all wise, but I learned that today as well :) )
             | pxeger1 wrote:
             | Switch to Zsh, where there'll be no difference!
         | nameless912 wrote:
         | function rip() {         printf ">>"         read query
         | curl -G -H 'Accept: text/plain' --url
         | https://dontbeevil.rip/search --data-urlencode q="$query"
         | }
         | :)
         | harryvederci wrote:
         | You could also put a shell script on your `PATH` instead of
         | creating an alias:                 #! /bin/sh
         | query_string="q=$@"       curl --get \         --header
         | 'Accept: text/plain' \         --url
         | https://dontbeevil.rip/search \         --data-urlencode
         | "${query_string}"
           | mrlemke wrote:
           | I'd do it almost the same but without the variable. Note: the
           | long shebang is for using on Termux, PC users should change
           | it to something like #!/use/bin/env sh.
           | #!/data/data/com.termux/files/use/bin/env sh       curl -G -H
           | "Accept: text/plain" \       --url
           | "https://dontbeevil.rip/search" \       --data-urlenconde
           | "q=$*"
             | harryvederci wrote:
             | Didn't know about "$*", thanks!
             | Edit: typo in your version: "urlenconde"
               | laumars wrote:
               | Also every instance of 'usr' has been autocorrected to
               | 'use'.
               | Autocorrect does make me laugh some days :D
       | sitkack wrote:
       | I love this idea.
       | This is mostly just raw data, it isn't that useful (yet).
       | The security issues with using curl directly to my terminal feel
       | a bit dangerous. I'd rather use my browser and be able to see the
       | results over a json tree. Providing raw access to ES has a high
       | risk to reward.
       | The search results for
       | https://dontbeevil.rip/search?q=python%20context%20manager%2...
       | are non-topical hits on SO records. I even put the name of the
       | python package (from stdlib no less) in the query string.
       | I was able to find what I needed on devdocs.io in less than 10
       | seconds.
       | https://devdocs.io/python~3.10/library/contextlib#contextlib...
       | In no way am I trying to discourage you, but until the basics are
       | in place, search over arxiv abstracts is way less useful than
       | just SO and docs (language and libraries).
       | I would recommend returning text/plain by default and .json if
       | someone asks for it (in the url), no everyone can set headers.
       | I'd also put an about page at the root of your site, plain text
       | is fine.
         | alangibson wrote:
         | Thanks for the feedback. I'm considering making searching for
         | language docs a special use case, maybe even with its own
         | index.
         | > Providing raw access to ES has a high risk to reward.
         | I'm not quite that foolhardy :). The only thing that gets
         | passed to ES is a sanitized simple_query_string.query. Should
         | be relatively safe. I'm sure someone will prove me wrong
         | though.
         | tekromancr wrote:
         | What security implications does running curl have that wouldn't
         | be present in a browser?
           | artursapek wrote:
           | I'm wondering the same. You're not piping them into a shell.
           | feanaro wrote:
           | There have been instances of terminal vulnerabilities via
           | terminal escape codes, as bad as an RCE in iterm2: https://bl
           | og.mozilla.org/security/2019/10/09/iterm2-critical.... I
           | suppose the OP is thinking of something like that.
             | laumars wrote:
             | And this is exactly why I'm always playing the damp squid
             | when people advocate for more features being supported via
             | shell escape codes.
             | tekromancr wrote:
             | Yea, I was wondering about that; but the risk feels similar
             | to a browser RCE to me. Maybe it's higher because browsers
             | are more widely used/analyzed; but then again, a browser
             | RCE has a much wider range of targets with more
             | opportunities to exploit
               | dundarious wrote:
               | Even just having the potential for the terminal to
               | interpret escape codes is frustrating. Always pipe remote
               | output to `less` or `less -R` (not `less -r`).
         | fao_ wrote:
         | > The security issues with using curl directly to my terminal
         | feel a bit dangerous.
         | What?
         | laurent123456 wrote:
         | All this curl script does is print strings in the terminal. Are
         | you saying that printing strings is dangerous? I think you may
         | be confusing it with executing the output of curl which is not
         | what this script is doing.
           | kodah wrote:
           | Escape codes could be leveraged to RCE the terminal. That
           | said, every CLI you install on your computer can do code
           | execution and could potentially couple that with remote
           | instructions. There's two vectors there, in one where you
           | don't trust the server and another where you don't trust the
           | connection between the client and the server.
         | nano9 wrote:
         | Then don't use a privileged terminal to run curl.
           | jrockway wrote:
           | I don't think that's a satisfactory mitigation. For example,
           | there is a terminal escape code to change the title of your
           | terminal. Your windowing system, which is very privileged,
           | then displays that.
           | I think it's fine to be paranoid here, the attack surface is
           | massive.
       | dorianmariefr wrote:
       | I'm getting:
       | > {"message": "Internal server error"}
         | alangibson wrote:
         | Give it another try. I fixed a flaw in the json to text
         | translation.
           | dorianmariefr wrote:
           | Thanks, it works now
         | alangibson wrote:
         | What query are you running?
       | pcthrowaway wrote:
       | Are you not putting github (+issues and PRs) in the indexed set?
         | alangibson wrote:
         | Not yet. That's an astonishing about of data, and I want to
         | make sure that people genuinely want it first. I'm considering
         | an index specifically for this actually.
         | I'll put you down as a +1
       | feanaro wrote:
       | Getting internal server error for many ordinary requests. I'm not
       | able to discern a pattern. An example is `rip 'q=zelda'`.
         | alangibson wrote:
         | It should be fixed now.
         | alangibson wrote:
         | Thanks for the report. I'll get this fixed.
         | In the mean time you can use application/json:
         | curl -G -H 'Accept: application/json'
         | https://dontbeevil.rip/search?q=zelda
       | erklik wrote:
       | > programmer Reddit
       | Is it just /r/programmer? or many other programming related
       | subreddits?
       | In general though, this is great. I would similarly love a
       | solution where we could submit sites to be indexed. A way to have
       | a search engine for all the websites I want specifically would be
       | awesome. You could probably add some sort of popular filter on
       | top of it so that only sites popular enough get indexed. I don't
       | know. Just an idea.
       | I love the fact that it's accessible from the terminal. That's
       | fantastic. Although, would be nice to have a very simple HTML
       | front-end. Think very early Google or go very brutalistic.
       | Anyway, excited to hear about it.
       | Edit:
       | Doing the following gives me an Internal Server Error for some
       | reason.
       | curl -G -H 'Accept: text/plain' --url
       | https://dontbeevil.rip/search --data-urlencode 'q=Notes'
       | {"message": "Internal server error"}
         | alangibson wrote:
         | > Is it just /r/programmer? or many other programming related
         | subreddits?
         | There's about 30 programming focused subreddits.
         | > I would similarly love a solution where we could submit sites
         | to be indexed.
         | This is on the plan. I want to allow common interest groups to
         | maintain their own search verticals. I also want to allow
         | individual users to add everything from bookmarks to notes
         | (privately only of course) to act as a sort of external memory.
         | That's very long term though.
         | > Doing the following gives me an Internal Server Error for
         | some reason.
         | Should be fixed now
       | wolfgang42 wrote:
       | Congrats on the launch! Over the past 6 months or so I've been
       | intermittently working on building pretty much exactly the same
       | thing, but with a lot of procrastinating on fiddling with the
       | internals rather than just putting something out there. Your API-
       | first approach is a clever way to get around the desire to keep
       | fiddling around with the page design!
         | alangibson wrote:
         | I find that plain text is a very effective anti-procrastination
         | tool. That's why the "API" was actually text-first. Limiting
         | your options can be very liberating.
       (page generated 2022-03-03 23:00 UTC)