[HN Gopher] The technology behind GitHub's new code search
       ___________________________________________________________________
        
       The technology behind GitHub's new code search
        
       Author : joshbetz
       Score  : 377 points
       Date   : 2023-02-06 17:32 UTC (5 hours ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | tuan wrote:
       | I wish they provide short name versions for their filters. For
       | example: instead of "withContext language:python path:tests", I
       | could write "withContext l:python p:tests".
        
       | debdut wrote:
       | https://grep.app
        
         | napsterbr wrote:
         | I use this one almost daily. It's great to find real world
         | examples of APIs/contracts being used. Also, instant results!
         | 
         | The underlying data may be limited (I have no idea how large it
         | is, I doubt it has indexed every public repository out there),
         | but I never failed to find examples of what I was looking for.
        
       | Beefin wrote:
       | If you ever want to search binary files (image, video, pdf, etc.)
       | within github repos: https://learn.mixpeek.com/github-search/
        
       | hbn wrote:
       | I've been using this since it was still an email signup beta. I
       | don't do anything too complicated, but man it's been invaluable
       | to do exact-string searches across all of my organization's
       | repos. I use it most days at work
        
       | tonymet wrote:
       | This is a great intro / overview of full-text search for those
       | wondering how to build your own search engine.
       | 
       | It's a great 101-level exercise to write an inverted index
       | implementation you can do it in an afternoon , and then expand to
       | a leaf /aggregator in follow-up exercises.
        
         | jeffbee wrote:
         | It is indeed Information Retrieval 101 level stuff which leads
         | to the question of why this is the best GitHub can do with all
         | the resources of Microsoft behind them. It's almost useless, at
         | least for C++. It can't tell the difference between foo(int)
         | and foo(double) or this::foo vs. that::foo.
         | 
         | If I wanted the kind of search engine I can get a teenager to
         | write in 16 weeks why would I expect my org to be paying $$$
         | for the service?
        
           | 100k wrote:
           | Have you tried the new search? Thanks to the variable length
           | ngram indexing mentioned in the post, it can handle all of
           | those cases. Sign up here to try it:
           | https://github.com/features/code-search
           | 
           | Symbol extraction for C and C++ is currently disabled because
           | we were having problems with the performance of the tree-
           | sitter queries we were using, but we are planning to bring
           | that back.
        
             | jeffbee wrote:
             | Sorry, it cannot handle _any_ of those cases. You 're
             | talking about the ability to find the literal `this::foo`
             | but that's not how it would normally appear. It normally
             | will appear anywhere inside a `namespace this` scope, which
             | cs.github does not grok. And cs.github cannot address
             | finding the definition related to a given call site. It
             | doesn't even try.
        
               | 100k wrote:
               | You are correct, as I mentioned, we do not analyze
               | symbols for C and C++ at this time.
        
           | burntsushi wrote:
           | What a shit take. The article itself is perhaps a nice light
           | overview of 101-ish level concepts, although knowing how and
           | when to apply them in a real engineering context is not
           | something I would consider 101 level. And certainly, building
           | something that is actually at the scale of GitHub Search is
           | nowhere near 101 level.
           | 
           | This is what a 101-level inverted index implementation looks
           | like: https://github.com/BurntSushi/imdb-rename
           | 
           | In other words, absolutely nothing like what GitHub built.
           | Nowhere close.
        
         | anitil wrote:
         | I did this for our organization using sqlite's FTS module and
         | datasette and boy was it fast. Unfortunately I did get
         | (temporarily) banned from the organisations github account, but
         | it was definitely worth it.
         | 
         | Even now I find myself using it despite the index being a few
         | months out of date.
        
       | Waterluvian wrote:
       | This looks delightful!
       | 
       | One nit I have about current search: I'll look something up and
       | find I'm getting results for some obtuse commit in some old
       | branch somewhere. I'd like to be able to optionally say "latest
       | commit on branches only please" or "main branch only please."
       | 
       | Another thing, which might betray that I don't understand search
       | all that well: language aware searching that knows, for example,
       | that a single or a double quote are syntactically
       | interchangeable. Don't omit half the results because I used one
       | quote over the other when looking up `interpolation = 'nearest'`
        
       | bjd2385 wrote:
       | When can we have a usable search in GitLab?
        
         | john_cogs wrote:
         | GitLab team member. Thanks for the question.
         | 
         | Our Code Search team is currently working on moving to Zoekt[0]
         | which is expected to be a significant improvement as it is
         | purpose-built for code search.
         | 
         | We also shipped an improvement[1] to our existing search
         | functionality at the end of last year. If you haven't used it
         | recently, I'd encourage you to check out code search again to
         | see if the quality has been improved for you.
         | 
         | [0] - https://gitlab.com/groups/gitlab-org/-/epics/9404
         | 
         | [1] - https://gitlab.com/gitlab-org/gitlab/-/issues/346914
        
       | tantalor wrote:
       | Why not kythe?
       | 
       | https://kythe.io/
        
         | Scaevolus wrote:
         | Kythe is not a regex search engine. It depends on extracting
         | precise semantics of all the code it runs on to compute correct
         | edges like "calls-function". This only works for a few
         | languages, and is extremely difficult to do generically across
         | all of github.
        
       | mperham wrote:
       | On the spectrum of "build vs buy", this is a good example where a
       | business should build it. Scaling code search is their core
       | value.
        
       | [deleted]
        
       | Existenceblinks wrote:
       | Blackbird written in Rust is a natural approach. Those who try to
       | sell build the whole thing with a whole thing is unwise (look at
       | you isomorphic javascript)
        
         | ZephyrBlu wrote:
         | Isomorphic projects are generally good for full stack apps, but
         | I don't think anyone would recommend you build a search engine
         | with isomorphic JS.
        
       | user3939382 wrote:
       | The cursor position in the free-form query terms in the search
       | input doesn't align correctly when the input contains tags.
        
       | ZephyrBlu wrote:
       | Search is a fascinating topic because it's such a fundamental
       | problem and every search engine is based around the same
       | extremely simple data structure (Posting list/inverted index).
       | Despite that, search isn't easy and every search engine seems to
       | be quite unique. It also seems to get exponentially harder with
       | scale.
       | 
       | You can write your own search engine that will perform very well
       | on a surprisingly large amount of data, even doing naive full-
       | text search. A search tool I came across a while back is a great
       | example of something at that scale: https://pagefind.app/.
       | 
       | For anyone who doesn't know anything about search I highly
       | recommend reading this (It's mentioned in the blog post as well):
       | https://swtch.com/~rsc/regexp/regexp4.html.
       | 
       | Algolia also has a series of blog posts describing how their
       | search engine works:
       | https://www.algolia.com/blog/engineering/inside-the-algolia-....
       | 
       | ---
       | 
       | It's interesting that GitHub seems to have quite a few shards.
       | Algolia basically has a monolithic architecture with 3 different
       | hosts which replicate data and they embed their search engine in
       | Nginx:
       | 
       |  _" Our search engine is a C++ module which is directly embedded
       | inside Nginx. So when the query enters Nginx, we directly run it
       | through the search engine and send it back to the client."_
       | 
       | I'm guessing GitHub probably doesn't store repos in a custom
       | binary format like Algolia does though:
       | 
       |  _" Each index is a binary file in our own format. We put the
       | information in a specific order so that it is very fast to
       | perform queries on it."_
       | 
       |  _" Our Nginx C++ module will directly open the index file in
       | memory-mapped mode in order to share memory between the different
       | Nginx processes and will apply the query on the memory-mapped
       | data structure."_
       | 
       | https://stackshare.io/posts/how-algolia-built-their-realtime...
       | 
       | 100ms p99 seems pretty good, but I'm curious what the p50 is and
       | how much time is spent searching vs ranking. I've seen Dan Luu
       | say that majority of time should be spent ranking rather than
       | searching and when I've snooped on https://hn.algolia.com I've
       | seen single digit millisecond search times in the responses,
       | which seems to corroborate this.
       | 
       | I'm curious why they chose to optimize ingestion when it only
       | took 36hrs to re-index the entire corpus without optimizations. A
       | 50% speedup is nice, but 36hrs and 18hrs are the same order of
       | magnitude and it sounds like there was a fair amount of
       | engineering effort put into this. An index 1/5 of the size is
       | pretty sweet though, I have to assume that's a bigger win that
       | 50% faster ingestion.
       | 
       | Since they're indexing by language I wonder if they have custom
       | indexing/searching for each language, or if their ngram strategy
       | is generic over all languages. Perhaps their "sparse grams"
       | naturally token different for every language. Hard to tell when
       | they leave out the juiciest part of the strategy though: "Assume
       | you have some function that given a bigram gives a weight".
       | 
       | Search is so cool. I could talk about it all day.
        
         | 100k wrote:
         | I agree! Search is so cool.
         | 
         |  _It 's interesting that GitHub seems to have quite a few
         | shards. Algolia basically has a monolithic architecture with 3
         | different hosts_
         | 
         | I used to work at an Algolia competitor. I don't know for sure,
         | but my guess is that Algolia shards their indices by customer.
         | Algolia does not provide global search. GitHub code search
         | does. That, and the desire to deduplicate data, is what led us
         | to our current sharding strategy (notably, it is different than
         | the old GitHub code search's sharding.).
         | 
         |  _I 'm guessing GitHub probably doesn't store repos in a custom
         | binary format like Algolia does though:_
         | 
         | We have a custom index format, so I would say this is the same,
         | unless you mean something different. We of course translate
         | repos from their Git form to our index document form for
         | indexing.
         | 
         |  _I 'm curious why they chose to optimize ingestion when it
         | only took 36hrs to re-index the entire corpus without
         | optimizations. A 50% speedup is nice, but 36hrs and 18hrs are
         | the same order of magnitude and it sounds like there was a fair
         | amount of engineering effort put into this. An index 1/5 of the
         | size is pretty sweet though, I have to assume that's a bigger
         | win that 50% faster ingestion._
         | 
         | The index size is a bigger win, but being able to reindex
         | quickly is huge for our development velocity and trying things
         | out. We really feel it when things are slow. This is also not
         | our final goal, we want to scale the system up considerably.
        
           | ZephyrBlu wrote:
           | I'm not familiar with production search systems at scale
           | (Very curious about them though). How do you think Algolia
           | shards their data given that architecture? Based on their
           | description it seems like the search engine itself is
           | monolithic. Maybe they're running a 3-node cluster with a
           | monolithic index for each customer?
           | 
           | Interesting, do you keep a copy of the index document form of
           | repos or is that done on the fly during indexing? Is your
           | custom index format a binary format? I have no idea whether
           | that's standard practice, or just a compressed text format is
           | enough. I guess that non-binary formats would be enormous
           | though, and given that an index is by definition relatively
           | unique it probably wouldn't compress that well.
           | 
           | I do feel the development velocity thing. I've felt something
           | similar on my smaller scale projects. Being able to fully re-
           | index the corpus in less than a day definitely seems like it
           | would provide a lot of opportunities to experiment and try
           | stuff out without it being too costly.
           | 
           | Scale up in terms of what? Is the current system not indexing
           | all of GitHub, or you mean you want to index on more things
           | (E.g. commits, PRs, etc)?
        
       | boyter wrote:
       | The sparse grams solution to deal with stupidly common ngrams
       | such as for or tes is very interesting.
       | 
       | I'd love to see more discussion on how they are dealing with the
       | false positives though. It looks like a positional index is being
       | used to achieve this, but that usually blows out your index size.
       | 
       | Additional information about deduplication would be especially
       | interesting to me as well. It seems to solve this quite well. I
       | usually try a search of Jquery to test this and it does not
       | return multiple copies of different versions of it which is a
       | good indicator that it's slightly fuzzy.
       | 
       | What I find really interesting about all the code search engines
       | I know of is that each one implemented its own index. Nobody is
       | using off the shelf software for this. I suspect that might be
       | down to no off the shelf software providing a decent enough
       | solution, and none providing a solution that scales. At least
       | none that scales with decent costs.
       | 
       | I did a small comparison of GitHub code search a while ago
       | https://twitter.com/boyter/status/1480667185475244036?s=61&t...
       | But I should note a lot has improved since then, and it looks
       | like sourcegraph now also does default AND of terms rather than
       | exact match, so my complaints there are resolved.
       | 
       | Impressive work by GitHub. I am sure some of the people behind it
       | will read this comment, let me say well done to you all. I am
       | very impressed. Also please post more information like this.
       | There is so little out there.
        
         | 100k wrote:
         | Thanks! I enjoyed reading your blog posts about building your
         | code search engine. One minor point of clarification, we do not
         | use a positional ngram index, which as you note blows up the
         | index size. Instead, we use the covering sparse ngrams to
         | produce candidate documents and then search the content.
         | 
         | An early version of Blackbird experimented with trigrams plus a
         | bitmask of the next character, but it didn't work well because
         | it wasn't selective enough. This is mentioned in the blog post:
         | 
         |  _We tried a number of strategies to fix this like adding
         | follow masks, which use bitmasks for the character following
         | the trigram (basically halfway to quad grams), but they
         | saturate too quickly to be useful._
        
           | boyter wrote:
           | That's what I get for a cursory glance at 4am when I wrote
           | this. I will have a much better look after I get some coffee
           | into me.
           | 
           | Thanks for the clarification. Looking forward to see what
           | else you and your team end up writing about. Which reminds me
           | to publish some other posts I have about searchcode.
        
         | ZephyrBlu wrote:
         | Implementing your own index gives you more control over it. I
         | think at this scale you probably want to tweak things
         | specifically to your product rather than using a generic
         | solution. I would guess that what you're indexing on (E.g.
         | language, file, repo, etc) and sharding strategy affects the
         | structure of your index as well.
        
           | boyter wrote:
           | Believe me I am aware. I am one of those who implemented
           | their own index for a code search engine :) I did it for my
           | own learning, but find it interesting because something like
           | elastic with trigrams can get you very close, albeit at a far
           | greater cost.
        
             | ZephyrBlu wrote:
             | I'm reading your blog posts about building your own index
             | now.
             | 
             | I started writing my own very simple index and search
             | engine, but quickly decided to just use ClickHouse via
             | https://tinybird.co as my backend (Serverless SQL with
             | automatic APIs is pretty sweet) because I wanted to build
             | out the product side of things and my data is really small,
             | so I felt like it was going to be a lot of effort for
             | little reward.
             | 
             | Maybe one day I will need to write a custom index or search
             | engine that actually scales though :).
        
               | boyter wrote:
               | I won't hijack this thread with details but if you have
               | questions you can find my details on my profile.
        
       | gavinray wrote:
       | I just want to say thank-you to the folks who work on Code Search
       | at GitHub.
       | 
       | It's the number one way I research and understand new
       | libraries/API's and programming languages.
       | 
       | There's a lot more you can learn from usage in the wild than
       | tutorial posts sometimes.
        
       | cozos wrote:
       | I have been waiting for this for so long.
        
       | andrewmcwatters wrote:
       | > Just use grep? First though, let's explore the brute force
       | approach to the problem. We get this question a lot: "Why don't
       | you just use grep?" To answer that, let's do a little napkin math
       | using ripgrep on that 115 TB of content. On a machine with an
       | eight core Intel CPU, ripgrep can run an exhaustive regular
       | expression query on a 13 GB file cached in memory in 2.769
       | seconds, or about 0.6 GB/sec/core.
       | 
       | But you don't NEED to do this do you? I'm ALREADY in a
       | repository, I just don't want to check out, say all of WebKit, I
       | just need to find where a specific reference is defined.
       | 
       | Maybe, maybe on a really serious day do I need to search an
       | entire organization. But hardly ever.
       | 
       | I have never, in over a decade ever, wanted sophisticated
       | symbolic searching from GitHub code search, I just need remote
       | grep.
       | 
       | Why is the code search not feature bisected into this 99% use
       | case, and then the occasional global repository search, which can
       | behave entirely differently?
        
         | IshKebab wrote:
         | Actually I've been using https://grep.app for ages and while I
         | agree on GitHub I basically only search the repo I'm in, that's
         | mainly because Github's existing search sucks.
         | 
         | On grep.app I regularly search all repos. It's very useful for
         | finding out how to use APIs or where APIs from dependencies are
         | defined.
         | 
         | So I suspect you don't want it because subconsciously you know
         | that Github's "search all" feature won't return you useful
         | results.
         | 
         | Hell they still don't provide a way to filter out test
         | directories which makes the code search inside a single repo
         | useless a lot of the time.
        
           | Xeophon wrote:
           | Recently discovered grep.app and it is incredibly useful.
           | Wish I've knew it earlier.
        
         | tiagod wrote:
         | I use github-wide searches all the time to see how people are
         | using certain APIs, to find libraries used in some blob from
         | the strings I find there, to find people working with the same
         | data I'm about to attempt to work it, and the list goes on.
         | 
         | What _you_ use github search for doesn 't require all this
         | engineering, but what I use it for does. Why wouldn't they
         | build something that satisfies both our necessities well?
        
           | rtuin wrote:
           | Same here. I find doing a quick org-wide code search for a
           | specific call a great starting point for impact analysis for
           | API changes. Using it multiple times a week!
        
         | nonethewiser wrote:
         | That's not 99% of use cases. That's just how you use it.
        
         | 100k wrote:
         | (I work on this.)
         | 
         | If you check out our prior blog post, "A brief history of code
         | search at GitHub" (https://github.blog/2021-12-15-a-brief-
         | history-of-code-searc...), you can learn a bit about the
         | evolution of this feature. And, in fact, we used to use git
         | grep to search repositories.
         | 
         | This doesn't work well at GitHub's scale. We have 100M users
         | and over 200M repositories in a multi-tenant environment. Your
         | git grep is going to be competing for resources with other
         | user's pushes and clones.
        
         | joshuamorton wrote:
         | > I'm ALREADY in a repository, I just don't want to check out,
         | say all of WebKit, I just need to find where a specific
         | reference is defined.
         | 
         | If you're in some repository that uses a webkit api, and you
         | want to know how that api is defined, how do you do that
         | without global cross references or a global lookup?
         | 
         | Even for local lookups, indicies are useful (as any ctags user
         | will tell you!), but for any kind of cross repo xrefs they're
         | ncecessary.
        
       | chatmasta wrote:
       | In general, I really recommend code search as a tool for
       | supplementing reading the documentation and source code of your
       | dependencies (you _are_ reading the source code, right?). I reach
       | for it almost every day, and I find it 's a reliable tool for
       | identifying "the right way" to use a library, especially one that
       | isn't fully documented.
        
       | imadethis wrote:
       | Sourcegraph should've accepted that offer from GitHub.
        
         | lern_too_spel wrote:
         | Sourcegraph had to have known GitHub would do this if they
         | didn't accept the offer. Since this should be expected, the
         | launch of this feature shouldn't change what their decision
         | should have been.
        
           | sqs wrote:
           | Sourcegraph CEO here. Just to be clear, so internet rumors
           | don't get started, there was no "offer" here. We started
           | Sourcegraph with the intent of remaining independent because
           | building really good code search and intelligence means
           | working across all code (not just on GitHub), all devs, and
           | all code intelligence sources (code nav plus every dev tool
           | you use that knows stuff about code, not just the ones in the
           | GitHub/Microsoft suite bundle). We've never entertained any
           | kind of acquisition interest for this reason.
           | 
           | We don't think any of today's code host vendors with their
           | current strategies can make truly great code search and
           | intelligence because they'll be biased toward their own
           | bundled tools and limited to the subset of code hosted on
           | that instance. It'd be kind of like Encyclopedia Britannica
           | or The NY Times building a web search engine: helpful, but so
           | much more limited compared to what the independent Google
           | became.
           | 
           | And yes, none of this was a surprise. GitHub's new code
           | search has been out for 14 months now.
           | 
           | OK, hope this puts an internet rumor to rest!
        
         | pigtailgirl wrote:
         | -- was looking at their glassdoor last night - one of the worst
         | ive seen in tech so far --
        
           | icelancer wrote:
           | Seems fine to me? I was expecting a bloodbath but the reviews
           | are all pretty balanced, with Yegge getting a pretty good
           | amount of praise.
        
         | mxstbr wrote:
         | I don't think Sourcegraph is in big trouble here. Their whole
         | play is enterprises, who likely have code spread across many
         | different hosts. On top of that, their code search is still
         | miles ahead of GitHub's.
        
           | the_mitsuhiko wrote:
           | The new code search includes private repos. Miles ahead
           | doesn't matter if you get the search at GitHub for free.
           | GitHub actions started out pretty terrible but they are now
           | dominating hosted CI.
           | 
           | SourceGraph likely has challenging times ahead considering
           | the valuation.
        
             | sqs wrote:
             | Sourcegraph CEO here. "Challenging times" is how it should
             | be all the time. That means competition is forcing both
             | GitHub and us to build better stuff for devs. Devs win.
             | 
             | But to be clear, as a company we are doing well and growing
             | nicely inside customers, with a ton of cash in the bank, an
             | awesome team, and a huge opportunity ahead of us. GitHub's
             | new code search has been out for 14 months now, so this is
             | nothing new.
             | 
             | It's a big market and there's way more room for
             | differentiation and dev choice in code search/intelligence
             | than in CI. There's a lot of code intelligence that GitHub
             | won't support (precise code nav for more languages,
             | comprehensive code ownership, metadata from other dev tools
             | that know things about code outside the GitHub/Microsoft
             | suite, etc.), there's a need for the ability to fix (with
             | our Batch Changes) not just find, and even in the core
             | search workflow there's so much room for improvement with
             | AI fine-tuned on your own code, etc.
             | 
             | But talk is cheap and only shipping matters. So, watch what
             | we ship, and send any feedback and requests our way!
        
           | tooltitude wrote:
           | It's very useful for enterprises to have one solution like
           | github. Github isn't the best in every fields, but overall
           | it's the best offer (they have git hosting, continuous
           | integration, bug tracking, discussion forums, hosted dev
           | environment, and now code search). It reminds me of a
           | bundling strategy used in MS Office.
        
           | slimsag wrote:
           | I've worked alongside the CEO/CTO of Sourcegraph for the past
           | 8 years, everyone else is at our company offsite so I figured
           | I'd chime in :) nobody asked me to write this (nor did I ask)
           | :)
           | 
           | The article is a top-notch technical write-up, the devs on
           | GitHub code search should be proud of what they've achieved
           | so far!
           | 
           | Honestly, we're rooting for GitHub to improve their code
           | search, viewing them as a close peer-not a competitor. We
           | also maintain OSS projects like Zoekt, which IIRC GitLab is
           | maybe looking at using for their own. The more devs that
           | 'get' code search, the better off Sourcegraph is frankly!
           | 
           | GitHub has a nice intuitive/simple UX, we could learn a thing
           | or two there (though, easier to do with less features.)
           | 
           | Still, Sourcegraph search tech is quite a bit more powerful:
           | 
           | * Searching over commit messages, diffs, filename, etc. are
           | super nice for tracking down regressions / finding 'that PR I
           | swear my coworker made'
           | 
           | * Expressiveness like "find this regexp in repositories, but
           | only if the repo has had a commit in the last month AND has a
           | file named package.json in its root"
           | 
           | * Since Steve Yegge joined us, we've started thinking about
           | _ranking_ of search results, a notoriously difficult thing to
           | do well in code search unless you have great factors to rank
           | on (e.g. a semantic understanding of code):
           | https://about.sourcegraph.com/blog/new-search-ranking
           | 
           | * We stream results back, so you can get a comprehensive set
           | of results - not just a few pages, from our API.
           | 
           | * Works in GitHub Enterprise, not just GitHub.com. Plus on
           | all your code hosts, think BitBucket, GitLab, Azure DevOps,
           | Gerrit, Phabricator, etc. and even non-Git VCS like Perforce.
           | 
           | * Respects permissions of all your code hosts (a very
           | difficult problem, as there are no official APIs to query
           | this info from code hosts in general)
           | 
           | Having code search is one thing, but using it is another:
           | 
           | * Code Insights (we use search as an API to gather statistics
           | about code, track code quality, keywords, etc. both over time
           | and retroactively and let you build dashboards)
           | 
           | * Batch changes (find+replace, but over thousands of
           | repositories. Run a Docker container per repo, run your
           | custom linter script etc. and then draft or send PRs to
           | thousands of repos, manage/track campaigns with thousands of
           | PRs like that over time, etc.)
           | 
           | * Precise code intel / semantic awareness of code, we use
           | SCIP indexers for this (spiritual successor to Microsoft's
           | LSIF format for indexing LSP servers.)
           | 
           | I am super happy GitHub continues to push their code search
           | effort, and genuinely believe it's a great thing for all
           | developers and us over at Sourcegraph. Also excited to see
           | when they do their public rollout of this :)
           | 
           | Anyway, that's just my take as someone who works there-other
           | Sourcegraphers will chime in later if anything I said above
           | feels off to them I'm sure :)
        
             | zellyn wrote:
             | Thanks for the information! I came to the hackernews
             | discussion because SourceGraph was so curiously absent from
             | the article and the "brief history of codesearch" article
             | it linked to, and I wanted to see what others thought!
        
             | john_cogs wrote:
             | GitLab team member here.
             | 
             | We are looking at Zoekt for code search:
             | https://gitlab.com/groups/gitlab-org/-/epics/9404
        
           | imemyself wrote:
           | Sourcegraph has become pretty obscenely expensive in the last
           | couple years (with borderline hostile sales folks to boot). I
           | know at least one company who would love to be able to cancel
           | that contract if GH search is "good enough" now.
        
             | sqs wrote:
             | Sourcegraph CEO here. We definitely need a cheaper tier for
             | smaller companies or those who don't need our entire
             | feature set. I agree! What do you think that should be?
             | 
             | Overall, we're building what our customers need, and our
             | product goes way beyond what GitHub can offer. Sourcegraph
             | indexes all the code and increasingly all the code
             | intelligence (including code nav but also code ownership
             | and other metadata in the future from your other dev
             | tools). We charge based on active usage, so we make money
             | when devs at customers /choose/ to use us over the
             | alternatives. We're trying to do this the right way, and
             | tons of customers agree. (If anyone reading this disagrees,
             | please let me know!)
             | 
             | Re: your comment about our sales team, I'm really sorry to
             | hear that and want to understand more so I can fix the
             | problem. Can you please email me at sqs@sourcegraph.com?
        
         | [deleted]
        
       | kjuulh wrote:
       | I really like the new search. Though sometimes it is a bit
       | deceptive. I.e. when searching for a function name by clicking on
       | a piece of code and suddenly you are in an entitely different
       | code base with an unrelated function though it shares the name.
       | 
       | It feels like github code browsing is a step between a full
       | editor with lsp and a static site. I Hope they work out the Kinks
       | and make it more smooth
        
         | kibwen wrote:
         | Supporting jump-to-definition natively seems like something
         | that will be table stakes for any code hosting site in the
         | future.
        
           | dymk wrote:
           | I think that there should be some sort of standardized,
           | language-agnostic metadata format for semantically indexing a
           | codebase. It could include e.g type information for
           | expressions or declared variables (for languages that infer
           | types), and an index of symbols and how they're connected.
           | 
           | This metadata file would be generated by a language-specific
           | tool. For instance, Cargo could generate it for Rust
           | projects, ctags/cmake for C / C++, Sorbet for Ruby.
           | 
           | Then a service like Github / Gitlab / your own homegrown code
           | viewer could provide things like "Show type", "Jump to
           | source" without ever needing to build a language-specific
           | parser or interpreter (which seems arbitrarily difficult,
           | most build systems provide escape hatches, so you can't
           | assume much about project structure).
           | 
           | Basically, LSP as a static file in a standard format that
           | tools can read to understand a codebase, without needing to
           | model the language's semantics.
        
             | maxov wrote:
             | SemanticDB
             | (https://scalameta.org/docs/semanticdb/guide.html) is a
             | protobuf-based file format that does almost exactly this
             | for JVM languages, primarily Scala (I was a contributor a
             | while back). It is used to build an intelligent online code
             | browser, as the backend for a language server, and to do
             | intelligent refactorings.
             | 
             | I think a language-agnostic semantic metadata format is a
             | good idea, but requires a lot of compromise. ctags
             | partially does this, but only to a very coarse level
             | (mostly definitions and references). I think some ctags
             | implementations also define 'extension fields' that could
             | be used to give type information, but I don't know how/if
             | these are used in practice. SemanticDB is extremely fine-
             | grained, but highly specialized to JVM languages and type
             | systems that are designed to work with the JVM. Finding a
             | common set of semantic features that can be used across
             | languages and type systems that is fine-grained enough to
             | be more useful than ctags sounds very difficult to me.
        
               | dymk wrote:
               | I think simple things like "go to reference" or "show
               | type" would be sufficient for 95% of usecases. But if you
               | split languages up into a few different categories (maybe
               | along the lines of Algol-like vs Lisp-like), and were
               | flexible with extensions, I'd imagine we'd see some
               | common patterns emerging, and clients would take
               | advantage of that. Best effort is probably good enough to
               | greatly improve the ergonomics of search.
        
             | slimsag wrote:
             | This is pretty much exactly what we've built at
             | Sourcegraph. Microsoft had introduced (but pretty much
             | abandoned before it even started) LSIF, a static index
             | format for LSP servers which encodes in detail all possible
             | LSP requests/responses, effectively.
             | 
             | We took that torch and carried it forward, building the
             | spiritual successor called SCIP[0]. It's language agnostic,
             | we have indexers for quite a few languages already, and we
             | genuinely intend for it to be vendor neutral / a proper OSS
             | project[1].
             | 
             | [0] https://about.sourcegraph.com/blog/announcing-scip
             | 
             | [1] https://github.com/sourcegraph/scip
        
           | jeffbee wrote:
           | I can only think of one hosted repo service that provides a
           | working go-to-definition feature and it is not github or
           | sourcegraph. What makes you think this will suddenly become
           | widespread? Github spent years doing this and their new thing
           | is strictly non-semantic. It doesn't have the faintest idea
           | where the name is defined, or if there's even a difference
           | between a function name, a parameter name, or a word in a
           | comment.
        
             | saagarjha wrote:
             | I can think of zero. Doing go-to-definition statically is
             | very difficult.
        
             | slimsag wrote:
             | FWIW Sourcegraph has fully precise/semantic go-to-
             | definition, find-references, etc. We use SCIP code indexers
             | (a spiritual successor to LSIF, the Microsoft standard for
             | indexing LSP servers)
        
               | jeffbee wrote:
               | Not for C++. To test my recollection I navigated to
               | abseil-cpp/strings/str_split.h, clicked on the
               | declaration of absl::ByString::Find, and clicked "Go to
               | definition". I was presented with every function in
               | Abseil named "Find" regardless of its scope or parameter
               | types. That's not "precise code intelligence"!
        
               | slimsag wrote:
               | In the top right corner of the tooltip it will say either
               | "Search-based" or "Precise" - in this case, you're right,
               | we don't have the abseil-cpp repo indexed so it falls
               | back to search-based as you describe.
               | 
               | We do have a C++ code indexer in beta,
               | https://github.com/sourcegraph/lsif-clang - it is based
               | on clang but C++ indexing is notably harder to do
               | automatically/without-setup due to the varying build
               | systems that need to be understood in order to invoke the
               | compiler.
        
             | tjoff wrote:
             | The rise and popularity of LSP and projects such as
             | treesitter are a superb foundation for features such as
             | this. Both support a wealth of languages, it is still and
             | will be quite hard to assume the toolchain and settings
             | required for producing accurate information though.
             | 
             | But this could be tied into CI, especially for projects
             | utilizing runners for building the code.
             | 
             | So the barrier to entry now is orders of magnitude less and
             | with github and others pushing for codespaces it could be
             | the final piece to tie everything together.
        
             | hobofan wrote:
             | > It doesn't have the faintest idea where the name is
             | defined, or if there's even a difference between a function
             | name, a parameter name, or a word in a comment.
             | 
             | I don't think what you are saying is actually true for
             | stack-graphs[0][1].
             | 
             | [0]: https://github.com/github/stack-graphs
             | 
             | [1]: https://github.blog/2021-12-09-introducing-stack-
             | graphs/
        
       | solarkraft wrote:
       | Damn, it's about time, the current search sucks. What I have
       | found to work very well is SourceGraph; they offer search for
       | public repos. Maybe this'll be an alternative to it.
        
       | saagarjha wrote:
       | I've been using the new code search for a couple of months and I
       | like it, but the UI is kind of antagonistic to how I typically
       | want to search for things. For one, the new experience doesn't
       | actually load code onto the page, it does some sort of lazy
       | loading thing as you scroll around, so [?]F doesn't work. I
       | understand that there's a custom search box to try to get around
       | this but it's pretty slow and fiddly and I don't really want to
       | use it. I also find the layout to be pretty annoying, because
       | invariably there's a symbol panel on the side that doesn't work
       | for the code I want to look at, and then it's just there taking
       | space. If I hit "t" to enter a file name and start typing the
       | text field loses focus after a second and I need to click on it
       | again. I know there are a couple of people on the team in this
       | thread: I search a lot of code on GitHub and I feel like there's
       | a couple of tweaks that would greatly improve my experience.
       | Like, I think I could even show you a video of all the places
       | where the UI has gotten less usable for me. What would be the
       | best way to get this feedback to you? I've posted stuff on the
       | forum or whatever but it's unclear to me if this is the intended
       | way to raise issues.
        
         | colin353 wrote:
         | Hey saagarjha, thanks for the feedback. It's our goal to make
         | the experience as good as possible, and we're aware of
         | shortcomings with cmd+F and `t`, among other things. We're
         | working on it, and your feedback helps us a lot.
         | 
         | We read all the feedback on the forum here:
         | https://github.com/orgs/community/discussions/38692, so please
         | keep providing it. Videos and screenshots are super helpful
         | too. Thanks for bearing with us as we continue to polish the
         | UX!
        
       | thinking001001 wrote:
       | Hmm not sure if I should delete my (2nd) Github account again,
       | just thinking about how much data they are getting from users, it
       | could become the Facebook of Git.
        
       | drcongo wrote:
       | With current search, I can search [0] the Django repo for a class
       | that definitely exists [1] in Django, there are 0 code results.
       | Zero. GitHub search is mystifyingly bad, I hope this is a LOT
       | better.
       | 
       | [0]
       | https://github.com/django/django/search?q=DeleteView&type=co...
       | 
       | [1]
       | https://github.com/django/django/blob/main/django/views/gene...
        
         | MapleWalnut wrote:
         | With the new GitHub code search, the first result is the
         | DeleteView class:
         | https://github.com/django/django/blob/bd366ca2aeffa869b7dbc0...
        
       | jd3 wrote:
       | I was working on a research project awhile ago and every time I
       | searched for something particular it immediately thought I was a
       | bot after like 2-3 particular/exact queries.
       | 
       | Ever since then, I've exclusively used sourcegraph.
        
       | ezekg wrote:
       | I use their new code search a lot to grok how people use certain
       | features, or implement certain things. But I do wish there was a
       | way to filter out forks. Sometimes I search a string and just get
       | a bunch of forks all with the same result. For example, searching
       | a common class in a Rails app often just shows a bunch of
       | rails/rails forks, which is a lot of noise to sift through when
       | you're trying to see how devs commonly use a certain feature.
        
         | 100k wrote:
         | Thanks for the feedback! That's coming, we've been prioritizing
         | scaling the index and ingest process and haven't had a chance
         | got add that yet. There are a bunch of value-add features like
         | this I am looking forward to knocking out soon.
        
           | BaculumMeumEst wrote:
           | thanks so much, you guys are doing amazing work
        
           | kowlo wrote:
           | Well done
        
           | orf wrote:
           | Out of interest, if I have a repo with many millions of files
           | that compress quite nicely down to about a 1.4gb packfile, is
           | it better for the ingestion and/or indexer if I break this
           | down into many smaller pushes or one large push?
           | 
           | Because I pushed such a repo yesterday and it's still not
           | been indexed.
        
             | 100k wrote:
             | Sorry, there are size limits on the number of files in a
             | repositories that we'll index. That is probably why your
             | repository wasn't indexed. Out of curiosity, what are you
             | storing in this repository?
        
               | orf wrote:
               | Sequences of code from Python package releases from pypi,
               | as an experiment I'm working on. They compress quite
               | nicely as the deltas between releases are fairly small.
               | 
               | I thought it would be nice to be able to search through
               | them, but I guess the file limit was reached. That's a
               | shame.
        
         | simonw wrote:
         | As an open source library maintainer I've been using it for
         | that too - it's fantastic for answering questions like "is
         | anyone using this API method I wrote?" and "how much of a mess
         | would it likely cause people if I deprecated this function?"
        
         | chatmasta wrote:
         | Perhaps it could benefit from something like a "dissimilarity"
         | filter, which ranks the current result set by returning the
         | most unique hits first. You wouldn't always want this, because
         | sometimes you're searching for how something is typically used,
         | and with many duplicative results you can confirm that's the
         | preferred pattern. But other times you're looking for more
         | estoteric usage of a certain function, and it would be nice to
         | filter the "standard usage" from the results (although you can
         | already do this with carefully chosen negated keywords).
         | 
         | Personally, I'm happy with the new code search so far. I
         | stopped using Sourcegraph because I could never get the deeper
         | results I wanted - it would just return the top five
         | repositories including some common code snippet and I couldn't
         | explore further than that. GitHub Code Search doesn't seem to
         | have this problem to such a degree, since I can use negation
         | more naturally, and since my query is not limited to some
         | shallow subset of the corpus before refining it.
        
           | sqs wrote:
           | Re: Sourcegraph, we're working on improving that, and sorry
           | you couldn't get the results you wanted. We primarily build
           | for the code within customers, where this particular problem
           | is less common than across all open-source repositories. But
           | we want it to work really well in every case.
           | 
           | Our new ranking (https://about.sourcegraph.com/blog/new-
           | search-ranking) should help a lot here, and it's live on
           | https://sourcegraph.com. Can you share some of the queries
           | you tried so we can see how much ranking helps and how to
           | handle them better?
        
             | chatmasta wrote:
             | Sure! I will try it next time I'm using code search. I
             | can't remember the specific queries I've tried, but I only
             | ever used it when looking for "how have people used this
             | function," so I remember the limited depth of results being
             | particularly annoying.
             | 
             | Thanks for the link to your blog post. I didn't realize
             | Steve Yegge had joined your team - congratulations, that's
             | quite an endorsement!
             | 
             | I've always liked Sourcegraph just because our company
             | names are so similar (founder of Splitgraph here)... we
             | might have even gotten a few candidates because of that
             | initial confusion. :)
             | 
             | How has GitHub Code Search impacted your product direction?
             | Do you see it as an opportunity to focus more on the
             | internal use case, or do you have plans for some other
             | differentiation? It's always unfortunate when a big company
             | introduces a product so similar to the core product of a
             | startup, but I'm sure there is a silver lining there,
             | especially when you have a talented team and a mature
             | codebase (for example, Fly.io has been able to carve out a
             | niche for itself despite Cloudflare moving to compete in
             | the same areas). Either way, best of luck to you from a
             | fellow S-grapher!
        
         | mnutt wrote:
         | I run into this a lot where I'd like to see some real world use
         | cases of a method in FooClass, but get a hundred pages of forks
         | of the FooClass.h header file. In some cases I've been
         | successful adding "-filename:FooClass.h" to filter out that
         | header file. If the method is used a lot in the primary project
         | it can be a game of whack-a-mole but it often eventually works.
        
         | ljm wrote:
         | But also, you know what you're looking for and you can filter
         | the results.
         | 
         | Someone less experienced than you may take any of those results
         | as gospel. Gospel the same way StackOverflow could be, but this
         | time by a source who will do it's best to tell you what you
         | want.
         | 
         | my personal rule about AI is about the same as the rule about
         | being a lawyer. Never ask a question you don't know the answer
         | to.
        
         | pohuing wrote:
         | I think forks without any changes shouldn't really show up in
         | search tbh.
        
         | qabqabaca wrote:
         | Sourcegraph[1] does this better and has done for a couple of
         | years now. I use it for this reason all the time.
         | 
         | [1] https://sourcegraph.com
        
           | Arnavion wrote:
           | Yes, just change the URL from https://github.com/foo/bar to
           | https://sourcegraph.com/github.com/foo/bar to be dropped in
           | to a code search for that GH repo.
        
       | Scaevolus wrote:
       | As a comparison to Sourcegraph: Sourcegraph shards and indexes a
       | repository at a time, and uses trigrams and bloom filters (to
       | skip shards).
       | 
       | Github shards and indexes individual files according to their
       | hashes. It also uses variable length ngrams (neat!). This makes
       | horizontal scaling simpler, but also means more of the index
       | needs to be scanned for org/repo-scoped queries ("Due to our
       | sharding strategy, a query request must be sent to each shard in
       | the cluster.").
        
       | colin353 wrote:
       | Hey everyone, I'm Colin from GitHub's code search team: happy to
       | answer any questions people have about it. Also, you can sign up
       | to get access here: https://github.com/features/code-search
        
       | kirillbobyrev wrote:
       | This is exciting! I see a lot of familiar pieces here that
       | propagated from Google's Code Search and I know few people from
       | Code Search went to GitHub, probably specifically to work on
       | this. I always wondered why GitHub didn't invest into a decent
       | code searching features, but I'm happy it finally gets to the
       | State of the Art one step at a time. Some of the folks going to
       | GitHub to work on this I know are just incredible and I have no
       | doubt GitHub's code search will be amazing.
       | 
       | I also worked on something similar to the search engine that is
       | described here for the purposes of making auto-complete fast for
       | C++ in Clangd. That was my intern project back in 2018 and it was
       | very successful in reducing the delays and latencies in the auto-
       | complete pipeline. That project was a lot of fun and was also
       | based on Russ Cox's original Google Code Search trigram index. My
       | implementation of the index is still largely untouched and is a
       | hot path of Clangd. I made a huge effort to document it as much
       | as I can and the code is, I believe, very readable (although I'm
       | obviously very biased because I spent a loot of time with it).
       | 
       | Here is the implementation:
       | 
       | https://github.com/llvm/llvm-project/tree/main/clang-tools-e...
       | 
       | I also wrote a... very long design document about how exactly
       | this works, so if you're interested in understanding the
       | internals of a code search engine, you can check it out:
       | 
       | https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiG...
        
       | ZephyrBlu wrote:
       | I really hope they release this soon and that it's actually good.
       | 
       | The current search sucks ass, you can't find anything.
       | 
       | I was trying to search for something in the WebKit source the
       | other day and I had to use Sourcegraph because the GitHub search
       | gave me zero results.
        
         | 100k wrote:
         | You can try it right now in our beta. Right now, we are
         | onboarding all signups daily. Sign up here:
         | https://github.com/features/code-search
         | 
         | Obvious disclaimer because I work on this (and I also worked on
         | the legacy search): It is way better.
        
           | ZephyrBlu wrote:
           | I just signed up to the beta. Looking forwards to trying it
           | out :).
        
       | Daffodils wrote:
       | Was looking for more details on the data structure 'Geometric
       | filter' mentioned in the footnotes. Couldn't find anything (a few
       | unrelated papers in object recognition aside). If anybody can
       | share anything that would be great !
        
         | 100k wrote:
         | We hope to share more on this soon!
        
       | simonw wrote:
       | I really appreciate that this includes details about how search
       | permissions work - how they ensure that search results include
       | data from my private repos.
       | 
       | I'd always wondered how they implemented that: it turns out they
       | add extra internal filters to their searches along the lines of
       | "RepoIDs(...) or PublicRepo".
       | 
       | Question for the team: Do you have an additional permission check
       | in the view layer before the results are shown to the end-user? I
       | worry that if I switch a repo from public to private it may take
       | a while for the code search index to catch up to the new
       | permissions.
        
         | tclem wrote:
         | Yes, we never fully trust the search index so before anything
         | is displayed to the user there are a number of final checks
         | performed to make sure you're actually allowed to see that
         | content.
         | 
         | Another fun example is that your SSO session might have
         | expired. While you technically have access to view the result,
         | we can't show it until you go through the refresh dance to get
         | another valid token.
        
       | loginatnine wrote:
       | I'm curious if they'll open source Blackbird, it does not seem
       | mentioned in the post.
        
       | robertlagrant wrote:
       | Not to diminish this excellent work, but:
       | 
       | 1) I never want to search all repos globally. At worst I want to
       | search all of my org's repos.
       | 
       | 2) the search UI is a little clunky, in a way I'd need to be
       | using it again to remember.
       | 
       | Between those two I think there's loads of progress to be made
       | outside of raw search power. Of course it's nice to have that,
       | but that's what I'm really after.
        
         | aeyes wrote:
         | I use global search from time to time to see how other projects
         | use certain libraries. When the documentation of said libraries
         | is sparse this can sometimes be a good timesaver.
        
           | robertlagrant wrote:
           | The other big thing that works well is being able to jump
           | directly into the source of an open source library from your
           | code. That is powerful, but again, possibly doesn't need a
           | giant ultra search. Just some clever linking.
        
       | PaulHoule wrote:
       | My beef with GitHub's code search is that it doesn't distinguish
       | between the definition of a symbol and the uses of the symbol, so
       | you need to wade through 5 pages of results to get the one result
       | you're looking for. I would contrast that to my IDE which usually
       | scores a direct hit if I enter a search in the right box.
       | 
       | The indexing they talk about in that article seems like
       | rearranging the deck chairs on the Titanic so far as that is
       | concerned.
        
         | DontchaKnowit wrote:
         | Thank you. This is SUCH an obvious feature that seems super
         | trivial to implement.
        
       ___________________________________________________________________
       (page generated 2023-02-06 23:00 UTC)