[HN Gopher] The technology behind GitHub's new code search ___________________________________________________________________ The technology behind GitHub's new code search Author : joshbetz Score : 377 points Date : 2023-02-06 17:32 UTC (5 hours ago) (HTM) web link (github.blog) (TXT) w3m dump (github.blog) | tuan wrote: | I wish they provide short name versions for their filters. For | example: instead of "withContext language:python path:tests", I | could write "withContext l:python p:tests". | debdut wrote: | https://grep.app | napsterbr wrote: | I use this one almost daily. It's great to find real world | examples of APIs/contracts being used. Also, instant results! | | The underlying data may be limited (I have no idea how large it | is, I doubt it has indexed every public repository out there), | but I never failed to find examples of what I was looking for. | Beefin wrote: | If you ever want to search binary files (image, video, pdf, etc.) | within github repos: https://learn.mixpeek.com/github-search/ | hbn wrote: | I've been using this since it was still an email signup beta. I | don't do anything too complicated, but man it's been invaluable | to do exact-string searches across all of my organization's | repos. I use it most days at work | tonymet wrote: | This is a great intro / overview of full-text search for those | wondering how to build your own search engine. | | It's a great 101-level exercise to write an inverted index | implementation you can do it in an afternoon , and then expand to | a leaf /aggregator in follow-up exercises. | jeffbee wrote: | It is indeed Information Retrieval 101 level stuff which leads | to the question of why this is the best GitHub can do with all | the resources of Microsoft behind them. It's almost useless, at | least for C++. It can't tell the difference between foo(int) | and foo(double) or this::foo vs. that::foo. | | If I wanted the kind of search engine I can get a teenager to | write in 16 weeks why would I expect my org to be paying $$$ | for the service? | 100k wrote: | Have you tried the new search? Thanks to the variable length | ngram indexing mentioned in the post, it can handle all of | those cases. Sign up here to try it: | https://github.com/features/code-search | | Symbol extraction for C and C++ is currently disabled because | we were having problems with the performance of the tree- | sitter queries we were using, but we are planning to bring | that back. | jeffbee wrote: | Sorry, it cannot handle _any_ of those cases. You 're | talking about the ability to find the literal `this::foo` | but that's not how it would normally appear. It normally | will appear anywhere inside a `namespace this` scope, which | cs.github does not grok. And cs.github cannot address | finding the definition related to a given call site. It | doesn't even try. | 100k wrote: | You are correct, as I mentioned, we do not analyze | symbols for C and C++ at this time. | burntsushi wrote: | What a shit take. The article itself is perhaps a nice light | overview of 101-ish level concepts, although knowing how and | when to apply them in a real engineering context is not | something I would consider 101 level. And certainly, building | something that is actually at the scale of GitHub Search is | nowhere near 101 level. | | This is what a 101-level inverted index implementation looks | like: https://github.com/BurntSushi/imdb-rename | | In other words, absolutely nothing like what GitHub built. | Nowhere close. | anitil wrote: | I did this for our organization using sqlite's FTS module and | datasette and boy was it fast. Unfortunately I did get | (temporarily) banned from the organisations github account, but | it was definitely worth it. | | Even now I find myself using it despite the index being a few | months out of date. | Waterluvian wrote: | This looks delightful! | | One nit I have about current search: I'll look something up and | find I'm getting results for some obtuse commit in some old | branch somewhere. I'd like to be able to optionally say "latest | commit on branches only please" or "main branch only please." | | Another thing, which might betray that I don't understand search | all that well: language aware searching that knows, for example, | that a single or a double quote are syntactically | interchangeable. Don't omit half the results because I used one | quote over the other when looking up `interpolation = 'nearest'` | bjd2385 wrote: | When can we have a usable search in GitLab? | john_cogs wrote: | GitLab team member. Thanks for the question. | | Our Code Search team is currently working on moving to Zoekt[0] | which is expected to be a significant improvement as it is | purpose-built for code search. | | We also shipped an improvement[1] to our existing search | functionality at the end of last year. If you haven't used it | recently, I'd encourage you to check out code search again to | see if the quality has been improved for you. | | [0] - https://gitlab.com/groups/gitlab-org/-/epics/9404 | | [1] - https://gitlab.com/gitlab-org/gitlab/-/issues/346914 | tantalor wrote: | Why not kythe? | | https://kythe.io/ | Scaevolus wrote: | Kythe is not a regex search engine. It depends on extracting | precise semantics of all the code it runs on to compute correct | edges like "calls-function". This only works for a few | languages, and is extremely difficult to do generically across | all of github. | mperham wrote: | On the spectrum of "build vs buy", this is a good example where a | business should build it. Scaling code search is their core | value. | [deleted] | Existenceblinks wrote: | Blackbird written in Rust is a natural approach. Those who try to | sell build the whole thing with a whole thing is unwise (look at | you isomorphic javascript) | ZephyrBlu wrote: | Isomorphic projects are generally good for full stack apps, but | I don't think anyone would recommend you build a search engine | with isomorphic JS. | user3939382 wrote: | The cursor position in the free-form query terms in the search | input doesn't align correctly when the input contains tags. | ZephyrBlu wrote: | Search is a fascinating topic because it's such a fundamental | problem and every search engine is based around the same | extremely simple data structure (Posting list/inverted index). | Despite that, search isn't easy and every search engine seems to | be quite unique. It also seems to get exponentially harder with | scale. | | You can write your own search engine that will perform very well | on a surprisingly large amount of data, even doing naive full- | text search. A search tool I came across a while back is a great | example of something at that scale: https://pagefind.app/. | | For anyone who doesn't know anything about search I highly | recommend reading this (It's mentioned in the blog post as well): | https://swtch.com/~rsc/regexp/regexp4.html. | | Algolia also has a series of blog posts describing how their | search engine works: | https://www.algolia.com/blog/engineering/inside-the-algolia-.... | | --- | | It's interesting that GitHub seems to have quite a few shards. | Algolia basically has a monolithic architecture with 3 different | hosts which replicate data and they embed their search engine in | Nginx: | | _" Our search engine is a C++ module which is directly embedded | inside Nginx. So when the query enters Nginx, we directly run it | through the search engine and send it back to the client."_ | | I'm guessing GitHub probably doesn't store repos in a custom | binary format like Algolia does though: | | _" Each index is a binary file in our own format. We put the | information in a specific order so that it is very fast to | perform queries on it."_ | | _" Our Nginx C++ module will directly open the index file in | memory-mapped mode in order to share memory between the different | Nginx processes and will apply the query on the memory-mapped | data structure."_ | | https://stackshare.io/posts/how-algolia-built-their-realtime... | | 100ms p99 seems pretty good, but I'm curious what the p50 is and | how much time is spent searching vs ranking. I've seen Dan Luu | say that majority of time should be spent ranking rather than | searching and when I've snooped on https://hn.algolia.com I've | seen single digit millisecond search times in the responses, | which seems to corroborate this. | | I'm curious why they chose to optimize ingestion when it only | took 36hrs to re-index the entire corpus without optimizations. A | 50% speedup is nice, but 36hrs and 18hrs are the same order of | magnitude and it sounds like there was a fair amount of | engineering effort put into this. An index 1/5 of the size is | pretty sweet though, I have to assume that's a bigger win that | 50% faster ingestion. | | Since they're indexing by language I wonder if they have custom | indexing/searching for each language, or if their ngram strategy | is generic over all languages. Perhaps their "sparse grams" | naturally token different for every language. Hard to tell when | they leave out the juiciest part of the strategy though: "Assume | you have some function that given a bigram gives a weight". | | Search is so cool. I could talk about it all day. | 100k wrote: | I agree! Search is so cool. | | _It 's interesting that GitHub seems to have quite a few | shards. Algolia basically has a monolithic architecture with 3 | different hosts_ | | I used to work at an Algolia competitor. I don't know for sure, | but my guess is that Algolia shards their indices by customer. | Algolia does not provide global search. GitHub code search | does. That, and the desire to deduplicate data, is what led us | to our current sharding strategy (notably, it is different than | the old GitHub code search's sharding.). | | _I 'm guessing GitHub probably doesn't store repos in a custom | binary format like Algolia does though:_ | | We have a custom index format, so I would say this is the same, | unless you mean something different. We of course translate | repos from their Git form to our index document form for | indexing. | | _I 'm curious why they chose to optimize ingestion when it | only took 36hrs to re-index the entire corpus without | optimizations. A 50% speedup is nice, but 36hrs and 18hrs are | the same order of magnitude and it sounds like there was a fair | amount of engineering effort put into this. An index 1/5 of the | size is pretty sweet though, I have to assume that's a bigger | win that 50% faster ingestion._ | | The index size is a bigger win, but being able to reindex | quickly is huge for our development velocity and trying things | out. We really feel it when things are slow. This is also not | our final goal, we want to scale the system up considerably. | ZephyrBlu wrote: | I'm not familiar with production search systems at scale | (Very curious about them though). How do you think Algolia | shards their data given that architecture? Based on their | description it seems like the search engine itself is | monolithic. Maybe they're running a 3-node cluster with a | monolithic index for each customer? | | Interesting, do you keep a copy of the index document form of | repos or is that done on the fly during indexing? Is your | custom index format a binary format? I have no idea whether | that's standard practice, or just a compressed text format is | enough. I guess that non-binary formats would be enormous | though, and given that an index is by definition relatively | unique it probably wouldn't compress that well. | | I do feel the development velocity thing. I've felt something | similar on my smaller scale projects. Being able to fully re- | index the corpus in less than a day definitely seems like it | would provide a lot of opportunities to experiment and try | stuff out without it being too costly. | | Scale up in terms of what? Is the current system not indexing | all of GitHub, or you mean you want to index on more things | (E.g. commits, PRs, etc)? | boyter wrote: | The sparse grams solution to deal with stupidly common ngrams | such as for or tes is very interesting. | | I'd love to see more discussion on how they are dealing with the | false positives though. It looks like a positional index is being | used to achieve this, but that usually blows out your index size. | | Additional information about deduplication would be especially | interesting to me as well. It seems to solve this quite well. I | usually try a search of Jquery to test this and it does not | return multiple copies of different versions of it which is a | good indicator that it's slightly fuzzy. | | What I find really interesting about all the code search engines | I know of is that each one implemented its own index. Nobody is | using off the shelf software for this. I suspect that might be | down to no off the shelf software providing a decent enough | solution, and none providing a solution that scales. At least | none that scales with decent costs. | | I did a small comparison of GitHub code search a while ago | https://twitter.com/boyter/status/1480667185475244036?s=61&t... | But I should note a lot has improved since then, and it looks | like sourcegraph now also does default AND of terms rather than | exact match, so my complaints there are resolved. | | Impressive work by GitHub. I am sure some of the people behind it | will read this comment, let me say well done to you all. I am | very impressed. Also please post more information like this. | There is so little out there. | 100k wrote: | Thanks! I enjoyed reading your blog posts about building your | code search engine. One minor point of clarification, we do not | use a positional ngram index, which as you note blows up the | index size. Instead, we use the covering sparse ngrams to | produce candidate documents and then search the content. | | An early version of Blackbird experimented with trigrams plus a | bitmask of the next character, but it didn't work well because | it wasn't selective enough. This is mentioned in the blog post: | | _We tried a number of strategies to fix this like adding | follow masks, which use bitmasks for the character following | the trigram (basically halfway to quad grams), but they | saturate too quickly to be useful._ | boyter wrote: | That's what I get for a cursory glance at 4am when I wrote | this. I will have a much better look after I get some coffee | into me. | | Thanks for the clarification. Looking forward to see what | else you and your team end up writing about. Which reminds me | to publish some other posts I have about searchcode. | ZephyrBlu wrote: | Implementing your own index gives you more control over it. I | think at this scale you probably want to tweak things | specifically to your product rather than using a generic | solution. I would guess that what you're indexing on (E.g. | language, file, repo, etc) and sharding strategy affects the | structure of your index as well. | boyter wrote: | Believe me I am aware. I am one of those who implemented | their own index for a code search engine :) I did it for my | own learning, but find it interesting because something like | elastic with trigrams can get you very close, albeit at a far | greater cost. | ZephyrBlu wrote: | I'm reading your blog posts about building your own index | now. | | I started writing my own very simple index and search | engine, but quickly decided to just use ClickHouse via | https://tinybird.co as my backend (Serverless SQL with | automatic APIs is pretty sweet) because I wanted to build | out the product side of things and my data is really small, | so I felt like it was going to be a lot of effort for | little reward. | | Maybe one day I will need to write a custom index or search | engine that actually scales though :). | boyter wrote: | I won't hijack this thread with details but if you have | questions you can find my details on my profile. | gavinray wrote: | I just want to say thank-you to the folks who work on Code Search | at GitHub. | | It's the number one way I research and understand new | libraries/API's and programming languages. | | There's a lot more you can learn from usage in the wild than | tutorial posts sometimes. | cozos wrote: | I have been waiting for this for so long. | andrewmcwatters wrote: | > Just use grep? First though, let's explore the brute force | approach to the problem. We get this question a lot: "Why don't | you just use grep?" To answer that, let's do a little napkin math | using ripgrep on that 115 TB of content. On a machine with an | eight core Intel CPU, ripgrep can run an exhaustive regular | expression query on a 13 GB file cached in memory in 2.769 | seconds, or about 0.6 GB/sec/core. | | But you don't NEED to do this do you? I'm ALREADY in a | repository, I just don't want to check out, say all of WebKit, I | just need to find where a specific reference is defined. | | Maybe, maybe on a really serious day do I need to search an | entire organization. But hardly ever. | | I have never, in over a decade ever, wanted sophisticated | symbolic searching from GitHub code search, I just need remote | grep. | | Why is the code search not feature bisected into this 99% use | case, and then the occasional global repository search, which can | behave entirely differently? | IshKebab wrote: | Actually I've been using https://grep.app for ages and while I | agree on GitHub I basically only search the repo I'm in, that's | mainly because Github's existing search sucks. | | On grep.app I regularly search all repos. It's very useful for | finding out how to use APIs or where APIs from dependencies are | defined. | | So I suspect you don't want it because subconsciously you know | that Github's "search all" feature won't return you useful | results. | | Hell they still don't provide a way to filter out test | directories which makes the code search inside a single repo | useless a lot of the time. | Xeophon wrote: | Recently discovered grep.app and it is incredibly useful. | Wish I've knew it earlier. | tiagod wrote: | I use github-wide searches all the time to see how people are | using certain APIs, to find libraries used in some blob from | the strings I find there, to find people working with the same | data I'm about to attempt to work it, and the list goes on. | | What _you_ use github search for doesn 't require all this | engineering, but what I use it for does. Why wouldn't they | build something that satisfies both our necessities well? | rtuin wrote: | Same here. I find doing a quick org-wide code search for a | specific call a great starting point for impact analysis for | API changes. Using it multiple times a week! | nonethewiser wrote: | That's not 99% of use cases. That's just how you use it. | 100k wrote: | (I work on this.) | | If you check out our prior blog post, "A brief history of code | search at GitHub" (https://github.blog/2021-12-15-a-brief- | history-of-code-searc...), you can learn a bit about the | evolution of this feature. And, in fact, we used to use git | grep to search repositories. | | This doesn't work well at GitHub's scale. We have 100M users | and over 200M repositories in a multi-tenant environment. Your | git grep is going to be competing for resources with other | user's pushes and clones. | joshuamorton wrote: | > I'm ALREADY in a repository, I just don't want to check out, | say all of WebKit, I just need to find where a specific | reference is defined. | | If you're in some repository that uses a webkit api, and you | want to know how that api is defined, how do you do that | without global cross references or a global lookup? | | Even for local lookups, indicies are useful (as any ctags user | will tell you!), but for any kind of cross repo xrefs they're | ncecessary. | chatmasta wrote: | In general, I really recommend code search as a tool for | supplementing reading the documentation and source code of your | dependencies (you _are_ reading the source code, right?). I reach | for it almost every day, and I find it 's a reliable tool for | identifying "the right way" to use a library, especially one that | isn't fully documented. | imadethis wrote: | Sourcegraph should've accepted that offer from GitHub. | lern_too_spel wrote: | Sourcegraph had to have known GitHub would do this if they | didn't accept the offer. Since this should be expected, the | launch of this feature shouldn't change what their decision | should have been. | sqs wrote: | Sourcegraph CEO here. Just to be clear, so internet rumors | don't get started, there was no "offer" here. We started | Sourcegraph with the intent of remaining independent because | building really good code search and intelligence means | working across all code (not just on GitHub), all devs, and | all code intelligence sources (code nav plus every dev tool | you use that knows stuff about code, not just the ones in the | GitHub/Microsoft suite bundle). We've never entertained any | kind of acquisition interest for this reason. | | We don't think any of today's code host vendors with their | current strategies can make truly great code search and | intelligence because they'll be biased toward their own | bundled tools and limited to the subset of code hosted on | that instance. It'd be kind of like Encyclopedia Britannica | or The NY Times building a web search engine: helpful, but so | much more limited compared to what the independent Google | became. | | And yes, none of this was a surprise. GitHub's new code | search has been out for 14 months now. | | OK, hope this puts an internet rumor to rest! | pigtailgirl wrote: | -- was looking at their glassdoor last night - one of the worst | ive seen in tech so far -- | icelancer wrote: | Seems fine to me? I was expecting a bloodbath but the reviews | are all pretty balanced, with Yegge getting a pretty good | amount of praise. | mxstbr wrote: | I don't think Sourcegraph is in big trouble here. Their whole | play is enterprises, who likely have code spread across many | different hosts. On top of that, their code search is still | miles ahead of GitHub's. | the_mitsuhiko wrote: | The new code search includes private repos. Miles ahead | doesn't matter if you get the search at GitHub for free. | GitHub actions started out pretty terrible but they are now | dominating hosted CI. | | SourceGraph likely has challenging times ahead considering | the valuation. | sqs wrote: | Sourcegraph CEO here. "Challenging times" is how it should | be all the time. That means competition is forcing both | GitHub and us to build better stuff for devs. Devs win. | | But to be clear, as a company we are doing well and growing | nicely inside customers, with a ton of cash in the bank, an | awesome team, and a huge opportunity ahead of us. GitHub's | new code search has been out for 14 months now, so this is | nothing new. | | It's a big market and there's way more room for | differentiation and dev choice in code search/intelligence | than in CI. There's a lot of code intelligence that GitHub | won't support (precise code nav for more languages, | comprehensive code ownership, metadata from other dev tools | that know things about code outside the GitHub/Microsoft | suite, etc.), there's a need for the ability to fix (with | our Batch Changes) not just find, and even in the core | search workflow there's so much room for improvement with | AI fine-tuned on your own code, etc. | | But talk is cheap and only shipping matters. So, watch what | we ship, and send any feedback and requests our way! | tooltitude wrote: | It's very useful for enterprises to have one solution like | github. Github isn't the best in every fields, but overall | it's the best offer (they have git hosting, continuous | integration, bug tracking, discussion forums, hosted dev | environment, and now code search). It reminds me of a | bundling strategy used in MS Office. | slimsag wrote: | I've worked alongside the CEO/CTO of Sourcegraph for the past | 8 years, everyone else is at our company offsite so I figured | I'd chime in :) nobody asked me to write this (nor did I ask) | :) | | The article is a top-notch technical write-up, the devs on | GitHub code search should be proud of what they've achieved | so far! | | Honestly, we're rooting for GitHub to improve their code | search, viewing them as a close peer-not a competitor. We | also maintain OSS projects like Zoekt, which IIRC GitLab is | maybe looking at using for their own. The more devs that | 'get' code search, the better off Sourcegraph is frankly! | | GitHub has a nice intuitive/simple UX, we could learn a thing | or two there (though, easier to do with less features.) | | Still, Sourcegraph search tech is quite a bit more powerful: | | * Searching over commit messages, diffs, filename, etc. are | super nice for tracking down regressions / finding 'that PR I | swear my coworker made' | | * Expressiveness like "find this regexp in repositories, but | only if the repo has had a commit in the last month AND has a | file named package.json in its root" | | * Since Steve Yegge joined us, we've started thinking about | _ranking_ of search results, a notoriously difficult thing to | do well in code search unless you have great factors to rank | on (e.g. a semantic understanding of code): | https://about.sourcegraph.com/blog/new-search-ranking | | * We stream results back, so you can get a comprehensive set | of results - not just a few pages, from our API. | | * Works in GitHub Enterprise, not just GitHub.com. Plus on | all your code hosts, think BitBucket, GitLab, Azure DevOps, | Gerrit, Phabricator, etc. and even non-Git VCS like Perforce. | | * Respects permissions of all your code hosts (a very | difficult problem, as there are no official APIs to query | this info from code hosts in general) | | Having code search is one thing, but using it is another: | | * Code Insights (we use search as an API to gather statistics | about code, track code quality, keywords, etc. both over time | and retroactively and let you build dashboards) | | * Batch changes (find+replace, but over thousands of | repositories. Run a Docker container per repo, run your | custom linter script etc. and then draft or send PRs to | thousands of repos, manage/track campaigns with thousands of | PRs like that over time, etc.) | | * Precise code intel / semantic awareness of code, we use | SCIP indexers for this (spiritual successor to Microsoft's | LSIF format for indexing LSP servers.) | | I am super happy GitHub continues to push their code search | effort, and genuinely believe it's a great thing for all | developers and us over at Sourcegraph. Also excited to see | when they do their public rollout of this :) | | Anyway, that's just my take as someone who works there-other | Sourcegraphers will chime in later if anything I said above | feels off to them I'm sure :) | zellyn wrote: | Thanks for the information! I came to the hackernews | discussion because SourceGraph was so curiously absent from | the article and the "brief history of codesearch" article | it linked to, and I wanted to see what others thought! | john_cogs wrote: | GitLab team member here. | | We are looking at Zoekt for code search: | https://gitlab.com/groups/gitlab-org/-/epics/9404 | imemyself wrote: | Sourcegraph has become pretty obscenely expensive in the last | couple years (with borderline hostile sales folks to boot). I | know at least one company who would love to be able to cancel | that contract if GH search is "good enough" now. | sqs wrote: | Sourcegraph CEO here. We definitely need a cheaper tier for | smaller companies or those who don't need our entire | feature set. I agree! What do you think that should be? | | Overall, we're building what our customers need, and our | product goes way beyond what GitHub can offer. Sourcegraph | indexes all the code and increasingly all the code | intelligence (including code nav but also code ownership | and other metadata in the future from your other dev | tools). We charge based on active usage, so we make money | when devs at customers /choose/ to use us over the | alternatives. We're trying to do this the right way, and | tons of customers agree. (If anyone reading this disagrees, | please let me know!) | | Re: your comment about our sales team, I'm really sorry to | hear that and want to understand more so I can fix the | problem. Can you please email me at sqs@sourcegraph.com? | [deleted] | kjuulh wrote: | I really like the new search. Though sometimes it is a bit | deceptive. I.e. when searching for a function name by clicking on | a piece of code and suddenly you are in an entitely different | code base with an unrelated function though it shares the name. | | It feels like github code browsing is a step between a full | editor with lsp and a static site. I Hope they work out the Kinks | and make it more smooth | kibwen wrote: | Supporting jump-to-definition natively seems like something | that will be table stakes for any code hosting site in the | future. | dymk wrote: | I think that there should be some sort of standardized, | language-agnostic metadata format for semantically indexing a | codebase. It could include e.g type information for | expressions or declared variables (for languages that infer | types), and an index of symbols and how they're connected. | | This metadata file would be generated by a language-specific | tool. For instance, Cargo could generate it for Rust | projects, ctags/cmake for C / C++, Sorbet for Ruby. | | Then a service like Github / Gitlab / your own homegrown code | viewer could provide things like "Show type", "Jump to | source" without ever needing to build a language-specific | parser or interpreter (which seems arbitrarily difficult, | most build systems provide escape hatches, so you can't | assume much about project structure). | | Basically, LSP as a static file in a standard format that | tools can read to understand a codebase, without needing to | model the language's semantics. | maxov wrote: | SemanticDB | (https://scalameta.org/docs/semanticdb/guide.html) is a | protobuf-based file format that does almost exactly this | for JVM languages, primarily Scala (I was a contributor a | while back). It is used to build an intelligent online code | browser, as the backend for a language server, and to do | intelligent refactorings. | | I think a language-agnostic semantic metadata format is a | good idea, but requires a lot of compromise. ctags | partially does this, but only to a very coarse level | (mostly definitions and references). I think some ctags | implementations also define 'extension fields' that could | be used to give type information, but I don't know how/if | these are used in practice. SemanticDB is extremely fine- | grained, but highly specialized to JVM languages and type | systems that are designed to work with the JVM. Finding a | common set of semantic features that can be used across | languages and type systems that is fine-grained enough to | be more useful than ctags sounds very difficult to me. | dymk wrote: | I think simple things like "go to reference" or "show | type" would be sufficient for 95% of usecases. But if you | split languages up into a few different categories (maybe | along the lines of Algol-like vs Lisp-like), and were | flexible with extensions, I'd imagine we'd see some | common patterns emerging, and clients would take | advantage of that. Best effort is probably good enough to | greatly improve the ergonomics of search. | slimsag wrote: | This is pretty much exactly what we've built at | Sourcegraph. Microsoft had introduced (but pretty much | abandoned before it even started) LSIF, a static index | format for LSP servers which encodes in detail all possible | LSP requests/responses, effectively. | | We took that torch and carried it forward, building the | spiritual successor called SCIP[0]. It's language agnostic, | we have indexers for quite a few languages already, and we | genuinely intend for it to be vendor neutral / a proper OSS | project[1]. | | [0] https://about.sourcegraph.com/blog/announcing-scip | | [1] https://github.com/sourcegraph/scip | jeffbee wrote: | I can only think of one hosted repo service that provides a | working go-to-definition feature and it is not github or | sourcegraph. What makes you think this will suddenly become | widespread? Github spent years doing this and their new thing | is strictly non-semantic. It doesn't have the faintest idea | where the name is defined, or if there's even a difference | between a function name, a parameter name, or a word in a | comment. | saagarjha wrote: | I can think of zero. Doing go-to-definition statically is | very difficult. | slimsag wrote: | FWIW Sourcegraph has fully precise/semantic go-to- | definition, find-references, etc. We use SCIP code indexers | (a spiritual successor to LSIF, the Microsoft standard for | indexing LSP servers) | jeffbee wrote: | Not for C++. To test my recollection I navigated to | abseil-cpp/strings/str_split.h, clicked on the | declaration of absl::ByString::Find, and clicked "Go to | definition". I was presented with every function in | Abseil named "Find" regardless of its scope or parameter | types. That's not "precise code intelligence"! | slimsag wrote: | In the top right corner of the tooltip it will say either | "Search-based" or "Precise" - in this case, you're right, | we don't have the abseil-cpp repo indexed so it falls | back to search-based as you describe. | | We do have a C++ code indexer in beta, | https://github.com/sourcegraph/lsif-clang - it is based | on clang but C++ indexing is notably harder to do | automatically/without-setup due to the varying build | systems that need to be understood in order to invoke the | compiler. | tjoff wrote: | The rise and popularity of LSP and projects such as | treesitter are a superb foundation for features such as | this. Both support a wealth of languages, it is still and | will be quite hard to assume the toolchain and settings | required for producing accurate information though. | | But this could be tied into CI, especially for projects | utilizing runners for building the code. | | So the barrier to entry now is orders of magnitude less and | with github and others pushing for codespaces it could be | the final piece to tie everything together. | hobofan wrote: | > It doesn't have the faintest idea where the name is | defined, or if there's even a difference between a function | name, a parameter name, or a word in a comment. | | I don't think what you are saying is actually true for | stack-graphs[0][1]. | | [0]: https://github.com/github/stack-graphs | | [1]: https://github.blog/2021-12-09-introducing-stack- | graphs/ | solarkraft wrote: | Damn, it's about time, the current search sucks. What I have | found to work very well is SourceGraph; they offer search for | public repos. Maybe this'll be an alternative to it. | saagarjha wrote: | I've been using the new code search for a couple of months and I | like it, but the UI is kind of antagonistic to how I typically | want to search for things. For one, the new experience doesn't | actually load code onto the page, it does some sort of lazy | loading thing as you scroll around, so [?]F doesn't work. I | understand that there's a custom search box to try to get around | this but it's pretty slow and fiddly and I don't really want to | use it. I also find the layout to be pretty annoying, because | invariably there's a symbol panel on the side that doesn't work | for the code I want to look at, and then it's just there taking | space. If I hit "t" to enter a file name and start typing the | text field loses focus after a second and I need to click on it | again. I know there are a couple of people on the team in this | thread: I search a lot of code on GitHub and I feel like there's | a couple of tweaks that would greatly improve my experience. | Like, I think I could even show you a video of all the places | where the UI has gotten less usable for me. What would be the | best way to get this feedback to you? I've posted stuff on the | forum or whatever but it's unclear to me if this is the intended | way to raise issues. | colin353 wrote: | Hey saagarjha, thanks for the feedback. It's our goal to make | the experience as good as possible, and we're aware of | shortcomings with cmd+F and `t`, among other things. We're | working on it, and your feedback helps us a lot. | | We read all the feedback on the forum here: | https://github.com/orgs/community/discussions/38692, so please | keep providing it. Videos and screenshots are super helpful | too. Thanks for bearing with us as we continue to polish the | UX! | thinking001001 wrote: | Hmm not sure if I should delete my (2nd) Github account again, | just thinking about how much data they are getting from users, it | could become the Facebook of Git. | drcongo wrote: | With current search, I can search [0] the Django repo for a class | that definitely exists [1] in Django, there are 0 code results. | Zero. GitHub search is mystifyingly bad, I hope this is a LOT | better. | | [0] | https://github.com/django/django/search?q=DeleteView&type=co... | | [1] | https://github.com/django/django/blob/main/django/views/gene... | MapleWalnut wrote: | With the new GitHub code search, the first result is the | DeleteView class: | https://github.com/django/django/blob/bd366ca2aeffa869b7dbc0... | jd3 wrote: | I was working on a research project awhile ago and every time I | searched for something particular it immediately thought I was a | bot after like 2-3 particular/exact queries. | | Ever since then, I've exclusively used sourcegraph. | ezekg wrote: | I use their new code search a lot to grok how people use certain | features, or implement certain things. But I do wish there was a | way to filter out forks. Sometimes I search a string and just get | a bunch of forks all with the same result. For example, searching | a common class in a Rails app often just shows a bunch of | rails/rails forks, which is a lot of noise to sift through when | you're trying to see how devs commonly use a certain feature. | 100k wrote: | Thanks for the feedback! That's coming, we've been prioritizing | scaling the index and ingest process and haven't had a chance | got add that yet. There are a bunch of value-add features like | this I am looking forward to knocking out soon. | BaculumMeumEst wrote: | thanks so much, you guys are doing amazing work | kowlo wrote: | Well done | orf wrote: | Out of interest, if I have a repo with many millions of files | that compress quite nicely down to about a 1.4gb packfile, is | it better for the ingestion and/or indexer if I break this | down into many smaller pushes or one large push? | | Because I pushed such a repo yesterday and it's still not | been indexed. | 100k wrote: | Sorry, there are size limits on the number of files in a | repositories that we'll index. That is probably why your | repository wasn't indexed. Out of curiosity, what are you | storing in this repository? | orf wrote: | Sequences of code from Python package releases from pypi, | as an experiment I'm working on. They compress quite | nicely as the deltas between releases are fairly small. | | I thought it would be nice to be able to search through | them, but I guess the file limit was reached. That's a | shame. | simonw wrote: | As an open source library maintainer I've been using it for | that too - it's fantastic for answering questions like "is | anyone using this API method I wrote?" and "how much of a mess | would it likely cause people if I deprecated this function?" | chatmasta wrote: | Perhaps it could benefit from something like a "dissimilarity" | filter, which ranks the current result set by returning the | most unique hits first. You wouldn't always want this, because | sometimes you're searching for how something is typically used, | and with many duplicative results you can confirm that's the | preferred pattern. But other times you're looking for more | estoteric usage of a certain function, and it would be nice to | filter the "standard usage" from the results (although you can | already do this with carefully chosen negated keywords). | | Personally, I'm happy with the new code search so far. I | stopped using Sourcegraph because I could never get the deeper | results I wanted - it would just return the top five | repositories including some common code snippet and I couldn't | explore further than that. GitHub Code Search doesn't seem to | have this problem to such a degree, since I can use negation | more naturally, and since my query is not limited to some | shallow subset of the corpus before refining it. | sqs wrote: | Re: Sourcegraph, we're working on improving that, and sorry | you couldn't get the results you wanted. We primarily build | for the code within customers, where this particular problem | is less common than across all open-source repositories. But | we want it to work really well in every case. | | Our new ranking (https://about.sourcegraph.com/blog/new- | search-ranking) should help a lot here, and it's live on | https://sourcegraph.com. Can you share some of the queries | you tried so we can see how much ranking helps and how to | handle them better? | chatmasta wrote: | Sure! I will try it next time I'm using code search. I | can't remember the specific queries I've tried, but I only | ever used it when looking for "how have people used this | function," so I remember the limited depth of results being | particularly annoying. | | Thanks for the link to your blog post. I didn't realize | Steve Yegge had joined your team - congratulations, that's | quite an endorsement! | | I've always liked Sourcegraph just because our company | names are so similar (founder of Splitgraph here)... we | might have even gotten a few candidates because of that | initial confusion. :) | | How has GitHub Code Search impacted your product direction? | Do you see it as an opportunity to focus more on the | internal use case, or do you have plans for some other | differentiation? It's always unfortunate when a big company | introduces a product so similar to the core product of a | startup, but I'm sure there is a silver lining there, | especially when you have a talented team and a mature | codebase (for example, Fly.io has been able to carve out a | niche for itself despite Cloudflare moving to compete in | the same areas). Either way, best of luck to you from a | fellow S-grapher! | mnutt wrote: | I run into this a lot where I'd like to see some real world use | cases of a method in FooClass, but get a hundred pages of forks | of the FooClass.h header file. In some cases I've been | successful adding "-filename:FooClass.h" to filter out that | header file. If the method is used a lot in the primary project | it can be a game of whack-a-mole but it often eventually works. | ljm wrote: | But also, you know what you're looking for and you can filter | the results. | | Someone less experienced than you may take any of those results | as gospel. Gospel the same way StackOverflow could be, but this | time by a source who will do it's best to tell you what you | want. | | my personal rule about AI is about the same as the rule about | being a lawyer. Never ask a question you don't know the answer | to. | pohuing wrote: | I think forks without any changes shouldn't really show up in | search tbh. | qabqabaca wrote: | Sourcegraph[1] does this better and has done for a couple of | years now. I use it for this reason all the time. | | [1] https://sourcegraph.com | Arnavion wrote: | Yes, just change the URL from https://github.com/foo/bar to | https://sourcegraph.com/github.com/foo/bar to be dropped in | to a code search for that GH repo. | Scaevolus wrote: | As a comparison to Sourcegraph: Sourcegraph shards and indexes a | repository at a time, and uses trigrams and bloom filters (to | skip shards). | | Github shards and indexes individual files according to their | hashes. It also uses variable length ngrams (neat!). This makes | horizontal scaling simpler, but also means more of the index | needs to be scanned for org/repo-scoped queries ("Due to our | sharding strategy, a query request must be sent to each shard in | the cluster."). | colin353 wrote: | Hey everyone, I'm Colin from GitHub's code search team: happy to | answer any questions people have about it. Also, you can sign up | to get access here: https://github.com/features/code-search | kirillbobyrev wrote: | This is exciting! I see a lot of familiar pieces here that | propagated from Google's Code Search and I know few people from | Code Search went to GitHub, probably specifically to work on | this. I always wondered why GitHub didn't invest into a decent | code searching features, but I'm happy it finally gets to the | State of the Art one step at a time. Some of the folks going to | GitHub to work on this I know are just incredible and I have no | doubt GitHub's code search will be amazing. | | I also worked on something similar to the search engine that is | described here for the purposes of making auto-complete fast for | C++ in Clangd. That was my intern project back in 2018 and it was | very successful in reducing the delays and latencies in the auto- | complete pipeline. That project was a lot of fun and was also | based on Russ Cox's original Google Code Search trigram index. My | implementation of the index is still largely untouched and is a | hot path of Clangd. I made a huge effort to document it as much | as I can and the code is, I believe, very readable (although I'm | obviously very biased because I spent a loot of time with it). | | Here is the implementation: | | https://github.com/llvm/llvm-project/tree/main/clang-tools-e... | | I also wrote a... very long design document about how exactly | this works, so if you're interested in understanding the | internals of a code search engine, you can check it out: | | https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiG... | ZephyrBlu wrote: | I really hope they release this soon and that it's actually good. | | The current search sucks ass, you can't find anything. | | I was trying to search for something in the WebKit source the | other day and I had to use Sourcegraph because the GitHub search | gave me zero results. | 100k wrote: | You can try it right now in our beta. Right now, we are | onboarding all signups daily. Sign up here: | https://github.com/features/code-search | | Obvious disclaimer because I work on this (and I also worked on | the legacy search): It is way better. | ZephyrBlu wrote: | I just signed up to the beta. Looking forwards to trying it | out :). | Daffodils wrote: | Was looking for more details on the data structure 'Geometric | filter' mentioned in the footnotes. Couldn't find anything (a few | unrelated papers in object recognition aside). If anybody can | share anything that would be great ! | 100k wrote: | We hope to share more on this soon! | simonw wrote: | I really appreciate that this includes details about how search | permissions work - how they ensure that search results include | data from my private repos. | | I'd always wondered how they implemented that: it turns out they | add extra internal filters to their searches along the lines of | "RepoIDs(...) or PublicRepo". | | Question for the team: Do you have an additional permission check | in the view layer before the results are shown to the end-user? I | worry that if I switch a repo from public to private it may take | a while for the code search index to catch up to the new | permissions. | tclem wrote: | Yes, we never fully trust the search index so before anything | is displayed to the user there are a number of final checks | performed to make sure you're actually allowed to see that | content. | | Another fun example is that your SSO session might have | expired. While you technically have access to view the result, | we can't show it until you go through the refresh dance to get | another valid token. | loginatnine wrote: | I'm curious if they'll open source Blackbird, it does not seem | mentioned in the post. | robertlagrant wrote: | Not to diminish this excellent work, but: | | 1) I never want to search all repos globally. At worst I want to | search all of my org's repos. | | 2) the search UI is a little clunky, in a way I'd need to be | using it again to remember. | | Between those two I think there's loads of progress to be made | outside of raw search power. Of course it's nice to have that, | but that's what I'm really after. | aeyes wrote: | I use global search from time to time to see how other projects | use certain libraries. When the documentation of said libraries | is sparse this can sometimes be a good timesaver. | robertlagrant wrote: | The other big thing that works well is being able to jump | directly into the source of an open source library from your | code. That is powerful, but again, possibly doesn't need a | giant ultra search. Just some clever linking. | PaulHoule wrote: | My beef with GitHub's code search is that it doesn't distinguish | between the definition of a symbol and the uses of the symbol, so | you need to wade through 5 pages of results to get the one result | you're looking for. I would contrast that to my IDE which usually | scores a direct hit if I enter a search in the right box. | | The indexing they talk about in that article seems like | rearranging the deck chairs on the Titanic so far as that is | concerned. | DontchaKnowit wrote: | Thank you. This is SUCH an obvious feature that seems super | trivial to implement. ___________________________________________________________________ (page generated 2023-02-06 23:00 UTC)