[HN Gopher] Code Search at Google: Han-Wen and Zoekt ___________________________________________________________________ Code Search at Google: Han-Wen and Zoekt Author : intrepidsoldier Score : 94 points Date : 2023-11-21 17:10 UTC (5 hours ago) (HTM) web link (sourcegraph.com) (TXT) w3m dump (sourcegraph.com) | jeffbee wrote: | Is Zoekt actually in use at Google and if so how does it related | to Kythe? I know the Zoekt instance for Bazel exists, but the | Kythe index also exists | (https://cs.opensource.google/bazel/bazel) | dmoy wrote: | It has nothing to do with Kythe. | | I'm on the Kythe team, and I don't know off the top of my head | what Zoekt is. Looking it up, I see it's some sort of trigram | search, which means if it's used at all (I have no idea), it's | codesearch proper, not Kythe. | | The Kythe index is the semantic index of the codebase, | Codesearch does all of the text/regex/etc searching. | sluongng wrote: | Are you sure? There is find definition and references in | https://cs.opensource.google/bazel/bazel and Im quite sure | it's thanks to the Kythe indexing job Bazel team is running | in CI. | dmoy wrote: | The refs & jump to def in bazel/bazel are using Kythe, yes. | But that is Kythe's semantic index from running (also it's | Kythe team running it, not bazel team). It's not the | Codesearch trigram/text search (which again, I have no idea | if it uses zoekt). | hanwenn wrote: | Not in use that I know | frutiger wrote: | I'm a bit confused as to how | https://swtch.com/~rsc/regexp/regexp4.html isn't mentioned at | all. | beyang wrote: | Zoekt was heavily inspired by Google's internal code search, as | mentioned in the blog post. The original version of the | internal code search is described in the rsc post. Zoekt keeps | some of the foundational ideas (e.g., trigram index), but was a | from-scratch implementation. We probably should link to the rsc | post for completeness, will update. | hanwenn wrote: | At the time that I started Zoekt (2016), Google's internal | codesearch used suffix arrays for the string matching, which | the team wasn't happy with, presumably because of the | algorithmic complexity and indexing slowness. The Codesearch | team was exploring alternatives, one of them the technique | described in | https://link.springer.com/article/10.1007/s11390-016-1618-6. | The positional trigrams were a simplification of this, that | they didn't mind me open sourcing. | | so, in terms of algorithms, Zoekt wasn't actually inspired by | Google's internal code search. | | The precise query syntax of zoekt is mostly copied from | google's internal syntax, though. | IshKebab wrote: | It is mentioned. | frutiger wrote: | It is now, but wasn't earlier. | hanwenn wrote: | Russ Cox' trigram approach uses document IDs for the posting | list, which makes the index much smaller, but gives less | precise (ie. slower) matching. This is mentioned in the design | doc at https://github.com/sourcegraph/zoekt/blob/main/doc/desig | n.md.... | j2kun wrote: | IIUC, the main thing that Google's internal codesearch does that | makes it superior to external systems (outside of an IDE, like | GitHub code search) is that Google actually builds everything, | and so it can incorporate that information into its index. | There's only so much text search can do when you have macros | generating code. | dmoy wrote: | Yea that would be Kythe. We build almost everything, across | 44-45 different programming languages, and postprocess that | into a giant semantic graph. | | Most major parts are open sourced at kythe.io, and there's a | somewhat dated talk given by Luke here: | https://youtu.be/VYI3ji8aSM0 | dmoy wrote: | > macros | | Corollary: while we can do a lot with indexing generated code | (even cross language) in Kythe, there are limits. Macros may | be one, I forget atm | sa46 wrote: | Do you have any cases studies or success stories for non- | Google repos? I miss code search but I'm not sure how close | Kythe is to code-search-in-a-box. | dmoy wrote: | Internally, we use variants of our pipeline to index a | variety of open source repos, and some non-blaze/bazel | internal repos. Those are often non-Google repos. But we're | using some internal postprocessing and serving logic to | actually create and host the final index. | | Unfortunately I don't know if there's any significant use | of Kythe outside of Google. We get a handful of questions | on the open source repo from time to time, but that's all I | know about. | beyang wrote: | Great call out! We've built this code navigation infra on top | of Zoekt into Sourcegraph. Example: | https://sourcegraph.com/github.com/golang/go/-/blob/src/net/... | | Docs: | https://docs.sourcegraph.com/code_navigation/explanations/pr... | tromp wrote: | Wondering how this tool got named after the Dutch verb for seek, | I found this quote on its github page [1]. | | > "Zoekt, en gij zult spinazie eten" - Jan Eertink | | > ("seek, and ye shall eat spinach" - My primary school teacher) | | [1] https://github.com/sourcegraph/zoekt | JohnMakin wrote: | I didn't start my tech journey til late 00's, so it's constantly | surprising to me that something as ubiquitous as git only came | out in _2005_. | | Is it possible at all this story helped spur the widespread | adoption of git (the early implementation of this tool)? | jeffbee wrote: | I think it is odd that the story mentions git at all. Git5, the | mentioned wrapper around piper, has only a niche audience when | I last used it 5 years ago, and it was a demonstrated fact that | the users of it were less productive than perforce users. | Whether that was causal or not was unknown. | hanwenn wrote: | Hi, I'm the Han-Wen from the title. | | The story mentions git because git5 got me into developer | tooling. More in particular, it put me in touch with Shawn | Pearce who ran the Git/Gerrit team at Google. When I went to | work for him, Shawn wanted to have codesearch support in | Gerrit, and Zoekt was ultimately the outcome of my | explorations in this space. | | IIRC, Git5 was deleted approximately 5 years ago because Fig | (the Hg based replacement) had taken over all the use cases | billllll wrote: | I agree there doesn't seem to be a good connection between | work on version control and work on code search. | | However, I don't think it makes sense to downplay git5. | Anecdotally, basically everybody knew about it, and I'd | constantly run into people using it (which is by itself | noteworthy since nobody was exactly talking about version | control all the time). | | Git5 was at the time the most robust solution to chain | commits, which was tedious bordering on impossible without | some tool. Without definitive data, I wouldn't say users were | less productive with git5: it definitely was a useful tool | that people at least recommended for chain commits. I was | definitely more productive with it. | | There were a lot of footguns though, and I do think the hg | wrapper that superseded it was way better. | ajross wrote: | Git stepped into a source control ecosystem that was well- | served (albeit contentious). People knew (or at least thought | they knew) what they wanted from bk/svn/CVS/p4/rcs/sccs. | | So git essentially was the "final form" that integrated all the | various workflows and topped it off with a maximally-scaled use | case (linux) that proved out the tool, drove innovation in | integration/scripting/gadgetry, and provided a clear beacon for | everyone else to adopt it. So it won. | | But in 2004, if you asked around, everyone would have told you | that a tool like this was coming at some point (even if they | probably wouldn't have described it as very git-like!). | pgeorgi wrote: | If you squint a little, https://web.archive.org/web/200306291 | 14010/http://www.venge.... is a fair approximation of some of | the core ideas behind git (and Linus played with it and wrote | a critique of its short-comings before starting git) | mettamage wrote: | That is so cool to see where Linus got some of his | inspiration from. It made a few things more clear to me as | to why git uses certain things. | justrealist wrote: | Oh yeah. I remember merging SVN branches into production in | 2010 or so. | | It was a... special time. Let's not reminisce. | dekhn wrote: | Before git, most people in my larger circle used RCS, a UNIX | version control system from the early 80's. It was very limited | (basically each file had its own side-file that contained | revision data, and there was no project-wide file) but did its | job. Many people moved over to VCS, which used RCS files but | added project-wide files so you could manage a dir tree. | | After that, I think many people moved to subversion, which had | a lot more functionality for distributed VC, for exmaple there | was a server. svn was popular for a while but building it was | painful (due to berkeley db) and it sort of never grew. I | invested a lot of time in (specifically apache with mod_dav and | mod_dav_svn) but lost interest in VC after fighting with | subversion. | | git came along and from what i can tell it mainly had "it's by | linus, and the kernel uses it" and "it's fast" and "something | about reflogs". I use git day-to-day but I still; can't explain | how git became so ubiquitous; I find using it outright painful. | dws wrote: | Lightweight branches was a huge selling point. If you didn't | do them often enough that they were rote, branches in | RCS/CVS/SVN required ritual sacrifice. | reportingsjr wrote: | Mercurial (aka hg) was also gaining popularity at the same | time as git. The interface was a lot nicer and more sane than | git, but it had some serious performance limitations that | hampered it. | | Both were definitely way better than SVN/CVS/etc. | dekhn wrote: | Yes, after using git for a few years I was introduced to | Mercurial and it was like a breath of fresh air, although | I'm also told hg added a number of things that made it much | more usable, "right before I started using it". | | Since I have limited brain capacity I focus my efforts on | being able to use git, not hg, merely because it has so | much marketshare. | cpach wrote: | Nit-pick: Did you mean CVS? | dekhn wrote: | Yes, CVS. | IshKebab wrote: | The same algorithm is also used in Hound | (https://github.com/hound-search/hound) though I have to say the | best implementation of code search by far that I've seen is | https://grep.app | | You really should check it out if you haven't already. It's | incredibly useful; I used it all the time. Not open source | though. ___________________________________________________________________ (page generated 2023-11-21 23:00 UTC)