[HN Gopher] Code Search at Google: Han-Wen and Zoekt
       ___________________________________________________________________
        
       Code Search at Google: Han-Wen and Zoekt
        
       Author : intrepidsoldier
       Score  : 94 points
       Date   : 2023-11-21 17:10 UTC (5 hours ago)
        
 (HTM) web link (sourcegraph.com)
 (TXT) w3m dump (sourcegraph.com)
        
       | jeffbee wrote:
       | Is Zoekt actually in use at Google and if so how does it related
       | to Kythe? I know the Zoekt instance for Bazel exists, but the
       | Kythe index also exists
       | (https://cs.opensource.google/bazel/bazel)
        
         | dmoy wrote:
         | It has nothing to do with Kythe.
         | 
         | I'm on the Kythe team, and I don't know off the top of my head
         | what Zoekt is. Looking it up, I see it's some sort of trigram
         | search, which means if it's used at all (I have no idea), it's
         | codesearch proper, not Kythe.
         | 
         | The Kythe index is the semantic index of the codebase,
         | Codesearch does all of the text/regex/etc searching.
        
           | sluongng wrote:
           | Are you sure? There is find definition and references in
           | https://cs.opensource.google/bazel/bazel and Im quite sure
           | it's thanks to the Kythe indexing job Bazel team is running
           | in CI.
        
             | dmoy wrote:
             | The refs & jump to def in bazel/bazel are using Kythe, yes.
             | But that is Kythe's semantic index from running (also it's
             | Kythe team running it, not bazel team). It's not the
             | Codesearch trigram/text search (which again, I have no idea
             | if it uses zoekt).
        
         | hanwenn wrote:
         | Not in use that I know
        
       | frutiger wrote:
       | I'm a bit confused as to how
       | https://swtch.com/~rsc/regexp/regexp4.html isn't mentioned at
       | all.
        
         | beyang wrote:
         | Zoekt was heavily inspired by Google's internal code search, as
         | mentioned in the blog post. The original version of the
         | internal code search is described in the rsc post. Zoekt keeps
         | some of the foundational ideas (e.g., trigram index), but was a
         | from-scratch implementation. We probably should link to the rsc
         | post for completeness, will update.
        
           | hanwenn wrote:
           | At the time that I started Zoekt (2016), Google's internal
           | codesearch used suffix arrays for the string matching, which
           | the team wasn't happy with, presumably because of the
           | algorithmic complexity and indexing slowness. The Codesearch
           | team was exploring alternatives, one of them the technique
           | described in
           | https://link.springer.com/article/10.1007/s11390-016-1618-6.
           | The positional trigrams were a simplification of this, that
           | they didn't mind me open sourcing.
           | 
           | so, in terms of algorithms, Zoekt wasn't actually inspired by
           | Google's internal code search.
           | 
           | The precise query syntax of zoekt is mostly copied from
           | google's internal syntax, though.
        
         | IshKebab wrote:
         | It is mentioned.
        
           | frutiger wrote:
           | It is now, but wasn't earlier.
        
         | hanwenn wrote:
         | Russ Cox' trigram approach uses document IDs for the posting
         | list, which makes the index much smaller, but gives less
         | precise (ie. slower) matching. This is mentioned in the design
         | doc at https://github.com/sourcegraph/zoekt/blob/main/doc/desig
         | n.md....
        
       | j2kun wrote:
       | IIUC, the main thing that Google's internal codesearch does that
       | makes it superior to external systems (outside of an IDE, like
       | GitHub code search) is that Google actually builds everything,
       | and so it can incorporate that information into its index.
       | There's only so much text search can do when you have macros
       | generating code.
        
         | dmoy wrote:
         | Yea that would be Kythe. We build almost everything, across
         | 44-45 different programming languages, and postprocess that
         | into a giant semantic graph.
         | 
         | Most major parts are open sourced at kythe.io, and there's a
         | somewhat dated talk given by Luke here:
         | https://youtu.be/VYI3ji8aSM0
        
           | dmoy wrote:
           | > macros
           | 
           | Corollary: while we can do a lot with indexing generated code
           | (even cross language) in Kythe, there are limits. Macros may
           | be one, I forget atm
        
           | sa46 wrote:
           | Do you have any cases studies or success stories for non-
           | Google repos? I miss code search but I'm not sure how close
           | Kythe is to code-search-in-a-box.
        
             | dmoy wrote:
             | Internally, we use variants of our pipeline to index a
             | variety of open source repos, and some non-blaze/bazel
             | internal repos. Those are often non-Google repos. But we're
             | using some internal postprocessing and serving logic to
             | actually create and host the final index.
             | 
             | Unfortunately I don't know if there's any significant use
             | of Kythe outside of Google. We get a handful of questions
             | on the open source repo from time to time, but that's all I
             | know about.
        
         | beyang wrote:
         | Great call out! We've built this code navigation infra on top
         | of Zoekt into Sourcegraph. Example:
         | https://sourcegraph.com/github.com/golang/go/-/blob/src/net/...
         | 
         | Docs:
         | https://docs.sourcegraph.com/code_navigation/explanations/pr...
        
       | tromp wrote:
       | Wondering how this tool got named after the Dutch verb for seek,
       | I found this quote on its github page [1].
       | 
       | > "Zoekt, en gij zult spinazie eten" - Jan Eertink
       | 
       | > ("seek, and ye shall eat spinach" - My primary school teacher)
       | 
       | [1] https://github.com/sourcegraph/zoekt
        
       | JohnMakin wrote:
       | I didn't start my tech journey til late 00's, so it's constantly
       | surprising to me that something as ubiquitous as git only came
       | out in _2005_.
       | 
       | Is it possible at all this story helped spur the widespread
       | adoption of git (the early implementation of this tool)?
        
         | jeffbee wrote:
         | I think it is odd that the story mentions git at all. Git5, the
         | mentioned wrapper around piper, has only a niche audience when
         | I last used it 5 years ago, and it was a demonstrated fact that
         | the users of it were less productive than perforce users.
         | Whether that was causal or not was unknown.
        
           | hanwenn wrote:
           | Hi, I'm the Han-Wen from the title.
           | 
           | The story mentions git because git5 got me into developer
           | tooling. More in particular, it put me in touch with Shawn
           | Pearce who ran the Git/Gerrit team at Google. When I went to
           | work for him, Shawn wanted to have codesearch support in
           | Gerrit, and Zoekt was ultimately the outcome of my
           | explorations in this space.
           | 
           | IIRC, Git5 was deleted approximately 5 years ago because Fig
           | (the Hg based replacement) had taken over all the use cases
        
           | billllll wrote:
           | I agree there doesn't seem to be a good connection between
           | work on version control and work on code search.
           | 
           | However, I don't think it makes sense to downplay git5.
           | Anecdotally, basically everybody knew about it, and I'd
           | constantly run into people using it (which is by itself
           | noteworthy since nobody was exactly talking about version
           | control all the time).
           | 
           | Git5 was at the time the most robust solution to chain
           | commits, which was tedious bordering on impossible without
           | some tool. Without definitive data, I wouldn't say users were
           | less productive with git5: it definitely was a useful tool
           | that people at least recommended for chain commits. I was
           | definitely more productive with it.
           | 
           | There were a lot of footguns though, and I do think the hg
           | wrapper that superseded it was way better.
        
         | ajross wrote:
         | Git stepped into a source control ecosystem that was well-
         | served (albeit contentious). People knew (or at least thought
         | they knew) what they wanted from bk/svn/CVS/p4/rcs/sccs.
         | 
         | So git essentially was the "final form" that integrated all the
         | various workflows and topped it off with a maximally-scaled use
         | case (linux) that proved out the tool, drove innovation in
         | integration/scripting/gadgetry, and provided a clear beacon for
         | everyone else to adopt it. So it won.
         | 
         | But in 2004, if you asked around, everyone would have told you
         | that a tool like this was coming at some point (even if they
         | probably wouldn't have described it as very git-like!).
        
           | pgeorgi wrote:
           | If you squint a little, https://web.archive.org/web/200306291
           | 14010/http://www.venge.... is a fair approximation of some of
           | the core ideas behind git (and Linus played with it and wrote
           | a critique of its short-comings before starting git)
        
             | mettamage wrote:
             | That is so cool to see where Linus got some of his
             | inspiration from. It made a few things more clear to me as
             | to why git uses certain things.
        
         | justrealist wrote:
         | Oh yeah. I remember merging SVN branches into production in
         | 2010 or so.
         | 
         | It was a... special time. Let's not reminisce.
        
         | dekhn wrote:
         | Before git, most people in my larger circle used RCS, a UNIX
         | version control system from the early 80's. It was very limited
         | (basically each file had its own side-file that contained
         | revision data, and there was no project-wide file) but did its
         | job. Many people moved over to VCS, which used RCS files but
         | added project-wide files so you could manage a dir tree.
         | 
         | After that, I think many people moved to subversion, which had
         | a lot more functionality for distributed VC, for exmaple there
         | was a server. svn was popular for a while but building it was
         | painful (due to berkeley db) and it sort of never grew. I
         | invested a lot of time in (specifically apache with mod_dav and
         | mod_dav_svn) but lost interest in VC after fighting with
         | subversion.
         | 
         | git came along and from what i can tell it mainly had "it's by
         | linus, and the kernel uses it" and "it's fast" and "something
         | about reflogs". I use git day-to-day but I still; can't explain
         | how git became so ubiquitous; I find using it outright painful.
        
           | dws wrote:
           | Lightweight branches was a huge selling point. If you didn't
           | do them often enough that they were rote, branches in
           | RCS/CVS/SVN required ritual sacrifice.
        
           | reportingsjr wrote:
           | Mercurial (aka hg) was also gaining popularity at the same
           | time as git. The interface was a lot nicer and more sane than
           | git, but it had some serious performance limitations that
           | hampered it.
           | 
           | Both were definitely way better than SVN/CVS/etc.
        
             | dekhn wrote:
             | Yes, after using git for a few years I was introduced to
             | Mercurial and it was like a breath of fresh air, although
             | I'm also told hg added a number of things that made it much
             | more usable, "right before I started using it".
             | 
             | Since I have limited brain capacity I focus my efforts on
             | being able to use git, not hg, merely because it has so
             | much marketshare.
        
           | cpach wrote:
           | Nit-pick: Did you mean CVS?
        
             | dekhn wrote:
             | Yes, CVS.
        
       | IshKebab wrote:
       | The same algorithm is also used in Hound
       | (https://github.com/hound-search/hound) though I have to say the
       | best implementation of code search by far that I've seen is
       | https://grep.app
       | 
       | You really should check it out if you haven't already. It's
       | incredibly useful; I used it all the time. Not open source
       | though.
        
       ___________________________________________________________________
       (page generated 2023-11-21 23:00 UTC)