[HN Gopher] Graph Mining Library ___________________________________________________________________ Graph Mining Library Author : zuzatm Score : 223 points Date : 2023-10-03 15:46 UTC (7 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | xxpor wrote: | I was hoping this would mine literal stats graphs for anomaly | detection | blitzar wrote: | I think they use the word "graph" to mean a different thing to | what I use the word for. | lanstin wrote: | https://en.wikipedia.org/wiki/Graph_theory | | It's interesting and deceptively simple at first. | [deleted] | sbrother wrote: | I might be (very) far behind the times, but does this have any | relationship with Pregel? | [deleted] | charcircuit wrote: | Most of these files have a double license header. | ldhulipala wrote: | Thanks for pointing this out (fixed now). | specproc wrote: | If you're working on this repo, can we plz haz docs? | ldhulipala wrote: | Thanks, yes, this is on the list of TODOs! (also, to open- | source the tests) | choppaface wrote: | Did you release it without docs so that you could add it | to your Perf packet? | specproc wrote: | Thank you kindly! | xw3098 wrote: | It would be nice to have a bit of documentation on what | makes this library special as well. It's a significant | time investment to learn a library like this one well. So | some information on why one should choose this over, say, | http://snap.stanford.edu/ Would be very helpful. | MarkMarine wrote: | Whew. Lots of complaints from people who probably will never need | to use this code. | | If you need docs just read the .h files, they have extensive | comments. I'm sure they'll add them or maybe, just maybe, you | could write some to contribute. | | This would have made some of my previous work much easier, it's | really nice to see google open source this. | riku_iki wrote: | > If you need docs just read the .h files | | curious if this is typical dev experience inside google.. | dekhn wrote: | I think in most cases, back when I worked there, I would have | instead searched the monorepo for targets that depended on | this library (an easy lookup), and look at how they used it. | | Some code libraries had excellent docs (recordio, sstable, | mapreduce). But yes, reading the header file was often the | best place to start. | MarkMarine wrote: | I'm not at google so I've got no idea. | | Reading the code, especially the header files, seems to be | pretty standard as far as what I see in non-open source code. | So, it's been my typical dev experience, I'd say if you're | somewhere that has gleaming, easy to understand docs that are | actually up to date with the code you all have too much time | on your hands, but I serially work at startups that are | running to market. | riku_iki wrote: | Header file gives you a view into some narrow window of the | system, API, pipeline, and you probably have no idea which | header files are important and which are part of some | internal implementation. | | 10 mins spent on readme with some high level details is | investment with 100x return for lib users. | helsinki wrote: | The .proto files are the documentation everyone is looking for. | [deleted] | ls612 wrote: | I think it's that it's not at all obvious how to even build the | damn thing so at least a little bit of readme would have been | nice. I agree with the sentiment this looks like a super cool | tool. | PaulHoule wrote: | It says you're supposed to leave a ticket if you have | questions or comments... A README file isn't much to ask for. | MarkMarine wrote: | I'm not saying it's too much to ask for, but also, when | you're doing distributed in memory graph mining (which | means you've got an application with a big enough graph | that you need to do this, and the technical expertise to | need the algorithms in this open source package) maybe it's | expected that you can read the bazel files and bazel docs | yourself and figure it out. | | Or just write a make file and cut all the bazel build | optimization out. | | They don't put instructions on how to start a F1 car inside | the cockpit, you don't hop into a fighter jet and look for | the push to start easy button, it's expected that when | you're at that level you bring expertise. | PaulHoule wrote: | Yeah, and somebody who is that smart can probably pack | their data structures efficiently and find an | approximation to do the job on a macbook pro that people | with too many resources need a 1000 machine cluster to | do. And get the coding and the computation done in the | time that the C++ compiler is still chewing on the | headers of the bloated library. (At times I've been that | guy.) | | But seriously, there is such a thing as | industrialization. Google is notorious though for hiring | 180 IQ people and getting them to perform at a 35 IQ | level because there the documentation makes no sense, a | procedure which should be done in one step really takes | 300, fighting with C++, etc. They can afford to do it | because it is just so profitable to help that guy who | shows up on every. single. video. roll. who says "you | can't lose weight by exercising", "you can't lose weight | by cutting carbs" who links to some video that drones on | hours and hours and signs you up for some subscription | you can never cancel to sell scam supplements. | | Shame on them. | | BTW, with high-complexity software there is enough | question that you got it working right that you expect a | significant process of testing that it works for your | application. For instance if you got a hydrocode for | simulating the explosion of a nuclear weapon you would | not take for granted that you had built it properly and | were using it properly until you'd done a lot of | validation work. A system like that isn't a product | unless it comes with that validation suite. The same is | true for automated trading software (gonna hook it up | straight to the market without validation? hope you have | $100M to burn!) | | ... now there was that time a really famous CS professor | e-mailed me a short C program that was said to do | something remarkable that crashed before it even got into | main() which did teach me a thing about C but that's not | what a professional programmer does. | MarkMarine wrote: | I agree with all of this. | | Its just frustrating that all the comments about an | interesting library seem to be customer service | complaints from people who never need to reach for this | library. I was hoping for a real discussion, something I | could learn from. | PaulHoule wrote: | Really though an open source product has not really been | released until there is documentation walking through | setting it up and doing some simple thing with it. As it | is I am really not so sure what it is, what kind of | hardware it can run on, etc. Do you really think it got | 117 Github stars from people who were qualified to | evaluate it? | | (I'd consider myself qualified to evaluate it.. If I put | two weeks into futzing with it.) | | Every open source release I've done that's been | successful has involved me spending almost as much time | in documentation, packaging and fit-and-finish work as I | did getting working it well enough for me. It's why I | dread the thought of an open source YOShInOn as much as I | get asked for it. | | Sometimes though it is just a bitch. I have some image | curation projects and was thinking of setting up some | "booru" software and found there wasn't much out there | that was easy to install because there are so many moving | parts and figured I'd go for the motherf4r of them all | because at least the docker compose here is finite | | https://github.com/danbooru/danbooru | | even if it means downloading 345TB of images over my DSL | connection. | ponyous wrote: | No idea where is the hype coming from, who is actually upvoting | this? 0 Docs, 0 examples, 0 explanation of how is it useful. | | Is "Graph Mining" so ubiquitous that people know what this is all | about? | bafe wrote: | It was hyped some years ago. There are plenty of legitimate | applications of graphs, perhaps the library offers well | optimized implementation of important algorithms. But the past | hype around all things "graph" was not rational. As always, you | can't solve all problems with a graph as you can't with a | neural network or with any other structure/algorithm | ldhulipala wrote: | We are updating the README to be more descriptive; in the | meantime, please see https://gm-neurips-2020.github.io/ or | https://research.google/teams/graph-mining/ | [deleted] | nologic01 wrote: | Graph algorithms cry out for some standardization. Think blas and | lapack. | bigbillheck wrote: | Consider: https://graphblas.org | nologic01 wrote: | I wonder how much overlap this new project with graphblas and | older graph libraries like boost::graph | https://www.boost.org/doc/libs/1_83_0/libs/graph/doc/ | 0x6461188A wrote: | How is this usable. I see no documentation. There is a docs | folder but all it contains is a code of conduct. | mathisfun123 wrote: | I'm not trying to be snarky but have you considered reading the | code? Like I'll be honest I can't remember the last time I | looked at docs at all instead of reading the code itself. | zekenie wrote: | some examples would be super helpful! | zuzatm wrote: | It's coming! Check again in 12 hours, I believe it should be up | then! | specproc wrote: | Documentation of any sort would be super helpful. | corentin88 wrote: | Interesting fact: the first commit is 2 years old and is entitled | "Boilerplate for new Google open source project". | | Either they rewrite git history or it took about 2 years to get | approval on making this repo public. | spankalee wrote: | If you know you want to open source a project eventually, it's | easier if you start it in the open source part of the internal | repo with all the licensing and headers in place. Open sourcing | existing code is harder because you need to review that it | hasn't used something that can't be opened. | | So probably they just started the project two years ago, had | aspiration to open source, and finally just did now. Some teams | might publish earlier, some like to wait until it's had enough | internal usage to prove it out. | hiddencost wrote: | That could be the template it was cloned from | progval wrote: | FWIW they had already pushed that commit four months ago: | https://archive.softwareheritage.org/browse/snapshot/bd01717... | [deleted] | thfuran wrote: | Now that's bureaucracy. | numpad0 wrote: | I'd agree if _last_ commit was 2 years ago. | j2kun wrote: | The code has an internal analogue, and the tooling lets you | choose whether to export the entire git history or squash it. | They may have chosen the former, in which case it could just be | 2 years to migrate and rework the code to be ready for open | sourcing. In that time I imagine there were four reorgs and | countless priority shifts :) | Daan21 wrote: | [dead] | afandian wrote: | Can someone with familiarity with Bazel give any clues how to | build? `bazel build` does something, but I end up with `bazel- | build` and `bazel-build` with no obvious build artefacts. | Keyframe wrote: | Interesting. I always had in back of my mind this notion that I | ought to check bazel one of these days. So, one of these days | is then today. In order to install bazel, recommended way seems | to be to install bazelisk first and just rename that to bazel | and move it somewhere on the path like /usr/local/bin/bazel.. | fine. Now when I run query it warned me about JDK.. huh. Now | when I run build it errored and failed due to missing JAVA with | "WARNING: Ignoring JAVA_HOME, because it must point to a JDK, | not a JRE.". Ok, I'm not using Java - let's check which Java | JDK/JRE to use these days and after few minutes of googling I'm | not up for it anymore and that, ladies and gentlemen, is where | this day is then up for another day after all. Pathetic how | cargo and even npm/yarm spoiled us. | | edit: thanks to https://sdkman.io/ it's up and running. It | wasn't _that_ bad after all. | elteto wrote: | In bazel //... is the equivalent of the 'all' target in make: | bazel build //... bazel test //... bazel query | //... | | The last one should list all targets (from what I remember). | afandian wrote: | Thanks! That last one lists 84 results. None looks obviously | like 'main'. Trying a random one: bazel run | //in_memory/clustering:graph ERROR: Cannot run target | //in_memory/clustering:graph | | I'm going to wait until someone updates the readme I think! | elteto wrote: | Then most likely this is meant to be used primarily as a | library. You should wait until they open source the tests | (soon, per another commenter). Those will be runnable | targets. | tfsh wrote: | considering the repo doesn't contain a cc_binary build | rule, I'm inclined to believe there's no demo, the easiest | way to get started (if you want to play around from | scratch) would be to add a cc_binary, point that to a | main.cpp file which depends on the library targets you | want, e.g "//in_memory/clustering:graph" and ensure there's | sufficient visibility from the targets. | hashar wrote: | `bazel run` is for a rule that has been marked `executable | = True` and there is no such rule in the repository. | | If you `bazel build //...`, you should get the compiled | libs under `bazel-out/*fastbuild/bin/`. | esafak wrote: | Graph mining was "so hot right now" ten years ago. Remember | GraphX (https://spark.apache.org/graphx/) and GraphLab | (https://en.wikipedia.org/wiki/GraphLab) ? Or graph databases? | | I guess it coincided with the social network phenomenon. Much | more recently geometric learning (ML on graphs and other | structures) shone, until LLMs stole their thunder. I still think | geometric learning has a lot of life left in it, and I would like | to see it gain popularity. | [deleted] | reaperman wrote: | I still use NetworkX a lot when a problem is best solved with | graph analysis, I really enjoy the DevEx of that package. | [deleted] | PaulHoule wrote: | There are "graph databases" which see graphs as a universal | approach to data, see RDF and SPARQL and numerous pretenders. | For that matter, think of a C program where the master data | structure is a graph of pointers. In a graph like that there is | usually a huge number of different edge types such as "is | married to", "has yearly average temperature", ... | | Then there are "graph algorithms" such as PageRank, graph | centrality, and such. In a lot of those cases there is one edge | type or a small number of edge cases. | | There are some generic algorithms you can apply to graphs with | many typerd edges edges such as the magic SPARQL pattern | ?s1 ?p ?o . ?s2 ?p ?o . | | which finds ?s1 and ?s2 that share a relationship ?p with some | ?o and is the basis for a similarity metric between ?s1 and | ?s2. Then there are the cases that you pick out nodes with some | specific ?p and apply some graph algorithm to those. | | The thing about graphs is, in general, they are amorphous and | could have any structure (or lack of structure) at all which | can be a disaster from a memory latency perspective. Specific | graphs usually do have some structure with some locality. There | was a time I was using that magic SPARQL pattern and wrote a | program that would have taken 100 years to run and then | repacked the data structures and discovered an approximation | that let me run the calculation in 20 minutes. | | Thus practitioners tend to be skeptical about general purpose | graph processing libraries as you may very have a problem that | I could code up a special-purpose answer to in less time than | you'll spend fighting with the build system for that thing that | runs 1000x faster. | | ---- | | If you really want to be fashionable though, arXiv today is | just crammed with papers about "graph neural networks" that | never seem to get hyped elsewhere. YOShInOn has made me a long | queue of GNN papers to look at but I've only skimmed a few. A | lot of articles say they can be applied to the text analysis | problems I do but they don't seem to really perform better than | the system YOShInOn and I use so I haven't been in a hurry to | get into them. | esafak wrote: | 1. Graph algorithms like the ones you mentioned are processed | not by graph databases like Neo4j, but graph processing | libraries like the titular Google library. | | 2. Geometric learning is the broader category that subsumes | graph neural networks. | | https://geometricdeeplearning.com/ | PaulHoule wrote: | Depends, some graph databases have some support for graph | algorithms. | | I'll also say I think graph algorithms are overrated, I | mean you know the diameter of some graph: who cares? | Physicists (like me back in the day) are notorious for | plotting some statistics on log-log paper, seeing that the | points sorta kinda fall on a line if you squint and decide | that three of the points are really bug splats and then | yelling "UNIVERSIALITY" and sending it to _Physical Review | E_ but the only thing that is universal is that none of | them have ever heard of a statistical estimator or a | significance test for power law distributions. Node 7741 is | the "most central" node, but does that make a difference? | Maybe if you kill the top 1% central nodes that will | disrupt the terrorist network but for most of us I don't | see high quality insights coming out of graph algorithms | most of the time. | esafak wrote: | > _Physicists (like me back in the day) are notorious for | plotting some statistics on log-log paper_... | | For people who've missed it: _So You Think You Have a | Power Law -- Well Isn 't That Special?_ | (http://bactra.org/weblog/491.html) :) | Someone wrote: | > a universal approach to data, see RDF and SPARQL and | numerous pretenders. For that matter, think of a C program | where the master data structure is a graph of pointers. | | A graph of typed pointers. As you likely know, the basic | element of RDF is not " _foo_ has a relationship with _bar_ | ", but " _foo_ has a relationship with _bar_ of type _baz_ ". | | Also, the types themselves can be part of relationships as in | " _baz_ has a relationship with _quux_ of type _foobar_ " | | > The thing about graphs is, in general, they are amorphous | and could have any structure (or lack of structure) at all | which can be a disaster from a memory latency perspective | | But that's an implementation detail ;-) | | In theory, the engine you use to store the graph could | automatically optimize memory layout for both the data and | the types of query that are run on it. | | Practice is different. | | > Thus practitioners tend to be skeptical about general | purpose graph processing libraries | | I am, too. I think the thing they're mostly good for is | producing PhD's, both on the theory of querying them, | ignoring performance, and on improving performance of | implementations. | PaulHoule wrote: | Funny, the core table of salesforce.com is triples but they | got a patent circa 2000 on a system that builds indexes and | materializes views based on query profiling so the | performance is good (w/ gold plated hardware). That patent | is one reason why graph databases sucked for a long time. | | Now the Lotus notes patents have been long expired so I'd | like to see some graph database based products that can do | what Notes did 30 years ago but it is lost technology like | the pyramids, stonehenge and how to make HTML form | applications without React. | pharmakom wrote: | Can someone explain what this library might be useful for? | oddthink wrote: | Clustering. I used the correlation clusterer from here for a | problem that I could represent as a graph of nodes with | similarity measures (this data looks like this other data) and | strong repelling features (this data is known to be different | from this other, so never merge them). | PaulHoule wrote: | Towards the end of my relationship with a business partner, he | was really impressed with a graph processing library released by | Intel (because it was Intel), while my thoughts were "ho hum, | this looks like it was done by a student" (like a student who got | a C-, not a A student) and thought about how much I liked my | really crude graph processing scripts in Pig that were crazy fast | because they used compressed data structures and well-chosen | algorithms. | whitten wrote: | Github says it is C, C++, and Starland. | | What is Starland ? | Laremere wrote: | It's Starlark, the language for configuring the build system | Bazel. Bazel is the open source port of Google's internal build | system, Blaze. Starlark is a subset of Python. | Terr_ wrote: | This list of corporate project name associations makes me | wonder where Galactus comes in. :P | | https://www.youtube.com/watch?v=y8OnoxKotPQ | [deleted] | ashout33 wrote: | if I had to guess, that is a typo and should be starlark, which | is the language used for bazel build files. bazel is the build | system they use | jefftk wrote: | Github says "Starlark 6.2%", so it looks like whitten's typo, | not GitHub's. | nolok wrote: | On which keyboard layout is rk into nd a typo ... | bsimpson wrote: | "STARLAND VOCAL BAND? THEY SUCK!" | mcpeepants wrote: | on any layout operated by a human, who may at times type | the wrong word entirely | macintux wrote: | ...and/or fall victim to autocorrect. | tomrod wrote: | This is a big deal, I think. I'm guessing it's not widely used | internally anymore if they are open sourcing it. What is used | instead? | wilsynet wrote: | As merely two examples, both gRPC and Kubernetes are important | to Google, and yet Google opened sourced them. "No longer used" | is not the criteria Google uses to make their software OSS. | | FYI, I work at Google. | tomrod wrote: | Thanks for clarifying | jefftk wrote: | Google Wave is the only counterexample I can think of, where | it was "we're deprecating this project, but releasing it as | open source". | bsimpson wrote: | I don't think Google generally opensources _products_ - | either it always is open source (Android) or never is (web | apps). I can't think of an example where a product was | closed source, released as open source, and continually | maintained. | | Open source at Google generally takes the form of libraries | rather than products. Often, that's something that an | individual engineer is working on, and it's easier to open | source than get the copyright reassigned (since Google by | default owns any code you write). There are also libraries | that are open sourced for business reasons - e.g. SDKs. You | can tell the difference, because most individually-driven | libraries contain the copy "Not an official Google product" | in the README. | progval wrote: | > I can't think of an example where a product was closed | source, released as open source, and continually | maintained. | | I found one after some searching: Nomulus. | https://opensource.googleblog.com/2016/10/introducing- | nomulu... | PaulHoule wrote: | I'd say both of those are actively harmful products (like | PFOS or cigarettes) that hurt Google's competition by being | open sourced. Google wrecked their own productivity, the | least they could do was wreck everybody else's. | dieortin wrote: | And why would any of those be harmful? Care to elaborate? | PaulHoule wrote: | They take a process a small team could complete quickly | with high quality and low cost maintenance and turn it | into a process a huge team completes slowly with poor | quality and high maintenance cost. Google can afford this | because of huge profits from their advertising monopoly | that they don't know how to spend. | | Go look at the manuals for IBM's Parallel Sysplex for | mainframes and compare the simplicity of that to K8S for | instance. | | Or for that matter look at DCOM and the family of systems | which Microsoft built around it which are truly atrocious | but look like a model of simplicity compared to gRPC. (At | least Don Box wrote a really great book about COM that | showed people in the Microsoftsphere how to write good | documentation.) | | Or for that matter try doing something with AWS, Azure or | some off-brand cloud and Google Cloud from zero (no | account) and time yourself with a stopwatch. Well, it | will be a stopwatch for AWS but you will probably need a | calendar for Google Cloud. | spankalee wrote: | Most projects try not to open something that's not going to be | maintained. If they do it's usually rather loudly called out in | the readme. | simonw wrote: | I don't think "not widely used internally anymore" is a common | rationale for open sourcing something. | | Generally I'd expect companies to open source things when it's | proven itself internally and they want to reap the benefits of | open source: | | - Make internal engineers happy - engineers like having their | code released outside the bounds of their company | | - Prestige, which can help with hiring | | - External contributions (not even code necessarily, just | feedback from people who are using it can be amazingly useful | for improving the software) | | - Ability to hire people in the future who already know | important parts of your technical stack, and don't need | internal training on it | | - Externally produced resources that help people learn how to | use the software (tutorials, community discussion forums etc) | | If the software is no longer used internally, open sourcing it | is MORE expensive - first you have to get through the whole | process of ensuring the code you are releasing can be released | (lots of auditing, legal reviews etc), and then you'll have to | deal with a flow of support requests which you can either deal | with (expensive, especially if your internal knowledge of the | tool is atrophying) or ignore (expensive to your reputation). | mlinhares wrote: | Google has done this before so it's not surprising people | would think that. | tomrod wrote: | To be fair, I can often be uninformed with regards to some | of the smaller movements of techcos. | palata wrote: | IMO, you forget one important point: control. | | If your open source project/protocol is the most popular, and | you have the governance over it, then you decide where it | goes. Chromium is open source, but Google controls it, and | everyone who depends on it has to follow. If Chromium was not | open source, maybe Firefox would be more popular, and Google | would not have control over that. | | > or ignore (expensive to your reputation). | | I don't think that anything is expensive for Google. They can | do whatever they want. | meneer_oke wrote: | Would it be possible to explain why it's big deal. | nivekney wrote: | It's based on ParlayLib, which is for shared-memory multicore | machines. Highly suspect that they moved the algorithms on to | distributed systems. | dllthomas wrote: | Does it accept graphviz? ___________________________________________________________________ (page generated 2023-10-03 23:00 UTC)