[HN Gopher] Flat Data ___________________________________________________________________ Flat Data Author : idan Score : 179 points Date : 2021-05-18 17:21 UTC (5 hours ago) (HTM) web link (octo.github.com) (TXT) w3m dump (octo.github.com) | FemmeAndroid wrote: | The really interesting thing about this to me is that if this | wasn't being put out via GitHub, I would have dismissed it as | being potentially against the TOS or abuse of GitHub's free | service. But with them putting it out, I'm quite interested in | reevaluating my use cases for GitHub. | gerner wrote: | See the comment from @jasoncwarner about GitHub actions being a | platform for much more than CI. | | I wonder how far that extends to non-GitHub provided services. | For instance, could we leverage GitHub actions, perhaps even | Flat Data, to scrape some web site and store it (perhaps | uploading elsewhere) in a more comprehensive way vs. storing | some small snippet of the data in a git repo? | VWWHFSfQ wrote: | you mean like in a database | gerner wrote: | Yes. Or S3 bucket, or whatever. The thing I'm getting at | is, can we use GitHub actions for application tasks like | web sraping that need compute and network access, but that | don't really do much with with a git repo. Does GitHub want | to support that? | yNeolh wrote: | Interesting, is not an official product from Github, but I love | the idea and they being upfront about their inspiration from | Simon, a really interesting person to follow, I love his | investment in Datasette, SQLite utils and Django. | | The thing about Git Scrapping, although I think the idea is | awesome, I thought It was against Github Actions rules, or at the | very least being on the edge. So I don't know what the position | from Github is about this as this is not an official thing from | them, but this gives me positive vibes. | simonw wrote: | Same here! I was reasonably confident that Git scraping was | within the boundaries of GitHub Actions supported use-cases but | it did always feel a little bit on the edge, this is fantastic | confirmation that it's a supported technique. | ellimilial wrote: | Very interesting how Github comes with more and more interesting | 'actions' to turn repos into 'platforms' and moves us closer to | serverless future. | | @idan how does it scale with the size (including storage)? Is 'a | billion rows' a goal or an actual tested use case? | jasoncwarner wrote: | Hi! Jason, CTO @ GitHub, here | | You're getting at the heart of Actions. Actions was never | intended to be "CI" or any such vertical capability. It has | always intended to be a platform that exposes capabilities like | CI or packages etc out to the world, but the underlying | serverless very flexible workflow platform is the bedrock upon | which we want to build the future | | My long held view that the only real 'competitor' to what I | want github to be was AWS/major cloud infra companies and if | you believe in that view along with me, you likely see what the | why the past four years of github and the next few years of | github make a lot of sense | | And it even makes more sense when you squint just a bit and | realize what codespaces + repos + actions (CI/security/packages | + other things) + automated workflows would eventually do. Now | imagine a bit further out into the future and what it would | mean if we understood your production workloads a bit more | ellimilial wrote: | Hi Jason, thank you very much for the background and the | explanation. It is fascinating to see the progress in this | direction. | | I started raising my eyebrow (in the best possible sense) | upon seeing parts of tooling very similar to ours but simpler | and more importantly - without moving parts. We operate in | biomedical data space and deal with flat/static data a lot, | for example we power https://biokeanos.com with data-in-repo, | so Flat Data was immediately interesting. | | It is really inspiring to see GitHub actions to having a | foray in this direction, definitely something to keep an eye | on. | eksabajt wrote: | It's storing the files in the repository which has a file size | limit of 100MB. I think the repositories themselves have a soft | limit of 5GB and a hard limit of 100GB. | idan wrote: | It doesn't scale! This isn't a replacement for databases. | | Our take on this is about "working sets" of data -- if you have | billions of rows, that's a lot bigger than a working set! At | some point, you have to query, filter, and aggregate to get | your data down to a chewable size for work. | | You can do that in your code too, and sometimes that's | absolutely the right approach! But often it's easier to push | that work to "outside your code," and that is what Flat is | great for. | ellimilial wrote: | Thank you for the response and clearing up the 'billion rows' | / surly bonds confusion I had from reading project's Why Flat | Data? section. I think I understand the target use case | slightly better now. | | One of the strong arguments for object-like storage (S3 etc) | in the context of plain / flat data is scalability and | availability for large scale processing frameworks. Databases | are only occasionally relevant. | danso wrote: | As someone who has written so much boilerplate data-collection | code (i.e. scripts that I cron on my local repo, then push to | Github), this is really incredible. I've been really impressed | with what Simon W. has shown off with Github Actions but hadn't | yet felt compelled enough to dive in and learn the | conventions...but this looks like a great entry point. | | Don't know if this is the place to report bugs, but I was trying | the github>>flatgithub data viewer trick on an old repo that has | a name of `white_house_salaries`. | | My data subdirectories have several files named | _white_house_salaries.csv_ -- e.g. _data | /wrangled/white_house_salaries.csv_ is the "finished" version. | However, visiting that file in flatgithub.com gives me a "No | valid data" error: | | https://flatgithub.com/storydrivendatasets/white_house_salar... | | I get the same error when visiting _data | /fused/white_house_salaries.csv_. | | However, when I rename the file to something other than | "white_house_salaries.csv", like, _data | /wrangled/white_house_salaries_wrangled.csv_, it works as | expected: | | https://flatgithub.com/storydrivendatasets/white_house_salar... | | I'm guessing there must be some issue with the data filename | (white_house_salaries.csv) sharing the same name as the repo | (storydrivendatasets/white_house_salaries)? | rothenbizzle wrote: | Hey there! Matt from the DevEx team here. Apologies for the | lack of polish - I _think_ the issue here is that the | flatgithub.com URL only works when you specify the repo owner | and repo name, a la https://flatgithub.com/storydrivendatasets/ | white_house_salar.... | | It gets confused by all of the other stuff afterward, | "tree/master/data/wrangled". | | Let me know if that gets you sorted! | whats_spinning wrote: | How big of data can this handle? | [deleted] | trinovantes wrote: | I once ran a web scraper on an hourly schedule with GitHub Action | that wrote to a json file in my gh-pages branch and saved its | results with sh "git commit --amend". Glad to see this workflow | in a more integrated environment than my janky hack | gerner wrote: | I don't know much about Flat Data, but I'm impressed with how | much GitHub is doing as GitHub since the MSFT acquisition. They | continue to offer compelling services to developers, and | increasingly to enterprise customers. All without abandoning much | of what made GitHub great: a focus on developers and easy to | access dev productivity. | | Notice the prominence of the VSCode integration here. Notice the | dramatically increased presence of MSFT on GitHub in general. It | seems like they've managed to integrate these two cultures and | product-sets in sensible ways. Given how hard big integrations | like this are to pull off, I feel like the community really | dodged a bullet in terms of access to products/tools. | alexander-litty wrote: | Dodged a bullet for now. | | I'm worried this is their extend-embrace stage, and the | extinguish is yet to come. | | I truly hate to be pessimistic, and I'm not trying to start a | flame war. I just can't see this behavior lasting in the long | run. | pwdisswordfish8 wrote: | It's already here, is just that the userbase and third | parties are (happily) doing the dirty work for them. Try | going GitHub-free for a month or three and you'll notice how | many things rest on the assumption that you have a GitHub | account. | | Look at how it shat on Markdown with what it calls "GitHub | Flavored Markdown". Look at the things that it calls "wikis". | Look at how GitHub's PR merge tool junks up the commit log. | Look at how many projects don't even have a way to accept a | fix unless you submit it with GitHub's janky pull request | workflow. Hell, a bug in Netlify's command-line client | managed to make its way into release versions that would | straight up cause the process to terminate for bog standard | "hello world"-style static sites due to unhandled exception | when cwd was a repo that wasn't hosted on github.com. | | The tacit assumption that you're using GitHub is like the | tacit assumption 15 years ago that you were using Visual | Studio, and "Log in with GitHub" is essentially what | Microsoft hoped for with Passport, if Passport had actually | been successful. | agency wrote: | I have no particular love for MSFT but I don't think any of | the issues you mentioned began after the acquisition. | pwdisswordfish8 wrote: | ...so? | | They acquired a company that was doing the thing that | they are wont to do and are criticized for, and have | poured the significant resources at their disposal into | growing the circle of impact. Where it originates from | and whether it was or wasn't already independently in | full swing (or partial, in this case) before their | involvement doesn't matter, the effect on the user is the | same. Besides that, if a person's problem with a given | practice is whether or not Microsoft is the perpetrator, | then that person is a hypocrite and doesn't actually give | a shit about the the thing they claim to be concerned | about. | gerner wrote: | Agree, it's important that we keep an eye on things and, | however we can, hold MSFT and GitHub accountable to keep up | the good showing. | | We've seen new features launched (e.g. this one) long enough | after the acquisition that much (most, all?) of the work | happened in the post acquisition environment that I'm | optimistic. But I've been wrong before. | idan wrote: | The OCTO DevEx team reaaaaaallly loves VS Code -- beyond the | editor, it's just a great surface for experimental developer | tooling! | | GitHub Codespaces aren't generally available yet, but being | able to target both "native" VS Code as well as in-browser VS | Code with the same extension is super powerful. Expect a lot | more from us on that front. | | We've also released a pair of little projects re VS Code | development that we've extracted from our work: | | https://github.com/githubocto/tailwind-vscode: a Tailwind CSS | plugin which creates Tailwind color tokens for each of the VS | Code theme colors, easing theme-native styling in VS Code. | | https://github.com/githubocto/snowpack-vscode-extension- | temp...: a VS Code extension template that incorporates the | fastest toolchain with the wisdom we've accumulated about | webview development. | adamcstephens wrote: | Can you help me get a Codespaces invite? ;) | duped wrote: | The monthly downtime during working hours has been getting to | me lately. | dataangel wrote: | ...they reinvented cron? it just commits a file on a timer | idan wrote: | Correct! And if you're Simon Willison, this is a super easy | thing to Just(tm) implement manually. | | The point of Flat Data is to push the edges of that bubble | outwards. Add tooling and examples. Add a viewer. Make the | "happy path" situations where this is helpful really fast and | easy. | | We're pretty upfront about this not being a major technological | advance. The difference between a difficult-to-use API and a | good API is usually just about the mental model. We like this | mental model, and the kinds of patterns it encourages! | abuehrle wrote: | This is really cool! I would have liked to have incorporated this | into my vaccine appointment slot finder tool a few months ago. I | like using git commits for change tracking too. Seems not | dissimilar (though not identical) to what they're doing at Dolt | (https://www.dolthub.com/). | idan wrote: | Yup, there's Dolt, and DVC, and probably a dozen other projects | I'm forgetting or haven't heard of. Dat! | | There's more than one way to data. We looked at a bunch of | them, and the key thing we keep coming back to is git | semantics. In many ways, all these other projects attempt to | graft git semantics on top of more scalable datastores, | allowing you to "fork" your data or roll it back to a given | version. Trouble is, these abstractions have subtly different | semantics or behaviors. These aren't inherently bad -- just not | the same as the ones you know from git. | | This approach sacrifices "scalability" in order to let you Just | Use Git(tm). It won't work (well) for a larger dataset, but we | find that it's useful in a ton of situations. | | For example: I have personally shipped bugs to production | because my test fixtures had stale example data. I should have | remembered to create new fixtures, but I didn't. Flat could | have made them for me, on a schedule, subsampling and | anonymizing production data as it worked. | | It's a subtle difference in appplication. If your goal is to | version $BIGDATA, then Flat isn't the right tool for the job, | and you should check out Dolt, DVC &co. | FractalHQ wrote: | Funny, I'm currently working on a project where I'm fetching post | data from a Wordpress backend with a few GQL queries via the | WPGraphQL plug-in and `@urql/svelte` to populate a static SSG'd | frontend. While developing locally, I copied and pasted the JSON | response into a local file in the repo to develop against. I was | thinking this would be nice to automate. | | If I'm understanding correctly, it seems like this tool more or | less automates that process? | | Can it send a GQL query? | idan wrote: | This is a really powerful use-case! If you saw Alex Gaynor's | election tracker[1] during the US 2020 elections, it's exactly | how it worked. Actions scraped the NYT election results.json, | and a static site on GH pages rendered the data, XHRing the | scraped JSON out of the repo periodically. | | There's no GraphQL backend yet! We've only done HTTP and SQL | backends so far. If your GQL query is simple enough, you might | be able to squeak by with an HTTP flat action whose target is | https://your.site/graphql?query=whatever ? | | [1] https://alex.github.io/nyt-2020-election- | scraper/battlegroun... | simonw wrote: | If you want to run GraphQL queries against this kind of data I | have a roundabout way of doing it: | | 1. Set up a repo that uses actions to scrape data into a CSV | | 2. Set up another action that converts that CSV to SQLite | (using my sqlite-utils tool) and then... | | 3. Publishes that database to Cloud Run or Vercel with | Datasette and with the datasette-graphql plugin | | Here's an example repo that does exactly that: | https://github.com/simonw/cdc-vaccination-history | | It scrapes vaccination data from the CDC, complies that into a | SQLite database and publishes it using Datasette on Vercel at | https://cdc-vaccination-history.datasette.io/ | | Then you can run GraphQL queries at https://cdc-vaccination- | history.datasette.io/graphql | | (Here's the plugin: https://datasette.io/plugins/datasette- | graphql) | | Another demo: https://covid-19.datasettes.com/graphql runs from | this repo: https://github.com/simonw/covid-19-datasette | bob1029 wrote: | I am sensing some interesting capabilities here, but also get the | impression that this is more about denormalized views of data | (JSON/CSV/etc) than anything else. It's also in the name - | 'Flat'. | | Perhaps it is actually supported and I can't read properly, but I | feel like you are just 1 tiny step away from allowing someone to | write one of these things such that it can ETL any arbitrary data | source into a SQLite database (i.e. many tables). There's not a | whole lot of difference between CSV and SQLite when it comes to | repository file management. Granted, SQLite databases would | present as opaque blobs at code review time, but this is | something we can tolerate because you still get all of the nice | versioning & project consistency. Hell, you could probably write | a special GitHub-branded diff viewer that allows you to compare 2 | different SQLite databases, schema & all. | | SQLite in general is such a force to be reckoned with. You could | do a lot of damage (in a good way) with product features built up | around the most popular database engine on earth. | nt2h9uh238h wrote: | I'm actually very excited about it. Could start a new era of how | we develop and work with data. | idan wrote: | Hi HN! Our team has loved building this, as well as all of the | storytelling and examples. We'd love your feedback! | everybodyknows wrote: | The screen videos are interesting, but too fast to follow, and | make reading the accompanying text impossible for those of us | with fragile concentration. Reader View (Safari) drops the | screen imagery entirely, so goes too far the other way. | | How about a video pause/seek control? | idan wrote: | Hey! This is a great callout, we'll think about how to make | it better! | dariosalvi78 wrote: | nice idea, but exploring the data is very limited. Would be even | better if it had some sort of query language and maybe an API. ___________________________________________________________________ (page generated 2021-05-18 23:01 UTC)