hngopher.com

       [HN Gopher] Flat Data
       ___________________________________________________________________
        
       Flat Data
        
       Author : idan
       Score  : 179 points
       Date   : 2021-05-18 17:21 UTC (5 hours ago)
        
 (HTM) web link (octo.github.com)
 (TXT) w3m dump (octo.github.com)
        
       | FemmeAndroid wrote:
       | The really interesting thing about this to me is that if this
       | wasn't being put out via GitHub, I would have dismissed it as
       | being potentially against the TOS or abuse of GitHub's free
       | service. But with them putting it out, I'm quite interested in
       | reevaluating my use cases for GitHub.
        
         | gerner wrote:
         | See the comment from @jasoncwarner about GitHub actions being a
         | platform for much more than CI.
         | 
         | I wonder how far that extends to non-GitHub provided services.
         | For instance, could we leverage GitHub actions, perhaps even
         | Flat Data, to scrape some web site and store it (perhaps
         | uploading elsewhere) in a more comprehensive way vs. storing
         | some small snippet of the data in a git repo?
        
           | VWWHFSfQ wrote:
           | you mean like in a database
        
             | gerner wrote:
             | Yes. Or S3 bucket, or whatever. The thing I'm getting at
             | is, can we use GitHub actions for application tasks like
             | web sraping that need compute and network access, but that
             | don't really do much with with a git repo. Does GitHub want
             | to support that?
        
       | yNeolh wrote:
       | Interesting, is not an official product from Github, but I love
       | the idea and they being upfront about their inspiration from
       | Simon, a really interesting person to follow, I love his
       | investment in Datasette, SQLite utils and Django.
       | 
       | The thing about Git Scrapping, although I think the idea is
       | awesome, I thought It was against Github Actions rules, or at the
       | very least being on the edge. So I don't know what the position
       | from Github is about this as this is not an official thing from
       | them, but this gives me positive vibes.
        
         | simonw wrote:
         | Same here! I was reasonably confident that Git scraping was
         | within the boundaries of GitHub Actions supported use-cases but
         | it did always feel a little bit on the edge, this is fantastic
         | confirmation that it's a supported technique.
        
       | ellimilial wrote:
       | Very interesting how Github comes with more and more interesting
       | 'actions' to turn repos into 'platforms' and moves us closer to
       | serverless future.
       | 
       | @idan how does it scale with the size (including storage)? Is 'a
       | billion rows' a goal or an actual tested use case?
        
         | jasoncwarner wrote:
         | Hi! Jason, CTO @ GitHub, here
         | 
         | You're getting at the heart of Actions. Actions was never
         | intended to be "CI" or any such vertical capability. It has
         | always intended to be a platform that exposes capabilities like
         | CI or packages etc out to the world, but the underlying
         | serverless very flexible workflow platform is the bedrock upon
         | which we want to build the future
         | 
         | My long held view that the only real 'competitor' to what I
         | want github to be was AWS/major cloud infra companies and if
         | you believe in that view along with me, you likely see what the
         | why the past four years of github and the next few years of
         | github make a lot of sense
         | 
         | And it even makes more sense when you squint just a bit and
         | realize what codespaces + repos + actions (CI/security/packages
         | + other things) + automated workflows would eventually do. Now
         | imagine a bit further out into the future and what it would
         | mean if we understood your production workloads a bit more
        
           | ellimilial wrote:
           | Hi Jason, thank you very much for the background and the
           | explanation. It is fascinating to see the progress in this
           | direction.
           | 
           | I started raising my eyebrow (in the best possible sense)
           | upon seeing parts of tooling very similar to ours but simpler
           | and more importantly - without moving parts. We operate in
           | biomedical data space and deal with flat/static data a lot,
           | for example we power https://biokeanos.com with data-in-repo,
           | so Flat Data was immediately interesting.
           | 
           | It is really inspiring to see GitHub actions to having a
           | foray in this direction, definitely something to keep an eye
           | on.
        
         | eksabajt wrote:
         | It's storing the files in the repository which has a file size
         | limit of 100MB. I think the repositories themselves have a soft
         | limit of 5GB and a hard limit of 100GB.
        
         | idan wrote:
         | It doesn't scale! This isn't a replacement for databases.
         | 
         | Our take on this is about "working sets" of data -- if you have
         | billions of rows, that's a lot bigger than a working set! At
         | some point, you have to query, filter, and aggregate to get
         | your data down to a chewable size for work.
         | 
         | You can do that in your code too, and sometimes that's
         | absolutely the right approach! But often it's easier to push
         | that work to "outside your code," and that is what Flat is
         | great for.
        
           | ellimilial wrote:
           | Thank you for the response and clearing up the 'billion rows'
           | / surly bonds confusion I had from reading project's Why Flat
           | Data? section. I think I understand the target use case
           | slightly better now.
           | 
           | One of the strong arguments for object-like storage (S3 etc)
           | in the context of plain / flat data is scalability and
           | availability for large scale processing frameworks. Databases
           | are only occasionally relevant.
        
       | danso wrote:
       | As someone who has written so much boilerplate data-collection
       | code (i.e. scripts that I cron on my local repo, then push to
       | Github), this is really incredible. I've been really impressed
       | with what Simon W. has shown off with Github Actions but hadn't
       | yet felt compelled enough to dive in and learn the
       | conventions...but this looks like a great entry point.
       | 
       | Don't know if this is the place to report bugs, but I was trying
       | the github>>flatgithub data viewer trick on an old repo that has
       | a name of `white_house_salaries`.
       | 
       | My data subdirectories have several files named
       | _white_house_salaries.csv_ -- e.g. _data
       | /wrangled/white_house_salaries.csv_ is the "finished" version.
       | However, visiting that file in flatgithub.com gives me a "No
       | valid data" error:
       | 
       | https://flatgithub.com/storydrivendatasets/white_house_salar...
       | 
       | I get the same error when visiting _data
       | /fused/white_house_salaries.csv_.
       | 
       | However, when I rename the file to something other than
       | "white_house_salaries.csv", like, _data
       | /wrangled/white_house_salaries_wrangled.csv_, it works as
       | expected:
       | 
       | https://flatgithub.com/storydrivendatasets/white_house_salar...
       | 
       | I'm guessing there must be some issue with the data filename
       | (white_house_salaries.csv) sharing the same name as the repo
       | (storydrivendatasets/white_house_salaries)?
        
         | rothenbizzle wrote:
         | Hey there! Matt from the DevEx team here. Apologies for the
         | lack of polish - I _think_ the issue here is that the
         | flatgithub.com URL only works when you specify the repo owner
         | and repo name, a la https://flatgithub.com/storydrivendatasets/
         | white_house_salar....
         | 
         | It gets confused by all of the other stuff afterward,
         | "tree/master/data/wrangled".
         | 
         | Let me know if that gets you sorted!
        
       | whats_spinning wrote:
       | How big of data can this handle?
        
       | [deleted]
        
       | trinovantes wrote:
       | I once ran a web scraper on an hourly schedule with GitHub Action
       | that wrote to a json file in my gh-pages branch and saved its
       | results with sh "git commit --amend". Glad to see this workflow
       | in a more integrated environment than my janky hack
        
       | gerner wrote:
       | I don't know much about Flat Data, but I'm impressed with how
       | much GitHub is doing as GitHub since the MSFT acquisition. They
       | continue to offer compelling services to developers, and
       | increasingly to enterprise customers. All without abandoning much
       | of what made GitHub great: a focus on developers and easy to
       | access dev productivity.
       | 
       | Notice the prominence of the VSCode integration here. Notice the
       | dramatically increased presence of MSFT on GitHub in general. It
       | seems like they've managed to integrate these two cultures and
       | product-sets in sensible ways. Given how hard big integrations
       | like this are to pull off, I feel like the community really
       | dodged a bullet in terms of access to products/tools.
        
         | alexander-litty wrote:
         | Dodged a bullet for now.
         | 
         | I'm worried this is their extend-embrace stage, and the
         | extinguish is yet to come.
         | 
         | I truly hate to be pessimistic, and I'm not trying to start a
         | flame war. I just can't see this behavior lasting in the long
         | run.
        
           | pwdisswordfish8 wrote:
           | It's already here, is just that the userbase and third
           | parties are (happily) doing the dirty work for them. Try
           | going GitHub-free for a month or three and you'll notice how
           | many things rest on the assumption that you have a GitHub
           | account.
           | 
           | Look at how it shat on Markdown with what it calls "GitHub
           | Flavored Markdown". Look at the things that it calls "wikis".
           | Look at how GitHub's PR merge tool junks up the commit log.
           | Look at how many projects don't even have a way to accept a
           | fix unless you submit it with GitHub's janky pull request
           | workflow. Hell, a bug in Netlify's command-line client
           | managed to make its way into release versions that would
           | straight up cause the process to terminate for bog standard
           | "hello world"-style static sites due to unhandled exception
           | when cwd was a repo that wasn't hosted on github.com.
           | 
           | The tacit assumption that you're using GitHub is like the
           | tacit assumption 15 years ago that you were using Visual
           | Studio, and "Log in with GitHub" is essentially what
           | Microsoft hoped for with Passport, if Passport had actually
           | been successful.
        
             | agency wrote:
             | I have no particular love for MSFT but I don't think any of
             | the issues you mentioned began after the acquisition.
        
               | pwdisswordfish8 wrote:
               | ...so?
               | 
               | They acquired a company that was doing the thing that
               | they are wont to do and are criticized for, and have
               | poured the significant resources at their disposal into
               | growing the circle of impact. Where it originates from
               | and whether it was or wasn't already independently in
               | full swing (or partial, in this case) before their
               | involvement doesn't matter, the effect on the user is the
               | same. Besides that, if a person's problem with a given
               | practice is whether or not Microsoft is the perpetrator,
               | then that person is a hypocrite and doesn't actually give
               | a shit about the the thing they claim to be concerned
               | about.
        
           | gerner wrote:
           | Agree, it's important that we keep an eye on things and,
           | however we can, hold MSFT and GitHub accountable to keep up
           | the good showing.
           | 
           | We've seen new features launched (e.g. this one) long enough
           | after the acquisition that much (most, all?) of the work
           | happened in the post acquisition environment that I'm
           | optimistic. But I've been wrong before.
        
         | idan wrote:
         | The OCTO DevEx team reaaaaaallly loves VS Code -- beyond the
         | editor, it's just a great surface for experimental developer
         | tooling!
         | 
         | GitHub Codespaces aren't generally available yet, but being
         | able to target both "native" VS Code as well as in-browser VS
         | Code with the same extension is super powerful. Expect a lot
         | more from us on that front.
         | 
         | We've also released a pair of little projects re VS Code
         | development that we've extracted from our work:
         | 
         | https://github.com/githubocto/tailwind-vscode: a Tailwind CSS
         | plugin which creates Tailwind color tokens for each of the VS
         | Code theme colors, easing theme-native styling in VS Code.
         | 
         | https://github.com/githubocto/snowpack-vscode-extension-
         | temp...: a VS Code extension template that incorporates the
         | fastest toolchain with the wisdom we've accumulated about
         | webview development.
        
           | adamcstephens wrote:
           | Can you help me get a Codespaces invite? ;)
        
         | duped wrote:
         | The monthly downtime during working hours has been getting to
         | me lately.
        
       | dataangel wrote:
       | ...they reinvented cron? it just commits a file on a timer
        
         | idan wrote:
         | Correct! And if you're Simon Willison, this is a super easy
         | thing to Just(tm) implement manually.
         | 
         | The point of Flat Data is to push the edges of that bubble
         | outwards. Add tooling and examples. Add a viewer. Make the
         | "happy path" situations where this is helpful really fast and
         | easy.
         | 
         | We're pretty upfront about this not being a major technological
         | advance. The difference between a difficult-to-use API and a
         | good API is usually just about the mental model. We like this
         | mental model, and the kinds of patterns it encourages!
        
       | abuehrle wrote:
       | This is really cool! I would have liked to have incorporated this
       | into my vaccine appointment slot finder tool a few months ago. I
       | like using git commits for change tracking too. Seems not
       | dissimilar (though not identical) to what they're doing at Dolt
       | (https://www.dolthub.com/).
        
         | idan wrote:
         | Yup, there's Dolt, and DVC, and probably a dozen other projects
         | I'm forgetting or haven't heard of. Dat!
         | 
         | There's more than one way to data. We looked at a bunch of
         | them, and the key thing we keep coming back to is git
         | semantics. In many ways, all these other projects attempt to
         | graft git semantics on top of more scalable datastores,
         | allowing you to "fork" your data or roll it back to a given
         | version. Trouble is, these abstractions have subtly different
         | semantics or behaviors. These aren't inherently bad -- just not
         | the same as the ones you know from git.
         | 
         | This approach sacrifices "scalability" in order to let you Just
         | Use Git(tm). It won't work (well) for a larger dataset, but we
         | find that it's useful in a ton of situations.
         | 
         | For example: I have personally shipped bugs to production
         | because my test fixtures had stale example data. I should have
         | remembered to create new fixtures, but I didn't. Flat could
         | have made them for me, on a schedule, subsampling and
         | anonymizing production data as it worked.
         | 
         | It's a subtle difference in appplication. If your goal is to
         | version $BIGDATA, then Flat isn't the right tool for the job,
         | and you should check out Dolt, DVC &co.
        
       | FractalHQ wrote:
       | Funny, I'm currently working on a project where I'm fetching post
       | data from a Wordpress backend with a few GQL queries via the
       | WPGraphQL plug-in and `@urql/svelte` to populate a static SSG'd
       | frontend. While developing locally, I copied and pasted the JSON
       | response into a local file in the repo to develop against. I was
       | thinking this would be nice to automate.
       | 
       | If I'm understanding correctly, it seems like this tool more or
       | less automates that process?
       | 
       | Can it send a GQL query?
        
         | idan wrote:
         | This is a really powerful use-case! If you saw Alex Gaynor's
         | election tracker[1] during the US 2020 elections, it's exactly
         | how it worked. Actions scraped the NYT election results.json,
         | and a static site on GH pages rendered the data, XHRing the
         | scraped JSON out of the repo periodically.
         | 
         | There's no GraphQL backend yet! We've only done HTTP and SQL
         | backends so far. If your GQL query is simple enough, you might
         | be able to squeak by with an HTTP flat action whose target is
         | https://your.site/graphql?query=whatever ?
         | 
         | [1] https://alex.github.io/nyt-2020-election-
         | scraper/battlegroun...
        
         | simonw wrote:
         | If you want to run GraphQL queries against this kind of data I
         | have a roundabout way of doing it:
         | 
         | 1. Set up a repo that uses actions to scrape data into a CSV
         | 
         | 2. Set up another action that converts that CSV to SQLite
         | (using my sqlite-utils tool) and then...
         | 
         | 3. Publishes that database to Cloud Run or Vercel with
         | Datasette and with the datasette-graphql plugin
         | 
         | Here's an example repo that does exactly that:
         | https://github.com/simonw/cdc-vaccination-history
         | 
         | It scrapes vaccination data from the CDC, complies that into a
         | SQLite database and publishes it using Datasette on Vercel at
         | https://cdc-vaccination-history.datasette.io/
         | 
         | Then you can run GraphQL queries at https://cdc-vaccination-
         | history.datasette.io/graphql
         | 
         | (Here's the plugin: https://datasette.io/plugins/datasette-
         | graphql)
         | 
         | Another demo: https://covid-19.datasettes.com/graphql runs from
         | this repo: https://github.com/simonw/covid-19-datasette
        
       | bob1029 wrote:
       | I am sensing some interesting capabilities here, but also get the
       | impression that this is more about denormalized views of data
       | (JSON/CSV/etc) than anything else. It's also in the name -
       | 'Flat'.
       | 
       | Perhaps it is actually supported and I can't read properly, but I
       | feel like you are just 1 tiny step away from allowing someone to
       | write one of these things such that it can ETL any arbitrary data
       | source into a SQLite database (i.e. many tables). There's not a
       | whole lot of difference between CSV and SQLite when it comes to
       | repository file management. Granted, SQLite databases would
       | present as opaque blobs at code review time, but this is
       | something we can tolerate because you still get all of the nice
       | versioning & project consistency. Hell, you could probably write
       | a special GitHub-branded diff viewer that allows you to compare 2
       | different SQLite databases, schema & all.
       | 
       | SQLite in general is such a force to be reckoned with. You could
       | do a lot of damage (in a good way) with product features built up
       | around the most popular database engine on earth.
        
       | nt2h9uh238h wrote:
       | I'm actually very excited about it. Could start a new era of how
       | we develop and work with data.
        
       | idan wrote:
       | Hi HN! Our team has loved building this, as well as all of the
       | storytelling and examples. We'd love your feedback!
        
         | everybodyknows wrote:
         | The screen videos are interesting, but too fast to follow, and
         | make reading the accompanying text impossible for those of us
         | with fragile concentration. Reader View (Safari) drops the
         | screen imagery entirely, so goes too far the other way.
         | 
         | How about a video pause/seek control?
        
           | idan wrote:
           | Hey! This is a great callout, we'll think about how to make
           | it better!
        
       | dariosalvi78 wrote:
       | nice idea, but exploring the data is very limited. Would be even
       | better if it had some sort of query language and maybe an API.
        
       ___________________________________________________________________
       (page generated 2021-05-18 23:01 UTC)