[HN Gopher] Architecture Notes: Datasette
       ___________________________________________________________________
        
       Architecture Notes: Datasette
        
       Author : pcr910303
       Score  : 216 points
       Date   : 2022-05-27 15:25 UTC (7 hours ago)
        
 (HTM) web link (architecturenotes.co)
 (TXT) w3m dump (architecturenotes.co)
        
       | simonw wrote:
       | This is the first issue of the new Architecture Notes publication
       | (website and newsletter) run by Mahdi Yusuf.
       | 
       | Mahdi interviewed me about the architecture of
       | https://datasette.io/ - topics we covered include:
       | 
       | - Building a modern Python app using ASGI
       | 
       | - Benefits of SQLite
       | 
       | - Designing plugin hooks
       | 
       | - Safely allowing SQL injection
       | 
       | - Using SQLite from asyncio
       | 
       | - The Baked Data architectural pattern
       | 
       | - Bundling a Python web application in Electron
       | 
       | - Packaging a Python for WebAssembly
        
         | samwillis wrote:
         | Simon, I have a wish list item for Datasette, pagination of ad-
         | hock queries. I know thats a really difficult thing to
         | implement as it would require parsing and altering the sql
         | query, but with large datasets it would be so useful!
         | 
         | (Love Datasette!)
        
           | simonw wrote:
           | I want this too!
           | 
           | Datasette avoids offset/limit pagination because it performs
           | poorly on huge queries - and I don't want random visitors
           | (and crawlers) hurting performance of a public instance by
           | crawling through offset/limit of thousands of pages.
           | 
           | That's why table pages implement keyset pagination instead -
           | so you can do https://congress-
           | legislators.datasettes.com/legislators/legi... and get back
           | records following the one with A000106, which is a fast query
           | because the ID column has an index on it.
           | 
           | Supporting this with arbitrary queries is harder. One idea I
           | had is to allow the user to specify which column and sort
           | order should be used for keyset pagination - so you could
           | construct a URL like this:                   /?sql=select+*+f
           | rom+legislators+order+by+id&_pagination_column=id
           | 
           | If a pagination column has been specified, Datasette would
           | use the same trick it uses on regular table pages and add
           | next links that way.
           | 
           | Would that work for you?
           | 
           | The other, probably easier option is a setting that enables
           | offset/limit pagination of arbitrary SQL queries - turned off
           | by default, but easy to turn on for users who are running
           | Datasette on a private server. If that takes several seconds
           | people can at least opt into it.
        
             | samwillis wrote:
             | Either would work, but as you said the latter is easer to
             | implement and would have done the job for what I was
             | working on. If I had had more time I would have looking
             | into trying to do it myself, but there is never enough
             | time...
        
         | prepend wrote:
         | What can government data providers like data.cdc.gov do to
         | support datasette better? Is it expected that these providers
         | would make SQLite distros of data? Or do you think people will
         | just suck stuff down in csv and convert it to datasette to make
         | it easier to work with and reproduce the work of others?
        
           | simonw wrote:
           | My ultimate dream is that providers like that will use
           | Datasette itself - or a similar system that has the same
           | characteristics: make it really easy for people to slice and
           | dice the data using querystring parameters or even SQL
           | queries and get back out just the data they want as JSON, CSV
           | and other formats.
           | 
           | I do think that SQLite is a really interesting format for
           | publishing data, and I'd love to see more places publish raw
           | SQLite files. It's much better at preserving things like
           | column type information and relationships between tables than
           | CSV is.
        
       | googletron wrote:
       | Hello creator of the project. This entire project was motivated
       | by another HN post where someone was asking where we can get
       | information about architectures and system designs across our
       | industry. Seeing there wasn't anything available, I decided to
       | start it; glad to see its resonated with the community.
       | 
       | I hope to create high quality posts from engineers that work on
       | these systems and the challenges they are trying to solve. Really
       | dig into the problems and the technologies and strategies they
       | used to solve it.
       | 
       | What they have learned? Where they have failed?
       | 
       | I also plan to write up deep technical dives on technologies we
       | all rely on and use everyday.
       | 
       | If there is any feedback to improve or you have something you
       | want to write up with me please reach out.
        
         | eatonphil wrote:
         | By creator of the project I assume you mean creator of the
         | blog/newsletter, not creator of Datasette?
        
           | googletron wrote:
           | Correct. :)
        
         | singhrac wrote:
         | Hey, I love this. Please keep doing it, we need more of this
         | kind of high quality writing available on the web. When I was
         | just getting started programming, I read the AOSA guides, but
         | they were considerably harder to read than this.
         | 
         | There are so many things I love about this: the art style, the
         | large font, the summary image, etc. Great work.
        
           | googletron wrote:
           | We worked hard to make very approachable, explain concepts
           | people assume people are aware of. This means a lot and hope
           | to continue producing these and more.
        
             | singhrac wrote:
             | Is there a way for me to sponsor or donate? I know the most
             | limiting resource is probably your time but let me know if
             | there's we can do to help.
        
               | googletron wrote:
               | We have paid membership that currently doesn't have much
               | yet.. https://architecturenotes.co/membership/
               | 
               | So you can support there though. Do appreciate it. Hoping
               | to do more technical dives on technologies and use cases
               | here soon.
        
         | pc86 wrote:
         | Looks great so far, one nit pick is the menu isn't playing
         | well[0] for Brave Version 1.39.111 Chromium: 102.0.5005.61
         | (Official Build) (arm64)
         | 
         | [0] https://imgur.com/a/WSEVCFG
        
           | googletron wrote:
           | Thanks for the feedback. I will look into it.
        
             | lelandfe wrote:
             | Kill the z-index properties entirely on `.toc-wrap` and
             | `.toc` to make it play nicely with full width desktop
             | images :)
             | 
             | (or maybe make intentional with e.g. a big blurry box-
             | shadow)
             | 
             | Also: thanks for the resource! I felt incredibly lost doing
             | system design interviews over the last few months, this
             | seems like a fantastic site.
        
               | googletron wrote:
               | Thanks for the help there. I am struggling my way through
               | that.
               | 
               | I am glad it can be of help and plan to do much more. :)
        
         | klooney wrote:
         | There's some prior art: http://aosabook.org/en/index.html
        
         | usrme wrote:
         | Thanks for doing this! The only things I noticed is that in
         | this image (https://architecturenotes.co/content/images/2022/05
         | /Datasett...) the word "acquire" is misspelled and near the end
         | GitHub is written as "Github" with incorrect casing, despite it
         | being correct earlier on.
        
           | googletron wrote:
           | Good catch. I will revise this. Thanks for pointing it out.
        
       | googletron wrote:
       | Here is the sketch note of the entire post as well.
       | 
       | https://architecturenotes.co/content/images/size/w1600/2022/...
        
       | emadda wrote:
       | You can load your Stripe account into a Datasette instance using
       | this:
       | 
       | https://github.com/tabledog/datasette-stripe
        
       | jrvarela56 wrote:
       | Congrats on the launch and what an awesome first guest! Looking
       | fwd to more posts :D
        
       | beebmam wrote:
       | I have a hard time believing people feel comfortable using Python
       | applications in production for anything other than prototyping. I
       | have seen some shit over my career, across many kinds of
       | interpreted languages, that will never let me approve of that.
       | 
       | Unless one simply doesn't care about runtime quality.
        
         | simonw wrote:
         | Believe it. I've spent almost my entire career running Python
         | applications in production, as have many of my friends, and
         | many large companies that I've worked for or worked with.
         | 
         | Given the number of terrible, buggy sites I've seen built using
         | Java or .NET I personally have trouble believing companies run
         | those in production, but evidently they do!
        
           | simonw wrote:
           | A slightly less snarky answer: the thing I care about isn't
           | the language, it's the process and environment around the
           | project.
           | 
           | If I'm going to put something in production, I want it to
           | have:
           | 
           | - Comprehensive tests, protected by CI
           | 
           | - Thorough, up-to-date documentation
           | 
           | - Code that lives in version control, with good commit
           | messages that help answer "why" questions about how it works
           | 
           | - Good development environments
           | 
           | - A robust deployment process
           | 
           | The language influences these in as much as different
           | languages have different cultures and tooling around them,
           | but conceptually they are pretty language agnostic.
           | 
           | I know how to do all of these things well in Python, which is
           | why I tend to continue to spend my time in Python land.
        
         | googletron wrote:
         | I am surprised how this hot take still has legs. World's
         | biggest site's use Python.
        
         | sontek wrote:
         | Yeah, it'd be crazy if sites like Reddit, SurveyMonkey,
         | Dropbox, Spotify, Instagram, Pinterest, Lyft, and Sentry were
         | built on a silly little prototyping language like Python
         | instead of a _real_ programming language. Right?
        
         | noSyncCloud wrote:
         | What about production apps running on scripting languages like
         | JS?
        
         | getpost wrote:
         | Production sites like YouTube, Instagram, Netflix, Reddit, ...?
         | https://www.botreetechnologies.com/blog/top-15-websites-buil...
        
       | redredrobot wrote:
       | Datasette is pretty cool.
       | 
       | But AFAICT, it just doesn't scale whatsoever. That SQLite db is
       | both the dataset index and the dataset content combined, right?
       | So you're limited by how big that SQLite db can realistically be.
       | The docs say "share data of any shape or any size", but AFAICT it
       | can't handle large datasets containing large unstructured data
       | like images and video and multi-billion data point datasets are
       | hard to store in a single machine/file.
       | 
       | Not really a criticism, but more wondering if there are scale
       | optimizations in Datasette I'm not aware of since the docs do say
       | any shape or size.
        
         | wswope wrote:
         | > AFAICT it can't handle large datasets containing large
         | unstructured data like images and video and multi-billion data
         | point datasets are hard to store in a single machine/file
         | 
         | Images and videos can easily be yeeted in as binary blobs (same
         | as with any other standard DB), and SQLite DBs scale into the
         | hundreds of TB range as a single file. Are you comparing the
         | single file strategy to something like a sharded cluster of
         | DBs, or is your thought that a DB that stores objects as
         | independent files is somehow superior?
        
         | simonw wrote:
         | You're right, Datasette isn't the right tool for sharing
         | billion point datasets (actually low-billions might be OK if
         | each row is small enough).
         | 
         | I think of Datasette as a tool for working with "small data" -
         | where I define small data as data that will fit on a USB stick,
         | or on my phone.
         | 
         | My iPhone has a TB of storage these days, so small data can get
         | you a very long way!
         | 
         | Using it for unstructured image and video would work fine using
         | the pattern where those binary files live somewhere like S3 and
         | the Datasette instance exposes URLs to them. I should find
         | somewhere in the documentation to talk about that.
         | 
         | But yes, I should probably take "of any size" off the homepage,
         | it does give a misleading impression.
        
           | simonw wrote:
           | Opened an issue exploring alternatives here:
           | https://github.com/simonw/datasette.io/issues/109
           | 
           | I decided to just drop "any size" but keep "any shape".
        
             | bspammer wrote:
             | Very interesting idea to use GPT3 as a starting point for
             | rewording text. I can see it being an effective way to
             | break writer's block.
        
         | samwillis wrote:
         | Not quite the scale you are suggesting but I used it with a 7gb
         | 20m row dataset and it worked incredible well.
        
           | redredrobot wrote:
           | Yeah - it's probably unfair of me to say it doesn't scale at
           | all. But between large data and 2 extra orders of magnitudes
           | of rows, the single SQLite file approach quickly breaks down,
           | even if you don't store the large content in-db.
        
       | jandrese wrote:
       | I was slightly disappointed that this wasn't an article
       | describing the technical details of systems that store data on
       | cassette tapes.
        
         | jmbwell wrote:
         | This is the second time this 'datassette' has tricked me for
         | the same reason.
        
         | tesseract wrote:
         | Here's one then (in German):
         | https://www.c64-wiki.de/wiki/Aufzeichnungsformat_der_Datasse...
        
       ___________________________________________________________________
       (page generated 2022-05-27 23:00 UTC)