[HN Gopher] Architecture Notes: Datasette ___________________________________________________________________ Architecture Notes: Datasette Author : pcr910303 Score : 216 points Date : 2022-05-27 15:25 UTC (7 hours ago) (HTM) web link (architecturenotes.co) (TXT) w3m dump (architecturenotes.co) | simonw wrote: | This is the first issue of the new Architecture Notes publication | (website and newsletter) run by Mahdi Yusuf. | | Mahdi interviewed me about the architecture of | https://datasette.io/ - topics we covered include: | | - Building a modern Python app using ASGI | | - Benefits of SQLite | | - Designing plugin hooks | | - Safely allowing SQL injection | | - Using SQLite from asyncio | | - The Baked Data architectural pattern | | - Bundling a Python web application in Electron | | - Packaging a Python for WebAssembly | samwillis wrote: | Simon, I have a wish list item for Datasette, pagination of ad- | hock queries. I know thats a really difficult thing to | implement as it would require parsing and altering the sql | query, but with large datasets it would be so useful! | | (Love Datasette!) | simonw wrote: | I want this too! | | Datasette avoids offset/limit pagination because it performs | poorly on huge queries - and I don't want random visitors | (and crawlers) hurting performance of a public instance by | crawling through offset/limit of thousands of pages. | | That's why table pages implement keyset pagination instead - | so you can do https://congress- | legislators.datasettes.com/legislators/legi... and get back | records following the one with A000106, which is a fast query | because the ID column has an index on it. | | Supporting this with arbitrary queries is harder. One idea I | had is to allow the user to specify which column and sort | order should be used for keyset pagination - so you could | construct a URL like this: /?sql=select+*+f | rom+legislators+order+by+id&_pagination_column=id | | If a pagination column has been specified, Datasette would | use the same trick it uses on regular table pages and add | next links that way. | | Would that work for you? | | The other, probably easier option is a setting that enables | offset/limit pagination of arbitrary SQL queries - turned off | by default, but easy to turn on for users who are running | Datasette on a private server. If that takes several seconds | people can at least opt into it. | samwillis wrote: | Either would work, but as you said the latter is easer to | implement and would have done the job for what I was | working on. If I had had more time I would have looking | into trying to do it myself, but there is never enough | time... | prepend wrote: | What can government data providers like data.cdc.gov do to | support datasette better? Is it expected that these providers | would make SQLite distros of data? Or do you think people will | just suck stuff down in csv and convert it to datasette to make | it easier to work with and reproduce the work of others? | simonw wrote: | My ultimate dream is that providers like that will use | Datasette itself - or a similar system that has the same | characteristics: make it really easy for people to slice and | dice the data using querystring parameters or even SQL | queries and get back out just the data they want as JSON, CSV | and other formats. | | I do think that SQLite is a really interesting format for | publishing data, and I'd love to see more places publish raw | SQLite files. It's much better at preserving things like | column type information and relationships between tables than | CSV is. | googletron wrote: | Hello creator of the project. This entire project was motivated | by another HN post where someone was asking where we can get | information about architectures and system designs across our | industry. Seeing there wasn't anything available, I decided to | start it; glad to see its resonated with the community. | | I hope to create high quality posts from engineers that work on | these systems and the challenges they are trying to solve. Really | dig into the problems and the technologies and strategies they | used to solve it. | | What they have learned? Where they have failed? | | I also plan to write up deep technical dives on technologies we | all rely on and use everyday. | | If there is any feedback to improve or you have something you | want to write up with me please reach out. | eatonphil wrote: | By creator of the project I assume you mean creator of the | blog/newsletter, not creator of Datasette? | googletron wrote: | Correct. :) | singhrac wrote: | Hey, I love this. Please keep doing it, we need more of this | kind of high quality writing available on the web. When I was | just getting started programming, I read the AOSA guides, but | they were considerably harder to read than this. | | There are so many things I love about this: the art style, the | large font, the summary image, etc. Great work. | googletron wrote: | We worked hard to make very approachable, explain concepts | people assume people are aware of. This means a lot and hope | to continue producing these and more. | singhrac wrote: | Is there a way for me to sponsor or donate? I know the most | limiting resource is probably your time but let me know if | there's we can do to help. | googletron wrote: | We have paid membership that currently doesn't have much | yet.. https://architecturenotes.co/membership/ | | So you can support there though. Do appreciate it. Hoping | to do more technical dives on technologies and use cases | here soon. | pc86 wrote: | Looks great so far, one nit pick is the menu isn't playing | well[0] for Brave Version 1.39.111 Chromium: 102.0.5005.61 | (Official Build) (arm64) | | [0] https://imgur.com/a/WSEVCFG | googletron wrote: | Thanks for the feedback. I will look into it. | lelandfe wrote: | Kill the z-index properties entirely on `.toc-wrap` and | `.toc` to make it play nicely with full width desktop | images :) | | (or maybe make intentional with e.g. a big blurry box- | shadow) | | Also: thanks for the resource! I felt incredibly lost doing | system design interviews over the last few months, this | seems like a fantastic site. | googletron wrote: | Thanks for the help there. I am struggling my way through | that. | | I am glad it can be of help and plan to do much more. :) | klooney wrote: | There's some prior art: http://aosabook.org/en/index.html | usrme wrote: | Thanks for doing this! The only things I noticed is that in | this image (https://architecturenotes.co/content/images/2022/05 | /Datasett...) the word "acquire" is misspelled and near the end | GitHub is written as "Github" with incorrect casing, despite it | being correct earlier on. | googletron wrote: | Good catch. I will revise this. Thanks for pointing it out. | googletron wrote: | Here is the sketch note of the entire post as well. | | https://architecturenotes.co/content/images/size/w1600/2022/... | emadda wrote: | You can load your Stripe account into a Datasette instance using | this: | | https://github.com/tabledog/datasette-stripe | jrvarela56 wrote: | Congrats on the launch and what an awesome first guest! Looking | fwd to more posts :D | beebmam wrote: | I have a hard time believing people feel comfortable using Python | applications in production for anything other than prototyping. I | have seen some shit over my career, across many kinds of | interpreted languages, that will never let me approve of that. | | Unless one simply doesn't care about runtime quality. | simonw wrote: | Believe it. I've spent almost my entire career running Python | applications in production, as have many of my friends, and | many large companies that I've worked for or worked with. | | Given the number of terrible, buggy sites I've seen built using | Java or .NET I personally have trouble believing companies run | those in production, but evidently they do! | simonw wrote: | A slightly less snarky answer: the thing I care about isn't | the language, it's the process and environment around the | project. | | If I'm going to put something in production, I want it to | have: | | - Comprehensive tests, protected by CI | | - Thorough, up-to-date documentation | | - Code that lives in version control, with good commit | messages that help answer "why" questions about how it works | | - Good development environments | | - A robust deployment process | | The language influences these in as much as different | languages have different cultures and tooling around them, | but conceptually they are pretty language agnostic. | | I know how to do all of these things well in Python, which is | why I tend to continue to spend my time in Python land. | googletron wrote: | I am surprised how this hot take still has legs. World's | biggest site's use Python. | sontek wrote: | Yeah, it'd be crazy if sites like Reddit, SurveyMonkey, | Dropbox, Spotify, Instagram, Pinterest, Lyft, and Sentry were | built on a silly little prototyping language like Python | instead of a _real_ programming language. Right? | noSyncCloud wrote: | What about production apps running on scripting languages like | JS? | getpost wrote: | Production sites like YouTube, Instagram, Netflix, Reddit, ...? | https://www.botreetechnologies.com/blog/top-15-websites-buil... | redredrobot wrote: | Datasette is pretty cool. | | But AFAICT, it just doesn't scale whatsoever. That SQLite db is | both the dataset index and the dataset content combined, right? | So you're limited by how big that SQLite db can realistically be. | The docs say "share data of any shape or any size", but AFAICT it | can't handle large datasets containing large unstructured data | like images and video and multi-billion data point datasets are | hard to store in a single machine/file. | | Not really a criticism, but more wondering if there are scale | optimizations in Datasette I'm not aware of since the docs do say | any shape or size. | wswope wrote: | > AFAICT it can't handle large datasets containing large | unstructured data like images and video and multi-billion data | point datasets are hard to store in a single machine/file | | Images and videos can easily be yeeted in as binary blobs (same | as with any other standard DB), and SQLite DBs scale into the | hundreds of TB range as a single file. Are you comparing the | single file strategy to something like a sharded cluster of | DBs, or is your thought that a DB that stores objects as | independent files is somehow superior? | simonw wrote: | You're right, Datasette isn't the right tool for sharing | billion point datasets (actually low-billions might be OK if | each row is small enough). | | I think of Datasette as a tool for working with "small data" - | where I define small data as data that will fit on a USB stick, | or on my phone. | | My iPhone has a TB of storage these days, so small data can get | you a very long way! | | Using it for unstructured image and video would work fine using | the pattern where those binary files live somewhere like S3 and | the Datasette instance exposes URLs to them. I should find | somewhere in the documentation to talk about that. | | But yes, I should probably take "of any size" off the homepage, | it does give a misleading impression. | simonw wrote: | Opened an issue exploring alternatives here: | https://github.com/simonw/datasette.io/issues/109 | | I decided to just drop "any size" but keep "any shape". | bspammer wrote: | Very interesting idea to use GPT3 as a starting point for | rewording text. I can see it being an effective way to | break writer's block. | samwillis wrote: | Not quite the scale you are suggesting but I used it with a 7gb | 20m row dataset and it worked incredible well. | redredrobot wrote: | Yeah - it's probably unfair of me to say it doesn't scale at | all. But between large data and 2 extra orders of magnitudes | of rows, the single SQLite file approach quickly breaks down, | even if you don't store the large content in-db. | jandrese wrote: | I was slightly disappointed that this wasn't an article | describing the technical details of systems that store data on | cassette tapes. | jmbwell wrote: | This is the second time this 'datassette' has tricked me for | the same reason. | tesseract wrote: | Here's one then (in German): | https://www.c64-wiki.de/wiki/Aufzeichnungsformat_der_Datasse... ___________________________________________________________________ (page generated 2022-05-27 23:00 UTC)