[HN Gopher] File Attachments: Databases can now store files and ... ___________________________________________________________________ File Attachments: Databases can now store files and images Author : todsacerdoti Score : 90 points Date : 2023-08-30 16:21 UTC (6 hours ago) (HTM) web link (xata.io) (TXT) w3m dump (xata.io) | excalibur wrote: | This sounds useful for malware persistence after a backup | recovery | pmpjr wrote: | [flagged] | pmpjr wrote: | [flagged] | TOGoS wrote: | Skimmed the article, grepped for 'hash', didn't find it. | | > they are stored in AWS S3 | | Not sure why anyone would feel the need to couple this sort of | thing to S3. Just store file hashes in the database, and have a | blobstore mapping hashes to content. I like bitprint URNs for | this purpose[1] but any strong cryptographic hash function will | do. Then you can use any old datastore you want, be it a plain | old filesystem, S3, a Git repo, or something else. It's trivial | to bolt on alternate backends because you can always guarantee | that what you asked for is what you got. If not, try the next | one. | | [1] http://www.nuke24.net/docs/2015/HashURNs.html | tudorg wrote: | hey HN, and thank you todsacerdoti for posting it! | | This is a feature that we wanted for a long time, but we also | wanted to get it right. It's somehow the equivalent of storing | files and images in an S3 bucket and then putting URLs to them in | the database, but then you have to take care of keeping things in | sync, take care of security permissions, delete the files when | the database row is deleted, etc. We automate all that for you | and even more: we cache the files in a CDN, and if they are | images we support transformations out of the box. | adrianmsmith wrote: | I think for smaller projects just storing images as BLOBs in e.g. | PostgreSQL works quite well. | | I know this is controversial. People will claim it's bad for | performance. | | But that's only bad if performance is something you're having | problems with, or going to have problems with. If the system | you're working on is small, and going to stay small, then that | doesn't matter. Not all systems are like this, but some are. | | Storing the images directly in the database has a lot of | advantages. They share a transactional context with the other | updates you're doing. They get backed up at the same time, and if | you do a "point in time" restore they're consistent with the rest | of your data. No additional infrastructure to manage to store the | files. | andyp-kw wrote: | I've seen companies store pdf invoices in the database too, for | the same reasons you spoke of. | lovasoa wrote: | How big does the database need to be for large BLOBs to become | a problem ? "big" and "small" are quite subjective terms. | | How many BLOBs does one need to have, and how often to we need | to touch them for this solution to become untenable ? | spiffytech wrote: | One datapoint on BLOB performance: _SQLite: 35% Faster Than | The Filesystem_ | | https://www.sqlite.org/fasterthanfs.html | crabbone wrote: | In storage you measure things in blocks. Historically, blocks | were meant to be 512 bytes big, but today the tendency is to | make them bigger, 4K would be the typical size in server | setting. | | So, the idea here is this: databases that store structured | information, i.e. such that needs to store integers, | booleans, short strings are typically something like | relational databases, eg. PostgreSQL. | | Filesystems (eg. Ext4) usually think about whole blocks, but | are designed with the eye for smaller files, i.e. files | aren't expected to be more than some ten or hundred blocks in | size for optimal performance. | | Object stores (eg. S3) are the kinds of storage systems that | are supposed to work well for anything larger than typical | files. | | This gives the answer to your question: blobs in a relational | database are probably OK if they are under one block big. | Databases will be probably able to handle bigger ones too, | but you will start seeing serious drops in performance when | it comes to indexing, filtering, searching etc. because such | systems optimize internal memory buffers in such a way that | they can fit a "perfect" number of elements of the "perfect" | size. | | Another concern here is that with stored elements larger than | single block you need a different approach to parallelism. | Ultimately, the number of blocks used by an I/O operation | determines its performance. If you are reading/writing sub- | block sized elements, you try to make it so that they come | from the same block to minimize the number of requests made | to the physical storage. If you work with multi-block | elements, your approach to performance optimization is | different -- you try to pre-fetch the "neighbor" blocks | because you expect you might need them soon. Modern storage | hardware has a decent degree of parallelism that allows you | to queue multiple I/O requests w/o awaiting completion. This | later mechanism is a lot less relevant to something like | RDBMS, but is at the heart of an object store. | | In other words: the problem is not the function of the size | of the database. In principle, nothing stops eg. PostgreSQL | from special-casing blobs and dealing with them differently | than it would normally do with "small" objects... but they | aren't probably interested in doing so because you already | have appropriate storage for that kind of stuff, and | PostgreSQL, like most other RDBMS sits _on top_ of the | storage for larger objects (filesystem), so they have no | hopes of doing it better than the layer below them. | brazzy wrote: | Most of what you wrote there is simply not true for modern | DBMS, specifically PostgreSQL has a mechanism called TOAST | (https://www.enterprisedb.com/postgres- | tutorials/postgresql-t...) that does exactly what you claim | "they aren't probably interested in doing" and completely | eliminates any performance penalty of large objects in a | table when they are not used. | doubled112 wrote: | I've seen BLOBs in an Oracle DB used to store Oracle install | ISOs, which I think is ironic on some level. | | Let's attach what we're using to the ticket. All of it. Why is | the ticketing DB huge? Well, you attached all of it. | rajman187 wrote: | Several years ago Walmart dramatically sped up their online | store's performance by storing images as blobs in their | distributed Cassandra cluster. | | https://medium.com/walmartglobaltech/building-object-store-s... | mohamedattahri wrote: | BLOB on Postgres are awesome, but there's also a full-featured | file API called Large Objects for when the use-case requires | seeking/streaming. | | Wrote a small Go library to interface with it: | https://github.com/mohamedattahri/pgfs | | Large Objects: | https://www.postgresql.org/docs/current/largeobjects.html | dventimi wrote: | Large Objects works great for images and stylesheets and | other file assets for my small PostgREST projects. | ibgeek wrote: | TIL thanks! | bastawhiz wrote: | I think the one big problem with BLOBs, especially if you have | a heavily read-biased DB, is you're going to run up against | bandwidth/throughput as a bottleneck. One of the DBs I help | maintain has some very large JSON columns and we frequently see | this problem when traffic is at its peak: simply pulling the | data down from Postgres is the problem. | | If the data is frequently accessed, it also means there are | extra hops the data has to take before getting to the user. | It's a lot faster to pull static files from S3 or a CDN (or | even just a dumb static file server) than it is to round trip | through your application to the DB and back. For one, it's | almost impossible to stream the response, so the whole BLOB | needs to be copied in memory in each system it passes through. | | It's rare that any request for, say, user data would _also_ | return the user avatar, and so you ultimately just end up with | one endpoint for structured data and one to serve binary BLOB | data which have very little overlap except for ACL stuff, but | signed S3 URLs will get you the same security properties with | much better performance overall. | refulgentis wrote: | Do you have any thoughts on when a JSON column is too large? | I've been wondering about the tradeoffs between a jsonb | column in postgres that may have values, at the extreme, as | large as 10 MB, usually just 100 KB, versus using S3. | cryptonector wrote: | > But that's only bad if performance is something you're having | problems with, or going to have problems with. | | The first part is easy enough, but how do you predict when | you're going to hit a knee in the system's performance? | SigmundA wrote: | Main issues with large blobs in DB is moving and or deleting | them. Not sure if PG has this issue I would like to confirm, | but in say SQL server if you delete a blob or try and move it | to another file group you get transaction logging equal to the | size of the blobs. | | You would think at least with a delete especially in PG with | the way it handles MVCC the blob would just have an entry | saying which blob was deleted or something then if you need | rollback you just undelete the blob. | | So just being able to move and delete the data after it is in | there becomes a real problem with a lot of it. | vbezhenar wrote: | It makes backups PITA. I migrated blobs to S3 and managed to | improve backups from once a month to once a day. Database is | now very slim. HDD space is no longer an issue. Can delete code | which serves files. Lots of improvements with no downsides so | far. | crooked-v wrote: | Of course, that also means that now you don't have backups of | your files. | 0x457 wrote: | They do have backups if versioning is enabled on S3. | ericbarrett wrote: | Yes, S3 is extremely reliable and versioning protects | against "oopsies." I do always recommend disallowing | s3:DeleteObjectVersion for all IAM roles, and/or as a | bucket policy; manage old versions twith a lifecycle | policy instead. This will protect against programmatic | deletion by anybody except the root account. | brazzledazzle wrote: | I second these sensible protections. Also would recommend | replicating your bucket as a DR measure. Ideally to | another (infrequently used and separately secured) | account. Even better if you do it to another region but I | think another account is more important since the chances | of your account getting owned is higher than amazon | suffering a total region failure. | [deleted] | [deleted] | ltbarcly3 wrote: | I think there is too much emphasis on 'the right way' | sometimes, which leads people to skip over an analysis for the | problem they are solving right now. | | For example, if you are building something that will have | thousands or millions of records, and storing small binaries in | postgresql lets you avoid integrating with S3 at all, then you | should seriously consider doing it. The simplification gain | almost certainly pays for any 'badness' and then some. | | If you are working on something that you hope to scale to | millions of users then you should just bite the bullet and | integrate with S3 or something, because you will use too much | db capacity to store binaries (assuming they aren't like 50 | bytes each or something remarkably small) and that will force | you to scale-out the database far before you would otherwise be | forced to do so. | sorintm wrote: | Brilliant point! It always depends on the problem you're | trying to solve. What Xata did, was integrate files as a | database column, BUT stored the binary content in S3. Thus | offering both transactional file management and point in time | recovery as well as high performance direct (S3) file access, | through a CDN. Basically Xata implements the orchestration | and takes away the pain of managing 2 services. | harshitaneja wrote: | Going on a tangent here but why s3? Why not avoid vendor lock | in and just serve static files? | | My setup is to post blobs over a small rest api which | balances storage load over all the file servers using nginx. | It responds with the serverSubdomainId which in turn gets | saved in the database. So the url for file access can be | generated https://{serverSubdomainId}.domain.tld/{resourceId} | The only costly part is writing which with caching of | available storage of all the servers isn't bad. The whole | system is horizontally and vertically scalable. What am I | missing that s3 is still the defacto even with vendor lock in | and high egress costs? | connordoner wrote: | I can't speak for everyone but, in my case, it comes down | to the ease of setup and having someone else manage it. | timando wrote: | "S3" is often used to refer to any S3 compatible object | storage system. You can even self-host it with something | like MinIO. | aforwardslash wrote: | There are plenty of oss solutions that talk "s3", like | swift, ceph and seaweedfs. | | Why object storage? Its easier to backup, version and to | scale horizontally. Most solutions will also provide | encryption capabilities that can be commanded either via | external key management systems or from the client | application. Also, custom access policies are great for | private documents and file uploads. | | Using static files is a good solution in some cases, | nothing against it. But in many scenarios, there are huge | benefits on using object storage instead, even if it is | quite slower. | crooked-v wrote: | > and that will force you to scale-out the database far | before you would otherwise be forced to do so | | Disk space is cheap, even when it's on a DB server and not | S3. Why worry that much about it? | tomnipotent wrote: | > Disk space is cheap | | Disk I/O less so. An average RDBMS writing a 10MB BLOB is | actually writing at minimum 20-30MB to disk - once to | journal/WAL, once to table storage, and for updates/deletes | a third copy in the redo/undo log. | | You also get a less efficient buffer manager with a higher | eviction rate, which can further exasperate disk I/O. | crooked-v wrote: | But if the files are important enough that you need | transactional updates or point-in-time recovery for them, | you're stuck with the equivalents of those things anyway, | wherever those files end up, plus the extra overhead of | coordinating two systems. | [deleted] | fiedzia wrote: | > If the system you're working on is small, and going to stay | small, then that doesn't matter. | | Having good default solution saves a lot of problems with | migration. Storing files outside of database today really isn't | that more complicated, while benefits even for small files are | significant: you can use CDN, you don't want traffic spikes to | affect database performance, you will want different | availability guarantees for your media than for database and so | on. | EvanAnderson wrote: | The issue I've run into re: storing files as BLOBs in a | database has been the "impedance mismatch" coming from others | wanting to use tools that act on filesystem objects against the | files stored in the database. | | That aside I've had good experiences for some applications. | It's certainly a lot easier than keeping a filesystem hierarchy | in sync w/ the database, particularly if you're trying to | replicate the database's native access control semantics to the | filesystem. (So many applications get this wrong and leave | their entire BLOB store, sitting out on a filesystem, | completely exposed to filesystem-level actors with excessive | permission.) | | Microsoft has an interesting feature in SQL Server to expose | BLOBs in a FUSE-like manner: https://learn.microsoft.com/en- | us/sql/relational-databases/b... | | I see postgresqlfs[0] on Github, but it's unclear exactly what | it does and it looks like it has been idle for at least 6 | years. | | [0] https://github.com/petere/postgresqlfs | harlanji wrote: | Minio is easy enough to spin up, S3-compatible. That seems | like my default path to persistence going forward. More and | more deployments options seem like they'll benefit from not | using the disk directly but instead using the dedicated | storage service path, so might as well use a tool designed | for that. | | S3 can be a bit of a mismatch for people who want to work | with FS objects as well, but there are a couple options that | are a lot easier than dealing with blob files in PGsql. | S3cmd, S3fs; perhaps SSHfs to the backing directory of Minio | or direct access on the host (direct routes untested, unsure | if it maps 1:1). | maerF0x0 wrote: | You may not like Mongo, but it's had this feature for a long | time. https://www.mongodb.com/docs/manual/core/gridfs/ Title | maybe needs to say "Xata now can store..." ? | winrid wrote: | With built in replication, horizontal scalability, sharding, | detailed monitoring, built in LRU memory cache, reading your | own writes, automatic file expiration, etc. | [deleted] | benatkin wrote: | It's hard to find when CouchDB and MongoDB got attachments, but | I suspect CouchDB had them first, so it may have been an | obvious choice for MongoDB. | EMM_386 wrote: | "Databases can now store files and images" | | Just with SQL Server alone, for decades now you can store BLOB | data. | | Need more than 2GB? | | There's FILESTREAM, where enormous files can sit on the disk, | outside the DB, but be referenced as if they were in the database | via a varbinary(max) column. | | "Think of it as having a new database column type where you can | store files of any size" | | We have that. It might not be AWS S3 and cached on a CDN, but | that is just adding additional cost, complexity, and lock-in. | saulrh wrote: | Being able to demote files from "this is an object on disk" to | "this is just another record, albeit a very big one" is a huge | win for data architecture. I've seen this implemented ad-hoc in | dozens of different instances. Having it work out-of-the-box with | a plugin and custom column type would be really nifty. | orf wrote: | Just a note, if you are reading this post then click the logo in | the top right, you cannot navigate back to the post - it | immediately jumps to the homepage again. | | Safari on iOS. | meepmorp wrote: | Works fine for me with Safari on iOS. | GreeningRun wrote: | Works fine on chrome/android | crabbone wrote: | Coming from storage, I really hate it when people (unknowingly or | deliberately) misuse terminology in this field. | | Whoever Xata is, they aren't storing _files_ in their database. | They are storing blobs, or objects... depends on how you look at | it, but definitely not files. | | Simple way to see that that's not true: can they store write-only | files in their database (eg. /dev/null?) Or can they store UNIX | socket files? What about device files? Can they recognize that | the file has setuid bit? And so on... practically no useful | attributes of files are stored in their database. | | They obviously didn't even want to store files... It just feels | like some kind of marketing trick (especially since the article | is excessively peppered with self-congratulatory quotes from | people who have allegedly found the product useful). Just say it | as it is: _Blob attachments_. | HumanOstrich wrote: | I think you're conflating files with file metadata. | berkes wrote: | I'm no expert in this field. I presume you are technically | correct. But also probably too focused on technical details. | | Send a file or path to a db server, in an INSERT query (INSERT | email, pwhash, avatar (?,?,?) into users), and that than | handles the actual storage, all abstracted away, for the user | (programmer) of this system, this is "storing the file in the | DB". If I can then do a "SELECT email, variation('tiny', | avatar)::cdn_url FROM users", that is "getting the image from | the DB. Well, technically getting the email and a URL where | this file cane be retrieved. | | To me, as a user of such a database, it matters not if some | engine stores it as a blob, base64, pointer, or just a URI to | an S3 block. What matters is that it's in the same layer of | abstraction. | DaiPlusPlus wrote: | > To me, as a user of such a database, it matters not if some | engine stores it as a blob, base64, pointer, or just a URI to | an S3 block. What matters is that it's in the same layer of | abstraction. | | By "user of such a database" you mean the end-user of an | application, right? As opposed to the devs, DBAs, SREs, et | cetera? Because those users absolutely care about how this | feature works and absolutely do not want anything to be | abstracted away, because that's the dif | | Also, data: URIs need to die. It started off with lazy JS | devs using <canvas>'s to-data-uri (instead of toBlob) and now | it's infecting the entire dev ecosystem: just go to | StackOverflow and there'll be a post every few hours where | some asker genuinely believes we're supposed to store large, | gigabyte-sized files in a Postgres or MSSQL table as a Base64 | string in a varchar column with a text collation | (aiiieeeeee). | rat9988 wrote: | "practically no useful attributes of files are stored in their | database." | | Probably because you work on the field, you miss what the | layman wants to see. The most useful attrivute for me is the | data. The other use case you cited are legitimate, it's just | that I, as a layperson, don't think about them when we talk | about files. | deadbeeves wrote: | Early operating systems did not support directory trees, and | I'm sure some didn't even support timestamps and such. Would | you say that those system do not store files at all? What even | is the defining characteristic of a "file"? | fiedzia wrote: | >Think of it as having a new database column type where you can | store files of any size, and behind the scenes they are stored in | AWS S3 and cached through a global CDN. | | So the files are not stored in the database, but somewhere else. | This belongs to a web framework, not database. | karlmush wrote: | [dead] ___________________________________________________________________ (page generated 2023-08-30 23:00 UTC)