[HN Gopher] File Attachments: Databases can now store files and ...
       ___________________________________________________________________
        
       File Attachments: Databases can now store files and images
        
       Author : todsacerdoti
       Score  : 90 points
       Date   : 2023-08-30 16:21 UTC (6 hours ago)
        
 (HTM) web link (xata.io)
 (TXT) w3m dump (xata.io)
        
       | excalibur wrote:
       | This sounds useful for malware persistence after a backup
       | recovery
        
       | pmpjr wrote:
       | [flagged]
        
       | pmpjr wrote:
       | [flagged]
        
       | TOGoS wrote:
       | Skimmed the article, grepped for 'hash', didn't find it.
       | 
       | > they are stored in AWS S3
       | 
       | Not sure why anyone would feel the need to couple this sort of
       | thing to S3. Just store file hashes in the database, and have a
       | blobstore mapping hashes to content. I like bitprint URNs for
       | this purpose[1] but any strong cryptographic hash function will
       | do. Then you can use any old datastore you want, be it a plain
       | old filesystem, S3, a Git repo, or something else. It's trivial
       | to bolt on alternate backends because you can always guarantee
       | that what you asked for is what you got. If not, try the next
       | one.
       | 
       | [1] http://www.nuke24.net/docs/2015/HashURNs.html
        
       | tudorg wrote:
       | hey HN, and thank you todsacerdoti for posting it!
       | 
       | This is a feature that we wanted for a long time, but we also
       | wanted to get it right. It's somehow the equivalent of storing
       | files and images in an S3 bucket and then putting URLs to them in
       | the database, but then you have to take care of keeping things in
       | sync, take care of security permissions, delete the files when
       | the database row is deleted, etc. We automate all that for you
       | and even more: we cache the files in a CDN, and if they are
       | images we support transformations out of the box.
        
       | adrianmsmith wrote:
       | I think for smaller projects just storing images as BLOBs in e.g.
       | PostgreSQL works quite well.
       | 
       | I know this is controversial. People will claim it's bad for
       | performance.
       | 
       | But that's only bad if performance is something you're having
       | problems with, or going to have problems with. If the system
       | you're working on is small, and going to stay small, then that
       | doesn't matter. Not all systems are like this, but some are.
       | 
       | Storing the images directly in the database has a lot of
       | advantages. They share a transactional context with the other
       | updates you're doing. They get backed up at the same time, and if
       | you do a "point in time" restore they're consistent with the rest
       | of your data. No additional infrastructure to manage to store the
       | files.
        
         | andyp-kw wrote:
         | I've seen companies store pdf invoices in the database too, for
         | the same reasons you spoke of.
        
         | lovasoa wrote:
         | How big does the database need to be for large BLOBs to become
         | a problem ? "big" and "small" are quite subjective terms.
         | 
         | How many BLOBs does one need to have, and how often to we need
         | to touch them for this solution to become untenable ?
        
           | spiffytech wrote:
           | One datapoint on BLOB performance: _SQLite: 35% Faster Than
           | The Filesystem_
           | 
           | https://www.sqlite.org/fasterthanfs.html
        
           | crabbone wrote:
           | In storage you measure things in blocks. Historically, blocks
           | were meant to be 512 bytes big, but today the tendency is to
           | make them bigger, 4K would be the typical size in server
           | setting.
           | 
           | So, the idea here is this: databases that store structured
           | information, i.e. such that needs to store integers,
           | booleans, short strings are typically something like
           | relational databases, eg. PostgreSQL.
           | 
           | Filesystems (eg. Ext4) usually think about whole blocks, but
           | are designed with the eye for smaller files, i.e. files
           | aren't expected to be more than some ten or hundred blocks in
           | size for optimal performance.
           | 
           | Object stores (eg. S3) are the kinds of storage systems that
           | are supposed to work well for anything larger than typical
           | files.
           | 
           | This gives the answer to your question: blobs in a relational
           | database are probably OK if they are under one block big.
           | Databases will be probably able to handle bigger ones too,
           | but you will start seeing serious drops in performance when
           | it comes to indexing, filtering, searching etc. because such
           | systems optimize internal memory buffers in such a way that
           | they can fit a "perfect" number of elements of the "perfect"
           | size.
           | 
           | Another concern here is that with stored elements larger than
           | single block you need a different approach to parallelism.
           | Ultimately, the number of blocks used by an I/O operation
           | determines its performance. If you are reading/writing sub-
           | block sized elements, you try to make it so that they come
           | from the same block to minimize the number of requests made
           | to the physical storage. If you work with multi-block
           | elements, your approach to performance optimization is
           | different -- you try to pre-fetch the "neighbor" blocks
           | because you expect you might need them soon. Modern storage
           | hardware has a decent degree of parallelism that allows you
           | to queue multiple I/O requests w/o awaiting completion. This
           | later mechanism is a lot less relevant to something like
           | RDBMS, but is at the heart of an object store.
           | 
           | In other words: the problem is not the function of the size
           | of the database. In principle, nothing stops eg. PostgreSQL
           | from special-casing blobs and dealing with them differently
           | than it would normally do with "small" objects... but they
           | aren't probably interested in doing so because you already
           | have appropriate storage for that kind of stuff, and
           | PostgreSQL, like most other RDBMS sits _on top_ of the
           | storage for larger objects (filesystem), so they have no
           | hopes of doing it better than the layer below them.
        
             | brazzy wrote:
             | Most of what you wrote there is simply not true for modern
             | DBMS, specifically PostgreSQL has a mechanism called TOAST
             | (https://www.enterprisedb.com/postgres-
             | tutorials/postgresql-t...) that does exactly what you claim
             | "they aren't probably interested in doing" and completely
             | eliminates any performance penalty of large objects in a
             | table when they are not used.
        
         | doubled112 wrote:
         | I've seen BLOBs in an Oracle DB used to store Oracle install
         | ISOs, which I think is ironic on some level.
         | 
         | Let's attach what we're using to the ticket. All of it. Why is
         | the ticketing DB huge? Well, you attached all of it.
        
         | rajman187 wrote:
         | Several years ago Walmart dramatically sped up their online
         | store's performance by storing images as blobs in their
         | distributed Cassandra cluster.
         | 
         | https://medium.com/walmartglobaltech/building-object-store-s...
        
         | mohamedattahri wrote:
         | BLOB on Postgres are awesome, but there's also a full-featured
         | file API called Large Objects for when the use-case requires
         | seeking/streaming.
         | 
         | Wrote a small Go library to interface with it:
         | https://github.com/mohamedattahri/pgfs
         | 
         | Large Objects:
         | https://www.postgresql.org/docs/current/largeobjects.html
        
           | dventimi wrote:
           | Large Objects works great for images and stylesheets and
           | other file assets for my small PostgREST projects.
        
           | ibgeek wrote:
           | TIL thanks!
        
         | bastawhiz wrote:
         | I think the one big problem with BLOBs, especially if you have
         | a heavily read-biased DB, is you're going to run up against
         | bandwidth/throughput as a bottleneck. One of the DBs I help
         | maintain has some very large JSON columns and we frequently see
         | this problem when traffic is at its peak: simply pulling the
         | data down from Postgres is the problem.
         | 
         | If the data is frequently accessed, it also means there are
         | extra hops the data has to take before getting to the user.
         | It's a lot faster to pull static files from S3 or a CDN (or
         | even just a dumb static file server) than it is to round trip
         | through your application to the DB and back. For one, it's
         | almost impossible to stream the response, so the whole BLOB
         | needs to be copied in memory in each system it passes through.
         | 
         | It's rare that any request for, say, user data would _also_
         | return the user avatar, and so you ultimately just end up with
         | one endpoint for structured data and one to serve binary BLOB
         | data which have very little overlap except for ACL stuff, but
         | signed S3 URLs will get you the same security properties with
         | much better performance overall.
        
           | refulgentis wrote:
           | Do you have any thoughts on when a JSON column is too large?
           | I've been wondering about the tradeoffs between a jsonb
           | column in postgres that may have values, at the extreme, as
           | large as 10 MB, usually just 100 KB, versus using S3.
        
         | cryptonector wrote:
         | > But that's only bad if performance is something you're having
         | problems with, or going to have problems with.
         | 
         | The first part is easy enough, but how do you predict when
         | you're going to hit a knee in the system's performance?
        
         | SigmundA wrote:
         | Main issues with large blobs in DB is moving and or deleting
         | them. Not sure if PG has this issue I would like to confirm,
         | but in say SQL server if you delete a blob or try and move it
         | to another file group you get transaction logging equal to the
         | size of the blobs.
         | 
         | You would think at least with a delete especially in PG with
         | the way it handles MVCC the blob would just have an entry
         | saying which blob was deleted or something then if you need
         | rollback you just undelete the blob.
         | 
         | So just being able to move and delete the data after it is in
         | there becomes a real problem with a lot of it.
        
         | vbezhenar wrote:
         | It makes backups PITA. I migrated blobs to S3 and managed to
         | improve backups from once a month to once a day. Database is
         | now very slim. HDD space is no longer an issue. Can delete code
         | which serves files. Lots of improvements with no downsides so
         | far.
        
           | crooked-v wrote:
           | Of course, that also means that now you don't have backups of
           | your files.
        
             | 0x457 wrote:
             | They do have backups if versioning is enabled on S3.
        
               | ericbarrett wrote:
               | Yes, S3 is extremely reliable and versioning protects
               | against "oopsies." I do always recommend disallowing
               | s3:DeleteObjectVersion for all IAM roles, and/or as a
               | bucket policy; manage old versions twith a lifecycle
               | policy instead. This will protect against programmatic
               | deletion by anybody except the root account.
        
               | brazzledazzle wrote:
               | I second these sensible protections. Also would recommend
               | replicating your bucket as a DR measure. Ideally to
               | another (infrequently used and separately secured)
               | account. Even better if you do it to another region but I
               | think another account is more important since the chances
               | of your account getting owned is higher than amazon
               | suffering a total region failure.
        
         | [deleted]
        
         | [deleted]
        
         | ltbarcly3 wrote:
         | I think there is too much emphasis on 'the right way'
         | sometimes, which leads people to skip over an analysis for the
         | problem they are solving right now.
         | 
         | For example, if you are building something that will have
         | thousands or millions of records, and storing small binaries in
         | postgresql lets you avoid integrating with S3 at all, then you
         | should seriously consider doing it. The simplification gain
         | almost certainly pays for any 'badness' and then some.
         | 
         | If you are working on something that you hope to scale to
         | millions of users then you should just bite the bullet and
         | integrate with S3 or something, because you will use too much
         | db capacity to store binaries (assuming they aren't like 50
         | bytes each or something remarkably small) and that will force
         | you to scale-out the database far before you would otherwise be
         | forced to do so.
        
           | sorintm wrote:
           | Brilliant point! It always depends on the problem you're
           | trying to solve. What Xata did, was integrate files as a
           | database column, BUT stored the binary content in S3. Thus
           | offering both transactional file management and point in time
           | recovery as well as high performance direct (S3) file access,
           | through a CDN. Basically Xata implements the orchestration
           | and takes away the pain of managing 2 services.
        
           | harshitaneja wrote:
           | Going on a tangent here but why s3? Why not avoid vendor lock
           | in and just serve static files?
           | 
           | My setup is to post blobs over a small rest api which
           | balances storage load over all the file servers using nginx.
           | It responds with the serverSubdomainId which in turn gets
           | saved in the database. So the url for file access can be
           | generated https://{serverSubdomainId}.domain.tld/{resourceId}
           | The only costly part is writing which with caching of
           | available storage of all the servers isn't bad. The whole
           | system is horizontally and vertically scalable. What am I
           | missing that s3 is still the defacto even with vendor lock in
           | and high egress costs?
        
             | connordoner wrote:
             | I can't speak for everyone but, in my case, it comes down
             | to the ease of setup and having someone else manage it.
        
             | timando wrote:
             | "S3" is often used to refer to any S3 compatible object
             | storage system. You can even self-host it with something
             | like MinIO.
        
             | aforwardslash wrote:
             | There are plenty of oss solutions that talk "s3", like
             | swift, ceph and seaweedfs.
             | 
             | Why object storage? Its easier to backup, version and to
             | scale horizontally. Most solutions will also provide
             | encryption capabilities that can be commanded either via
             | external key management systems or from the client
             | application. Also, custom access policies are great for
             | private documents and file uploads.
             | 
             | Using static files is a good solution in some cases,
             | nothing against it. But in many scenarios, there are huge
             | benefits on using object storage instead, even if it is
             | quite slower.
        
           | crooked-v wrote:
           | > and that will force you to scale-out the database far
           | before you would otherwise be forced to do so
           | 
           | Disk space is cheap, even when it's on a DB server and not
           | S3. Why worry that much about it?
        
             | tomnipotent wrote:
             | > Disk space is cheap
             | 
             | Disk I/O less so. An average RDBMS writing a 10MB BLOB is
             | actually writing at minimum 20-30MB to disk - once to
             | journal/WAL, once to table storage, and for updates/deletes
             | a third copy in the redo/undo log.
             | 
             | You also get a less efficient buffer manager with a higher
             | eviction rate, which can further exasperate disk I/O.
        
               | crooked-v wrote:
               | But if the files are important enough that you need
               | transactional updates or point-in-time recovery for them,
               | you're stuck with the equivalents of those things anyway,
               | wherever those files end up, plus the extra overhead of
               | coordinating two systems.
        
             | [deleted]
        
         | fiedzia wrote:
         | > If the system you're working on is small, and going to stay
         | small, then that doesn't matter.
         | 
         | Having good default solution saves a lot of problems with
         | migration. Storing files outside of database today really isn't
         | that more complicated, while benefits even for small files are
         | significant: you can use CDN, you don't want traffic spikes to
         | affect database performance, you will want different
         | availability guarantees for your media than for database and so
         | on.
        
         | EvanAnderson wrote:
         | The issue I've run into re: storing files as BLOBs in a
         | database has been the "impedance mismatch" coming from others
         | wanting to use tools that act on filesystem objects against the
         | files stored in the database.
         | 
         | That aside I've had good experiences for some applications.
         | It's certainly a lot easier than keeping a filesystem hierarchy
         | in sync w/ the database, particularly if you're trying to
         | replicate the database's native access control semantics to the
         | filesystem. (So many applications get this wrong and leave
         | their entire BLOB store, sitting out on a filesystem,
         | completely exposed to filesystem-level actors with excessive
         | permission.)
         | 
         | Microsoft has an interesting feature in SQL Server to expose
         | BLOBs in a FUSE-like manner: https://learn.microsoft.com/en-
         | us/sql/relational-databases/b...
         | 
         | I see postgresqlfs[0] on Github, but it's unclear exactly what
         | it does and it looks like it has been idle for at least 6
         | years.
         | 
         | [0] https://github.com/petere/postgresqlfs
        
           | harlanji wrote:
           | Minio is easy enough to spin up, S3-compatible. That seems
           | like my default path to persistence going forward. More and
           | more deployments options seem like they'll benefit from not
           | using the disk directly but instead using the dedicated
           | storage service path, so might as well use a tool designed
           | for that.
           | 
           | S3 can be a bit of a mismatch for people who want to work
           | with FS objects as well, but there are a couple options that
           | are a lot easier than dealing with blob files in PGsql.
           | S3cmd, S3fs; perhaps SSHfs to the backing directory of Minio
           | or direct access on the host (direct routes untested, unsure
           | if it maps 1:1).
        
       | maerF0x0 wrote:
       | You may not like Mongo, but it's had this feature for a long
       | time. https://www.mongodb.com/docs/manual/core/gridfs/ Title
       | maybe needs to say "Xata now can store..." ?
        
         | winrid wrote:
         | With built in replication, horizontal scalability, sharding,
         | detailed monitoring, built in LRU memory cache, reading your
         | own writes, automatic file expiration, etc.
        
         | [deleted]
        
         | benatkin wrote:
         | It's hard to find when CouchDB and MongoDB got attachments, but
         | I suspect CouchDB had them first, so it may have been an
         | obvious choice for MongoDB.
        
       | EMM_386 wrote:
       | "Databases can now store files and images"
       | 
       | Just with SQL Server alone, for decades now you can store BLOB
       | data.
       | 
       | Need more than 2GB?
       | 
       | There's FILESTREAM, where enormous files can sit on the disk,
       | outside the DB, but be referenced as if they were in the database
       | via a varbinary(max) column.
       | 
       | "Think of it as having a new database column type where you can
       | store files of any size"
       | 
       | We have that. It might not be AWS S3 and cached on a CDN, but
       | that is just adding additional cost, complexity, and lock-in.
        
       | saulrh wrote:
       | Being able to demote files from "this is an object on disk" to
       | "this is just another record, albeit a very big one" is a huge
       | win for data architecture. I've seen this implemented ad-hoc in
       | dozens of different instances. Having it work out-of-the-box with
       | a plugin and custom column type would be really nifty.
        
       | orf wrote:
       | Just a note, if you are reading this post then click the logo in
       | the top right, you cannot navigate back to the post - it
       | immediately jumps to the homepage again.
       | 
       | Safari on iOS.
        
         | meepmorp wrote:
         | Works fine for me with Safari on iOS.
        
         | GreeningRun wrote:
         | Works fine on chrome/android
        
       | crabbone wrote:
       | Coming from storage, I really hate it when people (unknowingly or
       | deliberately) misuse terminology in this field.
       | 
       | Whoever Xata is, they aren't storing _files_ in their database.
       | They are storing blobs, or objects... depends on how you look at
       | it, but definitely not files.
       | 
       | Simple way to see that that's not true: can they store write-only
       | files in their database (eg. /dev/null?) Or can they store UNIX
       | socket files? What about device files? Can they recognize that
       | the file has setuid bit? And so on... practically no useful
       | attributes of files are stored in their database.
       | 
       | They obviously didn't even want to store files... It just feels
       | like some kind of marketing trick (especially since the article
       | is excessively peppered with self-congratulatory quotes from
       | people who have allegedly found the product useful). Just say it
       | as it is: _Blob attachments_.
        
         | HumanOstrich wrote:
         | I think you're conflating files with file metadata.
        
         | berkes wrote:
         | I'm no expert in this field. I presume you are technically
         | correct. But also probably too focused on technical details.
         | 
         | Send a file or path to a db server, in an INSERT query (INSERT
         | email, pwhash, avatar (?,?,?) into users), and that than
         | handles the actual storage, all abstracted away, for the user
         | (programmer) of this system, this is "storing the file in the
         | DB". If I can then do a "SELECT email, variation('tiny',
         | avatar)::cdn_url FROM users", that is "getting the image from
         | the DB. Well, technically getting the email and a URL where
         | this file cane be retrieved.
         | 
         | To me, as a user of such a database, it matters not if some
         | engine stores it as a blob, base64, pointer, or just a URI to
         | an S3 block. What matters is that it's in the same layer of
         | abstraction.
        
           | DaiPlusPlus wrote:
           | > To me, as a user of such a database, it matters not if some
           | engine stores it as a blob, base64, pointer, or just a URI to
           | an S3 block. What matters is that it's in the same layer of
           | abstraction.
           | 
           | By "user of such a database" you mean the end-user of an
           | application, right? As opposed to the devs, DBAs, SREs, et
           | cetera? Because those users absolutely care about how this
           | feature works and absolutely do not want anything to be
           | abstracted away, because that's the dif
           | 
           | Also, data: URIs need to die. It started off with lazy JS
           | devs using <canvas>'s to-data-uri (instead of toBlob) and now
           | it's infecting the entire dev ecosystem: just go to
           | StackOverflow and there'll be a post every few hours where
           | some asker genuinely believes we're supposed to store large,
           | gigabyte-sized files in a Postgres or MSSQL table as a Base64
           | string in a varchar column with a text collation
           | (aiiieeeeee).
        
         | rat9988 wrote:
         | "practically no useful attributes of files are stored in their
         | database."
         | 
         | Probably because you work on the field, you miss what the
         | layman wants to see. The most useful attrivute for me is the
         | data. The other use case you cited are legitimate, it's just
         | that I, as a layperson, don't think about them when we talk
         | about files.
        
         | deadbeeves wrote:
         | Early operating systems did not support directory trees, and
         | I'm sure some didn't even support timestamps and such. Would
         | you say that those system do not store files at all? What even
         | is the defining characteristic of a "file"?
        
       | fiedzia wrote:
       | >Think of it as having a new database column type where you can
       | store files of any size, and behind the scenes they are stored in
       | AWS S3 and cached through a global CDN.
       | 
       | So the files are not stored in the database, but somewhere else.
       | This belongs to a web framework, not database.
        
       | karlmush wrote:
       | [dead]
        
       ___________________________________________________________________
       (page generated 2023-08-30 23:00 UTC)