[HN Gopher] Gcsfuse: A user-space file system for interacting wi...
       ___________________________________________________________________
        
       Gcsfuse: A user-space file system for interacting with Google Cloud
       Storage
        
       Author : yla92
       Score  : 126 points
       Date   : 2023-09-06 09:48 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | carbocation wrote:
       | I do scientific computing in google cloud. When I first got
       | started, I heavily relied on GCSFuse. Over time, I have
       | encountered enough trouble that I no longer use it for the vast
       | majority of my work. Instead, I explicitly localize the files I
       | want to the machine that will be operating on them, and this has
       | eliminated a whole class of slowdown bugs and availability bugs.
       | 
       | The scale of data for my work is modest (~50TB, ~1 million files
       | total, about 50k files per "directory").
        
         | paulddraper wrote:
         | > The scale of data for my work is modest (~50TB, ~1 million
         | files total, about 50k files per "directory").
         | 
         | Then my work must be downright embarassing.
        
         | nyc_pizzadev wrote:
         | Did you use a local caching proxy like Varnish or Squid? Would
         | that have helped?
        
           | dekhn wrote:
           | These codes aren't talking HTTP. They are talking POSIX to a
           | real filesystem. The problem is that cloud-based FUSE mounts
           | are never as reliable (they will "just hang" at random times
           | and you need some sort of external timeout to kill the
           | process and restart the job and possible the host) as a real
           | filesystem (either a local POSIX one or NFS or SMB).
           | 
           | I've used all the main FUSE cloud FS (gcsfuse, s3-fuse,
           | rclone, etc) and they all end up falling over in prod.
           | 
           | I think a better approach would be to port all the important
           | science codes to work with file formats like parquet and use
           | user-space access libraries linked into the application, and
           | both the access library and the user code handle errors
           | robustly. This is how systems like mapreduce work, and in my
           | experience they work far more reliably than FUSE-mounts when
           | dealing with 10s to 100s of TBs.
        
         | laurencerowe wrote:
         | These file systems are not a good fit for large numbers of
         | small files. Their sweet spot is working with large (~GB+)
         | files which are mostly read from beginning to end. I've mostly
         | used them for bioinformatics stuff.
        
         | ashishbijlani wrote:
         | FUSE does not work well with a large number of small files (due
         | to high metadata ops such as inode/dentry lookups).
         | 
         | ExtFUSE (optimized FUSE with eBPF) [1] can offer you much
         | higher performance. It caches metadata in the kernel to avoid
         | lookups in user space. Disclaimer: I built it.
         | 
         | 1. https://github.com/extfuse/extfuse
        
           | laurencerowe wrote:
           | ExtFUSE seems really cool and great for implementing
           | performant drivers in userspace for local or lower latency
           | network filesystems, but I doubt FUSE is the bottleneck in
           | this case since S3/GCS have 100ms first byte latency.
           | 
           | https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi.
           | ..
        
         | markstos wrote:
         | I had a similar experience with S3 Fuse. It was slower, more
         | complex and expensive than using S3 directly. I had feared
         | refactoring my code to use the API, but it went quickly. I've
         | never gone back to using or recommending a cloud filesystem
         | like that for a project.
        
       | [deleted]
        
       | yread wrote:
       | There is also blobfuse2 for mounting Azure Storage
       | https://github.com/Azure/azure-storage-fuse
       | 
       | It has some nice features like streaming with block level caching
       | for fast readonly access
        
         | yla92 wrote:
         | There is s3fs-fuse as well for AWS S3. https://github.com/s3fs-
         | fuse/s3fs-fuse
         | 
         | It even supports GCS (as GCS has S3 compatible API)
         | 
         | https://github.com/s3fs-fuse/s3fs-fuse/wiki/Google-Cloud-Sto...
        
         | alpb wrote:
         | Hah nice! I developed https://github.com/ahmetb/azurefs back in
         | 2012 when I was about to join to Azure. I'm glad Azure actually
         | provides a supported and actively-maintained tool for this.
        
         | easton wrote:
         | mountpoint-s3 is AWS' first party solution for mounting s3
         | buckets as file systems:
         | https://github.com/awslabs/mountpoint-s3
         | 
         | Haven't used it but it looks cool, if a bit immature.
        
       | bushbaba wrote:
       | Id also look at goofys, which I've found to be google performant
       | for reads. Also nice that it's a golang binary which is easily
       | passable around to hosts.
        
       | nyc_pizzadev wrote:
       | Does anyone have any experience on how this works at scale?
       | 
       | Let's say I have a directory tree with 100MM files in a nested
       | structure, where the average file is 4+ directories deep. When I
       | `ls` the top few directories, is it fast? How long until I
       | discover updates?
       | 
       | Reading the docs, it looks like it's using this API for traversal
       | [0]?
       | 
       | What about metadata like creation times, permission, owner,
       | group?
       | 
       | Any consistency concerns?
       | 
       | [0]
       | https://cloud.google.com/storage/docs/json_api/v1/objects/li...
        
         | BrandonY wrote:
         | Hi, Brandon from GCS here. If you're looking for all of the
         | guarantees of a real, POSIX filesystem, you want to do fast top
         | level directory listing for 100MM+ nested files, and POSIX
         | permissions/owner/group and other file metadata are important
         | to you, Gcsfuse is probably not what you're after. You might
         | want something more like Filestore:
         | https://cloud.google.com/filestore
         | 
         | We've got some additional documentation on the differences and
         | limitations between Gcsfuse and a proper POSIX filesystem:
         | https://cloud.google.com/storage/docs/gcs-fuse#expandable-1
         | 
         | Gcsfuse is a great way to mount Cloud Storage buckets and view
         | them like they're in a filesystem. It scales quite well for all
         | sorts of uses. However, Cloud Storage itself is a flat
         | namespace with no built-in directory support. Listing the few
         | top level directories of a bucket with 100MM files more or less
         | requires scanning over your entire list of objects, which means
         | it's not going to be very fast. Listing objects in a leaf
         | directory will be much faster, though.
        
           | milesward wrote:
           | Brandon, I know why this was built, and I agree with your
           | list of viable uses; that said, it strikes me as extremely
           | likely to lead to gnarly support load, grumpy customers, and
           | system instability when it is inevitably misused. What steps
           | across all of the user interfaces is GCP taking to warn users
           | who may not understand their workload characteristics at all
           | as to the narrow utility of this feature?
        
           | nyc_pizzadev wrote:
           | Thanks for the reply.
           | 
           | Our theoretical usecase is 10+ PB and we need multiple TB/s
           | of read throughout (maybe of fraction of that for writing).
           | So I don't think Filestore fits this scale, right?
           | 
           | As for the directory traversals, I guess caching might help
           | here? Top level changes aren't as frequent as leaf additions.
           | 
           | That being said, I don't see any (caching) proxy support
           | anywhere other than the Google CDN.
        
         | daviesliu wrote:
         | If you really expect a file system experience over GCS, please
         | try JuiceFS [1], which scales to 10 billions of files pretty
         | well with TiKV or FoundationDB as meta engine.
         | 
         | PS, I'm founder of JuiceFS.
         | 
         | [1] https://github.com/juicedata/juicefs
        
           | victor106 wrote:
           | The description says S3. Does it also support GCS?
        
             | 8organicbits wrote:
             | The architecture image shows GCS and others, so I suspect
             | it does.
             | 
             | https://github.com/juicedata/juicefs#architecture
        
           | skrowl wrote:
           | [dead]
        
       | asah wrote:
       | gcsfuse worked great for me on a couple of projects, but YMMV for
       | production use. As with all distributed storage systems, make
       | sure you can handle timeouts, retries, high latency periods and
       | outages.
        
       | djbusby wrote:
       | Why not rclone? It was discussed here yesterday as a replacement
       | for sshfs - and supports GCS as well as dozens more backends.
       | 
       | https://rclone.org/
       | 
       | https://news.ycombinator.com/item?id=37390184
        
         | [deleted]
        
         | capableweb wrote:
         | Last time gcsfuse was on HN
         | (https://news.ycombinator.com/item?id=35784889), the author of
         | rclone was in the comments:
         | 
         | > From reading the docs, it looks very similar to `rclone
         | mount` with `--vfs-cache-mode off` (the default). The
         | limitations are almost identical.
         | 
         | > However rclone has `--vfs-cache-mode writes` which caches
         | file writes to disk first to allow overwriting in the middle of
         | a file and `--vfs-cache-mode full` to cache all objects on a
         | LRU basis. They both make the file system a whole lot more
         | POSIX compatible and most applications will run using `--vfs-
         | cache-mode writes` unlike `--vfs-cache-mode off`.
         | 
         | https://news.ycombinator.com/item?id=35788919
         | 
         | Seems rclone would be an even better option than Google's own
         | tool.
        
           | tough wrote:
           | Comments like this are why HN comments section is usually
           | better than the news in it.
           | 
           | Also hi capableweb I think your name rings a bell from
           | LLM's/Gen AI threads
        
             | capableweb wrote:
             | Me too!
             | 
             | Hello! That's probably a sign I need to take a break from
             | writing too many HN comments per day, thanks :)
        
               | tough wrote:
               | Don't worry man, wasn't implying that, it's just cool to
               | see the same names/non-faces around tbh. simon is
               | likewise heh
        
       | paulgb wrote:
       | > Cloud Storage FUSE can only write whole objects at a time to
       | Cloud Storage and does not provide a mechanism for patching. If
       | you try to patch a file, Cloud Storage FUSE will reupload the
       | entire file. The only exception to this behavior is that you can
       | append content to the end of a file that's 2 MB and larger, where
       | Cloud Storage FUSE will only reupload the appended content.
       | 
       | I didn't know GCS supported appends efficiently. Correct me if
       | I'm wrong, but I don't think S3 has an equivalent way to append
       | to a value, which makes it clunky to work with as a log sink.
        
         | nicornk wrote:
         | With S3 you can do something similar my misusing the multiparty
         | upload functionality, e.g.:
         | https://github.com/fsspec/s3fs/blob/fa1c76a3b75c6d0330ed03c4...
        
         | rickette wrote:
         | Azure Blob Storage actually has explicit append support using
         | "Append"-blobs (next to block and page blobs)
        
           | capableweb wrote:
           | Building a storage service like these today and not having
           | "append" would be very silly indeed. I guess S3 is kind of
           | excused since it's so old by now. Although I haven't read
           | anything about them adding it, so maybe less excused...
        
         | paulddraper wrote:
         | > Correct me if I'm wrong, but I don't think S3 has an
         | equivalent way to append to a value, which makes it clunky to
         | work with as a log sink.
         | 
         | You are correct. (There are multipart uploads, but that's kinda
         | different.)
         | 
         | ELB logs are delivered as separate object every few minutes,
         | FWIW.
        
         | londons_explore wrote:
         | Append workloads are common in distributed systems. Turns out
         | nearly every time you think you need random read/write to a
         | datastructure (eg. a hard drive/block device for a
         | Windows/Linux VM), you can instead emulate that with an append-
         | only log of changes and a set of append-only indexes.
         | 
         | Doing so has huge benefits: Write performance is _way_ higher,
         | you can do rollbacks easily (just ignore the tail of the
         | files), you can do snapshotting easily (just make a new file
         | and include by reference a byte range of the parent), etc.
         | 
         | The downside is from time to time you need to make a new file
         | and chuck out the dead data - but such an operation can be done
         | 'online', and can be done during times of lower system load.
        
           | merb wrote:
           | you just described wal files of a database.
        
             | KRAKRISMOTT wrote:
             | The generalized architecture is called event sourcing
        
               | capableweb wrote:
               | Well, I'd argue that it's just two different names for
               | similar concepts, but applied at different levels. WAL is
               | a low-level implementation detail, usually for
               | durability, while Event Source is a architecture applied
               | to solve business problems.
               | 
               | A WAL would usually disappear or truncate it's length
               | after a while, and you'd only rerun things from it if you
               | absolutely have two. Changes in business requirements
               | shouldn't require you to do anything with a WAL.
               | 
               | In contrast, Event sourcing log would be kept
               | indefinitely, so when business requirements change, you
               | could (if you want to, not required) re-run N previous
               | events so you can apply new changes to old data in your
               | data storage.
               | 
               | But, if you really want to, it's basically the same, but
               | in the end, applied differently :)
        
             | tough wrote:
             | ha I was just thinking how similar this was to postgres pg-
             | audit way of reusing the logs to sum up the correct state
        
           | capableweb wrote:
           | > Append workloads are common in distributed systems.
           | 
           | Bringing back the topic to what the parent was saying; since
           | S3 is a pretty common system, and a distributed system at
           | that, are you saying that S3 does support appending data?
           | AFAIK, S3 never supported any append operations.
        
             | vlovich123 wrote:
             | I'll add some nuance here. You can implement append
             | yourself in a clunk way. Create a new multipart upload for
             | the file, copy its existing contents, create a new part
             | appending what you need, and then complete the upload.
             | 
             | Not as elegant / fast as GCS's and there may be other
             | subtleties, but it's possible to simulate.
        
             | ozfive wrote:
             | It does not. Consider using services like Amazon Kinesis
             | Firehose, which can buffer and batch logs, then
             | periodically write them to S3.
        
         | Severian wrote:
         | You can use S3 versioning, assuming you have enabled this on
         | the bucket. It would be a little clunky. It would also be done
         | in batches and not continuous append.
         | 
         | Basically if your data is append only (such as a log), buffer
         | whatever reasonable amount is needed, and then put a new
         | version of the file with said data (recording the generated
         | version ID AWS gives you). This gets added to the "stack" of
         | versions of said S3 object. To read them all, you basically get
         | each version from oldest to newest and concatenate them
         | together on the application side.
         | 
         | Tracking versions would need to be done application side
         | overall.
         | 
         | You could also do "random" byte ranges if you track the
         | versioning and your object has the range embedded somewhere in
         | it. You'd still need to read everything to find what is the
         | most up to date as some byte ranges would overwrite others.
         | 
         | Definitely not the most efficient but it is doable.
        
           | jeffbarr wrote:
           | OMG...
        
           | dataangel wrote:
           | what is the advantage of versioning versus just naming your
           | objects log001, log002, etc and opening them in order?
        
             | vlovich123 wrote:
             | You can set up lifecycle policies. For example, auto delete
             | or auto archive versions > X date. That's one lifecycle
             | rule. With custom naming schemes, it wouldn't scale as
             | well.
        
         | advisedwang wrote:
         | gcsfuse uses compose [1] to append. Basically it uploads the
         | new data to a temp object, then performs a compose operation to
         | make a new object in the place if the original with the
         | combined content.
         | 
         | [1] https://cloud.google.com/storage/docs/composing-objects
        
         | KptMarchewa wrote:
         | I wonder how it handles potential conflicts.
        
           | londons_explore wrote:
           | For appends, the normal way is to apply the append operations
           | in an arbitrary order if there are multiple concurrent
           | writers. That way you can have 10 jobs all appending data to
           | the same 'file', and you know every record will end up in
           | that file when you later scan through it.
           | 
           | Obviously, you need to make sure no write operation breaks a
           | record midway while doing that. (unlike the posix write() API
           | which can be interrupted midway).
        
             | KptMarchewa wrote:
             | That makes sense - if you keep data in something like
             | ndjson and don't require any order.
             | 
             | If you need order then probably writing to separate files
             | and having compaction jobs is still better.
        
             | boulos wrote:
             | Objects have three fields for this: Version, Generation,
             | and Metageneration. There's also a checksum. You can be
             | sure that you were the writer / winner by checking these.
        
               | dpkirchner wrote:
               | You can also send a x-goog-if-generation-match[0] header
               | that instructs GCS to reject writes that would replace
               | the wrong generation (sort of like a version) of a file.
               | Some utilities use this for locking.
               | 
               | 0: https://cloud.google.com/storage/docs/xml-
               | api/reference-head...
        
       | buildbuildbuild wrote:
       | The gcsfuse k8s CSI also works well if you build to expect
       | occasional timeouts. It is a shame that a reliable S3-compatible
       | alternative does not yet exist in the open source realm.
        
       | jarym wrote:
       | One thing I don't fully understand is whether data is cached
       | locally or whether I would have to handle that myself (for
       | example if I have to read a configuration file)? And if it is
       | cached, how can I control how often it refreshes?
        
         | plicense wrote:
         | It uses FUSE and there's three types of Kernel cache you could
         | use with FUSE (although, it seems like gcsfuse is exposing only
         | one):
         | 
         | 1. Cache of file attributes in the Kernel (this is controlled
         | by "stat-cache-ttl" value - https://github.com/GoogleCloudPlatf
         | orm/gcsfuse/blob/7dc5c7ff...) 2. Cache of directory listings 3.
         | Cache of file contents
         | 
         | It should be possible to use (2) and (3) for a better
         | performance but might need changes to the underlying fuse
         | library they use to expose those options.
        
         | hansonw wrote:
         | gcsfuse has controllable built-in caching of _metadata_ but not
         | contents: https://cloud.google.com/storage/docs/gcsfuse-
         | performance-an...
         | 
         | You'd have to use your own cache otherwise. IME the OS-level
         | page cache is actually quite effective at caching reads and
         | seems to work out of the box with gcsfuse.
        
         | droque wrote:
         | I don't think it is, instead each operation makes a request.
         | You can use something like catfs
         | https://github.com/kahing/catfs
        
       ___________________________________________________________________
       (page generated 2023-09-06 20:00 UTC)