[HN Gopher] Gcsfuse: A user-space file system for interacting wi... ___________________________________________________________________ Gcsfuse: A user-space file system for interacting with Google Cloud Storage Author : yla92 Score : 126 points Date : 2023-09-06 09:48 UTC (10 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | carbocation wrote: | I do scientific computing in google cloud. When I first got | started, I heavily relied on GCSFuse. Over time, I have | encountered enough trouble that I no longer use it for the vast | majority of my work. Instead, I explicitly localize the files I | want to the machine that will be operating on them, and this has | eliminated a whole class of slowdown bugs and availability bugs. | | The scale of data for my work is modest (~50TB, ~1 million files | total, about 50k files per "directory"). | paulddraper wrote: | > The scale of data for my work is modest (~50TB, ~1 million | files total, about 50k files per "directory"). | | Then my work must be downright embarassing. | nyc_pizzadev wrote: | Did you use a local caching proxy like Varnish or Squid? Would | that have helped? | dekhn wrote: | These codes aren't talking HTTP. They are talking POSIX to a | real filesystem. The problem is that cloud-based FUSE mounts | are never as reliable (they will "just hang" at random times | and you need some sort of external timeout to kill the | process and restart the job and possible the host) as a real | filesystem (either a local POSIX one or NFS or SMB). | | I've used all the main FUSE cloud FS (gcsfuse, s3-fuse, | rclone, etc) and they all end up falling over in prod. | | I think a better approach would be to port all the important | science codes to work with file formats like parquet and use | user-space access libraries linked into the application, and | both the access library and the user code handle errors | robustly. This is how systems like mapreduce work, and in my | experience they work far more reliably than FUSE-mounts when | dealing with 10s to 100s of TBs. | laurencerowe wrote: | These file systems are not a good fit for large numbers of | small files. Their sweet spot is working with large (~GB+) | files which are mostly read from beginning to end. I've mostly | used them for bioinformatics stuff. | ashishbijlani wrote: | FUSE does not work well with a large number of small files (due | to high metadata ops such as inode/dentry lookups). | | ExtFUSE (optimized FUSE with eBPF) [1] can offer you much | higher performance. It caches metadata in the kernel to avoid | lookups in user space. Disclaimer: I built it. | | 1. https://github.com/extfuse/extfuse | laurencerowe wrote: | ExtFUSE seems really cool and great for implementing | performant drivers in userspace for local or lower latency | network filesystems, but I doubt FUSE is the bottleneck in | this case since S3/GCS have 100ms first byte latency. | | https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi. | .. | markstos wrote: | I had a similar experience with S3 Fuse. It was slower, more | complex and expensive than using S3 directly. I had feared | refactoring my code to use the API, but it went quickly. I've | never gone back to using or recommending a cloud filesystem | like that for a project. | [deleted] | yread wrote: | There is also blobfuse2 for mounting Azure Storage | https://github.com/Azure/azure-storage-fuse | | It has some nice features like streaming with block level caching | for fast readonly access | yla92 wrote: | There is s3fs-fuse as well for AWS S3. https://github.com/s3fs- | fuse/s3fs-fuse | | It even supports GCS (as GCS has S3 compatible API) | | https://github.com/s3fs-fuse/s3fs-fuse/wiki/Google-Cloud-Sto... | alpb wrote: | Hah nice! I developed https://github.com/ahmetb/azurefs back in | 2012 when I was about to join to Azure. I'm glad Azure actually | provides a supported and actively-maintained tool for this. | easton wrote: | mountpoint-s3 is AWS' first party solution for mounting s3 | buckets as file systems: | https://github.com/awslabs/mountpoint-s3 | | Haven't used it but it looks cool, if a bit immature. | bushbaba wrote: | Id also look at goofys, which I've found to be google performant | for reads. Also nice that it's a golang binary which is easily | passable around to hosts. | nyc_pizzadev wrote: | Does anyone have any experience on how this works at scale? | | Let's say I have a directory tree with 100MM files in a nested | structure, where the average file is 4+ directories deep. When I | `ls` the top few directories, is it fast? How long until I | discover updates? | | Reading the docs, it looks like it's using this API for traversal | [0]? | | What about metadata like creation times, permission, owner, | group? | | Any consistency concerns? | | [0] | https://cloud.google.com/storage/docs/json_api/v1/objects/li... | BrandonY wrote: | Hi, Brandon from GCS here. If you're looking for all of the | guarantees of a real, POSIX filesystem, you want to do fast top | level directory listing for 100MM+ nested files, and POSIX | permissions/owner/group and other file metadata are important | to you, Gcsfuse is probably not what you're after. You might | want something more like Filestore: | https://cloud.google.com/filestore | | We've got some additional documentation on the differences and | limitations between Gcsfuse and a proper POSIX filesystem: | https://cloud.google.com/storage/docs/gcs-fuse#expandable-1 | | Gcsfuse is a great way to mount Cloud Storage buckets and view | them like they're in a filesystem. It scales quite well for all | sorts of uses. However, Cloud Storage itself is a flat | namespace with no built-in directory support. Listing the few | top level directories of a bucket with 100MM files more or less | requires scanning over your entire list of objects, which means | it's not going to be very fast. Listing objects in a leaf | directory will be much faster, though. | milesward wrote: | Brandon, I know why this was built, and I agree with your | list of viable uses; that said, it strikes me as extremely | likely to lead to gnarly support load, grumpy customers, and | system instability when it is inevitably misused. What steps | across all of the user interfaces is GCP taking to warn users | who may not understand their workload characteristics at all | as to the narrow utility of this feature? | nyc_pizzadev wrote: | Thanks for the reply. | | Our theoretical usecase is 10+ PB and we need multiple TB/s | of read throughout (maybe of fraction of that for writing). | So I don't think Filestore fits this scale, right? | | As for the directory traversals, I guess caching might help | here? Top level changes aren't as frequent as leaf additions. | | That being said, I don't see any (caching) proxy support | anywhere other than the Google CDN. | daviesliu wrote: | If you really expect a file system experience over GCS, please | try JuiceFS [1], which scales to 10 billions of files pretty | well with TiKV or FoundationDB as meta engine. | | PS, I'm founder of JuiceFS. | | [1] https://github.com/juicedata/juicefs | victor106 wrote: | The description says S3. Does it also support GCS? | 8organicbits wrote: | The architecture image shows GCS and others, so I suspect | it does. | | https://github.com/juicedata/juicefs#architecture | skrowl wrote: | [dead] | asah wrote: | gcsfuse worked great for me on a couple of projects, but YMMV for | production use. As with all distributed storage systems, make | sure you can handle timeouts, retries, high latency periods and | outages. | djbusby wrote: | Why not rclone? It was discussed here yesterday as a replacement | for sshfs - and supports GCS as well as dozens more backends. | | https://rclone.org/ | | https://news.ycombinator.com/item?id=37390184 | [deleted] | capableweb wrote: | Last time gcsfuse was on HN | (https://news.ycombinator.com/item?id=35784889), the author of | rclone was in the comments: | | > From reading the docs, it looks very similar to `rclone | mount` with `--vfs-cache-mode off` (the default). The | limitations are almost identical. | | > However rclone has `--vfs-cache-mode writes` which caches | file writes to disk first to allow overwriting in the middle of | a file and `--vfs-cache-mode full` to cache all objects on a | LRU basis. They both make the file system a whole lot more | POSIX compatible and most applications will run using `--vfs- | cache-mode writes` unlike `--vfs-cache-mode off`. | | https://news.ycombinator.com/item?id=35788919 | | Seems rclone would be an even better option than Google's own | tool. | tough wrote: | Comments like this are why HN comments section is usually | better than the news in it. | | Also hi capableweb I think your name rings a bell from | LLM's/Gen AI threads | capableweb wrote: | Me too! | | Hello! That's probably a sign I need to take a break from | writing too many HN comments per day, thanks :) | tough wrote: | Don't worry man, wasn't implying that, it's just cool to | see the same names/non-faces around tbh. simon is | likewise heh | paulgb wrote: | > Cloud Storage FUSE can only write whole objects at a time to | Cloud Storage and does not provide a mechanism for patching. If | you try to patch a file, Cloud Storage FUSE will reupload the | entire file. The only exception to this behavior is that you can | append content to the end of a file that's 2 MB and larger, where | Cloud Storage FUSE will only reupload the appended content. | | I didn't know GCS supported appends efficiently. Correct me if | I'm wrong, but I don't think S3 has an equivalent way to append | to a value, which makes it clunky to work with as a log sink. | nicornk wrote: | With S3 you can do something similar my misusing the multiparty | upload functionality, e.g.: | https://github.com/fsspec/s3fs/blob/fa1c76a3b75c6d0330ed03c4... | rickette wrote: | Azure Blob Storage actually has explicit append support using | "Append"-blobs (next to block and page blobs) | capableweb wrote: | Building a storage service like these today and not having | "append" would be very silly indeed. I guess S3 is kind of | excused since it's so old by now. Although I haven't read | anything about them adding it, so maybe less excused... | paulddraper wrote: | > Correct me if I'm wrong, but I don't think S3 has an | equivalent way to append to a value, which makes it clunky to | work with as a log sink. | | You are correct. (There are multipart uploads, but that's kinda | different.) | | ELB logs are delivered as separate object every few minutes, | FWIW. | londons_explore wrote: | Append workloads are common in distributed systems. Turns out | nearly every time you think you need random read/write to a | datastructure (eg. a hard drive/block device for a | Windows/Linux VM), you can instead emulate that with an append- | only log of changes and a set of append-only indexes. | | Doing so has huge benefits: Write performance is _way_ higher, | you can do rollbacks easily (just ignore the tail of the | files), you can do snapshotting easily (just make a new file | and include by reference a byte range of the parent), etc. | | The downside is from time to time you need to make a new file | and chuck out the dead data - but such an operation can be done | 'online', and can be done during times of lower system load. | merb wrote: | you just described wal files of a database. | KRAKRISMOTT wrote: | The generalized architecture is called event sourcing | capableweb wrote: | Well, I'd argue that it's just two different names for | similar concepts, but applied at different levels. WAL is | a low-level implementation detail, usually for | durability, while Event Source is a architecture applied | to solve business problems. | | A WAL would usually disappear or truncate it's length | after a while, and you'd only rerun things from it if you | absolutely have two. Changes in business requirements | shouldn't require you to do anything with a WAL. | | In contrast, Event sourcing log would be kept | indefinitely, so when business requirements change, you | could (if you want to, not required) re-run N previous | events so you can apply new changes to old data in your | data storage. | | But, if you really want to, it's basically the same, but | in the end, applied differently :) | tough wrote: | ha I was just thinking how similar this was to postgres pg- | audit way of reusing the logs to sum up the correct state | capableweb wrote: | > Append workloads are common in distributed systems. | | Bringing back the topic to what the parent was saying; since | S3 is a pretty common system, and a distributed system at | that, are you saying that S3 does support appending data? | AFAIK, S3 never supported any append operations. | vlovich123 wrote: | I'll add some nuance here. You can implement append | yourself in a clunk way. Create a new multipart upload for | the file, copy its existing contents, create a new part | appending what you need, and then complete the upload. | | Not as elegant / fast as GCS's and there may be other | subtleties, but it's possible to simulate. | ozfive wrote: | It does not. Consider using services like Amazon Kinesis | Firehose, which can buffer and batch logs, then | periodically write them to S3. | Severian wrote: | You can use S3 versioning, assuming you have enabled this on | the bucket. It would be a little clunky. It would also be done | in batches and not continuous append. | | Basically if your data is append only (such as a log), buffer | whatever reasonable amount is needed, and then put a new | version of the file with said data (recording the generated | version ID AWS gives you). This gets added to the "stack" of | versions of said S3 object. To read them all, you basically get | each version from oldest to newest and concatenate them | together on the application side. | | Tracking versions would need to be done application side | overall. | | You could also do "random" byte ranges if you track the | versioning and your object has the range embedded somewhere in | it. You'd still need to read everything to find what is the | most up to date as some byte ranges would overwrite others. | | Definitely not the most efficient but it is doable. | jeffbarr wrote: | OMG... | dataangel wrote: | what is the advantage of versioning versus just naming your | objects log001, log002, etc and opening them in order? | vlovich123 wrote: | You can set up lifecycle policies. For example, auto delete | or auto archive versions > X date. That's one lifecycle | rule. With custom naming schemes, it wouldn't scale as | well. | advisedwang wrote: | gcsfuse uses compose [1] to append. Basically it uploads the | new data to a temp object, then performs a compose operation to | make a new object in the place if the original with the | combined content. | | [1] https://cloud.google.com/storage/docs/composing-objects | KptMarchewa wrote: | I wonder how it handles potential conflicts. | londons_explore wrote: | For appends, the normal way is to apply the append operations | in an arbitrary order if there are multiple concurrent | writers. That way you can have 10 jobs all appending data to | the same 'file', and you know every record will end up in | that file when you later scan through it. | | Obviously, you need to make sure no write operation breaks a | record midway while doing that. (unlike the posix write() API | which can be interrupted midway). | KptMarchewa wrote: | That makes sense - if you keep data in something like | ndjson and don't require any order. | | If you need order then probably writing to separate files | and having compaction jobs is still better. | boulos wrote: | Objects have three fields for this: Version, Generation, | and Metageneration. There's also a checksum. You can be | sure that you were the writer / winner by checking these. | dpkirchner wrote: | You can also send a x-goog-if-generation-match[0] header | that instructs GCS to reject writes that would replace | the wrong generation (sort of like a version) of a file. | Some utilities use this for locking. | | 0: https://cloud.google.com/storage/docs/xml- | api/reference-head... | buildbuildbuild wrote: | The gcsfuse k8s CSI also works well if you build to expect | occasional timeouts. It is a shame that a reliable S3-compatible | alternative does not yet exist in the open source realm. | jarym wrote: | One thing I don't fully understand is whether data is cached | locally or whether I would have to handle that myself (for | example if I have to read a configuration file)? And if it is | cached, how can I control how often it refreshes? | plicense wrote: | It uses FUSE and there's three types of Kernel cache you could | use with FUSE (although, it seems like gcsfuse is exposing only | one): | | 1. Cache of file attributes in the Kernel (this is controlled | by "stat-cache-ttl" value - https://github.com/GoogleCloudPlatf | orm/gcsfuse/blob/7dc5c7ff...) 2. Cache of directory listings 3. | Cache of file contents | | It should be possible to use (2) and (3) for a better | performance but might need changes to the underlying fuse | library they use to expose those options. | hansonw wrote: | gcsfuse has controllable built-in caching of _metadata_ but not | contents: https://cloud.google.com/storage/docs/gcsfuse- | performance-an... | | You'd have to use your own cache otherwise. IME the OS-level | page cache is actually quite effective at caching reads and | seems to work out of the box with gcsfuse. | droque wrote: | I don't think it is, instead each operation makes a request. | You can use something like catfs | https://github.com/kahing/catfs ___________________________________________________________________ (page generated 2023-09-06 20:00 UTC)