[HN Gopher] HopsFS: 100x Times Faster Than AWS S3 ___________________________________________________________________ HopsFS: 100x Times Faster Than AWS S3 Author : nathaliaariza Score : 88 points Date : 2020-11-19 13:17 UTC (2 days ago) (HTM) web link (www.logicalclocks.com) (TXT) w3m dump (www.logicalclocks.com) | elhawtaky wrote: | I'm the author. Let me know if you have any questions. | [deleted] | fwip wrote: | What tradeoffs did you make? In what situations does S3 have | better characteristics than HopsFS? | fh973 wrote: | > ... but has 100X the performance of S3 for file move/rename | operations | | Isn't rename in S3 effectively a copy-delete operation? | booi wrote: | That's my understanding too. Also rename / copy turned out | not to be very useful at the end of the day. Nearly all my | implementations just boil down to randomized characters as | ids | yandie wrote: | Yup. Use a system like Redis/DynamoDB or even a traditional | database to store the metadata and use random UUID for | actual file storage. | | And tag the files for expiration/clean up. S3 is not a file | system and people should stop treating it like one - only | to get bitten by these assumptions around it being a FS. | ajb wrote: | You say it's "posix-like" - so what from posix had to be left | out? | de_Selby wrote: | How does this differ from Objectivefs? | dividuum wrote: | The linked post says "has the same cost as S3", yet on the | linked pricing page there's only a "Contact Us" Enterprise plan | besides a free one. Am I missing something? | hilbertseries wrote: | Reading through your article, this solution is built on top of | s3. So, moving and listing files is faster, presumably due to a | new metadata system you've built for tracking files. The trade | off here, is that writes must be strictly slower now than they | were previously because you've added a network hop. All read | and write data now flows through these workers. Which adds a | point of failure, if you steam too much data through these | workers, you could potentially OOM them. Reads are potentially | faster, but that also depends on the hit rate of the block | cache. Would be nice to see a more transparent post listing the | pros and cons, rather than what reads as a technical | advertisement. | mwcampbell wrote: | > But, until today, there has been no equivalent to ADLS for S3. | | ObjectiveFS has been around for several years. How is HopsFS | better? | SirOibaf wrote: | I'd say that the main difference with ObjectiveFS are metadata | operations. From the documentation of ObjectiveFS: | | `doesn't fully support regular file system semantics or | consistency guarantees (e.g. atomic rename of directories, | mutual exclusion of open exclusive, append to file requires | rewriting the whole file and no hard links).` | | HopsFS _does_ provide strongly consistent metadata operations | like atomic directory rename, which is essential if you are | running frameworks like Apache Spark. | mwcampbell wrote: | That quote from the ObjectiveFS documentation [1] is out of | context. It was describing limitations in s3fs, not | ObjectiveFS. My understanding is that because ObjectiveFS is | a log-structured filesystem that uses S3 as underlying | storage, it doesn't have those limitations. | | [1]: https://objectivefs.com/faq | hartator wrote: | > 100X the performance of S3 for file move/rename operations | | I don't see how it can be useful. Moving or renaming files in S3 | seems more like maintenance than something you want to do on a | regular basis. | stingraycharles wrote: | We do this in our ETL jobs on several hundreds of thousands of | files a day. Not a reason to switch to a different system, but | there are definitely non-maintenance use cases for this. | otterley wrote: | Any particular reason you don't tag the objects instead? | That's a significantly lighter-weight operation, since S3 | doesn't have native renaming capability. | bfrydl wrote: | You can't list objects by tag. | jeffbee wrote: | Diagram seems to imply an active/passive namenode setup like | HDFS. Doesn't that limit it to tiny filesystems, and curse it | with the same availability problems that plague HDFS? | SirOibaf wrote: | No, HopsFS namenodes are stateless as the metadata is stored on | an in-memory distributed database (NDB) | | There are other papers that describe HopsFS architecture in | more details if you are interested: | https://www.usenix.org/system/files/conference/fast17/fast17... | chordysson wrote: | No because HopsFS uses a stateless namenode architecture. From | https://github.com/hopshadoop/hops | | "HopsFS is a new implementation of the Hadoop Filesystem | (HDFS), that supports multiple stateless NameNodes, where the | metadata is stored in MySQL Cluster, an in-memory distributed | database. HopsFS enables more scalable clusters than Apache | HDFS (up to ten times larger clusters), and enables NameNode | metadata to be both customized and analyzed, because it can now | be easily accessed via a SQL API." | oneplane wrote: | Does it matter? I often see 'our product is much faster than | thing X at cloud Y!' and find myself asking why. Why would I want | something less integrated for a performance change I don't need, | a software change I'll have to write and an extra overhead for | dealing with another source? | | It's great that one individual thing is better than one other | individual thing, but if you look at the bigger picture it | generally isn't that individual thing by itself that you are | using. | ethanwillis wrote: | It does matter, because not every use case requires everything | but the kitchen sink. If you're building things that _only_ | ever require the same mass produced components for every | application, well that stifles the possibilities of what can be | built. | oneplane wrote: | So say you have a solution for your object storage and it is | plenty. Then a different product pops up, solves the same | problem in exactly the same way, and costs the same. | Migrating isn't free and at the end you have two suppliers to | manage. Does that make any sense at all? I think not. | | There is a different case that makes sense: you have object | storage and it's not sufficient, so you go look for object | storage suppliers that deliver something different so it | suits your need. Now it makes sense to look for a service | that is relatively similar to what you are already using but | is better in a factor that is significant for your | application (i.e. speed of object key changes), now it does | make sense. | | Marketing just says: "Look at us, we are faster". I think | that message is not going to matter unless that happens to be | your exact problem in isolation, which isn't exactly common; | systems don't tend to run in isolation. | rgbrenner wrote: | Ok so you're not the target customer for this product. Do you | believe people with this problem should be deprived of a | solution just because you don't need it? | oneplane wrote: | I don't think I suggest that a product should not exist. I | suggest that targeting the competition instead of the use | case is a bit silly. | Just1689 wrote: | Do you plan on having s Kubernetes storage provider? I like the | idea of present a POSIX like mount to s container while paying | per use for storage in S3 | threeseed wrote: | You can do this using s3backer, goofys etc to mount S3 as a | host filesystem and then use the host path provisioner. | rwdim wrote: | What's the latency with 100,000 and 1,000,000 files? | elhawtaky wrote: | We haven't run mv/list experiments on 100,000 and 1,000,000 | files for the blog. However, We expect the gap in latency | between HopsFS/S3 and EMRFS would increase even further with | larger directories. In our original HopsFS paper, we showed | that the latency for mv/rename for a directory with 1 million | files was around 5.8 seconds. | [deleted] | therealmarv wrote: | Reading/writing SMALL files are SUPER slow on things like S3, | Google Drive, Backblaze. Also using a lot of threads does only | help a little bit but it's nowhere near reading/writing speeds of | e.g. a single 600MB file. | | Is HopsFS helping in this area? | chordysson wrote: | Regarding small files, HopsFS can store small files in the | metadata layer for improved performance. | | https://kth.diva-portal.org/smash/record.jsf?pid=diva2:12608... | ignoramous wrote: | How small are the files? | | Here are some strategies to make S3 go faster [0]. | | You're right that S3 isn't for small files but for a lot of | small files (think 500 bytes), I either plunge them to S3 | through Kinesis Firehose, or fit them gzipped into DynamoDB. | | You could also consider using Amazon FSx which can rw S3. | | [0] https://news.ycombinator.com/item?id=19475726 | Matthias247 wrote: | It's not surprising that it's slow. Handling files has 2 costs | associated to it: One is a fixed setup cost, which includes | looking up the file metadata, creating connections between all | the services that make it possible to access the file, and | starting the read/write. The other part is actually | transferring the file content. | | For small files the fixed cost will be the most important | factor. The "transfer time" after this "first byte latency" | might actually be 0, since all the data could be transferred | within a single write call on each node. | tutfbhuf wrote: | High-availability durable filesystem is a difficult problem to | solve. It usually starts with NFS, which is a big huge single | point of failure. Depending on the nature of the application this | might be good enough. | | But if it's not, you'll typically want cross-datacenter | replication so if one rack goes down you don't lose all your | data. So then you're looking at something like | Glusterfs/MooseFS/Ceph. But the latencies involved with | synchronously replicating to multiple datacenters can really kill | your performance. For example, try git cloning a large project | onto a Glusterfs mount with >20ms ping between nodes. It's | brutal. | | Other products try to do asynchronous replication, EdgeFS is one | I was looking at recently. This follows the general industry | trend, like it or not, of "eventually consistent is consistent | enough". Not much better than a cron job + rsync, in my opinion, | but for some workloads it's good enough. If there's a partition | you'll lose data. | | Eventually you just give up and realize that a perfectly | synchronized geographically distributed POSIX filesystem is a | pipe dream, you bite the bullet and re-write your app against S3 | and call it a day. | fock wrote: | wait, S3 can do git now? I guess you are right in nearly | everything... but starting with git on gluster and then jumping | to selling apples as bananas (e.g. S3) doesn't really make an | argument. | the8472 wrote: | > For example, try git cloning a large project onto a Glusterfs | mount with >20ms ping between nodes. It's brutal. | | That may be true but also is due to applications often having | very sequential IO patterns even when they don't need to be. | | I hope we'll get some convenience wrappers around io_uring that | make batching of many small IO calls in a synchronous manner | simple and easy for cases where you don't want to deal with | async runtimes. E.g. bulk_statx() or fsync_many() would be | prime candidates for batching. | Lucasoato wrote: | Can this be installed on AWS EMR clusters? | threeseed wrote: | You could but I would do thorough real-world benchmarks. | | EMRFS has dozens of optimisations for Spark/Hadoop workloads | e.g. S3 select, partitioning pruning, optimised committers etc | and since EMR is a core product it is continually being | improved. Using HopsFS would negate all of that. | SirOibaf wrote: | Not really, but you can try it out on https://hopsworks.ai | | It's conceptually similar to EMR in the way it works. You | connect your AWS account and we'll deploy a cluster there. | HopsFS will run on top a S3 bucket in your organization. You | get a fully featured Spark environment (With metrics and | logging included - no need for cloudwatch). UI with Jupyter | notebooks, the Hopsworks feature store and ML capabilities that | EMR does not provide. | threeseed wrote: | Looks no different to Alluxio or Minio S3 Gateway or the dozens | of other S3 caches around. | | Would've been more interesting had it taken advantage of newer | technologies such as io_uring, NVME over Fabric, RDMA etc. ___________________________________________________________________ (page generated 2020-11-21 23:00 UTC)