[HN Gopher] HopsFS: 100x Times Faster Than AWS S3
       ___________________________________________________________________
        
       HopsFS: 100x Times Faster Than AWS S3
        
       Author : nathaliaariza
       Score  : 88 points
       Date   : 2020-11-19 13:17 UTC (2 days ago)
        
 (HTM) web link (www.logicalclocks.com)
 (TXT) w3m dump (www.logicalclocks.com)
        
       | elhawtaky wrote:
       | I'm the author. Let me know if you have any questions.
        
         | [deleted]
        
         | fwip wrote:
         | What tradeoffs did you make? In what situations does S3 have
         | better characteristics than HopsFS?
        
         | fh973 wrote:
         | > ... but has 100X the performance of S3 for file move/rename
         | operations
         | 
         | Isn't rename in S3 effectively a copy-delete operation?
        
           | booi wrote:
           | That's my understanding too. Also rename / copy turned out
           | not to be very useful at the end of the day. Nearly all my
           | implementations just boil down to randomized characters as
           | ids
        
             | yandie wrote:
             | Yup. Use a system like Redis/DynamoDB or even a traditional
             | database to store the metadata and use random UUID for
             | actual file storage.
             | 
             | And tag the files for expiration/clean up. S3 is not a file
             | system and people should stop treating it like one - only
             | to get bitten by these assumptions around it being a FS.
        
         | ajb wrote:
         | You say it's "posix-like" - so what from posix had to be left
         | out?
        
         | de_Selby wrote:
         | How does this differ from Objectivefs?
        
         | dividuum wrote:
         | The linked post says "has the same cost as S3", yet on the
         | linked pricing page there's only a "Contact Us" Enterprise plan
         | besides a free one. Am I missing something?
        
         | hilbertseries wrote:
         | Reading through your article, this solution is built on top of
         | s3. So, moving and listing files is faster, presumably due to a
         | new metadata system you've built for tracking files. The trade
         | off here, is that writes must be strictly slower now than they
         | were previously because you've added a network hop. All read
         | and write data now flows through these workers. Which adds a
         | point of failure, if you steam too much data through these
         | workers, you could potentially OOM them. Reads are potentially
         | faster, but that also depends on the hit rate of the block
         | cache. Would be nice to see a more transparent post listing the
         | pros and cons, rather than what reads as a technical
         | advertisement.
        
       | mwcampbell wrote:
       | > But, until today, there has been no equivalent to ADLS for S3.
       | 
       | ObjectiveFS has been around for several years. How is HopsFS
       | better?
        
         | SirOibaf wrote:
         | I'd say that the main difference with ObjectiveFS are metadata
         | operations. From the documentation of ObjectiveFS:
         | 
         | `doesn't fully support regular file system semantics or
         | consistency guarantees (e.g. atomic rename of directories,
         | mutual exclusion of open exclusive, append to file requires
         | rewriting the whole file and no hard links).`
         | 
         | HopsFS _does_ provide strongly consistent metadata operations
         | like atomic directory rename, which is essential if you are
         | running frameworks like Apache Spark.
        
           | mwcampbell wrote:
           | That quote from the ObjectiveFS documentation [1] is out of
           | context. It was describing limitations in s3fs, not
           | ObjectiveFS. My understanding is that because ObjectiveFS is
           | a log-structured filesystem that uses S3 as underlying
           | storage, it doesn't have those limitations.
           | 
           | [1]: https://objectivefs.com/faq
        
       | hartator wrote:
       | > 100X the performance of S3 for file move/rename operations
       | 
       | I don't see how it can be useful. Moving or renaming files in S3
       | seems more like maintenance than something you want to do on a
       | regular basis.
        
         | stingraycharles wrote:
         | We do this in our ETL jobs on several hundreds of thousands of
         | files a day. Not a reason to switch to a different system, but
         | there are definitely non-maintenance use cases for this.
        
           | otterley wrote:
           | Any particular reason you don't tag the objects instead?
           | That's a significantly lighter-weight operation, since S3
           | doesn't have native renaming capability.
        
             | bfrydl wrote:
             | You can't list objects by tag.
        
       | jeffbee wrote:
       | Diagram seems to imply an active/passive namenode setup like
       | HDFS. Doesn't that limit it to tiny filesystems, and curse it
       | with the same availability problems that plague HDFS?
        
         | SirOibaf wrote:
         | No, HopsFS namenodes are stateless as the metadata is stored on
         | an in-memory distributed database (NDB)
         | 
         | There are other papers that describe HopsFS architecture in
         | more details if you are interested:
         | https://www.usenix.org/system/files/conference/fast17/fast17...
        
         | chordysson wrote:
         | No because HopsFS uses a stateless namenode architecture. From
         | https://github.com/hopshadoop/hops
         | 
         | "HopsFS is a new implementation of the Hadoop Filesystem
         | (HDFS), that supports multiple stateless NameNodes, where the
         | metadata is stored in MySQL Cluster, an in-memory distributed
         | database. HopsFS enables more scalable clusters than Apache
         | HDFS (up to ten times larger clusters), and enables NameNode
         | metadata to be both customized and analyzed, because it can now
         | be easily accessed via a SQL API."
        
       | oneplane wrote:
       | Does it matter? I often see 'our product is much faster than
       | thing X at cloud Y!' and find myself asking why. Why would I want
       | something less integrated for a performance change I don't need,
       | a software change I'll have to write and an extra overhead for
       | dealing with another source?
       | 
       | It's great that one individual thing is better than one other
       | individual thing, but if you look at the bigger picture it
       | generally isn't that individual thing by itself that you are
       | using.
        
         | ethanwillis wrote:
         | It does matter, because not every use case requires everything
         | but the kitchen sink. If you're building things that _only_
         | ever require the same mass produced components for every
         | application, well that stifles the possibilities of what can be
         | built.
        
           | oneplane wrote:
           | So say you have a solution for your object storage and it is
           | plenty. Then a different product pops up, solves the same
           | problem in exactly the same way, and costs the same.
           | Migrating isn't free and at the end you have two suppliers to
           | manage. Does that make any sense at all? I think not.
           | 
           | There is a different case that makes sense: you have object
           | storage and it's not sufficient, so you go look for object
           | storage suppliers that deliver something different so it
           | suits your need. Now it makes sense to look for a service
           | that is relatively similar to what you are already using but
           | is better in a factor that is significant for your
           | application (i.e. speed of object key changes), now it does
           | make sense.
           | 
           | Marketing just says: "Look at us, we are faster". I think
           | that message is not going to matter unless that happens to be
           | your exact problem in isolation, which isn't exactly common;
           | systems don't tend to run in isolation.
        
         | rgbrenner wrote:
         | Ok so you're not the target customer for this product. Do you
         | believe people with this problem should be deprived of a
         | solution just because you don't need it?
        
           | oneplane wrote:
           | I don't think I suggest that a product should not exist. I
           | suggest that targeting the competition instead of the use
           | case is a bit silly.
        
       | Just1689 wrote:
       | Do you plan on having s Kubernetes storage provider? I like the
       | idea of present a POSIX like mount to s container while paying
       | per use for storage in S3
        
         | threeseed wrote:
         | You can do this using s3backer, goofys etc to mount S3 as a
         | host filesystem and then use the host path provisioner.
        
       | rwdim wrote:
       | What's the latency with 100,000 and 1,000,000 files?
        
         | elhawtaky wrote:
         | We haven't run mv/list experiments on 100,000 and 1,000,000
         | files for the blog. However, We expect the gap in latency
         | between HopsFS/S3 and EMRFS would increase even further with
         | larger directories. In our original HopsFS paper, we showed
         | that the latency for mv/rename for a directory with 1 million
         | files was around 5.8 seconds.
        
           | [deleted]
        
       | therealmarv wrote:
       | Reading/writing SMALL files are SUPER slow on things like S3,
       | Google Drive, Backblaze. Also using a lot of threads does only
       | help a little bit but it's nowhere near reading/writing speeds of
       | e.g. a single 600MB file.
       | 
       | Is HopsFS helping in this area?
        
         | chordysson wrote:
         | Regarding small files, HopsFS can store small files in the
         | metadata layer for improved performance.
         | 
         | https://kth.diva-portal.org/smash/record.jsf?pid=diva2:12608...
        
         | ignoramous wrote:
         | How small are the files?
         | 
         | Here are some strategies to make S3 go faster [0].
         | 
         | You're right that S3 isn't for small files but for a lot of
         | small files (think 500 bytes), I either plunge them to S3
         | through Kinesis Firehose, or fit them gzipped into DynamoDB.
         | 
         | You could also consider using Amazon FSx which can rw S3.
         | 
         | [0] https://news.ycombinator.com/item?id=19475726
        
         | Matthias247 wrote:
         | It's not surprising that it's slow. Handling files has 2 costs
         | associated to it: One is a fixed setup cost, which includes
         | looking up the file metadata, creating connections between all
         | the services that make it possible to access the file, and
         | starting the read/write. The other part is actually
         | transferring the file content.
         | 
         | For small files the fixed cost will be the most important
         | factor. The "transfer time" after this "first byte latency"
         | might actually be 0, since all the data could be transferred
         | within a single write call on each node.
        
       | tutfbhuf wrote:
       | High-availability durable filesystem is a difficult problem to
       | solve. It usually starts with NFS, which is a big huge single
       | point of failure. Depending on the nature of the application this
       | might be good enough.
       | 
       | But if it's not, you'll typically want cross-datacenter
       | replication so if one rack goes down you don't lose all your
       | data. So then you're looking at something like
       | Glusterfs/MooseFS/Ceph. But the latencies involved with
       | synchronously replicating to multiple datacenters can really kill
       | your performance. For example, try git cloning a large project
       | onto a Glusterfs mount with >20ms ping between nodes. It's
       | brutal.
       | 
       | Other products try to do asynchronous replication, EdgeFS is one
       | I was looking at recently. This follows the general industry
       | trend, like it or not, of "eventually consistent is consistent
       | enough". Not much better than a cron job + rsync, in my opinion,
       | but for some workloads it's good enough. If there's a partition
       | you'll lose data.
       | 
       | Eventually you just give up and realize that a perfectly
       | synchronized geographically distributed POSIX filesystem is a
       | pipe dream, you bite the bullet and re-write your app against S3
       | and call it a day.
        
         | fock wrote:
         | wait, S3 can do git now? I guess you are right in nearly
         | everything... but starting with git on gluster and then jumping
         | to selling apples as bananas (e.g. S3) doesn't really make an
         | argument.
        
         | the8472 wrote:
         | > For example, try git cloning a large project onto a Glusterfs
         | mount with >20ms ping between nodes. It's brutal.
         | 
         | That may be true but also is due to applications often having
         | very sequential IO patterns even when they don't need to be.
         | 
         | I hope we'll get some convenience wrappers around io_uring that
         | make batching of many small IO calls in a synchronous manner
         | simple and easy for cases where you don't want to deal with
         | async runtimes. E.g. bulk_statx() or fsync_many() would be
         | prime candidates for batching.
        
       | Lucasoato wrote:
       | Can this be installed on AWS EMR clusters?
        
         | threeseed wrote:
         | You could but I would do thorough real-world benchmarks.
         | 
         | EMRFS has dozens of optimisations for Spark/Hadoop workloads
         | e.g. S3 select, partitioning pruning, optimised committers etc
         | and since EMR is a core product it is continually being
         | improved. Using HopsFS would negate all of that.
        
         | SirOibaf wrote:
         | Not really, but you can try it out on https://hopsworks.ai
         | 
         | It's conceptually similar to EMR in the way it works. You
         | connect your AWS account and we'll deploy a cluster there.
         | HopsFS will run on top a S3 bucket in your organization. You
         | get a fully featured Spark environment (With metrics and
         | logging included - no need for cloudwatch). UI with Jupyter
         | notebooks, the Hopsworks feature store and ML capabilities that
         | EMR does not provide.
        
       | threeseed wrote:
       | Looks no different to Alluxio or Minio S3 Gateway or the dozens
       | of other S3 caches around.
       | 
       | Would've been more interesting had it taken advantage of newer
       | technologies such as io_uring, NVME over Fabric, RDMA etc.
        
       ___________________________________________________________________
       (page generated 2020-11-21 23:00 UTC)