hngopher.com

       [HN Gopher] AWS doesn't make sense for scientific computing
       ___________________________________________________________________
        
       AWS doesn't make sense for scientific computing
        
       Author : lebovic
       Score  : 206 points
       Date   : 2022-10-07 15:28 UTC (7 hours ago)
        
 (HTM) web link (www.noahlebovic.com)
 (TXT) w3m dump (www.noahlebovic.com)
        
       | renewiltord wrote:
       | > _Most scientific computing runs on queues. These queues can be
       | months long for the biggest supercomputers - that 's the API
       | equivalent of storing your inbound API requests, and then
       | responding to them months later_
       | 
       | Makes sense if the jobs are all low urgency.
       | 
       | We have a similar problem in trading so we have a composite
       | solution with non-cloud simulation hardware and additional AWS
       | hardware. That's because we have the high utilization solution
       | combined with high urgency.
        
         | Fomite wrote:
         | I did have to chuckle a bit because, working on HPC simulations
         | of the pandemic during the pandemic, there was an awful lot of
         | "This needs to be done tomorrow" urgency.
        
       | prpl wrote:
       | Actually computing is fine for most use cases (spot instances,
       | preemptible VMs on GCP) and have been used in lots of situations,
       | even at CERN. Where it also excels is if you need any kind of
       | infrastructure, because no HPC center has figured a reasonable
       | approach to that (some are trying with k8s). Also, obviously, you
       | get a huge selection of hardware.
       | 
       | Where cloud/aws doesn't make sense is storage, especially if you
       | need egress, and if you actually need IB
        
       | wistlo wrote:
       | Database analyst for a large communication company here.
       | 
       | I have similar doubts about AWS for certain kinds of intensive
       | business analysis. Not API based transactions, but back-office
       | analysis where complex multi-join queries are run in sequence
       | against tables with 10s of millions of records.
       | 
       | We do some of this with SQL servers running right on the desktop
       | (and one still uses Excel with VLOOKUP). We have a pilot project
       | to try these tasks in a new Azure instance. I look forward to
       | seeing how it performs, and at what cost.
        
         | AyyWS wrote:
         | Do you have disaster recovery / high availability requirements?
         | SQL server on a desktop has a lot of single points of failure.
        
       | captainmuon wrote:
       | A former colleague did his PHD in particle physics with a novel
       | technique (matrix element method). I can't really explain it, but
       | it is extremely CPU intensive. That working group did it on
       | CERN's resources, and they had to borrow quotas from a bunch of
       | other people. For fun they calculated how much it would have cost
       | on AWS and came up with something ridiculous like 3 million
       | euros.
        
         | wenc wrote:
         | I can't specifically to CERN and the exact workload. But bear
         | in mind that the 3MM euros is non negotiated sticker pricing.
         | In real life, negotiated pricing can be much much less
         | depending on your org size and spend. This is a variable most
         | people neglect.
        
           | captainmuon wrote:
           | That is true, and a large part of the theoretical cost was
           | probably also traffic, and the use of nonstandard nodes. They
           | could have gotten a much more realistic price.
           | 
           | I guess the point is also that scientists often don't realize
           | computer costs money, when the computers are already bought
        
         | dguest wrote:
         | The bigger experiments will routinely burn through tens of
         | millions worth of computing. But 10 million euros isn't much
         | for these experiments. The issue is that they are publicly
         | funded: any country is much happier to build a local computing
         | center and lend it to scientists than to fork the money over to
         | an American cloud provider.
         | 
         | (The expensive part of these experiments is simulating billions
         | of collisions and how the thousands of outgoing particles
         | propagate though a detector the size of a small building.
         | Simulating a single event takes around a minute on a modern
         | CPU, and the experiments will simulate billions of events in a
         | few months. If AWS is charging 5 cents a minute it works out to
         | tens of millions easy.)
        
         | rirze wrote:
         | I would imagine CERN's resources are essentially a data center
         | comparable to a small cloud provider's resources.
        
       | somesortofthing wrote:
       | The author makes a convincing argument against doing this
       | workload on on-demand instances, but what about spot instances?
       | AWS explicitly calls out scientific computing as a major use case
       | for scientific computing in its training/promotional materials.
       | Given the advertised ~70-90% markdown on spot instance time, it
       | seems like a great option compared to paying almost the same
       | amount as the workstation but not having to pay to buy, maintain,
       | or replace the hardware.
        
         | lebovic wrote:
         | Author here! Spot instance pricing is better than on-demand,
         | but it doesn't include data transfer, and it's still more
         | expensive than on-prem/Hetzner/etc. Data transfer costs exceed
         | the cost of the instance itself if you're transferring many TB
         | off AWS.
         | 
         | For of our more popular AWS instance types I use - a
         | c5a.24xlarge, used for comparison in the post - the cheapest
         | spot price over the past month in us-east-1 was $1.69. That's
         | still $1233.70/mo: above on-prem, colo, or Hetzner pricing.
         | Data transfer is still extremely expensive.
         | 
         | That said, for bursty loads that can't be smoothed with a
         | queue, spot instances (or just normal EC2 instances) do make
         | sense! I use them all the time for my computational biology
         | company.
        
       | latchkey wrote:
       | I read this as a thinly veiled advertisement for the authors
       | service, toolchest.
        
         | lebovic wrote:
         | Toolchest actually runs scientific computing on AWS! I'm just
         | frustrated by what we can build, because most scientific
         | compute can't effectively shift to AWS
        
           | latchkey wrote:
           | As others have noted, there are many other providers out
           | there. I think your essay would have had more value if it
           | didn't end with an advertisement.
        
             | [deleted]
        
       | gammarator wrote:
       | Astronomy is moving more and more to cloud computing:
       | 
       | https://www.nature.com/articles/d41586-020-02284-7
       | 
       | https://arxiv.org/abs/1907.06320
        
       | Moissanite wrote:
       | This has been my exact field of work for a few years now; in
       | general I have found that:
       | 
       | When people claim it is 10x more expensive to use public cloud,
       | they have no earthly idea what it actually costs to run a HPC
       | service, a data centre, or do any of the associated maintenance.
       | 
       | When the claim is 3x more expensive in the cloud, they do know
       | those things but are making a bad faith comparison because their
       | job involves running an on-premises cluster and they are scared
       | of losing their toys.
       | 
       | When the claim is 0-50% more to run in the cloud, someone is
       | doing the math properly and aiming for a fair comparison.
       | 
       | When the claim is that cloud is cheaper than on-prem, you are
       | probably talking to a cloud vendor account manager whose
       | colleagues are wincing at the fact that they just torched their
       | credibility.
        
         | lebovic wrote:
         | Author here! I think running an HPC service that has a steady
         | queue in AWS can be more than 3x as expensive.
         | 
         | What type of HPC do you work in? Maybe I'm over-indexing on
         | computational biology.
        
           | Moissanite wrote:
           | All types of HPC; I'm a sysadmin/consultant. I don't think
           | the problem with the cost gap is overestimating cloud costs
           | but rather underestimating on-prem costs. Also, failing to
           | account for financing differences and opportunity-costs of
           | large up-front capital purchases.
        
         | johnklos wrote:
         | This is oversimplifying things a bit.
         | 
         | It can categorically be stated that for a year's worth of CPU
         | compute, local will always be less than Amazon. Of course,
         | putting percentages on it doesn't work - there are just too
         | many variables.
         | 
         | There are many admins out there who have no idea what an Alpha
         | is who'll swear that if you're not buying Dell or HP hardware
         | at a premium with expensive support contracts, you're doing
         | things wrong and you're not a real admin. Visit Reddit's
         | /r/sysadmin if you want to see the kind of people I'm talking
         | about.
         | 
         | The point is that if people insist on the most expensive, least
         | efficient type of servers such as Dell Xeons with ridiculous
         | service contracts, the savings over Amazon won't be large.
         | 
         | It's a cumulative problem, because trying to cool and house
         | less efficient hardware requires more power and that hardware
         | ultimately has less tolerance for non-datacenter cooling.
         | 
         | Rethink things. You can have AMD Threadripper / EPYC systems in
         | larger rooms that require less overall cooling, that have
         | better temperature tolerance, that're more reliable in
         | aggregate, which cost less and for which you can easily keep
         | around spare parts which would give better turnaround and
         | availability than support contracts from Dell / HP. Suddenly
         | your compute costs are halved, because of pricing, efficiency,
         | overall power, real estate considerations...
         | 
         | So percentages don't work, but the bottom line is that when
         | you're doing lots of compute, over time it's always cheaper
         | locally, even if you do things the "traditional" expensive and
         | inefficient way, so arguing percentages with so many variables
         | doesn't make any sense - it's still cheaper, no matter what.
        
       | thayne wrote:
       | > Most scientific computing runs on queues. These queues can be
       | months long for the biggest supercomputers
       | 
       | That sounds very much like an argument _for_ a cloud. Instead of
       | waiting months to do your processing, you spin up what you need,
       | then tear it down when you are done.
        
         | withinboredom wrote:
         | The queue then just turns into the bank account. The queue
         | doesn't magically go away.
        
           | adolph wrote:
           | Classic iron triangle, pick two:                 * cheap
           | * fast       * available
           | 
           | https://en.wikipedia.org/wiki/Project_management_triangle
        
       | twawaaay wrote:
       | On the other hand it makes sense if you just need to borrow their
       | infrastructure for a while to calculate something.
       | 
       | A lot of scientific computing isn't happening continuously, and a
       | lot of it is one time experiment or maybe couple of times after
       | which you would have to tear down and reassign.
       | 
       | Another fun fact people forget is our ability to predict future
       | is still pretty poor. Not only that, we are biased towards
       | thinking we can predict it when in fact this is complete
       | bullshit.
       | 
       | You have to buy and set up infrastructure before you can use it
       | and then you have to be ready to use it. What if you are not
       | ready? What if you will not need as much resources? What if you
       | stop needing it earlier than you thought? When you borrow it from
       | AWS you have flexibility to start using it when you are ready and
       | drop it immediately when you no longer need it. Which has value
       | on its own.
       | 
       | At the company I work for we found out and basically banned
       | signing long term contracts for discounts. We found that, on
       | average, we pay many times more for unused services than whatever
       | we gained through discounts. Also when you pay for the resources
       | there is incentive to improve efficiency. When you have basically
       | prepaid for everything that incentive is very small and is
       | basically limited to making sure you have to stay within limits.
        
       | bsenftner wrote:
       | This is the case for a large class of big data + high compute
       | applications. Animation / simulation in engineering, planning,
       | forecasting, not to mention entertainment require pipelines the
       | typical cloud is simply too expensive to use.
        
       | didip wrote:
       | No way. I vehemently disagree.
       | 
       | When a company reached a certain mass, hardware cost is a factor
       | that is considered but not a big factor.
       | 
       | The bigger problems are lost opportunity costs and unnecessary
       | churns.
       | 
       | Businesses lose a lot when the product launch is delayed by a
       | year simply because the hardware arrived late or have too many
       | defects (Ask your hardware fulfillment people how many defective
       | RAM and SSD they got per new shipment).
       | 
       | Churn can cost the business a lot as well. For example, imagine
       | the model that everyone been using is trained in a Mac Pro under
       | XYZ desk. And then when XYZ quit, s/he never properly backup the
       | code and the model.
       | 
       | Bare metal allows for sloppiness that the cloud cannot afford to
       | allow. Accountability and ownership is a lot more apparent in the
       | cloud.
        
       | idiot900 wrote:
       | This rings true for me. I have a federal grant that prohibits me
       | from using its funds for capital acquisitions: i.e. servers. But
       | I can spend it on AWS at massive cost for minimal added utility
       | for my use case. Even though it would be a far better use of
       | taxpayer funds to buy the servers, I have to rent them instead.
        
         | giantrobot wrote:
         | I'm not saying AWS is automatically the best option but the
         | question isn't just servers. It's servers, networking hardware,
         | HVAC, a facility to put them all in, and at least a couple
         | people to run and maintain it all. The TCO of some servers is
         | way higher than the cost of the hardware.
        
         | adgjlsfhk1 wrote:
         | Can you get your university to buy some servers for unrelated
         | reasons and have them rent them to you?
        
           | chrisseaton wrote:
           | Well that's just rebuilding AWS badly. I've used academic-
           | managed time-sharing setups and have some horror stories.
        
           | lostmsu wrote:
           | Doesn't have to be a university either. Depending on the
           | amount of compute needed any capable IT guy can do it for you
           | from their garage with a contract.
        
         | boldlybold wrote:
         | Lots of places (Hetzner for example) will rent you servers at
         | 10-25% the cost of AWS if you want dedicated hardware, without
         | the ability to autoscale. You can even set up a K8s cluster
         | there if the overhead is worth it.
        
           | intelVISA wrote:
           | Fond memories of Hetzner asking for my driving license as ID
           | for renting a $2 VPS. Lost a customer for life with that
           | nonsense.
        
         | testplzignore wrote:
         | > prohibits me from using its funds for capital acquisitions
         | 
         | What is a legitimate reason for this restriction?
        
           | blep_ wrote:
           | I can think of a few ways to abuse it while still spinning it
           | as "for research". The obvious one is to buy a $9999 gaming
           | machine with several of whatever the fanciest GPU on the
           | market is at the time, and say you're doing machine learning.
           | 
           | So my guess is it's an overly broad patch for that sort of
           | thing.
        
             | Fomite wrote:
             | Not really - this is also true for things with no
             | particular "civilian" use.
        
           | Fomite wrote:
           | Basically, the granting organization doesn't want to pay for
           | the full cost of capital equipment that will - either via
           | time or capacity - not be fully used for that grant.
           | 
           | There are other grant mechanisms for large capital
           | expenditures.
           | 
           | The problem is the thresholds haven't shifted in a long time,
           | so you can easily trigger it with a nice workstation. But
           | then, the budget for a modular NIH R01 was set in 1999, so
           | thats hardly a unique problem.
        
       | lowbloodsugar wrote:
       | >Even 2.5x over building your own infrastructure is significant
       | for a $50M/yr supercomputer.
       | 
       | Can't imagine you are paying public prices on any cloud provider
       | if you have a $50M/yr budget.
       | 
       | In addition, if, as the article states, the scientists are ok to
       | wait some considerable time for results, then one can run most,
       | if not all, on spot instances, and that can save 10x right there.
       | 
       | If you don't have $50M/yr there are companies that will move your
       | workload around different AWS regions to get the best price - and
       | will factor in the cost of transferring the data too.
       | 
       | I was architect at large scientific company using AWS.
        
         | lebovic wrote:
         | Author here. I agree that pricing is highly negotiable for any
         | large cloud provider, and there are even (capped) egress fee
         | waivers that you can negotiate as a part of your contract.
         | There's also a place for using AWS; I used it for a smaller DNA
         | sequencing facility, and I use it for my computational biology
         | startup.
         | 
         | That said, I'll repeat something that I commented somewhere
         | else: most of scientific computing (by % of compute) happens in
         | a context that still doesn't make sense in AWS. There's often a
         | physical machine within the organization that's creating data
         | (e.g. a DNA sequencer, particle accelerator, etc), and a well-
         | maintained HPC cluster that analyzes that data.
         | 
         | Spot instances are still pretty expensive for a steady queue
         | (2x of Hetzer monthly costs, for reference), and you still have
         | to pay AWS data transfer egress costs - which are at least 30x
         | more expensive than a colo or on-prem, if you're saturating a 1
         | Gbps link. Data transfer to optimize for spot instance pricing
         | becomes prohibitive when your job has 100 TB of raw data.
        
       | dastbe wrote:
       | using on-demand for latency insensitive work, especially when
       | you're also very cost sensitive, isn't the right choice. spot
       | instances will get you somewhere in the realm of the hetzner/on-
       | prem numbers.
        
         | Sebb767 wrote:
         | But, as the article points out, you are still paying a lot of
         | money for features that you don't need for scientific
         | computing.
         | 
         | Also, AWS is notoriously easy to undercut with on-prem
         | hardware, especially if your budget is large and your uptime
         | requirements aren't - you'll save a few hundred thousand a year
         | alone by not having to hire expert engineers for on-call duty
         | and extreme reliability.
        
         | lebovic wrote:
         | Even spot instances on AWS are still over 2x more expensive per
         | month than Hetzner. The cheapest c5a.24xlarge spot instance
         | right now is $1.5546/hr in us-east-1c. That's $1134.86/mo,
         | excluding data transfer costs. If you transfer out 10 TB over
         | the course of a month, that's another $921.60/mo - or now 4x
         | more expensive than Hetzner.
         | 
         | Using the estimate from the article, spot instances are still
         | over 8x more expensive than on-prem for scientific computing.
        
         | dekhn wrote:
         | Even more importantly, if you have any reasonable amount of
         | spend on cloud, you can get preferred pricing agreements. As
         | much as I hate to talk to "salespeople", I did manage to cut
         | millions in costs per year with discounts on serving and
         | storage.
         | 
         | Personally, when I estimate the total cost of ownership of
         | scientific cloud computing versus on prem (for extremely large-
         | scale science with both significant server, storage, and
         | bandwidth requirements) the cloud ends up winning for a number
         | of reasons. I've seen a lot of academics who disagree but then
         | I find out they use their grad students to manage their
         | clusters.
        
       | thesausageking wrote:
       | I'm suspicious of the author's actual experience.
       | 
       | The fact that scientific computing has a different pattern than
       | the typical web app is actually a good thing. If you can
       | architect large batch jobs to use spot instances, it's 50-80%
       | cheaper.
       | 
       | Also this bit: "you can keep your servers at 100% utilization by
       | maintaining a queue of requested jobs" isn't true in practice.
       | The pattern of research is the work normally comes in waves.
       | You'll want to train a new model or run a number of large
       | simulations. And then there will be periods of tweaking and work
       | on other parts. And then more need for a lot of training. Yes,
       | you can always find work to put on a cluster to keep it >90%
       | utilization, but if it can be elastic (and has compute has budget
       | attached to it), it will rise and fall.
        
         | lebovic wrote:
         | Author here! I worked for the computing infrastructure for a
         | DNA sequencing facility, and I run a computational biology
         | infrastructure company (trytoolchest.com, YC W22). Both are
         | built on AWS, so I do think AWS in scientific computing has its
         | use-cases - mostly in places where you can't saturate a queue
         | or you want a fast cycle time.
         | 
         | Spot instances are still pretty expensive for a steady queue
         | (2x of Hetzer monthly costs, for reference), and you still have
         | to pay AWS data transfer egress costs - which are at least 30x
         | more expensive than a colo or on-prem, if you're saturating a 1
         | Gbps link.
         | 
         | This post was born from frustration at AWS for their pricing
         | and offerings after trying to get people to switch to AWS in
         | scientific computing for years :)
        
       | Fomite wrote:
       | One of the aspects not touched on for this is PII/confidential
       | data/HIPAA data, etc.
       | 
       | For that, whether it makes sense or not, a lot of universities
       | are moving to AWS, and the infrastructure cost of AWS for what
       | would be a pretty modest server are still considerably less than
       | the cost of complying with the policies and regulations involved
       | in that.
       | 
       | Recently at my institution I asked about housing it on premise,
       | and the answer was that IT supports AWS, and if I wanted to do
       | something else, supporting that - as well as the responsibility
       | for a breach - would rest entirely on my shoulders. Not doing
       | that.
        
       | [deleted]
        
       | KaiserPro wrote:
       | Its much more complex than described.
       | 
       | The author is making a brilliant argument for getting a
       | secondhand workstation and shoving under their desk.
       | 
       | If you are doing multi machine batch style processing, then you
       | won't be using ondemand, you'd use the spot pricing. The missing
       | argument in that part is storage costs. Managing a high speed,
       | highly available synchronous file system that can do a sustained
       | 50gb/sec is hard bloody work (no S3 isnt a good fit, too much
       | management overhead)
       | 
       | Don't get me wrong AWS _is_ expensive if you are using a machine
       | for more than a month or two.
       | 
       | however if you are doing highly parallel stuff, Batch and lustre
       | on demand is pretty ace.
       | 
       | If you are doing a multi-year project, then real steel is where
       | its at. Assuming you have factored in hosting, storage and admin
       | costs.
        
         | bushbaba wrote:
         | Checkout Apache Iceberg which makes it fairly trivial to get
         | high throughput from S3 without much fine-tuning. Bursts from 0
         | to 50Gbps should be possible from S3 without much effort, just
         | have object sizes that are in the NN+ MiB range. Personally,
         | Lustre is a mess, it's expensive and even more pain to fine-
         | tune.
        
           | awiesenhofer wrote:
           | From https://iceberg.apache.org
           | 
           | > Iceberg is a high-performance format for huge analytic
           | tables.
           | 
           | How would that help speedup S3? Genuine question?
        
         | gautamdivgi wrote:
         | Even for multi-year, if you factor in everything does it still
         | come out cheaper and AWS? Would you be running everything 24x7
         | on an HPC? I don't think so. You need scale at some points and
         | there are probably times where research is done on your
         | desktop.
         | 
         | You could invest in an HPC - but I think the human cost of
         | maintaining one especially if you're in a high cost of living
         | area (e.g. Bay Area, NYC, etc.) is going to be pretty high.
         | Admin cost, UPS, cable wiring, heat/cooling etc. can all be
         | pretty expensive. Maintenance of these can be pretty pricey
         | too.
         | 
         | Are there any companies that remotely manage data centers and
         | rent out bare metal infra?
        
         | lostmsu wrote:
         | Isn't 50GB sec like 5 NVMe Gen 5 SSDs + 1 or 2 for redundancy?
         | 
         | Actually, you are right. Consumer SSDs I've seen only do about
         | 1.5GB/s sustained.
        
           | davidmr wrote:
           | Not in the context the person you responded to meant it. Yes,
           | you can very easily get 50GB/s from a few NVMe devices on a
           | single box. Getting 50GB/s on a POSIX-ish filesystem exported
           | to 1000 servers is very possible and common, but orders of
           | magnitude more complicated. 500GB/s is tougher still. 5TB/s
           | is real tough, but real fun.
        
           | pclmulqdq wrote:
           | Even (high-end) consumer SSDs can saturate a PCIe gen 4 x4
           | link if you are doing sequential reads. Non-sequential hurts
           | on even enterprise SSDs.
        
       | wenc wrote:
       | Calculating costs based on sticker price is sometimes misleading
       | because there's another variable: negotiated pricing, which can
       | be much much lower than sticker prices, depending on your
       | negotiating leverage. Different companies pay different prices
       | for the same product.
       | 
       | If you've ever worked at a big company or university (any place
       | where you spend at scale), you'll know you rarely pay sticker
       | price. Software licensing is particularly elastic because it's
       | almost zero marginal cost. Raw cloud costs are largely a function
       | of energy usage and amortized hardware costs -- there's a certain
       | minimum you can't go under but there remains a huge margin that
       | is open to be negotiated on.
       | 
       | Startups/individuals rarely even think about this because they
       | rarely qualify. But big orgs with large spends do. You can get
       | negotiated cloud pricing.
        
         | racking7 wrote:
         | This is definitely true for cloud retail prices. However, this
         | becomes not true in cases I've seen when there is an existing
         | discount. Reserved instances, for example.
        
       | bee_rider wrote:
       | Is genomic code typically distributed-memory parallel? I'm under
       | the impression that it is more like batch processing, not a ton
       | of node-to-node communication but you want lots of bandwidth and
       | storage.
       | 
       | If you are doing a big distributed-memory numerical simulation,
       | on the other hand, you probably want infiniband I guess.
       | 
       | AWS seems like an OK fit for the former, maybe not great for the
       | latter...
        
         | pclmulqdq wrote:
         | The fastest way to do a lot of genomics stuff is with FPGA
         | accelerators, which also aren't used by most of the other
         | tenants in a multi-tenant scientific computing center. The
         | cloud is perfect for that kind of work.
        
           | bee_rider wrote:
           | That's interesting. It is sort of funny that I was right
           | (putting genomics in the "maybe good for cloud" bucket) for
           | the wrong reason (characterizing it as more suited for
           | general-purpose commodity resources, rather than suited for
           | the more niche FPGA platform).
        
       | timeu wrote:
       | As an HPC sysadmin for 3 research institutes (mostly life
       | sciences & biology) I can't see how cloud HPC system could be any
       | cheaper than an on-prem HPC system especially if I look at the
       | resource efficiency (how much resources were requested vs how
       | much were actually useed) of our users SLURM jobs. Often the
       | users request 100s of GB but only use a fraction of it. In our
       | on-prem HPC system this might decrease utilization (which is not
       | great) but in the case of the cloud this would result in
       | increased computing costs (because bigger VM flavor) which would
       | be probably worse (CapEx vs OpEx) Of course you could argue that
       | the users should do and know better and properly size/measure
       | their resource requirements however most of our users have lab
       | background and are new to computational biology so estimating or
       | even knowing what all the knobs (cores, mem per core, total
       | memory, etc) of the job specification means is hard for them. We
       | try to educate by providing trainings and job efficency reporting
       | however the researchters/users have little incentive to optimize
       | the job requests and are more interested in quick results and
       | turnover which is also understandable (the on-prem HPC system is
       | already payed for). Maybe the cost transparancy of the cloud
       | would force them or rather their group leaders/institute heads to
       | put a focus on this but until you move to the cloud you won't
       | know.
       | 
       | Additionally the typical workloads that run on our HPC system are
       | often some badly maintained bioinformatics software or
       | R/perl/pythong throwaway scripts and often enough a typo in the
       | script causes the entire pipeline to fail after days of running
       | on the HPC system and needs to be restarted (maybe even multiple
       | times). Again on the on-prem system you have wasted electricity
       | (bad enough) but in the cloud you have to pay the computing costs
       | of the failed runs. Again cost transparency might force to fix
       | this but the users are not software engineers.
       | 
       | One thing that the cloud is really good at, is elasticity and
       | access to new hardware. We have seen for example a shift of
       | workloads from pure CPUs to GPUs. A new CryoEM microscope was
       | installed where the downstream analysis is relying heavily on
       | GPUs, more and more resaerch groups run Alpafold predictions and
       | also NGS analysis is now using GPUs. We have around 100 GPUs and
       | average utlizations has increased to 80-90% and the users are
       | complaining about long waiting/queueing times for their GPU jobs.
       | For this bursting to the cloud would be nice, however GPUs are
       | prohibitively expensive in the cloud unfortunately and the above
       | mentioned caveats regarding job resource efficiencies still
       | apply.
       | 
       | One thing that will hurt on-prem HPC systems tough are the
       | increased electricity prices. We are now taking measures to
       | actively save energy (i.e. by powering down idle nodes and
       | powering them up again when jobs are scheduled). As far as I can
       | tell the big cloud providers (AWS, etc) haven't increased the
       | prices yet either because they cover elecriticity cost increase
       | with their profit margins or they are not affected as much
       | because they have better deals with elecricity providers.
        
       | kortex wrote:
       | What does the landscape look like now for "terraform for bare
       | metal"?. Is ansible/chef still the main name in town? I just
       | wanna netboot some lightweight image, set up some basic network
       | discovery on a control plane, and turn every connected box into a
       | flexible worker bee I can deploy whatever cluster control layer
       | (k8s/nomad) on top of and start slinging containers.
        
         | nwilkens wrote:
         | I really like this description of how baremetal infrastructure
         | should work, and this is where I think (shameless self
         | promotion) Triton DataCenter[1] plays really well today on-
         | prem.
         | 
         | PXE booted lightweight compute nodes with a robust API,
         | including operator portal, user portal, and cli.
         | 
         | Keep an eye out for the work we are doing with Triton Linux +
         | K8s. Very lightweight Triton Linux compute node + baremetal k8s
         | deployments on Triton.
         | 
         | [1] https://www.tritondatacenter.com
        
       | jrm4 wrote:
       | I imagine what makes this especially hard is you have (at least)
       | three parties in play here:
       | 
       | - the people doing the research
       | 
       | - the institution's IT services group
       | 
       | - the administrator who writes the checks
       | 
       | And in my experience, "actual knowledge of what must be done and
       | what it will or could cost" can vary greatly across these three
       | groups; frequently in very unintuitive ways.
        
         | Fomite wrote:
         | This is the biggest point of friction. I spent the better part
         | of a year trying to get a postdoc admin access to his machine.
        
       | rpep wrote:
       | I think there are some things this misses about the scientific
       | ecosystem in Universities/etc. that can make the cloud more
       | attractive than it first appears:
       | 
       | * If you want to run really big jobs e.g. with multiple multi-GPU
       | nodes, this might not even be possible depending on your
       | institution or your access. Most research-intensive Universities
       | have a cluster but they're not normally big machines. For
       | regional and national machines, you usually have to bid for
       | access for specific projects, and you might not be successful.
       | 
       | * You have control of exactly what hardware and OS you want on
       | your nodes. Often you're using an out of date RHEL version and
       | despite spack and easybuild gaining ground, all too often you're
       | given a compiler and some old versions of libraries and that's
       | it.
       | 
       | * For many computationally intensive studies, your data transfer
       | actually isn't that large. For e.g. you can often do the post-
       | processing on-node and then only get aggregate statistics about
       | simulation runs out.
        
       | danking00 wrote:
       | I think this post is identifying scientific computing with
       | simulation studies and legacy workflows, to a fault. Scientific
       | computing includes those things, but it _also_ includes
       | interactive analysis of very large datasets as well as workflows
       | designed around cloud computing.
       | 
       | Interactive analysis of large datasets (e.g. genome & exome
       | sequencing studies with 100s of 1000s of samples) is well suited
       | to low-latency, server-less, & horizontally scalable systems
       | (like Dremel/BigQuery, or Hail [1], which we build and is
       | inspired by Dremel, among other systems). The load profile is
       | unpredictable because after a scientist runs an analysis they
       | need an unpredictable amount of time to think about their next
       | step.
       | 
       | As for productionized workflows, if we redesign the tools used
       | within these workflows to directly read and write data to cloud
       | storage as well as to tolerate VM-preemption, then we can exploit
       | the ~1/5 cost of preemptible/spot instances.
       | 
       | One last point: for the subset of scientific computing I
       | highlighted above, speed is key. I want the scientist to stay in
       | a flow state, receiving feedback from their experiments as fast
       | as possible, ideally within 300 ms. The only way to achieve that
       | on huge datasets is through rapid and substantial scale-out
       | followed by equally rapid and substantial scale-in (to control
       | cost).
       | 
       | [1] https://hail.is
        
         | jessfyi wrote:
         | I've followed Hail and applaud the Broad Institute's work wrt
         | establishing better bioinformatics software and toolkits so I
         | hope this doesn't come as rude, but I can't imagine an instance
         | in a real industry or academic workflow where you need 300ms
         | feedback from an experiment to "maintain flow" considering how
         | long experiments on data that large (especially exome
         | sequencing!) take overall? My (likely lacking) imagination
         | aside I guess what I'm really saying is that I don't know
         | what's preventing the usecase you've described from being
         | performed locally considering there'd be even _less_ latency?
        
       | CreRecombinase wrote:
       | These MPI-based scientific computing applications make up a bulk
       | of the compute hours on hpc clusters, but there is a crazy long
       | tail of scientists who have workloads that can't (or shouldn't)
       | run on their personal computers. The other option is HPC. This
       | sucks for a ton of reasons, but I think the biggest one is that
       | it's more or less impossible to set up a persistent service of
       | any kind. So no databases; if you want spark, be ready to spin it
       | up from nothing every day (also no HDFS unless you spin that up
       | in your SLURM job too). This makes getting work done harder but
       | it also means that it makes integrating existing work so much
       | harder because everyone's workflow involves reinventing
       | everything, and everyone does it in subtly incompatible ways;
       | there are no natural (common) abstraction layers because there
       | are no services.
        
       | 0xbadcafebee wrote:
       | AWS is _fantastic_ for scientific computing. With it you can:
       | 
       | - Deploy a thousand servers with GPUs in 10 minutes, churn over a
       | giant dataset, then turn them all off again. Nobody ever has to
       | wait for access to the supercomputer.
       | 
       | - Automatically back up everything into cold storage over time
       | with a lifecycle policy.
       | 
       | - Avoid the massive overhead of maintaining HPC clusters, labs,
       | data centers, additional staff and training, capex, load
       | estimation, months/years of advance planning to be ready to start
       | computing.
       | 
       | - Automation via APIs to enable very quick adaptation with little
       | coding.
       | 
       | - An entire universe of services which ramp up your capabilities
       | to analyze data and apply ML without needing to build anything
       | yourself.
       | 
       | - A marketplace of B2B and B2C solutions to quickly deploy new
       | tools within your account.
       | 
       | - Share data with other organizations easily.
       | 
       | AWS costs are also "retail costs". There are massive savings to
       | be had quite easily.
        
         | Fomite wrote:
         | One thing to consider:
         | 
         |  _I_ don 't control my AWS account. I don't even _have_ an AWS
         | account in my professional life.
         | 
         | I tell my IT department what I want. They tell the AWS people
         | in central IT what they want. It's set up. At some point I get
         | an email with login information.
         | 
         | I email them again to turn it off.
         | 
         | Do I hate this system? Yes. Is it the system I have to work
         | with? Also yes.
         | 
         | "AWS as implemented by any large institution" is considerably
         | less agile than AWS itself.
        
           | [deleted]
        
       | slaymaker1907 wrote:
       | Cloud worked really well for me when I was in school. A lot of
       | the time, I would only need a beefy computer for a few hours at a
       | time (often due to high memory usage) and you can/could rent out
       | spot instances for very cheap. There are about 730 hours per
       | month so the cost calculus is very different for a
       | student/researcher who needs fast turnaround times (high
       | performance), but only for a short period of time.
       | 
       | However, I know not all HPC/scientific computing works that way
       | and some workloads are much more continuous.
        
         | a2tech wrote:
         | Thats how my department uses the cloud--we have an image we
         | store up at AWS geared towards a couple of tasks and we spin up
         | a big instance when we need it, run the task, pull out the
         | results, then stop the machine. Total cost sub-100 dollars. If
         | we had to go the HPC group we'd have to fight with them to get
         | the environment configured, get access to the system, get
         | payment setup, teach the faculty to use the environment, etc.
         | Its just a pain for very little gain.
        
       | Mave83 wrote:
       | I agree with the article. We at croit.io support customers around
       | the globe to build their clusters and save huge amounts. For
       | example, AWS S3 compared to Ceph S3 in any data center of your
       | choice is around 1/10 of the AWS price.
        
       | nharada wrote:
       | I'd love to buy my own servers for small-scale (i.e. startup size
       | or research lab size) projects, but it's very hard to be
       | utilizing them 24x7. Does anyone know of open-source software or
       | tools that allow multiple people to timeshare one of these? A big
       | server full of A100s would be awesome, with the ability to
       | reserve the server on specific days.
        
         | jpeloquin wrote:
         | > the ability to reserve the server on specific days
         | 
         | In an environment where there are not too many users and
         | everyone is cooperative, using Google Calendar to reserve time
         | slots works very well and is very low maintenance. Technical
         | restrictions are needed only when the users can't be trusted to
         | stay out of each other's way.
        
         | didip wrote:
         | This is just the cloud with extra steps.
        
       | mbreese wrote:
       | I completely agree for most cases. In many scientific computing
       | applications, compute time isn't the factor you prioritize in the
       | good/fast/cheap triad. Instead, you often need to do things as
       | cheaply as possible. And your data access isn't always
       | predictable, so you need to keep results around for an extended
       | period of time. This makes storage costs a major factor. For us,
       | this alone was enough to move workloads away from cloud and onto
       | local resources.
        
       | COGlory wrote:
       | >a month-long DNA sequencing project can generate 90 TB of data
       | 
       | Our EM facility generates 10 TB of raw data per day, and once you
       | start computing it, that increases by 30%-50% depending on what
       | you do with it. Plus, moving between network storage and local
       | scratch for computational steps basically never ends and keeps
       | multiple 10 Gbe links saturated 100% of the time.
        
       | bgro wrote:
       | When I was looking at AWS for personal use, I first thought it
       | was oddly expensive even when factoring in not having to buy the
       | hardware. When I looked at just what the electricity cost to run
       | it myself would be, I think that addition alone turned out AWS
       | was actually cheaper. This is without factoring in cooling /
       | dedicated space / maintenance.
        
       | betolink wrote:
       | I see both sides of the argument, there is a reason why CERN is
       | not processing their data using EC2 and lambdas.
        
         | thamer wrote:
         | The vast majority of researchers don't need anywhere close to
         | the amount of resources that CERN needs. The fact that CERN
         | doesn't use EC2 and lambdas shouldn't be taken as a lesson by
         | anyone who's not operating at their scale.
         | 
         | This feels like a similar argument to the one made by people
         | who use Kubernetes to ensure their web app with 100 visitors a
         | day is web scale.
        
         | harunurhan wrote:
         | The cost isn't the only reason
         | 
         | - CERN started planning its computing grid before AWS was
         | launched.
         | 
         | - It's pretty complicated (politics, mission, vision) for CERN
         | to use external proprietary software/hardware for its main
         | functions (they have even started to MS Office like products.)
         | 
         | - [cost] CERN is quite different than a small team researchers
         | doing few years research. the scale is enormous and very long
         | lived, like for decades continue
         | 
         | - and more...
         | 
         | HPC and scientific computing aside, I would have loved to be
         | able to use AWS when I worked there, internal infra for running
         | web apps and services wasn't nearly good & reliable, neither
         | had a wide catalog of services offered.
        
           | betolink wrote:
           | I think the spirit of the article is to put the cloud in
           | perspective of the organization size and the workload type.
           | There is a sweet spot where the cloud is the only option that
           | makes sense, definitely with variable loads and capacity to
           | basically scale on demand as big as our budget, there is no
           | match for that. However... there are organizations with
           | certain type of workloads that could afford to put
           | infrastructure in place and even with the costs of staffing,
           | energy etc they will save millions in the long run. NASA,
           | CERN etc are some. This is not limited to HPC, the cloud at
           | scale is not cheap either see:
           | https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-
           | cap...
        
       | bluedino wrote:
       | We have 500-node cluster at a chemical company, and we've been
       | experimenting with "hybrid-cloud". This allows jobs to use
       | servers with resources we just don't have, or couldn't add fast
       | enough.
       | 
       | Storage is a huge issue for us. We have a petabyte of local
       | storage from big name vendor that's bursting at the seams, and
       | expensive to upgrade. A lot of our users leave big files laying
       | around for a large time. Every few months we have to hound
       | everyone to delete old stuff.
       | 
       | The other thing that you get with the cloud is there's way more
       | accountability for who's using how much resources. Right now we
       | just let people have access and roam free. Cloud HPC is 5-10x
       | more in cost and the beancounters would shut shit down real quick
       | if the actual costs were divvied up.
       | 
       | We also still have a legacy datacenter so in a similar vein, it's
       | hard to say how much not having to deal with physical
       | hardware/networking/power/bandwidth would be worth. Our work is
       | maybe 1% of what that team does.
        
         | adolph wrote:
         | I can relate to these problems. Cloud brings positive
         | accountability that is difficult to justify onprem. I have some
         | hope that higher level tools for project/data/experiment
         | management (as opposed to a bash prompt and a path) will bring
         | some accountability without stifling flexibility.
        
       | julienchastang wrote:
       | I've also been skeptical of the commercial cloud for scientific
       | computing workflows. I don't think this cost benefit analysis
       | mentions it, but the commercial cloud makes even less sense when
       | you take into account brick and mortar considerations. In other
       | words, if your company/institution has already paid for the
       | machine rooms, sys admins, networks, the physical buildings, the
       | commercial cloud is even less appealing. This is especially true
       | with "persistent services" for example data servers that are
       | always on because they handle real-time data, for example.
       | 
       | Another aspect of scientific computing on the commercial cloud
       | that's a pain if you work in academia is procurement or paying
       | for the cloud. Academic groups are much more comfortable with the
       | grant model. They often operate on shoe-string budgets and are
       | simply not comfortable entering a credit card number. You can
       | also get commercial cloud grants, but they often lack long-term,
       | multiyear continuity.
        
         | mattkrause wrote:
         | It's often not that they're "not comfortable"; it's that we're
         | often flat-out not allowed to.
        
           | Fomite wrote:
           | This. It's got nothing to do with "comfort". I use cloud
           | computing all the time in the rest of my life, but the rest
           | of my life isn't subject to university policies and state
           | regulations.
        
       | ordiel wrote:
       | Having worked for 2 of the largest cloud providers (1 of them
       | beimg the largest) i have to say "The Cloud" just doesnt makes
       | sense (maybe with the exception of cloud storage) yet for most
       | use cases, this including start ups, small and, mid size
       | companies its just way to expensive for the benefits it provides,
       | it moves your hardware acquisitions /maintainance cost to
       | development costs, you just think better/cheaper because that
       | cost comes in small monthly chunks rather than as a single bill,
       | plus you add all security risks either those introduced by the
       | vendor or those introduced by the masive complexity and poor
       | training of the developers which if you want to avoid will have
       | to pay by hiring a developer competent in security for that
       | particular cloud provider
        
         | manv1 wrote:
         | Having worked in 3 startups that were AWS-first, I can say that
         | you've learned the completely wrong lessons from your time at
         | your cloud providers.
         | 
         | Building on AWS has provided scale, security, and redundancy at
         | a substantially lower cost than doing any on-prem solution
         | (except for a shitty one strung together with lowendbox
         | machines).
         | 
         | The combined AWS bill for the three startups is less than the
         | cost of an F5, even on a non-inflation adjusted basis.
         | 
         | The cloud doesn't mean that you can be totally clueless. I've
         | had experience in HA/scalability/redundancy/deployment/developm
         | ent/networking/etc. It means that if you do know what you're
         | doing you can deliver a scalable HA solution at a ridiculously
         | lower price point than a DIY solution using bare iron and colo.
        
           | ordiel wrote:
           | "The combined bill" during which time period?
           | 
           | 1 Month, for sure. What about 1 year? Also did those
           | companies required to provide any training or hiring to
           | achieve that? Because you also need to add that to the cost
           | comparison
           | 
           | If you are comparing one month bill agains 1 time purchase
           | (which if is correctly chosen should not happen but once
           | every 10 years at the earliest) for sure it will be cheaper.
           | When it comes down to scalability, development and
           | deployment, you should check your tech stack rather than your
           | infrastructure. Kubernetees and containerization should
           | easily take care of those with on premise hardware while also
           | reducing complexity + you will no longer have to worry for
           | off the chart network transit fees
        
       | jerjerjer wrote:
       | Sure? I mean, if you have:
       | 
       | 1) A large enough queue of tasks
       | 
       | 2) Users/donstream willing to wait
       | 
       | using your own infrastructure always wins (alsuming free labor)
       | since you can load your own infrastructure to ~95% pretty much
       | 24/7 which is unbeatable.
        
         | mrweasel wrote:
         | It might also depend on how long you're actually willing to
         | wait. There's nothing stopping you from having a job queue in
         | AWS, and you can setup things up so that instances are only
         | running if the price is low enough.
         | 
         | Otherwise completely agree, there might be some cases where the
         | cost of labour means that you're better off running something
         | in AWS, even if that requires someone to do the configuration
         | as well.
        
       | aschleck wrote:
       | This is sort of a confusing article because it assumes the
       | premise of "you have a fixed hardware profile" and then argues
       | within that context ("Most scientific computing runs on queues.
       | These queues can be months long for the biggest supercomputers".)
       | Of course if you're getting 100% utilization then you'll find
       | better raw pricing (and this article conveniently leaves out
       | staffing costs), but this model misses one of the most powerful
       | parts of cloud providers: autoscaling. Why would you want to
       | waste scientist time by making them wait in a queue when you can
       | just instead autoscale as high as needed? Giving scientists a
       | tight iteration loop will likely be the biggest cost reduction
       | and also the biggest benefit. And if you're doing that on prem
       | then you need to provision for the peak load, which drives your
       | utilization down and makes on prem far less cost effective.
        
         | lebovic wrote:
         | For fast-moving researchers who are blocked by a queue, cloud
         | computing still makes sense. I guess I wasn't clear enough in
         | the last section about how I still use AWS for startup-scale
         | computational biology. My scientific computing startup
         | (trytoolchest.com) is 100% built on top of AWS.
         | 
         | Most scientific computing still happens on supercomputers in
         | slower moving academic or big co settings. That's the group for
         | whom cloud computing - or at least running everything on the
         | cloud - doesn't make sense.
        
           | adolph wrote:
           | Another service that runs on AWS is CodeOcean. It looks like
           | Toolchest is oriented toward facilitating execution of
           | specific packages rather than organization and execution like
           | CodeOcean. Is that a fair summary?
           | 
           | https://codeocean.com/explore
        
             | lebovic wrote:
             | Yep, that's right! Toolchest focuses on compute, deploying
             | and optimizing popular scientific computing packages.
        
         | secabeen wrote:
         | Generally, scientists aren't blocked while they are waiting on
         | a computational queue. The results of a computation are needed
         | eventually, but there is lots of other work that can be done
         | that doesn't depend on a specific calculation.
        
           | jefftk wrote:
           | It's good to learn how not to be blocked on long-running
           | calculations.
           | 
           | On the other hand, if transitioning to a bursty cloud model
           | means you can do your full run in hours instead of weeks,
           | that has real impact on how many iterations you can do and
           | often does appreciably affect velocity.
        
             | secabeen wrote:
             | It can, if you have the technical ability to write code
             | that can leverage the scale-out that most bursty-cloud
             | solutions entail. Coding for clustering can be pretty
             | challenging, and I would generally recommend a user target
             | a single large system with job that takes a week over
             | trying to adapt that job to a clustered solution of 100
             | smaller systems that can complete it in 8 hours.
        
               | Fomite wrote:
               | This is a big part of it. In my lab, I have a lot of grad
               | students who are _computational_ scientists, not computer
               | scientists. The time it will take them to optimize code
               | far exceeds a quick-and-dirty job array on Slurm and then
               | going back to working on the introduction of the paper,
               | or catching up on the literature, or any one of a dozen
               | other things.
        
       | secabeen wrote:
       | The general rule of thumb in the HPC world is if you can keep a
       | system computing for more than 40% of the time, it will be
       | cheaper to buy.
        
       | tejtm wrote:
       | Cloud never has made sense for scientific computing. Renting
       | someone else's big computer makes good sense in a business
       | setting where you are not paying for your peak capacity when you
       | are not using it, and you are not losing revenue by
       | underestimating whatever the peak capacity the market happens to
       | dictate.
       | 
       | For business, outsourcing compute cost center eliminates both
       | cost and risk for a big win each quarter.
       | 
       | Scientists never say, Gee it isn't the holiday season, guess we
       | better scale things back.
       | 
       | Instead they will always tend to push whatever compute limit
       | there is, it is kinda in the job description.
       | 
       | As for the grant argument, that is letting the tool shape the
       | hand.
       | 
       | business-science is not science, we will pay now or pay later.
        
       | aBioGuy wrote:
       | Furthermore, scientific computing often (usually?) involves
       | trainees. It can difficult to train people when small mistakes
       | can have five figure bills.
        
         | Moissanite wrote:
         | This is the biggest un-addressed problem, IMO. Getting more
         | scientific computing done in the cloud is where we are
         | inevitably trending, but no-one yet has a good answer for
         | completely ad-hoc, low value experimentation and skill building
         | in cloud. I see universities needing to maintain clusters to
         | allow PhDs and postdocs to develop their computing skills for a
         | good while yet.
        
       | avereveard wrote:
       | > Hardware is amortized over five years
       | 
       | hardware running 100% won't last five years
       | 
       | if hardware is not needed to be running 100% at full steam for
       | five years, you can turn down instances on the cloud and you
       | don't pay anything
       | 
       | in 2 years you'll be stuck with the same hardware, while on the
       | cloud you follow cpu evolution as it arrives to the provider
       | 
       | all in all the comparison is too high level to be useful
        
         | e63f67dd-065b wrote:
         | > hardware running 100% won't last five years
         | 
         | Five year is a pretty typical amortisation schedule for HPC
         | hardware. During my sysadmin days, of CPU, memory, cooling,
         | power, storage, and networking, the only things that broke were
         | hard disks and a few cooling fans. Disks were replaced by just
         | grabbing a space and slotting it in, and fans were replaced by,
         | well, swapping them out.
         | 
         | Modern CPUs and memory last a very long time. I think I
         | remember seeing Ivy Bridge CPUs running in Hetzner servers in a
         | video they put out, and they're still fine.
        
           | avereveard wrote:
           | if you expect downtime in the 5 year to replace fan and
           | whatnot, you're not getting 100% of your money/perf back -
           | and I didn't see that in the article.
           | 
           | if you have spares, spares need to be in the cost, and value
           | lost to downtime stay minimal. but you have to include spares
           | in the expenses. if you don't have spares, 1-2 day downtime
           | is going to be a decent hit to value.
        
             | davidmr wrote:
             | I'm not sure I understand what you mean. I've run HPC
             | clusters for a long time now, and node failures are just a
             | fact of life. If 3 or 4 nodes of your 500 node cluster are
             | down for a few days while you wait for RMA parts to arrive,
             | you haven't lost much value. Your cluster is still
             | functioning at nearly peak capacity.
             | 
             | You have a handful of nodes that the cluster can't function
             | without (scheduler, fileservers, etc), but you buy spares
             | and 24x7 contracts for those nodes.
             | 
             | Did I misunderstand your comment?
        
         | icedchai wrote:
         | I think you underestimate how long modern hardware can last. I
         | have 8 to 12 year old PCs running non-stop, in a musty and damp
         | basement.
        
           | avereveard wrote:
           | they don't just die, thermal paste dry up, fans gum up, gpu
           | will live, but thermal throttling will mean it'll run at,
           | say, 80%.
        
         | aflag wrote:
         | I've worked with a yarn cluster with around 200 nodes which ran
         | non stop for well over 5 years and still kicking. There were a
         | handful of failures and replacements, but I'd 95% of the
         | cluster was fine 7 years in.
        
       | walnutclosefarm wrote:
       | Having had the responsibility of providing HPC for a literal
       | buildings full of scientists, I can say that it may be true that
       | you can get computation cheaper with owned hardware, than in a
       | cloud. Certainly pay as you go, individual project at a time
       | processing will look that way to the scientist. But I can also
       | say with confidence that the contest is far closer than they
       | think. Scientists who make this argument almost invariably leave
       | major costs out of their calculation - assuming they can put
       | their servers in a closet,maintain them themselves, do all the
       | security infrastructure, provide redundancy and still get to
       | shared compute when they have an overflow need. When the closet
       | starts to smoke because they stuffed it with too many cheaply
       | sourced, hot-running cores and GPUs, or gets hacked by one of
       | their postdocs resulting in an institutional HIPAA violation,
       | well, that's not their fault.
       | 
       | Put like for like in a well managed data center against
       | negotiated and planned cloud services, and the former may still
       | win, but it won't be dramatically cheaper, and figured over
       | depreciable lifetime and including opportunity cost, may cost
       | more. It takes work to figure out which is true.
        
         | pbronez wrote:
         | The article estimated:
         | 
         | Running a modern AMD-based server that has 48 cores, at least
         | 192 GB of RAM, and no included disk space costs:
         | ~$2670.36/mo for a c5a.24xlarge AWS on-demand instance
         | ~$1014.7/mo for a c5a.24xlarge AWS reserved instance on a
         | three-year term, paid upfront         ~$558.65/mo on OVH
         | Cloud[1]         ~$512.92/mo on Hetzner[2]         ~$200/mo on
         | your own infrastructure as a large institution[3]
         | 
         | Footnote [3] explains this cost estimate as:
         | 
         | "Assumes an AMD EPYC 7552 run at 100% load in Boston with high
         | electricity prices of $0.23/kWh, for $33.24/mo in raw power.
         | Hardware is amortized over five years, for an average monthly
         | price of $67.08/mo. We assume that your large institution
         | already has 24/7 security and public internet bandwidth, but
         | multiply base hardware and power costs by 2x to account for
         | other hardware, cooling, physical space, and a
         | half-a-$120k-sysadmin amortized across 100 servers."
        
         | jacobr1 wrote:
         | Also it assumes full utilization of hardware. If you have
         | variable load (such as only needing to run compute after an
         | experiment). The overhead costs of maintaining a cluster you
         | don't need all time are probably much lower than resources you
         | can schedule on-demand.
        
         | duxup wrote:
         | When I worked as a network engineer I spent months working with
         | some great scientists / their team who built a crazy microscope
         | (I assumed it was looking at atoms or something...) the size of
         | a small building.
         | 
         | Their budget for the network gear was a couple hundred bucks
         | and some old garbage consumer grade network gear. For something
         | that spit out 10s of GB a second (at least) across a ton of
         | network connections (they didn't seem to know what would even
         | happen when they ran it), and was so bursty all but the highest
         | end of gear could handle it.
         | 
         | Can confirm sometimes scientists aren't really up on the
         | overall costs. Then they dump it "this isn't working" on their
         | university IT team to absorb the costs / manpower costs.
        
         | ipaddr wrote:
         | You are paying 10x more because no one gets fired for using
         | IBM. AWS has many benefits most which you don't need. Pair up
         | with another school in a different region and backup data.
         | Computers are not scary they rarely catch fire.
        
         | whatever1 wrote:
         | Nah for us it was the department IT guy who set up once
         | everything (a full cluster of 50 r720s) and works like a dream.
         | 
         | Properly provisioned linux machines need no maintenance. You
         | drive them until there is a hardware failure.
        
         | mangecoeur wrote:
         | I've been running a group server (basically a shared
         | workstation) for 5 years and it's been great. Way cheaper than
         | cloud, no worrying about national rules on where data can be
         | stored, no waiting in a SLURM batch queue, Jupyter notebooks on
         | tap for everyone. A single $~6k outlay (we don't need GPUs
         | which helps).
         | 
         | Classic big workstations are way more capable than people think
         | - but at the same time it's hard to justify buying one machine
         | per user unless your department is swimming in money. Also,
         | academic budgets tend to come in fixed chunks, and university
         | IT departments may not have your particular group as a priority
         | - so often it's just better to invest once in a standalone
         | server tower that you can set up to do exactly what you need it
         | to than try to get IT to support your needs or the accounting
         | department to pay recurring AWS bills.
        
           | killingtime74 wrote:
           | Aren't you talking about 1 server when this is talking about
           | HPC?
        
             | mangecoeur wrote:
             | Well the title is scientific computing, which includes HPC
             | but not only. Anyway the fact is that a lot of "HPC" in
             | university clusters is smaller jobs that are too much for
             | an average PC to handle, but still fit into a single
             | typical HPC node. These are usually the jobs that people
             | think to farm out to AWS, but that you will generally find
             | are cheaper, faster, and more reliable if you just run them
             | on your own hardware.
        
             | [deleted]
        
         | forgomonika wrote:
         | This nails so much of the discussion that should be had. When
         | using any cloud service provider, you aren't just paying for
         | the machines/hardware you use - you are paying for people to
         | take care of a bunch of headaches of having to maintain this
         | hardware. It's incredibly easy to overlook this aspect of costs
         | and really easy to oversimplify what's involved if you don't
         | know how these things actually work.
        
         | prpl wrote:
         | The things that tend to be "cheap" on campuses:
         | 
         | Power (especially if there is some kind of significant
         | scientific facility on premise), space (especially in reused
         | buildings), manpower (undergrads, grad students, post docs,
         | professional post graduates), running old/reused hardware,
         | etc...
         | 
         | You can get away with those at large research universities.
         | Some of that you can get away with at national lab sorts of
         | places (not going to find as much free/cheap labor, surplus
         | hardware). If you start going down in scale/prestige, etc...
         | none of that holds true.
         | 
         | Running a bunch of hardware from the surplus store in a closet
         | somewhere with Lasko fans taped to the door is cheap. To some
         | extent, the university system encourages such subsidies.
         | 
         | In any case, once you get to actually building a datacenter, if
         | you have to factor in power, if you have a 4 hardware refresh
         | cycle, professional staffing, etc... unless you are in one of
         | those low CoL college towns - cloud is probably more more than
         | 1.5 to 3x more expensive for compute (spot, etc...). Storage on
         | prem is much cheaper - erasure coded storage systems are cheap
         | to buy and run, and everybody wants their own high performance
         | file system.
         | 
         | One continuing cloud obstacle though - researchers don't want
         | to spend their time figuring out how to get their code friendly
         | to preemptible VMs - which is the cost effective way to run on
         | cloud.
         | 
         | Another real issue with sticking to on-prem HPC is talent
         | acquisition and staff development. When you don't care about
         | those things so much, it's easy to say it's cheap to run on-
         | prem, but often the pay is crap for the required expertise, and
         | ignoring cloud doesn't help your staff either.
        
         | W-Stool wrote:
         | Let me echo this as someone who once was responsible for HPC
         | computing in a research intensive public university. Most
         | career academics have NO IDEA how much enterprise computing
         | infrastructure costs. If a 1 terabyte USB hard drive is $40 at
         | Costco we (university IT) must be getting a much better deal
         | than that. Take this argument and apply it to any aspect of HPC
         | computing and that's what you're fighting against. The closet
         | with racks of gear and no cooling is another fond memory. Don't
         | forget the AC terminal strips that power the whole thing,
         | sourced from the local dollar store.
        
           | bluedino wrote:
           | It's kind of funny around this time of year when some
           | researchers have $10,000 in their budget they need to spend,
           | and they want to 'gift' us with some GPU's.
        
             | davidmr wrote:
             | That was definitely one of the weirdest things of working
             | in academia IT: "hey. Can you buy me a workstation that's
             | as close to $6,328.45 as it is possible to get, and can you
             | do it by 4pm?"
        
           | systemvoltage wrote:
           | I am dealing with the exact opposite problem: "Oh you mean,
           | we should leave the EC2 instance running _24 /7_??? No way,
           | that would be too expensive"... to which I need to respond
           | "No, it would be like $15/month. Trivial, stop worrying about
           | costs in EC2 and S3, we're like 7 people here with 3 GB of
           | data."
           | 
           | I deal with Scientists that think AWS is some sort of a
           | massively expensive enterprise thing. I can be, but not for
           | the use case they're going to be embarking on. Our budget is
           | $7M spanning 4 years.
        
             | capableweb wrote:
             | > think AWS is some sort of a massively expensive
             | enterprise thing
             | 
             | Compared to using dedicated instances with way cheaper
             | bandwidth, storage and compute power, it might as well be.
             | 
             | Cloud makes sense when you have to scale up/down very
             | quickly, or you'd be losing money fast. But most don't
             | suffer from this problem.
        
             | gonzo41 wrote:
             | Don't say the budget outloud near AWS. They'll find a way
             | to help you spend it.
        
               | systemvoltage wrote:
               | Hahaha, may be I need to just go into the AWS ether and
               | start yakking big words like "Elastic Kubernetes Service"
               | to confuse the scientists and get my aws fix. These
               | people are too stingy. I want some shit running in AWS,
               | what good is this admin IAM role.
        
           | 0xbadcafebee wrote:
           | I remember the first time a server caught fire in the closet
           | we kept the rack in. Backups were kept on a server right
           | below the one on fire. But, y'know, we saved money.
        
             | eastbound wrote:
             | Don't worry, we do incremental backups during weekdays and
             | a full backup on Sunday. We use 2 tapes only, so one is
             | always outside of the building. But you know, we saved
             | money.
        
               | [deleted]
        
               | treeman79 wrote:
               | We had a million dollars worth of hardware installed in a
               | closet. It had a portable AC hooked up that needed it's
               | water bin changed every so often.
               | 
               | Well I was in the middle of that. When the Director
               | decided to show off the new security doors. So he closed
               | the room up. Then found out that new security doors
               | didn't work. I find out as I'm coming back to turn AC
               | back on. Room will get hot really fast.
               | 
               | We get office Security to unlock door. He says he doesn't
               | have authority. His supervisor will be by later in the
               | day.
               | 
               | Completely deadpan, and in front of several VPs of a
               | forth one 50.
               | 
               | I turn to guy to my right who lived nearby. "Go home and
               | get your chainsaw"
               | 
               | We were quickly let in. Also got fast approval to install
               | proper cooling.
        
               | bilbo0s wrote:
               | A bit off topic, but I gotta say you guys are a riot!
               | 
               | If there was a comedy tour for IT/Programmer types, I'd
               | pay to see you guys in it.
               | 
               | Best thing about your stuff is that it's literally all
               | funny precisely because it's all true.
        
               | [deleted]
        
               | [deleted]
        
               | Proven wrote:
        
             | pbronez wrote:
             | This is my fear about my homelab lol
             | 
             | Fire extinguisher nearby, smart temp sensors, but still...
        
               | rovr138 wrote:
               | oh, nice idea with temp sensor.
               | 
               | I have extinguishers all over the house, but hadn't
               | considered a temperature sensor set to send alerts.
               | 
               | Do you have any recommendations?
        
               | W-Stool wrote:
               | What are you using for a homelab priced temperature
               | sensor?
        
               | xani_ wrote:
               | Homelab-priced sensor is the temp sensor in your server,
               | it's free! Actual servers have a bunch, usually have one
               | at intake, "random old PC" servers can use motherboard
               | temp as rough proxy for environment temp.
               | 
               | Hell, even in DC you can look at temperatures and see in
               | front of which server technican was standing just by
               | those sensors.
               | 
               | Second cheapest would be USB-to-1wire module + some
               | DS18B20 1-sire sensors. Easy hobby job to make. They also
               | come with unique ID which means if you put it in TSDB by
               | that ID it doesn't matter where you plug those sensors.
        
         | COGlory wrote:
         | >the security infrastructure, provide redundancy and still get
         | to shared compute when they have an overflow need
         | 
         | The article points out that this is mostly not necessary for
         | scientific computing.
        
           | jrumbut wrote:
           | Which I thought was the best point of the article, that a lot
           | of IT best practice comes from the web app world.
           | 
           | Web apps quickly become finely tuned factory machines,
           | executing a million times a day and being duplicated
           | thousands of times.
           | 
           | Scientific computing projects are often more like workshops.
           | Getting charged by the second while you're sitting at a
           | console trying to figure out what this giant blob you were
           | sent even is is unpleasant. The solution you create is most
           | likely to be run exactly once. If it is a big hit, it may be
           | run a dozen times.
           | 
           | Trying to run scientific workloads on the cloud is like
           | trying to put a human shoe on a horse. It might be possible
           | but it's clearly not designed for that purpose.
        
         | onetimeusename wrote:
         | Is a postdoc hacking a cluster something you have seen before?
         | I am genuinely curious because I worked on a cluster owned by
         | my university as an undergrad and everyone was kind of assumed
         | to be trusted. If you had shell access on the main node you
         | could run any job you wanted on the cluster. You could enhance
         | security I just wonder about this threat model, that's an
         | interesting one. I am sure it happens to be clear.
        
         | ptero wrote:
         | I think it really depends on the task. Where HIPAA violation is
         | a real threat, the equation changes. And just for CYA purposes
         | those projects can get pushed to a cloud. Which does not
         | necessarily involve any attempts to make them any more secure,
         | but this is a different topic.
         | 
         | That said, many scientists _are_ operating on premise hardware
         | like this: some servers in a shared rack and an el-cheapo
         | storage solutions with an ssh access for people working in the
         | lab. And it works just fine for them.
         | 
         | Cloud services focus for running _business_ computing in a
         | cloud, emphasizing recurring revenue. Most research labs are
         | _much_ more comfortable with spending the hardware portion of a
         | grant upfront and not worrying about some student who, instead
         | of working on some fluid dynamics problem found a script to re-
         | train a stable diffusion and left it running over winter break.
         | My 2c.
        
           | secabeen wrote:
           | Thankfully, only a small part of the academic research
           | enterprise involves human subjects, HIPAA, and all that.
           | Neither fruit flies nor quarks have privacy rights.
        
             | dmicah wrote:
             | Research involving human subjects (psychology, cognitive
             | neuroscience, behavioral economics, etc.) requires
             | institutional review board approval and informed consent,
             | etc. but mostly doesn't involve HIPAA either.
        
               | charcircuit wrote:
               | That is not a law.
        
               | icedchai wrote:
               | There are actually laws around such things. You can read
               | about them here: https://www.hhs.gov/ohrp/index.html
        
               | Fomite wrote:
               | And many, many institutions are over cautious. My own
               | university, for example, has no data classification
               | between "It would be totally okay if anyone in the
               | university has access" and "Regulated data", so "I mean,
               | it's health information, and it's governed by our data
               | use agreement with the provider..." gets it kicked to the
               | same level as full-fat HIPAA data.
        
           | crazygringo wrote:
           | > _And it works just fine for them._
           | 
           | Until it doesn't because there's a fire or huge power surge
           | or whatever.
           | 
           | That's the point -- there's a lot of risk they're not taking
           | into account, and by focusing on the "it works just fine for
           | them", you're cherry picking the ones that didn't suffer
           | disaster.
        
             | horsawlarway wrote:
             | I'd counter by saying I think you're over-estimating how
             | valuable mitigating that risk is to this crowd.
             | 
             | I'd further say that you're probably over-estimating how
             | valuable mitigating that risk is to _anyone_ , although
             | there are a few limited set of customers that genuinely do
             | care.
             | 
             | There are few places I can think of that would benefit more
             | by avoiding cloud costs than scientific computing...
             | 
             | They often have limited budgets that are driven by grants,
             | not derived by providing online services (computer going
             | down does not impact bottom line).
             | 
             | They have real computation needs that mean hardware is
             | unlikely to sit idle.
             | 
             | There is no compelling reason to "scale" in the way that a
             | company might need to in order to handle additional
             | unexpected load from customers or hit marketing campaigns.
             | 
             | Basically... the _only_ meaningful offering from the cloud
             | is likely preventing data loss, and this can be done fairly
             | well with a simple backup strategy.
             | 
             | Again - they aren't a business where losing a few
             | hours/days of customer data is potentially business ending.
             | 
             | ---
             | 
             | And to be blunt - I can make the same risk avoidance claims
             | about a lot of things that would simply get me laughed out
             | of the room.
             | 
             | "The lead researcher shouldn't be allowed in a car because
             | it might crash!"
             | 
             | "The lab work must be done in a bomb shelter in case of war
             | or tornados!"
             | 
             | "No one on the team can eat red meat because it increases
             | the risk of heart attack!"
             | 
             | and on and on and on... Simply saying "There's risk" is not
             | sufficient - you must still make a compelling argument that
             | the cost of avoiding that risk is justified, and you're not
             | doing that.
        
             | billythemaniam wrote:
             | The counterpoint to that point is that a significant
             | percentage of scientific computing doesn't care about any
             | of that. They are unlikely to have enough hardware to cause
             | a fire and they don't care about outages or even data loss
             | in many cases. As others have said, it depends on the
             | specifics of the research. In the cases where that stuff
             | matters, the cloud would be better option.
        
               | Fomite wrote:
               | This. If my lab-level server failed tomorrow, I'd be
               | annoyed, order another one, and start the simulations
               | again.
        
             | vjk800 wrote:
             | The point is, there's not need for everything to be 100%
             | reliable in this context. If a fire destroys everything and
             | their computational resources is unavailable for a few
             | days, that's somewhat okay. Not ideal, but not a
             | catastrophic loss either. Even data loss is no catastrophic
             | - at worst it means redoing one or two weeks worth of
             | computations.
             | 
             | Some sort of 80/20 principle is at works here. Most of the
             | costs in professional cloud solutions comes from making the
             | infrastructure 99.99% reliable instead of 99% reliable. It
             | is totally worth it if you have millions of customers that
             | expect a certain level of reliability, but a complete
             | overkill if the worst case scenario from a system failure
             | is some graduate student having to redo a few days worth of
             | computations (which probably had to be redone several times
             | anyway because of some bug in the code or something).
        
             | kijin wrote:
             | Even that depends on what you're doing. Most scientists
             | aren't running apps that require several 9's of
             | availability, connect to an irreplaceable customer
             | database, etc.
             | 
             | An outage, or even permanent loss of hardware, might not be
             | a big problem if you're running easily repeatable
             | computations on data of which you have multiple copies. At
             | worst, you might have to copy some data from an external
             | hard drive and redo a few weeks' worth of computations.
        
             | withinboredom wrote:
             | Ummm. I've def been unable to do anything for entire days
             | because our AWS region went down and we had to rebuild the
             | database from scratch. AWS goes down, you twiddle your
             | thumbs and the people you report to are going to be asking
             | why, for how long, etc. and you can't give them an answer
             | until AWS comes back to see how fubar things are.
             | 
             | When your own hardware rack goes down. You know the
             | problem, how much it costs to fix it, and when it will come
             | back up; usually within a few hours (or minutes) of it
             | going down.
             | 
             | Do things catch fire, yes. But I think you're over-
             | estimating how often. In my entire life, I've had a single
             | SATA connector catch fire and it just melted plastic before
             | going out.
        
               | crazygringo wrote:
               | I'm not talking about temporary outages, I'm talking
               | about data loss.
               | 
               | With AWS it's extremely easy to keep an up-to-date
               | database backup in a different region.
               | 
               | And it's great that you haven't personally encountered
               | disaster, but of course once again that's cherry-picking.
               | And it's not just a component overheating, it's the whole
               | closet on fire, it's a broken ceiling sprinkler system
               | going off, it's a hurricane, it's whatever.
        
               | withinboredom wrote:
               | So was I also talking about data loss. Not everything can
               | be replicated, but backups can and were made.
               | 
               | For the rest, there's insurance. Most calculations done
               | in a research setting are dependent upon that research
               | surviving. If there's a fire and the whole building goes
               | down, those calculations are probably worthless now too.
               | 
               | Hell, most companies probably can't survive their own
               | building/factory burning down.
        
               | FpUser wrote:
               | >"With AWS it's extremely easy to keep an up-to-date
               | database backup in a different region"
               | 
               | It is just as extremely easy on Hetzner or on premises
        
               | monkmartinez wrote:
               | I would say even easier on prem as you don't need to wade
               | 15 layers deep to do anything. Since I have moved to
               | hosting my own stuff at my house, I have learned that
               | connecting a monitor and keyboard to a 'sever' is awesome
               | for productivity. I know where everything is, its fast as
               | hell, and everything is locked down. Monitoring temps,
               | adjusting and configuring hardware is just better in
               | every imaginable way. Need more RAM, Storage, Compute?
               | Slap those puppies in there and send it.
               | 
               | For home gamers like myself, it's has become a no brainer
               | with advances in tunneling, docker, and cheap prices on
               | Ebay.
        
             | ptero wrote:
             | > there's a lot of risk they're not taking into account
             | 
             | I see it the other way: experimental scientists operate
             | with unreliable systems all the time: fickle systems,
             | soldered one-time setups, shared lab space, etc. Computing
             | is just one more thing that is not 100% reliable (but way
             | more reliable than some other equipment), and usb data
             | sticks serve as a good enough data backup.
        
               | mangecoeur wrote:
               | Or your university might have it's own backup system. We
               | have a massive central tape-based archive that you can
               | run nightly backups to.
        
             | noobermin wrote:
             | May be consider that your use case and the average
             | scientist's use case isn't the same? What works for you
             | won't work for them and vise versa? What you consider a
             | risk, I wouldn't?
             | 
             | Consider the following, I have never considered applying
             | meltdown or spectre mitigations if it makes my code run
             | slower because I plain don't care, assuming anyone even
             | peeks at what my simulations doing, whoopdeedo, I don't
             | care. I won't do that on my laptop I use to buy shit off
             | amazon with, but the workstation I have control of? I don't
             | care. I DO care if my simulation will take 10 days instead
             | of a week.
             | 
             | My use case isn't yours because my needs aren't yours. Not
             | everything maps across domains.
        
         | insane_dreamer wrote:
         | Plus the supposed savings of in-house hardware only materialize
         | if you have sufficiently managed and queued load to keep your
         | servers running at 100% 24/7. The advantage of AWS/other is to
         | be able to acquire the necessary amount of compute power for
         | the duration that you need it.
         | 
         | For a large university it probably makes sense to have and
         | manage their own compute infrastructure (cheap post-doc labor,
         | ftw!) but for smaller outfits, AWS can make a lot of sense for
         | scientific computing (said as someone who uses AWS for
         | scientific computing), especially if you have fluctuating
         | loads.
         | 
         | What works best IMO (and what we do) is have a minimum-to-
         | moderate amount of compute resources in house that can satisfy
         | the processing jobs most commonly run (and where you haven't
         | had to overinvest in hardware), and then switch to AWS/other
         | for heavier loads that run for a finite period.
         | 
         | Another problem with in-house hardware is that you spent all
         | that money on Nvidia V100's a few years ago and now there's the
         | A100 that blows it away, but you can't just switch and take
         | advantage of it without another huge capital investment.
        
         | secabeen wrote:
         | They leave out major costs because they don't pay those costs.
         | Power, Cooling, Real Estate are all significant drivers of AWS
         | costs. Researchers don't pay those costs directly. The
         | university does, sure, but to the researcher, that means those
         | costs are pre-paid. Going to AWS means you're essentially
         | paying for those costs twice. plus all the profit margin and
         | availability that AWS provides that you also don't need.
        
       | fwip wrote:
       | The killer we've seen is data egress costs. Crunching the numbers
       | for some of our pipelines, we'd actually be paying more to get
       | the data out of AWS than to compute it.
        
         | bhewes wrote:
         | Data movement has become the number one cost in system builds
         | energy wise.
        
           | boldlybold wrote:
           | As in, the networking equipment consumes the most energy?
           | Given the 30x markup on AWS egress I'm inclined to say it's
           | more about incentives and marketing, but I'd love to learn
           | otherwise.
        
       | pclmulqdq wrote:
       | Even as a big cloud detractor, I have to disagree with this.
       | 
       | A lot of scientific computing doesn't need a persistent data
       | center, since you are running a ton of simulations that only take
       | a week or so, and scientific computing centers at big
       | universities are a big expense that isn't always well-utilized.
       | Also, when they are full, jobs can wait weeks to run.
       | 
       | These computing centers have fairly high overhead, too, although
       | some of that is absorbed by the university/nonprofit who runs
       | them. It is entirely possible that this dynamic, where
       | universities pay some of the cost out of your grant overhead,
       | makes these computing centers synthetically cheaper for
       | researchers when they are actually more expensive.
       | 
       | One other issue here is that scientific computing really benefits
       | from ultra-low-latency infiniband networks, and the cloud
       | providers offer something more similar to a virtualized RoCE
       | system, which is a lot slower. That means accounting for cloud
       | servers potentially being slower core-for-core.
        
         | davidmr wrote:
         | This is tangential to your point, but I'll just mention that
         | Azure has some properly specced out HPC gear: IB, FPGAs, the
         | works. You used to be able to get time on a Cray XC with an
         | Ares interconnect, but I never have occasion to use it, so I
         | don't know if you still can. They've been aggressively hiring
         | top-notch HPC people for a while.
        
         | lebovic wrote:
         | Author here. I agree with your points! I use AWS for a
         | computational biology company I'm working on. A lot of
         | scientific computing can spin up and down within a couple hours
         | on AWS and benefits from fast turnaround. Most academic HPCs
         | (by # of clusters) are slower than a mega-cluster on AWS, not
         | well utilized, and have a lot of bureaucratic process.
         | 
         | That said, most of scientific computing (by % of total compute)
         | happens in a different context. There's often a physical
         | machine within the organization that's creating data (e.g. a
         | DNA sequencer, particle accelerator, etc), and a well-
         | maintained HPC cluster that analyzes that data. The researchers
         | have already waited months for their data, so another couple
         | weeks in a queue doesn't impact their cycle.
         | 
         | For that context, AWS doesn't really make sense. I do think
         | there's room for a cloud provider that's geared towards an HPC
         | use-case, and doesn't have the app-inspired limits (e.g data
         | transfer) like AWS, GCP, and Azure.
        
       | hellodanylo wrote:
       | [retracted]
        
         | Marazan wrote:
         | It says 0.09 per GB on that page.
        
         | philipkglass wrote:
         | Where do you see that? On your link I see:                 Data
         | Transfer OUT From Amazon EC2 To Internet       First 10 TB /
         | Month $0.09 per GB       Next 40 TB / Month $0.085 per GB
         | Next 100 TB / Month $0.07 per GB       Greater than 150 TB /
         | Month $0.05 per GB
         | 
         | Which means if you transfer out 90 TB in one month, it's $0.09
         | * 10000 + $0.085 * 40000 + $0.07 * 40000 = $7100.
        
           | hellodanylo wrote:
           | Sorry, you are right. I need another coffee today.
        
       | xani_ wrote:
       | It always was for load that doesn't allow for autoscaling to save
       | you money; the savings were always from convenience of not having
       | to do ops and pay for ops.
       | 
       | Then again a part of ops cost you save is paid again in dev
       | salary that have to deal with AWS stuff instead of just throwing
       | a blob of binaries and letting ops worry about the rest.
        
       | citizenpaul wrote:
       | No one seems to consider colo data centers anymore as even an
       | option?
        
         | remram wrote:
         | My university owns hardware in multiple locations, plus uses
         | hardware in a collocation, and still uses the cloud for
         | bursting (overflow). You can't beat the provisioning time of
         | cloud providers which is measured in seconds.
        
       | zatarc wrote:
       | Why does no one consider colocation services anymore?
       | 
       | And why do people only know Hetzner, OVH and Linode as
       | alternatives to the big cloud providers?
       | 
       | There are so many good and inexpensive server hosting providers,
       | some with decades of experience.
        
         | lostmsu wrote:
         | Any particular you could recommend for GPU?
        
           | zatarc wrote:
           | I'm not in a position to recommend or not a particular
           | provider for gpu-equipped servers, simply because I've never
           | had the need for gpus.
           | 
           | My first thought was related to colocation services. From
           | what I understand, a lot of people avoid on-premise/in-house
           | solutions because they don't want to deal with server rooms,
           | redundant power, redundant networks, etc.
           | 
           | So people go to the cloud and pay horrendous prices there.
           | 
           | Why not take a middle path? Build your own custom server with
           | your perferred hardware and put in a colocation
        
           | dkobran wrote:
           | There are several tier-two clouds that offer GPUs but I think
           | they generally fall prey to the many of the same issues
           | you'll find with AWS. There is a new generation of
           | accelerator native clouds e.g. Paperspace
           | (https://paperspace.com) that cater specifically to HPC, AI,
           | etc. workloads. The main differentiators are: - much larger
           | GPU catalog - support for new accelerators e.g. Graphcore
           | IPUs - different pricing structure that address problematic
           | areas for HPC such as egress
           | 
           | However, one of the most important differences is the _lack_
           | of unrelated web services related components that pose a
           | major distraction /headache to users that don't have a DevOps
           | background (which AWS obviously caters to). AWS can be
           | incredibly complicated. Simple tasks are encumbered by a
           | whole host of unrelated options/capabilities and the learning
           | curve is very steep. A platform that is specifically designed
           | to serve the scientific computing audience can be much more
           | streamlined and user-friendly for this audience.
           | 
           | Disclosure: I work on Paperspace.
        
           | latchkey wrote:
           | Coreweave. I know the CTO. They are doing great work over
           | there.
           | 
           | https://www.coreweave.com
        
           | sabalaba wrote:
           | Lambda GPU Cloud has the cheapest A100s of that group.
           | https://lambdalabs.com/service/gpu-cloud
           | 
           | Lambda A100s - $1.10 / hr Paperspace A100s - $3.09 / hr
           | Genesis A100s - no A100s but their 3090 (1/2 the speed of
           | 100) is - $1.30 / hr for half the speed
        
             | lostmsu wrote:
             | That's still way too expensive. 3090 is less than 2x of the
             | monthly cost in Genesis. A100 is priced better here.
        
           | tryauuum wrote:
           | datacrunch.io has some 80G A100s
        
           | theblazehen wrote:
           | https://www.genesiscloud.com/ is pretty decent
        
       | snorkel wrote:
       | Buying your own fleet of dedicated servers seems like a smart
       | move in the short term, but then five years from now you'll get
       | someone on the team insisting that they need the latest greatest
       | GPU to run their jobs. Cloud providers give you the option of
       | using newer chipsets without having to re-purchase your entire
       | server fleet every five years.
        
         | lebovic wrote:
         | In HPC land, most hardware is amortized over five years and
         | then replaced! If you keep your service in life for five years
         | at high utilization, you're doing great.
         | 
         | For example, the Blue Waters supercomputer at UIUC was
         | originally expected to last five years, although they kept it
         | in service for nine; it was considered a success:
         | https://www.ncsa.illinois.edu/historic-blue-waters-supercomp...
        
       | adamsb6 wrote:
       | I've never worked in this space, but I'm curious about the need
       | for massive egress. What's driving the need to bring all that
       | data back to the institution?
       | 
       | Could whatever actions have to be performed on the data also be
       | performed in AWS?
       | 
       | Also while briefly looking into this I found that AWS has an
       | egress waiver for researchers and educational instiutions:
       | https://aws.amazon.com/blogs/publicsector/data-egress-waiver...
        
         | COGlory wrote:
         | Well for starters, if you are NIH or NSF funded, they have data
         | storage requirements you must meet. So usually this involves
         | something like tape backups in two locations.
         | 
         | The other is for reproducibility - typically you need to
         | preserve lots of in-between steps for peer review and proving
         | that you aren't making things up. Some intermediary data is
         | wiped out, but usually only if it can be quickly and easily
         | regenerated.
        
         | jpeloquin wrote:
         | Regarding the waiver--"The maximum discount is 15 percent of
         | total monthly spending on AWS services". Was very excited at
         | first.
         | 
         | As for leaving data in AWS, data is often (not always)
         | revisited repeatedly for years after the fact. If new questions
         | are raised about the results it's often much easier to check
         | the output than rerun the analysis. And cloud storage is not
         | cheap. But yes it sometimes makes sense to egress only summary
         | statistics and discard the raw data.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-10-07 23:00 UTC)