[HN Gopher] AWS doesn't make sense for scientific computing ___________________________________________________________________ AWS doesn't make sense for scientific computing Author : lebovic Score : 206 points Date : 2022-10-07 15:28 UTC (7 hours ago) (HTM) web link (www.noahlebovic.com) (TXT) w3m dump (www.noahlebovic.com) | renewiltord wrote: | > _Most scientific computing runs on queues. These queues can be | months long for the biggest supercomputers - that 's the API | equivalent of storing your inbound API requests, and then | responding to them months later_ | | Makes sense if the jobs are all low urgency. | | We have a similar problem in trading so we have a composite | solution with non-cloud simulation hardware and additional AWS | hardware. That's because we have the high utilization solution | combined with high urgency. | Fomite wrote: | I did have to chuckle a bit because, working on HPC simulations | of the pandemic during the pandemic, there was an awful lot of | "This needs to be done tomorrow" urgency. | prpl wrote: | Actually computing is fine for most use cases (spot instances, | preemptible VMs on GCP) and have been used in lots of situations, | even at CERN. Where it also excels is if you need any kind of | infrastructure, because no HPC center has figured a reasonable | approach to that (some are trying with k8s). Also, obviously, you | get a huge selection of hardware. | | Where cloud/aws doesn't make sense is storage, especially if you | need egress, and if you actually need IB | wistlo wrote: | Database analyst for a large communication company here. | | I have similar doubts about AWS for certain kinds of intensive | business analysis. Not API based transactions, but back-office | analysis where complex multi-join queries are run in sequence | against tables with 10s of millions of records. | | We do some of this with SQL servers running right on the desktop | (and one still uses Excel with VLOOKUP). We have a pilot project | to try these tasks in a new Azure instance. I look forward to | seeing how it performs, and at what cost. | AyyWS wrote: | Do you have disaster recovery / high availability requirements? | SQL server on a desktop has a lot of single points of failure. | captainmuon wrote: | A former colleague did his PHD in particle physics with a novel | technique (matrix element method). I can't really explain it, but | it is extremely CPU intensive. That working group did it on | CERN's resources, and they had to borrow quotas from a bunch of | other people. For fun they calculated how much it would have cost | on AWS and came up with something ridiculous like 3 million | euros. | wenc wrote: | I can't specifically to CERN and the exact workload. But bear | in mind that the 3MM euros is non negotiated sticker pricing. | In real life, negotiated pricing can be much much less | depending on your org size and spend. This is a variable most | people neglect. | captainmuon wrote: | That is true, and a large part of the theoretical cost was | probably also traffic, and the use of nonstandard nodes. They | could have gotten a much more realistic price. | | I guess the point is also that scientists often don't realize | computer costs money, when the computers are already bought | dguest wrote: | The bigger experiments will routinely burn through tens of | millions worth of computing. But 10 million euros isn't much | for these experiments. The issue is that they are publicly | funded: any country is much happier to build a local computing | center and lend it to scientists than to fork the money over to | an American cloud provider. | | (The expensive part of these experiments is simulating billions | of collisions and how the thousands of outgoing particles | propagate though a detector the size of a small building. | Simulating a single event takes around a minute on a modern | CPU, and the experiments will simulate billions of events in a | few months. If AWS is charging 5 cents a minute it works out to | tens of millions easy.) | rirze wrote: | I would imagine CERN's resources are essentially a data center | comparable to a small cloud provider's resources. | somesortofthing wrote: | The author makes a convincing argument against doing this | workload on on-demand instances, but what about spot instances? | AWS explicitly calls out scientific computing as a major use case | for scientific computing in its training/promotional materials. | Given the advertised ~70-90% markdown on spot instance time, it | seems like a great option compared to paying almost the same | amount as the workstation but not having to pay to buy, maintain, | or replace the hardware. | lebovic wrote: | Author here! Spot instance pricing is better than on-demand, | but it doesn't include data transfer, and it's still more | expensive than on-prem/Hetzner/etc. Data transfer costs exceed | the cost of the instance itself if you're transferring many TB | off AWS. | | For of our more popular AWS instance types I use - a | c5a.24xlarge, used for comparison in the post - the cheapest | spot price over the past month in us-east-1 was $1.69. That's | still $1233.70/mo: above on-prem, colo, or Hetzner pricing. | Data transfer is still extremely expensive. | | That said, for bursty loads that can't be smoothed with a | queue, spot instances (or just normal EC2 instances) do make | sense! I use them all the time for my computational biology | company. | latchkey wrote: | I read this as a thinly veiled advertisement for the authors | service, toolchest. | lebovic wrote: | Toolchest actually runs scientific computing on AWS! I'm just | frustrated by what we can build, because most scientific | compute can't effectively shift to AWS | latchkey wrote: | As others have noted, there are many other providers out | there. I think your essay would have had more value if it | didn't end with an advertisement. | [deleted] | gammarator wrote: | Astronomy is moving more and more to cloud computing: | | https://www.nature.com/articles/d41586-020-02284-7 | | https://arxiv.org/abs/1907.06320 | Moissanite wrote: | This has been my exact field of work for a few years now; in | general I have found that: | | When people claim it is 10x more expensive to use public cloud, | they have no earthly idea what it actually costs to run a HPC | service, a data centre, or do any of the associated maintenance. | | When the claim is 3x more expensive in the cloud, they do know | those things but are making a bad faith comparison because their | job involves running an on-premises cluster and they are scared | of losing their toys. | | When the claim is 0-50% more to run in the cloud, someone is | doing the math properly and aiming for a fair comparison. | | When the claim is that cloud is cheaper than on-prem, you are | probably talking to a cloud vendor account manager whose | colleagues are wincing at the fact that they just torched their | credibility. | lebovic wrote: | Author here! I think running an HPC service that has a steady | queue in AWS can be more than 3x as expensive. | | What type of HPC do you work in? Maybe I'm over-indexing on | computational biology. | Moissanite wrote: | All types of HPC; I'm a sysadmin/consultant. I don't think | the problem with the cost gap is overestimating cloud costs | but rather underestimating on-prem costs. Also, failing to | account for financing differences and opportunity-costs of | large up-front capital purchases. | johnklos wrote: | This is oversimplifying things a bit. | | It can categorically be stated that for a year's worth of CPU | compute, local will always be less than Amazon. Of course, | putting percentages on it doesn't work - there are just too | many variables. | | There are many admins out there who have no idea what an Alpha | is who'll swear that if you're not buying Dell or HP hardware | at a premium with expensive support contracts, you're doing | things wrong and you're not a real admin. Visit Reddit's | /r/sysadmin if you want to see the kind of people I'm talking | about. | | The point is that if people insist on the most expensive, least | efficient type of servers such as Dell Xeons with ridiculous | service contracts, the savings over Amazon won't be large. | | It's a cumulative problem, because trying to cool and house | less efficient hardware requires more power and that hardware | ultimately has less tolerance for non-datacenter cooling. | | Rethink things. You can have AMD Threadripper / EPYC systems in | larger rooms that require less overall cooling, that have | better temperature tolerance, that're more reliable in | aggregate, which cost less and for which you can easily keep | around spare parts which would give better turnaround and | availability than support contracts from Dell / HP. Suddenly | your compute costs are halved, because of pricing, efficiency, | overall power, real estate considerations... | | So percentages don't work, but the bottom line is that when | you're doing lots of compute, over time it's always cheaper | locally, even if you do things the "traditional" expensive and | inefficient way, so arguing percentages with so many variables | doesn't make any sense - it's still cheaper, no matter what. | thayne wrote: | > Most scientific computing runs on queues. These queues can be | months long for the biggest supercomputers | | That sounds very much like an argument _for_ a cloud. Instead of | waiting months to do your processing, you spin up what you need, | then tear it down when you are done. | withinboredom wrote: | The queue then just turns into the bank account. The queue | doesn't magically go away. | adolph wrote: | Classic iron triangle, pick two: * cheap | * fast * available | | https://en.wikipedia.org/wiki/Project_management_triangle | twawaaay wrote: | On the other hand it makes sense if you just need to borrow their | infrastructure for a while to calculate something. | | A lot of scientific computing isn't happening continuously, and a | lot of it is one time experiment or maybe couple of times after | which you would have to tear down and reassign. | | Another fun fact people forget is our ability to predict future | is still pretty poor. Not only that, we are biased towards | thinking we can predict it when in fact this is complete | bullshit. | | You have to buy and set up infrastructure before you can use it | and then you have to be ready to use it. What if you are not | ready? What if you will not need as much resources? What if you | stop needing it earlier than you thought? When you borrow it from | AWS you have flexibility to start using it when you are ready and | drop it immediately when you no longer need it. Which has value | on its own. | | At the company I work for we found out and basically banned | signing long term contracts for discounts. We found that, on | average, we pay many times more for unused services than whatever | we gained through discounts. Also when you pay for the resources | there is incentive to improve efficiency. When you have basically | prepaid for everything that incentive is very small and is | basically limited to making sure you have to stay within limits. | bsenftner wrote: | This is the case for a large class of big data + high compute | applications. Animation / simulation in engineering, planning, | forecasting, not to mention entertainment require pipelines the | typical cloud is simply too expensive to use. | didip wrote: | No way. I vehemently disagree. | | When a company reached a certain mass, hardware cost is a factor | that is considered but not a big factor. | | The bigger problems are lost opportunity costs and unnecessary | churns. | | Businesses lose a lot when the product launch is delayed by a | year simply because the hardware arrived late or have too many | defects (Ask your hardware fulfillment people how many defective | RAM and SSD they got per new shipment). | | Churn can cost the business a lot as well. For example, imagine | the model that everyone been using is trained in a Mac Pro under | XYZ desk. And then when XYZ quit, s/he never properly backup the | code and the model. | | Bare metal allows for sloppiness that the cloud cannot afford to | allow. Accountability and ownership is a lot more apparent in the | cloud. | idiot900 wrote: | This rings true for me. I have a federal grant that prohibits me | from using its funds for capital acquisitions: i.e. servers. But | I can spend it on AWS at massive cost for minimal added utility | for my use case. Even though it would be a far better use of | taxpayer funds to buy the servers, I have to rent them instead. | giantrobot wrote: | I'm not saying AWS is automatically the best option but the | question isn't just servers. It's servers, networking hardware, | HVAC, a facility to put them all in, and at least a couple | people to run and maintain it all. The TCO of some servers is | way higher than the cost of the hardware. | adgjlsfhk1 wrote: | Can you get your university to buy some servers for unrelated | reasons and have them rent them to you? | chrisseaton wrote: | Well that's just rebuilding AWS badly. I've used academic- | managed time-sharing setups and have some horror stories. | lostmsu wrote: | Doesn't have to be a university either. Depending on the | amount of compute needed any capable IT guy can do it for you | from their garage with a contract. | boldlybold wrote: | Lots of places (Hetzner for example) will rent you servers at | 10-25% the cost of AWS if you want dedicated hardware, without | the ability to autoscale. You can even set up a K8s cluster | there if the overhead is worth it. | intelVISA wrote: | Fond memories of Hetzner asking for my driving license as ID | for renting a $2 VPS. Lost a customer for life with that | nonsense. | testplzignore wrote: | > prohibits me from using its funds for capital acquisitions | | What is a legitimate reason for this restriction? | blep_ wrote: | I can think of a few ways to abuse it while still spinning it | as "for research". The obvious one is to buy a $9999 gaming | machine with several of whatever the fanciest GPU on the | market is at the time, and say you're doing machine learning. | | So my guess is it's an overly broad patch for that sort of | thing. | Fomite wrote: | Not really - this is also true for things with no | particular "civilian" use. | Fomite wrote: | Basically, the granting organization doesn't want to pay for | the full cost of capital equipment that will - either via | time or capacity - not be fully used for that grant. | | There are other grant mechanisms for large capital | expenditures. | | The problem is the thresholds haven't shifted in a long time, | so you can easily trigger it with a nice workstation. But | then, the budget for a modular NIH R01 was set in 1999, so | thats hardly a unique problem. | lowbloodsugar wrote: | >Even 2.5x over building your own infrastructure is significant | for a $50M/yr supercomputer. | | Can't imagine you are paying public prices on any cloud provider | if you have a $50M/yr budget. | | In addition, if, as the article states, the scientists are ok to | wait some considerable time for results, then one can run most, | if not all, on spot instances, and that can save 10x right there. | | If you don't have $50M/yr there are companies that will move your | workload around different AWS regions to get the best price - and | will factor in the cost of transferring the data too. | | I was architect at large scientific company using AWS. | lebovic wrote: | Author here. I agree that pricing is highly negotiable for any | large cloud provider, and there are even (capped) egress fee | waivers that you can negotiate as a part of your contract. | There's also a place for using AWS; I used it for a smaller DNA | sequencing facility, and I use it for my computational biology | startup. | | That said, I'll repeat something that I commented somewhere | else: most of scientific computing (by % of compute) happens in | a context that still doesn't make sense in AWS. There's often a | physical machine within the organization that's creating data | (e.g. a DNA sequencer, particle accelerator, etc), and a well- | maintained HPC cluster that analyzes that data. | | Spot instances are still pretty expensive for a steady queue | (2x of Hetzer monthly costs, for reference), and you still have | to pay AWS data transfer egress costs - which are at least 30x | more expensive than a colo or on-prem, if you're saturating a 1 | Gbps link. Data transfer to optimize for spot instance pricing | becomes prohibitive when your job has 100 TB of raw data. | dastbe wrote: | using on-demand for latency insensitive work, especially when | you're also very cost sensitive, isn't the right choice. spot | instances will get you somewhere in the realm of the hetzner/on- | prem numbers. | Sebb767 wrote: | But, as the article points out, you are still paying a lot of | money for features that you don't need for scientific | computing. | | Also, AWS is notoriously easy to undercut with on-prem | hardware, especially if your budget is large and your uptime | requirements aren't - you'll save a few hundred thousand a year | alone by not having to hire expert engineers for on-call duty | and extreme reliability. | lebovic wrote: | Even spot instances on AWS are still over 2x more expensive per | month than Hetzner. The cheapest c5a.24xlarge spot instance | right now is $1.5546/hr in us-east-1c. That's $1134.86/mo, | excluding data transfer costs. If you transfer out 10 TB over | the course of a month, that's another $921.60/mo - or now 4x | more expensive than Hetzner. | | Using the estimate from the article, spot instances are still | over 8x more expensive than on-prem for scientific computing. | dekhn wrote: | Even more importantly, if you have any reasonable amount of | spend on cloud, you can get preferred pricing agreements. As | much as I hate to talk to "salespeople", I did manage to cut | millions in costs per year with discounts on serving and | storage. | | Personally, when I estimate the total cost of ownership of | scientific cloud computing versus on prem (for extremely large- | scale science with both significant server, storage, and | bandwidth requirements) the cloud ends up winning for a number | of reasons. I've seen a lot of academics who disagree but then | I find out they use their grad students to manage their | clusters. | thesausageking wrote: | I'm suspicious of the author's actual experience. | | The fact that scientific computing has a different pattern than | the typical web app is actually a good thing. If you can | architect large batch jobs to use spot instances, it's 50-80% | cheaper. | | Also this bit: "you can keep your servers at 100% utilization by | maintaining a queue of requested jobs" isn't true in practice. | The pattern of research is the work normally comes in waves. | You'll want to train a new model or run a number of large | simulations. And then there will be periods of tweaking and work | on other parts. And then more need for a lot of training. Yes, | you can always find work to put on a cluster to keep it >90% | utilization, but if it can be elastic (and has compute has budget | attached to it), it will rise and fall. | lebovic wrote: | Author here! I worked for the computing infrastructure for a | DNA sequencing facility, and I run a computational biology | infrastructure company (trytoolchest.com, YC W22). Both are | built on AWS, so I do think AWS in scientific computing has its | use-cases - mostly in places where you can't saturate a queue | or you want a fast cycle time. | | Spot instances are still pretty expensive for a steady queue | (2x of Hetzer monthly costs, for reference), and you still have | to pay AWS data transfer egress costs - which are at least 30x | more expensive than a colo or on-prem, if you're saturating a 1 | Gbps link. | | This post was born from frustration at AWS for their pricing | and offerings after trying to get people to switch to AWS in | scientific computing for years :) | Fomite wrote: | One of the aspects not touched on for this is PII/confidential | data/HIPAA data, etc. | | For that, whether it makes sense or not, a lot of universities | are moving to AWS, and the infrastructure cost of AWS for what | would be a pretty modest server are still considerably less than | the cost of complying with the policies and regulations involved | in that. | | Recently at my institution I asked about housing it on premise, | and the answer was that IT supports AWS, and if I wanted to do | something else, supporting that - as well as the responsibility | for a breach - would rest entirely on my shoulders. Not doing | that. | [deleted] | KaiserPro wrote: | Its much more complex than described. | | The author is making a brilliant argument for getting a | secondhand workstation and shoving under their desk. | | If you are doing multi machine batch style processing, then you | won't be using ondemand, you'd use the spot pricing. The missing | argument in that part is storage costs. Managing a high speed, | highly available synchronous file system that can do a sustained | 50gb/sec is hard bloody work (no S3 isnt a good fit, too much | management overhead) | | Don't get me wrong AWS _is_ expensive if you are using a machine | for more than a month or two. | | however if you are doing highly parallel stuff, Batch and lustre | on demand is pretty ace. | | If you are doing a multi-year project, then real steel is where | its at. Assuming you have factored in hosting, storage and admin | costs. | bushbaba wrote: | Checkout Apache Iceberg which makes it fairly trivial to get | high throughput from S3 without much fine-tuning. Bursts from 0 | to 50Gbps should be possible from S3 without much effort, just | have object sizes that are in the NN+ MiB range. Personally, | Lustre is a mess, it's expensive and even more pain to fine- | tune. | awiesenhofer wrote: | From https://iceberg.apache.org | | > Iceberg is a high-performance format for huge analytic | tables. | | How would that help speedup S3? Genuine question? | gautamdivgi wrote: | Even for multi-year, if you factor in everything does it still | come out cheaper and AWS? Would you be running everything 24x7 | on an HPC? I don't think so. You need scale at some points and | there are probably times where research is done on your | desktop. | | You could invest in an HPC - but I think the human cost of | maintaining one especially if you're in a high cost of living | area (e.g. Bay Area, NYC, etc.) is going to be pretty high. | Admin cost, UPS, cable wiring, heat/cooling etc. can all be | pretty expensive. Maintenance of these can be pretty pricey | too. | | Are there any companies that remotely manage data centers and | rent out bare metal infra? | lostmsu wrote: | Isn't 50GB sec like 5 NVMe Gen 5 SSDs + 1 or 2 for redundancy? | | Actually, you are right. Consumer SSDs I've seen only do about | 1.5GB/s sustained. | davidmr wrote: | Not in the context the person you responded to meant it. Yes, | you can very easily get 50GB/s from a few NVMe devices on a | single box. Getting 50GB/s on a POSIX-ish filesystem exported | to 1000 servers is very possible and common, but orders of | magnitude more complicated. 500GB/s is tougher still. 5TB/s | is real tough, but real fun. | pclmulqdq wrote: | Even (high-end) consumer SSDs can saturate a PCIe gen 4 x4 | link if you are doing sequential reads. Non-sequential hurts | on even enterprise SSDs. | wenc wrote: | Calculating costs based on sticker price is sometimes misleading | because there's another variable: negotiated pricing, which can | be much much lower than sticker prices, depending on your | negotiating leverage. Different companies pay different prices | for the same product. | | If you've ever worked at a big company or university (any place | where you spend at scale), you'll know you rarely pay sticker | price. Software licensing is particularly elastic because it's | almost zero marginal cost. Raw cloud costs are largely a function | of energy usage and amortized hardware costs -- there's a certain | minimum you can't go under but there remains a huge margin that | is open to be negotiated on. | | Startups/individuals rarely even think about this because they | rarely qualify. But big orgs with large spends do. You can get | negotiated cloud pricing. | racking7 wrote: | This is definitely true for cloud retail prices. However, this | becomes not true in cases I've seen when there is an existing | discount. Reserved instances, for example. | bee_rider wrote: | Is genomic code typically distributed-memory parallel? I'm under | the impression that it is more like batch processing, not a ton | of node-to-node communication but you want lots of bandwidth and | storage. | | If you are doing a big distributed-memory numerical simulation, | on the other hand, you probably want infiniband I guess. | | AWS seems like an OK fit for the former, maybe not great for the | latter... | pclmulqdq wrote: | The fastest way to do a lot of genomics stuff is with FPGA | accelerators, which also aren't used by most of the other | tenants in a multi-tenant scientific computing center. The | cloud is perfect for that kind of work. | bee_rider wrote: | That's interesting. It is sort of funny that I was right | (putting genomics in the "maybe good for cloud" bucket) for | the wrong reason (characterizing it as more suited for | general-purpose commodity resources, rather than suited for | the more niche FPGA platform). | timeu wrote: | As an HPC sysadmin for 3 research institutes (mostly life | sciences & biology) I can't see how cloud HPC system could be any | cheaper than an on-prem HPC system especially if I look at the | resource efficiency (how much resources were requested vs how | much were actually useed) of our users SLURM jobs. Often the | users request 100s of GB but only use a fraction of it. In our | on-prem HPC system this might decrease utilization (which is not | great) but in the case of the cloud this would result in | increased computing costs (because bigger VM flavor) which would | be probably worse (CapEx vs OpEx) Of course you could argue that | the users should do and know better and properly size/measure | their resource requirements however most of our users have lab | background and are new to computational biology so estimating or | even knowing what all the knobs (cores, mem per core, total | memory, etc) of the job specification means is hard for them. We | try to educate by providing trainings and job efficency reporting | however the researchters/users have little incentive to optimize | the job requests and are more interested in quick results and | turnover which is also understandable (the on-prem HPC system is | already payed for). Maybe the cost transparancy of the cloud | would force them or rather their group leaders/institute heads to | put a focus on this but until you move to the cloud you won't | know. | | Additionally the typical workloads that run on our HPC system are | often some badly maintained bioinformatics software or | R/perl/pythong throwaway scripts and often enough a typo in the | script causes the entire pipeline to fail after days of running | on the HPC system and needs to be restarted (maybe even multiple | times). Again on the on-prem system you have wasted electricity | (bad enough) but in the cloud you have to pay the computing costs | of the failed runs. Again cost transparency might force to fix | this but the users are not software engineers. | | One thing that the cloud is really good at, is elasticity and | access to new hardware. We have seen for example a shift of | workloads from pure CPUs to GPUs. A new CryoEM microscope was | installed where the downstream analysis is relying heavily on | GPUs, more and more resaerch groups run Alpafold predictions and | also NGS analysis is now using GPUs. We have around 100 GPUs and | average utlizations has increased to 80-90% and the users are | complaining about long waiting/queueing times for their GPU jobs. | For this bursting to the cloud would be nice, however GPUs are | prohibitively expensive in the cloud unfortunately and the above | mentioned caveats regarding job resource efficiencies still | apply. | | One thing that will hurt on-prem HPC systems tough are the | increased electricity prices. We are now taking measures to | actively save energy (i.e. by powering down idle nodes and | powering them up again when jobs are scheduled). As far as I can | tell the big cloud providers (AWS, etc) haven't increased the | prices yet either because they cover elecriticity cost increase | with their profit margins or they are not affected as much | because they have better deals with elecricity providers. | kortex wrote: | What does the landscape look like now for "terraform for bare | metal"?. Is ansible/chef still the main name in town? I just | wanna netboot some lightweight image, set up some basic network | discovery on a control plane, and turn every connected box into a | flexible worker bee I can deploy whatever cluster control layer | (k8s/nomad) on top of and start slinging containers. | nwilkens wrote: | I really like this description of how baremetal infrastructure | should work, and this is where I think (shameless self | promotion) Triton DataCenter[1] plays really well today on- | prem. | | PXE booted lightweight compute nodes with a robust API, | including operator portal, user portal, and cli. | | Keep an eye out for the work we are doing with Triton Linux + | K8s. Very lightweight Triton Linux compute node + baremetal k8s | deployments on Triton. | | [1] https://www.tritondatacenter.com | jrm4 wrote: | I imagine what makes this especially hard is you have (at least) | three parties in play here: | | - the people doing the research | | - the institution's IT services group | | - the administrator who writes the checks | | And in my experience, "actual knowledge of what must be done and | what it will or could cost" can vary greatly across these three | groups; frequently in very unintuitive ways. | Fomite wrote: | This is the biggest point of friction. I spent the better part | of a year trying to get a postdoc admin access to his machine. | rpep wrote: | I think there are some things this misses about the scientific | ecosystem in Universities/etc. that can make the cloud more | attractive than it first appears: | | * If you want to run really big jobs e.g. with multiple multi-GPU | nodes, this might not even be possible depending on your | institution or your access. Most research-intensive Universities | have a cluster but they're not normally big machines. For | regional and national machines, you usually have to bid for | access for specific projects, and you might not be successful. | | * You have control of exactly what hardware and OS you want on | your nodes. Often you're using an out of date RHEL version and | despite spack and easybuild gaining ground, all too often you're | given a compiler and some old versions of libraries and that's | it. | | * For many computationally intensive studies, your data transfer | actually isn't that large. For e.g. you can often do the post- | processing on-node and then only get aggregate statistics about | simulation runs out. | danking00 wrote: | I think this post is identifying scientific computing with | simulation studies and legacy workflows, to a fault. Scientific | computing includes those things, but it _also_ includes | interactive analysis of very large datasets as well as workflows | designed around cloud computing. | | Interactive analysis of large datasets (e.g. genome & exome | sequencing studies with 100s of 1000s of samples) is well suited | to low-latency, server-less, & horizontally scalable systems | (like Dremel/BigQuery, or Hail [1], which we build and is | inspired by Dremel, among other systems). The load profile is | unpredictable because after a scientist runs an analysis they | need an unpredictable amount of time to think about their next | step. | | As for productionized workflows, if we redesign the tools used | within these workflows to directly read and write data to cloud | storage as well as to tolerate VM-preemption, then we can exploit | the ~1/5 cost of preemptible/spot instances. | | One last point: for the subset of scientific computing I | highlighted above, speed is key. I want the scientist to stay in | a flow state, receiving feedback from their experiments as fast | as possible, ideally within 300 ms. The only way to achieve that | on huge datasets is through rapid and substantial scale-out | followed by equally rapid and substantial scale-in (to control | cost). | | [1] https://hail.is | jessfyi wrote: | I've followed Hail and applaud the Broad Institute's work wrt | establishing better bioinformatics software and toolkits so I | hope this doesn't come as rude, but I can't imagine an instance | in a real industry or academic workflow where you need 300ms | feedback from an experiment to "maintain flow" considering how | long experiments on data that large (especially exome | sequencing!) take overall? My (likely lacking) imagination | aside I guess what I'm really saying is that I don't know | what's preventing the usecase you've described from being | performed locally considering there'd be even _less_ latency? | CreRecombinase wrote: | These MPI-based scientific computing applications make up a bulk | of the compute hours on hpc clusters, but there is a crazy long | tail of scientists who have workloads that can't (or shouldn't) | run on their personal computers. The other option is HPC. This | sucks for a ton of reasons, but I think the biggest one is that | it's more or less impossible to set up a persistent service of | any kind. So no databases; if you want spark, be ready to spin it | up from nothing every day (also no HDFS unless you spin that up | in your SLURM job too). This makes getting work done harder but | it also means that it makes integrating existing work so much | harder because everyone's workflow involves reinventing | everything, and everyone does it in subtly incompatible ways; | there are no natural (common) abstraction layers because there | are no services. | 0xbadcafebee wrote: | AWS is _fantastic_ for scientific computing. With it you can: | | - Deploy a thousand servers with GPUs in 10 minutes, churn over a | giant dataset, then turn them all off again. Nobody ever has to | wait for access to the supercomputer. | | - Automatically back up everything into cold storage over time | with a lifecycle policy. | | - Avoid the massive overhead of maintaining HPC clusters, labs, | data centers, additional staff and training, capex, load | estimation, months/years of advance planning to be ready to start | computing. | | - Automation via APIs to enable very quick adaptation with little | coding. | | - An entire universe of services which ramp up your capabilities | to analyze data and apply ML without needing to build anything | yourself. | | - A marketplace of B2B and B2C solutions to quickly deploy new | tools within your account. | | - Share data with other organizations easily. | | AWS costs are also "retail costs". There are massive savings to | be had quite easily. | Fomite wrote: | One thing to consider: | | _I_ don 't control my AWS account. I don't even _have_ an AWS | account in my professional life. | | I tell my IT department what I want. They tell the AWS people | in central IT what they want. It's set up. At some point I get | an email with login information. | | I email them again to turn it off. | | Do I hate this system? Yes. Is it the system I have to work | with? Also yes. | | "AWS as implemented by any large institution" is considerably | less agile than AWS itself. | [deleted] | slaymaker1907 wrote: | Cloud worked really well for me when I was in school. A lot of | the time, I would only need a beefy computer for a few hours at a | time (often due to high memory usage) and you can/could rent out | spot instances for very cheap. There are about 730 hours per | month so the cost calculus is very different for a | student/researcher who needs fast turnaround times (high | performance), but only for a short period of time. | | However, I know not all HPC/scientific computing works that way | and some workloads are much more continuous. | a2tech wrote: | Thats how my department uses the cloud--we have an image we | store up at AWS geared towards a couple of tasks and we spin up | a big instance when we need it, run the task, pull out the | results, then stop the machine. Total cost sub-100 dollars. If | we had to go the HPC group we'd have to fight with them to get | the environment configured, get access to the system, get | payment setup, teach the faculty to use the environment, etc. | Its just a pain for very little gain. | Mave83 wrote: | I agree with the article. We at croit.io support customers around | the globe to build their clusters and save huge amounts. For | example, AWS S3 compared to Ceph S3 in any data center of your | choice is around 1/10 of the AWS price. | nharada wrote: | I'd love to buy my own servers for small-scale (i.e. startup size | or research lab size) projects, but it's very hard to be | utilizing them 24x7. Does anyone know of open-source software or | tools that allow multiple people to timeshare one of these? A big | server full of A100s would be awesome, with the ability to | reserve the server on specific days. | jpeloquin wrote: | > the ability to reserve the server on specific days | | In an environment where there are not too many users and | everyone is cooperative, using Google Calendar to reserve time | slots works very well and is very low maintenance. Technical | restrictions are needed only when the users can't be trusted to | stay out of each other's way. | didip wrote: | This is just the cloud with extra steps. | mbreese wrote: | I completely agree for most cases. In many scientific computing | applications, compute time isn't the factor you prioritize in the | good/fast/cheap triad. Instead, you often need to do things as | cheaply as possible. And your data access isn't always | predictable, so you need to keep results around for an extended | period of time. This makes storage costs a major factor. For us, | this alone was enough to move workloads away from cloud and onto | local resources. | COGlory wrote: | >a month-long DNA sequencing project can generate 90 TB of data | | Our EM facility generates 10 TB of raw data per day, and once you | start computing it, that increases by 30%-50% depending on what | you do with it. Plus, moving between network storage and local | scratch for computational steps basically never ends and keeps | multiple 10 Gbe links saturated 100% of the time. | bgro wrote: | When I was looking at AWS for personal use, I first thought it | was oddly expensive even when factoring in not having to buy the | hardware. When I looked at just what the electricity cost to run | it myself would be, I think that addition alone turned out AWS | was actually cheaper. This is without factoring in cooling / | dedicated space / maintenance. | betolink wrote: | I see both sides of the argument, there is a reason why CERN is | not processing their data using EC2 and lambdas. | thamer wrote: | The vast majority of researchers don't need anywhere close to | the amount of resources that CERN needs. The fact that CERN | doesn't use EC2 and lambdas shouldn't be taken as a lesson by | anyone who's not operating at their scale. | | This feels like a similar argument to the one made by people | who use Kubernetes to ensure their web app with 100 visitors a | day is web scale. | harunurhan wrote: | The cost isn't the only reason | | - CERN started planning its computing grid before AWS was | launched. | | - It's pretty complicated (politics, mission, vision) for CERN | to use external proprietary software/hardware for its main | functions (they have even started to MS Office like products.) | | - [cost] CERN is quite different than a small team researchers | doing few years research. the scale is enormous and very long | lived, like for decades continue | | - and more... | | HPC and scientific computing aside, I would have loved to be | able to use AWS when I worked there, internal infra for running | web apps and services wasn't nearly good & reliable, neither | had a wide catalog of services offered. | betolink wrote: | I think the spirit of the article is to put the cloud in | perspective of the organization size and the workload type. | There is a sweet spot where the cloud is the only option that | makes sense, definitely with variable loads and capacity to | basically scale on demand as big as our budget, there is no | match for that. However... there are organizations with | certain type of workloads that could afford to put | infrastructure in place and even with the costs of staffing, | energy etc they will save millions in the long run. NASA, | CERN etc are some. This is not limited to HPC, the cloud at | scale is not cheap either see: | https://a16z.com/2021/05/27/cost-of-cloud-paradox-market- | cap... | bluedino wrote: | We have 500-node cluster at a chemical company, and we've been | experimenting with "hybrid-cloud". This allows jobs to use | servers with resources we just don't have, or couldn't add fast | enough. | | Storage is a huge issue for us. We have a petabyte of local | storage from big name vendor that's bursting at the seams, and | expensive to upgrade. A lot of our users leave big files laying | around for a large time. Every few months we have to hound | everyone to delete old stuff. | | The other thing that you get with the cloud is there's way more | accountability for who's using how much resources. Right now we | just let people have access and roam free. Cloud HPC is 5-10x | more in cost and the beancounters would shut shit down real quick | if the actual costs were divvied up. | | We also still have a legacy datacenter so in a similar vein, it's | hard to say how much not having to deal with physical | hardware/networking/power/bandwidth would be worth. Our work is | maybe 1% of what that team does. | adolph wrote: | I can relate to these problems. Cloud brings positive | accountability that is difficult to justify onprem. I have some | hope that higher level tools for project/data/experiment | management (as opposed to a bash prompt and a path) will bring | some accountability without stifling flexibility. | julienchastang wrote: | I've also been skeptical of the commercial cloud for scientific | computing workflows. I don't think this cost benefit analysis | mentions it, but the commercial cloud makes even less sense when | you take into account brick and mortar considerations. In other | words, if your company/institution has already paid for the | machine rooms, sys admins, networks, the physical buildings, the | commercial cloud is even less appealing. This is especially true | with "persistent services" for example data servers that are | always on because they handle real-time data, for example. | | Another aspect of scientific computing on the commercial cloud | that's a pain if you work in academia is procurement or paying | for the cloud. Academic groups are much more comfortable with the | grant model. They often operate on shoe-string budgets and are | simply not comfortable entering a credit card number. You can | also get commercial cloud grants, but they often lack long-term, | multiyear continuity. | mattkrause wrote: | It's often not that they're "not comfortable"; it's that we're | often flat-out not allowed to. | Fomite wrote: | This. It's got nothing to do with "comfort". I use cloud | computing all the time in the rest of my life, but the rest | of my life isn't subject to university policies and state | regulations. | ordiel wrote: | Having worked for 2 of the largest cloud providers (1 of them | beimg the largest) i have to say "The Cloud" just doesnt makes | sense (maybe with the exception of cloud storage) yet for most | use cases, this including start ups, small and, mid size | companies its just way to expensive for the benefits it provides, | it moves your hardware acquisitions /maintainance cost to | development costs, you just think better/cheaper because that | cost comes in small monthly chunks rather than as a single bill, | plus you add all security risks either those introduced by the | vendor or those introduced by the masive complexity and poor | training of the developers which if you want to avoid will have | to pay by hiring a developer competent in security for that | particular cloud provider | manv1 wrote: | Having worked in 3 startups that were AWS-first, I can say that | you've learned the completely wrong lessons from your time at | your cloud providers. | | Building on AWS has provided scale, security, and redundancy at | a substantially lower cost than doing any on-prem solution | (except for a shitty one strung together with lowendbox | machines). | | The combined AWS bill for the three startups is less than the | cost of an F5, even on a non-inflation adjusted basis. | | The cloud doesn't mean that you can be totally clueless. I've | had experience in HA/scalability/redundancy/deployment/developm | ent/networking/etc. It means that if you do know what you're | doing you can deliver a scalable HA solution at a ridiculously | lower price point than a DIY solution using bare iron and colo. | ordiel wrote: | "The combined bill" during which time period? | | 1 Month, for sure. What about 1 year? Also did those | companies required to provide any training or hiring to | achieve that? Because you also need to add that to the cost | comparison | | If you are comparing one month bill agains 1 time purchase | (which if is correctly chosen should not happen but once | every 10 years at the earliest) for sure it will be cheaper. | When it comes down to scalability, development and | deployment, you should check your tech stack rather than your | infrastructure. Kubernetees and containerization should | easily take care of those with on premise hardware while also | reducing complexity + you will no longer have to worry for | off the chart network transit fees | jerjerjer wrote: | Sure? I mean, if you have: | | 1) A large enough queue of tasks | | 2) Users/donstream willing to wait | | using your own infrastructure always wins (alsuming free labor) | since you can load your own infrastructure to ~95% pretty much | 24/7 which is unbeatable. | mrweasel wrote: | It might also depend on how long you're actually willing to | wait. There's nothing stopping you from having a job queue in | AWS, and you can setup things up so that instances are only | running if the price is low enough. | | Otherwise completely agree, there might be some cases where the | cost of labour means that you're better off running something | in AWS, even if that requires someone to do the configuration | as well. | aschleck wrote: | This is sort of a confusing article because it assumes the | premise of "you have a fixed hardware profile" and then argues | within that context ("Most scientific computing runs on queues. | These queues can be months long for the biggest supercomputers".) | Of course if you're getting 100% utilization then you'll find | better raw pricing (and this article conveniently leaves out | staffing costs), but this model misses one of the most powerful | parts of cloud providers: autoscaling. Why would you want to | waste scientist time by making them wait in a queue when you can | just instead autoscale as high as needed? Giving scientists a | tight iteration loop will likely be the biggest cost reduction | and also the biggest benefit. And if you're doing that on prem | then you need to provision for the peak load, which drives your | utilization down and makes on prem far less cost effective. | lebovic wrote: | For fast-moving researchers who are blocked by a queue, cloud | computing still makes sense. I guess I wasn't clear enough in | the last section about how I still use AWS for startup-scale | computational biology. My scientific computing startup | (trytoolchest.com) is 100% built on top of AWS. | | Most scientific computing still happens on supercomputers in | slower moving academic or big co settings. That's the group for | whom cloud computing - or at least running everything on the | cloud - doesn't make sense. | adolph wrote: | Another service that runs on AWS is CodeOcean. It looks like | Toolchest is oriented toward facilitating execution of | specific packages rather than organization and execution like | CodeOcean. Is that a fair summary? | | https://codeocean.com/explore | lebovic wrote: | Yep, that's right! Toolchest focuses on compute, deploying | and optimizing popular scientific computing packages. | secabeen wrote: | Generally, scientists aren't blocked while they are waiting on | a computational queue. The results of a computation are needed | eventually, but there is lots of other work that can be done | that doesn't depend on a specific calculation. | jefftk wrote: | It's good to learn how not to be blocked on long-running | calculations. | | On the other hand, if transitioning to a bursty cloud model | means you can do your full run in hours instead of weeks, | that has real impact on how many iterations you can do and | often does appreciably affect velocity. | secabeen wrote: | It can, if you have the technical ability to write code | that can leverage the scale-out that most bursty-cloud | solutions entail. Coding for clustering can be pretty | challenging, and I would generally recommend a user target | a single large system with job that takes a week over | trying to adapt that job to a clustered solution of 100 | smaller systems that can complete it in 8 hours. | Fomite wrote: | This is a big part of it. In my lab, I have a lot of grad | students who are _computational_ scientists, not computer | scientists. The time it will take them to optimize code | far exceeds a quick-and-dirty job array on Slurm and then | going back to working on the introduction of the paper, | or catching up on the literature, or any one of a dozen | other things. | secabeen wrote: | The general rule of thumb in the HPC world is if you can keep a | system computing for more than 40% of the time, it will be | cheaper to buy. | tejtm wrote: | Cloud never has made sense for scientific computing. Renting | someone else's big computer makes good sense in a business | setting where you are not paying for your peak capacity when you | are not using it, and you are not losing revenue by | underestimating whatever the peak capacity the market happens to | dictate. | | For business, outsourcing compute cost center eliminates both | cost and risk for a big win each quarter. | | Scientists never say, Gee it isn't the holiday season, guess we | better scale things back. | | Instead they will always tend to push whatever compute limit | there is, it is kinda in the job description. | | As for the grant argument, that is letting the tool shape the | hand. | | business-science is not science, we will pay now or pay later. | aBioGuy wrote: | Furthermore, scientific computing often (usually?) involves | trainees. It can difficult to train people when small mistakes | can have five figure bills. | Moissanite wrote: | This is the biggest un-addressed problem, IMO. Getting more | scientific computing done in the cloud is where we are | inevitably trending, but no-one yet has a good answer for | completely ad-hoc, low value experimentation and skill building | in cloud. I see universities needing to maintain clusters to | allow PhDs and postdocs to develop their computing skills for a | good while yet. | avereveard wrote: | > Hardware is amortized over five years | | hardware running 100% won't last five years | | if hardware is not needed to be running 100% at full steam for | five years, you can turn down instances on the cloud and you | don't pay anything | | in 2 years you'll be stuck with the same hardware, while on the | cloud you follow cpu evolution as it arrives to the provider | | all in all the comparison is too high level to be useful | e63f67dd-065b wrote: | > hardware running 100% won't last five years | | Five year is a pretty typical amortisation schedule for HPC | hardware. During my sysadmin days, of CPU, memory, cooling, | power, storage, and networking, the only things that broke were | hard disks and a few cooling fans. Disks were replaced by just | grabbing a space and slotting it in, and fans were replaced by, | well, swapping them out. | | Modern CPUs and memory last a very long time. I think I | remember seeing Ivy Bridge CPUs running in Hetzner servers in a | video they put out, and they're still fine. | avereveard wrote: | if you expect downtime in the 5 year to replace fan and | whatnot, you're not getting 100% of your money/perf back - | and I didn't see that in the article. | | if you have spares, spares need to be in the cost, and value | lost to downtime stay minimal. but you have to include spares | in the expenses. if you don't have spares, 1-2 day downtime | is going to be a decent hit to value. | davidmr wrote: | I'm not sure I understand what you mean. I've run HPC | clusters for a long time now, and node failures are just a | fact of life. If 3 or 4 nodes of your 500 node cluster are | down for a few days while you wait for RMA parts to arrive, | you haven't lost much value. Your cluster is still | functioning at nearly peak capacity. | | You have a handful of nodes that the cluster can't function | without (scheduler, fileservers, etc), but you buy spares | and 24x7 contracts for those nodes. | | Did I misunderstand your comment? | icedchai wrote: | I think you underestimate how long modern hardware can last. I | have 8 to 12 year old PCs running non-stop, in a musty and damp | basement. | avereveard wrote: | they don't just die, thermal paste dry up, fans gum up, gpu | will live, but thermal throttling will mean it'll run at, | say, 80%. | aflag wrote: | I've worked with a yarn cluster with around 200 nodes which ran | non stop for well over 5 years and still kicking. There were a | handful of failures and replacements, but I'd 95% of the | cluster was fine 7 years in. | walnutclosefarm wrote: | Having had the responsibility of providing HPC for a literal | buildings full of scientists, I can say that it may be true that | you can get computation cheaper with owned hardware, than in a | cloud. Certainly pay as you go, individual project at a time | processing will look that way to the scientist. But I can also | say with confidence that the contest is far closer than they | think. Scientists who make this argument almost invariably leave | major costs out of their calculation - assuming they can put | their servers in a closet,maintain them themselves, do all the | security infrastructure, provide redundancy and still get to | shared compute when they have an overflow need. When the closet | starts to smoke because they stuffed it with too many cheaply | sourced, hot-running cores and GPUs, or gets hacked by one of | their postdocs resulting in an institutional HIPAA violation, | well, that's not their fault. | | Put like for like in a well managed data center against | negotiated and planned cloud services, and the former may still | win, but it won't be dramatically cheaper, and figured over | depreciable lifetime and including opportunity cost, may cost | more. It takes work to figure out which is true. | pbronez wrote: | The article estimated: | | Running a modern AMD-based server that has 48 cores, at least | 192 GB of RAM, and no included disk space costs: | ~$2670.36/mo for a c5a.24xlarge AWS on-demand instance | ~$1014.7/mo for a c5a.24xlarge AWS reserved instance on a | three-year term, paid upfront ~$558.65/mo on OVH | Cloud[1] ~$512.92/mo on Hetzner[2] ~$200/mo on | your own infrastructure as a large institution[3] | | Footnote [3] explains this cost estimate as: | | "Assumes an AMD EPYC 7552 run at 100% load in Boston with high | electricity prices of $0.23/kWh, for $33.24/mo in raw power. | Hardware is amortized over five years, for an average monthly | price of $67.08/mo. We assume that your large institution | already has 24/7 security and public internet bandwidth, but | multiply base hardware and power costs by 2x to account for | other hardware, cooling, physical space, and a | half-a-$120k-sysadmin amortized across 100 servers." | jacobr1 wrote: | Also it assumes full utilization of hardware. If you have | variable load (such as only needing to run compute after an | experiment). The overhead costs of maintaining a cluster you | don't need all time are probably much lower than resources you | can schedule on-demand. | duxup wrote: | When I worked as a network engineer I spent months working with | some great scientists / their team who built a crazy microscope | (I assumed it was looking at atoms or something...) the size of | a small building. | | Their budget for the network gear was a couple hundred bucks | and some old garbage consumer grade network gear. For something | that spit out 10s of GB a second (at least) across a ton of | network connections (they didn't seem to know what would even | happen when they ran it), and was so bursty all but the highest | end of gear could handle it. | | Can confirm sometimes scientists aren't really up on the | overall costs. Then they dump it "this isn't working" on their | university IT team to absorb the costs / manpower costs. | ipaddr wrote: | You are paying 10x more because no one gets fired for using | IBM. AWS has many benefits most which you don't need. Pair up | with another school in a different region and backup data. | Computers are not scary they rarely catch fire. | whatever1 wrote: | Nah for us it was the department IT guy who set up once | everything (a full cluster of 50 r720s) and works like a dream. | | Properly provisioned linux machines need no maintenance. You | drive them until there is a hardware failure. | mangecoeur wrote: | I've been running a group server (basically a shared | workstation) for 5 years and it's been great. Way cheaper than | cloud, no worrying about national rules on where data can be | stored, no waiting in a SLURM batch queue, Jupyter notebooks on | tap for everyone. A single $~6k outlay (we don't need GPUs | which helps). | | Classic big workstations are way more capable than people think | - but at the same time it's hard to justify buying one machine | per user unless your department is swimming in money. Also, | academic budgets tend to come in fixed chunks, and university | IT departments may not have your particular group as a priority | - so often it's just better to invest once in a standalone | server tower that you can set up to do exactly what you need it | to than try to get IT to support your needs or the accounting | department to pay recurring AWS bills. | killingtime74 wrote: | Aren't you talking about 1 server when this is talking about | HPC? | mangecoeur wrote: | Well the title is scientific computing, which includes HPC | but not only. Anyway the fact is that a lot of "HPC" in | university clusters is smaller jobs that are too much for | an average PC to handle, but still fit into a single | typical HPC node. These are usually the jobs that people | think to farm out to AWS, but that you will generally find | are cheaper, faster, and more reliable if you just run them | on your own hardware. | [deleted] | forgomonika wrote: | This nails so much of the discussion that should be had. When | using any cloud service provider, you aren't just paying for | the machines/hardware you use - you are paying for people to | take care of a bunch of headaches of having to maintain this | hardware. It's incredibly easy to overlook this aspect of costs | and really easy to oversimplify what's involved if you don't | know how these things actually work. | prpl wrote: | The things that tend to be "cheap" on campuses: | | Power (especially if there is some kind of significant | scientific facility on premise), space (especially in reused | buildings), manpower (undergrads, grad students, post docs, | professional post graduates), running old/reused hardware, | etc... | | You can get away with those at large research universities. | Some of that you can get away with at national lab sorts of | places (not going to find as much free/cheap labor, surplus | hardware). If you start going down in scale/prestige, etc... | none of that holds true. | | Running a bunch of hardware from the surplus store in a closet | somewhere with Lasko fans taped to the door is cheap. To some | extent, the university system encourages such subsidies. | | In any case, once you get to actually building a datacenter, if | you have to factor in power, if you have a 4 hardware refresh | cycle, professional staffing, etc... unless you are in one of | those low CoL college towns - cloud is probably more more than | 1.5 to 3x more expensive for compute (spot, etc...). Storage on | prem is much cheaper - erasure coded storage systems are cheap | to buy and run, and everybody wants their own high performance | file system. | | One continuing cloud obstacle though - researchers don't want | to spend their time figuring out how to get their code friendly | to preemptible VMs - which is the cost effective way to run on | cloud. | | Another real issue with sticking to on-prem HPC is talent | acquisition and staff development. When you don't care about | those things so much, it's easy to say it's cheap to run on- | prem, but often the pay is crap for the required expertise, and | ignoring cloud doesn't help your staff either. | W-Stool wrote: | Let me echo this as someone who once was responsible for HPC | computing in a research intensive public university. Most | career academics have NO IDEA how much enterprise computing | infrastructure costs. If a 1 terabyte USB hard drive is $40 at | Costco we (university IT) must be getting a much better deal | than that. Take this argument and apply it to any aspect of HPC | computing and that's what you're fighting against. The closet | with racks of gear and no cooling is another fond memory. Don't | forget the AC terminal strips that power the whole thing, | sourced from the local dollar store. | bluedino wrote: | It's kind of funny around this time of year when some | researchers have $10,000 in their budget they need to spend, | and they want to 'gift' us with some GPU's. | davidmr wrote: | That was definitely one of the weirdest things of working | in academia IT: "hey. Can you buy me a workstation that's | as close to $6,328.45 as it is possible to get, and can you | do it by 4pm?" | systemvoltage wrote: | I am dealing with the exact opposite problem: "Oh you mean, | we should leave the EC2 instance running _24 /7_??? No way, | that would be too expensive"... to which I need to respond | "No, it would be like $15/month. Trivial, stop worrying about | costs in EC2 and S3, we're like 7 people here with 3 GB of | data." | | I deal with Scientists that think AWS is some sort of a | massively expensive enterprise thing. I can be, but not for | the use case they're going to be embarking on. Our budget is | $7M spanning 4 years. | capableweb wrote: | > think AWS is some sort of a massively expensive | enterprise thing | | Compared to using dedicated instances with way cheaper | bandwidth, storage and compute power, it might as well be. | | Cloud makes sense when you have to scale up/down very | quickly, or you'd be losing money fast. But most don't | suffer from this problem. | gonzo41 wrote: | Don't say the budget outloud near AWS. They'll find a way | to help you spend it. | systemvoltage wrote: | Hahaha, may be I need to just go into the AWS ether and | start yakking big words like "Elastic Kubernetes Service" | to confuse the scientists and get my aws fix. These | people are too stingy. I want some shit running in AWS, | what good is this admin IAM role. | 0xbadcafebee wrote: | I remember the first time a server caught fire in the closet | we kept the rack in. Backups were kept on a server right | below the one on fire. But, y'know, we saved money. | eastbound wrote: | Don't worry, we do incremental backups during weekdays and | a full backup on Sunday. We use 2 tapes only, so one is | always outside of the building. But you know, we saved | money. | [deleted] | treeman79 wrote: | We had a million dollars worth of hardware installed in a | closet. It had a portable AC hooked up that needed it's | water bin changed every so often. | | Well I was in the middle of that. When the Director | decided to show off the new security doors. So he closed | the room up. Then found out that new security doors | didn't work. I find out as I'm coming back to turn AC | back on. Room will get hot really fast. | | We get office Security to unlock door. He says he doesn't | have authority. His supervisor will be by later in the | day. | | Completely deadpan, and in front of several VPs of a | forth one 50. | | I turn to guy to my right who lived nearby. "Go home and | get your chainsaw" | | We were quickly let in. Also got fast approval to install | proper cooling. | bilbo0s wrote: | A bit off topic, but I gotta say you guys are a riot! | | If there was a comedy tour for IT/Programmer types, I'd | pay to see you guys in it. | | Best thing about your stuff is that it's literally all | funny precisely because it's all true. | [deleted] | [deleted] | Proven wrote: | pbronez wrote: | This is my fear about my homelab lol | | Fire extinguisher nearby, smart temp sensors, but still... | rovr138 wrote: | oh, nice idea with temp sensor. | | I have extinguishers all over the house, but hadn't | considered a temperature sensor set to send alerts. | | Do you have any recommendations? | W-Stool wrote: | What are you using for a homelab priced temperature | sensor? | xani_ wrote: | Homelab-priced sensor is the temp sensor in your server, | it's free! Actual servers have a bunch, usually have one | at intake, "random old PC" servers can use motherboard | temp as rough proxy for environment temp. | | Hell, even in DC you can look at temperatures and see in | front of which server technican was standing just by | those sensors. | | Second cheapest would be USB-to-1wire module + some | DS18B20 1-sire sensors. Easy hobby job to make. They also | come with unique ID which means if you put it in TSDB by | that ID it doesn't matter where you plug those sensors. | COGlory wrote: | >the security infrastructure, provide redundancy and still get | to shared compute when they have an overflow need | | The article points out that this is mostly not necessary for | scientific computing. | jrumbut wrote: | Which I thought was the best point of the article, that a lot | of IT best practice comes from the web app world. | | Web apps quickly become finely tuned factory machines, | executing a million times a day and being duplicated | thousands of times. | | Scientific computing projects are often more like workshops. | Getting charged by the second while you're sitting at a | console trying to figure out what this giant blob you were | sent even is is unpleasant. The solution you create is most | likely to be run exactly once. If it is a big hit, it may be | run a dozen times. | | Trying to run scientific workloads on the cloud is like | trying to put a human shoe on a horse. It might be possible | but it's clearly not designed for that purpose. | onetimeusename wrote: | Is a postdoc hacking a cluster something you have seen before? | I am genuinely curious because I worked on a cluster owned by | my university as an undergrad and everyone was kind of assumed | to be trusted. If you had shell access on the main node you | could run any job you wanted on the cluster. You could enhance | security I just wonder about this threat model, that's an | interesting one. I am sure it happens to be clear. | ptero wrote: | I think it really depends on the task. Where HIPAA violation is | a real threat, the equation changes. And just for CYA purposes | those projects can get pushed to a cloud. Which does not | necessarily involve any attempts to make them any more secure, | but this is a different topic. | | That said, many scientists _are_ operating on premise hardware | like this: some servers in a shared rack and an el-cheapo | storage solutions with an ssh access for people working in the | lab. And it works just fine for them. | | Cloud services focus for running _business_ computing in a | cloud, emphasizing recurring revenue. Most research labs are | _much_ more comfortable with spending the hardware portion of a | grant upfront and not worrying about some student who, instead | of working on some fluid dynamics problem found a script to re- | train a stable diffusion and left it running over winter break. | My 2c. | secabeen wrote: | Thankfully, only a small part of the academic research | enterprise involves human subjects, HIPAA, and all that. | Neither fruit flies nor quarks have privacy rights. | dmicah wrote: | Research involving human subjects (psychology, cognitive | neuroscience, behavioral economics, etc.) requires | institutional review board approval and informed consent, | etc. but mostly doesn't involve HIPAA either. | charcircuit wrote: | That is not a law. | icedchai wrote: | There are actually laws around such things. You can read | about them here: https://www.hhs.gov/ohrp/index.html | Fomite wrote: | And many, many institutions are over cautious. My own | university, for example, has no data classification | between "It would be totally okay if anyone in the | university has access" and "Regulated data", so "I mean, | it's health information, and it's governed by our data | use agreement with the provider..." gets it kicked to the | same level as full-fat HIPAA data. | crazygringo wrote: | > _And it works just fine for them._ | | Until it doesn't because there's a fire or huge power surge | or whatever. | | That's the point -- there's a lot of risk they're not taking | into account, and by focusing on the "it works just fine for | them", you're cherry picking the ones that didn't suffer | disaster. | horsawlarway wrote: | I'd counter by saying I think you're over-estimating how | valuable mitigating that risk is to this crowd. | | I'd further say that you're probably over-estimating how | valuable mitigating that risk is to _anyone_ , although | there are a few limited set of customers that genuinely do | care. | | There are few places I can think of that would benefit more | by avoiding cloud costs than scientific computing... | | They often have limited budgets that are driven by grants, | not derived by providing online services (computer going | down does not impact bottom line). | | They have real computation needs that mean hardware is | unlikely to sit idle. | | There is no compelling reason to "scale" in the way that a | company might need to in order to handle additional | unexpected load from customers or hit marketing campaigns. | | Basically... the _only_ meaningful offering from the cloud | is likely preventing data loss, and this can be done fairly | well with a simple backup strategy. | | Again - they aren't a business where losing a few | hours/days of customer data is potentially business ending. | | --- | | And to be blunt - I can make the same risk avoidance claims | about a lot of things that would simply get me laughed out | of the room. | | "The lead researcher shouldn't be allowed in a car because | it might crash!" | | "The lab work must be done in a bomb shelter in case of war | or tornados!" | | "No one on the team can eat red meat because it increases | the risk of heart attack!" | | and on and on and on... Simply saying "There's risk" is not | sufficient - you must still make a compelling argument that | the cost of avoiding that risk is justified, and you're not | doing that. | billythemaniam wrote: | The counterpoint to that point is that a significant | percentage of scientific computing doesn't care about any | of that. They are unlikely to have enough hardware to cause | a fire and they don't care about outages or even data loss | in many cases. As others have said, it depends on the | specifics of the research. In the cases where that stuff | matters, the cloud would be better option. | Fomite wrote: | This. If my lab-level server failed tomorrow, I'd be | annoyed, order another one, and start the simulations | again. | vjk800 wrote: | The point is, there's not need for everything to be 100% | reliable in this context. If a fire destroys everything and | their computational resources is unavailable for a few | days, that's somewhat okay. Not ideal, but not a | catastrophic loss either. Even data loss is no catastrophic | - at worst it means redoing one or two weeks worth of | computations. | | Some sort of 80/20 principle is at works here. Most of the | costs in professional cloud solutions comes from making the | infrastructure 99.99% reliable instead of 99% reliable. It | is totally worth it if you have millions of customers that | expect a certain level of reliability, but a complete | overkill if the worst case scenario from a system failure | is some graduate student having to redo a few days worth of | computations (which probably had to be redone several times | anyway because of some bug in the code or something). | kijin wrote: | Even that depends on what you're doing. Most scientists | aren't running apps that require several 9's of | availability, connect to an irreplaceable customer | database, etc. | | An outage, or even permanent loss of hardware, might not be | a big problem if you're running easily repeatable | computations on data of which you have multiple copies. At | worst, you might have to copy some data from an external | hard drive and redo a few weeks' worth of computations. | withinboredom wrote: | Ummm. I've def been unable to do anything for entire days | because our AWS region went down and we had to rebuild the | database from scratch. AWS goes down, you twiddle your | thumbs and the people you report to are going to be asking | why, for how long, etc. and you can't give them an answer | until AWS comes back to see how fubar things are. | | When your own hardware rack goes down. You know the | problem, how much it costs to fix it, and when it will come | back up; usually within a few hours (or minutes) of it | going down. | | Do things catch fire, yes. But I think you're over- | estimating how often. In my entire life, I've had a single | SATA connector catch fire and it just melted plastic before | going out. | crazygringo wrote: | I'm not talking about temporary outages, I'm talking | about data loss. | | With AWS it's extremely easy to keep an up-to-date | database backup in a different region. | | And it's great that you haven't personally encountered | disaster, but of course once again that's cherry-picking. | And it's not just a component overheating, it's the whole | closet on fire, it's a broken ceiling sprinkler system | going off, it's a hurricane, it's whatever. | withinboredom wrote: | So was I also talking about data loss. Not everything can | be replicated, but backups can and were made. | | For the rest, there's insurance. Most calculations done | in a research setting are dependent upon that research | surviving. If there's a fire and the whole building goes | down, those calculations are probably worthless now too. | | Hell, most companies probably can't survive their own | building/factory burning down. | FpUser wrote: | >"With AWS it's extremely easy to keep an up-to-date | database backup in a different region" | | It is just as extremely easy on Hetzner or on premises | monkmartinez wrote: | I would say even easier on prem as you don't need to wade | 15 layers deep to do anything. Since I have moved to | hosting my own stuff at my house, I have learned that | connecting a monitor and keyboard to a 'sever' is awesome | for productivity. I know where everything is, its fast as | hell, and everything is locked down. Monitoring temps, | adjusting and configuring hardware is just better in | every imaginable way. Need more RAM, Storage, Compute? | Slap those puppies in there and send it. | | For home gamers like myself, it's has become a no brainer | with advances in tunneling, docker, and cheap prices on | Ebay. | ptero wrote: | > there's a lot of risk they're not taking into account | | I see it the other way: experimental scientists operate | with unreliable systems all the time: fickle systems, | soldered one-time setups, shared lab space, etc. Computing | is just one more thing that is not 100% reliable (but way | more reliable than some other equipment), and usb data | sticks serve as a good enough data backup. | mangecoeur wrote: | Or your university might have it's own backup system. We | have a massive central tape-based archive that you can | run nightly backups to. | noobermin wrote: | May be consider that your use case and the average | scientist's use case isn't the same? What works for you | won't work for them and vise versa? What you consider a | risk, I wouldn't? | | Consider the following, I have never considered applying | meltdown or spectre mitigations if it makes my code run | slower because I plain don't care, assuming anyone even | peeks at what my simulations doing, whoopdeedo, I don't | care. I won't do that on my laptop I use to buy shit off | amazon with, but the workstation I have control of? I don't | care. I DO care if my simulation will take 10 days instead | of a week. | | My use case isn't yours because my needs aren't yours. Not | everything maps across domains. | insane_dreamer wrote: | Plus the supposed savings of in-house hardware only materialize | if you have sufficiently managed and queued load to keep your | servers running at 100% 24/7. The advantage of AWS/other is to | be able to acquire the necessary amount of compute power for | the duration that you need it. | | For a large university it probably makes sense to have and | manage their own compute infrastructure (cheap post-doc labor, | ftw!) but for smaller outfits, AWS can make a lot of sense for | scientific computing (said as someone who uses AWS for | scientific computing), especially if you have fluctuating | loads. | | What works best IMO (and what we do) is have a minimum-to- | moderate amount of compute resources in house that can satisfy | the processing jobs most commonly run (and where you haven't | had to overinvest in hardware), and then switch to AWS/other | for heavier loads that run for a finite period. | | Another problem with in-house hardware is that you spent all | that money on Nvidia V100's a few years ago and now there's the | A100 that blows it away, but you can't just switch and take | advantage of it without another huge capital investment. | secabeen wrote: | They leave out major costs because they don't pay those costs. | Power, Cooling, Real Estate are all significant drivers of AWS | costs. Researchers don't pay those costs directly. The | university does, sure, but to the researcher, that means those | costs are pre-paid. Going to AWS means you're essentially | paying for those costs twice. plus all the profit margin and | availability that AWS provides that you also don't need. | fwip wrote: | The killer we've seen is data egress costs. Crunching the numbers | for some of our pipelines, we'd actually be paying more to get | the data out of AWS than to compute it. | bhewes wrote: | Data movement has become the number one cost in system builds | energy wise. | boldlybold wrote: | As in, the networking equipment consumes the most energy? | Given the 30x markup on AWS egress I'm inclined to say it's | more about incentives and marketing, but I'd love to learn | otherwise. | pclmulqdq wrote: | Even as a big cloud detractor, I have to disagree with this. | | A lot of scientific computing doesn't need a persistent data | center, since you are running a ton of simulations that only take | a week or so, and scientific computing centers at big | universities are a big expense that isn't always well-utilized. | Also, when they are full, jobs can wait weeks to run. | | These computing centers have fairly high overhead, too, although | some of that is absorbed by the university/nonprofit who runs | them. It is entirely possible that this dynamic, where | universities pay some of the cost out of your grant overhead, | makes these computing centers synthetically cheaper for | researchers when they are actually more expensive. | | One other issue here is that scientific computing really benefits | from ultra-low-latency infiniband networks, and the cloud | providers offer something more similar to a virtualized RoCE | system, which is a lot slower. That means accounting for cloud | servers potentially being slower core-for-core. | davidmr wrote: | This is tangential to your point, but I'll just mention that | Azure has some properly specced out HPC gear: IB, FPGAs, the | works. You used to be able to get time on a Cray XC with an | Ares interconnect, but I never have occasion to use it, so I | don't know if you still can. They've been aggressively hiring | top-notch HPC people for a while. | lebovic wrote: | Author here. I agree with your points! I use AWS for a | computational biology company I'm working on. A lot of | scientific computing can spin up and down within a couple hours | on AWS and benefits from fast turnaround. Most academic HPCs | (by # of clusters) are slower than a mega-cluster on AWS, not | well utilized, and have a lot of bureaucratic process. | | That said, most of scientific computing (by % of total compute) | happens in a different context. There's often a physical | machine within the organization that's creating data (e.g. a | DNA sequencer, particle accelerator, etc), and a well- | maintained HPC cluster that analyzes that data. The researchers | have already waited months for their data, so another couple | weeks in a queue doesn't impact their cycle. | | For that context, AWS doesn't really make sense. I do think | there's room for a cloud provider that's geared towards an HPC | use-case, and doesn't have the app-inspired limits (e.g data | transfer) like AWS, GCP, and Azure. | hellodanylo wrote: | [retracted] | Marazan wrote: | It says 0.09 per GB on that page. | philipkglass wrote: | Where do you see that? On your link I see: Data | Transfer OUT From Amazon EC2 To Internet First 10 TB / | Month $0.09 per GB Next 40 TB / Month $0.085 per GB | Next 100 TB / Month $0.07 per GB Greater than 150 TB / | Month $0.05 per GB | | Which means if you transfer out 90 TB in one month, it's $0.09 | * 10000 + $0.085 * 40000 + $0.07 * 40000 = $7100. | hellodanylo wrote: | Sorry, you are right. I need another coffee today. | xani_ wrote: | It always was for load that doesn't allow for autoscaling to save | you money; the savings were always from convenience of not having | to do ops and pay for ops. | | Then again a part of ops cost you save is paid again in dev | salary that have to deal with AWS stuff instead of just throwing | a blob of binaries and letting ops worry about the rest. | citizenpaul wrote: | No one seems to consider colo data centers anymore as even an | option? | remram wrote: | My university owns hardware in multiple locations, plus uses | hardware in a collocation, and still uses the cloud for | bursting (overflow). You can't beat the provisioning time of | cloud providers which is measured in seconds. | zatarc wrote: | Why does no one consider colocation services anymore? | | And why do people only know Hetzner, OVH and Linode as | alternatives to the big cloud providers? | | There are so many good and inexpensive server hosting providers, | some with decades of experience. | lostmsu wrote: | Any particular you could recommend for GPU? | zatarc wrote: | I'm not in a position to recommend or not a particular | provider for gpu-equipped servers, simply because I've never | had the need for gpus. | | My first thought was related to colocation services. From | what I understand, a lot of people avoid on-premise/in-house | solutions because they don't want to deal with server rooms, | redundant power, redundant networks, etc. | | So people go to the cloud and pay horrendous prices there. | | Why not take a middle path? Build your own custom server with | your perferred hardware and put in a colocation | dkobran wrote: | There are several tier-two clouds that offer GPUs but I think | they generally fall prey to the many of the same issues | you'll find with AWS. There is a new generation of | accelerator native clouds e.g. Paperspace | (https://paperspace.com) that cater specifically to HPC, AI, | etc. workloads. The main differentiators are: - much larger | GPU catalog - support for new accelerators e.g. Graphcore | IPUs - different pricing structure that address problematic | areas for HPC such as egress | | However, one of the most important differences is the _lack_ | of unrelated web services related components that pose a | major distraction /headache to users that don't have a DevOps | background (which AWS obviously caters to). AWS can be | incredibly complicated. Simple tasks are encumbered by a | whole host of unrelated options/capabilities and the learning | curve is very steep. A platform that is specifically designed | to serve the scientific computing audience can be much more | streamlined and user-friendly for this audience. | | Disclosure: I work on Paperspace. | latchkey wrote: | Coreweave. I know the CTO. They are doing great work over | there. | | https://www.coreweave.com | sabalaba wrote: | Lambda GPU Cloud has the cheapest A100s of that group. | https://lambdalabs.com/service/gpu-cloud | | Lambda A100s - $1.10 / hr Paperspace A100s - $3.09 / hr | Genesis A100s - no A100s but their 3090 (1/2 the speed of | 100) is - $1.30 / hr for half the speed | lostmsu wrote: | That's still way too expensive. 3090 is less than 2x of the | monthly cost in Genesis. A100 is priced better here. | tryauuum wrote: | datacrunch.io has some 80G A100s | theblazehen wrote: | https://www.genesiscloud.com/ is pretty decent | snorkel wrote: | Buying your own fleet of dedicated servers seems like a smart | move in the short term, but then five years from now you'll get | someone on the team insisting that they need the latest greatest | GPU to run their jobs. Cloud providers give you the option of | using newer chipsets without having to re-purchase your entire | server fleet every five years. | lebovic wrote: | In HPC land, most hardware is amortized over five years and | then replaced! If you keep your service in life for five years | at high utilization, you're doing great. | | For example, the Blue Waters supercomputer at UIUC was | originally expected to last five years, although they kept it | in service for nine; it was considered a success: | https://www.ncsa.illinois.edu/historic-blue-waters-supercomp... | adamsb6 wrote: | I've never worked in this space, but I'm curious about the need | for massive egress. What's driving the need to bring all that | data back to the institution? | | Could whatever actions have to be performed on the data also be | performed in AWS? | | Also while briefly looking into this I found that AWS has an | egress waiver for researchers and educational instiutions: | https://aws.amazon.com/blogs/publicsector/data-egress-waiver... | COGlory wrote: | Well for starters, if you are NIH or NSF funded, they have data | storage requirements you must meet. So usually this involves | something like tape backups in two locations. | | The other is for reproducibility - typically you need to | preserve lots of in-between steps for peer review and proving | that you aren't making things up. Some intermediary data is | wiped out, but usually only if it can be quickly and easily | regenerated. | jpeloquin wrote: | Regarding the waiver--"The maximum discount is 15 percent of | total monthly spending on AWS services". Was very excited at | first. | | As for leaving data in AWS, data is often (not always) | revisited repeatedly for years after the fact. If new questions | are raised about the results it's often much easier to check | the output than rerun the analysis. And cloud storage is not | cheap. But yes it sometimes makes sense to egress only summary | statistics and discard the raw data. | [deleted] ___________________________________________________________________ (page generated 2022-10-07 23:00 UTC)