[HN Gopher] Scaling to 100k Users
       ___________________________________________________________________
        
       Scaling to 100k Users
        
       Author : sckwishy
       Score  : 280 points
       Date   : 2020-02-05 16:29 UTC (6 hours ago)
        
 (HTM) web link (alexpareto.com)
 (TXT) w3m dump (alexpareto.com)
        
       | majkinetor wrote:
       | This is relevant only for multimedia apps.
       | 
       | I have fintech systems in production with 100k+ users with
       | complex Gov app for entire country that runs on commodity
       | hardware (majority of work done by 1 backend server , 1 database
       | server and all reporting by 1 reporting backend server using the
       | same db). Based on our grafana metrics it can survive x10 number
       | of users without upgrade of any kind. It runs on linux and dot
       | net core and Sql Server.
       | 
       | Most of the software is not multimedia in nature and those
       | numbers are off the charts for such systems.
        
         | rlander wrote:
         | I'll bite. What's a "fintech system" with "complex Gov app"?
        
           | jermaustin1 wrote:
           | Looking at his CV [1], it looks like he's done a lot of work
           | inside the Serbiam Treasury.
           | 
           | 1: https://gist.github.com/majkinetor/877d5174ba322fbb808cc47
           | a8...
        
         | Kiro wrote:
         | Yeah, it really depends on the kind of app. My app with a
         | couple of hundred thousand users runs on a single $5 Digital
         | Ocean droplet with standard PHP and MySQL.
        
           | leeoniya wrote:
           | > My app with a couple of hundred thousand users
           | 
           | per year? per month? per day? simultaneously? doing what?
           | 
           | it matters.
           | 
           | i ask this as someone who runs a $40/mo Linode with a
           | debian/nginx/node/mysql stack that's definitely 20x over-
           | provisioned for an e-commerce site with 10k daily visitors,
           | 15 simultaneous backend users (reporting, order-entry, CRM,
           | analytics) and 0 caching tricks. i could easily run the site
           | on any 5 year old laptop with an SSD and 8GB RAM.
           | 
           | normalize/de-normalize when needed, understand and hand-write
           | efficient SQL queries (ditch ORMs), choose small/fast libs
           | carefully (or write your own), and you can easily serve 100k
           | users per day on a single cheap VPS with no
           | orchestration/replication/hz-scaling bullshit. definitely
           | can't say the same about 200k _simultaneous_ users - that
           | would need proper hardware, but can still be a single server.
           | 
           | Monoliths Are the Future:
           | https://news.ycombinator.com/item?id=22193383
        
             | neillyons wrote:
             | What is the e-commerce site?
        
               | Demiurge wrote:
               | Are you offering a free stress test? :)
        
             | Scarbutt wrote:
             | _hand-write efficient SQL queries (ditch ORMs)_
             | 
             | Raw SQL strings or a query builder?
        
         | karambir wrote:
         | Yeah, for normal web app, we can easily have 100x users for
         | each step mentioned in article.
         | 
         | For our company, we had more than 500k users with 1 small nginx
         | + 1 medium appServer(with autoscaling, though never needed it)
         | + 1 small cache server and RDS till now. We just added a aws
         | managed load balancer into the setup and think it might be
         | overkill.
         | 
         | For a client with NewsFeed needs, I used a dedicated
         | server(64GB, 2TB space) to run nginx, app, cache, huge
         | elasticsearch and postgres. It was great(and cheap) option for
         | an MVP and let them validate the product for few months with
         | >10k users.
         | 
         | It was awesome to learn a few years ago, how much compute power
         | we don't use.
        
         | bob1029 wrote:
         | Thank you for this post. I read "10 Users: Split out the
         | Database Layer" and about had an aneurysm.
         | 
         | I also work with fintech systems built upon .NET Core and have
         | similar experiences regarding scaling of these solutions. You
         | can get an incredible amount of throughput from a single box if
         | you are careful with the technologies you use.
         | 
         | A single .NET Core web API process using Kestrel and whatever
         | RDBMS (even SQLite) can absolutely devour requests in the <1
         | megabyte range. I would feel confident putting the largest
         | customer we can imagine x10 on a single 16~32 core server for
         | the solution we provide today. Obviously, if you are pushing
         | any form of multimedia this starts to break down rapidly.
        
           | thinkmassive wrote:
           | It seems like they recommend splitting out the database at
           | the start because using a managed service is much easier than
           | properly managing your own production database.
        
             | throwaway5752 wrote:
             | I can't speak for high performance/near or realtime system
             | people but I would not trust a managed database service for
             | that needs. My experience is that the managed offerings lag
             | behind upstream and usually are economical because they are
             | multitenant. So you have a bit less predictability in io
             | wait / cpu queue, lose host level tunings (page sizes or
             | hugepages, share memory allocation, etc), and - not naming
             | names - some managed db services are so behind they lack
             | critical query planner profiling features. That's not even
             | going into application workload specific tuning for various
             | nosql stores. This is a nice article but its audience is
             | people that haven't scaled up a system but are trying to
             | cope with success. It's not great generalized scaling
             | advice.
        
         | tempestn wrote:
         | Even though the absolute numbers may well not apply to other
         | types of apps, the general concepts of how to scale do. We have
         | 100k+ users on effectively a single box too (actually two for
         | redundancy, ease of upgrades, etc., but one can handle it), but
         | this is a great overview of how to think about scaling beyond
         | that, however many users that's done at. Honestly when I was
         | reading the article I read that 1/10/100 as more of a unitless
         | degree of scale than actual numbers of humans.
        
         | grezql wrote:
         | SQL Server is a different beast. You get alot of performance
         | enhancement out of box. Yes its costs money you save tremendous
         | amount of tweaking time and headaches.
        
           | viggity wrote:
           | or you can just use AzureSQL and essentially just pay what it
           | costs for the box, because it is platform as a service. Its
           | far cheaper (and easier to maintain) than what it'd cost to
           | buy a SQL Server license and run it on a VM.
        
         | kernoble wrote:
         | Hey, really specific question regarding your deployment. A
         | teammate of mine reported difficulties with the .net SqlServer
         | database driver establishing connection from a linux client
         | (container instance based on the public .net core image) to a
         | SqlServer instance (on windows) . Are you familiar with this
         | problem? I think moving our systems over to .NET Core on linux
         | is the future, but this one experience has somewhat soured the
         | idea for some decision makers and the team managing our db.
        
         | NicoJuicy wrote:
         | Any tests on progress? I'm going on Postgress because of Bson
         | support and no-sql with sql duplicate fields for search.
         | 
         | Using .net core also.
         | 
         | In general: I agree, there's rarely a case for really using the
         | cloud. Page loads of my E-commerce project are 8ms for basket,
         | I'm wondering what should kill it first on a big load, even
         | without caching. Probably the database, not sure yet.
        
       | munns wrote:
       | As the original creator of the presentation referenced by the
       | blog author (later re-delivered by Joel in the linked post) I am
       | super excited to see this still have an impact on people, but I'd
       | say today in 2019 you'd probably do things very differently(as
       | others call out).
       | 
       | Tech has progressed really far and there are tools like Netlify
       | for hosting that would replace 90% of the non-DB parts of this.
       | Cloud providers have also grown drastically and so again a lot of
       | this would/could look a lot lot different.
       | 
       | Fwiw original deck from Spring of 2013, delivered at a VC event
       | and then went on to be the most viewed/shared deck on Slidehare
       | for a bit: https://www.slideshare.net/AmazonWebServices/scaling-
       | on-aws-...
       | 
       | thanks, - munns@AWS
        
         | debaserab2 wrote:
         | Does it look that much different if you exclude solutions that
         | increase vendor lock-in?
        
           | munns wrote:
           | On your own metal it looks like it does in this post.
           | 
           | With managed services it looks a world different.
        
           | Swizec wrote:
           | You always pay the vendor. Whether that's in sweat and tears
           | or in dollars is up to you.
           | 
           | Fwiw, you are almost certainly shooting yourself in the foot
           | by avoiding vendor lockin at stages before 8 revenue figures
           | per year. Your engineering takes longer, is more brittle, and
           | because you're only using 1 vendor actively, your solution is
           | still vendor locked-in.
           | 
           | Love, ~ Guy who learned his lesson many times
        
             | munns wrote:
             | Fwiw: Slack, Lift, AirBnb, Snapchat, Stripe are all 100%
             | public cloud based (in so far as I know). So up through 8+
             | figures they are still doing it too.
             | 
             | removed Uber as its not 100% cloud (or at least wasn't in
             | the past)
        
               | jayp wrote:
               | Uber has always been self hosted. Some workloads are on
               | Cloud and migrating more there. I last worked there 2
               | years ago.
        
               | munns wrote:
               | Thank you for clarifying! I know quite a lot has
               | supposedly shifted. Will update my original comment.
        
             | debaserab2 wrote:
             | I think it depends on the type of vendor lock-in -- sure,
             | the trade off of having a managed Postgres instance is
             | obvious, but it becomes less obvious to me when you're
             | using things like a proprietary queueing or deployment
             | service.
             | 
             | Writing service API integration code instead of code that
             | interfaces directly with the technology that service is
             | doing makes code quite brittle. If/when the vendor
             | deprecates the service, introduces backwards incompatible
             | changes, or abandons development of the product, you're
             | left on the hook to engineer your way out of that problem.
             | Often times that effort is equal to or greater than the
             | effort of an in-house solution in the first place.
             | 
             | I had the same mentality as you until this happened to the
             | SaaS product I work on for a few different services. Now at
             | very least I try to make sure solutions are cloud agnostic.
        
       | thaniri wrote:
       | This blog post is almost entirely a re-hash of
       | http://highscalability.com/blog/2016/1/11/a-beginners-guide-...
       | 
       | The primary difference is that this post tries to be more
       | generic, whereas the original is specific to AWS.
       | 
       | The original, for what it is worth, is far more detailed than
       | this one.
        
         | lixtra wrote:
         | That's on purpose:
         | 
         | >> This post was inspired by one of my favorite posts on High
         | Scalability. I wanted to flesh the article out a bit more for
         | the early stages and make it a bit more cloud agnostic.
         | Definitely check it out if you're interested in these kind of
         | things.
        
       | [deleted]
        
       | gfodor wrote:
       | It's probably a bad idea to switch to read only replicas for
       | reads pre-emptively, vs vertically scaling up the database. Doing
       | so adds a lot of incidental complexity since you have to avoid
       | read after writes, or ensure the reads come from the master.
       | 
       | The reason punting on this is a good idea is because you can get
       | pretty far with vertical scaling, database optimization, and
       | caching. And when push comes to shove, you are going to need to
       | shard the data anyway to scale writes, reduce index depths, etc.
       | So a re-architecture of your data layer will need to happen
       | eventually, so it may turn out that you can avoid the
       | intermediate "read from replica" overhaul by just punting the
       | ball until sharding becomes necessary.
        
         | danenania wrote:
         | For those who have reached vertical database write scaling
         | limits and had to start sharding, I'm curious what kind of load
         | that entails? Looking at RDS instances, the biggest one is
         | db.r5.24xlarge with 48 cores and 768 gb ram. I imagine that can
         | take you quite a long way--perhaps even into millions of users
         | territory for a well-designed crud app that's read-heavy and
         | doesn't do anything too fancy?
        
           | adventured wrote:
           | > the biggest one is db.r5.24xlarge with 48 cores and 768 gb
           | ram. I imagine that can take you quite a long way--perhaps
           | even into millions of users territory
           | 
           | That will run Stackoverflow's db by itself for reference,
           | along with sensible caching (they're very read-heavy and
           | cache like crazy). Here's their hardware for their SQL server
           | for 2016:
           | 
           | 2 Dell R720xd Servers featuring: Dual E5-2697v2 Processors
           | (12 cores @2.7-3.5GHz each), 384 GB of RAM (24x 16 GB DIMMs),
           | 1x Intel P3608 4 TB NVMe PCIe SSD (RAID 0, 2 controllers per
           | card), 24x Intel 710 200 GB SATA SSDs (RAID 10), Dual 10 Gbps
           | network (Intel X540/I350 NDC).
           | 
           | https://nickcraver.com/blog/2016/03/29/stack-overflow-the-
           | ha...
        
           | bcrosby95 wrote:
           | Very far I would guess. 10 years ago we took a single bare
           | metal database server running mysql with 8 cores and 64gb of
           | memory to 8 million daily users. 15k requests per second of
           | per user dynamic pages at peak load.
           | 
           | We did use memcached where we could.
        
         | jedberg wrote:
         | The problem with going to the "top of the vertical" scaling so
         | to speak is that one day, if you're lucky, you'll have enough
         | traffic that you'll reach the limit. And it will be like
         | hitting a wall.
         | 
         | And then you have to rearchitect your data layer under extreme
         | duress as your databases are constantly on fire.
         | 
         | So you really need to find the balance point and start doing it
         | _before_ your databases are on fire all the time.
        
           | NicoJuicy wrote:
           | I actually implemented domain driven design WITH an Api-layer
           | ( so core, application, Infrastructure + api). They also are
           | split on Basket, catalog, checkout, shipping and pricing
           | domain with seperate db's.
           | 
           | So just splitting up the heaviest part (eg. Catalog) into "a
           | microservice" would be easy while I add nginx as load
           | balancer. I already separated domain vs Integration Events.
           | 
           | Both now use events in memory in the application, I only need
           | a message broker like NATS then for the integration events.
           | 
           | It would be a easy wall ;). I have multiple options like
           | heavier hardware, splitting up the db from application server
           | or splitting up a domain bound api to a seperate server.
           | 
           | As long as I don't need multimedia streaming, kubernetes or
           | implement Kafka the future is clear.
           | 
           | Ps. Load balancing based on tenant and cookie would be a easy
           | fix in extreme circomstances.
           | 
           | The thing I'm afraid for the most is hitting the identity
           | server for authentication/token verification. Not sure if
           | it's justified though.
           | 
           | Side note: one application has an insane amount of complex
           | joins and will not scale :)
        
           | toast0 wrote:
           | Assuming you have a relatively stable growth curve, you
           | should have some ability to predict how long your hardware
           | upgrades will last.
           | 
           | With that, you can start planning your rearchitecture if
           | you're running out of upgrades, and start implementing when
           | your servers aren't yet on fire, but are likely to be.
           | 
           | Today's server hardware ecosystem isn't advancing as reliably
           | as it was 8 years ago, but we're still seeing significant
           | capacity upgrades every couple years. If you're CPU bound,
           | the new Zen2 Epyc processors are pretty exciting, I think
           | they also increased the amount of accessible ram, which is
           | also a potential scaling bottleneck.
        
             | jedberg wrote:
             | > Assuming you have a relatively stable growth curve, you
             | should have some ability to predict how long your hardware
             | upgrades will last.
             | 
             | But that's not how the real world works. The databases
             | don't just slowly get bad. They hit a wall, and when they
             | do it is pretty unpredictable. Unless you have your scaling
             | story set ahead of time, you're gonna have a bad day (or
             | week).
        
               | wolco wrote:
               | That's exactly how the real world works. Databases will
               | get slow, then slower. Resources get used. Unpredictable
               | not really. Maybe you've run out of space or ram or
               | processes are hanging. The database will never just start
               | rending html or formatting your disk or email someone. It
               | is pretty predictable.
        
               | jedberg wrote:
               | The failure I've seen multiple times is that the database
               | is returning data within normal latencies, and then there
               | is a traffic tipping point and the latencies go up 1000x
               | for all requests.
        
               | toast0 wrote:
               | If you're lucky, the wall is at 95-100% cpu. Oftentimes,
               | we're not that lucky, and when you approach 60%,
               | everything gets clogged up, I've even worked on systems
               | where it was closer to 30%.
               | 
               | Usually, databases are pretty good at running up to 100%,
               | though. And if you started with small hardware, and have
               | upgraded a few times already, you should have a pretty
               | good idea of where your wall is going to hit. Some
               | systems won't work much better on a two socket system
               | than a one socket system, because the work isn't open to
               | concurrency, but again, we're talking about scaling
               | databases, and database authors spend a lot of time
               | working on scaling, and do a pretty good job. Going
               | vertically up to a two socket system makes a lot of sense
               | on a database; four and eight socket systems could work
               | too, but get a lot more expensive pretty fast.
               | 
               | Sometimes, the wall on a databases is from bad queries or
               | bad tuning; sharding can help with that, because maybe
               | you isolate the bad queries and they don't affect
               | everyone at once, but fixing those queries would help you
               | stay on a single database design.
        
               | bcrosby95 wrote:
               | The minute your RDBMS' hot dataset doesn't fit into
               | memory its going to shit itself. I've seen it happen
               | anywhere from 90% CPU down to around 10%. Queries that
               | were instant can start to take 50ms.
               | 
               | It can be an easy fix (buy more memory), but the first
               | time it happens it can be pretty mysterious.
        
           | lllr_finger wrote:
           | DID is an extremely important concept that is alien to a lot
           | of developers: Deploy for 1.5X, Implement for 3X, Design for
           | 10X (your numbers may vary slightly)
        
             | [deleted]
        
         | cactus2093 wrote:
         | There are some cases where adding a read replica can be helpful
         | at almost no extra overhead - for instance if your product has
         | something like a stats dashboard you'll have some heavy queries
         | that are never going to result in a write after read and don't
         | matter if they are sometimes a few ms or even a few seconds or
         | tens of seconds out of date. Similarly if you have analysts
         | poking around running exploratory queries, a read replica can
         | be the first step towards an analytics workflow/data warehouse.
        
       | charlesju wrote:
       | These posts are great and there is always great information in
       | them. But to nitpick, it would be a lot easier to digest on face
       | value if you lead with concurrency rather than raw total users as
       | that's the true gauge of how your server infrastructure looks
       | like.
        
         | stingraycharles wrote:
         | Yeah I still don't understand the need to split servers at 10
         | users. Even if this is in parallel, it must still mean there is
         | a well-beyond-average resource consumption per user.
        
           | cwingrav wrote:
           | Probably so when your 10 users grow to 1000, your efforts at
           | 10x are good for 1000x, and you're working on 100000x.
        
       | superphil0 wrote:
       | Use firebase or any other serverless architecture and forget
       | about scaling and devops. Not only will you save development time
       | but also money because you need less developers. Yes I understand
       | at some point it will get expensive, but you can still optimize
       | later and move expensive parts to your own infrastructure if
       | needed
        
       | ablekh wrote:
       | Nice, but very simplistic (on purpose, it seems), write-up on the
       | topic. For a much more comprehensive and excellent coverage of
       | designing & implementing large-scale systems, see
       | https://github.com/donnemartin/system-design-primer. Also, I want
       | to mention that an important - and relevant to scaling - aspect
       | of _multi-tenancy_ is very often (as is in Alex 's post) not
       | addressed. Most of the large-scale software-intensive systems are
       | SaaS, hence the multi-tenancy importance and relevance.
        
       | huzaif wrote:
       | We can now achieve pretty high scalability from day 1 with a tiny
       | bit of "engineering cost" up front. Serverless on AWS is pretty
       | cheap and can scale quickly.
       | 
       | App load: |User| <-> |Cloudfront| <-> |S3 hosted React/Vue app|
       | 
       | App operations: |App| <-> |Api Gateway| <-> |Lambda| <-> |Dynamo
       | DB|
       | 
       | Add in Route53 for DNS, ACM to manage certs, Secrets Manager to
       | store secrets, SES for Email and Cognito for users.
       | 
       | All this will not cost a whole lot until you grow. At that point,
       | you can make additional engineering decisions to manage costs.
        
         | aratakareigen wrote:
         | Great, but this reads like a particularly blunt Amazon ad. Is
         | there a way to achieve "high scalability" without selling my
         | soul to Amazon?
        
           | dumbfoundded wrote:
           | I think if you use something like serverless, you can
           | abstract the cloud layer. I've never used it for anything
           | more than a toy project though.
           | 
           | https://serverless.com/
        
           | huzaif wrote:
           | Yes, it does read like that.
           | 
           | In the context of a start-up, cost is a big factor and then
           | perhaps (hopefully) handling growth. You could start small
           | and refactor apps/infrastructure as you grow but I am unsure
           | how one could afford to do that efficiently while also
           | managing a growing startup.
           | 
           | On the selling soul to cloud provider, I don't see it like
           | that. I have a start-up to bootstrap and I want to see it
           | grow before making altruistic decisions that would sustain
           | the business model.
           | 
           | Once you are past the initial growth stage, there are many
           | options for serverless, gateway, caches, proxies that can be
           | orchestrated in K8 on commodity VMs in the datacenter. Though
           | this is where you would need some decent financial backing.
           | 
           | (I am not associated with Amazon, Google or Azure. I do run
           | my start-up on Azure.)
        
             | ignoramous wrote:
             | I'm down a similar route, but I must point out that beyond
             | a certain number of users / scale, Serverless becomes cost-
             | prohibitive. For instance, per back-of-the-napkin
             | calculation, the Serverless load I run right now, though
             | very cost-effective for the smaller userbase I've got,
             | would quickly spiral out of control once I cross a
             | threshold (which is at 40k users). At 5M users, I'd be
             | paying an astonishing 100x the cost than if I hosted the
             | services on a VPS. That said, Serverless does reduce DevOps
             | to an extent but introduces different but fewer other
             | complications.
             | 
             | As patio11 would like to remind us all, _we 've got a
             | revenue problem, not a cost problem._ [0]
             | 
             | [0] https://news.ycombinator.com/item?id=22202301
        
           | sky_rw wrote:
           | Yes, sell your soul to Google.
        
           | fragmede wrote:
           | The big clouds have similar enough products, just the names
           | are changed, so at a high level, GP's list of AWS products
           | can be swapped with eg, Azure's product names.
           | https://www.wintellect.com/the-rosetta-stone-of-cloud-
           | servic...
           | 
           | Sadly, anything more in-depth than that, you'll need to sign
           | an NDA with AWS to learn anything about the performance
           | limits of their services (eg Redshift), and you won't get
           | that unless you're already a big customer there. Azure's not
           | going to be falling over themselves to let you know where
           | they fall short, either. This is vendor lock-in, and is why
           | there are so many free cloud credits to be had to startups.
           | 
           | This is also a reason I believe SaaS companies will find it
           | is harder than they realized to arbitrage between clouds, and
           | business models based on that may not be able to get that
           | right.
        
         | papito wrote:
         | I bet that DNC Iowa primaries app was serverless. Problem
         | solved! [dusts off hands].
        
         | marriedWpt wrote:
         | AWS seems like it would be expensive long term.
         | 
         | Between my issues with AWS currently and the exterior look of
         | Amazon, I'm skeptical AWS is a good solution.
        
           | lbriner wrote:
           | Like most providers, it does depend. Some products are priced
           | very competitively while others seem over-the-top. For
           | smaller companies, the cloud is a cheaper starting point for
           | many systems but even for larger organisations, there are
           | savings to be made by outsourcing your servers. Do you know
           | how much it costs to install and maintain a decent air-con
           | system for your server room?
           | 
           | One of the other major advantages of cloud is that you can
           | save a lot in support staff. Compare the wages of even 1
           | decent sysadmin looking after your own hardware compared to
           | several thousand dollars of AWS and it's still loads cheaper.
           | Hardware upgrades, OS updates etc. are often automatic or
           | hidden.
        
         | ludamad wrote:
         | I hated DynamoDB. What good is there about it other than
         | convenience?
        
           | ignoramous wrote:
           | I've found that KV stores like DynamoDB make for a good
           | control-plane configuration repository. For instance, say,
           | you need to know if a client, X, is allowed to access a
           | resource, Y. And, say, you've clients in order of millions
           | and resources in order of 100s, and you've got very specific
           | queries to execute on such denormalized data and need
           | consistently low latency and high throughput across key-
           | combinations.
           | 
           | Another good use-case is to store checkpointing information.
           | Say, you've processed some task and would like to check-in
           | the result. Either the information fits the 400KB DynamoDB
           | limit or you use DynamoDB as a index to a S3 file.
           | 
           | You could do those things with managed or self-hosted RDBMS,
           | but DynamoDB takes away the need to manage the hardware, the
           | backups, the scale-ups, and the scale-outs, reduces ceremony
           | whilst dealing with locks, schemas, misbehaving clients, and
           | myraid other configuration knobs whilst also fitting your
           | queries patterns to a tee.
           | 
           | KV stores typically give you consistent performance on reads
           | and writes, if you avoid cascading relationships between two
           | or more keys, and make just the right amount of trade-offs in
           | terms of both cross-cluster data-consistency and cross-table
           | data-consistency.
           | 
           | Besides, in terms of features, one can add a write-through
           | cache in front of a DynamoDB table, can point-in-time-restore
           | data up to a minute granularity, can create on-demand tables
           | that scale with load (not worry about provisioned capacity
           | anymore), can auto-stream updates to Elasticsearch for
           | materialised views or consume the updates in real-time
           | themselves, can replicate tables world-wide with lax
           | consistency guarantees and so on...with very little fuss, if
           | any.
           | 
           | Running databases is hard. I pretty much exclusively favour a
           | managed solution over self-hosted one, at this point. And for
           | denormalized data, a managed KV store makes for a viable
           | solution, imo.
        
             | danenania wrote:
             | All good points, but one thing people should look at very
             | closely before choosing DynamoDB as a primary db is the
             | transaction limits. Most apps are going to have some
             | operations that should be atomic and involve more than 25
             | items. With DynamoDB, your only option currently is to
             | break these up into multiple transactions and hope none of
             | them fail. But as you scale, eventually some _will_ fail,
             | while others in the same request succeed, leaving your data
             | in an inconsistent state.
             | 
             | While this could be ok for some apps, I think for most use
             | cases it's really bad and ends up being more trouble than
             | what you save on ops in the long run, especially
             | considering options like Aurora that, while not as hands-
             | off as Dynamo, are still pretty low-maintenance and don't
             | limit transactions at all.
        
           | Scarbutt wrote:
           | If you don't mind watching a video:
           | https://www.youtube.com/watch?v=6yqfmXiZTlM
        
         | dkarras wrote:
         | It will cost an arm and a leg when you eventually grow though.
        
       | agumonkey wrote:
       | odd, that was my google query of yesterday..
       | 
       | I'm curious what kind of hardware can sustain 100k concurrent
       | connections these days.
        
         | lbriner wrote:
         | We were running a speed test with node vs dotnet core and even
         | on a small Linux box (4GB, 1 core), we could reach nearly 10K
         | concurrent requests for a basic HTTP response but the exact
         | nature of the system will affect that massively.
         | 
         | Add large request/response sizes or CPU/RAM bound operations
         | and your servers can very quickly reach their limits with far
         | fewer concurrent requests.
         | 
         | Architecture is a big picture task since you have to consider
         | the whole system before implementing part of it, otherwise you
         | end up having to start again.
        
           | agumonkey wrote:
           | thanks that's already a lower bound point of reference
        
       | bcrosby95 wrote:
       | Conversely, rent or buy 1 bare metal server. That's how we went
       | until we hit around 300k users. Back in 2008.
        
         | brokencode wrote:
         | I think it's kind of crazy that we have 64 core processors
         | available, but still need so many servers to handle only a
         | hundred thousand users. That's what, a few thousand requests
         | per second max?
         | 
         | Having many servers gives you redundancy and horizontal
         | scalability, but also comes at a high complexity and
         | maintenance cost. Also, with many machines communicating over
         | the network, latency and reliability can become much harder to
         | manage.
         | 
         | Most smaller companies can probably get away with having a
         | single powerful server with one extra server for failover, and
         | probably two more for the database with failover as well. I
         | think this would also result in better performance and
         | reliability as well. I'm curious to know whether the author
         | tried vertical scaling first or went straight to horizontal
         | scaling.
        
       | jedberg wrote:
       | Heh, this is one of the questions I liked to used for interviews.
       | 
       | "Let's work together and design a system that scales appropriate
       | but isn't overbuilt. Let's start with 10 users".
       | 
       | Then we talk about what we need and go from there. The end result
       | looks a lot like this blog post, for those who are qualified.
        
         | ignoramous wrote:
         | /offtopic
         | 
         | Heh, you're being modest. I'm sure you've dealt with far more
         | complex distributed systems than the hypothetical one in the
         | blog post.
        
           | jedberg wrote:
           | Sure, but most of the people I was interviewing hadn't, so it
           | was a good way to test their knowledge. :)
           | 
           | If you can scale to 100K users, you can probably learn the
           | rest to scale to 100M users.
        
       | k2xl wrote:
       | A little bit of overkill recommendations here.
       | 
       | With 10 users you don't "need" to separate out the database
       | layer. Heck you don't need to do that with 100 users. Website I
       | ran back in 2007-2010 had tens of thousands of users on a single
       | machine running app, database, and caching fine.
       | 
       | Users are actually a really poor way use for scalability
       | planning. What's more relevant is queries/data transmission per
       | interval, and also the distribution of the type of data
       | transfers.
       | 
       | I'd say replace the "Users" in this posts to "queries per second"
       | and then I think it's a better general guide.
        
       | segmondy wrote:
       | IMHO, I believe discussion about scale should begin with at least
       | 1Million users these days. 100k has been old news for more than a
       | decade.
        
         | lbriner wrote:
         | As stated above, the number of users is not a measure of a
         | system, it is the concurrency multiplied by the typical system
         | loads per action.
         | 
         | Clearly a million users on Facebook is much heavier than a
         | million registered with online banking and who only use it once
         | a month.
        
       | erkken wrote:
       | We now use a DigitalOceans managed database with 0 standby nodes,
       | coupled with another instance running Django. It is working good.
       | 
       | We are however actually thinking about switching to a new
       | dedicated server at another provider (Hetzner) where we are
       | looking at having the Web server and the DB on the same server,
       | however the new server will have hugely improved performance
       | (which is sometimes needed), still at a reduced cost compared to
       | the DigitalOcean setup.
       | 
       | The thing we are doubting is if having a managed db is worth it.
       | The sell in is that everything is of course managed. But what
       | does this mean in reality? Updating packages is easy, backups as
       | well (to some extent), and we still do not use any standby nodes
       | and doubt we will need any replication. So far we have never had
       | the need to recover anything (about 5 years). Before we got the
       | managed db we had it in the same machine (as we are now looking
       | at going back to) and never had any issues.
       | 
       | Any input?
        
         | heffer wrote:
         | Do note that with dedicated servers you are subjecting yourself
         | to things such as hardware upgrades and failures which you will
         | have to manage yourself if you want to prevent downtime.
         | 
         | And while Hetzner customer support is generally excellent, in
         | my experience, their handling of DDoS incidents will generally
         | leave your server blackholed and sometimes requires manual
         | intervention to get back online.
         | 
         | This is something you need to account for in terms of
         | redundance if you are planning to expose your application
         | directly to the net without any CDN/Load balancer/DDoS filter
         | in place.
         | 
         | From my experience it makes sense to work with a data centre
         | that is less focussed on a mass market but allows for
         | individual client relations to mitigate risks like that. I love
         | Hetzner for what they are and do host some services with them,
         | but I wouldn't build a business around services hosted there.
         | 
         | And this not only goes for Hetzner but pretty much any provider
         | whose business model is based on low margin/high throughput.
        
           | erkken wrote:
           | Oh, their DDoS protection was one of the reasons we were
           | thinking about moving away from DO.
           | 
           | It is a public facing SaaS API which does not have much
           | traffic in terms of requests, but would be catastrophic to be
           | blackholed. So its that bad?
           | 
           | Regarding hardware failures- have never experienced any so
           | far, but guess it's just a question of when then.
        
             | heffer wrote:
             | > It is a public facing SaaS API which does not have much
             | traffic in terms of requests, but would be catastrophic to
             | be blackholed. So its that bad?
             | 
             | Well, depends on whether you have people that don't like
             | you. For them it can be rather easy to stage a DDoS against
             | your server and take that server offline for some time.
             | 
             | > Regarding hardware failures- have never experienced any
             | so far, but guess it's just a question of when then.
             | 
             | It has been happening to me much less often since they
             | switched most of their portfolio over to "Enterprise Grade"
             | disks. These days I tend to go with NVMe anyway so it has
             | become less of an issue.
        
         | ryanar wrote:
         | I thought part of their managed service was that they optimized
         | / tuned your postgres db based on how you were using it. If
         | that is true, then moving off of the managed service means you
         | are tuning postgres yourself now.
         | 
         | Also want to throw in there that it is important to not only
         | compare specs, but to also compare hardware. If DO has newer
         | chips and faster RAM, then you will take a performance hit
         | moving to the new provider even if the machine is beefier.
        
           | sb8244 wrote:
           | They quote "We'll handle setting up, backing up, and
           | updating". I interpret that as literally the database itself,
           | not the application specific nature of it--how it's used.
           | 
           | For example, I would be surprised if they noticed that your
           | IOPS was high and you needed to upgrade the storage/disk
           | components. (That would be cool if it's the type of thing
           | they offer).
        
           | adventured wrote:
           | Pretty certain that DO tunes to broad usage performance
           | optimization (all the easy, obvious performance wins), not
           | dynamically per client to each client's usage.
           | 
           | Here's their pitch: easy setup & maintenance, scaling, daily
           | backups, optional standby nodes & automated failover, fast
           | reliable performance including SSDs, can run on the private
           | network at DO and encrypts data at rest & in transit.
        
       | marcinzm wrote:
       | That seems pretty aggressive for just 100k users unless they mean
       | concurrent users (in which case they should say so).
       | 
       | Let's say that maybe 10% of your users are on at any given time
       | and they each may make 1 request a minute. That's under 200 QPS
       | which a single server running a half-decent stack should be able
       | to handle fine.
        
       ___________________________________________________________________
       (page generated 2020-02-05 23:00 UTC)