[HN Gopher] One million queries per second with MySQL ___________________________________________________________________ One million queries per second with MySQL Author : aarondf Score : 125 points Date : 2022-09-01 18:35 UTC (4 hours ago) (HTM) web link (planetscale.com) (TXT) w3m dump (planetscale.com) | truth_seeker wrote: | Correct me if my knowledge is outdated, | | Vitess deployment comes at a cost: | | 1. You can't have single database view of all shards 2. No | distributed ACID 3. No distributed JOINs | | If I were to do it in PostgreSQL 14 natively without any other | third party plugins, I can get help of Table Partitioning (say | root server) + Tables on Foreign Servers (say leaf servers) | through FDW. Parallel scans and JOINs are allowed. Also, | PostgreSQL 15 (in beta now) will allow parallel commits to | foreign server tables through FDW. | harshitgangal wrote: | Disclaimer: I am a Vitess Maintainer. | | Please find my answer inline | | 1. You can't have single database view of all shards | | - Vitess does query planner and routing for the application so | application does not have to know about shards. Yes, Vitess | gives a view of a single database. | | 2. No distributed ACID | | - This has been talked a lot feature and it is in our roadmap | to provide. Currently, we have a twoPC implementation which we | do not recommend for production use. Right now the users of | vitess either choose their sharding key in such a way that all | the queries in a transaction goes to single shard or the | application can handle best effort multi shard transaction. | | 3. No distributed JOINs | | - Vitess does support distributed joins, aggregation, | subqueries and other complex queries. Vitess thrive to be 100% | Mysql compatible. If anyone encouter such query, a github issue | helps in knowing and prioritizing. | truth_seeker wrote: | Thanks for info. | mattlord wrote: | > If I were to do it in PostgreSQL 14 natively without any | other third party plugins, I can get help of Table Partitioning | (say root server) + Tables on Foreign Servers (say leaf | servers) through FDW. Parallel scans and JOINs are allowed. | Also, PostgreSQL 15 (in beta now) will allow parallel commits | to foreign server tables through FDW. | | I wish you all the best in operating that at scale and | providing a reliable general purpose PostgreSQL DBaaS (even if | only internally) around it. :-) PostgreSQL is great, not going | to argue otherwise -- every solution has tradeoffs and if the | tradeoffs of one solution fit your usage best then go with it. | If you need to scale out and manage a large cluster of | PostgreSQL instances as if it was a single logical database | then I'd recommend something that's in a similar space as | Vitess and PlanetScale -- Citus and Citus on Azure | respectively. At the hand waving level it's always easy to "do | this, and that", but it's a whole other matter when you develop | a general purpose production-grade product that anyone can use | and operate at virtually infinite scale. :-) | truth_seeker wrote: | aeyes wrote: | FDW is not nearly as usable as you think. Joins of even 2 or 3 | remote tables with more than a few hundred rows in the join are | enough to make simple queries take minutes to return even if | you tune all the settings and have statistics of remote tables. | | Also, how do you reshard without downtime? | truth_seeker wrote: | (Plz excuse the brevity. I am using my mobile phone to reply) | | Have you read this page carefully especially the section | "remote execution option" | | https://www.postgresql.org/docs/current/postgres-fdw.html | | If not give it a try. | | Also this one might help too. | | https://www.percona.com/blog/2021/06/02/postgres_fdw- | enhance... | | Now, about re-sharding: | | I know this is not as care free as in NewSQL dbs like | CockroachDB but its not that difficult either. | | Choosing a shard key ultimately comes down to breaking down | your ER diagram and inspecting the incoming write flow to | certain tables which are bound to grow. | | In any of the cases you don't have to let your user suffer | from downtime. At least I have not. Even if you chose wrong | shard key, You can gracefully transfer or re-distribute the | data records by creating new tables and renaming then with | old ones. The window of appearant downtime here is few | milliseconds. This is not something which will happen | frequently. | gen220 wrote: | Postgres FDW [1] seems to me to be a crazy cool (i.e. | unexpected and good) feature of Postgres, but I haven't seen | much chatter about it on HN in the rdbms stories. | | Has anyone here used it? What is/was your experience like? | | [1]: https://www.postgresql.org/docs/current/postgres-fdw.html | sroussey wrote: | That's a great feature but not al all comparable for a | deployment. | | The closest thing for PostgreSQL would be Citus, I imagine. | 29athrowaway wrote: | Not all queries are the same. | andersonrkton wrote: | I would like to see a bench using the "common" setup and not the | enterprise one... what are the limits on those? | forgot_old_user wrote: | says this in the article at the end: | | >We ran this benchmark against a single-tenant environment, | with levels of resources that we reserve for our enterprise | customers. | nelsondev wrote: | How much hardware did you need? | tiffanyh wrote: | In 2018, folks were getting 4M queries per second using SQLite | (BedrockDB) [0]. | | This was also accomplished from just 1 server, not needing 40 | shards like blog post. | | [0] https://blog.expensify.com/2018/01/08/scaling-sqlite- | to-4m-q... | mattlord wrote: | The Expensify test was targeted at demonstrating the upper | limits of SQLite itself. The hardware used: > 1TB of DDR4 RAM > | 3TB of NVME SSD storage > 192 physical 2.7GHz cores (384 with | hyperthreading) | | And a very simple read-only query w/o overlapping rows, with | the full dataset already loaded in memory: > The query itself | just sums a range of 10 rows randomly located inside 10B row | (447GB) two column database of random integers, one of which is | indexed and the other not. This means that the test is fully | cached in RAM, and the database is precached prior to the test. | No writes are done during this test, only reads. | | All that is good and fine for that purpose -- to demonstrate | the limits of what SQLite can do in scaling UP. | | The purpose of the PlanetScale post was demonstrating linear | scalability by scaling OUT using a realistic workload (TPC-C is | a warehouse / ecommerce site simulation) on small commodity VMs | (ec2 isntances). | | As noted, you're comparing two very different tests that had | very different goals. | throwusawayus wrote: | i strongly dislike planetscale and even i think this is not a | useful or fair comparison .. sqlite works with local files, | much simpler, no network latency, no HA | kamikaz1k wrote: | why do you dislike t? | marginalia_nu wrote: | > i strongly dislike planetscale and even i think this is not | a useful or fair comparison .. sqlite works with local files, | much simpler, no network latency, no HA | | To be fair, if you accomplish the same effect without needing | distributed computing, then the solution gets much simpler, | you cut out worrying about network latency and HA* | | * the probability of node failure grows very rapidly with the | number of nodes. Same reason you probably don't need RAID for | 1 disk, but you definitely do for a dozen | rbranson wrote: | Apples to oranges comparison. The benchmark in the PlanetScale | blog post is based on TPC-C. | [deleted] | [deleted] | SnowHill9902 wrote: | And with awk you can query 10M lines per second. | bitobserver wrote: | "Granted, it's not SQLite by itself. We've wrapped it in a | custom distributed transaction layer ..." | tiffanyh wrote: | How is this any different than the blog post that uses | Vitess, which is a custom wrapper around MySQL? | | BedrockDB is a custom wrapper around SQLite. | hbrn wrote: | I'm unfamiliar with vitess, but what exactly is the achievement | here? | | If you have a shared-nothing architecture, you can keep | indefinitely adding nodes to get more throughput. You can easily | do a billion qps if you wanted to. | | The downside is that you have to eliminate relations (or at least | don't expect them to be performant/scalable). | | Am I missing something? | [deleted] | samlambert wrote: | You understand the point. It's a real point. Show me a modern | relational database that can scale this predictably on the | cloud with this level operability. | tiffanyh wrote: | When you have to eliminate data relationships for it to | scale, you no longer have a "relational" database. | sroussey wrote: | You would be surprised then. Most SaaS companies can easily | shard by customer. All the customer data stays together, | relational and all. | | Cross-customer queries will be somewhat slower. | tiffanyh wrote: | Sure, but what you are describing is no longer a multi- | tenant application/database. It's essentially a single- | tenant deployment of your tech stack _per_ customer. | Which is not very cost effective. | staticassertion wrote: | I would still consider it a multi-tenant system. It's a | single database to manage, which distributes your | customers using their identity as a partitioning key. | sroussey wrote: | It's multi tenant --- you have 10,000,000 customers with | 1,000,000 each on 10 servers, for example. | | Or I don't understand what you mean... | [deleted] | hbrn wrote: | A relational database without relations is an oxymoron. As | folks pointed out, you also have to throw ACID away. So | what's left of the original database, SQL-dialect? I bet that | gets limited too. | | Look, I get it, you have to sell your product. Some folks | want semi-infinitely scalable storage, and they don't | understand that the only way to achieve it is turning their | DB into a key-value store. As a side effect they would have | to rewrite their whole application from scratch, but they | would only realize it after they get vendor locked in. | | You can advertise your solution as MySQL-compatible. And I | can claim that it's dishonest. | mattlord wrote: | > A relational database without relations is an oxymoron. | | OK. You're the only one talking to this straw man though. | :-) Every Vitess user that I'm aware of has a pretty | typical 2NF/3NF schema design. A small sampling of them | being listed here: https://vitess.io | | You setup your data distribution/partitioning/sharding | scheme so that you have data locality for 99.9999+% of your | queries -- meaning that the query executes against a data | subset that lives on a single shard/node (e.g. sharding by | customer_id) -- and you live with the performance hit and | consistency tradeoffs for those very rare cases that cross | shard queries cannot be avoided (Vitess does support this). | You should do this even if the solution you're using claims | to have distributed SQL with ACID and MVCC | guarantees/properties. There's no magic that improves the | speed of light and removes other resource constraints. In | practice most people say they want perfect | security/consistency/<name your desired property here> but | then realize that the costs (perf, resources, $$, etc) are | simply so high that it is not practical for their | business/use case. | | I know MySQL fairly well (I started working at MySQL, AB in | 2003) and you can certainly claim that "MySQL-compatible" | is dishonest but I would offer a counter claim that either | you don't know this space very well or you're not operating | in good faith here. | hbrn wrote: | To be fair, I skimmed through your docs and did misread | them initially: I thought you don't allow foreign keys, | but you actually don't allow foreign key _constraints_. | | If you are still allowing JOINs within a shard, then I | need to apologize. | hodgesrm wrote: | To pile on your answer a bit the manual bucketing you | describe is exactly how ClickHouse works in most cases. | It won't allow joins / IN on multiple distributed tables | --i.e., sharded/replicated tables--unless you explicitly | set a property called distributed_product_mode. [0] This | is to prevent you from shooting yourself in the foot | either by bad performance or improperly distributed data. | | This constraint will eventually be relaxed but most apps | are able to work around it just fine. The ones that can't | use Snowflake. | | [0] https://clickhouse.com/docs/en/operations/settings/se | ttings/... | closeparen wrote: | "Relational" actually refers to tables, not foreign keys. | https://en.wikipedia.org/wiki/Relation_(database) | johannes1234321 wrote: | More to "relational algebra" of which joins are a part. | (But not necessarily "foreign keys") | [deleted] | loxias wrote: | Is a shard a server? If so, how powerful of a machine were they | using? I'm deeply unimpressed at only 1M TPS on a _40 node_ | cluster. | ijidak wrote: | Very good job. | | A few questions. | | What is the response payload size? How large is the key space? | And how many unique keys? | | I may have missed those details in the write up. | | It's hard to judge queries per second without understanding those | pieces. | | Genuinely curious. | chreniuc wrote: | I'm curious about the latency between a backend app that runs on | different cloud provider than aws(eg hetzner), because I've | noticed that they run on aws(it shows up when choosing the region | for your database). Has anyone used this and connected from a | different cloud provider? Have you noticed big latencies? | orware wrote: | You will generally run into different types of latencies in | this case since you have the general connection latency when | those are getting established plus the regular physical/network | latency between where the database is located and your own | servers. | | For connections, since a TLS handshake is required, the impact | of physical distances can have a greater impact on the | connection time. The following article: | https://sking7.github.io/articles/44961356.html actually | provides a good 3.5x-4x figure which correlates with some | connection tests I've completed. | | In other words, if an initial TCP packet takes ~100ms to get | from the database to your server, then establishing a TLS | connection to the database will probably be around 400ms. | | Once the connection is established, running queries over an | open connection is generally going to be quicker, at least for | simpler queries. More complex queries will still take whatever | time they need to process on the database end before they can | start sending results back so results will generally vary | there. | | But going back to that 100ms example...if the amount of data | being returned from a very simple query is minimal than the | response time here would be very close to that 100ms figure | over an already open connection and likely would go up from | there depending on the complexity of the query and amount of | data needing to be returned. | | Since the connection hostnames are publicly accessible and TLS | is always required for connections you can easily test from | your own provider's location. So long as the general physical | location isn't too far away from a supported region, the | latency overall shouldn't be unusable. | | I may have mangled some terminology/analogies above but | hopefully that helps provide a bit of a ballpark for you. If | you have specific to/from regions in mind I might be able to | try and collect some specific numbers for you! | toast0 wrote: | I haven't spent time optimizing TLS between a database client | and server, but in HTTPS, using TLS 1.3 without early data | (or TLS 1.2 with somewhat optimistic handshake handling) gets | you to one added roundtrip, TLS 1.3 early data gets you down | to zero added round trips. Early data isn't always | appropriate, because there's potential for replays, but the | latency improvement might be worth considering for some | selects. | [deleted] | minhazm wrote: | It's depends on how far the two data centers are from one | another. But the big cloud providers have data centers in | similar locations since they usually want to be near an | internet exchange[1]. For example most companies that provide | cloud services have a data center in Virginia, USA. So if both | data centers are in the same region then you can get numbers in | the sub 5ms range. As long as you're not crossing the ocean the | latency should be fine. | | 1. | https://en.wikipedia.org/wiki/List_of_Internet_exchange_poin... | leetrout wrote: | I know their devs (or the vitess devs) are on here... | | No foreign keys are supported - are there any other schema level | constraints that are not supported that would also enable scaling | out and achieving these numbers? | samsquire wrote: | HandlerSocket is an interesting approach to turn MySQL into a | NoSQL database and get 700,000 requests per second on a single | node. | | http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-a... | | When I benchmarked Redis on an Intel hexacore I got 800,000 | requests a second on 8 terminal windows. So I assume the MYSQL | data fit in RAM. | maxdemarzi wrote: | What were the stats on each Shard? Was each shard running 4 | cores? 8 cores? 32 cores? RAM? The blog post is light on details. ___________________________________________________________________ (page generated 2022-09-01 23:00 UTC)