[HN Gopher] Kafka Is Not a Database ___________________________________________________________________ Kafka Is Not a Database Author : andrioni Score : 208 points Date : 2020-12-08 15:52 UTC (7 hours ago) (HTM) web link (materialize.com) (TXT) w3m dump (materialize.com) | amai wrote: | Elasticsearch is also not a database. | charlieflowers wrote: | Could you elaborate on that? Because it sure seems like a DB to | me. | | Not a relational one. And should not replace a typical CRUD | OLTP DB. | | But it sure seems like a no-sql DB to me. | detay wrote: | using the any tool for correct problem requires skills. | joking wrote: | neither has to be. | diehunde wrote: | Relevant to the discussion: | | Martin Kleppmann | Kafka Summit SF 2018 Keynote (Is Kafka a | Database?) [1] | | [1] https://www.youtube.com/watch?v=v2RJQELoM6Y | jdmichal wrote: | So, the problem really being addressed but not named is that | eventing systems give _eventual_ consistency. But sometimes that | 's not good enough. And it's OK to admit that and bring in | another technology when you need a stronger guarantee than that. | | The example I was taught with was a booking system, where the | inventory management system-of-record was separate from the | search system. Search does not need 100% up-to-date inventory. A | delay between the last item being booked and it being removed | from the search results is acceptable. In fact, it has to be | acceptable, because it can happen anyway. If someone books the | last item after another hit the search button... There's nothing | the system can do about that. | | When actually committing a booking, however, then that must be | atomically done within the inventory management system. | | So, to bring it home, it's OK for the search system to be | eventually consistent against bookings, and read bookings off of | an event stream to update its internal tracking. However, the | bookings themselves cannot be eventually consistent without | risking a double-booking. | Cojen wrote: | Jim Gray disagrees: | https://arxiv.org/ftp/cs/papers/0701/0701158.pdf | georgewfraser wrote: | It's funny that you use that example, we actually cited that in | an earlier draft of this post. Despite the seemingly opposite | title "Queues are Databases", that note actually makes many of | the same arguments, that message brokers are missing much of | the functionality of database management systems and this is a | problem. | [deleted] | arthurcolle wrote: | I had an issue with RabbitMQ where I didn't know how my consumer | was going to use the data that I was writing to a queue yet (from | a producer that was listening on a SocketIO or WebSockets | stream), and I was kind of just going to figure it out in an hour | or something. | | Eventually, my buffer ran out of memory and I couldn't write | anything else to it, and it was dropping lots of messages. I was | bummed. Is there a way to avoid this in Kafka? | atmosx wrote: | RabbitMQ and Kafka serve different needs. | | RabbitMQ is the most widely used open source implementation of | the AMQP protocol. It is slower but can support complex routing | scenarios internally and handle situations were at-least-once- | delivery guarantees are important. RabbitMQ supports on-disk | persistent queues, which you can tune if you like. Compared to | Kafka, RabbitMQ is slow in terms of volume that can be managed | per queue. | | Kafka is fast because it is horizontally scalable and you have | parallel producers and consumers per topic. You can tune the | speed and move needle where you need between consistency and | availability. However, if you want things like at-least-once- | delivery and such, you'll have to use the building blocks kafka | gives you, but ultimately you'll have to handle this on the | application side. | | Regarding storage, by default kafka stores data for 7 days. | IIRC the NY Times stores articles from 1970 onwards on kafka | clusters. The storage is horizontally scalable and durable. | This is a common use case. As many have pointed out, the | cluster setup depends highly on you needs. We store data for 7 | days in kafka as well and it's in the order of 500GB or more | per node. | | Looks like you have a configuration issue. You can configure | rabbitMQ to store queues on the hard disk and with a quick | calculation you can make sure you have enough space for 10 or | 150 hours of data. I don't see any reason to switch to kafka, a | different tool with different characteristics, just because you | need more storage. | jgraettinger1 wrote: | This post doesn't mention the _actual_ answer, which is to: | | 1) Write a event recording a _desire_ to checkout. 2) Build a | view of checkout decisions, which compares requests against | inventory levels and produces checkout _results_. This is a | stateful stream/stream join. 3) Read out the checkout decision to | respond to the user, or send them an email, or whatever. | | CDC is great and all, too, but there are architectures where ^ | makes more sense than sticking a database in front. | | Admittedly working up highly available, stateful stream-stream | joins which aren't challenging to operate in production is... | hard, but getting better. | fuckadtech wrote: | Hard is an understatement. Particularly so if you are using | Kafka Streams to attempt to run a highly available, fault | tolerant, zero downtime, etc., service. The race condition and | compaction bugs in that library are not fun to debug. | EamonnMR wrote: | Kafka is a very nice communication channel. You can dump the | results into a database and query it if you need a database. | jkarneges wrote: | Another potential misuse of Kafka I've been wondering about is | how a single Kafka instance/cluster is often shared by multiple | microservices. | | On one hand the ability to connect multiple microservices to a | central message broker is convenient, but on the the other hand | this goes against the microservice philosophy of not sharing | subcomponents (databases, etc). I wonder where the lines should | be drawn. | ivalm wrote: | Wait, what? Isn't the whole point of having multiple | publishers/subscribers? | mrweasel wrote: | I think the point was about using a single cluster for | multiple topics, for different services. | | Depending on the scenario I can see the point. If the micro | services are all part of the larger overall solution, having | a single cluster is perfectly fine. Using the same cluster | for multiple "product" is a little like having one central | database server for a number of different solutions. You can | do it, but it potentially become a bottleneck or a central | point for your different solutions to impact performance of | each other. | jkarneges wrote: | I'd agree there is an arguable difference between sharing a | server vs sharing data within the server. | | Bottleneck issues aside, letting two microservices connect | to the same Postgres cluster but access different | "databases" (collection of tables) within that cluster | could be considered an acceptable data separation. | Certainly with multi-tenant DBaaS systems there may be some | server sharing by unrelated microservices/customers. | Whereas letting two microservices access the same database | tables would probably be frowned upon. | | Nevertheless, sharing the same Kafka topics between | microservices seems to be a common thing to do. | ivalm wrote: | > Whereas letting two microservices access the same | database tables would probably be frowned upon. | | > Nevertheless, sharing the same Kafka topics between | microservices seems to be a common thing to do. | | I think if it is part of one whole isn't this fine? You | have one service that generates customer facing output, | you may have another service that powers | analytics/dashboards you may have yet another service | that ETLs data into some data mart. Why wouldn't they | touch the same table/subscribe to the same topic (since | they just need read-only access to the data)? Genuinely | curious what the problem is except for | bottleneck/performance; and if it just bottleneck then | wouldn't scaling horizontally solve it? | jkarneges wrote: | Sure. I believe microservice boundaries are more about | development agility rather than scalability. By limiting | each microservice to a minimal API surface and a "2 | pizza" team, everyone can iterate faster. And if a | particular microservice is implemented as multiple sub- | services sharing database tables only known by the team | in charge, that seems fine. | mumblemumble wrote: | I would argue that, if it's being used properly, the message | broker itself is a service. It runs as a separate process, you | communicate with it over an API, and its subcomponents (e.g., | the database) are encapsulated. | | It's all about framing and perspective, of course. But that's | how I'd want to try and frame it from a system architecture | point of view. | wongarsu wrote: | By that same reasoning Postgres is it's own micro service. It | runs as a separate process, you communicate with it over a | well defined API, and it's subcomponents (data store, query | optimizer etc) are encapsulated. | | With enough framing everything is possible, and in some | contexts it will even make sense. | mumblemumble wrote: | It has to do with how it functions in practice, IMO. | PostgreSQL itself is arguably a service, but the database | probably is not - you're probably crawling all over its | implementation details and data model. | | You _could_ take a stand and say, "All access is through | stored procedures. They are the API." And, if that API | operates as the same semantic level as a well-crafted REST | API, then you could make an argument that that particular | database is a microservice that just happens to have been | implemented on top of PostgreSQL. But I don't think I've | ever seen such a thing happen in the wild. It's much more | popular to use an ORM to get things nice and tightly | coupled. | selfhoster11 wrote: | Such implementations exist and have been discussed a few | days ago here on HN. There are also REST adapters for | Postgres: https://github.com/PostgREST/postgrest | onefuncman wrote: | this isn't any different than microservices getting a deathball | dependency on a user service or logging service or security | service or ... | | you either don't allow microservices to consume from others' | topics, or you publish event schemas so they can still iterate | independently. | | the move to a 2nd kafka cluster in my experience has always | been driven by isolation and fault tolerance concerns, not | scalability. | tacitusarc wrote: | I think because software engineers tend to excel at pattern | recognition, oftentimes solutions to different problems appear so | similar that it seems like with a small amount of abstraction, | they can be reused. But it's a trap! | | Everything abstracted to the highest level is the same, but | problems aren't solved at the highest level. | | The devil, as they say, is in the details. | UK-Al05 wrote: | This is lack of abstraction. You can certainly fix this in | kafka using various hacks, but its implementation of an | abstraction you can get in a standard db for free. | | Funnily enough a list of events is pretty much what a | transaction log is in a standard db. Although the events have | more of a business meaning. In many ways event sourcing is | removing a lot of abstraction databases give you. | mrkeen wrote: | > a standard db | | Yes, ACID works for one database. Many databases? Not so | much. | | > Funnily enough a list of events is pretty much what a | transaction log is in a standard db. | | When I "SELECT Balance WHERE user = 12345", I usually just | get back a balance, I don't get back the transaction log. | | If nothing else, adopting the Kafka model gets your teammates | to append updates to your ledger, rather than changing values | in-place. | UK-Al05 wrote: | Yes that my point. You get half of a database. The | transaction log. Not the up to date state. | hodgesrm wrote: | Is this really a thing? Do people _really_ try to use Kafka as | the system of record for financial transactions or similar data? | mrkeen wrote: | Hell yes. Best thing I ever did. | | Updating balances using an RDBMS is like managing your finances | with pencils and erasers. Unless you somehow ban the UPDATE | statement. | | Updating balances with Kafka is like working in pen. You | can't[1] change the ledger lines, you can only add corrections | after the fact. | | [1] Yes, Kafka records can be updated/deleted depending on | configuration. But when your codebase is written around append- | only operations, in-place mutations are hard and corrections | are easy, so your fellow programmers fall into the 'pit of | success'. | hodgesrm wrote: | I would just put the ledger in a database table if it's that | important and maintain the current state of the account in a | separate table. ACID transactions and database constraints | make this kind of consistency easier to achieve than many | alternatives. It's also easier to prove correctness since you | can run queries that return consistent results thanks to the | isolation guaranteed by ACID. (Modulo some corner cases that | are not hard to work around.) | | Just my $0.02. | mrkeen wrote: | > ACID transactions and database constraints make this kind | of consistency easier to achieve than many alternatives. | | If your company only runs one database. | | > It's also easier to prove correctness | | RDBMSs are wonderful and I don't consider them unreliable | at all. But I can't prove the correctness of my teammate's | actions. I want them to show their working by putting their | updates onto an append-only ledger. | jacques_chester wrote: | > _If your company only runs one database._ | | I think the same argument can be made with "only one | Kafka cluster" and "only one blockchain". | sixdimensional wrote: | Blockchain databases take this model to the extreme - it is a | database that is append only and immutable, with | cryptographic guarantees of the ledger. | | I'm not selling any particular tool but Amazon's QLDB is an | interesting example of a blockchain-based database. I am | interested to see how things like Kafka and this might come | together somehow. | | In my opinion, we have evolved to a point where storage is | not a concern for temporal use cases - i.e. we can now store | every change in an immutable fashion. When you think about | this being an append only transaction log that you never have | to purge, and you make observable events on that log (which | is what most CDC systems do)... yeah it works. Now you have | every change, cryptographically secure, with events to | trigger downstream consumers and you can really rethink the | whole architecture of monolithic databases vs. data | platforms. | | It is an exciting time we are in, in my opinion. | jacques_chester wrote: | The original sin of SQL is that it begins without memory. | | The original sin of Kafka is that it begins with _only_ | memory. | | To me the right middle way is relational temporal tables[0]. | You get both the memoryless query/update convenience and the | ability to travel through time. | | [0] SQL:2011 introduced temporal data, but in a slightly | janky just-so-happens-to-fit-Oracle's-preference kind of way. | [deleted] | fuckadtech wrote: | Yes, with great pain, backing a graph database. Powering | billions of dollars in revenue. | fouc wrote: | Any sufficiently complex software will end up implementing a | database. | dgb23 wrote: | The article links to this talk[0], which has a funny and | interesting sounding title: "Did you accidentally build a | database?" | | [0] https://www.oreilly.com/library/view/strata- | hadoop/978149194... | revertts wrote: | That link's a 3min clip for non-subscribers, but the full | talk is here https://www.youtube.com/watch?v=Bz2EXg0Fy98 | quickthrower2 wrote: | I'd qualify that as "complex infrastructure software" | UK-Al05 wrote: | One way around this is to make sure your kafka command streams | are processed in order, in serial partitioned by an id where you | want the concurrency control. | | Normally you only want concurrency control within certain | boundaries. | | By figuring out the minimum amount transaction and concurrency | boundaries you can inch out quite a bit of performance. | loopz wrote: | Sure, but that defeats the quest for horizontal scalability. | You _can_ build highly performant systems based on serial | execution, but not sure this is an area where Kafka excels | particularly. | UK-Al05 wrote: | That's why you partition by some id. Say stock SKU id for | stock control. Then you can handle other SKUs in parallel. | It's only in serial for a single SKU. That's probably the | maximum performance potential your going to get in a | traditional db anyway. | jacques_chester wrote: | This strikes me as mixing the physical and logical models. | edbrown23 wrote: | This definitely seems like the "Kafka" way to solve this | problem, but I fear there are implications to this | partitioning scheme I'd love to see answered. For example, | partition counts aren't infinite, and aren't easily | adjusted after the fact. So if you choose, say, 10 | partitions originally, for a SKU space that is nearly | infinite, then in reality you can only handle 10 parallel | streams of work. Any SKU that is partitioned behind a bit | of slow work is then blocked by that work. | | It's doable to repartition to 100 partitions or more, but | you basically need to replay the work kept in the log based | on 10 partitions onto the new 100 partitions, and that | operation gets more expensive over time. Then of course | you're basically stuck again once your traffic increases to | a high enough level that the original problem returns. If | the unit of horizontal scaling is the partition, but the | partition count can't be easily changed, consumers | eventually lose their horizontal scalability in Kafka, from | my perspective. | morelisp wrote: | On the other hand Kafka partitions are relatively cheap | on both broker and client side; 100 partitions does not | require 100 parallel consumers so over-provisioning is | not so risky. | soumyadeb wrote: | The architecture of dumping events into Kafka and creating | materialized views is a perfect choice for many use cases - e.g. | collecting clickstream data and building analytical reports. | | If ACID is a prerequisite, then lot of things won't classify as | databases - None of Mongo, Cassandra, ElasticSearch etc. Not even | many data-warehouses. | je42 wrote: | Ok. I admit using Kafka as DB is not straight forward but just | stating it doesn't provide ACID functionality is not enough. | | The example they give is very simplistic. With the correct design | of kafka topics and events the problem of the example can be | fixed. | | And according to oracle https://www.oracle.com/database/what-is- | database/ : | | > A database is an organized collection of structured | information, or data, typically stored electronically in a | computer system. | | So Kafka clearly fits that definition. | satisfaction wrote: | I'm guilty of using it as a DB, my home weather station writes | to kafka topics, sometimes the postgres instance is down for | months, no problems letting the kafka topic store the data | until i get around to rebooting pg and restarting the | connector. | dathinab wrote: | Didn't some newspaper use Kafka to store the newspapers they | released in order or something similar? (I think it was the New | York Times, maybe??). | | Honestly as long as you don't use it as a general purpose | database, it might very well be the best choice for your use- | case. | je42 wrote: | Indeed. Known the tools and apply the one that has least cons | and most pros ;). | | Also, a good starter to for knowing how to use a Streaming | Store like Kafka as DB is the video Database inside-out. | https://www.youtube.com/watch?v=fU9hR3kiOK0 | barnabask wrote: | 2017 article: https://open.nytimes.com/publishing-with- | apache-kafka-at-the... | halbritt wrote: | The author presumes that every use case requires a | transactional database. ACID is nice, especially if it's | needed, but generally not needed, especially in many streaming | data applications for which Kafka is most suitable. | based2 wrote: | https://www.postgresql.org/docs/10/rules-materializedviews.h... | Kalium wrote: | As recently as last year, I worked for a company where the Chief | Architect, in his infinite wisdom, had decided that a database | was a silly legacy thing. The future looked like Kafka streams, | with each service being a function against Kafka streams, and | data retention set to infinite. | | Predictably, this setup ran into an interesting assortment of | issues. There were no real transactions, no ensured consistency, | and no referential integrity. There was also no authentication or | authorization, because a default-configured deployment of Kafka | from Confluent happily neglects such trivial details. | | To say this was a vast mess would be to put it lightly. It was a | nightmare to code against once you left the fantasy world of | functional programming nirvana and encountered real requirements. | It meant pushing a whole series of concerns that isolation | addresses into application code... or not addressing them at all. | Teams routinely relied on one another's internal kafka streams. | It was a GDPR nightmare. | | Kafka Connect was deployed to bridge between Kafka and some real | databases. This was its own mess. | | Kafka, I have learned, is a very powerful tool. And like all | shiny new tools, deeply prone to misuse. | sixdimensional wrote: | It sounds like this would have made a better proof of concept | than an commitment to the architecture. | | The idea on the face of it is not per se a bad one, quite | interesting, but the implementations are perhaps not there yet | to back such an idea. | | It's important to know as an architect when your vision for the | architecture is outpacing reality, to know your time horizon, | and match the vision with the tools that help you implement the | actual use cases you have in your hand right now. | | It sounds like this person might have had an interesting idea | but not a working system. In another light, this could have | been a good idea if all the technology was in place to support | it.. but the timing and implementation doesn't sound like it | was right, perhaps. | | The old saying "use the right tool for the job" comes to mind, | but that can be hard to see when the tools are changing so | fast, and there is a risk to going too far onto the bleeding | edge. Perhaps the saying should have been, "use the rightest | tool you can find at the time, that gives you some room to | grow, for the job"... | Kalium wrote: | It was definitely an interesting proof of concept that needed | some refinement. The core idea was functional services | against nicely organized data streams on a flat network. | Which is a really cool approach that works quite well for a | lot of things. | | Several of these points fell apart when credit card handling | and PCI-DSS entered the picture. | JMTQp8lwXL wrote: | Instead of single architects, companies need architect boards. | And they can vote on these ideas before a single individual | becomes a single point of failure. | | Expecting 1 person to make 100% correct decisions all the time | is too much expectation for one person. People go down rabbit | holes and they have weird takeaways, like replace all the | databases with queues. | Kalium wrote: | I agree in abstract, but in practice it's quite difficult to | set up a successful democratic architecture board. You need | teams or departments that all have architects, and an | engineering organization where both managers and engineers | accept a degree of centralized technical leadership. Getting | there is, in my opinion, the work of years. It's especially | challenging because spinning up such a board requires a | person who can run it single-handedly. | | In this particular company, the Chief Architect in theory had | a group around him. They did nothing to check his poor | decisions, and from the outside seemed primarily interested | in living in the functional programming nirvana he promised | them. | sixdimensional wrote: | Thinking in terms of building physical buildings.. | architects are often the visionaries but engineers are the | realists. You know, architects come up with these crazy | incredible building designs based on their engineering | understanding, but ultimately it has to be vetted, proven | and implemented by engineers. | | I have, however, always wondered why people should be seen | as "only architects" and "only engineers". While the | separation of duties is critical to ensure the overall | construction is sound, people can be visionary engineers, | and people can be knowledgeable in both how to do something | in real terms as well as dreaming on how to go beyond. | EamonnMR wrote: | At our company we use a ton of services that operate | essentially as as functions on a Kafka stream (well, they tend | to read/write in batches for efficiency) but we write event | streams we want to query later into a regular database for | later query. It works out very well. The idea of our poor Kafka | cluster having to field queries in addition to the load of | acting as a transport layer is frightening. The 'superpower' | Kafka gives you is the ability to turn back time if something | goes wrong and the ability to orchestrate really big pipeline. | You have to build or buy a fair bit of tooling to make it work | though. | Kalium wrote: | This particular org did its level best to think of Kafka as | the regular database for queries. | shay_ker wrote: | This is maybe a silly question, but what's the difference between | the timely dataflow that Materialize uses and Spark's execution | engine? From my understanding they're doing very similar things - | break down a sequence of functions on a stream of data, | parallelize them on several machines, and then gather the | results. | | I understand that the feature set of timely dataflow is more | flexible than Spark - I just don't understand why (I couldn't | figure it out from the paper, academic papers really go over my | head). | frankmcsherry wrote: | There are a few differences, the main one between Spark and | timely dataflow is that TD operators can be stateful, and so | can respond to new rounds of input data in time proportional to | the new input data, rather than that plus accumulated state. | | So, streaming one new record in and seeing how this changes the | results of a multi-way join with many other large relations can | happen in milliseconds in TD, vs batch systems which will re- | read the large inputs as well. | | This isn't a fundamentally new difference; Flink had this | difference from Spark as far back as 2014. There are other | differences between Flink and TD that have to do with state | sharing and iteration, but I'd crack open the papers and check | out the obligatory "related work" sections each should have. | | For example, here's the first para of the Related Work section | from the Naiad paper: | | > Dataflow Recent systems such as CIEL [30], Spark [42], Spark | Streaming [43], and Optimus [19] extend acyclic batch dataflow | [15, 18] to allow dynamic modification of the dataflow graph, | and thus support iteration and incremental computation without | adding cycles to the dataflow. By adopting a batch-computation | model, these systems inherit powerful existing techniques | including fault tolerance with parallel recovery; in exchange | each requires centralized modifications to the dataflow graph, | which introduce substantial overhead that Naiad avoids. For | example, Spark Streaming can process incremental updates in | around one second, while in Section 6 we show that Naiad can | iterate and perform incremental updates in tens of | milliseconds. | mamon wrote: | There's no difference really. All "Big Data" (tm) tools are | trying to capitalize on the hype, so Kafka adds database | capabilities, while Spark adds Streaming. At some point they | will reach feature parity. | vladsanchez wrote: | I want to Upvote this more than once. So much facts into a | condensed into a small essay. Good job! | | Money quote: "Event-sourced architectures like these suffer many | such isolation anomalies, which constantly gaslight users with | "time travel" behavior that we're all familiar with." | rowland66 wrote: | Maybe I am just old because I had to Google what gaslighting | meant, but as best I can tell getting gaslighted by your system | architecture is really stretching the meaning of the term | gaslighting to the point of absurdity. | theptip wrote: | This is a bit dumbed down, and ignores the domain terminology | required to properly discuss the trade-offs here (which is | puzzling given that it links to a post by Aphyr, where you can | find incredibly thorough discussions around isolation levels and | anomalies). | | > The fundamental problem with using Kafka as your primary data | store is it provides no isolation. | | This is false. I can only assume the author doesn't know about | the Kafka transactions feature? | | To be specific, Kafka's transaction machinery offers read- | committed isolation, and you get read-uncommitted by default if | you don't opt-in to use that transaction machinery (the docs: | https://kafka.apache.org/0110/javadoc/index.html?org/apache/...). | Depending on your workload, read-committed might be sufficient | for correctness, in which case you can absolutely use Kafka as | your database. | | Of course, proving that your application is sound with just read- | committed isolation is can be challenging, not to mention testing | that your application continues to be sound as new features are | added. | | Because of that, in general I think that the underlying point of | this article is probably correct, in that you probably shouldn't | use Kafka as your database -- but for certain applications / use- | cases it's a completely valid system design choice. | | More generally this is an area that many applications get wrong | by using the wrong isolation levels, because most frameworks | encourage incorrect implementations by their unsafe defaults; | e.g. see the classic "Feral concurrency control" paper | http://www.bailis.org/papers/feral-sigmod2015.pdf. So I think the | general message of "don't use Kafka as your DB unless you know | enough about consistency to convince yourself that read-committed | isolation is and will always be sufficient for your usecase" | would be more appropriate (though it's certainly a less snappy | title). | georgewfraser wrote: | "Read-committed isolation" is not a meaningful implementation | of transactions. If you can't do read, then a write, while | guaranteeing the database didn't change in between, then you | don't really have transactions. | [deleted] | LgWoodenBadger wrote: | This sounds like "serializable" which is (in my experience) | rarely useful for a meaningful system. | theptip wrote: | Depends on your use-case; if it's meaningless, why is it | implemented in all the leading SQL DBs? It's the default in | Postgres... | | https://www.postgresql.org/docs/9.5/transaction-iso.html | | If you're arguing that in practice this isn't enough | isolation, then sure, that's what I said in my post; most | applications need more than the default isolation levels. I | feel like you're making an absolutist point (just like the | original article) where my point was that the domain is | actually more nuanced, and absolutes just obscure the | technical complexity. | megachucky wrote: | Here is my answer if Kafka is a database: | | (the answer is yes, but you should still not try to replace every | other database) | | https://www.kai-waehner.de/blog/2020/03/12/can-apache-kafka-... | | Would be curious what the people here think about my post, too. | Disclaimer: I work for Confluent - and I am also happy about | critical feedback. | fredliu wrote: | Kafka is essentially commit logs, which are at the core of any | traditional database engines. Streaming is just turning the gut | of DB inside out (mostly for scalability reasons), while DB is | wrapped up commit logs that provides higher level functionalities | (ACID, Transactions, etc.). It's two sides of the same coin, yin | and yang of the same thing... But on the practical side of | things, yes, if what you needed more are indeed what's described | in this article, your life would be easier with a traditional DB. | hasanic wrote: | I mean, duh? Does Apache Kafka ever made the claim that it is a | database? | | Other things that are not a database: Apache Traffic Server, | Apache Mahout, Apache Jakarta, Apache ActiveMQ... hundreds of | these exist. | dang wrote: | The title sounds generic, but the article makes it clear that | it's responding to specific proposed uses of Kafka. | Spivak wrote: | I feel like the inventory thing is a bit of a straw-man because | the situation is set out in such a way that you need transactions | for it to work. If you find yourself wishing you had a global | write-lock on a topic to then of course it won't work. Modeling | your data for Kafka is work just the same as it is for MySQL. Of | course it might not be the best tool for the job but you should | at least give it a fair shake. | | You should be able to post "buy" messages to a topic without fear | that it messes up your data integrity. Who cares if two people | are fighting over the last item? You have a durable log. Post | both "buys" and wait for the "confirm" message from a consumer | that's reading the log at that point in time, validates, and | confirms or rejects the buys. At the point that the buy reaches a | consumer there is enough information to know for sure whether | it's valid or not. Both of the buy events happened and should be | recorded whether they can be fulfilled or not. | jacques_chester wrote: | > _Who cares if two people are fighting over the last item?_ | | The two people, at least. Customers tend to be a bit | underwhelmed by "well, the CAP theorem..." as a customer | service script. | Spivak wrote: | None of this shows up as user-facing any differently than a | relational database. No CAP theorem at all. | | Kafka: | | User clicks buy and it shows "processing" which behind the | scenes posts the buy message and waits for a "confirmed" | message. When it's confirmed user is directed to success! If | someone else posts the buy before them they get back a | "failed: sold out" message. | | Relational: | | User clicks buy and it shows "processing" which behind the | scenes tries to get a lock on the db, looks at inventory, | updates it if there's still one available, and creates a row | in the purchases table. If all this works the user is | directed to success. If by the time the lock was acquired the | inventory was zero the server returns "failure: sold out". | jacques_chester wrote: | The CAP theorem line was smart-arsery. | | The thing here is that the database can update the cart and | the inventory in one logical step, to the exclusion of | others. | | The Kafka approach doesn't guarantee that out of the box, | leading to the creation of de facto locking protocols | (write cart intent, read cart intent, write inventory | intent ...). A traditional database does that for you with | selectable levels of guarantees. | Spivak wrote: | You're absolutely right but I think the advantage is that | to the users of the database it doesn't really feel like | there's any locking going on and it's this implicit thing | that's constructed from the history. | | Like for a hypothetical ticket selling platform you just | get a log like 00:00 RESERVE AAA FOR | xxx, 5 min 00:02 BUY BBB FOR qqq | 00:15 RESERVE AAA FOR yyy, 5 min 00:23 CONFIRM | AAA FOR xxx 00:25 CONFIRM BBB for qqq | 00:27 REJECT AAA FOR yyy, "already reserved" | 05:16 RESERVE AAA FOR zzz, 5 min 05:34 CONFIRM | AAA FOR zzz | | So although there's "locking" going on here enforced by | the consumer sending the confirms the producer just sends | intents/events and sees whether they're successful or not | and both producer and consumer are stateless. | | I guess it just depends on which model you think is more | a PITA to work with. | AndyPa32 wrote: | Yeah, if you are lucky enough that your cart and | inventory is stored in the same database. | jacques_chester wrote: | Of course. That's why Materialize are interesting, they | are apparently able to give a greatly improved | compute:result ratio than the previous state of the art. | j-pb wrote: | Tbh, It's a weird blog post coming from the materialize folks, | considering they know better. | | The "event sourced" arch they sketched is missing pieces. Normaly | you'd have single writer instances that are locked to the | corresponding kafka partition, which ensure strong transactional | guarantees, IF you need them. | | Throwing shade for maketings sake is something that they should | be above. | | I mean c'mon, I'd argue that Postgres enhanced with Materialize | isn't a database anymore either, but in a good sense! | | It's building material. A hybrid between MQ, DB, backend logic & | frontend logic. | | The reduction in application logic and the increase in | reliability you can get from reactive systems is insane. | | SQL is declarative, reactive Materialize streams are declarative | on a whole new level. | | Once that tech makes it into other parts of computing like the | frontend, development will be so much better, less code, less | bug, a lot more fun. | | Imagine that your react component could simply declare all the | data it needs from a db, and the system will figure out all the | caching and rerendering. | | So yeah, they have awesome tech with many advantages, so I don't | get why they bad-mouth other architectures. | dkhenry wrote: | I am not terribly surprised. The materialize team was | previously at CockroachDB which also had a habit of putting out | marketing material like this. | _jal wrote: | I ignored them for a long time because of it. I just assumed | they were lightweights being annoying because they didn't | have anything else. | | It serves as anti-marketing, at least to me. | j-pb wrote: | Maybe I'm biased because I'm such a huge Frank McSherry | fanboy. The differential dataflow work he does in rust is | simply awesome, and he writes great papers too! | | Little known fun fact, the rust type- and borrow-checker uses | a datalog engine internally to express typing rules, and that | engine was written and improved by Frank McSherry. So | whenever you hit compile on a rust program, you're using a | tiny bit of Materialize tech. | dwelch2344 wrote: | Totally agree with this. "Kafka is terrible if you use it with | poor practices" should be the title of this. It's totally | clickbait targeting the "Database Inside Out" articles / | concept, which if you read the most common 3-page high level it | explicitly states you should be materializing up into other | data sources / structures / systems / etc | | To date, I've never written code that reads from a Kafka topic | that wasn't taking data and transforming + materializing it | into domain intelligence | georgewfraser wrote: | We're trying to address a real problem that is happening in our | industry: VPs of eng and principal engineers at startups are | adopting the "Kappa Architecture" / "Turning the Database | Inside Out", without realizing how much functionality from | traditional database systems they are leaving behind. This has | led to a barrage of consistency bugs in everything from food- | delivery apps to the "unread message count" in LinkedIn. We're | at the peak of the hype cycle for Kafka, and it's being used in | all kinds of places it doesn't belong. For 99% of companies, a | traditional DBMS is the right foundation. | arjunnarayan wrote: | Hi! I'm one of the two authors here. At Materialize, we're | definitely of the 'we are a bunch of voices, we are people | rather than corp-speak, and you get our largely unfiltered | takes' flavor. This is my (and George's from Fivetran) take. In | particular this is not Frank's take, as you attribute below :) | | > SQL is declarative, reactive Materialize streams are | declarative on a whole new level. | | Thank you for the kind words about our tech, I'm flattered! | That said, this dream is downstream of Kafka. Most of our | quibbles with the Kafka-as-database architecture are to do with | the fact that that architecture neglects the work that needs to | be done _upstream_ of Kafka. | | That work is best done with an OLTP database. Funnily enough, | neither of us are building OLTP databases, but this piece | largely is a defense of OLTP databases (if you're curious, yes, | I'd recommend CockroachDB), and their virtues at that head of | the data pipeline. | | Kafka has its place - and when its used downstream of CDC from | said OLTP database (using, e.g. Debezium), we could not be | happier with it (and we say so). | | The best example is in foreign key checks. It is not good if | you ever need to enforce foreign key checks (which translates | to checking a denormalization of your source data | _transactionally_ with deciding whether to admit or deny an | event). This is something that you may not need in your data | pipeline on day 1, but adding that in later is a trivial schema | change with an OLTP database, and exceedingly difficult with a | Kafka-based event sourced architecture. | | > Normally you'd have single writer instances that are locked | to the corresponding Kafka partition, which ensure strong | transactional guarantees, IF you need them. | | This still does not deal with the use-case of needing to add a | foreign key check. You'd have to: | | 1. Log "intents to write" rather than writes themselves in | Topic A 2. Have a separate denormalization computed and kept in | a separate Topic B, which can be read from. This | denormalization needs to be read until the intent propagates | from Topic A. 3. Convert those intents into commits. 4. Deal | with all the failure cases in a distributed system, e.g. | cleaning up abandoned intents, etc. | | If you use an OLTP database, and generate events into Kafka via | CDC, you get the best of both worlds. And hopefully, yes, have | a reactive declarative stack downstream of that as well! | skybrian wrote: | It seems like Kafka could be used to record requests and | responses (upstream of the database) and approved changes | (downstream of the database), but isn't good for the approval | process itself? | skyde wrote: | > If you use an OLTP database, and generate events into Kafka | via CDC, you get the best of both worlds. | | 100% agree this is the way to go instead of rolling your own | transaction support you get the "ACID" for free from the DB | and use KAFKA to archive changes and subscribe to them. | vvern wrote: | > 1. Log "intents to write" rather than writes themselves in | Topic A 2. Have a separate denormalization computed and kept | in a separate Topic B, which can be read from. This | denormalization needs to be read until the intent propagates | from Topic A. 3. Convert those intents into commits. 4. Deal | with all the failure cases in a distributed system, e.g. | cleaning up abandoned intents, etc. | | People do do this. I have done this. I wish I had been more | principled with the error paths. It got there _eventually_. | | It was a lot of code and complexity to ship a feature which | in retrospect could have been nearly trivial with a | transactional database. I'd say months rather than days. I | won't get those years of my life back. | | The products were build on top of Kafka, Cassandra, and | Elasticsearch where, over time, there was a desire to | maintain some amount of referential integrity. The only | reason we bought into this architecture at the time was | horizontal scalability (not even multi-region). Kafka, sagas, | 2PC at the "application layer" can work, but you're going to | spend a heck of a lot on engineering. | | It was this experience that drove me to Cockroach and I've | been spreading the good word ever since. | | > If you use an OLTP database, and generate events into Kafka | via CDC, you get the best of both worlds. | | This is the next chapter in the gospel of the distributed | transaction. | gunnarmorling wrote: | >> If you use an OLTP database, and generate events into | Kafka via CDC, you get the best of both worlds. | | > This is the next chapter in the gospel of the distributed | transaction. | | Actually, it's the opposite. CDC helps to _avoid_ | distributed transaction; apps write to a single database | only, and other resources (Kafka, other databases, etc.) | based on that are updated asychronously, eventually | consistent. | vvern wrote: | I mean, I hear you that people do that and it does allow | you to avoid needing distributed transactions. When | you're stuck with an underlying NoSQL database without | transactions, this is the best thing you can do. This is | the whole "invert the database" idea that ends up with | [1]. | | I'm arguing that that usage pattern was painful and that | if you can have horizontally scalable transaction | processing and a stream of events due to committed | transactions downstream, you can use that to power | correct and easy to reason about materializations as well | as asynchronous task processing. | | Transact to change state. React to state changes. | Prosper. | | [1] https://youtu.be/5ZjhNTM8XU8?t=2075 | AndrewKemendo wrote: | Alternatively from Jay Krebs [1] a much more thorough and nuanced | discussion that is probably the best send-up on this topic. | | "So is it crazy to do this? The answer is no, there's nothing | crazy about storing data in Kafka: it works well for this because | it was designed to do it. Data in Kafka is persisted to disk, | checksummed, and replicated for fault tolerance. Accumulating | more stored data doesn't make it slower. There are Kafka clusters | running in production with over a petabyte of stored data." | | [1] https://www.confluent.io/blog/okay-store-data-apache-kafka/ | pram wrote: | Log compaction definitely isn't problem free. I'd say it was | crazy when he wrote that. | | We ran a cluster with lots of compacted topics, hundreds of | terabytes of data. At the time it would make broker startup | insanely slow. An unclean startup could literally take _an | hour_ to go through all the compacted partitions. It was awful. | netn3twebw3b wrote: | Agreed. I feel like the tech community is afflicted with | collective functional fixedness, or some sort of essentialism. | | At it's core it's electron state in hardware. So long as those | limits are not incidentally exceeded, and you validate outputs, | who really cares what gets loaded? | | While we rip rare minerals from the ground and toss all that at | scale every 3-5 years later we get economical over installing | software. | | So long as it offers the necessary order of operations to do | the work, whatever. | | https://en.m.wikipedia.org/wiki/Functional_fixedness | hodgesrm wrote: | Thanks, this is a great article. The money quote for me is: | | > I think it makes more sense to think of your datacenter as a | giant database, in that database Kafka is the commit log, and | these various storage systems are kinds of derived indexes or | views. | | My company supports analytic systems and we see this pattern | constantly. It's also sort of a Pat Helland view of the world | that subsumes a large fraction of data management under a | relatively simple idea. [1] What's interesting is that Pat also | sees it as a way to avoid sticky coordination problems as well. | | [1] http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf | skyde wrote: | The problem is that Kafka API and Commit log API are very | different. | | If you wanted to literally use Kafka for your commit log the | same way the Amazon aurora are using a distributed commit | log. You would find that a lot of feature a commit log need | are missing and impossible to add to kafka. | georgewfraser wrote: | That post explains that there are scenarios where it makes | sense to store data permanently in Kafka. "Kafka is Not a | Database" makes a different point, which is that Kafka doesn't | solve any of the hard transaction-processing problems that | database systems do, so it's not an alternative to a | traditional DBMS. This is not a straw man---Confluent advocates | for "turning the database inside out" all over their marketing | materials and conference talks. | cfontes wrote: | It actually does with |Exactly-once Semantics| in fact I've | been using as the single source of truth in a cash management | system for almost 2 years without a single issue related to | transactions. | jashmatthews wrote: | How do you deal with side effects outside of the Kafka | cluster? | theptip wrote: | Short answer: you write your consumer's state into the | same DB as you're writing the side-effects to, in the | same transaction. | | Long answer: say your consumer is a service with a SQL DB | -- if you want to process Event(offset=123), you need to | 1. start a transaction, 2. write a record in your DB | logging that you've consumed offset=123, 3. write your | data for the side-effect, 4. commit your transaction. | (Reverse 2 and 3 if you prefer; it shouldn't make a | difference). If your side-effect write fails (say your DB | goes down) then your transaction will be broken, your | side-effect won't be readable outside the transaction, | and the update to the consumer offset pointer also won't | get persisted. Next loop around on your consumer's event | loop, you'll start at the same offset and retry the same | transaction. | arafsheikh wrote: | Persisting message offsets in DB has its own challenges. | The apps become tightly coupled with a specific Kafka | cluster and that makes it difficult to swap clusters in | case of a failover event. | | If you expect apps to persist offsets then it's important | to have a mechanism/process to safely reset the app state | in DB when the stored offset doesn't make sense. | wdb wrote: | Interesting, do you have any resources about how to best | handle side effects with message queues (e.g. GCP | PubSub)? Trying to find out if its worth the effort, or | good practice to allow replay-ability (like from backup) | of message and get back to the same state | andrewem wrote: | *Kreps, not Krebs | gopalv wrote: | > Accumulating more stored data doesn't make it slower | | That is a valid theory when we talk about readers which look at | recent data or when you are trying to append data to the | existing system. | | But in practice, the accumulation of cold data on a local disk | is where this starts to hurt, particularly if that has to serve | read traffic which starts from the beginning of time (i.e your | queries don't start with a timestamp range). | | KSQL transforms does help reduce the depth of the traversal, by | building flatter versions of the data set, but you need to | repartition the same data on every lookup key you want - so if | you had a video game log trace, you'd need multiple | materializations for (user) , (user,game), (game) etc. | | And on this local storage part, EBS is expensive to just hold | cold data, but then replicate it to maintain availability | during a node recovery - EBS is like a 1.5x redundant store, | better than a single node. I liked the Druid segment model of | shoving it off to S3 and still being to read off it (i.e not | just stream to S3 as a dumping ground). | | When Pravega came out, I liked it a lot for the same - but it | hasn't gained enough traction. | halbritt wrote: | I recommend his book, "I heart logs". It's a short read, but | changed my perspective. | tutfbhuf wrote: | Well, then you have never heard of ksqlDB. It adds SQL and DB | features to Kafka. It is backed by Confluent (LinkedIn) same | company that developed Kafka initially. | | https://ksqldb.io | albertwang wrote: | We're very aware of ksqlDB. I would recommend this video from | last week where Matthias talks about some of the strengths and | weaknesses of ksql: https://youtu.be/KUQuegJ4do8 | cirego wrote: | My understanding is that ksqlDB is a read-only interface on top | of streams and only helps people write better consumers. The | problems mentioned in the blog post relate to producers. | zaphar wrote: | If it stores data it's a database. Filesystems are databases, | MongoDB is a database. LevelDB is a database. Postgres and MySQL | are databases. Kafka is a database. They are all very different | in features and functionality though. | | What the authors mean is that kafka is not a traditional database | and doesn't solve the same problems that traditional databases | solve. Which is a useful distinction to make but is not the | distinction they make. | | The reality is that database is now a very general term and for | many usecases you can choose to special purpose databases for | what you need. | yumaikas wrote: | I think I'd differentiate between a database and a data store. | | I'd argue that a filesystem is a data store, rather than a | database. | Hamuko wrote: | Some say that Microsoft Excel is the world's most popular | database engine. | Baeocystin wrote: | Honestly, I'd agree with that, too, at least in a | colloquial sense. | zaphar wrote: | What is your distinction? Is mongodb a database? What about | leveldb? | | Whether we want it to be so or not the term database is much | more encompassing thank it used to be. You can try to fight | that change if you want to but it means you'll be speaking a | different language that most of the rest of us as a result. | IfOnlyYouKnew wrote: | You're using the term "traditional database" two levels up. | Whatever your definition of that is, it's their definition | of "database". | | Your initial post says "I have a different definition of | 'database'", which is different than anyone else's, as you | acknowledge when you refer to that common definition as a | 'traditional database'. | | Then, you fault them for making an argument that is invalid | for your non-standard definition of a database! | | You would like Kafka, the author. | yumaikas wrote: | Honestly, where I'd draw the distinction between a database | and a datastore is the ability to do multi-statement | transactions. So, a filesystem? A datastore. Some support | fsync on a single file, but multi-file sync isn't usually | supported. | | MongoDB and LevelDB both support transactions, so I'd err | on calling them databases myself. | edgyquant wrote: | I'd say initially file systems were data stores but once they | developed hierarchies they became more akin to a database. | I'm not sure there's a huge difference but it seems a | database is a collection of data stores (though there or | probably a more technical and correct definition.) ___________________________________________________________________ (page generated 2020-12-08 23:00 UTC)