[HN Gopher] What every software engineer should know about Apach... ___________________________________________________________________ What every software engineer should know about Apache Kafka Author : aloknnikhil Score : 127 points Date : 2020-05-16 19:40 UTC (3 hours ago) (HTM) web link (www.michael-noll.com) (TXT) w3m dump (www.michael-noll.com) | [deleted] | gnfargbl wrote: | Something that I wish I had known about Apache Kafka a year or so | ago is that it essentially has no support for long-running tasks, | i.e. tasks where longest-possible-worker-execution-time >> | longest-tolerable-group-rebalance-time. | | After much angst in trying to work around this issue, I finally | gave up and switched to Pulsar. Pulsar isn't without it's own | issues (mostly around bugs and general maturity) but it handles | this particular scenario admirably. | ketralnis wrote: | It's true, message buses and work queues have different | characteristics. It sounds like you want a work queue, not a | message bus. I have very successful experience with using | rabbitmq for work queueing, but as you mention there are others | too. | biggestlou wrote: | Pulsar works quite well as a message queue: | https://pulsar.apache.org/docs/en/cookbooks-message-queue/ | gnfargbl wrote: | You're right. The issue is that in this particular | application I need _both_ a work queue and a message bus. | I've also successfully used Rabbit as a work queue, but it's | not high-throughput enough to meet my messaging needs. Pulsar | seems to cope well in both roles. | | All that said, if I had my time again I'd probably just use | one of the cloud providers' solutions and spend my efforts | elsewhere... | Traster wrote: | There seems to be this common problem among relatively new | technologies, that they're not _actually_ aware of what the | average person knows about them. So let me be the moron in the | room. I work at a company that uses Kafka. What I know so far is | that Kafka is broken. It seems to me that this article is more | about what every software engineer who plans to re-skill as a | kafka engineer should know. | oweiler wrote: | In which way is Kafka broken? | ddevault wrote: | I was going to say, what every software engineer should know | about Kafka is "don't use it". This is more a list of things | people who are already stuck with Kafka should know. | skyde wrote: | What every engineer should know about Kafka is that it should not | be used for anything critical like you would use Cassandra or | Hbase. | | But if you are ok with partitions not being available for many | hours or losing all written data because the cluster did not | automatically move parution to 3 new replica after 2 of the | replica failed ... then it's a good scalable(speed) product. | | There is also no serious multi tenant support. So if you need | multitenancy you gotta use kubernete and do one cluster per | tenant and automate that yourself. | kevindeasis wrote: | anyone wanna share their thoughts about deploying their own | messaging system vs using a messaging system provided by their | cloud provider? | antoncohen wrote: | If you are on GCP I think the choice is simple, use Cloud | Pub/Sub. Extremely simple, extremely reliable, extremely | performant, fairly inexpensive, multi-region (global). No | maintenance, no scaling, almost no tunables, it just works. | | Google provides a Pub/Sub emulator for local development. | | I don't really buy the vendor lock-in thing for Pub/Sub-like | systems. The Cloud Pub/Sub usage pattern is basically the same | as Kafka, you can have a library that abstracts away the | differences. There are open source libraries that do that[1]. | If you ever need to switch cloud providers, or want a messaging | system to span cloud providers, you can switch without changing | lots of code. | | [1] https://github.com/google/go-cloud/tree/master/pubsub | peterhunt wrote: | I don't think it's that simple unless I misunderstand how GCP | Pubsub works. I don't think GCP PubSub will give you | deterministic delivery order within a partition the way Kafka | will: https://cloud.google.com/pubsub/docs/ordering | batter wrote: | We have Kafka and GCP pubsub. Kafka is the way to go for | us. In terms of reliability, performance, load, etc. | oweiler wrote: | We use Amazon MSK and are pretty happy with it so far. | dtech wrote: | Distributed messaging is _really_ hard to get right. It 'll | seem to work fine right up until you get weird bugs and | unreliability during at the worst moments. | | I wouldn't recommend relying primarily on something vendor- | specific like Amazon SQS, but there are very good out-of-the- | box tools like RabbitMQ or Kafka available. | | Writing your own messaging system is like writing your own | database, it's the wrong choice 99.9% of the time. | TheBlight wrote: | Avoiding vendor lock-in is the only quasi-sane reason I can | think of. | sz4kerto wrote: | Testability. Single biggest reason for us not going with the | proprietary queue. We can create, reset, throw away queues as | we want during testing, thousands times a day. Even on a | laptop. | sa46 wrote: | Do you have any advice for setting up a test Kafka | environment? I'd love to be able setup a lightweight in- | memory Kafka for unit-y tests without going through the whole | Docker compose rigamarole. | realtalk_sp wrote: | The GCP Pub/Sub API has largely replicated all the features | you'd want out of Kafka (including Consumer Groups). The | primary consideration at this point is cost. There's an | inflection point in size (at some very large message volume) | where it makes sense to start running your own Kafka cluster | and hire a dedicated person or two to manage it. Most companies | will never get anywhere close. | | Any project just starting out should use Pub/Sub. One thing I | really like is that GCP provides emulators of Pub/Sub et al for | local testing. That used to be a bit of an obstacle not too | long ago. | | In terms of lock-in, I don't see how that applies to an AMQ. | The data moving through it should only be transiently | persisted, up to a week or two at most in the usual case. | | If you want to avoid cloud lock-in, have DB backups, use | Postgres/MySQL/etc, containerize your service(s), replicate | data in object storage, etc. Common sense stuff, if that's | something that's of concern. | | Personally, I've seen "vendor lock-in" weaponized as an excuse | for a lot of costly NIH bullshit. It's painful to reflect back | on a project that could have involved literally a tenth of the | time and pain it ended up taking because of that one choice | alone. | gigatexal wrote: | IMO lock-in fears are overblown. Build stuff out quickly and | prove your idea and then refactor when you have customers and | revenue. | SpicyLemonZest wrote: | Refactoring a growing system while maintaining bug-for-bug | compatibility is extraordinarily hard. Most people I know | who've gone through such a migration never want to do it | again. | nojito wrote: | Much much cheaper. | cameronbrown wrote: | Dev time would far outweigh the cost of a managed equivalent. | dajohnson89 wrote: | does that factor in development costs? | Rebelgecko wrote: | I'm surprised by the amount of criticism in this thread. I've | used Kafka in the past and it definitely got the job done (as a | message bus, not using stream processing or the other more whiz- | bang features). What do people use instead? | skyde wrote: | If you need. Ordered log abstraction (Apache pulsar, Facebook | log device ...) if you need a real transactional and highly | available message broker any (my/jms) server like Rabbitmq IBM | MQ .... | | Kafka in my experience is always the worse solution unless you | need to aggregate http server log to do offline analytics using | something like spark | sixhobbits wrote: | So much criticism here! I've read a lot about Kafka over the last | few years and I wish I had read this article earlier -- even | basic questions like "Can Kafka store data persistently?" are not | adequately answered in many intros to it. | | That said, I do find the tutorial flip-flops a bit in target | audience. It's mainly "this is what Kafka is", but sometimes has | weird asides like "This is how to optimise Kafka" (redundancy, | number of partitions, etc) which are pretty distracting from the | more fundamental points. | seemslegit wrote: | Hmm, I'm pretty sure that a software engineer developing safety- | critical firmware for embedded medical systems does not need to | know anything about Apache Kafka. Or a game developer. Or a web | frontend developer. Given the title it's surprising how many | software engineers can in fact go through life and career without | ever knowing anything about Apache Kafka. | vsareto wrote: | By now, "What Every Software Engineer Should Know" headlines | aren't intended to be serious. | seemslegit wrote: | I wish, instead they are just not intended to be taken | literally. | cosmotic wrote: | Cool click-bait title | fmjrey wrote: | Not necessarily, it's most likely a reference to the popular | article from 2013: | https://engineering.linkedin.com/distributed-systems/log-wha... | ken wrote: | Or directly from 1991's "What Every Computer Scientist Should | Know About Floating-Point Arithmetic". | pierrec wrote: | Where the "every" was actually justified, contrarily to its | descendants. | georgewfraser wrote: | This notion of "stream-table duality" might be the most | misleading, damaging idea floating around in software engineering | today. Yes, you can turn a stream of events into a table of the | present state. However, during that process you will eventually | confront every single hard problem that relational database | management systems have faced for decades. You will more or less | have to write a full-fledged DBMS in your application code. And | you will probably not do a great job, and will end up with dirty | reads, phantoms, and all the other symptoms of a buggy database. | | Kafka is a message broker. It's not a database and it's not close | to being a database. This idea of stream-table duality is not | nearly as profound or important as it seems at first. | Ozzie_osman wrote: | > This notion of "stream-table duality" might be the most | misleading, damaging idea floating around in software | engineering today. | | No. The notion of a "stream-table duality" is a powerful | concept, that I've found can change the way any engineer thinks | about how they are storing / retrieving data for the better | (it's an old idea, rooted in Domain Driven Design, but for some | reason a lot of engineers, myself included, still need to | learn, or relearn, it). | | The notion that _relying_ on a stream as the primary data | persistence abstraction or mechanism in a production is the | misleading part, at least for now. I 'd argue Kafka pushes us | in a direction that makes progress along that dimension, and | you can apply it successfully with a lot of effort. But to | match what you can get from a more traditional DBMS? The tech | just isn't there (yet). | manigandham wrote: | That's why ksqlDB exists and handles all that for you, turning | streams into tables that you can query. | georgewfraser wrote: | ksql does not solve any of the hard consistency or contention | problems you will face if you attempt to use Kafka as a | datastore. Consider the simplest possible example: you write | an "update event" to a topic and then read a ksql view of | that topic. The view may or may not yet reflect the update. | This is called read-after-write consistency, and you will | need to create it in your application code. | wenc wrote: | A quibble I have with the term "stream-table duality" is that | it's not true duality. | | You can construct a state (table) from a stream, but you cannot | do the reverse. You cannot deconstruct a table into its | original stream because the WAL / CDC information is lost -- | you can only create a new stream from the table. This means you | lose all ability to retrieve previous states by replaying the | stream. Information is lost. | | Duality in math is an overloaded term but it generally means | you can go in either direction. This is not true here. | epistasis wrote: | The point of that is for people trying to figure out why | streams are a useful abstraction at all what's needed to make | them useful are some sort of aggregation, and of course tabular | state is a common end point. | | The article does not recommend writing this code yourself, it | shows how to aggregate data into usable forms. | | So I think your concerns may be a bit overblown. If you think | that ksqlDB or Kafka Streams, the tools shown in this blog | post, are are at risk for what you warn, this comment would be | a valid criticism. But it's clear that the article isn't | advocating for people to write their own versions of that... | SpicyLemonZest wrote: | Yes, they're definitely at risk. ksqlDB does not appear to | have transactions at all. | strictfp wrote: | I always found it ironic that you get most of this for free if | you design your sql updates and save/query the transaction log | and/or history. A lot of relational dbs have functionality for | that. | | And if you don't want to use that, there's also products for | this specifically, such as event store. | zok3102 wrote: | Reminds me of that time when database vendors were overreaching | to be message queues/brokers - OracleAQ, MSMQ, etc. | math wrote: | At the moment you more or less need to write a DBMS in your app | code, but I don't think that's the end state. I think what | we're seeing the beginnings of something big - it just might | not seem like it yet because it's the v1 / no where near | complete version. I think having all your data in a single | system (Kafka, KsqlDb, ..) that allows you to work with it in | cross paradigm ways will turn out to be very compelling. | eternalban wrote: | > At the moment you more or less need to write a DBMS in your | app code | | Since we're discussing misunderstandings in the community, it | should be pointed out that a _Database Management System_ | (DBMS) is not merely a _database_ , to say nothing of a _data | store_. Oracle, Postgres, et al are genuine "DBMS". ATM you | are very likely putting together a data store in your app | code. | tomnipotent wrote: | You're conflating DBMS with RDBMS. | | I interpreted the OP as DBMS=database, which absolutely | includes application code that stores and retrieves data in | proprietary formats. | | Even a linked list mmap'd to disk can be a database, just | maybe not a very good one. | eternalban wrote: | A RDBMS is simply a _Relational_ Database Management | System. A quick tour of CS history reveals ancient | curiosities such as _Hierarchical_ Database Management | Systems. And there is more: | | https://www.studytonight.com/dbms/database-model.php | tomnipotent wrote: | There are many kinds of databases. Graph, key/value, | relational, hierarchal etc. That doesn't change the fact | that any app that writes code to store and retrieve data | is creating their own database or using someone else's. | eternalban wrote: | There are many kinds of data models. A DBMS is a DBMS, | and typically a specific DBMS supports a specific data | model. A log structured file is at most a data store. | | To be fair this is a somewhat fuzzy categorization (DBMS, | DB, Data Store) and it can cause confusion. An DBMS is a | _system_ , just like it says right on the acronym tin. A | DBMS is a system that provides capabilities such DDL, | DML, auxiliary processes, etc. to _manage_ a database. | There are various data models, e.g. a triple store, for | databases, but the data model X is a orthogonal to XDBMS. | ckdarby wrote: | What every software engineer should know about Kafka, it's dead. | | If you're not already technically chained into it and Confluence | hasn't already upsold your poor organization avoid it. | | If you want the early flexibility and the rapid PoC just look at | AWS Kinesis/Firehouse. | | If you're looking at large scale (+1 gbit ingest, 100k/s, kind of | stuff) then Apache Pulsar is where to go. | gigatexal wrote: | this is so timely, thank you! | opportune wrote: | >With Kafka, such a stream may record the history of your | business for hundreds of years | | Do not do this. Kafka is not a database! Kafka should never be | the source of truth for your business. The source of truth should | be in whatever consumes data from Kafka downstream when messages | are committed as read. Why? Because in your middle layer you can | do all the data normalization, sanity checking, processing, and | interaction with a REAL database system downstream that can give | you things like transactions, ACID, etc. | | Of course Confluent _wants_ you to try to use Kafka as a DB, so | your usage of it is very high and you pay for the top support | package and they have you by the cajones, but that doesn 't mean | you should do that. You will miss out on all the benefits of | using a real database, with what benefit? Having a simple client | API? | sixdimensional wrote: | So, I've been having a back and forth with a colleague on this | and I'm genuinely interested in why you so strongly suggest | this. | | For the record, I have good real world experience with all | kinds of databases (relational, NoSQL, and even legacy | multivalue and hierarchical ones), and I don't see why what you | say has to be "always true". | | One way of looking at Kafka is that is an unbundled transaction | log, nothing more or less, so it could be used to permanently | store and replay transactional activity, if one wishes. Noting | that, even most databases don't store an immutable, permanent | transaction log (as they often grow to be huge and are | truncated every so often, and tables are used as the current | state). | | This article by Confluent seems to cover the topic (yes, | recognizing it is written by the very vendor you suggest is | trying to lock us in): https://www.confluent.io/blog/okay- | store-data-apache-kafka/. | | Ok, so how about the idea of a persistent, immutable, never- | ending transaction log (uhoh, sounds like blockchain now!)? | Setting aside Kafka for now, what do you think about the basic | design pattern? To me it sounds a bit like it could represent a | temporal database in raw transactional log form. Why not? | | EDIT: After rereading your comment I see your main concern is | using Kafka as a database management system (DBMS). I would | agree, that's not what Kafka is for. But, I don't think | Confluent intends that use case, do they? I look at it more as | an unbundled single component that is very useful by itself, | and is part of a more complex data platform/architecture (ex. | Lambda or Kappa architecture). | opportune wrote: | I agree that you can use Kafka as a raw event log - not | necessarily a transaction log unless you are basically | putting records into Kafka that don't need to be transformed, | which you _can_ do but probably don 't want/need to do. There | are situations in which you want to replay raw events, but in | most cases I think you want to replay actual transactions to | your DB, so it makes more sense to log your progress as you | commit using whatever consumes from Kafka rather than in | Kafka itself. | | My main concern is that when all you have is a hammer, or you | already have a hammer lying around, everything looks like a | nail. Kafka can be great as a raw event stream, and yes you | can store raw events forever, but are raw events really the | source of truth for your business? If your workflow is, as I | think is appropriate in almost all cases, Kafka->Consumer | service->DB, why do you want to rely on Kafka when you have a | consumer that can have better logic and custom handling | regarding how you actually interpret your events? Moreover, | why keep the data in Kafka when you can just plug it into a | temporal DB from your service? | unohoo wrote: | Use pulsar - so much better than kafka | lytedev wrote: | Why is that? | wpietri wrote: | If anybody has seen a good detailed comparison, I'd love to | read one. The first dozen hits were pretty weak. | dominotw wrote: | millions of topics, no zookeeper ect. Kafka is addressing | these shortcomings on the roadmap. | oweiler wrote: | For a lot a projects this is hardly a problem. On the other | hand Kafka is more mature and has a huge ecosystem (Kafka | Connect, Kafka Streams, KSQL, ...). | biggestlou wrote: | Pulsar is less mature but does provide functional | equivalents to all of the above. Pulsar IO (Kafka | Connect), Pulsar SQL (KSQL), Pulsar Functions (Kafka | Streams). | math wrote: | Kafka also has less moving parts even today before | zookeeper removal is complete (2 vs pulsar 3). | biggestlou wrote: | But one of those moving parts of Pulsar, BookKeeper, | means that you're no longer storing data on message | brokers. Worth the extra puzzle piece for a lot of use | cases. | nitwit005 wrote: | The Pulsar documentation says it requires Zookeeper: | https://pulsar.apache.org/docs/en/administration-zk-bk/ | dominotw wrote: | oh sorry i meant storing topic info in zookeper that | limits kafka to a certain number of topics. | mrunkel wrote: | I am all for self-reliance, but if you really want to influence | someone else, you might want to include a link to the project, | especially when the only word you share has a much more | prevalent meaning. ___________________________________________________________________ (page generated 2020-05-16 23:00 UTC)