[HN Gopher] What every software engineer should know about Apach...
       ___________________________________________________________________
        
       What every software engineer should know about Apache Kafka
        
       Author : aloknnikhil
       Score  : 127 points
       Date   : 2020-05-16 19:40 UTC (3 hours ago)
        
 (HTM) web link (www.michael-noll.com)
 (TXT) w3m dump (www.michael-noll.com)
        
       | [deleted]
        
       | gnfargbl wrote:
       | Something that I wish I had known about Apache Kafka a year or so
       | ago is that it essentially has no support for long-running tasks,
       | i.e. tasks where longest-possible-worker-execution-time >>
       | longest-tolerable-group-rebalance-time.
       | 
       | After much angst in trying to work around this issue, I finally
       | gave up and switched to Pulsar. Pulsar isn't without it's own
       | issues (mostly around bugs and general maturity) but it handles
       | this particular scenario admirably.
        
         | ketralnis wrote:
         | It's true, message buses and work queues have different
         | characteristics. It sounds like you want a work queue, not a
         | message bus. I have very successful experience with using
         | rabbitmq for work queueing, but as you mention there are others
         | too.
        
           | biggestlou wrote:
           | Pulsar works quite well as a message queue:
           | https://pulsar.apache.org/docs/en/cookbooks-message-queue/
        
           | gnfargbl wrote:
           | You're right. The issue is that in this particular
           | application I need _both_ a work queue and a message bus.
           | I've also successfully used Rabbit as a work queue, but it's
           | not high-throughput enough to meet my messaging needs. Pulsar
           | seems to cope well in both roles.
           | 
           | All that said, if I had my time again I'd probably just use
           | one of the cloud providers' solutions and spend my efforts
           | elsewhere...
        
       | Traster wrote:
       | There seems to be this common problem among relatively new
       | technologies, that they're not _actually_ aware of what the
       | average person knows about them. So let me be the moron in the
       | room. I work at a company that uses Kafka. What I know so far is
       | that Kafka is broken. It seems to me that this article is more
       | about what every software engineer who plans to re-skill as a
       | kafka engineer should know.
        
         | oweiler wrote:
         | In which way is Kafka broken?
        
         | ddevault wrote:
         | I was going to say, what every software engineer should know
         | about Kafka is "don't use it". This is more a list of things
         | people who are already stuck with Kafka should know.
        
       | skyde wrote:
       | What every engineer should know about Kafka is that it should not
       | be used for anything critical like you would use Cassandra or
       | Hbase.
       | 
       | But if you are ok with partitions not being available for many
       | hours or losing all written data because the cluster did not
       | automatically move parution to 3 new replica after 2 of the
       | replica failed ... then it's a good scalable(speed) product.
       | 
       | There is also no serious multi tenant support. So if you need
       | multitenancy you gotta use kubernete and do one cluster per
       | tenant and automate that yourself.
        
       | kevindeasis wrote:
       | anyone wanna share their thoughts about deploying their own
       | messaging system vs using a messaging system provided by their
       | cloud provider?
        
         | antoncohen wrote:
         | If you are on GCP I think the choice is simple, use Cloud
         | Pub/Sub. Extremely simple, extremely reliable, extremely
         | performant, fairly inexpensive, multi-region (global). No
         | maintenance, no scaling, almost no tunables, it just works.
         | 
         | Google provides a Pub/Sub emulator for local development.
         | 
         | I don't really buy the vendor lock-in thing for Pub/Sub-like
         | systems. The Cloud Pub/Sub usage pattern is basically the same
         | as Kafka, you can have a library that abstracts away the
         | differences. There are open source libraries that do that[1].
         | If you ever need to switch cloud providers, or want a messaging
         | system to span cloud providers, you can switch without changing
         | lots of code.
         | 
         | [1] https://github.com/google/go-cloud/tree/master/pubsub
        
           | peterhunt wrote:
           | I don't think it's that simple unless I misunderstand how GCP
           | Pubsub works. I don't think GCP PubSub will give you
           | deterministic delivery order within a partition the way Kafka
           | will: https://cloud.google.com/pubsub/docs/ordering
        
             | batter wrote:
             | We have Kafka and GCP pubsub. Kafka is the way to go for
             | us. In terms of reliability, performance, load, etc.
        
         | oweiler wrote:
         | We use Amazon MSK and are pretty happy with it so far.
        
         | dtech wrote:
         | Distributed messaging is _really_ hard to get right. It 'll
         | seem to work fine right up until you get weird bugs and
         | unreliability during at the worst moments.
         | 
         | I wouldn't recommend relying primarily on something vendor-
         | specific like Amazon SQS, but there are very good out-of-the-
         | box tools like RabbitMQ or Kafka available.
         | 
         | Writing your own messaging system is like writing your own
         | database, it's the wrong choice 99.9% of the time.
        
         | TheBlight wrote:
         | Avoiding vendor lock-in is the only quasi-sane reason I can
         | think of.
        
         | sz4kerto wrote:
         | Testability. Single biggest reason for us not going with the
         | proprietary queue. We can create, reset, throw away queues as
         | we want during testing, thousands times a day. Even on a
         | laptop.
        
           | sa46 wrote:
           | Do you have any advice for setting up a test Kafka
           | environment? I'd love to be able setup a lightweight in-
           | memory Kafka for unit-y tests without going through the whole
           | Docker compose rigamarole.
        
         | realtalk_sp wrote:
         | The GCP Pub/Sub API has largely replicated all the features
         | you'd want out of Kafka (including Consumer Groups). The
         | primary consideration at this point is cost. There's an
         | inflection point in size (at some very large message volume)
         | where it makes sense to start running your own Kafka cluster
         | and hire a dedicated person or two to manage it. Most companies
         | will never get anywhere close.
         | 
         | Any project just starting out should use Pub/Sub. One thing I
         | really like is that GCP provides emulators of Pub/Sub et al for
         | local testing. That used to be a bit of an obstacle not too
         | long ago.
         | 
         | In terms of lock-in, I don't see how that applies to an AMQ.
         | The data moving through it should only be transiently
         | persisted, up to a week or two at most in the usual case.
         | 
         | If you want to avoid cloud lock-in, have DB backups, use
         | Postgres/MySQL/etc, containerize your service(s), replicate
         | data in object storage, etc. Common sense stuff, if that's
         | something that's of concern.
         | 
         | Personally, I've seen "vendor lock-in" weaponized as an excuse
         | for a lot of costly NIH bullshit. It's painful to reflect back
         | on a project that could have involved literally a tenth of the
         | time and pain it ended up taking because of that one choice
         | alone.
        
           | gigatexal wrote:
           | IMO lock-in fears are overblown. Build stuff out quickly and
           | prove your idea and then refactor when you have customers and
           | revenue.
        
             | SpicyLemonZest wrote:
             | Refactoring a growing system while maintaining bug-for-bug
             | compatibility is extraordinarily hard. Most people I know
             | who've gone through such a migration never want to do it
             | again.
        
         | nojito wrote:
         | Much much cheaper.
        
           | cameronbrown wrote:
           | Dev time would far outweigh the cost of a managed equivalent.
        
           | dajohnson89 wrote:
           | does that factor in development costs?
        
       | Rebelgecko wrote:
       | I'm surprised by the amount of criticism in this thread. I've
       | used Kafka in the past and it definitely got the job done (as a
       | message bus, not using stream processing or the other more whiz-
       | bang features). What do people use instead?
        
         | skyde wrote:
         | If you need. Ordered log abstraction (Apache pulsar, Facebook
         | log device ...) if you need a real transactional and highly
         | available message broker any (my/jms) server like Rabbitmq IBM
         | MQ ....
         | 
         | Kafka in my experience is always the worse solution unless you
         | need to aggregate http server log to do offline analytics using
         | something like spark
        
       | sixhobbits wrote:
       | So much criticism here! I've read a lot about Kafka over the last
       | few years and I wish I had read this article earlier -- even
       | basic questions like "Can Kafka store data persistently?" are not
       | adequately answered in many intros to it.
       | 
       | That said, I do find the tutorial flip-flops a bit in target
       | audience. It's mainly "this is what Kafka is", but sometimes has
       | weird asides like "This is how to optimise Kafka" (redundancy,
       | number of partitions, etc) which are pretty distracting from the
       | more fundamental points.
        
       | seemslegit wrote:
       | Hmm, I'm pretty sure that a software engineer developing safety-
       | critical firmware for embedded medical systems does not need to
       | know anything about Apache Kafka. Or a game developer. Or a web
       | frontend developer. Given the title it's surprising how many
       | software engineers can in fact go through life and career without
       | ever knowing anything about Apache Kafka.
        
         | vsareto wrote:
         | By now, "What Every Software Engineer Should Know" headlines
         | aren't intended to be serious.
        
           | seemslegit wrote:
           | I wish, instead they are just not intended to be taken
           | literally.
        
       | cosmotic wrote:
       | Cool click-bait title
        
         | fmjrey wrote:
         | Not necessarily, it's most likely a reference to the popular
         | article from 2013:
         | https://engineering.linkedin.com/distributed-systems/log-wha...
        
           | ken wrote:
           | Or directly from 1991's "What Every Computer Scientist Should
           | Know About Floating-Point Arithmetic".
        
             | pierrec wrote:
             | Where the "every" was actually justified, contrarily to its
             | descendants.
        
       | georgewfraser wrote:
       | This notion of "stream-table duality" might be the most
       | misleading, damaging idea floating around in software engineering
       | today. Yes, you can turn a stream of events into a table of the
       | present state. However, during that process you will eventually
       | confront every single hard problem that relational database
       | management systems have faced for decades. You will more or less
       | have to write a full-fledged DBMS in your application code. And
       | you will probably not do a great job, and will end up with dirty
       | reads, phantoms, and all the other symptoms of a buggy database.
       | 
       | Kafka is a message broker. It's not a database and it's not close
       | to being a database. This idea of stream-table duality is not
       | nearly as profound or important as it seems at first.
        
         | Ozzie_osman wrote:
         | > This notion of "stream-table duality" might be the most
         | misleading, damaging idea floating around in software
         | engineering today.
         | 
         | No. The notion of a "stream-table duality" is a powerful
         | concept, that I've found can change the way any engineer thinks
         | about how they are storing / retrieving data for the better
         | (it's an old idea, rooted in Domain Driven Design, but for some
         | reason a lot of engineers, myself included, still need to
         | learn, or relearn, it).
         | 
         | The notion that _relying_ on a stream as the primary data
         | persistence abstraction or mechanism in a production is the
         | misleading part, at least for now. I 'd argue Kafka pushes us
         | in a direction that makes progress along that dimension, and
         | you can apply it successfully with a lot of effort. But to
         | match what you can get from a more traditional DBMS? The tech
         | just isn't there (yet).
        
         | manigandham wrote:
         | That's why ksqlDB exists and handles all that for you, turning
         | streams into tables that you can query.
        
           | georgewfraser wrote:
           | ksql does not solve any of the hard consistency or contention
           | problems you will face if you attempt to use Kafka as a
           | datastore. Consider the simplest possible example: you write
           | an "update event" to a topic and then read a ksql view of
           | that topic. The view may or may not yet reflect the update.
           | This is called read-after-write consistency, and you will
           | need to create it in your application code.
        
         | wenc wrote:
         | A quibble I have with the term "stream-table duality" is that
         | it's not true duality.
         | 
         | You can construct a state (table) from a stream, but you cannot
         | do the reverse. You cannot deconstruct a table into its
         | original stream because the WAL / CDC information is lost --
         | you can only create a new stream from the table. This means you
         | lose all ability to retrieve previous states by replaying the
         | stream. Information is lost.
         | 
         | Duality in math is an overloaded term but it generally means
         | you can go in either direction. This is not true here.
        
         | epistasis wrote:
         | The point of that is for people trying to figure out why
         | streams are a useful abstraction at all what's needed to make
         | them useful are some sort of aggregation, and of course tabular
         | state is a common end point.
         | 
         | The article does not recommend writing this code yourself, it
         | shows how to aggregate data into usable forms.
         | 
         | So I think your concerns may be a bit overblown. If you think
         | that ksqlDB or Kafka Streams, the tools shown in this blog
         | post, are are at risk for what you warn, this comment would be
         | a valid criticism. But it's clear that the article isn't
         | advocating for people to write their own versions of that...
        
           | SpicyLemonZest wrote:
           | Yes, they're definitely at risk. ksqlDB does not appear to
           | have transactions at all.
        
         | strictfp wrote:
         | I always found it ironic that you get most of this for free if
         | you design your sql updates and save/query the transaction log
         | and/or history. A lot of relational dbs have functionality for
         | that.
         | 
         | And if you don't want to use that, there's also products for
         | this specifically, such as event store.
        
         | zok3102 wrote:
         | Reminds me of that time when database vendors were overreaching
         | to be message queues/brokers - OracleAQ, MSMQ, etc.
        
         | math wrote:
         | At the moment you more or less need to write a DBMS in your app
         | code, but I don't think that's the end state. I think what
         | we're seeing the beginnings of something big - it just might
         | not seem like it yet because it's the v1 / no where near
         | complete version. I think having all your data in a single
         | system (Kafka, KsqlDb, ..) that allows you to work with it in
         | cross paradigm ways will turn out to be very compelling.
        
           | eternalban wrote:
           | > At the moment you more or less need to write a DBMS in your
           | app code
           | 
           | Since we're discussing misunderstandings in the community, it
           | should be pointed out that a _Database Management System_
           | (DBMS) is not merely a _database_ , to say nothing of a _data
           | store_. Oracle, Postgres, et al are genuine  "DBMS". ATM you
           | are very likely putting together a data store in your app
           | code.
        
             | tomnipotent wrote:
             | You're conflating DBMS with RDBMS.
             | 
             | I interpreted the OP as DBMS=database, which absolutely
             | includes application code that stores and retrieves data in
             | proprietary formats.
             | 
             | Even a linked list mmap'd to disk can be a database, just
             | maybe not a very good one.
        
               | eternalban wrote:
               | A RDBMS is simply a _Relational_ Database Management
               | System. A quick tour of CS history reveals ancient
               | curiosities such as _Hierarchical_ Database Management
               | Systems. And there is more:
               | 
               | https://www.studytonight.com/dbms/database-model.php
        
               | tomnipotent wrote:
               | There are many kinds of databases. Graph, key/value,
               | relational, hierarchal etc. That doesn't change the fact
               | that any app that writes code to store and retrieve data
               | is creating their own database or using someone else's.
        
               | eternalban wrote:
               | There are many kinds of data models. A DBMS is a DBMS,
               | and typically a specific DBMS supports a specific data
               | model. A log structured file is at most a data store.
               | 
               | To be fair this is a somewhat fuzzy categorization (DBMS,
               | DB, Data Store) and it can cause confusion. An DBMS is a
               | _system_ , just like it says right on the acronym tin. A
               | DBMS is a system that provides capabilities such DDL,
               | DML, auxiliary processes, etc. to _manage_ a database.
               | There are various data models, e.g. a triple store, for
               | databases, but the data model X is a orthogonal to XDBMS.
        
       | ckdarby wrote:
       | What every software engineer should know about Kafka, it's dead.
       | 
       | If you're not already technically chained into it and Confluence
       | hasn't already upsold your poor organization avoid it.
       | 
       | If you want the early flexibility and the rapid PoC just look at
       | AWS Kinesis/Firehouse.
       | 
       | If you're looking at large scale (+1 gbit ingest, 100k/s, kind of
       | stuff) then Apache Pulsar is where to go.
        
       | gigatexal wrote:
       | this is so timely, thank you!
        
       | opportune wrote:
       | >With Kafka, such a stream may record the history of your
       | business for hundreds of years
       | 
       | Do not do this. Kafka is not a database! Kafka should never be
       | the source of truth for your business. The source of truth should
       | be in whatever consumes data from Kafka downstream when messages
       | are committed as read. Why? Because in your middle layer you can
       | do all the data normalization, sanity checking, processing, and
       | interaction with a REAL database system downstream that can give
       | you things like transactions, ACID, etc.
       | 
       | Of course Confluent _wants_ you to try to use Kafka as a DB, so
       | your usage of it is very high and you pay for the top support
       | package and they have you by the cajones, but that doesn 't mean
       | you should do that. You will miss out on all the benefits of
       | using a real database, with what benefit? Having a simple client
       | API?
        
         | sixdimensional wrote:
         | So, I've been having a back and forth with a colleague on this
         | and I'm genuinely interested in why you so strongly suggest
         | this.
         | 
         | For the record, I have good real world experience with all
         | kinds of databases (relational, NoSQL, and even legacy
         | multivalue and hierarchical ones), and I don't see why what you
         | say has to be "always true".
         | 
         | One way of looking at Kafka is that is an unbundled transaction
         | log, nothing more or less, so it could be used to permanently
         | store and replay transactional activity, if one wishes. Noting
         | that, even most databases don't store an immutable, permanent
         | transaction log (as they often grow to be huge and are
         | truncated every so often, and tables are used as the current
         | state).
         | 
         | This article by Confluent seems to cover the topic (yes,
         | recognizing it is written by the very vendor you suggest is
         | trying to lock us in): https://www.confluent.io/blog/okay-
         | store-data-apache-kafka/.
         | 
         | Ok, so how about the idea of a persistent, immutable, never-
         | ending transaction log (uhoh, sounds like blockchain now!)?
         | Setting aside Kafka for now, what do you think about the basic
         | design pattern? To me it sounds a bit like it could represent a
         | temporal database in raw transactional log form. Why not?
         | 
         | EDIT: After rereading your comment I see your main concern is
         | using Kafka as a database management system (DBMS). I would
         | agree, that's not what Kafka is for. But, I don't think
         | Confluent intends that use case, do they? I look at it more as
         | an unbundled single component that is very useful by itself,
         | and is part of a more complex data platform/architecture (ex.
         | Lambda or Kappa architecture).
        
           | opportune wrote:
           | I agree that you can use Kafka as a raw event log - not
           | necessarily a transaction log unless you are basically
           | putting records into Kafka that don't need to be transformed,
           | which you _can_ do but probably don 't want/need to do. There
           | are situations in which you want to replay raw events, but in
           | most cases I think you want to replay actual transactions to
           | your DB, so it makes more sense to log your progress as you
           | commit using whatever consumes from Kafka rather than in
           | Kafka itself.
           | 
           | My main concern is that when all you have is a hammer, or you
           | already have a hammer lying around, everything looks like a
           | nail. Kafka can be great as a raw event stream, and yes you
           | can store raw events forever, but are raw events really the
           | source of truth for your business? If your workflow is, as I
           | think is appropriate in almost all cases, Kafka->Consumer
           | service->DB, why do you want to rely on Kafka when you have a
           | consumer that can have better logic and custom handling
           | regarding how you actually interpret your events? Moreover,
           | why keep the data in Kafka when you can just plug it into a
           | temporal DB from your service?
        
       | unohoo wrote:
       | Use pulsar - so much better than kafka
        
         | lytedev wrote:
         | Why is that?
        
           | wpietri wrote:
           | If anybody has seen a good detailed comparison, I'd love to
           | read one. The first dozen hits were pretty weak.
        
           | dominotw wrote:
           | millions of topics, no zookeeper ect. Kafka is addressing
           | these shortcomings on the roadmap.
        
             | oweiler wrote:
             | For a lot a projects this is hardly a problem. On the other
             | hand Kafka is more mature and has a huge ecosystem (Kafka
             | Connect, Kafka Streams, KSQL, ...).
        
               | biggestlou wrote:
               | Pulsar is less mature but does provide functional
               | equivalents to all of the above. Pulsar IO (Kafka
               | Connect), Pulsar SQL (KSQL), Pulsar Functions (Kafka
               | Streams).
        
               | math wrote:
               | Kafka also has less moving parts even today before
               | zookeeper removal is complete (2 vs pulsar 3).
        
               | biggestlou wrote:
               | But one of those moving parts of Pulsar, BookKeeper,
               | means that you're no longer storing data on message
               | brokers. Worth the extra puzzle piece for a lot of use
               | cases.
        
             | nitwit005 wrote:
             | The Pulsar documentation says it requires Zookeeper:
             | https://pulsar.apache.org/docs/en/administration-zk-bk/
        
               | dominotw wrote:
               | oh sorry i meant storing topic info in zookeper that
               | limits kafka to a certain number of topics.
        
         | mrunkel wrote:
         | I am all for self-reliance, but if you really want to influence
         | someone else, you might want to include a link to the project,
         | especially when the only word you share has a much more
         | prevalent meaning.
        
       ___________________________________________________________________
       (page generated 2020-05-16 23:00 UTC)