[HN Gopher] SQL Maxis: Why We Ditched RabbitMQ and Replaced It w...
       ___________________________________________________________________
        
       SQL Maxis: Why We Ditched RabbitMQ and Replaced It with a Postgres
       Queue
        
       Author : ctc24
       Score  : 360 points
       Date   : 2023-04-11 15:48 UTC (7 hours ago)
        
 (HTM) web link (www.prequel.co)
 (TXT) w3m dump (www.prequel.co)
        
       | jasonlotito wrote:
       | So, this article contains a serious issue.
       | 
       | What is the prefetch value for RabbitMQ mean? > The value defines
       | the max number of unacknowledged deliveries that are permitted on
       | a channel.
       | 
       | From the Article: > Turns out each RabbitMQ consumer was
       | prefetching the next message (job) when it picked up the current
       | one.
       | 
       | that's a prefetch count of 2.
       | 
       | The first message is unacknowledged, and if you have a prefetch
       | count of 1, you'll only get 1 message because you've set the
       | maximum number of unacknowledged messages to 1.
       | 
       | So, I'm curious what the actual issue is. I'm sure someone
       | checked things, and I'm sure they saw something, but this isn't
       | right.
       | 
       | tl;dr: prefetch count of 1 only gets one message, it doesn't get
       | one message, and then a second.
       | 
       | Note: I didn't test this, so there could be some weird issue, or
       | the documentation is wrong, but I've never seen this as an issue
       | in all the years I've used RabbitMQ.
        
         | stuff4ben wrote:
         | That's my thinking as well. Seems like they're not using the
         | tool correctly and didn't read the documentation. Oh well,
         | let's switch to Postgres because "reasons". And now to get the
         | features of a queuing system, you have to build it yourself.
         | Little bit of Not Invented Here syndrome it sounds like.
        
           | sseagull wrote:
           | I mean, at some point you do have to write code. Either
           | 
           | 1.) You (hopefully) know a bit about how your DB works, what
           | the workload is, what your tasks are. You also (hopefully)
           | know a bit about SQL and Postgres. So you learn a bit more
           | and build upon that knowledge, and implement the queue there
           | (which comes with other benefits).
           | 
           | 2.) You learn about a new package, deal with how to set that
           | up, and how to integrate it with your existing database
           | (including how tasks get translated between the queue and
           | your existing DB). This also increases your maintenance and
           | deployment burden, and now developers need to know not only
           | about your DB, but the queueing package as well.
           | 
           | There are certainly cases where #2 makes sense. But writing
           | off #1 as NIH often leads to horrifically over-engineered
           | software stacks, when 10s/few hundred lines of code will
           | suffice.
        
         | mannyv wrote:
         | The RabbitMQ stuff seems pretty straightforward:
         | 
         | Channel prefetch:
         | 
         | https://www.rabbitmq.com/confirms.html
         | 
         | "Once the number reaches the configured count, RabbitMQ will
         | stop delivering more messages on the channel unless at least
         | one of the outstanding ones is acknowledged"
         | 
         | consumer prefetch:
         | 
         | https://www.rabbitmq.com/consumer-prefetch.html
         | 
         | So a prefetch count of 1 = 1 un-ACKed message -> what they want
        
           | eternalban wrote:
           | The article unintentionally communicated more about the
           | engineering team than the subject.
           | 
           | btw, publisher confirms used in conjunction with prefetch
           | setting can allow for flow control within a very well behaved
           | band.
           | 
           | People run into issues with Rabbit for two reasons. You noted
           | one (they are documentation averse), and number is two is
           | mistaking a message queue for a distributed log. Rabbit does
           | -not- like holding on to messages. Performance will crater if
           | you treat it as a lake. It is a river
        
             | baq wrote:
             | great point and same is true for postgres! where I'm at
             | DBAs rip people's legs off for idle in transaction over a
             | few seconds
        
         | ketchupdebugger wrote:
         | Maybe this is a case of an engineer who just wanted to put
         | 'implementing a queue using postqres' on their resume.
         | 
         | > took half a day to implement + test
         | 
         | so seems like there are maybe 2 or 3 services using rabbitmq
        
         | hangonhn wrote:
         | I'm wondering if they made the mistake of acknowledging the
         | message before the processing was done. From the article it
         | sounds like their jobs take a long time to run so they may have
         | acked the message to stop RabbitMQ from delivering the message
         | to another worker for retry but IIRC there is a setting that
         | allows you to extend the "lease" time on a message before
         | retry.
        
         | binaryBandicoot wrote:
         | Agreed !
         | 
         | If prefetch was the issue; they could have even used AMQP's
         | basic.get -
         | https://www.rabbitmq.com/amqp-0-9-1-quickref.html#basic.get
        
           | jalla wrote:
           | This is most likely correct. They didn't realize that
           | consumers always prefetch, and the minimum is 1. Answered
           | here: https://stackoverflow.com/questions/39699727/what-is-
           | the-dif...
        
       | sevenf0ur wrote:
       | Sounds like a poorly written AMQP client of which there are many.
       | Either you go bare bones and write wrappers to implement basic
       | functionality or find a fully fleshed out opinionated client. If
       | you can get away with using PostgreSQL go for it.
        
         | kitd wrote:
         | As a matter of interest, what are good, reliable, actively
         | developed AMQP clients? Asking because I've had problems with
         | several.
        
           | datadeft wrote:
           | So far I have not had any issues with the Elixir one.
           | 
           | https://github.com/pma/amqp
        
       | jrib wrote:
       | > One of our team members has gotten into the habit of pointing
       | out that "you can do this in Postgres" whenever we do some kind
       | of system design or talk about implementing a new feature. So
       | much so that it's kind of become a meme at the company.
       | 
       | love it
        
         | mordae wrote:
         | Did this recently with a GIS query. But everyone here loves PG
         | already. :-)
        
       | code-e wrote:
       | As the maintainer of a rabbitmq client library (not the golang
       | one mentioned in the article) the bit about dealing with
       | reconnections really range true. Something about the AMQP
       | protocol seems to make library authors just... avoid dealing with
       | it, forcing the work onto users, or wrapper libraries. It's a
       | real frustration across languages, golang, python, JS, etc.
       | Retry/reconnect is built in to HTTP libraries, and database
       | drivers. Why don't more authors consider this a core component of
       | a RabbitMQ client?
        
         | klabb3 wrote:
         | Nats has all of these features built in, and is a small
         | standalone binary with optional persistence. I still don't
         | understand why it's not more popular.
        
         | [deleted]
        
         | skrimp wrote:
         | This also occurs when dealing with channel-level protocol
         | exceptions, so this behavior is doubly important to get right.
         | I think one of the hard parts here is that the client needs to
         | be aware of these events in order to ensure that application
         | level consistency requirements are being kept. The other part
         | is that most of the client libraries I have seen are very
         | imperative. It's much easier to handle retries at the library
         | level when the client has specified what structures need to be
         | declared again during a reconnect/channel-recreation.
        
       | j3th9n wrote:
       | Sooner or later they will have to deal with deadlocks.
        
       | coding123 wrote:
       | We have a mix of agenda jobs and rabbitmq. I know there are more
       | complex use-cases, like fan out. but in reality the rabbit stack
       | keeps disconnecting silently in the stack we're using (js).
       | Someone has to go in and restart pods (k8s).
       | 
       | All the stuff on Agenda works perfectly all the time. (which is
       | basically using mongo's find and update)
        
       | exabrial wrote:
       | We use ActiveMQ (classic) because of the incredible console.
       | Combine that with hawt.io and you get some extra functionality
       | not included in the normal console.
       | 
       | I'm always surprised, even with the older version of ActiveMQ,
       | what kind of throughput you can get, on modest hardware. A 1gb
       | kvm with 1 cpu easily pushes 5000 msgs/second across a couple
       | hundred topics and queues. Quite impressive and more than we need
       | for our use case. ActiveMQ Artemis is supposed to scale even
       | farther out.
        
         | therealdrag0 wrote:
         | Same. Ran a perf test recently. With two 1 core brokers I got
         | 2000 rps with persistence and peaked at 12000 rps with non-
         | persistence.
         | 
         | We've also had similar issues as OP, except fixing it just came
         | down to configuring the Java client to have 0 prefetch so that
         | long jobs don't block other msgs from being processed by other
         | clients. Also using separate queues wide different workloads.
        
       | 0xbadcafebee wrote:
       | It's important not to gloss over what your actual use-case is.
       | Don't just pick some tech because "it seems simpler". Who gives a
       | crap about simplicity if it doesn't meet your needs? List your
       | exact needs and how each solution is going to meet them, and then
       | pick the simplest solution that meets your needs.
       | 
       | If you ever get into a case where "we don't think we're using it
       | right", then you didn't understand it when you implemented it.
       | That is a much bigger problem to understand and prevent in the
       | future than the problem of picking a tool.
        
       | chime wrote:
       | If you don't want to roll your own, look into
       | https://github.com/timgit/pg-boss
        
       | bstempi wrote:
       | I've done something like this and opted to use advisory locks
       | instead of row locks thinking that I'd increase performance by
       | avoiding an actual lock.
       | 
       | I'm curious to hear what the team thinks the pros/cons of a row
       | vs advisory lock are and if there really are any performance
       | implications. I'm also curious what they do with job/task records
       | once they're complete (e.g., do they leave them in that table? Is
       | there some table where they get archived? Do they just get
       | deleted?)
        
         | brasetvik wrote:
         | Advisory locks are purely in-memory locks, while row locks
         | might ultimately hit disk.
         | 
         | The memory space reserved for locks is finite, so if you were
         | to have workers claim too many queue items simultaneously, you
         | might get "out of memory for locks" errors all over the place.
         | 
         | > Both advisory locks and regular locks are stored in a shared
         | memory pool whose size is defined by the configuration
         | variables max_locks_per_transaction and max_connections. Care
         | must be taken not to exhaust this memory or the server will be
         | unable to grant any locks at all. This imposes an upper limit
         | on the number of advisory locks grantable by the server,
         | typically in the tens to hundreds of thousands depending on how
         | the server is configured.
         | 
         | - https://www.postgresql.org/docs/current/explicit-
         | locking.htm...
        
         | wswope wrote:
         | What did you do to avoid implicit locking, and what sort of
         | isolation level were you using?
         | 
         | Without more information about your setup, the advisory locking
         | sounds like dead weight.
        
           | bstempi wrote:
           | > What did you do to avoid implicit locking, and what sort of
           | isolation level were you using?
           | 
           | I avoided implicit locking by manually handling transactions.
           | The query that acquired the lock was a separate transaction
           | from the query that figured out which jobs were eligible.
           | 
           | > Without more information about your setup, the advisory
           | locking sounds like dead weight.
           | 
           | Can you expand on this? Implementation-wise, my understanding
           | is that both solutions require a query to acquire the lock or
           | fast-fail, so the advisory lock acquisition query is almost
           | identical SQL to the row-lock solution. I'm not sure where
           | the dead weight is.
        
             | wswope wrote:
             | That helps, thanks. Totally possible that I'm just missing
             | the nuances of your implementation.
             | 
             | I'm imagining this is the sorta setup we're comparing:
             | 
             | Row Lock - https://pastebin.com/sgm45gF2 Advisory Lock -
             | https://pastebin.com/73bqfBfV
             | 
             | And if that's accurate, I'm failing to see how an advisory
             | lock would leave the table unblocked for any amount of time
             | greater than row-level locks would.
             | 
             | The point of the explicit row-level locking is to allow a
             | worker to query for fresh rows without fetching any records
             | that are already in-progress (i.e. it avoids what would
             | otherwise be a race condition between the procedural SELECT
             | and UPDATE components of parallel workers), so if you've
             | already queried a list of eligible jobs, and then have your
             | workers take advisory locks, what are those locks
             | synchronizing access to?
        
       | nemothekid wrote:
       | > _To make all of this run smoothly, we enqueue and dequeue
       | thousands of jobs every day._
       | 
       | If you your needs aren't that expensive, and you don't anticipate
       | growing a ton, then it's probably a smart technical decision to
       | minimize your operational stack. Assuming 10k/jobs a day, thats
       | roughly 7 jobs per minute. Even the most unoptimized database
       | should be able to handle this.
        
         | hinkley wrote:
         | Years of being bullshitted have taught me to instantly distrust
         | anyone who is telling me about how many things they do per day.
         | Jobs or customers per day is something to tell you banker, or
         | investors. For tech people it's per second, per minute, maybe
         | per hour, or self aggrandizement.
         | 
         | A million requests a day sounds really impressive, but it's
         | 12req/s which is not a lot. I had a project that needed 100
         | req/s ages ago. That was considered a reasonably complex
         | problem but not world class, and only because C10k was an open
         | problem. Now you could do that with a single 8xlarge. You don't
         | even need a cluster.
         | 
         | 10k tasks a day is 7 per minute. You could do that with
         | Jenkins.
        
           | Eumenes wrote:
           | you could use mechanical turk for 10k tasks per day
        
           | tomjakubowski wrote:
           | the other thing is averaging over days says nothing about
           | spikes in the rate - I imagine very few systems see more-or-
           | less constant traffic over the course of an entire day
        
             | jt_b wrote:
             | This. Customers are not organized in neat lines waiting
             | their turns around the clock IRL
        
         | SideburnsOfDoom wrote:
         | > Assuming 10k/jobs a day, that's roughly 7 jobs per minute.
         | 
         | I've seen systems at that scale where that's roughly true. But
         | I've also seen systems where those jobs come in a daily batch,
         | at a point in time of day, and then nothing until the next
         | day's batch.
        
           | spacedcowboy wrote:
           | Yep, even websites can be highly non-periodic. I used to run
           | the web services for the 50 or so magazines that Risk-Waters
           | had at the time, and around lunch time was a massive peak of
           | traffic, easily 100x our slower times.
           | 
           | Postgres could still handle that though, IMHO :)
        
         | bdcravens wrote:
         | For many apps, using a mature library like Sidekiq or Celery
         | (with a stock Redis) within a monolith will take you a very
         | long way.
        
         | aetherson wrote:
         | And, on the other hand, people shouldn't kid themselves about
         | the ability of Postgres to handle millions of jobs per day as a
         | queue.
        
           | shakow wrote:
           | A few millions a days is a few dozens per second; we
           | currently have a service running this order of magnitude of
           | jobs with a SELECT/SKIP LOCKED pattern and no issue at all on
           | a medium AWS box.
        
           | pritambaral wrote:
           | I've used Postgres to handle 60M jobs per second (using FOR
           | UPDATE SKIP LOCKED) in production, for two years, on a single
           | dual core 8GB GCE VM. Postgres goes far.
        
             | qskousen wrote:
             | Either you meant "per day" or you've got a really well
             | tuned database.
        
           | dymk wrote:
           | My hobby project does ~1.5M jobs per day enqueued into
           | Postgres, no sweat. I use
           | https://github.com/bensheldon/good_job which uses PG's
           | LISTEN/NOTIFY to lower worker poll latency.
        
             | avinassh wrote:
             | > which uses PG's LISTEN/NOTIFY to lower worker poll
             | latency
             | 
             | Can you elaborate on how do you do this?
        
               | Mavvie wrote:
               | I can't speak for how they do it, but when your worker
               | polls the table and finds no rows, you will sleep. While
               | sleeping, you can also LISTEN on a channel (and if you
               | get a message, you abort your sleep).
               | 
               | Then, whenever you write a new job to the queue, you also
               | do a NOTIFY on the same channel.
               | 
               | This lets you keep latency low while still polling
               | relatively infrequently.
               | 
               | NOTIFY is actually transactional which makes this
               | approach even better (the LISTENer won't be notified
               | until the NOTIFY transaction commits)
        
               | fweimer wrote:
               | These are somewhat obscure PostgreSQL SQL commands:
               | 
               | https://www.postgresql.org/docs/15/sql-listen.html
               | 
               | https://www.postgresql.org/docs/15/sql-notify.html
               | 
               | I think they have been around for ages, but handling the
               | LISTEN responses may need special client library support.
        
               | sideofbacon wrote:
               | I'm the GoodJob author. Here's the class that is
               | responsible for implementing Postgres's LISTEN/NOTIFY
               | functionality in GoodJob:
               | 
               | https://github.com/bensheldon/good_job/blob/10e9d9b714a66
               | 8dc...
               | 
               | That's heavily inspired by Rail's Action Cable
               | (websockets) Adapter for Postgres, which is a bit simpler
               | and easier to understand:
               | 
               | https://github.com/rails/rails/blob/be287ac0d5000e667510f
               | aba...
               | 
               | Briefly, it spins up a background thread with a dedicated
               | database connection and makes a blocking Postgres LISTEN
               | query until results are returned, and then it forwards
               | the result to other subscribing objects.
        
               | comboy wrote:
               | _excited claps_
        
           | philipbjorge wrote:
           | We were comfortably supporting millions of jobs per day as a
           | Postgres queue (using select for update skip locked
           | semantics) at a previous role.
           | 
           | Scaled much, much further than I would've guessed at the time
           | when I called it a short-term solution :) -- now I have much
           | more confidence in Postgres ;)
        
             | fjni wrote:
             | This! Most haven't tried. It goes incredibly far.
        
               | jbverschoor wrote:
               | Because all popular articles are about multi million tps
               | at bigtech scale, and everybody thinks they're big tech
               | somehow.
        
               | int_19h wrote:
               | That's the original problem, but then there are the
               | secondary effects. Some of the people who made decision
               | on that basis write blog posts about what they did, and
               | then those blog posts end up on StackOverflow etc, and
               | eventually it just becomes "this is how we do it by
               | default" orthodoxy without much conscious reasoning
               | involved - it's just a safe bet to do what works for
               | everybody else even if it's not optimal.
        
             | simplotek wrote:
             | > We were comfortably supporting millions of jobs per day
             | as a Postgres queue (using select for update skip locked
             | semantics) at a previous role.
             | 
             | That's very refreshing to hear. In a previous role I was in
             | a similar situation than yours, but I pushed for RabbitMQ
             | instead of postgres due to scaling concerns, with
             | hypothetical seilings smaller than the ones you faced. My
             | team had to make a call without having hard numbers to
             | support any decision and no time to put together a proof of
             | concept. The design pressures were the simplicity of
             | postgres vs paying for the assurance of getting a working
             | message broker with complexity. In the end I pushed for the
             | most conservative approach and we went with RabbitMQ,
             | because I didn't wanted to be the one having to explain why
             | we had problems getting a RDBMS to act as a message broker
             | when we get a real message broker for free with a docker
             | pull.
             | 
             | I was always left wondering if that was the right call, and
             | apparently it wasn't, because RabbitMQ also put up a fight.
             | 
             | If there were articles out there showcasing case studies of
             | real world applications of implementing message brokers
             | over RDBMS then people like me would have an easier time
             | pushing for saner choices.
        
               | marcosdumay wrote:
               | > showcasing case studies of real world applications of
               | implementing message brokers over RDBMS
               | 
               | You mean "industrial scale RDBMS" that you can license
               | for thousands of dollars? No, you can't really implement
               | message brokers on those.
               | 
               | You will never see those showcase articles. Nobody paying
               | wants them.
        
             | mhink wrote:
             | Just out of curiosity (as someone who hasn't done a lot of
             | this kind of operational stuff) how does this approach to
             | queueing with Postgres degrade as scale increases? Is it
             | just that your job throughput starts to hit a ceiling?
        
               | aetherson wrote:
               | With Postgres, you also need to worry a lot about
               | tombstoning and your ability to keep up with the vacuums
               | necessary to deal with highly mutable data. This can
               | depend a lot on what else is going on with the database
               | and whether you have more than one index on the table.
        
               | nightpool wrote:
               | Throughput is less of an issue then queue size--Postgres
               | can handle a truly incredible amount of throughput as
               | long as the jobs table is small enough that it can safely
               | remain in memory for every operation. We can handle 800k
               | jobs/hr with postgres, but if you have more than 5k or
               | 10k jobs in the table at any given time, you're in
               | dangerous territory. It's a different way of thinking
               | about queue design then some other systems, but it's
               | definitely worth it if you're interested in the benefits
               | Postgres can bring (atomicity, reliability, etc)
        
           | baq wrote:
           | a well-tuned bare metal box in a master-slave config should
           | easily handle (being conservative here) 10k/s... I assume a
           | purpose-built box could handle 100k/s without breaking a
           | sweat
        
           | mjevans wrote:
           | In other SQL databases an 'in memory' table could be a
           | candidate. It looks like Postgres only has session specific
           | temporary tables, but does have an UNLOGGED
           | https://www.postgresql.org/docs/13/sql-createtable.html table
           | type which has desirable properties for temporary data that
           | must be shared.
        
             | cryptonector wrote:
             | PG really needs GLOBAL TEMP tables...
        
               | mjevans wrote:
               | The properties of the UNLOGGED table suggest it fills
               | this niche already.
        
               | eshnil wrote:
               | There's an extension for this:
               | https://github.com/darold/pgtt
        
           | vidarh wrote:
           | Have done millions on nearly a decade old hardware. A million
           | is <12 a second, and that's trivial.
        
           | lelanthran wrote:
           | I'm not kidding myself, postgresql easily handles 10s of
           | thousands of queries _per second_.
           | 
           | No problem with millions of enqueue+dequeue per day.
           | 
           | A table for a queue is also going to be so tiny that
           | postgresql might even outdo my own expectations.
        
         | ed25519FUUU wrote:
         | Honestly this is how projects should start. Simple, easy,
         | maintainable.
        
         | MuffinFlavored wrote:
         | > Even the most unoptimized database should be able to handle
         | this.
         | 
         | Anybody had any success running a queue on top of... sqlite?
         | 
         | With the way the sqlite file locking mechanisms work, are you
         | basically guaranteed really low concurrency? You can have lots
         | of readers but not really a lot of writers, and in order to pop
         | a job off of the queue you need to have a process spinning
         | waiting for work, move its status from "to do" to "in progress"
         | and then "done" or "error", which is sort of "write" heavy?
         | 
         | > An EXCLUSIVE lock is needed in order to write to the database
         | file. Only one EXCLUSIVE lock is allowed on the file and no
         | other locks of any kind are allowed to coexist with an
         | EXCLUSIVE lock. In order to maximize concurrency, SQLite works
         | to minimize the amount of time that EXCLUSIVE locks are held.
         | 
         | > You can avoid locks when reading, if you set database journal
         | mode to Write-Ahead Logging (see:
         | http://www.sqlite.org/wal.html).
        
           | polyrand wrote:
           | SQLite is missing some features like `SELECT FOR UPDATE`, but
           | you can work around some issues with a few extra queries. I
           | wrote litequeue[0] with this specific purpose. I haven't been
           | able to use it a lot, so I don't have real-world numbers of
           | how it scales, but the scaling limits depend on how fast you
           | can insert into the database.
           | 
           | [0]: https://github.com/litements/litequeue
        
             | st3fan wrote:
             | I wrote https://github.com/TinyWebServices/tqs a couple of
             | years ago. It is modelled after SQS and runs in a single
             | threaded Tornado server.
             | 
             | I don't know how many messages per second it does but for a
             | podcast crawling side project I have processed hundreds of
             | millions of messages through this little Python wrapper
             | around SQLite. Zero problems. It just keeps running
             | happily.
        
           | zamalek wrote:
           | The problem with a queue on SQLite is that every successful
           | read implies a write. SQLite is fast enough that it may not
           | matter, though.
        
             | jacobr1 wrote:
             | We used sqlite for a queue at a prior startup - enabling
             | the WAL was critical to maintain disk throughput.
        
             | simplotek wrote:
             | > The problem with a queue on SQLite is that every
             | successful read implies a write.
             | 
             | SQLite also supports in-memory databases with shared cache.
             | 
             | https://www.sqlite.org/inmemorydb.html
             | 
             | If a message queue does not require any form of
             | persistence, writes don't sound like an issue.
        
               | avinassh wrote:
               | But this is purely in-memory with no disk persistence.
               | That's quite risky
        
               | michaelcampbell wrote:
               | Not to be coy, but it only is if it is. For this
               | application I'd agree, but there are plenty of apps that
               | want queues for in-memory use, and if the thing that
               | holds the memory for the queue dies, chances are the
               | whole ecosystem has.
               | 
               | Sometimes this is fine.
        
           | vidarh wrote:
           | I processed ~3m messages a day on Sqlite using a pair of ca
           | 2006 era Xeon on spinning rust (for redundancy; each could
           | _easily_ handle the load by itself). The queue processor was
           | written in Ruby and ran on the (very slow) 1.8.x series (even
           | then it used about 10% of a single core.
           | 
           | On modern hardware you should be able to trivially handle
           | more than that.
        
             | randito wrote:
             | Does this mean you are processing messages on only one
             | machine, since it's Sqlite? Depending on what you are
             | processing, that could take longer than the queue/dequeue.
        
               | vidarh wrote:
               | The queue was managed by a server written in Ruby that
               | spoke Stomp [1], so while for the biggest queue most of
               | the processing did in fact happen on the same machine,
               | that was just because it happened to fit.
               | 
               | [1] https://stomp.github.io/
        
           | [deleted]
        
           | foldr wrote:
           | A simple implementation of a queue in SQL will need to
           | acquire an exclusive lock on the table anyway. Although it's
           | not necessary to use locking at the level of SQL itself:
           | https://news.ycombinator.com/item?id=27482402
        
             | TedDoesntTalk wrote:
             | The article mentions row-level locking, not full table
             | locking. Big difference in write performance.
        
           | 0xbadcafebee wrote:
           | Why use something as complicated as SQLite? You can use a
           | plain old set of directories and files as a queue, with sane,
           | portable, exclusive, atomic locks, on any filesystem, with
           | concurrent readers/writers. That's how we ran mail servers
           | that handled millions (now billions) of messages a day, 20+
           | years ago.
        
             | Sesse__ wrote:
             | But your durability goes down the drain on power loss or
             | similar. Files are _really_ complicated to get durable.
        
             | nomel wrote:
             | Yeah, I'm super confused with this. Getting a few thousand
             | per second seems relatively trivial, on an Arduino. Maybe
             | there's something I'm missing here, or is this the
             | abstractions that software lives at these days?
        
               | TedDoesntTalk wrote:
               | This limits simultaneous writes to the maximum number of
               | open file handles supported by the OS. I don't know what
               | that is, but I don't see how it can compare to a multiple
               | multiplexed TCP/IP sockets.
               | 
               | When you're writing billions of messages per day, I don't
               | see how a file system scales.
        
             | quintes wrote:
             | Yep I've used this approach for file transfers/ messaging
             | between two systems. Primitive but get the lock process
             | unlock and move sequence working and it works for that kind
             | of use case
        
           | matzf wrote:
           | The huey project supports sqlite as one of its backend
           | options
           | (https://huey.readthedocs.io/en/latest/api.html#SqliteHuey).
           | It works, is all I can say.
        
           | alberth wrote:
           | Expensify does.
           | 
           | https://bedrockdb.com/jobs.html
           | 
           | https://blog.expensify.com/2018/01/08/scaling-sqlite-
           | to-4m-q...
        
           | sodapopcan wrote:
           | Oban [0] is a job queuing system in the Elixir world that
           | supports both postgres and sqlite.
           | 
           | [0] https://getoban.pro/
        
         | CodeSgt wrote:
         | That's assuming those 10k jobs are equally distributed, which
         | is almost certainly not the case.
        
           | dymk wrote:
           | Postgres can handle 10k batch inserts in seconds on even
           | commodity hardware. Not done batch, you should still get >100
           | inserts/second with a few indexes thrown in there.
        
       | animex wrote:
       | Interestingly, we've always started with an SQL custom queue and
       | thought one day we'll "upgrade to RabbitMQ".
        
       | northisup wrote:
       | Reticulating Splines?
        
       | endisneigh wrote:
       | Thousands a day? Really? Even if it were hundreds of thousands a
       | day it would make more sense to use a managed Pub Sub service and
       | save yourself the hassle (assuming modest throughput).
        
         | dymk wrote:
         | Yeah, I'd do the opposite of what they ended up doing. Start
         | with Postgres, which will handle their thousands-per-day no
         | sweat. If they scaled to > 100 millions/day, then start
         | investigating a dedicated message bus / queue system if an
         | optimized PG solution starts to hit its limits.
        
           | dpflan wrote:
           | Yeah, it seems like a more natural evolution into
           | specialization and scale rather than a step "backwards" to PG
           | which I suspect will be the subject of a future blogpost
           | about replacing their pq queue with another solution...
        
       | sa46 wrote:
       | Here are a couple of tips if you want to use postgres queues:
       | 
       | - You probably want FOR NO KEY UPDATE instead of FOR UPDATE so
       | you don't block inserts into tables that have a foreign key
       | relationship with the job table. [1]
       | 
       | - If you need to process messages in order, you don't want SKIP
       | LOCKED. Also, make sure you have an ORDER BY clause.
       | 
       | My main use-case for queues is syncing resources in our database
       | to QuickBooks. The overall structure looks like:
       | BEGIN; -- start a transaction              SELECT job.job_id,
       | rm.data         FROM qbo.transmit_job job           JOIN
       | resource_mutation rm USING (tenant_id, resource_mutation_id)
       | WHERE job.state = 'pending'         ORDER BY job.create_time
       | LIMIT 1 FOR NO KEY UPDATE OF job NOWAIT;              -- External
       | API call to QuickBooks.              -- If successsful:
       | UPDATE qbo.transmit_job         SET state = 'transmitted'
       | WHERE job_id = $1;              COMMIT;
       | 
       | This code will serialize access to the transmit_job table. A more
       | clever approach would be to serialize access by tenant_id. I
       | haven't figured out how to do that yet (probably lock on a tenant
       | ID first, then lock on the job ID).
       | 
       | Somewhat annoyingly, Postgres will log an error if another worker
       | holds the row lock (since we're not using SKIP LOCKED). It won't
       | block because of NOWAIT.
       | 
       | CrunchyData also has a good overview of Postgres queues: [2]
       | 
       | [1]: https://www.migops.com/blog/2021/10/05/select-for-update-
       | and...
       | 
       | [2]: https://blog.crunchydata.com/blog/message-queuing-using-
       | nati...
        
       | eckesicle wrote:
       | Postgres is probably the best solution for every type of data
       | store for 95-99% of projects. The operational complexity of
       | maintaining other attached resources far exceed the benefit they
       | realise over just using Postgres.
       | 
       | You don't need a queue, a database, a blob store, and a cache.
       | You just need Postgres for all of these use cases. Once your
       | project scales past what Postgres can handle along one of these
       | dimensions, replace it (but most of the time this will never
       | happen)
       | 
       | It also does wonders for your uptime and SLO.
        
         | alberth wrote:
         | > Postgres is probably the best solution for every type of data
         | store for 95-99% of projects.
         | 
         | I'd say it's more like:
         | 
         | - 95.0%: SQLite
         | 
         | - 4.9%: Postgres
         | 
         | - 0.1%: Other
        
         | adverbly wrote:
         | Careful with using postgres as a blob store. That can go bad
         | fast...
        
           | colonwqbang wrote:
           | Ominous... Care to elaborate?
        
           | no_butterscotch wrote:
           | more deets?
           | 
           | I want to use Postgres for JSON, I know it has specific
           | functionality for that.
           | 
           | But still, what do you mean by that and does it apply to JSON
           | why or why not?
        
           | jakearmitage wrote:
           | Why?
        
         | fullstop wrote:
         | We collect messages from tens of thousands of devices and use
         | RabbitMQ specifically because it is uncoupled from the Postgres
         | databases. If the shit hits the fan and a database needs to be
         | taken offline the messages can pool up in RabbitMQ until we are
         | in a state where things can be processed again.
        
           | hn_throwaway_99 wrote:
           | Still trivial to get that benefit with just a separate
           | postgres instance for your queue, then you have the (very
           | large IMO) benefit of simplifying your overall tech stack and
           | having fewer separate components you have to have knowledge
           | for, keep track of version updates for, etc.
        
           | wvenable wrote:
           | You may well be the 1-5% of projects that need it.
        
             | Spivak wrote:
             | Everyone should use this pattern unless there's a good
             | reason not too though. Turning what would otherwise be
             | outages into queue backups is a godsend.
             | 
             | It becomes impossible to ever lose in-flight data. The
             | moment you persist to your queue you can ack back to the
             | client.
        
               | colonwqbang wrote:
               | > The moment you persist to your queue you can ack back
               | to the client.
               | 
               | Relational databases also have this feature.
        
               | Spivak wrote:
               | And if you store your work inbox in a relational db then
               | you invented a queueing system. The point is that queues
               | can ingest messages much much faster and cheaper than a
               | db, route messages based on priority and globally tune
               | the speed of workers to keep your db from getting
               | overwhelmed or use idle time.
        
               | jakearmitage wrote:
               | > The moment you persist to your queue you can ack back
               | to the client
               | 
               | You mean like... ACID transactions?
               | 
               | https://www.postgresql.org/docs/current/tutorial-
               | transaction...
        
               | Spivak wrote:
               | You act like the "I can persist data" is the part that
               | matters. It's the fact that I can from my pool of app
               | servers post a unit of work to be done and be sure it
               | will happen even if the app server gets struck down. It's
               | the architecture of offloading work from your frontend
               | whenever possible to work that can be done at your
               | leisure.
               | 
               | Use whatever you like to actually implement the queue,
               | Postgres wouldn't be my first or second choice but it's
               | fine, I've used it for small one-off projects.
        
               | mordae wrote:
               | In my experience, persistent message queue is just a poor
               | secondary database.
               | 
               | If anything, I prefer to use ZeroMQ and make sure
               | everything can recover from an outage and settle
               | eventually.
               | 
               | To ingest large inputs, I would just use short append
               | only files and maybe send them over to the other node
               | over ZeroMQ to get a little bit more reliability, but
               | rarely are such high volume data that critical.
               | 
               | There is nothing like free lunch when talking distributed
               | fault tolerant systems and simplicity usually fares
               | rather well.
        
               | 0cf8612b2e1e wrote:
               | What if the queue goes down? Without insane engineering,
               | there is always going to be a single point of failure
               | somewhere.
        
               | Spivak wrote:
               | It doesn't, it's two clusters in replication. It's like
               | the least insane engineering?
               | 
               | Are y'all not running your stuff highly available?
        
       | CobaltHorizon wrote:
       | This is interesting because I've seen a queue that was
       | implemented in Postgres that had performance problems before: the
       | job which wrote new work to the queue table would have DB
       | contention with the queue marking the rows as processed. I wonder
       | if they have the same problem but the scale is such that it
       | doesn't matter or if they're marking the rows as processed in a
       | way that doesn't interfere with rows being added.
        
         | throwaway5959 wrote:
         | > To make all of this run smoothly, we enqueue and dequeue
         | thousands of jobs every day.
         | 
         | The scale isn't large enough for this to at all be a worry. The
         | biggest worry here I imagine is ensuring that a job isn't
         | processed by multiple workers, which they solve with features
         | built into Postgres.
         | 
         | Usually I caution against using a database as a queue, but in
         | this case it removes a piece of the architecture that they have
         | to manage and they're clearly more comfortable with SQL than
         | RabbitMQ so it sounds like a good call.
        
           | KrugerDunnings wrote:
           | It is easy to avoid multiple workers processing the same
           | task: `delete from task where id = (select id from task for
           | update skip locked limit 1) returning *;`
        
             | throwaway5959 wrote:
             | I didn't say it was difficult, I just said it was the
             | biggest concern. That looks correct.
        
         | zem wrote:
         | yep, i had precisely this issue in a previous job, where i
         | tried to build a hacky distributed queue on top of postgres.
         | almost certainly my inexperience with databases rather than the
         | volume of jobs, but i figured i was being shortsighted trying
         | to roll my own and replaced it with rabbitmq (which we had a
         | hell of a time administering, but that's a different story)
        
         | zrail wrote:
         | (not sure why this comment was dead, I vouched for it)
         | 
         | There are a lot of ways to implement a queue in an RDBMS and a
         | lot of those ways are naive to locking behavior. That said,
         | with PostgreSQL specifically, there are some techniques that
         | result in an efficient queue without locking problems. The
         | article doesn't really talk about their implementation so we
         | can't know what they did, but one open source example is
         | Que[1]. Que uses a combination of advisory locking rather than
         | row-level locks and notification channels to great effect, as
         | you can read in the README.
         | 
         | [1]: https://github.com/que-rb/que
        
         | cldellow wrote:
         | They claim "thousands of jobs every day", so the volume sounds
         | very manageable. In a past job, I used postgres to handle
         | millions of jobs/day without too much hassle.
         | 
         | They also say that some jobs take hours and that they use
         | 'SELECT ... FOR UPDATE' row locks for the duration of the job
         | being processed. That strongly implies a small volume to me, as
         | you'd otherwise need many active connections (which are
         | expensive in Postgres!) or some co-ordinator process that
         | handles the locking for multiple rows using a single connection
         | (but it sounds like their workers have direct connections).
        
           | riogordo2go wrote:
           | I think the 'select for update' query is used by a worker to
           | fetch jobs ready for pickup, then update the status to
           | something like 'processing' and the lock is released. The
           | article doesn't mention holding the lock for the entire
           | duration of the job.
        
             | ketchupdebugger wrote:
             | what happens if the task cannot be completed? or a worker
             | goes down? Is there a retry mechanism? maintaining a retry
             | mechanism might be a huge hassle.
        
             | sgarman wrote:
             | I wish they actually wrote about their exact
             | implementation. Article is kinda light on any content
             | without that. I suspect you are right, I have implemented
             | this kinda thing in a similar way.
        
           | bpodgursky wrote:
           | Sharding among workers by ID isn't hard.
        
       | FooBarWidget wrote:
       | I've also used PostgreSQL as a queue but I worry about
       | operational implications. Ideally you want clients to dequeue an
       | item, but put it back in the queue (rollback transaction) if they
       | crash while prpcessing the item. But processing is a long-running
       | task, which means that you need to keep the database connection
       | open while processing. Which means that your number of database
       | connections must scale along with the number of queue workers.
       | And I've understood that scaling database connections can be
       | problematic.
       | 
       | Another problem is that INSERT followed by SELECT FOR UPDATE
       | followed by UPDATE and DELETE results in a lot of garbage pages
       | that need to be vacuumed. And managing vacuuming well is _also_
       | an annoying issue...
        
         | benlivengood wrote:
         | I've generally seen short(ish) leases as the solution to this
         | problem. The queue has an owner and expiration column and
         | workers update the lease and NOW()+N when getting work, and
         | when selecting for work get anything that has expired or has no
         | lease.
         | 
         | This assumes the processing is idempotent in the rest of the
         | system and is only committed transactionally when it's done.
         | Some workers might do wasted work, but you can tune the
         | expiration future time for throughput or latency.
        
       | u89012 wrote:
       | Would be nice if a little more detail were added in order to give
       | anyone looking to do the same more heads-up to watch out for
       | potential trouble spots. I take it the workers are polling to
       | fetch the next job which requires a row lock which in turn
       | requires a transaction yeah? How tight is this loop? What's the
       | sleep time per thread/goroutine? At what point does Postgres go,
       | sorry not doing that? Or is there an alternative to polling and
       | if so, what? :)
        
         | adamckay wrote:
         | > Or is there an alternative to polling and if so, what? :)
         | 
         | LISTEN and NOTIFY are the common approaches to avoiding
         | polling, but I've not used them myself (yet).
         | 
         | https://www.postgresql.org/docs/current/sql-listen.html
         | 
         | https://www.postgresql.org/docs/current/sql-notify.html
        
       | avinassh wrote:
       | > And we guarantee that jobs won't be picked up by more than one
       | worker through simple read/write row-level locks. The new system
       | is actually kind of absurdly simple when you look at it. And
       | that's a good thing. It's also behaved flawlessly so far.
       | 
       | Wouldn't this lead to contention issue when a lot of multiple
       | workers are involved?
        
         | fabian2k wrote:
         | My understanding is that a naive implementation essentially
         | serializes access to the queue table. So it works, but no
         | matter how many requests you make in parallel, only one will be
         | served at a time (unless you have a random component in the
         | query).
         | 
         | With SKIP LOCKED you can resolve this easily, as long as you
         | know about the feature. But almost every tutorial and
         | description of implementing a job queue in Postgres mentions
         | this now.
        
         | prpl wrote:
         | You can do an a select statement skipping locked rows in
         | postgres.
        
           | zinclozenge wrote:
           | Specifically you also want SELECT ... FOR UPDATE SKIP LOCKED;
        
         | CBLT wrote:
         | I believe that problem is avoided by using SKIP LOCKED[0].
         | 
         | [0] https://www.2ndquadrant.com/en/blog/what-is-select-skip-
         | lock...
        
       | jabl wrote:
       | Seems like a slamdunk example of choosing boring technology.
       | https://boringtechnology.club/
        
       | gamedna wrote:
       | Side note, the amount of times that this article redundantly
       | mentioned "Ditched RabbitMQ And Replaced It With A Postgres
       | Queue" made me kinda sick.
        
       | tonymet wrote:
       | whenever i see RDBMS queues i think : why would you implement a
       | queue or stack in a b-tree ?
       | 
       | always go back to fundamentals. the rdbms is giving you
       | replication , queries , locking but at what cost ?
        
       | autospeaker22 wrote:
       | We do just about everything with one or more Postgres databases.
       | We have workers that query the db for tasks, do the work, and
       | update the db. Portals that are the read-only view of the work
       | being performed, and it's pretty amazing how far we've gotten
       | with just Postgres and no real tuning on our end. There's been a
       | couple scenarios where query time was excessive and we solved by
       | learning a bit more about how Postgres worked and how to redefine
       | our data model. It seems to be the swiss army knife that allows
       | you to excel at most general cases, and if you need to do
       | something very specific, well at that point you probably need a
       | different type of database.
        
       | borplk wrote:
       | In many scenarios a DB/SQL-backed queue is far superior to the
       | fancy queue solutions such as RabbitMQ because it gives you
       | instantaneous granular control over 'your queue' (since it is the
       | result set of your query to reserve the next job).
       | 
       | Historically people like to point out the common locking issues
       | etc... with SQL but in modern datbases you have a good number of
       | tools to deal with that ("select for update nowait").
       | 
       | If you think about it a queue is just a performance optimisation
       | (it helps you get the 'next' item in a cheap way, that's it).
       | 
       | So you can get away with "just a db" for a long time and just
       | query the DB to get the next job (with some 'reservations' to
       | avoid duplicate processing).
       | 
       | At some point you may overload the DB if you have too many
       | workers asking the DB for the next job. At that point you can add
       | a queue to relieve that pressure.
       | 
       | This way you can keep a super dynamic process by periodically
       | selecting 'next 50 things to do' and injecting those job IDs in
       | the queue.
       | 
       | This gives you the best of both worlds because you can maintain
       | granular control of the process by not having large queues (you
       | drip feed from DB to queue in small batches) and the DB is not
       | overly burdened.
        
         | dangets wrote:
         | +1 to this. I'm just as wary to recommend using a DB as a queue
         | as the next person, but it is a very common pattern at my
         | company.
         | 
         | DB queues allow easy prioritization, blocking, canceling, and
         | other dynamic queued job controls that basic FIFO queues do
         | not. These are all things that add contention to the queue
         | operations. Keep your queues as dumb as you possibly can and go
         | with FIFO if you can get away with it, but DB queues aren't the
         | worst design choice.
        
       | rorymalcolm wrote:
       | Were Prequel using RaabitMQ to stay cloud platform agnostic when
       | spinning up new environments? Always wondered how companies that
       | offer managed services on the customers cloud like this manage
       | infrastructure in this regard. Do you maintain an environment on
       | each cloud platform with a relatively standard configuration, or
       | do you have a central cluster hosted in one cloud provider which
       | the other deployments phone home to?
        
       | twawaaay wrote:
       | As much as I detest MongoDB immaturity in many respects, I found
       | a lot of features that are actually making life easier when you
       | design pretty large scale applications (mine was typically doing
       | 2GB/s of data out of the database, I like to think it is pretty
       | large).
       | 
       | One feature I like is change event stream which you can subscribe
       | to. It is pretty fast and reliable and for good reason -- the
       | same mechanism is used to replicate MongoDB nodes.
       | 
       | I found you can use it as a handy notification / queueing
       | mechanism (more like Kafka topics than RabbitMQ). I would not
       | recommend it as any kind of interface between components but
       | within an application, for its internal workings, I think it is
       | pretty viable option.
        
         | aPoCoMiLogin wrote:
         | i've used that as a cache flush mechanism when some cached
         | records were updated/deleted, the simplicity was the key.
        
         | Joel_Mckay wrote:
         | Funny enough, we designed one subsystem to use RabbitMQ to
         | enforce linear committed records into mongodb to avoid indices.
         | I.e. the routes in rabbitMQ would ensure a GUID tagged record
         | was spatially localized with other user data on the same host
         | (the inter-host shovel traffic is minimized).
         | 
         | Depends on the use-case, but the original article smells like
         | FUD. This is because the connection C so lib allows you to
         | select how the envelopes are bound/ack'ed on the queue/dead-
         | letter-route in the AMQP client-consumer (you don't usually
         | camp on the connection). Also, the expected runtime constraint
         | should always be included when designing a job-queue regardless
         | of the underlying method (again, expiry default routing is
         | built into rabbitMQ)...
         | 
         | Cheers =)
        
         | necovek wrote:
         | That's quite interesting: I wonder if someone has done
         | something similar with Postgres WAL log streaming?
        
           | twic wrote:
           | If i understand correctly, then yes, they have. PostgreSQL
           | supports "logical decoding", which is where you parse the WAL
           | into logical events:
           | 
           | https://www.postgresql.org/docs/15/logicaldecoding.html
           | 
           | Quite a number of bits of software use that to do things with
           | these events. For example:
           | 
           | https://debezium.io/documentation/reference/stable/connector.
           | ..
        
           | twawaaay wrote:
           | MongoDB's change stream is accidentally very simple to use.
           | You just call the database and get continuous stream of
           | documents that you are interested in from the database. If
           | you need to restart, you can restart processing from the
           | chosen point. It is not a global WAL or anything like that,
           | it is just a stream of documents with some metadata.
        
             | spmurrayzzz wrote:
             | > If you need to restart, you can restart processing from
             | the chosen point
             | 
             | One caveat to this is that you can only start from wherever
             | the beginning of your oplog window is. So for large
             | deployments and/or situations where your oplog ondisk size
             | simply isn't tuned properly, you're SOL unless you build a
             | separate mechanism for catching up.
        
               | twawaaay wrote:
               | Which is fine, queueing systems can't store infinity of
               | messages either. In the end messages are stored somewhere
               | so there is always some limit.
        
               | spmurrayzzz wrote:
               | Yep, absolutely. But the side effect I am referring to
               | (and probably wasn't clear enough about) is that the
               | oplog is globally shared across the replica set. So even
               | if your queue collection tops out at like 10k documents
               | max, if you have another collection in the same
               | deployment thats getting 10mm docs/min, your queue window
               | is also gonna be artificially limited.
               | 
               | Putting the queue in its own deployment is a good
               | insulation against this (assuming you don't need to use
               | aggregate() with the queue across collections obviously).
        
               | twawaaay wrote:
               | I do agree, but listen... this is supposed to be _handy_
               | solution. You know, my app already uses MongoDB, why do I
               | need another component if I can run my notifications with
               | a collection?
               | 
               | Also, I am firm believer that you should not put actual
               | data through notifications. Notifications are meant to
               | wake other systems up, not carry gigabytes of data. You
               | can pack your data into another storage and notify "Hey,
               | here is data of 10k new clients that needs to be
               | processed. Cheers!"
               | 
               | The message is meant to ensure correct processing flow
               | (message has been received, processed, if it fails
               | somebody else will process it, etc.), but it does not
               | have to carry all the data.
               | 
               | I have fixed at least one platform that "reached limits
               | of Kafka" (their words not mine) and "was looking for
               | expert help" to manage the problem.
               | 
               | My solution? I got the component that publishes upload
               | the data to compressed JSON to S3 and post the
               | notification with some metadata and link to the JSON. And
               | the client to parse the JSON. Bam, suddenly everything
               | works fine, no bottlenecks anymore. For the cost of maybe
               | three pages of code.
               | 
               | There is few situation where you absolutely need to track
               | so many individual objects that you have to start caring
               | if they make hard drives large enough. And I managed some
               | pretty large systems.
        
               | spmurrayzzz wrote:
               | > I do agree, but listen... this is supposed to be handy
               | solution. You know, my app already uses MongoDB, why do I
               | need another component if I can run my notifications with
               | a collection?
               | 
               | We're in agreement, I think we may be talking past each
               | other. I use mongo for the exact use case you're
               | describing (messages as signals, not payloads of data).
               | 
               | I'm just sharing a footgun for others that may be reading
               | that bit me fairly recently in a 13TB replica set dealing
               | with 40mm docs/min ingress.
               | 
               | (Its a high resolution RF telemetry service, but the
               | queue mechanism is only a minor portion of it which never
               | gets larger than maybe 50-100 MB. Its oplog window got
               | starved because of the unrelated ingress.)
        
       | stereosteve wrote:
       | Another good library for this is Graphile Worker.
       | 
       | Uses both listen notify and advisory locks so it is using all the
       | right features. And you can enqueue a job from sql and plpgsql
       | triggers. Nice!
       | 
       | Worker is in Node js.
       | 
       | https://github.com/graphile/worker
        
       | mark242 wrote:
       | In summary -- their RabbitMQ consumer library and config is
       | broken in that their consumers are fetching additional messages
       | when they shouldn't. I've never seen this in years of dealing
       | with RabbitMQ. This caused a cascading failure in that consumers
       | were unable to grab messages, rightfully, when only one of the
       | messages was manually ack'ed. Fixing this one fetch issue with
       | their consumer would have fixed the entire problem. Switching to
       | pg probably caused them to rewrite their message fetching code,
       | which probably fixed the underlying issue.
       | 
       | It ultimately doesn't matter because of the low volume they're
       | dealing with, but gang, "just slap a queue on it" gets you the
       | same results as "just slap a cache on it" if you don't understand
       | the tool you're working with. If they knew that some jobs would
       | take hours and some jobs would take seconds, why would you not
       | immediately spin up four queues. Two for the short jobs (one
       | acting as a DLQ), and two for the long jobs (again, one acting as
       | a DLQ). Your DLQ queues have a low TTL, and on expiration those
       | messages get placed back onto the tail of the original queues.
       | Any failure by your consumer, and that message gets dropped onto
       | the DLQ and your overall throughput is determined by the number *
       | velocity of your consumers, and not on your queue architecture.
       | 
       | This pg queue will last a very long time for them. Great! They're
       | willing to give up the easy fanout architecture for simplicity,
       | which again at their volume, sure, that's a valid trade. At
       | higher volumes, they should go back to the drawing board.
        
         | ftkftk wrote:
         | My thoughts exactly half way through the article.
        
         | aidos wrote:
         | It may be a misconfiguration but I'm fairly sure you couldn't
         | change this behaviour in the past. Each worker would take a job
         | in advance and you could not prevent it (I might be
         | misremembering but I think I checked the source at the time).
         | 
         | In my experience, RabbitMQ isn't a good fit for long running
         | tasks. This was 10 years ago. But honestly, if you have a short
         | number of long running tasks, Postgres is probably a better
         | fit. You get transactional control and you remove a load of
         | complexity from the system.
        
           | skrtskrt wrote:
           | > RabbitMQ isn't a good fit for long running tasks
           | 
           | yeah I've seen 3 different workplaces run into this exact
           | issue, usually when they started off with a default Django-
           | Celery-Redis approach
           | 
           |  _all_ of those cases were actually easily fixed with
           | Postgres SELECT FOR UPDATE as a job queue
        
             | phamilton wrote:
             | I'll add another to the anecdata. We saw this issue with
             | RabbitMQ. We replaced it with SQS at the time but we're
             | currently rebuilding it all on SELECT FOR UPDATE.
             | 
             | Our problem was that when a consumer hung on a poison
             | pilled message, the prefetched messages would not be
             | released. We fixed the hanging, but hit a similar issue,
             | and then we fixed that, etc.
             | 
             | We moved to SQS for other reasons (the primary being that
             | we sometimes saturated a single erlang process per rabbit
             | queue), but moving to the SQS visibility timeout model has
             | in general been easier to reason about and has been a
             | better operations experience.
             | 
             | However, we've found that all the jobs are in postgres
             | anyway, and being able to index into our job queue and
             | remove jobs is really useful. We started storing job
             | metadata (including "don't process this job") in postgres
             | and checking it at the start of all our queue workers and
             | we've decided that our lives would be simpler if it was all
             | in postgres.
             | 
             | It's still an experiment on our part, but we've seen a lot
             | of strong stories around it and think it's worth trying
             | out.
        
           | 2muchcoffeeman wrote:
           | The linked docs in the article also suffer from the "wall of
           | text" style that almost all software docs suffer from. If
           | that's what they had to go by to work out this behaviour, I
           | wouldn't be surprised they missed it. Most of the time when I
           | look at docs, none of them are laid out to build
           | understanding. It's just a description of the system.
        
           | whakim wrote:
           | I don't think this behavior has changed significantly. The
           | key issue is that they seem to have correctly identified that
           | they wanted to prefetch a single task, but they didn't
           | recognize that this setting is the count of _un-ACK 'ed_
           | tasks. If you ACK upon receipt (as most consumers do by
           | default), then you're really prefetching _two_ tasks - one
           | that 's being processed, and one that's waiting to be
           | processed. If you ACK late, you get the behavior that TFA
           | seems to want. I've seen this misconfiguration a number of
           | times.
        
           | aftbit wrote:
           | What exactly do you mean by "take a job in advance"? I have
           | certainly set the prefetch limit to 1 on my queues, which I
           | believe prevents them from taking a job while they are
           | running one. One of our production applications runs 4+ hour
           | jobs in a RabbitMQ queue without blocking and has done so for
           | 5+ years.
           | 
           | You do want to make sure to set the VM high water mark below
           | 50% of RAM as the GC phase can double the used memory. If
           | high water mark is set too high, the box will thrash swap
           | really badly during GC and hang entirely.
        
         | seunosewa wrote:
         | Not exactly. For performance (I guess) each worker fetches an
         | extra job while it's working on the current job. If the current
         | job happens to be very long, then the extra job it fetched will
         | be stuck waiting for a long time.
         | 
         | Your multiple queue solution might work but it is most
         | efficient to have just one queue with a pool of workers where
         | where each worker doesn't pop a job unless it's ready to
         | process it immediately. In my experience, this is the optimal
         | solution.
        
           | quintes wrote:
           | I used rabbit many many years ago but agree, scale consumers
           | and only pop when ready to actually process
        
           | mannyv wrote:
           | It's actually a misconfiguration (see the comment below with
           | the documentation).
        
         | tracker1 wrote:
         | Yeah, my first thought was curiosity about their volume needs.
         | DB based queues are fine if you don't need more than a few
         | messages a second of transport. For that matter, I've found
         | Azure's Storage Queues probably the simplest and most reliable
         | easy button for queues that don't need a lot of complexity.
         | Once you need more than that, it gets... complicated.
         | 
         | Also, sharing queues for multiple types of jobs just feels like
         | frustration waiting to happen.
        
           | TexanFeller wrote:
           | > DB based queues are fine if you don't need more than a few
           | messages a second of transport
           | 
           | I'd estimate more like dozens to hundreds per second should
           | be pretty doable, depending on payload, even on a small DB.
           | More if you can logically partition the events. Have
           | implemented such a queue and haven't gotten close to
           | bottlenecking on it.
        
             | tracker1 wrote:
             | Had meant few hundred... :-)
        
         | echelon wrote:
         | I'm at a point where I built a low volume queue in MySQL and
         | need to rip it out and replace it with something that does 100+
         | QPS, exactly once dispatch / single worker processing, job
         | priority level, job topics, sampling from the queue without
         | dequeuing, and limited on failure retry.
         | 
         | I can probably bolt some of these properties onto a queue that
         | doesn't support all the features I need.
        
           | qaq wrote:
           | batch requests instead of ripping anything out. 100 qps
           | sounds really low btw
        
           | ikiris wrote:
           | If your MySQL queue can't handle 100qps you've done something
           | really wrong.
        
         | ryanjshaw wrote:
         | > I've never seen this in years of dealing with RabbitMQ.
         | 
         | Did you do long running jobs like they did? It's a stereotype,
         | but I don't think they used the technology correctly here --
         | you're not supposed to hold onto messages for _hours_ before
         | acknowledging. They should have used RabbitMQ just to kick off
         | the job, immediately ACKing the request, and job tracking
         | /completion handled inside... a database.
        
           | carrja99 wrote:
           | Bingo. This is exactly a setup I have seen before as well.
           | Message recieved and acked, process kicked off, job state
           | tracked in redis or similar.
        
           | hospadar wrote:
           | at which point, if you've got to use a DB to track status,
           | really why bother with the queuing system?
        
             | TedDoesntTalk wrote:
             | When you're dealing with billions of messages, i think
             | queuing systems may be tuned more for it?
             | 
             | I'd like to hear why people chose Kafka over some RDBMS
             | tables.
        
             | giovannibonetti wrote:
             | It's all a matter of how much throughput you need. A
             | queuing system can handle, in the same hardware, orders of
             | magnitude more than a traditional SQL database that writes
             | rows to disk in a page-oriented fashion.
             | 
             | If your load is, say, a few hundred writes/second, stick
             | with the database only, and it will be much simpler.
        
               | arcticfox wrote:
               | how does that help if you still have to have a DB
               | tracking status? you still need the same order-of-
               | magnitude of DB throughput
        
           | xahrepap wrote:
           | I've used RabbitMq to do long running jobs. Jobs that take
           | hours and hours to complete. Occasionally even 1-2 days.
           | 
           | It did take some configuring to get it working. Between
           | acking appropriately and the prefetch (qos perhaps? Can't
           | remember, don't have it in front of me). We were able to make
           | it work. It was pretty straightforward it never even crossed
           | my mind that this isn't a correct use case for RMQ.
           | 
           | (Used the Java client.)
        
           | mark242 wrote:
           | The short answer is "yes" but the questions that you should
           | be asking are: A) How long am I willing to block the queue
           | for additional consumers, B) How committed am I to getting
           | close to exactly-once processing, and C) how tolerant of
           | consumer failure should I be? Depending on the answer to
           | those three questions is what drives your queue architecture.
           | Note that this has nothing to do with time spent processing
           | messages or "long running jobs".
           | 
           | Assume that your producers will be able to spike and generate
           | messages faster than your consumers can process them. This is
           | normal! This is why you have a queue in the first place! If
           | your jobs take 5 seconds or 5 hours, your strategy is
           | influenced by the answers to those three questions. For
           | example -- if you're willing to drop a message if a consumer
           | gets power-cycled, then yeah, you'd immediately ack the
           | request and put it back onto a dead letter queue if your
           | consumer runs into an exception. Alternatively, if you're
           | unwilling to block and you want to be very tolerant of
           | consumer failure, you'd fan out your queues and have your
           | consumers checking multiple queues in parallel. Etc etc etc,
           | you get the drift.
           | 
           | Keep in mind also that this isn't specific to RabbitMQ! You'd
           | want to answer the same questions if you were using SQS, or
           | if you were using Kafka, or if you were using 0mq, or if you
           | were using Redis queues, or if you were using pg queues.
        
             | zibarn wrote:
             | Except sqs has a limit for unacked messages of 12 hours
        
         | dpflan wrote:
         | Your last comment is the key, they had an issue and not the
         | scale so a simpler approach works, but then I imagine that this
         | company, which is a new company and growing, will have a future
         | blogpost about switching from pg queue to something that fits
         | their scale...
        
           | SahAssar wrote:
           | So they are picking the right tool (and a tool that they
           | know) for their problem.
        
             | hospadar wrote:
             | Also when they have two many jobs for their one table -
             | partition the table by customer, when that's still somehow
             | too big - shard the table across a couple DB instances.
             | Toss in some beefy machines that can keep the tables in
             | memory and I suspect you'd have a LOOOONG way to go before
             | you ever really needed to get off of postgres.
             | 
             | In my experience, the benefits of a SQL table for a problem
             | like this are real big - easier to see what's in the queue,
             | manipulate the queue, resolve head-of-queue blocking
             | problems, etc.
        
             | dpflan wrote:
             | There is another variable of experience with the technology
             | which seems to be high for Postgres, low for RabbitMQ...
        
       | simonw wrote:
       | The best thing about using PostgreSQL for a queue is that you can
       | benefit from transactions: only queue a job if the related data
       | is 100% guaranteed to have been written to the database, in such
       | a way that it's not possible for the queue entry not to be
       | written.
       | 
       | Brandur wrote a great piece about a related pattern here:
       | https://brandur.org/job-drain
       | 
       | He recommends using a transactional "staging" queue in your
       | database which is then written out to your actual queue by a
       | separate process.
        
         | orthoxerox wrote:
         | I also used a similar pattern in reverse to process incoming
         | messages that had to be partially processed in-order: one
         | simple worker that dumped the queue to a Postgres table at full
         | speed and N workers that used a slightly more complicated
         | SELECT ... FOR UPDATE SKIP LOCKED query than usual to pick up
         | the next message that could be safely processed. Again, using a
         | single database made transactions very simple: only the
         | message-dumping worker had to worry about processing messages
         | exactly once (which it did via being conservative with acking
         | and utilizing INSERT ... ON CONFLICT DO NOTHING), every other
         | worker could just implicitly rely on database transactions).
        
         | justinsaccount wrote:
         | Also you can benefit from being able to use all of SQL to
         | manage the queue.
         | 
         | I built a system ages ago that had modest queue needs.. maybe
         | 100 jobs a day. It involved syncing changes in the local
         | database with external devices. Many changes would ultimately
         | update the same device, and making the fewest number of updates
         | was important.
         | 
         | The system used an extremely simple schema: A table with
         | something like [job_id, device, start_after, time_started,
         | time_finished]
         | 
         | When queueing a job for $device, do an upsert to either insert
         | a new record, or bump up the start_after of a not yet started
         | job to now+5 minutes. When looking for a job to run, ignore
         | anything with a start_after in the future.
         | 
         | As edits were made, it would create a single job for each
         | device that would run 5 minutes after the last change was made.
         | 
         | I know a lot of queueing systems have the concept of a delayed
         | job, but I haven't come across any that had the concept of
         | delayed jobs+dedup/coalescence.
        
         | tlarkworthy wrote:
         | It's got a better name called a transactional outbox
         | https://microservices.io/patterns/data/transactional-outbox....
        
           | simonw wrote:
           | Yeah, that's a better name for it - good clear explanation
           | too.
        
         | djm_ wrote:
         | This is so important if you want to avoid incredibly gnarly
         | race conditions. In particular for us: jobs being run even
         | before the transaction has been fully committed to the
         | database.
         | 
         | We utilise a decorator for our job addition to external queues,
         | such that the function that does the addition gets attached to
         | Django's "on transaction commit" signal and thus don't actually
         | get run until the outer database transaction for that request
         | has been committed.
        
         | tannhaeuser wrote:
         | The original messaging middleware on which RabbitMQ (and other
         | AMQP-capable message brokers) are based is IBM's WebSphere MQ
         | fka MQseries. MQ can be used in a transactional fashion
         | depending on whether messages are stored or can be timed out,
         | and other QoS. Within CICS, MQ can also participate in
         | distributed transactions along with database transactions, and
         | this would be a typical way to use transactional message
         | processing. And X/Open distributed transactions were one of the
         | earliest open system/Unix standards from around 1983, so have
         | been a well understood architecture outside mainframes as well.
         | 
         | That it's beneficial to use Postgres messaging (or Oracle AQ or
         | whatever) for its transactional semantics is kind of accidental
         | and a consequence of folks not wanting to bother with dtx. Even
         | though databases are accessed via networks, truly scalable work
         | distribution can't be achieved using SQL, much less with
         | SQLite. Or in other words, if you're using messaging queues in
         | databases, you could use tables and row locks directly just as
         | well.
        
       | [deleted]
        
       | dapearce wrote:
       | Love to see it. We (CoreDB) recently released PGMQ, a message
       | queue extension for Postgres: https://github.com/CoreDB-
       | io/coredb/tree/main/extensions/pgm...
        
         | fullstop wrote:
         | Perhaps you mean https://github.com/CoreDB-
         | io/coredb/tree/main/extensions/pgm...
         | 
         | Your link results in a 404.
        
           | dapearce wrote:
           | Yes, copy/paste error just fixed!
        
       | gorjusborg wrote:
       | I'm all for simplifying stacks by removing stuff that isn't
       | needed.
       | 
       | I've also used in-database queuing, and it worked well enough for
       | some use cases.
       | 
       | However, most importantly: calling yourself a maxi multiple times
       | is cringey and you should stop immediately :)
        
         | Zopieux wrote:
         | What even is a maxi, please?
        
       | macspoofing wrote:
       | How do you handle stale 'processing' jobs (i.e. jobs that were
       | picked-up by a consumer but never finished - maybe because the
       | consumer died)?
        
         | severino wrote:
         | In the very few occasions that I've seen a queue backed by a
         | Postgres table, when a job was taken, its row in the database
         | was locked for the entire processing. If the job was finished,
         | a status column was updated so the job won't be taken again in
         | the future. If it wasn't, maybe because the consumer died, the
         | transaction would eventually be rolled back, leaving the row
         | unlocked for another consumer to take it. But the author may
         | have implemented this differently.
        
         | bstempi wrote:
         | Not the author, but I've used PG like this in the past. My
         | criteria for selecting a job was (1) the job was not locked and
         | (2) was not in a terminal state. If a job was in the
         | "processing" state and the worker died, that lock would be free
         | and that job would be eligible to get picked up since its not
         | in a terminal state (e.g., done or failed). This can be
         | misleading at times because a job will be marked as processing
         | even though its not.
        
       | andrewstuart wrote:
       | I wrote a message queue in Python called StarQueue.
       | 
       | It's meant to be a simpler reimagining of Amazon SQS.
       | 
       | It has an HTTP API and behaves mostly like SQS.
       | 
       | I wrote it to support Postgres, Microsoft's SQL server and also
       | MySQL because they all support SKIP LOCKED.
       | 
       | At some point I turned it into a hosted service and only
       | maintained the Postgres implementation though the MySQL and SQL
       | server code is still in there.
       | 
       | It's not an active project but the code is at
       | https://github.com/starqueue/starqueue/
       | 
       | After that I wanted to write the worlds fastest message queue so
       | I implemented an HTTP message queue in Rust. It maxed out the
       | disk at about 50,000 messages a second I vaguely recall, so I
       | switched to purely memory only and in the biggest EC2 instance I
       | could run it on it did about 7 million messages a second. That
       | was just a crappy prototype so I never released the code.
       | 
       | After that I wanted to make the simplest possible message queue
       | so I discovered that Linux atomic moves are the basis of a
       | perfectly acceptable message queue that is simply file system
       | based. I didn't put it into a message queue, but close enough to
       | be the same I wrote an SMTP buffer called Arnie. It's only about
       | 100 lines of Python.
       | https://github.com/bootrino/arniesmtpbufferserver
        
       | pnathan wrote:
       | I've had a very good experience with pg queuing. I didn't even
       | know `skip locked` was a pg clause. That would have... made the
       | experience even better!
       | 
       | I am afraid I've moved to a default three-way architecture:
       | 
       | - backend autoscaling stateless server
       | 
       | - postgres database for small data
       | 
       | - blobstore for large data
       | 
       | it's not that other systems are bad. its just that those 3
       | components get you off the ground flying, and if you're
       | struggling to scale past that, you're already doing enormous
       | volume or have some really interesting data patterns (geospatial
       | or timeseries, perhaps).
        
         | allan_s wrote:
         | For geospatial you actually have postgis extension which is a
         | battle tested solution
        
           | pnathan wrote:
           | Quite correct. Made an error. Carryover from a prior job
           | where we mucked with weather data... I was thinking more
           | along the lines of raster geo datasets like sat images, etc.
           | Each pixel represents one geo location, with a carrying load
           | of metadata, then a timeseries of that snapshot & md, so
           | timeseries-raster-geomapped data basically.
           | 
           | I don't remember anymore what that general kind of datatype
           | is called, sadly.
        
       | smallerfish wrote:
       | We've inadvertently "load tested" our distributed locking / queue
       | impl on postgres in production, and so I know that it can handle
       | hundreds of thousands of "what should I run / try to take lock on
       | task" queries per minute, with a schema designed to avoid
       | bloat/vacuuming, tuned indices, and reasonably beefy hardware.
        
       | haarts wrote:
       | I didn't even know Postgres had a queue last year. I used it just
       | for fun and it is GREAT. People using Kafka are kidding
       | themselves.
        
       | omneity wrote:
       | Postgres is super cool and comes with batteries for almost any
       | situation you can throw at it. Low throughput scenarios are a
       | great match. In high throughput cases, you might find yourself
       | not needing all the extra guarantees that Postgres gives you, and
       | at the same time you might need other capabilities that Postgres
       | was not designed to handle, or at least not without a performance
       | hit.
       | 
       | Like everything else in life, it's always a tradeoff. Know your
       | workload, the tradeoffs your tools are making, and make sure to
       | mix and match appropriately.
       | 
       | In the case of Prequel, it seems they possibly have a low
       | throughput situation at hands, i.e. in the case of periodic syncs
       | the time spent queuing the instruction <<< the time needed to
       | execute it. Postgres is great in this case.
        
       | semiquaver wrote:
       | When your workload is trivially tiny, most any technology can be
       | made to work.
        
         | fabian2k wrote:
         | There's a pretty large area between "trivially tiny" and "so
         | large that a single Postgres instance on reasonable hardware
         | can't handle it anymore".
        
           | semiquaver wrote:
           | Agreed! Lots of people overengineer.
           | 
           | I'd venture to guess that the median RabbitMQ-using app in
           | production could not easily be replaced with postgres though.
           | The main reasons this one could are very low volume and not
           | really using any of RMQ's features.
           | 
           | I love postgres! But RMQ and co fulfill a different need.
        
           | baq wrote:
           | and the high end of unreasonable hardware can get you
           | reaaaaaallly far, which in the age of cloud computing is
           | something people forget about and try to scale horizontally
           | when not strictly necessary
        
             | threeseed wrote:
             | Most people should at least be thinking about horizontal
             | scalability.
             | 
             | Because the chances of a cloud instance randomly going down
             | is not insignificant.
        
       | anecdotal1 wrote:
       | Postgres job queue in Elixir: Oban
       | 
       | "a million jobs a minute"
       | 
       | https://getoban.pro/articles/one-million-jobs-a-minute-with-...
        
       | tantalor wrote:
       | What does "maxi/maxis" mean in this context?
       | 
       | Google search for [sql maxis] just returns this article.
        
       | VincentEvans wrote:
       | One thing worth pointing out - that the approach described in TFA
       | changes PUSH architecture to PULL.
       | 
       | So now you have to deal with deciding how tight your polling loop
       | is, and with reads that are happening regardless of whether you
       | have messages waiting to be processed or not, expending both CPU
       | and requests, which may matter if you are billed accordingly.
       | 
       | Not in any way knocking it, just pointing out some trade-offs.
        
       | say_it_as_it_is wrote:
       | I love postgresql. It's a great database. However, this blog post
       | is by people who are not quite experienced enough with message
       | processing systems to understand that the problem wasn't RabbitMQ
       | but how they used it.
        
       | TexanFeller wrote:
       | Using a DB as an event queue opens up many options not easily
       | possible with traditional queues. You can dedupe your events by
       | upserting. You can easily implement dynamic priority adjustment
       | to adjust processing order. Dedupe and priority adjustment feels
       | like an operational superpower.
        
       | user3939382 wrote:
       | Another middle ground is AWS Batch. If you don't need like
       | complicated/rules based on the outcome of the run etc it's
       | simpler, especially if you're already used to doing ECS tasks.
        
       | SergeAx wrote:
       | > One of our team members has gotten into the habit of pointing
       | out that "you can do this in Postgres"
       | 
       | Actually, using Postgres stored procedures they can do anything
       | in Postgres. I am quite sure they can rewrite their entire
       | product using only stored procedures. Doesn't mean they really
       | want to do that, of course.
        
       ___________________________________________________________________
       (page generated 2023-04-11 23:00 UTC)