Back then, you couldn't build a performant queue in a database without a huge number of resources, both hardware wise and people wise. That's why only huge enterprises were using them then.
Only recently (last decade or so), has the performance of an open source database on modest hardware caught up to the alternatives.
Hi, OP. Any thoughts on why to prefer using your OLTP for queueing as opposed to Kafka? In my mind it would be a drop in ease of observability (though there is still KSQL and a number of UI wrappers) in exchange for much better performance. I'm interested if you had other reasons.
Observability is a big advantage, another advantage (in the context of DBOS specifically) is integration with durable workflows, so you can write a large workflow that enqueues and manages many smaller tasks.
From what I recall, Reddit uses AWS extensively. Could they not have replaced RabbitMQ with SQS? You get the near unlimited horizontal scalability, extremely good uptime, guaranteed at least once message delivery and for the case of a worker crash, the messages will become visible again after the visibility timeout (since they wouldn’t have been deleted by the worker).
I think there is a hard limit on the number of in-flight requests (that is items that have been dequeued by a worker, but whose job has not been completed). I wouldn't be surprised if Reddit hit those sorts of volumes.
I was architecture team lead at Skype 2005-2011 and persistent queues were the only ones around. Basically, because they knew how to scale databases, our DBA team (Hannu Krosing comes to mind) built queues into and on top of Postgres. It happened not just at Skype, too: eBay guys (Dan Prichett in particular) had built a very cool Oracle-based queue solution. Scaling persistent queues seems to be something that needs to be reinvented periodically. Maybe it’s too nuanced of a problem?
This article resonated with my experience building what was essentially a distributed task queue using Redis+PostgreSQL with Python workers in Kubernetes.
It seems like these systems naturally evolve different patterns based on their specific use cases.
The logic of our queue was intertwined with a rule engine. I wrote about building a rule engine here [1].
Another difference to this article is that it did not report back to the client as the events were delivered via web hooks.
There are some different approaches here and there which come from making it application specific, e.g. we added a periodic reconciliation check.
I also built a debouncer into the queue to give special treatment to burst in the load.
I actually worked on security for Skype in 2005/2006. And I knew Dan at eBay too.
Using databases for queues isn't a new idea, but a couple things make it different now. One is that I think this is the first time a solution has been open-sourced (not 100% sure on that).
And two, Postgres added SKIP LOCKED in 9.5 (around 2016), which was the performance unlock to make this work without needing a whole DBA team like Skype had.
I know from firsthand experience that there were investment banks doing this at least as early as the mid 1990s - on top of Sybase in the particular case I have in mind. Perhaps not the best performance, but it was trivially easy to inspect, and you could integrate with other database features, for example stored procedures.
> What we really needed to make distributed task queueing robust are durable queues ...
> Durable queues were rare when I was at Reddit, but they’re more and more popular now.
It sounds like the answer was known at the time but there wasn't the resources to solve it ?
Only recently (last decade or so), has the performance of an open source database on modest hardware caught up to the alternatives.
There are some different approaches here and there which come from making it application specific, e.g. we added a periodic reconciliation check. I also built a debouncer into the queue to give special treatment to burst in the load.
[1] https://blog.benediktsvogler.com/blog/building-a-distributed...
Using databases for queues isn't a new idea, but a couple things make it different now. One is that I think this is the first time a solution has been open-sourced (not 100% sure on that).
And two, Postgres added SKIP LOCKED in 9.5 (around 2016), which was the performance unlock to make this work without needing a whole DBA team like Skype had.