Databases in 2025: A Year in Review

(cs.cmu.edu)

650 points | by viveknathani_ 1 day ago

35 comments

danielfalbo 1 day ago
Maybe off-topic but,
If you're not familiar with the CMU DB Group you might want to check out their eccentric teaching style [1].
I absolutely love their gangsta intros like [2] and pre-lecture dj sets like [3].
I also remember a video where he was lecturing with someone sleeping on the floor in the background for some reason. I can't find that video right now.
Not too sure about the context or Andy's biography, I'll research that later, I'm even more curious now.
[1] https://youtube.com/results?search_query=cmu+database
[2] https://youtu.be/dSxV5Sob5V8
[3] https://youtu.be/7NPIENPr-zk?t=85
[-]
- sargun 1 hour ago
  Andy Pavlo absolutely seems like the kind of guy that I would want to get a drink with.
- sirfz 1 day ago
  Indeed, I was delighted when I read the part about wutang's time capsule and obviously OP is a wu-tang and general hip hop fan. The intro you shared is dope!
- znpy 1 day ago
  I can't understand if their "intro to database systems" is an introductory (undergrad) level course or some advanced course (as in, introduction to database (internals)).
  Anyone willing to clarify this? I'm quite weak at database stuff, i'd love to find some undergrad-level proper course to learn and catch up.
  [-]
  - lmwnshn 23 hours ago
    It is an undergrad course, though it is cross-listed for masters students as well. At CMU, the prerequisites chain looks like this: 15-122 (intro imperative programming, zero background assumed, taken by first semester CS undergrads) -> 15-213 (intro systems programming, typically taken by the end of the second year) -> 15-445 (intro to database systems, typically taken in the third or fourth year). So in theory, it's about one year of material away from zero experience.
  - Tostino 1 day ago
    It's the internals.
    He is training up people to work on new features for existing databases, or build new ones.
    Not application developers on how to use a database.
    Knowing some of the internals can help application developers make better decisions when it comes to using databases though.
    [-]
    - Tostino 19 hours ago
      Here is the playlist: https://www.youtube.com/playlist?list=PLSE8ODhjZXjYMAgsGH-Gt...
      You can tell from the topics, it's related to building databases, not using them.
- dang 22 hours ago
  (I consed "https://" onto your links so they'd become clickable. Hope that's ok!)
beders 1 day ago
While the author mentions that he just doesn't have the time to look at all the databases, none of the reviews of the last few years mention immutable and/or bi-temporal databases.
Which looks more like a blind spot to me honestly. This category of databases is just fantastic for industries like fintech.
Two candidates are sticking out. https://xtdb.com/blog/launching-xtdb-v2 (2025) https://blog.datomic.com/2023/04/datomic-is-free.html (2023)
[-]
- apavlo 1 day ago
  > none of the reviews of the last few years mention immutable and/or bi-temporal databases.
  We hosted XTDB to give a tech talk five weeks ago:
  https://db.cs.cmu.edu/events/futuredata-reconstructing-histo...
  > Which looks more like a blind spot to me honestly.
  What do you want me to say about them? Just that they exist?
  [-]
  - mrtimo 1 day ago
    Nice work Andy. I'd love to hear about semantic layer developments in this space (e.g. Malloy etc.). Something to consider for the future. Thanks.
    [-]
    - apavlo 1 day ago
      > I'd love to hear about semantic layer developments in this space (e.g. Malloy etc.)
      We also hosted Llyod to give a talk about Malloy in March 2025:
      https://db.cs.cmu.edu/events/sql-death-malloy-a-modern-open-...
- zie 1 day ago
  You can get pretty far with just PG using tstzrange and friends: https://www.postgresql.org/docs/current/rangetypes.html
  Otherwise there are full bitemporal extensions for PG, like this one: https://github.com/hettie-d/pg_bitemporal
  What we do is range types for when a row applies or not, so we get history, and then for 'immutability' we have 2 audit systems, one in-database as row triggers that keeps an on-line copy of what's changed and by who. This also gives us built-in undo for everything. Some mistake happens, we can just undo the change easy peasy. The audit log captures the undo as well of course, so we keep that history as well.
  Then we also do an "off-line" copy, via PG logs, that get shipped off the main database into archival storage.
  Works really well for us.
- radarroark 1 day ago
  People are slow to realize the benefit of immutable databases, but it is happening. It's not just auditability; immutable databases can also allow concurrent reads while writes are happening, fast cloning of data structures, and fast undo of transactions.
  The ones you mentioned are large backend databases, but I'm working on an "immutable SQLite"...a single file immutable database that is embedded and works as a library: https://github.com/radarroark/xitdb-java
  [-]
  - j16sdiz 3 hours ago
    Are there any new big research/development in immutable database?
    I know they are great... but i don't see many news around them
- delichon 1 day ago
  I see people bolting temporality and immutability onto triple stores, because xtdb and datomic can't keep up with their SPARQL graph traversal. I'm hoping for a triple store with native support for time travel.
  [-]
  - autogn0me 1 day ago
    Lance graph?
- felipelalli 23 hours ago
  FYI I made a comment very similar to yours, before reading yours. I'll put it here for reference. https://news.ycombinator.com/item?id=46503181
- anonymousDan 1 day ago
  Why fintech specifically?
  [-]
  - groestl 1 day ago
    Destructive operations are both tempting to some devs and immensely problematic in that industry for regulatory purposes, so picking a tech that is inherently incapable of destructive operations is alluring, I suppose.
  - falcor84 1 day ago
    I would assume that it's because in fintech it's more common than in other domains to want to revert a particular thread of transactions without touching others from the same time.
    [-]
    - postexitus 1 day ago
      Not only transactions - but state of the world.
  - defo10 1 day ago
    compliance requirements mostly (same for health tech)
  - postexitus 1 day ago
    Because, money.
- quotemstr 23 hours ago
  XTDB addresses a real use-case. I wish we invested more in time series databases actually: there's a ton of potential in a GIS-style database, but 1D and oriented around regions on the timeline, not shapes in space.
  That said, it's kind of frustrating that XTDB has to be its own top-level database instead of a storage engine or plugin for another. XTDB's core competence is its approach to temporal row tagging and querying. What part of this core competence requires a new SQL parser?
  I get that the XTDB people don't want to expose their feature set as a bunch of awkward table-valued functions or whatever. Ideally, DB plugins for Postgres, SQLite, DuckDB, whatever would be able to extend the SQL grammar itself (which isn't that hard if you structure a PEG parser right) and expose new capabilities in an ergonomic way so we don't end up with a world of custom database-verticals each built around one neat idea and duplicating the rest.
  I'd love to see databases built out of reusable lego blocks to a greater extent than today. Why doesn't Calcite get more love? Is it the Java smell?
  [-]
  - refset 21 hours ago
    > it's kind of frustrating that XTDB has to be its own top-level database instead of a storage engine or plugin for another. XTDB's core competence is its approach to temporal row tagging and querying. What part of this core competence requires a new SQL parser?
    Many implementation options were considered before we embarked on v2, including building on Calcite. We opted to maximise flexibility over the long term (we have bigger ambitions beyond the bitemporal angle) and to keep non-Clojure/Kotlin dependencies to a minimum.
TekMol 1 day ago
From my perspective on databases, two trends continued in 2025:
1: Moving everything to SQLite
2: Using mostly JSON fields
Both started already a few years back and accelerated in 2025.
SQLite is just so nice and easy to deal with, with its no-daemon, one-file-per-db and one-type-per value approach.
And the JSON arrow functions make it a pleasure to work with flexible JSON data.
[-]
- delaminator 1 day ago
  From my perspective, everything's DuckDB.
  Single file per database, Multiple ingestion formats, full text search, S3 support, Parquet file support, columnar storage. fully typed.
  WASM version for full SQL in JavaScript.
  [-]
  - sanderjd 1 day ago
    This is a funny thread to me because my frustration is at the intersection of your comments: I keep wanting sqlite for writes (and lookups) and duckdb for reads. Are you aware of anything that works like this?
    [-]
    - nlittlepoole 1 day ago
      DuckDB can read/write SQLite files via extension. So you can do that now with DuckDB as is.
      https://duckdb.org/docs/stable/core_extensions/sqlite
      [-]
      - sanderjd 1 day ago
        My understanding is that this is still too slow for quick inserts, because duckdb (like all columnar stores) is designed for batches.
        [-]
        theanonymousone 1 day ago
        The way I understood it, you can do your inserts with SQLite "proper", and simultaneously use DuckDB for analytics (aka read-only).
        [-]
        sanderjd 1 day ago
        Aha! That makes so much sense. Thank you for this.
        Edit: Ah, right, the downside is that this is not going to have good olap query performance when interacting directly with the sqlite tables. So still necessary to copy out to duckdb tables (probably in batches) if this matters. Still seems very useful to me though.
        [-]
        dietr1ch 1 day ago
        Analytics is done in "batches" (daily, weekly) anyways, right?
        We know you can't get both, row and column orders at the same time, and that continuously maintaining both means duplication and ensuring you get the worst case from both worlds.
        Local, row-wise writing is the way to go for write performance. Column-oriented reads are the way to do analytics at scale. It seems alright to have a sync process that does the order re-arrangement (maybe with extra precomputed statistics, and sharding to allow many workers if necessary) to let queries of now historical data run fast.
        [-]
        delaminator 10 hours ago
        It's not just about row versus column. OLAPs are potentially denormalised as well, and sometimes pre-aggregation, such as rolling up by day, by customer.
        If you really need to get performance you'll be building a star schema.
        sanderjd 1 day ago
        Not all olap-like queries are for daily reporting.
        I agree that the basic architecture should be row order -> delay -> column order, but the question (in my mind) is balancing the length of that delay with the usefulness of column order queries for a given workload. I seem to keep running into workloads that do inserts very quickly and then batch reads on a slower cadence (either in lockstep with the writes, or concurrently) but not on the extremely slow cadence seen in the typical olap reporting type flow. Essentially, building up state and then querying the results.
        I'm not so sure about "continuously maintaining both means duplication and ensuring you get the worst case from both worlds". Maybe you're right, I'm just not so sure. I agree that it's duplicating storage requirements, but is that such a big deal? And I think if fast writes and lookups and fast batch reads are both possible at the cost of storage duplication, that would actually be the best case from both worlds?
        I mean, this isn't that different conceptually from the architecture of log-structured merge trees, which have this same kind of "duplication" but for good purpose. (Indeed, rocksdb has been the closest thing to what I want for this workload that I've found; I just think it would be neat if I could use sqlite+duckdb instead, accepting some tradeoffs.)
        [-]
        dietr1ch 1 day ago
        > the question (in my mind) is balancing the length of that delay with the usefulness of column order queries for a given workload. I seem to keep running into workloads that do inserts very quickly and then batch reads on a slower cadence (either in lockstep with the writes, or concurrently) but not on the extremely slow cadence seen in the typical olap reporting type flow. Essentially, building up state and then querying the results.
        I see. Can you come up with row/table watermarks? Say your column store is up-to-date with certain watermark, so any query that requires freshness beyond that will need to snoop into the rows that haven't made it into the columnar store to check for data up to the required query timestamp.
        In the past I've dealt with a system that had read-optimised columnar data that was overlaid with fresh write-optimised data and used timestamps to agree on the data that should be visible to the queries. It continuously consolidated data into the read-optimised store instead of having the silly daily job that you might have in the extremely slow cadence reporting job you mention.
        You can write such a system, but in reality I've found it hard to justify building a system for continuous updates when a 15min delay isn't the end of the world, but it's doable if you want it.
        > I'm not so sure about "continuously maintaining both means duplication and ensuring you get the worst case from both worlds". Maybe you're right, I'm just not so sure. I agree that it's duplicating storage requirements, but is that such a big deal? And I think if fast writes and lookups and fast batch reads are both possible at the cost of storage duplication, that would actually be the best case from both worlds?
        I mean that if you want both views in a consistent world, then writes will bring things to a crawl as both, row and column ordered data needs to be updated before the writing lock is released.
        [-]
        sanderjd 23 hours ago
        Yes! We're definitely talking about the same thing here! Definitely not thinking of consistent writes to both views.
        Now that you said this about watermarks, I realize that this is definitely the same idea as streaming systems like flink (which is where I'm familiar with watermarks from), but my use cases are smaller data and I'm looking for lower latency than distributed systems like that. I'm interested in delays that are on the order of double to triple digit milliseconds, rather than 15 minutes. (But also not microseconds.)
        I definitely agree that it's difficult to justify building this, which is why I keep looking for a system that already exists :)
    - SchwKatze 1 day ago
      I think you could build an ETL-ish workflow where you use SQLite for OLTP and DuckDB for OLAP, but I suppose it's very workload dependent, there are several tradeoffs here.
      [-]
      - sanderjd 1 day ago
        Right. This is what I want, but transparently to the client. It seems fairly straightforward, but I keep looking for an existing implementation of it and haven't found one yet.
  - swyx 1 day ago
    very interesting. whats the vector indexing story like in duckdb these days?
    also are there sqlite-duckdb sync engines or is that an oxymoron
    [-]
    - cfors 1 day ago
      https://duckdb.org/docs/stable/core_extensions/vss
      It's not bad if you need something quick. I haven't had a large need of ANN in duckdb since it's doing more analytical/exploratory needs, but it's definitely there if you need it.
- DrBazza 1 day ago
  From my perspective - do you even need a database?
  SQLite is kind-of the middle ground between a full fat database, and 'writing your own object storage'. To put it another way, it provides 'regularised' object access API, rather than, say, a variant of types in a vector that you use filter or map over.
  [-]
  - TekMol 1 day ago
    If I would write my own data storage I would re-implement SQLite. Why would I want to do that?
    [-]
    - trevor-e 23 hours ago
      Not sure if this is quite what you are getting at, but the SQLite folks even mention this as a great use-case: https://www.sqlite.org/appfileformat.html
- kopirgan 1 day ago
  As a backend database that's not multi user, how many web connections that do writes can it realistically handle? Assuming writes are small say 100+ rows each?
  Any mitigation strategy for larger use cases?
  Thanks in advance!
  [-]
  - WJW 1 day ago
    Couple thousand simultaneous should be fine, depending on total system load, whether you're running on spinning disks or on SSDs, p50/99 latency demands and of course you'd need to enable the WAL pragma to allow simultaneous writes in the first place. Run an experiment to be sure about your specific situation.
    [-]
    - laurencerowe 20 hours ago
      You also need BEGIN CONCURRENT to allow simultaneous write transactions.
      https://www.sqlite.org/src/doc/begin-concurrent/doc/begin_co...
  - loxs 1 day ago
    After 2 years in production with a small (but write heavy) web service... it's a mixed bag. It definitely does the job, but not having a DB server does have not only benefits, but also drawbacks. The biggest being (lack of) caching the file/DB in RAM. As a result I have to do my own read caching, which is fine in Rust using the mokka caching library, but it's still something you have to do yourself, which would otherwise come for free with Postgres. This of course also makes it impossible to share the cache between instances, doing so would require employing redis/memcached at which point it would be better to use Postgres.
    It has been OK so far, but definitely I will have to migrate to Postgres at one point, rather sooner than later.
    [-]
    - TekMol 1 day ago
      How would caching on the db layer help with your web service?
      In my experience, caching makes most sense on the CDN layer. Which not only caches the DB requests but the result of the rendering and everything else. So most requests do not even hit your server. And those that do need fresh data anyhow.
      [-]
      - loxs 1 day ago
        As I said, my app is write heavy. So there are several separate processes that constantly write to the database, but of course, often, before writing, they need to read in order to decide what/where to write. Currently they need to have their own read cache in order to not clog the database.
        The "web service" is only the user facing part which bears the least load. Read caching is useful there too as users look at statistics, so calculating them once every 5-10 minutes and caching them is needed, as that requires scanning the whole database.
        A CDN is something I don't even have. It's not needed for the amount of users I have.
        If I was using Postgres, these writer processes + the web service would share the same read cache for free (coming from Posgres itself). The difference wouldn't be huge if I would migrate right now, but now I already have the custom caching.
    - kopirgan 1 day ago
      I am no expert, but SQLite does have in memory store? At least for tables that need it..ofc sync of the writes to this store may need more work.
  - TekMol 1 day ago
    Why have multiple connections in the first place?
    If your writes are fast, doing them serially does not cause anyone to wait.
    How often does the typical user write to the DB? Often it is like once per day or so (for example on hacker news). Say the write takes 1/1000s. Then you can serve
```
    1000 * 60 * 60 * 24 = 86 million users
```
    And nobody has to wait longer than a second when they hit the "reply" button, as I do now ...
    [-]
    - frje1400 1 day ago
      > If your writes are fast, doing them serially does not cause anyone to wait.
      Why impose such a limitation on your system when you don't have to by using some other database actually designed for multi user systems (Postgres, MySQL, etc)?
      [-]
      - TekMol 1 day ago
        Because development and maintenance faster and easier to reason about. Increasing the chances you really get to 86 million daily active users.
        [-]
        frje1400 1 day ago
        So in this solution, you run the backend on a single node that reads/writes from an SQLite file, and that is the entire system?
        [-]
        withinboredom 1 day ago
        Thats basically how the web started. You can serve a ridiculous number of users from a single physical machine. It isn't until you get into the hundreds-of-millions of users ballpark where you need to actually create architecture. The "cloud" lets you rent a small part of a physical machine, so it actually feels like you need more machines than you do. But a modern server? Easily 16-32+ cores, 128+gb of ram, and hundreds of tb of space. All for less than 2k per month (amortized). Yeah, you need an actual (small) team of people to manage that; but that will get you so far that it is utterly ridiculous.
        Assuming you can accept 99% uptime (that's ~3 days a year being down), and if you were on a single cloud in 2025; that's basically last year.
        [-]
        kopirgan 1 day ago
        I agree...there is scale and then there is scale. And then there is scale like Facebook.
        We need not assume internet FB level scale for typical biz apps where one instance may support a few hundred users max. Or even few thousand. Over engineering under such assumptions is likely cost ineffective and may even increase surface area of risk. $0.02
        [-]
        downsplat 7 hours ago
        It goes much further than that.. a single moderately sized VPS web server can handle millions of hard-to-cache requests per day, all hitting the db.
        Most will want to use a managed db, but for a real basic setup you can just run postgres or mysql on the same box. And running your own db on a separate VPS is not hard either.
    - kopirgan 1 day ago
      That depends on the use case. HN is not a good example. I am referring to business applications where users submit data. Ofc in these cases we are looking at 00s not millions of users. The answer is good enough.
    - nijave 14 hours ago
      >How often does the typical user write to the DB
      Turns out a lot when you have things like "last accessed" timestamps on your models.
      Really depends on the app
      I also don't think that calculation is valid. Your users aren't going to be purely uniformly accessing the app over the course of a day. Invariably you'll have queuing delays above a significantly smaller user count (but maybe the delays are acceptable)
- andrewinardeer 1 day ago
  Pardon my ignorance, yet wasn't the prevailing thought a few years ago that you would never use SQLite in production? Has that school of thought changed?
  [-]
  - WJW 1 day ago
    SQlite as a database for web services had a little bit of a boom due to:
    1. People gaining newfound appreciation of having the database on the same machine as the web server itself. The latency gains can be substantial and obviously there are some small cost savings too as you don't need a separate database server anymore. This does obviously limit you to a single web server, but single machines can have tons of cores and serve tens of thousands of requests per second, so that is not as limiting as you'd think.
    2. Tools like litestream will continuously back up all writes to object storage, so that one web server having a hardware failure is not a problem as long as your SLA allow downtimes of a few minutes every few years. (and let's be real, most small companies for which this would be a good architecture don't have any SLA at all)
    3. SQLite has concurrent writes now, so it's gotten much more performant in situations with multiple users at the same time.
    So for specific use cases it can be a nice setup because you don't feel the downsides (yet) but you do get better latency and simpler architecture. That said, there's a reason the standard became the standard, so unless you have a very specific reason to choose this I'd recommend the "normal" multitier architectures in like 99% of cases.
    [-]
    - pixelesque 1 day ago
      > SQLite has concurrent writes now
      Just to clarify: Unless I've missed something, this is only with WAL mode and concurrent reads at the same time as writes, I don't think it can handle multiple concurrent writes at the same time?
      [-]
      - WJW 8 hours ago
        As I understand it, there can be concurrent writes as long as they don't touch the same data (the same file system pages, to be exact). Also, the actual COMMIT part is still serialized and you need to begin your transactions with BEGIN CONCURRENT. If two transactions do conflict, the later one will be forced to ROLLBACK although you can still try again. It is up to the application to do this.
        See also https://www.sqlite.org/src/doc/begin-concurrent/doc/begin_co...
        This type of limitation is exactly why I would recommend "normal" server-based databases like Postgres or MySQL for the vast majority of web backends.
      - giovannibonetti 1 day ago
        I think only Turso — SQLite rewritten in Rust — supports that.
    - chasd00 1 day ago
      I’m a fan of SQLite but just want to point out there’s no reason you can’t have Postgres or some other rdbms on the same machine as the webserver too. It’s just another program running in the background bound to a port similar to the web server itself.
  - lpil 1 day ago
    SQLite is likely the most widely used production database due to its widespread usage in desktop and mobile software, and SQLite databases being a Library of Congress "sustainable format".
    [-]
    - zerr 1 day ago
      Most of the usage was/is as a local ACID-compliant replacement for txt/ini/custom local/bundled files though.
  - em500 1 day ago
    "Production" can mean many different things to different people. It's very widely used as a backend strutured file format in Android and iOS/macOS (e.g. for appls like Notes, Photos). Is that "production"? It's not widely used and largely inappropriate for applications with many concurrent writes.
    Sqlite docs has a good overview of appropriate and inappropriate uses: https://sqlite.org/whentouse.html It's best to start with Section 2 "Situations Where A Client/Server RDBMS May Work Better"
  - scott_w 1 day ago
    Only for large scale multiple user applications. It’s more than reasonable as a data store in local applications or at smaller scales where having the application and data layer on the same machine are acceptable.
    If you’re at a point where the application needs to talk over a network to your database then that’s a reasonable heuristic that you should use a different DB. I personally wouldn’t trust my data to NFS.
    [-]
    - kunley 1 day ago
      What is a "local application"?
      [-]
      - loxs 1 day ago
        Funny how people used to ask "what is a cloud application", and now they ask "what is a local application" :-)
        Local as in "desktop application on the local machine" where you are the sole user.
        [-]
        scott_w 1 day ago
        This, though I think other posters have pointed to a web app/site that’s backed by SQLite. It can be a perfectly reasonable approach, I think, as the application is the web server and it likely accesses SQLite on the same machine.
        kunley 9 hours ago
        That commenter's idea clearly wasn't about desktop application on a local machine. That is why I asked.
        [-]
        scott_w 4 hours ago
        You mean Andrew's comment? I took it in the broadest sense to try and give a more complete answer.
  - almost 1 day ago
    The reason you heard that was probably because they were talking about a more specific circumstance. For example SQLite is often used as a database during development in Django projects but not usually in production (there are exceptions of course!). So you may have read when setting up Django, or a similar thing, that the SQLite option wasn't meant for production because usually you'd use a database like Postgres for that. Absolutely doesn't mean that SQLite isn't used in production, it's just used for different things.
    [-]
    - andrewinardeer 12 hours ago
      You are right. Thanks!
- randomtoast 1 day ago
  I would say SQLite when possible, PostgreSQL (incl. extensions) when necessary, DuckDB for local/hobbyist data analysis and BigQuery (often TB or PB range) for enterprise business intelligence.
- CuriouslyC 1 day ago
  I think the right pattern here is edge sharding of user data. Cloudflare makes this pretty easy with D1/Hyperdrive.
- odie5533 1 day ago
  For as much talk as I see about SQLite, are people actually using it or does it just have good marketers?
  [-]
  - TekMol 1 day ago
    Among people who can actually code (in contrast to just stitch together services), I see it used all around.
    For someone who openly describes his stack and revenue, look up Pieter Levels, how he serves hundreds of thousands of users and makes millions of dollars per year, using SQLite as the storage layer.
  - sgbeal 1 day ago
    > are people actually using it or does it just have good marketers?
    _You_ are using it right this second. It's storing your browser's bookmarks (at a minimum, and possibly other browser-internal data).
  - SJMG 1 day ago
    It's the standard for mobile. That said, in server-side enterprise computing, I know no one who uses it. I'm sure there are applications, but in this domain you'd need a good justification for not following standard patterns.
    I have used DuckDB on an application server because it computes aggregations lightning fast which saved this app from needing caching, background services and all the invalidation and failure modes that come with those two.
  - greenavocado 1 day ago
    If you use desktops, laptops, or mobile phones, there is a very good chance you have at least ten SQLite databases in your possession right now.
  - CyberDildonics 1 day ago
    It is fantastic software, have you ever used it?
    [-]
    - odie5533 23 hours ago
      I don't have a use case for it. I've used it a tiny bit for mocking databases in memory, but because it's not fully Postgres, I've switched entirely to TestContainers.
- phendrenad2 12 hours ago
  Man, I hope so. Bailing people out of horribly slow NoSQL databases is good business.
- quotemstr 23 hours ago
  FWIW (and this is IMHO of course) DuckDB makes working with random JSON much nicer than SQLite, not least because I can extract JSON fields to dense columnar representations and do it in a deterministic, repeatable way.
  The only thing I want out of DuckDB core at this point is support for overriding the columnar storage representation for certain structs. Right now, DuckDB decomposes structs into fields and stores each field in a column. I'd like to be able to say "no, please, pre-materialize this tuple subset and store this struct in an internal BLOB or something".
A1aM0 1 day ago
Pavlo is right to be skeptical about MCP security. The entire philosophy of MCP seems to be about maximizing context availability for the model, which stands in direct opposition to the principle of Least Privilege.
When you expose a database via a protocol designed for 'context', you aren't just exposing data; you're exposing the schema's complexity to an entity that handles ambiguity poorly. It feels like we're just reinventing SQL injection, but this time the injection comes from the system's own hallucinations rather than a malicious user.
[-]
- Miyamura80 1 day ago
  Totally agree, unfettered access to databases are dangerous
  There are ways to reduce injection risk since LLMs are stateless and thus you can monitor the origination and the trustworthiness of the context that enters the LLM and then decide if MCB actions that affect state will be dangerous or not
  We've implementeda mechanism like this based on Simon Willison's lethal trifecta framework as an MCP gateway monitoring what enters context. LMK if you have any feedback on this approach to MCP security. This is not as elegant as the approach that Pavlo talks about in the post, but nonetheless, we believe this is a good band-aid solution for the time bein,g as the technology matures
  https://github.com/Edison-Watch/open-edison
  [-]
  - quotemstr 23 hours ago
    > Totally agree, unfettered access to databases are dangerous
    Any decent MVCC database should be able to provide an MCP access to a mutable yet isolated snapshot of the DB though, and it doesn't strike me as crazy to let the agent play with that.
    [-]
    - thesz 23 hours ago
      For this database has to have nested transactions, where COMMITs do propagate up one level and not to the actual database, and not many databases have them. Also, a double COMMIT may propagate changes outside of agent's playbox.
      [-]
      - quotemstr 23 hours ago
        > For this database has to have nested transactions, where COMMITs do propagate up one level and not to the actual database,
        Correct, but nested transaction support doesn't seem that much of a reach if you're an MVCC-style system anyway (although you might have to factor out things like row watermarks to lookaside tables if you want to let them be branchy instead of XID being a write lock.)
        You could version the index B-tree nodes too.
        [-]
        thesz 23 hours ago
        > but nested transaction support doesn't seem that much of a reach if you're an MVCC-style system anyway
        You are talking about code that have to be written and tested.
        Also, do not forget about double COMMIT, intentional or not.
- nijave 14 hours ago
  Yes and no. Least privilege has existed in databases for a very long time. You need to implement correct DB privileges using user/roles, views, and other best practices. The MCP server is more like a dumb client in this setup.
  However, that's easy for people to forget and throw privileged creds at the MCP and hope for the best.
  The same stands for all LLM tools (MCP servers or otherwise). You always need to implement correct permissions in the tool--the LLM is too easily tricked and confused to enforce a permission boundary
- anthonypasq 1 day ago
  i dont know anyone with a brain that is using a DB mcp with write permissions in prod. i mean trying to lay that blame on a protocol for doing something as nuts as that seems unfair.
- SpaceL10n 1 day ago
  Was the trade-off so exciting that we abandoned our own principles? Or, are we lemmings?
  Edit: My apologies for the cynical take. I like to think that this is just the move fast break stuff ethos coming about.
cluckindan 52 minutes ago
”I still haven't met anybody who is actively using Dgraph.”
That’s because it is mostly used in national security and military applications in several countries.
p2hari 1 day ago
The author mentions about it in the name change for edgeDb to gel. However, it could also have been added in the Acquisitions landscape. Gel joined vercel [1].
1. https://www.geldata.com/blog/gel-joins-vercel
[-]
- apavlo 1 day ago
  Thanks for catching this. Updated: https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...
  I need to figure out an automatic way to track these.
- djsjajah 1 day ago
  You just ruined my day. The post makes it sound like gel is now dead. The post by Vercel does not give me much hope either [1]. Last commit on the gel repo was two weeks ago.
  [1] https://vercel.com/blog/investing-in-the-python-ecosystem
  [-]
  - kaelwd 1 day ago
    From discord:
    > There has been a ton of interest expressed this week about potential community maintenance of Gel moving forward. To help organize and channel these hopes, I'm putting out a call for volunteers to join a Gel Community Fork Working Group (...GCFWG??). We are looking for 3-5 enthusiastic, trustworthy, and competent engineers to form a working group to create a "blessed" community-maintained fork of Gel. I would be available as an advisor to the WG, on a limited basis, in the beginning.
    > The goal would be to produce a fork with its own build and distribution infrastructure and a credible commitment to maintainership. If successful, we will link to the project from the old Gel repos before archiving them, and potentially make the final CLI release support upgrading to the community fork.
    > Applications accepted here: https://forms.gle/GcooC6ZDTjNRen939
    > I'll be reaching out to people about applications in January.
ComputerGuru 20 hours ago
Pg18 is an absolutely fantastic release. Everyone flaks about the async IO worker support, but there’s so much more. Builtin Unicode locales, unique indexes/constraints/fks that can be added in unvalidated state, generated virtual (expression) columns, skip scans on btree indexes (absolutely huge), uuidv7 support, and so much more.
felipelalli 23 hours ago
I think it's time for a big move towards immutable databases that weren't even mentioned in this article. I've already worked with Datomic and immudb: Datomic is very good, but extremely complex and exotic, difficult learning curve to achieve perfect tuning. immudb is definitely not ready for production and starts having problems with mere hundreds of thousands of records. There's nothing too serious yet.
lvl155 1 day ago
I want to thank Andy and the entire DB Group at CMU. They’ve done a great job of making database accessible to so many people. They are world class.
[-]
- techsystems 1 day ago
  What did they do?
  [-]
  - swyx 1 day ago
    look up the cmu db youtube
zjaffee 1 day ago
What an amazing set of articles, one thing that I think he's missed is the clear multi year trends.
Over the past 5 years there's been significant changes and several clear winners. Databricks and Snowflake have really demonstrated ability to stay resilient despite strong competition from cloud providers themselves, often through the privatization of what previously was open source. This is especially relevant given also the articles mentioning of how cloudera and hortonworks failed to make it.
I also think the quiet execution of databases like clickhouse have shown to be extremely impressive and have filled a niche that wasn't previously filled by an obvious solution.
throw0101d 1 day ago
Regarding distributed(-ish) Postgres, does anyone know if something like My/MariaSQL's multi-master Galera† is around for Pg:
> MariaDB Galera Cluster provides a synchronous replication system that uses an approach often called eager replication. In this model, nodes in a cluster synchronize with all other nodes by applying replicated updates as a single transaction. This means that when a transaction COMMITs, all nodes in the cluster have the same value. This process is accomplished using write-set replication through a group communication framework.
* https://mariadb.com/docs/galera-cluster/galera-architecture/...
This isn't necessarily about being "web scale", but having a first-party, fairly-automated replication solution would make HA easier for a number internal-only stuff much simpler.
† Yes, I am aware: https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster
[-]
- nijave 14 hours ago
  Citus, sort of Cockroach
  For HA, Patroni, stolon, CNPG
  Multimaster doesn't necessarily buy you availability. Usually it trades performance and potentially uptime for data integrity.
qinchencq 8 hours ago
Was hoping to read about graph database, AI-related changes..., but didn't expect this: "I almost died in the spring semester...surprisingly hard to concentrate on important things like databases when you can't breathe." Hope Prof. Pavlo has been breathing better, stellar review.
backtogeek 1 day ago
I can't believe that article has no mention of SQLite ??
[-]
- apavlo 1 day ago
  > I can't believe that article has no mention of SQLite ??
  https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...
- bob1029 1 day ago
  No MSSQL, DB2 or Oracle either. Anything this proven & stable is probably not worth blogging about in this context. SQLite gets a lot of attention on HN but that's a bit of an exception.
- astrostl 23 hours ago
  Same. CMD-F, 'sqlite', no hits, skip and go straight to comments.
jereze 1 day ago
No mention of DuckDB? Surprising.
[-]
- dujuku 1 day ago
  Also somewhat surprised. DuckDB traction is impressive and on par with vector databases in their early phases. I think there's a good chance it will earn an honorable mention next year if adoption holds and becomes more mainstream. But my impression is that it's still early in its adoption curve where only those "in the know" are using it as a niche tool. It also still has some quirks and foot-guns that need moderately knowledgeable systems people to operate (e.g. it will happily OOM your DB)
- mariocesar 1 day ago
  Same surprise here. However in practice, the community tends to talk about DuckDB more like a client-side tool than a traditional database
divan 1 day ago
> Acquisitions ... Gel → Vercel
is a bit misleading. Gel (formerly EdgeDB) is sunsetting it's development. (extremely talented) Team joins Vercel to work on other stuff.
That was a hard hit for me in December. I loved working with EdgeQL so much.
[-]
- senderista 1 day ago
  It is a beautifully designed language and would make a great starting point for future DB projects.
santiagobasulto 1 day ago
I love these yearly review posts. Thanks Andy and team.
bzGoRust 1 day ago
I would like to mention that vector databases like Milvus got lots of new features to support RAG, Agent development, features like BM25, hybrid search etc..
gr4vityWall 1 day ago
Didn't know MongoDB was suing the company behind FerretDB. That's disgusting.
[-]
- beembeem 1 day ago
  Andy has a balanced and appropriate take here.
alexpadula 11 hours ago
Been reading these for a few years. I enjoy them, thank you Andy. I hope you’re doing better.
tiemster 1 day ago
Also emmer (which is perhaps too niche to get mentioned in an article like this), which I focuses more on being a quick/flexible 'data scratchpad', rather than just scale.
https://hub.docker.com/r/tiemster/emmer
[-]
- furrball010 1 day ago
  nice to see it get mentioned here :), I like using it also for scripts etc. Quite flexible because you can do everything with the api.
thesurlydev 18 hours ago
Supabase seems to be killing it. I read somewhere they are used by ~70% of YCombinator startups. I wonder how many of those eventually move to self-hosted.
npalli 1 day ago
Andy is probably the only person who adores Larry Ellison (Oracle) unironically.
[-]
- viccis 22 hours ago
  Ironically unironically.
shrx 1 day ago
Nothing about time series-oriented databases?
[-]
- apavlo 1 day ago
  > Nothing about time series-oriented databases?
  https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...
- speedgoose 1 day ago
  Not much happened I guess. Clickhouse has got an experimental time series engine : https://clickhouse.com/docs/engines/table-engines/special/ti...
  [-]
  - shrx 19 hours ago
    QuestDB at least is gaining some popularity: https://questdb.com/
    I was hoping to learn about some new potentially viable alternatives to InfluxDB, alas it seems I'll continue using it for now.
    [-]
    - speedgoose 9 hours ago
      I'm running an experimental side project where I doing some kind of glue between various time-series APIs and storage engines.
      For example it has an InfluxDB compatible ingestion API, so Telegraf can push its data to it or InfluxDB can replicate to it. It also has a Prometheus remote read and remote write API, so it's compatible with Prometheus.
      The storage can be done in various systems, including ClickHouse, SQLite, DuckDB, TimescaleDB… I should try to include QuestDB.
pjmlp 1 day ago
Over here, it is DB2, SQL Server or Oracle if using a plain RDMS, or whatever DB abstraction layer is provided on top of a SaaS product, where we get to query with some kind of ORM abstraction preventing raw SQL, or GraphQL, without knowing the implementation details.
[-]
- sandos 1 day ago
  This sounds like a flashback to J2EE. Which I know is still alive and well. Banks, insurance companies and the tax agency do not much care for fancy new stuff, but that it works.
  [-]
  - chasd00 1 day ago
    I describe these techs like garbage trucks. No one likes to see them but they’re there every day doing a decent part of what it takes to hold society together hah.
    [-]
    - pjmlp 1 day ago
      Scott Hanselman has a good term for all these kind of jobs, the dark matter developers.
      https://www.hanselman.com/blog/dark-matter-developers-the-un...
  - pjmlp 1 day ago
    Yep, Fortune 500 enterprise consulting, boring technology that pays the bills.
    Java, .NET, C++, nodejs, Sitecore, Adobe Experience Manager, Optimizely, SAP, Dynamics, headless CMSes,...
    [-]
    - sanderjd 1 day ago
      Never felt so old, seeing nodejs in a list of old boring stuff.
      [-]
      - pjmlp 1 day ago
        Yeah, it is on the edge, but unavoidable in many Web projects.
codeulike 1 day ago
Barely any mention of Oracle or MS Sql Server, commonly reckoned to be #1 and #3 most used databases in the world
https://db-engines.com/en/ranking
[-]
- qcnguy 1 day ago
  Oracle is mentioned at the start, where he proclaims the "dominance" of Postgres and then admits its newest features have been in Oracle for nearly a quarter of a century already. The dominance he's talking about is only about how many startups raise how many millions from investors, not anything technical.
  And then of course at the end he has a whole section about Larry Ellison, like always.
  [-]
  - sanderjd 1 day ago
    Isn't it because it's about news, as in what's changing, rather than being about what's staying the same? He's a researcher, so his interests are always going to be more oriented toward new systems and new companies more than the big dominant systems.
    [-]
    - qcnguy 1 day ago
      There's nothing technically new that he's covering here though? It's all just startups adding stuff to Postgres that Oracle had for decades already.
      [-]
      - sanderjd 1 day ago
        The startups are new.
andersmurphy 22 hours ago
With a trend towards immutable single writer databases MMAP seems like a massive win.
[-]
- mtndew4brkfst 16 hours ago
  Andy is very critical of using mmap in database implementations.
  [-]
  - andersmurphy 12 hours ago
    Why? Sqlite and LMDB make fantastic use of it. For anyone doing a single writer db it's a no brainer. It does so much for you and it does it very well. All the things you don't have to implement because it does it for you:
    - Reading the data from disk
    - Concurrency between different threads reading the same data
    - Caching and buffer management
    - Eviction of pages from memory
    - Playing nice with other processes in the machine
    Why would you not leverage it? It's such a great fit for scaling reads.
    [-]
    - alexpadula 11 hours ago
      “ It's such a great fit for scaling reads.”
      And losing them.
      [-]
      - andersmurphy 7 hours ago
        How so? LMDB, boltdb/bbolt and sqlite (with mmap) are all rock solid. Just because mongodb used mmap badly does not make it any less valuable.
    - cmrdporcupine 5 hours ago
      The strongest argument as far as I can see it is... the problem is you now lose control over all those things. It's a black box with effectively no knobs.
      Anyways, read for yourself, Pavlo & Leis get into it in detail, and there's benchmarks:
      https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf
      https://db.cs.cmu.edu/mmap-cidr2022/
jimmar 1 day ago
> "The Dominance of PostgreSQL Continues"
It seems like the author is more focused on database features than user base. Every metric I can find online says that MySQL/MariaDB is more popular than PostgreSQL. PostgreSQL seems "better" (more features, better standards compliance) but MySQL/MariaDB works fine for many people. Am I living in a bubble?
[-]
- mdasen 1 day ago
  Popularity can mean multiple things. Are we talking about how frequently a database is used or how frequently a database is chosen for new projects? MySQL will always be very popular because some very popular things use it like WordPress.
  It does feel like a lot of the momentum has shifted to PostgreSQL recently. You even see it in terms of what companies are choosing for compatibility. Google has a lot more MySQL work historically, but when they created a compatibility interface for Cloud Spanner, they went with PostgreSQL. ClickHouse went with PostgreSQL. More that I'm forgetting at the moment. It used to be that everyone tried for MySQL wire compatibility, but that doesn't feel like what's happening now.
  If MySQL is making you happy, great. But there has certainly been a shift toward PostgreSQL. MySQL will continue to be one of the most used databases just as PHP will remain one of the most used programming languages. There's a lot of stuff already built with those things. I think most metrics would say that PHP is more widely deployed than NodeJS, but I think it'd be hard to argue that PHP is what the developer community is excited about.
  Even search here on HN. In the past year, 4 MySQL stories with over 100 point compared to 28 PostgreSQL stories with over 100 points (and zero MariaDB stories above 100 points and 42 SQLite). What are we talking about here on HN? Not nearly as frequently MySQL - we're talking about SQLite and PostgreSQL. That's not to say that MySQL doesn't work great for you or that it doesn't have a large installed base, but it isn't where our mindshare is about the future.
  [-]
  - evanelias 21 hours ago
    > ClickHouse went with PostgreSQL.
    What do you mean by this? AFAIK they added MySQL wire protocol compatibility long before they added Postgres. And meanwhile their cloud offering still doesn't support Postgres wire protocol today, but it does support MySQL wire protocol.
    > Even search here on HN.
    fwiw MySQL has been extremely unpopular on HN for a decade or more, even back when MySQL was a more common choice for startups. So there's a bit of a self-fulfilling prophecy where MySQL ecosystem folks mostly stopped submitting stories here because they never got enough upvotes to rank high enough to get eyeballs and discussion.
    That all said, I do agree with your overall thesis.
- dujuku 22 hours ago
  > Every metric I can find online says that MySQL/MariaDB is more popular than PostgreSQL
  What are those metrics? If you're talking about things like db-engines rankings, those are heavily skewed by non-production workloads. For example, MySQL still being the database for Wordpress will forever have a high number of installations and developers using and asking StackOverflow questions. But when a new company or established company is deciding which new database to use for their custom application, MySQL is seldom in the running like it was 8-10 years ago.
- spprashant 1 day ago
  I think author is basing his observations on where the money is flowing. PostgreSQL adjacent startups and businesses are seeing a lot of investment.
  [-]
  - alexpadula 11 hours ago
    Well yeah.
- apavlo 1 day ago
  > Am I living in a bubble?
  There are rumblings that the MySQL project is rudderless after Oracle fired the team working on the open-source project in September 2025. Oracle is putting all its energy in its closed-source MySQL Heatwave product. There is a new company that is looking to take over leadership of open-source MySQL but I can't talk about them yet.
  The MariaDB Corporation financial problems have also spooked companies and so more of them are looking to switch to Postgres.
  [-]
  - Sesse__ 1 day ago
    > There are rumblings that the MySQL project is rudderless after Oracle fired the team working on the open-source project in September 2025.
    Not just the open-source project; 80%+ (depending a bit on when you start counting) of the MySQL team as a whole was let go, and the SVP in charge of MySQL was, eh, “moving to another part of the org to spend more time with his family”. There was never really a separate “MySQL Community Edition team” that you could fire, although of course there were teams that worked mostly or entirely on projects that were not open-sourced.
    [-]
    - menaerus 10 hours ago
      Wouldn't Oracle need those 80%+ devs if they wanted to shift their efforts into Heatwave? That percentage sounds too huge to me and if true I believe they won't be making any larger investments into Heatwave neither. There's several core teams in MySQL and if you let those people go ... I don't know, I am not sure what to make out of it but that Oracle is completely moving away from MySQL as a strategic component of their business.
      [-]
      - Sesse__ 4 hours ago
        > Wouldn't Oracle need those 80%+ devs if they wanted to shift their efforts into Heatwave?
        They would, so Heatwave is also going to suffer over this.
  - _1tan 12 hours ago
    Percona I suppose?
cloutiertyler 1 day ago
How is SpacetimeDB not mentioned here?
[-]
- apavlo 1 day ago
  > How is SpacetimeDB not mentioned here?
  https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...
dmarwicke 1 day ago
we had to restrict ours to views only because it kept trying to run updates. still breaks sometimes when it hallucinates column names but at least it can't do anything destructive
quotemstr 23 hours ago
Why does "database" is surveys like this not include DuckDB and SQLite, which are great [1] embedded answers to Clickhouse and PostgreSQL. Both are excellent and useful databases; DuckDB's reasonable syntax, fast vectorized everything, and support for ingesting the hairiest of data as in-DB ETL make me reach for it first these days, at least for the things I want to do.
Why is it that in "I'm a serious database person" circles, the popular embedded databases don't count?
[1] Yes, I know it's not an exact comparison.
shekispeaks 23 hours ago
TiDB has gained some momentum in silicon valley with companies looking to adopt it. Does he have any commentary on TiDB which is an OLTP and OLAP hybrid?
SchwKatze 1 day ago
Can we even say that Anyblox is a file format? By my understanding of the project it's "just" a decoder for other file formats to solve the MxN problem.
maximgeorge 1 day ago
[dead]
cryptica 1 day ago
It's so weird how everyone nowadays is using Postgres. It's not like end users can see your database.
It's disturbing how everyone is gravitating towards the same tools. This started happening since React and kept getting worse. Software development sucks nowadays.
All technical decisions about which tools to use are made by people who don't have to use the tools. There is no nuance anymore. There's a blanket solution for every problem and there isn't much to choose from. Meanwhile, software is less reliable than it's ever been.
It's like a bad dream. Everything is bad and getting worse.
[-]
- esafak 22 hours ago
  What's wrong this postgres?
- da02 23 hours ago
  Which alternatives to PostgreSQL would you like to see get more attention?
  [-]
  - cryptica 20 hours ago
    All of them. Nothing wrong with Postgres, I like Postgres. But the more alternatives the better. My favorite database is RethinkDB but officially, it's a dead project. Unofficially it's still pretty great.