Last week I was at SIGMOD/PODS 2025 hosted in Berlin. This is one of the leading conferences in data management. This year there were over 1200 attendees. Data management is still hot! Congrats to the Berlin data management community for pulling this off.
I was there as one of the senior chairs for the Provenance Week workshop, which was well organised by Tanja Auge and Seokki Lee. We had 30+ attendees as the last workshop on the last day, which I thought was pretty good. It was also fun catching up with many of my provenance colleagues. In particular, it was a while since I met up with my PhD supervisor Luc Moreau. He’s working on provenance-first systems using the idea of provenance templates based on his experience in putting provenance into practice. He promised a book on the topic – so mark that down – and no pressure Luc ๐ . It was also nice to have other teams members and alumni (Hazar, Stefan, Madelon) from INDElab at the conference. Conferences should all have a selfie stand ๐ท.


SIGMOD is a gigantic conference with lots going on but here’s some major themes coming out of the conference for me.
AI changing systems and workloads
The database community has always been good at taking advantage of new compute architectures and infrastructures to build systems. See: the 21st addition of the workshop on Data Management on New Hardware (DaMoN). The massive infrastructure build out for AI and the implications of that on data management systems was a topic that repeatedly came up during the conference. Here, I’ll point to keynote at the aforementioned DaMoN workshop by Carlo Curino from Gray Systems Lab. His talk focused on SQL on GPUs but I think illustrates the point:



This is further exemplified by the work from the Matteo Interlandi and co-authors on Tensor Query Processors (e.g. compiling SQL to PyTorch programs) to take advantage of the underlying GPU infrastructure. Pushing this even further their recent work looks at the potential of processing SQL using GPU clusters in commercial clouds with high speed interconnects which translates to 60x increases in performance on large datasets. On a different line, the work from CWI on seeing which cloud infrastructure is good for vector databases also shows how to take advantage of these resources.
AI also changes data management workloads. I think the most straightforward one to think about is Text2SQL. We’ve been working on this with Statistics Nederlands. There were a ton of papers about this (here you go Lucas) : to address hallucinations, to make schemas more “natural” to improve LLM understandability, to include humans-in-the-loop through abstention, to address complex queries by iterative composition, to choose the right examples for prompting, and to build better training datasets for the problem. You can also expand this out to include the combination of queries and data exploration. Almost every database company repping at SIGMOD had their own Text2SQL story.
But of course that’s just the start. This was the message that I took from the panel on AI for Future Databases. We need to think bigger. See Tim Kraska’s takes:

One key idea behind this discussion is that LLMs are operators and this enables unstructured data to be treated as a peer to structured data (see Immanuel Trummer’s slides) – more about that in the next section. A second idea was that AI agents produce very different database query workloads. As Aditya Parameswaran discussed – see his very cool intro talk from the panel and the longer keynote from the NOVAS workshop- agents can generate thousands of queries but those queries look more like what humans would write, which is very different than queries coming from programs/scripts. I’ve seen this with our MCP integration for longform.ai where Claude generate tons of different style queries when we’re generating nifty looking reports over our knowledge graph of podcast data. For an example of the output, checkout our DuckDB ecosystem analysis. Another notion mentioned by Aditya was that LLMs provide the ability to build semantic layers from data.
Multimodal data
Building on the integration of AI into data management systems means that unstructured data becomes a first class citizen (e.g. SwellDB, palimpzest.org, docetl.org). This was summarised in the AI panel by these two slides from Alibaba Cloud’s Feifei Li:


So with this you can now do some very cool things with multimodal data. First, you can have impressive demos with hardware: to integrate multimodal data from hospital beds (NebulaStream) or processing data from LIDAR sensors (Alpha-Demo):


You can also build interesting systems that process point cloud data, do exploratory queries on video data, do compositional queries on video, or go really crazy and treat neural networks as data and query them using SQL.
Two of my favourite papers in the conference was first the work from the University of Washington’s visual data project on generating UDFs for video analysis on the fly. The second was Paolo Papotti’s teams work, Galois, on how to execute SQL queries over not only multimodal data but importantly the parameters of the LLM. I also very much appreciated Paolo’s insights and thinking about LLMs as data themselves. I still think there’s lots to be explored treating LLMs as data sources.


Speaking of multimodal data, FlockMTL provides multimodal support for DuckDB. This ability to extend a robust columnar database that’s super easy to install is pretty cool for getting research ideas into something that’s usable. Another example is SmokedDuck which is an implementation of provenance in DuckDB. There was a nifty ProvenanceWeek poster about this by Haneen Mohammed discussing trade-offs in the performance of lineage tracking.
Empirical insights
Workloads are always a big deal in the data management field and everyone is always interested in “real workloads”. I like to think off this more broadly and in general see how people use data systems in practice. It was great to see a bunch of these studies at SIGMOD.
A good place to start is Gaรซl Varoquaux‘s keynote at DEEM workshop:

Carsten Binning’s keynote at the MIDAS workshop focused on the challenges of using LLMs for enterprise data engineering based real-world customer data from SAP. A key conclusion is the importance of enterprise knowledge in actually doing effective data engineering with LLMs.
Other work looked at whether the assumptions of transactions systems hold by studying 30,000 transactions from 111 open source projects. Somehow this seemed appropriate given the excellent keynote – on the history of transactions (slides) – given by one of its pioneers Phil Bernstein.
Back to Gaรซl’s keynote – strings are important:

Hence, I thought DaVinci was interesting as a system that cleans both semantic and syntactic errors in strings based on the insight that “67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text”.
There were a couple of new interesting benchmarks:
- SQLBarber: if you don’t have a query workload generate it with an LLM
- Modyn provides a collection of benchmarks for dynamic datasets (distribution and temporal shifts).
- SIMBA is a benchmark for the open ended task of interacting with data dashboards.
- New dataset (and approach) for table overlap estimation based on GitTables and WikiTables.
- UDFBench
- Not new, but the creator of the CleanML benchmark won the best dissertation award
Lastly, one can not only go wide looking at a lot of empirical data (“distant reading”) but also deep, here, the work of George Fletcher and colleagues on the close reading of data models was particularly insightful.
Wrapping up, I left SIGMOD excited about how dynamic the data management community feels right now in the context of the AI boom both from an infrastructure and capabilities level but also how it foregrounds the importance of knowledge to data management.
Random Notes
- DataEd is an important workshop and I’m very proud that Daphne on our team helps organise it.
- Thanks to Madelon, Stefan, Matteo and Shreya for organising a very good DEEM workshop.
- Google’s Spanner database won the test of time award. It processes over 6 billion queries per second over 17 exabytes in 40 regions. Mind blowing scale.
- Cool Knowledge Graph things:
- Exposing RDF knowledge graphs as property graphs using schema constraints
- TARTE: Using knowledge graphs to improve tabular foundation models.
- Knowledge Graph accuracy estimation
- Use the schema to improve recursive graph queries
- Vectorized SPARQL Query Execution from Stardog
- “An analysis of over ย 148.7 million regular path queries occurring in 937.2 million used on 29 different data sets”
- Also look at my ESWC 2025 trip report ๐
- I think main track papers as posters needs a bit of work. Personally, I rather have more parallel sessions.
- One regret: I didn’t get to talk to Joseph Hellerstein. One of my favourite papers is Potter’s Wheel.
- Provenance-Enabled Explainable AI
- I enjoyed the PODS keynote – new ways to think about what a hard problem is.
- The RelationalAI folks are very convincing – datalog(ish) all the things






























ย ย 


Good job!




It was quite good, I hope they put the slides up somewhere. The key notion for me is this idea of instance optimality: by using machine learning we can tailor performance to specific users and applications whereas in the past this was not cost effective because the need for programmer effort. They suggested 4 ways to create instance optimized algorithms and data structures:










