CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: nginx date: Tue, 30 Dec 2025 11:39:16 GMT content-type: text/html; charset=UTF-8 vary: Accept-Encoding x-hacker: Want root? Visit join.a8c.com/hacker and mention this header. host-header: WordPress.com vary: accept, content-type, cookie link: ; rel=shortlink content-encoding: gzip x-ac: 2.bom _dca MISS alt-svc: h3=":443"; ma=86400 strict-transport-security: max-age=31536000 server-timing: a8c-cdn, dc;desc=bom, cache;desc=MISS;dur=594.0 Think Links | Thoughts on remixing, provenance and data by Paul Groth

Trip Report: SIGMOD 2025

July 4, 2025

academia, provenance, trip report

Leave a comment

Last week I was at SIGMOD/PODS 2025 hosted in Berlin. This is one of the leading conferences in data management. This year there were over 1200 attendees. Data management is still hot! Congrats to the Berlin data management community for pulling this off.

I was there as one of the senior chairs for the Provenance Week workshop, which was well organised by Tanja Auge and Seokki Lee. We had 30+ attendees as the last workshop on the last day, which I thought was pretty good. It was also fun catching up with many of my provenance colleagues. In particular, it was a while since I met up with my PhD supervisor Luc Moreau. He’s working on provenance-first systems using the idea of provenance templates based on his experience in putting provenance into practice. He promised a book on the topic – so mark that down – and no pressure Luc 😀 . It was also nice to have other teams members and alumni (Hazar, Stefan, Madelon) from INDElab at the conference. Conferences should all have a selfie stand 📷.

SIGMOD is a gigantic conference with lots going on but here’s some major themes coming out of the conference for me.

AI changing systems and workloads

The database community has always been good at taking advantage of new compute architectures and infrastructures to build systems. See: the 21st addition of the workshop on Data Management on New Hardware (DaMoN). The massive infrastructure build out for AI and the implications of that on data management systems was a topic that repeatedly came up during the conference. Here, I’ll point to keynote at the aforementioned DaMoN workshop by Carlo Curino from Gray Systems Lab. His talk focused on SQL on GPUs but I think illustrates the point:

This is further exemplified by the work from the Matteo Interlandi and co-authors on Tensor Query Processors (e.g. compiling SQL to PyTorch programs) to take advantage of the underlying GPU infrastructure. Pushing this even further their recent work looks at the potential of processing SQL using GPU clusters in commercial clouds with high speed interconnects which translates to 60x increases in performance on large datasets. On a different line, the work from CWI on seeing which cloud infrastructure is good for vector databases also shows how to take advantage of these resources.

AI also changes data management workloads. I think the most straightforward one to think about is Text2SQL. We’ve been working on this with Statistics Nederlands. There were a ton of papers about this (here you go Lucas) : to address hallucinations, to make schemas more “natural” to improve LLM understandability, to include humans-in-the-loop through abstention, to address complex queries by iterative composition, to choose the right examples for prompting, and to build better training datasets for the problem. You can also expand this out to include the combination of queries and data exploration. Almost every database company repping at SIGMOD had their own Text2SQL story.

But of course that’s just the start. This was the message that I took from the panel on AI for Future Databases. We need to think bigger. See Tim Kraska’s takes:

One key idea behind this discussion is that LLMs are operators and this enables unstructured data to be treated as a peer to structured data (see Immanuel Trummer’s slides) – more about that in the next section. A second idea was that AI agents produce very different database query workloads. As Aditya Parameswaran discussed – see his very cool intro talk from the panel and the longer keynote from the NOVAS workshop- agents can generate thousands of queries but those queries look more like what humans would write, which is very different than queries coming from programs/scripts. I’ve seen this with our MCP integration for longform.ai where Claude generate tons of different style queries when we’re generating nifty looking reports over our knowledge graph of podcast data. For an example of the output, checkout our DuckDB ecosystem analysis. Another notion mentioned by Aditya was that LLMs provide the ability to build semantic layers from data.

Multimodal data

Building on the integration of AI into data management systems means that unstructured data becomes a first class citizen (e.g. SwellDB, palimpzest.org, docetl.org). This was summarised in the AI panel by these two slides from Alibaba Cloud’s Feifei Li:

So with this you can now do some very cool things with multimodal data. First, you can have impressive demos with hardware: to integrate multimodal data from hospital beds (NebulaStream) or processing data from LIDAR sensors (Alpha-Demo):

You can also build interesting systems that process point cloud data, do exploratory queries on video data, do compositional queries on video, or go really crazy and treat neural networks as data and query them using SQL.

Two of my favourite papers in the conference was first the work from the University of Washington’s visual data project on generating UDFs for video analysis on the fly. The second was Paolo Papotti’s teams work, Galois, on how to execute SQL queries over not only multimodal data but importantly the parameters of the LLM. I also very much appreciated Paolo’s insights and thinking about LLMs as data themselves. I still think there’s lots to be explored treating LLMs as data sources.

Speaking of multimodal data, FlockMTL provides multimodal support for DuckDB. This ability to extend a robust columnar database that’s super easy to install is pretty cool for getting research ideas into something that’s usable. Another example is SmokedDuck which is an implementation of provenance in DuckDB. There was a nifty ProvenanceWeek poster about this by Haneen Mohammed discussing trade-offs in the performance of lineage tracking.

Empirical insights

Workloads are always a big deal in the data management field and everyone is always interested in “real workloads”. I like to think off this more broadly and in general see how people use data systems in practice. It was great to see a bunch of these studies at SIGMOD.

A good place to start is Gaël Varoquaux‘s keynote at DEEM workshop:

Carsten Binning’s keynote at the MIDAS workshop focused on the challenges of using LLMs for enterprise data engineering based real-world customer data from SAP. A key conclusion is the importance of enterprise knowledge in actually doing effective data engineering with LLMs.

Other work looked at whether the assumptions of transactions systems hold by studying 30,000 transactions from 111 open source projects. Somehow this seemed appropriate given the excellent keynote – on the history of transactions (slides) – given by one of its pioneers Phil Bernstein.

Back to Gaël’s keynote – strings are important:

Hence, I thought DaVinci was interesting as a system that cleans both semantic and syntactic errors in strings based on the insight that “67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text”.

There were a couple of new interesting benchmarks:

SQLBarber: if you don’t have a query workload generate it with an LLM
Modyn provides a collection of benchmarks for dynamic datasets (distribution and temporal shifts).
SIMBA is a benchmark for the open ended task of interacting with data dashboards.
New dataset (and approach) for table overlap estimation based on GitTables and WikiTables.
UDFBench
Not new, but the creator of the CleanML benchmark won the best dissertation award

Lastly, one can not only go wide looking at a lot of empirical data (“distant reading”) but also deep, here, the work of George Fletcher and colleagues on the close reading of data models was particularly insightful.

Wrapping up, I left SIGMOD excited about how dynamic the data management community feels right now in the context of the AI boom both from an infrastructure and capabilities level but also how it foregrounds the importance of knowledge to data management.

Random Notes

DataEd is an important workshop and I’m very proud that Daphne on our team helps organise it.
Thanks to Madelon, Stefan, Matteo and Shreya for organising a very good DEEM workshop.
Google’s Spanner database won the test of time award. It processes over 6 billion queries per second over 17 exabytes in 40 regions. Mind blowing scale.
Cool Knowledge Graph things:
- Exposing RDF knowledge graphs as property graphs using schema constraints
- TARTE: Using knowledge graphs to improve tabular foundation models.
- Knowledge Graph accuracy estimation
- Use the schema to improve recursive graph queries
- Vectorized SPARQL Query Execution from Stardog
- “An analysis of over 148.7 million regular path queries occurring in 937.2 million used on 29 different data sets”
- Also look at my ESWC 2025 trip report 😉
I think main track papers as posters needs a bit of work. Personally, I rather have more parallel sessions.
One regret: I didn’t get to talk to Joseph Hellerstein. One of my favourite papers is Potter’s Wheel.
Provenance-Enabled Explainable AI
I enjoyed the PODS keynote – new ways to think about what a hard problem is.
The RelationalAI folks are very convincing – datalog(ish) all the things

Data governance gets a packed house (?) or is it all the industry sessions..

SIGMOD Beer

Trip Report: ESWC 2025

June 8, 2025

academia, events, trip report

2 Comments

Last week, I was happy to be able to attend the 22nd European Semantic Web Conference. I’m a regular at this conference and it’s great to see many friends and colleagues as well as meet new people. The conference had 279 attendees. I’m happy be a part of a community that is both intellectually dynamic but also welcoming. For example, Melika Ayoughi from our lab who hadn’t been at ESWC before but found many people to engage with but also, I think, had a lot of fun. Thanks to Edward Curry and the team for pulling off the the conference.

So what was I doing there. I gave keynotes at the XAI-KG and SemTech4STLD workshops – more on those later. I was also a panel for the Text2KG workshop discussing KGs and GenAI. Melika and I represented the EU ENEXA project at the project networking session showcasing the work that that project has accomplished. Finally, I watched on as Melika presented our work in the research track on Designing Hierarchies for Optimal Hyperbolic Embedding – which was nominated for best paper. If you’re interested in how to build ontologies or schemas to improve their embedding for ML applications, do check it out. On to the major themes I distilled out of the conference:

Knowledge Graphs as building blocks of GenAI systems

Throughout the conference, there were interesting examples of the role that Knowledge Graphs (KGs) play as critical components in AI systems especially those that are employing GenAI. This was first highlighted in Raphaël Troncy’s keynote reflecting on his 20 years (!) of building knowledge graphs. He has built KGs for domains ranging from cultural heritage (silk and smells) to tracking electricity consumption in France. This figure stood out showing the requests going by LLMs to KGs.

Indeed, there is a need for high quality, well organised knowledge for LLMs. He also discussed the importance of KGs as a foundation for recommender systems and visual search. Raphael did a great job of synthesising all his lessons learned so I’ll just put them here for you:

Some more examples:

Bosch discussed how they are using GenAI with KGs to improve discovery and metadata in their data lake. Here’s the system architecture.

Relational AI showed how to improve natural language queuing over Knowledge Graphs by taking advantage of the KG itself to overcome the what LLMs struggle with specifically bad conceptualisations, inverse properties and constraints.

Sonja Zillner, from Siemens, in her keynote showed how KGs were being used together with various foundation models to support operations like anomaly detection by linking together multimodal data.

Other examples include using knowledge graphs to build a scene understanding foundation model for autonomous driving; using a neurosymbolic architecture for helping do seismic event detection (check out the boxology, Frank); and using the combination of LLMs + KGs to help drive no-code development environments.

I think as we continue to build out our understanding of GenAI in knowledge engineering, bundling these examples into common patterns and architectures is a fruitful direction specifically incorporating what LLM agents can do. This is especially important when you think about how to design AI systems in a regulatory environments as shown in Zillner’s keynote:

From Practice to Theory to Practice

Interoperability is an important value in this and the broader web community with a focus on how we get systems to talk to one another and making pragmatic trade-offs focused on developer UX. This is a good thing! However, we often get specifications that are not grounded in a solid theoretical foundation. A very cool trick from the research side is then to take these specifications and then build a theoretical underpinning for them. Then by using that theory one can find better optimisations, find bugs, or improve implementations of the specifications.

This approach was on display at the conference. Olaf Hartig is a master of this approach having done so before with GraphQL and Linked Data Fragments. This year together with Sitt Min Oo he did so for RML, the key mapping language for knowledge graph construction, developing an algebraic definition of mapping languages which they used to provide a formal definition of RML’s semantics. This won the best paper award. Another example is the paper on the definition of a formal semantics for ShEX inheritance.

Similarly, in his keynote, Leonid Libkin showed how by developing a theoretical understanding of the new ISO GQL and SQL/PGQ specs for graph querying, he and his co-authors were able to identify expressiveness gaps in the language.

Maybe MCP should be next for this kind of exercise….

A note on KGs and Relational Databases

Another take away from Leonid’s keynote is that graphs and in particular knowledge graphs have become so useful that they are or have been incorporated, subsumed (?), into relational database systems. This is a win from my perspective as it becomes easier to integrate KGs and “semantic thinking” into systems. For example, in my startup longform.ai we have a KG but stored in a hosted relational database (supabase) to take advantage of developer tooling that that service provides. I still think there’s need for language standards (e.g. RDF 1.2) for enabling integration and shipping of data between systems. Furthermore, graph specific query languages like GQL, SPARQL, and Rel are handy views over KGs even when stored in an RDBMS. e.g. Rel looks like a fun language to work with.

Evaluation in the context of GenAI

Another theme at the conference was the challenge of evaluation when developing and employing GenAI systems. This was the topic of my talk at the SemTech4STLD workshop where I discussed some of the challenges we’ve faced with respect to evaluation and some ideas about how to move forward (e.g. LLMs as judges, more complex tasks, and agents as members of peer review).

Download

This theme was illustrated as well by the Q/A after the talk on the paper on Ontology Construction using Large Language Models, where we discussed whether the superfluous measure (e.g. over generating constructions) proposed was indeed the right one. There was also a specialised workshop focused on the evaluation of LLMs for knowledge engineering tasks (ELMKE). I’m hoping the papers are available soon. Another works worth mentioning are LLM-KG-Bench which provides a benchmark for the performance of LLMs on RDF and SPARQL related tasks and the work on multilingual quality assessment of LLMs.

I also got some good feedback on the idea I’ve been thinking about on explanation as evaluation – using the quality of explanations as a proxy for the quality of an AI system performance. This was in my talk at the XAI-KG workshop on co-constructing explanations an provenance.

ESWC x SWSA

It was announced that ESWC is now under the SWSA banner. As a member of the ESWC Steering Committee and as a community member, I’m very excited for this move. This means that the two major conferences in our area ISWC and ESWC are under the same governance structure focused on the promotion of research in the semantic web. SWSA also has links to the major journals in the filed. ESWC is in good hands. With this change, there is the opportunity to connect the two conferences more deeply and strengthen the community as a whole.

Wrapping up there are many exciting research paths forward from questions on how to evaluate complex information systems based on LLM agents, what do theoretically grounded architectures look like for neurosymbolic systems; how do semantics and data quality improve foundation models; to how to design KG query languages for agents. I look forward to seeing how we progress at ESWC 2026.

Random Notes

Stats on ESWC 2025
Great work by Tobias Kuhn and team on building out nanopublications. Check the progress in the tutorial they gave.
Tensors + SPARQL
Good to have had a tribute to Dieter Fensel at the conference.
Thanks to the folks letting me pitch longform.ai to them, in particular, Tobias Käfer and Christoph Braun for the advice with their business hats on from a past life.
Zeyu checkout AvengER: Ensembling and Fine-Tuning LLMs for Select Prompts in Entity Resolution [pdf]. Reminds of your work on AnyMatch.
ESWC is held in nice places: HT Harald Sack
ESWC 2026: power house general & program chairs – Maribel Acosta, Sebastian Rudolph, Marieke van Erp
🎤Karaoke…
Industrial Grade AI
Taxonomy Inference for Tabular Data using LLMs
The tooling for ontology production is really good now.
- See https://github.com/mobilityDCAT-AP for an example.
- The https://github.com/perks-project/pk-ontology is really nicely designed and uses PROV 😉
More tooling:
trutheval for factuality evaluation

Trip Report: ESWC 2024

June 2, 2024

academia, trip report

Leave a comment

Last week, I attended the 21st Extended (European) Semantic Web Conference. The conference was well organised by Dr. Albert Meroño Peñuela from King’s College London. He seemed surprisingly chill during the whole conference. Congrats to him and the organization team on such a good job. The community is as strong as ever with many old friends but also new attendees. The conference had 300 participants, up from the 230 that we had when I was the general chair in 2022.

I was there to present the paper led by Bradley Allen on Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models. (You should check it out.) I also had the opportunity to present the ENEXA project in the project networking session. I also gave a talk at the Text2KG workshop and was on a panel for the data quality meets KG + ML workshop.

A little bit of semantics still goes a long way

We had excellent keynote by Katariina Kari from IKEA about her experience building knowledge graphs at both IKEA and Zalando. My take away was that a critical component of effective knowledge graph deployments was the addition of small amounts of high quality data in particular in terms of semantics expressed in the the ontology or as rules. The addition of such critical knowledge from domain experts about what customers mean when they search for the beach or Beyonce can help drive gains on organisational goals. The flexibility of knowledge graphs can make this easier. For example, through different ontology modules, knowledge graphs can allow different parts of the organisation to take the needed perspective for their application.

I especially liked the idea of instead of storing all information about entities in a knowledge graph storing recipes for accessing entity knowledge from underlying data sources. This notion of small amounts of high quality data focused on success was emphasised in her KG project timeline:

Here, Katariina emphasised engaging domain experts by doing data therapy to help gather the right business knowledge needed. She also situated building the knowledge graph in terms of required skills and roles – taking this on as a standard software engineering project with semantic technologies integrated (e.g. manage your ontologies using git). I was also happy to see knowledge scientist get a shout out.

LLMs and knowledge engineering

There was a lot of strong discussion about LLMs and generative AI at the conference. There’s clearly tremendous opportunity to reformulate both the process of knowledge engineering as well as in the context of intelligent system architectures which take advantage of structured knowledge. First, in terms of knowledge engineering Prof. Elena Simperl gave an excellent summary of where we are at and directions forward in her keynote:

These directions were further emphasised in the special track on LLMs and Knowledge Engineering, which myself, Elena, Oscar Corcho and Valentina Tamma organised. This was an opportunity for the community to report on its explorations in this emerging area. When we first organised the track we thought it would have a small number of submissions, instead it was tremendously popular with 49 submissions. From this track, it was clear that across the whole range of knowledge engineering, from competency question elicitation to knowledge graph refinement, LLMs can be applied as part of hybrid workflows. A good example is the OntoChat framework, which is a conversational agent for supporting ontology construction. Another example, is the work of Giglou and colleagues that shows that LLMs can meet and in some cases surpass the performance of traditional systems for ontology matching.

In the research track, the work of Dong et al., A Language Model based Framework for New Concept Placement in Ontologies, showed the effectiveness of LLMs support in helping curate existing ontologies. Likewise, Saeedizade and Blomqvist showed that LLMs can be generate quality modelling suggestions for OWL.

Pushing further, one can think about an LLM Agent that can drive the creation and curation of knowledge, potentially, to help maintain organisational semantic resources (e.g. data dictionaries, corporate ontologies) and structured datasets. In his keynote, Dr. Peter Clark from Allen AI showed the potential of such agents to be, for example, do research assistance.

You can find a longer presentation on this assistant here:

New architectures

Peter’s keynote emphasised how the original vision of the semantic web – of a machine readable web with intelligent agents – is coming to pass through the capabilities of LLMs. He nicely summarised the key points from his talk as follows:

In his talk, he illustrated new architectures that combined traditional reasoners to enhance the capabilities of the LLMs. Worth checking out are:

Language Models with Rationality from EMNLP 2023 – using a belief graph to repair to help the LLM correct inconsistent beliefs recursively.
Entailer from EMNLP 2022 – using textual entailment as inference procedure to improve question answering

Prof. Marta Sabou also discussed new hybrid architectures for neuro-symbolic systems at the SemTech4STLD workshop. This was based on the large survey of Semantic Web and Machine Learning systems that she produced with a large team of collaborators, which also has a knowledge graph and tooling associated with it. This builds on the boxology approach to architecture patterns for neuro-symbolic systems from Van Harmelen and ten Teije.

I also discussed new kinds of knowledge graph architectures that make use of LLMs in knowledge graph architecture in my talk in the Text2KG workshop. Here I emphasised the ability of LLMs to help us deal with both unstructured data as well as serve as sources of knowledge themselves.

To Graph or Not to Graph Knowledge Graph Architectures and LLMs from Paul Groth

I also very much liked the new work going on around using the learnings of mapping approaches for knowledge graph construction to deal with other types of data, specifically: Not Everybody Speaks RDF: Knowledge Conversion between Different Data Representations. Mario Scrocca, Alessio Carenini, Marco Grassi, Marco Comerio, Irene Celino. Paper / Slides

Final Speculative Thoughts

In the workshop on Generative Near-Symbolic AI, Frank van Harmelen gave an excellent talk (“Neuro-symbolic ≠ Neuro-semantic”) focused on the point that semantics – as defined as predictable inference – is not really captured/used by neural representations. You can see a similar talk here. It was a compelling case and led to one of these fascinating side conversations that can happen at conferences on the nature of semantics, the notion of symbol grounding, and the limitations and capabilities of LLMs. If we one assumes that LLMs have “mental models” (as Dr. Clark’s research seeks to uncover), can we use that idea to help us easily establish shared conceptualisations between ourselves and with LLMs? How do codify that? Does the future of knowledge engineering provide a route foreward? and so on….

This ESWC provided me with much found of thought. It’s exciting to be in this environment where the rapid pace of change allows us to think big about what we can do with knowledge and data on the web.

Random Notes

OfficeGraph: A Knowledge Graph of Office Building IoT Measurements – a cool dataset built by our colleagues at the VU. It’s sometimes funny that you discover things in your backyard but need to go to a conference. This looks potentially useful for our work on digital twins.
Elena Simperl’s talk in on data quality in the Data Quality meets Machine Learning and Knowledge Graphs workshop.
MLSea – a knowledge graph integrating information about machine learning artifacts.
Olaf convinced me that we definitely need better ways to deal with list and maps in RDF.
The best paper went to Duygu Sezen Islakoglu et al. for Leveraging Pre-trained Language Models for Time Interval Prediction in Text-Enhanced Temporal Knowledge Graphs. It combines two cool Ideas: one the textual representation associated with entities and the other temporal knowledge graph prediction to help improve the temporal content of KGs.
Nice to see papers on dealing with underlying source change:
- Propagating Ontology Changes to Declarative Mappings in Construction of Knowledge Graphs. Diego Conde Herreros, David Chaves-Fraga, María Poveda-Villalón, Romana Pernisch, Lise Stork, Oscar Corcho. Paper / Slides
INDElab newcomer, Lise Stork, crushed it in her talk on social demography research using semtech.
Provenance and PROV are still core reference points in the community being
- used: Towards Cyber Mapping the German Financial System with Knowledge Graphs [pdf]; and
- built upon: Compatibility Challenges of the Current State-of-the-Art Provenance Tools [pdf].

Trip Report: WebConf 2023

May 14, 2023

academia, events, provenance, trip report

1 Comment

I’m just getting back from a nice trip to the US where I attended the Academic Data Science Alliance leadership summit and before that the Web Conference 2023 (WWW 2023) in Austin, Texas. This is the premier academic conference on the Web. The conference organisation was led by two friends and collaborators Dr. Juan Sequeda and Dr. Ying Ding. They did a fantastic job with the structure, food, keynotes (e.g. ACM Turing Award Winner Bob Metcalfe) and who can not give two thumbs up to BBQ and Austin Live Music. The last in-person Web Conference I was at was in 2018 in Lyon, so it was good to be back and to catch-up with a lot of folks in the community.

Provenance Week 2023

The main reason that I was at Web Conf was for Provenance Week 2023, which was collocated, It’s a bit of misnomer – since it was a special two day event. In the past, we’ve done this as a whole week as a separate event but coming out of the pandemic, the steering committees felt that collocating would be better. There were about 20 attendees. I was presenting the work we’ve done led by Stefan Grafberger on on mlinspect and use-cases for provenance and end-to-end machine learning. It was also nice to meet Julia Stoyanovich our co-author on this work for the first time in person. I also was very happy to celebrate the 10th anniversary of the W3C Prov provenance recommendation.

Happy Birthday W3C PROV!

The @w3c PROV specification for #Provenance was published exactly 10 years ago. 🎂🎉https://t.co/Xe7R4XPRNq pic.twitter.com/Me9hJ9LFOM
— Andreas Schreiber (@onyame) May 1, 2023

For that we organised a panel, with the other co-chair of the working group (Prof. Luc Moreau) and two co-editors (Prof. Paolo Missier, Prof. Deborah McGuinness). All three are also leaders in provenance research. We also were joined by Bryon Jacob – CTO of data.world. It was excellent to have Bryon there as data.world is a heavy user of PROV and wasn’t involved in the standardisation effort. He commented on how from his perspective the spec was really usable. We discussed the up take of PROV. The panel felt that uptake has been good with demonstrable impact use but the committee members were hoping for more. The fact that it is often used within systems or as a frame of reference (e.g. HL7 FHIR) means that it’s not as widely known as hoped. I think the panel did agree that provenance is needed now more than ever. For example, Bryon focused on data governance where data.world employs provenance. It’s becoming critical to know where data comes from to understand the broader data estate but also to deal with legal issues related to provenance. Additionally, generative AI is placing further demands on provenance. Here, I would point to work being pushed by Adobe specifically the Content Authenticity Initiative and their Firefly tools + LLM. Overall, the panel reinforced to me the need for interoperable provenance and the role that PROV has played in providing a reference point.

Beyond the panel, I took 3 things from the workshop:

The intersection of provenance and data science/AI pipelines is promising. There’s a clear demand for it (and broadly ML-ops) but it also provides particular constraints that make designing provenance systems (somewhat) easier. You can make some assumptions about the kind of frameworks being used and the targeted applications are not too specific but also not completely general purpose. There’s also space for empirical insights to drive system development. This intersection was being investigated not only in our work mentioned above. For example, in Vanessa Braganholo’s keynote on the noWorkflow provenance system, I thought their work on analysing 1.4 million Juypter Notebooks was cool. I’d also mention a number of other systems discussed at the workshop including Data Provenance for Data Science and Vizier and a new system presented at the workshop focused on deep learning and data integration. Lastly, much of this work also touches on the importance of data cleaning in data science. Here, I liked the work presented by Bertram Ludäscher on using prospective provenance to document data cleaning and reuse such pipelines.
Provenance-by-design – I found this notion introduced by Luc in his paper interesting. Instead of trying to retrofit provenance gathering to applications either through instrumentation or logging, one should first design what provenance to capture and then integrate the business logic with that. In some sense, this is thinking about your workflow but also what you need to report. I can imagine this being beneficial in regulated environments such as banking or in sustainability applications as described in the paper above.
Interesting tasks for provenance in databases: I liked a couple of different papers that used the provenance functionality of database systems (i.e. provenance polynomials) for various tasks. For example, the work by Tanja Auge on using provenance to help create sharable portions of databases; or the work on expanding the explanation of queries to including contextual information (+10 points for using basketball examples); or the use of this functionality to support database education as presented by Sudeepa Roy in her keynote; or even to support provenance for SHACL.
Provenance as a measure for data value: I really enjoyed Boris Glavic’s keynote on relevance-based data management. In particular, the idea of determining relevance of data and using that to understand which data has value and which doesn’t. Also check out his deck if you want a checklist for doing a keynote 😀

Overall, I think the workshop was a success. It was good to catch-up with old friends but also it was nice to hear from the younger scholars there that they felt connected to the community. Thanks to Yuval and Daniel for organising and also giving plenty of time for discussion during the workshop.

In addition to provenance week, there were a number of things that caught my eye at the conference. First, the Web Conf remains a top tier conference that’s challenging to get into. With 1891 submissions and acceptance rate of 19% in the research track. Given the quality of the conference there is an increasing number of submissions that maybe don’t really belong to the venue. Hence, I thought it was a great initiative by the organisers to really focus on defining what makes a web conference paper:

Generative AI

There was a lot of background discussion going on about generative AI and the implications for the web. Here, I would point to three of the keynotes. From the perspective of misinformation and the potential to expand that through generative AI, the keynote by David Rand specifically addressed misinformation and how to combat it from a social science perspective. More broadly Barbara Poblete’s advocated forcefully for inclusion in the development of AI systems and LLMs based on her research on developing social media and AI systems in Chile. Bob Metcalfe in his Turning Award speech discussed the idea of an engineering mindset and embracing the problems and opportunities of new technology. In his case, it was the internet, but why not for generative AI?

From the research talks, I liked the work on creating a pretrained knowledge graph model that can then be used by prompting. I also liked the work on doing query log analysis on prompt logs from users of generative models to help understand user intent. This is a pretty interesting analysis over quite a lot of prompts:

Generative AI also provides a new source of knowledge. A nifty example of this was from The Creative Web track where Bhavya et al. mined and importantly assessed creative analogies from GPT-3. Also there was nice example of extracting cultural common sense knowledge and the strengths and weaknesses of LLMs and knowledge graphs.

Wikidata

I spent some time in the history of the web sessions. This was really fun. Here, I would particularly call out the really great talk about the creation of Wikidata by Denny Vrandečić. It’s an amazing success story. Definitely checkout the whole talk on YouTube.

More broadly there were a number of useful talks about enriching Wikidata. Specifically, about completing Wikidata tables using GPT-3 and using Wikidata to seed an information extractor from the web. This later paper is interesting for me because it uses QA based information extraction with an LLM a technique that we’ve been researching heavily. What I thought was interesting is that they do the QA directly on the HTML source itself. They use Wikidata to fine tune this extraction model.

Taxonomies are back and a thought on KG completion

There were quite a number of papers on building taxonomies including the student best paper award. Pointers:

In general, automatically creating hierarchies are useful for browsing and also useful for computer vision problems. Whether these papers are truly tackling taxonomies or just building hierarchies was a discussion we were having in the coffee break.

More broadly in the sessions where these papers were presented, there were a lot of papers on link prediction/node classification in knowledge graphs whether it was with metapaths; tackling temporal knowledge graphs or using multiple modalities. I’ve done work on this task myself but it would be nice to see different topics and more importantly different evaluation datasets. As Denny noted, Freebase shutdown in 2014 and we’re still doing evaluation on it

Overall, I think the Web as an evolving platform still presents some of the most exciting research challenges in CS. Austin was a great place to have it. Kudos to the team and the community.

Random thoughts

Amsterdam as a hub of information exchange.
Knowledge graphs are for people: Zachary Elkins – Concept Regulation in the social sciences and the Constitute Project
Optimise the performance of python web apps.
I admit I used a scooter to get around Austin.
Weedle: A dashboard for data centric NLP- hope this is available soon.
Lots of discussion about remote talks, the viability of conferences going forward, community culture… but … personally I just get so much more from in-person.
The Web Data Commons Schema.org Data Set Series is a great resource.
The web is an application platform. Is that a good thing? Would be fun to debate this with Steven Pemberton. I’d probably lose but still 😀
The Mexican 🌮 and BBQ 🍖 are really really good in Austin. But you probably already know that.

Trip Report: ESWC 2022

June 6, 2022

academia, trip report

Leave a comment

The last post on this blog was two years ago at the start the pandemic. It was a trip report on one of the first virtual conferences I attended – the 2020 Knowledge Graph Conference. In the intervening two years, I’ve attended other virtual conferences and even attended a couple of hybrid ones, but I found it challenging to produce a synthesis or get my head around them. Without the running into people, the random pinch of an idea gotten by sitting in the back of the talk, or the chat at the poster; I just found it hard to pull out threads from a conference or event. When I did try, I often felt like what I was doing was either regurgitating the structure given by conference organisers or performing a literature review – both of which don’t get at what I try to do with a trip report, which is to find themes and hints at where a community is and where it’s going. That’s why I think ESWC 2022 is a good place to start to try to get back into the habit of writing trip reports because it was a completely in-person event. Oh and I was also one one of the organisers. 🙂

Logistics

2022 was the 19th edition of the European Semantic Web Conference (extended/european??). This year, I had the honor of being general chair of the conference, so I wanted to start off with discussing some of my thoughts on conference logistics in particular around hybrid.

When I was asked to be general chair last year, it was still unclear what the situation would be in summer 2022. The one thing I did say was that we would go either completely online or completely in-person and not hybrid. The rationale for me behind this was a threefold : 1) I wasn’t sure I could deliver a quality experience for both in-person and online attendees from the venue which is both beautiful but also not someplace where there’s support infrastructure beyond the hotel. I know that Victor de Boer was able to do an excellent hybrid event at Semantics 2021 in Amsterdam but that involved using his whole and very extensive local network. 2) I think one of the great things about ESWC is that it really feels like a community and plays an essential role in fostering connections. Hence, beyond the technical logistics, I wanted to prioritise the in-person experience and I didn’t think I could do that and still have a good online aspect. 3) I knew that we would make all the content available (all the papers, videos of all the talks) so those that could not attend still had access to almost everything.

Overall, I think the community was happy to be in-person by the show of hands in the town hall and the engagement at the poster session and coffee breaks:

Fully crowded…it was wonderful to arrange the whole thing. A thank you goes to @russa_biswas and her colleagues for helping me for the posters&demos session #eswc2022 pic.twitter.com/RxBGcGzaYV
— Anisa Rula (@AnisaRula) May 31, 2022

This was also buttressed by the number of hands that went up in the PhD Symposium when asked if this was the first time at a conference. The students I talked to were just so excited about being in-person and how it energised them.

To get a feel for what it was like, check out the impression video below.

Impressions from day 3 and the first main conference day of #eswc2022 @videolectures pic.twitter.com/CxLRe2Nc0Z
— ESWC Conferences (@eswc_conf) June 1, 2022

That’s not to say that hybrid can’t be done well and doesn’t have benefits but in this case I think we made the right decision.

There’s lots more to say about logistics of running a conference as a general chair but I’ll that for now and just offer two pieces of advice:

TVs where you can adaptively adjust the schedule and make announcements are a win; and
(this is pretty obvious) get a great organising team. I was very lucky to have an awesome organising team who both took charge, knew when to ask questions, and were pragmatic. I am very grateful. I will in particular call out Umutcan Simsek who picked up a massive amount of local organization work at the last minute and really made things run smoothly.

From logistics to research, here are my 4 key take-aways from the conference.

Playing with multiple representations

Knowledge Graph embeddings have been one of the biggest research trends in the field in recent years and ESWC was no different with at least 6 papers in the research track developing new approaches to embeddings, leveraging them in a task or using approaches based on them as a baseline.

The coolest experience about the poster presentation at #eswc2022 for me was the fraction of poster visitors who, to my initial question of whether they are familiar with the basics of #rdf2vec, answered "yes".
— Heiko Paulheim (@heikopaulheim) June 1, 2022

What I found interesting was that this year was the movement towards trying to capture more kinds of knowledge in these embedding spaces. I’ll point to two papers as examples:

Dihedron Algebraic Embeddings for Spatio-temporal Knowledge Graph Completion by Nayyeri et al., which looks at encoding both location and time in an embedding using a different geometric formalism.
Enhancing Sequential Recommendation via Decoupled Knowledge Graphs by Wu et al., which studies the use of embedding multi-hop (i.e. higher order) relations between entities to improve recommendation.

But this idea of multiple representations, went beyond just embeddings to include topic model representations of knowledge graphs and learning how to aggregate rules learned from a knowledge graph.

This playing around with multiple representations was also emphasised in Axel Ngonga’s keynote where he brought us through his thinking on how to get better performance on various knowledge graph tasks via switching representational perspectives.

Even pushing this further you can think about the integration of discrete representations and learning, which was the theme of the really excellent keynote by Mathias Niepert. Here’s the summary from his slide.

Takeaways from the fascinating first #eswc2022 keynote by @Mniepert on integrating symbolic structure in deep learning models pic.twitter.com/1Qe37l4cfH
— Albert Meroño (@albertmeronyo) May 31, 2022

Overall, I hope the community continues to think about and push on this notion of multiple perspectives on knowledge representation and for different kinds of knowledge not just what we have in a knowledge graph.

Multiple modalities

Riffing of this point, I thought it was great to see research on talking on different modalities including:

Vision and Video:

Great experience presenting my paper on commonsense knowledge-based scene graph generation at #ESWC2022 Research Track. The paper is available here: https://t.co/sQCmTCmAzN @DSIatNUIG @crt_ai @insight_centre @ResearchNUIG @SIT_NUIGalway @EdwardACurry @johnbreslin https://t.co/Jfm8a3NeYC pic.twitter.com/Jl7R1EIbeX
— Jaleed Khan (@jaleed93) May 31, 2022

Smell (which won the best resource paper award):

I really love the idea of creating a knowledge graph of smells @odeuropa. I was not aware that there exists so much research about the heritage of smells. Great talk by @PasqLisena. #eswc2022 pic.twitter.com/9PcmosGcwf
— Jan-Christoph Kalo (@JanCKalo) May 31, 2022

Audio:

e.g. Audio Ontologies for Intangible Cultural Heritage by Tan et al.

Integrating software development and knowledge graphs

It was also good to see consideration be taken for how to integrate knowledge graphs into the software development lifecycle. We had a whole session dedicated to this and the best research paper addressed how to integrate object oriented programming models and semantic technologies. Other work looked out how SDK’s can facilitate the use of RDF data.

Slides of my presentation at #ESWC2022 available at https://t.co/BDjhTel8HY — introducing a software developed at @opencitations for simplifying creation and manipulations of #RDF data compliant with the #OpenCitations Data Model (paper at https://t.co/ylecb8sxk9) /cc @eswc_conf pic.twitter.com/TCKA47oXmz
— Silvio Peroni (@essepuntato) June 2, 2022

All this reminded of work from almost 10 years ago being led by Stefan Staab but I think maybe it’s now time for a resurgence given the importance of knowledge graphs in industry. This importance was seen by the really well attended industry track session and the completely packed knowledge graph construction tutorial.

https://twitter.com/RinkeHoekstra/status/1531955599835021318

New applications of Data Provenance

Tova Milo gave an amazing keynote about data disposal by design.

I’ve really enjoyed today’s Keynote at @eswc_conf #eswc2022. Database and SW communities should work together and look to each other for better data management pic.twitter.com/b6katF7y3u
— David Chaves-Fraga (@dchavesf) June 2, 2022

One of my favourite things was when she asked if the audience new what data provenance was almost every hand in the room shot up. So I think we’ve educated this community about the importance of provenance :-). She talked about how to use data provenance to address the problem of data reduction but put it in an overall framework that included using the provenance to help predict what data should or should not be retained.

She has a nice paper describing this vision with her co-authors that appeared in a special issue of IEEE Data Engineering Bulletin edited by Sebastian Schelter a member of our lab.

Wrap-up

There was much more at ESWC 2022. I couldn’t catch all the content because we also had to make sure that the logistics worked (e.g. why don’t we organise speed demos at the last minute). On a personal note, it was amazing to see the community together again in-person and how the event gave folks a a ton of energy and new ideas.

Random Thoughts

Lots of deep learning and machine learning in the talks but not in the word cloud… hmm.
Crete is still a nice place for a conference
Nice to hear of all the feedback that Xue Li got on her PhD Symposium paper Causal Domain Adaptation for Information Extraction from Complex Conversations
I also gave a keynote at a the NLIWod workshop on Informing Data Search through Data Practice. In-situ and constructive data search are cool problems.
Underrated – nothing like a conference to help make connections for project proposals and bootstrap conferences.
Thanks to Harald for taking me to find a good coffee.
Good luck to Catia for 2023. She’ll do amazing.

(Virtual) Trip Report: KGC 2020

May 11, 2020

trip report

Leave a comment

Last week, I virtually attended the Knowledge Graph Conference 2020. Originally, KGC was planned to be hosted in New York at Columbia University but, as with everything, had to go online because of the pandemic.

Before getting to the content, I wanted to talk about logistics. Kudos to Francois Scharffe and the team for putting this conference online quickly and running it so smoothly. Just thinking of all the small things – for example, as a speaker I was asked to do a dry run with the organizers and get comments back for how the presentation went on Zoom. The conference Slack workspace was booming with tons of different challenges. The organizers had a nice cadence of talk announcements while boosting conversation by pushing the Q/A session onto Slack. This meant that the conversations could continue beyond each individual session. At the meta level, they managed to get the intensity of a conference online through the amount of effort in curating those Slack channels along with the rapid fire pace of the talks over the two main track days. Personally, I somehow found this more tiring than F2F because somehow Zoom presentations require full focus to ingest. Additionally, there’s this temptation to do both the conference and your normal workday when the event is in another time zone….which… err.. I might have been guilty of. I also did have some hallway conversations on Slack but not as much as I normally would in a F2F setting.

But what’s the conference about? KGC started last year with the idea of having an application and business oriented event focused on knowledge graphs. I would summarize the aim is to bring people together to talk about knowledge graph technology in action, see the newest commercially ready tech and get a glimpse of future tech. The conference has the same flavor of Connected Data London . As a researcher, I really enjoy seeing the impact these technologies are having in a myriad of domains.

So what was I doing there? I was talking about Knowledge Graph Maintenance (slides) – how do we integrate machine learning techniques and the work of people to not only create but maintain knowledge graphs. Here’s my talk summarized in one picture:

My goal is to get organizations who are adopting knowledge graphs to think not only about one-of creation but think about what goes in to keeping that knowledge up-to-date. I also wanted to give a sketch of the current research we’ve been doing in this direction.

There was a lot of content at this event (which will be available online) so I’ll just call out three things I took away from it.

Human Understandable Data

One of the themes that kept coming up was the use of knowledge graphs to help the data in an organization match the conceptualizations that are used within businesses. Sure we can do this by saying we need to build an ontology or logical model or a semantic dictionary but the fundamental point that was highlighted again and again is that this data-to-business bridge was the purpose of building many knowledge graphs. It was kind of summed up in the following two slides from Michael Grove:

This also came through in Ora Lassila’s talk (now at Amazon Neptune) as well as the the tutorial I attended by Juan Sequeda about building Enterprise Knowledge Graphs from Relational Databases. Juan ran through a litany of mapping patterns all trying to bridge from data stored for specific applications to human understandable data. I’m looking forward to seeing this tutorial material available.

The Knowledge Scientist

Great to see @BethanySehon back at #kgconf with “friendly neighborhood #ontologist” Brian Donohue of @CapitalOne Talk: “Validating Categories using Knowledge Graphs” pic.twitter.com/41jjqls4h3

— The Knowledge Graph Conference (KGC) (@KGConference) May 6, 2020

Given the need to bridge the gap between application data and business level goals, new kinds of knowledge engineering and tools to facilitate that we’re also of interest. Why aren’t existing approaches enough? I think the assumption is that there’s a ton of data that people doing this activity need to deal with. Both Juan and I discussed the need to recognize these sorts of people – which we call a Knowledge Scientist– and it seemed to resonate or at least the premise behind the term did.

An excellent example of supporting this sort of tools to support knowledge engineering was by Rafael Gonçalves on how Pinterest used WebProtege to update and manage their taxonomy (paper):

Likewise, Bryon Jacob discussed about how the first step to getting to a knowledge graph was through the better cataloging of data within the organization. It reminds me of the lesson we learned from linked data – that before we can have knowledge we need to index and catalog the underlying data. Also, I can never overlook a talk that gives a shoutout to PROV and the need for lineage and provenance 🙂 .

Knowledge Graphs as Data Assets

I really enjoyed seeing all the various kinds of application areas using knowledge graphs. There were early domain adopters for example in drug discovery and scholarly data that have pushed further in using this technology:

An excellent talk by @TPlasterer @AstraZeneca on #FAIR data and knowledge graphs in the biopharmaceutical context at the closing of the #kgc2020 conference. What a great show, thanks so much for organizing this @KGConference! Judging by the burgeoning Slack, it will live on 🙂 pic.twitter.com/yby11ngNzr

— Kees van Bochove (@keesvanbochove) May 7, 2020

https://twitter.com/azraiekv/status/1258082941315559430

But also new domains like personal health (e.g. deck from Jim Hendler).

https://twitter.com/azraiekv/status/1257802122529312769

The two I liked the most were on law and real estate. David Kamien fromMind Alliance talked about how knowledge graphs in combination with NLP can specifically help law firms for example by automatically suggesting new business development opportunities by analyzing court dockets.

Ron Bekkerman‘s talk on the real estate knowledge graph that they’ve constructed at Cherre was the most eye opening to me. Technically, it was cool in that are applying geometric deep learning to perform entity resolution to build a massive graph of real estate. I had been at another academic workshop on this only a ~2 weeks prior. But from a business sense, their fundamental asset is that the cleaned data in the form of a knowledge graph. It’s not just data but reliable connected data. Really one to watch.

To wrap-up, the intellectual history of knowledge graphs is long (ee John Sowa’s slides and knowledgegraph.today) but I think it’s nice to see that we are at stage where this technology is being deployed at scale in practice, which brings additional research challenges for folks like me.

Part of the Knowledge Graph of the Knowledge Graph Conference:

Random Notes

A nice article on the conference in ZDNet.
Job Amazon – Knowledge Graph Software Development manager
Awesome Semantic Web
Interesting tagger – annif.org
Tutorial: Building a Knowledge Graph from Schema.org annotations
Import / Export RDF from neo4
kgbase.com – no code knowledge graphs.

Trip Report: SIGMOD/PODS 2019

July 15, 2019

academia, events, provenance, trip report

Leave a comment

It’s not so frequently that you get a major international conference in your area of interest around the corner from your house. Luckily for me, that just happened. From June 30th – July 5th, SIGMOD/PODS was hosted here in Amsterdam. SIGMOD/PODS is one of the major conferences on databases and data management. Before diving into the event itself, I really wanted to thank Peter Boncz, Stefan Manegold, Hannes Mühleisen and the whole organizing team (from @CWI_DA and the NL DB community) for getting this massive conference here:

#SIGMOD2019-Opening: This is the 2nd biggest #SIGMOD ever, there are 1050 other participants (up to now) // @SIGMOD2019 pic.twitter.com/GOBthVTbiw

— Benjamin Hättasch (@bhaettasch_cs) July 2, 2019

and pulling off things like this:

A successful #SIGMOD2019 reception at van Gogh museum last night by #MonetDB and @cwi_da, adding a healthy dose of culture to the DBMS community @ACTiCLOUD @FashionBrain1 @ExaNeSt_H2020 pic.twitter.com/J73vk7kSok

— MonetDB Team (@MonetDB) July 3, 2019

Oh and really nice badges too:Good job!

Surprisingly, this was the first time I’ve been at SIGMOD. While I’m pretty acquainted with the database literature, I’ve always just hung out in different spots. Hence, I had some trepidation attending wondering if I’d fit in? Who would I talk to over coffee? Would all the papers be about join algorithms or implications of cache misses on some new tree data structure variant? Now obviously this is all pretty bogus thinking, just looking at the proceedings would tell you that. But there’s nothing like attending in person to bust preconceived notions. Yes, there were papers on hardware performance and join algorithms – which were by the way pretty interesting – but there were many papers on other data management problems many of which we are trying to tackle (e.g. provenance, messy data integration). Also, there were many colleagues that I knew (e.g. Olaf & Jeff above). Anyway, perceptions busted! Sorry DB friends you might have to put up with me some more 😀.

I was at the conference for the better part of 6 days – that’s a lot of material – so I definitely missed a lot but here are the four themes I took from the conference.

Data management for machine learning
Machine learning for data management
New applications of provenance
Software & The Data Center Computer

Data Management for Machine Learning

Matei Zaharia (Stanford/Databricks) on the need for data management for ML

The success of machine learning has rightly changed computer science as a field. In particular, the data management community writ large has reacted trying to tackle the needs of machine learning practitioners with data management systems. This was a major theme at SIGMOD.

Really interesting – using a variety of knowledge to do weak supervision at scale – check out the lift #sigmod https://t.co/Pjiz2XyLBw pic.twitter.com/ahPqV3nvad

— Paul Groth (@pgroth) July 2, 2019

There were a number of what I would term holistic systems that helped manage and improve the process of building ML pipelines including using data. Snorkel DryBell provides a holistic system that lets engineers employ external knowledge (knowledge graphs, dictionaries, rules) to reduce the number of needed training examples needed to create new classifiers. Vizier provides a notebook data science environment backed fully by a provenance data management environment that allows data science pipelines to be debugged and reused. Apple presented their in-house system for helping data management specifically designed for machine learning – from my understanding all their data is completely provenance enabled – ensuring that ML engineers know exactly what data they can use for what kinds of model building tasks.

I think the other thread here is the use of real world datasets to drive these systems. The example that I found the most compelling was Alpine Meadow++ to use knowledge about ML datasets (e.g. Kaggle) to improve the suggestion on new ML pipelines in an AutoML setting.

On a similar note, I thought the work of Suhail Rehman from the University of Chicago on using over 1 million juypter notebooks to understand data analysis workflows was particularly interesting. In general, the notion is that we need to taking a looking at the whole model building and analysis problem in a holistic sense inclusive of data management . This was emphasized by the folks doing the Magellan entity matching project in their paper on Entity Matching Meets Data Science.

Machine Learning for Data Management

On the flip side, machine learning is rapidly influencing data management itself. The aforementioned Megellan project has developed a deep learning entity matcher. Knowledge graph construction and maintenance is heavily reliant on ML. (See also the new work from Luna Dong & colleagues which she talked about at SIGMOD). Likewise, ML is being used to detect data quality issues (e.g. HoloDetect).

ML is also impacting even lower levels of the data management stack.

Tim Kraska list of algorithms that are or are being MLified

I went to the tutorial on Learned Data-intensive systems from Stratos Idreos and Tim Kraska. They overviewed how machine learning could be used to replace parts or augment of the whole database system and when that might be useful.

It was quite good, I hope they put the slides up somewhere. The key notion for me is this idea of instance optimality: by using machine learning we can tailor performance to specific users and applications whereas in the past this was not cost effective because the need for programmer effort. They suggested 4 ways to create instance optimized algorithms and data structures:

Synthesize traditional algorithms using a model
Use a CDF model of the data in your system to tailor the algorithm
Use a prediction model as part of your algorithm
Try to to learn the entire algorithm or data structure

They had quite the laundry list of recent papers tackling this approach and this seems like a super hot topic.

Another example was SkinnerDb which uses reinforcement learning to on the fly to learn optimal join ordering. I told you there were papers on joins that were interesting.

New Provenance Applications

There was an entire session of SIGMOD devoted to provenance, which was cool. What I liked about the papers was that that they had several new applications of provenance or optimizations for applications beyond auditing or debugging.

Explain surprising results to users – Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). ACM, New York, NY, USA, 485-502. DOI: https://doi.org/10.1145/3299869.3300066
Creating small counterexamples to help with debugging – Zhengjie Miao, Sudeepa Roy, and Jun Yang. 2019. Explaining Wrong Queries Using Small Examples. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). ACM, New York, NY, USA, 503-520. DOI: https://doi.org/10.1145/3299869.3319866
Suggestion optimizations for graph analytics – Vicky Papavasileiou, Ken Yocum, and Alin Deutsch. 2019. Ariadne: Online Provenance for Big Graph Analytics. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). ACM, New York, NY, USA, 521-536. DOI: https://doi.org/10.1145/3299869.3300091 – they also do a cool thing where they execute the provenance query in combination with the graph analytics query removing overhead.
Hypothetical reasoning – what happens if I modify this data to a query that I’ve already run – Daniel Deutch, Yuval Moskovitch, and Noam Rinetzky. 2019. Hypothetical Reasoning via Provenance Abstraction. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). ACM, New York, NY, USA, 537-554. DOI: https://doi.org/10.1145/3299869.3300084

In addition to these new applications, I saw some nice new provenance capture systems:

Software & The Data Center Computer

This is less of a common theme but something that just struck me. Microsoft discussed their upgrade or overhaul of the database as a service that they offer in Azure. Likewise, Apple discussed FoundationDB – the mult-tenancy database that underlines CloudKit.

JD.com discussed their new file system to deal with containers and ML workloads across clusters with tens of thousands of servers. These are not applications that are hosted in the cloud but instead they assume the data center. These applications are fundamentally designed with the idea that they will be executed on a big chunk of an entire data center. I know my friends at super computing have been doing this for ages but I always wonder how to change one’s mindset to think about building applications that big and not only building them but upgrading & maintaining them as well.

Wrap-up

Overall, this was a fantastic conference. Beyond the excellent technical content, from a personal point of view, it was really eye opening to marinate in the community. From the point of view of the Amsterdam tech community, it was exciting to have an Amsterdam Data Science Meetup with over 500 people.

Excited that #SIGMOD2019 is meeting the local Amsterdam data science community @ams_ds pic.twitter.com/NuDUHxCegx

— Paul Groth (@pgroth) July 4, 2019

If you weren’t there, video of much of the event is available.

Random Notes

Note to conference organizers – nice badges are appreciated [1,2].
Default conference languages are interesting. SIGMOD/PODS assumption: all conversations can build from SQL. ISWC assumption: all conversations can build from RDF/SPARQL/OWL/HTTP. NLP assumption: all conversations can build up from shared task X.
Blockchain brain dump
DARPA Data Driven Discovery of Models
Webish Tables
- JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
- Profiling the semantics of n-ary web tables
- Automatically Generating Interesting Facts from Wikipedia Tables – nice to see the call out to the Web Tables work from Bizer and the Semweb community in general.
Scaling CSV parsing
https://dataspread.github.io
Compute and memory requirements for deep learning
Let’s not just talk about fairness – we can do things about it:
- Interventional Fairness: Causal Database Repair for Algorithmic Fairness
- Data Responsibly
Cool to have a beer with Frank McSherry and Peter Boncz and listen to them talk about the implications of PCI express lane bandwidth and cache misses on DB performance.
- Oh and micro-services + provenance + organizational mergers = 💡
Lots of moves on the convergence of graph query languages
- property graphs in SQL 2020
- GQL
Maybe all conferences should just be 10 minutes away from my house 😉
If you want to understand differential privacy – watch the amazing talk from the amazing Cynthia Dwork keynote.
Juan Sequeda – ahead of the curve:

G. Gottlob presenting his 2009 PODS paper “A General Datalog-based Framework for Tractable Query Answering” which receives the Test of time award. One of the first papers I read when in my PhD Great to see all the theory & how they have taken it to practice w/ Vadalog #sigmod2019 pic.twitter.com/zlfLYi4gnc

— Juan Sequeda (@juansequeda) July 1, 2019

Trip Report: ESWC 2019

June 28, 2019

academia, trip report

Leave a comment

From June 2 – 6, I had the pleasure of attending the Extended Semantic Web Conference 2019 held in Portorož, Solvenia. After ESWC, I had another semantic web visit with Axel Polleres, Sabrina Kirrane and team in Vienna. We had a great time avoiding the heat and talking about data search and other fun projects. I then paid the requisite price for all this travel and am just now getting down to emptying my notebook. Note to future self, do your trip reports at the end of the conference.

It’s been awhile since I’ve been at ESWC so it was nice to be back. The conference I think was down a bit in terms the number of attendees but the same community spirit and interesting content (check out the award winners) was there. Shout out to Miriam Fernandez and the team for making it an invigorating event:

BIG THX everyone for all the lovely moments at #eswc2019! Thx to all authors 4 the exciting work and presentations, SPC & PC members, keynote speakers, sponsors, … but specially to an absolutely amaizing OC team! Thanks to all of your for making the SW community so special 🙂 pic.twitter.com/LNtdxHvZcH

— Miriam Fernandez (@miriam_fs) June 7, 2019

So what was I doing there. I was presenting work at the Deep Learning for Knowledge Graph workshop on trying to see if we could answer structured (e.g. SPARQL) queries over text (paper):

End-to-End Learning for Answering Structured Queries Directly over Text from Paul Groth

The workshop itself was packed. I think there were about 30-40 people in the room. In addition to the presenting the workshop paper, I was also one of the mentors for the doctoral consortium. It was really nice to see the next up and coming students who put a lot of work into the session: a paper, a revised paper, a presentation and a poster. Victor and Maria-Esther did a fantastic job organizing this.

So what were my take-aways from the conference. I had many of the same thoughts coming out of this conference that I had when I was at the recent AKBC 2019 especially around the ideas of polyglot representation and scientific literature understanding as an important domain driver (e.g. a Predicting Entity Mentions in Scientific Literature and Mining Scholarly Data for Fine-Grained Knowledge Graph Construction. ) but there were some additional things as well.

Target Schemas

The first was a notion that I’ll term “target schemas”. Diana Maynard in her keynote talked about this. These are little conceptually focused ontologies designed specifically for the application domain. She talked about how working with domain experts to put together these little ontologies that could be the target for NLP tools was really a key part of building these domain specific analytical applications. I think this notion of simple schemas is also readily apparent in many commercial knowledge graphs.

The notion of target schemas popped up again in an excellent talk by Katherine Thornton on the use of ShEx. In particular, I would call out the introduction of an EntitySchema part of Wikidata. (e.g. Schema for Human Gene or Software Title). These provide these little target schemas that say something to the effect of “Hey if you match this kind of schema, I can use them in my application”. I think this is a really powerful development.

Katherine Thornton presenting shex schema sharing on @wikidata since last Tuesday #eswc2019 pic.twitter.com/nYurHqiZtn

— Paul Groth (@pgroth) June 5, 2019

The third keynote by Daniel Quercia was impressive. The Good City Life project about applying data to understand cities just makes you think. You really must check it out. More to this point of target schemas, however, was the use of these little conceptual descriptions in the various maps and analytics he did. By, for example, thinking about how to define urban sounds or feelings on a walking route, his team was able to develop these fantastic and useful views of the city.

Impressive data insights into cities from @danielequercia https://t.co/hfGMovdsDS #eswc2019 pic.twitter.com/wrg6WhkUke

— Paul Groth (@pgroth) June 6, 2019

I think the next step will be to automatically generate these target schemas. There was already some work headed into that direction. One was Generating Semantic Aspects for Queries , which was about how to use document mining to select which attributes for entities one should show for an entity. Think of it as selecting what should show up in a knowledge graph entity panel. Likewise, in the talk on Latent Relational Model for Relation Extraction, Gaetano Rossiello talked about how to think about using analogies between example entities to help extract these kind of schemas for small domains:

I think this notion is worth exploring more.

Feral Spreadsheets

What more can I say:

Great term – feral spreadsheets – @dianamaynard #eswc2019 pic.twitter.com/maSrOt2DCV

— Paul Groth (@pgroth) June 5, 2019

We need more here. Things like MantisTable. Data wrangling is the problem. Talking to Daniel about the data behind his maps just confirmed this problem as well.

Knowledge Graph Engineering

This was a theme that was also at AKBC – the challenge of engineering knowledge graphs. As an example, the Knowledge Graph Building workshop was packed. I really enjoyed the discussion around how to evaluate the effectiveness of data mapping languages led by Ben de Meester especially with emphasis around developer usability. The experiences shared by the team from the industrial automation from Festo were really insightful. It’s amazing to see how knowledge graphs have been used to accelerate their product development process but also the engineering effort and challenges to get there.

GBVSKwXC

Likewise, Peter Haase in his audacious keynote (no slides – only a demo) showed how far we’ve come in the underlying platforms and technology to be able to create commercially useful knowledge graphs. This is really thanks to him and the other people who straddle the commercial/research line. It was neat to see the Open PHACTS style biomedical knowledge graph being built using SPARQL and api service wrappers:

However, still these kinds of wrappers need to be built, the links need to be created and more importantly the data needs to be made available. A summary of challenges:

#eswc2019 Industry presentation by @Siemens Very interesting analysis of the challenges of constructing and using Knowledge Graphs. @eswc_conf pic.twitter.com/4veU79x0CH

— Miriam Fernandez (@miriam_fs) June 4, 2019

Overall, I really enjoyed the conference. I got a chance to spend sometime with a bunch of members of the community and it’s exciting to see the continued excitement and the number of new research questions.

Random Notes

The EU Project Networking session is pretty unique and I think an important session for ESWC.
- PROV gets used for policy enforcement.
- Hearing about the ERC directly from the EU was interesting.
A SPARQL endpoint for Wikidata history
An ontology for product foot-printing. Also builds on PROV-O.
Software, Functions and semantics are coming back as a research challenge. See FunctionHub and OntoSoft.
Chilean Legislative website powered by semtech going strong after 8 years in production.

Trip Report: AKBC 2019

June 2, 2019

academia, trip report

Leave a comment

About two weeks ago, I had the pleasure of attending the 1st Conference on Automated Knowledge Base Construction held in Amherst, Massachusetts. This conference follows up on a number of successful workshops held at venues like NeurIPS and NAACL. Why a conference and not another workshop? The general chair and host of the conference (and he really did feel like a host), Andrew McCallum articulated this as coming from three drivers: 1) the community spans a number of different research areas but was getting its own identity; 2) the workshop was outgrowing typical colocation opportunities and 3) the motivation to have a smaller event where people could really connect in comparison to some larger venues.

Automated knowledge base construction is at the intersection of area – Andrew McCallum @akbc_conf #akbc2019 pic.twitter.com/m7DHBat7Ph

— Paul Groth (@pgroth) May 20, 2019

I don’t know the exact total but I think there was just over 110 people at the conference. Importantly, there were top people in the field and they stuck around and hung out. The size, the location, the social events (a lovely group walk in the forest in MA), all made it so that the conference achieved the goal of having time to converse in depth. It reminded me a lot of our Provenance Week events in the scale and depth of conversation.

@akbc_conf #AKBC2019 photos. Thank you to all the organizers, staff, speakers and participants who made it such an engaging, insightful, friendly, and fun conference. Already looking forward to #AKBC2020! Some photos: pic.twitter.com/VkhJtZMTfo

— andrewmccallum (@andrewmccallum) May 24, 2019

Oh and Amherst is a terribly cute college town:

2019-05-19 16.56.06.jpg

Given that the conference subject is really central to my research, I found it hard to boil down everything into a some themes but I’ll give it a shot:

Representational polyglotism
So many datasets so little time
The challenges of knowledge (graph) engineering
There’s lots more to do!

Representational polyglotism

Untitled 2.png

One of the main points that came up frequently both in talks and in conversation was around what one should use as representation language for knowledge bases and for what purpose. Typed graphs have clearly shown their worth over the last 10 years but with the rise of knowledge graphs in a wide variety of industries and applications. The power of the relational approach especially in its probabilistic form was shown in excellent talks by Lise Getoor on PSL and by Guy van den Broeck. For efficient query answering and efficiency in data usage, symbolic solutions work well. On the other hand, the softness of embedding or even straight textual representations enables the kind of fuzziness that’s inherent in human knowledge. Currently, our approach to unify these two views is often to encode the relational representation in an embedding space, reason about it geometrically, and then through it back over the wall into symbolic/relational space. This was something that came up frequently and Van den Broek took this head on in his talk.

Then there’s McCallum’s notion of text as a knowledge graph. This approach was used frequently to different degrees, which is to be expected given that much of the contents of KGs is provided through information extraction. In her talk, Laura Dietz, discussed her work where she annotated the edges of a knowledge graph with paragraph text to improve entity ranking in search. Likewise, the work presented by Yejin Choi around common sense reasoning used natural language as the representational “formalism”. She discussed the ATOMIC (paper) knowledge graph which represents a crowed sourced common sense knowledge as natural language text triples (e.g. PersonX finds ___ in the literature). She then described transformer based, BERT-esque, architectures (COMET: Commonsense Transformers for Knowledge Graph Construction) that perform well on common sense reasoning tasks based on these kinds of representations.

The performance of BERT style language models on all sorts of tasks, led to Sebastian Riedel considering whether one should treat these models as the KB:

It turns out that out-of-the box BERT performs pretty well as a knowledge base for single tokens that have been seen frequently by the model. That’s pretty amazing. Is storing all our knowledge in the parameters of a model the way to go? Maybe not but surely it’s good to investigate the extent of the possibilities here. I guess I came away from the event thinking that we are moving toward an environment where KBs will maintain heterogenous representations and that we are at a point where we need to embrace this range of representations to produce results in order face the challenges of the fuzzy. For example, the challenge of reasoning:

Great banquet talk by @earnmyturns, telling us about the challenge of reasoning#AKBC2019 #AKBC #ML #NLProc pic.twitter.com/3f6txOhVYI

— AKBC 2022 (@akbc_conf) May 22, 2019

or of disagreement around knowledge as discussed by Chris Welty:

So many datasets so little time

Progress in this field is driven by data and there were a lot of new datasets presented at the conference. Here’s my (probably incomplete) list:

OPIEC – from the makers of the MINIE open ie system – 300 million open information extracted triples with a bunch of interesting annotations;
TREC CAR dataset – cool task, auto generate articles for a search query;
HAnDS – a new dataset for fined grained entity typing to support thousands of types;
HellaSwag – a new dataset for common sense inference designed to be hard for state-of-the-art transformer based architectures (BERT);
ShARC – conversational question answering dataset focused on follow-up questions
Materials Synthesis annotated data for extraction of material synthesis recipes from text. Look up in their GitHub repo for more interesting stuff
MedMentions – annotated corpora of UMLs mentions in biomedical papers from CZI
A bunch of datasets that were submitted to EMNLP so expect those to come soon – follow @nlpmattg.

The challenges of knowledge (graph) engineering

Juan Sequeda has been on this topic for a while – large scale knowledge graphs are really difficult to engineer. The team at DiffBot – who were at the conference – are doing a great job of supplying this engineering as a service through their knowledge graph API. I’ve been working with another start-up SeMI who are also trying to tackle this challenge. But this is still complicated task as underlined for me when talking to Francois Scharffe who organized the recent industry focused Knowledge Graph Conference. The complexity of KG (social-technical) engineering was one of the main themes of that conference. An example of the need to tackle this complexity at AKBC was the work presented about the knowledge engineering going on for the KG behind Apple’s Siri. Xiao Ling emphasized that they spent a lot of their time thinking about and implementing systems for knowledge base construction developer workflow:

Cool to see Apple using the combination of various public knowledge bases – wikidata, musicbrainz, discogs to power Siri #AKBC2019 https://t.co/2XN44hxVjC pic.twitter.com/G8uduHyw00

— Paul Groth (@pgroth) May 20, 2019

Thinking about these sorts of challenges was also behind several of the presentations in the Open Knowledge Network workshop: Vicki Tardif from the Google Knowledge Graph discussed these issues in particular with reference to the muddiness of knowledge representation (e.g. how to interpret facets of a single entity? or how to align the inconsistencies of people with that of machines?). Jim McCusker and Deborah McGuinness’ work on the provenance/nanopublication driven WhyIs framework for knowledge graph construction is an important in that their software views a knowledge graph not as an output but as a set of tooling for engineering that graph.

The best paper of the conference Alexandria: Unsupervised High-Precision Knowledge Base Construction using a Probabilistic Program was also about how to lower the barrier to defining knowledge base construction steps using a simple probabilistic program. Building a KB from a single seed fact is impressive but then you need the engineering effort to massively scale probabilistic inference.

Alexandra Meliou’s work on using provenance to help diagnose these pipelines was particularly relevant to this issue. I have now added a bunch of her papers to the queue.

There’s lots more to do

One of the things I most appreciated was that many speakers had a set of research challenges at the end of their presentations. So here’s a set of things you could work on in this space curated from the event. Note these may be paraphrased.

Laura Dietz:
- General purpose schema with many types
- High coverage/recall (40%?)
- Extraction of complex relations (not just triples + coref)
- Bridging existing KGs with text
- Relevant information extraction
- Query-specific knowledge graphs
Fernando Pereira
- combing source correlation and grounding
Guy van den Broeck
- Do more than link predication
- Tear down the wall between query evaluation and knowledge base completion
- The open world assumption – take it seriously
Waleed Ammar
- Bridge sentence level and document level predictions
- Summarize published results on a given problem
- Develop tools to facilitate peer review
- How do we crowd source annotations for a specialized domain
- What are leading indicators of a papers impact?
Sebastian Riedel
- Determine what BERT actually knows or what it’s guessing
Xian Ren
- Where can we source complex rules that help AKBC?
- How do we induce transferrable latent structures from pre-trained models?
- Can we have modular neural networks for modeling compositional rules?
- Ho do we model “human effort” in the objective function during training?
Matt Gardner
- Make hard reading datasets by baking required reasoning into them

Finally, I think the biggest challenge that was laid down was from Claudia Wagner, which is how to think a bit more introspectively about the theory behind our AKBC methods and how we might even bring the rigor of social science methodology to our technical approaches:

@clauwa bringing social science methodology to #akbc2019 – the need to document design decisions pic.twitter.com/LT3BogSeNq

— Paul Groth (@pgroth) May 21, 2019

I left AKBC 2019 with a fountain of ideas and research questions, which I count as a success. This is a community to watch. AKBC 2020 is definitely on my list of events to attend next year.

Random Pointers

Word Embeddings 6 years later – if you use embeddings read this. From Anna Rogers.
A Survey of Semantic Parsing
Is Winter Coming?
The ambiguity of PDF
NLP Highlights Podcast
Very exciting work about integrating software and models into knowledge graphs. See https://models.mint.isi.edu/my-about which was presented by Daniel Garijo
Yummy – FoodKG
Good to meet Samuel Klein in person! I think we’re getting to a point where we can put the subjectivity into distributed knowledge graphs (the overlay!)
Check out the Julia Lane led Coleridge Initiative on improving social science research data search through knowledge graphs.

Trip Report: ISWC 2018

October 23, 2018

events, linked data, talks, trip report

1 Comment

Two weeks ago, I had the pleasure of attending the 17th International Semantic Web Conference held at Asiolomar Conference Grounds in California. A tremendously beautiful setting in a state park along the ocean. This trip report is somewhat later than normal because I took the opportunity to hang out for another week along the coast of California.

Before getting into the content of the conference, I think it’s worth saying, if you don’t believe that there are capable, talented, smart and awesome women in computer science at every level of seniority, the ISWC 2018 organizing committee + keynote speakers is the mike drop of counter examples:

Back home after an incredible #iswc2018. Thank you to the team for making it happen. It was a pleasure and an honour to be part of this w/ @AnLiGentile @vrandezo @kbontcheva @jrlsgoncalves @laroyo @miriam_fs @vpresutti @laurakoesten @maribelacosta @merpeltje @iricelino @iswc2018

— Elena Simperl (@esimperl) October 19, 2018

Now some stats:

438 attendees
Papers
- Research Track: 167 submissions – 39 accepted – 23% acceptance rate
- In Use: 55 submissions – 17 accepted – 31% acceptance rate
- Resources: 31 submissions – 6 accepted – 19% acceptance rate
38 Posters & 39 Demos
14 industry presentations
Over 1000 reviews

These are roughly the same as the last time ISWC was held in the United States. So on to the major themes I took away from the conference plus some asides.

Knowledge Graphs as enterprise assets

It was hard to walk away from the conference without being convinced that knowledge graphs are becoming fundamental to delivering modern information solutions in many domains. The enterprise knowledge graph panel was a demonstration of this idea. A big chunk of the majors were represented:

#iswc2018 Enterprise-scale Knowledge Graphs, very exciting panel! Microsoft, Facebook, eBay, Google, IBM, … Fantastic impact of the SW community 🙂 pic.twitter.com/2iONorKt1J

— Miriam Fernandez (@miriam_fs) October 11, 2018

The stats are impressive. Google’s Knowledge Graph has 1 billion things and 70 billion assertions. Facebook’s knowledge graph which they distinguish from their social graph and has just ramped up this year has 50 Million Entities and 500 million assertions. More importantly, they are critical assets for applications, for example, at eBay their KG is central to creating product pages, at Google and Microsoft, KGs are key to entity search and assistants, and at IBM they use it as part of their corporate offerings. But you know it’s really in-use when knowledge graphs are used for emoji:

Stickers on fb messages are driven by knowledge graphs :O #ISWC2018 pic.twitter.com/4wWm2h3H8t

— Helena Deus (@hdeus) October 11, 2018

It wasn’t just the majors who have or are deploying knowledge graphs. The industry track in particular was full of good examples of knowledge graphs being used in practice. Some ones that stood out were: Bosch’s use of knowledge graphs for question answering in DIY, multiple use cases for digital twin management (Siemens, Aibel); use in a healthcare chatbot (Babylon Health); and for helping to regulate the US finance industry (FINRA). I was also very impressed with Diffbot’s platform for creating KGs from the Web. I contributed to the industry session presenting how Elsevier is using knowledge graphs to drive new products in institutional showcasing and healthcare.

Standing room in the industry track of #iswc2018 #iswc_conf. Fantastic to see how Semantic Web and Knowledge Graphs are being used in the real world. It’s not just an academic exercise anymore pic.twitter.com/K0PLbVMjDg

— Juan Sequeda (@juansequeda) October 10, 2018

Beyond the wide use of knowledge graphs, there was a number of things I took away from this thread of industrial adoption.

Technology heterogeneity is really the norm. All sorts of storage, processing and representation approaches were being used. It’s good we have the W3C Semantic Web stack but it’s even better that the principles of knowledge representation for messy data are being applied. This is exemplified by Amazon Neptune’s support for TinkerPop & SPARQL.
It’s still hard to build these things. Microsoft said it was hard at scale. IBM said it was hard for unique domains. I had several people come to me after my talk about Elsevier’s H-Graph discussing similar challenges faced in other organizations that are trying to bring their data together especially for machine learning based applications. Note, McCusker’s work is some of the better publicly available thinking on trying to address the entire KG construction lifecycle.
Identity is a real challenge. I think one of the important moves in the success of knowledge graphs was not to over ontologize. However, record linkage and thinking when to unify an entity is still not a solved problem. One common approach was towards moving the creation of an identifiable entity closer to query time to deal with the query context but that removes the shared conceptualization that is one of the benefits of a Knowledge Graph. Indeed, the clarion call by Google’s Jamie Taylor to teach knowledge representation was an outcome of the need for people who can think about these kinds of problem.

In terms of research challenges, much of what was discussed reflects the same kinds of ideas that were discussed at the recent Dagstuhl Knowledge Graph Seminar so I’ll point you to my summary from that event.

Finally, for most enterprises, their knowledge graph(s) were considered a unique asset to the company. This led to an interesting discussion about how to share “common knowledge” and the need to be able to merge such knowledge with local knowledge. This leads to my next theme from the conference.

Wikidata as the default option

@ma_kr talking @wikidata sparql service. Showing that #semtech is scalable and not too complicated #iswc2018 pic.twitter.com/x6z7vlMRPV

— Paul Groth (@pgroth) October 11, 2018

When discussing “common knowledge”, Wikidata has become a focal point. In the enterprise knowledge graph panel, it was mentioned as the natural place to collaborate on common knowledge. The mechanics of the contribution structure (e.g. open to all, provenance on statements) and institutional attention/authority (i.e. Wikimedia foundation) help with this. An example of Wikidata acting as a default is the use of Wikidata to help collate data on genes

Fittingly enough, Markus Krötzsch and team won the best in-use paper with a convincing demonstration of how well semantic technologies have worked as the query environment for Wikidata. Furthermore, Denny Vrandečić (one of the founders of Wikidata) won the best blue sky paper with the idea of rendering Wikipedia articles directly from Wikidata.

Deep Learning diffusion

As with practically every other conference I’ve been to this year, deep learning as a technique has really been taken up. It’s become just part of the semantic web researchers toolbox. This was particularly clear in the knowledge graph construction area. Papers I liked with DL as part of the solution:

While not DL per sea , I’ll lump embeddings in this section as well. Papers I thought that were interesting are:

Aligning Knowledge Base
and Document Embedding Models Using Regularized Multi-Task Learning
An embedding based approach to ontologies
Towards Empty Answers in SPARQL: Approximating Querying with RDF Embedding
Rule Learning from Knowledge Graphs Guided by Embedding Models

The presentation of the above paper was excellent. I particularly liked their slide on related work:

As an aside, the work on learning rules and the complementarity of rules to other forms of prediction was an interesting thread in the conference. Besides the above paper, see the work from Heiner Stuckenschmidt’s group on evaluating rules and embedding approaches for knowledge graph completion. The work of Fabian Suchanek’s group on the representativeness of knowledge bases is applicable as well in order to tell whether rule learning from knowledge graphs is coming from a representative source and is also interesting in its own right. Lastly, I thought the use of rules in Beretta et al.’s work to quantify the evidence of an assertion in a knowledge graph to help improve reliability was neat.

Information Quality and Context of Use

The final theme is a bit harder for me to solidify and articulate but it lies at the intersection of information quality and how that information is being used. It’s not just knowing the provenance of information but it’s knowing how information propagates and was intended to be used. Both the upstream and downstream need to be considered. As a consumer of information I want to know the reliability of the information I’m consuming. As a producer I want to know if my information is being used for what it was intended for.

The later problem was demonstrated by the keynote from Jennifer Golbeck on privacy. She touched on a wide variety of work but in particular it’s clear that people don’t know but are concerned with what is happening to their data.

What we are ready to compromise when it comes to #privacy? @jengolbeck @iswc2018 #iswc_conf #iswc2018 pic.twitter.com/Ez9hZJlvNC

— Angelo A. Salatino (@angelosalatino) October 10, 2018

There was also quite a bit of discussion going on about the decentralized web and Tim Berners-Lee’s Solid project throughout the conference. The workshop on decentralization was well attended. Something to keep your eye on.

The keynote by Natasha Noy also touched more broadly on the necessity of quality information this time with respect to scientific data.

Natasha Noy presenting @google’s dataset search stressing in the importance of #metadata #dataquality and #provenance #ISWC2018 @iswc2018 pic.twitter.com/b3Dv4yWVhr

— iamamrapali (@amrapaliz) October 11, 2018

The notion of propagation of bias through our information systems was also touched on and is something I’ve been thinking about in terms of data supply chains:

#ISWC2018 "Debiasing knowledge graphs" Janowicz et all. Biases are in word embeddings (doctor-male/nurse-female), image search, etc. Data is not neutral! In SW what we get are statements but not necessarily facts about the world. How can we really de-bias? pic.twitter.com/HJ9ca7FXaS

— Miriam Fernandez (@miriam_fs) October 11, 2018

That being said I think there’s an interesting path forward for using technology to address these issues. Yolanda Gil’s work on the need for AI to address our own biases in science is a step forward in that direction. This is a slide from her excellent keynote at SemSci Workshop:

All this is to say that this is an absolutely critical topic and one where the standard “more research is needed” is very true. I’m happy to see this community thinking about it.

Final Thought

The Semantic Web community has produced a lot (see this slide from Nataha’s keynote:

ISWC 2018 definitely added to that body of knowledge but more importantly I think did a fantastic job of reinforcing and exciting the community.

Really amazed by the community and the quality at #iswc2018. So happy to get to have dinner at #MonterreyAquarium! Thanks to the local organizing committee! @iswc2018 pic.twitter.com/vxHCdXvd5n

— Elisenda Bou-Balust (@elisenda_bou) October 11, 2018

Random Notes

You should read Helena Deus’s trip report as well.
Also a twitter summary of ISWC 2018 from Svitlana Vakulenko
What I said about ISWC 2017
Ada Lovelace Day
Food + Knowledge Graphs – Yummy!
Provenance
- ProvBook – by directional conversation of Juypter Notebooks and RDF including temporal provenance of cells
- Pretty excited about the WebIsALOD in general and as a database of provenance.
- Cool to see provenance being used to improve SPARQL query performance.
Crowdsourcing:
- Mike Lauruhn presented our work on case studies for trying to determine good composition for the crowd correctly
- Really enjoyed the Crowd Truth tutorial. A case study in open data. We spent time trying to apply their measures to my data.
My keynote at SemSci Workshop – The Challenge of Deeper Knowledge Graphs for Science
All of the Journal of Web Semantics available as preprints. Also it was JWS’s 15th year.
I had coffee everyday at Crema. Very good and they ended up knowing my name…
Did I mention California has really nice weather and is beautiful?
Golden State Warriors!
More usage of PSL
Datalog implemented in Bash – so cool!
Aidan Hogan and co – crushing it.
Bike Riding in California
Jam Session!
Thanks to twitter or I would have forgotten a lot of stuff.
I’ll see you in New Zealand.

Original Source | Taken Source

AI changing systems and workloads

Multimodal data

Empirical insights

Random Notes

Share this:

Knowledge Graphs as building blocks of GenAI systems

From Practice to Theory to Practice

A note on KGs and Relational Databases

Evaluation in the context of GenAI

ESWC x SWSA

Random Notes

Share this:

A little bit of semantics still goes a long way

LLMs and knowledge engineering

New architectures

Final Speculative Thoughts

Random Notes

Share this:

Provenance Week 2023

Generative AI

Wikidata

Taxonomies are back and a thought on KG completion

Share this:

Logistics

Playing with multiple representations

Multiple modalities

Integrating software development and knowledge graphs

New applications of Data Provenance

Wrap-up

Random Thoughts

Share this:

Human Understandable Data

The Knowledge Scientist

Knowledge Graphs as Data Assets

Share this:

Data Management for Machine Learning

Machine Learning for Data Management

New Provenance Applications

Software & The Data Center Computer

Wrap-up

Random Notes

Share this:

Target Schemas

Feral Spreadsheets

Knowledge Graph Engineering

Random Notes

Share this:

Representational polyglotism

So many datasets so little time

The challenges of knowledge (graph) engineering

There’s lots more to do

Random Pointers

Share this:

Knowledge Graphs as enterprise assets

Wikidata as the default option

Deep Learning diffusion

Information Quality and Context of Use

Final Thought

Random Notes

Share this: