| CARVIEW |
Select Language
HTTP/1.1 200 OK
Date: Sat, 31 Jan 2026 02:25:27 GMT
Server: Apache/2.4.65 (Unix) OpenSSL/3.5.1
X-Powered-By: PHP/8.3.4
Vary: Accept-Encoding,User-Agent
Content-Encoding: gzip
Content-Length: 23910
Content-Type: text/html; charset=UTF-8
Cristian Riveros - Website
With Marcelo Arenas, Rajesh Jayaram, and Luis Alberto Croquevielle. To appear in STOC. [More infoLess info]
With Jorge Salas. In ICDT. [More infoLess info]
With Marcelo Arenas, Rajesh Jayaram, and Luis Alberto Croquevielle. In PODS. [More infoLess info]
With Fernando Florenzano, Martin Ugarte, Stijn Vansummeren, and Domagoj Vrgoč. In PODS. [More infoLess info]
With Marcelo Arenas and Martin Muñoz. In LICS. [More infoLess info]
With Filip Mazowiecki. In STACS. [More infoLess info]
With Filip Mazowiecki. In CSL. [More infoLess info]
With Stephan Kreutzer. In LICS. [More infoLess info]
With Gabriele Puppis and Slawek Staworko. In ICDT. [More infoLess info]
With Michael Benedikt and Gabriele Puppis. In ICALP. [More infoLess info]
With Marcelo Arenas, Jorge Pérez, and Juan Reutter. In PODS. [More infoLess info]
With Marcelo Arenas, Jorge Pérez, and Juan Reutter. In VLDB. [More infoLess info]
With Marcelo Arenas and Jorge Pérez. In PODS. [More infoLess info]
With Fernando Florenzano, Martin Ugarte, Stijn Vansummeren, and Domagoj Vrgoč. In TODS. [More infoLess info]
With Filip Mazowiecki. In JCSS. [More infoLess info]
With Pierre Bourhis, Gabriele Puppis, and Slawek Staworko. In TODS. [More infoLess info]
With Pierre Bourhis and Gabriele Puppis. In ToCS. [More infoLess info]
With Michael Benedikt and Gabriele Puppis. In TCS. [More infoLess info]
With Michael Benedikt and Gabriele Puppis. In JCSS. [More infoLess info]
With Marcelo Arenas, Jorge Pérez, and Juan Reutter. In VLDBJ. [More infoLess info]
With Marcelo Arenas, Jorge Pérez, and Juan Reutter. In SIGMOD Record. [More infoLess info]
Warning: Version warning: Imagick was compiled against ImageMagick version 1692 but version 1693 is loaded. Imagick will run but may behave surprisingly in Unknown on line 0
CRISTIAN RIVEROS
Website
"In their capacity as a tool, computers will be but a ripple on the surface of our culture. In their capacity as intellectual challenge, they are without precedent in the cultural history of mankind." E. W. Dijkstra, Turing Award Lecture, 1972
Department of Computer Science
Vicuna Mackenna 4860
Edificio San Agustin, 4to piso
Macul, Santiago, 7820436
+56 2 23547407
Vicuna Mackenna 4860
Edificio San Agustin, 4to piso
Macul, Santiago, 7820436
+56 2 23547407
cristian.riveros@uc.cl
Conference papers
2021
When is Approximate Counting for Conjunctive Queries Tractable?
[ pdf
]
With Marcelo Arenas, Rajesh Jayaram, and Luis Alberto Croquevielle. To appear in STOC. [More infoLess info]
Conference: 53rd Annual ACM Symposium on Theory of Computing
(STOC) - Rome, Italy.
Abstract:
Conjunctive queries are one of the most common class of queries used in database systems,
and the best studied in the literature. A seminal result of Grohe, Schwentick, and Segoufin (STOC 2001)
demonstrates that for every class G of graphs, the evaluation of all conjunctive queries whose underlying
graph is in G is tractable if, and only if, G has bounded treewidth. In this work, we extend this characterization
to the counting problem for conjunctive queries. Specifically, for every class C of conjunctive queries with
bounded treewidth, we introduce the first fully polynomial-time randomized approximation scheme (FPRAS)
for counting answers to a query in C, and the first polynomial-time algorithm for sampling answers uniformly from a query in C.
As a corollary, it follows that for every class G of graphs, the counting problem for conjunctive queries
whose underlying graph is in G admits an FPRAS if, and only if, G has bounded treewidth (unless BPP != P)}.
In fact, our FPRAS is more general, and also applies to conjunctive queries with bounded hypertree width,
as well as unions of such queries.
The key ingredient in our proof is the resolution of a fundamental counting problem from automata theory.
Specifically, we demonstrate the first FPRAS and polynomial time sampler for the set of trees of size n
accepted by a tree automaton, which improves the prior quasi-polynomial time randomized approximation scheme (QPRAS)
and sampling algorithm of Gore, Jerrum, Kannan, Sweedyk, and Mahaney '97. We demonstrate how this algorithm can
be used to obtain an FPRAS for many hitherto open problems, such as counting solutions to constraint satisfaction problems (CSP)
with bounded hypertree-width, counting the number of error threads in programs with nested call subroutines, and
counting valid assignments to structured DNNF circuits.
Expressive power of linear algebra query languages
[ pdf
]
With Floris Geerts, Thomas Muñoz, and Domagoj Vrgoč. To appear in PODS. [More infoLess info]
With Floris Geerts, Thomas Muñoz, and Domagoj Vrgoč. To appear in PODS. [More infoLess info]
Conference: 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
(PODS) - Xi'an, Shaanxi, China.
Abstract:
Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination,
matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow looping for a number of steps that
is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of recursion, and show that this
suffices to express classical linear algebra algorithms. We study the expressive power of this language and show that it naturally corresponds
to arithmetic circuit families, which are often said to capture linear algebra. Furthermore, we analyze several sub-fragments of our language,
and show that their expressive power is closely tied to logical formalisms on semiringannotated relations.
Ranked enumeration of MSO logic on words
[ pdf
]
With Pierre Bourhis, Alejandro Grez, and Louis Jachiet. To appear in ICDT. [More infoLess info]
With Pierre Bourhis, Alejandro Grez, and Louis Jachiet. To appear in ICDT. [More infoLess info]
Conference: 24nd International Conference on Database Theory
(ICDT) - Nicosia, Cyprus.
Abstract:
In the last years, enumeration algorithms with bounded delay have attracted a lot of attention
for several data management tasks. Given a query and the data, the task is to preprocess the
data and then enumerate all the answers to the query one by one and without repetitions. This
enumeration scheme is typically useful when the solutions are treated on the fly or when we want
to stop the enumeration once the pertinent solutions have been found. However, with the current
schemes, there is no restriction on the order how the solutions are given and this order usually
depends on the techniques used and not on the relevance for the user.
In this paper we study the enumeration of monadic second order logic (MSO) over words when
the solutions are ranked. We present a framework based on MSO cost functions that allows to
express MSO formulae on words with a cost associated with each solution. We then demonstrate the
generality of our framework which subsumes, for instance, document spanners and regular complex
event processing queries and adds ranking to them. The main technical result of the paper is an
algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely,
with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this
algorithm is based on using functional data structures, in particular, by extending functional Brodal
queues to suit with the ranked enumeration of MSO on words.
2020
A Family of Centrality Measures for Graph Data Based on Subgraphs
[ pdf
]
With Jorge Salas. In ICDT. [More infoLess info]
Conference: 23nd International Conference on Database Theory
(ICDT) - Copenhagen, Denmark.
Abstract:
We present the theoretical foundations of a new approach in centrality measures for graph data.
The main principle of our approach is very simple: the more relevant subgraphs around a vertex,
the more central it is in the network. We formalize the notion of "relevant subgraphs" by
choosing a family of subgraphs that, give a graph G and a vertex v in G, it assigns a subset
of connected subgraphs of G that contains v. Any of such families defines a measure of centrality
by counting the number of subgraphs assigned to the vertex, i.e., a vertex will be more important
for the network if it belongs to more subgraphs in the family. We show many examples of this approach
and, in particular, we propose the all-subgraphs centrality, a centrality measure that takes every
subgraph into account. We study fundamental properties over families of subgraphs that guarantee
desirable properties over the corresponding centrality measure. Interestingly, all-subgraphs centrality
satisfies all these properties, showing its robustness as a notion for centrality. Finally, we study
the computational complexity of counting certain families of subgraphs and show a polynomial time
algorithm to compute the all-subgraphs centrality for graphs with bounded tree width.
On the Expressiveness of Languages for Complex Event Recognition
[ pdf
]
With Alejandro Grez, Martin Ugarte, and Stijn Vansummeren. In ICDT. [More infoLess info]
With Alejandro Grez, Martin Ugarte, and Stijn Vansummeren. In ICDT. [More infoLess info]
Conference: 23nd International Conference on Database Theory
(ICDT) - Copenhagen, Denmark.
Abstract:
Complex Event Recognition (CER for short) has recently gained attention as a mechanism
for detecting patterns in streams of continuously arriving event data.
Numerous CER systems and languages have been proposed in the literature, commonly
based on combining operations from regular expressions (sequencing, iteration, and disjunction)
and relational algebra (e.g., joins and filters). While variables in these languages can only
bind single elements, they also provide capabilities for filtering sets of events that occur
inside iterative patterns; for example requiring sequences of numbers to be increasing.
Unfortunately, these type of filters usually present ad-hoc syntax and under-defined semantics,
precisely because variables cannot bind sets of events. As a result, CER languages that provide filtering
of sequences commonly lack rigorous semantics and their expressive power is not understood.
In this paper we embark on two tasks: First, to define a denotational semantics for CER that
naturally allows to bind and filter sets of events; and second, to compare the expressive power
of this semantics with that of CER languages that only allow for binding single events.
Concretely, we introduce Set-based Complex Event Logic (S-CEL for short), a variation of the CER language
introduced by Grez et al. in which all variables bind to sets of matched events. We then compare
S-CEL with Event-based CEL (E-CEL), the language proposed by Grez et al. where variables bind single events.
We show that they are equivalent in expressive power when restricted to unary predicates but,
surprisingly, incomparable in general. Nevertheless, we show that if we restrict to sets
of binary predicates, then S-CEL is strictly more expressive than E-CEL. To get a better understanding
of the expressive power, computational capabilities, and limitations of S-CEL, we also investigate
the relationship between S-CEL and Complex Event Automata (CEA), a natural computational model for CER languages.
We define a property on CEA called the *-property and show that, under unary predicates, S-CEL captures
precisely the subclass of CEA that satisfy this property. Finally, we identify the operations that SCEL
is lacking to characterize CEA and introduce a natural extension of the language that captures
the complete class of CEA under unary predicates.
Towards Streaming Evaluation of Queries with Correlation in Complex Event Processing
[ pdf
]
With Alejandro Grez. In ICDT. [More infoLess info]
With Alejandro Grez. In ICDT. [More infoLess info]
Conference: 23nd International Conference on Database Theory
(ICDT) - Copenhagen, Denmark.
Abstract:
Complex event processing (CEP) has gained a lot of attention for evaluating complex patterns
over high-throughput data streams.
Recently, new algorithms for the evaluation of CEP patterns have emerged with
strong guarantees of efficiency, i.e. constant update-time per tuple and constant-delay enumeration.
Unfortunately, these techniques are restricted for patterns with local filters, limiting the
possibility of using joins for correlating the data of events that are far apart.
In this paper, we embark on the search for efficient evaluation algorithms of CEP
patterns with joins. We start by formalizing the so-called partition-by operator, a
standard operator in data stream management systems to correlate contiguous events on streams.
Although this operator is a restricted version of a join query, we show that partition-by
(without iteration) is equally expressive as hierarchical queries, the biggest class of full
conjunctive queries that can be evaluated with constant update-time and constant-delay
enumeration over streams. To evaluate queries with partition-by we introduce an automata model,
called chain complex event automata (chain-CEA), an extension of complex event automata that can
compare data values by using equalities and disequalities.
We show that chain-CEA is closed under determinization and is expressive enough to capture queries with partition-by.
More importantly, we provide an algorithm with constant update time and constant delay enumeration
for evaluating any query definable by chain-CEA, showing that all CEP queries with partition-by can
be evaluated with these strong guarantees of efficiency.
2019
Efficient Logspace Classes for Enumeration, Counting, and Uniform Generation
[ pdf
| slides
]
( Best paper award )
With Marcelo Arenas, Rajesh Jayaram, and Luis Alberto Croquevielle. In PODS. [More infoLess info]
Conference: 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles
of Database Systems
(PODS) - Amsterdam, The Netherlands.
Abstract:
In this work, we study two simple yet general complexity classes, based on
logspace Turing machines, which provide a unifying framework for efficient query
evaluation in areas like information extraction and graph databases, among others.
We investigate the complexity of three fundamental algorithmic problems for
these classes: enumeration, counting and uniform generation of solutions, and
show that they have several desirable properties in this respect.
Both complexity classes are defined in terms of nondeterministic logspace transducers (NL transducers).
For the first class, we consider the case of unambiguous NL transducers, and we
prove constant delay enumeration, and both counting and uniform generation of
solutions in polynomial time. For the second class, we consider unrestricted NL transducers,
and we obtain polynomial delay enumeration, approximate counting in polynomial time,
and polynomial-time randomized algorithms for uniform generation.
More specifically, we show that each problem in this second class
admits a fully polynomial-time randomized approximation scheme (FPRAS)
and a polynomial-time Las Vegas algorithm for uniform generation.
Interestingly, the key idea to prove these results is to show that the
fundamental problem #NFA admits an FPRAS, where #NFA is the problem of counting
the number of strings of length n accepted by a nondeterministic finite automaton (NFA).
While this problem is known to be #P-complete and, more precisely, SPANL-complete,
it was open whether this problem admits an FPRAS.
In this work, we solve this open problem, and obtain as a welcome corollary
that every function in SPANL admits an FPRAS.
A Formal Framework for Complex Event Processing
[ pdf
]
With Alejandro Grez and Martin Ugarte. In ICDT. [More infoLess info]
With Alejandro Grez and Martin Ugarte. In ICDT. [More infoLess info]
Conference: 22nd International Conference on Database Theory
(ICDT) - Lisbon, Portugal.
Abstract:
Complex Event Processing (CEP) has emerged as the unifying field for technologies that require
processing and correlating distributed data sources in real-time. CEP finds applications in diverse
domains, which has resulted in a large number of proposals for expressing and processing complex
events. However, existing CEP languages lack from a clear semantics, making them hard to
understand and generalize. Moreover, there are no general techniques for evaluating CEP query
languages with clear performance guarantees.
In this paper we embark on the task of giving a rigorous and efficient framework to CEP.
We propose a formal language for specifying complex events, called CEL, that contains the main
features used in the literature and has a denotational and compositional semantics. We also
formalize the so-called selection strategies, which had only been presented as by-design extensions
to existing frameworks. With a well-defined semantics at hand, we discuss how to efficiently
process complex events by evaluating CEL formulas with unary filters. We start by studying
the syntactical properties of CEL and propose rewriting optimization techniques for simplifying
the evaluation of formulas. Then, we introduce a formal computational model for CEP, called
complex event automata (CEA), and study how to compile CEL formulas with unary filters into
CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using
constant time per event followed by constant-delay enumeration of the results. Finally, we gather
the main results of this work to present an efficient and declarative framework for CEP.
A Worst-Case Optimal Join Algorithm for SPARQL
[ pdf
]
With Aidan Hogan, Carlos Rojas, and Adrian Soto. In ISWC. [More infoLess info]
With Aidan Hogan, Carlos Rojas, and Adrian Soto. In ISWC. [More infoLess info]
Conference: The Semantic Web - ISWC 2019 - 18th International Semantic Web Conference
(ISWC) - Auckland, New Zealand.
Abstract:
Worst-case optimal multiway join algorithms have recently gained a lot of attention in the database literature.
These algorithms not only offer strong theoretical guarantees of efficiency, but have also been empirically
demonstrated to significantly improve query runtimes for relational and graph databases.
Despite these promising theoretical and practical results, however,
the Semantic Web community has yet to adopt such techniques; to the best of our knowledge,
no native RDF database currently supports such join algorithms, where in this paper we demonstrate that this should change.
We propose a novel procedure for evaluating SPARQL queries based on an existing worst-case join algorithm called Leapfrog Triejoin.
We propose an adaptation of this algorithm for evaluating SPARQL queries, and implement it in Apache Jena.
We then present experiments over the Berlin and WatDiv SPARQL benchmarks, and a novel benchmark that we propose
based on Wikidata that is designed to provide insights into join performance for a more diverse set of basic graph patterns.
Our results show that with this new join algorithm, Apache Jena often runs orders of magnitude faster than the base
version and two other SPARQL engines: Virtuoso and Blazegraph.
2018
Constant Delay Algorithms for Regular Document Spanners
[ pdf
| slides
]
With Fernando Florenzano, Martin Ugarte, Stijn Vansummeren, and Domagoj Vrgoč. In PODS. [More infoLess info]
Conference: 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles
of Database Systems
(PODS) - Houston, TX, USA.
Abstract:
Regular expressions and automata models with capture variables are
core tools in rule-based information extraction. These formalisms,
also called regular document spanners, use regular languages in
order to locate the data that a user wants to extract from a text
document, and then store this data into variables.
Since document spanners can easily generate large outputs, it is
important to have good evaluation algorithms that can generate the
extracted data in a quick succession, and with relatively little
precomputation time. Towards this goal, we present a practical
evaluation algorithm that allows constant delay enumeration of a
spanner's output after a precomputation phase that is linear in the
document. While the algorithm assumes that the spanner is specified
in a syntactic variant of variable set automata, we also study how
it can be applied when the spanner is specified by general variable
set automata, regex formulas, or spanner algebras. Finally, we study
the related problem of counting the number of outputs of a document
spanner, providing a fine grained analysis of the classes of
document spanners that support efficient enumeration of their
results.
Document Spanners for Extracting Incomplete Information: Expressiveness
and Complexity
[ pdf
]
With Francisco Maturana and Domagoj Vrgoč. In PODS. [More infoLess info]
With Francisco Maturana and Domagoj Vrgoč. In PODS. [More infoLess info]
Conference: 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles
of Database Systems
(PODS) - Houston, TX, USA.
Abstract:
Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, all language proposals introduced so far are designed to output relations, thus making them incapable of handling incomplete information. To remedy the situation, we propose a theoretical framework which supports the use of mappings, thus allowing us to work with documents which have missing or optional parts. Using this approach, we simplify the semantics of regex formulas and extraction rules, two previously defined methods for extracting information, extend them with the ability to handle incomplete data, and study how they compare in terms of expressive power. We also study computational properties of the two languages, focusing on the query enumeration problem, as well as satisfiability and containment.
Pumping Lemmas for Weighted Automata
[ pdf
]
With Filip Mazowiecki. In STACS. [More infoLess info]
With Filip Mazowiecki. In STACS. [More infoLess info]
Conference: 35th Symposium on Theoretical Aspects of Computer Science
(STACS) - Caen, France.
Abstract:
We present three pumping lemmas for three classes of functions definable by fragments of weighted automata over the min-plus semiring and the semiring of natural numbers. As a corollary we show that the hierarchy of functions definable by unambiguous, finitely-ambiguous, polynomially-ambiguous weighted automata, and the full class of weighted automata is strict for the min-plus~semiring.
2017
Descriptive Complexity for counting complexity classes
[ pdf
| slides
]
With Marcelo Arenas and Martin Muñoz. In LICS. [More infoLess info]
Conference: 32nd Annual ACM/IEEE Symposium on Logic in Computer Science
(LICS) - Reykjavik, Iceland.
Abstract:
Descriptive Complexity has been very successful in characterizing complexity classes
of decision problems in terms of the properties definable in some logics.
However, descriptive complexity for counting complexity classes, such as FP and #P,
has not been systematically studied, and it is not as developed as its decision counterpart.
In this paper, we propose a framework based on Weighted Logics to address this issue.
Specifically, by focusing on the natural numbers we obtain a logic called
Quantitative Second Order Logics (QSO), and show how some of its fragments
can be used to capture fundamental counting complexity classes such as FP, #P and FPSPACE, among others.
We also use QSO to define a hierarchy inside #P, identifying counting complexity
classes with good closure and approximation properties, and which
admit natural complete problems. Finally, we add recursion to QSO, and show
how this extension naturally captures lower counting complexity classes such as #P.
Probabilistic Automata of Bounded Ambiguity
[ pdf
]
With Nathanaël Fijalkow and James Worrell. In CONCUR. [More infoLess info]
With Nathanaël Fijalkow and James Worrell. In CONCUR. [More infoLess info]
Conference: 28th International Conference on Concurrency Theory
(CONCUR) - Berlin, Germany.
Abstract:
Probabilistic automata are a computational model introduced by
Michael Rabin, extending nondeterministic finite automata with
probabilistic transitions. Despite its simplicity, this model is
very expressive and many of the associated algorithmic questions are
undecidable. In this work we focus on the emptiness problem, which
asks whether a given probabilistic automaton accepts some word with
probability higher than a given threshold. We consider a natural
and well-studied structural restriction on automata, namely the
degree of ambiguity, which is defined as the maximum number of
accepting runs over all words. We observe that undecidability of
the emptiness problem requires infinite ambiguity and so we focus on
the case of finitely ambiguous probabilistic automata.
Our main results are to construct efficient algorithms for analysing
finitely ambiguous probabilistic automata through a reduction to a
multi-objective optimisation problem, called the stochastic path
problem. We obtain a polynomial time algorithm for approximating
the value of finitely ambiguous probabilistic automata and a
quasi-polynomial time algorithm for the emptiness problem for
2-ambiguous probabilistic automata.
2016
Copyless Cost-Register Automata: Structure, Expressiveness, and Closure
Properties
[ pdf
]
With Filip Mazowiecki. In STACS. [More infoLess info]
Conference: 33rd Symposium on Theoretical Aspects of Computer Science
(STACS) - Orleans, France.
Abstract:
Cost register automata (CRA) and its subclass, copyless CRA, were recently proposed by Alur et al. as a new model for computing functions over strings.
We study structural properties, expressiveness, and closure properties of copyless CRA.
We show that copyless CRA are strictly less expressive than weighted automata and are not closed under reverse operation.
To find a better class we impose restrictions on copyless CRA, which ends successfully with a new robust computational model that is closed under reverse and other extensions.
A framework for annotating CSV-like data
[ pdf
]
With Marcelo Arenas, Francisco Maturana, and Domagoj Vrgoč. In VLDB. [More infoLess info]
With Marcelo Arenas, Francisco Maturana, and Domagoj Vrgoč. In VLDB. [More infoLess info]
Conference: Proceedings of the Very Large Data Base Endowment
(VLDB) - Nueva Delhi, India.
Abstract:
In this paper, we propose a simple and expressive framework for adding metadata to CSV documents and their noisy variants.
The framework is based on annotating parts of the document that can be later used to read, query, or exchange the data.
The core of our framework is a language based on extended regular expressions that are used for selecting data.
These expressions are then combined using a set of rules in order to annotate the data.
We study the computational complexity of implementing our framework and present an efficient evaluation algorithm
that runs in time proportional to its output and linear in its input. As a proof of concept,
we test an implementation of our framework against a large number of real world datasets and show that it can be efficiently used in practice.
Querying Wikidata: Comparing SPARQL, Relational and Graph Databases
[ pdf
]
With Daniel Hernández, Aidan Hogan, Carlos Rojas, and Enzo Zerega. In ISWC. [More infoLess info]
With Daniel Hernández, Aidan Hogan, Carlos Rojas, and Enzo Zerega. In ISWC. [More infoLess info]
Conference: The Semantic Web - ISWC 2016 - 15th International Semantic Web Conference
(ISWC) - Kobe, Japan.
Abstract:
In this paper, we experimentally compare the efficiency of various database engines for the purposes of querying the Wikidata knowledge-base, which can be conceptualised as a directed edge-labelled graph where edges can be annotated with meta-information called qualifiers. We take two popular SPARQL databases (Virtuoso, Blazegraph), a popular relational database (PostgreSQL), and a popular graph database (Neo4J) for comparison and discuss various options as to how Wikidata can be represented in the models of each engine. We design a set of experiments to test the relative query performance of these representations in the context of their respective engines. We first execute a large set of atomic lookups to establish a baseline performance for each test setting, and subsequently perform experiments on instances of more complex graph patterns based on real-world examples. We conclude with a summary of the strengths and limitations of the engines observed.
2015
Maximal Partition Logic: Towards a Logical Characterization of Copyless
Cost Register Automata
[ pdf
]
With Filip Mazowiecki. In CSL. [More infoLess info]
Conference: 24th EACSL Annual Conference on Computer Science Logic
(CSL) - Berlin, Germany.
Abstract:
It is highly desirable for a computational model to have a logic characterization like in the seminal work of B\"{u}chi that connects MSO with finite automata. For example, weighted automata are the quantitative extension of finite automata for computing functions over words and they can be naturally characterized by a subfragment of weighted logic introduced by Droste and Gastin. Recently, cost register automata (CRA) were introduced by Alur et al. as an alternative model for weighted automata. In hope of finding decidable subclasses of weighted automata, they proposed to restrict their model with the so-called copyless restriction. Unfortunately, copyless CRA do not enjoy good closure properties and, therefore, a logical characterization of this class seems to be unlikely.
In this paper, we introduce a new logic called maximal partition logic (MP) for studying the expressiveness of copyless CRA. In contrast to the previous approaches (i.e. weighted logics), MP is based on a new set of ``regular'' quantifiers that partition a word into maximal subwords, compute the output of a subformula over each subword separately, and then aggregate these outputs with a semiring operation.
We study the expressiveness of MP and compare it with weighted logics. Furthermore, we show that MP is equally expressive to a natural subclass of copyless CRA.
This shows the first logical characterization of copyless CRA and it gives a better understanding of the copyless restriction in weighted automata.
2013
Quantitative Monadic Second-Order Logic
[ pdf
| slides
]
With Stephan Kreutzer. In LICS. [More infoLess info]
Conference: 28th Annual IEEE Symposium on Logic in Computer Science
(LICS) - New Orleans, USA.
Abstract:
While monadic second-order logic is a prominent
logic for specifying languages of finite words, it
lacks the power to compute quantitative properties, e.g.
to count. An automata model capable of computing such
properties are weighted automata, but logics equivalent to
these automata have only recently emerged.
We propose a new framework for adding quantitative
properties to logics specifying Boolean properties of words.
We use this to define Quantitative Monadic Second-Order
Logic (QMSO). In this way we obtain a simple logic which
is equally expressive to weighted automata. We analyse its
evaluation complexity, both data and combined complexity,
and show completeness results for combined complexity.
We further refine the analysis of this logic and obtain
fragments that characterise exactly subclasses of weighted
automata defined by the level of ambiguity allowed in the
automata. In this way, we define a quantitative logic which
has good decidability properties while being resonably
expressive and enjoying a simple syntactical definition.
Which DTDs are streaming bounded repairable?
[ pdf
| slides
]
With Pierre Bourhis and Gabriele Puppis. In ICDT. [More infoLess info]
With Pierre Bourhis and Gabriele Puppis. In ICDT. [More infoLess info]
Conference: 16th International Conference on Database Theory
(ICDT) - Genova, Italy.
Abstract:
Integrity constraint management concerns both checking
whether data is valid and taking action to restore correctness
when invalid data is discovered.
In XML the notion of valid data can be captured by schema languages
such as Document Type Definitions (DTDs)
and more generally XML schemas.
DTDs have the property that constraint checking can be done
in streaming fashion. In this paper we consider when the corresponding
action to restore validity (repair)
can be done in streaming fashion.
We formalize this as the problem of determining, given a DTD,
whether or not a streaming procedure exists that transforms an input document
so as to satisfy the DTD, using a number of edits independent of the document.
We show that this problem is decidable. In fact, we show the decidability
of a more general problem, allowing a more general class of schemas
than DTDs, and requiring a repair procedure that works
only for documents that are already known to satisfy another class of
constraints.
The decision procedure relies on a new analysis of the structure of DTDs,
reducing to a novel notion of game played on pushdown systems associated
with the schemas.
2012
Bounded repairability for regular tree languages
[ pdf
| slides
]
With Gabriele Puppis and Slawek Staworko. In ICDT. [More infoLess info]
Conference: 15th International Conference on Database Theory
(ICDT) - Berlin, Germany.
Abstract:
We consider the problem of repairing unranked trees (e.g., XML documents)
satisfying a given restriction specification R (e.g., a DTD) into unranked
trees satisfying a given target specification T. Specifically, we focus on
the question of whether one can get from any tree in a regular language R
to some tree in another regular language T with a finite, uniformly bounded,
number of edit operations (i.e., deletions and insertions of nodes).
We give effective characterizations of the pairs of specifications R and T for
which such a uniform bound exists, and we study the complexity of the problem
under different representations of the regular tree languages (e.g., non-deterministic
stepwise automata, deterministic stepwise automata, DTDs). Finally, we point out some
connections with the analogous problem for regular languages of words,
which was previously studied in [6].
2011
The cost of traveling between languages
[ pdf
| slides
]
With Michael Benedikt and Gabriele Puppis. In ICALP. [More infoLess info]
Conference: 38th International Colloquium on Automata, Languages and Programming
(ICALP) - Zürich, Switzerland.
Abstract: We show how to calculate the maximum number of edits per character needed
to convert any string in one regular language to a string in another
language. Our algorithm makes use of a local determinization
procedure applicable to a subclass of distance automata.
We then show how to calculate the same property when the editing
needs to be done in streaming fashion, by a finite state transducer, using
a reduction to mean-payoff games. We show that the optimal streaming editor
can be produced in PTIME.
Regular repair of specifications
[ pdf
| slides
]
With Michael Benedikt and Gabriele Puppis. In LICS. [More infoLess info]
With Michael Benedikt and Gabriele Puppis. In LICS. [More infoLess info]
Conference: 26th Annual IEEE Symposium on Logic in Computer Science
(LICS) - Toronto, Canada.
Abstract: What do you do if a computational object (e.g. program trace) fails a specification?
An obvious approach is to perform repair: modify the object minimally to get
something that satisfies the constraints.
In this paper we study repair of temporal
constraints, given as automata or temporal logic formulas. We focus on
determining the number of repairs that must be applied to a word satisfying
a given input constraint in order to ensure that it satisfies a given target
constraint.
This number may well be unbounded; one of our main contributions is to
isolate the complexity of the "bounded repair problem", based on
a characterization of the pairs of regular languages that admit such a repair. We
consider this in the setting where the repair strategy is unconstrained and also
when the strategy is restricted to use finite memory.
Although the streaming setting is quite different from the general setting, we find
that there are surprising connections between streaming and non-streaming, as well
as within variants of the streaming problem.
2010
Foundations of schema mapping management
[ pdf
]
With Marcelo Arenas, Jorge Pérez, and Juan Reutter. In PODS. [More infoLess info]
Conference: 29th ACM Symposium on Principles of Database Systems
(PODS) - Indianapolis, USA.
Abstract: In the last few years, a lot of attention has been paid to the specification
and subsequent manipulation of schema mappings, a problem
which is of fundamental importance in metadata management.
There have been many achievements in this area, and semantics
have been defined for operators on schema mappings such as composition
and inverse. However, little research has been pursued
towards providing formal tools to compare schema mappings, in
terms of their ability to transfer data and avoid storing redundant
information, which has hampered the development of foundations
for more complex operators as many of them involve these notions.
In this paper, we address the problem of providing foundations
for metadata management by developing an order to compare the
amount of information transferred by schema mappings. From this
order we derive several other criteria to compare mappings, we provide
tools to deal with these criteria, and we show their usefulness
in defining and studying schema mapping operators. More precisely,
we show how the machinery developed can be used to study
the extract and merge operators, that have been identified as fundamental
for the development of a metadata management framework.
We also use our machinery to provide simpler proofs for
some fundamental results regarding the inverse operator, and we
give an effective characterization for the decidability of the wellknown
schema evolution problem.
2009
Inverting schema mappings: Bridging the gap between theory and practice
[ pdf
]
With Marcelo Arenas, Jorge Pérez, and Juan Reutter. In VLDB. [More infoLess info]
Conference: 35th International Conference on Very Large Data Bases
(VLDB) - Lyon, France.
Abstract:
The inversion of schema mappings has been identified as
one of the fundamental operators for the development of a
general framework for metadata management. In fact, during
the last years three alternative notions of inversion for
schema mappings have been proposed (Fagin-inverse,
quasi-inverse and maximum recovery). However, the
procedures that have been developed for computing these
operators have some features that limit their practical
applicability. First, these algorithms work in exponential time
and produce inverse mappings of exponential size. Second,
these algorithms express inverses in some mappings languages
which include features that are difficult to use in
practice. A typical example is the use of disjunction in the
conclusion of the mapping rules, which makes the process of
exchanging data much more complicated.
In this paper, we propose solutions for the two problems
mentioned above. First, we provide a polynomial time algorithm
that computes the three inverse operators mentioned
above given a mapping specified by a set of tuple-generating
dependencies (tgds). This algorithm uses an output mapping
language that can express these three operators in a
compact way and, in fact, can compute inverses for a much
larger class of mappings. Unfortunately, it has already been
proved that this type of mapping languages has to include
some features that are difficult to use in practice and, hence,
this is also the case for our output mapping language. Thus,
as our second contribution, we propose a new and natural
notion of inversion that overcomes this limitation. In particular,
every mapping specified by a set of tgds admits an
inverse under this new notion that can be expressed in a
mapping language that slightly extends tgds, and that has
the same good properties for data exchange as tgds.
Finally, as our last contribution, we provide an algorithm for
computing such inverses.
2008
The Recovery of a schema mapping: Bringing exchanged data back
[ pdf
]
With Marcelo Arenas and Jorge Pérez. In PODS. [More infoLess info]
Conference: 27th ACM Symposium on Principles of Database Systems
(PODS) - Vancouver, Canada.
Abstract:
A schema mapping is a specification that describes how data from a source schema is to be mapped to a target
schema. Once the data has been transferred from the source to the target, a natural question is whether one can
undo the process and recover the initial data, or at least part of it. In fact, it would be desirable to find a reverse
schema mapping from target to source that specifies how to bring the exchanged data back.
In this paper, we introduce the notion of a recovery of a schema mapping: it is a reverse mapping M' for a
mapping M that recovers sound data with respect to M. We further introduce an order relation on recoveries.
This allows us to choose mappings that recover the maximum amount of sound information. We call such mappings
maximum recoveries. We study maximum recoveries in detail, providing a necessary and sufficient condition
for their existence. In particular, we prove that maximum recoveries exist for the class of mappings specified
by FO-TO-CQ source-to-target dependencies. This class subsumes the class of source-to-target tuple-generating
dependencies used in previous work on data exchange. For the class of mappings specified by FO-TO-CQ dependencies,
we provide an exponential-time algorithm for computing maximum recoveries, and a simplified version
for full dependencies that works in quadratic time. We also characterize the language needed to express maximum
recoveries, and we include a detailed comparison with the notion of inverse (and quasi-inverse) mapping
previously proposed in the data exchange literature. In particular, we show that maximum recoveries strictly
generalize inverses. We finally study the complexity of some decision problems related to the notions of recovery
and maximum recovery.
Journal versions
2020
Efficient Enumeration Algorithms for Regular Document Spanners
[ pdf
]
With Fernando Florenzano, Martin Ugarte, Stijn Vansummeren, and Domagoj Vrgoč. In TODS. [More infoLess info]
Journal: ACM Transactions on Database Systems
(TODS)
Volume: 45
(1)
Pages: 3:1-3:42
Abstract:
Regular expressions and automata models with capture variables are
core tools in rule-based information extraction. These formalisms, also
called regular document spanners, use regular languages to locate the
data that a user wants to extract from a text document and then store
this data into variables. Since document spanners can easily generate
large outputs, it is important to have efficient evaluation algorithms
that can generate the extracted data in a quick succession, and with relatively
little precomputation time. Toward this goal, we present a practical
evaluation algorithm that allows output-linear delay enumeration of a spanner’s
result after a precomputation phase that is linear in the document.
Although the algorithm assumes that the spanner is specified in a
syntactic variant of variable-set automata, we also study how it
can be applied when the spanner is specified by general variable-set
automata, regex formulas, or spanner algebras. Finally, we study the
related problem of counting the number of outputs of a document spanner
and provide a fine-grained analysis of the classes of document spanners
that support efficient enumeration of their results.
Descriptive Complexity for Counting Complexity Classes
[ pdf
]
With Marcelo Arenas and Martin Muñoz. In LMCS. [More infoLess info]
With Marcelo Arenas and Martin Muñoz. In LMCS. [More infoLess info]
Journal: Logical Methods in Computer Science
(LMCS)
Volume: 16
(1)
Pages: 9:1-9:42
Abstract:
Descriptive Complexity has been very successful in characterizing
complexity classes of decision problems in terms of the properties
definable in some logics. However, descriptive complexity for
counting complexity classes, such as FP and #P, has not been
systematically studied, and it is not as developed as its
decision counterpart. In this paper, we propose a framework based
on Weighted Logics to address this issue. Specifically, by focusing
on the natural numbers we obtain a logic called Quantitative Second
Order Logics (QSO), and show how some of its fragments can be used
to capture fundamental counting complexity classes such as FP, #P
and FPSPACE, among others. We also use QSO to define a hierarchy
inside #P, identifying counting complexity classes with good closure
and approximation properties, and which admit natural complete problems.
Finally, we add recursion to QSO, and show how this extension naturally
captures lower counting complexity classes such as #L.
Efficient Logspace Classes for Enumeration, Counting, and Uniform Generation
[ pdf
]
With Marcelo Arenas, Rajesh Jayaram, and Luis Alberto Croquevielle. In SIGMOD Record. [More infoLess info]
With Marcelo Arenas, Rajesh Jayaram, and Luis Alberto Croquevielle. In SIGMOD Record. [More infoLess info]
Journal: ACM SIGMOD Record
(SIGMOD Record)
Volume: 49
(1)
Pages: 52-59
Abstract:
We study two simple yet general complexity classes,
which provide a unifying framework for efficient query evaluation
in areas like graph databases and information extraction, among others.
We investigate the complexity of three fundamental algorithmic problems
for these classes: enumeration, counting and uniform generation of solutions,
and show that they have several desirable properties in this respect.
Both complexity classes are defined in terms of non deterministic logarithmic-space transducers (NL transducers).
For the first class, we consider the case of unambiguous NL transducers, and we
prove constant delay enumeration, and both counting and uniform generation of solutions
in polynomial time. For the second class, we consider unrestricted NL transducers, and we
obtain polynomial delay enumeration, approximate counting in polynomial time, and polynomial-time randomized
algorithms for uniform generation. More specifically, we show that each problem in this second class admits a
fully polynomial-time randomized approximation scheme (FPRAS) and a polynomial-time Las Vegas algorithm
(with preprocessing) for uniform generation. Remarkably, the key idea to prove these
results is to show that the fundamental problem #NFA admits an FPRAS, where #NFA is the problem
of counting the number of strings of length n (given in unary) accepted by a
non-deterministic finite automaton (NFA).
While this problem is known to be #P-complete and, more precisely, SpanL-complete, it was
open whether this problem admits an FPRAS. In this work, we solve this open problem,
and obtain as a welcome corollary that every function in SpanL admits an FPRAS.
Probabilistic Automata of Bounded Ambiguity
[ pdf
]
With Nathanaël Fijalkow and James Worrell. In Information and Computation. [More infoLess info]
With Nathanaël Fijalkow and James Worrell. In Information and Computation. [More infoLess info]
Journal: Information and Computation
(Information and Computation)
Info: To appear during November 2020.
Abstract:
Probabilistic automata are an extension of nondeterministic finite automata in which
transitions are annotated with probabilities. Despite its simplicity, this model is very
expressive and many algorithmic questions are undecidable.
In this work we focus on the emptiness problem (and its variant the value problem), which asks
whether a given probabilistic automaton accepts some word with probability greater
than a given threshold. We consider finitely ambiguous probabilistic automata.
Our main contributions are to construct efficient algorithms for analysing finitely
ambiguous probabilistic automata through a reduction to a multi-objective optimisation
problem called the stochastic path problem. We obtain a polynomial time algorithm for
approximating the value of probabilistic automata of fixed ambiguity and a quasi-polynomial
time algorithm for the emptiness problem for 2-ambiguous probabilistic automata.
We complement these positive results by an inapproximability result stating
that the value of finitely ambiguous probabilistic automata cannot be approximated unless P=NP.
2019
Copyless cost-register automata: Structure, expressiveness, and closure properties
[ pdf
]
With Filip Mazowiecki. In JCSS. [More infoLess info]
Journal: Journal of Computer and System Sciences
(JCSS)
Volume: 100
Pages: 1-29
Abstract:
Cost register automata (CRA) and its subclass, copyless CRA, were recently
proposed by Alur et al. as a new model for computing functions over strings.
We study some structural properties, expressiveness, and closure properties
of copyless CRA. We show that copyless CRA are strictly less expressive than
weighted automata and are not closed under reverse operation. To find a better
class we impose restrictions on copyless CRA, which ends successfully
with a new robust computational model that is closed under reverse and other extensions.
2016
Bounded Repairability for Regular Tree Languages
[ pdf
]
With Pierre Bourhis, Gabriele Puppis, and Slawek Staworko. In TODS. [More infoLess info]
Journal: ACM Transactions on Database Systems
(TODS)
Volume: 41
(3)
Pages: 18:1-18:45
Abstract:
We study the problem of bounded repairability of a given restriction tree
language R into a target tree language T. More precisely, we say that R is
bounded repairable with respect to T if there exists a bound on the number
of standard tree editing operations necessary to apply to any tree in R
to obtain a tree in T. We consider a number of possible specifications for
tree languages: bottom-up tree automata (on curry encoding of unranked trees)
that capture the class of XML schemas and document type definitions (DTDs).
We also consider a special case when the restriction language R is universal
(i.e., contains all trees over a given alphabet).
We give an effective characterization of bounded repairability between
pairs of tree languages represented with automata. This characterization
introduces two tools—synopsis trees and a coverage relation between them—allowing
one to reason about tree languages that undergo a bounded number of editing
operations. We then employ this characterization to provide upper bounds to
the complexity of deciding bounded repairability and show that these bounds
are tight. In particular, when the input tree languages are specified with
arbitrary bottom-up automata, the problem is coNExp-complete. The problem
remains coNExp-complete even if we use deterministic nonrecursive DTDs
to specify the input languages. The complexity of the problem can be
reduced if we assume that the alphabet, the set of node labels, is fixed:
the problem becomes PSpace-complete for nonrecursive DTDs and coNP-complete
for deterministic nonrecursive DTDs. Finally, when the restriction tree
language R is universal, we show that the bounded repairability problem
becomes Exp-complete if the target language is specified by an arbitrary
bottom-up tree automaton and becomes tractable (P-complete, in fact) when
a deterministic bottom-up automaton is used.
A framework for annotating CSV-like data
[ pdf
]
With Marcelo Arenas, Francisco Maturana, and Domagoj Vrgoč. In VLDB. [More infoLess info]
With Marcelo Arenas, Francisco Maturana, and Domagoj Vrgoč. In VLDB. [More infoLess info]
Journal: Proceedings of the Very Large Data Base Endowment
(VLDB)
Volume: 9
(11)
Pages: 876-887
Abstract:
In this paper, we propose a simple and expressive framework for adding metadata to CSV documents and their noisy variants.
The framework is based on annotating parts of the document that can be later used to read, query, or exchange the data.
The core of our framework is a language based on extended regular expressions that are used for selecting data.
These expressions are then combined using a set of rules in order to annotate the data.
We study the computational complexity of implementing our framework and present an efficient evaluation algorithm
that runs in time proportional to its output and linear in its input. As a proof of concept,
we test an implementation of our framework against a large number of real world datasets and show that it can be efficiently used in practice.
2015
Which XML Schemas are Streaming Bounded Repairable?
[ pdf
]
With Pierre Bourhis and Gabriele Puppis. In ToCS. [More infoLess info]
Journal: Theory of Computing Systems
(ToCS)
Volume: 57
(4)
Pages: 1250-1321
Abstract:
In this paper we consider the problem of repairing, that is, restoring validity of,
documents with respect to XML schemas. We formalize this as the problem of determining,
given an XML schema, whether or not a streaming procedure exists that transforms an
input document so as to satisfy the XML schema, using a number of edits independent
of the document. We show that this problem is decidable. In fact, we show the
decidability of a more general problem, which allows the repair procedure to work on
documents that are already known to satisfy another XML schema. The decision procedure
relies on the analysis of the structure of an automaton model specifying the restriction
and target XML schemas and reduces te problem to a novel notion of game played on
pushdown systems associated with the schemas.
2014
The per-character cost of repairing word languages
[ pdf
]
With Michael Benedikt and Gabriele Puppis. In TCS. [More infoLess info]
Journal: Theoretical Computer Science
(TCS)
Volume: 539
Pages: 38-67
Abstract:
We show how to calculate the maximum number of edits per character
needed to convert any string in one regular language to a string in
another language. Our algorithm makes use of a local determinization
procedure applicable to a subclass of distance automata. We then show
how to calculate the same property when the editing needs to be done
in streaming fashion, by a finite state transducer, using a reduction
to mean-payoff games. In this case, we show that the optimal streaming
editor can be produced in P.
2013
Bounded repairability of word languages
[ pdf
]
With Michael Benedikt and Gabriele Puppis. In JCSS. [More infoLess info]
Journal: Journal of Computer and System Sciences
(JCSS)
Volume: 79
(8)
Pages: 1302–1321
Abstract:
What do you do if a computational object (e.g. program trace) fails a
specification? An obvious approach is to perform a repair: modify the
object minimally to get something that satisfies the constraints.
This approach has been investigated in the database community,
for integrity constraints, and in the AI community for propositional logics.
Here we study how difficult it is to repair a document in the form of a string.
Specifically, we consider number of edits that must be applied to
an input string in order to satisfy a given target language.
This number may be unbounded; our main contribution is to isolate
the complexity of the bounded repair problem based on a characterization
of the regular languages that admit bounded repair.
We consider the settings where the repair strategy is unconstrained and
when the editing must be produced in a streaming way,
i.e. by a letter-to-letter transducer.
The language of plain SO-tgds: composition, inversion and structural properties
[ pdf
]
With Marcelo Arenas, Jorge Pérez, and Juan Reutter. In JCSS. [More infoLess info]
With Marcelo Arenas, Jorge Pérez, and Juan Reutter. In JCSS. [More infoLess info]
Journal: Journal of Computer and System Sciences
(JCSS)
Volume: 79
(6)
Pages: 763–784
Abstract:
The problems of composing and inverting schema mappings specified by source-to-target tuple-generating
dependencies (st-tgds) have attracted a lot of attention, as they are of fundamental importance for the
development of Bernstein’s metadata management framework. In the case of the composition operator,
a natural semantics has been proposed and the language of second-order tuple generating dependencies
(SO-tgds) has been identified as the right language to express it. In the case of the inverse operator,
several semantics have been proposed, most notably the maximum recovery, the only inverse notion that
guarantees that every mapping specified by st-tgds is invertible. Unfortunately, less attention has been paid
to combining both operators, which is the motivation of this paper. More precisely, we start our investigation
by showing that SO-tgds are not good for inversion, as there exist mappings specified by SO-tgds that are
not invertible under any of the notions of inversion proposed in the literature. To overcome this limitation,
we borrow the notion of CQ-composition, which is a relaxation obtained by parameterizing the composition
of mappings by the class of conjunctive queries (CQ), and we propose a restriction over the class of SO-tgds
that gives rise to the language of plain SO-tgds. Then we show that plain SO-tgds are the right language
to express the CQ-composition of mappings given by st-tgds, in the same sense that SO-tgds are the right
language to express the composition of st-tgds, and we prove that every mapping specified by a plain SOtgd admits a maximum recovery, thus showing that plain SO-tgds have a good behavior w.r.t. inversion.
Moreover, we show that the language of plain SO-tgds shares some fundamental structural properties with
the language of st-tgds, but being much more expressive, and we provide a polynomial-time algorithm to
compute maximum recoveries for mappings specified by plain SO-tgds (which can also be used to compute
maximum recoveries for mappings given by st-tgds). All these results suggest that the language of plain
SO-tgds is a good alternative to be implemented in data exchange and data integration applications.
2012
Query language-based inverses of schema mappings: semantics, computation, and closure properties
[ pdf
]
With Marcelo Arenas, Jorge Pérez, and Juan Reutter. In VLDBJ. [More infoLess info]
Journal: The VLDB Journal
(VLDBJ)
Volume: 21
(6)
Pages: 823-842
Abstract:
The inversion of schema mappings has been identified as one of the
fundamental operators for the development of a general framework for
metadata management. During the last few years three alternative
notions of inversion for schema mappings have been proposed
(Fagin-inverse, quasi-inverse and maximum
recovery). However, these notions lack some fundamental
properties which limit their practical applicability: most of them
are expressed in languages including features that are difficult to
use in practice, some of these inverses are not guaranteed to exist
for mappings specified with source-to-target tuple-generating
dependencies (st-tgds), and it has been futile to search for a
meaningful mapping language that is closed under any of
these notions of inverse.
In this paper, we develop a framework for the inversion of schema
mappings that fulfills all of the above requirements. It is based on
the notion of C-maximum recovery, for a query language C, a
notion designed to generate inverse mappings that recover back only
the information that can be retrieved with queries in C. By
focusing on the language of conjunctive queries (CQ), we are able
to find a mapping language that contains the class of st-tgds, is
closed under CQ-maximum recovery, and for which the chase procedure
can be used to exchange data efficiently. Furthermore, we show that
our choices of inverse notion and mapping language are optimal, in the
sense that choosing a more expressive inverse operator or mapping
language causes the loss of these properties.
2009
Composition and inversion of schema mappings
[ pdf
]
With Marcelo Arenas, Jorge Pérez, and Juan Reutter. In SIGMOD Record. [More infoLess info]
Journal: ACM SIGMOD Record
(SIGMOD Record)
Volume: 38
(3)
Pages: 17-28
Abstract: (not available)
The Recovery of a schema mapping: Bringing exchanged data back
[ pdf
]
With Marcelo Arenas and Jorge Pérez. In TODS. [More infoLess info]
With Marcelo Arenas and Jorge Pérez. In TODS. [More infoLess info]
Journal: ACM Transactions on Database Systems
(TODS)
Volume: 34
(4)
Pages: Article No. 22
Abstract:
A schema mapping is a specification that describes how data from a source schema is to be mapped to a target
schema. Once the data has been transferred from the source to the target, a natural question is whether one can
undo the process and recover the initial data, or at least part of it. In fact, it would be desirable to find a reverse
schema mapping from target to source that specifies how to bring the exchanged data back.
In this paper, we introduce the notion of a recovery of a schema mapping: it is a reverse mapping M' for a
mapping M that recovers sound data with respect to M. We further introduce an order relation on recoveries.
This allows us to choose mappings that recover the maximum amount of sound information. We call such mappings
maximum recoveries. We study maximum recoveries in detail, providing a necessary and sufficient condition
for their existence. In particular, we prove that maximum recoveries exist for the class of mappings specified
by FO-TO-CQ source-to-target dependencies. This class subsumes the class of source-to-target tuple-generating
dependencies used in previous work on data exchange. For the class of mappings specified by FO-TO-CQ dependencies,
we provide an exponential-time algorithm for computing maximum recoveries, and a simplified version
for full dependencies that works in quadratic time. We also characterize the language needed to express maximum
recoveries, and we include a detailed comparison with the notion of inverse (and quasi-inverse) mapping
previously proposed in the data exchange literature. In particular, we show that maximum recoveries strictly
generalize inverses. We finally study the complexity of some decision problems related to the notions of recovery
and maximum recovery.
I am an Assistant Professor at the Department of Computer Science
at the Pontificia Universidad Catolica de Chile.
I received a D.Phil from the University of Oxford in 2013 and a
M.Sc from Pontificia Universidad Católica de Chile in 2008.
I was previously studying at Pontificia Universidad Católica de Chile as an undergraduate student where I received
a B.A. in Mathematics in 2006 and my Professional Degree in Computer Engineering in 2008.
My research interests are mostly in data management systems, specifically, in data streams, information extraction, and graph data. Also, I do research on theoretical computer science, mostly in automata theory, logics, and computational complexity.
I received a D.Phil from the University of Oxford in 2013 and a
M.Sc from Pontificia Universidad Católica de Chile in 2008.
I was previously studying at Pontificia Universidad Católica de Chile as an undergraduate student where I received
a B.A. in Mathematics in 2006 and my Professional Degree in Computer Engineering in 2008.
My research interests are mostly in data management systems, specifically, in data streams, information extraction, and graph data. Also, I do research on theoretical computer science, mostly in automata theory, logics, and computational complexity.
Warning: Version warning: Imagick was compiled against ImageMagick version 1692 but version 1693 is loaded. Imagick will run but may behave surprisingly in Unknown on line 0