You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Leonid Boytsov edited this page Oct 20, 2018
·
24 revisions
Overview
This repository hosts a learning-to-rank retrieval (and re-ranking) pipeline, which is a part of the PGraph project where we study applicability of k-nearest neighbor (k-NN) search methods to IR and QA applications. This project is supported primarily by the NSF grant #1618159 : "Matching and Ranking via Proximity Graphs: Applications to Question Answering and Beyond".
The proposed research aims to improve a retrieval component of an extractive QA system,
which traditionally relies on a filter-and-refine approach. The cheap filtering step uses full-
text retrieval to obtain a list of candidates. The refining step applies a more expensive
re-ranking model. Usually, the full-text index includes only lexical entries, which may lead
to a reduced recall due to the vocabulary mismatch.
The main research hypothesis is that a high-accuracy similarity (k-NN) search powered by a
proximity graph can provide a significant improvement over the classic retrieval approach
based on inverted indices. In addition, the similarity search can efficiently handle more
sophisticated document representations such as vectorial document embeddings, which are
not supported by classic inverted indices.
Existing state-of-the-art k-NN search algorithms based on proximity graphs work well
in relatively simple domains with a metric distance function (e.g. the Euclidean distance),
but they are not sufficiently accurate for complex non-symmetric and non-metric distances.
Our central goal is to improve these search methods and adapt them to the NLP domain.
License
The code written by us (both this repository and NMSLIB) is distributed under Apache 2 license. We also use Apache Lucene, Apache UIMA, Apache Thrift, DKPRO core, and Clear NLP, which have the same license. The RankLib library uses a BSD license. Finally, we use Stanford NLP as well as a modified versions of Manaal Faruqui's retrofitting script. These modules ship with a GNU license.
Last, but not least, there are a number of less important employed packages (see the POM file), which may have their own licenses different from Apache 2.