CARVIEW |
III: Small: Domain-Agnostic Dataset Search
NSF AWARD IIS-1816325
Participants:
- Investigators: Brian D. Davison (PI: Lehigh University, Dept. of Computer Science and Engineering), Jeff Heflin (Co-PI: Lehigh University, Dept. of Computer Science and Engineering), and Haiyan Jia (Co-PI: Lehigh University, Dept. of Journalism and Communication).
- Student participants: Helen Borchart '21, Zhiyu Chen, PhD '22, Jessica Hicks '19, Yujie Ji '20 MS, Alexandra Johnson '19, Drake Johnson (from Calfornia University of Pennsylvania), Alissa Landberg '22, dePaul Miller '20, Larrisa Miller '20, Mericel Mirabal '22, Ethan Moscot '22, Kishan Patel '23, Lixuan Qiu '20, Keith Register (from Princeton), Emma Stein '20, Mathangi Sundar (from BITS, India), Mohamed Trabelsi, PhD '22, Ngan Tran '21, Xuewei Brooks Wang '20, Hui Ye MS'20, Yang Yi '18, and Yifan Zhang, MS '23.
Description:
-
Today, the size of the Web is such that one cannot imagine finding much information without a web search engine. Similarly, the number of collections of public datasets now available has become so large as to be difficult for a researcher to track all of them within his or her discipline, and impossible to do so across disciplines. To help searchers find data in a discipline-agnostic manner, this project investigates new, promising approaches to full-content dataset search.
This research will provide the technology and develop the prototype of a tool that can ultimately assist many kinds of scientists to locate data that they can use to perform exploratory analysis and test hypotheses. Thus, this work will enable public dataset discovery and reuse, regardless of who produced the data or where it is stored. A dataset search engine using these methods benefits society by helping researchers to accelerate their work and reduce duplicate efforts. It will also benefit others, such as data journalists, as data promises a new source of evidence and for story discovery, a new way for story-telling and fact-checking, to make reporting that is both meaningful and trustworthy. This work will help any data analyst locate relevant datasets.
This project will impact the training of graduate students and undergraduates (both within and separate from the requested REU supplement). This involvement will make it possible to broaden participation by underrepresented groups and the development of educational materials. The researchers will incorporate results of this work in courses, including Data Science, Web Search Engines, Data Journalism, and Semantic Web Topics.
Existing dataset search services are cumbersome, focusing on searching descriptions, not data, and cater to searchers looking within their own discipline. The project's goal is to develop a prototype dataset search engine incorporating new techniques for full-content indexing to enable searchers to find data across the web, regardless of domain. The investigators will combine principles and novel methods from information retrieval, databases, and data mining. The design and development of the prototype will also take a user-centric approach, involving professionals and practitioners in observational, interview and experimental studies to inform and guide this process.
The outcomes of this work include: 1. The development of new principles, methods, and technologies for the construction of search indexes from hundreds of thousands of real-world public datasets: the researchers will create novel methods for a) full-content indexing and analysis, b) inferring additional metadata such as attribute names when the existing descriptors are lacking and, c) inferring additional descriptors that can be used to resolve schema and data heterogeneity. 2. The understanding of searchers' cognitive processes as they search for and consider use of datasets. A social cognitive model will be built to describe human-system interactions in dataset searches, and to predict the effectiveness of the system in various scenarios. 3. The development of novel interfaces to support the search, exploration, and presentation of datasets to such users. Through this process, the researchers will develop a set of instruments for evaluating the dataset search technology and interface from the user's perspective. Research results will be disseminated broadly by presenting and publishing at conferences and journals, sharing on the web, giving talks, and making developed software open source.
News:
- Our partnership with data.world was mentioned in their blog post on September 8, 2020.
- A search engine for datasets, Lehigh Research Review, June 2020.
- Lehigh research team to investigate a 'Google for research data', EurekAlert!, 20-Aug-2018.
Publications:
-
H. Jia, L. Miller, J. Hicks, E. Moscot, A. Landberg, J. Heflin, and
B.D. Davison.
(2022)
Truth in a Sea of Data: Adoption and Use of Data Search Tools among
Researchers and Journalists.
In
Information, Communication and Society, 26(16): 3239-3258.
Taylor & Francis, November.
DOI: 10.1080/1369118X.2022.2147398
-
Z. Chen.
(2022)
Dataset
Search and Augmentation. Doctoral dissertation, Department of
Computer Science and Engineering, Lehigh University, August.
-
M. Trabelsi.
(2022)
Leveraging
Dataset Content in Neural Models for Search and Curation.
Doctoral dissertation, Department of Computer Science and Engineering,
Lehigh University, August.
-
M. Trabelsi, Z. Chen, S. Zhang, B.D. Davison, and J. Heflin.
(2022)
StruBERT: Structure-aware BERT for Table Search and Matching.
In Proceedings of the
31st edition of the Web
Conference, pp. 442-451,
online, April.
DOI: 10.1145/3485447.3511972
-
Z. Chen, M. Trabelsi, J. Heflin, D. Yin and B. D. Davison.
(2021)
MGNETS: Multi-Graph Neural Networks for Table Search.
In Proceedings of the 30th ACM International Conference on
Information and Knowledge Management (CIKM),
pp. 2945-2949,
online, November.
DOI: 10.1145/3459637.3482140
-
M. Trabelsi, Z. Chen, B. D. Davison, and J. Heflin.
(2021)
Neural Ranking Models for Document Retrieval.
Information Retrieval, 24:400-444, October.
DOI: 10.1007/s10791-021-09398-0
-
J. Heflin, B. D. Davison, and H. Jia.
(2021)
Exploring Datasets via Cell-Centric Indexing.
In Proceedings of
DESIRES 2021: Second International Conference on Design of Experimental
Search and Information REtrieval Systems,
CEUR Workshop Proceedings,
Volume 2950,
pp. 53-60,
Padua, Italy, September.
-
Z. Chen, S. Zhang, and B. D. Davison.
(2021)
WTR: A Test Collection for Web Table Retrieval.
In Proceedings of 44th
International ACM SIGIR Conference on Research and Development in
Information Retrieval,
pages 2514-2520, July.
DOI: 10.1145/3404835.3463260.
-
H. Borchart.
(2021)
Query
Refinement in Dataset Search.
Senior Project Report, Cognitive Science Program, Lehigh University, May.
-
M. Trabelsi, Z. Chen, B. D. Davison, and J. Heflin.
(2020)
A Hybrid Deep Model for Learning to Rank Data Tables.
In Proceedings of the 2020
IEEE International Conference on Big Data (IEEE BigData 2020),
December.
-
M. Trabelsi, Z. Chen, B. D. Davison, and J. Heflin.
(2020)
Relational Graph Embeddings for Table Retrieval.
In
Seventh International Workshop on High
Performance Big Graph Data Management, Analysis, and Mining (BigGraphs
2020), held with IEEE BigData 2020,
December.
-
D. Johnson, K. Register, B. D. Davison, and J. Heflin.
(2020)
An
Exploratory Interface for Dataset Repositories Using Cell-Centric
Indexing.
Poster paper
in Proceedings of the 2020
IEEE International Conference on Big Data (IEEE BigData 2020),
pp. 5716-5718,
December.
-
Z. Chen, M. Trabelsi, B. D. Davison, and J. Heflin.
(2020)
Towards Knowledge Acquisition of Metadata on AI Progress.
In Proceedings
of the ISWC 2020 Demos and Industry Tracks: From Novel Ideas to Industrial
Practice,
co-located with the
19th International Semantic Web
Conference (ISWC 2020),
CEUR Workshop Proceedings, Vol. 2721, pages 232-237,
November.
-
L. Qiu, H. Jia, B. D. Davison, and J. Heflin.
(2020)
An
Architecture for Cell-Centric Indexing of Datasets.
In Proceedings of PROFILES'20: 7th International
Workshop on Dataset PROFILing and Search,
pages 82-96,
held with ISWC 2020,
November.
-
Z. Chen, M. Trabelsi, J. Heflin, Y. Xu, and B. D. Davison.
(2020)
Table Search Using a Deep Contextualized Language Model.
In Proceedings of
43rd International ACM SIGIR Conference on Research and Development in Information Retrieval,
pages 589-598,
July.
-
L. Miller.
(2020)
Facilitating Dataset Search of Non-Expert Users through Heuristic and Systematic Information Processing.
Honors Thesis, Cognitive Science Program, Lehigh University,
May.
-
E. Stein.
(2020)
How Communication and Customization Influence Perceived Credibility, Usability and Adoption of Dataset Search Tools.
Senior Project Report, Cognitive Science Program, Lehigh University,
May.
-
Z. Chen,
H. Jia,
J. Heflin, and
B. D. Davison.
(2020)
Leveraging
Schema Labels to Enhance Dataset Search.
In Proceedings of the
42nd European Conference on Information
Retrieval (ECIR 2020),
pages 267-280,
April.
-
M. Trabelsi,
B. D. Davison, and
J. Heflin.
(2019)
Improved Table Retrieval Using Multiple Context Embeddings for
Attributes.
In Proceedings of the 2019 IEEE
International Conference on Big Data (BigData),
pages 1238-1244,
Los Angeles, CA,
December.
-
Z. Chen.
(2018)
Challenges and Progress in Dataset Search.
Presentation at the
Eighth BCS-IRSG Symposium on Future Directions in
Information Access (FDIA 2018), co-located with the
8th International Conference on the
Theory of Information Retrieval,
Tianjin, China, September 2018.
-
Y. Yi, Z. Chen, J. Heflin and B. D. Davison.
(2018)
Recognizing Quantity Names for Tabular Data.
In Joint Proceedings of
the First International Workshop on Professional Search (ProfS2018); the
Second Workshop on Knowledge Graphs and Semantics for Text Retrieval,
Analysis, and Understanding (KG4IR); and the International Workshop on
Data Search (DATA:SEARCH'18), pages 68-73.
Presented at the
International
Workshop on Data Search (DATA:SEARCH'18).
Co-located with SIGIR 2018, Ann Arbor, Michigan, USA, July.
-
Z. Chen, H. Jia, J. Heflin and B. D. Davison.
(2018)
Generating Schema Labels through Dataset Content Analysis.
In Companion Proceedings of the The Web Conference (WWW '18),
pages 1515-1522.
Presented at the
International Workshop on
Profiling and Searching Data on the Web
(Profiles & Data:Search'18, co-located with The Web Conference),
Lyon, France, April.
Best paper award.
Last modified: 8 December 2022