Data Science Research (DSR) Lab at the University of Florida focuses on large-scale data management, data mining and data analysis using technologies from database management Systems (DBMS’s), Statistical Machine Learning (SML), and Information Visualization. Such research in a Big Data era is called Data Science, which is a profession, a research agenda, as well as a sport! The goal of Data Science research is to build systems and algorithms to extract knowledge, find patterns, generate insights and predictions from diverse data for various applications and visualization.
The research challenges in Data Science research include:
- Terabytes, even petabytes of data are generated each day;
- Almost every discipline is facing big data analysis problems, including medical sciences, life sciences, bio-informatics, law school, civil engineering and government;
- Data comes in different forms, such as free text, structured data, audio/video, images;
- Analysis tasks performed over the data are becoming more and more sophisticated;
- High performance computing platforms are advancing fast (e.g., cloud computing, multi-core machines, GPU, mobile-computing);
- Communication and feedback needs to be established between machine, algorithms and people.
The Archimedes Project
The Archimedes project aims at building a probabilistic master knowledge base system by combining novel system components and algorithms that we are designing and building at UF. In the context of the Archimedes project, we pursue a spectrum of research directions we are exploring at the UF Data Science Research (DSR) group including: query-driven and scalable statistical inference, probabilistic data models, state-parallel and data parallel data analytics framework, multimodal (e.g., text, image) information extraction, and KB schema enrichment. This line of research on supporting large-scale automatically extracted knowledge bases is of high impact for many application domains from medical informatics to ecology. We have received funding from industry as well as federal government including NSF, DARPA, EMC/Greenplum, Amazon, Pivotal and Google. Other related projects include DeepDive from Stanford, YAGO from Max Planck Institute, NELL from CMU as well as WikiData/Freebase and Google Knowledge Vault.
- DARPA: ECOLE:CReLeRI: Concept-centric Representation, Learning, Reasoning, and Interaction (2023-2026)
- DARPA: DEFT: Deep Extraction and Filtering of Text (2013-2017)
- NSF: III: Eureka: Efficient Query Processing over Large Probabilistic Knowledge Bases (2015-2021)
- DARPA: AIDA: Active Interpretation of Desperate Alternatives (2018-2022)
- NSF: MRA: Disentangling cross-scale influences on tree species, traits, and diversity from individual trees to continental scales (2019-2022)
Hiring! the DSR@UF lab is currently looking for exceptional candidates to fill a PhD student position.
Hiring! the DSR@UF lab is currently looking for exceptional candidates to fill a Postdoc and multiple graduate student positions.
News
- [Feb 2024] Our paper by Yang Bai et. al.: M3: A Multi-Task Mixed-Objective Learning Framework for Open-Domain Multi-Hop Dense Sentence Retrieval get accepted by COLING 2024.
- [Oct 2023] Our paper by Anthony Colas and Haodi Ma et. al.: Can Knowledge Graphs Simplify Text? is published in CIKM 2023.
- [Oct 2023] Our paper by Yifan Wang et. al.: LIDER: An Efficient High-dimensional Learned Index for Large-scale Dense Passage Retrieval is published in VLDM 2023.
- [Jul 2023] Our research was funded ($4.4M) by DARPA about Concept-centric Representation, Learning, Reasoning, and Interaction (CReLeRI) (PI: Zhiting Hu (UCSD) Co-PIs: Jaime Ruiz (UF), Daisy Wang (UF), Eric Xing (CMU), Jun-Yan Zhu (CMU)).
- [Jul 2023] Our paper by Yang Bai et. al.: MythQA: Query-Based Large-Scale Check-Worthy Claim Detection through Multi-Answer Open-Domain Question Answering is published in SIGIR 2023.
- [May 2023] Our paper by Ira Harmon et. al.: Improving Rare Tree Species Classification Using Domain Knowledge is published in IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 20, 2023.
- [Mar 2023] Our paper by Yifan Wang et. al.: Learned Accelerator Framework for Angular-Distance-Based High-Dimensional DBSCAN is published in EDBT 2023.
- [Aug 2022] Our paper by Anthony Colas et. al.: GAP: A Graph-ware Language Model Framework for Knowledge Graph-to-Text Generation is going to appear in proceedings of COLING 2022 in Republic of Korea.
- [Jul 2022] Our paper by Yifan Wang et. al.: Extensible Database Simulator for Fast Prototyping In-Database Algorithms is going to appear in proceedings of ACM CIKM 2022 in Atlanta Georgia.
- [Mar 2022] Our paper by Jayetri Bardhan et. al.: DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records for Medicine Related Queries is published in proceedings of LREC 2022 in Marseille, France.
- [Feb 2022] We gave a talk at the Dorothy M. Smith Nursing Leadership Conference on the AI-Driven Virtual Skin Prep Coach. [slides]
- [Dec 2021] Our paper by Anthony Colas et. al.: EventNarrative: A large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation is published in NeurIPS 2021.
- [Jan 2021] Our paper by Ali Sadeghian et. al.: ChronoR: Rotation Based Temporal Knowledge Graph Embedding is published in AAAI.
- [Dec 2020] Our work in GAIA at SM-KBP 2020 – A Dockerized Multi-media Multi-lingual Knowledge Extraction, Clustering, Temporal Tracking and Hypothesis Generation System in the DARPA AIDA program achieved top performance in the hypotheses generation task.
- [Jun 2020] The University of Florida Weecology Lab in collaboration with the DSR Lab is holding a data science challenge. The goal is to delineate tree crowns and classify tree species from hyperspectral and RGB remote sensing data. The IDTreeS challenge is open to all participants until August 15th.
- [Dec 2019] Our conference paper by Ali Sadeghian et. al.: DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs is published at the Neural Information Processing Systems Conference.
- [Nov 2019] Our work Gaia: A Multi-media Multi-lingual Knowledge Extraction and Hypothesis Generation System is published in Text Analysis Conference 2019.
- [Jun 2019] The collaboration between faculty at the UF College of Engineering and UF IFAS has won an NSF award (1.2M) on MRA: Disentangling cross-scale influences on tree species, traits, and diversity from individual trees to continental scales to investigate cross-scale tree identification.
- [Sep 2018] We are participating in the 2018 TAC Streaming Multimedia Knowledge Base Population (SM-KBP) Task 3 Hypotheses generation evaluation as part of the DARPA AIDA program. Dr. Wang was featured in the UF College of Engineering new article “Taming the Data Monster to make Better Decisions“.
- [Jun 2018] Prof. Daisy Zhe Wang and her collaborators have been awarded a 2018 Very Large Databases (VLDB) 10-Year Test-of-Time Award for their paper, “WebTables: exploring the power of tables on the web.” This award is given to the VLDB paper published ten years earlier that has had the most influence since its publication.
- [Mar 2018] We are part of a larger PRISMA-P (Precision and Intelligent Systems in Medicine) project, funded by NIH, since its inception from 2013. One of key publications is accepted to Annals of Surgery. We continue to expand our research experience in biomedical and transnational research through project such as Rose and PRISMA-P.
- [Jan 2018] We are part of a newly funded NSF IUCRC (Industry and University Cooperative Research Center) program at the University of Florida: Center for Big Learning, whose goal is to push further the research, tech transfer and application of deep learning technologies.
- [Dec 2017] In collaboration with USC ISI, University of Columbia and RPI, we are selected to receive a grant to work on the DARPA Active INterpretation of Desperate Alternatives (AIDA) program. UF team is going to focus on mining hypothesis from probabilistic knowledge graphs constructed from multimedia event driven corpus.
- [Oct 2017] Supported by NIST and co-PIed with Dr. Ethan White from the Weecology Lab, the Data Science Evaluation (DSE) for Plant Identification with Neon Remote Sensing data is well underway. The tasks and evaluation guideline documents are released — Please join!
- [Aug 2017] The Apache Software Foundation Announces Apache® MADlib™ as a Top-Level Project. MADlib 1.12 released recently with Neural Nets implementation of multi-layer perceptron and Jupiter Notebook demonstrating its application over MNIST dataset.
- [May 2017] UF Clinical and Translational Science Institute (CTSI) and UF Institute for Child Health Policies (ICHP) sponsored the research and development of an intelligent virtual health navigator Rose that is supported by research from the DSR Lab.
- [Jan 2017] Our journal paper by Sean Goldberg et. al.: pi-CASTLE: A Probabilistically Integrated System for Crowd-Assisted Text Labeling and Extraction is published in ACM JDIQ (Journal of Data and Information Quality), 2017.
- More News