Data Science Research (DSR) Lab at the University of Florida focuses on large-scale data management, data mining and data analysis using technologies from database management Systems (DBMS’s), Statistical Machine Learning (SML), and Information Visualization. Such research in a Big Data era is called Data Science, which is a profession, a research agenda, as well as a sport! The goal of Data Science research is to build systems and algorithms to extract knowledge, find patterns, generate insights and predictions from diverse data for various applications and visualization.
The research challenges in Data Science research include:
- Terabytes, even petabytes of data are generated each day;
- Almost every discipline is facing big data analysis problems, including medical sciences, life sciences, bio-informatics, law school, civil engineering and government;
- Data comes in different forms, such as free text, structured data, audio/video, images;
- Analysis tasks performed over the data are becoming more and more sophisticated;
- High performance computing platforms are advancing fast (e.g., cloud computing, multi-core machines, GPU, mobile-computing);
- Communication and feedback needs to be established between machine, algorithms and people.
The Archimedes project aims at building a probabilistic master knowledge base system by combining novel system components and algorithms that we are designing and building at UF. In the context of the Archimedes project, we pursue a spectrum of research directions we are exploring at the UF Data Science Research (DSR) group including: query-driven and scalable statistical inference, probabilistic data models, state-parallel and data parallel data analytics framework, multimodal (e.g., text, image) information extraction, and KB schema enrichment. This line of research on supporting large-scale automatically extracted knowledge bases is of high impact for many application domains from medical informatics to ecology. We have received funding from industry as well as federal government including NSF, DARPA, EMC/Greenplum, Amazon, Pivotal and Google. Other related projects include DeepDive from Stanford, YAGO from Max Planck Institute, NELL from CMU as well as WikiData/Freebase and Google Knowledge Vault.
- NSF: Eureka: Efficient Query Processing over Large Probabilistic Knowledge Bases
- DARPA: DEFT: Deep Extraction and Filtering of Text
- [Jan 2017] Our journal paper by Sean Goldberg et. al.: pi-CASTLE: A Probabilistically Integrated System for Crowd-Assisted Text Labeling and Extraction is published in ACM JDIQ (Journal of Data and Information Quality), 2017.
- [Nov 2016] Two journal papers on two core aspects of Archimedes Probabilistic Knowledge Base System were published in the VLDB Journal: 1) Inference: In-Database Batch and Query-time Inference over Probabilistic Graphical Models using UDA-GIST, by Kun Li*, Xiaofeng Zhou*, Daisy Zhe Wang, Christan Grant, Alin Dobra, Christopher Dudley and 2) Learning: ScaLeKB: Scalable Learning and Inference over Large Knowledge Bases, by Yang Chen, Daisy Zhe Wang, Sean Goldberg.
- [Sept 2016] NIST sponsors UF faculty Prof. Daisy Zhe Wang and Prof. Ethan White as PIs for developing a new Data Science Evaluation track for Fall 2017 on Data Science for Plant Identification with Remote Sensing data from National Ecological Observatory Network (Neon).
- [August 2016] Two system demo papers 1) ArchimedesOne: Query Processing over Probabilistic Knowledge Bases, by Xiaofeng Zhou, Yang Chen, Daisy Zhe Wang and 2) SigmaKB: Multiple Uncertain Knowledge Base Fusion, by Miguel E. Rodríguez, Sean Goldberg, Daisy Zhe Wang are presented at the 2016 VLDB conference in New Delhi, India.
- [July 2016] Our work on Multimodal Ensemble Fusion for Disambiguation and Retrieval by Yang Peng et. al. has been accepted as a journal paper to the IEEE Multimedia Magazine.
- [May 2016] Prof Daisy Zhe Wang visited Computer Science at the University of Washington to give a talk as part of the North West Database Society series of talks on Archimedes: A Probabilistic Master Knowledge Base System.
- [March 2016] UF DSR Lab is invited to participate the NIST Data Science pre-pilot evaluation workshop 2016 and will be presenting (1) the results of the 2015 NIST Data Science pre-pilot evaluation participation from UF and (2) a proposal of a new Data Science evaluation on Computational Ecology using remote sensing and data from the NSF Neon program.
- [March 2016] Consensus Maximization Fusion of Probabilistic Information Extractors by Miguel Rodriguez et. al is accepted at HTL NAACL 2016. This CMF algorithm participated in the TAC KBP SVF evaluation organized by NIST in 2015 and achieved top 3 ranked results in CSSF/CSKB and overall ensemble runs.
- [Feb 2016] Prof. Daisy Zhe Wang visited Computer Science at University of Miami, Information Sciences Institute at University of South California and gave talks on different aspects of Archimedes. I also visited UC Irvine to discuss research projects.
- [Jan 2016] The NLP expertise in the UF DSR lab was drawn upon by UF CTSI and supporting the newly funded OneFlorida Clinical Research Consortium, which was recently designated as one of the nation’s 13 clinical data research networks, or CDRNs, by the Patient-Centered Outcomes Research Institute (PCORI) to accelerate the translation of promising research findings into improved patient care.
- [Spring 2016] Prof. Daisy Zhe Wang is advising four student projects in CAP4773/CAP6779 Project in Data Science: (1) contributing to Apache MADlib; (2) Legal citation graph analytics and Case predictions; (3) automatically extracting biomedical knowledge bases; and (4) distributed RDF store for query processing over large knowledge bases.
- More News