Fall 2024
Date | Speaker | Title | Slides |
---|---|---|---|
Dec 06 | Tre’ R. Jeter | Active Gradient Manipulation for Privacy Breaching in Vertical Federated Learning
Federated Learning (FL) has emerged as a promising approach for privacy-preserving collaborative machine learning. Specifically, vertical FL (vFL) allows various devices in multi-agent systems to collectively train models on vertically partitioned data while safeguarding sensitive information. Recently, a significant amount of research has been conducted related to the privacy analysis of vFL, but the majority of this work explores the passive setting where attackers follow the FL protocol. Unfortunately, this perspective underestimates the potential threats of vFL, as practical adversaries can deviate from the protocol to enhance their capabilities. In response, this work proposes two novel active data reconstruction attacks to compromise data privacy. Each attack induces gradient manipulation during the training phase to breach data privacy. Including an Active Inversion Network (AIN), our first attack exploits a subset of known data in the training set to make passive parties train an auto-encoder (AE) to reconstruct their private data. The second attack introduces an Active Generative Network (AGN) that relies only on the data distribution to train a conditional generative adversarial network (C-GAN) for private feature reconstruction. Our experiments demonstrate the effectiveness of both attacks in three real-world datasets. Additionally, we provide valuable insights and guidelines for enhancing the security of vFL systems through the application of Local Differential Privacy (LDP). |
|
Nov 22 | Yupu Zhang | Enhancing Protein-Ligand Binding Affinity Prediction with Graph Contrastive Learning
In drug discovery, accurately predicting the binding affinity between proteins and ligands is crucial for identifying potential therapeutic compounds. A reliable prediction can streamline the drug design process, saving both time and resources by guiding experimental efforts toward the most promising candidates. While computational techniques can easily obtain the 3D structures of protein-ligand complexes, determining the actual binding affinity often requires complex, costly, and time-intensive experiments. This scarcity of labeled data poses a significant challenge for machine learning models, leading to unsatisfying performance in predicting binding affinities. Recent advancements in self-supervised learning provide an opportunity to mitigate this data bottleneck by leveraging the vast amounts of unlabeled data. By extracting useful patterns and representations from unlabeled protein-ligand complexes, self-supervised pretraining can enhance the performance of models in downstream tasks. We develop a novel approach to pretrain graph neural networks using graph contrastive learning (GCL), specifically designed for the context of binding affinity prediction. |
|
Nov 08 | Haodi Ma (Part 1) | VQA Faithfulness of Vision Language Model
The recent large language models show potential in complex reasoning with textual tasks, which leads to the recent development in leveraging such models to tackle multi-modal tasks such as object recognition, action recognition, and video question answering. While these models seem to work well for image-level tasks, they tend to fail on video data which requires higher level reasoning ability such as temporal understanding. To look into the shortcomings, we evaluate SOTA works in this field on popular benchmarks like STAR and propose possible solutions to improve VLM's faithfulness with different types of information grounding. |
|
Nov 08 | Michael Perez (Part 2) | Temporal Action Detection Based on Vision-Language and Action Segmentation Models with UI
This week, we completed a version of an image and video analysis website demonstrated to DARPA as a part of the ECOLE program. This talk focuses on the video analysis component. In temporal action detection, the input is an untrimmed video with multiple actions and the output is action/argument predictions for segmented clips. This task is challenging problem because the start and end times of actions are unknown. Existing approaches rely on large, labeled datasets with a closed vocabulary or Vision-Language Models (VLMs), which often suffer from hallucinations. Our system addresses these issues by using state-of-the-art action segmentation models, VLMs and Large-Language Models. Following action segmentation, modules for object detection, caption generation, action/argument parsing, and grounding perform action recognition and reduce hallucinations. I will discuss each pipeline step, UI design, and demonstrate examples in the interface. |
|
Oct 25 | Kyle Ulmer | Hyperion Bulk Inventory Measuring System
Hyperion is a system which uses stationary LiDAR scanners to generate pointclouds for the purpose of automatically measuring and tracking the inventory of bulk material storage. Pointclouds are oriented to ground using Random Sample Consensus (RANSAC) plane segmentation and aligned either to some reference pointcloud (i.e. drone) or each other using iterative closest point (ICP) registration. A user interface presents options for defining fixed areas where material is stored, and then defines options for these areas within the registered pointcloud to remove statistical outliers, determine material and bounding edges, build a surface using Delaunay Triangulation, smooth the surface through Poisson-Disc resampling, then convert the surface to a watertight mesh and calculate its volume. These meshes and volumes are stored as a timeseries and presented to the customer via a web UI. Sometimes, the scanners become dirty, or line of site is blocked from the sensor and there are voids in the pointclouds. Interpolation between points works well but might produce bad results. Sometimes the hardware is broken, and we must automatically make that determination. Is there a way to improve this system? Yes! How? |
|
Oct 04 | Tony (Yang Bai) | Machine Learning-based Information Retrieval for Large-Scale Natural Language Processing
Information retrieval (IR) is essential for large-scale natural language processing (NLP) tasks like open-domain question answering and automatic fact-checking. Traditional IR models, such as TF-IDF and BM25, rely on lexical matching and often fail to capture semantic meaning, limiting their effectiveness. In contrast, recent machine learning-based IR methods use pretrained language models to encode the semantic meaning of both queries and documents. These models offer significant improvements by enabling better ranking of relevant results. This dissertation presents several contributions to advancing machine learning-based IR for improving large-scale NLP tasks. First, we introduce MythQA, a multi-answer open-domain question answering system for detecting check-worthy claims directly from large-scale information sources like Twitter. To support this study, we construct TweetMythQA, a benchmark with specific evaluation metrics and 5.3K annotated tweets, classified as "Supporting", "Refuting", or "Neutral". Further, we propose M3, a multi-hop dense retrieval system that integrates contrastive and multi-task learning to enhance text retrieval. M3 demonstrates state-of-the-art performance on the FEVER, an open-domain fact verification benchmark, addressing limitations of contrastive learning in dense retrieval. Lastly, we explore multi-modal retrieval-augmented question answering (MRAQA) by developing RAMQA, a framework combining learning-to-rank methods with generative LLMs for multi-modal question answering. Using LLaVa and LLaMA models, RAMQA outperforms strong baselines on the WebQA and MultiModalQA benchmarks, highlighting its effectiveness in multi-modal retrieval tasks. In summary, this dissertation introduces novel benchmarks, advanced retrieval models, and robust frameworks to improve large-scale NLP tasks, enhancing open-domain claim detection, automatic fact verification, and retrieval-augmented multi-modal question answering. |
|
Sep 20 | Zachary F. Greenberg | Expanding pMHC's functional space with generative modelling to improve cancer immunotherapy
Implementing effective cancer immunotherapies is hindered by patient heterogeneity in the peptide-presented major histocompatibility complex (pMHC) and T cell receptor (TCR). This heterogeneity enables tumors to evade host immunosurveillance and drug delivery formulations, enabling tumor proliferation. We address this challenge by expanding the functional pMHC space using generative artificial intelligence (AI) to improve cancer immunotherapeutic designs. Our generative adversarial network model, ExoGAN, creates novel peptides by modeling a pMHC’s binding affinity using sequence data of existing peptides curated from the Immune Epitope Database (IEDB) plus physiochemical feature engineering. We demonstrate how pMHC physiochemistry is correlated with pMHC presentation, enabling the effective generation of novel peptides and enabling ExoGAN to create more diverse binding peptides than those it trained on. Next, we cluster this expanded peptide space based on sequence similarity to identify peptide subtypes, perform statistical analysis on the physiochemical features of the generated and existing peptides, and select motifs and sequences for experimental validation. Experimental validation of a selection of diverse peptides designed by ExoGAN shows that all are strong binders, demonstrating the power of this approach. Overall, ExoGAN enables rapid and cost-effective expansion of the functional pMHC space, allowing researchers to efficiently and systematically unravel pMHC-TCR interactions. |
|
Sep 06 | Jean Louis | Unfair Size Advantage: Exploring Machine Learning Model Compression’s Impact on Fairness
Recently, we have seen a glimpse of a new trend in machine learning aiming to run powerful machine learning (ML) models on smaller devices. The field of TinyML aims to run ML models on resource-constrained devices such as microcontrollers. Smart low-resource device benefits include reduced latency, energy utilization, cost, and increased privacy. As we bring more complex black-box models closer to end users, we must continue looking beyond simple accuracy. Identifying and mitigating changes in the fairness and reliability of these devices is critical. TinyML supports vast applications such as telehealth, Industrial IoT, and smart homes. Many ML fairness researchers have uncovered and investigated the real-world impact of bias algorithms in facial recognition, predictive policing algorithms, hiring algorithms, and healthcare datasets. However, the impact of model compression remains under-explored. A model once thought, to be fair, may not always behave the same after quantization(model compression) to fit on a smaller device. For example, facial recognition (FR) is being implemented in smart glasses, airports, and smart homes. With the growth of IoT and privacy concerns, researchers must be proactive. This talk will discuss quantization methods and their effects on model size and fairness metrics for facial recognition tasks. |
Spring 2024
Date | Speaker | Title | Slides |
---|---|---|---|
May 09 | Dr. Zoey Liu | Addressing the "gut feelings" in cross-linguistic model generalizations
Abstract: What models should I use, and how should I evaluate my models? While there is growing interest in cross-linguistic NLP evaluation, much of the work still attends to languages with large amounts of data available, often relying on “gut feelings” based on experimental results from these high-resource languages to make decisions about what training schemes to apply. In this talk, I discuss the potential consequences of having these gut feelings, with the consequences largely induced by crucial limitations in language diversity, adoption of multiple dataset and data splits, as well as detailed model comparisons. To that end, I present ongoing work that investigates the effect of dataset construction and data partitioning strategies on model generalizability, using morphological segmentation as the test case; the datasets employed cover a spectrum of data availability for languages ranging from being critically endangered to relatively high-resource. Time permitting, I will move onto work probing the role of simple n-gram language models for the development and evaluation of automatic speech recognition for low-resource languages.
|
|
Apr 25 | Wenchong He | Interdisciplinary Geospatial Artificial Intelligence for Scientific Applications
Abstract: Over the last decade, Artificial Intelligence (AI) have revolutionized society in the realms of computer vision and natural language processing. However, the progress of AI in the scientific domain has faced significant challenges, including the existence of physical knowledge and constraints, the paucity of high-quality ground-truth, spatiotemporal autocorrelation and heterogeneity, long-range and multi-scale dependencies. In this talk, I will introduce my representative work aimed at overcoming these challenges in developing Geospatial AI (GeoAI) models for scientific applications. Additionally, I will discuss my future research directions towards building foundation models for geospatial scientific applications.
|
|
Apr 11 | Michael Perez | Exploring Human and Animal Action Recognition with Practical Applications
Abstract: This presentation will first describe Michael’s general exam survey paper on CNN-based action recognition in videos. Terminology such as action recognition, action detection, pose estimation, and optical flow, will be defined. Two-dimensional, two-stream, and 3D CNNs designed for action recognition will be introduced. Representative works in human and animal action recognition and pose estimation will be summarized. An action recognition taxonomy will be presented that is organized based on: the species and the number of subjects that the method was validated on, inference speed, and the use of pose estimation, action labels, optical flow, multiple cameras, a depth camera, 2D pose or 3D pose. This presentation will also describe extensions of Michael’s previous research to two DARPA-funded projects about an augmented reality task guidance system and a multimodal QA system. An NSF project about classifying behaviors in animal videos for behavioral neuroscience applications will be briefly mentioned.
|
|
Apr 4 | Yang Bai | Machine Learning-based Information Retrieval for Large-Scale Natural Language Processing
Abstract: This presentation is a rehearsal of Yang Bai (Tony)'s Ph.D. Thesis Proposal. Tony's research enhances information retrieval (IR) for pivotal NLP tasks like open-domain QA and fact-checking. His work demonstrates the limitations of traditional IR methods, like BM25, and showcases the advancements made by using neural networks for dense retrieval. The proposed multi-answer QA task, MythQA, extends to large-scale check-worthy claim detection. Another contribution, M3, is a multi-hop retrieval system that outperforms existing methods on the FEVER dataset. Tony's work also explores the potential of large generative pretrained transformer (GPT) models for complex multi-modal information retrieval tasks.
|
|
Feb 29 | Thiago de Paulo Faleiros | DODFMiner: An automated tool for Named Entity Recognition from Brazilian Official Gazettes
Abstract: This presentation provides an overview of my research journey, starting with an introduction to my academic background. The second part of the presentation focuses on exploring KnEDLe project. The project "KnEDLe - Knowledge Extraction from Documents of Legal Content" was proposed to employ official publications as a research object and extract knowledge. The objective was to develop intelligent tools for extracting structured information from official publications, aiming to facilitate the search and retrieval of information, increase government transparency, facilitate audit tasks, and detect problems related to the use of public resources. We proposed DODFMiner, an automated tool for Named Entity Recognition from Official Gazettes. The computational challenges and solutions implemented in the DODFMiner tool will be presented. In conclusion of the presentation, I would like to talk about perspectives and proposals for new research collaborations.
Bio: Thiago de Paulo Faleiros is an Assistant Professor in the Department of Computer Science at the University of Brasília and coordinated the KnEDLe project. He earned their Bachelor's degree in Computer Science in 2007 and their Master's degree in 2011. He received his Ph.D. in Computer Science from the Institute of Mathematical Sciences and Computing at the University of São Paulo in 2016. Their primary research interests focus on machine learning, natural language processing, and graph mining.
|
|
Feb 15 | Jean Louis | (Tiny) Solutions to Big Problems: Introduction to applications and challenges in TinyML
Abstract: There is increasing research on edge computing and device-side machine Learning. ML at the edge provides several benefits, such as reduced latency, energy utilization, cost, and increased privacy. A class of research also exists focused on even more constraint devices like microcontrollers or embedded devices. Microcontrollers have limited power and memory and no operating system but enable new Intelligent IoT (IIoT) applications. For example, smart, low-cost sensors are used in smart manufacturing for anomaly detection and preventative maintenance. These pursuits have led to the creation of models with a fraction of resource requirements with negligible drops in accuracy or top-line metrics (top-k accuracy). However, research in ML fairness demonstrates that high accuracy may still hide disproportionate effects on a minority subset of the data. In this presentation, we will delve into the realm of Tiny Machine Learning (TinyML) and explore the influence of compression on various aspects of a model, including its accuracy, fairness, efficiency, robustness, and explainability.
|
|
Feb 01 | Yifan Wang | Learning-Based High-Dimensional Query Processing
Abstract: With the emergence of deep learning techniques, more and more applications are utilizing neural embeddings which are high-dimensional numeric vector representations. A recent example is the retrieval-augmented generation (RAG) for LLM. Accordingly, efficient and effective query processing over high-dimensional vectors is demanded. However, the performance of traditional methods has approached the bottleneck with the rapid increase of data scale. In this talk, Yifan will introduce his studies that apply machine learning techniques to speeding up commonly used high-dimensional data operations, as well as his future research and funding plans.
|
|
Jan 18 | Wenchong He | Interdisciplinary GeoAI for Scientific Applications
Over the past decade, Artificial Intelligence (AI) has a transformative impact on society, particularly in areas such as computer vision and natural language processing. Despite these advancements, the progress of AI in the Geo-domain (GeoAI) has faced significant challenges, including the distinctive characteristics of spatiotemporal data, scientific knowledge constraints, and issues related to trustworthiness. During this seminar, Wenchong will present two representative projects that focus on leveraging scientific knowledge within spatiotemporal data to enhance the robustness and efficiency of GeoAI models. Additionally, Wenchong will outline his future research plans aimed at developing geo-foundation models.
|
Fall 2023
Date | Speaker | Title | Slides |
---|---|---|---|
Nov 09 | Sangpil Youm | First Come First Assign (FCFA): Explainable, Divergence-Aware SRL Projection
|
|
Nov 09 | Yifan Wang | AI4DB and DB4AI in the Era of Deep Learning and LLM
|
|
Oct 26 | Gloria Katuka | Investigating Automatic Dialogue Act Classification in Collaborative Learning through Federated Transfer Learning and Cross-Corpora Domain Adaptation
|
|
Oct 26 | Alexander Webber | An algorithm to induce the grammar of Mixtec Codices - a work in progress
|
|
Oct 5 | Richa Dutt | A Machine Learning Approach for Chlorophyll -a Estimation in Coastal Waters from Top-of-Atmosphere VIIRS Satellite Data
|
|
Oct 5 | Yifan Wang | LIDER: An Efficient High-dimensional Learned Index for Large-scale Dense Passage Retrieval
Many recent approaches of passage retrieval are using dense embeddings generated from deep neural models, called "dense passage retrieval". The state-of-the-art end-to-end dense passage retrieval systems normally deploy a deep neural model followed by an approximate nearest neighbor (ANN) search module. The model generates embeddings of the corpus and queries, which are then indexed and searched by the high-performance ANN module. With the increasing data scale, the ANN module unavoidably becomes the bottleneck on efficiency. An alternative is the learned index, which achieves significantly high search efficiency by learning the data distribution and predicting the target data location. But most of the existing learned indexes are designed for low dimensional data, which are not suitable for dense passage retrieval with high-dimensional dense embeddings. In this paper, we propose LIDER, an efficient high-dimensional Learned Index for large-scale DEnse passage Retrieval. LIDER has a clustering-based hierarchical architecture formed by two layers of core models. As the basic unit of LIDER to index and search data, a core model includes an adapted recursive model index (RMI) and a dimension reduction component which consists of an extended SortingKeys-LSH (SK-LSH) and a key re-scaling module. The dimension reduction component reduces the high-dimensional dense embeddings into one-dimensional keys and sorts them in a specific order, which are then used by the RMI to make fast prediction. Experiments show that LIDER has a higher search speed with high retrieval quality comparing to the state-of-the-art ANN indexes on passage retrieval tasks, e.g., on large-scale data it achieves 1.2x search speed and significantly higher retrieval quality than the fastest baseline in our evaluation. Furthermore, LIDER has a better capability of speed-quality trade-off.
|
|
Sep 21 | Jayetri Bardhan | DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
This paper develops the first question answering dataset (DrugEHRQA) containing question-answer pairs from both structured
tables and unstructured notes from a publicly available Electronic Health Record (EHR). EHRs contain patient records, stored
in structured tables and unstructured clinical notes. The information in structured and unstructured EHRs is not strictly
disjoint: information may be duplicated, contradictory, or provide additional context between these sources. Our dataset has
medication-related queries, containing over 70,000 question-answer pairs. To provide a baseline model and help analyze the
dataset, we have used a simple model (MultimodalEHRQA) which uses the predictions of a modality selection network to
choose between EHR tables and clinical notes to answer the questions. This is used to direct the questions to the table-based or
text-based state-of-the-art QA model. In order to address the problem arising from complex, nested queries, this is the first time
Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers (RAT-SQL) has been used to test the structure of query
templates in EHR data. Our goal is to provide a benchmark dataset for multi-modal QA systems, and to open up new avenues of
research in improving question answering over EHR structured data by using context from unstructured clinical data.
|
|
Sep 21 | Haodi Ma | KGSimple: Can Knowledge Graphs Simplify Text?
Knowledge Graph (KG)-to-Text Generation has seen recent improvements in generating fluent and informative sentences which
describe a given KG. As KGs are widespread across multiple domains and contain important entity-relation information, and as
text simplification aims to reduce the complexity of a text while
preserving the meaning of the original text, we propose KGSimple, a novel approach to unsupervised text simplification which
infuses KG-established techniques in order to construct a simplified
KG path and generate a concise text which preserves the original
input’s meaning. Through an iterative and sampling KG-first approach, our model is capable of simplifying text when starting from
a KG by learning to keep important information while harnessing
KG-to-text generation to output fluent and descriptive sentences.
We evaluate various settings of the KGSimple model on currentlyavailable KG-to-text datasets, demonstrating its effectiveness compared to unsupervised text simplification models which start with
a given complex text.
|
|
Sep 07 | Zelin Xu | Spatial Knowledge-Infused Hierarchical Learning: An Application in Flood Mapping on Earth Imagery
|
|
Sep 07 | Wenchong He | Physics-guided AI for Spatiotemporal Data in Scientific Applications
|
|
Aug 24 | Yang Bai | MythQA: Query-Based Large-Scale Check-Worthy Claim Detection through Multi-Answer Open-Domain Question Answering
Check-worthy claim detection aims at providing plausible misinformation to the downstream fact-checking systems or human experts to check. This is a crucial step toward accelerating the fact-checking process. Many efforts have been put into how to identify check-worthy claims from a small scale of pre-collected claims, but how to efficiently detect check-worthy claims directly from a large-scale information source, such as Twitter, remains underexplored. To fill this gap, we introduce MythQA, a new multi-answer open-domain question answering(QA) task that involves contradictory stance mining for query-based large-scale check-worthy claim detection. The idea behind this is that contradictory claims are a strong indicator of misinformation that merits scrutiny by the appropriate authorities. To study this task, we construct TweetMythQA, an evaluation dataset containing 522 factoid multi-answer questions based on controversial topics. Each question is annotated with multiple answers. Moreover, we collect relevant tweets for each distinct answer, then classify them into three categories: "Supporting", "Refuting", and "Neutral". In total, we annotated 5.3K tweets. Contradictory evidence is collected for all answers in the dataset. Finally, we present a baseline system for MythQA and evaluate existing NLP models for each system component using the TweetMythQA dataset. We provide initial benchmarks and identify key challenges for future models to improve upon. Code and data are available at: https://github.com/TonyBY/Myth-QA |
|
Aug 24 | Alexander Rajender Webber | Figure Identification in Mixtec
|
Fall 2022
Date | Speaker | Title | Slides |
---|---|---|---|
Oct 21 | I Harmon |
Neuro-symbolics in Remote Sensing Species Classification
Forests cover 31% of the Earth's surface and play a vital role in the sustaining life on the planet. Remote sensing is used to efficiently monitor forest health at scale. Many estimated forest parameters such as biomass can be more accurately estimated when species are known or individual tree crowns can be counted. Therefore, crown delineation and species classification are important to remote sensing forest parameter estimation. However, creating robust machine learning classifiers to delineate crowns and classify species is a difficult task. In this talk we discuss how neuro-symbolics can be leveraged to improve performance of remote sensing based tree species classifiers. |
|
Oct 6 | Anthony Colas | GAP: A Graph-aware Language Model Framework for Knowledge Graph-to-Text Generation Recent improvements in KG-to-text generation are due to additional auxiliary pre-trained tasks designed to give the fine-tune task a boost in performance. These tasks require extensive computational resources while only suggesting marginal improvements. Here, we demonstrate that by fusing graph-aware elements into existing pre-trained language models, we are able to outperform state-of-the-art models and close the gap imposed by additional pre-train tasks. We do so by proposing a mask structure to capture neighborhood information and a novel type encoder that adds a bias to the graph-attention weights depending on the connection type. Experiments on two KG-to-text benchmark datasets show these models to be superior in quality while involving fewer parameters and no additional pre-trained tasks. By formulating the problem as a framework, we can interchange the various proposed components and begin interpreting KG-to-text generative models based on the topological and type information found in a graph. | |
Sep 23 | Yifan Wang |
DBSim - Extensible Database Simulator for Fast Prototyping In-Database Algorithms
In-database analytics has become one of the most studied topics in data science community, because of its significance in reducing the gap between the management and the analytics of data, which can save much time on exchanging data between databases and external analytic tools. But implementing in-database algorithms inside mainstream databases without pre-verification of the ideas is risky and may result in a significant waste of time. In this talk we present a testbed, DBSim, which simulates a relational database and allows users to easily extend it based on their needs, such that users can fast prototype their ideas and estimate the performance before diving into the large codebase of real RDBMS. |
Fall 2020
Date | Speaker | Title | Slides |
---|---|---|---|
Dec 11 | DSR Lab | Neuro-Symbolics Applications Continuation of neuro-symbolics applications. | |
Dec 4 | DSR Lab | Neuro-Symbolics Applications Neuro-symbolics is a new AI paradigm that combines neural models with symbolic reasoning. It's a promising technology that is edging AI capabilities closer to human levels. Today we will look at some of its applications. | |
Nov 20 | Yifan Wang | SystemV: A Generic Embedding-based Platform for Efficient Similarity Query Processing Similarity queries are becoming increasingly important in the age of big data where the data objects do not possess any natural order. Examples include large collections of images, text and multimodal data objects. Embeddings are widely used in processing of similarity queries for measuring semantic similarity of data. However, current database systems cannot adequately support embedding based similarity query processing. Furthermore, multiple types of embeddings may be constructed over data objects that contain multiple modalities, such as text and image, where similarity-based join queries are required but the traditional join operators do not work well on these queries. So we propose SystemV, a generic platform on top of Spark, providing scalable end-to-end solutions for efficient embedding based similarity query processing, including search on single embedding space and join on multiple embedding spaces. The goal of SystemV is to democratize AI by facilitating the use of pre-trained embeddings (e.g., learned from large amounts of data such as Wikipedia and ImageNet). Using such pre-trained models, users can focus more on their core application rather than designing the retrieval models. | |
Nov 13 | Ali Sadeghian | Relational learning and reasoning on knowledge graphs (My PhD defense dry run) In this talk we will briefly discus knowledge graphs and the current well studied reasoning methods on KGs. We then dive deeper into temporal knowledge graphs and temporal reasoning. This is an important topic because despite the importance and abundance of temporal knowledge graphs, most of the current research has been focused on reasoning on static graphs. We study the challenging problem of inference over temporal knowledge graphs. In particular, the task of temporal link prediction. In general, this is a difficult task due to data non-stationarity, data heterogeneity, and its complex temporal dependencies. We propose Chronological Rotation (ChronoR), a novel model for learning representations for entities, relations, and time. Learning dense representations is frequently used as an efficient and versatile method to perform reasoning on knowledge graphs. The proposed model learns a k-dimensional rotation transformation for every relation, time pair such that each fact’s head entity, once transformed using the rotation, is close to its tail. By using high dimensional rotation as its transformation operator, ChronoR captures rich interaction between the temporal and multi-relational characteristics of a Temporal Knowledge Graph. Experimentally, we show that ChronoR is able to outperform state-of-the-art methods on the benchmark datasets for temporal knowledge graph link prediction. | |
Nov 6 | Haodi Ma | Survery on neural symbolic learning Motivated by human's ability of learning visual concepts by jointly understanding vision and language, a recent work proposes the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual concepts, words, and semantic parsing of sentences without explicit supervision on any of them; instead, our model learns by simply looking at images and reading paired questions and answers. This work has some similarity with previous works from our lab focusing on information extraction with multimodal approaches. In this talk I will include these works and discuss the similarity and distinction between them for possible future research. | |
Oct 30 | Anthony Colas | Few-Shot Learning Deep learning has achieved state-of-the-art results on many tasks in various areas, including image-to-image translation and image classification in computer vision and summarization and machine translation in natural language processing (NLP). However, these types of architectures still require large amounts of data, which can be difficult to obtain or simply not available in many domains. Thus, few-shot learning can help solve these problems in domains where large data are not as copious. Few-shot learning enables one to train a model with smaller amounts of data, while also teaching the model to learn how to learn (meta-learning). These models have been shown to perform fairly well in their respective objectives. In this talk, we will introduce few-shot learning as well as the related meta- and one-shot learning. We will look at a few example architectures used in the few-shot learning task, as well as commonly used datasets. Next, we will examine how few-shot learning has been more prevalently used in computer vision for image classification and more recently, image-to-image translation. From there, we can transition into how few-shot learning has been more recently applied in NLP to tasks such as summarization and machine translation. Finally, we will conclude with how such models can be adapted for the graph-to-text task. | |
Oct 23 | Jaytri Bardhan | Question Answering on Electronic Health Records Electronic Health Record (EHR) is an electronic version of a patient’s medical history. The state-of-the-art model- ‘Text to SQL Generation for Question Answering on Electronic Medical Records’ has some major shortcomings. The accuracy of this question answering system on MIMIC III database is very poor. It is very challenging to obtain a good accuracy in multi-relational health records. In this project, Relational aware schema encoding and Linking for Text to SQL (RAT-SQL model) has been used for question answering over structured medical records. The talk would also cover the annotation of the drug dataset for question answering over structured and unstructured clinical records. Furthermore, this QA system could aid in applications like Sepsis phenotype detection and treatment prediction. | |
Oct 16 | Yang Bai | AIDA Progress in the Summer AIDA is a DARPA-funded project that aims at automatically ingesting web documents and transform them into a semantic space representation(Knowledge Graph) that analysts can use to query about uncertain situations and obtain a variety of related hypotheses. The Data Science Research(DSR) lab of UF is a participant in this project. We are responsible for generating hypotheses with knowledge graphs that are generated by upstream teams. HypoGator is the hypothesis generation system designed by the DSR lab for the AIDA project. It uses a search-score-rank approach to find alternative answers to complex queries over the automatically extracted event-driven multimedia knowledge graph. In this talk, I will describe the major changes and improvements made over the HypoGator last summer. | |
Oct 9 | Ali Sadeghian | N/A | |
Sep 25 | I Harmon | A Survery of Weak Supervision and its Applications Machine learning models are becoming increasingly powerful but require larger datasets for optimal performance. One of the biggest problems in the model creation pipeline is the bottleneck of dataset creation. Weak supervision is one solution to this problem. We will explore exactly what weak supervision is, its taxonomy, and its applicability. We'll look at frameworks that allow for the rapid creation of datasets using weak supervision. We'll also look at some applications of weak supervision. |
Spring 2020
Date | Speaker | Title | Slides |
---|---|---|---|
Mar 13 | Ali Sadeghian | Relational learning and reasoning on knowledge graphs In this talk, we will talk about two of the main reasoning methods on knowledge graphs, rule learning, and embedding models. We will briefly cover how KGs are constructed and why reasoning on KGs is crucial for both completing the KG and using them in applications. We will then discuss differentiable rule mining and the benefits of rules as an interpretable way of reasoning. Finally, we will discuss embedding based methods on KGs and dive deeper into recent advances of these latent models in temporal KG completion. | |
Feb 14 | Yang Bai | HypoGator for the AIDA Project In this talk, I’m going to give an introduction to our hypothesis Generation system, HypoGator, which is designed for the AIDA project. It uses a search-score-rank approach to find alternative answers to complex queries over the automatically extracted event-driven multimedia knowledge graph. I will go through some major features of the pipeline with decent details and demonstrate the results with our specially developed visualization tool. Finally, we will discuss future improving directions and task extension of the system. AIDA is a DARPA-funded project that aims at automatically ingesting web documents and transform them into a semantic space representation(Knowledge Graph) that analysts can use to query about uncertain situations and obtain a variety of related hypotheses. Data Science Research lab of UF is a participant of this project. We are responsible for generating hypotheses with knowledge graphs that are generated by upstream teams. | |
Feb 07 | Anthony Colas | EventQA (Complex-Question Narration) Knowledge bases (KBs) present large amounts of data in a structured manner. Because of the amount of information in them, there has been much work on querying these structured repositories of data using natural language. This work extends to complex question-answering (complexQA) where the complexity of question increases by mentioning multiple entities, relations, constraints, or all three. In this talk, we discuss the problems with current complexQA systems over knowledge graphs and motivate our work in KB-QA narration. Namely, we differentiate our problem from conventional KB-QA by observing questions that cannot be solved well given the current approaches. We first introduce EventKG in order to motivate our work by giving examples of the complexQA narrative task over this dataset. To solve this new problem, we review some work in both complexQA and graph-to-text and show how our task is a combination of these two. Next, we discuss our baseline approach in generating answer narratives. Finally, we present our approach to building a novel dataset for the complex question-narratives task and discuss possible models that generate narratives from questions and KB components. This talk will serve as an introduction to works in both the KB-QA and graph-to-text translation tasks and as our roadmap to solving the new problem of KB-QA narration. | |
Jan 31 | Jaytri Bardhan | Question Answering Systems using Electronic Health Records This talk would be about the scope of the project- Question Answering Systems using Electronic Health Records. A brief introduction would be given about the different types of electronic health records, followed by the related research work that exists in this area. A detailed description of the different milestones for this project will be presented along with its different challenges. Also, a short overview will be given for the paper- BERT-based Ranking for Biomedical Entity Normalization. In this paper, an entity normalization architecture was proposed by fine-tuning the pre-trained BERT/ BioBERT / ClinicalBERT models and conducted extensive experiments to evaluate the effectiveness of the pre-trained models for biomedical entity normalization using three different types of datasets. The experimental results show that the best-fine-tuned models consistently outperformed previous methods and advanced the state-of-the-art for biomedical entity normalization, with up to 1.17% increase inaccuracy. |
Fall 2019
Date | Speaker | Title | Slides |
---|---|---|---|
Nov 22 | Jaytri Bardhan | Survey on Question Answering Systems over Electronic Health Records (EHRs) The widespread adoption of electronic health records (EHRs) has enabled the secondary use of EHR data for clinical research and healthcare delivery. An Electronic Health Record (EHR) is an electronic version of a patient’s medical history and may include demographics, progress notes problems, medications, vital signs, past medical history, and laboratory data. Natural language processing (NLP) techniques have been used over the years to develop question answering systems over structured as well as unstructured electronic health records. The following two papers will be presented: The paper - ‘A Translate-Edit Model for Natural Language Question to SQL Query Generation on Multi-relational Healthcare Data’ develops a deep learning-based approach that can translate a natural language question on multi-relational EHR data into its corresponding SQL query, which is referred to as a Question-to-SQL generation task. To address the challenge of generating queries on multi-relational databases from natural language questions, TRanslate-Edit model for Question-to-SQL query (TREQS) is proposed. The paper- ‘CREATE: Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records using the OMOP Common Data Model’ presents a cohort retrieval system that can execute textual cohort selection queries on both structured and unstructured EHR data. CREATE is a proof-of-concept system that leverages a combination of structured queries and IR techniques on NLP results to improve cohort retrieval performance while adopting the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) to enhance model portability. | |
Nov 8 | Yang Bai | HypoGator: Hypotheses Generation and Ranking over Event-Driven Multimedia Knowledge Graph This presentation is going to be a dry run for the TAC workshop next week where we are going to show our latest progress on the AIDA project during the M18 evaluation phase. It is going to include a brief introduction to our old pipeline and a detailed introduction to the improvements in our new pipeline, as well as a demo of our new visualization tool which is quite helpful in result analysis. AIDA is a DARPA-funded program that aims at automatically ingesting web documents and transform them into a semantic space representation(Knowledge Graph) that analysts can use to query about uncertain situations and obtain a variety of related hypotheses. As a key part of the AIDA project, we are responsible for generating hypotheses with the Knowledge graphs that are generated by upstream teams. | |
Nov 1 | Miguel Rodriguez | REASONING OVER MULTI-SOURCE AND DYNAMIC KNOWLEDGE GRAPHS (Dissertation Dry Run) Innovative approaches to Information Extraction(IE) have enabled the creation of large Knowledge Graphs (KGs) (e.g., YAGO, NELL, DBPedia, Wikidata) and Dynamic Knowledge Graphs (e.g., ICEWS, GDELT). These knowledge graphs have become an increasingly popular domain knowledge representation used in semantic search, recommendation systems, question-answering, natural language processing, etc. Despite the increased efforts, KGs are still predominantly incomplete and contain a high degree of uncertainty. In this dissertation, we study two approaches to fill in the gaps in existing knowledge graphs 1) Reasoning over facts from different knowledge graphs and/or extractors to increase coverage and automatically evaluate the correctness in the aggregate, this task is known as Knowledge Fusion (KF). We propose CM Fusion, an ensemble approach that combines supervised learning and unsupervised consensus. We also propose SigmaKB, a query engine that uses CM Fusion and user feedback to integrate and improve query results compared to single KGs. 2) Reasoning over existing facts to determine possibly missing ones, moreover we propose to do this over dynamic knowledge graphs. We propose to learn observed feature models over dynamic knowledge graphs in the form of sequential rules and use them to infer missing facts and forecast events in a streaming fashion. CM fusion uses the Consensus Maximization algorithm to ensemble supervised classifiers learned using information extractors that can be assessed a priori and generates consensus with black box extractors that can not be assessed beforehand. Consensus Maximization Fusion is able to promote high-quality facts and eliminate incorrect ones. We demonstrate the effectiveness of our system on the NIST Slot Filler Validation evaluation, which seeks to evaluate and aggregate multiple independent information extractors. Our system achieved the highest F1 score relative to other system submissions. We mine temporally constrained first-order inference rules over dynamic knowledge graphs. The learned rules are the first set of temporal rules mined over dynamic knowledge graphs. The algorithm we propose uses adjusted definitions of support, confidence and head coverage metrics that consider minimal occurrences and time windowing constrain. We ground the learned sequential rules over dynamic knowledge graphs in two tasks: temporal link prediction and streaming link prediction. Our experiments show that inference using sequential rules can outperform representation learning approaches while at the same time yield interpretable patterns. These patterns can be further refined by experts or used by analysts in QA tasks. Finally, we show that rules and embeddings complement each other and propose to cast the ensemble as a ranking aggregation problem. We Reciprocal Rank Fusion, an unsupervised rank aggregation model. | |
Oct 25 | Haodi Ma | Answering Complex Questions by Joining Multi-Doc Evidence with Quasi Knowledge Graph In this talk, I will first discuss the pipeline of their model. Then go into details in that order. There are mainly two steps of their model: graph construction and graph algorithm. And the main distribution for them will be casting the QA task into the GST problem. Then will be the evaluation of their model including experience results and finally the comparison of QUEST and AIDA task and what I have accomplished so far. | |
Oct 18 | Ali Sadeghian | Embed All The Things! In this talk, we will discuss StarSpace, a method that can be used to embed multiple types of objects and solve a variety of tasks such as labeling tasks, ranking tasks, collaborative filtering-based or content-based recommendation, embedding of multi-relational graphs, and learning word, sentence or document level embeddings. | |
Oct 11 | Yang Bai | Knowledge Graph Embeddings Knowledge graph (KG) embedding is to embed components of a KG including entities and relations into continuous vector spaces, so as to simplify the manipulation while preserving the inherent structure of the KG. It can benefit a variety of downstream tasks such as KG completion and relation extraction and hence has quickly gained massive attention. In this presentation, I’m going to firstly give a thorough introduction of classical KG embedding methods: the translational distance models. Then, I’m going to introduce one of the latest work in this field: Knowledge graph embedding via reasoning over entities, relations, and text, which combined the classic translational distance models with the LSTM neural network to extract both latent semantic features and observable structure patterns in a unified knowledge graph. | |
Sep 27 | Jaytri Bardhan | Template-Based Complex Query The following two papers will be presented- ‘Automated template generation for question answering over knowledge graphs’ and ‘Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases’. They both emphasize on different template-based approaches to handle complex question answering systems in knowledge graphs. Templates are an important asset for question answering over knowledge graphs, simplifying the semantic parsing of input utterances and generating structured queries for interpretable answers. The paper ‘‘Automated template generation for question answering over knowledge graphs” presents QUINT, a system that automatically learns utterance-query templates solely from user questions paired with their answers. Additionally, QUINT uses basic templates to answer structurally complex compositional questions without observing such questions during training. The paper - “Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases” presents NEQA, a continuous learning paradigm for KB-QA. Translating natural language questions to semantic representations such as SPARQL is a core challenge in open-domain question answering over knowledge bases. The existing methods require access to a large annotated training set that is not always readily available and fail on questions from before-unseen domains. In this paper NEQA when offline, automatically learns templates mapping syntactic structures to semantic ones from a small number of training question-answer pairs. If a new question cannot be satisfactorily answered via templates, user feedback is used on the output of a semantic similarity function to learn a new template based on the new question. | |
Sep 20 | Miguel Rodriguez | HypoGator - M18 Eval Knowledge Graphs are widely used to represent knowledge in part because of the open-world assumption, uncertainties and noise can all coexist in them. Unfortunately, the representation flexibility of KG is not matched by the query languages or the query engines that run on them. HypoGator is the DSR lab hypotheses generation system in the framework of the AIDA program. HypoGator answers query over KGs constructed from combining and aligning document level knowledge elements and provide a ranked list of possible answers. In this talk, we describe the major changes made to the systems for the summer evaluation cycle and show a KG narration to better visualize the generated hypothesis. | |
Sep 13 | Yifan Wang | AI-Supported DBMS We are going to present two papers: CognitiveDB and RankSQL. They are relevant to supporting AI in DBMS and efficient execution of top-k queries in DBMS. These are two popular topics currently in the database community. CognitiveDB is an approach for transparently enabling Artificial Intelligence (AI) capabilities in relational databases. A novel aspect of their design is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. This model captures the hidden inter-/intra-column relationships between database tokens of different types. They seamlessly integrate the word embedding model into existing SQL query infrastructure and use it to enable a new class of SQL-based analytics queries called cognitive intelligence (CI) queries. CI queries use the model vectors to enable complex queries such as semantic matching, inductive reasoning queries such as analogies, predictive queries using entities not present in a database, and, more generally, using knowledge from external sources. This system exemplifies using AI functionality to endow relational databases with capabilities that were previously very hard to realize in practice. RankSQL is a system that provides a systematic and principled framework to support efficient evaluations of ranking (top-k) queries in relational database systems (RDBMS), by extending relational algebra and query optimization. They aim to support ranking as a first-class database construct. So they extend relational algebra by proposing a rank-relational model to capture the ranking property and introducing new and extended operators to support ranking as a first-class construct. Enabled by the extended algebra, they present a pipelined and incremental execution model of ranking query plans (that cannot be expressed traditionally) based on a fundamental ranking principle. Their approach can significantly reduce the cost of ranking operation in DBMS. | |
Sep 6 | Ali Sadeghian | Hotel2vec: Learning Attribute-Aware Hotel Embeddings with Self-Supervision We propose a neural network architecture for learning vector representations of hotels. Unlike previous works, which typically only use user click information for learning item embeddings, we propose a framework that combines several sources of data, including user clicks, hotel attributes (e.g., property type, star rating, average user rating), amenity information (e.g., the hotel has free Wi-Fi or free breakfast), and geographic information. During model training, a joint embedding is learned from all of the above information. We show that including structured attributes about hotels enables us to make better predictions in a downstream task than when we rely exclusively on click data. We train our embedding model on more than 40 million user click sessions from a leading online travel platform and learn embeddings for more than one million hotels. Our final learned embeddings integrate distinct sub-embeddings for user clicks, hotel attributes, and geographic information, providing an interpretable representation that can be used flexibly depending on the application. We show empirically that our model generates high-quality representations that boost the performance of a hotel recommendation system in addition to other applications. An important advantage of the proposed neural model is that it addresses the cold-start problem for hotels with insufficient historical click information by incorporating additional hotel attributes which are available for all hotels. | |
Aug 30 | Anthony Colas | Natural Language to Query Language Users often converse with chatbots in order to ask task-specific questions. Much of the information is only accessible in databases which requires one to know a query language. Thus, when building a conversational question answering system it is crucial to converting natural language questions into query language, in order for users to obtain answers to their questions. We approach this problem in two different domains and utilize two types of query languages: SQL and SPARQL. Furthermore, we develop a novel data collection methodology in order to generate synthetic data for which to use on the natural language to query language tasks. With this methodology, one can quickly and efficiently generate data to use on their models for their specific domain needs. We show promising results for our synthetically generated datasets and present further steps needed for future work. |
Spring 2019
Date | Speaker | Title | Slides |
---|---|---|---|
Apr 5 | Ali Sadeghian Giacomo Bergami | emrQA: A Large Corpus for Question Answering on Electronic Medical Records Question and answering systems remain relatively unexplored in clinical domains. This paper proposes a novel methodology to generate domain-specific large-scale question answering datasets and demonstrates an instance of this methodology in creating a large-scale QA dataset for electronic medical records. The method creates questions and logical form templates obtained through expert annotations and with existing annotations in clinical notes it generates questions, logical form, and answers. The dataset’s learning potential is explored by training baseline models for question to logical form and question to answer mapping. | |
Mar 28 | Giacomo Bergami | Alternatives for generating Alternative Hypotheses over Knowledge graph In this presentation, we will provide an introduction to a graph search-based algorithms in comparison with a graph query answering algorithm. While the former focus on getting paths from one entry point, the latter returns a subgraph matching the query. We will show similarities and differences between the two. After doing this, we show the output of the former evaluation of HypoGator and those from a straightforward implementation of SAMA (a distributed approximate graph matching algorithm). Different metrics will be used to evaluate those two alternative algorithms for generating alternative hypotheses answering questions over probabilistic knowledge graph. | |
Mar 22 | Miguel Rodriguez | SERM - Sequence Rule Mining from Temporal Knowledge Bases Research efforts in reasoning over large scale Knowledge Bases (KBs) such as Freebase, YAGO or NELL has largely focused on static representations of knowledge. The recent availability of time annotated KB facts in YAGO and Wikidata and timed stamped Event Knowledge Bases (EKBs) GDELT and ICEWS has ignited efforts in developing models that also reason over the temporal dimension. The majority of work in this research area is centered in representation learning for the link prediction task. In contrast, in this paper, we study the problem of learning first-order inference rules where the rule atoms occur in sequential order. In particular, we propose an algorithm to mine sequence rules from temporal knowledge bases and interestingness metrics that take into account time windowing constraints. Our experiments show that interpretable patterns can be mined from ICEWS and further used by human analysts and improved by experts. | |
Mar 1 | Rahul Sengupta Debdeep Basu | Fusing differentiable Rule-Learning and Embeddings for Knowledge Base Completion During the course of this project, we investigate the use of deep learning applied to learn probabilistic first-order logical rules in order to improve embeddings, for the task of knowledge base completion. In particular, we attempt to fuse the concepts of the paper “Differentiable Learning of Logical Rules for Knowledge Base Reasoning” with popular embedding models such as DISTMULT and TransE. We hope that a joint model will perform better overall than the individual models, by one compensating for others. Evaluation Framework for Hypothesis Generation Knowledge Graphs are generated from various signals (video, audio, text etc) for a particular event. These small knowledge graphs are then merged into a big graph which contains the accumulated information about an event from various streams. Hypothesis generation is generating a list of subsets of this merged graph. The output of hypothesis generation signifies the possible causes for the occurrence of a particular event. The evaluation framework is a web application with which users can upload their datasets, visualize statistics in that dataset and play with an interactive knowledge graph exploration tool with which they will be able to query the graph (in natural language), visualize the generated list of hypotheses and proceed to unfold the graph as and when required. This can be used as a debugging tool for HypoGator. | |
Feb 15 | Anthony Colas Hyun Choi | Implementing Question Understanding Via Template Decomposition There has been a lot of work on simple question-answering over structured knowledge bases. That is, answering questions which contain only one relation between two entities. Though there has been a lot of development in simple question-answering over structured data by using templates, syntactic parsers, or language models. In this talk, we present our progress towards implementing the approach used by Question Answering Over Knowledge Graphs: Question Understanding Via Template Decomposition. We show experimental results thus far on the WebQuestions dataset (simple questions) and describe some of the challenges encountered during the process. We also discuss future steps and some statistics of the natural language pattern question templates. emrQA: A Large Corpus for Question Answering on Electronic Medical Records Question and answering systems remain relatively unexplored in clinical domains. This paper proposes a novel methodology to generate domain-specific large-scale question answering datasets and demonstrates an instance of this methodology in creating a large-scale QA dataset for electronic medical records. The method utilizes existing expert annotations on clinical notes for various NLP tasks from i2b2 datasets. The corpus contains question templates and logical forms and question-answer pairs. The dataset’s learning potential is explored by training baseline models for question to logical form and question to answer mapping. | |
Feb 8 | Yang Bai | Approximate Subgraph Matching It is increasingly common to find real-life data represented as graphs of labeled, heterogeneous entities and relations. To query these graphs, one often needs to identify the matches of a given query graph in a (typically large) graph database. Due to noise and the lack of fixed schema in many real-life graph databases, the query graph can substantially differ from its matches in the graph database in both structure and node/edge labels, thus bringing challenges to the graph querying tasks. To tackle this problem, Approximate Subgraph Matching is proposed to help users better query graph database. Today, I am going to give a talk in this scenario based on two representative models: neighborhood based approximate graph matching and path alignment based approximate graph matching. | |
Feb 1 | Ali Sadeghian | DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs Logical rules play an important role in capturing interpretable patterns present in knowledge bases. An extensive body of research has focused on automatically discovering first-order logical rules by exploring a discrete search space for the rule structure and the applying statistical measures to assess the correctness of each rule. In this talk, we consider a new parametric approach for mining logical rules from knowledge graphs. We proof the limitations of the state-of-the-art differentiable technique for mining logical rules and propose a method that (1) Has no theoretical restrictions on the structure of meaningful logic rules, (2) Adaptive to the data, and (3) Avoids over-parametrization while finding meaningful signals in the data. We show that our method outperforms existing rule mining methods in the quality of the mined rules over benchmark datasets as well as in link prediction task. | |
Jan 25 | Yifan Wang | Cognitive Database: A Step towards Endowing Relational Databases with Artificial Intelligence Capabilities A cognitive Database is an approach for transparently enabling Artificial Intelligence (AI) capabilities in relational databases. A novel aspect of its design is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. This model captures the hidden inter-/intra-column relationships between database tokens of different types. For each database token, the model includes a vector that encodes contextual semantic relationships. It seamlessly integrates the word embedding model into existing SQL query infrastructure and uses it to enable a new class of SQL-based analytics queries called cognitive intelligence (CI) queries. CI queries use the model vectors to enable complex queries such as semantic matching, inductive reasoning queries such as analogies, predictive queries using entities not present in a database, and, more generally, using knowledge from external sources. This paper demonstrates unique capabilities of Cognitive Databases using an Apache Spark based prototype to execute inductive reasoning CI queries over a multi-modal database containing text and images. The authors believe their first-of-a-kind system exemplifies using AI functionality to endow relational databases with capabilities that were previously very hard to realize in practice. | |
Jan 18 | Anthony Colas | Instructional QA In this talk, I will go over my work on a question answering problem, involving tutorial videos. Previous works have focused on generating short responses or factoids as answers to a user’s question. Here, the answer is in the form of a span (from a video segment) where the length of the answers are usually multiple sentences long. To model and accomplish this task we use a dataset, containing the video transcript, question, and answer. To conclude, we apply some baseline models on our task and present some future directions for the task. | |
Jan 11 | Giacomo Bergami | Schema Independent Relational Learning Learning novel relations from relational databases is an important problem with many applications. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same database may be represented under different schemas for various reasons, such as data quality, efficiency and usability. The output of current relational learning algorithms tends to vary quite substantially over the choice of schema. This variation complicates their off-the-shelf application. We introduce and formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of (de) composition schema transformations. We show that current algorithms are not schema independent. We propose Castor, a relational learning algorithm that achieves schema independence by leveraging data dependencies. |
Fall 2018
Date | Speaker | Title | Slides |
---|---|---|---|
Dec 7 | Giacomo Bergami | On Inconsistency Detection over Alternative Hypotheses This talk is going to focus on my previous work on Inconsistency Detection for the M9 evaluation of the AIDA project. After outlining the general AIDA project scenario and motivating the hypothesis generation assumption, I will give a brief introduction on what should be considered as an inconsistency, and which metrics have been already developed in current literature to summarise such inconsistency information. I also present the difference between the standard definition of MultiValued Dependencies with "equality" and their generalisation with "relatedness" and "is-a" relationships. I provide the benchmarks and the quality measures for both approaches and outline some future work directions. | |
Nov 30 | Ali Sadeghian | Deep learning for/and with Logical rules There are multiple methods for inference and reasoning over Knowledge Bases. Methods like mining Horn rules are useful because the can be understood by humans (interpretable) and unlike embedding based methods, can be applied to entities not seen before. In this talk, we will discuss various methods of rule mining. This learning problem is difficult because it requires searching a very large space. We briefly overview search based and embedding based methods of rule mining and their results. We then focus on a differentiable way of learning rules, that learns parameters as well as the structure in a continuous space as. We show how a neural control system is designed to learn to compose these operations. We also show combining logical rules with deep neural networks can enhance their performance in several domains. | |
Nov 16 | Anthony Colas, Caleb Bryant | Complex Question Answering over Knowledge Bases There has been a lot of work on simple question-answering over structured knowledge bases. That is, questions which contain only one relation between two entities. Though there has been a lot of development in simple question-answering over structured data--by using templates, syntactic parsers, or language models--solving complex questions is still an ongoing issue. Recent work has delved into complex question-answering where there can be multiple relationships between more than two entities. For example, the question "Where was the wife of the US president born?" is a complex question that can be divided into multiple simple questions using multiple relations. In this talk, we look into recent research which has dealt with complex questions (those with more than one relationship) by using template based methods in order to formulate a formal query from a natural language utterance. Another method which will examine is that of state-transition approach. This translates a natural question into a semantic query graph to find answers in a KB. We will also briefly go over simple question cases in order to build a basis from which the complex questions are formulated into formal queries. This talk serves as a survey of the work for complex question answering over KBs and will motivate future work in improving the current state-of-the-art and developing new methods. | |
Nov 09 | Hyun Choi | Generalized joint attribute model to learn population dynamics in ecosystems The Generalized Joint Attribute Model (GJAM) is used to study ecological systems. The GJAM model is a joint species distribution model that can accommodate the multifarious data in ecological datasets. The model determines interspecies relationships and environmental factors to make predictions of the species response. In this talk the main focus will be on the way GJAM predicts the population dynamics in the continental United States and Florida. The accuracy of the model with different scales is also examined to study the behavior of the model with different climate inputs. | |
Nov 09 | Sergio Marconi | Data Science in Ecology: real-world hard problems dealing with multifarious and/or limited data In the last decades ecology has increasingly become a data-intensive discipline, whose challenges inherently overlap with data science problems in a real world complexity scenario. For example, in order to forecast how tree species distribution and productivity change in uncertain future, we need to develop generalized methods to extract information from big data, account for the uncertainty in the data source, integrate different sources into cross-scale models, and formally link biogeochemical knowledge to observed patterns from small unbalanced training sets. These real world challenges represent cutting edge fundamental problems that the data science community has started recognizing. In this seminar we will address what our interdisciplinary group at UF has done to address such methodological and biological issues. First, we built a Data Science Evaluation Series aiming to predict species labels for each individual tree at scale of thousands of hectares from remote sensing data. Second, we built a fully bayesian hierarchical model (GJAM) on a dataset of millions of individual trees sampled from a coarse grid across the country, to identify rules for how those species are distributed. The final goal is to merge these products to understand how those rules change with scale. | |
Nov 01 | Xiaofeng Zhou | Efficient Conditional Rule Mining Over Knowledge Bases Present day web-scale knowledge bases (KBs) incorporate a substantial amount of information in a structured format. Availability of this readily machine-digestible data has made KBs a desirable resource for other applications. This has motivated many to explore learning from KBs. Embedding methods and learning inference rules are examples of such methods. Rules provide great inference power and are also easily understandable. Most recent work focuses only on normal rules (where all the predicates only support variables). We explore conditional inference rules, a class of logical rules which allow predicates with constants and have more expressive power. We show their effectiveness in knowledge expansion by comparing to normal rules’ number of predictions and precision. However, due to the larger search space, mining conditional rules is much more time-consuming compared to mining normal rules. Current state-of-the-art rule mining methods adapted to mine conditional rules, are infeasibly slow on medium/large KBs. To aid with this shortcoming, we introduce a scalable conditional rule mining algorithm. Our algorithm makes it possible to mine conditional rules from web-scale KBs. | |
Oct 26 | Sourav Dutta, Ali Sadeghian | HypoGator: Alternative Hypotheses Generation and Ranking We provide an overview of HypoGator, our hypotheses generation system. HypoGator relies on the KB constructed from combining and aligning of the document level knowledge elements. Using multiple features and inconsistency detection methods, it extracts multiple coherent and consistent hypotheses. It finally returns a sorted list of the alternative hypotheses relevant to the query. In this talk, we describe the major components of our system and present our experiments and results. Initial analysis of our results provides insight into some of the things that can be done by TA2-TA1 to help improve the generation of the hypotheses. | |
Oct 19 | Anthony Colas, Caleb Bryant | Template Generation for Querying Relational Databases using Natural Language In most cases, questions are asked using natural language. Because of the amount of information stored in structured knowledge bases, natural language interfaces for databases have been developed which take in a natural language query, structure the natural language to query the knowledge base, and then give a natural language answer. Two main approaches for querying knowledge bases with natural language involve sequence to sequence and template-based generation. This talk will focus on state-of-the-art methods for generation templates in order to query relational/structured databases with natural language-based queries. The talk will cover how to generate the templates, selecting the relevant templates, ranking the templates, and mapping queries to templates. Comparisons to deep learning approaches in question answering systems will also be made. The presentation will take from various works cited in the Querying RDBMS Using Natural Language (Li, 2017) dissertation and serves as a survey of question answering systems on structured knowledge bases for easily query structured systems and gathering interpretable results. | |
Oct 12 | Xiaofeng Zhou | Efficient Conditional Rule Mining Over Knowledge Bases Present day web-scale knowledge bases (KBs) incorporate a substantial amount of information in a structured format. Availability of this readily machine-digestible data has made KBs a desirable resource for other applications. This has motivated many to explore learning from KBs. Embedding methods and learning inference rules are examples of such methods. Rules provide great inference power and are also easily understandable. Most recent work focuses only on normal rules (where all the predicates only support variables). We explore conditional inference rules, a class of logical rules which allow predicates with constants and have more expressive power. We show their effectiveness in knowledge expansion by comparing to normal rules’ number of predictions and precision. However, due to the larger search space, mining conditional rules is much more time-consuming compared to mining normal rules. Current state-of-the-art rule mining methods adapted to mine conditional rules, are infeasibly slow on medium/large KBs. To aid with this shortcoming, we introduce a scalable conditional rule mining algorithm. Our algorithm makes it possible to mine conditional rules from web-scale KBs. | |
Oct 05 | Yang Bai | Improving Sequential Recommendation with Knowledge-Enhanced Memory Networks With the revival of neural networks, many studies try to adapt powerful sequential neural models, i.e., Recurrent Neural Networks (RNN), to sequential recommendation. RNN-based networks encode historical interaction records into a hidden state vector. Although the state vector is able to encode sequential dependency, it still has limited representation power in capturing complicated user preference. It is difficult to capture fine-grained user preference from the interaction sequence. Furthermore, the latent vector representation is usually hard to understand and explain. To address these issues, in this paper, we propose a novel knowledge enhanced sequential recommender. Our model integrates the RNN-based networks with Key-Value Memory Network (KV-MN). We further incorporate knowledge base (KB) information to enhance the semantic representation of KV-MN. RNN-based models are good at capturing sequential user preference, while knowledge enhanced KV-MNs are good at capturing attribute-level user preference. By using a hybrid of RNNs and KV-MNs, it is expected to be endowed with both benefits from these two components. The sequential preference representation together with the attribute-level preference representation are combined as the final representation of user preference. With the incorporation of KB information, our model is also highly interpretable. To our knowledge, it is the first time that sequential recommender is integrated with external memories by leveraging large-scale KB information. | |
Sep 28 | Caleb Bryant | Narrating a Knowledge Base Narrating structured data with a paragraph of text remains a challenging problem. In this presentation, we examine recent efforts to tackle the challenge of natural language generation from Wikipedia tables, focusing on Wang et al.'s paper, Narrating a Knowledge Base. We begin with a brief review of seq2seq neural networks, next investigating how Wang et al. successfully applied multiple types of self-attention to increase the length and completeness of their Wikipedia summaries. Finally, we assess the successes and failures of their method and propose future research directions. | |
Sep 21 | Anthony Colas | Graph Embeddings: A Review on Graph Representations Recently, there has been a lot of interest in efficiently embedding graphs based on node similarity. In this talk, I will introduce what it means to embed graphs. The talk will also compare "shallow" approaches to "deep" approaches. I will also go over some of the state of the art deep approaches used to embed graph data, including their different methodologies and results. This includes Graph Convolutional Networks, GraphSAGE, and Graph Attention Networks. Finally, I will conclude by discussing some applications of embedding graphs using "deep" approaches. | |
Sep 14 | Giacomo Bergami | Query Answering over Probabilistic KB with Alternative Hypotheses Current literature faces three main problems with inconsistency detection. First, theoretical approaches treat all the entities that are not "syntactically" the same as being not equal and perform query answering with the repair-then-query approach. These constraints are too naïve for real-world data: data may contain different descriptions at different abstraction levels, and biased data sources will not allow us to discriminate which is the actual correct answer. Second, traditional FOL system cannot cope with inconsistencies in the reasoning process, because the principle of explosion allows drawing any possible conclusion from inconsistent facts. Third, traditional SRL models using FOL logic may be affected by the same problem, thus allowing to infer implausible hypotheses with near to zero scores. This presentation will focus on solving the first problem by using hierarchies to define inequality, and on how to detect inconsistencies in data using external validation. For the second problem, we will briefly introduce paraconsistent logics that refute the contradiction principle and can be exploited to reason with inconsistent data. We will leave the third problem for future work on SRL paraconsistent models. |
Summer 2018
Date | Speaker | Title | Slides |
---|---|---|---|
Feb 16 | Dihong Gong | Scaling Integral Projection Models for Analyzing Size Demography In this talk, we study the integral projection model (IPM) for analyzing size demography of ecological systems. First, a basic version of IPM is introduced to model ecological dynamics. Further, the IPM is extended to include climate factors, such that it can be scaled to broader geographic areas. Finally, the effectiveness of IPM in investigated upon the FIA dataset with focus on two example species from the year 2015 through 2010. | |
May 18 | Miguel Rodriguez | UF AIDA Summer Dev Plan Some months have passed since we started building on the AIDA project. Even though there are multiple things still in the process of being defined internally and program-wide, I will present the latest developments of the project: General AIDA architecture, UF-TA3 architecture, Inter-TA communication protocols and tools, available datasets and roadmap towards a GAIA dry run evaluation and M9 evaluation. |
Spring 2018
Date | Speaker | Title | Slides |
---|---|---|---|
Apr 13 | Sourav Dutta | Mining Coherent hypothesis from knowledge graph (II) Knowledge bases are very useful to store complex structured and unstructured information. The rise of the internet has given rise to huge knowledge bases. However, with the humongous amount of information we have, it creates a need to generate meaningful insights to understand connections between entities. In this work, the ICEWS data was modeled as a knowledge graph. The knowledge graph was further enhanced with entities from Wikidata. Since the two data sources have been curated differently, one of the major problems was aligning the entities to have a connected graph. The alignment was done by calculating similarity scores using n-grams and using rules mined from the data. Using the knowledge graph, a hypothesis has been considered as a weighted path between two entities. Here, the approach to creating the knowledge graph has been presented, along with a comparison of different entity alignment techniques. Using the knowledge graph, hypotheses were generated and ranked for multiple scenarios. The observations and future work for the same has been shared. | |
Apr 06 | Miguel Rodriguez | Mining Temporal Sequence Rules from Events. In this talk, I will discuss my ongoing research on mining sequential rules over event knowledge graphs, this is edges between entities have time annotations. Specifically, I will discuss the differences between mining factual knowledge bases vs event knowledge base, present our current methods for mining sequential rules over event knowledge bases, scalability issues and interestingness measures. I will also show some preliminary results and examples of mined rules. | |
Mar 30 | Ali Sadeghian | Evaluation of automatically generated hypothesis. We are faced with an explosion of information (and misinformation) published through different mediums like blogs, youtube videos, newspapers, audio podcasts, etc. Knowledge bases have proven to be a great way to store complex information in a semi-structured way. One can process and convert all the information from the mentioned mediums into a single KB. The big question now is how to generate “good” hypothesis about different ongoing scenarios from this KB. To answer this, one must first define what a “good” hypothesis is? In this presentation, we will assume that a KB is built from the mentioned different mediums and that there exists a system that generates hypothesis from the KB in the form of subgraphs. We will give a definition of a good hypothesis based on Grice’s maxims and propose different ways of evaluating systems that mine hypothesis from KBs. | |
Mar 16 | Sourav Dutta | Mining Coherent hypothesis from knowledge graph Knowledge bases are very useful to store complex structured and unstructured information. The rise of the internet has given rise to huge knowledge bases. However, with the humongous amount of information we have, it creates a need to generate meaningful insights to understand connections between entities. In this work, the ICEWS data was modeled as a knowledge graph. The knowledge graph was further enhanced with entities from WikiData. Since the two data sources have been curated differently, one of the major problems was aligning the entities to have a connected graph. The alignment was done by calculating similarity scores using n-grams and using rules mined from the data. Using the knowledge graph, a hypothesis has been considered as a weighted path between two entities. Here, the approach to creating the knowledge graph has been presented, along with preliminary steps to generate hypothesis from the same. Some initial observations of the entity alignment and hypothesis generation task have also been shared. | |
Feb 23 | Victor Lin and Kevin Chow | Graph-based Anomaly Detection for Insider Threat An insider threat is a malicious threat to an organization that comes from people within the organization, such as employees, former employees, contractors or business associates, who have inside information concerning the organization's security practices, data and computer systems. Because insiders may attempt to steal property or information for personal gain, or to benefit another organization or country, the insiders committing threats may be related with each in some ways such as coming from the same alien country, working in the same team, or having the same ex-employer or previously serving in the same organization. Being able to utilize this kind of additional relational information makes it an interesting research topic for more precise insider threat detection. In this project, we focus on the CERT dataset by building upon existing attribute-based threat detection, applying graph-based models to improve our detection, and ultimately combining both for the most reliable anomaly detection. We will talk about our progress so far. | |
Feb 16 | Caleb Bryant | Medical Dialogue Systems Historically, the use of dialogue systems in the medical domain has been fairly limited. While systems have been proposed in the past (e.g. for conversing about medication), the difficulty of designing large and robust dialogue systems has prevented widespread adoption by clinicians. However, the recent rise of virtual digital assistants could help drive renewed attention to medical dialogue systems. In this talk, we explore different models and applications for dialogue systems. We examine past and current trends in the the area of dialogue system research, such as FSA, Information State Update, plan-based, POMDP, and neural network dialogue systems. In particular, we see how the target applications and design decisions of dialogue systems have interacted. Focusing on work the medical domain, we look a number of previous medical dialogue systems and compare them to present work on Rose. Finally, we discuss the current state of medical dialogue systems as well as possible future directions. Second talk: Ali Sadeghian. Title: Temporal Reasoning Over Event Knowledge Graphs. Abstract: Many advances in the computer science field, such as semantic search, recommendation systems, question-answering, natural language processing, are drawn-out using the help of large-scale knowledge bases (e.g., YAGO, NELL, DBPedia). However, many of these knowledge bases are static representations of knowledge and do not model time on its own dimension or do it only for a small portion of the graph. In contrast, projects such as GDELT and ICEWS have constructed large temporally annotated knowledge graphs of events collected from news hubs. In this paper, we study the problem of reasoning over such graphs. In particular, transpose two well-known techniques from knowledge base reasoning to utilize the temporal dimension: rule mining and graph embeddings. We mine temporally constrained first-order inference rules using the state-of-the-art relational knowledge base model. We interpret the learned rules as event sequence rules. We also use simple embedding methods to jointly learn a universal representation of entities and time-specific representations of the knowledge graph. We present the first set of temporal rules mined over event knowledge graphs and preliminary results on using the learned embeddings in the temporal link prediction task. | |
Feb 02 | Sarvesh Soni | Patient Question Answering from Electronic Health Records using Semantic Parsing (III) In this presentation, I will talk about my thesis work progress through the winter break. Electronic Health Records (EHR) are a great source for answering questions related to patient data. The main focus of my thesis work is to convert the patient questions into logical forms using questions and their corresponding answers from EHR. These logical forms are transformed to Fast Healthcare Interoperability Resources (FHIR) query for retrieving the answer(s) from EHR. I will briefly talk about Semantic Parsing and some related works in this domain of Patient Question Answering using Semantic Parsing. Then, I will explain the various steps of my thesis work and talk about my progress during the winter break. | |
Jan 26 | Xiaofeng Zhou | Query Processing and Incremental Learning over Knowledge Bases Knowledge bases are becoming increasingly important in structuring and representing information from the web. Meanwhile, web-scale information poses significant scalability and quality challenges to knowledge base systems. To address these challenges, we develop a probabilistic knowledge base system, ARCHIMEDESONE, by scaling up the knowledge expansion and statistical inference algorithms. ARCHIMEDESONE supports knowledge expansion by applying inference rules in batches using relational operations, and query-driven inference by focusing computation on the query facts in a unified system. Today's knowledge bases are mostly continuously growing despite the large sizes. Much research effort has been put into mining inference rules from the knowledge bases, yet few focus on the incremental aspect of those web-scale knowledge bases. We propose a parallel incremental rule mining framework based on relational model and apply updates to large knowledge bases, we propose an alternative metric that reduces computation complexity without compromising rule quality, we apply multiple optimization techniques that reduce runtime by more than 2 orders of magnitude. Experiments show that our approach can scale to web-scale knowledge bases efficiently and can easily save over 90% time comparing to the state-of-the-art batch rule mining system. To the best of our knowledge, our incremental rule mining system is the first that handles updates to web-scale knowledge bases efficiently. | |
Jan 19 | Miguel Rodriguez | AIDA project summary and plans This talk will cover our plans for AIDA TA3 and summary of the kickoff meeting. |