Fall 2023
Date | Speaker | Title | Slides |
---|---|---|---|
Sep 21 | Jayetri Bardhan | DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
This paper develops the first question answering dataset (DrugEHRQA) containing question-answer pairs from both structured
tables and unstructured notes from a publicly available Electronic Health Record (EHR). EHRs contain patient records, stored
in structured tables and unstructured clinical notes. The information in structured and unstructured EHRs is not strictly
disjoint: information may be duplicated, contradictory, or provide additional context between these sources. Our dataset has
medication-related queries, containing over 70,000 question-answer pairs. To provide a baseline model and help analyze the
dataset, we have used a simple model (MultimodalEHRQA) which uses the predictions of a modality selection network to
choose between EHR tables and clinical notes to answer the questions. This is used to direct the questions to the table-based or
text-based state-of-the-art QA model. In order to address the problem arising from complex, nested queries, this is the first time
Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers (RAT-SQL) has been used to test the structure of query
templates in EHR data. Our goal is to provide a benchmark dataset for multi-modal QA systems, and to open up new avenues of
research in improving question answering over EHR structured data by using context from unstructured clinical data.
|
|
Sep 21 | Haodi Ma | KGSimple: Can Knowledge Graphs Simplify Text?
Knowledge Graph (KG)-to-Text Generation has seen recent improvements in generating fluent and informative sentences which
describe a given KG. As KGs are widespread across multiple domains and contain important entity-relation information, and as
text simplification aims to reduce the complexity of a text while
preserving the meaning of the original text, we propose KGSimple, a novel approach to unsupervised text simplification which
infuses KG-established techniques in order to construct a simplified
KG path and generate a concise text which preserves the original
input’s meaning. Through an iterative and sampling KG-first approach, our model is capable of simplifying text when starting from
a KG by learning to keep important information while harnessing
KG-to-text generation to output fluent and descriptive sentences.
We evaluate various settings of the KGSimple model on currentlyavailable KG-to-text datasets, demonstrating its effectiveness compared to unsupervised text simplification models which start with
a given complex text.
|
|
Sep 07 | Zelin Xu | Spatial Knowledge-Infused Hierarchical Learning: An Application in Flood Mapping on Earth Imagery
|
|
Sep 07 | Wenchong He | Physics-guided AI for Spatiotemporal Data in Scientific Applications
|
|
Aug 24 | Yang Bai | MythQA: Query-Based Large-Scale Check-Worthy Claim Detection through Multi-Answer Open-Domain Question Answering
Check-worthy claim detection aims at providing plausible misinformation to the downstream fact-checking systems or human experts to check. This is a crucial step toward accelerating the fact-checking process. Many efforts have been put into how to identify check-worthy claims from a small scale of pre-collected claims, but how to efficiently detect check-worthy claims directly from a large-scale information source, such as Twitter, remains underexplored. To fill this gap, we introduce MythQA, a new multi-answer open-domain question answering(QA) task that involves contradictory stance mining for query-based large-scale check-worthy claim detection. The idea behind this is that contradictory claims are a strong indicator of misinformation that merits scrutiny by the appropriate authorities. To study this task, we construct TweetMythQA, an evaluation dataset containing 522 factoid multi-answer questions based on controversial topics. Each question is annotated with multiple answers. Moreover, we collect relevant tweets for each distinct answer, then classify them into three categories: "Supporting", "Refuting", and "Neutral". In total, we annotated 5.3K tweets. Contradictory evidence is collected for all answers in the dataset. Finally, we present a baseline system for MythQA and evaluate existing NLP models for each system component using the TweetMythQA dataset. We provide initial benchmarks and identify key challenges for future models to improve upon. Code and data are available at: https://github.com/TonyBY/Myth-QA |
|
Aug 24 | Alexander Rajender Webber | Figure Identification in Mixtec
|
Fall 2022
Date | Speaker | Title | Slides |
---|---|---|---|
Oct 21 | I Harmon |
Neuro-symbolics in Remote Sensing Species Classification
Forests cover 31% of the Earth's surface and play a vital role in the sustaining life on the planet. Remote sensing is used to efficiently monitor forest health at scale. Many estimated forest parameters such as biomass can be more accurately estimated when species are known or individual tree crowns can be counted. Therefore, crown delineation and species classification are important to remote sensing forest parameter estimation. However, creating robust machine learning classifiers to delineate crowns and classify species is a difficult task. In this talk we discuss how neuro-symbolics can be leveraged to improve performance of remote sensing based tree species classifiers. |
|
Oct 6 | Anthony Colas | GAP: A Graph-aware Language Model Framework for Knowledge Graph-to-Text Generation Recent improvements in KG-to-text generation are due to additional auxiliary pre-trained tasks designed to give the fine-tune task a boost in performance. These tasks require extensive computational resources while only suggesting marginal improvements. Here, we demonstrate that by fusing graph-aware elements into existing pre-trained language models, we are able to outperform state-of-the-art models and close the gap imposed by additional pre-train tasks. We do so by proposing a mask structure to capture neighborhood information and a novel type encoder that adds a bias to the graph-attention weights depending on the connection type. Experiments on two KG-to-text benchmark datasets show these models to be superior in quality while involving fewer parameters and no additional pre-trained tasks. By formulating the problem as a framework, we can interchange the various proposed components and begin interpreting KG-to-text generative models based on the topological and type information found in a graph. | |
Sep 23 | Yifan Wang |
DBSim - Extensible Database Simulator for Fast Prototyping In-Database Algorithms
In-database analytics has become one of the most studied topics in data science community, because of its significance in reducing the gap between the management and the analytics of data, which can save much time on exchanging data between databases and external analytic tools. But implementing in-database algorithms inside mainstream databases without pre-verification of the ideas is risky and may result in a significant waste of time. In this talk we present a testbed, DBSim, which simulates a relational database and allows users to easily extend it based on their needs, such that users can fast prototype their ideas and estimate the performance before diving into the large codebase of real RDBMS. |
Fall 2020
Date | Speaker | Title | Slides |
---|---|---|---|
Dec 11 | DSR Lab | Neuro-Symbolics Applications Continuation of neuro-symbolics applications. | |
Dec 4 | DSR Lab | Neuro-Symbolics Applications Neuro-symbolics is a new AI paradigm that combines neural models with symbolic reasoning. It's a promising technology that is edging AI capabilities closer to human levels. Today we will look at some of its applications. | |
Nov 20 | Yifan Wang | SystemV: A Generic Embedding-based Platform for Efficient Similarity Query Processing Similarity queries are becoming increasingly important in the age of big data where the data objects do not possess any natural order. Examples include large collections of images, text and multimodal data objects. Embeddings are widely used in processing of similarity queries for measuring semantic similarity of data. However, current database systems cannot adequately support embedding based similarity query processing. Furthermore, multiple types of embeddings may be constructed over data objects that contain multiple modalities, such as text and image, where similarity-based join queries are required but the traditional join operators do not work well on these queries. So we propose SystemV, a generic platform on top of Spark, providing scalable end-to-end solutions for efficient embedding based similarity query processing, including search on single embedding space and join on multiple embedding spaces. The goal of SystemV is to democratize AI by facilitating the use of pre-trained embeddings (e.g., learned from large amounts of data such as Wikipedia and ImageNet). Using such pre-trained models, users can focus more on their core application rather than designing the retrieval models. | |
Nov 13 | Ali Sadeghian | Relational learning and reasoning on knowledge graphs (My PhD defense dry run) In this talk we will briefly discus knowledge graphs and the current well studied reasoning methods on KGs. We then dive deeper into temporal knowledge graphs and temporal reasoning. This is an important topic because despite the importance and abundance of temporal knowledge graphs, most of the current research has been focused on reasoning on static graphs. We study the challenging problem of inference over temporal knowledge graphs. In particular, the task of temporal link prediction. In general, this is a difficult task due to data non-stationarity, data heterogeneity, and its complex temporal dependencies. We propose Chronological Rotation (ChronoR), a novel model for learning representations for entities, relations, and time. Learning dense representations is frequently used as an efficient and versatile method to perform reasoning on knowledge graphs. The proposed model learns a k-dimensional rotation transformation for every relation, time pair such that each fact’s head entity, once transformed using the rotation, is close to its tail. By using high dimensional rotation as its transformation operator, ChronoR captures rich interaction between the temporal and multi-relational characteristics of a Temporal Knowledge Graph. Experimentally, we show that ChronoR is able to outperform state-of-the-art methods on the benchmark datasets for temporal knowledge graph link prediction. | |
Nov 6 | Haodi Ma | Survery on neural symbolic learning Motivated by human's ability of learning visual concepts by jointly understanding vision and language, a recent work proposes the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual concepts, words, and semantic parsing of sentences without explicit supervision on any of them; instead, our model learns by simply looking at images and reading paired questions and answers. This work has some similarity with previous works from our lab focusing on information extraction with multimodal approaches. In this talk I will include these works and discuss the similarity and distinction between them for possible future research. | |
Oct 30 | Anthony Colas | Few-Shot Learning Deep learning has achieved state-of-the-art results on many tasks in various areas, including image-to-image translation and image classification in computer vision and summarization and machine translation in natural language processing (NLP). However, these types of architectures still require large amounts of data, which can be difficult to obtain or simply not available in many domains. Thus, few-shot learning can help solve these problems in domains where large data are not as copious. Few-shot learning enables one to train a model with smaller amounts of data, while also teaching the model to learn how to learn (meta-learning). These models have been shown to perform fairly well in their respective objectives. In this talk, we will introduce few-shot learning as well as the related meta- and one-shot learning. We will look at a few example architectures used in the few-shot learning task, as well as commonly used datasets. Next, we will examine how few-shot learning has been more prevalently used in computer vision for image classification and more recently, image-to-image translation. From there, we can transition into how few-shot learning has been more recently applied in NLP to tasks such as summarization and machine translation. Finally, we will conclude with how such models can be adapted for the graph-to-text task. | |
Oct 23 | Jaytri Bardhan | Question Answering on Electronic Health Records Electronic Health Record (EHR) is an electronic version of a patient’s medical history. The state-of-the-art model- ‘Text to SQL Generation for Question Answering on Electronic Medical Records’ has some major shortcomings. The accuracy of this question answering system on MIMIC III database is very poor. It is very challenging to obtain a good accuracy in multi-relational health records. In this project, Relational aware schema encoding and Linking for Text to SQL (RAT-SQL model) has been used for question answering over structured medical records. The talk would also cover the annotation of the drug dataset for question answering over structured and unstructured clinical records. Furthermore, this QA system could aid in applications like Sepsis phenotype detection and treatment prediction. | |
Oct 16 | Yang Bai | AIDA Progress in the Summer AIDA is a DARPA-funded project that aims at automatically ingesting web documents and transform them into a semantic space representation(Knowledge Graph) that analysts can use to query about uncertain situations and obtain a variety of related hypotheses. The Data Science Research(DSR) lab of UF is a participant in this project. We are responsible for generating hypotheses with knowledge graphs that are generated by upstream teams. HypoGator is the hypothesis generation system designed by the DSR lab for the AIDA project. It uses a search-score-rank approach to find alternative answers to complex queries over the automatically extracted event-driven multimedia knowledge graph. In this talk, I will describe the major changes and improvements made over the HypoGator last summer. | |
Oct 9 | Ali Sadeghian | N/A | |
Sep 25 | I Harmon | A Survery of Weak Supervision and its Applications Machine learning models are becoming increasingly powerful but require larger datasets for optimal performance. One of the biggest problems in the model creation pipeline is the bottleneck of dataset creation. Weak supervision is one solution to this problem. We will explore exactly what weak supervision is, its taxonomy, and its applicability. We'll look at frameworks that allow for the rapid creation of datasets using weak supervision. We'll also look at some applications of weak supervision. |
Spring 2020
Date | Speaker | Title | Slides |
---|---|---|---|
Mar 13 | Ali Sadeghian | Relational learning and reasoning on knowledge graphs In this talk, we will talk about two of the main reasoning methods on knowledge graphs, rule learning, and embedding models. We will briefly cover how KGs are constructed and why reasoning on KGs is crucial for both completing the KG and using them in applications. We will then discuss differentiable rule mining and the benefits of rules as an interpretable way of reasoning. Finally, we will discuss embedding based methods on KGs and dive deeper into recent advances of these latent models in temporal KG completion. | |
Feb 14 | Yang Bai | HypoGator for the AIDA Project In this talk, I’m going to give an introduction to our hypothesis Generation system, HypoGator, which is designed for the AIDA project. It uses a search-score-rank approach to find alternative answers to complex queries over the automatically extracted event-driven multimedia knowledge graph. I will go through some major features of the pipeline with decent details and demonstrate the results with our specially developed visualization tool. Finally, we will discuss future improving directions and task extension of the system. AIDA is a DARPA-funded project that aims at automatically ingesting web documents and transform them into a semantic space representation(Knowledge Graph) that analysts can use to query about uncertain situations and obtain a variety of related hypotheses. Data Science Research lab of UF is a participant of this project. We are responsible for generating hypotheses with knowledge graphs that are generated by upstream teams. | |
Feb 07 | Anthony Colas | EventQA (Complex-Question Narration) Knowledge bases (KBs) present large amounts of data in a structured manner. Because of the amount of information in them, there has been much work on querying these structured repositories of data using natural language. This work extends to complex question-answering (complexQA) where the complexity of question increases by mentioning multiple entities, relations, constraints, or all three. In this talk, we discuss the problems with current complexQA systems over knowledge graphs and motivate our work in KB-QA narration. Namely, we differentiate our problem from conventional KB-QA by observing questions that cannot be solved well given the current approaches. We first introduce EventKG in order to motivate our work by giving examples of the complexQA narrative task over this dataset. To solve this new problem, we review some work in both complexQA and graph-to-text and show how our task is a combination of these two. Next, we discuss our baseline approach in generating answer narratives. Finally, we present our approach to building a novel dataset for the complex question-narratives task and discuss possible models that generate narratives from questions and KB components. This talk will serve as an introduction to works in both the KB-QA and graph-to-text translation tasks and as our roadmap to solving the new problem of KB-QA narration. | |
Jan 31 | Jaytri Bardhan | Question Answering Systems using Electronic Health Records This talk would be about the scope of the project- Question Answering Systems using Electronic Health Records. A brief introduction would be given about the different types of electronic health records, followed by the related research work that exists in this area. A detailed description of the different milestones for this project will be presented along with its different challenges. Also, a short overview will be given for the paper- BERT-based Ranking for Biomedical Entity Normalization. In this paper, an entity normalization architecture was proposed by fine-tuning the pre-trained BERT/ BioBERT / ClinicalBERT models and conducted extensive experiments to evaluate the effectiveness of the pre-trained models for biomedical entity normalization using three different types of datasets. The experimental results show that the best-fine-tuned models consistently outperformed previous methods and advanced the state-of-the-art for biomedical entity normalization, with up to 1.17% increase inaccuracy. |
Fall 2019
Date | Speaker | Title | Slides |
---|---|---|---|
Nov 22 | Jaytri Bardhan | Survey on Question Answering Systems over Electronic Health Records (EHRs) The widespread adoption of electronic health records (EHRs) has enabled the secondary use of EHR data for clinical research and healthcare delivery. An Electronic Health Record (EHR) is an electronic version of a patient’s medical history and may include demographics, progress notes problems, medications, vital signs, past medical history, and laboratory data. Natural language processing (NLP) techniques have been used over the years to develop question answering systems over structured as well as unstructured electronic health records. The following two papers will be presented: The paper - ‘A Translate-Edit Model for Natural Language Question to SQL Query Generation on Multi-relational Healthcare Data’ develops a deep learning-based approach that can translate a natural language question on multi-relational EHR data into its corresponding SQL query, which is referred to as a Question-to-SQL generation task. To address the challenge of generating queries on multi-relational databases from natural language questions, TRanslate-Edit model for Question-to-SQL query (TREQS) is proposed. The paper- ‘CREATE: Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records using the OMOP Common Data Model’ presents a cohort retrieval system that can execute textual cohort selection queries on both structured and unstructured EHR data. CREATE is a proof-of-concept system that leverages a combination of structured queries and IR techniques on NLP results to improve cohort retrieval performance while adopting the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) to enhance model portability. | |
Nov 8 | Yang Bai | HypoGator: Hypotheses Generation and Ranking over Event-Driven Multimedia Knowledge Graph This presentation is going to be a dry run for the TAC workshop next week where we are going to show our latest progress on the AIDA project during the M18 evaluation phase. It is going to include a brief introduction to our old pipeline and a detailed introduction to the improvements in our new pipeline, as well as a demo of our new visualization tool which is quite helpful in result analysis. AIDA is a DARPA-funded program that aims at automatically ingesting web documents and transform them into a semantic space representation(Knowledge Graph) that analysts can use to query about uncertain situations and obtain a variety of related hypotheses. As a key part of the AIDA project, we are responsible for generating hypotheses with the Knowledge graphs that are generated by upstream teams. | |
Nov 1 | Miguel Rodriguez | REASONING OVER MULTI-SOURCE AND DYNAMIC KNOWLEDGE GRAPHS (Dissertation Dry Run) Innovative approaches to Information Extraction(IE) have enabled the creation of large Knowledge Graphs (KGs) (e.g., YAGO, NELL, DBPedia, Wikidata) and Dynamic Knowledge Graphs (e.g., ICEWS, GDELT). These knowledge graphs have become an increasingly popular domain knowledge representation used in semantic search, recommendation systems, question-answering, natural language processing, etc. Despite the increased efforts, KGs are still predominantly incomplete and contain a high degree of uncertainty. In this dissertation, we study two approaches to fill in the gaps in existing knowledge graphs 1) Reasoning over facts from different knowledge graphs and/or extractors to increase coverage and automatically evaluate the correctness in the aggregate, this task is known as Knowledge Fusion (KF). We propose CM Fusion, an ensemble approach that combines supervised learning and unsupervised consensus. We also propose SigmaKB, a query engine that uses CM Fusion and user feedback to integrate and improve query results compared to single KGs. 2) Reasoning over existing facts to determine possibly missing ones, moreover we propose to do this over dynamic knowledge graphs. We propose to learn observed feature models over dynamic knowledge graphs in the form of sequential rules and use them to infer missing facts and forecast events in a streaming fashion. CM fusion uses the Consensus Maximization algorithm to ensemble supervised classifiers learned using information extractors that can be assessed a priori and generates consensus with black box extractors that can not be assessed beforehand. Consensus Maximization Fusion is able to promote high-quality facts and eliminate incorrect ones. We demonstrate the effectiveness of our system on the NIST Slot Filler Validation evaluation, which seeks to evaluate and aggregate multiple independent information extractors. Our system achieved the highest F1 score relative to other system submissions. We mine temporally constrained first-order inference rules over dynamic knowledge graphs. The learned rules are the first set of temporal rules mined over dynamic knowledge graphs. The algorithm we propose uses adjusted definitions of support, confidence and head coverage metrics that consider minimal occurrences and time windowing constrain. We ground the learned sequential rules over dynamic knowledge graphs in two tasks: temporal link prediction and streaming link prediction. Our experiments show that inference using sequential rules can outperform representation learning approaches while at the same time yield interpretable patterns. These patterns can be further refined by experts or used by analysts in QA tasks. Finally, we show that rules and embeddings complement each other and propose to cast the ensemble as a ranking aggregation problem. We Reciprocal Rank Fusion, an unsupervised rank aggregation model. | |
Oct 25 | Haodi Ma | Answering Complex Questions by Joining Multi-Doc Evidence with Quasi Knowledge Graph In this talk, I will first discuss the pipeline of their model. Then go into details in that order. There are mainly two steps of their model: graph construction and graph algorithm. And the main distribution for them will be casting the QA task into the GST problem. Then will be the evaluation of their model including experience results and finally the comparison of QUEST and AIDA task and what I have accomplished so far. | |
Oct 18 | Ali Sadeghian | Embed All The Things! In this talk, we will discuss StarSpace, a method that can be used to embed multiple types of objects and solve a variety of tasks such as labeling tasks, ranking tasks, collaborative filtering-based or content-based recommendation, embedding of multi-relational graphs, and learning word, sentence or document level embeddings. | |
Oct 11 | Yang Bai | Knowledge Graph Embeddings Knowledge graph (KG) embedding is to embed components of a KG including entities and relations into continuous vector spaces, so as to simplify the manipulation while preserving the inherent structure of the KG. It can benefit a variety of downstream tasks such as KG completion and relation extraction and hence has quickly gained massive attention. In this presentation, I’m going to firstly give a thorough introduction of classical KG embedding methods: the translational distance models. Then, I’m going to introduce one of the latest work in this field: Knowledge graph embedding via reasoning over entities, relations, and text, which combined the classic translational distance models with the LSTM neural network to extract both latent semantic features and observable structure patterns in a unified knowledge graph. | |
Sep 27 | Jaytri Bardhan | Template-Based Complex Query The following two papers will be presented- ‘Automated template generation for question answering over knowledge graphs’ and ‘Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases’. They both emphasize on different template-based approaches to handle complex question answering systems in knowledge graphs. Templates are an important asset for question answering over knowledge graphs, simplifying the semantic parsing of input utterances and generating structured queries for interpretable answers. The paper ‘‘Automated template generation for question answering over knowledge graphs” presents QUINT, a system that automatically learns utterance-query templates solely from user questions paired with their answers. Additionally, QUINT uses basic templates to answer structurally complex compositional questions without observing such questions during training. The paper - “Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases” presents NEQA, a continuous learning paradigm for KB-QA. Translating natural language questions to semantic representations such as SPARQL is a core challenge in open-domain question answering over knowledge bases. The existing methods require access to a large annotated training set that is not always readily available and fail on questions from before-unseen domains. In this paper NEQA when offline, automatically learns templates mapping syntactic structures to semantic ones from a small number of training question-answer pairs. If a new question cannot be satisfactorily answered via templates, user feedback is used on the output of a semantic similarity function to learn a new template based on the new question. | |
Sep 20 | Miguel Rodriguez | HypoGator - M18 Eval Knowledge Graphs are widely used to represent knowledge in part because of the open-world assumption, uncertainties and noise can all coexist in them. Unfortunately, the representation flexibility of KG is not matched by the query languages or the query engines that run on them. HypoGator is the DSR lab hypotheses generation system in the framework of the AIDA program. HypoGator answers query over KGs constructed from combining and aligning document level knowledge elements and provide a ranked list of possible answers. In this talk, we describe the major changes made to the systems for the summer evaluation cycle and show a KG narration to better visualize the generated hypothesis. | |
Sep 13 | Yifan Wang | AI-Supported DBMS We are going to present two papers: CognitiveDB and RankSQL. They are relevant to supporting AI in DBMS and efficient execution of top-k queries in DBMS. These are two popular topics currently in the database community. CognitiveDB is an approach for transparently enabling Artificial Intelligence (AI) capabilities in relational databases. A novel aspect of their design is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. This model captures the hidden inter-/intra-column relationships between database tokens of different types. They seamlessly integrate the word embedding model into existing SQL query infrastructure and use it to enable a new class of SQL-based analytics queries called cognitive intelligence (CI) queries. CI queries use the model vectors to enable complex queries such as semantic matching, inductive reasoning queries such as analogies, predictive queries using entities not present in a database, and, more generally, using knowledge from external sources. This system exemplifies using AI functionality to endow relational databases with capabilities that were previously very hard to realize in practice. RankSQL is a system that provides a systematic and principled framework to support efficient evaluations of ranking (top-k) queries in relational database systems (RDBMS), by extending relational algebra and query optimization. They aim to support ranking as a first-class database construct. So they extend relational algebra by proposing a rank-relational model to capture the ranking property and introducing new and extended operators to support ranking as a first-class construct. Enabled by the extended algebra, they present a pipelined and incremental execution model of ranking query plans (that cannot be expressed traditionally) based on a fundamental ranking principle. Their approach can significantly reduce the cost of ranking operation in DBMS. | |
Sep 6 | Ali Sadeghian | Hotel2vec: Learning Attribute-Aware Hotel Embeddings with Self-Supervision We propose a neural network architecture for learning vector representations of hotels. Unlike previous works, which typically only use user click information for learning item embeddings, we propose a framework that combines several sources of data, including user clicks, hotel attributes (e.g., property type, star rating, average user rating), amenity information (e.g., the hotel has free Wi-Fi or free breakfast), and geographic information. During model training, a joint embedding is learned from all of the above information. We show that including structured attributes about hotels enables us to make better predictions in a downstream task than when we rely exclusively on click data. We train our embedding model on more than 40 million user click sessions from a leading online travel platform and learn embeddings for more than one million hotels. Our final learned embeddings integrate distinct sub-embeddings for user clicks, hotel attributes, and geographic information, providing an interpretable representation that can be used flexibly depending on the application. We show empirically that our model generates high-quality representations that boost the performance of a hotel recommendation system in addition to other applications. An important advantage of the proposed neural model is that it addresses the cold-start problem for hotels with insufficient historical click information by incorporating additional hotel attributes which are available for all hotels. | |
Aug 30 | Anthony Colas | Natural Language to Query Language Users often converse with chatbots in order to ask task-specific questions. Much of the information is only accessible in databases which requires one to know a query language. Thus, when building a conversational question answering system it is crucial to converting natural language questions into query language, in order for users to obtain answers to their questions. We approach this problem in two different domains and utilize two types of query languages: SQL and SPARQL. Furthermore, we develop a novel data collection methodology in order to generate synthetic data for which to use on the natural language to query language tasks. With this methodology, one can quickly and efficiently generate data to use on their models for their specific domain needs. We show promising results for our synthetically generated datasets and present further steps needed for future work. |
Spring 2019
Date | Speaker | Title | Slides |
---|---|---|---|
Apr 5 | Ali Sadeghian Giacomo Bergami | emrQA: A Large Corpus for Question Answering on Electronic Medical Records Question and answering systems remain relatively unexplored in clinical domains. This paper proposes a novel methodology to generate domain-specific large-scale question answering datasets and demonstrates an instance of this methodology in creating a large-scale QA dataset for electronic medical records. The method creates questions and logical form templates obtained through expert annotations and with existing annotations in clinical notes it generates questions, logical form, and answers. The dataset’s learning potential is explored by training baseline models for question to logical form and question to answer mapping. | |
Mar 28 | Giacomo Bergami | Alternatives for generating Alternative Hypotheses over Knowledge graph In this presentation, we will provide an introduction to a graph search-based algorithms in comparison with a graph query answering algorithm. While the former focus on getting paths from one entry point, the latter returns a subgraph matching the query. We will show similarities and differences between the two. After doing this, we show the output of the former evaluation of HypoGator and those from a straightforward implementation of SAMA (a distributed approximate graph matching algorithm). Different metrics will be used to evaluate those two alternative algorithms for generating alternative hypotheses answering questions over probabilistic knowledge graph. | |
Mar 22 | Miguel Rodriguez | SERM - Sequence Rule Mining from Temporal Knowledge Bases Research efforts in reasoning over large scale Knowledge Bases (KBs) such as Freebase, YAGO or NELL has largely focused on static representations of knowledge. The recent availability of time annotated KB facts in YAGO and Wikidata and timed stamped Event Knowledge Bases (EKBs) GDELT and ICEWS has ignited efforts in developing models that also reason over the temporal dimension. The majority of work in this research area is centered in representation learning for the link prediction task. In contrast, in this paper, we study the problem of learning first-order inference rules where the rule atoms occur in sequential order. In particular, we propose an algorithm to mine sequence rules from temporal knowledge bases and interestingness metrics that take into account time windowing constraints. Our experiments show that interpretable patterns can be mined from ICEWS and further used by human analysts and improved by experts. | |
Mar 1 | Rahul Sengupta Debdeep Basu | Fusing differentiable Rule-Learning and Embeddings for Knowledge Base Completion During the course of this project, we investigate the use of deep learning applied to learn probabilistic first-order logical rules in order to improve embeddings, for the task of knowledge base completion. In particular, we attempt to fuse the concepts of the paper “Differentiable Learning of Logical Rules for Knowledge Base Reasoning” with popular embedding models such as DISTMULT and TransE. We hope that a joint model will perform better overall than the individual models, by one compensating for others. Evaluation Framework for Hypothesis Generation Knowledge Graphs are generated from various signals (video, audio, text etc) for a particular event. These small knowledge graphs are then merged into a big graph which contains the accumulated information about an event from various streams. Hypothesis generation is generating a list of subsets of this merged graph. The output of hypothesis generation signifies the possible causes for the occurrence of a particular event. The evaluation framework is a web application with which users can upload their datasets, visualize statistics in that dataset and play with an interactive knowledge graph exploration tool with which they will be able to query the graph (in natural language), visualize the generated list of hypotheses and proceed to unfold the graph as and when required. This can be used as a debugging tool for HypoGator. | |
Feb 15 | Anthony Colas Hyun Choi | Implementing Question Understanding Via Template Decomposition There has been a lot of work on simple question-answering over structured knowledge bases. That is, answering questions which contain only one relation between two entities. Though there has been a lot of development in simple question-answering over structured data by using templates, syntactic parsers, or language models. In this talk, we present our progress towards implementing the approach used by Question Answering Over Knowledge Graphs: Question Understanding Via Template Decomposition. We show experimental results thus far on the WebQuestions dataset (simple questions) and describe some of the challenges encountered during the process. We also discuss future steps and some statistics of the natural language pattern question templates. emrQA: A Large Corpus for Question Answering on Electronic Medical Records Question and answering systems remain relatively unexplored in clinical domains. This paper proposes a novel methodology to generate domain-specific large-scale question answering datasets and demonstrates an instance of this methodology in creating a large-scale QA dataset for electronic medical records. The method utilizes existing expert annotations on clinical notes for various NLP tasks from i2b2 datasets. The corpus contains question templates and logical forms and question-answer pairs. The dataset’s learning potential is explored by training baseline models for question to logical form and question to answer mapping. | |
Feb 8 | Yang Bai | Approximate Subgraph Matching It is increasingly common to find real-life data represented as graphs of labeled, heterogeneous entities and relations. To query these graphs, one often needs to identify the matches of a given query graph in a (typically large) graph database. Due to noise and the lack of fixed schema in many real-life graph databases, the query graph can substantially differ from its matches in the graph database in both structure and node/edge labels, thus bringing challenges to the graph querying tasks. To tackle this problem, Approximate Subgraph Matching is proposed to help users better query graph database. Today, I am going to give a talk in this scenario based on two representative models: neighborhood based approximate graph matching and path alignment based approximate graph matching. | |
Feb 1 | Ali Sadeghian | DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs Logical rules play an important role in capturing interpretable patterns present in knowledge bases. An extensive body of research has focused on automatically discovering first-order logical rules by exploring a discrete search space for the rule structure and the applying statistical measures to assess the correctness of each rule. In this talk, we consider a new parametric approach for mining logical rules from knowledge graphs. We proof the limitations of the state-of-the-art differentiable technique for mining logical rules and propose a method that (1) Has no theoretical restrictions on the structure of meaningful logic rules, (2) Adaptive to the data, and (3) Avoids over-parametrization while finding meaningful signals in the data. We show that our method outperforms existing rule mining methods in the quality of the mined rules over benchmark datasets as well as in link prediction task. | |
Jan 25 | Yifan Wang | Cognitive Database: A Step towards Endowing Relational Databases with Artificial Intelligence Capabilities A cognitive Database is an approach for transparently enabling Artificial Intelligence (AI) capabilities in relational databases. A novel aspect of its design is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. This model captures the hidden inter-/intra-column relationships between database tokens of different types. For each database token, the model includes a vector that encodes contextual semantic relationships. It seamlessly integrates the word embedding model into existing SQL query infrastructure and uses it to enable a new class of SQL-based analytics queries called cognitive intelligence (CI) queries. CI queries use the model vectors to enable complex queries such as semantic matching, inductive reasoning queries such as analogies, predictive queries using entities not present in a database, and, more generally, using knowledge from external sources. This paper demonstrates unique capabilities of Cognitive Databases using an Apache Spark based prototype to execute inductive reasoning CI queries over a multi-modal database containing text and images. The authors believe their first-of-a-kind system exemplifies using AI functionality to endow relational databases with capabilities that were previously very hard to realize in practice. | |
Jan 18 | Anthony Colas | Instructional QA In this talk, I will go over my work on a question answering problem, involving tutorial videos. Previous works have focused on generating short responses or factoids as answers to a user’s question. Here, the answer is in the form of a span (from a video segment) where the length of the answers are usually multiple sentences long. To model and accomplish this task we use a dataset, containing the video transcript, question, and answer. To conclude, we apply some baseline models on our task and present some future directions for the task. | |
Jan 11 | Giacomo Bergami | Schema Independent Relational Learning Learning novel relations from relational databases is an important problem with many applications. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same database may be represented under different schemas for various reasons, such as data quality, efficiency and usability. The output of current relational learning algorithms tends to vary quite substantially over the choice of schema. This variation complicates their off-the-shelf application. We introduce and formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of (de) composition schema transformations. We show that current algorithms are not schema independent. We propose Castor, a relational learning algorithm that achieves schema independence by leveraging data dependencies. |
Fall 2018
Date | Speaker | Title | Slides |
---|---|---|---|
Dec 7 | Giacomo Bergami | On Inconsistency Detection over Alternative Hypotheses This talk is going to focus on my previous work on Inconsistency Detection for the M9 evaluation of the AIDA project. After outlining the general AIDA project scenario and motivating the hypothesis generation assumption, I will give a brief introduction on what should be considered as an inconsistency, and which metrics have been already developed in current literature to summarise such inconsistency information. I also present the difference between the standard definition of MultiValued Dependencies with "equality" and their generalisation with "relatedness" and "is-a" relationships. I provide the benchmarks and the quality measures for both approaches and outline some future work directions. | |
Nov 30 | Ali Sadeghian | Deep learning for/and with Logical rules There are multiple methods for inference and reasoning over Knowledge Bases. Methods like mining Horn rules are useful because the can be understood by humans (interpretable) and unlike embedding based methods, can be applied to entities not seen before. In this talk, we will discuss various methods of rule mining. This learning problem is difficult because it requires searching a very large space. We briefly overview search based and embedding based methods of rule mining and their results. We then focus on a differentiable way of learning rules, that learns parameters as well as the structure in a continuous space as. We show how a neural control system is designed to learn to compose these operations. We also show combining logical rules with deep neural networks can enhance their performance in several domains. | |
Nov 16 | Anthony Colas, Caleb Bryant | Complex Question Answering over Knowledge Bases There has been a lot of work on simple question-answering over structured knowledge bases. That is, questions which contain only one relation between two entities. Though there has been a lot of development in simple question-answering over structured data--by using templates, syntactic parsers, or language models--solving complex questions is still an ongoing issue. Recent work has delved into complex question-answering where there can be multiple relationships between more than two entities. For example, the question "Where was the wife of the US president born?" is a complex question that can be divided into multiple simple questions using multiple relations. In this talk, we look into recent research which has dealt with complex questions (those with more than one relationship) by using template based methods in order to formulate a formal query from a natural language utterance. Another method which will examine is that of state-transition approach. This translates a natural question into a semantic query graph to find answers in a KB. We will also briefly go over simple question cases in order to build a basis from which the complex questions are formulated into formal queries. This talk serves as a survey of the work for complex question answering over KBs and will motivate future work in improving the current state-of-the-art and developing new methods. | |
Nov 09 | Hyun Choi | Generalized joint attribute model to learn population dynamics in ecosystems The Generalized Joint Attribute Model (GJAM) is used to study ecological systems. The GJAM model is a joint species distribution model that can accommodate the multifarious data in ecological datasets. The model determines interspecies relationships and environmental factors to make predictions of the species response. In this talk the main focus will be on the way GJAM predicts the population dynamics in the continental United States and Florida. The accuracy of the model with different scales is also examined to study the behavior of the model with different climate inputs. | |
Nov 09 | Sergio Marconi | Data Science in Ecology: real-world hard problems dealing with multifarious and/or limited data In the last decades ecology has increasingly become a data-intensive discipline, whose challenges inherently overlap with data science problems in a real world complexity scenario. For example, in order to forecast how tree species distribution and productivity change in uncertain future, we need to develop generalized methods to extract information from big data, account for the uncertainty in the data source, integrate different sources into cross-scale models, and formally link biogeochemical knowledge to observed patterns from small unbalanced training sets. These real world challenges represent cutting edge fundamental problems that the data science community has started recognizing. In this seminar we will address what our interdisciplinary group at UF has done to address such methodological and biological issues. First, we built a Data Science Evaluation Series aiming to predict species labels for each individual tree at scale of thousands of hectares from remote sensing data. Second, we built a fully bayesian hierarchical model (GJAM) on a dataset of millions of individual trees sampled from a coarse grid across the country, to identify rules for how those species are distributed. The final goal is to merge these products to understand how those rules change with scale. | |
Nov 01 | Xiaofeng Zhou | Efficient Conditional Rule Mining Over Knowledge Bases Present day web-scale knowledge bases (KBs) incorporate a substantial amount of information in a structured format. Availability of this readily machine-digestible data has made KBs a desirable resource for other applications. This has motivated many to explore learning from KBs. Embedding methods and learning inference rules are examples of such methods. Rules provide great inference power and are also easily understandable. Most recent work focuses only on normal rules (where all the predicates only support variables). We explore conditional inference rules, a class of logical rules which allow predicates with constants and have more expressive power. We show their effectiveness in knowledge expansion by comparing to normal rules’ number of predictions and precision. However, due to the larger search space, mining conditional rules is much more time-consuming compared to mining normal rules. Current state-of-the-art rule mining methods adapted to mine conditional rules, are infeasibly slow on medium/large KBs. To aid with this shortcoming, we introduce a scalable conditional rule mining algorithm. Our algorithm makes it possible to mine conditional rules from web-scale KBs. | |
Oct 26 | Sourav Dutta, Ali Sadeghian | HypoGator: Alternative Hypotheses Generation and Ranking We provide an overview of HypoGator, our hypotheses generation system. HypoGator relies on the KB constructed from combining and aligning of the document level knowledge elements. Using multiple features and inconsistency detection methods, it extracts multiple coherent and consistent hypotheses. It finally returns a sorted list of the alternative hypotheses relevant to the query. In this talk, we describe the major components of our system and present our experiments and results. Initial analysis of our results provides insight into some of the things that can be done by TA2-TA1 to help improve the generation of the hypotheses. | |
Oct 19 | Anthony Colas, Caleb Bryant | Template Generation for Querying Relational Databases using Natural Language In most cases, questions are asked using natural language. Because of the amount of information stored in structured knowledge bases, natural language interfaces for databases have been developed which take in a natural language query, structure the natural language to query the knowledge base, and then give a natural language answer. Two main approaches for querying knowledge bases with natural language involve sequence to sequence and template-based generation. This talk will focus on state-of-the-art methods for generation templates in order to query relational/structured databases with natural language-based queries. The talk will cover how to generate the templates, selecting the relevant templates, ranking the templates, and mapping queries to templates. Comparisons to deep learning approaches in question answering systems will also be made. The presentation will take from various works cited in the Querying RDBMS Using Natural Language (Li, 2017) dissertation and serves as a survey of question answering systems on structured knowledge bases for easily query structured systems and gathering interpretable results. | |
Oct 12 | Xiaofeng Zhou | Efficient Conditional Rule Mining Over Knowledge Bases Present day web-scale knowledge bases (KBs) incorporate a substantial amount of information in a structured format. Availability of this readily machine-digestible data has made KBs a desirable resource for other applications. This has motivated many to explore learning from KBs. Embedding methods and learning inference rules are examples of such methods. Rules provide great inference power and are also easily understandable. Most recent work focuses only on normal rules (where all the predicates only support variables). We explore conditional inference rules, a class of logical rules which allow predicates with constants and have more expressive power. We show their effectiveness in knowledge expansion by comparing to normal rules’ number of predictions and precision. However, due to the larger search space, mining conditional rules is much more time-consuming compared to mining normal rules. Current state-of-the-art rule mining methods adapted to mine conditional rules, are infeasibly slow on medium/large KBs. To aid with this shortcoming, we introduce a scalable conditional rule mining algorithm. Our algorithm makes it possible to mine conditional rules from web-scale KBs. | |
Oct 05 | Yang Bai | Improving Sequential Recommendation with Knowledge-Enhanced Memory Networks With the revival of neural networks, many studies try to adapt powerful sequential neural models, i.e., Recurrent Neural Networks (RNN), to sequential recommendation. RNN-based networks encode historical interaction records into a hidden state vector. Although the state vector is able to encode sequential dependency, it still has limited representation power in capturing complicated user preference. It is difficult to capture fine-grained user preference from the interaction sequence. Furthermore, the latent vector representation is usually hard to understand and explain. To address these issues, in this paper, we propose a novel knowledge enhanced sequential recommender. Our model integrates the RNN-based networks with Key-Value Memory Network (KV-MN). We further incorporate knowledge base (KB) information to enhance the semantic representation of KV-MN. RNN-based models are good at capturing sequential user preference, while knowledge enhanced KV-MNs are good at capturing attribute-level user preference. By using a hybrid of RNNs and KV-MNs, it is expected to be endowed with both benefits from these two components. The sequential preference representation together with the attribute-level preference representation are combined as the final representation of user preference. With the incorporation of KB information, our model is also highly interpretable. To our knowledge, it is the first time that sequential recommender is integrated with external memories by leveraging large-scale KB information. | |
Sep 28 | Caleb Bryant | Narrating a Knowledge Base Narrating structured data with a paragraph of text remains a challenging problem. In this presentation, we examine recent efforts to tackle the challenge of natural language generation from Wikipedia tables, focusing on Wang et al.'s paper, Narrating a Knowledge Base. We begin with a brief review of seq2seq neural networks, next investigating how Wang et al. successfully applied multiple types of self-attention to increase the length and completeness of their Wikipedia summaries. Finally, we assess the successes and failures of their method and propose future research directions. | |
Sep 21 | Anthony Colas | Graph Embeddings: A Review on Graph Representations Recently, there has been a lot of interest in efficiently embedding graphs based on node similarity. In this talk, I will introduce what it means to embed graphs. The talk will also compare "shallow" approaches to "deep" approaches. I will also go over some of the state of the art deep approaches used to embed graph data, including their different methodologies and results. This includes Graph Convolutional Networks, GraphSAGE, and Graph Attention Networks. Finally, I will conclude by discussing some applications of embedding graphs using "deep" approaches. | |
Sep 14 | Giacomo Bergami | Query Answering over Probabilistic KB with Alternative Hypotheses Current literature faces three main problems with inconsistency detection. First, theoretical approaches treat all the entities that are not "syntactically" the same as being not equal and perform query answering with the repair-then-query approach. These constraints are too naïve for real-world data: data may contain different descriptions at different abstraction levels, and biased data sources will not allow us to discriminate which is the actual correct answer. Second, traditional FOL system cannot cope with inconsistencies in the reasoning process, because the principle of explosion allows drawing any possible conclusion from inconsistent facts. Third, traditional SRL models using FOL logic may be affected by the same problem, thus allowing to infer implausible hypotheses with near to zero scores. This presentation will focus on solving the first problem by using hierarchies to define inequality, and on how to detect inconsistencies in data using external validation. For the second problem, we will briefly introduce paraconsistent logics that refute the contradiction principle and can be exploited to reason with inconsistent data. We will leave the third problem for future work on SRL paraconsistent models. |
Summer 2018
Date | Speaker | Title | Slides |
---|---|---|---|
Feb 16 | Dihong Gong | Scaling Integral Projection Models for Analyzing Size Demography In this talk, we study the integral projection model (IPM) for analyzing size demography of ecological systems. First, a basic version of IPM is introduced to model ecological dynamics. Further, the IPM is extended to include climate factors, such that it can be scaled to broader geographic areas. Finally, the effectiveness of IPM in investigated upon the FIA dataset with focus on two example species from the year 2015 through 2010. | |
May 18 | Miguel Rodriguez | UF AIDA Summer Dev Plan Some months have passed since we started building on the AIDA project. Even though there are multiple things still in the process of being defined internally and program-wide, I will present the latest developments of the project: General AIDA architecture, UF-TA3 architecture, Inter-TA communication protocols and tools, available datasets and roadmap towards a GAIA dry run evaluation and M9 evaluation. |
Spring 2018
Date | Speaker | Title | Slides |
---|---|---|---|
Apr 13 | Sourav Dutta | Mining Coherent hypothesis from knowledge graph (II) Knowledge bases are very useful to store complex structured and unstructured information. The rise of the internet has given rise to huge knowledge bases. However, with the humongous amount of information we have, it creates a need to generate meaningful insights to understand connections between entities. In this work, the ICEWS data was modeled as a knowledge graph. The knowledge graph was further enhanced with entities from Wikidata. Since the two data sources have been curated differently, one of the major problems was aligning the entities to have a connected graph. The alignment was done by calculating similarity scores using n-grams and using rules mined from the data. Using the knowledge graph, a hypothesis has been considered as a weighted path between two entities. Here, the approach to creating the knowledge graph has been presented, along with a comparison of different entity alignment techniques. Using the knowledge graph, hypotheses were generated and ranked for multiple scenarios. The observations and future work for the same has been shared. | |
Apr 06 | Miguel Rodriguez | Mining Temporal Sequence Rules from Events. In this talk, I will discuss my ongoing research on mining sequential rules over event knowledge graphs, this is edges between entities have time annotations. Specifically, I will discuss the differences between mining factual knowledge bases vs event knowledge base, present our current methods for mining sequential rules over event knowledge bases, scalability issues and interestingness measures. I will also show some preliminary results and examples of mined rules. | |
Mar 30 | Ali Sadeghian | Evaluation of automatically generated hypothesis. We are faced with an explosion of information (and misinformation) published through different mediums like blogs, youtube videos, newspapers, audio podcasts, etc. Knowledge bases have proven to be a great way to store complex information in a semi-structured way. One can process and convert all the information from the mentioned mediums into a single KB. The big question now is how to generate “good” hypothesis about different ongoing scenarios from this KB. To answer this, one must first define what a “good” hypothesis is? In this presentation, we will assume that a KB is built from the mentioned different mediums and that there exists a system that generates hypothesis from the KB in the form of subgraphs. We will give a definition of a good hypothesis based on Grice’s maxims and propose different ways of evaluating systems that mine hypothesis from KBs. | |
Mar 16 | Sourav Dutta | Mining Coherent hypothesis from knowledge graph Knowledge bases are very useful to store complex structured and unstructured information. The rise of the internet has given rise to huge knowledge bases. However, with the humongous amount of information we have, it creates a need to generate meaningful insights to understand connections between entities. In this work, the ICEWS data was modeled as a knowledge graph. The knowledge graph was further enhanced with entities from WikiData. Since the two data sources have been curated differently, one of the major problems was aligning the entities to have a connected graph. The alignment was done by calculating similarity scores using n-grams and using rules mined from the data. Using the knowledge graph, a hypothesis has been considered as a weighted path between two entities. Here, the approach to creating the knowledge graph has been presented, along with preliminary steps to generate hypothesis from the same. Some initial observations of the entity alignment and hypothesis generation task have also been shared. | |
Feb 23 | Victor Lin and Kevin Chow | Graph-based Anomaly Detection for Insider Threat An insider threat is a malicious threat to an organization that comes from people within the organization, such as employees, former employees, contractors or business associates, who have inside information concerning the organization's security practices, data and computer systems. Because insiders may attempt to steal property or information for personal gain, or to benefit another organization or country, the insiders committing threats may be related with each in some ways such as coming from the same alien country, working in the same team, or having the same ex-employer or previously serving in the same organization. Being able to utilize this kind of additional relational information makes it an interesting research topic for more precise insider threat detection. In this project, we focus on the CERT dataset by building upon existing attribute-based threat detection, applying graph-based models to improve our detection, and ultimately combining both for the most reliable anomaly detection. We will talk about our progress so far. | |
Feb 16 | Caleb Bryant | Medical Dialogue Systems Historically, the use of dialogue systems in the medical domain has been fairly limited. While systems have been proposed in the past (e.g. for conversing about medication), the difficulty of designing large and robust dialogue systems has prevented widespread adoption by clinicians. However, the recent rise of virtual digital assistants could help drive renewed attention to medical dialogue systems. In this talk, we explore different models and applications for dialogue systems. We examine past and current trends in the the area of dialogue system research, such as FSA, Information State Update, plan-based, POMDP, and neural network dialogue systems. In particular, we see how the target applications and design decisions of dialogue systems have interacted. Focusing on work the medical domain, we look a number of previous medical dialogue systems and compare them to present work on Rose. Finally, we discuss the current state of medical dialogue systems as well as possible future directions. Second talk: Ali Sadeghian. Title: Temporal Reasoning Over Event Knowledge Graphs. Abstract: Many advances in the computer science field, such as semantic search, recommendation systems, question-answering, natural language processing, are drawn-out using the help of large-scale knowledge bases (e.g., YAGO, NELL, DBPedia). However, many of these knowledge bases are static representations of knowledge and do not model time on its own dimension or do it only for a small portion of the graph. In contrast, projects such as GDELT and ICEWS have constructed large temporally annotated knowledge graphs of events collected from news hubs. In this paper, we study the problem of reasoning over such graphs. In particular, transpose two well-known techniques from knowledge base reasoning to utilize the temporal dimension: rule mining and graph embeddings. We mine temporally constrained first-order inference rules using the state-of-the-art relational knowledge base model. We interpret the learned rules as event sequence rules. We also use simple embedding methods to jointly learn a universal representation of entities and time-specific representations of the knowledge graph. We present the first set of temporal rules mined over event knowledge graphs and preliminary results on using the learned embeddings in the temporal link prediction task. | |
Feb 02 | Sarvesh Soni | Patient Question Answering from Electronic Health Records using Semantic Parsing (III) In this presentation, I will talk about my thesis work progress through the winter break. Electronic Health Records (EHR) are a great source for answering questions related to patient data. The main focus of my thesis work is to convert the patient questions into logical forms using questions and their corresponding answers from EHR. These logical forms are transformed to Fast Healthcare Interoperability Resources (FHIR) query for retrieving the answer(s) from EHR. I will briefly talk about Semantic Parsing and some related works in this domain of Patient Question Answering using Semantic Parsing. Then, I will explain the various steps of my thesis work and talk about my progress during the winter break. | |
Jan 26 | Xiaofeng Zhou | Query Processing and Incremental Learning over Knowledge Bases Knowledge bases are becoming increasingly important in structuring and representing information from the web. Meanwhile, web-scale information poses significant scalability and quality challenges to knowledge base systems. To address these challenges, we develop a probabilistic knowledge base system, ARCHIMEDESONE, by scaling up the knowledge expansion and statistical inference algorithms. ARCHIMEDESONE supports knowledge expansion by applying inference rules in batches using relational operations, and query-driven inference by focusing computation on the query facts in a unified system. Today's knowledge bases are mostly continuously growing despite the large sizes. Much research effort has been put into mining inference rules from the knowledge bases, yet few focus on the incremental aspect of those web-scale knowledge bases. We propose a parallel incremental rule mining framework based on relational model and apply updates to large knowledge bases, we propose an alternative metric that reduces computation complexity without compromising rule quality, we apply multiple optimization techniques that reduce runtime by more than 2 orders of magnitude. Experiments show that our approach can scale to web-scale knowledge bases efficiently and can easily save over 90% time comparing to the state-of-the-art batch rule mining system. To the best of our knowledge, our incremental rule mining system is the first that handles updates to web-scale knowledge bases efficiently. | |
Jan 19 | Miguel Rodriguez | AIDA project summary and plans This talk will cover our plans for AIDA TA3 and summary of the kickoff meeting. |