Fall 2017
Date | Speaker | Title | Slides |
---|---|---|---|
Dec 14 | Rahul Sengupta | Memory-augmented neural networks i.e. neural networks that can interact (read/write/update) a data structure. Given the pervasiveness of WiFi networks within indoor spaces, there arises the possibility of using WiFi signals for indoor localization. In this work, we start by performing a literature survey of the existing approaches and start work on a practical and scalable system for commercial and industrial use. We use an off-the-shelf robot with commodity hardware and sensors to create the dataset. We operate it in a semi-autonomous manner in the corridors of a university office building and map unrestricted hallways on two floors to create a large dataset of over 6000 individual readings of over 200 locations, spaced 3 feet (0.9144m) apart. The robot collects WiFi Received Signal Strength (RSS) readings of available Access Points, magnetometer heading and scene images(captured using two fisheye cameras). We then explore the performance of various deep neural network architectures and present their results. We see that deep learning models can effectively perform indoor localization in corridors, which can be used for location estimation by people and autonomous robots operating inside of large indoor complexes. | |
Dec 8 | Sarvesh Soni | Patient Question Answering from Electronic Health Records using Semantic Parsing (II) In this presentation, I will talk about my thesis work progress during the fall semester. Electronic Health Records (EHR) are a great source for answering questions related to patient data. The main focus of my thesis work is to convert the patient questions into logical forms and to transform these logical forms to Fast Healthcare Interoperability Resources (FHIR) for retrieving the answer(s). I will briefly talk about Semantic Parsing and Sempre, a Semantic Parser by Stanford. Then I will highlight some related works in this domain of Patient Question Answering using Semantic Parsing. Finally, I will explain the various steps of my thesis work and progress during the semester. | |
Nov 30 | Dihong Gong | Towards Building Large-Scale Multimodal KBs (dry run for Ph.D. candidacy) We present a design of multimodal knowledge base (MKB) by extending the existing knowledge graph model with multimodal entities and links. The expansion of MKB is modeled as multimodal link prediction (MLP) problem. In this talk, we present three different algorithms to address the MLP problem to address the MLP problem, the core idea of which is to utilize multimodal links in the graph to improve prediction accuracy. Experimental evaluation confirms our intuition that utilizing additional information from alternative modalities can improve the robustness of link prediction. We conclude the talk with a brief summary of work as well as an introduction to future study that might make our work more impactful. | |
Nov 9 | Yang Peng | Query-Driven Knowledge Base Completion with Multimodal Fusion Over the past few years, large knowledge bases have been constructed to host massive amounts of world knowledge. However, these knowledge bases are greatly incomplete, for example, over 70% of people in Freebase have no known place of birth and 99% have no known ethnicity. To solve this problem, we propose a query-driven knowledge base completion system by fusing unstructured text and structured knowledge bases. Our system applies ensemble fusion to combine two different strategies, web-based question answering and rule inference, to achieve high knowledge base completion performance. We design a web-based question answering system employing question templates and multimodal features, which can achieve better answer ranking quality with much fewer questions than previous work. We implement a novel rule inference system combining pre-learned logical rules and question answering to infer new facts for knowledge base completion, which achieves much better performance than only using rules. By fusing web-based question answering and rule inference, our system further boosts performance. To our best knowledge, our paper is the first comprehensive research work leveraging both unstructured and structured data in depth for knowledge base completion. To improve efficiency, query-driven techniques are utilized to reduce the running time of our system on-the-fly, providing fast responses to user queries. Extensive experiments have been conducted to demonstrate the effectiveness and efficiency of our system. | |
Nov 3 | Yang Peng | Multimodal Fusion: A New Theory and Applications As data grows larger and larger nowadays, Big Data and Data Science are becoming more and more prominent in Computer Science. In Data Science, not only the volume of data is important for research, but also the variety of data has drawn a lot of attention from researchers. In recent years, we have seen more and more complex datasets with multiple kinds of data. For example, Wikipedia is a huge dataset with unstructured text, semi-structured documents, structured knowledge, and images. We call a dataset with different types of data as a multimodal dataset. This dissertation focuses on employing multimodal fusion on multimodal data to improve performance for various tasks, as well as providing scalability and high efficiency. Multimodal fusion is the use of algorithms to combine information from different kinds of data with the purpose of achieving better performance. In this dissertation, I first introduce the concepts of multimodal datasets and multimodal fusion, then propose a new theory about multimodal fusion based on correlative and complementary relations between different modalities, and then present different applications for multimodal fusion, such as information extraction, word sense disambiguation, information retrieval and knowledge base completion. Multimodal datasets studied in this dissertation include images, unstructured text and structured facts in knowledge bases. | |
Oct 27 | Anthony Colas | A literature review on hypothesis generation Many studies have focused on automatically generating hypotheses by mining a posteriori, domain-specific data to find links between different entities. Hypothesis generation is especially popular in discovering new facts from literature. In this talk, I will go over a literature review of hypothesis generation. I will first define a hypothesis and then go off of this definition to define hypothesis generation. I will present a few works in hypothesis generation, specifically in the medical domain. I will then go over what the research attempts to find, their techniques, and results. | |
Oct 20 | Dihong Gong | Towards Building Large-Scale Multimodal Knowledge Bases (II) We present a design of multimodal knowledge base (MKB) by extending the existing knowledge graph model with multimodal entities and links. The expansion of MKB is modeled as multimodal link prediction (MLP) problem. In this talk, we present three different algorithms to address the MLP problem, the core idea of which is to utilize multimodal links in the graph to improve prediction accuracy. Experimental evaluation confirms our intuition that utilizing additional information from alternative modalities can improve the robustness of link prediction. We conclude the talk with a brief summary about of work as well as an introduction to future study that might make our work more impactful. | |
Oct 13 | Dihong Gong | Towards Building Large-Scale Multimodal Knowledge Bases We present a design of multimodal knowledge base (MKB) by extending the existing knowledge graph model with multimodal entities and links. The expansion of MKB is modeled as multimodal link prediction (MLP) problem. In this talk, we present three different algorithms to address the MLP problem, the core idea of which is to utilize multimodal links in the graph to improve prediction accuracy. Experimental evaluation confirms our intuition that utilizing additional information from alternative modalities can improve the robustness of link prediction. We conclude the talk with a brief summary about of work as well as an introduction to future study that might make our work more impactful. | |
Sep 29 | Miguel Rodrigez | Reasoning Over Temporal Knowledge Graphs In this talk, I will discuss the topic of reasoning over knowledge graphs that change in time, this is edges between entities change in time. Specifically I will introduce two lines of work on the subject, temporal latent representations from the paper "Know-Evolve: Deep Temporal Reasoning for Dynamic Knowledge Graphs" where embeddings are used in the temporal link prediction task and "Marrying Uncertainty and Time in Knowledge Graphs" that used temporal constraint rules and Markov Logic Networks to detect temporal inconsistencies in the Knowledge Graph. | |
Sep 22 | Sarvesh Soni | Patient Question Answering from Electronic Health Records using Semantic Parsing In this presentation, I will talk about my proposed thesis work. Electronic Health Records (EHR) are a great source for answering questions related to patient data. The main focus of my thesis work is to convert the patient questions into logical forms and to transform these logical forms to Fast Healthcare Interoperability Resources (FHIR, an EHR query standard) for retrieving the answer(s). Firstly, I will be talking about Semantic Parsing and Sempre, a Semantic Parser by Stanford. Then this talk will highlight some related works in this domain of Patient Question Answering using Semantic Parsing. Finally, I will briefly explain the proposed steps of my thesis work along with the timeline. | |
Sep 15 | Professor Christan Grant | Ongoing research at the University of Oklahoma This talk is a preview of the upcoming and ongoing research at the University of Oklahoma. The OU Data Analytics lab focuses on research projects in data analytics, system building, and novel methods of human interaction. The majority of the talk will discuss human-over-the-loop analytics. | |
Sep 1 | Ali Sadeghian | Paper review: Observed versus latent features for knowledge base and text inference A review of an interesting paper by Kristina Toutanova and Danqi Chen, on combining latent features (embeddings) and observed features (rules) for link prediction. |
Summer 2017
Date | Speaker | Title | Slides |
---|---|---|---|
Aug 4 | Dihong Gong | Extracting Visual Knowledge from the Web with Multimodal Learning We consider the problem of automatically extracting visual objects from web images. Despite the extraordinary advancement in deep learning, visual object detection remains a challenging task. To overcome the deficiency of pure visual techniques, we propose to make use of meta text surrounding images on the Web for enhanced detection accuracy. In this paper, we present a multimodal learning algorithm to integrate text information into visual knowledge extraction. To demonstrate the effectiveness of our approach, we developed a system that takes raw webpages and a small set of training images from ImageNet as inputs, and automatically extracts visual knowledge (e.g. object bounding boxes) from tens of millions of images crawled from the Web. Experimental results based on 46 object categories show that the extraction precision is improved significantly from 73% (with state-of-the-art deep learning programs) to 81%, which is equivalent to a 31% reduction in error rates. | |
Jun 2 | Sean Goldberg | Interactive Graph Inference In Knowledge Bases With the increase in data over the last few years, there has been tremendous interest in organizing into knowledge bases for more efficient querying and analysis. Two automated methods for introducing new information into a knowledge base include Automatic Information Extraction and Rule Inference. These machine learning methods require complex inference over probabilistic graphical models. Small, efficient human corrections can be used to guide the inference process in such a way to improve the overall inference process. This talk addresses the fundamental problem of how to select questions to posit a human expert or crowd of people to improve the posterior prediction accuracy of the inference algorithms. I will discuss previously published work in Information Extraction and propose new research related Rule Inference in Knowledge Bases. |
Spring 2017
Date | Speaker | Title | Slides |
---|---|---|---|
Apr 26 | Akash Agarwal and Roukna Sengupta | Anomaly Detection over graphical knowledge bases using HPCC The goal of our project is anomaly detection over time evolving network graphs using HPCC systems. Our methods consider the network as it evolves and monitors properties of the network for changes. We use HPCC Systems ®, which is an open source, a massive parallel-processing computing platform for big data processing and analytics. In the presentation, we wish to discuss our learning from HPCC and evaluate its performance for querying and operating on a large dataset. We would also discuss Enterprise Control Language(ECL) which is designed specifically for big data processing with HPCC. Besides we would discuss our evaluations of anomaly detection algorithms over graphical KB - Wikipedia revision history, where we try to detect events using a distribution based methodology and structural changes in the graph over the time series. | |
Apr 26 | Harish Balaji | ChronoSeek: Information Extraction from temporal knowledge bases In the past, sequential pattern mining has been more focused on instantaneous events rather than time intervals. With the advent of KBs such as YAGO and Wikidata, the scenario is changing. With temporal annotations on their relations by way of reification, they have proposed various data models like SPOT representation. This can be exploited to find patterns not only in a temporal arrangement but also in a combination of topological and temporal arrangement. This has not been explored and it leads to fifteen different arrangements that prove to be interesting. I will present the results of a 2-arrangement phase of the enumeration tree generated from Temporal KBs. Particularly, I will talk about parsing techniques and temporal data mining approaches used on YAGO and Wikipedia to mine the various arrangements from SPOT tuples. | |
Apr 26 | Karthik M. S. Subramanian, Auon H. Kazmi | Perceptron in MADlib MADlib is a free, open source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. In this presentation we introduce the MADlib project, including the background that led to its beginnings, and the motivation for using Python and C++ along with Postgres. We provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context. We will explain the key contributions made by us to the MADlib project including the perceptron and the KNN algorithms. | |
Apr 19 | Samskruthi Padigepati, Abhinav Shankar | Biomedical Prediction Electronic health records store the medical and demographic information of patients in a digital format and can be used for advancement in clinical research. While the EHR data can be used for predicting patient-centered outcomes, challenges arise when there is missing information. In this project, we predict the missing links in the EHR data by integrating with a biomedical knowledge base. | |
Apr 19 | Caleb Bryant | The Rose Dialogue System Personal digital assistants, such as Siri and Alexa, are the most well-known examples of dialogue systems. In recent years high accuracy speech recognition and natural language processing tools have made building custom dialogue systems ever more feasible. In this talk, we will be taking an end-of-semester look at the dialogue system for Rose, a virtual health navigator whose goal is to help patients understand their medical situations. | |
Apr 14 | Nishant Agarwal, Arvind K. Sugumar | NEON DSE Project Automatic tree crown delineation has a great impact on tracking and preserving bio diversity in our world. To serve as the pre-pilot for the full DSE track, which comprises of delineation, alignment and classification, we propose using the watershed class of algorithms to implement a baseline model for the delineation task. This talk will take over from where we left off earlier and we will be talking in particular about two different approaches to making the naive watershed segmentation better i.e. Laplacian of Gaussian (LoG) method followed by Morphological enhancement and the Region Growing algorithm. We will be going through the techniques which we utilize to get the crown delineation done and the current progress will be demoed. Also the participant evaluation system would be demoed and a sample report will be generated. | |
Apr 14 | Jayson Salkey | Clinical Link Prediction with Translational Embeddings With the exponential growth of Electronic Health Records (EHR) and linked life data, merging and predicting links from medical knowledge bases allow us to further enrich and empower the medical community. Here we utilize a translational embedding technique in predicting relationships between clinical concepts. | |
Apr 7 | Yang Peng | Multimodal fusion for disambiguation, retrieval and knowledge base completion As data grows larger and larger in recent years, Big Data and Data Science are becoming more and more important in Computer Science. In Data Science, not only the volume of data is essential for research, but also the variety of data has drawn a lot of attention from researchers. There are usually more than one types of records/items in a dataset. For example, Wikipedia is a huge dataset with unstructured text, semi-structured documents, structured knowledge, and images. We call this kind of dataset as a multimodal dataset. This dissertation focus on employing different kinds of data to improve quality for different tasks and provide scalability for large datasets. In this dissertation, I introduce multimodal datasets and multimodal fusion for different applications, such as information extraction, word sense disambiguation, information retrieval and knowledge base completion. Multimodal fusion is the use of algorithms to combine information from different kinds of data with the purpose of achieving better performance. In this dissertation, I introduce a few applications for multimodal fusion, explain the ensemble fusion algorithm using images and text for disambiguation and retrieval and propose the deep fusion pipeline of unstructured and structured data for knowledge base completion. | |
Mar 31 | Xiaofeng Zhou | Real-time inference query processing and incremental rule learning over Archimedes KB Knowledge bases are becoming increasingly important in structuring and representing information from the web. Meanwhile, web-scale information poses significant scalability and quality challenges to knowledge base systems. To address these challenges, we develop a probabilistic knowledge base system, ArchimedesOne, by scaling up the knowledge expansion and statistical inference algorithms. We design a web interface for users to query and update large knowledge bases. Knowledge bases are also dynamic, incorporating new knowledge continuously. This calls for the need to potentially update any existing models trained from them. These dynamic knowledge bases pose significant challenge to state-of-the-art learning methods if we re-train the models for each update. We investigate the first-order rule mining problem over web-scale dynamic knowledge bases and propose a new metric to facilitate rule mining over dynamic knowledge bases in a distributed setting. | |
Mar 24 | Anthony colas | Voice-xml This talk will cover the basics of voice-xml and its integration with Rose. | |
Mar 24 | Miguel Rodriguez | Towards Mining Episodes from Temporal Knowledge Bases Today's knowledge bases are particularly turning its interest into the Temporal domain. KBs such as YAGO and Wikidata have temporal bounds on their relations and have proposed various data models. Even though there is interest in constructing Temporal KBs, the temporal domain of existing KBs is largely unexplored compared to a-temporal KBs. I will introduce preliminary ideas on an exploratory analysis of the Temporal dimension in KBs. Particularly, I will talk about temporal data management and temporal data mining approaches used to mine episodes from sequential data. | |
Mar 17 | Ali Sadeghian | Deep Learning on Relational Data A Review of Previous Work Past few years have seen elevating efforts to integrate knowledge bases with deep learning to boost performance on various tasks from question answering to image recognition. There have also been efforts to use deep learning to enhance Knowledge graphs. In the first part of this talk I will give a brief review of previous work and in the second part, I will discuss my on-going research on rule mining. | |
Mar 10 | Samskruthi R. Padigepati, Abhinav S. Venkataraman | Biomedical Prediction To predict diagnosis for a patient with the help of informationEHR data and biomedical knowledge bases. Integrate the information from a biomedical medical knowledge base into EHR data. | |
Mar 3 | Akash Agarwal, Round Sengupta | Anomaly Detection over graphical knowledge bases using HPCC The focal point of our project is anomaly detection over time evolving network graphs using HPCC systems. Unlike previous work on anomaly detection in information networks that worked with a static network graph, our methods consider the network as it evolves and monitors properties of the network for changes. HPCC Systems ® is an open source, a massive parallel-processing computing platform for big data processing and analytics. It has three main components the data refinery engine Thor, data delivery engine Roxie and the Enterprise Control Language (ECL) all of which work together to provide concurrent querying over the data in HPCC. To demonstrate our capabilities we have chosen the Wikipedia revision history as our data. Wikipedia has become a standard source of reference online, and many people (some unknowingly) now trust this corpus of knowledge as an authority to fulfill their information requirements. In doing so they task the human contributors of Wikipedia with maintaining the accuracy of articles, a job that these contributors have been performing admirably. We try to detect events using a distribution based methodology and structural changes in the graph over the time series. | |
Mar 3 | Auon Kazmi, Karthik Maharajan | Word2vec implementation in MADlib MADlib is a free, open source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. In this presentation, we introduce the MADlib project, including the background that led to its beginnings, and the motivation for using Python and C++ along with Postgres. We provide an overview of the library’s architecture and design patterns and provide a description of various statistical methods in that context. | |
Feb 24 | Harish Balaji | ChronoSeek: The Groundwork An Analysis of Information extraction from temporal knowledge bases. The concept of creating a timeline or to be able to view a knowledge base based on a particular timestamp, gives a lot of applications such as analysing songs made by an artist before and after getting an award. Current knowledge bases may contain outdated facts. Using temporal dimension to annotate old facts can increase the utility value of old knowledge bases. Knowing the time of occurrence of a fact can be used to validate if another related fact should have happened after the former. Famous knowledge bases like Yago and Wikidata have temporal dimension and promise future applications. Performing a more thorough information extraction for time than just using current methods like infoboxes, lists, tables and regular patterns can help populate knowledge bases with an improved temporal dimension. Thus facilitating us in exploiting knowledge bases with temporal data to perform a wide variety of queries. We propose to export the improved strategy onto a larger knowledge base like Wikidata to create a big KB Time Machine, ChronoSeek. | |
Feb 24 | Jayson Salkey , Anthony Colas , Caleb J. Bryant | Virtual Health Navigation ROSE dialogue system Dialogue systems for Intelligent Personal Assistants is a dense field for Virtual Health Workers. For example, popular voice assistants Siri, Cortana, and Google Now, offer services such as, question answering, recommendations, and action performance. This technology could potentially be integrated with an integrated biomedical knowledge base to provide use to medical professionals in medical situations. This talk will go over our current work towards creating such an application. | |
Feb 17 | Nishant Agarwal, Arvind K. Sugumar | NEON DSE Project Automatic tree crown delineation has a great impact on tracking and preserving bio diversity in our world. We use the NEON data set which includes data products from field sampling to landscape scale airborne remote sensing (AOP), to build a pipeline that can effectively isolate and classify tree crowns based on their species. To serve as the pre-pilot for the full DSE track, which comprises of delineation, alignment and classification, we propose using the graph cut class of algorithms to implement a baseline model for the delineation task. Right now we are using an enhanced version of the watershed algorithm coupled with some of the morphological transforms. We are also building the evaluation pipeline in parallel for all three tasks. This talk will go over our current work and results. | |
Feb 10 | Dihong Gong | Multimodal Knowledge Extraction (II) Multimodal techniques have attracted much research interest in the recent years. There have been quite a lot arising research topics related to multimodal information analysis, which include multimodal knowledge base, image captioning, bidirectional text/image retrieval, and so on. The previous multimodal research mostly focuses on bridging the gap between multimodal information, e.g. describing contents of an image with text, or retrieve images by text. In this talk, we present our studies on multimodal techniques from a different point of view -- fusing information from multiple modalities for better knowledge extraction quality. We propose three different types of methods to fuse information across multiple modalities, based on frequency statistics, embedding, and graphical models respectively. Our experimental evaluations based on named entity recognition and visual knowledge extraction show that, making use of information from multiple modalities significantly improves the quality of knowledge extraction. | |
Feb 3 | Dihong Gong | Multimodal Knowledge Extraction Multimodal techniques have attracted much research interest in the recent years. There have been quite a lot arising research topics related to multimodal information analysis, which include multimodal knowledge base, image captioning, bidirectional text/image retrieval, and so on. The previous multimodal research mostly focuses on bridging the gap between multimodal information, e.g. describing contents of an image with text, or retrieve images by text. In this talk, we present our studies on multimodal techniques from a different point of view -- fusing information from multiple modalities for better knowledge extraction quality. We propose three different types of methods to fuse information across multiple modalities, based on frequency statistics, embedding, and graphical models respectively. Our experimental evaluations based on named entity recognition and visual knowledge extraction show that, making use of information from multiple modalities significantly improves the quality of knowledge extraction. | |
Jan 27 | Daisy Zhe Wang | Archimedes: A Probabilistic Master Knowledge Base System (II) In this talk, I discuss novel system components and algorithms that we are designing and building at UF to enable a probabilistic master Knowledge Base (KB) system. In the context of the Archimedes project, I will discuss a spectrum of research directions we are exploring at the UF Data Science Research (DSR) group including: query-driven and scalable statistical inference, probabilistic data models, state-parallel and data parallel data analytics framework, multimodal (e.g., text, image) information extraction, and KB schema enrichment. This line of research of supporting analytics over automatically extracted knowledge bases is of high impact for many applications from QA systems, situational awareness to medical informatics. Other related projects include DeepDive from Stanford, YAGO from Max Planck Institute, NELL from CMU as well as WikiData/Freebase and Google Knowledge Vault. | |
Jan 20 | Daisy Zhe Wang | Archimedes: A Probabilistic Master Knowledge Base System (I) In this talk, I discuss novel system components and algorithms that we are designing and building at UF to enable a probabilistic master Knowledge Base (KB) system. In the context of the Archimedes project, I will discuss a spectrum of research directions we are exploring at the UF Data Science Research (DSR) group including: query-driven and scalable statistical inference, probabilistic data models, state-parallel and data parallel data analytics framework, multimodal (e.g., text, image) information extraction, and KB schema enrichment. This line of research of supporting analytics over automatically extracted knowledge bases is of high impact for many applications from QA systems, situational awareness to medical informatics. Other related projects include DeepDive from Stanford, YAGO from Max Planck Institute, NELL from CMU as well as WikiData/Freebase and Google Knowledge Vault. | |
Jan 13 | Daisy Zhe Wang | Data Science Tea Roundtable Semester kick-off and roundtable. |
Fall 2016
Date | Speaker | Title | Slides |
---|---|---|---|
Nov 10 | Yang Chen | Scalable Learning and Inference in Large Knowledge Bases Recent years have seen elevating efforts in the construction of web-scale knowledge bases (e.g., DBPedia, DeepDive, Freebase, Google Knowledge Graph, NELL, OpenIE, ProBase, YAGO). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to limitations of human knowledge and information extraction algorithms, current knowledge bases are far from complete. To infer the missing knowledge, we propose the knowledge expansion and ontological pathfinding algorithms. The knowledge expansion algorithm applies first-order inference rules to infer facts from an incomplete knowledge base; the ontological pathfinding algorithm mines first-order inference rules from the knowledge bases. The knowledge expansion and ontological pathfinding algorithms form the core components of a probabilistic knowledge base system, ProbKB. The knowledge expansion algorithm efficiently applies first-order inference rules to derive implicit facts from incomplete knowledge bases. The novel contributions to achieve efficiency and quality include: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality. The ontological pathfinding algorithm mines first-order inference rules from these knowledge bases. It scales up via a series of optimization techniques: a new rule mining algorithm to parallelize join queries, a pruning strategy to eliminate unsound and resource-consuming rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale. We support knowledge queries with spreading activation, a popular way of simulating human memory in semantic networks. We design a relational model for semantic networks and an efficient SQL-based spreading activation algorithm. We leverages the mature query engines and optimizers that generate efficient query plans for memory activation and retrieval. Our system supports human-scale memories with the massive storage capacity provided by modern database systems. We evaluate the spreading activation queries in a comprehensive experimental study using DBpedia, a web-scale ontology constructed from the Wikipedia corpus. The results show that our system runs over 500 times faster than previous works. Based on these contributions, we propose a probabilistic knowledge base system, ProbKB, that manages web-scale knowledge by scalable learning and inference. We validate the ProbKB’s effectiveness with web knowledge bases including Freebase and YAGO. For future work, we propose to extend the previous contributions to dynamic knowledge bases and data streams and to support other types of automatic reasoning including abductive and defeasible reasoning. | |
Nov 4 | Laksshman Sundaram | Introduction to Tensorflow/Keras Tensorflow is a popular machine learning package for coding deep neural network models.In this talk, I will introduce tensorflow and show two short demos on Image processing and NLP task using the same. I will also introduce Keras which is an abstraction over Tensorflow that helps in building prototypes at a much quicker pace than Tensorflow. | |
Oct 28 | Sean Goldbergn | pi-CASTLE: A Probabilistically Integrated System for Crowd-Assisted Text Labeling and Extraction The amount of text data has been growing exponentially in recent years, giving rise to automatic information extraction methods that store text annotations in a database. The current state-of-the-art structured prediction methods, however, are likely to contain errors and it’s important to be able to manage the overall uncertainty of the database. On the other hand, the advent of crowdsourcing has enabled humans to aid machine algorithms at scale. In this talk I will introduce pi-CASTLE , a system that optimizes and integrates human and machine computing as applied to a complex structured prediction problem involving conditional random fields (CRFs). I will propose strategies grounded in information theory to select a token subset, formulate questions for the crowd to label, and integrate these labelings back into the database using a method of constrained inference. On both a text segmentation task over academic citations and a named entity recognition task over tweets our results show an order of magnitude improvement in accuracy gain over baseline methods. | |
Oct 21 | Yang Chen | Scalable Learning and Inference in Large Knowledge Bases Recent years have seen elevating efforts in the construction of web-scale knowledge bases (e.g., DBPedia, DeepDive, Freebase, Google Knowledge Graph, NELL, OpenIE, ProBase, YAGO). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to limitations of human knowledge and information extraction algorithms, current knowledge bases are far from complete. To infer the missing knowledge, we propose the knowledge expansion and ontological pathfinding algorithms. The knowledge expansion algorithm applies first-order inference rules to infer facts from an incomplete knowledge base; the ontological pathfinding algorithm mines first-order inference rules from the knowledge bases. The knowledge expansion and ontological pathfinding algorithms form the core components of a probabilistic knowledge base system, ProbKB. The knowledge expansion algorithm efficiently applies first-order inference rules to derive implicit facts from incomplete knowledge bases. The novel contributions to achieve efficiency and quality include: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality. The ontological pathfinding algorithm mines first-order inference rules from these knowledge bases. It scales up via a series of optimization techniques: a new rule mining algorithm to parallelize join queries, a pruning strategy to eliminate unsound and resource-consuming rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale. We support knowledge queries with spreading activation, a popular way of simulating human memory in semantic networks. We design a relational model for semantic networks and an efficient SQL-based spreading activation algorithm. We leverages the mature query engines and optimizers that generate efficient query plans for memory activation and retrieval. Our system supports human-scale memories with the massive storage capacity provided by modern database systems. We evaluate the spreading activation queries in a comprehensive experimental study using DBpedia, a web-scale ontology constructed from the Wikipedia corpus. The results show that our system runs over 500 times faster than previous works. Based on these contributions, we propose a probabilistic knowledge base system, ProbKB, that manages web-scale knowledge by scalable learning and inference. We validate the ProbKB’s effectiveness with web knowledge bases including Freebase and YAGO. For future work, we propose to extend the previous contributions to dynamic knowledge bases and data streams and to support other types of automatic reasoning including abductive and defeasible reasoning. | |
Sep 30 | Douglas W. Oard | Efficiently Searching Among Sensitive Content In the field of information retrieval, we take it as our goal to help people find what they want to see, but in this talk I will argue that it is high time that we also begin to think seriously about a complementary problem: preventing people from finding things that they should not see. Back when information was scarce, we could do this just by just omitting things that should not be found from the index used by the information retrieval system. Today, however, it is increasingly common to find important information (that should be found) intermixed with sensitive information that needs to be protected for one reason or another (personal privacy, commercial interests, national security, ...). I’ll begin the talk by presenting some of our recent work on one current application of these ideas, the protection of privileged content when sharing evidence among the parties to a lawsuit. This task, referred to as “discovery” or (when the content is born-digital) “e-discovery,” follows a search-then-segregate process that was originally developed for paper records. I’ll then compare that approach with the obvious alternative: segregate-then-search, which is presently the process used to review classified records for public release after some period (e.g., 25 years). Both approaches suffer from high latency (on the order of months) and high cost, and are thus suitable only for settings in which the information to be found is expected to have high value. With that as background, I will then look to the future to sketch out how more responsive and affordable alternatives might be crafted, and what technical challenges would need to be addressed to make those alternative approaches both possible and practical. | |
Sep 23 | Miguel E. Rodríguez, Yang Chen | Multi-level fusion of Extractions and Beliefs for Knowledge Base Population In automatic knowledge base population relation extractors such as ReVerb, CPL, etc play a major role in transforming unstructured text into facts. In our previous research, we have targeted the Knowledge Fusion problem where given the uncertainty from multiple extractors, we construct an ensemble KB with filtered untrusted facts and calibrated probabilities. In this talk, we will explore how additional sources of uncertainty from text extractions such as document trustworthiness and user beliefs can be leveraged in order to further increase the quality of the ensemble KB. Archimedes: Efficient Query Processing over Probabilistic Knowledge Bases We present the Archimedes system for query processing over probabilistic knowledge bases. Archimedes is designed for knowledge bases containing incomplete, ambiguous, or erroneous information due to uncertainty in information extraction algorithms and limitations of human knowledge. Answering user queries over these knowledge bases requires significant efforts of probabilistic inference. In this paper, we present the design of Archimedes performing knowledge expansion and query-driven inference over a unified data- and state-parallel computation framework, UDA-GIST. With efficient probabilistic reasoning and inference, Archimedes produces reasonable results for queries over incomplete and uncertain knowledge bases. We use public knowledge bases including Reverb-Sherlock and Wikilinks to show Archimedes achieves real-time performance with satisfactory quality. | |
Sep 16 | Jayson Salkey | Intelligent Personal Assistants as Virtual Healthcare Workers Through integrating biomedical knowledge bases, one now has the ability to ask more than if two concepts are related, but how they are related. This opens up the possibility of utilizing query-based intelligent assistant applications as virtual healthcare workers. For example, popular voice assistants Siri, Cortana, and Google Now, offer services such as, question answering, recommendations, and action performance. This technology could potentially be integrated with an integrated biomedical knowledge base to provide use to medical professionals in medical situations. This talk will go over my current work towards creating such an application. | |
Sep 9 | DSR Group | DSR OpenLab Discussions DSR OpenLab to discuss potential Data Science Projects for Master's and undergrad students for Project in Data Science (Spring 2017)/Independent Studies/OPS position/Research Credits affiliated with DSR Lab. Potential project includes: NIST TAC/DSE, MADLib, System/Demo for Probabilistic KB research, Biomedical, Q/A systems, etc. More information can be found at: https://dsr.cise.ufl.edu/cap4773cap6779-projects-in-data-science-spring-2016 https://dsr.cise.ufl.edu/courses | |
Aug 26 | Miguel E. Rodríguez, Xiaofeng Zhou, Yang Chen | SigmaKB: Multiple Probabilistic Knowledge Base Fusion The interest in integrating web-scale knowledge bases (KBs) has intensified in the last several years. Research has focused on knowledge base completion between two KBs with complementary information, lacking any notion of uncertainty or method of handling conflicting information. We present SigmaKB, a knowledge base system that utilizes Consensus Maximization Fusion and user feedback to integrate and improve the query results of a total of 71 KBs. This paper presents the architecture and demonstration details. ArchimedesOne: Query Processing over Probabilistic Knowledge Bases Knowledge bases are becoming increasingly important in structuring and representing information from the web. Meanwhile, web-scale information poses significant scalability and quality challenges to knowledge base systems. To address these challenges, we develop a probabilistic knowledge base system, ArchimedesOne, by scaling up the knowledge expansion and statistical inference algorithms. We design a web interface for users to query and update large knowledge bases. In this paper, we demonstrate the ArchimedesOne system to showcase its efficient query and inference engines. The demonstration serves two purposes: 1) to provide an interface for users to interact with ArchimedesOne through load, search, and update queries; and 2) to validate our approaches of knowledge expansion by applying inference rules in batches using relational operations and query-driven inference by focusing computation on the query facts. We compare ArchimedesOne with state-of-the-art approaches using two knowledge bases: NELL-sports with 4.5 million facts and Reverb-Sherlock with 15 million facts. | |
Aug 19 | Yang Peng | Knowledge Base Completion using Search-Based Question Answering Current knowledge bases are incomplete with low recall. For example, only 70% of Freebase person entities have birth place information. In order to fill in missing information for entities in knowledge bases, we propose to use search-based question answering to generate missing facts. This Knowledge Base Completion (KBC) system first rewrites structured queries into natural language questions, then asks the Web through search engines to provide relevant sentences/snippets, processes raw sentences by linking noun phrases to Wikipedia entities, and rank these candidate answers by classification and type checking. |
Summer 2016
Date | Speaker | Title | Slides |
---|---|---|---|
Jul 8 | Morteza Shahriari Nia | Data Science for Plant Identification with Remote Sensing Ecological sciences benefit from the huge diversity of plant species which play an important role in large scale ecological aspects such as global warming, land cover change, CO2 emission, invasive species, fire hazard, and etc. State-of-the-art species classification techniques utilize remote sensing data such as hyperspectral and LiDAR, however this task involves plenty of field data collection which is both highly time consuming, costly and can only be accomplished by ecological experts. Among thousands of the most commonly found plant species there is huge similarities between them from a remote sensing point of view which makes the task of species classification very daunting; therefore we see a whole body of literature specifically dedicated to this issue which is yet far from real world scenarios with thousands of possible species. While this is an indicator of the importance and complexity of the issue, little has been done to tackle the problem from a computational point of view harnessing the power of "big data". Periodic airborne campaigns can generate terabytes of data on vast swaths of land. To tackle these problems we propose to use probabilistic knowledge bases which work best when there is lots and lots of uncertain data. Probabilistic knowledge base captures ecological expert knowledge in terms of probabilistic rules, which will be mapped to remote sensing data and used to infer new facts and therefore enhance species classification accuracy. | |
Jun 10 | Yang Chen | Ontological Pathfinding: Mining First-Order Knowledge from Large Knowledge Bases Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques, including a new parallel rule mining algorithm implemented on Spark, a novel partitioning algorithm to break the learning tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we are able to develop a first rule learning system that scales to Freebase--the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale. | |
Jun 3 | Sean Goldberg | Submodular Functions and Applications to Machine Learning Convex functions are well-studied and useful because they exhibit certain properties that allow for efficient optimization. Submodular functions are a discrete analog for combinatorial problems that exploit a notion of "diminishing returns" to provide performance guarantees on a large class of problems. In this talk, I provide an introduction to submodular functions, discuss their utility within the machine learning community, explore a number of applications that utilize submodular functions, and consider how they might be applied to the problem of crowdsourcing knowledge base inference. |
Spring 2016
Date | Speaker | Title | Slides |
---|---|---|---|
Apr 29 | Projects in Data Science | Projects in Data Science Posters - MADlib Contributions and Applications - Semantic Edge Labeling over Legal Citation Graph and Case Prediction - A Case for Integrated Probabilistic Biomedical Knowledge Bases - Exploring Graph Partitioning and Distributed RDF Store for Probabilistic KB | |
Apr 15 | Morteza Shahriari Nia | Data Science in Ecology through Species Identification Precision study of ecology plays a crucial role in our understanding of the environment regarding issues such as climate change, carbon emissions, invasive species, disease outbreaks, potential fires and etc. Traditional ecological approaches merely look at one or at most a few ecological sites and researchers usually perform independent analysis that do not necessarily align well with each other. However, these systems are not closed domains and a global study of our ecosystem is needed to fully understand their dynamics. In this talk we look at big-data analytics solutions that we have studied and propose for further investigation as part of NIST DSE challenge using NEON data collection initiative. | |
Apr 8 | Dihong Gong | Multilingual Learning for Information Extraction We consider the problem of semi-supervised learning to extract textual categories (e.g. cities, clothing, and plant) from unstructured text corpus. Starting from a handful of seed instances, a typical system iteratively learns useful syntactic patterns like “cities such as ” or “ is a city”, and then applies these learned patterns to extract new instances. However, information extracted in this manner is usually of low quality because most of the syntactic patterns are not very reliable. To address this problem, in this paper we present a novel information extraction approach based on multilingual learning that combines information from multiple modalities (e.g. text and vision) to enhance the reliability of information extractors. Our approach is primarily motivated by an observation that multilingual information usually complements each other, which leads to a potentially more robust learning method: the multilingual learning. | |
Mar 25 | Yang Chen | Scalable Learning and Inference in Large Knowledge Bases Recent years have seen escalating efforts in the construction of web-scale knowledge bases (e.g., DBPedia, DeepDive, Freebase, Google Knowledge Graph, NELL, OpenIE, ProBase, YAGO). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to limitations of human knowledge and information extraction algorithms, current knowledge bases are far from complete. To infer the missing knowledge, we propose the knowledge expansion and ontological pathfinding algorithms. The knowledge expansion algorithm applies first-order inference rules to infer facts from an incomplete knowledge base; the ontological pathfinding algorithm mines first-order inference rules from the knowledge bases. The knowledge expansion and ontological pathfinding algorithms form the core components of a probabilistic knowledge base system, ProbKB. The knowledge expansion algorithm efficiently applies first-order inference rules to derive implicit facts from incomplete knowledge bases. The novel contributions to achieve efficiency and quality include: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm that applies inference rules in batches; 2) We implement Probkb on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that Probkb system outperforms the state-of-the-art inference engine in terms of both performance and quality. The ontological pathfinding algorithm mines first-order inference rules from these knowledge bases. It scales up via a series of optimization techniques: a new rule mining algorithm to parallelize join queries, a pruning strategy to eliminate unsound and resource-consuming rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale. Based on these contributions, we propose a probabilistic knowledge base system, Probkb, that manages web-scale knowledge by scalable learning and inference. We propose to expand Freebase by applying the 36,625 first-order inference rules and evaluate our approach using performance measures and cross validation. | slides |
Mar 11 | Ali Sadeghian | Semantic Edge Labeling over Legal Citation Graph and Case Prediction In this talk we are going to to talk about two exciting projects in the field of computational law and legal analytics. In one we are designing an intelligent system that predicts the outcome of a case brought to the court. We are currently experimenting with the data collected from United States International Trade Commission's Electronic Document Information (USITC EDIS). EDIS is a repository for documents filed in Title VII, Section 337 (Unfair Import Investigations Information System) , and other investigations before the Commission. Our second project is semantic edge labeling in legal citation graphs. We tackle the problem of automatically determining the type of citation, where a certain statue is being cited in another statue. This project involves analysis of the content and structure of the United States Code (USC). Our efforts involve defining, annotating and assigning each citation edge with a specific semantic labels. | |
Mar 10 | Babak Alipour | Contributing to Apache MADlib Analytics Library Apache MADlib (incubating) is a framework for distributed/parallel in-database Machine Learning over data processing engines such as Greenplum and HAWK. MADlib project was initiated in 2011 and is a collaboration among developers and researchers from Greenplum/Pivotal/EMC; University of California, Berkeley; University of Wisconsin and University of Florida. It has recently moved to Apache for extension and support by open source developers and encourage innovation. In this talk, we will discuss our individual experience of development over MADlib and will provide an overview of efforts of this group to contribute to the MADlib community, through issue reporting, bug fixes, development of new modules and algorithm implementations. We will also discuss challenges and next steps to resolve those. | |
Feb 26 | Mebin Jacob | Scalable SPARQL Querying of Large RDF Graphs The generation of RDF data has accelerated to the point where many data sets need to be partitioned across multiple machines in order to achieve reasonable performance. We look into one scheme of partitioning and how it can be achieved and also how it performs on real world queries like of DBPEDIA. | |
Feb 19 | Jayson Mclaughlin-Salkey | On Constructing Biomedical Knowledge Bases Knowledge Bases are now finding themselves in increasing use in Biomedical applications. In this talk, we will discuss two state of the art approaches to automatically generating these KB. I will also discuss my current work towards creating a biomedical KB and its possible application for the Open Science Prize challenge. | |
Feb 5 | Dr. Daisy Zhe Wang | Archimedes: A Probabilistic Knowledge Base to Combine Information Extraction from Diverse Sources In this talk, I discuss novel system components and algorithms that we are designing and building at UF to enable a probabilistic master Knowledge Base (KB) system. In the context of the Archimedes project, I will discuss a spectrum of research directions we are exploring at the UF Data Science Research (DSR) Lab, in particular probabilistic modeling and scalable inference over a probabilistic knowledge base that can integrate information extracted from diverse and multimedia data sources and systems. This line of research of supporting analytics over automatically extracted knowledge bases is of high impact for many applications from QA systems, situational awareness to biomedical informatics. Other related projects include DeepDive from Stanford, YAGO from Max Planck Institute, NELL from CMU as well as WikiData/Freebase and Google Knowledge Vault. | |
Jan 29 | Dihong Gong | NIST Data Science Pre-Piplot Evaluation We entered the NIST Pre-Piplot Evaluation for two tasks: data cleaning and event prediction. More specifically, we have submitted two runs for the data cleaning task, and seven runs for the event prediction. In this talk, we mainly focus on the feedback that we have received from the NIST group. We present the algorithm details of each runs, and their corresponding scores. Our goal is to learn which algorithm works and which doesn't, as well as how the data/scoring of NIST evaluation, so that we can further improve our systems for the future tasks. In addition, we will also discuss our observations, suggestions and feedback to NIST for up-coming 2016/2017 DS open-track evaluations in terms of dataset, tasks, evaluation metrics, timeline and organizations. | |
Jan 22 | Xiaofeng Zhou | MADlib: Big Data Machine Learning in SQL MADlib is an open source library of scalable in-RDBMS analytics. MADlib is now moving to Apache, gaining impact on both academia and industry. MADlib supports PosgreSQL, Greeplum DB and HAWQ. This talk is a brief introduction to MADlib and how to contribute to MADlib. | |
Jan 15 | Clint P. George | Comparing Probabilistic, Deterministic, and Algebraic Approaches for Exploratory Data Analysis In this talk, we discuss probabilistic, deterministic, and algebraic approaches for exploratory data analysis. We consider a corpus created from the English wikipedia and look at various document modeling schemes to model the underlying structure of the corpus. We discuss two matrix factorization based approaches Latent Semantic Analysis (LSA) and Principle Component Analysis (PCA), and a popular deterministic clustering scheme the k-means algorithm. We also discuss the most popular probabilistic topic modeling scheme Latent Dirichlet Allocation (LDA) and how it is different from the other three schemes. In the end, we discuss the applications of document modeling in the e-discovery project and a short summary of experimental results on a few e-discovery datasets. |
Fall 2015
Date | Speaker | Title | Slides |
---|---|---|---|
Dec 11 | Yang Peng | Probabilistic Ensemble Fusion for Multimodal Word Sense Disambiguation With the advent of abundant multimedia data on the Internet, there have been research efforts on multimodal machine learning to utilize data from different modalities. Current approaches mostly focus on developing models to fuse low-level features from multiple modalities and learn unified representation from different modalities. But most related work failed to justify why we should use multimodal data and multimodal fusion, and few of them leveraged the complementary relation among different modalities. In this paper, we first identify the correlative and complementary relations among multiple modalities. Then we propose a probabilistic ensemble fusion model to capture the complementary relation between two modalities (images and text). Experimental results on the UIUC-ISD dataset show our ensemble approach outperforms approaches using only single modality. Word sense disambiguation (WSD) is the use case we studied to demonstrate the effectiveness of our probabilistic ensemble fusion model. | |
Dec 4 | Ali Sadeghian | Mapping and Mining Arguments In this talk, Ali will give an introduction about arguments and talk about the state-of-the-art techniques of argument mapping and argument mining. | |
Nov 20 | The DSR Group | Round-Table Discussion In today's Data Science Tea, we will have a round table discussion, talking about past/current work, progress, results, research plans, and future directions. | |
Nov 13 | Miguel Rodriguez | University of Florida DSR Lab System for KBP Slot Filler Validation 2015 Abstract We present a Slot filler Validation (SFV) system that uses a semi-supervised ensemble learning approach to aggregate the results from multiple slot fillers from the Cold Start track. We apply Bipartite Graph-based Consensus Maximization (BGCM) to combine the output of supervised stacked ensemble methods with the output of slot filling runs that can’t be trained. By using BGCM we are also able to leverage a small set of assessed fillers to increase the performance of the system. The ensemble results outperformed the best cold start run, the best filtered runs, and other ensemble systems. | |
Oct 30 | Yang Chen | Ontological Pathfinding: Mining First-Order Knowledge from Large Knowledge Bases Abstract Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques, including a new parallel rule mining algorithm implemented on Spark, a novel partitioning algorithm to break the learning tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we are able to develop a first rule learning system that scales to Freebase--the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale. | |
Oct 23 | Dihong Gong, Miguel Rodriguez | The Pre-Pilot Datascience Evaluation: Traffic Data Cleaning and Traffic Events Prediction Abstract We had participated the Pre-Pilot datascience evaluation by NIST which focuses on traffic data processing, including data cleaning and prediction. The traffic data contains measurements (e.g. flows, speed, and occupancy) by traffic sensors, and event reports distributed over DC-Baltimore areas. In the data cleaning task, our task is to correct possibly error flow values in the measurements. We propose to solve this problem by verifying data integrity using various constraints such as smoothness constraint and measurement constraint. For the prediction task, we are asked to predict the number of traffic events in a given geographical areas within a time interval of one month. We designed a regression model followed by ensemble for the prediction task. The major motivations are 1) use regression models to predict number of events based on road features that have significant impact on event occurrence; and 2) use ensemble method to combine outputs of multiple regression models for enhanced prediction performance. | |
Oct 16 | Miguel Rodriguez | Knowledge Base Population Using Ensemble Learning Abstract Knowledge Base Population (KBP) is the task of extracting triples in the form of (subject, relation, object) to populate a knowledge base. English Slot Filling (ESF) and Cold Start (CS) tasks are part of the KBP effort conducted by NIST. Following the ESF task, the Slot Filler Validation (SFV) task was created in order to use the outputs of a number of individual systems attempting the ESF task and improve upon the accuracy in the aggregate. Various approaches, both supervised and unsupervised, have been applied to improve slot filler systems including entailment, truth finding, constraint optimization, majority voting and stacked ensembles. Although these methods refine the output of individual systems, they can be computationally expensive, unsuitable for ESF’s list-valued results, or require substantial data for training. We propose to apply Bipartite Graph-based Consensus Maximization (BGCM), an ensemble learning approach that combines the outputs of supervised and unsupervised models in a semi-supervised fashion. | |
Oct 2 | Dihong Gong | Multimodal Knowledge Extraction Abstract We consider the problem of semi-supervised learning to extract text categories (e.g. persons, cities) and image object bounding boxes from the web pages. Starting with a handful of handcrafted seed examples for text categories, and hundreds of seed images (collected from the ImageNet), our system can automatically extract useful knowledge from the meta web. This talk pursues the thesis that, by extracting text and image jointly, the extraction accuracy can be noticeably improved. To enable this multimodal extraction scheme, we propose a graphical fusion model, which combines multimodal information that is complementary with each other into a unified framework. Evaluation experiment shows noticeable improvement of the proposed multimodal extraction over their single modal versions. | |
Sep 25 | Yang Peng, Dr. Andrew Moore (CMU) | The BigDAWG Polystore System Abstract The BigDAWG polystore system is designed to handle large-scale analytics, real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-storage-system queries. Guided by the "one size does not fit all", they build on top of a variety of storage engines, each designed for a specialized use case. The system provides a new view of federated databases to address the growing need for managing information that spans multiple data models. Recent Developments in Artificial Intelligence - Lessons from the Private Sector Dr. Andrew Moore https://videocast.nih.gov Dr. Andrew Moore will discuss some of the big developments in computer science from the perspective of someone crossing over from industry to academia. He will talk about roadmaps for AI-based consumer and advice products in the commercial world and contrast with some of the potentially viable roadmaps in healthcare. Dr. Moore will also touch on entity stores (aka knowledge graphs), question answering and ultra-large data center architectures. Please visit the event page at https://datascience.nih.gov/community/datascience-at-nih/frontiers for more information. | |
Sep 11 | The DSR Group | Round-Table Discussion In today's Data Science Tea, we will have a round table discussion, talking about past/current work, research plans, and future directions. | |
Aug 28 | Yang Chen | Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing (Google) http://research.google.com/pubs/pub42851.html Abstract Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google’s Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. This paper presents the Mesa system and reports the performance and scale that it achieves. | |
Aug 28 | Dr. Christof Koch, Dr. Emery Brown, Dr. Michael Stonebraker | External Lectures Towards Solutions to Experimental and Computational Challenges in Neuroscience http://videocast.nih.gov/summary.asp?live=16695&bhcp=1 Abstract Dr. Christof Koch, President and Chief Scientific Officer of the Allen Institute for Brain Science, and Dr. Emery Brown, Professor of Computational Neuroscience and Health Sciences and Technology, Department of Brain and Cognitive Sciences, MIT-Harvard Division of Health Sciences and Technology, will describe the computational or experimental challenges associated with Big Data in their respective domains of neuroscience. From the basic to applied realms, science is being transformed by the collection of data on increasingly finer resolutions, both spatially and temporally. Storing, accessing, and analyzing these data create numerous challenges as well as opportunities. Please visit the event page at https://datascience.nih.gov/events/BRAIN-BD2K for more information. Michael Stonebraker 2014 ACM A.M. Turing Lecture https://www.youtube.com/watch?v=BbGeKi6T6QI Michael Stonebraker has made fundamental contributions to database systems, which are one of the critical applications of computers today and contain much of the world's important data. He is the inventor of many concepts that were crucial to making databases a reality and that are used in almost all modern database systems. His work on Ingres introduced the notion of query modification, used for integrity constraints and views. His later work on Postgres introduced the object-relational model, effectively merging databases with abstract data types while keeping the database separate from the programming language. Stonebraker's implementations of Ingres and Postgres demonstrated how to engineer database systems that support these concepts; he released these systems as open software, which allowed their widespread adoption and their code bases have been incorporated into many modern database systems. Since the pathbreaking work on Ingres and Postgres, Stonebraker has continued to be a thought leader in the database community and has had a number of other influential ideas including implementation techniques for column stores and scientific databases and for supporting on-line transaction processing and stream processing. |
Summer 2015
Date | Speaker | Title | Slides |
---|---|---|---|
Aug 14 | Miguel Rodriguez, Xiaofeng Zhou | Knowledge-Base Population using Ensemble Learning of Supervised and Unsupervised Models Abstract A wide variety of techniques have been implemented to participate in the English Slot Filling (ESF) task, part of the Knowledge Base Population(KBP) effort from NIST. The Slot Filler Validation (SFV) task, was created in order to use the outputs of multiple ESF to improve the accuracy of individual systems. Different supervised and unsupervised approaches have been used to improve slot filler systems including entailment, constraint optimization, majority voting and stacked ensembles. We propose the use Consensus Maximization, an ensemble learning approach that combines the outputs of supervised and unsupervised models. Reasoning Marginal Inference Probability on Dynamic ProbKB Abstract Knowledges bases are growing rapidly, the assimilated new facts leads to incremental changes to Probabilistic Knowledge Base (ProbKB), which invalidates the inferred marginal probability for nodes in the factor graph. Facilitated by Kun’s previous work on query-time k-hop approximate inference, we investigate how incremental information influences the marginal inference probability on NELL-sport dataset. | |
Jul 30 | Dihong Gong, Yang Peng | Multimodal Knowledge Base Construction Abstract One of the major tasks in knowledge base construction (KBC) is to populate category instances (e.g. "is_a") over a predefined ontology. While the state-of-the-art KBC systems (e.g. NELL, NEIL) are all based on information extraction technologies limited to a single modality, we propose to extraction information in a multimodality manner. Our system expects to adopt a similar never-ending learning model from NELL, which repeatedly extracts new instances from a large collection of web pages, and then refine and update the extractors using newly extracted instances. The major contributions of our project include: 1) show that the information extracted using multimodal fusion model has higher precision than their respective unimodal versions; and 2) show that by combining multimodal constraints, we are able to mitigate the "semantic drift" issues of the never-ending learning models. | |
Jul 17 | Christan Grant | Query-Driven Statistical Analytics for Knowledge Extraction, Resolution and Inference Abstract With the precipitous increase in data, performing text analytics using traditional methods has become increasingly difficult. From now until 2020 the worlds data is predicted to double every year. Techniques to store and process these large data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing. No longer can one assume data will be structured numbers and names. Traditionally, to perform analytics, a data scientist extracts parts of large data sources to local machines and perform analytics using, R, Python or SASS. Extracting this information is becoming a pain point. Additionally, many algorithms performed over sets of data perform extra work, the data scientist may only be interested in particular portion of the data. In this dissertation, I introduce query-driven text analytics. Query-Driven text analytics is the use of declarative semantics (a query) to direct, restrict and alter computation in analytic systems without a major sacrifice in accuracy. I demonstrate this principle in three ways. First, I add text analytics inside of a relational database where the user can use SQL to bound the scope of their algorithm. This way, computation is in the same location as storage and the user can take advantage of the query processing provided by the database. Second, I alter an entity resolution algorithm so it uses example queries to drive computation. This demonstrates a method of making a non-trivial algorithm aware of the query. Finally, I describe a method for inferring information from knowledge bases. I describe new techniques to perform inference over knowledge bases that model uncertainty for a real scenario and its application within question answering. | |
Jun 26 | Mebin Jacob and Miguel Rodriguez | Tutorial on Docker and Server Access Abstract In today's seminar, we will have a tutorial on docker setup, server guidelines, and the steps to host a live demo on web server via docker. Expanding SigmaKB with GDETL data Abstract GDELT is a project aimed to create a global dataset of events, locations and tone by collecting news media articles from around the world. This data set put together spacial, and temporal dimensions of world events adding context such as tone, the kind of language media is using to cover the event. This kind of dataset can be used to expand factual knowledge bases such as SigmaKB/YAGO that already include spatio-temporal dimensions for entities and facts. In this talk i will discuss the nature of both datasets, possible ways to integrate them and the advantages it can bring. | |
Jun 19 | Sean Goldberg | Knowledge Base Inference: Goals and Methods Abstract In this talk, I will review and motivate marginal inference as the prevailing task in treating knowledge bases as probabilistic graphical models. To that end, there are a number of inference algorithms that lead to the balance of certain tradeoffs. These include level of approximation vs. scalability and feature specificity vs. expressivity. Along with Markov Logic and Path Ranking, I will discuss a number of modifications and how the tradeoffs are affected. Additionally, whether such models treat rule features as producers of knowledge or constraints on knowledge has far-reaching effects on our intuitive understanding of the inference results. | |
Jun 12 | Miguel Rodriguez | TAC KBP 2015 - Slot Filling Validation Track Abstract The goal of Knowledge Base Population (KBP) at TAC is to promote research and evaluate the ability of automated systems to discover information about named entities and incorporate this information in a knowledge source. Specifically, given a reference knowledge base, a set of attributes (Slots), and a set of entities from the reference KB, the Slot Filling (SF) task consists of mining information about entity, slot pairs from text to complete missing slots from the reference KB. Since 2013, a new task, Slot Filler Validation, has been proposed to focus on refinement of the output from SF systems by applying more intense linguistic processing or combining information from multiple systems. In this talk, the datasets used in the 2014 SFV track will be discussed, a pipeline for a stacking ensembling system, that aggregates multiple SF system outputs will be presented along with possible ways to improve it for the 2015 SFV task. | |
Jun 5 | Sean Goldberg | Probabilistic Graphic Models and Knowledge Bases: A Review Abstract This talk will serve as a brief introduction to modeling knowledge using either a probabilistic graphical model or first-order logic. After reviewing basic concepts and motivating shortcomings in both, Markov Logic will be presented as one solution to combat complex specificity in Markov random fields and determinism in first-order logic. The material presented is a precursor to next week's talk on problems and possible solutions inherent in Markov Logic. | |
May 22 | The DSR Group | SigmaKB and Probabilistic KB Fusion Abstract First, we will see a live demo of the SigmaKB system by Mugdha and Jeremy. Second, we will hear a short talk on the preliminary results of a probabilistic fusion model over NELL from Miguel. |
Spring 2015
Date | Speaker | Title | Slides |
---|---|---|---|
May 15 | The DSR Group | Archimedes Discussion Discussion on the Archimedes Master Probabilistic Knowledge Base: motivation, algorithms, system architecture, user interface, data sources and evaluation. | |
May 8 | Yang Chen | Ontological Pathfinding: Mining First-Order Knowledge from Large Knowledge Bases Abstract Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques, including a new parallel rule mining algorithm implemented on Spark, a novel partitioning algorithm to break the learning tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we are able to develop a first rule learning system that scales to Freebase–the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale. | |
Apr 24 | Morteza Shahriari Nia | Applying Big Data Technology to Remote Sensing for Species Identification Abstract Species identification through remote sensing provides the means to monitor biodiversity and their inter-related dynamics in large ecological scales. With the advent of NEON, a standardized protocol for data collection in a wide range of domains would be used to collect data across continental US for over 30 years. We use big data technologies such as probabilistic knowledge bases and deep learning to incorporate expert knowledge and features learning to enhance species identification from remote sensing data. | |
Apr 17 | Yang Peng | Multimodal Fusion and Applications Abstract Our motivation is to utilize multimodal data to achieve better performance compared to single modality. We will first introduce two applications for multimodal data fusion, multimodal information retrieval and multimodal word sense disambiguation. The methods to combine images and text will be explained, as well as the experimental results that show that the multimodal approaches outperform single modality approaches. We will discuss about a few different models to combine different modalities and propose a promising model. | |
Apr 10 | Kushal Arora | Neural Nets and Knowledge Bases Abstract In this talk we will discuss the neural network architecture applied to multi-relational data and how it is used to solve the problems like inference, expansion and reasoning over KBs. We will touch the basic architecture used, various objective functions and how are they used to in context of the problems stated above. | |
Apr 3 | The DSR Group | Archimedes Discussion Discussion on the Archimedes Master Probabilistic Knowledge Base: motivation, algorithms, system architecture, user interface, data sources and evaluation. | |
Mar 27 | Kun Li | In-Database Large-scale Statistical Data Analysis Kun Li's Dissertation Practice Talk Abstract Probabilistic knowledge bases are incorporating new knowledge learned from the web. With the incremental changes to the KB, a naive approach to answer a marginal query is to re-run the inference algorithm, e.g., Gibbs sampling, MC-SAT, which is time consuming. We present an approach to the approximate the marginal inference. We shows that we achieve an order of magnitude faster to answer a query with negligible error. | |
Mar 20 | Michael J. Franklin with DSR Group | Round-Table Discussion | |
Mar 13 | Christan Grant | Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources Abstract In this talk, I will be discussing a recent paper from the Knowledge Vault team at Google. In this new paper, the researchers investigate the use of facts extracted from the web as a signal for search and ranking result. They build upon the previously published Knowledge Vault system to collectively model the extraction and factual errors in the corpora, as well as extraction errors. I will discuss their techniques and present their findings. This paper, presumably, has been submitted but not yet accepted for publication. | |
Feb 27 | The DSR Group | Archimedes Round Table Discussion We discuss the Archimedes KB project and its sub-projects. | |
Feb 13 | Kushal Arora | Introduction to Deep Learning and Theano Abstract In this talk we will cover the basics of neural networks, starting with basic logistic regression, multilayer preceptron, auto-encoders to deep architecture like stacked auto encoders. In addition to this we will discuss Theano framework basics and implementation of discussed architectures. | |
Feb 6 | Christan Grant | Question Answering over Probabilistic KBs Abstract Question answering systems allow humans to ask questions in natural language, and the system responds with an answer in a human recognizable way. There has been a recent renewed interest in developing QA systems using Knowledge Graphs. In this talk, I will discuss the development of an in-house system over probabilistic knowledge bases. A probabilistic KB aims to provide an additional trustworthiness score to traditional QA systems. I will address both the motivation and progress of this work. | |
Jan 23 | Sean Goldberg | Rule Learning and Inference in Knowledge Bases 2 Abstract First-order logical rules are an expressive and powerful way to infer new facts from existing evidence. Markov Logic applies all rules at once to reason jointly over the entire possible world of knowledge, but exponential growth makes application to large-scale knowledge bases intractable. Approximations such as Association Rule Mining, instead perform inference on a fact-by-fact basis, ignoring higher-order correlations. This talk explores the divide between these two approaches to the problem of fact inference and what the space of approximations somewhere in the middle may be. | |
Jan 16 | Yang Chen | Rule Learning and Inference in Knowledge Bases 1 Abstract First-order logical rules are an expressive and powerful way to infer new facts from existing evidence. Markov Logic applies all rules at once to reason jointly over the entire possible world of knowledge, but exponential growth makes application to large-scale knowledge bases intractable. Approximations such as Association Rule Mining, instead perform inference on a fact-by-fact basis, ignoring higher-order correlations. In order to scale to web-scale knowledge base, we describe a new algorithm that scales association rule mining to today's KBs with billions of facts. | |
Jan 9 | Kun Li | In-RDBMS Query-Time Inference over Large Factor Graphs Abstract Probabilistic knowledge bases are incorporating new knowledge learned from the web. With the incremental changes to the KB, the current approach to answer a marginal probability query is to re-run the inference algorithm, e.g., Gibbs sampling, which is time consuming. We present an approach to the approximate the marginal inference. We shows that we achieve an order of magnitude faster to answer a query with negligible error. |
Fall 2014
Date | Speaker | Title | Slides |
---|---|---|---|
Dec 19 | Dr. Daisy Zhe Wang | Super Knowledge Base -> ArchMind -> Archimedes!! Abstract In this talk, I discuss system and algorithmic components that we are designing and building at UF to enable a master Knowledge Base (KB). I will also discuss many research directions we are exploring at the UF Data Science Research (DSR) group including: query-driven inference and sampling, probabilistic knowledge base, state-parallel and data parallel data analytics framework, multimodal (e.g., text, image) information extraction, and KB schema enrichment. This line of research is of high impact has received funding from industry as well as federal government including DARPA, EMC, Amazon and Google. Other related projects include DeepDive from Stanford, YAGO from Max Planck Institute, NELL from CMU as well as WikiData/Freebase and Google Vault. | |
Dec 12 | Dr. Kevin Dong | Design for Emotion Abstract introduce current research activities in Interaction Design Lab of Shanghai Jiao Tong University, the portfolios under the principle of “Form Follows Emotion”. Bio Kevin Dong is assistant professor of interaction design at Shanghai Jiao Tong University, China. He received his doctor degree from college of computer science in Zhejiang University. He is the principle investigator of several government-funded projects, including: Universal Interaction Design of Digital-TV under Aging Society; Relationship between Customers’ Participation and User Experience of Customized Products. Aside of government-funded projects, Kevin is also leading enterprise projects, including: A New Automobile Navigation Interface Design Based on Touch Panel & Knob. Currently, Kevin is a visiting scholar of HCI group in University of Florida and his research focuses on user-centered design and emotional design which studies users’ perception, response and feeling from products and interfaces. | |
Dec 5 | Christan Grant | Query-Driven Text Analytics Abstract With the precipitous increase in data, performing text analytics using traditional methods has become increasingly difficult. From now until 2020 the worlds data is predicted to double every year. Techniques to store and process these large data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing. No longer can one assume data will be structured numbers and names. Databases are now storing more a mix of structured and unstructured data. To support text analytics, queries over disparate data types cannot be an over sight. In this proposal I introduce query-driven text analytics. Query-Driven text analytics is the use of declarative semantics (a query) to decrease the amount of processing in analytic systems without a major sacrifice in accuracy. I demonstrate this in three ways. First, I add text analytics inside of a parallel relational DBMS where the user can use SQL and UDFs to choose the scope of their algorithm. Second, I alter a data mining algorithm so it uses an example query to drive computation. Finally, I propose an integrated question answering system over the different parts of the web. | |
Nov 14 | Dihong Gong | A Text-Image Search Engine for Online Shopping Abstract Text-based document classification and image retrieval are two most fundamental problems in data science. In the era of big data, how to efficiently search text documents and images while at the same time guarantee a good accuracy has been one of the most interesting topics. In our project, we propose to combine these two topics for enhanced performance and possibly new applications. Currently, we have built an online search engine, which has demonstrated state-of-the-art accuracy on Oxford Building Dataset. In our presentation, we will focus on technical details from several aspects including data collection, system implementation and search algorithms. In the meanwhile, we will also introduce our software packet for a highly scalable approximate K-means clustering with OpenMPI support. | |
Nov 7 | Ian Perera | Grounding Symbols as Children Do Abstract While we train our computer vision systems with a series of images and labels, it is clear that children do not learn language this way. They are faced with a large variety of objects and behaviors visible at once, and must pull references from a jumble of words as they are still learning grammar. And yet, with a number of sometimes unintuitive learning strategies, they seem to be able to learn language grounded in their experiences faster than our top-of-the-line object recognition systems. With the field of symbol grounding becoming more popular, work at the intersection of computer vision, natural language understanding, and cognitive science is poised to discover more complete and efficient ways of learning grounded language in AI systems. Advances in grounded language learning can be applied to scene description, dialogue systems, knowledge representation, and other fields. In this talk, I will cover our work so far on SALL-E, a system that uses child language learning strategies and pragmatic inference to perceptually ground language from video demonstrations. I will also cover the challenges we faced along the way and the precautions one must take to truly create a grounded language system. | |
Oct 24 | Morteza Shahriari Nia | Hyperspectral Classification of Savanah Tree Species Using k-fold Cross-Validated Non-linear SVM and MESMA Abstract Identifying savannah species at ecological scale is a major milestone in measuring biomass, carbon reserves, drought and invasive species spread predictions. In this talk we perform classification and geo-mapping of tree species from hyperspectral imagery collected using AVIRIS airborne sensors. We provide a thorough comparison of the effects of ATCOR and FLAASH atmospheric corrections in prediction accuracy. This study classifies common savannah tree species in Ordway-Swisher Biological Station in north-central Florida, USA. Specie classification was performed using variety of Support Vector Machine kernels both on pixel level and canopy level where Polynomial Kernel outperformed others. We also verify MESMA (Muliple Endmember Spectral Mixture Analysis) and try to build an spectral library and observe the results. Also we look into LiDAR (Light Detection and Ranging) airborne data and find interesting patterns in species heights. All this information along with added expert knowledge available online such as USDA Plants database or lots of other resources can lead to a much more informed classification of species. | |
Oct 10 | Ishan Patwa | Word Sense Disambiguation through Images Abstract The automatic disambiguation of word senses is of growing interest in natural Language processing community. Use of Images to disambiguate short text with limited context is an important intermediary step in many Natural Language processing task. We are going to review our proposed method of solving the WSD problem and possible improvements on our preliminary results. | |
Oct 3 | Yang Peng | Large Scale Image Retrieval System Abstract Large scale image retrieval system is a big challenge because of the rapid increase of images on web today. In this presentation, we will first talk about a brief introduction to image retrieval systems. Then we will show our own pipeline design to handle large scale image retrieval by using advanced parallel data processing systems, including Hadoop and Mahout. We will also talk about the severe challenges for the system scale-up and how to solve them. At last, we will discuss our results and next steps. | |
Oct 2 | Pawel Terlecki (Tableau Software) | An analytic data engine for visualization in Tableau Abstract The talk covers the history, architecture and capabilities of Tableau Data Engine. It is an in-house columnar database based on the MonetDB design and developed specifically to support users with mid-size data sets and no efficient analytic back-ends. We cover important components and design decision, as well as give an overview how industrial projects of this size start and evolve. Short Bio Pawel leads the query team at Tableau. His responsibilities include vision, design and implementation of various query processing elements of the Tableau visualization platform. One can find his contributions in the Tableau Data Engine, caching infrastructure or data extraction. Prior to Tableau he worked on business application, web frameworks, database servers, in particular MS SQL Server, and data mining projects. He holds a PhD in Computer Science from Warsaw University of Technology, with specialization in information systems and knowledge discovery, and BS in Economics from Warsaw University. He published several works on databases and data mining and is a frequent attendant of major conferences in these fields. Performance and building reliable solutions are his passion. | |
Sep 26 | Yang Peng | Word Sense Disambiguation using Images in Social Networks Abstract In social networks, there are several challenges for word sense disambiguation, including short context and little annotation/knowledge. While there is only limited textual information, we could use multi-modal data including images to help disambiguate word senses. We are going to review related work and propose new methods using multi-modal data to solve WSD problems. | |
Sep 19 | Sean Goldberg | Comparing Markov Logic to other Rule Learning Approaches Abstract Markov Logic Networks (MLNs) combine the domains of first order logic and statistical probability by attaching weights to first order formulas or rules. This talk will serve as an introduction to intuitively understanding MLNs, particularly how they perform inference and learn weights and structure. MLN structure learning is equivalent to weighted inference rule learning and comparisons will be drawn with association rule mining metrics. | |
Sep 12 | Yang Chen | Rule Mining in Large Knowledge Bases Abstract Recent years have seen a tremendous research interest in knowledge base construction. These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, all existing knowledge bases are incomplete. As one potential solution to knowledge expansion, we study the problem of rule mining in such knowledge bases. In this talk, I will survey the state-of-the-art rule mining algorithms and report potential research directions, our progress, and our contributions toward the rule-based solution of the knowledge expansion problem. | |
Sep 5 | Sean Goldberg | Fact Inference through Rule Learning in Knowledge Bases: A Review Abstract Many current large-scale knowledge bases (KBs) are highly incomplete either due to errors in the construction process or because the knowledge is implicit as opposed to explicit. For example, 93.8% of people in Freebase have no birthplace and 78.5% have no nationality. The construction of inference rules from mining repeatable patterns in the KB has the potential to contribute additional knowledge to the KB. In this talk I will outline the most recent attempts at mining both structured and unstructured data for inference rules and elucidate similarities in methodologies and algorithms. Finally, I will present some ideas for future contributions to this nascent field. | |
Aug 29 | Dr. Juan Gilbert | Applications Quest: A Nominal Population Metric Approach to Diversity in Admissions Abstract In 2003, two land mark cases challenged the University of Michigan admissions policies, one focused on Law School admission and the other on undergraduate admissions. In Grutter v. Bollinger, the case focused on the Law School, the U. S. Supreme Court ruled 5-4 in favor of the Law School. However, in the Gratz v. Bollinger, by a vote of 6-3, the Court reversed, in part, the University's undergraduate admission's policy to provide points for race/ethnicity. Therefore, the Court decided that race could be considered in admission's decision, but could not be the deciding factor. Later, Michigan residents voted to adopt a ban on racial and gender preferences through Proposal 2. In 2007, the U.S. Supreme Court heard two cases on race-conscious school placement policies in Louisville and Seattle. The court struck down the programs in Louisville and Seattle. In 2013, the U.S. Supreme Court heard another case on this very topic in Fisher v. Texas. In the Fischer case, the U.S. Supreme Court sent the case back to the 5th District Court citing that the case had not passed strict scrutiny. Applications Quest is a data mining tool that provides preference free, holistic diversity using a patented nominal population metric. In this talk, Dr. Gilbert will discuss the legal implications of Applications Quest and the nominal population metric. Short Bio Dr. Juan E. Gilbert is the Andrew Banks Family Preeminence Endowed Chair and the Associate Chair of Research in the Computer & Information Science & Engineering Department at the University of Florida where he leads the Human Experience Research Lab. He is also a Fellow of the American Association of the Advancement of Science, National Associate of the National Research Council of the National Academies, an ACM Distinguished Scientist and a Senior Member of the IEEE. Dr. Gilbert was recently named one of the 50 most important African-Americans in Technology. Long Bio Dr. Juan E. Gilbert is the Andrew Banks Family Preeminence Endowed Chair and the Associate Chair of Research in the Computer & Information Science & Engineering Department at the University of Florida where he leads the Human Experience Research Lab. Dr. Gilbert has research projects in spoken language systems, advanced learning technologies, usability and accessibility, Ethnocomputing (Culturally Relevant Computing) and databases/data mining. He has published more than 140 articles, given more than 200 talks and obtained more than $24 million dollars in research funding. He is a Fellow of the American Association of the Advancement of Science. In 2012, Dr. Gilbert received the Presidential Award for Excellence in Science, Mathematics, and Engineering Mentoring from President Barack Obama. He was recently named one of the 50 most important African-Americans in Technology. He was also named a Speech Technology Luminary by Speech Technology Magazine and a national role model by Minority Access Inc. Dr. Gilbert is also a National Associate of the National Research Council of the National Academies, an ACM Distinguished Scientist and a Senior Member of the IEEE. Recently, Dr. Gilbert was named a Master of Innovation by Black Enterprise Magazine, a Modern-Day Technology Leader by the Black Engineer of the Year Award Conference, the Pioneer of the Year by the National Society of Black Engineers and he received the Black Data Processing Association (BDPA) Epsilon Award for Outstanding Technical Contribution. In 2002, Dr. Gilbert was named one of the nation's top African-American Scholars by Diverse Issues in Higher Education. In 2013, the Black Graduate and Professional Student Association at Auburn University named their Distinguished Lecture Series in honor of Dr. Gilbert. Dr. Gilbert testified before the Congress on the Bipartisan Electronic Voting Reform Act of 2008 for his innovative work in electronic voting. In 2006, Dr. Gilbert was honored with a mural painting in New York City by City Year New York, a non-profit organization that unites a diverse group of 17 to 24 year-old young people for a year of full-time, rigorous community service, leadership development, and civic engagement. Photos https://www.dropbox.com/sh/68jz6slz6qxe9my/udWuVUMIUJ |
Summer 2014
Date | Speaker | Title | Slides |
---|---|---|---|
Jun 20 | Kun Li | Large-Scale Graph Processing Systems Cont. Topics include GraphX and Xstream. | |
Jun 13 | Kun Li | Large-Scale Graph Processing Systems We had a discussion on different systems for large-scale graph processing and the pros and cos of each one. The systems discussed include GraphLab, distributed GraphLab, GraphChi, PowerGraph, GraphX, and GIST. | |
May 30 | Yang Chen | Knowledge Expansion over Probabilistic Knowledge Bases Abstract Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer missing facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality. | |
May 23 | Xiaofeng Zhou, Morteza Shahriari Nia | Exploring Netflow Data using Hadoop Exploring Netflow Data using Hadoop Explore Netflow dataset analysis in Hadoop and characterize the performance. Hyper-spectral Classification of Savannah Tree Species Using k-fold Cross-Validated Non-linear Support Vector Machines In this paper we classify Savannah tree species using AVRIS hyper-spectral images, the pre-processing performed dramatically increased classification accuracy. | |
May 16 | Kun Li | In-RDBMS Large-scale Statistical Analysis Abstract Organizations such as companies, government and hospitals heavily rely on relational database management system(RDBMS) to store large amounts of data in the formats of structured data and unstructured data. A deep analysis to the data stored in database would help to discover useful information, suggesting conclusions, and supporting decision making. It helps companies to make the next best decision, enable doctors to have better assessment of their patients, and alleviate lawyers of document review processes. However a deep and comprehensive understanding of data requires various machine learning algorithms and statistical methods. Several challenges exist in using state-of-the-art systems to perform analysis on data resides in RDBMS. First, expensive big data transfer cost must be paid up front to move data between databases and external analytics systems. Second, many popular statistical packages do not scale up to production sized datasets. Thus enterprise applications need sophisticated in-database analytics in addition to traditional online analytical processing(OLAP) from a database. To meet customers' pressing demands, researchers, database vendors have been pushing advanced analytics techniques into databases. This thesis has two major contribution to the in-database analytic community. Firstly, it contribute a in-RDBMS statistical text analysis package to the community and introduce GPText, Greenplum parallel statistical text analysis framework which seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib. Secondly, it present a GIST operator for large scale statistical inference to address the limitation of current RDMBS. The two work are summarized in the following two paragraphs. MADlib Text Analytics and GPText Text analytics has gained much attention in the big data research community due to the large amounts of text data generated in organizations such as companies, government and hospitals everyday in the form of emails, electronic notes and internal documents. Many companies store this text data in relational databases because they relay on databases for their daily business needs. We bring statistical text analysis power into MADlib, a state-of-art in-database analytic package which can be installed in postgres and Greenplum. We developed and contributed a linear-chain conditional random field(CRF) module to MADLib to enable information extraction tasks such as part-of-speech tagging and named entity recognition. We show the elegant in-RDBMS parallel implementation of CRF which achives sub-linear scalability. We introduce GPText, Greenplum parallel statistical text analysis framework which seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib. We describe an eDiscovery application built on the GPText framework. GIST: An Operator for Large Scale Statistical Inference Every major RDBMS offers a User-Defined Aggregate (UDA) facility to implement many of the analytical techniques in parallel. However, inference algorithms, like Markov chain Monte Carlo, where there is some amount of setup done for the problem and then most of the work is performed by iterating over a large state, the UDA model is not a natural fit. This paper presents the General Iterative State Transition (GIST), a RDBMS operator for large scale inference. GIST is an operator which receives a state, which is generated by a UDA, and then performs rounds of transitions on the state until the state has converged to the desired result. We argue that the combination of UDA and GIST can express the majority of learning algorithms, thus significantly extends the analytical capabilities of RDBMSs. We exemplify the use GIST through two high-profile applications: cross-document coreference and loopy belief propagation. We show that the database-GIST combination allows us to tackle a task 27 times larger than state-of-the-art for the first problem and produces a solution that is an order of magnitude faster than the state of the art for the last problem. |
Spring 2014
Date | Speaker | Title | Slides |
---|---|---|---|
Apr 18 | Xiaofeng Zhou, Sahil Puri | A Short Introduction to SciDB A Short Introduction to SciDB The presentation first briefly introduces SciDB with its architecture and array processing, then focuses on the work with Neon image import/export in SciDB. Knowledge Feedback on Prediction of Post-operative Outcomes A study was conducted to establish the the requirement of an algorithm in predicting post-operative outcomes in collaboration with UF Health. We will discuss the methodology used in this study along with a demo of the software used. The presentation will focus on the experimental data collected in the most recent version of this study and the analysis/results derived from them. EDiscovery This presentation will provide a brief description of the project "SMARTeR" being developed for document retrieval in collaboration with UF law. We will focus on an overview of the algorithm developed and a demo of the software which will be provided to the law school. A comparison will be presented detailing the advantages of the algorithm against present techniques of Document Retrieval. | |
Apr 11 | Sethuraman Sundararaman, Parthasarathy Srinivasan, Kushal Arora | Masters Projects Showcase Sethuraman Sundararaman: NLP on mobile phones. Parthasarathy Srinivasan: Efficient Representation of Large KBA Text Corpus. Kushal Arora: KB Integration. | |
Apr 4 | Kun Li | GIST: An Operator for Large Scale Statistical Inference Abstract Enterprise applications need sophisticated in-database analytics in addition to traditional online analytical processing (OLAP) from a database. To meet customers’ pressing demands, database vendors have been pushing advanced analytics techniques into databases. Every major RDBMS offers a User-Defined Aggregate (UDA) facility to implement many of the analytical techniques in parallel. However, inference algorithms, like Markov chain Monte Carlo, where there is some amount of setup done for the problem and then most of the work is performed by iterating over a large state, the UDA model is not a natural fit. This talk presents the General Iterative State Transition (GIST), an RDBMS operator for large scale inference. GIST is an operator which receives a state, which is generated by a UDA, and then perform rounds of transitions on the state until the state has converged to the desired result. We argue that the combination of UDA and GIST can express the majority of learning algorithms, thus significantly extending the analytical capabilities of RDBMSs. We exemplify the use GIST through two high-profile applications: cross-document coreference and loopy belief propagation. We show that the database-GIST combination allows us to tackle a task 27 times larger than state-of-the-art for the first problem and produces a solution that is an order of magnitude faster than the state-of-the-art for the last problem. | |
Mar 28 | Yang Chen | VisKB: Interactive Visualization of Web-Scale Knowledge Graphs Abstract Knowledge graphs are becoming the next big goal for the web and researchers have realized various ways to construct knowledge graphs. However, the user interfaces for knowledge bases are limited. In this talk, we present VisKB, a visual search engine that allows users to interactively query and explore web-scale knowledge graphs. VisKB visualizes only part of the knowledge graph relevant to user queries, and allows users to interact with the visualization to express more queries to expand the visualization. In this way, VisKB avoids visualizing the entire graph without losing information. Using DBPedia as the data source, we show it helps users discover interesting properties and relationships of the entities they are interested in. | |
Mar 21 | Vipin Kumar (University of Minnesota) | Understanding Climate Change: Opportunities and Challenges for Data Driven Research Abstract Climate change is the defining environmental challenge facing our planet, yet there is considerable uncertainty regarding the social and environmental impact due to the limited capabilities of existing physics-based models of the Earth system. This talk will present an overview of research being done in a large interdisciplinary project on the development of novel data driven approaches that take advantage of the wealth of climate and ecosystem data now available from satellite and ground-based sensors, the observational record for atmospheric, oceanic, and terrestrial processes, and physics-based climate model simulations. These information-rich datasets offer huge potential for monitoring, understanding, and predicting the behavior of the Earth's ecosystem and for advancing the science of climate change. This talk will discuss some of the challenges in analyzing such data sets and our early research results. Speaker Bio Vipin Kumar is currently William Norris Professor and Head of Computer Science and Engineering at the University of Minnesota. His research interests include High Performance computing and data mining, and he is currently leading an NSF Expedition project on understanding climate change using data driven approaches. He has authored over 250 research articles, and co-edited or coauthored 10 books including the widely used text book ``Introduction to Parallel Computing", and "Introduction to Data Mining" both published by Addison-Wesley. Kumar co-founded SIAM International Conference on Data Mining and served as a founding co-editor-in-chief of Journal of Statistical Analysis and Data Mining (an official journal of the American Statistical Association). Kumar is a Fellow of the ACM, IEEE and AAAS. He received the Distinguished Alumnus Award from the Indian Institute of Technology (IIT) Roorkee (2013), the Distinguished Alumnus Award from the Computer Science Department, University of Maryland College Park (2009), and IEEE Computer Society's Technical Achievement Award (2005). Kumar's foundational research in data mining and its applications to scientific data was honored by the ACM SIGKDD 2012 Innovation Award, which is the highest award for technical excellence in the field of Knowledge Discovery and Data Mining (KDD). | |
Mar 14 | Morteza Shahriari Nia | Building Data Storage, Retrieval and Analysis Platform for Ecological Research at Continental Scale We will specifically talk about applying state-of-the-art machine learning techniques over remote sensing data, where the goal is species classification of plants. Also, we discuss existing platforms that are available to scientists to easily share and query data. Our goal is to build a platform to perform data analysis over massive amounts of ecological data centered around remote sensing data such as hyperspectual and lidar data for ecological research and applications, such as climate change, invasive species identification, at continental scale. | |
Feb 28 | Kushal Arora | Universal Knowledge Base Ontology alignment of multiple knowledge bases to create a universal knowledge base with integrated schema and entities. This work is based on PIDGIN paper from CMU -- Ontology alignment using web text as interlingua. | |
Feb 21 | Jingtao Wang (U. Pittsburgh) | MindMiner: A Mixed-Initiative Interface for Interactive Distance Metric Learning Abstract Cluster analysis is a common task in exploratory data mining, and involves combining entities with similar properties into groups. However, most clustering techniques face one key challenge when used in real world applications: the algorithms expect a quantitative, deterministic distance function to quantify the similarity between two entities. Whereas in most real world problems, such similarity measurements usually require subjective domain knowledge that can be hard for users to explain. In this talk, we present MindMiner, a mixed-initiative interface and visualization system for capturing subjective similarity measurements via a combination of new interaction techniques and machine learning algorithms. MindMiner collects qualitative, hard to express similarity measurements from users via active polling with uncertainty and example based visual constraint creation. MindMiner also formulates human prior knowledge into a set of inequalities and learns a quantitative similarity distance metric via convex optimization. In a 12-subject peer-review understanding task, we found MindMiner was easy to learn and use, and could capture users' implicit knowledge about writing performance and cluster target entities into groups that match subjects' mental models. We also found that MindMiner's constraint suggestions and uncertainty polling functions could improve both efficiency and the quality of clustering. Speaker Bio Dr. Jingtao Wang is an Assistant Professor in Computer Science and Learning Research and Development Center (LRDC) at the University of Pittsburgh. His primary research direction is Human-Computer Interaction (HCI). Jingtao's current research interests include - mobile interfaces, education/learning technology, end-user programming, machine learning and its applications in HCI. He received his Ph.D. degree in computer science from the University of California, Berkeley. Before that, Jingtao was a researcher and team lead at the IBM China Research Lab, working on large-vocabulary, online handwriting recognition technologies for Asian languages. He received his master degree and bachelor degree both from Xi'an Jiaotong University, China. | |
Feb 14 | Christan Grant | Query-Driven Statistical Text Analysis | |
Feb 7 | Yang Chen | Knowledge Expansion over Probabilistic Knowledge Bases Abstract Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer hidden facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high-quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows efficient SQL-based inference algorithms for knowledge expansion that apply inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality. | |
Jan 24 | Yang Peng, Morteza Shahriari Nia | Image Analysis and Knowledge Base Construction Yang Peng and Morteza Nia will give short talks on their image analysis projects. We also watch a Ted talk by Greg Asner on remote sensing for Ecological Research. http://hyspeedblog.wordpress.com/2013/12/02/conservation-technology-mapping-our-environment-using-the-carnegie-airborne-observatory | |
Jan 17 | Armando Fox (UC Berkeley) | Using MOOCs to Reinvigorate Software Engineering Education Abstract The spectacular failure of the Affordable Care Act website ("Obamacare") has focused public attention on software engineering. Yet experienced practitioners mostly sighed and shrugged, because the historical record shows that only 10% of large (>$10M) software projects using conventional methodologies such as Waterfall are successful. In contrast, Amazon and others successfully build comparably large and complex sites with hundreds of integrated subsystems by using modern agile methods and service-oriented architecture. This contrast is one reason Industry has complained that academia ignores vital software topics, leaving students unprepared upon graduation. In too many courses, well-meaning instructors teach traditional approaches to software development that are neither supported by tools that students can readily use, nor appropriate for projects whose scope matches a college course. Students respond by continuing to build software more or less the way they always have, which is boring for students, frustrating for instructors, and disappointing for industry. This talk explains how the confluence of cloud computing and Massive Open Online Courses (MOOCs) have allowed us to greatly improve both the effectiveness and the reach of UC Berkeley's undergraduate software engineering course. The shift toward Software as a Service has not only revolutionized the future of software, but changed it in a way that makes it easier and more rewarding to teach. UC Berkeley’s revised Software Engineering course leverages this productivity to allow students to both enhance a legacy application and to develop a new app that matches requirements of non-technical customers. By experiencing the whole software life cycle repeatedly within a single college course, and by using the same tools and techniques that professionals use, students actually use and learn to appreciate the skills that industry has long encouraged. The course is now popular with students, rewarding for faculty, and praised by industry. The technology developed for the course has also been used to offer a subset of the material as a MOOC to hundreds of thousands of students, and through an arrangement with edX, is available to classroom instructors interested in trying this approach as a SPOC (Small Private Online Course) offering instructor support far beyond what is usually available for traditional textbooks. Indeed, our experience has been that despite recent hand-wringing about MOOCs destroying higher education, appropriate use of MOOC technology can improve on-campus pedagogy, increase student throughput while raising course quality, and even reinvigorate faculty teaching. Speaker Bio Armando Fox (fox@cs.berkeley.edu) is a Professor in Berkeley's Electrical Engineering & Computer Science Department as well as the Faculty Advisor to the UC Berkeley MOOCLab. He co-designed and co-taught Berkeley's first Massive Open Online Course on Engineering Software as a Service, currently offered through edX, through which over 10,000 students worldwide have earned certificates of mastery. He also serves on edX's Technical Advisory Committee, helping to set the technical direction of their open MOOC platform. With colleagues in Computer Science and in the School of Information, he is doing research in online education including automatic grading of students' computer programs and improving student engagement and learning outcomes in MOOCs. His other computer science research in the Berkeley ASPIRE project focuses on highly productive parallel programming. While at Stanford he received teaching and mentoring awards from the Associated Students of Stanford University, the Society of Women Engineers, and Tau Beta Pi Engineering Honor Society. He has been a "Scientific American Top 50" researcher, an NSF CAREER award recipient, a Gilbreth Lecturer at the National Academy of Engineering, a keynote speaker at the Richard Tapia Celebration of Diversity in Computing, and an ACM Distinguished Scientist. In previous lives he helped design the Intel Pentium Pro microprocessor and founded a successful startup to commercialize his UC Berkeley Ph.D. research on mobile computing. He received his other degrees in electrical engineering and computer science from MIT and the University of Illinois. He is also a classically-trained musician and performer, an avid musical theater fan and freelance Music Director, and bilingual/bicultural (Cuban-American) New Yorker living in San Francisco. | |
Jan 10 | Christan Grant | Universal Schema Discussion The discussion is based on the following papers: Relation Extraction with Matrix Factorization and Universal Schemas http://people.cs.umass.edu/~lmyao/papers/univ-schema-tacl.pdf Universal Schema for Entity Type Prediction http://people.cs.umass.edu/~lmyao/papers/unary_akbc13.pdf [This is a small paper good summary] Latent Relation Representations for Universal Schemas http://arxiv.org/pdf/1301.4293v2.pdf | |
Jan 3 | Anastasia Ailamaki (EPFL) | Efficient Exploration of Big Brain Data Abstract Today's scientific processes heavily depend on fast and accurate analysis of experimental data. Scientists are routinely overwhelmed by the effort needed to manage the volumes of data produced either by observing phenomena or by sophisticated simulations. As data management software proves inefficient, inadequate, or insufficient to meet the needs of scientific applications, the scientific community typically uses special-purpose legacy software. With the exponential growth of dataset size and complexity, however, application-specific systems no longer scale to efficiently analyse the relevant parts of their data, thereby slowing down the cycle of analysing, understanding, and preparing new experiments. I will illustrate the problem with a challenging application on brain simulation data and will show how the problems from neuroscience translate into challenges for the data management community. I will show how novel data management technology can enable today's neuroscientists to simulate and discover a meaningful percentage of the human brain at unprecedented levels of detail. Finally I will describe the challenges of integrating simulation and medical neuroscience data to advance our understanding of the functionality of the brain. Speaker Bio Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique Federale de Lausanne (EPFL) in Switzerland. Her research interests are in database systems and applications, and in particular (a) in strengthening the interaction between the database software and emerging hardware and I/O devices, and (b) in automating database management to support computationally-demanding and demanding data-intensive scientific applications. She has received an ERC Consolidator Award (2013), a Finmeccanica endowed chair from the Computer Science Department at Carnegie Mellon (2007), a European Young Investigator Award from the European Science Foundation (2007), an Alfred P. Sloan Research Fellowship (2005), eight best-paper awards at top conferences (2001-2011), and an NSF CAREER award (2002). She earned her Ph.D. in Computer Science from the University of Wisconsin-Madison in 2000. She is a senior member of the IEEE and a member of the ACM, serves as the ACM SIGMOD vice chair, and has also been a CRA-W mentor. |
Fall 2013
Date | Speaker | Title | Slides |
---|---|---|---|
Dec 13 | Ryan Cobb, Shuang Lin | Medical NLP and VizSearch | |
Dec 6 | Sean Goldberg | Using People and Machines to Learn and Evaluate Inference Rules | |
Nov 22 | DSR Group | PIDGIN: Ontology Alignment using Web Text as Interlingua | |
Nov 22 | DSR Group | Semantic Parsing on Freebase from Question-Answer Pairs | |
Nov 1 | Christan Grant | FDB: A Query Engine for Factorised Relational Databases | slides |
Oct 25 | Ryan Cobb, Sahil Puri | Medical NLP & Knowledge Exchange | |
Oct 18 | Clint George | Model selection in Bayesian Topic Models and their Applications in Electronic Discovery | |
Oct 4 | Christan Grant, Morteza Shahriari Nia, Yang Peng | KBA 2013 TREC competition | |
Sep 27 | Christan Grant | Large-scale Entity resolution on text streams | |
Sep 20 | Donghui Wu | Predictive Modeling in Healthcare (MEDai/lexisNexis) | |
Sep 13 | Michael Borish | Crowdsourcing for Virtual Humans (UF VERG) | |
Sep 13 | Sean Goldberg | CASTLE: Crowd-Assisted System for Text Labeling and Extraction | |
Aug 30 | Yang Chen | Database Backend for Description Logic (IHMC) | slides |
Aug 30 | Kun Li | Task Migration Feasibility Analysis in Distributed Systems (Google) |
Summer 2013
Date | Speaker | Title | Slides |
---|---|---|---|
Jul 12 | Morteza Shahriari Nia | Inter-Media Hashing for Large-Scale Retrieval from Heterogeneous Data Sources | |
Jun 28 | Morteza Shahriari Nia | AMPLab Big Data Benchmark and Million Query Track (TREC) | |
Jun 21 | Christan Grant | GPText: Greenplum Parallel Statistical Text Analysis Framework | |
Jun 14 | Yang Chen & Ryan Cobb | ICML Dry Runs | |
Jun 14 | Shuang Lin | Vispedia: Interactive Visual Exploration of Wikipedia Data via Search-Based Integration | |
May 31 | Daisy Zhe Wang | Probabilistic Programming Language for Advanced Machine Learning: MLN, BLOG/Figaro & Church | |
May 24 | Ryan Cobb | Use Big Data and Machine Learning in Mahjong | |
May 17 | Dr. Wind Cowles | Coreference and Focus in Human Sentence Processing | slides |