|Dec 11||Yang Peng||Probabilistic Ensemble Fusion for Multimodal Word Sense Disambiguation|
With the advent of abundant multimedia data on the Internet, there have been research efforts on multimodal machine learning to utilize data from different modalities. Current approaches mostly focus on developing models to fuse low-level features from multiple modalities and learn unified representation from different modalities. But most related work failed to justify why we should use multimodal data and multimodal fusion, and few of them leveraged the complementary relation among different modalities. In this paper, we first identify the correlative and complementary relations among multiple modalities. Then we propose a probabilistic ensemble fusion model to capture the complementary relation between two modalities (images and text). Experimental results on the UIUC-ISD dataset show our ensemble approach outperforms approaches using only single modality. Word sense disambiguation (WSD) is the use case we studied to demonstrate the effectiveness of our probabilistic ensemble fusion model.
|Dec 4||Ali Sadeghian||Mapping and Mining Arguments|
In this talk, Ali will give an introduction about arguments and talk about the state-of-the-art techniques of argument mapping and argument mining.
|Nov 20||The DSR Group||Round-Table Discussion|
In today's Data Science Tea, we will have a round table discussion, talking about past/current work, progress, results, research plans, and future directions.
|Nov 13||Miguel Rodriguez||University of Florida DSR Lab System for KBP Slot Filler Validation 2015|
We present a Slot filler Validation (SFV) system that uses a semi-supervised ensemble learning approach to aggregate the results from multiple slot fillers from the Cold Start track. We apply Bipartite Graph-based Consensus Maximization (BGCM) to combine the output of supervised stacked ensemble methods with the output of slot filling runs that can’t be trained. By using BGCM we are also able to leverage a small set of assessed fillers to increase the performance of the system. The ensemble results outperformed the best cold start run, the best filtered runs, and other ensemble systems.
|Oct 30||Yang Chen||Ontological Pathfinding: Mining First-Order Knowledge from Large Knowledge Bases|
Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques, including a new parallel rule mining algorithm implemented on Spark, a novel partitioning algorithm to break the learning tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we are able to develop a first rule learning system that scales to Freebase--the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale.
|Oct 23||Dihong Gong, Miguel Rodriguez||The Pre-Pilot Datascience Evaluation: Traffic Data Cleaning and Traffic Events Prediction|
We had participated the Pre-Pilot datascience evaluation by NIST which focuses on traffic data processing, including data cleaning and prediction. The traffic data contains measurements (e.g. flows, speed, and occupancy) by traffic sensors, and event reports distributed over DC-Baltimore areas. In the data cleaning task, our task is to correct possibly error flow values in the measurements. We propose to solve this problem by verifying data integrity using various constraints such as smoothness constraint and measurement constraint. For the prediction task, we are asked to predict the number of traffic events in a given geographical areas within a time interval of one month. We designed a regression model followed by ensemble for the prediction task. The major motivations are 1) use regression models to predict number of events based on road features that have significant impact on event occurrence; and 2) use ensemble method to combine outputs of multiple regression models for enhanced prediction performance.
|Oct 16||Miguel Rodriguez||Knowledge Base Population Using Ensemble Learning|
Knowledge Base Population (KBP) is the task of extracting triples in the form of (subject, relation, object) to populate a knowledge base. English Slot Filling (ESF) and Cold Start (CS) tasks are part of the KBP effort conducted by NIST. Following the ESF task, the Slot Filler Validation (SFV) task was created in order to use the outputs of a number of individual systems attempting the ESF task and improve upon the accuracy in the aggregate. Various approaches, both supervised and unsupervised, have been applied to improve slot filler systems including entailment, truth finding, constraint optimization, majority voting and stacked ensembles. Although these methods refine the output of individual systems, they can be computationally expensive, unsuitable for ESF’s list-valued results, or require substantial data for training. We propose to apply Bipartite Graph-based Consensus Maximization (BGCM), an ensemble learning approach that combines the outputs of supervised and unsupervised models in a semi-supervised fashion.
|Oct 2||Dihong Gong||Multimodal Knowledge Extraction|
We consider the problem of semi-supervised learning to extract text categories (e.g. persons, cities) and image object bounding boxes from the web pages. Starting with a handful of handcrafted seed examples for text categories, and hundreds of seed images (collected from the ImageNet), our system can automatically extract useful knowledge from the meta web. This talk pursues the thesis that, by extracting text and image jointly, the extraction accuracy can be noticeably improved. To enable this multimodal extraction scheme, we propose a graphical fusion model, which combines multimodal information that is complementary with each other into a unified framework. Evaluation experiment shows noticeable improvement of the proposed multimodal extraction over their single modal versions.
|Sep 25||Yang Peng,|
Dr. Andrew Moore (CMU)
|The BigDAWG Polystore System|
The BigDAWG polystore system is designed to handle large-scale analytics, real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-storage-system queries. Guided by the "one size does not fit all", they build on top of a variety of storage engines, each designed for a specialized use case. The system provides a new view of federated databases to address the growing need for managing information that spans multiple data models.
Recent Developments in Artificial Intelligence - Lessons from the Private Sector
Dr. Andrew Moore
Dr. Andrew Moore will discuss some of the big developments in computer science from the perspective of someone crossing over from industry to academia. He will talk about roadmaps for AI-based consumer and advice products in the commercial world and contrast with some of the potentially viable roadmaps in healthcare. Dr. Moore will also touch on entity stores (aka knowledge graphs), question answering and ultra-large data center architectures. Please visit the event page at https://datascience.nih.gov/community/datascience-at-nih/frontiers for more information.
|Sep 11||The DSR Group||Round-Table Discussion|
In today's Data Science Tea, we will have a round table discussion, talking about past/current work, research plans, and future directions.
|Aug 28||Yang Chen||Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing (Google)|
Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google’s Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. This paper presents the Mesa system and reports the performance and scale that it achieves.
|Aug 28||Dr. Christof Koch,|
Dr. Emery Brown,
Dr. Michael Stonebraker
Towards Solutions to Experimental and Computational Challenges in Neuroscience
Dr. Christof Koch, President and Chief Scientific Officer of the Allen Institute for Brain Science, and Dr. Emery Brown, Professor of Computational Neuroscience and Health Sciences and Technology, Department of Brain and Cognitive Sciences, MIT-Harvard Division of Health Sciences and Technology, will describe the computational or experimental challenges associated with Big Data in their respective domains of neuroscience. From the basic to applied realms, science is being transformed by the collection of data on increasingly finer resolutions, both spatially and temporally. Storing, accessing, and analyzing these data create numerous challenges as well as opportunities. Please visit the event page at https://datascience.nih.gov/events/BRAIN-BD2K for more information.
Michael Stonebraker 2014 ACM A.M. Turing Lecture
Michael Stonebraker has made fundamental contributions to database systems, which are one of the critical applications of computers today and contain much of the world's important data. He is the inventor of many concepts that were crucial to making databases a reality and that are used in almost all modern database systems. His work on Ingres introduced the notion of query modification, used for integrity constraints and views. His later work on Postgres introduced the object-relational model, effectively merging databases with abstract data types while keeping the database separate from the programming language. Stonebraker's implementations of Ingres and Postgres demonstrated how to engineer database systems that support these concepts; he released these systems as open software, which allowed their widespread adoption and their code bases have been incorporated into many modern database systems. Since the pathbreaking work on Ingres and Postgres, Stonebraker has continued to be a thought leader in the database community and has had a number of other influential ideas including implementation techniques for column stores and scientific databases and for supporting on-line transaction processing and stream processing.
|Aug 14||Miguel Rodriguez,|
|Knowledge-Base Population using Ensemble Learning of Supervised and Unsupervised Models|
A wide variety of techniques have been implemented to participate in the English Slot Filling (ESF) task, part of the Knowledge Base Population(KBP) effort from NIST. The Slot Filler Validation (SFV) task, was created in order to use the outputs of multiple ESF to improve the accuracy of individual systems. Different supervised and unsupervised approaches have been used to improve slot filler systems including entailment, constraint optimization, majority voting and stacked ensembles. We propose the use Consensus Maximization, an ensemble learning approach that combines the outputs of supervised and unsupervised models.
Reasoning Marginal Inference Probability on Dynamic ProbKB
Knowledges bases are growing rapidly, the assimilated new facts leads to incremental changes to Probabilistic Knowledge Base (ProbKB), which invalidates the inferred marginal probability for nodes in the factor graph. Facilitated by Kun’s previous work on query-time k-hop approximate inference, we investigate how incremental information influences the marginal inference probability on NELL-sport dataset.
|Jul 30||Dihong Gong, Yang Peng||Multimodal Knowledge Base Construction|
One of the major tasks in knowledge base construction (KBC) is to populate category instances (e.g. "is_a") over a predefined ontology. While the state-of-the-art KBC systems (e.g. NELL, NEIL) are all based on information extraction technologies limited to a single modality, we propose to extraction information in a multimodality manner. Our system expects to adopt a similar never-ending learning model from NELL, which repeatedly extracts new instances from a large collection of web pages, and then refine and update the extractors using newly extracted instances. The major contributions of our project include: 1) show that the information extracted using multimodal fusion model has higher precision than their respective unimodal versions; and 2) show that by combining multimodal constraints, we are able to mitigate the "semantic drift" issues of the never-ending learning models.
|Jul 17||Christan Grant||Query-Driven Statistical Analytics for Knowledge Extraction, Resolution and Inference|
With the precipitous increase in data, performing text analytics using traditional methods has become increasingly difficult. From now until 2020 the worlds data is predicted to double every year. Techniques to store and process these large data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing. No longer can one assume data will be structured numbers and names. Traditionally, to perform analytics, a data scientist extracts parts of large data sources to local machines and perform analytics using, R, Python or SASS. Extracting this information is becoming a pain point. Additionally, many algorithms performed over sets of data perform extra work, the data scientist may only be interested in particular portion of the data.
In this dissertation, I introduce query-driven text analytics. Query-Driven text analytics is the use of declarative semantics (a query) to direct, restrict and alter computation in analytic systems without a major sacrifice in accuracy. I demonstrate this principle in three ways. First, I add text analytics inside of a relational database where the user can use SQL to bound the scope of their algorithm. This way, computation is in the same location as storage and the user can take advantage of the query processing provided by the database. Second, I alter an entity resolution algorithm so it uses example queries to drive computation. This demonstrates a method of making a non-trivial algorithm aware of the query. Finally, I describe a method for inferring information from knowledge bases. I describe new techniques to perform inference over knowledge bases that model uncertainty for a real scenario and its application within question answering.
|Jun 26||Mebin Jacob and|
|Tutorial on Docker and Server Access|
In today's seminar, we will have a tutorial on docker setup, server guidelines, and the steps to host a live demo on web server via docker.
Expanding SigmaKB with GDETL data
GDELT is a project aimed to create a global dataset of events, locations and tone by collecting news media articles from around the world. This data set put together spacial, and temporal dimensions of world events adding context such as tone, the kind of language media is using to cover the event. This kind of dataset can be used to expand factual knowledge bases such as SigmaKB/YAGO that already include spatio-temporal dimensions for entities and facts. In this talk i will discuss the nature of both datasets, possible ways to integrate them and the advantages it can bring.
|Jun 19||Sean Goldberg||Knowledge Base Inference: Goals and Methods|
In this talk, I will review and motivate marginal inference as the prevailing task in treating knowledge bases as probabilistic graphical models. To that end, there are a number of inference algorithms that lead to the balance of certain tradeoffs. These include level of approximation vs. scalability and feature specificity vs. expressivity. Along with Markov Logic and Path Ranking, I will discuss a number of modifications and how the tradeoffs are affected. Additionally, whether such models treat rule features as producers of knowledge or constraints on knowledge has far-reaching effects on our intuitive understanding of the inference results.
|Jun 12||Miguel Rodriguez||TAC KBP 2015 - Slot Filling Validation Track|
The goal of Knowledge Base Population (KBP) at TAC is to promote research and evaluate the ability of automated systems to discover information about named entities and incorporate this information in a knowledge source. Specifically, given a reference knowledge base, a set of attributes (Slots), and a set of entities from the reference KB, the Slot Filling (SF) task consists of mining information about entity, slot pairs from text to complete missing slots from the reference KB. Since 2013, a new task, Slot Filler Validation, has been proposed to focus on refinement of the output from SF systems by applying more intense linguistic processing or combining information from multiple systems. In this talk, the datasets used in the 2014 SFV track will be discussed, a pipeline for a stacking ensembling system, that aggregates multiple SF system outputs will be presented along with possible ways to improve it for the 2015 SFV task.
|Jun 5||Sean Goldberg||Probabilistic Graphic Models and Knowledge Bases: A Review|
This talk will serve as a brief introduction to modeling knowledge using either a probabilistic graphical model or first-order logic. After reviewing basic concepts and motivating shortcomings in both, Markov Logic will be presented as one solution to combat complex specificity in Markov random fields and determinism in first-order logic. The material presented is a precursor to next week's talk on problems and possible solutions inherent in Markov Logic.
|May 22||The DSR Group||SigmaKB and Probabilistic KB Fusion|
First, we will see a live demo of the SigmaKB system by Mugdha and Jeremy. Second, we will hear a short talk on the preliminary results of a probabilistic fusion model over NELL from Miguel.
|May 15||The DSR Group||Archimedes Discussion|
Discussion on the Archimedes Master Probabilistic Knowledge Base: motivation, algorithms, system architecture, user interface, data sources and evaluation.
|May 8||Yang Chen||Ontological Pathfinding: Mining First-Order Knowledge from Large Knowledge Bases|
Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques, including a new parallel rule mining algorithm implemented on Spark, a novel partitioning algorithm to break the learning tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we are able to develop a first rule learning system that scales to Freebase–the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale.
|Apr 24||Morteza Shahriari Nia||Applying Big Data Technology to Remote Sensing for Species Identification|
Species identification through remote sensing provides the means to monitor biodiversity and their inter-related dynamics in large ecological scales. With the advent of NEON, a standardized protocol for data collection in a wide range of domains would be used to collect data across continental US for over 30 years. We use big data technologies such as probabilistic knowledge bases and deep learning to incorporate expert knowledge and features learning to enhance species identification from remote sensing data.
|Apr 17||Yang Peng||Multimodal Fusion and Applications|
Our motivation is to utilize multimodal data to achieve better performance compared to single modality. We will first introduce two applications for multimodal data fusion, multimodal information retrieval and multimodal word sense disambiguation. The methods to combine images and text will be explained, as well as the experimental results that show that the multimodal approaches outperform single modality approaches. We will discuss about a few different models to combine different modalities and propose a promising model.
|Apr 10||Kushal Arora||Neural Nets and Knowledge Bases|
In this talk we will discuss the neural network architecture applied to multi-relational data and how it is used to solve the problems like inference, expansion and reasoning over KBs. We will touch the basic architecture used, various objective functions and how are they used to in context of the problems stated above.
|Apr 3||The DSR Group||Archimedes Discussion|
Discussion on the Archimedes Master Probabilistic Knowledge Base: motivation, algorithms, system architecture, user interface, data sources and evaluation.
|Mar 27||Kun Li||In-Database Large-scale Statistical Data Analysis|
Kun Li's Dissertation Practice Talk
Probabilistic knowledge bases are incorporating new knowledge learned from the web. With the incremental changes to the KB, a naive approach to answer a marginal query is to re-run the inference algorithm, e.g., Gibbs sampling, MC-SAT, which is time consuming. We present an approach to the approximate the marginal inference. We shows that we achieve an order of magnitude faster to answer a query with negligible error.
|Mar 20||Michael J. Franklin with DSR Group||Round-Table Discussion|
|Mar 13||Christan Grant||Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources|
In this talk, I will be discussing a recent paper from the Knowledge Vault team at Google. In this new paper, the researchers investigate the use of facts extracted from the web as a signal for search and ranking result. They build upon the previously published Knowledge Vault system to collectively model the extraction and factual errors in the corpora, as well as extraction errors. I will discuss their techniques and present their findings. This paper, presumably, has been submitted but not yet accepted for publication.
|Feb 27||The DSR Group||Archimedes Round Table Discussion|
We discuss the Archimedes KB project and its sub-projects.
|Feb 13||Kushal Arora||Introduction to Deep Learning and Theano|
In this talk we will cover the basics of neural networks, starting with basic logistic regression, multilayer preceptron, auto-encoders to deep architecture like stacked auto encoders. In addition to this we will discuss Theano framework basics and implementation of discussed architectures.
|Feb 6||Christan Grant||Question Answering over Probabilistic KBs|
Question answering systems allow humans to ask questions in natural language, and the system responds with an answer in a human recognizable way. There has been a recent renewed interest in developing QA systems using Knowledge Graphs. In this talk, I will discuss the development of an in-house system over probabilistic knowledge bases. A probabilistic KB aims to provide an additional trustworthiness score to traditional QA systems. I will address both the motivation and progress of this work.
|Jan 23||Sean Goldberg||Rule Learning and Inference in Knowledge Bases 2|
First-order logical rules are an expressive and powerful way to infer new facts from existing evidence. Markov Logic applies all rules at once to reason jointly over the entire possible world of knowledge, but exponential growth makes application to large-scale knowledge bases intractable. Approximations such as Association Rule Mining, instead perform inference on a fact-by-fact basis, ignoring higher-order correlations. This talk explores the divide between these two approaches to the problem of fact inference and what the space of approximations somewhere in the middle may be.
|Jan 16||Yang Chen||Rule Learning and Inference in Knowledge Bases 1|
First-order logical rules are an expressive and powerful way to infer new facts from existing evidence. Markov Logic applies all rules at once to reason jointly over the entire possible world of knowledge, but exponential growth makes application to large-scale knowledge bases intractable. Approximations such as Association Rule Mining, instead perform inference on a fact-by-fact basis, ignoring higher-order correlations. In order to scale to web-scale knowledge base, we describe a new algorithm that scales association rule mining to today's KBs with billions of facts.
|Jan 9||Kun Li||In-RDBMS Query-Time Inference over Large Factor Graphs|
Probabilistic knowledge bases are incorporating new knowledge learned from the web. With the incremental changes to the KB, the current approach to answer a marginal probability query is to re-run the inference algorithm, e.g., Gibbs sampling, which is time consuming. We present an approach to the approximate the marginal inference. We shows that we achieve an order of magnitude faster to answer a query with negligible error.
|Dec 19||Dr. Daisy Zhe Wang||Super Knowledge Base -> ArchMind -> Archimedes!!|
In this talk, I discuss system and algorithmic components that we are designing and building at UF to enable a master Knowledge Base (KB). I will also discuss many research directions we are exploring at the UF Data Science Research (DSR) group including: query-driven inference and sampling, probabilistic knowledge base, state-parallel and data parallel data analytics framework, multimodal (e.g., text, image) information extraction, and KB schema enrichment. This line of research is of high impact has received funding from industry as well as federal government including DARPA, EMC, Amazon and Google. Other related projects include DeepDive from Stanford, YAGO from Max Planck Institute, NELL from CMU as well as WikiData/Freebase and Google Vault.
|Dec 12||Dr. Kevin Dong||Design for Emotion|
introduce current research activities in Interaction Design Lab of Shanghai Jiao Tong University, the portfolios under the principle of “Form Follows Emotion”.
Kevin Dong is assistant professor of interaction design at Shanghai Jiao Tong University, China. He received his doctor degree from college of computer science in Zhejiang University. He is the principle investigator of several government-funded projects, including: Universal Interaction Design of Digital-TV under Aging Society; Relationship between Customers’ Participation and User Experience of Customized Products. Aside of government-funded projects, Kevin is also leading enterprise projects, including: A New Automobile Navigation Interface Design Based on Touch Panel & Knob. Currently, Kevin is a visiting scholar of HCI group in University of Florida and his research focuses on user-centered design and emotional design which studies users’ perception, response and feeling from products and interfaces.
|Dec 5||Christan Grant||Query-Driven Text Analytics|
With the precipitous increase in data, performing text analytics using traditional methods has become increasingly difficult. From now until 2020 the worlds data is predicted to double every year. Techniques to store and process these large
data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing. No longer can one assume data will be structured numbers and names. Databases are now storing more a mix of structured and unstructured data. To support text analytics, queries over disparate data types cannot be an over sight.
In this proposal I introduce query-driven text analytics. Query-Driven text analytics is the use of declarative semantics (a query) to decrease the amount of processing in analytic systems without a major sacrifice in accuracy. I demonstrate this in three ways. First, I add text analytics inside of a parallel relational DBMS where the user can use SQL and UDFs to choose the scope of their algorithm. Second, I alter a data mining algorithm so it uses an example query to drive computation. Finally, I propose an integrated question answering system over the different parts of the web.
|Nov 14||Dihong Gong||A Text-Image Search Engine for Online Shopping|
Text-based document classification and image retrieval are two most fundamental problems in data science. In the era of big data, how to efficiently search text documents and images while at the same time guarantee a good accuracy has been one of the most interesting topics. In our project, we propose to combine these two topics for enhanced performance and possibly new applications. Currently, we have built an online search engine, which has demonstrated state-of-the-art accuracy on Oxford Building Dataset. In our presentation, we will focus on technical details from several aspects including data collection, system implementation and search algorithms. In the meanwhile, we will also introduce our software packet for a highly scalable approximate K-means clustering with OpenMPI support.
|Nov 7||Ian Perera||Grounding Symbols as Children Do|
While we train our computer vision systems with a series of images and labels, it is clear that children do not learn language this way. They are faced with a large variety of objects and behaviors visible at once, and must pull references from a jumble of words as they are still learning grammar. And yet, with a number of sometimes unintuitive learning strategies, they seem to be able to learn language grounded in their experiences faster than our top-of-the-line object recognition systems.
With the field of symbol grounding becoming more popular, work at the intersection of computer vision, natural language understanding, and cognitive science is poised to discover more complete and efficient ways of learning grounded language in AI systems. Advances in grounded language learning can be applied to scene description, dialogue systems, knowledge representation, and other fields. In this talk, I will cover our work so far on SALL-E, a system that uses child language learning strategies and pragmatic inference to perceptually ground language from video demonstrations. I will also cover the challenges we faced along the way and the precautions one must take to truly create a grounded language system.
|Oct 24||Morteza Shahriari Nia||Hyperspectral Classification of Savanah Tree Species Using k-fold Cross-Validated Non-linear SVM and MESMA|
Identifying savannah species at ecological scale is a major milestone in measuring biomass, carbon reserves, drought and invasive species spread predictions. In this talk we perform classification and geo-mapping of tree species from hyperspectral imagery collected using AVIRIS airborne sensors. We provide a thorough comparison of the effects of ATCOR and FLAASH atmospheric corrections in prediction accuracy. This study classifies common savannah tree species in Ordway-Swisher Biological Station in north-central Florida, USA. Specie classification was performed using variety of Support Vector Machine kernels both on pixel level and canopy level where Polynomial Kernel outperformed others. We also verify MESMA (Muliple Endmember Spectral Mixture Analysis) and try to build an spectral library and observe the results. Also we look into LiDAR (Light Detection and Ranging) airborne data and find interesting patterns in species heights. All this information along with added expert knowledge available online such as USDA Plants database or lots of other resources can lead to a much more informed classification of species.
|Oct 10||Ishan Patwa||Word Sense Disambiguation through Images|
The automatic disambiguation of word senses is of growing interest in natural Language processing community. Use of Images to disambiguate short text with limited context is an important intermediary step in many Natural Language processing task. We are going to review our proposed method of solving the WSD problem and possible improvements on our preliminary results.
|Oct 3||Yang Peng||Large Scale Image Retrieval System|
Large scale image retrieval system is a big challenge because of the rapid increase of images on web today. In this presentation, we will first talk about a brief introduction to image retrieval systems. Then we will show our own pipeline design to handle large scale image retrieval by using advanced parallel data processing systems, including Hadoop and Mahout. We will also talk about the severe challenges for the system scale-up and how to solve them. At last, we will discuss our results and next steps.
|Oct 2||Pawel Terlecki (Tableau Software)||An analytic data engine for visualization in Tableau|
The talk covers the history, architecture and capabilities of Tableau Data Engine. It is an in-house columnar database based on the MonetDB design and developed specifically to support users with mid-size data sets and no efficient analytic back-ends. We cover important components and design decision, as well as give an overview how industrial projects of this size start and evolve.
Pawel leads the query team at Tableau. His responsibilities include vision, design and implementation of various query processing elements of the Tableau visualization platform. One can find his contributions in the Tableau Data Engine, caching infrastructure or data extraction. Prior to Tableau he worked on business application, web frameworks, database servers, in particular MS SQL Server, and data mining projects. He holds a PhD in Computer Science from Warsaw University of Technology, with specialization in information systems and knowledge discovery, and BS in Economics from Warsaw University. He published several works on databases and data mining and is a frequent attendant of major conferences in these fields. Performance and building reliable solutions are his passion.
|Sep 26||Yang Peng||Word Sense Disambiguation using Images in Social Networks|
In social networks, there are several challenges for word sense disambiguation, including short context and little annotation/knowledge. While there is only limited textual information, we could use multi-modal data including images to help disambiguate word senses. We are going to review related work and propose new methods using multi-modal data to solve WSD problems.
|Sep 19||Sean Goldberg||Comparing Markov Logic to other Rule Learning Approaches|
Markov Logic Networks (MLNs) combine the domains of first order logic and statistical probability by attaching weights to first order formulas or rules. This talk will serve as an introduction to intuitively understanding MLNs, particularly how they perform inference and learn weights and structure. MLN structure learning is equivalent to weighted inference rule learning and comparisons will be drawn with association rule mining metrics.
|Sep 12||Yang Chen||Rule Mining in Large Knowledge Bases|
Recent years have seen a tremendous research interest in knowledge base construction. These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, all existing knowledge bases are incomplete. As one potential solution to knowledge expansion, we study the problem of rule mining in such knowledge bases. In this talk, I will survey the state-of-the-art rule mining algorithms and report potential research directions, our progress, and our contributions toward the rule-based solution of the knowledge expansion problem.
|Sep 5||Sean Goldberg||Fact Inference through Rule Learning in Knowledge Bases: A Review|
Many current large-scale knowledge bases (KBs) are highly incomplete either due to errors in the construction process or because the knowledge is implicit as opposed to explicit. For example, 93.8% of people in Freebase have no birthplace and 78.5% have no nationality. The construction of inference rules from mining repeatable patterns in the KB has the potential to contribute additional knowledge to the KB. In this talk I will outline the most recent attempts at mining both structured and unstructured data for inference rules and elucidate similarities in methodologies and algorithms. Finally, I will present some ideas for future contributions to this nascent field.
|Aug 29||Dr. Juan Gilbert||Applications Quest: A Nominal Population Metric Approach to Diversity in Admissions|
In 2003, two land mark cases challenged the University of Michigan admissions policies, one focused on Law School admission and the other on undergraduate admissions. In Grutter v. Bollinger, the case focused on the Law School, the U. S. Supreme Court ruled 5-4 in favor of the Law School. However, in the Gratz v. Bollinger, by a vote of 6-3, the Court reversed, in part, the University's undergraduate admission's policy to provide points for race/ethnicity. Therefore, the Court decided that race could be considered in admission's decision, but could not be the deciding factor. Later, Michigan residents voted to adopt a ban on racial and gender preferences through Proposal 2. In 2007, the U.S. Supreme Court heard two cases on race-conscious school placement policies in Louisville and Seattle. The court struck down the programs in Louisville and Seattle. In 2013, the U.S. Supreme Court heard another case on this very topic in Fisher v. Texas. In the Fischer case, the U.S. Supreme Court sent the case back to the 5th District Court citing that the case had not passed strict scrutiny. Applications Quest is a data mining tool that provides preference free, holistic diversity using a patented nominal population metric. In this talk, Dr. Gilbert will discuss the legal implications of Applications Quest and the nominal population metric.
Dr. Juan E. Gilbert is the Andrew Banks Family Preeminence Endowed Chair and the Associate Chair of Research in the Computer & Information Science & Engineering Department at the University of Florida where he leads the Human Experience Research Lab. He is also a Fellow of the American Association of the Advancement of Science, National Associate of the National Research Council of the National Academies, an ACM Distinguished Scientist and a Senior Member of the IEEE. Dr. Gilbert was recently named one of the 50 most important African-Americans in Technology.
Dr. Juan E. Gilbert is the Andrew Banks Family Preeminence Endowed Chair and the Associate Chair of Research in the Computer & Information Science & Engineering Department at the University of Florida where he leads the Human Experience Research Lab. Dr. Gilbert has research projects in spoken language systems, advanced learning technologies, usability and accessibility, Ethnocomputing (Culturally Relevant Computing) and databases/data mining. He has published more than 140 articles, given more than 200 talks and obtained more than $24 million dollars in research funding. He is a Fellow of the American Association of the Advancement of Science. In 2012, Dr. Gilbert received the Presidential Award for Excellence in Science, Mathematics, and Engineering Mentoring from President Barack Obama. He was recently named one of the 50 most important African-Americans in Technology. He was also named a Speech Technology Luminary by Speech Technology Magazine and a national role model by Minority Access Inc. Dr. Gilbert is also a National Associate of the National Research Council of the National Academies, an ACM Distinguished Scientist and a Senior Member of the IEEE. Recently, Dr. Gilbert was named a Master of Innovation by Black Enterprise Magazine, a Modern-Day Technology Leader by the Black Engineer of the Year Award Conference, the Pioneer of the Year by the National Society of Black Engineers and he received the Black Data Processing Association (BDPA) Epsilon Award for Outstanding Technical Contribution. In 2002, Dr. Gilbert was named one of the nation's top African-American Scholars by Diverse Issues in Higher Education. In 2013, the Black Graduate and Professional Student Association at Auburn University named their Distinguished Lecture Series in honor of Dr. Gilbert. Dr. Gilbert testified before the Congress on the Bipartisan Electronic Voting Reform Act of 2008 for his innovative work in electronic voting. In 2006, Dr. Gilbert was honored with a mural painting in New York City by City Year New York, a non-profit organization that unites a diverse group of 17 to 24 year-old young people for a year of full-time, rigorous community service, leadership development, and civic engagement.
|Jun 20||Kun Li||Large-Scale Graph Processing Systems Cont.|
Topics include GraphX and Xstream.
|Jun 13||Kun Li||Large-Scale Graph Processing Systems|
We had a discussion on different systems for large-scale graph processing and the pros and cos of each one. The systems discussed include GraphLab, distributed GraphLab, GraphChi, PowerGraph, GraphX, and GIST.
|May 30||Yang Chen||Knowledge Expansion over Probabilistic Knowledge Bases|
Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer missing facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.
|May 23||Xiaofeng Zhou, Morteza Shahriari Nia||Exploring Netflow Data using Hadoop|
Exploring Netflow Data using Hadoop
Explore Netflow dataset analysis in Hadoop and characterize the performance.
Hyper-spectral Classification of Savannah Tree Species Using k-fold Cross-Validated Non-linear Support Vector Machines
In this paper we classify Savannah tree species using AVRIS hyper-spectral images, the pre-processing performed dramatically increased classification accuracy.
|May 16||Kun Li||In-RDBMS Large-scale Statistical Analysis|
Organizations such as companies, government and hospitals heavily rely on relational database management system(RDBMS) to store large amounts of data in the formats of structured data and unstructured data. A deep analysis to the data stored in database would help to discover useful information, suggesting conclusions, and supporting decision making. It helps companies to make the next best decision, enable doctors to have better assessment of their patients, and alleviate lawyers of document review processes. However a deep and comprehensive understanding of data requires various machine learning algorithms and statistical methods. Several challenges exist in using state-of-the-art systems to perform analysis on data resides in RDBMS. First, expensive big data transfer cost must be paid up front to move data between databases and external analytics systems. Second, many popular statistical packages do not scale up to production sized datasets. Thus enterprise applications need sophisticated in-database analytics in addition to traditional online analytical processing(OLAP) from a database. To meet customers' pressing demands, researchers, database vendors have been pushing advanced analytics techniques into databases. This thesis has two major contribution to the in-database analytic community. Firstly, it contribute a in-RDBMS statistical text analysis package to the community and introduce GPText, Greenplum parallel statistical text analysis framework which seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib. Secondly, it present a GIST operator for large scale statistical inference to address the limitation of current RDMBS. The two work are summarized in the following two paragraphs.
MADlib Text Analytics and GPText
Text analytics has gained much attention in the big data research community due to the large amounts of text data generated in organizations such as companies, government and hospitals everyday in the form of emails, electronic notes and internal documents. Many companies store this text data in relational databases because they relay on databases for their daily business needs. We bring statistical text analysis power into MADlib, a state-of-art in-database analytic package which can be installed in postgres and Greenplum. We developed and contributed a linear-chain conditional random field(CRF) module to MADLib to enable information extraction tasks such as part-of-speech tagging and named entity recognition. We show the elegant in-RDBMS parallel implementation of CRF which achives sub-linear scalability. We introduce GPText, Greenplum parallel statistical text analysis framework which seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib. We describe an eDiscovery application built on the GPText framework.
GIST: An Operator for Large Scale Statistical Inference
Every major RDBMS offers a User-Defined Aggregate (UDA) facility to implement many of the analytical techniques in parallel. However, inference algorithms, like Markov chain Monte Carlo, where there is some amount of setup done for the problem and then most of the work is performed by iterating over a large state, the UDA model is not a natural fit. This paper presents the General Iterative State Transition (GIST), a RDBMS operator for large scale inference. GIST is an operator which receives a state, which is generated by a UDA, and then performs rounds of transitions on the state until the state has converged to the desired result. We argue that the combination of UDA and GIST can express the majority of learning algorithms, thus significantly extends the analytical capabilities of RDBMSs. We exemplify the use GIST through two high-profile applications: cross-document coreference and loopy belief propagation. We show that the database-GIST combination allows us to tackle a task 27 times larger than state-of-the-art for the first problem and produces a solution that is an order of magnitude faster than the state of the art for the last problem.
|Apr 18||Xiaofeng Zhou, Sahil Puri||A Short Introduction to SciDB|
A Short Introduction to SciDB
The presentation first briefly introduces SciDB with its architecture and array processing, then focuses on the work with Neon image import/export in SciDB.
Knowledge Feedback on Prediction of Post-operative Outcomes
A study was conducted to establish the the requirement of an algorithm in predicting post-operative outcomes in collaboration with UF Health. We will discuss the methodology used in this study along with a demo of the software used. The presentation will focus on the experimental data collected in the most recent version of this study and the analysis/results derived from them.
This presentation will provide a brief description of the project "SMARTeR" being developed for document retrieval in collaboration with UF law. We will focus on an overview of the algorithm developed and a demo of the software which will be provided to the law school. A comparison will be presented detailing the advantages of the algorithm against present techniques of Document Retrieval.
|Apr 11||Sethuraman Sundararaman, Parthasarathy Srinivasan, Kushal Arora||Masters Projects Showcase|
Sethuraman Sundararaman: NLP on mobile phones.
Parthasarathy Srinivasan: Efficient Representation of Large KBA Text Corpus.
Kushal Arora: KB Integration.
|Apr 4||Kun Li||GIST: An Operator for Large Scale Statistical Inference|
Enterprise applications need sophisticated in-database analytics in addition to traditional online analytical processing (OLAP) from a database. To meet customers’ pressing demands, database vendors have been pushing advanced analytics techniques into databases. Every major RDBMS offers a User-Defined Aggregate (UDA) facility to implement many of the analytical techniques in parallel. However, inference algorithms, like Markov chain Monte Carlo, where there is some amount of setup done for the problem and then most of the work is performed by iterating over a large state, the UDA model is not a natural fit. This talk presents the General Iterative State Transition (GIST), an RDBMS operator for large scale inference. GIST is an operator which receives a state, which is generated by a UDA, and then perform rounds of transitions on the state until the state has converged to the desired result. We argue that the combination of UDA and GIST can express the majority of learning algorithms, thus significantly extending the analytical capabilities of RDBMSs. We exemplify the use GIST through two high-profile applications: cross-document coreference and loopy belief propagation. We show that the database-GIST combination allows us to tackle a task 27 times larger than state-of-the-art for the first problem and produces a solution that is an order of magnitude faster than the state-of-the-art for the last problem.
|Mar 28||Yang Chen||VisKB: Interactive Visualization of Web-Scale Knowledge Graphs|
Knowledge graphs are becoming the next big goal for the web and researchers have realized various ways to construct knowledge graphs. However, the user interfaces for knowledge bases are limited. In this talk, we present VisKB, a visual search engine that allows users to interactively query and explore web-scale knowledge graphs. VisKB visualizes only part of the knowledge graph relevant to user queries, and allows users to interact with the visualization to express more queries to expand the visualization. In this way, VisKB avoids visualizing the entire graph without losing information. Using DBPedia as the data source, we show it helps users discover interesting properties and relationships of the entities they are interested in.
|Mar 21||Vipin Kumar (University of Minnesota)||Understanding Climate Change: Opportunities and Challenges for Data Driven Research|
Climate change is the defining environmental challenge facing our planet, yet there is considerable uncertainty regarding the social and environmental impact due to the limited capabilities of existing physics-based models of the Earth system. This talk will present an overview of research being done in a large interdisciplinary project on the development of novel data driven approaches that take advantage of the wealth of climate and ecosystem data now available from satellite and ground-based sensors, the observational record for atmospheric, oceanic, and terrestrial processes, and physics-based climate model simulations. These information-rich datasets offer huge potential for monitoring, understanding, and predicting the behavior of the Earth's ecosystem and for advancing the science of climate change. This talk will discuss some of the challenges in analyzing such data sets and our early research results.
Vipin Kumar is currently William Norris Professor and Head of Computer Science and Engineering at the University of Minnesota. His research interests include High Performance computing and data mining, and he is currently leading an NSF Expedition project on understanding climate change using data driven approaches. He has authored over 250 research articles, and co-edited or coauthored 10 books including the widely used text book ``Introduction to Parallel Computing", and "Introduction to Data Mining" both published by Addison-Wesley. Kumar co-founded SIAM International Conference on Data Mining and served as a founding co-editor-in-chief of Journal of Statistical Analysis and Data Mining (an official journal of the American Statistical Association). Kumar is a Fellow of the ACM, IEEE and AAAS. He received the Distinguished Alumnus Award from the Indian Institute of Technology (IIT) Roorkee (2013), the Distinguished Alumnus Award from the Computer Science Department, University of Maryland College Park (2009), and IEEE Computer Society's Technical Achievement Award (2005). Kumar's foundational research in data mining and its applications to scientific data was honored by the ACM SIGKDD 2012 Innovation Award, which is the highest award for technical excellence in the field of Knowledge Discovery and Data Mining (KDD).
|Mar 14||Morteza Shahriari Nia||Building Data Storage, Retrieval and Analysis Platform for Ecological Research at Continental Scale|
We will specifically talk about applying state-of-the-art machine learning techniques over remote sensing data, where the goal is species classification of plants. Also, we discuss existing platforms that are available to scientists to easily share and query data. Our goal is to build a platform to perform data analysis over massive amounts of ecological data centered around remote sensing data such as hyperspectual and lidar data for ecological research and applications, such as climate change, invasive species identification, at continental scale.
|Feb 28||Kushal Arora||Universal Knowledge Base|
Ontology alignment of multiple knowledge bases to create a universal knowledge base with integrated schema and entities. This work is based on PIDGIN paper from CMU -- Ontology alignment using web text as interlingua.
|Feb 21||Jingtao Wang (U. Pittsburgh)||MindMiner: A Mixed-Initiative Interface for Interactive Distance Metric Learning|
Cluster analysis is a common task in exploratory data mining, and involves combining entities with similar properties into groups. However, most clustering techniques face one key challenge when used in real world applications: the algorithms expect a quantitative, deterministic distance function to quantify the similarity between two entities. Whereas in most real world problems, such similarity measurements usually require subjective domain knowledge that can be hard for users to explain.
In this talk, we present MindMiner, a mixed-initiative interface and visualization system for capturing subjective similarity measurements via a combination of new interaction techniques and machine learning algorithms. MindMiner collects qualitative, hard to express similarity measurements from users via active polling with uncertainty and example based visual constraint creation. MindMiner also formulates human prior knowledge into a set of inequalities and learns a quantitative similarity distance metric via convex optimization. In a 12-subject peer-review understanding task, we found MindMiner was easy to learn and use, and could capture users' implicit knowledge about writing performance and cluster target entities into groups that match subjects' mental models. We also found that MindMiner's constraint suggestions and uncertainty polling functions could improve both efficiency and the quality of clustering.
Dr. Jingtao Wang is an Assistant Professor in Computer Science and Learning Research and Development Center (LRDC) at the University of Pittsburgh. His primary research direction is Human-Computer Interaction (HCI). Jingtao's current research interests include - mobile interfaces, education/learning technology, end-user programming, machine learning and its applications in HCI. He received his Ph.D. degree in computer science from the University of California, Berkeley. Before that, Jingtao was a researcher and team lead at the IBM China Research Lab, working on large-vocabulary, online handwriting recognition technologies for Asian languages. He received his master degree and bachelor degree both from Xi'an Jiaotong University, China.
|Feb 14||Christan Grant||Query-Driven Statistical Text Analysis|
|Feb 7||Yang Chen||Knowledge Expansion over Probabilistic Knowledge Bases|
Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer hidden facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high-quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows efficient SQL-based inference algorithms for knowledge expansion that apply inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.
|Jan 24||Yang Peng, Morteza Shahriari Nia||Image Analysis and Knowledge Base Construction|
Yang Peng and Morteza Nia will give short talks on their image analysis projects. We also watch a Ted talk by Greg Asner on remote sensing for Ecological Research. http://hyspeedblog.wordpress.com/2013/12/02/conservation-technology-mapping-our-environment-using-the-carnegie-airborne-observatory
|Jan 17||Armando Fox (UC Berkeley)||Using MOOCs to Reinvigorate Software Engineering Education|
The spectacular failure of the Affordable Care Act website ("Obamacare") has focused public attention on software engineering. Yet experienced practitioners mostly sighed and shrugged, because the historical record shows that only 10% of large (>$10M) software projects using conventional methodologies such as Waterfall are successful. In contrast, Amazon and others successfully build comparably large and complex sites with hundreds of integrated subsystems by using modern agile methods and service-oriented architecture.
This contrast is one reason Industry has complained that academia ignores vital software topics, leaving students unprepared upon graduation. In too many courses, well-meaning instructors teach traditional approaches to software development that are neither supported by tools that students can readily use, nor appropriate for projects whose scope matches a college course. Students respond by continuing to build software more or less the way they always have, which is boring for students, frustrating for instructors, and disappointing for industry.
This talk explains how the confluence of cloud computing and Massive Open Online Courses (MOOCs) have allowed us to greatly improve both the effectiveness and the reach of UC Berkeley's undergraduate software engineering course. The shift toward Software as a Service has not only revolutionized the future of software, but changed it in a way that makes it easier and more rewarding to teach. UC Berkeley’s revised Software Engineering course leverages this productivity to allow students to both enhance a legacy application and to develop a new app that matches requirements of non-technical customers. By experiencing the whole software life cycle repeatedly within a single college course, and by using the same tools and techniques that professionals use, students actually use and learn to appreciate the skills that industry has long encouraged. The course is now popular with students, rewarding for faculty, and praised by industry.
The technology developed for the course has also been used to offer a subset of the material as a MOOC to hundreds of thousands of students, and through an arrangement with edX, is available to classroom instructors interested in trying this approach as a SPOC (Small Private Online Course) offering instructor support far beyond what is usually available for traditional textbooks. Indeed, our experience has been that despite recent hand-wringing about MOOCs destroying higher education, appropriate use of MOOC technology can improve on-campus pedagogy, increase student throughput while raising course quality, and even reinvigorate faculty teaching.
Armando Fox (firstname.lastname@example.org) is a Professor in Berkeley's Electrical Engineering & Computer Science Department as well as the Faculty Advisor to the UC Berkeley MOOCLab. He co-designed and co-taught Berkeley's first Massive Open Online Course on Engineering Software as a Service, currently offered through edX, through which over 10,000 students worldwide have earned certificates of mastery. He also serves on edX's Technical Advisory Committee, helping to set the technical direction of their open MOOC platform. With colleagues in Computer Science and in the School of Information, he is doing research in online education including automatic grading of students' computer programs and improving student engagement and learning outcomes in MOOCs. His other computer science research in the Berkeley ASPIRE project focuses on highly productive parallel programming.
While at Stanford he received teaching and mentoring awards from the Associated Students of Stanford University, the Society of Women Engineers, and Tau Beta Pi Engineering Honor Society. He has been a "Scientific American Top 50" researcher, an NSF CAREER award recipient, a Gilbreth Lecturer at the National Academy of Engineering, a keynote speaker at the Richard Tapia Celebration of Diversity in Computing, and an ACM Distinguished Scientist. In previous lives he helped design the Intel Pentium Pro microprocessor and founded a successful startup to commercialize his UC Berkeley Ph.D. research on mobile computing. He received his other degrees in electrical engineering and computer science from MIT and the University of Illinois. He is also a classically-trained musician and performer, an avid musical theater fan and freelance Music Director, and bilingual/bicultural (Cuban-American) New Yorker living in San Francisco.
|Jan 10||Christan Grant||Universal Schema Discussion|
The discussion is based on the following papers:
Relation Extraction with Matrix Factorization and Universal Schemas
Universal Schema for Entity Type Prediction
[This is a small paper good summary]
Latent Relation Representations for Universal Schemas
|Jan 3||Anastasia Ailamaki (EPFL)||Efficient Exploration of Big Brain Data|
Today's scientific processes heavily depend on fast and accurate analysis of experimental data. Scientists are routinely overwhelmed by the effort needed to manage the volumes of data produced either by observing phenomena or by sophisticated simulations. As data management software proves inefficient, inadequate, or insufficient to meet the needs of scientific applications, the scientific community typically uses special-purpose legacy software. With the exponential growth of dataset size and complexity, however, application-specific systems no longer scale to efficiently analyse the relevant parts of their data, thereby slowing down the cycle of analysing, understanding, and preparing new experiments. I will illustrate the problem with a challenging application on brain simulation data and will show how the problems from neuroscience translate into challenges for the data management community. I will show how novel data management technology can enable today's neuroscientists to simulate and discover a meaningful percentage of the human brain at unprecedented levels of detail. Finally I will describe the challenges of integrating simulation and medical neuroscience data to advance our understanding of the functionality of the brain.
Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique Federale de Lausanne (EPFL) in Switzerland. Her research interests are in database systems and applications, and in particular (a) in strengthening the interaction between the database software and emerging hardware and I/O devices, and (b) in automating database management to support computationally-demanding and demanding data-intensive scientific applications. She has received an ERC Consolidator Award (2013), a Finmeccanica endowed chair from the Computer Science Department at Carnegie Mellon (2007), a European Young Investigator Award from the European Science Foundation (2007), an Alfred P. Sloan Research Fellowship (2005), eight best-paper awards at top conferences (2001-2011), and an NSF CAREER award (2002). She earned her Ph.D. in Computer Science from the University of Wisconsin-Madison in 2000. She is a senior member of the IEEE and a member of the ACM, serves as the ACM SIGMOD vice chair, and has also been a CRA-W mentor.
|Dec 13||Ryan Cobb, Shuang Lin||Medical NLP and VizSearch|
|Dec 6||Sean Goldberg||Using People and Machines to Learn and Evaluate Inference Rules|
|Nov 22||DSR Group||PIDGIN: Ontology Alignment using Web Text as Interlingua|
|Nov 22||DSR Group||Semantic Parsing on Freebase from Question-Answer Pairs|
|Nov 1||Christan Grant||FDB: A Query Engine for Factorised Relational Databases||slides|
|Oct 25||Ryan Cobb, Sahil Puri||Medical NLP & Knowledge Exchange|
|Oct 18||Clint George||Model selection in Bayesian Topic Models and their Applications in Electronic Discovery|
|Oct 4||Christan Grant,|
Morteza Shahriari Nia,
|KBA 2013 TREC competition|
|Sep 27||Christan Grant||Large-scale Entity resolution on text streams|
|Sep 20||Donghui Wu||Predictive Modeling in Healthcare (MEDai/lexisNexis)|
|Sep 13||Michael Borish||Crowdsourcing for Virtual Humans (UF VERG)|
|Sep 13||Sean Goldberg||CASTLE: Crowd-Assisted System for Text Labeling and Extraction|
|Aug 30||Yang Chen||Database Backend for Description Logic (IHMC)||slides|
|Aug 30||Kun Li||Task Migration Feasibility Analysis in Distributed Systems (Google)|
|Jul 12||Morteza Shahriari Nia||Inter-Media Hashing for Large-Scale Retrieval from Heterogeneous Data Sources|
|Jun 28||Morteza Shahriari Nia||AMPLab Big Data Benchmark and Million Query Track (TREC)|
|Jun 21||Christan Grant||GPText: Greenplum Parallel Statistical Text Analysis Framework|
|Jun 14||Yang Chen & Ryan Cobb||ICML Dry Runs|
|Jun 14||Shuang Lin||Vispedia: Interactive Visual Exploration of Wikipedia Data via Search-Based Integration|
|May 31||Daisy Zhe Wang||Probabilistic Programming Language for Advanced Machine Learning: MLN, BLOG/Figaro & Church|
|May 24||Ryan Cobb||Use Big Data and Machine Learning in Mahjong|
|May 17||Dr. Wind Cowles||Coreference and Focus in Human Sentence Processing||slides|