CAP4773/CAP6779, Projects in Data Science, Spring 2017
Course Description (3 credit hours)
In order to address the growing need from both industry and academia (e.g., medical and bio informatics, financial, law enforcement, economics, decision support, social networks) for big data analytics skills including, data management, data mining, machine learning and data visualization, this course is part of the three-course series in the Data Science curriculum. The aim is to apply data science and big data analytic tools to develop domain-specific applications. Advanced topics in data science, individual projects in application areas such as vision, natural language processing, computational fluid dynamics, social networks, bioinformatics etc.
Prerequisite
Introduction to Data Science (CAP4770/CAP5771) or equivalent.
Course Objectives
Building on the foundations of databases and data mining, this course will prepare students for a variety of individualized projects in interest areas such as bioinformatics, vision and imaging, sensor and social networks, computational neuroscience, natural language processing, medical informatics and scientific data analysis.
Instructor: Prof. Daisy Z. Wang
- Office location: CSE E456
- Telephone: (352) 505-7626
- Email address: daisyw@cise.ufl.edu
- Office hours: Fridays 3-4pm
Projects
Project members | Project Summary | Posters | External Advisors |
---|---|---|---|
Harish Balaji | ChronoSeek:Information Extraction from temporal knowledge bases Sequential pattern mining has been more focused on instantaneous events rather than time intervals. KBs such as YAGO and Wikidata have temporal annotations on their relations by way of reification and have proposed various data models like SPOT. This can be exploited to find patterns not only in a temporal arrangement but also in a combination of topological and temporal arrangement. This has not been explored and it leads to fifteen different arrangements that prove to be interesting. The results of 2-arrangement phase of the enumeration tree is generated from Wikidata. | Poster | Miguel E. Rodriguez@UF Data Science Research Lab |
Akash Agarwal, Roukna Sengupta | Anomaly Detection over Graphical KB using HPCC The goal of our project is anomaly detection over time evolving network graphs using HPCC systems. Our methods consider the network as it evolves and monitors properties of the network for changes. We use HPCC Systems ®, which is an open source, a massive parallel-processing computing platform for big data processing and analytics. In the presentation, we wish to discuss our learning from HPCC and evaluate its performance for querying and operating on a large dataset. We would also discuss Enterprise Control Language(ECL) which is designed specifically for big data processing with HPCC. Besides we would discuss our evaluations of anomaly detection algorithms over graphical KB - Wikipedia revision history, where we try to detect events using a distribution based methodology and structural changes in the graph over the time series. | Poster | HPCC LexisNexis |
Auon Haidar Kazmi, Karthik Maharajan Sankara Subramanian | MADLIB Analytics Library Contributions MADlib is a free, open source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. In this presentation we introduce the MADlib project, including the background that led to its beginnings, and the motivation for using Python and C++ along with Postgres. We provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context. We will explain the key contributions made by us to the MADlib project including the perceptron and the KNN algorithms. | Poster | MADlib Apache Project |
Samskruthi Padigepati, Abhinav Shankar | Link Prediction on EHR Data using Medical Knowledge Base Electronic health records store the medical and demographic information of patients in a digital format and can be used for advancement in clinical research. While the EHR data can be used for predicting patient-centered outcomes, challenges arise when there is missing information. In this project, we predict the missing links in the EHR data by integrating with a biomedical knowledge base. | Poster | CTSI and UF Data Science Research Lab |
Arvind Kumar Sugumar, Nishant Agarwal | NEON NIST DSE – Tree Crown Delineation Automatic tree crown delineation has a great impact on tracking and preserving bio diversity in our world. To serve as the pre-pilot for the full DSE track, which comprises of delineation, alignment and classification, we propose using the watershed class of algorithms to implement a baseline model for the delineation task. This talk will take over from where we left off earlier and we will be talking in particular about two different approaches to making the naive watershed segmentation better i.e. Laplacian of Gaussian (LoG) method followed by Morphological enhancement and the Region Growing algorithm. We will be going through the techniques which we utilize to get the crown delineation done and the current progress will be demoed. Also the participant evaluation system would be demoed and a sample report will be generated. | Poster | UF WeEcologyLab and UF Data Science Research Lab |
Caleb Bryant | The Rose Dialogue System Personal digital assistants, such as Siri and Alexa, are the most well-known examples of dialogue systems. In recent years high accuracy speech recognition and natural language processing tools have made building custom dialogue systems ever more feasible. In this talk, we will be taking an end-of-semester look at the dialogue system for Rose, a virtual health navigator whose goal is to help patients understand their medical situations. | CTSI and UF Data Science Research Lab |