CAP4773/CAP6779, Projects in Data Science, Spring 2016
Course Description (3 credit hours)
In order to address the growing need from both industry and academia (e.g., medical and bio informatics, financial, law enforcement, economics, decision support, social networks) for big data analytics skills including, data management, data mining, machine learning and data visualization, this course is part of the three-course series in the Data Science curriculum. The aim is to apply data science and big data analytic tools to develop domain-specific applications. Advanced topics in data science, individual projects in application areas such as vision, natural language processing, computational fluid dynamics, social networks, bioinformatics etc.
Introduction to Data Science (CAP4770/CAP5771) or equivalent.
Building on the foundations of databases and data mining, this course will prepare students for a variety of individualized projects in interest areas such as bioinformatics, vision and imaging, sensor and social networks, computational neuroscience, natural language processing, medical informatics and scientific data analysis.
Instructor: Prof. Daisy Z. Wang
- Office location: CSE E456
- Telephone: (352) 505-7626
- Email address: firstname.lastname@example.org
- Office hours: Fridays 3-4pm
|Project members||Project Summary||Posters||External Advisors|
|Babak Alipour, Aditya Nain, Giang Nguyen||MADlib Contributions and Applications|
Apache MADlib (incubating) is a framework for distributed/parallel in-database Machine Learning over data processing engines such as Greenplum and HAWQ. In this talk, we will present our individual experiences of module development for MADlib and also an application of MADlib to enhance user experience. We have developed two modules for integration into MADlib: k-nearest neighbors (k-NN) and Gaussian Mixture Model (GMM), developed by our team members Babak and Aditya respectively. We will discuss the challenges and issues faced, the solutions applied and future directions for improvement. Further, Giang will present an enhanced blogging platform that enjoys a MADlib-enabled database backend. Using NLP functionality baked into MADlib, he will showcase a blog that automatically links names in a blog post with their corresponding Wikipedia page.
|Poster||Dr. Milenko Petrovic
Apache MADlib community
|Ali Sadeghian, Benjamin Grider, Laksshman Sundaram||Semantic Edge Labeling over Legal Citation Graph and Case Prediction|
We first tackle the challenging task of predicting the outcome of a legal case. We designed an intelligent system that predicts the length of a case brought to the court. Our system is tested on the data collected from cases presented at United States International Trade Commission's Electronic Document Information (USITC EDIS) under section 337 (Unfair Import Investigations).
Then we focus on semantic edge labeling in legal citation graphs. Citations, when a certain statute is being cited in another statute, differ in meaning, and we aim to annotate each edge with a semantic label that expresses this meaning or purpose. Our efforts involve defining, annotating and automatically assigning each citation edge with a specific semantic label.
|Poster||William Hamilton at UF Law School and ICAIR|
|Jayson Salkey||A Case for Integrated Probabilistic Biomedical Knowledge Bases|
Biomedical Knowledge Bases have traditionally been manually curated and integrated by use of constraint-based merging methods. In this talk, I will present a case for applying probabilistic integration methods for combining and inferring new rules on these Knowledge bases.
|Poster||UF Shands and CTSI|
|Mebin Jacob||Exploring Graph Partitioning and distributed RDF store for probabilistic KB|
Earlier we saw the bench marking of Sparql queries over partitioned data using n-hop guarantee technique. In this talk, we explore further the same technique over factor graphs and suggest modification to our built architecture(on Archimedes X) for inference queries.
|Poster||UF Data Science Research Lab|