Daisy Zhe Wang
In order to address the growing need from both industry and academia (e.g., medical and bio informatics, financial, law enforcement, economics, decision support, social networks) for big data analytic skills including, data management, data mining, natural language processing, machine learning and data visualization, we introduce a three-course series in the Data Science Curriculum:
- Introduction to Data Science (both undergrad and graduate level)
- Advanced Topics in Data Science (graduate level only)
- Projects in Data Science (both undergrad and graduate level)
Due to the inter-disciplinary nature of Data Science applications, we encourage students from CS as well as other majors with CS minor to take the first and third course in the curriculum. We encourage CS graduates to take the second course to explore and push forward the frontier of Data Science technology. We will start to offer the first course in the series “Introduction to Data Science” in Spring 2014.
The aim of the first course “Introduction to Data Science” is to bring student with basic programming and data structure background to be abreast with common tools used for Data Science application development. This course will give an introduction to the basic data science techniques including programming in SQL, Map-Reduce, R, and Python. We also cover topics including relational databases, data visualization, classification, clustering, regression and parallel computing platforms. Some of tentative topics to be covered are:
Part 0: Introduction
Part 1: Data Manipulation, at Scale
- MapReduce, Hadoop, relationship to databases, algorithms, extensions, languages
- Databases, SQL and the relational algebra
- Parallel databases, parallel query processing, in-database analytics
- Key-value stores and NoSQL; tradeoffs of SQL and NoSQL
Part 2: Statistical Analytics
- Programming in Python and R
- Basic Data Mining
- Basic statistical modeling, introduction to machine learning, overfitting
- Supervised learning: Linear and Logistic Regression, Classification
- Unsupervised learning: Clustering, Association Rule mining
Part 3: Graph/Text Data Analysis & Communicating Results
- Graph Analytics: PageRank, community detection, recursive queries, iterative processing
- Text Analytics: TF/IDF, conditional random fields
- Visualization, data products, visual data analytics
Part 4: Parallel Computing
- Concurrency and Data Decomposition
- Message Based Parallelism – MPI
- Thread Based Parallelism – OpenMP
The course will be mainly project based. We encourage students to form groups to develop Data Science application to compete the the 2nd UF Data Science Exposition. In the 1st UF Data Science Exposition, we received generous sponsorship from Google And Amazon. Stay tuned!