Data Science Course Projects in Previous Years: First Data Science Exposition, Spring 2013

1. Influence Extraction from Social Networks

Today’s social networks are widely used to share or exchange information. As a result, huge amounts of data arise, showing people’s interests, hot topics, newest trends, etc. It has been an interesting challenge to mine knowledge from such network data in a principled manner. This project aims at analyzing large-scale social network data and identifying influential users or contents using visualizations and graph analytics algorithms, including PageRank, community identification, etc.


  1. Develop efficient implementations of graph algorithms with the state-of-the-art graph analytics frameworks.
  2. Demo your results and insights with data visualization tools.

Dataset: Stanford Large Network Dataset Collection (SNAP) or of your choice.
Tools: MapReduceGraphLab or Spark/GraphXData Driven Documents (D3)

  1. The Anatomy of a Large-Scale Hypertextual Web Search Engine
  2. Finding and Evaluating Community Structure in Networks

2. Query processing over large Knowledge bases

The objective of this project is to efficiently answer queries over large knowledge bases such as freebase using RDF/SPARQL. Query workload and evaluation are the key. A SPARQL query workload needs to be designed, justified and constructed for benchmarking. The goal is benchmarking and performance evaluation of different types of SPARQL queries over large KBs such as Freebase over RDF/graph databases such as JENA, Neon4j and others.


  1. You can replace KB such as freebase with an RDF store with chemical compounds structures. Queries can be returning all chemical compounds that contains three-membered carbon ring (cyclopropane).

Dataset: Freebase
Tools: SPARQLJena

  1. Big Data Benchmark
  2. Benchmarking Graph Databases

3. Scalable Image Retrieval Systems and Applications

In this project, we look into the content-based image retrieval problem and aim to build a image retrieval system. Similar to search engines, the queries to image retrieval system is an image and the results should be a ranked list of similar images in the image database. The Oxford buildings dataset can used to assess the search quality.


  1. Use images in online shopping dataset.
  2. Use images in twitter dataset.
  3. Extend retrieval systems for video data.

Dataset: Twitter datasets from SNAP, Oxford buildings dataset
Tools: HadoopSolrMahout

  1. Video Google: A Text Retrieval Approach to Object Matching in Videos
  2. Bag-of-Words Models

4. Knowledge Base Construction from Text

Recent knowledge bases like Freebase, YAGO are being increasingly helpful to understand human information and queries. In this project, we build a knowledge base using natural language processing (NLP) and information extraction (IE) approaches. We aim at constructing and enhancing a structured knowledge base from natural text in a format that machines can process and use to answer human queries.


  1. Use one of the state-of-the-art tools to extract knowledge from large text corpus.
  2. Show how the knowledge base can be used to answer user queries. Example queries include key word search, natural language question, visualization, etc.
  3. Enhance an existing knowledge base by extracting new knowledge from text corpus. Try to use the input knowledge base as background evidence for the extraction algorithms.

Dataset: Wikipedia dumpsClueWeb12The New York Times Annotated CorpusDBPediaFreebase
Tools: Stanford NLP SoftwareOpen IE

  1. Open Information Extraction: the Second Generation
  2. Toward an Architecture for Never-Ending Language Learning
  3. Reading The Web with Learned Syntactic-Semantic Inference Rules
  4. Constructing an Interactive Natural Language Interface for Relational Database

5. Database Support for Large-scale Fast Visualizations

Most visualization software and services such as Tableau are supported by back-end scalable database systems to do the heavy-lifting of data processing and computation. The goal of this project is to look at current literature on database support for large-scale fast visualization and develop a base to implement and develop new query processing techniques and optimizations for visualization applications.

Ideas: TBA

Dataset: TBA
Tools: Tableau, D3, Prefuse
References: TBA