• Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Data Science Research

Menu
  • Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

SMARTeR : Topic Modeling, Exploration, Entity Extraction, and Applications

SMARTeR : Topic Modeling, Exploration, Entity Extraction, and Applications

 

Electronic legal discovery (e-discovery) is the process of collecting, reviewing, and producing electronically stored information (ESI), i.e., documents either in native format (e.g., emails, attachments, social media messages, etc.) or after conversion into PDF or TIFF form, to determine its relevance to a request for production. We are interested in improving document discovery and review in the e-discovery process. Keyword-based search is a popular information retrieval scheme to discover relevant documents from a document collection, but it has many shortcomings. For example, some relevant documents may not contain the exact keywords specified by a user. Concept search is an alternative to keyword-based search that can address some of these deficiencies. One way to perform concept search is to employ topic models (that are often used to make inference regarding the underlying thematic or topic structure of a document collection or corpus) and categorize documents based on their underlying topics. Entity Extraction and Entity Linking are another efficient way to deal with the issues of keyword-based approaches. In this project, we study the effectiveness of both of these approaches using real-world datasets.

 

This is a joint project with the University of Florida Law School. See the d-discovery project page at the University of Florida Levin College of Law website.

Topic Models for E-Discovery

Topic models are algorithms that can discover the main themes or concepts in large unstructured collections of documents and  organize the collections according to the discovered themes. They can be adapted to many kinds of data such as collections of text documents, images, and social networks. We apply topic models such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) to enhance document discovery, given a production request in e-discovery. We studied the performance of a hybrid ranking and classification model based on keyword-based indexing (via Lucene, Whoosh, etc.) and popular document modeling methods such TF-IDF, LSA, and LDA. The major tasks are the following.

  • using topic modeling to provide greater power than commonly employed methods such as keyword search and LDA
  • using identified topics for document categorization and ranking their relevance to a given query
  • using topic modeling based algorithms to provide document summaries and improve the document review process.

Furthermore, to ensure the broad penetration of our effort, we are also building software tools that can serve as the basis for an open e-discovery framework. For example, see the open-source random sampler software developed by our team.

Faculty: Daisy Zhe Wang, Joseph N. Wilson
Students: Clint P. George, Sahil Puri, Srinivas Balaji
Collaborators : William (Bill) Hamilton UF Law

TopViz and TopEx Project

From literature surveys to legal document collections, people need to organize and explore large amounts of documents. During these tasks, students and researchers will search for documents based on particular themes. We use topic models such as LDA to derive topic distributions for articles and allow users to specify personal topic distribution to contextualize the exploration experience. We introduce three types of exploration: user model re-weighted keyword search, topic-based search, and topic-based exploration. We demonstrate these methods using a scientific citation data set and a Wikipedia article collection.

Faculty: Daisy Zhe Wang, Joseph N. Wilson
Students: Clint P. George, Christan Grant

Topic Models for Survey Clustering

We explored the problem of automatic topic extraction, categorization, and relevance ranking model for surveys and their questions, from different languages such as English, Spanish, Portuguese, German, and French. Automatically generated question and survey categories are used to build question banks and category-specific survey templates. We used the machine learning algorithms such as latent semantic indexing, latent Dirichlet allocation, and fuzzy clustering in our experiments. This is a joint research project we did with SurveyMonkey, a large scale online survey management system.

Faculty: Daisy Zhe Wang, Joseph N. Wilson
Students: Clint P. George
Collaborators: SurveyMonkey

Publications

  • A Topic-Based Search, Visualization, and Exploration System. Christan Grant, Clint P. George, Virupaksha Kanjilal, Supriya Nirkhiwale, Daisy Zhe Wang,  and Joseph N. Wilson, FLAIRS-28, Hollywood, Florida, USA. May 2015
  • SMART Electronic Legal Discovery via Topic Modeling. Clint P. George, Sahil Puri, Daisy Zhe Wang, Joseph N. Wilson, and William Hamilton. FLAIRS-27, Pensacola, Florida, USA. May 2014.
  • A Machine Learning Based Topic Exploration and Categorization on Surveys. Clint P. George, Daisy Zhe Wang, Joseph N. Wilson, Liana M. Epstein, Philip Garland, and Annabell Suh. ICMLA 2012

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017

Categories

  • courses
  • ecology
  • NIST and open eval
  • publications
  • research
  • research directions
  • survey
  • Uncategorized

Archives

  • February 2023
  • October 2020
  • December 2019
  • April 2019
  • December 2018
  • August 2018
  • February 2018
  • November 2017
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • October 2016
  • April 2016
  • March 2016
  • December 2015
  • November 2015
  • October 2015
  • May 2015
  • November 2014
  • October 2014
  • July 2014
  • May 2014
  • March 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013