SMARTeR : Topic Modeling, Exploration, Entity Extraction, and Applications
Topic Models for E-Discovery
Topic models are algorithms that can discover the main themes or concepts in large unstructured collections of documents and organize the collections according to the discovered themes. They can be adapted to many kinds of data such as collections of text documents, images, and social networks. We apply topic models such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) to enhance document discovery, given a production request in e-discovery. We studied the performance of a hybrid ranking and classification model based on keyword-based indexing (via Lucene, Whoosh, etc.) and popular document modeling methods such TF-IDF, LSA, and LDA. The major tasks are the following.
- using topic modeling to provide greater power than commonly employed methods such as keyword search and LDA
- using identified topics for document categorization and ranking their relevance to a given query
- using topic modeling based algorithms to provide document summaries and improve the document review process.
Furthermore, to ensure the broad penetration of our effort, we are also building software tools that can serve as the basis for an open e-discovery framework. For example, see the open-source random sampler software developed by our team.
TopViz and TopEx Project
From literature surveys to legal document collections, people need to organize and explore large amounts of documents. During these tasks, students and researchers will search for documents based on particular themes. We use topic models such as LDA to derive topic distributions for articles and allow users to specify personal topic distribution to contextualize the exploration experience. We introduce three types of exploration: user model re-weighted keyword search, topic-based search, and topic-based exploration. We demonstrate these methods using a scientific citation data set and a Wikipedia article collection.
Topic Models for Survey Clustering
We explored the problem of automatic topic extraction, categorization, and relevance ranking model for surveys and their questions, from different languages such as English, Spanish, Portuguese, German, and French. Automatically generated question and survey categories are used to build question banks and category-specific survey templates. We used the machine learning algorithms such as latent semantic indexing, latent Dirichlet allocation, and fuzzy clustering in our experiments. This is a joint research project we did with SurveyMonkey, a large scale online survey management system.
- A Topic-Based Search, Visualization, and Exploration System. Christan Grant, Clint P. George, Virupaksha Kanjilal, Supriya Nirkhiwale, Daisy Zhe Wang, and Joseph N. Wilson, FLAIRS-28, Hollywood, Florida, USA. May 2015
- SMART Electronic Legal Discovery via Topic Modeling. Clint P. George, Sahil Puri, Daisy Zhe Wang, Joseph N. Wilson, and William Hamilton. FLAIRS-27, Pensacola, Florida, USA. May 2014.
- A Machine Learning Based Topic Exploration and Categorization on Surveys. Clint P. George, Daisy Zhe Wang, Joseph N. Wilson, Liana M. Epstein, Philip Garland, and Annabell Suh. ICMLA 2012