Electronic discovery is an interesting sub problem of information retrieval in which one identifies documents that are potentially relevant to issues and facts of a legal case from an electronically stored document collection (a corpus). In this paper, we consider representing documents in a topic space using the well-known topic models such as latent Dirichlet allocation and latent semantic indexing, and solving the information retrieval problem via finding document similarities in the topic space rather doing it in the corpus vocabulary space. We also develop an iterative SMART ranking and categorization framework including human-in-the-loop to label a set of seed (training) documents and using them to build a semi-supervised binary document classification model based on Support Vector Machines. To improve this model, we propose a method for choosing seed documents from the whole population via an active learning strategy. We report the results of our experiments on a real dataset in the electronic discovery domain.
Authors:
Clint P. George, Sahil Puri, Daisy Zhe Wang, Joseph Wilson, William Hamilton
Bibtex:
@article{, author = "Clint P. George, Sahil Puri, Daisy Zhe Wang, Joseph Wilson, William Hamilton", title = "SMART Electronic Legal Discovery via Topic Modeling", journal = "Proceedings of the 27th International FLAIRS Conference", year = "2014" }
Download:
[pdf]