Clint P. George
This is an extension to our earlier post, SMART Electronic Discovery (SMARTeR), which describes a framework for electronic discovery (e-discovery). Discovery is a pre-trial procedure in a lawsuit or legal investigation in which each party can obtain evidence from other parties (typically via a request for production) according to the laws of civil procedure in the United States and other countries. A request for production is a short description of the discovery requirement for a case. For example, “All documents or communications that describe, discuss, refer to, report on, or relate to the Company’s engagement in structured commodity transactions known as prepay transactions.” is a sample request for production quoted from the TREC 2010 legal learning track. In electronic discovery, one collects, reviews, and produces electronically stored information (i.e. documents, emails, attachments, etc) to determine its relevance to a request for production. The primary goal of this project is document discovery—i.e. we study various methods to improve document discovery given a request for production.
In a typical keyword-based retrieval, one uses keywords (e.g. pre-pay and swap for the production request described above) to search for relevant documents. Even though, keyword-based search is the popular scheme in the e-discovery community, it has many shortcomings. For example, issues such as synonymy and polysemy of words exists in a corpus can affect the performance of keyword-based approaches (George 2015). We take the document discovery problem as a binary document classification problem: we build binary classifiers that can label each input document as relevant or irrelevant based on the document content. The classifier is trained using a set of expert (e.g. a contract attorney) labeled training documents given a request for production. These expert labeled training documents are referred to as seed documents in the e-discovery community. In our earlier post, we discussed a set of approaches for principled selection of seed documents for expert review. See George (2015)[Chapter 7], for more details about seed selection methods and and their performance analysis. Electronically stored information for a given case can be represented in any format such as PDF, plain-text, HTML, and emails. The next step after data collection in e-discovery is to pre-process them to plain text format. A challenging problem in document classification and ranking is the choice of features for documents. Considering relative frequencies of individual words in documents as features as in TF or TF-IDF models may yield a rich but very large feature space (Joachims, 1999). This may cause computational difficulties when you build classifiers. A more computationally effective approach would be to analyze documents represented in a reduced topic space extracted by topic models such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). For example, words such as football, quarterback, dead ball, free kick, NFL, touchdown, etc., are representative of the single topic football. Topic models such as LDA can identify these topic structures automatically from document collections.
In this project, we compare the performance of LDA to several other document modeling schemes such as TF-IDF and LSA that have been employed to model e-discovery documents. We used the Gensim package—a scalable implementation of LSA and LDA algorithms—in our experiments. The LSA implementation performs a singular value decomposition (SVD) of the TF-IDF matrix for a corpus, and then projects documents represented in the TF-IDF matrix into the LSA semantic space. The LDA implementation is based on the online variational Bayes algorithm (Hoffman et al., 2010) that reduces any document in the corpus to a fixed set of real valued features (i.e. a document in the learned topic space). For classification, we consider popular classification algorithms such as Logistic Regression (LR), SVM (RBF) (SVM-R), SVM (Linear) (SVM-L), and k-Nearest Neighbor (k-NN). We used the implementations of these classification algorithms (along with the default tuning parameters) provided in the scikit-learn package for our experiments.
For evaluation, we used the 20Newsgroups dataset, a popular dataset used in the machine learning literature for experiments in applications of text classification and clustering algorithms. This dataset contains approximately 20,000 articles that are partitioned relatively even by across 20 different newsgroups or categories. We created a set of corpora from this dataset as follows (See the below table). For each corpus, the relevant class included documents under a single relevant group and the irrelevant class included documents under a set of irrelevant groups from the 20Newsgroups dataset. The column Relevant Group gives the relevant newsgroup and the column Irrelevant Groups gives the set of irrelevant newsgroups used for each corpus. The column Rel./Irrel. gives the number of documents in the relevant class vs the number of documents in the irrelevant class, for each created corpora.
Below table gives AUC, Precision, and Recall scores of the various classification results for corpora C-Mideast, C-IBM-PC, C-Motorcycles, and C-Baseball-2. All these scores are computed using a stratified 5-fold cross-validation scheme on all four corpora.
All classification methods performed reasonably well for all feature representation approaches in terms of AUC, except for k-Nearest Neighbor classifiers, which performed poorly for all feature types. It is surprising to note that Logistic Regression and SVM (Linear) methods gave similar AUC scores for all feature types (and Precision and Recall scores are comparable). We believe this is due to the similarity of the algorithms used in the scikit-learn package to find optimal solutions, and the choice of penalties. Another interesting observation is that for classification, simpler document models such as LSA and TF-IDF outperforms LDA-based models for all the four corpora. Our guess is that selecting hyperparameters and the number of topics for the LDA model of a corpus may make a difference in the classification performance (this is part of our future work). The TF-IDF scheme may not be ideal for large datasets as it can encounter computational difficulties in training a classifier.
Furthermore, we noticed a similar pattern on an experiment in which, we evaluated various linear SVM (SVM-L) models with different number of seed (training) documents. The rest of the documents in the corpus are used for testing. The experiments are based on two e-discovery datasets employed in the TREC 2010 Legal Learning Track. AUC performance of the SVM-L modes are given below. We can see that the LSA and TF-IDF models outperforms LDA uniformly in these two datasets.
In our experience, classifiers such as SVM (RBF kernel), SVM (Linear kernel), and logistic regression show mixed classification performance for different datasets. This suggests that having identified the right feature set for documents in a corpus the choice of algorithms to build the optimal classifier is relatively insignificant. The complexity of understanding documents in a corpus also can play a role in the performance of these classifiers: we observed that (experiments not shown) for some datasets the the number support vectors used by SVM is relatively large compared to other datasets—that means SVM tries to memorize the training data, which can cause overfitting.
George (2015) provides more details about this project and further experimental analysis.