Clint P. George
Our recent work on Electronic Discovery (E-Discovery) evolved a novel E-Discovery retrieval model – SMARTeR, which employ the state-of-the-art document modeling algorithms for relevance ranking, classification, and prioritizing the review process – based on the traditional Computer Assisted Review (CAR) process (See Background). A critical task associated with the categorization of documents is ranking their relevance to a given user query, which typically contains topic specific keywords or phrases. Our idea was to exploit the topics underneath the query keywords along with existing keyword-based search strategies such as Apache Lucene.
Figure 1 shows the work flow of our system. The major module of our CAR process is the indexing engine (Start 1 in the figure). We use a hybrid approach using topic modeling and traditional keyword-based approaches for indexing documents. We employed well-known probabilistic modeling algorithms such as latent Dirichlet allocation (LDA) to extract underlying topics in a document collection. They allow us to represent the properties of a corpus with a small collection of topics or concepts, far fewer than the vocabulary size of a corpus. Probabilistic topic models are superior to other document modeling methods such as term-frequency inverse-document-frequency (TF-IDF) and Latent Semantic Analysis (LSA). We use the keyword-based search engine Apache Lucene for indexing document metadata (e.g., email attributes such as to, from, subject, date, etc.) and textual content, after performing typical natural language processing techniques such as stemming, lemmatization, and stop-word removal.
Initial seed documents are generated (Star 2) based on methods such as stratified sampling – which divides members of the whole population into homogeneous subgroups before sampling – using the documents represented in the topic modeling and keyword-based index and clustering methods such as k-means clustering. The intuition behind this seed selection method is to find a subset that is the representative of the whole population (that would enhance the generalization capability of the document classifiers). The expert labeled seed documents along with the topic modeling and keyword-based document indices are used for building the SMARTeR ranking and classification model. Given a Boolean user query, SMARTeR generates relevancy scores and class labels for documents and displays the results to the user for verification (Stars 3-5).
The amount of data available for each class (relevant or non-relevant) can be enormous making manual verification of the results from the CAR process cumbersome. A typical quality control method used in the e-discovery community is random sampling, which is described as follows. We consider reviewing a random sample of documents from the whole document population, classifying them as valid or invalid, and projecting the percentage of relevant documents found in the sample onto the whole population. If the sample is completely random and the sample size is sufficiently large, we can determine a Confidence Level within a certain error margin (i.e., Confidence Interval). The Confidence Level is expressed as a percentage (e.g., CL means you can becertain). A commonly used CL is 95%, which is derived from the normality assumption of random samples from a population. The Confidence Interval is expressed in percentage points (e.g.,) of the confidence (i.e., CL), which indicates the reliability of an estimate. For more details about random sampling techniques used in the legal community, the readers are recommended to review the article by Ralph Losey.
If the evaluation based on random sampling is passed the user can go on to generate reports, otherwise, the user can go back and edit the current query keywords and continue the CAR process in an iterative fashion. One of the advantages of this review process is that it can incorporate the user feedback about the retrieval model by revising the seed documents and classifier.
We evaluated our algorithms based on a set of datasets employed in the Text REtrieval Conference (TREC) 2010 Legal Learning Track. These datasets were created from emails and their attachments from the well-known Enron dataset. We found that ranking models developed based on the documents which are represented in a topic space created via the topic modeling algorithms give better ranking scores than using the typical keyword-based ranking schemes alone. More details about our algorithms and evaluations are described in the paper “SMART Electronic Legal Discovery via Topic Modeling”, which is in the proceedings of the FLAIRS-27 conference, Pensacola, FL, May 2014.
We also developed a random sampler application which is intended to select files randomly from a given data population and the sample size is determined by given confidence intervals (CI) and confidence levels (CL). The projects — random sampler and SMARTeR — are supported by ICAIR, the International Center for Automated Research at the University of Florida Levin College of Law.
Electronic Discovery and Computer Assisted Review Process: Background
Discovery is a pre-trial procedure in a legal investigation in which each party can obtain evidence from other parties according to the laws of civil procedure in the United States and other countries. The first step in a typical discovery is to identify data by the parties on both sides of a legal case. Then, potentially relevant documents are collected, extracted text and metadata, indexed, and made it available for expert reviewers, for example, contract attorneys. During this process, relatively easy data culling methods such as de-NIST-ing and de-duplication are performed to remove noise in the data. Expert reviewers then review documents (manually) for relevance to a production request and cluster them as relevant, non-relevant, or privileged. Finally, relevant documents are produced to opposing party, based on agreed upon terms and conditions.
Electronic legal discovery (e-discovery) is the process of collecting, reviewing, and producing electronically stored information (ESI), i.e., documents either in native format (e.g., emails, attachments, social media messages, etc.) or after conversion into PDF or TIFF form, to determine its relevance to a request for production. ESI is fundamentally different from paper information because of its form, persistence, and additional information such as document metadata (that are not available for paper documents). ESI can play a critical role in identifying evidence. On the other hand, the explosion of ESI to be dealt with in any typical case makes manual review cumbersome and expensive. For example, see the study conducted at kCura. Moreover, expert (manual) reviewers aren’t fool-proof in the review process, see, e.g., Lewis (2011). Litigation costs are increasing and as a result, are removing the public dispute resolution process from the reach of an average citizen and medium-sized company. Thus, legal professionals have sought to employ Computer Assisted Review (CAR)–a.k.a. Technology Assisted Review (TAR), predictive coding–based on information retrieval and machine learning methods to reduce manual labor and increase accuracy.
In a typical CAR process, one trains a computer to classify documents based on relevancy to a legal case using a set of training documents labeled by expert reviewers. The job of the classification algorithm is to propagate the domain expert’s knowledge, i.e., using the training document labels, to the whole document collection via various indexing, relevance-ranking, and classification methods. Finally, the validation step that checks whether the system’s results are the results desired by the review team is performed via methods such as statistical sampling. For more discussion about the CAR process and e-discovery, the readers are recommended to see, the Computer Assisted Review Reference Model (CARRM).