• Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Data Science Research

Menu
  • Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses
Home › research directions › SMART Electronic Discovery: System Evaluation

SMART Electronic Discovery: System Evaluation

December 27, 2015     Comment Closed     research directions

Clint P. George

This is an extension to our earlier post, SMART Electronic Discovery (SMARTeR), which describes a framework for electronic discovery (e-discovery). Discovery is a pre-trial procedure in a lawsuit or legal investigation in which each party can obtain evidence from other parties (typically via a request for production) according to the laws of civil procedure in the United States and other countries. A request for production is a short description of the discovery requirement for a case. For example, “All documents or communications that describe, discuss, refer to, report on, or relate to the Company’s engagement in structured commodity transactions known as prepay transactions.” is a sample request for production quoted from the TREC 2010 legal learning track super mario for free german. In electronic discovery, one collects, reviews, and produces electronically stored information (i.e. documents, emails, attachments, etc) to determine its relevance to a request for production. The primary goal of this project is document discovery—i.e. we study various methods to improve document discovery given a request for production.

In a typical keyword-based retrieval, one uses keywords (e.g. pre-pay and swap for the production request described above) to search for relevant documents Download prime video on windows 10. Even though, keyword-based search is the popular scheme in the e-discovery community, it has many shortcomings. For example, issues such as synonymy and polysemy of words exists in a corpus can affect the performance of keyword-based approaches (George 2015). We take the document discovery problem as a binary document classification problem: we build binary classifiers that can label each input document as relevant or irrelevant based on the document content. The classifier is trained using a set of expert (e.g. a contract attorney) labeled training documents given a request for production. These expert labeled training documents are referred to as seed documents in the e-discovery community need for speed payback kostenlos downloaden. In our earlier post, we discussed a set of approaches for principled selection of seed documents for expert review. See George (2015)[Chapter 7], for more details about seed selection methods and and their performance analysis. Electronically stored information for a given case can be represented in any format such as PDF, plain-text, HTML, and emails. The next step after data collection in e-discovery is to pre-process them to plain text format. A challenging problem in document classification and ranking is the choice of features for documents. Considering relative frequencies of individual words in documents as features as in TF or TF-IDF models may yield a rich but very large feature space (Joachims, 1999) adobe reader 11 kostenlos downloaden deutsch heise. This may cause computational difficulties when you build classifiers. A more computationally effective approach would be to analyze documents represented in a reduced topic space extracted by topic models such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). For example, words such as football, quarterback, dead ball, free kick, NFL, touchdown, etc., are representative of the single topic football. Topic models such as LDA can identify these topic structures automatically from document collections.

In this project, we compare the performance of LDA to several other document modeling schemes such as TF-IDF and LSA that have been employed to model e-discovery documents. We used the Gensim package—a scalable implementation of LSA and LDA algorithms—in our experiments pdf iphone. The LSA implementation performs a singular value decomposition (SVD) of the TF-IDF matrix for a corpus, and then projects documents represented in the TF-IDF matrix into the LSA semantic space. The LDA implementation is based on the online variational Bayes algorithm (Hoffman et al., 2010) that reduces any document in the corpus to a fixed set of real valued features (i.e. a document in the learned topic space). For classification, we consider popular classification algorithms such as Logistic Regression (LR), SVM (RBF) (SVM-R), SVM (Linear) (SVM-L), and k-Nearest Neighbor (k-NN) herunterladen. We used the implementations of these classification algorithms (along with the default tuning parameters) provided in the scikit-learn package for our experiments.

For evaluation, we used the 20Newsgroups dataset, a popular dataset used in the machine learning literature for experiments in applications of text classification and clustering algorithms. This dataset contains approximately 20,000 articles that are partitioned relatively even by across 20 different newsgroups or categories. We created a set of corpora from this dataset as follows (See the below table) kindle alle bücher herunterladen. For each corpus, the relevant class included documents under a single relevant group and the irrelevant class included documents under a set of irrelevant groups from the 20Newsgroups dataset. The column Relevant Group gives the relevant newsgroup and the column Irrelevant Groups gives the set of irrelevant newsgroups used for each corpus. The column Rel./Irrel. gives the number of documents in the relevant class vs the number of documents in the irrelevant class, for each created corpora herunterladen.

20newsgroups-dataset-01

Corpora created from the 20Newsgroups dataset (George 2015)

 

Below table gives AUC, Precision, and Recall scores of the various classification results for corpora C-Mideast, C-IBM-PC, C-Motorcycles, and C-Baseball-2. All these scores are computed using a stratified 5-fold cross-validation scheme on all four corpora.

doc-classification-performance

Classification performance of various classification methods using all four 20Newsgroups corpora (George 2015)

 

All classification methods performed reasonably well for all feature representation approaches in terms of AUC, except for k-Nearest Neighbor classifiers, which performed poorly for all feature types. It is surprising to note that Logistic Regression and SVM (Linear) methods gave similar AUC scores for all feature types (and Precision and Recall scores are comparable) json datei herunterladen. We believe this is due to the similarity of the algorithms used in the scikit-learn package to find optimal solutions, and the choice of penalties. Another interesting observation is that for classification, simpler document models such as LSA and TF-IDF outperforms LDA-based models for all the four corpora. Our guess is that selecting hyperparameters and the number of topics for the LDA model of a corpus may make a difference in the classification performance (this is part of our future work). The TF-IDF scheme may not be ideal for large datasets as it can encounter computational difficulties in training a classifier.

Furthermore, we noticed a similar pattern on an experiment in which, we evaluated various linear SVM (SVM-L) models with different number of seed (training) documents. The rest of the documents in the corpus are used for testing. The experiments are based on two e-discovery datasets employed in the TREC 2010 Legal Learning Track. AUC performance of the SVM-L modes are given below download the raleway font. We can see that the LSA and TF-IDF models outperforms LDA uniformly in these two datasets.

201-regex-liblinear-stats-AUC

Performance of Linear SVM classifiers on the TREC 2010 query-201 dataset (168 relevant documents and 520 irrelevant documents)

 

207-regex-liblinear-stats-AUC

Performance of Linear SVM classifiers on the TREC 2010 query-207 dataset (80 relevant documents and 492 irrelevant documents)

 

In our experience, classifiers such as SVM (RBF kernel), SVM (Linear kernel), and logistic regression show mixed classification performance for different datasets. This suggests that having identified the right feature set for documents in a corpus the choice of algorithms to build the optimal classifier is relatively insignificant. The complexity of understanding documents in a corpus also can play a role in the performance of these classifiers: we observed that (experiments not shown) for some datasets the the number support vectors used by SVM is relatively large compared to other datasets—that means SVM tries to memorize the training data, which can cause overfitting.

George (2015) provides more details about this project and further experimental analysis.

research directions
document classificationdocument clusteringe-discoveryElectronic DiscoveryLatent Dirichlet Allocationpredictive codingtopic modelsTREC

 Previous Post

Multimodal Ensemble Fusion for Disambiguation and Retrieval

― November 25, 2015

Next Post 

NIST 2015 Pre-Pilot Data Science Evaluation

― March 21, 2016

Related Articles

Streaming Fact Extraction for Wikipedia Entities at Web-Scale
SMART Electronic Discovery
SMARTeR : Topic Modeling, Exploration, Entity Extraction, and Applications

Sponsors

NIST

Adobe_Logo

DTCC

pcori uf-clinical

ICHP

Recent Posts

Recent Posts

  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017
  • Efficient Conditional Rule Mining over Knowledge Bases
  • Taming The Data Monster To Make Better Decisions

Related Blogs

  • ampLab
  • Data Beta
  • Fast ML

Post Categories

Categories

  • courses
  • ecology
  • NIST and open eval
  • publications
  • research
  • research directions
  • survey
  • Uncategorized

Archives

Archives

  • October 2020
  • December 2019
  • April 2019
  • December 2018
  • August 2018
  • February 2018
  • November 2017
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • October 2016
  • April 2016
  • March 2016
  • December 2015
  • November 2015
  • October 2015
  • May 2015
  • November 2014
  • October 2014
  • July 2014
  • May 2014
  • March 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013

Meta

DSR Wiki
Site Admin
WordPress.org

Recent Posts

  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017
  • Efficient Conditional Rule Mining over Knowledge Bases
  • Taming The Data Monster To Make Better Decisions
  • Whsmith Lodgers Agreement
  • What To Ask For In A Prenuptial Agreement
  • What Is Department Of State Corporation Bureau Or Business Partnership Agreement
  • What Is A General Security Agreement Nz
  • What Agreement Led To The Establishment Of The Euro A Common European Currency Quizlet
  • Vmware Service Provider License Agreement
  • Validity Of Debt Agreement In India
  • University Of Manitoba Unifor Collective Agreement
  • U.s.-China Trade Agreement 1999
  • Training Agreement Plan Definition
  • Thoroughbred Lease Agreement
  • The Canada-Us-Mexico Agreement Enters Into Force July 1
  • Td Ameritrade Brokerage Agreement
  • Subscription Service Agreements
  • Subject And Verbs Agreement
  • Standard Non Disclosure Agreement Australia
  • Source Code Development Agreement
  • Simple One Page Room Rental Agreement Pdf
  • Shareholders Agreements Sweet Maxwell
  • Service Purchase Agreement Meaning