• Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Data Science Research

Menu
  • Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses
Home › publications › research directions › GPText: Greenplum Parallel Statistical Text Analysis Framework

GPText: Greenplum Parallel Statistical Text Analysis Framework

November 11, 2013     No Comment     publications, research directions

Kun Li

Text analytics has gained much attention in the big data research community due to the large amounts of text data generated in organizations such as companies, government and hospitals everyday in the form of emails, electronic notes and internal documents. A good understanding of this unstructured text data is crucial for companies to make business decision, for doctors to assess their patients, and for lawyers to accelerate document review processes.  Traditional business intelligence pulls content from databases into other massive data warehouses to analyze the data. The typical “data movement process” involves moving information from the database for analysis using external tools and storing the final product back into the database. This movement process is time consuming and even prohibitive.

GPText Architecture

GPText Architecture

Greenplum and our group motivate in-database text analytics by showing the GPText, a powerful and scalable text analysis framework developed on Greenplum MPP database. GPText runs on Greenplum database(GP), which is a shared nothing massive parallel processing(MPP) database. As shown in the GPText architecture, it is a collection of PostgreSQL instances including one master instance and multiple slave instances(segments). The master node accepts SQL queries from clients, then divide the workloads and send sub-tasks to the segments. The embarrassing processing capability powered by the Greenplum MPP framework lays the cornerstone to enable GPText to process the production sized text data. On top of the underling MPP framework, there are two building blocks, MADLib and Solr as illustrated in the architecture which distinguish GPText from many of the existing text analysis tools.

MADLib makes GPText capable of doing sophisticated text data analysis tasks, such as part-of-speech tagging, named entity recognition, document classification and topic modeling with a vast amount of parallelism. The GPText uses the CRF package, which was contributed to the MADLib open-source library.  SQL and user defined aggregates are used to implement conditional random fields (CRFs) methods for information extraction in parallel. The CRF modules can scale sublinearly in runtime for both CRF learning and inference with linear increase in the number of cores.

Solr is reliable and scalable text search platform from Apache Lucene project and it has been widely deployed in web servers. The major features includes powerful full-text search, faceted search, near real time indexing. As shown in the Figure , GPText uses Solr to create distributed indexing. GPText has all the features that Solr has since Solr is integrated into GPText seamlessly.

With the seamless integration with Solr and MADLib, GPText is a framework over MPP database with powerful search engine and advanced statistical text analysis capabilities. The functionalities and scalability provided by GPText positions itself to be a great tool for sophisticated text analytics applications e.g., eDiscovery application.

publications research directions
DBlyticstext analysis

 Previous Post

Using the Crowd to Improve Information Extraction

― October 25, 2013

Next Post 

Three-Course Data Science Curriculum @ UF CISE

― December 31, 2013

Related Articles

UDA-GIST: Unified Data-Parallel and State-Parallel Analytics in DB
Using the Crowd to Improve Information Extraction
DBlytics: Statistical analysis on data parallel frameworks

Leave a Reply Cancel reply

You must be logged in to post a comment.

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017

Categories

  • courses
  • ecology
  • NIST and open eval
  • publications
  • research
  • research directions
  • survey
  • Uncategorized

Archives

  • February 2023
  • October 2020
  • December 2019
  • April 2019
  • December 2018
  • August 2018
  • February 2018
  • November 2017
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • October 2016
  • April 2016
  • March 2016
  • December 2015
  • November 2015
  • October 2015
  • May 2015
  • November 2014
  • October 2014
  • July 2014
  • May 2014
  • March 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017