Many companies keep large amounts of text data inside of relational databases. Several challenges exist in using state-of-the-art systems to perform analysis on such datasets. First, expensive big data transfer cost must be paid up front to move data between databases and analytics systems. Second, many popular text analytics packages do not scale up to production sized datasets. In this paper, we introduce GPText, Greenplum parallel statistical text analysis framework that addresses the above problems by supporting statistical inference and learning algorithms natively in a massively parallel processing database system. GPText seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib, an open source library for scalable in-database analytics which can be installed on PostgreSQL and Greenplum. In addition, GPText also developed and contributed a linear-chain conditional random field(CRF) module to MADLib to enable information extraction tasks such as part-ofspeech tagging and named entity recognition. We show the performance and scalability of the parallel CRF implementation. Finally, we describe an eDiscovery application built on the GPText framework.
Authors:
Kun Li, Christan Grant, Daisy Zhe Wang, Sunny Khatri, George Chitouras
Bibtex:
@inproceedings{Li:2013:GGP:2486767.2486774, author = {Li, Kun and Grant, Christan and Wang, Daisy Zhe and Khatri, Sunny and Chitouras, George}, title = {GPText: Greenplum parallel statistical text analysis framework}, booktitle = {Proceedings of the Second Workshop on Data Analytics in the Cloud}, series = {DanaC '13}, year = {2013}, isbn = {978-1-4503-2202-7}, location = {New York, New York}, pages = {31--35}, numpages = {5}, url = {http://doi.acm.org/10.1145/2486767.2486774}, doi = {10.1145/2486767.2486774}, acmid = {2486774}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {RDBMS, massive parallel processing, text analytics}, }
Download: