Kun Li
Text analytics has gained much attention in the big data research community due to the large amounts of text data generated in organizations such as companies, government and hospitals everyday in the form of emails, electronic notes and internal documents. A good understanding of this unstructured text data is crucial for companies to make business decision, for doctors to assess their patients, and for lawyers to accelerate document review processes. Traditional business intelligence pulls content from databases into other massive data warehouses to analyze the data. The typical “data movement process” involves moving information from the database for analysis using external tools and storing the final product back into the database. This movement process is time consuming and even prohibitive.
Greenplum and our group motivate in-database text analytics by showing the GPText, a powerful and scalable text analysis framework developed on Greenplum MPP database. GPText runs on Greenplum database(GP), which is a shared nothing massive parallel processing(MPP) database. As shown in the GPText architecture, it is a collection of PostgreSQL instances including one master instance and multiple slave instances(segments). The master node accepts SQL queries from clients, then divide the workloads and send sub-tasks to the segments. The embarrassing processing capability powered by the Greenplum MPP framework lays the cornerstone to enable GPText to process the production sized text data. On top of the underling MPP framework, there are two building blocks, MADLib and Solr as illustrated in the architecture which distinguish GPText from many of the existing text analysis tools.
MADLib makes GPText capable of doing sophisticated text data analysis tasks, such as part-of-speech tagging, named entity recognition, document classification and topic modeling with a vast amount of parallelism. The GPText uses the CRF package, which was contributed to the MADLib open-source library. SQL and user defined aggregates are used to implement conditional random fields (CRFs) methods for information extraction in parallel. The CRF modules can scale sublinearly in runtime for both CRF learning and inference with linear increase in the number of cores.
Solr is reliable and scalable text search platform from Apache Lucene project and it has been widely deployed in web servers. The major features includes powerful full-text search, faceted search, near real time indexing. As shown in the Figure , GPText uses Solr to create distributed indexing. GPText has all the features that Solr has since Solr is integrated into GPText seamlessly.
With the seamless integration with Solr and MADLib, GPText is a framework over MPP database with powerful search engine and advanced statistical text analysis capabilities. The functionalities and scalability provided by GPText positions itself to be a great tool for sophisticated text analytics applications e.g., eDiscovery application.