Archer: Query-Driven Machine Learning
In the Archer project we develop techniques for adapting analytics in response to a query as opposed to general computation. Instead of doing a
SELECT * FROM Table, as with typical machine learning problems, we are integrating selection style queries,
SELECT * FROM Table WHERE X, into typical analytics.
Knowledge Base Acceleration
Wikipedia is the go to knowledge base for information on events, people and scores of other topics. Wikipedia is collaboratively edited but the number of editors is far below the number of entities so it often takes a long time for important information to be added to the knowledge base.
Knowledge Base Acceleration (KBA) task reads streams of documents and recommends documents to be cited by knowledge base pages. Several issues are involved with this tasks:
- Many documents in the stream are not relevant, millions of these documents must be filtered.
- Some document refer to the different entities of the same name. It is important to understand what entity a document it referring too.
- Some information is not sufficient for citation. Event may have happened, but they may not be notable enough to be included in the knowledge base.
In this work we attempt to filter a stream of document and suggest pieces of information to be added to a set of Wikipedia entities.
Query-Driven Entity Resolution
Entity resolution (ER) is the process of determining records (mentions) in a database that correspond to the same real-world entity. Leading ER systems solve this problem by resolving every record in the database; however, for large datasets this is an expensive process. Moreover, such approaches are wasteful because in practice, users are interested in only one or a small subset of the entities mentioned in the database. In this work, we introduce new classes of SQL queries involving ER operators — selection-driven ER and join-driven ER. We develop novel variations of Metropolis Hastings algorithm and introduce selectivity-based scheduling algorithms to support the two classes of ER queries.