• Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Data Science Research

Menu
  • Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses
Home › Uncategorized › DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms

DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms

Yifan Wang February 20, 2023     Comment Closed     Uncategorized

by Yifan Wang Many data scientists and analysts have to spend a large portion of time in a routine loop: exporting data from database, processing/analyzing the data using external data science tools, and re-importing the data back to database. To solve this problem and save users’ time, in-database analytics is emerging in recent years, which tries to implement commonly used analytic algorithms and tools in the mainstream relational database systems (RDBMS) , such that users can complete most data processing and analyzing tasks directly inside RDBMS without the export-import loop. However, during the development of in-database algorithms, there is often such an issue: even though the algorithms work well externally, the developers usually cannot make sure they still perform as expected inside RDBMS. And the developers can have the answer only when they complete the implementation. Due to the tremendous scale of the mainstream RDBMS codebases, the full implementation may take a significantly long time, meaning there is a risk that huge developing time will be wasted if the implementation finally shows a bad performance. Therefore, we develop DBSim, a highly extensible RDBMS simulator, as a testbed for in-database algorithm developers to estimate the cost of their algorithms before implementing them in real RDBMS, by which the risk of wasting time can be reduced effectively.

System Overview

DBSim is designed for two goals:
  • providing a testing environment that is as similar as possible to real RDBMS and accurate query cost estimation tools
  • providing enough extensibility for users to develop various data science algorithms without modifying the kernel
For the first goal, the kernel of DBSim covers all the major components of the query engine in general RDBMS, from query parser to physical plan executor. And the query processing workflow in DBSim is the same as in general RDBMS, i.e., SQL query is input through API and passed through the kernel components one by one until being executed by the physical plan executor, and finally the results are returned to users via the API. To accurately estimate the query processing cost (which is a straightforward measure for the performance of users’ algorithms), DBSim provides three methods: (1) a built-in cost estimator, (2) using external estimator from real RDBMS, and (3) measuring the actual query execution time in DBSim. For the second goal, we design a registry-based extension mechanism to manage the user-created extensions for DBSim. DBSim organizes the extensions in units called Syntax, where each Syntax includes a specific suite of extension classes and functions, e.g., SpatialSyntax includes implementation of related operations and data types for spatial data processing. A global registry acts as a middleware between the extensions and the kernel to manage the extensions such that they are decoupled from the kernel. The registry provides entry points for users to mount their extensions. Those entry points can be enabled/disabled flexibly, such that different Syntaxes can co-exist seamlessly (as long as they have no conflict in the implementation). The entry points cover all the five components in kernel, allowing users to extend any of the components flexibly.

Interface

The GUI of DBSim mainly has 6 blocks: (1) query input box, (2) query result box, (3) query plan visualization panel, (4) query optimization rule list, (5) history records, and (6) button for managing datasets. Users click the button to upload, update or delete datasets, manipulate the optimization rules through the rule list, type query into the input box, then the results will be displayed in the result box, query plans will be drawn in the visualization panel and a new history record will be appended to the history list. Please refer to our paper (CIKM 2022) for more details.
Uncategorized

 Previous Post

DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries

― February 20, 2023

Related Articles

DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
A Brief Overview of Weak Supervision
DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
IDTrees Data Science Challenge: 2017
Efficient Conditional Rule Mining over Knowledge Bases

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017

Categories

  • courses
  • ecology
  • NIST and open eval
  • publications
  • research
  • research directions
  • survey
  • Uncategorized

Archives

  • February 2023
  • October 2020
  • December 2019
  • April 2019
  • December 2018
  • August 2018
  • February 2018
  • November 2017
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • October 2016
  • April 2016
  • March 2016
  • December 2015
  • November 2015
  • October 2015
  • May 2015
  • November 2014
  • October 2014
  • July 2014
  • May 2014
  • March 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017