Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

Latest commit

 

History

History
53 lines (35 loc) · 3.38 KB

README.md

File metadata and controls

53 lines (35 loc) · 3.38 KB

optd

optd (pronounced as op-dee) is a database optimizer framework. It is a cost-based optimizer that searches the plan space using the rules that the user defines and derives the optimal plan based on the cost model and the physical properties.

The primary objective of optd is to explore the potential challenges involved in effectively implementing a cost-based optimizer for real-world production usage. optd implements the Columbia Cascades optimizer framework based on Yongwen Xu's master's thesis. Besides cascades, optd also provides a heuristics optimizer implementation for testing purpose.

The other key objective is to implement a flexible optimizer framework which supports adaptive query optimization (aka. reoptimization) and adaptive query execution. optd executes a query, captures runtime information, and utilizes this data to guide subsequent plan space searches and cost model estimations. This progressive optimization approach ensures that queries are continuously improved, and allows the optimizer to explore a large plan space.

Currently, optd is integrated into Apache Arrow Datafusion as a physical optimizer. It receives the logical plan from Datafusion, implements various physical optimizations (e.g., determining the join order), and subsequently converts it back into the Datafusion physical plan for execution.

optd is a research project and is still evolving. It should not be used in production. The code is licensed under MIT.

Get Started

There are three demos you can run with optd. More information available in the docs.

cargo run --release --bin optd-adaptive-tpch-q8
cargo run --release --bin optd-adaptive-three-join

You can also run the Datafusion cli to interactively experiment with optd.

cargo run --bin datafusion-optd-cli

You can also test the performance of the cost model with the "cardinality benchmarking" feature (more info in the docs). Before running this, you will need to manually run Postgres on your machine. Note that there is a CI script which tests this command (TPC-H with scale factor 0.01) before every merge into main, so it should be very reliable.

cargo run --release --bin optd-perfbench cardbench tpch --scale-factor 0.01

Documentation

The documentation is available in the mdbook format in the docs directory.

Structure

  • datafusion-optd-cli: The patched Apache Arrow Datafusion (version=32) cli that calls into optd.
  • datafusion-optd-bridge: Implementation of Apache Arrow Datafusion query planner as a bridge between optd and Apache Arrow Datafusion.
  • optd-core: The core framework of optd.
  • optd-datafusion-repr: Representation of Apache Arrow Datafusion plan nodes in optd.
  • optd-adaptive-demo: Demo of adaptive optimization capabilities of optd. More information available in the docs.
  • optd-sqlplannertest: Planner test of optd based on risinglightdb/sqlplannertest-rs.
  • optd-gungnir: Scalable, memory-efficient, and parallelizable statistical methods for cardinality estimation (e.g. TDigest, HyperLogLog).
  • optd-perfbench: A CLI program for benchmarking performance (cardinality, throughput, etc.) against other databases.

Related Works