Skip to content

PyDough DSL Benchmarker#513

Open
john-sanchez31 wants to merge 34 commits into
mainfrom
John/benchmarker
Open

PyDough DSL Benchmarker#513
john-sanchez31 wants to merge 34 commits into
mainfrom
John/benchmarker

Conversation

@john-sanchez31
Copy link
Copy Markdown
Contributor

Linked ticket

Closes #

Type of change

  • Bug fix
  • New feature
  • Refactor
  • Docs / config

What changed and why?

Adding the new benchmarker. This new class will run a list of questions and get metrics related to the performance of pydough. All this data can be used for detecting regressions and opportunities of optimizations. This generates a new file with all collected information and adds a new row on the benchmark_metrics.csv file.

How I tested this?

This new class has been tested locally using the command python executer.py. There are not tests on our test suite for this new feature that could be discussed later.

Notes for reviewers

The workflow spins up the Postgres Docker container as a service using the image bodoai1/pydough-benchmarker:latest, installs the project dependencies using uv, and runs executer.py to execute the benchmark. Once the benchmark finishes and the metrics files are generated, it commits those files to a new branch and automatically opens a PR against main for review.
It triggers automatically on every new GitHub release published, and can also be run manually at any time from the Actions tab.

Copy link
Copy Markdown
Contributor Author

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also take a look of the generated PR for the last run of the benchmarker: #517

type: boolean
required: false
default: false
run-benchmark:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary flag so I can trigger the workflow without having to merge to main. The idea would be that the 2 main ways to execute the benchmark are through github actions tab and on published release

python-versions: ${{ github.event_name == 'workflow_dispatch'
&& needs.get-py-ver-matrix.outputs.matrix
|| '["3.10", "3.11", "3.12", "3.13"]' }}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be deleted before merging to main

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think it's good to have an option to run benchmark on demand rather than rely on releases only.

Comment thread benchmark/questions.csv Outdated
@@ -0,0 +1,975 @@
benchmark,question_id,pydough,sql
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about adding the questions itself here and add it to the generated files. Do you think is something that would help us or not really?

Comment thread benchmark/questions.csv Outdated
)
) AS custsale
GROUP BY cntrycode
ORDER BY cntrycode;" No newline at end of file
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to fix this

Comment thread benchmark/README.md
@@ -0,0 +1,124 @@
# PyDough Performance Benchmarker
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much is the same we have in the design document. Let me know if I need to add something else

Copy link
Copy Markdown
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, John!
Please see my comments below. Mostly about handling what happens if things go wrong and matching README with code.

python-versions: ${{ github.event_name == 'workflow_dispatch'
&& needs.get-py-ver-matrix.outputs.matrix
|| '["3.10", "3.11", "3.12", "3.13"]' }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think it's good to have an option to run benchmark on demand rather than rely on releases only.

@@ -0,0 +1,870 @@
[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is the same as sample one in tests for now, right?

Comment thread benchmark/benchmarker.py

@property
def connection(self) -> Connection:
"""Independent connection to the database"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does independent connection mean in this context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a single database connection for one specific engine

Comment thread benchmark/benchmarker.py
Comment on lines +161 to +166
if run_type == "R":
# e.g. 20260508_R_v1.2.0
return f"{today}_R_{run_id}"
else:
# e.g. 20260508_M_25526200301
return f"{today}_M_{run_id}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment clarifying what does R and M refer to.

Comment thread benchmark/benchmarker.py Outdated

def execute_pydough(self, pydough_str: str) -> tuple[DataFrame, str]:
"""
Executes the given pyodugh code using `from_string` api
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring args/return

Comment thread benchmark/benchmarker.py
"pydough_exec_time": "float",
"exec_plan": "str",
"pydough_exec_plan": "str",
"status": "str",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, where is pydough_time_ms (Time to generate SQL via PyDough)?

Comment thread benchmark/README.md Outdated
| `benchmark` | `str` | Name of the benchmark suite |
| `ground_truth` | `str` | Reference SQL query |
| `pydough_sql` | `str` | SQL generated by PyDough |
| `pydough_time_ms` | `float` | Time to generate SQL via PyDough (ms) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing from the code.

Comment thread benchmark/README.md Outdated
### `measure() -> None`
Main entry point. Orchestrates the full benchmark run — loads questions, executes both SQL paths, validates results, and collects all metrics.

### `execute_pydough(pydough_code: str) -> DataFrame`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### `execute_pydough(pydough_code: str) -> DataFrame`
### `execute_pydough(pydough_code: str) -> tuple[DataFrame, str]`

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment thread benchmark/README.md Outdated
Main entry point. Orchestrates the full benchmark run — loads questions, executes both SQL paths, validates results, and collects all metrics.

### `execute_pydough(pydough_code: str) -> DataFrame`
Executes the given PyDough DSL code using the `from_string` API and returns the result as a DataFrame.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update description here to reflect code changes.

Comment thread benchmark/README.md Outdated
Comment on lines +111 to +112
### `query_validation(pydough_result: DataFrame, ground_truth_result: DataFrame) -> bool`
Compares the PyDough result against the ground truth. Supports complex comparison logic to account for ordering and floating-point tolerances.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not in the code as its own function.
It's better to add it in the code so that if we want to change how we do validation later own, it's easier to spot.

Copy link
Copy Markdown
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-sanchez31 FANTASTIC job accomplishing this! There are a few things to tinker with, but this is definitely a great tool.

Comment thread benchmark/executor.py
@@ -0,0 +1,65 @@
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be spelled executor

Comment thread benchmark/questions.csv
Comment on lines +7 to +24
included.","result = (
lines.WHERE(ship_date <= DATETIME('1998-12-1', '-90 days'))
.PARTITION(name=""groups"", by=(return_flag, status))
.CALCULATE(
L_RETURNFLAG=return_flag,
L_LINESTATUS=status,
SUM_QTY=SUM(lines.quantity),
SUM_BASE_PRICE=SUM(lines.extended_price),
SUM_DISC_PRICE=SUM(lines.extended_price * (1 - lines.discount)),
SUM_CHARGE=SUM(
lines.extended_price * (1 - lines.discount) * (1 + lines.tax)
),
AVG_QTY=AVG(lines.quantity),
AVG_PRICE=AVG(lines.extended_price),
AVG_DISC=AVG(lines.discount),
COUNT_ORDER=COUNT(lines),
)
.ORDER_BY(L_RETURNFLAG.ASC(), L_LINESTATUS.ASC())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not something to be addressed right now, but I believe in our original discussions about the benchmarker we discussed having the LLM generate the PyDough code? I forget, did we decide against that at some point?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at some point the original discussed task was divided into 2 different task. For the LLM team I created a script for data generation and for the DSL I'm working on the benchmarker. We could discuss about adding a LLM generation for the benchmarker for a followup task.

Comment thread .github/workflows/benchmark.yml Outdated
Comment on lines +87 to +99
git commit -m "chore: update benchmark metrics $(date +'%Y-%m-%d')"
git push origin "$BRANCH"

gh pr create \
--base main \
--head "$BRANCH" \
--title "Benchmark metrics update — $(date +'%Y-%m-%d')" \
--label "benchmark" \
--label "automated" \
--body "Automated benchmark run completed successfully.

**Triggered by:** \`${{ github.event_name }}\`
**Run:** [${{ github.run_id }}](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ensure the commits/PR generated indicate:

  • Whether it was generated from a release or manually
  • If on a release, which release
  • If manually, on which commit of which branch

Comment thread .github/workflows/benchmark.yml Outdated
Comment on lines +79 to +80
git checkout -b "$BRANCH"
git add benchmark/metrics/*.csv benchmark/benchmark_metrics.csv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR that was opened included all of these changes AND the metric file. Will this always happen? If so that is potentially problematic if people are doing active development on a branch while they run the workflow, since the PR will contain all of their active development changes.

Comment thread benchmark/README.md
| Benchmark | Queries | Status |
|---|---|---|
| TPC-H (Scale Factor 10) | 22 | Available |
| TPC-DS (Scale Factor 10) | 99 | Added incrementally |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to do SF-10 for DS? That database can get pretty big. You may want to experiment to see which one is more viable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I already have both TPCH and TPCDS SF-10 on the docker image. Is it worth the experimentation at this point?

Comment thread benchmark/README.md Outdated

## Automated Runs

The benchmarker runs automatically on every PyDough DSL release via GitHub Actions. Results are committed as a new PR containing a timestamped CSV file. The workflow can also be triggered manually.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a more in-depth explanation of the following:

  • The CSV file is in the format described in ## Metrics
  • The file is generated from the workflow run on release, or the manual one
  • The exact naming scheme of the generated file in those two versions

Also, perhaps we should have a comparator script that takes in 2x of these benchmark run file names and outputs a comparison file:

  • How many tests were right/wrong before/after (list all tests that went wrong->right or right->wrong)
  • The total time performance only looking at queries that were correct in both versions
  • For all queries that were correct in both versions, a breakdown of the absolute & relative % time improvement/regression (can sort this from best-to-worst)

This can be dumped into stdout or into a file for analysis.

Copy link
Copy Markdown
Contributor

@knassre-bodo knassre-bodo May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, suppose file R1 had the following data:

  • q1: correct, pydough_time=10
  • q2: correct, pydough_time=15
  • q3: correct, pydough_time=20
  • q4: incorrect, pydough_time=25
  • q5: incorrect, pydough_time=30
  • q6: correct, pydough_time=42

And suppose file R2 had the following data:

  • q1: incorrect, pydough_time=30
  • q2: correct, pydough_time=10
  • q3: correct, pydough_time=23
  • q4: correct, pydough_time=5
  • q5: incorrect, pydough_time=25
  • q6: correct, pydough_time=30

What the comparison file would tell you (formatted a bit nicer, and probably going to decimals):

  • R1: 4/6 correct
  • R2: 4/6 correct
  • Incorrect->Correct: q6
  • Correct->Incorrect: q1
  • Total Correct Time: 77->63 (-18%)
  • Time breakdown:
    • q2: 15->10 (-33%)
    • q6: 42->30 (-29%)
    • q3: 20->23 (+15%)

Could also do the same analysis, but instead of comparing the absolute times, we compare the differences relative to the refsol times (e.g. did R2 get faster/slower relative to the refsol than R1). Perhaps both of those time analyses could be in the same file, in different sections?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still applicable taking in count that for instance every query is written manually and will be correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants