PyDough DSL Benchmarker by john-sanchez31 · Pull Request #513 · bodo-ai/PyDough

john-sanchez31 · 2026-04-30T19:20:38Z

Linked ticket

Closes #

Type of change

Bug fix
New feature
Refactor
Docs / config

What changed and why?

Adding the new benchmarker. This new class will run a list of questions and get metrics related to the performance of pydough. All this data can be used for detecting regressions and opportunities of optimizations. This generates a new file with all collected information and adds a new row on the benchmark_metrics.csv file.

How I tested this?

This new class has been tested locally using the command python executer.py. There are not tests on our test suite for this new feature that could be discussed later.

Notes for reviewers

The workflow spins up the Postgres Docker container as a service using the image bodoai1/pydough-benchmarker:latest, installs the project dependencies using uv, and runs executer.py to execute the benchmark. Once the benchmark finishes and the metrics files are generated, it commits those files to a new branch and automatically opens a PR against main for review.
It triggers automatically on every new GitHub release published, and can also be run manually at any time from the Actions tab.

john-sanchez31

Also take a look of the generated PR for the last run of the benchmarker: #517

john-sanchez31 · 2026-05-11T22:45:48Z

        type: boolean
        required: false
        default: false
+      run-benchmark:


This is a temporary flag so I can trigger the workflow without having to merge to main. The idea would be that the 2 main ways to execute the benchmark are through github actions tab and on published release

john-sanchez31 · 2026-05-11T22:46:06Z

      python-versions: ${{ github.event_name == 'workflow_dispatch'
                      && needs.get-py-ver-matrix.outputs.matrix
                      || '["3.10", "3.11", "3.12", "3.13"]' }}
+


This can be deleted before merging to main

Why? I think it's good to have an option to run benchmark on demand rather than rely on releases only.

john-sanchez31 · 2026-05-11T22:53:28Z

@@ -0,0 +1,975 @@
+benchmark,question_id,pydough,sql


I've been thinking about adding the questions itself here and add it to the generated files. Do you think is something that would help us or not really?

john-sanchez31 · 2026-05-11T22:54:07Z

+      )
+) AS custsale
+GROUP BY cntrycode
+ORDER BY cntrycode;"


Reminder to fix this

john-sanchez31 · 2026-05-11T22:55:13Z

@@ -0,0 +1,124 @@
+# PyDough Performance Benchmarker


Pretty much is the same we have in the design document. Let me know if I need to add something else

hadia206

Good job, John!
Please see my comments below. Mostly about handling what happens if things go wrong and matching README with code.

hadia206 · 2026-05-14T17:39:18Z

      python-versions: ${{ github.event_name == 'workflow_dispatch'
                      && needs.get-py-ver-matrix.outputs.matrix
                      || '["3.10", "3.11", "3.12", "3.13"]' }}
+


Why? I think it's good to have an option to run benchmark on demand rather than rely on releases only.

hadia206 · 2026-05-14T17:39:59Z

@@ -0,0 +1,870 @@
+[


I'm assuming this is the same as sample one in tests for now, right?

hadia206 · 2026-05-14T17:50:25Z

+
+    @property
+    def connection(self) -> Connection:
+        """Independent connection to the database"""


What does independent connection mean in this context?

Just a single database connection for one specific engine

hadia206 · 2026-05-14T17:52:05Z

+        if run_type == "R":
+            # e.g. 20260508_R_v1.2.0
+            return f"{today}_R_{run_id}"
+        else:
+            # e.g. 20260508_M_25526200301
+            return f"{today}_M_{run_id}"


Let's add a comment clarifying what does R and M refer to.

hadia206 · 2026-05-14T17:53:07Z

+
+    def execute_pydough(self, pydough_str: str) -> tuple[DataFrame, str]:
+        """
+        Executes the given pyodugh code using `from_string` api


docstring args/return

hadia206 · 2026-05-14T18:19:28Z

+        "pydough_exec_time": "float",
+        "exec_plan": "str",
+        "pydough_exec_plan": "str",
+        "status": "str",


Also, where is pydough_time_ms (Time to generate SQL via PyDough)?

hadia206 · 2026-05-14T18:19:56Z

+| `benchmark` | `str` | Name of the benchmark suite |
+| `ground_truth` | `str` | Reference SQL query |
+| `pydough_sql` | `str` | SQL generated by PyDough |
+| `pydough_time_ms` | `float` | Time to generate SQL via PyDough (ms) |


Missing from the code.

hadia206 · 2026-05-14T18:21:08Z

+### `measure() -> None`
+Main entry point. Orchestrates the full benchmark run — loads questions, executes both SQL paths, validates results, and collects all metrics.
+
+### `execute_pydough(pydough_code: str) -> DataFrame`


Suggested change

### `execute_pydough(pydough_code: str) -> DataFrame`

### `execute_pydough(pydough_code: str) -> tuple[DataFrame, str]`

Good catch!

hadia206 · 2026-05-14T18:21:24Z

+Main entry point. Orchestrates the full benchmark run — loads questions, executes both SQL paths, validates results, and collects all metrics.
+
+### `execute_pydough(pydough_code: str) -> DataFrame`
+Executes the given PyDough DSL code using the `from_string` API and returns the result as a DataFrame.


Update description here to reflect code changes.

hadia206 · 2026-05-14T18:23:16Z

+### `query_validation(pydough_result: DataFrame, ground_truth_result: DataFrame) -> bool`
+Compares the PyDough result against the ground truth. Supports complex comparison logic to account for ordering and floating-point tolerances.


This one is not in the code as its own function.
It's better to add it in the code so that if we want to change how we do validation later own, it's easier to spot.

knassre-bodo

@john-sanchez31 FANTASTIC job accomplishing this! There are a few things to tinker with, but this is definitely a great tool.

knassre-bodo · 2026-05-19T17:03:03Z

@@ -0,0 +1,65 @@
+"""


I think it should be spelled executor

knassre-bodo · 2026-05-19T17:05:59Z

+included.","result = (
+        lines.WHERE(ship_date <= DATETIME('1998-12-1', '-90 days'))
+        .PARTITION(name=""groups"", by=(return_flag, status))
+        .CALCULATE(
+            L_RETURNFLAG=return_flag,
+            L_LINESTATUS=status,
+            SUM_QTY=SUM(lines.quantity),
+            SUM_BASE_PRICE=SUM(lines.extended_price),
+            SUM_DISC_PRICE=SUM(lines.extended_price * (1 - lines.discount)),
+            SUM_CHARGE=SUM(
+                lines.extended_price * (1 - lines.discount) * (1 + lines.tax)
+            ),
+            AVG_QTY=AVG(lines.quantity),
+            AVG_PRICE=AVG(lines.extended_price),
+            AVG_DISC=AVG(lines.discount),
+            COUNT_ORDER=COUNT(lines),
+        )
+        .ORDER_BY(L_RETURNFLAG.ASC(), L_LINESTATUS.ASC())


Probably not something to be addressed right now, but I believe in our original discussions about the benchmarker we discussed having the LLM generate the PyDough code? I forget, did we decide against that at some point?

Yes, at some point the original discussed task was divided into 2 different task. For the LLM team I created a script for data generation and for the DSL I'm working on the benchmarker. We could discuss about adding a LLM generation for the benchmarker for a followup task.

knassre-bodo · 2026-05-19T17:19:12Z

+          git commit -m "chore: update benchmark metrics $(date +'%Y-%m-%d')"
+          git push origin "$BRANCH"
+
+          gh pr create \
+            --base main \
+            --head "$BRANCH" \
+            --title "Benchmark metrics update — $(date +'%Y-%m-%d')" \
+            --label "benchmark" \
+            --label "automated" \
+            --body "Automated benchmark run completed successfully.
+
+            **Triggered by:** \`${{ github.event_name }}\`
+            **Run:** [${{ github.run_id }}](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }})


Let's ensure the commits/PR generated indicate:

Whether it was generated from a release or manually

If on a release, which release

If manually, on which commit of which branch

knassre-bodo · 2026-05-19T17:20:43Z

+          git checkout -b "$BRANCH"
+          git add benchmark/metrics/*.csv benchmark/benchmark_metrics.csv


The PR that was opened included all of these changes AND the metric file. Will this always happen? If so that is potentially problematic if people are doing active development on a branch while they run the workflow, since the PR will contain all of their active development changes.

knassre-bodo · 2026-05-19T17:24:56Z

+| Benchmark | Queries | Status |
+|---|---|---|
+| TPC-H (Scale Factor 10) | 22 | Available |
+| TPC-DS (Scale Factor 10) | 99 | Added incrementally |


Are we sure we want to do SF-10 for DS? That database can get pretty big. You may want to experiment to see which one is more viable.

Well I already have both TPCH and TPCDS SF-10 on the docker image. Is it worth the experimentation at this point?

knassre-bodo · 2026-05-19T17:30:20Z

+
+## Automated Runs
+
+The benchmarker runs automatically on every PyDough DSL release via GitHub Actions. Results are committed as a new PR containing a timestamped CSV file. The workflow can also be triggered manually.


Let's have a more in-depth explanation of the following:

The CSV file is in the format described in ## Metrics

The file is generated from the workflow run on release, or the manual one

The exact naming scheme of the generated file in those two versions

Also, perhaps we should have a comparator script that takes in 2x of these benchmark run file names and outputs a comparison file:

How many tests were right/wrong before/after (list all tests that went wrong->right or right->wrong)

The total time performance only looking at queries that were correct in both versions

For all queries that were correct in both versions, a breakdown of the absolute & relative % time improvement/regression (can sort this from best-to-worst)

This can be dumped into stdout or into a file for analysis.

For instance, suppose file R1 had the following data:

q1: correct, pydough_time=10

q2: correct, pydough_time=15

q3: correct, pydough_time=20

q4: incorrect, pydough_time=25

q5: incorrect, pydough_time=30

q6: correct, pydough_time=42

And suppose file R2 had the following data:

q1: incorrect, pydough_time=30

q2: correct, pydough_time=10

q3: correct, pydough_time=23

q4: correct, pydough_time=5

q5: incorrect, pydough_time=25

q6: correct, pydough_time=30

What the comparison file would tell you (formatted a bit nicer, and probably going to decimals):

R1: 4/6 correct

R2: 4/6 correct

Incorrect->Correct: q6

Correct->Incorrect: q1

Total Correct Time: 77->63 (-18%)

Time breakdown:

q2: 15->10 (-33%)

q6: 42->30 (-29%)

q3: 20->23 (+15%)

Could also do the same analysis, but instead of comparing the absolute times, we compare the differences relative to the refsol times (e.g. did R2 get faster/slower relative to the refsol than R1). Perhaps both of those time analyses could be in the same file, in different sections?

Is this still applicable taking in count that for instance every query is written manually and will be correct?

john-sanchez31 added 28 commits April 30, 2026 13:07

base archi for pydough benchmarker

d6a4d74

tpch all queries working

3f052b4

fixing workflow

43b1813

adding temporary caller [run benchmark]

0dcc0b1

another try [run benchmark]

c0d25a6

fixing run with venv [run benchmark]

04d2f9b

fixing secrets [run benchmark]

094e992

passing postgres_user correctly [run benchmark]

052b266

adding job env [run benchmark]

7dc490f

adding waiting time for database [run benchmarker]

631fbba

[run benchmark]

174d946

adding waiting time [run benchmark]

2d9df1d

printing logs [run benchmark]

9cc420d

fixing container name [run benchmark]

df44e20

listening fix [run benchmark]

bd65174

name placed correctly [run benchmark]

c82ff9a

fixing name command [run benchmark]

5e5d671

fixing container name [run benchmark]

0e73c27

fixing service [run benchmark]

e41a7ad

fixing wrong commands [run benchmark]

65277e0

fixing question path [run benchmark]

040f73f

fixing used paths [run benchmark]

9f25e04

fixing metadata path [run benchmark]

5d9cdca

changing branch and pr creation steps [run benchmark]

71390d2

manual workflow work [run benchmark]

fcde326

fixing identation message [run benchmark]

1654a36

adding run label and readme [run benchmark]

960153d

creating the folder if doesn't exist [run benchmark]

582c15c

john-sanchez31 requested review from hadia206 and knassre-bodo May 11, 2026 17:23

john-sanchez31 commented May 11, 2026

View reviewed changes

hadia206 reviewed May 14, 2026

View reviewed changes

john-sanchez31 added 2 commits May 15, 2026 16:49

fixing documentation, adding validation function

53e6bcb

adding tpcds metadata

4aed114

knassre-bodo reviewed May 19, 2026

View reviewed changes

john-sanchez31 added 4 commits May 25, 2026 14:52

adding documentation, more information on the PR message [run benchmark]

ef5dcdb

fixing yml [run benchmark]

9750713

fix for questions.csv [run benchmark]

f8c241a

unmatched quotes and benchmark name fixed [run benchmark]

4e3b8ee

	### `execute_pydough(pydough_code: str) -> DataFrame`
	### `execute_pydough(pydough_code: str) -> tuple[DataFrame, str]`

		### `query_validation(pydough_result: DataFrame, ground_truth_result: DataFrame) -> bool`
		Compares the PyDough result against the ground truth. Supports complex comparison logic to account for ordering and floating-point tolerances.

		git checkout -b "$BRANCH"
		git add benchmark/metrics/*.csv benchmark/benchmark_metrics.csv


		## Automated Runs

		The benchmarker runs automatically on every PyDough DSL release via GitHub Actions. Results are committed as a new PR containing a timestamped CSV file. The workflow can also be triggered manually.

Conversation

john-sanchez31 commented Apr 30, 2026

Linked ticket

Type of change

What changed and why?

How I tested this?

Notes for reviewers

Uh oh!

john-sanchez31 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadia206 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

knassre-bodo May 19, 2026 •

edited

Loading