PyDough DSL Benchmarker#513
Conversation
john-sanchez31
left a comment
There was a problem hiding this comment.
Also take a look of the generated PR for the last run of the benchmarker: #517
| type: boolean | ||
| required: false | ||
| default: false | ||
| run-benchmark: |
There was a problem hiding this comment.
This is a temporary flag so I can trigger the workflow without having to merge to main. The idea would be that the 2 main ways to execute the benchmark are through github actions tab and on published release
| python-versions: ${{ github.event_name == 'workflow_dispatch' | ||
| && needs.get-py-ver-matrix.outputs.matrix | ||
| || '["3.10", "3.11", "3.12", "3.13"]' }} | ||
|
|
There was a problem hiding this comment.
This can be deleted before merging to main
There was a problem hiding this comment.
Why? I think it's good to have an option to run benchmark on demand rather than rely on releases only.
| @@ -0,0 +1,975 @@ | |||
| benchmark,question_id,pydough,sql | |||
There was a problem hiding this comment.
I've been thinking about adding the questions itself here and add it to the generated files. Do you think is something that would help us or not really?
| ) | ||
| ) AS custsale | ||
| GROUP BY cntrycode | ||
| ORDER BY cntrycode;" No newline at end of file |
There was a problem hiding this comment.
Reminder to fix this
| @@ -0,0 +1,124 @@ | |||
| # PyDough Performance Benchmarker | |||
There was a problem hiding this comment.
Pretty much is the same we have in the design document. Let me know if I need to add something else
hadia206
left a comment
There was a problem hiding this comment.
Good job, John!
Please see my comments below. Mostly about handling what happens if things go wrong and matching README with code.
| python-versions: ${{ github.event_name == 'workflow_dispatch' | ||
| && needs.get-py-ver-matrix.outputs.matrix | ||
| || '["3.10", "3.11", "3.12", "3.13"]' }} | ||
|
|
There was a problem hiding this comment.
Why? I think it's good to have an option to run benchmark on demand rather than rely on releases only.
| @@ -0,0 +1,870 @@ | |||
| [ | |||
There was a problem hiding this comment.
I'm assuming this is the same as sample one in tests for now, right?
|
|
||
| @property | ||
| def connection(self) -> Connection: | ||
| """Independent connection to the database""" |
There was a problem hiding this comment.
What does independent connection mean in this context?
There was a problem hiding this comment.
Just a single database connection for one specific engine
| if run_type == "R": | ||
| # e.g. 20260508_R_v1.2.0 | ||
| return f"{today}_R_{run_id}" | ||
| else: | ||
| # e.g. 20260508_M_25526200301 | ||
| return f"{today}_M_{run_id}" |
There was a problem hiding this comment.
Let's add a comment clarifying what does R and M refer to.
|
|
||
| def execute_pydough(self, pydough_str: str) -> tuple[DataFrame, str]: | ||
| """ | ||
| Executes the given pyodugh code using `from_string` api |
| "pydough_exec_time": "float", | ||
| "exec_plan": "str", | ||
| "pydough_exec_plan": "str", | ||
| "status": "str", |
There was a problem hiding this comment.
Also, where is pydough_time_ms (Time to generate SQL via PyDough)?
| | `benchmark` | `str` | Name of the benchmark suite | | ||
| | `ground_truth` | `str` | Reference SQL query | | ||
| | `pydough_sql` | `str` | SQL generated by PyDough | | ||
| | `pydough_time_ms` | `float` | Time to generate SQL via PyDough (ms) | |
| ### `measure() -> None` | ||
| Main entry point. Orchestrates the full benchmark run — loads questions, executes both SQL paths, validates results, and collects all metrics. | ||
|
|
||
| ### `execute_pydough(pydough_code: str) -> DataFrame` |
There was a problem hiding this comment.
| ### `execute_pydough(pydough_code: str) -> DataFrame` | |
| ### `execute_pydough(pydough_code: str) -> tuple[DataFrame, str]` |
| Main entry point. Orchestrates the full benchmark run — loads questions, executes both SQL paths, validates results, and collects all metrics. | ||
|
|
||
| ### `execute_pydough(pydough_code: str) -> DataFrame` | ||
| Executes the given PyDough DSL code using the `from_string` API and returns the result as a DataFrame. |
There was a problem hiding this comment.
Update description here to reflect code changes.
| ### `query_validation(pydough_result: DataFrame, ground_truth_result: DataFrame) -> bool` | ||
| Compares the PyDough result against the ground truth. Supports complex comparison logic to account for ordering and floating-point tolerances. |
There was a problem hiding this comment.
This one is not in the code as its own function.
It's better to add it in the code so that if we want to change how we do validation later own, it's easier to spot.
knassre-bodo
left a comment
There was a problem hiding this comment.
@john-sanchez31 FANTASTIC job accomplishing this! There are a few things to tinker with, but this is definitely a great tool.
| @@ -0,0 +1,65 @@ | |||
| """ | |||
There was a problem hiding this comment.
I think it should be spelled executor
| included.","result = ( | ||
| lines.WHERE(ship_date <= DATETIME('1998-12-1', '-90 days')) | ||
| .PARTITION(name=""groups"", by=(return_flag, status)) | ||
| .CALCULATE( | ||
| L_RETURNFLAG=return_flag, | ||
| L_LINESTATUS=status, | ||
| SUM_QTY=SUM(lines.quantity), | ||
| SUM_BASE_PRICE=SUM(lines.extended_price), | ||
| SUM_DISC_PRICE=SUM(lines.extended_price * (1 - lines.discount)), | ||
| SUM_CHARGE=SUM( | ||
| lines.extended_price * (1 - lines.discount) * (1 + lines.tax) | ||
| ), | ||
| AVG_QTY=AVG(lines.quantity), | ||
| AVG_PRICE=AVG(lines.extended_price), | ||
| AVG_DISC=AVG(lines.discount), | ||
| COUNT_ORDER=COUNT(lines), | ||
| ) | ||
| .ORDER_BY(L_RETURNFLAG.ASC(), L_LINESTATUS.ASC()) |
There was a problem hiding this comment.
Probably not something to be addressed right now, but I believe in our original discussions about the benchmarker we discussed having the LLM generate the PyDough code? I forget, did we decide against that at some point?
There was a problem hiding this comment.
Yes, at some point the original discussed task was divided into 2 different task. For the LLM team I created a script for data generation and for the DSL I'm working on the benchmarker. We could discuss about adding a LLM generation for the benchmarker for a followup task.
| git commit -m "chore: update benchmark metrics $(date +'%Y-%m-%d')" | ||
| git push origin "$BRANCH" | ||
|
|
||
| gh pr create \ | ||
| --base main \ | ||
| --head "$BRANCH" \ | ||
| --title "Benchmark metrics update — $(date +'%Y-%m-%d')" \ | ||
| --label "benchmark" \ | ||
| --label "automated" \ | ||
| --body "Automated benchmark run completed successfully. | ||
|
|
||
| **Triggered by:** \`${{ github.event_name }}\` | ||
| **Run:** [${{ github.run_id }}](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}) |
There was a problem hiding this comment.
Let's ensure the commits/PR generated indicate:
- Whether it was generated from a release or manually
- If on a release, which release
- If manually, on which commit of which branch
| git checkout -b "$BRANCH" | ||
| git add benchmark/metrics/*.csv benchmark/benchmark_metrics.csv |
There was a problem hiding this comment.
The PR that was opened included all of these changes AND the metric file. Will this always happen? If so that is potentially problematic if people are doing active development on a branch while they run the workflow, since the PR will contain all of their active development changes.
| | Benchmark | Queries | Status | | ||
| |---|---|---| | ||
| | TPC-H (Scale Factor 10) | 22 | Available | | ||
| | TPC-DS (Scale Factor 10) | 99 | Added incrementally | |
There was a problem hiding this comment.
Are we sure we want to do SF-10 for DS? That database can get pretty big. You may want to experiment to see which one is more viable.
There was a problem hiding this comment.
Well I already have both TPCH and TPCDS SF-10 on the docker image. Is it worth the experimentation at this point?
|
|
||
| ## Automated Runs | ||
|
|
||
| The benchmarker runs automatically on every PyDough DSL release via GitHub Actions. Results are committed as a new PR containing a timestamped CSV file. The workflow can also be triggered manually. |
There was a problem hiding this comment.
Let's have a more in-depth explanation of the following:
- The CSV file is in the format described in
## Metrics - The file is generated from the workflow run on release, or the manual one
- The exact naming scheme of the generated file in those two versions
Also, perhaps we should have a comparator script that takes in 2x of these benchmark run file names and outputs a comparison file:
- How many tests were right/wrong before/after (list all tests that went wrong->right or right->wrong)
- The total time performance only looking at queries that were correct in both versions
- For all queries that were correct in both versions, a breakdown of the absolute & relative % time improvement/regression (can sort this from best-to-worst)
This can be dumped into stdout or into a file for analysis.
There was a problem hiding this comment.
For instance, suppose file R1 had the following data:
- q1: correct, pydough_time=10
- q2: correct, pydough_time=15
- q3: correct, pydough_time=20
- q4: incorrect, pydough_time=25
- q5: incorrect, pydough_time=30
- q6: correct, pydough_time=42
And suppose file R2 had the following data:
- q1: incorrect, pydough_time=30
- q2: correct, pydough_time=10
- q3: correct, pydough_time=23
- q4: correct, pydough_time=5
- q5: incorrect, pydough_time=25
- q6: correct, pydough_time=30
What the comparison file would tell you (formatted a bit nicer, and probably going to decimals):
- R1: 4/6 correct
- R2: 4/6 correct
- Incorrect->Correct: q6
- Correct->Incorrect: q1
- Total Correct Time: 77->63 (-18%)
- Time breakdown:
- q2: 15->10 (-33%)
- q6: 42->30 (-29%)
- q3: 20->23 (+15%)
Could also do the same analysis, but instead of comparing the absolute times, we compare the differences relative to the refsol times (e.g. did R2 get faster/slower relative to the refsol than R1). Perhaps both of those time analyses could be in the same file, in different sections?
There was a problem hiding this comment.
Is this still applicable taking in count that for instance every query is written manually and will be correct?
Linked ticket
Closes #
Type of change
What changed and why?
Adding the new benchmarker. This new class will run a list of questions and get metrics related to the performance of pydough. All this data can be used for detecting regressions and opportunities of optimizations. This generates a new file with all collected information and adds a new row on the
benchmark_metrics.csvfile.How I tested this?
This new class has been tested locally using the command
python executer.py. There are not tests on our test suite for this new feature that could be discussed later.Notes for reviewers
The workflow spins up the Postgres Docker container as a service using the image
bodoai1/pydough-benchmarker:latest, installs the project dependencies using uv, and runs executer.py to execute the benchmark. Once the benchmark finishes and the metrics files are generated, it commits those files to a new branch and automatically opens a PR against main for review.It triggers automatically on every new GitHub release published, and can also be run manually at any time from the Actions tab.