Add support for Trino dialect and database context in PyDough#498
Conversation
for more information, see https://pre-commit.ci
john-sanchez31
left a comment
There was a problem hiding this comment.
Great job Kian! There are some comments below but overall almost ready
| image: bodoai1/pydough-mongo-tpch:latest | ||
| env: | ||
| # Set environment variables for MongoDB container | ||
| MONGO_HOST: 127.0.0.1 |
There was a problem hiding this comment.
I think It'd be better to create the corresponding environment variables for mongo
| MYSQL_USER: ${{ secrets.MYSQL_USERNAME }} | ||
| MYSQL_PASSWORD: ${{ secrets.MYSQL_PASSWORD }} | ||
| MONGO_HOST: mongo | ||
| MONGO_USER: "root_user" |
There was a problem hiding this comment.
Those can be reused here (avoiding typing errors)
There was a problem hiding this comment.
+1. Moving to variables an re-using
| with ID 1? Return the coupon ID and total amount transacted with it. | ||
| """ | ||
| return coupons.WHERE(merchant_id == "1").CALCULATE( | ||
| return coupons.WHERE(merchant_id == 1).CALCULATE( |
There was a problem hiding this comment.
Why this change? Does it work for all dialects? I know some (Oracle) might have trouble handling this change.
There was a problem hiding this comment.
It technically is SUPPOSED to be an integer; I don't know why it was ever a string, but doing so was causing problems since comparing integers to strings in Trino is problematic. Rather than adding implicit type casting, it was simpler to just fix the test so it made more sense.
| FROM tpch.lineitem | ||
| ORDER BY | ||
| 1 NULLS FIRST | ||
| l_discount NULLS FIRST |
There was a problem hiding this comment.
Why is it using the actual column name instead of the index?
There was a problem hiding this comment.
The actual test here was reworked so it orders it in a manner where the only row kept is the one where l_discount is 0.
| "-e", | ||
| "TPCH_TABLES=region,supplier", | ||
| "-e", | ||
| "DEFOG_TABLES=sbCustomer,salespersons,payments_received,doctors,diagnoses,concomitant_meds,merchants,wallet_user_balance_daily,user_sessions,cite,domain_author,domain_keyword,keyword,publication_keyword,location", |
There was a problem hiding this comment.
Why is this needed? When more defog db are added does this need to be updated?
There was a problem hiding this comment.
Why is this needed?
Mongo/Cassandra setup time scales a lot with the number of tables. Only having it set up some of them (the ones that Trino will use from these connectors), instead of all of them makes it faster.
When more defog db are added does this need to be updated?
Yes, unless we choose to make it so those defog db are only ever accessed from the MySQL/Postgres connectors.
| client = MongoClient( | ||
| host=MONGO_HOST, | ||
| port=MONGO_PORT, | ||
| username="root_user", |
There was a problem hiding this comment.
Maybe would be better use environment variables. I know the data in the container are not sensitive but still would be a better practice and for consistency.
| return request.param | ||
|
|
||
|
|
||
| @pytest.mark.custom_datasets |
There was a problem hiding this comment.
I agree, needed sooner or later
|
|
||
|
|
||
| @pytest.mark.skip( | ||
| "Trino does not properly handle all of the weird quoting cases; investigate at a later time." |
There was a problem hiding this comment.
We can leave it like this but just take in count that other type of tests can be added as custom_datasets not just the quoted ones.
hadia206
left a comment
There was a problem hiding this comment.
Good work, Kian!
Please see my comments below.
Also, don't forget the documentation.
| MYSQL_USER: ${{ secrets.MYSQL_USERNAME }} | ||
| MYSQL_PASSWORD: ${{ secrets.MYSQL_PASSWORD }} | ||
| MONGO_HOST: mongo | ||
| MONGO_USER: "root_user" |
There was a problem hiding this comment.
+1. Moving to variables an re-using
| run: uv venv | ||
|
|
||
| - name: Install dependencies | ||
| run: uv pip install -e ".[trino]"; uv pip install pytest |
There was a problem hiding this comment.
Do we still need the uv pip install pytest?
There was a problem hiding this comment.
This is the only/main part still under development
| return parse_one(combined_string) | ||
| return parse_one( | ||
| combined_string, | ||
| dialect=self._visitor._session.database.dialect.sqlglot_dialect, |
There was a problem hiding this comment.
Discovered the hard way while doing some Trino UDF stuff that the UDF will not always be parsed by SQLGlot correctly unless the dialect is passed in. E.g. if I use DATE_ADD inside the SQL macro text, SQLGlot parses the arguments in the wrong order unless the dialect was provided.
| cast of the original expression to string. | ||
| """ | ||
| if not isinstance(typ, StringType): | ||
| return sqlglot_expressions.Cast(this=expr, to="VARCHAR") |
There was a problem hiding this comment.
Shouldn't this be
| return sqlglot_expressions.Cast(this=expr, to="VARCHAR") | |
| return sqlglot_expressions.Cast(this=expr, to=sqlglot_expressions.DataType.build("VARCHAR")) |
| return super().convert_call_to_sqlglot(operator, args, types) | ||
|
|
||
| def convert_sum( | ||
| self, arg: SQLGlotExpression, types: list[PyDoughType] |
There was a problem hiding this comment.
| self, arg: SQLGlotExpression, types: list[PyDoughType] | |
| self, args: list[SQLGlotExpression], types: list[PyDoughType] |
Then update use of arg to args
| num_term = float(numerator.this) | ||
| denom_term = float(denominator.this) | ||
| if denom_term != 0 and num_term / denom_term == int(num_term / denom_term): | ||
| return exp.Literal.number(int(num_term / denom_term)) |
There was a problem hiding this comment.
Suggested by AI.
Reason:
sqlglot.expressions.Div can represent either integer (//) or float (/) division depending on context. Replacing 4 / 2 with the integer literal 2 silently changes the type from float to int for float-division contexts.
Additionally, the equality check float == int can produce false positives for very large numbers due to floating-point representation (e.g., 10**16 + 1 / 1 would fold incorrectly). Using num_term % denom_term == 0 with integer arithmetic (int(num_term) and int(denom_term)) where both values are whole numbers would be safer.
| num_term = float(numerator.this) | |
| denom_term = float(denominator.this) | |
| if denom_term != 0 and num_term / denom_term == int(num_term / denom_term): | |
| return exp.Literal.number(int(num_term / denom_term)) | |
| num_term = int(float(numerator.this)) | |
| denom_term = int(float(denominator.this)) | |
| if denom_term != 0 and num_term % denom_term == 0: | |
| return exp.Literal.number(num_term // denom_term) |
There was a problem hiding this comment.
Will need to verify that both terms are actually integers... e.g. I don't want to have 6.5 / 3.2 get folded into 3
There was a problem hiding this comment.
Did you verify? If yes, what's final decision?
There was a problem hiding this comment.
Changed it to this:
# Attempt to fold constants for cases like `4 / 2` -> `2`
if (
isinstance(numerator, exp.Literal)
and numerator.is_int
and isinstance(denominator, exp.Literal)
and denominator.is_int
):
num_term: int = numerator.this
denom_term: int = denominator.this
if denom_term != 0 and num_term % denom_term == 0:
return exp.Literal.number(num_term // denom_term)
| # correctly whereas the SQLGlot version gets confused with multiple | ||
| # rounds of parsing and unparsing. | ||
| Trino.Generator.TRANSFORMS[sqlglot_expressions.DayOfWeek] = ( | ||
| lambda self, e: f"(({self.func('DAY_OF_WEEK', e.this)} % 7) + 1)", |
There was a problem hiding this comment.
Assuming it's a typo. There was no , in the case written beforehand.
| lambda self, e: f"(({self.func('DAY_OF_WEEK', e.this)} % 7) + 1)", | |
| lambda self, e: f"(({self.func('DAY_OF_WEEK', e.this)} % 7) + 1)" |
| def get_table_prefix_for_dialect(dialect: DatabaseDialect) -> str: | ||
| """Return the appropriate table name prefix for the given database dialect.""" | ||
| match dialect: | ||
| case DatabaseDialect.SNOWFLAKE: |
There was a problem hiding this comment.
Let's re-add comment here in docstring or above each case.
# For Snowflake, use cross-database write (read from SNOWFLAKE_SAMPLE_DATA,
# write to E2E_TESTS_DB.PUBLIC)
| ) | ||
|
|
||
|
|
||
| def simple_week_sampler(): |
There was a problem hiding this comment.
Why was this removed? I see the tpc-h one all combined into one. but this one I don't see it re-added somewhere else or I may have missed it.
There was a problem hiding this comment.
It was all combined into one in "test_pipeline_e2e_simple_week", which uses the TPC-H version.
hadia206
left a comment
There was a problem hiding this comment.
Thanks Kian.
Some comments were not addressed and also missing the user documentation.
| connection = kwargs.pop("connection", None) | ||
| if connection: | ||
| return DatabaseConnection(connection) | ||
| required_keys = ["user", "host", "port"] |
There was a problem hiding this comment.
I meant ,are they optional for the user?
The docstring says they can have a default value. So should we used that as a default and not make it required for the user.
From the docstring
- host: Trino server host (str, default: "127.0.01"/"localhost")
- port: Trino server port (int, default: 8080)
| def convert_integer( | ||
| self, args: list[SQLGlotExpression], types: list[PyDoughType] | ||
| ) -> SQLGlotExpression: | ||
| # First cast to a DOUBLE, then to an INTEGER, for safety. |
There was a problem hiding this comment.
Let's clarify that in the comment and be specific about what is safety
|
|
||
| def generate_dataframe_item_dialect_expression( | ||
| self, item: Any, item_type: PyDoughType | ||
| ) -> SQLGlotExpression: |
| # default version, since PyDough handles the conversion logic | ||
| # correctly whereas the SQLGlot version gets confused with multiple | ||
| # rounds of parsing and unparsing. |
| num_term = float(numerator.this) | ||
| denom_term = float(denominator.this) | ||
| if denom_term != 0 and num_term / denom_term == int(num_term / denom_term): | ||
| return exp.Literal.number(int(num_term / denom_term)) |
There was a problem hiding this comment.
Did you verify? If yes, what's final decision?
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Adding support for the Trino dialect in Pydough, including testing with a docker container that has four Trino connections to other databases: MySQL, Postgres, Mongo, and Cassandra.