Skip to content

chore: bump dependencies#310

Merged
tristan-f-r merged 18 commits into
Reed-CompBio:mainfrom
tristan-f-r:dep-bump
Jul 25, 2025
Merged

chore: bump dependencies#310
tristan-f-r merged 18 commits into
Reed-CompBio:mainfrom
tristan-f-r:dep-bump

Conversation

@tristan-f-r

@tristan-f-r tristan-f-r commented Jul 1, 2025

Copy link
Copy Markdown
Collaborator

@tristan-f-r tristan-f-r added the refactor Changes that don't actually improve anything except for code quality. label Jul 1, 2025
@ntalluri

ntalluri commented Jul 2, 2025

Copy link
Copy Markdown
Collaborator

Do you have an example of this (the image) that you could add as a comment?

@ntalluri

ntalluri commented Jul 2, 2025

Copy link
Copy Markdown
Collaborator

Also could you mention what computer you are using (there is something weird with windows version 10 with scikit learn that we were dealing with 2 years ago but never fixed #135)

@tristan-f-r

tristan-f-r commented Jul 2, 2025

Copy link
Copy Markdown
Collaborator Author

NixOS 25.05.20250108.bffc22e (Warbler) on x86_64, though I do think this issue is reproducing on CI, so I doubt this is platform dependent. I'll reproduce it.

@tristan-f-r

tristan-f-r commented Jul 2, 2025

Copy link
Copy Markdown
Collaborator Author

@ntalluri

spras/test/ml/test_ml.py

Lines 83 to 95 in 54bd449

def test_pca_robustness(self):
dataframe = ml.summarize_networks([INPUT_DIR + 'test-data-s1/s1.txt', INPUT_DIR + 'test-data-s2/s2.txt', INPUT_DIR + 'test-data-s3/s3.txt'])
expected = pd.read_table(EXPECT_DIR + 'expected-pca-coordinates.tsv')
expected = expected.round(5)
for _ in range(5):
dataframe_shuffled = dataframe.sample(frac=1, axis=1) # permute the columns
ml.pca(dataframe_shuffled, OUT_DIR + 'pca-shuffled-columns.png', OUT_DIR + 'pca-shuffled-columns-variance.txt',
OUT_DIR + 'pca-shuffled-columns-coordinates.tsv')
coord = pd.read_table(OUT_DIR + 'pca-shuffled-columns-coordinates.tsv')
coord = coord.round(5) # round values to 5 digits to account for numeric differences across machines
coord.sort_values(by='algorithm', ignore_index=True, inplace=True)
assert coord.equals(expected)

Coord:
      algorithm      PC1      PC2
0  test-data-s1 -2.00665 -0.98659
1  test-data-s2 -1.52765  1.07995
2  test-data-s3  3.53430 -0.09336       algorithm      PC1      PC2
0  test-data-s1 -2.00665 -0.98659
1  test-data-s2 -1.52765  1.07995
2  test-data-s3  3.53430 -0.09336

Expected:
      algorithm      PC1      PC2
0  test-data-s1  2.00665 -0.98659
1  test-data-s2  1.52765  1.07995
2  test-data-s3 -3.53430 -0.09336       algorithm      PC1      PC2
0  test-data-s1 -2.00665 -0.98659
1  test-data-s2 -1.52765  1.07995
2  test-data-s3  3.53430 -0.09336

@tristan-f-r

Copy link
Copy Markdown
Collaborator Author

@ntalluri

ntalluri commented Jul 3, 2025

Copy link
Copy Markdown
Collaborator

Based on what I’m reading, bumping the version is a good idea because it improves PCA in a meaningful way. Before, PCA (in version 1.2) signs were chosen by looking at how the data looked after projection (transformed data), but now they are chosen by inspecting each component vector directly. Previously, each solver decided the signs in its own way, which meant that changing the solver could cause components to unexpectedly flip signs. While the mathematical results seem to always still be correct, this causes confusion. The new approach in version 1.7 makes the sign choice independent of the solver because it relies only on the component vector’s own values, ensuring we always get consistent component signs no matter which solver we use.

We don't allow a user to pick a solver, we have it set to auto at the moment.
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA:~:text=svd_solver%7B%E2%80%98auto%E2%80%99%2C%20%E2%80%98full,optionally%20truncated%20afterwards.

tristan-f-r and others added 2 commits July 3, 2025 09:42
untested; need to open my devcontainer
@tristan-f-r

Copy link
Copy Markdown
Collaborator Author

Let me know if this commit is the right way to approach that. To my understanding, this does mess with our sampling test because our component vectors differ per run; hopefully, the aforementioned commit addresses that.

@tristan-f-r

Copy link
Copy Markdown
Collaborator Author

Blocked by #265 - we should discuss removing GraphSpace support in today's meeting.

Comment thread test/ml/test_ml.py
@ntalluri

ntalluri commented Jul 3, 2025

Copy link
Copy Markdown
Collaborator

Let me know if this commit is the right way to approach that. To my understanding, this does mess with our sampling test because our component vectors differ per run; hopefully, the aforementioned commit addresses that.

For that specific test, we want the output to be the same despite shuffling the rows and columns. So having 2 different outputs is not what we would want.

@tristan-f-r

Copy link
Copy Markdown
Collaborator Author

For that specific test, we want the output to be the same despite shuffling the rows and columns. So having 2 different outputs is not what we would want.

Hm, do we roll back scikit-learn then? The PCA signage is going to start depending on the content, and it seems that permutations count as a mutation over the content.

@ntalluri

ntalluri commented Jul 7, 2025

Copy link
Copy Markdown
Collaborator

Maybe we need to rethink the test then. Is it okay the values are the same but the signs change? Or should it always be the same values no matter the data? (Just some questions to ask). I will think through what we can do for this test case.

@tristan-f-r tristan-f-r marked this pull request as ready for review July 8, 2025 20:11
@agitter

agitter commented Jul 13, 2025

Copy link
Copy Markdown
Collaborator

Is it okay the values are the same but the signs change?

Can you help me get caught up on the core question? It seems valid to have a transformation of the PCA solution where the signs are flipped.

@tristan-f-r

tristan-f-r commented Jul 13, 2025

Copy link
Copy Markdown
Collaborator Author

Can you help me get caught up on the core question? It seems valid to have a transformation of the PCA solution where the signs are flipped.

The essence of the test that was broken here was row shuffling preserving PCA output:

dataframe_shuffled = dataframe.sample(frac=1, axis=1)  # permute the columns
ml.pca(dataframe_shuffled, OUT_DIR + 'pca-shuffled-columns.png', OUT_DIR + 'pca-shuffled-columns-variance.txt',
    OUT_DIR + 'pca-shuffled-columns-coordinates.tsv')
coord = pd.read_table(OUT_DIR + 'pca-shuffled-columns-coordinates.tsv')
coord = coord.round(5)  # round values to 5 digits to account for numeric differences across machines
coord.sort_values(by='algorithm', ignore_index=True, inplace=True)

This now needs an assertion check for two possible signings:

assert coord.equals(expected) or coord.equals(expected_other)

@tristan-f-r

tristan-f-r commented Jul 18, 2025

Copy link
Copy Markdown
Collaborator Author

Since this is just a standard dependency bump, this seems okay to merge. I do want one last review to check in that we're okay with this PCA output signage change. Will be reviewed on Monday.

@tristan-f-r tristan-f-r requested a review from ntalluri July 18, 2025 20:25
Comment thread test/ml/expected/expected-pca-coordinates-2.tsv Outdated
@ntalluri

Copy link
Copy Markdown
Collaborator

If it is the ensembling one, there was a data.pickle I added that has a toy dataset network that is needed for ensembling in general that we use to ensure we calculate recall correct. I'm not sure why that would need to be updated (if it is this pickled dataset) because it has nothing to do with the pca.

@agitter

agitter commented Jul 21, 2025

Copy link
Copy Markdown
Collaborator

Regardless, we probably should stop committing artifacts to main anyway - this causes problems; see #320 for an example of a large test refactor because of a related issue

This seems worthwhile to discuss in a new issue as a testing strategy change if it affects multiple tests.

@tristan-f-r

Copy link
Copy Markdown
Collaborator Author

#339

@tristan-f-r

tristan-f-r commented Jul 21, 2025

Copy link
Copy Markdown
Collaborator Author

@tristan-f-r Which PR was it? The ensembling one?

👍 It was #212 (origin commit). I'll prepare another PR to make that test follow #339.

@tristan-f-r

Copy link
Copy Markdown
Collaborator Author

If it is the ensembling one, there was a data.pickle I added that has a toy dataset network that is needed for ensembling in general that we use to ensure we calculate recall correct. I'm not sure why that would need to be updated (if it is this pickled dataset) because it has nothing to do with the pca.

Pickled objects contain internal representations of the object, and are subject to change throughout package versions. In this case, pandas itself changed.

@tristan-f-r

tristan-f-r commented Jul 21, 2025

Copy link
Copy Markdown
Collaborator Author

#340 should fix this.

@tristan-f-r

tristan-f-r commented Jul 23, 2025

Copy link
Copy Markdown
Collaborator Author

@ntalluri there seems to be sorting issues with the new commits making the PCA test flaky (see this commit which addresses that) - the tests only worked occasionally on my machine before that commit.

Comment thread test/ml/expected/expected-pca-coordinates-negated.tsv
Comment thread test/ml/test_ml.py Outdated
@tristan-f-r tristan-f-r requested a review from ntalluri July 23, 2025 18:23

@ntalluri ntalluri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. However, I am going to test the environment updates locally before approving.

@tristan-f-r

tristan-f-r commented Jul 23, 2025

Copy link
Copy Markdown
Collaborator Author

By the way, some dependencies may be a little outdated (e.g. this PR outlived a pandas release cycle) - I can update them, but this PR was mostly intended to avoid all of the dependency errors that were causing me problems with pixi.

@tristan-f-r

tristan-f-r commented Jul 24, 2025

Copy link
Copy Markdown
Collaborator Author

Okay - sorry about this terribly long hill of a PR for how small it is. That should be the last commit which fixes the tests from the latest merge 👍

@tristan-f-r tristan-f-r requested a review from ntalluri July 24, 2025 19:29

@agitter agitter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the TestML tests locally and they passed

@tristan-f-r tristan-f-r dismissed ntalluri’s stale review July 25, 2025 15:42

pre-commit is too outdated to run typos, we need to bump pre-commit

@tristan-f-r tristan-f-r merged commit e899da8 into Reed-CompBio:main Jul 25, 2025
15 checks passed
@tristan-f-r tristan-f-r deleted the dep-bump branch July 25, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor Changes that don't actually improve anything except for code quality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Different dependency version matching between conda and python Clustering warning in sklearn version 1.2

3 participants