Object-based, multiprocess, and caching pdet and weight estimator by kcroker · Pull Request #1 · michaelzevin/selection-effects

kcroker · 2023-07-25T18:07:02Z

Implements the core functionality in pdets_from_grid() as a persistent-state object. It will cache the underlying trained regression model, so that training the model only needs to happen on the first call. Subsequent evaluation of the model can also be distributed across a multiprocessing.Pool. Multiprocessing can be memory intensive, because every child needs a copy of the trained regressor, and that's ~350MB. Because the regressor is complicated, getting the memory consumption down will probably require something like map_coordinates() with a shared_memory numpy array. If this approximation scheme delivers the necessarily accuracy, it would likely be 10-100x faster in addition to consuming 1/cores less memory.

Usage example:

from selection_effects.utils.predict_detection_probabilities import LVKWeighter

# Initialize the object with "filename:key" string
# Defaults to loading a cached version if present
lvk = LVKWeighter("pdets_grid.hdf5:O3actual_H1L1V1")

# Estimate with some data
with multiprocessing.Pool(10) as p:
  data['pdet'], data['weight'] = lvk.estimate(data, pool=p)

instead of the bounds being set by the grid. BROKE caching. Will need to save the LVKWeighter object itself, so that we have access to the bounds. (and other things, like the key and filename)

set member variables from that pickle upon load from cache REMOVED some vestigal output

kcroker · 2023-07-25T19:14:05Z

Apparently multiprocessing.Pool is not the preferred way to parallel out scikit learn things. It may run much faster if implemented with joblib. See https://scikit-learn.org/stable/computing/parallelism.html Also claims that it shares memory too. Not sure if this applies just to training, or also to prediction...

Operator added 4 commits July 24, 2023 22:00

PERSISTED the scikit learned fitter

b5772eb

ADDED caching and 25% memory optimization

8466576

FIXED bug where bounds based on the data were being used to normalize,

91cfcc7

instead of the bounds being set by the grid. BROKE caching. Will need to save the LVKWeighter object itself, so that we have access to the bounds. (and other things, like the key and filename)

ADJUSTED caching to store the LVKWeighter object directly, and

45a469a

set member variables from that pickle upon load from cache REMOVED some vestigal output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object-based, multiprocess, and caching pdet and weight estimator #1

Object-based, multiprocess, and caching pdet and weight estimator #1
kcroker wants to merge 4 commits into
michaelzevin:mainfrom
kcroker:main

kcroker commented Jul 25, 2023

Uh oh!

kcroker commented Jul 25, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kcroker commented Jul 25, 2023

Uh oh!

kcroker commented Jul 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kcroker commented Jul 25, 2023 •

edited

Loading