Description
This issue proposes the design and implementation of a new report_generation step in SFtool, responsible for generating Excel reports from the selected_variants already stored in each Sample object.
The goal is to implement a reusable, extensible, and testable reporting architecture that cleanly supports:
- One or two samples
- Multiple categories (PR, RR, RR-STR, RR-SMN1-copy, PGx)
- Two reproductive-risk modes (
screening and advanced)
- Per-sample reports and couple reports
The design explicitly separates data preparation, carrier logic, and Excel rendering, avoiding case-specific branching and maximizing reuse.
Motivation
The current reporting requirements introduce multiple dimensions of complexity:
- Variable number of samples (1 vs 2)
- Conditional report generation depending on
RR_mode
- Multiple Excel outputs with overlapping logic
- Category-specific but structurally similar tables
Without a clear separation of responsibilities, report generation risks becoming monolithic and difficult to extend or test.
This refactor ensures:
- Minimal duplication
- Isolated carrier screening rules
- Future extensibility (e.g. new RR modes, trio analysis, alternative formats)
Current behavior
Report generation in SFtool is currently implemented in modules/misc/report_utils.py as a set of utility functions that directly
produce Excel files from Sample objects and configuration parameters.
The current behaviour can be summarized as follows:
-
Report generation is implemented as a utility module, not as a dedicated pipeline step:
- Functions in
report_utils.py are invoked procedurally
- There is no explicit
report_generation step analogous to other pipeline steps (e.g. variant_collection, variant_selection)
-
The report logic operates directly on Sample objects:
- Variant data are accessed from multiple sample attributes (e.g. variant collections, PGx results, STR/SMA outputs)
- There is no intermediate, normalized report representation (tables, rows, metadata)
-
Excel output is the primary abstraction:
- Business logic, report structure (tabs), and Excel formatting are interleaved
- Each function both decides what to report and how it is written to Excel
-
Handling of different scenarios relies on conditional branching:
- Two-sample workflow should be managed inside the same reporting logic as one-sample case
- Carrier screening behaviour depends on
RR_mode and it is not implemented
- Per-category logic (PR, RR, STR, SMN1, PGx) is embedded directly in the Excel-writing functions
-
Couple-level reports are not implemented.
-
Individual sample reports for more than a single sample is not implemented
-
As a consequence:
- Reuse of reporting logic between single-sample and dual-sample workflows is limited
- Adding new categories, new RR modes, or alternative outputs requires modifying existing conditional logic
- Unit testing of reporting rules independently of Excel generation is difficult
- The report implementation is tightly coupled to the current Excel format
While the current implementation is functional and produces the expected Excel outputs (only for a single sample), its utility-based and Excel-centric design makes it difficult to extend, reuse, or test as reporting requirements continue to grow.
Proposed refactor
The report generation logic will be refactored into a dedicated pipeline step, report_generation, with a clear separation between data preparation, carrier screening rules, and output rendering.
The refactor introduces an intermediate, data-driven reporting layer that decouples report logic from Excel generation and minimizes conditional branching.
1. Introduce explicit reporting data models
Define lightweight, normalized report abstractions that represent report content independently of the output format:
-
ReportTable
- Represents one logical table (e.g. PR, RR, PGx)
- Stores rows as a list of normalized dictionaries plus optional metadata
-
SampleReport
- Groups all
ReportTable objects belonging to one sample
- One
SampleReport is always generated per sample
-
CoupleReport
- Represents couple-level genetic counseling results
- Encapsulates mode-specific metadata (
RR_mode)
These models replace direct Excel-driven logic and act as the contract between variant selection and report rendering.
2. Refactor category-specific logic into table builders
Move category-specific report logic out of report_utils.py into pure, reusable table builder functions:
build_pr_table()
build_rr_table()
build_rr_str_table()
build_rr_smn1_table()
build_pgx_table()
Characteristics:
- Input:
Sample.selected_variants (and minimal sample metadata if required)
- Output:
ReportTable
- No Excel, file system, or configuration side effects
- Independent of the number of samples
3. Centralize per-sample report assembly
Implement a single assembler function:
python
build_sample_report(sample, enabled_categories) -> SampleReport
This function:
- Selects the appropriate table builders based on configuration
- Produces a complete per-sample report
- Is reused for both single-sample and two-sample workflows
4. Isolate genetic counseling for couples logic using rule-based strategies
Extract genetic counseling logic from the reporting utilities into explicit, mode-specific rule classes:
- CarrierScreeningRule
- AdvancedRule
Each rule:
- Operates exclusively on RR-selected variants from two samples
- Encapsulates all mode-specific reproductive-risk logic
- Produces normalized rows for genetic counseling tables
A dedicated builder:
build_genetic_counseling_report(sample_a, sample_b, rr_mode) -> GeneticCounselingReport
handles rule selection and report assembly.
5. Introduce a generic Excel writer
Replace Excel-centric logic in report_utils.py with a thin rendering layer:
- write_sample_report(SampleReport)
- write_genetic_counseling_report(GeneticCounselingReport)
Responsabilities:
- Convert ReportTable.rows into DataFrames
- Apply formatting and layout
- Create Excel tabs using ReportTable.name
The writer does not implement any biological or reporting rules.
6. Implement report_generation as a pipeline step
Create a new pipeline step (steps/report_generation.py) that:
- Builds one SampleReport per input sample
- Writes one Excel file per sample
- If two samples are provided the report generation step always produces:
- One Excel report per individual, with the same structure as the single-sample report.One additional PR-only Excel report generated at couple level.
The semantics and rules applied to the additional PR report depend on the RR_mode.
RR_mode = screening
Outputs:
- One Excel per individual (PR, RR, RR-STR, RR-SMN1, PGx tabs depending on enabled categories)
- One additional Excel for carrier screening (RR only)
The carrier screening report:
- Uses PR selected variants from both individuals
- Applies carrier screening–specific rules
RR_mode ≠ screening
Outputs:
- One Excel per individual (same structure as single-sample)
- One additional Excel for advanced RR analysis (RR only)
The advanced RR report:
- Uses RR selected variants from both individuals
- Applies a rule set different from carrier screening
This step becomes the sole consumer of Sample.selected_variants for reporting.
Directory structure
steps/report_generation.py
modules/report/init.py
modules/report/report_manager.py
modules/report/sample_report.py
modules/report/couple_report.py
modules/writers/init.py
modules/writers/excel_writer.py
Additional context
-
This refactor is part of the ongoing restructuring of SFtool toward a step-based, data-driven architecture using ExecutionContext and Sample objects, following recent refactors of variant_collection and variant_selection.
-
The proposed design intentionally treats report generation as a pure consumer of Sample.selected_variants, avoiding any dependency on upstream annotation or selection logic.
-
The refactor does not introduce new biological interpretation rules. Existing logic for PR, RR, and PGx reporting is preserved and relocated into clearer, reusable components.
-
This refactor includes new implementation rules for STR, SMN1-copy
-
Excel is maintained as the initial and primary output format to ensure backward compatibility with current clinical and research workflows. However, the introduction of intermediate report models enables future extensions to alternative formats (e.g. JSON, HTML, PDF) without modifying report logic.
-
Genetic counseling (couple-level interpretation) is explicitly constrained to RR-selected variants and is only generated when two samples are provided, reflecting current SFtool behavior. The rule set applied depends on RR_mode and differs between screening and non-screening scenarios.
-
The separation of two modes for genetic counseling rules (screening vs advanced) is intentional and aligns with existing reproductive-risk modes, allowing each rule set to evolve independently and be unit tested in isolation.
-
This refactor is scoped to report generation only and does not modify:
- Variant annotation (GeneBe, ClinVar)
- Variant selection rules
- Catalog definitions
- Configuration schemas
-
The proposed structure mirrors patterns already used elsewhere in the codebase (category-specific modules, thin orchestration steps, reusable utilities), minimizing cognitive overhead for maintainers.
Description
This issue proposes the design and implementation of a new
report_generationstep in SFtool, responsible for generating Excel reports from theselected_variantsalready stored in eachSampleobject.The goal is to implement a reusable, extensible, and testable reporting architecture that cleanly supports:
screeningandadvanced)The design explicitly separates data preparation, carrier logic, and Excel rendering, avoiding case-specific branching and maximizing reuse.
Motivation
The current reporting requirements introduce multiple dimensions of complexity:
RR_modeWithout a clear separation of responsibilities, report generation risks becoming monolithic and difficult to extend or test.
This refactor ensures:
Current behavior
Report generation in SFtool is currently implemented in
modules/misc/report_utils.pyas a set of utility functions that directlyproduce Excel files from
Sampleobjects and configuration parameters.The current behaviour can be summarized as follows:
Report generation is implemented as a utility module, not as a dedicated pipeline step:
report_utils.pyare invoked procedurallyreport_generationstep analogous to other pipeline steps (e.g.variant_collection,variant_selection)The report logic operates directly on
Sampleobjects:Excel output is the primary abstraction:
Handling of different scenarios relies on conditional branching:
RR_modeand it is not implementedCouple-level reports are not implemented.
Individual sample reports for more than a single sample is not implemented
As a consequence:
While the current implementation is functional and produces the expected Excel outputs (only for a single sample), its utility-based and Excel-centric design makes it difficult to extend, reuse, or test as reporting requirements continue to grow.
Proposed refactor
The report generation logic will be refactored into a dedicated pipeline step,
report_generation, with a clear separation between data preparation, carrier screening rules, and output rendering.The refactor introduces an intermediate, data-driven reporting layer that decouples report logic from Excel generation and minimizes conditional branching.
1. Introduce explicit reporting data models
Define lightweight, normalized report abstractions that represent report content independently of the output format:
ReportTable
SampleReport
ReportTableobjects belonging to one sampleSampleReportis always generated per sampleCoupleReport
RR_mode)These models replace direct Excel-driven logic and act as the contract between variant selection and report rendering.
2. Refactor category-specific logic into table builders
Move category-specific report logic out of
report_utils.pyinto pure, reusable table builder functions:build_pr_table()build_rr_table()build_rr_str_table()build_rr_smn1_table()build_pgx_table()Characteristics:
Sample.selected_variants(and minimal sample metadata if required)ReportTable3. Centralize per-sample report assembly
Implement a single assembler function:
This function:
4. Isolate genetic counseling for couples logic using rule-based strategies
Extract genetic counseling logic from the reporting utilities into explicit, mode-specific rule classes:
Each rule:
A dedicated builder:
build_genetic_counseling_report(sample_a, sample_b, rr_mode) -> GeneticCounselingReporthandles rule selection and report assembly.
5. Introduce a generic Excel writer
Replace Excel-centric logic in report_utils.py with a thin rendering layer:
Responsabilities:
The writer does not implement any biological or reporting rules.
6. Implement report_generation as a pipeline step
Create a new pipeline step (steps/report_generation.py) that:
The semantics and rules applied to the additional PR report depend on the
RR_mode.RR_mode = screening
Outputs:
The carrier screening report:
RR_mode ≠ screening
Outputs:
The advanced RR report:
This step becomes the sole consumer of Sample.selected_variants for reporting.
Directory structure
steps/report_generation.py
modules/report/init.py
modules/report/report_manager.py
modules/report/sample_report.py
modules/report/couple_report.py
modules/writers/init.py
modules/writers/excel_writer.py
Additional context
This refactor is part of the ongoing restructuring of SFtool toward a step-based, data-driven architecture using
ExecutionContextandSampleobjects, following recent refactors ofvariant_collectionandvariant_selection.The proposed design intentionally treats report generation as a pure consumer of
Sample.selected_variants, avoiding any dependency on upstream annotation or selection logic.The refactor does not introduce new biological interpretation rules. Existing logic for PR, RR, and PGx reporting is preserved and relocated into clearer, reusable components.
This refactor includes new implementation rules for STR, SMN1-copy
Excel is maintained as the initial and primary output format to ensure backward compatibility with current clinical and research workflows. However, the introduction of intermediate report models enables future extensions to alternative formats (e.g. JSON, HTML, PDF) without modifying report logic.
Genetic counseling (couple-level interpretation) is explicitly constrained to RR-selected variants and is only generated when two samples are provided, reflecting current SFtool behavior. The rule set applied depends on
RR_modeand differs between screening and non-screening scenarios.The separation of two modes for genetic counseling rules (
screeningvsadvanced) is intentional and aligns with existing reproductive-risk modes, allowing each rule set to evolve independently and be unit tested in isolation.This refactor is scoped to report generation only and does not modify:
The proposed structure mirrors patterns already used elsewhere in the codebase (category-specific modules, thin orchestration steps, reusable utilities), minimizing cognitive overhead for maintainers.