Report generation refactor: reusable, data-driven Excel reporting for 1–2 samples

## Description
This issue proposes the design and implementation of a new `report_generation` step in **SFtool**, responsible for generating Excel reports from the `selected_variants` already stored in each `Sample` object.

The goal is to implement a **reusable, extensible, and testable reporting architecture** that cleanly supports:

- One or two samples
- Multiple categories (PR, RR, RR-STR, RR-SMN1-copy, PGx)
- Two reproductive-risk modes (`screening` and `advanced`)
- Per-sample reports and couple reports

The design explicitly separates **data preparation**, **carrier logic**, and **Excel rendering**, avoiding case-specific branching and maximizing reuse.


## Motivation
The current reporting requirements introduce multiple dimensions of complexity:

- Variable number of samples (1 vs 2)
- Conditional report generation depending on `RR_mode`
- Multiple Excel outputs with overlapping logic
- Category-specific but structurally similar tables

Without a clear separation of responsibilities, report generation risks becoming monolithic and difficult to extend or test.

This refactor ensures:
- Minimal duplication
- Isolated carrier screening rules
- Future extensibility (e.g. new RR modes, trio analysis, alternative formats)

## Current behavior
Report generation in SFtool is currently implemented in `modules/misc/report_utils.py` as a set of utility functions that directly
produce Excel files from `Sample` objects and configuration parameters.

The current behaviour can be summarized as follows:

- Report generation is implemented as a **utility module**, not as a dedicated pipeline step:
  - Functions in `report_utils.py` are invoked procedurally
  - There is no explicit `report_generation` step analogous to other pipeline steps (e.g. `variant_collection`, `variant_selection`)

- The report logic operates **directly on `Sample` objects**:
  - Variant data are accessed from multiple sample attributes (e.g. variant collections, PGx results, STR/SMA outputs)
  - There is no intermediate, normalized report representation (tables, rows, metadata)

- Excel output is the primary abstraction:
  - Business logic, report structure (tabs), and Excel formatting are interleaved
  - Each function both decides *what* to report and *how* it is written to Excel

- Handling of different scenarios relies on **conditional branching**:
  - Two-sample workflow should be managed inside the same reporting logic as one-sample case
  - Carrier screening behaviour depends on `RR_mode` and it is not implemented
  - Per-category logic (PR, RR, STR, SMN1, PGx) is embedded directly in the Excel-writing functions

- Couple-level reports are not implemented. 
- Individual sample reports for more than a single sample is not implemented

- As a consequence:
  - Reuse of reporting logic between single-sample and dual-sample workflows is limited
  - Adding new categories, new RR modes, or alternative outputs requires modifying existing conditional logic
  - Unit testing of reporting rules independently of Excel generation is difficult
  - The report implementation is tightly coupled to the current Excel format

While the current implementation is functional and produces the expected Excel outputs (only for a single sample), its utility-based and Excel-centric design makes it difficult to extend, reuse, or test as reporting requirements continue to grow.

## Proposed refactor
The report generation logic will be refactored into a dedicated pipeline step, `report_generation`, with a clear separation between **data preparation**, **carrier screening rules**, and **output rendering**.

The refactor introduces an intermediate, data-driven reporting layer that decouples report logic from Excel generation and minimizes conditional branching.

### 1. Introduce explicit reporting data models

Define lightweight, normalized report abstractions that represent report content independently of the output format:

- **ReportTable**
  - Represents one logical table (e.g. PR, RR, PGx)
  - Stores rows as a list of normalized dictionaries plus optional metadata

- **SampleReport**
  - Groups all `ReportTable` objects belonging to one sample
  - One `SampleReport` is always generated per sample

- **CoupleReport**
  - Represents couple-level genetic counseling results
  - Encapsulates mode-specific metadata (`RR_mode`)

These models replace direct Excel-driven logic and act as the contract between variant selection and report rendering.

### 2. Refactor category-specific logic into table builders

Move category-specific report logic out of `report_utils.py` into pure, reusable table builder functions:

- `build_pr_table()`
- `build_rr_table()`
- `build_rr_str_table()`
- `build_rr_smn1_table()`
- `build_pgx_table()`

Characteristics:
- Input: `Sample.selected_variants` (and minimal sample metadata if required)
- Output: `ReportTable`
- No Excel, file system, or configuration side effects
- Independent of the number of samples

### 3. Centralize per-sample report assembly

Implement a single assembler function:

```
python
build_sample_report(sample, enabled_categories) -> SampleReport
```
This function:

- Selects the appropriate table builders based on configuration
- Produces a complete per-sample report
- Is reused for both single-sample and two-sample workflows

### 4. Isolate genetic counseling for couples logic using rule-based strategies

Extract genetic counseling logic from the reporting utilities into explicit, mode-specific rule classes:

- CarrierScreeningRule
- AdvancedRule

Each rule:

- Operates exclusively on RR-selected variants from two samples
- Encapsulates all mode-specific reproductive-risk logic
- Produces normalized rows for genetic counseling tables

A dedicated builder:

`build_genetic_counseling_report(sample_a, sample_b, rr_mode) -> GeneticCounselingReport`

handles rule selection and report assembly.

### 5. Introduce a generic Excel writer
Replace Excel-centric logic in report_utils.py with a thin rendering layer:

- write_sample_report(SampleReport)
- write_genetic_counseling_report(GeneticCounselingReport)

Responsabilities:
- Convert ReportTable.rows into DataFrames
- Apply formatting and layout
- Create Excel tabs using ReportTable.name

The writer does not implement any biological or reporting rules.

### 6. Implement report_generation as a pipeline step

Create a new pipeline step (steps/report_generation.py) that:

1. Builds one SampleReport per input sample
2. Writes one Excel file per sample
3. If two samples are provided the report generation step always produces:
    - One Excel report per individual, with the same structure as the single-sample report.One additional PR-only Excel report generated at couple level.

The semantics and rules applied to the additional PR report depend on the `RR_mode`.

#### RR_mode = screening

Outputs:
- One Excel per individual (PR, RR, RR-STR, RR-SMN1, PGx tabs depending on enabled categories)
- One additional Excel for carrier screening (RR only)

The carrier screening report:
- Uses PR selected variants from both individuals
- Applies carrier screening–specific rules

#### RR_mode ≠ screening

Outputs:
- One Excel per individual (same structure as single-sample)
- One additional Excel for advanced RR analysis (RR only)

The advanced RR report:
- Uses RR selected variants from both individuals
- Applies a rule set different from carrier screening

This step becomes the sole consumer of Sample.selected_variants for reporting.

### Directory structure
steps/report_generation.py

modules/report/__init__.py
modules/report/report_manager.py
modules/report/sample_report.py
modules/report/couple_report.py
modules/writers/__init__.py
modules/writers/excel_writer.py



## Additional context


- This refactor is part of the ongoing restructuring of SFtool toward a step-based, data-driven architecture using `ExecutionContext` and `Sample` objects, following recent refactors of `variant_collection` and `variant_selection`.

- The proposed design intentionally treats report generation as a **pure consumer** of `Sample.selected_variants`, avoiding any dependency on upstream annotation or selection logic.

- The refactor does **not** introduce new biological interpretation rules. Existing logic for PR, RR,  and PGx reporting is preserved and relocated into clearer, reusable components.
 
- This refactor includes new implementation rules for STR, SMN1-copy 

- Excel is maintained as the initial and primary output format to ensure backward compatibility with current clinical and research workflows. However, the introduction of intermediate report models enables future extensions to alternative formats (e.g. JSON, HTML, PDF) without modifying report logic.

- Genetic counseling (couple-level interpretation) is explicitly constrained to RR-selected variants and is only generated when two samples are provided, reflecting current SFtool behavior. The rule set applied depends on `RR_mode` and differs between screening and non-screening scenarios.

- The separation of two modes for genetic counseling rules (`screening` vs `advanced`) is intentional and aligns with existing reproductive-risk modes, allowing each rule set to evolve independently and be unit tested in isolation.

- This refactor is scoped to **report generation only** and does not modify:
  - Variant annotation (GeneBe, ClinVar)
  - Variant selection rules
  - Catalog definitions
  - Configuration schemas

- The proposed structure mirrors patterns already used elsewhere in the codebase (category-specific modules, thin orchestration steps, reusable utilities), minimizing cognitive overhead for maintainers.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report generation refactor: reusable, data-driven Excel reporting for 1–2 samples #34

Description

Motivation

Current behavior

Proposed refactor

1. Introduce explicit reporting data models

2. Refactor category-specific logic into table builders

3. Centralize per-sample report assembly

4. Isolate genetic counseling for couples logic using rule-based strategies

5. Introduce a generic Excel writer

6. Implement report_generation as a pipeline step

RR_mode = screening

RR_mode ≠ screening

Directory structure

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Report generation refactor: reusable, data-driven Excel reporting for 1–2 samples #34

Description

Description

Motivation

Current behavior

Proposed refactor

1. Introduce explicit reporting data models

2. Refactor category-specific logic into table builders

3. Centralize per-sample report assembly

4. Isolate genetic counseling for couples logic using rule-based strategies

5. Introduce a generic Excel writer

6. Implement report_generation as a pipeline step

RR_mode = screening

RR_mode ≠ screening

Directory structure

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions