Skip to content

Calculating the matrix of pairwise coalescence rates efficiently (with genetic_relatedness_vector) #3444

@nspope

Description

@nspope

TreeSequence.pair_coalescence_rates is intended for mapping from a small number of sample sets to a large number of time windows, and is not efficient when one wants to get all pairwise rates between samples for a small number of time windows.

In the latter case, a good option is to use time-windowed genetic_relatedness_vector. This works because the proportion of the sequence that is uncoalesced by a particular timepoint is proportional to the instantaneous change in branch divergence/relatedness for the pair of haplotypes. Once we implement time windowing in genetic_relatedness_vector and don't need to use decapitate per time window, I suspect this approach should be pretty optimal in terms of efficiency.

Then, the pairwise coalescence time CDF can be turned into rates via the same Kaplan-Meier estimator that is used for pair_coalescence_rates.

Here's a mock implementation of computing the CDF this way,

def pairwise_coalescence_cdf(
    ts: tskit.TreeSequence, 
    time_grid: np.ndarray, 
    samples: np.ndarray = None,
) -> np.ndarray:
    """
    Return the proportion of sequence that has coalesced between all pairs of `samples`, 
    by the times in `time_grid`. If `samples` is None, then all samples in `ts`
    are used. The output array has dimensions `(time_grid.size, samples.size, samples.size)`.
    This uses `ts.decapitate(t).genetic_relatedness_vector(..., nodes=samples)`
    under the hood.
    """
    assert time_grid[0] == 0
    assert time_grid[-1] < np.inf
    assert np.all(np.diff(time_grid) > 0)
    if samples is None: samples = np.array(list(ts.samples()))
    assert np.all(ts.nodes_time[samples] == 0)
    eye = np.eye(samples.size)
    times = np.append(np.unique(ts.nodes_time), np.inf)
    break_above = np.searchsorted(times, time_grid, side="right")
    assert break_above.min() > 0                                                                                                                                                               
    times_below = times[break_above - 1]
    times_above = times[break_above]
    pairwise_surv = np.zeros((time_grid.size, samples.size, samples.size))
    pairwise_surv[0] = 1.0
    kwargs = {"nodes": samples, "mode": "branch", "centre": False}
    for i, (a, b) in enumerate(zip(times_below, times_above)):
        if a > 0 and np.isfinite(b):
            D = (
                ts.decapitate(b).genetic_relatedness_vector(eye, **kwargs) -
                ts.decapitate(a).genetic_relatedness_vector(eye, **kwargs)
            ) / (b - a) / 2
            pairwise_surv[i] = np.add.outer(np.diag(D), np.diag(D)) - 2 * D
    return 1 - pairwise_surv

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions