Skip to content

Convert provgigapath embeddings to parquet by slide/tile #32

@seandavi

Description

@seandavi

The current prov-gigapath files are formatted as CSV files with embedded text representations of python classes. This format makes the data very difficult to access and use.

Proposal

Convert all the tile-level and slide-level prov-gigapath to a parquet-format file with one or more metadata columns (slide id, tile location, image name) and one column with the actual tensor data (14 x 768 array).

Advantages

  • Much easier data management: one file for tile-level data and one for slide-level data gets ALL of TCGA.
  • Dataset becomes more AI-ready
  • Language-agnostic representation (any language can read parquet files)
  • Data access code becomes trivial (read parquet file)

Pseudocode

  1. Read in embeddings for each per-sample CSV file
  2. Develop metadata for each CSV file and collect in data.frame
  3. Convert each CSV file embedding to a matrix and include as a new column in the dataframe from step 2.
  4. Write out full dataframe as parquet file

Result

  1. tile-level provgigapath embeddings in a parquet file
  2. slide-level provgigapath embeddings in a parquet file

Fully language-agnostic and AI/ML ready...

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions