Skip to content

clearcode/store_scans.py fails to bootstrap shard repositories reliably #847

@Shashidar123

Description

@Shashidar123

Summary :
I found an issue in the scan storage flow in clearcode/store_scans.py.

This module appears to be responsible for taking ClearlyDefined scan data, deriving a purl-based hash, mapping that hash to a GitHub repository shard, and then storing scans in that repository. Because of that, the repo creation and clone logic here is part of an important project workflow, not just a helper utility.

The current implementation in get_or_init_repo() seems to break in common cases where a repository already exists remotely or needs to be created and cloned for the first time.

Problem
Current logic:

if repo_name not in get_github_repos(user_name=user_name):
repo_url = create_github_repo(repo_name=repo_name)

repo_path = work_dir / repo_name
if repo_path.exists():
repo = Repo(repo_path)
if pull:
repo.origin.pull()
else:
repo = Repo.clone_from(repo_url, repo_path)
This has two clear failure cases:

If the remote repository already exists, but the local directory does not:
repo_url is never assigned
Repo.clone_from(repo_url, repo_path) will fail because repo_url is undefined
If the remote repository does not exist:
create_github_repo() creates the repo through the GitHub API, but does not return a clone URL
repo_url becomes None
Repo.clone_from(repo_url, repo_path) will still fail
So the bootstrap flow is unreliable in both scenarios:

cloning an existing shard repository
creating and then cloning a new shard repository
Additional issue
get_github_repos() yields full_name values such as org/repo, but the membership check compares those values with repo_name only.

That means this check may be incorrect:

if repo_name not in get_github_repos(user_name=user_name):
If repo_name is only something like abc, and get_github_repos() yields values like my-org/abc, then the repo existence check will fail even when the repo already exists.

Why this matters
This issue affects a core storage workflow in the project:

ClearlyDefined scan -> purl -> purl hash -> shard repo -> clone/init -> commit/push

Since this file is using purl-derived hashes to distribute scan data across many Git repositories, a bug here can block or break the scan archival process.

This is why I think this is an important issue to fix.

Expected behavior
get_or_init_repo() should:

correctly detect whether the remote shard repo already exists
create it if it does not exist
always have a valid clone URL before trying to clone
clone when the local checkout is missing
pull only when the local checkout already exists and pull=True
Suggested fix
A good fix would likely include:

making create_github_repo() return the created repo clone URL
making get_github_repos() return repo names in a format consistent with the membership check
ensuring repo_url is always defined before Repo.clone_from(...)
clarifying whether repo_namespace and user_name should support org repositories differently
References
GitHub REST API documentation:
https://docs.github.com/en/rest/repos/repos#create-a-repository-for-the-authenticated-user

GitHub REST API list repositories:
https://docs.github.com/en/rest/repos/repos#list-repositories-for-the-authenticated-user

Package URL specification:
https://github.com/package-url/purl-spec

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions