Skip to content

Python: model PyMongo read results as sources for py/sql-injection in second-order SQL construction flows #21775

@invoke1442

Description

@invoke1442

py/sql-injection already appears to model the sink side correctly through the existing DB-API / PEP249.qll coverage for execute(...). The gap seems to be on the source side for a common second-order pattern: values read from MongoDB with PyMongo are later reused in dynamically constructed SQL. I ran into this while triaging KBase Metrics (CVE-2022-4860), but the underlying issue is broader than that one project.

A reduced example looks like this:

from pymongo import MongoClient
import psycopg2

def sync_users():
    users = []
    for record in MongoClient(uri).auth.users.find({"role": "dev"}, {"user": 1, "_id": 0}):
        users.append(record["user"])

    in_clause = "', '".join(users)
    sql = (
        "update user_info set active = true "
        "where username in ('" + in_clause + "')"
    )

    cur = psycopg2.connect(dsn).cursor()
    cur.execute(sql)

My reading of the current modeling is that this flow falls between two existing pieces: PyMongo.qll models collection operations for NoSQL semantics, while py/sql-injection starts from active threat-model sources that do not seem to cover data read back from PyMongo collections. As a result, the query has the right sink and the right string-building path shape, but no source that can reach it.

I do not think this needs a new query or wider sink modeling. The fix seems fairly contained: add source coverage for values obtained from common PyMongo read APIs such as find, find_one, and find_one_and_*, so that those results can participate in the existing py/sql-injection flow. If widening default behavior is a concern, this could also live behind an opt-in threat-model bucket for persisted database results rather than being treated as generic local input.

This pattern is common in real Python codebases, especially in cron jobs, reporting jobs, migration scripts, sync workers, and ETL-style code that bridges Mongo-backed application state into relational stores. Bandit's B608 already flags the same family syntactically by recognizing SQL-shaped string construction passed to execute(), so there is at least external evidence that this is a practical and recurring pattern. CodeQL seems close to covering it already; the missing piece is verifiable semantic coverage for PyMongo-backed persisted data flowing into the existing SQL sinks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions