Skip to content

arrow-json primitive decoder never detects numeric-string/fractional-float type conflicts, so ignore_type_conflicts can't act on them #10265

Description

@pdmetcalfe

Describe the bug

When decoding a value for a numeric column, arrow-json should treat the JSON literal 1234 differently from the JSON literal "1234": one is a number token, the other is a string token that contains digits.
It does not: both tokens are accepted and produce the identical Int64 value 1234. The decoder parses the contents of a string as if it were a number instead of first checking whether the token was a number in the
first place — so no type conflict is ever raised for this case.

The option that should govern this is with_ignore_type_conflicts. Its doc comment describes precisely this scenario:

if the type is declared to be ... DataType::Int32 but the reader
encounters a string value "foo" ... false (the default): The reader
will return an error.

But that guarantee only holds when the string's contents fail to parse as a number ("foo"). When the string's contents do parse as a number ("42"), no conflict is detected at all, so ignore_type_conflicts never gets a chance to act — the value is silently accepted regardless of whether ignore_type_conflicts is true or false.

The same gap exists one level down: a JSON float with a fractional part (5.7) decoded into an Int64 column is silently truncated to 5 rather than being treated as a conflict — again irrespective of
ignore_type_conflicts. This is silent data loss, not just a lenient alternate encoding.

To Reproduce

use arrow::array::AsArray;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::json::StructMode;
use arrow::json::reader::ReaderBuilder;
use std::sync::Arc;

fn decode_one(schema: Arc<Schema>, json_row: &str, ignore_type_conflicts: bool) -> Result<i64, String> {
    let mut decoder = ReaderBuilder::new(schema.clone())
        .with_struct_mode(StructMode::ListOnly)
        .with_ignore_type_conflicts(ignore_type_conflicts)
        .build_decoder()
        .map_err(|e| e.to_string())?;
    decoder.decode(json_row.as_bytes()).map_err(|e| e.to_string())?;
    decoder.decode(b"\n").map_err(|e| e.to_string())?;
    let batch = decoder.flush().map_err(|e| e.to_string())?.unwrap();
    Ok(batch.column(0).as_primitive::<arrow::datatypes::Int64Type>().value(0))
}

fn main() {
    let schema = Arc::new(Schema::new(vec![Field::new("AGE", DataType::Int64, true)]));

    // Decode a JSON string "42" into an Int64 column, with the default
    // ignore_type_conflicts=false.
    println!("{:?}", decode_one(schema.clone(), r#"["42"]"#, false));

    // Decode a JSON float 5.7 into an Int64 column, with the default
    // ignore_type_conflicts=false.
    println!("{:?}", decode_one(schema.clone(), r#"[5.7]"#, false));
}

Output on 59.0.0:

Ok(42)
Ok(5)

The same holds for Float32/Float64 targets given a numeric JSON string
("177.8"Float32 value 177.8, "72.5"Float64 value 72.5),
confirmed with the same harness.

Expected behavior

With ignore_type_conflicts at its default (false), both calls should return Err, not Ok:

  • r#"["42"]"# into an Int64 column: the token is a JSON string, not a number, so this should be rejected as a type conflict — the same as any other string-into-number mismatch (e.g. "foo", which is correctly
    rejected today).
  • r#"[5.7]"# into an Int64 column: 5.7 is not representable as an Int64 without loss, so this should be rejected as a type conflict rather than silently truncated to 5.

Additional context

Both are silent acceptance of malformed input rather than a decode error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions