Describe the bug
When decoding a value for a numeric column, arrow-json should treat the JSON literal 1234 differently from the JSON literal "1234": one is a number token, the other is a string token that contains digits.
It does not: both tokens are accepted and produce the identical Int64 value 1234. The decoder parses the contents of a string as if it were a number instead of first checking whether the token was a number in the
first place — so no type conflict is ever raised for this case.
The option that should govern this is with_ignore_type_conflicts. Its doc comment describes precisely this scenario:
if the type is declared to be ... DataType::Int32 but the reader
encounters a string value "foo" ... false (the default): The reader
will return an error.
But that guarantee only holds when the string's contents fail to parse as a number ("foo"). When the string's contents do parse as a number ("42"), no conflict is detected at all, so ignore_type_conflicts never gets a chance to act — the value is silently accepted regardless of whether ignore_type_conflicts is true or false.
The same gap exists one level down: a JSON float with a fractional part (5.7) decoded into an Int64 column is silently truncated to 5 rather than being treated as a conflict — again irrespective of
ignore_type_conflicts. This is silent data loss, not just a lenient alternate encoding.
To Reproduce
use arrow::array::AsArray;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::json::StructMode;
use arrow::json::reader::ReaderBuilder;
use std::sync::Arc;
fn decode_one(schema: Arc<Schema>, json_row: &str, ignore_type_conflicts: bool) -> Result<i64, String> {
let mut decoder = ReaderBuilder::new(schema.clone())
.with_struct_mode(StructMode::ListOnly)
.with_ignore_type_conflicts(ignore_type_conflicts)
.build_decoder()
.map_err(|e| e.to_string())?;
decoder.decode(json_row.as_bytes()).map_err(|e| e.to_string())?;
decoder.decode(b"\n").map_err(|e| e.to_string())?;
let batch = decoder.flush().map_err(|e| e.to_string())?.unwrap();
Ok(batch.column(0).as_primitive::<arrow::datatypes::Int64Type>().value(0))
}
fn main() {
let schema = Arc::new(Schema::new(vec![Field::new("AGE", DataType::Int64, true)]));
// Decode a JSON string "42" into an Int64 column, with the default
// ignore_type_conflicts=false.
println!("{:?}", decode_one(schema.clone(), r#"["42"]"#, false));
// Decode a JSON float 5.7 into an Int64 column, with the default
// ignore_type_conflicts=false.
println!("{:?}", decode_one(schema.clone(), r#"[5.7]"#, false));
}
Output on 59.0.0:
The same holds for Float32/Float64 targets given a numeric JSON string
("177.8" → Float32 value 177.8, "72.5" → Float64 value 72.5),
confirmed with the same harness.
Expected behavior
With ignore_type_conflicts at its default (false), both calls should return Err, not Ok:
r#"["42"]"# into an Int64 column: the token is a JSON string, not a number, so this should be rejected as a type conflict — the same as any other string-into-number mismatch (e.g. "foo", which is correctly
rejected today).
r#"[5.7]"# into an Int64 column: 5.7 is not representable as an Int64 without loss, so this should be rejected as a type conflict rather than silently truncated to 5.
Additional context
Both are silent acceptance of malformed input rather than a decode error.
Describe the bug
When decoding a value for a numeric column,
arrow-jsonshould treat the JSON literal1234differently from the JSON literal"1234": one is a number token, the other is a string token that contains digits.It does not: both tokens are accepted and produce the identical
Int64value1234. The decoder parses the contents of a string as if it were a number instead of first checking whether the token was a number in thefirst place — so no type conflict is ever raised for this case.
The option that should govern this is
with_ignore_type_conflicts. Its doc comment describes precisely this scenario:But that guarantee only holds when the string's contents fail to parse as a number (
"foo"). When the string's contents do parse as a number ("42"), no conflict is detected at all, soignore_type_conflictsnever gets a chance to act — the value is silently accepted regardless of whetherignore_type_conflictsistrueorfalse.The same gap exists one level down: a JSON float with a fractional part (
5.7) decoded into anInt64column is silently truncated to5rather than being treated as a conflict — again irrespective ofignore_type_conflicts. This is silent data loss, not just a lenient alternate encoding.To Reproduce
Output on 59.0.0:
The same holds for
Float32/Float64targets given a numeric JSON string(
"177.8"→Float32value177.8,"72.5"→Float64value72.5),confirmed with the same harness.
Expected behavior
With
ignore_type_conflictsat its default (false), both calls should returnErr, notOk:r#"["42"]"#into anInt64column: the token is a JSON string, not a number, so this should be rejected as a type conflict — the same as any other string-into-number mismatch (e.g."foo", which is correctlyrejected today).
r#"[5.7]"#into anInt64column:5.7is not representable as anInt64without loss, so this should be rejected as a type conflict rather than silently truncated to5.Additional context
Both are silent acceptance of malformed input rather than a decode error.