Skip to content

enable customizing list inner child element name? #84

@AlJohri

Description

@AlJohri

When Spark outputs a parquet file, I believe it always uses the inner list item name of element as opposed to item:

message spark_schema {
  ....
  OPTIONAL group mylistcolumn (LIST) {
    REPEATED group list {
      OPTIONAL BYTE_ARRAY element (UTF8);
    }
  }
  ...
}

It appears this crate (or one of its dependencies, perhaps arrow2 itself?), is always assuming that the inner field name of a list is item rather than element.

Expected: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "item", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])

Actual: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "element", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])

I'm guessing this is because of this line of code?

arrow2::datatypes::DataType::List(Box::new(<T as ArrowField>::field("item")))

  1. If this is controlled by arrow2-convert, can we perhaps customize this via an annotation on the struct member?
  2. Should the default by re-evaluated if parquet-mr / Spark uses element?

P.S. Likely not related, but I ran into a very similar error in this other crate as well: timvw/qv#31

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions