Skip to content

Consider using UTF-8 when encoding is unspecified #200

@danstoner

Description

@danstoner

For example, attempting to ingest:

https://unhcollection.unh.edu/database/content/dwca/UNHC-UNHC_DwC-A.zip

The published Darwin Core Archive includes a meta.xml which has a blank encoding value:

encoding=""

The rest of that line looks like:

<core dateFormat="YYYY-MM-DD" encoding="" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy=""" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">

The encoding value tell the consumers of the occurrence file how to process the file properly.

The data provider has been unable to resolve the situation in over a year.

https://redmine.idigbio.org/issues/3002

Consider whether it is worth applying UTF-8 encoding in this situation so the data can be ingested, or whether it still makes sense to hard fail since there is a chance of "bad things" if the encoding turns out to be mismatched.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions