Skip to content

Utf-8 decode fails on chunk if character is split #78

@Jolbas

Description

@Jolbas

bufs[fd] += os.read(fd, 4096).decode('UTF-8')

It happened that an unicode character appeared at position 4095 and therefore was split in two resulting in utf-8 decode fail.

    bufs[fd] += os.read(fd, 4096).decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 4095: unexpected end of data

I'm not sure about this solution but it seems to work:

        if sys.version_info < (3, 0):
            for fd in fds:
                bufs[fd] += os.read(fd, 4096)
        else:
            for fd in fds:
                b = os.read(fd, 4096)
                for i in range(4):
                    try:
                        bufs[fd] += b.decode('UTF-8')
                        break
                    except UnicodeDecodeError:
                        if i < 4:
                            b += os.read(fd, 1)
                        else:
                            raise

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions