[Urwid] Problem related to UTF-8 processing of Latin strings (Fatal)

Neil Tallim red.hamsterx at gmail.com
Sun Apr 16 22:14:39 EDT 2006


My encoding is UTF-8, and everything is exactly as you've described.
I'm sorry for not being clear about that in my first message.

I know that I can tell Urwid to use another encoding, but that doesn't
solve the problem I think I've found.

In the event that a Latin-1 string is evaluated as UTF-8 (my
application's fault for not validating encoding when accepting input;
I'm not blaming this part on anyone but myself), if the *final*
character is, say, 'ä', then 'text[pos+1]' will cause an IndexError to
be thrown. I believe this is because that character marks the start of
a new UTF-8 double-byte character, so Urwid incorrectly assumes there
will be another byte.

Discarding, question-marking, or returning the ordinal value of this
trailing character, if it exists, would prevent an error from being
thrown, which would prevent the possibility of an unexpected crash.
Related lines: 88, 98, 111 in urwid/utable.py



Application/testing information:

The application I'm working on is an IRC client framework that uses
Urwid as an optional interface. (It's just a project I started so I
could learn more about Python)

This problem was discovered when someone typed 'ae = ä'.
('ä' = '\xe6' in Latin-1)

Now, this works fine when sent as UTF-8, which is usually the case,
but this time, it was sent as Latin-1. Of course, other Latin-1
strings work fine, as long as they don't end with a character that
marks the start of a UTF-8 pair.

My program is hardly something that needs to be running 24/7, but I
believed it was stable enough to leave for extended periods of time.
This crash was entirely unexpected, and I wouldn't want anyone else to
be surprised because some user finally entered a string they didn't
believe to be a problem.

-Neil

> It looks like your encoding is set to UTF-8, and you're passing plain
> strings to Urwid to display in the latin-1 encoding.  In general, Urwid
> assumes that plain strings are in the system's default encoding, so you
> can't use plain strings in an encoding other than UTF-8 when the
> system's encoding is UTF-8.
>
> If your application is designed to handle strings in the latin-1
> encoding, convert them to unicode strings before displaying them:
>
> eg:
> textwidget = Text( unicode( mystring, "latin-1" ) )
>
> You can force Urwid to disable its UTF-8 processing by calling
> urwid.set_encoding("latin-1") but if your terminal really is in UTF-8
> mode then the characters won't be displayed properly.
>
> If your system's encoding is not UTF-8 then Urwid shouldn't be trying to
> decode your strings with the utable module..  What is the output when
> you run the "locale" command on your system?
>
> Ian
>
>
> _______________________________________________
> Urwid mailing list
> Urwid at lists.excess.org
> http://lists.excess.org/mailman/listinfo/urwid
>
>


More information about the Urwid mailing list