[Urwid] Problem related to UTF-8 processing of Latin strings
(Fatal)
Neil Tallim
red.hamsterx at gmail.com
Sun Apr 16 22:14:39 EDT 2006
My encoding is UTF-8, and everything is exactly as you've described.
I'm sorry for not being clear about that in my first message.
I know that I can tell Urwid to use another encoding, but that doesn't
solve the problem I think I've found.
In the event that a Latin-1 string is evaluated as UTF-8 (my
application's fault for not validating encoding when accepting input;
I'm not blaming this part on anyone but myself), if the *final*
character is, say, 'ä', then 'text[pos+1]' will cause an IndexError to
be thrown. I believe this is because that character marks the start of
a new UTF-8 double-byte character, so Urwid incorrectly assumes there
will be another byte.
Discarding, question-marking, or returning the ordinal value of this
trailing character, if it exists, would prevent an error from being
thrown, which would prevent the possibility of an unexpected crash.
Related lines: 88, 98, 111 in urwid/utable.py
Application/testing information:
The application I'm working on is an IRC client framework that uses
Urwid as an optional interface. (It's just a project I started so I
could learn more about Python)
This problem was discovered when someone typed 'ae = ä'.
('ä' = '\xe6' in Latin-1)
Now, this works fine when sent as UTF-8, which is usually the case,
but this time, it was sent as Latin-1. Of course, other Latin-1
strings work fine, as long as they don't end with a character that
marks the start of a UTF-8 pair.
My program is hardly something that needs to be running 24/7, but I
believed it was stable enough to leave for extended periods of time.
This crash was entirely unexpected, and I wouldn't want anyone else to
be surprised because some user finally entered a string they didn't
believe to be a problem.
-Neil
> It looks like your encoding is set to UTF-8, and you're passing plain
> strings to Urwid to display in the latin-1 encoding. In general, Urwid
> assumes that plain strings are in the system's default encoding, so you
> can't use plain strings in an encoding other than UTF-8 when the
> system's encoding is UTF-8.
>
> If your application is designed to handle strings in the latin-1
> encoding, convert them to unicode strings before displaying them:
>
> eg:
> textwidget = Text( unicode( mystring, "latin-1" ) )
>
> You can force Urwid to disable its UTF-8 processing by calling
> urwid.set_encoding("latin-1") but if your terminal really is in UTF-8
> mode then the characters won't be displayed properly.
>
> If your system's encoding is not UTF-8 then Urwid shouldn't be trying to
> decode your strings with the utable module.. What is the output when
> you run the "locale" command on your system?
>
> Ian
>
>
> _______________________________________________
> Urwid mailing list
> Urwid at lists.excess.org
> http://lists.excess.org/mailman/listinfo/urwid
>
>
More information about the Urwid
mailing list