[Home] [Downloads] [Search] [Help/forum]


Register forum user name Search FAQ

Gammon Forum

[Folder]  Entire forum
-> [Folder]  MUSHclient
. -> [Folder]  Bug reports
. . -> [Subject]  On UTF-8, misc. codepages and other weird behaviour!

On UTF-8, misc. codepages and other weird behaviour!

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page


Posted by Worstje   Netherlands  (899 posts)  [Biography] bio
Date Tue 07 Feb 2017 11:58 AM (UTC)
Message
I figured I'd dreg this old cow out of her well again and see if I can resuscitate the poor beast. As such, I did a couple of tests just to see what would happen... and to see if the proper behaviour can be ascertained and implemented.

All data in this post applies to Mushclient v5.05.

Introductory image: http://imgur.com/5F2KDlZ
Note: These are the same few output lines, screenshotted at different moments! The difference is in toggling the Output UTF-8 setting.

The code above is executed on ARXmush, a MUSH based on the Evennia codebase. From what I was able to ascertain, it treats input and output as UTF-8, but since no codebase is without flaws, I am approaching this as if both could contain bugs.

Also worth noting is that font choices also affect how characters appear. If a character is not available for the current font, it shows up as a block with a question mark. From the images above, it appears as if the Local Echo in the 'UTF-8: ON' case suffers from this. The font in question is Consolas, a quite popular font released by Microsoft several years ago.

Assumptions based on screenshots above:
1) Input control -> Local Echo conversion is bugged in some way. (The character we are looking for CAN be displayed, yet we see a ? block character instead.)
2) Input control -> Sent-Data-Over-Network is of the same encoding as what the game expects.
3) Game supports UTF-8 for input and output, although perhaps not implemented correctly

Of the above, only 1) excludes the game we are connected itself as a source of problems. However, we expect the local output to always visually match the input window. While it makes sense that the Local Output needs to be converted to whatever 'language' we speak to the game, other evidence suggests that the game does speak the correct language. Additionally, tests lead me to believe that this setting holds no relevance in terms of processing any kind of input. (Which makes sense.)

Conclusion: input text box codepage should be converted correctly to the output codepage.

In order to make sure things on the wire match our expectations, let's run a few packet traces.

Packet Debug (Output UTF-8: ON)

Sent  packet: 21 (13 bytes) at dinsdag, februari 07, 2017, 11:02:21 

pose «Test»..      70 6f 73 65 20 ab 54 65 73 74 bb 0d 0a

Incoming packet: 62 (19 bytes) at dinsdag, februari 07, 2017, 11:02:22 

Iona «Test».[0   49 6f 6e 61 20 c2 ab 54 65 73 74 c2 bb 1b 5b 30
m..                6d 0d 0a

(Yes, there's also encoding issues in the packet debug. But let's ignore that for now.)

Packet Debug (Output UTF-8: OFF)
Sent  packet: 12 (14 bytes) at dinsdag, februari 07, 2017, 11:24:07 

pose «Tosti»..     70 6f 73 65 20 ab 54 6f 73 74 69 bb 0d 0a

Incoming packet: 7 (20 bytes) at dinsdag, februari 07, 2017, 11:24:07 

Iona «Tosti».[   49 6f 6e 61 20 c2 ab 54 6f 73 74 69 c2 bb 1b 5b
0m..               30 6d 0d 0a

(It appears the encoding issues here are not relevant to the setting! For the sake of not accidentally confusing the packet debugs, I used 'tosti' as my test string here.)
[Go to top] top

Posted by Worstje   Netherlands  (899 posts)  [Biography] bio
Date Reply #1 on Tue 07 Feb 2017 11:59 AM (UTC)

Amended on Tue 07 Feb 2017 12:02 PM (UTC) by Worstje

Message
(Max character limit, sorry.)

The characters used in this example are:
Unicode Character 'LEFT-POINTING DOUBLE ANGLE QUOTATION MARK' (U+00AB)
Unicode Character 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK' (U+00BB)

Looking above, it appears the behaviour is that:
- SENDING outputs 'aa' and 'bb' regardless of the setting. This is indicative of: ISO 8859-1 (Latin-1) compatible codepage
- RECEIVING output 'c2 ab' and 'c2 bb' regardless of the setting. This is indicative of: correctly encoded UTF-8

Many people call the former ASCII, as that is what it is based on. (The differences lie in the extended 128-255 characters range, which is also where the problem with these characters is, so it is hard to say if this is correct or incorrect.)

So this implies my former assumption was incorrect; the game appears to eat up a variety of ASCII and gladly converts it to UTF-8. Now the question becomes... is this some sort of correction because it expected UTF-8 but got something that didn't fit, so it tried ASCII instead? While not super-relevant for MUSHclient, I still want to figure it out for future tests.

Tests with code for great justice:

local utf8_bytes = { 0x70, 0x6f, 0x73, 0x65, 0x20, 0xC2, 0xab, 0x54, 0x6f, 0x73, 0x74, 0x69, 0xC2, 0xbb, 0x0d, 0x0a}
local ascii_bytes = { 0x70, 0x6f, 0x73, 0x65, 0x20, 0xab, 0x54, 0x6f, 0x73, 0x74, 0x69, 0xbb, 0x0d, 0x0a}
local utf8_packet = string.char(unpack(utf8_bytes))
local ascii_packet = string.char(unpack(ascii_bytes))
Note("UTF-8 Encoded string: " .. utf8_packet)
Note("ASCII Encoded string: " .. ascii_packet)

Results:
* Whilst Output->UTF-8 is ON, the UTF-8 encoded string looks CORRECT. ASCII shows the missing font glyph ?-blocks.
* Whilst Output->UTF-8 is OFF, the ASCII encoded string looks CORRECT. UTF-8 shows the extra  characters to match the C2 bytes.

Nothing unexpected there; it completely matches up with the description of the Output->UTF-8 setting. But this was merely to frame earlier results in the context of actual codepages. Next comes the test where we send data to the game:

SendPkt(utf8_packet):
Sent  packet: 15 (16 bytes) at dinsdag, februari 07, 2017, 12:07:21 

pose «Tosti»..   70 6f 73 65 20 c2 ab 54 6f 73 74 69 c2 bb 0d 0a

Incoming packet: 100 (20 bytes) at dinsdag, februari 07, 2017, 12:07:21 

Iona «Tosti».[   49 6f 6e 61 20 c2 ab 54 6f 73 74 69 c2 bb 1b 5b
0m..               30 6d 0d 0a


This proves that the Evennia codebase indeed tries a UTF-8 interpretation first, and that it gracefully falls back to interpreting as an ASCII-derivative in case it runs into issues.

Conclusions thus far:
* Input text box -> Local Echo is not converted from (whatever) into ASCII-esque encoding, which is dumped into the Output buffer regardless of what the output buffer is supposed to be interpreted as.
* Input text box -> Send-Over-Network is similarly converted from (whatever) into ASCII-esque encoding regardless of what the output buffer is converted to be interpreted as.

Other tests show that the text box at the bottom of MUSHclient currently accepts ASCII output; I was unable to paste in any UTF-8 characters that did not have an ASCII equivalent. (Sometimes they would downgrade by losing ligaments and such.) As such, I'll assume it is not ANSI compatible.
[Go to top] top

Posted by Worstje   Netherlands  (899 posts)  [Biography] bio
Date Reply #2 on Tue 07 Feb 2017 12:00 PM (UTC)
Message
My recommendations, which are up for a lot of debate and SHOULD be questioned:

(All of this ignores the possible existence of Telnet Subnegotiation that could help select the best options for a particular game. That would be for later worry and research imho.)

Have a clear setting for both INPUT and OUTPUT character encoding.

For consistency and clarities sake, change the Display -> UTF-8 setting into a listbox where common encodings can be selected.
This setting would establish several things:

a. The display buffer contains content of this charset. (Same as before.)
b. The content the game sends is interpreted according to this codepage.
c. Thus, ALL displayed content (local echoes, Notes) have to be compatible with this format and converted into it if they weren't that to begin with.

You could maintain the current setting, but IIRC there were issues in the past with people entering asian languages were having trouble entering certain characters? Besides, since I am suggesting adding an INPUT character encoding, it would be a lot easier for users to understand if the INPUT and OUTPUT encodings had similar UI.

So yes.. another listbox ought to be added to the Input->Commands screen that also lets people choose encodings.
This setting would establish the following:

a. The commands we send to the game are encoded in this particular encoding.

Finally, for most capability in terms of actually entering data, the client needs to switch to a Unicode-compatible input control at some point. I am not sure what is involved; is it as simple as switching the creation to a newer version of the edit control, or does it mean switching to something more complicated?

And more importantly, how do we deal with input? Perhaps it is easiest if it accepts direct input in the codepage we have configured above? (Thus giving it a second purpose b.) Or does one just allow all sorts of entry, and leave it to the conversion routines to make the post out of it?

Currently, the behaviour seems closer to the limiting-input variety as the control seems to match the Windows-configured codepage.

The final (confusing) bugbear is in regards to how script functions interpret encodings. I haven't really collected my thoughts on that matter. (For that matter, I think a lot of my suggestions up there could be changed drastically depending on the exact kind of behaviour and user experience MUSHclient wants to offer. One could argue that maybe the output buffer should always be UTF-8 and not be coupled into game output, which would change things big time!)

Final final really final conclusion: the finale

Upto here, I babbled a lot. I also thought a lot. And I know there's a lot of work that involves changing / upgrading any of this. So please don't consider this post as a *whipcrack* GET TO WORK NICK post, as that is not my intention! :-)

Rather, I wanted to do some research and provide a somewhat compact overview of the ways MUSHclient is currently suffering from codepage/charset-encoding related pains so that future endeavours to tackle this issue can do so based on tests and facts. Although some attempts were made to improve things in the past, I found that (after checking changelogs and looking up a discussion thread here and there) the lack of insight into the scope of the issue and the patches causing side-effects or bugs for other users caused such work to be reverted.

I lack the insight into the MUSHclient codebase and associated UI libraries to provide suggestions in terms of code snippets and pull requests. But at least, I can offer you my mind to think along; I'm pretty capable where codepages and charsets are concerned. :-) As such, if I can be of help in tackling this issue, please let me know.
[Go to top] top

Posted by Fiendish   USA  (2,514 posts)  [Biography] bio   Global Moderator
Date Reply #3 on Tue 07 Feb 2017 12:36 PM (UTC)

Amended on Tue 07 Feb 2017 12:43 PM (UTC) by Fiendish

Message
I think you're making this overly complex.

It's true that local echo does not properly reflect the recent UTF8 work. That should be fixed.

https://github.com/fiendish/aardwolfclientpackage
[Go to top] top

Posted by Worstje   Netherlands  (899 posts)  [Biography] bio
Date Reply #4 on Tue 07 Feb 2017 02:06 PM (UTC)
Message
I probably am. I am not opposed to a simple solution.

However, did Nick not make an attempt to fix the behaviour a couple of patches ago, which then got reverted? Unless I misunderstood what he did from the changelog, patching the input routine is what happened back then, and apparently it caused people problems, causing it to be reverted the next version.

For whatever reason, the simple solution was not good enough. Thus my digging through all the aspects of the behaviour to try and figure out what is going on at the parts of MUSHclient dealing with encodings.

As a bonus, I did want to try to see what can be done to make things more flexible for the future; MUSHclient is rather old and moving away from reliance on the codepage defined as a part of the users locale is probably not a bad thing as it is a mechanic that throws back all the way to DOS and the interpretation of of the higher ASCII characters. Offering a Unicode-capable entry widget isn't really a luxury anymore anno 2017 imho. But that isn't a part of this very specific problem, agreed.
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #5 on Wed 08 Feb 2017 09:20 AM (UTC)
Message
My brief response so far (there are other issues here right now) is that the client was originally not Unicode-aware (hey, I wasn't Unicode-aware either in 1995 when I first wrote it!).

It is compiled as non-Unicode which has all sorts of implications for things like edit controls (which the command window is). Attempts to convert it involved such a massive amount of work that I gave up trying.

As a partial fix, checking the UTF-8 box in the Output configuration makes the output drawing routines interpret the sequence of bytes as UTF-8. For this to fully work you also need to be using a Unicode font.

Also, miniwindows have an option to draw text in UTF-8 format.

The command (input) window is troubling, and I'm surprised, frankly, that people get it to work at all in Chinese and similar languages.

Recent other threads have convinced me that with UTF-8 off but a suitable code page selected, the output routines can handle multiple-byte sequences (from the MUD) and render them correctly. Also, presumably, that also applies to the input (command) window.

I think Windows is doing some kludging in the background. I suspect that they make edit windows "work" even for non-Unicode apps (for legacy support) in ways I don't fully comprehend.

I'll review your suggestions in greater detail tomorrow. I suspect there will be things that won't work as nicely as you or I might hope they will.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Worstje   Netherlands  (899 posts)  [Biography] bio
Date Reply #6 on Wed 08 Feb 2017 09:28 AM (UTC)
Message
I'll try looking into exactly what that reverted patch you tried at one point entailed and what the complaints regarding it were.

Maybe I can give a suggestion on how to improve it to at least fix the obvious discrepancies I was seeing whilst testing, even if it doesn't make the client more suitably Unicode aware.
[Go to top] top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


16,740 views.

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page

Go to topic:           Search the forum


[Go to top] top

Quick links: MUSHclient. MUSHclient help. Forum shortcuts. Posting templates. Lua modules. Lua documentation.

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.

[Home]


Written by Nick Gammon - 5K   profile for Nick Gammon on Stack Exchange, a network of free, community-driven Q&A sites   Marriage equality

Comments to: Gammon Software support
[RH click to get RSS URL] Forum RSS feed ( https://gammon.com.au/rss/forum.xml )

[Best viewed with any browser - 2K]    [Hosted at HostDash]