(Max character limit, sorry.)
The characters used in this example are:
Unicode Character 'LEFT-POINTING DOUBLE ANGLE QUOTATION MARK' (U+00AB)
Unicode Character 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK' (U+00BB)
Looking above, it appears the behaviour is that:
- SENDING outputs 'aa' and 'bb' regardless of the setting. This is indicative of: ISO 8859-1 (Latin-1) compatible codepage
- RECEIVING output 'c2 ab' and 'c2 bb' regardless of the setting. This is indicative of: correctly encoded UTF-8
Many people call the former ASCII, as that is what it is based on. (The differences lie in the extended 128-255 characters range, which is also where the problem with these characters is, so it is hard to say if this is correct or incorrect.)
So this implies my former assumption was incorrect; the game appears to eat up a variety of ASCII and gladly converts it to UTF-8. Now the question becomes... is this some sort of correction because it expected UTF-8 but got something that didn't fit, so it tried ASCII instead? While not super-relevant for MUSHclient, I still want to figure it out for future tests.
Tests with code for great justice:
local utf8_bytes = { 0x70, 0x6f, 0x73, 0x65, 0x20, 0xC2, 0xab, 0x54, 0x6f, 0x73, 0x74, 0x69, 0xC2, 0xbb, 0x0d, 0x0a}
local ascii_bytes = { 0x70, 0x6f, 0x73, 0x65, 0x20, 0xab, 0x54, 0x6f, 0x73, 0x74, 0x69, 0xbb, 0x0d, 0x0a}
local utf8_packet = string.char(unpack(utf8_bytes))
local ascii_packet = string.char(unpack(ascii_bytes))
Note("UTF-8 Encoded string: " .. utf8_packet)
Note("ASCII Encoded string: " .. ascii_packet)
Results:
* Whilst Output->UTF-8 is ON, the UTF-8 encoded string looks CORRECT. ASCII shows the missing font glyph ?-blocks.
* Whilst Output->UTF-8 is OFF, the ASCII encoded string looks CORRECT. UTF-8 shows the extra  characters to match the C2 bytes.
Nothing unexpected there; it completely matches up with the description of the Output->UTF-8 setting. But this was merely to frame earlier results in the context of actual codepages. Next comes the test where we send data to the game:
SendPkt(utf8_packet):
Sent packet: 15 (16 bytes) at dinsdag, februari 07, 2017, 12:07:21
pose «Tosti».. 70 6f 73 65 20 c2 ab 54 6f 73 74 69 c2 bb 0d 0a
Incoming packet: 100 (20 bytes) at dinsdag, februari 07, 2017, 12:07:21
Iona «Tosti».[ 49 6f 6e 61 20 c2 ab 54 6f 73 74 69 c2 bb 1b 5b
0m.. 30 6d 0d 0a
This proves that the Evennia codebase indeed tries a UTF-8 interpretation first, and that it gracefully falls back to interpreting as an ASCII-derivative in case it runs into issues.
Conclusions thus far:
* Input text box -> Local Echo is not converted from (whatever) into ASCII-esque encoding, which is dumped into the Output buffer regardless of what the output buffer is supposed to be interpreted as.
* Input text box -> Send-Over-Network is similarly converted from (whatever) into ASCII-esque encoding regardless of what the output buffer is converted to be interpreted as.
Other tests show that the text box at the bottom of MUSHclient currently accepts ASCII output; I was unable to paste in any UTF-8 characters that did not have an ASCII equivalent. (Sometimes they would downgrade by losing ligaments and such.) As such, I'll assume it is not ANSI compatible. |