[Home] [Downloads] [Search] [Help/forum]


Register forum user name Search FAQ

Gammon Forum

[Folder]  Entire forum
-> [Folder]  MUSHclient
. -> [Folder]  International
. . -> [Subject]  TinyMUX 2.7 with UTF-8 and MUSHClient

TinyMUX 2.7 with UTF-8 and MUSHClient

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page


Pages: 1 2  3  

Posted by Brazil   USA  (10 posts)  [Biography] bio
Date Fri 09 Mar 2007 04:14 AM (UTC)
Message
With the UTF-8 check-box selected, MUSHClient seems to handle two-byte UTF-8 sequences, but it doesn't handle three and four-byte sequences properly. For example, code point 8364 is a three-byte sequence.

Repro steps:
TinyMUX 2.7.0.3
> think ord(8364) (euro symbol)
€ (euro symbol)

Another alpha drop of TinyMUX 2.7 will probably happen next weekend, but we're setting up a testbed for clients.

It will probably be another year before UTF-8 is fully baked into TinyMUX, but the networking layer changes are done. The clients testing will probably expose bugs on both sides.


Brazil
[Go to top] top

Posted by Nick Gammon   Australia  (22,973 posts)  [Biography] bio   Forum Administrator
Date Reply #1 on Fri 09 Mar 2007 04:27 AM (UTC)
Message
I can't reproduce that.

Use Ctrl+Shift+F12 to open the "Debug Simulated World Input" dialog box.

According to my reckoning the code for 8364 in UTF-8 is hex E282AC. So, you can enter this:


test: \e2\82\ac done


This correctly displays the Euro symbol, providing you are using a font that supports Unicode. I chose Lucida Sans Unicode, and saw the symbol OK.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Brazil   USA  (10 posts)  [Biography] bio
Date Reply #2 on Fri 09 Mar 2007 04:45 AM (UTC)
Message
I'm using MUSHClient 3.73 which isn't quite the latest. Lucida Console Font, UTF-8 selected. Here's the debug packets:

Sent packet: 10 (17 bytes)

think chr(8364). 74 68 69 6e 6b 20 63 68 72 28 38 33 36 34 29 0d
. 0a

Incoming packet: 4 (5 bytes)

..... e2 82 ac 0d 0a

think chr(8364)


The issue was the font. FixedSys doesn't work. Thanks for the quick response.

However, here's a catch. I noticed that MUSHClient isn't doing charset negotiation. We were planning on looking at the TTERM and defaulting to a character set, then letting CHARSET override that, but for TTERM, we're seeing 'mushclient', and there isn't enough information there to determine whether or not the client has the UTF-8 check box selected or not.
[Go to top] top

Posted by Nick Gammon   Australia  (22,973 posts)  [Biography] bio   Forum Administrator
Date Reply #3 on Fri 09 Mar 2007 04:46 AM (UTC)

Amended on Fri 09 Mar 2007 04:51 AM (UTC) by Nick Gammon

Message
Just to confirm those hex numbers, I looked up conversion to UTF-8, and found these operations are recommended for Unicode characters in the range 2048 to 65535:


If ud >=2048 and <=65535 (FFFF hex) then UTF-8 is 3 bytes long.
   byte 1 = 224 + (ud div 4096)
   byte 2 = 128 + ((ud div 64) mod 64)
   byte 3 = 128 + (ud mod 64)


Thus, this small bit of Lua code entered into MUSHclient confirms my conversion:


ud = 8364

byte1 = 224 + math.floor (ud / 4096)
byte2 = 128 + math.floor (ud / 64) % 64
byte3 = 128 + ud % 64

print (bit.tostring (byte1, 16))  --> E2
print (bit.tostring (byte2, 16))  --> 82
print (bit.tostring (byte3, 16))  --> AC


A quote from another web site makes it clearer:


The binary representation of the character's integer value is thus simply spread across the bytes and the number of high bits set in the lead byte announces the number of bytes in the multibyte sequence:


 bytes | bits | representation
     1 |    7 | 0vvvvvvv
     2 |   11 | 110vvvvv 10vvvvvv
     3 |   16 | 1110vvvv 10vvvvvv 10vvvvvv
     4 |   21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv


The binary sequence 11100000 is 224 in decimal, and the sequence 10000000 is 128, which is why those numbers are being added.


- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,973 posts)  [Biography] bio   Forum Administrator
Date Reply #4 on Fri 09 Mar 2007 04:50 AM (UTC)
Message
Quote:

FixedSys doesn't work.


Not surprised - it isn't a Unicode font.

Quote:

I noticed that MUSHClient isn't doing charset negotiation. We were planning on looking at the TTERM and defaulting to a character set, then letting CHARSET override that, but for TTERM, we're seeing 'mushclient' ...


What will you do if the client doesn't support Unicode?

I would assume that "mushclient" supports Unicode, and having some message at the start, to the effect that you need to check the "UTF-8" box, and select a Unicode font.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Brazil   USA  (10 posts)  [Biography] bio
Date Reply #5 on Fri 09 Mar 2007 04:52 AM (UTC)
Message
Easy there big guy. Your numbers are right and agree with the numbers the server is producing. I just happened to pick a font without that character.

The charset thing is still an interesting question though. How is the UTF-8 checkbox communicated across telnet for games that support UTF-8. On MUDs, do players manually, enable it on both sides, or is there some telnet negotiation method that we need to look at?
[Go to top] top

Posted by Brazil   USA  (10 posts)  [Biography] bio
Date Reply #6 on Fri 09 Mar 2007 05:01 AM (UTC)
Message
The database will be UTF-8. Clients can be ASCII-only, ISO 8859-1, UTF-8, or perhaps other things later, but let's stick with those for now.

Anything coming in is converted to UTF-8 according to what the client has negotiated. So, for ISO 8859-1, all the upper 128 characters are converted to their multi-byte UTF-8 counterpart.

Anything going out is down converted with things that can't be represented turned into a replacement character (i.e., '?').


Brazil
[Go to top] top

Posted by Nick Gammon   Australia  (22,973 posts)  [Biography] bio   Forum Administrator
Date Reply #7 on Fri 09 Mar 2007 05:08 AM (UTC)
Message
Quote:

How is the UTF-8 checkbox communicated across telnet for games that support UTF-8.


I honestly don't know, this subject has been covered a bit in the past (2002):

http://www.gammon.com.au/forum/bbshowpost.php?id=1777

One of the posters there mentioned the Charset negotiation (RFC 2066) with the rider that "I've not seen implemented anywhere".

I'm not sure what your fallback position is, if the client is not going to display Unicode. I suggest some instructions at the start, like "to get proper display, ensure option XYZ is set for client ABC", with an appropriate list of clients and the relevant options.

Even if I add the negotiation to the next version of MUSHclient, you will still have the problem of what to do with people with earlier versions, plus how other clients handle this issue, if at all.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,973 posts)  [Biography] bio   Forum Administrator
Date Reply #8 on Fri 09 Mar 2007 05:13 AM (UTC)
Message
Quote:

Anything going out is down converted with things that can't be represented turned into a replacement character (i.e., '?').


I don't totally see how that will work. Assuming a feature of the MUX is that you support foreign languages, it isn't really going to help if an entire sentence in (say) Japanese, is rendered as ??? ????? ?????.

I think that specifying clients that support UTF-8, plus instructions on how to configure them correctly, will have to be part of your setup instructions. For example, as people create new characters, ask if they can read a certain line of text, and if not, fiddle with their client until they can.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Brazil   USA  (10 posts)  [Biography] bio
Date Reply #9 on Fri 09 Mar 2007 05:41 AM (UTC)

Amended on Fri 09 Mar 2007 05:50 AM (UTC) by Brazil

Message
With the right locale, xterm will display UTF-8 right, but it declines to negotiate ala RFC 2066. Or, at least, my copy with my configuration is declining. That behavior is the same as MUSHClient.

Perl Net::Telnet::Options supports RFC 2066.

Atlantis (a Mac client) is doing RFC 2066.

I'd argue that it's the right way. It's a clear indication from the client, and since the telnet protocol is involved, if one side doesn't support the negotiation of that option, that's also a clear indication.

The open issue is then what to do about prior versions. On the server side, we could add a way to 'force' a certain character set on a per-player or per-port basis. ANSI and STRIPACCENTS provides some precedent for this, but STRIPACCENTS is imperfect, and ANSI is a matter of taste now rather than an indication of support in the client.

I can see a player using different versions of different clients from different locations (one from home, and another from work), so the automatic negotiation seems preferred.

As far as substituting a replacement character. The ??? ??? seems much more polite than the alternative...which is beeping and line noise. If a game contains Japanese, and a player does not have any Japanese-worthy fonts installed, the replacement characters are the indication that he needs to install some if he intends to see those characters.
[Go to top] top

Posted by Brazil   USA  (10 posts)  [Biography] bio
Date Reply #10 on Fri 09 Mar 2007 02:45 PM (UTC)
Message
Sparks added telnet NEW-ENVIRON negotiation and a UNICODE softcode flag to force UTF-8 on the server side. Without the flag, it relys on what it can negotiate through telnet, so there is still value in doing any appropriate negoations, but with UNICODE, there's an answer for versions that don't.
[Go to top] top

Posted by Sparks   (7 posts)  [Biography] bio
Date Reply #11 on Sat 10 Mar 2007 08:55 AM (UTC)
Message
FWIW, I've set up a public testbed server for 2.7; if you'd like to test with it, just toss me a note (or ask Brazil), and you can get set up on there.

Meanwhile, over on my Windows box, I downloaded MUSHclient and set it UTF8 and changed the font, and I was able to get linedraw and Japanese on the testbed when I forced myself to be UTF8 server-side. It'd still be nice to autonegotiate the UTF8 support if the user has the UTF8 box checked, however. ;)

I couldn't get some of the other codepages, but I think that's simply because I don't have as many codepages (or fonts supporting them) installed on Windows by default as I do over on Mac OS X.

However, I /did/ uncover one (vaguely) related bug, apparently. Changing the font in MUSHclient while connected (at least from FixedSys to Lucida Console in particular) generates a really spurious set of NAWS values. Specifically, MUSHclient informs the server that its screen dimensions have changed from 80x26 to 786x49(!!).

Disconnecting and reconnecting causes MUSHclient to properly send the new 80x56 NAWS value for Lucida Console, so it's not a HUGE issue, but it seems to be reproducable if I change fonts while connected. Probably not terribly hard to fix. :)

Rachel 'Sparks' Blackman
[Go to top] top

Posted by Nick Gammon   Australia  (22,973 posts)  [Biography] bio   Forum Administrator
Date Reply #12 on Sat 10 Mar 2007 08:19 PM (UTC)
Message
It seems there was a bug with dynamic resizing, where it was returning a pixel count rather than a character count. I hope that is fixed, and will be released in the next version.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,973 posts)  [Biography] bio   Forum Administrator
Date Reply #13 on Sat 10 Mar 2007 08:39 PM (UTC)
Message
Quote:

It'd still be nice to autonegotiate the UTF8 support if the user has the UTF8 box checked, however. ;)


You need the UTF-8 box checked, and a font that supports Unicode - there is probably a long list. Plus, what happens if the player changes fonts halfway through the session - or unchecks UTF-8?

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Brazil   USA  (10 posts)  [Biography] bio
Date Reply #14 on Sat 10 Mar 2007 09:09 PM (UTC)

Amended on Sat 10 Mar 2007 09:11 PM (UTC) by Brazil

Message
As characters are received, the server converts them to UTF-8 as necessary. Backspace/erase backs up over one UTF-8 character in the buffer regardless of whether the character from the client was ASCII or Latin1.

The interesting part is the interaction between the telnet charset negotiation and the flow of text. RFC 2066 (http://www.faqs.org/rfcs/rfc2066.html) makes it sound as if the side attempting to negotiate charset intentionally holds back from sending output until it achieves agreement. As soon as the other side agrees, it can believe the characters it receives after the IAC-sequence are in the requested charset. Also at that point, it knows how to encode things going out.

When a client connects, the welcome screen is being sent, but charset is being negotiated at the same time. There is certainly some bug potential around this point, and I don't think the server does this right, yet. As long as the welcome screen is in ASCII, there is no problem, but I suspect as soon as people start building games, it will be reported as a bug.

On the other hand, if you use the UTF-8 checkbox and a UNICODE flag on the player, you have a worse problem. At the welcome screen, the server doesn't know which player is trying to connect because they haven't logged in, yet. So, the welcome screen definitely needs to only use ASCII characters. With a proper implementation of charset (complete with delaying the flow in the other direction), the welcome screen can be in UTF-8, and the sever will down-convert it for clients which don't support UTF-8.
[Go to top] top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


94,743 views.

This is page 1, subject is 3 pages long: 1 2  3  [Next page]

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page

Go to topic:           Search the forum


[Go to top] top

Quick links: MUSHclient. MUSHclient help. Forum shortcuts. Posting templates. Lua modules. Lua documentation.

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.

[Home]


Written by Nick Gammon - 5K   profile for Nick Gammon on Stack Exchange, a network of free, community-driven Q&A sites   Marriage equality

Comments to: Gammon Software support
[RH click to get RSS URL] Forum RSS feed ( https://gammon.com.au/rss/forum.xml )

[Best viewed with any browser - 2K]    [Hosted at HostDash]