Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.
 Entire forum ➜ MUSHclient ➜ International ➜ Full Unicode support

Full Unicode support

It is now over 60 days since the last post. This thread is closed.     Refresh page


Pages: 1 2  3  4  5  

Posted by daniel perry   USA  (4 posts)  Bio
Date Sun 11 May 2003 07:13 PM (UTC)
Message
I just added Unicode functionality to the server that our MOO runs on, and I was wondering if there was any way that the full unicode set could be made to work with the MUSHclient input line. There are quite a few characters (???????????????????) that show up as question marks.

-daniel perry (aka entreido, tobias, turi, sean, anything you can think of ;) )
Top

Posted by Nick Gammon   Australia  (23,120 posts)  Bio   Forum Administrator
Date Reply #1 on Sun 11 May 2003 10:38 PM (UTC)
Message
Hmmm, I was wondering when someone would ask that. :)

I have been thinking about Unicode for a while, however it isn't as simple as making it work "with the MUSHclient input line".

Here are some of the problems:


  1. It is not just a case of changing the input line, since input can be echoed in the output window, the output window would need to support Unicode as well.
  2. Plus, the whole point presumably of entering Unicode is to send it to the MUD and get it back again, so the output window would definitely need to display Unicode.
  3. For this to work, the send/receive routines would need to send/receive Unicode rather than single-byte characters
  4. All sorts of internal things (eg. command history, saved output buffer) would need to store text as Unicode
  5. Comparisons (eg. for searching) would need to be Unicode-aware
  6. Regular expression matching (eg. for triggers, aliases) would need to support Unicode
  7. The XML parser which reads in world files, and the writer which writes them out, would need to support Unicode (in case you had a Unicode string in a trigger, for instance)
  8. It would need to still support non-Unicode MUDs
  9. It would need to know whether or not the MUD was Unicode somehow
  10. There are about 6800 internal strings in MUSHclient (eg. "you cannot connect to the world") some, but not all, of those would need to be converted to Unicode, if MUSHclient became Unicode-aware


My preliminary research indicates the probably the simplest thing would be to expect text in UTF-8 format, which would at least support existing MUDs that only use 7-bit character encoding, however it would need fiddling if a MUD used 8-bit encoding, but not Unicode.

I am curious to know how far you have got with this server project, can you tell me ...


  1. Do you in fact send 2 bytes for each character to the client, and expect 2 bytes back?; or
  2. Do you use UTF-8 encoding?
  3. What client are you using for testing? Is it an existing MUD client? What is its name?
  4. Do you propose to support 8-bit (non-Unicode) clients as well?
  5. If so, in what way will you tell if the client is 8-bit or Unicode-aware? Also, how will you handle Unicode text being sent to a non-Unicode client?


- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Shadowfyr   USA  (1,788 posts)  Bio
Date Reply #2 on Mon 12 May 2003 03:12 AM (UTC)
Message
One would hope that unicode based muds would encode it in special sequences and allow the client to translate it, if needed. For the input/output windows, I figured that a 'major' redesign would be needed and that since it didn't appear to already exist, that no one had yet added something like a <unicode>--string--</unicode> tag to anything that a client would be expected to use. Full unicode only works if you 'know' that the server is going to use it, otherwise it would attempt to translate normal single byte sequences into unicode letters. An effect not unlike switching and old dos client into 7 bit encoding, when the BBS was using 8-1 (8 bits + 1 bit checksum).

I briefly considered bringing up this issue myself, though mostly as a question about the infobar, since some fonts, including Lucida Console have imbedded unicode for special characters, including the blocks used in gauges, which currently require the use of a seperate font to produce. Not to mention any special characters in dingbats and other such fonts that are 'missing' because they only exist in the unicode sections.
Top

Posted by daniel perry   USA  (4 posts)  Bio
Date Reply #3 on Mon 12 May 2003 03:32 AM (UTC)
Message
I am actually not the one that is writing the code, it is a freely available patch for the LambdaMOO server code, written by Lao Tzu, so I do not know the exact details of it :/

The link to his software site is: http://stompstompstomp.com/software/#unicode_moo

-daniel perry (aka entreido, tobias, turi, sean, anything you can think of ;) )
Top

Posted by Nick Gammon   Australia  (23,120 posts)  Bio   Forum Administrator
Date Reply #4 on Mon 12 May 2003 04:58 AM (UTC)
Message
Hmm - the UTF-8 route eh?

I have been doing a bit more research, and it mightn't be quite as bad as I thought, for example the regexp routine (PCRE) has an update that supports UTF-8.

If I understand the spec correctly (RFC 2279) the encoding for UTF-8 means that:


  1. The first 128 characters in the ASCII set encode to themselves, basically supporting all existing messages that MUSHclient is likely to use (eg. error messages, XML sequences and so on).
  2. The encoding for the remaining characters get progressively longer depending on what you are trying to encode (up to a maximum of 6 bytes), however each intermediate byte is guaranteed to have a bit set. In other words, things like strlen will continue to work.


What this means is that the client could probably pass UTF-8 strings through it without really realising it, excepting that they wouldn't be displayed properly, of course.

Thus, the minimal change necessary to make Unicode work might be ...


  1. Have a "MUD uses UTF-8" flag which would distinguish between MUDs using Unicode and others that simply use characters in the range 0x80 to 0xFF
  2. Leave most of the client alone (eg. just store the UTF-8 in the command history, and output buffer in the usual way)
  3. For Unicode MUDs, switch to the UTF-8 version of the regular expression parser
  4. Where necessary (eg. doing Finds) decompose UTF-8 into Unicode for comparison purposes
  5. Change the screen output routine to display UTF-8 as Unicode where required
  6. Change the command window to handle Unicode where required, and encode into UTF-8.
  7. Look for places that might display Unicode (eg. you are about to replace "X" with "Y" in the command window) and handle it appropriately.
  8. Handle UTF-8 in the XML parser.
  9. Look for places where bytes will no longer equal characters (eg. wrapping output at column 80) and fix them appropriately


It is a reasonably big job, but interestingly, I think you might find that many MUDs would also handle UTF-8 (since it would just look like a character string to them), however I cannot test that right now because I can't find a terminal program that will actually let me send UTF-8.



For the RFC, see: http://www.faqs.org/rfcs/rfc2279.html

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,120 posts)  Bio   Forum Administrator
Date Reply #5 on Mon 12 May 2003 05:49 AM (UTC)
Message
However I see what you mean about using the command window. Even pasting in Unicode shows as ??? even before MUSHclient "gets at it" so-to-speak.

Clearly there must be some change to the input window (maybe make it a Rich Edit control) to even allow the Unicode characters to be displayed.

If anyone knows more about this than me I would be pleased to hear from them. :)

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Orange   United Kingdom  (25 posts)  Bio
Date Reply #6 on Mon 12 May 2003 09:19 AM (UTC)

Amended on Mon 12 May 2003 09:24 AM (UTC) by Orange

Message
My mud has support for a variety of character sets including Unicode/Latin1/CP1252/ASCII. It lets the user choose which one they want or otherwise autodetect based on ttype. (latin1 usually except in a couple of other cases.)

There is a telnet CHARSET negotiation option (see RFC 2066), but it's crap and unimplemented.

The ISO-2022 escape code for indicating UTF-8 is '\033%G', to indicate the end of UTF-8 is \033%@. Mushclient could detect these and switch.

Your analysis of UTF-8 is correct. There are a couple of other useful features about it. The initial byte of a multibyte-sequences is always in the range 0x80-0xbf, and the continuation bytes are always 0xc0-0xff. This means you can find the character boundaries really easily. Also, sort order is preserved.

Also, you'll not find sequences longer than 3, as the codespace >= 0x110000 has been abandoned.
Top

Posted by Nick Gammon   Australia  (23,120 posts)  Bio   Forum Administrator
Date Reply #7 on Wed 14 May 2003 04:30 AM (UTC)
Message
I have been experimenting with Unicode, and as far as I can see it is quite tedious to implement it. The problem is not with the UTF-8 part, that seems simple enough, but the thing I haven't got to work yet is the seemingly-simple task of showing Unicode in the command window (or any window for that matter).

It seems that if you have a non-Unicode application, however that is defined exactly, then the text windows (eg. dialog boxes, edit windows) are just "straight text" windows, with one byte per character.

To enable Unicode means trawling through the code converting hundreds and possibly thousands of strings (and code that uses them) to Unicode strings, where applicable. It isn't just a case of making everything Unicode, which itself isn't all that simple, because some stuff (eg. disk files, data from the MUD, data to the MUD, the chat system) still uses one byte per character, or possibly UTF-8. Thus it needs to be converted to/from Unicode at the appropriate point.

I'll keep experimenting, if my brain doesn't fuse first. ;)

If anyone knows how to mix Unicode windows (eg. an edit window) with a so-called non-Unicode application, please let me know.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,120 posts)  Bio   Forum Administrator
Date Reply #8 on Sun 18 May 2003 12:09 AM (UTC)

Amended on Mon 09 Feb 2004 04:16 AM (UTC) by Nick Gammon

Message
I have investigated adding Unicode support for quite some time now, and am going to abandon it for a while. There are some unanswered questions about the process, that - strangely enough - seem to be very hard to work out.

Below I will describe the problem and what I have found so far, mainly to remind myself later on what I did, so I don't spend weeks re-researching it all. Maybe someone reading this will be able to suggest the solution too. :)


The problem

Without using Unicode, the main problem is that many applications, including MUSHclient, encode text data as 8-bit character strings, like this:


char myString [] = "Nick Gammon";


The problem is that 8 bits will only hold 256 different characters, and some of those (the first 32) are already "lost" as they are used for "control" characters, like carriage-return, newline, form-feed, bell, page-feed, text-terminator (0x00) and so on. Also the last one (0xFF) is used for Telnet negotiation (the IAC character), although some programs work around that by sending it twice.

Thus only 256 - 32 (224) different characters are available. This is fine for normal English text, because the normal letters (A-Z, a-z), numbers (0-9) and punctuation fit nicely into the first 128, even including the control characters. Thus to write ordinary English text you can get away with the character range 0x00 to 0x7F.

However other languages (eg. Greek, Cyrillic, Arabic, Indic, Japanese, Chinese) have so many different characters in them they simply can't be represented in the 224 characters available.


Unicode

Unicode solves this problem by encoding characters in 2 bytes each rather than one. In Windows the character type is WCHAR which defined as unsigned short. eg.


typedef unsigned short WCHAR;

WCHAR myWideString [] = L"Nick Gammon";


The "L" in front of the character string says to compile it as Unicode characters.

Because an unsigned short can contain 65,536 character that gives plenty of scope for encoding various languages.

Unicode is not Windows-specific, for more details see:


http://www.unicode.org


There is quite a good article about Unicode on MSDN at:


http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx



UTF-8

In order to have a "mixed" environment of Unicode and ordinary text (eg. on web pages) you can use UTF-8 which uses single bytes to store the first 128 characters (so most text can be the same as usual) but uses the high-order bit to signal the start of extra bytes. Also, as the high-order bit is always set in further bytes the text string can be processed in C programs (which use 0x00 as a text terminator) without any problems. The general encoding scheme is:



Unicode range              UTF-8 bytes

0x00000000 - 0x0000007F    0 xxxxxxx
0x00000080 - 0x000007FF    110 xxxxx 10 xxxxxx
0x00000800 - 0x0000FFFF    1110 xxxx 10 xxxxxx 10 xxxxxx
0x00010000 - 0x001FFFFF    11110 xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx


You can see from this scheme that if any byte has the high-order bit clear it must be in the range 00-7F, any byte with the first two bits being '11' must be the start of a Unicode sequence (where the next bit(s) tell you how many bytes follow) and any byte with the first two bits being '10' is the middle of a Unicode sequence, the start of which can be found by scanning backwards a maximum of 4 bytes.


Converting to/from Unicode

In Windows you can convert to and from Unicode using WideCharToMultiByte and MultiByteToWideChar.

eg. (eg. sInput is input string, sOutput is output string)



// convert Unicode to ANSI:

char sOutput [100];
WideCharToMultiByte (CP_ACP, 0, sInput, -1, sOutput, sizeof sOutput, NULL, NULL);

// convert Unicode to UTF-8:	

char sOutput [100];
WideCharToMultiByte (CP_UTF8, 0, sInput, -1, sOutput, sizeof sOutput, NULL, NULL);

// convert ANSI to Unicode:

WCHAR sOutput [100];
MultiByteToWideChar (CP_ACP, MB_PRECOMPOSED, sInput, -1, sOutput, 
      sizeof sOutput / sizeof WCHAR);

// convert UTF-8 to Unicode:

WCHAR sOutput [100];
MultiByteToWideChar (CP_UTF8, 0, sInput, -1, sOutput, 
      sizeof sOutput / sizeof WCHAR);

// find the length of a UTF-8 string in characters:

int iLength = MultiByteToWideChar (CP_UTF8, 0, sInput, -1, NULL, NULL);



Byte-order marks

To identify what sort of text file you are dealing with (if data is on disk) the first 2 or 3 bytes can be used for this purpose. Since these characters won't normally occur in ordinary text this should be safe enough:


Encoding               Encoded BOM

UTF-16 big-endian      FE FF
UTF-16 little-endian   FF FE        (Windows)
UTF-8                  EF BB BF


Notepad uses this scheme to identify Unicode files.


Writing a Unicode application in Windows

OK, so much for the background. :)

Windows NT (and thus 2000 and XP) support Unicode internally, and thus you can write a Unicode application for those platforms.

Many operating system calls have an Ansi version (the A version) and a Wide (Unicode) version (the W version), eg. TextOutA to draw Ansi text and TextOutW to draw Unicode text.

When using MFC (Microsoft Foundation Class) libraries you simply have to define UNICODE and the compiler selects the appropriate routine for you from the "generic" version (in this case, TextOut), like this:


#ifdef UNICODE
 #define TextOut  TextOutW
#else
 #define TextOut  TextOutA
#endif // !UNICODE


There is one more trick if you want to make a Unicode application, you need to set the Link -> Output -> Entry point symbol to be "wWinMainCRTStartup" otherwise you get a link error.

Once you decide to compile with Unicode (or to make an application that can be compiled both ways) you need to use various generic typedefs, such as:


// non-generic typedefs
CHAR = char
WCHAR = unsigned short

// generic ones
TCHAR = CHAR or WCHAR
LPTSTR = CHAR * or WCHAR *
LPCTSTR = const CHAR * or const WCHAR *


Also, literals should be enclosed with _T("blah") which expands to either "blah" or L"blah" as appropriate.

Thus a portable Unicode/Non-Unicode application might say:


TCHAR myString [] = _T("Nick Gammon");


Also various MFC classes (like CString) automatically become the "wide" versions when compiled with Unicode.

However in the case of MUSHclient, it is extremely tedious to convert it to Unicode after it is written. For one thing there are around 6,500 text strings (like "You can't do that") which need to be inspected and have _T() put around them.

However there are some calls (like inet_addr) which do not have a wide version, and in those cases the strings being passed to them have to be downgraded to Ansi strings before they can be used.

For another, MUSHclient has to handle non-Unicode in places like disk files, chat sessions, and normal TCP/IP to a MUD. I have attempted it, and gave up, after a couple of days of fixing one compiler error, only to find the fix caused four more.

Thus, I want to make an app that is basically non-Unicode (in other words, staying much the same as it is) but to optionally (at user request) output Unicode to the output window, and accept Unicode in the command window.

The "at user request" part is because some people may want to use characters with the high-order bit set (eg. German characters with umlats) which are not UTF-8 but simply use the characters in the range 0x80 to 0xFF.


How does Windows know whether it is a Unicode app or not?

After some research, I gather that Windows (NT) does not treat a whole application as Unicode or not, but treats individual calls on their merits. For instance, if you to TextOutW to a particular window, then you are outputting Unicode text to it. Thus, it ought to be possible to mix Unicode and non-Unicode windows in a particular application, which is what I want to do.

For example, in a test application this successfully drew Unicode in a window (once I had created the font "Lucida Sans Unicode" in the view, because that font will draw the Unicode characters):


  WCHAR sMsg [] = { 0x0443, 0x0433, 0x043e,
                    0x043c, 0x0420, 0x0020,
                    0x0448, 0x0443, 0x043c,
                    0x0435, 0x043b, 0x0442,
                    0 }; 

  pDC->SelectObject(m_font);   // select Unicode font

  TextOutW (pDC->m_hDC, 150, 150, sMsg, wcslen (sMsg));


Note the use of wcslen to find the length of a "wide" string.

In the middle of the text is an ordinary space (0x0020) demonstrating that the normal Ansi characters are in the first 128 bytes of the Unicode character space.

This particular application was not compiled with UNICODE defined, I was trying to mix Unicode and non-Unicode.


Window Procedures

There is more complexity than that in writing Unicode applications because some Windows messages handle text (eg. WM_SETTEXT) which involves text being passed around internally by Windows. Also other messages (like WM_CHAR) involve text from the user being passed to the application.

The specific problem I am trying to solve here is to create an "edit" window (in fact, the MUSHclient command window) which accepts Unicode, without having to write an edit window from scratch. Currently in a non-Unicode app (compiled without UNICODE defined) such windows just show question marks if you try to put Unicode text into them.

It appears that each Window belongs to a window "class" - which has to be pre-registered with Windows before a window of that class can be created. Amongst other things, a window class defined a window procedure (WNDPROC) which handles messages for that window.

If you register a class with RegisterClassW then Windows thinks the window is a Unicode window, otherwise if you register it with RegisterClassA it becomes an Ansi window.

Here is an example of registering a Unicode window:


HINSTANCE hInst = AfxGetResourceHandle();

WNDCLASSW WndClass;   

  WndClass.style         = CS_DBLCLKS;   
  WndClass.lpfnWndProc   = (WNDPROC) MainWndProc;   
  WndClass.cbClsExtra    = (INT) NULL;   
  WndClass.cbWndExtra    = (INT) NULL;   
  WndClass.hInstance     = hInst;   
  WndClass.hIcon         = LoadIcon( IDR_MAINFRAME );   
  WndClass.hCursor       = LoadCursor( (LPTSTR) IDC_ARROW );   
  WndClass.hbrBackground = (HBRUSH) (COLOR_APPWORKSPACE+1);   
  WndClass.lpszMenuName  = L"";   
  WndClass.lpszClassName = L"MUSHclientWindow";    

  if( !RegisterClassW (&WndClass) ) 
    ::AfxMessageBox ("Could not register the class");


Note the use of WNDCLASSW to get the Wide WNDCLASS version and using RegisterClassW to register it.

Then in the MFC PreCreateWindow function you can tell it to use a different class, like this:


cs.lpszClass = "MUSHclientWindow";


However, that appears to not work, as MFC doesn't seem to like you switching to a window class it doesn't know about.

A bit more research shows you can "subclass" a window, which means that you indicate you want to have "first stab" at the messages for that window, which then get passed on to the real window procedure if you don't want to handle them. It seems that you use SetWindowLong to do that, and indeed if you use SetWindowLongW (note the W) then it registers that window (or at least, that window procedure) as one that wants Unicode. Here is an example:



// store previous window procedure here

WNDPROC oldproc = NULL;

// define our own window procedure

LRESULT CALLBACK MainWndProc ( HWND hWnd, UINT uMsg, WPARAM wParam, LPARAM lParam ) 
  {   

  switch( uMsg ) 
    {      

    case WM_SETTEXT:
        // handle WM_SETTEXT here ...
        break;

    }   // end of switch

  // send others to the original one

  return  CallWindowProcW (oldproc, hWnd, uMsg, wParam, lParam);

  } // end of MainWndProc


// now install it - note use of SetWindowLongW 

  oldproc = (WNDPROC) SetWindowLongW (m_hWnd, GWL_WNDPROC, (long) MainWndProc);



For more information on subclassing, see:


http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnwui/html/msdn_subclas3.asp


However despite doing this, Unicode still appears as question marks, even though Spy shows that the window is now considered to be Unicode.

Another approach seems to be to "superclass" the window, which involves finding out about the original class (in this case "Edit") and then registering a new class based on it, like this:



  WNDCLASSW WndClass;   

  if (!GetClassInfoW(hInst, L"Edit", &WndClass))
      ::AfxMessageBox ("Could not GetClassInfo");
 
  HINSTANCE hInst = AfxGetResourceHandle();

  WndClass.hInstance     = hInst;   
  WndClass.lpszClassName = L"MUSHclientWindow";  
  oldproc = WndClass.lpfnWndProc;
  WndClass.lpfnWndProc = MainWndProc;

  if( !RegisterClassW (&WndClass) ) 
    ::AfxMessageBox ("Could not RegisterClassW");


However whilst this works better than making a new class from scratch, Unicode still shows up as question marks.

What I think is happening is this: first, the documentation for CallWindowProc indicates that it can handle a mix of Unicode and non-Unicode in the chain of window procedures. If you change from one to the other it converts the messages (eg. WM_SETTEXT) to/from Unicode as appropriate.

Second, MFC does its own subclassing of the windows (as part of the application framework) and thus installs Ansi subclasses in the chain (because it is not a UNICODE build).

Thus what is happening is:

Unicode message --> MFC Ansi message --> Unicode message --> display

The MFC Ansi subclass in the middle there causes the Unicode to be thrown away, and adding Unicode at either end of the chain does not really help.

What seems to be needed is to somehow stop MFC from subclassing the window at all, or to make the command window one that is independent of MFC, which I am not sure how to do.

Any constructive suggestions appreciated.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,120 posts)  Bio   Forum Administrator
Date Reply #9 on Sun 18 May 2003 03:14 AM (UTC)

Amended on Sun 18 May 2003 03:21 AM (UTC) by Nick Gammon

Message
Quote:

The initial byte of a multibyte-sequences is always in the range 0x80-0xbf, and the continuation bytes are always 0xc0-0xff. This means you can find the character boundaries really easily.


Looking at the bit patterns in my earlier post, I think you have that backwards. The initial byte (11 xxxxxx) will be in the range 0xC0 to 0xFF and the continuation bytes (10 xxxxxx) will be in the range 0x80 to 0xBF.

This also makes more sense for preserving the sort order, as the initial byte (which is higher, being 0xC0 to 0xFF) will sort before the continuation bytes.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Rooster   (2 posts)  Bio
Date Reply #10 on Tue 03 Feb 2004 09:55 AM (UTC)
Message
This is a silly question, but do you have IME installed? (The Windows component that supports non-Europian characters. IE, Japanese Kanji. It can be found in Control Panel under Regional Settings.)

I've spent some time in a channel on IRC where Japanese characters were used quite a bit, and so I broke down and installed the language support so that they would stop displaying as broken two-byte characters in mIRC.

Once IME is installed, Windows itself handles the input of the characters. This includes installing multiple-language support for new fonts. (I have to be careful when choosing fonts, however.)

Without IME installed, UNICODE characters often appear as ? symbols, or are interpretted one byte per character.

---

I just tested using MushClient to send a few Japanese characters, with interesting results.

I have it currently connected to an IRC server where I have mIRC connected. I sent Japanese characters in both directions with the following results:

Sent from MushClient: Displayed correctly as Japanese characters in the command line. Displayed correctly at the other end in mIRC. (The IRC protocol does not doesn't echo back to the client that sent it) This worked correctly.

Sent from mIRC: Known to work fine on mIRC's end, however, the characters are obvious char, not wchar, and therefore did not display correctly.

This is using Windows XP, which admittedly, handles UNICODE better than earlier versions of Windows. (And much better than 95/98/ME).
Top

Posted by Rooster   (2 posts)  Bio
Date Reply #11 on Tue 03 Feb 2004 09:56 AM (UTC)
Message
Erg, that should read "did not display currently in MushClient". My fault for posting at 2 AM.
Top

Posted by Nick Gammon   Australia  (23,120 posts)  Bio   Forum Administrator
Date Reply #12 on Tue 03 Feb 2004 08:52 PM (UTC)
Message
Hmm - I'm using NT 4, and don't see IME as such under regional settings, however I'm installing different input locales in the hope that will do it.

However you are right, if I can solve the problem of inputting in UTF-8 the rest should be pretty trivial.

BTW - how do I type (in German, say, or Japanese) on my ordinary US keyboard?

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,120 posts)  Bio   Forum Administrator
Date Reply #13 on Tue 03 Feb 2004 10:02 PM (UTC)
Message
A bit of testing seems to indicate that installing different input locales merely makes my keyboard behave strangely. For instance, if I type "say hello" I see "saz hello" on the screen.

However if I copy some Unicode from a test file, and paste it into the input window, it still comes out as question marks.

Still, if this problem can be solved, a Unicode version should be achievable.

Are you saying that the only problem at present is that you can type (and paste?) Unicode into the command window, but it simply doesn't display properly in the upper (output) window?

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,120 posts)  Bio   Forum Administrator
Date Reply #14 on Tue 03 Feb 2004 11:28 PM (UTC)

Amended on Wed 04 Feb 2004 12:26 AM (UTC) by Nick Gammon

Message
The issue is confused here because I am not sure whether my lack of results is due to:


  • Not knowing how to input (say) Japanese - I gather from my research that IME is an Input Method Engine, perhaps that would help

  • Not having a Japanese font installed anyway

  • If copying and pasting not knowing if the problem is:


    • Unicode (or UTF-8) not being copied
    • Unicode not being pasted correctly
    • Unicode not being displayed



A bit of experimentation has found some interesting results ...

On a UTF-8 sampler page (http://www.columbia.edu/kermit/utf8.html) I found, amongst other things, this phrase in Greek:

"Ôç ãëþóóá " (this may not display in Greek due to your web browser or other problems)

Now opening that web page in Ultra-Edit appears to indicate that it is in UTF-8 ...


00000000  ce a4 ce b7 20 ce b3 ce  bb cf 8e cf 83 cf 83 ce  |.... ...........|
00000010  b1 20 ce bc ce bf cf 85                           |. ......|


Looking at these characters, we see:


  • ce a4 - 11001110 10100100
  • ce b7 - 11001110 10110111
  • 20 - a normal space
  • ce b3 - 11001110 10110011
  • ce bb - 11001110 10111011
  • cf 8e - 11001111 10001110
  • cf 83 - 11001111 10000011


... and so on ...

Now, take out the 110xxxxx bits from the first byte, and the 11xxxxxx bits from the second byte, and combine the remaining ones, you get this:


  • 1110100100 (decimal 932) - Greek capital "tau"
  • 1110110111 (decimal 951) - Greek "eta"
  • (space)
  • 1110110011 (decimal 947) - Greek "gamma"
  • 1110111011 (decimal 955) - Greek "lambda"
  • 1111001110 (decimal 974) - Greek "omega with tonos"
  • 1111000011 (decimal 963) - Greek "sigma"


(I got these names and numbers from the page: http://www.york.ac.uk/depts/maths/greekutf.htm ).

This seems to follow the UTF-8 scheme described above, the space was a single byte, the other characters were double bytes where the first three bits of the first byte were always 110 and the first two bits of the other byte were always 10. You combine the bits other than the marker bits, to get a number higher than 255, and look that up.

However if I copy that word onto the clipboard and paste it into MUSHclient's command window I just see "?? ???ssa" which appears to indicate it didn't paste properly (or didn't display properly).

I can copy and paste into Notepad, so the copying itself cannot be the problem.

I am using the same font in MUSHclient (Lucida Sans Unicode) that I successfully used in Notepad, so it would seem the font isn't the problem.

Interestingly, if I paste into UltraEdit I get the same results as in MUSHclient, so they have the same problem, whatever it is, that I do.

More tests show that pasting into Word works, but pasting into Wordpad doesn't. It certainly isn't consistent!

If I bring up the "find" dialog box, and paste the word into them, the following programs work: Word, Internet Explorer, however these don't: Notepad, Wordpad, UltraEdit, MUSHclient.

It is hard to know what to conclude from all this. I know the MUSHclient output routine is not designed to handle UTF-8, and changing that is not too hard. However to test it I need to get some UTF-8 into the command window (or displayed somehow) and that seems pretty tricky.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


224,466 views.

This is page 1, subject is 5 pages long: 1 2  3  4  5  [Next page]

It is now over 60 days since the last post. This thread is closed.     Refresh page

Go to topic:           Search the forum


[Go to top] top

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.