Message
| I have investigated adding Unicode support for quite some time now, and am going to abandon it for a while. There are some unanswered questions about the process, that - strangely enough - seem to be very hard to work out.
Below I will describe the problem and what I have found so far, mainly to remind myself later on what I did, so I don't spend weeks re-researching it all. Maybe someone reading this will be able to suggest the solution too. :)
The problem
Without using Unicode, the main problem is that many applications, including MUSHclient, encode text data as 8-bit character strings, like this:
char myString [] = "Nick Gammon";
The problem is that 8 bits will only hold 256 different characters, and some of those (the first 32) are already "lost" as they are used for "control" characters, like carriage-return, newline, form-feed, bell, page-feed, text-terminator (0x00) and so on. Also the last one (0xFF) is used for Telnet negotiation (the IAC character), although some programs work around that by sending it twice.
Thus only 256 - 32 (224) different characters are available. This is fine for normal English text, because the normal letters (A-Z, a-z), numbers (0-9) and punctuation fit nicely into the first 128, even including the control characters. Thus to write ordinary English text you can get away with the character range 0x00 to 0x7F.
However other languages (eg. Greek, Cyrillic, Arabic, Indic, Japanese, Chinese) have so many different characters in them they simply can't be represented in the 224 characters available.
Unicode
Unicode solves this problem by encoding characters in 2 bytes each rather than one. In Windows the character type is WCHAR which defined as unsigned short. eg.
typedef unsigned short WCHAR;
WCHAR myWideString [] = L"Nick Gammon";
The "L" in front of the character string says to compile it as Unicode characters.
Because an unsigned short can contain 65,536 character that gives plenty of scope for encoding various languages.
Unicode is not Windows-specific, for more details see:
http://www.unicode.org
There is quite a good article about Unicode on MSDN at:
http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx
UTF-8
In order to have a "mixed" environment of Unicode and ordinary text (eg. on web pages) you can use UTF-8 which uses single bytes to store the first 128 characters (so most text can be the same as usual) but uses the high-order bit to signal the start of extra bytes. Also, as the high-order bit is always set in further bytes the text string can be processed in C programs (which use 0x00 as a text terminator) without any problems. The general encoding scheme is:
Unicode range UTF-8 bytes
0x00000000 - 0x0000007F 0 xxxxxxx
0x00000080 - 0x000007FF 110 xxxxx 10 xxxxxx
0x00000800 - 0x0000FFFF 1110 xxxx 10 xxxxxx 10 xxxxxx
0x00010000 - 0x001FFFFF 11110 xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
You can see from this scheme that if any byte has the high-order bit clear it must be in the range 00-7F, any byte with the first two bits being '11' must be the start of a Unicode sequence (where the next bit(s) tell you how many bytes follow) and any byte with the first two bits being '10' is the middle of a Unicode sequence, the start of which can be found by scanning backwards a maximum of 4 bytes.
Converting to/from Unicode
In Windows you can convert to and from Unicode using WideCharToMultiByte and MultiByteToWideChar.
eg. (eg. sInput is input string, sOutput is output string)
// convert Unicode to ANSI:
char sOutput [100];
WideCharToMultiByte (CP_ACP, 0, sInput, -1, sOutput, sizeof sOutput, NULL, NULL);
// convert Unicode to UTF-8:
char sOutput [100];
WideCharToMultiByte (CP_UTF8, 0, sInput, -1, sOutput, sizeof sOutput, NULL, NULL);
// convert ANSI to Unicode:
WCHAR sOutput [100];
MultiByteToWideChar (CP_ACP, MB_PRECOMPOSED, sInput, -1, sOutput,
sizeof sOutput / sizeof WCHAR);
// convert UTF-8 to Unicode:
WCHAR sOutput [100];
MultiByteToWideChar (CP_UTF8, 0, sInput, -1, sOutput,
sizeof sOutput / sizeof WCHAR);
// find the length of a UTF-8 string in characters:
int iLength = MultiByteToWideChar (CP_UTF8, 0, sInput, -1, NULL, NULL);
Byte-order marks
To identify what sort of text file you are dealing with (if data is on disk) the first 2 or 3 bytes can be used for this purpose. Since these characters won't normally occur in ordinary text this should be safe enough:
Encoding Encoded BOM
UTF-16 big-endian FE FF
UTF-16 little-endian FF FE (Windows)
UTF-8 EF BB BF
Notepad uses this scheme to identify Unicode files.
Writing a Unicode application in Windows
OK, so much for the background. :)
Windows NT (and thus 2000 and XP) support Unicode internally, and thus you can write a Unicode application for those platforms.
Many operating system calls have an Ansi version (the A version) and a Wide (Unicode) version (the W version), eg. TextOutA to draw Ansi text and TextOutW to draw Unicode text.
When using MFC (Microsoft Foundation Class) libraries you simply have to define UNICODE and the compiler selects the appropriate routine for you from the "generic" version (in this case, TextOut), like this:
#ifdef UNICODE
#define TextOut TextOutW
#else
#define TextOut TextOutA
#endif // !UNICODE
There is one more trick if you want to make a Unicode application, you need to set the Link -> Output -> Entry point symbol to be "wWinMainCRTStartup" otherwise you get a link error.
Once you decide to compile with Unicode (or to make an application that can be compiled both ways) you need to use various generic typedefs, such as:
// non-generic typedefs
CHAR = char
WCHAR = unsigned short
// generic ones
TCHAR = CHAR or WCHAR
LPTSTR = CHAR * or WCHAR *
LPCTSTR = const CHAR * or const WCHAR *
Also, literals should be enclosed with _T("blah") which expands to either "blah" or L"blah" as appropriate.
Thus a portable Unicode/Non-Unicode application might say:
TCHAR myString [] = _T("Nick Gammon");
Also various MFC classes (like CString) automatically become the "wide" versions when compiled with Unicode.
However in the case of MUSHclient, it is extremely tedious to convert it to Unicode after it is written. For one thing there are around 6,500 text strings (like "You can't do that") which need to be inspected and have _T() put around them.
However there are some calls (like inet_addr) which do not have a wide version, and in those cases the strings being passed to them have to be downgraded to Ansi strings before they can be used.
For another, MUSHclient has to handle non-Unicode in places like disk files, chat sessions, and normal TCP/IP to a MUD. I have attempted it, and gave up, after a couple of days of fixing one compiler error, only to find the fix caused four more.
Thus, I want to make an app that is basically non-Unicode (in other words, staying much the same as it is) but to optionally (at user request) output Unicode to the output window, and accept Unicode in the command window.
The "at user request" part is because some people may want to use characters with the high-order bit set (eg. German characters with umlats) which are not UTF-8 but simply use the characters in the range 0x80 to 0xFF.
How does Windows know whether it is a Unicode app or not?
After some research, I gather that Windows (NT) does not treat a whole application as Unicode or not, but treats individual calls on their merits. For instance, if you to TextOutW to a particular window, then you are outputting Unicode text to it. Thus, it ought to be possible to mix Unicode and non-Unicode windows in a particular application, which is what I want to do.
For example, in a test application this successfully drew Unicode in a window (once I had created the font "Lucida Sans Unicode" in the view, because that font will draw the Unicode characters):
WCHAR sMsg [] = { 0x0443, 0x0433, 0x043e,
0x043c, 0x0420, 0x0020,
0x0448, 0x0443, 0x043c,
0x0435, 0x043b, 0x0442,
0 };
pDC->SelectObject(m_font); // select Unicode font
TextOutW (pDC->m_hDC, 150, 150, sMsg, wcslen (sMsg));
Note the use of wcslen to find the length of a "wide" string.
In the middle of the text is an ordinary space (0x0020) demonstrating that the normal Ansi characters are in the first 128 bytes of the Unicode character space.
This particular application was not compiled with UNICODE defined, I was trying to mix Unicode and non-Unicode.
Window Procedures
There is more complexity than that in writing Unicode applications because some Windows messages handle text (eg. WM_SETTEXT) which involves text being passed around internally by Windows. Also other messages (like WM_CHAR) involve text from the user being passed to the application.
The specific problem I am trying to solve here is to create an "edit" window (in fact, the MUSHclient command window) which accepts Unicode, without having to write an edit window from scratch. Currently in a non-Unicode app (compiled without UNICODE defined) such windows just show question marks if you try to put Unicode text into them.
It appears that each Window belongs to a window "class" - which has to be pre-registered with Windows before a window of that class can be created. Amongst other things, a window class defined a window procedure (WNDPROC) which handles messages for that window.
If you register a class with RegisterClassW then Windows thinks the window is a Unicode window, otherwise if you register it with RegisterClassA it becomes an Ansi window.
Here is an example of registering a Unicode window:
HINSTANCE hInst = AfxGetResourceHandle();
WNDCLASSW WndClass;
WndClass.style = CS_DBLCLKS;
WndClass.lpfnWndProc = (WNDPROC) MainWndProc;
WndClass.cbClsExtra = (INT) NULL;
WndClass.cbWndExtra = (INT) NULL;
WndClass.hInstance = hInst;
WndClass.hIcon = LoadIcon( IDR_MAINFRAME );
WndClass.hCursor = LoadCursor( (LPTSTR) IDC_ARROW );
WndClass.hbrBackground = (HBRUSH) (COLOR_APPWORKSPACE+1);
WndClass.lpszMenuName = L"";
WndClass.lpszClassName = L"MUSHclientWindow";
if( !RegisterClassW (&WndClass) )
::AfxMessageBox ("Could not register the class");
Note the use of WNDCLASSW to get the Wide WNDCLASS version and using RegisterClassW to register it.
Then in the MFC PreCreateWindow function you can tell it to use a different class, like this:
cs.lpszClass = "MUSHclientWindow";
However, that appears to not work, as MFC doesn't seem to like you switching to a window class it doesn't know about.
A bit more research shows you can "subclass" a window, which means that you indicate you want to have "first stab" at the messages for that window, which then get passed on to the real window procedure if you don't want to handle them. It seems that you use SetWindowLong to do that, and indeed if you use SetWindowLongW (note the W) then it registers that window (or at least, that window procedure) as one that wants Unicode. Here is an example:
// store previous window procedure here
WNDPROC oldproc = NULL;
// define our own window procedure
LRESULT CALLBACK MainWndProc ( HWND hWnd, UINT uMsg, WPARAM wParam, LPARAM lParam )
{
switch( uMsg )
{
case WM_SETTEXT:
// handle WM_SETTEXT here ...
break;
} // end of switch
// send others to the original one
return CallWindowProcW (oldproc, hWnd, uMsg, wParam, lParam);
} // end of MainWndProc
// now install it - note use of SetWindowLongW
oldproc = (WNDPROC) SetWindowLongW (m_hWnd, GWL_WNDPROC, (long) MainWndProc);
For more information on subclassing, see:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnwui/html/msdn_subclas3.asp
However despite doing this, Unicode still appears as question marks, even though Spy shows that the window is now considered to be Unicode.
Another approach seems to be to "superclass" the window, which involves finding out about the original class (in this case "Edit") and then registering a new class based on it, like this:
WNDCLASSW WndClass;
if (!GetClassInfoW(hInst, L"Edit", &WndClass))
::AfxMessageBox ("Could not GetClassInfo");
HINSTANCE hInst = AfxGetResourceHandle();
WndClass.hInstance = hInst;
WndClass.lpszClassName = L"MUSHclientWindow";
oldproc = WndClass.lpfnWndProc;
WndClass.lpfnWndProc = MainWndProc;
if( !RegisterClassW (&WndClass) )
::AfxMessageBox ("Could not RegisterClassW");
However whilst this works better than making a new class from scratch, Unicode still shows up as question marks.
What I think is happening is this: first, the documentation for CallWindowProc indicates that it can handle a mix of Unicode and non-Unicode in the chain of window procedures. If you change from one to the other it converts the messages (eg. WM_SETTEXT) to/from Unicode as appropriate.
Second, MFC does its own subclassing of the windows (as part of the application framework) and thus installs Ansi subclasses in the chain (because it is not a UNICODE build).
Thus what is happening is:
Unicode message --> MFC Ansi message --> Unicode message --> display
The MFC Ansi subclass in the middle there causes the Unicode to be thrown away, and adding Unicode at either end of the chain does not really help.
What seems to be needed is to somehow stop MFC from subclassing the window at all, or to make the command window one that is independent of MFC, which I am not sure how to do.
Any constructive suggestions appreciated. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|