[Home] [Downloads] [Search] [Help/forum]

Gammon Software Solutions forum

See www.mushclient.com/spam for dealing with forum spam. Please read the MUSHclient FAQ!

[Folder]  Entire forum
-> [Folder]  MUSHclient
. -> [Folder]  International
. . -> [Subject]  Localization - is it needed?

Home  |  Users  |  Search  |  FAQ
Username:
Register forum user name
Password:
Forgotten password?
(New message)
Subject: Localization - is it needed?
Name:
Your forum user name.
Register forum user name
Password:
Your forum password.
Forgotten password?
Message:
Message to be posted (in English, please).
Forum codes:
Check this if your message uses 'forum codes' or templates (auto-detected for new posts).
Forum codes Templates

Save this message ...


Subject review (reverse sequence)

Pages: 1 2  

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Tue 12 Jun 2007 06:14 AM (UTC)  quote  ]
Message
See follow-up thread with progress to-date:

http://www.gammon.com.au/forum/?id=7953

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Mon 11 Jun 2007 09:52 PM (UTC)  quote  ]
Message
I'm not sure about the L"MUSHclient" - that is displayed in the dialog box title. It is a proper name, after all, so perhaps it doesn't need translating?

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Mon 11 Jun 2007 09:50 PM (UTC)  quote  ]
Message
Yes, UTF-8 is an encoding system, not a locale.

This is roughly what I am using:


#include <vector>

// display message box - using UTF-8
int UMessageBox (const char * lpszText, UINT nType)
  {

  // find how big table has to be
  int iLength = MultiByteToWideChar (CP_UTF8, 0, lpszText, -1, NULL, NULL);

  // vector to hold Unicode
  vector<WCHAR> v;

  // adjust size
  v.resize (iLength);

  // do the conversion now
  MultiByteToWideChar (CP_UTF8, 0, lpszText, -1, &v [0], iLength);

  // determine icon based on type specified
  if ((nType & MB_ICONMASK) == 0)
  {
    switch (nType & MB_TYPEMASK)
    {
    case MB_OK:
    case MB_OKCANCEL:
      nType |= MB_ICONEXCLAMATION;
      break;

    case MB_YESNO:
    case MB_YESNOCANCEL:
      nType |= MB_ICONEXCLAMATION;
      break;

    case MB_ABORTRETRYIGNORE:
    case MB_RETRYCANCEL:
      // No default icon for these types, since they are rarely used.
      // The caller should specify the icon.
      break;
    }
  }

  int nResult = ::MessageBoxW (NULL, &v [0], L"MUSHclient", nType);

  return nResult;

  }   // end of UMessageBox



And test like this:


   UMessageBox ("\xC9\xB3\xC9\xA8\xC9\x95\xC9\xAE");


That UTF-8 sequence should show 4 characters that look vaguely like "Nick".

What this is doing is taking UTF-8 input, converting to wide characters (WCHAR) using MultiByteToWideChar, and then calling MessageBoxW to display the Unicode text.

I am basically going through the MUSHclient source changing all calls to AfxMessageBox to UMessageBox, thus facilitating the display of Unicode.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Ked   Russia  (524 posts)  [Biography] bio
Date Mon 11 Jun 2007 08:23 PM (UTC)  quote  ]
Message
Ok, I checked a UTF-8 sequence as you suggested. I used the following call, which might even be a wrong way to do it:

::AfxMessageBox("\d0" "\90")


And the dialog displayed two separate characters instead of a single capital "A".

This was with the multi-byte character setting, since Unicode doesn't compile.

From what I understand in "multi-byte mode" AfxMessageBox uses the current locale's codepage to convert an array of chars, so it will display Russian or Chinese text properly as long as the corresponding codepage is selected. But it'll treat UTF-8 also according to this codepage, not UTF-8 itself. At least MSDN docs on setlocale() explicitly say that you cannot set the locale to UTF-7 or UTF-8.

[Go to top] top

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Mon 11 Jun 2007 07:09 AM (UTC)  quote  ]
Message
Quote:

On closer inspection it turned out that I merely confirmed that Russian is displayed, not that Unicode is displayed,


What I think you need to do, is look up the Unicode code points by referring to the appropriate page here:

http://www.unicode.org/charts/

Then you can use the "Debug Simulated Input" dialog in MUSHclient to convert Unicode code points into UTF-8 sequences in hex. Having done that you can plant those into a dialog box and confirm they are displayed correctly.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Mon 11 Jun 2007 07:05 AM (UTC)  quote  ]
Message
Quote:

... sure enough - that spawned over 2000 compile errors, mostly having to do with conversion between LPCTSTR/LPCWSTR and char ...


Well I tried that once with similar results. You can put _T(...) around strings but then that introduces another heap of errors.

For example, functions that have a 'const char *' prototype will not now accept the new strings. So, you change those, and then they fail because they call something like fwrite, (or strlen) which still expects char *.

It was about at this point when my mind started to boggle. For example, Lua uses 8-bit strings, not 16-bit strings. Also, MUSHclient world files are 8-bit data, not 16-bit data. Also, incoming text from a MUD is 8-bit data.

I think the simpler solution is to stick to UTF-8 for Unicode data, provided we can solve the problem of getting dialog boxes to display in Unicode, which judging by the earlier post, someone has done.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Ked   Russia  (524 posts)  [Biography] bio
Date Mon 11 Jun 2007 06:11 AM (UTC)  quote  ]
Message
Quote:
I'm a bit puzzled it was that easy - this was in the full MUSHclient compile?


Well, it's not that easy. On closer inspection it turned out that I merely confirmed that Russian is displayed, not that Unicode is displayed, but Russian characters are on most English codepages so my tests provide no proof that Japanese, for example, will also be displayed without problems.

Later yesterday I tried changing the codeset for the entire solution to Unicode and sure enough - that spawned over 2000 compile errors, mostly having to do with conversion between LPCTSTR/LPCWSTR and char. Some of those errors (those that involve literal strings in assignments and function calls) are easy to solve, but implications of fixing the rest are not as obvious to me.

Quote:
The normal AfxMessageBox function (in a non-Unicode application) expects 8-bit data.


VS8 supposedly has a "Unicode version" of MFC. At least many of the errors I mentioned above seem to indicate that AfxMessageBox expects a wide char* instead of char*, which is what it is getting right now all over the place. Converting literal strings to wide (with the L macro) solves this chunk of errors.

Quote:
As for changing things like dialog boxes - do you think you could make a copy of the resources that use Russian characters, and I could merge them into the existing source? I don't have the .NET compiler (yet, anyway).


Sure.
[Go to top] top

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Mon 11 Jun 2007 02:44 AM (UTC)  quote  ]

Amended on Mon 11 Jun 2007 02:51 AM (UTC) by Nick Gammon

Message
The next interesting problem is messages with imbedded variables. For example, this message:


"The %s contains %i line%s, %i word%s, %i character%s"


The first %s can be either "document" or "selection". The %i items are counts. The other %s items are either the letter "s" for plural, or the empty string, for singular.

A number of problems arise here. For a start, the word order may be different. For example, a translated version might look like this:


There are 4 lines, 5 words, 22 characters in the document.


In this example the "document" word has moved to the back.

Also the pluralization (is that a word?) might be different. The plural of "line" in German is probably not obtained by adding an "s".

To assist in the process of making a correct translation, formatted strings are handled differently, namely by calling a Lua function. Here is how that message might be handled:


formatted = {

-- TextView.cpp:589
  ["The %s contains %i line%s, %i word%s, %i character%s"] =
    function (a, b, c, d, e, f, g)
     
      return ""
    end,  -- function

-- ... and so on ..

}


Formatted messages are in a separate table. This time the item value is an unnamed function that will be called at runtime.

The function is supplied with the arguments that the original one (in the source) had.

Let us take an example:


The selection contains 1 line, 3 words, 20 characters


There are really 7 variables here, and they are automatically named a to g. Their values in this particular case would be:


  1. selection
  2. 1
  3. (empty)
  4. 3
  5. s
  6. 20
  7. s


The translator can now feel free to use those arguments as s/he feels fit. For example, the "s" arguments (items 3, 5 and 7) could be ignored.

The word "selection" could be converted into the equivalent.

The pluralization can be handled by examining the actual numbers and generating appropriate code. The converted function might look like this:


-- TextView.cpp:589
  ["The %s contains %i line%s, %i word%s, %i character%s"] =
    function (a, b, c, d, e, f, g)
       
       local line = "line"
       if b ~= 1 then
         line = "lines"
       end -- plural lines
       
       local word = "word"
       if d ~= 1 then
         word = "words"
       end -- plural words
       
       local character = "character"
       if f ~= 1 then
         character = "characters"
       end -- plural characters
       
      return string.format ("There are %i %s, %i %s, %i %s in the %s",
             b, line, d, word, f, character, a)
             
    end,  -- function


Although this is still English, this illustrates how I have moved the word "document" or "selection" to the end of the message (that is, argument 'a'), and re-evaluted whether to make the word plural by testing the number of each one (arguments 'b', 'd' and 'f').

The nice thing about using Lua, is that things like making numbers plural can be handled by a shared function, which you could put at the start of the translation file, and which can then be used by every function that needs it.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Mon 11 Jun 2007 02:21 AM (UTC)  quote  ]

Amended on Mon 11 Jun 2007 04:12 AM (UTC) by Nick Gammon

Message
The internationalization process, in more detail.

After reading about gettext, and getting some ideas, I have been doing things a bit differently.

The basic steps to internationalization are:


  • Make it possible to display Unicode. This is already partly done if you enable UTF-8 in the output window.

  • Establish the current locale (eg. France, Spain), and - more importantly perhaps - the language that the user wants to use (eg. French, Spanish, Japanese, etc.)

  • At appropriate places in the source code, request translation of strings that are going to be displayed. For example, where formerly it read:

    
      MessageBox ("The proxy server address cannot be blank.");
    


    It now reads:


    
      MessageBox (Translate ("The proxy server address cannot be blank."));
    


    The extra function call to 'Translate' requests that the message "The proxy server address cannot be blank." be converted from English into a message suitable for the user's locale.

    The default behaviour - if you do nothing else - will be to simply return the original message. Thus, the default behaviour is to see the messages in English.

    To simplify this particular operation, which is done quite a lot, a special function call does both - translates and displays a message box (there are over 300 of them):

    
      TMessageBox ("The proxy server address cannot be blank.");
    


  • In order to facilitate the translation, an automated scan of the source code is done, to locate such messages, and write them to a disk file.

    This produces a file which contains stuff like this:

    
    #: doc.cpp:840
    msgid "The proxy server address cannot be blank."
    msgstr ""
    


  • To facilitate the translation of more complex things (like strings with imbedded variables) I have decided to use Lua tables as the translation medium, so this file is now pre-processed into a big Lua table, like this:

    
    messages = {
    
    -- doc.cpp:813
      ["Cannot connect. World name not specified"] =
        "",
    
    -- doc.cpp:840
      ["The proxy server address cannot be blank."] =
        "",
    
    --- and so on ...
      }
    


    Effectively, each unique message is stored in the table, with the original message as the key. Thus a keyed lookup, which is very fast, can find the replacement.

  • This file, which is called the "template" will be distributed with MUSHclient (or made available on the web site). So far it doesn't do a huge amount that is useful, because there are no translations in it yet.

  • People who are interested in localizing - that is, making a translation into a particular language - will make a copy of that file under an appropriate name. For example:


    <MUSHclient executable directory>\locale\DE.lua


    They then edit the copy, and, for each message, devise a translation. Thus the message might now look like this:

    
    messages = {
    
    -- doc.cpp:813
      ["Cannot connect. World name not specified"] =
        "Kann nicht anschließen. Weltname nicht spezifiziert.",
    
    -- doc.cpp:840
      ["The proxy server address cannot be blank."] =
        "Die proxy serveradresse kann nicht leer sein.",
    
    --- and so on ...
      }
    


    There is no tearing rush to convert all messages - you could just do the common ones. Any that are left as an empty string will continue to be shown in English.

  • Once this localized file is updated, the next time MUSHclient is started it will read in the appropriate file, keeping the translations in memory.

  • When it is time to display a message the 'Translate' function will lookup the old message, find the new one (if it exists) and display that instead.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Zeno   USA  (2,867 posts)  [Biography] bio   Moderator
Date Mon 11 Jun 2007 02:20 AM (UTC)  quote  ]
Message
I get:
English_United States.1252


(I find it strange that it underscores the first space but not the next)

Zeno McDohl,
Owner of Bleached InuYasha Galaxy
http://www.biyg.org
[Go to top] top

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Mon 11 Jun 2007 02:05 AM (UTC)  quote  ]
Message
Well I am making some progress with localization. Before I go any further I am interested to see what the default locale is for various users.

Can anyone who is reading this, please make Lua your scripting language (if necessary), and then enter this line into the command window:


/print (os.setlocale ("", "all"))


For me, that prints:


English_Australia.1252


If you get something else printed, please post a message pasting the exact thing it says. Don't bother if someone already has, with the same thing in it.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Sun 10 Jun 2007 08:52 PM (UTC)  quote  ]
Message
I'm a bit puzzled it was that easy - this was in the full MUSHclient compile?

MUSHclient isn't a Unicode application and probably cannot be made one now without heaps of work.

The normal AfxMessageBox function (in a non-Unicode application) expects 8-bit data.

I have made a workaround by making a helper function that calls MessageBoxW (not AfxMessageBox), with UTF-8 data being supplied to the function. First it converts it ti 16-bit Unicode, and then calls MessageBoxW with that.

This seems to work on XP - on my copy of NT at least, the system font didn't support the characters I tested.

As for changing things like dialog boxes - do you think you could make a copy of the resources that use Russian characters, and I could merge them into the existing source? I don't have the .NET compiler (yet, anyway).

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Ked   Russia  (524 posts)  [Biography] bio
Date Sun 10 Jun 2007 11:27 AM (UTC)  quote  ]
Message
I've performed a couple of simple tests of the Unicode issue.

I've found that trying to use Unicode with any of the early versions of VS is probably hopeless. At least VS.NET (ver. 7) failed to render Russian text properly no matter what settings I changed and what codepages I selected. Some googling had revealed that VS2005 (ver. 8) does support Unicode more or less properly, so that's what I was using for testing.

Firstly, I've managed to fix the problem with menus in the resource file that I've already mentioned earlier - Russian characters that I enter in the properties window are replaced with question marks both in the design view and during runtime. This was in VS.NET. The fix in VS2005 wasn't immediately obvious either: I had to close the solution, open the RC as a separate file, edit it to replace the English text with Russian, and save it as Unicode. After that I was able to compile the solution and all characters were properly displayed during execution.

Secondly, I've tested the AfxMessageBox callback. I haven't hooked up gettext yet, so what I did was also - save the individual source file (doc.cpp) in Unicode and edit one of the messages:

//strMsg.Format ("Unable to resolve host name for \"%s\", code = %i (%s)", 
	  strMsg.Format ("&#1053;&#1077;&#1074;&#1086;&#1079;&#1084;&#1086;&#1078;&#1085;&#1086; &#1085;&#1072;&#1081;&#1090;&#1080; &#1072;&#1076;&#1088;&#1077;&#1089; \"%s\", &#1082;&#1086;&#1076; = %i (%s)",
                      (const char *) strWhich,
                      WSAGETASYNCERROR (lParam),
                      GetSocketError (WSAGETASYNCERROR (lParam)));
      if (App.m_bErrorNotificationToOutputWindow)
        Note (strMsg);
      else
        ::AfxMessageBox (strMsg);


This message was also properly displayed. So, at least with VS2005, Unicode can be used without too much trouble.
[Go to top] top

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Sun 10 Jun 2007 01:03 AM (UTC)  quote  ]

Amended on Sun 10 Jun 2007 01:05 AM (UTC) by Nick Gammon

Message
A quick scan of the source shows there are around 321 calls to AfxMessageBox - so that is around 321 warning or information messages that need translation.

Some of them (perhaps half) are parametized, which adds the complexity of handling the parameter(s). For example:


"Replace your typing of (something) with (something else)?"


Or:


"1 trigger, 2 aliases imported"


On top of that would be other messages, like:


"Welcome to MUSHclient version 4.06!"


And, script error messages:


Compile error
World: (world name here)
Immediate execution
(nature of error here)


Then there are things like the list of plugins, where the column headings are hard-coded into the code:


 m_ctlPluginList.InsertColumn(eColumnName, "Name", LVCFMT_LEFT, iColWidth [eColumnName]);
 m_ctlPluginList.InsertColumn(eColumnPurpose, "Purpose", LVCFMT_LEFT, iColWidth [eColumnPurpose]);
 m_ctlPluginList.InsertColumn(eColumnAuthor, "Author", LVCFMT_LEFT, iColWidth [eColumnAuthor]);
 m_ctlPluginList.InsertColumn(eColumnLanguage, "Language", LVCFMT_LEFT, iColWidth [eColumnLanguage]);
 m_ctlPluginList.InsertColumn(eColumnFile, "File", LVCFMT_LEFT, iColWidth [eColumnFile]);


- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (18,770 posts)  [Biography] bio   Forum Administrator
Date Sat 09 Jun 2007 11:41 PM (UTC)  quote  ]
Message
I bit more reading of the gettext documentation indicates that the plural form is even more complex than I realised. In some languages, apparently there are even more than 2 forms. Conceptually it is similar to English ordinal numbers, eg.


1st      -- form A
2nd      -- form B
3rd      -- form C
4th      -- form D
5th      -- form D again
21st     -- form A again
22nd     -- form B again


Also apparently some languages have a different syntax for zero of something (eg. 0 dogs, 1 dog, 2 dogs ... would be 3 different forms in some languages).

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


10,676 views.

This is page 1, subject is 2 pages long: 1 2  [Next page]

[Reply to this subject]  Reply to this subject   [New subject]  Start a new subject   [Refresh] Refresh page

Go to topic:           Search the forum


[Go to top] top

[Home]

Written by Nick Gammon - 5K

Comments to: Gammon Software support
[RH click to get RSS URL] Forum RSS feed ( http://www.gammon.com.au/rss/forum.xml )

[Best viewed with any browser - 2K]    [Internet Contents Rating Association (ICRA) - 2K]    [Web site powered by FutureQuest.Net]