[LC++]How to generate UTF-8 output?

Chris Vine chris at cvine.freeserve.co.uk
Fri Nov 25 07:39:06 UTC 2005


On Thursday 24 November 2005 20:04, Torsten Rennett wrote:
> Hi,
>
> I want to generate UTF-8 output with C++, but it does not work.
>
> This little C program works as expected:
>
>     #include <locale.h>
>     #include <stdio.h>
>
>     int main(int argc, char *argv[])
>     {
>       (void)argc;
>       (void)argv;
>
>       if (!setlocale(LC_CTYPE, ""))
>       {
> 	fprintf(stderr, "Can't set the specified locale! "
> 		"Check LANG, LC_CTYPE, LC_ALL.\n");
> 	return 1;
>       }
>       printf("%ls\n", L"Schöne Grüße");
>       return 0;
>     }
>
> Call this program with the locale setting LANG=de_DE and the output will
> be in ISO 8859-1. Call it with LANG=de_DE.UTF-8 and the output will be
> in UTF-8.
>
> torsten at linux3:~$ LANG=de_DE print_utf8 | od -t x1
> 0000000 53 63 68 f6 6e 65 20 47 72 fc df 65 0a
> 0000015
> torsten at linux3:~$ LANG=de_DE.UTF-8 print_utf8 | od -t x1
> 0000000 53 63 68 c3 b6 6e 65 20 47 72 c3 bc c3 9f 65 0a
> 0000020
>
>
> Good, so far. Now the same thing in C++.
>
>     #include <locale>
>     #include <iostream>
>     using namespace std;
>
>     int main(int argc, char *argv[])
>     {
>       (void)argc;
>       (void)argv;
>
>       try
>       {
> 	// environment default (usually determined by $LANG)
> 	locale loc1("");
> 	cout << "loc1='" << loc1.name() << '\'' << endl;
> 	wcout.imbue(loc1);
>       }
>       catch (const exception &ex)
>       {
> 	cerr << "FAILED! " << ex.what() << endl;
>       }
>
>       wcout << L"Schöne Grüße";
>       cout << endl;
>
>       return 0;
>     }
>
> When I run this program, the output is as follows:
>     torsten at linux3:~$ LANG=de_DE print2_utf8
>     loc1='de_DE'
>     Sch
>
> As you can see, the output stops at the german Umlaut 'ö'. This is
> independent of the setting of $LANG.
>
> What's wrong?

Neither ISO 8859-1 nor UTF-8 are wide character codesets.  You should use them with narrow character streams.

The other problem is that the string literal "Schöne Grüße" will be resolved at compile time and not at runtime, so setting the locale at run time will make no difference to its character representation.

The fact that printf() turned out to do the right thing is interesting.  Using the "%ls" format specification caused the function (which carries out run time substitutions) to convert the wide characters back into narrow characters in the narrow character codeset of your runtime locale.  printf() uses wcrtomb() for this which (according to its man page) does this by reference to the LC_CTYPE category of the current locale to convert between the wide character representation and the narrow character representation.  It therefore picked up the correct narrow character codeset established by the call to setlocale().

The interesting question is what wcrtomb() assumed about the wide character codeset it encountered - probably it did the sensible thing and assumed a UCS-4 codeset if you have a 4 bit wchar_t (Linux has 4 bit wchar_t and Windows a 2 bit wchar_t), and that happened to match the assumption of your compiler in setting up the wide character string literal at compilation stage).  To that extent, it looks to be a matter of luck that it worked.

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux.org.au/pipermail/tuxcpprogramming/attachments/20051125/9b8eff0f/attachment.htm 


More information about the tuxCPProgramming mailing list