[LC++]How to generate UTF-8 output?

Chris Vine chris at cvine.freeserve.co.uk
Wed Nov 30 04:40:05 UTC 2005


On Tuesday 29 November 2005 13:08, Torsten Rennett wrote:
> Thank you Chris for your suggestions.
>
> On Freitag 25 November 2005 00:59, Chris Vine wrote:
> > On Thursday 24 November 2005 22:49, Chris Vine wrote:
> > ..., but where you have a single language
> > application, using printf() is a good way of catering for different
> > narrow character codesets.  Did you get the example from a textbook or
> > someone else's code?
>
> Yes, the C example is from this WebPage:
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#c
>
> The C++ example is my own attempt to do the same thing in C++.
>
> > To use C++ streams, you would probably have to imbue a code conversion
> > facet into your narrow character stream to convert wide characters to
> > narrow characters.  Perhaps your compiler already does this - have you
> > tried?
>
> Yes, I have.
>
> With gcc-3.3.5 come the 'class __enc_traits' und a partial specialization
> of class codecvt:
>   template<typename _InternT, typename _ExternT>
>   class codecvt<_InternT, _ExternT, __enc_traits>
>
>     : public __codecvt_abstract_base<_InternT, _ExternT, __enc_traits>
>
> You will find this in the header
> /usr/include/c++/3.3/i486-linux/bits/codecvt_specializations.h
>
> Documention:
> http://gcc.gnu.org/onlinedocs/libstdc++/22_locale/codecvt.html
>
> This approach tries to merge C++ 'class codecvt' and the X/Open iconv(3),
> and I really think that's the way to go. The example in the above
> mentioned documentation uses the facet and 'class __enc_traits' directly,
> but not indirectly through iostreams (+ imbue) like in
> 	wcout << L"Schöne Grüße";
>
> What I've found so far, this will generally not be possible with a partial
> specialization!  I'll comment more on this and a possible solution (which
> I'm testing currently) in a few days as I'm quite busy at the moment.

That's great.  I would be interested to hear how you do it.  glib has some 
easy to use code conversion functions which I use a lot, and you could use in 
a code conversion facet class.

The way Unix-like systems are moving, however, is to use UTF-8 throughout 
(rather than wide characters) to implement Unicode.  You can then save the 
UTF-8 directly to disk as a locale-independent (and except for some eastern 
character sets optimally small) way of storing data, so code conversion 
facets are not necessary.

If I were using wide characters in an application I would be inclined to use a 
wchar_t aware console/xterm for display, and when saving data just stuff the 
bytes directly onto disk - but then your stored data is only usable with 
systems with the same endian-ness so you need a code conversion facet if you 
want to deal with that.  This is more problematic for Windows systems which 
use either UCS-2 or (more recently) UTF-16 (UTF-16 being in my view the worst 
of all worlds).  I think Windows now enables you set the storage endian-ness 
for its streams, although as I do not use Windows I may be wrong about that.

Chris





More information about the tuxCPProgramming mailing list