[LC++]How to generate UTF-8 output?
Chris Vine
chris at cvine.freeserve.co.uk
Wed Nov 30 04:40:05 UTC 2005
On Tuesday 29 November 2005 13:08, Torsten Rennett wrote:
> Thank you Chris for your suggestions.
>
> On Freitag 25 November 2005 00:59, Chris Vine wrote:
> > On Thursday 24 November 2005 22:49, Chris Vine wrote:
> > ..., but where you have a single language
> > application, using printf() is a good way of catering for different
> > narrow character codesets. Did you get the example from a textbook or
> > someone else's code?
>
> Yes, the C example is from this WebPage:
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#c
>
> The C++ example is my own attempt to do the same thing in C++.
>
> > To use C++ streams, you would probably have to imbue a code conversion
> > facet into your narrow character stream to convert wide characters to
> > narrow characters. Perhaps your compiler already does this - have you
> > tried?
>
> Yes, I have.
>
> With gcc-3.3.5 come the 'class __enc_traits' und a partial specialization
> of class codecvt:
> template<typename _InternT, typename _ExternT>
> class codecvt<_InternT, _ExternT, __enc_traits>
>
> : public __codecvt_abstract_base<_InternT, _ExternT, __enc_traits>
>
> You will find this in the header
> /usr/include/c++/3.3/i486-linux/bits/codecvt_specializations.h
>
> Documention:
> http://gcc.gnu.org/onlinedocs/libstdc++/22_locale/codecvt.html
>
> This approach tries to merge C++ 'class codecvt' and the X/Open iconv(3),
> and I really think that's the way to go. The example in the above
> mentioned documentation uses the facet and 'class __enc_traits' directly,
> but not indirectly through iostreams (+ imbue) like in
> wcout << L"Schöne Grüße";
>
> What I've found so far, this will generally not be possible with a partial
> specialization! I'll comment more on this and a possible solution (which
> I'm testing currently) in a few days as I'm quite busy at the moment.
That's great. I would be interested to hear how you do it. glib has some
easy to use code conversion functions which I use a lot, and you could use in
a code conversion facet class.
The way Unix-like systems are moving, however, is to use UTF-8 throughout
(rather than wide characters) to implement Unicode. You can then save the
UTF-8 directly to disk as a locale-independent (and except for some eastern
character sets optimally small) way of storing data, so code conversion
facets are not necessary.
If I were using wide characters in an application I would be inclined to use a
wchar_t aware console/xterm for display, and when saving data just stuff the
bytes directly onto disk - but then your stored data is only usable with
systems with the same endian-ness so you need a code conversion facet if you
want to deal with that. This is more problematic for Windows systems which
use either UCS-2 or (more recently) UTF-16 (UTF-16 being in my view the worst
of all worlds). I think Windows now enables you set the storage endian-ness
for its streams, although as I do not use Windows I may be wrong about that.
Chris
More information about the tuxCPProgramming
mailing list