Prev: Generating a derived class from a base class
Next: Why is the return type of count_if() "signed" rather than "unsigned"?
From: Martin B. on 24 Jul 2010 23:31 On 24.07.2010 15:55, Stanley Friesen wrote: > "Martin B."<0xCDCDCDCD(a)gmx.at> wrote: >> Stanley Friesen wrote: >>> "joe"<jc1996(a)att.net> wrote: >>> >>>> Francis Glassborow wrote: >>>>> joe wrote: >>>> [...] >>>>> Anyway this has got very far from C++ where we certainly do need a way >>>>> to handle text in more than just American English. >>>> Not far at all from C++ given that it has lame support for Unicode, >>> >>> In C++0X there is actually considerable support. It allows many >>> [...] >> >> As I see it, some support is added for better handling of unicode at >> compile time. (Uni character literals, charXX_t, etc.) >> >> We are left with the same mess we always had at runtime. (modulo >> char32_t, maybe): >> [...] > [...] >> * No way to tell what character set a char* is encoded in (and this will >> get worse with compile-time u8 constants). >> * std::exception works only with char* > > Which still allows UTF-8 strings. std::exception allows for UTF-8 strings. Yes. It already does this, C++0x doesn't add anything in this regard. cheers, Martin -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Stanley Friesen on 28 Jul 2010 12:34 Mathias Gaunard <loufoque(a)gmail.com> wrote: >On Jul 24, 2:55 pm, Stanley Friesen <sar...(a)friesen.net> wrote: >> Mathias Gaunard <loufo...(a)gmail.com> wrote: >> >On Jul 22, 1:26 pm, Stanley Friesen <sar...(a)friesen.net> wrote: >> >> >> It provides conversions between the three main >> >> representations (UTF-8, UTF-16, and UTF-32). >> >> >Not really in a way that is practical to use though. >> >> Well, sstreams (string streams) should provide that capability, even if >> that is a trifle clumsy. > >string streams do not invoke codecvt facets, only file streams do. Oops, I forgot that point. >Also note most current implementations do not allow N to M conversion >with codecvt facets, and only allow one-way 1 to N (in-memory fixed >width, in-file variable-width), so I'd be quite careful about this. I challenge this one however. The standard facet codecvt<char16_t, char, mbstate_t> is required to support conversions between UTF-16 and UTF-8 (22.4.1.4, para 3). Failure to properly convert surrogate pairs is a failure to support UTF-16, as that is the difference between UTF-16 and UCS-2. And the draft standard clearly incorporates the distinction, since the "extra" facet codecvt_utf8<Elem>, is explicitly specified to convert to and from either UCS-2 or UCS-4 (depending on Elem). > >The alternative is applying the codecvt facet directly, which has a >fairly ugly interface and requires static contiguous buffers. Yes, I agree it is a touch clumsy. The best way to use it would be to wrap it in a simplified library interface. > >What we truly need is an iterator-based interface, that basically >behaves like std::copy, or better yet, iterator adaptors that convert >as you iterate. Hmm, this may be tricky to specify, given the nature of the conversions. Dereferencing such an iterator would have to resolve to some sort of container (e.g. a specialization of basic_string), as there is no guarantee that the result will be a single code. >But that's not sufficient, you also need ways to segment strings >(graphemes, words, sentences), do normalization, case conversion, etc. >None of which are nowhere near possible in C++0x. And I maintain that they are beyond the scope of the C++ standard. These are things, I think, that should be supplied as domain libraries, since different systems may well require different performance trade-offs. > > >> >> The only thing it lacks that I see as a >> >> substantial issue is UTF-16 and/or UTF-32 iostreams. This is >> >> unfortunate, as both Windows and modern Unix support such files at the >> >> OS level. >> >> >basic_istream<char16_t> etc. should work just fine. >> >> That will read or write a UTF-8 file, not a UTF-16/UTF-32 file. The >> specification is quite clear - it is required to apply the appropriate >> codecvt facet. > >That's not a problem at the stream level, but at the filebuf level. >File streams invoke codecvt facets to convert from their type to char >because filebufs are char-based. > Though this means one would need to instantiate one's own type of file buffers to get basic_istream<char16_t> to actually input from a UTF-16 external file. This goes beyond merely clumsy to manifestly labyrinthine. This is, I maintain, something that *should* be standardized in the language, as it is widely useful, difficult to get right, and has few design issues that would make alternative implementations useful. > >> >So GCC, the most widely used C and C++ compiler, is not a decent >> >development environment? >> >> It is not a development environment at all, it is just a compiler. A >> development environment includes build configuration, syntax-aware >> editing, syntax-aware searches and so on. > >Looks like you only know the world of software development as you see >it through your Microsoft Visual Studio window. > No. That also describes the Tornado development environment for VxWorks, and its successor, as well as the development environment for several other similar OS's. There is also Eclipse, which is an OS and compiler independent development environment. It can be configured to use gcc as the compiler. > > >> >As was clearly stated in the parent message, GCC only supports >> >inputting unicode characters in identifiers as escape codes. >> >> I understand. I also do not consider GCCs C++0X support complete as of >> now. > >You said that any decent development environment that exists supports >it NOW. My point is that it is too early to judge if the lack of support for directly encoded Unicode-extended identifiers is going remain true of complete C++0X implementations. It is clear the draft standard was written to permit such an implementation under the as-if rule (it even explicitly says this is allowed). Whether, once the new standard is approved and final, any major vendors will actually implement that feature remains to be seen. -- The peace of God be with you. Stanley Friesen [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Mathias Gaunard on 29 Jul 2010 03:59
On Jul 29, 4:34 am, Stanley Friesen <sar...(a)friesen.net> wrote: > Hmm, this may be tricky to specify, given the nature of the conversions. > Dereferencing such an iterator would have to resolve to some sort of > container (e.g. a specialization of basic_string), as there is no > guarantee that the result will be a single code. Huh? Just return the results in multiple iteration steps. Iterator adaptors do not have to be one-to-one... See my Unicode library if you want examples. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |