Scott Ferguson wrote:
> On May 12, 2006, at 3:32 AM, Petr Gladkikh wrote:
>> about lengths of UTF-8 encoded data:
>
> It's length in 16-bit characters for strings and XML. That reduces the
> computation needed on both ends when the language represents the string
> with characters like Java (as opposed to an encoded byte array).
>
> -- Scott
This is an interesting issue. For languages like Java where a string
is 16 bits, encoding the string length in units of 16 bits is the
most straight forward thing to do.
For C++ this is less straight forward. If wchar_t is 32 bit, then
to send a wchar_t string requires 2 pass parsing, if characters
beyond 0xffff are to be supported. One pass to compute the utf16
length, another to send it. (unless you can patch the length in
a post-op)
To send a native C++ string, you first have to convert it to a
wstring, since the built-in conversions are only supported
between string and wstring. So you go from 8bit -> wchar_t -> utf16,
which is a multi-step process. Receiving an 8 bit string requires
utf16 -> wchar_t -> 8bit conversion, also multi-step.
Thomas Wang
Received on Fri 12 May 2006 11:42:17 -0700
This archive was generated by hypermail 2.1.8 : Thu Sep 28 2006 - 20:16:41 PDT