From: Robert Haas on
On Sun, Mar 28, 2010 at 7:36 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew(a)dunslane.net> writes:
>> Here's another thought. Given that JSON is actually specified to consist
>> of a string of Unicode characters, what will we deliver to the client
>> where the client encoding is, say Latin1? Will it actually be a legal
>> JSON byte stream?
>
> No, it won't.  We will *not* be sending anything but latin1 in such a
> situation, and I really couldn't care less what the JSON spec says about
> it.  Delivering wrongly-encoded data to a client is a good recipe for
> all sorts of problems, since the client-side code is very unlikely to be
> expecting that.  A datatype doesn't get to make up its own mind whether
> to obey those rules.  Likewise, data on input had better match
> client_encoding, because it's otherwise going to fail the encoding
> checks long before a json datatype could have any say in the matter.
>
> While I've not read the spec, I wonder exactly what "consist of a string
> of Unicode characters" should actually be taken to mean.  Perhaps it
> only means that all the characters must be members of the Unicode set,
> not that the string can never be represented in any other encoding.
> There's more than one Unicode encoding anyway...

See sections 2.5 and 3 of:

http://www.ietf.org/rfc/rfc4627.txt?number=4627

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Mike Rylander on
On Sun, Mar 28, 2010 at 7:36 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew(a)dunslane.net> writes:
>> Here's another thought. Given that JSON is actually specified to consist
>> of a string of Unicode characters, what will we deliver to the client
>> where the client encoding is, say Latin1? Will it actually be a legal
>> JSON byte stream?
>
> No, it won't.  We will *not* be sending anything but latin1 in such a
> situation, and I really couldn't care less what the JSON spec says about
> it.  Delivering wrongly-encoded data to a client is a good recipe for
> all sorts of problems, since the client-side code is very unlikely to be
> expecting that.  A datatype doesn't get to make up its own mind whether
> to obey those rules.  Likewise, data on input had better match
> client_encoding, because it's otherwise going to fail the encoding
> checks long before a json datatype could have any say in the matter.
>
> While I've not read the spec, I wonder exactly what "consist of a string
> of Unicode characters" should actually be taken to mean.  Perhaps it
> only means that all the characters must be members of the Unicode set,
> not that the string can never be represented in any other encoding.
> There's more than one Unicode encoding anyway...

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Whether it would be easy inside the backend, when generating JSON from
user data stored in tables that are not in a UTF-8 encoded cluster, to
convert to UTF-8, that's something else entirely. If it /is/ easy and
safe, then it's just a matter of scanning for multi-byte sequences and
replacing those with their \uXXXX equivalents. I have some simple and
fast code I could share, if it's needed, though I suspect it's not.
:)

UPDATE: Thanks, Robert, for pointing to the RFC.

--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker(a)esilibrary.com
| web: http://www.esilibrary.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander(a)gmail.com> wrote:
> In practice, every parser/serializer I've used (including the one I
> helped write) allows (and, often, forces) any non-ASCII character to
> be encoded as \u followed by a string of four hex digits.

Is it correct to say that the only feasible place where non-ASCII
characters can be used is within string constants? If so, it might be
reasonable to disallow characters with the high-bit set unless the
server encoding is one of the flavors of Unicode of which the spec
approves. I'm tempted to think that when the server encoding is
Unicode we really ought to allow Unicode characters natively, because
turning a long string of two-byte wide chars into a long string of
six-byte wide chars sounds pretty evil from a performance point of
view.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on


Robert Haas wrote:
> On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander(a)gmail.com> wrote:
>
>> In practice, every parser/serializer I've used (including the one I
>> helped write) allows (and, often, forces) any non-ASCII character to
>> be encoded as \u followed by a string of four hex digits.
>>
>
> Is it correct to say that the only feasible place where non-ASCII
> characters can be used is within string constants? If so, it might be
> reasonable to disallow characters with the high-bit set unless the
> server encoding is one of the flavors of Unicode of which the spec
> approves. I'm tempted to think that when the server encoding is
> Unicode we really ought to allow Unicode characters natively, because
> turning a long string of two-byte wide chars into a long string of
> six-byte wide chars sounds pretty evil from a performance point of
> view.
>
>
>

We support exactly one unicode encoding on the server side: utf8.

And the maximum possible size of a validly encoded unicode char in utf8
is 4 (and that's pretty rare, IIRC).

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on


Andrew Dunstan wrote:
>
>
> Robert Haas wrote:
>> On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander(a)gmail.com>
>> wrote:
>>
>>> In practice, every parser/serializer I've used (including the one I
>>> helped write) allows (and, often, forces) any non-ASCII character to
>>> be encoded as \u followed by a string of four hex digits.
>>>
>>
>> Is it correct to say that the only feasible place where non-ASCII
>> characters can be used is within string constants? If so, it might be
>> reasonable to disallow characters with the high-bit set unless the
>> server encoding is one of the flavors of Unicode of which the spec
>> approves. I'm tempted to think that when the server encoding is
>> Unicode we really ought to allow Unicode characters natively, because
>> turning a long string of two-byte wide chars into a long string of
>> six-byte wide chars sounds pretty evil from a performance point of
>> view.
>>
>>
>>
>
> We support exactly one unicode encoding on the server side: utf8.
>
> And the maximum possible size of a validly encoded unicode char in
> utf8 is 4 (and that's pretty rare, IIRC).
>
>

Sorry. Disregard this. I see what you mean.

Yeah, I thing *requiring* non-ascii character to be escaped would be evil.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers