From: Robert Haas on 28 Mar 2010 20:16 On Sun, Mar 28, 2010 at 7:36 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew(a)dunslane.net> writes: >> Here's another thought. Given that JSON is actually specified to consist >> of a string of Unicode characters, what will we deliver to the client >> where the client encoding is, say Latin1? Will it actually be a legal >> JSON byte stream? > > No, it won't. We will *not* be sending anything but latin1 in such a > situation, and I really couldn't care less what the JSON spec says about > it. Delivering wrongly-encoded data to a client is a good recipe for > all sorts of problems, since the client-side code is very unlikely to be > expecting that. A datatype doesn't get to make up its own mind whether > to obey those rules. Likewise, data on input had better match > client_encoding, because it's otherwise going to fail the encoding > checks long before a json datatype could have any say in the matter. > > While I've not read the spec, I wonder exactly what "consist of a string > of Unicode characters" should actually be taken to mean. Perhaps it > only means that all the characters must be members of the Unicode set, > not that the string can never be represented in any other encoding. > There's more than one Unicode encoding anyway... See sections 2.5 and 3 of: http://www.ietf.org/rfc/rfc4627.txt?number=4627 ....Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Mike Rylander on 28 Mar 2010 20:23 On Sun, Mar 28, 2010 at 7:36 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew(a)dunslane.net> writes: >> Here's another thought. Given that JSON is actually specified to consist >> of a string of Unicode characters, what will we deliver to the client >> where the client encoding is, say Latin1? Will it actually be a legal >> JSON byte stream? > > No, it won't. We will *not* be sending anything but latin1 in such a > situation, and I really couldn't care less what the JSON spec says about > it. Delivering wrongly-encoded data to a client is a good recipe for > all sorts of problems, since the client-side code is very unlikely to be > expecting that. A datatype doesn't get to make up its own mind whether > to obey those rules. Likewise, data on input had better match > client_encoding, because it's otherwise going to fail the encoding > checks long before a json datatype could have any say in the matter. > > While I've not read the spec, I wonder exactly what "consist of a string > of Unicode characters" should actually be taken to mean. Perhaps it > only means that all the characters must be members of the Unicode set, > not that the string can never be represented in any other encoding. > There's more than one Unicode encoding anyway... In practice, every parser/serializer I've used (including the one I helped write) allows (and, often, forces) any non-ASCII character to be encoded as \u followed by a string of four hex digits. Whether it would be easy inside the backend, when generating JSON from user data stored in tables that are not in a UTF-8 encoded cluster, to convert to UTF-8, that's something else entirely. If it /is/ easy and safe, then it's just a matter of scanning for multi-byte sequences and replacing those with their \uXXXX equivalents. I have some simple and fast code I could share, if it's needed, though I suspect it's not. :) UPDATE: Thanks, Robert, for pointing to the RFC. -- Mike Rylander | VP, Research and Design | Equinox Software, Inc. / The Evergreen Experts | phone: 1-877-OPEN-ILS (673-6457) | email: miker(a)esilibrary.com | web: http://www.esilibrary.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Robert Haas on 28 Mar 2010 20:33 On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander(a)gmail.com> wrote: > In practice, every parser/serializer I've used (including the one I > helped write) allows (and, often, forces) any non-ASCII character to > be encoded as \u followed by a string of four hex digits. Is it correct to say that the only feasible place where non-ASCII characters can be used is within string constants? If so, it might be reasonable to disallow characters with the high-bit set unless the server encoding is one of the flavors of Unicode of which the spec approves. I'm tempted to think that when the server encoding is Unicode we really ought to allow Unicode characters natively, because turning a long string of two-byte wide chars into a long string of six-byte wide chars sounds pretty evil from a performance point of view. ....Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Andrew Dunstan on 28 Mar 2010 20:46 Robert Haas wrote: > On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander(a)gmail.com> wrote: > >> In practice, every parser/serializer I've used (including the one I >> helped write) allows (and, often, forces) any non-ASCII character to >> be encoded as \u followed by a string of four hex digits. >> > > Is it correct to say that the only feasible place where non-ASCII > characters can be used is within string constants? If so, it might be > reasonable to disallow characters with the high-bit set unless the > server encoding is one of the flavors of Unicode of which the spec > approves. I'm tempted to think that when the server encoding is > Unicode we really ought to allow Unicode characters natively, because > turning a long string of two-byte wide chars into a long string of > six-byte wide chars sounds pretty evil from a performance point of > view. > > > We support exactly one unicode encoding on the server side: utf8. And the maximum possible size of a validly encoded unicode char in utf8 is 4 (and that's pretty rare, IIRC). cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Andrew Dunstan on 28 Mar 2010 20:48
Andrew Dunstan wrote: > > > Robert Haas wrote: >> On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander(a)gmail.com> >> wrote: >> >>> In practice, every parser/serializer I've used (including the one I >>> helped write) allows (and, often, forces) any non-ASCII character to >>> be encoded as \u followed by a string of four hex digits. >>> >> >> Is it correct to say that the only feasible place where non-ASCII >> characters can be used is within string constants? If so, it might be >> reasonable to disallow characters with the high-bit set unless the >> server encoding is one of the flavors of Unicode of which the spec >> approves. I'm tempted to think that when the server encoding is >> Unicode we really ought to allow Unicode characters natively, because >> turning a long string of two-byte wide chars into a long string of >> six-byte wide chars sounds pretty evil from a performance point of >> view. >> >> >> > > We support exactly one unicode encoding on the server side: utf8. > > And the maximum possible size of a validly encoded unicode char in > utf8 is 4 (and that's pretty rare, IIRC). > > Sorry. Disregard this. I see what you mean. Yeah, I thing *requiring* non-ascii character to be escaped would be evil. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |