From: Mike Rylander on
On Sun, Mar 28, 2010 at 8:33 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander(a)gmail.com> wrote:
>> In practice, every parser/serializer I've used (including the one I
>> helped write) allows (and, often, forces) any non-ASCII character to
>> be encoded as \u followed by a string of four hex digits.
>
> Is it correct to say that the only feasible place where non-ASCII
> characters can be used is within string constants?

Yes. That includes object property strings -- they are quoted string literals.

> If so, it might be
> reasonable to disallow characters with the high-bit set unless the
> server encoding is one of the flavors of Unicode of which the spec
> approves.  I'm tempted to think that when the server encoding is
> Unicode we really ought to allow Unicode characters natively, because
> turning a long string of two-byte wide chars into a long string of
> six-byte wide chars sounds pretty evil from a performance point of
> view.
>

+1

As an aside, \u-encoded (escaped) characters and native multi-byte
sequences (of any RFC-allowable Unicode encoding) are exactly
equivalent in JSON -- it's a storage and transmission format, and
doesn't prescribe the application-internal representation of the data.

If it's faster (which it almost certainly is) to not mangle the data
when it's all staying server side, that seems like a useful
optimization. For output to the client, however, it would be useful
to provide a \u-escaping function, which (AIUI) should always be safe
regardless of client encoding.

--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker(a)esilibrary.com
| web: http://www.esilibrary.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Joseph Adams on
On Sun, Mar 28, 2010 at 5:19 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams
> <joeyadams3.14159(a)gmail.com> wrote:
>> Now my thoughts and opinions on the JSON parsing/unparsing itself:
>>
>> It should be built-in, rather than relying on an external library
>> (like XML does).
>
> Why?  I'm not saying you aren't right, but you need to make an
> argument rather than an assertion.  This is a community, so no one is
> entitled to decide anything unilaterally, and people want to be
> convinced - including me.

I apologize; I was just starting the conversation with some of my
ideas to receive feedback. I didn't want people to have to wade
through too many "I think"s . I'll be sure to use <opinion> tags in
the future :-)

My reasoning for "It should be built-in" is:
* It would be nice to have a built-in serialization format that's
available by default.
* It might be a little faster because it doesn't have to link to an
external library.
* The code to interface between JSON logic and PostgreSQL will
probably be much larger than the actual JSON encoding/decoding itself.
* The externally-maintained and packaged libjson implementations I
saw brought in lots of dependencies (e.g. glib).
* "Everyone else" (e.g. PHP) uses a statically-linked JSON implementation.

Is the code in question "*really*" small? Well, not really, but it's
not enormous either. By the way, I found a bug in PHP's JSON_parser
(json_decode("true "); /* with a space */ returns null instead of
true). I'll have to get around to reporting that.

Now, assuming JSON support is built-in to PostgreSQL and is enabled by
default, it is my opinion that encoding issues should not be dealt
with in the JSON code itself, but that the JSON code itself should
assume UTF-8. I think conversions should be done to/from UTF-8 before
passing it through the JSON code because this would likely be the
smallest way to implement it (not necessarily the fastest, though).

Mike Rylander pointed out something wonderful, and that is that JSON
code can be stored in plain old ASCII using \u... . If a target
encoding supports all of Unicode, the JSON serializer could be told
not to generate \u escapes. Otherwise, the \u escapes would be
necessary.

Thus, here's an example of how (in my opinion) character sets and such
should be handled in the JSON code:

Suppose the client's encoding is UTF-16, and the server's encoding is
Latin-1. When JSON is stored to the database:
1. The client is responsible and sends a valid UTF-16 JSON string.
2. PostgreSQL checks to make sure it is valid UTF-16, then converts
it to UTF-8.
3. The JSON code parses it (to ensure it's valid).
4. The JSON code unparses it (to get a representation without
needless whitespace). It is given a flag indicating it should only
output ASCII text.
5. The ASCII is stored in the server, since it is valid Latin-1.

When JSON is retrieved from the database:
1. ASCII is retrieved from the server
2. If user needs to extract one or more fields, the JSON is parsed,
and the fields are extracted.
3. Otherwise, the JSON text is converted to UTF-16 and sent to the client.

Note that I am being biased toward optimizing code size rather than speed.

Here's a question about semantics: should converting JSON to text
guarantee that Unicode will be \u escaped, or should it render actual
Unicode whenever possible (when the client uses a Unicode-complete
charset) ?

As for reinventing the wheel, I'm in the process of writing yet
another JSON implementation simply because I didn't find the other
ones I looked at palatable. I am aiming for simple code, not fast
code. I am using malloc for structures and realloc for strings/arrays
rather than resorting to clever buffering tricks. Of course, I'll
switch it over to palloc/repalloc before migrating it to PostgreSQL.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Sun, Mar 28, 2010 at 11:24 PM, Joseph Adams
<joeyadams3.14159(a)gmail.com> wrote:
> I apologize; I was just starting the conversation with some of my
> ideas to receive feedback.  I didn't want people to have to wade
> through too many "I think"s .  I'll be sure to use <opinion> tags in
> the future :-)

FWIW, I don't care at all whether you say "I think" or "I know"; the
point is that you have to provide backup for any position you choose
to take.

> My reasoning for "It should be built-in" is:
>  * It would be nice to have a built-in serialization format that's
> available by default.
>  * It might be a little faster because it doesn't have to link to an
> external library.

I don't think either of these reasons is valid.

>  * The code to interface between JSON logic and PostgreSQL will
> probably be much larger than the actual JSON encoding/decoding itself.

If true, this is a good argument.

>  * The externally-maintained and packaged libjson implementations I
> saw brought in lots of dependencies (e.g. glib).

As is this.

>  * "Everyone else" (e.g. PHP) uses a statically-linked JSON implementation.

But this isn't.

> Is the code in question "*really*" small?  Well, not really, but it's
> not enormous either.  By the way, I found a bug in PHP's JSON_parser
> (json_decode("true "); /* with a space */ returns null instead of
> true).  I'll have to get around to reporting that.
>
> Now, assuming JSON support is built-in to PostgreSQL and is enabled by
> default, it is my opinion that encoding issues should not be dealt
> with in the JSON code itself, but that the JSON code itself should
> assume UTF-8.  I think conversions should be done to/from UTF-8 before
> passing it through the JSON code because this would likely be the
> smallest way to implement it (not necessarily the fastest, though).
>
> Mike Rylander pointed out something wonderful, and that is that JSON
> code can be stored in plain old ASCII using \u... .  If a target
> encoding supports all of Unicode, the JSON serializer could be told
> not to generate \u escapes.  Otherwise, the \u escapes would be
> necessary.
>
> Thus, here's an example of how (in my opinion) character sets and such
> should be handled in the JSON code:
>
> Suppose the client's encoding is UTF-16, and the server's encoding is
> Latin-1.  When JSON is stored to the database:
>  1. The client is responsible and sends a valid UTF-16 JSON string.
>  2. PostgreSQL checks to make sure it is valid UTF-16, then converts
> it to UTF-8.
>  3. The JSON code parses it (to ensure it's valid).
>  4. The JSON code unparses it (to get a representation without
> needless whitespace).  It is given a flag indicating it should only
> output ASCII text.
>  5. The ASCII is stored in the server, since it is valid Latin-1.
>
> When JSON is retrieved from the database:
>  1. ASCII is retrieved from the server
>  2. If user needs to extract one or more fields, the JSON is parsed,
> and the fields are extracted.
>  3. Otherwise, the JSON text is converted to UTF-16 and sent to the client.
>
> Note that I am being biased toward optimizing code size rather than speed.

Can you comment on my proposal elsewhere on this thread and compare
your proposal to mine? In what ways are they different, and which is
better, and why?

> Here's a question about semantics: should converting JSON to text
> guarantee that Unicode will be \u escaped, or should it render actual
> Unicode whenever possible (when the client uses a Unicode-complete
> charset) ?

I feel pretty strongly that the data should be stored in the database
in the format in which it will be returned to the user - any
conversion which is necessary should happen on the way in. I am not
100% sure to what extent we should attempt to canonicalize the input
and to what extend we should simply store it in whichever way the user
chooses to provide it.

> As for reinventing the wheel, I'm in the process of writing yet
> another JSON implementation simply because I didn't find the other
> ones I looked at palatable.  I am aiming for simple code, not fast
> code.  I am using malloc for structures and realloc for strings/arrays
> rather than resorting to clever buffering tricks.  Of course, I'll
> switch it over to palloc/repalloc before migrating it to PostgreSQL.

I'm not sure that optimizing for simplicity over speed is a good idea.
I think we can reject implementations as unpalatable because they are
slow or feature-poor or have licensing issues or are not actively
maintained, but rejecting them because they use complex code in order
to be fast doesn't seem like the right trade-off to me.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Hitoshi Harada on
2010/3/29 Andrew Dunstan <andrew(a)dunslane.net>:
> Robert Haas wrote:
>> On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams
>> <joeyadams3.14159(a)gmail.com> wrote:
>>> I'm wondering whether the internal representation of JSON should be
>>> plain JSON text, or some binary code that's easier to traverse and
>>> whatnot.  For the sake of code size, just keeping it in text is
>>> probably best.
>>
>> +1 for text.
>
> Agreed.

There's another choice, called BSON.

http://www.mongodb.org/display/DOCS/BSON

I've not researched it yet deeply, it seems reasonable to be stored in
databases as it is invented for MongoDB.

>>> Now my thoughts and opinions on the JSON parsing/unparsing itself:
>>>
>>> It should be built-in, rather than relying on an external library
>>> (like XML does).
>>
>> Why?  I'm not saying you aren't right, but you need to make an
>> argument rather than an assertion.  This is a community, so no one is
>> entitled to decide anything unilaterally, and people want to be
>> convinced - including me.
>
> Yeah, why? We should not be in the business of reinventing the wheel (and
> then maintaining the reinvented wheel), unless the code in question is
> *really* small.

Many implementations in many languages of JSON show that parsing JSON
is not so difficult to code and the needs vary. Hence, I wonder if we
can have it very our own.

Never take it wrongly, I don't disagree text format nor disagree to
use an external library.

Regards,

--
Hitoshi Harada

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Dimitri Fontaine on
Hi,

Joseph Adams <joeyadams3.14159(a)gmail.com> writes:
> As for reinventing the wheel, I'm in the process of writing yet
> another JSON implementation simply because I didn't find the other
> ones I looked at palatable.

Even this one (ANSI C, MIT Licenced)?

cJSON -- An ultra-lightweight, portable, single-file, simple-as-can-be
ANSI-C compliant JSON parser, under MIT license.

http://sourceforge.net/projects/cjson/
http://cjson.svn.sourceforge.net/viewvc/cjson/README?revision=7&view=markup
http://cjson.svn.sourceforge.net/viewvc/cjson/cJSON.c?revision=33&view=markup

And from the cJSON.h we read that it could be somewhat easy to integrate
into PostgreSQL's memory management:
56 typedef struct cJSON_Hooks {
57 void *(*malloc_fn)(size_t sz);
58 void (*free_fn)(void *ptr);
59 } cJSON_Hooks;

Just adding some data points, hoping that's not adding only confusion.

Regards,
--
dim

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers