From: Bruce Momjian on 20 Oct 2009 01:58 Itagaki Takahiro wrote: > UTF8 encoding text files with BOM (Byte Order Mark) are commonly > used in Windows, though BOM was designed for UTF16 text originally. > However, psql cannot read such format even if we set client encoding > to UTF8. Is it worth supporting those format in psql? > > When psql opens a file with -f or \i, it checks first 3 bytes of the > file. If they are BOM, discard the 3 bytes and change client encoding > to UTF8 automatically. > > Is this change reasonable? Comments welcome. Seems there is community support for accepting BOM: http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php Should I add this as a TODO item? -- Bruce Momjian <bruce(a)momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Itagaki Takahiro on 20 Oct 2009 02:18 Bruce Momjian <bruce(a)momjian.us> wrote: > Itagaki Takahiro wrote: > > When psql opens a file with -f or \i, it checks first 3 bytes of the > > file. If they are BOM, discard the 3 bytes and change client encoding > > to UTF8 automatically. > > Seems there is community support for accepting BOM: > http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php Thank yor for information. I read the thread that we discussed about BOM handling in *data types*. I agree the decision in the thead that we should not skip BOM characters, but we can handle BOM in a different way in the head of *files* for psql and COPY input. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Peter Eisentraut on 20 Oct 2009 05:54 On Tue, 2009-10-20 at 14:41 +0900, Itagaki Takahiro wrote: > UTF8 encoding text files with BOM (Byte Order Mark) are commonly > used in Windows, though BOM was designed for UTF16 text originally. > However, psql cannot read such format even if we set client encoding > to UTF8. Is it worth supporting those format in psql? psql doesn't have a problem, but the backend's lexer doesn't parse the BOM as whitespace. Since the lexer is byte-based, it will presumably have problems with anything outside of ASCII that Unicode considers whitespace. > When psql opens a file with -f or \i, it checks first 3 bytes of the > file. If they are BOM, discard the 3 bytes and change client encoding > to UTF8 automatically. While I see that the Unicode standard supports using a UTF-8 encoded BOM as UTF-8 signature, I wonder if those bytes can usefully appear in a leading position in other encodings. -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 20 Oct 2009 10:36 Bruce Momjian <bruce(a)momjian.us> writes: > Seems there is community support for accepting BOM: > http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php That discussion has approximately nothing to do with the much-more-invasive change that Itagaki-san is suggesting. In particular I think an automatic change of client_encoding isn't particularly a good idea --- wouldn't you have to change it back later, and is there any possibility of creating a security issue from such behavior? Remember that client_encoding *IS* tied to security issues such as backslash escape handling. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Andrew Dunstan on 20 Oct 2009 11:13
Tom Lane wrote: > Bruce Momjian <bruce(a)momjian.us> writes: > >> Seems there is community support for accepting BOM: >> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php >> > > That discussion has approximately nothing to do with the > much-more-invasive change that Itagaki-san is suggesting. > > In particular I think an automatic change of client_encoding isn't > particularly a good idea --- wouldn't you have to change it back later, > and is there any possibility of creating a security issue from such > behavior? Remember that client_encoding *IS* tied to security issues > such as backslash escape handling. > > > Yeah, I don't think we should be second-guessing the user on the encoding. What I think we might sensibly do is to eat the leading BOM of an SQL file iff the client encoding is UTF8, and otherwise treat it as just bytes in whatever the encoding is. Should we also do the same for files passed via \copy? What about streams on stdin? What about files read from the backend via COPY? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |