Read utf-8 file return utf-16 coding hex string ? [Java Programming]

Prev: Need to recompile a Java Applet as an Executable
Next: Save xls file as blob to db ?

From: moonhkt on 30 Jan 2010 10:23

On Jan 30, 5:51 pm, Roedy Green <see_webs...(a)mindprod.com.invalid>
wrote:
> On Thu, 28 Jan 2010 23:40:07 -0800 (PST), moonhkt <moon...(a)gmail.com>
> wrote, quoted or indirectly quoted someone who said :
>
> >Hi All
> >Why using utf-8, the hex value return 51cc and 6668 ?
>
> UTF-8 is a mixture of 8 bit chars, and magic 8-bit sequences that turn
> into 16 bit and 32 bit code sequences.
>
> To see how the algorithm works seehttp://mindprod.com/jgloss/utf.htmlhttp://mindprod.com/jgloss/codepoint.html
> --
> Roedy Green Canadian Mind Productshttp://mindprod.com
> Computers are useless. They can only give you answers.
> ~ Pablo Picasso (born: 1881-10-25 died: 1973-04-08 at age: 91)

Hi All
Thank for documents for UTF-8. Actually, My company want using
ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle
ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
with UTF-8 Data can be import and processed loading to our database.
Then export the data to default codepage, IBM850, we found e5 87 8c
e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
UTF-8 character.

The next test is loading all possible UTF-8 character to our database
then export the loaded data into a file, for compare two file. If two
different, we may be proof that loading UTF-8 into ISO8859-1 database
without any of bad effect.

Our Database is Progress Database for Character mode run on AIX 5.3
Machine.

Next Task, try to build all possible UTF-8 Bit into file,for Loading
test.
Any suggestion ?

From: Lew on 30 Jan 2010 11:42

-moonhkt wrote:.

> Thank for documents for UTF-8. Actually, My company want using
> ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle

That statement doesn't make sense. What makes sense would be, "My company
wants to store characters with an ISO8859-1 encoding". There is not any such
thing, really, as "UTF-8 data". What there is is character data. Others
upthread have explained this; you might wish to review what people told you
about how data in a Java 'String' is always UTF-16. You read it into the
'String' using an encoding argument to the 'Reader' to understand the encoding
of the source, and you write it to the destination using whatever encoding in
the 'Writer' that you need.

> ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI

The term "UTF-8 data" has no meaning.

> with UTF-8 Data can be import and processed loading to our database.
> Then export the data to default codepage, IBM850, we found e5 87 8c
> e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
> UTF-8 character.

You simply map the 'String' data to the database column using JDBC. The
connection and JDBC driver handle the encoding, AIUI.
<http://java.sun.com/javase/6/docs/api/java/sql/PreparedStatement.html#setString(int,%20java.lang.String)>

> The next test is loading all possible UTF-8 character to our database
> then export the loaded data into a file, for compare two file. If two
> different, we may be proof that loading UTF-8 into ISO8859-1 database
> without any of bad effect.

There are an *awful* lot of UTF-encoded characters, over 107,000. Most are
not encodable with ISO-8859-1, which only handles 256 characters.

> Our Database is Progress Database for Character mode run on AIX 5.3
> Machine.
>
> Next Task, try to build all possible UTF-8 Bit into file,for Loading
> test.
> Any suggestion ?

That'll be a rather large file.

Why don't you Google for character encoding and what different encodings can
handle?

Also:
<http://en.wikipedia.org/wiki/Unicode>
<http://en.wikipedia.org/wiki/ISO-8859-1>

--
Lew

From: moonhkt on 30 Jan 2010 11:48

On Jan 31, 12:16 am, RedGrittyBrick <RedGrittyBr...(a)spamweary.invalid>
wrote:
> moonhkt wrote:
> > Actually, My company want using
> > ISO8859-1 database to store UTF-8 data.
>
> Your company should use a Unicode database to store Unicode data. The
> Progress DBMS supports Unicode.
>
> > Currently, our EDI just handle
> > ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
> > with UTF-8 Data can be import and processed loading to our database.
> > Then export the data to default codepage, IBM850, we found e5 87 8c
> > e6 99 a8 in the file.
>
> This seems crazy to me. The DBMS functions for working with CHAR
> datatypes will do bad things if your have misled the DBMS into treating
> UTF-8 encoded data as if it were ISO 8859-1. You will no longer be able
> to fit 10 chars in a CHAR(10) field for example.
>
> > The Export file are mix ISO8859-1 chars and UTF-8 character.
>
> Sorry to be so negative, but this seems a recipe for disaster.
>
> > The next test is loading all possible UTF-8 character to our database
> > then export the loaded data into a file, for compare two file. If two
> > different, we may be proof that loading UTF-8 into ISO8859-1 database
> > without any of bad effect.
>
> I think you'll have a false sense of optimism and discover bad effects
> later.
>
> > Our Database is Progress Database for Character mode run on AIX 5.3
> > Machine.
>
> A 1998 vintage document suggests the Progress DBMS can support Unicode.http://unicode.org/iuc/iuc13/c12/slides.ppt. Though there's a few items
> in that presentation that I find troubling.
>
> > Next Task, try to build all possible UTF-8 Bit into file,for Loading
> > test.
>
> Unicode contains combining characters, not all sequences of Unicode
> characters are valid.
>
> > Any suggestion ?
>
> Reconsider :-)
>
> --
> RGB

Thank for you reminder. But Our database already have Chinese/Japanese/
Korean code data on it.
Those data update by lookup program, e.g. When input PEN will get
Chinese GB2312 or BIG5 code.
We already ask Progress TS for this case, they also suggest using
UTF-8 Database.

But, we can not move to UTF-8 Database. We just some fields have this
case, those fields will not using substring,upcase or other string
operation to update those fields. Upto now, those CJK value without
any problem for over 10+ year.

For Unicode contains combining characters, is one of consideration.

From: moonhkt on 30 Jan 2010 12:20

On Jan 31, 12:48Â am, moonhkt <moon...(a)gmail.com> wrote:
> On Jan 31, 12:16Â am, RedGrittyBrick <RedGrittyBr...(a)spamweary.invalid>
> wrote:
>
>
>
> > moonhkt wrote:
> > > Actually, My company want using
> > > ISO8859-1 database to store UTF-8 data.
>
> > Your company should use a Unicode database to store Unicode data. The
> > Progress DBMS supports Unicode.
>
> > > Currently, our EDI just handle
> > > ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
> > > with UTF-8 Data can be import and processed loading to our database.
> > > Then export the data to default codepage, IBM850, Â we found e5 87 8c
> > > e6 99 a8 in the file.
>
> > This seems crazy to me. The DBMS functions for working with CHAR
> > datatypes will do bad things if your have misled the DBMS into treating
> > UTF-8 encoded data as if it were ISO 8859-1. You will no longer be able
> > to fit 10 chars in a CHAR(10) field for example.
>
> > > The Export file are mix ISO8859-1 chars and UTF-8 character.
>
> > Sorry to be so negative, but this seems a recipe for disaster.
>
> > > The next test is loading all possible UTF-8 character to our database
> > > then export the loaded data into a file, for compare two file. Â If two
> > > different, we may be proof that loading UTF-8 into ISO8859-1 database
> > > without any of bad effect.
>
> > I think you'll have a false sense of optimism and discover bad effects
> > later.
>
> > > Our Database is Progress Database for Character mode run on AIX 5.3
> > > Machine.
>
> > A 1998 vintage document suggests the Progress DBMS can support Unicode.http://unicode.org/iuc/iuc13/c12/slides.ppt. Though there's a few items
> > in that presentation that I find troubling.
>
> > > Next Task, try to build all possible UTF-8 Bit into file,for Loading
> > > test.
>
> > Unicode contains combining characters, not all sequences of Unicode
> > characters are valid.
>
> > > Any suggestion ?
>
> > Reconsider :-)
>
> > --
> > RGB
>
> Thank for you reminder. But Our database already have Chinese/Japanese/
> Korean code data on it.
> Those data update by lookup program, e.g. When input PEN will get
> Chinese GB2312 or BIG5 code.
> We already ask Progress TS for this case, they also suggest using
> UTF-8 Database.
>
> But, we can not move to UTF-8 Database. We just some fields have this
> case, those fields will not using substring,upcase or other string
> operation to update those fields. Upto now, those CJK value without
> any problem for over 10+ year.
>
> For Unicode contains combining characters, is one of consideration.

Why my testing using Java. I want to check what the byte value for my
output in Progress.
We want to check what value when export data by Progress.
For Chinese word "åæ¨", using codepoints for UTF-16 51CC and 6668, for
Byte value are e5 87 8c e6 99 a8.

In Progress, viewed the inputted data by UTF-8 terminal as a åæ¨. So,
we felt it is not awful to ISO8859-1 database. Actually, Database seem
to be handle 0x00 to 0xFF characters. The number of byte for åæ¨ to be
six byte.

From: Arved Sandstrom on 30 Jan 2010 13:32

Lew wrote:
> -moonhkt wrote:.
>
>> Thank for documents for UTF-8. Actually, My company want using
>> ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle
>
> That statement doesn't make sense. What makes sense would be, "My
> company wants to store characters with an ISO8859-1 encoding". There is
> not any such thing, really, as "UTF-8 data". What there is is character
> data. Others upthread have explained this; you might wish to review
> what people told you about how data in a Java 'String' is always
> UTF-16. You read it into the 'String' using an encoding argument to the
> 'Reader' to understand the encoding of the source, and you write it to
> the destination using whatever encoding in the 'Writer' that you need.
>
>> ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
>
> The term "UTF-8 data" has no meaning.
[ SNIP ]

That's a bit nitpicky for me. If you're going to get that precise then
there's no such thing as character data either, since characters are
also an interpretation of binary bytes and words. In this view there's
no difference between a Unicode file and a PNG file and a PDF file and
an ASCII file.

Since we do routinely describe files by the only useful interpretation
of them, why not UTF-8 data files?

AHS

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Need to recompile a Java Applet as an Executable
Next: Save xls file as blob to db ?