Read utf-8 file return utf-16 coding hex string ? [Java Programming]

Prev: Need to recompile a Java Applet as an Executable
Next: Save xls file as blob to db ?

From: Lew on 3 Feb 2010 11:34

bugbear wrote:
>> But you can store 6 bytes as 6 Latin-1 chars (as long as
>> the DB doesn't suppress the "invalid" values; most don't)
>>
>> It just won't have the right semantics.

moonhkt wrote:
> What is your problem ?

How do you mean that question? I don't see any problem from him.

> The six bytes , 3 for first character and next 3 bytes for seconding
> character.
> Actually, We tried import and export , and compare two file are same.
>
> The next task, is Extended ascii code, 80 to FF, value is not part of
> UTF-8. It is means that the Output file can not include 80 to FF bytes
> value ?

No. Those bytes can appear, and sometimes will, in a UTF-8-encoded file.

> And handle 0xBC, Fraction one quarter, 0xBD,Fraction one half
> conversion to UTF-8. or some value in Extended ASCII code to UTF-8
> conversion.

Once again, as many have mentioned, you can read into a 'String' from, say, an
ISO8859-1 source and write to a UTF-8 sink using the appropriated encoding
arguments to the constructors of your 'Reader' and your 'Writer'.

> Below Extended ASCII code found in our Database, ISO8859-1.
> 0x85
> 0xA9
> 0xAE

So? Read them in using one encoding and write them out using another. Done.
Easy. End of story.

Why do you keep asking the same question over and over again after so many
have answered it? There must be some detail in the answers that isn't clear
to you. What exactly is that?

--
Lew

From: RedGrittyBrick on 3 Feb 2010 12:18

moonhkt wrote:
> bugbear wrote:
>> markspace wrote:
>>> moonhkt wrote:
>>>> In Progress, viewed the inputted data by UTF-8 terminal as a 凌晨. So,
>>>> we felt it is not awful to ISO8859-1 database. Actually, Database seem
>>>> to be handle 0x00 to 0xFF characters. The number of byte for 凌晨 to be
>>>> six byte.
>>> Correct. You can't fit six bytes into one. You can't store all UTF-8
>>> characters into an ISO8859-1 file. Some (most) will get truncated.
>> But you can store 6 bytes as 6 Latin-1 chars (as long as
>> the DB doesn't suppress the "invalid" values; most don't)
>>
>> It just won't have the right semantics.
>>

By which I believe bugbear means that if your database thinks the octest
are ISO-8859-1 whereas they are in reality UTF-8 then the databases
understanding of the meaning (semantics) of those octets is wrong.
That's all. The implication is that sorting (i.e. collation) and string
operations liike case shifting and substring operations will often act
incorrectly.

> The six bytes , 3 for first character and next 3 bytes for seconding
> character.

The number of bytes per character is anywhere between one and four, Some
characters will be represented by one byte, others by two bytes ...

> Actually, We tried import and export , and compare two file are same.

Which is what your objective was. Job done?

>
> The next task, is Extended ascii code, 80 to FF,

There are many different 8-bit character sets that are sometimes
labelled "extended ASCII". ISO-8859-1 is one. Windows Latin 1 is
another, Code page 850 another.

> value is not part of UTF-8.

Yes it is! As Lew said, those byte values will appear in UTF-8 encoded
character data.

It is means that the Output file can not include 80 to FF bytes
> value ?

Yes it can.

> And handle 0xBC, Fraction one quarter, 0xBD,Fraction one half
> conversion to UTF-8. or some value in Extended ASCII code to UTF-8
> conversion.

0xBC is not "Fraction one quarter" in some "extended ASCII" character
sets. For example in Code Page 850 it is a "box drawing double up and
left" character. I guess when you say "extended ASCII" you are only
considering "ISO 8859-1"?

>
> Below Extended ASCII code found in our Database, ISO8859-1.
> 0x85
> 0xA9
> 0xAE
>

Since you are using your ISO 8859-1 database as a generic byte-bucket,
you have to know what encoding was used to insert those byte sequences.

They don't look like a valid sequence in UTF-8 encoding.
AFAIK ellipsis copyright registered in UTF-8 would be C2 85 C2 A9 C2 AE

Maybe some of the columns in your ISO 8859-1 database do contain ISO
8859-1 encoded data, whilst other columns (or rows - eeek!) actually
contain UTF-8 encoded data.

If you don't know which columns/rows contain which encodings then you
have a problem.

In an earlier response I said that I view this as a recipe for disaster.

From: Roedy Green on 5 Feb 2010 11:44

On Sat, 30 Jan 2010 07:23:55 -0800 (PST), moonhkt <moonhkt(a)gmail.com>
wrote, quoted or indirectly quoted someone who said :

>Thank for documents for UTF-8. Actually, My company want using
>ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle
>ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
>with UTF-8 Data can be import and processed loading to our database.
>Then export the data to default codepage, IBM850, we found e5 87 8c
>e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
>UTF-8 character.
>
>The next test is loading all possible UTF-8 character to our database
>then export the loaded data into a file, for compare two file. If two
>different, we may be proof that loading UTF-8 into ISO8859-1 database
>without any of bad effect.
>
>Our Database is Progress Database for Character mode run on AIX 5.3
>Machine.
>
>Next Task, try to build all possible UTF-8 Bit into file,for Loading
>test.
>Any suggestion ?

You lied to your database and partly got away with it.

Here's the problem.

If you just look at a stream of bytes, you can't tell for sure if it
is UTF-8 or ISO-8859-1. There is no special marker. A human can make
a pretty good guess, but it is still a guess. The database just
treats the string as a slew of bits. It stores them and regurgitates
them identically. It does not really matter what encoding they are.

UNLESS you start doing some ad hoc queries not using your Java code
that is aware of the deception.

Now when you say search for c^aro (the Esperanto word for cart), the
search engine is going to look for a UTF-8-like set of bits with an
accented c. It won't find them unless the database truly is UTF-8 or
it is one of those lucky situation where UTF-8 and ISO are the same.

Telling your database engine the truth has another advantage. It can
use a more optimal compression algorithm.

Usually you store your database in UTF-8. Some legacy apps may
request some other encoding, and the database will translate for it in
and out. However, if you have lied about any of the encodings, this
translation process will go nuts.

One of the functions of a database is to hide the actual
representation of the data. It serves it up any way you like it. This
makes it possible to change the internal representation of the
database without changing all the apps at the same time.

--
Roedy Green Canadian Mind Products
http://mindprod.com

You can�t have great software without a great team, and most software teams behave like dysfunctional families.
~ Jim McCarthy

First | Prev |
Pages: 1 2 3 4
Prev: Need to recompile a Java Applet as an Executable
Next: Save xls file as blob to db ?