change ISO8859-1 to GB2312 to UTF-8 to EBCDIC to Big5 to ... [Java Programming]

Prev: sql sort problem ?
Next: Very Quick Question

From: RedGrittyBrick on 25 May 2010 07:02

On 25/05/2010 09:48, moonhkt wrote:

> Thank [you]. I am not testing [with] JDBC.

When you wrote "Our database is ISO8859-1 format with some GB2312 and
other non ISO8859-1 data." I got the impression that a DBMS was
involved. If you were using Hibernate or some other framework rather
than JDBC, the same principles would apply.

> But tired to GB2312 file , to UTF-8 then BIG5

BIG5! Another character set and encoding! I think that makes seven
you've mentioned in this thread! Any more?

> 10 TEST1 |测试1
> 11 TEST2 |测试2
> 13 TEST4 |测试4
>
> [the program below] can conv[ert a file containing the above data] to UTF-8
>
> When [it] conv[erts from] UTF-8 to BIG5, [it] can not [successfully convert
> all characters].Do you know why ?

You are ignoring exceptions. Exceptions might be telling you something
you really need to know about. Don't ignore exceptions.

I'm not familiar with GB2312 and Big5 but I expect that there are
characters in GB2312 that are not in Big5. It is almost certain.

GB2312 originated in the People's Republic of China, where simplified
Chinese characters were mandatory. I think this policy has been relaxed now.

I suspect Big5 originated in either the British colony of Hong Kong or
in the Republic of China (Taiwan/Formosa). In both these places,
Traditional Chinese characters were (and still are) used.

Whether the conversion from GB2312 to UTF-16 and then to Big5 can
convert a simplified character to a traditional counterpart is unknown
to me. Perhaps this causes conversion problems?

> [I] Checked [the resulting file] with IE, the BIG5 code is [displayed as] "?"

You have to tell IE what encoding to use to display the file. That was
why I wrote HTML markup containing <meta charset="gb2312">. You can
probably force an encoding using a menu option in IE. You certainly can
in Firefox.

If IE does not have access to a font containing the required glyph, it
will display a placeholder character. I don't use IE much so I'm not
certain what the placeholder IE displays, a small box, a question-mark
or something else.

If Java writes a character that is not present in the specified output
character set then I expect it might also substitute a placeholder
character.

Also Big5 is weird, apparently it doesn't exactly encode characters, it
encodes logograms or parts of graphical characters. It also has to be
paired with a single-byte character-set that isn't specified in the Big5
standard. Also there are variants of Big5. Lots of scope for encoding
issues. Maybe Java and IE disagree about Big5 variants?
<http://en.wikipedia.org/wiki/Big5>

P.S. IE6 is old and a security hazard, I'd upgrade.
--
RGB

From: moonhkt on 25 May 2010 10:18

On 5æ25æ¥, ä¸å7æ02å, RedGrittyBrick <RedGrittyBr...(a)spamweary.invalid>
wrote:
> On 25/05/2010 09:48, moonhkt wrote:
>
> > Thank [you]. I am not testing [with] JDBC.
>
> When you wrote "Our database is ISO8859-1 format with some GB2312 and
> other non ISO8859-1 data." I got the impression that a DBMS was
> involved. If you were using Hibernate or some other framework rather
> than JDBC, the same principles would apply.
>
> > But tired to GB2312 file , to UTF-8 then BIG5
>
> BIG5! Another character set and encoding! I think that makes seven
> you've mentioned in this thread! Any more?
>
> > 10 TEST1 Â Â |æµè¯1
> > 11 TEST2 Â Â |æµè¯2
> > 13 TEST4 Â Â |æµè¯4
>
> > [the program below] can conv[ert a file containing the above data] to UTF-8
>
> > When [it] conv[erts from] UTF-8 to BIG5, [it] can not [successfully convert
> > all characters].Do you know why ?
>
> You are ignoring exceptions. Exceptions might be telling you something
> you really need to know about. Don't ignore exceptions.
>
> I'm not familiar with GB2312 and Big5 but I expect that there are
> characters in GB2312 that are not in Big5. It is almost certain.
>
> GB2312 originated in the People's Republic of China, where simplified
> Chinese characters were mandatory. I think this policy has been relaxed now.
>
> I suspect Big5 originated in either the British colony of Hong Kong or
> in the Republic of China (Taiwan/Formosa). In both these places,
> Traditional Chinese characters were (and still are) used.
>
> Whether the conversion from GB2312 to UTF-16 and then to Big5 can
> convert a simplified character to a traditional counterpart is unknown
> to me. Perhaps this causes conversion problems?
>
> > [I] Checked [the resulting file] with IE, the BIG5 code is [displayed as] "?"
>
> You have to tell IE what encoding to use to display the file. That was
> why I wrote HTML markup containing <meta charset="gb2312">. You can
> probably force an encoding using a menu option in IE. You certainly can
> in Firefox.
>
> If IE does not have access to a font containing the required glyph, it
> will display a placeholder character. I don't use IE much so I'm not
> certain what the placeholder IE displays, a small box, a question-mark
> or something else.
>
> If Java writes a character that is not present in the specified output
> character set then I expect it might also substitute a placeholder
> character.
>
> Also Big5 is weird, apparently it doesn't exactly encode characters, it
> encodes logograms or parts of graphical characters. It also has to be
> paired with a single-byte character-set that isn't specified in the Big5
> standard. Also there are variants of Big5. Lots of scope for encoding
> issues. Maybe Java and IE disagree about Big5 variants?
> <http://en.wikipedia.org/wiki/Big5>
>
> P.S. IE6 is old and a security hazard, I'd upgrade.
> --
> RGB

Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/
Simplified Chinese and Traditional Chinese. Those Language imported by
lookup function. e.g. When User Input "G" in particular , the lookup
program will get "Green" in corresponding Language Character set.
Also, I checked other GB2312 Database(Progress Database), the Encoding
Value of "æµè¯" (in English "TEST") same as IS08859-1. Checked by unix
tool "od -ct x1 file_name".

For BIG5 conversion, I just for testing how to change GB2312 to BIG5.
My Boss ask me for check what is the encoding value for "TEST" in
GB2312 or BIG5. So, I want convert to BIG5 to check what encoding
value in BIG5.

I will add the exceptions back.

Thank a lot.

moonhkt

From: RedGrittyBrick on 26 May 2010 04:56

On 25/05/2010 15:18, moonhkt wrote:
> Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/
> Simplified Chinese and Traditional Chinese. Those Language imported
> by lookup function. e.g. When User Input "G" in particular , the
> lookup program will get "Green" in corresponding Language Character
> set. Also, I checked other GB2312 Database(Progress Database), the
> Encoding Value of "测试" (in English "TEST") same as IS08859-1. Checked
> by unix tool "od -ct x1 file_name".
>
> For BIG5 conversion, I just for testing how to change GB2312 to
> BIG5. My Boss ask me for check what is the encoding value for "TEST"
> in GB2312 or BIG5. So, I want convert to BIG5 to check what encoding
> value in BIG5.

"测试" is simplified Chinese.
"測試" is traditional Chinese.

So far as I know:
GB2312 is simplified Chinese.
Big5 is traditional Chinese.

Therefore:
You cannot write "测试" in Big5
You cannot write "測試" in GB2312

Unless I am mistaken.

One simplified Chinese character may correspond to several traditional
Chinese characters. Java cannot translate "测试" to "測試" because that
is a process that requires artistic skill, literary skill and an
understanding of the context.

I do not read, write, speak nor understand Chinese so I only offer the
above as my somewhat uninformed understanding of the situation.

--
RGB

From: moonhkt on 26 May 2010 10:12

On May 26, 4:56Â pm, RedGrittyBrick <RedGrittyBr...(a)spamweary.invalid>
wrote:
> On 25/05/2010 15:18, moonhkt wrote:
>
> > Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/
> > Simplified Chinese and Traditional Chinese. Those Language imported
> > by lookup function. e.g. When User Input "G" in particular , the
> > lookup program will get "Green" in corresponding Language Character
> > set. Also, I checked other GB2312 Database(Progress Database), the
> > Encoding Value of "æµè¯" (in English "TEST") same as IS08859-1. Checked
> > by unix tool "od -ct x1 file_name".
>
> > For BIG5 conversion, I just for testing how to change GB2312 to
> > BIG5. My Boss ask me for check what is the encoding value for "TEST"
> > in GB2312 or BIG5. So, I want convert to BIG5 to check what encoding
> > value in BIG5.
>
> "æµè¯" is simplified Chinese.
> "æ¸¬è©¦" is traditional Chinese.
>
> So far as I know:
> GB2312 is simplified Chinese.
> Big5 is traditional Chinese.
>
> Therefore:
> You cannot write "æµè¯" in Big5
> You cannot write "æ¸¬è©¦" in GB2312
>
> Unless I am mistaken.
>
> One simplified Chinese character may correspond to several traditional
> Chinese characters. Java cannot translate "æµè¯" to "æ¸¬è©¦" because that
> is a process that requires artistic skill, literary skill and an
> understanding of the context.
>
> I do not read, write, speak nor understand Chinese so I only offer the
> above as my somewhat uninformed understanding of the situation.
>
> --
> RGB

Hi RGB

"æµè¯" in GB2312 and "æ¸¬è©¦" in BIG5.

My testing is Change GB2312 to UTF-8 (OK). Then UTF-8 to BIG5, This
change not OK.
Is some missing or other reason ?

One simplified Chinese character may correspond to several traditional
Chinese characters. It may not true.
Anyway, Thank for you help.

From: RedGrittyBrick on 26 May 2010 10:55

On 26/05/2010 15:12, moonhkt wrote:
> On May 26, 4:56 pm, RedGrittyBrick<RedGrittyBr...(a)spamweary.invalid>
> wrote:
>> On 25/05/2010 15:18, moonhkt wrote:
>>
>>> Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/
>>> Simplified Chinese and Traditional Chinese. Those Language imported
>>> by lookup function. e.g. When User Input "G" in particular , the
>>> lookup program will get "Green" in corresponding Language Character
>>> set. Also, I checked other GB2312 Database(Progress Database), the
>>> Encoding Value of "测试" (in English "TEST") same as IS08859-1. Checked
>>> by unix tool "od -ct x1 file_name".
>>
>>> For BIG5 conversion, I just for testing how to change GB2312 to
>>> BIG5. My Boss ask me for check what is the encoding value for "TEST"
>>> in GB2312 or BIG5. So, I want convert to BIG5 to check what encoding
>>> value in BIG5.
>>
>> "测试" is simplified Chinese.
>> "測試" is traditional Chinese.
>>
>> So far as I know:
>> GB2312 is simplified Chinese.
>> Big5 is traditional Chinese.
>>
>> Therefore:
>> You cannot write "测试" in Big5
>> You cannot write "測試" in GB2312
>>
>> Unless I am mistaken.
>>
>> One simplified Chinese character may correspond to several traditional
>> Chinese characters. Java cannot translate "测试" to "測試" because that
>> is a process that requires artistic skill, literary skill and an
>> understanding of the context.
>>
>> I do not read, write, speak nor understand Chinese so I only offer the
>> above as my somewhat uninformed understanding of the situation.
>
>
> "测试" in GB2312 and "測試" in BIG5.

Yes. Different characters. Not the same.

>
> My testing is Change GB2312 to UTF-8 (OK).

Yes. Because Unicode includes all characters that are in GB2312.

> Then UTF-8 to BIG5, This change not OK.

No, because Big5 is a lot smaller than Unicode and does not include 测
or 试 characters*

> Is some missing or other reason ?

Yes, 测 and 试 characters are missing from Big5*

>
> One simplified Chinese character may correspond to several traditional
> Chinese characters. It may not true.

It is true for some characters. For example:
台 = 臺 or 台 or 檯 or 枱 or 颱

There is a list at
<http://en.wikipedia.org/wiki/Multiple_association_of_converting_Simplified_Chinese_to_Traditional_Chinese>

I suspect Java, for this reason, does not attempt to translate a
simplified Chinese character to a traditional Chinese character.

* I haven't checked because finding Chinese characters in enormous lists
is hard work for me. So I might be wrong :-)

--
RGB

| Next | Last
Pages: 1 2
Prev: sql sort problem ?
Next: Very Quick Question