Prev: sql sort problem ?
Next: Very Quick Question
From: RedGrittyBrick on 25 May 2010 07:02 On 25/05/2010 09:48, moonhkt wrote: > Thank [you]. I am not testing [with] JDBC. When you wrote "Our database is ISO8859-1 format with some GB2312 and other non ISO8859-1 data." I got the impression that a DBMS was involved. If you were using Hibernate or some other framework rather than JDBC, the same principles would apply. > But tired to GB2312 file , to UTF-8 then BIG5 BIG5! Another character set and encoding! I think that makes seven you've mentioned in this thread! Any more? > 10 TEST1 |测试1 > 11 TEST2 |测试2 > 13 TEST4 |测试4 > > [the program below] can conv[ert a file containing the above data] to UTF-8 > > When [it] conv[erts from] UTF-8 to BIG5, [it] can not [successfully convert > all characters].Do you know why ? You are ignoring exceptions. Exceptions might be telling you something you really need to know about. Don't ignore exceptions. I'm not familiar with GB2312 and Big5 but I expect that there are characters in GB2312 that are not in Big5. It is almost certain. GB2312 originated in the People's Republic of China, where simplified Chinese characters were mandatory. I think this policy has been relaxed now. I suspect Big5 originated in either the British colony of Hong Kong or in the Republic of China (Taiwan/Formosa). In both these places, Traditional Chinese characters were (and still are) used. Whether the conversion from GB2312 to UTF-16 and then to Big5 can convert a simplified character to a traditional counterpart is unknown to me. Perhaps this causes conversion problems? > [I] Checked [the resulting file] with IE, the BIG5 code is [displayed as] "?" You have to tell IE what encoding to use to display the file. That was why I wrote HTML markup containing <meta charset="gb2312">. You can probably force an encoding using a menu option in IE. You certainly can in Firefox. If IE does not have access to a font containing the required glyph, it will display a placeholder character. I don't use IE much so I'm not certain what the placeholder IE displays, a small box, a question-mark or something else. If Java writes a character that is not present in the specified output character set then I expect it might also substitute a placeholder character. Also Big5 is weird, apparently it doesn't exactly encode characters, it encodes logograms or parts of graphical characters. It also has to be paired with a single-byte character-set that isn't specified in the Big5 standard. Also there are variants of Big5. Lots of scope for encoding issues. Maybe Java and IE disagree about Big5 variants? <http://en.wikipedia.org/wiki/Big5> P.S. IE6 is old and a security hazard, I'd upgrade. -- RGB
From: moonhkt on 25 May 2010 10:18 On 5æ25æ¥, ä¸å7æ02å, RedGrittyBrick <RedGrittyBr...(a)spamweary.invalid> wrote: > On 25/05/2010 09:48, moonhkt wrote: > > > Thank [you]. I am not testing [with] JDBC. > > When you wrote "Our database is ISO8859-1 format with some GB2312 and > other non ISO8859-1 data." I got the impression that a DBMS was > involved. If you were using Hibernate or some other framework rather > than JDBC, the same principles would apply. > > > But tired to GB2312 file , to UTF-8 then BIG5 > > BIG5! Another character set and encoding! I think that makes seven > you've mentioned in this thread! Any more? > > > 10 TEST1   |æµè¯1 > > 11 TEST2   |æµè¯2 > > 13 TEST4   |æµè¯4 > > > [the program below] can conv[ert a file containing the above data] to UTF-8 > > > When [it] conv[erts from] UTF-8 to BIG5, [it] can not [successfully convert > > all characters].Do you know why ? > > You are ignoring exceptions. Exceptions might be telling you something > you really need to know about. Don't ignore exceptions. > > I'm not familiar with GB2312 and Big5 but I expect that there are > characters in GB2312 that are not in Big5. It is almost certain. > > GB2312 originated in the People's Republic of China, where simplified > Chinese characters were mandatory. I think this policy has been relaxed now. > > I suspect Big5 originated in either the British colony of Hong Kong or > in the Republic of China (Taiwan/Formosa). In both these places, > Traditional Chinese characters were (and still are) used. > > Whether the conversion from GB2312 to UTF-16 and then to Big5 can > convert a simplified character to a traditional counterpart is unknown > to me. Perhaps this causes conversion problems? > > > [I] Checked [the resulting file] with IE, the BIG5 code is [displayed as] "?" > > You have to tell IE what encoding to use to display the file. That was > why I wrote HTML markup containing <meta charset="gb2312">. You can > probably force an encoding using a menu option in IE. You certainly can > in Firefox. > > If IE does not have access to a font containing the required glyph, it > will display a placeholder character. I don't use IE much so I'm not > certain what the placeholder IE displays, a small box, a question-mark > or something else. > > If Java writes a character that is not present in the specified output > character set then I expect it might also substitute a placeholder > character. > > Also Big5 is weird, apparently it doesn't exactly encode characters, it > encodes logograms or parts of graphical characters. It also has to be > paired with a single-byte character-set that isn't specified in the Big5 > standard. Also there are variants of Big5. Lots of scope for encoding > issues. Maybe Java and IE disagree about Big5 variants? > <http://en.wikipedia.org/wiki/Big5> > > P.S. IE6 is old and a security hazard, I'd upgrade. > -- > RGB Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/ Simplified Chinese and Traditional Chinese. Those Language imported by lookup function. e.g. When User Input "G" in particular , the lookup program will get "Green" in corresponding Language Character set. Also, I checked other GB2312 Database(Progress Database), the Encoding Value of "æµè¯" (in English "TEST") same as IS08859-1. Checked by unix tool "od -ct x1 file_name". For BIG5 conversion, I just for testing how to change GB2312 to BIG5. My Boss ask me for check what is the encoding value for "TEST" in GB2312 or BIG5. So, I want convert to BIG5 to check what encoding value in BIG5. I will add the exceptions back. Thank a lot. moonhkt
From: RedGrittyBrick on 26 May 2010 04:56 On 25/05/2010 15:18, moonhkt wrote: > Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/ > Simplified Chinese and Traditional Chinese. Those Language imported > by lookup function. e.g. When User Input "G" in particular , the > lookup program will get "Green" in corresponding Language Character > set. Also, I checked other GB2312 Database(Progress Database), the > Encoding Value of "测试" (in English "TEST") same as IS08859-1. Checked > by unix tool "od -ct x1 file_name". > > For BIG5 conversion, I just for testing how to change GB2312 to > BIG5. My Boss ask me for check what is the encoding value for "TEST" > in GB2312 or BIG5. So, I want convert to BIG5 to check what encoding > value in BIG5. "测试" is simplified Chinese. "測試" is traditional Chinese. So far as I know: GB2312 is simplified Chinese. Big5 is traditional Chinese. Therefore: You cannot write "测试" in Big5 You cannot write "測試" in GB2312 Unless I am mistaken. One simplified Chinese character may correspond to several traditional Chinese characters. Java cannot translate "测试" to "測試" because that is a process that requires artistic skill, literary skill and an understanding of the context. I do not read, write, speak nor understand Chinese so I only offer the above as my somewhat uninformed understanding of the situation. -- RGB
From: moonhkt on 26 May 2010 10:12 On May 26, 4:56 pm, RedGrittyBrick <RedGrittyBr...(a)spamweary.invalid> wrote: > On 25/05/2010 15:18, moonhkt wrote: > > > Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/ > > Simplified Chinese and Traditional Chinese. Those Language imported > > by lookup function. e.g. When User Input "G" in particular , the > > lookup program will get "Green" in corresponding Language Character > > set. Also, I checked other GB2312 Database(Progress Database), the > > Encoding Value of "æµè¯" (in English "TEST") same as IS08859-1. Checked > > by unix tool "od -ct x1 file_name". > > > For BIG5 conversion, I just for testing how to change GB2312 to > > BIG5. My Boss ask me for check what is the encoding value for "TEST" > > in GB2312 or BIG5. So, I want convert to BIG5 to check what encoding > > value in BIG5. > > "æµè¯" is simplified Chinese. > "測試" is traditional Chinese. > > So far as I know: > GB2312 is simplified Chinese. > Big5 is traditional Chinese. > > Therefore: > You cannot write "æµè¯" in Big5 > You cannot write "測試" in GB2312 > > Unless I am mistaken. > > One simplified Chinese character may correspond to several traditional > Chinese characters. Java cannot translate "æµè¯" to "測試" because that > is a process that requires artistic skill, literary skill and an > understanding of the context. > > I do not read, write, speak nor understand Chinese so I only offer the > above as my somewhat uninformed understanding of the situation. > > -- > RGB Hi RGB "æµè¯" in GB2312 and "測試" in BIG5. My testing is Change GB2312 to UTF-8 (OK). Then UTF-8 to BIG5, This change not OK. Is some missing or other reason ? One simplified Chinese character may correspond to several traditional Chinese characters. It may not true. Anyway, Thank for you help.
From: RedGrittyBrick on 26 May 2010 10:55
On 26/05/2010 15:12, moonhkt wrote: > On May 26, 4:56 pm, RedGrittyBrick<RedGrittyBr...(a)spamweary.invalid> > wrote: >> On 25/05/2010 15:18, moonhkt wrote: >> >>> Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/ >>> Simplified Chinese and Traditional Chinese. Those Language imported >>> by lookup function. e.g. When User Input "G" in particular , the >>> lookup program will get "Green" in corresponding Language Character >>> set. Also, I checked other GB2312 Database(Progress Database), the >>> Encoding Value of "测试" (in English "TEST") same as IS08859-1. Checked >>> by unix tool "od -ct x1 file_name". >> >>> For BIG5 conversion, I just for testing how to change GB2312 to >>> BIG5. My Boss ask me for check what is the encoding value for "TEST" >>> in GB2312 or BIG5. So, I want convert to BIG5 to check what encoding >>> value in BIG5. >> >> "测试" is simplified Chinese. >> "測試" is traditional Chinese. >> >> So far as I know: >> GB2312 is simplified Chinese. >> Big5 is traditional Chinese. >> >> Therefore: >> You cannot write "测试" in Big5 >> You cannot write "測試" in GB2312 >> >> Unless I am mistaken. >> >> One simplified Chinese character may correspond to several traditional >> Chinese characters. Java cannot translate "测试" to "測試" because that >> is a process that requires artistic skill, literary skill and an >> understanding of the context. >> >> I do not read, write, speak nor understand Chinese so I only offer the >> above as my somewhat uninformed understanding of the situation. > > > "测试" in GB2312 and "測試" in BIG5. Yes. Different characters. Not the same. > > My testing is Change GB2312 to UTF-8 (OK). Yes. Because Unicode includes all characters that are in GB2312. > Then UTF-8 to BIG5, This change not OK. No, because Big5 is a lot smaller than Unicode and does not include 测 or 试 characters* > Is some missing or other reason ? Yes, 测 and 试 characters are missing from Big5* > > One simplified Chinese character may correspond to several traditional > Chinese characters. It may not true. It is true for some characters. For example: 台 = 臺 or 台 or 檯 or 枱 or 颱 There is a list at <http://en.wikipedia.org/wiki/Multiple_association_of_converting_Simplified_Chinese_to_Traditional_Chinese> I suspect Java, for this reason, does not attempt to translate a simplified Chinese character to a traditional Chinese character. * I haven't checked because finding Chinese characters in enormous lists is hard work for me. So I might be wrong :-) -- RGB |