change ISO8859-1 to GB2312 [Java Programming]

Prev: Basic Question re JUnit Tests and Deprecated Methods
Next: entity bean as being reentrant

From: RedGrittyBrick on 24 May 2010 05:57

On 24/05/2010 03:30, moonhkt wrote:
> On 5月22日, 上午6时23分, RedGrittyBrick<RedGrittyBr...(a)SpamWeary.invalid>
> wrote:
>> On 21/05/2010 17:38, moonhkt wrote:
>>
>>> Our database is ISO8859-1 format with some GB2312 and other non
>>> ISO8859-1 data. Now, we want print GB2312 code in work order routing.
>>> We planing to purchase a Chinese line printer for printing GB2312. The
>>> line printer can print the file under UNIX. Why the output file no
>>> need to convert GB2312 format before printing ?
>>
>> You don't provide any details so I can only guess. My guess is that the
>> Database thinks it has (for example) six European letters when in fact
>> it has three Chinese characters. The database is happy to store and
>> retrieve the bytes sequences that would, under 8859-1 encoding represent
>> six European letters. When the retrieved byte sequences are sent to the
>> printer, because the printer is configured to use the GB2312 encoding,
>> it interprets those same byte sequences, not as six European letters but
>> as three Chinese characters.
>>
>> On the other hand, so far as I know, Unix/Linux printing systems like
>> CUPS allow you to specify a character encoding as an option to commands
>> like lp. they also pick them up from the locale (see environment
>> variables) This allows CUPS to do whatever is needed to print those
>> characters correctly.
>>
>>> Any Suggestion ? And Java Conversion program can convert my output to
>>> UTF-8.
>>
>> I'm sure it can. If a Java program knows what encodings are to be used
>> for data input and data output then the standard classes allow you to
>> handle data correctly*. How that would help in your situation I don't
>> know. if your database thinks it is handing 8859-1 encoded European
>> characters to your Java program when in fact some of that needs to be
>> interpreted as GB3212 then I expect you will have to do something ugly
>> in Java. UTF-8 is, in general, a good thing. Configuring your database,
>> your programs, your locale and your printer for UTF-8 might well be a
>> good thing to do.
>>
>
> Hi All
> Today, Our printer vendor suggest us provide Hanzi EBCDIC code for
> testing Chinease printing.
> Due to IBM Hosts All support Hanzi EBCDIC code.

You have an IBM System z?

Throwing EBCDIC code-pages into the mix with 8859-1, GB2312 and UTF-8
seems to me to be making your life more complex when you need to make it
simpler. Still, presumably your printor vendor's saleman has your best
interests at heart.

> How to Convert GB2312/UTF-8 to EBCDID

<http://stackoverflow.com/questions/771054/utf-8-to-ebcdic-in-java>

> I try cp1047 on cp1838, All ASCII code like before. By compare using
> diff to check the different.

What JCL did you use to run diff?

--
RGB

From: moonhkt on 24 May 2010 10:04

On May 24, 5:57Â pm, RedGrittyBrick <RedGrittyBr...(a)spamweary.invalid>
wrote:
> On 24/05/2010 03:30, moonhkt wrote:
>
>
>
> > On 5æ22æ¥, ä¸å6æ¶23å, RedGrittyBrick<RedGrittyBr...(a)SpamWeary.invalid>
> > wrote:
> >> On 21/05/2010 17:38, moonhkt wrote:
>
> >>> Our database is ISO8859-1 format with some GB2312 and other non
> >>> ISO8859-1 data. Now, we want print GB2312 code in work order routing.
> >>> We planing to purchase a Chinese line printer for printing GB2312. The
> >>> line printer can print the file under UNIX. Why the output file no
> >>> need to convert GB2312 format before printing ?
>
> >> You don't provide any details so I can only guess. My guess is that the
> >> Database thinks it has (for example) six European letters when in fact
> >> it has three Chinese characters. The database is happy to store and
> >> retrieve the bytes sequences that would, under 8859-1 encoding represent
> >> six European letters. When the retrieved byte sequences are sent to the
> >> printer, because the printer is configured to use the GB2312 encoding,
> >> it interprets those same byte sequences, not as six European letters but
> >> as three Chinese characters.
>
> >> On the other hand, so far as I know, Unix/Linux printing systems like
> >> CUPS allow you to specify a character encoding as an option to commands
> >> like lp. they also pick them up from the locale (see environment
> >> variables) This allows CUPS to do whatever is needed to print those
> >> characters correctly.
>
> >>> Any Suggestion ? And Java Conversion program can convert my output to
> >>> UTF-8.
>
> >> I'm sure it can. If a Java program knows what encodings are to be used
> >> for data input and data output then the standard classes allow you to
> >> handle data correctly*. How that would help in your situation I don't
> >> know. if your database thinks it is handing 8859-1 encoded European
> >> characters to your Java program when in fact some of that needs to be
> >> interpreted as GB3212 then I expect you will have to do something ugly
> >> in Java. UTF-8 is, in general, a good thing. Configuring your database,
> >> your programs, your locale and your printer for UTF-8 might well be a
> >> good thing to do.
>
> > Hi All
> > Today, Our printer vendor suggest us provide Hanzi EBCDIC code for
> > testing Chinease printing.
> > Due to IBM Hosts All support Hanzi EBCDIC code.
>
> You have an IBM System z?
>
> Throwing EBCDIC code-pages into the mix with 8859-1, GB2312 and UTF-8
> seems to me to be making your life more complex when you need to make it
> simpler. Still, presumably your printor vendor's saleman has your best
> interests at heart.
>
> > How to Convert GB2312/UTF-8 to EBCDID
>
> <http://stackoverflow.com/questions/771054/utf-8-to-ebcdic-in-java>
>
> > I try cp1047 on cp1838, All ASCII code like before. By compare using
> > diff to check the different.
>
> What JCL did you use to run diff?
>
> --
> RGB

Hi All

Our system is P630.
No , Suppose just two charset on file. ISO8859-1/GB2312 to UTF-8 or
EBCDID
For compare the different output by using UNIX diff command.

moonhkt

From: RedGrittyBrick on 24 May 2010 18:09

On 24/05/2010 15:04, moonhkt wrote:
> Our system is P630.
> No , Suppose just two charset on file. ISO8859-1/GB2312 to UTF-8 or
> EBCDID
> For compare the different output by using UNIX diff command.

Your task can be broken down into three elements:
1) Read ISO-8859-1 encoded text from database.
2) Convert incorrectly encoded text back into Unicode UTF-16
3) Convert UTF-16 to UTF-8 (or EBCDIC)

For the first part, Your JDBC drivers should provide a way to make sure
the correct encoding conversion is performed so that whatever encoding
the database is using is known to the driver and it can convert text to
the UTF-16 encoding used by Java. See your DBMS documentation.

The second part is tricky. Your database thinks the GB2312 data is
ISO-8859-1 (because you lied to it). Now java is under the same illusion
and has done the arithmetic that would normally convert from ISO-8859-1
to Unicode/UTF-16. This has probably made an unholy mess of the GB2312
data. You have to reverse this. It's late, I'm tired and I just don't
care enough at the moment to think about how this would be done. (later)
I think I would use java.lang.String's methods to convert to byte[]
using ISO-8859-1 conversion then restore to String form using GB2312
conversion. I'm assuming the GB2312 data pretending to be ISO-8859-1 is
in a separate field in a table and hence in a separate
ResultSet.getString() result. If not ... oh dear.

The last part is easy - see below. I just output some GB2312 characters
using EUC-CN encoding into a HTML file because my web-browser, Firefox,
understands GB2312 - it's a convenient way to check the correctness of
the conversion. You want UTF-8 or EBCDIC not GB2312 but the principle is
the same.

-------------------------------8<------------------------------
import java.io.FileNotFoundException;
import java.io.PrintWriter;
import java.io.UnsupportedEncodingException;

public class TestGB2312 {

public static void main(String[] args) {
/*
* Note: The fun characters are specified as Unicode escapes.
* We later get Java to convert to GB2312 in EUC_CN encoding.
*/
String data = "<html><head><meta charset=\"gb2312\"></head><body>"
+ "Character set:GB2312" + "Encoding: EUC_CN"
+ "Roman Numerals: \u2160\u2161\u2162\u2163"
+ "Han (Numerals): \u3220\u3221\u3222\u3223"
+ "</body></html>";

writeFileAsGB2312("GB2312.html", data);
}

private static void writeFileAsGB2312(String fileName, String data) {
PrintWriter pw;
try {
pw = new PrintWriter(fileName, "GB2312");
pw.println(data);
pw.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
}

-------------------------------8<------------------------------

Where I've got "GB2312" and "gb2312" you might want "UTF-8" and "utf8".

See
<http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html>

I imagine you knew all the above and were hoping for help with the part
which I numbered 2.

--
RGB

From: moonhkt on 25 May 2010 04:48

On 5æ25æ¥, ä¸å6æ¶09å, RedGrittyBrick <RedGrittyBr...(a)SpamWeary.invalid>
wrote:
> On 24/05/2010 15:04, moonhkt wrote:
>
> > Our system is P630.
> > No , Suppose just two charset on file. ISO8859-1/GB2312 to UTF-8 or
> > EBCDID
> > For compare the different output by using UNIX diff command.
>
> Your task can be broken down into three elements:
> 1) Read ISO-8859-1 encoded text from database.
> 2) Convert incorrectly encoded text back into Unicode UTF-16
> 3) Convert UTF-16 to UTF-8 (or EBCDIC)
>
> For the first part, Your JDBC drivers should provide a way to make sure
> the correct encoding conversion is performed so that whatever encoding
> the database is using is known to the driver and it can convert text to
> the UTF-16 encoding used by Java. See your DBMS documentation.
>
> The second part is tricky. Your database thinks the GB2312 data is
> ISO-8859-1 (because you lied to it). Now java is under the same illusion
> and has done the arithmetic that would normally convert from ISO-8859-1
> to Unicode/UTF-16. This has probably made an unholy mess of the GB2312
> data. You have to reverse this. It's late, I'm tired and I just don't
> care enough at the moment to think about how this would be done. (later)
> I think I would use java.lang.String's methods to convert to byte[]
> using ISO-8859-1 conversion then restore to String form using GB2312
> conversion. I'm assuming the GB2312 data pretending to be ISO-8859-1 is
> in a separate field in a table and hence in a separate
> ResultSet.getString() result. If not ... oh dear.
>
> The last part is easy - see below. I just output some GB2312 characters
> using EUC-CN encoding into a HTML file because my web-browser, Firefox,
> understands GB2312 - it's a convenient way to check the correctness of
> the conversion. You want UTF-8 or EBCDIC not GB2312 but the principle is
> the same.
>
> -------------------------------8<------------------------------
> import java.io.FileNotFoundException;
> import java.io.PrintWriter;
> import java.io.UnsupportedEncodingException;
>
> public class TestGB2312 {
>
> Â Â public static void main(String[] args) {
> Â Â Â /*
> Â Â Â * Note: The fun characters are specified as Unicode escapes.
> Â Â Â * We later get Java to convert to GB2312 in EUC_CN encoding.
> Â Â Â */
> Â Â Â String data = "<html><head><meta charset=\"gb2312\"></head><body>"
> Â Â Â Â Â Â + "Character set:GB2312" + "Encoding: EUC_CN"
> Â Â Â Â Â Â + "Roman Numerals: \u2160\u2161\u2162\u2163"
> Â Â Â Â Â Â + "Han (Numerals): \u3220\u3221\u3222\u3223"
> Â Â Â Â Â Â + "</body></html>";
>
> Â Â Â writeFileAsGB2312("GB2312.html", data);
> Â Â }
>
> Â Â private static void writeFileAsGB2312(String fileName, String data) {
> Â Â Â PrintWriter pw;
> Â Â Â try {
> Â Â Â Â pw = new PrintWriter(fileName, "GB2312");
> Â Â Â Â pw.println(data);
> Â Â Â Â pw.close();
> Â Â Â } catch (FileNotFoundException e) {
> Â Â Â Â e.printStackTrace();
> Â Â Â } catch (UnsupportedEncodingException e) {
> Â Â Â Â e.printStackTrace();
> Â Â Â }
> Â Â }
>
> }
>
> -------------------------------8<------------------------------
>
> Where I've got "GB2312" and "gb2312" you might want "UTF-8" and "utf8".
>
> See
> <http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc....>
>
> I imagine you knew all the above and were hoping for help with the part
> which I numbered 2.
>
> --
> RGB

Thank. I am not testing JDBC.
But tired to GB2312 file , to UTF-8 then BIG5

10 TEST1 |æµè¯1
11 TEST2 |æµè¯2
13 TEST4 |æµè¯4

it can conv to UTF-8

When conv UTF-8 to BIG5, can not. Do you know why ?

Checked with IE, the BIG5 code is "?"

import java.io.*;
public class Conv_cp
{
public static void help ()
{
System.out.println("Missing parameter");
System.out.println("1- Input file name ");
System.out.println("2- FromCode ");
System.out.println("3- ToCode ");
System.exit(0);
}
public static void main( String[] args )
{
if ( args.length < 3 ) {
help ();
}
new Conv_cp().recode(args[0] , args[1] , args[2] );
}

public void recode(String fnin, String cpf , String cpt)
{
final BufferedReader rin;
final BufferedWriter owt;
try
{
rin = new BufferedReader( new InputStreamReader(
/* getClass().getResourceAsStream( "temp.txt" ),
"ISO-8859-1" ));
owt = new BufferedWriter( new
OutputStreamWriter(System.out, "GB2312" ));
*/
getClass().getResourceAsStream( fnin ),cpf ));
owt = new BufferedWriter( new OutputStreamWriter(
System.out, cpt ));
}
catch ( IOException exc )
{
/* logger.error( exc ); */
return;
}
try
{
for ( String str; (str = rin.readLine()) != null; )
{
owt.write( str );
owt.newLine();
}
owt.flush();
}
catch ( IOException exc )
{
/* logger.error( exc ); */
}
finally
{
try
{
rin.close();
owt.close();
}
catch ( IOException exc )
{
/* logger.error( exc ); */
}
}
}
}

First | Prev |
Pages: 1 2 3
Prev: Basic Question re JUnit Tests and Deprecated Methods
Next: entity bean as being reentrant