From: Praveen on 9 Feb 2010 07:14 Hi, I am working on enhancing the IBM_DB Ruby driver (database driver for DB2 and Informix) by providing unicode support. I tried googling with no luck to find any documents or links which talk about the Ruby C extension API's that can be used to unleash the unicode support of Ruby-1.9 to 1) Convert Ruby string (unicode) object received in the extension API into wchar (like rb_str2cstr, in ruby-1.8) 2) Convert wchar* to a Ruby Object (like rb_str_new2, in ruby-1.8). 3) Convert string objects between different formats (UCS-2, UCS-4). Could some body put light on the answers for the above queries. Along with the above things could you also tell me if Ruby by default is compiled to use UCS-2 or UCS-4 or other format strings and how will I be able to tap this info, of which format is being used, programmatically in the extension. Thanks Praveen
From: KUBO Takehiro on 9 Feb 2010 08:06 On Tue, Feb 9, 2010 at 9:15 PM, Praveen <praveendevarao(a)gmail.com> wrote: > Hi, > > I am working on enhancing the IBM_DB Ruby driver (database driver for > DB2 and Informix) by providing unicode support. > > I tried googling with no luck to find any documents or links which > talk about the Ruby C extension API's that can be used to unleash the > unicode support of Ruby-1.9 to Look at ruby-1.9.1-pxxx/include/ruby/encoding.h and ruby-1.9.1-pxxx/string.c. > 1) Convert Ruby string (unicode) object received in the extension API > into wchar (like rb_str2cstr, in ruby-1.8) No generic way because wchar's encoding is platform-dependent. As far as I know, it is UCS-2 in Windows, UCS-4 in Linux, locale-dependent value in Solaris. If it is UCS-2, rb_encoding *ucs2_enc = rb_enc_find("UCS-2"); VALUE ucs2_string = rb_str_export_to_enc(string, ucs2_enc); const char *ucs2_cstr = StringValueCStr(ucs2_string); > 2) Convert wchar* to a Ruby Object (like rb_str_new2, in ruby-1.8). If the wchar's encoding is UCS-2, rb_encoding *ucs2_enc = rb_enc_find("UCS-2"); VALUE ucs2_string = rb_external_str_new_with_enc(cstr, len, usc2_enc); > 3) Convert string objects between different formats (UCS-2, UCS-4). rb_encoding *ucs2_enc = rb_enc_find("UCS-2"); rb_encoding *ucs4_enc = rb_enc_find("UCS-4"); VALUE ucs4_string = rb_str_conv_enc(ucs2_string, ucs2_enc, ucs4_enc);
From: KUBO Takehiro on 9 Feb 2010 08:14 On Tue, Feb 9, 2010 at 10:06 PM, KUBO Takehiro <kubo(a)jiubao.org> wrote: > rb_encoding *ucs2_enc = rb_enc_find("UCS-2"); > rb_encoding *ucs4_enc = rb_enc_find("UCS-4"); Sorry, UCS-2 and UCS-4 are not defined in ruby 1.9.1. Use UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE instead.
From: Praveen on 9 Feb 2010 11:39 Hi Kubo, Thanks for the information. I will give a try and get back to you on how I progress [with doubts/ Success ]. Thanks Praveen
From: Praveen on 16 Feb 2010 04:13 Hi Kubo, I tried proceeding with the above mentioned APIs. However I am seeing some interesting stuffs. Not sure if I am using the right constructs. Below is the Ruby script I am using: ====================================== #encoding: utf-8 puts "Results in C extension" puts "----------------------" require 'ibm_db' str = "insert into woods (name) values ('GÃHRINGæ')" conn = IBM_DB.connect 'DRIVER={IBM DB2 ODBC DRIVER};DATABASE=devdb;HOSTNAME=9.124.159.74;PORT=50000;PROTOCOL=TCPIP;UID=db2admin;PWD=db2admin;','','' stmt = IBM_DB.exec conn, str IBM_DB.close conn print "----------------------\n\n" puts "Results in Ruby script" puts "----------------------" puts "str.length is :#{str.length}" puts "str.bytesize: #{str.bytesize}" puts "**Forcing encoding**" str1 = str.force_encoding("UTF-16LE") puts "str.length is :#{str1.length}" puts "str.bytesize: #{str1.bytesize}" ====================================== In the script above, IBM_DB is the C extension module. However the database call has got nothing to do with the unicode API usage. I have just resused the module for trying the unicode support. The snippet in C extension that uses the unicode functions is as below: ====================================== VALUE ibm_db_exec(int argc, VALUE *argv, VALUE self){ rb_scan_args(argc, argv, "21", &connection, &stmt, &options); if (!NIL_P(stmt)) { rb_encoding *enc_received; rb_encoding *ucs2_enc = rb_enc_find("UTF-16LE"); rb_encoding *ucs4_enc = rb_enc_find("UTF-32LE"); enc_received = rb_enc_from_index(ENCODING_GET(stmt)); printf("\nString in received format: %s\n",RSTRING_PTR(stmt)); printf("\nrb_str_length is: %d\n",rb_str_length(stmt)); printf("\nRSTRING_LEN is: %d\n",RSTRING_LEN(stmt)); printf("\nEncoding format received: %s\n",enc_received->name); stmt_ucs2 = rb_str_export_to_enc(stmt,ucs2_enc); printf("\nString in utf16 format: %s\n",RSTRING_PTR(stmt_ucs2)); printf("\nrb_str_length is: %d\n",rb_str_length(stmt_ucs2)); printf("\nRSTRING_LEN is: %d\n",RSTRING_LEN(stmt_ucs2)); printf("\nEncoding after conversion: %s\n",ucs2_enc->name); } } ====================================== The above ruby script run produces the following output: ====================================== Results in C extension ---------------------- String in received format: insert into woods (name) values ('GÃHRINGæ') rb_str_length is: 89 RSTRING_LEN is: 47 Encoding format received: UTF-8 String in utf16 format: i #Expected because used printf rb_str_length is: 89 RSTRING_LEN is: 88 Encoding after conversion: UTF-16LE ---------------------- Results in Ruby script ---------------------- str.length is :44 str.bytesize: 47 **Forcing encoding** str.length is :24 str.bytesize: 47 ====================================== I am not sure why is there a difference in the string length in the original string [44] (UTF-8 format) and string after changing the encoding [24] (to UTF-16LE). The same is the case in case of output in the C extension, the bytesize and the length are same (+1 or -1) and the length is different in different encoding formats. Could you tell me what is that I am doing wrong? Along with this, in C extension is there any API that I can call to check if the given string is in a particular encoding or should I use rb_enc_from_index and from there read the struct member name and determine in the extension that I write? Thanks Praveen
|
Next
|
Last
Pages: 1 2 3 Prev: ANN: toamqp 0.3.1 Next: Problems using the 'extensions' gem - can anyone help? |