Ruby 'C' Extensions and Unicode [Ruby]

Prev: ANN: toamqp 0.3.1
Next: Problems using the 'extensions' gem - can anyone help?

From: Praveen on 16 Feb 2010 07:53

Forgot to mention. I am on ruby version

ruby 1.9.1p0 (2009-01-30 revision 21907) [x86_64-linux].

Let me know if you require any info

Thanks

Praveen

From: Lui Kore on 16 Feb 2010 09:11

I'm not familiar with the enc_ APIs.

But I think the easiest way is to use

rb_funcall(some_str, rb_intern("encode") ...

Praveen wrote:
> Forgot to mention. I am on ruby version
>
> ruby 1.9.1p0 (2009-01-30 revision 21907) [x86_64-linux].
>
> Let me know if you require any info
>
> Thanks
>
> Praveen

--
Posted via http://www.ruby-forum.com/.

From: Heesob Park on 16 Feb 2010 09:49

Hi,

2010/2/16 Praveen <praveendevarao(a)gmail.com>:
> Hi Kubo,
>
> I tried proceeding with the above mentioned APIs. However I am seeing
> some interesting stuffs. Not sure if I am using the right constructs.
>
> Below is the Ruby script I am using:
>
> ======================================
> #encoding: utf-8
>
> puts "Results in C extension"
> puts "----------------------"
> require 'ibm_db'
> str = "insert into woods (name) values ('GÃHRINGæ')"
>
> conn = IBM_DB.connect 'DRIVER={IBM DB2 ODBC
> DRIVER};DATABASE=devdb;HOSTNAME=9.124.159.74;PORT=50000;PROTOCOL=TCPIP;UID=db2admin;PWD=db2admin;','',''
> stmt = IBM_DB.exec conn, str
> IBM_DB.close conn
>
> print "----------------------\n\n"
>
> puts "Results in Ruby script"
> puts "----------------------"
>
> puts "str.length is :#{str.length}"
> puts "str.bytesize: #{str.bytesize}"
> puts "**Forcing encoding**"
> str1 = str.force_encoding("UTF-16LE")
> puts "str.length is :#{str1.length}"
> puts "str.bytesize: #{str1.bytesize}"
> ======================================
>
> In the script above, IBM_DB is the C extension module. However the
> database call has got nothing to do with the unicode API usage. I have
> just resused the module for trying the unicode support.
>
> The snippet in C extension that uses the unicode functions is as
> below:
>
> ======================================
> VALUE ibm_db_exec(int argc, VALUE *argv, VALUE self){
> Â rb_scan_args(argc, argv, "21", &connection, &stmt, &options);
> Â if (!NIL_P(stmt)) {
> Â Â rb_encoding *enc_received;
> Â Â rb_encoding *ucs2_enc = rb_enc_find("UTF-16LE");
> Â Â rb_encoding *ucs4_enc = rb_enc_find("UTF-32LE");
>
> Â Â enc_received = rb_enc_from_index(ENCODING_GET(stmt));
>
> Â Â printf("\nString in received format: %s\n",RSTRING_PTR(stmt));
> Â Â printf("\nrb_str_length is: %d\n",rb_str_length(stmt));
> Â Â printf("\nRSTRING_LEN is: %d\n",RSTRING_LEN(stmt));
> Â Â printf("\nEncoding format received: %s\n",enc_received->name);
>
> Â Â stmt_ucs2 Â = Â rb_str_export_to_enc(stmt,ucs2_enc);
>
> Â Â printf("\nString in utf16 format: %s\n",RSTRING_PTR(stmt_ucs2));
> Â Â printf("\nrb_str_length is: %d\n",rb_str_length(stmt_ucs2));
> Â Â printf("\nRSTRING_LEN is: %d\n",RSTRING_LEN(stmt_ucs2));
> Â Â printf("\nEncoding after conversion: %s\n",ucs2_enc->name);
> Â }
> }
>
> ======================================
>
> The above ruby script run produces the following output:
>
> ======================================
>
> Results in C extension
> ----------------------
>
> String in received format: insert into woods (name) values
> ('GÃHRINGÃ¦')
>
> rb_str_length is: 89
>
> RSTRING_LEN is: 47
>
> Encoding format received: UTF-8
>
> String in utf16 format: i #Expected because used printf
>
> rb_str_length is: 89
>
> RSTRING_LEN is: 88
>
> Encoding after conversion: UTF-16LE
> ----------------------
>
> Results in Ruby script
> ----------------------
> str.length is :44
> str.bytesize: 47
> **Forcing encoding**
> str.length is :24
> str.bytesize: 47
>
> ======================================
>
> I am not sure why is there a difference in the string length in the
> original string [44] (UTF-8 format) and string after changing the
> encoding [24] (to UTF-16LE). The same is the case in case of output in
> the C extension, the bytesize and the length are same (+1 or -1) and
> the length is different in different encoding formats.
>
89 is not an integer but a VALUE. VALUE of 89 means 44 of integer.
> Could you tell me what is that I am doing wrong?
>
You should use String#encode instead of String#force_encode like this:

puts "**Converting encoding**"
str1 = str.encode("UTF-16LE")
puts "str.length is :#{str1.length}"
puts "str.bytesize: #{str1.bytesize}"

> Along with this, in C extension is there any API that I can call to
> check if the given string is in a particular encoding or should I use
> rb_enc_from_index and from there read the struct member name and
> determine in the extension that I write?
>
Using rb_enc_get is more simple then rb_enc_from_index like this:
enc_received = rb_enc_get(stmt);

And, rb_str_length returns not an integer but a VALUE. So you should
use NUM2INT like this:
printf("\nrb_str_length is: %d\n",NUM2INT(rb_str_length(stmt)));

Regards,

Park Heesob

From: Praveen on 22 Feb 2010 11:27

Thanks All for your help!!

Will Keep posted on how it goes.

Thanks

Praveen

From: Praveen on 18 Mar 2010 06:36

Hi,

I wanted to know if there is any function in the C extension
(Ruby-1.9) that can be used to convert the encoding of the string to
the encoding format specified by the user (in his environment or by
setting #encoding: at the beginning of .rb file).

I did find 2 function namely rb_str_export and rb_str_export_locale.
Not sure which one will convert the strings rightly to the format
which the user is set.

Could somebody guide.

Thanks

Praveen

First | Prev | Next | Last
Pages: 1 2 3
Prev: ANN: toamqp 0.3.1
Next: Problems using the 'extensions' gem - can anyone help?