Ruby 1.9 Encodings

When i came to Ruby 1.9, the first problem i met is the encodings. Gregory Brown said, in a training session at the Lone Start Rubyconf, “Ruby 1.8 works in bytes. Ruby 1.9 works in characters.” In Ruby 1.8, you have to deal with those bytes and it does not provide any functions with encodings. But in Ruby 1.9, i think you must know about the encoding stuff to make you life easier. Let us talk about the fouce Encodings in Ruby 1.9 by examples.

The Source File Encoding

The source file encoding is the character encoding of a given source file. It is US-ASCII by default. When you create a String literal in your code, it is assigned the Encoding of your source. So you have to changing the source Encoding when you want to place any non-ASCII content in a String literal.

  
  $cat no_encoding.rb
  p "中文".encoding
  $ruby no_encoding.rb
  no_encoding.rb:1: invalid multibyte char (US-ASCII)
  
$cat encoding.rb #!ruby19 # encoding: utf-8 p "中文".encoding $ruby encoding.rb #

As you can see in the no_encoding.rb, the error came out as “invalid multibyte char (US-ASCII)” when there is an chinese string in the source file. That is because when nothing of encoding is specified, Ruby will default to ASCII. But after the encoding is specified by adding the encoding comment, it works.

The String Encoding

Each string has its own own encoding, which you can access with String#encoding method:


   ruby-1.9.2-head>string = "中文"
    => "中文"
   ruby-1.9.2-head>string.encoding
    => #

You could transcode the string into a different encoding by using String#encode:

  
    ruby-1.9.2-head>string_in_gb2312 = string.encode("GB2312")
     => "x{D6D0}x{CEC4}"
  

But the transcoding will fail if the encoding does not support all characters in your string:

 
    ruby-1.9.2-head>string_in_ascii = string.encode("us-ascii")
    Encoding::UndefinedConversionError: U+4E2D from UTF-8 to US-ASCII
 

The External Encoding

The encoding of the data in an IO stream is known by Ruby as the object's external encoding.The default external Encoding is pulled from your environment.


   ruby-1.9.2-head>Encoding.default_external
    => #

Here is how the exernal encoding works:


   ruby-1.9.2-head>f = File.open("example.txt")
    => #
   ruby-1.9.2-head>f.external_encoding
    => #
   ruby-1.9.2-head>content = f.read
    => "这是一些示范文本"
   ruby-1.9.2-head>content.encoding
    => #

if the file is not going to use the default extrenal encoding, you can override it:


   ruby-1.9.2-head>f = File.open("example.txt", "r:gb2312")
    => #
   ruby-1.9.2-head>f.external_encoding
    => #
   ruby-1.9.2-head>content = f.read
    => "x{E8BF}x99xE6x98x{AFE4}xB8x80x{E4BA}x9Bx{E7A4}x{BAE8}x8Cx83xE6x96x87xE6x9CxACn"
   ruby-1.9.2-head>content.encoding
    => #

The Internal Encoding

The encoding that the programmer wishes to use with the data in a stream is the internal encoding. The default internal encoding is nil unless set explicitly.


   ruby-1.9.2-head>Encoding.default_external
    => nil

We could specify our internal encoding when opening the file if the external encoding won't match the encoding we want to use internally.


   ruby-1.9.2-head>f = File.open("example.txt", "r:utf-8:gb2312")
    => #
   ruby-1.9.2-head>f.external_encoding
    => #
   ruby-1.9.2-head>content = f.read
    => "x{D5E2}x{CAC7}x{D2BB}x{D0A9}x{CABE}x{B7B6}x{CEC4}x{B1BE}n"
   ruby-1.9.2-head>content.encoding
    => #