Saturday, March 01, 2008

Fixnum Overflow in Ruby’s Hash Implementation

The original post is on my main blog.

Ruby’s build-in Hash is the first-choice if you want to do searching. Using your own customized object as hash key is simple: define the following two method for your object:
  • hash: to get the hash code of the object.
  • eql?: to compare whether two object are equal.

When working to improve the performance of RMMSeg, I tried to implement a Substring class which can hold a reference to a big chunk of text instead of doing an expensive copy. Then I implemented the hash and eql? method. The hash value calculated is identical to the related String, and eql? is properly implemented. But the whole thing seemed not working quite well.

I though it’s my code’s fault because it’s the first time for me to write a C extension of Ruby. I use gdb to trace the program — it’s very hard to do, because Hash is a very commonly used data structure in Ruby. Many core libraries use it. :(

However, finally I figured it out (after a sleep) and created a small piece of code to reproduce it:

class MyStr
def initialize(str)
@str = str
end
def hash
@str.hash
end
def eql?(o)
@str.eql?(o)
end
end

s1 = "foo"
s2 = "This"
my_s1 = MyStr.new(s1)
my_s2 = MyStr.new(s2)
h = { s1 => "value of foo", s2 => "value of This" }

puts "h[my_s1] = #{h[my_s1].inspect}"
puts "h[my_s2] = #{h[my_s2].inspect}"

The expected output should be

h[my_s1] = "value of foo"
h[my_s2] = "value of This"

but the actual output is

h[my_s1] = "value of foo"
h[my_s2] = nil

So what’s wrong with is? Why “foo” is right but “This” is wrong? Looking at the code of Hash in Ruby answers the question. Ruby’s treating String specially. When the key is a String, it use rb_str_hash directly to calculate the hash value.

rb_str_hash returns an int. But user customized objects don’t have this special treatment. The hash method is called in the Ruby environment returning a Fixnum which finally converted to an int.

The problem is that the value range of a Fixnum is small than int. The calculated hash value of “This”, 2073740424, when converted to Fixnum and then converting back, finally becomes -73743224.

That’s the problem. When key is a String, its hash is the original 2073740424. But when not, its hash becomes the weird -73743224. It’s a bug, not only with String, but also Symbol and Fixnum. I’ve post the bug report and a suggested patch to ruby-core ML. Hope it get fixed soon. :)

No comments: