Saturday, March 01, 2008

Fixnum Overflow in Ruby’s Hash Implementation

The original post is on my main blog.

Ruby’s build-in Hash is the first-choice if you want to do searching. Using your own customized object as hash key is simple: define the following two method for your object:
  • hash: to get the hash code of the object.
  • eql?: to compare whether two object are equal.

When working to improve the performance of RMMSeg, I tried to implement a Substring class which can hold a reference to a big chunk of text instead of doing an expensive copy. Then I implemented the hash and eql? method. The hash value calculated is identical to the related String, and eql? is properly implemented. But the whole thing seemed not working quite well.

I though it’s my code’s fault because it’s the first time for me to write a C extension of Ruby. I use gdb to trace the program — it’s very hard to do, because Hash is a very commonly used data structure in Ruby. Many core libraries use it. :(

However, finally I figured it out (after a sleep) and created a small piece of code to reproduce it:

class MyStr
def initialize(str)
@str = str
end
def hash
@str.hash
end
def eql?(o)
@str.eql?(o)
end
end

s1 = "foo"
s2 = "This"
my_s1 = MyStr.new(s1)
my_s2 = MyStr.new(s2)
h = { s1 => "value of foo", s2 => "value of This" }

puts "h[my_s1] = #{h[my_s1].inspect}"
puts "h[my_s2] = #{h[my_s2].inspect}"

The expected output should be

h[my_s1] = "value of foo"
h[my_s2] = "value of This"

but the actual output is

h[my_s1] = "value of foo"
h[my_s2] = nil

So what’s wrong with is? Why “foo” is right but “This” is wrong? Looking at the code of Hash in Ruby answers the question. Ruby’s treating String specially. When the key is a String, it use rb_str_hash directly to calculate the hash value.

rb_str_hash returns an int. But user customized objects don’t have this special treatment. The hash method is called in the Ruby environment returning a Fixnum which finally converted to an int.

The problem is that the value range of a Fixnum is small than int. The calculated hash value of “This”, 2073740424, when converted to Fixnum and then converting back, finally becomes -73743224.

That’s the problem. When key is a String, its hash is the original 2073740424. But when not, its hash becomes the weird -73743224. It’s a bug, not only with String, but also Symbol and Fixnum. I’ve post the bug report and a suggested patch to ruby-core ML. Hope it get fixed soon. :)

Tuesday, February 26, 2008

[ANN] RMMSeg 0.1.2 Released

Mainly performance improvement.

rmmseg version 0.1.2
by pluskid
http://rmmseg.rubyforge.org

== DESCRIPTION

RMMSeg is an implementation of MMSEG Chinese word segmentation
algorithm. It is based on two variants of maximum matching
algorithms. Two algorithms are available for using:

* simple algorithm that uses only forward maximum matching.
* complex algorithm that uses three-word chunk maximum matching and 3
aditonal rules to solve ambiguities.

For more information about the algorithm, please refer to the
following essays:

* http://technology.chtsai.org/mmseg/
* http://pluskid.lifegoo.com/?p=261

== CHANGES

* Add cache to find_match_words: performance improved.
* Implement Chunk as a module instead of a class: performance improved.
* Don’t store unnecessary data in dictionary: memory usage reduced.

Saturday, January 05, 2008

Running RSpec in Emacs

This article is originally posted at my main blog (Mostly Chinese) : http://pluskid.lifegoo.com/?p=245

RSpec is a Behaviour Driven Development framework for Ruby. It’s output format can be customized. However, the default format works well in Emacs’s compilation-mode. Type M-x compile and input spec file_name_spec.rb. The result will be prompted at a new buffer.

Some useful information are colorized. You can even use your mouse to click on the failures to go directly to the line where spec fails (Of course there’re shortcuts like C-` available). However, we can still make it better.

More highlighting

By default the cursor is at the beginning in the newly prompted buffer with the spec results. We want it to be at the end so that we can see how many examples failed. That’s easy, in fact, it is the default behavior before Emacs 20.3:

;; keep scrolling in compilation result buffer
(setq compilation-scroll-output t)

That’s simple and cool! But I want the number be highlighted! And more highlighted when the number of failures is not zero. That’s also easy, we can add some rules to achieve this:

(add-to-list 'compilation-mode-font-lock-keywords
'("^\\([[:digit:]]+\\) examples?, \\([[:digit:]]+\\) failures?\\(?:, \\([[:digit:]]+\\) pendings?\\)?$"
(0 '(face nil message nil help-echo nil mouse-face nil) t)
(1 compilation-info-face)
(2 (if (string= "0" (match-string 2))
compilation-info-face
compilation-error-face))
(3 compilation-info-face t t)))

And here’s a screenshot:

emacs-rspec.png

Yeah! Cool! :D

Smart Compile

smart-compile.el is an extension for Emacs to guess the compilation command for different type of files. Customization is simple. Here’s my customization (I use Emacs to edit a lot of files):

(require 'smart-compile)
(setq smart-compile-alist
'(("/programming/guile/.*c$" . "gcc -Wall %f `guile-config link` -o %n")
("\\.c\\'" . "gcc -Wall %f -lm -o %n")
("\\.[Cc]+[Pp]*\\'" . "g++ -Wall %f -lm -o %n")
("\\.java$" . "javac %f")
("_spec\\.rb$" . "spec %f")
("\\.rb$" . "ruby %f")
(emacs-lisp-mode . (emacs-lisp-byte-compile))
(html-mode . (browse-url-of-buffer))
(html-helper-mode . (browse-url-of-buffer))
("\\.skb$" . "skribe %f -o %n.html")
(haskell-mode . "ghc -o %n %f")
(asy-mode . (call-interactively 'asy-compile-view))
(muse-mode . (call-interactively 'muse-project-publish))))
(global-set-key (kbd "") 'smart-compile)

I set the global shortcut to f9. Now just name your spec files with foo_spec.rb (this is the convention). When you are in the buffer, just press f9. It will prompt you the correct command to run your specs.

Wish you enjoy it! :)