Free Mind

Tuesday, February 10, 2009

New Blog

It is here: http://blog.pluskid.org , and the posts in the original blog has been archived here: http://lifegoo.pluskid.org .

Tuesday, October 28, 2008

pluskid.lifegoo.com 大迁移

事情确实发生得有点突然，昨天晚上登上 Google Talk ，Jack 突然告诉我托管的服务器在 11 月份就要拿回来了。那代表所有 lifegoo.com 上的应用都会下线吧。这个 blog 是去年五月份的时候我请求帮忙开通的，非常好用，真的要感谢 Jack 和 sishen 了！

不过，现在得到这个消息还真有些伤心，毕竟是用了这么长时间的 blog ，incoming link 也是有不少的。但是总之还是先把数据备份下来吧。我想先用 wordpress 的方式备份一份数据库，再用普通网页抓取的方式把网站爬下来，不过好像 quark 昨天晚上已经帮我爬过了，还做成了一个 chm ，待会找他要去。

备份下来之后还要考虑今后的去处呢。一时之间都没有想到哪个地方比较好用又稳定的。也许是该去注册一个域名了，不知道现在注册域名费用如何。可是如果单单是为了放一个 blog 去租一个虚拟空间的话，似乎太浪费了，而且这些价格啊、在哪里租比较好啊之类的也都还很不了解。时间比较仓促，似乎现在 blogger 还没有被封掉，于是暂时用 blogger 上的那个吧： http://pluskid.blogspot.com ，如果我找到了新的地方，会在那里贴出来。lifegoo 这里也不清楚具体哪天会下线，所以我待会把这篇文章也贴过去。

我想 blog 我还是希望继续写的，虽然现在频率比以前已经低了很多了吧。moonykily 曾经跟我说他觉得写技术 blog 的人都是脑子进水了，blog 本来就是用来抒发自己心情的。当然我是不会同意他的这个观点的，且不说写技术 blog 能够让知识得到分享这样的话吧，更自私的观点就是：如果你能把自己知道的东西给别人描述清楚的话，你会掌握得更加牢靠。而有些东西当你准备要写下来的时候，你才会发现原来自己根本没有弄清楚。我写 blog 一般会花掉半天到两天不等的时间，但是大部分时候我还是觉得是有收获的。

所以，我还是先去备份数据了，之后怎么迁移还真是个麻烦事，一时也想不清楚，不知道大家有什么好的主意没有？

Saturday, March 01, 2008

Fixnum Overflow in Ruby’s Hash Implementation

The original post is on my main blog.

Ruby’s build-in Hash is the first-choice if you want to do searching. Using your own customized object as hash key is simple: define the following two method for your object:

hash: to get the hash code of the object.
eql?: to compare whether two object are equal.

When working to improve the performance of RMMSeg, I tried to implement a Substring class which can hold a reference to a big chunk of text instead of doing an expensive copy. Then I implemented the hash and eql? method. The hash value calculated is identical to the related String, and eql? is properly implemented. But the whole thing seemed not working quite well.

I though it’s my code’s fault because it’s the first time for me to write a C extension of Ruby. I use gdb to trace the program — it’s very hard to do, because Hash is a very commonly used data structure in Ruby. Many core libraries use it.

However, finally I figured it out (after a sleep) and created a small piece of code to reproduce it:

class MyStr
 def initialize(str)
   @str = str
 end
 def hash
   @str.hash
 end
 def eql?(o)
   @str.eql?(o)
 end
end

s1 = "foo"
s2 = "This"
my_s1 = MyStr.new(s1)
my_s2 = MyStr.new(s2)
h = { s1 => "value of foo", s2 => "value of This" }

puts "h[my_s1] = #{h[my_s1].inspect}"
puts "h[my_s2] = #{h[my_s2].inspect}"

The expected output should be

h[my_s1] = "value of foo"
h[my_s2] = "value of This"

but the actual output is

h[my_s1] = "value of foo"
h[my_s2] = nil

So what’s wrong with is? Why “foo” is right but “This” is wrong? Looking at the code of Hash in Ruby answers the question. Ruby’s treating String specially. When the key is a String, it use rb_str_hash directly to calculate the hash value.

rb_str_hash returns an int. But user customized objects don’t have this special treatment. The hash method is called in the Ruby environment returning a Fixnum which finally converted to an int.

The problem is that the value range of a Fixnum is small than int. The calculated hash value of “This”, 2073740424, when converted to Fixnum and then converting back, finally becomes -73743224.

That’s the problem. When key is a String, its hash is the original 2073740424. But when not, its hash becomes the weird -73743224. It’s a bug, not only with String, but also Symbol and Fixnum. I’ve post the bug report and a suggested patch to ruby-core ML. Hope it get fixed soon.

Tuesday, February 26, 2008

[ANN] RMMSeg 0.1.2 Released

Mainly performance improvement.

rmmseg version 0.1.2
by pluskid
http://rmmseg.rubyforge.org

== DESCRIPTION

RMMSeg is an implementation of MMSEG Chinese word segmentation
algorithm. It is based on two variants of maximum matching
algorithms. Two algorithms are available for using:

* simple algorithm that uses only forward maximum matching.
* complex algorithm that uses three-word chunk maximum matching and 3
aditonal rules to solve ambiguities.

For more information about the algorithm, please refer to the
following essays:

* http://technology.chtsai.org/mmseg/
* http://pluskid.lifegoo.com/?p=261

== CHANGES

* Add cache to find_match_words: performance improved.
* Implement Chunk as a module instead of a class: performance improved.
* Don’t store unnecessary data in dictionary: memory usage reduced.

Saturday, January 05, 2008

Running RSpec in Emacs

This article is originally posted at my main blog (Mostly Chinese) : http://pluskid.lifegoo.com/?p=245

RSpec is a Behaviour Driven Development framework for Ruby. It’s output format can be customized. However, the default format works well in Emacs’s compilation-mode. Type M-x compile and input spec file_name_spec.rb. The result will be prompted at a new buffer.

Some useful information are colorized. You can even use your mouse to click on the failures to go directly to the line where spec fails (Of course there’re shortcuts like C-` available). However, we can still make it better.

More highlighting

By default the cursor is at the beginning in the newly prompted buffer with the spec results. We want it to be at the end so that we can see how many examples failed. That’s easy, in fact, it is the default behavior before Emacs 20.3:

;; keep scrolling in compilation result buffer
(setq compilation-scroll-output t)

That’s simple and cool! But I want the number be highlighted! And more highlighted when the number of failures is not zero. That’s also easy, we can add some rules to achieve this:

(add-to-list 'compilation-mode-font-lock-keywords
      '("^\\([[:digit:]]+\\) examples?, \\([[:digit:]]+\\) failures?\\(?:, \\([[:digit:]]+\\) pendings?\\)?$"
        (0 '(face nil message nil help-echo nil mouse-face nil) t)
        (1 compilation-info-face)
        (2 (if (string= "0" (match-string 2))
        compilation-info-face
      compilation-error-face))
        (3 compilation-info-face t t)))

And here’s a screenshot:

Yeah! Cool!

Smart Compile

smart-compile.el is an extension for Emacs to guess the compilation command for different type of files. Customization is simple. Here’s my customization (I use Emacs to edit a lot of files):

(require 'smart-compile)
(setq smart-compile-alist
     '(("/programming/guile/.*c$" .    "gcc -Wall %f `guile-config link` -o %n")
       ("\\.c\\'"              .       "gcc -Wall %f -lm -o %n")
       ("\\.[Cc]+[Pp]*\\'"     .       "g++ -Wall %f -lm -o %n")
       ("\\.java$"             .       "javac %f")
 ("_spec\\.rb$"          .       "spec %f")
 ("\\.rb$"               .       "ruby %f")
       (emacs-lisp-mode        .       (emacs-lisp-byte-compile))
       (html-mode              .       (browse-url-of-buffer))
       (html-helper-mode       .       (browse-url-of-buffer))
       ("\\.skb$"              .       "skribe %f -o %n.html")
       (haskell-mode           .       "ghc -o %n %f")
       (asy-mode               .       (call-interactively 'asy-compile-view))
       (muse-mode              .       (call-interactively 'muse-project-publish))))
(global-set-key (kbd "") 'smart-compile)

I set the global shortcut to f9. Now just name your spec files with foo_spec.rb (this is the convention). When you are in the buffer, just press f9. It will prompt you the correct command to run your specs.

Wish you enjoy it!