Finished my machine learning courses

12月 19, 2011

经过三个月的时间,终于看完了ml-class的所有视频可以,完成了所有review questions,提交了所有programming exercises.感觉不错,之前一直对数据挖掘相关的方面感兴趣,回想一下大学时候一些地统计分析甚至遥感图像数据处理的课都跟机器学习相关,但是毕竟不是这方面的课程,所以介绍的不是很系统。今年秋天斯坦福推出这个在线课程,机器学习作为其中之一真算是弥补了我们民间科学爱好者的遗憾了。

这个课程在有限的篇幅里涵盖了linear regression, logistic regression, ANN, SVM, PCA, K-Means, Anomaly Detection等等知识,基本上算是一个完整实用的导论。Andrew Ng教授的讲解也算是通俗易懂深入浅出,基本上感觉不到什么门槛。

对于online course这种形式,今年秋天斯坦福的人工智能、数据库、机器学习也算是首开先河,目前这三门课程都已经结束,网上的反响非常强烈。好消息是明年Q1斯坦福还有更多数量更多方向的课程。今天MIT也宣布了明年的online course计划,他们也将加入提供在线课程的行列。而且,MIT的在线课程还会颁发一个名叫MITx的certification。开放式课程已经成为大势所趋,信息本应自由传播。

对于对机器学习感兴趣的朋友,除了ml-class.org上的资源,你还可以在academic earth上找到ANG教授的授课视频。这套视频涵盖的内容比ml-class上的更详细完整:
http://academicearth.org/courses/machine-learning

课程结束,我在ml-class上所有的编程作业都已经放在bitbucket上,如果有兴趣可以参考这些octave程序:
https://bitbucket.org/sunng/ml-class

明年一月斯坦福还会开放更多跟机器学习相关的课程,包括:

Extend slacker server with interceptors

12月 18, 2011

An interceptor framework was introduced in slacker 0.3.0. It’s designed to allow user to add custom functionality without hacking into the internal of slacker.

Like many server frameworks, slacker abstracts the request processing as a pipeline. The request object is modified by adding or updating attributes through each node of the pipeline. So it’s easy to add your interceptor into the pipeline, with which you can get the data before and after function executed.

To create such an interceptor, you should use the slacker.interceptor/definterceptor macro and slacker.interceptor/definterceptor+ macro:

(definterceptor name
:before interceptor-function
:after interceptor-function)

(definterceptor+ name [arguments]
:before interceptor-function
:after interceptor-function)

definterceptor+ can accept arguments so you can configure the interceptor when you use it.

See a simple example:

1
2
3
4
5
6
7
8
9
(definterceptor log-function-call
  :before (fn [req] (println (str "calling " (:fname req))) req))

(definterceptor+ log-function-call-prefixed [prefix]
  :before (fn [req] (println (str
                               (if (fn? prefix) (prefix) prefix)
                               " calling "
                               (:fname req)))
                    req))

Then, add it to your slacker server by

1
2
3
4
5
6
(use ‘[slacker.interceptor])
(import[java.util Date])
(start-slacker (the-ns ‘slapi) 2104
  :interceptors (interceptors log-function-call
                              (log-function-call-prefixed
                                (fn [] (.toString (Date.)))))

Now you can log every function call of your slacker server.

For more detail about the interceptor framework, especially the request data, please check the wiki page.

In slacker 0.3.0, there is a built-in interceptor to stats function calls. You can find it at slacker.interceptors.stats. The stats data is expose via JMX. You can also write monitoring application to retrieve the data.

And there will be more built-in interceptors in 0.4.0, includes function call time stats and logging.

使用Clojure Thread Macro的心得

12月 16, 2011

Thread Macro是clojure里一个很强大的宏,他帮助你简化嵌套函数的调用,比如

1
(str (inc (count [:a :b])))

就可以利用thread macro简写成

1
(-> [:a :b] count inc str)
1
->>

1
->

类似,区别在于

1
->>

把值作为函数的最后一个参数传入。

简单的功能介绍完了,接下来就遇到问题了。我需要功能,能够接受一个或多个函数,然后把这些函数组成一个pipeline。这时很自然想到

1
->

是一个好帮手。也许只需要一个类似这样的form就可以了:

1
#(apply -> % [funcs])

。结果失败了,因为

1
->

是个宏,所以根本不能用apply。于是想到有apply-macro吗?有,或者说曾经有过。在contrib中曾经有一个apply-macro,不过被强烈不推荐使用。到这里,这条路堵死了,惟一的办法就是把

1
->

放到API之外,放到用户代码里去。

放到用户代码里,你需要写一个详细的说明文档并且告诉用户他必须这么做。然而在clojure世界里有一个更好的办法就是再写一个宏把

1
->

包装起来。这么做看似多此一举,其实是保持了API的一致性。通过宏,我们可以把自己的API延伸到用户代码中去。或者说,通过一个类似DSL的宏,给一些并不优雅的API一个缓冲,也为API日后的演化留下空间。

这里还要扯开一句关于宏的开发。clojure中所谓code is data,主要就是体现在宏里。原本在多数其他语言里,宏是不能求值的。但是在clojure里,由于code is data的缘故,宏是可以求值的。所有的输入数据都是list,你可以做first/reverse这样的操作。但是有一点要注意的是,宏中求得的值和代码里的值是不一样的。例如

1
{:a inc}

这样一个字面量,在宏里是可以通过

1
:a

做求值的,然后这里得到的并非一个函数(function),而是一个符号(symbol)。再者,调试宏的时候你可能会被这样的结果困惑:

1
2
3
4
(defmacro a [f] (println (:a f)))
(a {:a 1}) ; ==> prints 1
(def b {:a 1})
(a b) ; ==> prints nil

字面量可以,同值的变量就不行了。原因还是那句,宏里不能求值。

继续谈

1
->

。这个宏其实远没有你想象的那么驯服。遇到复杂一点的情况:

1
2
(def m {:a inc})
(-> 2 (get m :a))

这个写法对吗?str是个函数,(get m :a)返回的是inc也是个函数,貌似正确。运行之后却报错get的参数数量错误。所以千万不要忘了

1
->

是个宏,(get m :a)这里是不会求值到inc的,直接作为一个list被宏吞下去。在宏里只能通过符号的组合变化来生成代码,那么一不小心,就没有inc什么事了。

于是,你可能想到这里需要一个确切的函数,就好比这样:

1
2
(def m {:a inc})
(-> 2 (fn [x] ((get m :a) x)))

也许这样就好多了,我们放了一个匿名函数,并不要求宏去求值,因为这个匿名函数会被宏生成到新的代码里。里面的get也会在运行时求值。看似没什么问题,可是一运行还是没有期待的结果,居然返回了一个匿名函数!而对这个匿名函数求值得到的也是一个错误的结果!简直有点无厘头了。

呵呵,不故弄玄虚了。我们用macroexpand看看发生了什么。

这是用匿名函数包装以前

1
2
(macroexpand ‘(-> 2 (get m :a)))
(get 2 m :a)
1
->

居然只是简单地把2放到了

1
get

这个form里面!

再看看用匿名函数包装后的结果

1
2
(macroexpand-1(-> 2 (fn [x] ((get m :a) x))))
(fn 2 [x] ((get m :a) x))

和刚才一样,2被放到了第一个form的第一个参数位置!得到的是一个非法的form。

那么既然

1
->

只是简单地把第一个参数放到后面form的首个参数的位置,那么这个宏正确的使用方法其实是

1
2
(def m {:a inc})
(-> 2 ((fn [x] ((get m :a) x))))

再加一层括号!

1
2
(macroexpand-1(-> 2 ((fn [x] ((get m :a) x)))))
((fn [x] ((get m :a) x)) 2)

可见,

1
->

虽然是个功能强大的宏,但宏终归只是宏,和函数的区别是明显的。在使用的时候,不能完全按照函数的习惯。

如果你想了解实际的代码,可以参考slacker 0.3.0里的这个interceptor框架:
https://github.com/sunng87/slacker/blob/master/src/slacker/interceptor.clj
上面提到的难处,多半也都是在开发这个框架时亲身经历的。

slacker 0.2.0 is out

12月 10, 2011

Slacker 0.2.0 has been push to clojars today. Connection pooling and json serialization are available in this release.

Connection Pool

Generally, pooling connection is a good idea in high concurrence application. To make slacker for real world, connection pool support is a high-prioritized feature in its development. The new connection pool is backended by commons-pool which you might familiar with. To use connection pool, just create slacker client with a new function `slackerc-pool`

1
(def scp (slackerc-pool "localhost" 2104))

Then you can use this pool just like a single client.

Some options are available to configure the pool by your wish:

  • :max-active, max connections opened by the pool
  • :exhausted-action
    • :fail throw an exception when pool exhausted.
    • :block block current thread and wait until max-wait exceed (throw an exception)
    • :grow automatically create new connection and add it to pool
  • :max-wait max wait time before throwing an exception
  • :min-idle minimal number of pool hold idle connections

The options are inherited from GenericObjectPool, you can find detailed information from their javadoc.

JSON Serialization

slacker just added json serialization provided by clj-json. According to my test, clj-json is 1x faster than carbonite in serialization.

1
(def sc (slackerc "localhost" 2104 :content-type :json))

However, with json serialization, you may lost some clojure types like keyword and set in type conversion. You should be caution when using json as serialization method.

In next release, I am planning to use fastjson as json lib which provides option to write type name into json so it could be a full featured serialization for clojure. And fastjson is claimed even faster than jackson.

Performance

slacker gains high performance with its non-blocking server, serialization and direct function call. As tested on a dual 6 core server, it reaches 10000+ TPS for a single client (50 connections, 50 threads). The server just use 35% CPU so I consider it could have even more TPS if there is two or more client machines.

So if you are interested in some benchmarks, you can test it with client like this. All the requests are using synchronous call because I believe it’s the most common case you use slacker.

Next steps

Inspired by discussion in cn-clojure mailing list, I’m going to add HTTP transport for slacker. With HTTP transport, it’s easier to debug and evaluate your clojure functions, it also makes slacker available to ClojureScript.

At lst, thanks Zach Tellman for reviewing my client code.

从GNOME网站安装exaile-doubanfm-gnome-shell-extension

12月 2, 2011

最近GNOME发布了期待已久的extension.gnome.org,这个网站允许你直接通过浏览器安装和管理gnome-shell扩展,有点类似app store的感觉,混乱的~/.local/share/gnome-shell/extensions/终于有了一个官方的界面。

网站开通的第一时间,我提交了exaile-doubanfm-gnome-shell-extension,经过review和修改,这个扩展也得到了进一步的完善,适配了gnome-shell 3.2的风格。

你可以从这个地址直接安装启用
https://extensions.gnome.org/extension/24/exaile-doubanfm-control/

它会在exaile douban.fm启动后显示一个菜单在gnome-shell上,你可以通过这个菜单进行基本的操作。

如果喜欢,别忘了在extension.gnome.org上vote一下 :)

Slacker 0.1.0 is out

12月 2, 2011

Glad to roll out the first release of the slacker framework. Slacker is a clojure RPC framework on top of a TCP binary protocol. It provides a set of non-invasive APIs for both server and client. The remote invocation is completely transparent to user.

In addition to APIs introduced in last post, asynchronous approach is supported in client API :

1
2
(defremote remote-func :async true)
@(remote-func)

If you add option `:async` to defremote, then the function facade will return a promise. You have to deref it by yourself. Also you can use the `:callback` option in defremote to specify a callback function.

1
2
(defremote remote-func :callback #(println %))
(remote-func)

This gives you much more flexibility of using remote function. But be careful it will break consistency between local and remote mode.

To use slacker, add it to your project.clj

1
:dependencies [[info.sunng/slacker "0.1.0"]]

You can find examples on the github page.

Exaile豆瓣电台插件0.0.11发布

12月 1, 2011

很高兴时隔半年后我继续发布了Exaile豆瓣电台插件的更新,从第一个版本发布到现在已经有一年半的时间,这期间豆瓣电台插件已经陆续出现在Rhythmbox、Banshee等播放器上。作为第一个视图把豆瓣电台移植到本地的尝试,我感到甚是欣慰:)

这次的更新修正了长久依赖困绕用户登录问题,现在我们有一个专门的界面来输入验证码。这个功能要感谢DigitalPig用户在github的报告(鞭策作用),此外,我参考了豆瓣电台banshee插件的实现,节省了研究含验证码登录的时间,感谢。总而言之,没有用户的推动,这个项目也不会坚持这么久。

除此之外,插件还有一些支持了新的豆瓣说的推荐,优化了播放列表载入的策略。

另外值得highlight的是,对应的gnome-shell扩展发布了0.0.2版本,唯一的更新是专辑封面现在会显示在gnome-shell的菜单中。

你可以从github获得最新的插件和gnome-shell扩展:

有任何问题都可以在github或这里留言。

Clojure RPC, prototyping and early thoughts

11月 27, 2011

Last week, I prototyped an RPC framework, slacker, by clojure and for clojure.

What I did ?

Suppose you have a sets of clojure functions to expose. Define them under a namespace:

1
2
3
4
5
(ns slapi)
(defn timestamp []
  (System/currentTimeMillis))

;; …more functions

Expose the namespace with slacker server, on port 2104:

1
2
(use ‘slacker.server)
(start-slacker-server (the-ns ‘slapi) 2104)

On the client side, we use the `defremote` macro to create a facade for `timestamp` function. This API will keep the client code consistent with local mode.

1
2
3
4
(use ‘slacker.client)
(def sc (slackerc "localhost" 2104)
(defremote sc timestamp)
(timestamp)

Internally, slacker uses aleph for transport and carbonite for serialization. I forked carbonite and made it compatible with clojure 1.2 because the aleph mainline is still running on 1.2.

Going further

Functions as parameter

In clojure, functions are treated as first class citizens. Within memory, you can pass function as parameter to another function. However, this is not supported by serialization framework. So is it possible to add support for that in RPC?

Lazy sequence as parameter

This is another interesting feature in clojure function call. You can pass a lazy-sequence to clojure function. In RPC, it requires parameters to be evaluated on the server side.

1
2
(defn get-first [& args] (first args))
(apply get-first (range))

Example copied from StackOverflow

Coordinated states between several remote servers

With RPC, we can update states on several servers. So do we need something like distributed dosync:

1
2
3
4
5
(defremote a1 update-a1-state)
(defremote a2 update-a2-state)
(dosync-distributed
  (update-a1-state some-value)
  (update-a2-state some-value))

I’m not sure if this is a valid scenario in real world but I think it’s an interesting topic.

Conclusion

RPC is the first step to distributed clojure world. I will keep you updated with my prototype.

Spark in common lisp

11月 18, 2011

还是关于spark的,一石激起千层浪,每个人心中都有一个spark。其实spark脚本刚出来的时候问题很多,但是就是因为产生了共鸣,众人拾柴pull request多。像redis的作者antirez也忍不住自己用c写了一个aspark

说完了别人的,那么来看看我的:clspark,common-lisp的spark。原本是打算用clojure写,但是想到jvm的启动速度,把这个机会留给我的第一个common lisp程序吧。

其实很简单。

common lisp的核心库里没有split,所以这里从cl-cookbook拷贝了一个split的实现,坦白说我还看不太懂这个loop的写法。loop是common lisp中最尴尬的form,因为他的形式太多。这点在clojure中是不存在的。比较一下就能发现,在语言层面,clojure是相对现代得多的lisp方言。

My response to Spark: Visualize your mercurial commit history from commandline

11月 15, 2011

标题长了些。还是用母语吧。

昨天HackerNews上一个小脚本轰动了,所谓山不在高程序不在小。为了响应这个小脚本,我写了一个更加简单的Mercurial(hg)扩展,帮助你输出hg仓库的提交历史。这个feature和github的直方图有点像,在我看来github的直方图是他们最重要的feature之一,它鞭策着你不断地commit。

hg summary

安装:将这个脚本放在任意一处,在你的hgrc中添加:

1
2
[extensions]
summary = /path/to/your/script

由于很简单,就不按照mercurial的规范发布了。


加关注

Get every new post delivered to your Inbox.