No comment yet
January 3rd, 2009

Research in computer vision put the reliability problem aside and hold the principle that computer never goes wrong. Even with limited computing ability, let’s say, the embedded system, although the memory resource and CPU were very limited, the system is reliable by default.

Distributing computer vision problem to multi-computer is not new. For many systems, to fulfill the real-time requirement, several computers are used. The three-tier NASA robotic framework was first implemented in three computers. Stanford winning Stanley vehicle was utilize three PCs to perform the decision-making, environment detecting and computer vision tasks. The small clusters (typically under 10 PCs) do not worry about system reliability because under that situation, the probability of system failure is ignorable.

For large-scale cluster, it is simply not true. The large-scale cluster, which contains at least 1000 commodity PCs. At that scale, single fatal failure happens all the time. In best wish, the system failure should be taken care by lower facilities. Many training process can be implemented by MapReduce like mechanism which should not be worried about. But as large part of computer vision algorithm concern about real-time task, taking account of the system failure in local implementation is inevitable.

A desirable low facility for real-time computer vision task has to be very flexible. It can be reorganized quickly after a single-node failure. The two phase design of MapReduce may be still in use, but the algorithm applying to the two phase procedure need to be reconsidered. Many algorithms just simply are not fit to the two phase idea.

When the highly reorganizable facility is accomplished, the problems are left for in the algorithm layer. The paralleled version of SVM was published in 2006. The parallelization of many well-known algorithms was just happened few years ago. But considering the good fact of offline parallel structure, the tricky part would not be the parallelization of algorithms. Contrarily, how to online all these offline algorithms can be a very challenged task. Even the famous metrics-tree (or best-bin-first search tree) cannot easily perform insert/delete node, how such as PCA/LLE become online algorithms?

As there are so many unknowns exist, all these problems forms the very bright future for distributed computer vision framework.

No comment yet
December 31st, 2008

人们对于事物之间的关系发现由于社会学的复杂性,通常缺乏控制变量进行实验的条件,因此会导致许多错误的结论由此产生。一个典型的例子就是所谓的30年代擦鞋匠典故。该典故的内容大致是,某位股市经纪人在擦鞋的时候,听到擦鞋匠都在谈论买股票,觉得是不祥之兆,于是回头抛售了手里的所有股票,紧跟着就是30年代的大萧条了。

这样简单的故事导致一些错误关系的发掘,甚至有人会以按脚的小工都知道经济危机为由头认为实际的经济危机已经结束。将信息的最末到达端的信号当作事件的结束点是不严肃的一种态度,因为这二者并不构成任何的理论关系。

由于大部分市场具有自我实现的因素,因此上涨和下跌是符合某些利好/利空的信息传递途径,通常自恃此理论的人士都站在信息传播的前部,因此无论接触的末端有多末,都是已实现状态,符合了自我的预期,更觉得该理论的正确。

但在一个漫长的下降曲线中,这种假联系就无从谈起了。信息传播符合指数模型,而漫长的下降曲线和陡峭的上升曲线不同,没有半点指数的影子,于是,在下降远没到底的时候,信息已经传遍了所有群体。妄图依靠这种简单关系来发掘真相的同学们怕是要失望了。

人类总是有缺陷的,比如由于错误偏见导致的决策。一个典型的例子是大部分人相对于不小心坠楼或感冒死亡,更相信飓风和恐怖袭击能取其性命。这些错误偏见还包括大家总是高估彩票中奖的概率,高估用不靠谱的理论Pitch到Stupid Money的概率等。

No comment yet
December 29th, 2008

在Charlottesville呆了半年,都快忘了北京的冬天空气有多么差了。最近关注点也少太多了,在复习Data Mining和Matrix Analysis,好久都没看Libor的数据了。GReader的分享频度也降低了,大概在两天一篇左右。玩了一会LinkedIn,很好玩,但是还是觉得不靠谱,似乎我和所有人的关系都在3rd左右。这样考量下来,说明这个圈子还是很小的,人数不会超过4M。

向来是个不会给政府写文件的孩子,高中时候的材料都是老师帮忙搞定的,我只是傻乎乎的等着去玩罢了。要是真的要做件需要政府的事情,还得找找擅长写文件的同学。

我有很多地方是不符合这个Pattern的,要么我错了,要么我理解的Pattern错了。

No comment yet
December 23rd, 2008

Everyone lies. - House

只有成交量不会做假。- 爸

man is mortal. - Logics

Women and children can be careless but not men. - The God Father

Where are they? - Fermi

Fool me once, shame on you; fool me twice, shame on me. - proverbs

No comment yet
December 17th, 2008

Several design concepts emerged today were derived from the simple goal “get work done asap”. The goal is so simple that made these modern concepts can be found from business model to scalable system. The GWDA concepts move the focus out of “user-friendly”, instead, it assume that user can understand the basic mechanism to get their work done. But GWDA doesn’t only claim the importance of functionality. Maybe, with less functionality, user can do things better.

There are different definitions of user depends on what the actual work is. For business model, apparently, it is the customer. For web-UI, it is the web user or member. For scalable system, it can be the computer farm. For rapid development, it can be the programmer.

The first important thing for GWDA concept is to define your user. If you build the basic facilities for massive cluster, you have to define your user as both programmers and computing facilities. For the good of your computing facility, your design should resist to system failure as that is the nature of computer farm. For the good of the programmer, your design should fail fast, log as much as possible and decouple to tiny & reliable parts.

{ to be continued }

No comment yet
December 9th, 2008
  1. A 10Mbps hosting service in China costs 50,000 RMB per year, thus, 7246 USD;

  2. $7246/(10606024365/8) = $0.0001838 USD/MiB = $0.188 USD/GiB;

  3. Amazon S3 service charge $0.170 USD/GiB.

Note: it is much easier to do financial estimation based on bandwidth capacity. The calculation is rough, but amazingly corresponding.

No comment yet
December 9th, 2008

Several attempts have been made to make semi-structured information more structuralize. The central concept of semantic web is about universal form of the knowledge we have. Freebase is a highly structured information base, however, Wikipedia, the world largest encyclopedia, only have semi-structured data. In CIKM 2008, the awarded paper is about extract structured information from Wikipedia database. Basically, there are more semi-structured information than full-structured one. Another problem is about the massive, poorly organized data, for instance, the photos. Flickr made a good attempt in exploit human resource to organize data. However, there are less photos are tagged. Luckily, camera manufacturer came up with EXIF which can embed combined camera sensor’s information into a photo. But time-dimension and geo-dimension is too vogue to fit in specific usage. Overall, with years efforts, we have pretty much structured or semi-structured data in hand.

The mixed data structure is organized in key-value form. An element can be described with several properties. These properties can be structured or non-structured. Here we recognize semi-structured property as non-structured, too. There are several questions remain unclear, for example, how to form a query in mixed data structure? How to slice data based on its mixed properties? In this article, we simply ignore these questions. So, we directly jump to how to fulfill a query. Once a query was made, firstly we break up the structured information. The break up process, was described in fuzzy set area for years. We used one assumption in this process: any data relation can be illustrate with similarity. It is a very big hypothesis, besides, we leave difficulties here for ourselves which I will discuss later. However, illustrate data with only similarity can simplify the problem. To fulfill query, we only have to sort based on the similarities.

I have to suggest several considerations in this process. First, in many cases, the structured data cannot be simply measured by one similarity method. For example, to fuzzy datatime field, we can only measure the time span between each other. Then, how we compare May 11, 2008 and May 14, 2006? The two date definately share some common, they are all mother’s day. The second problem is about computing time. However, the similarity matrix was very spare, thus, it should reduce some calculation time.

The idea of fuzzy is not new. It came from multi-value logic and soon adapted to computer science. The idea I suggest here is about to form query and retrieve in database where data is poorly organized.

No comment yet
November 11st, 2008

I discard artificial neural network idea long time ago since its over-fitting problem and the ugly expression of back-propagate algorithm. It is hard to say bp is an elegant algorithm. It directly magnifies the influence of error with the gradient, and the hidden layer structure is highly depended on empirical data.

People are easily convinced by SVM, HMM or manifolding methods. They look elegant with great mathematic skills. Other methods such as PCA and LFD, which in fact largely depends on linear hypothesis earn its credit, too. ANN method in a long time was only applied by engineers and ignored in science community.

There are some problems in existing statistic learning methods. Modern methods are expected longer execution time, in some case, it is unbearable. Applying nonlinear SVM which requires many support vectors is a painful experience. Successful applications nowadays largely rely on specific structure. In face detection application, it is a degenerated high-dimensional surface approximation. In general recognition problem, people much more rely on good “features” which is an indeterminate problem itself. Thus, nearly all the state-of-art methods in image recognition are empirical results more than formal mathematic proves.

Despite the over-fitting problem which can be tuned by carefully testing, nn algorithms have some advantages. They could be deployed in online learning problem where other statistic methods may need a holistic distribution of data for further calculation. Hence that, I am investigating some modern nn models such as RBM these days.

No comment yet
November 5th, 2008

一个月后再做评论

No comment yet
October 30th, 2008

HB发给我的:有一次,程晨读到一封来自浙江大学的信,是四个大学生联名写的,他们在信中写道,“史玉柱,你不能倒,你是我们这一代人的偶像,如果你倒下了,你就会辜负一代人。”