Posts from December, 2008
No comment yet
December 31st, 2008

人们对于事物之间的关系发现由于社会学的复杂性,通常缺乏控制变量进行实验的条件,因此会导致许多错误的结论由此产生。一个典型的例子就是所谓的30年代擦鞋匠典故。该典故的内容大致是,某位股市经纪人在擦鞋的时候,听到擦鞋匠都在谈论买股票,觉得是不祥之兆,于是回头抛售了手里的所有股票,紧跟着就是30年代的大萧条了。

这样简单的故事导致一些错误关系的发掘,甚至有人会以按脚的小工都知道经济危机为由头认为实际的经济危机已经结束。将信息的最末到达端的信号当作事件的结束点是不严肃的一种态度,因为这二者并不构成任何的理论关系。

由于大部分市场具有自我实现的因素,因此上涨和下跌是符合某些利好/利空的信息传递途径,通常自恃此理论的人士都站在信息传播的前部,因此无论接触的末端有多末,都是已实现状态,符合了自我的预期,更觉得该理论的正确。

但在一个漫长的下降曲线中,这种假联系就无从谈起了。信息传播符合指数模型,而漫长的下降曲线和陡峭的上升曲线不同,没有半点指数的影子,于是,在下降远没到底的时候,信息已经传遍了所有群体。妄图依靠这种简单关系来发掘真相的同学们怕是要失望了。

人类总是有缺陷的,比如由于错误偏见导致的决策。一个典型的例子是大部分人相对于不小心坠楼或感冒死亡,更相信飓风和恐怖袭击能取其性命。这些错误偏见还包括大家总是高估彩票中奖的概率,高估用不靠谱的理论Pitch到Stupid Money的概率等。

No comment yet
December 29th, 2008

在Charlottesville呆了半年,都快忘了北京的冬天空气有多么差了。最近关注点也少太多了,在复习Data Mining和Matrix Analysis,好久都没看Libor的数据了。GReader的分享频度也降低了,大概在两天一篇左右。玩了一会LinkedIn,很好玩,但是还是觉得不靠谱,似乎我和所有人的关系都在3rd左右。这样考量下来,说明这个圈子还是很小的,人数不会超过4M。

向来是个不会给政府写文件的孩子,高中时候的材料都是老师帮忙搞定的,我只是傻乎乎的等着去玩罢了。要是真的要做件需要政府的事情,还得找找擅长写文件的同学。

我有很多地方是不符合这个Pattern的,要么我错了,要么我理解的Pattern错了。

No comment yet
December 23rd, 2008

Everyone lies. - House

只有成交量不会做假。- 爸

man is mortal. - Logics

Women and children can be careless but not men. - The God Father

Where are they? - Fermi

Fool me once, shame on you; fool me twice, shame on me. - proverbs

No comment yet
December 17th, 2008

Several design concepts emerged today were derived from the simple goal "get work done asap". The goal is so simple that made these modern concepts can be found from business model to scalable system. The GWDA concepts move the focus out of "user-friendly", instead, it assume that user can understand the basic mechanism to get their work done. But GWDA doesn't only claim the importance of functionality. Maybe, with less functionality, user can do things better.

There are different definitions of user depends on what the actual work is. For business model, apparently, it is the customer. For web-UI, it is the web user or member. For scalable system, it can be the computer farm. For rapid development, it can be the programmer.

The first important thing for GWDA concept is to define your user. If you build the basic facilities for massive cluster, you have to define your user as both programmers and computing facilities. For the good of your computing facility, your design should resist to system failure as that is the nature of computer farm. For the good of the programmer, your design should fail fast, log as much as possible and decouple to tiny & reliable parts.

{ to be continued }

No comment yet
December 9th, 2008

Several attempts have been made to make semi-structured information more structuralize. The central concept of semantic web is about universal form of the knowledge we have. Freebase is a highly structured information base, however, Wikipedia, the world largest encyclopedia, only have semi-structured data. In CIKM 2008, the awarded paper is about extract structured information from Wikipedia database. Basically, there are more semi-structured information than full-structured one. Another problem is about the massive, poorly organized data, for instance, the photos. Flickr made a good attempt in exploit human resource to organize data. However, there are less photos are tagged. Luckily, camera manufacturer came up with EXIF which can embed combined camera sensor's information into a photo. But time-dimension and geo-dimension is too vogue to fit in specific usage. Overall, with years efforts, we have pretty much structured or semi-structured data in hand.

The mixed data structure is organized in key-value form. An element can be described with several properties. These properties can be structured or non-structured. Here we recognize semi-structured property as non-structured, too. There are several questions remain unclear, for example, how to form a query in mixed data structure? How to slice data based on its mixed properties? In this article, we simply ignore these questions. So, we directly jump to how to fulfill a query. Once a query was made, firstly we break up the structured information. The break up process, was described in fuzzy set area for years. We used one assumption in this process: any data relation can be illustrate with similarity. It is a very big hypothesis, besides, we leave difficulties here for ourselves which I will discuss later. However, illustrate data with only similarity can simplify the problem. To fulfill query, we only have to sort based on the similarities.

I have to suggest several considerations in this process. First, in many cases, the structured data cannot be simply measured by one similarity method. For example, to fuzzy datatime field, we can only measure the time span between each other. Then, how we compare May 11, 2008 and May 14, 2006? The two date definately share some common, they are all mother's day. The second problem is about computing time. However, the similarity matrix was very spare, thus, it should reduce some calculation time.

The idea of fuzzy is not new. It came from multi-value logic and soon adapted to computer science. The idea I suggest here is about to form query and retrieve in database where data is poorly organized.

No comment yet
December 9th, 2008
  1. A 10Mbps hosting service in China costs 50,000 RMB per year, thus, 7246 USD;

  2. $7246/(10606024365/8) = $0.0001838 USD/MiB = $0.188 USD/GiB;

  3. Amazon S3 service charge $0.170 USD/GiB.

Note: it is much easier to do financial estimation based on bandwidth capacity. The calculation is rough, but amazingly corresponding.