The holy grail of computer vision is to understand the scene, and output with proper language. In ideal senerio, it should be able to answer questions like “how many people visited our college this afternoon” or at least given result to a query like “SELECT COUNT(*) FROM Camera1 | Camera2 | Camera3->VideoStream, DateTime->VideoStream WHERE FaceDetector LIKE (SELECT FaceDetector WHERE Tag=Face) AND Time > ‘12:00:00’ AND Time < ‘23:59:59’”. |
In this senerio, we extracted a goal that with existing technology could be achieved. Other than to distort structural data which introduced in the article, here we try to structuralize visual data. The result of the effort is a new structural query language for visual data. It should be a subset of existing SQL and more over, ideally, it should be able to collabrate with other SQL engines. In practice, we sacrified the compatibility with exisiting SQL frontface in order to get better performance. As a result, we yield a incompatible query syntax with SQL. In fact, for the current stage, I’d better describe it as a process to find similar visual data.
Visual data is processed with several different feature extractors when the first time put in database. The different feature representations become columns for every visual data. It also generate a unique fingerprnt in order to avoid duplicate visual data. To interact with text, tags are introduced again. Every piece of visual data can be tagged. With tags, one can write a nested query to do classification like the query in the beginning, it is actually a kNN classification. The feature extractor is atomic coomponent in the construction. Three feature extractors are provided in the start: face extractor, SURF extractor and MSER extractor. The structure is fleaxible, any extractor can be added later. Ideally, the extractors should provide a function to measure similarity between two visual data. In current case, it would be a huge performance penalty. To avoid these penalities, several helper extractors are introduced. The general extractors can have very different output, it can be variable length, binary data, or serialized data. a list, or a tree. However, the helper extractors output fixed length float sequence; they also have a implication that they can be measured by L1/L2 distance. In the system, we use helper extractors to mimic the similarities that are output by general extractors. Three helper extractors are used: global histogram, local histogram and gabor filter.
Overall, it looks far from a query language. One may think it as a table engine like what MyISAM/InnoDB does. Giving it time, maybe it could be more powerful, who knows?
曾经是魏晋的遗风,国人直到现在仍然笃信语言的力量。和当时的清谈一样,大家相信言语胜于行为。比如如果一个人说自己是老实的,那么潜在地,大家会认为他一点都不老实,甚至当他的行为表现符合老实的定义时,还有很多人根据这样一句话认为他不老实。懂得运用语言来混淆视听,即使在当代,也有着巨大的作用。因言既然能够获罪,那么言语也能够惑主,也能够获利了。
我并不是在吹捧孔子所说的言伪而辩的概括是多么精辟。毕竟,不能否认语言在交流中所起的作用,通过辩论来证明事物也并不错误。问题在于人们广泛地将语言和实际事物相联系,却没有意识到语言知识媒介,和实际的事物没有任何的关系。
这就是为什么三人能够成虎的原因。本来有虎这样的一句话是没有任何意义的,他不代表有虎或者没有虎。而人们试图通过个人信用度为其担保,最后在没有任何印记的情况下虚构了一只老虎出来。这是所有原始传播媒介的悲哀,由于本身不带任何担保,使得信息可信度不能科学度量。
听其言,观其行,这在任何时候都是真理。
社会发展的平衡和不被特殊事件影响的可能性取决于个体的发展。简而言之,个体发展程度越高,文明将会越不稳定,最终会由于少数的事件导致整个文明群体的灭绝。
这一论断的得出基于这几点观察。一、随着文明发展,个体能获取的能力将会越来越大。二、在整个群体中,个体的活动是近似随机的。从第二点出发,我们知道个体的活动函数f分布有随机性保证,但是群体活动F将是个体运动f的和,是稳定的。设定每个个体活动函数f受到他的邻近个体活动函数f_{k}的影响,且 | f-f_{k} | 小于一个阈值c。由于第一点的观察,阈值c不断放大,最后导致每个f均受到整个群体的显著影响。由于每次计算受到时间间隙的影响,即使最终的F是稳定的,在每个时间间隙点仍然是不稳定的,若假设个体活动函数f小于某阈值z时表示死亡且不可逆转,那么在每个f均能影响整个群体时,群体灭绝性事件的发生是不可避免的。 |
这一论述为未来文明的发展说明了几种可能性。利用行星系尺度的间隔,可以减小个体影响力的作用,即使一部分世界会由于少数事件而灭绝,从概率上说,少部分世界仍然可以存活发展下去。减少个体数量,在这种情况下,群体发展会更加不稳定,但是个体行为更加容易预测,一个随机模型将不再适用,整个预测模型的失效可能会产生新的可能性。一种减少个体数量的极端办法是使得整个群体具有能力,而不是个体。实用地说,通过制度的设定保证个体无法产生对群体具有显著影响的能力。
现阶段的危险性在于,通过各种辅助方式,个体的能力已经显著提高。以前,操作船舶需要几十个人的配合。而现在只需要一个人的控制和计算机辅助系统就能驾驶飞机。个体具有对整个文明产生决定性影响的能力在将来的不久就会出现。非对称化的恐怖袭击只是这一论断的一种表现而已。
搞MSER搞了一周了,还是在找local minimum的时候有问题,而且还比原文的实现结果慢了4倍左右,感慨怎么实现点稍微复杂的算法就已经这么疲惫了。
Local feature descriptor (LFD) is an overwhelming successful method for image comparision which is currently the best solution against in-plane-rotation, distortion and light variations. However, there are some assumptions should be noticed. LFD is a appearance feature. It is more stable than pixel feature, but after all, the property of appearance feature still disturbs the ability of LFD. First, the appearance feature is vulnerable to variations of light and the description ability is depended by the complexity of appearance. LFD by collecting features through the key areas counteract the interferences of image variations. You can view that as sort of extension to shape description. Still, for object with low complexity of appearance, LFD failed to achieve any thing. The test senerio will be a color ball with a complex background. For this senerio, LFD cannot capture any useful information of the ball because the distinguish of the ball is the shape not appearance. For that part, LFD can do little.
One way to solve the problem is to combine some sort of shape descriptors to our LFD machanism. As appearance features and shape features are total different in their domain, there is no simple way to do this. There are rare cases that shape is more powerful than appearance. A closer look at how shape affects the appearance we will notice that only smooth boundary is something LFD cannot describe well. For sharp boundary and turnning point (corner point), LFD algorithm can effeciently extract local descriptor from those key point. In our extreme case, LFD lost its most important function to keep track the percise location of key point. For a ball, there is just no way to judge which point is key point on the boundary of circle.
Another limitation that is heavily addressed is the computing complexity of LFD. In my previous articles, I provided some insights of how to solve the problem.
小时候就明白我智商平庸,也没有什么特别能力。于是只能在一件事情上多花时间,尽量做好。然而这么多年过去了,养成的习惯却是几乎不看小说,从不玩游戏,不娱乐,也不约会和聚会,几乎所有时间都在读东西和写东西,这似乎就是workaholic吧。
Local feature descriptors have overwhelming advantage in comparing different images against cropping, resizing, affine trans, light vars. However, to compare local feature descriptors between two images needs O(MlogN) time where M is the LFD number of 1st image and N is the LFD number of 2nd image. Having in mind that classic LFD involves 128d or more, one can theoretically conclude that the comparing process will take almost 10ms in a commodity computer. Time becomes a severe problem when scale up to compare across hundreds of thousands images.
Most approaches nowadays try to solve the problem by reduce the complexity of comparing LFDs. LSH/Tree-based methods attack the problem with classical data structure which could reduce the time complexity of search process to O(1) or O(logN). Visual vocabulary introduced classic IR model to visual search which makes the time complexity irrelevent to number of LFD indexed. O(logN) complexity is not good enough for billions of features. Visual vocabulary takes too much time to cluster all LFDs. LSH seems like one good choice though indexing billion features needs cluster computer to support.
All three methods above ignore the factor of sparsity and redundancy. Empirically, classic method can discover 500~600 LFDs from 800*600 images. In wavelet trans or fourier trans, discard 95% dimensions can still maintain most part of the images, which means that most information can be recovered from 24,000 dimension space. LFDs support a 64,000~76,800 dimension feature space which is far more enough for feature comparision. In the other hand, by a simple observation, sparsity can be proved. The classic comparing process find one to one correlation in two images. If we simply think other’s similarity is zero, a 1/500 sparse similarity matrix are obtained. The idea of setting the similarity of two points that are too far from each other to zero is because when measuring in high dimensional euclidean space of random distributed points, their distances from each other tend to be very close. The measurement observation suggests that in high-dimensional euclidean space, far distance is useless. As here we only observe two image comparision, in actual case, because only few images have correlate points, the distance matrix should be much more sparse.
My last article already suggested a method to reduce the indeterminate dimension to fewer dimension, for say, 30d. In that way, we are hoping the 30d data can still maintain good approximation of non-metric distances. After read several MDS methods, the computation cost scared me. Considering many MDS methods are examplar-based, it may be improper to apply to very large database.
The thought of introducing empirical methods because empirical methods, most of them, are fast and robust. A proper combination of classic global image features, such as gist, color histogram, wavelet and so on, maybe, can mimic a good approximation to LFD comparision. BTW, we don’t expect a 90% correct approximation. a 10% hit rate is enough because that can also dramatically reduce the computation cost in refine stage.
All above are based on theoretical analysis. I programmed a crawler to capture some pics from flickr to verify my theory, but there are some serious memory leaks in the crawler framework I use.
每个国家的人还是有不同的。比如俄罗斯人就会说,这个问题可以等价看作是那个问题,那么就存在一个显式解;美国人会说,这个问题比较复杂,可以用梯度下降来求一个比较好的解;加拿大人会说,这个问题可以先映射到另一个问题,然后再重建回来,解决另一个问题,再解决重建错误;中国人会说,这个问题其他人是这样解的。
I allocated some time last week to investigate tree-based and hash-based approximate k-NN methods. It seems that tree-based methods are more likely to be promising. As a result, I implemented the spill tree in OpenCV repository. In test, tree-based method suffers great pain from the curse of dimensionality. 32d is ok, but for 128d, it is beaten by naive search. However, to reduce the dimension is always the right choice, but even consider that, 4ms for a indexed 60k size database, it is far from practical.
My recent interests takes great concern to local feature descriptor (LFD) and view that as the most promising method for image recognition I’ve ever encountered. Don’t forget that even for face recognition method which has been considered as partly solved, dividing to several parts is the common technique.
For most methods to extract LFD, the actually resulting features are lay in euclidean space. But for one image, the length of result LFD sequence can be hundreds or thousands. In single feature space, we have to perform thousands of queries in a billions of features database (million image database). Even scale to thousands computers and assume the network delay is negligible, the results are still poor (4s per thousand queries).
One practical method that was borrowed from full text retrieval is visual vocabulary. But there is one thing in my mind that keeps me walk away from the state-of-art method. How one can determine if 50k words is enough for describe any image? Even with a very convincing results (comparing 300k, 50k and so on size of vocabularies), it is still unknown how the word size grows along with the number of image indexed.
Another way to do this is to fall back to traditional content-based image retrieval which consider the whole image as a single element. Unsuccessful trials have been made with histogram, local histogram, color monument and so in 1990s. That is the time tree-based approximate k-NN methods developed as the dimension of histogram is manageable. After decades, certainly we should add something new to the old. One thing changed the most is, the similarity measurements are no longer in euclidean space. That means, the ideas of mean vectors, euclidean distance are not valid here.
That is why the filter-refine method comes out. The idea is mapping nonmetric space to a euclidean space, and then find the approximate nearest subset. Applying distance calculation on a subset is much easier. But question remains is how good the mapping proceduce would be. However, it worth to give a try.
09年伊始,我却花了快两年时间才明白这些道理。
做一些事情的时候,我贡献的这些东西才重要(重要性从低到高排列):
-
钱,其实钱是最方便贡献的;
-
时间和精力,这才存在机会成本;
-
他人对我的信任。
而现阶段我没钱,所以钱对我相当于是父母对我的信任,和最后一条同阶。
其实我没有一些商业上必备的天赋,比如,我同情心过于泛滥。举一个案例,你手头只有一万块钱,那么到底是给农民工发工资还是去和有力人士娱乐?我肯定会选择前者,但是最好的选择是后者。因为前者是在为别人过去做的事付费,那么付费与否对未来没有任何影响,而后者是在为将来的可能掏钱,即使购买的是概率,也比购买零来得强。
明白这些道理,也知道了自己的局限。