Local feature descriptors have overwhelming advantage in comparing different images against cropping, resizing, affine trans, light vars. However, to compare local feature descriptors between two images needs O(MlogN) time where M is the LFD number of 1st image and N is the LFD number of 2nd image. Having in mind that classic LFD involves 128d or more, one can theoretically conclude that the comparing process will take almost 10ms in a commodity computer. Time becomes a severe problem when scale up to compare across hundreds of thousands images.
Most approaches nowadays try to solve the problem by reduce the complexity of comparing LFDs. LSH/Tree-based methods attack the problem with classical data structure which could reduce the time complexity of search process to O(1) or O(logN). Visual vocabulary introduced classic IR model to visual search which makes the time complexity irrelevent to number of LFD indexed. O(logN) complexity is not good enough for billions of features. Visual vocabulary takes too much time to cluster all LFDs. LSH seems like one good choice though indexing billion features needs cluster computer to support.
All three methods above ignore the factor of sparsity and redundancy. Empirically, classic method can discover 500~600 LFDs from 800*600 images. In wavelet trans or fourier trans, discard 95% dimensions can still maintain most part of the images, which means that most information can be recovered from 24,000 dimension space. LFDs support a 64,000~76,800 dimension feature space which is far more enough for feature comparision. In the other hand, by a simple observation, sparsity can be proved. The classic comparing process find one to one correlation in two images. If we simply think other's similarity is zero, a 1/500 sparse similarity matrix are obtained. The idea of setting the similarity of two points that are too far from each other to zero is because when measuring in high dimensional euclidean space of random distributed points, their distances from each other tend to be very close. The measurement observation suggests that in high-dimensional euclidean space, far distance is useless. As here we only observe two image comparision, in actual case, because only few images have correlate points, the distance matrix should be much more sparse.
My last article already suggested a method to reduce the indeterminate dimension to fewer dimension, for say, 30d. In that way, we are hoping the 30d data can still maintain good approximation of non-metric distances. After read several MDS methods, the computation cost scared me. Considering many MDS methods are examplar-based, it may be improper to apply to very large database.
The thought of introducing empirical methods because empirical methods, most of them, are fast and robust. A proper combination of classic global image features, such as gist, color histogram, wavelet and so on, maybe, can mimic a good approximation to LFD comparision. BTW, we don't expect a 90% correct approximation. a 10% hit rate is enough because that can also dramatically reduce the computation cost in refine stage.
All above are based on theoretical analysis. I programmed a crawler to capture some pics from flickr to verify my theory, but there are some serious memory leaks in the crawler framework I use.
I allocated some time last week to investigate tree-based and hash-based approximate k-NN methods. It seems that tree-based methods are more likely to be promising. As a result, I implemented the spill tree in OpenCV repository. In test, tree-based method suffers great pain from the curse of dimensionality. 32d is ok, but for 128d, it is beaten by naive search. However, to reduce the dimension is always the right choice, but even consider that, 4ms for a indexed 60k size database, it is far from practical.
My recent interests takes great concern to local feature descriptor (LFD) and view that as the most promising method for image recognition I've ever encountered. Don't forget that even for face recognition method which has been considered as partly solved, dividing to several parts is the common technique.
For most methods to extract LFD, the actually resulting features are lay in euclidean space. But for one image, the length of result LFD sequence can be hundreds or thousands. In single feature space, we have to perform thousands of queries in a billions of features database (million image database). Even scale to thousands computers and assume the network delay is negligible, the results are still poor (4s per thousand queries).
One practical method that was borrowed from full text retrieval is visual vocabulary. But there is one thing in my mind that keeps me walk away from the state-of-art method. How one can determine if 50k words is enough for describe any image? Even with a very convincing results (comparing 300k, 50k and so on size of vocabularies), it is still unknown how the word size grows along with the number of image indexed.
Another way to do this is to fall back to traditional content-based image retrieval which consider the whole image as a single element. Unsuccessful trials have been made with histogram, local histogram, color monument and so in 1990s. That is the time tree-based approximate k-NN methods developed as the dimension of histogram is manageable. After decades, certainly we should add something new to the old. One thing changed the most is, the similarity measurements are no longer in euclidean space. That means, the ideas of mean vectors, euclidean distance are not valid here.
That is why the filter-refine method comes out. The idea is mapping nonmetric space to a euclidean space, and then find the approximate nearest subset. Applying distance calculation on a subset is much easier. But question remains is how good the mapping proceduce would be. However, it worth to give a try.
Research in computer vision put the reliability problem aside and hold the principle that computer never goes wrong. Even with limited computing ability, let's say, the embedded system, although the memory resource and CPU were very limited, the system is reliable by default.
Distributing computer vision problem to multi-computer is not new. For many systems, to fulfill the real-time requirement, several computers are used. The three-tier NASA robotic framework was first implemented in three computers. Stanford winning Stanley vehicle was utilize three PCs to perform the decision-making, environment detecting and computer vision tasks. The small clusters (typically under 10 PCs) do not worry about system reliability because under that situation, the probability of system failure is ignorable.
For large-scale cluster, it is simply not true. The large-scale cluster, which contains at least 1000 commodity PCs. At that scale, single fatal failure happens all the time. In best wish, the system failure should be taken care by lower facilities. Many training process can be implemented by MapReduce like mechanism which should not be worried about. But as large part of computer vision algorithm concern about real-time task, taking account of the system failure in local implementation is inevitable.
A desirable low facility for real-time computer vision task has to be very flexible. It can be reorganized quickly after a single-node failure. The two phase design of MapReduce may be still in use, but the algorithm applying to the two phase procedure need to be reconsidered. Many algorithms just simply are not fit to the two phase idea.
When the highly reorganizable facility is accomplished, the problems are left for in the algorithm layer. The paralleled version of SVM was published in 2006. The parallelization of many well-known algorithms was just happened few years ago. But considering the good fact of offline parallel structure, the tricky part would not be the parallelization of algorithms. Contrarily, how to online all these offline algorithms can be a very challenged task. Even the famous metrics-tree (or best-bin-first search tree) cannot easily perform insert/delete node, how such as PCA/LLE become online algorithms?
As there are so many unknowns exist, all these problems forms the very bright future for distributed computer vision framework.