Highlight: document-vector is sparse, therefore, it can be recovered by compressed sensing.
Today’s robust object detection algorithm need large training dataset. For example, THU’s high accuracy face detection system uses 30k positive faces, and the negative examples collected from background images are countless. It poses a very serious problem for researchers. Because good result relies on both novelty of algorithm and size of training dataset. Completeness of dataset will help the algorithm to generate good assumptions of the subject. But the collecting of dataset is heavily labored. It requires human input for every sample, which makes the collection process not scalable.
On other hand, the photo-realistic 3D graphics is very much mature. Nowadays’ PC software can produce very real graphics. The only difference is the rendering is very computational expensive. However, one stage of effective training typically requires ten thousands positive images, it is not a small number, a 6 min length 3D movies roughly has 10,000 frames. However, only a very low-resolution image is something we really need in later training stage, a 32x32 resolution is enough for many tasks. The specific requirement makes the problem feasible.
And all hail to Turing. Who can tell me how the “Remapping PCI Memory to Memory” option affects Linux 64bit and Win7 64bit?
The constructing of visual word set is essentially a method to find clusters/exemplars in given feature space. The result by comparing a local feature vector with visual word set is the index of visual word. It compacted the reverse index to each image, but the underlying mapping is the same. It performs the calculation h(v) to convert one vector to an integer.
Though the mapping function of looking up visual word is very similar to the hashing function of locality sensitive hash scheme, the idea behind it is very different. LSH more cares about preserve the distance measurement and coverage to whole feature space. Visual word generating cares more about concentration of points. Thus, for L2 distance, visual words tend to use different diameter balls to cover existing points, but LSH uses more same diameter balls to cover the whole range.
It is easy to recognize that LSH encoded more information because it preserved the distance information through a set of hashing functions. It carved each point more precisely by providing several measurements. On the other hand, looking up visual word lost all the distance information, only yielded a categorical result. Comparing as one hashing function, visual word method is good because it obtained approximately best partition for the existing points. But the overall performance are restricted by the limited output.
I am looking into an algorithm that combines dynamic generated visual word set, the sparsity nature of self-similarity descriptor and locality sensitive hashing into an online local feature comparison.