The one major common sense shared in machine learning community is that euclidian distance is poor. To attack this problem, one way is to use another distance measurement, and the other is to learn a better distance representation. Mahanalobis distance is a good practice by linear tranform our data to a more suitable space. As it is only do one linear transformation, after the transformation, it is still a normal euclidian distance.
By finding a better linear space to retain NN, it may dramatically improve the result (>2x). However, it cannot dilute our concern to the imperfection of euclidian space. By simply turn to another "nonlinear" method cannot serve any good too. Turning a simply question to a space which has more degree of freedom and tuning a better result is a way to avoid harder and realistic problem. Stick to the linear way is not something too shy to say.
At the monment, we are still largely depended on lower-dimensional euclidian distance and hoping to find another unified way to do distance measurement.
The holy grail of computer vision is to understand the scene, and output with proper language. In ideal senerio, it should be able to answer questions like "how many people visited our college this afternoon" or at least given result to a query like "SELECT COUNT(*) FROM Camera1|Camera2|Camera3->VideoStream, DateTime->VideoStream WHERE FaceDetector LIKE (SELECT FaceDetector WHERE Tag=Face) AND Time > '12:00:00' AND Time < '23:59:59'".
In this senerio, we extracted a goal that with existing technology could be achieved. Other than to distort structural data which introduced in the article, here we try to structuralize visual data. The result of the effort is a new structural query language for visual data. It should be a subset of existing SQL and more over, ideally, it should be able to collabrate with other SQL engines. In practice, we sacrified the compatibility with exisiting SQL frontface in order to get better performance. As a result, we yield a incompatible query syntax with SQL. In fact, for the current stage, I'd better describe it as a process to find similar visual data.
Visual data is processed with several different feature extractors when the first time put in database. The different feature representations become columns for every visual data. It also generate a unique fingerprnt in order to avoid duplicate visual data. To interact with text, tags are introduced again. Every piece of visual data can be tagged. With tags, one can write a nested query to do classification like the query in the beginning, it is actually a kNN classification. The feature extractor is atomic coomponent in the construction. Three feature extractors are provided in the start: face extractor, SURF extractor and MSER extractor. The structure is fleaxible, any extractor can be added later. Ideally, the extractors should provide a function to measure similarity between two visual data. In current case, it would be a huge performance penalty. To avoid these penalities, several helper extractors are introduced. The general extractors can have very different output, it can be variable length, binary data, or serialized data. a list, or a tree. However, the helper extractors output fixed length float sequence; they also have a implication that they can be measured by L1/L2 distance. In the system, we use helper extractors to mimic the similarities that are output by general extractors. Three helper extractors are used: global histogram, local histogram and gabor filter.
Overall, it looks far from a query language. One may think it as a table engine like what MyISAM/InnoDB does. Giving it time, maybe it could be more powerful, who knows?