1. How to make a query and get associated data?
2. How to assign data to a specific query instance?
In the first glance, there seem no big differences between the two problems. One can mimic the solution for 2nd problem by making frequent query to database and get new assignment to current data. However, if we put the Real-time attribute into consideration, the problem becomes very diffcult. It means we cannot rely on lazy query performing/cache to ease the query load to database backend. Every data assignement has been done immediately as the data comes.
The optimization techniques are quite different. In the first scenario, we rely on indexing and shrinking the qualified database size to make the query faster. In the second scenario, the most natural optimization is the left-hand optimization which discards the data with first few conditions within a query. Until now, my research heavily addressed the first problem and ignored the second problem.
The second problem whether “realistic” or not remains unclear for most people. If the solutions to 2nd problem can be as efficient as the first one, our scheme of the overall Internet could experience a dramatic change. In many web apps, we don’t deal with changing query, contrarily, we deal with changing data. If we can solve the second problem, btw, the more natural solution for changing data, we don’t have to cache anything and deal with expire monster. Twitter took advantage of distributed queue system to deliver new messages other than query messages for different user with different query parameters. Since real-time streaming become the new bragging features for web apps, in foreseeable future, we have to solve the second problem.
A queue system is very primitive for 2nd problem. It only solves the problem of how to store the data’s relationship with queries. How to check the data relation validity with millions queries is the real headache. We may utilize some common features between queries, however, for complicated query, I don’t know how to do it well.
Any paper recommendations?
Fotas.net is always proud of its “dynamic folder” technology. Now, with NDQI (Non-structural Data Query Interface), the new fotas.net collection (former name “dynamic folder”) will be more powerful and finally syntax-complete.
After the domain issue was solved, new fotas.net which schedule to be released in June will contain more than a dozen new concepts and innovations such as portfolio-based manage layer, new js upload API, all-ajaxed admin page etc.
The fancinating part of high-dimensional descriptor is that it has two faces, one is the sparsity and the other is the density. The overall descriptor (in my case, image patch descriptor/local feature descriptor) data are lay in space with large variance. But by observing small group of descriptor data, there are some dense area in the space. The duplicate detections of corner detector, similar objects and etc. all may cause the density. By reducing all the density descriptor cloud to one examplar, it will reduce the overall matching time. Especially, for CBIR, it would be nearly no after-effect of reducing descriptors for one image.
That is the time where affinity propagation comes in as a replacement for k-median. Affinity propagation is an amazingly fast and pretty good approximation to the optimal examplar result. Affinity propagation’s property of using sparse matrix can largely reduce the computational cost. By using full neighborhood dissimilarity information and the mean of dissimilarity as preference, it reduced 1147 local patches in a image to 160 local patches. How ever, to compute the full dissimilarity matrix is time expensive, for my experiment, a best-bin-first tree was used to speed up the k-nn search and only set the dissimilarity with top N (N=5, 10, 20) neighbors. In that case, the time cost was reduced from 30s to less than 1s and the number of local patches was reduced to 477(T20), 552(T10), 647(T5).
A coarse observation is that, by reducing the number of local patches, the accuracy of search over database is improved. The reduction of local patches leaves the more distinctive ones in the bank. More distinctive points reduced the false positive, and get the overall performance gains.
As the affinity propagation method shows many promising aspects, the new key word “EXAMPLAR” will be introduced in the implementation of non-structural data query lang.
Half a year ago, I read an article about how to use simple javascript to perform MapReduce in browser. It is very interesting, but the author obviously ignored that the locality of MapReduce made it so good. It is not proper to introduce MapReduce to the scenario of browser because it solves data-intensive problem which bandwidth is critical (that is why Reduce part introduced).
However, the idea of making browser do some extra work is suitable for computing-intensive work which only requires little data. Someone is already on the track years ago by using Java applet or Flash. With Google Gears or even setTimeout, I believe it is very realistic now to introduce browser-based grid computing with Javascript.
More details about it will be revealed in July.
For those who don’t know what Facool is, there is a video about it: http://www.vimeo.com/1925998
It has been 3 years since the close of Facool in 2006. After working on serveral minor startup things, I still occasionally heard people’s ask about why Facool failed at that time. I spend a lot spare time to think about it. Today it still seems to be a cool idea to put face retrieval technique online and there are many startups working on this (such as face.com, riya.com etc.). And now I think that I have a good perspective of why Facool failed.
Facool rolled out as an academic research result. It took me while to realize the economic potential and then I started to run it as an actual product. The year of 2005 is the time when everyone believes that search is the coolest stuff as SNS in 2007 and twitter in 2009. The idea is simple: to index all faces in the web and find it instantly. The missing point here is that the goal is too ambitious and the resource I could use is limited.
The shortage of resource can explain many negative facts that Facool encountered. First is the shortage of images. At 2005, Facebook just launched. There is no much good structural representation of personal information here on the Internet. By scraping 100,000 images, the detector found about 10,000 faces and most of them were low-resolution ones. You have to dig the deep net in order to find more useful information and due to the lack of structural information about person, I even have to develop a new algorithm to determine a person’s name!
Lesson 1 learned: start with a small thing, and evolve along the way.
When Facool came out as a web service, I coded a web server from scratch which made me spend more time to take care of socket error, concurrency problem etc. To make a web server is a big time sinker, and even if it could take few percents advantage, it is not a convenient thing to start with. I actually spent 2 months to code the web server, comparing with now pile up a web service in 3 days with Django, I wasted too much time on unimportant stuff.
Contrarily, I was not a huge fan of opensource community at that period of time. In 2005, I only heard of OpenCV and never put real use of it. Without trying the power of opensource, I trained the face detector by my own. Which, no doubt, cost another 3 months to get a satisfactory result.
Lesson 2 learned: saving time and avoiding reinvention of wheels, taking the power of opensource.
When finally finished the beta version of Facool, I just about ran out of money. I spent about $5,000 to buy server and rent the bandwidth, left few bulks for living. It is hard to recall that just 3 years ago, there is no slicehost, no Amazon S3 and you have to startup with $2,000 server.
At June, I don’t have one extra penny to pay the bandwidth, and that pretty much about it.
Lesson 3 learned: startup with cheap stuff and save at least half of your money before the release day.
Sometimes, I appreciate that I was failed so young and I have so much time to start over.
The one major common sense shared in machine learning community is that euclidian distance is poor. To attack this problem, one way is to use another distance measurement, and the other is to learn a better distance representation. Mahanalobis distance is a good practice by linear tranform our data to a more suitable space. As it is only do one linear transformation, after the transformation, it is still a normal euclidian distance.
By finding a better linear space to retain NN, it may dramatically improve the result (>2x). However, it cannot dilute our concern to the imperfection of euclidian space. By simply turn to another “nonlinear” method cannot serve any good too. Turning a simply question to a space which has more degree of freedom and tuning a better result is a way to avoid harder and realistic problem. Stick to the linear way is not something too shy to say.
At the monment, we are still largely depended on lower-dimensional euclidian distance and hoping to find another unified way to do distance measurement.
在以前的一些分析中,通常都忽略了情感的作用而只注重于理性分析。将情感认为是一个不可预知量,并利用一切手段进行保守估算,通过理性分析,虽然能获得一些利益,但是相对于导入情感作为可控量后的结果而言还是显得太少太少了。
论断所得结果过于保守,以至于在实际生活中毫无用处,但引入情感的分析之后,很多现象可以大胆一步。
情感分析的作用不仅仅是被动的。主动的情感导入也可以帮助人。人的记忆体是非常强壮的,而且记忆的存取是联想式的,这也是为什么背文章要花很大的主观努力而记忆场景基本不需要自我参与的原因。很多时候甚至人们很难分辨自己记忆的真实性,因为联想就在那,没有更多的现象可循。
这为导入情感做了铺垫。用强烈的情感来做比方,如果在极端血腥暴力的镜头中主动突出种族特征,那么大多数接受的人都会将其联系起来。当然,这只是在几十年前被用过的一种宣传手段而已。但是和几十年前的情况不同,现代的研究,我们可以论断,即使只有很少量的这种宣传,也会引起和大规模宣传类似的效果。因此,以前的大范围催化方式可以被点对点的传播方式所替代。也就是说,特定的情感注入可以作用到特定的人身上而不会被察觉。
更加地,如果可以引发大脑从自身承认并不断被强化的事实出发引起联想,且故意模糊这段联想的时间,那么在被诱发之后,会更主动地强化这一认知。通过这样的方法,特定人甚至会捍卫这一自己的所产生的认知。这种力量是不能被小窥的。
人类具有原始情感的特质妨碍了零和博弈的现实,这可以认为是群体保护的一种基因特点。同时,这些原始情感和记忆的联想特质,也导致了容易受到精心设计的意识攻击的可能。