Notes

  • TiVo application: It is a TV show recommender system implemented in decentralized architecture. Users’ preference on shows was presented in the form of rating (from score -N to +N) and the preference score was rated in the client side. The TiVo server periodically aggregate the clients’ rating and send the rating table via ring mechanism to the client. Each client received the rating table utilize the collaborative filtering mechanism to predict the rating score of the shows he/she has never seen. Next, after the predicting process was completed, the show recommender at client side will make show recommendations for its user.
  • Fab Application: It is a web pages recommender system integrated both the content-based and collaborative filtering approaches. In Fab, there are three main system components, namely collection agents, selection agents and the central router. Collection agents are designed to collect web page of a specific topic and central router will aggregate the collected pages. Next, central router will forward pages which related to users’ interest profiles to system users and users’ selection agents will make  selection to the forwarded pages (e.g. discard pages have seen by users). In this phrase, content based mechanisms are implemented to make recommendation from central router to users. In next phrase, the concept of collaborative filtering mechanisms are applied. Selection agents at user side will forward the pages to the nearest neighbors according to some metrics for similarity calculations. To sum up, Fab combine both the well-known recommendation approaches to produce hybrid recommendations, which was running under the distributed architecture.

我的 Proposal 核心

Tagging 的部份 :
由於CF的運算是要採取 co-rated items 的方式去做similarity比較
如果單純以 paper 作為 item,則使用者彼此之間的co-rated的item
set 一定不會很大,甚至很可能為空集合,而無法計算similarity。
而我希望能藉由Tagging的方式,把一篇paper擴增成讓許多Tags
來描述這篇paper的特性,利用這種方式,即使是兩篇不同的paper
,但是他們屬於同一類型,很可能就會被使用者們歸類為同一個Tag。
因此比起傳統用SVD來降低Rating Table的維度(維度是要用tune的),
來增進co-rated的機會,在這邊我是利用Tag的方式來增進 co-rated
的機會(維度是由使用者的Tagging情況決定)。當然很可能Tagging
Space不一定會小於Item Space,但是我的目的只是要增進使用者間
有意義的co-rated的機會。(在此co-rated意思為兩個使用者擁有同一個tag)
而利用Tag的方式是否會改善傳統以paper(item)方式的co-rated機率
就是我要做實驗來驗證的。

Social Recommendation 的部份 :
傳統的CF運算是針對一個使用者(Target User)的行為,來找出系統中行
為跟他相似的 ” 朋友 “,進而分析他朋友喜歡的東西,來推薦給我們的Target
User。因此CF定義朋友的方式,是採取行為間的相似度高的那群人來當作
我們的 Target User 的朋友(儘管Target User實際上不認識他們),所以有
些學者也把CF歸類為Social Recommendation的一種方式。而在我的Proposal
中所提及的Social Recommendation除了會利用到CF的運算,此外還會考
慮使用者”自行加入”的朋友,在此的”朋友”就是使用者認識的朋友(很可能是現實
世界認識,或是網路世界認識)。而在CiteULike資料庫中,有提供組Group的
功能,同一個Group的人們(也許興趣相似或是不相似),彼此之間都可以當作
是朋友關係。我希望結合CF的運算與使用者自行定義的Friends的方式來增進
Recommendation List的diversity(多樣化),因為使用者自行定義的Friends很可
能在CF的運算中並不會被納入為”朋友”的考量。而我所提的Social Recommendation
的方法是否可以增進傳統用CF的方式之diversity,也是需要做實驗來驗證。

Personalized Recommendation推薦的東西比較傾向於使用者原本就喜歡的東西,
因此推薦的重點在於強調是否能準確抓住使用者的興趣,進而推薦一些他有興趣閱讀的東西。

Social Recommendation推薦的東西比較傾向於使用者 “可能” 會喜歡的東西,可能
他還沒接觸過這類型的東西,但他很有可能會喜歡,強調的是準確抓住使用者的興趣外
並且利用mining social relationship提供多樣化(diversity)的推薦列表。

在我的work中,並沒有傳統的Rating table使用者明確給予的評分來讓我跑推薦演算法。
在此,我推算的preference數值實際上只能當作baseline參考而已。因此重點就會在於
如何利用使用者的feedback來改善我們推薦系統的效能。

Popularity Measurements

How to measure the popularity of a product can be discussed in three dimensions :

1. The quality of a product

2. The frequency of a product’s being purchased

3. The strong support of a product (big fans)

Paper report 1

Paper

A. Blum and M. Furst. Fast planning through planning graph analysis. In Proceedings of the International Joint Conference of Artificial Intelligence, pages 1636-1642, August 1995.

Report

The paper proposed a mechanism for solving planning problem using a compact structure called planning graph. The planner Graphplan used in this paradigm can always return the shortest plan or answers no plan existed for any planning problem. Moreover, it can also solve partial ordering planning problems. With the great power of having high performance and guarantee soundness and completeness, Graphplan beat many well-known planners in the world.

 

Planning problems can be described of using planner to finding a sequence of actions from an initial state to achieve a target goal. Since real world planning problems can be encoded into representations that computer can understand and handle, designing a high performance planner is crucial in many real applications. The mechanism the paper proposed can be described as follows. First, it will construct a data structure called planning graph which was composed of many stages expending level by level, while each stage is composed of propositional level and action level. Second, it will add constraints (mutual exclusion) on each level, including inconsistent effects, interference, competing needs and inconsistent support. It means that if a solution returned by Graphplan, then all actions can not obey the constraints defined in this scheme. Third, it will expand the graph level by level until our goal appears, and it will use backtracking to see whether or not a valid plan exists. Finally, if no valid plan exists in this stage, it will continue to expand the graph until leveled-off. If the graph is leveled-off and some literals of the goal do not appear or are marked as mutex in the latest proposition level, the problem is unsolvable. The proposed planner only required polynomial time and space to run its algorithm and is sound and complete. From the view of experimental results, it shows that Graphplan outperforms many total-order and partial-order planners in many well-known AI planning problems. The question I was pondering about is that why literals and actions will increase monotonically and mutexes will decrease monotonically in the planning graph, and eventually level off. Moreover, if we add time factor into the planning problem, I don’t know whether or not the planner can guarantee that running this algorithm is sound and complete.

Research Issues

Netflix data set : 有好幾十億的資料,若要轉成 user-item matrix

做分析,則會因為 scale 太大而難以處理。

Solution : Apply SVD ( Singular Vector Decomposition) to the matrix.

SVD 可以把原始的 matrix 降至更低維度,並且保持一些性質 (待check)。

想法 : 幫每個movie定義一個 feature vactor,vector中的每個tuple代表

此movie某方面的性質,比如動作成分佔多高要素。相對的每個user也有

preference vector,vector中的每個tuple代表對movie中的每個性質之喜

好程度。

So let user A’s preference vector is = ( 1 , 2 , -1 )

let Movie M’s feature vector is = ( 1 , 4 , -1 )

then user A’s rating on Movie M is 1*1 + 2*4 + -1*-1 = 1 + 8 + 1 = 10

Paper Recommendation :

讓使用者把自己電腦中的 paper 交給 bibagent去取得 bibtex 後,並紀

錄每篇 paper在使用者電腦中的save time。如此就可以apply time weighted

技術來 identify目前user的working set是哪些領域的paper。為了不讓user

親自去對每篇paper做rating的動作,我們可以由網路上抓此篇paper的citation

number來當做客觀的rating,or we can do it in this way :

rating of a paper = citation number / published date

So , we can designe a web for user to upload the result of the retriving data made

by bibagent , then also let user to create the FOAF-like data sheet so that we can

use those infomation to apply CF-Algo .

Another point : 針對某篇 paper,我們是否能找出對此篇paper有興趣的user set,

so that we can get recommendations more specific to the paper from the user set .

類似找專家的概念。

Time issue in designing a RS

之前看了學長的 Time weighted CF 演算法成功了改進傳統

CF演算法的Accuracy。心中還是有些疑惑,依照學長的論

點是,一個人早期的 rating 資料的參考價值,不會比現在的

rating 資料的參考價值來的高。因此他設計了一個參考價值

依照時間飄逝 decrese 的曲線,也就是距離目前時間點愈近

則參考價值就比較高,愈遠就參考價值愈低。比如今天某位

使用者所做的 rating 參考價值是 1 ,則昨天此位 user 所做的

rating參考價值可能就是 0.9,如此把參考價值乘上實際的rating

數值所得除來的數字是比較具有參考性的。利用這種想法稍微

改了一下 CF 演算法的核心部份,準確度就上升了。但是目前

在 RS 系統上改善 Accuracy 的主要盲點是在於大家都會想出一

套新的方法來改變舊有的演算法,之後宣稱這個方法是可以提

升準確度的。但是卻沒辦法去證明說如果要改善準確度用這些

新方法就可以真的達到目的。 As known as if p then q ( p -> q) ,

we can not guarantee that q -> p 。其實這和 overfitting 有很大

的關係,儘管我們用很複雜的 model 來讓這份萬年 training data

( MovieLens ) 得到最好的 training 效果,並不意味我們拿到一份

全然不同的 training data時也會有同樣的好效果。其實真正要能

評論 Accuracy 的只有使用者才算是最客觀的標準。 但是人又是

主觀的動物,就算兩個人個性都差不多,但也並不意味著他們兩個

人的想法就都一樣。對於Accuracy的看法也不同吧。

Mail 相關術語,分析 Mail 所建立的 social networks

TO : 郵件的主要接收人

CC : 全名是 Carbon Copy ,副本。

BCC : 全名是 Blind Carbon Copy,密件副本。

在分析 Mail 所建立的 social networks 時,郵件彼此傳送之間

傳送者與接收者的重要性可以由 TO , CC , BCC 看出。 比如一

封 MAIL 如果 TO 很多人,則這封郵件對於每個人的重要性就

沒有來比TO 單一個人來的高。當然,BCC給某個人,則代表

這封郵件對那個BCC接收者顯得更唯重要,或是說寄信者相信

BCC的接受者才會利用BCC這欄位。

在分析 Social network 所做的 Visualization 目的是為了讓使用

者能更直觀的看出人與人之間的關係,比如兩個 Nodes之間的遠

近就可以表示這兩個Nodes之間的相似度或是親密度或是相關度等。

但是當 Network 的 Scale 很大時,做Visualization可能就會遭遇呈現

的問題吧,想像著要把好幾千人所組成的 Network 秀在一個小螢幕上

又可以同時讓使用者了解每個 node 是代表哪位人物的確不是很容易

做得到的。

Recommender Systems – Trust

How to make user trust our recommender system is a crucial

design goal . Image that if a RS always recommend items that

a user have never seen , so the items will be strange to the user.

In other words , the RS will not be so friendly to the user . So

how to make RS seems that it can understand the user’s psychology

is a very important point. It’s just like that RS play a friend role of the

user . In fact , it has simple way to do this : try to recommend items

that the user have ever seen , then let user feel friendly toward our

systems just like his/her friend . So , accuracy is not enough in designing

a RS , we must consider other factors like HCI perspectives in order to

let users easily control our RS.

PR Notes

Pattern recognition — the act of taking in raw data and taking an action based on the “category” of the pattern .

Segmentation operation in which the images of different fish are somehow isolated from one another and from the background. 比如把某隻魚從其他的魚中獨立出來並且去除背景,此動作叫做Segmentation。

Feature 的好壞 : 當一個 feature f1 和 feature f2 有相關性 ( 比如是正相關 ) .
則同時選擇 f1 和 f2 當 feature並沒有什麼意義,對 classification 的效果也不會提升。

用太複雜的 model 來做 decision 並不一定是好的,很難有 generalization 的效果。
也就是會造成Overfitting。簡單就是好。

Given that there truly are differences between the population of sea bass and that
of salmon, we view them as having different models — different descriptions, which
are typically mathematical in form. 盧魚和鮭魚顯然有著不同的特質,比如外觀,
長度,顏色等。諸如此類的特性,我們稱之為 Model。我們常常會用數學式來描述
Model,比如說長度 > 50 以上就是盧魚,<=50就是鮭魚。此為簡單的Model,
因此為了要區分盧魚和鮭魚所做出的decision boundary就很簡單。

Syntactic pattern recognition, where rules or grammars describe our decision. For ex-
ample we might wish to classify an English sentence as grammatical or not, and here
statistical descriptions (word frequencies, word correlations, etc.) are inapapropriate.
Syntactic的方式比較類似定 rule 來描述我們的 decision。

Pattern 表示法可以採用 vector 表示。