CARVIEW |
Every repository with this icon (

Every repository with this icon (

Description: | edit |
Homepage: | edit |
Public Clone URL: |
git://github.com/hintly/hintly_ensemble.git
Give this clone URL to anyone.
git clone git://github.com/hintly/hintly_ensemble.git
|
Your Clone URL: |
Use this clone URL yourself.
git clone git@github.com:hintly/hintly_ensemble.git
|

name | age | message | |
---|---|---|---|
![]() |
README.rdoc | Thu Sep 03 17:38:07 -0700 2009 | correct typo; spell out SA [Daniel Haran] |
![]() |
blending/ | Wed Sep 02 18:04:04 -0700 2009 | blending algo [Xiang Liang] |
![]() |
download/ | Thu Sep 03 04:14:30 -0700 2009 | data set [Xiang Liang] |
![]() |
include/ | Sun Aug 30 19:23:50 -0700 2009 | source [Xiang Liang] |
![]() |
knni-all/ | Mon Aug 31 03:05:01 -0700 2009 | item based knn with language and repos name inf... [Xiang Liang] |
![]() |
knni/ | Sun Aug 30 19:23:50 -0700 2009 | source [Xiang Liang] |
![]() |
knnu-all/ | Sun Aug 30 21:01:25 -0700 2009 | knnu all [Xiang Liang] |
![]() |
knnu/ | Sun Aug 30 19:29:16 -0700 2009 | user based knn [Xiang Liang] |
![]() |
knnui/ | Sun Aug 30 19:50:31 -0700 2009 | knnui [Xiang Liang] |
![]() |
popular/ | Mon Aug 31 03:38:40 -0700 2009 | recommend popular repos [Xiang Liang] |
![]() |
repo-all/ | Thu Sep 03 04:11:08 -0700 2009 | repo all [Xiang Liang] |
![]() |
repos/ | Mon Aug 31 05:33:56 -0700 2009 | repos and collaborators [Xiang Liang] |
![]() |
results.txt | Sun Aug 30 11:52:39 -0700 2009 | top entry so far based on xlvector's entry [Daniel Haran] |
About this entry
First, thanks to Github for putting on a great contest.
This entry was supposed to be a blend of xlvector and jeremybarnes’ solutions, but due to time it ended up being just xlvector’s.
Liang Xiang (xlvector) will explain his model shortly, Jeremy’s already explained a great deal in his README (read it, it’s great!).
The secret sauce?
More data equals better results.
We also crammed in more heuristics.
Next steps
Anyone wanting to build on this should probably pay close attention to blending. The algorithms themselves aren’t incredibly new, but there’s probably low-hanging fruit with neural networks, decision trees (or combinations) for blending ordered and weighted lists.
Algorithms
Item Based KNN
This algo is in knni directory. I use cosine similarity with inverse user frequency to measure item-item similarity.
We also use reponame and language information in measuring item-item similarity, this algo is implemented in knni-all.
User Based KNN
This algo is in knnu directory. I use cosine similarity with inverse item frequency to measure user-user similarity.
We also use reponame and language information in measuring user-user similarity, this algo is implemented in knnu-all.
Let’s take language information for example, given two user a, b. Let La[i] be the percent of repo which is written in language i user a watch. Then, the language similarity of user a, b can be calculated by cosine(La, Lb).
Hybrid User/Item KNN
This algo is in knnui directory
Removing unlike items
If we want to find the repos a user will watch, we can solve this problem by removing items he/she will not watch. In this way, I use some extremum value to remove unlike items.
For example, given a user u, if the popularity of the most popular repo he/she watch is N. Then, in the recommendation list, we should decent the likeness of the repos which popularity is larger than N. In blending/main.cpp, function userMostPopular do this task.
Further more, we know the date of the repo when it is created. Let T1 be the earliest create date of repos user u watch, and T2 be the latest create date of repos user u watch. Then, we should increase the weight of repos created between T1 and T2. In blending/main.cpp, function postProcessByDate do this task.
Result Diversity
In the contest, we find users who watch a lot of repos are hard to be predicted. This is because they have diversity interest, therefore, it is hard to cover their interest in only 10 recommendations. In this way, we should let recommendations list include different interest field of users and we have to improve the diversity of recommendations.
This algo is implemented in function diversity() in file blending/main.cpp
Extra data
repo_watch number. I download repo information by Github API, a repo have many information, but watch times is the best information. Let N1(i) be the watch time of repo i in data.txt and N2(i) be the real watch time of repo i, then, if N1(i) == N2(i), it means i will never be in the recommendation list.
collaborators information, this information is also very important, it is used in directory repos/
Without these extra data, I can only get 2600 by the data github provide. So, extra data is important in the contest.
Blending
The most important lesson I learn from Netflix Prize is one algo can not solve everything. In this way, we can get better result by blending algos I used above. I used a simple linear blending method, a divide users into 3 groups by the number of repos they watched. In every group, I use simulated annealing to learn the linear blending weight of different algos and get the best blending weight. Then, I reset these weight by submission. Github contest allow a competitor to sumbit many results every day, so the best blending weight can be learning by submission, therefore, the best result is overfitting in test set.
However, if I only using the blending weight got from SA, I can only get 253x in test set. So, I think jeremybarnes did a good job in github contest.