CARVIEW |
Every repository with this icon (

Every repository with this icon (

Description: | recommender system for the github contest edit |
Homepage: | edit |
Public Clone URL: |
git://github.com/jbrownlee/github-contest.git
Give this clone URL to anyone.
git clone git://github.com/jbrownlee/github-contest.git
|
Your Clone URL: |
Use this clone URL yourself.
git clone git@github.com:jbrownlee/github-contest.git
|
name | age | message | |
---|---|---|---|
![]() |
.gitignore | Wed Aug 26 18:18:29 -0700 2009 | classification [jbrownlee] |
![]() |
README.textile | Wed Sep 02 15:51:31 -0700 2009 | added link to readme [jbrownlee] |
![]() |
data/ | Fri Aug 28 21:26:44 -0700 2009 | updated readme [jbrownlee] |
![]() |
objc/ | Sun Aug 30 05:36:45 -0700 2009 | tweaking [jbrownlee] |
![]() |
results.txt | Sun Aug 30 05:56:42 -0700 2009 | test [jbrownlee] |
![]() |
ruby/ | Thu Aug 20 18:37:21 -0700 2009 | thinking about switching to objectivec [jbrownlee] |
GitHub Contest: August 2009
A recommendation system for the github-contest by Jason Brownlee
Summary
This project provides code to generate a solution (results.txt) to the github contest August 2009. I participated in this contest as a learning exercise for both programming languages (ruby and then Objective-c) and technology (collaborative filtering).
Details
The objective is to recommend repositories to users. Specifically, to predict which repositories to recommend to a user given existing known user-repository relationships.
Data
The following is a summary of the data provided in the competition, see /data/readme.txt for more detail.
- data.txt 440,237 known user-to-repository relationships
- repos.txt 120,867 known repositories (id, name, date created, parent repository)
- lang.txt language information for 73,496 repositories (id,lang;total_lines,…)
- test.txt 4,788 users from data.txt, each of which requires a prediction of (up to) 10 repositories
Observations
- All repos with parents have those parents in the dataset
- There are 5 repo language definitions for repos that are not in main repo set (and therefore not currently watched by any users – maybe they were removed in prep of the test set)
- there is no score for assigning these repos to any users – remove them (I tested this explicitly)
- All but 2 of the test users are in the main user set (2 are unknown or new users, maybe all of their repos were stripped in the prep of the test set)
- 41,444 distinct repo owner names
- 71,847 distinct repo names
- 56,521 distinct users
- All 120,867 repos are watched (excluding the additions from the language data)
- 12,064 forked repos (about 10%)
- 81,913 root repos (about 67%)
Strategy
I am an amateur in this domain (I’m a computational intelligence guy), so my strategies were mostly lead by intuition, then later by strategy discussion by other contest submissions as well as introductory papers to the field of recommender systems and collaborative filtering.
The prediction strategy is as follows:
- Generate lots of predictor indicators
- Generate a candidate set of repos for each test user
- Score candidate repos using weighted indicators (weighted sum)
- Maybe build models for applying the indicators toward prediction (indicator weights)
- Sort scored candidate repos and select the top 10 as user repo predictions
Indicators are at one of three levels:
- Global: probabilistic indicators across all user repo watching
- Group: users grouped (knn or clustered using kmeans) then probabilistic indicators from this smaller user group
- User: probabilistic indicators across the repos a user watches
Findings
I focused on user-based collaborative filtering, although I started looking into item-based collaborative filtering which turned out to be to slow for my old hardware. I also played with building global (stupid I know) classification models using WEKA (decision trees, J48, etc) which returned poor results.
I calculated user groups (clusters) using n-nearest neighbors where the number of overlapping watches was the distance metric. This resulted in assigning at least one neighbor to 3989 of the 4789 test users (800 neighborless). I have considered (not implemented) less strict relations (repos in lineage, repos with same language composition and size), and other distance metrics (pearson’s, cosine, euclidean, etc). I tried a vector distance metric that took vector length into account, which resulted in lower scores using current indicators over a range of K. Both cluster types have been generated and stored in the /data directory.
Candidate selection involved: top 20 repos by watch count and fork count, all repos in users watched parent tree and child tree, all repos with the same name and the same owner name, and all repos in the users cluster. The top n repos was increased for a fractional score increase.
User Collaborative Filtering: Single Indicator Results (User KNN=5, tests against whole dataset on server, weight is 1.0 of course)
- global_prob_watch (873 correct or 18.23%)
- global_prob_watch_forked (487 correct or 10.17%)
- global_prob_watch_root (359 correct or 7.497%)
- local_prob_watch (996 correct or 20.80%)
- local_prob_watch_name (1093 correct or 22.82%)
- local_prob_watch_owner (1038 correct or 21.67%)
- user_prob_watch_forked (461 correct or 9.628%)
- user_prob_watch_root (349 correct or 7.289%)
- user_prob_watch_owner (1061 correct or 22.15%)
- user_prob_watch_name (1030 correct or 21.51%)
- user_prob_watch_language (614 correct or 12.82%)
It seems that popularity based (global and user cluster) indicators are useful. It also seems that indicators based on repo name and repo owner name are useful (user cluster and user repos). It seems repo forking and root status are poor indicators.
User Collaborative Filtering: Multiple Indicator Results (tests against whole dataset on server, indicator weights are 1.0)
- K=3, global_prob_watch, local_prob_watch, user_prob_watch_owner, user_prob_watch_name: (1830 correct or 38.22%)
- K=5, global_prob_watch, local_prob_watch, user_prob_watch_owner, user_prob_watch_name: (1857 correct or 38.78%)
- K=10, global_prob_watch, local_prob_watch, user_prob_watch_owner, user_prob_watch_name: (1867 correct or 38.99%)
Using just global/cluster popularity and user name/owner probability combined linearly give a useful result. Increasing K on the user cluster refines the local_prob_watch indicator, slightly increasing results.
Given the utility of repo owner name, and a summary of an approach to deduce the user name by jeremybarnes, I deduced user names (repo owners) for those users who where the single watcher of repos. Conflicting names were skipped, and two cut-off’s for repo-owner match counts were tested:
- 3+: assigns 166 repos to users, 146 are correct (score of 3.049%)
- 2+: assigns 229 repos to users, 204 are correct (score of 4.260%) [i use this one naturally]
Using this algorithm with the previous indicators (global_prob_watch, local_prob_watch, user_prob_watch_owner, user_prob_watch_name), increasing K to 10 and n for top ranked forked/watched repos in the candidate set to 50 I can get a small boost (1912 correct or 39.93%).
Looking at the repos ranked by global repo name occurrence, the top named repos are ‘test’ and ‘dotfiles’. I thought stripping these repos from the candidate set would give a boost or at least give a cutdown to computation. Unfortunately, filtering out these repos results in a lower score (1906 correct or 39.80%) using the same previous parameters.
I generated some sample datasets using the above probabilistic indicators and fed them through some decision tree classifiers with poor results. I also tried a series of rank-based indicators which were fed into a WEKA classifier, also resulting in very poor performance. I have not tested these single indicators against the test set yet, I doubt their results would be good or much different to the probabilistic indicators used before. The rank indicators included:
- normalized_watch_rank
- normalized_fork_rank
- normalized_name_rank
- normalized_owner_rank
- normalized_group_watch_rank
- normalized_group_name_rank
- normalized_group_owner_rank
- normalized_user_name_rank
- normalized_user_owner_rank
In the dash for the finish, my goal was to top 40%. I achieved this by human-annealing the key indicators. Specifically, I focused on the indicators that give a big boost (global and local popularity, user probs for repo name and owner) and hacked in hand-tuned indicators for repos in user repo parent hierarchy and root repos. This took me to first to 40% then to 44%. I managed some further improvements (around 48%) with some bug fixes in probability calculations, hand tuning some weight parameters, and using weighted neighborhood probability of occurrence.
To Do
A list of tasks I’d like to tackle, time permitting or on future collaborative filtering projects:
- Dimensionality reduction using Singular Value Decomposition (SVD) – this will give faster (maybe more useful) user and repo similarity calculations
- Called Latent Semantic Analysis / Latent Semantic Indexing in the field of Information Retrieval
- Better user clustering using kmeans or even just an improved distance metric on my kNN
- decrease similarity strength based on item popularity (i’ve seen equations for this)
- User user-distance to weight indicators (neighborhood popularity indicator for example)
- Item-based collaborative filtering
- I started clustering repos, but it was going to take for ever with my crappy 5 y/o mac laptop
- Build classification models for each user group or even each user (slow, but might get good results)
- The current external classifier model approach is very poor, I’m considering using straight measures as indicators rather than prob and ranks (let decision tree model the probs for me)
- More indicators: brainstorm, look at others peoples code, eval real users on github website…
- Optimize weights on indicators for global/cluster/user (gradient descent or even a genetic algorithm)
- Association rules: probabilistic co-occurrences in global/cluster
- Started, but the computation was going to take too long – need to pre-compute co-occurrences or create a fast(er) lookup
- Pay for some Amazon EC2 time to do some serious number crunching, my laptop is not upto the task and fast proving of ideas may accelerate progress.
- Offline validation testing
Validation Testing
I just push to the server for now (strategy results) and do cross-validation for my classifier using WEKA. My machine/strategy is too slow to any kind of automated gradient decent on weights or strategy parameters.
Code
I started with a Ruby implementation (/ruby) although quickly found that it was simply too slow. I then moved to an Objective-c implementation (/objc). I chose these two languages as I wanted to know more about their capabilities.
To tap into WEKA (written in Java) I used Objective-C’s java bridge. Worked fine as long as I compiled and ran under JDK 1.5 32-bit. The code is still in there but not apart of the main execution (it was a test). Strip it out if it cases problems during compilation.
Ignore the ruby code, the objective-c code is the main show (/objc folder). To configure, tweak cfg.h. To build, use ‘make’ (makefile is provided). To run, call: ./main
On the futility of the contest
A goal of the contest as listed by the contest organizes was to generate an open source corpus of recommendation systems. I believe such efforts towards a ‘white label’ or ‘turn key’ system are futile.
The utility of any data mining or machine learning approach comes down to the specifics of the problem being solved (scale, data, timeliness, etc). General algorithms and even systems are awesome, but they the starting point. The hard work in any such system is in identifying, selecting, and specializing existing techniques to the domain.
References
My Links
- Competition hacking with Ruby and Objective-C (blog post summarizing my involvement)
- jbrownlee (my) contest submissions
Contest Links
- Github blog, GitHub Contest almost over!
- Github blog, The 2009 GitHub Contest
- 2009 GitHub Contest
- GitHub Contest Leaderboard
- Github Contest Action (recent submission summary)
- Github contest dataset (.zip)
- About the GitHub Contest finalizing the competition, highlighting strategies and code bases.
Other Contest Links
- Ryan Cox, Lessons Learned from the GitHub Recommender Contest (Aug 14th 2009)
Papers and Links
- Weka: Data Mining Software in Java
- Recommended Systems Resource Center
- Recommender System Algorithms
- Emmanouil Vozalis and Konstantinos G. Margaritis, Analysis of Recommender Systems’ Algorithms (2003)
- Coen Stevens, Lead Recommendations Engineer at Wakoopa, How to build a recommender system? (2008?)
- Michael H. Pryor, The Effects of Singular Value Decomposition on Collaborative Filtering Honors Thesis, 1998
- SVD Recommendation System in Ruby (2007) using linalg
- Using Linear Algebra for Intelligent Information Retrieval (1995)
- Collaborative Filtering Resources
Useful Fellow Contest Submissions
- jeremybarnes An excellent summary of contest submission strategy, algorithms, and information processing on the dataset
- cardmagic A terse but informative summary of initial strategy.
License
(The MIT License)
Copyright © 2009 Jason Brownlee
Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the “Software”), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.