CARVIEW

MOTORHOMES

Select Language

HTTP/2 302 server: nginx date: Thu, 31 Jul 2025 02:24:57 GMT content-type: text/plain; charset=utf-8 content-length: 0 x-archive-redirect-reason: found capture at 20090908110844 location: https://web.archive.org/web/20090908110844/https://github.com/jbrownlee/github-contest server-timing: captures_list;dur=0.799127, exclusion.robots;dur=0.024818, exclusion.robots.policy;dur=0.011638, esindex;dur=0.012919, cdx.remote;dur=33.138614, LoadShardBlock;dur=218.172774, PetaboxLoader3.datanode;dur=86.933175, PetaboxLoader3.resolve;dur=89.893762 x-app-server: wwwb-app218 x-ts: 302 x-tr: 320 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 set-cookie: SERVER=wwwb-app218; path=/ x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() HTTP/2 301 server: nginx date: Thu, 31 Jul 2025 02:24:57 GMT content-type: text/html; charset=utf-8 content-length: 120 x-archive-orig-server: nginx/0.6.26 x-archive-orig-date: Tue, 08 Sep 2009 11:08:42 GMT x-archive-orig-connection: close x-archive-orig-status: 301 Moved Permanently x-archive-orig-x-runtime: 167ms location: https://web.archive.org/web/20090908110844/https://github.com/jbrownlee/github-contest/tree/master x-archive-orig-cache-control: no-cache x-archive-orig-content-length: 120 cache-control: max-age=1800 memento-datetime: Tue, 08 Sep 2009 11:08:44 GMT link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate", ; rel="first memento"; datetime="Tue, 08 Sep 2009 11:08:44 GMT", ; rel="memento"; datetime="Tue, 08 Sep 2009 11:08:44 GMT", ; rel="next memento"; datetime="Fri, 05 Feb 2010 13:48:54 GMT", ; rel="last memento"; datetime="Sat, 26 Jul 2025 14:00:55 GMT" content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org x-archive-src: 52_11_20090908092804_crawl100.gpg-c/52_11_20090908110746_crawl101.arc.gz server-timing: captures_list;dur=0.552766, exclusion.robots;dur=0.020254, exclusion.robots.policy;dur=0.009963, esindex;dur=0.011742, cdx.remote;dur=10.339022, LoadShardBlock;dur=219.504192, PetaboxLoader3.datanode;dur=132.095141, PetaboxLoader3.resolve;dur=215.017485, load_resource;dur=157.003509 x-app-server: wwwb-app218 x-ts: 301 x-tr: 413 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() HTTP/2 200 server: nginx date: Thu, 31 Jul 2025 02:24:58 GMT content-type: text/html; charset=utf-8 x-archive-orig-server: nginx/0.6.31 x-archive-orig-date: Tue, 08 Sep 2009 11:08:44 GMT x-archive-orig-connection: close x-archive-orig-status: 200 OK x-archive-orig-etag: "20b889beeb0dcc9dec50746b58c3f7c2" x-archive-orig-x-runtime: 987ms x-archive-orig-cache-control: private, max-age=0, must-revalidate x-archive-orig-content-length: 40137 x-archive-guessed-content-type: text/html x-archive-guessed-charset: windows-1250 memento-datetime: Tue, 08 Sep 2009 11:08:44 GMT link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate", ; rel="first memento"; datetime="Tue, 08 Sep 2009 11:08:44 GMT", ; rel="memento"; datetime="Tue, 08 Sep 2009 11:08:44 GMT", ; rel="next memento"; datetime="Sun, 07 Feb 2010 02:56:37 GMT", ; rel="last memento"; datetime="Tue, 27 Sep 2011 13:33:22 GMT" content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org x-archive-src: 52_11_20090908092804_crawl100.gpg-c/52_11_20090908110746_crawl101.arc.gz server-timing: captures_list;dur=0.518130, exclusion.robots;dur=0.016436, exclusion.robots.policy;dur=0.007978, esindex;dur=0.009827, cdx.remote;dur=19.349664, LoadShardBlock;dur=247.963948, PetaboxLoader3.datanode;dur=120.111547, PetaboxLoader3.resolve;dur=179.603120, load_resource;dur=84.912863 x-app-server: wwwb-app218 x-ts: 200 x-tr: 421 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() content-encoding: gzip jbrownlee's github-contest at master - GitHub

jbrownlee / github-contest

Description:	recommender system for the github contest edit
Homepage:	edit
Public Clone URL:	git://github.com/jbrownlee/github-contest.git Give this clone URL to anyone. `git clone git://github.com/jbrownlee/github-contest.git`
Your Clone URL:	git@github.com:jbrownlee/github-contest.git Use this clone URL yourself. `git clone git@github.com:jbrownlee/github-contest.git`

added link to readme

jbrownlee (author)

Wed Sep 02 15:51:31 -0700 2009

commit 2e8af6d982ab9c6505081bec4bae93400afadf0d
tree 4acef863a11bc96642f1296d5c6da00bf24c1c03
parent 97b98a58cacff021e8cbf8199e7c1154072dff04

github-contest /

name	age	history message
.gitignore	Wed Aug 26 18:18:29 -0700 2009	classification [jbrownlee]
README.textile	Wed Sep 02 15:51:31 -0700 2009	added link to readme [jbrownlee]
data/	Fri Aug 28 21:26:44 -0700 2009	updated readme [jbrownlee]
objc/	Sun Aug 30 05:36:45 -0700 2009	tweaking [jbrownlee]
results.txt	Sun Aug 30 05:56:42 -0700 2009	test [jbrownlee]
ruby/	Thu Aug 20 18:37:21 -0700 2009	thinking about switching to objectivec [jbrownlee]

README.textile

GitHub Contest: August 2009

A recommendation system for the github-contest by Jason Brownlee

Summary

This project provides code to generate a solution (results.txt) to the github contest August 2009. I participated in this contest as a learning exercise for both programming languages (ruby and then Objective-c) and technology (collaborative filtering).

Details

The objective is to recommend repositories to users. Specifically, to predict which repositories to recommend to a user given existing known user-repository relationships.

Data

The following is a summary of the data provided in the competition, see /data/readme.txt for more detail.

data.txt 440,237 known user-to-repository relationships
repos.txt 120,867 known repositories (id, name, date created, parent repository)
lang.txt language information for 73,496 repositories (id,lang;total_lines,…)
test.txt 4,788 users from data.txt, each of which requires a prediction of (up to) 10 repositories

Observations

All repos with parents have those parents in the dataset
There are 5 repo language definitions for repos that are not in main repo set (and therefore not currently watched by any users – maybe they were removed in prep of the test set)
- there is no score for assigning these repos to any users – remove them (I tested this explicitly)
All but 2 of the test users are in the main user set (2 are unknown or new users, maybe all of their repos were stripped in the prep of the test set)
41,444 distinct repo owner names
71,847 distinct repo names
56,521 distinct users
All 120,867 repos are watched (excluding the additions from the language data)
12,064 forked repos (about 10%)
81,913 root repos (about 67%)

Strategy

I am an amateur in this domain (I’m a computational intelligence guy), so my strategies were mostly lead by intuition, then later by strategy discussion by other contest submissions as well as introductory papers to the field of recommender systems and collaborative filtering.

The prediction strategy is as follows:

Generate lots of predictor indicators
Generate a candidate set of repos for each test user
Score candidate repos using weighted indicators (weighted sum)
1. Maybe build models for applying the indicators toward prediction (indicator weights)
Sort scored candidate repos and select the top 10 as user repo predictions

Indicators are at one of three levels:

Global: probabilistic indicators across all user repo watching
Group: users grouped (knn or clustered using kmeans) then probabilistic indicators from this smaller user group
User: probabilistic indicators across the repos a user watches

Findings

I focused on user-based collaborative filtering, although I started looking into item-based collaborative filtering which turned out to be to slow for my old hardware. I also played with building global (stupid I know) classification models using WEKA (decision trees, J48, etc) which returned poor results.

I calculated user groups (clusters) using n-nearest neighbors where the number of overlapping watches was the distance metric. This resulted in assigning at least one neighbor to 3989 of the 4789 test users (800 neighborless). I have considered (not implemented) less strict relations (repos in lineage, repos with same language composition and size), and other distance metrics (pearson’s, cosine, euclidean, etc). I tried a vector distance metric that took vector length into account, which resulted in lower scores using current indicators over a range of K. Both cluster types have been generated and stored in the /data directory.

Candidate selection involved: top 20 repos by watch count and fork count, all repos in users watched parent tree and child tree, all repos with the same name and the same owner name, and all repos in the users cluster. The top n repos was increased for a fractional score increase.

User Collaborative Filtering: Single Indicator Results (User KNN=5, tests against whole dataset on server, weight is 1.0 of course)

global_prob_watch (873 correct or 18.23%)
global_prob_watch_forked (487 correct or 10.17%)
global_prob_watch_root (359 correct or 7.497%)
local_prob_watch (996 correct or 20.80%)
local_prob_watch_name (1093 correct or 22.82%)
local_prob_watch_owner (1038 correct or 21.67%)
user_prob_watch_forked (461 correct or 9.628%)
user_prob_watch_root (349 correct or 7.289%)
user_prob_watch_owner (1061 correct or 22.15%)
user_prob_watch_name (1030 correct or 21.51%)
user_prob_watch_language (614 correct or 12.82%)

It seems that popularity based (global and user cluster) indicators are useful. It also seems that indicators based on repo name and repo owner name are useful (user cluster and user repos). It seems repo forking and root status are poor indicators.

User Collaborative Filtering: Multiple Indicator Results (tests against whole dataset on server, indicator weights are 1.0)

K=3, global_prob_watch, local_prob_watch, user_prob_watch_owner, user_prob_watch_name: (1830 correct or 38.22%)
K=5, global_prob_watch, local_prob_watch, user_prob_watch_owner, user_prob_watch_name: (1857 correct or 38.78%)
K=10, global_prob_watch, local_prob_watch, user_prob_watch_owner, user_prob_watch_name: (1867 correct or 38.99%)

Using just global/cluster popularity and user name/owner probability combined linearly give a useful result. Increasing K on the user cluster refines the local_prob_watch indicator, slightly increasing results.

Given the utility of repo owner name, and a summary of an approach to deduce the user name by jeremybarnes, I deduced user names (repo owners) for those users who where the single watcher of repos. Conflicting names were skipped, and two cut-off’s for repo-owner match counts were tested:

3+: assigns 166 repos to users, 146 are correct (score of 3.049%)
2+: assigns 229 repos to users, 204 are correct (score of 4.260%) [i use this one naturally]

Using this algorithm with the previous indicators (global_prob_watch, local_prob_watch, user_prob_watch_owner, user_prob_watch_name), increasing K to 10 and n for top ranked forked/watched repos in the candidate set to 50 I can get a small boost (1912 correct or 39.93%).

Looking at the repos ranked by global repo name occurrence, the top named repos are ‘test’ and ‘dotfiles’. I thought stripping these repos from the candidate set would give a boost or at least give a cutdown to computation. Unfortunately, filtering out these repos results in a lower score (1906 correct or 39.80%) using the same previous parameters.

I generated some sample datasets using the above probabilistic indicators and fed them through some decision tree classifiers with poor results. I also tried a series of rank-based indicators which were fed into a WEKA classifier, also resulting in very poor performance. I have not tested these single indicators against the test set yet, I doubt their results would be good or much different to the probabilistic indicators used before. The rank indicators included:

normalized_watch_rank
normalized_fork_rank
normalized_name_rank
normalized_owner_rank
normalized_group_watch_rank
normalized_group_name_rank
normalized_group_owner_rank
normalized_user_name_rank
normalized_user_owner_rank

In the dash for the finish, my goal was to top 40%. I achieved this by human-annealing the key indicators. Specifically, I focused on the indicators that give a big boost (global and local popularity, user probs for repo name and owner) and hacked in hand-tuned indicators for repos in user repo parent hierarchy and root repos. This took me to first to 40% then to 44%. I managed some further improvements (around 48%) with some bug fixes in probability calculations, hand tuning some weight parameters, and using weighted neighborhood probability of occurrence.

To Do

A list of tasks I’d like to tackle, time permitting or on future collaborative filtering projects:

Dimensionality reduction using Singular Value Decomposition (SVD) – this will give faster (maybe more useful) user and repo similarity calculations
- Called Latent Semantic Analysis / Latent Semantic Indexing in the field of Information Retrieval
Better user clustering using kmeans or even just an improved distance metric on my kNN
- decrease similarity strength based on item popularity (i’ve seen equations for this)
- User user-distance to weight indicators (neighborhood popularity indicator for example)
Item-based collaborative filtering
- I started clustering repos, but it was going to take for ever with my crappy 5 y/o mac laptop
Build classification models for each user group or even each user (slow, but might get good results)
- The current external classifier model approach is very poor, I’m considering using straight measures as indicators rather than prob and ranks (let decision tree model the probs for me)
More indicators: brainstorm, look at others peoples code, eval real users on github website…
Optimize weights on indicators for global/cluster/user (gradient descent or even a genetic algorithm)
Association rules: probabilistic co-occurrences in global/cluster
- Started, but the computation was going to take too long – need to pre-compute co-occurrences or create a fast(er) lookup
Pay for some Amazon EC2 time to do some serious number crunching, my laptop is not upto the task and fast proving of ideas may accelerate progress.
Offline validation testing

Validation Testing

I just push to the server for now (strategy results) and do cross-validation for my classifier using WEKA. My machine/strategy is too slow to any kind of automated gradient decent on weights or strategy parameters.

Code

I started with a Ruby implementation (/ruby) although quickly found that it was simply too slow. I then moved to an Objective-c implementation (/objc). I chose these two languages as I wanted to know more about their capabilities.

To tap into WEKA (written in Java) I used Objective-C’s java bridge. Worked fine as long as I compiled and ran under JDK 1.5 32-bit. The code is still in there but not apart of the main execution (it was a test). Strip it out if it cases problems during compilation.

Ignore the ruby code, the objective-c code is the main show (/objc folder). To configure, tweak cfg.h. To build, use ‘make’ (makefile is provided). To run, call: ./main

On the futility of the contest

A goal of the contest as listed by the contest organizes was to generate an open source corpus of recommendation systems. I believe such efforts towards a ‘white label’ or ‘turn key’ system are futile.

The utility of any data mining or machine learning approach comes down to the specifics of the problem being solved (scale, data, timeliness, etc). General algorithms and even systems are awesome, but they the starting point. The hard work in any such system is in identifying, selecting, and specializing existing techniques to the domain.

References

Useful Fellow Contest Submissions

jeremybarnes An excellent summary of contest submission strategy, algorithms, and information processing on the dataset
cardmagic A terse but informative summary of initial strategy.

License

(The MIT License)

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the “Software”), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

Original Source | Taken Source