CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: nginx date: Tue, 23 Dec 2025 02:19:53 GMT content-type: text/html; charset=utf-8 x-archive-orig-server: nginx/0.6.26 x-archive-orig-date: Tue, 08 Sep 2009 10:30:29 GMT x-archive-orig-connection: close x-archive-orig-status: 200 OK x-archive-orig-etag: "b4123efbce1923256b4968ed55e23baa" x-archive-orig-x-runtime: 715ms x-archive-orig-cache-control: private, max-age=0, must-revalidate x-archive-orig-content-length: 33604 x-archive-guessed-content-type: text/html x-archive-guessed-charset: utf-8 memento-datetime: Tue, 08 Sep 2009 10:30:29 GMT link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate" content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org x-archive-src: 52_11_20090908092804_crawl100.gpg-c/52_11_20090908102922_crawl101.arc.gz server-timing: captures_list;dur=0.950715, exclusion.robots;dur=0.102725, exclusion.robots.policy;dur=0.087602, esindex;dur=0.013713, cdx.remote;dur=15.920617, LoadShardBlock;dur=243.330040, PetaboxLoader3.datanode;dur=165.376626, PetaboxLoader3.resolve;dur=107.261880, load_resource;dur=144.141891 x-app-server: wwwb-app242-dc8 x-ts: 200 x-tr: 504 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 set-cookie: wb-p-SERVER=wwwb-app242; path=/ x-location: All x-as: 14061 x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() content-encoding: gzip MarkLodato's gh-contest at master - GitHub

MarkLodato / gh-contest

Description:	My github contest entry edit
Homepage:	edit
Public Clone URL:	git://github.com/MarkLodato/gh-contest.git Give this clone URL to anyone. `git clone git://github.com/MarkLodato/gh-contest.git`
Your Clone URL:	git@github.com:MarkLodato/gh-contest.git Use this clone URL yourself. `git clone git@github.com:MarkLodato/gh-contest.git`

fix typo, and fix mistake in fork desc.

MarkLodato (author)

Sun Aug 30 15:59:34 -0700 2009

commit 35b00cb401eedbc73ef0e239d974a8783faf3b3a
tree 3f28f47d64e169b3e83db9e464364f089a7b13fe
parent 200e9743c814ed436822047e82eb4ad1d030a951

gh-contest /

name	age	history message
README.rst	Sun Aug 30 15:59:34 -0700 2009	fix typo, and fix mistake in fork desc. [MarkLodato]
backup/	Fri Aug 28 19:07:08 -0700 2009	add a bunch of old crap [MarkLodato]
download/	Sat Aug 15 16:55:57 -0700 2009	add download [MarkLodato]
ghcontest.py	Sun Aug 30 13:31:30 -0700 2009	switch to MIT license always [MarkLodato]
old/	Fri Aug 28 19:07:08 -0700 2009	add a bunch of old crap [MarkLodato]
results.txt	Fri Aug 28 19:14:33 -0700 2009	remove **= 1.1 I originally added this because... [MarkLodato]
results_best.txt	Fri Aug 28 19:07:08 -0700 2009	add a bunch of old crap [MarkLodato]
sparse.py	Sun Aug 09 20:29:46 -0700 2009	added sparse.py [MarkLodato]
timing_test.py	Fri Aug 28 19:07:08 -0700 2009	add a bunch of old crap [MarkLodato]

README.rst

GitHub Contest Entry

Author:	Mark Lodato
Date:	August, 2009

Introduction

My algorithm is very simple, and I was surprised how well it did. One evening I thought of the simplest thing I could (a naive and partial implementation of the following) and I got 20% on my first shot, half of the leader at that time. After some more tweaking, my second attempt scored 30%. So, I figured maybe I'd continue down this path. My program no where near as sophisticated as other entires, but it is really simple and runs reasonably fast - on one CPU of my Core 2 Duo 6300, it takes about a minute to train and makes about 100 suggestions/sec.

The basic idea is to make a graph of all the repositories, with one node per repository, and with edge weight A->B equal to the fraction of users watching A who are also watching B. (If A is forked from B, add 1. Let A->A = 1.) Then, normalize the weights so that the sum of all the edges going out of each node is 1. (That is, divide each weight A->B by A->X0 + A->X1 + ... + A->Xn.) This results in a Markov Chain - each edge weight A->B gives the probability of transitioning from repository A to repository B. Now, to make a suggestion for user U, assign equal probability to all repositories being watched by U, step the Markov Chain once (that is, perform a matrix multiply), and list the 10 highest probability nodes, excluding those in the original set.

Rundown

The following is pseudo-code description using pseudo-NumPy syntax (let ~x~ be matrix multiply.)

init()

data = load('data.txt')
forked = load_forked_from('repos.txt')

prepare()

common_watchers = data.tranpose() ~x~ data
r2r = common_watchers.copy()
for child, parent in forked: r2r[child,parent] += r2r[child,child]
r2r = divide each element of r2r by the row sum
u2r = divide each element of data by the row sum

suggest(user)

begin = u2r[user]
state = begin ~x~ r2r
state[begin != 0] = 0
return indices of 10 highest non-zero entries of state

Base Algorithm

To start, load data.txt from disk into sparse matrix, D, where D[u,r] is 1 if user u is watching repository r, 0 otherwise. [self.data]
Compute W[a,b], the number of users watching both repositories a and b. This is simply D' * D. [self.common_watchers]
Compute P[a,b], the probability of transitioning from repository a to repository b in our Markov chain. (P is right stochastic.) This is just diving each row of W by the row sum. [self.r2r]
Compute S[u,r], the probability of selecting repository r at random from those being watched by user u. This is just dividing each row of D by the row sum. [self.u2r]
For each user u for which we wish to make suggestions:
1. Let state = S[u], which is 1/n for each of the n repositories user u is watching, 0 otherwise.
2. Step the Markov chain once by computing next = state * D.
3. Set next[r] = 0 for each repository r that the user u is watching. (This is so we don't suggest something the user is already watching.)
4. Return the top 10 highest values.

Adding Fork Information

It ends up a lot of repositories are only watched by one person: the owner. But, many of these are forked, and it is obvious that an owner of a child repository may watch the parent. So, between steps 2 and 3, if p is a parent of c, add the number of users watching c (W[c,c]) to W[c,p]. This resulted in a huge improvement.

Fallback

If the above returns n < 10 results, append the 10 - n most popular repositories. This did not seem to help at all.

Failed Ideas

The following seem like they should produce better results, but when I tried them, I got worse results. Perhaps I did it wrong.

Followers: It seems like if user i is following user j, then user j's repositories should perhaps be weighted higher.
Multiple steps: I would think that stepping multiple times, weighting each step n in some way (I tried exp(-n)) would give better results, but this worked worse for me.
Fork chains: It seemed to me that it would be useful to relate a repository to all other repositories in its fork graph, not just its parent. I tried doing something like this, but I got worse results.

When giving suggestions for a particular user, I tried giving weights to every other user based on how much overlap there was between the repositories being watched, the thought being that I want to find other users that have similar interests. From here, for each user u, I added u's weight to each repository he was watching, and then I took the top 10 repos. This worked much worse than the above algorithm. But maybe there is promise here...

Other Ideas

If a user is only watching a single repository, and other users are watching that repository, I think my algorithm should be pretty good, since I'm calculating Pr(r|w), the probability of watching r given that the user is already watching w. But, our users are watching multiple repositories, and I'm basically computing Pr(r|w0) + Pr(r|w1) + .... What I really want to find out is Pr(r|w0 w1 w2 ...), which is not same since these are not independent. If we had tons of data, then we could just estimate this empirically. But there is almost zero chance that any other users are watching exactly w0, w1, w2, etc. But maybe we can use pairwise probabilities somehow, as in Pr(r|wi wj) for all i/j?

Somehow tie in information about the owner of each repository. For example, treat repositories owned by the user different than repositories owned by other people.

Original Source | Taken Source