Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: nginx date: Mon, 22 Dec 2025 19:39:19 GMT content-type: text/html; charset=utf-8 x-archive-orig-server: nginx/0.6.31 x-archive-orig-date: Tue, 08 Sep 2009 11:56:55 GMT x-archive-orig-connection: close x-archive-orig-status: 200 OK x-archive-orig-etag: "336769650181f7824a6617b37b21e7d6" x-archive-orig-x-runtime: 194ms x-archive-orig-cache-control: private, max-age=0, must-revalidate x-archive-orig-content-length: 31754 x-archive-guessed-content-type: text/html x-archive-guessed-charset: utf-8 memento-datetime: Tue, 08 Sep 2009 11:56:55 GMT link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate" content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org x-archive-src: 52_11_20090908092804_crawl100.gpg-c/52_11_20090908115653_crawl101.arc.gz server-timing: captures_list;dur=0.920742, exclusion.robots;dur=0.077872, exclusion.robots.policy;dur=0.060678, esindex;dur=0.014113, cdx.remote;dur=6.284049, LoadShardBlock;dur=224.094620, PetaboxLoader3.datanode;dur=140.465880, PetaboxLoader3.resolve;dur=199.825141, load_resource;dur=187.248156 x-app-server: wwwb-app227-dc8 x-ts: 200 x-tr: 495 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=1 set-cookie: wb-p-SERVER=wwwb-app227; path=/ x-location: All x-as: 14061 x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() content-encoding: gzip talison's ghc09 at master - GitHub

talison / ghc09

Description:	Github Contest 2009 edit
Homepage:	edit
Public Clone URL:	git://github.com/talison/ghc09.git Give this clone URL to anyone. `git clone git://github.com/talison/ghc09.git`
Your Clone URL:	git@github.com:talison/ghc09.git Use this clone URL yourself. `git clone git@github.com:talison/ghc09.git`

Removed empty file.

talison (author)

Sat Aug 29 14:02:51 -0700 2009

commit 3975472bfb093a751a70c00042ce4940f2ecb1a1
tree c47acf2ce6ab463e1b3d429d0bfeda9637ce0ff2
parent 39abc0dccded73f0cac02c01dc4bb993cefdd0b1

ghc09 /

name	age	history message
LICENSE	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
README	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
all_unwatched_sources.txt	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
analyzers.py	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
blend_unwatched_sources.rb	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
config.py	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
ghc.py	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
repos-fix.diff	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
repos.py	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
results-filled-20.txt	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
results-prob-20.txt	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
results.txt	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
tokyo.py	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
users.py	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]
util.py	Sat Aug 29 14:02:27 -0700 2009	Added source code and explanatory README. [talison]

README

This is code for the 2009 Github Contest.
https://contest.github.com
This project was a nice way for me to learn Python.
+++ Methodology +++
+ Conditional Probability +
My main approach was to try to use conditional probability to suggest
candidate repos. I spent the first part of the project working on this
approach.
It's a simple model. I count how many times repo J was watched with
repo I. Using this model, the probability of watching repo J
conditioned on watching repo I is:
collocations(J,I)/freq(I)
Calculating this for every collocation does not take terribly long,
but it uses a lot of memory. Originally I tried pickling the object so
I could quickly read it in later, but unpickling it took longer than
recalculating it. Either way, the whole matrix ends up in memory which
is inefficient.
Ultimately I stored the probabilities in a Tokyo Cabinet B-Tree
database. The keys were in the form "I,J" and the values were in the
form "collocations,conditional_probability".
This works much better because whenever I want to find all the repos
related to repo I, I do a simple range query for all the keys with the
prefix "I,".
The next step was to actually sort the suggestions that came out of
the probability model. Sorting on collocation count would have biased
the suggestions toward popular repos. Sorting on probability alone
biases low-frequency repos (i.e. repo X is watched by two people, and
repo Y and repo X is watched by one of those people).
The best results came from log weighting (thanks joestelmach) the
candidates by collocation count. Then some of the more popular repos
could bubble up a bit.
Using that strategy alone yielded a score in the 23% range, but lots
of users had less than 10 suggestions.
+ Filling +
I then focused on filling suggestions for users with insufficient
suggestions. I computed the ancestry of each repo, so that each repo
had a list of its ancestors (parents, grandparents, etc.)  and
descendants (children, grandchildren, etc.).
One thing I noticed was that the repos.txt file had a few
inconsistencies. There were a few forked repos with a date that was
before the repo they were forked from. This screwed up my lineage
calculation so I had to go in and hand-fix these entries. The diff
file is included as repos-fix.diff
For each user with less than 10 suggestions, I add all of the
ancestors and descendants. If they still have less than 10
suggestions, I take all of their existing suggestions and use the
conditional probability database to get more candidates.
If the user still has less than 10 suggestions, I look for similarly
named repos and add those and anything related to those in the
probability matrix.
Finally, if there are still not enough candidates, I fall back on the
top ten repos.
This methodology brought the score up to around 35 or 36%.
+ Blending +
I knew from willbailey that adding all unwatched ancestor repos would
result in a boost. Rather than write my own algorithm, I used
danielharan's blend_unwatched_sources.rb script
(https://github.com/danielharan/github_resys/tree/master).
This works best with candidate suggestions of at least 20. I went back
and generated results files with 20 candidates. danielharan's blending
then brought my score up to 43%. I found there was a slight boost when
only using the first 5 unwatched candidates.
+++ The Code +++
The code is written in Python. I used a Python 2.6 interpreter.
Tokyo Cabinet must be installed. I used these instructions:
https://michael.susens-schurter.com/tokyotalk/tokyotalk.html
The simplejson module should also be installed.
The code does extensive logging, and some of the data structures
contain more data than I actually used.
To run:
python ghc.py
ruby blend_unwatched_sources.rb > results.txt
See LICENSE for the license that governs this code.

Original Source | Taken Source