| CARVIEW |
Select Language
HTTP/2 200
server: nginx
date: Mon, 22 Dec 2025 19:39:19 GMT
content-type: text/html; charset=utf-8
x-archive-orig-server: nginx/0.6.31
x-archive-orig-date: Tue, 08 Sep 2009 11:56:55 GMT
x-archive-orig-connection: close
x-archive-orig-status: 200 OK
x-archive-orig-etag: "336769650181f7824a6617b37b21e7d6"
x-archive-orig-x-runtime: 194ms
x-archive-orig-cache-control: private, max-age=0, must-revalidate
x-archive-orig-content-length: 31754
x-archive-guessed-content-type: text/html
x-archive-guessed-charset: utf-8
memento-datetime: Tue, 08 Sep 2009 11:56:55 GMT
link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org
x-archive-src: 52_11_20090908092804_crawl100.gpg-c/52_11_20090908115653_crawl101.arc.gz
server-timing: captures_list;dur=0.920742, exclusion.robots;dur=0.077872, exclusion.robots.policy;dur=0.060678, esindex;dur=0.014113, cdx.remote;dur=6.284049, LoadShardBlock;dur=224.094620, PetaboxLoader3.datanode;dur=140.465880, PetaboxLoader3.resolve;dur=199.825141, load_resource;dur=187.248156
x-app-server: wwwb-app227-dc8
x-ts: 200
x-tr: 495
server-timing: TR;dur=0,Tw;dur=0,Tc;dur=1
set-cookie: wb-p-SERVER=wwwb-app227; path=/
x-location: All
x-as: 14061
x-rl: 0
x-na: 0
x-page-cache: MISS
server-timing: MISS
x-nid: DigitalOcean
referrer-policy: no-referrer-when-downgrade
permissions-policy: interest-cohort=()
content-encoding: gzip
talison's ghc09 at master - GitHub
This repository is private.
All pages are served over SSL and all pushing and pulling is done over SSH.
No one may fork, clone, or view it unless they are added as a member.
Every repository with this icon (
) is private.
Every repository with this icon (
This repository is public.
Anyone may fork, clone, or view it.
Every repository with this icon (
) is public.
Every repository with this icon (
| Description: | Github Contest 2009 edit |
| Homepage: | edit |
| Public Clone URL: |
git://github.com/talison/ghc09.git
Give this clone URL to anyone.
git clone git://github.com/talison/ghc09.git
|
| Your Clone URL: |
Use this clone URL yourself.
git clone git@github.com:talison/ghc09.git
|
ghc09 /
README
This is code for the 2009 Github Contest. https://contest.github.com This project was a nice way for me to learn Python. +++ Methodology +++ + Conditional Probability + My main approach was to try to use conditional probability to suggest candidate repos. I spent the first part of the project working on this approach. It's a simple model. I count how many times repo J was watched with repo I. Using this model, the probability of watching repo J conditioned on watching repo I is: collocations(J,I)/freq(I) Calculating this for every collocation does not take terribly long, but it uses a lot of memory. Originally I tried pickling the object so I could quickly read it in later, but unpickling it took longer than recalculating it. Either way, the whole matrix ends up in memory which is inefficient. Ultimately I stored the probabilities in a Tokyo Cabinet B-Tree database. The keys were in the form "I,J" and the values were in the form "collocations,conditional_probability". This works much better because whenever I want to find all the repos related to repo I, I do a simple range query for all the keys with the prefix "I,". The next step was to actually sort the suggestions that came out of the probability model. Sorting on collocation count would have biased the suggestions toward popular repos. Sorting on probability alone biases low-frequency repos (i.e. repo X is watched by two people, and repo Y and repo X is watched by one of those people). The best results came from log weighting (thanks joestelmach) the candidates by collocation count. Then some of the more popular repos could bubble up a bit. Using that strategy alone yielded a score in the 23% range, but lots of users had less than 10 suggestions. + Filling + I then focused on filling suggestions for users with insufficient suggestions. I computed the ancestry of each repo, so that each repo had a list of its ancestors (parents, grandparents, etc.) and descendants (children, grandchildren, etc.). One thing I noticed was that the repos.txt file had a few inconsistencies. There were a few forked repos with a date that was before the repo they were forked from. This screwed up my lineage calculation so I had to go in and hand-fix these entries. The diff file is included as repos-fix.diff For each user with less than 10 suggestions, I add all of the ancestors and descendants. If they still have less than 10 suggestions, I take all of their existing suggestions and use the conditional probability database to get more candidates. If the user still has less than 10 suggestions, I look for similarly named repos and add those and anything related to those in the probability matrix. Finally, if there are still not enough candidates, I fall back on the top ten repos. This methodology brought the score up to around 35 or 36%. + Blending + I knew from willbailey that adding all unwatched ancestor repos would result in a boost. Rather than write my own algorithm, I used danielharan's blend_unwatched_sources.rb script (https://github.com/danielharan/github_resys/tree/master). This works best with candidate suggestions of at least 20. I went back and generated results files with 20 candidates. danielharan's blending then brought my score up to 43%. I found there was a slight boost when only using the first 5 unwatched candidates. +++ The Code +++ The code is written in Python. I used a Python 2.6 interpreter. Tokyo Cabinet must be installed. I used these instructions: https://michael.susens-schurter.com/tokyotalk/tokyotalk.html The simplejson module should also be installed. The code does extensive logging, and some of the data structures contain more data than I actually used. To run: python ghc.py ruby blend_unwatched_sources.rb > results.txt See LICENSE for the license that governs this code.
This feature is coming soon. Sit tight!











