| CARVIEW |
Select Language
HTTP/2 200
date: Tue, 30 Dec 2025 23:04:27 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
etag: W/"c97d4647e5d6493ef9a06200be638cab"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com github.githubassets.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com wss://alive-staging.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com marketplace-screenshots.githubusercontent.com/ copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com github.githubassets.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=cvMamdC8GyNETBitg2kKGpYOqoG00x0JEDJo83qdg%2FTEenhznri%2BFB5hBqDEskCMFP1f%2FMGiQ%2FRf%2BWK8ruSzm4ZBm6o8F6DAgZklckmS5dRIW70J3FwXpxcx7DBvbdGjJW00CVrx0ae6XDgTGADCXfdySTml2FZPr43VhgQRWAZBWfwSysqGsUjW8mfvhSJ0lFimnjlCi6Un9vRs8EEZfvZhTkWmeuzdNmSvcGBi%2B6NChwZeB2CwwkajxeOtRISGlUvKjwPCzW%2FsG97U2c8x1Q%3D%3D--Lo76NQ67x6V6nRSY--1zy%2FgL7fMgmFp53db5kQNw%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.522432070.1767135866; Path=/; Domain=github.com; Expires=Wed, 30 Dec 2026 23:04:26 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Wed, 30 Dec 2026 23:04:26 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: AC7A:2E7100:117702A:1391280:69545A7A
GitHub - ColiLea/scan
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 1
ColiLea/scan
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
===========================
====== README =======
===========================
This package contains the code underlying the SCAN model as described in
Frermann and Lapata (2016). The code can be used to (a) create binary files
containing time-stamped input documents (the required input to SCAN); and (b)
to train scan models.
This bundle contains three parts
(1) main.zip
the main functions. Run code from this directory. Also contains example
input and output
(2) scan_lib.zip
the core code underlying SCAN (my code)
(3) other_lib.zip
libraries that are used but not part of the core go libraries. Each of these must
be placed in the same directory as scan_lib.zip
To run the code, navigate into your 'main' directory and run
EITHER the pre-compiled binary (should work out of the box)
./dynamic-senses -parameter_file=/path/to/parameters/ [-create_corpus] -store={true,false}
OR the code itself (requires to have go installed)
go run *.go -parameter_file=/path/to/parameters/ [-create_corpus] -store={true,false}
The command-line parameters are:
- parameter_file
path to a text file containing all parameters (see below)
-create_corpus
an optional parameter. If it is set a binary corpus will be created. Otherwise a model will be trained
-store
indicates whether target word-specific corpora should be stored
To run the code itself (second option above i.e., not just the binary) golang (https://golang.org/doc/)
must be installed. I use version go1.7.4 (might be safest to use the same). All required
packages to run this code should be part of this bundle.
----------------------------------
----- running out of the box -----
----------------------------------
I include an example for all necessary test files:
- a parameters.txt (which has my own hard-coded paths, MUST BE CHANGED)
- corpus file under main/test_input/corpus.txt
- a file with target words under main/test_input/targets.txt
Change the paths in the parameters.txt to your own paths (to the corpus / targets fies),
and run
(b) to create a binary corpus
./dynamic_senses -parameter_file=path/to/parameters.txt -create_corpus -store=true
(a) to train models
./dynamic_senses -parameter_file=path/to/parameters.txt -store=true
---------------------------
--- the parameters file ---
---------------------------
All parameters are specified in a parameters which must be passed as input to
the program. Best see the included 'parameters.txt' examples file. This includes
- paths to underlying text corpora, target word sets, etc
- model parameters (see paper for explanation)
- sampler parameters (number of iterations)
- parameters regarding the time start/end/intervals of interest
- parameters to optionally restrict the minimum number of available documents (to ignore highly
infrequent words) and / or maximum number of available documents per time interval
(to get managable-size corpora for very high-frequent words)
------------------------------
Creating binary input corpora
------------------------------
Takes a text file and a list of target words, and a 'document length' specification. Outputs
a binary corpus with target-word specific, time-stamped documents. Document length refers to
the size of the context window considered around the target word, e.g., 5 words.
[The corpus has words mapped to unique ID identifiers, and contains dictionaries mapping from
word strings to IDs and back]
It takes the following parameters (all specified in parameters.txt)
- text_corpus
path to a text file which in each line contains a number indicating
a year (of origin) followed by a \tab\ character followed by the corresponding text
from that year of origin. The same year can be listed multiple times:
YEAR \tab\ text ....
YEAR \tab\ text ....
....
- target_words
path to a text file containing all target words of interest
whose meaning should be tracked, one word per line.
- window_size
It also requires a specification of the window size (i.e, the context window to consider, as
explained above)
- bin_corpus_store
path to the location the binary corpus should be stored
------------------------------
Training a SCAN model
------------------------------
Once we have a binary corpus of time-stamped documents as explained above full_corpus_path, we can train SCAN
models that track meaning change of individual target words. To do this we
(1) extract a target word-specific corpus from the underlying binary corpus. It contains only time-tagged
documents with the specified target word. It converts the absolute times in the underlying corpus (e.g, 1764, 1993, ...)
to time intervals (0, 1, ..., T) based on the start_time, end_time and time_interval parameters (see below).
It takes the following parameters (all specified in parameters.txt):
- full_corpus_path
path to the underlying binary corpus
- start_time
the earliest time stamp in the underlying corpus to be considered
- end_time
the latest time stamp in the underlying corpus to be considered
- time_interval
the length of time intervals into which the span [start_time, end_time] is to be split
e.g., if start_time=1700 , end_time=2000 , time_interval=10 then documents are binned into 10-year bins
and all documents from before 1700 and after 2000 are ignored. Documents from 1700-1709 are assigned to
bin 0, documents from 1710-1719 are assigned to bin 1 and so on.
- word_corpus_path
path to a location where word-specific corpora are stored.
The filename reflects the choice of start_time / end_time / time_interval,
e.g., corpus_s1700_e2009_i10.bin for the example above
(2) pass this corpus to the model and train the model with MCMC inference. It creates a model and human-readable output:
- model.bin the trained model binary and
- output.dat human-readable model output, namely for each time slice its distribution over senses,
and for each sense in each time slice, it's distribution over words (as the set of most
highly associated words).
It takes the following parameters (all specified in parameters.txt)
- output_path
directory in which files output.dat and model.bin are to be stored
If the output directory and files (a) and (b) already exist (from a previous run) it moves the old files
to output_old.dat and model_old.bin
- kappaF, kappaK, a0, b0, num_top
model parameters; check paper (or ask me!) for explanations
- iterations
number of training iterations
- min_doc_per_word
the model doesn't work well if there are very few documents for a target word available. You may want to
only learn models per target words that occur at least N (~ 500?) times in the data
- max_docs_per_slice
some words occur extremely often. To get a managable size input corpus you can restrict the number of documents
to consider per time interval
-------------------------------
Understanding the output
-------------------------------
The program creates a human-readable output file for each target word in [outptut_path]/word/output.dat
The included directories under main/test_input/output/ contain output for models trained on the corpus in
main/test_input/corpus.txt for the target words in main/test_input/targets.txt
Models were trained to learn
-- from the corpus.txt file (containing some language from between 1700 and 2009)
-- K=8 senses per word
-- Start_time=1700, end_time=2009, intervals=20 --> obtaining 16 20-year time intervals in total
In each output.dat file:
-- K indexes the sense ID
-- T indexes the time slice ID
-- After each sense / time: The top 10 words with highest probability under each sense are listed
-- The bottom line of each block (p(w|s)....) sums over all senses, i.e., shows the most highly associated words
for a particular time, ignoring sense information.
The file output.dat contains the same information in two ways.
*** per type***
First, I list *by sense* the representation of the
same sense (k=0...K) for each time slice. The first number indicates the sense's prevalence at that par-
ticular time (as a probability, between 0 and 1). Look for example in main/example_input/output/power/output.dat.
We can see that sense 2 (the block with K=2) seems to relate to the 'electricity' sense of power (it has
a highly associated words 'battery', 'plant', etc especially towards later times, and tts prevalence increases
towards later time slices.
*** per time ***
This lists the senses associated with each time interval (the content of the lines is the same as above). E.g.,
the times associated with T=0 (first block under 'per_time') shows that senses K=7 has high probablility
and sense K=2 (the 'power' sense including e.g., the word 'dynamo') has low probability. Senses K=2 refers to
the 'mental' power.
Output for the targets 'battery' and 'transport' is also included.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
You can’t perform that action at this time.