CARVIEW |
Select Language
HTTP/2 200
date: Sat, 19 Jul 2025 20:44:15 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
etag: W/"658724aeb713c5e32634c198ad34f8b2"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=sk1BQ0r2RqI%2F6WjhczyyJdMlmg4Atq0VsczU3fXlEdCX1oCzznh%2BeJoiKfHQ9UcI9sGetb1ZsnEISFNMe24otZbwWFkpTKA4uCWPisKJx0uZoMR1%2FqpTmkogSg81yZAAoxKb%2BEhMwGc7Lpi%2FGbQ5ZmhU%2B23XIH6Ru1JsrgDe%2BjSWzl52ejioAIZfN0s1I4oOIhKL2oLZ1C%2FNLNK65STtikctAa54wE9TZ0zhbHB5rTldYcjeCl33DlGSTxMVEc9BM%2BNGvZIeGY2TJSZVimnvVA%3D%3D--nFC4tqVLtj8d7pHf--%2FS%2FJIvtfAfjgucLkOwUlwg%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.871622834.1752957854; Path=/; Domain=github.com; Expires=Sun, 19 Jul 2026 20:44:14 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Sun, 19 Jul 2026 20:44:14 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: D64E:67A53:831635:A58527:687C039E
Rosetta Code · twitter/scalding Wiki · GitHub
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 708
Rosetta Code
samthebest edited this page Jun 26, 2014
·
27 revisions
A collection of MapReduce tasks translated (from Pig, Hive, MapReduce streaming, etc.) into Scalding. For fully runnable code, see the repository here.
# Emit (word, count) pairs.
def mapper
STDIN.each_line do |line|
line.split.each do |word|
puts [word, 1].join("\t")
end
end
end
# Aggregate all (word, count) pairs for a particular word.
#
# In Hadoop Streaming (unlike standard Hadoop), the reducer receives
# rows from the mapper *one at a time*, though the rows are guaranteed
# to be sorted by key (and every row associated to a particular key
# will be sent to the same reducer).
def reducer
curr_word = nil
curr_count = 0
STDIN.each_line do |line|
word, count = line.strip.split("\t")
if word != curr_word
puts [curr_word, curr_count].join("\t")
curr_word = word
curr_count = 0
end
curr_count += count.to_i
end
puts [curr_word, curr_count].join("\t") unless curr_word.nil?
end
# tokenizer.py
import sys
for line in sys.stdin:
for word in line.split():
print word
CREATE TABLE tweets (text STRING);
LOAD DATA LOCAL INPATH 'tweets.tsv' OVERWRITE INTO TABLE tweets;
SELECT word, COUNT(*) AS count
FROM (
SELECT TRANSFORM(text) USING 'python tokenizer.py' AS word
FROM tweets
) t
GROUP BY word;
tweets = LOAD 'tweets.tsv' AS (text:chararray);
words = FOREACH tweets GENERATE FLATTEN(TOKENIZE(text)) AS word;
word_groups = GROUP words BY word;
word_counts = FOREACH word_groups GENERATE group AS word, COUNT(words) AS count;
STORE word_counts INTO 'word_counts.tsv';
(cascalog.repl/bootstrap)
(?<- (hfs-textline "word_counts.tsv") [?word ?count]
((hfs-textline "tweets.tsv") ?text)
((mapcatfn [text] (.split text "\\s+")) ?text :> ?word)
(c/count ?count)))
import com.twitter.scalding._
class ScaldingTestJob(args: Args) extends Job(args) {
Tsv("tweets.tsv", 'text)
.flatMap('text -> 'word)[String, String](_.split("\\s+"))
.groupBy('word)(_.size)
.write(Tsv("word_counts.tsv"))
}
PATTERN = /.*hello.*/
# Emit words that match the pattern.
def mapper
STDIN.each_line do |line|
puts line if line =~ PATTERN
end
end
# Identity reducer.
def reducer
STDIN.each_line do |line|
puts line
end
end
%declare PATTERN '.*hello.*';
tweets = LOAD 'tweets.tsv' AS (text:chararray);
results = FILTER tweets BY (text MATCHES '$PATTERN');
(def pattern #".*hello.*")
(deffilterop matches-pattern? [text pattern]
(re-matches pattern text))
(defn distributed-grep [input pattern]
(<- [?textline]
(input ?textline)
(matches-pattern? ?textline pattern)))
(?- (stdout) (distributed-grep (hfs-textline "tweets.tsv") pattern))
val Pattern = ".*hello.*";
Tsv("tweets.tsv", 'text).filter('text)[String](_.matches(Pattern))
# Emit (word, tweet_id) pairs.
def mapper
STDIN.each_line do |line|
tweet_id, text = line.strip.split("\t")
text.split.each do |word|
puts [word, tweet_id].join("\t")
end
end
end
# Aggregate all (word, tweet_id) pairs for a particular word.
#
# In Hadoop Streaming (unlike standard Hadoop), the reducer receives
# rows from the mapper *one at a time*, though the rows are guaranteed
# to be sorted by key (and every row associated to a particular key
# will be sent to the same reducer).
def reducer
curr_word = nil
curr_inv_index = []
STDIN.each_line do |line|
word, tweet_id = line.strip.split("\t")
if word != curr_word
# New key.
puts [curr_word, curr_inv_index.join(",")].join("\t")
curr_word = word
curr_inv_index = []
end
curr_inv_index << tweet_id
end
unless curr_word.nil?
puts [curr_word, curr_inv_index.join(", ")].join("\t")
end
end
tweets = LOAD 'tweets.tsv' AS (tweet_id:int, text:chararray);
words = FOREACH tweets GENERATE tweet_id, FLATTEN(TOKENIZE(text)) AS word;
word_groups = GROUP words BY word;
inverted_index = FOREACH word_groups GENERATE group AS word, words.tweet_id;
;; define the data
(def index [
[0 "Hello World"]
[101 "The quick brown fox jumps over the lazy dog"]
[42 "Answer to the Ultimate Question of Life, the Universe, and Everything"]
])
;; the tokenize function
(defmapcatop tokenize [text]
(seq (.split text "\\s+")))
;; ensure inverted index is distinct per word
(defbufferop distinct-vals [tuples]
(list (set (map first tuples))))
;; run the query on data
(?<- (stdout) [?word ?ids]
(index ?id ?text)
(tokenize ?text :> ?word)
(distinct-vals ?id :> ?ids))
val invertedIndex =
Tsv("tweets.tsv", ('id, 'text))
.flatMap(('id, 'text) -> ('word, 'tweetId))[(String, String), (String, String)] {
case (tweetId, text) => text.split("\\s+").map((_, tweetId))
}
.groupBy('word)(_.toList[Long]('tweetId -> 'tweetIds))
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding
Clone this wiki locally
You can’t perform that action at this time.