You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a collection of Jupyter
notebooks intended to train the reader on different Apache Spark concepts, from
basic to advanced, by using the R language.
If your are interested in being introduced to some basic Data Science Engineering concepts and applications, you might find these series of tutorials interesting. There we explain different concepts and applications
using Python and R. Additionally, if you are interested in using Python with Spark, you can have a look at our pySpark notebooks.
Instructions
For these series of notebooks, we have used Jupyter with the IRkernel R kernel. You can find installation instructions for you specific setup here. Have also a look at Andrie de Vries post Using R with Jupyter Notebooks that includes instructions for installing Jupyter and IRkernel together.
A good way of using these notebooks is by first cloning the repo, and then
starting your Jupyter in pySpark mode. For example,
if we have a standalone Spark installation running in our localhost with a
maximum of 6Gb per node assigned to IPython:
Notice that the path to the pyspark command will depend on your specific
installation. So as requirement, you need to have
Spark installed in
the same machine you are going to start the IPython notebook server.
For more Spark options see here. In general it works the rule of passign options
described in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY when
calling IPython/pySpark.
Every year, the US Census Bureau runs the American Community Survey. In this survey, approximately 3.5 million
households are asked detailed questions about who they are and how they live. Many topics are covered, including
ancestry, education, work, transportation, internet use, and residency. You can directly to
the source
in order to know more about the data and get files for different years, longer periods, individual states, etc.
In any case, the starting up notebook
will download the 2013 data locally for later use with the rest of the notebooks.
The idea of using this dataset came from being recently announced in Kaggle
as part of their Kaggle scripts datasets. There you will be able to analyse the dataset on site, while sharing your results with other Kaggle
users. Highly recommended!
Different operations we can use with SparkR and DataFrame objects, such as data selection and filtering, aggregations, and sorting. The basis for exploratory data analysis and machine learning.
This repository contains a variety of content; some developed by Jose A. Dianes, and some from third-parties. The third-party content is distributed under the license provided by those parties.
The content developed by Jose A. Dianes is distributed under the following license:
Copyright 2016 Jose A Dianes
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
About
R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks