Data Science With Python

Data Science With Python Mosky

Data Science ➤ = Extract knowledge or insights from data.
➤ Data Science ⊃ ➤ Visualization ➤ Statistics ➤ Machine Learning ➤ Big Data ➤ Etc. ➤ ≈ Data Mining 2

➤ Statistics constructs more solid inferences. ➤ Machine learning constructs
more interesting predictions. ➤ Machine Learning ⊃ Deep Learning ➤ The models may be the same, but the focuses are di ff erent. ➤ Good predictions usually needs good inferences on dataset. Statistics vs. Machine Learning 3

Science, Analysis, Scientist, and Engineering ➤ Data Engineering / Data
Engineer ➤ Prepare the data infra to enable others to work with. ➤ Data Analysis / Data Analyst ➤ Analyze to help the company's decisions. ➤ Data Scientist ➤ Create software to optimize the company's operations. 4

Mosky ➤ Backend Lead at Pinkoi. ➤ Has spoken at:
PyCons in   TW, JP , SG, HK, KR, MY, COSCUPs, and TEDx, etc. ➤ Countless hours   on teaching Python. ➤ Own Python packages: ZIPCodeTW, etc. ➤ https://mosky.tw/ 5

Outline 1. Exploratory (EDA, Exploratory Data analysis) ➤ Correlation Analysis,
PCA, FA, etc. 2. Inference (Statistical Inference) ➤ Hypothesis Testing, OLS, Logit, etc. 3. Preprocessing ➤ By pandas, scikit-learn, etc. 4. Prediction (Machine Learning Prediction) ➤ SVM, Trees, KNN, K-Means, etc. 5. Models of Models ➤ Cross-Validation & Pipeline, Model Development, etc. 6

PDF & Notebooks ➤ The PDF and notebooks are available
here: ➤ https://github.com/moskytw/data-science-with-python ➤ A good notebook reader: ➤ https://nbviewer.jupyter.org/ ➤ Or run it on your own computer: ➤ Prepare Python and Pipenv. ➤ $ pipenv sync 7

Datasets ➤ The handouts are based on: ➤ American National
Election Survey 1996 (944×10) ➤ You may play with: ➤ Extramarital A ff airs Dataset (1978; 6366×9) ➤ Star98 Educational Dataset (1998; 303×13) ➤ Handout: datasets.ipynb ➤ The context matters: ➤ 1970s – Wikipedia, 1990s – Wikipedia. ➤ 1996 United States presidential election – Wikipedia. 8

Exploration

Correlation Analysis ➤ Measures the   bivariate linear “tightness”. ←
Pearson's   Correlation Coe ff i cient (r) ➤ All pairs → correlation matrix. ➤ Handout: correlation_analysis.ipynb 10

PCA & FA ➤ Maps into a lower-dim space. ←
Principal Component Analysis (PCA) ➤ Visualize quickly, usually. ➤ Factor Analysis (FA) ➤ Assume lower-number unobserved variables (factors) exist. ➤ Handouts: ➤ pca.ipynb, pca_3d.ipynb, ipywidgets.ipynb, fa.ipynb 11

See Also ➤ seaborn ➤ For drawing attractive and informative
statistical graphics. ➤ Plotly ➤ Makes interactive graphs. ➤ pandas.DataFrame.corr ➤ Also has Kendall's τ (tau) and Spearman's ρ (rho). ➤ Isomap – scikit-learn ➤ Seeks a lower-dimensional embedding which maintains geodesic distances between all points. ➤ Dimensionality reduction – scikit-learn 12

Inference

Hypothesis Testing ➤ Given a hypothesis, calculate the probability to
observe the data. ➤ The hypothesis may be: ➤ “the means are the same” ➤ “the medians are the same” ➤ “the prop. are the same, e.g., conversion rates”, etc. ➤ Like testing the performances of the model A and the B. ➤ Handout: hypothesis_testing.ipynb 14

OLS & Logit ➤ Measures the “steepness”. ➤ With various
assumptions: ➤ Linear: OLS ➤ y is {0, 1}: Logit ➤ y is {0, 1, ...}: Poisson, etc. ← Logit Regression ➤ Like understanding the dataset, or may fi nd the insights directly. ➤ Handouts:   ols.ipynb, logit.ipynb 15

See Also ➤ Statistical functions – SciPy ➤ Includes most
of the hypothesis testing functions. ➤ User Guide – statsmodels ➤ Includes much more models for statistical inference. ➤ Hypothesis Testing With Python ➤ Answers like “how much sample is enough?” ➤ Statistical Regression With Python ➤ Answers like “how to understand a regression summary?” 16

Preprocessing

Preprocessing ➤ Make the models understand the data by various
methods. ← MixMinScaler ➤ Handouts: pandas_preprocessing.ipynb, sqlite.ipynb, sklearn_preprocessing.ipynb 18

➤ Text feature extraction & Image feature extraction – scikit-learn
➤ patsy: describes models by formulas, e.g., y ~ age + C(gender). ➤ imbalanced-learn: balances the classes more carefully. ➤ The class_weight='balanced' in scikit-learn may be also helpful. ➤ Rather than pandas: ➤ Polars: faster. ➤ Spark: more scalable. ➤ Database-like ops benchmark – H2O.ai ➤ Feature Engineering: create features by domain knowledge. See Also 19

Prediction

Prediction Support-Vector Machines (SVM)

SVM With Radial Basis Function (RBF) Kernel

Prediction Decision Tree

Prediction ➤ Predict the category or continuous value. ➤ By
various models: ↑ SVM ↑ Tree ← Linear Discriminant Analysis (LDA) ➤ KNN & K-Means ➤ Handouts: svm.ipynb, trees.ipynb, logistic_and_lda.ipynb, knn.ipynb, kmeans.ipynb 26

See Also ➤ LightGBM: the most popular choice in Kaggle
in 2019 [ref]. ➤ Approximate Nearest Neighbor (ANN) Benchmark ➤ Recommender Systems in Practice – Towards Data Science ➤ Association Rules – mlxtend ➤ Voting & Stacking – scikit-learn 27

Models of Models

Data Leakage ➤ The training data which leads a high
performance is not available when prediction. Not the “data breach” in the security area. ➤ Two major types: [ref] ➤ Train-Test Contamination:   like back fi lling train by test. ➤ Target Leakage:   like diseased is y, and treated in X. ➤ Solutions: ➤ Pipeline ➤ Explanation   (+ Domain Knowledge) 29

Overfitting ➤ A model fi ts the training data too
well, and then fails to predict. ➤ It happens because of the natural of models, like trees, or over- tuning the hyperparameters. ← Green may be an over fi t. ➤ Solutions: ➤ train_score / test_score should be around 1. ➤ Train-Test Split ➤ Cross-Validation 30

Spurious Relationship ➤ The model uses a false relationship to
predict. ← Get the 90% accuracy by “the background is snowy, so the animal is Husky.” [ref] ➤ Solution: ➤ Explanation   (+ Domain Knowledge) 31 Husky Wolf

Model-Market Fit ➤ Like “Product-Market Fit”. ➤ “Hey, this house
is super similar to the one you just bought, buy one more?” ➤ “I build this model by ten years, please buy one!” ➤ Solution: ➤ Model Development 32

Pipeline ➤ Prede fi ne the steps and run the
fi t / transform (predict) separately to avoid data leakage. 33

Cross-Validation ➤ Train-Test Split is simple, but can't use the
data fully. ➤ Use the data fully by various strategies. ← K-Fold Cross-Validation   (K-Fold CV) 34

➤ Train-Test Split? Keep a set clean from fi tting
to evaluate the performance correctly. ➤ Cross-Validation? Also rotate the 2 sets to cover all of the data. ➤ Train-Valid-Test Split? Keep another set clean from the model selection, e.g., selecting from Logistics, SVM, Random Forest. ➤ Nested Cross-Validation? Also rotate the 3 sets. ➤ Handout: pipe_and_cv.ipynb 35

See Also ➤ Cross validation iterators – scikit-learn ➤ Choose
by the data generating process like groups. ➤ Exhaustive Grid Search – scikit-learn ➤ Search the best hyperparameters automatically. ➤ AWS Data Pipeline ➤ It's a di ff erent “pipeline”, but it's also important in the data engineering. 36

Model Development ➤ Like “Software Development”. ➤ How to “model-market
fi t”?   Delight people with fast release! ➤ People must like your model: ➤ Domain experts. ➤ Colleagues. ➤ Users. ➤ Release faster; then learn faster, ideally 1–2 weeks. 37

See Also ➤ The Analysis Steps ➤ A suggested method
to make an analysis, may be an analysis for building models or reviewing models. ➤ The Study Designs ➤ Besides the A/B testing, some not costly methods. ➤ The Mini-Scrum ➤ How to work with a team e ff i ciently. 38

Time Series ➤ A Spurious Relationship happens between independent non-
stationary variables naturally, like the mean varies by time. ➤ The methods and libraries for time series. ➤ plot_acf & plot_pacf – statsmodels ➤ tsa & statespace – statsmodels ➤ ADF test – statsmodels ➤ pmdarima: brings R's auto.arima to Python. ➤ Prophet: using Bayesian-based method. ➤ Cross validation of time series data – scikit-learn 39

Recap ➤ Exploratory like PCA helps to understand the data.
➤ Inference like statistical regressions fi nds the insights out. ➤ Preprocessing is for feeding easy-to-digest data to models. ➤ Inference helps prediction. ➤ Delight people with fast release! 😊 40

Image Credits ➤ “Linear PCA vs. Nonlinear Principal Manifolds”: https://en.wikipedia.org/wiki/
Principal_component_analysis#/media/File:Elmap_breastcancer_wiki.png ➤ “SVM”: https://en.wikipedia.org/wiki/Support_vector_machine#/media/File:SVM_margin.png ➤ “SVM With RBF Kernel”: https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html ➤ “Tree”: https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Cart_tree_kyphosis.png ➤ “PCA vs. LDA”: https://sebastianraschka.com/Articles/2014_python_lda.html ➤ “Over fi tting”: https://en.wikipedia.org/wiki/Over fi tting#/media/File:Over fi tting.svg ➤ “Data Leakage”: https://www.kaggle.com/dansbecker/data-leakage ➤ “Husky”: https://en.wikipedia.org/wiki/Husky ➤ “Wolf”: https://en.wikipedia.org/wiki/Wolf#/media/File:Front_view_of_a_resting_Canis_lupus_ssp.jpg ➤ “Houses”: https://unsplash.com/photos/vZEPXDQHR4s ➤ “K-Fold Cross-Validation”: https://en.wikipedia.org/wiki/Cross-validation_(statistics)#/media/File:K- fold_cross_validation_EN.svg ➤ “Pipeline”: https://unsplash.com/photos/KP6XQIEjjPA ➤ “Smile”: https://unsplash.com/photos/g1Kr4Ozfoac 41

Data Science With Python

Data Science With Python

More Decks by Mosky Liu

Other Decks in Research

Featured

Transcript