| CARVIEW |
Select Language
HTTP/2 200
date: Wed, 31 Dec 2025 05:12:18 GMT
content-type: text/html; charset=utf-8
cache-control: max-age=0, private, must-revalidate
link: ; rel=preload; as=style; nopush,; rel=preload; as=script; nopush,; rel=preload; as=style; nopush,; rel=preload; as=script; nopush,; rel=preload; as=script; nopush
nel: {"report_to":"heroku-nel","response_headers":["Via"],"max_age":3600,"success_fraction":0.01,"failure_fraction":0.1}
referrer-policy: strict-origin-when-cross-origin
report-to: {"group":"heroku-nel","endpoints":[{"url":"https://nel.heroku.com/reports?s=ye9KsDICFlDpQqfsIhIocbFp8daHiW0xstz0vagYbH8%3D\u0026sid=e11707d5-02a7-43ef-b45e-2cf4d2036f7d\u0026ts=1767157937"}],"max_age":3600}
reporting-endpoints: heroku-nel="https://nel.heroku.com/reports?s=ye9KsDICFlDpQqfsIhIocbFp8daHiW0xstz0vagYbH8%3D&sid=e11707d5-02a7-43ef-b45e-2cf4d2036f7d&ts=1767157937"
server: cloudflare
set-cookie: _secure_speakerd_session=ptUj3v%2Bb0lbPuR3zzqN%2Bc4Tinyv3rNxh%2FtH4MNolAldEqrmcygoLxDy%2BZByOTnlR%2FTJQ7tCX8cgfeimhOJJFDtGNol11AjKzjOiSttAnIKFKHmnmbhZSUd%2FxuKD%2FIQyIxlF2wYPv20LJ0yEQgEdL7LFPujsQzRzhtb9v1r7S%2Bm1yjySYQJWGl5iJgkjcH%2Bbf2CS416MiVaNqVtx9ZswvW28p%2FuMZoqa4TQMtBQHRm1wd7NrAIYy9OCpeXq8f96uJLGjGAoL%2FsKV3r6HFDc7M5KXkzMtOBu5fyV%2BmY75%2BeJT85kr7CSDLKkhMNBr1UHPEvaMiiv7CSJ7zsUXvPX2GZc%2BDEJOruN3EXRk2ounuUvLXsDaHM%2B5d57qLE4xIGZYjUPamqfuLg%2FuUJGRAc%2FA%3D--vzP9%2Fj0r3Z32pS89--XhLkG7kHezuq0NJGjT0KXg%3D%3D; path=/; expires=Wed, 14 Jan 2026 05:12:18 GMT; secure; HttpOnly; SameSite=Lax
strict-transport-security: max-age=0; includeSubDomains
vary: Accept,Accept-Encoding
via: 2.0 heroku-router
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
x-request-id: f8e888e5-6fd0-8edc-14d6-11cdca09690b
x-runtime: 0.316776
x-xss-protection: 0
cf-cache-status: DYNAMIC
content-encoding: gzip
cf-ray: 9b6747f3c921d817-BLR
Data Science With Python - Speaker Deck
Data Science With Python
“Data science” is a big term; however, we still try to capture all of the topics, hoping to be a lighthouse which points the way you need.
It covers the clarification of confusing terminology, correlation analysis, principal component analysis (PCA), hypothesis testing, ordinary least squares (OLS), logistics regression, pandas, support vector machine (SVM), the tree methods (random forest and gradient boosted decision trees), KNN for recommendation, k-means for clustering, cross validation, pipelining, and more.
And the most important thing: all are introduced in plain Python!
The notebooks are available on https://github.com/moskytw/data-science-with-python .
More Decks by Mosky Liu
Other Decks in Research
Featured
Transcript
-
Data Science ➤ = Extract knowledge or insights from data.
➤ Data Science ⊃ ➤ Visualization ➤ Statistics ➤ Machine Learning ➤ Big Data ➤ Etc. ➤ ≈ Data Mining 2 -
➤ Statistics constructs more solid inferences. ➤ Machine learning constructs
more interesting predictions. ➤ Machine Learning ⊃ Deep Learning ➤ The models may be the same, but the focuses are di ff erent. ➤ Good predictions usually needs good inferences on dataset. Statistics vs. Machine Learning 3 -
Science, Analysis, Scientist, and Engineering ➤ Data Engineering / Data
Engineer ➤ Prepare the data infra to enable others to work with. ➤ Data Analysis / Data Analyst ➤ Analyze to help the company's decisions. ➤ Data Scientist ➤ Create software to optimize the company's operations. 4 -
Mosky ➤ Backend Lead at Pinkoi. ➤ Has spoken at:
PyCons in TW, JP , SG, HK, KR, MY, COSCUPs, and TEDx, etc. ➤ Countless hours on teaching Python. ➤ Own Python packages: ZIPCodeTW, etc. ➤ https://mosky.tw/ 5 -
Outline 1. Exploratory (EDA, Exploratory Data analysis) ➤ Correlation Analysis,
PCA, FA, etc. 2. Inference (Statistical Inference) ➤ Hypothesis Testing, OLS, Logit, etc. 3. Preprocessing ➤ By pandas, scikit-learn, etc. 4. Prediction (Machine Learning Prediction) ➤ SVM, Trees, KNN, K-Means, etc. 5. Models of Models ➤ Cross-Validation & Pipeline, Model Development, etc. 6 -
PDF & Notebooks ➤ The PDF and notebooks are available
here: ➤ https://github.com/moskytw/data-science-with-python ➤ A good notebook reader: ➤ https://nbviewer.jupyter.org/ ➤ Or run it on your own computer: ➤ Prepare Python and Pipenv. ➤ $ pipenv sync 7 -
Datasets ➤ The handouts are based on: ➤ American National
Election Survey 1996 (944×10) ➤ You may play with: ➤ Extramarital A ff airs Dataset (1978; 6366×9) ➤ Star98 Educational Dataset (1998; 303×13) ➤ Handout: datasets.ipynb ➤ The context matters: ➤ 1970s – Wikipedia, 1990s – Wikipedia. ➤ 1996 United States presidential election – Wikipedia. 8 -
Correlation Analysis ➤ Measures the bivariate linear “tightness”. ←
Pearson's Correlation Coe ff i cient (r) ➤ All pairs → correlation matrix. ➤ Handout: correlation_analysis.ipynb 10 -
PCA & FA ➤ Maps into a lower-dim space. ←
Principal Component Analysis (PCA) ➤ Visualize quickly, usually. ➤ Factor Analysis (FA) ➤ Assume lower-number unobserved variables (factors) exist. ➤ Handouts: ➤ pca.ipynb, pca_3d.ipynb, ipywidgets.ipynb, fa.ipynb 11 -
See Also ➤ seaborn ➤ For drawing attractive and informative
statistical graphics. ➤ Plotly ➤ Makes interactive graphs. ➤ pandas.DataFrame.corr ➤ Also has Kendall's τ (tau) and Spearman's ρ (rho). ➤ Isomap – scikit-learn ➤ Seeks a lower-dimensional embedding which maintains geodesic distances between all points. ➤ Dimensionality reduction – scikit-learn 12 -
Hypothesis Testing ➤ Given a hypothesis, calculate the probability to
observe the data. ➤ The hypothesis may be: ➤ “the means are the same” ➤ “the medians are the same” ➤ “the prop. are the same, e.g., conversion rates”, etc. ➤ Like testing the performances of the model A and the B. ➤ Handout: hypothesis_testing.ipynb 14 -
OLS & Logit ➤ Measures the “steepness”. ➤ With various
assumptions: ➤ Linear: OLS ➤ y is {0, 1}: Logit ➤ y is {0, 1, ...}: Poisson, etc. ← Logit Regression ➤ Like understanding the dataset, or may fi nd the insights directly. ➤ Handouts: ols.ipynb, logit.ipynb 15 -
See Also ➤ Statistical functions – SciPy ➤ Includes most
of the hypothesis testing functions. ➤ User Guide – statsmodels ➤ Includes much more models for statistical inference. ➤ Hypothesis Testing With Python ➤ Answers like “how much sample is enough?” ➤ Statistical Regression With Python ➤ Answers like “how to understand a regression summary?” 16 -
Preprocessing ➤ Make the models understand the data by various
methods. ← MixMinScaler ➤ Handouts: pandas_preprocessing.ipynb, sqlite.ipynb, sklearn_preprocessing.ipynb 18 -
➤ Text feature extraction & Image feature extraction – scikit-learn
➤ patsy: describes models by formulas, e.g., y ~ age + C(gender). ➤ imbalanced-learn: balances the classes more carefully. ➤ The class_weight='balanced' in scikit-learn may be also helpful. ➤ Rather than pandas: ➤ Polars: faster. ➤ Spark: more scalable. ➤ Database-like ops benchmark – H2O.ai ➤ Feature Engineering: create features by domain knowledge. See Also 19 -
Prediction ➤ Predict the category or continuous value. ➤ By
various models: ↑ SVM ↑ Tree ← Linear Discriminant Analysis (LDA) ➤ KNN & K-Means ➤ Handouts: svm.ipynb, trees.ipynb, logistic_and_lda.ipynb, knn.ipynb, kmeans.ipynb 26 -
See Also ➤ LightGBM: the most popular choice in Kaggle
in 2019 [ref]. ➤ Approximate Nearest Neighbor (ANN) Benchmark ➤ Recommender Systems in Practice – Towards Data Science ➤ Association Rules – mlxtend ➤ Voting & Stacking – scikit-learn 27 -
Data Leakage ➤ The training data which leads a high
performance is not available when prediction. Not the “data breach” in the security area. ➤ Two major types: [ref] ➤ Train-Test Contamination: like back fi lling train by test. ➤ Target Leakage: like diseased is y, and treated in X. ➤ Solutions: ➤ Pipeline ➤ Explanation (+ Domain Knowledge) 29 -
Overfitting ➤ A model fi ts the training data too
well, and then fails to predict. ➤ It happens because of the natural of models, like trees, or over- tuning the hyperparameters. ← Green may be an over fi t. ➤ Solutions: ➤ train_score / test_score should be around 1. ➤ Train-Test Split ➤ Cross-Validation 30 -
Spurious Relationship ➤ The model uses a false relationship to
predict. ← Get the 90% accuracy by “the background is snowy, so the animal is Husky.” [ref] ➤ Solution: ➤ Explanation (+ Domain Knowledge) 31 Husky Wolf -
Model-Market Fit ➤ Like “Product-Market Fit”. ➤ “Hey, this house
is super similar to the one you just bought, buy one more?” ➤ “I build this model by ten years, please buy one!” ➤ Solution: ➤ Model Development 32 -
Pipeline ➤ Prede fi ne the steps and run the
fi t / transform (predict) separately to avoid data leakage. 33 -
Cross-Validation ➤ Train-Test Split is simple, but can't use the
data fully. ➤ Use the data fully by various strategies. ← K-Fold Cross-Validation (K-Fold CV) 34 -
➤ Train-Test Split? Keep a set clean from fi tting
to evaluate the performance correctly. ➤ Cross-Validation? Also rotate the 2 sets to cover all of the data. ➤ Train-Valid-Test Split? Keep another set clean from the model selection, e.g., selecting from Logistics, SVM, Random Forest. ➤ Nested Cross-Validation? Also rotate the 3 sets. ➤ Handout: pipe_and_cv.ipynb 35 -
See Also ➤ Cross validation iterators – scikit-learn ➤ Choose
by the data generating process like groups. ➤ Exhaustive Grid Search – scikit-learn ➤ Search the best hyperparameters automatically. ➤ AWS Data Pipeline ➤ It's a di ff erent “pipeline”, but it's also important in the data engineering. 36 -
Model Development ➤ Like “Software Development”. ➤ How to “model-market
fi t”? Delight people with fast release! ➤ People must like your model: ➤ Domain experts. ➤ Colleagues. ➤ Users. ➤ Release faster; then learn faster, ideally 1–2 weeks. 37 -
See Also ➤ The Analysis Steps ➤ A suggested method
to make an analysis, may be an analysis for building models or reviewing models. ➤ The Study Designs ➤ Besides the A/B testing, some not costly methods. ➤ The Mini-Scrum ➤ How to work with a team e ff i ciently. 38 -
Time Series ➤ A Spurious Relationship happens between independent non-
stationary variables naturally, like the mean varies by time. ➤ The methods and libraries for time series. ➤ plot_acf & plot_pacf – statsmodels ➤ tsa & statespace – statsmodels ➤ ADF test – statsmodels ➤ pmdarima: brings R's auto.arima to Python. ➤ Prophet: using Bayesian-based method. ➤ Cross validation of time series data – scikit-learn 39 -
Recap ➤ Exploratory like PCA helps to understand the data.
➤ Inference like statistical regressions fi nds the insights out. ➤ Preprocessing is for feeding easy-to-digest data to models. ➤ Inference helps prediction. ➤ Delight people with fast release! 😊 40 -
Image Credits ➤ “Linear PCA vs. Nonlinear Principal Manifolds”: https://en.wikipedia.org/wiki/
Principal_component_analysis#/media/File:Elmap_breastcancer_wiki.png ➤ “SVM”: https://en.wikipedia.org/wiki/Support_vector_machine#/media/File:SVM_margin.png ➤ “SVM With RBF Kernel”: https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html ➤ “Tree”: https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Cart_tree_kyphosis.png ➤ “PCA vs. LDA”: https://sebastianraschka.com/Articles/2014_python_lda.html ➤ “Over fi tting”: https://en.wikipedia.org/wiki/Over fi tting#/media/File:Over fi tting.svg ➤ “Data Leakage”: https://www.kaggle.com/dansbecker/data-leakage ➤ “Husky”: https://en.wikipedia.org/wiki/Husky ➤ “Wolf”: https://en.wikipedia.org/wiki/Wolf#/media/File:Front_view_of_a_resting_Canis_lupus_ssp.jpg ➤ “Houses”: https://unsplash.com/photos/vZEPXDQHR4s ➤ “K-Fold Cross-Validation”: https://en.wikipedia.org/wiki/Cross-validation_(statistics)#/media/File:K- fold_cross_validation_EN.svg ➤ “Pipeline”: https://unsplash.com/photos/KP6XQIEjjPA ➤ “Smile”: https://unsplash.com/photos/g1Kr4Ozfoac 41