R codes to implement sparse SEmiparametric Discriminant Analysis (SEDA) for high-dimensional zero-inflated data.
Chung, H.C., Ni, Y., & Gaynanova, I. Sparse semiparametric discriminant analysis for high-dimensional zero-inflated data. Journal of Machine Learning Research (to appear)
Yoon G., Carroll R.J. & Gaynanova I. (2020). Sparse semiparametric canonical correlation analysis for data of mixed types Biometrika, 107(3), 609-625.
Vandeputte, D., Kathagen, G., D’hoe, K., Vieira-Silva, S., Valles-Colomer, M., Sabino, J., … & Raes, J. (2017). Quantitative microbiome profiling links gut community variation to microbial load. Nature, 551(7681), 507-511.
To implement, (1) clone this repository; (2) open SEDA.Rproj; (3) load
packages and function located in the folder ./functions/ as follows
# Set Paths
dataPath <- "./Data/atac/"
funcPath <- "./functions/"
# Load packages and source functions
files.sources = list.files(path=funcPath)
extension <- substring(files.sources,nchar(files.sources)-1,nchar(files.sources))
lets.source <- paste(funcPath, files.sources[extension==".R"], sep="")
invisible( mapply(source, lets.source) )
# "checkPD.R" modifies tmvtnorm:::checkSymmetricPositiveDefinite
environment( checkSymmetricPositiveDefiniteCustom ) <- asNamespace("tmvtnorm")
assignInNamespace( "checkSymmetricPositiveDefinite", checkSymmetricPositiveDefiniteCustom, ns="tmvtnorm" )This example assumes that you have your own dataset xTrain, yTrain,
xTest, and yTest. In this example, we use
# Training data sample size and dimension
dim(xTrain) ## [1] 100 300
# Test data sample size and dimension
dim(xTest) ## [1] 200 300
# Class sizes (training)
table(yTrain)## yTrain
## 0 1
## 51 49
# Class sizes (testing)
table(yTest)## yTest
## 0 1
## 102 98
The function kfold.cv_seda trains SEDA classifier using K-fold
cross-validation.
# Monte-Carlo sample size for the posterior probability estimation
nmc <- 100
# Tuning grid for the intercept (Delta_y, the class label threshold).
delta1_grid <- seq(-1.5, 1.5, l=100)
# Number of tuning grid points for the sparsity parameter (lambda).
ngrid_lambda <- 100
# Number of folds
K <- 5
fit <- kfold.cv_seda(K=K, xTrain,yTrain, delta1_grid, ngrid_lambda = ngrid_lambda, nmc = nmc)Condition means are approximated with a Monte-Carlo sample of size
nmc. The function zhat_cnd returns latent Gaussian level test data
with truncated variables replaced with respective conditional means.
## nmc : MC sample size
## xTrain : training x
## xTest : test x
## fit$Rx.train: sample correlation matrix based on training x
zhat.test_seda <- zhat_cnd(nmc = nmc, xTrain = xTrain, xValid = xTest, Rx = fit$Rx.train )$zhatThe function test_seda returns the misclassification error rate
comparing predicted class labels and true class label yTest. The
function requires (1) tuned classification direction fit$tuned.beta,
(2) intercept fit$tuned.delta1, (3) latent Gaussian level test data
zhat_test_seda, and (4) class label of test data yTest.
## fit$tuned_beta: tuned classification direction
## fit$tuned_delta1: tuned intercept
## zhat.test_seda: estimated latent Gaussian level test data, and
## yTest: test data class label for misclassification rate calculation.
seda.test.error <- test_seda(fit$tuned.beta, fit$tuned.delta1, zhat.test_seda, yTest, plot=TRUE )