HTTP/2 301
date: Sun, 18 Jan 2026 09:57:47 GMT
content-length: 0
location: https://doi.org/10.1101/079087
server: cloudflare
vary: Origin
expires: Mon, 19 Jan 2026 09:57:46 GMT
permissions-policy: interest-cohort=(),browsing-topics=()
cf-cache-status: DYNAMIC
nel: {"report_to":"cf-nel","success_fraction":0.0,"max_age":604800}
strict-transport-security: max-age=31536000; includeSubDomains; preload
report-to: {"group":"cf-nel","max_age":604800,"endpoints":[{"url":"https://a.nel.cloudflare.com/report/v4?s=F2qVc2ZROSj4tyBw8oU%2FLu34Bj0QTVAf%2B1k8sx82V6%2FsIIRWp3BYhChNiorkkir8xyBCWx1FC5PRKD3ZlePO16LAsVJiAw%3D%3D"}]}
cf-ray: 9bfd3ae76d6df473-BLR
alt-svc: h3=":443"; ma=86400
HTTP/2 302
date: Sun, 18 Jan 2026 09:57:47 GMT
content-type: text/html;charset=utf-8
location: https://biorxiv.org/lookup/doi/10.1101/079087
server: cloudflare
vary: Origin
vary: Accept
expires: Sun, 18 Jan 2026 10:11:00 GMT
permissions-policy: interest-cohort=(),browsing-topics=()
cf-cache-status: DYNAMIC
nel: {"report_to":"cf-nel","success_fraction":0.0,"max_age":604800}
strict-transport-security: max-age=31536000; includeSubDomains; preload
report-to: {"group":"cf-nel","max_age":604800,"endpoints":[{"url":"https://a.nel.cloudflare.com/report/v4?s=8f1R0zn0yyOy0zOf3fd5weA75qasrDH1Iv51AMU%2BS1mESKTHMwFKCYjFun1G7LLacQLh5xDdTKuG5MMEtOMRdFWcggQeKg%3D%3D"}]}
cf-ray: 9bfd3ae8deecf473-BLR
alt-svc: h3=":443"; ma=86400
HTTP/1.1 302 Found
Date: Sun, 18 Jan 2026 09:57:47 GMT
Content-Type: text/html; charset=iso-8859-1
Transfer-Encoding: chunked
Connection: keep-alive
server: cloudflare
location: https://www.biorxiv.org/lookup/doi/10.1101/079087
cf-cache-status: DYNAMIC
Nel: {"report_to":"cf-nel","success_fraction":0.0,"max_age":604800}
Report-To: {"group":"cf-nel","max_age":604800,"endpoints":[{"url":"https://a.nel.cloudflare.com/report/v4?s=Qt9BKjZ0wxioySVFSmO97PBWq51AFgmEj6MWJZTqcMtsu6z0xAtFXBTma%2Fvi5Ve2VdTGzy4dj%2BQg7F5UmGpTXLB2pPNEAfvVwTqM"}]}
CF-RAY: 9bfd3ae96e99aed9-BOM
alt-svc: h3=":443"; ma=86400
HTTP/2 301
date: Sun, 18 Jan 2026 09:57:48 GMT
content-type: text/html; charset=UTF-8
location: https://www.biorxiv.org/content/10.1101/079087v1
cf-ray: 9bfd3aec8fd61ec2-BLR
x-content-type-options: nosniff
x-content-type-options: nosniff
x-drupal-cache: MISS
expires: Sun, 18 Jan 2026 10:27:48 GMT
cache-control: public, max-age=1800
x-varnish-ttl:
pragma: no-cache
vary: Accept-Encoding
x-highwire-sitecode: biorxiv
x-highwire-smart-code: biorxiv_production
x-varnish: 1892475598
x-varnish-cache:
via: 1.1 varnish
cf-cache-status: MISS
set-cookie: __cf_bm=Z73NjXC5S_gRwOszYUujDDVVZUHFZ0STIDwVjIfg3hU-1768730268-1.0.1.1-jliG6oYp8JSZO_rzP0.khPTiqL9lTN3fYhb44UDQCBx7OGcAEeg0RLpjw75drXsGRx8gFbOSjjOf8hOv9XNJRXVDpOsZIoC7.jKt0zEPNeM; path=/; expires=Sun, 18-Jan-26 10:27:48 GMT; domain=.www.biorxiv.org; HttpOnly; Secure; SameSite=None
server: cloudflare
HTTP/2 200
date: Sun, 18 Jan 2026 09:57:49 GMT
content-type: text/html; charset=utf-8
content-encoding: gzip
x-content-type-options: nosniff
x-content-type-options: nosniff
x-drupal-cache: MISS
expires: Sun, 19 Nov 1978 05:00:00 GMT
cache-control: no-cache, must-revalidate
set-cookie: SSESS1dd6867f1a1b90340f573dcdef3076bc=XqoHMUaKbPbqlvGEZjZBx68gobbip3OLPsy7BA54zSA; expires=Tue, 10-Feb-2026 13:31:08 GMT; path=/; domain=.biorxiv.org; secure; HttpOnly
content-language: en
x-frame-options: SAMEORIGIN
x-generator: Drupal 7 (https://drupal.org)
link:
; rel="canonical",; rel="shortlink"
vary: Accept-Encoding
x-highwire-sitecode: biorxiv
x-highwire-smart-code: biorxiv_production
x-varnish: 694603979
age: 0
via: 1.1 varnish
x-varnish-ttl:
x-varnish-cache:
cf-cache-status: DYNAMIC
server: cloudflare
cf-ray: 9bfd3af07d221ec2-BLR
Adaptive Somatic Mutations Calls with Deep Learning and Semi-Simulated Data | bioRxiv
New Results
Adaptive Somatic Mutations Calls with Deep Learning and Semi-Simulated Data
Remi Torracinta, Laurent Mesnard, Susan Levine, Rita Shaknovich, Maureen Hanson, Susan Levine
doi: https://doi.org/10.1101/079087

ABSTRACT
A number of approaches have been developed to call somatic variation in high-throughput sequencing data. Here, we present an adaptive approach to calling somatic variations. Our approach trains a deep feed-forward neural network with semi-simulated data. Semi-simulated datasets are constructed by planting somatic mutations in real datasets where no mutations are expected. Using semi-simulated data makes it possible to train the models with millions of training examples, a usual requirement for successfully training deep learning models. We initially focus on calling variations in RNA-Seq data. We derive semi-simulated datasets from real RNA-Seq data, which offer a good representation of the data the models will be applied to. We test the models on independent semi-simulated data as well as pure simulations. On independent semi-simulated data, models achieve an AUC of 0.973. When tested on semi-simulated exome DNA datasets, we find that the models trained on RNA-Seq data remain predictive (sens 0.4 & spec 0.9 at cutoff of P > = 0.9), albeit with lower overall performance (AUC=0.737). Interestingly, while the models generalize across assay, training on RNA-Seq data lowers the confidence for a group of mutations. Haloplex exome specific training was also performed, demonstrating that the approach can produce probabilistic models tuned for specific assays and protocols. We found that the method adapts to the characteristics of experimental protocol. We further illustrate these points by training a model for a trio somatic experimental design when germline DNA of both parents is available in addition to data about the individual. These models are distributed with Goby (https://goby.campagnelab.org).
Copyright
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY 4.0 International license.