Python SDK
At Bystro, we believe natural language is the right interface for genetic andproteomic analysis. We are building the world's first LLM-powered natural language analysis engine that takes your questions about complex genetic and proteomic datasets, and converts them into statistical answers with easy to understand summaries and visualizations.
This is our open-source collection of machine learning methods for high dimensionalstatistics, as well as some applications in genomics and proteomics. We're working to integrate these cutting-edge methods into the future of the Bystro natural language analysis platform for genetics & proteomics. Our current platform automates analyses like PRS, ancestry calculation, and QC for genetics data.
Installation
Install the Bystro Python package:
pip install bystro
Machine Learning Methods
Covariance Matrix Estimation and Hypothesis Testing
from bystro.covariance import *
- ▶Regularized covariance matrix estimation methods well suited for smaller sample size regimes where n << p
- ▶Covariance matrix hypothesis tests, like the 2 sample covariance test (
from bystro.random_matrix_theory.rmt4ds_cov_test import two_sample_cov_test)
Random Matrix Theory Methods
from bystro.random_matrix_theory import *
Random Matrix Theory modules that are foundational for significance tests, such as our two_sample_cov_test
Stochastic Gradient Langevin
from bystro.stochastic_gradient_langevin import *
Implementation of Stochastic Gradient Langevin algorithm in https://www.ics.uci.edu/~welling/publications/papers/stoclangevin
Fair Machine Learning and Supervised PPCA / Variational Principal Component Regression
from bystro.supervised_ppca import *
supervised_ppca is a collection of generative methods:
- Probabilistic PCA (PPCA)
- Supervised PPCA (also known as Variational Principal Component Regression): Novel method for network analysis that is able to pick up dynamics of interest in low variance components. Also competitive with Elastic Net in a regression context, without shrinking covariates (instead shrinks them in latent space). See our recent publication: https://arxiv.org/abs/2409.02327
- Adversarial Probabilistic PCA: Fair ML method that removes the influence of M sensitive variables (confounding factors), from high dimensional data
Applications in Proteomics
For proteomics analyses, we have 4 modular steps that can be applied alone or combined:
- Imputation - Recommended if there are missing values within your data, demo here: Soft Impute Demo
- Batch Correction - Recommended for TMT data that has a control per batch to correct for small batches, demo here: Small sample batch correction Demo
- Removal of Confounding factors - Recommended if there are confounding factors such as sex or ancestry that you want to remove from your data, demo here: Fair PCA Demo
- Network Analysis - Recommended if you want to discover predictive networks in your proteomic data, demo here: Learning Predictive Network Demo
Applications in Genetics
Make genetic results more generalizable by removing information from confounding factors:
- ▶Remove ancestry-related information in multi-ancestry cohorts to reduce bias
- ▶Remove effect of batch in meta-analyses
- ▶See demo here: Fair PCA Demo
Combined (Multi-omics) Applications
Combine genomic and proteomic data for any downstream analyses or data exploration - Read more here: Proteomics combine README.md
Publications
Citation
If you use the Bystro Python SDK in your research, please cite:
Kotlar et al. "Bystro: rapid online variant annotation and natural-language filtering at whole-genome scale." Genome Biology 19, 14 (2018). https://doi.org/10.1186/s13059-018-1387-3
How is this different from the CLI and API?
- ▶CLI: Command-line tools for file processing, uploads, and data management
- ▶API: REST endpoints for integrating Bystro services into web applications
- ▶Python SDK: Advanced machine learning methods for statistical analysis, dimensionality reduction, and fair ML research in genomics and proteomics
Use the Python SDK when you want to perform advanced statistical analysis, implement fair machine learning techniques, or conduct research with cutting-edge algorithms in your Python environment.