OSS Algorithms

At Bystro, we believe natural language is the right interface for genetic and proteomic analysis. We are building the world's first LLM-powered natural language analysis engine that takes your questions about complex genetic and proteomic datasets and converts them into statistical answers with easy-to-understand summaries and visualizations.

This is our open-source collection of machine learning methods for high-dimensional statistics, with applications in genomics and proteomics. We're working to integrate these methods into the Bystro natural language analysis platform. Our current platform automates analyses like PRS, ancestry calculation, and QC for genetics data.

Installation

Install the Bystro Python package:

pip install bystro

Machine Learning Methods

Covariance Matrix Estimation and Hypothesis Testing

from bystro.covariance import *

Regularized covariance matrix estimation methods well suited for smaller sample size regimes where n << p.

Covariance matrix hypothesis tests, including the two-sample covariance test from bystro.random_matrix_theory.rmt4ds_cov_test import two_sample_cov_test.

Random Matrix Theory Methods

from bystro.random_matrix_theory import *

Foundational modules for significance tests, including two_sample_cov_test.

Stochastic Gradient Langevin

from bystro.stochastic_gradient_langevin import *

Implementation of the Stochastic Gradient Langevin algorithm. Read the paper →

Fair ML / Supervised PPCA / Variational Principal Component Regression

from bystro.supervised_ppca import *

supervised_ppca is a collection of generative methods:

Probabilistic PCA (PPCA)Standard probabilistic formulation of PCA.
Supervised PPCAAlso known as Variational Principal Component Regression. Novel method for network analysis that picks up dynamics of interest in low-variance components. Competitive with Elastic Net in regression contexts without shrinking covariates directly. Read the paper →
Adversarial Probabilistic PCAFair ML method that removes the influence of M sensitive variables (confounding factors) from high-dimensional data.

Applications in Proteomics

Four modular steps that can be applied alone or combined:

ImputationRecommended if there are missing values in your data. Soft Impute Demo
Batch CorrectionRecommended for TMT data with a control per batch. Small sample batch correction demo
Removal of Confounding FactorsRecommended if there are confounding factors such as sex or ancestry to remove. Fair PCA demo
Network AnalysisRecommended to discover predictive networks in proteomic data. Learning Predictive Network demo

Applications in Genetics

Make genetic results more generalizable by removing information from confounding factors:

Remove ancestry-related information in multi-ancestry cohorts to reduce bias.

Remove the effect of batch in meta-analyses.

See the Fair PCA demo for a worked example. Fair PCA Demo →

Combined (Multi-omics) Applications

Combine genomic and proteomic data for downstream analyses or data exploration. Read the proteomics README →

Publications

Citation

If you use the Bystro Python package in your research, please cite:

Kotlar et al. "Bystro: rapid online variant annotation and natural-language filtering at whole-genome scale." Genome Biology 19, 14 (2018). https://doi.org/10.1186/s13059-018-1387-3