Bayesian Inference for Fitness Landscapes: A Modern Guide to Quantifying Evolutionary Costs and Benefits in Drug Development

Leo Kelly Jan 09, 2026 132

This article provides a comprehensive guide for biomedical researchers on applying Bayesian statistical frameworks to estimate fitness costs and benefits, crucial parameters in evolutionary biology and antimicrobial/anticancer drug development.

Bayesian Inference for Fitness Landscapes: A Modern Guide to Quantifying Evolutionary Costs and Benefits in Drug Development

Abstract

This article provides a comprehensive guide for biomedical researchers on applying Bayesian statistical frameworks to estimate fitness costs and benefits, crucial parameters in evolutionary biology and antimicrobial/anticancer drug development. We explore foundational Bayesian concepts, detail methodological workflows for integrating genomic and phenotypic data, address common pitfalls in model specification and computation, and compare Bayesian approaches to frequentist alternatives. The content is tailored to empower scientists in building robust, probabilistic models of selection pressure to predict resistance evolution and optimize therapeutic strategies.

The Bayesian Paradigm: A Primer on Probabilistic Modeling for Fitness Landscapes

In evolutionary biology, fitness is the fundamental currency, quantifying an organism's genetic contribution to subsequent generations. Fitness costs (reductions in fitness) and benefits (increases in fitness) are the opposing forces that shape adaptation. Estimating these parameters is challenging due to noisy, multivariate data from natural environments. Bayesian inference provides a powerful statistical framework for this task, allowing researchers to integrate prior knowledge with observed data (e.g., survival, reproduction, trait measurements) to generate posterior probability distributions of cost/benefit parameters. This quantifies uncertainty and enables robust predictions about evolutionary trajectories, crucial for fields like antimicrobial resistance and cancer biology.

Core Definitions & Quantitative Data

Fitness Benefit: An increase in the relative contribution of a genotype or phenotype to the next generation's gene pool, often conferred by a trait that enhances survival or reproduction in a given environment.

Fitness Cost: A decrease in relative fitness, typically arising from resource allocation trade-offs, antagonistic pleiotropy, or increased susceptibility to other selective pressures.

Key Metrics and Their Typical Ranges: Fitness effects are often measured relative to a reference strain (e.g., wild-type), with a relative fitness (W) of 1.0. Costs/benefits are reported as selection coefficients (s), where W = 1 + s. A negative s indicates a cost; a positive s indicates a benefit.

Table 1: Common Metrics for Quantifying Fitness Costs and Benefits

Metric Typical Experimental Context Quantitative Range (Commonly Observed) Interpretation
Relative Fitness (W) Head-to-head competition assays. 0.7 - 1.3 W_ref = 1.0. W < 1 = cost; W > 1 = benefit.
Selection Coefficient (s) Derived from W (s = W - 1). -0.3 to +0.3 s = -0.1 = 10% fitness cost per generation.
IC50/IC90 Fold Change Drug resistance studies. 2x to >1000x Higher fold = stronger benefit under drug, often correlated with cost in drug-free environment.
Growth Rate (μ, per hour) In vitro monoculture growth curves. Variable by species. Difference (Δμ) is key. Δμ < 0 indicates a cost of a mutation in optimal lab medium.
LD50 (Pathogen Virulence) In vivo infection models. Variable. Comparison to control. Increased LD50 may indicate cost of attenuation; decreased LD50 indicates benefit of virulence trait.

Table 2: Documented Fitness Costs of Antibiotic Resistance Mutations (Representative Examples)

Antibiotic Class Resistance Mechanism Reported Cost (s) in Drug-Free Medium Conditional Benefit (s) in Drug Key Reference
β-lactams Alteration of PBP (penicillin-binding protein) -0.15 to -0.05 > +1.0 (lethal drug) Andersson & Hughes, 2010
Fluoroquinolones Topoisomerase mutation (gyrA) -0.2 to -0.05 +0.5 to >+1.0 Marcusson et al., 2009
Aminoglycosides rRNA methylation (16S) -0.1 to -0.01 +0.3 to +0.8 Vester & Long, 2013

Application Notes & Experimental Protocols

AN-001: In Vitro Competitive Fitness Assay (Gold Standard Protocol)

Purpose: To precisely measure the relative fitness (W) and selection coefficient (s) of a mutant strain versus an isogenic wild-type.

Bayesian Integration Point: The replicate data from time points (CFU counts) serve as likelihoods. Prior distributions for growth rates can be informed from monoculture experiments. Markov Chain Monte Carlo (MCMC) sampling generates posteriors for s with credible intervals.

Protocol:

  • Strain Preparation:

    • Generate marked, isogenic strains: wild-type (WT) and mutant (M). Neutral markers (e.g., differential antibiotic resistance not under test, fluorescent proteins) enable quantification.
    • Grow overnight monocultures of each strain separately in relevant medium.
  • Initial Coculture (Day 0):

    • Mix WT and M strains at a ~1:1 ratio (e.g., 1x10^6 CFU each) in fresh medium. Precisely quantify the starting densities (CFU/mL) by serial dilution and plating on selective agar for each marker.
  • Serial Batch Transfer:

    • Incubate the coculture at appropriate conditions.
    • Each day (or at fixed exponential-phase intervals), dilute the culture into fresh medium (typically 1:100 to 1:1000) to maintain exponential growth. This represents one growth cycle ("generation").
    • Repeat for 5-10+ growth cycles.
  • Sampling and Plating:

    • At each transfer point, sample the coculture, perform serial dilutions, and plate on both (a) non-selective agar (for total CFU) and (b) agar selective for each marker (for WT and M counts).
  • Data Analysis & Bayesian Estimation:

    • Calculate the ratio M/WT at each time point (t).
    • The selection coefficient s can be estimated from the slope of ln(M/WT) over time (in generations): ln(Rt) = ln(R0) + s*t, where R is the ratio.
    • Implement a Bayesian linear regression model (e.g., using Stan, PyMC3) where the observed log ratios are normally distributed around the line defined by s and an intercept. Specify weakly informative priors for s (e.g., Normal(0, 0.5)).

G A Prepare Isogenic Strains (WT & Mutant) B Mix 1:1 in Fresh Medium A->B C Incubate & Grow B->C D Daily Serial Dilution Transfer C->D D->C Next Cycle E Sample & Plate on Selective/Non-Selective Agar D->E F Count CFUs & Calculate M/WT Ratio E->F G Bayesian Model: ln(Ratio) ~ N(s * t, σ) F->G H Posterior Distribution of Selection Coefficient (s) G->H

Title: Competitive Fitness Assay & Bayesian Analysis Workflow

AN-002: In Vivo Fitness Cost/Benefit in a Murine Infection Model

Purpose: To estimate the fitness cost of antimicrobial resistance or virulence attenuation in a host environment.

Bayesian Integration Point: Complex, hierarchical models can integrate data on bacterial loads from multiple organs, host survival, and prior in vitro data to jointly estimate parameters for growth, clearance, and immune interaction, yielding a net fitness effect.

Protocol:

  • Infection Groups:

    • Establish groups of mice (n=5-10/group) infected with: (i) WT strain, (ii) Mutant strain, (iii) Co-infected with a 1:1 mix of WT and marked Mutant.
  • Inoculum & Infection:

    • Prepare bacterial suspensions from logarithmic-phase cultures.
    • Infect mice via relevant route (e.g., intravenous, intranasal, intraperitoneal) with a pre-determined dose.
  • Longitudinal Monitoring:

    • Monitor survival and clinical scores over several days.
    • At pre-defined time points (e.g., 24h, 48h, 72h), euthanize a subset of mice from the single-infection and co-infection groups.
  • Sample Processing:

    • Harvest target organs (spleen, liver, lungs). Homogenize tissues.
    • Perform serial dilutions and plate homogenates on agar to determine total bacterial burden. For co-infection samples, plate on both selective and non-selective media to quantify WT and Mutant CFUs.
  • Bayesian Dynamical Modeling:

    • Construct a differential equation model of bacterial growth and host control.
    • Treat unknown parameters (e.g., intrinsic growth rate, immune killing rate) as probabilistic variables with priors.
    • Use MCMC to fit the model to the time-series CFU data, inferring the posterior distribution of the growth rate difference (Δμ) as the in vivo fitness cost/benefit.

G A Mouse Infection Groups: WT, Mutant, Co-Infection B Inoculum Preparation (Log-phase cultures) A->B C Administer Infection (e.g., IV, IN) B->C D Longitudinal Sampling: Survival & Euthanasia at Timepoints C->D E Organ Harvest & Homogenization D->E F Serial Dilution & Plating on Selective Media E->F G CFU Counts & Ratio Calculation F->G H Hierarchical Bayesian Dynamical Model Fit G->H I Posterior of In Vivo Fitness Parameters H->I

Title: In Vivo Fitness Assay Protocol Flowchart

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents and Materials for Fitness Studies

Item Function & Application Example/Supplier
Isogenic, Differentially Marked Strains Essential for competition assays. Allows precise discrimination without altering fitness. Fluorescent proteins (GFP, mCherry), neutral antibiotic resistance (e.g., kanR on chromosome).
Specialized Growth Media To test conditional fitness effects (e.g., with/without antibiotic, different carbon sources). Mueller-Hinton (antibiotic testing), minimal M9 media (nutrient limitation).
Automated Cell Counter/Plater Increases throughput and accuracy of colony counting and plating in high-replicate experiments. BioRad QCount, Synbiosis ProtoCOL.
Animal Model (Murine) Gold-standard host system for in vivo fitness studies of pathogens or cancer cells. C57BL/6, BALB/c strains.
Bayesian Statistical Software For probabilistic estimation of fitness parameters and modeling. Stan (via brms in R, CmdStanPy), PyMC3, JAGS.
Microfluidic Chemostats For precise, continuous culture with controlled environmental variables to measure fitness. CellASIC ONIX, microbial microchemostat systems.

Why Bayes? Advantages Over Frequentist Methods for Noisy Biological Data

Quantitative Advantages in Noisy Data Scenarios

Table 1: Performance Comparison of Methods on Synthetic Noisy Data (n=1000 simulations)

Metric Frequentist (GLM) Bayesian (MCMC) Notes
Mean Absolute Error (β) 0.45 ± 0.12 0.28 ± 0.08 True β = 1.0, SNR=2
95% CI Coverage 88% 94% Bayesian uses Credible Interval
Handling of Missing Data Listwise deletion or imputation Direct modeling within posterior 15% missing data simulated
Run Time (seconds) 1.2 ± 0.3 152.7 ± 25.4 (warm-up) / 45.1 ± 8.2 (sampling) Hardware: 8-core CPU, 32GB RAM
False Positive Rate 0.065 0.048 α=0.05 threshold

Table 2: Application to Fitness Cost Estimation in Antimicrobial Resistance (AMR) | Parameter | Frequentist MLE Estimate (SE) | Bayesian Posterior Median (95% CrI) | Biological Interpretation | | :--- | :--- | :--- | : :--- | | Growth Rate Cost (c) | -0.32 (0.15) | -0.29 (-0.51, -0.08) | Fitness cost of resistance mutation | | Benefit in Drug (b) | 1.85 (0.42) | 1.91 (1.15, 2.78) | Growth advantage in antibiotic | | Hill Coefficient (n) | 2.1 (Fixed) | 2.3 (1.7, 3.1) | Estimated cooperativity | | Half-max [Drug] (K) | 0.58 µg/mL (0.21) | 0.61 µg/mL (0.25, 1.02) | Estimated from noisy MIC data |

Core Protocols for Bayesian Analysis in Fitness Research

Protocol 2.1: Bayesian Hierarchical Modeling of Noisy Growth Curves

Objective: Estimate bacterial growth parameters and fitness costs from plate reader data with high technical noise.

Materials:

  • OD600 measurements over time (n=8 replicates, 4 conditions)
  • Stan or PyMC3 software environment
  • R (rstan, brms) or Python (ArviZ, pandas) for analysis

Procedure:

  • Data Preprocessing: For each well, subtract blank control OD600. Log-transform data: y = log(OD / OD₀).
  • Model Specification: Define a hierarchical logistic growth model:
    • Population-level: μ ~ Normal(0, 1) for average growth rate.
    • Group-level: kᵢ ~ Normal(μ, σ_k) for condition-specific rates.
    • Likelihood: y(t) ~ Normal( A / (1 + exp(-kᵢ*(t - t₀))), σ ).
    • Priors: Use weakly informative: σ ~ HalfNormal(0.1).
  • MCMC Sampling: Run 4 chains, 2000 warm-up iterations, 2000 sampling iterations.
  • Diagnostics: Check R̂ < 1.01, effective sample size > 400 per chain.
  • Posterior Analysis: Extract median and 95% credible intervals for kᵢ. Compute fitness cost as c = 1 - (k_mutant / k_wildtype).

Deliverable: Posterior distributions for all parameters, enabling probabilistic statements: e.g., "Probability that fitness cost > 10% is 0.89".

Protocol 2.2: Bayesian Inference for Dose-Response in Drug Synergy

Objective: Quantify uncertainty in IC50 and Hill slope from noisy dose-response data.

Procedure:

  • Model the observed response R at drug concentration [D] using a 4-parameter logistic model: R ~ Normal( Bottom + (Top - Bottom) / (1 + 10^((LogIC50 - log10[D]) * HillSlope)), σ ).
  • Assign Priors:
    • LogIC50 ~ Normal(log10(mean_estimate), 2)
    • HillSlope ~ Normal(1, 0.5)
    • σ ~ Exponential(1)
  • Incorporate Hierarchical Structure if multiple experimental batches: LogIC50_batch ~ Normal(μ_LogIC50, τ).
  • Sample from Posterior using Hamiltonian Monte Carlo (NUTS).
  • Compute Probabilities of synergy: P(Combination_IC50 < min(Single_IC50s) | Data).

Visualizing Bayesian Workflows and Pathways

bayes_workflow Prior Prior Likelihood Likelihood Prior->Likelihood Combines with Posterior Posterior Likelihood->Posterior Via Bayes' Theorem Inference Inference Posterior->Inference Sample for NoisyData Noisy Biological Data NoisyData->Likelihood Informs

Bayesian Inference Workflow for Noisy Data

fitness_model cluster_hierarchical Hierarchical Structure GrowthRate Growth Rate (k) ObservedOD Observed OD600 GrowthRate->ObservedOD Generates Mutation Resistance Mutation FitnessCost Fitness Cost (c) Mutation->FitnessCost Drug Antibiotic [Drug] Drug->GrowthRate FitnessCost->GrowthRate Reduces StrainLevel Strain-level k StrainLevel->GrowthRate PopulationLevel Population μ, σ PopulationLevel->StrainLevel

Hierarchical Model for Fitness Cost Estimation

The Scientist's Toolkit: Key Research Reagent Solutions

Table 3: Essential Toolkit for Bayesian Analysis of Biological Data

Item / Reagent Function in Bayesian Analysis Example Product / Software
Probabilistic Programming Language Specifies model, priors, and likelihood for MCMC sampling. Stan (rstan, cmdstanr), PyMC3, Turing.jl
Diagnostic & Visualization Package Assesses chain convergence, visualizes posteriors. ArviZ (Python), bayesplot (R), shinystan
High-Throughput Growth Assay Kit Generates noisy time-series data for fitness estimation. Biolog Phenotype MicroArrays, OD600 plate readers
qPCR Master Mix with High Precision Provides quantification cycle (Cq) data for hierarchical models of gene expression. TaqMan Gene Expression Master Mix, SYBR Green
Bayesian Sample Size Calculator Uses prior information to compute required replicates. R package BayesSampleSize
Markov Chain Monte Carlo (MCMC) Sampler Engine drawing samples from complex posterior distributions. Hamiltonian Monte Carlo (HMC), No-U-Turn Sampler (NUTS)
Gelatin-Based Hydrogel Matrix Creates heterogeneous 3D cell culture environment, modeling tissue noise for drug response studies. Corning Matrigel
Bayesian Clinical Trial Design Software Applies Bayesian adaptive designs for preclinical/early clinical development. FACTS, Trial Architect

Within the thesis on applying Bayesian inference to fitness cost and benefit research, the core components—priors, likelihoods, and posteriors—form the fundamental engine for quantitative estimation. This document provides detailed application notes and protocols for implementing this Bayesian framework in experimental research, particularly relevant to microbial evolution, antibiotic resistance, and therapeutic development.

Core Bayesian Components: Definitions & Applications

Priors: Incorporating Existing Knowledge

The prior distribution encapsulates pre-experimental beliefs about a fitness parameter (e.g., growth rate, selection coefficient s). In drug development, priors can be derived from preclinical data or structural analogs.

Table 1: Common Prior Distributions in Fitness Estimation

Prior Type Mathematical Form Application Context Rationale
Uninformative (Uniform) P(θ) ∝ 1 No prior knowledge; initial high-throughput screen. Maximizes influence of incoming experimental data.
Conjugate (Beta) P(θ) ∝ θ^{α-1}(1-θ)^{β-1} Modeling a probability, e.g., mutation rate. Simplifies computation; α,β can be set from historical data.
Normal (Gaussian) P(θ) ∝ N(μ₀, σ₀²) Prior for a log-fold growth rate. μ₀ based on wild-type strain data; σ₀ reflects uncertainty.
Gamma P(θ) ∝ θ^{k-1}e^{-θ/θ} Prior for a rate parameter (e.g., decay). Ensures parameter positivity.

Protocol 1.1: Eliciting an Informative Prior

  • Gather Historical Data: Compile fitness estimates (e.g., growth rates) for related strains or compounds from literature or internal databases.
  • Fit Distribution: Use maximum likelihood estimation to fit a candidate distribution (e.g., Normal) to the historical data.
  • Quantify Uncertainty: Set the prior variance (σ₀²) to reflect the dispersion of historical data and confidence in its applicability.
  • Sensitivity Analysis: Run the Bayesian model with a range of prior widths to assess impact on the posterior.

Likelihood: Connecting Data to Models

The likelihood function P(Data|θ) quantifies the probability of observing the experimental data given a specific fitness parameter θ.

Common Likelihood Models:

  • Normal Likelihood: For continuous fitness measures (e.g., optical density, plaque size).
    • Data|θ ~ N(θ, σ²), where σ² is experimental noise variance.
  • Poisson/Binomial Likelihood: For count data (e.g., number of resistant colonies, survival counts).
    • Data|θ ~ Binomial(n, p(θ)), where p is a function linking fitness to survival probability.

Protocol 1.2: Constructing a Likelihood Function from Growth Data

  • Experiment: Competitive growth assay of mutant vs. wild-type.
  • Data: Sequencing read counts at time t=0 and t=T.
  • Model: The selection coefficient s is defined by W_{mut}/W_{wt} = 1+s.
  • Likelihood: The observed read count for the mutant at time T, r_T, is assumed to follow a Negative Binomial distribution (accounts for overdispersion):
    • r_T ~ NegBin(mean = r₀ * e^{sT}, dispersion)
  • Parameters to Estimate: s (fitness effect) and the dispersion parameter.

Posterior: The Bayesian Estimate

The posterior distribution P(θ|Data) is the complete Bayesian result, proportional to the prior times the likelihood: P(θ|Data) ∝ P(θ) × P(Data|θ).

Protocol 1.3: Computing and Summarizing the Posterior

  • Method Selection: For conjugate models, calculate directly. For complex models, use Markov Chain Monte Carlo (MCMC) sampling (e.g., Stan, PyMC).
  • Sampling: Run MCMC chains (≥4) to generate samples from the posterior distribution of θ.
  • Diagnostics: Check chain convergence (R-hat statistic ≈ 1.0, effective sample size).
  • Summary: Report the posterior median and 95% credible interval (2.5th to 97.5th percentile of samples).

Integrated Experimental & Computational Workflow

G Prior Prior Model Model: P(θ|Data) ∝ P(θ) × P(Data|θ) Prior->Model Experiment Experiment Likelihood Likelihood Experiment->Likelihood Data Likelihood->Model Posterior Posterior Decision Decision Posterior->Decision Inference Design Design Decision->Design  Informs Model->Posterior Design->Experiment Next Iteration

(Diagram Title: Bayesian Fitness Estimation Workflow)

Key Experimental Protocol: Fluorescent Competitive Growth Assay

This protocol generates high-precision time-series data for likelihood construction.

Objective: Precisely estimate the selection coefficient (s) of a bacterial strain expressing antibiotic resistance.

Materials (Scientist's Toolkit): Table 2: Essential Research Reagents & Materials

Item Function/Description
Isogenic Strains Wild-type and mutant strains, differing only by the allele of interest. Essential for clean fitness comparison.
Fluorescent Proteins Constitutive expression of distinct FPs (e.g., CFP, YFP) for strain differentiation via flow cytometry.
Chemostats or Multi-well Plates Environment for controlled, continuous growth competition.
Flow Cytometer Instrument for high-throughput quantification of strain ratios in the mixed culture.
Luria-Bertani (LB) Broth Standardized growth medium.
Sub-inhibitory Antibiotic Drug pressure to reveal fitness costs/benefits; concentration set at a fraction of MIC.

Procedure:

  • Culture Preparation: Grow monocultures of differentially fluorescent-tagged wild-type and mutant strains to mid-log phase.
  • Initial Mixture: Mix strains at a ~1:1 ratio in fresh medium (+/- antibiotic). Precisely determine the initial ratio (r₀) via flow cytometry (≥100,000 events).
  • Growth Competition: Dilute mixture into fresh medium to initiate exponential growth. Maintain in exponential phase via serial dilution or chemostat.
  • Time-point Sampling: Sample at 5-7 time points over ~15-20 generations.
  • Flow Cytometry Analysis: For each sample, quantify the ratio of mutant to wild-type cells (r_t).
  • Data Processing: Calculate ln(r_t / r₀) for each time point t.

Bayesian Analysis of Data:

  • Define Model: ln(r_t) = ln(r₀) + st + ε, where *ε ~ N(0, σ²).
  • Set Priors: Use weakly informative priors: s ~ N(0, 0.5), σ ~ Exponential(1).
  • Construct Likelihood: Data|s,σ ~ N(ln(r₀) + st, σ²)*.
  • Sample Posterior: Use MCMC to obtain the posterior distribution for s.
  • Interpretation: A posterior for s centered clearly above 0 indicates a fitness benefit; below 0 indicates a cost.

G A Grow Isogenic Fluorescent Strains B Mix at ~1:1 Ratio (Measure r₀ via FC) A->B C Dilute into Growth Medium B->C D +/- Antibiotic (Sub-MIC) C->D E Serial Batch or Chemostat Growth D->E F Sample at Intervals (t1..tn) E->F G Flow Cytometry Analysis F->G H Calculate Ratio r_t G->H I Bayesian Model Fit for 's' H->I

(Diagram Title: Competitive Growth Assay Protocol Flow)

Data Presentation & Interpretation

Table 3: Example Posterior Summaries from a Simulated Resistance Study

Strain (Condition) Prior Used Posterior Median (s) 95% Credible Interval Probability(s > 0) Practical Interpretation
mutA (No Drug) N(0, 0.3) -0.021 (-0.034, -0.008) 0.001 Strong evidence of fitness cost.
mutA (With Drug) N(0, 0.3) 0.152 (0.138, 0.167) 1.000 Strong evidence of fitness benefit.
mutB (No Drug) N(0, 0.3) -0.002 (-0.015, 0.011) 0.411 No decisive evidence for cost or benefit.

Advanced Application: Hierarchical Models for Population Heterogeneity

For populations with sub-structure (e.g., different patient isolates), hierarchical models share information across groups.

Model Structure:

  • Parameter: Fitness s_i for each isolate i.
  • Hierarchical Prior: s_i ~ N(μ, τ), where μ ~ N(0,1) is the population mean fitness, and τ is the between-isolate variance.
  • Advantage: Isolates with sparse data are informed by the group-level distribution (μ, τ), improving estimate precision.

G mu Population Mean μ s1 Fitness s₁ mu->s1 s2 Fitness s₂ mu->s2 s4 Fitness s_k mu->s4 tau Population SD τ tau->s1 tau->s2 tau->s4 D1 Data₁ s1->D1 D2 Data₂ s2->D2 s3 D4 Data_k s4->D4 D3

(Diagram Title: Hierarchical Model for Isolate Fitness)

The rigorous application of priors, likelihoods, and posteriors provides a coherent probabilistic framework for fitness estimation. This approach quantifies uncertainty, integrates diverse data sources, and iteratively refines hypotheses—directly supporting decision-making in evolution-guided drug and therapeutic development.

Application Notes

This document outlines the integration of three foundational biological data sources—genomic sequences, growth rates, and competition assays—within a Bayesian inference framework for the estimation of microbial fitness costs and benefits, particularly in antimicrobial resistance (AMR) research. Accurate estimation is critical for predicting resistance evolution and optimizing treatment strategies.

1. Genomic Sequences: Provide the foundational genotype. High-throughput sequencing (e.g., Illumina, Nanopore) identifies mutations, insertions/deletions (indels), and gene amplifications associated with a phenotype. Within a Bayesian model, sequence data informs the prior probability of a fitness effect based on known functional impacts (e.g., nonsense mutation in an essential gene). The integration of population-level variant calling (using tools like Breseq) allows for the tracking of allele frequency changes over time, a direct input for fitness estimation.

2. Growth Rates: Represent a direct, in-vitro phenotypic measure of fitness under controlled conditions. Metrics include the maximum growth rate (μmax) and carrying capacity (K) derived from optical density (OD) or colony-forming unit (CFU) time-series data. In a Bayesian framework, growth curve data for mutant and reference strains (e.g., in the presence/absence of an antibiotic) provide the likelihood function. Hierarchical models can pool information across technical and biological replicates to separate true fitness effects from experimental noise.

3. Competition Assays: Serve as the gold standard for relative fitness measurement. A mutant strain is co-cultured with a differentially marked wild-type strain, and their ratio is tracked via selective plating or flow cytometry over multiple generations. The selection coefficient (s) is calculated from the log ratio change. This data provides a high-precision likelihood for Bayesian inference, allowing for the integration of prior knowledge from genomics and growth curves to yield robust posterior distributions of fitness costs/benefits, complete with credible intervals.

Bayesian Synthesis: The power of the Bayesian approach lies in combining these heterogeneous data streams. Genomic priors are updated with growth rate likelihoods, and the resulting posteriors can be further informed by competition assay data, progressively reducing uncertainty. This is formalized as: P(Fitness | Data) ∝ P(Data | Fitness) * P(Fitness | Genomic Context)

Table 1: Quantitative Data Summary from Key Data Sources

Data Source Typical Metrics Measurement Technique Data Scale Key Role in Bayesian Model
Genomic Sequences SNP/Indel count, Gene presence/absence, Read depth NGS (Illumina), Long-read (PacBio, Nanopore) Nucleotide Informs prior distributions; identifies candidate causal variants.
Growth Rates μmax (hr-1), Lag time (hr), Carrying capacity (OD) Plate readers, Growth curves (OD600), CFU counts Population Provides likelihood for fitness in defined conditions; moderate precision.
Competition Assays Selection coefficient (s) per generation, Relative Fitness (W) Flow cytometry, Selective plating, PCR Population (ratio) High-precision likelihood; grounds inference in direct competition.

Experimental Protocols

Protocol 1: High-Throughput Genomic Sequencing for Variant Identification

Objective: To identify genetic variants between evolved/mutant strains and a reference genome. Materials: Microbial genomic DNA (≥20 ng/µL), Qubit fluorometer, Illumina DNA Prep kit, sequencing platform (e.g., MiSeq). Procedure:

  • DNA Extraction: Use a validated kit (e.g., DNeasy Blood & Tissue) to extract high-quality genomic DNA. Quantify using Qubit.
  • Library Preparation: Follow the Illumina DNA Prep kit protocol for tagmentation, cleanup, and adapter ligation. Include dual-index barcodes for multiplexing.
  • Quality Control: Assess library fragment size distribution using a Bioanalyzer or TapeStation (target peak: ~550 bp).
  • Sequencing: Pool libraries at equimolar concentrations. Sequence on an Illumina MiSeq platform using a 2x150 bp v3 kit to achieve >50x coverage.
  • Bioinformatic Analysis: a. Quality Trimming: Use Trimmomatic to remove adapters and low-quality bases. b. Alignment: Map reads to the reference genome using BWA-MEM. c. Variant Calling: Identify SNPs and indels using Breseq (in "polymorphism" mode for mixed populations) or GATK Best Practices.

Protocol 2: Microtiter Plate-Based Growth Curve Analysis

Objective: To determine the growth kinetics of strains under controlled conditions. Materials: 96-well flat-bottom plate, plate reader with temperature control and shaking, appropriate sterile growth medium. Procedure:

  • Inoculum Preparation: Grow overnight cultures of test and reference strains. Dilute to a low OD (~0.001) in fresh medium ± stressor (e.g., antibiotic).
  • Plate Setup: Dispense 200 µL of each diluted culture into at least 6 replicate wells. Include medium-only blanks for background subtraction.
  • Measurement: Load plate into pre-warmed (37°C) plate reader. Set protocol: orbital shaking for 5s before each read, measure OD600 every 15 minutes for 24 hours.
  • Data Processing: Subtract the average blank value. Fit the growth data for each well to a model (e.g., Gompertz) using software like growthcurver in R or Prism to extract μmax and carrying capacity.

Protocol 3: Direct Competition Assay for Selection Coefficient Estimation

Objective: To precisely measure the relative fitness of a mutant strain versus a wild-type competitor. Materials: Isogenic strains with differential, neutral markers (e.g., antibiotic resistance, fluorescent proteins), selective agar plates or flow cytometer. Procedure:

  • Initial Co-culture: Mix the mutant and wild-type strains at a 1:1 ratio in a small volume (e.g., 1:1 mix of overnight cultures, then 1:1000 dilution into fresh medium). This is the "input" mixture (T0).
  • Growth: Incubate the co-culture with dilution into fresh medium daily to maintain exponential growth for a set number of generations (e.g., 3-5 serial 1:1000 dilutions over 24-48h).
  • Sampling and Plating: At T0 and after each dilution cycle (e.g., Tfinal), serially dilute samples and plate on both non-selective and selective agar to enumerate total and mutant/wild-type CFUs.
  • Fitness Calculation: Calculate the selection coefficient s per generation: s = ln[(Mt/Wt) / (M0/W0)] / t where M and W are mutant and wild-type counts, and t is the number of generations.

Visualizations

bayesian_integration prior Genomic Sequence Data (Prior Distribution) post Posterior Distribution of Fitness Cost/Benefit prior->post Update lik1 Growth Rate Data (Likelihood 1) lik1->post Update lik2 Competition Assay Data (Likelihood 2) lik2->post Update

Title: Bayesian Inference Workflow for Fitness Estimation

comp_assay mix 1:1 Mixture Mutant + WT growth Serial Batch Culture (Exponential Growth) mix->growth plate Selective Plating on Marker-Specific Agar growth->plate calc Calculate Selection Coefficient (s) plate->calc

Title: Competition Assay Workflow

The Scientist's Toolkit

Table 2: Essential Research Reagents and Materials

Item Function in Experiments Example Product/Catalog
Next-Gen Sequencing Kit Prepares fragmented, adapter-ligated DNA libraries from gDNA for sequencing. Illumina DNA Prep Kit (20018705)
Growth Media (Defined) Provides controlled nutrient environment for reproducible growth rate measurements. M9 Minimal Salts (Sigma M6030)
96-Well Cell Culture Plate Vessel for high-throughput, parallel growth curve monitoring in plate readers. Corning 3603, Flat Clear Bottom
Optical Density (OD) Calibrant Ensures consistency and comparability of OD measurements across instruments. Precisely Absorbance Standard (Starna 21-205)
Neutral Genetic Marker Allows distinction between competing strains without affecting fitness (e.g., for competition assays). Chromosomal Fluorescent Protein (GFP, mCherry) or Antibiotic Resistance Cassette
Selective Agar Plates Used in competition assays to enumerate subpopulations based on marker expression. LB Agar + Kanamycin (50 µg/mL)
High-Fidelity DNA Polymerase For accurate amplification of genetic regions for validation of sequencing variants. Q5 High-Fidelity DNA Polymerase (NEB M0491)
Bayesian Modeling Software Implements statistical inference to integrate data and estimate posterior fitness distributions. Stan (via brms R package), PyMC3

Application Notes: Bayesian Inference of Fitness Landscapes in Drug Resistance

The evolution of drug resistance in pathogens and cancer cells is a canonical example of natural selection in action. Conceptualizing this process on a fitness landscape—a map connecting genotype or phenotype to reproductive success—provides a powerful theoretical framework. In the context of a thesis on Bayesian inference, this approach moves from static landscape visualization to a probabilistic, data-driven estimation of evolutionary parameters.

Core Concept

A fitness landscape for drug resistance is typically high-dimensional, with axes representing genetic mutations (e.g., in viral reverse transcriptase, bacterial beta-lactamase, or oncogenic kinases) and the vertical axis representing fitness, often under a specific drug concentration. The "peaks" represent genotypes with high fitness (resistance), while "valleys" represent low-fitness (sensitive) genotypes. Evolutionary trajectories are walks across this landscape toward peaks.

Bayesian Integration

Bayesian methods are uniquely suited to this problem because they:

  • Incorporate Prior Knowledge: Existing biochemical data on mutation effects (e.g., from deep mutational scans) can be formalized as prior distributions.
  • Quantify Uncertainty: They provide posterior distributions for key parameters (e.g., fitness effect of a mutation, epistatic interactions) rather than point estimates, crucial for predicting evolutionary paths.
  • Leverage Time-Series Data: Using genomic data from serial samples during treatment, Bayesian models can infer the underlying fitness landscape that best explains the observed frequency dynamics of mutations.

Key Inferred Parameters:

  • Fitness Cost (s_cost): The reduction in replication rate associated with a resistance mutation in the absence of the drug.
  • Fitness Benefit (s_benefit): The increase in replication rate conferred by the mutation under specific drug pressure. The net selective coefficient is often a function of drug concentration.
  • Epistasis (ε): The interaction effect between mutations, where the fitness effect of one mutation depends on the presence of another. This shapes the landscape's topography (ruggedness).

Table 1: Estimated Fitness Effects of Common Resistance Mutations (Illustrative Examples)

System Drug Mutation Estimated s_cost (per gen.) Estimated s_benefit (at [IC90]) Net Select. Coeff. (at [IC90]) Key Epistatic Partner
HIV-1 Lamivudine M184V -0.05 ± 0.02 +0.35 ± 0.05 +0.30 K65R (antagonistic)
M. tuberculosis Rifampicin rpoB S450L -0.03 ± 0.01 +0.60 ± 0.10 +0.57 Various (additive)
NSCLC* Osimertinib EGFR T790M -0.02 ± 0.01 +0.40 ± 0.08 +0.38 C797S (synergistic)
P. falciparum Artemisinin kelch13 C580Y -0.08 ± 0.03 +0.15 ± 0.04 +0.07 Various

*Non-Small Cell Lung Cancer

Table 2: Comparison of Bayesian Inference Models for Landscape Reconstruction

Model Name Data Input Key Inferred Parameters Computational Complexity Best For
Wright-Fisher w/ Selection Allele frequency time-series s, N_e (effective pop. size) Low Clonal, well-mixed populations
Mountainscape (Poelwijk et al.) Deep mutational scanning (DMS) fitness Pairwise epistasis (ε), 3D landscape Medium Dense genotype-phenotype maps
BEAR (Bayesian Epistasis Analysis) Growth measurements of mutant libraries High-order epistasis, uncertainty High Complex genetic interactions
Phylogenetic Gibbs Sampler Time-scaled phylogenies Ancestral fitness, selection on branches Very High Pathogen sequence surveillance data

Experimental Protocols

Protocol 1: Deep Mutational Scanning (DMS) to Empirically Map a Fitness Landscape

Objective: Quantify the fitness of thousands of single and double mutants of a target gene across drug concentrations.

Materials: See "Scientist's Toolkit" below.

Procedure:

  • Library Construction: Use site-saturation mutagenesis or oligonucleotide pool synthesis to create a plasmid library encompassing all single amino acid substitutions (and optionally doubles) in the gene of interest (e.g., HIV-1 pol).
  • Viral/Vector Production: Package the mutant library into replication-competent viral vectors (for viruses) or express in a stable bacterial/mammalian cell line.
  • Selection Passages: Infect or treat cells with the mutant library. Split the population into parallel cultures treated with a range of drug concentrations (including no-drug control). Passage for 3-5 generations.
  • Sample Collection: Harvest viral/cellular genomic DNA at passages 0 (input), 1, 3, and 5.
  • High-Throughput Sequencing: Amplify the target gene region via PCR and subject to next-generation sequencing (NGS; Illumina MiSeq/NextSeq).
  • Data Analysis (Bayesian):
    • Count Data: Align sequences and count reads for each variant at each time point and drug condition.
    • Modeling: Use a hierarchical Bayesian model (e.g., in Stan or PyMC3) where the observed read counts are drawn from a Multinomial distribution with probabilities proportional to variant frequency.
    • Fitness Inference: The model estimates a growth rate parameter (s) for each variant in each condition, with priors centered on neutrality (s=0) and sharing information across related variants.
    • Epistasis Calculation: Infer interaction terms (ε) for double mutants by comparing their observed fitness to the expected additive effect of the two single mutations.

workflow_dms start 1. Gene of Interest (e.g., HIV-1 pol) lib 2. Mutant Library Construction start->lib pkg 3. Viral/Cellular Library Production lib->pkg sel 4. Selection Passages (± Drug Gradient) pkg->sel seq 5. NGS Time-Point Sampling sel->seq bayes 6. Bayesian Inference Model seq->bayes out 7. Output: Posterior Distributions for s_cost, s_benefit, ε bayes->out

Bayesian DMS Experimental Workflow

Protocol 2: Longitudinal Population Sequencing & Bayesian Frequency Dynamics

Objective: Infer fitness costs/benefits from evolving pathogen populations sampled from a patient during treatment.

Procedure:

  • Sample Collection: Collect serial biological samples (blood, biopsy, sputum) at regular intervals (e.g., baseline, weeks 2, 4, 8, 12) during a monitored drug treatment regimen.
  • NGS of Target Loci: Extract total DNA/RNA, perform targeted amplicon sequencing of the resistance-associated loci (e.g., full EGFR kinase domain, HIV-1 pol gene) to high coverage (>5000x).
  • Variant Calling: Identify single nucleotide variants (SNVs) and their frequencies at each time point using a calibrated pipeline (e.g., GATK, LoFreq). Filter for sequencing artifacts.
  • Bayesian State-Space Modeling:
    • State: The true, unknown frequency of each variant at each time point.
    • Observation Model: The observed NGS read counts are drawn from a Binomial distribution centered on the true frequency.
    • Process Model: The true frequency evolves according to a Wright-Fisher model with selection, parameterized by the net selective coefficient (s_net). A Gaussian Process prior can be placed on s_net over drug concentration (if measured).
    • Inference: Use Markov Chain Monte Carlo (MCMC) sampling (via BEAST2, STAN) to obtain the posterior distribution of s_net for each major variant, and hyperparameters for population-wide adaptation rates.
    • Priors: Informative priors for s_cost can be set from DMS data (Protocol 1).

bayes_inference data Time-Series Variant Frequency Data obs Observation Model: Reads ~ Binomial(f, Depth) data->obs prior_s Prior: s ~ Normal(μ,σ) (e.g., from DMS) process Process Model: Wright-Fisher Dynamics df/dt = s * f(1-f) prior_s->process prior_freq Prior: Initial Frequency prior_freq->process process->obs True Freq (f) post Posterior Distribution: P(s, f_trajectory | Data) obs->post

Bayesian Model for Frequency Dynamics

The Scientist's Toolkit

Table 3: Key Research Reagent Solutions

Item Function & Application Example/Supplier
Oligo Pool Synthesis Generates comprehensive mutant DNA libraries for DMS. Twist Bioscience, Agilent SureSelect
Error-Prone PCR Kits Introduces random mutations for library generation in vitro. Thermo Fisher GeneMorph II
High-Fidelity PCR Mix Accurate amplification of NGS amplicons from low-input samples. NEB Q5, KAPA HiFi
NGS Library Prep Kits Prepares amplicon or genomic libraries for Illumina sequencing. Illumina Nextera XT
Cell Viability Assays Measures fitness/growth rate (IC50, doubling time) of resistant lines. Promega CellTiter-Glo
Bayesian Modeling Software Platforms for specifying and inferring parameters of custom models. Stan (CmdStanR/PyMC3), BEAST2
Variant Calling Pipeline Software to accurately call low-frequency variants from NGS data. LoFreq, GATK Mutect2
Directed Evolution Systems Continuous culture for experimental evolution under drug pressure. Chemostats, MEGA-plate

A Step-by-Step Bayesian Workflow: From Data to Fitness Parameter Estimates

This document provides application notes and protocols for selecting prior distributions within a Bayesian inference framework to estimate fitness costs and benefits. This work is part of a broader thesis utilizing Bayesian hierarchical models to quantify the evolutionary trade-offs (cost/benefit parameters) of antimicrobial resistance mechanisms in bacterial pathogens, with direct implications for predicting resistance trajectories and informing combination drug therapies.

The following cost/benefit parameters are central to the model. Priors are chosen based on biological plausibility, previous in vitro studies, and computational constraints.

Table 1: Key Cost/Benefit Parameters and Recommended Prior Distributions

Parameter (Symbol) Biological Meaning Typical Prior Distribution Justification & Hyperparameters
Baseline Growth Rate (μ₀) Maximum growth rate of susceptible strain in absence of drug. Log-Normal Positive, right-skewed. μ=ln(1.0), σ=0.5 (hr⁻¹).
Cost of Resistance (c) Reduction in growth rate due to resistance mechanism in drug-free environment. Beta Bounded [0,1]. α=1.5, β=5.0, implying cost is low but non-zero.
Protection Benefit (b) Fractional reduction in drug-induced death rate conferred by resistance. Gamma Positive, allows for diminishing returns. k=2.0, θ=0.5.
Half-Maximal Efficacy (K_D) Drug concentration at which death rate is half-maximal. Inverse Gamma Positive, heavy-tailed to allow for high uncertainty. α=3, β=10 (μg/mL).
Hill Coefficient (n) Steepness of dose-response curve. Truncated Normal Bounded >0. μ=1.5, σ=0.75, min=0.1.

Experimental Protocols for Prior Informantion

Empirical data is required to inform weakly informative or informative priors.

Protocol 3.1:In VitroGrowth Curve Assay for Fitness Cost (c)

Objective: Quantify the growth rate difference between isogenic resistant and susceptible strains in drug-free medium to inform the prior for cost (c).

Materials: See Scientist's Toolkit. Procedure:

  • Inoculate 5 mL of pre-warmed Mueller-Hinton Broth (MHB) with a single colony of either the resistant (R) or susceptible (S) strain. Incubate overnight (37°C, 200 rpm).
  • Dilute overnight cultures to OD₆₀₀ ≈ 0.01 in fresh MHB.
  • Aliquot 200 μL of diluted culture into 96-well microplate wells (n=8 technical replicates per strain).
  • Load plate into a pre-warmed (37°C) plate reader. Measure OD₆₀₀ every 10 minutes for 24 hours with continuous orbital shaking.
  • Data Analysis: For each well, fit the exponential phase (typically OD 0.05 to 0.5) to the model: ln(OD_t) = ln(OD_0) + μ * t. The fitness cost c is calculated as 1 - (μ_R / μ_S). The mean and variance of c across replicates inform the Beta prior hyperparameters.

Protocol 3.2: Time-Kill Curve Assay for Protection Benefit (b)

Objective: Measure the death rates of R and S strains across a range of drug concentrations to estimate the benefit parameter (b).

Procedure:

  • Prepare a 2-fold serial dilution of the target antibiotic in MHB in a deep-well block.
  • Inoculate each drug concentration and a drug-free control with ~10⁶ CFU/mL from mid-log phase cultures of R and S strains.
  • Incubate at 37°C with shaking. Sample 100 μL from each condition at t = 0, 2, 4, 6, and 24 hours.
  • Perform serial dilutions in saline and spot-plate 10 μL drops onto drug-free agar plates. Count colonies after overnight incubation.
  • Data Analysis: For each concentration, estimate the net death rate (δ) from the slope of log₁₀(CFU/mL) vs. time over the first 6 hours. Fit a sigmoidal δ(C) model. The benefit b at a given concentration is 1 - (δ_R(C) / δ_S(C)). These estimates inform the Gamma prior for b.

Visualization of Modeling Framework

G PriorKnowledge Prior Knowledge (Literature, Preliminary Data) ParameterPriors Parameter Priors (c, b, μ₀, K_D) PriorKnowledge->ParameterPriors ExperimentalData Experimental Data (Growth & Kill Curves) BayesianModel Bayesian Hierarchical Growth-Death Model ExperimentalData->BayesianModel ParameterPriors->BayesianModel Posterior Posterior Distributions of Cost/Benefit BayesianModel->Posterior MCMC Sampling ThesisOutput Thesis Output: Quantified Trade-offs, Prediction of Resistance Spread Posterior->ThesisOutput

Title: Bayesian Inference Workflow for Cost/Benefit Analysis

H cluster_pathway Drug Action & Resistance Mechanism Drug Antibiotic Target Bacterial Target Protein Drug->Target Binds Damage Cellular Damage Target->Damage Inhibition Death Cell Death/Growth Arrest Damage->Death Resistance Resistance Gene (e.g., efflux pump, enzyme) Resistance->Drug Modifies/Exports BenefitParameter Benefit (b) Reduces Death Rate Resistance->BenefitParameter CostParameter Cost (c) Resource Drain Resistance->CostParameter BenefitParameter->Death Modulates CostParameter->Target Burden on Cell

Title: Biological Basis of Cost (c) and Benefit (b) Parameters

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials for Prior-Informing Experiments

Item / Reagent Function & Relevance to Prior Elicitation
Isogenic Bacterial Strain Pair Resistant (R) and susceptible (S) strains differing only at the resistance locus. Crucial for isolating the cost of the specific mechanism.
Cation-Adjusted Mueller Hinton Broth (CAMHB) Standardized growth medium for antimicrobial susceptibility testing, ensuring reproducible growth and kill rates.
Sterile, Clear 96-Well Microplates For high-throughput growth curve assays in plate readers. Optical clarity is essential for accurate OD measurements.
Automated Plate Reader with Shaking & Incubation Enables continuous, kinetic measurement of optical density (OD600) for precise growth rate (μ) calculation.
Pre-Dried Antibiotic Microdilution Plates Commercial plates with precise, serial-diluted antibiotics for efficient generation of time-kill curve data across concentrations.
Cell Culture Deep Well Blocks (2 mL) Allows for adequate aeration during extended time-kill curve incubations with shaking.
Phosphate Buffered Saline (PBS) For accurate serial dilution of bacterial samples prior to plating for CFU enumeration.
Columbia Blood Agar Plates Non-selective, rich agar for viable colony counting after exposure to drug in time-kill assays.
Bayesian Modeling Software (Stan/pymc3) Computational tool to implement the hierarchical model, specify priors, and perform MCMC sampling to obtain posteriors.

Constructing the Likelihood Function for Common Experimental Assays

Within Bayesian inference frameworks for estimating fitness costs and benefits in microbial evolution or drug resistance studies, the likelihood function is the critical bridge between experimental data and model parameters. It quantifies the probability of observing the collected data given a specific set of parameter values (e.g., growth rate, IC50, mutation rate). This document provides application notes and protocols for constructing likelihood functions from standard experimental assays, enabling rigorous parameter estimation.

Minimum Inhibitory Concentration (MIC) & Growth Assays

Data Type: Quantitative, censored data (e.g., no growth at or above a threshold concentration).

Typical Likelihood Model: Ordered Probit or Interval Censored. The continuous process of bacterial growth inhibition is observed only ordinally (2-fold dilution steps). The likelihood accounts for the probability that the true MIC lies within the reported dilution interval.

Protocol for Broth Microdilution MIC Assay:

  • Prepare Compound Dilutions: Using sterile 96-well plates, prepare two-fold serial dilutions of the antimicrobial agent in cation-adjusted Mueller-Hinton broth across rows.
  • Inoculate Wells: Dilute a log-phase bacterial suspension to ~5 x 10^5 CFU/mL and add equal volume to each well, yielding a final inoculum of ~5 x 10^4 CFU/mL per well.
  • Incubate: Seal plate and incubate at 35°C for 16-20 hours under static conditions.
  • Read Results: The MIC is the lowest concentration at which no visible growth is observed. Include positive (no drug) and negative (no inoculum) controls.

Likelihood Construction: Let ( Ci ) be the ( i )-th tested concentration. The observed outcome is binary: growth (( Yi=1 )) or no growth (( Yi=0 )). A common model assumes a latent variable ( Zi ) representing the effective growth capacity: [ Zi = \beta0 - \beta1 \log{2}(Ci) + \epsiloni, \quad \epsiloni \sim N(0, \sigma^2) ] Growth is observed (( Yi=1 )) if ( Zi > 0 ). The probability of growth at concentration ( Ci ) is: [ P(Yi=1 | \beta0, \beta1, Ci) = \Phi\left( \frac{\beta0 - \beta1 \log{2}(Ci)}{\sigma} \right) ] where ( \Phi ) is the standard normal CDF. The likelihood for all wells is: [ L(\beta0, \beta1, \sigma | \mathbf{Y}, \mathbf{C}) = \prod{i: Yi=1} \Phi\left( \frac{\beta0 - \beta1 \log{2}(Ci)}{\sigma} \right) \times \prod{i: Yi=0} \left[1 - \Phi\left( \frac{\beta0 - \beta1 \log{2}(Ci)}{\sigma} \right)\right] ] Parameters ( \beta_1 ) relates directly to the fitness cost of the drug.

Time-Kill Curve Assays

Data Type: Time-series quantitative data (CFU counts over time).

Typical Likelihood Model: Poisson or Negative-Binomial for count data, often embedded within a deterministic pharmacokinetic/pharmacodynamic (PK/PD) growth model.

Protocol for Time-Kill Experiment:

  • Initial Inoculum: Prepare a bacterial suspension at ~10^6 CFU/mL in fresh broth in multiple flasks.
  • Drug Addition: Add antimicrobial to treatment flasks at predefined multiples of the MIC (e.g., 1x, 4x, 10x MIC). Maintain a drug-free growth control.
  • Sampling: At regular intervals (e.g., 0, 2, 4, 8, 24 hours), remove aliquots from each flask.
  • Viable Count: Serially dilute samples in saline and plate onto non-selective agar. Count colonies after overnight incubation.
  • Data Recording: Record CFU/mL at each time point for each condition.

Likelihood Construction: Let ( N{t} ) be the observed CFU count at time ( t ). The underlying model is often a differential equation (e.g., ( dN/dt = N \times (g - k{\text{max}} C^H / (C^H + EC{50}^H)) )), which predicts a expected count ( \mut ) at time ( t ). Accounting for plating dilution and sampling noise, a Poisson or Negative-Binomial distribution is appropriate: [ Nt \sim \text{Negative-Binomial}(\text{mean} = \mut, \text{overdispersion} = \phi) ] The likelihood function becomes: [ L(\theta | \mathbf{N}) = \prod{t} \frac{\Gamma(Nt + \phi)}{\Gamma(\phi) Nt!} \left( \frac{\phi}{\mut + \phi} \right)^\phi \left( \frac{\mut}{\mut + \phi} \right)^{Nt} ] where ( \theta ) includes growth rate ( g ), maximum kill rate ( k{\text{max}} ), ( EC_{50} ), Hill coefficient ( H ), and overdispersion ( \phi ).

Competitive Fitness Assays

Data Type: Ratio measurements (e.g., relative abundance of two strains via selective plating or sequencing).

Typical Likelihood Model: Beta-Binomial or Multinomial-Dirichlet for proportion data.

Protocol for Direct Competition Experiment:

  • Strain Preparation: Grow reference (e.g., drug-sensitive, marked) and test (e.g., resistant) strains separately to mid-log phase.
  • Mixing: Mix strains at a known initial ratio (e.g., 1:1) in fresh medium, with and without drug pressure.
  • Passaging: Dilute the mixed culture into fresh medium daily for a set number of generations.
  • Sampling and Plating: Sample the mixture at each transfer. Perform serial dilution and plate on both non-selective and selective agars to distinguish strain types by colony morphology or antibiotic markers.
  • Calculate Ratio: Compute the ratio of test to reference colonies.

Likelihood Construction: Let the true proportion of the test strain at time ( t ) be ( pt ), modeled as ( pt = p0 \cdot e^{(s \cdot t)} / (1 - p0 + p0 \cdot e^{(s \cdot t)}) ), where ( s ) is the selection coefficient (fitness difference). Observed counts ( (Kt, Nt) ) (test strain, total) follow a Beta-Binomial to account for technical replication noise beyond simple binomial sampling: [ Kt \sim \text{Beta-Binomial}(n = Nt, \alpha = \phi pt, \beta = \phi (1-pt)) ] The likelihood is: [ L(s, \phi | \mathbf{K}, \mathbf{N}) = \prod{t} \binom{Nt}{Kt} \frac{B(Kt + \phi pt, Nt - Kt + \phi (1-pt))}{B(\phi pt, \phi (1-p_t))} ] where ( B ) is the Beta function and ( \phi ) is a precision parameter.

Dose-Response Viability Assays (e.g., Cell Titer-Glo)

Data Type: Continuous luminescence/fluorescence readings proportional to cell viability.

Typical Likelihood Model: Normal or Student-t distribution around a deterministic Hill curve model.

Protocol for Cell Viability Assay:

  • Plate Cells: Seed adherent or suspension cells in 96- or 384-well plates at a density ensuring linear signal response.
  • Compound Treatment: After cell adherence, add serial dilutions of the compound. Include DMSO vehicle controls and blank wells (medium only).
  • Incubate: Incubate plates for a determined period (e.g., 72 hours).
  • Assay Development: Add a single reagent like Cell Titer-Glo, mix to lyse cells and generate a luminescent signal proportional to ATP present (viable cells).
  • Read and Normalize: Read luminescence. Normalize raw values: % Viability = 100 * (Raw - Blank) / (Vehicle Control - Blank).

Likelihood Construction: Let ( y{ij} ) be the normalized viability (%) for replicate ( j ) at concentration ( Ci ). The expected response is given by a 4-parameter logistic (4PL) Hill model: [ \mui = \text{Bottom} + \frac{\text{Top} - \text{Bottom}}{1 + (Ci / IC{50})^{\text{HillSlope}}} ] The likelihood assuming homoscedastic Normal errors is: [ L(\text{Top, Bottom, IC}{50}, \text{HillSlope}, \sigma | \mathbf{y}) = \prod{i,j} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y{ij} - \mu_i)^2}{2\sigma^2} \right) ] For robustness against outliers, a Student-t distribution with low degrees of freedom (e.g., ν=4) can be substituted.

Table 1: Likelihood Models for Common Assays

Assay Data Type Typical Distribution Key Parameters in θ Notes
MIC Ordinal/Censored Ordered Probit β1 (potency), σ (steepness) Accounts for dilution series intervals.
Time-Kill Time-series counts Negative-Binomial g (growth rate), kmax (kill rate), EC50, H (Hill), ϕ (dispersion) Separates biological process from sampling noise.
Competitive Fitness Proportional counts Beta-Binomial s (selection coefficient), ϕ (precision) Models overdispersion in plating counts.
Dose-Response Continuous signal Normal or Student-t Top, Bottom, IC50, HillSlope, σ (error) 4PL model standard for viability; t-distribution robust to outliers.

Table 2: Linking Assay Parameters to Fitness Costs/Benefits

Estimated Parameter Biological Interpretation in Fitness Context Typical Assay Source
β1 (from MIC Probit) Log2 increase in MIC per unit fitness change; measures resistance cost/benefit. MIC Assay
s (Selection Coefficient) Direct per-generation fitness difference between strains. Competitive Assay
EC50 (from Time-Kill) Drug concentration for half-maximal kill rate; informs pharmacodynamic resistance. Time-Kill Assay
IC50 (from Viability) Concentration for half-maximal cellular inhibition; measures compound potency against a genotype. Dose-Response Assay
H (Hill Coefficient) Steepness of dose-response; can indicate cooperative binding or multi-hit mechanisms. Time-Kill, Dose-Response

The Scientist's Toolkit: Research Reagent Solutions

Item/Reagent Function in Likelihood-Informed Experiments
Cation-Adjusted Mueller-Hinton Broth (CAMHB) Standardized medium for bacterial MIC assays, ensuring reproducible cation concentrations critical for antibiotic activity.
Cell Titer-Glo 2.0 Assay Homogeneous luminescent method to quantify viable cells based on ATP content; generates continuous data for robust dose-response modeling.
Selective Agar Plates (e.g., with Antibiotic) Enables differentiation and counting of specific strains in competitive fitness assays for proportion data collection.
96/384-Well Microplates (Tissue Culture Treated) Standard format for high-throughput dose-response and MIC assays, compatible with automated liquid handlers and plate readers.
DMSO (Cell Culture Grade) Universal solvent for compound libraries; vehicle control essential for normalizing viability assay data.
Multichannel Pipettes & Repeaters Critical for accurate serial dilutions and reagent additions across plate-based assays to minimize technical error.
Automated Colony Counter (or OpenCFU) Increases accuracy and reduces bias in counting colonies from competitive fitness or time-kill plating steps.
Bayesian Inference Software (e.g., Stan, PyMC) Computational tool to implement the constructed likelihood functions and perform posterior sampling for parameter estimation.

Visualizations

mic_likelihood C Drug Concentration (C_i) LatentZ Latent Growth Potential (Z_i) C->LatentZ log2 transform Beta Parameters (β₀, β₁, σ) Beta->LatentZ linear predictor ObsY Observed Outcome (Growth: Y_i=1 or 0) LatentZ->ObsY threshold Z_i > 0 ?

Title: Likelihood Model for MIC Assay Data

timekill_likelihood PKPD PK/PD Growth Model (e.g., dN/dt = f(N,θ)) Mu Predicted Mean CFU (μ_t) PKPD->Mu solve ODE Params Parameters (g, kₘₐₓ, EC₅₀, H) Params->PKPD ObsN Observed CFU Count (N_t) Mu->ObsN Disp Dispersion (φ) Disp->ObsN distribution (Neg-Bin)

Title: Likelihood Construction for Time-Kill Data

workflow_bayesian Assay Perform Experimental Assay Data Collect Data (Y, N, CFU, etc.) Assay->Data Likeli Construct Likelihood P(Data | θ) Data->Likeli Bayes Apply Bayes' Theorem P(θ | Data) ∝ P(Data | θ) P(θ) Likeli->Bayes Prior Specify Prior P(θ) Prior->Bayes Post Estimate Posterior P(θ | Data) Bayes->Post Infer Infer Fitness Cost/Benefit Post->Infer

Title: Bayesian Inference Workflow from Assay to Fitness Estimate

Markov Chain Monte Carlo (MCMC) sampling is a cornerstone of modern Bayesian inference, enabling researchers to estimate complex posterior distributions for parameters of interest. Within the context of a broader thesis on Bayesian inference for estimating fitness cost and benefit in antimicrobial resistance research, MCMC provides the computational framework to quantify uncertainty in parameters such as mutation rates, selection coefficients, and compensatory benefit. This guide presents a practical implementation using two leading probabilistic programming languages: Stan (accessed via RStan or PyStan) and PyMC3.

Core Theoretical Framework & Application to Fitness Landscapes

In studying antimicrobial resistance, a key problem is estimating the fitness cost of a resistance-conferring mutation and the potential benefit of secondary compensatory mutations. A Bayesian model allows us to incorporate prior knowledge from in vitro assays and update beliefs with experimental data from growth rate measurements or competitive fitness assays.

The model can be framed as: Data: Observed growth rates ( y{ij} ) for bacterial strain ( i ) under condition ( j ). Parameters: Base growth rate ( \mu ), cost of primary mutation ( \beta{cost} ), benefit of compensatory mutation ( \beta{benefit} ), and interaction terms. Likelihood: ( y{ij} \sim \text{Normal}(\mu + X\beta, \sigma) ), where ( X ) is a design matrix encoding genetic variants. Priors: Informed by previous literature, e.g., ( \beta_{cost} \sim \text{Normal}(-0.1, 0.05) ) representing a plausible mild fitness defect.

MCMC algorithms (e.g., Hamiltonian Monte Carlo in Stan, No-U-Turn Sampler in PyMC3) are used to sample from the joint posterior distribution ( P(\mu, \beta{cost}, \beta{benefit}, \sigma | y) ).

Table 1: Estimated Fitness Parameters for rpoB Mutations in M. tuberculosis from Recent Bayesian Analyses

Mutation Prior Distribution (Cost) Posterior Mean (Cost) 95% Credible Interval Data Source (n) Model Used
S450L Normal(-0.15, 0.1) -0.08 [-0.12, -0.04] In vitro growth (n=12 replicates) Hierarchical, Stan
H445Y Normal(-0.1, 0.05) -0.11 [-0.15, -0.07] Competitive index assay (n=8 mice) Linear, PyMC3
D435V + Comp (C>T) ( \beta{cost} ): Normal(-0.2,0.1), ( \beta{benefit} ): Gamma(2,10) Cost: -0.10, Benefit: +0.12 Cost: [-0.18, -0.03], Benefit: [0.05, 0.20] Longitudinal CFU counts (n=5 time points) Interaction, Stan

Table 2: MCMC Diagnostics Comparison for Different Sampling Algorithms

Software Default Sampler Effective Sample Size (ESS) per sec (mean) (\hat{R}) (target ≤1.01) Divergences (in typical run) Warm-up (Burn-in) Steps
Stan (v2.32) NUTS ~250 1.002 < 1% 1000-2000
PyMC3 (v3.11.5) NUTS ~180 1.003 < 1% 1000-2000

Experimental Protocols for Generating Bayesian Inference Data

Objective: Generate robust preliminary data to inform prior distributions for fitness cost. Materials: Wild-type and isogenic mutant strains, selective and non-selective media, plate reader or colony counter. Procedure:

  • Co-culture wild-type and mutant strains at a 1:1 ratio in liquid medium.
  • Plate serial dilutions onto both non-selective and drug-containing selective agar at T=0h and T=24h.
  • Count colony-forming units (CFUs) for each strain.
  • Calculate the competitive index (CI) = (mutant CFU / WT CFU) at T24 / (mutant CFU / WT CFU) at T0.
  • Log-transform CI values. The mean and variance of log(CI) across 6-8 biological replicates form the basis for a Normal prior on the cost parameter.

Protocol 4.2: Longitudinal Growth Curve Measurement for Likelihood Function

Objective: Collect time-series data for hierarchical growth model fitting. Procedure:

  • Inoculate strains in 96-well plates with medium ± sub-inhibitory drug concentration.
  • Measure optical density (OD600) every 15 minutes for 24 hours in a plate reader.
  • Fit a logistic growth model ( OD(t) = \frac{K}{1+e^{-r(t-t_0)}} ) to each replicate to estimate growth rate ( r ) and carrying capacity ( K ).
  • Use the derived growth rates ( r_{ij} ) as the data vector ( y ) in the Bayesian model. Replicate variance informs the likelihood scale parameter ( \sigma ).

Implementation Workflow and Code Framework

G START Define Scientific Question & Model SPEC Specify Bayesian Model: Priors & Likelihood START->SPEC CODE1 Implement Model in Stan (C++) SPEC->CODE1 CODE2 Implement Model in PyMC3 (Python) SPEC->CODE2 DATA Load & Preprocess Experimental Data CODE1->DATA CODE2->DATA SAMPLE Run MCMC Sampling (NUTS) DATA->SAMPLE DIAG Diagnostic Checks (ESS, R̂, Trace) SAMPLE->DIAG POST Posterior Analysis & Visualization DIAG->POST INF Draw Inference: Cost & Benefit POST->INF

Diagram 1: MCMC Implementation Workflow for Fitness Estimation

G cluster_bayes Bayesian Model for Fitness Cost/Benefit mu Base Rate μ Normal(0,1) y Observed Growth Rate y mu->y beta_c Cost β_c Normal(-0.1,0.05) beta_c->y beta_b Benefit β_b Gamma(2,10) beta_b->y sigma Noise σ HalfNormal(0.1) sigma->y Data Experimental Design Matrix X Data->y

Diagram 2: Probabilistic Graphical Model for Fitness

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Fitness Cost-Benefit Experiments

Item Function in Protocol Example Product/Catalog # Notes for Bayesian Analysis
Isogenic Mutant Strain Set Provides controlled genetic background to isolate mutation effects. KEIO Collection (E. coli) Essential for defining clear levels in the model's design matrix ( X ).
Automated Plate Reader Generates high-density, time-series growth curve data. BioTek Synergy H1 Outputs continuous data preferred for Normal likelihood.
Selective Antibiotic Agar Applies selection pressure to measure competitive fitness. Mueller-Hinton + Rifampicin (1μg/mL) Drug concentration must be standardized to inform prior on effect size.
Cell Counting Kit Quantifies CFUs for competitive index calculation. MilliporeSigma CBC Kit Count data can be modeled with Poisson or Negative Binomial likelihood.
PCR & Sequencing Reagents Validates genotype before/during experiment. Qiagen Multiplex PCR Kit Ensures data is linked to correct genetic parameter.
Probabilistic Programming Software Performs MCMC sampling and inference. RStan (v2.32), PyMC3 (v3.11.5) Primary tool for implementing the Bayesian model.

Detailed Computational Protocol

Protocol 7.1: Model Implementation in Stan

Objective: Code a hierarchical model for growth rates with cost/benefit parameters.

Protocol 7.2: Sampling & Diagnostics in PyMC3

Objective: Run MCMC, assess convergence, and visualize the posterior.

Key Diagnostics: Check az.summary for Rhat ≈ 1.0 and high ess_bulk. Use az.plot_trace to assess chain mixing and stationarity.

Interpreting Results in a Therapeutic Context

The posterior distributions for ( \beta{cost} ) and ( \beta{benefit} ) directly inform drug development strategy. A narrow credible interval for a large cost suggests the resistance mutation may not persist without drug selection. A posterior indicating a high compensatory benefit signals potential for rapid resistance stabilization, urging combination therapy approaches. These quantitative, probabilistic outputs enable robust risk assessment for resistance management.

Within the broader thesis on applying Bayesian inference to microbial evolution, this case study details protocols for estimating the fitness costs and benefits of antibiotic resistance in bacterial populations. Bayesian methods allow for the integration of prior knowledge (e.g., growth rates, resistance mechanisms) with experimental data to produce posterior probability distributions for parameters like the selection coefficient (s) and the cost of resistance (c). This is critical for predicting resistance dynamics and informing drug development strategies.

Table 1: Typical Growth Rate Data for Resistant and Sensitive Isogenic Strains

Strain Phenotype Mean Doubling Time (min) ± SD Relative Fitness (W) Estimated Cost (c = 1-W)
Sensitive (Wild-type) 30.5 ± 2.1 1.00 (reference) 0.00
Resistant (Mutant A) 36.8 ± 3.4 0.83 0.17
Resistant (Mutant B) 41.2 ± 2.8 0.74 0.19
Compensated Evolved Mutant A 31.1 ± 2.5 0.98 0.02

Note: Fitness (W) calculated as (μ_sensitive / μ_resistant), where μ = growth rate (1/doubling time).

Table 2: Bayesian Inference Parameters for Fitness Cost Estimation

Parameter Symbol Prior Distribution Typical Posterior Mean (95% Credible Interval) Biological Meaning
Selection Coefficient (Drug-free) c Normal(μ=0.15, σ=0.1) 0.18 (0.12, 0.25) Fitness cost of resistance.
Selection Coefficient (Under Drug) s Normal(μ=0.5, σ=0.2) 0.62 (0.51, 0.78) Fitness benefit of resistance under antibiotic.
Growth Rate (Sensitive) μ_s Gamma(α=100, β=3000) 0.033 min⁻¹ (0.031, 0.035) Inverse of doubling time.

Detailed Experimental Protocols

Protocol 1: Growth Curve Analysis for Fitness Cost Measurement Objective: To determine the in vitro fitness cost of a resistance-conferring mutation in the absence of antibiotic selection. Materials: See "Research Reagent Solutions" below. Procedure: 1. Inoculate 5 mL of cation-adjusted Mueller-Hinton Broth (CAMHB) with a single colony of either the isogenic antibiotic-sensitive or resistant strain. Incubate overnight (37°C, 220 rpm). 2. Dilute the overnight cultures 1:1000 into fresh, pre-warmed CAMHB in a sterile flask. 3. Aliquot 200 µL of each diluted culture into 96-well sterile, optically clear flat-bottom microplates. Include at least 8 technical replicates per strain and blank wells with broth only. 4. Load the plate into a pre-warmed (37°C) plate reader. Measure optical density at 600 nm (OD₆₀₀) every 10 minutes for 24 hours, with continuous orbital shaking. 5. Export data and fit the exponential phase of each growth curve to the model: ln(OD₆₀₀) = ln(OD₀) + μt, where μ is the maximum growth rate. 6. Calculate relative fitness: W = μ_resistant / μ_sensitive. The fitness cost is c = 1 - W.

Protocol 2: Competitive Fitness Assay for Bayesian Inference Objective: To generate time-series data on population frequencies for robust Bayesian estimation of selection coefficients. Procedure: 1. Prepare differentially marked strains (e.g., resistant strain with a neutral fluorescent marker; sensitive strain without). 2. Mix the strains at a known initial ratio (e.g., 1:1) in drug-free medium and under sub-MIC antibiotic pressure (e.g., 1/4x MIC) in separate flasks. 3. Serially passage the co-cultures every 24 hours by diluting 1:1000 into fresh medium (± antibiotic). Maintain for 5-10 generations. 4. At each passage, plate dilutions on selective and non-selective agar to enumerate colony-forming units (CFUs) for each strain. 5. Calculate the frequency of the resistant strain (p) over time. 6. Analyze data using a Bayesian Markov Chain Monte Carlo (MCMC) model. The likelihood function can model frequency change as: p_{t+1} = (p_t * (1+s)) / (1 + p_t * s), where s is the selection coefficient to be estimated (negative for cost, positive for benefit).

Visualizations

workflow Start Define Research Question (e.g., Cost of Beta-lactam resistance) P1 Prior Knowledge (e.g., Known growth penalty) Start->P1 Exp Experimental Data (Competition assay time-series) Start->Exp Model Bayesian Model (Likelihood: Selection equation) (Priors: Distributions for c, s) P1->Model Exp->Model Infer MCMC Sampling (Estimate posterior distributions) Model->Infer Post Posterior Analysis (Cost c, Benefit s, Credible Intervals) Infer->Post

Title: Bayesian Workflow for Resistance Cost Estimation

pathway Antibiotic Beta-lactam Antibiotic PBPs Penicillin-Binding Proteins (PBPs) Antibiotic->PBPs Binds CellWall Peptidoglycan Synthesis Antibiotic->CellWall Inhibits PBPs->CellWall Catalyzes Lysis Cell Lysis CellWall->Lysis Mutation Resistance Mutation (e.g., PBP2a) Binding Reduced Antibiotic Binding Mutation->Binding CostNode Fitness Cost (Altered PBP function, Metabolic burden) Mutation->CostNode Binding->CellWall Protected

Title: Resistance Mechanism & Fitness Cost Origin

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Fitness Cost Experiments

Item Function & Application
Isogenic Bacterial Strain Pairs Resistant mutant and its sensitive parent strain; essential for attributing fitness differences solely to the resistance determinant.
Cation-Adjusted Mueller Hinton Broth (CAMHB) Standardized, reproducible growth medium for antimicrobial susceptibility and fitness testing.
96-Well Cell Culture Microplate (Sterile, Optical Bottom) For high-throughput, replicate growth curve analysis in plate readers.
Plate Reader with Temperature Control & Shaking Enables automated, precise kinetic monitoring of optical density for growth rate calculation.
Fluorescent Protein Markers (e.g., GFP, mCherry) To differentially label competing strains for easy enumeration in competitive fitness assays.
Selective Agar Plates Containing antibiotic or counter-selection agents to determine CFUs of specific strains from a mixture.
MCMC Software (e.g., Stan, PyMC3) Probabilistic programming languages to implement custom Bayesian models for parameter estimation.
Automated Liquid Handling System For accuracy and reproducibility in serial passaging and high-throughput assay setup.

Abstract This application note details a Bayesian inference framework for quantifying the fitness costs and benefits of oncogenic mutations. Within the broader thesis on computational oncology, this protocol provides a method to translate bulk or single-cell sequencing data into probabilistic estimates of clonal fitness, enabling the dissection of driver pathway interactions and prediction of therapeutic resistance.

1. Introduction: A Bayesian Framework for Fitness Estimation Tumor evolution is driven by somatic mutations that confer selective fitness advantages. The net fitness effect of a mutation is a combination of its intrinsic oncogenic benefit and context-dependent costs. This case study outlines a protocol to model these parameters using a Bayesian approach, which incorporates prior biological knowledge and uncertainty from genomic data to posterior fitness distributions.

2. Core Model and Quantitative Data The fundamental model describes the growth of a clone i with mutation m over time t: N_i(t) = N_i(0) • exp((b_m - c_m - Σ_j I_{ij}) • t) where b_m is the benefit, c_m is the cost, and I_{ij} represents interference from other clones.

Table 1: Model Parameters and Typical Prior Distributions

Parameter Symbol Description Typical Prior (Distribution)
Fitness Benefit b_m Net proliferation/survival advantage. Gamma(k=2, θ=0.05)
Fitness Cost c_m Cost from genomic instability, immunogenicity. Gamma(k=1, θ=0.02)
Selection Coefficient s_m Net selective advantage (b_m - c_m). Normal(μ=0.1, σ=0.15)
Clonal Interference I Competitive suppression between co-occurring clones. Exponential(λ=1.0)
Measurement Noise σ Error in VAF measurement. HalfNormal(σ=0.02)

Table 2: Example Posterior Estimates for Common Oncogenic Mutations

Mutation (Gene) Pathway Median Benefit (b_m) [90% CrI] Median Cost (c_m) [90% CrI] Inferred Net s
BRAF V600E MAPK 0.21 [0.17, 0.26] 0.08 [0.04, 0.13] 0.13
PIK3CA H1047R PI3K-AKT 0.16 [0.12, 0.20] 0.06 [0.03, 0.10] 0.10
KRAS G12D RTK/MAPK 0.19 [0.15, 0.24] 0.10 [0.06, 0.15] 0.09
EGFR L858R RTK 0.23 [0.19, 0.28] 0.05 [0.02, 0.09] 0.18

3. Experimental Protocols

Protocol 3.1: Input Data Generation from Bulk Whole-Exome Sequencing Objective: Derive longitudinal clonal fraction data for fitness inference. Steps:

  • Sequencing: Perform WES on tumor-normal pairs at multiple time points (e.g., diagnosis, relapse).
  • Variant Calling: Use callers (e.g., Mutect2) to identify somatic SNVs/Indels.
  • Clonal Decomposition: Input VAFs into a phylogenetic deconvolution tool (e.g, PyClone-VI) to estimate cancer cell fractions (CCF) for each mutation cluster.
  • Data Structuring: Format output into a table: [Timepoint, Clone_ID, CCF, Read_Depth].

Protocol 3.2: Bayesian Model Implementation via Markov Chain Monte Carlo (MCMC) Objective: Sample from the posterior distribution of fitness parameters. Steps:

  • Define Model in Probabilistic Language: Implement the likelihood (e.g., CCF ~ Normal(predicted_CCF, σ)) and priors (Table 1) in PyMC3 or Stan.
  • Specify Sampling: Run 4 independent MCMC chains with 5000 tuning steps and 10000 draws per chain.
  • Convergence Diagnostics: Ensure Gelman-Rubin statistic (R-hat) < 1.05 and high effective sample size (ESS).
  • Posterior Analysis: Extract median and credible intervals (CrI) for b_m, c_m, s_m. Visualize posterior distributions and pairwise correlations.

Protocol 3.3: In Vitro Validation via Competitive Proliferation Assay Objective: Experimentally measure relative fitness of isogenic cell lines. Steps:

  • Cell Engineering: Create isogenic pairs (mutant vs. wild-type) via CRISPR-Cas9, each tagged with unique fluorescent barcodes (e.g., GFP vs. RFP).
  • Co-Culture: Mix cells at a 1:1 ratio in triplicate. Maintain in log phase for 20 generations.
  • Flow Cytometry: Sample every 3-4 days to quantify GFP/RFP ratio.
  • Fitness Calculation: Fit the log ratio over time to a linear model. The slope = inferred experimental selection coefficient s_exp.

4. Visualizations

G Data Genomic Data (Longitudinal CCFs) Model Bayesian Growth Model Data->Model Likelihood Prior Priors (biological constraints) Prior->Model Post Posterior Distributions of b, c, s Model->Post MCMC Sampling Infer Biological Inference Post->Infer Interpretation

Bayesian Fitness Inference Workflow

pathway GF Growth Factor RTK Receptor Tyrosine Kinase (RTK) GF->RTK P3K PI3K (Mutant) RTK->P3K Activates BRA BRAF (Mutant) RTK->BRA via RAS AKT AKT P3K->AKT PIP3 Cos Fitness Costs: Immunogenicity ↑ Energetic Demand ↑ Proteotoxic Stress ↑ P3K->Cos mTO mTORC1 AKT->mTO Ben Fitness Benefits: Proliferation ↑ Apoptosis ↓ Metabolism ↑ mTO->Ben MEK MEK BRA->MEK BRA->Cos ERK ERK MEK->ERK ERK->Ben

Oncogenic Signaling & Fitness Trade-Offs

5. The Scientist's Toolkit

Table 3: Key Research Reagent Solutions

Item Function in Protocol Example/Supplier
PyClone-VI Bayesian clustering of mutations into clonal populations from sequencing data. (https://github.com/Roth-Lab/pyclone-vi)
PyMC3/Stan Probabilistic programming frameworks for defining and fitting custom Bayesian models. Probabilistic programming language libraries.
Fluorescent Cell Barcodes (GFP/RFP) Enables precise tracking and ratio quantification of competing cell lineages in vitro. Lentiviral vectors (e.g., Addgene).
CRISPR-Cas9 Knock-in Kits For precise introduction of oncogenic mutations into isogenic cell lines. Synthetic gRNA & HDR donors.
Targeted Inhibitors Used to probe fitness dependencies (e.g., Vemurafenib for BRAF V600E). Selleck Chemicals, MedChemExpress.
UMI Sequencing Adapters Reduces PCR errors in sequencing, critical for accurate VAF measurement. Illumina TruSeq Unique Dual Indexes.

Overcoming Challenges: Diagnosing and Fixing Common Bayesian Model Issues

Diagnosing MCMC Convergence Failures and How to Resolve Them

Application Notes on MCMC Diagnostics within Bayesian Fitness Inference

In the context of Bayesian inference for estimating fitness costs and benefits in pathogens (e.g., drug resistance evolution), Markov Chain Monte Carlo (MCMC) is the computational engine. Convergence failures lead to biased estimates of posterior distributions, directly impacting conclusions about selection pressures. These notes outline diagnostic protocols and solutions.

1. Core Quantitative Diagnostics for Convergence Assessment

Effective diagnosis requires multiple, complementary metrics. The following table summarizes key diagnostic quantities and their interpretation.

Table 1: Key Quantitative Diagnostics for MCMC Convergence

Diagnostic Target Value Calculation Method Interpretation of Failure
Potential Scale Reduction Factor (R̂) R̂ ≤ 1.05 Variance of pooled chains vs. average within-chain variance. Chains have not mixed; likely trapped in local modes.
Effective Sample Size (ESS) ESS > 400 (per chain) Accounts for autocorrelation: ESS = N / (1 + 2∑ρₜ). High autocorrelation; insufficient independent samples.
Monte Carlo Standard Error (MCSE) MCSE < 5% of posterior sd. Standard error of the posterior mean estimate. High uncertainty in point estimates despite many samples.
Trace Plot Visual Inspection Stationary, well-mixed "fuzzy caterpillar". Plot of sampled parameter value vs. iteration. Non-stationarity (drift), poor mixing, or multi-modality.
Autocorrelation Plot Rapid decay to near zero. Correlation between samples at lag t. High autocorrelation indicates inefficient sampling.
Geweke Diagnostic (Z-score) |Z| < 2 Compares means from early vs. late segments of a single chain. Chain non-stationarity.

2. Detailed Experimental Protocols for Diagnosis

Protocol 1: Comprehensive Multi-Chain Diagnostic Workflow

Objective: To robustly assess convergence of MCMC sampling for a hierarchical model estimating fitness costs (e.g., cost of a resistance mutation, β_mut).

Materials: MCMC output (4 independent chains, each with ≥ 10,000 post-warmup iterations). Software: Stan/HMC-based sampler, bayesplot, posterior R packages, or arviz in Python.

Procedure:

  • Chain Initialization: Run 4 chains from dispersed starting points (e.g., drawn from a prior wider than the expected posterior).
  • Warmup/Adaptation: Discard the first 50% of each chain as warmup.
  • Trace Plot Generation: For each key parameter (β_mut, hierarchical standard deviations), plot all chains overlaid.
  • Calculate R̂ and Bulk/Tail ESS: Compute using the rank-normalized, folded-split R̂. Report bulk-ESS (for centrality) and tail-ESS (for extremes).
  • Autocorrelation Analysis: Plot autocorrelation for lags up to 50 for each chain. Note the lag at which correlation drops below 0.1.
  • Posterior Predictive Checks: Simulate replicated data from posterior draws. Compare visually and quantitatively to observed data.

Protocol 2: Diagnosing Specific Pathologies

  • Symptom: High R̂ (>1.1) and chains separated in trace plots.
    • Investigation: Run chains longer. If separation persists, investigate model specification: check priors, likelihood, or parameter identifiability via prior-posterior overlap checks.
  • Symptom: Low ESS despite many iterations, with high autocorrelation.
    • Investigation: This indicates poor exploration efficiency. For Hamiltonian Monte Carlo (HMC), examine the divergent_transitions statistic and the accept_stat (step size adaptation). Many divergences point to regions of high curvature in the posterior.

3. Visualization of Diagnostic Logic and Workflow

G Start Run MCMC (4+ Dispersed Chains) T1 Trace Plot & R̂ Check Start->T1 T2 ESS & Autocorrelation Check T1->T2 R̂ ≤ 1.05 Fail Diagnosed Failure T1->Fail R̂ > 1.05 Chains Separated T3 Divergence & E-BFMI Check (HMC Specific) T2->T3 ESS > 400 Low ACF T2->Fail ESS << 400 High ACF T4 Posterior Predictive Check T3->T4 Divergences ~0 E-BFMI > 0.3 T3->Fail Excess Divergences or Low E-BFMI Pass Convergence & Fidelity Confirmed T4->Pass Simulated Data Matches Observed T4->Fail Systematic Mismatch

Title: MCMC Convergence Diagnostic Decision Workflow

4. The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Computational Tools for MCMC Diagnostics

Tool / Reagent Function / Purpose Example Implementation
No-U-Turn Sampler (NUTS) Adaptive HMC algorithm that automates path length. Reduces tuning burden. stan, PyMC, TensorFlow Probability.
Divergence Diagnostics Identifies where sampler cannot explore geometry of posterior, indicating model issues. check_divergences() in bayesplot (R).
Energy Bayesian Fraction of Missing Info (E-BFMI) Diagnoses inefficient sampling due to poorly-chosen mass matrix or step size in HMC. mcse.effective_sample_size in arviz.
Rank Plots Visual alternative to R̂; chains should be uniform if mixed. plot_rank() in bayesplot or arviz.
Prior Predictive Checks Simulates data from the prior before observing data to validate model structure. rstantools::prior_predictive() (R).
Simulation-Based Calibration (SBC) Global validation test: ranks of posterior draws should be uniform if inference is valid. SBC package (R).

5. Resolution Protocols for Common Failures

Protocol 3: Resolving High R̂ and Poor Mixing

Issue: Chains sampling different posteriors.

  • Action 1 (Model Reparameterization): For hierarchical models (e.g., per-strain fitness deviations), use non-centered parameterization:
    • Original: cost_strain ~ normal(μ, σ);
    • Reparameterized: z_strain ~ normal(0,1); cost_strain = μ + σ * z_strain;
  • Action 2 (Pilot Analysis for Identifiability): Conduct a simulation where true parameters are known. Ensure the model can recover them.

Protocol 4: Resolving Low ESS and High Autocorrelation

Issue: Inefficient exploration.

  • Action 1 (Increase Target Acceptance Rate): For HMC/NUTS, increase adapt_delta (e.g., from 0.8 to 0.95 or 0.99). This reduces step size and divergences but increases computation.
  • Action 2 (Model Simplification): Remove weakly identified parameters, fix some to sensible values, or use stronger (but still sensible) priors to regularize the inference.

Protocol 5: Addressing Divergent Transitions in HMC

Issue: Sampler cannot navigate regions of high curvature.

  • Action 1 (Increase adapt_delta): Primary remedy.
  • Action 2 (Reparameterize with Soft Constraints): Instead of hard boundaries (e.g., real<lower=0> sigma;), use an unconstrained variable and transform (e.g., log_sigma), smoothing the geometry.

Within the broader thesis on Bayesian inference for estimating fitness costs and benefits in antimicrobial resistance and cancer biology, prior specification is a critical step. The prior probability distribution formalizes existing knowledge before observing new experimental data. This Application Note provides protocols for conducting rigorous sensitivity analysis to quantify how variations in prior choice influence posterior estimates of key parameters like the fitness cost (c) of a resistance mutation or the benefit (b) of a compensatory mutation.

Core Protocol: Prior Sensitivity Analysis Workflow

Objective: To systematically evaluate the robustness of posterior inferences to changes in prior distribution assumptions.

Materials & Computational Environment:

  • Bayesian statistical software (e.g., Stan via cmdstanr/rstan, PyMC, JAGS).
  • Programming environment (R or Python).
  • Dataset containing growth rate measurements or competitive fitness indices for wild-type and mutant strains.

Procedure:

  • Define the Base Model: Specify your likelihood function (e.g., Normal distribution for log-fold change in growth rates) and a base/reference prior (e.g., Weakly Informative: Normal(μ=0, σ=10) for log fitness cost).
  • Specify the Prior Sensitivity Grid: Define a set of alternative prior families and hyperparameters that encapsulate plausible alternative states of knowledge.
    • Vague Prior: e.g., Uniform(lower=-100, upper=100).
    • Informative Prior (Literature-Based): e.g., Normal(μ=-0.1, σ=0.05) based on prior studies.
    • Different Prior Families: Compare Normal, Cauchy (heavy-tailed), and Gamma (for positive-only parameters) distributions.
  • Fit Models: Execute Markov Chain Monte Carlo (MCMC) sampling for the model under each prior specification in the grid.
  • Extract Key Quantities: For each model fit, record the posterior summaries for the target parameters (e.g., posterior mean, median, and 95% Credible Interval (CrI) for fitness cost c).
  • Comparative Analysis: Compare the posterior distributions across the prior grid. Focus on clinically or biologically decisive conclusions (e.g., Is c > 0.5 with high probability?).

workflow Start Start: Define Base Likelihood Model Grid Define Prior Sensitivity Grid Start->Grid Fit Fit Models via MCMC (All Priors in Grid) Grid->Fit Extract Extract Posterior Summaries Fit->Extract Compare Compare Posteriors Across Priors Extract->Compare Robust Conclusion Robust? Compare->Robust Robust->Grid No, refine grid Report Report Results with Sensitivity Boundaries Robust->Report Yes

Title: Prior Sensitivity Analysis Workflow

Case Study: Fitness Cost of agyrAMutation inE. coli

Experimental Data Summary (Synthetic): Competitive fitness index (WT vs. Mutant) from 10 replicate experiments.

Replicate Fitness Index (Mean) Standard Error
1 0.85 0.08
2 0.92 0.07
... ... ...
10 0.88 0.09

Prior Sensitivity Grid:

Prior Label Distribution Family Hyperparameters Rationale
P1: Vague Normal mean=0, sd=100 Minimal information
P2: Weakly Informative Normal mean=0, sd=10 Regularizing, plausible range
P3: Informative (Costly) Normal mean=-0.2, sd=0.15 Expect moderate fitness cost
P4: Heavy-Tailed Cauchy location=0, scale=2.5 Allows for outliers

Results of Sensitivity Analysis: Posterior summaries for the mean fitness cost (1 - fitness index).

Prior Used Posterior Mean (Cost) 95% Credible Interval Posterior SD
P1: Vague 0.138 (0.065, 0.215) 0.038
P2: Weakly Informative 0.136 (0.064, 0.212) 0.038
P3: Informative 0.145 (0.080, 0.215) 0.034
P4: Heavy-Tailed 0.137 (0.063, 0.216) 0.039

Interpretation: The posterior inference (mean cost ~0.14) is stable across all prior choices, with overlapping credible intervals. The conclusion of a moderate fitness cost is robust to prior specification.

Protocol for Prior-Data Conflict Assessment

Objective: Diagnose when the chosen prior is in strong conflict with the observed data.

Procedure:

  • Prior Predictive Check: Simulate hypothetical datasets y_rep from the prior predictive distribution (prior only).
  • Calculate Test Quantity: For each y_rep, compute a test statistic T(y_rep) (e.g., mean, variance).
  • Compare to Observed Data: Plot the distribution of T(y_rep) and mark the observed T(y_obs). If T(y_obs) lies in the extreme tails, a prior-data conflict exists.
  • Remediation: If conflict is detected, re-evaluate prior choices with domain experts.

conflict Prior Chosen Prior Distribution Sim Simulate Prior Predictive Data y_rep Prior->Sim Tcalc_rep Calculate Test Statistic T(y_rep) Sim->Tcalc_rep Compare2 Compare Distributions of T(y_rep) vs T(y_obs) Tcalc_rep->Compare2 Tcalc_obs Calculate Test Statistic T(y_obs) Tcalc_obs->Compare2 Conflict Prior-Data Conflict? Compare2->Conflict Accept Proceed with Inference Conflict->Accept No Revise Revise Prior Specification Conflict->Revise Yes

Title: Prior-Data Conflict Assessment Protocol

The Scientist's Toolkit: Research Reagent Solutions

Item / Solution Function in Bayesian Fitness Analysis
Stan (cmdstanr/rstan) Probabilistic programming language for full Bayesian inference with efficient MCMC (NUTS) sampling.
PyMC Python library for probabilistic programming, enabling flexible model building and variational inference.
BRMS (R package) High-level interface to Stan for fitting complex multilevel models using formula syntax.
bayesplot (R/Python) Essential for posterior predictive checks, MCMC diagnostics, and visualization of prior/posterior distributions.
shinystan Interactive GUI for exploring MCMC output, diagnosing convergence, and visualizing posteriors.
Competitive Fitness Assay Kit Standardized reagents (e.g., fluorescent dyes, selective media) for generating accurate fitness index data.
High-Throughput Microplate Reader Enables collection of dense, longitudinal growth curve data for precise likelihood modeling.

Dealing with Unidentifiability and Weak Data in Fitness Models

Within the framework of a Bayesian inference thesis for estimating fitness costs and benefits, a central challenge arises from model unidentifiability, particularly when data is sparse or noisy. Unidentifiability occurs when multiple combinations of model parameters yield identical likelihoods, making unique estimation impossible. Weak data exacerbates this issue, leading to overly broad, uninformative posterior distributions. This document provides application notes and protocols to diagnose, manage, and mitigate these challenges in evolutionary fitness and antimicrobial resistance studies.

Table 1: Common Sources of Unidentifiability in Fitness Models

Source Type Description Typical Impact on Posterior
Structural (Non-identifiability) Model symmetry or overparameterization (e.g., product of parameters β*γ). Perfect correlation between parameters; flat or ridged likelihood.
Practical (Weak identifiability) Insufficient or low-information data (e.g., few time points, small population sizes). Very broad, multi-modal posteriors; high posterior correlation.
Algorithmic Inefficiencies in sampling or approximation methods. Failure to explore full parameter space; biased estimates.

Table 2: Strategies for Mitigation and Their Bayesian Interpretation

Strategy Methodological Approach Bayesian Implementation
Incorporation of Prior Information Use mechanistic knowledge or previous studies to constrain plausible values. Formulate informative or regularizing priors.
Data Augmentation Design experiments to target informative measurements (e.g., competition assays at multiple dilutions). Hierarchical modeling that integrates multiple data sources.
Model Reduction Simplify the model to its identifiable core (e.g., fix weakly identifiable parameters). Use Bayesian model selection (e.g., Bayes Factors, LOO-CV) to compare reduced vs. full models.
Reparameterization Express model in terms of identifiable parameter combinations (e.g., fitness cost difference, not absolute values). Sample in transformed parameter space (e.g., using Stan's parameters block).

Experimental Protocols

Protocol 1: In Vitro Competitive Fitness Assay for Weak Data Scenario

Objective: Estimate the relative fitness cost of a drug-resistant mutant compared to a wild-type strain with limited sampling points.

Materials:

  • Isogenic wild-type and mutant bacterial strains.
  • Liquid growth medium with/without sub-inhibitory antibiotic concentration.
  • Automated plate reader or manual spectrophotometer.
  • 96-well deep-well plates and sterile seals.

Procedure:

  • Pre-culture: Grow overnight cultures of both strains independently.
  • Inoculation: Mix strains at a 1:1 ratio in fresh medium. Prepare two conditions: drug-free (control) and drug-containing (selective). Use a minimum of 6 biological replicates per condition.
  • Limited Sampling Growth Curve: To simulate a "weak data" scenario, inoculate plates and measure optical density (OD600) at only three time points: T0 (inoculation), Tmid (mid-exponential phase, ~4-6 hours), and Tfinal (entry to stationary phase, ~24 hours). This contrasts with rich, hourly sampling.
  • Plating for CFUs: At each time point, for each replicate, perform serial dilution and spot-plating on selective and non-selective agar to enumerate each strain population.
  • Data Calculation: Calculate the selection rate coefficient (s) per daily cycle using the formula: s = ln[(M_t/W_t) / (M_0/W_0)] / t, where M and W are mutant and wild-type counts, and t is in days.

Bayesian Integration: The limited data (3 points) will yield a noisy estimate of s. In the Bayesian model, incorporate a prior for s based on known mutation effects (e.g., Normal(μ=-0.1, σ=0.2)) to stabilize inference.

Protocol 2: Flow Cytometry-Based Fitness Assay with Fluorescent Reporters

Objective: Generate higher-resolution fitness data from a single culture by tracking two strains fluorescently, reducing measurement error.

Procedure:

  • Strain Engineering: Tag wild-type and mutant strains with constitutively expressed, spectrally distinct fluorescent proteins (e.g., GFP and mCherry).
  • Co-culture: Mix strains as in Protocol 1. Sample from the same culture every 2 hours over 24 hours.
  • Flow Cytometry Analysis: For each sample, analyze 50,000-100,000 events. Use gating to distinguish strains based on fluorescence and to exclude debris or dead cells.
  • Data Processing: Calculate the ratio of mutant to wild-type fluorescence events at each time point. This provides a dense, longitudinal dataset of population dynamics.

Bayesian Advantage: The high-frequency, low-noise data reduces posterior uncertainty. The growth model can be fit directly to the time-series of ratios using an ODE model within a Bayesian sampler (e.g., using brms or rstan).

Visualizations

Diagram 1: Bayesian Workflow for Weak Fitness Data

workflow Start Weak/Noisy Experimental Data M1 Define Initial Model & Priors Start->M1 M2 Fit Model (MCMC Sampling) M1->M2 M3 Diagnose Posterior M2->M3 M4 Check Identifiability M3->M4 D1 Broad/Correlated Posterior? M4->D1 D2 Identifiable? D1->D2 Yes End Final Inference: Fitness Cost Estimate D1->End No P1 Mitigation Strategies D2->P1 No D2->End Yes P1->M1 Iterate P2 Report Posterior with Caution P1->P2 P2->End

Diagram 2: Key Signaling Pathway in Fitness Cost (e.g., Beta-Lactam Resistance)

pathway BetaLactam Beta-Lactam Antibiotic PBP Penicillin-Binding Proteins (PBPs) BetaLactam->PBP Binds/Inhibits SOS SOS Response Activation BetaLactam->SOS Induces CellWall Peptidoglycan Synthesis PBP->CellWall Lysis Cell Lysis & Death CellWall->Lysis Disruption BLa β-Lactamase Expression SOS->BLa Upregulates BLa->BetaLactam Hydrolyzes Resistance Fitness Cost: Energy Drain & Slow Growth BLa->Resistance Metabolic Burden

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Fitness Model Experiments

Item Function Example/Brand
Fluorescent Protein Plasmids Enable strain differentiation via flow cytometry without plating. pGEN-GFP (Cyan), pGEN-mCherry (Red).
Microplate Reader with Shaking/Incubation Allows automated, high-throughput growth curve data collection. BioTek Synergy H1, Agilent BioTek.
MCMC Sampling Software Performs Bayesian inference on complex, non-linear fitness models. Stan (via cmdstanr, brms), PyMC3.
Identifiability Analysis Tool Diagnoses parameter non-identifiability from posteriors or prior predictive checks. bayesplot (R), ArviZ (Python).
Chemostats or Microfluidics Maintain constant conditions for precise fitness measurement over long periods. CellASIC ONIX2, INFORS HT Minifors.
Selective Agar Plates For colony-forming unit (CFU) counts of specific strains from a mixture. LB Agar + specific antibiotic.
Bayesian Model Visualization Suite Creates trace plots, pair plots, and posterior predictive checks. shinystan, bayesplot.

Computational Optimization for High-Dimensional Genotype-Phenotype Maps

Application Notes

Within a thesis framework utilizing Bayesian inference to estimate fitness costs and benefits, the computational optimization of high-dimensional genotype-phenotype maps is crucial. It enables the efficient exploration of vast genetic landscapes to predict phenotypic outcomes—such as drug resistance, virulence, or therapeutic response—and infer their fitness consequences. This approach is foundational for prioritizing experimental validation and accelerating translational research.

Table 1: Comparison of Optimization & Bayesian Inference Methods for G-P Maps

Method Category Key Algorithms/Tools Dimensionality Handling Fitness Parameter Inference Primary Application in Research
Regularized Regression LASSO, Elastic Net, Ridge Feature selection, shrinkage Posterior distributions of effect sizes Identifying predictive SNP sets for complex traits (e.g., polygenic risk scores).
Dimensionality Reduction PCA, t-SNE, UMAP, Autoencoders Non-linear projection to lower dimensions Inference on latent variables representing genetic modules. Visualizing population structure, clustering phenotypes.
Bayesian Optimization Gaussian Processes, Tree-structured Parzen Estimators Efficient global optimization in high-D spaces Directly optimizes acquisition functions based on posterior models. Guiding adaptive laboratory evolution experiments.
Deep Learning CNNs (for sequences), MLPs, Transformers Automatic feature abstraction via hidden layers Bayesian Neural Networks for uncertainty quantification. Predicting protein function from sequence or CRISPR guide efficiency.
Graphical Models Bayesian Networks, Markov Random Fields Captures conditional dependencies Direct estimation of probabilistic dependencies (e.g., epistasis). Modeling epistatic interactions in fitness landscapes.

Table 2: Typical Software/Platforms & Performance Metrics

Software/Package Core Function Key Performance Metric (Typical Range) Computational Scale
STAN/PyMC3 Probabilistic programming for Bayesian inference MCMC sampling efficiency (~10³-10⁵ iterations) Single node to HPC for hierarchical models.
GPyOpt/BOTORCH Bayesian Optimization Convergence to optimum in ~10²-10³ function evaluations. Medium-scale parameter spaces (d<1000).
DeepSEA/Basenji DL for sequence-to-function maps AUC-PR for regulatory features (0.85-0.95). Requires GPU for training on genome-wide data.
PLINK/REGENIE Large-scale genotype-phenotype association Can handle >1M variants, >500k samples. HPC/Cluster-based for genome-wide analysis.

Experimental Protocols

Protocol 1: Bayesian Ridge Regression for Polygenic Score Estimation with Fitness Cost Inference

Objective: To estimate the polygenic contribution of high-dimensional SNP data to a quantitative phenotype (e.g., bacterial growth rate under drug pressure) and infer the posterior distribution of SNP effect sizes as a proxy for fitness cost/benefit.

Materials: Genotype matrix (VCF file), Phenotype measurements (e.g., growth assay data), High-performance computing (HPC) environment.

Procedure:

  • Data Preprocessing: Use PLINK to filter genotypes for quality control (MAF > 0.01, call rate > 95%). Normalize the phenotype to a mean of 0 and standard deviation of 1.
  • Model Specification (Bayesian Ridge): Define the model in PyMC3:
    • Likelihood: y ~ Normal(mu, sigma)
    • Linear Model: mu = alpha + dot(X, beta)
    • Priors:
      • alpha ~ Normal(0, 1)
      • beta ~ Normal(0, sigma_beta) # Hierarchical prior on coefficients
      • sigma_beta ~ HalfCauchy(1) # Regularizing scale parameter
      • sigma ~ HalfCauchy(1)
  • Inference: Run No-U-Turn Sampler (NUTS) MCMC for 2000 tune steps and 5000 draw steps across 4 chains.
  • Diagnostics & Analysis: Check trace plots and Gelman-Rubin statistic (Rhat < 1.01). Extract the posterior mean of beta as the estimated effect size for each SNP. SNPs with a 95% Highest Posterior Density Interval excluding zero are considered significant.
  • Fitness Integration: Sum the posterior samples of beta for drug-resistant vs. wild-type genotypes to generate a posterior distribution of the predicted fitness differential.

Protocol 2: Bayesian Optimization of a High-Dimensional Combinatorial Genotype Space

Objective: To efficiently identify gene-editing combinations (e.g., CRISPR-mediated edits across 10 target sites) that maximize a desired phenotypic output (e.g., antibody yield) with minimal experimental cycles.

Materials: Phenotypic assay (e.g., FACS, reporter), Library construction capability, Python environment with BoTorch.

Procedure:

  • Initial Design: Construct an initial library of 20-50 random combinatorial genotypes and measure their phenotype.
  • Model Training: Fit a Gaussian Process (GP) model using a Matern kernel to the initial (genotype, phenotype) data. The genotype is encoded as a binary vector.
  • Acquisition Optimization: Maximize the Expected Improvement (EI) acquisition function over the entire high-dimensional combinatorial space (2^10 possibilities) using a combination of gradient methods and random restarts.
  • Candidate Selection: Select the top 5-10 genotype combinations proposed by EI for the next round of experimental construction and phenotyping.
  • Iteration: Update the GP model with new data and repeat steps 3-4 for 5-10 cycles. Use the final posterior mean of the GP to identify the global optimum genotype.

Diagrams

workflow HD_Data High-Dimensional Genotype Data Bayes_Model Bayesian Model (e.g., Sparse Regression) HD_Data->Bayes_Model Input Post_Samples Posterior Samples of Parameters Bayes_Model->Post_Samples MCMC/NUTS Inference Fit_Cost Fitness Cost/Benefit Posterior Distribution Post_Samples->Fit_Cost Biological Interpretation Exp_Val Experimental Validation Fit_Cost->Exp_Val Predictions Thesis Thesis: Inferring Fitness Landscapes Thesis->Bayes_Model

Bayesian Inference Pipeline for Fitness Maps

epistasis SNP1 SNP A SNP2 SNP B SNP1->SNP2 Epistatic Interaction Pheno Phenotype (Fitness) SNP1->Pheno SNP2->Pheno

Epistatic Interaction in a Bayesian Network

The Scientist's Toolkit

Table 3: Key Research Reagent Solutions for G-P Map Validation

Item Function in Validation Example Product/Assay
Saturation Mutagenesis Library Empirically maps sequence variants to phenotype at high resolution. Twist Bioscience Oligo Pools; CRISPR-based variant libraries.
Multiplexed Phenotypic Screening Measures fitness/output for thousands of genotypes in parallel. Flow Cytometry (FACS); CellRox/Annexin V assays for viability; Barcode sequencing (Bar-seq).
In-vivo Fitness Competition Assay Directly quantifies relative growth advantage/cost in a model system. Mouse co-infection models (for pathogens); Pooled competitive growth in bioreactors.
Deep Mutational Scanning (DMS) Pipeline Integrated platform for generating and scoring variant effects. Custom NGS library prep kits; Illumina sequencing; Enrich2 analysis software.
Reporters for Pathway Activity Proxies complex phenotype (e.g., signaling strength) for high-throughput. Luciferase (Firefly/NanoLuc) reporter constructs; GFP transcriptional fusions.

Best Practices for Model Checking and Posterior Predictive Validation

In the context of a thesis on Bayesian inference for estimating fitness costs and benefits in therapeutic intervention research, model checking and posterior predictive validation are critical steps. They move beyond simply obtaining parameter estimates to evaluating whether the chosen model adequately represents the biological reality of drug-target interaction, resistance emergence, and pathogen/host cell fitness. This ensures that conclusions regarding therapeutic benefit and evolutionary cost are reliable for downstream decision-making in drug development.

Core Principles & Quantitative Benchmarks

Table 1: Core Metrics for Model Checking & Validation
Metric/Test Formula/Description Interpretation in Fitness Models Optimal Range/Benchmark
R-hat (Gelman-Rubin) $\hat{R} = \frac{\hat{var}^+(\psi | y)}{W}$; compares between-chain to within-chain variance. Diagnoses non-convergence in MCMC sampling of cost/benefit parameters. $\hat{R} < 1.01$ for all parameters.
Effective Sample Size (ESS) $ESS = N / (1 + 2 \sum{t=1}^\infty \rhot)$; estimates independent samples. Assesses precision of posterior estimates (e.g., mutation fitness cost). Bulk-ESS > 100 per chain; Tail-ESS > 100 for quantiles.
Posterior Predictive P-value $p_B = Pr(T(y^{rep}, \theta) \geq T(y, \theta) | y)$ Global test of model fit; e.g., comparing predicted vs. observed growth rates. $p_B$ close to 0.5 (not extreme).
Leave-One-Out Cross-Validation (LOO-CV) $elpd{loo} = \sum{i=1}^n \log p(yi | y{-i})$; estimated via Pareto-smoothed importance sampling (PSIS). Compares predictive accuracy of competing models of fitness. Higher $elpd_{loo}$ is better; $k < 0.7$ for reliable PSIS.
Bayesian R² $R^2 = 1 - \frac{E_{\theta|y}[Var(y^{rep}|\theta)]}{Var(y)}$ Variance in growth data explained by the fitness model. Context-dependent; compare across models.
Table 2: Common Pitfalls & Diagnostic Signals in Fitness Models
Pitfall Model Checking Signal Remedial Action
Misspecified Likelihood Systematic discrepancies in posterior predictive checks (PPCs); skewed residual patterns. Transform data (e.g., log growth); switch likelihood (e.g., negative binomial for overdispersed counts).
Poor Parameter Identifiability High posterior correlation (>0.9) between parameters (e.g., benefit and cost); divergent MCMC transitions. Re-parameterize model; add weakly informative priors based on prior biological knowledge.
Overfitting LOO-CV $elpd$ significantly lower than model deviance; large $p_{loo}$ values. Simplify model; use regularizing priors (e.g., horseshoe for hierarchical effects).
Inadequate Chain Mixing $\hat{R} >> 1.01$; low ESS; trace plots show "sticky" chains. Increase warm-up iterations; reparameterize; use non-centered hierarchical formulations.

Detailed Experimental Protocols

Protocol 1: Comprehensive Workflow for Posterior Predictive Validation

Aim: To validate a Bayesian model estimating the fitness cost of a drug-resistance mutation. Materials: MCMC samples, observed experimental data (e.g., growth curves, MICs), computing environment (R/Stan, PyMC3, cmdstanr).

  • Model Fitting: Run 4+ independent MCMC chains with sufficient warm-up (≥ 2000 iterations) and sampling (≥ 2000 iterations).
  • Convergence Diagnostics: Calculate $\hat{R}$ and ESS for all primary parameters (growth rate, carrying capacity, cost multiplier). Confirm trace plots are stationary and well-mixed.
  • Generate Replicated Data: Draw $L$ (e.g., 4000) parameter sets $\tilde{\theta}_l$ from the posterior. For each set, simulate a replicated dataset $y^{rep,l}$ using the model's likelihood.
  • Define Test Quantities ($T$): Select both discrepancy measures (e.g., mean growth, variance, min/max) and physics/biology-based checks (e.g., monotonicity of dose-response, sign of fitness difference).
  • Calculate PPCs: For each $T$, compute the distribution of $T(y^{rep})$ and compare to $T(y)$. Plot histograms with observed value marked. Calculate posterior predictive p-values.
  • Prior Sensitivity Analysis: Repeat inference with alternative, reasonable prior distributions. Compare posteriors of key cost/benefit parameters. Significant shifts indicate excessive prior influence.
  • Model Comparison: Fit alternative models (e.g., additive vs. multiplicative cost). Use LOO-CV to compare $elpd_{loo}$. Weight conclusions by model probabilities if differences are substantial.
Protocol 2: Experimental Validation of Predicted Fitness Landscapes

Aim: To empirically test model predictions of pathogen fitness under drug pressure. Materials: Bacterial/viral strains (wild-type, mutant), compound of interest, growth chambers/plate readers, qPCR equipment.

  • In-silico Prediction: From the validated model, predict the growth trajectory over 72h for WT and mutant across a gradient of drug concentrations.
  • In-vitro Competition Assay: a. Co-culture WT and isogenic mutant strains at a defined starting ratio (e.g., 1:1). b. Expose to sub-MIC, MIC, and supra-MIC drug concentrations in triplicate. c. Sample at 0h, 24h, 48h, and 72h. Use strain-specific markers (e.g., differential qPCR, fluorescent tags) to quantify ratios.
  • Data Collection: Calculate the selection rate coefficient (s) or the mutant frequency over time.
  • Comparison: Plot the observed mutant frequency trajectory against the 95% posterior predictive interval from the model. Assess if observed data falls within the predictive envelope.

Visualizations

workflow Start Observed Data (Growth, MIC, etc.) A Specify Bayesian Fitness Model Start->A B Sample Posterior (MCMC) A->B C Convergence Diagnostics (R-hat, ESS) B->C C->A Not Converged D Draw Posterior Parameter Sets C->D Converged E Simulate Replicated Data D->E F Define Test Quantities (T) E->F G Compare T(y_rep) to T(y) F->G H PPC Plots & P-values G->H I1 Model Adequate H->I1 No Systematic Discrepancies I2 Model Rejected/ Revised H->I2 Major Discrepancies

Title: Posterior Predictive Validation Workflow

hierarchy Prior Priors: - Growth Rate - Cost Distribution - Measurement Noise Params Parameters (θ): - WT Fitness (β_wt) - Mutant Fitness (β_mut) - Cost (c = β_wt - β_mut) - Benefit (IC50 shift) Prior->Params Likelihood Likelihood: e.g., Growth ~ Normal(μ(θ), σ) Params->Likelihood Data Observed Data (y): - Growth Curves - Competition Assay Ratios - Dose-Response Data->Likelihood

Title: Bayesian Fitness Model Graph

The Scientist's Toolkit

Table 3: Key Research Reagent & Computational Solutions
Item/Category Specific Example/Product Function in Model Checking & Validation
Probabilistic Programming Framework Stan (via cmdstanr, brms), PyMC3, Turing.jl Enables flexible specification of Bayesian fitness models and efficient posterior sampling.
Diagnostic Software Package bayesplot (R), arviz (Python), shinystan Generates trace plots, R-hat/ESS calculations, and posterior predictive check visualizations.
Model Comparison Tool loo R package, az.compare() in ArviZ Computes LOO-CV, WAIC, and model weights for predictive accuracy comparison.
High-Throughput Growth Assay Bioscreen C, OmniLog, plate readers with gas control Generates precise, reproducible growth curve data for model fitting and validation.
Strain Differentiation Reagent Strain-specific qPCR probes, fluorescent proteins (GFP, RFP), barcoded sequencing primers Enables quantification of strain ratios in competition experiments for direct model testing.
Data Simulator Custom scripts in R/Python using rstantools Generates simulated data under the model for power analysis and method development.

Benchmarking Bayesian Inference: Validation Strategies and Comparison to Alternative Methods

Validating Models with Simulated Data and Known Parameters

Within a thesis on Bayesian inference for estimating fitness costs and benefits of drug resistance, validating the statistical model is a critical step before applying it to real, noisy biological data. Simulation studies, where data is generated from a known probabilistic model with pre-defined parameters, provide a "ground truth" test bed. A model's ability to recover these known parameters under various conditions (e.g., different sample sizes, noise levels) is the strongest proof of its internal validity and informs its limitations.

Core Protocol: Conducting a Simulation-Based Validation Study

This protocol outlines a systematic approach for validating a Bayesian model designed to estimate fitness parameters (e.g., growth rate r, carrying capacity K, resistance cost c, benefit b).

Protocol Steps:

  • Define the True Data-Generating Model: Specify the exact mathematical model and known parameters that will simulate reality. For example:
    • Model: Stochastic Lotka-Volterra competition model between wild-type (WT) and resistant (RES) strains.
    • Known Parameters: r_wt = 0.5, r_res = 0.4, K_wt = 1e6, K_res = 8e5, competition coefficients α_wr = 0.8, α_rw = 1.2, measurement error σ = 0.1.
  • Simulate Datasets: Using the model from Step 1, generate multiple synthetic datasets.
    • Vary key experimental conditions across simulation runs (see Table 1).
    • Incorporate realistic noise (Poisson for counts, Normal for logs, etc.).
  • Apply the Bayesian Inference Model: Fit your proposed Bayesian model (with priors) to each simulated dataset.
  • Recover and Compare Parameters: Extract the posterior distributions (e.g., median and 95% Credible Intervals) for each estimated parameter.
  • Calculate Validation Metrics: Quantify performance using:
    • Bias: Difference between posterior median and true value.
    • Coverage: Percentage of times the true value lies within the 95% Credible Interval.
    • Precision: Width of the credible interval.

Data Presentation: Simulation Scenario Results

Table 1: Simulation Scenarios and Key Outcomes for a Fitness Cost-Benefit Model Scenario designs explore the effect of data quality and model misspecification on parameter recovery.

Scenario ID Experimental Condition Varied True Benefit (b) True Cost (c) Sample Size (n) Estimated b (Median [95% CI]) Estimated c (Median [95% CI]) Coverage (b/c)
S1 Baseline (High quality) 0.15 0.05 100 0.149 [0.142, 0.157] 0.051 [0.043, 0.059] 96% / 94%
S2 Small sample size 0.15 0.05 20 0.145 [0.121, 0.172] 0.055 [0.028, 0.083] 92% / 90%
S3 High measurement noise (σ=0.5) 0.15 0.05 100 0.153 [0.131, 0.178] 0.047 [0.018, 0.075] 93% / 91%
S4 Model Misspecification* 0.15 0.05 100 0.128 [0.116, 0.141] 0.072 [0.062, 0.082] 0% / 0%

*e.g., Data generated with a logistic growth model but fitted with an exponential growth model.

Visualizing the Validation Workflow

G Start Define True Parameters & Data-Generating Model Sim Simulate Synthetic Datasets Start->Sim Known Ground Truth Fit Apply Bayesian Model (Priors + Likelihood) Sim->Fit Synthetic Data Post Obtain Posterior Distributions Fit->Post MCMC Sampling Eval Compare Posteriors to Known Truth Post->Eval Parameter Estimates Metrics Calculate Metrics: Bias, Coverage, Precision Eval->Metrics

Simulation-Based Bayesian Model Validation Workflow

The Scientist's Toolkit: Essential Research Reagents & Software

Table 2: Key Solutions for Simulation & Bayesian Fitness Inference

Item Category Function in Validation Studies
Stan (CmdStanR/PyStan) Software Library Probabilistic programming language for specifying Bayesian models and performing efficient Hamiltonian Monte Carlo (HMC) sampling.
R (brms, rstan) / Python (PyMC, ArviZ) Software Ecosystem Primary languages with packages for model fitting, posterior analysis, and visualization of simulation results.
Synthetic Data Generators Computational Tool Custom scripts (e.g., in R/Python) that implement the exact biological model to produce ground-truth datasets with controllable noise.
High-Performance Computing (HPC) Cluster Infrastructure Enables running hundreds of parallel simulation fits to robustly assess model performance across scenarios.
Gelman-Rubin Diagnostic (R̂) Statistical Tool Checks MCMC chain convergence; essential for ensuring reliable posterior estimates from each simulation run.
Simulation Scenario Table Planning Document Pre-registered plan (like Table 1) that defines the scope of the validation study, ensuring comprehensive and unbiased testing.

Application Notes

This document provides a practical comparison of Bayesian and Frequentist statistical paradigms, applied to the analysis of experimental data relevant to fitness cost and benefit research in microbial evolution and drug development. The focus is on estimating parameters such as mutation rates, selection coefficients, and treatment efficacy.

Core Philosophical & Practical Differences

Aspect Frequentist Approach Bayesian Approach
Probability Definition Long-run frequency of events. Degree of belief or certainty.
Parameters Fixed, unknown constants. Random variables with distributions.
Inference Output Point estimates & confidence intervals. Posterior distributions & credible intervals.
Prior Information Not incorporated formally. Formally incorporated via prior distributions.
Analysis Goal P(data | parameter), maximize likelihood. P(parameter | data), update prior to posterior.
Computational Demand Often lower (optimization). Often higher (integration/MCMC sampling).

Comparative Analysis on a Fitness Cost Dataset

Dataset: In vitro growth rates of antibiotic-resistant Pseudomonas aeruginosa strains vs. wild-type. Objective: Estimate the selection coefficient (s) and its uncertainty.

Results Summary:

Method Point Estimate (s) 95% Interval Interval Interpretation
Frequentist (MLE) -0.042 [-0.068, -0.016] If experiment repeated, 95% of CIs would contain true s.
Bayesian (Weak Prior) -0.044 [-0.069, -0.018] 95% probability true s lies in this interval.
Bayesian (Informative Prior) -0.039 [-0.062, -0.015] Incorporates prior data from similar mutants.

Experimental Protocols

Protocol 1: Frequentist Analysis of Selection Coefficients

Objective: Calculate Maximum Likelihood Estimate (MLE) and confidence interval for selection coefficient.

  • Data Collection: Measure optical density (OD600) of wild-type and mutant strains in triplicate over 24 hours in a plate reader. Calculate growth rate (µ) for each replicate via exponential fit.
  • Compute Difference: Calculate the mean growth rate difference: ∆µ = µmutant - µwildtype.
  • Normalize: Compute selection coefficient: s = ∆µ / µ_wildtype.
  • Estimate Variance: Calculate standard error (SE) of s from replicate measurements.
  • Construct CI: 95% Confidence Interval = s ± 1.96 * SE.

Protocol 2: Bayesian Analysis of Selection Coefficients

Objective: Obtain posterior distribution for selection coefficient using Markov Chain Monte Carlo (MCMC).

  • Define Model: Specify likelihood: sobserved ~ Normal(strue, σ).
  • Specify Prior: Choose prior for s_true (e.g., Normal(µ=0, τ=0.1) for weakly informative).
  • Initialize: Set starting values for MCMC chains.
  • Sample Posterior: Run MCMC (e.g., using Stan or PyMC) with 4 chains, 10,000 iterations each.
  • Diagnose: Check chain convergence (R-hat ≈ 1.0, effective sample size).
  • Summarize: Report posterior median and 95% Highest Posterior Density (HPD) credible interval.

workflow Start Collect Growth Rate Data FM Frequentist Model Start->FM BM Bayesian Model Start->BM MLE Compute MLE & Confidence Interval FM->MLE Prior Define Prior Distribution BM->Prior CI Interpret CI as Long-run Frequency MLE->CI Post Compute Posterior Distribution Prior->Post HPD Interpret HPD as Degree of Belief Post->HPD Comp Compare Estimates & Uncertainty CI->Comp HPD->Comp

Title: Statistical Analysis Workflow Comparison

Protocol 3: Analyzing Drug Efficacy in a Clinical Trial Context

Dataset: Placebo vs. Drug response rates in a Phase II trial. Goal: Estimate odds ratio (OR) for treatment response.

Results Summary:

Method Odds Ratio (OR) 95% Interval p-value / Pr(OR>1)
Frequentist 2.10 [1.15, 3.84] p = 0.015
Bayesian 2.05 [1.18, 3.65] P(OR>1 | data) = 0.998

bayes_update PriorDist Prior Distribution PosteriorDist Posterior Distribution PriorDist->PosteriorDist Bayesian Update Likelihood Likelihood (Experimental Data) Likelihood->PosteriorDist

Title: Bayesian Inference as Prior Update

The Scientist's Toolkit

Research Reagent / Tool Function in Analysis
R with brms/rstanarm packages User-friendly R interfaces for Bayesian regression models using Stan.
Python with PyMC library Flexible Python package for defining and sampling from Bayesian models.
Stan (CmdStanR/CmdStanPy) Probabilistic programming language for full Bayesian inference with MCMC.
JAGS / BUGS Alternative MCMC samplers for Bayesian hierarchical models.
emcee (Python) Ensemble sampler for Affine Invariant MCMC, useful for custom models.
statsmodels (Python) Comprehensive Frequentist statistical testing and modeling.
broom (R) Tidy Frequentist model outputs (estimates, CIs, p-values).
Gelman-Rubin Diagnostic (R-hat) Key convergence statistic for MCMC chains.
ArviZ (Python) Diagnostics and visualization for Bayesian inference.
Weakly Informative Priors Default priors (e.g., Normal(0,1) on z-scores) to constrain without bias.

Integrating Bayesian Fitness Estimates with Population Genetic Predictions

The integration of Bayesian fitness estimates with population genetic models represents a powerful synergy for evolutionary genetics and applied drug development. This approach is framed within a broader thesis that Bayesian inference provides a coherent probabilistic framework for quantifying the fitness costs and benefits of genetic variants, especially in pathogen evolution and cancer genomics. By combining prior knowledge with empirical data, researchers can generate posterior distributions of selection coefficients (s) that are directly usable in predictive population genetic simulations.

Foundational Quantitative Data

Table 1: Comparison of Selection Coefficient (s) Estimation Methods

Method Framework Key Inputs Typical Output Best Use Case
Bayesian Wright-Fisher Frequency time-series Variant allele frequency over time, effective population size (Ne) Posterior distribution of s Experimental evolution, longitudinal clinical isolates
Beta-with-Spikes Cross-sectional frequency Single time-point frequency, estimated mutation rate Probability of negative, neutral, or positive selection Pathogen genomic surveillance
dN/dS (ω) Comparative sequence analysis Multiple sequence alignment, phylogenetic tree Point estimate of purifying/positive selection Deep evolutionary history, conserved genes
POPGENOME (BayeScan) Population differentiation F_ST values across multiple populations Posterior probability of selection per locus Local adaptation studies

Table 2: Common Priors for Selection Coefficients in Bayesian Inference

Prior Distribution Parameters Biological Justification Typical Context
Normal μ ≈ -0.01, σ ≈ 0.1 Most mutations are slightly deleterious Broad-scale genomic analysis
Gamma shape=2, rate=200 Strongly deleterious mutations are rare Drug resistance variant analysis
Uniform e.g., [-0.5, 0.5] Complete uncertainty about effect size Exploratory analysis, novel phenotypes
Spike-and-Slab Mix of point mass at 0 and continuous distribution Many mutations are neutral, some are under selection Genome-wide association studies

Core Application Notes

Note AN-001: Predicting Antibiotic Resistance Emergence

Objective: Forecast the probability and timeline of a specific resistance mutation (e.g., rpoB S450L in M. tuberculosis) reaching a clinically relevant frequency (>1%) in a patient population under specific drug pressure. Integration Pipeline:

  • Bayesian Estimation: Use deep sequencing data from in vitro serial passaging experiments under sub-MIC Rifampicin to generate a posterior distribution for the selection coefficient (s) of the S450L variant.
  • Simulation Input: Use the mean and variance of the posterior distribution of s as parameters in a stochastic Wright-Fisher simulation.
  • Population Parameters: Configure the simulation with realistic effective population size (Ne) for TB within a host (~10^4 - 10^5) and a starting allele frequency (e.g., 10^-6 from de novo mutation).
  • Prediction: Run 10,000 simulations to generate a distribution of "generations to reach 1% frequency." Convert generations to time using the pathogen's generation time.
Note AN-002: Quantifying Fitness Cost of Cancer Driver Mutations

Objective: Determine if a somatically acquired mutation (e.g., BRAF V600E) confers a net clonal fitness benefit in the tumor microenvironment, integrating cellular and microenvironmental data. Integration Pipeline:

  • Cellular Fitness Prior: Derive a prior distribution for s from in vitro assays (e.g., organoid growth rate in standard conditions).
  • In Vivo Likelihood: Use longitudinal bulk or single-cell sequencing data from patient biopsies (pre- and post-treatment) to model allele frequency changes.
  • Bayesian Updating: Combine the in vitro prior with the in vivo likelihood to obtain a robust posterior estimate of the in vivo selection coefficient.
  • Spatial Simulation: Feed the posterior s into an agent-based spatial model of tumor growth to predict spatial clonal expansion patterns and interactions with immune cells.

Detailed Experimental Protocols

Protocol PR-001: Bayesian Estimation ofsfrom Microbial Time-Series Data

Title: Estimating Selection Coefficients from Allele Frequency Time-Series Using a Bayesian Wright-Fisher Model.

I. Materials & Reagents

  • Biological: Microbial culture (e.g., P. aeruginosa), antibiotic of interest.
  • Sequencing: DNA extraction kit, library prep kit for Illumina, targeted amplicon or whole-genome sequencing primers.
  • Software: R (≥4.0) or Python (≥3.8), packages: rstan/cmdstanr (Stan) or pymc3 (PyMC), dplyr/pandas, ggplot2/matplotlib.

II. Procedure Day 1-7: Experimental Evolution & Sampling

  • Start 10-20 independent replicate cultures from a low-passage ancestor.
  • Propagate cultures daily (1:1000 dilution) in medium containing a sub-inhibitory concentration of the antibiotic (e.g., 0.5x MIC).
  • At each passage (e.g., every 2-3 generations), archive 1 mL of culture for DNA extraction. Day 8: Genotyping by Sequencing
  • Extract genomic DNA from all archived time points.
  • Prepare sequencing libraries targeting the locus of interest (amplicon-seq) or perform whole-genome sequencing.
  • Sequence on an Illumina MiSeq or NovaSeq platform to achieve high coverage (>500x). Day 9-10: Variant Calling & Frequency Calculation
  • Map reads to reference genome using BWA-MEM or Bowtie2.
  • Call variants and calculate allele frequencies at each time point using bcftools mpileup/call or breseq (for microbes).
  • Tabulate the frequency of the focal allele over time for each replicate. Day 11-14: Bayesian Inference with Stan
  • Code the Wright-Fisher with selection model in Stan. The model approximates allele frequency change as: p_{t+1} | p_t, s ~ Normal(p_t + s * p_t * (1 - p_t), sqrt( p_t * (1 - p_t) / (2 * Ne) )) where p_t is frequency at time t, s is the selection coefficient, Ne is effective population size.
  • Prepare data list for Stan: vectors of frequencies, time intervals, and an estimate of Ne.
  • Run Hamiltonian Monte Carlo sampling (4 chains, 2000 iterations warm-up, 2000 sampling). Use a weakly informative prior for s (e.g., normal(0, 0.1)).
  • Check convergence (R-hat < 1.05, trace plots).
  • Extract the posterior distribution of s for downstream use.
Protocol PR-002: Integratingsinto a Stochastic Population Genetic Simulation

Title: Forward-in-Time Simulation of Allele Frequency Dynamics Using Estimated Selection Coefficients.

I. Materials & Software

  • Input: Posterior distribution of s from Protocol PR-001.
  • Software: SLiM (version 4.0+) or a custom Wright-Fisher simulator in R/Python.

II. Procedure

  • Parameterization:
    • Set N (census population size) and Ne (effective size), which can be a fraction of N.
    • Set initial allele frequency p0 (e.g., 10^-6).
    • Define a demographic model (constant size, growth, bottleneck).
    • Draw a value of s from its posterior distribution for each simulation replicate. This propagates estimation uncertainty into the prediction.
  • Simulation Code (Pseudocode using Wright-Fisher steps):

  • Execution & Analysis:
    • Run simulations (e.g., 10,000 replicates).
    • Summarize outcomes: proportion of replicates where allele fixes, is lost, or reaches a threshold (e.g., 1%) by a certain time.
    • Plot the distribution of outcomes (e.g., Kaplan-Meier curve for "time to reach 1% frequency") conditional on the posterior of s.

Visualization Diagrams

G Prior Prior Distribution P(s) Bayes Bayesian Inference Engine (e.g., MCMC Sampling) Prior->Bayes Data Experimental Data (e.g., Allele Frequencies) Data->Bayes Posterior Posterior Distribution P(s | Data) Bayes->Posterior Simulation Population Genetic Simulation (e.g., SLiM) Posterior->Simulation s ~ Posterior Prediction Probabilistic Prediction (e.g., Fixation Probability) Simulation->Prediction

Title: Bayesian-Population Genetics Prediction Workflow

G TS Time-Series Frequency Data WF Wright-Fisher Model with Parameter s TS->WF MCMC MCMC Sampler WF->MCMC PriorS Prior on s e.g., Normal(0,0.1) PriorS->MCMC PostS Posterior Distribution of s MCMC->PostS

Title: Bayesian Estimation of Selection Coefficient (s)

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents and Materials for Integrated Fitness Studies

Item Function & Application Example Product/Kit
Directed Evolution Kit Provides a controlled system for generating and selecting variants for fitness prior estimation. NEB Hi-Fi DNA Assembly Master Mix for library generation; Takara In-Fusion Snap Assembly.
Longitudinal Sampling Kit Enables aseptic, consistent archiving of microbial or cell line samples over time for allele frequency tracking. Qiagen DNeasy Blood & Tissue Kit (for DNA); Zymo Research DNA/RNA Shield for sample stabilization.
Targeted Amplicon-Seq Kit Cost-effective, deep sequencing of specific genomic loci to track variant frequencies with high precision. Illumina Nextera XT DNA Library Prep Kit with custom primers; IDT xGen Amplicon Panels.
Bayesian Modeling Software Robust, probabilistic programming environment for defining custom models and performing MCMC sampling. Stan (via cmdstanr in R or pystan); PyMC3/PyMC4 (Python).
Population Genetics Simulator Forward-time simulator capable of incorporating selection coefficients and complex demography. SLiM (Simulation of Evolution); fwdpy11 (Python/C++).
High-Performance Computing (HPC) Access Essential for running thousands of MCMC iterations and population genetic simulations in parallel. Cloud platforms (AWS, GCP); local cluster with SLURM scheduler.
Variant Caller for Time-Series Specialized tool for accurate frequency estimation from sequencing data of evolving populations. breseq (for microbes); LoFreq (for low-frequency variants); GATK Mutect2 (for cancer).

Assessing Predictive Power for Clinical Outcomes (e.g., Treatment Failure)

Within the broader thesis on Bayesian inference for fitness cost/benefit estimation, assessing the predictive power for clinical outcomes is a critical application. Bayesian frameworks allow for the integration of prior knowledge (e.g., in vitro fitness assays, genomic data) with emerging clinical trial data to iteratively update the probability of a treatment failing for a patient or population. This dynamic, probabilistic approach to prediction is superior to static, frequentist p-values for decision-making in drug development, where evidence accumulates sequentially and prior mechanistic understanding is substantial.

Foundational Concepts & Data

Table 1: Common Biomarkers & Data Types for Predictive Modeling of Treatment Failure

Data Type Example Biomarkers/Assays Typical Predictive Use Quantitative Format
Genomic Pathogen mutation profiles (e.g., HIV drug resistance mutations, tumor somatic variants), Host SNPs (e.g., pharmacogenomics). Identifies pre-existing or emerging factors that reduce drug efficacy. Variant allele frequency (VAF), Presence/Absence matrix.
Transcriptomic Host immune gene signatures, Pathogen gene expression profiles. Predicts non-response linked to immune dysfunction or pathogen adaptive states. RNA-seq counts, Microarray intensity values.
Proteomic Serum cytokine levels, Drug target protein expression. Correlates with inflammatory status or target availability. Concentration (pg/mL), Optical density units.
Clinical & Demographic Baseline disease severity, Age, BMI, Prior treatment history. Provides contextual priors for patient stratification. Continuous measures, Categorical labels.
Pharmacokinetic Drug trough concentration (C~min~), Area under the curve (AUC). Links exposure to potential for failure due to sub-therapeutic dosing. Concentration (µg/mL), mg·h/L.

Key Experimental Protocols

Protocol 1: Generating In Vitro Fitness Cost Data for Bayesian Priors Objective: To determine the in vitro replicative capacity (fitness) of pathogen isolates with and without resistance-associated mutations.

  • Isolate & Culture: Obtain clinical isolates or engineer isogenic strains differing by a mutation of interest. Culture in standardized medium.
  • Competitive Growth Assay: Mix the mutant and wild-type strains at a known ratio (e.g., 1:1). Co-culture them over multiple replication cycles (e.g., 5-10 passages).
  • Sample & Quantify: At each time point, sample the population. Use quantitative PCR (qPCR) with allele-specific probes or deep sequencing to determine the proportion of each strain.
  • Fitness Calculation: The selection rate coefficient (s) is calculated from the change in log ratio over time. s > 0 indicates a fitness benefit; s < 0 indicates a fitness cost.
  • Prior Formulation: The distribution of s values across biological replicates forms the prior distribution for the mutation's fitness cost in a Bayesian model.

Protocol 2: Longitudinal Sampling for Clinical Outcome Validation Objective: To collect paired biomarker and outcome data for model training and validation.

  • Patient Cohorting: Enroll patients initiating the therapy under study. Collect comprehensive baseline samples (e.g., whole blood, tissue biopsy).
  • Defining Treatment Failure: Pre-define the clinical outcome (e.g., virologic failure, radiographic progression, death by 12 months).
  • Scheduled Sampling: Collect follow-up samples at predefined intervals (e.g., weeks 4, 12, 24) and at the time of suspected failure.
  • Multi-Omics Processing: Process samples for the data types in Table 1 (e.g., whole genome sequencing, RNA sequencing, multiplex immunoassays).
  • Data Integration: Align biomarker trajectories with outcome labels to create a time-structured dataset for predictive modeling.

Bayesian Analytical Workflow

G Prior Prior Knowledge (In vitro fitness, Preclinical PK/PD) Model Bayesian Statistical Model (e.g., Survival model with time-varying covariates) Prior->Model Informs Data Observed Clinical Data (Biomarker trajectories, Failure events) Data->Model Updates Posterior Updated Posterior Distribution (Probability of failure for a biomarker profile) Model->Posterior Yields Prediction Clinical Utility (Patient stratification, Adaptive trial design) Posterior->Prediction Guides

Diagram 1: Bayesian workflow for clinical outcome prediction.

H cluster_0 Model Inputs cluster_1 Bayesian Weibull Survival Model Biomarker Baseline Biomarker X Beta1 Coefficient β1 ~ Normal(μ_prior, σ_prior) Biomarker->Beta1 Time Time on Therapy Likelihood Likelihood Failure Time ~ Weibull(k, λ) Time->Likelihood PK Drug Exposure (AUC) Scale Scale Parameter (λ) λ = exp(β0 + β1*X + β2*AUC) PK->Scale Shape Shape Parameter (k) ~ Gamma(2, 1) Shape->Likelihood Scale->Likelihood Beta1->Scale Outcome Output: Posterior Probability of Failure Before Time T Likelihood->Outcome

Diagram 2: Structure of a Bayesian survival analysis model.

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials for Predictive Power Research

Item Function & Application
Cell-based Fitness Assay Kit (e.g., luciferase-coupled viral growth assay). Provides a standardized, high-throughput system for quantifying replicative fitness of pathogens in response to drug pressure, generating prior data.
Digital Droplet PCR (ddPCR) Master Mix & Probes Enables absolute, sensitive quantification of allele frequencies (e.g., resistance mutations) from limited patient samples for longitudinal tracking.
Multiplex Immunoassay Panel (e.g., 45-plex cytokine array). Simultaneously measures a broad panel of host protein biomarkers from serum/plasma to identify predictive inflammatory signatures.
Next-Generation Sequencing Library Prep Kit (for RNA/DNA). Essential for generating genomic and transcriptomic data from baseline and longitudinal samples for integrated omics analysis.
Bayesian Statistical Software (e.g., Stan, PyMC3/4, JAGS). Provides the computational framework to build, fit, and evaluate the complex hierarchical models that integrate priors and clinical data.
Certified Biobank Collection Tubes (e.g., PAXgene for RNA, EDTA for plasma). Ensures pre-analytical stability of biospecimens, guaranteeing the integrity of biomarkers for downstream assays.

This application note provides a comparative analysis of three major software resources used for phylogenetic analysis and evolutionary rate estimation within the context of a thesis on Bayesian inference for fitness cost/benefit research. The tools are essential for modeling selection pressures, such as those arising from drug treatment.

Software Comparison Tables

Table 1: Core Features and Primary Use Cases

Feature BEAST 2.7.x MEGA 11 Custom Bayesian Pipeline
Primary Purpose Bayesian evolutionary analysis, coalescent & phylodynamic modeling Comprehensive sequence alignment, distance-based & ML phylogenetics Tailored, hypothesis-specific Bayesian statistical modeling
Inference Engine Markov Chain Monte Carlo (MCMC) Maximum Likelihood, Distance Methods, Parsimony User-defined (e.g., Stan, PyMC3, JAGS, custom MCMC)
Key Strength Time-calibrated trees, complex evolutionary model integration, extensibility User-friendly GUI, integrated suite for molecular evolution Ultimate flexibility, integration of non-standard data & models
Typical Use in Fitness Research Estimating evolutionary rates, population dynamics under selection Identifying positively/negatively selected sites (e.g., SLAC, FEL) Directly modeling fitness parameters from experimental data
Learning Curve Steep Moderate Very Steep
Cost Free & Open Source Freemium (Pro $250) Free (but requires development time)

Table 2: Quantitative Performance and Data Handling

Aspect BEAST 2.7.x MEGA 11 Custom Bayesian Pipeline
Max Sequence Number (Practical) ~10,000 ~500-1,000 Limited only by compute
Standard Molecular Clock Models Strict, Relaxed (LogNormal, Exponential) Strict, Relaxed User-defined
Convergence Diagnostics Built-in (ESS, Tracer) Limited User-implemented
Parallelization Support Yes (BEAGLE library) Limited (some ML steps) Full user control
Typical Run Time (Medium Dataset) Hours to Days Minutes to Hours Highly variable

Experimental Protocols

Protocol 1: Estimating Site-wise Selection Pressures using MEGA11

Objective: To identify codons under positive or negative selection in a viral gene before and after drug treatment.

  • Alignment: Load nucleotide sequences (e.g., pol gene from pre- and post-treatment isolates) into MEGA11. Use ClustalW or MUSCLE for alignment. Visually inspect and trim.
  • Model Selection: Navigate to Models > Find Best DNA/Protein Model. The tool uses Maximum Likelihood to select the best-fit substitution model (e.g., HKY+G).
  • Phylogeny Construction: Construct a maximum likelihood tree using the selected model (Phylogeny > Construct/Test ML Tree). Use 100 bootstrap replicates.
  • Selection Analysis: Run the Selection > CodeML Analysis. Use the pre-built tree. Choose models for comparison (e.g., Model 0 vs. Model 2 for positive selection). Execute analysis.
  • Interpretation: Examine the output for sites with dN/dS (ω) > 1 and statistically significant p-values, indicating positive selection potentially conferring drug resistance.

Protocol 2: Bayesian Evolutionary Rate Estimation using BEAST2

Objective: To estimate the time to most recent common ancestor (tMRCA) and evolutionary rate of a pathogen population under drug pressure.

  • XML Generation: Use BEAUti (BEAST2 GUI) to configure analysis. Import sequence alignment and sampling dates (for tip-dating).
  • Model Specification:
    • Site Model: Select HKY or GTR + Γ.
    • Clock Model: Choose a Relaxed Clock Log Normal to model rate variation across branches.
    • Tree Prior: Select a coalescent model (e.g., Bayesian Skyline) for population size dynamics.
    • Priors: Set appropriate priors for evolutionary rate (e.g., Gamma prior) and clock rate.
  • MCMC Setup: Set chain length to 50-100 million, logging parameters every 5000 steps. Generate the XML file.
  • Run Analysis: Execute the XML in BEAST2. Use BEAGLE library for hardware acceleration.
  • Diagnostics & Summarization: Open log files in Tracer to check Effective Sample Size (ESS > 200). Use TreeAnnotator to generate a maximum clade credibility tree, discarding an appropriate burn-in (e.g., 10%). Visualize in FigTree.

Protocol 3: Custom Bayesian Pipeline for Fitness Cost Estimation

Objective: To directly infer the fitness cost of a resistance mutation by integrating growth curve data and frequency data in a single hierarchical model.

  • Model Definition (Pseudocode): Define a joint probabilistic model in a framework like Stan.

  • Data Preparation: Format experimental data (time-series of mutant frequency from sequencing and overall culture density) into the required input list.
  • Model Fitting: Run Hamiltonian Monte Carlo (HMC) sampling in Stan (cmdstanr/pystan), using 4 chains, 2000 iterations per chain (1000 warm-up).
  • Posterior Analysis: Check R-hat statistics (≈1.0) and trace plots. Extract the posterior distribution for the selection coefficient s. Report median and 95% Credible Interval. A negative s indicates a fitness cost.

Visualizations

workflow Start Raw Sequence Data (FASTA) Align Multiple Sequence Alignment Start->Align Tree Phylogenetic Tree Construction Align->Tree Custom Custom Pipeline: Integrated Fitness Model Align->Custom  + Exp. Data Beast BEAST2: Bayesian Evolutionary Analysis Tree->Beast Mega MEGA11: Selection Analysis Tree->Mega Out1 Output: Time-scaled Trees, Evolutionary Rates Beast->Out1 Out2 Output: Codon-level dN/dS Values Mega->Out2 Out3 Output: Posterior Distribution of Selection Coefficient (s) Custom->Out3

Title: Software Workflow for Fitness Inference

hierarchy Data Observable Data (e.g., Sequences, Growth Curves) Model Probabilistic Model (Likelihood + Prior) Data->Model  Conditions Parameters Latent Parameters (e.g., Selection Coefficient s, μ) Parameters->Data  ~Likelihood Hyperparams Hyperparameters (Prior Means, Variances) Hyperparams->Parameters  ~Priors Posterior Posterior Distribution P(Parameters | Data) Model->Posterior  MCMC Sampling

Title: Core Bayesian Inference Hierarchy

The Scientist's Toolkit: Research Reagent Solutions

Item Function in Fitness Cost/Benefit Research
High-Fidelity Polymerase (e.g., Q5) For accurate amplification of target pathogen genes from mixed populations prior to sequencing.
Next-Generation Sequencing Kit (Illumina) Enables deep, high-throughput sequencing of viral/bacterial populations to track allele frequency changes over time.
Cell-based Assay Plates (96/384-well) For high-throughput growth competition assays under varying drug concentrations to generate phenotypic fitness data.
qPCR Master Mix with Probes For precise, absolute quantification of wild-type vs. mutant allele frequencies over an experimental time course.
Stable Isotope Labeled Amino Acids Used in Mass Spec-based proteomics to directly measure protein synthesis rates, linking mutations to translational fitness costs.
Drug Compounds (Research Grade) The selective agent used in in vitro evolution experiments to apply pressure and select for resistance-conferring mutations.
Cloning & Expression Vector Kit To engineer isogenic strains differing by a single mutation for direct, head-to-head fitness comparisons.

Conclusion

Bayesian inference provides a powerful, flexible framework for quantifying the fitness costs and benefits that drive evolution in pathogens and cancer. By moving from point estimates to full probability distributions, it offers a more nuanced understanding of uncertainty, crucial for forecasting resistance and designing robust drug regimens. The future lies in integrating these models with real-time clinical data, multi-omics layers, and machine learning to create dynamic, predictive tools. Embracing this probabilistic approach will be key to developing evolution-informed therapies that stay ahead of adaptive disease.