GTDB Taxonomic Classification of Marinisomatota: Genomic Insights, Methods, and Biomedical Applications for Researchers

Allison Howard Jan 12, 2026 527

This article provides a comprehensive resource on the Marinisomatota phylum within the Genome Taxonomy Database (GTDB) framework.

GTDB Taxonomic Classification of Marinisomatota: Genomic Insights, Methods, and Biomedical Applications for Researchers

Abstract

This article provides a comprehensive resource on the Marinisomatota phylum within the Genome Taxonomy Database (GTDB) framework. It establishes the foundational genomic and ecological characteristics of this marine bacterial group, details methodologies for accurate classification and analysis, addresses common computational challenges, and validates GTDB's taxonomy against traditional systems like SILVA and NCBI. Targeted at researchers and drug development professionals, it synthesizes current knowledge to guide discovery of novel biosynthetic gene clusters and other biotechnologically relevant traits.

Marinisomatota Unveiled: Genomic Foundations and Ecological Significance in the GTDB Era

The phylum Marinisomatota represents a significant expansion of our understanding of bacterial diversity, originating from uncultured environmental sequences and achieving formal recognition through the Genome Taxonomy Database (GTDB) framework. This phylum encapsulates organisms primarily retrieved from marine and subsurface environments, characterized by genomic signatures of anaerobic metabolism and symbiotic or parasitic lifestyles.

Table 1: Chronological Development of Marinisomatota Taxonomy

Year Key Event/Tool Description Outcome/Reference
Pre-2015 16S rRNA Gene Surveys Detection in marine sediments & hydrothermal vents Identified as "Candidate phylum Zixibacteria" or similar candidate divisions.
2016 GTDB r89/r95 Initial placement in GTDB taxonomy using concatenated protein phylogenies Grouped within the broader FCB superphylum.
2020 GTDB r07-RS202 Refinement via pangenome analysis & average amino acid identity (AAI) Proposed as a distinct phylum-level lineage.
2022-Present GTDB r214/r220 Validation with expanded genome dataset & relative evolutionary divergence (RED) Formalized as phylum Marinisomatota in the GTDB taxonomy.

Table 2: Core Genomic & Ecological Characteristics of Marinisomatota

Characteristic Typical Range/Feature Method of Determination
Genome Size 1.8 - 3.2 Mbp Genome assembly from metagenomes (MAGs)
GC Content 38 - 52% In silico calculation from MAGs
Predicted Metabolism Anaerobic fermenter, possible syntrophy Gene neighborhood & metabolic pathway inference
Habitat Marine sediment, groundwater, anaerobic digesters Sample metadata from NCBI SRA
RED Value vs. Adjacent Phyla >0.15 GTDB Toolkit (GTDB-Tk) analysis

Key Experimental Protocols

Protocol 2.1: Genome-Resolved Metagenomics for Marinisomatota MAG Retrieval

Objective: Recover high-quality draft genomes of Marinisomatota from complex environmental samples.

Materials:

  • Environmental DNA extract (e.g., from marine sediment).
  • Illumina or PacBio sequencing reagents.
  • High-performance computing cluster.

Procedure:

  • Library Preparation & Sequencing: Prepare metagenomic library using a kit (e.g., Illumina Nextera Flex). Sequence using paired-end (2x150 bp) or long-read chemistry.
  • Quality Control & Assembly:
    • Trim adapters and low-quality bases using Trimmomatic v0.39 (ILLUMINACLIP:2:30:10, LEADING:3, TRAILING:3, SLIDINGWINDOW:4:15, MINLEN:36).
    • Assemble reads using metaSPAdes v3.15.4 with k-mer sizes 21,33,55,77,99,127.
  • Binning:
    • Map reads back to contigs using Bowtie2 v2.4.5.
    • Execute binning with MetaBAT2 v2.15, MaxBin2 v2.2.7, and CONCOCT v1.1.0.
    • Refine bins using DAS Tool v1.1.4 to generate a consensus set of MAGs.
  • Taxonomic Assignment & Curation:
    • Run GTDB-Tk v2.1.1 (classify_wf) against the GTDB r214 database.
    • Identify bins assigned to p__Marinisomatota.
    • Assess genome quality with CheckM2 v1.0.1, retaining only medium/high-quality MAGs (completeness >50%, contamination <10%).

Protocol 2.2: Phylogenomic Validation Using the GTDB-Tk Workflow

Objective: Place novel Marinisomatota MAGs within the GTDB reference tree and compute RED values.

Materials:

  • MAGs in FASTA format.
  • GTDB-Tk software and reference data (r214).

Procedure:

  • Identify Marker Genes: Run gtdbtk identify to find 120 bacterial single-copy marker genes within the MAGs.
  • Align and Concatenate: Run gtdbtk align to create multiple sequence alignments (MSA) for each marker, followed by concatenation.
  • Tree Inference: Run gtdbtk infer to generate a maximum-likelihood tree with FastTree v2.1.11 under the LG+G model.
  • Taxonomic Assignment & RED Calculation: The tree is rooted and compared to the reference tree. RED values are computed internally by GTDB-Tk to quantitatively assess lineage separation. A RED value > ~0.15 supports phylum-level distinction.
  • Output: Review the summary.tsv file for taxonomy and the RED values at relevant nodes in the tree file.

Visualization of Workflows and Relationships

Title: Workflow for Defining a New Bacterial Phylum

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents & Tools for Marinisomatota Research

Item/Category Specific Product/Software Example Function in Research
DNA Extraction Kit DNeasy PowerSoil Pro Kit (QIAGEN) High-yield, inhibitor-free DNA extraction from complex sediments.
Metagenomic Library Prep Illumina DNA Prep Kit Preparation of sequencing-ready libraries from environmental DNA.
Sequencing Platform Illumina NovaSeq 6000; PacBio HiFi Generates short-read or long-read data for assembly and binning.
Assembly Software metaSPAdes, Flye Assembles sequencing reads into contigs/scaffolds.
Binning Software Suite MetaBAT2, MaxBin2, CONCOCT Groups contigs into putative genomes (MAGs) based on sequence composition and abundance.
Taxonomic Classifier GTDB-Tk with r214 database Provides standardized, phylogeny-based taxonomy and RED metrics.
Genome Quality Tool CheckM2/CheckM Estimates genome completeness and contamination using marker genes.
Metabolic Inference METABOLIC v4.0 Profiles metabolic pathways from MAGs to infer ecological role.
Phylogenetic Analysis IQ-TREE 2, FastTree 2 Constructs robust phylogenetic trees for phylogenomic validation.
Culture Media (Experimental) Anaerobic marine broth (modified) Attempts to cultivate elusive members using simulated environmental conditions.

Application Notes for GTDB Taxonomic Classification ofMarinisomatota

Within the GTDB (Genome Taxonomy Database) framework, Marinisomatota (formerly candidate phylum SAR406) is classified as a phylum within the FCB group superphylum. Its genomic hallmarks are critical for accurate taxonomic placement and understanding its ecological role in marine systems.

Key Genomic Hallmarks:

  • Marker Genes: The GTDB-Tk toolkit utilizes a set of 120 bacterial single-copy marker genes (bac120) for phylogenetic inference. For Marinisomatota, consistent absence or presence of specific markers within this set aids in delineation from related phyla like Bacteroidota.
  • Phylogenetic Placement: Based on concatenated marker gene alignments, Marinisomatota forms a monophyletic clade sister to the Bacteroidota-Chlorobiota branch.
  • Metabolic Profile: Genomes are characterized by genomic signatures for proteorhodopsin-based phototrophy, streamlining for oligotrophy, and diverse sulfur compound oxidation pathways, reflecting an adaptation to deep ocean aphotic zones.

Quantitative Data Summary:

Table 1: Core Genomic Statistics for Representative *Marinisomatota MAGs (Metagenome-Assembled Genomes) from GTDB r214.*

GTDB Species Representative Genome Size (Mbp) GC Content (%) CheckM Completeness (%) CheckM Contamination (%) Number of bac120 Markers Identified
UBA1166 sp002160825 1.98 37.2 98.6 0.9 119
UBA9951 sp014337395 2.15 36.8 99.2 1.2 120
UBA1773 sp004294285 2.32 38.5 97.8 0.5 118

Table 2: Diagnostic Metabolic Pathway Presence/Absence in *Marinisomatota vs. Related Phyla.*

Metabolic Pathway (KEGG Module) Marinisomatota (n=50 MAGs) Bacteroidota (n=50) Chlorobiota (n=50)
Proteorhodopsin (M00597) 100% 12% 0%
Dissimilatory sulfite reductase (DsrAB, M00596) 88% 24% 100%
Complete TCA cycle (M00009) 10% 96% 100%
Anoxygenic photosynthesis (M00116) 0% 0% 100%

Protocols

Protocol 2.1: Phylogenomic Placement Using GTDB-Tk

Purpose: To classify a novel bacterial genome or MAG within the GTDB taxonomy, with specific focus on placement relative to Marinisomatota. Materials: High-quality bacterial genome assembly, computing cluster or server with GTDB-Tk (v2.1.1+) installed. Procedure:

  • Data Preparation: Ensure your genome is in FASTA format. Run checkm to assess basic quality (completeness >50%, contamination <5% recommended).
  • Run GTDB-Tk: Execute the classify_wf workflow:

  • Interpret Output: Key files:
    • gtdbtk.bac120.summary.tsv: Taxonomic classification. Examine classification column for placement (e.g., d__Bacteria;p__Marinisomatota;...).
    • gtdbtk.bac120.markers_summary.tsv: Count of identified marker genes.
    • gtdbtk.bac120.user_msa.fasta: Concatenated marker gene alignment for your genome.
  • Custom Phylogeny: To generate a tree with reference Marinisomatota genomes, use the infer workflow with the user MSA and the GTDB reference package.

Protocol 2.2: Identification of Diagnostic Metabolic Pathways via KofamScan

Purpose: To profile the metabolic potential of a Marinisomatota genome, focusing on hallmark pathways. Materials: Annotated genome (protein sequences in FASTA format), KofamScan software, KEGG databases. Procedure:

  • Annotation: Annotate genome using Prokka or DRAM to generate protein file (*.faa).
  • KofamScan Setup: Download KOfam HMM profiles and Ko list from KEGG.
  • Scan and Map: Run KofamScan using the exec_annotation script with the --cpu and -o options. Use profile HMMs to map KOs.
  • Parse Output: The output file lists KOs assigned to genes. Translate KOs to KEGG Modules (e.g., Proteorhodopsin: KEGG KO K15789, K15790 -> Module M00597).
  • Visualization: Create a presence/absence matrix of key modules (as in Table 2) for comparative analysis.

Visualization

G Start Novel Genome/MAG (FASTA) QC Quality Check (CheckM/E) Start->QC GTDBTk GTDB-Tk classify_wf QC->GTDBTk TaxOut Taxonomic Classification GTDBTk->TaxOut MSA Marker Gene Alignment GTDBTk->MSA Infer Infer Phylogeny (FastTree/RAxML) MSA->Infer Tree Phylogenetic Tree with Marinisomatota Reference Infer->Tree

Diagram Title: Workflow for Phylogenomic Placement with GTDB

pathways cluster_energy Energy Acquisition cluster_sulfur Sulfur Metabolism Light Light (Photons) PR Proteorhodopsin Gene Cluster Light->PR Activates Hgrad Proton Gradient (H+) PR->Hgrad Generates ATP ATP Synthesis Hgrad->ATP Drives Scomp Sulfur Compounds (S2O3^2-, S^0) Sox Sox Multienzyme System Scomp->Sox Oxidized Dsr DsrAB (Reductase) Scomp->Dsr Reduced Ecarrier Reducing Equivalents [e-] Sox->Ecarrier Produces Dsr->Ecarrier Consumes Ecarrier->ATP Supports

Diagram Title: Core Energy Pathways in Marinisomatota

The Scientist's Toolkit

Table 3: Essential Research Reagents and Tools for Marinisomatota Genomics

Item Function & Relevance
GTDB-Tk (v2.1.1+) Standardized toolkit for assigning genomes to the GTDB taxonomy using a set of 120 bacterial marker genes; essential for consistent phylogenetic placement.
CheckM2 / CheckM Assesses genome quality (completeness, contamination) of MAGs prior to phylogenetic analysis; critical for data reliability.
KofamScan / eggNOG-mapper Functional annotation tools to map protein sequences to KEGG Orthologs (KOs) and reconstruct metabolic pathways like proteorhodopsin.
FastTree / RAxML Software for inferring phylogenetic trees from concatenated marker gene alignments generated by GTDB-Tk.
MetaBAT 2 / MaxBin 2 Binning algorithms for reconstructing MAGs from marine metagenomic data, the primary source of Marinisomatota genomes.
DRAM (Distilled and Refined Annotation of Metabolism) Specialized tool for annotating metabolic pathways and auxiliary functions in microbial genomes; useful for detailed pathway analysis.
Pfam & TIGRFAM HMMs Curated protein family databases used to identify specific marker genes (e.g., proteorhodopsin, DsrAB) in novel genomes.

Application Notes: Niche Prevalence and Genomic Adaptations inMarinisomatota

Context within GTDB Taxonomic Classification Research: The phylum Marinisomatota (formerly candidate phylum Marinisomatota in GTDB r214) represents a lineage of Bacteria predominantly identified from metagenomic surveys. Research within a broader thesis on GTDB taxonomy aims to elucidate the ecological drivers of its distribution and its metabolic potential, particularly for biodiscovery. This phylum exemplifies the critical link between habitat, ecological niche, and genomic content.

Key Quantitative Data Summary:

Table 1: Prevalence of Marinisomatota in Public Metagenomic Databases

Environment / Sample Type Approximate Relative Abundance (%) Primary Dataset/Source (Example) Key Identifying Genomic Marker
Marine Pelagic (Oceanic) 0.01 - 0.5 TARA Oceans, Malaspina Expedition 16S rRNA gene, RpoB
Marine Sediments 0.1 - 2.0 Ocean Drilling Program, IODP 16S rRNA gene, RpoB
Marine Sponge Microbiome Up to 15.0 Sponge Microbiome Project, local surveys 16S rRNA gene, Metagenome-assembled genomes (MAGs)
Coral Microbiome (Healthy) 0.5 - 3.0 Various reef studies 16S rRNA gene, MAGs
Human & Animal Gut < 0.01 Human Microbiome Project, MGnify Extremely rare, sporadic MAGs

Table 2: Genomic Features Correlated with Habitat in Marinisomatota MAGs

Genomic Feature / Pathway Prevalence in Marine Pelagic MAGs (%) Prevalence in Host-Associated (Sponge) MAGs (%) Putative Functional Role & Niche Adaptation
Proteorhodopsin & Light-Sensing 85-95 10-20 Phototrophy, energy generation in oligotrophic water
CRISPR-Cas Systems 30-40 60-80 Defense against mobile genetic elements/viruses
Biosynthetic Gene Clusters (BGCs) 2-4 per MAG 5-8 per MAG Secondary metabolite production (e.g., NRPS, PKS)
Adhesion Proteins (e.g., MSCRAMM-like) Low High Host tissue attachment and colonization
C1 Metabolism (e.g., folD, fhs) High Variable Adaptation to C1 compounds in marine environment
TonB-Dependent Transporters Very High High Nutrient scavenging (e.g., siderophores, sugars)

Interpretation: The data indicate a primary marine origin for Marinisomatota, with a significant shift in abundance and genomic capacity upon association with marine invertebrate hosts, particularly sponges. The increased prevalence of defense mechanisms and biosynthetic potential in host-associated lineages suggests adaptation to a competitive, resource-rich, and defended microenvironment, highlighting their potential for novel natural product discovery.

Experimental Protocols

Protocol 1: Targeted Detection and Quantification ofMarinisomatotain Metagenomes

Objective: To assess the relative abundance and diversity of Marinisomatota in marine and host-associated metagenomic samples.

Materials: Metagenomic DNA, PCR reagents, GTDB-tk database (v2.3.0), QIIME2 (2024.5), specific primers (see Toolkit).

Workflow:

  • DNA Extraction & QC: Use a standardized kit for environmental/microbiome samples (e.g., DNeasy PowerSoil Pro). Verify integrity via gel electrophoresis and quantify via fluorometry.
  • 16S rRNA Gene Amplicon Sequencing (Screening):
    • Perform PCR amplification of the V4-V5 region using primer pair 515F-Y (5'-GTGYCAGCMGCCGCGGTAA-3') and 926R (5'-CCGYCAATTYMTTTRAGTTT-3').
    • Clean amplicons and prepare libraries for Illumina MiSeq 2x250 bp sequencing.
  • Bioinformatic Processing:
    • Denoise sequences with DADA2 in QIIME2 to generate Amplicon Sequence Variants (ASVs).
    • Taxonomic Assignment: Use a custom-trained classifier. First, extract Marinisomatota reference sequences from GTDB. Train a Naïve Bayes classifier on the Silva 138 SSU NR 99 database supplemented with the Marinisomatota sequences using qiime feature-classifier fit-classifier-naive-bayes.
    • Assign taxonomy to ASVs using this classifier.
    • Filter feature table to retain ASVs classified as Marinisomatota. Calculate relative abundance.
  • Shotgun Metagenomic Analysis (In-depth):
    • Sequence libraries (Illumina NovaSeq, 2x150 bp).
    • Perform quality trimming with Fastp.
    • Assemble co-assemblies per habitat type using MEGAHIT or metaSPAdes.
    • Bin contigs into MAGs using MetaBat2.
    • Taxonomic Classification: Classify MAGs using GTDB-Tk (v2.3.0+) with the classify_wf command against the GTDB r214 database.
    • Annotate MAGs with Prokka and perform functional analysis via KEGG and antiSMASH.

Protocol 2: Functional Screening for Bioactive Compound Production

Objective: To experimentally test the biosynthetic potential predicted in host-associated Marinisomatota MAGs.

Materials: Sponge tissue sample, Marine Broth 2216, selective antibiotics, isolation plates, HPLC-MS.

Workflow:

  • Cultivation & Isolation:
    • Homogenize fresh marine sponge tissue in sterile seawater.
    • Perform serial dilutions and spread on marine agar 2216 supplemented with cycloheximide (100 µg/mL) to inhibit fungi.
    • Add a mix of antibiotics (nalidixic acid 10 µg/mL, vancomycin 5 µg/mL) to inhibit fast-growing Gram-negative/positive bacteria, favoring slow-growing, potentially novel phyla.
    • Incubate at 20°C for 4-8 weeks. Monitor for slow-growing, morphologically unique colonies.
  • Identification of Isolates:
    • Extract genomic DNA from candidate colonies.
    • Amplify and Sanger-sequence the full-length 16S rRNA gene using universal primers 27F and 1492R.
    • Compare sequences to the GTDB via BLAST to confirm Marinisomatota affiliation.
  • Metabolite Extraction and Analysis:
    • Inoculate a positive isolate in liquid marine broth. Incubate with shaking until late stationary phase.
    • Extract metabolites from the cell pellet and supernatant separately using ethyl acetate.
    • Dissolve dried extracts in methanol and analyze by HPLC coupled with High-Resolution Mass Spectrometry (HR-MS).
    • Compare mass spectra and retention times to databases (e.g., GNPS) to identify known compounds or novel molecular families.

Mandatory Visualization

G Start Environmental Sample (Marine/Host) DNA Metagenomic DNA Extraction & QC Start->DNA Seq Sequencing (Amplicon/Shotgun) DNA->Seq A1 Amplicon Data (16S/18S rRNA) Seq->A1 Path A S1 Shotgun Reads (Quality Filtered) Seq->S1 Path B A2 ASV/OTU Table (DADA2/QIIME2) A1->A2 A3 Taxonomic Assignment (GTDB-Augmented Classifier) A2->A3 Out1 Relative Abundance (Table 1) A3->Out1 S2 Assembly & Binning (MEGAHIT, MetaBat2) S1->S2 S3 MAG Curation & Taxonomy (GTDB-Tk) S2->S3 Out2 Habitat-Specific MAGs (Table 2) S3->Out2 Out3 Functional Annotation & BGC Prediction Out2->Out3

Title: Metagenomic Analysis Workflow for Marinisomatota

G Host Host-Associated Niche (e.g., Sponge Mesohyl) Factor1 High Microbial Density & Competition Host->Factor1 Factor2 Rich Nutrient Exchange (Host-Derived DOC) Host->Factor2 Factor3 Constant Host Defense Pressure Host->Factor3 Factor4 Stable Physical Environment Host->Factor4 GenomicAdapt1 ↑ CRISPR-Cas Systems ↑ Adhesion Factors Factor1->GenomicAdapt1 GenomicAdapt2 ↑ TonB Transporters ↑ Peptidase Activity Factor2->GenomicAdapt2 GenomicAdapt3 ↑ Biosynthetic Gene Clusters (BGCs) Factor3->GenomicAdapt3 Factor4->GenomicAdapt1 Phenotype Enhanced Secondary Metabolite Production (Drug Discovery Target) GenomicAdapt1->Phenotype GenomicAdapt2->Phenotype GenomicAdapt3->Phenotype

Title: Host Niche Drivers of Marinisomatota Genomics

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Marinisomatota Research

Item / Reagent Function / Application in Protocol Example Product / Specification
Marine Agar/Broth 2216 Standardized medium for cultivation and isolation of marine heterotrophs. Difco Marine Broth 2216, BD.
GTDB Reference Database (r214+) Essential for accurate taxonomic classification of MAGs and sequences from understudied phyla. Genome Taxonomy Database Toolkit (GTDB-Tk) v2.3.0+.
Anti-Fungal/Antibiotic Supplement Mix Selective isolation of slow-growing bacteria by inhibiting fungi and fast-growing competitors. Cycloheximide (100 µg/mL), Nalidixic Acid (10 µg/mL), Vancomycin (5 µg/mL).
Polymerase for GC-Rich Templates High-fidelity PCR amplification of bacterial DNA, often with high GC content common in Marinisomatota. KAPA HiFi HotStart ReadyMix (Roche) or Q5 High-Fidelity DNA Polymerase (NEB).
Metagenomic DNA Extraction Kit (Soil/Microbiome) Efficient lysis of diverse, tough-to-lyse bacterial cells from complex environmental samples. DNeasy PowerSoil Pro Kit (Qiagen) or MagAttract PowerSoil DNA KF Kit (Qiagen).
antiSMASH Software Suite Prediction, annotation, and analysis of Biosynthetic Gene Clusters (BGCs) in bacterial genomes. antiSMASH 7.0+ web server or standalone version.
HPLC-MS Grade Solvents High-purity solvents for metabolite extraction and analytical chemistry to avoid background interference. Ethyl Acetate, Methanol (LC-MS Grade, e.g., Fisher Chemical).

Key Genera and Species within GTDB's Marinisomatota Taxonomy

Within the Genome Taxonomy Database (GTDB) framework, the phylum Marinisomatota (formerly a candidate phylum) represents a significant, yet understudied, lineage of primarily marine bacteria. This taxonomic group is of considerable interest for its phylogenetic diversity, its ecological roles in marine biogeochemical cycles, and its potential as a source of novel bioactive compounds. This document, framed within a broader thesis on GTDB-based microbial systematics, provides detailed application notes and protocols for the cultivation, genomic analysis, and functional characterization of key genera and species within the Marinisomatota. The content is designed to support research aimed at validating and expanding the GTDB taxonomy while exploring biotechnological applications.

Based on the latest GTDB release (R214), the phylum Marinisomatota is delineated into several classes and orders. The following table summarizes the quantitatively dominant and phylogenetically distinct genera according to genome availability and 16S rRNA gene surveys.

Table 1: Key Genera and Species within GTDB Marinisomatota (as of GTDB R214)

GTDB Class GTDB Order Key Genus (GTDB Label) Approx. # of MAGs/Genomes Relative Abundance in Marine Surveys* Notable Species/Clade
Marinisomatia Marinisomatales UBA10353 (Marinisomatales) ~45 High Representative species: Ga0074134
Marinisomatia UBA9962 UBA9962 ~22 Moderate Often found in coastal sediments
Bathybacteria BMS94Abin14 Bin-S124 ~15 Low (Deep-sea) Associated with hydrothermal vents
Marinisomatia UBA1773 UBA1773 ~12 Moderate Pelagic ocean clade
Marinisomatia UBA10354 UBA10354 ~8 Low -

*Abundance based on aggregated data from the TARA Oceans and BioGEOTRACES metagenomic surveys.

Application Notes & Protocols

Protocol: Enrichment and Cultivation of Marinisomatota from Marine Samples

Objective: To selectively enrich for Marinisomatota cells from seawater or sediment samples. Background: Most Marinisomatota remain uncultured; however, specific enrichment strategies based on predicted metabolism (from MAGs) can be employed.

Materials & Reagents:

  • Marinisomatota Enrichment Medium (MEM):
    • Artificial Seawater Base: 30 g/L NaCl, 0.7 g/L KCl, 5.3 g/L MgSO₄·7H₂O, 0.1 g/L CaCl₂·2H₂O, 10 mM HEPES buffer (pH 7.5).
    • Carbon/Nitrogen Source: 0.5 g/L Sodium pyruvate, 0.5 g/L Yeast extract, 0.2 g/L NH₄Cl.
    • Trace Elements & Vitamins: SL-10 solution (1 mL/L), Vitamin B12 (10 µg/L).
    • Reducing Agent: 0.5 g/L L-Cysteine-HCl (add after autoclaving, under N₂/CO₂ atmosphere).
  • Sample: 1L of seawater (0.22µm filtered to remove eukaryotes) or 10g of marine sediment.

Procedure:

  • Sample Processing: For seawater, concentrate cells on a 0.22µm polycarbonate filter. Resuspend biomass in 10 mL sterile ASW. For sediment, homogenize in 20 mL MEM without carbon sources.
  • Inoculation: Transfer 5 mL of cell suspension to 45 mL of pre-reduced MEM in a 120 mL serum bottle. Flush headspace with N₂/CO₂ (80:20) and seal with a butyl rubber stopper.
  • Incubation: Incubate in the dark at 12-15°C (to simulate mesopelagic conditions) with gentle shaking (80 rpm) for 4-8 weeks.
  • Monitoring: Monitor growth by flow cytometry (SYBR Green I staining) and periodic 16S rRNA gene amplicon sequencing (using primers 515F/806R) to track enrichment of Marinisomatota.
  • Subculturing: Transfer 10% (v/v) of a positive enrichment to fresh MEM every 4 weeks.
Protocol: Metagenome-Assembled Genome (MAG) Binning and Taxonomic Classification

Objective: To reconstruct and taxonomically classify Marinisomatota MAGs from metagenomic data.

Workflow Diagram Title: MAG Binning & GTDB Classification Workflow

G M1 Raw Reads (FASTQ) M2 Quality Control & Assembly M1->M2 M3 Contigs M2->M3 M4 Read Mapping & Depth Calculation M3->M4 M5 Binning (MaxBin2, MetaBAT2) M4->M5 M6 Initial MAGs M5->M6 M7 Refinement & Dereplication (dRep) M6->M7 M8 High-Quality MAGs M7->M8 M9 GTDB-Tk Classification M8->M9 M10 Marinisomatota MAGs M9->M10

Detailed Methodology:

  • Assembly: Assemble quality-filtered reads using metaSPAdes (v3.15.0) with -k 21,33,55,77.
  • Binning: Map reads back to contigs using Bowtie2. Calculate coverage profiles and generate initial bins with MetaBAT2 (v2.15) and MaxBin2 (v2.2.7).
  • Refinement: Use MetaWRAP (v1.3.2) bin_refinement module to consolidate bins from multiple tools, retaining only bins with >50% completeness and <10% contamination (CheckM2 criteria).
  • Taxonomic Classification: Run gtdbtk (v2.3.0) with the classify_wf command on refined MAGs using the R214 database. The output (gtdbtk.bac120.summary.tsv) will assign taxonomy, including potential Marinisomatota placement.
  • Phylogenomic Tree: For confirmed Marinisomatota MAGs, use gtdbtk infer to generate a multiple sequence alignment and construct a tree with FastTree for phylogenetic placement.
Protocol: Screening for Biosynthetic Gene Clusters (BGCs) in Marinisomatota Genomes

Objective: To identify potential secondary metabolite BGCs from Marinisomatota MAGs or isolates.

Materials & Reagents:

  • Computational Tools: antiSMASH (v7.0), BiG-SCAPE.
  • Database: MIBiG database (v3.0).
  • Genomic Input: High-quality Marinisomatota genome in FASTA format.

Procedure:

  • BGC Prediction: Run antiSMASH on the genome using strict detection parameters: antismash --genefinding-tool prodigal --taxon bacteria --cb-general --cb-knownclusters --asf --pfam2go.
  • Analysis: Extract all predicted BGC regions (e.g., non-ribosomal peptide synthetase (NRPS), polyketide synthase (PKS), bacteriocin). Tabulate their types, locations, and core biosynthetic genes.
  • Comparative Genomics: Use BiG-SCAPE to correlate BGCs from Marinisomatota with known BGCs in the MIBiG database, generating sequence similarity networks to identify novel clusters.
  • Heterologous Expression Cloning: For prioritized novel BGCs, design PCR primers to amplify the entire cluster (using long-range PCR or Gibson assembly of cosmids) for cloning into an expression host like Pseudomonas putida.

The Scientist's Toolkit: Key Research Reagents & Materials

Table 2: Essential Research Reagents for Marinisomatota Studies

Item/Category Specific Product/Example Function/Application
Enrichment Medium Custom Marinisomatota Enrichment Medium (MEM) Selective cultivation and maintenance of fastidious marine bacteria.
DNA Extraction Kit DNeasy PowerSoil Pro Kit (Qiagen) High-yield, inhibitor-free genomic DNA extraction from complex marine samples.
Metagenomic Library Prep Nextera XT DNA Library Prep Kit (Illumina) Preparation of sequencing-ready libraries from low-input environmental DNA.
Taxonomic Classifier GTDB-Tk v2.3.0 Software & R214 Database Precise genome-based taxonomic assignment according to the GTDB system.
BGC Analysis Software antiSMASH v7.0 Web Server/CLI Comprehensive prediction and annotation of biosynthetic gene clusters.
Phylogenetic Marker Bacterial 16S rRNA Gene Primers (515F/806R) Tracking Marinisomatota enrichment and community profiling via amplicon sequencing.
Expression Host Pseudomonas putida KT2440 Robust, Gram-negative host for heterologous expression of marine bacterial BGCs.
Flow Cytometry Stain SYBR Green I Nucleic Acid Gel Stain Quantification of bacterial cell abundance in enrichment cultures.

The classification of bacterial and archaeal life has undergone a paradigm shift, moving from a single-marker (16S rRNA) system to a genome-centric taxonomy that forms the foundation of the Genome Taxonomy Database (GTDB). This evolution is critical for research into candidate phyla like Marinisomatota (also known in legacy systems as 'Marinisomatia' or within the PVC group), whose physiological and ecological roles are inferred primarily from genomic data. Accurate taxonomy is essential for drug discovery, as it clarifies evolutionary relationships and identifies novel biosynthetic gene clusters.

Application Notes: Comparative Analysis of Classification Eras

The following table summarizes the key differences between the two classification paradigms.

Table 1: Comparison of 16S rRNA and Genome-Centric Taxonomy

Feature 16S rRNA Gene-Based Taxonomy (c. 1977-2010s) Genome-Centric Taxonomy (GTDB Era, 2018-Present)
Primary Data Source Sanger sequencing of ~1,500 bp 16S rRNA gene. Whole genome sequences (WGS) from isolates and metagenome-assembled genomes (MAGs).
Resolution Species to genus level; poor for closely related species and strains. High resolution to species and strain level; robust at all ranks.
Quantitative Basis Sequence similarity thresholds (e.g., 97% for species, 95% for genus). Average Amino Acid Identity (AAI), Average Nucleotide Identity (ANI), and relative evolutionary divergence (RED).
Type Material Requirement Dependent on cultured type strains. Employs type species genomes and designated type genomes for uncultivated taxa.
Handling of Uncultivated Diversity Limited; requires PCR amplification from environment. Integral; MAGs from metagenomics allow classification of the "microbial dark matter."
Impact on Marinisomatota Research Preliminary placement based on fragmentary 16S data led to uncertain phylogeny. Precise placement as a distinct phylum based on conserved single-copy marker genes; reveals metabolic potential for drug target discovery.

Table 2: Key Quantitative Metrics in GTDB Genome-Centric Classification

Metric Calculation Method Typical Threshold for Species Demarcation Function in Classification
Average Nucleotide Identity (ANI) BLAST-based or MUMmer-based alignment of shared genomic regions. ≥ ~95% Primary species-level standard, replacing 16S similarity.
Average Amino Acid Identity (AAI) Comparison of amino acid sequences of shared protein-coding genes. ~60% for same phylum Useful for higher-rank (family, phylum) assignments and phylogeny.
Relative Evolutionary Divergence (RED) Measure of relative branch length in a rooted phylogenetic tree of marker genes. Normalized scale (0.0=root, 1.0=leaves) Objective rank normalization across all lineages.
Percentage of Conserved Proteins (POCP) Percentage of conserved protein sequences between two genomes. ≥50% for same genus Supplementary metric for genus classification.

Experimental Protocols

Protocol 3.1: GTDB-Tk Workflow for Genome Classification (Current Best Practice)

Objective: To classify a bacterial genome (isolate or MAG) within the GTDB taxonomy.

Materials:

  • High-quality bacterial genome assembly in FASTA format.
  • Computational environment (Linux/macOS) with at least 16 GB RAM and 8 cores recommended.
  • Conda package manager.
  • GTDB-Tk software package (v2.3.0 or later).
  • GTDB reference data (R214 or later).

Procedure:

  • Software Installation:

  • Prepare Input Data: Place all genome assemblies (.fna files) in a single directory (genome_dir). Create a batch file listing paths if necessary.

  • Run Classification Workflow:

    • This workflow: a) identifies 120 bacterial marker genes with HMMER, b) aligns markers, c) creates a concatenated alignment, d) places genomes into a reference tree via pplacer, and e) classifies them based on RED-based rank thresholds.
  • Output Interpretation:

    • Key files: gtdbtk_out/gtdbtk.bac120.summary.tsv
    • This tab-delimited file contains columns for user genome ID, classification at each rank (domain to species), RED values, and placement confidence.

Protocol 3.2: 16S rRNA Gene Extraction and Sanger Sequencing (Historical Context)

Objective: To obtain a 16S rRNA sequence for preliminary phylogenetic analysis.

Materials:

  • Bacterial genomic DNA.
  • Universal primer pair (e.g., 27F: 5'-AGAGTTTGATCMTGGCTCAG-3' and 1492R: 5'-GGTTACCTTGTTACGACTT-3').
  • PCR reagents (Taq polymerase, dNTPs, buffer).
  • Agarose gel electrophoresis equipment.
  • Sanger sequencing reagents.

Procedure:

  • PCR Amplification: Set up a 50 µL reaction with 1X PCR buffer, 200 µM dNTPs, 0.2 µM each primer, 1.25 U Taq polymerase, and ~50 ng template DNA. Use cycling: 95°C/5 min; 35 cycles of [95°C/30s, 55°C/30s, 72°C/90s]; 72°C/7 min.
  • Gel Purification: Run PCR product on 1% agarose gel. Excise the correct band (~1,500 bp) and purify using a gel extraction kit.
  • Sanger Sequencing: Submit purified product for bidirectional sequencing with the same primers.
  • Sequence Analysis: Trim low-quality bases, assemble forward/reverse reads. Submit the consensus sequence to NCBI BLAST for tentative identification.

Visualizations

G A Environmental Sample (Soil, Water, Gut) D DNA Extraction A->D B 16S rRNA Gene Era E PCR with Universal 16S Primers B->E F Sanger Sequencing B->F G BLAST vs. SILVA/RDP → Preliminary ID B->G C Genome-Centric Era (GTDB) H Metagenomic Sequencing or Isolation C->H I Assembly: Isolate Genome or MAG C->I J GTDB-Tk Analysis: 120/53 Marker Genes Phylogenetic Placement C->J K Definitive Classification with RED-based Ranks (e.g., Phylum Marinisomatota) C->K D->E D->H E->F F->G H->I I->J J->K

Title: Evolution from 16S to Genome-Based Taxonomy

workflow Start Input Genome (.fna assembly) Step1 Identify 120/53 Single-Copy Marker Genes (HMMER) Start->Step1 Step2 Create Multiple Sequence Alignment (MSA) of Markers Step1->Step2 Step3 Trim MSA (BMGE) Step2->Step3 Step4 Concatenate Alignments into Supermatrix Step3->Step4 Step5 Place in Reference Tree (pplacer/EPA-ng) Step4->Step5 Step6 Apply RED Rank Thresholds Step5->Step6 End GTDB Taxonomic Classification Output Step6->End

Title: GTDB-Tk Classification Workflow

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Genome-Centric Taxonomy Research

Item Function & Relevance
DNeasy PowerSoil Pro Kit (QIAGEN) Gold-standard for high-yield, inhibitor-free microbial genomic DNA extraction from complex samples for WGS and MAG generation.
Nextera XT DNA Library Prep Kit (Illumina) Prepares multiplexed, adapter-ligated sequencing libraries from low-input genomic DNA for Illumina short-read sequencing.
GTDB-Tk Software & Reference Data Core bioinformatics toolkit for performing genome classification against the standardized GTDB taxonomy.
CheckM / CheckM2 Assesses completeness and contamination of MAGs using lineage-specific marker sets, a critical QC step before classification.
antiSMASH / BAGEL Identifies biosynthetic gene clusters (BGCs) for secondary metabolites in classified genomes; crucial for drug discovery pipelines.
Phanta EVO HS Master Mix (Vazyme) High-fidelity polymerase mix for accurate amplification of taxonomic marker genes or genome fragments when required.
ZymoBIOMICS Microbial Community Standard Mock community with known composition for validating wet-lab and computational workflows from extraction to classification.

From Raw Reads to Taxonomy: Best Practices for Classifying and Analyzing Marinisomatota Genomes

1. Introduction and Thesis Context This protocol is framed within a broader thesis investigating the recalibration of bacterial taxonomy, specifically the phylum Marinisomatota (formerly known as Marine Group II within the Thermoplasmatota). The Genome Taxonomy Database Toolkit (GTDB-Tk) provides a standardized, genome-based methodology for consistent taxonomic classification, which is critical for resolving the ecological and metabolic roles of uncultured lineages like Marinisomatota. Accurate classification is foundational for downstream applications in microbial ecology and the discovery of novel bioactive compounds relevant to drug development.

2. Key Research Reagent Solutions The following table details essential materials and software for the classification workflow.

Reagent/Solution/Software Function/Explanation
GTDB-Tk v2.3.2+ Core software package for inferring taxonomic classification and phylogenetic placement.
GTDB Reference Data (r220+) Curated set of reference genomes and taxonomy (e.g., r220_data.tar.gz). Mandatory for classification.
CheckM2 or CheckM Assesses genome completeness and contamination; critical for quality filtering prior to classification.
Prodigal or Pyrodigal Gene prediction software used internally by GTDB-Tk for creating protein markers.
HMMER (v3.1+) Used for aligning conserved marker genes to reference HMM profiles.
PPANKM or FastANI Calculates Average Nucleotide Identity (ANI) for precise species demarcation.
Python 3.8+ Required runtime environment for GTDB-Tk.
High-Performance Computing (HPC) Cluster Recommended due to the computational intensity of alignment and tree placement steps.

3. Experimental Protocol for Genome Classification Note: All commands assume a Unix-like environment and conda for package management.

Step 1: Installation and Data Preparation

Step 2: Input Genome Quality Control

  • Assemble genomes from metagenomic or isolate sequencing data.
  • Filter genomes using CheckM2 to ensure high quality:

  • Based on broader thesis standards, retain only genomes meeting:
    • Completeness ≥ 80%
    • Contamination ≤ 5%

Step 3: Execute GTDB-Tk Classification Workflow Run the comprehensive classify_wf pipeline:

Step 4: Interpretation of Results Key output files:

  • gtdbtk.bac120.summary.tsv: Tabular summary of taxonomic classification for each genome.
  • gtdbtk.ar53.summary.tsv: For archaea (relevant if Marinisomatota is classified as archaeal in your dataset).
  • gtdbtk.<marker_set>.tree: Phylogenetic tree for visual placement.

4. Data Presentation: Summary of Classification Metrics The following table quantifies typical outputs from a Marinisomatota classification run using GTDB-Tk, based on a hypothetical dataset of 150 marine metagenome-assembled genomes (MAGs).

Table 1: GTDB-Tk Classification Statistics for a Marine MAG Dataset

Metric Value Interpretation
Total Input Genomes 150 MAGs passing QC thresholds.
Genomes Classified to Marinisomatota 47 (31.3%) Assigned to the target phylum.
Novel Species (ANI < 95%) 28 (59.6% of phylum) Potential new species within Marinisomatota.
Novel Genera (AF < 50%) 11 (23.4% of phylum) Potential new genera.
Average Alignment Fraction (AF) 72.1% (std dev ± 18.5) Measure of genomic relatedness to reference.
Placement in Reference Tree 100% All genomes placed within the GTDB reference phylogeny.

5. Visualization of the Classification Workflow

GTDBtk_Workflow cluster_0 Alignment Phase cluster_1 Classification Phase node_start Input: Quality Filtered Genomes node_dl Identify 120/53 Marker Genes node_start->node_dl node_msa Multiple Sequence Alignment (MSA) node_dl->node_msa node_mask Apply Informative Site Mask node_msa->node_mask node_place Phylogenetic Placement (pplacer) node_mask->node_place node_class Taxonomic Classification node_place->node_class node_ani ANI & AF Calculation node_place->node_ani Inform classification node_out Output: Taxonomy Summary & Tree node_class->node_out node_ani->node_out

Title: GTDB-Tk Classification Workflow for Marinisomatota Genomes

Marinisomatota_Taxonomy_Context legacy Legacy Taxonomy ('Marine Group II') gtdb_phy GTDB: Phylum Marinisomatota legacy->gtdb_phy Reclassification class Class Marinisomatia gtdb_phy->class thesis Thesis Focus: Metabolic & Ecological Diversity gtdb_phy->thesis Defines Study Group order Order Marinisomatales class->order family Family UBA10353 order->family genus Genus M2PB1-65 family->genus species Species UBA10353 sp. genus->species ANI < 95% thesis->genus Seeks Novel Bioactive Potential

Title: Taxonomic Context of Marinisomatota in GTDB

Application Notes

Within the context of a broader thesis on Marinisomatota taxonomy using the Genome Taxonomy Database (GTDB), the generation and refinement of Metagenome-Assembled Genomes (MAGs) is foundational. The accuracy of downstream phylogenetic and metabolic analyses is critically dependent on parameters adjusted during assembly, binning, and refinement. This protocol details the workflow adjustments necessary for optimizing MAG quality, particularly for elusive phyla like Marinisomatota, which are frequently underrepresented in environmental samples.

Key Findings:

  • Assembly Stringency: For complex marine metagenomes (e.g., TARA Oceans), a minimum contig length of 1000-1500 bp and k-mer multiples (21, 33, 55, 77) in metaSPAdes significantly improve recovery of medium-abundance genomes like Marinisomatota.
  • Binning Sensitivity: The use of compositional (tetranucleotide frequency) and coverage-based features is non-negotiable. For Marinisomatota, integrating CheckM lineage-specific markers before the final dereplication step reduces contamination from related PVC group members.
  • GTDB-Tk Classification: The probability (p_placer) and relative evolutionary divergence (RED) values from GTDB-Tk are critical for interpreting the placement of novel Marinisomatota MAGs. A threshold of p_placer ≥ 0.8 is recommended for confident placement at the genus level.

Table 1: Impact of Assembly & Binning Parameters on MAG Quality Metrics for Marine Datasets

Parameter Standard Value Optimized Value for Marinisomatota Effect on Completeness Effect on Contamination Key Tool
Min Contig Length 500 bp 1500 bp -5% to +2% -10% to -15% metaSPAdes, MEGAHIT
Metabat2 --minContig 1500 bp 2500 bp -3% -8% MetaBAT2
CheckM Lineage WF Standard --extension_threshold 0.2 More stringent lineage assignment Better contamination estimate CheckM2
MaxBin2 Prob Threshold 0.9 0.95 -2% -7% MaxBin2
DAS Tool Score Threshold 0.5 0.6 +1% Completeness -5% Contamination DAS Tool

Table 2: GTDB-Tk Classification Output Interpretation for Novel Taxa

Metric Range Interpretation for Marinisomatota MAGs
Classify p_placer 0.0 - 1.0 ≥ 0.95: Strong confidence at species rank. 0.80-0.94: Confident genus-level placement. <0.80: Require manual phylogenomic review.
RED Value ~0.0 - ~1.0 Values close to 0.5 for a new MAG suggest a novel genus within a known family. Deviations >0.15 from sister taxa warrant investigation.
FastANI vs. Reference 85% - 100% <95% ANI to nearest GTDB reference suggests novel species; <~70% suggests novel genus.

Experimental Protocols

Protocol 1: Optimized Co-Assembly and Binning for Marine Samples

Objective: To reconstruct high-quality Marinisomatota MAGs from multi-sample marine metagenomic datasets.

Materials:

  • Input: Quality-controlled paired-end reads from multiple samples (e.g., same geographic region).
  • Software: metaSPAdes v3.15.5, MetaBAT2 v2.15, Bowtie2 v2.5, SAMtools, CheckM2 v1.0.1.

Method:

  • Co-assembly: metaspades.py -o co_assembly -1 sample1_1.fq,sample2_1.fq -2 sample1_2.fq,sample2_2.fq -k 21,33,55,77 -t 32 -m 500
  • Contig Filtering: Retain contigs ≥ 1500 bp using seqtk.
  • Coverage Profiling: Map individual sample reads back to filtered contigs using Bowtie2. Generate depth tables with jgi_summarize_bam_contig_depths.
  • Binning: Run multiple binners:
    • metabat2 -i filtered_contigs.fa -a depth_table.txt -o bin -m 2500
    • Run MaxBin2 and CONCOCT as per standard protocols.
  • Bin Refinement: Use DAS Tool with a stringent scoring threshold: DAS_Tool -i metabat2_bins.txt,maxbin2_bins.txt -l metabat,maxbin -c filtered_contigs.fa --score_threshold 0.6 -o final_bins

Protocol 2: MAG Refinement and GTDB-Tk Classification for Taxonomic Assignment

Objective: To assess MAG quality, refine bins, and achieve accurate GTDB taxonomy.

Materials:

  • Input: Bins from Protocol 1 (final_bins_DASTool_scaffolds2bin.txt).
  • Software: CheckM2, GTDB-Tk v2.3.0, derep, FastANI.

Method:

  • Quality Assessment: checkm2 predict --input final_bins/ --output checkm2_results -t 16 --lowmem
  • Bin Refinement based on Lineage: Manually inspect bins with >5% contamination. Use CheckM2's lineage-specific marker set to identify and remove contaminant contigs via anvi-refine or manual curation.
  • Dereplication at Species Level: Cluster MAGs at 99% ANI: derep -i *.fa -o mags_derep99 -ani 0.99 -nc 0.30
  • GTDB Taxonomic Classification:
    • gtdbtk classify_wf --genome_dir mags_derep99/ --out_dir gtdbtk_out --cpus 32 --extension_threshold 0.2
    • Critically analyze the gtdbtk.bac120.summary.tsv file, focusing on classification, p_placer, and red_value.
  • Phylogenomic Validation (Optional): For MAGs with ambiguous placement (p_placer < 0.8), construct a custom phylogeny with IQ-TREE using the bac120 markers.

Visualization

Diagram 1: MAG Generation & Curation Workflow

mag_workflow raw_reads Raw Metagenomic Reads (Multi-Sample) qc Quality Control & Adapter Trimming raw_reads->qc co_assembly Co-Assembly (k-mer multiples, min 1500bp) qc->co_assembly contigs Filtered Contigs co_assembly->contigs mapping Read Mapping & Coverage Profiling contigs->mapping binning Multi-Tool Binning (MetaBAT2, MaxBin2) mapping->binning das_tool DAS Tool Refinement (score ≥ 0.6) binning->das_tool mags_pre Initial MAGs das_tool->mags_pre qc_checkm Quality Control (CheckM2) mags_pre->qc_checkm curation Manual Curation (Lineage-Specific) qc_checkm->curation If Contam >5% mags_highq High-Quality MAGs (Completeness > 90% Contamination < 5%) qc_checkm->mags_highq If MIMAG HQ curation->mags_highq gtdbtk GTDB-Tk Classification mags_highq->gtdbtk taxonomy Taxonomic Assignment & RED/p_placer analysis gtdbtk->taxonomy

Diagram 2: GTDB-Tk Decision Pathway for Novel Taxa

gtdb_decision start MAG Input p1 p_placer ≥ 0.95? start->p1 p2 p_placer ≥ 0.80? p1->p2 No conf Confident Species Assignment p1->conf Yes ani FastANI to nearest ref <95%? p2->ani Yes manual Requires Manual Phylogenomic Tree & RED Analysis p2->manual No genus Confident Genus Assignment (Potential Novel Species) ani->genus Yes novel_genus Novel Genus Candidate (Phylogenomic Validation) ani->novel_genus No

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions & Materials

Item/Reagent Function & Application in MAG Workflow Critical Parameter/Specification
NEB Next Ultra II FS DNA Library Prep Kit High-quality metagenomic library preparation for Illumina sequencing. Essential for obtaining high-coverage, unbiased reads. Input DNA: 1ng-100ng. Enzymatic fragmentation time optimization for desired insert size.
MetaPolyzyme (Sigma-Aldrich) Enzymatic lysis cocktail for diverse microbial cell walls in environmental samples. Critical for unbiased DNA extraction from marine biomass. Incubation: 37°C for 60 min. Use in conjunction with mechanical lysis (bead-beating).
SPRIselect Beads (Beckman Coulter) Size-selective magnetic bead-based clean-up for post-assembly contig filtering and size selection. Ratio optimization (e.g., 0.6x to 0.8x) to retain contigs >1500 bp post-assembly.
CheckM2 Lineage-Specific Marker Set Software-based "reagent" for assessing MAG completeness/contamination using a random forest model. More accurate than CheckM1. Use --lowmem flag for large datasets. Interpret results in context of contamination sources.
GTDB-Tk Reference Data (v.R214) Curated database of bacterial/archaeal genomes for phylogenetic placement. The standard for taxonomic classification of Marinisomatota MAGs. Must download (~50 GB) and install separately. Update with each GTDB release.
Phusion High-Fidelity DNA Polymerase (Thermo) For amplification of taxonomic marker genes from MAGs or community DNA for validation (e.g., 16S rRNA gene PCR if present). High fidelity reduces chimera formation during PCR from complex templates.

1. Introduction and Taxonomic Context The phylum Marinisomatota (as defined by the Genome Taxonomy Database, GTDB) represents a phylogenetically distinct lineage within the bacterial domain, primarily derived from marine and host-associated environments. Within the broader thesis of refining GTDB classifications and exploring underexplored taxa, this phylum presents a significant opportunity for biodiscovery. Its ecological niches suggest adaptation to complex polysaccharides and competitive interactions, predicting a rich repertoire of Biosynthetic Gene Clusters (BGCs) and catalytically novel enzymes with potential applications in drug discovery, biocatalysis, and biomedicine.

2. Quantitative Overview of Marinisomatota Genomic Potential Table 1: Summary of BGC Diversity in Publicly Available Marinisomatota Genomes (as of 2024)

GTDB Genus Representative Number of Genomes Surveyed Average BGCs per Genome Most Frequent BGC Class Notable Predicted Product
UBA2962 (marine sediment) 12 8.2 Terpene Sesterterpenoid-like
UBA10314 (sponge symbiont) 7 11.5 NRPS, T3PKS Lipopeptide, Polyketide
UBA1773 (hydrothermal vent) 5 6.8 Bacteriocin Lanthipeptide-class
Phylum Aggregate ~50 8.7 Terpene High chemical novelty index

Table 2: Putative Novel Enzyme Families Identified via CAZy and Peptidase Database Mining

Enzyme Class GTDB Family Predicted Activity Unique Domain Architecture Potential Biomedical Application
Glycosyltransferase UBA2962 β-1,3-Xylosyltransferase C-terminal Sharkin-like domain Synthesis of heparin mimetics
Peptidase (S8 family) UBA10314 Subtilisin-like serine protease Inserted carbohydrate-binding module Targeted proteolysis for biofilm disruption
Polysaccharide Lyase UBA1773 Alginate lyase (novel substrate specificity) Tandem bacterial immunoglobulin-like domains Cystic fibrosis therapeutic (mucolysis)

3. Detailed Experimental Protocols

Protocol 3.1: In silico Genome Mining and BGC Prioritization Objective: To identify and prioritize non-ribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) BGCs from Marinisomatota genomes. Materials: High-performance computing cluster, antiSMASH 7.0, BiG-SCAPE, CORASON, MIBiG database. Procedure:

  • Data Acquisition: Download target Marinisomatota genomes from GTDB or NCBI in FASTA format.
  • BGC Prediction: Run antiSMASH with strict parameters (--strict --cb-general --cb-knownclusters --cb-subclusters --asf --pfam2go). Use the --genefinding-tool prodigal.
  • Cluster Family Analysis: Process all antiSMASH outputs with BiG-SCAPE (python bigscape.py -i ./antismash_results -o ./bigscape_out --mibig --mix --cutoffs 0.3 0.65 0.95).
  • Phylogenetic Contextualization: For prioritized clusters (e.g., in new GCFs), use CORASON to generate sequence similarity networks of core biosynthetic genes against the MIBiG reference.
  • Prioritization Scoring: Score BGCs based on: (i) Phylogenetic novelty (distance to nearest known cluster), (ii) Presence of unique domains, (iii) Predicted regulatory elements, and (iv) Proximity to transporter genes.

Protocol 3.2: Heterologous Expression of a Terpene Synthase BGC Objective: To express a prioritized terpene synthase BGC from UBA2962 in Streptomyces coelicolor M1152. Materials: Research Reagent Solutions Table:

Reagent/Solution Function Source/Catalog Note
pCAP01 fosmid vector BGC capture and heterologous expression E. coli EPI300-T1ᵣ library construction
S. coelicolor M1152 Streptomycete heterologous host Lack of native PKS and NRPS clusters
Apetite solid medium Selection and sporulation of Streptomyces Contains apramycin, MgCl₂, and trace elements
Ethyl acetate with 1% acetic acid Extraction of terpenoid metabolites LC-MS grade for downstream analysis
PCR Master Mix (2x) with GC enhancer Amplification of high-GC% Marinisomatota DNA Required for >60% GC content

Procedure:

  • Fosmid Library Construction: Partially digest high-molecular-weight Marinisomatota genomic DNA with Sau3AI. Size-select 30-40 kb fragments and ligate into pCAP01 vector. Package using a lambda phage kit and transduce into EPI300-T1ᵣ E. coli.
  • Library Screening: Screen colonies by PCR for the conserved terpene synthase gene (DDXXD motif). Isolate positive fosmid DNA.
  • Intergeneric Conjugation: Mix E. coli ET12567/pUZ8002 harboring the positive fosmid with S. coelicolor M1152 spores. Plate on SFM agar with 10 mM MgCl₂. After 16h, overlay with apramycin (50 µg/mL) and nalidixic acid (25 µg/mL).
  • Heterologous Expression: Pick exconjugants to Apetite plates. Incubate at 30°C for 5-7 days. Inoculate single colonies into TSB with apramycin for seed culture, then transfer into production medium (R5 or SFM). Incubate with shaking for 14 days.
  • Metabolite Extraction: Centrifuge culture. Extract supernatant with equal volume ethyl acetate (1% AcOH). Extract cell pellet with 1:1 acetone:methanol. Combine organic phases, dry under vacuum, and resuspend in methanol for LC-HRMS/MS analysis.

Protocol 3.3: Activity Screening of a Novel Subtilisin-like Protease Objective: To clone, express, and test the activity of a novel S8 peptidase from UBA10314. Materials: pET-28a(+) vector, E. coli BL21(DE3), Ni-NTA resin, fluorogenic substrate Boc-Gln-Ala-Arg-AMC. Procedure:

  • Gene Optimization & Cloning: Codon-optimize the gene for E. coli expression, adding an N-terminal 6xHis tag. Synthesize and clone into pET-28a(+) via NdeI/XhoI sites.
  • Expression: Transform into E. coli BL21(DE3). Grow in TB with kanamycin at 37°C to OD₆₀₀ 0.6. Induce with 0.5 mM IPTG at 18°C for 16h.
  • Purification: Lyse cells via sonication in lysis buffer (50 mM Tris-HCl pH 8.0, 300 mM NaCl, 10 mM imidazole). Purify soluble protein using Ni-NTA affinity chromatography with an imidazole gradient (50-250 mM).
  • Activity Assay: In a 96-well plate, mix 50 µL of purified enzyme (0.1-1 µM) with 50 µL assay buffer (50 mM HEPES pH 7.5, 150 mM NaCl) containing 200 µM Boc-Gln-Ala-Arg-AMC. Monitor fluorescence (ex 380 nm, em 460 nm) kinetically for 30 min at 25°C.
  • Substrate Specificity Profiling: Test against the MEROPS S8 substrate library, including insulin B chain, followed by UPLC-MS analysis of cleavage products.

4. Visualization of Workflows and Pathways

G Start Marinisomatota Genome (FASTA) A1 antiSMASH 7.0 Analysis Start->A1 A2 BGC Predictions (NRPS, PKS, Terpene) A1->A2 A3 BiG-SCAPE Clustering A2->A3 A4 CORASON Phylogenetic Context A3->A4 A5 Prioritized BGC List A4->A5 B1 Fosmid Library Construction A5->B1 For Natural Products C1 Gene Synthesis & Cloning (pET28a) A5->C1 For Single Enzyme B2 Heterologous Expression (S. coelicolor) B1->B2 B3 Metabolite Extraction & LC-MS B2->B3 End Lead Compound or Enzyme B3->End C2 Protein Expression & Purification C1->C2 C3 Enzymatic Assay & Substrate Profiling C2->C3 C3->End

Title: Marinisomatota Mining and Validation Workflow

G cluster_0 One Elongation Cycle A NRPS Adenylation Domain C Aminoacyl-AMP A->C  ATP → AMP+PPi B Substrate (Amino Acid) B->A E Aminoacyl-S-PCP (Thioester) C->E  Thioesterification D Peptidyl Carrier Protein (PCP) D->E F Condensation Domain E->F G Growing Peptide Chain F->G Peptide Bond Formation G->F

Title: NRPS Biosynthetic Logic

This application note details protocols for linking phylogeny, specifically within the Marinisomatota phylum (as classified by the Genome Taxonomy Database - GTDB), to biosynthetic gene cluster (BGC) diversity. The work is framed within the broader thesis that GTDB-based phylogenetic resolution of understudied taxa, like the Marinisomatota, uncovers novel BGC landscapes, providing a systematic roadmap for targeted biodiscovery in drug development.

Table 1: Comparative BGC Diversity in Marinisomatota vs. Related Phyla (GTDB r214)

Taxonomic Group (GTDB) Genomes Analyzed Total BGCs Identified BGCs/Genome (Avg) NRPS/PKS (%) Ribosomal (%) Terpene (%) Other (%)
Marinisomatota 47 312 6.64 28.2 18.9 22.1 30.8
Actinomycetota 150 1245 8.30 45.6 12.3 15.4 26.7
Bacteroidota 85 401 4.72 15.2 31.7 10.0 43.1

Table 2: Phylogenetic Conservation of BGC Families in Marinisomatota Clades

Marinisomatota Family (GTDB) Representative Genus Core BGC Family (MIBiG Class) Conservation Frequency in Clade (%) Putative Novelty Score*
Marinisomataceae Marinisoma Type I PKS 92 0.85
Oceanipullicutaceae Oceanipullicuta Lasso peptide 78 0.92
UBA10353 UBA10353 Hybrid NRPS-PKS 65 0.95
Novel lineage A MAG-3321 Thiopeptide 100 0.98

*Novelty Score: 1 - (max BLASTp identity to known MIBiG cluster).

Experimental Protocols

Protocol 3.1: Phylogenomic Reconstruction of Marinisomatota

Objective: Generate a robust, GTDB-consistent phylogeny for BGC diversity mapping.

Materials: See "Research Reagent Solutions" (Section 6).

Method:

  • Genome Retrieval: Download all available Marinisomatota genome assemblies (RefSeq/GenBank) and associated GTDB taxonomy files (release r214) using ncbi-genome-download and gtdb-tk.
  • Core Genome Identification: Use OrthoFinder v2.5 with default parameters on all predicted proteomes to identify single-copy orthologous (SCO) groups.
  • Alignment & Concatenation: Align SCO amino acid sequences with MAFFT v7. Auto-trim alignments with trimAl (-automated1). Concatenate alignments using AMAS.
  • Phylogenetic Inference: Construct a maximum-likelihood tree with IQ-TREE2 (-m TEST -B 1000 -alrt 1000). Use the resulting .treefile as the phylogenetic framework.

Protocol 3.2: BGC Prediction, Dereplication, and Classification

Objective: Identify, classify, and quantify BGCs from Marinisomatota genomes.

Method:

  • BGC Prediction: Run antiSMASH v6 (or latest) on all genomes with --clusterhmmer, --asf, and --cb-knownclusters flags enabled.
  • BGC Dereplication: Process antiSMASH JSON outputs with BiG-SCAPE v1.1 (--mix mode). This generates Gene Cluster Families (GCFs) based on Pfam domain similarity.
  • Novelty Assessment: For each GCF, extract core biosynthetic genes. Perform BLASTp against the MIBiG database (v3). Calculate the Putative Novelty Score (Table 2) as 1 minus the highest percent identity for any core gene hit. Scores >0.7 indicate high novelty.

Protocol 3.3: Phylogeny-BGC Diversity Mapping & Correlation

Objective: Statistically link phylogenetic distance to BGC repertoire dissimilarity.

Method:

  • Distance Matrix Creation:
    • Generate a phylogenetic distance matrix from the Protocol 3.1 tree using cophenetic.phylo in R's ape package.
    • Generate a BGC profile distance matrix from BiG-SCAPE output using the Jaccard distance on genome-GCF presence/absence data (vegdist in R's vegan).
  • Statistical Testing: Perform a Mantel test (mantel function in vegan) to assess correlation between phylogenetic and BGC distance matrices (use 9999 permutations). A significant p-value (<0.05) supports phylogenetic conservation of BGCs.
  • Visualization: Map dominant GCFs onto tree nodes using iTOL or ggtree in R.

Key Visualizations

Diagram 1: Phylogeny-Guided Drug Discovery Workflow

G Start Marinisomatota Genome Collection (GTDB Taxonomy) P1 Phylogenomic Tree Construction (Protocol 3.1) Start->P1 P2 BGC Prediction & Dereplication (Protocol 3.2) Start->P2 P3 Phylogeny-BGC Diversity Mapping (Protocol 3.3) P1->P3 P2->P3 A1 Identify Monophyletic Clades with Unique BGC Repertoires P3->A1 A2 Prioritize Clades/GCFs with High Novelty Score A1->A2 A3 Select Representative Strains for Cultivation A2->A3 End Heterologous Expression & Bioactivity Screening A3->End

Diagram 2: BGC Diversity Correlation with Phylogeny

G Tree GTDB Phylogenetic Tree Marinisomatota Families Marinisomataceae Oceanipullicutaceae UBA10353 Novel lineage A Mantel Mantel Test r = 0.82, p < 0.01 Tree->Mantel Data BGC Profile (Gene Cluster Families) GCF-001 (PKS) GCF-002 (Lasso) GCF-003 (NRPS) GCF-004 (Thiopeptide) Marinisomataceae Oceanipullicutaceae UBA10353 Novel lineage A Data->Mantel Correlation Strong Phylogenetic Signal in BGC Distribution Mantel->Correlation

Research Reagent Solutions

Table 3: Essential Toolkit for Phylogeny-BGC Linkage Studies

Item/Category Specific Product/Resource Function in Protocol
Taxonomic Framework GTDB-Tk (v2.3.0) Database & Toolkit Standardizes genome taxonomy according to GTDB, essential for defining Marinisomatota clades (Protocol 3.1).
Phylogenomics Software IQ-TREE2 (v2.2.0), OrthoFinder (v2.5.4) Infers robust phylogenetic trees from core genomes (Protocol 3.1).
BGC Prediction Pipeline antiSMASH (v6.1.1) with all databases Comprehensive identification and initial classification of BGCs in genomes (Protocol 3.2).
BGC Comparative Analysis BiG-SCAPE (v1.1) & CORASON Clusters BGCs into Gene Cluster Families (GCFs) enabling diversity quantification (Protocol 3.2, 3.3).
Reference BGC Database MIBiG (Minimum Information about a BGC) Repository (v3.1) Gold-standard database for BGC novelty assessment via BLAST (Protocol 3.2).
Statistical & Visualization Environment R (v4.2+) with ape, vegan, ggtree packages Performs Mantel test and visualizes phylogeny-BGC correlations (Protocol 3.3).
High-Performance Computing (HPC) Linux cluster with SLURM scheduler & >= 1TB storage Manages computationally intensive genome analysis, tree building, and BiG-SCAPE runs.

Integrating Classification with Functional Annotation Pipelines (e.g., Prokka, antiSMASH)

This application note is framed within a broader thesis research on the Marinisomatota phylum (GTDB classification; formerly part of the PVC superphylum in some taxonomic systems). The integration of robust taxonomic classification like the Genome Taxonomy Database (GTDB) with functional annotation pipelines is critical for elucidating the unique metabolic and biosynthetic potential of understudied lineages. For Marinisomatota, hypothesized to have rich secondary metabolism, coupling GTDB-tk classification with tools like antiSMASH and Prokka accelerates the discovery of novel gene clusters and their contextual interpretation within an accurate evolutionary framework.

Table 1: Comparison of Functional Annotation & Classification Tools

Tool/Pipeline Primary Purpose Key Outputs Typical Runtime* Relevance to Marinisomatota Research
GTDB-Tk v2.3.0 Taxonomic classification & phylogeny Taxonomic assignment, alignments, tree ~30 min/genome Definitive placement of novel Marinisomatota genomes within the GTDB hierarchy.
Prokka v1.14.6 Rapid prokaryotic genome annotation CDS, tRNA, rRNA, functional prefixes (COG, Pfam) ~10-15 min/genome First-pass functional annotation, creating standardized GenBank files for downstream analysis.
antiSMASH v7.0 Secondary metabolite BGC detection BGC location, type, similarity, core structures ~20-30 min/genome Identification of biosynthetic gene clusters (BGCs) for drug discovery leads.
EggNOG-mapper v2.1.12 Functional orthology annotation GO terms, KEGG pathways, COG categories ~5-10 min/genome Consistent functional annotation across diverse taxa.
CheckM2 v1.0.2 Genome quality estimation Completeness, contamination, strain heterogeneity ~3-5 min/genome Quality assessment prior to classification/annotation.

*Runtimes are approximate for a 4-5 Mbp bacterial genome on a high-performance compute node.

Table 2: Integrated Pipeline Output Statistics for a Mock Marinisomatota Dataset

Analysis Stage Metric Average Value (n=10 draft genomes) Notes
CheckM2 Genome Completeness (%) 96.4 ± 2.1 High-quality drafts suitable for analysis.
GTDB-Tk Classification Rank pMarinisomatota; gUBA2565 All genomes placed within the phylum; most as novel genera.
Prokka Total CDS Annotated 3,850 ± 420 Provides baseline gene calls for all pipelines.
antiSMASH BGCs per Genome 8.2 ± 1.7 Indicates high biosynthetic potential.
EggNOG-mapper Genes with KEGG Annotation 62% ± 5% Enables pathway reconstruction.

Detailed Integrated Protocol

Protocol 1: Genome Quality Control and Taxonomic Classification

Objective: Assess draft genome quality and assign accurate taxonomy prior to functional annotation.

  • Input: Assembled genomes (FASTA format).
  • Quality Assessment: checkm2 predict --input <assembly.fasta> --output-directory <checkm2_out> --threads 8
    • Filtering: Retain genomes with >90% completeness and <5% contamination.
  • GTDB-Tk Classification: gtdbtk classify_wf --genome_dir <filtered_genomes_dir> --out_dir <gtdbtk_out> --cpus 8 --pplacer_cpus 2
    • Outputs: classify/<genome>.summary.tsv provides kingdom to species-level classification.
    • Thesis Context: Confirm phylum-level placement as Marinisomatota and identify novel genera/species.
Protocol 2: Integrated Functional Annotation Workflow

Objective: Annotate genomes and specifically identify biosynthetic gene clusters (BGCs) using a coordinated pipeline.

  • Primary Annotation with Prokka: prokka <assembly.fasta> --outdir <prokka_out> --prefix <strain_name> --cpus 8 --rfam
    • Uses GTDB-based classification to select appropriate genetic code.
    • Outputs: .gbk file essential for antiSMASH.
  • BGC Detection with antiSMASH: antismash <prokka_out>/<strain_name>.gbk --output-dir <antismash_out> --cpus 8 --genefinding-tool prodigal-m
    • Critical: Use the Prokka-generated GBK to ensure consistent gene calls between annotations.
    • Output Analysis: Merge antiSMASH results (BGC locations, types) with GTDB taxonomy and Prokka annotations.
  • Orthology-Based Functional Annotation (Parallel): emapper.py -i <prokka_out>/<strain_name>.faa -o <eggnog_out> --cpu 8
    • Provides standardized KEGG/GO terms to supplement Prokka's Pfam-based annotations.

Visualization of Workflows

G Start Draft Genome Assemblies (FASTA) QC Quality Control (CheckM2) Start->QC GTDB Taxonomic Classification (GTDB-Tk) QC->GTDB Passing Genomes Prokka Primary Annotation (Prokka) GTDB->Prokka Integrate Integrated Analysis & Visualization GTDB->Integrate Taxonomy Table antiSMASH BGC Detection (antiSMASH) Prokka->antiSMASH GBK File EggNOG Orthology Annotation (EggNOG-mapper) Prokka->EggNOG FAA File antiSMASH->Integrate EggNOG->Integrate

Title: Integrated Genome Analysis Pipeline

D Marinisomatota Marinisomatota GTDB_Class GTDB Classification p__Marinisomatota g__UBA2565 Marinisomatota->GTDB_Class Annotation Prokka Annotation ~3,800 CDS GTDB_Class->Annotation Provides Taxonomic Context Hypothesis Research Hypothesis: Novel NRPS/PKS Systems in Novel Genera GTDB_Class->Hypothesis BGC antiSMASH BGCs 8-10 Clusters/Genome Annotation->BGC Standardized Gene Calls BGC->Hypothesis

Title: Marinisomatota Research Logic Flow

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions & Materials

Item Function/Application in Protocol Example/Notes
High-Quality Compute Environment Running computationally intensive pipelines. Linux server/cluster with ≥32GB RAM, multi-core CPUs (e.g., AWS EC2, HPC).
Conda/Mamba Reliable dependency and environment management. Use bioconda channels to install all tools (GTDB-Tk, Prokka, antiSMASH).
GTDB-Tk Reference Data (v214) Essential database for taxonomic classification. Download reference214.tar.gz (∼54 GB). Critical for accurate Marinisomatota placement.
antiSMASH Databases For BGC detection, rule-based clustering, etc. Includes MIBiG, Pfam, ClusterBlast; installed via download-databases.
EggNOG Database (v5.0) For fast orthology mapping and functional annotation. Bacterial (bact) subset sufficient for Marinisomatota.
Integrative Analysis Scripts Custom Python/R scripts to merge outputs. For merging GTDB taxonomy, BGC locations, and KEGG pathways into a single table.
Visualization Tools Creating publication-quality figures from results. R (ggplot2, ggtree), Python (matplotlib, seaborn), or software like OriginLab.

Resolving Classification Challenges: Troubleshooting GTDB Analysis for Marinisomatota

Application Notes: A GTDB-Centric Framework for Marinisomatota

Within the broader thesis applying the Genome Taxonomy Database (GTDB) framework to elucidate the evolutionary and metabolic diversity of the phylum Marinisomatota (synonymous with Marinisomatia in some classifications), three critical, interconnected pitfalls consistently compromise downstream analysis. These are the recovery of low-quality Metagenome-Assembled Genomes (MAGs), genome contamination, and assignment to incomplete or obsolete taxonomic lineages. Addressing these is paramount for robust ecological inference and bioprospecting, especially for drug development professionals seeking novel bioactive gene clusters from marine microbiomes.

1. Low-Quality MAGs: The inherent fragmentation and uneven coverage in metagenomic sequencing often yield MAGs that are incomplete and/or miss-assembled. For GTDB classification, which relies on a set of conserved marker genes, this directly impacts the placement accuracy. A MAG missing >10% of these markers may be assigned to an imprecise taxonomic rank or flagged as "incomplete."

2. Contamination: Cross-contamination from co-occurring organisms, especially during binning, results in chimeric MAGs containing genes from multiple taxonomic units. This invalidates functional predictions and distorts phylogenetic trees. For Marinisomatota, which often exist in complex consortia, this is a prevalent risk.

3. Incomplete Taxonomy: Relying on legacy taxonomy (e.g., NCBI) instead of the standardized, genome-based GTDB can lead to misclassification. Marinisomatota itself is a product of genomic taxonomy, redefining older groups. Using outdated names obscures evolutionary relationships and hinders comparative genomics.

Quantitative Impact Summary:

Table 1: Impact of MAG Quality on GTDB Classification Success Rate

MIMAG Quality Tier Completeness (CheckM2) Contamination (CheckM2) % Passing GTDB-tk Workflow (approx.) Risk of Misclassification
High-quality (HQ) >90%, <5% <5% >95% Low
Medium-quality (MQ) ≥50%, <90% <10% ~60-80% Moderate
Low-quality (LQ) <50% ≥10% <30% Very High

Table 2: Common Contaminant Signatures in Putative Marinisomatota MAGs

Contaminant Phylum (GTDB) Typical Marker Genes Effect on Classification
Proteobacteria rpoB, fusA Creates aberrant long branches in phylogeny
Bacteroidota rpoC, gyrB Can cause "pulling" into sister clades
Archaea (e.g., Thermoplasmatota) Archaeal ribosomal proteins GTDB-tk may reject genome or flag as contaminated

Protocols for Mitigation and Validation

Protocol 1: Pre-GTDB Classification MAG Refinement Workflow

This protocol ensures only robust MAGs are submitted to GTDB-tk for taxonomic classification of Marinisomatota.

Materials (Research Reagent Solutions):

  • CheckM2: Estimates genome completeness and contamination using a machine-learning model.
  • GTDB-Tk (v2.3.0+): Toolkit for assigning GTDB taxonomy and inferring phylogenies.
  • GUNC (Genome UNClutterer): Detects and quantifies contamination in metagenomic bins.
  • DASTool: Optimizes binning from multiple algorithms to produce consensus, high-quality bins.
  • BBTools (bbduk.sh): For adapter trimming and quality filtering of raw reads.
  • Bowtie2 & SAMtools: For mapping reads back to MAGs to assess coverage uniformity.

Methodology:

  • Initial Binning & Quality Screening: Generate MAGs using at least two binners (e.g., MetaBAT2, MaxBin2). Use DASTool to create consensus bins. Assess all bins with CheckM2. Retain only MAGs meeting MIMAG "medium-quality" threshold (≥50% complete, <10% contaminated).
  • Contamination-Specific Screening: Run all retained MAGs through GUNC. Reject or manually curate (via anvi'o) any MAGs with a GUNC pass.mode of "contaminated" for the SSC (Species-Specific Cluster) model.
  • Coverage Profile Validation: Map quality-filtered reads back to each curated MAG using Bowtie2. Generate per-base coverage with SAMtools (depth). Visually inspect coverage plots for sharp, unimodal distributions. Discard MAGs with multi-modal coverage, indicating co-binned populations.
  • GTDB Classification: Run the refined, high-confidence MAG set through GTDB-Tk (classify_wf). The resulting bacterial classification file (gtdbtk.bac120.summary.tsv) provides the taxonomy, classification confidence (based on marker gene support), and place in the reference tree.

G cluster_reject Reject/Curate RawReads Raw Metagenomic Reads QC Quality Control & Adapter Trimming (BBTools bbduk) RawReads->QC Assembly De Novo Assembly (MEGAHIT/SPAdes) QC->Assembly Binning Binning (MetaBAT2, MaxBin2) Assembly->Binning DAS Consensus Binning (DASTool) Binning->DAS CheckM2_step Quality Screening (CheckM2) DAS->CheckM2_step GUNC_step Contamination Check (GUNC) CheckM2_step->GUNC_step Passing MAGs Reject1 Low-Quality (<50% Complete, >10% Contam.) CheckM2_step->Reject1 MapVal Coverage Validation (Bowtie2, SAMtools) GUNC_step->MapVal 'Clean' MAGs Reject2 GUNC-Flagged Contaminated GUNC_step->Reject2 HQ_Set High-Confidence MAG Set MapVal->HQ_Set Reject3 Abnormal Coverage Profile MapVal->Reject3 GTDBtk GTDB-Tk Classification Workflow HQ_Set->GTDBtk Taxonomy GTDB Taxonomy & Phylogenetic Placement GTDBtk->Taxonomy

MAG Refinement and GTDB Classification Pipeline

Protocol 2: Resolving Incomplete Taxonomy via Phylogenomic Reconciliation

When GTDB-tk assigns a Marinisomatota MAG to an "unclassified" genus or family, follow this protocol to contextualize its placement.

Materials:

  • GTDB-Tk (infer workflow): Generates a multiple sequence alignment (MSA) and tree including your MAGs and the full GTDB reference.
  • FastTree/IQ-TREE2: For maximum-likelihood tree inference if custom analysis is needed.
  • GTDB Website/API: To access the current taxonomy (release 220+) and browse reference trees.
  • Interactive Tree of Life (iTOL): For visualization and annotation of phylogenetic trees.

Methodology:

  • Phylogenetic Inference: Run the GTDB-Tk infer workflow on your MAG set to place them within the GTDB reference tree. Visualize the resulting tree (.treefile) in iTOL.
  • Clade Examination: Identify the MAG's precise position. Note the bootstrap support or posterior probability at the node where it branches. Examine the taxonomy of its closest reference genome siblings.
  • Taxonomic Proposal Evaluation: If your MAG forms a coherent, novel clade with other uncultivated MAGs from public databases (with strong support), it may represent a candidate for a novel genus/family. Use the GTDB's msa and mask files to calculate Average Amino Acid Identity (AAI) and Average Nucleotide Identity (ANI) versus its closest relatives using tools like compareM or PyANI.
  • Reporting: For publication, report the GTDB taxonomy string (e.g., d__Bacteria;p__Marinisomatota;c__...;g__;s__). Clearly distinguish between classified ranks and placeholder names (g__UBA1234). Reference the GTDB release number (e.g., R220).

G Unclassified Unclassified Marinisomatota MAG (GTDB-tk output) InferTree Phylogenomic Placement (GTDB-tk infer workflow) Unclassified->InferTree Visualize Tree Visualization & Clade Inspection (iTOL) InferTree->Visualize Decision Novel Clade with Strong Support? Visualize->Decision CalcMetrics Calculate Genomic Distance Metrics (ANI/AAI via compareM) Decision->CalcMetrics Yes KnownClade Assign to Nearest Classified Relatives' Taxonomy Decision->KnownClade No ProposeNovel Document as Candidate Novel Taxon CalcMetrics->ProposeNovel Report Report with Full GTDB Taxonomy String KnownClade->Report ProposeNovel->Report

Resolving Unclassified Taxonomy via Phylogenomics

Application Notes

Within a thesis investigating the phylogenetic diversity and metabolic potential of the Marinisomatota phylum (syn. MARINISOMATOTA in GTDB), the interpretation of GTDB-Tk outputs is critical. Ambiguities, such as low support values and unclassified branches, are common but can be systematically addressed to refine taxonomic hypotheses.

1. Quantitative Analysis of Ambiguity: Common metrics from GTDB-Tk phylogenetic trees require careful scrutiny. The following table summarizes key thresholds for interpretation.

Table 1: Interpretation of Support Metrics in GTDB-Tk Phylogenetic Trees

Metric Range Typical Threshold for Robustness Interpretation in Marinisomatota Context
SH-like (aLRT) Support 0-1 ≥ 0.9 Values < 0.7 indicate high ambiguity; branch placement is unreliable for novel lineages.
Bootstrap Support 0-100 ≥ 80 Values between 50-80 suggest caution; topology may change with more data.
Taxonomic Rank Support Classified/Unclassified N/A An "unclassified" label at the genus or family level often correlates with support values < 0.8.
Placement Distance (RF) 0-1 ≤ 0.3 Distance > 0.5 from a defined reference suggests a potentially novel clade.

2. Protocol for Resolving Ambiguities: Follow this sequential workflow to investigate ambiguous classifications.

Protocol 1: Multi-Marker Tree Reconciliation

Objective: To validate or correct the GTDB-Tk classification of a Marinisomatota genome (e.g., bin_23) showing low support at the family level.

Materials:

  • Input Data: GTDB-Tk output directory (gtdbtk_output/) for the genome of interest.
  • Software: GTDB-Tk (v2.3.0+), IQ-TREE2 (v2.2.0+), CheckM2, FastANI.
  • Databases: GTDB reference data (R214 or newer).

Methodology:

  • Extract Marker Alignment: From the GTDB-Tk output, locate the multiple sequence alignment (MSA) file for your genome (e.g., [bin_id].bac120.user_msa.fasta).
  • Build a Custom Tree: Isolate the MSA for your genome and its closest 50-100 reference genomes from the GTDB bac120.msa file. Use taxonkit to gather relevant GTDB taxa IDs.
  • Phylogenetic Inference: Run a maximum-likelihood tree with IQ-TREE2:

  • Congruence Test: Visually and quantitatively compare the topology and support values of this custom tree with the GTDB-Tk output tree. Use the Robinson-Foulds distance.
  • Complement with Genome Metrics: Calculate CheckM2 completeness/contamination and perform an ANI analysis (fastANI) against the genomes in the ambiguous clade to confirm or refute genus-level grouping (threshold ~95% ANI).

Protocol 2: Metabolic Profiling for Taxonomic Inference

Objective: Use functional signatures to support the placement of an unclassified Marinisomatota branch.

Materials:

  • Input Data: Annotated genome (produced via Prokka or DRAM).
  • Software: KofamScan, HMMER, custom metabolic pathway scripts.
  • Databases: KEGG, dbCAN, TIGRFAMs.

Methodology:

  • Profile Marker HMMs: Beyond the 120/122 markers, search for phylum or class-specific conserved protein families using TIGRFAMs and custom HMMs.
  • Signature Pathway Analysis: Annotate pathways relevant to Marinisomatota's marine niche (e.g., sulfated polysaccharide utilization, prokaryotic proteorhodopsin). Map the presence/absence pattern across the ambiguous clade and its reference relatives.
  • Create Functional Distance Matrix: Generate a Jaccard distance matrix based on the presence/absence of ~500 core KEGG Orthologs. Construct a neighbor-joining tree and compare its topology to the GTDB-Tk tree. Congruent clustering despite low sequence support strengthens a novel classification hypothesis.

Visualizations

ambiguity_workflow start Ambiguous GTDB-Tk Output (Low Support/Unclassified) p1 Protocol 1: Multi-Marker Reconciliation start->p1 p2 Protocol 2: Metabolic Profiling start->p2 check Check Congruence p1->check p2->check check->start No hypo Formulate Taxonomic Hypothesis (Novel Clade or Classification Error) check->hypo Yes

GTDB-Tk Ambiguity Resolution Workflow

tree_support A Phylum: Marinisomatota B Class: Marinisomatia Support: 1.0 A->B C Order: UBA1415 Support: 0.89 B->C D Family: Unclassified Support: 0.61 C->D E Genus: UBA1415_sp Support: 0.45 D->E F Query Genome bin_23 E->F

Example Ambiguous Branch with Low Support

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Resources for Resolving GTDB-Tk Ambiguities

Item Function/Description Source/Example
GTDB-Tk Reference Data (R214+) Essential database containing alignments, trees, and taxonomy for classification. Always use the version matching your GTDB-Tk install. GTDB Website
IQ-TREE2 Software For robust, custom phylogenetic tree inference with modern support metrics (SH-aLRT, UFBoot). http://www.iqtree.org/
CheckM2 / GTDB-Tk QC Provides essential genome quality metrics (completeness, contamination). Poor quality can cause ambiguous placement. CheckM2 GitHub
FastANI Computes Average Nucleotide Identity for precise genus/species boundary assessment against reference genomes. FastANI GitHub
KofamScan & KEGG Database For functional profiling and identifying conserved metabolic signatures that support taxonomic grouping. KofamScan GitHub
Custom HMM Library A collection of Hidden Markov Models for protein families specific to Marinisomatota or related PVC superphylum. Constructed via hmmbuild from curated alignments.
Taxonkit A powerful CLI tool for parsing and filtering NCBI/GTDB-style taxonomy files efficiently. Taxonkit GitHub

1. Introduction & Thesis Context Within the broader thesis investigating the evolutionary genomics and biotechnological potential of the Marinisomatota phylum (GTDB designation, formerly part of FCB group or Sphingobacteria), efficient computational resource management is paramount. Analysis of large-scale metagenomic and isolate datasets demands strategic optimization to enable high-fidelity taxonomic classification, pangenome construction, and functional profiling. These protocols are designed to maximize throughput and accuracy while minimizing computational cost and time.

2. Quantitative Resource Benchmarks for Common Tasks The following table summarizes resource requirements for key analytical steps, benchmarked on a representative dataset of 500 metagenome-assembled genomes (MAGs) binned as Marinisomatota.

Table 1: Computational Resource Benchmarks for Core Analysis Tasks

Analytical Task Software (Example) Typical Dataset Size CPU Cores Recommended RAM (GB) Wall Time (HH:MM) Storage I/O
Quality Control & Adapter Trimming Fastp v0.23.4 1B PE reads (2x150bp) 16 32 02:30 High
Metagenome Assembly MEGAHIT v1.2.9 1B PE reads 64 512 24:00+ Very High
Genome Binning MetaBat2 v2.15 500 contigs (>2.5kbp) 8 64 04:00 Medium
GTDB-Tk Classification GTDB-Tk v2.3.0 500 MAGs 16 128 06:00 Medium
Pangenome Analysis Anvi'o v7.1 50 Marinisomatota genomes 32 256 12:00 High
Functional Annotation Prokka v1.14.6 1 MAG (~5 Mbp) 4 16 01:00 Low

3. Detailed Experimental Protocols

Protocol 3.1: Optimized GTDB Taxonomic Classification Pipeline Objective: To classify putative Marinisomatota MAGs using the Genome Taxonomy Database Toolkit (GTDB-Tk) with resource-efficient prioritization.

  • Pre-classification Filtering: Filter MAGs using CheckM2 to select only those with >50% completeness and <10% contamination. This reduces downstream compute time on low-quality bins.
  • Batch Job Configuration: Package MAGs into batches of 50-100 genomes per SLURM/Job Scheduler array job.
  • GTDB-Tk Execution Command:

  • Post-processing: Concatenate batch results (bac120_summary.tsv, ar53_summary.tsv) and filter for classifications within the Marinisomatota phylum (e.g., p__Marinisomatota_A, p__Marinisomatota_B).

Protocol 3.2: Resource-Aware Comparative Genomics Workflow Objective: To construct a pangenome from curated Marinisomatota genomes without exhausting memory.

  • Dereplication: Use dRep v3.4.0 to cluster genomes at 99% ANI to reduce redundancy.

  • Annotation with Prokka (Parallelized): Use GNU Parallel to annotate dereplicated genomes simultaneously across allocated nodes.

  • Pangenome Construction: Use Roary v3.13.0 with a strict MCL inflation parameter (1.5) for clearer core/accessory separation.

4. Mandatory Visualization

gtdb_workflow Marinisomatota Analysis Workflow RawReads Raw Metagenomic Reads QC QC & Trimming (fastp) RawReads->QC Assembly De Novo Assembly (MEGAHIT) QC->Assembly Binning Genome Binning (MetaBat2) Assembly->Binning MAGs Metagenome-Assembled Genomes (MAGs) Binning->MAGs CheckM2 Quality Filtering (CheckM2) MAGs->CheckM2 HighQualMAGs High-Quality MAGs (>50% comp, <10% cont) CheckM2->HighQualMAGs Pass Discard Discard Low-Quality MAGs CheckM2->Discard Fail GTDBtk Taxonomic Classification (GTDB-Tk) HighQualMAGs->GTDBtk MarinisomatotaMAGs Marinisomatota MAG Dataset GTDBtk->MarinisomatotaMAGs Pangenome Comparative Genomics & Pangenome (Roary) MarinisomatotaMAGs->Pangenome Output Hypotheses for Drug Target Screening Pangenome->Output

Diagram Title: Marinisomatota MAG Analysis and Classification Pipeline

resource_decision Compute Resource Decision Tree Start Start Analysis Batch Q1 Dataset > 1TB or 1B reads? Start->Q1 Q2 Process > 500 Genomes? Q1->Q2 No Node1 Use High-Mem Node 64 Cores, 512GB RAM Q1->Node1 Yes Q3 Memory-intensive (Pangenome/Assembly)? Q2->Q3 Yes Node3 Use Standard Node 16 Cores, 64GB RAM Q2->Node3 No Node2 Use High-CPU Node 32 Cores, 128GB RAM Q3->Node2 Yes Batch Use Job Arrays for Parallel Batch Q3->Batch No Batch->Node3

Diagram Title: Computational Resource Decision Tree

5. The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 2: Key Computational Tools & Data Resources for Marinisomatota Research

Item Name Type Primary Function in Analysis Resource Optimization Tip
GTDB-Tk v2.3.0+ Software/Reference Data Assigns robust taxonomy using GTDB reference tree. Critical for placing novel Marinisomatota. Use --scratch_dir to point to fast local SSD for I/O-bound performance gain.
CheckM2 Software/Model Rapid assessment of MAG quality (completeness/contamination). Use the pre-trained model; runs significantly faster with lower memory than CheckM1.
dRep Software Dereplicates genome sets based on ANI. Reduces computational load for downstream steps. Adjust -nc (coverage threshold) based on sequencing depth to retain relevant diversity.
Roary Software Rapid large-scale prokaryote pangenome analysis. Identifies core/accessory genes. Use -i (MCL inflation) >1.2 for more conservative, less noisy clustering in diverse sets.
Prokka Software Rapid annotation of bacterial genomes. Provides standard GFF3 for downstream tools. Use --metagenome flag and --mincontiglen to optimize for MAG annotation.
GTDB R214 Reference Database Provides the standardized taxonomic framework and alignments for classification. Download to a shared, high-performance filesystem to avoid redundant copies.
PFAM & TIGRFAM HMM Database For functional annotation of protein families within Marinisomatota genomes. Combine with tools like anvi-run-hmms for efficient, parallelized annotation.
Slurm / SGE Job Scheduler Manages resource allocation on HPC clusters for parallelizable workflows. Implement job arrays for classifying or annotating 100s of genomes efficiently.

Application Notes & Protocols

Within the Genomic Taxonomy Database (GTDB) framework, the phylum Marinisomatota (formerly candidate phylum SAR406) presents unique challenges in taxonomic placement due to its deep evolutionary branching and frequent genomic bridging to related candidate phyla like Muirbacteria, Uhrbacteria, and Gribaldobacteria. Accurate classification is critical for interpreting its ecological role in marine systems and assessing its potential in bioprospecting for novel enzymes or bioactive compounds.

1. Quantitative Data Summary: Key Genomic & Phylogenetic Markers

Table 1: Core Genome & Phylogenetic Marker Comparison Across Bridging Phyla

Feature / Marker Marinisomatota (GTDB r214) Muirbacteria (GTDB) Uhrbacteria (GTDB) Bridging Genome Example (Bin.123)
Average Genome Size (Mbp) 1.8 - 2.3 1.5 - 1.9 1.6 - 2.1 2.05
Average GC Content (%) 44 - 48 50 - 54 38 - 42 46.2
tRNA Count (avg.) 33 35 32 34
16S rRNA Identity to Marinisomatota (%) 100 (ref) 78.2 - 81.5 75.8 - 79.1 92.3
Concatenated Marker (120) AAI to Marinisomatota (%) 100 (ref) 60.5 - 62.8 58.9 - 61.2 85.7
CheckM2 Completeness (%) >95 (high-quality) >90 >90 96.4
CheckM2 Contamination (%) <5 <5 <5 1.2
Presence of Diagnostic Pathway Yes (Partial TCA) No No Yes

Table 2: Diagnostic Metabolic Pathway Gene Presence/Absence

Pathway Gene Marinisomatota Consensus Bridging Genome Annotation Function & Taxonomic Relevance
Fumarate hydratase (class II) [K01676] + + Key TCA cycle enzyme; conserved in Marinisomatota.
Rhodanese-domain protein [K01011] + + Sulfur metabolism; a phylum-associated trait.
Group 3 [NiFe] hydrogenase [K06281, K06282] + + Energy metabolism in anoxic environments.
Archaeal-like Rubisco (rbcL) - - Distinguishes from photosynthetic relatives.

2. Experimental Protocols

Protocol 1: Integrated Phylogenomic Placement of Ambiguous Genomes Objective: To resolve classification of a genome bridging Marinisomatota and related phyla using GTDB toolkit and supplementary analysis. Materials: High-quality metagenome-assembled genome (MAG), GTDB-Tk v2.3.0, CheckM2, Python environment with SciKit-bio, FastTree, IQ-TREE2. Procedure:

  • Quality Assessment: Run checkm2 predict --input <mag.fasta> ... to assess completeness & contamination. Proceed only if completeness >90% & contamination <5%.
  • GTDB-Tk Default Classification: Execute gtdbtk classify_wf --genome_dir <dir> --out_dir <output> --cpus 8. Record the classification and posterior probability for all ranks.
  • Marker Extraction & Tree Building: If placement is ambiguous (e.g., low posterior probability), extract 120 bacterial marker genes using gtdbtk identify and align. Create a custom concatenated alignment.
  • Reference Tree Construction: Build a reference tree with IQ-TREE2 (iqtree2 -s concat.align -m MFP -bb 1000 -nt 8) using a curated set of reference genomes from Marinisomatota, Muirbacteria, Uhrbacteria, and an outgroup.
  • Placement of Query Genome: Place the bridging genome onto the reference tree using the --place function in GTDB-Tk or using EPA-ng in conjunction with the reference alignment.
  • Average Amino Acid Identity (AAI) Calculation: Calculate AAI between the query and all reference genomes using comparem aai_wf (https://github.com/dparks1134/CompareM). An AAI >80% suggests phylum-level affiliation; 60-80% indicates separate but related phyla.
  • Consensus Classification: Synthesize results from GTDB-Tk posterior probability, phylogenetic placement, and AAI. A genome is classified as Marinisomatota if it: a) clusters within the Marinisomatota monophyletic clade with >70% bootstrap support, b) shares AAI >80% with Marinisomatota references, and c) retains key metabolic markers (Table 2).

Protocol 2: Validation via Diagnostic Metabolic Profiling Objective: To validate phylogenetic placement by confirming the presence of Marinisomatota-diagnostic metabolic pathways. Materials: Annotated genome (e.g., using PROKKA or DRAM), KEGG database, HMMER suite, custom HMM profiles for diagnostic genes. Procedure:

  • Functional Annotation: Annotate the bridging genome using prokka --outdir <dir> --prefix <mag> <mag.fasta>.
  • Target Gene HMM Search: Using hmmsearch with an E-value cutoff of 1e-20, search the translated proteome against custom HMMs built for diagnostic genes (e.g., fumarate hydratase class II, rhodanese-domain protein).
  • Pathway Reconstruction: Use the annotated KO terms to map genes to KEGG modules (e.g., M00009, TCA cycle). Manual curation is required to confirm pathway completeness and identify phylum-specific variants.
  • Comparative Analysis: Compare the reconstructed pathways to the consensus profiles in Table 2. A bridging genome showing a Marinisomatota-like profile provides functional evidence supporting its classification.

3. Visualizations

G Start Input: MAG (Bridging Genome) QA Quality Filter (CheckM2) Start->QA GTDBtk GTDB-Tk Classify QA->GTDBtk Tree Phylogenomic Tree Placement GTDBtk->Tree AAI AAI Analysis (CompareM) GTDBtk->AAI Metab Diagnostic Metabolic Profile GTDBtk->Metab Eval Data Synthesis & Consensus Decision Tree->Eval AAI->Eval Metab->Eval Out1 Output: Classified as Marinisomatota Eval->Out1 Out2 Output: Novel or Sibling Phylum Eval->Out2

Title: Workflow for Resolving Phylogenomic Classification

pathway A Acetyl-CoA C Citrate A->C + B Oxaloacetate B->C D Isocitrate C->D E Alpha-Ketoglutarate D->E F Succinyl-CoA E->F G Succinate F->G H Fumarate G->H I Malate H->I I->B CS Citrate Synthase [K01647] CS->C ACO Aconitase [K01681] ACO->D IDH Isocitrate DH [K00031] IDH->E OGD 2-oxoglutarate DH [K00164] OGD->F SCS Succinyl-CoA synthetase [K01902] SCS->G SDH Succinate DH [K00240] SDH->H FUM Fumarase (Class II) [K01676] FUM->I MDH Malate DH [K00026] MDH->B

Title: Diagnostic Partial TCA Cycle in Marinisomatota

4. The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions & Materials

Item / Tool Function in Analysis Example / Note
GTDB-Tk (v2.3.0+) Standardized taxonomic classification relative to GTDB phylogeny. Uses ~120 bacterial marker genes & pplacer for placement.
CheckM2 Estimates genome completeness & contamination rapidly. Superior for genomes from novel lineages vs. CheckM1.
CompareM Calculates Average Amino Acid Identity (AAI) & ANI. Critical for quantifying genomic relatedness between phyla.
IQ-TREE2 Phylogenetic inference with model testing & fast bootstrapping. For building robust reference trees.
PROKKA / DRAM Rapid genome annotation & metabolic profiling. DRAM specializes in metabolic pathway distillation for microbes.
Custom HMM Profiles Detects conserved, phylum-diagnostic protein families. Build using hmmbuild from curated alignments of target genes.
KEGG MODULE Database Reference for pathway completeness assessment. Manual curation required due to pathway variability in DPANN/CPR.
PhyloPhlAn 3.0 Alternative for phylogeny using ~400 universal markers. Useful as an orthogonal method to GTDB-Tk.

Best Practices for Curation and Submission of Novel Marinisomatota Genomes

Application Notes The accurate classification of novel genomes within the phylum Marinisomatota (formerly known as KS3-B09 or SAR406) is critical for advancing our understanding of their role in marine biogeochemical cycles and for exploring their biosynthetic potential. As per the Genome Taxonomy Database (GTDB) taxonomy (release 220), Marinisomatota is a distinct bacterial phylum primarily comprising uncultivated lineages from oceanic and deep-sea environments. Curation and submission of genomes from this group present unique challenges due to their frequent assembly from complex metagenomic datasets and their phylogenetic depth. Adherence to standardized practices ensures genomic data integrity, facilitates reproducible taxonomy, and enables downstream drug discovery pipelines to accurately target novel enzymatic pathways from these enigmatic organisms.

Protocols

1. Genome Assembly and Curation Protocol Objective: To reconstruct high-quality Marinisomatota genomes from metagenomic sequence data. Detailed Methodology: 1. Sequence Pre-processing: Use fastp (v0.23.4) with parameters --detect_adapter_for_pe --trim_poly_g --cut_front --cut_tail to remove adapters and low-quality bases. 2. Co-assembly: Perform de novo assembly on quality-filtered reads using MEGAHIT (v1.2.9) with meta-large presets or SPAdes (v3.15.5) in --meta mode for higher complexity samples. 3. Binning: Execute binning using multiple tools: MetaBAT2 (v2.15), MaxBin2 (v2.2.7), and CONCOCT (v1.1.0). Generate a consensus set of bins using DAS Tool (v1.1.6). 4. Taxonomic Assignment: Assign preliminary taxonomy to bins using GTDB-Tk (v2.3.2) with the classify_wf command and database release R220. 5. Genome Refinement: For bins classified as Marinisomatota, perform manual refinement in Anvi'o (v7.1). Map reads back to the bin, inspect coverage and tetranucleotide frequency outliers, and remove contaminating contigs. 6. Completeness/Contamination Assessment: Run CheckM2 (v1.0.1) to estimate genome completeness and contamination. Proceed only with medium-quality (MQG; ≥50% complete, <10% contaminated) or high-quality (HQG; ≥90% complete, <5% contaminated) genomes.

2. Phylogenomic Placement and Classification Protocol Objective: To determine the precise taxonomic position of a novel Marinisomatota genome within the GTDB framework. Detailed Methodology: 1. Protein Marker Extraction: Use GTDB-Tk's identify and align commands to extract and align 120 bacterial single-copy marker genes. 2. Reference Tree Placement: Generate a rooted phylogenetic tree with the infer command, which places the novel genome within the GTDB reference tree of type genomes. 3. Relative Evolutionary Divergence (RED) Calculation: The GTDB-Tk classify workflow automatically calculates the RED value, a quantitative measure of phylogenetic divergence. 4. Taxonomic Assignment: Assign taxonomy based on the genome's position relative to defined RED thresholds for each rank. Novelty is indicated by prefixes (e.g., "UBA..." for uncultivated bacterium).

3. Genome Submission and Annotation Protocol Objective: To submit curated genomes to public repositories with standardized annotations. Detailed Methodology: 1. Functional Annotation: Annotate the genome using PROKKA (v1.14.6) for rapid gene calling, or a comprehensive pipeline: DRAM (v1.4.4) for metabolism and KofamScan for KEGG orthologs. 2. Biosynthetic Gene Cluster (BGC) Identification: Run antiSMASH (v7.0) or DeepBGC to identify potential secondary metabolite BGCs, a key interest for drug development. 3. Metadata Collection: Compose minimal and contextual metadata as per the Genomic Standards Consortium (MIXS) checklist, emphasizing environmental parameters (depth, salinity, temperature). 4. Submission: Submit the genome assembly, annotated features, and raw reads to the International Nucleotide Sequence Database Collaboration (INSDC) via the NCBI, ENA, or DDBJ submission portals.

Data Presentation

Table 1: Genomic Quality Standards for Marinisomatota Submissions

Quality Tier Completeness Contamination # of Contigs N50 (kb) GTDB Designation
High Quality ≥ 90% < 5% < 500 > 20 HQG
Medium Quality ≥ 50% < 10% < 1000 > 10 MQG
Low Quality < 50% ≥ 10% Not Applicable Not Applicable Exclude from taxonomy

Table 2: Key GTDB Metrics for Novel Marinisomatota Classification

Taxonomic Rank Typical RED Threshold Action for Novel Genome
Species Cluster ~0.06 Assign spXXXXXXX label if outside existing cluster.
Genus ~0.30 Prefix with 'UBA' or 'GCA' if RED > type genus threshold.
Family ~0.50 Prefix with 'UBA' if novel lineage at family level.

Mandatory Visualizations

G Start Raw Metagenomic Reads QC Quality Control & Adapter Trimming Start->QC Assemble De Novo Assembly QC->Assemble Bin Binning (Multiple Tools) Assemble->Bin ConsBin Consensus Binning (DAS Tool) Bin->ConsBin GTDBtk1 GTDB-Tk: Preliminary Classification ConsBin->GTDBtk1 Refine Manual Curation & Refinement GTDBtk1->Refine CheckM CheckM2: Quality Assessment Refine->CheckM HQ High/Medium Quality Genome? CheckM->HQ HQ->Start No, Re-bin/Re-assemble Annotate Functional & BGC Annotation HQ->Annotate Yes Submit INSDC Submission & Metadata Annotate->Submit End Publicly Available Marinisomatota Genome Submit->End

Title: Genome Curation & Submission Workflow

G NovelGenome Novel Marinisomatota Genome ExtractMarkers Extract 120 Bacterial Marker Proteins NovelGenome->ExtractMarkers Align Align to GTDB Reference Set ExtractMarkers->Align PlaceTree Place in Reference Phylogeny Align->PlaceTree CalcRED Calculate Relative Evolutionary Divergence (RED) PlaceTree->CalcRED Compare Compare RED to Rank Thresholds CalcRED->Compare Assign Assign GTDB Taxonomy (Species, Genus, Family) Compare->Assign Prefix Apply 'UBA' Prefix if Novel at Rank Assign->Prefix

Title: GTDB Phylogenomic Classification Protocol

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Tools and Resources for Marinisomatota Genome Research

Item / Tool Function / Purpose Source / Example
GTDB-Tk (v2.3.2+) Standardized toolkit for phylogenomic classification using the GTDB database. Essential for taxonomy. GitHub: ecogenomics/gtdbtk
CheckM2 Rapid and accurate estimation of genome completeness and contamination in bacterial genomes. GitHub: chklovski/CheckM2
DRAM (Distilled & Refined Annotation of Metabolism) Comprehensive functional annotation pipeline, highlighting metabolic potential and virulence. GitHub: WrightonLabCSU/DRAM
antiSMASH Identifies Biosynthetic Gene Clusters (BGCs) for secondary metabolites; crucial for drug discovery screens. https://antismash.secondarymetabolites.org
Anvi'o Interactive platform for microbial 'omics, essential for manual bin refinement and visualization. http://merenlab.org/software/anvio/
MIXS Checklists Standardized metadata reporting formats to ensure data reproducibility and integration. Genomic Standards Consortium
NCBI Prokaryotic Genome Annotation Pipeline (PGAP) Recommended for consistent structural and functional annotation prior to INSDC submission. NCBI GitHub

GTDB vs. Legacy Systems: Validating Marinisomatota Taxonomy and Comparative Genomics Insights

Within the broader thesis on the genomic and metabolic diversity of the Marinisomatota phylum (formerly known as Marinimicrobia), accurate taxonomic classification is a foundational challenge. This phylum, prevalent in marine environments, exhibits significant metabolic versatility with implications for biogeochemical cycling and potential biotechnological applications. The recent adoption of the Genome Taxonomy Database (GTDB) taxonomy, based on conserved single-copy marker genes and relative evolutionary divergence, often conflicts with the established but sometimes phenotypically influenced NCBI taxonomy. This discrepancy is particularly pronounced for Marinisomatota, where numerous reclassifications and the delineation of new candidate phyla (e.g., Candidatus Uhrbacteria) have been proposed. This application note provides a protocol for benchmarking these two classification systems for Marinisomatota clades, enabling researchers to critically evaluate genomic data within a consistent framework for downstream ecological, evolutionary, and drug discovery research.

Core Quantitative Comparison

Table 1: High-Level Taxonomic Comparison for Marinisomatota

Taxonomic Rank NCBI Taxonomy (as of latest update) GTDB Release R214 (April 2023) Notes/Implications
Phylum Marinimicrobia (PRI) P__Marinisomatota GTDB uses the name Marinisomatota.
Class Level Multiple candidate classes (e.g., SAR406 clade) C__Marinisomatia (and others split into separate phyla) GTDB splits the group into multiple phyla-level taxa.
Order Level Not consistently defined O_Marinisomatales (within P_Marinisomatota) Clearer hierarchical structure in GTDB.
Representative Genus Marinimicrobium, Candidatus Litoricolaceae Multiple genera under Marinisomatales (e.g., UBA10353, MSA-10) Genus-level assignments differ radically.
Number of Genome Assemblies ~500+ labeled under Marinimicrobia ~400+ classified under P__Marinisomatota and related new phyla. Counts vary due to reclassification.

Table 2: Benchmarking Metrics for a Representative Clade (e.g., SAR406)

Metric NCBI Taxonomy Classification Result GTDB Taxonomy Classification Result Benchmarking Advantage
Average Amino Acid Identity (AAI) within group 65.2% ± 5.1% 72.8% ± 3.5% GTDB classification yields more genomically coherent groups.
Percentage of Conserved Single-Copy Marker Genes 89% 98% GTDB groups maintain higher essential gene content.
Relative Evolutionary Divergence (RED) Score Not applied 0.65 (clearly delineated from sister phyla) Provides quantitative rank normalization.
Congruence with 16S rRNA Gene Tree Moderate (long-branch attraction issues) High for defined taxa (uses >120 proteins) Improved phylogenetic resolution.

Experimental Protocols

Protocol 3.1: Genome Dataset Curation and Taxonomic Labeling

Objective: To assemble a balanced genome dataset with dual (NCBI & GTDB) labels for benchmarking.

Materials:

  • High-performance computing cluster or server.
  • ncbi-genome-download tool.
  • GTDB-Tk v2.3.0 software package & corresponding R214 data files.
  • Custom Python scripts for data parsing (available in thesis repository).

Procedure:

  • NCBI Genome Retrieval: Using ncbi-genome-download, download all bacterial genomes associated with the NCBI taxon ID for Marinimicrobia. Use the --assembly-level complete,chromosome,scaffold filter.
  • GTDB Classification: Run GTDB-Tk (classify_wf) on the downloaded genome assemblies. This will assign GTDB taxonomy based on the R214 reference tree.
  • Create Mapping Table: Parse the NCBI assembly reports and the GTDB-Tk output to generate a master table with columns: Assembly_Accession, NCBI_Phylum, NCBI_Class, GTDB_Phylum, GTDB_Class, GTDB_Red_Value.
  • Filter & Balance: Filter out low-quality genomes (<90% completeness, >5% contamination as assessed by CheckM2). Balance the dataset to include representative genomes from each major clade in both systems.

Protocol 3.2: Phylogenomic Tree Reconciliation Analysis

Objective: To visualize and quantify the discordance between classification systems.

Materials:

  • IQ-TREE 2.2.0 for maximum likelihood phylogeny.
  • bac120 marker gene set from GTDB or a custom set of 74 universal single-copy genes.
  • ETE3 Python toolkit for tree analysis and visualization.

Procedure:

  • Marker Gene Extraction & Alignment: Identify and concatenate the bac120 marker genes from each curated genome using GTDB-Tk or HMMER with custom profiles.
  • Phylogenomic Inference: Construct a maximum-likelihood tree using IQ-TREE with model LG+F+G and 1000 ultrafast bootstrap replicates.
  • Tree Annotation: Use ETE3 to map the NCBI and GTDB taxonomy labels onto the tree leaf nodes. Color-code branches based on phylum-level assignment from each system.
  • Discordance Metric Calculation: Calculate the Robinson-Foulds distance between the phylogenomic tree topology and the hierarchical "tree" implied by each taxonomy system. A lower distance indicates better congruence with the genomic data.

Protocol 4: The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions & Materials

Item Function/Benefit in Benchmarking Source/Example
GTDB-Tk Software Standardized toolkit for assigning GTDB taxonomy to genomes; ensures reproducibility. https://github.com/ecogenomics/gtdbtk
CheckM2 Rapid, accurate assessment of genome completeness and contamination; critical for quality filtering. https://github.com/chklovski/CheckM2
bac120 / ar122 Marker Set Curated set of 120 bacterial single-copy genes; provides standardized data for phylogenomics. Included with GTDB-Tk.
IQ-TREE Efficient software for maximum likelihood phylogenetic inference with model selection. http://www.iqtree.org/
ETE3 Toolkit Python environment for analyzing, manipulating, and visualizing trees and taxonomies. http://etetoolkit.org/
NCBI Datasets CLI Programmatic access to download NCBI genome assemblies and associated metadata. https://www.ncbi.nlm.nih.gov/datasets/

Mandatory Visualizations

gtdb_ncbi_workflow start Raw Genome Assemblies (Public Repositories) ncbi_dl NCBI Genome Download & Metadata Extraction start->ncbi_dl gtdb_classify GTDB-Tk Classification Workflow start->gtdb_classify qc Quality Control (CheckM2) ncbi_dl->qc gtdb_classify->qc ncbi_tax NCBI Taxonomy Labels qc->ncbi_tax gtdb_tax GTDB Taxonomy Labels (R214) qc->gtdb_tax map Generate Unified Mapping Table ncbi_tax->map gtdb_tax->map tree Phylogenomic Tree Construction (IQ-TREE) map->tree bench Benchmarking Analysis (Metrics & Visualization) tree->bench

Workflow for Taxonomic Benchmarking

tax_discordance phylo_tree Reference Phylogenomic Tree (Based on 120 proteins) gtdb_system GTDB Taxonomy System phylo_tree->gtdb_system map labels ncbi_system NCBI Taxonomy System phylo_tree->ncbi_system map labels metric_1 Robinson-Foulds Distance gtdb_system->metric_1 metric_2 Clade Coherence (AAI, RED) gtdb_system->metric_2 ncbi_system->metric_1 ncbi_system->metric_2 output Quantified Classification Discordance metric_1->output metric_2->output

Taxonomic System Comparison Logic

Application Notes

Within the GTDB taxonomic framework, the phylum Marinisomatota (formerly candidate phylum SAR406) represents a deep-branching lineage distinct from its phenotypically and ecologically similar neighbor, Bacteroidota. This analysis highlights key genomic and metabolic features that delineate their evolutionary divergence, critical for interpreting ocean carbon cycling and guiding bioprospecting efforts.

Table 1: Core Genomic & Metabolic Divergence Metrics

Feature Marinisomatota (Avg.) Bacteroidota (Avg.) Implication for Divergence
Genome Size (Mbp) 2.1 - 2.8 4.2 - 6.5 Streamlined, oligotrophic adaptation in Marinisomatota
GC Content (%) 34 - 38 39 - 48 Distinct nucleotide composition & codon bias
16S rRNA Identity (%) < 75% Reference Phylum-level taxonomic separation (GTDB)
Glycoside Hydrolases (GHs) Low count, specific types High count, diverse (e.g., GH13, GH16) Limited polysaccharide diversity in Marinisomatota
Respiratory Chain Predicted HiPIP → bc1 complex Diverse (e.g., fumarate reduct., flavin-based) Unique electron transport via high-potential iron-sulfur protein
Carbon Fixation RuBisCO-like protein (RLP) Absent in most Potential for CO2 metabolism in dark ocean
Nitrogen Metabolism Nitrate/nitrite transporters Urease, peptidases N-source specialization; Marinisomatota targets inorganic N

Table 2: Diagnostic Marker Genes for Phylogenetic Delineation

Gene/Protein Family Presence in Marinisomatota Presence in Bacteroidota Use as Phylogenetic Marker
RNA Polymerase Beta Subunit (rpoB) Unique conserved inserts Canonical sequence GTDB backbone tree placement
Conserved Signature Proteins (CSPs) 21 unique CSPs identified 45 unique CSPs identified Phylum-specific molecular synapomorphies
HiPIP (High-potential iron-sulfur) Widespread, conserved Rare, not conserved Functional marker for electron transport
Porfirinogen deaminase (HemC) Specific variant (MVG) Specific variant (LAG) Amino acid motif diagnostic

Protocols

Protocol 1: In Silico Phylogenomic Delineation Using GTDB-Tk

Objective: To reconstruct the phylogenetic position of Marinisomatota genomes relative to Bacteroidota and other adjacent phyla.

Materials:

  • High-quality metagenome-assembled genomes (MAGs).
  • Computational cluster with ≥ 32 GB RAM.
  • GTDB-Tk v2.3.0 software (https://github.com/ecogenomics/gtdbtk).
  • Reference data pack (release 220).

Procedure:

  • Genome Preparation: Place bacterial genome files (.faa for proteins, .fna for nucleotides) in a single directory. Ensure MAG quality (completion > 50%, contamination < 10%).
  • Run GTDB-Tk Classify:

  • Tree Inference: Use the infer workflow on the classified markers to generate a rooted tree:

  • Analysis: Visualize the tree (e.g., in iTOL). Note the monophyletic clustering of Marinisomatota separate from the Bacteroidota clade, supported by bootstrap values.

Protocol 2: Metabolic Pathway Discrepancy Analysis via KEGG Decoder

Objective: To compare and visualize the completeness of core metabolic pathways between the phyla.

Materials:

  • Annotated genomes (e.g., using PROKKA or DRAM).
  • KEGG Decoder script (https://github.com/bjtully/BioData/tree/master/KEGGDecoder).
  • Python3 with matplotlib and seaborn.

Procedure:

  • Annotation: Annotate all genomes uniformly. Generate KEGG Orthology (KO) assignments using kofamscan.
  • Generate Input: Create a binary matrix of KOs per genome using custom scripts.
  • Run KEGG Decoder:

  • Visualize: The script generates heatmaps. Key divergent pathways to highlight: Oxidative Phosphorylation (presence of petABC for bc1 complex in Marinisomatota), Glycan Metabolism (depleted in Marinisomatota), and Carbon Fixation (presence of RLP genes).

Diagrams

G cluster_genomics Phylogenomic Analysis cluster_metab Pathway Comparison node_Marine Marinisomatota Genomes node_Genomics Comparative Genomics Workflow node_Marine->node_Genomics node_Metabolism Metabolic Profiling Workflow node_Marine->node_Metabolism node_Bacteroidota Bacteroidota Genomes node_Bacteroidota->node_Genomics node_Bacteroidota->node_Metabolism G1 Marker Gene Alignment (120) node_Genomics->G1 M1 KO Assignment (kofamscan) node_Metabolism->M1 node_Result Output: Diagnostic Divergence Features Start Input: MAG Collections (GTDB classified) Start->node_Marine Start->node_Bacteroidota G2 Maximum-Likelihood Tree Inference G1->G2 G3 Branch Support (Bootstraps) G2->G3 G3->node_Result M2 Pathway Completeness (KEGG Decoder) M1->M2 M3 Divergence Heatmap M2->M3 M3->node_Result

Title: Phylogenomic & Metabolic Analysis Workflow

G node_Env Deep Ocean Niche (Low Nutrient, Dark) node_Main Marinisomatota Core Divergence Traits node_Env->node_Main node_HiPIP Unique Electron Transport: HiPIP → bc1 Complex node_Main->node_HiPIP node_RLP CO2 Metabolism Potential: RuBisCO-like Protein (RLP) node_Main->node_RLP node_Glycan Limited Glycan Breakdown: Low GH Count & Diversity node_Main->node_Glycan Result Divergent Evolutionary Path vs. Bacteroidota node_HiPIP->Result node_RLP->Result node_Glycan->Result

Title: Key Divergence Traits of Marinisomatota

The Scientist's Toolkit

Table 3: Essential Research Reagents & Solutions

Item Function in Comparative Genomics Example Product/Reference
GTDB-Tk Reference Data Provides standardized bacterial/archaeal marker set & taxonomy for consistent phylogenomic placement. GTDB Release 220 (R220)
KEGG KofamScan Database Profile HMM database for accurate KEGG Orthology (KO) assignment from protein sequences. KEGG Release (e.g., 2024-01)
CheckM2 / BUSCO Assess genome completeness & contamination of MAGs prior to comparative analysis. CheckM2 (v1.0.2)
FastTree / IQ-TREE2 Software for rapid & accurate maximum-likelihood phylogenetic inference on marker alignments. IQ-TREE2 (v2.2.6)
DRAM (Distilled & Refined Annotations of Metabolism) Tool to annotate MAGs & distill metabolic profiles, highlighting pathways like vitamin synthesis & carbon utilization. DRAM (v1.5)
Anti-HiPIP Antibodies For experimental validation of the predicted unique electron transport chain component via western blot. Custom polyclonal (e.g., GenScript)
Defined Oligotrophic Media For cultivation attempts, mimicking deep-sea conditions (low organic C, high pressure, NO3- as N source). AMS1 media recipe modifications

Introduction & Thesis Context Within the broader thesis research on the phylum Marinisomatota (GTDB nomenclature; synonymous with Bacteroidota in some NCBI taxonomies), a critical challenge is translating standardized genomic taxonomy into ecological understanding. The Genome Taxonomy Database (GTDB) provides a phylogenetically consistent framework, but ecological inferences drawn from its classifications require validation through independent metagenomic surveys. This protocol outlines a method to cross-reference GTDB-derived lineages against environmental metagenomic datasets to confirm habitat associations, co-occurrence patterns, and putative metabolic roles, thereby grounding taxonomic revisions in ecological reality.

Application Notes & Protocols

Protocol 1: Creating a Curated GTDB Reference Package for Marinisomatota

  • Data Retrieval: Access the latest GTDB release (e.g., R220) via the gtdb-tk software package or the GTDB website. Extract all genomes classified within the phylum Marinisomatota.
  • Quality Filtering: Filter genomes based on GTDB quality criteria (CheckM completeness >50%, contamination <10%). Retain only representative or "dereplicated" genomes as defined by GTDB to reduce redundancy.
  • Metadata Compilation: For each retained genome, compile associated metadata: GTDB taxonomy (e.g., pMarinisomatota, cMarinisomatia, o__UBA10353), NCBI biome and feature annotations, and genomic characteristics (genome size, GC content).
  • Format for Downstream Analysis: Create a BLAST or Bowtie2 database of the filtered genome sequences. Structure the associated metadata into a tab-delimited table.

Table 1: Example Curated GTDB Marinisomatota Reference Set (Hypothetical Data)

GTDB Genome ID GTDB Taxonomy (Phylum to Genus) CheckM Completeness (%) CheckM Contamination (%) NCBI Isolation Source Genome Size (Mbp)
GBGCA123456 pMarinisomatota; cMarinisomatia; oUBA10353; fUBA10353; g__UBA10353 92.5 1.2 Marine sediment 4.8
GBGCA789012 pMarinisomatota; cP2B42; oUBA10234; fUBA10234; g_JAAOCX01 78.9 5.5 Activated sludge 6.1
RSGCF345678 pMarinisomatota; cP2B42; oP2B42; fP2B42; gP2B42 86.7 2.8 Human gut 3.9

Protocol 2: Metagenomic Read Recruitment & Taxonomic Binning

  • Metagenome Selection: Select public or in-house metagenomic studies from target environments (e.g., marine, freshwater, human gut, bioreactors) from repositories like the SRA or MG-RAST.
  • Read Mapping: Use bowtie2 or BWA to map quality-filtered metagenomic reads against the curated Marinisomatota reference database. Use sensitive parameters (--very-sensitive for bowtie2).

  • Abundance Estimation: Use samtools and custom scripts to calculate depth of coverage and breadth of coverage for each reference genome. Normalize by genome length and total metagenome reads to estimate relative abundance (RPKM or TPM).
  • Taxonomic Profiling: Perform independent taxonomic profiling of the same metagenomes using a GTDB-based tool like MetaPhlAn 3 or mOTUs to obtain a community profile. Compare the presence/absence of Marinisomatota clades with the recruitment results.

Table 2: Cross-Referencing Results from a Hypothetical Marine Metagenome

Detected Taxon (GTDB) Read Recruitment Abundance (RPKM) MetaPhlAn3 Relative Abundance (%) Concordance (Y/N/Partial) Inferred Primary Habitat from Cross-Reference
g__UBA10353 (Marinisomatia) 45.2 0.05 Y Marine sediment
g_JAAOCX01 (P2B42) 0.8 <0.001 Partial (Low detection) Possibly transient / not native
g_P2B42 (P2B_42) 0.1 0.0 N Non-marine; likely contamination

Protocol 3: Phylogenomic Validation of Ecological Clustering

  • Phylogenetic Tree Construction: Build a reference tree from the curated Marinisomatota genomes using a set of >100 conserved single-copy marker genes (via GTDB-Tk de_novo_wf).
  • Environmental Metadata Mapping: Map the habitat source (from metagenomic recruitment) onto the tree leaves as a discrete trait.
  • Analysis: Visually and statistically (e.g., using CAST or consenTRAIT) assess if specific phylogenetic clades are significantly associated with specific environments (e.g., marine vs. terrestrial).

The Scientist's Toolkit: Key Research Reagent Solutions

Table 3: Essential Materials and Tools for Validation Workflow

Item Function & Explanation
GTDB-Tk (v2.3.0+) Software toolkit to classify genomes into the GTDB taxonomy and generate phylogenomic trees. Essential for standardizing input genomes.
CheckM2 Assesses genome quality (completeness, contamination) for filtering the reference database. More accurate than CheckM1 for diverse bacteria.
Bowtie2 / BWA Read mapping tools for recruiting metagenomic reads to the reference genome database. Critical for quantifying environmental presence.
MetaPhlAn 3 Profiler for metagenomic taxonomic composition using GTDB-derived marker genes. Provides independent community profile for cross-validation.
Non-Redundant GTDB Reference Database (RS & RG) Provides the standardized, de-replicated genome set. The foundation for creating a phylum-specific reference package.
SRA Toolkit Downloads raw metagenomic sequencing data from the NCBI Sequence Read Archive for analysis.
ITOL / GGTREE Interactive Tree of Life or R package for visualizing phylogenetic trees with annotated metadata (e.g., habitat).

Diagrams

G node_start node_start node_process node_process node_data node_data node_decision node_decision node_end node_end Start Start: GTDB Release & Metagenome SRA Studies P1 1. Build Curated Marinisomatota Reference DB Start->P1 P2 2. Read Recruitment (Mapping) P1->P2 P3 3. Independent Taxonomic Profiling P1->P3 Same Reference D1 Quantitative Data: Coverage & Abundance P2->D1 D2 Community Profile: Taxonomic Abundance P3->D2 Compare Cross-Reference & Concordance Analysis D1->Compare D2->Compare Phylo 4. Phylogenomic Tree with Habitat Traits Compare->Phylo Validate Validated Ecological Inference for Clades Phylo->Validate

Cross-Referencing Validation Workflow

Pathway node_env node_env node_tax node_tax node_genome node_genome node_phylo node_phylo MG Metagenomic Study (SRA) Map Read Mapping MG->Map Profile Marker Gene Profiling MG->Profile GTDB_DB GTDB Reference DB RefPkg Curated Marinisomatota Genomes GTDB_DB->RefPkg RefPkg->Map RefPkg->Profile Tree Phylogenomic Tree (Habitat Labels) RefPkg->Tree Genes Abundance Recruitment Abundance Table Map->Abundance Community Community Profile Table Profile->Community Matrix Integrated Habitat-Taxon Matrix Abundance->Matrix Community->Matrix Matrix->Tree Traits Inference Validated Habitat Inference & Ecological Role Tree->Inference

Data Integration for Ecological Inference

The Genome Taxonomy Database (GTDB) provides a standardized, genome-based taxonomy that frequently reclassifies microbial lineages, including the phylum Marinisomatota (previously known as Marinisomatota in some NCBI lineages, often synonymous with the candidate phylum SAR406 or Marinimicrobia in historical literature). This reclassification presents both a challenge and an opportunity for researchers. Legacy data, published literature, and associated metabolic models or drug target identifications become semantically disconnected from current genomic understanding. These Application Notes provide a framework for reconciling historical data with the GTDB taxonomy to ensure robust, reproducible science in marine microbiology and marine natural product discovery.

Quantitative Impact Analysis of Reclassification

Table 1: Comparative Taxonomy of Key Marinisomatota Lineages: GTDB r220 vs. NCBI/SILVA Legacy Systems

GTDB r220 Taxonomy (Phylum/Class/Order) Approximate Legacy NCBI/SILVA Equivalent Notable Phenotypic/Metabolic Traits (from Literature) Key Publications Affected (Example Count)
P_Marinisomatota (full phylum) Candidate phylum SAR406, Marinimicrobia Oligotrophic, deep-sea adapted, putative role in sulfur & carbon cycling. >500 (metagenomic surveys, oceanography)
C_Marinisomatia Marine Group A, SAR406 clade Abundant in oxygen minimum zones, genome indicates auotrophic potential. ~300 (biogeochemical studies)
C_Aureabacteria Uncultivated descendant of SAR406 Found in saline lakes, distinct genomic repertoire. ~50 (extreme environment studies)
O_UBA1416 Sub-clade within SAR406 Associated with particulate organic matter. ~75 (carbon flux research)

Table 2: Protocol for Taxonomic Reconciliation of Existing Data and Models

Step Protocol Description Tools/Resources Expected Output
1. Identifier Mapping Cross-reference legacy genome/OTU IDs (e.g., from NCBI) with GTDB using canonical correspondence files. GTDB-Tk, gtdb_to_taxdump.tsv file from GTDB, EBI Metagenomics. Table linking NCBI accession to GTDB accession & taxonomy.
2. Literature Re-annotation Systematically search and tag existing literature with updated GTDB nomenclature using text-mining. Custom Python scripts with BioPython & PubMed API, Zotero/Mendeley. Annotated reference library with dual nomenclature.
3. Metabolic Model Validation Remap reaction annotations (KEGG, MetaCyc) in legacy metabolic models to genomes in GTDB reference tree. ModelSEED, KBase, PATRIC, RAST toolkit. Updated genome-scale metabolic models (GEMs) under GTDB taxonomy.
4. Phylogenetic Contextualization Place legacy sequence data within the GTDB reference tree via phylogenetic placement. GTDB-Tk classify_wf, EPA-ng, pplacer. Newick tree with query sequences placed within GTDB framework.

Detailed Experimental Protocols

Protocol 1: Reclassifying Amplicon Sequence Variant (ASV) Data Using GTDB Objective: To re-annotate existing 16S rRNA gene amplicon datasets (often classified against SILVA) with GTDB taxonomy. Materials: ASV table (BIOM or CSV format), representative ASV sequences (FASTA), QIIME2 (2024.2+), GTDB reference package (r220). Procedure:

  • Download Reference Data: Obtain the GTDB bacterial reference sequences and taxonomy file for release r220.
  • Train Classifier: Use q2-feature-classifier to fit a naive Bayes classifier on the GTDB reference sequences. Command: qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads gtdb_seqs.qza --i-reference-taxonomy gtdb_tax.qza --o-classifier gtdb_classifier.qza.
  • Classify ASVs: Run taxonomy classification on your ASV sequences. Command: qiime feature-classifier classify-sklearn --i-classifier gtdb_classifier.qza --i-reads rep_seqs.qza --o-classification taxonomy.qza.
  • Collate Data: Merge new taxonomy with the ASV table and analyze.

Protocol 2: Validating a Putative Drug Target Gene in Reclassified Genomes Objective: To assess the conservation and phylogenetic distribution of a previously identified essential gene (e.g., dnaN) across reclassified Marinisomatota genomes. Materials: List of GTDB genome access IDs for Marinisomatota, target gene protein sequence, Anvio (v7.1), HMMER suite. Procedure:

  • Genome Retrieval: Use gtdb-tk to generate a genome set or download from GTDB ftp.
  • Gene Calling & Functional Annotation: Process all genomes through a consistent pipeline (e.g., Prokka, Bakta) to generate standardized GFF3 and amino acid FASTA files.
  • Build Target HMM: Create a Hidden Markov Model (HMM) profile for the target gene using reference sequences from trusted databases. Command: hmmbuild target_gene.hmm alignment.fasta.
  • Search & Extract: Use hmmsearch against the concatenated protein database of all Marinisomatota genomes. Parse results to extract hits above a strict e-value threshold (e.g., 1e-30).
  • Phylogenetic Profiling: Map the presence/absence and sequence variants of the target gene onto the GTDB phylogeny using Anvio's pangenomics workflow to visualize conservation.

Mandatory Visualizations

workflow LegacyData Legacy Data & Literature (NCBISILVA 'SAR406') Step1 Step 1: Identifier Mapping (Cross-Reference Accessions) LegacyData->Step1 GTDBTax GTDB Toolkit & Reference Taxonomy (r220+) GTDBTax->Step1 Step2 Step 2: Phylogenetic Placement (GTDB-Tk classify_wf) Step1->Step2 Step3 Step 3: Functional Re-annotation (Consistent HMM/PFAM) Step2->Step3 Reconciled Reconciled Dataset (GTDB Taxonomy Applied) Step3->Reconciled

Title: Data Reconciliation Workflow for GTDB Reclassification

impact Reclass GTDB Reclassification Lit Literature Semantic Gap Reclass->Lit Data Database Identifier Misalignment Reclass->Data Models Metabolic Model Inconsistencies Reclass->Models Action1 Action: Systematic Re-annotation Lit->Action1 Data->Action1 Action2 Action: Phylogenetic Re-contextualization Data->Action2 Models->Action2 Goal Outcome: Unified Research Framework Action1->Goal Action2->Goal

Title: Reclassification Impacts and Required Actions

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Resources for Marinisomatota Research Post-GTDB Reclassification

Item Name Supplier/Resource Function & Application Notes
GTDB-Tk v2.3.0+ (https://github.com/ecogenomics/gtdbtk) Core software toolkit for assigning GTDB taxonomy to genome bins and placing them in the reference tree. Critical for all reclassification work.
GTDB r220 Reference Data GTDB FTP Site Genome sequence and taxonomy files. Required for any classification or phylogenetic analysis aligned with GTDB.
CheckM2 (https://github.com/chklovski/CheckM2) Rapid, accurate assessment of genome completeness and contamination. Essential for quality control before taxonomic classification.
anvi'o v7.1+ (http://anvio.org) Integrated platform for pangenomics, phylogenomics, and metabolic modeling. Ideal for comparing reclassified genomes.
KBase (Microbiome Modeling) (https://www.kbase.us) Cloud platform for constructing and analyzing metabolic models from genomes, facilitating functional re-annotation post-reclassification.
MEMOTE Suite (https://memote.io) For testing and reporting standard compliance of genome-scale metabolic models, ensuring updated models are robust.
Custom HMM Profiles (e.g., TIGRFAM, PFAM) Curated protein family profiles for targeting specific metabolic pathways (e.g., sulfur oxidation) in functional screens of reclassified genomes.

Application Notes and Protocols

Thesis Context: Within a broader thesis investigating the phylogenetic novelty and metabolic potential of candidate phyla like Marinisomatota (formerly candidate phylum Marinisomatota) for the GTDB (Genome Taxonomy Database) classification framework, accurate phylogenetic placement is paramount. Inferring the evolutionary relationships of these often-fragmentary, metagenome-assembled genomes (MAGs) requires robust assessment of genome quality to prevent erroneous taxonomic assignment.

1. Core Quality Metrics for Phylogenetic Trustworthiness

The integrity of a phylogenetic inference is directly contingent on the quality of the input genomes. The following metrics, popularized by tools like CheckM and CheckM2, are non-negotiable for pre-placement screening.

Table 1: Core Metrics for Genome Quality Assessment

Metric Definition Ideal Range for Trustworthy Placement Interpretation in Marinisomatota Context
Completeness Percentage of conserved, single-copy marker genes (SCMGs) found in the genome. >90% (High Quality) >50% (Draft) High completeness ensures adequate phylogenetic signal. Low completeness in Marinisomatota MAGs may indicate novel lineages with divergent markers.
Contamination Estimated percentage of SCMGs present in multiple copies, suggesting multiple strains/species. <5% (High Quality) <10% (Acceptable) High contamination leads to chimeric phylogenetic signals, misplacing the genome. Critical for novel phylum assignment in GTDB.
Strain Heterogeneity Evidence of multiple sequence variants among SCMGs, indicating unresolved strains. Low (Close to 0%) High heterogeneity complicates assembly and placement, may require bin refinement or indicate a population.
Genome Size & N50 Total assembly length and contig length at which 50% of the genome is assembled. Context-dependent Significantly deviant sizes may flag contamination or incompleteness. Useful for comparing against known relatives.

Protocol 1.1: Standardized Quality Assessment with CheckM2 Objective: To calculate completeness, contamination, and strain heterogeneity for a set of Marinisomatota MAGs prior to phylogenetic analysis.

  • Input Preparation: Collect all MAGs in FASTA format (e.g., *.fna files) in a single directory.
  • Database Setup: Ensure CheckM2 is installed via pip install checkm2. The program uses a pre-trained model and does not require a manual database download.
  • Run Quality Assessment: Execute the command:

  • Output Interpretation: The primary results are in quality_report.tsv. Filter MAGs based on Table 1 thresholds (e.g., Completeness >70%, Contamination <5%) for downstream phylogenetic placement.

2. Phylogenetic Placement-Specific Assessments

Beyond general metrics, specific checks are needed to ensure the phylogenetic signal is reliable.

Table 2: Placement-Specific Diagnostic Metrics

Metric Protocol/Method Purpose & Relevance
Marker Gene Concordance Phylogeny of individual SCMGs (e.g., via PhyloPhlAn) vs. concatenated tree. Detects hidden contamination or horizontal gene transfer that concatenated trees may obscure. Incongruent gene trees can invalidate placement.
Coverage Uniformity Analysis of read mapping depth across contigs (e.g., using bowtie2 and samtools). Large coverage drops may indicate mis-binned contigs (contamination). Uniform coverage supports a coherent genome.
Taxonomic Consistency Compare taxonomic assignments of all predicted genes (e.g., via CAT or GTDB-Tk classify). A high percentage of genes agreeing with the dominant lineage boosts confidence. Many genes from divergent phyla signal contamination.
Reference Tree Robustness Placement on a stable, well-curated reference tree (e.g., GTDB backbone tree). Ensures placement is not an artifact of a poor or biased reference dataset.

Protocol 2.1: Assessing Taxonomic Consistency with CAT/BAT Objective: To evaluate gene-level taxonomic agreement within a Marinisomatota MAG.

  • Gene Prediction: Predict protein sequences from the MAG using prodigal:

  • Run CAT/BAT: Perform taxonomic classification of the proteins:

  • Analyze Output: Examine the mag.lineage file. A trustworthy MAG for placement will show a high proportion of proteins classified to a coherent lineage (e.g., candidate phylum Marinisomatota), with limited classification to unrelated phyla.

3. Visualization of the Assessment Workflow

A standardized workflow integrates these metrics to gatekeep genomes for trustworthy phylogenetic placement.

G Start Input: Metagenome- Assembled Genomes (MAGs) QC Core Quality Check (CheckM2) Start->QC Filter1 Filter by Thresholds Completeness > X% Contamination < Y% QC->Filter1 Diag Placement Diagnostics (Coverage, Gene Concordance, Taxonomic Consistency) Filter1->Diag Pass BinRefine Bin Refinement or Re-assembly Filter1->BinRefine Fail Filter2 Passes All Diagnostics? Diag->Filter2 Place Phylogenetic Placement (GTDB-Tk, PhyloPhlAn) Filter2->Place Yes Filter2->BinRefine No Trust Trustworthy Placement for GTDB Place->Trust BinRefine->Start Re-evaluate Exclude Exclude from Placement

Title: Workflow for Trustworthy Phylogenetic Placement

4. The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Tools and Databases for Quality Assessment

Item Function & Relevance Typical Source/Implementation
CheckM2 Rapid, tool for estimating completeness and contamination using machine learning. Essential first-pass filter. https://github.com/chklovski/CheckM2
GTDB-Tk Toolkit for assigning GTDB taxonomy, includes classify_wf which performs internal quality checks and reference tree placement. https://github.com/ecogenomics/gtdbtk
PhyloPhlAn For constructing highly accurate phylogenies with SCMGs and assessing marker gene concordance. https://github.com/biobakery/phylophlan
BUSCO Alternative to CheckM using universal orthologous benchmarks. Useful for eukaryotes and specific lineages. https://busco.ezlab.org/
CAT/BAT Protein-based taxonomic classifier. Critical for evaluating gene-level consistency within a MAG. https://github.com/dutilh/CAT
Bowtie2 & SAMtools For mapping reads back to assemblies to compute coverage uniformity and validate binning. http://bowtie-bio.sourceforge.net/bowtie2, http://www.htslib.org/
GTDB Reference Data (r214+) Curated genome database and trees. The gold-standard reference for bacterial and archaeal phylogenetic placement. https://data.gtdb.ecogenomic.org/
CIAlign Tool to clean and interpret multiple sequence alignments, removing noisy regions that can distort phylogeny. https://github.com/KatyBrown/CIAlign/

Conclusion

The GTDB framework provides a robust, genome-based taxonomy that has significantly refined our understanding of the Marinisomatota phylum, clarifying its evolutionary boundaries and internal diversity. Mastery of the associated tools and an awareness of classification nuances are essential for accurately placing new genomes and interpreting their biological significance. The validated genomic distinctiveness of Marinisomatota, particularly its prevalence in marine systems, underscores its potential as a reservoir for novel natural products and enzymes. Future directions should focus on isolating representative strains, functionally characterizing predicted biosynthetic pathways, and exploring the phylum's role in marine biogeochemical cycles and host-microbe interactions. For biomedical research, integrating GTDB classification with metabolomic and phenotypic data will be crucial for translating genomic novelty into therapeutic leads.