This article provides a comprehensive resource on the Marinisomatota phylum within the Genome Taxonomy Database (GTDB) framework.
This article provides a comprehensive resource on the Marinisomatota phylum within the Genome Taxonomy Database (GTDB) framework. It establishes the foundational genomic and ecological characteristics of this marine bacterial group, details methodologies for accurate classification and analysis, addresses common computational challenges, and validates GTDB's taxonomy against traditional systems like SILVA and NCBI. Targeted at researchers and drug development professionals, it synthesizes current knowledge to guide discovery of novel biosynthetic gene clusters and other biotechnologically relevant traits.
The phylum Marinisomatota represents a significant expansion of our understanding of bacterial diversity, originating from uncultured environmental sequences and achieving formal recognition through the Genome Taxonomy Database (GTDB) framework. This phylum encapsulates organisms primarily retrieved from marine and subsurface environments, characterized by genomic signatures of anaerobic metabolism and symbiotic or parasitic lifestyles.
Table 1: Chronological Development of Marinisomatota Taxonomy
| Year | Key Event/Tool | Description | Outcome/Reference |
|---|---|---|---|
| Pre-2015 | 16S rRNA Gene Surveys | Detection in marine sediments & hydrothermal vents | Identified as "Candidate phylum Zixibacteria" or similar candidate divisions. |
| 2016 | GTDB r89/r95 | Initial placement in GTDB taxonomy using concatenated protein phylogenies | Grouped within the broader FCB superphylum. |
| 2020 | GTDB r07-RS202 | Refinement via pangenome analysis & average amino acid identity (AAI) | Proposed as a distinct phylum-level lineage. |
| 2022-Present | GTDB r214/r220 | Validation with expanded genome dataset & relative evolutionary divergence (RED) | Formalized as phylum Marinisomatota in the GTDB taxonomy. |
Table 2: Core Genomic & Ecological Characteristics of Marinisomatota
| Characteristic | Typical Range/Feature | Method of Determination |
|---|---|---|
| Genome Size | 1.8 - 3.2 Mbp | Genome assembly from metagenomes (MAGs) |
| GC Content | 38 - 52% | In silico calculation from MAGs |
| Predicted Metabolism | Anaerobic fermenter, possible syntrophy | Gene neighborhood & metabolic pathway inference |
| Habitat | Marine sediment, groundwater, anaerobic digesters | Sample metadata from NCBI SRA |
| RED Value vs. Adjacent Phyla | >0.15 | GTDB Toolkit (GTDB-Tk) analysis |
Objective: Recover high-quality draft genomes of Marinisomatota from complex environmental samples.
Materials:
Procedure:
ILLUMINACLIP:2:30:10, LEADING:3, TRAILING:3, SLIDINGWINDOW:4:15, MINLEN:36).classify_wf) against the GTDB r214 database.p__Marinisomatota.Objective: Place novel Marinisomatota MAGs within the GTDB reference tree and compute RED values.
Materials:
Procedure:
gtdbtk identify to find 120 bacterial single-copy marker genes within the MAGs.gtdbtk align to create multiple sequence alignments (MSA) for each marker, followed by concatenation.gtdbtk infer to generate a maximum-likelihood tree with FastTree v2.1.11 under the LG+G model.summary.tsv file for taxonomy and the RED values at relevant nodes in the tree file.Title: Workflow for Defining a New Bacterial Phylum
Table 3: Essential Reagents & Tools for Marinisomatota Research
| Item/Category | Specific Product/Software Example | Function in Research |
|---|---|---|
| DNA Extraction Kit | DNeasy PowerSoil Pro Kit (QIAGEN) | High-yield, inhibitor-free DNA extraction from complex sediments. |
| Metagenomic Library Prep | Illumina DNA Prep Kit | Preparation of sequencing-ready libraries from environmental DNA. |
| Sequencing Platform | Illumina NovaSeq 6000; PacBio HiFi | Generates short-read or long-read data for assembly and binning. |
| Assembly Software | metaSPAdes, Flye | Assembles sequencing reads into contigs/scaffolds. |
| Binning Software Suite | MetaBAT2, MaxBin2, CONCOCT | Groups contigs into putative genomes (MAGs) based on sequence composition and abundance. |
| Taxonomic Classifier | GTDB-Tk with r214 database | Provides standardized, phylogeny-based taxonomy and RED metrics. |
| Genome Quality Tool | CheckM2/CheckM | Estimates genome completeness and contamination using marker genes. |
| Metabolic Inference | METABOLIC v4.0 | Profiles metabolic pathways from MAGs to infer ecological role. |
| Phylogenetic Analysis | IQ-TREE 2, FastTree 2 | Constructs robust phylogenetic trees for phylogenomic validation. |
| Culture Media (Experimental) | Anaerobic marine broth (modified) | Attempts to cultivate elusive members using simulated environmental conditions. |
Within the GTDB (Genome Taxonomy Database) framework, Marinisomatota (formerly candidate phylum SAR406) is classified as a phylum within the FCB group superphylum. Its genomic hallmarks are critical for accurate taxonomic placement and understanding its ecological role in marine systems.
Key Genomic Hallmarks:
Quantitative Data Summary:
Table 1: Core Genomic Statistics for Representative *Marinisomatota MAGs (Metagenome-Assembled Genomes) from GTDB r214.*
| GTDB Species Representative | Genome Size (Mbp) | GC Content (%) | CheckM Completeness (%) | CheckM Contamination (%) | Number of bac120 Markers Identified |
|---|---|---|---|---|---|
| UBA1166 sp002160825 | 1.98 | 37.2 | 98.6 | 0.9 | 119 |
| UBA9951 sp014337395 | 2.15 | 36.8 | 99.2 | 1.2 | 120 |
| UBA1773 sp004294285 | 2.32 | 38.5 | 97.8 | 0.5 | 118 |
Table 2: Diagnostic Metabolic Pathway Presence/Absence in *Marinisomatota vs. Related Phyla.*
| Metabolic Pathway (KEGG Module) | Marinisomatota (n=50 MAGs) | Bacteroidota (n=50) | Chlorobiota (n=50) |
|---|---|---|---|
| Proteorhodopsin (M00597) | 100% | 12% | 0% |
| Dissimilatory sulfite reductase (DsrAB, M00596) | 88% | 24% | 100% |
| Complete TCA cycle (M00009) | 10% | 96% | 100% |
| Anoxygenic photosynthesis (M00116) | 0% | 0% | 100% |
Purpose: To classify a novel bacterial genome or MAG within the GTDB taxonomy, with specific focus on placement relative to Marinisomatota. Materials: High-quality bacterial genome assembly, computing cluster or server with GTDB-Tk (v2.1.1+) installed. Procedure:
checkm to assess basic quality (completeness >50%, contamination <5% recommended).classify_wf workflow:
gtdbtk.bac120.summary.tsv: Taxonomic classification. Examine classification column for placement (e.g., d__Bacteria;p__Marinisomatota;...).gtdbtk.bac120.markers_summary.tsv: Count of identified marker genes.gtdbtk.bac120.user_msa.fasta: Concatenated marker gene alignment for your genome.infer workflow with the user MSA and the GTDB reference package.Purpose: To profile the metabolic potential of a Marinisomatota genome, focusing on hallmark pathways. Materials: Annotated genome (protein sequences in FASTA format), KofamScan software, KEGG databases. Procedure:
*.faa).exec_annotation script with the --cpu and -o options. Use profile HMMs to map KOs.
Diagram Title: Workflow for Phylogenomic Placement with GTDB
Diagram Title: Core Energy Pathways in Marinisomatota
Table 3: Essential Research Reagents and Tools for Marinisomatota Genomics
| Item | Function & Relevance |
|---|---|
| GTDB-Tk (v2.1.1+) | Standardized toolkit for assigning genomes to the GTDB taxonomy using a set of 120 bacterial marker genes; essential for consistent phylogenetic placement. |
| CheckM2 / CheckM | Assesses genome quality (completeness, contamination) of MAGs prior to phylogenetic analysis; critical for data reliability. |
| KofamScan / eggNOG-mapper | Functional annotation tools to map protein sequences to KEGG Orthologs (KOs) and reconstruct metabolic pathways like proteorhodopsin. |
| FastTree / RAxML | Software for inferring phylogenetic trees from concatenated marker gene alignments generated by GTDB-Tk. |
| MetaBAT 2 / MaxBin 2 | Binning algorithms for reconstructing MAGs from marine metagenomic data, the primary source of Marinisomatota genomes. |
| DRAM (Distilled and Refined Annotation of Metabolism) | Specialized tool for annotating metabolic pathways and auxiliary functions in microbial genomes; useful for detailed pathway analysis. |
| Pfam & TIGRFAM HMMs | Curated protein family databases used to identify specific marker genes (e.g., proteorhodopsin, DsrAB) in novel genomes. |
Context within GTDB Taxonomic Classification Research: The phylum Marinisomatota (formerly candidate phylum Marinisomatota in GTDB r214) represents a lineage of Bacteria predominantly identified from metagenomic surveys. Research within a broader thesis on GTDB taxonomy aims to elucidate the ecological drivers of its distribution and its metabolic potential, particularly for biodiscovery. This phylum exemplifies the critical link between habitat, ecological niche, and genomic content.
Key Quantitative Data Summary:
Table 1: Prevalence of Marinisomatota in Public Metagenomic Databases
| Environment / Sample Type | Approximate Relative Abundance (%) | Primary Dataset/Source (Example) | Key Identifying Genomic Marker |
|---|---|---|---|
| Marine Pelagic (Oceanic) | 0.01 - 0.5 | TARA Oceans, Malaspina Expedition | 16S rRNA gene, RpoB |
| Marine Sediments | 0.1 - 2.0 | Ocean Drilling Program, IODP | 16S rRNA gene, RpoB |
| Marine Sponge Microbiome | Up to 15.0 | Sponge Microbiome Project, local surveys | 16S rRNA gene, Metagenome-assembled genomes (MAGs) |
| Coral Microbiome (Healthy) | 0.5 - 3.0 | Various reef studies | 16S rRNA gene, MAGs |
| Human & Animal Gut | < 0.01 | Human Microbiome Project, MGnify | Extremely rare, sporadic MAGs |
Table 2: Genomic Features Correlated with Habitat in Marinisomatota MAGs
| Genomic Feature / Pathway | Prevalence in Marine Pelagic MAGs (%) | Prevalence in Host-Associated (Sponge) MAGs (%) | Putative Functional Role & Niche Adaptation |
|---|---|---|---|
| Proteorhodopsin & Light-Sensing | 85-95 | 10-20 | Phototrophy, energy generation in oligotrophic water |
| CRISPR-Cas Systems | 30-40 | 60-80 | Defense against mobile genetic elements/viruses |
| Biosynthetic Gene Clusters (BGCs) | 2-4 per MAG | 5-8 per MAG | Secondary metabolite production (e.g., NRPS, PKS) |
| Adhesion Proteins (e.g., MSCRAMM-like) | Low | High | Host tissue attachment and colonization |
| C1 Metabolism (e.g., folD, fhs) | High | Variable | Adaptation to C1 compounds in marine environment |
| TonB-Dependent Transporters | Very High | High | Nutrient scavenging (e.g., siderophores, sugars) |
Interpretation: The data indicate a primary marine origin for Marinisomatota, with a significant shift in abundance and genomic capacity upon association with marine invertebrate hosts, particularly sponges. The increased prevalence of defense mechanisms and biosynthetic potential in host-associated lineages suggests adaptation to a competitive, resource-rich, and defended microenvironment, highlighting their potential for novel natural product discovery.
Objective: To assess the relative abundance and diversity of Marinisomatota in marine and host-associated metagenomic samples.
Materials: Metagenomic DNA, PCR reagents, GTDB-tk database (v2.3.0), QIIME2 (2024.5), specific primers (see Toolkit).
Workflow:
qiime feature-classifier fit-classifier-naive-bayes.classify_wf command against the GTDB r214 database.Objective: To experimentally test the biosynthetic potential predicted in host-associated Marinisomatota MAGs.
Materials: Sponge tissue sample, Marine Broth 2216, selective antibiotics, isolation plates, HPLC-MS.
Workflow:
Title: Metagenomic Analysis Workflow for Marinisomatota
Title: Host Niche Drivers of Marinisomatota Genomics
Table 3: Essential Materials for Marinisomatota Research
| Item / Reagent | Function / Application in Protocol | Example Product / Specification |
|---|---|---|
| Marine Agar/Broth 2216 | Standardized medium for cultivation and isolation of marine heterotrophs. | Difco Marine Broth 2216, BD. |
| GTDB Reference Database (r214+) | Essential for accurate taxonomic classification of MAGs and sequences from understudied phyla. | Genome Taxonomy Database Toolkit (GTDB-Tk) v2.3.0+. |
| Anti-Fungal/Antibiotic Supplement Mix | Selective isolation of slow-growing bacteria by inhibiting fungi and fast-growing competitors. | Cycloheximide (100 µg/mL), Nalidixic Acid (10 µg/mL), Vancomycin (5 µg/mL). |
| Polymerase for GC-Rich Templates | High-fidelity PCR amplification of bacterial DNA, often with high GC content common in Marinisomatota. | KAPA HiFi HotStart ReadyMix (Roche) or Q5 High-Fidelity DNA Polymerase (NEB). |
| Metagenomic DNA Extraction Kit (Soil/Microbiome) | Efficient lysis of diverse, tough-to-lyse bacterial cells from complex environmental samples. | DNeasy PowerSoil Pro Kit (Qiagen) or MagAttract PowerSoil DNA KF Kit (Qiagen). |
| antiSMASH Software Suite | Prediction, annotation, and analysis of Biosynthetic Gene Clusters (BGCs) in bacterial genomes. | antiSMASH 7.0+ web server or standalone version. |
| HPLC-MS Grade Solvents | High-purity solvents for metabolite extraction and analytical chemistry to avoid background interference. | Ethyl Acetate, Methanol (LC-MS Grade, e.g., Fisher Chemical). |
Within the Genome Taxonomy Database (GTDB) framework, the phylum Marinisomatota (formerly a candidate phylum) represents a significant, yet understudied, lineage of primarily marine bacteria. This taxonomic group is of considerable interest for its phylogenetic diversity, its ecological roles in marine biogeochemical cycles, and its potential as a source of novel bioactive compounds. This document, framed within a broader thesis on GTDB-based microbial systematics, provides detailed application notes and protocols for the cultivation, genomic analysis, and functional characterization of key genera and species within the Marinisomatota. The content is designed to support research aimed at validating and expanding the GTDB taxonomy while exploring biotechnological applications.
Based on the latest GTDB release (R214), the phylum Marinisomatota is delineated into several classes and orders. The following table summarizes the quantitatively dominant and phylogenetically distinct genera according to genome availability and 16S rRNA gene surveys.
Table 1: Key Genera and Species within GTDB Marinisomatota (as of GTDB R214)
| GTDB Class | GTDB Order | Key Genus (GTDB Label) | Approx. # of MAGs/Genomes | Relative Abundance in Marine Surveys* | Notable Species/Clade |
|---|---|---|---|---|---|
| Marinisomatia | Marinisomatales | UBA10353 (Marinisomatales) | ~45 | High | Representative species: Ga0074134 |
| Marinisomatia | UBA9962 | UBA9962 | ~22 | Moderate | Often found in coastal sediments |
| Bathybacteria | BMS94Abin14 | Bin-S124 | ~15 | Low (Deep-sea) | Associated with hydrothermal vents |
| Marinisomatia | UBA1773 | UBA1773 | ~12 | Moderate | Pelagic ocean clade |
| Marinisomatia | UBA10354 | UBA10354 | ~8 | Low | - |
*Abundance based on aggregated data from the TARA Oceans and BioGEOTRACES metagenomic surveys.
Objective: To selectively enrich for Marinisomatota cells from seawater or sediment samples. Background: Most Marinisomatota remain uncultured; however, specific enrichment strategies based on predicted metabolism (from MAGs) can be employed.
Materials & Reagents:
Procedure:
Objective: To reconstruct and taxonomically classify Marinisomatota MAGs from metagenomic data.
Workflow Diagram Title: MAG Binning & GTDB Classification Workflow
Detailed Methodology:
metaSPAdes (v3.15.0) with -k 21,33,55,77.Bowtie2. Calculate coverage profiles and generate initial bins with MetaBAT2 (v2.15) and MaxBin2 (v2.2.7).MetaWRAP (v1.3.2) bin_refinement module to consolidate bins from multiple tools, retaining only bins with >50% completeness and <10% contamination (CheckM2 criteria).gtdbtk (v2.3.0) with the classify_wf command on refined MAGs using the R214 database. The output (gtdbtk.bac120.summary.tsv) will assign taxonomy, including potential Marinisomatota placement.gtdbtk infer to generate a multiple sequence alignment and construct a tree with FastTree for phylogenetic placement.Objective: To identify potential secondary metabolite BGCs from Marinisomatota MAGs or isolates.
Materials & Reagents:
antiSMASH (v7.0), BiG-SCAPE.Procedure:
antiSMASH on the genome using strict detection parameters: antismash --genefinding-tool prodigal --taxon bacteria --cb-general --cb-knownclusters --asf --pfam2go.BiG-SCAPE to correlate BGCs from Marinisomatota with known BGCs in the MIBiG database, generating sequence similarity networks to identify novel clusters.Table 2: Essential Research Reagents for Marinisomatota Studies
| Item/Category | Specific Product/Example | Function/Application |
|---|---|---|
| Enrichment Medium | Custom Marinisomatota Enrichment Medium (MEM) | Selective cultivation and maintenance of fastidious marine bacteria. |
| DNA Extraction Kit | DNeasy PowerSoil Pro Kit (Qiagen) | High-yield, inhibitor-free genomic DNA extraction from complex marine samples. |
| Metagenomic Library Prep | Nextera XT DNA Library Prep Kit (Illumina) | Preparation of sequencing-ready libraries from low-input environmental DNA. |
| Taxonomic Classifier | GTDB-Tk v2.3.0 Software & R214 Database | Precise genome-based taxonomic assignment according to the GTDB system. |
| BGC Analysis Software | antiSMASH v7.0 Web Server/CLI | Comprehensive prediction and annotation of biosynthetic gene clusters. |
| Phylogenetic Marker | Bacterial 16S rRNA Gene Primers (515F/806R) | Tracking Marinisomatota enrichment and community profiling via amplicon sequencing. |
| Expression Host | Pseudomonas putida KT2440 | Robust, Gram-negative host for heterologous expression of marine bacterial BGCs. |
| Flow Cytometry Stain | SYBR Green I Nucleic Acid Gel Stain | Quantification of bacterial cell abundance in enrichment cultures. |
The classification of bacterial and archaeal life has undergone a paradigm shift, moving from a single-marker (16S rRNA) system to a genome-centric taxonomy that forms the foundation of the Genome Taxonomy Database (GTDB). This evolution is critical for research into candidate phyla like Marinisomatota (also known in legacy systems as 'Marinisomatia' or within the PVC group), whose physiological and ecological roles are inferred primarily from genomic data. Accurate taxonomy is essential for drug discovery, as it clarifies evolutionary relationships and identifies novel biosynthetic gene clusters.
The following table summarizes the key differences between the two classification paradigms.
Table 1: Comparison of 16S rRNA and Genome-Centric Taxonomy
| Feature | 16S rRNA Gene-Based Taxonomy (c. 1977-2010s) | Genome-Centric Taxonomy (GTDB Era, 2018-Present) |
|---|---|---|
| Primary Data Source | Sanger sequencing of ~1,500 bp 16S rRNA gene. | Whole genome sequences (WGS) from isolates and metagenome-assembled genomes (MAGs). |
| Resolution | Species to genus level; poor for closely related species and strains. | High resolution to species and strain level; robust at all ranks. |
| Quantitative Basis | Sequence similarity thresholds (e.g., 97% for species, 95% for genus). | Average Amino Acid Identity (AAI), Average Nucleotide Identity (ANI), and relative evolutionary divergence (RED). |
| Type Material Requirement | Dependent on cultured type strains. | Employs type species genomes and designated type genomes for uncultivated taxa. |
| Handling of Uncultivated Diversity | Limited; requires PCR amplification from environment. | Integral; MAGs from metagenomics allow classification of the "microbial dark matter." |
| Impact on Marinisomatota Research | Preliminary placement based on fragmentary 16S data led to uncertain phylogeny. | Precise placement as a distinct phylum based on conserved single-copy marker genes; reveals metabolic potential for drug target discovery. |
Table 2: Key Quantitative Metrics in GTDB Genome-Centric Classification
| Metric | Calculation Method | Typical Threshold for Species Demarcation | Function in Classification |
|---|---|---|---|
| Average Nucleotide Identity (ANI) | BLAST-based or MUMmer-based alignment of shared genomic regions. | ≥ ~95% | Primary species-level standard, replacing 16S similarity. |
| Average Amino Acid Identity (AAI) | Comparison of amino acid sequences of shared protein-coding genes. | ~60% for same phylum | Useful for higher-rank (family, phylum) assignments and phylogeny. |
| Relative Evolutionary Divergence (RED) | Measure of relative branch length in a rooted phylogenetic tree of marker genes. | Normalized scale (0.0=root, 1.0=leaves) | Objective rank normalization across all lineages. |
| Percentage of Conserved Proteins (POCP) | Percentage of conserved protein sequences between two genomes. | ≥50% for same genus | Supplementary metric for genus classification. |
Objective: To classify a bacterial genome (isolate or MAG) within the GTDB taxonomy.
Materials:
Procedure:
Prepare Input Data: Place all genome assemblies (.fna files) in a single directory (genome_dir). Create a batch file listing paths if necessary.
Run Classification Workflow:
HMMER, b) aligns markers, c) creates a concatenated alignment, d) places genomes into a reference tree via pplacer, and e) classifies them based on RED-based rank thresholds.Output Interpretation:
gtdbtk_out/gtdbtk.bac120.summary.tsvObjective: To obtain a 16S rRNA sequence for preliminary phylogenetic analysis.
Materials:
Procedure:
Title: Evolution from 16S to Genome-Based Taxonomy
Title: GTDB-Tk Classification Workflow
Table 3: Essential Materials for Genome-Centric Taxonomy Research
| Item | Function & Relevance |
|---|---|
| DNeasy PowerSoil Pro Kit (QIAGEN) | Gold-standard for high-yield, inhibitor-free microbial genomic DNA extraction from complex samples for WGS and MAG generation. |
| Nextera XT DNA Library Prep Kit (Illumina) | Prepares multiplexed, adapter-ligated sequencing libraries from low-input genomic DNA for Illumina short-read sequencing. |
| GTDB-Tk Software & Reference Data | Core bioinformatics toolkit for performing genome classification against the standardized GTDB taxonomy. |
| CheckM / CheckM2 | Assesses completeness and contamination of MAGs using lineage-specific marker sets, a critical QC step before classification. |
| antiSMASH / BAGEL | Identifies biosynthetic gene clusters (BGCs) for secondary metabolites in classified genomes; crucial for drug discovery pipelines. |
| Phanta EVO HS Master Mix (Vazyme) | High-fidelity polymerase mix for accurate amplification of taxonomic marker genes or genome fragments when required. |
| ZymoBIOMICS Microbial Community Standard | Mock community with known composition for validating wet-lab and computational workflows from extraction to classification. |
1. Introduction and Thesis Context This protocol is framed within a broader thesis investigating the recalibration of bacterial taxonomy, specifically the phylum Marinisomatota (formerly known as Marine Group II within the Thermoplasmatota). The Genome Taxonomy Database Toolkit (GTDB-Tk) provides a standardized, genome-based methodology for consistent taxonomic classification, which is critical for resolving the ecological and metabolic roles of uncultured lineages like Marinisomatota. Accurate classification is foundational for downstream applications in microbial ecology and the discovery of novel bioactive compounds relevant to drug development.
2. Key Research Reagent Solutions The following table details essential materials and software for the classification workflow.
| Reagent/Solution/Software | Function/Explanation |
|---|---|
| GTDB-Tk v2.3.2+ | Core software package for inferring taxonomic classification and phylogenetic placement. |
| GTDB Reference Data (r220+) | Curated set of reference genomes and taxonomy (e.g., r220_data.tar.gz). Mandatory for classification. |
| CheckM2 or CheckM | Assesses genome completeness and contamination; critical for quality filtering prior to classification. |
| Prodigal or Pyrodigal | Gene prediction software used internally by GTDB-Tk for creating protein markers. |
| HMMER (v3.1+) | Used for aligning conserved marker genes to reference HMM profiles. |
| PPANKM or FastANI | Calculates Average Nucleotide Identity (ANI) for precise species demarcation. |
| Python 3.8+ | Required runtime environment for GTDB-Tk. |
| High-Performance Computing (HPC) Cluster | Recommended due to the computational intensity of alignment and tree placement steps. |
3. Experimental Protocol for Genome Classification Note: All commands assume a Unix-like environment and conda for package management.
Step 1: Installation and Data Preparation
Step 2: Input Genome Quality Control
Step 3: Execute GTDB-Tk Classification Workflow
Run the comprehensive classify_wf pipeline:
Step 4: Interpretation of Results Key output files:
gtdbtk.bac120.summary.tsv: Tabular summary of taxonomic classification for each genome.gtdbtk.ar53.summary.tsv: For archaea (relevant if Marinisomatota is classified as archaeal in your dataset).gtdbtk.<marker_set>.tree: Phylogenetic tree for visual placement.4. Data Presentation: Summary of Classification Metrics The following table quantifies typical outputs from a Marinisomatota classification run using GTDB-Tk, based on a hypothetical dataset of 150 marine metagenome-assembled genomes (MAGs).
Table 1: GTDB-Tk Classification Statistics for a Marine MAG Dataset
| Metric | Value | Interpretation |
|---|---|---|
| Total Input Genomes | 150 | MAGs passing QC thresholds. |
| Genomes Classified to Marinisomatota | 47 (31.3%) | Assigned to the target phylum. |
| Novel Species (ANI < 95%) | 28 (59.6% of phylum) | Potential new species within Marinisomatota. |
| Novel Genera (AF < 50%) | 11 (23.4% of phylum) | Potential new genera. |
| Average Alignment Fraction (AF) | 72.1% (std dev ± 18.5) | Measure of genomic relatedness to reference. |
| Placement in Reference Tree | 100% | All genomes placed within the GTDB reference phylogeny. |
5. Visualization of the Classification Workflow
Title: GTDB-Tk Classification Workflow for Marinisomatota Genomes
Title: Taxonomic Context of Marinisomatota in GTDB
Within the context of a broader thesis on Marinisomatota taxonomy using the Genome Taxonomy Database (GTDB), the generation and refinement of Metagenome-Assembled Genomes (MAGs) is foundational. The accuracy of downstream phylogenetic and metabolic analyses is critically dependent on parameters adjusted during assembly, binning, and refinement. This protocol details the workflow adjustments necessary for optimizing MAG quality, particularly for elusive phyla like Marinisomatota, which are frequently underrepresented in environmental samples.
Key Findings:
p_placer) and relative evolutionary divergence (RED) values from GTDB-Tk are critical for interpreting the placement of novel Marinisomatota MAGs. A threshold of p_placer ≥ 0.8 is recommended for confident placement at the genus level.Table 1: Impact of Assembly & Binning Parameters on MAG Quality Metrics for Marine Datasets
| Parameter | Standard Value | Optimized Value for Marinisomatota | Effect on Completeness | Effect on Contamination | Key Tool |
|---|---|---|---|---|---|
| Min Contig Length | 500 bp | 1500 bp | -5% to +2% | -10% to -15% | metaSPAdes, MEGAHIT |
| Metabat2 --minContig | 1500 bp | 2500 bp | -3% | -8% | MetaBAT2 |
| CheckM Lineage WF | Standard | --extension_threshold 0.2 |
More stringent lineage assignment | Better contamination estimate | CheckM2 |
| MaxBin2 Prob Threshold | 0.9 | 0.95 | -2% | -7% | MaxBin2 |
| DAS Tool Score Threshold | 0.5 | 0.6 | +1% Completeness | -5% Contamination | DAS Tool |
Table 2: GTDB-Tk Classification Output Interpretation for Novel Taxa
| Metric | Range | Interpretation for Marinisomatota MAGs |
|---|---|---|
Classify p_placer |
0.0 - 1.0 | ≥ 0.95: Strong confidence at species rank. 0.80-0.94: Confident genus-level placement. <0.80: Require manual phylogenomic review. |
| RED Value | ~0.0 - ~1.0 | Values close to 0.5 for a new MAG suggest a novel genus within a known family. Deviations >0.15 from sister taxa warrant investigation. |
| FastANI vs. Reference | 85% - 100% | <95% ANI to nearest GTDB reference suggests novel species; <~70% suggests novel genus. |
Objective: To reconstruct high-quality Marinisomatota MAGs from multi-sample marine metagenomic datasets.
Materials:
Method:
metaspades.py -o co_assembly -1 sample1_1.fq,sample2_1.fq -2 sample1_2.fq,sample2_2.fq -k 21,33,55,77 -t 32 -m 500seqtk.jgi_summarize_bam_contig_depths.metabat2 -i filtered_contigs.fa -a depth_table.txt -o bin -m 2500DAS_Tool -i metabat2_bins.txt,maxbin2_bins.txt -l metabat,maxbin -c filtered_contigs.fa --score_threshold 0.6 -o final_binsObjective: To assess MAG quality, refine bins, and achieve accurate GTDB taxonomy.
Materials:
final_bins_DASTool_scaffolds2bin.txt).Method:
checkm2 predict --input final_bins/ --output checkm2_results -t 16 --lowmemanvi-refine or manual curation.derep -i *.fa -o mags_derep99 -ani 0.99 -nc 0.30gtdbtk classify_wf --genome_dir mags_derep99/ --out_dir gtdbtk_out --cpus 32 --extension_threshold 0.2gtdbtk.bac120.summary.tsv file, focusing on classification, p_placer, and red_value.p_placer < 0.8), construct a custom phylogeny with IQ-TREE using the bac120 markers.
Table 3: Essential Research Reagent Solutions & Materials
| Item/Reagent | Function & Application in MAG Workflow | Critical Parameter/Specification |
|---|---|---|
| NEB Next Ultra II FS DNA Library Prep Kit | High-quality metagenomic library preparation for Illumina sequencing. Essential for obtaining high-coverage, unbiased reads. | Input DNA: 1ng-100ng. Enzymatic fragmentation time optimization for desired insert size. |
| MetaPolyzyme (Sigma-Aldrich) | Enzymatic lysis cocktail for diverse microbial cell walls in environmental samples. Critical for unbiased DNA extraction from marine biomass. | Incubation: 37°C for 60 min. Use in conjunction with mechanical lysis (bead-beating). |
| SPRIselect Beads (Beckman Coulter) | Size-selective magnetic bead-based clean-up for post-assembly contig filtering and size selection. | Ratio optimization (e.g., 0.6x to 0.8x) to retain contigs >1500 bp post-assembly. |
| CheckM2 Lineage-Specific Marker Set | Software-based "reagent" for assessing MAG completeness/contamination using a random forest model. More accurate than CheckM1. | Use --lowmem flag for large datasets. Interpret results in context of contamination sources. |
| GTDB-Tk Reference Data (v.R214) | Curated database of bacterial/archaeal genomes for phylogenetic placement. The standard for taxonomic classification of Marinisomatota MAGs. | Must download (~50 GB) and install separately. Update with each GTDB release. |
| Phusion High-Fidelity DNA Polymerase (Thermo) | For amplification of taxonomic marker genes from MAGs or community DNA for validation (e.g., 16S rRNA gene PCR if present). | High fidelity reduces chimera formation during PCR from complex templates. |
1. Introduction and Taxonomic Context The phylum Marinisomatota (as defined by the Genome Taxonomy Database, GTDB) represents a phylogenetically distinct lineage within the bacterial domain, primarily derived from marine and host-associated environments. Within the broader thesis of refining GTDB classifications and exploring underexplored taxa, this phylum presents a significant opportunity for biodiscovery. Its ecological niches suggest adaptation to complex polysaccharides and competitive interactions, predicting a rich repertoire of Biosynthetic Gene Clusters (BGCs) and catalytically novel enzymes with potential applications in drug discovery, biocatalysis, and biomedicine.
2. Quantitative Overview of Marinisomatota Genomic Potential Table 1: Summary of BGC Diversity in Publicly Available Marinisomatota Genomes (as of 2024)
| GTDB Genus Representative | Number of Genomes Surveyed | Average BGCs per Genome | Most Frequent BGC Class | Notable Predicted Product |
|---|---|---|---|---|
| UBA2962 (marine sediment) | 12 | 8.2 | Terpene | Sesterterpenoid-like |
| UBA10314 (sponge symbiont) | 7 | 11.5 | NRPS, T3PKS | Lipopeptide, Polyketide |
| UBA1773 (hydrothermal vent) | 5 | 6.8 | Bacteriocin | Lanthipeptide-class |
| Phylum Aggregate | ~50 | 8.7 | Terpene | High chemical novelty index |
Table 2: Putative Novel Enzyme Families Identified via CAZy and Peptidase Database Mining
| Enzyme Class | GTDB Family | Predicted Activity | Unique Domain Architecture | Potential Biomedical Application |
|---|---|---|---|---|
| Glycosyltransferase | UBA2962 | β-1,3-Xylosyltransferase | C-terminal Sharkin-like domain | Synthesis of heparin mimetics |
| Peptidase (S8 family) | UBA10314 | Subtilisin-like serine protease | Inserted carbohydrate-binding module | Targeted proteolysis for biofilm disruption |
| Polysaccharide Lyase | UBA1773 | Alginate lyase (novel substrate specificity) | Tandem bacterial immunoglobulin-like domains | Cystic fibrosis therapeutic (mucolysis) |
3. Detailed Experimental Protocols
Protocol 3.1: In silico Genome Mining and BGC Prioritization Objective: To identify and prioritize non-ribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) BGCs from Marinisomatota genomes. Materials: High-performance computing cluster, antiSMASH 7.0, BiG-SCAPE, CORASON, MIBiG database. Procedure:
--strict --cb-general --cb-knownclusters --cb-subclusters --asf --pfam2go). Use the --genefinding-tool prodigal.python bigscape.py -i ./antismash_results -o ./bigscape_out --mibig --mix --cutoffs 0.3 0.65 0.95).Protocol 3.2: Heterologous Expression of a Terpene Synthase BGC Objective: To express a prioritized terpene synthase BGC from UBA2962 in Streptomyces coelicolor M1152. Materials: Research Reagent Solutions Table:
| Reagent/Solution | Function | Source/Catalog Note |
|---|---|---|
| pCAP01 fosmid vector | BGC capture and heterologous expression | E. coli EPI300-T1ᵣ library construction |
| S. coelicolor M1152 | Streptomycete heterologous host | Lack of native PKS and NRPS clusters |
| Apetite solid medium | Selection and sporulation of Streptomyces | Contains apramycin, MgCl₂, and trace elements |
| Ethyl acetate with 1% acetic acid | Extraction of terpenoid metabolites | LC-MS grade for downstream analysis |
| PCR Master Mix (2x) with GC enhancer | Amplification of high-GC% Marinisomatota DNA | Required for >60% GC content |
Procedure:
DDXXD motif). Isolate positive fosmid DNA.Protocol 3.3: Activity Screening of a Novel Subtilisin-like Protease Objective: To clone, express, and test the activity of a novel S8 peptidase from UBA10314. Materials: pET-28a(+) vector, E. coli BL21(DE3), Ni-NTA resin, fluorogenic substrate Boc-Gln-Ala-Arg-AMC. Procedure:
4. Visualization of Workflows and Pathways
Title: Marinisomatota Mining and Validation Workflow
Title: NRPS Biosynthetic Logic
This application note details protocols for linking phylogeny, specifically within the Marinisomatota phylum (as classified by the Genome Taxonomy Database - GTDB), to biosynthetic gene cluster (BGC) diversity. The work is framed within the broader thesis that GTDB-based phylogenetic resolution of understudied taxa, like the Marinisomatota, uncovers novel BGC landscapes, providing a systematic roadmap for targeted biodiscovery in drug development.
Table 1: Comparative BGC Diversity in Marinisomatota vs. Related Phyla (GTDB r214)
| Taxonomic Group (GTDB) | Genomes Analyzed | Total BGCs Identified | BGCs/Genome (Avg) | NRPS/PKS (%) | Ribosomal (%) | Terpene (%) | Other (%) |
|---|---|---|---|---|---|---|---|
| Marinisomatota | 47 | 312 | 6.64 | 28.2 | 18.9 | 22.1 | 30.8 |
| Actinomycetota | 150 | 1245 | 8.30 | 45.6 | 12.3 | 15.4 | 26.7 |
| Bacteroidota | 85 | 401 | 4.72 | 15.2 | 31.7 | 10.0 | 43.1 |
Table 2: Phylogenetic Conservation of BGC Families in Marinisomatota Clades
| Marinisomatota Family (GTDB) | Representative Genus | Core BGC Family (MIBiG Class) | Conservation Frequency in Clade (%) | Putative Novelty Score* |
|---|---|---|---|---|
| Marinisomataceae | Marinisoma | Type I PKS | 92 | 0.85 |
| Oceanipullicutaceae | Oceanipullicuta | Lasso peptide | 78 | 0.92 |
| UBA10353 | UBA10353 | Hybrid NRPS-PKS | 65 | 0.95 |
| Novel lineage A | MAG-3321 | Thiopeptide | 100 | 0.98 |
*Novelty Score: 1 - (max BLASTp identity to known MIBiG cluster).
Objective: Generate a robust, GTDB-consistent phylogeny for BGC diversity mapping.
Materials: See "Research Reagent Solutions" (Section 6).
Method:
ncbi-genome-download and gtdb-tk.OrthoFinder v2.5 with default parameters on all predicted proteomes to identify single-copy orthologous (SCO) groups.MAFFT v7. Auto-trim alignments with trimAl (-automated1). Concatenate alignments using AMAS.IQ-TREE2 (-m TEST -B 1000 -alrt 1000). Use the resulting .treefile as the phylogenetic framework.Objective: Identify, classify, and quantify BGCs from Marinisomatota genomes.
Method:
antiSMASH v6 (or latest) on all genomes with --clusterhmmer, --asf, and --cb-knownclusters flags enabled.antiSMASH JSON outputs with BiG-SCAPE v1.1 (--mix mode). This generates Gene Cluster Families (GCFs) based on Pfam domain similarity.BLASTp against the MIBiG database (v3). Calculate the Putative Novelty Score (Table 2) as 1 minus the highest percent identity for any core gene hit. Scores >0.7 indicate high novelty.Objective: Statistically link phylogenetic distance to BGC repertoire dissimilarity.
Method:
cophenetic.phylo in R's ape package.vegdist in R's vegan).mantel function in vegan) to assess correlation between phylogenetic and BGC distance matrices (use 9999 permutations). A significant p-value (<0.05) supports phylogenetic conservation of BGCs.iTOL or ggtree in R.Diagram 1: Phylogeny-Guided Drug Discovery Workflow
Diagram 2: BGC Diversity Correlation with Phylogeny
Table 3: Essential Toolkit for Phylogeny-BGC Linkage Studies
| Item/Category | Specific Product/Resource | Function in Protocol |
|---|---|---|
| Taxonomic Framework | GTDB-Tk (v2.3.0) Database & Toolkit | Standardizes genome taxonomy according to GTDB, essential for defining Marinisomatota clades (Protocol 3.1). |
| Phylogenomics Software | IQ-TREE2 (v2.2.0), OrthoFinder (v2.5.4) | Infers robust phylogenetic trees from core genomes (Protocol 3.1). |
| BGC Prediction Pipeline | antiSMASH (v6.1.1) with all databases | Comprehensive identification and initial classification of BGCs in genomes (Protocol 3.2). |
| BGC Comparative Analysis | BiG-SCAPE (v1.1) & CORASON | Clusters BGCs into Gene Cluster Families (GCFs) enabling diversity quantification (Protocol 3.2, 3.3). |
| Reference BGC Database | MIBiG (Minimum Information about a BGC) Repository (v3.1) | Gold-standard database for BGC novelty assessment via BLAST (Protocol 3.2). |
| Statistical & Visualization Environment | R (v4.2+) with ape, vegan, ggtree packages |
Performs Mantel test and visualizes phylogeny-BGC correlations (Protocol 3.3). |
| High-Performance Computing (HPC) | Linux cluster with SLURM scheduler & >= 1TB storage | Manages computationally intensive genome analysis, tree building, and BiG-SCAPE runs. |
This application note is framed within a broader thesis research on the Marinisomatota phylum (GTDB classification; formerly part of the PVC superphylum in some taxonomic systems). The integration of robust taxonomic classification like the Genome Taxonomy Database (GTDB) with functional annotation pipelines is critical for elucidating the unique metabolic and biosynthetic potential of understudied lineages. For Marinisomatota, hypothesized to have rich secondary metabolism, coupling GTDB-tk classification with tools like antiSMASH and Prokka accelerates the discovery of novel gene clusters and their contextual interpretation within an accurate evolutionary framework.
Table 1: Comparison of Functional Annotation & Classification Tools
| Tool/Pipeline | Primary Purpose | Key Outputs | Typical Runtime* | Relevance to Marinisomatota Research |
|---|---|---|---|---|
| GTDB-Tk v2.3.0 | Taxonomic classification & phylogeny | Taxonomic assignment, alignments, tree | ~30 min/genome | Definitive placement of novel Marinisomatota genomes within the GTDB hierarchy. |
| Prokka v1.14.6 | Rapid prokaryotic genome annotation | CDS, tRNA, rRNA, functional prefixes (COG, Pfam) | ~10-15 min/genome | First-pass functional annotation, creating standardized GenBank files for downstream analysis. |
| antiSMASH v7.0 | Secondary metabolite BGC detection | BGC location, type, similarity, core structures | ~20-30 min/genome | Identification of biosynthetic gene clusters (BGCs) for drug discovery leads. |
| EggNOG-mapper v2.1.12 | Functional orthology annotation | GO terms, KEGG pathways, COG categories | ~5-10 min/genome | Consistent functional annotation across diverse taxa. |
| CheckM2 v1.0.2 | Genome quality estimation | Completeness, contamination, strain heterogeneity | ~3-5 min/genome | Quality assessment prior to classification/annotation. |
*Runtimes are approximate for a 4-5 Mbp bacterial genome on a high-performance compute node.
Table 2: Integrated Pipeline Output Statistics for a Mock Marinisomatota Dataset
| Analysis Stage | Metric | Average Value (n=10 draft genomes) | Notes |
|---|---|---|---|
| CheckM2 | Genome Completeness (%) | 96.4 ± 2.1 | High-quality drafts suitable for analysis. |
| GTDB-Tk | Classification Rank | pMarinisomatota; gUBA2565 | All genomes placed within the phylum; most as novel genera. |
| Prokka | Total CDS Annotated | 3,850 ± 420 | Provides baseline gene calls for all pipelines. |
| antiSMASH | BGCs per Genome | 8.2 ± 1.7 | Indicates high biosynthetic potential. |
| EggNOG-mapper | Genes with KEGG Annotation | 62% ± 5% | Enables pathway reconstruction. |
Objective: Assess draft genome quality and assign accurate taxonomy prior to functional annotation.
checkm2 predict --input <assembly.fasta> --output-directory <checkm2_out> --threads 8
gtdbtk classify_wf --genome_dir <filtered_genomes_dir> --out_dir <gtdbtk_out> --cpus 8 --pplacer_cpus 2
classify/<genome>.summary.tsv provides kingdom to species-level classification.Objective: Annotate genomes and specifically identify biosynthetic gene clusters (BGCs) using a coordinated pipeline.
prokka <assembly.fasta> --outdir <prokka_out> --prefix <strain_name> --cpus 8 --rfam
.gbk file essential for antiSMASH.antismash <prokka_out>/<strain_name>.gbk --output-dir <antismash_out> --cpus 8 --genefinding-tool prodigal-m
emapper.py -i <prokka_out>/<strain_name>.faa -o <eggnog_out> --cpu 8
Title: Integrated Genome Analysis Pipeline
Title: Marinisomatota Research Logic Flow
Table 3: Essential Research Reagent Solutions & Materials
| Item | Function/Application in Protocol | Example/Notes |
|---|---|---|
| High-Quality Compute Environment | Running computationally intensive pipelines. | Linux server/cluster with ≥32GB RAM, multi-core CPUs (e.g., AWS EC2, HPC). |
| Conda/Mamba | Reliable dependency and environment management. | Use bioconda channels to install all tools (GTDB-Tk, Prokka, antiSMASH). |
| GTDB-Tk Reference Data (v214) | Essential database for taxonomic classification. | Download reference214.tar.gz (∼54 GB). Critical for accurate Marinisomatota placement. |
| antiSMASH Databases | For BGC detection, rule-based clustering, etc. | Includes MIBiG, Pfam, ClusterBlast; installed via download-databases. |
| EggNOG Database (v5.0) | For fast orthology mapping and functional annotation. | Bacterial (bact) subset sufficient for Marinisomatota. |
| Integrative Analysis Scripts | Custom Python/R scripts to merge outputs. | For merging GTDB taxonomy, BGC locations, and KEGG pathways into a single table. |
| Visualization Tools | Creating publication-quality figures from results. | R (ggplot2, ggtree), Python (matplotlib, seaborn), or software like OriginLab. |
Within the broader thesis applying the Genome Taxonomy Database (GTDB) framework to elucidate the evolutionary and metabolic diversity of the phylum Marinisomatota (synonymous with Marinisomatia in some classifications), three critical, interconnected pitfalls consistently compromise downstream analysis. These are the recovery of low-quality Metagenome-Assembled Genomes (MAGs), genome contamination, and assignment to incomplete or obsolete taxonomic lineages. Addressing these is paramount for robust ecological inference and bioprospecting, especially for drug development professionals seeking novel bioactive gene clusters from marine microbiomes.
1. Low-Quality MAGs: The inherent fragmentation and uneven coverage in metagenomic sequencing often yield MAGs that are incomplete and/or miss-assembled. For GTDB classification, which relies on a set of conserved marker genes, this directly impacts the placement accuracy. A MAG missing >10% of these markers may be assigned to an imprecise taxonomic rank or flagged as "incomplete."
2. Contamination: Cross-contamination from co-occurring organisms, especially during binning, results in chimeric MAGs containing genes from multiple taxonomic units. This invalidates functional predictions and distorts phylogenetic trees. For Marinisomatota, which often exist in complex consortia, this is a prevalent risk.
3. Incomplete Taxonomy: Relying on legacy taxonomy (e.g., NCBI) instead of the standardized, genome-based GTDB can lead to misclassification. Marinisomatota itself is a product of genomic taxonomy, redefining older groups. Using outdated names obscures evolutionary relationships and hinders comparative genomics.
Quantitative Impact Summary:
Table 1: Impact of MAG Quality on GTDB Classification Success Rate
| MIMAG Quality Tier | Completeness (CheckM2) | Contamination (CheckM2) | % Passing GTDB-tk Workflow (approx.) | Risk of Misclassification |
|---|---|---|---|---|
| High-quality (HQ) | >90%, <5% | <5% | >95% | Low |
| Medium-quality (MQ) | ≥50%, <90% | <10% | ~60-80% | Moderate |
| Low-quality (LQ) | <50% | ≥10% | <30% | Very High |
Table 2: Common Contaminant Signatures in Putative Marinisomatota MAGs
| Contaminant Phylum (GTDB) | Typical Marker Genes | Effect on Classification |
|---|---|---|
| Proteobacteria | rpoB, fusA | Creates aberrant long branches in phylogeny |
| Bacteroidota | rpoC, gyrB | Can cause "pulling" into sister clades |
| Archaea (e.g., Thermoplasmatota) | Archaeal ribosomal proteins | GTDB-tk may reject genome or flag as contaminated |
This protocol ensures only robust MAGs are submitted to GTDB-tk for taxonomic classification of Marinisomatota.
Materials (Research Reagent Solutions):
bbduk.sh): For adapter trimming and quality filtering of raw reads.Methodology:
depth). Visually inspect coverage plots for sharp, unimodal distributions. Discard MAGs with multi-modal coverage, indicating co-binned populations.classify_wf). The resulting bacterial classification file (gtdbtk.bac120.summary.tsv) provides the taxonomy, classification confidence (based on marker gene support), and place in the reference tree.
MAG Refinement and GTDB Classification Pipeline
When GTDB-tk assigns a Marinisomatota MAG to an "unclassified" genus or family, follow this protocol to contextualize its placement.
Materials:
infer workflow): Generates a multiple sequence alignment (MSA) and tree including your MAGs and the full GTDB reference.Methodology:
infer workflow on your MAG set to place them within the GTDB reference tree. Visualize the resulting tree (.treefile) in iTOL.msa and mask files to calculate Average Amino Acid Identity (AAI) and Average Nucleotide Identity (ANI) versus its closest relatives using tools like compareM or PyANI.d__Bacteria;p__Marinisomatota;c__...;g__;s__). Clearly distinguish between classified ranks and placeholder names (g__UBA1234). Reference the GTDB release number (e.g., R220).
Resolving Unclassified Taxonomy via Phylogenomics
Application Notes
Within a thesis investigating the phylogenetic diversity and metabolic potential of the Marinisomatota phylum (syn. MARINISOMATOTA in GTDB), the interpretation of GTDB-Tk outputs is critical. Ambiguities, such as low support values and unclassified branches, are common but can be systematically addressed to refine taxonomic hypotheses.
1. Quantitative Analysis of Ambiguity: Common metrics from GTDB-Tk phylogenetic trees require careful scrutiny. The following table summarizes key thresholds for interpretation.
Table 1: Interpretation of Support Metrics in GTDB-Tk Phylogenetic Trees
| Metric | Range | Typical Threshold for Robustness | Interpretation in Marinisomatota Context |
|---|---|---|---|
| SH-like (aLRT) Support | 0-1 | ≥ 0.9 | Values < 0.7 indicate high ambiguity; branch placement is unreliable for novel lineages. |
| Bootstrap Support | 0-100 | ≥ 80 | Values between 50-80 suggest caution; topology may change with more data. |
| Taxonomic Rank Support | Classified/Unclassified | N/A | An "unclassified" label at the genus or family level often correlates with support values < 0.8. |
| Placement Distance (RF) | 0-1 | ≤ 0.3 | Distance > 0.5 from a defined reference suggests a potentially novel clade. |
2. Protocol for Resolving Ambiguities: Follow this sequential workflow to investigate ambiguous classifications.
Protocol 1: Multi-Marker Tree Reconciliation
Objective: To validate or correct the GTDB-Tk classification of a Marinisomatota genome (e.g., bin_23) showing low support at the family level.
Materials:
gtdbtk_output/) for the genome of interest.Methodology:
[bin_id].bac120.user_msa.fasta).bac120.msa file. Use taxonkit to gather relevant GTDB taxa IDs.fastANI) against the genomes in the ambiguous clade to confirm or refute genus-level grouping (threshold ~95% ANI).Protocol 2: Metabolic Profiling for Taxonomic Inference
Objective: Use functional signatures to support the placement of an unclassified Marinisomatota branch.
Materials:
Methodology:
Visualizations
GTDB-Tk Ambiguity Resolution Workflow
Example Ambiguous Branch with Low Support
The Scientist's Toolkit: Research Reagent Solutions
Table 2: Essential Resources for Resolving GTDB-Tk Ambiguities
| Item | Function/Description | Source/Example |
|---|---|---|
| GTDB-Tk Reference Data (R214+) | Essential database containing alignments, trees, and taxonomy for classification. Always use the version matching your GTDB-Tk install. | GTDB Website |
| IQ-TREE2 Software | For robust, custom phylogenetic tree inference with modern support metrics (SH-aLRT, UFBoot). | http://www.iqtree.org/ |
| CheckM2 / GTDB-Tk QC | Provides essential genome quality metrics (completeness, contamination). Poor quality can cause ambiguous placement. | CheckM2 GitHub |
| FastANI | Computes Average Nucleotide Identity for precise genus/species boundary assessment against reference genomes. | FastANI GitHub |
| KofamScan & KEGG Database | For functional profiling and identifying conserved metabolic signatures that support taxonomic grouping. | KofamScan GitHub |
| Custom HMM Library | A collection of Hidden Markov Models for protein families specific to Marinisomatota or related PVC superphylum. | Constructed via hmmbuild from curated alignments. |
| Taxonkit | A powerful CLI tool for parsing and filtering NCBI/GTDB-style taxonomy files efficiently. | Taxonkit GitHub |
1. Introduction & Thesis Context Within the broader thesis investigating the evolutionary genomics and biotechnological potential of the Marinisomatota phylum (GTDB designation, formerly part of FCB group or Sphingobacteria), efficient computational resource management is paramount. Analysis of large-scale metagenomic and isolate datasets demands strategic optimization to enable high-fidelity taxonomic classification, pangenome construction, and functional profiling. These protocols are designed to maximize throughput and accuracy while minimizing computational cost and time.
2. Quantitative Resource Benchmarks for Common Tasks The following table summarizes resource requirements for key analytical steps, benchmarked on a representative dataset of 500 metagenome-assembled genomes (MAGs) binned as Marinisomatota.
Table 1: Computational Resource Benchmarks for Core Analysis Tasks
| Analytical Task | Software (Example) | Typical Dataset Size | CPU Cores Recommended | RAM (GB) | Wall Time (HH:MM) | Storage I/O |
|---|---|---|---|---|---|---|
| Quality Control & Adapter Trimming | Fastp v0.23.4 | 1B PE reads (2x150bp) | 16 | 32 | 02:30 | High |
| Metagenome Assembly | MEGAHIT v1.2.9 | 1B PE reads | 64 | 512 | 24:00+ | Very High |
| Genome Binning | MetaBat2 v2.15 | 500 contigs (>2.5kbp) | 8 | 64 | 04:00 | Medium |
| GTDB-Tk Classification | GTDB-Tk v2.3.0 | 500 MAGs | 16 | 128 | 06:00 | Medium |
| Pangenome Analysis | Anvi'o v7.1 | 50 Marinisomatota genomes | 32 | 256 | 12:00 | High |
| Functional Annotation | Prokka v1.14.6 | 1 MAG (~5 Mbp) | 4 | 16 | 01:00 | Low |
3. Detailed Experimental Protocols
Protocol 3.1: Optimized GTDB Taxonomic Classification Pipeline Objective: To classify putative Marinisomatota MAGs using the Genome Taxonomy Database Toolkit (GTDB-Tk) with resource-efficient prioritization.
- Post-processing: Concatenate batch results (
bac120_summary.tsv, ar53_summary.tsv) and filter for classifications within the Marinisomatota phylum (e.g., p__Marinisomatota_A, p__Marinisomatota_B).
Protocol 3.2: Resource-Aware Comparative Genomics Workflow
Objective: To construct a pangenome from curated Marinisomatota genomes without exhausting memory.
- Dereplication: Use dRep v3.4.0 to cluster genomes at 99% ANI to reduce redundancy.
Annotation with Prokka (Parallelized): Use GNU Parallel to annotate dereplicated genomes simultaneously across allocated nodes.
Pangenome Construction: Use Roary v3.13.0 with a strict MCL inflation parameter (1.5) for clearer core/accessory separation.
4. Mandatory Visualization
Diagram Title: Marinisomatota MAG Analysis and Classification Pipeline
Diagram Title: Computational Resource Decision Tree
5. The Scientist's Toolkit: Essential Research Reagents & Solutions
Table 2: Key Computational Tools & Data Resources for Marinisomatota Research
Item Name
Type
Primary Function in Analysis
Resource Optimization Tip
GTDB-Tk v2.3.0+
Software/Reference Data
Assigns robust taxonomy using GTDB reference tree. Critical for placing novel Marinisomatota.
Use --scratch_dir to point to fast local SSD for I/O-bound performance gain.
CheckM2
Software/Model
Rapid assessment of MAG quality (completeness/contamination).
Use the pre-trained model; runs significantly faster with lower memory than CheckM1.
dRep
Software
Dereplicates genome sets based on ANI. Reduces computational load for downstream steps.
Adjust -nc (coverage threshold) based on sequencing depth to retain relevant diversity.
Roary
Software
Rapid large-scale prokaryote pangenome analysis. Identifies core/accessory genes.
Use -i (MCL inflation) >1.2 for more conservative, less noisy clustering in diverse sets.
Prokka
Software
Rapid annotation of bacterial genomes. Provides standard GFF3 for downstream tools.
Use --metagenome flag and --mincontiglen to optimize for MAG annotation.
GTDB R214
Reference Database
Provides the standardized taxonomic framework and alignments for classification.
Download to a shared, high-performance filesystem to avoid redundant copies.
PFAM & TIGRFAM
HMM Database
For functional annotation of protein families within Marinisomatota genomes.
Combine with tools like anvi-run-hmms for efficient, parallelized annotation.
Slurm / SGE
Job Scheduler
Manages resource allocation on HPC clusters for parallelizable workflows.
Implement job arrays for classifying or annotating 100s of genomes efficiently.
Application Notes & Protocols
Within the Genomic Taxonomy Database (GTDB) framework, the phylum Marinisomatota (formerly candidate phylum SAR406) presents unique challenges in taxonomic placement due to its deep evolutionary branching and frequent genomic bridging to related candidate phyla like Muirbacteria, Uhrbacteria, and Gribaldobacteria. Accurate classification is critical for interpreting its ecological role in marine systems and assessing its potential in bioprospecting for novel enzymes or bioactive compounds.
1. Quantitative Data Summary: Key Genomic & Phylogenetic Markers
Table 1: Core Genome & Phylogenetic Marker Comparison Across Bridging Phyla
| Feature / Marker | Marinisomatota (GTDB r214) | Muirbacteria (GTDB) | Uhrbacteria (GTDB) | Bridging Genome Example (Bin.123) |
|---|---|---|---|---|
| Average Genome Size (Mbp) | 1.8 - 2.3 | 1.5 - 1.9 | 1.6 - 2.1 | 2.05 |
| Average GC Content (%) | 44 - 48 | 50 - 54 | 38 - 42 | 46.2 |
| tRNA Count (avg.) | 33 | 35 | 32 | 34 |
| 16S rRNA Identity to Marinisomatota (%) | 100 (ref) | 78.2 - 81.5 | 75.8 - 79.1 | 92.3 |
| Concatenated Marker (120) AAI to Marinisomatota (%) | 100 (ref) | 60.5 - 62.8 | 58.9 - 61.2 | 85.7 |
| CheckM2 Completeness (%) | >95 (high-quality) | >90 | >90 | 96.4 |
| CheckM2 Contamination (%) | <5 | <5 | <5 | 1.2 |
| Presence of Diagnostic Pathway | Yes (Partial TCA) | No | No | Yes |
Table 2: Diagnostic Metabolic Pathway Gene Presence/Absence
| Pathway Gene | Marinisomatota Consensus | Bridging Genome Annotation | Function & Taxonomic Relevance |
|---|---|---|---|
Fumarate hydratase (class II) [K01676] |
+ | + | Key TCA cycle enzyme; conserved in Marinisomatota. |
Rhodanese-domain protein [K01011] |
+ | + | Sulfur metabolism; a phylum-associated trait. |
Group 3 [NiFe] hydrogenase [K06281, K06282] |
+ | + | Energy metabolism in anoxic environments. |
| Archaeal-like Rubisco (rbcL) | - | - | Distinguishes from photosynthetic relatives. |
2. Experimental Protocols
Protocol 1: Integrated Phylogenomic Placement of Ambiguous Genomes Objective: To resolve classification of a genome bridging Marinisomatota and related phyla using GTDB toolkit and supplementary analysis. Materials: High-quality metagenome-assembled genome (MAG), GTDB-Tk v2.3.0, CheckM2, Python environment with SciKit-bio, FastTree, IQ-TREE2. Procedure:
checkm2 predict --input <mag.fasta> ... to assess completeness & contamination. Proceed only if completeness >90% & contamination <5%.gtdbtk classify_wf --genome_dir <dir> --out_dir <output> --cpus 8. Record the classification and posterior probability for all ranks.gtdbtk identify and align. Create a custom concatenated alignment.iqtree2 -s concat.align -m MFP -bb 1000 -nt 8) using a curated set of reference genomes from Marinisomatota, Muirbacteria, Uhrbacteria, and an outgroup.--place function in GTDB-Tk or using EPA-ng in conjunction with the reference alignment.comparem aai_wf (https://github.com/dparks1134/CompareM). An AAI >80% suggests phylum-level affiliation; 60-80% indicates separate but related phyla.Protocol 2: Validation via Diagnostic Metabolic Profiling Objective: To validate phylogenetic placement by confirming the presence of Marinisomatota-diagnostic metabolic pathways. Materials: Annotated genome (e.g., using PROKKA or DRAM), KEGG database, HMMER suite, custom HMM profiles for diagnostic genes. Procedure:
prokka --outdir <dir> --prefix <mag> <mag.fasta>.hmmsearch with an E-value cutoff of 1e-20, search the translated proteome against custom HMMs built for diagnostic genes (e.g., fumarate hydratase class II, rhodanese-domain protein).3. Visualizations
Title: Workflow for Resolving Phylogenomic Classification
Title: Diagnostic Partial TCA Cycle in Marinisomatota
4. The Scientist's Toolkit
Table 3: Essential Research Reagent Solutions & Materials
| Item / Tool | Function in Analysis | Example / Note |
|---|---|---|
| GTDB-Tk (v2.3.0+) | Standardized taxonomic classification relative to GTDB phylogeny. | Uses ~120 bacterial marker genes & pplacer for placement. |
| CheckM2 | Estimates genome completeness & contamination rapidly. | Superior for genomes from novel lineages vs. CheckM1. |
| CompareM | Calculates Average Amino Acid Identity (AAI) & ANI. | Critical for quantifying genomic relatedness between phyla. |
| IQ-TREE2 | Phylogenetic inference with model testing & fast bootstrapping. | For building robust reference trees. |
| PROKKA / DRAM | Rapid genome annotation & metabolic profiling. | DRAM specializes in metabolic pathway distillation for microbes. |
| Custom HMM Profiles | Detects conserved, phylum-diagnostic protein families. | Build using hmmbuild from curated alignments of target genes. |
| KEGG MODULE Database | Reference for pathway completeness assessment. | Manual curation required due to pathway variability in DPANN/CPR. |
| PhyloPhlAn 3.0 | Alternative for phylogeny using ~400 universal markers. | Useful as an orthogonal method to GTDB-Tk. |
Best Practices for Curation and Submission of Novel Marinisomatota Genomes
Application Notes The accurate classification of novel genomes within the phylum Marinisomatota (formerly known as KS3-B09 or SAR406) is critical for advancing our understanding of their role in marine biogeochemical cycles and for exploring their biosynthetic potential. As per the Genome Taxonomy Database (GTDB) taxonomy (release 220), Marinisomatota is a distinct bacterial phylum primarily comprising uncultivated lineages from oceanic and deep-sea environments. Curation and submission of genomes from this group present unique challenges due to their frequent assembly from complex metagenomic datasets and their phylogenetic depth. Adherence to standardized practices ensures genomic data integrity, facilitates reproducible taxonomy, and enables downstream drug discovery pipelines to accurately target novel enzymatic pathways from these enigmatic organisms.
Protocols
1. Genome Assembly and Curation Protocol
Objective: To reconstruct high-quality Marinisomatota genomes from metagenomic sequence data.
Detailed Methodology:
1. Sequence Pre-processing: Use fastp (v0.23.4) with parameters --detect_adapter_for_pe --trim_poly_g --cut_front --cut_tail to remove adapters and low-quality bases.
2. Co-assembly: Perform de novo assembly on quality-filtered reads using MEGAHIT (v1.2.9) with meta-large presets or SPAdes (v3.15.5) in --meta mode for higher complexity samples.
3. Binning: Execute binning using multiple tools: MetaBAT2 (v2.15), MaxBin2 (v2.2.7), and CONCOCT (v1.1.0). Generate a consensus set of bins using DAS Tool (v1.1.6).
4. Taxonomic Assignment: Assign preliminary taxonomy to bins using GTDB-Tk (v2.3.2) with the classify_wf command and database release R220.
5. Genome Refinement: For bins classified as Marinisomatota, perform manual refinement in Anvi'o (v7.1). Map reads back to the bin, inspect coverage and tetranucleotide frequency outliers, and remove contaminating contigs.
6. Completeness/Contamination Assessment: Run CheckM2 (v1.0.1) to estimate genome completeness and contamination. Proceed only with medium-quality (MQG; ≥50% complete, <10% contaminated) or high-quality (HQG; ≥90% complete, <5% contaminated) genomes.
2. Phylogenomic Placement and Classification Protocol
Objective: To determine the precise taxonomic position of a novel Marinisomatota genome within the GTDB framework.
Detailed Methodology:
1. Protein Marker Extraction: Use GTDB-Tk's identify and align commands to extract and align 120 bacterial single-copy marker genes.
2. Reference Tree Placement: Generate a rooted phylogenetic tree with the infer command, which places the novel genome within the GTDB reference tree of type genomes.
3. Relative Evolutionary Divergence (RED) Calculation: The GTDB-Tk classify workflow automatically calculates the RED value, a quantitative measure of phylogenetic divergence.
4. Taxonomic Assignment: Assign taxonomy based on the genome's position relative to defined RED thresholds for each rank. Novelty is indicated by prefixes (e.g., "UBA..." for uncultivated bacterium).
3. Genome Submission and Annotation Protocol Objective: To submit curated genomes to public repositories with standardized annotations. Detailed Methodology: 1. Functional Annotation: Annotate the genome using PROKKA (v1.14.6) for rapid gene calling, or a comprehensive pipeline: DRAM (v1.4.4) for metabolism and KofamScan for KEGG orthologs. 2. Biosynthetic Gene Cluster (BGC) Identification: Run antiSMASH (v7.0) or DeepBGC to identify potential secondary metabolite BGCs, a key interest for drug development. 3. Metadata Collection: Compose minimal and contextual metadata as per the Genomic Standards Consortium (MIXS) checklist, emphasizing environmental parameters (depth, salinity, temperature). 4. Submission: Submit the genome assembly, annotated features, and raw reads to the International Nucleotide Sequence Database Collaboration (INSDC) via the NCBI, ENA, or DDBJ submission portals.
Data Presentation
Table 1: Genomic Quality Standards for Marinisomatota Submissions
| Quality Tier | Completeness | Contamination | # of Contigs | N50 (kb) | GTDB Designation |
|---|---|---|---|---|---|
| High Quality | ≥ 90% | < 5% | < 500 | > 20 | HQG |
| Medium Quality | ≥ 50% | < 10% | < 1000 | > 10 | MQG |
| Low Quality | < 50% | ≥ 10% | Not Applicable | Not Applicable | Exclude from taxonomy |
Table 2: Key GTDB Metrics for Novel Marinisomatota Classification
| Taxonomic Rank | Typical RED Threshold | Action for Novel Genome |
|---|---|---|
| Species Cluster | ~0.06 | Assign spXXXXXXX label if outside existing cluster. |
| Genus | ~0.30 | Prefix with 'UBA' or 'GCA' if RED > type genus threshold. |
| Family | ~0.50 | Prefix with 'UBA' if novel lineage at family level. |
Mandatory Visualizations
Title: Genome Curation & Submission Workflow
Title: GTDB Phylogenomic Classification Protocol
The Scientist's Toolkit: Research Reagent Solutions
Table 3: Essential Tools and Resources for Marinisomatota Genome Research
| Item / Tool | Function / Purpose | Source / Example |
|---|---|---|
| GTDB-Tk (v2.3.2+) | Standardized toolkit for phylogenomic classification using the GTDB database. Essential for taxonomy. | GitHub: ecogenomics/gtdbtk |
| CheckM2 | Rapid and accurate estimation of genome completeness and contamination in bacterial genomes. | GitHub: chklovski/CheckM2 |
| DRAM (Distilled & Refined Annotation of Metabolism) | Comprehensive functional annotation pipeline, highlighting metabolic potential and virulence. | GitHub: WrightonLabCSU/DRAM |
| antiSMASH | Identifies Biosynthetic Gene Clusters (BGCs) for secondary metabolites; crucial for drug discovery screens. | https://antismash.secondarymetabolites.org |
| Anvi'o | Interactive platform for microbial 'omics, essential for manual bin refinement and visualization. | http://merenlab.org/software/anvio/ |
| MIXS Checklists | Standardized metadata reporting formats to ensure data reproducibility and integration. | Genomic Standards Consortium |
| NCBI Prokaryotic Genome Annotation Pipeline (PGAP) | Recommended for consistent structural and functional annotation prior to INSDC submission. | NCBI GitHub |
Within the broader thesis on the genomic and metabolic diversity of the Marinisomatota phylum (formerly known as Marinimicrobia), accurate taxonomic classification is a foundational challenge. This phylum, prevalent in marine environments, exhibits significant metabolic versatility with implications for biogeochemical cycling and potential biotechnological applications. The recent adoption of the Genome Taxonomy Database (GTDB) taxonomy, based on conserved single-copy marker genes and relative evolutionary divergence, often conflicts with the established but sometimes phenotypically influenced NCBI taxonomy. This discrepancy is particularly pronounced for Marinisomatota, where numerous reclassifications and the delineation of new candidate phyla (e.g., Candidatus Uhrbacteria) have been proposed. This application note provides a protocol for benchmarking these two classification systems for Marinisomatota clades, enabling researchers to critically evaluate genomic data within a consistent framework for downstream ecological, evolutionary, and drug discovery research.
Table 1: High-Level Taxonomic Comparison for Marinisomatota
| Taxonomic Rank | NCBI Taxonomy (as of latest update) | GTDB Release R214 (April 2023) | Notes/Implications |
|---|---|---|---|
| Phylum | Marinimicrobia (PRI) | P__Marinisomatota | GTDB uses the name Marinisomatota. |
| Class Level | Multiple candidate classes (e.g., SAR406 clade) | C__Marinisomatia (and others split into separate phyla) | GTDB splits the group into multiple phyla-level taxa. |
| Order Level | Not consistently defined | O_Marinisomatales (within P_Marinisomatota) | Clearer hierarchical structure in GTDB. |
| Representative Genus | Marinimicrobium, Candidatus Litoricolaceae | Multiple genera under Marinisomatales (e.g., UBA10353, MSA-10) | Genus-level assignments differ radically. |
| Number of Genome Assemblies | ~500+ labeled under Marinimicrobia | ~400+ classified under P__Marinisomatota and related new phyla. | Counts vary due to reclassification. |
Table 2: Benchmarking Metrics for a Representative Clade (e.g., SAR406)
| Metric | NCBI Taxonomy Classification Result | GTDB Taxonomy Classification Result | Benchmarking Advantage |
|---|---|---|---|
| Average Amino Acid Identity (AAI) within group | 65.2% ± 5.1% | 72.8% ± 3.5% | GTDB classification yields more genomically coherent groups. |
| Percentage of Conserved Single-Copy Marker Genes | 89% | 98% | GTDB groups maintain higher essential gene content. |
| Relative Evolutionary Divergence (RED) Score | Not applied | 0.65 (clearly delineated from sister phyla) | Provides quantitative rank normalization. |
| Congruence with 16S rRNA Gene Tree | Moderate (long-branch attraction issues) | High for defined taxa (uses >120 proteins) | Improved phylogenetic resolution. |
Objective: To assemble a balanced genome dataset with dual (NCBI & GTDB) labels for benchmarking.
Materials:
ncbi-genome-download tool.Procedure:
ncbi-genome-download, download all bacterial genomes associated with the NCBI taxon ID for Marinimicrobia. Use the --assembly-level complete,chromosome,scaffold filter.classify_wf) on the downloaded genome assemblies. This will assign GTDB taxonomy based on the R214 reference tree.Assembly_Accession, NCBI_Phylum, NCBI_Class, GTDB_Phylum, GTDB_Class, GTDB_Red_Value.Objective: To visualize and quantify the discordance between classification systems.
Materials:
bac120 marker gene set from GTDB or a custom set of 74 universal single-copy genes.ETE3 Python toolkit for tree analysis and visualization.Procedure:
bac120 marker genes from each curated genome using GTDB-Tk or HMMER with custom profiles.LG+F+G and 1000 ultrafast bootstrap replicates.ETE3 to map the NCBI and GTDB taxonomy labels onto the tree leaf nodes. Color-code branches based on phylum-level assignment from each system.Table 3: Essential Research Reagent Solutions & Materials
| Item | Function/Benefit in Benchmarking | Source/Example |
|---|---|---|
| GTDB-Tk Software | Standardized toolkit for assigning GTDB taxonomy to genomes; ensures reproducibility. | https://github.com/ecogenomics/gtdbtk |
| CheckM2 | Rapid, accurate assessment of genome completeness and contamination; critical for quality filtering. | https://github.com/chklovski/CheckM2 |
| bac120 / ar122 Marker Set | Curated set of 120 bacterial single-copy genes; provides standardized data for phylogenomics. | Included with GTDB-Tk. |
| IQ-TREE | Efficient software for maximum likelihood phylogenetic inference with model selection. | http://www.iqtree.org/ |
| ETE3 Toolkit | Python environment for analyzing, manipulating, and visualizing trees and taxonomies. | http://etetoolkit.org/ |
| NCBI Datasets CLI | Programmatic access to download NCBI genome assemblies and associated metadata. | https://www.ncbi.nlm.nih.gov/datasets/ |
Workflow for Taxonomic Benchmarking
Taxonomic System Comparison Logic
Within the GTDB taxonomic framework, the phylum Marinisomatota (formerly candidate phylum SAR406) represents a deep-branching lineage distinct from its phenotypically and ecologically similar neighbor, Bacteroidota. This analysis highlights key genomic and metabolic features that delineate their evolutionary divergence, critical for interpreting ocean carbon cycling and guiding bioprospecting efforts.
| Feature | Marinisomatota (Avg.) | Bacteroidota (Avg.) | Implication for Divergence |
|---|---|---|---|
| Genome Size (Mbp) | 2.1 - 2.8 | 4.2 - 6.5 | Streamlined, oligotrophic adaptation in Marinisomatota |
| GC Content (%) | 34 - 38 | 39 - 48 | Distinct nucleotide composition & codon bias |
| 16S rRNA Identity (%) | < 75% | Reference | Phylum-level taxonomic separation (GTDB) |
| Glycoside Hydrolases (GHs) | Low count, specific types | High count, diverse (e.g., GH13, GH16) | Limited polysaccharide diversity in Marinisomatota |
| Respiratory Chain | Predicted HiPIP → bc1 complex | Diverse (e.g., fumarate reduct., flavin-based) | Unique electron transport via high-potential iron-sulfur protein |
| Carbon Fixation | RuBisCO-like protein (RLP) | Absent in most | Potential for CO2 metabolism in dark ocean |
| Nitrogen Metabolism | Nitrate/nitrite transporters | Urease, peptidases | N-source specialization; Marinisomatota targets inorganic N |
| Gene/Protein Family | Presence in Marinisomatota | Presence in Bacteroidota | Use as Phylogenetic Marker |
|---|---|---|---|
| RNA Polymerase Beta Subunit (rpoB) | Unique conserved inserts | Canonical sequence | GTDB backbone tree placement |
| Conserved Signature Proteins (CSPs) | 21 unique CSPs identified | 45 unique CSPs identified | Phylum-specific molecular synapomorphies |
| HiPIP (High-potential iron-sulfur) | Widespread, conserved | Rare, not conserved | Functional marker for electron transport |
| Porfirinogen deaminase (HemC) | Specific variant (MVG) | Specific variant (LAG) | Amino acid motif diagnostic |
Objective: To reconstruct the phylogenetic position of Marinisomatota genomes relative to Bacteroidota and other adjacent phyla.
Materials:
Procedure:
Tree Inference: Use the infer workflow on the classified markers to generate a rooted tree:
Analysis: Visualize the tree (e.g., in iTOL). Note the monophyletic clustering of Marinisomatota separate from the Bacteroidota clade, supported by bootstrap values.
Objective: To compare and visualize the completeness of core metabolic pathways between the phyla.
Materials:
Procedure:
kofamscan.
Title: Phylogenomic & Metabolic Analysis Workflow
Title: Key Divergence Traits of Marinisomatota
| Item | Function in Comparative Genomics | Example Product/Reference |
|---|---|---|
| GTDB-Tk Reference Data | Provides standardized bacterial/archaeal marker set & taxonomy for consistent phylogenomic placement. | GTDB Release 220 (R220) |
| KEGG KofamScan Database | Profile HMM database for accurate KEGG Orthology (KO) assignment from protein sequences. | KEGG Release (e.g., 2024-01) |
| CheckM2 / BUSCO | Assess genome completeness & contamination of MAGs prior to comparative analysis. | CheckM2 (v1.0.2) |
| FastTree / IQ-TREE2 | Software for rapid & accurate maximum-likelihood phylogenetic inference on marker alignments. | IQ-TREE2 (v2.2.6) |
| DRAM (Distilled & Refined Annotations of Metabolism) | Tool to annotate MAGs & distill metabolic profiles, highlighting pathways like vitamin synthesis & carbon utilization. | DRAM (v1.5) |
| Anti-HiPIP Antibodies | For experimental validation of the predicted unique electron transport chain component via western blot. | Custom polyclonal (e.g., GenScript) |
| Defined Oligotrophic Media | For cultivation attempts, mimicking deep-sea conditions (low organic C, high pressure, NO3- as N source). | AMS1 media recipe modifications |
Introduction & Thesis Context Within the broader thesis research on the phylum Marinisomatota (GTDB nomenclature; synonymous with Bacteroidota in some NCBI taxonomies), a critical challenge is translating standardized genomic taxonomy into ecological understanding. The Genome Taxonomy Database (GTDB) provides a phylogenetically consistent framework, but ecological inferences drawn from its classifications require validation through independent metagenomic surveys. This protocol outlines a method to cross-reference GTDB-derived lineages against environmental metagenomic datasets to confirm habitat associations, co-occurrence patterns, and putative metabolic roles, thereby grounding taxonomic revisions in ecological reality.
Application Notes & Protocols
Protocol 1: Creating a Curated GTDB Reference Package for Marinisomatota
gtdb-tk software package or the GTDB website. Extract all genomes classified within the phylum Marinisomatota.Table 1: Example Curated GTDB Marinisomatota Reference Set (Hypothetical Data)
| GTDB Genome ID | GTDB Taxonomy (Phylum to Genus) | CheckM Completeness (%) | CheckM Contamination (%) | NCBI Isolation Source | Genome Size (Mbp) |
|---|---|---|---|---|---|
| GBGCA123456 | pMarinisomatota; cMarinisomatia; oUBA10353; fUBA10353; g__UBA10353 | 92.5 | 1.2 | Marine sediment | 4.8 |
| GBGCA789012 | pMarinisomatota; cP2B42; oUBA10234; fUBA10234; g_JAAOCX01 | 78.9 | 5.5 | Activated sludge | 6.1 |
| RSGCF345678 | pMarinisomatota; cP2B42; oP2B42; fP2B42; gP2B42 | 86.7 | 2.8 | Human gut | 3.9 |
Protocol 2: Metagenomic Read Recruitment & Taxonomic Binning
bowtie2 or BWA to map quality-filtered metagenomic reads against the curated Marinisomatota reference database. Use sensitive parameters (--very-sensitive for bowtie2).
samtools and custom scripts to calculate depth of coverage and breadth of coverage for each reference genome. Normalize by genome length and total metagenome reads to estimate relative abundance (RPKM or TPM).MetaPhlAn 3 or mOTUs to obtain a community profile. Compare the presence/absence of Marinisomatota clades with the recruitment results.Table 2: Cross-Referencing Results from a Hypothetical Marine Metagenome
| Detected Taxon (GTDB) | Read Recruitment Abundance (RPKM) | MetaPhlAn3 Relative Abundance (%) | Concordance (Y/N/Partial) | Inferred Primary Habitat from Cross-Reference |
|---|---|---|---|---|
| g__UBA10353 (Marinisomatia) | 45.2 | 0.05 | Y | Marine sediment |
| g_JAAOCX01 (P2B42) | 0.8 | <0.001 | Partial (Low detection) | Possibly transient / not native |
| g_P2B42 (P2B_42) | 0.1 | 0.0 | N | Non-marine; likely contamination |
Protocol 3: Phylogenomic Validation of Ecological Clustering
GTDB-Tk de_novo_wf).CAST or consenTRAIT) assess if specific phylogenetic clades are significantly associated with specific environments (e.g., marine vs. terrestrial).The Scientist's Toolkit: Key Research Reagent Solutions
Table 3: Essential Materials and Tools for Validation Workflow
| Item | Function & Explanation |
|---|---|
| GTDB-Tk (v2.3.0+) | Software toolkit to classify genomes into the GTDB taxonomy and generate phylogenomic trees. Essential for standardizing input genomes. |
| CheckM2 | Assesses genome quality (completeness, contamination) for filtering the reference database. More accurate than CheckM1 for diverse bacteria. |
| Bowtie2 / BWA | Read mapping tools for recruiting metagenomic reads to the reference genome database. Critical for quantifying environmental presence. |
| MetaPhlAn 3 | Profiler for metagenomic taxonomic composition using GTDB-derived marker genes. Provides independent community profile for cross-validation. |
| Non-Redundant GTDB Reference Database (RS & RG) | Provides the standardized, de-replicated genome set. The foundation for creating a phylum-specific reference package. |
| SRA Toolkit | Downloads raw metagenomic sequencing data from the NCBI Sequence Read Archive for analysis. |
| ITOL / GGTREE | Interactive Tree of Life or R package for visualizing phylogenetic trees with annotated metadata (e.g., habitat). |
Diagrams
Cross-Referencing Validation Workflow
Data Integration for Ecological Inference
The Genome Taxonomy Database (GTDB) provides a standardized, genome-based taxonomy that frequently reclassifies microbial lineages, including the phylum Marinisomatota (previously known as Marinisomatota in some NCBI lineages, often synonymous with the candidate phylum SAR406 or Marinimicrobia in historical literature). This reclassification presents both a challenge and an opportunity for researchers. Legacy data, published literature, and associated metabolic models or drug target identifications become semantically disconnected from current genomic understanding. These Application Notes provide a framework for reconciling historical data with the GTDB taxonomy to ensure robust, reproducible science in marine microbiology and marine natural product discovery.
Table 1: Comparative Taxonomy of Key Marinisomatota Lineages: GTDB r220 vs. NCBI/SILVA Legacy Systems
| GTDB r220 Taxonomy (Phylum/Class/Order) | Approximate Legacy NCBI/SILVA Equivalent | Notable Phenotypic/Metabolic Traits (from Literature) | Key Publications Affected (Example Count) |
|---|---|---|---|
| P_Marinisomatota (full phylum) | Candidate phylum SAR406, Marinimicrobia | Oligotrophic, deep-sea adapted, putative role in sulfur & carbon cycling. | >500 (metagenomic surveys, oceanography) |
| C_Marinisomatia | Marine Group A, SAR406 clade | Abundant in oxygen minimum zones, genome indicates auotrophic potential. | ~300 (biogeochemical studies) |
| C_Aureabacteria | Uncultivated descendant of SAR406 | Found in saline lakes, distinct genomic repertoire. | ~50 (extreme environment studies) |
| O_UBA1416 | Sub-clade within SAR406 | Associated with particulate organic matter. | ~75 (carbon flux research) |
Table 2: Protocol for Taxonomic Reconciliation of Existing Data and Models
| Step | Protocol Description | Tools/Resources | Expected Output |
|---|---|---|---|
| 1. Identifier Mapping | Cross-reference legacy genome/OTU IDs (e.g., from NCBI) with GTDB using canonical correspondence files. | GTDB-Tk, gtdb_to_taxdump.tsv file from GTDB, EBI Metagenomics. |
Table linking NCBI accession to GTDB accession & taxonomy. |
| 2. Literature Re-annotation | Systematically search and tag existing literature with updated GTDB nomenclature using text-mining. | Custom Python scripts with BioPython & PubMed API, Zotero/Mendeley. | Annotated reference library with dual nomenclature. |
| 3. Metabolic Model Validation | Remap reaction annotations (KEGG, MetaCyc) in legacy metabolic models to genomes in GTDB reference tree. | ModelSEED, KBase, PATRIC, RAST toolkit. | Updated genome-scale metabolic models (GEMs) under GTDB taxonomy. |
| 4. Phylogenetic Contextualization | Place legacy sequence data within the GTDB reference tree via phylogenetic placement. | GTDB-Tk classify_wf, EPA-ng, pplacer. |
Newick tree with query sequences placed within GTDB framework. |
Protocol 1: Reclassifying Amplicon Sequence Variant (ASV) Data Using GTDB Objective: To re-annotate existing 16S rRNA gene amplicon datasets (often classified against SILVA) with GTDB taxonomy. Materials: ASV table (BIOM or CSV format), representative ASV sequences (FASTA), QIIME2 (2024.2+), GTDB reference package (r220). Procedure:
q2-feature-classifier to fit a naive Bayes classifier on the GTDB reference sequences. Command: qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads gtdb_seqs.qza --i-reference-taxonomy gtdb_tax.qza --o-classifier gtdb_classifier.qza.qiime feature-classifier classify-sklearn --i-classifier gtdb_classifier.qza --i-reads rep_seqs.qza --o-classification taxonomy.qza.Protocol 2: Validating a Putative Drug Target Gene in Reclassified Genomes
Objective: To assess the conservation and phylogenetic distribution of a previously identified essential gene (e.g., dnaN) across reclassified Marinisomatota genomes.
Materials: List of GTDB genome access IDs for Marinisomatota, target gene protein sequence, Anvio (v7.1), HMMER suite.
Procedure:
gtdb-tk to generate a genome set or download from GTDB ftp.hmmbuild target_gene.hmm alignment.fasta.hmmsearch against the concatenated protein database of all Marinisomatota genomes. Parse results to extract hits above a strict e-value threshold (e.g., 1e-30).pangenomics workflow to visualize conservation.
Title: Data Reconciliation Workflow for GTDB Reclassification
Title: Reclassification Impacts and Required Actions
Table 3: Essential Resources for Marinisomatota Research Post-GTDB Reclassification
| Item Name | Supplier/Resource | Function & Application Notes |
|---|---|---|
| GTDB-Tk v2.3.0+ | (https://github.com/ecogenomics/gtdbtk) | Core software toolkit for assigning GTDB taxonomy to genome bins and placing them in the reference tree. Critical for all reclassification work. |
| GTDB r220 Reference Data | GTDB FTP Site | Genome sequence and taxonomy files. Required for any classification or phylogenetic analysis aligned with GTDB. |
| CheckM2 | (https://github.com/chklovski/CheckM2) | Rapid, accurate assessment of genome completeness and contamination. Essential for quality control before taxonomic classification. |
| anvi'o v7.1+ | (http://anvio.org) | Integrated platform for pangenomics, phylogenomics, and metabolic modeling. Ideal for comparing reclassified genomes. |
| KBase (Microbiome Modeling) | (https://www.kbase.us) | Cloud platform for constructing and analyzing metabolic models from genomes, facilitating functional re-annotation post-reclassification. |
| MEMOTE Suite | (https://memote.io) | For testing and reporting standard compliance of genome-scale metabolic models, ensuring updated models are robust. |
| Custom HMM Profiles | (e.g., TIGRFAM, PFAM) | Curated protein family profiles for targeting specific metabolic pathways (e.g., sulfur oxidation) in functional screens of reclassified genomes. |
Application Notes and Protocols
Thesis Context: Within a broader thesis investigating the phylogenetic novelty and metabolic potential of candidate phyla like Marinisomatota (formerly candidate phylum Marinisomatota) for the GTDB (Genome Taxonomy Database) classification framework, accurate phylogenetic placement is paramount. Inferring the evolutionary relationships of these often-fragmentary, metagenome-assembled genomes (MAGs) requires robust assessment of genome quality to prevent erroneous taxonomic assignment.
1. Core Quality Metrics for Phylogenetic Trustworthiness
The integrity of a phylogenetic inference is directly contingent on the quality of the input genomes. The following metrics, popularized by tools like CheckM and CheckM2, are non-negotiable for pre-placement screening.
Table 1: Core Metrics for Genome Quality Assessment
| Metric | Definition | Ideal Range for Trustworthy Placement | Interpretation in Marinisomatota Context |
|---|---|---|---|
| Completeness | Percentage of conserved, single-copy marker genes (SCMGs) found in the genome. | >90% (High Quality) >50% (Draft) | High completeness ensures adequate phylogenetic signal. Low completeness in Marinisomatota MAGs may indicate novel lineages with divergent markers. |
| Contamination | Estimated percentage of SCMGs present in multiple copies, suggesting multiple strains/species. | <5% (High Quality) <10% (Acceptable) | High contamination leads to chimeric phylogenetic signals, misplacing the genome. Critical for novel phylum assignment in GTDB. |
| Strain Heterogeneity | Evidence of multiple sequence variants among SCMGs, indicating unresolved strains. | Low (Close to 0%) | High heterogeneity complicates assembly and placement, may require bin refinement or indicate a population. |
| Genome Size & N50 | Total assembly length and contig length at which 50% of the genome is assembled. | Context-dependent | Significantly deviant sizes may flag contamination or incompleteness. Useful for comparing against known relatives. |
Protocol 1.1: Standardized Quality Assessment with CheckM2 Objective: To calculate completeness, contamination, and strain heterogeneity for a set of Marinisomatota MAGs prior to phylogenetic analysis.
*.fna files) in a single directory.pip install checkm2. The program uses a pre-trained model and does not require a manual database download.quality_report.tsv. Filter MAGs based on Table 1 thresholds (e.g., Completeness >70%, Contamination <5%) for downstream phylogenetic placement.2. Phylogenetic Placement-Specific Assessments
Beyond general metrics, specific checks are needed to ensure the phylogenetic signal is reliable.
Table 2: Placement-Specific Diagnostic Metrics
| Metric | Protocol/Method | Purpose & Relevance |
|---|---|---|
| Marker Gene Concordance | Phylogeny of individual SCMGs (e.g., via PhyloPhlAn) vs. concatenated tree. | Detects hidden contamination or horizontal gene transfer that concatenated trees may obscure. Incongruent gene trees can invalidate placement. |
| Coverage Uniformity | Analysis of read mapping depth across contigs (e.g., using bowtie2 and samtools). |
Large coverage drops may indicate mis-binned contigs (contamination). Uniform coverage supports a coherent genome. |
| Taxonomic Consistency | Compare taxonomic assignments of all predicted genes (e.g., via CAT or GTDB-Tk classify). |
A high percentage of genes agreeing with the dominant lineage boosts confidence. Many genes from divergent phyla signal contamination. |
| Reference Tree Robustness | Placement on a stable, well-curated reference tree (e.g., GTDB backbone tree). | Ensures placement is not an artifact of a poor or biased reference dataset. |
Protocol 2.1: Assessing Taxonomic Consistency with CAT/BAT Objective: To evaluate gene-level taxonomic agreement within a Marinisomatota MAG.
prodigal:
mag.lineage file. A trustworthy MAG for placement will show a high proportion of proteins classified to a coherent lineage (e.g., candidate phylum Marinisomatota), with limited classification to unrelated phyla.3. Visualization of the Assessment Workflow
A standardized workflow integrates these metrics to gatekeep genomes for trustworthy phylogenetic placement.
Title: Workflow for Trustworthy Phylogenetic Placement
4. The Scientist's Toolkit: Research Reagent Solutions
Table 3: Essential Tools and Databases for Quality Assessment
| Item | Function & Relevance | Typical Source/Implementation |
|---|---|---|
| CheckM2 | Rapid, tool for estimating completeness and contamination using machine learning. Essential first-pass filter. | https://github.com/chklovski/CheckM2 |
| GTDB-Tk | Toolkit for assigning GTDB taxonomy, includes classify_wf which performs internal quality checks and reference tree placement. |
https://github.com/ecogenomics/gtdbtk |
| PhyloPhlAn | For constructing highly accurate phylogenies with SCMGs and assessing marker gene concordance. | https://github.com/biobakery/phylophlan |
| BUSCO | Alternative to CheckM using universal orthologous benchmarks. Useful for eukaryotes and specific lineages. | https://busco.ezlab.org/ |
| CAT/BAT | Protein-based taxonomic classifier. Critical for evaluating gene-level consistency within a MAG. | https://github.com/dutilh/CAT |
| Bowtie2 & SAMtools | For mapping reads back to assemblies to compute coverage uniformity and validate binning. | http://bowtie-bio.sourceforge.net/bowtie2, http://www.htslib.org/ |
| GTDB Reference Data (r214+) | Curated genome database and trees. The gold-standard reference for bacterial and archaeal phylogenetic placement. | https://data.gtdb.ecogenomic.org/ |
| CIAlign | Tool to clean and interpret multiple sequence alignments, removing noisy regions that can distort phylogeny. | https://github.com/KatyBrown/CIAlign/ |
The GTDB framework provides a robust, genome-based taxonomy that has significantly refined our understanding of the Marinisomatota phylum, clarifying its evolutionary boundaries and internal diversity. Mastery of the associated tools and an awareness of classification nuances are essential for accurately placing new genomes and interpreting their biological significance. The validated genomic distinctiveness of Marinisomatota, particularly its prevalence in marine systems, underscores its potential as a reservoir for novel natural products and enzymes. Future directions should focus on isolating representative strains, functionally characterizing predicted biosynthetic pathways, and exploring the phylum's role in marine biogeochemical cycles and host-microbe interactions. For biomedical research, integrating GTDB classification with metabolomic and phenotypic data will be crucial for translating genomic novelty into therapeutic leads.