GTDB Taxonomic Classification of Marinisomatota: Genomic Insights, Methods, and Biomedical Applications for Researchers

Allison Howard Jan 12, 2026 834

This article provides a comprehensive resource on the Marinisomatota phylum within the Genome Taxonomy Database (GTDB) framework.

GTDB Taxonomic Classification of Marinisomatota: Genomic Insights, Methods, and Biomedical Applications for Researchers

Abstract

This article provides a comprehensive resource on the Marinisomatota phylum within the Genome Taxonomy Database (GTDB) framework. It establishes the foundational genomic and ecological characteristics of this marine bacterial group, details methodologies for accurate classification and analysis, addresses common computational challenges, and validates GTDB's taxonomy against traditional systems like SILVA and NCBI. Targeted at researchers and drug development professionals, it synthesizes current knowledge to guide discovery of novel biosynthetic gene clusters and other biotechnologically relevant traits.

Marinisomatota Unveiled: Genomic Foundations and Ecological Significance in the GTDB Era

The phylum Marinisomatota represents a significant expansion of our understanding of bacterial diversity, originating from uncultured environmental sequences and achieving formal recognition through the Genome Taxonomy Database (GTDB) framework. This phylum encapsulates organisms primarily retrieved from marine and subsurface environments, characterized by genomic signatures of anaerobic metabolism and symbiotic or parasitic lifestyles.

Table 1: Chronological Development of Marinisomatota Taxonomy

Year	Key Event/Tool	Description	Outcome/Reference
Pre-2015	16S rRNA Gene Surveys	Detection in marine sediments & hydrothermal vents	Identified as "Candidate phylum Zixibacteria" or similar candidate divisions.
2016	GTDB r89/r95	Initial placement in GTDB taxonomy using concatenated protein phylogenies	Grouped within the broader FCB superphylum.
2020	GTDB r07-RS202	Refinement via pangenome analysis & average amino acid identity (AAI)	Proposed as a distinct phylum-level lineage.
2022-Present	GTDB r214/r220	Validation with expanded genome dataset & relative evolutionary divergence (RED)	Formalized as phylum Marinisomatota in the GTDB taxonomy.

Table 2: Core Genomic & Ecological Characteristics of Marinisomatota

Characteristic	Typical Range/Feature	Method of Determination
Genome Size	1.8 - 3.2 Mbp	Genome assembly from metagenomes (MAGs)
GC Content	38 - 52%	In silico calculation from MAGs
Predicted Metabolism	Anaerobic fermenter, possible syntrophy	Gene neighborhood & metabolic pathway inference
Habitat	Marine sediment, groundwater, anaerobic digesters	Sample metadata from NCBI SRA
RED Value vs. Adjacent Phyla	>0.15	GTDB Toolkit (GTDB-Tk) analysis

Key Experimental Protocols

Protocol 2.1: Genome-Resolved Metagenomics for Marinisomatota MAG Retrieval

Objective: Recover high-quality draft genomes of Marinisomatota from complex environmental samples.

Materials:

Environmental DNA extract (e.g., from marine sediment).
Illumina or PacBio sequencing reagents.
High-performance computing cluster.

Procedure:

Library Preparation & Sequencing: Prepare metagenomic library using a kit (e.g., Illumina Nextera Flex). Sequence using paired-end (2x150 bp) or long-read chemistry.
Quality Control & Assembly:
- Trim adapters and low-quality bases using Trimmomatic v0.39 (ILLUMINACLIP:2:30:10, LEADING:3, TRAILING:3, SLIDINGWINDOW:4:15, MINLEN:36).
- Assemble reads using metaSPAdes v3.15.4 with k-mer sizes 21,33,55,77,99,127.
Binning:
- Map reads back to contigs using Bowtie2 v2.4.5.
- Execute binning with MetaBAT2 v2.15, MaxBin2 v2.2.7, and CONCOCT v1.1.0.
- Refine bins using DAS Tool v1.1.4 to generate a consensus set of MAGs.
Taxonomic Assignment & Curation:
- Run GTDB-Tk v2.1.1 (classify_wf) against the GTDB r214 database.
- Identify bins assigned to p__Marinisomatota.
- Assess genome quality with CheckM2 v1.0.1, retaining only medium/high-quality MAGs (completeness >50%, contamination <10%).

Protocol 2.2: Phylogenomic Validation Using the GTDB-Tk Workflow

Objective: Place novel Marinisomatota MAGs within the GTDB reference tree and compute RED values.

Materials:

MAGs in FASTA format.
GTDB-Tk software and reference data (r214).

Procedure:

Identify Marker Genes: Run gtdbtk identify to find 120 bacterial single-copy marker genes within the MAGs.
Align and Concatenate: Run gtdbtk align to create multiple sequence alignments (MSA) for each marker, followed by concatenation.
Tree Inference: Run gtdbtk infer to generate a maximum-likelihood tree with FastTree v2.1.11 under the LG+G model.
Taxonomic Assignment & RED Calculation: The tree is rooted and compared to the reference tree. RED values are computed internally by GTDB-Tk to quantitatively assess lineage separation. A RED value > ~0.15 supports phylum-level distinction.
Output: Review the summary.tsv file for taxonomy and the RED values at relevant nodes in the tree file.

Visualization of Workflows and Relationships

Title: Workflow for Defining a New Bacterial Phylum

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents & Tools for Marinisomatota Research

Item/Category	Specific Product/Software Example	Function in Research
DNA Extraction Kit	DNeasy PowerSoil Pro Kit (QIAGEN)	High-yield, inhibitor-free DNA extraction from complex sediments.
Metagenomic Library Prep	Illumina DNA Prep Kit	Preparation of sequencing-ready libraries from environmental DNA.
Sequencing Platform	Illumina NovaSeq 6000; PacBio HiFi	Generates short-read or long-read data for assembly and binning.
Assembly Software	metaSPAdes, Flye	Assembles sequencing reads into contigs/scaffolds.
Binning Software Suite	MetaBAT2, MaxBin2, CONCOCT	Groups contigs into putative genomes (MAGs) based on sequence composition and abundance.
Taxonomic Classifier	GTDB-Tk with r214 database	Provides standardized, phylogeny-based taxonomy and RED metrics.
Genome Quality Tool	CheckM2/CheckM	Estimates genome completeness and contamination using marker genes.
Metabolic Inference	METABOLIC v4.0	Profiles metabolic pathways from MAGs to infer ecological role.
Phylogenetic Analysis	IQ-TREE 2, FastTree 2	Constructs robust phylogenetic trees for phylogenomic validation.
Culture Media (Experimental)	Anaerobic marine broth (modified)	Attempts to cultivate elusive members using simulated environmental conditions.

Application Notes for GTDB Taxonomic Classification ofMarinisomatota

Within the GTDB (Genome Taxonomy Database) framework, Marinisomatota (formerly candidate phylum SAR406) is classified as a phylum within the FCB group superphylum. Its genomic hallmarks are critical for accurate taxonomic placement and understanding its ecological role in marine systems.

Key Genomic Hallmarks:

Marker Genes: The GTDB-Tk toolkit utilizes a set of 120 bacterial single-copy marker genes (bac120) for phylogenetic inference. For Marinisomatota, consistent absence or presence of specific markers within this set aids in delineation from related phyla like Bacteroidota.
Phylogenetic Placement: Based on concatenated marker gene alignments, Marinisomatota forms a monophyletic clade sister to the Bacteroidota-Chlorobiota branch.
Metabolic Profile: Genomes are characterized by genomic signatures for proteorhodopsin-based phototrophy, streamlining for oligotrophy, and diverse sulfur compound oxidation pathways, reflecting an adaptation to deep ocean aphotic zones.

Quantitative Data Summary:

Table 1: Core Genomic Statistics for Representative *Marinisomatota MAGs (Metagenome-Assembled Genomes) from GTDB r214.*

GTDB Species Representative	Genome Size (Mbp)	GC Content (%)	CheckM Completeness (%)	CheckM Contamination (%)	Number of bac120 Markers Identified
UBA1166 sp002160825	1.98	37.2	98.6	0.9	119
UBA9951 sp014337395	2.15	36.8	99.2	1.2	120
UBA1773 sp004294285	2.32	38.5	97.8	0.5	118

Table 2: Diagnostic Metabolic Pathway Presence/Absence in *Marinisomatota vs. Related Phyla.*

Metabolic Pathway (KEGG Module)	Marinisomatota (n=50 MAGs)	Bacteroidota (n=50)	Chlorobiota (n=50)
Proteorhodopsin (M00597)	100%	12%	0%
Dissimilatory sulfite reductase (DsrAB, M00596)	88%	24%	100%
Complete TCA cycle (M00009)	10%	96%	100%
Anoxygenic photosynthesis (M00116)	0%	0%	100%

Protocols

Protocol 2.1: Phylogenomic Placement Using GTDB-Tk

Purpose: To classify a novel bacterial genome or MAG within the GTDB taxonomy, with specific focus on placement relative to Marinisomatota. Materials: High-quality bacterial genome assembly, computing cluster or server with GTDB-Tk (v2.1.1+) installed. Procedure:

Data Preparation: Ensure your genome is in FASTA format. Run checkm to assess basic quality (completeness >50%, contamination <5% recommended).
Run GTDB-Tk: Execute the classify_wf workflow:

Interpret Output: Key files:
- gtdbtk.bac120.summary.tsv: Taxonomic classification. Examine classification column for placement (e.g., d__Bacteria;p__Marinisomatota;...).
- gtdbtk.bac120.markers_summary.tsv: Count of identified marker genes.
- gtdbtk.bac120.user_msa.fasta: Concatenated marker gene alignment for your genome.
Custom Phylogeny: To generate a tree with reference Marinisomatota genomes, use the infer workflow with the user MSA and the GTDB reference package.

Protocol 2.2: Identification of Diagnostic Metabolic Pathways via KofamScan

Purpose: To profile the metabolic potential of a Marinisomatota genome, focusing on hallmark pathways. Materials: Annotated genome (protein sequences in FASTA format), KofamScan software, KEGG databases. Procedure:

Annotation: Annotate genome using Prokka or DRAM to generate protein file (*.faa).
KofamScan Setup: Download KOfam HMM profiles and Ko list from KEGG.
Scan and Map: Run KofamScan using the exec_annotation script with the --cpu and -o options. Use profile HMMs to map KOs.
Parse Output: The output file lists KOs assigned to genes. Translate KOs to KEGG Modules (e.g., Proteorhodopsin: KEGG KO K15789, K15790 -> Module M00597).
Visualization: Create a presence/absence matrix of key modules (as in Table 2) for comparative analysis.

Visualization

Diagram Title: Workflow for Phylogenomic Placement with GTDB

Diagram Title: Core Energy Pathways in Marinisomatota

The Scientist's Toolkit

Table 3: Essential Research Reagents and Tools for Marinisomatota Genomics

Item	Function & Relevance
GTDB-Tk (v2.1.1+)	Standardized toolkit for assigning genomes to the GTDB taxonomy using a set of 120 bacterial marker genes; essential for consistent phylogenetic placement.
CheckM2 / CheckM	Assesses genome quality (completeness, contamination) of MAGs prior to phylogenetic analysis; critical for data reliability.
KofamScan / eggNOG-mapper	Functional annotation tools to map protein sequences to KEGG Orthologs (KOs) and reconstruct metabolic pathways like proteorhodopsin.
FastTree / RAxML	Software for inferring phylogenetic trees from concatenated marker gene alignments generated by GTDB-Tk.
MetaBAT 2 / MaxBin 2	Binning algorithms for reconstructing MAGs from marine metagenomic data, the primary source of Marinisomatota genomes.
DRAM (Distilled and Refined Annotation of Metabolism)	Specialized tool for annotating metabolic pathways and auxiliary functions in microbial genomes; useful for detailed pathway analysis.
Pfam & TIGRFAM HMMs	Curated protein family databases used to identify specific marker genes (e.g., proteorhodopsin, DsrAB) in novel genomes.

Application Notes: Niche Prevalence and Genomic Adaptations inMarinisomatota

Context within GTDB Taxonomic Classification Research: The phylum Marinisomatota (formerly candidate phylum Marinisomatota in GTDB r214) represents a lineage of Bacteria predominantly identified from metagenomic surveys. Research within a broader thesis on GTDB taxonomy aims to elucidate the ecological drivers of its distribution and its metabolic potential, particularly for biodiscovery. This phylum exemplifies the critical link between habitat, ecological niche, and genomic content.

Key Quantitative Data Summary:

Table 1: Prevalence of Marinisomatota in Public Metagenomic Databases

Environment / Sample Type	Approximate Relative Abundance (%)	Primary Dataset/Source (Example)	Key Identifying Genomic Marker
Marine Pelagic (Oceanic)	0.01 - 0.5	TARA Oceans, Malaspina Expedition	16S rRNA gene, RpoB
Marine Sediments	0.1 - 2.0	Ocean Drilling Program, IODP	16S rRNA gene, RpoB
Marine Sponge Microbiome	Up to 15.0	Sponge Microbiome Project, local surveys	16S rRNA gene, Metagenome-assembled genomes (MAGs)
Coral Microbiome (Healthy)	0.5 - 3.0	Various reef studies	16S rRNA gene, MAGs
Human & Animal Gut	< 0.01	Human Microbiome Project, MGnify	Extremely rare, sporadic MAGs

Table 2: Genomic Features Correlated with Habitat in Marinisomatota MAGs

Genomic Feature / Pathway	Prevalence in Marine Pelagic MAGs (%)	Prevalence in Host-Associated (Sponge) MAGs (%)	Putative Functional Role & Niche Adaptation
Proteorhodopsin & Light-Sensing	85-95	10-20	Phototrophy, energy generation in oligotrophic water
CRISPR-Cas Systems	30-40	60-80	Defense against mobile genetic elements/viruses
Biosynthetic Gene Clusters (BGCs)	2-4 per MAG	5-8 per MAG	Secondary metabolite production (e.g., NRPS, PKS)
Adhesion Proteins (e.g., MSCRAMM-like)	Low	High	Host tissue attachment and colonization
C1 Metabolism (e.g., folD, fhs)	High	Variable	Adaptation to C1 compounds in marine environment
TonB-Dependent Transporters	Very High	High	Nutrient scavenging (e.g., siderophores, sugars)

Interpretation: The data indicate a primary marine origin for Marinisomatota, with a significant shift in abundance and genomic capacity upon association with marine invertebrate hosts, particularly sponges. The increased prevalence of defense mechanisms and biosynthetic potential in host-associated lineages suggests adaptation to a competitive, resource-rich, and defended microenvironment, highlighting their potential for novel natural product discovery.

Experimental Protocols

Protocol 1: Targeted Detection and Quantification ofMarinisomatotain Metagenomes

Objective: To assess the relative abundance and diversity of Marinisomatota in marine and host-associated metagenomic samples.

Materials: Metagenomic DNA, PCR reagents, GTDB-tk database (v2.3.0), QIIME2 (2024.5), specific primers (see Toolkit).

Workflow:

DNA Extraction & QC: Use a standardized kit for environmental/microbiome samples (e.g., DNeasy PowerSoil Pro). Verify integrity via gel electrophoresis and quantify via fluorometry.
16S rRNA Gene Amplicon Sequencing (Screening):
- Perform PCR amplification of the V4-V5 region using primer pair 515F-Y (5'-GTGYCAGCMGCCGCGGTAA-3') and 926R (5'-CCGYCAATTYMTTTRAGTTT-3').
- Clean amplicons and prepare libraries for Illumina MiSeq 2x250 bp sequencing.
Bioinformatic Processing:
- Denoise sequences with DADA2 in QIIME2 to generate Amplicon Sequence Variants (ASVs).
- Taxonomic Assignment: Use a custom-trained classifier. First, extract Marinisomatota reference sequences from GTDB. Train a Naïve Bayes classifier on the Silva 138 SSU NR 99 database supplemented with the Marinisomatota sequences using qiime feature-classifier fit-classifier-naive-bayes.
- Assign taxonomy to ASVs using this classifier.
- Filter feature table to retain ASVs classified as Marinisomatota. Calculate relative abundance.
Shotgun Metagenomic Analysis (In-depth):
- Sequence libraries (Illumina NovaSeq, 2x150 bp).
- Perform quality trimming with Fastp.
- Assemble co-assemblies per habitat type using MEGAHIT or metaSPAdes.
- Bin contigs into MAGs using MetaBat2.
- Taxonomic Classification: Classify MAGs using GTDB-Tk (v2.3.0+) with the classify_wf command against the GTDB r214 database.
- Annotate MAGs with Prokka and perform functional analysis via KEGG and antiSMASH.

Protocol 2: Functional Screening for Bioactive Compound Production

Objective: To experimentally test the biosynthetic potential predicted in host-associated Marinisomatota MAGs.

Materials: Sponge tissue sample, Marine Broth 2216, selective antibiotics, isolation plates, HPLC-MS.

Workflow:

Cultivation & Isolation:
- Homogenize fresh marine sponge tissue in sterile seawater.
- Perform serial dilutions and spread on marine agar 2216 supplemented with cycloheximide (100 µg/mL) to inhibit fungi.
- Add a mix of antibiotics (nalidixic acid 10 µg/mL, vancomycin 5 µg/mL) to inhibit fast-growing Gram-negative/positive bacteria, favoring slow-growing, potentially novel phyla.
- Incubate at 20°C for 4-8 weeks. Monitor for slow-growing, morphologically unique colonies.
Identification of Isolates:
- Extract genomic DNA from candidate colonies.
- Amplify and Sanger-sequence the full-length 16S rRNA gene using universal primers 27F and 1492R.
- Compare sequences to the GTDB via BLAST to confirm Marinisomatota affiliation.
Metabolite Extraction and Analysis:
- Inoculate a positive isolate in liquid marine broth. Incubate with shaking until late stationary phase.
- Extract metabolites from the cell pellet and supernatant separately using ethyl acetate.
- Dissolve dried extracts in methanol and analyze by HPLC coupled with High-Resolution Mass Spectrometry (HR-MS).
- Compare mass spectra and retention times to databases (e.g., GNPS) to identify known compounds or novel molecular families.

Mandatory Visualization

Title: Metagenomic Analysis Workflow for Marinisomatota

Title: Host Niche Drivers of Marinisomatota Genomics

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Marinisomatota Research

Item / Reagent	Function / Application in Protocol	Example Product / Specification
Marine Agar/Broth 2216	Standardized medium for cultivation and isolation of marine heterotrophs.	Difco Marine Broth 2216, BD.
GTDB Reference Database (r214+)	Essential for accurate taxonomic classification of MAGs and sequences from understudied phyla.	Genome Taxonomy Database Toolkit (GTDB-Tk) v2.3.0+.
Anti-Fungal/Antibiotic Supplement Mix	Selective isolation of slow-growing bacteria by inhibiting fungi and fast-growing competitors.	Cycloheximide (100 µg/mL), Nalidixic Acid (10 µg/mL), Vancomycin (5 µg/mL).
Polymerase for GC-Rich Templates	High-fidelity PCR amplification of bacterial DNA, often with high GC content common in Marinisomatota.	KAPA HiFi HotStart ReadyMix (Roche) or Q5 High-Fidelity DNA Polymerase (NEB).
Metagenomic DNA Extraction Kit (Soil/Microbiome)	Efficient lysis of diverse, tough-to-lyse bacterial cells from complex environmental samples.	DNeasy PowerSoil Pro Kit (Qiagen) or MagAttract PowerSoil DNA KF Kit (Qiagen).
antiSMASH Software Suite	Prediction, annotation, and analysis of Biosynthetic Gene Clusters (BGCs) in bacterial genomes.	antiSMASH 7.0+ web server or standalone version.
HPLC-MS Grade Solvents	High-purity solvents for metabolite extraction and analytical chemistry to avoid background interference.	Ethyl Acetate, Methanol (LC-MS Grade, e.g., Fisher Chemical).

Key Genera and Species within GTDB's Marinisomatota Taxonomy

Within the Genome Taxonomy Database (GTDB) framework, the phylum Marinisomatota (formerly a candidate phylum) represents a significant, yet understudied, lineage of primarily marine bacteria. This taxonomic group is of considerable interest for its phylogenetic diversity, its ecological roles in marine biogeochemical cycles, and its potential as a source of novel bioactive compounds. This document, framed within a broader thesis on GTDB-based microbial systematics, provides detailed application notes and protocols for the cultivation, genomic analysis, and functional characterization of key genera and species within the Marinisomatota. The content is designed to support research aimed at validating and expanding the GTDB taxonomy while exploring biotechnological applications.

Based on the latest GTDB release (R214), the phylum Marinisomatota is delineated into several classes and orders. The following table summarizes the quantitatively dominant and phylogenetically distinct genera according to genome availability and 16S rRNA gene surveys.

Table 1: Key Genera and Species within GTDB Marinisomatota (as of GTDB R214)

GTDB Class	GTDB Order	Key Genus (GTDB Label)	Approx. # of MAGs/Genomes	Relative Abundance in Marine Surveys*	Notable Species/Clade
Marinisomatia	Marinisomatales	UBA10353 (Marinisomatales)	~45	High	Representative species: Ga0074134
Marinisomatia	UBA9962	UBA9962	~22	Moderate	Often found in coastal sediments
Bathybacteria	BMS94Abin14	Bin-S124	~15	Low (Deep-sea)	Associated with hydrothermal vents
Marinisomatia	UBA1773	UBA1773	~12	Moderate	Pelagic ocean clade
Marinisomatia	UBA10354	UBA10354	~8	Low	-

*Abundance based on aggregated data from the TARA Oceans and BioGEOTRACES metagenomic surveys.

Application Notes & Protocols

Protocol: Enrichment and Cultivation of Marinisomatota from Marine Samples

Objective: To selectively enrich for Marinisomatota cells from seawater or sediment samples. Background: Most Marinisomatota remain uncultured; however, specific enrichment strategies based on predicted metabolism (from MAGs) can be employed.

Materials & Reagents:

Marinisomatota Enrichment Medium (MEM):
- Artificial Seawater Base: 30 g/L NaCl, 0.7 g/L KCl, 5.3 g/L MgSO₄·7H₂O, 0.1 g/L CaCl₂·2H₂O, 10 mM HEPES buffer (pH 7.5).
- Carbon/Nitrogen Source: 0.5 g/L Sodium pyruvate, 0.5 g/L Yeast extract, 0.2 g/L NH₄Cl.
- Trace Elements & Vitamins: SL-10 solution (1 mL/L), Vitamin B12 (10 µg/L).
- Reducing Agent: 0.5 g/L L-Cysteine-HCl (add after autoclaving, under N₂/CO₂ atmosphere).
Sample: 1L of seawater (0.22µm filtered to remove eukaryotes) or 10g of marine sediment.

Procedure:

Sample Processing: For seawater, concentrate cells on a 0.22µm polycarbonate filter. Resuspend biomass in 10 mL sterile ASW. For sediment, homogenize in 20 mL MEM without carbon sources.
Inoculation: Transfer 5 mL of cell suspension to 45 mL of pre-reduced MEM in a 120 mL serum bottle. Flush headspace with N₂/CO₂ (80:20) and seal with a butyl rubber stopper.
Incubation: Incubate in the dark at 12-15°C (to simulate mesopelagic conditions) with gentle shaking (80 rpm) for 4-8 weeks.
Monitoring: Monitor growth by flow cytometry (SYBR Green I staining) and periodic 16S rRNA gene amplicon sequencing (using primers 515F/806R) to track enrichment of Marinisomatota.
Subculturing: Transfer 10% (v/v) of a positive enrichment to fresh MEM every 4 weeks.

Protocol: Metagenome-Assembled Genome (MAG) Binning and Taxonomic Classification

Objective: To reconstruct and taxonomically classify Marinisomatota MAGs from metagenomic data.

Workflow Diagram Title: MAG Binning & GTDB Classification Workflow

Detailed Methodology:

Assembly: Assemble quality-filtered reads using metaSPAdes (v3.15.0) with -k 21,33,55,77.
Binning: Map reads back to contigs using Bowtie2. Calculate coverage profiles and generate initial bins with MetaBAT2 (v2.15) and MaxBin2 (v2.2.7).
Refinement: Use MetaWRAP (v1.3.2) bin_refinement module to consolidate bins from multiple tools, retaining only bins with >50% completeness and <10% contamination (CheckM2 criteria).
Taxonomic Classification: Run gtdbtk (v2.3.0) with the classify_wf command on refined MAGs using the R214 database. The output (gtdbtk.bac120.summary.tsv) will assign taxonomy, including potential Marinisomatota placement.
Phylogenomic Tree: For confirmed Marinisomatota MAGs, use gtdbtk infer to generate a multiple sequence alignment and construct a tree with FastTree for phylogenetic placement.

Protocol: Screening for Biosynthetic Gene Clusters (BGCs) in Marinisomatota Genomes

Objective: To identify potential secondary metabolite BGCs from Marinisomatota MAGs or isolates.

Materials & Reagents:

Computational Tools: antiSMASH (v7.0), BiG-SCAPE.
Database: MIBiG database (v3.0).
Genomic Input: High-quality Marinisomatota genome in FASTA format.

Procedure:

BGC Prediction: Run antiSMASH on the genome using strict detection parameters: antismash --genefinding-tool prodigal --taxon bacteria --cb-general --cb-knownclusters --asf --pfam2go.
Analysis: Extract all predicted BGC regions (e.g., non-ribosomal peptide synthetase (NRPS), polyketide synthase (PKS), bacteriocin). Tabulate their types, locations, and core biosynthetic genes.
Comparative Genomics: Use BiG-SCAPE to correlate BGCs from Marinisomatota with known BGCs in the MIBiG database, generating sequence similarity networks to identify novel clusters.
Heterologous Expression Cloning: For prioritized novel BGCs, design PCR primers to amplify the entire cluster (using long-range PCR or Gibson assembly of cosmids) for cloning into an expression host like Pseudomonas putida.

The Scientist's Toolkit: Key Research Reagents & Materials

Table 2: Essential Research Reagents for Marinisomatota Studies

Item/Category	Specific Product/Example	Function/Application
Enrichment Medium	Custom Marinisomatota Enrichment Medium (MEM)	Selective cultivation and maintenance of fastidious marine bacteria.
DNA Extraction Kit	DNeasy PowerSoil Pro Kit (Qiagen)	High-yield, inhibitor-free genomic DNA extraction from complex marine samples.
Metagenomic Library Prep	Nextera XT DNA Library Prep Kit (Illumina)	Preparation of sequencing-ready libraries from low-input environmental DNA.
Taxonomic Classifier	GTDB-Tk v2.3.0 Software & R214 Database	Precise genome-based taxonomic assignment according to the GTDB system.
BGC Analysis Software	antiSMASH v7.0 Web Server/CLI	Comprehensive prediction and annotation of biosynthetic gene clusters.
Phylogenetic Marker	Bacterial 16S rRNA Gene Primers (515F/806R)	Tracking Marinisomatota enrichment and community profiling via amplicon sequencing.
Expression Host	Pseudomonas putida KT2440	Robust, Gram-negative host for heterologous expression of marine bacterial BGCs.
Flow Cytometry Stain	SYBR Green I Nucleic Acid Gel Stain	Quantification of bacterial cell abundance in enrichment cultures.

The classification of bacterial and archaeal life has undergone a paradigm shift, moving from a single-marker (16S rRNA) system to a genome-centric taxonomy that forms the foundation of the Genome Taxonomy Database (GTDB). This evolution is critical for research into candidate phyla like Marinisomatota (also known in legacy systems as 'Marinisomatia' or within the PVC group), whose physiological and ecological roles are inferred primarily from genomic data. Accurate taxonomy is essential for drug discovery, as it clarifies evolutionary relationships and identifies novel biosynthetic gene clusters.

Application Notes: Comparative Analysis of Classification Eras

The following table summarizes the key differences between the two classification paradigms.

Table 1: Comparison of 16S rRNA and Genome-Centric Taxonomy

Feature	16S rRNA Gene-Based Taxonomy (c. 1977-2010s)	Genome-Centric Taxonomy (GTDB Era, 2018-Present)
Primary Data Source	Sanger sequencing of ~1,500 bp 16S rRNA gene.	Whole genome sequences (WGS) from isolates and metagenome-assembled genomes (MAGs).
Resolution	Species to genus level; poor for closely related species and strains.	High resolution to species and strain level; robust at all ranks.
Quantitative Basis	Sequence similarity thresholds (e.g., 97% for species, 95% for genus).	Average Amino Acid Identity (AAI), Average Nucleotide Identity (ANI), and relative evolutionary divergence (RED).
Type Material Requirement	Dependent on cultured type strains.	Employs type species genomes and designated type genomes for uncultivated taxa.
Handling of Uncultivated Diversity	Limited; requires PCR amplification from environment.	Integral; MAGs from metagenomics allow classification of the "microbial dark matter."
Impact on Marinisomatota Research	Preliminary placement based on fragmentary 16S data led to uncertain phylogeny.	Precise placement as a distinct phylum based on conserved single-copy marker genes; reveals metabolic potential for drug target discovery.

Table 2: Key Quantitative Metrics in GTDB Genome-Centric Classification

Metric	Calculation Method	Typical Threshold for Species Demarcation	Function in Classification
Average Nucleotide Identity (ANI)	BLAST-based or MUMmer-based alignment of shared genomic regions.	≥ ~95%	Primary species-level standard, replacing 16S similarity.
Average Amino Acid Identity (AAI)	Comparison of amino acid sequences of shared protein-coding genes.	~60% for same phylum	Useful for higher-rank (family, phylum) assignments and phylogeny.
Relative Evolutionary Divergence (RED)	Measure of relative branch length in a rooted phylogenetic tree of marker genes.	Normalized scale (0.0=root, 1.0=leaves)	Objective rank normalization across all lineages.
Percentage of Conserved Proteins (POCP)	Percentage of conserved protein sequences between two genomes.	≥50% for same genus	Supplementary metric for genus classification.

Experimental Protocols

Protocol 3.1: GTDB-Tk Workflow for Genome Classification (Current Best Practice)

Objective: To classify a bacterial genome (isolate or MAG) within the GTDB taxonomy.

Materials:

High-quality bacterial genome assembly in FASTA format.
Computational environment (Linux/macOS) with at least 16 GB RAM and 8 cores recommended.
Conda package manager.
GTDB-Tk software package (v2.3.0 or later).
GTDB reference data (R214 or later).

Procedure:

Software Installation:

Prepare Input Data: Place all genome assemblies (.fna files) in a single directory (genome_dir). Create a batch file listing paths if necessary.
Run Classification Workflow:
- This workflow: a) identifies 120 bacterial marker genes with HMMER, b) aligns markers, c) creates a concatenated alignment, d) places genomes into a reference tree via pplacer, and e) classifies them based on RED-based rank thresholds.
Output Interpretation:
- Key files: gtdbtk_out/gtdbtk.bac120.summary.tsv
- This tab-delimited file contains columns for user genome ID, classification at each rank (domain to species), RED values, and placement confidence.

Protocol 3.2: 16S rRNA Gene Extraction and Sanger Sequencing (Historical Context)

Objective: To obtain a 16S rRNA sequence for preliminary phylogenetic analysis.

Materials:

Bacterial genomic DNA.
Universal primer pair (e.g., 27F: 5'-AGAGTTTGATCMTGGCTCAG-3' and 1492R: 5'-GGTTACCTTGTTACGACTT-3').
PCR reagents (Taq polymerase, dNTPs, buffer).
Agarose gel electrophoresis equipment.
Sanger sequencing reagents.

Procedure:

PCR Amplification: Set up a 50 µL reaction with 1X PCR buffer, 200 µM dNTPs, 0.2 µM each primer, 1.25 U Taq polymerase, and ~50 ng template DNA. Use cycling: 95°C/5 min; 35 cycles of [95°C/30s, 55°C/30s, 72°C/90s]; 72°C/7 min.
Gel Purification: Run PCR product on 1% agarose gel. Excise the correct band (~1,500 bp) and purify using a gel extraction kit.
Sanger Sequencing: Submit purified product for bidirectional sequencing with the same primers.
Sequence Analysis: Trim low-quality bases, assemble forward/reverse reads. Submit the consensus sequence to NCBI BLAST for tentative identification.

Visualizations

Title: Evolution from 16S to Genome-Based Taxonomy

Title: GTDB-Tk Classification Workflow

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Genome-Centric Taxonomy Research

Item	Function & Relevance
DNeasy PowerSoil Pro Kit (QIAGEN)	Gold-standard for high-yield, inhibitor-free microbial genomic DNA extraction from complex samples for WGS and MAG generation.
Nextera XT DNA Library Prep Kit (Illumina)	Prepares multiplexed, adapter-ligated sequencing libraries from low-input genomic DNA for Illumina short-read sequencing.
GTDB-Tk Software & Reference Data	Core bioinformatics toolkit for performing genome classification against the standardized GTDB taxonomy.
CheckM / CheckM2	Assesses completeness and contamination of MAGs using lineage-specific marker sets, a critical QC step before classification.
antiSMASH / BAGEL	Identifies biosynthetic gene clusters (BGCs) for secondary metabolites in classified genomes; crucial for drug discovery pipelines.
Phanta EVO HS Master Mix (Vazyme)	High-fidelity polymerase mix for accurate amplification of taxonomic marker genes or genome fragments when required.
ZymoBIOMICS Microbial Community Standard	Mock community with known composition for validating wet-lab and computational workflows from extraction to classification.

From Raw Reads to Taxonomy: Best Practices for Classifying and Analyzing Marinisomatota Genomes

1. Introduction and Thesis Context This protocol is framed within a broader thesis investigating the recalibration of bacterial taxonomy, specifically the phylum Marinisomatota (formerly known as Marine Group II within the Thermoplasmatota). The Genome Taxonomy Database Toolkit (GTDB-Tk) provides a standardized, genome-based methodology for consistent taxonomic classification, which is critical for resolving the ecological and metabolic roles of uncultured lineages like Marinisomatota. Accurate classification is foundational for downstream applications in microbial ecology and the discovery of novel bioactive compounds relevant to drug development.

2. Key Research Reagent Solutions The following table details essential materials and software for the classification workflow.

Reagent/Solution/Software	Function/Explanation
GTDB-Tk v2.3.2+	Core software package for inferring taxonomic classification and phylogenetic placement.
GTDB Reference Data (r220+)	Curated set of reference genomes and taxonomy (e.g., `r220_data.tar.gz`). Mandatory for classification.
CheckM2 or CheckM	Assesses genome completeness and contamination; critical for quality filtering prior to classification.
Prodigal or Pyrodigal	Gene prediction software used internally by GTDB-Tk for creating protein markers.
HMMER (v3.1+)	Used for aligning conserved marker genes to reference HMM profiles.
PPANKM or FastANI	Calculates Average Nucleotide Identity (ANI) for precise species demarcation.
Python 3.8+	Required runtime environment for GTDB-Tk.
High-Performance Computing (HPC) Cluster	Recommended due to the computational intensity of alignment and tree placement steps.

3. Experimental Protocol for Genome Classification Note: All commands assume a Unix-like environment and conda for package management.

Step 1: Installation and Data Preparation

Step 2: Input Genome Quality Control

Assemble genomes from metagenomic or isolate sequencing data.
Filter genomes using CheckM2 to ensure high quality:

Based on broader thesis standards, retain only genomes meeting:
- Completeness ≥ 80%
- Contamination ≤ 5%

Step 3: Execute GTDB-Tk Classification Workflow Run the comprehensive classify_wf pipeline:

Step 4: Interpretation of Results Key output files:

gtdbtk.bac120.summary.tsv: Tabular summary of taxonomic classification for each genome.
gtdbtk.ar53.summary.tsv: For archaea (relevant if Marinisomatota is classified as archaeal in your dataset).
gtdbtk.<marker_set>.tree: Phylogenetic tree for visual placement.

4. Data Presentation: Summary of Classification Metrics The following table quantifies typical outputs from a Marinisomatota classification run using GTDB-Tk, based on a hypothetical dataset of 150 marine metagenome-assembled genomes (MAGs).

Table 1: GTDB-Tk Classification Statistics for a Marine MAG Dataset

Metric	Value	Interpretation
Total Input Genomes	150	MAGs passing QC thresholds.
Genomes Classified to Marinisomatota	47 (31.3%)	Assigned to the target phylum.
Novel Species (ANI < 95%)	28 (59.6% of phylum)	Potential new species within Marinisomatota.
Novel Genera (AF < 50%)	11 (23.4% of phylum)	Potential new genera.
Average Alignment Fraction (AF)	72.1% (std dev ± 18.5)	Measure of genomic relatedness to reference.
Placement in Reference Tree	100%	All genomes placed within the GTDB reference phylogeny.

5. Visualization of the Classification Workflow

Title: GTDB-Tk Classification Workflow for Marinisomatota Genomes

Title: Taxonomic Context of Marinisomatota in GTDB

Application Notes

Within the context of a broader thesis on Marinisomatota taxonomy using the Genome Taxonomy Database (GTDB), the generation and refinement of Metagenome-Assembled Genomes (MAGs) is foundational. The accuracy of downstream phylogenetic and metabolic analyses is critically dependent on parameters adjusted during assembly, binning, and refinement. This protocol details the workflow adjustments necessary for optimizing MAG quality, particularly for elusive phyla like Marinisomatota, which are frequently underrepresented in environmental samples.

Key Findings:

Assembly Stringency: For complex marine metagenomes (e.g., TARA Oceans), a minimum contig length of 1000-1500 bp and k-mer multiples (21, 33, 55, 77) in metaSPAdes significantly improve recovery of medium-abundance genomes like Marinisomatota.
Binning Sensitivity: The use of compositional (tetranucleotide frequency) and coverage-based features is non-negotiable. For Marinisomatota, integrating CheckM lineage-specific markers before the final dereplication step reduces contamination from related PVC group members.
GTDB-Tk Classification: The probability (p_placer) and relative evolutionary divergence (RED) values from GTDB-Tk are critical for interpreting the placement of novel Marinisomatota MAGs. A threshold of p_placer ≥ 0.8 is recommended for confident placement at the genus level.

Table 1: Impact of Assembly & Binning Parameters on MAG Quality Metrics for Marine Datasets

Parameter	Standard Value	Optimized Value for Marinisomatota	Effect on Completeness	Effect on Contamination	Key Tool
Min Contig Length	500 bp	1500 bp	-5% to +2%	-10% to -15%	metaSPAdes, MEGAHIT
Metabat2 --minContig	1500 bp	2500 bp	-3%	-8%	MetaBAT2
CheckM Lineage WF	Standard	`--extension_threshold 0.2`	More stringent lineage assignment	Better contamination estimate	CheckM2
MaxBin2 Prob Threshold	0.9	0.95	-2%	-7%	MaxBin2
DAS Tool Score Threshold	0.5	0.6	+1% Completeness	-5% Contamination	DAS Tool

Table 2: GTDB-Tk Classification Output Interpretation for Novel Taxa

Metric	Range	Interpretation for Marinisomatota MAGs
Classify `p_placer`	0.0 - 1.0	≥ 0.95: Strong confidence at species rank. 0.80-0.94: Confident genus-level placement. <0.80: Require manual phylogenomic review.
RED Value	~0.0 - ~1.0	Values close to 0.5 for a new MAG suggest a novel genus within a known family. Deviations >0.15 from sister taxa warrant investigation.
FastANI vs. Reference	85% - 100%	<95% ANI to nearest GTDB reference suggests novel species; <~70% suggests novel genus.

Experimental Protocols

Protocol 1: Optimized Co-Assembly and Binning for Marine Samples

Objective: To reconstruct high-quality Marinisomatota MAGs from multi-sample marine metagenomic datasets.

Materials:

Input: Quality-controlled paired-end reads from multiple samples (e.g., same geographic region).
Software: metaSPAdes v3.15.5, MetaBAT2 v2.15, Bowtie2 v2.5, SAMtools, CheckM2 v1.0.1.

Method:

Co-assembly: metaspades.py -o co_assembly -1 sample1_1.fq,sample2_1.fq -2 sample1_2.fq,sample2_2.fq -k 21,33,55,77 -t 32 -m 500
Contig Filtering: Retain contigs ≥ 1500 bp using seqtk.
Coverage Profiling: Map individual sample reads back to filtered contigs using Bowtie2. Generate depth tables with jgi_summarize_bam_contig_depths.
Binning: Run multiple binners:
- metabat2 -i filtered_contigs.fa -a depth_table.txt -o bin -m 2500
- Run MaxBin2 and CONCOCT as per standard protocols.
Bin Refinement: Use DAS Tool with a stringent scoring threshold: DAS_Tool -i metabat2_bins.txt,maxbin2_bins.txt -l metabat,maxbin -c filtered_contigs.fa --score_threshold 0.6 -o final_bins

Objective: To assess MAG quality, refine bins, and achieve accurate GTDB taxonomy.

Materials:

Input: Bins from Protocol 1 (final_bins_DASTool_scaffolds2bin.txt).
Software: CheckM2, GTDB-Tk v2.3.0, derep, FastANI.

Method:

Quality Assessment: checkm2 predict --input final_bins/ --output checkm2_results -t 16 --lowmem
Bin Refinement based on Lineage: Manually inspect bins with >5% contamination. Use CheckM2's lineage-specific marker set to identify and remove contaminant contigs via anvi-refine or manual curation.
Dereplication at Species Level: Cluster MAGs at 99% ANI: derep -i *.fa -o mags_derep99 -ani 0.99 -nc 0.30
GTDB Taxonomic Classification:
- gtdbtk classify_wf --genome_dir mags_derep99/ --out_dir gtdbtk_out --cpus 32 --extension_threshold 0.2
- Critically analyze the gtdbtk.bac120.summary.tsv file, focusing on classification, p_placer, and red_value.
Phylogenomic Validation (Optional): For MAGs with ambiguous placement (p_placer < 0.8), construct a custom phylogeny with IQ-TREE using the bac120 markers.

Visualization

Diagram 1: MAG Generation & Curation Workflow

Diagram 2: GTDB-Tk Decision Pathway for Novel Taxa

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions & Materials

Item/Reagent	Function & Application in MAG Workflow	Critical Parameter/Specification
NEB Next Ultra II FS DNA Library Prep Kit	High-quality metagenomic library preparation for Illumina sequencing. Essential for obtaining high-coverage, unbiased reads.	Input DNA: 1ng-100ng. Enzymatic fragmentation time optimization for desired insert size.
MetaPolyzyme (Sigma-Aldrich)	Enzymatic lysis cocktail for diverse microbial cell walls in environmental samples. Critical for unbiased DNA extraction from marine biomass.	Incubation: 37°C for 60 min. Use in conjunction with mechanical lysis (bead-beating).
SPRIselect Beads (Beckman Coulter)	Size-selective magnetic bead-based clean-up for post-assembly contig filtering and size selection.	Ratio optimization (e.g., 0.6x to 0.8x) to retain contigs >1500 bp post-assembly.
CheckM2 Lineage-Specific Marker Set	Software-based "reagent" for assessing MAG completeness/contamination using a random forest model. More accurate than CheckM1.	Use `--lowmem` flag for large datasets. Interpret results in context of contamination sources.
GTDB-Tk Reference Data (v.R214)	Curated database of bacterial/archaeal genomes for phylogenetic placement. The standard for taxonomic classification of Marinisomatota MAGs.	Must download (~50 GB) and install separately. Update with each GTDB release.
Phusion High-Fidelity DNA Polymerase (Thermo)	For amplification of taxonomic marker genes from MAGs or community DNA for validation (e.g., 16S rRNA gene PCR if present).	High fidelity reduces chimera formation during PCR from complex templates.

1. Introduction and Taxonomic Context The phylum Marinisomatota (as defined by the Genome Taxonomy Database, GTDB) represents a phylogenetically distinct lineage within the bacterial domain, primarily derived from marine and host-associated environments. Within the broader thesis of refining GTDB classifications and exploring underexplored taxa, this phylum presents a significant opportunity for biodiscovery. Its ecological niches suggest adaptation to complex polysaccharides and competitive interactions, predicting a rich repertoire of Biosynthetic Gene Clusters (BGCs) and catalytically novel enzymes with potential applications in drug discovery, biocatalysis, and biomedicine.

2. Quantitative Overview of Marinisomatota Genomic Potential Table 1: Summary of BGC Diversity in Publicly Available Marinisomatota Genomes (as of 2024)

GTDB Genus Representative	Number of Genomes Surveyed	Average BGCs per Genome	Most Frequent BGC Class	Notable Predicted Product
UBA2962 (marine sediment)	12	8.2	Terpene	Sesterterpenoid-like
UBA10314 (sponge symbiont)	7	11.5	NRPS, T3PKS	Lipopeptide, Polyketide
UBA1773 (hydrothermal vent)	5	6.8	Bacteriocin	Lanthipeptide-class
Phylum Aggregate	~50	8.7	Terpene	High chemical novelty index

Table 2: Putative Novel Enzyme Families Identified via CAZy and Peptidase Database Mining

Enzyme Class	GTDB Family	Predicted Activity	Unique Domain Architecture	Potential Biomedical Application
Glycosyltransferase	UBA2962	β-1,3-Xylosyltransferase	C-terminal Sharkin-like domain	Synthesis of heparin mimetics
Peptidase (S8 family)	UBA10314	Subtilisin-like serine protease	Inserted carbohydrate-binding module	Targeted proteolysis for biofilm disruption
Polysaccharide Lyase	UBA1773	Alginate lyase (novel substrate specificity)	Tandem bacterial immunoglobulin-like domains	Cystic fibrosis therapeutic (mucolysis)

3. Detailed Experimental Protocols

Protocol 3.1: In silico Genome Mining and BGC Prioritization Objective: To identify and prioritize non-ribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) BGCs from Marinisomatota genomes. Materials: High-performance computing cluster, antiSMASH 7.0, BiG-SCAPE, CORASON, MIBiG database. Procedure:

Data Acquisition: Download target Marinisomatota genomes from GTDB or NCBI in FASTA format.
BGC Prediction: Run antiSMASH with strict parameters (--strict --cb-general --cb-knownclusters --cb-subclusters --asf --pfam2go). Use the --genefinding-tool prodigal.
Cluster Family Analysis: Process all antiSMASH outputs with BiG-SCAPE (python bigscape.py -i ./antismash_results -o ./bigscape_out --mibig --mix --cutoffs 0.3 0.65 0.95).
Phylogenetic Contextualization: For prioritized clusters (e.g., in new GCFs), use CORASON to generate sequence similarity networks of core biosynthetic genes against the MIBiG reference.
Prioritization Scoring: Score BGCs based on: (i) Phylogenetic novelty (distance to nearest known cluster), (ii) Presence of unique domains, (iii) Predicted regulatory elements, and (iv) Proximity to transporter genes.

Protocol 3.2: Heterologous Expression of a Terpene Synthase BGC Objective: To express a prioritized terpene synthase BGC from UBA2962 in Streptomyces coelicolor M1152. Materials: Research Reagent Solutions Table:

Reagent/Solution	Function	Source/Catalog Note
pCAP01 fosmid vector	BGC capture and heterologous expression	E. coli EPI300-T1ᵣ library construction
S. coelicolor M1152	Streptomycete heterologous host	Lack of native PKS and NRPS clusters
Apetite solid medium	Selection and sporulation of Streptomyces	Contains apramycin, MgCl₂, and trace elements
Ethyl acetate with 1% acetic acid	Extraction of terpenoid metabolites	LC-MS grade for downstream analysis
PCR Master Mix (2x) with GC enhancer	Amplification of high-GC% Marinisomatota DNA	Required for >60% GC content

Procedure:

Fosmid Library Construction: Partially digest high-molecular-weight Marinisomatota genomic DNA with Sau3AI. Size-select 30-40 kb fragments and ligate into pCAP01 vector. Package using a lambda phage kit and transduce into EPI300-T1ᵣ E. coli.
Library Screening: Screen colonies by PCR for the conserved terpene synthase gene (DDXXD motif). Isolate positive fosmid DNA.
Intergeneric Conjugation: Mix E. coli ET12567/pUZ8002 harboring the positive fosmid with S. coelicolor M1152 spores. Plate on SFM agar with 10 mM MgCl₂. After 16h, overlay with apramycin (50 µg/mL) and nalidixic acid (25 µg/mL).
Heterologous Expression: Pick exconjugants to Apetite plates. Incubate at 30°C for 5-7 days. Inoculate single colonies into TSB with apramycin for seed culture, then transfer into production medium (R5 or SFM). Incubate with shaking for 14 days.
Metabolite Extraction: Centrifuge culture. Extract supernatant with equal volume ethyl acetate (1% AcOH). Extract cell pellet with 1:1 acetone:methanol. Combine organic phases, dry under vacuum, and resuspend in methanol for LC-HRMS/MS analysis.

Protocol 3.3: Activity Screening of a Novel Subtilisin-like Protease Objective: To clone, express, and test the activity of a novel S8 peptidase from UBA10314. Materials: pET-28a(+) vector, E. coli BL21(DE3), Ni-NTA resin, fluorogenic substrate Boc-Gln-Ala-Arg-AMC. Procedure:

Gene Optimization & Cloning: Codon-optimize the gene for E. coli expression, adding an N-terminal 6xHis tag. Synthesize and clone into pET-28a(+) via NdeI/XhoI sites.
Expression: Transform into E. coli BL21(DE3). Grow in TB with kanamycin at 37°C to OD₆₀₀ 0.6. Induce with 0.5 mM IPTG at 18°C for 16h.
Purification: Lyse cells via sonication in lysis buffer (50 mM Tris-HCl pH 8.0, 300 mM NaCl, 10 mM imidazole). Purify soluble protein using Ni-NTA affinity chromatography with an imidazole gradient (50-250 mM).
Activity Assay: In a 96-well plate, mix 50 µL of purified enzyme (0.1-1 µM) with 50 µL assay buffer (50 mM HEPES pH 7.5, 150 mM NaCl) containing 200 µM Boc-Gln-Ala-Arg-AMC. Monitor fluorescence (ex 380 nm, em 460 nm) kinetically for 30 min at 25°C.
Substrate Specificity Profiling: Test against the MEROPS S8 substrate library, including insulin B chain, followed by UPLC-MS analysis of cleavage products.

4. Visualization of Workflows and Pathways

Title: Marinisomatota Mining and Validation Workflow

Title: NRPS Biosynthetic Logic

This application note details protocols for linking phylogeny, specifically within the Marinisomatota phylum (as classified by the Genome Taxonomy Database - GTDB), to biosynthetic gene cluster (BGC) diversity. The work is framed within the broader thesis that GTDB-based phylogenetic resolution of understudied taxa, like the Marinisomatota, uncovers novel BGC landscapes, providing a systematic roadmap for targeted biodiscovery in drug development.

Table 1: Comparative BGC Diversity in Marinisomatota vs. Related Phyla (GTDB r214)

Taxonomic Group (GTDB)	Genomes Analyzed	Total BGCs Identified	BGCs/Genome (Avg)	NRPS/PKS (%)	Ribosomal (%)	Terpene (%)	Other (%)
Marinisomatota	47	312	6.64	28.2	18.9	22.1	30.8
Actinomycetota	150	1245	8.30	45.6	12.3	15.4	26.7
Bacteroidota	85	401	4.72	15.2	31.7	10.0	43.1

Table 2: Phylogenetic Conservation of BGC Families in Marinisomatota Clades

Marinisomatota Family (GTDB)	Representative Genus	Core BGC Family (MIBiG Class)	Conservation Frequency in Clade (%)	Putative Novelty Score*
Marinisomataceae	Marinisoma	Type I PKS	92	0.85
Oceanipullicutaceae	Oceanipullicuta	Lasso peptide	78	0.92
UBA10353	UBA10353	Hybrid NRPS-PKS	65	0.95
Novel lineage A	MAG-3321	Thiopeptide	100	0.98

*Novelty Score: 1 - (max BLASTp identity to known MIBiG cluster).

Experimental Protocols

Protocol 3.1: Phylogenomic Reconstruction of Marinisomatota

Objective: Generate a robust, GTDB-consistent phylogeny for BGC diversity mapping.

Materials: See "Research Reagent Solutions" (Section 6).

Method:

Genome Retrieval: Download all available Marinisomatota genome assemblies (RefSeq/GenBank) and associated GTDB taxonomy files (release r214) using ncbi-genome-download and gtdb-tk.
Core Genome Identification: Use OrthoFinder v2.5 with default parameters on all predicted proteomes to identify single-copy orthologous (SCO) groups.
Alignment & Concatenation: Align SCO amino acid sequences with MAFFT v7. Auto-trim alignments with trimAl (-automated1). Concatenate alignments using AMAS.
Phylogenetic Inference: Construct a maximum-likelihood tree with IQ-TREE2 (-m TEST -B 1000 -alrt 1000). Use the resulting .treefile as the phylogenetic framework.

Protocol 3.2: BGC Prediction, Dereplication, and Classification

Objective: Identify, classify, and quantify BGCs from Marinisomatota genomes.

Method:

BGC Prediction: Run antiSMASH v6 (or latest) on all genomes with --clusterhmmer, --asf, and --cb-knownclusters flags enabled.
BGC Dereplication: Process antiSMASH JSON outputs with BiG-SCAPE v1.1 (--mix mode). This generates Gene Cluster Families (GCFs) based on Pfam domain similarity.
Novelty Assessment: For each GCF, extract core biosynthetic genes. Perform BLASTp against the MIBiG database (v3). Calculate the Putative Novelty Score (Table 2) as 1 minus the highest percent identity for any core gene hit. Scores >0.7 indicate high novelty.

Protocol 3.3: Phylogeny-BGC Diversity Mapping & Correlation

Objective: Statistically link phylogenetic distance to BGC repertoire dissimilarity.

Method:

Distance Matrix Creation:
- Generate a phylogenetic distance matrix from the Protocol 3.1 tree using cophenetic.phylo in R's ape package.
- Generate a BGC profile distance matrix from BiG-SCAPE output using the Jaccard distance on genome-GCF presence/absence data (vegdist in R's vegan).
Statistical Testing: Perform a Mantel test (mantel function in vegan) to assess correlation between phylogenetic and BGC distance matrices (use 9999 permutations). A significant p-value (<0.05) supports phylogenetic conservation of BGCs.
Visualization: Map dominant GCFs onto tree nodes using iTOL or ggtree in R.

Key Visualizations

Diagram 1: Phylogeny-Guided Drug Discovery Workflow

Diagram 2: BGC Diversity Correlation with Phylogeny

Research Reagent Solutions

Table 3: Essential Toolkit for Phylogeny-BGC Linkage Studies

Item/Category	Specific Product/Resource	Function in Protocol
Taxonomic Framework	GTDB-Tk (v2.3.0) Database & Toolkit	Standardizes genome taxonomy according to GTDB, essential for defining Marinisomatota clades (Protocol 3.1).
Phylogenomics Software	IQ-TREE2 (v2.2.0), OrthoFinder (v2.5.4)	Infers robust phylogenetic trees from core genomes (Protocol 3.1).
BGC Prediction Pipeline	antiSMASH (v6.1.1) with all databases	Comprehensive identification and initial classification of BGCs in genomes (Protocol 3.2).
BGC Comparative Analysis	BiG-SCAPE (v1.1) & CORASON	Clusters BGCs into Gene Cluster Families (GCFs) enabling diversity quantification (Protocol 3.2, 3.3).
Reference BGC Database	MIBiG (Minimum Information about a BGC) Repository (v3.1)	Gold-standard database for BGC novelty assessment via BLAST (Protocol 3.2).
Statistical & Visualization Environment	R (v4.2+) with `ape`, `vegan`, `ggtree` packages	Performs Mantel test and visualizes phylogeny-BGC correlations (Protocol 3.3).
High-Performance Computing (HPC)	Linux cluster with SLURM scheduler & >= 1TB storage	Manages computationally intensive genome analysis, tree building, and BiG-SCAPE runs.

Integrating Classification with Functional Annotation Pipelines (e.g., Prokka, antiSMASH)

This application note is framed within a broader thesis research on the Marinisomatota phylum (GTDB classification; formerly part of the PVC superphylum in some taxonomic systems). The integration of robust taxonomic classification like the Genome Taxonomy Database (GTDB) with functional annotation pipelines is critical for elucidating the unique metabolic and biosynthetic potential of understudied lineages. For Marinisomatota, hypothesized to have rich secondary metabolism, coupling GTDB-tk classification with tools like antiSMASH and Prokka accelerates the discovery of novel gene clusters and their contextual interpretation within an accurate evolutionary framework.

Table 1: Comparison of Functional Annotation & Classification Tools

Tool/Pipeline	Primary Purpose	Key Outputs	Typical Runtime*	Relevance to Marinisomatota Research
GTDB-Tk v2.3.0	Taxonomic classification & phylogeny	Taxonomic assignment, alignments, tree	~30 min/genome	Definitive placement of novel Marinisomatota genomes within the GTDB hierarchy.
Prokka v1.14.6	Rapid prokaryotic genome annotation	CDS, tRNA, rRNA, functional prefixes (COG, Pfam)	~10-15 min/genome	First-pass functional annotation, creating standardized GenBank files for downstream analysis.
antiSMASH v7.0	Secondary metabolite BGC detection	BGC location, type, similarity, core structures	~20-30 min/genome	Identification of biosynthetic gene clusters (BGCs) for drug discovery leads.
EggNOG-mapper v2.1.12	Functional orthology annotation	GO terms, KEGG pathways, COG categories	~5-10 min/genome	Consistent functional annotation across diverse taxa.
CheckM2 v1.0.2	Genome quality estimation	Completeness, contamination, strain heterogeneity	~3-5 min/genome	Quality assessment prior to classification/annotation.

*Runtimes are approximate for a 4-5 Mbp bacterial genome on a high-performance compute node.

Table 2: Integrated Pipeline Output Statistics for a Mock Marinisomatota Dataset

Analysis Stage	Metric	Average Value (n=10 draft genomes)	Notes
CheckM2	Genome Completeness (%)	96.4 ± 2.1	High-quality drafts suitable for analysis.
GTDB-Tk	Classification Rank	pMarinisomatota; gUBA2565	All genomes placed within the phylum; most as novel genera.
Prokka	Total CDS Annotated	3,850 ± 420	Provides baseline gene calls for all pipelines.
antiSMASH	BGCs per Genome	8.2 ± 1.7	Indicates high biosynthetic potential.
EggNOG-mapper	Genes with KEGG Annotation	62% ± 5%	Enables pathway reconstruction.

Detailed Integrated Protocol

Protocol 1: Genome Quality Control and Taxonomic Classification

Objective: Assess draft genome quality and assign accurate taxonomy prior to functional annotation.

Input: Assembled genomes (FASTA format).
Quality Assessment: checkm2 predict --input <assembly.fasta> --output-directory <checkm2_out> --threads 8
- Filtering: Retain genomes with >90% completeness and <5% contamination.
GTDB-Tk Classification: gtdbtk classify_wf --genome_dir <filtered_genomes_dir> --out_dir <gtdbtk_out> --cpus 8 --pplacer_cpus 2
- Outputs: classify/<genome>.summary.tsv provides kingdom to species-level classification.
- Thesis Context: Confirm phylum-level placement as Marinisomatota and identify novel genera/species.

Protocol 2: Integrated Functional Annotation Workflow

Objective: Annotate genomes and specifically identify biosynthetic gene clusters (BGCs) using a coordinated pipeline.

Primary Annotation with Prokka: prokka <assembly.fasta> --outdir <prokka_out> --prefix <strain_name> --cpus 8 --rfam
- Uses GTDB-based classification to select appropriate genetic code.
- Outputs: .gbk file essential for antiSMASH.
BGC Detection with antiSMASH: antismash <prokka_out>/<strain_name>.gbk --output-dir <antismash_out> --cpus 8 --genefinding-tool prodigal-m
- Critical: Use the Prokka-generated GBK to ensure consistent gene calls between annotations.
- Output Analysis: Merge antiSMASH results (BGC locations, types) with GTDB taxonomy and Prokka annotations.
Orthology-Based Functional Annotation (Parallel): emapper.py -i <prokka_out>/<strain_name>.faa -o <eggnog_out> --cpu 8
- Provides standardized KEGG/GO terms to supplement Prokka's Pfam-based annotations.

Visualization of Workflows

Title: Integrated Genome Analysis Pipeline

Title: Marinisomatota Research Logic Flow

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions & Materials

Item	Function/Application in Protocol	Example/Notes
High-Quality Compute Environment	Running computationally intensive pipelines.	Linux server/cluster with ≥32GB RAM, multi-core CPUs (e.g., AWS EC2, HPC).
Conda/Mamba	Reliable dependency and environment management.	Use `bioconda` channels to install all tools (GTDB-Tk, Prokka, antiSMASH).
GTDB-Tk Reference Data (v214)	Essential database for taxonomic classification.	Download `reference214.tar.gz` (∼54 GB). Critical for accurate Marinisomatota placement.
antiSMASH Databases	For BGC detection, rule-based clustering, etc.	Includes MIBiG, Pfam, ClusterBlast; installed via `download-databases`.
EggNOG Database (v5.0)	For fast orthology mapping and functional annotation.	Bacterial (`bact`) subset sufficient for Marinisomatota.
Integrative Analysis Scripts	Custom Python/R scripts to merge outputs.	For merging GTDB taxonomy, BGC locations, and KEGG pathways into a single table.
Visualization Tools	Creating publication-quality figures from results.	R (ggplot2, ggtree), Python (matplotlib, seaborn), or software like OriginLab.

Resolving Classification Challenges: Troubleshooting GTDB Analysis for Marinisomatota

Application Notes: A GTDB-Centric Framework for Marinisomatota

Within the broader thesis applying the Genome Taxonomy Database (GTDB) framework to elucidate the evolutionary and metabolic diversity of the phylum Marinisomatota (synonymous with Marinisomatia in some classifications), three critical, interconnected pitfalls consistently compromise downstream analysis. These are the recovery of low-quality Metagenome-Assembled Genomes (MAGs), genome contamination, and assignment to incomplete or obsolete taxonomic lineages. Addressing these is paramount for robust ecological inference and bioprospecting, especially for drug development professionals seeking novel bioactive gene clusters from marine microbiomes.

1. Low-Quality MAGs: The inherent fragmentation and uneven coverage in metagenomic sequencing often yield MAGs that are incomplete and/or miss-assembled. For GTDB classification, which relies on a set of conserved marker genes, this directly impacts the placement accuracy. A MAG missing >10% of these markers may be assigned to an imprecise taxonomic rank or flagged as "incomplete."

2. Contamination: Cross-contamination from co-occurring organisms, especially during binning, results in chimeric MAGs containing genes from multiple taxonomic units. This invalidates functional predictions and distorts phylogenetic trees. For Marinisomatota, which often exist in complex consortia, this is a prevalent risk.

3. Incomplete Taxonomy: Relying on legacy taxonomy (e.g., NCBI) instead of the standardized, genome-based GTDB can lead to misclassification. Marinisomatota itself is a product of genomic taxonomy, redefining older groups. Using outdated names obscures evolutionary relationships and hinders comparative genomics.

Quantitative Impact Summary:

Table 1: Impact of MAG Quality on GTDB Classification Success Rate

MIMAG Quality Tier	Completeness (CheckM2)	Contamination (CheckM2)	% Passing GTDB-tk Workflow (approx.)	Risk of Misclassification
High-quality (HQ)	>90%, <5%	<5%	>95%	Low
Medium-quality (MQ)	≥50%, <90%	<10%	~60-80%	Moderate
Low-quality (LQ)	<50%	≥10%	<30%	Very High

Table 2: Common Contaminant Signatures in Putative Marinisomatota MAGs

Contaminant Phylum (GTDB)	Typical Marker Genes	Effect on Classification
Proteobacteria	rpoB, fusA	Creates aberrant long branches in phylogeny
Bacteroidota	rpoC, gyrB	Can cause "pulling" into sister clades
Archaea (e.g., Thermoplasmatota)	Archaeal ribosomal proteins	GTDB-tk may reject genome or flag as contaminated

Protocols for Mitigation and Validation

This protocol ensures only robust MAGs are submitted to GTDB-tk for taxonomic classification of Marinisomatota.

Materials (Research Reagent Solutions):

CheckM2: Estimates genome completeness and contamination using a machine-learning model.
GTDB-Tk (v2.3.0+): Toolkit for assigning GTDB taxonomy and inferring phylogenies.
GUNC (Genome UNClutterer): Detects and quantifies contamination in metagenomic bins.
DASTool: Optimizes binning from multiple algorithms to produce consensus, high-quality bins.
BBTools (bbduk.sh): For adapter trimming and quality filtering of raw reads.
Bowtie2 & SAMtools: For mapping reads back to MAGs to assess coverage uniformity.

Methodology:

Initial Binning & Quality Screening: Generate MAGs using at least two binners (e.g., MetaBAT2, MaxBin2). Use DASTool to create consensus bins. Assess all bins with CheckM2. Retain only MAGs meeting MIMAG "medium-quality" threshold (≥50% complete, <10% contaminated).
Contamination-Specific Screening: Run all retained MAGs through GUNC. Reject or manually curate (via anvi'o) any MAGs with a GUNC pass.mode of "contaminated" for the SSC (Species-Specific Cluster) model.
Coverage Profile Validation: Map quality-filtered reads back to each curated MAG using Bowtie2. Generate per-base coverage with SAMtools (depth). Visually inspect coverage plots for sharp, unimodal distributions. Discard MAGs with multi-modal coverage, indicating co-binned populations.
GTDB Classification: Run the refined, high-confidence MAG set through GTDB-Tk (classify_wf). The resulting bacterial classification file (gtdbtk.bac120.summary.tsv) provides the taxonomy, classification confidence (based on marker gene support), and place in the reference tree.

MAG Refinement and GTDB Classification Pipeline

Protocol 2: Resolving Incomplete Taxonomy via Phylogenomic Reconciliation

When GTDB-tk assigns a Marinisomatota MAG to an "unclassified" genus or family, follow this protocol to contextualize its placement.

Materials:

GTDB-Tk (infer workflow): Generates a multiple sequence alignment (MSA) and tree including your MAGs and the full GTDB reference.
FastTree/IQ-TREE2: For maximum-likelihood tree inference if custom analysis is needed.
GTDB Website/API: To access the current taxonomy (release 220+) and browse reference trees.
Interactive Tree of Life (iTOL): For visualization and annotation of phylogenetic trees.

Methodology:

Phylogenetic Inference: Run the GTDB-Tk infer workflow on your MAG set to place them within the GTDB reference tree. Visualize the resulting tree (.treefile) in iTOL.
Clade Examination: Identify the MAG's precise position. Note the bootstrap support or posterior probability at the node where it branches. Examine the taxonomy of its closest reference genome siblings.
Taxonomic Proposal Evaluation: If your MAG forms a coherent, novel clade with other uncultivated MAGs from public databases (with strong support), it may represent a candidate for a novel genus/family. Use the GTDB's msa and mask files to calculate Average Amino Acid Identity (AAI) and Average Nucleotide Identity (ANI) versus its closest relatives using tools like compareM or PyANI.
Reporting: For publication, report the GTDB taxonomy string (e.g., d__Bacteria;p__Marinisomatota;c__...;g__;s__). Clearly distinguish between classified ranks and placeholder names (g__UBA1234). Reference the GTDB release number (e.g., R220).

Resolving Unclassified Taxonomy via Phylogenomics

Application Notes

Within a thesis investigating the phylogenetic diversity and metabolic potential of the Marinisomatota phylum (syn. MARINISOMATOTA in GTDB), the interpretation of GTDB-Tk outputs is critical. Ambiguities, such as low support values and unclassified branches, are common but can be systematically addressed to refine taxonomic hypotheses.

1. Quantitative Analysis of Ambiguity: Common metrics from GTDB-Tk phylogenetic trees require careful scrutiny. The following table summarizes key thresholds for interpretation.

Table 1: Interpretation of Support Metrics in GTDB-Tk Phylogenetic Trees

Metric	Range	Typical Threshold for Robustness	Interpretation in Marinisomatota Context
SH-like (aLRT) Support	0-1	≥ 0.9	Values < 0.7 indicate high ambiguity; branch placement is unreliable for novel lineages.
Bootstrap Support	0-100	≥ 80	Values between 50-80 suggest caution; topology may change with more data.
Taxonomic Rank Support	Classified/Unclassified	N/A	An "unclassified" label at the genus or family level often correlates with support values < 0.8.
Placement Distance (RF)	0-1	≤ 0.3	Distance > 0.5 from a defined reference suggests a potentially novel clade.

2. Protocol for Resolving Ambiguities: Follow this sequential workflow to investigate ambiguous classifications.

Protocol 1: Multi-Marker Tree Reconciliation

Objective: To validate or correct the GTDB-Tk classification of a Marinisomatota genome (e.g., bin_23) showing low support at the family level.

Materials:

Input Data: GTDB-Tk output directory (gtdbtk_output/) for the genome of interest.
Software: GTDB-Tk (v2.3.0+), IQ-TREE2 (v2.2.0+), CheckM2, FastANI.
Databases: GTDB reference data (R214 or newer).

Methodology:

Extract Marker Alignment: From the GTDB-Tk output, locate the multiple sequence alignment (MSA) file for your genome (e.g., [bin_id].bac120.user_msa.fasta).
Build a Custom Tree: Isolate the MSA for your genome and its closest 50-100 reference genomes from the GTDB bac120.msa file. Use taxonkit to gather relevant GTDB taxa IDs.
Phylogenetic Inference: Run a maximum-likelihood tree with IQ-TREE2:

Congruence Test: Visually and quantitatively compare the topology and support values of this custom tree with the GTDB-Tk output tree. Use the Robinson-Foulds distance.
Complement with Genome Metrics: Calculate CheckM2 completeness/contamination and perform an ANI analysis (fastANI) against the genomes in the ambiguous clade to confirm or refute genus-level grouping (threshold ~95% ANI).

Protocol 2: Metabolic Profiling for Taxonomic Inference

Objective: Use functional signatures to support the placement of an unclassified Marinisomatota branch.

Materials:

Input Data: Annotated genome (produced via Prokka or DRAM).
Software: KofamScan, HMMER, custom metabolic pathway scripts.
Databases: KEGG, dbCAN, TIGRFAMs.

Methodology:

Profile Marker HMMs: Beyond the 120/122 markers, search for phylum or class-specific conserved protein families using TIGRFAMs and custom HMMs.
Signature Pathway Analysis: Annotate pathways relevant to Marinisomatota's marine niche (e.g., sulfated polysaccharide utilization, prokaryotic proteorhodopsin). Map the presence/absence pattern across the ambiguous clade and its reference relatives.
Create Functional Distance Matrix: Generate a Jaccard distance matrix based on the presence/absence of ~500 core KEGG Orthologs. Construct a neighbor-joining tree and compare its topology to the GTDB-Tk tree. Congruent clustering despite low sequence support strengthens a novel classification hypothesis.

Visualizations

GTDB-Tk Ambiguity Resolution Workflow

Example Ambiguous Branch with Low Support

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Resources for Resolving GTDB-Tk Ambiguities

Item	Function/Description	Source/Example
GTDB-Tk Reference Data (R214+)	Essential database containing alignments, trees, and taxonomy for classification. Always use the version matching your GTDB-Tk install.	GTDB Website
IQ-TREE2 Software	For robust, custom phylogenetic tree inference with modern support metrics (SH-aLRT, UFBoot).	http://www.iqtree.org/
CheckM2 / GTDB-Tk QC	Provides essential genome quality metrics (completeness, contamination). Poor quality can cause ambiguous placement.	CheckM2 GitHub
FastANI	Computes Average Nucleotide Identity for precise genus/species boundary assessment against reference genomes.	FastANI GitHub
KofamScan & KEGG Database	For functional profiling and identifying conserved metabolic signatures that support taxonomic grouping.	KofamScan GitHub
Custom HMM Library	A collection of Hidden Markov Models for protein families specific to Marinisomatota or related PVC superphylum.	Constructed via hmmbuild from curated alignments.
Taxonkit	A powerful CLI tool for parsing and filtering NCBI/GTDB-style taxonomy files efficiently.	Taxonkit GitHub

1. Introduction & Thesis Context Within the broader thesis investigating the evolutionary genomics and biotechnological potential of the Marinisomatota phylum (GTDB designation, formerly part of FCB group or Sphingobacteria), efficient computational resource management is paramount. Analysis of large-scale metagenomic and isolate datasets demands strategic optimization to enable high-fidelity taxonomic classification, pangenome construction, and functional profiling. These protocols are designed to maximize throughput and accuracy while minimizing computational cost and time.

2. Quantitative Resource Benchmarks for Common Tasks The following table summarizes resource requirements for key analytical steps, benchmarked on a representative dataset of 500 metagenome-assembled genomes (MAGs) binned as Marinisomatota.

Table 1: Computational Resource Benchmarks for Core Analysis Tasks

Analytical Task	Software (Example)	Typical Dataset Size	CPU Cores Recommended	RAM (GB)	Wall Time (HH:MM)	Storage I/O
Quality Control & Adapter Trimming	Fastp v0.23.4	1B PE reads (2x150bp)	16	32	02:30	High
Metagenome Assembly	MEGAHIT v1.2.9	1B PE reads	64	512	24:00+	Very High
Genome Binning	MetaBat2 v2.15	500 contigs (>2.5kbp)	8	64	04:00	Medium
GTDB-Tk Classification	GTDB-Tk v2.3.0	500 MAGs	16	128	06:00	Medium
Pangenome Analysis	Anvi'o v7.1	50 Marinisomatota genomes	32	256	12:00	High
Functional Annotation	Prokka v1.14.6	1 MAG (~5 Mbp)	4	16	01:00	Low

3. Detailed Experimental Protocols

Protocol 3.1: Optimized GTDB Taxonomic Classification Pipeline Objective: To classify putative Marinisomatota MAGs using the Genome Taxonomy Database Toolkit (GTDB-Tk) with resource-efficient prioritization.

Pre-classification Filtering: Filter MAGs using CheckM2 to select only those with >50% completeness and <10% contamination. This reduces downstream compute time on low-quality bins.
Batch Job Configuration: Package MAGs into batches of 50-100 genomes per SLURM/Job Scheduler array job.
GTDB-Tk Execution Command:




Post-processing: Concatenate batch results (bac120_summary.tsv, ar53_summary.tsv) and filter for classifications within the Marinisomatota phylum (e.g., p__Marinisomatota_A, p__Marinisomatota_B).

Protocol 3.2: Resource-Aware Comparative Genomics Workflow
Objective: To construct a pangenome from curated Marinisomatota genomes without exhausting memory.

Dereplication: Use dRep v3.4.0 to cluster genomes at 99% ANI to reduce redundancy.





Annotation with Prokka (Parallelized): Use GNU Parallel to annotate dereplicated genomes simultaneously across allocated nodes.



Pangenome Construction: Use Roary v3.13.0 with a strict MCL inflation parameter (1.5) for clearer core/accessory separation.




4. Mandatory Visualization





Diagram Title: Marinisomatota MAG Analysis and Classification Pipeline





Diagram Title: Computational Resource Decision Tree
5. The Scientist's Toolkit: Essential Research Reagents & Solutions
Table 2: Key Computational Tools & Data Resources for Marinisomatota Research



Item Name
Type
Primary Function in Analysis
Resource Optimization Tip




GTDB-Tk v2.3.0+
Software/Reference Data
Assigns robust taxonomy using GTDB reference tree. Critical for placing novel Marinisomatota.
Use --scratch_dir to point to fast local SSD for I/O-bound performance gain.


CheckM2
Software/Model
Rapid assessment of MAG quality (completeness/contamination).
Use the pre-trained model; runs significantly faster with lower memory than CheckM1.


dRep
Software
Dereplicates genome sets based on ANI. Reduces computational load for downstream steps.
Adjust -nc (coverage threshold) based on sequencing depth to retain relevant diversity.


Roary
Software
Rapid large-scale prokaryote pangenome analysis. Identifies core/accessory genes.
Use -i (MCL inflation) >1.2 for more conservative, less noisy clustering in diverse sets.


Prokka
Software
Rapid annotation of bacterial genomes. Provides standard GFF3 for downstream tools.
Use --metagenome flag and --mincontiglen to optimize for MAG annotation.


GTDB R214
Reference Database
Provides the standardized taxonomic framework and alignments for classification.
Download to a shared, high-performance filesystem to avoid redundant copies.


PFAM & TIGRFAM
HMM Database
For functional annotation of protein families within Marinisomatota genomes.
Combine with tools like anvi-run-hmms for efficient, parallelized annotation.


Slurm / SGE
Job Scheduler
Manages resource allocation on HPC clusters for parallelizable workflows.
Implement job arrays for classifying or annotating 100s of genomes efficiently.

Item Name	Type	Primary Function in Analysis	Resource Optimization Tip
GTDB-Tk v2.3.0+	Software/Reference Data	Assigns robust taxonomy using GTDB reference tree. Critical for placing novel Marinisomatota.	Use `--scratch_dir` to point to fast local SSD for I/O-bound performance gain.
CheckM2	Software/Model	Rapid assessment of MAG quality (completeness/contamination).	Use the pre-trained model; runs significantly faster with lower memory than CheckM1.
dRep	Software	Dereplicates genome sets based on ANI. Reduces computational load for downstream steps.	Adjust `-nc` (coverage threshold) based on sequencing depth to retain relevant diversity.
Roary	Software	Rapid large-scale prokaryote pangenome analysis. Identifies core/accessory genes.	Use `-i` (MCL inflation) >1.2 for more conservative, less noisy clustering in diverse sets.
Prokka	Software	Rapid annotation of bacterial genomes. Provides standard GFF3 for downstream tools.	Use `--metagenome` flag and `--mincontiglen` to optimize for MAG annotation.
GTDB R214	Reference Database	Provides the standardized taxonomic framework and alignments for classification.	Download to a shared, high-performance filesystem to avoid redundant copies.
PFAM & TIGRFAM	HMM Database	For functional annotation of protein families within Marinisomatota genomes.	Combine with tools like `anvi-run-hmms` for efficient, parallelized annotation.
Slurm / SGE	Job Scheduler	Manages resource allocation on HPC clusters for parallelizable workflows.	Implement job arrays for classifying or annotating 100s of genomes efficiently.

Application Notes & Protocols

Within the Genomic Taxonomy Database (GTDB) framework, the phylum Marinisomatota (formerly candidate phylum SAR406) presents unique challenges in taxonomic placement due to its deep evolutionary branching and frequent genomic bridging to related candidate phyla like Muirbacteria, Uhrbacteria, and Gribaldobacteria. Accurate classification is critical for interpreting its ecological role in marine systems and assessing its potential in bioprospecting for novel enzymes or bioactive compounds.

1. Quantitative Data Summary: Key Genomic & Phylogenetic Markers

Table 1: Core Genome & Phylogenetic Marker Comparison Across Bridging Phyla

Feature / Marker	*Marinisomatota* (GTDB r214)	*Muirbacteria* (GTDB)	*Uhrbacteria* (GTDB)	Bridging Genome Example (Bin.123)
Average Genome Size (Mbp)	1.8 - 2.3	1.5 - 1.9	1.6 - 2.1	2.05
Average GC Content (%)	44 - 48	50 - 54	38 - 42	46.2
tRNA Count (avg.)	33	35	32	34
*16S rRNA Identity to Marinisomatota* (%)**	100 (ref)	78.2 - 81.5	75.8 - 79.1	92.3
*Concatenated Marker (120) AAI to Marinisomatota* (%)**	100 (ref)	60.5 - 62.8	58.9 - 61.2	85.7
CheckM2 Completeness (%)	>95 (high-quality)	>90	>90	96.4
CheckM2 Contamination (%)	<5	<5	<5	1.2
Presence of Diagnostic Pathway	Yes (Partial TCA)	No	No	Yes

Table 2: Diagnostic Metabolic Pathway Gene Presence/Absence

Pathway Gene	*Marinisomatota* Consensus	*Bridging Genome* Annotation	Function & Taxonomic Relevance
Fumarate hydratase (class II) `[K01676]`	+	+	Key TCA cycle enzyme; conserved in Marinisomatota.
Rhodanese-domain protein `[K01011]`	+	+	Sulfur metabolism; a phylum-associated trait.
Group 3 [NiFe] hydrogenase `[K06281, K06282]`	+	+	Energy metabolism in anoxic environments.
Archaeal-like Rubisco (rbcL)	-	-	Distinguishes from photosynthetic relatives.

2. Experimental Protocols

Protocol 1: Integrated Phylogenomic Placement of Ambiguous Genomes Objective: To resolve classification of a genome bridging Marinisomatota and related phyla using GTDB toolkit and supplementary analysis. Materials: High-quality metagenome-assembled genome (MAG), GTDB-Tk v2.3.0, CheckM2, Python environment with SciKit-bio, FastTree, IQ-TREE2. Procedure:

Quality Assessment: Run checkm2 predict --input <mag.fasta> ... to assess completeness & contamination. Proceed only if completeness >90% & contamination <5%.
GTDB-Tk Default Classification: Execute gtdbtk classify_wf --genome_dir <dir> --out_dir <output> --cpus 8. Record the classification and posterior probability for all ranks.
Marker Extraction & Tree Building: If placement is ambiguous (e.g., low posterior probability), extract 120 bacterial marker genes using gtdbtk identify and align. Create a custom concatenated alignment.
Reference Tree Construction: Build a reference tree with IQ-TREE2 (iqtree2 -s concat.align -m MFP -bb 1000 -nt 8) using a curated set of reference genomes from Marinisomatota, Muirbacteria, Uhrbacteria, and an outgroup.
Placement of Query Genome: Place the bridging genome onto the reference tree using the --place function in GTDB-Tk or using EPA-ng in conjunction with the reference alignment.
Average Amino Acid Identity (AAI) Calculation: Calculate AAI between the query and all reference genomes using comparem aai_wf (https://github.com/dparks1134/CompareM). An AAI >80% suggests phylum-level affiliation; 60-80% indicates separate but related phyla.
Consensus Classification: Synthesize results from GTDB-Tk posterior probability, phylogenetic placement, and AAI. A genome is classified as Marinisomatota if it: a) clusters within the Marinisomatota monophyletic clade with >70% bootstrap support, b) shares AAI >80% with Marinisomatota references, and c) retains key metabolic markers (Table 2).

Protocol 2: Validation via Diagnostic Metabolic Profiling Objective: To validate phylogenetic placement by confirming the presence of Marinisomatota-diagnostic metabolic pathways. Materials: Annotated genome (e.g., using PROKKA or DRAM), KEGG database, HMMER suite, custom HMM profiles for diagnostic genes. Procedure:

Functional Annotation: Annotate the bridging genome using prokka --outdir <dir> --prefix <mag> <mag.fasta>.
Target Gene HMM Search: Using hmmsearch with an E-value cutoff of 1e-20, search the translated proteome against custom HMMs built for diagnostic genes (e.g., fumarate hydratase class II, rhodanese-domain protein).
Pathway Reconstruction: Use the annotated KO terms to map genes to KEGG modules (e.g., M00009, TCA cycle). Manual curation is required to confirm pathway completeness and identify phylum-specific variants.
Comparative Analysis: Compare the reconstructed pathways to the consensus profiles in Table 2. A bridging genome showing a Marinisomatota-like profile provides functional evidence supporting its classification.

3. Visualizations

Title: Workflow for Resolving Phylogenomic Classification

Title: Diagnostic Partial TCA Cycle in Marinisomatota

4. The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions & Materials

Item / Tool	Function in Analysis	Example / Note
GTDB-Tk (v2.3.0+)	Standardized taxonomic classification relative to GTDB phylogeny.	Uses ~120 bacterial marker genes & pplacer for placement.
CheckM2	Estimates genome completeness & contamination rapidly.	Superior for genomes from novel lineages vs. CheckM1.
CompareM	Calculates Average Amino Acid Identity (AAI) & ANI.	Critical for quantifying genomic relatedness between phyla.
IQ-TREE2	Phylogenetic inference with model testing & fast bootstrapping.	For building robust reference trees.
PROKKA / DRAM	Rapid genome annotation & metabolic profiling.	DRAM specializes in metabolic pathway distillation for microbes.
Custom HMM Profiles	Detects conserved, phylum-diagnostic protein families.	Build using `hmmbuild` from curated alignments of target genes.
KEGG MODULE Database	Reference for pathway completeness assessment.	Manual curation required due to pathway variability in DPANN/CPR.
PhyloPhlAn 3.0	Alternative for phylogeny using ~400 universal markers.	Useful as an orthogonal method to GTDB-Tk.

Best Practices for Curation and Submission of Novel Marinisomatota Genomes

Application Notes The accurate classification of novel genomes within the phylum Marinisomatota (formerly known as KS3-B09 or SAR406) is critical for advancing our understanding of their role in marine biogeochemical cycles and for exploring their biosynthetic potential. As per the Genome Taxonomy Database (GTDB) taxonomy (release 220), Marinisomatota is a distinct bacterial phylum primarily comprising uncultivated lineages from oceanic and deep-sea environments. Curation and submission of genomes from this group present unique challenges due to their frequent assembly from complex metagenomic datasets and their phylogenetic depth. Adherence to standardized practices ensures genomic data integrity, facilitates reproducible taxonomy, and enables downstream drug discovery pipelines to accurately target novel enzymatic pathways from these enigmatic organisms.

Protocols

1. Genome Assembly and Curation Protocol Objective: To reconstruct high-quality Marinisomatota genomes from metagenomic sequence data. Detailed Methodology: 1. Sequence Pre-processing: Use fastp (v0.23.4) with parameters --detect_adapter_for_pe --trim_poly_g --cut_front --cut_tail to remove adapters and low-quality bases. 2. Co-assembly: Perform de novo assembly on quality-filtered reads using MEGAHIT (v1.2.9) with meta-large presets or SPAdes (v3.15.5) in --meta mode for higher complexity samples. 3. Binning: Execute binning using multiple tools: MetaBAT2 (v2.15), MaxBin2 (v2.2.7), and CONCOCT (v1.1.0). Generate a consensus set of bins using DAS Tool (v1.1.6). 4. Taxonomic Assignment: Assign preliminary taxonomy to bins using GTDB-Tk (v2.3.2) with the classify_wf command and database release R220. 5. Genome Refinement: For bins classified as Marinisomatota, perform manual refinement in Anvi'o (v7.1). Map reads back to the bin, inspect coverage and tetranucleotide frequency outliers, and remove contaminating contigs. 6. Completeness/Contamination Assessment: Run CheckM2 (v1.0.1) to estimate genome completeness and contamination. Proceed only with medium-quality (MQG; ≥50% complete, <10% contaminated) or high-quality (HQG; ≥90% complete, <5% contaminated) genomes.

2. Phylogenomic Placement and Classification Protocol Objective: To determine the precise taxonomic position of a novel Marinisomatota genome within the GTDB framework. Detailed Methodology: 1. Protein Marker Extraction: Use GTDB-Tk's identify and align commands to extract and align 120 bacterial single-copy marker genes. 2. Reference Tree Placement: Generate a rooted phylogenetic tree with the infer command, which places the novel genome within the GTDB reference tree of type genomes. 3. Relative Evolutionary Divergence (RED) Calculation: The GTDB-Tk classify workflow automatically calculates the RED value, a quantitative measure of phylogenetic divergence. 4. Taxonomic Assignment: Assign taxonomy based on the genome's position relative to defined RED thresholds for each rank. Novelty is indicated by prefixes (e.g., "UBA..." for uncultivated bacterium).

3. Genome Submission and Annotation Protocol Objective: To submit curated genomes to public repositories with standardized annotations. Detailed Methodology: 1. Functional Annotation: Annotate the genome using PROKKA (v1.14.6) for rapid gene calling, or a comprehensive pipeline: DRAM (v1.4.4) for metabolism and KofamScan for KEGG orthologs. 2. Biosynthetic Gene Cluster (BGC) Identification: Run antiSMASH (v7.0) or DeepBGC to identify potential secondary metabolite BGCs, a key interest for drug development. 3. Metadata Collection: Compose minimal and contextual metadata as per the Genomic Standards Consortium (MIXS) checklist, emphasizing environmental parameters (depth, salinity, temperature). 4. Submission: Submit the genome assembly, annotated features, and raw reads to the International Nucleotide Sequence Database Collaboration (INSDC) via the NCBI, ENA, or DDBJ submission portals.

Data Presentation

Table 1: Genomic Quality Standards for Marinisomatota Submissions

Quality Tier	Completeness	Contamination	# of Contigs	N50 (kb)	GTDB Designation
High Quality	≥ 90%	< 5%	< 500	> 20	HQG
Medium Quality	≥ 50%	< 10%	< 1000	> 10	MQG
Low Quality	< 50%	≥ 10%	Not Applicable	Not Applicable	Exclude from taxonomy

Table 2: Key GTDB Metrics for Novel Marinisomatota Classification

Taxonomic Rank	Typical RED Threshold	Action for Novel Genome
Species Cluster	~0.06	Assign spXXXXXXX label if outside existing cluster.
Genus	~0.30	Prefix with 'UBA' or 'GCA' if RED > type genus threshold.
Family	~0.50	Prefix with 'UBA' if novel lineage at family level.

Mandatory Visualizations

Title: Genome Curation & Submission Workflow

Title: GTDB Phylogenomic Classification Protocol

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Tools and Resources for Marinisomatota Genome Research

Item / Tool	Function / Purpose	Source / Example
GTDB-Tk (v2.3.2+)	Standardized toolkit for phylogenomic classification using the GTDB database. Essential for taxonomy.	GitHub: ecogenomics/gtdbtk
CheckM2	Rapid and accurate estimation of genome completeness and contamination in bacterial genomes.	GitHub: chklovski/CheckM2
DRAM (Distilled & Refined Annotation of Metabolism)	Comprehensive functional annotation pipeline, highlighting metabolic potential and virulence.	GitHub: WrightonLabCSU/DRAM
antiSMASH	Identifies Biosynthetic Gene Clusters (BGCs) for secondary metabolites; crucial for drug discovery screens.	https://antismash.secondarymetabolites.org
Anvi'o	Interactive platform for microbial 'omics, essential for manual bin refinement and visualization.	http://merenlab.org/software/anvio/
MIXS Checklists	Standardized metadata reporting formats to ensure data reproducibility and integration.	Genomic Standards Consortium
NCBI Prokaryotic Genome Annotation Pipeline (PGAP)	Recommended for consistent structural and functional annotation prior to INSDC submission.	NCBI GitHub

GTDB vs. Legacy Systems: Validating Marinisomatota Taxonomy and Comparative Genomics Insights

Within the broader thesis on the genomic and metabolic diversity of the Marinisomatota phylum (formerly known as Marinimicrobia), accurate taxonomic classification is a foundational challenge. This phylum, prevalent in marine environments, exhibits significant metabolic versatility with implications for biogeochemical cycling and potential biotechnological applications. The recent adoption of the Genome Taxonomy Database (GTDB) taxonomy, based on conserved single-copy marker genes and relative evolutionary divergence, often conflicts with the established but sometimes phenotypically influenced NCBI taxonomy. This discrepancy is particularly pronounced for Marinisomatota, where numerous reclassifications and the delineation of new candidate phyla (e.g., Candidatus Uhrbacteria) have been proposed. This application note provides a protocol for benchmarking these two classification systems for Marinisomatota clades, enabling researchers to critically evaluate genomic data within a consistent framework for downstream ecological, evolutionary, and drug discovery research.

Core Quantitative Comparison

Table 1: High-Level Taxonomic Comparison for Marinisomatota

Taxonomic Rank	NCBI Taxonomy (as of latest update)	GTDB Release R214 (April 2023)	Notes/Implications
Phylum	Marinimicrobia (PRI)	P__Marinisomatota	GTDB uses the name Marinisomatota.
Class Level	Multiple candidate classes (e.g., SAR406 clade)	C__Marinisomatia (and others split into separate phyla)	GTDB splits the group into multiple phyla-level taxa.
Order Level	Not consistently defined	O_Marinisomatales (within P_Marinisomatota)	Clearer hierarchical structure in GTDB.
Representative Genus	Marinimicrobium, Candidatus Litoricolaceae	Multiple genera under Marinisomatales (e.g., UBA10353, MSA-10)	Genus-level assignments differ radically.
Number of Genome Assemblies	~500+ labeled under Marinimicrobia	~400+ classified under P__Marinisomatota and related new phyla.	Counts vary due to reclassification.

Table 2: Benchmarking Metrics for a Representative Clade (e.g., SAR406)

Metric	NCBI Taxonomy Classification Result	GTDB Taxonomy Classification Result	Benchmarking Advantage
Average Amino Acid Identity (AAI) within group	65.2% ± 5.1%	72.8% ± 3.5%	GTDB classification yields more genomically coherent groups.
Percentage of Conserved Single-Copy Marker Genes	89%	98%	GTDB groups maintain higher essential gene content.
Relative Evolutionary Divergence (RED) Score	Not applied	0.65 (clearly delineated from sister phyla)	Provides quantitative rank normalization.
Congruence with 16S rRNA Gene Tree	Moderate (long-branch attraction issues)	High for defined taxa (uses >120 proteins)	Improved phylogenetic resolution.

Experimental Protocols

Protocol 3.1: Genome Dataset Curation and Taxonomic Labeling

Objective: To assemble a balanced genome dataset with dual (NCBI & GTDB) labels for benchmarking.

Materials:

High-performance computing cluster or server.
ncbi-genome-download tool.
GTDB-Tk v2.3.0 software package & corresponding R214 data files.
Custom Python scripts for data parsing (available in thesis repository).

Procedure:

NCBI Genome Retrieval: Using ncbi-genome-download, download all bacterial genomes associated with the NCBI taxon ID for Marinimicrobia. Use the --assembly-level complete,chromosome,scaffold filter.
GTDB Classification: Run GTDB-Tk (classify_wf) on the downloaded genome assemblies. This will assign GTDB taxonomy based on the R214 reference tree.
Create Mapping Table: Parse the NCBI assembly reports and the GTDB-Tk output to generate a master table with columns: Assembly_Accession, NCBI_Phylum, NCBI_Class, GTDB_Phylum, GTDB_Class, GTDB_Red_Value.
Filter & Balance: Filter out low-quality genomes (<90% completeness, >5% contamination as assessed by CheckM2). Balance the dataset to include representative genomes from each major clade in both systems.

Protocol 3.2: Phylogenomic Tree Reconciliation Analysis

Objective: To visualize and quantify the discordance between classification systems.

Materials:

IQ-TREE 2.2.0 for maximum likelihood phylogeny.
bac120 marker gene set from GTDB or a custom set of 74 universal single-copy genes.
ETE3 Python toolkit for tree analysis and visualization.

Procedure:

Marker Gene Extraction & Alignment: Identify and concatenate the bac120 marker genes from each curated genome using GTDB-Tk or HMMER with custom profiles.
Phylogenomic Inference: Construct a maximum-likelihood tree using IQ-TREE with model LG+F+G and 1000 ultrafast bootstrap replicates.
Tree Annotation: Use ETE3 to map the NCBI and GTDB taxonomy labels onto the tree leaf nodes. Color-code branches based on phylum-level assignment from each system.
Discordance Metric Calculation: Calculate the Robinson-Foulds distance between the phylogenomic tree topology and the hierarchical "tree" implied by each taxonomy system. A lower distance indicates better congruence with the genomic data.

Protocol 4: The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions & Materials

Item	Function/Benefit in Benchmarking	Source/Example
GTDB-Tk Software	Standardized toolkit for assigning GTDB taxonomy to genomes; ensures reproducibility.	https://github.com/ecogenomics/gtdbtk
CheckM2	Rapid, accurate assessment of genome completeness and contamination; critical for quality filtering.	https://github.com/chklovski/CheckM2
bac120 / ar122 Marker Set	Curated set of 120 bacterial single-copy genes; provides standardized data for phylogenomics.	Included with GTDB-Tk.
IQ-TREE	Efficient software for maximum likelihood phylogenetic inference with model selection.	http://www.iqtree.org/
ETE3 Toolkit	Python environment for analyzing, manipulating, and visualizing trees and taxonomies.	http://etetoolkit.org/
NCBI Datasets CLI	Programmatic access to download NCBI genome assemblies and associated metadata.	https://www.ncbi.nlm.nih.gov/datasets/

Mandatory Visualizations

Workflow for Taxonomic Benchmarking

Taxonomic System Comparison Logic

Application Notes

Within the GTDB taxonomic framework, the phylum Marinisomatota (formerly candidate phylum SAR406) represents a deep-branching lineage distinct from its phenotypically and ecologically similar neighbor, Bacteroidota. This analysis highlights key genomic and metabolic features that delineate their evolutionary divergence, critical for interpreting ocean carbon cycling and guiding bioprospecting efforts.

Table 1: Core Genomic & Metabolic Divergence Metrics

Feature	Marinisomatota (Avg.)	Bacteroidota (Avg.)	Implication for Divergence
Genome Size (Mbp)	2.1 - 2.8	4.2 - 6.5	Streamlined, oligotrophic adaptation in Marinisomatota
GC Content (%)	34 - 38	39 - 48	Distinct nucleotide composition & codon bias
16S rRNA Identity (%)	< 75%	Reference	Phylum-level taxonomic separation (GTDB)
Glycoside Hydrolases (GHs)	Low count, specific types	High count, diverse (e.g., GH13, GH16)	Limited polysaccharide diversity in Marinisomatota
Respiratory Chain	Predicted HiPIP → bc1 complex	Diverse (e.g., fumarate reduct., flavin-based)	Unique electron transport via high-potential iron-sulfur protein
Carbon Fixation	RuBisCO-like protein (RLP)	Absent in most	Potential for CO2 metabolism in dark ocean
Nitrogen Metabolism	Nitrate/nitrite transporters	Urease, peptidases	N-source specialization; Marinisomatota targets inorganic N

Table 2: Diagnostic Marker Genes for Phylogenetic Delineation

Gene/Protein Family	Presence in Marinisomatota	Presence in Bacteroidota	Use as Phylogenetic Marker
RNA Polymerase Beta Subunit (rpoB)	Unique conserved inserts	Canonical sequence	GTDB backbone tree placement
Conserved Signature Proteins (CSPs)	21 unique CSPs identified	45 unique CSPs identified	Phylum-specific molecular synapomorphies
HiPIP (High-potential iron-sulfur)	Widespread, conserved	Rare, not conserved	Functional marker for electron transport
Porfirinogen deaminase (HemC)	Specific variant (MVG)	Specific variant (LAG)	Amino acid motif diagnostic

Protocols

Protocol 1: In Silico Phylogenomic Delineation Using GTDB-Tk

Objective: To reconstruct the phylogenetic position of Marinisomatota genomes relative to Bacteroidota and other adjacent phyla.

Materials:

High-quality metagenome-assembled genomes (MAGs).
Computational cluster with ≥ 32 GB RAM.
GTDB-Tk v2.3.0 software (https://github.com/ecogenomics/gtdbtk).
Reference data pack (release 220).

Procedure:

Genome Preparation: Place bacterial genome files (.faa for proteins, .fna for nucleotides) in a single directory. Ensure MAG quality (completion > 50%, contamination < 10%).
Run GTDB-Tk Classify:

Tree Inference: Use the infer workflow on the classified markers to generate a rooted tree:
Analysis: Visualize the tree (e.g., in iTOL). Note the monophyletic clustering of Marinisomatota separate from the Bacteroidota clade, supported by bootstrap values.

Protocol 2: Metabolic Pathway Discrepancy Analysis via KEGG Decoder

Objective: To compare and visualize the completeness of core metabolic pathways between the phyla.

Materials:

Annotated genomes (e.g., using PROKKA or DRAM).
KEGG Decoder script (https://github.com/bjtully/BioData/tree/master/KEGGDecoder).
Python3 with matplotlib and seaborn.

Procedure:

Annotation: Annotate all genomes uniformly. Generate KEGG Orthology (KO) assignments using kofamscan.
Generate Input: Create a binary matrix of KOs per genome using custom scripts.
Run KEGG Decoder:

Visualize: The script generates heatmaps. Key divergent pathways to highlight: Oxidative Phosphorylation (presence of petABC for bc1 complex in Marinisomatota), Glycan Metabolism (depleted in Marinisomatota), and Carbon Fixation (presence of RLP genes).

Diagrams

Title: Phylogenomic & Metabolic Analysis Workflow

Title: Key Divergence Traits of Marinisomatota

The Scientist's Toolkit

Table 3: Essential Research Reagents & Solutions

Item	Function in Comparative Genomics	Example Product/Reference
GTDB-Tk Reference Data	Provides standardized bacterial/archaeal marker set & taxonomy for consistent phylogenomic placement.	GTDB Release 220 (R220)
KEGG KofamScan Database	Profile HMM database for accurate KEGG Orthology (KO) assignment from protein sequences.	KEGG Release (e.g., 2024-01)
CheckM2 / BUSCO	Assess genome completeness & contamination of MAGs prior to comparative analysis.	CheckM2 (v1.0.2)
FastTree / IQ-TREE2	Software for rapid & accurate maximum-likelihood phylogenetic inference on marker alignments.	IQ-TREE2 (v2.2.6)
DRAM (Distilled & Refined Annotations of Metabolism)	Tool to annotate MAGs & distill metabolic profiles, highlighting pathways like vitamin synthesis & carbon utilization.	DRAM (v1.5)
Anti-HiPIP Antibodies	For experimental validation of the predicted unique electron transport chain component via western blot.	Custom polyclonal (e.g., GenScript)
Defined Oligotrophic Media	For cultivation attempts, mimicking deep-sea conditions (low organic C, high pressure, NO3- as N source).	AMS1 media recipe modifications

Introduction & Thesis Context Within the broader thesis research on the phylum Marinisomatota (GTDB nomenclature; synonymous with Bacteroidota in some NCBI taxonomies), a critical challenge is translating standardized genomic taxonomy into ecological understanding. The Genome Taxonomy Database (GTDB) provides a phylogenetically consistent framework, but ecological inferences drawn from its classifications require validation through independent metagenomic surveys. This protocol outlines a method to cross-reference GTDB-derived lineages against environmental metagenomic datasets to confirm habitat associations, co-occurrence patterns, and putative metabolic roles, thereby grounding taxonomic revisions in ecological reality.

Application Notes & Protocols

Protocol 1: Creating a Curated GTDB Reference Package for Marinisomatota

Data Retrieval: Access the latest GTDB release (e.g., R220) via the gtdb-tk software package or the GTDB website. Extract all genomes classified within the phylum Marinisomatota.
Quality Filtering: Filter genomes based on GTDB quality criteria (CheckM completeness >50%, contamination <10%). Retain only representative or "dereplicated" genomes as defined by GTDB to reduce redundancy.
Metadata Compilation: For each retained genome, compile associated metadata: GTDB taxonomy (e.g., pMarinisomatota, cMarinisomatia, o__UBA10353), NCBI biome and feature annotations, and genomic characteristics (genome size, GC content).
Format for Downstream Analysis: Create a BLAST or Bowtie2 database of the filtered genome sequences. Structure the associated metadata into a tab-delimited table.

Table 1: Example Curated GTDB Marinisomatota Reference Set (Hypothetical Data)

GTDB Genome ID	GTDB Taxonomy (Phylum to Genus)	CheckM Completeness (%)	CheckM Contamination (%)	NCBI Isolation Source	Genome Size (Mbp)
GBGCA123456	pMarinisomatota; cMarinisomatia; oUBA10353; fUBA10353; g__UBA10353	92.5	1.2	Marine sediment	4.8
GBGCA789012	pMarinisomatota; cP2B42; oUBA10234; fUBA10234; g_JAAOCX01	78.9	5.5	Activated sludge	6.1
RSGCF345678	pMarinisomatota; cP2B42; oP2B42; fP2B42; gP2B42	86.7	2.8	Human gut	3.9

Protocol 2: Metagenomic Read Recruitment & Taxonomic Binning

Metagenome Selection: Select public or in-house metagenomic studies from target environments (e.g., marine, freshwater, human gut, bioreactors) from repositories like the SRA or MG-RAST.
Read Mapping: Use bowtie2 or BWA to map quality-filtered metagenomic reads against the curated Marinisomatota reference database. Use sensitive parameters (--very-sensitive for bowtie2).

Abundance Estimation: Use samtools and custom scripts to calculate depth of coverage and breadth of coverage for each reference genome. Normalize by genome length and total metagenome reads to estimate relative abundance (RPKM or TPM).
Taxonomic Profiling: Perform independent taxonomic profiling of the same metagenomes using a GTDB-based tool like MetaPhlAn 3 or mOTUs to obtain a community profile. Compare the presence/absence of Marinisomatota clades with the recruitment results.

Table 2: Cross-Referencing Results from a Hypothetical Marine Metagenome

Detected Taxon (GTDB)	Read Recruitment Abundance (RPKM)	MetaPhlAn3 Relative Abundance (%)	Concordance (Y/N/Partial)	Inferred Primary Habitat from Cross-Reference
g__UBA10353 (Marinisomatia)	45.2	0.05	Y	Marine sediment
g_JAAOCX01 (P2B42)	0.8	<0.001	Partial (Low detection)	Possibly transient / not native
g_P2B42 (P2B_42)	0.1	0.0	N	Non-marine; likely contamination

Protocol 3: Phylogenomic Validation of Ecological Clustering

Phylogenetic Tree Construction: Build a reference tree from the curated Marinisomatota genomes using a set of >100 conserved single-copy marker genes (via GTDB-Tk de_novo_wf).
Environmental Metadata Mapping: Map the habitat source (from metagenomic recruitment) onto the tree leaves as a discrete trait.
Analysis: Visually and statistically (e.g., using CAST or consenTRAIT) assess if specific phylogenetic clades are significantly associated with specific environments (e.g., marine vs. terrestrial).

The Scientist's Toolkit: Key Research Reagent Solutions

Table 3: Essential Materials and Tools for Validation Workflow

Item	Function & Explanation
GTDB-Tk (v2.3.0+)	Software toolkit to classify genomes into the GTDB taxonomy and generate phylogenomic trees. Essential for standardizing input genomes.
CheckM2	Assesses genome quality (completeness, contamination) for filtering the reference database. More accurate than CheckM1 for diverse bacteria.
Bowtie2 / BWA	Read mapping tools for recruiting metagenomic reads to the reference genome database. Critical for quantifying environmental presence.
MetaPhlAn 3	Profiler for metagenomic taxonomic composition using GTDB-derived marker genes. Provides independent community profile for cross-validation.
Non-Redundant GTDB Reference Database (RS & RG)	Provides the standardized, de-replicated genome set. The foundation for creating a phylum-specific reference package.
SRA Toolkit	Downloads raw metagenomic sequencing data from the NCBI Sequence Read Archive for analysis.
ITOL / GGTREE	Interactive Tree of Life or R package for visualizing phylogenetic trees with annotated metadata (e.g., habitat).

Diagrams

Cross-Referencing Validation Workflow

Data Integration for Ecological Inference

The Genome Taxonomy Database (GTDB) provides a standardized, genome-based taxonomy that frequently reclassifies microbial lineages, including the phylum Marinisomatota (previously known as Marinisomatota in some NCBI lineages, often synonymous with the candidate phylum SAR406 or Marinimicrobia in historical literature). This reclassification presents both a challenge and an opportunity for researchers. Legacy data, published literature, and associated metabolic models or drug target identifications become semantically disconnected from current genomic understanding. These Application Notes provide a framework for reconciling historical data with the GTDB taxonomy to ensure robust, reproducible science in marine microbiology and marine natural product discovery.

Quantitative Impact Analysis of Reclassification

Table 1: Comparative Taxonomy of Key Marinisomatota Lineages: GTDB r220 vs. NCBI/SILVA Legacy Systems

GTDB r220 Taxonomy (Phylum/Class/Order)	Approximate Legacy NCBI/SILVA Equivalent	Notable Phenotypic/Metabolic Traits (from Literature)	Key Publications Affected (Example Count)
P_Marinisomatota (full phylum)	Candidate phylum SAR406, Marinimicrobia	Oligotrophic, deep-sea adapted, putative role in sulfur & carbon cycling.	>500 (metagenomic surveys, oceanography)
C_Marinisomatia	Marine Group A, SAR406 clade	Abundant in oxygen minimum zones, genome indicates auotrophic potential.	~300 (biogeochemical studies)
C_Aureabacteria	Uncultivated descendant of SAR406	Found in saline lakes, distinct genomic repertoire.	~50 (extreme environment studies)
O_UBA1416	Sub-clade within SAR406	Associated with particulate organic matter.	~75 (carbon flux research)

Table 2: Protocol for Taxonomic Reconciliation of Existing Data and Models

Step	Protocol Description	Tools/Resources	Expected Output
1. Identifier Mapping	Cross-reference legacy genome/OTU IDs (e.g., from NCBI) with GTDB using canonical correspondence files.	GTDB-Tk, `gtdb_to_taxdump.tsv` file from GTDB, EBI Metagenomics.	Table linking NCBI accession to GTDB accession & taxonomy.
2. Literature Re-annotation	Systematically search and tag existing literature with updated GTDB nomenclature using text-mining.	Custom Python scripts with BioPython & PubMed API, Zotero/Mendeley.	Annotated reference library with dual nomenclature.
3. Metabolic Model Validation	Remap reaction annotations (KEGG, MetaCyc) in legacy metabolic models to genomes in GTDB reference tree.	ModelSEED, KBase, PATRIC, RAST toolkit.	Updated genome-scale metabolic models (GEMs) under GTDB taxonomy.
4. Phylogenetic Contextualization	Place legacy sequence data within the GTDB reference tree via phylogenetic placement.	GTDB-Tk `classify_wf`, EPA-ng, pplacer.	Newick tree with query sequences placed within GTDB framework.

Detailed Experimental Protocols

Protocol 1: Reclassifying Amplicon Sequence Variant (ASV) Data Using GTDB Objective: To re-annotate existing 16S rRNA gene amplicon datasets (often classified against SILVA) with GTDB taxonomy. Materials: ASV table (BIOM or CSV format), representative ASV sequences (FASTA), QIIME2 (2024.2+), GTDB reference package (r220). Procedure:

Download Reference Data: Obtain the GTDB bacterial reference sequences and taxonomy file for release r220.
Train Classifier: Use q2-feature-classifier to fit a naive Bayes classifier on the GTDB reference sequences. Command: qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads gtdb_seqs.qza --i-reference-taxonomy gtdb_tax.qza --o-classifier gtdb_classifier.qza.
Classify ASVs: Run taxonomy classification on your ASV sequences. Command: qiime feature-classifier classify-sklearn --i-classifier gtdb_classifier.qza --i-reads rep_seqs.qza --o-classification taxonomy.qza.
Collate Data: Merge new taxonomy with the ASV table and analyze.

Protocol 2: Validating a Putative Drug Target Gene in Reclassified Genomes Objective: To assess the conservation and phylogenetic distribution of a previously identified essential gene (e.g., dnaN) across reclassified Marinisomatota genomes. Materials: List of GTDB genome access IDs for Marinisomatota, target gene protein sequence, Anvio (v7.1), HMMER suite. Procedure:

Genome Retrieval: Use gtdb-tk to generate a genome set or download from GTDB ftp.
Gene Calling & Functional Annotation: Process all genomes through a consistent pipeline (e.g., Prokka, Bakta) to generate standardized GFF3 and amino acid FASTA files.
Build Target HMM: Create a Hidden Markov Model (HMM) profile for the target gene using reference sequences from trusted databases. Command: hmmbuild target_gene.hmm alignment.fasta.
Search & Extract: Use hmmsearch against the concatenated protein database of all Marinisomatota genomes. Parse results to extract hits above a strict e-value threshold (e.g., 1e-30).
Phylogenetic Profiling: Map the presence/absence and sequence variants of the target gene onto the GTDB phylogeny using Anvio's pangenomics workflow to visualize conservation.

Mandatory Visualizations

Title: Data Reconciliation Workflow for GTDB Reclassification

Title: Reclassification Impacts and Required Actions

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Resources for Marinisomatota Research Post-GTDB Reclassification

Item Name	Supplier/Resource	Function & Application Notes
GTDB-Tk v2.3.0+	(https://github.com/ecogenomics/gtdbtk)	Core software toolkit for assigning GTDB taxonomy to genome bins and placing them in the reference tree. Critical for all reclassification work.
GTDB r220 Reference Data	GTDB FTP Site	Genome sequence and taxonomy files. Required for any classification or phylogenetic analysis aligned with GTDB.
CheckM2	(https://github.com/chklovski/CheckM2)	Rapid, accurate assessment of genome completeness and contamination. Essential for quality control before taxonomic classification.
anvi'o v7.1+	(http://anvio.org)	Integrated platform for pangenomics, phylogenomics, and metabolic modeling. Ideal for comparing reclassified genomes.
KBase (Microbiome Modeling)	(https://www.kbase.us)	Cloud platform for constructing and analyzing metabolic models from genomes, facilitating functional re-annotation post-reclassification.
MEMOTE Suite	(https://memote.io)	For testing and reporting standard compliance of genome-scale metabolic models, ensuring updated models are robust.
Custom HMM Profiles	(e.g., TIGRFAM, PFAM)	Curated protein family profiles for targeting specific metabolic pathways (e.g., sulfur oxidation) in functional screens of reclassified genomes.

Application Notes and Protocols

Thesis Context: Within a broader thesis investigating the phylogenetic novelty and metabolic potential of candidate phyla like Marinisomatota (formerly candidate phylum Marinisomatota) for the GTDB (Genome Taxonomy Database) classification framework, accurate phylogenetic placement is paramount. Inferring the evolutionary relationships of these often-fragmentary, metagenome-assembled genomes (MAGs) requires robust assessment of genome quality to prevent erroneous taxonomic assignment.

1. Core Quality Metrics for Phylogenetic Trustworthiness

The integrity of a phylogenetic inference is directly contingent on the quality of the input genomes. The following metrics, popularized by tools like CheckM and CheckM2, are non-negotiable for pre-placement screening.

Table 1: Core Metrics for Genome Quality Assessment

Metric	Definition	Ideal Range for Trustworthy Placement	Interpretation in Marinisomatota Context
Completeness	Percentage of conserved, single-copy marker genes (SCMGs) found in the genome.	>90% (High Quality) >50% (Draft)	High completeness ensures adequate phylogenetic signal. Low completeness in Marinisomatota MAGs may indicate novel lineages with divergent markers.
Contamination	Estimated percentage of SCMGs present in multiple copies, suggesting multiple strains/species.	<5% (High Quality) <10% (Acceptable)	High contamination leads to chimeric phylogenetic signals, misplacing the genome. Critical for novel phylum assignment in GTDB.
Strain Heterogeneity	Evidence of multiple sequence variants among SCMGs, indicating unresolved strains.	Low (Close to 0%)	High heterogeneity complicates assembly and placement, may require bin refinement or indicate a population.
Genome Size & N50	Total assembly length and contig length at which 50% of the genome is assembled.	Context-dependent	Significantly deviant sizes may flag contamination or incompleteness. Useful for comparing against known relatives.

Protocol 1.1: Standardized Quality Assessment with CheckM2 Objective: To calculate completeness, contamination, and strain heterogeneity for a set of Marinisomatota MAGs prior to phylogenetic analysis.

Input Preparation: Collect all MAGs in FASTA format (e.g., *.fna files) in a single directory.
Database Setup: Ensure CheckM2 is installed via pip install checkm2. The program uses a pre-trained model and does not require a manual database download.
Run Quality Assessment: Execute the command:
Output Interpretation: The primary results are in quality_report.tsv. Filter MAGs based on Table 1 thresholds (e.g., Completeness >70%, Contamination <5%) for downstream phylogenetic placement.

2. Phylogenetic Placement-Specific Assessments

Beyond general metrics, specific checks are needed to ensure the phylogenetic signal is reliable.

Table 2: Placement-Specific Diagnostic Metrics

Metric	Protocol/Method	Purpose & Relevance
Marker Gene Concordance	Phylogeny of individual SCMGs (e.g., via PhyloPhlAn) vs. concatenated tree.	Detects hidden contamination or horizontal gene transfer that concatenated trees may obscure. Incongruent gene trees can invalidate placement.
Coverage Uniformity	Analysis of read mapping depth across contigs (e.g., using `bowtie2` and `samtools`).	Large coverage drops may indicate mis-binned contigs (contamination). Uniform coverage supports a coherent genome.
Taxonomic Consistency	Compare taxonomic assignments of all predicted genes (e.g., via `CAT` or `GTDB-Tk` classify).	A high percentage of genes agreeing with the dominant lineage boosts confidence. Many genes from divergent phyla signal contamination.
Reference Tree Robustness	Placement on a stable, well-curated reference tree (e.g., GTDB backbone tree).	Ensures placement is not an artifact of a poor or biased reference dataset.

Protocol 2.1: Assessing Taxonomic Consistency with CAT/BAT Objective: To evaluate gene-level taxonomic agreement within a Marinisomatota MAG.

Gene Prediction: Predict protein sequences from the MAG using prodigal:
Run CAT/BAT: Perform taxonomic classification of the proteins:
Analyze Output: Examine the mag.lineage file. A trustworthy MAG for placement will show a high proportion of proteins classified to a coherent lineage (e.g., candidate phylum Marinisomatota), with limited classification to unrelated phyla.

3. Visualization of the Assessment Workflow

A standardized workflow integrates these metrics to gatekeep genomes for trustworthy phylogenetic placement.

Title: Workflow for Trustworthy Phylogenetic Placement

4. The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Tools and Databases for Quality Assessment

Item	Function & Relevance	Typical Source/Implementation
CheckM2	Rapid, tool for estimating completeness and contamination using machine learning. Essential first-pass filter.	https://github.com/chklovski/CheckM2
GTDB-Tk	Toolkit for assigning GTDB taxonomy, includes `classify_wf` which performs internal quality checks and reference tree placement.	https://github.com/ecogenomics/gtdbtk
PhyloPhlAn	For constructing highly accurate phylogenies with SCMGs and assessing marker gene concordance.	https://github.com/biobakery/phylophlan
BUSCO	Alternative to CheckM using universal orthologous benchmarks. Useful for eukaryotes and specific lineages.	https://busco.ezlab.org/
CAT/BAT	Protein-based taxonomic classifier. Critical for evaluating gene-level consistency within a MAG.	https://github.com/dutilh/CAT
Bowtie2 & SAMtools	For mapping reads back to assemblies to compute coverage uniformity and validate binning.	http://bowtie-bio.sourceforge.net/bowtie2, http://www.htslib.org/
GTDB Reference Data (r214+)	Curated genome database and trees. The gold-standard reference for bacterial and archaeal phylogenetic placement.	https://data.gtdb.ecogenomic.org/
CIAlign	Tool to clean and interpret multiple sequence alignments, removing noisy regions that can distort phylogeny.	https://github.com/KatyBrown/CIAlign/

Conclusion

The GTDB framework provides a robust, genome-based taxonomy that has significantly refined our understanding of the Marinisomatota phylum, clarifying its evolutionary boundaries and internal diversity. Mastery of the associated tools and an awareness of classification nuances are essential for accurately placing new genomes and interpreting their biological significance. The validated genomic distinctiveness of Marinisomatota, particularly its prevalence in marine systems, underscores its potential as a reservoir for novel natural products and enzymes. Future directions should focus on isolating representative strains, functionally characterizing predicted biosynthetic pathways, and exploring the phylum's role in marine biogeochemical cycles and host-microbe interactions. For biomedical research, integrating GTDB classification with metabolomic and phenotypic data will be crucial for translating genomic novelty into therapeutic leads.

GTDB Taxonomic Classification of Marinisomatota: Genomic Insights, Methods, and Biomedical Applications for Researchers

GTDB Taxonomic Classification of Marinisomatota: Genomic Insights, Methods, and Biomedical Applications for Researchers

Abstract

Marinisomatota Unveiled: Genomic Foundations and Ecological Significance in the GTDB Era

Key Experimental Protocols

Protocol 2.1: Genome-Resolved Metagenomics for Marinisomatota MAG Retrieval

Protocol 2.2: Phylogenomic Validation Using the GTDB-Tk Workflow

Visualization of Workflows and Relationships

The Scientist's Toolkit: Research Reagent Solutions

Application Notes for GTDB Taxonomic Classification ofMarinisomatota

Protocols

Protocol 2.1: Phylogenomic Placement Using GTDB-Tk

Protocol 2.2: Identification of Diagnostic Metabolic Pathways via KofamScan

Visualization

The Scientist's Toolkit

Application Notes: Niche Prevalence and Genomic Adaptations inMarinisomatota

Experimental Protocols

Protocol 1: Targeted Detection and Quantification ofMarinisomatotain Metagenomes

Protocol 2: Functional Screening for Bioactive Compound Production

Mandatory Visualization

The Scientist's Toolkit: Research Reagent Solutions

Key Genera and Species within GTDB's Marinisomatota Taxonomy

Application Notes & Protocols

Protocol: Enrichment and Cultivation of Marinisomatota from Marine Samples

Protocol: Metagenome-Assembled Genome (MAG) Binning and Taxonomic Classification

Protocol: Screening for Biosynthetic Gene Clusters (BGCs) in Marinisomatota Genomes

The Scientist's Toolkit: Key Research Reagents & Materials

Application Notes: Comparative Analysis of Classification Eras

Experimental Protocols

Protocol 3.1: GTDB-Tk Workflow for Genome Classification (Current Best Practice)

Protocol 3.2: 16S rRNA Gene Extraction and Sanger Sequencing (Historical Context)

Visualizations

The Scientist's Toolkit: Research Reagent Solutions

From Raw Reads to Taxonomy: Best Practices for Classifying and Analyzing Marinisomatota Genomes

Application Notes

Experimental Protocols

Protocol 1: Optimized Co-Assembly and Binning for Marine Samples

Protocol 2: MAG Refinement and GTDB-Tk Classification for Taxonomic Assignment

Visualization

Diagram 1: MAG Generation & Curation Workflow

Diagram 2: GTDB-Tk Decision Pathway for Novel Taxa

The Scientist's Toolkit

Experimental Protocols

Protocol 3.1: Phylogenomic Reconstruction of Marinisomatota

Protocol 3.2: BGC Prediction, Dereplication, and Classification

Protocol 3.3: Phylogeny-BGC Diversity Mapping & Correlation

Key Visualizations

Research Reagent Solutions

Integrating Classification with Functional Annotation Pipelines (e.g., Prokka, antiSMASH)

Detailed Integrated Protocol

Protocol 1: Genome Quality Control and Taxonomic Classification

Protocol 2: Integrated Functional Annotation Workflow

Visualization of Workflows

The Scientist's Toolkit

Resolving Classification Challenges: Troubleshooting GTDB Analysis for Marinisomatota

Application Notes: A GTDB-Centric Framework for Marinisomatota

Protocols for Mitigation and Validation

Protocol 1: Pre-GTDB Classification MAG Refinement Workflow

Protocol 2: Resolving Incomplete Taxonomy via Phylogenomic Reconciliation

GTDB vs. Legacy Systems: Validating Marinisomatota Taxonomy and Comparative Genomics Insights

Core Quantitative Comparison

Experimental Protocols

Protocol 3.1: Genome Dataset Curation and Taxonomic Labeling

Protocol 3.2: Phylogenomic Tree Reconciliation Analysis

Protocol 4: The Scientist's Toolkit

Mandatory Visualizations

Application Notes

Table 1: Core Genomic & Metabolic Divergence Metrics

Table 2: Diagnostic Marker Genes for Phylogenetic Delineation

Protocols

Protocol 1: In Silico Phylogenomic Delineation Using GTDB-Tk

Protocol 2: Metabolic Pathway Discrepancy Analysis via KEGG Decoder

Diagrams

The Scientist's Toolkit

Table 3: Essential Research Reagents & Solutions

Quantitative Impact Analysis of Reclassification

Detailed Experimental Protocols

Mandatory Visualizations

The Scientist's Toolkit: Research Reagent Solutions

Conclusion