Tutorial: Annotating a Phage Genome from Scratch
Time to complete: ~20 minutes | Skill level: Beginner
What you will learn: How to run a full annotation, understand confidence scores, read the HTML report, and identify genes worth following up on.
𧬠What we're doing β and why it matters
We'll annotate the Lambda phage genome (NC_001416), one of the most-studied viruses in biology. It's 48.5 kb with 92 genes β small enough to finish quickly, complex enough to be interesting.
Lambda phage is a great teaching example because:
- Some genes have very well-known functions (e.g., the repressor
cI) - Some have weak or ambiguous annotations
- Some are completely unknown β "Dark Matter" genes
By the end of this tutorial you'll see exactly how GIAE handles all three cases and how to decide what to do next.
Don't have GIAE installed yet?
Run pip install giae and come back. See the Quickstart for help.
Step 1 β Download the genome
Open your terminal and run:
curl -o lambda_phage.gb "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_001416&rettype=gb&retmode=text"
This downloads the Lambda phage GenBank file from NCBI. You should see a file called lambda_phage.gb appear in your current folder.
Quick check β make sure the file is there:
Expected output:
If the file exists and has a size around 185 KB, you're good.
Step 2 β Run GIAE
Now run the full annotation with the HTML report (this is the richest output format):
You'll see a progress bar in your terminal:
π¬ GIAE β Genome Interpretation and Annotation Engine
Genome: Lambda phage (NC_001416) | 48,502 bp | 92 genes
Analysing...
[β β β β β β β β β β β β β β β β β β β β ] 78/92 (85%)
Fetching homology data from UniProt...
Running motif scan (PROSITE)...
Scoring hypotheses...
β
Complete β 92 genes interpreted in 41s
π Report: lambda_report.html
π’ High confidence: 51 genes (55%)
π‘ Medium confidence: 22 genes (24%)
π΄ Low confidence: 11 genes (12%)
π Novel / Dark Matter: 8 genes (9%)
This takes 30β60 seconds on a normal laptop
Most of the time is spent fetching UniProt data. Use --no-uniprot --no-interpro
for a faster offline run (fewer evidence sources, but instant results).
Step 3 β Open the HTML report
Open lambda_report.html in your browser (double-click it, or drag it into Chrome/Firefox).
You'll see a page like this:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β GIAE β Lambda phage (NC_001416) β
β 92 genes | 41s | Generated 2025-03-21 β
β β
β βββββββββββββββββββββββββββββββββββ β
β 55% HIGH 24% MEDIUM 12% LOW 9% NOVEL β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β [π’ cI] Repressor protein CI 87% β
β [π’ O] Replication protein O 91% β
β [π’ P] Replication protein P 83% β
β [π‘ ren] Putative regulatory protein 61% β
β [π B] Unknown function NOVEL β
β ... β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Each gene is colour-coded: - π’ Green = HIGH confidence β you can trust this annotation - π‘ Yellow = MEDIUM confidence β probably right, but double-check - π΄ Red = LOW confidence β treat as a hypothesis, not a fact - π Black/Dark = NOVEL β no known function found
Click any gene to expand its full evidence panel.
Step 4 β Understanding a HIGH confidence gene
Click on cI (the first gene). You'll see:
ββββββββββββββββββββββββββββββββββββββββββββββ
cI | Lambda repressor protein CI | 87% β
ββββββββββββββββββββββββββββββββββββββββββββββ
Evidence collected:
β UniProt P03034 β 96% identity (Lambda repressor CI)
β Motif: HTH_3 (Helix-Turn-Helix DNA binding domain)
β Domain: cl21500 (Lambda repressor N-terminal domain)
Reasoning:
"Three independent evidence sources (sequence homology,
structural motif, and domain classification) all agree.
This gene is the lysogeny repressor. HIGH confidence."
Alternative hypotheses considered:
β Cro-like repressor (23%) β same HTH motif, lower UniProt match
What to look for here:
-
Multiple evidence sources agree β this is the gold standard. When UniProt homology, a motif hit, AND a domain classification all point to the same function, you can be very confident.
-
The reasoning is transparent β GIAE tells you why it made this call, not just what it decided.
-
Alternative hypotheses are shown β it considered Cro-like repressor but rejected it. This means GIAE actively ruled out other possibilities.
β No action needed for this gene. You can record it as confirmed.
Step 5 β Understanding a MEDIUM confidence gene
Now click on ren (rendered as yellow). You'll see something like:
ββββββββββββββββββββββββββββββββββββββββββββββ
ren | Putative regulatory protein | 61% π‘
ββββββββββββββββββββββββββββββββββββββββββββββ
Evidence collected:
β UniProt P03040 β 71% identity (Phage ren protein β "putative")
β No motif hits found
β No conserved domain
Reasoning:
"Sequence similarity to a protein of unknown function in UniProt.
The match is significant but the reference protein is itself
not well characterised. Function is inferred, not confirmed."
Alternative hypotheses considered:
β Transcriptional activator (38%) β weak, rejected
What to look for here:
-
Only one evidence source β when GIAE only has UniProt homology with no motif or domain support, it can't be as confident.
-
The source itself is "putative" β the UniProt entry it matched uses the word putative, which means even the reference database isn't sure. GIAE flags this.
-
No motif or domain hit β these would normally add more confidence.
What to do with this gene:
- If you're writing it up, say "putative regulatory protein, 61% confidence" β not "confirmed"
- Consider running HMMER locally (giae db download pfam) for extra domain evidence
- If you have lab access, this is a candidate for functional assays
Step 6 β Understanding Dark Matter genes
Click on gene B (shown as π NOVEL). This is where it gets interesting:
ββββββββββββββββββββββββββββββββββββββββββββββ
B | Unknown function | NOVEL π
ββββββββββββββββββββββββββββββββββββββββββββββ
Gene length: 533 amino acids (long for a phage gene)
Novelty score: 94%
Evidence collected:
β No UniProt homology found (e-value > 1e-5)
β No motif hits found
β No domain detected
Reasoning:
"No sequence-based evidence available. This gene has no
detectable homologs in current databases. It may represent
a genuinely novel protein class."
GIAE Suggestions:
π¬ Run AlphaFold2 on this sequence for structural prediction
π¬ Check expression data β is this gene expressed at all?
π¬ Candidate for wet-lab biochemical screening
What does "Dark Matter" mean?
Phage genomes are notorious for containing genes that have no detectable relationship to anything in current databases. These "ORFans" make up 9β40% of phage gene content.
GIAE calls these Dark Matter because: - It genuinely doesn't know what they do - The tools to answer the question go beyond sequence comparison - Structure prediction (AlphaFold) or laboratory experiments are the next step
Why is this useful? Most annotation tools would either skip these genes or label them "hypothetical protein" and move on. GIAE gives you: - A novelty score to prioritise which ones to investigate - A suggested next step based on gene length and properties - Explicit confirmation that sequence methods have been exhausted
Gene B is flagged as HIGH PRIORITY novelty because at 533 amino acids, it's unusually large for a phage gene. Long unknown proteins are often worth structural investigation.
Step 7 β Exporting results for downstream analysis
You can also get the results as JSON for computational downstream use:
The JSON output includes all evidence, scores, and reasoning in a structured format:
{
"genome_id": "NC_001416",
"genes": [
{
"gene_id": "cI",
"interpretation": "Repressor protein CI",
"confidence": 0.87,
"confidence_label": "HIGH",
"evidence": [
{
"type": "uniprot_homology",
"description": "P03034 β 96% identity",
"confidence": 0.96
},
{
"type": "motif",
"description": "HTH_3 domain detected",
"confidence": 0.82
}
],
"reasoning": "Three independent evidence sources agree...",
"novel": false,
"novelty_score": 0.03
}
]
}
Summary β What to do with your results
| Gene type | What GIAE tells you | Recommended action |
|---|---|---|
| π’ HIGH confidence | Strong multi-source evidence | Accept annotation, cite the evidence |
| π‘ MEDIUM confidence | Some uncertainty | Note as "putative", consider extra validation |
| π΄ LOW confidence | Weak single-source evidence | Treat as hypothesis only, do not cite as fact |
| π NOVEL / Dark Matter | No known function | Structural prediction or lab follow-up |
Next Steps
- π Install local BLAST+ for more sensitive homology:
giae db download swissprot - π Install HMMER/Pfam for domain detection:
giae db download pfam - π API Reference β use GIAE in your Python scripts
- πΊοΈ Roadmap β what's coming next (HTML comparison tools, caching, more!)