Annotation Field Descriptions

Comprehensive reference for all annotation fields in Bystro's output files. Understanding these fields helps you interpret variant annotations, filter results effectively, and extract meaningful insights from your genomic data.

Field Notation

Italicized fields are custom Bystro fields. All others are sourced from public databases as described.

Missing data: marked by '!'
Multiple values: separated by ';' (e.g., transcripts)
Indel annotations: separated by '|'
Multiallelic sites: appear on separate lines
Output order: matches original input file order

Basic Fields

Sourced from the input file, or calculated based on input fields

Position & Variant Information

chrom- Chromosome, always prepended with "chr"
pos- Genomic position after Bystro normalizes variant representations
  • - Positions always correspond to the first affected base
type- The type of variant
  • - VCF format types: SNP, INS, DEL, MULTIALLELIC
  • - SNP format types: SNP, INS, DEL, MULTIALLELIC, DENOVO
  • - MNPs are decomposed into separate "SNP" rows (future releases will label as "MNPs" with linkage properties)
  • - Multiallelics are decomposed into separate rows but retain "MULTIALLELIC" type
inputRef- The reference base (always 1 base long)
  • - Generated by input file pre-processor
  • - Always the affected reference base at that position
alt- The alternate/nonreference allele
  • - VCF multi-allelic and MNP sites are decomposed into individual entries
  • - Genotypes are properly segregated per allele
ref- The Bystro-annotated UCSC reference
  • - For insertions: always 2 bases long (base before + base after insertion)
  • - For deletions: as long as the deletion (up to 32 bases), 1 annotation per deleted base

Population Genetics

trTv- Transition:transversion ratio for your dataset at this position
heterozygotes- The heterozygous sample labels
heterozygosity- Fraction of samples that are heterozygous for the alternate allele
homozygotes- The homozygous sample labels
homozygosity- Fraction of samples that are homozygous for the alternate allele
missingGenos- Samples that did not have a genotype (e.g., ".")
missingness- Fraction of samples with missing genotypes
ac- The alternate allele count
an- The total non-missing allele count
sampleMaf- The in-sample alternate allele frequency

File Metadata

vcfPos- Original VCF POS, unaffected by Bystro normalization
id- The VCF ID field
discordant- True if input VCF reference matches Bystro-annotated UCSC reference

RefSeq Annotations

refSeq.* annotations are based on RefSeq transcripts. See UCSC refGene and kgXref for details.

Note: When a site is intergenic, all refSeq annotations will be NA. Consequences are annotated for all overlapping RefSeq transcripts and can be matched to their corresponding transcript names.

Functional Effects

refSeq.siteType- Effect type on transcript
  • - Types: intronic, exonic, UTR3, UTR5, spliceAcceptor, spliceDonor, ncRNA
refSeq.exonicAlleleFunction- Coding effect of the variant
  • - Values: synonymous, nonSynonymous, indel-nonFrameshift, indel-frameshift, stopGain, stopLoss, startLoss
  • - NA for non-coding siteTypes

Protein Impact

refSeq.refCodon- Reference codon from in silico transcription
refSeq.altCodon- In silico transcribed codon after alt allele modification
refSeq.refAminoAcid- Amino acid from in silico translation of reference
refSeq.altAminoAcid- In silico translated amino acid after alt allele
refSeq.codonPosition- Position within codon (1, 2, 3)
refSeq.codonNumber- Codon number within transcript
refSeq.strand- Positive or negative watson/crick strand

Gene & Transcript Identifiers

refSeq.name- RefSeq transcript ID
refSeq.name2- RefSeq gene symbol
refSeq.description- Long form description of RefSeq transcript
refSeq.kgID- UCSC's Known Genes ID
refSeq.mRNA- mRNA ID (transcript ID starting with NM_)
refSeq.ensemblID- Ensembl transcript ID
refSeq.isCanonical- Whether this is the canonical transcript for the gene

External Database Links

refSeq.spID- UniProt protein accession number
refSeq.spDisplayID- UniProt display ID
refSeq.protAcc- NCBI protein accession number
refSeq.rfamAcc- Rfam accession number
refSeq.tRnaName- Name from tRNA track

Proximity Annotations

nearest.refSeq

Nearest transcript(s) by txStart, txEnd boundaries

nearest.refSeq.name2- Gene symbol
nearest.refSeq.name- Transcript ID
nearest.refSeq.dist- Distance to transcript

nearestTss.refSeq

Nearest transcript(s) by distance to transcription start site

nearestTss.refSeq.name2- Gene symbol
nearestTss.refSeq.name- Transcript ID
nearestTss.refSeq.dist- Distance to TSS

External Database Annotations

ClinVar (clinvarVcf)

Clinical significance annotations from ClinVar VCF dataset

clinvarVcf.id- ClinVar VCF ID
clinvarVcf.alt- ALT allele for this site
clinvarVcf.CLNSIG- Germline classification
clinvarVcf.CLNDN- Preferred disease name
clinvarVcf.CLNDNINCL- Disease name for included variants
clinvarVcf.CLNREVSTAT- Review status
clinvarVcf.CLNHGVS- HGVS expression
clinvarVcf.CLNSIGCONF- Conflicting classifications
clinvarVcf.ALLELEID- ClinVar Allele ID
clinvarVcf.AF_ESP- GO-ESP frequencies
clinvarVcf.AF_EXAC- ExAC frequencies
clinvarVcf.AF_TGP- 1000 Genomes frequencies
clinvarVcf.CLNVCSO- Sequence Ontology variant type
clinvarVcf.DBVARID- dbVar NSV accessions
clinvarVcf.ORIGIN- Allele origin
clinvarVcf.SSR- Suspect reason codes
clinvarVcf.RS- dbSNP ID (rs number)

gnomAD Exomes (gnomad.exomes)

Population frequencies from gnomAD exome dataset

Basic Fields:
gnomad.exomes.alt- ALT allele
gnomad.exomes.id- gnomAD VCF ID
gnomad.exomes.AN- Total allele number
gnomad.exomes.AF- Overall allele frequency
gnomad.exomes.AN_female- Female allele number
gnomad.exomes.AF_female- Female allele frequency
Filtered Datasets:
gnomad.exomes.non_cancer_AN- Non-cancer AN
gnomad.exomes.non_cancer_AF- Non-cancer AF
gnomad.exomes.non_neuro_AN- Non-neuro AN
gnomad.exomes.non_neuro_AF- Non-neuro AF
gnomad.exomes.non_topmed_AN- Non-TOPMed AN
gnomad.exomes.non_topmed_AF- Non-TOPMed AF
gnomad.exomes.controls_AN- Controls AN
gnomad.exomes.controls_AF- Controls AF
Population-specific:
gnomad.exomes.AN_nfe_seu- Southern European AN
gnomad.exomes.AF_nfe_seu- Southern European AF
gnomad.exomes.AN_nfe_bgr- Bulgarian AN
gnomad.exomes.AF_nfe_bgr- Bulgarian AF
gnomad.exomes.AN_afr- African/African-American AN
gnomad.exomes.AF_afr- African/African-American AF
gnomad.exomes.AN_sas- South Asian AN
gnomad.exomes.AF_sas- South Asian AF
gnomad.exomes.AN_nfe_onf- Other Non-Finnish European AN
gnomad.exomes.AF_nfe_onf- Other Non-Finnish European AF
gnomad.exomes.AN_amr- Latino/Admixed American AN
gnomad.exomes.AF_amr- Latino/Admixed American AF
gnomad.exomes.AN_eas- East Asian AN
gnomad.exomes.AF_eas- East Asian AF
gnomad.exomes.AN_nfe_swe- Swedish AN
gnomad.exomes.AF_nfe_swe- Swedish AF
gnomad.exomes.AN_nfe_nwe- Northwest European AN
gnomad.exomes.AF_nfe_nwe- Northwest European AF
gnomad.exomes.AN_eas_jpn- Japanese AN
gnomad.exomes.AF_eas_jpn- Japanese AF
gnomad.exomes.AN_eas_kor- Korean AN
gnomad.exomes.AF_eas_kor- Korean AF

gnomAD Genomes (gnomad.genomes)

Population frequencies from gnomAD v4 (hg38) or v2.1.1 (hg19) whole-genome dataset

Basic Fields:
gnomad.genomes.alt- ALT allele
gnomad.genomes.id- gnomAD VCF ID
gnomad.genomes.AN- Total allele number
gnomad.genomes.AF- Overall allele frequency
gnomad.genomes.AN_female- Female allele number
gnomad.genomes.AF_female- Female allele frequency
Filtered Datasets:
gnomad.genomes.non_neuro_AN- Non-neuro AN
gnomad.genomes.non_neuro_AF- Non-neuro AF
gnomad.genomes.non_topmed_AN- Non-TOPMed AN
gnomad.genomes.non_topmed_AF- Non-TOPMed AF
gnomad.genomes.controls_AN- Controls AN
gnomad.genomes.controls_AF- Controls AF
Population-specific:
gnomad.genomes.AN_nfe_seu- Southern European AN
gnomad.genomes.AF_nfe_seu- Southern European AF
gnomad.genomes.AN_afr- African/African-American AN
gnomad.genomes.AF_afr- African/African-American AF
gnomad.genomes.AN_nfe_onf- Other Non-Finnish European AN
gnomad.genomes.AF_nfe_onf- Other Non-Finnish European AF
gnomad.genomes.AN_amr- Latino/Admixed American AN
gnomad.genomes.AF_amr- Latino/Admixed American AF
gnomad.genomes.AN_eas- East Asian AN
gnomad.genomes.AF_eas- East Asian AF
gnomad.genomes.AN_nfe_nwe- Northwest European AN
gnomad.genomes.AF_nfe_nwe- Northwest European AF
gnomad.genomes.AN_nfe_est- Estonian AN
gnomad.genomes.AF_nfe_est- Estonian AF
gnomad.genomes.AN_nfe- Non-Finnish European AN
gnomad.genomes.AF_nfe- Non-Finnish European AF
gnomad.genomes.AN_fin- Finnish AN
gnomad.genomes.AF_fin- Finnish AF
gnomad.genomes.AN_asj- Ashkenazi Jewish AN
gnomad.genomes.AF_asj- Ashkenazi Jewish AF
gnomad.genomes.AN_oth- Other ancestry AN
gnomad.genomes.AF_oth- Other ancestry AF

dbSNP

dbSNP 155 annotations with population frequencies from multiple studies

Basic Fields:
dbSNP.id- dbSNP VCF ID
dbSNP.alt- ALT allele
dbSNP.GnomAD- gnomAD v3 frequencies
dbSNP.GnomAD_exomes- gnomAD exome frequencies
dbSNP.1000Genomes- 1000 Genomes frequencies
dbSNP.TOPMED- TOPMED frequencies
dbSNP.ExAC- ExAC frequencies
dbSNP.GoESP- NHLBI ESP frequencies
dbSNP.HapMap- HapMap frequencies
dbSNP.dbGaP_PopFreq- dbGaP aggregated frequencies
Asian Populations:
dbSNP.TOMMO- Tohoku Medical Megabank
dbSNP.Korea1K- Korea1K dataset
dbSNP.KOREAN- Korean Reference Genome
dbSNP.Vietnamese- Kinh Vietnamese database
European Populations:
dbSNP.GoNL- Genome of Netherlands
dbSNP.GENOME_DK- Danish reference pan genome
dbSNP.NorthernSweden- Northern Sweden samples
dbSNP.TWINSUK- TwinsUK cohort
dbSNP.ALSPAC- ALSPAC cohort
Other Populations:
dbSNP.Siberian- Siberian populations
dbSNP.Qatari- Qatar Genome dataset
dbSNP.MGP- Spanish population (MGP)
dbSNP.PRJEB37584- Project PRJEB37584
dbSNP.SGDP_PRJ- Simons Genome Diversity Project

Deleteriousness Scores

CADD Scores

Combined Annotation Dependent Depletion scores ≥0 indicating deleteriousness. Variants with CADD > 15 are more likely to be deleterious.

cadd (SNPs)

cadd- CADD score for SNPs

caddIndel (Indels & MNPs)

caddIndel.alt- ALT allele
caddIndel.PHRED- CADD PHRED score

Note: Since Bystro decomposes MNPs into "SNP" records, caddIndel may occasionally be populated for SNPs that are part of MNPs.

Using Field Descriptions

  • Filter effectively: Use these field descriptions to build precise queries
  • Understand relationships: Match transcript annotations to gene symbols using array ordering
  • Population context: Compare your sample frequencies to public database frequencies
  • Clinical relevance: Combine ClinVar significance with deleteriousness scores