Mapping annotations through a pangenome graph

By Sergio Zavala · Wed Jul 16 2025 00:00:00 GMT-0700 (Pacific Daylight Time) · 206 views

Mapping annotations through a pangenome graph (Summer Evaluation)

Author: Sergio Zavala

Abstract

Annotation pipelines are automated workflows that use annotated genomes as a guide to transfer functional features onto a new genome. Annotations help researchers identify gene structures and active genes, features that can aid researchers in areas such as disease research and diagnostics. Pangenome graphs provide a framework for capturing genomic variation, including single-nucleotide variants, structural rearrangements, and haplotype diversity, across multiple individuals in a single structure, overcoming the biases and blind spots inherent in single linear references. Current annotation tools project annotations from one assembly to another by leveraging whole-genome alignments; however, there is no current solution that can directly traverse and annotate the complex topology of a pangenome graph. To address this, we are developing a prototype tool that walks through pangenome graphs, identifies corresponding nodes from a reference assembly, and maps those annotations onto every genome path. The result of our prototype is to create an algorithm that serves as a guide for a full implementation, enabling us to answer important questions, such as the impact of SNPs and structural variants on transcript expression. This prototype will be tested on a graph constructed from 462 high-quality, haplotype-resolved human assemblies generated by the Human Pangenome Reference Consortium, laying the groundwork for a scalable, full-featured implementation capable of accurately and efficiently annotating pangenome graphs.

Background

During my summer as an undergraduate researcher at UC Santa Cruz, I took on the research question, “Can we develop an efficient algorithm that can handle annotating genes through a pangenome graph at a scale of hundreds of genomes?”. The first two weeks were the onboarding stages, which consisted of being added to the network cluster, reading research papers, and understanding common terms used in the field of computational genomics, given my background in computer science. After being situated, my first task was to familiarize myself with GFA. GFA (Graphical Fragment Assembly) is a format for representing genome assemblies. Genome assemblies are constructed from short DNA reads from a genome and are pieced back together to form the original genome sequence.

In this example, segments (S) are nodes that contain sequences from our assemblies, along with other optional tags, such as sequence ID, start position, and more.

In my case, I was given a small pangenome GFA file consisting of CHM13 and 2 other haplotypes assemblies, which consists of Segments (nodes), Links (edges that connect nodes), and Walks (the entire path for each genome and chromosome). I also had to explore the GFF format (General Feature Format), which is used to store annotations of a genome.

In this example, the sequence ID is ‘ctg123’. We have gene, mRNA, and TF_binding_site features. We have the start and end positions of each feature, along with additional tags.

Understanding these two files and the information they provide is crucial for the first step of developing our annotation system. Our first question was, “How can we use gene positions from an annotated genome to map the correct gene coordinates onto our reference genome in a GFA file.

Gene Coordinate Mapping on Reference

Brute-Force Approach

I initially worked on a brute-force algorithm to ensure I understood what our algorithm was required to do. Our workflow starts with segment mapping, a function that stores important genomic information from each segment on our reference genome. The same is applied to our annotated genome; we map each chromosome, feature, and a dictionary of genomic information for each annotation in our file.

Example of our format for each mapping

With these maps, we can now begin developing an algorithm that can transfer the coordinates from our annotations onto the segments of our reference genome. To map coordinates, we must search for overlapping annotations within our segments and label them accordingly. Overlaps help us identify which segments in our graph are within the coordinates of a gene.

Snippet of our gene coordinate mapping process onto reference segments.

Issues

The first issue with this approach was the time complexity. Our mapping functions worked quickly, but our system had too many loops when mapping gene coordinates and ran for hours. The next step was to optimize our approach for quicker segment mapping.

What can we optimize?

Before we can focus on optimizing our approach, we must identify what we can’t optimize. To map annotations onto our reference genome, we must start by iterating through our sequence IDs in annotations, every gene for each sequence ID, and every feature for each gene. However, our current approach loops through several possible segments, searching for overlaps, but is there a way to optimize the search?

Interval Trees

Interval trees are self-balancing Binary Search Trees. These trees consist of nodes containing intervals (start, end), which reduces the time complexity of our overlap search, finding all overlaps in O(log n) time. The idea is to replace our segment mapping stage to return a segment tree instead of a map, sacrificing time complexity for our mapping stage for quicker position mapping.

The new segment tree structure for each sequence ID

Result

Our initial function parses the GFA reference nodes and the GFF annotation file. It then maps each annotation from the annotated reference genome onto the pangenome reference genome, identifying nodes that overlap with specific features. Finally, the function returns a mapping of each overlapping node and its corresponding feature information, such as overlapping sites and related attributes.

Visual representation of our reference’s features being mapped onto our pangenome.

Gene Coordinate Mapping onto Targets

Brute-Force Approach

After transferring our annotations onto our pangenome graph, we can now use those gene coordinates to transfer features onto the remaining target genomes. How do we transfer those annotations from our reference nodes onto our target nodes? Our function needs to start by traversing each target sample ID (our 2 haplotypes), each sequence ID (chr1, chr2, etc), and every annotation for that sequence ID. For each annotation, we must confirm that the node exists in our target path (if the segment ID exists in our reference, but not our target, that means our target has a variation). If the segment ID does match, calculate the features' start/end points using the features' relative start on our reference and map the location.

Brute force approach showcasing our segment matching process and gene coordinate transfer

Issues

The issue with our current approach is that we were not verifying exact gene coordinates. This would lead to some gene annotations being slightly inaccurate.

How can we improve accuracy?

Exon-Level Mapping

Since exons are the actual coding regions of a transcript, they provide a much more precise basis for coordinate mapping. Our refined approach performs coordinate transfer only when the annotation feature type is an exon. We first calculate the predicted start and end points of the feature on the target. Then, using FASTA files–a record of entire genetic sequences–of both the target and reference genomes, we add padding around the predicted sequence and align it to the corresponding reference sequence, thereby extracting the precise coordinates of the feature.

Diagram showcasing how adding padding to our predicted start/end can help extract precise coordinates

Result

After testing our tool on chromosome 20 of a target haplotype, we observed improved accuracy in gene coordinate predictions. However, the method still does not correctly map all exons from the target paths, achieving approximately 80% accuracy in gene coordinate mappings.

Pangenome Annotation Toolkit

Comparative Annotation Toolkit

Future Works

Our current goal is to modify the algorithm to recover missing exons in the target assemblies by chaining them with neighboring exons from the reference. Once we achieve 100% accuracy, the next step will be to scale the algorithm to support an arbitrary number of reference annotations while effectively handling duplicates and related cases. The diagram below illustrates two annotated genomes being mapped onto our pangenome graph. Ultimately, the algorithm should be robust enough to transfer multiple annotations across all target genome paths.