The sight of mistletoe hanging in trees this winter will be especially poignant for Darwin Tree of Life scientists who spent many months finding innovative ways to generate this giant genome. Image Human genome — top left — compared to the first chromosome of the European mistletoe genome (Image: Genome Reference Informatics Team, Wellcome Sanger Institute) Darwin Tree of Life genomicists at the Wellcome Sanger Institute and University of Edinburgh have certainly earned an end-of-year break, having spent much of 2022 tackling a fittingly festive species. The European mistletoe (Viscum album) has the largest genome of any species from Britain and Ireland. It has now had its DNA sequenced, its genome assembled to top chromosomal-level quality, and - following a thorough final check of our work - will be submitted by the DToL project to public databases in the new year. A giant among genomes At around 90 gigabase pairs (Gbp) the mistletoe genome is 30 times larger than our own human genome, and easily the largest reference genome assembled thus far. Surprisingly, all this genetic material is mostly stored in just 10 enormous chromosome pairs - remember that humans have 23. Even the smallest of these mistletoe chromosomes is the same size as roughly three entire human genomes, at over 9 Gbp in size. Image Human genome — again, top left — compared to the entire European mistletoe genome (Image: Genome Reference Informatics Team, Wellcome Sanger Institute) For comparison, below are Hi-C maps of the entire Homo sapiens genome and just the first chromosome of Viscum album. Our bioinformaticians use Hi-C visualisations to manually check and edit our genome assemblies, with the diagonal line representing genome length and each darker square representing a chromosome. The sheer scale of that map makes curating this genome particularly mind-boggling. But this process only comes towards the end of a series of massive challenges. Starting big The decision to sequence the mistletoe genome was made very early in the Darwin Tree of Life ( DToL) project, which launched in late 2019. Thanks to years of research into plant genome size, not least at DToL partner Kew Gardens, our scientists knew mistletoe dwarfed other species. Size really does matter when sequencing the genomes of all life in Britain and Ireland Viscum album does not have the largest plant genome in the world; that record is held by Paris japonica (150 Gbp). But the closest British and Irish species, members of the lily and onion families, trail far behind on 30 to 40 Gbp. Alex Twyford, senior lecturer at the University of Edinburgh and a parasitic plant specialist, was one of those early DToL decision makers. “For me, Darwin Tree of Life is huge in scale with so many different species, but it’s important we’re tackling some of the most challenging species right from the start. And if we want to face some of those problems early on, why not go for the largest genome in Britain and Ireland?” If DToL could sequence mistletoe, the basic fact of a genome’s size would not pose a problem for future species. But this was also an opportunity for the project. Mistletoe helped stress test our equipment, and to work out whether the variability of early results was due to our processes or the species we were sequencing. To do this testing and tweaking over time, we needed a whole load of genetic data to play with, ideally from a single specimen to make trials repeatable. To get Darwin Tree of Life up and running, we needed a test case. Mistletoe was an obvious candidate. Alex Twyford Senior lecturer, Institute of Ecology and Evolution, and parasitic plant specialist Last Christmas… we sequenced this plant The first mistletoe samples were collected from a female plant in September 2020. It still grows on a hawthorn (Crataegus sp.) near Kew - easy for DToL botanists to return for new samples. Image Mistletoe (Viscum album) growing on the Wellcome Genome Campus wetlands nature reserve. (Image: Wellcome Sanger Institute) The mistletoe samples were sent to the Wellcome Sanger Institute where high molecular weight DNA was extracted from its cells. Powerful machines are then used to turn physical DNA molecules into long strings of ACGT code on huge computer text files. Several different types of data are required to build high-quality reference genomes. One team at Sanger specialises in producing long-read sequence data using machines called Sequel IIe systems, made by California-based company Pacific Biosciences (PacBio). The PacBio machines are based around a technology called SMRT cells (pronounced ‘smart’). Genome size matters for these machines: any species with a genome below 1 Gbp can be sequenced using one SMRT cell over a day or less. Sequel IIe systems For mistletoe, the team ran all 12 of Sanger’s Sequel IIe systems for a week to get the amount of data required. This was winter 2021, and the team found themselves facing a festive challenge. “It wasn’t on purpose, but we found ourselves racing to sequence the mistletoe genome before Christmas. We knew that meant a lot of SMRT cells. So there was a bit of excitement there, and some relief once we’d completed it,” says James Watts who leads one of the long-read teams. In total, 10 terabytes of DNA sequence data was generated for the mistletoe. DToL’s botanists like to point out that, although many of the project’s first genomes are of insects, this one plant required about as much sequence data as 100 insect species combined. Hundreds of jobs running By February 2022 all the DNA sequence data had reached Shane McCarthy and the Tree of Life Assembly team. Their role is to first check the quality of the data they receive. They then assemble the data into long, contiguous pieces and ‘scaffold’ that into chromosome-sized blocks. Much of this is done using automated tools, many developed by Shane and his team. Tree of Life Assembly Knowing the genome size ahead of time is helpful for knowing the kind of compute resource you’re going to use. For mistletoe we did a lot of special things because we knew the genome would be so large Dr Shane A. McCarthy Tree of Life Assembly Team Lead Three terabytes of storage was needed just to get the mistletoe’s raw data on disk. Only two machines at Sanger had the memory to actually do the assembly. Then, to map the genome, the team has lots of small jobs running in parallel on different computers. “For a smaller genome you might have 10 jobs running. For the mistletoe it was hundreds,” says Shane. Image Mistletoe berries. (Image: Luke Lythgoe, Wellcome Sanger Institute) Christmas time, mistletoe in line By June an assembly had been generated that the scientists were happy with. The next stage of the process is known as curation, which involves going chromosome by chromosome to check every little detail, confirming any translocations or potential errors. “It’s kind of like crafting the genome. You see what comes out of the [genome production] pipeline, like out of a box, and using the genome biology and the data you try to improve it,” explains Lucia Campos-Dominguez from the University of Edinburgh, who tackled the curation of the mistletoe. “This was a very intensive part. Scrolling through the mistletoe genome, chromosome by chromosome, correcting things by hand.” With small genomes, for example butterflies and moths, you might make a few edits per chromosome. Lucia ended up making hundreds of edits on each of the huge mistletoe chromosomes. Image Mistletoe in a tree on the Wellcome Genome Campus grounds — the snow fell shortly after Lucia finished curating the genome (Image: Luke Lythgoe, Wellcome Sanger Institute One issue that quickly became apparent was the resolution available for the Hi-C maps on standard software. It was so low at mistletoe’s scale that only large blocks could be moved around and no editing of finer detail was possible. The solution was to split the chromosomes into separate files and edit them there, which made it more arduous to find the smaller bits of misplaced sequence and assign them to the correct chromosome. “Since the summer, mistletoe has been my main task,” says Lucia. “I received training from the Sanger team and we worked out a way around the resolution issues. Then we got the mistletoe data and I worked on it for three months straight.” To put this timeframe into perspective, Lucia curated two other plant genomes - the box (Buxus sempervirens) and a moss (Polytrichum commune) - in a single week before embarking on the mistletoe. Curation was finished in early December 2022, another milestone achieved just before Christmas. “There is a lot of decision making in curation, which I think is the hardest part,” Lucia reflects. “Is this the right set of sequences, or should I change the order? Is this an inversion? These kinds of structural changes you are supposed to make to the genome, it feels deep because you’re actually altering the results. I got a lot of reassurance from the Sanger team who helped out a lot.” Sanger Genome Reference Informatics Team New year, new genome A few final challenges remain before the Viscum album genome assembly data is uploaded to public databases for scientists worldwide to freely access. For example, although genomes as large as 100 Gbp can be uploaded, the databases cannot take individual continuous sequences of DNA larger than 2.14 Gbp. Since the mistletoe chromosomes are so large, they will need to be split into six pieces each, but still all be part of the same genome submission. Nevertheless, the finish line for this marathon genome is in sight. “There were numerous challenges along the path, but it was well worth it. I’d do it again,” says Alex Twyford. But where do you go from the biggest genome? Well, plant genomes do lots of complicated things. One is called polyploidy, where the plant has duplicated its genome at different points in its history. The mistletoe has not done this, it is a straightforward diploid, meaning it only has pairs of chromosomes like humans. Some plants have many more copies of the same chromosome. The adder’s tongue fern (Ophioglossum sp.) has done this to such an extent that it has well over a thousand chromosomes in each of its cells. Now Darwin Tree of Life has sequenced the largest genome, I’d be keen to tackle this second challenge - the polyploidy issue. With some of those polyploid genomes you’re sequencing two, or four, or eight genomes in one. Trying to untangle those is the next frontier. Alex Twyford Senior lecturer, Institute of Ecology and Evolution, and parasitic plant specialist Related links Darwin Tree of Life Institute of Ecology and Evolution Wellcome Sanger Institute Kew Gardens Publication date 20 Dec, 2022