Genomics and Bioinformatics Team | 2021 Progress Report

Cucurbit Coordinated Agricultural Project Annual Progress Report|bioinformatics platforms, databases, pan-genomic analyses

CucCAP researchers and stakeholders met on October 27, 28 & 29, 2022 to present and discuss the grant’s accomplishments, ongoing research, plans and expectations.

Team members:

  • Zhangjun Fei (Boyce Thompson Institute)
  • Shan Wu (Boyce Thompson Institute)
  • Amnon Levi (USDA, ARS)
  • Yiqun Weng (USDA, ARS)
  • Michael Mazourek (Cornell University)
  • Jim McCreight (USDA, ARS)
  • Rebecca Grumet (Michigan State University)

CucCAP Affiliated Postdocs and Graduate Students

  • Jingyin Yu – Postdoc at Boyce Thompson Institute (Fei, Wu)
  • Honghe Sun – Graduate Student at Cornell Plant Biology (Fei, Wu)

Objectives

Develop novel advanced bioinformatic, pan-genome, and genetic mapping tools for cucurbits.

Work in progress and plans

1.1. Develop genomic and bioinformatic platforms for cucurbit crops.

1.1.1. Development of high-resolution genotyping platforms for cucurbits.

Our main goal here is to construct single-base resolution genome variation maps (variomes) including both SNPs and SVs through genome resequencing of all accessions in the core collections of cucumber, watermelon, melon and squash, and make them publicly available to the community. Currently samples have been grown for the cucumber core collection, and tissues are being collected and DNA extracted.

Under CucCAP1, genome resequencing data for 21 accessions in the Cucurbita pepo core collection and for 133 accessions in the cucumber core at an average depth of ~15× were generated and processed.

1.1.2. Development of novel, advanced genome and pan-genome platforms for cucurbit species.

Due to the reduced sequencing cost and contributions of the genome resequencing and de novo genome sequencing by our Chinese collaborators, we have changed our plan from generating one or two additional reference genomes to developing multiple reference genomes (several dozens) for each crop (mainly cucumber and watermelon) using the HiFi sequencing technology.

For cucumber, we have selected 25 accessions including five wild Cucumis sativus var. hardwickii, four semi-wild Xishuangbanna and 16 cultivated cucumbers for HiFi sequencing. Ten of these 25 accessions are from the core collection. Plants of these 25 accessions are grown at BTI greenhouse. High molecular weight (HMW) DNA has been extracted for six accessions and sent out in early October to Mount Sinai for sequencing. In addition, in collaboration with the Chinese group, we have sequenced and assembled reference genomes for another 11 cucumber accessions using PacBio CLR reads, including seven cultivated, one Xishuangbanna and three wild cucumbers.

For watermelon, in collaboration with the Chinese group, we selected a total of 127 accessions for reference genome development, including one Citrullus naudinianus, one C. rehmii, one C. ecirrhosus, five C. colocynthis, 13 C. amarus, five C. mucosospermus, eight landrace, and 93 cultivated lines. Eight of these accessions are in the core (five C. amarus, one landrace, and two cultivated). Sequencing of two accessions have been completed and leaf tissues of 122 accessions have been sent to the company for DNA extraction, library preparation and HiFi sequencing, while leaf tissues for the remaining three accessions should be be ready in a month. In addition, we have assembled a C. mucosospermus (USVL531-MDR) genome using PacBio CLR reads, resulting in 78 contigs with a total size of 365.3 Mb and an N50 contig size of 27.58 Mb; and 99.4% of the contigs were anchored and ordered to the 11 watermelon chromosomes. We have also assembled a Kordofan melon (C. lanatus subsp. cordophanus) genome using PacBio CLR reads. The assembled genome contained 86 contigs with a total size of 367.9 Mb and an N50 length of 9.34 Mb, and 98.94% of the contigs were clustered into 11 pseudomolecules.

For squash, in collaboration with the Chinese group, three accessions, two from Cucurbita pepo ssp. texana (also known as ssp. ovifera) and one from Cucurbita pepo ssp. pepo, have been selected for HiFi sequencing. The sequencing and genome assembly of these three accessions are expected to be done later this month (October, 2021). We are also in the process of generating improved reference genomes for Cucurbita maxima Rimu and Cucurbita moschata Rifu using the HiFi and Hi-C sequencing. Both HiFi and Hi-C sequencing had been done and genome assembling is underway.

1.1.3. De novo genome assembly and pan-genome construction

We have evaluated the efficiency of assembling cucurbit genomes with different depths of HiFi data for cucurbit genomes using watermelon as the example. We generated ~30× HiFi reads with an average length of 18.8 kb and 16.4 kb for watermelon cultivars LvWangTuo and SP5, respectively, and randomly selected different depths of HiFi reads and assembled the reads using HiCanu. We found that ~20× HiFi reads, which correspond to the throughput of a half SMRT cell of the Sequel IIe system, are good enough for a high-quality reference genome assembly. Using ~20× HiFi reads, we obtained assemblies with total sizes of 368.2 Mb and 367.6 Mb and N50 contig sizes of 14.0 Mb and 14.3 Mb for LvWangTuo and SP5, respectively. Quality evaluation using Merqury indicated that the two assemblies have high base accuracy (QV score of 47, corresponding to two errors in 100,000 bases) and completeness (99.7%). Using RagTag and the 97103 genome as the reference, ~99% of the assemblies could be anchored to the 11 chromosomes for both genomes. Based on these analyses, we propose to pool two samples and sequenced then on one SMRT cell of the Sequel IIe system.

We have established genome assembly, quality evaluation, pseudochromosome construction and genome annotation pipelines for the cucurbit species. Multiple high-quality reference genomes will be used to construct graph-based pan-genomes that can be further used to facilitate gene discovery and variant detection.

Genome resequencing data were generated for 29 C. colocynthis (30×), 30 C. mucosospermus (30×), 115 C. amarus (15×), and other 414 watermelon accessions. These data have been processed and assembled to identify additional novel genes in the pan-genome. The data will also be used to identify novel genes and genome variants (SNPs and SVs) based on the constructed graph-based pan-genomes.

1.1.4. Breeder-friendly web-based database for phenotypic, genotypic and QTL information

During CucCAP1 we developed the Cucurbit Genomics Database (CuGenDB), a critical resource for cucurbit genomics that is widely used by the community. However, the current CuGenDB (v1.0) suffers from one big drawback: it takes too long (weeks to even more than one month) to add a new genome in the database. To accommodate the needs for the increasing numbers of cucurbit genomes developed during the past couple of years and many more in the near future, we are re-implementing CuGenDB (CuGenDB v2.0) with the updated Tripal module (v3.0) that only takes a couple of days to add a new genome. We have collected a total of 43 cucurbit genomes published to date, of which 31 from 25 different species/subspecies are included in CuGenDB v2.0, with nine also included in CuGenDB v1.0. Of the 12 genomes not included in CuGenDB v2.0, five are either of low quality or lack of the annotation files, and seven are old versions (all these seven are included in the current CuGenDB). We expect to officially release CuGenDB v2.0 by the end of 2021.

Phenotype data have been generated for melon and cucumber core collections. A total of 33 vegetative, flower and fruit characters and two disease resistance traits were evaluated for the melon core collection. For the cucumber core collection a combination of 15 external and internal characteristics are being collected for immature and mature fruit of plants grown in 2019 and 2021. Examples of the cucumber phenotypic data are shown in Figure 1. These phenotypic data are currently being used to develop visualization and analysis tools in CuGenDB v2.0. Developing a breeding information management module to integrate phenotypic and genotypic data will be the main focus of the CuGenDB v2.0 development in 2022.

1.2 Perform seed multiplication and sequencing analysis of core collections of the four species, provide community resources for genome wide association studies (GWAS).

1.2.1. Seed multiplication of core collections

For cucumber, five companies will help seed increases of the 399 lines in the core collection. Seeds have been shipped to three companies while seeds for the remaining two companies will be shipped within the next month.

For watermelon, HM.Clause are increasing the seeds for the 384 accessions in the core collection. We shipped to HM.Clause 249 seed packs (accessions-PIs) with 50 S2 seeds in each pack. Prior to shipping the seeds to the HM.Clause station in Davis, California, they were tested for presence of Bacterial fruit blotch using an RT-PCR procedure. HM.Clause conducted seed health testing in California and the lots were shipped to the HM.Clause in Thailand. The first 80 accessions are being increased there at this time. HM.Clause are planning to provide us roughly 3-4 selfed seed lots per accession – enough to reach the 1,000 seed/accession target. The S1 and S2 seeds are being increase at the U.S. Vegetable Laboratory and at University of Georgia.

For melon, we have increased 312 to date and will harvest fruit from another 21 later this month.

For Cucurbita pepo, we have selfed seed from 229 accessions that represent ~99% of the genetic diversity of the species. These will be increased starting this winter at Linda Vista in Costa Rica.

1.2.2. Population genetics and phenotype-genotype association analysis

Nothing to report for this period.