Genomics and Bioinformatics Team | 2023 Progress Report

View the PDF version of the 2023 Genomics and Bioinformatics Report.

Team members:

  • Zhangjun Fei (Boyce Thompson Institute)
  • Michael Mazourek (Cornell University)
  • Shan Wu (Boyce Thompson Institute)
  • Jim McCreight (USDA, ARS)
  • Amnon Levi (USDA, ARS)
  • Rebecca Grumet (Michigan State University)
  • Yiqun Weng (USDA, ARS)


Develop novel advanced bioinformatic, pan-genome, and genetic mapping tools for cucurbits.

Develop genomic and bioinformatic platforms for cucurbit crops:
Development of high-resolution genotyping platforms for cucurbits

Genome resequencing of the cucumber core collection (comprising 388 accessions) and the squash core collection (207 Cucurbita pepo accessions) has been completed (except one in the squash core). The average sequencing depths of cucumber and squash cores are 44.7 and 53.3, respectively. These data have been processed for SNP and small indel calling using the Gy14 genome (v2.1) as the reference for the cucumber core and the MU‐CU‐16 genome (v4.1) as the reference for the squash core. Statistics of called variant are summarized in Table 1. Biallelic variants with MAF>0.01 of cucumber and squash core collections are available for mining at CuGenDBv2.

For melon core collection (384 accessions), genome sequencing of 308 accessions has been completed. SNPs and small indels have been called for these 308 samples, and will be updated once the sequencing of the remaining accessions in the core are generated. Sample collection and DNA preparation of the remaining 76 accessions and all accessions in the watermelon core are currently underway (Table 1). In addition, we have also completed genome resequencing for 26 C. maxima and seven C. moschata accessions.

Table 1. Summary of genome sequencing of cucurbit core collections

No. accessions 388 207 384 366
No. DNA prepared 388 207 314 313
No. sequenced 388 206 308 0
Average sequencing depth 44.7 53.3 47.3
No raw SNPs 5,332,225 5,007,376 21,022,493
No. biallelic SNPs with MAF>0.01 2,513,882 4,104,452 10,294,239
No raw indels 1,385,149 2,008,251 4,481,966
No. biallelic indels with MAF>0.01 490,882 1,264,224 1,540,406

Development of novel, advanced genome and pan-genome platforms for cucurbit species

For cucumber, we have selected 25 accessions including five wild Cucumis sativus var. hardwickii, four semi-wild Xishuangbanna and 16 cultivated cucumbers for PacBio HiFi sequencing. Ten of these 25 accessions are from the core collection. HiFi sequences have been generated for all the 25 accessions.

For watermelon, we selected a total of 135 accessions for reference-grade genome development, including one Citrullus naudinianus, one C. rehmii, two C. ecirrhosus, five C. colocynthis, 13 C. amarus, nine C. mucosospermus, four C. lanatus var. cordophanus, seven landraces, and 88 cultivars and five interspecific hybrids. HiFi sequences have been generated for 124 accessions, and DNA of the remaining 11 accessions (four cordophanus, two amarus, two colocynthis, two C. ecirrhosus and one C. rehmii) have been prepared and sent to the sequencing facility for HiFi read generation.

For squash, three accessions, two from Cucurbita pepo ssp. texana (also known as ssp. ovifera) and one from C. pepo ssp. pepo, have been selected for HiFi sequencing. HiFi sequences of these three accessions have been generated. We have also generated HiFi sequences for C. maxima Rimu and C. moschata Rifu.

For melon, a total of 16 representative accessions have been selected for HiFi sequencing, including eight C. melo ssp. melo and eight C. melo ssp. agrestis accessions, among which eight from India/Pakistan, two from Turkey, two from Americas, and one each from Africa, Central/West Asia, East Asia, and Europe. Sample collection, DNA preparation and HiFi sequencing are underway.

De novo genome assembly and pan-genome construction

We have finished the assembling of chromosome-scale genomes of the 25 cucumber accessions. One accession (WI5551) had an unexpectedly large size of the assembled genome (~610 Mb), possibly due to sample contamination, and was thus discarded. The assembled genome sizes of the remaining 24 accessions range from 272.0 Mb to 318.5 Mb (average: 294.9 Mb) and N50 sizes from 5.07 Mb to 23.28 Mb (average: 13.45 Mb). Protein-coding genes are being predicted in these genomes. Using the assembled ‘Poinsett 76’ genome as the reference/backbone, large structural variants (SVs) are being called and integrated for the other 23 assembled genomes and an additional of 11 previously published chromosome-scale cucumber genomes (seven cultivated, one Xishuangbanna and three wild hardwickii). A graph pan-genome will be constructed using the ‘Poinsett 76’ genome and the called SVs.

For watermelon, we have finished chromosome-scale genome assemblies and gene predictions of 124 accessions. The assembled genome sizes range from 368.6 Mb to 406.7 Mb (average: 377.5 Mb) and N50 sizes are all greater than 20 Mb (20.37-35.64 Mb; an average of 30.49 Mb). The numbers of predicted protein-coding genes range from 21,209 to 23,314 (average: 21,948). Using the newly assembled ‘97103’ genome as the backbone, SVs are being called in the 123 watermelon accessions. Once the HiFi data of the remaining 11 accessions are received, genome assembling, gene predictions and SV calling will be performed. The final SVs and the ‘97103’ reference genome will be used to construct a Citrullus graph pan-genome.

For Cucurbita species, we have finished genome assemblies and gene predictions of three squash (C. pepo) accessions, and genome assemblies of C. maxima Rimu and C. moschata Rifu (Table 2). Annotation of Rimu and Rifu genomes are underway. SVs are being called and a Cucurbita graph pan-genome will be constructed.

Table 2. Statistics of Cucurbita genome assemblies

C. maxima Rimu
C. moschata Rifu
C. pepo ssp. texana C31
C. pepo ssp. texana C38
C. pepo ssp. pepo C39
Assembly size (bp) 350,631,597 311,872,014 349,507,311 351,024,241 378,453,046
N50 (bp) 12,573,384 9,281,623 7,690,470 9,175,707 10,726,264
No. genes 31,528 30,412 31,327

All the constructed pan-genomes will be used as the reference to genotype SVs in core collections and other populations through mapping the genome resequencing reads.

In addition, we have constructed four species-level watermelon pan-genomes and a Citrullus super-pangenome with the ‘map-to-pan’ strategy using the four genome assemblies (one from each of the four species, C. lanatus, C. mucosospermus, C. amarus and C. colocynthis) and genome resequencing data we previously generated. The resequencing data are from 547 accessions, including 349 C. lanatus (243 cultivars, 88 landraces and 18 C. lanatus subsp. cordophanus), 31 C. mucosospermus, 131 C. amarus and 36 C. colocynthis. The species-level pan-genomes contain 2,288, 583, 1,922 and 2,521 novel genes that are not present in reference genomes of C. lanatus, C. mucosospermus, C. amarus and C. colocynthis, respectively. Analysis of presence/absence variations (PAVs) of genes in the Citrullus super-pangenome identified many genes showing differential presence frequencies between different populations, including 17 genes related to disease resistance that are completely absent or present at very low frequencies in domesticated watermelons while present at very high frequencies in at least one of the three wild species populations.

Breeder-friendly web-based database for phenotypic, genotypic and QTL information

We have updated CuGenDB to version 2 (CuGenDBv2) and officially released CuGenDBv2 in April 2022. CuGenDBv2 currently hosts 34 reference genomes from 27 cucurbit species/subspecies belonging to 10 different genera. Protein-coding genes from all these 34 genomes (total: 919,903; average: 27,056) have been comprehensively annotated, and the annotated genes can be queried and extracted in the database. Genomic synteny blocks and syntenic gene pairs have been identified between any two and within each of the 34 cucurbit genome assemblies (595 pairwise genome comparisons). A total of 391,379 synteny blocks and 12,130,719 syntenic gene pairs (average: 31 per synteny block) have been identified between the 34 cucurbit genomes. The ‘Synteny Viewer’ module have been re-implemented in CuGenDBv2 to improve the efficiency in processing and displaying the large-scale synteny data.

A ‘Genotype’ module has been newly developed in CuGenDBv2. The module provides a suite of functions that allow users to mine, analyze, extract, and download variants including SNPs and small indels from large-scale population genome sequencing projects. Currently variants (SNPs and small indels) called for melon and squash core collections and watermelon resequencing panel, and SNPs called from the GBS data generated under CucCAP1 for watermelon, melon, cucumber, C. pepo, C. maxima and C. moschata are available in the database for query and mining.

The ‘Expression’ module in CuGenDBv2 has been redesigned to provide a complete cucurbit gene expression atlas, using the publicly available cucurbit RNA-Seq datasets. Currently raw RNA-Seq data of a total of 221 projects, 1,513 distinct samples and 3,560 runs (or libraries) have been downloaded from NCBI and processed to derive expression values, which can be queried in CuGenDBv2 to display expression profiles of specific interesting genes in different tissues, development stages, and under different treatment conditions.

Phenotype data have been generated for melon and cucumber core collections. A total of 33 vegetative, flower and fruit characters and two disease resistance traits have been evaluated for the melon core collection, and for the cucumber core collection a combination of 15 external and internal characteristics have been collected for immature and mature fruit of plants grown in 2019 and 2021. These phenotypic data will be used to develop visualization and analysis tools in CuGenDBv2.

Perform seed multiplication and sequencing analysis of core collections of the four species, provide community resources for genome wide association studies (GWAS).
Seed multiplication of core collections

For cucumber, seed increases of the 388 accessions in the core collection were carried out by five participating seed companies. As of March 2023, seeds for 106 accessions with more than 1000 seeds per accession have been received. Seed increase for the majority of the 388 accessions is expected to be completed by the end of 2023.

For watermelon, HM.Clause are increasing the seeds for 249 accessions in the core collection. We shipped to HM.Clause 249 seed packs (accessions-PIs) with 50 S2 seeds in each pack. Prior to shipping the seeds to the HM.Clause station in Davis, California, they were tested for presence of Bacterial fruit blotch using an RT-PCR procedure. HM.Clause conducted seed health testing in California and the lots were shipped to the HM.Clause in Thailand. HM.Clause are planning to provide us 3-4 self-pollinated seed lots per accession – enough to reach the 1,000 seed/accession target. S2 seeds of 167 additional PI are being increased at the U.S. Vegetable Laboratory and at University of Georgia, and seed lots per accession – enough to reach the 500 seed/accession target will be increased during 2023-2024.

For melon, we have increased 312 to date and will harvest fruit from another 21 later this month.

For squash, we have finished seed increase for 130 accessions, of which 100 have >1,000 seeds and 30 have 500-1000 seeds. We are in the process of harvesting seeds for an additional of 20 accessions currently growing in the greenhouse, and are waiting for increased seeds from Villa Plant for another 50 accessions. The remaining seven accessions in the core will be increased this summer. The status of the core populations is summarized in Table 3.

Table 3. Status of CucCAP Core (CCC) population development

(Cucumis sativus)
(Cucumis melo)
(Citrullus lanatus)
[C. amarus, C. colocynthis]
(Cucurbita pepo)
[C. moschata, C. maxima])
No. accessions listed in NPGS 1335 (available) 2083 (available) 1870 [1619,77,151,23] 743 [302,614]
No. sequenced by GBS 1234 2083 1365 [1211,52,102] 829 [314,534]
No. chosen for core collection 395 384 377 [306,23,38,10] 229 [7, 26]
Portion of genetic diversity represented in core collection 96% 99% 96% >99%
No. accessions selfed (Generations of selfing prior to sequencing) 388 2-3 generations S1: 55 S2: 296 366 2-3 generations 207 [7,26] 2-3 generations
Seed multiplication in progress BASF, East-West, Enza Zaden, Vilmorin, VoloAgri Sakata, and tentatively: BASF & United Genetics H.M. Clause Sakata In-House MazLab; Villa Plants CR
No. of accessions with seed (>1000 seed/accession) 106 remainder in progress 249 increased 126 still to be increased Almost done: 130 >1000 seed; 30, 500-1000 seed; remainder in progress

Population genetics and phenotype-genotype association analysis

Phylogenies of accessions in the cucumber and melon cores have been inferred using the LDpruned SNPs at four-fold degenerate sites, which are largely consistent with their geographic origins.

Preliminary GWAS analysis of cucumber fruit traits identified several QTL, including several that have been previously identified in the literature (e.g. Wang et al, , 2020, 2021; Sheng et al., 2020) as highlighted in light blue in Fig. 1. Additional analyses are in progress.

Figure 1. Manhattan plots of GWAS of internal traits. (A) Fruit diameter; (B) Flesh thickness; (C) Seed cavity size; (D) Fruit hollowness. Blue and red lines indicate FDR 0.05 and 0.01, respectively.

View Figure 1. in the PDF version of the 2023 CucCAP Bioinformatics and Genomics Team Report