Genomics and Bioinformatics Team members:
- Zhangjun Fei (Boyce Thompson Institute)
- Shan Wu (Boyce Thompson Institute)
- Amnon Levi (USDA, ARS)
- Yiqun Weng (USDA, ARS)
- Michael Mazourek (Cornell University)
- Jim McCreight (USDA, ARS)
- Rebecca Grumet (Michigan State University)
Objectives: Develop novel advanced bioinformatic, pan-genome, and genetic mapping tools for cucurbits.
1.1. Develop genomic and bioinformatic platforms for cucurbit crops.
1.1.1. Development of high-resolution genotyping platforms for cucurbits.
Genome resequencing of the cucumber (388 accessions) and the squash (207 Cucurbita pepo accessions) core collections has been completed. The average depths of cleaned sequences of cucumber and squash cores are 49.7⋅ and 49.9⋅, respectively. For melon (384 accessions) and watermelon (372 accessions) cores, genome sequencing of 313 and 301 accessions, respectively, has been completed. In addition, we have also completed genome resequencing for 26 C. maxima and seven C. moschata accessions.
The sequence data of cucumber, squash, melon and watermelon cores have been processed for SNP and small indel calling using the Gy14 genome (v2.1), the MU‐CU‐16 genome (v4.1), the 97103 genome (v2.5) and the DHL92 genome (v4) as the references, respectively. Statistics of called variants are summarized in Table 1. Raw sequencing data and called variants have been distributed to our industry partners who have requested access to the data. Biallelic variants with MAF>0.01 of cucumber and squash core collections are available for mining publicly at CuGenDBv2 (cucurbitgenomics.org). The remaining accessions in the melon and watermelon cores are currently under sample collection and DNA preparation and will be sequenced. Currently, of the remaining 71 accessions in the watermelon core, DNA has been prepared for 45 accessions while the other 26 accessions did not germinate. Variants will be updated for the watermelon and melon cores once new sequences are available.
Table 1 Summary of genome sequencing of cucurbit core collections
We recently found that a total of 58 accessions in the cucumber core contain large numbers of missing SNPs (5-35%) due to the poor quality of the sequencing libraries. These libraries were constructed during CucCAP1 using a cheap protocol. Sequencing of these accessions are bein redone. DNA has been prepared for 45 accessions, while the remaining 13 accessions did not germinate. Variants will be updated with new sequences when available.
1.1.2. Development of novel, advanced genome and pan-genome platforms for cucurbit species.
For cucumber, we have selected 25 accessions including five wild Cucumis sativus var. hardwickii, four semi-wild Xishuangbanna and 16 cultivated cucumbers for PacBio HiFi sequencing. Ten of these 25 accessions are from the core collection. HiFi sequences have been generated for all the 25 accessions, with an average depth of 33.4×.
For watermelon, we selected a total of 135 accessions for reference-grade genome development, including one Citrullus naudinianus, one C. rehmii, two C. ecirrhosus, five C. colocynthis, 16 C. amarus, seven C. mucosospermus, five C. lanatus var. cordophanus, seven landraces, and 82 cultivars and nine interspecific hybrids. HiFi sequences have been generated for all 135 accessions, with an average depth of 30.3×.
For melon, a total of 27 representative accessions have been selected for HiFi sequencing, including 14 C. melo ssp. melo and 13 C. melo ssp. agrestis accessions, among which 13 from India/Pakistan, two from Turkey, three from Americas, and two from Africa, four from Central/West Asia, two from East Asia, and one from Europe. HiFi sequences have been generated for 22 of the 27 accessions, with an average depth of 33.7×.
For squash, three accessions, two from Cucurbita pepo ssp. texana (also known as ssp. ovifera) and one from C. pepo ssp. pepo, have been selected for HiFi sequencing. HiFi sequences of these three accessions have been generated. We have also generated HiFi sequences for C. maxima Rimu and C. moschata Rifu.
1.1.3. De novo genome assembly and pan-genome construction
We have finished the assembling of chromosome-scale genomes of the 25 cucumber accessions. The assembled genome sizes of the 25 accessions range from 259.0 Mb to 302.3 Mb (average: 287.43 Mb) and N50 contig sizes from 5.25 Mb to 22.98 Mb (average: 15.46 Mb). BUSCO completeness rate of these genome assemblies ranges from 96.4% to 98.8%, with an average of 98.4%. An average of 95.5% of the contigs (ranging from 90.3% to 97.8%) are assigned to the seven cucumber chromosomes. Protein-coding genes have been predicted in these genomes, as well as an additional of 11 previously published chromosome-scale cucumber genomes (seven cultivated, one Xishuangbanna and three wild hardwickii). The number of predicted genes ranges from 21,347 to 22551, with an average of 21,870. BUSCO completeness rate of genes predicted from each of these 36 cucumber genome assemblies ranges from 93.0% to 97.0%, with an average of 96.0%. Using the newly assembled WI7631 (‘Chinese long’) genome as the reference/backbone, large structural variants (SVs) have been called and for the other 24 assembled genomes and the 11 previously published genomes (Table 2). A graph pan-genome has been constructed using the WI7631 genome and the called SVs and used to Genotype these SVs in the core collection using the resequencing short reads.
For watermelon, we have finished chromosome-scale genome assemblies and gene prediction for all 135 accessions. The assembled genome sizes range from 368.6 Mb to 406.7 Mb (average: 377.5 Mb) and N50 sizes are all greater than 20 Mb (20.37-35.64 Mb; an average of 30.49 Mb). BUSCO completeness rate of these genome assemblies ranges from 93.9% to 99.2%, with an average of 99.0%. An average of 99.2% of the contigs (ranging from 96.2% to 99.9%) are assigned to the 11 watermelon chromosomes. The number of predicted protein-coding genes ranges from 20,834 to 23,330 (average: 21,785). BUSCO completeness rate of genes predicted from each of these 135 watermelon genome assemblies ranges from 91.6% to 96.6%, with an average of 95.9%. Using the newly assembled ‘97103’ genome as the backbone, SVs are being called in the other 134 watermelon accessions, as well as three previously published long read assemblies (Table 2). The final SVs and the ‘97103’ genome have been used to construct a Citrullus graph pan-genome, which has been used to genotype these SVs in the core collection and other accessions using the resequencing short reads (a total of 756 accessions, including 436 cultivars, 114 landraces, 13 cordophanus, 39 mucosospermus, 120 amarus, 33 colocynthis and 1 rehmii).
Table 2 Summary statistics of SVs identified in cucumber and watermelon across 36 and 138 genome assemblies, respectively.
For melon, we have finished the chromosome-level assemblies of 22 accessions. The assembled genome sizes range from 355.7 Mb to 387.0 Mb (average: 371.7 Mb) and N50 contig sizes from 9.41 Mb to 19.60 Mb (average: 13.85 Mb). BUSCO completeness rate of these genome assemblies ranges from 93.7% to 97.9%, with an average of 97.3%. An average of 97.2% of the contigs (ranging from 92.4% to 99.5%) are assigned to the 12 melon chromosomes. Protein-coding genes have been predicted in 21 of the 22 assembled genomes, and the number of genes predicted in each genome ranges from 23,108 to 27,678 (average: 24,570). BUSCO completeness rate of genes predicted from each of these 21 melon genomes ranges from 95.5% to 97.6%, with an average of 96.6%.
For Cucurbita species, we have finished genome assemblies and gene predictions of three squash (C. pepo) accessions, and C. maxima Rimu and C. moschata Rifu (Table 3).
Table 3 Statistics of Cucurbita genome assemblies.
1.1.4. Breeder-friendly web-based database for phenotypic, genotypic and QTL information
We have updated CuGenDB to version 2 (CuGenDBv2) and officially released CuGenDBv2 in April 2022. CuGenDBv2 currently hosts 34 reference genomes from 27 cucurbit species/subspecies belonging to 10 different genera. Protein-coding genes from all these 34 genomes (total: 919,903; average: 27,056) have been comprehensively annotated, and the annotated genes can be queried and extracted in the database. Genomic synteny blocks and syntenic gene pairs have been identified between any two and within each of the 34 cucurbit genome assemblies (595 pairwise genome comparisons). A total of 391,379 synteny blocks and 12,130,719 syntenic gene pairs (average: 31 per synteny block) have been identified between the 34 cucurbit genomes. The ‘Synteny Viewer’ module have been re-implemented in CuGenDBv2 to improve the efficiency in processing and displaying the large-scale synteny data.
A ‘Genotype’ module has been newly developed in CuGenDBv2. The module provides a suite of functions that allow users to mine, analyze, extract, and download variants including SNPs and small indels from large-scale population genome sequencing projects. Currently variants (SNPs and small indels) called for cucumber and squash core collections and watermelon resequencing panel, and SNPs called from the GBS data generated under CucCAP1 for watermelon, melon, cucumber, C. pepo, C. maxima and C. moschata are available in the database for query and mining.
The ‘Expression’ module in CuGenDBv2 has been redesigned to provide a complete cucurbit gene expression atlas, using the publicly available cucurbit RNA-Seq datasets. Currently raw RNA-Seq data of a total of 221 projects, 1,513 distinct samples and 3,560 runs (or libraries) have been downloaded from NCBI and processed to derive expression values, which can be queried in CuGenDBv2 to display expression profiles of specific interesting genes in different tissues, development stages, and under different treatment conditions.
Phenotype data have been generated for melon and cucumber core collections. A total of 33 vegetative, flower and fruit characters and two disease resistance traits have been evaluated for the melon core collection, and for the cucumber core collection a combination of 15 external and internal characteristics have been collected for immature and mature fruit of plants grown in 2019 and 2021. A tool to display the fruit images of cucumber core accessions has been developed (cucurbitgenomics.org). Additional tools to visualize and analyze the phenotypic data will be developed in CuGenDBv2.
1.2 Perform seed multiplication and sequencing analysis of core collections of the four species, provide community resources for genome wide association studies (GWAS).
1.2.1. Seed multiplication of core collections
For cucumber, seed increases of the 388 accessions in the core collection were carried out by five participating seed companies. As of March 2024, seeds for 310 accessions with more than 1000 seeds per accession have been received.
For watermelon, HM.Clause is increasing the seeds for 293 accessions in the core collection given to them by USDA-ARS. HM.Clause have already shipped to the USDA, ARS, U.S. Vegetable Laboratory S3 seeds of 177 accessions (with about 1,000 seed/accession) and will ship during 2024 the S3 seeds of the other 116 accessions they committed to increase. S2 seed of additional 39 accessions will be sent by University of Georgia to HM.CLAUSE for increase. During 2024, S2 seeds of additional 167 PIs (mainly Citrullus amarus) will be increased at the USDA, ARS, U.S. Vegetable Laboratory to reach 500 S3 seeds per accession.
Three companies assisted in advancing the melon core set in 2023: 259 of the 384 melon core lines were sent to three seed company cooperators; seed was obtained from 180 of those lines. United Genetics advanced 13 S0 lines to S1 and three S1 lines to S2. Nunhems advanced 13 S0 lines to S1 (Table 4). Sakata advanced 151 S2 lines to S3, with seed counts per line ranging from 21 to 3,100, based on seed weight; only 57 lines produced 1,000 or more S3 seed (Table 5).
Table 4 Seed multiplication status of melon core
Table 5 Estimated number of seeds per S3 Melon core lines (based on seed weight) by Sakata
For the C. pepo squash core increase, we expect to receive the last of the seed this summer. All of the squash core will be increased by a professional nursery, Villa Plants and have robust phytosanitary documentation. One line may have some IP restrictions and may be dropped from the core.
1.2.2. Population genetics and phenotype-genotype association analysis
Phylogenies of accessions in the cucumber, melon, squash, and watermelon cores have been inferred using the LD-pruned SNPs at four-fold degenerate sites. The phylogenies of cucumber and melon core accessions are largely consistent with their geographic origins and the phylogeny of watermelon accessions is consistent with their species classifications, while no clear separations were observed for squash accessions related to their geographic origins or improvement status.
Phenotype-genotype association analysis has been performed for the cucumber core. The cucumber core accessions were grown in the field at the Michigan State University Horticulture Teaching and Research Center in 2019-2022. Young and mature fruits were harvested at ~5-7 and 30-40 days post pollination, respectively. The following traits were measured for mature fruit: fruit length, diameter, fruit shape index, carpel number, seed cavity, flesh thickness, hollowness, curvature, tapering, skin color, flesh color and netting; and the following for young fruit: fruit shape index, curvature, tapering, skin color, and spine density. Genome-wide association studies
(GWAS) were performed on these fruit traits using different models including FarmCPU, BLINK, MLMM, and MLM (Fig. 1). Chromosomal locations of the detected significantly associated SNPs are illustrated in Fig. 2. QTLs for some of the traits were closely clustered. For example, SNPs for several highly correlated fruit size and shape traits, including mature fruit length, young fruit shape index, carpel number, and seed cavity size, were closely located on chromosome 1 at ~10 Mb. Multiple external fruit traits were also mapped to the same region on chromosome 1, such as netting, spine density, young fruit color R/G values. Several significant SNPs identified by GWAS were also in close vicinity (within 1Mb) to prior identified fruit trait QTL and candidate genes.