The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third...

14
RESOURCE The Chimonanthus salicifolius genome provides insight into magnoliid evolution and flavonoid biosynthesis Qundan Lv 1,† , Jie Qiu 2,† , Jie Liu 2,† , Zheng Li 3 , Wenting Zhang 2 , Qin Wang 2 , Jie Fang 1 , Junjie Pan 1 , Zhengdao Chen 1 , Wenliang Cheng 1 , Michael S. Barker 3 , Xuehui Huang 2 , Xin Wei 2, * and Kejun Cheng 1, * 1 Chemical Biology Center, Lishui Institute of Agriculture and Forestry Sciences, Lishui, China, 2 Shanghai Key Laboratory of Plant Molecular Sciences, College of Life Sciences, Shanghai Normal University, Shanghai, China, and 3 Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, USA Received 4 November 2019; revised 25 May 2020; accepted 2 June 2020. *For correspondence (e-mail [email protected]; [email protected]). These authors contributed equally to this work. SUMMARY Chimonanthus salicifolius, a member of the Calycanthaceae of magnoliids, is one of the most famous medicinal plants in Eastern China. Here, we report a chromosome-level genome assembly of C. salicifolius, comprising 820.1 Mb of genomic sequence with a contig N50 of 2.3 Mb and containing 36 651 annotated protein-coding genes. Phylogenetic analyses revealed that magnoliids were sister to the eudicots. Two rounds of ancient whole-genome duplication were inferred in the C. salicifolious genome. One is shared by Calycanthaceae after its divergence with Lauraceae, and the other is in the ancestry of Magnoliales and Lau- rales. Notably, long genes with > 20 kb in length were much more prevalent in the magnoliid genomes com- pared with other angiosperms, which could be caused by the length expansion of introns inserted by transposon elements. Homologous genes within the flavonoid pathway for C. salicifolius were identified, and correlation of the gene expression and the contents of flavonoid metabolites revealed potential critical genes involved in flavonoids biosynthesis. This study not only provides an additional whole-genome sequence from the magnoliids, but also opens the door to functional genomic research and molecular breeding of C. salicifolius. Keywords: Chimonanthus salicifolius, de novo genome assembly, magnoliids, evolution, long genes, gene expression. INTRODUCTION Magnoliids represent the third largest group of angios- perms, which includes approximately 10 000 species (Pal- mer et al., 2004; Massoni et al., 2015). Numerous useful plants are in the magnoliids, such as avocado, nutmeg, bay laurel, black pepper, star anise, wintersweet and cam- phor tree. They provide fruit, spices, traditional medicine, industrial raw materials and ornamental trees for human use. The availability of genomes for more than 300 mono- cots and eudicots has greatly accelerated phylogenetic reconstruction and genetic research in monocots and eudi- cots. Despite the importance of magnoliids, few magnoliid genomes have been sequenced, and their phylogenetic position remains uncertain (Chaw et al., 2019; Chen et al., 2019; Hu et al., 2019; Rendon-Anaya et al., 2019). The mysterious phylogenetic position of magnoliids has been debated for decades. Based on different genomic components, including plastid genes, mitochondrial genes, nuclear genes and plastomic inverted repeat regions, three main phylogenetic topologies have been proposed, that is: (i) sister to the monocots (Endress and Doyle, 2009); (ii) sis- ter to the clade containing monocots and eudicots (Moore et al., 2007; Qiu et al., 2010); and (iii) sister to the eudicots (Zeng et al., 2014; One Thousand Plant Transcriptomes Ini- tiative, 2019). In the Angiosperm Phylogeny Group (APG) system, the phylogenetic position of magnoliids is not con- sistent among the four versions (The Angiosperm Phy- logeny Group, 1998, 2003, 2009, 2016). Recently, four genomes of magnoliids were released, including Lirioden- dron chinense (Chen et al., 2019), Cinnamomum kanehirae © 2020 Society for Experimental Biology and John Wiley & Sons Ltd This article has been contributed to by US Government employees and their work is in the public domain in the USA 1 The Plant Journal (2020) doi: 10.1111/tpj.14874

Transcript of The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third...

Page 1: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

RESOURCE

The Chimonanthus salicifolius genome provides insight intomagnoliid evolution and flavonoid biosynthesis

Qundan Lv1,†, Jie Qiu2,†, Jie Liu2,†, Zheng Li3, Wenting Zhang2, Qin Wang2, Jie Fang1, Junjie Pan1, Zhengdao Chen1,

Wenliang Cheng1, Michael S. Barker3, Xuehui Huang2, Xin Wei2,* and Kejun Cheng1,*1Chemical Biology Center, Lishui Institute of Agriculture and Forestry Sciences, Lishui, China,2Shanghai Key Laboratory of Plant Molecular Sciences, College of Life Sciences, Shanghai Normal University, Shanghai,

China, and3Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, USA

Received 4 November 2019; revised 25 May 2020; accepted 2 June 2020.

*For correspondence (e-mail [email protected]; [email protected]).†These authors contributed equally to this work.

SUMMARY

Chimonanthus salicifolius, a member of the Calycanthaceae of magnoliids, is one of the most famous

medicinal plants in Eastern China. Here, we report a chromosome-level genome assembly of C. salicifolius,

comprising 820.1 Mb of genomic sequence with a contig N50 of 2.3 Mb and containing 36 651 annotated

protein-coding genes. Phylogenetic analyses revealed that magnoliids were sister to the eudicots. Two

rounds of ancient whole-genome duplication were inferred in the C. salicifolious genome. One is shared by

Calycanthaceae after its divergence with Lauraceae, and the other is in the ancestry of Magnoliales and Lau-

rales. Notably, long genes with > 20 kb in length were much more prevalent in the magnoliid genomes com-

pared with other angiosperms, which could be caused by the length expansion of introns inserted by

transposon elements. Homologous genes within the flavonoid pathway for C. salicifolius were identified,

and correlation of the gene expression and the contents of flavonoid metabolites revealed potential critical

genes involved in flavonoids biosynthesis. This study not only provides an additional whole-genome

sequence from the magnoliids, but also opens the door to functional genomic research and molecular

breeding of C. salicifolius.

Keywords: Chimonanthus salicifolius, de novo genome assembly, magnoliids, evolution, long genes, gene

expression.

INTRODUCTION

Magnoliids represent the third largest group of angios-

perms, which includes approximately 10 000 species (Pal-

mer et al., 2004; Massoni et al., 2015). Numerous useful

plants are in the magnoliids, such as avocado, nutmeg,

bay laurel, black pepper, star anise, wintersweet and cam-

phor tree. They provide fruit, spices, traditional medicine,

industrial raw materials and ornamental trees for human

use. The availability of genomes for more than 300 mono-

cots and eudicots has greatly accelerated phylogenetic

reconstruction and genetic research in monocots and eudi-

cots. Despite the importance of magnoliids, few magnoliid

genomes have been sequenced, and their phylogenetic

position remains uncertain (Chaw et al., 2019; Chen et al.,

2019; Hu et al., 2019; Rendon-Anaya et al., 2019).

The mysterious phylogenetic position of magnoliids has

been debated for decades. Based on different genomic

components, including plastid genes, mitochondrial genes,

nuclear genes and plastomic inverted repeat regions, three

main phylogenetic topologies have been proposed, that is:

(i) sister to the monocots (Endress and Doyle, 2009); (ii) sis-

ter to the clade containing monocots and eudicots (Moore

et al., 2007; Qiu et al., 2010); and (iii) sister to the eudicots

(Zeng et al., 2014; One Thousand Plant Transcriptomes Ini-

tiative, 2019). In the Angiosperm Phylogeny Group (APG)

system, the phylogenetic position of magnoliids is not con-

sistent among the four versions (The Angiosperm Phy-

logeny Group, 1998, 2003, 2009, 2016). Recently, four

genomes of magnoliids were released, including Lirioden-

dron chinense (Chen et al., 2019), Cinnamomum kanehirae

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA

1

The Plant Journal (2020) doi: 10.1111/tpj.14874

Page 2: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

(Chaw et al., 2019), Persea americana (Rend�on-Anaya

et al., 2019) and Piper nigrum (Hu et al., 2019). These gen-

omes greatly facilitated our understanding of the magnoli-

ids evolution (Soltis and Soltis, 2019). However, the

phylogenetic positions of the sequenced species revealed

by these genomes were different, resulting in more confu-

sion regarding the genome evolution of magnoliids. In

addition, the relative timing of the whole-genome duplica-

tion (WGD) events for different species and the divergence

times between different magnoliid plants remain ambigu-

ous (Cui et al., 2006; Chaw et al., 2019; Chen et al., 2019;

Rendon-Anaya et al., 2019).

Chimonanthus salicifolius (Chinese name ‘Liu-Ye-La-

Mei’) is a shrub that belongs to the Calycanthaceae in

the Laurales. The leaves of C. salicifolius have been

used as traditional medicine to relieve diarrhea symp-

toms by people of the She nationality for hundreds of

years in Eastern China, and its definite curative effects

have earned C. salicifolius the title of “the uncrowned

king of traditional She nationality medicine”. A large

number of secondary metabolites, such as flavonoids,

coumarins, alkaloids, and terpenoids, which might be

the active components that play critical roles in the

rehabilitation, have been identified in the leaves of

C. salicifolius (Ma et al., 2015; Li et al., 2016; Wang

et al., 2016, 2019). Ethanolic extracts of C. salicifolius

show significant antimicrobial and antibiotic-mediating

activity (Wang et al., 2018). Moreover, the young leaves

of C. salicifolius are processed into tea, which is com-

monly consumed in Eastern China. Despite the commer-

cial interest and increasing demand for C. salicifolius,

the basic biological research and genetic improvement

of C. salicifolius are quite limited. The lack of genome

information has hindered identification of the flavonoid

biosynthetic genes.

In this study, the genome of C. salicifolius was

sequenced by Illumina and PacBio, and scaffolded using

10 9 Genomics and Hi-C technologies. Approximately 820

megabases (Mb) of genome sequences were assembled

with a contig N50 of 2.2 Mb. This high-quality genome

provides a resource for inferring the phylogeny of mag-

noliids, and identifying the key genes responsible for fla-

vonoid biosynthesis and genes underlying the complex

agronomic traits such as flowering at low temperatures.

Comparative genomic analysis was performed with the

three published magnoliid genomes, L. chinense (Chen

et al., 2019), C. kanehirae (Chaw et al., 2019) and P. amer-

icana (Rendon-Anaya et al., 2019). In addition, the preva-

lence of long genes was discovered in the genomes of

C. salicifolius and other magnoliids. Overall, our results

shed light on the phylogeny of magnoliids, and lay a

foundation for understanding the mechanism of flavo-

noid biosynthesis and molecular breeding of high-flavo-

noid-content varieties.

RESULTS

Genome sequencing, assembly and annotation

The genomic DNA of C. salicifolius was extracted from one

individual plant collected in Eastern China, and sequenced

using both Illumina and PacBio sequencing platforms. For

the initial contig assembly, a total of 82.3 gigabases (Gb)

of PacBio data were generated, representing approximately

101.5-fold coverage of the 810.6-Mb genome, a size pre-

dicted by a 17-mer analysis (Figure S1). Based on the flow

cytometry survey, the genome size of C. salicifolius was

evaluated to be approximately 835.5 Mb (Figure S2), which

is close to the genome size estimated by the k-mer strat-

egy. The contigs were assembled using Falcon (Chin et al.,

2013) with the PacBio data, and were error corrected by

Pilon (Walker et al., 2014) with 98.5 Gb (121.5-fold) of clean

Illumina reads. Afterwards, the contigs were scaffolded by

FragScaff (Adey et al., 2014) using 127.8 Gb (157.7-fold) 10

9 Genomics data, and the final genome size was

851.7 Mb. A total of 1741 contigs and 1531 scaffolds were

assembled.

To assemble the scaffolds into pseudochromosomes, a

high-throughput chromosome conformation capture (Hi-C)

library was constructed and sequenced, resulting in

148.1 Gb (182.7-fold) data. Using LACHESIS (Burton et al.,

2013) to cluster, order and orient them, the assembled

scaffolds were anchored into 11 clusters, with a total size

of 820.1 Mb genome sequences. The assembly genome

size was very close (98.2% coverage) to the estimated gen-

ome size (835.5 Mb) obtained from the flow cytometry

analysis. The number of groups corresponded to the num-

ber of chromosomes of C. salicifolius (2n = 22). The

lengths of the pseudochromosomes ranged from 51.2 to

97.7 Mb (Table 1). The N50 values of pseudochromosome

and contig were 96.3 and 2.3 Mb, respectively. The Hi-C

contact matrix based on the assembled genome is visual-

ized in Figure S3.

The short reads generated from Illumina sequencing

were aligned with our assembled genome. We found that

95.6% of the reads could be mapped back to the genome

and covered 99.9% of the genome. Single nucleotide poly-

morphism (SNP) calling showed that the heterozygosity

rate was 0.46%. The assembled genome was evaluated by

BUSCO (Benchmarking Universal Single Copy Orthologs;

Simao et al., 2015), and we found that the ‘complete’ per-

centage was 95.1% (Table S1). The GC content of the

C. salicifolius genome was 36.9%.

Based on the combined gene prediction strategy, con-

sidering evidence from de novo prediction, protein homol-

ogy and transcriptomic support, we predicted 36 651

protein-coding gene models with an average gene length

of 6593.9 bp and a coding sequence length of 1069 bp. Of

the 36 651 genes, 34 119 (93.1%) were supported by either

the identification of homologs in other species or the

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,

The Plant Journal, (2020), doi: 10.1111/tpj.14874

2 Qundan Lv et al.

Page 3: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

RNA-seq data. The gene density ranged from 0 to 110

genes per megabases across the chromosomes (Figure 1).

Although the density of protein-coding genes seemed

generally complementary to the repetitive elements across

the genome, we found long genes (> 20 kb) tended to be

distributed more in the repetitive regions (Figures 1 and

S4). Meanwhile, heterozygous SNPs were more likely to

appear in regions with low repeat sequence density (Fig-

ures 1 and S5).

The annotation revealed the proportion of repetitive

sequences in the C. salicifolius genome was 57.5%, com-

parable to L. chinense (61.6%) but much higher than

C. kanehirae (48.0%). Interspersed repeat sequences

accounted for 56.6% of the repetitive sequences (Table S2).

Similar to most sequenced plant genomes, long terminal

repeats (LTRs) were the most abundant type of inter-

spersed repeats, occupying the majority (30.0%) of the

repeat sequences, followed by DNA transposons at 9.7%

(Table 1). With regard to non-coding sequences, 174

microRNAs (miRNAs), 695 transfer RNAs (tRNAs), 254

small nuclear RNAs (snRNAs), 1052 small nucleolar RNAs

(snoRNAs) and 283 ribosomal RNAs (rRNAs) were pre-

dicted in the C. salicifolius genome.

Phylogeny and comparative genomic analysis

To infer the phylogenetic position of C. salicifolius, gen-

omes of an outgroup species Selaginella moellendorffii

and 15 other angiosperms, including Amborella tri-

chopoda, three monocots (Musa acuminata, Zea mays and

Oryza sativa), eight eudicots (Daucus carota, Mimulus gut-

tatus, Vitis vinifera, Prunus mume, Arabidopsis thaliana,

Populus trichocarpa, Aquilegia coerulea and Nelumbo

nucifera) and three magnoliids (L. chinense, P. americana

and C. kanehirae), were analyzed. Based on the 103 single-

copy orthologous genes, we constructed a phylogeny of

these 17 plant species (Figure 2a). The phylogenetic tree

shows C. salicifolius clustered with three other magnoliids.

In addition, the magnoliids are sister to the eudicots, rather

than sister to monocots or both monocots and eudicots.

We further used a coalescence-based approach to analyze

1420 gene trees to help reduce the implications of incom-

plete lineage sorting. The coalescence-based phylogenetic

tree shows the same topology for magnoliids as the con-

catenation tree (Figure S6). Therefore, we concluded that

the magnoliids are likely to be sister to eudicots rather

than sister to monocots or sister to the clade of eudicots

and monocots.

We further compared the gene numbers among the four

magnoliid plants. A total of 8896 gene families were

shared by L. chinense, C. salicifolius, P. americana and

C. kanehirae. This suggests that they may be core gene

Table 1 Summary of assembly, annotation and repeat sequences of the Chimonanthus salicifolius genome

Group SubgroupN50 size(Mb) N90 size (Mb) Longest (Mb) Total size (Mb)

Sequencing andassembly

Pseudochromosome 96.3 67.6 97.7 820.1Contig 2.3 0.3 11.9 820.0

Genome annotation Protein-coding gene Genemodels

Gene size(bp)

Supported byexpression

Supported by expression &homolog

36 651 6593.9 90.1% 93.0%Exon No. exons Exons per

gene169 342 4.6

Repetitive elements(%)

LTR LINEs &SINEs

DNA transposons Total

20.8 4.4 5.6 57.7

C. salicifolius genome(a)pseudochromosomes(b)Repeat density(c)Gene density(d)Long gene (>20 kb) density(e)Heterozygous SNP density

Figure 1. Landscape of the Chimonanthus salicifolius genome. Circos plot

of the C. salicifolius genome assembly.

Circles from the outside inwards: (a) pseudochromosomes, (b) repeat den-

sity, (c) gene density, (d) long gene (> 20 kb) density and (e) heterozygous

single nucleotide polymorphism (SNP) density. These density metrics were

calculated with 1-Mb sliding windows. The syntenic genomic blocks

(> 300 kb) are illustrated with gray lines.

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,The Plant Journal, (2020), doi: 10.1111/tpj.14874

The genome of Chimonanthus salicifolius 3

Page 4: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

families for these four magnoliids. We found 490 gene

families were specific to C. salicifolius (Figure 2d), which is

higher compared with the other three magnoliid genomes.

The numbers of Pfam protein families in the genomes of

the four magnoliids and 11 other plants (three monocots

and eight eudicots) were examined and compared (Fig-

ure 2e; Table S3). For magnoliid plants, the genes with

Pfams like PF14432 (DYW family of nucleic acid deami-

nases), PF01397 (terpene synthase, N-terminal domain) are

commonly expanded compared with monocots and eudi-

cots, while the commonly reduced genes are with the

Pfams terms of PF01565 (FAD binding domain), PF00646

(F-box domain) and PF00931 (NB-ARC domain). For C. sali-

cifolius, the specifically expanded genes are with the

Pfams like PF13041 (PPR repeat family), PF00313 (‘Cold-

shock’ DNA-binding domain) and PF03105 (SPX domain),

while specifically reduced domains include PF13410

(glutathione S-transferase, C-terminal domain) and

PF00891 (O-methyltransferase).

Ancient whole-genome duplications in Chimonanthus

Self-alignment of the C. salicifolius genome showed clear

syntenic evidence for ancient WGD events (Figure 1). To

infer ancient WGDs in C. salicifolius, we used age distribu-

tion of duplicate genes followed by a mixture model imple-

mented in the mixtools R package to identify significant

peaks of gene duplication consistent with WGDs. The mix-

ture model identified two peaks at about Ks 0.53 and 0.86

(Figures 3a and S7). The recent peak of duplication in Chi-

monanthus has a median value of Ks at about 0.53,

younger than the ortholog divergences of Chimonanthus

and Cinnamomum (Ks 0.64), and the divergences of Chi-

monanthus and Liriodendron (Ks 0.76; Figure 3b). The

older peak of duplication in Chimonanthus has a median

Ks of about 0.86. This suggests this duplication most likely

occurred before the divergence of Chimonanthus and other

magnoliids.

To confirm C. salicifolius has undergone two rounds of

ancient WGD, we compared the syntenic depth ratio

between C. salicifolius genome with the genomes of A. tri-

chopoda and V. vinifera. We observed an overall four to

one syntenic depth ratio between C. salicifolius and A. tri-

chopoda (Figure 3c), that is, a single A. trichopoda region

could be aligned to four genomic regions in the C. salici-

folius genome (Figure 3d). Given the A. trichopoda gen-

ome had not experienced any ancient WGD after the

ancestral angiosperm WGD (Amborella Genome Project,

2013), the overall four to one syntenic depth ratio suggests

that C. salicifolius experienced two rounds of ancient

WGD. We also compared C. salicifolius with the V. vinifera

genome, which was most recently duplicated by the eudi-

cot hexaploidy event (Jaillon et al., 2007). Consistent with

our hypothesis, we found a four to three syntenic depth

ratio between Chimonanthus and Vitis (Figure S8). For

comparison to Chimonanthus, we also analyzed the syn-

tenic depth ratios between C. kanehirae and L. chinense to

Amborella and Vitis (Figures S7 and S8). In Liriodendron,

we found a two to one syntenic depth ratio to Amborella,

and two to three ratio to Vitis. In Cinnamomum, we recov-

ered a four to one ratio between Cinnamomum and

Amborella, and a four to three ratio to Vitis. Previous syn-

teny analyses inferred one round of ancient WGD in L. chi-

nense, and two rounds of ancient WGD in C. kanehirae

(Chaw et al., 2019; Chen et al., 2019). Our results are con-

sistent with these previous studies, and indicate two

rounds of ancient WGD occurred in Chimonanthus.

To corroborate the inferences and phylogenetic place-

ments of these ancient WGD, we also used the Multi-tAxon

Paleopolyploidy Search (MAPS) tool (Li et al., 2015, 2018;

Li and Barker, 2020). The MAPS algorithm filters collections

of gene trees for subtrees consistent with given

D. carota

M. guttatus

V. vinifera

P. mume

A. thaliana

P. trichocarpa

A. coerulea

N. nucifera

L. chinense

C. salicifolius

P. americana

C. kanehirae

M. acuminata

Z. mays

O. sativa

A. trichopoda

S. moellendorffii

PF14432(DYW family of

nucleic acid deaminases)

PF01535 (PPR repeat)

PF13041 (PPR repeat family)

PF00931(NB-ARC domain)

PF00646 (F-box domain)

Eudicots

Magnoliidae

Monocots

100

100

0 4000 8000 120000% 50% 100%

(a) (b) (c)

11,156C. kanehirae

11,630C. salicifolius

11,153L. chinense

11,315P. americana

490 384

393190

151 14784

214

279457

413525

8,896

618 279

(d) (e)

<5 kb 5-10 kb10-20kb >20kb

Intron

Length (bp)

CDSGene

.

100

100

86

100

100

67

100

100

100

100

100

Figure 2. Phylogenetics and gene families of Chimonanthus salicifolius.

(a) The phylogenetic tree including 17 species was constructed based on

103 single-copy genes, with Selaginella moellendorffii as an outgroup.

(b) The percentage of genes with different length ranges is illustrated for

each species on the tree. The golden bar indicates the percentage of long

genes (> 20 kb).

(c) The average lengths of the genes, introns and coding regions of each

species are shown. The lengths of the long genes (dark blue) are obviously

mainly contributed by the length of introns (light blue) rather than that of

the CDSs (gray).

(d) Gene families in the genomes of Persea americana, Cinnamomum kane-

hirae, C. salicifolius and Liriodendron chinense.

(e) Scatter plot showing the number of genes with specific Pfam domains in

the C. salicifolius genome compared with those of monocot and eudicot

species.

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,

The Plant Journal, (2020), doi: 10.1111/tpj.14874

4 Qundan Lv et al.

Page 5: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

relationships at each node in the species tree. Based on

these filtered subtrees, MAPS reports the number of gene

duplications shared by descendant taxa at each node. It

then compares the observed results with null and positive

simulations of WGDs. For our MAPS analysis, we selected

six species including the four magnoliids, Chimonanthus,

Cinnamomum, Persea, and Liriodendron. Oryza and

Amborella were used as outgroups. We observed a burst

of shared gene duplications at nodes N2, N3 and N4 (Fig-

ure 3e; Table S4). This signal had significantly more

shared gene duplications than expected compared with

the null simulations (P < 0.05). To further characterize

these significant gene bursts, we simulated an additional

set of gene trees with a WGD at the phylogenetic location

of the duplication bursts. At the node representing the

most recent common ancestor of Laurales and Magnoliales

(N3), we found this episodic burst of shared gene duplica-

tion was statistically consistent with our positive simula-

tions of WGD. These MAPS inferences of WGDs

corroborated the results of our Ks plots and ortholog diver-

gence analyses, and provided evidence consistent with an

ancient WGD shared among Laurales and Magnoliales.

Based on the substitution rate of 3.02E-9 for the magnoli-

ids (Cui et al., 2006), we estimated these two ancient poly-

ploidy events in the Chimonanthus genome dating back to

approximately 87 million years ago (Mya) and 142 Mya.

Large number of long genes in magnoliids

An interesting phenomenon was observed by analyzing

the gene length of C. salicifolius: long genes (> 20 kb) in

C. salicifolius (2737) were much greater in number than

those in most monocots and eudicots, such as rice (45),

maize (910), Arabidopsis (4) and poplar (66). We investi-

gated the lengths of all the genes in other plant genomes

and found that long genes were also common in magnoli-

ids and A. trichopoda but not in S. moellendorffii, suggest-

ing that long genes might be a specific genomic

characteristic of magnoliids and some angiosperms (Fig-

ure 2b). Further measuring the lengths of the coding

regions and introns revealed that the average coding

region lengths in all 17 plant genomes are similar (ranging

from 825 to 1396 bp), whereas the lengths of introns vary

greatly (ranging from 556 to 10 202 bp; Figure 2c). The

genes in the four magnoliid genomes have much longer

introns (7144 bp on average) than those in the three mono-

cots (2435 bp on average) and the eight eudicots (2933 bp

on average), suggesting that the long genes are due to the

extension of the intron length rather than coding regions.

According to the gene length, we divided genes in

C. salicifolius into different groups, < 5 kb (short genes), 5–10, 10–20 and > 20 kb (long genes). We further character-

ized LTR content in genes of different length ranges, and

found that a much higher percentage of LTRs, including

Gypsy and Copia, existed in the long genes group (Fig-

ure 4a). When examining paralogous genes in the C. salici-

folius genome, we found that many long genes were

paralogs of short and moderate-length genes. Calculation

of the Ka/Ks values for the paralogous genes with different

lengths revealed a significantly higher (one-sided Wilcoxon

test) Ka/Ks ratio for short�long comparison than that of

short�moderate (P-value: 4.37E-07) and moderate�long

(P-value: 2.98E-10) comparisons. This may indicate that in

addition to dramatic changes in the intron regions, more

non-synonymous variants in the coding regions exist

between long and short genes (Figure 4b).

The expression levels of all the genes were analyzed

using transcriptomic data. The average expression level of

0.640.76

0.05(b)(a)

(e)

(c) (d)C. salicifolius

A. trichopoda

Chr.

Scaf.

C. salicifolius (Csa) vs. A. trichopoda (Atr)4 : 1

106 MYA(0.64)

8 MYA(0.05)

WGD: 142 MYA(0.86)

WGD: 87 MYA (0.53)

WGD: 76 MYA(0.46)

A. trichopoda

L. chinense

C. salicifolius

C. kanehirae

P. americana

125 MYA(0.76)

173 -199MYA#

O. sativa

148 - 166MYA#

Percentage of subtrees from

MAPS analysis

0 10 20 30 (%)

Ks

Den

sity

LchCkaCsa

snoitacilpud en eg fo .oN

Ks

0.46 0.860.53

0%

20%

40%

60%

80%

0 1 2 3 4

No. Csa blocks per Atr geneNo. Atr blocks per Csa gene

**

N1

N2

N3

N4

emoneg fo egatnecreP

Syntenic depth ratio

Figure 3. Ancient whole-genome duplications (WGDs) and ortholog diver-

gences in four magnoliid plants.

(a) The ortholog divergence in four magnoliid plants.

(b) The frequency distributions of synonymous substitutions (Ks) for ortho-

logs between four magnoliid plants [Persea americana (Pam), Cinnamo-

mum kanehirae (Cka), Chimonanthus salicifolius (Csa) and Liriodendron

chinense (Lch)] and Amborella trichopoda (Atr).

(c) The syntenic depth ratio of C. salicifolius compared with A. trichopoda.

(d) Genomic syntenic blocks between C. salicifolius and A. trichopoda are

shown, with blue wedges as a case highlighting a typical ancestral region

of Amborella that can be tracked to four genomic regions of C. salicifolius.

(e) A schematic diagram summarizing WGD events for the four currently

genome-sequenced magnoliid species. The estimated times of polyploidiza-

tion based on Ks are shown with brown ovals. The divergence times

obtained from TimeTree are shown with the symbol ‘#’. The percentage of

subtrees by MAPS analysis, which contained a gene duplication (red line)

shared by descendant species for each node, is shown on the right. The

symbol ‘**’ indicates that the observed value of percentage is significantly

higher than the null simulations.

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,The Plant Journal, (2020), doi: 10.1111/tpj.14874

The genome of Chimonanthus salicifolius 5

Page 6: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

the long genes was similar to that of the other genes (Fig-

ure 4c). In addition, 20 long genes were randomly selected

and polymerase chain reaction (PCR) amplified from cDNA

of several C. salicifolius tissues. Fourteen genes were suc-

cessfully amplified in leaf, seed, stem and flower (Fig-

ure S9), and their gene sequences were validated by

Sanger sequencing (Table S5). These results strongly indi-

cated that most predicted long genes in the C. salicifolius

genome are functional. GO enrichment analysis of the long

genes revealed that long genes played important roles in

cell components, plant growth and development, and

energy metabolism, as well as in nucleotide binding, cat-

alytic activity and hydrolase activity (Figure 4d; Table S6).

The gene encoding the tryptophan synthase alpha chain

in A. thaliana (AT3G54640) is reported to be involved in

multiple biological roles, including auxin and tryptophan

biosynthetic process, and defense response to bacterium

and gravitropism (Zhang et al., 2008). The intron length of

the orthologous gene in C. salicifolius (Cs05g01228) is sig-

nificantly longer than that in O. sativa, A. thaliana and

even the A. trichopoda. LTRs were detected in the introns

of Cs05g01228 but not in any other orthologous gene in

O. sativa, A. thaliana and A. trichopoda (Figure 4e), indi-

cating that LTR insertion might be involved in the intron

expansion of genes in C. salicifolius.

Transcriptomic profiles of Chimonanthus salicifolius

tissues and flowers during development

To obtain a global transcriptomic map of different tissues

and flower development in C. salicifolius, we selected tis-

sues including root, stem, seed, pericarp, leaf and flower.

Three stages of flowers (bud, blooming and withering) as

well as three stages of leaves (bud, young and senescent)

were collected from the wild C. salicifolius individuals

grown in Kaihua County in Zhejiang Province for RNA-seq

analysis (Figure S10). Each tissue was collected from three

individuals as three biological repetitions. RNA-seq was

performed on these 10 tissues, and comprehensive expres-

sion profiles of C. salicifolius genes were obtained

(Table S7). According to a principal component analysis

(PCA), different tissues could be separated quite well. The

individuals from different stages of the same tissues

grouped together (Figure S11). In young tissues, such as

leaf buds, flower buds and developing roots, higher gene

expression was observed (Figure S12).

In Eastern China, C. salicifolius flowers from the end of

autumn to the beginning of winter; few plants can flower

in these low-temperature conditions. To dissect this phe-

nomenon, RNA of flowers at the bud, blooming and with-

ering stages were collected, and RNA-seq was performed.

A MapMan analysis showed that many types of transcrip-

tion factors were significantly up- or downregulated (Fig-

ure 5). When comparing the transcriptomes of flowers

between the bud and blooming stages, most genes in the

blooming stage were downregulated (Figure 5a). The GO

terms of the downregulated genes were enriched in terms

of catalytic activity, response to stress and transcription

regulator activity (Figure S13a). Notably, we found that the

expression of WRKY genes was significantly upregulated

in the blooming stage, which might result from a response

to the quickly decreased temperature in this stage (Fig-

ure S13b). Numerous studies have revealed that WRKY

genes are involved in cold or chilling tolerance in plants

(Lafuente et al., 2017; Li et al., 2017; Luo et al., 2017). In

addition, two gene families containing domain PF00313

(‘cold-shock’ DNA-binding domain) and PF04180 (low-tem-

perature viability protein) were significantly expanded in

the genome of C. salicifolius (Figure S14a), and might con-

tribute to the cold tolerance of C. salicifolius. One gene

containing the PF0313 domain, Cs11g02011, showed high

expression in low-temperature conditions, and might be

related to the cold tolerance of C. salicifolius (Figure S14b).

The expression of genes related to flavonoid metabolism

and phenylpropanoid synthesis observably increased in

>20 kb

<5 kb 5-20 kb

0.32(185)

0.22(865)

0.19(545)

Ka / Ks

Gene length

egatnecreP

Gene length

Log 2

(FPK

M +

1)

(a) (b)

(c) (d)

No. of genes

0

0.05

0.1

0.15

0.2

<5 kb 5-10 kb 10-20 kb >20 kb

Gypsy Copiaunknown All LTRs

0-20 kb >20 kb0

5

10

(e)

Exon

IntronLTR

Cs05g01228

AT3G54640.1

AT4G02610.1

LOC_Os03g58320.1

LOC_Os03g58300.1

LOC_Os03g58290.1

LOC_Os07g08430.1

LOC_Os03g58260.2

AmTr v1.0 scaffold2.531

AmTr v1.0 scaffold2.532 1kb

Figure 4. Characterization of long genes.

(a) Percentage of long terminal repeats (LTRs) within genes of different

lengths.

(b) Values of Ka/Ks between paralogous genes in the Chimonanthus salici-

folius genome. The paralogous genes are categorized into three groups

based on their lengths, and the median values of Ka/Ks between different

groups and within each group are shown. The values in the bracket are the

number of gene pairs.

(c) The relationship between the length of genes and their expression

levels. The expression level for one gene is the average FPKM values from

the 30 RNA-seq samples generated in this study.

(d) Top 10 enriched GO terms for long genes (> 20 kb).

(e) Phylogenetic tree and gene structure of genes with PF00290 (tryptophan

synthase alpha chain) domain in four species, Amborella trichopoda, Ara-

bidopsis thaliana, Oryza sativa and C. salicifolius. The gene of C. salicifolius

is indicated by a red circle. The bootstrap values above 70 are shown on the

nodes.

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,

The Plant Journal, (2020), doi: 10.1111/tpj.14874

6 Qundan Lv et al.

Page 7: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

the withering stage, suggesting that secondary metabolite

accumulation in seeds began at this stage (Figures 5b and

S15).

Investigation of genes encoding enzymes of the flavonoid

biosynthetic pathways

Genes involved in the flavonoid biosynthetic pathways

have been identified and characterized in plants (Saito

et al., 2013), including phenylalanine ammonialyase (PAL),

cinnamate 4-hydroxylase (C4H), 4-coumaroyl:CoA ligase

(4CL), chalcone synthase (CHS), chalcone isomerase (CHI),

flavanone 3-hydroxylase (F3H), flavonoid 30-hydroxylase(F30H), flavonol synthase (FLS), UDP glucose:flavonoid 3-O-

glycosyltranferase (UFGT), flavonol 3‑O‑glucoside rhamno-

syltransferase (GRT), dihydroflavonol 4-reductase (DFR)

and anthocyanidin synthase (ANS; Figure 6a). UFGT and

GRT are the genes that directly affect the content of final

flavonoid products. Homologs of UFGT and GRT, which

were genes with PF00201 (UDP-glucoronosyl and UDP-

glucosyl transferase) domain, were identified from C. sali-

cifolius genome. A phylogenetic tree was constructed by

Arabidopsis UDP-glucosyltransferase multigene family

genes and C. salicifolius genes with PF00201 domain (Fig-

ure S16). The genes clustered with UGT79 and UGT91

were UFGT and GRT genes, respectively. In total, 82 homo-

logs of the genes within the flavonoid biosynthetic path-

way were identified in C. salicifolius genome (Figure S17;

Table S8).

Gene expression analysis showed that the genes in the

flavonoid biosynthetic pathways had the highest expres-

sion level in the leaf buds (Figure 6b), indicating that more

flavonoid might be generated in this stage. It is worth not-

ing that a CHS (Cs04g01186) and two FLS (Cs07g00635 and

Cs07g00727) genes had extremely high expression levels

(FPKM > 3000) in the leaf buds. The products of these

genes might be functional bioactivators related to the ther-

apeutic effects of C. salicifolius.

The leaves of C. salicifolius have been used as tradi-

tional medicine in Eastern China for hundreds of years (Ma

et al., 2017). A previous study showed that six flavonoids,

including kaempferol, kaempferol-3-O-glucoside, kaemp-

ferol-3-O-rutinoside, quercetin, isoquercetin and rutin,

were rich in the leaves of plants in the Calycanthus (Yang

et al., 2018). Flavonoids in different tissues of C. salicifolius

were detected by high-performance liquid chromatography

(HPLC), and contents of these flavonoids were obtained

(Table S9). In general, these flavonoids in leaves were

much more than that in other tissues except flower buds.

Flower buds had the highest content of the upstream flavo-

noids (kaempferol and quercetin), while the leaf buds had

the highest content of downstream flavonoids (kaemp-

ferol-3-O-rutinoside and rutin).

In total, 34 UFGT, GRT and FLS homologs were identi-

fied in the C. salicifolius genome. However, which genes

were involved in the biosynthesis of the six flavonoids in

C. salicifolius was unclear. The correlation of flavonoid

contents and the expression values of homologous genes

were estimated (Table S10). Two FLS homologs

(Cs07g00635 and Cs07g00727), four GRT homologs

(Cs01g03503, Cs01g03506, Cs03g03359 and Cs04g02810)

and two UFGT homologs (Cs03g02624 and Cs03g02680)

showed significantly positive correlation (P < 0.05) of rutin

(Figure 6c), indicating the eight genes might be involved in

the synthesis of rutin in C. salicifolius. In addition, the two

FLS homologs and three GRT homologs showed strong

positive correlation of kaempferol-3-O-rutinoside. This is in

line with the fact that GRT-encoded flavonol 3‑O‑glucoside

rhamnosyltransferase synthetizes both rutin and kaemp-

ferol-3-O-rutinoside.

To investigate the upstream genes that were involved in

biosynthesis of the six flavonoids, correlation of the

expression of the eight rutin-related genes and other

homologs was analyzed (Figure 6d). The CHS gene

(a)

(b)

Figure 5. Transcriptomic profiles of Chimonanthus salicifolius flowers dur-

ing different developmental stages.

An overview of the dynamic expression changes in diverse pathways for

blooming versus bud (a) and withering versus blooming (b). Color intensity

corresponds to the expression fold change at the log2 scale (red: upregu-

lated, blue: downregulated).

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,The Plant Journal, (2020), doi: 10.1111/tpj.14874

The genome of Chimonanthus salicifolius 7

Page 8: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

(Cs04g01186) with extremely high expression value

showed significant correlation (r > 0.8, P < 0.05) to the

homologs genes that were involved in the rutin synthesis,

indicating that the CHS genes might be an important

upstream gene in the rutin biosynthetic pathway. There-

fore, we concluded that those nine genes might be

involved in the rutin synthesis. Furthermore, we found that

genes involved in the flavonoid biosynthesis pathways had

good correlation. For example, the genes that produce fla-

vones upstream of chalcone, including PAL, 4CL and C4H,

showed significantly positive coexpression patterns. Simi-

larly, the genes that produce flavonoids upstream of dihy-

droquercetin, including CHS, CHI and F3H, were also

positively coexpressed. In addition, genes expressed after

the divergence of the two flavonoid biosynthetic pathways,

including FLS, DFR and ANS, revealed negatively corre-

lated expression, suggesting that along with the accumula-

tion of flavonoids, which was affected by the expression of

FLS, the synthesis of anthocyanins (affected by DFR and

ANS) was suppressed.

DISCUSSION

Chimonanthus salicifolius is likely to be sister to eudicots

The phylogenetic position of the magnoliids was different

in the four versions of the APG system (The Angiosperm

Phylogeny Group, 1998, 2003, 2009, 2016). Magnoliids are

sister to both monocots and eudicots in APG I, sister to

monocots in APG II, and sister to the clade containing both

monocots and eudicots in APG III and IV. Although several

genomes of magnoliids had been published and phyloge-

nomic analysis of magnoliids had been carried out with

the L. chinense, C. kanehirae, P. americana and P. nigrum

genomes, respectively, the phylogenetic placement of

magnoliids was still inconclusive. Based on the phylogeny

constructed by 211 strictly single-copy genes in 13 seed

plants, Chaw et al. (2019) found that C. kanehirae (repre-

senting magnoliids) was sister to the eudicots with strong

bootstrap support (bootstrap value was 100). Based on the

phylogenetic tree that was constructed by 502 low-copy

orthogroups in 11 plant species, Chen et al. (2019) found

that L. chinense (representing magnoliids) was sister to

monocots and eudicots with weak bootstrap support (boot-

strap value was 50). Rendon-Anaya et al. (2019) suggested

P. americana as sister to the enormous monocot and eudi-

cot lineages according to the phylogenetic tree that was

constructed by 176 single-copy genes in 19 angiosperms.

In this study, phylogenetic analyses based on both con-

catenated alignments and coalescent-based approaches

revealed that magnoliids had a closer relationship to eudi-

cots than monocots, suggesting that C. salicifolius is likely

to be sister to the eudicots. This result disagrees with the

(b)

(a)

(c)

(d)PAL 4CL C4H CHS CHI F3H F3’H FLS DFR ANS UFT91 (GRT) UFT79 (UFGT)

0

10

1004000

FPKM

PAL

4CL

C4H

CHS

CHIF3H

F3’H

FLS

DFR

ANS

UGT91

UGT79

stem

pericarp

seed

root

Leaf(senescent)

Leaf(bud)flower(withering)flower(blooming)flower(bud)

Leaf(young)

ru�n

Cs04g01186 Cs07g00635 Cs07g00727

phenylalanine

cinnamic acid coumaric acid

PAL C4H 4CL

coumaroyl-CoA

CHS

naringenin chalcone

CHI

naringenin

F3H

dihydrokaempferol

F3’H

dihydroqercetin

FLSUFGT

kaempferol

UFGT

quercetin

GRT

GRT

DFR ANS

isoquercetin

kaempferol 3-O-glucoside kaempferol 3-O-rutinoside

rutin

leucocyanidin cyanidin

anthocyanin proanthocyanidin

Flower

Leaf

FLS

UFGTGRT

F3’H

PAL 4CL C4H CHS CHI F3H F3’H DFR ANS FLS UGT91 (GRT) UGT79 (UFGT)

tneiciffeoc noitalerroc

ru�n

Figure 6. Genes involved in the flavonoid biosyn-

thetic pathway.

(a) The flavonoid biosynthesis pathway in Chimo-

nanthus salicifolius. The flavonoids that were

mainly synthetized in leaf and flower were indicated

by green and red boxes, respectively.

(b) The expression levels of different paralogous

genes encoding different enzymes in the flavonoid

biosynthesis pathway.

(c) The correction of flavonoid biosynthetic genes

and rutin content in different tissues. P < 0.05 is

indicated by red circle.

(d) The coexpression coefficient matrix for genes in

the flavonoid biosynthesis pathway. The yellow

arrow indicates the CHS gene (Cs04g01186)

strongly related to the genes that are involved in

rutin synthesis.

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,

The Plant Journal, (2020), doi: 10.1111/tpj.14874

8 Qundan Lv et al.

Page 9: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

resolution of APG III and IV, which placed magnoliids as

sister to a clade containing both monocots and eudicots.

However, it is in line with a previous analysis of 59 low-

copy nuclear genes in 26 Mesangiospermae (Zeng et al.,

2014), and a phylogeny constructed by orthologous low-

copy nuclear genes in 115 plant species (Zhang et al.,

2020). In addition, it was also supported by the phyloge-

nomic framework constructed by 410 single-copy nuclear

gene families extracted from genome and transcriptome

data from 1153 species (One Thousand Plant Transcrip-

tomes Initiative, 2019).

Ancient whole-genome duplications in the Chimonanthus

genome

In this study, we inferred and placed two rounds of ancient

WGD in the genome of C. salicifolius by incorporating Ks

plots and ortholog divergences, synteny analyses, and the

MAPS phylogenomic approach. We show evidence for an

ancient polyploidy event only found in the Chimonanthus

genome, and not shared with Cinnamomum and Lirioden-

dron. This WGD was not inferred in the 1KP project (One

Thousand Plant Transcriptomes Initiative, 2019). This is

likely due to difficulties in detecting two highly overlapping

WGD peaks with mixture models from duplicate gene age

distributions. Based on the similarity of Ks distribution of

Idiospermum australiense and Calycanthus floridus from

the 1KP study, this Chimonanthus WGD is likely shared by

the Calycanthaceae. We also show evidence consistent

with an ancient WGD shared among Laurales and Magno-

liales. A previous study has shown the ancient WGD

inferred in the Liriodendron genome likely predated the

divergence of Magnoliaceae and Lauraceae (Chen et al.,

2019). Based on the genome of Cinnamomum and tran-

scriptome of 17 Laurales and Magnoliales from the 1KP

project, previous studies inferred an ancient polyploidy

event shared by Laraceae and another round of ancient

WGD at the ancestry of Laurales and Magnoliales (Chaw

et al., 2019; One Thousand Plant Transcriptomes Initiative,

2019). Consistent with these studies, we found further evi-

dence for the placement of this ancient WGD shared by

Laurales and Magnoliales by our MAPS phylogenomic

approach. Overall, our ancient WGD analyses are largely

consistent with previous findings, and provide clear evi-

dence for two rounds of ancient WGDs in Chimonanthus.

The magnoliid genomes contain a large number of long

genes

In total, 2737 long genes were identified from the C. salici-

folius genome, much more than monocot and eudicot gen-

omes. Long genes with long introns (> 10 kb) have also

been detected in animals, such as humans, Rattus norvegi-

cus, Danio rerio and Drosophila. In the human genome,

the number of introns longer than 24 kb was more than

8000, and the super-long-introns (> 100 kb) numbered

more than 1200 (Shepard et al., 2009). Previous research

on the long introns of Drosophila revealed that some of

the long introns underwent recursive splicing (Hatton

et al., 1998; Conklin et al., 2005; Sibley et al., 2015). Muta-

tions that occurred in the recursive splicing sites resulted

in many human diseases (Chabot and Shkreta, 2016). The

recursive splicing is a splicing phenomenon difficult to

capture, and requires nascent RNA sequencing, which can

profile pre-mRNA transcripts shortly after they are tran-

scribed (Pai et al., 2018). With the data for designed tran-

scriptomic experiments, more characteristics for the long

genes (such as recursive splicing and other mechanisms)

in the genomes of magnoliids could be explored in future.

The Chimonanthus salicifolius genome benefits functional

genomics research and molecular breeding of

Chimonanthus salicifolius

The genus Chimonanthus is widely grown in Asia, America

and Europe. Chimonanthus salicifolius is distributed

mainly in central and eastern China. It is collected and

used as a traditional medicine. The plants of this species

show vigorous growth, tolerance to several abiotic and

biotic stresses, and flowering at low temperatures. Despite

its importance, C. salicifolius is still not deeply utilized,

and its basic research is lacking.

Based on the high quality of the reference genome, gen-

ome-wide association studies (GWAS) and genome-wide

linkage mapping could be performed to quickly and com-

prehensively identify quantitative trait loci (QTLs) that are

related to the yield (yield of leaf buds) and quality (content

of flavones and other bioactive secondary metabolites) of

C. salicifolius. Using gene annotation and gene expression

information, candidate genes in the QTL regions could be

identified. Genome-editing and genetic complementation

experiments, which will also benefit from this genome by

using gene sequences, could be carried out to validate the

candidate genes. These genes can be further utilized in the

molecular breeding for high-yield and superior-quality

C. salicifolius cultivars.

Thus, an accurate reference genome of C. salicifolius

will provide a platform for elucidating the genomic evolu-

tion of the Chimonanthus genus and understanding the

genes responsible for biosynthesis of the various flavo-

noids made in C. salicifolius as well as laying a foundation

for the molecular breeding of C. salicifolius.

EXPERIMENTAL PROCEDURES

Plant materials, genomic DNA extraction and sequencing

The individual C. salicifolius that was used for genome sequenc-ing was originally collected from Liandu District (28°27053″N,119°55031″E), Lishui City, Zhejiang Province in Eastern China, andpreserved in the Lishui Institute of Agriculture and ForestrySciences. The RNA samples were collected from a wild population

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,The Plant Journal, (2020), doi: 10.1111/tpj.14874

The genome of Chimonanthus salicifolius 9

Page 10: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

of C. salicifolius in the natural environment at Kaihua County(29°14026″N, 118°27057″E), Zhejiang Province, Eastern China.

Genomic DNA was extracted from young leaves of C. salicifoliusplants using a The DNeasy Plant Mini Kit (Qiagen, Hilden, Ger-many) according to the user manual. The further treatment andpreparation of the genomic DNA of Illumina sequencing followedthe description in Wei et al. (2016). PacBio SMRTbell libraries(20 kb inserts) were prepared with a Template Prep Kit (Pacific Bio-sciences, Menlo Park, CA, USA), and 12 SMRT cells were run on thePacBio Sequel system with P6-C4 chemistry (Chin et al., 2013).

RNA extraction and sequencing

Tissues of roots, stems, leaf buds and seeds, as well as flowersand leaves in three developmental stages were collected fromthree individuals, and total RNA was extracted from each sampleusing RNeasy Plant Mini Kit (Qiagen) according to the user man-ual. The cDNA was synthesized from 20 lg total RNA using ReverTra Ace (TOYOBO, Osaka, Japan) with oligo(dT) primer followingthe manufacturer’s protocol. High-throughput sequencing wasthen performed on the Illumina HiSeq X Ten platform.

Genome size estimation

Flow cytometry was used to determine the nuclear DNA contentof C. salicifolius as described by Dole�zel et al. (2007). Sampleswere prepared by homogenizing young leaves of C. salicifoliusand O. sativa ssp. japonica cv. Nipponbare (as an internal stan-dard, 0.91 pg/2C; Ammiraju et al., 2006) on ice in Galbraith’s buf-fer (5 mM sodium metabi-sulfite and 5 ll b-mercaptoethanolcomplemented) with 50 lg ml�1 propidium iodide, and then ana-lyzed on a MoFlo XDP Cell Sorter (excitation 488 nm, emission620 nm; Beckman Coulter, Hialeah, FL, USA) after filtration. Thedata were further analyzed with FlowJo_V10.4.0 software. Thenuclear DNA content of C. salicifolius was estimated as followswith 1 pg of DNA assumed to be equivalent to 9.78 9 108 bp:Sample 1C value = Reference 1C value 9 sample 2C mean peakposition/reference 2C mean peak position. Genome size estima-tion based on Illumina short reads was conducted via a 17-bpk-mer frequency analysis with ‘kmerfreq’ as implemented inSOAPdenovo2 (Luo et al., 2012).

De novo assembly and genome evaluation

De novo assembly of C. salicifolius was performed using Falconv1.87 (Chin et al., 2016) software. After the process of base error cor-rection, overlap graphs were built, and consensus contigs were con-structed based on raw PacBio long reads. Contig sequences werealigned against each other to remove redundant sequences withmore than 85% similarity and overlap. The Illumina data were alignedwith the assembly contigs by bwa (Li and Durbin, 2009), and SNP andindel errors were corrected using Pilon v1.22 (Walker et al., 2014).

The contigs were scaffolded by FragScaff v140324.1 (Adeyet al., 2014) using 10 9 Genomics data. Based on Hi-C data, scaf-folds were anchored to 11 pseudomolecules using LACHESIS soft-ware (Burton et al., 2013). The completeness of the assembledgenome was evaluated by BUSCO v3 using the ‘embryophyta_od-b9’ database (Simao et al., 2015).

Repeat and gene annotation

We constructed a C. salicifolius genome repeat library usingRepeatModeler v1.0.11 with the default parameters (Chen, 2004).The constructed C. salicifolius repeat library was further used torun RepeatMasker v4.0.7 (Chen, 2004) for whole-genome repeatannotation.

The combination of ab initio gene prediction, protein homologevidence and transcriptomic evidence was used for the predictionof protein-coding genes. AUGUSTUS v3.0.3 (Stanke and Waack,2003), SNAP v5.0 (Leskovec and Sosic, 2016) and GeneMark-ETv4.212 (Lomsadze et al., 2014) were used in ab initio gene predic-tion. The protein sequences of Arabidopsis were aligned to theassembled C. salicifolius genome by Exonerate (Slater and Birney,2005) to achieve evidence for gene structure. The open readingframes (ORFs) in the transcripts from the RNA-seq data were pre-dicted by PASA v2.0.1 (Haas et al., 2003). Finally, all the predic-tions were combined into consensus gene models using EVM(Haas et al., 2008).

The predicted C. salicifolius gene models were aligned againstthe Swiss-Prot and NR protein databases for functional annotation(BLASTP, E-value ≤ 1E-5). InterProScan v5 (Zdobnov and Apwei-ler, 2001) was then applied for the prediction of protein domainsand GO terms for each gene model with the setting ‘-appl PfamA-goterms -pa’. Non-coding RNAs were predicted by the Infernalprogram using the default parameters (Nawrocki and Eddy, 2013).

Phylogenetic analysis and estimation of divergence time

A total of 17 plant species, including four magnoliids (C. salici-folius, P. americana, C. kanehirae, L. chinense), three monocots(O. sativa, Z. mays, M. acuminata), eight eudicots (A. thaliana,P. trichocarpa, P. mume, V. vinifera, D. carota, M. guttatus,A. coerulea, N. nucifera), A. trichopoda and S. moellendorffii wereselected for building the phylogenetic tree. Except for N. nucifera,all the genomes were downloaded from the ftp site of JGI (ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v12.0/). Paralogs andorthologs among the 17 species were identified using theOrthoFinder pipeline with the parameter ‘-M msa -oa’ (Emms andKelly, 2015), and the protein sequences of the identified 103 sin-gle-copy genes were used for phylogenetic tree construction.RAxML v8 (Stamatakis, 2014) was used for the tree construction,with the parameters ‘-m PROTGAMMAAUTO–auto-prot=bic’ toautomatically select the best protein model. A total of 100 boot-strap resampling was performed. The phylogenetic tree was visu-alized using MEGA v5 (Tamura et al., 2011). In addition, ASTRAL-III v5.7.3 (Zhang et al., 2018) was applied to infer the coalescence-based species tree with 1420 gene trees (Figure S6).

Estimation of divergence and ancient whole-genome

duplications

DupPipe analyses of ancient whole-genome duplications. Foreach genome, we used the DupPipe pipeline to construct genefamilies and estimate the age distribution of gene duplications(Barker et al., 2008, 2010). We translated DNA sequences and iden-tified ORFs by comparing the Genewise (Birney et al., 2004) align-ment to the best-hit protein from a collection of proteins from 25plant genomes from Phytozome (Goodstein et al., 2012). For allDupPipe runs, we used protein-guided DNA alignments to alignour nucleic acid sequences while maintaining the ORFs. We esti-mated synonymous divergence (Ks) using PAML with the F3X4model (Yang, 2007) for each node in the gene family phylogenies.We then used mixture modeling to identify significant peaks con-sistent with a potential WGD and to estimate their median paralogKs values. Significant peaks were identified using a likelihood ratiotest in the boot.comp function of the package mixtools in R (Bena-glia et al., 2009).

Estimating orthologous divergence. To place putative WGDs inrelation to lineage divergence, we estimated the synonymous

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,

The Plant Journal, (2020), doi: 10.1111/tpj.14874

10 Qundan Lv et al.

Page 11: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

divergence of orthologs among pairs of species that may share aWGD based on their phylogenetic position and evidence from thewithin-species Ks plots. We used the RBH Orthologue pipeline(Barker et al., 2010) to estimate the mean and median synony-mous divergence of orthologs, and compared those with the syn-onymous divergence of inferred paleopolyploid peaks. Weidentified orthologs as reciprocal best blast hits in pairs of tran-scriptomes. Using protein-guided DNA alignments, we estimatedthe pairwise synonymous divergence for each pair of orthologsusing PAML with the F3X4 model (Yang, 2007).

Synteny analyses and dating of ancient whole-genome

duplications and orthology divergence. The genomiccollinearity blocks for intra- and interspecies comparisons formagnoliids were identified by MCscan program (Tang et al.,2008). We performed all-against-all LAST (Kielbasa et al., 2011)and chained the LAST hits with a distance cutoff of 10 genes,requiring at least 5 gene pairs per synteny block. The syntenic‘depth’ function implemented in MCscan was applied to estimatethe duplication history in respective genomes. The genomic syn-teny was visualized by the python version of MCScan (Tang et al.,2008) and Circos (Krzywinski et al., 2009). The dating of ancientWGDs and orthology divergence were estimated using the for-mula T = Ks/2R, where Ks refers to the synonymous substitutionsper site, and R (3.02 9 10�9) is the synonymous substitution ratefor magnoliids estimated by Cui et al. (2006). Estimation of thedivergence times for A. trichopoda – O. sativa and O. sativa –magnoliids was based on TimeTree (Kumar et al., 2017).

MAPS analyses of whole-genome duplications from gen-

omes of multiple species. To determine the WGD nodeacross the magnoliid phylogeny, the MAPS tool (Li et al., 2015,2018) was applied. Six species, including the four magnoliids(P. americana, C. kanehirae, C. salicifolius, L. chinense), onemonocot species (O. sativa) and A. trichopoda, were selected asoutgroup. Orthologous groups for the six species were obtainedfrom Orthofinder (Emms and Kelly, 2015). We chose gene familieswith a maximum gene family size of 20, and achieved a total num-ber of 8437 gene families. The phylogenetic trees for the 8437gene families constructed by FastTree (Price et al., 2009) were ana-lyzed by the MAPS program. Both null and positive simulations ofthe background gene birth and death rates were performed tocompare with the observed number of duplications at each node.

For null simulations, we estimated the gene birth rate (k) anddeath rate (l) for the selected six species with WGDgc (Rabieret al., 2014). Gene count data of each gene family for the six spe-cies were obtained from Orthofinder (Emms and Kelly, 2015). Theestimated parameters (k = 1.355; l = 0.050) were configured in theMAPS program, and the gene trees were then simulated withinthe species tree using the GuestTreeGen program from GenPhylo-Data (Sjostrand et al., 2013). For each species tree, we simulated3000 gene trees with at least one tip per species: 1000 gene treesat the estimated k and l, 1000 gene trees at half of the estimated kand l, 1000 trees at three times k and l according to the settingsin the 1KP program (One Thousand Plant Transcriptomes Initia-tive, 2019; Li and Barker, 2020). We then randomly resampled1000 trees without replacement from the total pool of gene trees100 times to provide a measure of uncertainty on the percentageof subtrees at each node. A Fisher’s exact test was used to identifylocations with significant increases in gene duplication comparingwith a null simulation.

For positive simulations, we simulated gene trees using thesame methods described above. However, we incorporated WGDs

at the location in the MAPS phylogeny with significantly largernumbers of gene duplications compared with the null simulation.We allowed at least 20% of the genes to be retained following thesimulated WGD to account for biased gene retention and loss.

Identification and validation of long genes

The lengths of all genes were screened, and genes longer than20 kb were selected. Twenty long genes were randomly selected,and their coding sequences were amplified from the cDNA of dif-ferent C. salicifolius tissues using KOD-FX Plus (TOYOBO). Theprimers used for cloning long genes are listed in Table S11. Theamplified fragments were ligated into pMD18-T cloning vector byusing pMDTM 18-T Vector Cloning Kit (TAKARA, Shiga, Japan) afteradding A-tailing through DNA A-Tailing Kit (TAKARA). Positivesingle bacterial colonies were selected for plasmid extraction andfurther sequencing. The sequences were aligned with that of thelong genes.

Gene expression profiling

The raw paired-end RNA-seq reads were filtered into clean databy FASTP (Chen et al., 2018). The clean reads were aligned to ourgenerated C. salicifolius genome reference by Hisat2 (Kim et al.,2015), and StringTie (Pertea et al., 2015) was adopted for quantifi-cation of expression. The differential expression analysis was per-formed with Cuffdiff in the Cufflinks package (Trapnell et al.,2010). The gene coexpression pattern was visualized using the Rpackage ‘corrplot’.

The MapMan software was used to investigate the transcrip-tomic profiles of different developmental stages of flowers andleaves. A functional annotation database was constructed withMercator (Lohse et al., 2014). The list of significantly differentiallyexpressed genes was loaded into MapMan to analyze the signifi-cantly up- and downregulated pathways. GO enrichment analysiswas performed using agriGO (Tian et al., 2017), with the GOterms identified with InterProScan as the species background.The ‘Plant GO slim’ option was selected, and a false discoveryrate (FDR) criterion of 0.05 was used for the considered enrich-ment GO terms.

Identification of genes involved in the flavonoid pathway

To identify the candidate genes involved in the flavonoid pathwayin the C. salicifolius genome, we collected the genes in A. thalianathat were documented in the flavonoid pathway (Saito et al.,2013). The protein sequences of genes for four species (A. tri-chopoda, O. sativa, A. thaliana, C. salicifolius) were combined intoa database. Using each gene of A. thaliana in the flavonoid path-way as a query sequence, BLASTP was applied to scan homolo-gous genes (E-value thresholds: 1E-10). Phylogenetic trees wereconstructed for the homologous genes of the four species byRAxML v8 (Stamatakis, 2014), and further used for identificationof candidate orthologous genes.

Evaluation of flavonoid content in different tissues

The tissues used in the flavonoid evaluation were in accordancewith the samples used in the RNA-seq. These samples were col-lected and dried at 60°C. The dried samples were ground intopowder, and were filtered by passing through 80–100 mesh. HPLCanalysis was carried out on Agilent 1260 instrument following themethod described previously (Yang et al., 2018). Contents of sixflavonoids, including kaempferol, kaempferol-3-O-glucoside,kaempferol-3-O-rutinoside, quercetin, isoquercitrin and rutin, wereanalyzed with commercial reference standards. Pearson

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,The Plant Journal, (2020), doi: 10.1111/tpj.14874

The genome of Chimonanthus salicifolius 11

Page 12: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

correlation coefficient was calculated for expression values ofeach gene in the identified flavonoid pathway with the measuredflavonoid contents.

ACKNOWLEDGEMENTS

This work is financially supported by the Zhejiang Major Science& Technology Project of New Agricultural Varieties (2016C02058),the Zhejiang Province Major Science & Technology Project(2012C12014-1), the National Natural Science Foundation of China(31671282), Shanghai Science and Technology Committee Rising-Star Program (19QA1406500), and Shanghai Engineering ResearchCenter of Plant Germplasm Resources (17DZ2252700). The authorsthank Nextomics Biosciences Co., Ltd (Wuhan) for the help in gen-ome assembly, Dr Qiang Zhao from National Center for GeneResearch of Chinese Academy of Sciences for assistance in gen-ome annotation, Dr Yunpeng Zhao (Zhejiang University), and DrJun Yang (Chinese Academy of Sciences) for the discussion andproviding valuable suggestions to the manuscript. The authorsgratefully acknowledge the support of the IBM high-performancecomputing cluster of Analysis Center of Agrobiology and Environ-mental Sciences, Zhejiang University.

AUTHOR CONTRIBUTIONS

KC, XH and XW conceived and coordinated the project. QL,

JL, QW, JF, JP, ZC and WC prepared the materials and

conducted the experiments. JL and JQ performed the

assembly and annotation of the genome. JQ, XW, ZL, WZ

and JL carried out the phylogenetic, comparative genomics

and transcriptome analysis. XW, JQ, ZL, QL MB and KC

wrote the manuscript.

CONFLICT OF INTEREST

The authors declare no competing financial interests.

DATA AVAILABILITY STATEMENT

The assembled C. salicifolius genome and its related data

have been deposited under NCBI BioProject accession

PRJNA602413. The genome assembly has been assigned

with the accession number JAAGOE000000000. The SRA

accession numbers for the raw sequencing data (Pacbio,

Illumina, 10 9 Genomics, and Hi-C) are SRR11127589-

SRR11127597 and SRR11191851-SRR11191853. The tran-

scriptomic data generated in this study are under acces-

sion numbers SRR11109013-SRR11109042. The

C. salicifolius genome assembly and the annotated genes

are accessible at http://xhhuanglab.cn/data/Chimonanthus_

salicifolius.html.

SUPPORTING INFORMATION

Additional Supporting Information may be found in the online ver-sion of this article.

Figure S1. Genome survey of C. salicifolius based on K-mer analy-sis using Illumina sequencing data.

Figure S2. Genome size estimation based on flow cytometry usingO. sativa as an internal reference.

Figure S3. Hi-C contact map of the 11 constructed pseudochromo-somes.

Figure S4. Percentage of long genes and all genes that present ingenomic regions of different repetitive levels.

Figure S5. Heterozygous SNP distribution in the repeat sequenceregions.

Figure S6. Coalescent-based phylogenetic tree constructed by1420 orthologous genes retrieved from 15 plants.

Figure S7. Distribution of Ks among paralogs in four magnoliidplants.

Figure S8. Genomic syntenic depth ratio between magnoliidsagainst A. trichopoda and V. vinifera.

Figure S9. Long genes validated by PCR amplification.

Figure S10. Tissues used for RNA-seq.

Figure S11. PCA based on the expression profile of all genes fordifferent tissues of C. salicifolius.

Figure S12. Transcriptomic profiles for different tissues of C. sali-cifolius.

Figure S13. GO and MapMan terms for the significantly differen-tially expressed genes between bud and blooming stages.

Figure S14. Expansion of two gene families related to cold toler-ance.

Figure S15. Transcriptomic profile for metabolism-related genesvisualized by MapMan.

Figure S16. Classification of UDP-glucosyltransferase multigenefamily in the C. salicifolius genome.

Figure S17. Distribution of flavonoid pathway genes in the C. sali-cifolius genome.

Table S1. Assessment of the completeness of the genome assem-bly by BUSCO analysis

Table S2. Summary statistics of repeat sequences in the C. salici-folius genome.

Table S3. Comparison of number of genes with specific proteindomains in C. salicifolius and magnoliids against 11 monocot andeudicot plants.

Table S4. MAPS result for placements of WGDS for magnoliidsand their simulated distributions.

Table S5. Long genes that were successfully amplified from cDNAof C. salicifolius tissues.

Table S6. GO enrichment terms of long genes.

Table S7. Summary of RNA-seq data generated in this study.

Table S8. Table S8 Homologous genes involved in flavone biosyn-thetic pathways in C. salicifolius.

Table S9. Flavonoid content of different tissues in C. salicifolius.

Table S10. Correlation of flavonoid content and the expression offlavonoid biosynthetic genes.

Table S11. Primers used in PCR amplification for validation of longgenes.

REFERENCES

Adey, A., Kitzman, J.O., Burton, J.N. et al. (2014) In vitro, long-range

sequence information for de novo genome assembly via transposase

contiguity. Genome Res. 24, 2041–2049.Amborella Genome Project (2013) The Amborella genome and the evolution

of flowering plants. Science, 342, 1241089.

Ammiraju, J.S.S., Luo, M.Z., Goicoechea, J.L. et al. (2006) The Oryza bacte-

rial artificial chromosome library resource: construction and analysis of

12 deep-coverage large-insert BAC libraries that represent the 10 gen-

ome types of the genus Oryza. Genome Res. 16, 140–147.Barker, M.S., Kane, N.C., Matvienko, M., Kozik, A., Michelmore, R.W.,

Knapp, S.J. and Rieseberg, L.H. (2008) Multiple paleopolyploidizations

during the evolution of the Compositae reveal parallel patterns of dupli-

cate gene retention after millions of years. Mol. Biol. Evol. 25, 2445–2455.

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,

The Plant Journal, (2020), doi: 10.1111/tpj.14874

12 Qundan Lv et al.

Page 13: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

Barker, M.S., Dlugosch, K.M., Dinh, L., Challa, R.S., Kane, N.C., King, M.G.

and Rieseberg, L.H. (2010) EvoPipes.net: bioinformatic tools for ecologi-

cal and evolutionary genomics. Evol. Bioinform. Online 6, 143–149.Benaglia, T., Chauveau, D., Hunter, D.R. and Young, D.S. (2009) mixtools: an

R package for analyzing finite mixture models. J. Stat. Softw. 32, 1–29.Birney, E., Clamp, M. and Durbin, R. (2004) GeneWise and genomewise.

Genome Res. 14, 988–995.Burton, J.N., Adey, A., Patwardhan, R.P., Qiu, R., Kitzman, J.O. and Shen-

dure, J. (2013) Chromosome-scale scaffolding of de novo genome

assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125.

Chabot, B. and Shkreta, L. (2016) Defective control of pre-messenger RNA

splicing in human disease. J. Cell Biol. 212, 13–27.Chaw, S.M., Liu, Y.C., Wu, Y.W. et al. (2019) Stout camphor tree genome

fills gaps in understanding of flowering plant genome evolution. Nat.

Plants, 5, 63–73.Chen, J.H., Hao, Z.D., Guang, X.M. et al. (2019) Liriodendron genome sheds

light on angiosperm phylogeny and species-pair differentiation. Nat.

Plants, 5, 18–25.Chen, N. (2004) Using RepeatMasker to identify repetitive elements in geno-

mic sequences. Curr. Protoc. Bioinformatics, Chapter 4, Unit 4.10.

https://doi.org/10.1002/0471250953.bi0410s25

Chen, S., Zhou, Y., Chen, Y. and Gu, J. (2018) fastp: an ultra-fast all-in-one

FASTQ preprocessor. Bioinformatics, 34, i884–i890.Chin, C.S., Alexander, D.H., Marks, P. et al. (2013) Nonhybrid, finished

microbial genome assemblies from long-read SMRT sequencing data.

Nat. Methods. 10, 563–569.Chin, C.S., Peluso, P., Sedlazeck, F.J. et al. (2016) Phased diploid genome

assembly with single-molecule real-time sequencing. Nat. Methods, 13,

1050–1054.Conklin, J.F., Goldman, A. and Lopez, A.J. (2005) Stabilization and analysis

of intron lariats in vivo. Methods, 37, 368–375.Cui, L.Y., Wall, P.K., Leebens-Mack, J.H. et al. (2006) Widespread genome

duplications throughout the history of flowering plants. Genome Res. 16,

738–749.Dole�zel, J., Greilhuber, J. and Suda, J. (2007) Estimation of nuclear DNA

content in plants using flow cytometry. Nat. Protoc. 2, 2233–2244.Emms, D.M. and Kelly, S. (2015) OrthoFinder: solving fundamental biases in

whole genome comparisons dramatically improves orthogroup inference

accuracy. Genome Biol. 16, 157.

Endress, P.K. and Doyle, J.A. (2009) Reconstructing the ancestral angios-

perm flower and its initial specializations. Am. J. Bot. 96, 22–66.Goodstein, D.M., Shu, S., Howson, R. et al. (2012) Phytozome: a compara-

tive platform for green plant genomics. Nucleic Acids Res. 40, D1178–D1186.

Haas, B.J., Delcher, A.L., Mount, S.M. et al. (2003) Improving the Arabidop-

sis genome annotation using maximal transcript alignment assemblies.

Nucleic Acids Res. 31, 5654–5666.Haas, B.J., Salzberg, S.L., Zhu, W., Pertea, M., Allen, J.E., Orvis, J., White,

O., Buell, C.R. and Wortman, J.R. (2008) Automated eukaryotic gene

structure annotation using EVidenceModeler and the program to assem-

ble spliced alignments. Genome Biol. 9, R7.

Hatton, A.R., Subramaniam, V. and Lopez, A.J. (1998) Generation of alterna-

tive ultrabithorax isoforms and stepwise removal of a large intron by res-

plicing at exon-exon junctions. Mol. Cell, 2, 787–796.Hu, L., Xu, Z., Wang, M. et al. (2019) The chromosome-scale reference gen-

ome of black pepper provides insight into piperine biosynthesis. Nat.

Commun. 10, 4702.

Jaillon, O., Aury, J.M., Noel, B. et al. (2007) The grapevine genome

sequence suggests ancestral hexaploidization in major angiosperm

phyla. Nature, 449, 463–467.Kielbasa, S.M., Wan, R., Sato, K., Horton, P. and Frith, M.C. (2011) Adaptive

seeds tame genomic sequence comparison. Genome Res. 21, 487–493.Kim, D., Langmead, B. and Salzberg, S.L. (2015) HISAT: a fast spliced

aligner with low memory requirements. Nat. Methods, 12, 357–360.Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D.,

Jones, S.J. and Marra, M.A. (2009) Circos: an information aesthetic for

comparative genomics. Genome Res. 19, 1639–1645.Kumar, S., Stecher, G., Suleski, M. and Hedges, S.B. (2017) TimeTree: a

resource for timelines, timetrees, and divergence times. Mol. Biol. Evol.

34, 1812–1819.

Lafuente, M.T., Estables-Ortiz, B. and Gonzalez-Candelas, L. (2017) Insights

into the molecular events that regulate heat-induced chilling tolerance in

Citrus Fruits. Front. Plant Sci. 8, 1113.

Leskovec, J. and Sosic, R. (2016) SNAP: a general purpose network analysis

and graph mining library. ACM Trans. Intell. Syst. Technol. 8, 1–20.Li, D., Jiang, Y.Y., Jin, Z.M., Li, H.Y., Xie, H.J., Wu, B. and Wang, K.W. (2016)

Isolation and absolute configurations of diastereomers of 8alpha-hy-

droxy-T-muurolol and (1alpha,6beta,7beta)-cadinane-4-en-8alpha,10al-

pha-diol from Chimonanthus salicifolius. Phytochemistry, 122, 294–300.Li, D., Liu, P., Yu, J., Wang, L., Dossa, K., Zhang, Y., Zhou, R., Wei, X. and

Zhang, X. (2017) Genome-wide analysis of WRKY gene family in the

sesame genome and identification of the WRKY genes involved in

responses to abiotic stresses. BMC Plant Biol. 17, 152.

Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Bur-

rows-Wheeler transform. Bioinformatics, 25, 1754–1760.Li, Z., Baniaga, A.E., Sessa, E.B., Scascitelli, M., Graham, S.W., Rieseberg,

L.H. and Barker, M.S. (2015) Early genome duplications in conifers and

other seed plants. Sci. Adv. 1, e1501084.

Li, Z., Tiley, G.P., Galuska, S.R., Reardon, C.R., Kidder, T.I., Rundell, R.J. and

Barker, M.S. (2018) Multiple large-scale gene and genome duplications

during the evolution of hexapods. Proc. Natl. Acad. Sci. USA, 115, 4713–4718.

Li, Z. and Barker, M.S. (2020) Inferring putative ancient whole-genome

duplications in the 1000 Plants (1KP) initiative: access to gene family

phylogenies and age distributions. GigaScience 9(2). https://doi.org/10.

1093/gigascience/giaa004.

Lohse, M., Nagel, A., Herter, T., May, P., Schroda, M., Zrenner, R., Tohge,

T., Fernie, A.R., Stitt, M. and Usadel, B. (2014) Mercator: a fast and sim-

ple web server for genome scale functional annotation of plant sequence

data. Plant Cell Environ. 37, 1250–1258.Lomsadze, A., Burns, P.D. and Borodovsky, M. (2014) Integration of mapped

RNA-Seq reads into automatic training of eukaryotic gene finding algo-

rithm. Nucleic Acids Res. 42, e119.

Luo, D.L., Ba, L.J., Shan, W., Kuang, J.F., Lu, W.J. and Chen, J.Y. (2017)

Involvement of WRKY transcription factors in abscisic-acid-induced cold

tolerance of banana fruit. J. Agric. Food Chem. 65, 3627–3635.Luo, R., Liu, B., Xie, Y. et al. (2012) SOAPdenovo2: an empirically improved

memory-efficient short-read de novo assembler. GigaScience, 1, 18.

Ma, G.L., Yang, G.X., Xiong, J., Cheng, W.L., Cheng, K.J. and Hu, J.F. (2015)

Salicifoxazines A and B, new cytotoxic tetrahydro-1,2-oxazine-containing

tryptamine-derived alkaloids from the leaves of Chimonanthus salici-

folius. Tetrahedron Lett. 56, 4071–4075.Ma, S.J., Lv, Q.D., Zhou, H., Fang, J., Cheng, W.L., Jiang, C.X., Cheng, K.J.

and Yao, H. (2017) Identification of traditional She medicine Shi-Liang

tea species and closely related species using the ITS2 barcode. Appl. Sci.

7, 195.

Massoni, J., Couvreur, T.L.P. and Sauquet, H. (2015) Five major shifts of

diversification through the long evolutionary history of Magnoliidae (an-

giosperms). BMC Evol. Biol. 15, 49.

Moore, M.J., Bell, C.D., Soltis, P.S. and Soltis, D.E. (2007) Using plastid gen-

ome-scale data to resolve enigmatic relationships among basal angios-

perms. Proc. Natl. Acad. Sci. USA, 104, 19 363–19 368.

Nawrocki, E.P. and Eddy, S.R. (2013) Infernal 1.1: 100-fold faster RNA

homology searches. Bioinformatics, 29, 2933–2935.One Thousand Plant Transcriptomes Initiative (2019) One thousand plant tran-

scriptomes and the phylogenomics of green plants. Nature, 574, 679–685.Pai, A.A., Paggi, J.M., Yan, P., Adelman, K. and Burge, C.B. (2018) Numer-

ous recursive sites contribute to accuracy of splicing in long introns in

flies. Plos Genet. 14, e1007588.

Palmer, J.D., Soltis, D.E. and Chase, M.W. (2004) The plant tree of life: an

overview and some points of view. Am. J. Bot. 91, 1437–1445.Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.C., Mendell, J.T. and

Salzberg, S.L. (2015) StringTie enables improved reconstruction of a

transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295.Price, M.N., Dehal, P.S. and Arkin, A.P. (2009) FastTree: computing large

minimum evolution trees with profiles instead of a distance matrix. Mol.

Biol. Evol. 26, 1641–1650.Qiu, Y.L., Li, L.B., Wang, B., Xue, J.Y., Hendry, T.A., Li, R.Q., Brown, J.W.,

Liu, Y., Hudson, G.T. and Chen, Z.D. (2010) Angiosperm phylogeny

inferred from sequences of four mitochondrial genes. J. Syst. Evol. 48,

391–425.

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,The Plant Journal, (2020), doi: 10.1111/tpj.14874

The genome of Chimonanthus salicifolius 13

Page 14: The Chimonanthus salicifolius genome provides insight into ... · Magnoliids represent the third largest group of angios-perms, which includes approximately 10 000 species (Pal-mer

Rabier, C.E., Ta, T. and Ane, C. (2014) Detecting and locating whole genome

duplications on a phylogeny: a probabilistic approach. Mol. Biol. Evol.

31, 750–762.Rendon-Anaya, M., Ibarra-Laclette, E., Mendez-Bravo, A. et al. (2019) The

avocado genome informs deep angiosperm phylogeny, highlights intro-

gressive hybridization, and reveals pathogen-influenced gene space

adaptation. Proc. Natl. Acad. Sci. USA, 116, 17 081–17 089.

Saito, K., Yonekura-Sakakibara, K., Nakabayashi, R., Higashi, Y., Yamazaki,

M., Tohge, T. and Fernie, A.R. (2013) The flavonoid biosynthetic pathway

in Arabidopsis: structural and genetic diversity. Plant Physiol. Biochem.

72, 21–34.Shepard, S., McCreary, M. and Fedorov, A. (2009) The peculiarities of large

intron splicing in animals. PLoS One, 4, e7853.

Sibley, C.R., Emmett, W., Blazquez, L. et al. (2015) Recursive splicing in long

vertebrate genes. Nature, 521, 371–375.Simao, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V. and Zdob-

nov, E.M. (2015) BUSCO: assessing genome assembly and annotation

completeness with single-copy orthologs. Bioinformatics, 31, 3210–3212.

Sjostrand, J., Arvestad, L., Lagergren, J. and Sennblad, B. (2013) GenPhylo-

Data: realistic simulation of gene family evolution. BMC Bioinformatics,

14, 209.

Slater, G.S. and Birney, E. (2005) Automated generation of heuristics for

biological sequence comparison. BMC Bioinformatics, 6, 31.

Soltis, D.E. and Soltis, P.S. (2019) Nuclear genomes of two magnoliids. Nat.

Plants, 5, 6–7.Stamatakis, A. (2014) RAxML version 8: a tool for phylogenetic analysis and

post-analysis of large phylogenies. Bioinformatics, 30, 1312–1313.Stanke, M. and Waack, S. (2003) Gene prediction with a hidden Markov

model and a new intron submodel. Bioinformatics, 19(Suppl 2), ii215–ii225.

Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M. and Kumar, S.

(2011) MEGA5: molecular evolutionary genetics analysis using maximum

likelihood, evolutionary distance, and maximum parsimony methods.

Mol. Biol. Evol. 28, 2731–2739.Tang, H., Bowers, J.E., Wang, X., Ming, R., Alam, M. and Paterson, A.H.

(2008) Synteny and collinearity in plant genomes. Science, 320, 486–488.

The Angiosperm Phylogeny Group (1998) An ordinal classification for the

families of flowering plants. Ann. Mo. Bot. Gard. 85, 531–553.The Angiosperm Phylogeny Group (2003) An update of the Angiosperm

Phylogeny Group classification for the orders and families of flowering

plants: APG II. Bot. J. Linn. Soc. 141, 399–436.The Angiosperm Phylogeny Group (2009) An update of the Angiosperm

Phylogeny Group classification for the orders and families of flowering

plants: APG III. Bot. J. Linn. Soc. 161, 105–121.

The Angiosperm Phylogeny Group (2016) An update of the Angiosperm

Phylogeny Group classification for the orders and families of flowering

plants: APG IV. Bot. J. Linn. Soc. 181, 1–20.Tian, T., Liu, Y., Yan, H., You, Q., Yi, X., Du, Z., Xu, W. and Su, Z. (2017)

agriGO v2.0: a GO analysis toolkit for the agricultural community, 2017

update. Nucleic Acids Res. 45, W122–W129.

Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren,

M.J., Salzberg, S.L., Wold, B.J. and Pachter, L. (2010) Transcript assembly

and quantification by RNA-Seq reveals unannotated transcripts and iso-

form switching during cell differentiation. Nat. Biotechnol. 28, 511–515.Walker, B.J., Abeel, T., Shea, T. et al. (2014) Pilon: an integrated tool for

comprehensive microbial variant detection and genome assembly

improvement. PLoS One, 9, e112963.

Wang, K.W., Li, D., Wu, B. and Cao, X.J. (2016) New cytotoxic dimeric and

trimeric coumarins from Chimonanthus salicifolius. Phytochem. Lett. 16,

115–120.Wang, N., Chen, H., Xiong, L., Liu, X., Li, X., An, Q., Ye, X.M. and Wang,

W.J. (2018) Phytochemical profile of ethanolic extracts of Chimonanthus

salicifolius S. Y. Hu. leaves and its antimicrobial and antibiotic-mediating

activity. Ind. Crop. Prod. 125, 328–334.Wang, X.X., Zhang, H.J., Li, D. and Wang, K.W. (2019) Coumarin and fla-

vone constituents of Chimonanthus salicifolius with antioxidant activi-

ties. Chem. Nat. Compd. 55, 534–537.Wei, X., Zhu, X., Yu, J., Wang, L., Zhang, Y., Li, D., Zhou, R. and Zhang, X.

(2016) Identification of sesame genomic variations from genome com-

parison of landrace and variety. Front. Plant Sci. 7, 1169.

Yang, N., Zhao, K., Li, X., Zhao, R., Aslam, M.Z., Yu, L. and Chen, L. (2018)

Comprehensive analysis of wintersweet flower reveals key structural

genes involved in flavonoid biosynthetic pathway. Gene, 676, 279–289.Yang, Z. (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol.

Biol. Evol. 24, 1586–1591.Zdobnov, E.M. and Apweiler, R. (2001) InterProScan–an integration platform

for the signature-recognition methods in InterPro. Bioinformatics, 17,

847–848.Zeng, L.P., Zhang, Q., Sun, R.R., Kong, H.Z., Zhang, N. and Ma, H. (2014)

Resolution of deep angiosperm phylogeny using conserved nuclear

genes and estimates of early divergence times. Nat. Commun. 5, 4956.

Zhang, C., Rabiee, M., Sayyari, E. and Mirarab, S. (2018) ASTRAL-III: polyno-

mial time species tree reconstruction from partially resolved gene trees.

BMC Bioinformatics, 19, 153.

Zhang, L., Chen, F., Zhang, X. et al. (2020) The water lily genome and the

early evolution of flowering plants. Nature, 577, 79–84.Zhang, R., Wang, B., Jian, O.Y., Li, J.Y. and Wang, Y.H. (2008) Arabidopsis

indole synthase, a homolog of tryptophan synthase alpha, is an enzyme

involved in the Trp-independent indole-containing metabolite biosynthe-

sis. J. Integr. Plant Biol. 50, 1070–1077.

© 2020 Society for Experimental Biology and John Wiley & Sons LtdThis article has been contributed to by US Government employees and their work is in the public domain in the USA,

The Plant Journal, (2020), doi: 10.1111/tpj.14874

14 Qundan Lv et al.