Molecular taxonomy via DNA barcodes for species identification in selected genera of Fabaceae

. Fabaceae is an invaluable plant family with considerable ecological and economic importance, for example, as a food source, bio-fertilizer, and medicinal properties. However, several members of this family have been overexploited in Indonesia, thereby the existence of several species belonging to this family is critically endangered. Therefore, it is essential to support conservation efforts to ensure the overall survival of this plant family. We provided a molecular survey of Fabaceae in converted landscapes of Indonesia through DNA barcoding and aimed to evaluate the effectiveness of core barcoding chloroplast markers matK, rbcL, and their combination (matK+rbcL), as DNA barcodes for species identification in Fabaceae. We generated DNA barcodes of matK and rbcL regions from 51 species belonging to 28 genera and 47 species belonging to 31 genera, respectively. The results showed that the highest accuracy level for species identification was at 90%

Conservation efforts are urgently needed to prevent a further decrease of the species diversity within Fabaceae Family.This is influenced by the accuracy of species identification, which is carried out using conventional taxonomic methods and molecular techniques.However, many species are similar in morphological appearance, thereby making it difficult to distinguish between species.According to Elansary et al. (2017), morphological identification is not effective, especially for complex taxonomic groups, such as Argyreia (Convolvulaceae) (Traiperm et al., 2017), Cuscuta (Convolvulaceae) (Park et al., 2019), Pulsatilla (Ranunculaceae) (Li et al., 2019), and Vicia (Fabaceae) (Han et al., 2021).Moreover, morphological characters are influenced by the environment, as some reproductive traits are only seasonally available, making morphological species identification less specific in the absence of reproductive structures, affecting the accuracy of species identification (Hikmah et al., 2016).Therefore, the potential of molecular techniques needs to be explored for the proper identification of specimens belonging to Fabaceae.
DNA barcoding is a molecular technique used to identify species using DNA code-based similarity in combination with morphological characters, which minimizes errors from conventional identification (Liu et al., 2017).It has the basic principle of identification using a short DNA sequence "barcode" from a standardized genome part of the specimen being studied (Hebert et al., 2003).The unknown barcode sequence is compared with known barcode reference sequences and identified as a specimen when the query sequence matches with the target sequence with a high percentage of identity and similarity (Lis et al., 2016).Meanwhile, it may reveal morphological misidentification or even allows for the identification of cryptic species (Hajibabaei et al., 2007).
The Consortium for Barcode of Life (CBOL, 2009) stated that plant identification generally uses chloroplast DNA maturase K (matK) and ribulose-1, 5-bisphosphate carboxylase oxygenase (rbcL), as well as a combination of matK+rbcL (Hollingsworth et al., 2011).Amandita et al. (2019) reported that the use of two plastid markers, matK and rbcL, is efficient in identifying flowering plants from the lowland rainforest of Sumatra to the genus level.Meanwhile, a study carried out by Gao et al. (2011) reported that the matK marker correctly identified approximately 80% and 96% of specimens at the species and genus level of Fabaceae.Saadullah et al. (2016) stated that the combination of matK+rbcL markers is the best method for identifying 62 specimens from the Fabaceae Family originating from Pakistan.
In addition to species identification, DNA barcoding is also useful to determine the species genetic relatedness by constructing a phylogenetic tree, a representation of evolutionary relationships in a group of organisms with a common ancestor (Ochieng et al., 2007;Patwardhan et al., 2014).Hartvig et al. (2015) stated that the maximum parsimony and neighbor-joining methods were the best approaches for the genus Dalbergia.According to Saadullah et al. (2016), neighbor-joining is an appropriate approach to identify specimens at the Fabaceae Family level.The use of DNA sequences in this study is aimed to investigate the ability of DNA barcodes matK and rbcL in identifying Fabaceae plant species, as well as to evaluate its accuracy level in reconstructing the phylogenetic relationship between the sampled species.

DNA Barcode Sequences
A total of 43 matK sequences and 106 rbcL sequences were derived from the CRC990-EFForTS project in cooperation with IPB University (Bogor, Indonesia), Jambi University (Jambi, Indonesia), Tadulako University (Palu, Indonesia), and University of Göttingen (Göttingen, Germany) as summarized in Table 1 (Amandita et al., 2019).Furthermore, 156 sequences of matK and 112 sequences of rbcL were obtained from the Barcode of Life System (Ratnasingham and Hebert, 2007) database to increase the sample size and enhance species representation.Sequences of matK and rbcL originating from the same sample, as indicated by the sample ID, were concatenated (Vaidya et al., 2011) to form matK+rbcL, resulting in total of 35 sequences.The overall data consisted of 123 species from 48 different genera of Fabaceae.Two species, namely Ceiba speciosa and Adansonia digitata of Malvaceae were selected and added to each matK, rbcL and matK+rbcL dataset as an outgroup.Meanwhile, two species from the Polygalaceae Family, namely Monnina aestuans and Polygala chamaebuxus were also added as a sister group (Doyle et al., 2000).

Editing and Alignment
Each sample's forward and reverse sequences were aligned using Codon Code Aligner Software (http://www.codoncode.com/) and combined into a consensus sequence (contig).Multiple alignments were performed using MEGA7 Software (Tamura et al., 2016) to determine the similarity level and align the bases among the contigs.Gaps (the sign "-") were added when necessary to align the bases and interpreted as deletions (missing nucleotide bases in DNA sequence) (Christinawati et al., 2010).Changes to certain bases were made when differences between paired sequences from the same specimen were found by checking the chromatogram reading of the respective sequence in Codon Code Aligner and comparing to reference sequences of similar species from BOLD.

Data Analysis
The multiple alignment results were used for further analysis, namely identification suitability analysis, barcoding gap analysis, and phylogenetic analysis.The identification suitability analysis was carried out using the sequences obtained from the CRC990-EFForTS project only to compare the morphological identification by the affiliated taxonomist with the molecular identification using the Basic Local Alignment Search Tools (BLAST) in The National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/;Porter and Hajibabaei, 2018) database.The top BLAST result was taken as the best match for specimen identification when the similarity percentage was at least 80%.The identification suitability percentage was calculated for species, genus, and family level.
A barcoding gap analysis was carried out after obtaining data of intraspecific and interspecific genetic distances using MEGA7 Software with Kimura 3 Parameter (Tamura et al., 2016) and ExcaliBAR (Alibadian et al., 2014).Barcoding gaps for each marker were visualized by generating distribution bar charts of intraspecific and interspecific distances using Microsoft Excel.ANOVA analysis and t-tests were also carried out using SPSS Software (Brady et al., 2015) to determine significant differences between intraspecific/interspecific distances.
The last analysis was conducted to evaluate the resolution of each phylogenetic tree reconstructed with the Maximum Parsimony (MP), Neighbor Joining (NJ), and Maximum Likelihood (ML) algorithms using MEGA7 Software with 1 000 bootstrap replicates.The bootstrap values were categorized as high (85%), moderate (70-85%), weak (50-69%), or very weak (<50%) following Kress et al. (2002).The percentage of monophyletic clade formation of each tree was calculated at species and genus level.

Comparison of Morphological and Molecular Identification
The percentage of corresponding molecular and morphological identifications (identification suitability) of the samples obtained from CRC990-EFForTS project is shown in Table 2 for individual and combined markers.These samples were morphologically identified by comparing their herbarium with the LIPI herbarium collection.The highest percentage was obtained at the species level for all the markers, in contrary to Gao et al. (2011), which reported higher identification suitability at the genus level for Fabaceae.Molecular identification of Fabaceae species in this study performed better using matK compared to rbcL, and the use of multilocus matK+rbcL improved the identification performance.Other similar studies (Kolondam et al., 2012;Amandita et al., 2019;Alasmari, 2020) reported the superiority of matK compared to rbcL in terms of plant identification.Meanwhile, 3.96% of molecular identification did not match the morphological identification at all, and was thus determined as mislabeling, meaning that the sample was probably mislabeled during the field collection or laboratory analysis.

Barcoding Gap Analysis
A barcoding gap analysis was performed to evaluate if the investigated markers were sufficiently diverse in order to discriminate between two different species.Table 3 shows that the average interspecific genetic distance of matK and rbcL is 0.134 and 0.047, respectively, which is significantly higher than the intraspecific genetic distance (0.003 and 0.001).These figures are in accordance with Saadullah et al. (2016), who reported the discriminatory power of matK and rbcL on 22 species of Fabaceae, as well as for other families, such as Myristicaceae (Newmaster et al., 2008) and Rosaceae (Pang et al., 2010).Moreover, the low resolution of rbcL compared to matK might be due to the low mutation rate of this gene, as reported by Frascaria-Lacoste et al. (1993) and Stenøien (2008).The one-way ANOVA shown in Table 4 indicates that the interspecific genetic distances were significantly different for the three markers tested, but this was not the case for intraspecific genetic distances, except for the matK and rbcL comparison.Furthermore, the intra-and interspecific genetic distances of matK+rbcL were intermediate, as the properties of intra-and interspecific genetic distances acquired from matK and rbcL were compromising each other.Despite the significant differences between the intra-and interspecific genetic distances of the investigated markers, Figure 1 shows that none of the markers used in this study revealed a clear barcoding gap.The absence of a barcoding gap due to the overlap of intra-and interspecific genetic distances might indicate that the marker is not a suitable DNA barcode for the taxa in question.However, other factors such as sample size and taxonomical representation also influence the distribution of the intra-and interspecific variation within the dataset (Meyer and Paulay, 2005).

Phylogenetic Tree Reconstruction
Phylogenetic trees are important tools to acquire information on biodiversity, genetic classification, and to study evolutionary relationships.In this study, nine phylogenetic trees were reconstructed based on the aligned sequences of matK, rbcL, and matK+rbcL using Neighbor Joining, Maximum Parsimony, and Maximum Likelihood algorithms.Figures 2-4 show phylogenetic trees constructed using the Neighbor Joining approach as the best algorithm to provide highly resolved phylogenetic relationships in the Fabaceae Family, meanwhile the phylogenetic trees reconstructed using Maximum Parsimony and Maximum Likelihood are presented in Supplementary Material (Figures S1-S6).A "good" phylogenetic tree in biosystematics needs to be monophyletic, dichotomous, consistent, with high bootstrap value, shows no polytomies, and forms well-resolved clades.A monophyletic group originates from a single ancestor therefore, their members have similar traits, genetic patterns, and biochemistry (Rahayu and Jannah, 2019).The topologies of the phylogenetic trees reconstructed based on matK and rbcL in this study were generally congruent, but there were some differences in the clade positions and bootstrap values.The resolution of the trees was evaluated based on the percentage of the monophyletic clades at species and genus level, as shown in Table 5. Monophyletic clades with bootstrap values less than 0.7 were excluded from the estimation as considered unreliable (Hillis and Bull, 1993).Both matK and rbcL show high species-level resolution (92-95%), meaning most of the species included in the dataset were resolved to be monophyletic clades with bootstrap values higher than 0.7.The percentage of monophyletic clades in the matK+rbcL phylogenetic trees was not calculated as the data set is relatively limited compared to matK and rbcL.However, the phylogenetic visualization of this combined marker confirmed the results based on the single markers.As an overview of the effectiveness of matK and rbcL as plant barcodes, this study showcased that these two plastid markers worked well in identifying plant species of Fabaceae, at least for the selected genera included, which are particularly important to expand the knowledge of Indonesian floral composition.

CONCLUSION
Molecular identification with DNA barcodes is effectively applied to the Fabaceae species with high accuracy by matK and matK+rbcL compared to rbcL.Recommendations for the phylogenetic approach of Fabaceae Family are Neighbor Joining which is more informative in phylogenetic tree reconstruction.Future studies should include supplement markers, such as psbA-trnH or ITS/ITS2 in combination with matK and rbcL.

Table 1
DNA sequences of matK, rbcL, and matK+rbcL used in the study

Table 2
Identification suitability percentage of each marker used for CRC990-EFForTS project samples

Table 3
Average values of intraspecific and interspecific distances of each marker ***: significant

Table 4
One-way ANOVA results for each marker

Table 5
Percentage of monophyletic clades in the phylogenetic trees