U.Taxonstand: An R package for standardizing scientific names of plants and animals
Jian Zhanga,**, Hong Qianb,*     
a. Center for Global Change and Complex Ecosystems, Zhejiang Tiantong Forest Ecosystem National Observation and Research Station, School of Ecological and Environmental Sciences, East China Normal University, 200241, Shanghai, China;
b. Research and Collections Center, Illinois State Museum, 1011 East Ash Street, Springfield, IL, 62703, USA
Abstract: The scientific names of organisms are key identifiers of plants and animals. Correctly treating scientific names is a prerequisite for biodiversity research and documentation. Here, we present an R package, 'U.Taxonstand', which can standardize and harmonize scientific names in plant and animal species lists at a fast speed and at a high rate of matching success. Unlike most of other similar R packages each of which works with only one taxonomic database, U.Taxonstand can work with all taxonomic databases, as long as they are properly formatted. Multiple databases for plants and animals that can be directly used by U.Taxonstand, which include bryophytes, vascular plants, amphibians, birds, fishes, mammals, and reptiles, are available online. U.Taxonstand can be a very useful tool for botanists, zoologists, ecologists and biogeographers to standardize and harmonize scientific names of organisms.
Keywords: Biodiversity informatics    Scientific names    Species name matching    Taxonomic harmonization    Taxonomic tool    U.Taxonstand    
1. Introduction

The scientific names of organisms are key identifiers of plants and animals. Many studies in biology, ecology, and biogeography need to standardize and harmonize scientific names of organisms from different sources, and use standardized scientific names in the studies (Cayuela et al., 2012; Freiberg et al., 2020; Kindt, 2020; Jin and Qian, 2022). Because the same species may be documented using different scientific names in different sources [e.g. Eupatorium adenophorum Spreng. in Li and Xie (2002) versus Ageratina adenophora (Spreng.) R.M. King & H. Rob. in Weber (2017); Eupatorium odoratum L. in Li and Xie (2002) versus Chromolaena odorata (L.) R.M. King & H. Rob. in Weber (2017)], an integrated species list derived from different sources without resolving synonyms would necessarily over-count the number of species and fail in correctly integrating biological and ecological information for the same species. Correctly treating taxon names is a prerequisite for biodiversity research (Grenié et al., 2023).

It is fortunate that botanists and zoologists have compiled global databases for major groups of plants and animals that include the relationships between synonyms and their accepted names (Grenié et al., 2023). For example, global plant databases with such information include The Plant List (TPL; www.theplantlist.org), World Flora Online (WFO; www.worldfloraonline.org), Plants of the World Online (POWO; www.plantsoftheworldonline.org), The Leipzig Catalogue of Vascular Plants (LCVP; Freiberg et al., 2020), and World Plants (WP; www.worldplants.de). The synonyms that can be directly matched with names in a chosen database may be resolved.

However, it is common that a substantial proportion of scientific names in a species list cannot be directly matched with any names in a selected taxonomic database due to spelling errors or variants in either scientific names or their authorities or both. For example, Euonymus alata (Thunb.) Siebold, Euonymus alatus (Thunb.) Siebold, Evonymus alata (Thunb.) Siebold, Evonymus alatus (Thunb.) Siebold appear as accepted names in different literature sources but they represent a single species. When these four names were linked to a particular plant taxonomic database, not all the names may be found in the database. As a result, it is a great challenge to standardize taxonomic nomenclature for scientific names with spelling errors or variants in scientific names or their authorities, particularly for a species list with a large number of misspelled or variant names. There are some computer-based name-matching packages to resolve this issue (e.g. Taxonstand, Cayuela et al., 2012; WorldFlora, Kindt, 2020; lcvplants, Freiberg et al., 2020), but limitations with the packages are substantial due to either a low rate of matching success or a low speed of matching (see below for examples). In most cases, a particular package can work with only one database (e.g. Taxonstand works only with the TPL database; lcvplants works only with the LCVP database).

Here, we present an R package that can standardize and harmonize scientific names of both plants and animals in different taxonomic databases at a high rate of matching success and at a fast speed. The package is called 'U.Taxonstand' ('U' stands for 'universal'; 'Taxonstand' stands for 'taxonomic standardization'). We test U.Taxonstand on three species lists (having 19, 580, 14, 505, and 9, 775 names) against three different global plant databases. In each test, none of the scientific names in a tested species list can be directly matched with any names in the plant database under consideration. The results of our tests show that the rate of matching success is over 98% and the time of matching execution is less than 2 h in each test (see below for details). Our tests show that the rate of matching success of U.Taxonstand is higher and the speed of matching is much faster, compared to other R packages tested in this study (see below for details). Multiple databases for plants and animals that can be directly used by U.Taxonstand are available online (https://github.com/nameMatch/Database). They include global taxonomic databases for bryophytes, vascular plants, amphibians, birds, fishes, mammals, and reptiles. Global databases for other groups of organisms may be placed on the website in future. In addition to global databases, regional databases that can be used by U.Taxonstand may be available at the website (e.g., China's plant database).

2. Description of the U.Taxonstand package and relevant files 2.1. The U.Taxonstand package

This package includes a main function called nameMatch, which applies a fuzzy matching algorithm and many other matching approaches. Details about all matching approaches used by U.Taxonstand were included in the source codes of the package. The U.Taxonstand package was written in RStudio v.2022.02.2 (RStudio Team, 2022) and the R language v.4.2.0 (R Core Team, 2022). It requires a standard installation of R and R packages "plyr" and "magrittr" before running U.Taxonstand. The U.Taxonstand package has been tested by different users on different computers with more than 10 different versions of R, ranging from v.3.5.0 to v.4.2.0, under both Windows and Mac systems, and all the tests were completed with success. If the user of U.Taxonstand encounters a problem when installing the package, the problem may be resolved when the user installs any of the above-mentioned versions or a newer version of R. An R script running the sample species list provided by this article is available in Box 1.

Box 1 R script for using U.Taxonstand to standardize scientific names in a sample plant species list ("Sample_species_list.xlsx"), which is included in Supplementary data with the present article, and the TPL database, which is available at https://github.com/nameMatch/Database. The R script in this box is also available in the file "R_ script_U.Taxonstand_Test.r", which is included in Supplementary data.
# load the database
library(openxlsx)
dat1 < - read.xlsx("Plants_TPL_database_part1.xlsx")
dat2 < - read.xlsx("Plants_TPL_database_part2.xlsx")
dat3 < - read.xlsx("Plants_TPL_database_part3.xlsx")
database < - rbind(dat1, dat2, dat3)
rm(dat1, dat2, dat3)
# load the additional genus pairs for matching (optional)
genusPairs_example < - read.xlsx("Plants_genus_list.xlsx")
# load the species list to match with
splist < - read.xlsx("Sample_species_list.xlsx")
# load the package
library(U.Taxonstand)
# run the main function of name matching
res < - nameMatch(spList=splist, spSource=database, author=TRUE, max.distance=1, genusPairs=genusPairs_example, Append=FALSE)
# save the result in an xlsx file
write.xlsx(res, "Result_from_U.Taxonstand.xlsx", overwrite=TRUE)
Note: (1) The taxonomic database used in this script was included in three files. If a taxonomic database includes only one file, the lines starting with 'dat2' and 'dat3' should be removed and 'dat2' and 'dat3' in the subsequent line should also be removed. (2) This script includes a list of genera with alternate spellings ("Plants_genus_list.xlsx"), which is included in Supplementary data, but using the list of genera is optional. In the case that no genus list is used, the line starting with "genusPairs_example" should be removed, and "genusPairs = genusPairs_example, " should also be removed. (3) In the case that the species list supplied by the user does not have the authorities of scientific names, the user might change "author = TRUE" to "author = FALSE", which may save a lot of time for name matching. (4) When "Append = FALSE" is changed to "Append = TRUE", all the columns of the taxonomic database will be included in the output file.

U.Taxonstand is an open-source package (published under GPL-2). The R package of U.Taxonstand is available from GitHub (https://github.com/ecoinfor/U.Taxonstand). The package can be installed in R using the install_github function in the 'devtools' package (Wickham et al., 2018), as follows: devtools: : install_github("ecoinfor/U.Taxonstand").

2.2. Taxonomic databases and other files

Prior to running U.Taxonstand, the database with taxonomic information for the group of organisms that the user intends to use needs to be generated. Each database is included in one or more Excel files (with the.xlsx file extension) with five columns (ID, NAME, AUTHOR, RANK, ACCEPTED_ID, FAMILY), as shown in Table 1 and the file "Sample for a taxonomic database.xlsx" in Supplementary data. Generating such a database is relatively easy for plants and for many major groups of animals, because many groups of organisms have taxonomic information online (e.g., www.theplantlist.org, www.worldfloraonline.org, www.plantsoftheworldonline.org and www.worldplants.de for plants; www.departments.bucknell.edu/biology/resources/msw3 for mammals; www.fishbase.se for fishes, https://ebird.org/science/use-ebird-data/the-ebird-taxonomy for birds, https://amphibiansoftheworld.amnh.org for amphibians, http://www.reptile-database.org for reptiles, and https://www.checklistbank.org for many groups of organisms). We have generated several global plant and animal databases that can be directly used by U.Taxonstand as taxonomic databases, which can be downloaded from the website: https://github.com/nameMatch/Database. For example, Plants_TPL_databse and Plants_WFO_database included bryophyte and vascular plant names and taxonomic nomenclature from TPL and WFO, respectively. In addition to those taxonomic databases that have been already uploaded to the website, we may generate more global taxonomic databases for plants and animals that can be used by U.Taxonstand, and will upload them to the website in future.

Table 1 Description of a taxonomic database.
Column Description
ID Identifier for name as in the original database or created by the person who generated the database.
NAME Scientific name.
AUTHOR Authority of the scientific name. For an animal scientific name, the year of publication may be specified.
RANK Taxonomic rank in numerical value. 2 = binomial name, e.g. Jurinea karategini; Achillea X claudiopolitana; 3 = trinomial name, e.g. Abalon albiflorum var. obovatum; Ilex X makinoi var. laevis. No rank value will be assigned to a name that is at rank = neither 2 nor 3 (e.g. Ruellia caroliniensis ssp. ciliosa var. cinerascens; Lespedeza eriocarpa var. chinensis subvar. polyantha f. leiocarpa). If this column doesn't exist in the database, U.Taxonstand will generate this column in R based on the information in the column NAME in the database. To determine the rank for a name, U.Taxonstand first remove the hybrid sign before or after the genus name ("X" or "x" or " × "; e.g. Achillea X claudiopolitana; Ilex X makinoi var. laevis; X Elyhordeum X dutillyanum) and then determine the rank of a name based on the number of whitespaces in the name (e.g. one for rank = 2, three for rank = 3).
ACCEPTED_ID A value of 0 indicates that the matched name is an unresolved name without any suggested accepted name in the database. If this column of a given name has a non-zero value or code, the name is a synonym and the ID of its accepted name is shown in this column. This column is empty when the name is an accepted name.
FAMILY Family name.

A typical species list that includes scientific names to be matched with those in a taxonomic database includes four columns (i.e. Sorter, Name, Author, Rank), which were described in Table 2. The column Name, which includes scientific names to be matched, is mandatory, and the other three columns are optional. Providing the authorities of the scientific names of the species list in the column Author could increase the rate of matching success.

Table 2 Description for columns in user's species list used by U.Taxonstand.
Column Description
Sorter Sequential numbers (e.g. 1, 2, 3, …) or identifiers for names in user's species list may be placed in this column. If user's species list does not have this column, U.Taxonstand will add sequential numbers for names in user's species list when data are uploaded into R.
Name Scientific name in user's species list.
Author Authority of the name in user's species list. This column is optional, but we recommend providing authorship information for this column whenever possible, because it decreases errors during name matching. Using taxon authorship information can disambiguates between accepted names and synonyms for those scientific names with the same spelling.
Rank Taxonomic rank in numerical value (see Table 1 for details). If user's species list does not have this column, U.Taxonstand will generate this column based on the information in the Name column of species list supplied by the user.

Some genera have one or more variants in spelling (e.g. Euodia versus Evodia, Eccremis versus Excremis, Ziziphus versus Zizyphus). When a list of such genera is available, U.Taxonstand can use the information in the list as additional data to match names between the user's species list and the taxonomic database. The file with such information should include two columns (Genus01 and Genus02), as shown in the "Plants_genus_list.xlsx" file in Supplementary data and as a data file named "genusPairs_Plants" in U.Taxonstand. The file with the information of genus variants is optional when running U.Taxonstand, but the rate of matching success may increase when such information is provided. In addition, we used the data file "lista" from R package Taxonstand (Cayuela et al., 2012) for fuzzy matching of plant genus names.

The output file includes 22 columns, which were described in Table 3. However, if there is extra information in the taxonomic database in use and the user of U.Taxonstand is interested in including the information in the output file, U.Taxonstand provides an option to include all the columns of the taxonomic database in the output file (see Box 1 for details).

Table 3 Description for columns in an output file from U.Taxonstand.
Column Description
Sorter As in the column Sorter in Table 2.
Submitted_Name Scientific name as in user's species list.
Submitted_Author Authority as in User's species list.
Submitted_Genus Genus as in User's species list.
Submitted_Rank Taxonomic rank in User's species list.
Name_in_database Scientific name as in database.
Author_in_database Authority as in database.
Genus_in_database Genus as in database.
Rank_in_database Taxonomic rank as in database.
ID_in_database Name ID as in database.
Name_set If a name in user's species list matches only one name in database, the match is recorded as 1 in this column. If a name in user's species list matches two or mare names in database, the matches are assigned with 1, 2, 3, … based on values in the column Score (from the largest to smallest value), which are recorded in this column. If a name in user's species list does not match any name in database, 1 is recorded in this column. The number of rows with value 1 is equal to the number of names in user's species list.
Fuzzy 'TRUE' means the matching result is fully or partially based on fuzzy matching algorithm; 'FALSE' means the matching result is not based on fuzzy matching algorithm.
Score Matching score (ranging from 1 to 0, when author of a scientific name is provided in user's species list, or ranging from 0 to 0.8 when author of a scientific name is not provided). A larger value represents a better matching.
name.dist The distance between Submitted_Name and Name_in_database.
author.dist The distance between Submitted_Author and Author_in_database.
New_name In the case that this column is empty and the column Score has a value > 0, the matched name is treated as an accepted name in database. In the case that the matched name is treated as a synonym in database, the name shown in this column is the accepted name of the synonym based on database. For a name treated as an unresolved name or as a synonym whose accepted name cannot be identified on database, the text "[Accepted name needs to be determined]" is shown in this column.
New_author Authority of the name in the column New_name based on database.
New_ID ID of the name in the column New_name based on database.
Family Family as in database.
Name_spLev If the input taxon name cannot be matched at the same rank level, the species-level matching name shows here. For example, for the name Agropyron caesium subsp. caesium var. caesium, U.Taxonstand can only match it at species level. In this case, U.Taxonstand puts Agropyron caesium in this column, and the column "Name_in_database" is empty.
Accepted_SPNAME If the name in the column "Submitted_Name" is matched with a name in the database, either as an accepted name or a synonym, the species name of the accepted name of the matched name will be listed here.
NOTE This column contains some notes, e.g. no matching or multiple matching results for a given name, or only matching at species level for a trinomial name (e.g. Euonymus alatus for Euonymus alata var. sarmentosa).
3. Testing U.Taxonstand in comparison with other similar R packages

When a database is large (e.g. over 1, 000, 000 names in each of the databases that we generated for vascular plants), loading and processing the database by U.Taxonstand may take a few minutes, regardless of whether the list of species supplied by the user is short or long. We ran U.Taxonstand on a list of 72, 688 names using the WP database, and the task took 12.5 h. In order to determine the rate of matching success, we ran U.Taxonstand on additional three large species lists and compared the results from the tests with those from Taxonstand, WorldFlora and lcvplants. The same computer was used for each pair of tests (i.e. U.Taxonstand versus Taxonstand, U.Taxonstand versus WorldFlora, and U.Taxonstand versus lcvplants). After each test was completed, we classified names in the result of a tested species list into three classes: A = correct match, B = incorrect match, C = no match, based on the result of matching. The results of the tests and the matching class of each name in each test were shown in Supplementary data (see below for details).

3.1. U.Taxonstand versus Taxonstand

We extracted a set of those scientific names from the World Plants database (www.worldplants.de) that cannot be directly matched with any names in The Plant List database (version 1.1; www.theplantlist.org). This tested species list included 14, 505 names. We first ran U.Taxonstand on this species list using the TPL database, and then ran Taxonstand (v.2.4) on the same species list. The test on U.Taxonstand took ~1 h, and the proportions of names in classes A, B, and C were 99.14%, 0.51%, and 0.35%, respectively, whereas the test on Taxonstand took 5.2 h, and the proportions of names in classes A, B, and C were 85.16%, 4.25%, and 10.60%, respectively (Table 4). Thus, the speed of U.Taxonstand was over 5 times faster than that of Taxonstand, and the rate of matching success of U.Taxonstand was 14% higher than that of Taxonstand (99.14% versus 85.16%). The original results from the two tests were presented in "Comparison of results from tests (1).xlsx" in Supplementary data.

Table 4 Comparison of the results of U.Taxonstand and Taxonstand for the sample species list containing 14, 505 names. The results were divided into three classes: class A, correct match; class B, incorrect match; class C, no match. The original results from the two tests were presented in "Comparison of results from tests (1).xlsx" in Supplementary data.
Taxonstand (%)
Class A Class B Class C Total
U.Taxonstand (%) Class A 84.85 4.10 10.18 99.14
Class B 0.14 0.07 0.30 0.51
Class C 0.17 0.07 0.12 0.35
Total 85.16 4.24 10.60 100.00
3.2. U.Taxonstand versus WorldFlora

The species list for this pair of tests was the same species list used in the previous pair of tests. We first ran U.Taxonstand on the 14, 508 names of this species list using World Flora Online database (WFO, v.2022.04), and then ran WorldFlora (Kindt, 2020) on the same species list using the same WFO database. The test on U.Taxonstand took 1.3 h, whereas the test on WorldFlora took 44.5 h. Some of the 14, 505 names can be directly matched with names in the WFO database and some names were matched with two or more names by WorldFlora without providing a synthesized matching score, these names were excluded when calculating the rate of matching success. This left 9775 names. The proportions of names in classes A, B, and C were 98.37%, 1.04%, and 0.58%, respectively, for U.Taxonstand, and were 97.73%, 1.88%, and 0.39%, respectively, for WorldFlora (Table 5). Although the rate of matching success was similarly high for both U.Taxonstand and WorldFlora (i.e. 98.37% and 97.73%, respectively) for the tested species list, the speed of name matching process for U.Taxonstand was over 30 times faster than that of WorldFlora. The original results from the two tests were presented in "Comparison of results from tests (2).xlsx" in Supplementary data.

Table 5 Comparison of the results of U.Taxonstand and WorldFlora for the sample species list containing 9775 names. The results were divided into three classes: class A, correct match; class B, incorrect match; class C, no match. The original results from the two tests were presented in "Comparison of results from tests (2).xlsx" in Supplementary data.
WorldFlora (%)
Class A Class B Class C Total
U.Taxonstand (%) Class A 97.52 0.63 0.21 98.37
Class B 0.10 0.82 0.12 1.04
Class C 0.10 0.43 0.05 0.58
Total 97.73 1.88 0.39 100.00
3.3. U.Taxonstand versus lcvplants

We extracted a set of those scientific names from The Plant List database (www.theplantlist.org) that cannot be directly matched with any names in the LCVP database (Freiberg et al., 2020), which included 19, 580 names. We first ran U.Taxonstand on this species list using the LCVP database, and then ran lcvplants (v.2.0.0; https://github.com/idiv-biodiversity/lcvplants) on the same species list. The test on U.taxonstand took 1.3 h, and the proportions of names in classes A, B, and C were 99.75%, 0.08%, and 0.16%, respectively, whereas the test on the function "lcvp_fuzzy_search" of lcvplants took 5.3 h, and the proportions of names in classes A, B, and C were 82.61%, 12.47%, and 4.92%, respectively (Table 6). Thus, the running speed of U.Taxonstand was about 4 times faster than that of lcvplants, and the percentage of matching success of U.Taxonstand was greater than that of lcvplants by 17% (Table 6). The original results from the two tests were presented in "Comparison of results from tests (3).xlsx" in Supplementary data.

Table 6 Comparison of the results of U.Taxonstand and lcvplants for the sample species list containing 19, 580 names. The results were divided into three classes: class A, correct match; class B, incorrect match; class C, no match. The original results from the two tests were presented in "Comparison of results from tests (3).xlsx" in Supplementary data.
lcvplants (%)
Class A Class B Class C Total
U.Taxonstand (%) Class A 82.48 12.45 4.83 99.75
Class B 0.07 0.01 0.01 0.08
Class C 0.06 0.02 0.09 0.16
Total 82.61 12.47 4.92 100.00
4. Summary

We developed an R package, U.Taxonstand, that can standardize and harmonize scientific names in plant and animal species lists at a fast speed and at a high rate of matching success. Unlike most of other similar R packages each of which works with only one taxonomic database [e.g. lcvplants only working with the LCVP database (Freiberg et al., 2020); WorldFlora only working with the World Flora Online database (Kindt, 2020); Taxonstand only working with The Plant List database (Cayuela et al., 2012)], U.Taxonstand works with all taxonomic databases, as long as they are properly formatted (see Table 1 and "Sample for a taxonomic database.xlsx" in Supplementary data with the present article). The results of our tests show that the execution speed of U.Taxonstand is much faster, and the rate of matching success of U.Taxonstand is higher, compared to similar R packages. We believe that U.Taxonstand is a very useful tool for botanists, zoologists, ecologists and biogeographers to standardize and harmonize scientific names of organisms.

Acknowledgement

We thank the anonymous reviewers for their helpful comments. This work was supported by the National Natural Science Foundation of China (32030068) and the Shanghai Municipal Natural Science Foundation (20ZR1418100) to J.Z.

Author contributions

H.Q. developed the idea of the software; J.Z. produced the package and wrote all R codes; H.Q. and J.Z. prepared databases and wrote the paper.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.pld.2022.09.001.

References
Cayuela, L., Cerda, Í.G.-d.l., Albuquerque, F.S., et al., 2012. TAXONSTAND: an R package for species names standardisation in vegetation databases. Methods Ecol. Evol., 3: 1078-1083. DOI:10.1111/j.2041-210X.2012.00232.x
Freiberg, M., Winter, M., Gentile, A., et al., 2020. LCVP, the Leipzig catalogue of vascular plants, a new taxonomic reference list for all known vascular plants. Sci. Data, 7: 416. DOI:10.1038/s41597-020-00702-z
Grenié, M., Berti, E., Carvajal-Quintero, J., et al., 2023. Harmonizing taxon names in biodiversity data: a review of tools, databases and best practices. Methods Ecol. Evol., 14: 12-25. DOI:10.1111/2041-210X.13802
Jin, Y., Qian, H., 2022. V. PhyloMaker2: An updated and enlarged R package that can generate very large phylogenies for vascular plants. Plant Divers., 44: 335-339. DOI:10.1016/j.pld.2022.05.005
Kindt, R., 2020. WorldFlora: an R package for exact and fuzzy matching of plant names against the World Flora Online taxonomic backbone data. Appl. Plant Sci., 8: e11388.
Li, Z.-Y., Xie, Y. , 2002. Invasive Alien Species in China. Beijing: China Forestry Publishing House.
R Core Team, 2022. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL. http://www.Rproject.org.
RStudio Team, 2022. RStudio: integrated development for R. RStudio, inc.. http://www.rstudio.com.
Weber, E., 2017. Invasive Plant Species of the World: A Reference Guide to Environmental Weeds, Second ed. CABI Publishing, Wallingford, UK.
Wickham, H., Hester, J., Chang, W., 2018. devtools: tools to make developing R packages easier. https://CRAN.R-project.org/package=devtools.