How to fill the biodiversity data gap: Is it better to invest in fieldwork or curation?
Carlos A. Vargasa,b,*, Marius Bottinc, Tiina Sarkinend, James E. Richardsona,d,e, Marcela Celisf, Boris Villanuevab, Adriana Sancheza     
a. Programa de Biología, Facultad de Ciencias Naturales, Universidad del Rosario, Bogotá, D.C., Colombia;
b. Subdirección Científica, Jardín Botánico de Bogotá"José Celestino Mutis", Bogotá, D.C., Colombia;
c. Independent Researcher, Bogotá, D.C., Colombia;
d. Tropical Diversity Section, Royal Botanic Garden Edinburgh, UK;
e. School of Biological, Earth and Environmental Sciences, University College Cork, Cork, Ireland;
f. Departamento de Química y Biología, Universidad del Norte, Km. 5 Vía Puerto Colombia, Área Metropolitana de Barranquilla 081007, Colombia
Abstract: Data gaps and biases are two important issues that affect the quality of biodiversity information and downstream results. Understanding how best to fill existing gaps and account for biases is necessary to improve our current information most effectively. Two current main approaches for obtaining and improving data include (1) curation of biological collections, and (2) fieldwork. However, the comparative effectiveness of these approaches in improving biodiversity data remains little explored. We used the Flora de Bogotá project to study the magnitude of change in species richness, spatial coverage, and sample coverage of plant records based on curation versus fieldwork. The process of curation resulted in a decrease in species richness (synonym and error removal), but it significantly increased the number of records per species. Fieldwork contributed to a slight increase in species richness, via accumulation of new records. Additionally, curation led to increases in spatial coverage, species observed by locality, the number of plant records by species, and localities by species compared to fieldwork. Overall, curation was more efficient in producing new information compared to fieldwork, mainly because of the large number of records available in herbaria. We recommend intensive curatorial work as the first step in increasing biodiversity data quality and quantity, to identify bias and gaps at the regional scale that can then be targeted with fieldwork. The stepwise strategy would enable fieldwork to be planned more cost-effectively given the limited resources for biodiversity exploration and characterization.
Keywords: Colombia    Flora de Bogotá    Sample coverage    Species richness    Tropical Andes    
1. Introduction

Identifying spatial patterns of biodiversity distribution is fundamental to understanding how they were established. They are also essential for designing effective conservation and management strategies. However, it is well known that biodiversity information is incomplete or biased, limiting the generalization of results and the predictive power of models (Feeley and Silman, 2011a; García Márquez et al., 2012; Sousa-Baena et al., 2013; Vargas et al., 2022).

Although researchers are generally aware of the deficiencies of biological data, and different strategies have been developed to reduce their impact (Elith et al., 2006; Syfert et al., 2013; Engemann et al., 2015), many agree on the need to improve data availability (Graham et al., 2004; Feeley and Silman, 2010; Feeley, 2015; Ball-Damerow et al., 2019) and to generate more data through field explorations (Hopkins, 2007; Feeley and Silman, 2011a, 2011b). Museums and biological collections centralise a significant amount of biodiversity data that is currently available through global public online repositories (e.g., GBIF (, BIEN ( The online information is used for different purposes that include the study of evolutionary process, causes and limitations of species distributions, the response of species to climate change, and designing protected areas (Soberón and Peterson, 2004; Bebber et al., 2010; Feeley and Silman, 2010; Gaira et al., 2011). This availability is possible due to the mass digitalisation of biological collections through standard formats, which can be handled with different information analysis programs. Digitalisation opens up centuries of data and the possibility of studying biodiversity dynamics, evaluating changes through time, and modelling of future scenarios for management, use and conservation purposes (Feeley, 2012; Morueta-Holme et al., 2015; Nualart et al., 2017). However, around 40%–80% of biological records available online or databased are discarded because of taxonomic (e.g., poor determination, undetermined material and nomenclatural issues), geographical (e.g., poor georeference precision or samples assigned to a centroid), or temporal (i.e., without date) data deficiencies (Gueta and Carmel, 2016; Meyer et al., 2016; Daru et al., 2018). This data can be rescued by curatorial work (Feeley and Silman, 2011b), although it is a time-consuming effort. Additionally, not all museum collections are digitised, particularly small and local collections, and much work remains to be done in curating these collections to a high taxonomic standard.

Fieldwork is an alternative way to increase the amount and the accuracy of data. However, this requires investment of financial resources by institutions whose role is to describe biodiversity (e.g., governmental agencies, botanic gardens). Given the limited financial resources for research, it is therefore necessary to analyze where funds could best be invested (O'Connell et al., 2004; Franco et al., 2007; Gardner et al., 2008; Targetti et al., 2014), in order to increase resources with the best return.

Biological collections contain large amounts of data that could be retrieved through curatorial work. Curatorial work has the potential to increase the richness by the discovery of new species (Goodwin et al., 2015), expanding species distributions, increasing the environmental envelop of species (Feeley and Silman, 2011b), categorizing threatened species (Nualart et al., 2017), and/or filling geographical gaps (Daru et al., 2018). At the regional scale, curatorial work has the potential to increase the geographical coverage of collections by filling gaps on the collection information. However, it is not well known how curatorial work impacts biodiversity knowledge compared with fieldwork.

In this paper, we explore how curatorial work of herbarium collections increases biodiversity data quality and quantity in contrast to fieldwork. Using the Flora de Bogotá project as a model, we analyze the change in species richness, spatial coverage, and sample coverage of plant records in Bogotá (capital of Colombia). We evaluate the impact of both curation and fieldwork on increasing the taxonomic and geographical robustness of biodiversity information and highlight their unique contributions.

2. Materials and methods 2.1. Study area

Bogotá, the capital of Colombia and the most populated city in the country (ca. 7.5 million people over ca. 1630 km2;, accessed on 9th Sep 2022), is located in the Colombian Cordillera Oriental between 2510–3780 m elevation (Fig. 1). The climate is typical of tropical mountains with daily temperatures varying between 6 and 22 ℃ and low annual seasonality (Secretaría Distrital de Ambiente, 2007). The rainfall regime is bimodal, with two peaks occurring in April–May and October–November. Bogotá is situated on two physiographic units: a flat area north of the city's urban area, where an enormous lake disappeared 30, 000 years ago (Van der Hammen, 1986), and a mountainous area surrounding the metropolitan area to the east and south which includes the most extensive area of páramo vegetation on Earth (Sumapaz Páramo). Seventy-five per cent of Bogotá's territory is rural, while the remaining 25% is occupied by the urban area, where 80% of the population lives. Urban ecosystems are represented mainly through metropolitan parks and the wetland system associated with the Bogotá River. Meanwhile, natural ecosystems are concentrated in the city's rural areas where the páramo ecosystem predominates, and relicts of Andean and high Andean forests are also found.

Fig. 1 Map of Bogotá (satellite image, red line left panel) in the context of South America (top left) and Colombia (left). Grey area corresponds to the Colombian Andes.
2.2. Data

The Flora de Bogotá database is an initiative of the Jardín Botánico de Bogotá to study the plant diversity of the city. The database was established in 2013 to gather information of Bogotá's plant records deposited in herbaria and contains 37, 468 plant records obtained from local and worldwide herbaria. Additionally, 5401 new plant records from fieldwork were made by Jardín Botánico de Bogotá between 2011 and 2016 and are included in the database. The total database consists of 42, 869 plant records (Table 1).

Table 1 Data sources of vascular plant records gathered in the Flora de Bogotá database.
Source Number of records
Herbario Nacional Colombiano (COL) 12, 869
Bibliographic review 10, 966
Herbario Jardín Botánico de Bogotá (JBB) 8671
Herbario Forestal Gilberto Emilio Mahecha (UDBC) 1533
Missouri Botanical Garden (MO) 1186
Pontificia Universidad Javeriana (HPUJ) 1077
Instituto de investigación Alexander von Humboldt (FMB) 544
Smithsonian Institution (US) 452
New York Botanical Garden (NY) 170
Jardín Botánico de Bogotá Fieldwork 5401
Total 42, 869

For this study, records with identical collection numbers were screened, leaving only one of each in the database. Plant records with coordinates outside of Bogotá were excluded. This reduced the dataset to 21, 926 plant records that, after curatorial and fieldwork represent 2384 species, 903 genera and 187 families. All specimens of this research are available in open databases (see: Vargas et al., 2022; The specimens at the Herbario Nacional Colombiano (COL, and Jardín Botanico de Bogotá (JBB, have images in high resolution for many of the specimens that support the current manuscript. The list of flora de Bogotá is online ( and the names are supported by specimens that can be checked by users. The flora of Bogotá is also continuously revised and improved. Therefore, the two main datasets used in this manuscript are digitized and available for checking by specialists.

2.3. Data treatment

To study the effect of curatorial vs. fieldwork in the biodiversity patterns of Bogotá, we analyzed the change in species richness, spatial coverage of plant records and completeness at four different stages of data collection, during the first stage of the Flora de Bogotá project (2012–2016).

(1). Raw dataset: Plant records of the Flora of Bogotá database obtained from herbaria without nomenclatural and coordinate corrections.

(2). Curated dataset: Plant records on the Flora database obtained from herbaria that were corrected for nomenclatural and coordinate metadata. The taxonomic work consisted of revising herbarium specimens and correcting obvious orthographic and spelling errors on the names assigned to plant records, as well as screening for synonymy. The taxon names were standardized using the Catálogo de plantas y líquenes de Colombia (Bernal et al., 2016). Geographical work was advanced on every plant record by correcting and standardizing coordinates. Plant records without coordinates were georeferenced using the locality information from the specimen label, following the point-radius method (Wieczorek et al., 2004). The specimens with insufficient locality data were excluded from the analysis, such as the specimens older than the 1900's, mainly collected by Jose Jeronimo Triana, who reports specimens from "Bogotá province", which was a wider region than the current area of Bogotá city.

(3). Fieldwork dataset: The fieldwork conducted between 2012 and 2016 by Jardín Botánico in different parts of Bogotá to characterize areas without data. The characterization of those areas was conducted by the plant collection in reproductive condition. The plant records were deposited in the Jardín Botánico de Bogotá herbarium and identified by the botanical team to species level. The geographical and taxonomic data was checked to correct for any mistakes.

(4). Total (Curated–fieldwork): Curated dataset in addition to plant records obtained from fieldwork done between 2012 and 2016 by Jardín Botánico in Bogotá city.

2.4. Data analysis

For the analysis, the taxonomic and geographical changes in plant records were used to evaluate the difference in richness species, spatial coverage and sample coverage of the Flora of Bogotá, made through curatorial work compared with fieldwork. To analyze the effect of curatorial work (made in the database and herbarium specimens) and fieldwork to the Flora of Bogotá data quality, we evaluated the changes between 2012 and 2016. We conducted the taxonomic and spatial analysis for each dataset and compared differences between datasets to evaluate the improvement of data through both curatorial and fieldwork.

The analysis was conducted at two levels.

2.4.1. Taxonomic

To understand the change in taxonomic quality through curatorial and fieldwork, we calculated the number of taxon names and plant records by species at every data stage. We compared between stages using non-parametric tests (Kruskal–Wallis and the Wilcoxon test) in R 3.6.1 (R Development Core Team, 2019).

2.4.2. Spatial and sample coverage

In order to understand the geographical contribution of curatorial and fieldwork, we created a grid of cell size 1 km by 1 km (1 × 1) over the city. We analyzed the change in spatial coverage (number of grid cells with plant records), density records (number of plant records by grid cell of 1 km × 1 km), richness observed (number of species observed by grid cell of 1 km × 1 km), and sample coverage in the raw, curated and total datasets. The same analysis was conducted at the ecosystem level using the Colombian ecosystem map (Etter, 1998) to delimit Bogotá's ecosystems.

We conducted a spatial coverage analysis to observe the representativity of plant records in Bogotá, calculating it as the proportion of grid cells of 1 km × 1 km with plant records over the total grid cells in Bogotá (1842 grid cells of 1 km × 1 km) at the four stages of the data: .

Where N is total number of grid cells in Bogotá and Nr the number of grid cells with plant records at every stage of data treatment.

We also analyzed the effect of curatorial and fieldwork on the record density by km2, as a first step to describing the collection patterns on the territory (Soberón et al., 2007); observed richness and sample coverage for every grid cell. For the sample coverage, rarefaction was calculated in cells with more than 20 plant records. The sample coverage is a measure of sample completeness, giving the proportion of the total number of individuals in a community that belongs to the species represented in the sample (Chao and Jost, 2012). Sample coverage is defined as the total relative abundance of the observed species in the sample, ranging from 0 to 1. Sample completeness was estimated using the iNEXT R package (Hsieh et al., 2016).

Finally, we tested for differences in richness and sample coverage between raw, curated and total datasets using non-parametric tests (Kruskal–Wallis and the Wilcoxon test) using R 3.6.1 (R Develpment Core Team, 2019) and illustrated the results in maps created in QGIS (QGIS Development Team, 2015).

3. Results 3.1. Taxonomic changes following cleaning and fieldwork

Taxonomic cleaning decreased the number of taxon names, while fieldwork added new ones to the Flora de Bogotá database. As a result, taxa decreased by 24% for family and species levels, and 7% for genera (Table 2). On the other hand, fieldwork added 83 (3.5% of total species diversity) new names at the species level, most of which were herbs and epiphytes (Table S1).

Table 2 Number of species, genera and family in the Flora de Bogotá database at four data stages: raw data, clean data, fieldwork and total (combination of curated and fieldwork datasets), evidencing the change in the data quality resulting from curation and fieldwork.
Raw Curated Fieldwork Total
Family 225 187 110 187
Genera 967 904 312 904
Species 2878 2301 749 2384

The curatorial process and fieldwork significantly improved the number of records by species (p > 0.05) (Fig. S1), where the probability of species with less than five plant records decreased from 0.35 in the raw dataset to 0.21 in the curated dataset and 1.9 in the total dataset (curated – fieldwork), respectively (Fig. S1). It is important to note that for 744 species the number of records increased by fieldwork, for 690 species through the curatorial process, for 1157 by either of the two approaches (curatorial or fieldwork), and 277 species by the combined effect of both curatorial and fieldwork. Vaccinium floribundum (Ericaceae) showed the maximum number of plant records (143) and species with more than 100 plant records represented 0.6% (15 species) of the species recorded for Bogotá (Table S2).

3.2. Spatial representation

The spatial distribution of species showed significant differences between datasets (p > 0.05) (Fig. S2), where the probability of species being in one cell decreased from raw to curated and total datasets (curated–fieldwork) (Fig. S2). At the same time, fieldwork added grid cells to 28% of species, while the curatorial process added grid cells to 26% of species. The combination of cleaning and fieldwork added grid cells to 46% of species reported in the Flora database (Table S3). Gaultheria anastomosans (Ericaceae) was the most widely distributed species, recorded in 78 (4.2%) grid cells in Bogotá.

3.3. Spatial and ecosystem changes by curatorial and fieldwork 3.3.1. Grid cells and density records

The curatorial work on georeferences of plant records and fieldwork increased the number of plant records with coordinates in the Flora de Bogotá database. The number of plant records with coordinates increased by 77% from raw to total datasets, but the main contribution was due to the curatorial work that added coordinates to 59% of records, while fieldwork only added 18% (Fig. 2). Additionally, curatorial and fieldwork increased the number of cells with plant records from 364 in the raw dataset (19.8% of total grid cells) to 753 (41%) in the total dataset (curated–fieldwork). However, fieldwork only added three new grid cells (0.1%) (Fig. 3).

Fig. 2 Plant records with species names and coordinates available in the different datasets of the vascular plants of the Flora of Bogotá: (a) raw dataset (red dots), (b) curated dataset (yellow dots), and (c) fieldwork dataset (blue dots). The red line delimits the area of Bogotá.

Fig. 3 Number of 1 km2 grid cells covered by the different datasets of the vascular plants of the Flora of Bogotá: (a) raw dataset (red dots), (b) curated dataset (yellow dots), (c) fieldwork dataset (blue dots), and (d) total dataset (combination of curated and fieldwork datasets). The red line delimits the area of Bogotá.

On the other hand, cells with low-density records predominate at the three stages of data, although georeferenced and fieldwork slightly increased density. The number of grid cells with very low (1–10 plant records by grid cells), low (11–100) and medium (100–500) density increased by 11, 8 and 2%, respectively, from the raw dataset to the total dataset. The increase in density of grid cells with high (501–1000 plant records) and very high (1001–1500 plant records) density records increased, although it was 0.1 and 0.2, respectively (Fig. 4).

Fig. 4 Number of plant records per 1 km2 grid cell size with panels indicating change of density through curatorial and fieldwork in the vascular plant dataset of Flora of Bogotá: (a) raw dataset, (b) curated dataset, and (c) total dataset (combination of curated and fieldwork datasets). The blue – red scale indicates the spatial variation of record density in every stage of data: low density is indicated in blue while high density is indicated in red. The red line delimits the area of Bogotá.

At ecosystem level, plant records increased in Bogotá as a result of curatorial and fieldwork. Furthermore, ecosystems that were not represented in the raw dataset were represented by plant records after curatorial and fieldwork. Ecosystem representation was higher after curatorial work (ecosystem mean 57.1%) than through fieldwork (mean 10%) and raw data (mean 32.9%). For instance, plant density records increased in all ecosystems and a lake ecosystem appeared in the floristic record (i.e., La Regadera lake). In contrast, fieldwork increased the density of records in just five ecosystems with only one (i.e., western dry páramo of the city), representing 73% of those plant records (Table 3).

Table 3 Changes in ecosystem assignment when analysing plant records at the three stages of data (raw data, clean data, fieldwork and total [combination of curated and fieldwork datasets]). Every stage shows the input in plant records and its contribution to the total number of records by ecosystem (in %). Although several ecosystems are repeated (i.e., Dry Andean forest, Mixed agroecosystems, Humid Andean forest and Lake), they are considered as independent in the ecosystem classification of Bogotá (they occur at different locations). For a spatial representation of the ecosystems, please refer to Fig. S4.
Ecosystem Ecosystem code Area (km2) Raw (%) Clean (%) Fieldwork (%) Total (%)
Rural areas trasnformed by human activities II 266.20 1439 (20.4) 2987 (42.4) 2614 (37.1) 7040 (100)
Humid páramos 19 582.54 1034 (22.4) 3351 (72.7) 222 (4.8) 4607 (100)
Dry Andean forest 18b 42.96 748 (21.1) 2795 (78.9) 0 (0) 3543 (100)
Urban Area U 316.03 423 (17.9) 1912 (81.0) 26 (1.1) 2361 (100)
Mix agrosystems C3 133.22 766 (39.1) 538 (27.5) 655 (33.4) 1959 (100)
Dry páramos 20 40.31 248 (24.8) 750 (75.2) 0 (0) 998 (100)
Dry Andean forest 18b 16.81 87 (16.5) 382 (72.5) 58 (11.0) 527 (100)
Dry páramos 20 15.37 102 (24.9) 9 (2.2) 298 (72.9) 409 (100)
Milky agrosystems C4 62.37 47 (28.8) 116 (71.2) 0 (0) 163 (100)
Humid Andean forest 18a 29.66 14 (12.5) 98 (87.5) 0 (0) 112 (100)
Oak Andean forest 18c 31.90 37 (41.1) 53 (58.9) 0 (0) 90 (100)
Mix agrosystems C3 10.93 54 (98.2) 1 (1.8) 0 (0) 55 (100)
Humid Andean forest 18a 2.49 20 (90.9) 2 (9.1) 0 (0) 22 (100)
Humid Andean forest 18a 36.45 3 (17.6) 14 (82.4) 0 (0) 17 (100)
Lake La 1.86 6 (50) 6 (50) 0 (0) 12 (100)
Lake La 1.86 0 (0) 11 (100) 0 (0) 11 (100)
Humid Andean forest 18a 1.53 0 (0) 0 (0) 0 (0) 0 (0)

Overall, 80% of the Flora of Bogotá species are found in just four ecosystems. Two of those (humid páramo and dry high Andean Forest) are natural ecosystems, while the others consist of ecosystems with human intervention around the urban area. The most conserved ecosystems (e.g., cloud and high Andean humid forests; wet páramos) are located in the rural area, some distance from the urban area of Bogotá.

3.4. Richness and completeness changes through curatorial and fieldwork 3.4.1. Richness

Curatorial processes and fieldwork increased the median number of species observed in grid cells; however, there were no significant differences between datasets. While the median of observed richness at the raw dataset was four, the median at the curated dataset and total dataset (curated - fieldwork) was five and six, respectively. 75% of grid cells recorded 11 species at the raw dataset, 15 at curated and 18 at the total dataset. On the other hand, very few grid cells showed a high number of species observed. For example, 4% of grid cells contained more than 50 species in the raw stage, meanwhile, 7% of grid cells in the curated dataset and 9% of grid cells in the total dataset recorded more than 50 species observed (Fig. S3).

3.4.2. Sample coverage

This analysis discarded many grid cells with plant records because of the low sample size, even if the curatorial process and fieldwork added new ones. For example, while the proportion of grid cells with plant records in the raw dataset was 19.8%, the grid cells that reached the threshold (e.g., 20 plant records by cell) for the sample coverage analysis were only 2.8%. The curated dataset had 40.9% of the grid cells with plant records, but in the curated dataset only 8.8% were valid for this analysis. The total dataset had 41% of grid cells with plant records, but only 10.3% were valid for sample coverage. However, the sample coverage values between data stages did not show significant differences (p > 0.05). Median values on the grid cells were 0.22 in raw dataset, 0.21 in curated dataset and 0.24 in the total dataset. The 75% of grid cells showed sample coverage values below 0.5 with a maximum of 0.6 in the raw dataset, while for the curated and total, the maximum values were 0.91 (Fig. 5). Significant differences were found in the grid cells where fieldwork was done (p > 0.05). During the time of the study, only ten grid cells had fieldwork. Nevertheless, those grid cells showed a significant sample coverage increase by fieldwork. In those same grid cells, the sample coverage in the raw and curated datasets did not show significant differences (0.25 each), but in the total dataset (clean–fieldwork) the sample coverage increased to 0.68, showing the input of fieldwork.

Fig. 5 Sample coverage variation by curatorial and fieldwork in the vascular plant dataset of Flora of Bogotá: (a) raw dataset, (b) curated dataset, and (c) total dataset (combination of curated and fieldwork datasets). The blue – red scale indicates the spatial variation of record density in every stage of data: low density is indicated in blue while high density is indicated in red. The red line delimits the area of Bogotá.

At the ecosystem level, the sample coverage mainly increased in the curated dataset with a slight increase in fieldwork dataset. In only one ecosystem (C3), fieldwork's sample coverage increase was higher than in the curated dataset. Although the overall sample coverage of ecosystems was improved in the curated dataset compared to fieldwork, non-significant differences were observed between datasets (p < 0.05) (Fig. S4).

4. Discussion

In this study, we analyzed the change in magnitude of floristic knowledge and information quality of Bogotá city through both curatorial (nomenclatural cleaning and georeferencing process of plant records) and fieldwork in a time window between 2012 and 2016. These two activities were done simultaneously in the first stage of the flora de Bogotá project. We also evaluate their impact on the taxonomic, geographical, richness and sample coverage aspects of biodiversity information. We found the highest change in the Bogotá's floristic data was due to curatorial processes rather than fieldwork. There was a decrease in alpha diversity through taxonomic curatorial work, because of synonyms or orthographical mistakes in the species names. At the same time, there was an increase in spatial coverage because of geographical curatorial work that increased the number of plant records with georeferences.

4.1. Taxonomic changes resulting from curatorial and fieldwork

Our work significantly improved the taxonomic quality of the Flora of Bogotá database, which decreased the number of species names by 24%, by removing synonyms and orthographical errors. The loss of names is not surprising on account of different data sources with distinct curatorial levels, which is also evident in open databases such as GBIF, where much inaccurate and incorrect information is published (Maldonado et al., 2015). Different classification systems were found in the Flora of Bogotá database that refer to the same taxon with a different name (e.g., Compositae = Asteraceae, Palmae = Arecaceae, and Gramineae = Poaceae) that together with the orthographical mistakes, had inflated the alpha diversity of Bogotá. Although checking for nomenclatural issues is the first step in obtaining an accurate list of species for a region, this is not obvious, since local and regional species inventories are full of nomenclatural mistakes, artificially increasing the diversity. For example Cardoso et al. (2017) indicate that for the Amazon basin, almost 7% of species reported were mistakes, because individual species were listed more than once as synonyms and spelling variants. The problem is worsened because much information reported in open databases is not reviewed by experts (Goodwin et al., 2015) and does not utilize up to date nomenclature.

On the other hand, the contribution to Bogotá plant species richness by fieldwork was small (only 3% of new species for the Bogotá species list), compared with the data obtained from collection databases. Several factors contribute to the low rate of new species records obtained by fieldwork. For instance, all the analyzed fieldwork collections were based on biased sampling towards grid cells that were already intensively sampled. Additionally factors related with the low rate of detection of new species in the Bogotá area are also related with collector expertise (Ahrends et al., 2011), the sensitivity of sampling methods that exclude some groups of plants, preference (e.g., preference for angiosperms against ferns or non-vascular plants) (Daru et al., 2018), detectability of plants (e.g., that depend on phenology, life form) (Chen et al., 2009), species density (McCarthy et al., 2013), and sampling bias.

This study found that fieldwork had limited reach compared to the data from collections, which resulted from the combined efforts of multiple explorers and researchers over a long period of time. The data registered in collections reflects contributions from many collectors and the exploration of diverse locations. As a result, improving the data associated with existing herbarium specimens through curatorial work may be more effective in producing a more accurate representation of the flora. This can be achieved by adding new collection sites and enhancing the representation of species' climatic niches (Feeley and Silman, 2011a), as well as expanding geographical coverage. Alternatively, the curated information could assist in planning efficient fieldwork strategies for areas and plant groups with limited data.

4.2. Geographical changes resulting from curatorial and fieldwork 4.2.1. Spatial and environmental changes

Our study showed an important increase in the spatial coverage of the Flora through curatorial work. The geographical dimension of biological records has been an important issue since most biological records, especially old ones (e.g., collections before 1990 were GPS was not popular) (Feeley and Silman, 2010) are deposited in collections without coordinates. As a result, many records are discarded from ecological analyses, and these could represent new areas and environmental combinations. Curatorial work allowed the recovery of important floristic information such as ecosystems not previously represented in Bogotá, that were revealed through georeferencing (e.g., La Regadera lake). More of the environmental and climatic spectrum of the Flora of Bogotá not represented by the non-curated raw data, were elucidated through georeferencing (clean dataset). In contrast, fieldwork was carried out in ecosystems and grid cells that had already been sampled before, resulting in a low number of new samples. This finding suggests that there were flaws in the sampling design at the regional scale, as supported by previous information.

4.2.2. Taxonomic perspective changes

The recovery of data from collections could improve species niche information by adding new environmental variables not previously recorded, information that would help to improve species distribution models (Feeley and Silman, 2011a). As expected, our work improved species distribution data. However, we found that the species that increased the number of distinct localities (new species records in new grid cells) by curatorial and fieldwork, are the most common ones. In contrast, for many species, particularly rare ones, new localities were not added after cleaning and fieldwork. It is possible that rare species are not represented in collections due to low detectability of species or collection preferences. However curatorial work could help identify rare species and with this information, focus on targeted fieldwork and add new environment information.

4.2.3. Richness perspective

Although the number of species names in the database decreased as a result of curatorial work, those corrections have improved the taxonomic quality, resulting in a more reliable species list for the city. Taxonomic corrections increased the number of records for some species, increasing their range of geographic distribution. Fieldwork added new species to the Flora checklist, but the increase was low (e.g., 83 species that correspond to 3.5% of the total species diversity). The low rate of new species recorded by fieldwork could be due to an already exhaustive sampling of Bogotá. However, our analysis showed low sampling rates in the grid cells with plant records and a high proportion of grid cells without records (e.g., 59% of grid cells of Bogotá). Spatial sampling bias, collection preferences of some taxonomic groups (e.g., collections made by experts in certain groups that prefer angiosperms to ferns), and collector expertise could explain the low rate of recorded novel species (e.g., new species (undescribed) or new species records (new species for the area)). In our case, we found biased sampling around the urban area of Bogotá city, especially "Cerros Orientales" and some places such as the páramo of Sumapaz (e.g., Laguna de Chisacá and Nazareth). Only in one place, "Páramos de Pasquilla", where the Flora of Bogotá project undertook fieldwork intensively, did the sampling increase significantly, and the observed richness and the completeness increased above 40% with grid cells over 80%.

Locally (e.g., grid cells), curatorial work significantly improved the number of species observed in the grid cells with plant records. On the other hand, 50% of new grid cells were represented by plant records increasing the number of species observed in areas without plant information. However, many grid cells did not suffer changes in density and observed richness, especially those far away from the urban area where the lack of access, and social conflicts (e.g., guerrillas presence) can make it difficult for them to be reached (Negret et al., 2017).

Georeferencing increased the coverage of plant records, the number of species observed and sample coverage at local and regional scales in Bogotá. On the other hand, fieldwork (Fig. 5c) made significant changes at the local scale. For example, in the grid cells where fieldwork was performed, few plant records were recorded at the raw data stage, and few were recovered by curatorial work. After fieldwork, completeness in those grid cells increased significantly, reaching values above 0.8. Sample coverage analysis and richness estimators depend on sample size (Gotelli and Colwell, 2011). Although the number of plant records increased at a regional scale by curatorial work, at the local scale (e.g., grid cells), 80% of grid cells had less than 10 plant records (Fig. 4) and only 20% of grid cells had more than 20 plant records (threshold used to calculate sample coverage). The main changes were observed on the grid cells where fieldwork was undertaken.

Our study fixed taxonomic and geographical issues in the data recovery information that increased richness observed and completeness locally. Given the limited resources to explore the territory, especially in low or middle-income countries, it is essential to carefully invest those scarce resources in order to obtain as much information as possible. Biological collections contain information that recompiles the efforts of several researchers and projects through the years. As many researchers have pointed out physical collections have vast amounts of data that is not useable because of issues in three basic dimensions (taxonomic, geography and time) (Lavoie, 2013; Feeley, 2015; Hortal et al., 2015; Meyer et al., 2016). As we showed in this study, investing in curatorial work (e.g., physical and digital) as the first step of describing biodiversity could unveil those aspects that are necessary to make the use of few resources for biodiversity studies most efficiently. Fieldwork is a crucial activity to study biodiversity but should be targeted to under-collected areas that can be defined by improving collections data through curatorial work.

However, despite the great efforts that have been made with digitalization further improvements could be made that would enhance biodiversity studies. Small herbaria such as that of the Jardín Botánico de Bogotá would also benefit from citizen science contributions. Label digitization, for example, could allow data to be read from home by retirees. At the Jardín Botánico de Bogotá herbarium volunteer work contributes to scanning and photographing the specimen collection, as well as with mounting. We believe that by being involved in these activities, it is possible to inspire new taxonomists who will contribute to enhancing herbarium research. Research could also be assisted by new technologies associated with Artificial Intelligence approaches that can check the consistency of identifications and indicate which specimens are problematic requiring expert review (see for example, Hussein et al., 2022). All of these approaches can contribute to the concept of the 'global meta-herbarium', linking digitized specimens with other digital data (Davis, 2023).

5. Conclusions

Curatorial work of biodiversity collections and fieldwork are not distinct processes, but rather exist in continuous feedback that generates and improves biodiversity knowledge. Therefore, to maximize the scarce resources invested by research organizations and institutions in biodiversity, it is crucial that this circular process continually informs the subsequent steps. From the point of view of research institutes in charge of biodiversity characterization, curatorial work, facilitated by digitalization, is an investment that would offer a large amount of improved data that could be retrieved from biological collections at relatively low cost and requiring little time (Suarez and Tsutsui, 2004; Lavoie, 2013). However, many herbaria have decreased the investment in curators and care for collections decreases every day (Vogel et al., 2017), especially in local and small ones.

In contrast fieldwork requires high investment in personnel, logistics, preparation and time (Suarez and Tsutsui, 2004) with limited capacity to capture new data. Although fieldwork remains essential for acquiring biodiversity information, its significant investment underscores the importance of maintaining constant feedback between curatorial work and fieldwork. This enables better identification of collection areas or taxonomic groups that require further investigation. In order to improve baseline biodiversity information, we advocate for increased investment in curation and maintenance of herbaria, both in terms of trained personnel and infrastructure. This investment is particularly necessary in smaller, local herbaria that have been shown to contain important collections that contribute to a better understanding of the overall distribution of biodiversity (e.g., Marsico et al., 2020; Monfils et al., 2020).


We are grateful to all the collaborators and contributors of Jardín Botánico de Bogotá (Proyecto flora de Bogotá) specially Diego Moreno, manager of Flora de Bogotá database. We would like to thank the group "Genética evolutiva, filogeografía y ecología de biodiversidad Neotropical" and the High-Performance Computing service of the Universidad del Rosario for hosting our PostgreSQL database on their servers. This study would not have been possible without the support of MinCiencias Doctoral funds and the support of Universidad del Rosario. We would also like to thank Iván Jiménez (curator at the Missouri Botanical Garden) for his valuable comments on the document, and to Domingos Cardoso and an anonymous reviewer for their valuable comments to the manuscript. This project was supported by Colciencias Doctoral funding (727–2015) and Universidad del Rosario, through a teaching assistantship and a doctoral grant.

Author contributions

CV, TS, JER, AS conceived and designed the research. CV, MB, MC, BV obtained and processed the plant records. CV, MB analyzed the data. CV, AS wrote the first draft. CV, TS, JER, AS edited a revised version of the manuscript.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Supplementary data

Supplementary data to this article can be found online at

Ahrends, A., Rahbek, C., Bulling, M.T., et al., 2011. Conservation and the botanist effect. Biol. Conserv., 144: 131-140. DOI:10.1016/j.biocon.2010.08.008
Ball-Damerow, J.E., Brenskelle, L., Barve, N., et al., 2019. Research applications of primary biodiversity databases in the digital age. bioRxiv: 1-26. DOI:10.1101/605071
Bebber, D.P., Carine, M. a., Wood, J.R.I., et al., 2010. Herbaria are a major frontier for species discovery. Proc. Natl. Acad. Sci. U.S.A., 107: 22169-22171. DOI:10.1073/pnas.1011841108
Bernal, R., Grandstein, R., Celis, M. (Eds.), 2016. Catálogo de plantas y líquenes de Colombia, Catálogo de plantas y líquenes de Colombia. Editorial Universidad Nacional de Colombia, Bogota.
Cardoso, D., Särkinen, T., Alexander, S., et al., 2017. Amazon plant diversity revealed by a taxonomically verified species list. Proc. Natl. Acad. Sci. U.S.A., 114: 10695-10700. DOI:10.1073/pnas.1706756114
Chao, A., Jost, L., 2012. Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size. Ecology, 93: 2533-2547. DOI:10.1890/11-1952.1
Chen, G., Kéry, M., Zhang, J., et al., 2009. Factors affecting detection probability in plant distribution studies. J. Ecol., 97: 1383-1389. DOI:10.1111/j.1365-2745.2009.01560.x
Daru, B.H., Park, D.S., Primack, R.B., et al., 2018. Widespread sampling biases in herbaria revealed from large-scale digitization. New Phytol., 217: 939-955. DOI:10.1111/nph.14855
Elith, J., Graham, C, H., P. Anderson, R., et al., 2006. Novel methods improve prediction of species' distributions from occurrence data. Ecography, 29: 129-151. DOI:10.1111/j.2006.0906-7590.04596.x
Engemann, K., Enquist, B.J., Sandel, B., et al., 2015. Limited sampling hampers "big data" estimation of species richness in a tropical biodiversity hotspot. Ecol. Evol., 5: 807-820. DOI:10.1002/ece3.1405
Etter, A., 1998. Mapa general de ecosistemas de Colombia (1: 2, 000, 000). Instituto Alexander von Humboldt y PNUD, Bogotá.
Feeley, K.J., 2015. Are we filling the data void? An assessment of the amount and extent of plant collection records and census data available for tropical South America. PLoS One, 10: 1-17. DOI:10.1371/journal.pone.0125629
Feeley, K.J., 2012. Distributional migrations, expansions, and contractions of tropical plant species as revealed in dated herbarium records. Global Change Biol., 18: 1335-1341. DOI:10.1111/j.1365-2486.2011.02602.x
Feeley, K.J., Silman, M.R., 2011a. Keep collecting: accurate species distribution modelling requires more collections than previously thought. Divers. Distrib., 17: 1132-1140. DOI:10.1111/j.1472-4642.2011.00813.x
Feeley, K.J., Silman, M.R., 2011b. The data void in modelling current and future distributions of tropical species. Global Change Biol., 17: 626-630. DOI:10.1111/j.1365-2486.2010.02239.x
Feeley, K.J., Silman, M.R., 2010. Modelling the responses of Andean and Amazonian plant species to climate change: the effects of georeferencing errors and the importance of data filtering. J. Biogeogr., 37: 733-740. DOI:10.1111/j.1365-2699.2009.02240.x
Franco, A.M.A., Palmeirim, J.M., Sutherland, W.J., 2007. A method for comparing effectiveness of research techniques in conservation and applied ecology. Biol. Conserv., 134: 96-105. DOI:10.1016/j.biocon.2006.08.008
Gaira, K.S., Dhar, U., Belwal, O.K., 2011. Potential of herbarium records to sequence phenological pattern: a case study of Aconitum heterophyllum in the Himalaya. Biodivers. Conserv., 20: 2201-2210. DOI:10.1007/s10531-011-0082-4
García Márquez, J., Dormann, C., Sommer, J.H., et al., 2012. A methodological framework to quantify the spatial quality of biological databases. Biodivers. Ecol., 4: 25-39. DOI:10.7809/b-e.00057
Gardner, T.A., Barlow, J., Araujo, I.S., et al., 2008. The cost-effectiveness of biodiversity surveys in tropical forests. Ecol. Lett., 11: 139-150. DOI:10.1111/j.1461-0248.2007.01133.x
Goodwin, Z.A., Harris, D.J., Filer, D., et al., 2015. Widespread mistaken identity in tropical plant collections. Curr. Biol., 25: R1066-R1067. DOI:10.1016/j.cub.2015.10.002
Gotelli, N.J., Colwell, R.K., 2011. Estimating species richness. In: Biological Diversity. Frontiers in Measurement and Assessment. Oxford University press, New York.
Graham, C.H., Ferrier, S., Huettman, F., et al., 2004. New developments in museum-based informatics and applications in biodiversity analysis. Trends Ecol. Evol., 19: 497-503. DOI:10.1016/j.tree.2004.07.006
Gueta, T., Carmel, Y., 2016. Quantifying the value of user-level data cleaning for big data: a case study using mammal distribution models. Ecol. Inf., 34: 139-145. DOI:10.1016/j.ecoinf.2016.06.001
Hopkins, M.J.G., 2007. Modelling the known and unknown plant biodiversity of the Amazon Basin. J. Biogeogr., 34: 1400-1411. DOI:10.1111/j.1365-2699.2007.01737.x
Hortal, J., de Bello, F., Diniz-Filho, J.A.F., et al., 2015. Seven shortfalls that beset large-scale knowledge of biodiversity. Annu. Rev. Ecol. Evol. Syst., 46: 523-549. DOI:10.1146/annurev-ecolsys-112414-054400
Hsieh, T.C., Ma, K.H., Chao, A., 2016. iNEXT: an R package for rarefaction and extrapolation of species diversity (Hill numbers). Methods Ecol. Evol., 7: 1451-1456. DOI:10.1111/2041–210X.12613
Lavoie, C., 2013. Biological collections in an ever changing world: herbaria as tools for biogeographical and environmental studies. Perspect. Plant Ecol. Evol. Syst., 15: 68-76. DOI:10.1016/j.ppees.2012.10.002
Maldonado, C., Molina, C.I., Zizka, A., et al., 2015. Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases?. Global Ecol. Biogeogr., 24: 973-984. DOI:10.1111/geb.12326
Marsico, T.D., Krimmel, E.R., Carter, J.R., et al., 2020. Small herbaria contribute unique biogeographic records to county, locality, and temporal scales. Am. J. Bot., 107: 1577-1587. DOI:10.1002/ajb2.1563
McCarthy, M.A., Moore, J.L., Morris, W.K., et al., 2013. The influence of abundance on detectability. Oikos, 122: 717-726. DOI:10.1111/j.1600-0706.2012.20781.x
Meyer, C., Weigelt, P., Kreft, H., et al., 2016. Multidimensional biases, gaps and uncertainties in global plant occurrence information. Ecol. Lett., 19: 992-1006. DOI:10.1111/ele.12624
Monfils, A.K., Krimmel, E.R., Bates, J.M., et al., 2020. Regional collections are an essential component of biodiversity research infrastructure. Bioscience, 70: 1045-1047. DOI:10.1093/biosci/biaa102
Morueta-Holme, N., Engemann, K., Sandoval-Acuña, P., et al., 2015. Strong upslope shifts in Chimborazo's vegetation over two centuries since Humboldt. Proc. Natl. Acad. Sci. USA, 112: 12741-12745. DOI:10.1073/pnas.1509938112
Negret, P.J., Allan, J., Braczkowski, A., et al., 2017. Need for conservation planning in postconflict Colombia. Conserv. Biol., 31: 499-500. DOI:10.1111/cobi.12935
Nualart, N., Ibáñez, N., Soriano, I., et al., 2017. Assessing the relevance of herbarium collections as tools for conservation biology. Bot. Rev., 83: 303-325. DOI:10.1007/s12229-017-9188-z
O'Connell, A.F., Gilbert, A.T., Hatfield, J.S., 2004. Contribution of natural history collection data to biodiversity assessment in national parks. Conserv. Biol., 18: 1254-1261. DOI:10.1111/j.1523-1739.2004.00336.x
QGIS Development Team, 2015. QGIS Geographic Information System, Open Source Geospatial Foundation Project, version 3.8.0.
R Develpment Core Team, 2019. R: A Language and Environment for Statistical Computing (Version 3.6.1).
Secretaría Distrital de Ambiente., 2007. Atlas ambiental de Bogota DC. Imprenta Nacional de Colombia. Bogota (Colombia).
Soberón, J., Jiménez, R., Golubov, J., et al., 2007. Assessing completeness of biodiversity databases at different spatial scales. Ecography, 30: 152-160. DOI:10.1111/j.0906-7590.2007.04627.x
Soberón, J., Peterson, A.T., 2004. Biodiversity informatics: managing and applying primary biodiversity data. Philos. Trans. R. Soc. B-Biol. Sci., 359: 689-698. DOI:10.1098/rstb.2003.1439
Sousa-Baena, M.S., Couto, L., Townsend, A., 2013. Completeness of digital accessible knowledge of the plants of Brazil and priorities for survey and inventory. Divers. Distrib., 20: 1-13. DOI:10.1111/ddi.12136
Suarez, A.V., Tsutsui, N.D., 2004. The value of museum collections for research and society. Bioscience, 54: 66-74.
Syfert, M.M., Smith, M.J., Coomes, D.A., 2013. The effects of sampling bias and model complexity on the predictive performance of MaxEnt species distribution models. PLoS One, 8. DOI:10.1371/journal.pone.0055158
Targetti, S., Herzog, F., Geijzendorffer, I.R., et al., 2014. Estimating the cost of different strategies for measuring farmland biodiversity: evidence from a Europe-wide field evaluation. Ecol. Indicat., 45: 434-443. DOI:10.1016/j.ecolind.2014.04.050
Van der Hammen, T., 1986. La Sabana de Bogotá y su lago en el Pleniglacial Medio. Caldasia, 15: 249-262.
Vargas, C.A., Bottin, M., Särkinen, T., et al., 2022. Environmental and geographical biases in plant specimen data from the Colombian Andes. Bot. J. Linn. Soc., 200: 451-464. DOI:10.1093/botlinnean/boac035
Vogel, C., Bordignon, S.A. de L., Trevisan, R., et al., 2017. Implications of poor taxonomy in conservation. J. Nat. Conserv., 36: 10-13. DOI:10.1016/j.jnc.2017.01.003
Wieczorek, J., Guo, Q., Hijmans, R.J., 2004. The point-radius method for georeferencing locality descriptions and calculating associated uncertainty. Int. J. Geogr. Inf. Sci., 18: 745-767. DOI:10.1080/13658810412331280211