pyIFPNI: A package for querying and downloading plant fossil data from the IFPNI
Bailong Zhaoa,b,*     
a. State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, Chinese Academy of Sciences, Beijing, 100093, China;
b. University of Chinese Academy of Sciences, Beijing, 100049, China

The International Fossil Plant Names Index (IFPNI, 2014-onwards) not only serves as an online gateway to the fossil plant name registry for the global scientific community, but also serves as a comprehensive and dynamic archive of fossil plants (Doweld, 2016, 2022). Since its establishment in 2014, the IFPNI has accumulated more than 80, 000 entries of fossil plant names, along with information on over 10, 000 documents and over 6000 paleobotanists, all supported by exhaustive data, demonstrating the scope of the IFPNI's work (IFPNI, 2014-onwards). As it can meticulously trace the fossil plant name record, the IFPNI is an indispensable resource for both paleobotanists and systematic botanists, providing invaluable insights and information. However, unlike other palaeobotanical databases (i.e., The Palaeobotanical Database (PBDB) and the Geobiodiversity Database (GBDB); Fan et al., 2011, 2013; Deng et al., 2020), the IFPNI lacks a download function. The storage of information from the IFPNI requires manual data for each record, leading to a long and tedious process. Therefore, a crucial task is to create efficient methods to preserve the data obtained from the IFPNI.

Python is a highly popular programming language known for its versatility, user-friendliness, and widespread adoption. In this project, I have developed pyIFPNI, an open-source Python package that simplifies the process of retrieving and acquiring fossil plant name records from IFPNI. This innovative tool integrates a wide range of search techniques available on IFPNI, enabling smooth conversion of the obtained results into a structured pandas dataframe format, which can be conveniently saved as CSV files. Consequently, this streamlined process not only accelerates data collection and preservation but also ensures that subsequent data analysis tasks are efficiently executed, thereby guaranteeing the overall efficiency of the entire workflow.

Researchers have the flexibility to employ different query methods while accessing fossil plant name records. These methods enable them to efficiently navigate through the records by utilizing various classification levels. This functionality is achieved through the implementation of specific methods, which are explained as follows (Fig. 1):

Fig. 1 Schematic representation of the methods in the pyIFPNI package. The overview illustrates the complete framework of the pyIFPNI package and presents a detailed breakdown of the specific parameters for each method.

Supragenus: The supragenus() function enables researchers to query fossilized plant name records at taxonomic ranks higher than the genus level. This functionality proves invaluable when searching for plant names associated with specific supragenus categories.

Genus: The genus() function assists in querying fossilized plant name records at the genus level.

Infragenus: To further refine the search below the genus level, the infragenus() function comes into play.

Species: When researchers focus on the most fundamental taxonomic rank, the species() function becomes relevant. They can utilize this method to query fossilized plant name records specifically associated with particular species or genera.

Infraspecies: The infraspecies() function allows researchers to delve even deeper into the taxonomic hierarchy, exploring fossilized plant name records below the species level. This method is particularly useful for investigating subspecies, varieties, or forms of fossilized plants.

When utilizing the supragenus(), genus(), infragenus(), species(), and infraspecies() methods in pyIFPNI, users have the flexibility to include additional parameters for more specific queries. These parameters include the following:

Author: Users can specify the author parameter to search for fossil plant name records attributed to specific authors.

Rank List: By specifying the rank parameter, users can narrow down their search to specific taxonomic ranks within the supragenus, genus, infragenus, species, or infraspecies methods.

Original Spelling: Users have the option to include the "original spelling" parameter when searching for fossil plant names. This parameter allows users to find names based on their original spellings, which can be helpful for historical or variant name inquiries.

Publication Year Range: Researchers have the flexibility to define a publication year range parameter while conducting searches for fossil plant name records. This feature enables them to retrieve names that were published within a specific time frame, allowing for more targeted and focused research.

Paleoregion: The "paleoregion" parameter can also be utilized by users when using the species() and infraspecies() methods. This parameter allows researchers to apply a filter to their search, narrowing down the results to specific paleoregions of interest.

The methods publication(), book(), and journal() are commonly used in bibliographic databases to query specific types of sources. These methods allow users to retrieve references based on the type of publication they belong to. Here's a brief explanation of each method:

publication(): This method retrieves all types of publications, including books, journal articles, conference proceedings, reports, theses, etc. When using the publication() method, you generally get a broader range of results encompassing various publication types.

book(): This method specifically retrieves references that belong to books. It is useful when searching for information contained within published books. Using the book() method helps narrow down your search to relevant book titles and their associated details such as author(s), title, publisher, year, etc.

journal(): This method focuses on retrieving references from academic journals. If you are specifically interested in scholarly articles published in journals, using the journal() method can help filter out other types of publications.

Within the pyIFPNI library, there is a specialized method called author() that empowers users to explore paleobotanists listed in the IFPNI database. This dedicated functionality facilitates convenient access to information regarding specific authors who have made significant contributions to the field of paleobotany, with their work thoroughly documented in the IFPNI repository. By making use of the author() method, researchers can efficiently retrieve relevant data related to these paleobotanists and gain valuable insights into their notable contributions.

These advanced search methods encompass exploring higher taxonomic groups related to the target taxon, investigating lower taxonomic groups associated with the target taxon, searching for taxa published in the target publication, and querying for taxa and publications attributed to the targeted paleobotanist. There are 15 advanced methods provided by pyIFPNI. Detailed information and related examples of these methods can be found at https://github.com/WDragon101/pyIFPNI.

The primary objective of pyIFPNI is to retrieve records of fossil plant names through query methods based on taxonomic hierarchy. Each method operates within a specific taxonomic search range, which can be divided into two groups. The first group, including supragenus(), genus(), and infragenus(), explicitly targets names above the species level (excluding the species level), resulting in a limited number of outcomes. The second group consists of species() and infraspecies() methods, intended to search for names at and below the species level, potentially yielding more outcomes than the previous group. In this section, I will briefly introduce the usage and examples of two fundamental functions, supragenus() and species(), which are more commonly used. The remaining functions, including both basic and advanced, are documented at https://github.com/WDragon101/pyIFPNI.

Let us proceed with the following assumption: Our targeted taxonomic group is Berberidaceae, which unequivocally surpasses the level of genus. Hence, we shall employ the supragenus() method to facilitate our analysis. To ensure the utmost precision in retrieving name records, we will establish a search range spanning from 1753 to 2023, comprehensively encompassing the entirety of recorded instances from the earliest documented occurrence to the present day. This objective can be achieved by explicitly specifying the parameters yearFrom as 1753 and yearTo as 2023 or by omitting these parameters altogether.

Another crucial parameter, rank_list, plays a pivotal role in further refining the scope of our search. In this specific example, the rank_list parameter encompasses the values ["Family", "Order"], thereby restricting the search solely to these particular taxonomic ranks. Naturally, if required, rank_list can accommodate multiple taxonomic ranks. To determine the permissible values for the rank_list parameter in the supragenus() method, you may utilize IFPNI.rank_supragenus.keys(). Subsequently, select the desired combination of ranks accordingly (Fig. S1).

The process of searching for species names through the use of the species() method parallels the supragenus() method. The practical application of the species() includes the exploration of all names of a genus, which is not achievable through the methods of the first group. For instance, the species() method can be applied to extract all species names in the genus Berberis in combination with the same remaining parameters as utilized in the supragenus() method. By doing so, you can efficiently obtain the entire catalog of fossil species that are related to Berberis (Fig. S2).

To use the pyIFPNI package in R, you can follow these steps.

1. Install the reticulate package: install.packages("reticulate")

2. Load the reticulate package: library(reticulate)

3. Create a Python environment: use_python(" < path_to_python > "), where < path_to_python > is the path to your Python installation.

4. Install the pyIFPNI package: py_install("pyIFPNI")

5. Import the pyIFPNI module: pyIFPNI < - import("pyIFPNI")

6. The functions provided by pyIFPNI can be accessed using the pyIFPNI$function_name() syntax. For example, the supragenus() function can be used by calling pyIFPNI$supragenus().

One drawback is that there is currently no equivalent R package available for pyIFPNI. As a temporary solution, the aforementioned method can be used to invoke pyIFPNI in R. However, recognizing the widespread use of R in the scientific community, I am committed to developing an R version and will soon release relevant updates on https://github.com/WDragon101/pyIFPNI.

I developed pyIFPNI, a sophisticated Python library designed to enable users to explore and acquire fossil plant names from the esteemed International Fossil Plant Names Index. This comprehensive library not only facilitates diverse taxonomic searches but also provides advanced search criteria comparable to those found on IFPNI's official website. A notable feature of this library is its ability to seamlessly integrate search results into a structured dataframe format. This functionality streamlines data analysis and empowers researchers to conveniently store and utilize their findings in the future. To further enhance its utility for paleobotanists and botanists, I aim to bolster the search engine capabilities of pyIFPNI. This entails implementing cutting-edge depth search and taxon search algorithms, which will yield more exhaustive and precise outcomes. These refinements are intended to enhance the overall accessibility and functionality of the library, catering to the needs of the scientific community more effectively.

Acknowledgment

I thank the anonymous reviewers for their helpful comments.

Author contributions

B.Z. developed the idea of the software, produced the package and write all Python codes, and wrote the paper.

Declaration of competing interest

The author declares that I have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.pld.2023.12.001.

References
Deng, Y.Y., Fan, J.X., Wang, Y., et al., 2020. Current status of paleontological databases and data-driven research in paleontology. Geol. J. China Univ., 26: 361-383.
Doweld, A.B., 2016. The International Fossil Plant Names Index (IFPNI): a global registry of scientific names of fossil organisms started. J. Palaeosci., 65: 203-208. DOI:10.54991/jop.2016.311
Doweld, A.B., 2022. The International Fossil Plant Names Index (IFPNI): a new step in the development of palaeobotany. Geophytology, 50: 1-10.
Fan, J.X., Zhang, H., Hou, X.D., et al., 2011. Quantitative research trends in palaeobiology and stratigraphy—construction of the Geobiodiversity Database (GBDB) using escience technology. Acta Palaeontol. Sin., 50: 141-153.
Fan, J.X., Hou, X.D., Chen, Z.Y., et al., 2013. Geobiodiversity database and its application in stratigraphic research. J. Stratigr., 37: 400-409.
IFPNI, 2014-onwards. The International Fossil Plant Names Index. Global registry of scientific names of fossil organisms covered by the International Code of Nomenclature for Algae, Fungi, and Plants and International Code of Zoological Nomenclature. Available from: http://ifpni.org/. (Accessed 11 August 2023).