This document provides the best practices and recommendations ordered by areas of interoperability following the general approach of the paper Recommendations for interoperability among infrastructures and more specific technical aspects of the BiCIKL report Best practice manual for findability, re-use and accessibility of infrastructures. The best practices and recommendations were elaborated during a series of meetings of the Technical Research Infrastructures Forum of the BiCIKL project, to achieve better findability, re-use and accessibility of the infrastructures’ services, with input from external experts. Several outcomes from the discussions in the Biodiversity Services and Clients Interest Group (BSCI) of the TDWG were also incorporated in this document.
The best practices focus on API services, which play a key role in linking data between the infrastructures and creating a network of knowledge, but also describe more generic best practices on e.g. modalities of access, building communities and trust. For most best practices one or more recommendations are given for their implementation.
Table of Contents
List of abbreviations
|ABCD||Access to Biological Collection Data, a standard for access to and exchange of data about specimens and observations|
|AGU||American Geophysical Union|
|API||Application Programming Interface, a software interface for two or more computer programs to communicate with each other|
|BiCIKL||Biodiversity Community Integrated Knowledge Library, a project funded by the European Union’s Horizon 2020 Research and Innovation Action under grant agreement No 101007492|
|BKH||Biodiversity Knowledge Hub, a one-stop access point to guidelines, standards, data and services from 15 research infrastructures, under development in the BiCIKL project.|
|CC-BY-NC||Creative Commons Attribution-NonCommercial, a license for sharing material|
|CC-Zero||Creative Commons Zero, a ‘no rights reserved’ public domain mark.|
|CETAF||Consortium of European Taxonomic Facilities|
|CSV||Comma-Separated Values, a delimited text file that uses a comma to separate values|
|DOI||Digital Object Identifier, a persistent identifier or handle used to uniquely identify various objects, standardized by the International Organization for Standardization (ISO)|
|DOIP(v2)||Digital Object Interface Protocol, a protocol that specifies a standard way for clients to interact with digital objects|
|DwC||Darwin Core, a standard to facilitate the sharing of information about biological diversity|
|EOSC||European Open Science Cloud, Europe’s vision to deliver a web of FAIR data and related services for research|
|FAIR||Four foundational principles to improve Findability, Accessibility, Interoperability, Reusability of digital assets as guide to data producers|
|GBIF||Global Biodiversity Information Facility|
|GUID||Globally Unique Identifiers (also known as Universally Unique Identifiers’, or UUIDs) are 128 bit integers represented as 36-character randomised strings that follow the RFC 4122 specification.|
|HTML||HyperText Markup Language, the standard markup language for documents designed to be displayed in a web browser|
|HTTP||Hypertext Transfer Protocol, a set of rules for transferring files over the web|
|HTTPS||Hypertext Transfer Protocol Secure, an extension of the Hypertext Transfer Protocol (HTTP) used for secure communication over a computer network|
|ISO||International Organization for Standardization|
|MIME type||A media type (Multipurpose Internet Mail Extensions) that indicates the nature and format of a document, file, or assortment of bytes|
|OAuth2||industry-standard protocol for authorization for APIs|
|OGC||Open Geospatial Consortium|
|PID||Persistent Identifier, a long-lasting reference to a document, file, web page, or other object that is globally unique, persistent and resolvable.|
|REST||Representational State Transfer, a software architectural style to describe a machine-to-machine interface|
|TDWG||Biodiversity Information Standards (Taxonomic Databases Working Group)|
|URI||Uniform Resource Identifier, a unique sequence of characters that identifies a logical or physical resource used by web technologies.|
|URL||Uniform Resource Locator, a web address, a reference to a web resource that specifies its location on a computer network|
|UTF-8||Unicode Transformation Format – 8-bit, a variable-length character encoding used for electronic communication|
Modalities of access
- Primary scientific data needs to be provided as open as possible and only as closed as necessary for legal or sensitive data purposes.
- It is recommended to provide metadata always under a public domain dedication (indicated as CC-Zero, or CC-0).
- It is recommended to provide data under a public domain dedication or licensed under the Creative Commons that are Open Access compatible, e.g. CC-BY. NonCommercial (NC) or NoDerivatives (ND) licenses are not recommended for data intended for scholarly or scientific use, see: https://creativecommons.org/faq/.
- It is recommended to provide the license statement in a machine readable format. This allows search engines and software systems to be able to detect the CC license. Machine readable HTML code for CC licenses can be obtained from the CC license chooser.
- It is recommended to include a data quality assessment when data is provided.
- A Research Infrastructure providing data should ensure at a minimum a data discovery service plus CSV style data downloads, or RESTful endpoints to allow for programmatic access to the data. JSON is preferred over CSV because it can handle hierarchical information that the CSV format cannot.
- It is recommended to provide as many different modalities of access as possible, including access through packages for popular programming languages to work with data like Python or R. APIs can be used for different use cases requiring different kinds of APIs.
- It is recommended to provide APIs suitable for (future) machine-to-machine interaction, such as a DOIPv2 protocol implementation.
- No person providing data should need to be contacted to obtain open access data except for cases like the need for very large amounts of data for which extraction through an API might not be efficient or appropriate.
- Public APIs plus the data they serve should be fully documented and the documentation should be openly available and up to date.
- It is recommended that the API documentation (e.g. OpenAPI) covers common use cases and provides examples.
- It is recommended to provide machine-readable documentation, e.g. by using OpenAPI 3.x which can display the documentation both in a human-readable (HTML) and machine-actionable (JSON) formats.
- It is recommended to provide human-friendly descriptions and a beginners guide to the API(s).
- It is recommended to document the API versioning strategy and that versioning strategy should be precisely followed.
- Public APIs served by a research infrastructure should be easy to find.
- It is recommended to provide multiple ways to discover public APIs such as listing them on the RI’s website and registering them in dedicated service catalogues.
- At a minimum, the links to the API(s) and their documentation should be displayed on the RI’s website for a straightforward discovery and access to the service.
Building communities and trust
- APIs must be simple, easy to use, pragmatic, and designed with all major stakeholder groups in mind, including users, providers, aggregators, and architects.
- It is recommended to be as transparent as possible: every parameter in the request and response bodies should be defined and compromises should be thoughtful and documented.
- Issues and requests for new API features should be easily reported and encouraged.
- It is recommended to have an open forum for issue reporting and discussion.
- It is recommended to provide development roadmaps openly.
- It is recommended to provide a mechanism for mass communication for developers to subscribe to notices about updates, downtimes, etc.
- A mechanism for user support with clear response times should be provided.
- It is recommended to provide a free user support option.
- It is recommended for public APIs to have a service status page providing information about e.g. historic uptime.
- In case of write services, a sandbox or user acceptance test environment to allow users to contribute and test changes or to trial a service should be provided.
- It is recommended for sandbox environments to indicate which part of the data is available with a clear policy on how to ‘reset’ the data.
- It is recommended to make a clear distinction between data that is ‘public’ and data which (still) is under restricted access.
- It is recommended to use a framework for service testing (such as JMeter).
- In general, it is recommended to provide well tested services which build trust.
- It is recommended to test performance of the service.
- For data services (where sensible), a full dump of the (open) data served through the API at regular intervals (e.g. once a year) should be deposited in a trusted data repository
- It is recommended to include a fair use policy that describes when a service may be throttled to protect availability for other users.
- It is recommended to protect personal data according to the Code of Conduct for Service Providers, a common standard for the research and higher education sector.
Technology and standards
- Invest in standards compliance and work with organisations and communities to enhance existing standards or develop new ones.
- It is recommended to provide data that adheres to the FAIR principles (having a PID, detailed metadata, data usage license, etc).
- For additional vocabularies that are in use with standards, it is recommended to have PIDs for the terms with their corresponding term descriptions.
- For additional vocabularies that are in use with standards, it is recommended to have a clear process in place for proposing additional terms.
- It is recommended to provide metrics that give credit to people (e.g. data providers) for work on standard compliance and development.
- APIs providing biodiversity data need to use terms defined by the TDWG standards (e.g. DwC, ABCD, Audubon Core) if they are exact matches whenever possible.
- It is recommended to give preference to using DwC terms when similar alternatives in other standards exist.
- It is recommended to declare the namespaces with the terms. Terms from existing standards and vocabularies should be prefixed with their respective namespace abbreviations. Those prefixes should be defined and referenced accordingly. This allows for both humans and machines to easily identify linked terms.
- Data needs to be provided in UTF-8 encoding if possible.
- Data needs to be provided in a structured format.
- It is recommended to provide at least a JSON serialisation.
- It is recommended to provide JSON as JSON-LD if it makes sense to do so, to conform to Linked Open Data. It is often not suitable for biodiversity datasets though. However, JSON-LD can be used to format request and response metadata, just not the data itself.
- It is recommended that JSON responses are formatted following a published set of best practices such as those established by IIIF or OGC and that the design is consistent with it. For relational data it is recommended to use JSON:API, which specifies that endpoints are named as nouns rather than verbs. JSON:API is more a schema than a set of best practices though.
- It is recommended to serve data as ‘flat’ as possible, e.g. having at maximum two levels of nesting in JSON responses.
- It is recommended to support staged (‘chunked’) or queued (asynchronous) upload or download of very large files (where appropriate).
- RESTful services need to properly use/recognize HTTP headers for requests and responses and return correct HTTP response codes accompanied with meaningful information in a human readable format. The API should return the status codes that cover all erroneous response types.
- Response format should be included using request headers rather than by expressing it in the URI.
- It is recommended that services provide content negotiation with at a minimum a serialisation in HTML for users (default) and in JSON for machines.
- RESTful services requiring authentication need to provide access through HTTPS.
- It is recommended to always provide and require access through HTTPS rather than HTTP.
- It is recommended to use authentication only when really needed, such as for throttling or security.
- If authentication is required, it is recommended to provide it through OAuth2, e.g. through a token instead of the API getting the user’s email address or password. There is a cost though: using OAuth2 can allow the third party provider access to the API activity.
- RESTful service URIs need to indicate that they are part of an API either via a subdomain or a URL segment.
- It is recommended that endpoints follow the naming conventions as specified in JSON:API and/or OGC API Specification.
- For discoverability, the API needs to be described such that the description can be indexed and found by search engines.
Versioning of APIs and their data
- API services should have an explicit version history with documentation about changes.
- For REST-style APIs it is recommended to include the (major) version number in the URL path of the access point.
- Data(sets) provided by API services need to have version information in its metadata, with a last modified timestamp or a date when the data was retrieved as minimum.
- It is recommended to use an ISO 8601 date for last modified timestamps.
- It is recommended to explicitly define the resources the API is built upon, indicate if they are updated & how.
- Production versions of an API should be stable. APIs should rarely change as this may break existing implementations.
- It is recommended not to change API endpoints. API endpoints should remain persistent, deprecation should be done through a versioning process where previous versions are preserved and the latest version are posted at a ‘versioned’ URL.
- There should be a documented strategy for keeping older API versions online, e.g. with deprecation calendar/schedule
Bi-directional linking between infrastructures
- To enable bidirectional linking between infrastructures, resolvable PIDs need to be implemented for the data objects and provided through the APIs.
- If direct linking cannot be supported between infrastructures, then it is recommended to use data brokers like Wikidata to store links. Open linkage brokers provide a simple way to allow two-way links between infrastructures, without having to co-organize this between many different organisations.
- It is recommended to store created bi-directional links at both infrastructures between which the linkages are made.
- It is recommended to provide provenance of the created linkages, for example, who/what made the link, why and when.
- APIs need to provide or accept as input identifiers traditionally held by other relevant organisations, including legacy identifiers where possible.
- It is recommended to provide provenance information about established links in such a way that a data supplier can discover which of their data got enriched by linkages.
API design and naming conventions
- APIs should provide predictive and consistent API behaviour, following the best practices for that,. It is recommended to never alter standard HTTP headers.
- It is recommended to validate your response structures against a schema (where applicable).
- It is recommended to use nouns instead of verbs in paths.
- It is recommended to use easy to understand path elements in English; sometimes it may be beneficial to also use single vs plural e.g. /occurrences?query=… returning a list with occurrences versus /occurrence/123 returning one occurrence with id=123.
- It is recommended that the technology used to produce endpoints to be hidden (e.g. /search vs /search.php).
- It is recommended to make it RESTful, i.e., implement GET, PATCH, PUT, POST, DELETE, HEAD where relevant. Please see for more information some more RESTful API design recommendations.
- It is recommended to accommodate and use 301, 302, 303 redirects when possible and appropriate.
- Request formats should be implemented in a non-ambiguous way.
- It is recommended to enforce strict validation rules for request parameters and give hints if the validation fails. Validation errors should be returned in both machine and human readable format with human readable instructions on how to rectify the error.
- It is recommended to use a consistent request structure.
- It is recommended to use only key value pairs for query parameters where possible.
- The API provider should recommend how to atomise your data before you send individual requests.
- The API provider should provide clear and identifiable responses.
- It is recommended not to repeat response bodies. Each request should return a unique set of information in a response body. Two or more requests that return the same response body should be avoided.
- It is recommended to include PIDs or GUIDs in response bodies where appropriate and within the proper context.
- It is recommended to avoid deeply nested responses.
- It is recommended to use HTTP response codes, and in addition show errors in human-readable text to provide some context (except for context that provides sensitive information for e.g. hackers) and provide a verbose option where relevant.
- It is recommended that, if a client requests a content type, to return that content type.
Some useful tips for developers
The following suggestions might considerably improve the API-based web services:
- To test the API, it would be helpful to include an option during development for GET queries to include the query or request parameters in the response.
- Headers should be included in responses in test mode.
- Other best practice documentation for API developers should also be considered (response codes, RESTful design).
Suggestions to improve bi-directional linking for further exploration
While discussing best practices for findability, re-use and accessibility of infrastructures, several suggestions have been made for further investigation towards development of bi-directional linking between infrastructures. These are listed here for completeness:
- Options for using CETAF specimen identifiers when citing data through services like GBIF should be explored.
- Author guidelines on “How to cite specimens” (hyperlinked specimen IDs) when developed should be tested in pilot journals.
- The ways DiSSCo and INSDC/ENA could harvest and link back to literature citations of specimens and sequences in their infrastructures should be explored.
- Users may want to obtain the links and PIDs to the taxon name and all its synonyms when they search for it (especially in the case that different RIs use different taxonomic backbones, for example, such as when BOLD submits sequences to INSDC and they have to match taxon names).
- Material citations which contain sequences should be used to link these to the treatment name. Accession numbers mentioned within a material citation could provide a link between the specimen and the sequence taken from it. Links between specimens and sequences can also be managed in collection management software.
- It would be beneficial if a reference treatment could be added to each identification of a sequence.
- It should be considered to elaborate and publish annotated guidelines of how to publish material citations and tables including accession codes,.
- COL taxon names should be mapped together with the content in which they are described/detected (higher taxon categories), to promote usage by ecologists.
- It may be useful to develop a semantic, preferably event-based, model for linking, which should take into account legacy data from historical collections.
- It would be beneficial if standard identifiers for taxon names (or taxonomic concepts) such as those provided by COL, are used for specimens from different locations and if taxon names are always linked to a verified taxon concept.
- Taxonomic treatments should also be cited along with taxon names since treatments connect a name (unique) or concept (unique) to specimens (multiple), establishing the unity between them.
- It would be useful to include type specimen information in taxon name aggregators such as COL and collections to provide the typified name of type specimens in their GBIF dataset.
- Online writing tools and publishing systems such as ARPHA should implement strict journal editorial policies and instructive guidelines to ensure that specimen PIDs are inserted by the authors.
- The visibility of the nomenclators (e.g. IPNI, or ZooBank) in COL should be raised and the addition of the links to the type specimens which are captured in the nomenclator databases should be explored.
- It would be good to harmonise the GBIF and COL name matching services and make these available for all datasets registered in checklistbank.org.
- When a newly described species-level taxon is introduced, it is highly recommended to link the record of a name in COL back to its primary publication source, treatment and holotype.
- It would be useful to have a mechanism to link a COL taxon name ID to the annual version and make it available via API to automatically hyperlink a taxon name to its status in COL, e.g. Taxon name ID + year of its citation = disambiguated citation of a taxon concept from a particular annual version of COL.
- It is recommended PIDs to contain metadata about how the identifier should be cited.
The information in this document is based on the BiCIKL report “Best practice manual for findability, re-use and accessibility of infrastructures”, authored by Wouter Addink (Naturalis Biodiversity Center, Leiden, Netherlands), Niki Kyriakopoulou (Naturalis Biodiversity Center, Leiden, Netherlands), Lyubomir Penev (Pensoft Publishers & Institute of Biodiversity and Ecosystem Research, Bulgarian Academy of Sciences, Sofia, Bulgaria), David Fichtmueller (Botanic Garden and Botanical Museum Berlin Dahlem, Berlin, Germany), Ben Norton (North Carolina Museum of Natural Sciences, Raleigh, NC, US), David Shorthouse (Agriculture and Agri-Food Canada, Ottawa, Ontario, Canada). The authors would like to thank everyone who reviewed and commented or otherwise contributed to this document. We would like to thank project partners who participated in the technical RI forum discussions towards this document, TDWG members who provided input and external experts, in particular Nicky Nicolson, Franck Michel and Sam Leeflang.