Data infrastructure | Emodnet Biology

Data infrastructure

The data infrasctructure of EMODnet Biology is able to handle different data protocols and data standards for exchange of marine biodiveristy data . 

  • Biodiversity data can be made available using the Darwin Core standard that is in use by the Global Biodiversity Information Facility (GBIF) and the Ocean Biogeographic Information System (OBIS). This biogeographic data scheme is able to handle information and data of annual, seasonal, and spatial distribution of species composition and occurrence, but also abundance and biomass data in the water column and on the seabed can be handled with this data format. The Integrated Publishing Toolkit (IPT), a freely available open source web application using the Darwin Core standard, will make it easy to share biodiversity-related data and information with the EMODnet portal. The data portal is also capable of handling the Distributed Generic Information Retrieval (DiGIR), a web protocol used to support distributed data retrieval across remote biological collection databases and extensively used within the EurOBIS distributed database system.
    The addition of a new extension “MeasurementsOrFacts”, conform the corresponding DarwinCore format, willl allow EurOBIS – and EMODnet – to capture biological measurements and related abiotic data such as e.g. length- and weight information of taxa, stomach content data or the sediment-composition at the time of sampling, which was previously not possible. MeasurementsOrFacts is the category of information pertaining to measurements, facts, characteristics, or assertions about a resource (instance of data record, such as Occurrence, Taxon). The extension includes terms measurementID, measurementType, measurementValue, measurementAccuracy, measurementUnit, measurementDeterminedDate, measurementDeterminedBy, measurementMethod, measurementRemarks. MeasurementType, measurementUnit and measurementMethod will be standardized to allow common searches of measurements across datasets and taxa.
  • A specific data format enabling National Oceanographic Data Centers (NODC’s) to make biological data accessible using the SeaDataNet infrastructure has been set up. The format is a general and higher level format without necessarily containing all specifics of each data type, but rather focusing on common information elements for marine biological data. At the same time the format is sufficiently flexible/extendable to be applicable for at least part of the variety of biological data. The data format has been published as the Seadatanet Deliverable D8.4b.
  • The geospatial data can be made accessible through the EMODnet biological portal through several OGC Webservices. These services will create an interface of the EMODnet biological portal allowing requests for geographic “resources” across the web using platform-independent calls. Besides the Catalogue Service of the Web (CSW) which will enable access to metadata resources, three other OGC Webservices will be implemented: 1) Web Feature Service (WFS). This service allows requests for geographic features across the Web. As such, the vector data on the portal can be transferred using this service; 2) Web Coverage Service (WCS). This service allows requests for gridded data across the web. As such, raster data of the portal can be transferred using this service; 3) Web Map Service (WMS). This service allows requests for maps across the web. As such, GIS maps of the portal can be transferred using this service.
  • In specific cases, large data providers that are using in house developed web services are also able to deliver data. In these cases, the available data are looked at in great detail and a mapping between the available data and the Darwin Core Scheme is made allowing to capture as much data and information as possible.
  • The data infrastructure of EMODnet Biology is based upon the infrastructure and data flow developed under EurOBIS. Data submitted to EurOBIS go through a series of quality control procedures before being made available online.
    • Metadata: the data management team will check whether the data and the supplied metadata match and that all necessary fields of the metadata are filled in correctly and as completely as possible. If important information would be missing, a notification will be sent to the data provider asking to complete the metadata.
    • Required data fields: if the required data fields are not properly filled, a notification will be sent to the data provider. These records will not be uploaded until the required fields are completed.
    • Taxonomy: all taxon names are linked to the World Register of Marine Species (WoRMS). Unmatched taxa are sent back to the data provider for a secondary check-up. Taxa with uncertain identifications are matched to the first suitable higher taxonomic level. Originally provided taxon names are stored in the database, so one can always go back to this. When no match can be made, the taxon name is added to an 'annotation list': this list keeps track of the editors comments on why a taxon cannot be added to the World Register of Marine Species (see also the section on standards and quality control).
    • Geography: All supplied coordinates are converted to the WSG84 coordinate system and expressed as decimal degrees. Next, these coordinates are checked for possible positioning errors which can include sampling locations on land or in different regions compared to the supplied metadata information. These errors can be due to accidental swapping of latitude and longitude or errors related to the use of the minus-sign. Any possible errors or doubts are communicated to the data provider, so the necessary corrections can be made.
    • Depth: Two checks are performed related to depth: (1) Is the documented depth-value possible, if it is compared with the General Bathymetric Chart of the Oceans (GEBCO) and (2) is the documented depth-value possible, if it is compared with the known depth range of the species?
    • Units: if abundance and/or biomass data are supplied, the presence of the according units is checked. If these units are missing, the data cannot be put to use in comparisons between different datasets.