How To Deliver My Data?

ESPON M4D

Revision History
Revision 2svn rev 2420 build 2014-12-19 17:33:22

Abstract

This technical report proposes a description of the expectations for the integration of different ESPON projects data.


Table of Contents

Introduction
1. The Key Indicators
1.1. Concepts behind the key indicators delivery
1.1.1. Rule 1 – Limited number of indicators
1.1.2. Rule 2 – Innovative indicators
1.1.3. Rule 3 - High level of metadata
1.1.4. Rule 4 - Promote the core database strategy
1.1.5. Rule 5 - A good completeness of the indicator
1.2. Key Indicators Delivery
1.2.1. ESPON Data and Metadata Specifications
1.3. The Data Delivery Process
1.3.1. Automatized Checks
1.3.1.1. Online Checks at Upload
1.3.1.2. Offline Syntactic Check
1.3.2. Semantic check
1.3.3. Quality control
1.3.4. Integration into the database
2. The Case-study delivery
2.1. The Case-study Delivery Strategy
2.2. Expected delivery
2.2.1. Data file
2.2.2. Case-study data and metadata
2.2.2.1. The dataset sheet
2.2.2.2. The indicator sheet
2.2.3. Geometry file (optional)
2.3. What happens to my data? The case-study integration process
3. The Background Data of the Database
3.1. Strategy for the Background Data
3.2. Expected delivery
3.2.1. How structuring the background data .zip file? General recommendations
3.2.2. A .xls template available for filling quickly background metadata
3.3. What happens to my Data?
4. Resources of interest and advices for data creation in ESPON
4.1. Resources of interest
4.1.1. XLS Template with Examples
4.1.2. Frequently Asked Questions (FAQ)
4.1.3. M4D Presentation (ESPON Seminar in Cyprus, December 2012)
4.1.4. M4D Newsletters
4.2. Advice for a perfect management of the data process
4.3. A good practice for filling data and metadata
A. References
B. Data Flow Process of the Key Indicators
B.1. Syntactic Check
B.2. Semantics Check
B.3. Semantic Check Approval
B.4. Outliers Check
B.5. Outliers Check Approval
B.6. ESPON CU Approval
C. About

List of Figures

1. Three possibilities to deliver my data
1.1. Header of the ESPON Data and Metadata Specification
1.2. Excel Data Model for Key indicators
1.3. Example of the Label field description
1.4. Dataset Integration Tracking Details
1.5. Syntactic check: example of an invalid input
1.6. Spatial check: example of invalid spatial units
1.7. Syntactic check: example of a valid input despites warnings
1.8. Indicator code/name/abstract Triplet Conflict
1.9. Access to the executable version of the Syntactic check
1.10. Execute the DatasetCheck.jar
1.11. Example of a Semantic Check Report
1.12. Semantic Check Example: Input Information
1.13. References for a Semantic Check Expertise
1.14. Semantic Check Example: Fixed Information
1.15. Outlier check output
2.1. Overview page of Case Studies
2.2. Information Page of a Case Study
2.3. Data Model Example for Zoom-in projects
2.4. Case Study Dataset Sheet
2.5. Several strategies for locating the flag of a case-study
2.6. Case Study Indicator Sheet
2.7. Example of a Case Study Geometries Input
2.8. Mapping of Geometries Codes in Data
2.9. Case-study dataset file upload page
2.10. Case-study geometry file upload page
2.11. Tracking tool activation and database administrator functions
2.12. Case-study integration
3.1. Background data part of the ESPON Database Portal
3.2. Project Database Metadata Sheet
3.3. Background Data upload
4.1. On-line availability of the xls templates
4.2. Header of the FAQ
4.3. Starting point: a table with empty values
4.4. Resulting dataset with estimated values and associated labels
4.5. Description of the label 1 in the metadata
4.6. Description of the label 13 in the metadata
4.7. Description of the label TE6b in the metadata
B.1. Upload Form
B.2. Tracking Overview: SYNTAX_CHECKED checked
B.3. Semantics check form
B.4. Tracking Overview: SEMANTICS_CHECKED state
B.5. Dataset details: SEMANTICS_CHECKED state
B.6. Form for the semantics check approval
B.7. New state for the dataset: "SEMANTICS_ACCEPTED"
B.8. Outliers check form
B.9. New state for the dataset: OUTLIERS_CHECKED
B.10. Tracking Overview: OUTLIERS_CHECKED state
B.11. Dataset details: OUTLIERS_CHECKED state
B.12. Form for the outliers check approval
B.13. New state of the dataset: OUTLIERS_APPROVED
B.14. Tracking Overview: OUTLIERS_APPROVED state
B.15. Dataset Details: OUTLIERS_CHECKED state
B.16. Tracking - Step 4 - ESPON CU Approval Form
B.17. Tracking - Dataset Details Page After an Integration Attempt

Introduction

This document contains many references to the Database Administrator, who is the contact person for any question regarding the ESPON Database Portal. Until December 2014, tha administration of the ESPON Database portal is ensured by the M4D manager: . At the end of the M4D activities, in January 2015, this will be managed by the ESPON Coordination Unit:

The ESPON Database 1 project (2008-2011) has experienced a lot of difficulties to overcome the heterogeneity of information provided by ESPON Projects (integration of local data, integration of sophisticated indicators with little metadata description…). In order to improve this non-sustainable situation, the M4D Project has tried to better define what is expected from ESPON Projects in terms of data deliveries.

The ESPON P1, P2 and P3 Projects are obliged to deliver all data collected and produced within their project. These data should be delivered in the form of three types:

  • Key indicators, covering the entire ESPON Space and for a limited number of territorial nomenclatures (NUTS, SNUTS, FUAs, MUAs, UMZ) ;

  • Case-study data, which does not cover the entire ESPON Space and/or is not described in the territorial nomenclatures of the Search Interface and/or is located out of the ESPON Space ;

  • Background data, covering all data produced by the project, whatever their format.

As you can see in Figure 1, the dataflow depends on the nature of the delivery.

Figure 1. Three possibilities to deliver my data

Three possibilities to deliver my data

This figure shows the three possibilities for data to be integrated into the database depending on the nature of itself.


This document proposes useful information for these three different types of ESPON projects:

  • THE KEY INDICATORS, further described in the chapter entitled The Key Indicators

    The key indicators are innovative indicators highly relevant for policy making and should cover the entire ESPON Space (EU27+4). These indicators will be the only ones searchable from the query interface.

    The ESPON projects deliver in principle the indicators related to the maps included in Part B of the (Draft) Final Report - around 10 indicators. In case a typology or composite indicator is included, the data and methodology used to build it should also be delivered.

    The requirement in terms of data and metadata is high for this delivery and the ESPON Projects are requested to upload the data via the Upload page

    The key indicators delivery has to follow the ESPON Data and Metadata specifications. To build a strong and efficient query interface, these indicators will be checked in depth before integration.

    This process includes three steps:

    1. Syntactic check, metadata format analysis (are all the mandatory fields filed?). This check is processed automatically when the dataset is uploaded in the ESPON Database Portal (more details in Section 1.3 of this document ;

    2. Semantic check, metadata content analysis (are the metadata understandable?), realised by the database administrator;

    3. Outlier detection, Outlier detection (are there unusual values in the dataset?), realised by the database administrator.

  • CASE-STUDY DATA, further described in the chapter entitled The Zoom-in Delivery

    Besides the key indicators delivery, some ESPON Projects (in particular for Targeted Analysis, but not only) analyze specific territories inside or outside the ESPON Area. To make this kind of complementary and very interesting data easy accessible, a case-study interface is available.

    To set up this interface, the projects are requested to deliver their most representative data, their geometries (in a shape file format, if adapted) and a documentation highlighting the content of the data and geometries (following a dedicated template).

    Regarding to this delivery, the database administrator will only checks if all mandatory fields of the documentation file are correctly filled. If geometries are provided, the database administrator checks if it possible to map the results (e.g., linkage between the geometry and the data codes).

  • BACKGROUND DATA, further described in the chapter entitled The Background Data of the Database

    In order to fill their contractual obligations and to make all data as a coherent set available, each ESPON Project has to deliver a zip file that contains all data, metadata and geometries (if different than the usual ones delivered via ESPON) used in the project.

    This zip file is considered as an annex to the final report of the project and is stored in the resource part of the ESPON Database Portal.

[Caution]

General considerations while uploading files to the portal

Whatever the type of the dataset is, the upload requires a maximum file size of 100 MB. Please contact the administrator for bigger files.

For interoperability, please avoid any space and accentuated characters in the uploaded dataset filename.

Chapter 1. The Key Indicators

The 10 best indicators delivery is probably the most restrictive one: taking into account that ESPON is a community where knowledge and material is shared, it needs to define some basics to ensure the harmonization of the ESPON identity. Of course, it concerns reports (50 pages maximum by report, following some required typographic styles), maps (following the map-kit template) or reporting (inception report, interim report(s), draft final report, final report). It also concerns data and metadata.

To be useful for ESPON projects and other end-users, data should always be accompanied by metadata, including information about their quality and sources. It is also particularly important that the metadata should be compliant with international (ISO) and European (INSPIRE) standards so as to ensure the use of the database in the longer-run and to make it compatible with other national and international database initiatives.

To ensure correct data processing and integration into the ESPON Database Portal, the ESPON Metadata Specifications provided by M4D project must be carefully respected by all the data providers participating to the project and by the organizations/persons who intend to create new software implementations interacting with the ESPON Database.

The ESPON Metadata is relatively complex, but quite complete.  As a result, the metadata creation in ESPON is a huge work BUT only concerns a limited number of indicators. It implies that TPGs should take into consideration at the very beginning of the implementation of the project.

In Section 1.1 of this chapter, we firstly describe the concepts behind the key indicators delivery, or "What shall I deliver?". In Section 1.2, we detail the available specifications regarding metadata description and nomenclature integration, "How shall I deliver my data?". Finally, Section 1.3 is dedicated to the data flow process, or "What happens to my data?".

1.1. Concepts behind the key indicators delivery

Before delivering the key indicators, four basic rules are to be kept in mind. The M4D Project has defined rules in order to give a common understanding of the future content of the ESPON Database and to avoid the integration of too much heterogeneous information. It is the unique way to propose a database that could be managed in the future. Four basic rules are described below with concrete situations of good or bad practices.

1.1.1. Rule 1 – Limited number of indicators

Each ESPON Project has to choose its most representative indicators covering all the ESPON Area at NUTS level (e.g. basically the indicators displayed on maps in the draft final report - without annexes). With this basic rule, we want to limit the discrepancy between projects, which deliver hundreds indicators (residuals of statistical models, generally not very well explained in metadata) and other projects, which deliver few indicators, embedded in a monstrous information flow. In general terms, we prefer to include into the database a single indicator with a real added value, rather than hundreds of indicators which may never be queried by users of the database.

[Note]

Good practices:

  • If the main result of the project is a typology, please provide all the indicators used for calculating it (e.g. if the typology is based on population, age group 20-39, age group 65+, natural population increase and net migration, deliver all these indicators if they are not already included into the database).

  • Deliver indicator that could be helpful for the ESPON Community in the future, e.g. policy-makers, researchers and practitioners.

[Important]

Bad practices:

  • Deliver GDP per capita and all its statistical derivates (which could be automatically calculated further): GDP per capita EU27=100, GDP per capita ESPON Area=100 etc.

  • Provide all the residuals of a complex statistical model.

1.1.2. Rule 2 – Innovative indicators

By the past, the ESPON M4D project has received ten indicators describing total population in 2006! This kind of figure makes the database impossible to use (which indicator to download?). This is why, in the key indicators delivery, we kindly ask the project to propose innovative indicators that are not yet into the ESPON Database.

[Note]

Good practices:

  • Before collecting data, look into the ESPON Database to see if the indicators you are looking for are not already available. In that order, it is possible to have a look to the database overview), summarizing all the indicators included in the ESPON Database.

  • If mistakes are detected in the ESPON Database, please notice the ESPON Database administrator and propose a revision of the dataset.

[Important]

Bad practices:

  • Deliver an indicator already contained in the database without explaining the added value of the indicator you propose (estimations of better quality, mistakes corrected).

1.1.3. Rule 3 - High level of metadata

The metadata related to indicators must be very well explained. If you propose indicators derived from statistical analysis or models, make sure your data is understandable by non-specialists users!

[Note]

Good practices:

  • Take time to correctly fill each field of the metadata model.

  • Reference all the sources you use to create your dataset. In that way, the user will be able to define which data is coming from official data sources (Eurostat, national statistical institutes, ...) and which one you have estimated. The total population 1990-2010 file, available in the ESPON Database is a good example of systematic description of the data source.

  • Make sure that it is possible to rebuild the indicator your propose in the database.  Use the methodology property (part 1.7.2 of the specifications [1]) to describe your calculation methodology.

  • Enclose to your data delivery methodological notes (field URI of the indicator description).

[Important]

Bad practices:

  • Put in the methodology field of the indicators: “cf Final Report for further explanations”…

  • Deliver indicators that will never be updated in the future without your TPG knowledge (e.g. composite indicators based on a data model which is your property and not diffusible).

  • In the source part of the metadata, mention your project as data provider (generally the dataset is a combination of data coming from Eurostat, national sources and estimations).

1.1.4. Rule 4 - Promote the core database strategy

Out of the key indicators, each project can suggest the inclusion into the "Core Database" of indicators of interest for territorial monitoring (time series, added value for the database), which could be updated and maintained in the future, out of your project.

[Note]

Good practices:

  • The M4D Project proposes total population at NUTS0, 1, 2 and 3 levels for the period 1990-2010. A good practice could be to extend this temporal coverage to the period 1980-2010.

  • The M4D Project proposes age structure data (5 years age-class) at NUTS 0, 1, 2 levels. A good practice could be to extend the hierarchical coverage to the NUTS3 level.

  • The M4D Project has collected total area and population for the UMZ (Urban Morphological Zones): Extend the thematic coverage to other indicators (Land Use, etc). 

[Important]

Bad practices:

  • Deliver a derived indicator (for example, unemployment rate) without delivering the count data behind this indicator (e.g. unemployed population and active population).

  • Deliver a dataset with a high number of missing values.

1.1.5. Rule 5 - A good completeness of the indicator

At the moment, the ESPON Database supports several nomenclatures: NUTS division in the 1995, 1999, 2003, 2006 and 2010 revisions for the ESPON Area at regional level ; UMZ, FUAs and MUAs for cities, SNUTS division for the regional neighbourhood of the ESPON Area . Whatever the nomenclature used, the degree of completeness of the indicator must be relatively good. Ideally, most of the missing values must be estimated with a description of the method used. In that order, a guidance paper has been written by the M4D project, proposing a set of estimation methods [2].

The key indicators concern Applied research projects (ESPON Priority 1) and projects from the Scientific Platform (Priority 3). For targeted analysis, most of the data will be integrated in the Case-Study interface (cf The Zoom-in Delivery).

[Note]

Good practices:

  • If Eurostat (main data provider) does not provide data for some territorial units of the 10 best indicators, look at external data sources (National Statistical Institutes) if the data exists.

  •  When no data is available, estimate it and refer systematically in metadata the methodology used for the estimation.

[Important]

Bad practices:

  •  Deliver data for three countries of the ESPON Area: In this case of figure, go to the Case-Study interface (part 2 of the Technical Report)

  • No description of the estimation made.  

1.2. Key Indicators Delivery

This section details the expected deliveries and available resources to fill ESPON Data and metadata.

In order to ensure an efficient way to create data and metadata in the ESPON format, the M4D Project has produced some useful guidance documents (available from the help menu of the ESPON Database Web site at http://database.espon.eu).

1.2.1. ESPON Data and Metadata Specifications

The document entitled ESPON Data and Metadata Specification [1], whose header is shown in Figure 1.1, is the reference document for the Key indicators datasets. It proposes a specification of the metadata model. Firstly, it describes the generic conceptual model of the ESPON Metadata (called as the Abstract Metadata Model). Secondly, it presents the implementation of the abstract model using the international standards (ISO-19115 and INSPIRE Directive). Finally, it explains the implementation of the abstract model in a tabular file format.

Figure 1.1. Header of the ESPON Data and Metadata Specification

Header of the ESPON Data and Metadata Specification

This figure shows the header of the on-line HTML document, available on the ESPON Database Portal [3].


Please find below some advices to use these specifications:

  • Do not be impressed by the 150 pages of the paper format document! From the user point of view, the first, the second and the third parts of the metadata model specifications explain in a different way (conceptually, in a xml version, in a tabular version, e.g. Excel) the same topic: description of all the fields of the ESPON Metadata model.

  • To begin with, we strongly advise you to carefully read the introduction of the Metadata specifications, explaining the main concepts and also the third part, showing the tabular model and all the fields to be filled with concrete examples.

  • Download the metadata template (requires login) from the "Upload" menu (see Figure 4.1). On the basis of this .xls document, fill your metadata. For example, Figure 1.2 shows how colors and comments in this template help at filling cells. When something is not clear, please refer to the metadata specifications: as an example, Figure 1.3 shows the description of the Label field.

Following Figure 1.2 and Figure 1.3 illustrate an example of a good practise by using the metadata specifications.

Figure 1.2. Excel Data Model for Key indicators

Excel Data Model for Key indicators

I want to reference my data. First of all, I want to know what kind of information is mandatory. On the right part of each cell, a description box (in red on the figure) helps me to answer to this question. Each cell colored in green needs to be filled.

When going to the source part of the metadata template, I do not understand the meaning of the label field (in orange). When looking on the right part of the cell, one can see that this element is described in the part 1.6.1 of the Specifications. When going to the ESPON Metadata specifications, shown in Figure 1.3, the label property gives a full description of the element.


Figure 1.3. Example of the Label field description

Example of the Label field description

This figure is an extract of the specification. It shows the description of the Label field.


In case of doubts, the use of pre-filled templates with concrete examples is especially useful (cf Section 4.1).

1.3. The Data Delivery Process

This section aims at responding to the following question: "What happens to my data?"

The data integration process aims to apply a very steady quality control of datasets delivered by ESPON projects. This process is divided in 5 steps. When the TPG integrates its key indicators, he activates a dedicated module in the ESPON Data Portal ("Upload" menu): the Tracking Tool.

The tracking tool is being developed to follow the state of advancement of the data integration process (Figure 1.4). Please note that this tool requires to be logged in. For further information about the integration workflow (Who? When? etc), please consult Data Flow Process of the Key Indicators.

Figure 1.4. Dataset Integration Tracking Details

Dataset Integration Tracking Details

This screen allows to consult details on the achieved and pending activities concerning the dataset integration. The "Semantics" and "Outlier detection" reports of this dataset are available here.


The data integration process is composed of the main steps that are described in the following sub-sections.

1.3.1. Automatized Checks

1.3.1.1. Online Checks at Upload

When uploading a Key Indicator Dataset to the ESPON Database Portal, a first check consists in checking the syntax of the dataset. This syntactic check verifies if the dataset is well-formed, if all mandatory fields are filled, to be short, if the dataset is compliant with the metadata and data specifications.

[Note]

This syntactic check can also be performed offline with the DatasetCheck.jar software, see Section 1.3.1.2.

Since 2014, the syntactic check performed at upload has been completed with:

  • a spatial check: all spatial units referenced in the sheet entitled Data must belong to the nomenclature defined in the sheet entitled Dataset, and this nomenclature must be available in the ESPON Database.

  • a "code/name/abstract" indicator triplet check: for each indicator contained in the dataset, this step aims at avoiding conflicts with already integrated indicators. Thus, the "code/name/abstract" triplet is valid in the following cases:

    • the combination of the code AND the name AND the abstract of the given indicator is identical to an existing triplet in database.

    • the code AND the name of the given indicator are not already assigned to an existing indicator in the database.

    In case of an invalid triplet, the error is displayed in the "log box", as shown in Figure 1.8.

This set of controls aims at avoiding the most frequent errors, which previously used to be detected at the last integration step. They are automatically done when the project uploads its datasets from the "Upload" menu of the Web application. This is the only compulsory step of the data integration process. Once successfully checked, the dataset is saved on the server. A notification is sent to ESPON CU for the next step.

The syntactic check step is performed on all uploaded datasets. As shown in Figure 1.5, the page displays all the necessary information to fix eventual syntactic errors or warnings. Three types of messages are displayed in the log boxes:

  • INF prefix indicates an information message, e.g. some information about the syntactic check process.

  • WRN prefix indicates a warning message. Warning messages are triggered for ambiguous values that may be problematic during the next steps of the integration. Nevertheless, warning messages do not make the syntactic check fail. As shown in Figure 1.7, the TGP is invited to eventually review his dataset, though he can also submit it to the semantic check.

  • ERR prefix indicates an error message. Error messages refer to missing values or errors in mandatory fields of the metadata. These errors constraint the user to review his dataset that can no pass this step and continue the integration process.

    Besides the data and metadata syntactic check, spatial units must be available and consistent with the available nomenclatures in the ESPON Database. Figure 1.6 shows an example of a valid data/metadata syntax, but an invalid set of spatial references.

Figure 1.5. Syntactic check: example of an invalid input

Syntactic check: example of an invalid input

This screen shows the information messages (prefixed with [INF]), warning ([WRN]) and error messages ([ERR]) returned by the syntactic parser. Example:

1 WRN No value found for the indicator 'IXP'. Skipping data validation for this indicator.
2 ERR The 'Temporal Extent' property is null.
3 ERR The 'Dataset Information' element is not valid.
4 ERR The 'Temporal Reference' element is not valid.
5 ERR Unable to check the global temporal extent, because it is null.
6 ERR The 'Temporal Reference' property is not valid.
7 ERR The 'Temporal Reference' property is not valid.
8 ERR The 'Lineage' property is not valid.


Figure 1.6. Spatial check: example of invalid spatial units

Spatial check: example of invalid spatial units

This screenshot shows the example of an uploaded Key Indicator dataset with a valid data/metadata syntax. Nevertheless, the spatial check failed, seven spatial units referenced in the sheet "Data" are invalid according to the defined nomenclature in the sheet "Dataset" (here, NUTS 1999).


Figure 1.7. Syntactic check: example of a valid input despites warnings

Syntactic check: example of a valid input despites warnings

This screen shows that the uploaded file is valid (no errors) but still contains warnings. The user can pass this step or fix the dataset by clicking respective buttons at the bottom of the page.


Figure 1.8. Indicator code/name/abstract Triplet Conflict

Indicator code/name/abstract Triplet Conflict

This screenshot shows an example of an invalid Key Indicators dataset upload. It highlights the log error message displaying the filename of the dataset that is responsible of a code/name/abstract indicator conflict. In this uploaded dataset, an indicator has for code "WPFNF8PC". Yet, an indicator already exists in the ESPON Database, its properties are following:

  • CODE: "WPFNF8PC"

  • NAME: "Forest and non-forest area wind energy potential per capita at 8 c/kWh"

  • ABSTRACT: "Wind energy potential per capita (GWh/year/inhabitants) in forest and non-forest areas at a price of 8 c/kWh."

Consequently, the new dataset can not define the same indicator for the same period and statistical units. The error log message also displays the origin of the integrated dataset containing this indicator: "2014-09-18-08-23-44_GREECO_GREECO_WP_N06_2_2009_syntaxChecked.xls". The administrator can remove this old dataset or fix the new version.


1.3.1.2. Offline Syntactic Check

Before uploading a given dataset in the tracking tool, it is generally interesting to test the syntactic validity of the file produced. The DatasetCheck software has been created for that purpose and is currently available as a Java standalone application executable jar file. For cross-platform compatibility issues, this software must be executed from the command line via aconsole like Dos on Windows, Terminal on Unix-based systems (Linux, Mac OS). Figure 1.9 and Figure 1.10 show how executing locally the syntactic check. More information is available in the guidelines document included in the .zip file downloaded. Please note that the local version of the syntactic check does not check the validity of the spatial units included in the dataset (cf previous sub-section).

Figure 1.9. Access to the executable version of the Syntactic check

Access to the executable version of the Syntactic check

The DatasetCheck is available in the central page of the upload part of the ESPON Database Portal. The .zip file includes a guidelines document and the DatasetCheck.jar itself.


Figure 1.10. Execute the DatasetCheck.jar

Execute the DatasetCheck.jar

The output of the DatasetCheck executable jar file is two .txt files (parsing and validation) where all the ERR, WRN and INF are mentioned. In this example, my dataset does not include any errors. It is ready to be uploaded on the ESPON Database Portal!


1.3.2. Semantic check

After the syntactic check step, the dataset is transferred to the database administrator to be checked semantically.

This step aims at analyzing the content of the data and metadata (and namely the free-text fields). The aim of this step is to analyze if all the indicators of the dataset are correctly described and understandable by a large public. The result of this expert check is achieved by the edition of a semantic report.

Note that this semantic report feedback does not forbid the data integration process, but the project is sollicitated to consult this report and to decide to follow up the integration process, or to fix his dataset according to this expertise.

An example of such a semantic report, filled with annotations, warnings and remarks, is shown in Figure 1.11.

Figure 1.11. Example of a Semantic Check Report

Example of a Semantic Check Report

This example semantic check report extract proposes annotations remarks and suggestions besides problematic cells.


This step is an expertise. In other terms, if the TPG is not able (or does not want) to correct his metadata, the dataset can be submitted to the next step of the integration process.

Following screenshots illustrate an example of the semantic check expertise performed by the M4D Team on a problematic dataset. Figure 1.12 shows the initially received information. Figure 1.13 shows the consulted documents to help at understanding and fixing the received information. Figure 1.14 shows proposal of correction returned to the TGP.

[Important]

The database administrator is not in charge of filling this kind of information! He/she supports you in the process but please make sure that your delivered indicators are understandable by external users!

[Important]

The semantic check is performed by the M4D project until the end of 2014. Afterwards, the ESPON Coordination Unit will manage this task.

Figure 1.12. Semantic Check Example: Input Information

Semantic Check Example: Input Information

This figure shows a lack of information in the initially received metadata. This kind of description (4-digit classes) is not enough to understand how the indicator has been build.


Figure 1.13. References for a Semantic Check Expertise

References for a Semantic Check Expertise

This figure shows the material available (TGP report) to complete the missing information.


Figure 1.14. Semantic Check Example: Fixed Information

Semantic Check Example: Fixed Information

This figure shows the fixed information returned to the TGP in the proposal of correction document.


1.3.3. Quality control

At this stage, an outlier detection is proceeded on the key indicators. The M4D Project carries out some checks on the data values themselves. Some of these checks can be simple. For example: an indicator whose metadata tells us it’s a percentage should have values between 0 and 100; counts should be positive integers. If the metadata states that the values of a typology are 1, 2, 3 and 4, then there should be no other values. Are any data values unexpectedly missing?

The M4D project carry out other more complex checks: are any values for an indicator extraordinary high or low? M4D examines indicators singly and, where appropriate, in groups. As many projects use spatial units in the NUTS hierarchy, we check whether data values are unusual compared with their neighbours. The ESPON metadata properties help us to proceed the appropriate tests: some indicators are ratios, some are counts, some are typologies, and some form time series. Actually, checks depends on the statistical property of the indicator.

When M4D finds an unusual or extraordinary value, M4D flags it. If a NUTS region has many unusual values, then M4D will seek assurance that these values are correct – often they are (major cities can be very different from surrounding settlements).

The output from the quality check is a report (cf Figure 1.15) which M4D share with ESPON TPGs as a data supplier. When all agree that the dataset is as correct as we can make it, it is it is finally loaded and made available into the ESPON Database (last step).

[Important]

This check is an expertise and must be considered as an added-value in the data integration process: ESPON TPGs can validate or not the result of the check after consulting this report.

[Important]

Note that the outlier check is ideally adapted to basic indicators (count data, ratios) but not necessarily to complex typologies or indexes resulting from a huge methodological background.

[Important]

The semantic check is performed by the M4D project until the end of 2014. Afterwards, the ESPON Coordination Unit will manage this task.

Figure 1.15. Outlier check output

Outlier check output

This figure shows the structure of an outlier check. This report underlines missing and anomalous values.


1.3.4. Integration into the database

Previous checks and steps of the dataflow give a strong expertise on the quality of the datasets delivered by ESPON projects. On the basis of all these reports, the database administrator makes the decision to integrate or not the dataset in the database.

After its integration into the database, it will be possible to dynamically query the database composed by the indicators of the dataset in the search interface of the ESPON Database Portal. If metadata are very well described, it gives a real added value to the indicators.

Chapter 2. The Case-study delivery

2.1. The Case-study Delivery Strategy

The ESPON M4D considers as a “Case-study delivery” a dataset that does not cover the entire ESPON Area (EU27+4). In practice, it includes several cases of figures:

  • Local data for a region or a group of regions (e.g. Greater Manchester at LAU2 level, Ile-de-France at employment basin level etc.)

  • Data which are not desribed in the available nomenclatures of the Search Interface (airports, water basins etc.)

  • Non ESPON Area and non ESPON Neighbourhood data (e.g. data on American, Brazilian or Japanese regions).

The M4D project has developed as specific graphical user interface for querying such data. The data is stored following a simple template (in a zip format, cf Section 2.2 for further explanations) and will be downloadable following the two proposed pages shown in Figure 2.1 (overview) and Figure 2.2 (details).

Figure 2.1. Overview page of Case Studies

Overview page of Case Studies

This overview page of case studies is a proposal that will be further improved, but it presents some clear advantages for the users:

  • A clear overview of the location of case studies produced within the ESPON Program.

  • Data integration is not limited to Europe and it is easily possible to integrate data coming from case studies outside Europe (USA, China, etc)

  • It is a simple solution for displaying in a homogeneous way the heterogeneity of the ESPON production.

Then, when selecting a project pin, the user is redirected to the case study information page shown in Figure 2.2.

[Note]

The pins solution to see case studies data is certainly not the best way to display the one in cross-border areas (Grande Région), large areas (North Calotte) etc. But taking into account the heterogeneity of case studies data and the difficulty to predict by advance what kind of geometries could be proposed by ESPON Projects, the M4D Project has chosen this solution, which may be improved in a future version of the interface.


Figure 2.2. Information Page of a Case Study

Information Page of a Case Study

This figure shows the information page of a case study, previously selected from the list in the Overview page (Figure 2.1).

Five main parts compose the page:

  1. General information related to the ESPON data provider (aim of the data collection, contact, upload date of the datasets).

  2. Data information: a listing of the available indicators, temporal extent of the indicators.

  3. Study area: location and name of the case study, nomenclatures used to collect data.

  4. Data source: name of the data provider(s), URL, precaution of use.

  5. Downloads: this part of the page proposes to download separately the data (.zip format), the geometries (as a .zip), and the metadata page as a .pdf file. Note that the download rights may be specified and restricted, particularly for the geometries not free of use, for example the Eurogeographics data.


2.2. Expected delivery

To feed the Case-study interface and the metadata page, the ESPON Database needs two main mandatory deliveries from the ESPON Projects: data and documentation, and an optional one: the geometries. The following sub-sections describe each of these elements.

2.2.1. Data file

The format of the data file, shown in Figure 2.3, is not significantly different than the one proposed for the key indicators. The element which differ from the Key indicators template is the source column, on the right column of each indicator: it has been deleted. It means that the source description is made at the level of the dataset.

Figure 2.3. Data Model Example for Zoom-in projects

Data Model Example for Zoom-in projects

This figure shows an example of the expected data model for Case Studies projects.


2.2.2. Case-study data and metadata

Case-study metadata is certainly the most important to fill, since it aims at providing the information that is finally available to end-users on the page shown in Figure 2.2. The file structure is inspired from the metadata specifications of the key indicators with some simplifications and adjustments linked to the specificities of such a project delivery.

In the xls template, mandatory fields must be filled in two sheets, these mandatory cells are indicated with a green backgound color in Figure 2.4 and Figure 2.6.

Following sub-sections describe each of the sheets.

2.2.2.1. The dataset sheet

Figure 2.4. Case Study Dataset Sheet

Case Study Dataset Sheet

This figure shows the dataset sheet of the TeDi Case Study data file. The green color shows mandatory fields. The purple color shows optional fields.


The expected information in the dataset sheet is:

  • Name: name of the delivery. It is to give an idea of the dataset content. We encourage all dataset providers to produce the most short and meaningful dataset names that directly reflect the data semantics.  

  • Project: ESPON project in which the dataset was produced. This should be an acronym of one of the existing ESPON projects. If this property is not specified, the default project "ESPON 2013 Database" will be applied.

  • Abstract: Free-text description of the contents of the dataset, in a way to make understandable the aim of the case study (both geographical coverage and thematic scope of the delivery).  

  • Access classification: Classification of the access rule applied to the dataset/geometries separately. Three possibilities can be mentioned in this field:

    1. unclassified - available for general disclosure (public access)

    2. restricted - not for general disclosure (for registered users only, e.g. belonging to the ESPON Program). This possibility has to be used when the geometries comes from Eurogeographics, which cannot be diffused out of ESPON. But as far as possible, try to create your own geometries with no limitations of use…

    3. confidential - available for someone who can be entrusted with information (for the administrator of the database only, e.g. ESPON Coordination Unit and the ESPON Database administrator)

  • Use restriction: Information useful to know for the future user of the dataset. It might be incoherencies between indicators definition (e.g. “be careful to the unemployment rate definition for Belgian territorial units”), content of the dataset (e.g. data are not available for the same year) etc.

  • Responsible party: Organization or person responsible for the entire dataset. Name, organization and email contact are required.

  • Metadata contact: Organization or person who created the metadata for the dataset. Name, organization and email contact are required.

  • Spatial binding: Describes the spatial link between the data part of the dataset and the territorial units used. Four elements are required: the name of the case study and its country of belonging, the latitude and the longitude location of the case study (cf Figure 2.5 ); and information related to the geographical level of analysis (nomenclature name and/or version and/or level). The number of case studies per dataset is not limited.

Figure 2.5. Several strategies for locating the flag of a case-study

Several strategies for locating the flag of a case-study

There is never a single solution for locating the X/Y coordinates of a case-study. Taking the theoretical example of Ile-de-France (on the left), it is possible to locate the flag in the centroid of the Ile-de-France ploygon (geometric perspective, blue flag), or in the capital city of the region (thematic perspective, red flag). The theoretical case of Baltic see regions (on the right, covering 8 countries) raises other questions: it is possible to mention one flag by country (exhaustive view, red flags) or one flag for the all study area (synthetic view, blue flag). The location of the geographical coordinates depends on case-study metadata. This choice is very important to made since it defines the location of the flag on the case-study interface.


2.2.2.2. The indicator sheet

Figure 2.6. Case Study Indicator Sheet

Case Study Indicator Sheet

This figure shows the indicator sheet of the TeDi Case Study data file. The green color shows mandatory fields. The purple color shows optional fields.


The expected fields in the indicator sheet is:

  • Code: A short acronym that reflects the meaning of the indicator

  • Name: A short expression that reflects the meaning of the indicator

  • Abstract: The abstract of the indicator. This property must describe the indicator in a more extended way than it is done by the Name property. The abstract must not repeat only the name of the indicator, but propose more information about it, that is not given by the Name.

  • Methodology description (optional): Describes the methodology used to produce indicator values. This methodology can concern a particular indicator independently of data sources or be specific to a particular source that provided indicator values (e.g. when a typology is produced, explain the cluster method used and the meaning of values shown in the data file – 1 for decreasing; 2 for increasing).

  • Methodology URI (optional): Reference to the resource where a detailed description of the methodology is made. This may be a reference to an online/paper publication or to the name of a file attached to the dataset. If this property specifies a file name, it must be present in the package delivered to the data processors; otherwise the data provider will be requested to supply this file.

  • Temporal extent:  groups temporal references of periods or instances covered by the values of an indicator in the dataset. When the indicator is available at different time period (e.g. DNS_1a indicator on the figure 15), add several temporal extents.

  • Provider: Refers to the data provider of the indicator value. The provider may be an institution or even a person who is the originator of the data. This property should not be confused with the reference to the publication source: the data provider is the  actor  who contributed to the data production or publication.

  • Provider URI (optional): Official Uniform Resource Identifier (URI) of the data provider. In most cases, this is the URL (Internet address) of the data provider's site home page. This property must not represent a reference to the publication, but to the  organization or the person  who provided the data. For example, this property can take the value "http://ec.europa.eu/eurostat", which refers to the home page of Eurostat

  • Publication title (optional): Title of the publication or name of the source where data were taken from, if it exists (for instance  "Switzerland Statistics Public Database")

  • Publication URI (optional): Official Uniform Resource Identifier (URI) of the publication. In most cases, this is the URL (Internet address) where the data is available online or can be accessed or obtained. This can also be an ISBN if the source is a paper publication (for instance http://www.espon.eu/reports/report001.pdf).

  • Publication reference (optional): Indicates the element of the referenced publication (page, part, chapter etc) to refer to. (for instance. p.50, chapter 2).

  • Methodology description (optional):  This property describes a source-specific methodological details that make the data from this source distinct from the data coming from other sources of the dataset (for instance “coming from heterogeneous data provider, the data has been harmonized using Eurostat data”). Cf the Technical Report on Core indicators, which proposes some examples of estimation methods.

  • Methodology URI (optional):  Reference to the resource where a detailed description of the methodology is made. This may be a reference to an online/paper publication or to the name of a file attached to the dataset. If this property specifies a file name, it must be present in the package delivered to the data processors, otherwise the data provider will be requested to supply this file.

  • Copyright (optional): Text describing the copyright rules and/or restrictions applied to the data associated with this source. The default value of this property is "(c) ESPON 2013 Database".

2.2.3. Geometry file (optional)

When delivering a case-study, the geometry file is not mandatory. This being said, it is strongly recommended to attach geometries if the data deals with territorial data (LAU2). It allows to fully ensure the reproducibility of the maps made within a given case-study.

Geometries must be delivered in a .zip format. This .zip file must include at least the following files: .shp, .dbf, .prj, .sbn, .shx (e.g. georeferenced information, systematically generated when editing a layer using a GIS). Geometries have to be delivered in a .zip archive whose filename is name_of_the_project_geom.zip.

The information contained in the .dbf linked to a shape file has to be at least a code (ID) that is similar than the one contained in the data files (Figure 2.8). Thus, it is possible for the user to:

  1. Analyse the exact territorial coverage of each case study.

  2. Build some maps thanks to the data gathered for each case study of the ESPON Community.

Figure 2.7. Example of a Case Study Geometries Input

Example of a Case Study Geometries Input

This figure is an example of the ESPON TeDi Project Case Study, available at LAU 2 level.


Figure 2.8. Mapping of Geometries Codes in Data

Mapping of Geometries Codes in Data

This figure shows the full correspondance between geometries and data files codes.


2.3. What happens to my data? The case-study integration process

The case-study must be delivered in the upload part of the ESPON Database Portal (when logged, case-study sub-part). The data and metadata file must be delivered in a .zip archive (Figure 2.9 ). Then, the data provider can upload the geometries of the case-study, which have to be included in a .zip archive (the delivery of geometries is optional). The upload of data/metadata (and hopefully geometries) is the first step of the case-study integration process.

Let us now take a closer look at this case-study dataset upload step. Once logged in, the data provider can access the case-study upload page (see the Figure 2.9 below). The case-study dataset upload is composed of two steps: the upload of a dataset file and optionaly, the upload of a geometry file.

Figure 2.9. Case-study dataset file upload page

Case-study dataset file upload page

The page allows a registered user to upload a case-study dataset file.


The file uploaded must not exceed 100MB. The file must be a zip file containing at least one .xls or .xlsx file (no sub directory allowed, no other extension than .xls or .xlsx accepted).

Once the dataset uploaded, the data provider can optionaly upload a geometry file (see the Figure 2.10 below).

Figure 2.10. Case-study geometry file upload page

Case-study geometry file upload page

The second step of the case-study dataset upload process is optional. The project can attach or not a geometry file to the previous dataset uploaded.


The file uploaded must not exceed 100MB. The file must be a zip file containing at least one triplet : .shp, .shx, .dbf with the same base name. If it contains more then one .shp file : there must be .shx or .dbf corresponding files with the same base name, in the same directory than the .shp file. The zip file can contain other files.

Once the dataset file and optionaly the geometry file uploaded, the case-study tracking is activated (Figure 2.11). Afterwards, the database administrator checks if all mandatory fields of the case-study metadata are correctly filled. If it is not the case, the data provider is invited to correct her/his dataset on the basis of the database administrator remarks. If metadata are correctly filled, the database administrator proceeds to the metadata edition (second step), the case-study creation and the overall check of the case-study. The data provider can consult the state of advancement of the case-study integration by checking the tracking tool.

Figure 2.11. Tracking tool activation and database administrator functions

Tracking tool activation and database administrator functions

This figure shows the tracking tool page of the case-studies and displays the step 2 and 3 of the data integration process.


When the ESPON Database administrator valids the step 2 and 3, the case-study delivery is available in the Search Interface of the ESPON Database Portal. (Figure 2.12). It means that the large public can consult data and metadata in a user-friendly way. If geometries are also provided, a given user can hopefully also create new maps on the basis of the data provider material.

Figure 2.12. Case-study integration

Case-study integration

Finally, it is possible to query publicly the case-study delivery in the case-study part of the Search Interface of the ESPON Database Portal.


Chapter 3. The Background Data of the Database

3.1. Strategy for the Background Data

ESPON TPGs may have produced a lot of data useful for specialists (e.g. residuals of a regression model) but not for ordinary (e.g. non-expert) users, such a policy makers or practitioners. Or TPGs may produce results in formats (e.g. grid data) or nomenclatures (LAU2) not compliant with the specifications of the Search Interface of the ESPON Database Portal. In other words, Bacground data is a good opportunity for ESPON Projects to disseminate all useful material dealing with data within their project, whatever their format.

In such a case, the M4D Project proposes to ESPON TPGs to provide their database in the Background part of the ESPON Database Portal. This sub-section (Figure 3.1) follows the organization of ESPON Transnational Projects Groups: Applied Research (Priority 1 projects), Targeted Analysis (Priority 2 projects) and Scientific Platform (Priority 3 projects). The data provided by ESPON TPGs can be downloaded in a .zip format.

Figure 3.1. Background data part of the ESPON Database Portal

Background data part of the ESPON Database Portal

All the background data delivered by ESPON TPGs are available in the Resource part of the ESPON Database Portal.


3.2. Expected delivery

3.2.1. How structuring the background data .zip file? General recommendations

As described in Figure 1, no checks are made on Background data. ESPON TPGs are free to organize their .zip file as they consider the most appropriate. We only suggest to TPGs to structure this file in a comprehensive way for external users. In that order, several general remarks are to keep in mind before delivering the Background data:

  • The size constraint for background data is 100 Mb by .zip file. If the file exceeds this threshold, contact the database administrator.

  • A good practice consists by structuring the background data by intelligible folders and providing a documentation file explaining how using and understanding the .zip file at the root folder. A good example on how structuring the background data file can be found with the GEOSPECS background data

  • To ensure the INSPIRE compliance your rasters, it is strongly recommended to use the INSPIRE Metadata editor to edit an xml file, which is especially adapted to disseminate this kind of data.

  • It is not a problem to duplicate the information included in the search interface/case-study interface and in the background data part of the ESPON Database Portal: the background data as to be considered as the "entire TPG database".

  • A template is proposed by M4D.

3.2.2. A .xls template available for filling quickly background metadata

For territorial datasets not included in the key indicator delivery, the M4D Project has produced a simplified data and metadata template derived from the Metadata Specifications of key indicators. The aim of this template is to propose to external users the minimal piece of information useful to understand the meaning of the indicator, the origin of data and some precisions on the data producer. In fact, this template helps to define harmonised information related to data.

The XLS template developed in that order is quite easy and not time-consuming to feed. It is structured in two parts. One is dedicated to data and the other one to metadata.

The data template is structured as the one proposed for case-study data (cf Section 2.2.1), and has to be delivered as a .xls file including a single sheet entitle data.

The metadata file contains 10 compulsory fields (Figure 3.2) and has to be delivered as a .xls file including a single sheet entitled metadata. This sheet is structured in columns (one for each indicator). The first part is dedicated to the indicator definition, the second part to the data sources.

Figure 3.2. Project Database Metadata Sheet

Project Database Metadata Sheet

This figure shows the content of the metadata sheet expected for Background project data file.


3.3. What happens to my Data?

Background data must be uploaded under the upload part of the ESPON Database Portal (when logged), in the "Background Data" sub-section (Figure 3.3) After that, a minor compliance check is done by the database administrator. The aim of this check is just to check the coherence of the Background data delivered. No further checks (semantic or outlilier) will run on this delivery.

At the end, the background data will be publicly available under the Resource part of the ESPON Database Portal.

Figure 3.3. Background Data upload

Background Data upload

This figure shows the ESPON Database Portal page dedicated to the upload of Background data.


Chapter 4. Resources of interest and advices for data creation in ESPON

As a conclusion, this chapter proposes some advice to manage the data flow inside each ESPON Project, and complementary information.

4.1. Resources of interest

In order to ensure in an efficient way to create data and metadata, the M4D project has produced some useful guidance documents. These documents of interest are described in this section.

4.1.1. XLS Template with Examples

As shown in Figure 4.1, under the "Upload" menu of the ESPON Database Web site (login required), several XLS templates are available to download: and it is structured in four parts:

  1. Key indicators: an XLS template fully compatible with the ESPON Metadata specifications [1] is available to download. It includes all the required information described in the metadata specification.

  2. Case-study data: a XLS template adapted to Case-Study data. It includes less mandatory fields than the key indicator template.

  3. Background data: a XLS template is recommended. However, taking into account the potential huge heterogeneity of ESPON TPG deliveries (raster data etc.), it is not mandatory to follow exactly the organization of this file.

These XLS templates are provided with concrete examples of datasets already integrated in the ESPON Database.

Figure 4.1. On-line availability of the xls templates

On-line availability of the xls templates

The data and metadata templates, available from the Upload section of the ESPON Database Portal.


4.1.2. Frequently Asked Questions (FAQ)

By the past, the M4D project has had to respond to a lot of questions regarding to the data integration. We have tried to capitalize all these exchanges by writing a FAQ, available on-line from the help menu of the Web application [3] since February 2012. As shown in Figure 4.2, questions are ordered by topics:

  1. What is M4D?

  2. The ESPON Database Portal

  3. Restricted part of the ESPON Database Portal

  4. Data delivery

  5. Metadata processing

  6. Support to data creation

  7. Mapkit

  8. Local/urban data

Please check the content of the FAQ before asking your question(s) to the database administrator!

Figure 4.2. Header of the FAQ

Header of the FAQ

This figure shows the header of the FAQ available on-line from the Help menu of the Web application [3]


4.1.3. M4D Presentation (ESPON Seminar in Cyprus, December 2012)

The aim of this presentation consisted by summarizing in a clear way this written documentation. This presentation is available in the upload part of the ESPON Database Portal.

4.1.4. M4D Newsletters

From 2012 to 2013, three Newsletters have been published by the ESPON M4D Project. Among other, it explains the strategy followed by this project for building the ESPON Database Portal. All the ESPON M4D Newsletters are available in the home page of the ESPON Database Portal.

4.2. Advice for a perfect management of the data process

The following advice are the result of experience from the follow-up of previous ESPON Projects. They have experimented some difficulties to follow/deliver the data and metadata specification by the past.

  1. A limited number of persons in charge of data/metadata/GIS creation in each TPG.

    Ideally, each project should dedicate one of its team to deal with data and metadata creation. This allows to:

    1. Centralise all the data of the project

    2. Harmonize data and metadata creation

    3. Give a single delivery at the end of the project (a bad practice would be that each partner of the TPG deliver its own key indicators without any control of the consortium).

  2. Set up the question of data delivery very early in the lifetime of a project.

    Regarding to the expected deliveries, some basic questions need to be discussed inside each project very early:

    1. What key indicators will be delivered to the database?

    2. How to organize the data delivery of our case study?

    3. What kind of innovative indicator could we propose to the ESPON Community, which could be updatable in the future?

    It is important to consider that waiting for the end of the project to take care of the data delivery process may encounter problems of integration and lose a significant time.

  3. Do not loose information; use the metadata templates as soon as possible!

    In that way, you will be sure that you will not forget any mandatory fields and you will not have to apply a boring copy/paste procedure of your datasets into the templates at the end of your project.

4.3. A good practice for filling data and metadata

This example is derived from a concrete case which has been experimented by the M4D project in the data collection of the one of the core indicators (total population 1990-2011, available under the search interface). One of the aim of the core database strategy is to provide complete time-series at NUTS levels for the ESPON Area for a set of basic count data. Among other, it implies to estimate some missing values and refer precisely in the metadata the methodology used to fill the holes contained in the dataset.

Starting from Denmark, total population is available for 2007 and 2008 on Eurostat website. It refers to the label "1" which is described in the metadata file as shown in Figure 4.3.

Figure 4.3. Starting point: a table with empty values

Starting point: a table with empty values

This figure shows a common situation: a table with empty values which need to be estimated.


When looking at other data sources, this information is available only for two territorial units on the National Statistical Website of Denmark (due to the change of NUTS definition). The unique way to obtain data for the rest of the territorial units consists by proceeding to a data estimation (temporal retropolation in this case).

The problematic is: How to reference this in the metadata file?

The only solution to avoid a loss of information consists by referencing immediately this estimation in the metadata source of the dataset! Figure 4.4, Figure 4.5, Figure 4.6, and Figure 4.7 propose a way to proceed in order to ensure a high quality of metadata.

Figure 4.4. Resulting dataset with estimated values and associated labels

Resulting dataset with estimated values and associated labels

This figure shows the resulting table with estimated values. Each estimated value has a label (column source of the total population 2005 and 2006) explaining the methodology used to create the estimation. Of course, the value of the label (TE6b, 13) are different than the one of the starting table (label 1, source of the total population 2007 and the total population 2008). In concrete terms, the fact to put two labels (TE6b, 13) means that two different methods have been used to estimate the missing values. These labels have to be described in the source part of the metadata immediately.


Figure 4.5. Description of the label 1 in the metadata

Description of the label 1 in the metadata

This figure shows the metadata associated to the label 1. The data source is Eurostat and data has not been estimated (false value in the estimation field). Taking into account regular updates of Eurostat tables, a good practice consists by precising the date of upload of the table (2011-07-26 in this case) and its precise name (demo_r_gind3).


Figure 4.6. Description of the label 13 in the metadata

Description of the label 13 in the metadata

This figure shows the metadata associated to the label 13. This data comes from the Danish National Statistical Institute. As a consequence, the label must not be the same than the one related to Eurostat data (label 1).


Figure 4.7. Description of the label TE6b in the metadata

Description of the label TE6b in the metadata

This figure shows that data related to the label TE6b has been estimated (true value in the estimation field). When data is estimated, it is very important to describe in the methodology fields (description, formula or URI) how the estimation was contucted.


Appendix A. References

Table of Contents

[1] Anton Telechev and Benoit Le Rubrus. ESPON Data and Metadata Specification. Full text in HTML (last visit: 2012-05-20) .

[2] Claude Grasland and Ronan Ysebaert. ESPON Technical Report - The Core Database Strategy – A new paradigm for data collection at regional level. December 2011.

[3] LIG STeamer. ESPON Database Web Application. Version February 2012. http://database.espon.eu (last visit: 2012-05-20) .

[4] ESRI. ESRI Shape File Technical Description. An ESRI White Paper - July 1998. Full text in PDF (last visit: 2012-03-23) .

Appendix B. Data Flow Process of the Key Indicators

This appendix presents the different steps of the Key Indicators Datasets integration dataflow. The example of a test dataset integration illustrates via screenshots the ESPON Database Portal tracking functionality.

B.1. Syntactic Check

This "syntactic" check consists of a data/metadata file upload. The ESPON TPG is invited on this first form to specify if this upload concerns a new Key Indicators Dataset (left side in Figure B.1) or an update of an existing one (right side in Figure B.1).

Figure B.1. Upload Form

Upload Form

Left: first submission of the dataset. Right: update of an already tracked dataset. The user is invited to select the name of the previous version of the dataset.


Once submitted, the uploaded file is automatically checked. If the syntax of the delivered file is correct, the dataset appears in the tracking table as shown in Figure B.2. The M4D Contact Team in charge of the project receives a notification email to perform the syntactic check.

Figure B.2. Tracking Overview: SYNTAX_CHECKED checked

Tracking Overview: SYNTAX_CHECKED checked

B.2. Semantics Check

Figure B.3. Semantics check form

Semantics check form

The database administrator completes this form to deliver the status of the performed semantics check expertise. An evaluation overview and a comment are mandatory. A more complete report file can also be attached, for example with suggestions of corrections, as shown in Figure 1.11.


Once the "semantic check form" submitted by the database administrator, the ESPON TPG is notified by email that the state of the dataset becomes SEMANTICS_CHECKED as shown in the "Tracking" overview table (Figure B.4). The comment and eventual report file are available from the "Dataset Details" page shown in Figure B.5. The notification email received by the ESPON TPG invites him to login the ESPON Database Portal, to consult the semantics expertise before the next step described in Section B.3.

Figure B.4. Tracking Overview: SEMANTICS_CHECKED state

Tracking Overview: SEMANTICS_CHECKED state

The state of the dataset is "SEMANTICS_CHECKED".


Figure B.5. Dataset details: SEMANTICS_CHECKED state

Dataset details: SEMANTICS_CHECKED state

The detailed page of the dataset integration displays the semantic check expertise evaluation and comment, the optional report file can be downloaded.


B.3. Semantic Check Approval

When the database administrator has delivered the report about the semantics check, the TPG is notified. She/he is invited to consult the report, then he can choose to fix his delivery or to forward it to the next step of the integration.

Taking into the remarks in the semantic report, the ESPON TPG continues the integration or decides to review the dataset. This decision is statused by the simple form shown in Figure B.6.

Figure B.6. Form for the semantics check approval

Form for the semantics check approval

If the TPG answers "YES" to this "semantic check approval" question form, the state of the dataset becomes "SEMANTICS_ACCEPTED" (Figure B.7)

Figure B.7. New state for the dataset: "SEMANTICS_ACCEPTED"

New state for the dataset: "SEMANTICS_ACCEPTED"

This screen is displayed to the TPG when he clicks "Yes" to the semantics check approval form shown in Figure B.6. The NCG M4D Team is notified by email that the "Outliers" step can be started on this dataset. See Section B.4.


B.4. Outliers Check

This step mainly consists in detecting outliers and checking the quality of data. An outliers report is delivered at the end of this expertise.

The form shown in Figure B.8 targets the database administrator, who is in charge of the management of the outliers detection and statistics. Once submitted, the state of the dataset is "OUTLIERS_CHECKED" (Figure B.9)

Figure B.8. Outliers check form

Outliers check form

Figure B.9. New state for the dataset: OUTLIERS_CHECKED

New state for the dataset: OUTLIERS_CHECKED

This screen is displayed once the form in Figure B.8 completed.


Once this form submitted, the ESPON TPG is notified by email that the dataset has passed the "OUTLIERS_CHECKED" step, as shown in the Tracking overview table (Figure B.10). The comments and eventual additional outliers results files are available from the detailed page of the dataset integration (Figure B.11).

Figure B.10. Tracking Overview: OUTLIERS_CHECKED state

Tracking Overview: OUTLIERS_CHECKED state

The state of the dataset is "OULIERS_CHECKED". Consequently, the "ESPON CU Approval" step is pending.


Figure B.11. Dataset details: OUTLIERS_CHECKED state

Dataset details: OUTLIERS_CHECKED state

The detailed page of the dataset integration displays the outliers comments, the TPG can download the outliers eventual report file.


B.5. Outliers Check Approval

When the database administrator has delivered the outliers report, the TGP is notified. She/he is invited to consult the report, then to decide to continue the integration process, or to review her/his data.

The approbation of the Outliers expertise by the TPG is similar to the "semantics approbation" step when the TPG is invited to approve or abandonn the integration of his dataset. When the TPG submits the "outliers approbation form" shown in Figure B.12, the TPG has read the outliers report and decides to fix or continue the integration of the dataset.

Figure B.12. Form for the outliers check approval

Form for the outliers check approval

In this example, the TPG wants to continue the integration. The state of the dataset is now OUTLIERS_APPROVED (Figure B.13).


Figure B.13. New state of the dataset: OUTLIERS_APPROVED

New state of the dataset: OUTLIERS_APPROVED

This page is displayed to the TPG when he positively submits the form in Figure B.12, the outliers check is approved.


If the TPG answers "YES" to this "outliers check approval" form, the state of the dataset is "OUTLIERS_APPROVED", as shown in the Tracking overview table (Figure B.14) and dataset details page (Figure B.15).

Figure B.14. Tracking Overview: OUTLIERS_APPROVED state

Tracking Overview: OUTLIERS_APPROVED state

The state of the dataset is "OUTLIERS_APPROVED". The "ESPON CU" validation step is pending.


Figure B.15. Dataset Details: OUTLIERS_CHECKED state

Dataset Details: OUTLIERS_CHECKED state

The dataset details displays the comments for each step of the integration.


Once approved by the TPG, the ESPON CU is notified by email. This step is described in Section B.6.

B.6. ESPON CU Approval

The "ESPON CU Approval" step 4 of the tracking workflow invites ESPON CU to read the expertise reports produced during the semantic and outlier checks. Then, ESPON CU decides to integrate or not the dataset into the ESPON Database. The form shown in Figure B.16 allows ESPON CU to abandon this version of the dataset ("Resubmit" option). Approving the dataset ("Next Step" option) immediately attempts its integration into the database.

Figure B.16. Tracking - Step 4 - ESPON CU Approval Form

Tracking - Step 4 - ESPON CU Approval Form

This screenshot shows the form targeting ESPON CU in the step 4 of the Key Indicators Datasets tracking workflow. At this step, selecting the option "Next Step" will attempt the integration of the dataset into the ESPON Database.


Please note that depending on the size of the dataset and on the server performances, the integration may take a while (more than 3 hours for the 15 Mega Bytes largest dataset). Nevertheless, once the integration started, the user can leave the page and close his/her Web browser, the process continues the integration on the server side, then:

  • In the case of a successful integration, the state of the dataset becomes "Integrated". This information is available in the tracking overview table, represented by the icon .

  • In the case of a failed integration, the state of the dataset becomes "Abandoned at step 5". In the tracking overview table, this information is represented by the icon .

In both cases, the integration log file is available in the "Tracking - Dataset Details" page, as shown in Figure B.17.

Figure B.17. Tracking - Dataset Details Page After an Integration Attempt

Tracking - Dataset Details Page After an Integration Attempt

This screenshot shows the "Dataset Details" page after the online integration attempt. In this example, the integration succeeded. An automatic comment indicates the integration duration, the integration log file is available for download.


Appendix C. About

This document is part of the ESPON 2013 Database Phase 2 project, also known as M4D (Multi Dimension Database Design and Development). It was generated on the 2014-12-19 17:33:27, from the sources of the m4d forge imag project at the svn rev 2420.

The main author of this document is Ronan Ysebaert (UMS RIATE), with the help and contribution of UMS RIATE and LIG STeamer M4D Partners.

For any comment question or suggestion, please contact .

Colophon

Based on DocBook technology [1], this document is written in XML format, sources are validated with DocBook DTD 4.5CR3, then sources are transformed to HTML and PDF formats by using DocBook xslt 1.73.2 stylesheets. The generation of the documents is automatized thanks to the docbench LIG STeamer project that is based on Ant [2], java [3], processors Xalan[4] and FOP [5]. Note that Xslt standard stylesheets are customized in order to get a better image resolution in PDF generated output for admonitions icons: the generated sizes of these icons were turned from 30 to 12 pt.



[1] [on line] DocBook.org (last visit: July 2011)

[2] [on line] Apache Ant - Welcome. Version 1.7.1 (last visit: July 2011)

[3] [on line] Developer Resources For Java Technology (last visit: July 2011). Version 1.6.0_03-b05.

[4] [on line] Xalan-Java Version 2.7.1 (last visit: 18 november 2009). Version 2.7.1.

[5] [on line] Apache FOP (last visit: July 2011). Version 0.94.