Revision History | |
---|---|
Revision 2 | svn rev 2420 build 2014-12-19 17:33:22 |
Abstract
This technical report proposes a description of the expectations for the integration of different ESPON projects data.
Table of Contents
List of Figures
Label
field descriptionThis document contains many references to the Database Administrator, who is the contact person for any question regarding the ESPON Database Portal. Until December 2014,
tha administration of the ESPON Database portal is ensured by the M4D manager: <[email protected]>
. At the end of the M4D activities, in January 2015, this will be managed by the ESPON Coordination Unit: <[email protected].>
The ESPON Database 1 project (2008-2011) has experienced a lot of difficulties to overcome the heterogeneity of information provided by ESPON Projects (integration of local data, integration of sophisticated indicators with little metadata description…). In order to improve this non-sustainable situation, the M4D Project has tried to better define what is expected from ESPON Projects in terms of data deliveries.
The ESPON P1, P2 and P3 Projects are obliged to deliver all data collected and produced within their project. These data should be delivered in the form of three types:
Key indicators, covering the entire ESPON Space and for a limited number of territorial nomenclatures (NUTS, SNUTS, FUAs, MUAs, UMZ) ;
Case-study data, which does not cover the entire ESPON Space and/or is not described in the territorial nomenclatures of the Search Interface and/or is located out of the ESPON Space ;
Background data, covering all data produced by the project, whatever their format.
As you can see in Figure 1, the dataflow depends on the nature of the delivery.
This document proposes useful information for these three different types of ESPON projects:
THE KEY INDICATORS, further described in the chapter entitled The Key Indicators
The key indicators are innovative indicators highly relevant for policy making and should cover the entire ESPON Space (EU27+4). These indicators will be the only ones searchable from the query interface.
The ESPON projects deliver in principle the indicators related to the maps included in Part B of the (Draft) Final Report - around 10 indicators. In case a typology or composite indicator is included, the data and methodology used to build it should also be delivered.
The requirement in terms of data and metadata is high for this delivery and the ESPON Projects are requested to upload the data via the Upload page
The key indicators delivery has to follow the ESPON Data and Metadata specifications. To build a strong and efficient query interface, these indicators will be checked in depth before integration.
This process includes three steps:
Syntactic check, metadata format analysis (are all the mandatory fields filed?). This check is processed automatically when the dataset is uploaded in the ESPON Database Portal (more details in Section 1.3 of this document ;
Semantic check, metadata content analysis (are the metadata understandable?), realised by the database administrator;
Outlier detection, Outlier detection (are there unusual values in the dataset?), realised by the database administrator.
CASE-STUDY DATA, further described in the chapter entitled The Zoom-in Delivery
Besides the key indicators delivery, some ESPON Projects (in particular for Targeted Analysis, but not only) analyze specific territories inside or outside the ESPON Area. To make this kind of complementary and very interesting data easy accessible, a case-study interface is available.
To set up this interface, the projects are requested to deliver their most representative data, their geometries (in a shape file format, if adapted) and a documentation highlighting the content of the data and geometries (following a dedicated template).
Regarding to this delivery, the database administrator will only checks if all mandatory fields of the documentation file are correctly filled. If geometries are provided, the database administrator checks if it possible to map the results (e.g., linkage between the geometry and the data codes).
BACKGROUND DATA, further described in the chapter entitled The Background Data of the Database
In order to fill their contractual obligations and to make all data as a coherent set available, each ESPON Project has to deliver a zip file that contains all data, metadata and geometries (if different than the usual ones delivered via ESPON) used in the project.
This zip file is considered as an annex to the final report of the project and is stored in the resource part of the ESPON Database Portal.
General considerations while uploading files to the portal Whatever the type of the dataset is, the upload requires a maximum file size of 100 MB. Please contact the administrator for bigger files. For interoperability, please avoid any space and accentuated characters in the uploaded dataset filename. |
Table of Contents
The 10 best indicators delivery is probably the most restrictive one: taking into account that ESPON is a community where knowledge and material is shared, it needs to define some basics to ensure the harmonization of the ESPON identity. Of course, it concerns reports (50 pages maximum by report, following some required typographic styles), maps (following the map-kit template) or reporting (inception report, interim report(s), draft final report, final report). It also concerns data and metadata.
To be useful for ESPON projects and other end-users, data should always be accompanied by metadata, including information about their quality and sources. It is also particularly important that the metadata should be compliant with international (ISO) and European (INSPIRE) standards so as to ensure the use of the database in the longer-run and to make it compatible with other national and international database initiatives.
To ensure correct data processing and integration into the ESPON Database Portal, the ESPON Metadata Specifications provided by M4D project must be carefully respected by all the data providers participating to the project and by the organizations/persons who intend to create new software implementations interacting with the ESPON Database.
The ESPON Metadata is relatively complex, but quite complete. As a result, the metadata creation in ESPON is a huge work BUT only concerns a limited number of indicators. It implies that TPGs should take into consideration at the very beginning of the implementation of the project.
In Section 1.1 of this chapter, we firstly describe the concepts behind the key indicators delivery, or "What shall I deliver?". In Section 1.2, we detail the available specifications regarding metadata description and nomenclature integration, "How shall I deliver my data?". Finally, Section 1.3 is dedicated to the data flow process, or "What happens to my data?".
Before delivering the key indicators, four basic rules are to be kept in mind. The M4D Project has defined rules in order to give a common understanding of the future content of the ESPON Database and to avoid the integration of too much heterogeneous information. It is the unique way to propose a database that could be managed in the future. Four basic rules are described below with concrete situations of good or bad practices.
Each ESPON Project has to choose its most representative indicators covering all the ESPON Area at NUTS level (e.g. basically the indicators displayed on maps in the draft final report - without annexes). With this basic rule, we want to limit the discrepancy between projects, which deliver hundreds indicators (residuals of statistical models, generally not very well explained in metadata) and other projects, which deliver few indicators, embedded in a monstrous information flow. In general terms, we prefer to include into the database a single indicator with a real added value, rather than hundreds of indicators which may never be queried by users of the database.
Good practices:
|
Bad practices:
|
By the past, the ESPON M4D project has received ten indicators describing total population in 2006! This kind of figure makes the database impossible to use (which indicator to download?). This is why, in the key indicators delivery, we kindly ask the project to propose innovative indicators that are not yet into the ESPON Database.
Good practices:
|
Bad practices:
|
The metadata related to indicators must be very well explained. If you propose indicators derived from statistical analysis or models, make sure your data is understandable by non-specialists users!
Good practices:
|
Bad practices:
|
Out of the key indicators, each project can suggest the inclusion into the "Core Database" of indicators of interest for territorial monitoring (time series, added value for the database), which could be updated and maintained in the future, out of your project.
Good practices:
|
Bad practices:
|
At the moment, the ESPON Database supports several nomenclatures: NUTS division in the 1995, 1999, 2003, 2006 and 2010 revisions for the ESPON Area at regional level ; UMZ, FUAs and MUAs for cities, SNUTS division for the regional neighbourhood of the ESPON Area . Whatever the nomenclature used, the degree of completeness of the indicator must be relatively good. Ideally, most of the missing values must be estimated with a description of the method used. In that order, a guidance paper has been written by the M4D project, proposing a set of estimation methods [2].
The key indicators concern Applied research projects (ESPON Priority 1) and projects from the Scientific Platform (Priority 3). For targeted analysis, most of the data will be integrated in the Case-Study interface (cf The Zoom-in Delivery).
Good practices:
|
Bad practices:
|
This section details the expected deliveries and available resources to fill ESPON Data and metadata.
In order to ensure an efficient way to create data and metadata in the ESPON format, the M4D Project has produced some useful guidance documents (available from the help menu of the ESPON Database Web site at http://database.espon.eu).
The document entitled ESPON Data and Metadata Specification [1], whose header is shown in Figure 1.1, is the reference document for the Key indicators datasets. It proposes a specification of the metadata model. Firstly, it describes the generic conceptual model of the ESPON Metadata (called as the Abstract Metadata Model). Secondly, it presents the implementation of the abstract model using the international standards (ISO-19115 and INSPIRE Directive). Finally, it explains the implementation of the abstract model in a tabular file format.
Please find below some advices to use these specifications:
Do not be impressed by the 150 pages of the paper format document! From the user point of view, the first, the second and the third parts of the metadata model specifications explain in a different way (conceptually, in a xml version, in a tabular version, e.g. Excel) the same topic: description of all the fields of the ESPON Metadata model.
To begin with, we strongly advise you to carefully read the introduction of the Metadata specifications, explaining the main concepts and also the third part, showing the tabular model and all the fields to be filled with concrete examples.
Download the metadata template (requires login) from the "Upload" menu
(see Figure 4.1).
On the basis of this .xls
document, fill your metadata.
For example, Figure 1.2 shows how colors and comments in this template
help at filling cells. When something is not clear, please refer to the metadata specifications:
as an example, Figure 1.3 shows the description of the Label
field.
Following Figure 1.2 and Figure 1.3 illustrate an example of a good practise by using the metadata specifications.
In case of doubts, the use of pre-filled templates with concrete examples is especially useful (cf Section 4.1).
This section aims at responding to the following question: "What happens to my data?"
The data integration process aims to apply a very steady quality control of datasets delivered by ESPON projects. This process is divided in 5 steps. When the TPG integrates its key indicators, he activates a dedicated module in the ESPON Data Portal ("Upload" menu): the Tracking Tool.
The tracking tool is being developed to follow the state of advancement of the data integration process (Figure 1.4). Please note that this tool requires to be logged in. For further information about the integration workflow (Who? When? etc), please consult Data Flow Process of the Key Indicators.
The data integration process is composed of the main steps that are described in the following sub-sections.
When uploading a Key Indicator Dataset to the ESPON Database Portal, a first check consists in checking the syntax of the dataset. This syntactic check verifies if the dataset is well-formed, if all mandatory fields are filled, to be short, if the dataset is compliant with the metadata and data specifications.
This syntactic check can also be performed offline with the DatasetCheck.jar software, see Section 1.3.1.2. |
Since 2014, the syntactic check performed at upload has been completed with:
a spatial check: all spatial units referenced in the sheet entitled Data must belong to the nomenclature defined in the sheet entitled Dataset, and this nomenclature must be available in the ESPON Database.
a "code/name/abstract
" indicator triplet check: for each indicator contained in the dataset, this step
aims at avoiding conflicts with already integrated indicators. Thus, the "code/name/abstract
" triplet is valid in
the following cases:
the combination of the code AND the name AND the abstract of the given indicator is identical to an existing triplet in database.
the code AND the name of the given indicator are not already assigned to an existing indicator in the database.
In case of an invalid triplet, the error is displayed in the "log box", as shown in Figure 1.8.
This set of controls aims at avoiding the most frequent errors, which previously used to be detected at the last integration step. They are automatically done when the project uploads its datasets from the "Upload" menu of the Web application. This is the only compulsory step of the data integration process. Once successfully checked, the dataset is saved on the server. A notification is sent to ESPON CU for the next step.
The syntactic check step is performed on all uploaded datasets. As shown in Figure 1.5, the page displays all the necessary information to fix eventual syntactic errors or warnings. Three types of messages are displayed in the log boxes:
INF prefix indicates an information message, e.g. some information about the syntactic check process.
WRN prefix indicates a warning message. Warning messages are triggered for ambiguous values that may be problematic during the next steps of the integration. Nevertheless, warning messages do not make the syntactic check fail. As shown in Figure 1.7, the TGP is invited to eventually review his dataset, though he can also submit it to the semantic check.
ERR prefix indicates an error message. Error messages refer to missing values or errors in mandatory fields of the metadata. These errors constraint the user to review his dataset that can no pass this step and continue the integration process.
Besides the data and metadata syntactic check, spatial units must be available and consistent with the available nomenclatures in the ESPON Database. Figure 1.6 shows an example of a valid data/metadata syntax, but an invalid set of spatial references.
Before uploading a given dataset in the tracking tool, it is generally interesting to test the syntactic validity of the file produced. The DatasetCheck software has been created for that purpose and is currently available as a Java standalone application executable jar file. For cross-platform compatibility issues, this software must be executed from the command line via aconsole like Dos on Windows, Terminal on Unix-based systems (Linux, Mac OS). Figure 1.9 and Figure 1.10 show how executing locally the syntactic check. More information is available in the guidelines document included in the .zip file downloaded. Please note that the local version of the syntactic check does not check the validity of the spatial units included in the dataset (cf previous sub-section).
After the syntactic check step, the dataset is transferred to the database administrator to be checked semantically.
This step aims at analyzing the content of the data and metadata (and namely the free-text fields). The aim of this step is to analyze if all the indicators of the dataset are correctly described and understandable by a large public. The result of this expert check is achieved by the edition of a semantic report.
Note that this semantic report feedback does not forbid the data integration process, but the project is sollicitated to consult this report and to decide to follow up the integration process, or to fix his dataset according to this expertise.
An example of such a semantic report, filled with annotations, warnings and remarks, is shown in Figure 1.11.
This step is an expertise. In other terms, if the TPG is not able (or does not want) to correct his metadata, the dataset can be submitted to the next step of the integration process.
Following screenshots illustrate an example of the semantic check expertise performed by the M4D Team on a problematic dataset. Figure 1.12 shows the initially received information. Figure 1.13 shows the consulted documents to help at understanding and fixing the received information. Figure 1.14 shows proposal of correction returned to the TGP.
The database administrator is not in charge of filling this kind of information! He/she supports you in the process but please make sure that your delivered indicators are understandable by external users! |
The semantic check is performed by the M4D project until the end of 2014. Afterwards, the ESPON Coordination Unit will manage this task. |
At this stage, an outlier detection is proceeded on the key indicators. The M4D Project carries out some checks on the data values themselves. Some of these checks can be simple. For example: an indicator whose metadata tells us it’s a percentage should have values between 0 and 100; counts should be positive integers. If the metadata states that the values of a typology are 1, 2, 3 and 4, then there should be no other values. Are any data values unexpectedly missing?
The M4D project carry out other more complex checks: are any values for an indicator extraordinary high or low? M4D examines indicators singly and, where appropriate, in groups. As many projects use spatial units in the NUTS hierarchy, we check whether data values are unusual compared with their neighbours. The ESPON metadata properties help us to proceed the appropriate tests: some indicators are ratios, some are counts, some are typologies, and some form time series. Actually, checks depends on the statistical property of the indicator.
When M4D finds an unusual or extraordinary value, M4D flags it. If a NUTS region has many unusual values, then M4D will seek assurance that these values are correct – often they are (major cities can be very different from surrounding settlements).
The output from the quality check is a report (cf Figure 1.15) which M4D share with ESPON TPGs as a data supplier. When all agree that the dataset is as correct as we can make it, it is it is finally loaded and made available into the ESPON Database (last step).
This check is an expertise and must be considered as an added-value in the data integration process: ESPON TPGs can validate or not the result of the check after consulting this report. |
Note that the outlier check is ideally adapted to basic indicators (count data, ratios) but not necessarily to complex typologies or indexes resulting from a huge methodological background. |
The semantic check is performed by the M4D project until the end of 2014. Afterwards, the ESPON Coordination Unit will manage this task. |
Previous checks and steps of the dataflow give a strong expertise on the quality of the datasets delivered by ESPON projects. On the basis of all these reports, the database administrator makes the decision to integrate or not the dataset in the database.
After its integration into the database, it will be possible to dynamically query the database composed by the indicators of the dataset in the search interface of the ESPON Database Portal. If metadata are very well described, it gives a real added value to the indicators.
Table of Contents
The ESPON M4D considers as a “Case-study delivery” a dataset that does not cover the entire ESPON Area (EU27+4). In practice, it includes several cases of figures:
Local data for a region or a group of regions (e.g. Greater Manchester at LAU2 level, Ile-de-France at employment basin level etc.)
Data which are not desribed in the available nomenclatures of the Search Interface (airports, water basins etc.)
Non ESPON Area and non ESPON Neighbourhood data (e.g. data on American, Brazilian or Japanese regions).
The M4D project has developed as specific graphical user interface for querying such data. The data is stored following a simple template (in a zip format, cf Section 2.2 for further explanations) and will be downloadable following the two proposed pages shown in Figure 2.1 (overview) and Figure 2.2 (details).
To feed the Case-study interface and the metadata page, the ESPON Database needs two main mandatory deliveries from the ESPON Projects: data and documentation, and an optional one: the geometries. The following sub-sections describe each of these elements.
The format of the data file, shown in Figure 2.3, is not significantly different than the one proposed for the key indicators. The element which differ from the Key indicators template is the source column, on the right column of each indicator: it has been deleted. It means that the source description is made at the level of the dataset.
Case-study metadata is certainly the most important to fill, since it aims at providing the information that is finally available to end-users on the page shown in Figure 2.2. The file structure is inspired from the metadata specifications of the key indicators with some simplifications and adjustments linked to the specificities of such a project delivery.
In the xls template, mandatory fields must be filled in two sheets, these mandatory cells are indicated with a green backgound color in Figure 2.4 and Figure 2.6.
Following sub-sections describe each of the sheets.
The expected information in the dataset
sheet is:
Name
: name of the delivery. It is to give an idea of the dataset content. We encourage all dataset providers to produce the most short and meaningful dataset names that directly reflect the data semantics.
Project
: ESPON project in which the dataset was produced. This should be an acronym of one of the existing ESPON projects. If this property is not specified, the default project "ESPON 2013 Database" will be applied.
Abstract
: Free-text description of the contents of the dataset, in a way to make understandable the aim of the case study (both geographical coverage and thematic scope of the delivery).
Access classification
: Classification of the access rule applied to the dataset/geometries separately. Three possibilities can be mentioned in this field:
unclassified
- available for general disclosure (public access)
restricted
- not for general disclosure (for registered users only, e.g. belonging to the ESPON Program). This possibility has to be used when the geometries comes from Eurogeographics, which cannot be diffused out of ESPON. But as far as possible, try to create your own geometries with no limitations of use…
confidential
- available for someone who can be entrusted with information (for the administrator of the database only, e.g. ESPON Coordination Unit and the ESPON Database administrator)
Use restriction
: Information useful to know for the future user of the dataset. It might be incoherencies between indicators definition (e.g. “be careful to the unemployment rate definition for Belgian territorial units”), content of the dataset (e.g. data are not available for the same year) etc.
Responsible party
: Organization or person responsible for the entire dataset. Name, organization and email contact are required.
Metadata contact
: Organization or person who created the metadata for the dataset. Name, organization and email contact are required.
Spatial binding
: Describes the spatial link between the data part of the dataset and the territorial units used. Four elements are required: the name of the case study and its country of belonging, the latitude and the longitude location of the case study (cf Figure 2.5 ); and information related to the geographical level of analysis (nomenclature name and/or version and/or level). The number of case studies per dataset is not limited.
The expected fields in the indicator
sheet is:
Code
: A short acronym that reflects the meaning of the indicator
Name
: A short expression that reflects the meaning of the indicator
Abstract
: The abstract of the indicator. This property must describe the indicator in a more extended way
than it is done by the Name property. The abstract must not repeat only the name of the indicator, but propose
more information about it, that is not given by the Name.
Methodology description
(optional): Describes the methodology used to produce indicator values. This
methodology can concern a particular indicator independently of data sources or be specific to a particular
source that provided indicator values (e.g. when a typology is produced, explain the cluster method used and
the meaning of values shown in the data file – 1 for decreasing; 2 for increasing).
Methodology URI
(optional): Reference to the resource where a detailed description of the methodology
is made. This may be a reference to an online/paper publication or to the name of a file attached to the
dataset. If this property specifies a file name, it must be present in the package delivered to the data
processors; otherwise the data provider will be requested to supply this file.
Temporal extent
: groups temporal references of periods or instances covered by the values of an indicator
in the dataset. When the indicator is available at different time period (e.g. DNS_1a indicator on the figure
15), add several temporal extents.
Provider
: Refers to the data provider of the indicator value. The provider may be an institution or even a
person who is the originator of the data. This property should not be confused with the reference to the
publication source: the data provider is the
actor
who contributed to the data production or publication.
Provider URI
(optional): Official Uniform Resource Identifier (URI) of the data provider. In most cases,
this is the URL (Internet address) of the data provider's site home page. This property must not represent a reference to
the publication, but to the
organization or the person
who provided the data. For example, this property can take the value "http://ec.europa.eu/eurostat", which
refers to the home page of Eurostat
Publication title
(optional): Title of the publication or name of the source where data were taken
from, if it exists (for instance "Switzerland Statistics Public Database")
Publication URI
(optional): Official Uniform Resource Identifier (URI) of the publication. In most cases,
this is the URL (Internet address) where the data is available online or can be accessed or obtained. This can
also be an ISBN if the source is a paper publication (for instance
http://www.espon.eu/reports/report001.pdf).
Publication reference
(optional): Indicates the element of the referenced publication (page, part,
chapter etc) to refer to. (for instance. p.50, chapter 2).
Methodology description
(optional): This property describes a source-specific methodological details
that make the data from this source distinct from the data coming from other sources of the dataset (for
instance “coming from heterogeneous data provider, the data has been harmonized using Eurostat data”). Cf the
Technical Report on Core indicators, which proposes some examples of estimation methods.
Methodology URI
(optional): Reference to the resource where a detailed description of the methodology
is made. This may be a reference to an online/paper publication or to the name of a file attached to the
dataset. If this property specifies a file name, it must be present in the package delivered to the data
processors, otherwise the data provider will be requested to supply this file.
Copyright
(optional): Text describing the copyright rules and/or restrictions applied to the data
associated with this source. The default value of this property is "(c) ESPON 2013 Database".
When delivering a case-study, the geometry file is not mandatory. This being said, it is strongly recommended to attach geometries if the data deals with territorial data (LAU2). It allows to fully ensure the reproducibility of the maps made within a given case-study.
Geometries must be delivered in a .zip format. This .zip file must include at least the following files: .shp, .dbf, .prj, .sbn, .shx (e.g. georeferenced information, systematically generated when editing a layer using a GIS).
Geometries have to be delivered in a .zip archive whose filename is name_of_the_project_geom.zip
.
The information contained in the .dbf
linked to a shape file
has to be at least a code (ID
) that is
similar than the one contained in the data files (Figure 2.8).
Thus, it is possible for the user to:
Analyse the exact territorial coverage of each case study.
Build some maps thanks to the data gathered for each case study of the ESPON Community.
The case-study must be delivered in the upload part of the ESPON Database Portal (when logged, case-study sub-part). The data and metadata file must be delivered in a .zip archive (Figure 2.9 ). Then, the data provider can upload the geometries of the case-study, which have to be included in a .zip archive (the delivery of geometries is optional). The upload of data/metadata (and hopefully geometries) is the first step of the case-study integration process.
Let us now take a closer look at this case-study dataset upload step. Once logged in, the data provider can access the case-study upload page (see the Figure 2.9 below). The case-study dataset upload is composed of two steps: the upload of a dataset file and optionaly, the upload of a geometry file.
The file uploaded must not exceed 100MB. The file must be a zip file containing at least one .xls or .xlsx file (no sub directory allowed, no other extension than .xls or .xlsx accepted).
Once the dataset uploaded, the data provider can optionaly upload a geometry file (see the Figure 2.10 below).
The file uploaded must not exceed 100MB. The file must be a zip file containing at least one triplet : .shp, .shx, .dbf with the same base name. If it contains more then one .shp file : there must be .shx or .dbf corresponding files with the same base name, in the same directory than the .shp file. The zip file can contain other files.
Once the dataset file and optionaly the geometry file uploaded, the case-study tracking is activated (Figure 2.11). Afterwards, the database administrator checks if all mandatory fields of the case-study metadata are correctly filled. If it is not the case, the data provider is invited to correct her/his dataset on the basis of the database administrator remarks. If metadata are correctly filled, the database administrator proceeds to the metadata edition (second step), the case-study creation and the overall check of the case-study. The data provider can consult the state of advancement of the case-study integration by checking the tracking tool.
When the ESPON Database administrator valids the step 2 and 3, the case-study delivery is available in the Search Interface of the ESPON Database Portal. (Figure 2.12). It means that the large public can consult data and metadata in a user-friendly way. If geometries are also provided, a given user can hopefully also create new maps on the basis of the data provider material.
Table of Contents
ESPON TPGs may have produced a lot of data useful for specialists (e.g. residuals of a regression model) but not for ordinary (e.g. non-expert) users, such a policy makers or practitioners. Or TPGs may produce results in formats (e.g. grid data) or nomenclatures (LAU2) not compliant with the specifications of the Search Interface of the ESPON Database Portal. In other words, Bacground data is a good opportunity for ESPON Projects to disseminate all useful material dealing with data within their project, whatever their format.
In such a case, the M4D Project proposes to ESPON TPGs to provide their database in the Background part of the ESPON Database Portal. This sub-section (Figure 3.1) follows the organization of ESPON Transnational Projects Groups: Applied Research (Priority 1 projects), Targeted Analysis (Priority 2 projects) and Scientific Platform (Priority 3 projects). The data provided by ESPON TPGs can be downloaded in a .zip format.
As described in Figure 1, no checks are made on Background data. ESPON TPGs are free to organize their .zip file as they consider the most appropriate. We only suggest to TPGs to structure this file in a comprehensive way for external users. In that order, several general remarks are to keep in mind before delivering the Background data:
The size constraint for background data is 100 Mb by .zip file. If the file exceeds this threshold, contact the database administrator.
A good practice consists by structuring the background data by intelligible folders and providing a documentation file explaining how using and understanding the .zip file at the root folder. A good example on how structuring the background data file can be found with the GEOSPECS background data
To ensure the INSPIRE compliance your rasters, it is strongly recommended to use the INSPIRE Metadata editor to edit an xml file, which is especially adapted to disseminate this kind of data.
It is not a problem to duplicate the information included in the search interface/case-study interface and in the background data part of the ESPON Database Portal: the background data as to be considered as the "entire TPG database".
A template is proposed by M4D.
For territorial datasets not included in the key indicator delivery, the M4D Project has produced a simplified data and metadata template derived from the Metadata Specifications of key indicators. The aim of this template is to propose to external users the minimal piece of information useful to understand the meaning of the indicator, the origin of data and some precisions on the data producer. In fact, this template helps to define harmonised information related to data.
The XLS template developed in that order is quite easy and not time-consuming to feed. It is structured in two parts. One is dedicated to data and the other one to metadata.
The data template is structured as the one proposed for case-study data (cf Section 2.2.1),
and has to be delivered as a .xls
file including a single sheet entitle data
.
The metadata file contains 10 compulsory fields (Figure 3.2) and has to be delivered
as a .xls
file including a single sheet entitled metadata
.
This sheet is structured in columns (one for each indicator). The first part is dedicated to the indicator definition,
the second part to the data sources.
Background data must be uploaded under the upload part of the ESPON Database Portal (when logged), in the "Background Data" sub-section (Figure 3.3) After that, a minor compliance check is done by the database administrator. The aim of this check is just to check the coherence of the Background data delivered. No further checks (semantic or outlilier) will run on this delivery.
At the end, the background data will be publicly available under the Resource part of the ESPON Database Portal.
Table of Contents
As a conclusion, this chapter proposes some advice to manage the data flow inside each ESPON Project, and complementary information.
In order to ensure in an efficient way to create data and metadata, the M4D project has produced some useful guidance documents. These documents of interest are described in this section.
As shown in Figure 4.1, under the "Upload" menu of the ESPON Database Web site (login required), several XLS templates are available to download: and it is structured in four parts:
Key indicators: an XLS template fully compatible with the ESPON Metadata specifications [1] is available to download. It includes all the required information described in the metadata specification.
Case-study data: a XLS template adapted to Case-Study data. It includes less mandatory fields than the key indicator template.
Background data: a XLS template is recommended. However, taking into account the potential huge heterogeneity of ESPON TPG deliveries (raster data etc.), it is not mandatory to follow exactly the organization of this file.
By the past, the M4D project has had to respond to a lot of questions regarding to the data integration. We have tried to capitalize all these exchanges by writing a FAQ, available on-line from the help menu of the Web application [3] since February 2012. As shown in Figure 4.2, questions are ordered by topics:
What is M4D?
The ESPON Database Portal
Restricted part of the ESPON Database Portal
Data delivery
Metadata processing
Support to data creation
Mapkit
Local/urban data
Please check the content of the FAQ before asking your question(s) to the database administrator!
The aim of this presentation consisted by summarizing in a clear way this written documentation. This presentation is available in the upload part of the ESPON Database Portal.
The following advice are the result of experience from the follow-up of previous ESPON Projects. They have experimented some difficulties to follow/deliver the data and metadata specification by the past.
A limited number of persons in charge of data/metadata/GIS creation in each TPG.
Ideally, each project should dedicate one of its team to deal with data and metadata creation. This allows to:
Centralise all the data of the project
Harmonize data and metadata creation
Give a single delivery at the end of the project (a bad practice would be that each partner of the TPG deliver its own key indicators without any control of the consortium).
Set up the question of data delivery very early in the lifetime of a project.
Regarding to the expected deliveries, some basic questions need to be discussed inside each project very early:
What key indicators will be delivered to the database?
How to organize the data delivery of our case study?
What kind of innovative indicator could we propose to the ESPON Community, which could be updatable in the future?
Do not loose information; use the metadata templates as soon as possible!
In that way, you will be sure that you will not forget any mandatory fields and you will not have to apply a boring copy/paste procedure of your datasets into the templates at the end of your project.
This example is derived from a concrete case which has been experimented by the M4D project in the data collection of the one of the core indicators (total population 1990-2011, available under the search interface). One of the aim of the core database strategy is to provide complete time-series at NUTS levels for the ESPON Area for a set of basic count data. Among other, it implies to estimate some missing values and refer precisely in the metadata the methodology used to fill the holes contained in the dataset.
Starting from Denmark, total population is available for 2007 and 2008 on Eurostat website. It refers to the label "1" which is described in the metadata file as shown in Figure 4.3.
When looking at other data sources, this information is available only for two territorial units on the National Statistical Website of Denmark (due to the change of NUTS definition). The unique way to obtain data for the rest of the territorial units consists by proceeding to a data estimation (temporal retropolation in this case).
The problematic is: How to reference this in the metadata file?
The only solution to avoid a loss of information consists by referencing immediately this estimation in the metadata source of the dataset! Figure 4.4, Figure 4.5, Figure 4.6, and Figure 4.7 propose a way to proceed in order to ensure a high quality of metadata.
[1] ESPON Data and Metadata Specification. Full text in HTML (last visit: 2012-05-20) .
[2] ESPON Technical Report - The Core Database Strategy – A new paradigm for data collection at regional level. December 2011.
[3] ESPON Database Web Application. Version February 2012. http://database.espon.eu (last visit: 2012-05-20) .
[4] ESRI Shape File Technical Description. An ESRI White Paper - July 1998. Full text in PDF (last visit: 2012-03-23) .
Table of Contents
This appendix presents the different steps of the Key Indicators Datasets integration dataflow. The example of a test dataset integration illustrates via screenshots the ESPON Database Portal tracking functionality.
This "syntactic" check consists of a data/metadata file upload. The ESPON TPG is invited on this first form to specify if this upload concerns a new Key Indicators Dataset (left side in Figure B.1) or an update of an existing one (right side in Figure B.1).
Once submitted, the uploaded file is automatically checked. If the syntax of the delivered file is correct, the dataset appears in the tracking table as shown in Figure B.2. The M4D Contact Team in charge of the project receives a notification email to perform the syntactic check.
Once the "semantic check form" submitted by the database administrator, the ESPON TPG is notified by email that the state of the dataset becomes SEMANTICS_CHECKED as shown in the "Tracking" overview table (Figure B.4). The comment and eventual report file are available from the "Dataset Details" page shown in Figure B.5. The notification email received by the ESPON TPG invites him to login the ESPON Database Portal, to consult the semantics expertise before the next step described in Section B.3.
When the database administrator has delivered the report about the semantics check, the TPG is notified. She/he is invited to consult the report, then he can choose to fix his delivery or to forward it to the next step of the integration.
Taking into the remarks in the semantic report, the ESPON TPG continues the integration or decides to review the dataset. This decision is statused by the simple form shown in Figure B.6.
If the TPG answers "YES" to this "semantic check approval" question form, the state of the dataset becomes "SEMANTICS_ACCEPTED" (Figure B.7)
This step mainly consists in detecting outliers and checking the quality of data. An outliers report is delivered at the end of this expertise.
The form shown in Figure B.8 targets the database administrator, who is in charge of the management of the outliers detection and statistics. Once submitted, the state of the dataset is "OUTLIERS_CHECKED" (Figure B.9)
Once this form submitted, the ESPON TPG is notified by email that the dataset has passed the "OUTLIERS_CHECKED" step, as shown in the Tracking overview table (Figure B.10). The comments and eventual additional outliers results files are available from the detailed page of the dataset integration (Figure B.11).
When the database administrator has delivered the outliers report, the TGP is notified. She/he is invited to consult the report, then to decide to continue the integration process, or to review her/his data.
The approbation of the Outliers expertise by the TPG is similar to the "semantics approbation" step when the TPG is invited to approve or abandonn the integration of his dataset. When the TPG submits the "outliers approbation form" shown in Figure B.12, the TPG has read the outliers report and decides to fix or continue the integration of the dataset.
If the TPG answers "YES" to this "outliers check approval" form, the state of the dataset is "OUTLIERS_APPROVED", as shown in the Tracking overview table (Figure B.14) and dataset details page (Figure B.15).
Once approved by the TPG, the ESPON CU is notified by email. This step is described in Section B.6.
The "ESPON CU Approval" step 4 of the tracking workflow invites ESPON CU to read the expertise reports produced during the semantic and outlier checks. Then, ESPON CU decides to integrate or not the dataset into the ESPON Database. The form shown in Figure B.16 allows ESPON CU to abandon this version of the dataset ("Resubmit" option). Approving the dataset ("Next Step" option) immediately attempts its integration into the database.
Please note that depending on the size of the dataset and on the server performances, the integration may take a while (more than 3 hours for the 15 Mega Bytes largest dataset). Nevertheless, once the integration started, the user can leave the page and close his/her Web browser, the process continues the integration on the server side, then:
In the case of a successful integration, the state of the dataset becomes "Integrated". This information is available in the tracking overview table, represented by the icon .
In the case of a failed integration, the state of the dataset becomes "Abandoned at step 5". In the tracking overview table, this information is represented by the icon .
This document is part of the ESPON 2013 Database Phase 2 project, also known as M4D
(Multi Dimension Database Design and Development).
It was generated on the 2014-12-19 17:33:27, from the sources of the m4d
forge imag project at the svn rev 2420.
The main author of this document is Ronan Ysebaert (UMS RIATE), with the help and contribution of UMS RIATE and LIG STeamer M4D Partners.
For any comment question or suggestion, please contact <[email protected]>
.
Colophon
Based on DocBook technology
[1], this document is written in XML format, sources are validated with DocBook DTD 4.5CR3,
then sources are transformed to HTML and PDF formats by using DocBook xslt 1.73.2 stylesheets.
The generation of the documents is automatized thanks to the docbench
LIG STeamer project that is based on Ant [2],
java [3],
processors Xalan[4]
and FOP [5].
Note that Xslt standard stylesheets are customized in order to get a better image resolution in PDF generated output for admonitions icons: the generated sizes
of these icons were turned from 30 to 12 pt.
[1] [on line] DocBook.org (last visit: July 2011)
[2] [on line] Apache Ant - Welcome. Version 1.7.1 (last visit: July 2011)
[3] [on line] Developer Resources For Java Technology (last visit: July 2011). Version 1.6.0_03-b05.
[4] [on line] Xalan-Java Version 2.7.1 (last visit: 18 november 2009). Version 2.7.1.
[5] [on line] Apache FOP (last visit: July 2011). Version 0.94.