ESPON 2020 Core Data Updating Strategy

Introduction

The Updating module is another application in the ESPON Data Portal that allows a user to insert the detected virtual changes into the final database. That module is fully integrated in the ESPON Database and it takes profit of other applications that have been developed in another working packages. All the source code can be found in the official git repository in a module called Updating[1] . Two user expert roles can be derived from Updating module and Updating strategy: one role to run server scripts and another role to update indicators in a local machine with Excel Updating Suite.

Required python packages:

python == 3.5.2
django == 1.11.22
psycopg2 == 2.7.1
pandas == 0.22
xlwings == 0.15.1
xlsxwriter == 1.0.5
xlrd == 1.1.0.
jinja2 == 2.10.1

[1] https://git.unepgrid.ch/espon-egtc/edp/src/branch/update_base_indicator/django-edp/Updating

Updating tables

To make the Updating module work properly, five tables under the updating namespace in the ESPON database are needed: tdta_metadata, tdta_eurostat, tdta_missing_pattern, tdta_nulls y tdta_report. The three first tables are the most important because they store data that is relevant for a correct working of the tool; last two tables are optional

tdta_metadata
- This table contains information that is relevant to a given ESPON indicator, basically metadata and a unique identifier to group multi indicators
tdta_eurostat
- This table contains information about Eurostat table names as well as parameters that are needed to extract the counterpart ESPON indicators from those datasets
tdta_missing_pattern
- This table contains information related to missing patterns, i.e. a missing matrix as well several parameters to automatize the updating procedure in a local workstation
tdta_report
- This table contains a detailed report about the updates for a given indicator, providing the number of rows and affected tables in the ESPON database
tdta_nulls
- This table contains information about detected nulls for a given indicator

Updating command scripts

The Updating module also contains seven Django commands, namely python scripts, to be launched by an expert user in the ESPON server via line commands. The name of each script reflects the order of execution. These scripts are located in a folder named management and the name of each script reflects the order of execution. Moreover, there are other scripts useful in different stages of the updating procedure. The location folder and the main purpose of different scripts are detailed in Table 1.

Table 1: Command scripts (in blue) and auxiliary functions (in green) employed in Updating module

File name	Folder	Description
1st_insert_metadata.py	management	This script will populate the tdta_metadata table
2nd_download_eurostat.py	management	This script will download the Eurostat tables from Eurostat services and store those tsv files in a private folder in the server
3rd_insert_eurostat.py	management	This script will populate the tdta_eurostat table
4th_insert_nulls.py	management	This script will populate the tdta_nulls table
5th_insert_mp.py	management	This script will populate the tdta_missing_pattern table
6th_create_EUS.py	management	This script will build the Excel Updating Suite as a zip package
7th_importer2ESPONDB.py	management	This script will update several tables: tdta_data_item, trel_data_item_source, trel_data_item_method and tdta_metadata
BaseIndicator.py		This file contains a class to get data from ESPON database in a homogeneous way, irrespective from their nature indicator (single, multi).
eurostat_equivalences.py	data	This file contains a dictionary of useful parameters to get the counterpart values from Eurostat tables
utils_fn.py	utils	A collection of utilities in a module for several purposes: transform both ESPON and Eurostat data to pandas-like structures, helpers functions useful to another functions in different modules
nuts_related_fn.py	utils	A collection of functions to transform NUTS units in a hierarchical way, specially to transverse ESPON time series and generate relationships between rows

Table 2: Parametrized templates to generate at runtime the macros for the Excel Updating Suite

File name	Folder	Description
approval_and_launcher.tpl.py	macros	Parametrized script to launch detected virtual changes into ESPON Excel sheets
diff_dataframes_man_chg.tpl.py	macros	Parametrized script to start a new manual updating session and to make differences between ESPON time series
outliers.tpl.py	macros	Parametrized script to launch an Outlier detection method on initial/modified ESPON data
save_local_db.py	macros	Parametrized script to save changes into a local database file
spatial_matrix.tpl.py	macros	Parametrized script to compute spatial matrix
thematic_matrix.tpl.py	macros	Parametrized script to compute thematic matrix

Updating folders

To plug this module into the Database Portal an accessible folder in the server with writing/reading permissions is needed. The files generated by this application will be stored in this folder, in a nested structure with paths relative to root directory, to be further retrieved on demand by the application.

eurostat
- The location of the folder in which datasets from Eurostat web services are downloaded. Not empty
macros
- The location of the folder in which macros are generated for creating the Excel Updating Suite. A nested subfolder structure is generated. Initially empty.
eus
- An initially empty folder but EUS suite is generated in a subfolder called eus, as well as xlsx files are stored in another subfolder called xslx to make the zip automatically. A nested subfolder structure is generated to get zip files per country basis or per all countries
db_updates
- The location of the folder in which user uploads database dumps in sqlite3 format
db_reports
- The location of the folder in which a report is generated in text format is generated when user launch updates files against the ESPON database. Initially empty

How to use the Updating module

After connecting to ESPON server, an expert user will be able to launch the command scripts. A list of all ESPON Base indicators with their unique identifiers (ind_id) as stored in he ESPON database can be found in Table 4.

The execution procedure will be as follow:

Run 1st_insert_metadata.py script

python manage.py 1st_insert_metadata –indicator_id <ind_id>

This script will read metadata from ESPON 2020 Database and will populate the majority of all fields in the tdta_metadata table. This is achieved through a BaseIndicator class (in Updating/BaseIndicator.py) that populates all fields except for number_of_null

Run 2nd_download_eurostat.py script

python manage.py 2nd_download_eurostat –indicator_id <ind_id>

This script will start the downloading process of tsv files from Eurostat downloading services[1]:

gets the name of the Eurostat dataset as read from a EUROSTAT_EQUIVALENCES dictionary key (eurostat_file) in eurostat_equivalences.py script
checks the lastest version of downloaded tsv file (by reading the metadata.txt file with latest date). If current date (Y-M-Dformat), tsv file is not downloaded; otherwise, is downloaded
if downloaded, moves those tsv files to eurostat directory with a created directory according to some categories (DEMOGRAPHY, ECONOMY, EDUCATION, EMPLOYMENT, ENERGY, ITS, R&D, SOCIETY)

[1] https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data/

Run 3rd_insert_eurostat.py script

python manage.py 3rd_insert_eurostat –indicator_id <ind_id>

This script will populate the tdta_eurostat table that will be used in the Updating application to get data from Eurostat and compare with pair-wise ESPON indicator so as to get the so-called virtual changes (later with 6th_create_EUS.py script)

Run 4th_insert_nulls.py script (optional)

python manage.py 4th_insert_nulls --indicator_id <ind_id> --country_code <ctr_code>

This script will populate the tdta_nulls table that will be useful to get an idea of existing nulls in Base Indicators. Moreover, a field in tdta_metadata called number_nulls is populated after this scripts is run.

Run 5th_insert_mp script

python manage.py 5th_insert_mp --indicator_id <ind_id> --country_code <ctr_code> This script will populate the tdta_missing_pattern table that will be useful to get the missing pattern for a given ESPON indicator and a given country. This information will be very valuable later, when 6th_create_EUS.py script is run. One completed, there is no need to re-run until a new bunch update of Base Indicator is injected in ESPON database. These five first scripts have only to be launched once and this is the only requirement to start with the updating procedure.

Run 6th_create_EUS.py script

python manage.py 6th_create_EUS --indicator_id <ind_id> --country_code <ctr_code> --create_zip True

This script will create the so-called Excel Updating Suite that consists of a zip file with an Excel Dashboard and a bunch of files/macros, gathering information from Updating database (previous steps). The location of created zips is eus/zips folder, with either subpath to country-basis folder (for instance, inds/145/By_Country/) if country_code parameter, or subpath to all-country-basis folder (for instance, inds/145/All_countries/).

These macros are generated from parametrised python templates that are located in the macros/templates folder of the Updating module. See Table 2

Once expert user has built the EUS zip files for all countries for a given indicator in the server, an accessible point to be downloaded will be ready. With all Excels files and macros in his local machine, the updating procedure controlled by an expert user can be start in a local workstation following the workflow as specified in the Table 3.

When this updating procedure is finished, a local sqlite3 file with all virtual changes for all the countries should be generated (see step 12, Table 3) and the expert user should upload that file to a folder (db_updates) in the ESPON server.

Run 7th_importer2ESPONDB.py script

python manage.py 7th_importer2ESPONDB --indicator_id <ind_id>

This script will read the uploaded sqlite3 file and all virtual changes will be integrated in the final ESPON database. A txt report is generated and stored in database containing relevant information about data to be inserted in ESPON database.

Table 3. Sequential steps in the Updating workflow to updating a base indicator

#Step	Action	Description
1	Preparation of the data for the update (creation of the Excel Updating Suite)	After selecting the indicator to be updated, a process is run on the server to generate an Excel Updating Suite for each country, containing, in different sheets, the current indicator figures, the virtual changes derived from the comparison with Eurostat data, and a specific sheet with macro buttons to launch the following steps. Those Excel sheets are downloaded from the server within a ZIP file.
2	Detection of potential changes from Eurostat data, which are compliant with the spatial and thematic constraints of the specific indicator, when estimated data are involved.	After unzipping the Excel sheets, the process starts with one of the countries. In order to detect the potential and compliant changes, the corresponding macro should be run from the Excel Updating Suite, i.e. by pressing the Launch button
3	Creation of spatial and thematic validation matrices for the whole time-series.	In order to create the matrices, the corresponding buttons (Spatial matrix and Thematic matrix) in the Excel Updating Suite should be pressed.
4	Creation of supportive charts (line plots) of the indicator time-series.	Optional. In case of need, the expert in charge of the update can generate automatic line plots of the data, to have a graphical overview of the indicator time-series by pressing the Charts button.
5	Review of the resulting matrices to decide eventually if potential changes should be integrated or rejected.	The matrices provide the deviation of potential changes with regard to the spatial and thematic rules. It is up to the expert in charge of the update, to decide whether a specific deviation is acceptable or not.
6	Manual editing of potential changes to be rejected or modified.	All the potential changes not edited (or removed) in the corresponding Excel sheet will be eventually actual changes in the database. If some changes are considered as unacceptable, should be removed from the corresponding Excel sheet at this stage.
7	Detection of potential outliers in ESPON time series.	In case of need, the expert in charge of the update can launch two algorithms to detect potential outliers in the ESPON time series by pressing the Outliersbutton.
8	Save potential changes to the local DB.	By means of the Save button, the potential changes that remain in the Excel sheets are stored in a local database.
9	Introduction of additional figures from other sources and save the changes to the local DB.	There is the possibility to add manually potential changes coming from other sources (including estimated values). This is done in this step by pressing the corresponding buttons (Start manual session and Detect manual changes). If manual changes have been introduced, the potential changes need to be saved to the local DB again.
10	Repeat processes 1 to 9 for all the countries covered by the specific indicator.
11	Prepare a single file for the specific indicator including all the countries.
12	Upload the file to the ESPON DB Server and actually update the indicator.

Table 4. ESPON Base Indicators and unique indicator identifier as stored in ESPON database

Category	Indicator	ind_id
Demography	Population, total, by broad age (0-14, 14-65, >65) groups and gender	145
	Life expectancy, by age, gender	115
	Population, total, by broad age groups (by 5s) and gender	157
	Old age dependence ratio	223
	Gender imbalances	218
	Ageing index	215
	Deaths by age groups and gender (NUTS2)	1
	Deaths by age groups and gender (NUTS3)	58
	Live Births	313
	Total Population	314
	Total Deaths	315
	Natural Change of population	311
	Total Net Migration	312
Employment	Employment rates by sex & broad age groups (%)	244
	Employment by sex & broad age groups (1000)	256
	Employment by age, economic activity (Econ. sections C, F, I)	733,738,763
	Unemployment by sex & age (1000)	226
	Unemployment by sex & age (%)	340
	Employment by age, economic activity (Econ. sections A, F and K)	277
	Long term unemployment	316
Education	Early leavers from education and training (18-24 years) by sex	308
	Participation rate in education and training (age group 25 to 64 years)	304
	Population aged 25-64, 30-34 by educational attainment level, sex (%)	282
Economy	Disposable income of private households	317
	Gross value added at basic prices, million EURO	319
	Households with broandband access (percentage of households)	318
	Gross domestic product (GDP) at current market prices by Million PPS	321
	Gross domestic product (GDP) at current market prices by Million Euro	322
	Gross domestic product (GDP) at current market prices by Purchasing Power Standard per inhabitant	320
Energy	Share of energy from renewable sources (in gross final energy consumption)	324
	Greenhouse gas emissions by source sector (EEA)	327
	Primary energy consumption	325
	Energy intensity of the economy	326
Society	People at risk of poverty or social exclusion (% of total population)	328
	Severe material deprivation rate by NUTS2 regions (% of population)	331
	People living in households with very low work intensity (population aged 0 to 59 years)	330
	At-risk-of-poverty rate by NUTS2 regions (% of population)	329
R&D	Human resources in science and technology	334
	Patent applications to the EPO by priority year by NUTS 3 regions per million inhabitants	333
	Total intramural R & D expenditure (GERD) by sectors of performance (Percentage of GDP)	332
	Total R&D personnel and researchers by sectors of performance, sex (Percentage of active population)	335
ITS	Individuals who used the internet for interaction with public authorities	337
	Individuals who have never used a computer (% of population)	338
	Individuals who ordered goods or services over the Internet for private use (% of population)	336

In this Page

On this section

|