ESPON 2020 Core Data Updating Strategy
Introduction
The Updating module is another application in the ESPON Data Portal that allows a user to insert the detected virtual changes into the final database. That module is fully integrated in the ESPON Database and it takes profit of other applications that have been developed in another working packages. All the source code can be found in the official git repository in a module called Updating[1] . Two user expert roles can be derived from Updating module and Updating strategy: one role to run server scripts and another role to update indicators in a local machine with Excel Updating Suite.
Required python packages:
- python == 3.5.2
- django == 1.11.22
- psycopg2 == 2.7.1
- pandas == 0.22
- xlwings == 0.15.1
- xlsxwriter == 1.0.5
- xlrd == 1.1.0.
- jinja2 == 2.10.1
[1] https://git.unepgrid.ch/espon-egtc/edp/src/branch/update_base_indicator/django-edp/Updating
Updating tables
To make the Updating module work properly, five tables under the updating namespace in the ESPON database are needed: tdta_metadata, tdta_eurostat, tdta_missing_pattern, tdta_nulls y tdta_report. The three first tables are the most important because they store data that is relevant for a correct working of the tool; last two tables are optional
- tdta_metadata
- This table contains information that is relevant to a given ESPON indicator, basically metadata and a unique identifier to group multi indicators
- tdta_eurostat
- This table contains information about Eurostat table names as well as parameters that are needed to extract the counterpart ESPON indicators from those datasets
- tdta_missing_pattern
- This table contains information related to missing patterns, i.e. a missing matrix as well several parameters to automatize the updating procedure in a local workstation
- tdta_report
- This table contains a detailed report about the updates for a given indicator, providing the number of rows and affected tables in the ESPON database
- tdta_nulls
- This table contains information about detected nulls for a given indicator
Updating command scripts
The Updating module also contains seven Django commands, namely python scripts, to be launched by an expert user in the ESPON server via line commands. The name of each script reflects the order of execution. These scripts are located in a folder named management and the name of each script reflects the order of execution. Moreover, there are other scripts useful in different stages of the updating procedure. The location folder and the main purpose of different scripts are detailed in Table 1.
Table 1: Command scripts (in blue) and auxiliary functions (in green) employed in Updating module
File name | Folder | Description |
1st_insert_metadata.py | management | This script will populate the tdta_metadata table |
2nd_download_eurostat.py | management | This script will download the Eurostat tables from Eurostat services and store those tsv files in a private folder in the server |
3rd_insert_eurostat.py | management | This script will populate the tdta_eurostat table |
4th_insert_nulls.py | management | This script will populate the tdta_nulls table |
5th_insert_mp.py | management | This script will populate the tdta_missing_pattern table |
6th_create_EUS.py | management | This script will build the Excel Updating Suite as a zip package |
7th_importer2ESPONDB.py | management | This script will update several tables: tdta_data_item, trel_data_item_source, trel_data_item_method and tdta_metadata |
BaseIndicator.py |
| This file contains a class to get data from ESPON database in a homogeneous way, irrespective from their nature indicator (single, multi). |
eurostat_equivalences.py | data | This file contains a dictionary of useful parameters to get the counterpart values from Eurostat tables |
utils_fn.py | utils | A collection of utilities in a module for several purposes: transform both ESPON and Eurostat data to pandas-like structures, helpers functions useful to another functions in different modules |
nuts_related_fn.py | utils | A collection of functions to transform NUTS units in a hierarchical way, specially to transverse ESPON time series and generate relationships between rows |
Table 2: Parametrized templates to generate at runtime the macros for the Excel Updating Suite
File name | Folder | Description |
approval_and_launcher.tpl.py | macros | Parametrized script to launch detected virtual changes into ESPON Excel sheets |
diff_dataframes_man_chg.tpl.py | macros | Parametrized script to start a new manual updating session and to make differences between ESPON time series |
outliers.tpl.py | macros | Parametrized script to launch an Outlier detection method on initial/modified ESPON data |
save_local_db.py | macros | Parametrized script to save changes into a local database file |
spatial_matrix.tpl.py | macros | Parametrized script to compute spatial matrix |
thematic_matrix.tpl.py | macros | Parametrized script to compute thematic matrix |
Updating folders
To plug this module into the Database Portal an accessible folder in the server with writing/reading permissions is needed. The files generated by this application will be stored in this folder, in a nested structure with paths relative to root directory, to be further retrieved on demand by the application.
- eurostat
- The location of the folder in which datasets from Eurostat web services are downloaded. Not empty
- macros
- The location of the folder in which macros are generated for creating the Excel Updating Suite. A nested subfolder structure is generated. Initially empty.
- eus
- An initially empty folder but EUS suite is generated in a subfolder called eus, as well as xlsx files are stored in another subfolder called xslx to make the zip automatically. A nested subfolder structure is generated to get zip files per country basis or per all countries
- db_updates
- The location of the folder in which user uploads database dumps in sqlite3 format
- db_reports
- The location of the folder in which a report is generated in text format is generated when user launch updates files against the ESPON database. Initially empty
How to use the Updating module
After connecting to ESPON server, an expert user will be able to launch the command scripts. A list of all ESPON Base indicators with their unique identifiers (ind_id) as stored in he ESPON database can be found in Table 4.
The execution procedure will be as follow:
Run 1st_insert_metadata.py script
python manage.py 1st_insert_metadata –indicator_id <ind_id>
This script will read metadata from ESPON 2020 Database and will populate the majority of all fields in the tdta_metadata table. This is achieved through a BaseIndicator class (in Updating/BaseIndicator.py) that populates all fields except for number_of_null
Run 2nd_download_eurostat.py script
python manage.py 2nd_download_eurostat –indicator_id <ind_id>
This script will start the downloading process of tsv files from Eurostat downloading services[1]:
- gets the name of the Eurostat dataset as read from a EUROSTAT_EQUIVALENCES dictionary key (eurostat_file) in eurostat_equivalences.py script
- checks the lastest version of downloaded tsv file (by reading the metadata.txt file with latest date). If current date (Y-M-Dformat), tsv file is not downloaded; otherwise, is downloaded
- if downloaded, moves those tsv files to eurostat directory with a created directory according to some categories (DEMOGRAPHY, ECONOMY, EDUCATION, EMPLOYMENT, ENERGY, ITS, R&D, SOCIETY)
[1] https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data/
Run 3rd_insert_eurostat.py script
python manage.py 3rd_insert_eurostat –indicator_id <ind_id>
This script will populate the tdta_eurostat table that will be used in the Updating application to get data from Eurostat and compare with pair-wise ESPON indicator so as to get the so-called virtual changes (later with 6th_create_EUS.py script)
Run 4th_insert_nulls.py script (optional)
python manage.py 4th_insert_nulls --indicator_id <ind_id> --country_code <ctr_code>
This script will populate the tdta_nulls table that will be useful to get an idea of existing nulls in Base Indicators. Moreover, a field in tdta_metadata called number_nulls is populated after this scripts is run.
Run 5th_insert_mp script
python manage.py 5th_insert_mp --indicator_id <ind_id> --country_code <ctr_code> This script will populate the tdta_missing_pattern table that will be useful to get the missing pattern for a given ESPON indicator and a given country. This information will be very valuable later, when 6th_create_EUS.py script is run. One completed, there is no need to re-run until a new bunch update of Base Indicator is injected in ESPON database. These five first scripts have only to be launched once and this is the only requirement to start with the updating procedure.
Run 6th_create_EUS.py script
python manage.py 6th_create_EUS --indicator_id <ind_id> --country_code <ctr_code> --create_zip True
This script will create the so-called Excel Updating Suite that consists of a zip file with an Excel Dashboard and a bunch of files/macros, gathering information from Updating database (previous steps). The location of created zips is eus/zips folder, with either subpath to country-basis folder (for instance, inds/145/By_Country/) if country_code parameter, or subpath to all-country-basis folder (for instance, inds/145/All_countries/).
These macros are generated from parametrised python templates that are located in the macros/templates folder of the Updating module. See Table 2
Once expert user has built the EUS zip files for all countries for a given indicator in the server, an accessible point to be downloaded will be ready. With all Excels files and macros in his local machine, the updating procedure controlled by an expert user can be start in a local workstation following the workflow as specified in the Table 3.
When this updating procedure is finished, a local sqlite3 file with all virtual changes for all the countries should be generated (see step 12, Table 3) and the expert user should upload that file to a folder (db_updates) in the ESPON server.
Run 7th_importer2ESPONDB.py script
python manage.py 7th_importer2ESPONDB --indicator_id <ind_id>
This script will read the uploaded sqlite3 file and all virtual changes will be integrated in the final ESPON database. A txt report is generated and stored in database containing relevant information about data to be inserted in ESPON database.
Table 3. Sequential steps in the Updating workflow to updating a base indicator
#Step | Action | Description |
1 | Preparation of the data for the update (creation of the Excel Updating Suite) | After selecting the indicator to be updated, a process is run on the server to generate an Excel Updating Suite for each country, containing, in different sheets, the current indicator figures, the virtual changes derived from the comparison with Eurostat data, and a specific sheet with macro buttons to launch the following steps. Those Excel sheets are downloaded from the server within a ZIP file. |
2 | Detection of potential changes from Eurostat data, which are compliant with the spatial and thematic constraints of the specific indicator, when estimated data are involved. | After unzipping the Excel sheets, the process starts with one of the countries. In order to detect the potential and compliant changes, the corresponding macro should be run from the Excel Updating Suite, i.e. by pressing the Launch button |
3 | Creation of spatial and thematic validation matrices for the whole time-series. | In order to create the matrices, the corresponding buttons (Spatial matrix and Thematic matrix) in the Excel Updating Suite should be pressed. |
4 | Creation of supportive charts (line plots) of the indicator time-series. | Optional. In case of need, the expert in charge of the update can generate automatic line plots of the data, to have a graphical overview of the indicator time-series by pressing the Charts button. |
5 | Review of the resulting matrices to decide eventually if potential changes should be integrated or rejected. | The matrices provide the deviation of potential changes with regard to the spatial and thematic rules. It is up to the expert in charge of the update, to decide whether a specific deviation is acceptable or not. |
6 | Manual editing of potential changes to be rejected or modified. | All the potential changes not edited (or removed) in the corresponding Excel sheet will be eventually actual changes in the database. If some changes are considered as unacceptable, should be removed from the corresponding Excel sheet at this stage. |
7 | Detection of potential outliers in ESPON time series. | In case of need, the expert in charge of the update can launch two algorithms to detect potential outliers in the ESPON time series by pressing the Outliersbutton. |
8 | Save potential changes to the local DB. | By means of the Save button, the potential changes that remain in the Excel sheets are stored in a local database. |
9 | Introduction of additional figures from other sources and save the changes to the local DB. | There is the possibility to add manually potential changes coming from other sources (including estimated values). This is done in this step by pressing the corresponding buttons (Start manual session and Detect manual changes). If manual changes have been introduced, the potential changes need to be saved to the local DB again. |
10 | Repeat processes 1 to 9 for all the countries covered by the specific indicator. | |
11 | Prepare a single file for the specific indicator including all the countries. | |
12 | Upload the file to the ESPON DB Server and actually update the indicator. |
Table 4. ESPON Base Indicators and unique indicator identifier as stored in ESPON database
Category | Indicator | ind_id |
Demography | Population, total, by broad age (0-14, 14-65, >65) groups and gender | 145 |
Life expectancy, by age, gender | 115 | |
Population, total, by broad age groups (by 5s) and gender | 157 | |
Old age dependence ratio | 223 | |
Gender imbalances | 218 | |
Ageing index | 215 | |
Deaths by age groups and gender (NUTS2) | 1 | |
Deaths by age groups and gender (NUTS3) | 58 | |
Live Births | 313 | |
Total Population | 314 | |
Total Deaths | 315 | |
Natural Change of population | 311 | |
Total Net Migration | 312 | |
Employment | Employment rates by sex & broad age groups (%) | 244 |
Employment by sex & broad age groups (1000) | 256 | |
Employment by age, economic activity (Econ. sections C, F, I) | 733,738,763 | |
Unemployment by sex & age (1000) | 226 | |
Unemployment by sex & age (%) | 340 | |
Employment by age, economic activity (Econ. sections A, F and K) | 277 | |
Long term unemployment | 316 | |
Education | Early leavers from education and training (18-24 years) by sex | 308 |
Participation rate in education and training (age group 25 to 64 years) | 304 | |
Population aged 25-64, 30-34 by educational attainment level, sex (%) | 282 | |
Economy | Disposable income of private households | 317 |
Gross value added at basic prices, million EURO | 319 | |
Households with broandband access (percentage of households) | 318 | |
Gross domestic product (GDP) at current market prices by Million PPS | 321 | |
Gross domestic product (GDP) at current market prices by Million Euro | 322 | |
Gross domestic product (GDP) at current market prices by Purchasing Power Standard per inhabitant | 320 | |
Energy | Share of energy from renewable sources (in gross final energy consumption) | 324 |
Greenhouse gas emissions by source sector (EEA) | 327 | |
Primary energy consumption | 325 | |
Energy intensity of the economy | 326 | |
Society | People at risk of poverty or social exclusion (% of total population) | 328 |
Severe material deprivation rate by NUTS2 regions (% of population) | 331 | |
People living in households with very low work intensity (population aged 0 to 59 years) | 330 | |
At-risk-of-poverty rate by NUTS2 regions (% of population) | 329 | |
R&D | Human resources in science and technology | 334 |
Patent applications to the EPO by priority year by NUTS 3 regions per million inhabitants | 333 | |
Total intramural R & D expenditure (GERD) by sectors of performance (Percentage of GDP) | 332 | |
Total R&D personnel and researchers by sectors of performance, sex (Percentage of active population) | 335 | |
ITS | Individuals who used the internet for interaction with public authorities | 337 |
Individuals who have never used a computer (% of population) | 338 | |
Individuals who ordered goods or services over the Internet for private use (% of population) | 336 |
|