ESPON 2013 Data Migration

Introduction

To perform the data migration, an automatized procedure has been developed that reads the whole information (data and metadata) stored in Excel files for all indicators, transforms it and integrate into the final ESPON database. As a result, an initial state of the database will be achieved with the storage of selected ESPON 2013 indicators (in their version as of ESPON M4D). The procedure is similar to the first import of the ESPON Base indicators but completely different from the Updating strategy. This strategy is a well-documented procedure to follow when the updating of a given indicator stored in database is desired.

Required python packages:

  • Python 2.7.4
  • psycopg2 == 2.7.1
  • pandas == 0.22
  • xlrd == 1.1.0.

During the execution of the program, five phases/components can be distinguished, following the next sequential workflow:

  1. Pre-population
  2. Reading
  3. Parsing
  4. Building
  5. Insertion
  6. Export

Pre-population

In this first stage, the program connects to the database and populates information that is considered key to the indicators. This kind of information is not modifiable and refers to distinct pieces of indicator metadata such as:

  • Project
  • Keywords & keyword Set (Theme/Category )
  • Contact role
  • Constraints (Use/Class)
  • Spatial extension
  • Core type
  • Data type
  • Indicator Type
  • Nat Type
  • Nomenclature
  • Geometries

Except for the project attribute, the rest of pre-fixed pieces have been extracted from existing ESPON M4D documentation, see reference:

 http://database.espon.eu/db2/jsf/MetadataSpecifications/MetadataSpecifications_onehtml/index.html

As output, the pre-populated tables are the following:

  • edp_core_tdta_project
  • edp_core_tdta_gis
  • edp_core_tkey_keyword
  • edp_core_tkey_keyword_set
  • edp_core_tkey_contact_role
  • edp_core_tkey_constraint_class
  • edp_core_tkey_constraint_use
  • edp_core_tkey_spatial_ext
  • edp_core_tkey_dataitem_type
  • edp_core_tkey_indicator_type
  • edp_core_tkey_core_type
  • edp_core_tkey_nattype
  • edp_core_tkey_nomenclatures 

Reading

Once the initial tables are populated, the program reads each of the Excel files located inside a given folder in the system. Later, it stores in-memory each piece of data and metadata information coded in different sheets inside the Excel file: Dataset, Indicator, Source and Data, before that file is directly parsed.

Moreover, a json file (multiindicators_core.json) is in charge of helping the program to interpret different metadata after reading, for instance dimension-class hierarchies for multi- indicators, the file name or the sheet names (depending on the indicator, data can be found either in one or several Data sheets, organised by years or by indicators, etc.). That file has been created after a visual exploration for all indicators.

Parsing

Once all files are read, the system is capable to extract the information as programmatically declared. The system can extract data chunks related to individual indicators, referred to distinct pieces of information and found in unique cells within one sheet or in distinct cells spread in different sheets. This information points to :

  • Meta-information on the indicator
    • Name
    • Abstract
    • Time periods
  • Indicator type
    • Single
    • Multi
  • Dimension class
  • Genealogy relations
  • Spatial resolution
  • Keywords
  • Contact info
  • Sources and Methods
  • Spatial units
  • Raw data
    • Data values
    • Data types

This information will be useful to establish relations with the pre-populated tables from previous stage.

Building

The program will make transformations on the data extracted in the previous stage, in order to adjust it to the ESPON 2020 database schema. In reference to that schema, the following tables of the ESPON 2020 database will be sequentially populated:

  • edp_core_tdta_object
  • edp_core_tdta_indicator
  • edp_core_ tdta _contact
  • edp_core_tdta_source
  • edp_core_tdta_method
  • edp_core_tdta_data_item

In addition, subsequent relationships with the other records from both edp_core_tdta_object and edp_core_tdta_data_item tables will be created:

  • edp_core_trel_object_contact
  • edp_core_trel_object_keyword
  • edp_core_trel_object_source
  • edp_core_trel_object_method
  • edp_core_trel_data_item_source
  • edp_core_trel_data_item_method

edp_core_trel_tunit_hierarchy and edp_core_trel_genealogy tables, suitable for relations between different levels in ESPON nomenclatures and hierarchies for multidimensional time series respectively, will be separately populated.

Insertion

The program will connect to the database server and it will start the insertion of records in each of the specified tables, following the right order and keeping relational integrity.

The main difficulty in the program design phase has been related to reading the indicator data and metadata from Excel files. Even though indicators can be classified as either single or multi - namely, multidimensional statistics tables, they are heterogeneous in terms of their nature and structure, i.e. contained dimensions, number of indicators associated to a certain dimension, genealogy. Despite a certain standardization of metadata fields (column names and sheet names in the Excel files to be migrated), it is a challenge to parse all the information in a semi automatized procedure for each indicator, in particular due to the various ways the data sheets are organised (single or multiple data sheets).

In order to enable in the program to parse all Excel files correctly, a configuration json file was created that specifies not only which relevant information is contained but also in which sheets the data has been stored for all indicators.

Export

The final output of the program is a dump file to be finally imported in an empty database at UNIGE.

As mentioned above, the most challenging part is the multi indicator concept, in which data for multiple class indicators are available inside a given inner dimension. Furthermore, several multi-indicators contain more than one dimension.

Distinct relationships between several tables should be pre-arranged before batch insertion in the database, in particular relations between data value for every class and their distinct reporting units, with the respective year and source. This operation, involving several joins between tables, is computationally intensive.

List of migration scripts

The Table below present a list of python scripts for ESPON 2013 migration.

Table: Main files in ESPON 2013 migration tool. Root means current directory and any folder is relative to the path of that root folder. Different colours for different migration stages: pre-population (blue), reading (green), parsing & building & insertion (red)

File name

Folder

Description


conf

Contain config parameters to connect to database

tables.py

(root)

Contains table names as created in ESPON database, making a difference between prepopulated and post-populated

class_data_to_prepopulate_db.py

utils

Contains a class to prepopulate some ESPON tables. This class uses several functions located in utils/helpers_aux

class_data_extract_from_excel_noncore.py

utils

Contains a class to read Excel files for ESPON 2013 projects and return an object with all parsed metadata infconfig_db_params.py

o. This class uses several functions located in utils/helpers_ind and read Excel files located in excel_data folder

build_non_core.py

(root)

This is an script that builds properly records to be inserted later into the database

tests.py

tests_suite/

This is an script acting as a caller for another functions

running_tests.py

(root)

This is an script that control the execution of other scripts and run for all ESPON 2013 projects

In this Page

On this section


|