Data Ingestion

This document describes how data are imported into the OPDM relational DB from a variety of sources.

Setup

The database must be restored from a previous dump. Dumps are found on AWS. The following procedure needs to be followed:

  1. Identify a date and store it into a DATE env variable.

  2. Close all connection to the running DB.

    ssh root@my-server
    dc exec -e DATE=$DATE -upostgres postgres psql opdm
    
    pg_sql:opdm> SELECT pg_terminate_backend(pg_stat_activity.pid)
        FROM pg_stat_activity
        WHERE datname = current_database()
          AND pid <> pg_backend_pid();
    
  3. Download the dump, drop the db, re-create it and restore the data from the dump.

    ssh root@my-server
    cd /var/lib/docker/volumes/opdmservice_pg_data/_data/
    aws --profile s3 s3 cp s3://opdm-service-data/opdm_$DATE.sql.gz ./
    dc exec -e DATE=$DATE -upostgres postgres bash \
        -c "cd ~/data; dropdb opdm; createdb opdm; zcat opdm_$DATE.sql.gz | psql opdm; rm opdm_$DATE.sql.gz; exit"
    

All management tasks are available at https://service.opdm.openpolis.io/admin/taskmanager/task/, and can be launched or scheduled from the admin user interface.

Areas

Areas are taken from ISTAT web site, along with all the historical variations.

The cadastral code is used for municipalities, as it is immutable, with respect to province changes.

The manager tasks responsible for keeping the areas updated are:

The inhabitants for province, regioni and ripartizioni are computed as sums from the lower entities and stored into the inhabitants field through the script_update_inhabitants_cache management task.

Organizations

Organizations are taken from the BDAP open data catalog, at https://bdap-opendata.mef.gov.it/content/anagrafica-enti-anagrafica-ente and from manifactured json files, produced from other parsers.

Data from BDAP can be imported with the following code block:

python manage.py import_orgs_from_bdap -v2

While data from json files can be imported with:

python manage.py import_orgs_from_json $URL_TO_JSON_FILE -v2

A remote, readable URL can be used or an absolute or relative path to the server can be used as an argument to this task.

Persons and details

Persons, their Memberships and Ownerships and Posts, are imported from a variety of sources. Many of the tasks described in this section are macro-tasks calling a sequence of two or more sub-commands.

The strategy here is to parse the data from the original source and emit a json file that is compatible with the generic import_persons_from_json management task, that is also the model for parser implemented by third parties.

After that the data are loaded into OPDM using that generic import subcommand.

Governo and Parlamento

Data coming from government and parliament are imported, using the dedicated import_governo_parlamento_memberships macro-task:

# governo
python manage.py import_governo_parlamento_memberships.py \
  --latest --clear-parser-cache --clear-loader-cache --context=governo --verbosity=2

# senato
python manage.py import_governo_parlamento_memberships.py \
  --latest --clear-parser-cache --clear-loader-cache --context=senato --verbosity=2

# camera
python manage.py import_governo_parlamento_memberships.py \
  --latest --clear-parser-cache --clear-loader-cache --context=camera --verbosity=2

Removing the first parameter, it’s possible to import all historical data. Clearly this was only needed once.

The task, internally, is composed of the two sub-tasks:

  • parser_[camera|senato|governo]_to_json, the parser, reading data from the SPARQL endpoints and emitting a json file

  • import_persons_from_json: the generic importer.

All data come from the Camera’s sparql end point

Keyevents

Keyevents need to be inserted or updated before calling these scripts, so that memberships for the parliament can be associated with the right legislature (that’s a key event in our lingo).

python manage.py import_key_events_from_json.py $KE_URL.json -v2

Minint

Data for local administrators come from the Minister of Interior’s open data web page

The management task can be used to import data in various contexts:

# municipalities
python manage.py import_memberships_from_minint_current --context=comuni -v2

# provinces (deprecated)
python manage.py import_memberships_from_minint_current --context=province -v2

# metropolitan areas
python manage.py import_memberships_from_minint_current --context=metro -v2

# regions
python manage.py import_memberships_from_minint_current --context=regioni -v2

Internally the task compares the data fetched from the remote CSV file, with a local version kept from the last import, computing a diff and processing only the records it identifies as new, or needing an update.

This can be turned off using the --clear-cache argument, but the time needed to process all data (without real need), will be really long.

For historical data

A different task is also available to import historical data: see import_memberships_from_minint_year task.

Data from Minint do not contain end date. When a membership ends, the record simply disappears. Tracking memberships’ end dates this way is hard and we have another task, parsing the administrative history of institutions pages, that contain that information.

As this is a proper scraper, it takes a long time and is launched only once a week, over segments of institutions.

# municipalities from 100k to 700 k inhabitants, from 2012
python manage.py scrape_histadmin_minint COM --inhabitants-range=100k:700k --min-electoral-date=2012 -v2

# all regions from 2012
python manage.py scrape_histadmin_minint REG --min-electoral-date=2012 -v2

Arguments

Arguments can be used to poke single institutions and verify issues.

ATOKA

Data for companies (privately or publicly owned), are taken from ATOKA’s API.

There are two different macro-tasks available to fetch data from this source:

  1. import_atoka_organisations - import data starting from organisations

    starting from a given shares level, where:

    • 0 stands for institution (not a company);

    • 1 stands for companies directly owned, even partially, by institutions;

    • 2 stands for companies indirectly owned by institutions;

    details are extracted from organisations with the shares_level:

    • name, identifiers, contacts,

    • members in apical posts,

    • owners and ownerships

    All persons, organisations, memberships and ownerships are then identified, de-duplicated and loaded into the OPDM database.

    Level 0

    When shares-level=0, no members or owners are extracted, only ownerships.

    Internals

    This is internally done in a series of steps, and the code snippet here is self-explanatory:

    # assign partecipation level to organizations at level share_level
    # this is needed before identifying orgs to start from
    self.assign_partecipation_level(self.shares_level, overwrite=self.overwrite_part_level)
    
    self.extract_json()
    
    self.transform(['organizations'])
    self.load_organizations()
    if self.shares_level == 0:
        self.transform(['ownerships', ])
        self.load_ownerships()
    else:
        self.transform(['ownerships', 'persons_memberships_ownerships'])
        self.load_ownerships()
        self.load_persons_memberships_ownerships()
    
    # assign next partecipation level to organizations,
    # using ownerships info just loaded in opdm
    self.assign_partecipation_level(self.shares_level + 1)
    

    Before and after the ETL steps, the partecipation level is computed and assigned to Organisations. This computation takes into consideration the nature of the organisation, in order to identify institutions (that is those having shares_level = 0), and the shareholders for greater levels.

  2. import_atoka_persons - import data starting from persons

    for each politically exposed person, i.e. having memberships in institutions, her network of interests is fetched: - companies in which the person has a post - companies that the person owns - apical members or owners of these companies - companies owning or owned by thos companies

    Internals

    This is also internally performed in a series of steps:

    self.extract_json()
    
    self.transform(['organizations'])
    self.load_organizations()
    self.transform(['persons_memberships_ownerships', ])
    self.load_persons_memberships_ownerships()
    self.transform(['organizations_ownerships', ])
    self.load_organizations_ownerships()
    

To close memberships or ownerships not present any more in ATOKA, the script_close_memberships management task can be used. Given a persons_memberships_ownerships_XX.json file containing the details of all persons extracted from ATOKA, with their memberships and organisations, the script computes the difference for each one of the organisations, and closes the relationships only present in OPDM and not in ATOKA.

Prerequisites

In order to work, there are two pre-requisites:

  • all similarities should have been resolved (otherwise persons can not be correctly matched)

  • Persons should have been consolidated (use script_consolidate_persons), or the identification of persons will not work, or worst, will produce duplicates or wrong results.

Todo

Ownerships should be closed by a script_close_ownerships management task.

Openpolitici

Note

This is now deprecated, as the database is not updated any longer.

The original data from the old Openpolitici database has been imported with:

python manage.py import_posts_and_persons_from_op [c1 c2 c3] -v2

One or more contexts can be specified as command line arguments. The accepted values are:

  • eu - European Commission and Parliament

  • it - Italian Government and Parliament (Camera and Senato)

  • reg - Regional governments and councils

  • prov - Provincial or Metropolitan areas governments and councils (TODO)

  • com - Municipalities governments and councils