Data Ingestion¶
This document describes how data are imported into the OPDM relational DB from a variety of sources.
Setup¶
The database must be restored from a previous dump. Dumps are found on AWS. The following procedure needs to be followed:
Identify a date and store it into a DATE env variable.
Close all connection to the running DB.
ssh root@my-server dc exec -e DATE=$DATE -upostgres postgres psql opdm pg_sql:opdm> SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE datname = current_database() AND pid <> pg_backend_pid();
Download the dump, drop the db, re-create it and restore the data from the dump.
ssh root@my-server cd /var/lib/docker/volumes/opdmservice_pg_data/_data/ aws --profile s3 s3 cp s3://opdm-service-data/opdm_$DATE.sql.gz ./ dc exec -e DATE=$DATE -upostgres postgres bash \ -c "cd ~/data; dropdb opdm; createdb opdm; zcat opdm_$DATE.sql.gz | psql opdm; rm opdm_$DATE.sql.gz; exit"
All management tasks are available at https://service.opdm.openpolis.io/admin/taskmanager/task/, and can be launched or scheduled from the admin user interface.
Areas¶
Areas are taken from ISTAT web site, along with all the historical variations.
The cadastral code is used for municipalities, as it is immutable, with respect to province changes.
The manager tasks responsible for keeping the areas updated are:
The inhabitants for province, regioni and ripartizioni are computed as sums from the lower entities and stored into the inhabitants field through the script_update_inhabitants_cache management task.
Todo
Add data from the ANPR archive: https://www.anpr.interno.it/portale/anpr-archivio-comuni.csv
Organizations¶
Organizations are taken from the BDAP open data catalog, at https://bdap-opendata.mef.gov.it/content/anagrafica-enti-anagrafica-ente and from manifactured json files, produced from other parsers.
Data from BDAP can be imported with the following code block:
python manage.py import_orgs_from_bdap -v2
While data from json files can be imported with:
python manage.py import_orgs_from_json $URL_TO_JSON_FILE -v2
A remote, readable URL can be used or an absolute or relative path to the server can be used as an argument to this task.
Persons and details¶
Persons, their Memberships and Ownerships and Posts, are imported from a variety of sources. Many of the tasks described in this section are macro-tasks calling a sequence of two or more sub-commands.
The strategy here is to parse the data from the original source and emit a json file that is compatible with the generic
import_persons_from_json
management task, that is also the model for parser implemented by third parties.
After that the data are loaded into OPDM using that generic import subcommand.
Governo and Parlamento¶
Data coming from government and parliament are imported, using the dedicated
import_governo_parlamento_memberships
macro-task:
# governo
python manage.py import_governo_parlamento_memberships.py \
--latest --clear-parser-cache --clear-loader-cache --context=governo --verbosity=2
# senato
python manage.py import_governo_parlamento_memberships.py \
--latest --clear-parser-cache --clear-loader-cache --context=senato --verbosity=2
# camera
python manage.py import_governo_parlamento_memberships.py \
--latest --clear-parser-cache --clear-loader-cache --context=camera --verbosity=2
Removing the first parameter, it’s possible to import all historical data. Clearly this was only needed once.
The task, internally, is composed of the two sub-tasks:
parser_[camera|senato|governo]_to_json
, the parser, reading data from the SPARQL endpoints and emitting a json fileimport_persons_from_json
: the generic importer.
All data come from the Camera’s sparql end point
Keyevents
Keyevents need to be inserted or updated before calling these scripts, so that memberships for the parliament can be associated with the right legislature (that’s a key event in our lingo).
python manage.py import_key_events_from_json.py $KE_URL.json -v2
Minint¶
Data for local administrators come from the Minister of Interior’s open data web page
The management task can be used to import data in various contexts:
# municipalities
python manage.py import_memberships_from_minint_current --context=comuni -v2
# provinces (deprecated)
python manage.py import_memberships_from_minint_current --context=province -v2
# metropolitan areas
python manage.py import_memberships_from_minint_current --context=metro -v2
# regions
python manage.py import_memberships_from_minint_current --context=regioni -v2
Internally the task compares the data fetched from the remote CSV file, with a local version kept from the last import, computing a diff and processing only the records it identifies as new, or needing an update.
This can be turned off using the --clear-cache
argument, but the time needed to process all data (without real need),
will be really long.
For historical data
A different task is also available to import historical data: see import_memberships_from_minint_year
task.
Data from Minint do not contain end date. When a membership ends, the record simply disappears. Tracking memberships’ end dates this way is hard and we have another task, parsing the administrative history of institutions pages, that contain that information.
As this is a proper scraper, it takes a long time and is launched only once a week, over segments of institutions.
# municipalities from 100k to 700 k inhabitants, from 2012
python manage.py scrape_histadmin_minint COM --inhabitants-range=100k:700k --min-electoral-date=2012 -v2
# all regions from 2012
python manage.py scrape_histadmin_minint REG --min-electoral-date=2012 -v2
Arguments
Arguments can be used to poke single institutions and verify issues.
ATOKA¶
Data for companies (privately or publicly owned), are taken from ATOKA’s API.
There are two different macro-tasks available to fetch data from this source:
import_atoka_organisations
- import data starting from organisationsstarting from a given
shares level
, where:0
stands for institution (not a company);1
stands for companies directly owned, even partially, by institutions;2
stands for companies indirectly owned by institutions;
details are extracted from organisations with the
shares_level
:name, identifiers, contacts,
members in apical posts,
owners and ownerships
All persons, organisations, memberships and ownerships are then identified, de-duplicated and loaded into the OPDM database.
Level 0
When
shares-level=0
, no members or owners are extracted, only ownerships.Internals
This is internally done in a series of steps, and the code snippet here is self-explanatory:
# assign partecipation level to organizations at level share_level # this is needed before identifying orgs to start from self.assign_partecipation_level(self.shares_level, overwrite=self.overwrite_part_level) self.extract_json() self.transform(['organizations']) self.load_organizations() if self.shares_level == 0: self.transform(['ownerships', ]) self.load_ownerships() else: self.transform(['ownerships', 'persons_memberships_ownerships']) self.load_ownerships() self.load_persons_memberships_ownerships() # assign next partecipation level to organizations, # using ownerships info just loaded in opdm self.assign_partecipation_level(self.shares_level + 1)
Before and after the ETL steps, the partecipation level is computed and assigned to Organisations. This computation takes into consideration the nature of the organisation, in order to identify institutions (that is those having shares_level = 0), and the shareholders for greater levels.
import_atoka_persons
- import data starting from personsfor each politically exposed person, i.e. having memberships in institutions, her network of interests is fetched: - companies in which the person has a post - companies that the person owns - apical members or owners of these companies - companies owning or owned by thos companies
Internals
This is also internally performed in a series of steps:
self.extract_json() self.transform(['organizations']) self.load_organizations() self.transform(['persons_memberships_ownerships', ]) self.load_persons_memberships_ownerships() self.transform(['organizations_ownerships', ]) self.load_organizations_ownerships()
To close memberships or ownerships not present any more in ATOKA, the script_close_memberships
management task can be used. Given a persons_memberships_ownerships_XX.json
file containing the
details of all persons extracted from ATOKA, with their memberships and organisations, the script
computes the difference for each one of the organisations, and closes the relationships only present
in OPDM and not in ATOKA.
Prerequisites
In order to work, there are two pre-requisites:
all similarities should have been resolved (otherwise persons can not be correctly matched)
Persons should have been consolidated (use
script_consolidate_persons
), or the identification of persons will not work, or worst, will produce duplicates or wrong results.
Todo
Ownerships should be closed by a script_close_ownerships
management task.
Openpolitici¶
Note
This is now deprecated, as the database is not updated any longer.
The original data from the old Openpolitici database has been imported with:
python manage.py import_posts_and_persons_from_op [c1 c2 c3] -v2
One or more contexts can be specified as command line arguments. The accepted values are:
eu - European Commission and Parliament
it - Italian Government and Parliament (Camera and Senato)
reg - Regional governments and councils
prov - Provincial or Metropolitan areas governments and councils (TODO)
com - Municipalities governments and councils