preprocessing#

data_sources#

class DataSource#

Bases: object

Base class for creating custom DataSources. Subclasses must implement get_attribute and __len. In case the DataSource handles geospatial data, subclasses must also implement to_crs, get_geometry and get_bounding_box

get_attribute(name: str)#

Return a property as a list of values from the source data, one entry per feature.

Parameters:

name – The property name

get_bounding_box() Tuple[float, float, float, float] | None#

Return the bounding box that envelops all geospatial features in the source data

Returns:

A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or None in case no bounding box can be calculated

get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells']) dict | None#

Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the geometry_type:

  • points: geometry.x, geometry.y and optionally geometry.z

  • lines: either geometry.linestring_2d or geometry.linestring_3d

  • polygons: geometry.polygon

See Geometries for more information on geometry attributes.

This method may raise an Exception if a geometry_type is requested that does not match the source geometry.

Parameters:

geometry_type – One of points, lines, polygons or cells

to_crs(crs: str | int) None#

Convert the source geometry data the coordinate reference system specified in the crs argument

Parameters:

crs – The CRS to convert to, either a CRS string (eg. “WGS 84” or “EPSG:28992”) or an EPSG code integer (eg. 4326).

class GeopandasSource(geodataframe: GeoDataFrame)#

Bases: DataSource

DataSource for querying a geopandas.GeoDataFrame

static feature_type_or_raise(feature, expected)#
classmethod from_source_info(source_info)#
get_attribute(name: str)#

Return a property as a list of values from the source data, one entry per feature.

Parameters:

name – The property name

get_bounding_box()#

Return the bounding box that envelops all geospatial features in the source data

Returns:

A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or None in case no bounding box can be calculated

get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells'])#

Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the geometry_type:

  • points: geometry.x, geometry.y and optionally geometry.z

  • lines: either geometry.linestring_2d or geometry.linestring_3d

  • polygons: geometry.polygon

See Geometries for more information on geometry attributes.

This method may raise an Exception if a geometry_type is requested that does not match the source geometry.

Parameters:

geometry_type – One of points, lines, polygons or cells

get_lines(geom)#
get_points(geom)#
get_polygons(geom)#
to_crs(crs: str | int)#

Convert the source geometry data the coordinate reference system specified in the crs argument

Parameters:

crs – The CRS to convert to, either a CRS string (eg. “WGS 84” or “EPSG:28992”) or an EPSG code integer (eg. 4326).

class NetCDFGridSource(file: Path | str, x_var='gridCellX', y_var='gridCellY', time_var='time')#

Bases: DataSource

cells: ndarray = None#
classmethod from_source_info(source_info)#
get_attribute(name: str, time_idx=0)#

Return a property as a list of values from the source data, one entry per feature.

Parameters:

name – The property name

get_bounding_box() Tuple[float, float, float, float] | None#

Return the bounding box that envelops all geospatial features in the source data

Returns:

A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or None in case no bounding box can be calculated

get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells']) dict | None#

Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the geometry_type:

  • points: geometry.x, geometry.y and optionally geometry.z

  • lines: either geometry.linestring_2d or geometry.linestring_3d

  • polygons: geometry.polygon

See Geometries for more information on geometry attributes.

This method may raise an Exception if a geometry_type is requested that does not match the source geometry.

Parameters:

geometry_type – One of points, lines, polygons or cells

get_timestamps()#
points: ndarray = None#
class NumpyDataSource(data: Mapping[str, ndarray])#

Bases: DataSource

DataSource for non-geospatial Numpy or pandas data

Parameters:

data – Either a dictionary typing.Dict[str, np.ndarray] with keys being the property names and the values being the property data array or a Pandas dataframe

get_attribute(name: str)#

Return a property as a list of values from the source data, one entry per feature.

Parameters:

name – The property name

PandasDataSource#

alias of NumpyDataSource

dataset_creator#

class AttributeDataLoading(config)#

Bases: DatasetOperation

Extracts the actual data from the DataSources into the attribute arrays. It also supports transforming the raw data using so called loaders in the attribute config. Currently supported loaders are: json, csv, bool, int, float and str See create_dataset() for more information on the available loaders.

get_attribute_data(attr_config: dict, primary_source_name: str)#
get_data(entity_config: dict)#
get_geometry(geom_type: Literal['points', 'lines', 'polygons', 'cells'], source_name) dict | None#
get_loaders(attr_config)#
get_source(source_name)#
loaders = {'bool': <function load_primitive.<locals>._to_primitive_helper>, 'csv': <function load_csv>, 'float': <function load_primitive.<locals>._to_primitive_helper>, 'int': <function load_primitive.<locals>._to_primitive_helper>, 'json': <built-in function loads>, 'str': <function load_primitive.<locals>._to_primitive_helper>}#
static nan_loader(val)#
sources: Dict[str, DataSource]#
class BoundingBoxCalculation(config)#

Bases: DatasetOperation

Calculate the bounding box of the entire dataset

get_active_sources(sources: Dict[str, DataSource])#
class CRSTransformation(config, default_crs='EPSG:28992')#

Bases: DatasetOperation

The CRSTransformation operation converts every DatasetSource into the target-crs specified in the config

DEFAULT_CRS = 'EPSG:28992'#
class ConstantValueAssigning(config)#

Bases: DatasetOperation

Assign a constant value for every entity in an entity group. This Operation must come after IDGeneration, because only then the number of entities is guaranteed to be known

class DatasetCreator(operations: Sequence[Type[DatasetOperation]], sources: Dict[str, DataSource] | None = None, validate_config=True)#

Bases: object

Use DatasetCreator to convert different DataSources into an entity-based Dataset.

Parameters:
  • operations – The sequence of desired operations types

  • sources – (Optional) a dictionary with configured DataSources

  • validate_config – validate any dataset creator configs

create(config: dict)#
static default_operations()#
classmethod with_default_operations(**kwargs)#

Alternative initializer that creates a DatasetCreator with all DatasetOperations configured that provide full functionality to create datasets. This is the preferred way of instantiating DatasetCreator.

class DatasetOperation(config)#

Bases: object

class EnumConversion(config)#

Bases: DatasetOperation

The EnumConversion operation is responsible for converting and validating enumerated attributes (indicated by the enum field in the attribute config). It converts strings into integers, matching with the position of the value in the enum's array. If the values are already integer, then it validates whether the value is matching an enum’s value

convert_enums(attr, enum_name: str)#
enums: Dict[str, EnumInfo]#
get_enums(dataset)#
iter_enum_attributes(dataset) Tuple[str, list]#
set_enums(dataset: dict)#
class EnumInfo(name: str, enum_values: Sequence[str])#

Bases: object

add(text: str) int#
ensure(text: str) int#
get_pos(text: str) int | None#
to_list()#
class IDGeneration(config)#

Bases: DatasetOperation

Generate ids for every entity

get_entity_count_from_meta(entity_meta: dict, sources: dict) int#
class IDLinking(config)#

Bases: DatasetOperation

some attributes may reference another entity’s id in the same dataset using the id_link field in the attribute config. The IDLinking operation looks up the correct id for an entity’s id-link and places the correct id in the attribute. See <create datasets> for more information.

classmethod get_indexed_values_or_raise(values, indexers)#
static get_single_indexed_value_or_raise(value, indexers)#
index: Dict[Tuple[str, str], dict]#
class MetadataSetup(config)#

Bases: DatasetOperation

MetadataSetup copies the metadata fields from the config into the dataset, and/or fills them with their respective default values

keys = (('general', <object object>), ('name', <object object>), ('display_name', <object object>), ('type', <object object>), ('version', 4))#
class SourcesSetup(config)#

Bases: DatasetOperation

The SourcesSetup operation is responsible for reading the __sources__ field of the config and create DatasetSources from it

static get_file_path(path_str)#
make_source(source_info)#
class SpecialValueCollection(config)#

Bases: DatasetOperation

SpecialValueCollection compiles the general.special field in the dataset from the config. If a special value is defined both in the attribute as well as in the general.special field of the config, then the field in the config takes precedence

extract_special_values(config: dict, key=None, level=0)#
create_dataset(config: dict, sources: Dict[str, DataSource] | None = None)#

Shorthand function to create a entity-based Dataset from a dataset creator config dictionary. This is the preferred way of creating Datasets from dataset creator config as it requires the least amount of boilerplate code. DataSources are created from the config __sources__ field. However, it is also possible to provide (additional) DataSources through the optional sources argument.

Parameters:
  • config – a dataset creator config

  • sources – (Optional) a dictionary with configured DataSources

Returns:

A entity based dataset in dictionary format

deep_get(obj, *path: str | int, default=None)#
get_dataset_creator_schema()#
load_csv(obj: str)#
load_primitive(prim)#
pipe(operations: Sequence[callable], initial, **kwargs)#

tapefile#

Create a tapefile from one or more CSV files that contain a yearly changing attribute. The csv files should be in the form “<key> <year_1> <year_2>, …” such as: name, 2020, 2025, 2030 e1 , 100, 110, 120 e2 , 100, 90, 85

One or more of these csv files can then be linked to an Attribute and a tapefile can be created that outputs a value for every year (linearly interpolating between years that do not exist in the csv files). The entities are matched based on a reference_attribute and the <key> column in the csv files.

If multiple csv files are given, every update will contain the interpolated values for all time dependent attributes, even if the corresponding timestamp is not defined for that tapefile. For example if csv file “a” defines 2020 as a year, but csv file “b” starts at 2025, then a tapefile will be generated starting from the timestamp at 2020. The values from csv file “b” will be taken as the values at 2025 for all years earlier than 2025

class InterpolatingTapefile(entity_data: 'dict', dataset_name: 'str', entity_group_name: 'str', reference: 'str', tapefile_name: 'str', tapefile_display_name: 't.Optional[str]' = None, metadata: 'dict' = None, attributes: 't.List[TimeDependentAttribute]' = <factory>)#

Bases: object

add_attribute(attribute: TimeDependentAttribute)#
attributes: List[TimeDependentAttribute]#
create_content(interpolators: Dict[str, Interpolator])#
create_update(values: Dict[str, list])#

example:

{
"entity_group_name": {
    "id": [4, 5, 6],
    "some_attribute": [102, 40, 201]
    "some_other_attribute": [7, 6, 21]
}
}
dataset_name: str#
dump(file: str | Path)#
dump_dict()#
ensure_csv_completeness()#
entity_data: dict#
entity_group_name: str#
get_interpolators()#
get_merged_df(attribute: TimeDependentAttribute)#
get_scaffold()#
static get_seconds(year: int, reference: int)#
Parameters:
  • year – eg: 2024

  • reference – 2019

Returns:

seconds since reference

init_data: DataFrame#
metadata: dict = None#
read_initial_data() DataFrame#
reference: str#
tapefile_display_name: str | None = None#
tapefile_name: str#
class Interpolator(df: DataFrame)#

Bases: object

infer_years()#
interpolate(year)#
max_year: int#
min_year: int#
years: List[int]#
class TimeDependentAttribute(name: 'str', csv_file: 't.Union[Path, str]', key: 'str')#

Bases: object

csv_file: Path | str#
property dataframe#
key: str#
name: str#

Module contents#

class DataSource#

Bases: object

Base class for creating custom DataSources. Subclasses must implement get_attribute and __len. In case the DataSource handles geospatial data, subclasses must also implement to_crs, get_geometry and get_bounding_box

get_attribute(name: str)#

Return a property as a list of values from the source data, one entry per feature.

Parameters:

name – The property name

get_bounding_box() Tuple[float, float, float, float] | None#

Return the bounding box that envelops all geospatial features in the source data

Returns:

A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or None in case no bounding box can be calculated

get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells']) dict | None#

Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the geometry_type:

  • points: geometry.x, geometry.y and optionally geometry.z

  • lines: either geometry.linestring_2d or geometry.linestring_3d

  • polygons: geometry.polygon

See Geometries for more information on geometry attributes.

This method may raise an Exception if a geometry_type is requested that does not match the source geometry.

Parameters:

geometry_type – One of points, lines, polygons or cells

to_crs(crs: str | int) None#

Convert the source geometry data the coordinate reference system specified in the crs argument

Parameters:

crs – The CRS to convert to, either a CRS string (eg. “WGS 84” or “EPSG:28992”) or an EPSG code integer (eg. 4326).

class DatasetCreator(operations: Sequence[Type[DatasetOperation]], sources: Dict[str, DataSource] | None = None, validate_config=True)#

Bases: object

Use DatasetCreator to convert different DataSources into an entity-based Dataset.

Parameters:
  • operations – The sequence of desired operations types

  • sources – (Optional) a dictionary with configured DataSources

  • validate_config – validate any dataset creator configs

create(config: dict)#
static default_operations()#
classmethod with_default_operations(**kwargs)#

Alternative initializer that creates a DatasetCreator with all DatasetOperations configured that provide full functionality to create datasets. This is the preferred way of instantiating DatasetCreator.

class GeopandasSource(geodataframe: GeoDataFrame)#

Bases: DataSource

DataSource for querying a geopandas.GeoDataFrame

static feature_type_or_raise(feature, expected)#
classmethod from_source_info(source_info)#
get_attribute(name: str)#

Return a property as a list of values from the source data, one entry per feature.

Parameters:

name – The property name

get_bounding_box()#

Return the bounding box that envelops all geospatial features in the source data

Returns:

A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or None in case no bounding box can be calculated

get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells'])#

Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the geometry_type:

  • points: geometry.x, geometry.y and optionally geometry.z

  • lines: either geometry.linestring_2d or geometry.linestring_3d

  • polygons: geometry.polygon

See Geometries for more information on geometry attributes.

This method may raise an Exception if a geometry_type is requested that does not match the source geometry.

Parameters:

geometry_type – One of points, lines, polygons or cells

get_lines(geom)#
get_points(geom)#
get_polygons(geom)#
to_crs(crs: str | int)#

Convert the source geometry data the coordinate reference system specified in the crs argument

Parameters:

crs – The CRS to convert to, either a CRS string (eg. “WGS 84” or “EPSG:28992”) or an EPSG code integer (eg. 4326).

class InterpolatingTapefile(entity_data: 'dict', dataset_name: 'str', entity_group_name: 'str', reference: 'str', tapefile_name: 'str', tapefile_display_name: 't.Optional[str]' = None, metadata: 'dict' = None, attributes: 't.List[TimeDependentAttribute]' = <factory>)#

Bases: object

add_attribute(attribute: TimeDependentAttribute)#
attributes: List[TimeDependentAttribute]#
create_content(interpolators: Dict[str, Interpolator])#
create_update(values: Dict[str, list])#

example:

{
"entity_group_name": {
    "id": [4, 5, 6],
    "some_attribute": [102, 40, 201]
    "some_other_attribute": [7, 6, 21]
}
}
dataset_name: str#
dump(file: str | Path)#
dump_dict()#
ensure_csv_completeness()#
entity_data: dict#
entity_group_name: str#
get_interpolators()#
get_merged_df(attribute: TimeDependentAttribute)#
get_scaffold()#
static get_seconds(year: int, reference: int)#
Parameters:
  • year – eg: 2024

  • reference – 2019

Returns:

seconds since reference

init_data: DataFrame#
metadata: dict = None#
read_initial_data() DataFrame#
reference: str#
tapefile_display_name: str | None = None#
tapefile_name: str#
class NetCDFGridSource(file: Path | str, x_var='gridCellX', y_var='gridCellY', time_var='time')#

Bases: DataSource

cells: ndarray = None#
classmethod from_source_info(source_info)#
get_attribute(name: str, time_idx=0)#

Return a property as a list of values from the source data, one entry per feature.

Parameters:

name – The property name

get_bounding_box() Tuple[float, float, float, float] | None#

Return the bounding box that envelops all geospatial features in the source data

Returns:

A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or None in case no bounding box can be calculated

get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells']) dict | None#

Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the geometry_type:

  • points: geometry.x, geometry.y and optionally geometry.z

  • lines: either geometry.linestring_2d or geometry.linestring_3d

  • polygons: geometry.polygon

See Geometries for more information on geometry attributes.

This method may raise an Exception if a geometry_type is requested that does not match the source geometry.

Parameters:

geometry_type – One of points, lines, polygons or cells

get_timestamps()#
points: ndarray = None#
class NumpyDataSource(data: Mapping[str, ndarray])#

Bases: DataSource

DataSource for non-geospatial Numpy or pandas data

Parameters:

data – Either a dictionary typing.Dict[str, np.ndarray] with keys being the property names and the values being the property data array or a Pandas dataframe

get_attribute(name: str)#

Return a property as a list of values from the source data, one entry per feature.

Parameters:

name – The property name

PandasDataSource#

alias of NumpyDataSource

class TimeDependentAttribute(name: 'str', csv_file: 't.Union[Path, str]', key: 'str')#

Bases: object

csv_file: Path | str#
property dataframe#
key: str#
name: str#
create_dataset(config: dict, sources: Dict[str, DataSource] | None = None)#

Shorthand function to create a entity-based Dataset from a dataset creator config dictionary. This is the preferred way of creating Datasets from dataset creator config as it requires the least amount of boilerplate code. DataSources are created from the config __sources__ field. However, it is also possible to provide (additional) DataSources through the optional sources argument.

Parameters:
  • config – a dataset creator config

  • sources – (Optional) a dictionary with configured DataSources

Returns:

A entity based dataset in dictionary format

get_dataset_creator_schema()#