preprocessing#
data_sources#
- class DataSource#
Bases:
object
Base class for creating custom
DataSource
s. Subclasses must implementget_attribute
and__len
. In case theDataSource
handles geospatial data, subclasses must also implementto_crs
,get_geometry
andget_bounding_box
- get_attribute(name: str)#
Return a property as a
list
of values from the source data, one entry per feature.- Parameters:
name – The property name
- get_bounding_box() Tuple[float, float, float, float] | None #
Return the bounding box that envelops all geospatial features in the source data
- Returns:
A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or
None
in case no bounding box can be calculated
- get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells']) dict | None #
Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the
geometry_type
:points
:geometry.x
,geometry.y
and optionallygeometry.z
lines
: eithergeometry.linestring_2d
orgeometry.linestring_3d
polygons
:geometry.polygon
See Geometries for more information on geometry attributes.
This method may raise an Exception if a
geometry_type
is requested that does not match the source geometry.- Parameters:
geometry_type – One of
points
,lines
,polygons
orcells
- to_crs(crs: str | int) None #
Convert the source geometry data the coordinate reference system specified in the
crs
argument- Parameters:
crs – The CRS to convert to, either a CRS string (eg. “WGS 84” or “EPSG:28992”) or an EPSG code integer (eg. 4326).
- class GeopandasSource(geodataframe: GeoDataFrame)#
Bases:
DataSource
DataSource for querying a
geopandas.GeoDataFrame
- static feature_type_or_raise(feature, expected)#
- classmethod from_source_info(source_info)#
- get_attribute(name: str)#
Return a property as a
list
of values from the source data, one entry per feature.- Parameters:
name – The property name
- get_bounding_box()#
Return the bounding box that envelops all geospatial features in the source data
- Returns:
A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or
None
in case no bounding box can be calculated
- get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells'])#
Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the
geometry_type
:points
:geometry.x
,geometry.y
and optionallygeometry.z
lines
: eithergeometry.linestring_2d
orgeometry.linestring_3d
polygons
:geometry.polygon
See Geometries for more information on geometry attributes.
This method may raise an Exception if a
geometry_type
is requested that does not match the source geometry.- Parameters:
geometry_type – One of
points
,lines
,polygons
orcells
- get_lines(geom)#
- get_points(geom)#
- get_polygons(geom)#
- to_crs(crs: str | int)#
Convert the source geometry data the coordinate reference system specified in the
crs
argument- Parameters:
crs – The CRS to convert to, either a CRS string (eg. “WGS 84” or “EPSG:28992”) or an EPSG code integer (eg. 4326).
- class NetCDFGridSource(file: Path | str, x_var='gridCellX', y_var='gridCellY', time_var='time')#
Bases:
DataSource
- cells: ndarray = None#
- classmethod from_source_info(source_info)#
- get_attribute(name: str, time_idx=0)#
Return a property as a
list
of values from the source data, one entry per feature.- Parameters:
name – The property name
- get_bounding_box() Tuple[float, float, float, float] | None #
Return the bounding box that envelops all geospatial features in the source data
- Returns:
A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or
None
in case no bounding box can be calculated
- get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells']) dict | None #
Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the
geometry_type
:points
:geometry.x
,geometry.y
and optionallygeometry.z
lines
: eithergeometry.linestring_2d
orgeometry.linestring_3d
polygons
:geometry.polygon
See Geometries for more information on geometry attributes.
This method may raise an Exception if a
geometry_type
is requested that does not match the source geometry.- Parameters:
geometry_type – One of
points
,lines
,polygons
orcells
- get_timestamps()#
- points: ndarray = None#
- class NumpyDataSource(data: Mapping[str, ndarray])#
Bases:
DataSource
DataSource for non-geospatial Numpy or pandas data
- Parameters:
data – Either a dictionary
typing.Dict[str, np.ndarray]
with keys being the property names and the values being the property data array or a Pandas dataframe
- get_attribute(name: str)#
Return a property as a
list
of values from the source data, one entry per feature.- Parameters:
name – The property name
- PandasDataSource#
alias of
NumpyDataSource
dataset_creator#
- class AttributeDataLoading(config)#
Bases:
DatasetOperation
Extracts the actual data from the
DataSource
s into the attribute arrays. It also supports transforming the raw data using so calledloaders
in the attribute config. Currently supported loaders are:json
,csv
,bool
,int
,float
andstr
Seecreate_dataset()
for more information on the available loaders.- get_attribute_data(attr_config: dict, primary_source_name: str)#
- get_data(entity_config: dict)#
- get_geometry(geom_type: Literal['points', 'lines', 'polygons', 'cells'], source_name) dict | None #
- get_loaders(attr_config)#
- get_source(source_name)#
- loaders = {'bool': <function load_primitive.<locals>._to_primitive_helper>, 'csv': <function load_csv>, 'float': <function load_primitive.<locals>._to_primitive_helper>, 'int': <function load_primitive.<locals>._to_primitive_helper>, 'json': <built-in function loads>, 'str': <function load_primitive.<locals>._to_primitive_helper>}#
- static nan_loader(val)#
- sources: Dict[str, DataSource]#
- class BoundingBoxCalculation(config)#
Bases:
DatasetOperation
Calculate the bounding box of the entire dataset
- get_active_sources(sources: Dict[str, DataSource])#
- class CRSTransformation(config, default_crs='EPSG:28992')#
Bases:
DatasetOperation
The
CRSTransformation
operation converts everyDatasetSource
into the target-crs specified in the config- DEFAULT_CRS = 'EPSG:28992'#
- class ConstantValueAssigning(config)#
Bases:
DatasetOperation
Assign a constant value for every entity in an entity group. This Operation must come after IDGeneration, because only then the number of entities is guaranteed to be known
- class DatasetCreator(operations: Sequence[Type[DatasetOperation]], sources: Dict[str, DataSource] | None = None, validate_config=True)#
Bases:
object
Use DatasetCreator to convert different
DataSource
s into an entity-based Dataset.- Parameters:
operations – The sequence of desired operations types
sources – (Optional) a dictionary with configured
DataSource
svalidate_config – validate any dataset creator configs
- create(config: dict)#
- static default_operations()#
- classmethod with_default_operations(**kwargs)#
Alternative initializer that creates a DatasetCreator with all
DatasetOperation
s configured that provide full functionality to create datasets. This is the preferred way of instantiatingDatasetCreator
.
- class DatasetOperation(config)#
Bases:
object
- class EnumConversion(config)#
Bases:
DatasetOperation
The
EnumConversion
operation is responsible for converting and validating enumerated attributes (indicated by theenum
field in the attribute config). It converts strings into integers, matching with the position of the value in theenum
's array. If the values are already integer, then it validates whether the value is matching an enum’s value- convert_enums(attr, enum_name: str)#
- get_enums(dataset)#
- iter_enum_attributes(dataset) Tuple[str, list] #
- set_enums(dataset: dict)#
- class EnumInfo(name: str, enum_values: Sequence[str])#
Bases:
object
- add(text: str) int #
- ensure(text: str) int #
- get_pos(text: str) int | None #
- to_list()#
- class IDGeneration(config)#
Bases:
DatasetOperation
Generate
id
s for every entity- get_entity_count_from_meta(entity_meta: dict, sources: dict) int #
- class IDLinking(config)#
Bases:
DatasetOperation
some attributes may reference another entity’s id in the same dataset using the
id_link
field in the attribute config. TheIDLinking
operation looks up the correct id for an entity’s id-link and places the correct id in the attribute. See <create datasets> for more information.- classmethod get_indexed_values_or_raise(values, indexers)#
- get_indices_link_index(link_config: dict, dataset: dict, sources: Dict[str, DataSource])#
- get_link_index(link_config: dict, dataset: dict, sources: Dict[str, DataSource])#
- static get_single_indexed_value_or_raise(value, indexers)#
- index: Dict[Tuple[str, str], dict]#
- link_attribute(entity_type, attribute, link_config: list | dict, dataset: dict, sources: Dict[str, DataSource])#
- link_attribute_by_values(link_config: list | dict, values: list, dataset: dict, sources: Dict[str, DataSource], values_are_indices=False) List[int] #
- link_geometry_attribute(metadata, entity_type, link_config: list | dict, dataset: dict, sources: Dict[str, DataSource])#
- class MetadataSetup(config)#
Bases:
DatasetOperation
MetadataSetup
copies the metadata fields from the config into the dataset, and/or fills them with their respective default values- keys = (('general', <object object>), ('name', <object object>), ('display_name', <object object>), ('type', <object object>), ('version', 4))#
- class SourcesSetup(config)#
Bases:
DatasetOperation
The
SourcesSetup
operation is responsible for reading the__sources__
field of the config and createDatasetSource
s from it- static get_file_path(path_str)#
- make_source(source_info)#
- class SpecialValueCollection(config)#
Bases:
DatasetOperation
SpecialValueCollection
compiles thegeneral.special
field in the dataset from the config. If a special value is defined both in the attribute as well as in thegeneral.special
field of the config, then the field in the config takes precedence- extract_special_values(config: dict, key=None, level=0)#
- create_dataset(config: dict, sources: Dict[str, DataSource] | None = None)#
Shorthand function to create a entity-based Dataset from a dataset creator config dictionary. This is the preferred way of creating Datasets from dataset creator config as it requires the least amount of boilerplate code.
DataSource
s are created from the config__sources__
field. However, it is also possible to provide (additional)DataSource
s through the optionalsources
argument.- Parameters:
config – a dataset creator config
sources – (Optional) a dictionary with configured
DataSource
s
- Returns:
A entity based dataset in dictionary format
- deep_get(obj, *path: str | int, default=None)#
- get_dataset_creator_schema()#
- load_csv(obj: str)#
- load_primitive(prim)#
- pipe(operations: Sequence[callable], initial, **kwargs)#
tapefile#
Create a tapefile from one or more CSV files that contain a yearly changing attribute. The csv files should be in the form “<key> <year_1> <year_2>, …” such as: name, 2020, 2025, 2030 e1 , 100, 110, 120 e2 , 100, 90, 85
One or more of these csv files can then be linked to an Attribute and a tapefile can be created that outputs a value for every year (linearly interpolating between years that do not exist in the csv files). The entities are matched based on a reference_attribute and the <key> column in the csv files.
If multiple csv files are given, every update will contain the interpolated values for all time dependent attributes, even if the corresponding timestamp is not defined for that tapefile. For example if csv file “a” defines 2020 as a year, but csv file “b” starts at 2025, then a tapefile will be generated starting from the timestamp at 2020. The values from csv file “b” will be taken as the values at 2025 for all years earlier than 2025
- class InterpolatingTapefile(entity_data: 'dict', dataset_name: 'str', entity_group_name: 'str', reference: 'str', tapefile_name: 'str', tapefile_display_name: 't.Optional[str]' = None, metadata: 'dict' = None, attributes: 't.List[TimeDependentAttribute]' = <factory>)#
Bases:
object
- add_attribute(attribute: TimeDependentAttribute)#
- attributes: List[TimeDependentAttribute]#
- create_content(interpolators: Dict[str, Interpolator])#
- create_update(values: Dict[str, list])#
example:
{ "entity_group_name": { "id": [4, 5, 6], "some_attribute": [102, 40, 201] "some_other_attribute": [7, 6, 21] } }
- dataset_name: str#
- dump(file: str | Path)#
- dump_dict()#
- ensure_csv_completeness()#
- entity_data: dict#
- entity_group_name: str#
- get_interpolators()#
- get_merged_df(attribute: TimeDependentAttribute)#
- get_scaffold()#
- static get_seconds(year: int, reference: int)#
- Parameters:
year – eg: 2024
reference – 2019
- Returns:
seconds since reference
- init_data: DataFrame#
- metadata: dict = None#
- read_initial_data() DataFrame #
- reference: str#
- tapefile_display_name: str | None = None#
- tapefile_name: str#
Module contents#
- class DataSource#
Bases:
object
Base class for creating custom
DataSource
s. Subclasses must implementget_attribute
and__len
. In case theDataSource
handles geospatial data, subclasses must also implementto_crs
,get_geometry
andget_bounding_box
- get_attribute(name: str)#
Return a property as a
list
of values from the source data, one entry per feature.- Parameters:
name – The property name
- get_bounding_box() Tuple[float, float, float, float] | None #
Return the bounding box that envelops all geospatial features in the source data
- Returns:
A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or
None
in case no bounding box can be calculated
- get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells']) dict | None #
Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the
geometry_type
:points
:geometry.x
,geometry.y
and optionallygeometry.z
lines
: eithergeometry.linestring_2d
orgeometry.linestring_3d
polygons
:geometry.polygon
See Geometries for more information on geometry attributes.
This method may raise an Exception if a
geometry_type
is requested that does not match the source geometry.- Parameters:
geometry_type – One of
points
,lines
,polygons
orcells
- to_crs(crs: str | int) None #
Convert the source geometry data the coordinate reference system specified in the
crs
argument- Parameters:
crs – The CRS to convert to, either a CRS string (eg. “WGS 84” or “EPSG:28992”) or an EPSG code integer (eg. 4326).
- class DatasetCreator(operations: Sequence[Type[DatasetOperation]], sources: Dict[str, DataSource] | None = None, validate_config=True)#
Bases:
object
Use DatasetCreator to convert different
DataSource
s into an entity-based Dataset.- Parameters:
operations – The sequence of desired operations types
sources – (Optional) a dictionary with configured
DataSource
svalidate_config – validate any dataset creator configs
- create(config: dict)#
- static default_operations()#
- classmethod with_default_operations(**kwargs)#
Alternative initializer that creates a DatasetCreator with all
DatasetOperation
s configured that provide full functionality to create datasets. This is the preferred way of instantiatingDatasetCreator
.
- class GeopandasSource(geodataframe: GeoDataFrame)#
Bases:
DataSource
DataSource for querying a
geopandas.GeoDataFrame
- static feature_type_or_raise(feature, expected)#
- classmethod from_source_info(source_info)#
- get_attribute(name: str)#
Return a property as a
list
of values from the source data, one entry per feature.- Parameters:
name – The property name
- get_bounding_box()#
Return the bounding box that envelops all geospatial features in the source data
- Returns:
A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or
None
in case no bounding box can be calculated
- get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells'])#
Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the
geometry_type
:points
:geometry.x
,geometry.y
and optionallygeometry.z
lines
: eithergeometry.linestring_2d
orgeometry.linestring_3d
polygons
:geometry.polygon
See Geometries for more information on geometry attributes.
This method may raise an Exception if a
geometry_type
is requested that does not match the source geometry.- Parameters:
geometry_type – One of
points
,lines
,polygons
orcells
- get_lines(geom)#
- get_points(geom)#
- get_polygons(geom)#
- to_crs(crs: str | int)#
Convert the source geometry data the coordinate reference system specified in the
crs
argument- Parameters:
crs – The CRS to convert to, either a CRS string (eg. “WGS 84” or “EPSG:28992”) or an EPSG code integer (eg. 4326).
- class InterpolatingTapefile(entity_data: 'dict', dataset_name: 'str', entity_group_name: 'str', reference: 'str', tapefile_name: 'str', tapefile_display_name: 't.Optional[str]' = None, metadata: 'dict' = None, attributes: 't.List[TimeDependentAttribute]' = <factory>)#
Bases:
object
- add_attribute(attribute: TimeDependentAttribute)#
- attributes: List[TimeDependentAttribute]#
- create_content(interpolators: Dict[str, Interpolator])#
- create_update(values: Dict[str, list])#
example:
{ "entity_group_name": { "id": [4, 5, 6], "some_attribute": [102, 40, 201] "some_other_attribute": [7, 6, 21] } }
- dataset_name: str#
- dump(file: str | Path)#
- dump_dict()#
- ensure_csv_completeness()#
- entity_data: dict#
- entity_group_name: str#
- get_interpolators()#
- get_merged_df(attribute: TimeDependentAttribute)#
- get_scaffold()#
- static get_seconds(year: int, reference: int)#
- Parameters:
year – eg: 2024
reference – 2019
- Returns:
seconds since reference
- init_data: DataFrame#
- metadata: dict = None#
- read_initial_data() DataFrame #
- reference: str#
- tapefile_display_name: str | None = None#
- tapefile_name: str#
- class NetCDFGridSource(file: Path | str, x_var='gridCellX', y_var='gridCellY', time_var='time')#
Bases:
DataSource
- cells: ndarray = None#
- classmethod from_source_info(source_info)#
- get_attribute(name: str, time_idx=0)#
Return a property as a
list
of values from the source data, one entry per feature.- Parameters:
name – The property name
- get_bounding_box() Tuple[float, float, float, float] | None #
Return the bounding box that envelops all geospatial features in the source data
- Returns:
A bounding box as a tuple of four values: (min_x, min_y, max_x, max_y) or
None
in case no bounding box can be calculated
- get_geometry(geometry_type: Literal['points', 'lines', 'polygons', 'cells']) dict | None #
Return the geometry of the source features as a dictionary attribute lists. The resulting dictionary should have attributes based on the
geometry_type
:points
:geometry.x
,geometry.y
and optionallygeometry.z
lines
: eithergeometry.linestring_2d
orgeometry.linestring_3d
polygons
:geometry.polygon
See Geometries for more information on geometry attributes.
This method may raise an Exception if a
geometry_type
is requested that does not match the source geometry.- Parameters:
geometry_type – One of
points
,lines
,polygons
orcells
- get_timestamps()#
- points: ndarray = None#
- class NumpyDataSource(data: Mapping[str, ndarray])#
Bases:
DataSource
DataSource for non-geospatial Numpy or pandas data
- Parameters:
data – Either a dictionary
typing.Dict[str, np.ndarray]
with keys being the property names and the values being the property data array or a Pandas dataframe
- get_attribute(name: str)#
Return a property as a
list
of values from the source data, one entry per feature.- Parameters:
name – The property name
- PandasDataSource#
alias of
NumpyDataSource
- class TimeDependentAttribute(name: 'str', csv_file: 't.Union[Path, str]', key: 'str')#
Bases:
object
- csv_file: Path | str#
- property dataframe#
- key: str#
- name: str#
- create_dataset(config: dict, sources: Dict[str, DataSource] | None = None)#
Shorthand function to create a entity-based Dataset from a dataset creator config dictionary. This is the preferred way of creating Datasets from dataset creator config as it requires the least amount of boilerplate code.
DataSource
s are created from the config__sources__
field. However, it is also possible to provide (additional)DataSource
s through the optionalsources
argument.- Parameters:
config – a dataset creator config
sources – (Optional) a dictionary with configured
DataSource
s
- Returns:
A entity based dataset in dictionary format
- get_dataset_creator_schema()#