Data Preparation¶

Movici Datasets, or init data, need to be provided in the Movici Data Format. The easiest way to create these dataset files is to use a DatasetCreator, or more specifically, the shorthand function create_dataset().

This function can create a Dataset from a dataset creator config, a json object following a specific schema. A simple example of a dataset creator config is as following:

Dataset Creator Config¶

{
  "__meta__": {
    "crs": "EPSG:28992"
  },
  "__sources__": {
    "my_source": {
      "source_type": "file",
      "path": "/path/to/my_source.geojson"
    }
  },
  "name": "my_dataset",
  "display_name": "My Dataset",
  "version": 4,
  "general": {},
  "data": {
    "my_entities": {
      "__meta__": {
        "source": "my_source",
        "geometry": "points"
      },
      "my_attribute": {
        "property": "prop",
        "loaders": ["int"]
      }
    }
  }
}

Let’s look at the config piece by piece to show what everything means.

Global meta data¶

We start with the global metadata:

{
  "name": "my_dataset",
  "display_name": "My Dataset",
  "version": 4,
  "general": {},
}

These fields will be copied into the resulting dataset as is. The general section is currently empty, and may be omitted. However, it may also be filled with a special and enum fields, more on that later. The currently only supported version is 4, and is an optional field.

meta section¶

{
  "__meta__": {
    "crs": "EPSG:28992"
  }
}

The __meta__ field contains addtional metadata about the dataset creator needs to know about. In this case it contains a crs field, indicating the desired coordinate reference system of the dataset. This may be different from the crs of the sources (and the sources may each have their own crs), but the geospatial coordinates of the source entities will be transformed into the value of crs. If omitted, the default value is EPSG:28992, which corresponds with Amersfoort / RD New

Data Sources¶

{
  "__sources__": {
    "my_source": {
      "source_type": "file",
      "path": "/path/to/my_source.geojson"
    }
  }
}

The __sources__ field contains definitions of data sources. Keys are identifiers that may be used later on to reference a specific source. The values give information about the source. source_type gives the type of the source. Currently, only files are allowed. path gives the location to the source file. Since only files are supported, the above snippet may be simplified as following:

{
  "__sources__": {
    "my_source": "/path/to/my_source.geojson"
  }
}

Data source files are read by geopandas (which uses Fiona under the hood) and can be of any format that geopandas supports, typically GeoJSON and Shapefile.

Data section¶

{
  "data": {
    "my_entities": {
      "__meta__": {
        "source": "my_source",
        "geometry": "points"
      },
      "my_attribute": {
        "property": "prop"
        "loaders": ["int"]
      }
    }
  }
}

The data section of the dataset creator config loosely follows the same structure as the data section of the resulting dataset. Top-level keys inside the data section represent entity groups, and inside entity groups, most keys represent attributes. In this case there is a single entity group my_entities which will have an attribute my_attribute. An entity group also requires a __meta__ key with additional information on how to construct the entity group. Here, it has a source field, which identifies one of the source in the __sources__ field (see above). It also specifies a geometry, in this case points. Other supported geometries are lines and polygons. For more information about the supported geometries and how they map to common (geojson) feature types see Geometries. An entity group does not need to have a geometry, but most do.

The my_attribute field will result in an attribute my_attribute for the my_entities entity group. The data for this attribute (array) is taken from the source specified in the __meta__ field and the property field in my_attribute.

An attribute config may also specify a loaders key. See Loaders for more information about loaders and an overview of the supported loaders. In this case, the int loader ensures that the resulting attribute data has an integer data type.

ID generation¶

You may have noticed that the above example does not specify an id for the my_entities entity group. In fact, specifying an id attribute is not allowed. ids are always generated for you by the dataset creator. This way, it can be always be ensured that ids are unique within a dataset. In case you need to keep track of which entity belongs to which source asset, you may use the reference attribute and fill it with a (unique) string belonging to the source asset.

Loaders¶

An attribute config may specify one or more loaders. Loaders are processing operations done on source property values before they are written to the dataset. A loader acts on each value for each source asset separately. The table below shows the supported currently supported loaders:

Loader	Description
`bool`	Convert the value into a boolean. Follows the Python rules for truthyness, ie. everything is `true` except `false` , `0` ,``0.0`` and `""`.
`int`	Convert the value into an integer, input may be a numeric type or a string with a literal integer. Floating point numbers are converted to int by rounding down.
`float`	Convert a value into a floating point.
`str`	Convert a value into a string.
`json`	Parse a string value as json. Supports only primitive types and (multidimensional) lists. Lists with more than one dimension, must have a uniform length in all but first dimension. See also Attributes with array-like data types
`csv`	Parse a string as an array of comma (`,`) separated values. The resulting values are still of type `str` but may be converted using other loaders. See also Attributes with array-like data types

Dataset Creator¶

Now that we have established the contents of the dataset creator config, let’s have a look on how to use this to actually create datasets. As mentioned earlier, the most straight-forward method is to use the create_dataset() function:

import json
from movici_simulation_core.preprocessing.dataset_creator import create_dataset

dataset = create_dataset(config)
with open("dataset.json", 'w') as file:
    json.dump(dataset, file)

This will use the config to create a dataset. It is also possible to supply additional DataSources as using the sources arguments. These are then merged with any sources defined in the config’s __sources__ key:

import pandas as pd
from movici_simulation_core.preprocessing import create_dataset, PandasDataSource

additional_sources = {
  "source_a": PandasDataSource(pd.read_csv('source_a.csv')),
}

dataset = create_dataset(config, sources=additional_sources)

Recipes¶

Below are a number of recipes showcasing the various functionalities of the Dataset Creator

Enumerations¶

Enumerated attributes can be specified as following (not all required fields are given, such as source):

{
  "general": {
    "enum": {
      "label": ["first", "second", "third"]
    }
  },
  "data" {
    "my_entities": {
      "attr": {
        "property": "enum_prop"
        "enum": "label"
      }
    }
  }
}

The above config specifies an enum with name label and three values: first, second and third. The resulting dataset will make use of this enum in the my_entities.attr attribute. The attribute will have integer values in the range [0-2], matching up with the position in the enum list.

The source property can contain either strings or integers, integers must be a valid number in the enum range. When providing string values, the general.enum section is optional. Any values not present as a valid enum are simply appended to the list of enum values.

Enums are matched after all loaders are applied. This means that it is possble for example to have enum values in array-like attributes by supplying the source data as json or csv, and using their respective attribute loader.

Attributes with array-like data types¶

Movici datasets support array-like values for a single entity’s attribute. However, supplying such data in source properties can be tricky. While geojson supports arrays as a feature property, the underlying machinery of the DatasetCreator (geopandas and Fiona) does not. Instead, array-like data may be supplied as a string, either as comma-separated-values or a json string. These can then be converted into their array-like data using respectively the csv or the json loader. For example, consider a geosjon feature with the following property:

{
  "properties": {
    "layout": "0,1,1,0"
  }
}

An entity group config that reads this property may look like this:

{
  "__meta__": "{...}",
  "transport.layout": {
    "property": "layout",
    "loaders": ["csv", "int"]
  }
}

These loaders parse the csv string into their components. The output from csv loader still has a type str so it must be further converted into integers. The resulting entity group in the dataset will look like this:

{
  "id": [0]
  "transport.layout": [[0,1,1,0]]
}

Similarly, if the source property would be json-encoded (eg. "[0,1,1,0]"), you would use the json loader. As a bonus, when loading json data, all values are converted to their respective data type automatically.

Read attributes from a different source¶

Sometimes an entity group combines data from multiple data sources. For this case it is possible to override the entity group’s primary source in an attribute config:

{
  "__meta__": {
    "source": "source_a"
  },
  "some_attribute_using_the_default_source": {
    "property": "prop"
  }
  "some_attribute_using_source_b": {
    "source": "source_b",
    "property": "prop"
  }
}

The second source (source_b) must contain the same number of features as the primary source (source_a) and the order of features must also be equal.

Linking entities by id¶

Sometimes it is necessary to link entities together within a dataset, for example when specifying a network dataset, in which edges are connected to nodes (See also Common Attributes ). These connections are done on an id-basis, which means there is an attribute in an entity group with integer values that reference ids of other entities (in the same dataset). Since it is not allowed to specify the id for entities created using a dataset creator, the fact that a certain attribute references ids in other entity groups, must be specified separately. Let’s look at an example dataset creator config:

{
  "__sources__": {
    "nodes": "nodes.geojson",
    "edges": "edges.geojson",
  },
  "data": {
    "node_entities": {
      "__meta__": {
        "source": "nodes"
      }
    },
    "edge_entities": {
      "__meta__": {
        "source": "edges"
      }
      "topology.from_node_id": {
        "property": "from_node_ref"
        "id_link": {
          "entity_group": "node_entities",
          "property": "ref"
        }
      },
      "topology.to_node_id": {
        "property": "to_node_ref"
        "id_link": {
          "entity_group": "node_entities",
          "property": "ref"
        }
      }
    }
  }
}

This config will interpret the source data in the following way:

the nodes source is expected to have features with a ref property. Every feature must have unique ref property
the edges source is expected to have features with a from_node_ref and a to_node_ref property. Values for these properties are expected to match a ref field of a feature in the nodes source. This information links a single edge to two nodes; in the attribute config for topology.from_node_id and topology.to_node_id this link is specified using the id_link field
After generating the ids for every entity, the dataset creator revisits attributes with an id_link. It looks up the source for a linked entity group (in this case “node_entities”) and maps it to a unique id, which it places in the linking attribute (in this case topology.from_node_id and topology.to_node_id). Note that it is not required to have the linking source property (ie: ref) be available as an attribute in the linked entity group. (ie. node_entities), it will read the data directly from the source.

As an example, consider the following input data. it represents two points that are connected by a linestring. Processing this data using the above dataset creator config will result in the below output data:

nodes.geojson¶

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point"
        "coordinates": [0, 0]
      },
      "properties": {
        "ref": "1"
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Point"
        "coordinates": [1, 1]
      },
      "properties": {
        "ref": "2"
      }
    },
  ],
}

edges.geojson¶

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "LineString"
        "coordinates": [[0, 0], [1 ,1]]
      },
      "properties": {
        "from_node_ref": "1",
        "to_node_ref": "2",
      }
    },
  ],
}

dataset.json¶

{
  "data": {
    "node_entities": {
      "id": [0, 1]
    },
    "edge_entities": {
      "id": [2],
      "topology.from_node_id": [0],
      "topology.to_node_id": [1]
    }
  }
}

Sometimes the linked entity may exist in one of multiple entity groups, for example if there are more than one groups of nodes inside the network. In that case the id_link field may contain an array of entries:

{
  "topology.from_node_id": {
    "property": "from_node_ref",
    "id_link": [
        {
          "entity_group": "node_entities",
          "property": "ref"
        },
        {
          "entity_group": "other_node_entities",
          "property": "other_ref"
        }
    ]
  }
}

The values of “ref” and “other_ref” must be unique within their respective source, but there may be duplicates between the sources

Entities without attributes¶

It is possible to create entities in an entity group that do not have any attributes (except for id). There are two ways to do this, depending on whether there is an associated source:

With source¶

{
  "__meta__": {
    "source": "my_source"
  }
}

In case there is a source, the dataset creator simply looks at the number of features of the source and creates the same amount of entities. If there is no source available, you can set a fixed number of entities using the count field:

Without source¶

{
  "__meta__": {
    "count": 10
  }
}

Undefined attribute values and dealing with NaN¶

A property does not have to be defined for every feature within a source. In case a property is not defined, this results in a null / None in the dataset for that entity. For bool and str data types this works as expected. However, for int and float there are a few caveats

The default data source uses pandas under the hood. Any None in a float property array is converted into NaN by pandas. As a consequence, any NaN encountered by the Dataset Creator is converted into None, which means the value is undefined for that entity. In case you have data for which NaN has a specific meaning, you will first need to convert any NaN values to a non-NaN Special value.

When None exists within a property of type int, pandas converts the property array to float and inserts these None values to NaN. This means that in case you have None values for an int property, it is recommended to add the int loader to your attribute config to ensure the correct data type.

Preprocess data and custom data sources¶

Sometimes, it is necessary to perform additional preprocessing to the geospatial data before converting it to a Movici dataset. The preferred, and most flexible, way to do this is to first read the data in a geopandas.GeoDataFrame and perform any operations you want directly on the dataframe. You can then hand over the dataframe to a DatasetCreator and use it to create the Movici dataset. Consider the following example:

# The following example shows you how to preprocess any raw spatial data (geojson, shapefile)
# before using that data to create a Movici dataset

import geopandas

from movici_simulation_core.preprocessing import GeopandasSource, create_dataset

# Here we create a GeoJSON on the fly. Alternatively, you can read an existing GeoJSON or
# shapefile by using ``geopandas.read_file(<filename>)``
gdf = geopandas.GeoDataFrame.from_features(
    {
        "type": "FeatureCollection",
        "features": [
            {
                "type": "Feature",
                "geometry": {"type": "Point", "coordinates": [0, 0]},
                "properties": {"a": 1, "b": 2},
            },
            {
                "type": "Feature",
                "geometry": {"type": "Point", "coordinates": [1, 1]},
                "properties": {"a": 3, "b": 4},
            },
        ],
    },
    crs="WGS84",
)
# We can now do any preprocessing / dataframe operations that we want
gdf["c"] = gdf["a"] + gdf["b"]

# We can now use the source and the new property to create our dataset
config = {
    "__meta__": {"crs": "WGS84"},
    "name": "customized_data",
    "data": {
        "point_entities": {
            "__meta__": {
                "source": "my_custom_source",
                "geometry": "points",
            },
            "some_attribute": {"property": "c"},
        }
    },
}
dataset = create_dataset(config, sources={"my_custom_source": GeopandasSource(gdf)})

Read a grid from a NetCDF file¶

NetCDF files are supported if they contain a grid according to certain specifications. DatasetCreator can extract this grid in case of the following:

The grid is given as grid cells with x and y coordinates in NetCDF variables. x and y are stored in separate variables
The x and y variable have the first dimension as the individual cells, and the second dimension is for every vertex point of the cell
The x and y variable have a default name of gridCellX and gridCellY respectively, but this can be customized
The CRS can not be converted for netcdf data sources. The CRS must be specified to be the same CRS as the netcdf data source
the grid cell entities can read additional attributes from the netcdf, the point entities cannot

Below is given an example of a dataset creator config snippet

{
  "__meta__": {
    "crs": "EPSG:3414"
  },
  "__sources__": {
    "netcdf_grid": {
      "source_type": "netcdf",
      "path": "/path/to/netcdf.nc"
    }
  },
  "data": {
    "grid_points": {
      "__meta__": {
        "geometry": "points",
        "source": "netcdf_grid"
      }
    },
    "grid_cells": {
      "__meta__": {
        "geometry": "cells",
        "source": "netcdf_grid"
      },
      "some_attribute": {
        "property": "additional_netcdf_var"
      }
    }
  }
}

Read an EPANET INP file¶

EPANET INP files are supported via WNTR. An epanet source is a multi-entity source that exposes the six WNTR collections as separate sub-sources, each selected from the parent source using dot notation "<source_name>.<entity_type>":

Node types: junctions, tanks, reservoirs
Link types: pipes, pumps, valves

Properties available on each sub-source match the attributes of the corresponding WNTR object (e.g. elevation, diameter, start_node_name, base_head). The synthetic name property yields the WNTR element name and is useful as a reference attribute and as an id_link key for resolving link-to-node topology.

Caveats:

The CRS can not be converted for epanet data sources. The CRS must be specified to be the same CRS as the coordinates stored in the INP file.
Link sub-sources (pipes, pumps, valves) produce line geometries from the coordinates of their start and end nodes. They do not contribute a bounding box.
WNTR converts INP values to SI units on read according to the file’s flow-units header (e.g. in LPS, pipe diameters expressed in millimeters become meters). Map properties from the WNTR representation, not the raw INP values.

Below is an example of a dataset creator config snippet. A single epanet source is declared once, and each entity group references the appropriate sub-source through dot notation:

{
  "__meta__": {
    "crs": "EPSG:28992"
  },
  "__sources__": {
    "network": {
      "source_type": "epanet",
      "path": "/path/to/network.inp"
    }
  },
  "data": {
    "water_junction_entities": {
      "__meta__": {"source": "network.junctions", "geometry": "points"},
      "reference": {"property": "name"},
      "water.elevation": {"property": "elevation"}
    },
    "water_reservoir_entities": {
      "__meta__": {"source": "network.reservoirs", "geometry": "points"},
      "reference": {"property": "name"}
    },
    "water_tank_entities": {
      "__meta__": {"source": "network.tanks", "geometry": "points"},
      "reference": {"property": "name"}
    },
    "water_pipe_entities": {
      "__meta__": {"source": "network.pipes", "geometry": "lines"},
      "reference": {"property": "name"},
      "water.diameter": {"property": "diameter"},
      "topology.from_node_id": {
        "property": "start_node_name",
        "id_link": [
          {"entity_group": "water_junction_entities", "property": "name"},
          {"entity_group": "water_reservoir_entities", "property": "name"},
          {"entity_group": "water_tank_entities", "property": "name"}
        ]
      },
      "topology.to_node_id": {
        "property": "end_node_name",
        "id_link": [
          {"entity_group": "water_junction_entities", "property": "name"},
          {"entity_group": "water_reservoir_entities", "property": "name"},
          {"entity_group": "water_tank_entities", "property": "name"}
        ]
      }
    }
  }
}

Dataset Creator Config Schema Reference¶

DatasetCreatorConfig¶

type: object

properties:: __meta__: DatasetCreatorMetaData

__sources__: DatasetCreatorDatasetSources

name: string, a snake_case dataset name (required)

display_name: string, a human readable name suitable for displaying

type: string, a snake_case dataset_type

version: literal 4, only dataset version v4 is supported

general: DatasetCreatorGeneralSection

data: DatasetCreatorDataSection (required)

DatasetCreatorMetaData¶

type: object

properties:: crs: string or integer indicating the Coordinate Reference System. Can be anything that is supported by geopandas as a valid CRS identifier. Default: EPSG:28992 (Amersfoort/RD new)

DatasetCreatorDatasetSources¶

type: object

additionalProperties:: keys: Source name

values: DatasetCreatorSource

DatasetCreatorSource¶

One of:

type: string location on disk to a source file

type: object

properties:: source_type: One of file, netcdf, inp. Use netcdf for NetCDF-files, inp for EPANET INP files (read via WNTR), file for any other supported geospatial data file

path: string location on disk to a source file

Source files of source_type file are read using geopandas which uses Fiona under the hood and may be any file that Fiona supports as geospatial data, such as geojson or shapefile

DatasetCreatorGeneralSection¶

type: object

properties:: special: DatasetCreatorSpecialValues

enum: DatasetCreatorEnums

DatasetCreatorSpecialValues¶

type: object

additionalProperties:: keys: Path to an attribute such as <entity_group>.<attribute>

values: number | integer | string Primitive type of the attribute

DatasetCreatorEnums¶

type: object

additionalProperties:: keys: Enum name

values: DatasetCreatorEnumItems

DatasetCreatorEnumItems¶

type: array
items: string
minItems: 1

DatasetCreatorDataSection¶

type: object

additionalProperties:: keys: Entity types that will reflect entity groups in the dataset

values: DatasetCreatorEntityGroup

DatasetCreatorEntityGroup¶

type: object

properties:: __meta__: DatasetCreatorEntityGroupMeta
additionalProperties:: keys: attribute names that will reflect attributes in the dataset

values: DatasetCreatorAttribute

DatasetCreatorEntityGroupMeta¶

One of:

type: object

properties:: source: string name of a data source (required)

geometry: string one of points, lines, polygons or cells

type: object

properties:: count: integer number of required entities (in case there are no additional attributes) (required)

DatasetCreatorAttribute¶

type: object

properties:: source: string Source name. Can be used to override the entity group’s default source

property: string Name of the property in the data source

value: boolean | number | string Constant attribute value when no source is given

id_link: DatasetCreatorIDLink

special: number | integer | string The attribute’s special value

enum: string Enum name

loaders: DatasetCreatorAttributeLoaders

DatasetCreatorIDLink¶

One of

type: DatasetCreatorIDLinkItem

type: array
items: DatasetCreatorIDLinkItem
minItems: 1

DatasetCreatorIDLinkItem¶

type: object

properties:: entity_group: string The entity type to link to (target) (required)

property: string the property (in the target’s source data) to match (required)

DatasetCreatorAttributeLoaders¶

type: array
items: string
values: json, csv, bool, int, float, str

Data Preparation¶

Dataset Creator Config¶

Global meta data¶

__meta__ section¶

Data Sources¶

Data section¶

ID generation¶

Loaders¶

Dataset Creator¶

Recipes¶

Enumerations¶

Attributes with array-like data types¶

Read attributes from a different source¶

Linking entities by id¶

Entities without attributes¶

Undefined attribute values and dealing with NaN¶

Preprocess data and custom data sources¶

Read a grid from a NetCDF file¶

Read an EPANET INP file¶

Dataset Creator Config Schema Reference¶

DatasetCreatorConfig¶

DatasetCreatorMetaData¶

DatasetCreatorDatasetSources¶

DatasetCreatorSource¶

DatasetCreatorGeneralSection¶

DatasetCreatorSpecialValues¶

DatasetCreatorEnums¶

DatasetCreatorEnumItems¶

DatasetCreatorDataSection¶

DatasetCreatorEntityGroup¶

DatasetCreatorEntityGroupMeta¶

DatasetCreatorAttribute¶

DatasetCreatorIDLink¶

DatasetCreatorIDLinkItem¶

DatasetCreatorAttributeLoaders¶

meta section¶