Data Preparation#
Movici Datasets, or init data, need to be provided in the Movici Data Format. The easiest
way to create these dataset files is to use a DatasetCreator
, or more specifically, the
shorthand function create_dataset()
.
This function can create a Dataset from a dataset creator config
, a json object following a
specific schema. A simple example of a dataset
creator config is as following:
Dataset Creator Config#
{
"__meta__": {
"crs": "EPSG:28992"
},
"__sources__": {
"my_source": {
"source_type": "file",
"path": "/path/to/my_source.geojson"
}
},
"name": "my_dataset",
"display_name": "My Dataset",
"version": 4,
"general": {},
"data": {
"my_entities": {
"__meta__": {
"source": "my_source",
"geometry": "points"
},
"my_attribute": {
"property": "prop",
"loaders": ["int"]
}
}
}
}
Let’s look at the config piece by piece to show what everything means.
Global meta data#
We start with the global metadata:
{
"name": "my_dataset",
"display_name": "My Dataset",
"version": 4,
"general": {},
}
These fields will be copied into the resulting dataset as is. The general section is currently empty, and may be omitted. However, it may also be filled with a special and enum fields, more on that later. The currently only supported version is 4, and is an optional field.
__meta__ section#
{
"__meta__": {
"crs": "EPSG:28992"
}
}
The __meta__
field contains addtional metadata about the dataset creator needs to know about. In
this case it contains a crs
field, indicating the desired coordinate reference system of the
dataset. This may be different from the crs of the sources (and the sources may each have their own
crs), but the geospatial coordinates of the source entities will be transformed into the value
of crs
. If omitted, the default value is EPSG:28992
, which corresponds with Amersfoort / RD New
Data Sources#
{
"__sources__": {
"my_source": {
"source_type": "file",
"path": "/path/to/my_source.geojson"
}
}
}
The __sources__
field contains definitions of data sources. Keys are identifiers that may be
used later on to reference a specific source. The values give information about the source.
source_type
gives the type of the source. Currently, only files are allowed. path
gives
the location to the source file. Since only files are supported, the above snippet may be
simplified as following:
{
"__sources__": {
"my_source": "/path/to/my_source.geojson"
}
}
Data source files are read by geopandas
(which uses Fiona
under the hood) and can be of
any format that geopandas
supports, typically GeoJSON and Shapefile.
Data section#
{
"data": {
"my_entities": {
"__meta__": {
"source": "my_source",
"geometry": "points"
},
"my_attribute": {
"property": "prop"
"loaders": ["int"]
}
}
}
}
The data
section of the dataset creator config loosely follows the same structure as the
data
section of the resulting dataset. Top-level keys inside the data
section represent
entity groups, and inside entity groups, most keys represent attributes. In this case there is
a single entity group my_entities
which will have an attribute my_attribute
. An entity
group also requires a __meta__
key with additional information on how to construct the entity
group. Here, it has a source
field, which identifies one of the source in the __sources__
field (see above). It also specifies a geometry, in this case points
. Other supported
geometries are lines
and polygons
. For more information about the supported geometries and
how they map to common (geojson) feature types see Geometries. An entity group does
not need to have a geometry, but most do.
The my_attribute
field will result in an attribute my_attribute
for the my_entities
entity group. The data for this attribute (array) is taken from the source specified in the
__meta__
field and the property
field in my_attribute
.
An attribute config may also specify a loaders
key. See Loaders
for more information about loaders and an overview of the supported loaders. In this case, the
int
loader ensures that the resulting attribute data has an integer data type.
ID generation#
You may have noticed that the above example does not specify an id
for the my_entities
entity group. In fact, specifying an id
attribute is not allowed. id
s are always generated
for you by the dataset creator. This way, it can be always be ensured that id
s are unique
within a dataset. In case you need to keep track of which entity belongs to which source asset, you
may use the reference
attribute and fill it with a (unique) string belonging to the source
asset.
Loaders#
An attribute config may specify one or more loaders
. Loaders are processing operations done on
source property values before they are written to the dataset. A loader acts on each value for
each source asset separately. The table below shows the supported currently supported loaders:
Loader |
Description |
---|---|
|
Convert the value into a boolean. Follows the Python rules for truthyness, ie.
everything is |
|
Convert the value into an integer, input may be a numeric type or a string with a literal integer. Floating point numbers are converted to int by rounding down. |
|
Convert a value into a floating point. |
|
Convert a value into a string. |
|
Parse a string value as json. Supports only primitive types and (multidimensional) lists. Lists with more than one dimension, must have a uniform length in all but first dimension. See also Attributes with array-like data types |
|
Parse a string as an array of comma ( |
Dataset Creator#
Now that we have established the contents of the dataset creator config, let’s have a look on how
to use this to actually create datasets. As mentioned earlier, the most straight-forward method is
to use the create_dataset()
function:
import json
from movici_simulation_core.preprocessing.dataset_creator import create_dataset
dataset = create_dataset(config)
with open("dataset.json", 'w') as file:
json.dump(dataset, file)
This will use the config to create a dataset. It is also possible to supply additional
DataSources
as using the sources
arguments. These are then merged with any sources defined
in the config’s __sources__
key:
import pandas as pd
from movici_simulation_core.preprocessing import create_dataset, PandasDataSource
additional_sources = {
"source_a": PandasDataSource(pd.read_csv('source_a.csv')),
}
dataset = create_dataset(config, sources=additional_sources)
Recipes#
Below are a number of recipes showcasing the various functionalities of the Dataset Creator
Enumerations#
Enumerated attributes can be specified as following (not all required fields are given, such as
source
):
{
"general": {
"enum": {
"label": ["first", "second", "third"]
}
},
"data" {
"my_entities": {
"attr": {
"property": "enum_prop"
"enum": "label"
}
}
}
}
The above config specifies an enum with name label
and three values: first
, second
and
third
. The resulting dataset will make use of this enum in the my_entities.attr
attribute.
The attribute will have integer values in the range [0-2]
, matching up with the position in the
enum list.
The source property can contain either strings or integers, integers must be a valid number in the
enum range. When providing string values, the general.enum
section is optional. Any values not
present as a valid enum are simply appended to the list of enum values.
Enums are matched after all loaders are applied. This means that it is possble for example to have enum values in array-like attributes by supplying the source data as json or csv, and using their respective attribute loader.
Attributes with array-like data types#
Movici datasets support array-like values for a single entity’s attribute. However, supplying such
data in source properties can be tricky. While geojson supports arrays as a feature property, the
underlying machinery of the DatasetCreator
(geopandas
and Fiona
) does not. Instead,
array-like data may be supplied as a string, either as comma-separated-values or a json string.
These can then be converted into their array-like data using respectively the csv
or the
json
loader. For example, consider a geosjon feature with the following property:
{
"properties": {
"layout": "0,1,1,0"
}
}
An entity group config that reads this property may look like this:
{
"__meta__": "{...}",
"transport.layout": {
"property": "layout",
"loaders": ["csv", "int"]
}
}
These loaders parse the csv string into their components. The output from csv
loader still has
a type str
so it must be further converted into integers. The resulting entity group in the
dataset will look like this:
{
"id": [0]
"transport.layout": [[0,1,1,0]]
}
Similarly, if the source property would be json-encoded (eg. "[0,1,1,0]"
), you would use the
json
loader. As a bonus, when loading json data, all values are converted to their respective
data type automatically.
Read attributes from a different source#
Sometimes an entity group combines data from multiple data sources. For this case it is possible to override the entity group’s primary source in an attribute config:
{
"__meta__": {
"source": "source_a"
},
"some_attribute_using_the_default_source": {
"property": "prop"
}
"some_attribute_using_source_b": {
"source": "source_b",
"property": "prop"
}
}
The second source (source_b
) must contain the same number of features as the primary source
(source_a
) and the order of features must also be equal.
Linking entities by id#
Sometimes it is necessary to link entities together within a dataset, for example when specifying a network dataset, in which edges are connected to nodes (See also Common Attributes ). These connections are done on an id-basis, which means there is an attribute in an entity group with integer values that reference ids of other entities (in the same dataset). Since it is not allowed to specify the id for entities created using a dataset creator, the fact that a certain attribute references ids in other entity groups, must be specified separately. Let’s look at an example dataset creator config:
{
"__sources__": {
"nodes": "nodes.geojson",
"edges": "edges.geojson",
},
"data": {
"node_entities": {
"__meta__": {
"source": "nodes"
}
},
"edge_entities": {
"__meta__": {
"source": "edges"
}
"topology.from_node_id": {
"property": "from_node_ref"
"id_link": {
"entity_group": "node_entities",
"property": "ref"
}
},
"topology.to_node_id": {
"property": "to_node_ref"
"id_link": {
"entity_group": "node_entities",
"property": "ref"
}
}
}
}
}
This config will interpret the source data in the following way:
the
nodes
source is expected to have features with aref
property. Every feature must have uniqueref
propertythe
edges
source is expected to have features with afrom_node_ref
and ato_node_ref
property. Values for these properties are expected to match aref
field of a feature in thenodes
source. This information links a single edge to two nodes; in the attribute config fortopology.from_node_id
andtopology.to_node_id
this link is specified using theid_link
fieldAfter generating the
id
s for every entity, the dataset creator revisits attributes with anid_link
. It looks up thesource
for a linked entity group (in this case “node_entities”) and maps it to a uniqueid
, which it places in the linking attribute (in this casetopology.from_node_id
andtopology.to_node_id
). Note that it is not required to have the linking source property (ie:ref
) be available as an attribute in the linked entity group. (ie.node_entities
), it will read the data directly from the source.
As an example, consider the following input data. it represents two points that are connected by a linestring. Processing this data using the above dataset creator config will result in the below output data:
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "Point"
"coordinates": [0, 0]
},
"properties": {
"ref": "1"
}
},
{
"type": "Feature",
"geometry": {
"type": "Point"
"coordinates": [1, 1]
},
"properties": {
"ref": "2"
}
},
],
}
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "LineString"
"coordinates": [[0, 0], [1 ,1]]
},
"properties": {
"from_node_ref": "1",
"to_node_ref": "2",
}
},
],
}
{
"data": {
"node_entities": {
"id": [0, 1]
},
"edge_entities": {
"id": [2],
"topology.from_node_id": [0],
"topology.to_node_id": [1]
}
}
}
Sometimes the linked entity may exist in one of multiple entity groups, for example if there are
more than one groups of nodes inside the network. In that case the id_link
field may contain an
array of entries:
{
"topology.from_node_id": {
"property": "from_node_ref",
"id_link": [
{
"entity_group": "node_entities",
"property": "ref"
},
{
"entity_group": "other_node_entities",
"property": "other_ref"
}
]
}
}
The values of “ref” and “other_ref” must be unique within their respective source, but there may be duplicates between the sources
Entities without attributes#
It is possible to create entities in an entity group that do not have any attributes (except for
id
). There are two ways to do this, depending on whether there is an associated source:
{
"__meta__": {
"source": "my_source"
}
}
In case there is a source, the dataset creator simply looks at the number of features of the source
and creates the same amount of entities. If there is no source available, you can set a fixed
number of entities using the count
field:
{
"__meta__": {
"count": 10
}
}
Undefined attribute values and dealing with NaN#
A property does not have to be defined for every feature within a source. In case a property is not
defined, this results in a null
/ None
in the dataset for that entity. For bool
and
str
data types this works as expected. However, for int
and float
there are a few
caveats
The default data source uses pandas
under the hood. Any None
in a float
property array
is converted into NaN
by pandas
. As a consequence, any NaN
encountered by the Dataset
Creator is converted into None
, which means the value is undefined for that entity. In case
you have data for which NaN
has a specific meaning, you will first need to convert any NaN
values to a non-NaN
Special value.
When None
exists within a property of type int
, pandas
converts the property array to
float
and inserts these None
values to NaN
. This means that in case you have None
values for an int
property, it is recommended to add the int
loader to your attribute
config to ensure the correct data type.
Preprocess data and custom data sources#
Sometimes, it is necessary to perform additional preprocessing to the geospatial data before
converting it to a Movici dataset. The preferred, and most flexible, way to do this is to first
read the data in a geopandas.GeoDataFrame
and perform any operations you want directly on the
dataframe. You can then hand over the dataframe to a DatasetCreator
and use it to create the
Movici dataset. Consider the following example:
# The following example shows you how to preprocess any raw spatial data (geojson, shapefile)
# before using that data to create a Movici dataset
import geopandas
from movici_simulation_core.preprocessing import GeopandasSource, create_dataset
# Here we create a GeoJSON on the fly. Alternatively, you can read an existing GeoJSON or
# shapefile by using ``geopandas.read_file(<filename>)``
gdf = geopandas.GeoDataFrame.from_features(
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {"type": "Point", "coordinates": [0, 0]},
"properties": {"a": 1, "b": 2},
},
{
"type": "Feature",
"geometry": {"type": "Point", "coordinates": [1, 1]},
"properties": {"a": 3, "b": 4},
},
],
},
crs="WGS84",
)
# We can now do any preprocessing / dataframe operations that we want
gdf["c"] = gdf["a"] + gdf["b"]
# We can now use the source and the new property to create our dataset
config = {
"__meta__": {"crs": "WGS84"},
"name": "customized_data",
"data": {
"point_entities": {
"__meta__": {
"source": "my_custom_source",
"geometry": "points",
},
"some_attribute": {"property": "c"},
}
},
}
dataset = create_dataset(config, sources={"my_custom_source": GeopandasSource(gdf)})
Read a grid from a NetCDF file#
NetCDF files are supported if they contain a grid according to certain specifications.
DatasetCreator
can extract this grid in case of the following:
The grid is given as grid cells with x and y coordinates in NetCDF variables. x and y are stored in separate variables
The x and y variable have the first dimension as the individual cells, and the second dimension is for every vertex point of the cell
The x and y variable have a default name of
gridCellX
andgridCellY
respectively, but this can be customizedThe CRS can not be converted for
netcdf
data sources. The CRS must be specified to be the same CRS as thenetcdf
data sourcethe grid cell entities can read additional attributes from the netcdf, the point entities cannot
Below is given an example of a dataset creator config snippet
{
"__meta__": {
"crs": "EPSG:3414"
},
"__sources__": {
"netcdf_grid": {
"source_type": "netcdf",
"path": "/path/to/netcdf.nc"
}
},
"data": {
"grid_points": {
"__meta__": {
"geometry": "points",
"source": "netcdf_grid"
}
},
"grid_cells": {
"__meta__": {
"geometry": "cells",
"source": "netcdf_grid"
},
"some_attribute": {
"property": "additional_netcdf_var"
}
}
}
}
Dataset Creator Config Schema Reference#
DatasetCreatorConfig#
type
: object
properties
:__meta__
: DatasetCreatorMetaData__sources__
: DatasetCreatorDatasetSourcesname
:string
, a snake_case dataset name (required)display_name
:string
, a human readable name suitable for displayingtype
:string
, a snake_case dataset_typeversion
: literal4
, only dataset version v4 is supportedgeneral
: DatasetCreatorGeneralSectiondata
: DatasetCreatorDataSection (required)
DatasetCreatorMetaData#
type
: object
properties
:crs
:string
orinteger
indicating the Coordinate Reference System. Can be anything that is supported bygeopandas
as a valid CRS identifier. Default:EPSG:28992
(Amersfoort/RD new)
DatasetCreatorDatasetSources#
type
: object
additionalProperties
:- keys: Source namevalues: DatasetCreatorSource
DatasetCreatorSource#
One of:
type
: string
location on disk to a source filetype
: object
properties
:source_type
: One offile
,netcdf
. Usenetcdf
for NetCDF-files,file
for any other supported geospatial data filepath
:string
location on disk to a source file
Source files are read using geopandas
which uses Fiona
under the hood and may be any file
that Fiona
supports as geospatial data, such as geojson
or shapefile
DatasetCreatorGeneralSection#
type
: object
properties
:special
: DatasetCreatorSpecialValuesenum
: DatasetCreatorEnums
DatasetCreatorSpecialValues#
type
: object
additionalProperties
:- keys: Path to an attribute such as
<entity_group>.<attribute>
values:number
|integer
|string
Primitive type of the attribute
DatasetCreatorEnums#
type
: object
additionalProperties
:- keys: Enum namevalues: DatasetCreatorEnumItems
DatasetCreatorEnumItems#
type
: array
items
: string
minItems
: 1DatasetCreatorDataSection#
type
: object
additionalProperties
:- keys: Entity types that will reflect entity groups in the datasetvalues: DatasetCreatorEntityGroup
DatasetCreatorEntityGroup#
type
: object
properties
:__meta__
: DatasetCreatorEntityGroupMetaadditionalProperties
:- keys: attribute names that will reflect attributes in the datasetvalues: DatasetCreatorAttribute
DatasetCreatorEntityGroupMeta#
One of:
type
: object
properties
:source
:string
name of a data source (required)geometry
:string
one ofpoints
,lines
,polygons
orcells
type
: object
properties
:count
:integer
number of required entities (in case there are no additional attributes) (required)
DatasetCreatorAttribute#
type
: object
properties
:source
:string
Source name. Can be used to override the entity group’s default sourceproperty
:string
Name of the property in the data sourcevalue
:boolean
|number
|string
Constant attribute value when no source is givenid_link
: DatasetCreatorIDLinkspecial
:number
|integer
|string
The attribute’s special valueenum
:string
Enum nameloaders
: DatasetCreatorAttributeLoaders
DatasetCreatorIDLink#
One of
type
: DatasetCreatorIDLinkItemDatasetCreatorIDLinkItem#
type
: object
properties
:entity_group
:string
The entity type to link to (target) (required)property
:string
the property (in the target’s source data) to match (required)
DatasetCreatorAttributeLoaders#
type
: array
items
: string
values
: json
, csv
, bool
, int
, float
, str