xagg package

Submodules

xagg.auxfuncs module

xagg.auxfuncs.find_rel_area(df)[source]

Find the relative area of each row in a geodataframe

xagg.auxfuncs.fix_ds(ds, var_cipher={'Lat': {'Lat': 'lat', 'Lon': 'lon'}, 'Latitude': {'Latitude': 'lat', 'Longitude': 'lon'}, 'Y': {'X': 'lon', 'Y': 'lat'}, 'latitude': {'latitude': 'lat', 'longitude': 'lon'}, 'latitude_1': {'latitude_1': 'lat', 'longitude_1': 'lon'}, 'nav_lat': {'nav_lat': 'lat', 'nav_lon': 'lon'}, 'y': {'x': 'lon', 'y': 'lat'}}, chg_bnds=True)[source]

Puts the input ds into a format compatible with the rest of the package

  1. grid variables are renamed “lat” and “lon”

  2. the lon dimension is made -180:180 to be consistent with most geographic data (as well as any lon_bnds variable if chg_bnds=True (by default))

  3. the dataset is sorted in ascending order in both lat and lon

NOTE: there probably should be a safeguard in case “y” and “x” are multiindex dimension names instead of lat/lon names… maybe a warning for now… (TO DO)

Parameters:
dsxarray.Dataset

an input xarray.Dataset, which may or may not need adjustment to be compatible with this package

var_cipherdict, optional

a dict of dicts for renaming lat/lon variables to “lat”/”lon”. The form is {search_str:{lat_name:'lat',lon_name:'lon'},...}, the code looks for search_str in the dimensions of the ds; based on that, it renames lat_name to ‘lat’ and lon_name to ‘lon’. Common names for these variables (‘latitude’, ‘Latitude’, ‘Lat’, ‘latitude_1’,’nav_lat’,’Y’) are included out of the box.

chg_bndsbool, optional, default = True

if True, the names of variables with “_bnd” in their names are assumed to be dimension bound variables, and are changed as well if the rest of their name matches ‘o’ (for lon) or ‘a’ (for lat). ## DOES THIS WORK FOR “X” and “Y”?

Returns:
dsxarray.Dataset

a dataset with lat/lon variables in the format necessary for this package to function

xagg.auxfuncs.get_bnds(ds, wrap_around_thresh='dynamic', break_window_width=3, break_thresh_x=2, silent=False)[source]

Builds vectors of lat/lon bounds if not present in ds

Assumes a regular rectangular grid - so each lat/lon bound is 0.5*(gap between pixels) over to the next pixel.

Parameters:
dsxarray.Dataset

an xarray dataset that may or may not contain variables “lat_bnds” and “lon_bnds”

silentbool, by default False

if True, then suppresses standard output

wrap_around_threshnumeric or str, optional, default = 'dynamic'

the minimum distance between the last grid cell in longitude and the ‘edges’ of the coordinate system for which the pixels are ‘wrapped around’. By default “dynamic”, which sets this to twice the median difference between longitude values.

In either case, if the lowest and highest lon values have opposite sign, and are both within wrap_around_thresh of -180 or 180, then the grid is assumed to be “wrapping around”, and pixels will line up around (or across, depending on the grid) the anti-meridian.

break_thresh_xnumeric, default = 2

If the difference between consecutive coordinate values is break_thresh_x times the surrounding coordinate steps (determined by break_window_width), then a break in the grid is assumed and the corresponding grid cell is assumed to be as wide as the preceding one, instead of reaching all the way across the “break”.

break_window_widthnumeric, default = 3

Window used to determine how anomalous a difference between coordinate is when determining the location of breaks in the grid.

Returns:
dsxarray.Dataset

the same dataset as inputted, unchanged if “lat/lon_bnds” already existed, or with new variables “lat_bnds” and “lon_bnds” if not.

xagg.auxfuncs.list_or_first(ser)[source]
xagg.auxfuncs.normalize(a, drop_na=False)[source]

Normalizes the vector a

The vector a is divided by its sum. If all non-np.nan elements of a are 0, then a vector with only np.nan elements is returned.

Parameters:
aarray_like

A vector to be normalized

drop_nabool, optional, default = False

If drop_na = True, and there are nans in the vector a, then the normalization is calculated using only the non-nan locations in a, and the vector is returned with the nans in their original location. In other words, np.nansum(normalize(a),drop_na=True) == 1.0

If drop_na = False, and nans are present in a, then normalize just returns a vector of np.nan the same length of a.

Returns:
avector

a, but normalized.

xagg.auxfuncs.subset_find(ds0, ds1, silent=False)[source]

Finds the grid of ds1 in ds0, and subsets ds0 to the grid in ds1

Parameters:
ds0xarray.Dataset

an xarray Dataset to be subset based on the grid of ds1; must contain grid variables “lat” or “lon” (could add a fix_ds call)

ds1xarray.Dataset, xarray.DataArray

either an xarray structrue (Dataset, DataArray) with “lat” “lon” variables, or a dictionary with DataArrays [‘lat’] and [‘lon’]. IMPORTANT: ds1 HAS TO BE BROADCAST - i.e. one value of lat, lon each coordinate, with lat and lon vectors of equal length. This can be done e.g. using ds1.stack(loc=('lat','lon')).

silentbool, by default False

if True, then suppresses standard output

Returns:
ds0xarray.Dataset

The input ds0, subset to the locations in ds1.

xagg.classes module

class xagg.classes.aggregated(agg, source_grid, geometry, ds_in, weights='nowghts')[source]

Bases: object

Class for aggregated data, output from xagg.core.aggregate()

Methods

to_csv(fn[, silent])

Save as csv

to_dataframe([loc_dim])

Convert to pandas dataframe.

to_dataset([loc_dim])

Convert to xarray dataset.

to_geodataframe()

Convert to geopandas geodataframe.

to_netcdf(fn[, loc_dim, silent])

Save as netcdf

to_shp(fn[, silent])

Save as shapefile

to_csv(fn, silent=False)[source]

Save as csv

Parameters:
fnstr

The target filename

silentbool, by default False

If True, silences standard out

to_dataframe(loc_dim='poly_idx')[source]

Convert to pandas dataframe.

to_dataset(loc_dim='poly_idx')[source]

Convert to xarray dataset.

to_geodataframe()[source]

Convert to geopandas geodataframe.

to_netcdf(fn, loc_dim='poly_idx', silent=False)[source]

Save as netcdf

Parameters:
fnstr

The target filename

loc_dimstr, by default ‘poly_idx’

What to name the polygon dimension

silentbool, by default False

If True, silences standard out

to_shp(fn, silent=False)[source]

Save as shapefile

fnstr

The target filename

silentbool, by default False

If True, silences standard out

class xagg.classes.weightmap(agg, source_grid, geometry, overlap_da=None, weights='nowghts')[source]

Bases: object

Class for mapping from pixels to polgyons, output from xagg.wrappers.pixel_overlaps()

Methods

diag_fig(poly_id, ds)

Create a diagnostic figure showing overlap between pixels and a given polygon

to_file(fn[, overwrite])

Save a copy of the weightmap, to avoid recalculating it

diag_fig(poly_id, ds)[source]

Create a diagnostic figure showing overlap between pixels and a given polygon

See xagg.diag.diag_fig() for more info.

to_file(fn, overwrite=False)[source]

Save a copy of the weightmap, to avoid recalculating it

xagg.core module

exception xagg.core.NoOverlapError[source]

Bases: Exception

Exception for when there’s no overlap between pixels and polygons

xagg.core.aggregate(ds, wm, impl='for_loop', silent=False)[source]

Aggregate raster variable(s) to polygon(s)

Aggregates (N-D) raster variables in ds to the polygons in gfd_out - in other words, gives the weighted average of the values in [ds] based on each pixel’s relative area overlap with the polygons.

The values will be additionally weighted if a weight was inputted into xagg.core.create_raster_polygons()

The code checks whether the input lat/lon grid in ds is equivalent to the linearly indexed grid in wm, or if it can be cropped to that grid.

Parameters:
dsxarray.Dataset

an xarray.Dataset containing one or more variables with dimensions lat, lon (and possibly more). The dataset’s geographic grid has to include the lat/lon coordinates used in determining the pixel overlaps in xagg.core.get_pixel_overlaps() (and saved in wm['source_grid'])

wmxagg.classes.weightmap

the output to xagg.core.get_pixel_overlaps(); a xagg.classes.weightmap object containing

  • ['agg']

    a dataframe, with one row per polygon, and the columns pix_idxs and rel_area, giving the linear indices and the relative area of each pixel over the polygon, respectively

  • ['source_grid']

    the lat/lon grid on which the aggregating parameters were calculated (and on which the linear indices are based)

impl:class:str (def: 'for_loop')

which aggregation calculation method to use, either of:

  • 'for_loop'

    default behavior, aggregation loops through all polygons in a for loop, requires less memory

  • 'dot_product'

    aggregation is calculated using a dot product, requires much more memory (due to broadcasting of variables) but may be faster in certain circumstances

silentbool, default = False

if True, then no status updates are printed to std out

Returns:
agg_outxagg.classes.aggregated

an xagg.classes.aggregated object with the aggregated variables

xagg.core.create_raster_polygons(ds, mask=None, subset_bbox=None, weights=None, weights_target='ds', wrap_around_thresh=5, silent=False)[source]

Create polygons for each pixel in a raster

Note: ‘lat_bnds’ and ‘lon_bnds’ can be created through the xagg.aux.get_bnds() function if they are not already included in the input raster file.

Note: Currently this code only supports regular rectangular grids (so where every pixel side is a straight line in lat/lon space). Future versions may include support for irregular grids.

Parameters:
dsxarray.Dataset

an xarray dataset with the variables ‘lat_bnds’ and ‘lon_bnds’, which are both lat/lon x 2 arrays giving the min and max values of lat and lon for each pixel given by lat/lon

subset_bboxgeopandas.GeoDataFrame, optional, default = None

if a geopandas.GeoDataFrame is entered, the bounding box around the geometries in the gdf are used to mask the grid, to reduce the number of pixel polygons created

Returns:
pix_agg: dict

a dictionary containing:

  • 'gdf_pixels'

    a geopandas.GeoDataFrame containing a ‘geometry’ giving the pixel boundaries for each ‘lat’ / ‘lon’ pair

  • 'source_grid'

    a dictionary containing the original lat and lon inputs under the keys “lat” and “lon” (just the xarray.DataArray of those variables in the input ds)

xagg.core.get_pixel_overlaps(gdf_in, pix_agg, impl='for_loop')[source]

Get, for each polygon, the pixels that overlap and their area of overlap

Finds, for each polygon in gdf_in, which pixels intersect it, and by how much.

Note: Uses EASE-Grid 2.0 on the WGS84 datum to calculate relative areas (see https://nsidc.org/data/ease)

Parameters:
gdf_ingeopandas.GeoDataFrane

a geopandas.GeoDataFrame giving the polygons over which the variables should be aggregated. Can be just a read shapefile (with the added column of “poly_idx”, which is just the index as a column).

pix_aggdict

the output of xagg.core.create_raster_polygons(); a dict containing:

  • 'gdf_pixels'

    a geopandas.GeoDataFrame giving for each row the columns “lat” and “lon” (with coordinates) and a polygon giving the boundary of the pixel given by lat/lon

  • 'source_grid'

    [da.lat,da.lon] of the grid used to create the pixel polygons

implstr

whether the output will be used for the dot-product aggregation calculation (needs a slightly different format), either of: - 'for_loop' (default behavior) - 'dot_product' (to set up for impl='dot_product' in

xagg.core.aggregate)

Returns:
wm_out: dict

A dictionary containing:

  • 'agg':

    a dataframe containing all the fields of gdf_in (except geometry) and the additional columns:

    • coords: the lat/lon coordiates of all pixels that overlap

    the polygon of that row

    • pix_idxs: the linear indices of those pixels within the

    gdf_pixels grid

    • rel_area: the relative area of each of the overlaps between

    the pixels and the polygon (summing to 1 - e.g. if the polygon is exactly the size and location of two pixels, their rel_areas would be 0.5 each)

  • 'source_grid'

    a dictionary with keys ‘lat’ and ‘lon’ giving the original lat/lon grid whose overlaps with the polygons was calculated

  • 'geometry'

    just the polygons from gdf_in

xagg.core.make_multipoly(pts)[source]

Split pixel overlapping the antimeridian into MultiPolygon with each sub-Polygon in its own hemisphere

NB: to be used in create_raster_polygons()

xagg.core.process_weights(ds, weights=None, target='ds', silent=False)[source]

Process weights - including regridding

If target == 'ds', regrid weights to ds. If target == 'weights', regrid ds to weights.

Parameters:
dsxarray.Dataset, xarray.DataArray

an xarray.Dataset/xarray.DataArray to regrid

weightsxarray.DataArray, optional, default = None

an xarray.DataArray containing a weight (numeric) at each location

targetstr, optional, default = 'ds'

whether weights should be regridded to the ds grid (by default) or vice-versa (not yet supported, returns NotImplementedError)

silentbool, default = False

if True, then no status updates are printed to std out

Returns:
dsxarray.Dataset, xarray.DataArrays

the input xarray.Dataset/xarray.DataArray, with a new variable weights specifying weights for each pixel

weights_infodict

a dictionary storing information about the weights regridding process, with the fields:

  • target: showing which of the two grids was retained

  • ds_grid: a dictionary with the grid {"lat":ds.lat,"lon",ds.lon}

  • weights_grid: a dictionary with the grid {"lat":weights.lat,"lon":weights.lon}

xagg.core.read_wm(path)[source]

Load temporary weightmap files from wm.to_file()

Builds a weightmap out of saved weightmap component files. Particularly useful if the weightmap took a particularly long time to calculated (i.e., if the grid is particularly high resolution).

Assumes the files are generated from wm.to_file(); i.e., the files are all in a directory name:

  • name/name.shp

    the geometry of the input polygons

  • name/name.agg

    the dataframe with the pixel overlap data

  • name/name_lat.nc, name/name_lon.nc

    the source grid of the raster data

  • name/name_weights.nc :

    the additional weights grid, if used (this file is optional; if no file with this name is found, no weights are assumed, and wm.weights=’noweights’)

Parameters:
pathstr

The directory in which the files are stored. They are assumed to follow the filename convention of sharing the name of the directory (i.e., the last part of this path.)

Returns:
wmxagg.weightmap

xagg.diag module

xagg.diag.diag_fig(wm, poly_id, pix_overlap_info, cmap='magma_r', max_title_depth=5, fig=None, ax=None)[source]

Create a diagnostic figure showing overlap between pixels and a given polygon

Parameters:
wmxagg.classes.weightmap

the output to xagg.core.get_pixel_overlaps()

poly_idint, list, or dict

which polygon to plot. If int, then just the polygon in that indexed row of wm.agg is plotted. If list of int`s, then those polygons are plotted. If dict, then all matches in the geodataframe are plotted (e.g., `poly_id = {‘adm1_code’:’URY-8’}).

pix_overlap_infoxarray.dataset.Dataset, xarray.dataarray.DataArray, or the output to xa.core.create_raster_polygons()

If ds or da, the original dataset used to calculate the wm; needs to be re-entered here because wm does not keep the original pixel polygons. Otherwise, put in the output to xa.core.create_raster_polygons(ds)()

cmapstr, by default ‘magma_r’

colormap, must be the name of a matplotlib-recognized colormap

max_title_depthint, by default 5

if only showing one polygon, then the plot title is ‘, ‘.join(), called on the first max_title_depth columns of wm.agg that aren’t added by xagg

figmpl.figure.Figure or None, by default None

if not None, then this figure handle is used

axmpl.axis.Axis or None, by default None

if not None, then this axis handle is used

Returns:
fig,axthe matplotlib figure, axis handles

xagg.export module

xagg.export.export_weightmap(wm_obj, fn, overwrite=False)[source]

Save a copy of the weightmap, to avoid recalculating it

xagg.export.output_data(agg_obj, output_format, output_fn, loc_dim='poly_idx', silent=False)[source]

Wrapper for prep_for_* functions

Parameters:
agg_objxagg.classes.aggregated

object to be exported

output_formatstr

‘netcdf’, ‘csv’, or ‘shp’

output_fn: :py:class:`str`

the output filename

loc_dimstr, optional. default = 'poly_idx'

the name of the dimension with location indices; used by xagg.export.prep_for_nc() for nc, csv output

Returns:
the variable that gets saved, so depending on the output_format:
  • “netcdf”: the xarray.Dataset on which .to_netcdf was called

  • “csv”: the pandas.Dataframe on which .to_csv was called (uses the xarray ds.to_dataframe() functionality)

  • “shp”: the geopandas.GeoDataDrame on which .to_file was called

xagg.export.prep_for_csv(agg_obj, add_geom=False)[source]

Preps aggregated data for output as a csv

Concretely, aggregated data is placed in a new pandas dataframe and expanded wide - each aggregated variable is placed in new columns; one column per coordinate in each dimension that isn’t the location (poolygon). So, for example, a lat x lon x time variable “tas”, aggregated to location x time, would be reshaped long to columns “tas0”, “tas1”, “tas2”,… for timestep 0, 1, etc.

Note: Currently no support for variables with more than one extra dimension beyond their location dimensions. Potential options: a multi-index column name, so [var]0-0, [var]0-1, etc…

Parameters:
agg_objxagg.classes.aggregated

the output from aggregate()

add_geombool, default False

if True, adds the geometry from the original shapefile/ geodataframe back into the prepped output; this is for the .to_geodataframe() conversion option.

Returns:
dfpandas.DataFrame

a pandas dataframe containing all the fields from the original location polygons + columns containing the values of the aggregated variables at each location. This can then easily be exported as a csv directly (using df.to_csv) or to shapefiles by first turning into a geodataframe.

xagg.export.prep_for_nc(agg_obj, loc_dim='poly_idx')[source]

Preps aggregated data for output as a netcdf

Concretely, aggregated data is placed in a new xarray dataset with dimensions of location (the different polygons in gdf_out) and any other dimension(s) in the original input raster data. All fields from the input polygons are kept as variables with dimension of location.

Parameters:
agg_objxagg.classes.aggregated
loc_dimstr, optional

the name of the location dimension; by definition ‘poly_idx’. Values of that dimension are currently only an integer index (with further information given by the field variables). Future versions may allow, if loc_dim is set to the name of a field in the input polygons, to replace the dimension with the values of that field.

xagg.wrappers module

xagg.wrappers.pixel_overlaps(ds, gdf_in, weights=None, weights_target='ds', subset_bbox=True, impl='for_loop', silent=False)[source]

Wrapper function for determining overlaps between grid and polygon

For a geodataframe gdf_in, takes an xarray structure ds (Dataset or DataArray) and for each polygon in gdf_in provides a list of pixels given by the ds grid which overlap that polygon, in addition to their relative area of overlap with the polygon.

The output is then ready to be fed into xagg.core.aggregate(), which aggregates the variables in ds to the polygons in gdf_in using area- (and optionally other) weights.

(NB: the wrapper uses subset_bbox = True in xagg.core.create_raster_polygons())

Parameters:
dsxarray.Dataset, xarray.DataArray

an xarray Dataset or DataArray containing at least grid variables (“lat”/”lon”, though several other names are supported; see docs for xagg.aux.fix_ds()) and at least one variable on that grid

gdf_ingeopandas.GeoDataFrame

a geopandas GeoDataFrame containing polygons (and any other fields, for example fields from shapefiles)

weightsxarray.DataArray or None, optional, default = None

(by default, None) if additional weights are desired, (for example, weighting pixels by population in addition to by area overlap), weights is an xarray.DataArray containing that information. It does not have to be on the same grid as ds - grids will be homogonized (see below).

weights_targetstr, optional

if ‘ds’, then weights are regridded to the grid in [ds]; if ‘weights’, then the ds variables are regridded to the grid in ‘weights’ (LATTER NOT SUPPORTED YET, raises a NotImplementedError)

implstr, optional, default = 'for_loop'

whether to use the 'for_loop' or 'dot_product' methods for aggregating; the former uses less memory, the latter may be faster in certain circumstances

silentbool, default = False

if True, then no status updates are printed to std out

Returns:
wm_outdict

the output of xagg.core.get_pixel_overlaps() which gives the mapping of pixels to polygon aggregation; to be input into xagg.core.aggregate().

Module contents