toad.TOAD¶
- class toad.TOAD(data, time_dim='time', log_level='INFO', engine='netcdf4')¶
Bases:
objectMain object for interacting with TOAD.
TOAD (Tippping and Other Abrupt events Detector) is a framework for detecting and clustering spatio-temporal patterns in spatio-temporal data.
- Parameters:
data (Dataset) – The input data. Can be either an xarray Dataset or a path to a netCDF file.
time_dim (str) – The name of the time dimension. Defaults to ‘time’.
log_level (str) – The logging level. Choose from ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’. Defaults to ‘INFO’.
engine (str) – The engine to use to open the netCDF file. Defaults to ‘netcdf4’.
- Raises:
ValueError – If the input file path does not exist or if data dimensions are not 3D.
- __init__(data, time_dim='time', log_level='INFO', engine='netcdf4')¶
- Parameters:
data (Dataset | str)
time_dim (str)
log_level (str)
engine (str)
Methods
__init__(data[, time_dim, log_level, engine])apply_cluster_mask(var, apply_to_var, cluster_id)Apply the cluster mask to a variable.
apply_cluster_mask_spatial(var, ...)Apply the spatial cluster mask to a variable.
apply_cluster_mask_temporal(var, ...)Apply the temporal cluster mask to a variable.
cluster_vars_for_var(var)Get the cluster variables for a given variable.
compute_clusters([var, method, ...])Apply clustering to a dataset's temporal shifts using a sklearn-compatible clustering algorithm.
compute_shifts([var, method, ...])Apply an abrupt shift detection algorithm to a dataset along the specified temporal dimension.
Get number of active clusters for each timestep.
get_base_var(var)Get the base variable for a given variable.
get_cluster_counts(var[, exclude_noise])Returns sorted dictionary with number of cells in both space and time for each cluster.
get_cluster_data(var, cluster_id)Get raw data for specified cluster(s) with mask applied.
get_cluster_density_spatial(var[, cluster_id])Calculate the spatial density of a cluster across all grid cells.
get_cluster_density_temporal(var, cluster_id)Calculate the temporal density of a cluster at each grid cell.
get_cluster_ids(var[, exclude_noise])Return list of cluster ids sorted by total number of cells in each cluster.
get_cluster_mask(var, cluster_id[, ...])Returns a 3D boolean mask (time x space x space) indicating which points belong to the specified cluster(s).
get_cluster_mask_permanent(var, cluster_id)Create a mask for cells that always have the same cluster label (such as completely unclustered cells by passing -1).
Create the spatial mask for cells that are always unclustered (i.e. -1).
get_cluster_mask_spatial(var, cluster_id)Returns a 2D boolean mask indicating which grid cells belonged to the specified cluster at any point in time.
get_cluster_mask_temporal(var, cluster_id)Calculate a temporal footprint indicating cluster presence at each timestep.
get_cluster_timeseries(var[, cluster_id, ...])Get time series for cluster, optionally aggregated across space.
get_clusters(var)Get cluster xr.DataArray for the specified variable.
get_shifts(var[, label_suffix])Get shifts xr.DataArray for the specified variable.
Get the unit of the numeric time values.
Access preprocessing methods.
save([suffix, path])Save the TOAD object to a netCDF file.
set_log_level(level)Sets the logging level for the TOAD logger.
shift_vars_for_var(var)Get the shift variables for a given variable.
stats([var])Access statistics about clusters and their properties, such as time, space, and general metrics.
Attributes
Access aggregation methods.
Gets the list of base variables in the dataset.
Get the list of cluster variables in the dataset.
Get numeric time values.
Access plotting methods.
Gets the list of shift variables in the dataset.
- property aggregate: Aggregation¶
Access aggregation methods.
- apply_cluster_mask(var, apply_to_var, cluster_id, numeric_times=False)¶
Apply the cluster mask to a variable.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
apply_to_var (str) – The variable to apply the mask to.
cluster_id (int) – The cluster id to apply the mask for.
numeric_times (bool) – If True, returns result with numeric time coordinates instead of original time format. Defaults to False.
- Returns:
The masked variable.
- Return type:
DataArray
- apply_cluster_mask_spatial(var, apply_to_var, cluster_id)¶
Apply the spatial cluster mask to a variable.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
apply_to_var (str) – The variable to apply the mask to.
cluster_id (int) – The cluster id to apply the mask for.
- Returns:
All data (regardless of cluster) masked by the spatial extend of the specified cluster.
- Return type:
DataArray
- apply_cluster_mask_temporal(var, apply_to_var, cluster_id)¶
Apply the temporal cluster mask to a variable.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
apply_to_var (str) – The variable to apply the mask to.
cluster_id (int) – The cluster id to apply the mask for.
- Returns:
All data (regardless of cluster) masked by the temporal extend of the specified cluster.
- Return type:
DataArray
- property base_vars: list[str]¶
Gets the list of base variables in the dataset.
- Base variables are those that have not been derived from shift detection or
- clustering. A variable is considered a base variable if either:
It has no ‘variable_type’ attribute, or
Its ‘variable_type’ is neither ‘shift’ nor ‘cluster’
- Returns:
A list of strings containing the base variable names in the dataset.
- property cluster_vars: list[str]¶
Get the list of cluster variables in the dataset.
Cluster variables are those that have been derived from clustering. A variable is considered a cluster variable if it has a ‘variable_type=”cluster”’ attribute.
- Returns:
List of cluster variable names in the dataset
- Return type:
list[str]
- cluster_vars_for_var(var)¶
Get the cluster variables for a given variable.
- Parameters:
var (str) – The variable to get cluster variables for. Can be either: - A base variable (e.g. ‘temperature’) - A shift variable (e.g. ‘temperature_dts’) Cannot be a cluster variable.
- Returns:
For base variables: Returns cluster variables that have this as their base variable
For shift variables: Returns cluster variables that were derived from this shift variable
- Return type:
List of cluster variables associated with the given variable
- Raises:
ValueError – If var is a cluster variable. This function can only get cluster variables for base or shift variables.
- compute_clusters(var=None, method=HDBSCAN(), shift_threshold=0.5, shift_direction='both', shift_selection='local', scaler=None, time_weight=1, regridder=None, disable_regridder=False, output_label_suffix='', output_label=None, overwrite=False, sort_by_size=True, optimize=False, optimize_params={'min_cluster_size': (10, 25), 'time_weight': (0.5, 1.5)}, optimize_objective='combined_spatial_nonlinearity', optimize_n_trials=50, optimize_direction='maximize', optimize_log_level=30, optimize_progress_bar=True)¶
Apply clustering to a dataset’s temporal shifts using a sklearn-compatible clustering algorithm.
- Parameters:
var (str | None) – Name of the shifts variable to cluster, or name of the base variable whose shifts should be clustered. If None, TOAD will attempt to infer which shifts to use. A ValueError is raised if the shifts variable cannot be uniquely determined.
method (ClusterMixin | type) – The clustering method to use. Choose methods from sklearn.cluster or create your by inheriting from sklearn.base.ClusterMixin. Defaults to HDBSCAN().
shift_threshold (float) – The minimum magnitude a shift must reach to be included in clustering. Raising this threshold filters out less significant shifts and helps focus clustering on the most meaningful events, while reducing it will include more subtle (and potentially noisier) shifts. Default is 0.5, which effectively excludes most noise when using ASDETECT.
shift_direction (Literal['both', 'positive', 'negative'] | str) – The sign of the shift. Options are “both”, “positive”, “negative”. Defaults to “both”.
shift_selection (Literal['local', 'global', 'all'] | str) – How shift values are selected for clustering. All options respect shift_threshold and shift_direction: “local”: Finds peaks within individual shift episodes. Cluster only local maxima within each contiguous segment where abs(shift) > shift_threshold. “global”: Finds the overall strongest shift per grid cell. Cluster only the single maximum shift value per grid cell where abs(shift) > shift_threshold. “all”: Cluster all shift values that meet the threshold and direction criteria. Includes all data points above threshold, not just peaks. Defaults to “local”.
scaler (StandardScaler | MinMaxScaler | RobustScaler | MaxAbsScaler | None) – The scaling method to apply to the data before clustering. StandardScaler(), MinMaxScaler(), RobustScaler() and MaxAbsScaler() from sklearn.preprocessing are supported. Defaults to None. This option will be removed in the future. Set scaler=None to use recommended temporal scaling only.
time_weight (float) – Controls the relative influence of time in clustering. By default, time values are automatically scaled to match the standard deviation of the spatial coordinates. Increasing time_weight gives more emphasis to the temporal dimension, resulting in clusters that are tighter in time (shorter delays between abrupt events). Decreasing it emphasizes the spatial dimensions, allowing clusters to span a wider range of shift times. Defaults to 1.
regridder (BaseRegridder | None) – The regridding method to use from toad.clustering.regridding. Defaults to None. If None and coordinates are lat/lon, a HealPixRegridder will be created automatically.
disable_regridder (bool) – Whether to disable the regridder. Defaults to False.
output_label_suffix (str) – A suffix to add to the output label. Defaults to “”.
overwrite (bool) – Whether to overwrite existing variable. Defaults to False.
sort_by_size (bool) – Whether to reorder clusters by size. Defaults to True.
optimize (bool) – Whether to optimize the clustering parameters. Defaults to False.
optimize_params (dict) – Parameters for the optimization. Defaults to clustering.default_opt_params.
optimize_objective (Callable | Literal['median_heaviside', 'mean_heaviside', 'mean_consistency', 'mean_spatial_autocorrelation', 'mean_nonlinearity', 'combined_spatial_nonlinearity'] | str) – The objective function to optimize. Defaults to combined_spatial_nonlinearity. Can be one of: - callable: Custom objective function taking (td, output_label) as arguments - “median_heaviside”: Median heaviside score across clusters - “mean_heaviside”: Mean heaviside score across clusters - “mean_consistency”: Mean consistency score across clusters - “mean_spatial_autocorrelation”: Mean spatial autocorrelation score - “mean_nonlinearity”: Mean nonlinearity score across clusters
optimize_n_trials (int) – Number of trials to run for optimization. Defaults to 50.
optimize_direction (str) – The direction of the optimization. Defaults to “maximize”.
optimize_log_level (int) – The log level for the optimization. Defaults to optuna.logging.WARNING.
optimize_progress_bar (bool) – Whether to show the progress bar for the optimization. Defaults to True.
output_label (str | None)
- Returns:
None.
- Raises:
ValueError – If data is invalid or required parameters are missing
Notes
For global datasets, use toad.regridding.HealPixRegridder to ensure equal spacing between data points and prevent biased clustering at high latitudes.
- compute_shifts(var=None, method=<toad.shifts.methods.asdetect.ASDETECT object>, output_label_suffix='', overwrite=False)¶
Apply an abrupt shift detection algorithm to a dataset along the specified temporal dimension.
- Parameters:
var (str | None) – Name of the base variable to analyze for abrupt shifts. If None and only one base variable exists, that variable will be used automatically. If None and multiple base variables exist, raises a ValueError. Defaults to None.
method (ShiftsMethod) – The abrupt shift detection algorithm to use. Choose from predefined method objects in toad.shifts (e.g., ASDETECT), or create your own by subclassing ShiftsMethod from toad.shifts. Defaults to ASDETECT().
output_label_suffix (str) – A suffix to add to the output label. Defaults to “”.
overwrite (bool) – Whether to overwrite existing variable. Defaults to False.
- Raises:
ValueError – If data is invalid or required parameters are missing
- data: Dataset¶
- get_active_clusters_count_per_timestep(var)¶
Get number of active clusters for each timestep.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
- Returns:
Number of active clusters for each timestep.
- Return type:
DataArray
- get_base_var(var)¶
Get the base variable for a given variable.
- Parameters:
var (str)
- Return type:
str | None
- get_cluster_counts(var, exclude_noise=True)¶
Returns sorted dictionary with number of cells in both space and time for each cluster.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
exclude_noise (bool) – Whether to exclude noise points (cluster ID -1). Defaults to True.
- Returns:
Dictionary mapping cluster IDs to their total cell counts, sorted by count in descending order.
- Return type:
dict
- get_cluster_data(var, cluster_id)¶
Get raw data for specified cluster(s) with mask applied.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
cluster_id (int | List[int]) – Single cluster ID or list of cluster IDs.
- Returns:
Full dataset masked by the cluster id.
- Return type:
Dataset
Note
If cluster_id == -1, returns the unclustered mask.
If cluster_id is a list, returns the union of the masks for each cluster id.
- get_cluster_density_spatial(var, cluster_id=None)¶
Calculate the spatial density of a cluster across all grid cells.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
cluster_id (int | None) – The cluster id to calculate density for. If None, calculates density for all clusters combined (excluding noise points, cluster ID -1).
- Returns:
1D timeseries containing the fraction (0-1) of grid cells that belonged to the specified cluster (or all clusters if cluster_id is None) at each timestep.
- Return type:
DataArray
- get_cluster_density_temporal(var, cluster_id)¶
Calculate the temporal density of a cluster at each grid cell.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
cluster_id (int) – The cluster id to calculate density for.
- Returns:
2D spatial array where each grid cell contains a fraction (0-1) representing the proportion of timesteps that cell belonged to the specified cluster.
- Return type:
DataArray
- get_cluster_ids(var, exclude_noise=True)¶
Return list of cluster ids sorted by total number of cells in each cluster.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
exclude_noise (bool) – Whether to exclude noise points (cluster ID -1). Defaults to True.
- Returns:
List of cluster ids.
- Return type:
ndarray
- get_cluster_mask(var, cluster_id, numeric_times=False)¶
Returns a 3D boolean mask (time x space x space) indicating which points belong to the specified cluster(s).
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
cluster_id (int | List[int]) – Cluster id(s) to apply the mask for.
numeric_times (bool) – If True, returns mask with numeric time coordinates instead of original time format. Defaults to False.
- Returns:
Mask for the cluster label.
- Return type:
DataArray
- get_cluster_mask_permanent(var, cluster_id)¶
Create a mask for cells that always have the same cluster label (such as completely unclustered cells by passing -1).
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
cluster_id (int) – The cluster id.
- Returns:
Boolean mask where True indicates cells that always belonged to the specified cluster.
- Return type:
DataArray
- get_cluster_mask_permanent_noise(var)¶
Create the spatial mask for cells that are always unclustered (i.e. -1).
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
- Returns:
Boolean mask where True indicates cells that were never clustered (always had value -1).
- Return type:
DataArray
- get_cluster_mask_spatial(var, cluster_id)¶
Returns a 2D boolean mask indicating which grid cells belonged to the specified cluster at any point in time.
I.e. a grid cell is True if it belonged to the specified cluster at any point in time during the entire timeseries.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
cluster_id (int | List[int]) – Cluster id to apply the mask for.
- Returns:
Mask for the cluster id.
- Return type:
DataArray
- get_cluster_mask_temporal(var, cluster_id)¶
Calculate a temporal footprint indicating cluster presence at each timestep.
For each timestep, returns a boolean mask indicating whether any grid cell belonged to the specified cluster. This is useful for determining when a cluster was active, regardless of its spatial extent.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
cluster_id (int) – The cluster ID to calculate the temporal footprint for.
- Returns:
Boolean array with True indicating timesteps where the cluster existed somewhere in the spatial domain.
- Return type:
DataArray
- get_cluster_timeseries(var, cluster_id=None, cluster_var=None, aggregation='raw', percentile=None, normalize=None, keep_full_timeseries=True)¶
Get time series for cluster, optionally aggregated across space.
If cluster_id is None, returns all data from the dataset in timeseries format.
- Parameters:
var (str) – Variable name to extract time series from.
cluster_var (str | None) – Variable name to extract cluster ids from. Default to None and is attempted to be inferred from var.
cluster_id (int | List[int] | None) – Single cluster ID, list of cluster IDs, or None to return all data.
aggregation (Literal['raw', 'mean', 'sum', 'std', 'median', 'percentile', 'max', 'min'] | str) – How to aggregate spatial data: - “mean”: Average across space - “median”: Median across space - “sum”: Sum across space - “std”: Standard deviation across space - “percentile”: Percentile across space (requires percentile arg) - “max”: Maximum across space - “min”: Minimum across space - “raw”: Return data for each grid cell separately
percentile (float | None) – Percentile value between 0-1 when using percentile aggregation.
normalize (Literal['max', 'max_each'] | None | str) – How to normalize the data: - “max”: Normalize by the maximum value - “max_each”: Normalize each trajectory by its own maximum value - None: Do not normalize
keep_full_timeseries (bool) – If True, returns full time series of cluster cells. If False, values outside cluster bounds will be nan. Ignored when cluster_id is None.
- Returns:
The time series data for the specified cluster(s), or all data if cluster_id is None.
- Return type:
DataArray
- get_clusters(var)¶
Get cluster xr.DataArray for the specified variable.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
- Returns:
The clusters xr.DataArray for the specified variable.
- Raises:
ValueError – Failed to find valid cluster xr.DataArray for the given var. An xr.DataArray is only considered a cluster label if it contains _cluster in its name.
- Return type:
DataArray
- get_shifts(var, label_suffix='')¶
Get shifts xr.DataArray for the specified variable.
- Parameters:
var – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
label_suffix (str) – If you added a suffix to the shifts variable, help the function find it. Defaults to “”.
- Returns:
The shifts xr.DataArray for the specified variable.
- Raises:
ValueError – Failed to find valid shifts xr.DataArray for the given var.
- Return type:
DataArray
- property numeric_time_values¶
Get numeric time values. Defined as property since this might change if user changes the time resolution.
- Returns:
Array of numeric time values in seconds relative to first time point
- Return type:
numpy.ndarray
- numeric_time_values_unit()¶
Get the unit of the numeric time values.
- Return type:
str
- path: str | None = None¶
- property plot: Plotter¶
Access plotting methods.
Examples
>>> td.plot.overview() >>> td.plot.map() >>> td.plot.timeseries(cluster_ids=range(6))
- preprocess()¶
Access preprocessing methods.
- Return type:
Preprocess
- save(suffix=None, path=None)¶
Save the TOAD object to a netCDF file.
- Parameters:
suffix (str | None) – Optional string to append to filename before extension
path (str | None) – Optional path to save file to. If not provided, uses self.path
- Raises:
ValueError – If neither path nor self.path is set
ValueError – If using self.path without a suffix (to prevent overwriting)
- set_log_level(level)¶
Sets the logging level for the TOAD logger.
Sets the logging level and configures handlers for the TOAD logger instance. Available levels are ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’.
Examples
- Used like this:
>>> logger.debug("This is a debug message.") >>> logger.info("This is an info message.") >>> logger.warning("This is a warning message.") >>> logger.error("This is an error message.") >>> logger.critical("This is a critical message.")- In sub-modules get logger like this:
>>> logger = logging.getLogger("TOAD")
- Parameters:
level (str) – The logging level to set
- Raises:
ValueError – If level is not one of the valid logging levels
- property shift_vars: list[str]¶
Gets the list of shift variables in the dataset.
Shift variables are those that have been derived from shift detection. A variable is considered a shift variable if it has a ‘variable_type=_attrs.TYPE_SHIFT’ attribute.
- Returns:
A list of strings containing the shift variable names in the dataset.
- shift_vars_for_var(var)¶
Get the shift variables for a given variable.
- Parameters:
var (str) – The variable to get shift variables for. Can be either: - A base variable (e.g. ‘temperature’) - A cluster variable (e.g. ‘temperature_cluster’) Cannot be a shift variable.
- Returns:
For base variables: Returns all shift variables that have this as their base variable
For cluster variables: Returns the shift variable used to create this cluster
- Return type:
List of shift variables associated with the given variable
- Raises:
ValueError – If var is a shift variable, or if no shift variables are found.
- property space_dims¶
- stats(var=None)¶
Access statistics about clusters and their properties, such as time, space, and general metrics.
- Parameters:
var (str | None) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
- Returns:
Stats object for analyzing cluster statistics.
- Return type:
Examples
>>> td.stats(var="temperature").time.start(cluster_id=0) >>> td.stats(var="temperature").space.mean(cluster_id=0) >>> td.stats(var="temperature").general.score_heaviside(cluster_id=0)