toad.TOAD¶
- class toad.TOAD(data, time_dim='time', log_level='INFO', engine='netcdf4')¶
Bases:
objectMain object for interacting with TOAD.
TOAD (Tippping and Other Abrupt events Detector) is a framework for detecting and clustering spatio-temporal patterns in spatio-temporal data.
- Parameters:
data (Dataset) – The input data. Can be either an xarray Dataset or a path to a netCDF file.
time_dim (str) – The name of the time dimension. Defaults to ‘time’.
log_level (str) – The logging level. Choose from ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’. Defaults to ‘INFO’.
engine (str) – The engine to use to open the netCDF file. Defaults to ‘netcdf4’.
- Raises:
ValueError – If the input file path does not exist or if data dimensions are not 3D.
- __init__(data, time_dim='time', log_level='INFO', engine='netcdf4')¶
- Parameters:
data (Dataset | str)
time_dim (str)
log_level (str)
engine (str)
Methods
__init__(data[, time_dim, log_level, engine])cluster_vars_for_var(var)Get the cluster variables for a given variable.
compute_clusters([var, method, ...])Apply clustering to a dataset's temporal shifts using a sklearn-compatible clustering algorithm.
compute_shifts([var, method, ...])Apply an abrupt shift detection algorithm to a dataset along the specified temporal dimension.
Remove all cluster variables from the dataset.
Remove all shift variables from the dataset.
get_base_var([var])Get the base variable for a given variable.
get_cluster_counts(var[, exclude_noise])Returns sorted dictionary with number of cells in both space and time for each cluster.
get_cluster_ids([var, exclude_noise])Return list of cluster ids sorted by total number of cells in each cluster.
get_cluster_mask([var, cluster_id, ...])Returns a 3D boolean mask (time x space x space) indicating which points belong to the specified cluster(s).
get_cluster_mask_spatial([var, cluster_id])Returns a 2D boolean mask indicating which grid cells belonged to the specified cluster at any point in time.
get_cluster_times([var, cluster_ids, numeric])Extract all time values when/where the cluster is present.
get_cluster_timeseries(var[, cluster_id])Deprecated alias for
get_timeseries().get_clusters([var])Get cluster xr.DataArray for the specified variable.
get_shifts([var, label_suffix])Get shifts xr.DataArray for the specified variable.
get_timeseries([var, cluster_id, ...])Get time series for cluster, optionally aggregated across space.
Get the unit of the numeric time values.
remove_cluster(cluster_id[, var])Remove a cluster from the dataset.
save([suffix, path])Save the TOAD object to a netCDF file.
set_log_level(level)Sets the logging level for the TOAD logger.
shift_vars_for_var(var)Get the shift variables for a given variable.
sort_clusters([var, sort_by, order])Sort cluster IDs by a given criterion (largest/earliest becomes ID 0).
Attributes
Access aggregation methods.
Gets the list of base variables in the dataset.
Get the list of cluster variables in the dataset.
Get numeric time values.
Access plotting methods.
Access preprocessing methods.
Gets the list of shift variables in the dataset.
Access statistics about clusters and their properties, such as time, space, and general metrics.
- property aggregate: Aggregation¶
Access aggregation methods.
- property base_vars: list[str]¶
Gets the list of base variables in the dataset.
- Base variables are those that have not been derived from shift detection or
- clustering. A variable is considered a base variable if either:
It has no ‘variable_type’ attribute, or
Its ‘variable_type’ is neither ‘shift’ nor ‘cluster’
- Returns:
A list of strings containing the base variable names in the dataset.
- property cluster_vars: list[str]¶
Get the list of cluster variables in the dataset.
Cluster variables are those that have been derived from clustering. A variable is considered a cluster variable if it has a ‘variable_type=”cluster”’ attribute.
- Returns:
List of cluster variable names in the dataset
- Return type:
list[str]
- cluster_vars_for_var(var)¶
Get the cluster variables for a given variable.
- Parameters:
var (str) – The variable to get cluster variables for. Can be either: - A base variable (e.g. ‘temperature’) - A shift variable (e.g. ‘temperature_dts’) Cannot be a cluster variable.
- Returns:
For base variables: Returns cluster variables that have this as their base variable
For shift variables: Returns cluster variables that were derived from this shift variable
- Return type:
List of cluster variables associated with the given variable
- Raises:
ValueError – If var is a cluster variable. This function can only get cluster variables for base or shift variables.
- compute_clusters(var=None, method=HDBSCAN(), shift_threshold=0.5, shift_direction='both', shift_selection='local', time_weight=1, regridder=None, disable_regridder=False, output_label_suffix='', output_label=None, overwrite=False, sort_by_size=True, optimize=False, optimize_params={'min_cluster_size': (10, 25), 'time_weight': (0.5, 1.5)}, optimize_objective='combined_spatial_nonlinearity', optimize_n_trials=50, optimize_direction='maximize', optimize_log_level=30, optimize_progress_bar=True)¶
Apply clustering to a dataset’s temporal shifts using a sklearn-compatible clustering algorithm.
- Parameters:
var (str | None) – Name of the shifts variable to cluster, or name of the base variable whose shifts should be clustered. If None, TOAD will attempt to infer which shifts to use. A ValueError is raised if the shifts variable cannot be uniquely determined.
method (ClusterMixin | type) – The clustering method to use. Choose methods from sklearn.cluster or create your by inheriting from sklearn.base.ClusterMixin. Defaults to HDBSCAN().
shift_threshold (float) – The minimum magnitude a shift must reach to be included in clustering. Raising this threshold filters out less significant shifts and helps focus clustering on the most meaningful events, while reducing it will include more subtle (and potentially noisier) shifts. Default is 0.5, which effectively excludes most noise when using ASDETECT.
shift_direction (Literal['both', 'positive', 'negative'] | str) – The sign of the shift. Options are “both”, “positive”, “negative”. Defaults to “both”.
shift_selection (Literal['local', 'global', 'all'] | str) – How shift values are selected for clustering. All options respect shift_threshold and shift_direction: “local”: Finds peaks within individual shift episodes. Cluster only local maxima within each contiguous segment where abs(shift) > shift_threshold. “global”: Finds the overall strongest shift per grid cell. Cluster only the single maximum shift value per grid cell where abs(shift) > shift_threshold. “all”: Cluster all shift values that meet the threshold and direction criteria. Includes all data points above threshold, not just peaks. Defaults to “local”.
time_weight (float) – Controls the relative influence of time in clustering. By default, time values are automatically scaled to match the standard deviation of the spatial coordinates. Increasing time_weight gives more emphasis to the temporal dimension, resulting in clusters that are tighter in time (shorter delays between abrupt events). Decreasing it emphasizes the spatial dimensions, allowing clusters to span a wider range of shift times. Defaults to 1.
regridder (BaseRegridder | None) – The regridding method to use from toad.clustering.regridding. Defaults to None. If None and coordinates are lat/lon, a HealPixRegridder will be created automatically.
disable_regridder (bool) – Whether to disable the regridder. Defaults to False.
output_label_suffix (str) – A suffix to add to the output label. Defaults to “”.
overwrite (bool) – Whether to overwrite existing variable. Defaults to False.
sort_by_size (bool) – Whether to reorder clusters by size. Defaults to True.
optimize (bool) – Whether to optimize the clustering parameters. Defaults to False.
optimize_params (dict) – Parameters for the optimization. Defaults to clustering.default_opt_params.
optimize_objective (Callable | Literal['median_heaviside', 'mean_heaviside', 'mean_consistency', 'mean_spatial_autocorrelation', 'mean_nonlinearity', 'combined_spatial_nonlinearity'] | str) – The objective function to optimize. Defaults to combined_spatial_nonlinearity. Can be one of: - callable: Custom objective function taking (td, output_label) as arguments - “median_heaviside”: Median heaviside score across clusters - “mean_heaviside”: Mean heaviside score across clusters - “mean_consistency”: Mean consistency score across clusters - “mean_spatial_autocorrelation”: Mean spatial autocorrelation score - “mean_nonlinearity”: Mean nonlinearity score across clusters
optimize_n_trials (int) – Number of trials to run for optimization. Defaults to 50.
optimize_direction (str) – The direction of the optimization. Defaults to “maximize”.
optimize_log_level (int) – The log level for the optimization. Defaults to optuna.logging.WARNING.
optimize_progress_bar (bool) – Whether to show the progress bar for the optimization. Defaults to True.
output_label (str | None)
- Returns:
None.
- Raises:
ValueError – If data is invalid or required parameters are missing
Notes
For global datasets, use toad.regridding.HealPixRegridder to ensure equal spacing between data points and prevent biased clustering at high latitudes.
- compute_shifts(var=None, method=<toad.shifts.methods.asdetect.ASDETECT object>, output_label_suffix='', overwrite=False, run_parallel=True, n_jobs=-1, show_progress=True)¶
Apply an abrupt shift detection algorithm to a dataset along the specified temporal dimension.
- Parameters:
var (str | None) – Name of the base variable to analyze for abrupt shifts. If None and only one base variable exists, that variable will be used automatically. If None and multiple base variables exist, raises a ValueError. Defaults to None.
method (ShiftsMethod) – The abrupt shift detection algorithm to use. Choose from predefined method objects in toad.shifts (e.g., ASDETECT), or create your own by subclassing ShiftsMethod from toad.shifts. Defaults to ASDETECT().
output_label_suffix (str) – A suffix to add to the output label. Defaults to “”.
overwrite (bool) – Whether to overwrite existing variable. Defaults to False.
run_parallel (bool) – Whether to run the shift detection in parallel. Defaults to True.
n_jobs (int) – Number of jobs to run in parallel. Defaults to -1 (use all available cores).
show_progress (bool) – Whether to show a progress bar during parallel processing. Defaults to True.
- Raises:
ValueError – If data is invalid or required parameters are missing
- data: Dataset¶
- drop_clusters()¶
Remove all cluster variables from the dataset.
This method drops all variables identified as cluster variables from the underlying data object.
- drop_shifts()¶
Remove all shift variables from the dataset.
This method drops all variables identified as shift variables from the underlying data object.
- get_base_var(var=None)¶
Get the base variable for a given variable.
- Parameters:
var (str | None) – Base variable name, cluster variable name, or shift variable name. If None, returns the single base variable when only one exists.
- Return type:
str | None
- get_cluster_counts(var, exclude_noise=True)¶
Returns sorted dictionary with number of cells in both space and time for each cluster.
- Parameters:
var (str) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
exclude_noise (bool) – Whether to exclude noise points (cluster ID -1). Defaults to True.
- Returns:
Dictionary mapping cluster IDs to their total cell counts, sorted by count in descending order.
- Return type:
dict
- get_cluster_ids(var=None, exclude_noise=True)¶
Return list of cluster ids sorted by total number of cells in each cluster.
- Parameters:
var (str | None) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
exclude_noise (bool) – Whether to exclude noise points (cluster ID -1). Defaults to True.
- Returns:
List of cluster ids.
- Return type:
ndarray
- get_cluster_mask(var=None, cluster_id=None, numeric_times=False)¶
Returns a 3D boolean mask (time x space x space) indicating which points belong to the specified cluster(s).
- Parameters:
var (str | None) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
cluster_id (int | List[int] | range | None) – Cluster id(s) to apply the mask for.
numeric_times (bool) – If True, returns mask with numeric time coordinates instead of original time format. Defaults to False.
- Returns:
Mask for the cluster label.
- Return type:
DataArray
- get_cluster_mask_spatial(var=None, cluster_id=None)¶
Returns a 2D boolean mask indicating which grid cells belonged to the specified cluster at any point in time.
I.e. a grid cell is True if it belonged to the specified cluster at any point in time during the entire timeseries.
- Parameters:
var (str | None) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name. If None, infers the variable.
cluster_id (int | list[int] | range | None) – Cluster id(s) to apply the mask for. If None, uses all clusters.
- Returns:
Mask for the cluster id.
- Return type:
DataArray
- get_cluster_times(var=None, cluster_ids=None, numeric=True)¶
Extract all time values when/where the cluster is present.
- Parameters:
var (str | None) – Base variable name or custom cluster variable name.
cluster_ids (int | list[int] | range | None) – Single cluster ID, list of IDs, range, or None for all clusters.
numeric (bool) – If True (default), return numeric time values (e.g. seconds). If False, return native time coordinate (e.g. datetime64 or cftime).
- Returns:
Flattened array of time values for every (time, y, x) cell in the cluster.
- Return type:
ndarray
- get_cluster_timeseries(var, cluster_id=None, **kwargs)¶
Deprecated alias for
get_timeseries().- Parameters:
var (str)
cluster_id (int | List[int] | None)
- Return type:
DataArray
- get_clusters(var=None)¶
Get cluster xr.DataArray for the specified variable.
- Parameters:
var (str | None) – Base variable name (e.g. ‘temperature’, will look for ‘temperature_cluster’) or custom cluster variable name.
- Returns:
The clusters xr.DataArray for the specified variable.
- Raises:
ValueError – Failed to find valid cluster xr.DataArray for the given var. An xr.DataArray is only considered a cluster label if it contains _cluster in its name.
- Return type:
DataArray
- get_shifts(var=None, label_suffix='')¶
Get shifts xr.DataArray for the specified variable.
- Parameters:
var (str | None) – Base variable name (e.g. ‘temperature’), cluster variable name, or None to infer when only one base variable exists.
label_suffix (str) – If you added a suffix to the shifts variable, help the function find it. Defaults to “”.
- Returns:
The shifts xr.DataArray for the specified variable.
- Raises:
ValueError – Failed to find valid shifts xr.DataArray for the given var.
- Return type:
DataArray
- get_timeseries(var=None, cluster_id=None, cluster_var=None, aggregation='raw', percentile=None, normalize=None, keep_full_timeseries=True)¶
Get time series for cluster, optionally aggregated across space.
If cluster_id is None, returns all data from the dataset in timeseries format.
- Parameters:
var (str | None) – Variable name to extract time series from, or None to infer when only one base variable exists. Can be a base variable (e.g., ‘thk’) or a cluster variable (e.g., ‘thk_dts_cluster’). If a cluster variable is passed, the base variable is auto-inferred.
cluster_var (str | None) – Variable name to extract cluster ids from. Defaults to None, in which case it is inferred from var.
cluster_id (int | List[int] | None) – Single cluster ID, list of cluster IDs, or None to return all data.
aggregation (Literal['raw', 'mean', 'sum', 'std', 'median', 'percentile', 'max', 'min'] | str) – How to aggregate spatial data: - “mean”: Average across space - “median”: Median across space - “sum”: Sum across space - “std”: Standard deviation across space - “percentile”: Percentile across space (requires percentile arg) - “max”: Maximum across space - “min”: Minimum across space - “raw”: Return data for each grid cell separately
percentile (float | None) – Percentile value between 0-1 when using percentile aggregation.
normalize (Literal['max', 'max_each'] | None | str) – How to normalize the data: - “max”: Normalize by the maximum value - “max_each”: Normalize each trajectory by its own maximum value - None: Do not normalize
keep_full_timeseries (bool) – If True, returns full time series of cluster cells. If False, values outside cluster bounds will be nan. Ignored when cluster_id is None.
- Returns:
The time series data for the specified cluster(s), or all data if cluster_id is None.
- Return type:
DataArray
Note
If var is a cluster variable (e.g., ‘thk_dts_cluster’), the base variable is automatically inferred from its attributes and used for data extraction, while the cluster variable is used for masking. This ensures you get actual data values rather than cluster labels.
- property numeric_time_values¶
Get numeric time values. Defined as property since this might change if user changes the time resolution.
- Returns:
Array of numeric time values in seconds relative to first time point
- Return type:
numpy.ndarray
- numeric_time_values_unit()¶
Get the unit of the numeric time values.
- Return type:
str
- path: str | None = None¶
- property plot: Plotter¶
Access plotting methods.
Examples
>>> td.plot.overview() >>> td.plot.map() >>> td.plot.timeseries(cluster_ids=range(6))
- property preprocess: Preprocess¶
Access preprocessing methods.
- remove_cluster(cluster_id, var=None)¶
Remove a cluster from the dataset.
- Parameters:
cluster_id (int) – The cluster ID to remove.
var (str | None) – The variable to remove the cluster from. If None, the cluster variable will be inferred automatically.
- save(suffix=None, path=None)¶
Save the TOAD object to a netCDF file.
- Parameters:
suffix (str | None) – Optional string to append to filename before extension
path (str | None) – Optional path to save file to. If not provided, uses self.path
- Raises:
ValueError – If neither path nor self.path is set
ValueError – If using self.path without a suffix (to prevent overwriting)
- set_log_level(level)¶
Sets the logging level for the TOAD logger.
Sets the logging level and configures handlers for the TOAD logger instance. Available levels are ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’.
Examples
- Used like this:
>>> logger.debug("This is a debug message.") >>> logger.info("This is an info message.") >>> logger.warning("This is a warning message.") >>> logger.error("This is an error message.") >>> logger.critical("This is a critical message.")- In sub-modules get logger like this:
>>> logger = logging.getLogger("TOAD")
- Parameters:
level (str) – The logging level to set
- Raises:
ValueError – If level is not one of the valid logging levels
- property shift_vars: list[str]¶
Gets the list of shift variables in the dataset.
Shift variables are those that have been derived from shift detection. A variable is considered a shift variable if it has a ‘variable_type=_attrs.TYPE_SHIFT’ attribute.
- Returns:
A list of strings containing the shift variable names in the dataset.
- shift_vars_for_var(var)¶
Get the shift variables for a given variable.
- Parameters:
var (str) – The variable to get shift variables for. Can be either: - A base variable (e.g. ‘temperature’) - A cluster variable (e.g. ‘temperature_cluster’) Cannot be a shift variable.
- Returns:
For base variables: Returns all shift variables that have this as their base variable
For cluster variables: Returns the shift variable used to create this cluster
- Return type:
List of shift variables associated with the given variable
- Raises:
ValueError – If var is a shift variable, or if no shift variables are found.
- sort_clusters(var=None, *, sort_by='size', order=None)¶
Sort cluster IDs by a given criterion (largest/earliest becomes ID 0).
Keeps NaN values unchanged and preserves noise label
-1if present. Useful after filtering/removing clusters to restore contiguous cluster IDs.- Parameters:
var (str | None) – Base variable or cluster variable. If None, inferred automatically.
sort_by (Literal['size', 'footprint_cumulative_area', 'median_shift_magnitude', 'median_shift_time', 'start_shift_time']) –
Criterion for sorting when order is None. Options: - “size” or “footprint_cumulative_area”: by cluster cell count
(largest first). Equivalent.
”median_shift_magnitude”: by median magnitude change (largest first).
”median_shift_time”: by median time of shifts (earliest first).
”start_shift_time”: by start time of cluster (earliest first).
order (list[int] | None) – Manual order: list of current cluster IDs in the order they should become 0, 1, 2, … When provided, sort_by is ignored. Must be a permutation of existing cluster IDs (each ID exactly once).
- property space_dims¶
- property stats: _StatsAccessor¶
Access statistics about clusters and their properties, such as time, space, and general metrics.
- Use as a property when you have a single base variable (var is inferred):
>>> td.stats.time.start(cluster_id=0)- Call with a variable name when you have multiple base variables or need to specify:
>>> td.stats("temperature").time.start(cluster_id=0) >>> td.stats(var="temperature").space.mean(cluster_id=0)
- Returns:
callable for explicit var, or use .time/.space/.general for inferred var.
- Return type:
StatsAccessor