scmidas.utils

scmidas.utils#

class scmidas.utils.BaseSink[source]#

Bases: object

A sink receives minibatch outputs. It may keep them in memory or write to disk.

finalize() → Any[source]#: Return final outputs (e.g., nested dict for MemorySink, or manifest for DiskSink).

write(batch_name: str, path: List[str], value: Tensor | ndarray)[source]#

write_meta(batch_name: str, path: List[str], value: Any)[source]#: For small non-tensor metadata (optional).

class scmidas.utils.DiskSink(cfg: DiskSinkConfig)[source]#

Bases: BaseSink

Stream-to-disk sink (old-code style): each minibatch is saved immediately. Produces a manifest describing where things were written.

finalize() → Dict[str, Any][source]#: Return final outputs (e.g., nested dict for MemorySink, or manifest for DiskSink).

write(batch_name: str, path: List[str], value: Tensor | ndarray)[source]#

write_meta(batch_name: str, path: List[str], value: Any)[source]#: For small non-tensor metadata (optional).

class scmidas.utils.DiskSinkConfig(save_dir: str, save_format: str = 'npy', fname_pattern: str = '{batch}/{var}/{key}/{i:06d}.{ext}')[source]#

Bases: object

fname_pattern: str = '{batch}/{var}/{key}/{i:06d}.{ext}'#

save_dir: str#

save_format: str = 'npy'#

class scmidas.utils.MemorySink[source]#

Bases: BaseSink

finalize() → Dict[str, Any][source]#: Return final outputs (e.g., nested dict for MemorySink, or manifest for DiskSink).

write(batch_name: str, path: List[str], value: Tensor | ndarray)[source]#

write_meta(batch_name: str, path: List[str], value: Any)[source]#: For small non-tensor metadata (optional).

class scmidas.utils.OnlineMeanByGroup(dim: int)[source]#

Bases: object

Compute global mean and per-group means WITHOUT storing all samples.

finalize_centroid() → Tensor[source]#: Choose the group mean closest to global mean (L2), return that group’s mean.

update(x: Tensor, g: Tensor)[source]#: x: (N, D), g: (N,) int-like

scmidas.utils.convert_tensor_to_list(data: Tensor | List[List[Any]]) → List[List[Any]][source]#

Convert a 2D tensor or list into a 2D list.

Parameters:: data – Union[torch.Tensor, List[List[Any]]] Input data to be converted.
Returns:: Converted 2D list.
Return type:: List[List[Any]]

scmidas.utils.convert_tensors_to_cuda(x: Dict[str, Any], device: device) → Dict[str, Any][source]#

Recursively convert all tensors in a dictionary to CUDA.

Parameters:

x – Dict[str, Any] Dictionary containing tensors or nested dictionaries.
device – torch.device Device to move the tensors to (e.g., CUDA or CPU).

Returns:

A new dictionary with all tensors moved to the specified device.

Return type:

Dict[str, Any]

scmidas.utils.detach_tensors(x: Dict[str, Any]) → Dict[str, Any][source]#

Recursively detach all tensors in a dictionary.

Parameters:: x – Dict[str, Any] Dictionary containing tensors or nested dictionaries.
Returns:: A new dictionary with all tensors detached.
Return type:: Dict[str, Any]

scmidas.utils.ensure_dir(p: str)[source]#

scmidas.utils.exp(x: Tensor, eps: float = 1e-12) → Tensor[source]#

Compute a numerically stable exponential transformation.

Handles negative and positive values to avoid numerical instability.

Parameters:

x – torch.Tensor Input tensor.
eps – float, optional A small epsilon value to avoid division by zero, by default 1e-12.

Returns:

Transformed tensor with the exponential applied.

Return type:

torch.Tensor

scmidas.utils.extract_params(config: dict, prefix: str) → dict[source]#

Extract parameters from a configuration dictionary with a specific prefix.

Removes the specified prefix from the keys in the resulting dictionary.

Parameters:

config – dict Configuration dictionary containing various parameters.
prefix – str Prefix to filter and remove from the keys.

Returns:

A new dictionary containing the filtered parameters with the prefix removed.

Return type:

dict

scmidas.utils.extract_values(x: List[Any] | Tuple[Any] | Dict[Any, Any] | Any) → List[Any][source]#

Recursively extract all values from a tuple, list, or dictionary.

Parameters:: x – list, tuple, dict, or any type The input structure containing nested values.
Returns:: A flattened list of all values extracted from the input.
Return type:: List[Any]

scmidas.utils.filter_keys(d: Dict[str, Any], substring: str) → Dict[str, Any][source]#

Filter a dictionary to include only keys that contain a specific substring.

Parameters:

d – Dict[str, Any] The input dictionary to filter.
substring – str The substring to look for in the keys.

Returns:

A new dictionary containing only the keys from the original dictionary that include the specified substring.

Return type:

Dict[str, Any]

scmidas.utils.generate_all_combinations(mods: List[str]) → List[Tuple[Tuple[str, ...], List[str]]][source]#

Generate all possible input-output combinations for a given list of modalities.

For N modalities, generate all combinations of size r (1 <= r < N) as input, and the remaining modalities as output.

Parameters:

mods – List[str] List of modality names.

Returns:

A list of tuples, where each tuple contains:

A tuple of input modalities.
A list of output modalities.

Return type:

List[Tuple[Tuple[str, …], List[str]]]

scmidas.utils.get_filenames(directory: str, extension: str) → List[str][source]#

Get sorted filenames with the given extension in the specified directory.

Parameters:

directory – str The directory to search for files.
extension – str The file extension to filter by.

Returns:

Sorted list of filenames with the specified extension.

Return type:

List[str]

scmidas.utils.get_name_fmt(file_num: int) → str[source]#

Generate a format string for filenames based on the total number of files.

Parameters:: file_num – int Total number of files to be named.
Returns:: Format string for filenames, e.g., ‘%03d’ for three-digit naming.
Return type:: str

scmidas.utils.get_pred_dirs(pred_dir: str, combs: List[List[str]], joint_latent: bool, mod_latent: bool, impute: bool, batch_correct: bool, translate: bool, input: bool) → Dict[int, Dict[str, Dict[str, str]]][source]#

Generate directory paths for predictions based on configurations.

Parameters:

pred_dir – str Base directory for predictions.
combs – list of list of str Combinations of modalities for each batch.
joint_latent – bool Include joint latent variables.
mod_latent – bool Include modality-specific latent variables.
impute – bool Include imputed data.
batch_correct – bool Include batch-corrected data.
translate – bool Include translated data.
input – bool Include input data.

Returns:

Dictionary of directories for each batch and variable.

Return type:

Dict[int, Dict[str, Dict[str, str]]]

scmidas.utils.get_s_joint_mods(combs: List[List[str]]) → Tuple[List[Dict[str, int]], List[str]][source]#

Generate s_joint and mods from a list of modality combinations.

Parameters:

combs – List[List[str]] A list where each element is a list of strings representing combinations of modalities for a specific batch.

Returns:

s_joint: A list of dictionaries, where each dictionary maps the modalities

to their corresponding indices for each batch. - mods: A list of all unique modalities across the dataset.

Return type:

Tuple

scmidas.utils.load_csv(filename: str) → list[source]#

Load a CSV file and return its contents as a list of rows.

Parameters:: filename – str Path to the CSV file.
Returns:: A list of rows, where each row is a list of strings.
Return type:: list

scmidas.utils.load_mtx(filename: str) → list[source]#

load mtx file and convert to csr_matrix

Parameters:: filename – str Path to the mtx file.

scmidas.utils.load_predicted(save_dir: str, *, save_format: str = 'npy', dim_c: int | None = None, batch_names: List[str] | None = None, var_names: List[str] | None = None, split_z: bool = True, return_manifest: bool = False) → Dict[str, Any][source]#

Load predictions saved by the streaming DiskSink.

The function reads prediction results saved to disk during predict(…, save_dir=…) and reconstructs them into a prediction dictionary similar to the in-memory output format.

Parameters:

save_dir – str Root directory where predictions were saved by predict(…, save_dir=…).
save_format – {“npy”, “csv”} File format used for saved prediction arrays.
dim_c – int, optional Dimension of the content latent space (z_c). Required if split_z=True and latent variable z exists.
batch_names – List[str], optional If provided, only the specified batches will be loaded (in the given order).
joint_latent – bool, default=True Whether to include joint latent representations. If False, z_*[“joint”] will be removed from the output.
split_z –
bool, default=True If True, split latent variable z into:
- z_c : content latent representation
- z_u : technical latent representation
using dim_c. If False, keep the raw z arrays.
return_manifest – bool, default=False Whether to include a manifest containing the file paths used to reconstruct the predictions.

Returns:

Dict[str, Any]

Prediction dictionary organized by batch. The structure matches the in-memory prediction output, for example:

pred_b[batch][“z_c”][key] pred_b[batch][“z_u”][key] pred_b[batch][“x_impt”][modality] pred_b[batch][“x_bc”][modality] pred_b[batch][“x_trans”][translation_key]

If metadata was saved, modality masks will be stored as:

pred_b[batch][“mask”][modality]

Return type:

pred_b

scmidas.utils.log(x: Tensor, eps: float = 1e-12) → Tensor[source]#

Compute a numerically stable logarithm transformation.

Ensures numerical stability by adding a small epsilon.

Parameters:

x – torch.Tensor Input tensor.
eps – float, optional A small epsilon value to avoid log(0), by default 1e-12.

Returns:

Transformed tensor with the logarithm applied.

Return type:

torch.Tensor

scmidas.utils.mkdir(directory: str, remove_old: bool = False)[source]#

Create a directory, optionally removing the old one.

Parameters:

directory – str Path to the directory.
remove_old – bool, optional Whether to remove the old directory if it exists, by default False.

scmidas.utils.mkdirs(directories: str | List[str] | Dict[str, Any], remove_old: bool = False)[source]#

Recursively create directories.

Parameters:

directories – Union[str, List[str], Dict[str, Any]] Path(s) to directories to create.
remove_old – bool Whether to remove old directories if they exist, by default False.

scmidas.utils.ref_sort(x: List[str], ref: List[str]) → List[str][source]#

Sort the elements of x based on the order defined in ref.

Parameters:

x – list of str List of elements to be sorted.
ref – list of str Reference list defining the sort order.

Returns:

A sorted list of elements from x that appear in ref, maintaining the order of ref.

Return type:

List[str]

scmidas.utils.reverse_dict(original_dict: Dict[str, Dict[str, Any]]) → Dict[str, Dict[str, Any]][source]#

Reverse the keys and sub-keys of a nested dictionary.

Parameters:: original_dict – Dict[str, Dict[str, Any]] The original nested dictionary to be reversed.
Returns:: A reconstructed dictionary where the keys and sub-keys are swapped.
Return type:: Dict[str, Dict[str, Any]]

scmidas.utils.reverse_trsf(name: str, data: ndarray, **kwargs) → ndarray[source]#

Apply a reverse transformation to the given data.

Parameters:

name – str Name of the transformation to reverse (e.g., ‘log1p’).
data – np.ndarray Data to transform.
kwargs – dict Additional transformation parameters.

Returns:

Transformed data.

Return type:

np.ndarray

scmidas.utils.rmdir(directory: str)[source]#

Remove a directory if it exists.

Parameters:: directory – str Path to the directory to remove.

scmidas.utils.safe_append(pred: dict, batch_id: int, key_path: list, value: Any)[source]#

Append a value to a nested dictionary structure.

Parameters:

pred – dict The nested dictionary structure to append to.
batch_id – int The batch ID to use as the key for the nested dictionary.
key_path – list of str The path of keys to follow in the nested dictionary.
value – Any The value to append to the nested dictionary.

scmidas.utils.save_list_to_csv(data: List[List[Any]], filename: str, delimiter: str = ',')[source]#

Save a 2D list to a CSV file.

Parameters:

data – List[List[Any]] Input data to be saved.
filename – str Path to the CSV file.
delimiter – str Delimiter to separate values in the CSV file, by default ‘,’.

scmidas.utils.save_list_to_mtx(data: Tensor, filename: str)[source]#

Save a 2D list or tensor to a Matrix Market (MTX) file.

Parameters:

data – torch.Tensor Input data to be saved.
filename – str Path to the MTX file.

scmidas.utils.save_tensor_to_csv(data: Tensor, filename: str, delimiter: str = ',')[source]#

Save a 2D tensor to a CSV file.

Parameters:

data – torch.Tensor Input tensor to be saved.
filename – str Path to the CSV file.
delimiter – str, optional Delimiter to separate values in the CSV file, by default ‘,’.

scmidas.utils.save_tensor_to_mtx(data: Tensor, filename: str)[source]#

Save a 2D tensor to a Matrix Market (MTX) file.

Parameters:

data – torch.Tensor Input tensor to be saved.
filename – str Path to the MTX file.

scmidas.utils.to_numpy(t: Tensor) → ndarray[source]#

scmidas.utils.z_to_adata_or_mdata(pred, sparse_threshold=10000)[source]#

Convert prediction dictionary to AnnData (single modality) or MuData (multi-modality).

If only one modality is present, an AnnData object will be returned. If multiple modalities are present, a MuData object will be constructed with one AnnData object per modality.

Parameters:

pred – Dict[str, Any] Prediction results generated by predict() or load_predicted().
sparse_threshold – int, default=10000 If the number of features exceeds this threshold, the data matrix will be converted to a sparse CSR matrix to reduce memory usage.

Returns:

Union[AnnData, MuData]

AnnData if a single modality is present.
MuData if multiple modalities are present.

Return type:

adata_or_mdata

Notes

The batch label is added to both:
- adata.obs[“batch”] for each modality
- mdata.obs[“batch”] at the top level (so sc.pl.umap(mdata, color=”batch”) works)
Latent embeddings are stored as:
- adata.obsm[“z_c”], adata.obsm[“z_u”] for single-modality data
- mdata.obsm[“z_c”], mdata.obsm[“z_u”] for multi-modality data
Modality masks are stored in:
- adata.uns[“mask”] for single-modality data
- adata.uns[“mask_<modality>”] or adata.uns[“mask”] depending on the context

scmidas.utils

Contents

scmidas.utils#