scmidas.utils#

class scmidas.utils.BaseSink[source]#

Bases: object

A sink receives minibatch outputs. It may keep them in memory or write to disk.

finalize() Any[source]#

Return final outputs (e.g., nested dict for MemorySink, or manifest for DiskSink).

write(batch_name: str, path: List[str], value: Tensor | ndarray)[source]#
write_meta(batch_name: str, path: List[str], value: Any)[source]#

For small non-tensor metadata (optional).

class scmidas.utils.DiskSink(cfg: DiskSinkConfig)[source]#

Bases: BaseSink

Stream-to-disk sink (old-code style): each minibatch is saved immediately. Produces a manifest describing where things were written.

finalize() Dict[str, Any][source]#

Return final outputs (e.g., nested dict for MemorySink, or manifest for DiskSink).

write(batch_name: str, path: List[str], value: Tensor | ndarray)[source]#
write_meta(batch_name: str, path: List[str], value: Any)[source]#

For small non-tensor metadata (optional).

class scmidas.utils.DiskSinkConfig(save_dir: str, save_format: str = 'npy', fname_pattern: str = '{batch}/{var}/{key}/{i:06d}.{ext}')[source]#

Bases: object

fname_pattern: str = '{batch}/{var}/{key}/{i:06d}.{ext}'#
save_dir: str#
save_format: str = 'npy'#
class scmidas.utils.MemorySink[source]#

Bases: BaseSink

finalize() Dict[str, Any][source]#

Return final outputs (e.g., nested dict for MemorySink, or manifest for DiskSink).

write(batch_name: str, path: List[str], value: Tensor | ndarray)[source]#
write_meta(batch_name: str, path: List[str], value: Any)[source]#

For small non-tensor metadata (optional).

class scmidas.utils.OnlineMeanByGroup(dim: int)[source]#

Bases: object

Compute global mean and per-group means WITHOUT storing all samples.

finalize_centroid() Tensor[source]#

Choose the group mean closest to global mean (L2), return that group’s mean.

update(x: Tensor, g: Tensor)[source]#

x: (N, D), g: (N,) int-like

scmidas.utils.convert_tensor_to_list(data: Tensor | List[List[Any]]) List[List[Any]][source]#

Convert a 2D tensor or list into a 2D list.

Parameters:

data – Union[torch.Tensor, List[List[Any]]] Input data to be converted.

Returns:

Converted 2D list.

Return type:

List[List[Any]]

scmidas.utils.convert_tensors_to_cuda(x: Dict[str, Any], device: device) Dict[str, Any][source]#

Recursively convert all tensors in a dictionary to CUDA.

Parameters:
  • x – Dict[str, Any] Dictionary containing tensors or nested dictionaries.

  • device – torch.device Device to move the tensors to (e.g., CUDA or CPU).

Returns:

A new dictionary with all tensors moved to the specified device.

Return type:

Dict[str, Any]

scmidas.utils.detach_tensors(x: Dict[str, Any]) Dict[str, Any][source]#

Recursively detach all tensors in a dictionary.

Parameters:

x – Dict[str, Any] Dictionary containing tensors or nested dictionaries.

Returns:

A new dictionary with all tensors detached.

Return type:

Dict[str, Any]

scmidas.utils.ensure_dir(p: str)[source]#
scmidas.utils.exp(x: Tensor, eps: float = 1e-12) Tensor[source]#

Compute a numerically stable exponential transformation.

Handles negative and positive values to avoid numerical instability.

Parameters:
  • x – torch.Tensor Input tensor.

  • eps – float, optional A small epsilon value to avoid division by zero, by default 1e-12.

Returns:

Transformed tensor with the exponential applied.

Return type:

torch.Tensor

scmidas.utils.extract_params(config: dict, prefix: str) dict[source]#

Extract parameters from a configuration dictionary with a specific prefix.

Removes the specified prefix from the keys in the resulting dictionary.

Parameters:
  • config – dict Configuration dictionary containing various parameters.

  • prefix – str Prefix to filter and remove from the keys.

Returns:

A new dictionary containing the filtered parameters with the prefix removed.

Return type:

dict

scmidas.utils.extract_values(x: List[Any] | Tuple[Any] | Dict[Any, Any] | Any) List[Any][source]#

Recursively extract all values from a tuple, list, or dictionary.

Parameters:

x – list, tuple, dict, or any type The input structure containing nested values.

Returns:

A flattened list of all values extracted from the input.

Return type:

List[Any]

scmidas.utils.filter_keys(d: Dict[str, Any], substring: str) Dict[str, Any][source]#

Filter a dictionary to include only keys that contain a specific substring.

Parameters:
  • d – Dict[str, Any] The input dictionary to filter.

  • substring – str The substring to look for in the keys.

Returns:

A new dictionary containing only the keys from the original dictionary that include the specified substring.

Return type:

Dict[str, Any]

scmidas.utils.generate_all_combinations(mods: List[str]) List[Tuple[Tuple[str, ...], List[str]]][source]#

Generate all possible input-output combinations for a given list of modalities.

For N modalities, generate all combinations of size r (1 <= r < N) as input, and the remaining modalities as output.

Parameters:

mods – List[str] List of modality names.

Returns:

A list of tuples, where each tuple contains:
  • A tuple of input modalities.

  • A list of output modalities.

Return type:

List[Tuple[Tuple[str, …], List[str]]]

scmidas.utils.get_filenames(directory: str, extension: str) List[str][source]#

Get sorted filenames with the given extension in the specified directory.

Parameters:
  • directory – str The directory to search for files.

  • extension – str The file extension to filter by.

Returns:

Sorted list of filenames with the specified extension.

Return type:

List[str]

scmidas.utils.get_name_fmt(file_num: int) str[source]#

Generate a format string for filenames based on the total number of files.

Parameters:

file_num – int Total number of files to be named.

Returns:

Format string for filenames, e.g., ‘%03d’ for three-digit naming.

Return type:

str

scmidas.utils.get_pred_dirs(pred_dir: str, combs: List[List[str]], joint_latent: bool, mod_latent: bool, impute: bool, batch_correct: bool, translate: bool, input: bool) Dict[int, Dict[str, Dict[str, str]]][source]#

Generate directory paths for predictions based on configurations.

Parameters:
  • pred_dir – str Base directory for predictions.

  • combs – list of list of str Combinations of modalities for each batch.

  • joint_latent – bool Include joint latent variables.

  • mod_latent – bool Include modality-specific latent variables.

  • impute – bool Include imputed data.

  • batch_correct – bool Include batch-corrected data.

  • translate – bool Include translated data.

  • input – bool Include input data.

Returns:

Dictionary of directories for each batch and variable.

Return type:

Dict[int, Dict[str, Dict[str, str]]]

scmidas.utils.get_s_joint_mods(combs: List[List[str]]) Tuple[List[Dict[str, int]], List[str]][source]#

Generate s_joint and mods from a list of modality combinations.

Parameters:

combs – List[List[str]] A list where each element is a list of strings representing combinations of modalities for a specific batch.

Returns:

  • s_joint: A list of dictionaries, where each dictionary maps the modalities

to their corresponding indices for each batch. - mods: A list of all unique modalities across the dataset.

Return type:

Tuple

scmidas.utils.load_csv(filename: str) list[source]#

Load a CSV file and return its contents as a list of rows.

Parameters:

filename – str Path to the CSV file.

Returns:

A list of rows, where each row is a list of strings.

Return type:

list

scmidas.utils.load_mtx(filename: str) list[source]#

load mtx file and convert to csr_matrix

Parameters:

filename – str Path to the mtx file.

scmidas.utils.load_predicted(save_dir: str, *, save_format: str = 'npy', dim_c: int | None = None, batch_names: List[str] | None = None, var_names: List[str] | None = None, split_z: bool = True, return_manifest: bool = False) Dict[str, Any][source]#

Load predictions saved by the streaming DiskSink.

The function reads prediction results saved to disk during predict(…, save_dir=…) and reconstructs them into a prediction dictionary similar to the in-memory output format.

Parameters:
  • save_dir – str Root directory where predictions were saved by predict(…, save_dir=…).

  • save_format – {“npy”, “csv”} File format used for saved prediction arrays.

  • dim_c – int, optional Dimension of the content latent space (z_c). Required if split_z=True and latent variable z exists.

  • batch_names – List[str], optional If provided, only the specified batches will be loaded (in the given order).

  • joint_latent – bool, default=True Whether to include joint latent representations. If False, z_*[“joint”] will be removed from the output.

  • split_z

    bool, default=True If True, split latent variable z into:

    • z_c : content latent representation

    • z_u : technical latent representation

    using dim_c. If False, keep the raw z arrays.

  • return_manifest – bool, default=False Whether to include a manifest containing the file paths used to reconstruct the predictions.

Returns:

Dict[str, Any]

Prediction dictionary organized by batch. The structure matches the in-memory prediction output, for example:

pred_b[batch][“z_c”][key] pred_b[batch][“z_u”][key] pred_b[batch][“x_impt”][modality] pred_b[batch][“x_bc”][modality] pred_b[batch][“x_trans”][translation_key]

If metadata was saved, modality masks will be stored as:

pred_b[batch][“mask”][modality]

Return type:

pred_b

scmidas.utils.log(x: Tensor, eps: float = 1e-12) Tensor[source]#

Compute a numerically stable logarithm transformation.

Ensures numerical stability by adding a small epsilon.

Parameters:
  • x – torch.Tensor Input tensor.

  • eps – float, optional A small epsilon value to avoid log(0), by default 1e-12.

Returns:

Transformed tensor with the logarithm applied.

Return type:

torch.Tensor

scmidas.utils.mkdir(directory: str, remove_old: bool = False)[source]#

Create a directory, optionally removing the old one.

Parameters:
  • directory – str Path to the directory.

  • remove_old – bool, optional Whether to remove the old directory if it exists, by default False.

scmidas.utils.mkdirs(directories: str | List[str] | Dict[str, Any], remove_old: bool = False)[source]#

Recursively create directories.

Parameters:
  • directories – Union[str, List[str], Dict[str, Any]] Path(s) to directories to create.

  • remove_old – bool Whether to remove old directories if they exist, by default False.

scmidas.utils.ref_sort(x: List[str], ref: List[str]) List[str][source]#

Sort the elements of x based on the order defined in ref.

Parameters:
  • x – list of str List of elements to be sorted.

  • ref – list of str Reference list defining the sort order.

Returns:

A sorted list of elements from x that appear in ref, maintaining the order of ref.

Return type:

List[str]

scmidas.utils.reverse_dict(original_dict: Dict[str, Dict[str, Any]]) Dict[str, Dict[str, Any]][source]#

Reverse the keys and sub-keys of a nested dictionary.

Parameters:

original_dict – Dict[str, Dict[str, Any]] The original nested dictionary to be reversed.

Returns:

A reconstructed dictionary where the keys and sub-keys are swapped.

Return type:

Dict[str, Dict[str, Any]]

scmidas.utils.reverse_trsf(name: str, data: ndarray, **kwargs) ndarray[source]#

Apply a reverse transformation to the given data.

Parameters:
  • name – str Name of the transformation to reverse (e.g., ‘log1p’).

  • data – np.ndarray Data to transform.

  • kwargs – dict Additional transformation parameters.

Returns:

Transformed data.

Return type:

np.ndarray

scmidas.utils.rmdir(directory: str)[source]#

Remove a directory if it exists.

Parameters:

directory – str Path to the directory to remove.

scmidas.utils.safe_append(pred: dict, batch_id: int, key_path: list, value: Any)[source]#

Append a value to a nested dictionary structure.

Parameters:
  • pred – dict The nested dictionary structure to append to.

  • batch_id – int The batch ID to use as the key for the nested dictionary.

  • key_path – list of str The path of keys to follow in the nested dictionary.

  • value – Any The value to append to the nested dictionary.

scmidas.utils.save_list_to_csv(data: List[List[Any]], filename: str, delimiter: str = ',')[source]#

Save a 2D list to a CSV file.

Parameters:
  • data – List[List[Any]] Input data to be saved.

  • filename – str Path to the CSV file.

  • delimiter – str Delimiter to separate values in the CSV file, by default ‘,’.

scmidas.utils.save_list_to_mtx(data: Tensor, filename: str)[source]#

Save a 2D list or tensor to a Matrix Market (MTX) file.

Parameters:
  • data – torch.Tensor Input data to be saved.

  • filename – str Path to the MTX file.

scmidas.utils.save_tensor_to_csv(data: Tensor, filename: str, delimiter: str = ',')[source]#

Save a 2D tensor to a CSV file.

Parameters:
  • data – torch.Tensor Input tensor to be saved.

  • filename – str Path to the CSV file.

  • delimiter – str, optional Delimiter to separate values in the CSV file, by default ‘,’.

scmidas.utils.save_tensor_to_mtx(data: Tensor, filename: str)[source]#

Save a 2D tensor to a Matrix Market (MTX) file.

Parameters:
  • data – torch.Tensor Input tensor to be saved.

  • filename – str Path to the MTX file.

scmidas.utils.to_numpy(t: Tensor) ndarray[source]#
scmidas.utils.z_to_adata_or_mdata(pred, sparse_threshold=10000)[source]#

Convert prediction dictionary to AnnData (single modality) or MuData (multi-modality).

If only one modality is present, an AnnData object will be returned. If multiple modalities are present, a MuData object will be constructed with one AnnData object per modality.

Parameters:
  • pred – Dict[str, Any] Prediction results generated by predict() or load_predicted().

  • sparse_threshold – int, default=10000 If the number of features exceeds this threshold, the data matrix will be converted to a sparse CSR matrix to reduce memory usage.

Returns:

Union[AnnData, MuData]
  • AnnData if a single modality is present.

  • MuData if multiple modalities are present.

Return type:

adata_or_mdata

Notes

  • The batch label is added to both:
    • adata.obs[“batch”] for each modality

    • mdata.obs[“batch”] at the top level (so sc.pl.umap(mdata, color=”batch”) works)

  • Latent embeddings are stored as:
    • adata.obsm[“z_c”], adata.obsm[“z_u”] for single-modality data

    • mdata.obsm[“z_c”], mdata.obsm[“z_u”] for multi-modality data

  • Modality masks are stored in:
    • adata.uns[“mask”] for single-modality data

    • adata.uns[“mask_<modality>”] or adata.uns[“mask”] depending on the context