scmidas.utils#
- class scmidas.utils.BaseSink[source]#
Bases:
objectA sink receives minibatch outputs. It may keep them in memory or write to disk.
- class scmidas.utils.DiskSink(cfg: DiskSinkConfig)[source]#
Bases:
BaseSinkStream-to-disk sink (old-code style): each minibatch is saved immediately. Produces a manifest describing where things were written.
- class scmidas.utils.DiskSinkConfig(save_dir: str, save_format: str = 'npy', fname_pattern: str = '{batch}/{var}/{key}/{i:06d}.{ext}')[source]#
Bases:
object- fname_pattern: str = '{batch}/{var}/{key}/{i:06d}.{ext}'#
- save_dir: str#
- save_format: str = 'npy'#
- class scmidas.utils.MemorySink[source]#
Bases:
BaseSink
- class scmidas.utils.OnlineMeanByGroup(dim: int)[source]#
Bases:
objectCompute global mean and per-group means WITHOUT storing all samples.
- scmidas.utils.convert_tensor_to_list(data: Tensor | List[List[Any]]) List[List[Any]][source]#
Convert a 2D tensor or list into a 2D list.
- Parameters:
data – Union[torch.Tensor, List[List[Any]]] Input data to be converted.
- Returns:
Converted 2D list.
- Return type:
List[List[Any]]
- scmidas.utils.convert_tensors_to_cuda(x: Dict[str, Any], device: device) Dict[str, Any][source]#
Recursively convert all tensors in a dictionary to CUDA.
- Parameters:
x – Dict[str, Any] Dictionary containing tensors or nested dictionaries.
device – torch.device Device to move the tensors to (e.g., CUDA or CPU).
- Returns:
A new dictionary with all tensors moved to the specified device.
- Return type:
Dict[str, Any]
- scmidas.utils.detach_tensors(x: Dict[str, Any]) Dict[str, Any][source]#
Recursively detach all tensors in a dictionary.
- Parameters:
x – Dict[str, Any] Dictionary containing tensors or nested dictionaries.
- Returns:
A new dictionary with all tensors detached.
- Return type:
Dict[str, Any]
- scmidas.utils.exp(x: Tensor, eps: float = 1e-12) Tensor[source]#
Compute a numerically stable exponential transformation.
Handles negative and positive values to avoid numerical instability.
- Parameters:
x – torch.Tensor Input tensor.
eps – float, optional A small epsilon value to avoid division by zero, by default 1e-12.
- Returns:
Transformed tensor with the exponential applied.
- Return type:
torch.Tensor
- scmidas.utils.extract_params(config: dict, prefix: str) dict[source]#
Extract parameters from a configuration dictionary with a specific prefix.
Removes the specified prefix from the keys in the resulting dictionary.
- Parameters:
config – dict Configuration dictionary containing various parameters.
prefix – str Prefix to filter and remove from the keys.
- Returns:
A new dictionary containing the filtered parameters with the prefix removed.
- Return type:
dict
- scmidas.utils.extract_values(x: List[Any] | Tuple[Any] | Dict[Any, Any] | Any) List[Any][source]#
Recursively extract all values from a tuple, list, or dictionary.
- Parameters:
x – list, tuple, dict, or any type The input structure containing nested values.
- Returns:
A flattened list of all values extracted from the input.
- Return type:
List[Any]
- scmidas.utils.filter_keys(d: Dict[str, Any], substring: str) Dict[str, Any][source]#
Filter a dictionary to include only keys that contain a specific substring.
- Parameters:
d – Dict[str, Any] The input dictionary to filter.
substring – str The substring to look for in the keys.
- Returns:
A new dictionary containing only the keys from the original dictionary that include the specified substring.
- Return type:
Dict[str, Any]
- scmidas.utils.generate_all_combinations(mods: List[str]) List[Tuple[Tuple[str, ...], List[str]]][source]#
Generate all possible input-output combinations for a given list of modalities.
For N modalities, generate all combinations of size r (1 <= r < N) as input, and the remaining modalities as output.
- Parameters:
mods – List[str] List of modality names.
- Returns:
- A list of tuples, where each tuple contains:
A tuple of input modalities.
A list of output modalities.
- Return type:
List[Tuple[Tuple[str, …], List[str]]]
- scmidas.utils.get_filenames(directory: str, extension: str) List[str][source]#
Get sorted filenames with the given extension in the specified directory.
- Parameters:
directory – str The directory to search for files.
extension – str The file extension to filter by.
- Returns:
Sorted list of filenames with the specified extension.
- Return type:
List[str]
- scmidas.utils.get_name_fmt(file_num: int) str[source]#
Generate a format string for filenames based on the total number of files.
- Parameters:
file_num – int Total number of files to be named.
- Returns:
Format string for filenames, e.g., ‘%03d’ for three-digit naming.
- Return type:
str
- scmidas.utils.get_pred_dirs(pred_dir: str, combs: List[List[str]], joint_latent: bool, mod_latent: bool, impute: bool, batch_correct: bool, translate: bool, input: bool) Dict[int, Dict[str, Dict[str, str]]][source]#
Generate directory paths for predictions based on configurations.
- Parameters:
pred_dir – str Base directory for predictions.
combs – list of list of str Combinations of modalities for each batch.
joint_latent – bool Include joint latent variables.
mod_latent – bool Include modality-specific latent variables.
impute – bool Include imputed data.
batch_correct – bool Include batch-corrected data.
translate – bool Include translated data.
input – bool Include input data.
- Returns:
Dictionary of directories for each batch and variable.
- Return type:
Dict[int, Dict[str, Dict[str, str]]]
- scmidas.utils.get_s_joint_mods(combs: List[List[str]]) Tuple[List[Dict[str, int]], List[str]][source]#
Generate s_joint and mods from a list of modality combinations.
- Parameters:
combs – List[List[str]] A list where each element is a list of strings representing combinations of modalities for a specific batch.
- Returns:
s_joint: A list of dictionaries, where each dictionary maps the modalities
to their corresponding indices for each batch. - mods: A list of all unique modalities across the dataset.
- Return type:
Tuple
- scmidas.utils.load_csv(filename: str) list[source]#
Load a CSV file and return its contents as a list of rows.
- Parameters:
filename – str Path to the CSV file.
- Returns:
A list of rows, where each row is a list of strings.
- Return type:
list
- scmidas.utils.load_mtx(filename: str) list[source]#
load mtx file and convert to csr_matrix
- Parameters:
filename – str Path to the mtx file.
- scmidas.utils.load_predicted(save_dir: str, *, save_format: str = 'npy', dim_c: int | None = None, batch_names: List[str] | None = None, var_names: List[str] | None = None, split_z: bool = True, return_manifest: bool = False) Dict[str, Any][source]#
Load predictions saved by the streaming DiskSink.
The function reads prediction results saved to disk during predict(…, save_dir=…) and reconstructs them into a prediction dictionary similar to the in-memory output format.
- Parameters:
save_dir – str Root directory where predictions were saved by predict(…, save_dir=…).
save_format – {“npy”, “csv”} File format used for saved prediction arrays.
dim_c – int, optional Dimension of the content latent space (z_c). Required if split_z=True and latent variable z exists.
batch_names – List[str], optional If provided, only the specified batches will be loaded (in the given order).
joint_latent – bool, default=True Whether to include joint latent representations. If False, z_*[“joint”] will be removed from the output.
split_z –
bool, default=True If True, split latent variable z into:
z_c : content latent representation
z_u : technical latent representation
using dim_c. If False, keep the raw z arrays.
return_manifest – bool, default=False Whether to include a manifest containing the file paths used to reconstruct the predictions.
- Returns:
- Dict[str, Any]
Prediction dictionary organized by batch. The structure matches the in-memory prediction output, for example:
pred_b[batch][“z_c”][key] pred_b[batch][“z_u”][key] pred_b[batch][“x_impt”][modality] pred_b[batch][“x_bc”][modality] pred_b[batch][“x_trans”][translation_key]
If metadata was saved, modality masks will be stored as:
pred_b[batch][“mask”][modality]
- Return type:
pred_b
- scmidas.utils.log(x: Tensor, eps: float = 1e-12) Tensor[source]#
Compute a numerically stable logarithm transformation.
Ensures numerical stability by adding a small epsilon.
- Parameters:
x – torch.Tensor Input tensor.
eps – float, optional A small epsilon value to avoid log(0), by default 1e-12.
- Returns:
Transformed tensor with the logarithm applied.
- Return type:
torch.Tensor
- scmidas.utils.mkdir(directory: str, remove_old: bool = False)[source]#
Create a directory, optionally removing the old one.
- Parameters:
directory – str Path to the directory.
remove_old – bool, optional Whether to remove the old directory if it exists, by default False.
- scmidas.utils.mkdirs(directories: str | List[str] | Dict[str, Any], remove_old: bool = False)[source]#
Recursively create directories.
- Parameters:
directories – Union[str, List[str], Dict[str, Any]] Path(s) to directories to create.
remove_old – bool Whether to remove old directories if they exist, by default False.
- scmidas.utils.ref_sort(x: List[str], ref: List[str]) List[str][source]#
Sort the elements of x based on the order defined in ref.
- Parameters:
x – list of str List of elements to be sorted.
ref – list of str Reference list defining the sort order.
- Returns:
A sorted list of elements from x that appear in ref, maintaining the order of ref.
- Return type:
List[str]
- scmidas.utils.reverse_dict(original_dict: Dict[str, Dict[str, Any]]) Dict[str, Dict[str, Any]][source]#
Reverse the keys and sub-keys of a nested dictionary.
- Parameters:
original_dict – Dict[str, Dict[str, Any]] The original nested dictionary to be reversed.
- Returns:
A reconstructed dictionary where the keys and sub-keys are swapped.
- Return type:
Dict[str, Dict[str, Any]]
- scmidas.utils.reverse_trsf(name: str, data: ndarray, **kwargs) ndarray[source]#
Apply a reverse transformation to the given data.
- Parameters:
name – str Name of the transformation to reverse (e.g., ‘log1p’).
data – np.ndarray Data to transform.
kwargs – dict Additional transformation parameters.
- Returns:
Transformed data.
- Return type:
np.ndarray
- scmidas.utils.rmdir(directory: str)[source]#
Remove a directory if it exists.
- Parameters:
directory – str Path to the directory to remove.
- scmidas.utils.safe_append(pred: dict, batch_id: int, key_path: list, value: Any)[source]#
Append a value to a nested dictionary structure.
- Parameters:
pred – dict The nested dictionary structure to append to.
batch_id – int The batch ID to use as the key for the nested dictionary.
key_path – list of str The path of keys to follow in the nested dictionary.
value – Any The value to append to the nested dictionary.
- scmidas.utils.save_list_to_csv(data: List[List[Any]], filename: str, delimiter: str = ',')[source]#
Save a 2D list to a CSV file.
- Parameters:
data – List[List[Any]] Input data to be saved.
filename – str Path to the CSV file.
delimiter – str Delimiter to separate values in the CSV file, by default ‘,’.
- scmidas.utils.save_list_to_mtx(data: Tensor, filename: str)[source]#
Save a 2D list or tensor to a Matrix Market (MTX) file.
- Parameters:
data – torch.Tensor Input data to be saved.
filename – str Path to the MTX file.
- scmidas.utils.save_tensor_to_csv(data: Tensor, filename: str, delimiter: str = ',')[source]#
Save a 2D tensor to a CSV file.
- Parameters:
data – torch.Tensor Input tensor to be saved.
filename – str Path to the CSV file.
delimiter – str, optional Delimiter to separate values in the CSV file, by default ‘,’.
- scmidas.utils.save_tensor_to_mtx(data: Tensor, filename: str)[source]#
Save a 2D tensor to a Matrix Market (MTX) file.
- Parameters:
data – torch.Tensor Input tensor to be saved.
filename – str Path to the MTX file.
- scmidas.utils.z_to_adata_or_mdata(pred, sparse_threshold=10000)[source]#
Convert prediction dictionary to AnnData (single modality) or MuData (multi-modality).
If only one modality is present, an AnnData object will be returned. If multiple modalities are present, a MuData object will be constructed with one AnnData object per modality.
- Parameters:
pred – Dict[str, Any] Prediction results generated by predict() or load_predicted().
sparse_threshold – int, default=10000 If the number of features exceeds this threshold, the data matrix will be converted to a sparse CSR matrix to reduce memory usage.
- Returns:
- Union[AnnData, MuData]
AnnData if a single modality is present.
MuData if multiple modalities are present.
- Return type:
adata_or_mdata
Notes
- The batch label is added to both:
adata.obs[“batch”] for each modality
mdata.obs[“batch”] at the top level (so sc.pl.umap(mdata, color=”batch”) works)
- Latent embeddings are stored as:
adata.obsm[“z_c”], adata.obsm[“z_u”] for single-modality data
mdata.obsm[“z_c”], mdata.obsm[“z_u”] for multi-modality data
- Modality masks are stored in:
adata.uns[“mask”] for single-modality data
adata.uns[“mask_<modality>”] or adata.uns[“mask”] depending on the context