scmidas.data#

class scmidas.data.BasicModDataset[source]#

Bases: Dataset

Base class for modality data.

__getitem__(idx: int) Any[source]#

Retrieve the data item at the specified index (not implemented in base class).

Parameters:

idx – int The index of the data item.

__len__() int[source]#

Return the number of samples in the dataset.

Returns:

Number of samples.

Return type:

int

class scmidas.data.CSVDataset(csv_file: str)[source]#

Bases: BasicModDataset

Dataset for csv-based data.

Parameters:

csv_file – str Path to the CSV or compressed CSV file (csv.gz).

__getitem__(idx: int) ndarray[source]#

Retrieve the matrix row at the specified index.

Parameters:

idx – int The index of the matrix row.

Returns:

The matrix row as a NumPy array.

Return type:

np.ndarray

__len__() int[source]#

Return the number of rows in the matrix dataset.

Returns:

Number of rows in the dataset.

Return type:

int

class scmidas.data.MTXDataset(mtx_file: str)[source]#

Bases: BasicModDataset

Dataset for mtx-based data.

Parameters:

mtx_file – str Path to the mtx file.

__getitem__(idx: int) ndarray[source]#

Retrieve the matrix row at the specified index.

Parameters:

idx – int The index of the matrix row.

Returns:

The matrix row as a NumPy array.

Return type:

np.ndarray

__len__() int[source]#

Return the number of rows in the matrix dataset.

Returns:

Number of rows in the dataset.

Return type:

int

get_all() ndarray[source]#

Return all data in the dataset as a NumPy array.

Returns:

All data in the dataset as a NumPy array.

Return type:

np.ndarray

class scmidas.data.MultiBatchSampler(data_source: Dataset, shuffle: bool = True, batch_size: int = 1, n_max: int = 10000)[source]#

Bases: Sampler

Custom sampler for multi-batch sampling across multiple datasets.

Parameters:
  • data_source – Dataset Dataset.

  • shuffle – bool Whether to shuffle the samples within each dataset, default is True.

  • batch_size – int Number of samples per batch, default is 1.

  • n_max – int Maximum number of samples to draw from each dataset, default is 10000.

__iter__() Iterator[int][source]#

Iterate over the dataset indices in a multi-batch sampling manner.

Returns:

An iterator over sampled indices.

Return type:

Iterator[int]

__len__() int[source]#

Calculate the total number of samples across all sub-datasets.

Returns:

The total number of samples.

Return type:

int

class scmidas.data.MultiModalDataset(mod_dict: Dict[str, str], mod_id_dict: Dict[str, int], file_type: Dict[str, str], masks: Dict[str, str] | None = None, transform: Dict[str, str] | None = None)[source]#

Bases: Dataset

A dataset class for handling multi-modal data with optional masking and transformations.

Parameters:
  • mod_dict – Dict[str, str] A dictionary mapping modality names to their respective file paths.

  • mod_id_dict – Dict[str, int] A dictionary mapping modality names to their unique identifiers.

  • file_type – Dict[str, str] A dictionary mapping modality names to their file types (e.g., ‘vec’, ‘csv’, ‘mtx’).

  • masks – Optional[Dict[str, str]] A dictionary mapping modality names to their mask file paths or mask values, default is None.

  • transform – Optional[Dict[str, str]] A dictionary specifying transformations to apply to each modality, default is None.

__getitem__(idx: int) Dict[str, Dict[str, Any]][source]#

Retrieves the data at the specified index across all modalities.

Parameters:

idx – int The index of the sample to retrieve.

Returns:

A dictionary containing the following keys:
  • ’x’: Modality data at the given index, with optional transformations applied.

  • ’s’: Modality IDs.

  • ’e’: Masking information, if available.

Return type:

Dict[str, Dict[str, Any]]

__len__() int[source]#

Returns the size of the dataset.

Returns:

The number of samples in the dataset.

Return type:

int

class scmidas.data.MyDistributedSampler(dataset: Dataset, num_replicas: int | None = None, rank: int | None = None, shuffle: bool = True, seed: int = 0, batch_size: int = 256, n_max: int = 10000)[source]#

Bases: DistributedSampler

A custom distributed sampler for datasets split across multiple replicas.

Parameters:
  • dataset – Dataset The dataset to sample from.

  • num_replicas – Optional[int] Number of replicas in the distributed setup, default is determined by torch.distributed.

  • rank – Optional[int] The rank of the current process, default is determined by torch.distributed.

  • shuffle – bool Whether to shuffle the data, default is True.

  • seed – int Random seed for shuffling, default is 0.

  • batch_size – int Number of samples per batch, default is 256.

  • n_max – int Maximum number of samples per dataset, default is 10000.

__iter__() Iterator[_T_co][source]#

Iterate over the distributed dataset, ensuring balanced sampling across replicas.

Two RNG streams are used to keep DDP correct under mosaic data:
  • g_shared (same seed on every rank) drives the dataset-visit order, so all ranks process the same sub-batch at the same training step. With non-uniform per-sub-batch modality combinations, this is what keeps the encoder graph identical across ranks and avoids NCCL all-reduce hangs under find_unused_parameters=False.

  • g_local (rank-specific seed) shuffles each rank’s own indices within a dataset; these are disjoint across ranks by construction, so divergence here is fine.

Returns:

Iterator over indices for the current replica.

Return type:

Iterator

__len__() int[source]#

Calculate the number of samples in the sampler.

Returns:

Number of samples across all datasets.

Return type:

int

class scmidas.data.VECDataset(path: str)[source]#

Bases: BasicModDataset

Dataset for vector-based data.

Parameters:

path – str Directory containing vector-based data files.

__getitem__(idx: int) ndarray[source]#

Retrieve the vector data at the specified index.

Parameters:

idx – int The index of the vector file.

Returns:

The vector data as a NumPy array.

Return type:

np.ndarray

__len__() int[source]#

Return the number of files in the vector dataset.

Returns:

Number of vector files in the dataset.

Return type:

int

class scmidas.data.adataDataset(adata: AnnData, use_layer='X')[source]#

Bases: BasicModDataset

Dataset for vector-based data.

Parameters:
  • adata (AnnData) – An adata object.

  • use_layer – Layer of data to use.

__getitem__(idx: int) ndarray[source]#

Retrieve the vector data at the specified index.

scmidas.data.download_data(name: str, des: str = './')[source]#

Downloads the specified dataset and extracts it.

Parameters:
  • name – str Name of the dataset to download (e.g., ‘teadog_mosaic_4k’).

  • des – str Destination path to save the dataset (default is the current directory).

scmidas.data.download_file(url: str, dest_path)[source]#

Helper function to download a file from a URL with progress display.

Parameters:
  • url – str URL for data.

  • dest_path – str or pathlib.Path Path to save. Strings are accepted for convenience and converted internally to pathlib.Path; the previous signature only documented str but called dest_path.name.

scmidas.data.download_models(name: str, des: str = './')[source]#

Downloads the specified model.

Parameters:
  • name – str Name of the model to download (e.g., ‘wnn_mosaic_8batch_mtx’).

  • des – str Destination path to save the model (default is the current directory).

scmidas.data.download_script(name: str, des: str = './')[source]#

Downloads the specified script.

Parameters:
  • name – str Name of the script to download (e.g., ‘wnn_bimodal.R’).

  • des – str Destination path to save the script (default is the current directory).

scmidas.data.unzip_file(zip_path: str, extract_to: str)[source]#

Helper function to unzip a file.

Parameters:
  • zip_path – str Path of zip file.

  • extract_to – str Path to save.