scmidas.datasets#
Bundled example datasets shipped inside the scmidas wheel.
These are toy-sized subsets of real datasets, designed to make the README quickstart runnable in under a minute on a single GPU. They are NOT meant for benchmarking — see the basics tutorials for full-size data.
- scmidas.datasets.from_dir(dir_path: str | Path, label_dir: str | Path | None = None, label_col: str = 'label') MuData[source]#
Load a MIDAS directory-format dataset as a
MuData.The directory format (used by the basics tutorials) lays each batch’s counts out as MatrixMarket .mtx files plus per-feature mask CSVs:
dir_path/ feat/feat_dims.toml # per-modality chunk sizes <batch>/ cell_names.csv # cell IDs (1 column, no header beyond default) mat/<modality>.mtx # (n_cells, n_features), Matrix Market mask/<modality>.csv # 1-row CSV, n_features columns (0/1 mask) ...
The returned MuData has:
One modality per modality file present in
feat/feat_dims.toml.mdata[m].obs['batch']set to the source batch name.mdata[m].uns[f'mask_{batch}']for any per-batch feature masks that exist (matches the lookup inMIDAS.get_info_from_mdata).mdata.uns['feat_dims']mirroringfeat_dims.tomlso callers can passdims_x=mdata.uns['feat_dims']tosetup_mudata(needed for ATAC chromosome chunking).If
label_diris given,mdata[m].obs[label_col]is filled in fromlabel_dir/<batch>.csv(matched positionally to cells in that batch).
- Parameters:
dir_path – str or Path Path to the
data/directory described above.label_dir – str or Path, optional Path to the sibling
label/directory; one CSV per batch.label_col – str Name of the obs column to write labels under.
- Returns:
One AnnData per modality, indexed by batch.
- Return type:
MuData
Examples
>>> import scmidas >>> mdata = scmidas.datasets.from_dir( ... 'dataset/teadog_mosaic_mtx/data', ... label_dir='dataset/teadog_mosaic_mtx/label', ... ) >>> scmidas.MIDAS.setup_mudata(mdata, dims_x=mdata.uns['feat_dims']) >>> model = scmidas.MIDAS(mdata)
- scmidas.datasets.quickstart() MuData[source]#
Load the bundled quickstart MuData (PBMC RNA+ADT mosaic, 1600 cells).
The dataset is a hand-tuned subset of the WNN PBMC mosaic dataset: 4 batches × 400 cells each (RNA-only, ADT-only, two paired) with 500 RNA HVGs + 224 ADT features, sized so that
scmidas.integrate(...)finishes in roughly one minute on a single mid-range GPU. It is intended for the quickstart only; its size and feature count are not appropriate for serious analysis.- Returns:
A MuData with two modalities (
'rna','adt') and the followingobscolumns at top level:'batch'and'celltype'.- Return type:
MuData