scmidas.datasets

scmidas.datasets#

Bundled example datasets shipped inside the scmidas wheel.

These are toy-sized subsets of real datasets, designed to make the README quickstart runnable in under a minute on a single GPU. They are NOT meant for benchmarking — see the basics tutorials for full-size data.

scmidas.datasets.from_dir(dir_path: str | Path, label_dir: str | Path | None = None, label_col: str = 'label') MuData[source]#

Load a MIDAS directory-format dataset as a MuData.

The directory format (used by the basics tutorials) lays each batch’s counts out as MatrixMarket .mtx files plus per-feature mask CSVs:

dir_path/
    feat/feat_dims.toml          # per-modality chunk sizes
    <batch>/
        cell_names.csv           # cell IDs (1 column, no header beyond default)
        mat/<modality>.mtx       # (n_cells, n_features), Matrix Market
        mask/<modality>.csv      # 1-row CSV, n_features columns (0/1 mask)
    ...

The returned MuData has:

  • One modality per modality file present in feat/feat_dims.toml.

  • mdata[m].obs['batch'] set to the source batch name.

  • mdata[m].uns[f'mask_{batch}'] for any per-batch feature masks that exist (matches the lookup in MIDAS.get_info_from_mdata).

  • mdata.uns['feat_dims'] mirroring feat_dims.toml so callers can pass dims_x=mdata.uns['feat_dims'] to setup_mudata (needed for ATAC chromosome chunking).

  • If label_dir is given, mdata[m].obs[label_col] is filled in from label_dir/<batch>.csv (matched positionally to cells in that batch).

Parameters:
  • dir_path – str or Path Path to the data/ directory described above.

  • label_dir – str or Path, optional Path to the sibling label/ directory; one CSV per batch.

  • label_col – str Name of the obs column to write labels under.

Returns:

One AnnData per modality, indexed by batch.

Return type:

MuData

Examples

>>> import scmidas
>>> mdata = scmidas.datasets.from_dir(
...     'dataset/teadog_mosaic_mtx/data',
...     label_dir='dataset/teadog_mosaic_mtx/label',
... )
>>> scmidas.MIDAS.setup_mudata(mdata, dims_x=mdata.uns['feat_dims'])
>>> model = scmidas.MIDAS(mdata)
scmidas.datasets.quickstart() MuData[source]#

Load the bundled quickstart MuData (PBMC RNA+ADT mosaic, 1600 cells).

The dataset is a hand-tuned subset of the WNN PBMC mosaic dataset: 4 batches × 400 cells each (RNA-only, ADT-only, two paired) with 500 RNA HVGs + 224 ADT features, sized so that scmidas.integrate(...) finishes in roughly one minute on a single mid-range GPU. It is intended for the quickstart only; its size and feature count are not appropriate for serious analysis.

Returns:

A MuData with two modalities ('rna', 'adt') and the following obs columns at top level: 'batch' and 'celltype'.

Return type:

MuData

scmidas.datasets.quickstart_path() Path[source]#

Return the on-disk path of the bundled quickstart .h5mu file.

Returns:

Absolute path to quickstart_pbmc_mosaic.h5mu inside the installed scmidas package.

Return type:

Path