Data layout (input and output)#

This page describes the data contract between your data and MIDAS — what shape MIDAS expects on input, and where it writes its results on output.

Recommended path: bring a MuData. The legacy directory format is still supported but is now considered an advanced / reproducibility option (see the end of this page).

Input: a MuData#

MIDAS accepts a single mudata.MuData containing one anndata.AnnData per modality. Three things must be set:

Where

What

mdata[m].X

Per-modality counts (or whatever trsf_before_enc_<m> expects). MIDAS applies its own log1p / binarize internally, so for RNA / ADT / ATAC just store raw counts here.

mdata[m].obs[batch_key]

A column (default name 'batch') identifying the source batch. Required even when there is only one batch — MIDAS uses it to know how cells partition across batches.

mdata[m].uns[f'mask_{batch}'] (optional)

A 1-D float array of length n_features. 1 keeps a feature, 0 masks it out for that batch / modality combination. Use this for cross-batch feature alignment when not all batches share the same feature set. If absent, MIDAS treats every feature as present.

For ATAC encoded by chromosome chunk, also set:

mdata.uns['feat_dims'] = {'atac': [chunk1_size, chunk2_size, ...]}

and pass dims_x=mdata.uns['feat_dims'] to MIDAS.setup_mudata().

Quickstart#

The minimal “I have a MuData, run MIDAS” pipeline:

import scmidas

scmidas.MIDAS.setup_mudata(mdata, batch_key='batch')
model = scmidas.MIDAS(mdata)
model.train(max_epochs=2000)

mdata.obsm['X_midas']   = model.get_latent_representation()           # biological c
mdata.obsm['X_midas_u'] = model.get_latent_representation(kind='u')   # technical u

If you do not yet have a MuData, see the Preparing your data tutorial for a full scanpy-native pipeline starting from raw 10x output.

Output: written back to the MuData#

MIDAS writes its results to standard scanpy locations on the MuData, so any downstream tool that reads mdata.obsm works out of the box.

Key

What

mdata.obsm['X_midas']

Biological joint latent z_c of shape (n_obs, dim_c), written by scmidas.integrate() or model.get_latent_representation(kind='c'). Pass directly to sc.pp.neighbors(use_rep='X_midas').

mdata.obsm['X_midas_u']

Technical joint latent z_u of shape (n_obs, dim_u).

imputed counts

Returned as an array by model.get_imputed_values(modality='rna'), shape (n_obs, n_features). Assign wherever you prefer (e.g. mdata['rna'].layers['imputed'] for the cells that were in that modality, or mdata.obsm['rna_imputed'] for all cells).

For the full prediction surface (per-modality latents, batch-corrected reconstructions, modality translation), see scmidas.MIDAS.predict().

Bridging from your AnnData to MuData#

If your data is already in scanpy form, the bridge is a one-liner per modality plus the MuData constructor:

import mudata as mu

# adata_rna, adata_adt: your already-QC'd, HVG-selected AnnDatas
adata_rna.obs['batch'] = adata_rna.obs['donor']      # whatever your batch col is named
adata_adt.obs['batch'] = adata_adt.obs['donor']

mdata = mu.MuData({'rna': adata_rna, 'adt': adata_adt})

For the QC / normalization / HVG steps that lead up to this point, see the Preparing your data tutorial.

Bridging from a MIDAS directory dataset to a MuData#

The repository includes a helper that loads the legacy directory format (mat/<modality>.mtx, mask/<modality>.csv, feat/feat_dims.toml) into a MuData directly:

mdata = scmidas.datasets.from_dir(
    'dataset/wnn_full_8batch_mtx/data',
    label_dir='dataset/wnn_full_8batch_mtx/label',  # optional cell labels
)
scmidas.MIDAS.setup_mudata(mdata, batch_key='batch',
                           dims_x=mdata.uns['feat_dims'])  # only if ATAC chunks
model = scmidas.MIDAS(mdata)

Advanced: directory format#

The directory format below is the on-disk layout produced by the preprocessing scripts on the reproducibility branch, and is what the bundled demo1 / demo2 / demo3 datasets ship as. For most users, scmidas.datasets.from_dir() (above) is the simplest way to consume it; but if you want to read the directory format directly, MIDAS.configure_data_from_dir() accepts it:

./dataset_path/
    feat/
        feat_dims.toml
    batch_0/
        mat/<modality>.mtx
        mask/<modality>.csv  # optional, 1-row 0/1 CSV per modality
        cell_names.csv       # optional, one cell ID per row
    batch_1/
        ...

Note

MIDAS.configure_data_from_mdata() and MIDAS.configure_data_from_dir() are kept for backwards compatibility but emit a DeprecationWarning. Use MIDAS.setup_mudata() + MIDAS for new code.