Data layout (input and output)#
This page describes the data contract between your data and MIDAS — what shape MIDAS expects on input, and where it writes its results on output.
Recommended path: bring a MuData. The legacy directory format is still
supported but is now considered an advanced / reproducibility option (see the
end of this page).
Input: a MuData#
MIDAS accepts a single mudata.MuData containing one
anndata.AnnData per modality. Three things must be set:
Where |
What |
|---|---|
|
Per-modality counts (or whatever |
|
A column (default name |
|
A 1-D float array of length |
For ATAC encoded by chromosome chunk, also set:
mdata.uns['feat_dims'] = {'atac': [chunk1_size, chunk2_size, ...]}
and pass dims_x=mdata.uns['feat_dims'] to MIDAS.setup_mudata().
Quickstart#
The minimal “I have a MuData, run MIDAS” pipeline:
import scmidas
scmidas.MIDAS.setup_mudata(mdata, batch_key='batch')
model = scmidas.MIDAS(mdata)
model.train(max_epochs=2000)
mdata.obsm['X_midas'] = model.get_latent_representation() # biological c
mdata.obsm['X_midas_u'] = model.get_latent_representation(kind='u') # technical u
If you do not yet have a MuData, see the Preparing your data tutorial for a full scanpy-native pipeline starting from raw 10x output.
Output: written back to the MuData#
MIDAS writes its results to standard scanpy locations on the MuData, so any
downstream tool that reads mdata.obsm works out of the box.
Key |
What |
|---|---|
|
Biological joint latent |
|
Technical joint latent |
imputed counts |
Returned as an array by |
For the full prediction surface (per-modality latents, batch-corrected
reconstructions, modality translation), see scmidas.MIDAS.predict().
Bridging from your AnnData to MuData#
If your data is already in scanpy form, the bridge is a one-liner per
modality plus the MuData constructor:
import mudata as mu
# adata_rna, adata_adt: your already-QC'd, HVG-selected AnnDatas
adata_rna.obs['batch'] = adata_rna.obs['donor'] # whatever your batch col is named
adata_adt.obs['batch'] = adata_adt.obs['donor']
mdata = mu.MuData({'rna': adata_rna, 'adt': adata_adt})
For the QC / normalization / HVG steps that lead up to this point, see the Preparing your data tutorial.
Bridging from a MIDAS directory dataset to a MuData#
The repository includes a helper that loads the legacy directory format
(mat/<modality>.mtx, mask/<modality>.csv, feat/feat_dims.toml)
into a MuData directly:
mdata = scmidas.datasets.from_dir(
'dataset/wnn_full_8batch_mtx/data',
label_dir='dataset/wnn_full_8batch_mtx/label', # optional cell labels
)
scmidas.MIDAS.setup_mudata(mdata, batch_key='batch',
dims_x=mdata.uns['feat_dims']) # only if ATAC chunks
model = scmidas.MIDAS(mdata)
Advanced: directory format#
The directory format below is the on-disk layout produced by the
preprocessing scripts on the
reproducibility branch,
and is what the bundled demo1 / demo2 / demo3 datasets ship as.
For most users, scmidas.datasets.from_dir() (above) is the simplest
way to consume it; but if you want to read the directory format
directly, MIDAS.configure_data_from_dir() accepts it:
./dataset_path/
feat/
feat_dims.toml
batch_0/
mat/<modality>.mtx
mask/<modality>.csv # optional, 1-row 0/1 CSV per modality
cell_names.csv # optional, one cell ID per row
batch_1/
...
Note
MIDAS.configure_data_from_mdata() and
MIDAS.configure_data_from_dir() are kept for backwards
compatibility but emit a DeprecationWarning. Use
MIDAS.setup_mudata() + MIDAS for new code.