Standardize and append a batch of data#
Here, we’ll learn
how to standardize a less well curated collection
how to append it to the growing versioned collection
import lamindb as ln
import bionty as bt
ln.settings.verbosity = "hint"
bt.settings.auto_save_parents = False
💡 connected lamindb: testuser1/test-scrna
ln.transform.stem_uid = "ManDYgmftZ8C"
ln.transform.version = "1"
ln.track()
💡 Assuming editor is Jupyter Lab.
💡 notebook imports: bionty==0.42.4 lamindb==0.69.2
💡 saved: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', key='scrna2', version='1', type=notebook, updated_at=2024-03-28 12:10:27 UTC, created_by_id=1)
💡 saved: Run(uid='nPlAwWlPBNq66gRBfb3W', transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_nPlAwWlPBNq66gRBfb3W.txt
Standardize a data shard#
Let’s now consider a collection with less-well curated features:
adata = ln.core.datasets.anndata_pbmc68k_reduced()
adata
Show code cell output
AnnData object with n_obs × n_vars = 70 × 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
We are still working with human data, and can globally set an organism:
bt.settings.organism = "human"
annotate = ln.Annotate.from_anndata(adata, var_field=bt.Gene.symbol, obs_fields={"cell_type": bt.CellType.name})
❗ 3 non-validated features are not registered with Feature.name: ['n_genes', 'percent_mito', 'louvain']!
→ to lookup categories, use .lookup().['feature']
→ to register, run register_features(validated_only=False)
✅ registered 5 labels from public with Gene.symbol: ['GPX1', 'SOD2', 'RN7SL1', 'SNORD3B-2', 'IGLL5']
❗ 11 non-validated labels are not registered with Gene.symbol: ['RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'RP3-467N11.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'AC084018.1', 'CTD-3138B18.5']!
→ to lookup categories, use .lookup().['variables']
→ to register, set validated_only=False
Standardize & validate genes #
This data shard is indexed by gene symbols which we’ll want to map on Ensemble ids:
Now that all symbols are validated, let’s convert them to Ensembl ids via standardize()
. Note that this is an ambiguous mapping and the first match is kept because the keep
arg of .standardize()
defaults to "first"
:
adata.var["ensembl_gene_id"] = bt.Gene.standardize(
adata.var.index,
field=bt.Gene.symbol,
return_field=bt.Gene.ensembl_gene_id,
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")
# We only want to register data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
💡 standardized 754/765 terms
Here, we’ll use .raw
:
adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
annotate = ln.Annotate.from_anndata(adata_validated, var_field=bt.Gene.ensembl_gene_id, obs_fields={"cell_type": bt.CellType.name})
❗ 3 non-validated features are not registered with Feature.name: ['n_genes', 'percent_mito', 'louvain']!
→ to lookup categories, use .lookup().['feature']
→ to register, run register_features(validated_only=False)
annotate.validate()
💡 inspecting 'variables' by Gene.ensembl_gene_id
✅ all variabless are validated
💡 inspecting 'cell_type' by CellType.name
❗ 9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
→ register terms via .update_registry('cell_type')
False
Standardize & validate cell types #
Non of the cell types can be automatically registered:
annotate.update_registry("cell_type")
❗ 9 non-validated labels are not registered with CellType.name: ['Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+']!
→ to lookup categories, use .lookup().['cell_type']
→ to register, set validated_only=False
Let us search the cell type names from the public ontology, and add the name found in the AnnData
object as a synonym to the top match found in the public ontology.
bionty = bt.CellType.public() # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
# search the public ontology and use the ontology id of the top match
ontology_id = bionty.search(name).iloc[0].ontology_id
# create a record by loading the top match from bionty
record = bt.CellType.from_public(ontology_id=ontology_id)
name_mapper[name] = record.name # map the original name to standardized name
record.save()
record.add_synonym(name)
Show code cell output
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0001087'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000910'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000919'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002057'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002101'
We can now standardize cell type names using the search-based mapper:
adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)
Now, all cell types are validated:
annotate.validate()
💡 inspecting 'variables' by Gene.ensembl_gene_id
✅ all variabless are validated
💡 inspecting 'cell_type' by CellType.name
✅ all cell_types are validated
True
Register #
artifact = annotate.register_artifact(description="10x reference adata")
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/hcUbEXbalc3FN8L7TAhB.h5ad')
✅ storing artifact 'hcUbEXbalc3FN8L7TAhB' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/hcUbEXbalc3FN8L7TAhB.h5ad'
💡 parsing feature names of X stored in slot 'var'
✅ 754 terms (100.00%) are validated for ensembl_gene_id
✅ linked: FeatureSet(uid='s8CHXzK9oEAS9ea58B4h', n=754, type='number', registry='bionty.Gene', hash='j8QkIeLBgJwsscY4vVPx', created_by_id=1)
💡 parsing feature names of slot 'obs'
✅ 1 term (25.00%) is validated for name
❗ 3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
✅ linked: FeatureSet(uid='mXVynNw0wz6zhcuNNBQA', n=1, registry='core.Feature', hash='MeMDEJKhjEl8HpY4r3UL', created_by_id=1)
✅ saved 2 feature sets for slots: 'var','obs'
✅ registered artifact in testuser1/test-scrna
artifact.view_lineage()
Append the shard to the collection#
Query the previous collection:
collection_v1 = ln.Collection.filter(
name="My versioned scRNA-seq collection", version="1"
).one()
Create a new version of the collection by sharding it across the new artifact
and the artifact underlying version 1 of the collection:
collection_v2 = ln.Collection(
[artifact, collection_v1.artifact],
is_new_version_of=collection_v1,
)
collection_v2.save()
collection_v2.labels.add_from(artifact)
collection_v2.labels.add_from(collection_v1)
Show code cell output
✅ loaded: FeatureSet(uid='lS5y9QC982mHqHqWIyGo', n=4, registry='core.Feature', hash='xdku76mnTOIeueg6lA4G', updated_at=2024-03-28 12:10:19 UTC, created_by_id=1)
💡 adding collection [1] as input for run 2, adding parent transform 1
💡 adding artifact [1] as input for run 2, adding parent transform 1
✅ saved 1 feature set for slot: 'var'
💡 transferring cell_type
💡 transferring donor
💡 transferring tissue
💡 transferring cell_type
💡 transferring assay
Version 2 of the collection covers significantly more conditions.
collection_v2.describe()
Collection(uid='k8AG1SK3O5uwnd13RYSP', name='My versioned scRNA-seq collection', version='2', hash='HNR3VFV60_yqRnUka11E', visibility=1, updated_at=2024-03-28 12:10:49 UTC)
Provenance:
💫 transform: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', key='scrna2', version='1', type=notebook, updated_at=2024-03-28 12:10:27 UTC, created_by_id=1)
👣 run: Run(uid='nPlAwWlPBNq66gRBfb3W', started_at=2024-03-28 12:10:27 UTC, is_consecutive=True, transform_id=2, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-03-28 12:08:16 UTC)
Features:
var: FeatureSet(uid='wup3ifTGwiHX9Tljnmne', n=36508, type='number', registry='bionty.Gene', hash='b5NMddLHEyZqn-vSYvBI', updated_at=2024-03-28 12:10:47 UTC, created_by_id=1)
'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
obs: FeatureSet(uid='lS5y9QC982mHqHqWIyGo', n=4, registry='core.Feature', hash='xdku76mnTOIeueg6lA4G', updated_at=2024-03-28 12:10:19 UTC, created_by_id=1)
🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
🔗 cell_type (40, bionty.CellType): 'dendritic cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'B cell, CD19-positive', 'CD4-positive, alpha-beta T cell', 'classical monocyte', 'T follicular helper cell', ...
🔗 assay (3, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1'
Labels:
🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
🏷️ cell_types (40, bionty.CellType): 'dendritic cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'B cell, CD19-positive', 'CD4-positive, alpha-beta T cell', 'classical monocyte', 'T follicular helper cell', ...
🏷️ experimental_factors (3, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1'
🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
View data lineage:
collection_v2.view_lineage()