Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculations of mito directly generate columns with dtypes that break qc calculations on subsequent filtering #116

Open
pcm32 opened this issue Sep 21, 2022 · 3 comments

Comments

@pcm32
Copy link
Member

pcm32 commented Sep 21, 2022

Running a first filter step (genes or cells) when there are no mito columns given as part of the cell metadata generates a mito column that is considered logical probably by pandas (instead of possibly categorical when read from the metadata file). This leads into the following error:

Traceback (most recent call last):
  File "/usr/local/tools/_conda/envs/[email protected]/bin/scanpy-filter-genes", line 10, in <module>
    sys.exit(FILTER_CMD())
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/scanpy_scripts/cmd_utils.py", line 46, in cmd
    func(adata, **kwargs)
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/scanpy_scripts/cmd_utils.py", line 288, in matrix_function
    func(
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/scanpy_scripts/lib/_filter.py", line 75, in filter_anndata
    sc.pp.calculate_qc_metrics(
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/scanpy/preprocessing/_qc.py", line 306, in calculate_qc_metrics
    obs_metrics = describe_obs(
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/scanpy/preprocessing/_qc.py", line 123, in describe_obs
    X[:, adata.var[qc_var].values].sum(axis=1)
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/scipy/sparse/_index.py", line 33, in __getitem__
    row, col = self._validate_indices(key)
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/scipy/sparse/_index.py", line 147, in _validate_indices
    col = self._asindices(col, N)
  File "/usr/local/tools/_conda/envs/[email protected]/lib/python3.9/site-packages/scipy/sparse/_index.py", line 169, in _asindices
    if max_indx >= length:
TypeError: '>=' not supported between instances of 'str' and 'int'

Most likely categorical columns (from their pandas dtype) get excluded from that qc_vars list, but not for boolean/logical possibly (or the other way around).

@pcm32
Copy link
Member Author

pcm32 commented Sep 21, 2022

The object the fails has the following dtypes:

>>> wmi.var.dtypes
gene_symbols               object
mito                     category
n_cells_by_counts           int64
mean_counts               float32
log1p_mean_counts         float32
pct_dropout_by_counts     float64
total_counts              float32
log1p_total_counts        float32
n_counts                  float32
n_cells                     int64
dtype: object

the AnnData object where the gene metadata gets loaded (with mito) apriori (and doesn't fail) looks like this:

wm_ni.var.dtypes
gene_symbols              object
mito                        bool
n_cells_by_counts          int64
mean_counts              float32
log1p_mean_counts        float32
pct_dropout_by_counts    float64
total_counts             float32
log1p_total_counts       float32
n_counts                 float32
n_cells                    int64
dtype: object

so it seems that the following qc trigger is willing to go with bool but not category (the code is actually setting that column to category at https://github.com/ebi-gene-expression-group/scanpy-scripts/blob/develop/scanpy_scripts/lib/_filter.py#L40).

@pcm32
Copy link
Member Author

pcm32 commented Sep 21, 2022

And this line then reproduces the error:

>>> wm_ni.X[:, wm_ni.var['mito'].values].sum(axis=1)
matrix([[34.],
        [40.],
        [42.],
        ...,
        [24.],
        [54.],
        [25.]], dtype=float32)
>>> wmi.X[:, wmi.var['mito'].values].sum(axis=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/scipy/sparse/_index.py", line 47, in __getitem__
    row, col = self._validate_indices(key)
  File "/usr/local/lib/python3.9/site-packages/scipy/sparse/_index.py", line 168, in _validate_indices
    col = self._asindices(col, N)
  File "/usr/local/lib/python3.9/site-packages/scipy/sparse/_index.py", line 190, in _asindices
    if max_indx >= length:
TypeError: '>=' not supported between instances of 'str' and 'int'
>>> wmi.var['mito'].dtypes
CategoricalDtype(categories=['False', 'True'], ordered=False)

Now, the question is why we might be explicitly setting that var column to categorical. At least I can say that moving to using bool there doesn't seem to break the SCXA main workflow downstream.

@pcm32
Copy link
Member Author

pcm32 commented Sep 21, 2022

The change was introduced at https://github.com/ebi-gene-expression-group/scanpy-scripts/pull/70/files#diff-d4f03c482ed8ddbd6f6e9754d2e42001963362aa3958ee56918f9210747ef2f4R39 to allow negative filtering searches as attempted in #69 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant