Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first draft for a potential JOSS paper #496

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

first draft for a potential JOSS paper #496

wants to merge 9 commits into from

Conversation

Saransh-cpp
Copy link
Member

@Saransh-cpp Saransh-cpp commented Aug 16, 2024

Description

The draft PDF can be downloaded from - https://github.com/scikit-hep/vector/actions/runs/10420364323

JOSS paper format - https://joss.readthedocs.io/en/latest/paper.html
JOSS submission guidelines - https://joss.readthedocs.io/en/latest/submitting.html

Checklist

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't any other open Pull Requests for the required change?
  • Does your submission pass pre-commit? ($ pre-commit run --all-files or $ nox -s lint)
  • Does your submission pass tests? ($ pytest or $ nox -s tests)
  • Does the documentation build with your changes? ($ cd docs; make clean; make html or $ nox -s docs)
  • Does your submission pass the doctests? ($ pytest --doctest-plus src/vector/ or $ nox -s doctests)

Before Merging

  • Summarize the commit messages into a brief review of the Pull request.

Copy link

codecov bot commented Aug 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.85%. Comparing base (13a6370) to head (c66629a).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #496   +/-   ##
=======================================
  Coverage   86.85%   86.85%           
=======================================
  Files          96       96           
  Lines       11919    11919           
=======================================
  Hits        10352    10352           
  Misses       1567     1567           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Are you sure you want to write the paper in the Vector repo itself, rather than as a separate repo just for the paper? How is it normally done?

paper/paper.md Outdated

# Summary

Vector algebra is a crucial component of data analysis pipelines in high energy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Vector algebra" could mean different things to different people. For me, "algebra" makes me think "abstract algebra," so I'd be looking for addition-like and multiplication-like operators with some properties like associativity or distributivity and some sense of closure. That would lead me to think it's a linear algebra library, like BLAS. That's not what you mean here!

Depending on the reader's background, "vector" can mean

  • a 2D, 3D, or 4D physical space vector, like what the Vector library is actually about,
  • an N×1 or 1×N matrix without physical interpretation, as in ordinary linear algebra,
  • an N×1 or 1×N vector or covector ("1-form") of geometric algebra, which could live in a non-Euclidean metric (as our 4D Lorentz vectors already do),
  • a member of an infinite-dimensional space, like a Hilbert space, such as a quantum state represented by a bra <x| or ket |x>,
  • the input or output of a machine learning model, consisting of a fixed number of features or predictions,
  • the direction that an airplane is flying,
  • a collection-type data structure that isn't quite an array because it has variable length, like a C++ std::vector,
  • a collection-type data structure that isn't quite an array because it's immutable, like a Lisp vector or a vector in other functional libraries,
  • an organism, object, or environmental current that carries disease from one population to another,
  • a plasmid or virus that carries genetic material into a host cell in genetic engineering,
  • graphic primitives that use precisely positioned elements with infinite resolution, as in SVG or PDF file formats, rather than rasterized images like PNG or JPG,
  • the direction and magnitude of a literary or cultural trend, a philosophical argument or line of thought, a narrative in fiction, or an architectural design.

So you'll need to narrow in quickly and let the reader know that this is about 2D and 3D Euclidean vectors and 4D Lorentz vectors that can be used as physical quantities, such as position, momentum, and forces. Instead of "algebra," a word like "common operations" or "mathematical manipulations"?

paper/paper.md Outdated
physics, enabling physicists to transform raw data into meaningful results that
can be visualized. Given that high energy physics data is not uniform, the
vector algebra frameworks or libraries are expected to work readily on
non-uniform or jagged data, allowing users to perform operations on an entire
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A definition of "jagged" will be needed. Ever since I started using this word, I've found that "ragged" is more common. A potentially confusing thing is that sometimes a "vector" is a collection-type data structure, and what we have here is a collection that contains (non-collection) vectors.

paper/paper.md Outdated
Comment on lines 55 to 56
scientific or engineering application. The library houses 3+2 numerical
backends for experimental physicists and 1 symbolic backend for theoretical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what 3+2 numerical backends means. It could be worthwhile to present all of the numerical backends and their purposes in a bulleted list. A strong point is the diversity of types of backends, from scalars (builtin), to collection types (NumPy and Awkward), to symbolic (SymPy).

paper/paper.md Outdated
Comment on lines 67 to 68
Vector has become the de facto library for vector algebra in Python based high
energy physics data analysis pipelines. The library has been installed over
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

de facto library

That's too strong: many high energy physics data analyses use ROOT's new-style LorentzVectors and TLorentzVector, which has been deprecated for decades, but people still use it.

I see that you specify "Python", but there's PyROOT, so I'm not sure that the Vector uses outnumber the PyROOT-TLorentzVector uses.

It's enough to say that it's widely used, and you quote some numbers below.

paper/paper.md Outdated

Vector has become the de facto library for vector algebra in Python based high
energy physics data analysis pipelines. The library has been installed over
2 million times and 314 GitHub repositories use it as a dependency at the time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Download count is a notoriously misleading metric of how often software is used. ("Notorious" because even though its problems are known, people still use it. It's hard to do anything better.) In particular,

  • continuous testing frameworks and some highly parallel workloads will pip install vector or conda install vector as a first step, which inflates the numbers,
  • it doesn't capture the difference between
    • users who download it once, never update versions, but use it every day and
    • (non-)users who update it daily with all the rest of the software on their computer, but never use it,
  • it unintentionally captures the difference between
    • periods in which you release patches frequently (and users who live at head download every one of them),
    • periods in which there are no bugs so you don't release new versions at all (but users are still using it, all the same).

If you want to get quantitative in this paper, you might want to consider following the method of https://github.com/jpivarski-talks/2023-08-14-awkward-stats-update to

  1. use GitHub's dependency graph to get a list of all the repos that might be using Vector,
  2. git clone them all,
  3. search them for "import vector": egrep -ral "(import\b.*\bvector|vector\b.*\bimport)" * --include="*.py" --include="*.ipynb",
  4. use git log --format=%cd "$z" on all the matching files to find out when they were last touched,
  5. make a plot like this:

This would quantify the number of direct users of Vector, the people who know that they are using it, as opposed to the people who get it through another library. Pretty soon, this won't be a good metric anymore because of indirect users through Coffea, but you could do a Vector + Coffea-vector plot then.

Another benefit of an analysis like this is that you can find out how users are using your interface—which functions they use most, and in what ways. I talked about that in this presentation and I did a similar analysis for Numba.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure "inflate the numbers" is correct - The number is "downloads", not "number of users". It isn't a measure of number of users due to the issues listed above, but it does measure interest in/usefulness of the package in some form. A CI job doing something with vector still means someone is doing something with vector.

Also, on the flip side, CI jobs may use caches (uv and pixi both have fairly popular actions that cache by default), so that might hide "downloads" by not actually downloading from PyPI.

Other analyses are very useful, but having the download count is still useful as well and it wouldn't replace it, it's just a different metric.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the resources here! I went through them and also through several other JOSS papers. I did not statistics in any of them, so I think I will just remove them entirely.

paper/paper.md Outdated
Comment on lines 48 to 50
Vector is currently the only Lorentz vector library providing a Pythonic
interface but a C++ (through Awkward Array [@Pivarski:2018]) computational
backend. Vector integrates seamlessly with the existing high energy physics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this statement. NumPy is a compiled backend (though C instead of C++). And PyROOT is Python backed by C++. I think I'd rework it a bit to something to state what it is, and not focus on the "only".

paper/paper.md Outdated

Vector has become the de facto library for vector algebra in Python based high
energy physics data analysis pipelines. The library has been installed over
2 million times and 314 GitHub repositories use it as a dependency at the time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure "inflate the numbers" is correct - The number is "downloads", not "number of users". It isn't a measure of number of users due to the issues listed above, but it does measure interest in/usefulness of the package in some form. A CI job doing something with vector still means someone is doing something with vector.

Also, on the flip side, CI jobs may use caches (uv and pixi both have fairly popular actions that cache by default), so that might hide "downloads" by not actually downloading from PyPI.

Other analyses are very useful, but having the download count is still useful as well and it wouldn't replace it, it's just a different metric.

@Saransh-cpp
Copy link
Member Author

Thanks for the reviews, @jpivarski and @henryiii! I have made corrections, but could you please review it again whenever you get the time?

Are you sure you want to write the paper in the Vector repo itself, rather than as a separate repo just for the paper? How is it normally done?

The JOSS submission guidelines say -

  • Your paper (paper.md and BibTeX files, plus any figures) must be hosted in a Git-based repository together with your software.
  • The paper may be in a short-lived branch which is never merged with the default, although if you do this, make sure this branch is created from the default so that it also includes the source code of your submission.

I will prefer not merging this in and just using this PR to review the content. Once everything is reviewed, I will submit the paper from the short-lived branch and delete the branch once the paper is published

paper/paper.md Outdated Show resolved Hide resolved
* better definition of vector algebra
* don't use only/de-facto - mention PyROOT, fix language
* expand on the backends
* jagged -> ragged + a definition for ragged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants