-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first draft for a potential JOSS paper #496
base: main
Are you sure you want to change the base?
Conversation
bad61ad
to
c66629a
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #496 +/- ##
=======================================
Coverage 86.85% 86.85%
=======================================
Files 96 96
Lines 11919 11919
=======================================
Hits 10352 10352
Misses 1567 1567 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Are you sure you want to write the paper in the Vector repo itself, rather than as a separate repo just for the paper? How is it normally done?
paper/paper.md
Outdated
|
||
# Summary | ||
|
||
Vector algebra is a crucial component of data analysis pipelines in high energy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Vector algebra" could mean different things to different people. For me, "algebra" makes me think "abstract algebra," so I'd be looking for addition-like and multiplication-like operators with some properties like associativity or distributivity and some sense of closure. That would lead me to think it's a linear algebra library, like BLAS. That's not what you mean here!
Depending on the reader's background, "vector" can mean
- a 2D, 3D, or 4D physical space vector, like what the Vector library is actually about,
- an N×1 or 1×N matrix without physical interpretation, as in ordinary linear algebra,
- an N×1 or 1×N vector or covector ("1-form") of geometric algebra, which could live in a non-Euclidean metric (as our 4D Lorentz vectors already do),
- a member of an infinite-dimensional space, like a Hilbert space, such as a quantum state represented by a bra
<x|
or ket|x>
, - the input or output of a machine learning model, consisting of a fixed number of features or predictions,
- the direction that an airplane is flying,
- a collection-type data structure that isn't quite an array because it has variable length, like a C++
std::vector
, - a collection-type data structure that isn't quite an array because it's immutable, like a Lisp vector or a vector in other functional libraries,
- an organism, object, or environmental current that carries disease from one population to another,
- a plasmid or virus that carries genetic material into a host cell in genetic engineering,
- graphic primitives that use precisely positioned elements with infinite resolution, as in SVG or PDF file formats, rather than rasterized images like PNG or JPG,
- the direction and magnitude of a literary or cultural trend, a philosophical argument or line of thought, a narrative in fiction, or an architectural design.
So you'll need to narrow in quickly and let the reader know that this is about 2D and 3D Euclidean vectors and 4D Lorentz vectors that can be used as physical quantities, such as position, momentum, and forces. Instead of "algebra," a word like "common operations" or "mathematical manipulations"?
paper/paper.md
Outdated
physics, enabling physicists to transform raw data into meaningful results that | ||
can be visualized. Given that high energy physics data is not uniform, the | ||
vector algebra frameworks or libraries are expected to work readily on | ||
non-uniform or jagged data, allowing users to perform operations on an entire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A definition of "jagged" will be needed. Ever since I started using this word, I've found that "ragged" is more common. A potentially confusing thing is that sometimes a "vector" is a collection-type data structure, and what we have here is a collection that contains (non-collection) vectors.
paper/paper.md
Outdated
scientific or engineering application. The library houses 3+2 numerical | ||
backends for experimental physicists and 1 symbolic backend for theoretical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what 3+2
numerical backends means. It could be worthwhile to present all of the numerical backends and their purposes in a bulleted list. A strong point is the diversity of types of backends, from scalars (builtin), to collection types (NumPy and Awkward), to symbolic (SymPy).
paper/paper.md
Outdated
Vector has become the de facto library for vector algebra in Python based high | ||
energy physics data analysis pipelines. The library has been installed over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
de facto library
That's too strong: many high energy physics data analyses use ROOT's new-style LorentzVectors and TLorentzVector, which has been deprecated for decades, but people still use it.
I see that you specify "Python", but there's PyROOT, so I'm not sure that the Vector uses outnumber the PyROOT-TLorentzVector uses.
It's enough to say that it's widely used, and you quote some numbers below.
paper/paper.md
Outdated
|
||
Vector has become the de facto library for vector algebra in Python based high | ||
energy physics data analysis pipelines. The library has been installed over | ||
2 million times and 314 GitHub repositories use it as a dependency at the time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Download count is a notoriously misleading metric of how often software is used. ("Notorious" because even though its problems are known, people still use it. It's hard to do anything better.) In particular,
- continuous testing frameworks and some highly parallel workloads will
pip install vector
orconda install vector
as a first step, which inflates the numbers, - it doesn't capture the difference between
- users who download it once, never update versions, but use it every day and
- (non-)users who update it daily with all the rest of the software on their computer, but never use it,
- it unintentionally captures the difference between
- periods in which you release patches frequently (and users who live at head download every one of them),
- periods in which there are no bugs so you don't release new versions at all (but users are still using it, all the same).
If you want to get quantitative in this paper, you might want to consider following the method of https://github.com/jpivarski-talks/2023-08-14-awkward-stats-update to
- use GitHub's dependency graph to get a list of all the repos that might be using Vector,
git clone
them all,- search them for "import vector":
egrep -ral "(import\b.*\bvector|vector\b.*\bimport)" * --include="*.py" --include="*.ipynb"
, - use
git log --format=%cd "$z"
on all the matching files to find out when they were last touched, - make a plot like this:
This would quantify the number of direct users of Vector, the people who know that they are using it, as opposed to the people who get it through another library. Pretty soon, this won't be a good metric anymore because of indirect users through Coffea, but you could do a Vector + Coffea-vector plot then.
Another benefit of an analysis like this is that you can find out how users are using your interface—which functions they use most, and in what ways. I talked about that in this presentation and I did a similar analysis for Numba.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure "inflate the numbers" is correct - The number is "downloads", not "number of users". It isn't a measure of number of users due to the issues listed above, but it does measure interest in/usefulness of the package in some form. A CI job doing something with vector still means someone is doing something with vector.
Also, on the flip side, CI jobs may use caches (uv and pixi both have fairly popular actions that cache by default), so that might hide "downloads" by not actually downloading from PyPI.
Other analyses are very useful, but having the download count is still useful as well and it wouldn't replace it, it's just a different metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the resources here! I went through them and also through several other JOSS papers. I did not statistics in any of them, so I think I will just remove them entirely.
paper/paper.md
Outdated
Vector is currently the only Lorentz vector library providing a Pythonic | ||
interface but a C++ (through Awkward Array [@Pivarski:2018]) computational | ||
backend. Vector integrates seamlessly with the existing high energy physics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this statement. NumPy is a compiled backend (though C instead of C++). And PyROOT is Python backed by C++. I think I'd rework it a bit to something to state what it is, and not focus on the "only".
paper/paper.md
Outdated
|
||
Vector has become the de facto library for vector algebra in Python based high | ||
energy physics data analysis pipelines. The library has been installed over | ||
2 million times and 314 GitHub repositories use it as a dependency at the time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure "inflate the numbers" is correct - The number is "downloads", not "number of users". It isn't a measure of number of users due to the issues listed above, but it does measure interest in/usefulness of the package in some form. A CI job doing something with vector still means someone is doing something with vector.
Also, on the flip side, CI jobs may use caches (uv and pixi both have fairly popular actions that cache by default), so that might hide "downloads" by not actually downloading from PyPI.
Other analyses are very useful, but having the download count is still useful as well and it wouldn't replace it, it's just a different metric.
7785117
to
459dcd7
Compare
Thanks for the reviews, @jpivarski and @henryiii! I have made corrections, but could you please review it again whenever you get the time?
The JOSS submission guidelines say -
I will prefer not merging this in and just using this PR to review the content. Once everything is reviewed, I will submit the paper from the short-lived branch and delete the branch once the paper is published |
Description
The draft PDF can be downloaded from - https://github.com/scikit-hep/vector/actions/runs/10420364323
JOSS paper format - https://joss.readthedocs.io/en/latest/paper.html
JOSS submission guidelines - https://joss.readthedocs.io/en/latest/submitting.html
Checklist
$ pre-commit run --all-files
or$ nox -s lint
)$ pytest
or$ nox -s tests
)$ cd docs; make clean; make html
or$ nox -s docs
)$ pytest --doctest-plus src/vector/
or$ nox -s doctests
)Before Merging