Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Add search terms #554

Closed
wants to merge 2 commits into from

Conversation

valeriocos
Copy link
Member

This PR proposes to extend the metadata attributes by adding a new field (search_terms) with the purpose of simplifying query operations and avoding inspecting the data attribute.

The search terms are included in a dict with the following shape:

        {
            'term-1': 'value-1',
            'term-2': 'value-2',
            'term-3': 'value-3',
        }

The search terms are added to the metadata information of each item in search_terms attribute. If search terms are not defined, search_terms will be set to None.

This new feature is showcased with the Jira backend.

This code enhances the metadata information with
a set of search terms, which simplify query operations
and avoid the manual inspections of the Perceval items.
The search terms are included in a dict with the
following shape:
        {
            'term-1': 'value-1',
            'term-2': 'value-2',
            'term-3': 'value-3',
        }

The search terms are added to the metadata information
of each item in `search_terms` attributes. If `search_terms`
is not set, it will be set to `None`.

Tests have been added accordingly.
The backend version is set to 0.9.0.

Signed-off-by: Valerio Cosentino <[email protected]>
This code extends the Jira backend by including
a set of search terms to simplify query operations.
The search terms introduced are:
`project_id`, `project_key` and `project_name`.

Tests have been added accordingly.
The backend version is set to 0.13.0.

Signed-off-by: Valerio Cosentino <[email protected]>
@valeriocos valeriocos requested a review from sduenas July 19, 2019 15:23
@coveralls
Copy link

Coverage Status

Coverage increased (+0.004%) to 97.096% when pulling b0f4095 on valeriocos:add-search-terms into 0514c3c on chaoss:master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage increased (+0.004%) to 97.096% when pulling b0f4095 on valeriocos:add-search-terms into 0514c3c on chaoss:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.004%) to 97.096% when pulling b0f4095 on valeriocos:add-search-terms into 0514c3c on chaoss:master.

@valeriocos
Copy link
Member Author

valeriocos commented Jul 20, 2019

An initial evaluation of the new feature has been done to measure the impact on the raw indexes in GrimoireLab. The evaluation consisted in running the raw collection on a Jira with >10000 issues using
the old and search terms approach, and then measure the size of the two indexes. The table below (derived using _cat/indices) shows that around 32% of disk storage would be saved with the search terms approach.

index docs.count size
jira_search 10796 73.8mb
jira_old 10796 109.1mb

Furthermore, the additional benefits of the search terms are:

  • it's not needed anymore to index all information coming from a data source (but just the one at metadata level). See below an example of new mapping:
{
     "dynamic":true,
     "properties": {
         "data": {
             "dynamic":false,
             "properties": {}
            }
      }
}
  • the raw index mapping in GrimoireLab would be homogeneous across all data sources (see below), thus opening the doors to new ways of storing raw data.
  • the codebase of ELK would be reduced (mappings and _fix_item method would not be needed anymore).
  • errors such immense term exceptions, too many attribute to index, mapping attribute inconsistencies would disappear, thus making the platform more robust.

The next evaluation will focus on the impact of the search tems approach on `filter-raw' data.

@valeriocos
Copy link
Member Author

valeriocos commented Jul 20, 2019

To use the search_terms in the filter raw a minor change is needed in ELK (FILTER_DATA_ATTR = 'search_terms.'). Beyond this, the configurations are pretty the same:

"jira": [
            "https://xxx.jira.com --filter-raw=search_terms.project_key:ACME"
        ]
"jira": [
            "https://xxx.jira.com --filter-raw=data.fields.project.key:ACME"
        ]

In both cases, the same number of items are enriched and there is no significant variation in the execution times.

@valeriocos
Copy link
Member Author

The PR is ELK is available at: chaoss/grimoirelab-elk#665

@valeriocos valeriocos marked this pull request as ready for review July 24, 2019 15:40
Copy link
Member

@sduenas sduenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea looks great but I have some minor comments to change the implementation. Please look at them and tell me if they make sense or not. I think it will be better to use the same behaviour that we have for classified_fields.

@@ -74,14 +74,27 @@ class Backend:
Classified data filtering and archiving are not compatible to prevent
data leaks or security issues.

Each backend can also provides a set of search terms to simplify query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would changed this with:

Suggested change
Each backend can also provides a set of search terms to simplify query
Each backend might provide a set of search terms to simplify query

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the use of might (which expresses permission) over can (which expresses ability/capability) in this sentence.
I'll fix the typo (provides)

perceval/backend.py Show resolved Hide resolved
perceval/backend.py Show resolved Hide resolved
perceval/backend.py Show resolved Hide resolved
project_name = item['fields']['project']['name']

terms = {
'project_id': project_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does make sense to include the id of the item too? The one returned by metadata_id().

tests/test_jira.py Show resolved Hide resolved
perceval/backend.py Show resolved Hide resolved
@valeriocos
Copy link
Member Author

I have prepared another PR addressing your comments: #560, so we can better evaluate the two solutions.

@valeriocos
Copy link
Member Author

Closing this PR since PR #560 seems better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants