Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BZ-2141422] Automated a new test case [OCS-6280] to verify mds cache trim in standby-replay mode. #10892

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

nagendra202
Copy link
Contributor

…trim in standby-replay mode.

Signed-off-by: nagendra202 <[email protected]>
@nagendra202 nagendra202 self-assigned this Nov 19, 2024
@nagendra202 nagendra202 requested a review from a team as a code owner November 19, 2024 09:28
@pull-request-size pull-request-size bot added the size/L PR that changes 100-499 lines label Nov 19, 2024
Copy link

openshift-ci bot commented Nov 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nagendra202

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nagendra202 nagendra202 requested review from a team and PrasadDesala November 20, 2024 07:08
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: nagreddy-n19-1
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: nagreddy-n19-1
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master

Job PASSED.

@PrasadDesala PrasadDesala added team/e2e E2E team related issues/PRs Customer defects Defects automated aspart of GSS closed loop labels Nov 21, 2024

@pytest.mark.polarion_id("OCS-6280")
def test_mds_cache_trim_on_standby_replay(
self, run_metadata_io_with_cephfs, threading_lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threading_lock is needed? I don't see any usage in any other places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it.

pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py"
)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to skip to on external mode cluster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

sr_mds_node = cluster.get_mds_standby_replay_info()["node_name"]
worker_nodes = get_worker_nodes()
target_node = []
ceph_health_check()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the ceph health needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, before running the metadata IO I should make sure ceph is healthy so that we can avoid creating more problems by giving more load to the cluster when ceph is in bad state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes active and standby mds may not e available due to continuous flipping from active to standby and vice-versa. In that situation if we start this IO again, cluster will go bad.

Comment on lines 58 to 59
metaio_executor.submit(
pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to check for this thread later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check. It will be running in the background which will be creating files and performing file operations. We will check the memory utilisation in the test function and wait for the given time to get the targeted load.



@pytest.fixture(scope="function")
def run_metadata_io_with_cephfs(dc_pod_factory):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same code that we are using for mds mem and cpu alert feature? If yes, should we move it some common place and call that function in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


@tier2
@bugzilla("2141422")
@magenta_squad
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not part of the magenta squad, please add a relevant marker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@bugzilla("2141422")
@magenta_squad
@skipif_ocs_version("<4.15")
@skipif_ocp_version("<4.15")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the bug dependent on ocp version also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no mention about ocp in the BZ. So removed it.

"1 MDSs report oversized cache" not in ceph_health_detail
), f"Oversized cache warning found in Ceph health: {ceph_health_detail}"

if active_mds_mem_util > sr_mds_mem_util:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconsider this if-else block based on Venky's clarification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"1 MDSs report oversized cache" not in ceph_health_detail
), f"Oversized cache warning found in Ceph health: {ceph_health_detail}"

if active_mds_mem_util > sr_mds_mem_util:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition can be removed as sometimes sr mode may goes beyond active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

log.info(f"Standby-replay MDS memory utilization: {sr_mds_mem_util}%")
ceph_health_detail = cluster.ceph_health_detail()
assert (
"1 MDSs report oversized cache" not in ceph_health_detail
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "1"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identify warning is from which mds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: nagendra202 <[email protected]>
@nagendra202 nagendra202 requested a review from a team as a code owner December 10, 2024 09:11
Copy link
Contributor Author

@nagendra202 nagendra202 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed review comments.



@pytest.fixture(scope="function")
def run_metadata_io_with_cephfs(dc_pod_factory):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

sr_mds_node = cluster.get_mds_standby_replay_info()["node_name"]
worker_nodes = get_worker_nodes()
target_node = []
ceph_health_check()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, before running the metadata IO I should make sure ceph is healthy so that we can avoid creating more problems by giving more load to the cluster when ceph is in bad state.

sr_mds_node = cluster.get_mds_standby_replay_info()["node_name"]
worker_nodes = get_worker_nodes()
target_node = []
ceph_health_check()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes active and standby mds may not e available due to continuous flipping from active to standby and vice-versa. In that situation if we start this IO again, cluster will go bad.

Comment on lines 58 to 59
metaio_executor.submit(
pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check. It will be running in the background which will be creating files and performing file operations. We will check the memory utilisation in the test function and wait for the given time to get the targeted load.

pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py"
)


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@pytest.mark.polarion_id("OCS-6280")
def test_mds_cache_trim_on_standby_replay(
self, run_metadata_io_with_cephfs, threading_lock
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it.

log.info(f"Standby-replay MDS memory utilization: {sr_mds_mem_util}%")
ceph_health_detail = cluster.ceph_health_detail()
assert (
"1 MDSs report oversized cache" not in ceph_health_detail
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"1 MDSs report oversized cache" not in ceph_health_detail
), f"Oversized cache warning found in Ceph health: {ceph_health_detail}"

if active_mds_mem_util > sr_mds_mem_util:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"1 MDSs report oversized cache" not in ceph_health_detail
), f"Oversized cache warning found in Ceph health: {ceph_health_detail}"

if active_mds_mem_util > sr_mds_mem_util:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@tier2
@bugzilla("2141422")
@magenta_squad
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: nagreddy-d10-01
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master

Job PASSED.

break
else:
log.warning("MDS memory consumption is not yet reached target")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you making sure cache utilization is maximum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the MDS Cache oversized warning was triggered at 75% allocated memory utilization [which means that 150% of cache utilisation]. If such a warning appears, it should now only occur after the memory utilization exceeds 75%.

@pytest.mark.polarion_id("OCS-6280")
def test_mds_cache_trim_on_standby_replay(self, dc_pod_factory):
"""
Verifies whether the MDS cache is trimmed or not in standby-replay mode.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add brief steps here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@tier2
@bugzilla("2141422")
@brown_squad
@skipif_ocs_version("<4.15")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skip version doesn't seem to be right. The fix for the BZ 2141422 is in RHCS-6.1.4(4.14.5 and above) and RHCS-5.3.6(4.13.8 and above). Please check again. See: https://bugzilla.redhat.com/show_bug.cgi?id=2141422#c40

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed In Version: | 4.15.0-123

Ceph version upgraded to 6.1.4 in 4.14, 4.13, 4.12. I am not sure whether it works in all versions of older released or only few z streams of these older versions.

Signed-off-by: nagendra202 <[email protected]>
Copy link
Contributor Author

@nagendra202 nagendra202 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review comments addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Customer defects Defects automated aspart of GSS closed loop size/L PR that changes 100-499 lines team/e2e E2E team related issues/PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants