-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BZ-2141422] Automated a new test case [OCS-6280] to verify mds cache trim in standby-replay mode. #10892
base: master
Are you sure you want to change the base?
Conversation
…trim in standby-replay mode. Signed-off-by: nagendra202 <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nagendra202 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: nagendra202 <[email protected]>
Signed-off-by: nagendra202 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: nagreddy-n19-1
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master
Job UNSTABLE (some or all tests failed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: nagreddy-n19-1
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master
|
||
@pytest.mark.polarion_id("OCS-6280") | ||
def test_mds_cache_trim_on_standby_replay( | ||
self, run_metadata_io_with_cephfs, threading_lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
threading_lock is needed? I don't see any usage in any other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it.
pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py" | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to skip to on external mode cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sr_mds_node = cluster.get_mds_standby_replay_info()["node_name"] | ||
worker_nodes = get_worker_nodes() | ||
target_node = [] | ||
ceph_health_check() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the ceph health needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, before running the metadata IO I should make sure ceph is healthy so that we can avoid creating more problems by giving more load to the cluster when ceph is in bad state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes active and standby mds may not e available due to continuous flipping from active to standby and vice-versa. In that situation if we start this IO again, cluster will go bad.
metaio_executor.submit( | ||
pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to check for this thread later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check. It will be running in the background which will be creating files and performing file operations. We will check the memory utilisation in the test function and wait for the given time to get the targeted load.
|
||
|
||
@pytest.fixture(scope="function") | ||
def run_metadata_io_with_cephfs(dc_pod_factory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the same code that we are using for mds mem and cpu alert feature? If yes, should we move it some common place and call that function in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
||
@tier2 | ||
@bugzilla("2141422") | ||
@magenta_squad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not part of the magenta squad, please add a relevant marker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@bugzilla("2141422") | ||
@magenta_squad | ||
@skipif_ocs_version("<4.15") | ||
@skipif_ocp_version("<4.15") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the bug dependent on ocp version also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no mention about ocp in the BZ. So removed it.
"1 MDSs report oversized cache" not in ceph_health_detail | ||
), f"Oversized cache warning found in Ceph health: {ceph_health_detail}" | ||
|
||
if active_mds_mem_util > sr_mds_mem_util: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconsider this if-else block based on Venky's clarification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"1 MDSs report oversized cache" not in ceph_health_detail | ||
), f"Oversized cache warning found in Ceph health: {ceph_health_detail}" | ||
|
||
if active_mds_mem_util > sr_mds_mem_util: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition can be removed as sometimes sr mode may goes beyond active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
log.info(f"Standby-replay MDS memory utilization: {sr_mds_mem_util}%") | ||
ceph_health_detail = cluster.ceph_health_detail() | ||
assert ( | ||
"1 MDSs report oversized cache" not in ceph_health_detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Identify warning is from which mds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: nagendra202 <[email protected]>
Signed-off-by: nagendra202 <[email protected]>
Signed-off-by: nagendra202 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed review comments.
|
||
|
||
@pytest.fixture(scope="function") | ||
def run_metadata_io_with_cephfs(dc_pod_factory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
sr_mds_node = cluster.get_mds_standby_replay_info()["node_name"] | ||
worker_nodes = get_worker_nodes() | ||
target_node = [] | ||
ceph_health_check() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, before running the metadata IO I should make sure ceph is healthy so that we can avoid creating more problems by giving more load to the cluster when ceph is in bad state.
sr_mds_node = cluster.get_mds_standby_replay_info()["node_name"] | ||
worker_nodes = get_worker_nodes() | ||
target_node = [] | ||
ceph_health_check() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes active and standby mds may not e available due to continuous flipping from active to standby and vice-versa. In that situation if we start this IO again, cluster will go bad.
metaio_executor.submit( | ||
pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check. It will be running in the background which will be creating files and performing file operations. We will check the memory utilisation in the test function and wait for the given time to get the targeted load.
pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py" | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
@pytest.mark.polarion_id("OCS-6280") | ||
def test_mds_cache_trim_on_standby_replay( | ||
self, run_metadata_io_with_cephfs, threading_lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it.
log.info(f"Standby-replay MDS memory utilization: {sr_mds_mem_util}%") | ||
ceph_health_detail = cluster.ceph_health_detail() | ||
assert ( | ||
"1 MDSs report oversized cache" not in ceph_health_detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"1 MDSs report oversized cache" not in ceph_health_detail | ||
), f"Oversized cache warning found in Ceph health: {ceph_health_detail}" | ||
|
||
if active_mds_mem_util > sr_mds_mem_util: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"1 MDSs report oversized cache" not in ceph_health_detail | ||
), f"Oversized cache warning found in Ceph health: {ceph_health_detail}" | ||
|
||
if active_mds_mem_util > sr_mds_mem_util: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
@tier2 | ||
@bugzilla("2141422") | ||
@magenta_squad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: nagreddy-d10-01
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master
break | ||
else: | ||
log.warning("MDS memory consumption is not yet reached target") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you making sure cache utilization is maximum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the MDS Cache oversized warning was triggered at 75% allocated memory utilization [which means that 150% of cache utilisation]. If such a warning appears, it should now only occur after the memory utilization exceeds 75%.
@pytest.mark.polarion_id("OCS-6280") | ||
def test_mds_cache_trim_on_standby_replay(self, dc_pod_factory): | ||
""" | ||
Verifies whether the MDS cache is trimmed or not in standby-replay mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add brief steps here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@tier2 | ||
@bugzilla("2141422") | ||
@brown_squad | ||
@skipif_ocs_version("<4.15") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The skip version doesn't seem to be right. The fix for the BZ 2141422 is in RHCS-6.1.4(4.14.5 and above) and RHCS-5.3.6(4.13.8 and above). Please check again. See: https://bugzilla.redhat.com/show_bug.cgi?id=2141422#c40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed In Version: | 4.15.0-123
Ceph version upgraded to 6.1.4 in 4.14, 4.13, 4.12. I am not sure whether it works in all versions of older released or only few z streams of these older versions.
Signed-off-by: nagendra202 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review comments addressed.
Test case: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-6280
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2141422
Created a new test case in Polarion and automated the same.