[BZ-2141422] Automated a new test case [OCS-6280] to verify mds cache trim in standby-replay mode. #10892

nagendra202 · 2024-11-19T09:28:39Z

Test case: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-6280

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2141422

Created a new test case in Polarion and automated the same.

…trim in standby-replay mode. Signed-off-by: nagendra202 <[email protected]>

openshift-ci · 2024-11-19T09:28:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nagendra202

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: nagendra202 <[email protected]>

ocs-ci

PR validation on existing cluster

Cluster Name: nagreddy-n19-1
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master

Job UNSTABLE (some or all tests failed).

ocs-ci

PR validation on existing cluster

Cluster Name: nagreddy-n19-1
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master

Job PASSED.

PrasadDesala · 2024-11-28T07:54:54Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+
+    @pytest.mark.polarion_id("OCS-6280")
+    def test_mds_cache_trim_on_standby_replay(
+        self, run_metadata_io_with_cephfs, threading_lock


threading_lock is needed? I don't see any usage in any other places

removed it.

PrasadDesala · 2024-11-28T07:58:09Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+            pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py"
+        )
+
+


you need to skip to on external mode cluster

PrasadDesala · 2024-11-28T07:58:35Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+    sr_mds_node = cluster.get_mds_standby_replay_info()["node_name"]
+    worker_nodes = get_worker_nodes()
+    target_node = []
+    ceph_health_check()


is the ceph health needed?

yes, before running the metadata IO I should make sure ceph is healthy so that we can avoid creating more problems by giving more load to the cluster when ceph is in bad state.

Sometimes active and standby mds may not e available due to continuous flipping from active to standby and vice-versa. In that situation if we start this IO again, cluster will go bad.

PrasadDesala · 2024-11-28T07:59:32Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+        metaio_executor.submit(
+            pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py"


we don't need to check for this thread later?

No need to check. It will be running in the background which will be creating files and performing file operations. We will check the memory utilisation in the test function and wait for the given time to get the targeted load.

PrasadDesala · 2024-11-28T08:00:24Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+
+
+@pytest.fixture(scope="function")
+def run_metadata_io_with_cephfs(dc_pod_factory):


is this the same code that we are using for mds mem and cpu alert feature? If yes, should we move it some common place and call that function in the test?

PrasadDesala · 2024-11-28T08:00:54Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+
+@tier2
+@bugzilla("2141422")
+@magenta_squad


this is not part of the magenta squad, please add a relevant marker

PrasadDesala · 2024-11-28T08:01:25Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+@bugzilla("2141422")
+@magenta_squad
+@skipif_ocs_version("<4.15")
+@skipif_ocp_version("<4.15")


is the bug dependent on ocp version also?

There is no mention about ocp in the BZ. So removed it.

hnallurv · 2024-11-28T08:05:05Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+            "1 MDSs report oversized cache" not in ceph_health_detail
+        ), f"Oversized cache warning found in Ceph health: {ceph_health_detail}"
+
+        if active_mds_mem_util > sr_mds_mem_util:


Reconsider this if-else block based on Venky's clarification

nagendra202 · 2024-11-28T08:04:08Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+            "1 MDSs report oversized cache" not in ceph_health_detail
+        ), f"Oversized cache warning found in Ceph health: {ceph_health_detail}"
+
+        if active_mds_mem_util > sr_mds_mem_util:


This condition can be removed as sometimes sr mode may goes beyond active.

nagendra202 · 2024-11-28T08:05:29Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+        log.info(f"Standby-replay MDS memory utilization: {sr_mds_mem_util}%")
+        ceph_health_detail = cluster.ceph_health_detail()
+        assert (
+            "1 MDSs report oversized cache" not in ceph_health_detail


Identify warning is from which mds?

Signed-off-by: nagendra202 <[email protected]>

nagendra202

Addressed review comments.

nagendra202 · 2024-12-10T08:37:14Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+
+
+@pytest.fixture(scope="function")
+def run_metadata_io_with_cephfs(dc_pod_factory):


nagendra202 · 2024-12-10T08:38:39Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+    sr_mds_node = cluster.get_mds_standby_replay_info()["node_name"]
+    worker_nodes = get_worker_nodes()
+    target_node = []
+    ceph_health_check()


yes, before running the metadata IO I should make sure ceph is healthy so that we can avoid creating more problems by giving more load to the cluster when ceph is in bad state.

nagendra202 · 2024-12-10T08:40:01Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+    sr_mds_node = cluster.get_mds_standby_replay_info()["node_name"]
+    worker_nodes = get_worker_nodes()
+    target_node = []
+    ceph_health_check()


Sometimes active and standby mds may not e available due to continuous flipping from active to standby and vice-versa. In that situation if we start this IO again, cluster will go bad.

nagendra202 · 2024-12-10T08:41:39Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+        metaio_executor.submit(
+            pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py"


No need to check. It will be running in the background which will be creating files and performing file operations. We will check the memory utilisation in the test function and wait for the given time to get the targeted load.

nagendra202 · 2024-12-10T08:42:40Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+            pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py"
+        )
+
+


nagendra202 · 2024-12-10T08:43:27Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+
+    @pytest.mark.polarion_id("OCS-6280")
+    def test_mds_cache_trim_on_standby_replay(
+        self, run_metadata_io_with_cephfs, threading_lock


removed it.

nagendra202 · 2024-12-10T08:56:52Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+        log.info(f"Standby-replay MDS memory utilization: {sr_mds_mem_util}%")
+        ceph_health_detail = cluster.ceph_health_detail()
+        assert (
+            "1 MDSs report oversized cache" not in ceph_health_detail


nagendra202 · 2024-12-10T08:56:57Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+            "1 MDSs report oversized cache" not in ceph_health_detail
+        ), f"Oversized cache warning found in Ceph health: {ceph_health_detail}"
+
+        if active_mds_mem_util > sr_mds_mem_util:


nagendra202 · 2024-12-10T08:57:03Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+            "1 MDSs report oversized cache" not in ceph_health_detail
+        ), f"Oversized cache warning found in Ceph health: {ceph_health_detail}"
+
+        if active_mds_mem_util > sr_mds_mem_util:


nagendra202 · 2024-12-10T10:34:38Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+
+@tier2
+@bugzilla("2141422")
+@magenta_squad


ocs-ci

PR validation on existing cluster

Cluster Name: nagreddy-d10-01
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master

Job PASSED.

hnallurv · 2024-12-10T15:25:27Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+                break
+            else:
+                log.warning("MDS memory consumption is not yet reached target")
+


How are you making sure cache utilization is maximum?

Previously, the MDS Cache oversized warning was triggered at 75% allocated memory utilization [which means that 150% of cache utilisation]. If such a warning appears, it should now only occur after the memory utilization exceeds 75%.

hnallurv · 2024-12-10T15:26:38Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+    @pytest.mark.polarion_id("OCS-6280")
+    def test_mds_cache_trim_on_standby_replay(self, dc_pod_factory):
+        """
+        Verifies whether the MDS cache is trimmed or not in standby-replay mode.


please add brief steps here

hnallurv · 2024-12-10T17:06:00Z

tests/functional/pod_and_daemons/test_mds_cache_trim_standby.py

+@tier2
+@bugzilla("2141422")
+@brown_squad
+@skipif_ocs_version("<4.15")


The skip version doesn't seem to be right. The fix for the BZ 2141422 is in RHCS-6.1.4(4.14.5 and above) and RHCS-5.3.6(4.13.8 and above). Please check again. See: https://bugzilla.redhat.com/show_bug.cgi?id=2141422#c40

Fixed In Version: | 4.15.0-123

Ceph version upgraded to 6.1.4 in 4.14, 4.13, 4.12. I am not sure whether it works in all versions of older released or only few z streams of these older versions.

Signed-off-by: nagendra202 <[email protected]>

nagendra202

review comments addressed.

[BZ-2141422] Automated a new testcase [OCS-6280] to verify mds cache …

87d2b58

…trim in standby-replay mode. Signed-off-by: nagendra202 <[email protected]>

nagendra202 self-assigned this Nov 19, 2024

nagendra202 requested a review from a team as a code owner November 19, 2024 09:28

pull-request-size bot added the size/L PR that changes 100-499 lines label Nov 19, 2024

minor changes in import section

7c1cc78

Signed-off-by: nagendra202 <[email protected]>

nagendra202 requested review from a team and PrasadDesala November 20, 2024 07:08

Increased wait timer to get required load on memory

1290968

Signed-off-by: nagendra202 <[email protected]>

ocs-ci reviewed Nov 20, 2024

View reviewed changes

PrasadDesala added team/e2e E2E team related issues/PRs Customer defects Defects automated aspart of GSS closed loop labels Nov 21, 2024

PrasadDesala reviewed Nov 28, 2024

View reviewed changes

hnallurv reviewed Nov 28, 2024

View reviewed changes

nagendra202 commented Nov 28, 2024

View reviewed changes

review comments addressed.

e26c861

Signed-off-by: nagendra202 <[email protected]>

nagendra202 requested a review from a team as a code owner December 10, 2024 09:11

nagendra202 added 2 commits December 10, 2024 15:58

75% of memory utilisation is enough to generate cache oversized warning

6456eb6

Signed-off-by: nagendra202 <[email protected]>

added @BrownSquad tag

c0a391f

Signed-off-by: nagendra202 <[email protected]>

nagendra202 commented Dec 10, 2024

View reviewed changes

ocs-ci reviewed Dec 10, 2024

View reviewed changes

hnallurv reviewed Dec 10, 2024

View reviewed changes

addressed review comments

79af19e

Signed-off-by: nagendra202 <[email protected]>

nagendra202 commented Dec 12, 2024

View reviewed changes

nagendra202 requested review from hnallurv and PrasadDesala December 12, 2024 07:21

		pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py"
		)

		metaio_executor.submit(
		pod_obj.exec_sh_cmd_on_pod, command="python3 meta_data_io.py"



		@pytest.fixture(scope="function")
		def run_metadata_io_with_cephfs(dc_pod_factory):

[BZ-2141422] Automated a new test case [OCS-6280] to verify mds cache trim in standby-replay mode. #10892

Are you sure you want to change the base?

[BZ-2141422] Automated a new test case [OCS-6280] to verify mds cache trim in standby-replay mode. #10892

Conversation

nagendra202 commented Nov 19, 2024

openshift-ci bot commented Nov 19, 2024

ocs-ci left a comment

Choose a reason for hiding this comment

ocs-ci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagendra202 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocs-ci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagendra202 left a comment

Choose a reason for hiding this comment