From: Huang Ying <ying.huang(a)intel.com>
ANBZ: #80
cherry-picked from
https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?…
In memory tiering benchmark testing with THP (Transparent Huge Page)
enabled, we found that sometimes demotion/promotion between the fast
and slow memory nodes may pause for several hundreds seconds, even if
the THP failed to be isolated live lock has been fixed. Further
digging found that sometimes the pages in the fast memory node becomes
so hot, that kswapd fails to reclaim any page. Finally, the kswapd
failure count (pgdat->kswapd_failures) reaches its max value. So
kswapd will not been waken up until a successful direct reclaiming.
For general use case, this isn't a big problem, because the memory
users will do direct reclaim finally and trigger successful direct
reclaiming or OOM to fix the issue. But in memory tiering system, the
demotion and promotion will avoid to create too much memory pressure
on the fast memory node, so direct reclaiming will not be triggered to
resolve the issue. Even if the access pattern of the applications
have changed, we can not adjust the placement of the hot/cold pages to
optimize the performance.
To resolve the issue, in this patch, in memory tiering mode, kswapd
will be waken up even 10 seconds to try to free some pages to recover
kswapd failures.
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
---
mm/vmscan.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa9e268..abb4579 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4002,8 +4002,23 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order,
int reclaim_o
*/
set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
- if (!kthread_should_stop())
- schedule();
+ if (!kthread_should_stop()) {
+ /*
+ * In memory tiering mode, try harder to recover from
+ * kswapd failures, because direct reclaiming may be
+ * not triggered.
+ */
+ if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+ node_is_toptier(pgdat->node_id) &&
+ pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) {
+ remaining = schedule_timeout(10 * HZ);
+ if (!remaining) {
+ pgdat->kswapd_classzone_idx = ZONE_MOVABLE;
+ pgdat->kswapd_order = 0;
+ }
+ } else
+ schedule();
+ }
set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
} else {
--
1.8.3.1