Hi, All,
We have just released the version 0.8 of the memory tiering kernel in
the following URL,
https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/log/?h=t…
The main changes are as follows (in README-tiering.txt too).
Updates in tiering-0.8:
- Rebased on v5.15
- Remove cgroup v1 support, we will switch to cgroup v2 support in a
future version. If you need cgroup v1 support, please stick with
v0.72.
- Increase hot threshold quicker if too few pages pass the threshold
- Reset hot threshold if workload change is detected
- Batch migrate_pages() to reduce TLB shootdown IPIs
- Support to decrease hot threshold if the pages just demoted are hot
- Support to promote pages asynchronously
- Support to wake up kswapd earlier to make promotion more smooth
- Add more sysctl knob for experimenting new features
- Change the interface to enable NUMA balancing for MPOL_PREFERRED_MANY
The recommended configuration is changed too, also in README-tiering.txt.
The target of the patchset is upstream, so it uses the rebase policy and
refresh the patchset directly instead of changing incrementally. This
makes it harder for Alinos to use the patchset ...
Best Regards,
Huang, Ying
Hi,
This patch set adds THP migration statistics and reduces TLB flush when
page migration, as well as fixing the page refcount failure stats. Please
help to review. Thanks.
Anshuman Khandual (1):
mm/vmstat: add events for THP migration without split
Baolin Wang (1):
anolis: mm: migrate: Move the page refcount failure statistics to the
correct place
Huang Ying (1):
NUMA balancing: reduce TLB flush via delaying mapping on hint page
fault
Zi Yan (1):
mm/migrate: correct thp migration stats
Documentation/vm/page_migration.rst | 27 ++++++++++++++++
include/linux/vm_event_item.h | 3 ++
include/trace/events/migrate.h | 17 ++++++++--
mm/memory.c | 53 ++++++++++++++++++-------------
mm/migrate.c | 62 +++++++++++++++++++++++++++++--------
mm/vmstat.c | 3 ++
6 files changed, 127 insertions(+), 38 deletions(-)
--
1.8.3.1
ANBZ: #80
commit 37bc3cb9bbef86d1ddbbc789e55b588c8a2cac26 upstream
Since commit c843966c556d ("mm: allow swappiness that prefers reclaiming
anon over the file workingset") has expended the swappiness value to make
swap to be preferred in some systems. We should also change the memcg
swappiness restriction to allow memcg swap-preferred.
Link: https://lkml.kernel.org/r/d77469b90c45c49953ccbc51e54a1d465bc18f70.16276262…
Fixes: c843966c556d ("mm: allow swappiness that prefers reclaiming anon over the file workingset")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
---
Note: From mysql testing, we found the pagecache pages used to record
logs will bring thrashing, and we can increase the swappiness to mitigate
the thrashing by increasing the scanning propotion of anon pages when do
demotion, which can improve about 2% performance.
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df08e95..580ab02 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4520,7 +4520,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- if (val > 100 || val < -1 || (css->parent && val < 0))
+ if (val > 200 || val < -1 || (css->parent && val < 0))
return -EINVAL;
if (css->parent)
--
1.8.3.1
ANBZ: #80
sysctl_numa_balancing_mode is set to NUMA_BALANCING_MEMORY_TIERING
allowing memory migration between fast and slow node, and the page
of slow memory reuse the cpupid field. But it will bring in the
issue when sysctl_numa_balancing_mode is turned off dynamtically.
should_numa_migrate_memory will choose whether the slow memory should
be migrated to fast memory when NUMA_BALANCING_MEMORY_TIERING is
turned off simultaneously. It will fails to obtain the correct node
from cpupid field in slow memory. hence it will trigger the panic.
Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com>
---
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0184145..6afa935 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3016,6 +3016,14 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
/*
+ * Migration will turn off between fast memory and slow node when
+ * sysctl_numa_balancing_mode disable the feature dynamically.
+ */
+ if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
+ !node_is_toptier(src_nid))
+ return false;
+
+ /*
* Allow first faults or private faults to migrate immediately early in
* the lifetime of a task. The magic number 4 is based on waiting for
* two full passes of the "multi-stage node selection" test that is
--
1.8.3.1
From: Huang Ying <ying.huang(a)intel.com>
ANBZ: #80
commit bfe9d006c971a5daefe7a8b27819ccd497090fd8 upstream
When zone_watermark_ok() is called in migrate_balanced_pgdat() to check
migration target node, the parameter classzone_idx (for requested zone)
is specified as 0 (ZONE_DMA). But when allocating memory for autonuma
in alloc_misplaced_dst_page(), the requested zone from GFP flags is
ZONE_MOVABLE. That is, the requested zone is different. The size of
lowmem_reserve for the different requested zone is different. And this
may cause some issues.
For example, in the zoneinfo of a test machine as below,
Node 0, zone DMA32
pages free 61592
min 29
low 454
high 879
spanned 1044480
present 442306
managed 425921
protection: (0, 0, 62457, 62457, 62457)
The free page number of ZONE_DMA32 is greater than "high watermark +
lowmem_reserve[ZONE_DMA]", but less than "high watermark +
lowmem_reserve[ZONE_MOVABLE]". And because __alloc_pages_node() in
alloc_misplaced_dst_page() requests ZONE_MOVABLE, the
zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may always
return true. So, autonuma may not stop even when memory pressure in
node 0 is heavy.
To fix the issue, ZONE_MOVABLE is used as parameter to call
zone_watermark_ok() in migrate_balanced_pgdat(). This makes it same as
requested zone in alloc_misplaced_dst_page(). So that
migrate_balanced_pgdat() returns false when memory pressure is heavy.
Link: http://lkml.kernel.org/r/20191101075727.26683-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Acked-by: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Fengguang Wu <fengguang.wu(a)intel.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
---
Note: this patch fixes the problem that the DRAM node's kswapd
is not waked up in time, and improves about 12% with mysql performance
testing.
---
mm/migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 6d25ea0..e2dbf24 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1969,7 +1969,7 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat, int order)
/* Avoid waking kswapd by allocating pages to migrate. */
if (!zone_watermark_ok(zone, order,
high_wmark_pages(zone),
- 0, 0))
+ ZONE_MOVABLE, 0))
continue;
return true;
}
--
1.8.3.1
From: Huang Ying <ying.huang(a)intel.com>
When zone_watermark_ok() is called in migrate_balanced_pgdat() to check
migration target node, the parameter classzone_idx (for requested zone)
is specified as 0 (ZONE_DMA). But when allocating memory for autonuma
in alloc_misplaced_dst_page(), the requested zone from GFP flags is
ZONE_MOVABLE. That is, the requested zone is different. The size of
lowmem_reserve for the different requested zone is different. And this
may cause some issues.
For example, in the zoneinfo of a test machine as below,
Node 0, zone DMA32
pages free 61592
min 29
low 454
high 879
spanned 1044480
present 442306
managed 425921
protection: (0, 0, 62457, 62457, 62457)
The free page number of ZONE_DMA32 is greater than "high watermark +
lowmem_reserve[ZONE_DMA]", but less than "high watermark +
lowmem_reserve[ZONE_MOVABLE]". And because __alloc_pages_node() in
alloc_misplaced_dst_page() requests ZONE_MOVABLE, the
zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may always
return true. So, autonuma may not stop even when memory pressure in
node 0 is heavy.
To fix the issue, ZONE_MOVABLE is used as parameter to call
zone_watermark_ok() in migrate_balanced_pgdat(). This makes it same as
requested zone in alloc_misplaced_dst_page(). So that
migrate_balanced_pgdat() returns false when memory pressure is heavy.
Link: http://lkml.kernel.org/r/20191101075727.26683-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Acked-by: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Fengguang Wu <fengguang.wu(a)intel.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
---
mm/migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index a8f87cb..eae1565 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1859,7 +1859,7 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
if (!zone_watermark_ok(zone, 0,
high_wmark_pages(zone) +
nr_migrate_pages,
- 0, 0))
+ ZONE_MOVABLE, 0))
continue;
return true;
}
--
1.8.3.1