From: Huang Ying <ying.huang(a)intel.com>
ANBZ: #80
cherry-picked from
https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?…
In current implementation, to adjust the memory tiering promotion threshold more
quickly, we use the fixed steps (16) to adjust. But now we can reduce the
threshold (more strictly) much more quickly, so we can use some other way to
adjust the threshold to enlarge the range of the threshold and to make the
threshold more accurate.
In this patch, if the number of the pages that pass the threshold is more than
110% of the target number, we will decrease the threshold to 90% of its original
value until 1ms. And we will increase the threshold to 110% of its original
value vice versa. In this way, the minimal threshold is 1 ms (vs. 62ms by
default originally). And the possible values of threshold becomes much more.
Both make it possible to adjust the threshold more accurately.
One downside is that if the PMEM pages become much colder suddenly, we will need
more time to adjust the threshold to adapt to that. So that the promotion
throughput will be lower than expected. But if the PMEM pages become colder,
it’s less urgent to promote them too. So it should be OK to promote them
slower. Eventually, the threshold can be adjusted to the proper value.
In the test with the pmbench memory accessing benchmark on a 2-socket server
machine with Optane DCPMM, the promotion threshold can be adjusted to a smaller
value (33 ms vs. 62 ms originally). The pmbench score decreases 2.8%, mainly
because the promotion throughput decreases 4.9%. Because the number of the
pages that can be promoted isn't distributed evenly vs. time, but the ratelimit
is enforced equally vs. time. Will resolve the issue with the following patches
in the patchset.
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
---
kernel/sched/fair.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5806316..74dec2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2927,14 +2927,12 @@ int sysctl_numa_balancing_threshold(struct ctl_table *table, int
write, void *bu
return err;
}
-#define NUMA_MIGRATION_ADJUST_STEPS 16
-
static void numa_migration_adjust_threshold(struct pglist_data *pgdat,
unsigned long rate_limit,
unsigned long ref_th)
{
unsigned long now = jiffies, last_th_ts, th_period;
- unsigned long unit_th, th, oth;
+ unsigned long th, oth;
unsigned long last_nr_cand, nr_cand, ref_cand, diff_cand;
th_period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max);
@@ -2952,13 +2950,15 @@ static void numa_migration_adjust_threshold(struct pglist_data
*pgdat,
}
pgdat->numa_threshold_ts = now;
pgdat->numa_threshold_nr_candidate = nr_cand;
- unit_th = ref_th / NUMA_MIGRATION_ADJUST_STEPS;
oth = pgdat->numa_threshold;
th = oth ? : ref_th;
- if (diff_cand > ref_cand * 11 / 10)
- th = max(th - unit_th, unit_th);
- else if (diff_cand < ref_cand * 9 / 10)
- th = min(th + unit_th, ref_th);
+ if (diff_cand > ref_cand * 11 / 10) {
+ th = min(th * 9 / 10, th - 1);
+ th = max(th, 1UL);
+ } else if (diff_cand < ref_cand * 9 / 10) {
+ th = max(th * 11 / 10, th + 1);
+ th = min(th, ref_th * 2);
+ }
pgdat->numa_threshold = th;
spin_unlock(&pgdat->numa_lock);
trace_autonuma_threshold(pgdat->node_id, diff_cand, th);
--
1.8.3.1