On 2022/2/11 7:58 H, Huang, Ying wrote:
zhong jiang
<zhongjiang-ali(a)linux.alibaba.com> writes:
> On 2022/2/10 3:03 H, Huang, Ying wrote:
>> Baolin Wang <baolin.wang(a)linux.alibaba.com> writes:
>>
>>>> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:
>>>>
>>>>> On 2022/2/10 8:58 H, Huang, Ying wrote:
>>>>>> zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes:
>>>>>>
>>>>>>> Currently, Mysql testcase show that a large number of thp are
migrated
>>>>>>> from pmem node to toptier node, it will bring in more
pgpromote_demoted
>>>>>>> and migrated failiure. because pmem node memory is marked as
prot_none,
>>>>>>> it will be migrated by cpu access as soon as possible when it
is hot,
>>>>>>> and it is unnesscessary to migrate thp to dram when dram
memory is not
>>>>>>> enough, which will bring in more demoted and promoted.
>>>>>>>
>>>>>>> Hence, the patch forbid the thp to produce in pmem node. the
result show
>>>>>>> about 3% improvements. the relative statistics is as
follows.
>>>>>>>
>>>>>>> before appling patch:
>>>>>>> mysql prepare:
>>>>>>> pgpromote_demoted 908267
>>>>>>> pgmigrate_fail_dst_node_fail 428223
>>>>>>> pgmigrate_fail_numa_isolate_fail 460480
>>>>>>>
>>>>>>> mysql run:
>>>>>>> pgpromote_demoted 2901105
>>>>>>> pgmigrate_fail_dst_node_fail 5653776
>>>>>>> pgmigrate_fail_numa_isolate_fail 5686052
>>>>>>>
>>>>>>> after appling patch:
>>>>>>> mysql prepare:
>>>>>>> pgpromote_demoted 839297
>>>>>>> pgmigrate_fail_dst_node_fail 36585
>>>>>>> pgmigrate_fail_numa_isolate_fail 36585
>>>>>>>
>>>>>>> mysql run:
>>>>>>> pgpromote_demoted 913828
>>>>>>> pgmigrate_fail_dst_node_fail 235863
>>>>>>> pgmigrate_fail_numa_isolate_fail 235870
>>>>>>>
>>>>>>> Signed-off-by: zhongjiang-ali
<zhongjiang-ali(a)linux.alibaba.com>
>>>>>>> ---
>>>>>>> mm/page_alloc.c | 14 ++++++++++++++
>>>>>>> 1 file changed, 14 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index 8cfce92..4fff3cd 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -461,6 +461,17 @@ static __always_inline int
get_pfnblock_migratetype(struct page *page, unsigned
>>>>>>> return __get_pfnblock_flags_mask(page, pfn,
PB_migrate_end, MIGRATETYPE_MASK);
>>>>>>> }
>>>>>>> +static inline bool allow_hugepage_allocation(int nid,
unsigned
>>>>>>> int order)
>>>>>>> +{
>>>>>>> + if (node_is_toptier(nid))
>>>>>>> + return true;
>>>>>>> +
>>>>>>> + if (order != HPAGE_PMD_ORDER)
>>>>>>> + return true;
>>>>>>> +
>>>>>>> + return false;
>>>>>>> +}
>>>>>>> +
>>>>>>> /**
>>>>>>> * set_pfnblock_flags_mask - Set the requested group of
flags for a pageblock_nr_pages block of pages
>>>>>>> * @page: The page within the block of interest
>>>>>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct
zone *local_zone, struct zone *zone)
>>>>>>> }
>>>>>>> }
>>>>>>> + if (!allow_hugepage_allocation(zone_to_nid(zone),
>>>>>>> order))
>>>>>>> + continue;
>>>>>>> +
>>>>>> It appears that this will disable node reclaiming for THP
allocation.
>>>>>> So more pages will be allocated in PMEM node because of
allocation
>>>>>> fallback?
>>>>> We just allow normal pages allocate in pmem node, hence, thp
>>>>> allocation will fallback to produce more normal pages.
>>>>>
>>>>> Mysql testcase show that too many thps is promoted to toptier ,
>>>>> due to toptier memory is not enough, it will bring in
>>>>>
>>>>> more pgpromote_deomted and dst_node_full counter increasing. In
>>>>> that case, we prefer to remote access rather
>>>>>
>>>>> than migrate thp between pmem and toptier node frequently, which
>>>>> will make performance decrease.
>>>> Maybe we are looking at different source code :-). In latest
>>>> upstream
>>>> code, zone_allows_reclaim() is to control node reclaiming (or zone
>>>> reclaim) only. Which repo should I look?
>>> I think you misunderstood the change, the change is in
>>> get_page_from_freelist(), not in zone_allows_reclaim().
>> OK, I see. I think the `diff` program fools me:
>>
>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone,
struct zone *zone)
>> }
>> }
>> + if (!allow_hugepage_allocation(zone_to_nid(zone),
>> order))
>> + continue;
>> +
>> if (no_fallback && nr_online_nodes > 1 &&
>> zone != ac->preferred_zoneref->zone) {
>> int local_nid;
>>
>>
>>> From my understanding, Zhongjiang is trying to disable the memory
>>> allocation fallback for THP, right?
>> I think so too now.
>>
>>> But that will cause more demotion if we can not fallback to PMEM node?
>> If THP fails to be allocated, normal pages will be allocated instead.
>> And it appears that if THP is failed to be demoted (with this patch, it
>> will always fail), THP will be split too. So we may have much less THP
>> in system with the patch. Zhongjiang, Can you check it?
> The patch aims to prevent thp allocation in pmem node, I has
> checked that there are not an thp is created
>
> in pmem node which is intended. Dram node still has a lot of
> thp and can be collapsed.
>
>> Another choice is to split THP if migration fails. That's always a
>> question to prefer THP or local/hot normal pages.
> Test performance will decrease if a large number of thp in pmem
> node, promotion will fail more frequently
>
> relative to normal page allocation because dram memory is not enough
> to result in waking up kswapd.
>
>
> hence the influence is too much promotion failure and
> pgpromote_demoted. And Maybe thp is not
>
> really needed for testcase, but an subpage of thp.
Yes. So I suggest to try to fallback to split THP upon THP allocation
failure on DRAM. Just disable nosplit logic in migrate_pages().
The upstream do
as you said. It will fallback to split thp
into normal page when promotion fail to allocation
thp on dram.
Not for NUMA balancing. Because
bool nosplit = (reason == MR_NUMA_MISPLACED);