mm, swap: fix race between swap count continuation operations

One page may store a set of entries of the sis->swap_map
(swap_info_struct->swap_map) in multiple swap clusters.

If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX,
multiple pages will be used to store the set of entries of the
sis->swap_map.  And the pages are linked with page->lru.  This is called
swap count continuation.  To access the pages which store the set of
entries of the sis->swap_map simultaneously, previously, sis->lock is
used.  But to improve the scalability of __swap_duplicate(), swap
cluster lock may be used in swap_count_continued() now.  This may race
with add_swap_count_continuation() which operates on a nearby swap
cluster, in which the sis->swap_map entries are stored in the same page.

The race can cause wrong swap count in practice, thus cause unfreeable
swap entries or software lockup, etc.

To fix the race, a new spin lock called cont_lock is added to struct
swap_info_struct to protect the swap count continuation page list.  This
is a lock at the swap device level, so the scalability isn't very well.
But it is still much better than the original sis->lock, because it is
only acquired/released when swap count continuation is used.  Which is
considered rare in practice.  If it turns out that the scalability
becomes an issue for some workloads, we can split the lock into some
more fine grained locks.

Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying" <>
Cc: Johannes Weiner <>
Cc: Shaohua Li <>
Cc: Tim Chen <>
Cc: Michal Hocko <>
Cc: Aaron Lu <>
Cc: Dave Hansen <>
Cc: Andi Kleen <>
Cc: Minchan Kim <>
Cc: Hugh Dickins <>
Cc: <>	[4.11+]
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>
2 files changed