Christian Brauner 41ae496278 pnode: terminate at peers of source
commit 11933cf1d9 upstream.

The propagate_mnt() function handles mount propagation when creating
mounts and propagates the source mount tree @source_mnt to all
applicable nodes of the destination propagation mount tree headed by
@dest_mnt.

Unfortunately it contains a bug where it fails to terminate at peers of
@source_mnt when looking up copies of the source mount that become
masters for copies of the source mount tree mounted on top of slaves in
the destination propagation tree causing a NULL dereference.

Once the mechanics of the bug are understood it's easy to trigger.
Because of unprivileged user namespaces it is available to unprivileged
users.

While fixing this bug we've gotten confused multiple times due to
unclear terminology or missing concepts. So let's start this with some
clarifications:

* The terms "master" or "peer" denote a shared mount. A shared mount
  belongs to a peer group.

* A peer group is a set of shared mounts that propagate to each other.
  They are identified by a peer group id. The peer group id is available
  in @shared_mnt->mnt_group_id.
  Shared mounts within the same peer group have the same peer group id.
  The peers in a peer group can be reached via @shared_mnt->mnt_share.

* The terms "slave mount" or "dependent mount" denote a mount that
  receives propagation from a peer in a peer group. IOW, shared mounts
  may have slave mounts and slave mounts have shared mounts as their
  master. Slave mounts of a given peer in a peer group are listed on
  that peers slave list available at @shared_mnt->mnt_slave_list.

* The term "master mount" denotes a mount in a peer group. IOW, it
  denotes a shared mount or a peer mount in a peer group. The term
  "master mount" - or "master" for short - is mostly used when talking
  in the context of slave mounts that receive propagation from a master
  mount. A master mount of a slave identifies the closest peer group a
  slave mount receives propagation from. The master mount of a slave can
  be identified via @slave_mount->mnt_master. Different slaves may point
  to different masters in the same peer group.

* Multiple peers in a peer group can have non-empty ->mnt_slave_lists.
  Non-empty ->mnt_slave_lists of peers don't intersect. Consequently, to
  ensure all slave mounts of a peer group are visited the
  ->mnt_slave_lists of all peers in a peer group have to be walked.

* Slave mounts point to a peer in the closest peer group they receive
  propagation from via @slave_mnt->mnt_master (see above). Together with
  these peers they form a propagation group (see below). The closest
  peer group can thus be identified through the peer group id
  @slave_mnt->mnt_master->mnt_group_id of the peer/master that a slave
  mount receives propagation from.

* A shared-slave mount is a slave mount to a peer group pg1 while also
  a peer in another peer group pg2. IOW, a peer group may receive
  propagation from another peer group.

  If a peer group pg1 is a slave to another peer group pg2 then all
  peers in peer group pg1 point to the same peer in peer group pg2 via
  ->mnt_master. IOW, all peers in peer group pg1 appear on the same
  ->mnt_slave_list. IOW, they cannot be slaves to different peer groups.

* A pure slave mount is a slave mount that is a slave to a peer group
  but is not a peer in another peer group.

* A propagation group denotes the set of mounts consisting of a single
  peer group pg1 and all slave mounts and shared-slave mounts that point
  to a peer in that peer group via ->mnt_master. IOW, all slave mounts
  such that @slave_mnt->mnt_master->mnt_group_id is equal to
  @shared_mnt->mnt_group_id.

  The concept of a propagation group makes it easier to talk about a
  single propagation level in a propagation tree.

  For example, in propagate_mnt() the immediate peers of @dest_mnt and
  all slaves of @dest_mnt's peer group form a propagation group propg1.
  So a shared-slave mount that is a slave in propg1 and that is a peer
  in another peer group pg2 forms another propagation group propg2
  together with all slaves that point to that shared-slave mount in
  their ->mnt_master.

* A propagation tree refers to all mounts that receive propagation
  starting from a specific shared mount.

  For example, for propagate_mnt() @dest_mnt is the start of a
  propagation tree. The propagation tree ecompasses all mounts that
  receive propagation from @dest_mnt's peer group down to the leafs.

With that out of the way let's get to the actual algorithm.

We know that @dest_mnt is guaranteed to be a pure shared mount or a
shared-slave mount. This is guaranteed by a check in
attach_recursive_mnt(). So propagate_mnt() will first propagate the
source mount tree to all peers in @dest_mnt's peer group:

for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
        ret = propagate_one(n);
        if (ret)
               goto out;
}

Notice, that the peer propagation loop of propagate_mnt() doesn't
propagate @dest_mnt itself. @dest_mnt is mounted directly in
attach_recursive_mnt() after we propagated to the destination
propagation tree.

The mount that will be mounted on top of @dest_mnt is @source_mnt. This
copy was created earlier even before we entered attach_recursive_mnt()
and doesn't concern us a lot here.

It's just important to notice that when propagate_mnt() is called
@source_mnt will not yet have been mounted on top of @dest_mnt. Thus,
@source_mnt->mnt_parent will either still point to @source_mnt or - in
the case @source_mnt is moved and thus already attached - still to its
former parent.

For each peer @m in @dest_mnt's peer group propagate_one() will create a
new copy of the source mount tree and mount that copy @child on @m such
that @child->mnt_parent points to @m after propagate_one() returns.

propagate_one() will stash the last destination propagation node @m in
@last_dest and the last copy it created for the source mount tree in
@last_source.

Hence, if we call into propagate_one() again for the next destination
propagation node @m, @last_dest will point to the previous destination
propagation node and @last_source will point to the previous copy of the
source mount tree and mounted on @last_dest.

Each new copy of the source mount tree is created from the previous copy
of the source mount tree. This will become important later.

The peer loop in propagate_mnt() is straightforward. We iterate through
the peers copying and updating @last_source and @last_dest as we go
through them and mount each copy of the source mount tree @child on a
peer @m in @dest_mnt's peer group.

After propagate_mnt() handled the peers in @dest_mnt's peer group
propagate_mnt() will propagate the source mount tree down the
propagation tree that @dest_mnt's peer group propagates to:

for (m = next_group(dest_mnt, dest_mnt); m;
                m = next_group(m, dest_mnt)) {
        /* everything in that slave group */
        n = m;
        do {
                ret = propagate_one(n);
                if (ret)
                        goto out;
                n = next_peer(n);
        } while (n != m);
}

The next_group() helper will recursively walk the destination
propagation tree, descending into each propagation group of the
propagation tree.

The important part is that it takes care to propagate the source mount
tree to all peers in the peer group of a propagation group before it
propagates to the slaves to those peers in the propagation group. IOW,
it creates and mounts copies of the source mount tree that become
masters before it creates and mounts copies of the source mount tree
that become slaves to these masters.

It is important to remember that propagating the source mount tree to
each mount @m in the destination propagation tree simply means that we
create and mount new copies @child of the source mount tree on @m such
that @child->mnt_parent points to @m.

Since we know that each node @m in the destination propagation tree
headed by @dest_mnt's peer group will be overmounted with a copy of the
source mount tree and since we know that the propagation properties of
each copy of the source mount tree we create and mount at @m will mostly
mirror the propagation properties of @m. We can use that information to
create and mount the copies of the source mount tree that become masters
before their slaves.

The easy case is always when @m and @last_dest are peers in a peer group
of a given propagation group. In that case we know that we can simply
copy @last_source without having to figure out what the master for the
new copy @child of the source mount tree needs to be as we've done that
in a previous call to propagate_one().

The hard case is when we're dealing with a slave mount or a shared-slave
mount @m in a destination propagation group that we need to create and
mount a copy of the source mount tree on.

For each propagation group in the destination propagation tree we
propagate the source mount tree to we want to make sure that the copies
@child of the source mount tree we create and mount on slaves @m pick an
ealier copy of the source mount tree that we mounted on a master @m of
the destination propagation group as their master. This is a mouthful
but as far as we can tell that's the core of it all.

But, if we keep track of the masters in the destination propagation tree
@m we can use the information to find the correct master for each copy
of the source mount tree we create and mount at the slaves in the
destination propagation tree @m.

Let's walk through the base case as that's still fairly easy to grasp.

If we're dealing with the first slave in the propagation group that
@dest_mnt is in then we don't yet have marked any masters in the
destination propagation tree.

We know the master for the first slave to @dest_mnt's peer group is
simple @dest_mnt. So we expect this algorithm to yield a copy of the
source mount tree that was mounted on a peer in @dest_mnt's peer group
as the master for the copy of the source mount tree we want to mount at
the first slave @m:

for (n = m; ; n = p) {
        p = n->mnt_master;
        if (p == dest_master || IS_MNT_MARKED(p))
                break;
}

For the first slave we walk the destination propagation tree all the way
up to a peer in @dest_mnt's peer group. IOW, the propagation hierarchy
can be walked by walking up the @mnt->mnt_master hierarchy of the
destination propagation tree @m. We will ultimately find a peer in
@dest_mnt's peer group and thus ultimately @dest_mnt->mnt_master.

Btw, here the assumption we listed at the beginning becomes important.
Namely, that peers in a peer group pg1 that are slaves in another peer
group pg2 appear on the same ->mnt_slave_list. IOW, all slaves who are
peers in peer group pg1 point to the same peer in peer group pg2 via
their ->mnt_master. Otherwise the termination condition in the code
above would be wrong and next_group() would be broken too.

So the first iteration sets:

n = m;
p = n->mnt_master;

such that @p now points to a peer or @dest_mnt itself. We walk up one
more level since we don't have any marked mounts. So we end up with:

n = dest_mnt;
p = dest_mnt->mnt_master;

If @dest_mnt's peer group is not slave to another peer group then @p is
now NULL. If @dest_mnt's peer group is a slave to another peer group
then @p now points to @dest_mnt->mnt_master points which is a master
outside the propagation tree we're dealing with.

Now we need to figure out the master for the copy of the source mount
tree we're about to create and mount on the first slave of @dest_mnt's
peer group:

do {
        struct mount *parent = last_source->mnt_parent;
        if (last_source == first_source)
                break;
        done = parent->mnt_master == p;
        if (done && peers(n, parent))
                break;
        last_source = last_source->mnt_master;
} while (!done);

We know that @last_source->mnt_parent points to @last_dest and
@last_dest is the last peer in @dest_mnt's peer group we propagated to
in the peer loop in propagate_mnt().

Consequently, @last_source is the last copy we created and mount on that
last peer in @dest_mnt's peer group. So @last_source is the master we
want to pick.

We know that @last_source->mnt_parent->mnt_master points to
@last_dest->mnt_master. We also know that @last_dest->mnt_master is
either NULL or points to a master outside of the destination propagation
tree and so does @p. Hence:

done = parent->mnt_master == p;

is trivially true in the base condition.

We also know that for the first slave mount of @dest_mnt's peer group
that @last_dest either points @dest_mnt itself because it was
initialized to:

last_dest = dest_mnt;

at the beginning of propagate_mnt() or it will point to a peer of
@dest_mnt in its peer group. In both cases it is guaranteed that on the
first iteration @n and @parent are peers (Please note the check for
peers here as that's important.):

if (done && peers(n, parent))
        break;

So, as we expected, we select @last_source, which referes to the last
copy of the source mount tree we mounted on the last peer in @dest_mnt's
peer group, as the master of the first slave in @dest_mnt's peer group.
The rest is taken care of by clone_mnt(last_source, ...). We'll skip
over that part otherwise this becomes a blogpost.

At the end of propagate_mnt() we now mark @m->mnt_master as the first
master in the destination propagation tree that is distinct from
@dest_mnt->mnt_master. IOW, we mark @dest_mnt itself as a master.

By marking @dest_mnt or one of it's peers we are able to easily find it
again when we later lookup masters for other copies of the source mount
tree we mount copies of the source mount tree on slaves @m to
@dest_mnt's peer group. This, in turn allows us to find the master we
selected for the copies of the source mount tree we mounted on master in
the destination propagation tree again.

The important part is to realize that the code makes use of the fact
that the last copy of the source mount tree stashed in @last_source was
mounted on top of the previous destination propagation node @last_dest.
What this means is that @last_source allows us to walk the destination
propagation hierarchy the same way each destination propagation node @m
does.

If we take @last_source, which is the copy of @source_mnt we have
mounted on @last_dest in the previous iteration of propagate_one(), then
we know @last_source->mnt_parent points to @last_dest but we also know
that as we walk through the destination propagation tree that
@last_source->mnt_master will point to an earlier copy of the source
mount tree we mounted one an earlier destination propagation node @m.

IOW, @last_source->mnt_parent will be our hook into the destination
propagation tree and each consecutive @last_source->mnt_master will lead
us to an earlier propagation node @m via
@last_source->mnt_master->mnt_parent.

Hence, by walking up @last_source->mnt_master, each of which is mounted
on a node that is a master @m in the destination propagation tree we can
also walk up the destination propagation hierarchy.

So, for each new destination propagation node @m we use the previous
copy of @last_source and the fact it's mounted on the previous
propagation node @last_dest via @last_source->mnt_master->mnt_parent to
determine what the master of the new copy of @last_source needs to be.

The goal is to find the _closest_ master that the new copy of the source
mount tree we are about to create and mount on a slave @m in the
destination propagation tree needs to pick. IOW, we want to find a
suitable master in the propagation group.

As the propagation structure of the source mount propagation tree we
create mirrors the propagation structure of the destination propagation
tree we can find @m's closest master - i.e., a marked master - which is
a peer in the closest peer group that @m receives propagation from. We
store that closest master of @m in @p as before and record the slave to
that master in @n

We then search for this master @p via @last_source by walking up the
master hierarchy starting from the last copy of the source mount tree
stored in @last_source that we created and mounted on the previous
destination propagation node @m.

We will try to find the master by walking @last_source->mnt_master and
by comparing @last_source->mnt_master->mnt_parent->mnt_master to @p. If
we find @p then we can figure out what earlier copy of the source mount
tree needs to be the master for the new copy of the source mount tree
we're about to create and mount at the current destination propagation
node @m.

If @last_source->mnt_master->mnt_parent and @n are peers then we know
that the closest master they receive propagation from is
@last_source->mnt_master->mnt_parent->mnt_master. If not then the
closest immediate peer group that they receive propagation from must be
one level higher up.

This builds on the earlier clarification at the beginning that all peers
in a peer group which are slaves of other peer groups all point to the
same ->mnt_master, i.e., appear on the same ->mnt_slave_list, of the
closest peer group that they receive propagation from.

However, terminating the walk has corner cases.

If the closest marked master for a given destination node @m cannot be
found by walking up the master hierarchy via @last_source->mnt_master
then we need to terminate the walk when we encounter @source_mnt again.

This isn't an arbitrary termination. It simply means that the new copy
of the source mount tree we're about to create has a copy of the source
mount tree we created and mounted on a peer in @dest_mnt's peer group as
its master. IOW, @source_mnt is the peer in the closest peer group that
the new copy of the source mount tree receives propagation from.

We absolutely have to stop @source_mnt because @last_source->mnt_master
either points outside the propagation hierarchy we're dealing with or it
is NULL because @source_mnt isn't a shared-slave.

So continuing the walk past @source_mnt would cause a NULL dereference
via @last_source->mnt_master->mnt_parent. And so we have to stop the
walk when we encounter @source_mnt again.

One scenario where this can happen is when we first handled a series of
slaves of @dest_mnt's peer group and then encounter peers in a new peer
group that is a slave to @dest_mnt's peer group. We handle them and then
we encounter another slave mount to @dest_mnt that is a pure slave to
@dest_mnt's peer group. That pure slave will have a peer in @dest_mnt's
peer group as its master. Consequently, the new copy of the source mount
tree will need to have @source_mnt as it's master. So we walk the
propagation hierarchy all the way up to @source_mnt based on
@last_source->mnt_master.

So terminate on @source_mnt, easy peasy. Except, that the check misses
something that the rest of the algorithm already handles.

If @dest_mnt has peers in it's peer group the peer loop in
propagate_mnt():

for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
        ret = propagate_one(n);
        if (ret)
                goto out;
}

will consecutively update @last_source with each previous copy of the
source mount tree we created and mounted at the previous peer in
@dest_mnt's peer group. So after that loop terminates @last_source will
point to whatever copy of the source mount tree was created and mounted
on the last peer in @dest_mnt's peer group.

Furthermore, if there is even a single additional peer in @dest_mnt's
peer group then @last_source will __not__ point to @source_mnt anymore.
Because, as we mentioned above, @dest_mnt isn't even handled in this
loop but directly in attach_recursive_mnt(). So it can't even accidently
come last in that peer loop.

So the first time we handle a slave mount @m of @dest_mnt's peer group
the copy of the source mount tree we create will make the __last copy of
the source mount tree we created and mounted on the last peer in
@dest_mnt's peer group the master of the new copy of the source mount
tree we create and mount on the first slave of @dest_mnt's peer group__.

But this means that the termination condition that checks for
@source_mnt is wrong. The @source_mnt cannot be found anymore by
propagate_one(). Instead it will find the last copy of the source mount
tree we created and mounted for the last peer of @dest_mnt's peer group
again. And that is a peer of @source_mnt not @source_mnt itself.

IOW, we fail to terminate the loop correctly and ultimately dereference
@last_source->mnt_master->mnt_parent. When @source_mnt's peer group
isn't slave to another peer group then @last_source->mnt_master is NULL
causing the splat below.

For example, assume @dest_mnt is a pure shared mount and has three peers
in its peer group:

===================================================================================
                                         mount-id   mount-parent-id   peer-group-id
===================================================================================
(@dest_mnt) mnt_master[216]              309        297               shared:216
    \
     (@source_mnt) mnt_master[218]:      609        609               shared:218

(1) mnt_master[216]:                     607        605               shared:216
    \
     (P1) mnt_master[218]:               624        607               shared:218

(2) mnt_master[216]:                     576        574               shared:216
    \
     (P2) mnt_master[218]:               625        576               shared:218

(3) mnt_master[216]:                     545        543               shared:216
    \
     (P3) mnt_master[218]:               626        545               shared:218

After this sequence has been processed @last_source will point to (P3),
the copy generated for the third peer in @dest_mnt's peer group we
handled. So the copy of the source mount tree (P4) we create and mount
on the first slave of @dest_mnt's peer group:

===================================================================================
                                         mount-id   mount-parent-id   peer-group-id
===================================================================================
    mnt_master[216]                      309        297               shared:216
   /
  /
(S0) mnt_slave                           483        481               master:216
  \
   \    (P3) mnt_master[218]             626        545               shared:218
    \  /
     \/
    (P4) mnt_slave                       627        483               master:218

will pick the last copy of the source mount tree (P3) as master, not (S0).

When walking the propagation hierarchy via @last_source's master
hierarchy we encounter (P3) but not (S0), i.e., @source_mnt.

We can fix this in multiple ways:

(1) By setting @last_source to @source_mnt after we processed the peers
    in @dest_mnt's peer group right after the peer loop in
    propagate_mnt().

(2) By changing the termination condition that relies on finding exactly
    @source_mnt to finding a peer of @source_mnt.

(3) By only moving @last_source when we actually venture into a new peer
    group or some clever variant thereof.

The first two options are minimally invasive and what we want as a fix.
The third option is more intrusive but something we'd like to explore in
the near future.

This passes all LTP tests and specifically the mount propagation
testsuite part of it. It also holds up against all known reproducers of
this issues.

Final words.
First, this is a clever but __worringly__ underdocumented algorithm.
There isn't a single detailed comment to be found in next_group(),
propagate_one() or anywhere else in that file for that matter. This has
been a giant pain to understand and work through and a bug like this is
insanely difficult to fix without a detailed understanding of what's
happening. Let's not talk about the amount of time that was sunk into
fixing this.

Second, all the cool kids with access to
unshare --mount --user --map-root --propagation=unchanged
are going to have a lot of fun. IOW, triggerable by unprivileged users
while namespace_lock() lock is held.

[  115.848393] BUG: kernel NULL pointer dereference, address: 0000000000000010
[  115.848967] #PF: supervisor read access in kernel mode
[  115.849386] #PF: error_code(0x0000) - not-present page
[  115.849803] PGD 0 P4D 0
[  115.850012] Oops: 0000 [#1] PREEMPT SMP PTI
[  115.850354] CPU: 0 PID: 15591 Comm: mount Not tainted 6.1.0-rc7 #3
[  115.850851] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[  115.851510] RIP: 0010:propagate_one.part.0+0x7f/0x1a0
[  115.851924] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10
49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01
00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37
02 4d
[  115.853441] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282
[  115.853865] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00
[  115.854458] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780
[  115.855044] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0
[  115.855693] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8
[  115.856304] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000
[  115.856859] FS:  00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000)
knlGS:0000000000000000
[  115.857531] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  115.858006] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0
[  115.858598] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  115.859393] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  115.860099] Call Trace:
[  115.860358]  <TASK>
[  115.860535]  propagate_mnt+0x14d/0x190
[  115.860848]  attach_recursive_mnt+0x274/0x3e0
[  115.861212]  path_mount+0x8c8/0xa60
[  115.861503]  __x64_sys_mount+0xf6/0x140
[  115.861819]  do_syscall_64+0x5b/0x80
[  115.862117]  ? do_faccessat+0x123/0x250
[  115.862435]  ? syscall_exit_to_user_mode+0x17/0x40
[  115.862826]  ? do_syscall_64+0x67/0x80
[  115.863133]  ? syscall_exit_to_user_mode+0x17/0x40
[  115.863527]  ? do_syscall_64+0x67/0x80
[  115.863835]  ? do_syscall_64+0x67/0x80
[  115.864144]  ? do_syscall_64+0x67/0x80
[  115.864452]  ? exc_page_fault+0x70/0x170
[  115.864775]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  115.865187] RIP: 0033:0x7f92c92b0ebe
[  115.865480] Code: 48 8b 0d 75 4f 0c 00 f7 d8 64 89 01 48 83 c8 ff
c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 42 4f 0c 00 f7 d8 64 89
01 48
[  115.866984] RSP: 002b:00007fff000aa728 EFLAGS: 00000246 ORIG_RAX:
00000000000000a5
[  115.867607] RAX: ffffffffffffffda RBX: 000055a77888d6b0 RCX: 00007f92c92b0ebe
[  115.868240] RDX: 000055a77888d8e0 RSI: 000055a77888e6e0 RDI: 000055a77888e620
[  115.868823] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[  115.869403] R10: 0000000000001000 R11: 0000000000000246 R12: 000055a77888e620
[  115.869994] R13: 000055a77888d8e0 R14: 00000000ffffffff R15: 00007f92c93e4076
[  115.870581]  </TASK>
[  115.870763] Modules linked in: nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink qrtr snd_intel8x0
sunrpc snd_ac97_codec ac97_bus snd_pcm snd_timer intel_rapl_msr
intel_rapl_common snd vboxguest intel_powerclamp video rapl joydev
soundcore i2c_piix4 wmi fuse zram xfs vmwgfx crct10dif_pclmul
crc32_pclmul crc32c_intel polyval_clmulni polyval_generic
drm_ttm_helper ttm e1000 ghash_clmulni_intel serio_raw ata_generic
pata_acpi scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
[  115.875288] CR2: 0000000000000010
[  115.875641] ---[ end trace 0000000000000000 ]---
[  115.876135] RIP: 0010:propagate_one.part.0+0x7f/0x1a0
[  115.876551] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10
49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01
00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37
02 4d
[  115.878086] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282
[  115.878511] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00
[  115.879128] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780
[  115.879715] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0
[  115.880359] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8
[  115.880962] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000
[  115.881548] FS:  00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000)
knlGS:0000000000000000
[  115.882234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  115.882713] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0
[  115.883314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  115.883966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: f2ebb3a921 ("smarter propagate_mnt()")
Fixes: 5ec0811d30 ("propogate_mnt: Handle the first propogated copy being a slave")
Cc: <stable@vger.kernel.org>
Reported-by: Ditang Chen <ditang.c@gmail.com>
Signed-off-by: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-29 09:27:25 +09:00
2023-06-29 09:27:25 +09:00
2005-09-10 10:06:29 -07:00
2023-06-29 09:26:30 +09:00
2023-06-29 09:27:15 +09:00

        Linux kernel release 4.x <http://kernel.org/>

These are the release notes for Linux version 4.  Read them carefully,
as they tell you what this is all about, explain how to install the
kernel, and what to do if something goes wrong.

WHAT IS LINUX?

  Linux is a clone of the operating system Unix, written from scratch by
  Linus Torvalds with assistance from a loosely-knit team of hackers across
  the Net. It aims towards POSIX and Single UNIX Specification compliance.

  It has all the features you would expect in a modern fully-fledged Unix,
  including true multitasking, virtual memory, shared libraries, demand
  loading, shared copy-on-write executables, proper memory management,
  and multistack networking including IPv4 and IPv6.

  It is distributed under the GNU General Public License - see the
  accompanying COPYING file for more details.

ON WHAT HARDWARE DOES IT RUN?

  Although originally developed first for 32-bit x86-based PCs (386 or higher),
  today Linux also runs on (at least) the Compaq Alpha AXP, Sun SPARC and
  UltraSPARC, Motorola 68000, PowerPC, PowerPC64, ARM, Hitachi SuperH, Cell,
  IBM S/390, MIPS, HP PA-RISC, Intel IA-64, DEC VAX, AMD x86-64, AXIS CRIS,
  Xtensa, Tilera TILE, AVR32, ARC and Renesas M32R architectures.

  Linux is easily portable to most general-purpose 32- or 64-bit architectures
  as long as they have a paged memory management unit (PMMU) and a port of the
  GNU C compiler (gcc) (part of The GNU Compiler Collection, GCC). Linux has
  also been ported to a number of architectures without a PMMU, although
  functionality is then obviously somewhat limited.
  Linux has also been ported to itself. You can now run the kernel as a
  userspace application - this is called UserMode Linux (UML).

DOCUMENTATION:

 - There is a lot of documentation available both in electronic form on
   the Internet and in books, both Linux-specific and pertaining to
   general UNIX questions.  I'd recommend looking into the documentation
   subdirectories on any Linux FTP site for the LDP (Linux Documentation
   Project) books.  This README is not meant to be documentation on the
   system: there are much better sources available.

 - There are various README files in the Documentation/ subdirectory:
   these typically contain kernel-specific installation notes for some
   drivers for example. See Documentation/00-INDEX for a list of what
   is contained in each file.  Please read the Changes file, as it
   contains information about the problems, which may result by upgrading
   your kernel.

 - The Documentation/DocBook/ subdirectory contains several guides for
   kernel developers and users.  These guides can be rendered in a
   number of formats:  PostScript (.ps), PDF, HTML, & man-pages, among others.
   After installation, "make psdocs", "make pdfdocs", "make htmldocs",
   or "make mandocs" will render the documentation in the requested format.

INSTALLING the kernel source:

 - If you install the full sources, put the kernel tarball in a
   directory where you have permissions (e.g. your home directory) and
   unpack it:

     xz -cd linux-4.X.tar.xz | tar xvf -

   Replace "X" with the version number of the latest kernel.

   Do NOT use the /usr/src/linux area! This area has a (usually
   incomplete) set of kernel headers that are used by the library header
   files.  They should match the library, and not get messed up by
   whatever the kernel-du-jour happens to be.

 - You can also upgrade between 4.x releases by patching.  Patches are
   distributed in the xz format.  To install by patching, get all the
   newer patch files, enter the top level directory of the kernel source
   (linux-4.X) and execute:

     xz -cd ../patch-4.x.xz | patch -p1

   Replace "x" for all versions bigger than the version "X" of your current
   source tree, _in_order_, and you should be ok.  You may want to remove
   the backup files (some-file-name~ or some-file-name.orig), and make sure
   that there are no failed patches (some-file-name# or some-file-name.rej).
   If there are, either you or I have made a mistake.

   Unlike patches for the 4.x kernels, patches for the 4.x.y kernels
   (also known as the -stable kernels) are not incremental but instead apply
   directly to the base 4.x kernel.  For example, if your base kernel is 4.0
   and you want to apply the 4.0.3 patch, you must not first apply the 4.0.1
   and 4.0.2 patches. Similarly, if you are running kernel version 4.0.2 and
   want to jump to 4.0.3, you must first reverse the 4.0.2 patch (that is,
   patch -R) _before_ applying the 4.0.3 patch. You can read more on this in
   Documentation/applying-patches.txt

   Alternatively, the script patch-kernel can be used to automate this
   process.  It determines the current kernel version and applies any
   patches found.

     linux/scripts/patch-kernel linux

   The first argument in the command above is the location of the
   kernel source.  Patches are applied from the current directory, but
   an alternative directory can be specified as the second argument.

 - Make sure you have no stale .o files and dependencies lying around:

     cd linux
     make mrproper

   You should now have the sources correctly installed.

SOFTWARE REQUIREMENTS

   Compiling and running the 4.x kernels requires up-to-date
   versions of various software packages.  Consult
   Documentation/Changes for the minimum version numbers required
   and how to get updates for these packages.  Beware that using
   excessively old versions of these packages can cause indirect
   errors that are very difficult to track down, so don't assume that
   you can just update packages when obvious problems arise during
   build or operation.

BUILD directory for the kernel:

   When compiling the kernel, all output files will per default be
   stored together with the kernel source code.
   Using the option "make O=output/dir" allows you to specify an alternate
   place for the output files (including .config).
   Example:

     kernel source code: /usr/src/linux-4.X
     build directory:    /home/name/build/kernel

   To configure and build the kernel, use:

     cd /usr/src/linux-4.X
     make O=/home/name/build/kernel menuconfig
     make O=/home/name/build/kernel
     sudo make O=/home/name/build/kernel modules_install install

   Please note: If the 'O=output/dir' option is used, then it must be
   used for all invocations of make.

CONFIGURING the kernel:

   Do not skip this step even if you are only upgrading one minor
   version.  New configuration options are added in each release, and
   odd problems will turn up if the configuration files are not set up
   as expected.  If you want to carry your existing configuration to a
   new version with minimal work, use "make oldconfig", which will
   only ask you for the answers to new questions.

 - Alternative configuration commands are:

     "make config"      Plain text interface.

     "make menuconfig"  Text based color menus, radiolists & dialogs.

     "make nconfig"     Enhanced text based color menus.

     "make xconfig"     Qt based configuration tool.

     "make gconfig"     GTK+ based configuration tool.

     "make oldconfig"   Default all questions based on the contents of
                        your existing ./.config file and asking about
                        new config symbols.

     "make silentoldconfig"
                        Like above, but avoids cluttering the screen
                        with questions already answered.
                        Additionally updates the dependencies.

     "make olddefconfig"
                        Like above, but sets new symbols to their default
                        values without prompting.

     "make defconfig"   Create a ./.config file by using the default
                        symbol values from either arch/$ARCH/defconfig
                        or arch/$ARCH/configs/${PLATFORM}_defconfig,
                        depending on the architecture.

     "make ${PLATFORM}_defconfig"
                        Create a ./.config file by using the default
                        symbol values from
                        arch/$ARCH/configs/${PLATFORM}_defconfig.
                        Use "make help" to get a list of all available
                        platforms of your architecture.

     "make allyesconfig"
                        Create a ./.config file by setting symbol
                        values to 'y' as much as possible.

     "make allmodconfig"
                        Create a ./.config file by setting symbol
                        values to 'm' as much as possible.

     "make allnoconfig" Create a ./.config file by setting symbol
                        values to 'n' as much as possible.

     "make randconfig"  Create a ./.config file by setting symbol
                        values to random values.

     "make localmodconfig" Create a config based on current config and
                           loaded modules (lsmod). Disables any module
                           option that is not needed for the loaded modules.

                           To create a localmodconfig for another machine,
                           store the lsmod of that machine into a file
                           and pass it in as a LSMOD parameter.

                   target$ lsmod > /tmp/mylsmod
                   target$ scp /tmp/mylsmod host:/tmp

                   host$ make LSMOD=/tmp/mylsmod localmodconfig

                           The above also works when cross compiling.

     "make localyesconfig" Similar to localmodconfig, except it will convert
                           all module options to built in (=y) options.

   You can find more information on using the Linux kernel config tools
   in Documentation/kbuild/kconfig.txt.

 - NOTES on "make config":

    - Having unnecessary drivers will make the kernel bigger, and can
      under some circumstances lead to problems: probing for a
      nonexistent controller card may confuse your other controllers

    - A kernel with math-emulation compiled in will still use the
      coprocessor if one is present: the math emulation will just
      never get used in that case.  The kernel will be slightly larger,
      but will work on different machines regardless of whether they
      have a math coprocessor or not.

    - The "kernel hacking" configuration details usually result in a
      bigger or slower kernel (or both), and can even make the kernel
      less stable by configuring some routines to actively try to
      break bad code to find kernel problems (kmalloc()).  Thus you
      should probably answer 'n' to the questions for "development",
      "experimental", or "debugging" features.

COMPILING the kernel:

 - Make sure you have at least gcc 3.2 available.
   For more information, refer to Documentation/Changes.

   Please note that you can still run a.out user programs with this kernel.

 - Do a "make" to create a compressed kernel image. It is also
   possible to do "make install" if you have lilo installed to suit the
   kernel makefiles, but you may want to check your particular lilo setup first.

   To do the actual install, you have to be root, but none of the normal
   build should require that. Don't take the name of root in vain.

 - If you configured any of the parts of the kernel as `modules', you
   will also have to do "make modules_install".

 - Verbose kernel compile/build output:

   Normally, the kernel build system runs in a fairly quiet mode (but not
   totally silent).  However, sometimes you or other kernel developers need
   to see compile, link, or other commands exactly as they are executed.
   For this, use "verbose" build mode.  This is done by passing
   "V=1" to the "make" command, e.g.

     make V=1 all

   To have the build system also tell the reason for the rebuild of each
   target, use "V=2".  The default is "V=0".

 - Keep a backup kernel handy in case something goes wrong.  This is
   especially true for the development releases, since each new release
   contains new code which has not been debugged.  Make sure you keep a
   backup of the modules corresponding to that kernel, as well.  If you
   are installing a new kernel with the same version number as your
   working kernel, make a backup of your modules directory before you
   do a "make modules_install".

   Alternatively, before compiling, use the kernel config option
   "LOCALVERSION" to append a unique suffix to the regular kernel version.
   LOCALVERSION can be set in the "General Setup" menu.

 - In order to boot your new kernel, you'll need to copy the kernel
   image (e.g. .../linux/arch/x86/boot/bzImage after compilation)
   to the place where your regular bootable kernel is found.

 - Booting a kernel directly from a floppy without the assistance of a
   bootloader such as LILO, is no longer supported.

   If you boot Linux from the hard drive, chances are you use LILO, which
   uses the kernel image as specified in the file /etc/lilo.conf.  The
   kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or
   /boot/bzImage.  To use the new kernel, save a copy of the old image
   and copy the new image over the old one.  Then, you MUST RERUN LILO
   to update the loading map! If you don't, you won't be able to boot
   the new kernel image.

   Reinstalling LILO is usually a matter of running /sbin/lilo.
   You may wish to edit /etc/lilo.conf to specify an entry for your
   old kernel image (say, /vmlinux.old) in case the new one does not
   work.  See the LILO docs for more information.

   After reinstalling LILO, you should be all set.  Shutdown the system,
   reboot, and enjoy!

   If you ever need to change the default root device, video mode,
   ramdisk size, etc.  in the kernel image, use the 'rdev' program (or
   alternatively the LILO boot options when appropriate).  No need to
   recompile the kernel to change these parameters.

 - Reboot with the new kernel and enjoy.

IF SOMETHING GOES WRONG:

 - If you have problems that seem to be due to kernel bugs, please check
   the file MAINTAINERS to see if there is a particular person associated
   with the part of the kernel that you are having trouble with. If there
   isn't anyone listed there, then the second best thing is to mail
   them to me (torvalds@linux-foundation.org), and possibly to any other
   relevant mailing-list or to the newsgroup.

 - In all bug-reports, *please* tell what kernel you are talking about,
   how to duplicate the problem, and what your setup is (use your common
   sense).  If the problem is new, tell me so, and if the problem is
   old, please try to tell me when you first noticed it.

 - If the bug results in a message like

     unable to handle kernel paging request at address C0000010
     Oops: 0002
     EIP:   0010:XXXXXXXX
     eax: xxxxxxxx   ebx: xxxxxxxx   ecx: xxxxxxxx   edx: xxxxxxxx
     esi: xxxxxxxx   edi: xxxxxxxx   ebp: xxxxxxxx
     ds: xxxx  es: xxxx  fs: xxxx  gs: xxxx
     Pid: xx, process nr: xx
     xx xx xx xx xx xx xx xx xx xx

   or similar kernel debugging information on your screen or in your
   system log, please duplicate it *exactly*.  The dump may look
   incomprehensible to you, but it does contain information that may
   help debugging the problem.  The text above the dump is also
   important: it tells something about why the kernel dumped code (in
   the above example, it's due to a bad kernel pointer). More information
   on making sense of the dump is in Documentation/oops-tracing.txt

 - If you compiled the kernel with CONFIG_KALLSYMS you can send the dump
   as is, otherwise you will have to use the "ksymoops" program to make
   sense of the dump (but compiling with CONFIG_KALLSYMS is usually preferred).
   This utility can be downloaded from
   ftp://ftp.<country>.kernel.org/pub/linux/utils/kernel/ksymoops/ .
   Alternatively, you can do the dump lookup by hand:

 - In debugging dumps like the above, it helps enormously if you can
   look up what the EIP value means.  The hex value as such doesn't help
   me or anybody else very much: it will depend on your particular
   kernel setup.  What you should do is take the hex value from the EIP
   line (ignore the "0010:"), and look it up in the kernel namelist to
   see which kernel function contains the offending address.

   To find out the kernel function name, you'll need to find the system
   binary associated with the kernel that exhibited the symptom.  This is
   the file 'linux/vmlinux'.  To extract the namelist and match it against
   the EIP from the kernel crash, do:

     nm vmlinux | sort | less

   This will give you a list of kernel addresses sorted in ascending
   order, from which it is simple to find the function that contains the
   offending address.  Note that the address given by the kernel
   debugging messages will not necessarily match exactly with the
   function addresses (in fact, that is very unlikely), so you can't
   just 'grep' the list: the list will, however, give you the starting
   point of each kernel function, so by looking for the function that
   has a starting address lower than the one you are searching for but
   is followed by a function with a higher address you will find the one
   you want.  In fact, it may be a good idea to include a bit of
   "context" in your problem report, giving a few lines around the
   interesting one.

   If you for some reason cannot do the above (you have a pre-compiled
   kernel image or similar), telling me as much about your setup as
   possible will help.  Please read the REPORTING-BUGS document for details.

 - Alternatively, you can use gdb on a running kernel. (read-only; i.e. you
   cannot change values or set break points.) To do this, first compile the
   kernel with -g; edit arch/x86/Makefile appropriately, then do a "make
   clean". You'll also need to enable CONFIG_PROC_FS (via "make config").

   After you've rebooted with the new kernel, do "gdb vmlinux /proc/kcore".
   You can now use all the usual gdb commands. The command to look up the
   point where your system crashed is "l *0xXXXXXXXX". (Replace the XXXes
   with the EIP value.)

   gdb'ing a non-running kernel currently fails because gdb (wrongly)
   disregards the starting offset for which the kernel is compiled.

Description
No description provided
Readme 7.8 GiB
Languages
C 97.7%
Assembly 1.1%
Shell 0.4%
Makefile 0.3%
Python 0.2%