ima_parse_buf() takes as input the buffer start and end pointers, and
stores the result in a static array of ima_field_data structures,
where the len field contains the length parsed from the buffer, and
the data field contains the address of the buffer just after the length.
Optionally, the function returns the current value of the buffer pointer
and the number of array elements written.
A bitmap has been added as parameter of ima_parse_buf() to handle
the cases where the length is not prepended to data. Each bit corresponds
to an element of the ima_field_data array. If a bit is set, the length
is not parsed from the buffer, but is read from the corresponding element
of the array (the length must be set before calling the function).
ima_parse_buf() can perform three checks upon request by callers,
depending on the enforce mask passed to it:
- ENFORCE_FIELDS: matching of number of fields (length-data combination)
- there must be enough data in the buffer to parse the number of fields
requested (output: current value of buffer pointer)
- ENFORCE_BUFEND: matching of buffer end
- the ima_field_data array must be large enough to contain lengths and
data pointers for the amount of data requested (output: number
of fields written)
- ENFORCE_FIELDS | ENFORCE_BUFEND: matching of both
Use cases
- measurement entry header: ENFORCE_FIELDS | ENFORCE_BUFEND
- four fields must be parsed: pcr, digest, template name, template data
- ENFORCE_BUFEND is enforced only for the last measurement entry
- template digest (Crypto Agile): ENFORCE_BUFEND
- since only the total template digest length is known, the function
parses length-data combinations until the buffer end is reached
- template data: ENFORCE_FIELDS | ENFORCE_BUFEND
- since the number of fields and the total template data length
are known, the function can perform both checks
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
cgroups2 is beginning to show up in wider usage. Add it to the default
nomeasure/noappraise list like other filesystems.
Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
While reading the code, I noticed that these #endif comments don't match
how they're actually nested. This patch fixes that.
Signed-off-by: Tycho Andersen <tycho@docker.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
IMA uses the hash algorithm too early to be able to use a module.
Require the selected hash algorithm to be built-in.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Only return enabled if in enforcing mode, not fix or log modes.
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Changes:
- Define is_ima_appraise_enabled() as a bool (Thiago Bauermann)
The builtin "ima_appraise_tcb" policy should require file signatures for
at least a few of the hooks (eg. kernel modules, firmware, and the kexec
kernel image), but changing it would break the existing userspace/kernel
ABI.
This patch defines a new builtin policy named "secure_boot", which
can be specified on the "ima_policy=" boot command line, independently
or in conjunction with the "ima_appraise_tcb" policy, by specifing
ima_policy="appraise_tcb | secure_boot". The new appraisal rules
requiring file signatures will be added prior to the "ima_appraise_tcb"
rules.
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Changelog:
- Reference secure boot in the new builtin policy name. (Thiago Bauermann)
Add support for providing multiple builtin policies on the "ima_policy="
boot command line. Use "|" as the delimitor separating the policy names.
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Pull more ufs fixes from Al Viro:
"More UFS fixes, unfortunately including build regression fix for the
64-bit s_dsize commit. Fixed in this pile:
- trivial bug in signedness of 32bit timestamps on ufs1
- ESTALE instead of ufs_error() when doing open-by-fhandle on
something deleted
- build regression on 32bit in ufs_new_fragments() - calculating that
many percents of u64 pulls libgcc stuff on some of those. Mea
culpa.
- fix hysteresis loop broken by typo in 2.4.14.7 (right next to the
location of previous bug).
- fix the insane limits of said hysteresis loop on filesystems with
very low percentage of reserved blocks. If it's 5% or less, just
use the OPTSPACE policy.
- calculate those limits once and mount time.
This tree does pass xfstests clean (both ufs1 and ufs2) and it _does_
survive cross-builds.
Again, my apologies for missing that, especially since I have noticed
a related percentage-of-64bit issue in earlier patches (when dealing
with amount of reserved blocks). Self-LART applied..."
* 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ufs: fix the logics for tail relocation
ufs_iget(): fail with -ESTALE on deleted inode
fix signedness of timestamps on ufs1
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
New features:
- Add support to measure SMI cost in 'perf stat' (Kan Liang)
- Add support for unwinding callchains in powerpc with libdw (Paolo Bonzini)
Fixes:
- Fix message: cpu list option is -C not -c (Adrian Hunter)
- Fix 'perf script' message: field list option is -F not -f (Adrian Hunter)
- Intel PT fixes: (Adrian Hunter)
o Fix missing stack clear
o Ensure IP is zero when state is INTEL_PT_STATE_NO_IP
o Fix last_ip usage
o Ensure never to set 'last_ip' when packet 'count' is zero
o Clear FUP flag on error
o Fix transactions_sample_type
Infrastructure changes:
- Intel PT cleanups/refactorings (Adrian Hunter)
o Use FUP always when scanning for an IP
o Add missing __fallthrough
o Remove redundant initial_skip checks
o Allow decoding with branch tracing disabled
o Add default config for pass-through branch enable
o Add documentation for new config terms
o Add decoder support for ptwrite and power event packets
o Add reserved byte to CBR packet payload
o Add decoder support for CBR events
- Move find_process() to the only place that uses it, skimming some
more fat from util.[ch] (Arnaldo Carvalho de Melo)
- Do parameter validation earlier on fetch_kernel_version() (Arnaldo Carvalho de Melo)
- Remove unused _ALL_SOURCE define (Arnaldo Carvalho de Melo)
- Add sysfs__write_int function (Kan Liang)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix expand_upwards() on architectures with an upward-growing stack (parisc,
metag and partly IA-64) to allow the stack to reliably grow exactly up to
the address space limit given by TASK_SIZE.
Signed-off-by: Helge Deller <deller@gmx.de>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since blk_mq_quiesce_queue_nowait() can be called from interrupt
context, make this safe. Since this function is not in the hot
path, uninline it.
Fixes: commit f4560ffe8c ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoid that building with W=1 causes the compiler to complain that
a declaration for bounce_bio_set and bounce_bio_split is missing.
References: commit a8821f3f32 ("block: Improvements to bounce-buffer handling")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Neil Brown <neilb@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Trinity gets kernel BUG at mm/mmap.c:1963! in about 3 minutes of
mmap testing. That's the VM_BUG_ON(gap_end < gap_start) at the
end of unmapped_area_topdown(). Linus points out how MAP_FIXED
(which does not have to respect our stack guard gap intentions)
could result in gap_end below gap_start there. Fix that, and
the similar case in its alternative, unmapped_area().
Cc: stable@vger.kernel.org
Fixes: 1be7107fbe ("mm: larger stack guard gap, between vmas")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Debugged-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From 2c06e795162cb306c9707ec51d3e1deadb37f573 Mon Sep 17 00:00:00 2001
From: Dennis Zhou <dennisz@fb.com>
Date: Wed, 21 Jun 2017 10:17:09 -0700
Commit 30a5b5367e ("percpu: expose statistics about percpu memory via
debugfs") introduces percpu memory statistics. pcpu_stats_chunk_alloc
takes the spin lock and disables/enables irqs on creation of a chunk. Irqs
are not enabled when the first chunk is initialized and thus kernels are
failing to boot with kernel debugging enabled. Fixed by changing _irq to
_irqsave and _irqrestore.
Fixes: 30a5b5367e ("percpu: expose statistics about percpu memory via debugfs")
Signed-off-by: Dennis Zhou <dennisz@fb.com>
Reported-by: Alexander Levin <alexander.levin@verizon.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
md devices allocate a bio_set and use it for two
distinct purposes.
mddev->bio_set is used to clone bios as part of sending
upper level requests down to lower level devices,
and it is also use for synchronous IO such as superblock
and bitmap updates, and for correcting read errors.
This multiple usage can lead to deadlocks. It is likely
that cloned bios might be queued for write and to be
waiting for a metadata update before the write can be permitted.
If the cloning exhausted mddev->bio_set, the metadata update
may not be able to proceed.
This scenario has been seen during heavy testing, with lots of IO and
lots of memory pressure.
Address this by adding a new bio_set specifically for synchronous IO.
All synchronous IO goes directly to the underlying device and is not
queued at the md level, so request using entries from the new
mddev->sync_set will complete in a timely fashion.
Requests that use mddev->bio_set will sometimes need to wait
for synchronous IO, but will no longer risk deadlocking that iO.
Also: small simplification in mddev_put(): there is no need to
wait until the spinlock is released before calling bioset_free().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The Cortina Systems Gemini (SL3516/CS3516) has an on-chip clock
controller that derive all clocks from a single crystal, using some
documented and some undocumented PLLs, half dividers, counters and
gates. This is a best attempt to construct a clock driver for the
clocks so at least we can gate off unused hardware and driver the
PCI bus clock.
Acked-by: Philipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[sboyd@codeaurora.org: Fix devm_ioremap_resource() return value
checking]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
All the enum flags for FTRACE_OPS has a comment except for the RCU one. Add
the comment for that.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
btrfs_del_root_ref calls btrfs_search_slot and reads name from root_ref.
Call btrfs_is_name_len_valid before memcmp.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_get_name, there's btrfs_search_slot and reads name from
inode_ref/root_ref.
Call btrfs_is_name_len_valid in btrfs_get_name.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since iterate_dir_item checks name_len in its own way,
so use btrfs_is_name_len_valid not 'verify_dir_item' to make more strict
name_len check.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switched ENAMETOOLONG to EIO ]
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_log_inode, btrfs_search_forward gets the buffer and then
btrfs_check_ref_name_override will read name from ref/extref for the
first time.
Call btrfs_is_name_len_valid before reading name.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
replay_xattr_deletes calls btrfs_search_slot to get buffer and reads
name.
Call verify_dir_item to check name_len in replay_xattr_deletes to avoid
reading out of boundary.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
replay_one_buffer first reads buffers and dispatches items accroding to
the item type.
In this patch, add_inode_ref handles inode_ref and inode_extref.
Then add_inode_ref calls ref_get_fields and extref_get_fields to read
ref/extref name for the first time.
So checking name_len before reading those two is fine.
add_inode_ref also calls inode_in_dir to match ref/extref in parent_dir.
The call graph includes btrfs_match_dir_item_name to read dir_item name
in the parent dir.
Checking first dir_item is not enough. Change it to verify every
dir_item while doing matches.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function btrfs_is_name_len_valid.
The function compares parameter @name_len with item boundary then
returns true if name_len is valid.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ s/btrfs_leaf_data/BTRFS_LEAF_DATA_OFFSET/ ]
Signed-off-by: David Sterba <dsterba@suse.com>
We should really just wait in wait_dev_flush and let the caller decide
what to do with the error value.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to what submit_bio_wait does, we should account for IO while
waiting for a bio completion. This has marginal visible effects, flush
bio is short-lived.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For devices that support flushing, we allocate a bio, submit, wait for
it and then free it. The bio allocation does not fail so ENOMEM is not a
problem but we still may unnecessarily stress the allocation subsystem.
Instead, we can allocate the bio at the same time we allocate the device
and reuse it each time we need to flush the barriers. The bio is reset
before each use. Reference counting is simplified to just device
allocation (get) and freeing (put).
The bio used to be submitted through the integrity checker which will
find out that bio has no data attached and call submit_bio.
Status of the bio in flight needs to be tracked separately in case the
device caches get switched off between write and wait.
Signed-off-by: David Sterba <dsterba@suse.com>
If we have shared tags enabled, then every IO completion will trigger
a full loop of every queue belonging to a tag set, and every hardware
queue for each of those queues, even if nothing needs to be done.
This causes a massive performance regression if you have a lot of
shared devices.
Instead of doing this huge full scan on every IO, add an atomic
counter to the main queue that tracks how many hardware queues have
been marked as needing a restart. With that, we can avoid looking for
restartable queues, if we don't have to.
Max reports that this restores performance. Before this patch, 4K
IOPS was limited to 22-23K IOPS. With the patch, we are running at
950-970K IOPS.
Fixes: 6d8c6c0f97 ("blk-mq: Restart a single queue if tag sets are shared")
Reported-by: Max Gurtovoy <maxg@mellanox.com>
Tested-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If only a subset of the devices associated with multiple regions support
a given special operation (eg. DISCARD) then the dec_count() that is
used to set error for the region must increment the io->count.
Otherwise, when the dec_count() is called it can cause the dm-io
caller's bio to be completed multiple times. As was reported against
the dm-mirror target that had mirror legs with a mix of discard
capabilities.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=196077
Reported-by: Zhang Yi <yizhan@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
From 4a42ecc735cff0015cc73c3d87edede631f4b885 Mon Sep 17 00:00:00 2001
From: Dennis Zhou <dennisz@fb.com>
Date: Wed, 21 Jun 2017 08:07:15 -0700
Add error message to out of space failure for atomic allocations in
percpu allocation path to fix -Wmaybe-uninitialized.
Signed-off-by: Dennis Zhou <dennisz@fb.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use spin_lock_irqsave and spin_unlock_irqrestore rather than
spin_{lock,unlock}_irq in submit_flush_bio().
Otherwise lockdep issues the following warning:
DEBUG_LOCKS_WARN_ON(current->hardirq_context)
WARNING: CPU: 1 PID: 0 at kernel/locking/lockdep.c:2748 trace_hardirqs_on_caller+0x107/0x180
Reported-by: Ondrej Kozina <okozina@redhat.com>
Tested-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
for connected socket, the incoming_cpu field in the sock struct
is not going to change frequently, but we are setting it
unconditionally for each packet.
Since sk_incoming_cpu and sk_flags share the same cacheline,
and the latter is access by udp_recvmsg(), this cause a cache
miss for each packet for UDP connected socket.
With this patch, we set the incoming cpu field only when the
ingress cpu really changes.
This gives a small but measurable performance improvement for
connected UDP socket.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the new getsockopt(2) option SO_PEERGROUPS on SOL_SOCKET to
retrieve the auxiliary groups of the remote peer. It is designed to
naturally extend SO_PEERCRED. That is, the underlying data is from the
same credentials. Regarding its syntax, it is based on SO_PEERSEC. That
is, if the provided buffer is too small, ERANGE is returned and @optlen
is updated. Otherwise, the information is copied, @optlen is set to the
actual size, and 0 is returned.
While SO_PEERCRED (and thus `struct ucred') already returns the primary
group, it lacks the auxiliary group vector. However, nearly all access
controls (including kernel side VFS and SYSVIPC, but also user-space
polkit, DBus, ...) consider the entire set of groups, rather than just
the primary group. But this is currently not possible with pure
SO_PEERCRED. Instead, user-space has to work around this and query the
system database for the auxiliary groups of a UID retrieved via
SO_PEERCRED.
Unfortunately, there is no race-free way to query the auxiliary groups
of the PID/UID retrieved via SO_PEERCRED. Hence, the current user-space
solution is to use getgrouplist(3p), which itself falls back to NSS and
whatever is configured in nsswitch.conf(3). This effectively checks
which groups we *would* assign to the user if it logged in *now*. On
normal systems it is as easy as reading /etc/group, but with NSS it can
resort to quering network databases (eg., LDAP), using IPC or network
communication.
Long story short: Whenever we want to use auxiliary groups for access
checks on IPC, we need further IPC to talk to the user/group databases,
rather than just relying on SO_PEERCRED and the incoming socket. This
is unfortunate, and might even result in dead-locks if the database
query uses the same IPC as the original request.
So far, those recursions / dead-locks have been avoided by using
primitive IPC for all crucial NSS modules. However, we want to avoid
re-inventing the wheel for each NSS module that might be involved in
user/group queries. Hence, we would preferably make DBus (and other IPC
that supports access-management based on groups) work without resorting
to the user/group database. This new SO_PEERGROUPS ioctl would allow us
to make dbus-daemon work without ever calling into NSS.
Cc: Michal Sekletar <msekleta@redhat.com>
Cc: Simon McVittie <simon.mcvittie@collabora.co.uk>
Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On UDP packets processing, if the BH is the bottle-neck, it
always sees a cache miss while updating rmem_alloc; try to
avoid it prefetching the value as soon as we have the socket
available.
Performances under flood with multiple NIC rx queues used are
unaffected, but when a single NIC rx queue is in use, this
gives ~10% performance improvement.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_QED_RDMA isn't defined, we'd hit the following:
/include/linux/qed/qede_rdma.h:84:19:
warning: ‘qede_rdma_dev_add’ used but never defined [enabled by default]
static inline int qede_rdma_dev_add(struct qede_dev *dev);
Fixes: bbfcd1e8e1 ("qed*: Set rdma generic functions prefix")
Signed-off-by: Chad Dupuis <chad.dupuis@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>