This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Userspace Copies in Systemtap

From: Martin Hunt <hunt at redhat dot com>
To: "systemtap at sources dot redhat dot com" <systemtap at sources dot redhat dot com>
Date: Sat, 11 Feb 2006 17:32:48 -0800
Subject: Userspace Copies in Systemtap
Organization: Red Hat Inc
ïUserspace Copies in Systemtap

Code executed within a kprobe may not sleep.  Copying from userspace to
kernel address space might need to sleep if the proper pages are not
currently present. Because of this, it is not possible to do 100%
reliable copies from userspace.    That is good enough for Systemtap, as
long as the failures are rare and detectable.  But most importantly, the
copies must be 100% safe and never corrupt or crash the system. 

Last year, after some discussion, I briefly thought the correct solution
involved locking the page table and checking that a particular page was
present before attempting to access it. However, that approach was too
complicated and too slow.  After some more kernel reading, I realized
the solution was simple; just use the existing kernel calls with a very
minor change to one of them. Why is that?

How do userspace copies fail?
1.The address is bad.  Maybe the address is memory mapped hardware and
initiating a read from it will do Bad Things to your system.
2.The requested address is not in memory and an exception is generated,

Fixing #1 is simple enough.  There is a function called access_ok()
which checks the address against the valid address range for the
process.  We need to call that for every memory access to userspace,
just like the kernel does. All the copy functions in runtime/copy.c do
that.

Fixing #2 is a bit more complicated, but all the difficult work has
already been done for us.

When page fault exception happens, do_page_fault() is called to handle
it.  Immediately it calls
notify_die(DIE_PAGE_FAULT,...) This calls the notifier chain which calls
kprobe_exceptions_notify().  This calls kprobe_fault_handler().
kprobe_fault_handler() checks to see if there is a specific fault
handler for that kprobe, and if there is, it calls it.   The next part
is currently broken, but what needs to happen next is that the
kprobe_fault_handler() needs to call fixup_exception() and return 1.
This will tell do_page_fault() that the exception has been handled and
it will resume execution at the address specified in fixup_exception().

So no actual paging was done and no sleeping either.  The copy function
that attempted the userspace copy will return an error code.  How did
that happen? Let's take a look at the copy_from_user() function for
i386.

Looking in arch/i386/lib/usercopy.c:
unsigned long
copy_from_user(void *to, const void __user *from, unsigned long n)
{
	might_sleep();
	BUG_ON((long) n < 0);
	if (access_ok(VERIFY_READ, from, n))
		n = __copy_from_user(to, from, n);
	else
		memset(to, 0, n);
	return n;
}

might_sleep() prints a warning to the system log if it is run from a
context that cannot sleep. This is the case for kprobes, so we must
avoid calling might_sleep().

copy_from_user() calls access_ok(), which is good, then it calls
__copy_from_user() which is defined in asm/uaccess.h.

static inline unsigned long
__copy_from_user(void *to, const void __user *from, unsigned long n)
{
       might_sleep();
       return __copy_from_user_inatomic(to, from, n);
}

Again, that might_sleep, which we don't want.  Next we look at

static inline unsigned long
__copy_from_user_inatomic(void *to, const void __user *from, unsigned
long n)
{
	if (__builtin_constant_p(n)) {
		unsigned long ret;

		switch (n) {
		case 1:
			__get_user_size(*(u8 *)to, from, 1, ret, 1);
			return ret;
		case 2:
			__get_user_size(*(u16 *)to, from, 2, ret, 2);
			return ret;
		case 4:
			__get_user_size(*(u32 *)to, from, 4, ret, 4);
			return ret;
		}
	}
	return __copy_from_user_ll(to, from, n);
}

unsigned long
__copy_from_user_ll(void *to, const void __user *from, unsigned long n)
{
	BUG_ON((long)n < 0);
	if (movsl_is_ok(to, from, n))
		__copy_user_zeroing(to, from, n);
	else
		n = __copy_user_zeroing_intel(to, from, n);
	return n;
}

And, finally, 

/* Generic arbitrary sized copy.  */
#define __copy_user(to,from,size)					\
do {									\
	int __d0, __d1, __d2;						\
	__asm__ __volatile__(						\
		"	cmp  $7,%0\n"					\
		"	jbe  1f\n"					\
		"	movl %1,%0\n"					\
		"	negl %0\n"					\
		"	andl $7,%0\n"					\
		"	subl %0,%3\n"					\
		"4:	rep; movsb\n"					\
		"	movl %3,%0\n"					\
		"	shrl $2,%0\n"					\
		"	andl $3,%3\n"					\
		"	.align 2,0x90\n"				\
		"0:	rep; movsl\n"					\
		"	movl %3,%0\n"					\
		"1:	rep; movsb\n"					\
		"2:\n"							\
		".section .fixup,\"ax\"\n"				\
		"5:	addl %3,%0\n"					\
		"	jmp 2b\n"					\
		"3:	lea 0(%3,%0,4),%0\n"				\
		"	jmp 2b\n"					\
		".previous\n"						\
		".section __ex_table,\"a\"\n"				\
		"	.align 4\n"					\
		"	.long 4b,5b\n"					\
		"	.long 0b,3b\n"					\
		"	.long 1b,2b\n"					\
		".previous"						\
		: "=&c"(size), "=&D" (__d0), "=&S" (__d1), "=r"(__d2)	\
		: "3"(size), "0"(size), "1"(to), "2"(from)		\
		: "memory");						\
} while (0)

Note the section __ex_table. This is the exception table. It contains
the address of the instruction that might fault and the address of the
fixup code. When modules are loaded, the exception table is merged with
the kernel's exception table, which is what enables it to do the fixups
in fixup_exception().

So, no locks, no waits. Errors are recovered efficiently using the MMU
and the exception table.

What about other copy functions?  There is get_user() and
strncpy_from_user(). 
The get_user() family of functions are simple and safe to use.
strncpy_from_user() calls __do_strncpy_from_user() which unfortunately
has a might_sleep() embedded in it.

#define __do_strncpy_from_user(dst,src,count,res)			   \
do {									   \
	int __d0, __d1, __d2;						   \
	might_sleep();							   \
	__asm__ __volatile__(						   \
		"	testl %1,%1\n"					   \
		"	jz 2f\n"					   \
		"0:	lodsb\n"					   \
		"	stosb\n"					   \
		"	testb %%al,%%al\n"				   \
		"	jz 1f\n"					   \
		"	decl %1\n"					   \
		"	jnz 0b\n"					   \
		"1:	subl %1,%0\n"					   \
		"2:\n"							   \
		".section .fixup,\"ax\"\n"				   \
		"3:	movl %5,%0\n"					   \
		"	jmp 2b\n"					   \
		".previous\n"						   \
		".section __ex_table,\"a\"\n"				   \
		"	.align 4\n"					   \
		"	.long 0b,3b\n"					   \
		".previous"						   \
		: "=d"(res), "=c"(count), "=&a" (__d0), "=&S" (__d1),	   \
		  "=&D" (__d2)						   \
		: "i"(-EFAULT), "0"(count), "1"(count), "3"(src), "4"(dst) \
		: "memory");						   \
} while (0)

Because we cannot sleep due to the kprobes fault handler intercepting
page faults, the might_sleep() is erroneous. So I have copied this
function without the might_sleep call into runtime/copy.c.  Note that
might_sleep() is a warning and is only in i386 and x86_64. 

So are there any problems left to deal with? Yes,  people writing
embedded C code can bypass the access_ok() checks or call functions such
as copy_from_user() instead of stpcopy_from_user().  Bypassing the
access_ok() would rarely be a problem but could lead to lockups. Calling
copy_from_user() or any function which calls might_sleep() will write
junk in the system log but is otherwise harmless.

I'm attaching a corrected version of the memory copy test program that I
posted a few days ago. It can distinguish between copies that were bad
because access_ok() rejected them and copies that failed on a page
fault.
Attachment: mem_test.stp
Description: Text document
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]