基于sysenter 的系统调用机制（Sysenter Based System Call Mechanism in Linux 2.6）(原创翻译） - Linux系统管理

问题点数：0 回复次数：0

基于sysenter 的系统调用机制（Sysenter Based System Call Mechanism in Linux 2.6）(原创翻译）

原文地址
http://articles.
Starting with version 2.5, linux kernel introduced a new system call entry mechanism on Pentium
II+ processors.
从2.5版本开始,linux内核再Pentinum II+ 处理器上就引入了一种新的系统调用机制

Due to performance issues on Pentium IV processors with existing software interrupt method, an alternative system call entry mechanism was implemented using SYSENTER/SYSEXIT instructions available on Pentium II+ processors.
因为在Pentium IV处理器上已有的软中断方法的性能原因,在Pentium II+ 处理器上有了一个可选的系统调用机制,它通过使用SYSENTER/SYSEXIT指令来实现

This article explores this new mechanism. Discussion is limited to x86 architecture and all source code listings are based on linux kernel 2.6.15.6.
这篇文章会发掘这种新机制,但是讨论只限于x86构架, 并且所有的源代码都基于linux 2.6.15.6

1. What are system calls?
什么是系统调用

System calls provide userland processes a way to request services from the kernel.
系统调用提供了一种方式让用户进程可以提出请求而使内核对其提供服务.

What kind of services?
哪种服务?

Services which are managed by operating system like storage, memory, network, process management etc.
是那种操作系统提供的比如存储,内存,网络,进程管理等服务.

For example if a user process wants to read a file, it will have to make 'open' and 'read' system calls.
比如用户进程要读取一个文件,它就要系统调用’open’ 和‘read’.

Generally system calls are not called by processes directly. C library provides an interface to all system calls.
一般来说, 进程不会直接进行系统调用, C 函数库提供了所有系统调用的一个接口

2. What happens in a system call?
A kernel code snippet is run on request of a user process.
有一段内核代码, 它的运行是为了处理用户进程请求

This code runs in ring 0 (with current privilege level -CPL- 0), which is the highest level of privilege in x86 architecture.
它运行与ring 0级别, 这是x86构架中权限最高的级别

All user processes run in ring 3 (CPL 3).
所有用户进程都运行在 ring 3级别

So, to implement system call mechanism, what we need is 1) a way to call ring 0 code from ring 3 and 2) some kernel code to service the request.
所以为了实现系统调用机制,就需要 1) 从 ring 3,2 调用 ring 0 的代码 2) 一些内核代码用来服务这些请求

3. Good old way of doing it
Until some time back, linux used to implement system calls on all x86 platforms using software interrupts.
过去的一些时间里,linux 在x86 平台上是使用软中断来实现系统调用的

To execute a system call, user process will copy desired system call number to %eax and will execute 'int 0x80'.
为了系统调用, 用户进程会将希望调用的系统调用号复制到 %eax 寄存器中,然后执行int 0x80指令

This will generate interrupt 0x80 and an interrupt service routine will be called.
这会产生一个0x80中断, 然后一个中断程序会被执行

For interrupt 0x80, this routine is an "all system calls handling" routine.
对于中断0x80, 这个程序就是 “所有的系统调用处理” 程序

This routine will execute in ring 0. This routine, as defined in the file/usr/src/linux/arch/i386/kernel/entry.S, will save the current state and call appropriate system call handler based on the value in %eax.
这个程序会在ring 0级别被执行, 它在/usr/src/linux/arch/i386/kernel/entry.S中有定义,它会保存现在的状态,并根据%eax寄存器中的数值调用相应的系统调用处理程序

4. New shiny way of doing it
It was found out that this software interrupt method was much slower on Pentium IV processors.
在Pentium IV上软中断被发现很慢

To solve this issue, Linus implemented an alternative system call mechanism to take advantage of SYSENTER/SYSEXIT instructions provided by all Pentium II+ processors.
为了解决这个问题, linus实现了另一种可选的系统调用机制, 它利用了Pentinum II+处理器提供的
SYSENTER/SYSEXIT 指令.

Before going further with this new way of doing it, let's make ourselves more familiar with these instructions.
在我们深入讨论之前,来熟悉一下下面这些指令.

4.1. SYSENTER/SYSEXIT instructions:
Let's look at the authorized source, Intel manual itself. Intel manual says:
来看一下官方说法, Intel 手册中说

The SYSENTER instruction is part of the "Fast System Call" facility introduced on the Pentium® II processor.
SYSENTER 指令是Pentium II 处理器上提供的 “快速系统调用” 的一部分

The SYSENTER instruction is optimized to provide the maximum performance for transitions to protection ring 0 (CPL = 0).
SYSENTER 指令是为了转移到ring 0级别而提供最大的性能优化

The SYSENTER instruction sets the following registers according to values specified by the operating system in certain model-specific registers.
SYSENTER 指令会根据操作系统的一些特别的寄存器的值而去设置以下寄存器

CS register set to the value of (SYSENTER_CS_MSR)
CS 寄存器被设置成SYSENTER_CS_MSR的值

EIP register set to the value of (SYSENTER_EIP_MSR)
EIP寄存器被设置成SYSENTER_EIP_MSR的值

SS register set to the sum of (8 plus the value in SYSENTER_CS_MSR)
SS 寄存器被设置成 (8+SYSENTER_CS_MSR)的和

ESP register set to the value of (SYSENTER_ESP_MSR)
ESP 寄存器被设置成SYSENTER_ESP_MSR的值

Looks like processor is trying to help us. Let's look at SYSEXIT also very quickly:
看上去处理器正在帮助我们. 让我们看一看 SYSEXIT, 它也非常快

The SYSEXIT instruction is part of the "Fast System Call" facility introduced on the Pentium® II processor.
SYSEXIT指令是Pentium II 处理器上提供的 “快速系统调用” 的一部分

The SYSEXIT instruction is optimized to provide the maximum performance for transitions to protection ring 3 (CPL = 3) from protection ring 0 (CPL = 0).
SYSEXIT 指令是为了从ring 3级别转移到ring 0级别而提供最大的性能优化

The SYSEXIT instruction sets the following registers according to values specified by the operating system in certain model-specific or general purpose registers.
SYSEXIT指令会根据操作系统的一些特别的寄存器或者通用寄存器的值而去设置以下寄存器

CS register set to the sum of (16 plus the value in SYSENTER_CS_MSR)
CS 寄存器被设置成(16+SYSENTER_CS_MSR)的和

EIP register set to the value contained in the EDX register
EIP寄存器被设置EDX寄存器中的值

SS register set to the sum of (24 plus the value in SYSENTER_CS_MSR)
SS寄存器被设置成(24+SYSENTER_CS_MSR)的和

ESP register set to the value contained in the ECX register
ESP寄存器被设置成ECX寄存器中的值

SYSENTER_CS_MSR, SYSENTER_ESP_MSR, and SYSENTER_EIP_MSR are not really names of the registers.
SYSENTER_CS_MSR, SYSENTER_ESP_MSR, and SYSENTER_EIP_MSR 并不是真正寄存器的名字

Intel just defines the address of these registers as:
Intel只是定义了这些寄存器的地址

SYSENTER_CS_MSR   174h
SYSENTER_ESP_MSR 175h
SYSENTER_EIP_MSR  176h

In linux these registers are named as:
linux中,这些寄存器被命名为如下:

/usr/src/linux/include/asm/msr.h:
   101 #define MSR_IA32_SYSENTER_CS            0x174
   102 #define MSR_IA32_SYSENTER_ESP           0x175
   103 #define MSR_IA32_SYSENTER_EIP           0x176

4.2. How does linux 2.6 uses these instructions?
Linux sets up these registers during initialization itself.
linux在初始化时会设置如下寄存器

/usr/src/linux/arch/i386/kernel/sysenter.c:
    36         wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
    37         wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp1, 0);
    38         wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);

Please note that 'tss' refers to the Task State Segment (TSS) and tss->esp1 thus points to the kernel mode stack. [4] explains the use of TSS in linux as:
请注意’tss’是指任务状态段(TSS),并且tss->esp1指向内核模式堆栈,[4]解释了Linux中TSS的使用

The x86 architecture includes a specific segment type called the Task State Segment (TSS), to store hardware contexts.
x86构架包含一个特别的段类型叫做任务状态段（TSS）用来存储hardware context(硬件进程？）

Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:
尽管Linux不使用hardware context切换，但它还是会在系统中为每一个CPU设置一个TSS，主要有两个原因

- When an 80 x 86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS.
当80x86CPU从用户模式切换到内核模式的时候，会从TSS中获得内核模式堆栈的地址

- When a User Mode process attempts to access an I/O port by means of an in or out instruction, the CPU may need to access an I/O Permission Bitmap stored in the TSS to verify whether the process is allowed to address the port.
当用户模式进程试图通过in 或out指令访问I/O端口时，CPU需要访问存储在TSS中的I/O permission Bitmap来确认进程是否可以访问端口

So during initialization kernel sets up these registers such that after SYSENTER instruction, ESP is set to kernel mode stack and EIP is set to sysenter_entry.
所以初始化时，内核设定一些寄存器比如：在SYSENTER指令之后，ESP被设定成内核模式堆栈并且EIP被设定为sysenter_entry

Kernel also setups system call entry/exit points for user processes.
内核为了用户进程而设定系统调用 entry/exit 点

Kernel creates a single page in the memory and attaches it to all processes' address space when they are loaded into memory.
内核在内存中造出一个单页，并且当所有进程被加载到内存中时单页会挂载到每个进程地址空间。

This page contains the actual implementation of the system call entry/exit mechanism.
这个页包含了系统调用entry/exit机制的真正的实现

Definition of this page can be found in the file /usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S.
对这个页的定义可以在/usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S 中找到

Kernel calls this page virtual dynamic shared object (vdso). Existence of this page can be confirmed by looking at cat/proc/`pid`/maps:
内核把这个页叫做虚拟动态共享对象（vdso）。可以用cat/proc/`pid`/maps 来确认它的存在

slax ~ # cat /proc/self/maps
08048000-0804c000 r-xp 00000000 07:00 13         /bin/cat
0804c000-0804d000 rwxp 00003000 07:00 13         /bin/cat
0804d000-0806e000 rwxp 0804d000 00:00 0          [heap]
b7ea0000-b7ea1000 rwxp b7ea0000 00:00 0
b7ea1000-b7fca000 r-xp 00000000 07:03 1840       /lib/tls/libc-2.3.6.so
b7fca000-b7fcb000 r-xp 00128000 07:03 1840       /lib/tls/libc-2.3.6.so
b7fcb000-b7fce000 rwxp 00129000 07:03 1840       /lib/tls/libc-2.3.6.so
b7fce000-b7fd1000 rwxp b7fce000 00:00 0
b7fe7000-b7ffd000 r-xp 00000000 07:03 1730       /lib/ld-2.3.6.so
b7ffd000-b7fff000 rwxp 00015000 07:03 1730       /lib/ld-2.3.6.so
bffe7000-bfffd000 rwxp bffe7000 00:00 0          [stack]
ffffe000-fffff000 ---p 00000000 00:00 0          [vdso]

For binaries using shared libraries, this page can be seen using ldd also:
对于使用共享库的二进制文件，这个页可以用ldd看到

slax ~ # ldd /bin/ls
       linux-gate.so.1 =>  (0xffffe000)
       librt.so.1 => /lib/tls/librt.so.1 (0xb7f5f000)
       ...
Observe linux-gate.so.1. This is no physical file. Content of this vdso can be seen as follows:
观察以下linux-gate.so.1,它不是一个物理文件，这个vdso的内容用如下方法可以看到

==> dd if=/proc/self/mem of=linux-gate.dso bs=4096 skip=1048574 count=1
1+0 records in
1+0 records out
==> objdump -d --start-address=0xffffe400 --stop-address=0xffffe414 linux-gate.dso
ffffe400 <__kernel_vsyscall>:
ffffe400:       51                      push   %ecx
ffffe401:       52                      push   %edx
ffffe402:       55                      push   %ebp
ffffe403:       89 e5                   mov    %esp,%ebp
ffffe405:       0f 34                   sysenter
...
ffffe40d:       90                      nop
ffffe40e:       eb f3                   jmp    ffffe403 <__kernel_vsyscall+0x3>
ffffe410:       5d                      pop    %ebp
ffffe411:       5a                      pop    %edx
ffffe412:       59                      pop    %ecx
ffffe413:       c3                      ret
In all listings, ... stands for omitted irrelevant code.
上面的列表省略了无关的代码

Initiation: Userland processes (or C library on their behalf) call __kernel_vsyscall to execute system calls.
初始化：用户进程（或C函数库）调用__kernel_vsyscall 来执行系统调用

Address of __kernel_vsyscall is not fixed. Kernel passes this address to userland processes using AT_SYSINFO elf parameter.
__kernel_vsyscall的地址并不固定，内核使用AT_SYSIFO elf参数来向用户进程传递地址

AT_ elf parameters, a.k.a. elf auxiliary vectors, are loaded on the process stack at the time of startup, alongwith the process arguments and the environment variables.
AT_ elf 参数a.k.a. elf辅助向量会在启动时被加载到进程堆栈中

Look at [1] for more information on Elf auxiliary vectors.
［1］中有关于Elf auxiliary vectory更多的消息

After moving to this address, registers %ecx, %edx and %ebp are saved on the user stack and %esp is copied to %ebp before executing sysenter.
移动到这个地址之后，sysenter执行之前，ecx，edx，ebp寄存器会被存储到用户堆栈上，并且esp会被复制到ebp中

This %ebp later helps kernel in restoring userland stack back. After executing sysenter instruction, processor starts execution at sysenter_entry. sysenter_entry is defined in
ebp之后会帮助内核恢复用户堆栈。在执行了sysenter 指令后，处理器开始执行sysenter_entry, 它被定义
在/usr/src/linux/arch/i386/kernel/entry.S 中

/usr/src/linux/arch/i386/kernel/entry.S as: (See my comments in [ ])
    179 ENTRY(sysenter_entry)
   180         movl TSS_sysenter_esp0(%esp),%esp
   181 sysenter_past_esp:
   182         sti
   183         pushl $(__USER_DS)
   184         pushl %ebp            [%ebp contains userland %esp]
   185         pushfl
   186         pushl $(__USER_CS)
   187         pushl $SYSENTER_RETURN        [%userland return addr]
   188
        ....
   201         pushl %eax
   202         SAVE_ALL            [pushes registers on to stack]
   203         GET_THREAD_INFO(%ebp)
   204
   205         /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
   206         testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT),
                                                                            TI_flags(%ebp)
   207         jnz syscall_trace_entry
   208         cmpl $(nr_syscalls), %eax
   209         jae syscall_badsys
   210         call *sys_call_table(,%eax,4)
   211         movl %eax,EAX(%esp)
        ......
Inside sysenter_entry: between line 183 and 202, kernel is saving the current state by pushing register values on to the stack.
在sysenter_entry中，183到202行之间，内核通过把寄存器的值推到堆栈中来保存现在的状态

Observe that $SYSENTER_RETURN is the userland return address as defined inside/usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S and %ebp contains userland ESP as %esp was copied to %ebp before calling sysenter.
观察$SYSENTER_RETURN时用户的返回地址被定义在/usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S中，并且在调用sysenter之前因为esp被副指导ebp中，所以ebp里时用户进程的esp

After saving the state, kernel validates the system call number stored in %eax. Finally appropriate system call is called using instruction:
在保存了状态之后，内核确认存储在eax中的系统调用号，最后系统调用通过下面的指令被调用

210 call *sys_call_table(,%eax,4)
This is very much similar to old way.
这和以前的方式很像

After system call is complete, processor resumes execution at line 211. Looking further in sysenter_entry definition:
系统调用结束后，处理器恢复执行211行，再来看一看sysenter_entry的定义

   210         call *sys_call_table(,%eax,4)
   211         movl %eax,EAX(%esp)
   212         cli
   213         movl TI_flags(%ebp), %ecx
   214         testw $_TIF_ALLWORK_MASK, %cx
   215         jne syscall_exit_work
   216 /* if something modifies registers it must also disable sysexit */
   217         movl EIP(%esp), %edx            (EIP is 0x28)
   218         movl OLDESP(%esp), %ecx            (OLD ESP is 0x34)
   219         xorl %ebp,%ebp
   220         sti
   221         sysexit
Copies value in %eax to stack. Userland ESP and return address (to-be EIP) are copied from kernel stack to %edx and %ecx respectively.
复制eax的到堆栈中，用户进程的ESP和返回地址被相应的从内核堆栈复制到edx和ecx中

Observe that the userland return address, $SYSENTER_RETURN was pushed on to stack in line 187.
观察到用户进程的返回地址 $SYSENTER_RETURN在187行被推入堆栈

After that 0x28 bytes have been pushed on to the stack. That's why 0x28(%esp) points to $SYSENTER_RETURN.
之后0x28字节被推入堆栈，这就是为什么0x28（％esp）指向 $SYSENTER_RETURN了

After that SYSEXIT instruction is executed. As we know from previous section, sysexit copies value in %edx to EIP and value in %ecx to ESP. sysexit transfers processor back to ring 3 and processor resumes execution in userland.
之后SYSEXIT被执行，从前一节我们知道，sysexit 把edx中的值复制到EIP，ecx中的值复制到ESP，sysexit把处理器带回ring3 级别，并且处继续用户模式的执行

5. Some Code

程序代码：
#include <stdio.h> int pid; int main() { __asm__( "movl $20, %eax \n" "call *%gs:0x10 \n" /* offset 0x10 is not fixed across the systems */ "movl %eax, pid \n" ); printf("pid is %d\n", pid); return 0; }

This does the getpid() system call (__NR_getpid is 20) using __kernel_vsyscall instead of int 0x80.
这段代码使用__kernel_vsyscall而不是int 0x80 来调用getpid()。

Why %gs:0x10?
为什么是 %gs:0x10？

Parsing process stack to find out AT_SYSINFO's value can be a cumbersome task.
分析进程堆栈来找AT_SYSINFO的值是一个繁重的工作

So, when libc.so (C library) is loaded, it copies the value of AT_SYSINFO from the process stack to the TCB (Thread Control Block).
所以当libc.so被加载时，它回把AT_SYSINFO的值从进程堆栈复制到TCB（Thread Control Block）

Segment register %gs refers to the TCB.
段寄存器％gs就是指TCB

Please note that the offset 0x10 is not fixed across the systems.
请记住不同系统，偏移量0x10不是固定的

I found it out for my system using GDB. A system independent way to find out AT_SYSINFO is given in [1].
我在系统中用GDB中找到的。在［1］中可以找到一种独立于机器的方法来寻找AT_SYSINFO

Note: This example is taken from http://www.win.tue.nl/~aeb/linux/lk/lk-4.html after little modification to make it work on my system.
6. References
Here are some references that helped me understand this.
1. About Elf auxiliary vectors By Manu Garg
2. What is linux-gate.so.1? By Johan Peterson
3. This Linux kernel: System Calls By Andries Brouwer
4. Understanding the Linux Kernel, By Daniel P. Bovet, Marco Cesati
5. Linux Kernel source code

[ 本帖最后由 madfrogme 于 2012-7-20 00:18 编辑 ]

搜索更多相关主题的帖子: existing　software　available　系统　 linux