Improving performance (简体中文)

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

本文或本节需要翻译。要贡献翻译，请访问简体中文翻译团队。

附注： 部分内容未翻译。（在 Talk:Improving performance (简体中文)# 中讨论）

翻译状态：本文是 Improving performance 的翻译。上次翻译日期：2018-12-25。如果英文版本有所更改，则您可以帮助同步翻译。

本文将介绍关于系统性能诊断的基本知识，以及优化系统性能的具体步骤。

基础工作

了解系统

性能优化的最佳方法是找到瓶颈，因为这个子系统是导致系统速度低下的主要原因。查看系统配置表可以发现瓶颈，也可以通过以下方式寻找线索：

如果同时运行多个大型程序（如 LibreOffice、Firefox等）时卡顿，检查内存容量是否充足。可使用以下命令，并检查“available”列的数值：

$ free -m

如果开机时间非常长，并且第一次加载应用时很慢（但是启动以后运行却很流畅），可能是硬盘问题。可以用 hdparm 命令测量硬盘速度，在硬盘空闲时执行：

注意： hdparm 只代表了硬盘的读取速度，并没有进行有效的评分。空闲时读取速度只要高于 40MB/s 就足以满足大多数系统。

$ hdparm -t /dev/sdx

如果在内存充裕的情况下 CPU 占用率一直很高，可以通过停止进程或禁用守护服务。有多种方法可以监测 CPU 负荷，例如 htop、pstree 或其他系统监视器

$ htop

如果应用使用直接渲染卡顿（比如使用 GPU 的视频播放器、游戏或窗口管理器），改善 GPU 的性能应当很有效。首先需要检查直接渲染是否真的开启的。可以使用 mesa-demos 中的 glxinfo 命令：

$ glxinfo | grep "direct rendering"

direct rendering: Yes

使用桌面环境时，禁用桌面特效或许可以减少 GPU 使用。可以使用一个更轻量的桌面环境或自己打造桌面环境。

跑分

为定量评估优化成果，可使用基准测试。

存储设备

硬盘的连接方式

内部硬件路径意指储存设备是如何连接到主板的。例如 TCP/IP 经由 NIC、即插即用设备可以使用 PCIe/PCI、火线、RAID 卡、USB 等。通过将储存设备均分到这些接口可以最大化主板的性能，比如将六个硬盘接连到 USB 要比三个连接到 USB、三个连接到火线要慢。原因是主板上的接口点类似管道，而管道同一时间的最大流量是有上限的。幸运的是主板通常会有多个管道。比如：

使用 PCI/PCIe/ATA 直接连接到主板。
Using an external enclosure to house the disk over USB/Firewire
通过 TCP/IP 将设备转换为网络存储。

此外，假设你的电脑前面有两个 USB 插口，后面有四个 USB 插口，那么前面插两个、后面插两个应当要比前面插一个、后面插三个更快。这是因为前面的插口可能是多个根 Hub 设备，也就是说它可以在同一时间发送更多的数据。使用下面的命令查看你的机器上是否有多个路径：

USB设备树

$ lsusb -tv

PCI设备树

$ lspci -tv

分区

确保你的分区是分区对齐的。

多硬盘

如果你有多个硬盘，那么将其设置为 RAID 可以提升速度。

在分离的硬盘上创建交换空间也可以带来一些帮助，尤其是使用交换空间十分频繁时。

机械硬盘布局

如果使用传统的机械硬盘，您的分区布局会影响系统的性能。驱动器开头（靠近磁盘外部）的扇区比末尾的扇区要快（物理知识）。此外，较小的分区不需要驱动器磁头大幅度移动，从而加快磁盘的操作。因此，建议为您的系统创建一个小分区（10GB，或多或少取决于您的需要），并且尽可能靠近驱动器开头。其他数据（图片、视频）应该存放在一个单独的分区上，这通常是通过将家目录（/home/user）与根目录（/）分开来实现的。

选择文件系统

Choosing the best filesystem for a specific system is very important because each has its own strengths. The File systems article provides a short summary of the most popular ones. You can also find relevant articles in Category:File systems.

挂载选项

The noatime option is known to improve performance of the filesystem.

Other mount options are filesystem specific, therefore see the relevant articles for the filesystems:

Reiserfs

The data=writeback mount option improves speed, but may corrupt data during power loss. The notail mount option increases the space used by the filesystem by about 5%, but also improves overall speed. You can also reduce disk load by putting the journal and data on separate drives. This is done when creating the filesystem:

# mkreiserfs –j /dev/sda1 /dev/sdb1

Replace /dev/sda1 with the partition reserved for the journal, and /dev/sdb1 with the partition for data. You can learn more about reiserfs with this article.

更改内核选项

There are several key tunables affecting the performance of block devices, see sysctl#Virtual memory for more information.

I/O 调度

背景信息

The input/output (I/O) scheduler is the kernel component that decides in which order the block I/O operations are submitted to storage devices. It is useful to remind here some specifications of two main drive types because the goal of the I/O scheduler is to optimize the way these are able to deal with read requests:

An HDD has spinning disks and a head that moves physically to the required location. Therefore, random latency is quite high ranging between 3 and 12ms (whether it is a high end server drive or a laptop drive and bypassing the disk controller write buffer) while sequential access provides much higher throughput. The typical HDD throughput is about 200 I/O operations per second (IOPS).

An SSD does not have moving parts, random access is as fast as sequential one, typically under 0.1ms, and it can handle multiple concurrent requests. The typical SSD throughput is greater than 10,000 IOPS, which is more than needed in common workload situations.

If there are many processes making I/O requests to different storage parts, thousands of IOPS can be generated while a typical HDD can handle only about 200 IOPS. There is a queue of requests that have to wait for access to the storage. This is where the I/O schedulers plays an optimization role.

调度算法

One way to improve throughput is to linearize access: by ordering waiting requests by their logical address and grouping the closest ones. Historically this was the first Linux I/O scheduler called elevator.

One issue with the elevator algorithm is that it is not optimal for a process doing sequential access: reading a block of data, processing it for several microseconds then reading next block and so on. The elevator scheduler does not know that the process is about to read another block nearby and, thus, moves to another request by another process at some other location. The anticipatory I/O scheduler overcomes the problem: it pauses for a few milliseconds in anticipation of another close-by read operation before dealing with another request.

While these schedulers try to improve total throughput, they might leave some unlucky requests waiting for a very long time. As an example, imagine the majority of processes make requests at the beginning of the storage space while an unlucky process makes a request at the other end of storage. This potentially infinite postponement of the process is called starvation. To improve fairness, the deadline algorithm was developed. It has a queue ordered by address, similar to the elevator, but if some request sits in this queue for too long then it moves to an "expired" queue ordered by expire time. The scheduler checks the expire queue first and processes requests from there and only then moves to the elevator queue. Note that this fairness has a negative impact on overall throughput.

The Completely Fair Queuing (CFQ) approaches the problem differently by allocating a timeslice and a number of allowed requests by queue depending on the priority of the process submitting them. It supports cgroup that allows to reserve some amount of I/O to a specific collection of processes. It is in particular useful for shared and cloud hosting: users who paid for some IOPS want to get their share whenever needed. Also, it idles at the end of synchronous I/O waiting for other nearby operations, taking over this feature from the anticipatory scheduler and bringing some enhancements. Both the anticipatory and the elevator schedulers were decommissioned from the Linux kernel replaced by the more advanced alternatives presented below.

The Budget Fair Queuing (BFQ) is based on CFQ code and brings some enhancements. It does not grant the disk to each process for a fixed time-slice but assigns a "budget" measured in number of sectors to the process and uses heuristics. It is a relatively complex scheduler, it may be more adapted to rotational drives and slow SSDs because its high per-operation overhead, especially if associated with a slow CPU, can slow down fast devices. The objective of BFQ on personal systems is that for interactive tasks, the storage device is virtually as responsive as if it was idle. In its default configuration it focuses on delivering the lowest latency rather than achieving the maximum throughput.

Kyber is a recent scheduler inspired by active queue management techniques used for network routing. The implementation is based on "tokens" that serve as a mechanism for limiting requests. A queuing token is required to allocate a request, this is used to prevent starvation of requests. A dispatch token is also needed and limits the operations of a certain priority on a given device. Finally, a target read latency is defined and the scheduler tunes itself to reach this latency goal. The implementation of the algorithm is relatively simple and it is deemed efficient for fast devices.

内核的 I/O 调度器

While some of the early algorithms have now been decommissioned, the official Linux kernel supports a number of I/O schedulers which can be split into two categories:

The multi-queue schedulers are available by default with the kernel. The Multi-Queue Block I/O Queuing Mechanism (blk-mq) maps I/O queries to multiple queues, the tasks are distributed across threads and therefore CPU cores. Within this framework the following schedulers are available:
- None, where no queuing algorithm is applied.
- mq-deadline, the adaptation of the deadline scheduler (see below) to multi-threading.
- Kyber
- BFQ

The single-queue schedulers are legacy schedulers:
- NOOP is the simplest scheduler, it inserts all incoming I/O requests into a simple FIFO queue and implements request merging. In this algorithm, there is no re-ordering of the request based on the sector number. Therefore it can be used if the ordering is dealt with at another layer, at the device level for example, or if it does not matter, for SSDs for instance.
- Deadline
- CFQ

注意： Single-queue schedulers were removed from kernel since Linux 5.0.

更改 I/O 调度器

注意： The best choice of scheduler depends on both the device and the exact nature of the workload. Also, the throughput in MB/s is not the only measure of performance: deadline or fairness deteriorate the overall throughput but may improve system responsiveness. Benchmarking may be useful to indicate each I/O scheduler performance.

To list the available schedulers for a device and the active scheduler (in brackets):

$ cat /sys/block/sda/queue/scheduler

mq-deadline kyber [bfq] none

To list the available schedulers for all devices:

$ grep "" /sys/block/*/queue/scheduler

/sys/block/pktcdvd0/queue/scheduler:none
/sys/block/sda/queue/scheduler:mq-deadline kyber [bfq] none
/sys/block/sr0/queue/scheduler:[mq-deadline] kyber bfq none

To change the active I/O scheduler to bfq for device sda, use:

# echo bfq > /sys/block/sda/queue/scheduler

The process to change I/O scheduler, depending on whether the disk is rotating or not can be automated and persist across reboots. For example the udev rule below sets the scheduler to none for NVMe, mq-deadline for SSD/eMMC, and bfq for rotational drives:

/etc/udev/rules.d/60-ioschedulers.rules

# set scheduler for NVMe
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{queue/scheduler}="none"
# set scheduler for SSD and eMMC
ACTION=="add|change", KERNEL=="sd[a-z]|mmcblk[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# set scheduler for rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"

Reboot or force udev#Loading new rules.

使用 I/O 调度器

Each of the kernel's I/O scheduler has its own tunables, such as the latency time, the expiry time or the FIFO parameters. They are helpful in adjusting the algorithm to a particular combination of device and workload. This is typically to achieve a higher throughput or a lower latency for a given utilization. The tunables and their description can be found within the kernel documentation.

To list the available tunables for a device, in the example below sdb which is using deadline, use:

$ ls /sys/block/sdb/queue/iosched

fifo_batch  front_merges  read_expire  write_expire  writes_starved

To improve deadline's throughput at the cost of latency, one can increase fifo_batch with the command:

# echo 32 > /sys/block/sdb/queue/iosched/fifo_batch

电源管理配置

When dealing with traditional rotational disks (HDD's) you may want to lower or disable power saving features completely.

减少磁盘读写

Avoiding unnecessary access to slow storage drives is good for performance and also increasing lifetime of the devices, although on modern hardware the difference in life expectancy is usually negligible.

注意： A 32GB SSD with a mediocre 10x write amplification factor, a standard 10000 write/erase cycle, and 10GB of data written per day, would get an 8 years life expectancy. It gets better with bigger SSDs and modern controllers with less write amplification. Also compare [1] when considering whether any particular strategy to limit disk writes is actually needed.

显示磁盘写信息

The iotop package can sort by disk writes, and show how much and how frequently programs are writing to the disk. See iotop(8) for details.

重定位文件到 tmpfs

Relocate files, such as your browser profile, to a tmpfs file system, for improvements in application response as all the files are now stored in RAM:

Refer to Profile-sync-daemon for syncing browser profiles. Certain browsers might need special attention, see e.g. Firefox on RAM.
Refer to Anything-sync-daemon for syncing any specified folder.
Refer to Makepkg#Improving compile times for improving compile times by building packages in tmpfs.

文件系统

Refer to corresponding file system page in case there were performance improvements instructions, e.g. Ext4#Improving performance and XFS#Performance.

交换空间

See Swap#Performance.

同步和缓冲区大小

See Sysctl#Virtual memory for details.

使用 ionice 调度储存 I/O

Many tasks such as backups do not rely on a short storage I/O delay or high storage I/O bandwidth to fulfil their task, they can be classified as background tasks. On the other hand quick I/O is necessary for good UI responsiveness on the desktop. Therefore it is beneficial to reduce the amount of storage bandwidth available to background tasks, whilst other tasks are in need of storage I/O. This can be achieved by making use of the linux I/O scheduler CFQ, which allows setting different priorities for processes.

The I/O priority of a background process can be reduced to the "Idle" level by starting it with

# ionice -c 3 command

See ionice(1) and [2] for more information.

CPU

超频

超频就是增加 CPU 的实际运行频率。可是一项复杂而又有风险的操作，不建议盲目使用。超频的最佳手段是通过BIOS。acpi_cpufreq等工具常用工具无法读取I5或I7处理器超频后的频率。用户需要改用Community中的i7z工具。

频率自动调整

请参考 CPU frequency scaling.

Real-time kernel

Some applications such as running a TV tuner card at full HD resolution (1080p) may benefit from using a realtime kernel.

Adjusting priorities of processes

Ananicy

Ananicy is a daemon, available in the ananicy-git^AUR package, for auto adjusting the nice levels of executables. The nice level represents the priority of the executable when allocating CPU resources.

cgroups

See cgroups.

Cpulimit

Cpulimit is a program to limit the CPU usage percentage of a specific process. After installing cpulimit, you may limit the CPU usage of a processes' PID using a scale of 0 to 100 times the number of CPU cores that the computer has. For example, with eight CPU cores the precentage range will be 0 to 800. Usage:

$ cpulimit -l 50 -p 5081

irqbalance

The purpose of irqbalance is distribute hardware interrupts across processors on a multiprocessor system in order to increase performance. It can be controlled by the provided irqbalance.service.

显卡

Xorg.conf配置

显卡性能严重依赖于/etc/X11/xorg.conf，请参阅NVIDIA、ATI以及Intel显卡的相关教程修改配置。注意，不当的配置可能导致Xorg停止工作，所以，请审慎操作。

Driconf

driconf^AUR是官方库中收录的小工具，它可以修改开源驱动的直接渲染设置。启用HyperZ功能将显著改善性能。

GPU 超频

GPU 超频要比 CPU 超频简单得多，通过软件可以实时调整 GPU 时钟频率。

ATI 显卡可使用 rovclock^AUR。
NVIDIA 显卡可使用 nvclock^AUR。
Intel 显卡可使用 GMABooster.com 出品的 gmabooster^AUR。

超频设置可以保存到 ~/.xinitrc，每次X启动之后就能自动超频。当然，更安全的做法应该是按需设置。

内存及虚拟内存

将临时文件转移到tmpfs

如果内存充足，可将/tmp、/dev/shm或者浏览器缓存文件等转移至tmpfs，这些文件将保存在内存中，从而加快软件的响应速度。借助脚本的帮助，可以轻松实现：

同步浏览器缓存：Profile-sync-daemon。
同步任意目录：Anything-sync-daemon。

Swappiness

参阅Swap#Swappiness。

zram 或 zswap

内核模块 zram（以前叫做 compcache）在内存中提供了一个压缩块。若你将其用作交换空间，则内存可以保存更多的数据，代价是消耗更多的 CPU 。但是它仍然比硬盘上的交换空间快得多。若一个系统经常使用交换空间，使用 zram 可以提高响应。使用 zram 也可以减少对磁盘的读写，当交换空间被设置到固态硬盘时，这可以增加固态硬盘的寿命。

zswap 可以带来相似的益处（和相似的代价）。两者不同的是 zswap 将页面压缩后换入交换空间，而 zram 则换入内存。

例如：设置一个使用 lz4 压缩算法的 32GB 的、高优先级的 zram：

# modprobe zram
# echo lz4 > /sys/block/zram0/comp_algorithm
# echo 32G > /sys/block/zram0/disksize
# mkswap --label zram0 /dev/zram0
# swapon --priority 100 /dev/zram0

要禁用它，可以重启或者：

# swapoff /dev/zram0
# rmmod zram

详细的描述提供在 zram 模块的官方文档。

zram-generator 提供了一个 systemd-zram-setup@.service 单元用来自动初始化 zram 设备。此单元无需被 [enable/start]。以下资源提供了使用它的必要信息：

“生成器将会在系统启动的早期被 systemd 调用”，因此使用它只需要创建配置文件并重启。这里提供了一个简单的配置：/usr/share/doc/zram-generator/zram-generator.conf.example 。你可以通过检查 swap 的状态或通过检查 systemd-zram-setup@zramN.service 的状态来检查 zram 的情况。这里 /dev/zramN 是配置文件中设定的内容。

此外，zramd^AUR 默认以 zstd 算法自动设置 zram 。其配置文件位于 /etc/default/zramd 并且需要启用 zramd.service 服务。

Swap on zram using a udev rule

The example below describes how to set up swap on zram automatically at boot with a single udev rule. No extra package should be needed to make this work.

First, enable the module:

/etc/modules-load.d/zram.conf

zram

Configure the number of /dev/zram nodes you need.

/etc/modprobe.d/zram.conf

options zram num_devices=2

Create the udev rule as shown in the example.

/etc/udev/rules.d/99-zram.rules

KERNEL=="zram0", ATTR{disksize}="512M" RUN="/usr/bin/mkswap /dev/zram0", TAG+="systemd"
KERNEL=="zram1", ATTR{disksize}="512M" RUN="/usr/bin/mkswap /dev/zram1", TAG+="systemd"

Add /dev/zram to your fstab.

/etc/fstab

/dev/zram0 none swap defaults 0 0
/dev/zram1 none swap defaults 0 0

使用显存

如果你的系统内存很小，而显存又过剩，请参阅Swap on video RAM的方法，将交换文件设置在显存上。

预读

通过预读程序、库到内存中，能有效加快程序加载速度。预读通常用于常用的程序，如浏览器。

Go-preload

gopreload-git^AUR是来自gentoo的一个预读服务。安装后，通过下列命令采集预读信息：

# gopreload-prepare program

运行需要预读的程序，收集结束后按回车键。

然后会生成一个预读列表：/usr/share/gopreload/enabled。在/etc/rc.conf设置开机启动gopreload，Go-preload就会在每次开机时预读列表中的程序。要禁止预读某个程序，只需从/usr/share/gopreload/enabled删除项目，或者移入/usr/share/gopreload/disabled。

Preload

比起Go-preload，Preload更自动化（尽管有违KISS）：只需要在/etc/rc.conf添加服务就完事了。

系统启动

参见：加速系统启动。

待机

想要加快系统启动，最好的方法就是不要关电脑，而选择待机。当然，为了可持续发展（至少是电费），不用电脑时还是关了吧。

自己编译内核

自己编译内核，删除不需要的模块，可以减少引导时间和内存占用。但通常这是个耗时、枯燥甚至令人厌烦的事情，你可能面临各种错误——甚至最终节约的开机时间还不如你浪费的时间多。但通过自己编译内核，可以学习到不少知识。参见：here。

Network

Kernel networking: see Sysctl#Improving performance
NIC: see Network configuration#Set device MTU and queue length
DNS: consider using a caching DNS resolver, see Domain name resolution#DNS servers
Samba: see Samba#Improve throughput