macOS 内核之 CPU 占用率信息

在 iOS/Mac 上开发 App，当我们需要性能监控能力的时候，往往需要 CPU 信息来辅助追查：比如当前时刻是否 CPU 高占导致 App 卡到掉渣之类。

一、iOS 的 CPU 占用率实现

iOS 由于系统的限制，在不越狱的情况下无法获知整个系统的 CPU 信息，只能拿到自己 App 的所有线程信息，然后把 CPU 时间全部加起来得到一个大概的数值以供参考。可以参考腾讯开源的Matrix 的实现。代码太长我们只看核心部分:

    // 取当前进程基础信息，其实不取也没有关系
    kr = task_info(mach_task_self(), TASK_BASIC_INFO, (task_info_t) tinfo, &task_info_count);

    // 取当前进程的所有线程
    kr = task_threads(mach_task_self(), &thread_list, &thread_count);

     // 遍历所有线程，取一波 CPU 时间
    for (j = 0; j < thread_count; j++) {
            // 取一下线程信息
        thread_info_count = THREAD_INFO_MAX;
        kr = thread_info(thread_list[j], THREAD_BASIC_INFO,
                         (thread_info_t) thinfo, &thread_info_count);

        basic_info_th = (thread_basic_info_t) thinfo;

            // 计算一下时间和 CPU Usage，需要除以一个 TH_USAGE_SCALE 的 scale factor
        if (!(basic_info_th->flags & TH_FLAGS_IDLE)) {
            tot_sec = tot_sec + basic_info_th->user_time.seconds + basic_info_th->system_time.seconds;
            tot_usec = tot_usec + basic_info_th->system_time.microseconds + basic_info_th->system_time.microseconds;
            tot_cpu = tot_cpu + basic_info_th->cpu_usage / (float) TH_USAGE_SCALE * 100.0;
        }

    }

     // 最后释放一下
    kr = vm_deallocate(mach_task_self(), (vm_offset_t) thread_list, thread_count * sizeof(thread_t));

或者滴滴开源的 DoraemonKit 的实现，跟上面的实现基本是一样的，只是省略了task_info()和user_time, system_time的计算。

留意到我们需要把 cpu_usage 取得的值除以 TH_USAGE_SCALE 后才能获得一个准确的值。为啥？这个东西用来干啥子的？

1.1 TH_USAGE_SCALE 是什么

我们直接看看 darwin-xnu 对 thread_info() 的实现。这个函数只是简单地加了个锁，真正的实现在 thread_info_internal()。位置在 osfmk/kern/thread.c。

如果参数为 THREAD_BASIC_INFO 则走 retrieve_thread_basic_info()。这个函数先取了一波系统 timer 的数据给 user_time 和 system_time，然后就是重头戏了:

#define TH_USAGE_SCALE 1000

    /*
     *  To calculate cpu_usage, first correct for timer rate,
     *  then for 5/8 ageing.  The correction factor [3/5] is
     *  (1/(5/8) - 1).
     */
    basic_info->cpu_usage = 0;
#if defined(CONFIG_SCHED_TIMESHARE_CORE)
    if (sched_tick_interval) {
        basic_info->cpu_usage = (integer_t)(((uint64_t)thread->cpu_usage
                                    * TH_USAGE_SCALE) / sched_tick_interval);
        basic_info->cpu_usage = (basic_info->cpu_usage * 3) / 5;
    }
#endif

    if (basic_info->cpu_usage > TH_USAGE_SCALE)
        basic_info->cpu_usage = TH_USAGE_SCALE;

CONFIG_SCHED_TIMESHARE_CORE 这个宏应该是分时调度线程的意思，sched_tick_interval 则是定义在 osfmk/kern/sched.h 的一个全局变量。在分时调度逻辑初始化的时候，这个值被赋值:

// void sched_timeshare_timebase_init(void)

/* scheduler tick interval */
// #define USEC_PER_SEC 1000000ull /* microseconds per second */
// #define SCHED_TICK_SHIFT 3
clock_interval_to_absolutetime_interval(USEC_PER_SEC >> SCHED_TICK_SHIFT,
                                                NSEC_PER_USEC, &abstime);
assert((abstime >> 32) == 0 && (uint32_t)abstime != 0);
sched_tick_interval = (uint32_t)abstime;

这个值就是分时调度时(Time)每次 tick 的时间间隔，关于 FreeBSD 的分时模型(Time-sharing) 这里有篇文章可以参考一下。

void
clock_interval_to_absolutetime_interval(uint32_t   interval,
                                        uint32_t   scale_factor,
                                        uint64_t * result)
{
    uint64_t nanosecs = (uint64_t) interval * scale_factor;
    uint64_t t64;

    *result = (t64 = nanosecs / NSEC_PER_SEC) * rtclock_sec_divisor;
    nanosecs -= (t64 * NSEC_PER_SEC);
    *result += (nanosecs * rtclock_sec_divisor) / NSEC_PER_SEC;
}

NSEC_PER_SEC 是每一秒中有多少的纳秒(参考这里)。nanosecs / NSEC_PER_SEC 就得到秒了。

rtclock_sec_divisor 比较有意思。首先是 RTC，Real-time clock，中文翻译为实时时钟，是一个小小的时钟芯片，一般装在主板上，使用 CMOS 电池。读者朋友如果有装过 PC 的话应该会在主板上看到一个纽扣电池的卡槽，这个东西可以给 RTC 模块供电。

rtclock_sec_divisor 这个数值来自于以下函数:

static void
timebase_callback(struct timebase_freq_t * freq)

其中 freq 这个参数不同的平台有不同的实现。在时钟模块初始化的时候，内核会注册一个回调 PE_register_timebase_callback(timebase_callback); arm 架构的是是持有这个 callback 然后从硬件读取到相关信息后通过 callback 函数传回去:

void
PE_call_timebase_callback(void)
{
    struct timebase_freq_t timebase_freq;

    timebase_freq.timebase_num = gPEClockFrequencyInfo.timebase_frequency_hz;
    timebase_freq.timebase_den = 1;

    if (gTimebaseCallback)
        gTimebaseCallback(&timebase_freq);
}

timebase_freq_t 结构体的定义如下：

struct timebase_freq_t {
  unsigned long timebase_num; // numerator 分子
  unsigned long timebase_den; // denominator 分母
};

这种表示时间的方法叫做 Time Base，中文翻译为“时基”(注意这里所谓的时基和示波器的稍有不同，这里主要用作一个计时单位)。上面说到整个计算机的时序系统是建立在 RTC 模块上的，这个东西最重要的核心是一个时钟振荡器。目前多采用频率为 32.768 kHz (2^15) 的石英晶体制作。

在 arm 架构(iPhone)的实现中，timebase_freq 的分母被 hardcode 为 1。

i386(Mac)则取了总线频率做了如下运算:

void PE_call_timebase_callback(void)
{
  struct timebase_freq_t timebase_freq;
  unsigned long          num, den, cnt;

  num = gPEClockFrequencyInfo.bus_clock_rate_num * gPEClockFrequencyInfo.bus_to_dec_rate_num;
  den = gPEClockFrequencyInfo.bus_clock_rate_den * gPEClockFrequencyInfo.bus_to_dec_rate_den;

  cnt = 2;
  while (cnt <= den) {
    if ((num % cnt) || (den % cnt)) {
      cnt++;
      continue;
    }

    num /= cnt;
    den /= cnt;
  }

  timebase_freq.timebase_num = num;
  timebase_freq.timebase_den = den;

  if (gTimebaseCallback) gTimebaseCallback(&timebase_freq);
}

gPEClockFrequencyInfo 里的东西在系统启动时由外部传入，应该是硬件信息。其中 arm 架构的实现还根据硬件的不同写了一堆转换，比如三星的 s3c2410 处理器，OMAP 的 OMAP3430 之类的。不过不知道用来做什么，the iPhone Wiki倒是提供了一个线索，大意是 2009 年在 MacRumors有人发了 iPhone 原型机的照片引起大家讨论。由于在系统的 /System/Library/Caches/com.apple.kernelcaches 里有一些其他 CPU 的处理，猜测是当时苹果不晓得要用哪一种 CPU 比较好，是遗留的代码。虽无法求证但是好像很有道理。

在判断完一系列架构之后，如果都不符合就把 timebase_frequency_hz 设置为默认值 24000000，然后在再用 IOKit 接口取 timebase-frequency:

/* Find the time base frequency first. */
if (DTGetProperty(cpu, "timebase-frequency", (void **)&value, &size) == kSuccess) {
    /*
     * timebase_frequency_hz is only 32 bits, and
     * the device tree should never provide 64
     * bits so this if should never be taken.
     */
    if (size == 8)
        gPEClockFrequencyInfo.timebase_frequency_hz = *(unsigned long long *)value;
    else
        gPEClockFrequencyInfo.timebase_frequency_hz = *value;
}

i386 的实现比较简单，基本就是 vstart() 函数里的启动参数 boot_args_start 带过来。

gPEClockFrequencyInfo.timebase_frequency_hz = 1000000000;
gPEClockFrequencyInfo.bus_frequency_hz      =  100000000;
gPEClockFrequencyInfo.bus_clock_rate_hz = gPEClockFrequencyInfo.bus_frequency_hz;
gPEClockFrequencyInfo.dec_clock_rate_hz = gPEClockFrequencyInfo.timebase_frequency_hz;

gPEClockFrequencyInfo.bus_clock_rate_num = gPEClockFrequencyInfo.bus_clock_rate_hz;
gPEClockFrequencyInfo.bus_clock_rate_den = 1;

gPEClockFrequencyInfo.bus_to_dec_rate_num = 1;
gPEClockFrequencyInfo.bus_to_dec_rate_den =
gPEClockFrequencyInfo.bus_clock_rate_hz / gPEClockFrequencyInfo.dec_clock_rate_hz;

所以 bus_clock_rate_num 是 100000000，bus_clock_rate_den 是 1。

bus_to_dec_rate_num 是 1, bus_clock_rate_hz 是 100000000， dec_clock_rate_hz 是 1000000000，所以 bus_to_dec_rate_den 是 0.1，但是要留意gPEClockFrequencyInfo.bus_clock_rate_hz / gPEClockFrequencyInfo.dec_clock_rate_hz这个式子里面，这两个参数都是 unsigned long，所以会变成 0。于是

// 100000000*1
num = gPEClockFrequencyInfo.bus_clock_rate_num * gPEClockFrequencyInfo.bus_to_dec_rate_num;

// 1*0
den = gPEClockFrequencyInfo.bus_clock_rate_den * gPEClockFrequencyInfo.bus_to_dec_rate_den;

i386 的 time base 中分子是 100000000 而分母是 0。这让我非常费解，因为底下还要对 den 做计算:

cnt = 2;
while (cnt <= den) {
    if ((num % cnt) || (den % cnt)) {
      cnt++;
      continue;
    }

    num /= cnt;
    den /= cnt;
}

这段代码就废了，而且在 timebase_callback(struct timebase_freq_t * freq) 函数的实现中，0 是非法的:

static void
timebase_callback(struct timebase_freq_t * freq)
{
    unsigned long numer, denom;
    uint64_t      t64_1, t64_2;
    uint32_t      divisor;

    if (freq->timebase_den < 1 || freq->timebase_den > 4 ||
        freq->timebase_num < freq->timebase_den)
        panic("rtclock timebase_callback: invalid constant %ld / %ld",
              freq->timebase_num, freq->timebase_den);

    denom = freq->timebase_num;
    numer = freq->timebase_den * NSEC_PER_SEC;
    // reduce by the greatest common denominator to minimize overflow
    if (numer > denom) {
        t64_1 = numer;
        t64_2 = denom;
    } else {
        t64_1 = denom;
        t64_2 = numer;
    }
    while (t64_2 != 0) {
        uint64_t temp = t64_2;
        t64_2 = t64_1 % t64_2;
        t64_1 = temp;
    }
    numer /= t64_1;
    denom /= t64_1;

    rtclock_timebase_const.numer = (uint32_t)numer;
    rtclock_timebase_const.denom = (uint32_t)denom;
    divisor = (uint32_t)(freq->timebase_num / freq->timebase_den);

    rtclock_sec_divisor = divisor;
    rtclock_usec_divisor = divisor / USEC_PER_SEC;
}

为了防止是我脑内运算出的问题，我还实际 copy 了一遍这段代码跑了一下，bus_to_dec_rate_den 为 0 无疑。既已如此，不找到负责这个内核开发的人是无法知道问题的答案了。

但是不管怎样我们现在知道 sched_tick_interval 是系统线程调度用的时间间隔，和硬件时钟频率有关。一开始的问题 TH_USAGE_SCALE 是在内核处理线程调度时，用在 ageing 算法的一个值，hardcode 为 1000，我们除以这个值就能获得一个 CPU 使用百分比数值 basic_info_th->cpu_usage / (float) TH_USAGE_SCALE * 100.0。这里涉及系统的线程优先级调度和 ageing 算法，我还没有完全搞明白，可以参考 Mac OS X Internals: A Systems Approach 一书。

二、Mac 的 CPU 占用率实现

macOS 通过内核接口 host_processor_info() 可以取到 CPU Load Info，这个接口定义在 mach_host.h，实现在 osfmk/kern/host.c。

接口定义如下:

kern_return_t
host_processor_info(host_t host,
                    processor_flavor_t flavor,
                    natural_t * out_pcount,
                    processor_info_array_t * out_array,
                    mach_msg_type_number_t * out_array_count)

host 是一个 mach port，传 mach_host_self() 就行。如果不知道 Mach Port 是什么可以参考 macOS 内核系列的上一篇 1.1 章节。

2.1 mach_host_self 如何创建自己的 mach port 的

这里岔开聊一下 mach_host_self() 的实现。

// libsyscall/mach/mach_legacy.c
mach_port_t
mach_host_self(void)
{
    return host_self_trap();
}

// osfmk/kern/ipc_host.c
mach_port_name_t
host_self_trap(
    __unused struct host_self_trap_args *args)
{
    // 取以前当前发起系统调用的进程返回一个 `task_t`，实际上就是 `mach_port_t`。参考 2.2。
    task_t self = current_task();
    // 开源代码里没有 `ipc_port_t` 的定义但是有 `ipc_port`，字面意义上理解这是发送端的 mach port
    ipc_port_t sright;
    // port 名字，简单理解为 ID
    mach_port_name_t name;

   // 内核用的一个互斥锁，加锁
    itk_lock(self);
    // copy 一下传入的 port 参数，如果是 active 的就计数 +1，如果不是就置为 DEAD，就是整数 0
    // itk_host 是进程创建的时候内核分配的一个 special port，这个在我们上一篇也有提到。这个创建的源头来自 `ipc_init()`，它的最上游就是各平台自己实现的启动入口，比如 i386 的 `i386_init()`，应该就是开机后干的事情了。
    sright = ipc_port_copy_send(self->itk_host);
    itk_unlock(self);
    // 这里有一个 space 的概念，可以看下面对 `current_space()` 实现的解释。  // 这里通过 space 和 sright 查找到 name 然后内部实现里操作一堆 table 信息的更新，返回 nanme
    name = ipc_port_copyout_send(sright, current_space());
    // 最后返回给上层
    return name;
}

这就是内核如何创建一个自己的 mach port 然后返回给上层的过程。

顺便看下 current_space() 的实现:

// osfmk/kern/ipc_tt.c
kr = ipc_space_create(&ipc_table_entries[0], &space);

// osfmk/ipc/ipc_space.h
#define    current_space_fast()    (current_task_fast()->itk_space)
#define current_space()        (current_space_fast())

这个 ipc_space_t 主要是用来存储一个表 ipc_space_t，这个表记录了一堆 IPC 相关信息 ipc_entry_t。根据我粗浅的理解，应该是里面有 name 和 entry 的 KV 对应关系，可以互相查询，之前我们说过 name 并不需要全局唯一，内核可以自行查找匹配到对应的进程(task)，应该就是通过这个 space 维护的表。

2.2 有点费解的 current_task()

// bsd/kern/kern_prot.c
#include <kern/task.h>     /* for current_task() */

// libsyscall/mach/mach/mach_init.h
extern mach_port_t  mach_task_self_;
#define    mach_task_self() mach_task_self_
#define    current_task()  mach_task_self()

// libsyscall/mach/mach_init.c
mach_port_t mach_task_self_ = MACH_PORT_NULL;

void
mach_init_doit(void)
{
    // Initialize cached mach ports defined in mach_init.h
    mach_task_self_ = task_self_trap();
    // ...
}

current_task() 比较费解的是一路追过去发现它定义为 task_self_trap()，而这个函数上来就先调用了 current_task()，死循环了。

// osfmk/kern/ipc_tt.c
mach_port_name_t
task_self_trap(
    __unused struct task_self_trap_args *args)
{
    task_t task = current_task();
    //…
}

不过 libsyscall/mach/mach_init.c 里引用了 osfmk/mach/mach_traps.h 里的定义 extern mach_port_name_t task_self_trap(void);。也有可能他的实现并不在 ipc_tt.c 里，但是我根本找不到就是了。

2.3 host_processor_info() 取 CPU 信息

回到 host_processor_info() 这个函数，第一个参数填写由内核生成的自己进程的 mach port 用于 IPC，第二个参数则有以下定义:

/*
 *  Currently defined information.
 */
typedef int processor_flavor_t;
#define    PROCESSOR_BASIC_INFO    1       /* basic information */
#define    PROCESSOR_CPU_LOAD_INFO 2   /* cpu load information */
#define    PROCESSOR_PM_REGS_INFO  0x10000001  /* performance monitor register info */
#define    PROCESSOR_TEMPERATURE   0x10000002  /* Processor core temperature */

我们需要 CPU 占用率所以选第二个 PROCESSOR_CPU_LOAD_INFO，剩下的三个参数都是 out 参数，传引用就行。

processor_info_array_t cpuInfo;
    mach_msg_type_number_t numCpuInfo;
    natural_t numCPUsU = 0U;
    kern_return_t err = host_processor_info(mach_host_self(), PROCESSOR_CPU_LOAD_INFO, &numCPUsU, &cpuInfo, &numCpuInfo);

四个参数可以获得不同的信息但是都会回传 processor_info_array_t，这是一个变长数组(variable-sized inline array):

/* processor_info_t: variable-sized inline array that can
 * contain:
 * processor_basic_info_t:   (5 ints) 可以参考 PROCESSOR_BASIC_INFO_COUNT
 * processor_cpu_load_info_t:(4 ints) 最大是 CPU_STATE_MAX
 * processor_machine_info_t :(12 ints)
 * If other processor_info flavors are added, this definition
 * may need to be changed. (See mach/processor_info.h) */
type processor_flavor_t     = int;
type processor_info_t       = array[*:12] of integer_t;
type processor_info_array_t = ^array[] of integer_t;

CPU 占用率的数组 index 定义如下:

#define CPU_STATE_MAX      4

#define CPU_STATE_USER     0
#define CPU_STATE_SYSTEM   1
#define CPU_STATE_IDLE     2
#define CPU_STATE_NICE     3

由于现在的 Mac 基本都是多核 CPU，比如我的 Intel Core i7 CPU 有四核八线程，所以这个接口会返回每个线程 4 个 State 一共 32 个数据。我们可以通过 for 循环来取:

for(unsigned i = 0U; i < numCPUs; ++i) {
            uint32_t inUser   = (uint32_t)cpuInfo[(CPU_STATE_MAX * i) + CPU_STATE_USER];
            uint32_t inSystem = (uint32_t)cpuInfo[(CPU_STATE_MAX * i) + CPU_STATE_SYSTEM];
            uint32_t inNice   = (uint32_t)cpuInfo[(CPU_STATE_MAX * i) + CPU_STATE_NICE];
            uint32_t inIdle   = (uint32_t)cpuInfo[(CPU_STATE_MAX * i) + CPU_STATE_IDLE];
}

numCPUs 就是八核，可以通过 sysctl() 传入 hw.cpu 来取。关于 sysctl() 接口可以参考之前的一篇文章，这里不再赘述。

扩展: 超线程 Hyper-threading

以前的 CPU 是一个物理核心对应一个物理线程，这里的线程和我们应用层的线程概念不一样。应用层可以开上百个线程，但是一个 CPU 可能只有一个核心，那么他只能把时间分片给不同的逻辑线程运行，由于速度太快所以感受不出来。后来英特尔开发了超线程技术(Hyper-threading)可以在一个物理核心里模拟出两个线程。那么对于系统内核来说，就相当于物理核心多了一倍。所以 i7 处理器通过 sysctl() 取到的 CPU 个数就是 8 个。

user 是用户层 CPU 占用，system 是系统占用，nice 是老系统的遗留属性，现在是 hardcode 返回 0，不过源码没有删掉，idle 就是空闲 CPU 了。

按照之前的风格我们应该直接进入源码，不过这里先卖个关子。通过 host_processor_info() 取到的数据都是整数。直觉上我们认为把所有核心的 user + system + idle 就是全部 CPU，占用比全部就是 CPU 占用率了。

非常合理，有理有据。赶紧试一试。结果出来的百分比很奇怪，基本都在 7% 左右。用 Xcode 编译大项目 iStat Menu 都 100% 了这个结果值还是 7%。一定是哪里出了问题。

于是我参考了 Hammerspoon 的代码，htop 的代码，确认取 CPU Load Info 肯定没问题。那么有问题的可能是我对数据的处理方式。

留意到 Hammerspoon 关于 cpuUsageTick() 的文档有曰这个接口取到的数据是自系统最近一次启动以来的的 ticks 数据。

前面只说 host_processor_info() 的数组里全是整数但是没说单位是啥。那么 ticks 是什么呢？

2.4 CPU Ticks

准确来讲并不是 CPU ticks 而是 clock ticks，用于计算 CPU 时间的单位。一般会实现一个系统时钟，每隔一个非常短的时间间隔就发起一个 CPU 中断请求，把 tick 计数加一。

但是 host_processor_info() 接口返回的数字都不算大，比如 CPU 比较空闲时 idle 比较多，大概是 121033877。这个数字相比于 CPU 每秒的频率也太小了吧。当然真实的数字是可以大到爆掉 UInt64 的，内核肯定做了 scaled，所以内核到底是怎么实现的呢？

2.5 host_processor_info() 的实现

主要实现在 osfmk/kern/processor.c 里的以下方法:

kern_return_t
processor_info(
    processor_t processor,
    processor_flavor_t      flavor,
    host_t                  *host,
    processor_info_t        info,
    mach_msg_type_number_t  *count)

switch-case 一下遇到 PROCESSOR_CPU_LOAD_INFO 后直接去读取相应的数值。

cpu_load_info = (processor_cpu_load_info_t) info;
if (precise_user_kernel_time) {
    // #define PROCESSOR_DATA(processor, member)    \
    //              (processor)->processor_data.member
    // processor 通过 osfmk/kern/processor.h 定义的全局变量来取，这里相当于读 processor->processor_data.user_state
    // timer_data_t         user_state;
    // 拿到 user_state 之后再除以 hz_tick_interval
    // 在 osfmk/kern/clock.c 的实现中 hz_tick_interval 等于 NSEC_PER_SEC / 100，也就是 1/100 纳秒
    cpu_load_info->cpu_ticks[CPU_STATE_USER] =
                    (uint32_t)(timer_grab(&PROCESSOR_DATA(processor, user_state)) / hz_tick_interval);
    cpu_load_info->cpu_ticks[CPU_STATE_SYSTEM] =
                    (uint32_t)(timer_grab(&PROCESSOR_DATA(processor, system_state)) / hz_tick_interval);
} else {
    uint64_t tval = timer_grab(&PROCESSOR_DATA(processor, user_state)) +
        timer_grab(&PROCESSOR_DATA(processor, system_state));

    cpu_load_info->cpu_ticks[CPU_STATE_USER] = (uint32_t)(tval / hz_tick_interval);
    cpu_load_info->cpu_ticks[CPU_STATE_SYSTEM] = 0;
}

hz_tick_interval = 1000000000ull / 100 也就是 10^7，所以我们得到的结果被缩小了 10^7 倍，也就解释了为什么数字这么小了。

2019-11-1 updated: 后来我发现这里理解 tick 有问题

上面 host_processor_info() 获得的数字是内核时钟的 tick，在 XNU 里 hardcoded 为:

/*
 * The hz hardware interval timer.
 */

int             hz = 100;                /* GET RID OF THIS !!! */
int             tick = (1000000 / 100);  /* GET RID OF THIS !!! */

也就是一秒钟有 100 ticks，每个 CPU 核心(虚拟)自行计算，我取了其中一个的数据可以算出 3.8hr，同时打印 uptime 为 4hr 56m，略少一点。这是因为当系统 sleep 的时候 CPU 是不计算 ticks 的。所以这个计算是正确的，目前 tick 就是 hardcoded 为 100 次每秒。

顺便这两句 GET RID OF THIS !!! 的注释跟其他的 XXX 注释一样蜜汁幽默。

2.6 关于 idle 的计算

在 processor_info() 函数里还有这么一段注释:

/*
 * We capture the accumulated idle time twice over
 * the course of this function, as well as the timestamps
 * when each were last updated. Since these are
 * all done using non-atomic racy mechanisms, the
 * most we can infer is whether values are stable.
 * timer_grab() is the only function that can be
 * used reliably on another processor's per-processor
 * data.
 */

大意是由于 idle 状态下的 processor 不会经常更新自己的 idle time，所以在该函数内针对 idle 这个数值，判断 idle state 与否并取了两次 idle time 和 time stamp，比较一下再返回给上层。

// 取一下 idle 的 timer
idle_state = &PROCESSOR_DATA(processor, idle_state);
// 取第一次 idle state 数据
idle_time_snapshot1 = timer_grab(idle_state);
// 取第一次时间戳
idle_time_tstamp1 = idle_state->tstamp;

if (PROCESSOR_DATA(processor, current_state) != idle_state) {
    // 如果当前核心不在 idle 状态，那就是忙咯，忙就说明会经常更新，那么可信赖，直接用
    cpu_load_info->cpu_ticks[CPU_STATE_IDLE] =
                    (uint32_t)(idle_time_snapshot1 / hz_tick_interval);
} else if ((idle_time_snapshot1 != (idle_time_snapshot2 = timer_grab(idle_state))) ||
           (idle_time_tstamp1 != (idle_time_tstamp2 = idle_state->tstamp))){
    // 如果是 idle 状态，再抓一次 state 和 timestamp 看看数据是否一致
    // 由于此时数据有可能是并发更新的，那么第二次的数据比较新，有可能是更值得信赖的数据，用第二个
    cpu_load_info->cpu_ticks[CPU_STATE_IDLE] =
                    (uint32_t)(idle_time_snapshot2 / hz_tick_interval);
} else {
     // 这里同样是 idle 状态，但是数据没有变化，那么大概率没有在并发更新，数据是稳定的，也可以直接用上
    idle_time_snapshot1 += mach_absolute_time() - idle_time_tstamp1;

    cpu_load_info->cpu_ticks[CPU_STATE_IDLE] =
        (uint32_t)(idle_time_snapshot1 / hz_tick_interval);
}

这样忙时的数据和 idle 数据都有了，nice 数据就是 hardcode 的 0

cpu_load_info->cpu_ticks[CPU_STATE_NICE] = 0;

关于 NICE

在历史上 Unix 系统有一个 nice 状态用来表示一个进程的执行优先级，-20 最高，19 最低。但是 Apple 的 Darwin-XNU 现在已经弃用了。我试了一下 htop 在 Mac 上的 NI 一列全是 0，但是在 Ubuntu 上 NI 一列有 0, -20, 19, 5 各种数字都有。可以参考阅读维基百科或者这篇文章。

2.7 关于 `timer_grab` 方法

留意到上面的注释里有一句:

timer_grab() is the only function that can be used reliably on another processor's per-processor data.

此时使用 timer_grab() 函数是唯一可以读取另外一个 processor 的 per-processor data 也就是 processor->processor_data。但是为什么呢？为什么 timer_grab() 是唯一可靠的函数呢？

我们看看 timer_grab() 方法的定义:

/*
 * Read the accumulated time of `timer`.
 */
#if defined(__LP64__)
static inline
uint64_t
timer_grab(timer_t timer)
{
    return timer->all_bits;
}
#else /* defined(__LP64__) */
uint64_t timer_grab(timer_t timer);
#endif /* !defined(__LP64__) */

在 64 系统上用静态内敛函数在头文件里实现了，直接返回 all_bits。在非 64 位系统则只是声明没有实现。我搜了整个 XNU 开源代码也没有实现。但是有另一个版本实现可以参考一下:

static uint64_t safe_grab_timer_value(struct timer *t)
{
#if   defined(__LP64__)
  return t->all_bits;
#else
  uint64_t time = t->high_bits;    /* endian independent grab */
  time = (time << 32) | t->low_bits;
  return time;
#endif
}

其实这个 if-else 的区别只是因为 64 位和 32 位的区别而已:

struct timer {
    uint64_t tstamp;
#if defined(__LP64__)
    uint64_t all_bits;
#else /* defined(__LP64__) */
    /* A check word on the high portion allows atomic updates. */
    uint32_t low_bits;
    uint32_t high_bits;
    uint32_t high_bits_check;
#endif /* !defined(__LP64__) */
};

在 32 位系统上，内核用两个 uint32_t 来分开记录高位和低位数值，然后返回的时候拼成一个大的 64 位 uint64_t。一开始我以为 timer_grab() 是为了线程安全之类的，但是大家都只是读数值又不是写操作，而且看这个 safe 版本的实现，跟线程安全什么的没关系。所以应该只是因为要兼容，timer_grabe() 才是 only function。

2.8 关于 Timer 如何计时

Timer 计时的地方有点多，我还需要理解内核时钟的原理只能知道细节，这里大概看一下 Timer 的数据结构和 API。

struct timer {
    uint64_t tstamp;
    uint64_t all_bits;
};

非 64 位的直接不看了，原理是一样的，存储结构不同而已。最关键的是 tstamp 这个 time stamp。 timer_start() 时会记录当前时间戳，timer_stop(), timer_update(), timer_switch() 都会调用 timer_advance()，计算两次时间戳的差异，加到 all_bits 上面。

所以简单理解就是每次 CPU 把分配给了 user 或者 system 的时候，就会开启对应 timer 的计时，可以在二者之间切换时，或者闲时之类的变化就改变 timer 状态，更新计时数据。

传入的时间从 mach_absolute_time() 获得。

这个时间的实现 arm 和 i386 还不一样。

1386 的最终会到这里:

static inline uint64_t
rtc_nanotime_read(void)
{
    return  _rtc_nanotime_read(&pal_rtc_nanotime_info);
}

不过 _rtc_nanotime_read() 没有 C 实现，可能是汇编实现。但是反正读的是当前的 RTC 时间，以纳秒为单位。

arm 的实现则是:

uint64_t
mach_absolute_time(void)
{
    return ml_get_timebase();
}

uint64_t
ml_get_timebase()
{
    return (ml_get_hwclock() + getCpuDatap()->cpu_base_timebase);
}

为什么要两者相加呢？因为 cpu_base_timebase 在初始化的赋值是这样的:

if (!from_boot && (cdp == &BootCpuData)) {
        /*
         * When we wake from sleep, we have no guarantee about the state
         * of the hardware timebase.  It may have kept ticking across sleep, or
         * it may have reset.
         *
         * To deal with this, we calculate an offset to the clock that will
         * produce a timebase value wake_abstime at the point the boot
         * CPU calls cpu_timebase_init on wake.
         *
         * This ensures that mach_absolute_time() stops ticking across sleep.
         */
        rtclock_base_abstime = wake_abstime - ml_get_hwclock();
    }
cdp->cpu_base_timebase = rtclock_base_abstime;

rtclock_base_abstime 这个就是 uint64_t 的 RTC 时间，保存在 rtclock_data_t 的 rtc_base 结构体里，也是纳秒。

extern rtclock_data_t                   RTClockData;
#define rtclock_base_abstime           RTClockData.rtc_base.abstime

这个初始化函数 void cpu_timebase_init(boolean_t from_boot) 会被调用多次，系统启动的时候可以直接取 rtclock_base_abstime，但是如果从睡眠中唤醒，有可能时钟已经不跑了，所以要计算一个差值。

初始化是 rtclock_base_abstime 为 0。在所有核心 sleep 时 ml_arm_sleep(void) 函数记录一个时间到 wake_abstime。这个值通过 ml_get_timebase() 获取，此时如果从未 sleep 过则为硬件时钟时间 ml_get_hwclock()。

当 CPU 被唤醒时计算差值 wake_abstime - ml_get_hwclock()，保存到 cpu_base_timebase。

这样当你读取 ml_get_timebase() 时就加上这段差值，结果得到的是上一次保存的 wake_abstime，相当于从上一次 sleep 的地方开始继续往前 tick。

虽然注释说有可能 hwclock() 在睡眠期间会继续 tick 也有可能不会，所以要修正，不过我还不清楚修正是为了什么。可能内核需要用到这个时间来做些什么事情吧。

2.9 最后，解决占用率计算问题

回到一开始用 host_processor_info() 的数据来计算占用率不准问题，因为我们用的是历史数据，我们应该关注的是一小段时间内的 CPU 数据，比如取时间 t1 和时间 t2 的 cpu load，然后作差值。这个差值就反应了 t1 到 t2 之间 CPU 的占用情况。所以修正一下上面的做法，只需要取两次样本，然后相减，得到的数据再做一次忙时除以全部的 ticks 就能得到 CPU 占用率了。

Hammerspoon 里提供了一个用 LUA 封装的简单采用方法 hs.host.cpuUsage([period], [callback]) -> table 可供使用。

源码可以参考这里。

local convertToPercentages = function(result1, result2)
    local result = {}
    for k,v in pairs(result2) do
        if k == "n" then
            result.n = v
        else
            result[k] = {}
            for k2, v2 in pairs(v) do
                result[k][k2] = v2 - result1[k][k2]
            end
            local total = result[k].active + result[k].idle
            for k2, _ in pairs(result[k]) do
                result[k][k2] = (result[k][k2] / total) * 100.0
            end
        end
    end
    for i,v in pairs(result) do
        if tostring(i) ~= "n" then
            result[i] = setmetatable(v, { __tostring = __tostring_for_tables })
        end
    end
    return result
end

非常简单地两个结果作差值。

三、小结

本文从 iOS 和 Mac 取 CPU 占用率的接口出发，简单介绍了 Time Base 的概念，RTC 时钟，内核层维护 space 和 table 以记录 mach port 和进程相关信息，CPU Ticks 等内核层用到的东西。

操作系统越是往下走跟硬件设计打交道的东西就越多。平时做顶层面向用户的 App 开发基本不会碰到这些东西。对于 CPU 占用率这种代码，到 stackoverflow 抄一下就能用了。这并没有问题，但是探求一个系统接口的实现，寻找知其所以然的过程也十分有趣。

系统内核的实现有些地方需要高超的算法能力，比如线程调度模型，有些地方需要追去稳定，还有些地方可能用了 C/C++ 的语法糖之类的，看起来有点困难。但实际上和平时开发一个 App 需求的路子是一样的，就是分析一个问题，找到一个问题的解决方法而已。

当然了阅读和理解内核代码很容易，但是实践写出一个内核却是难如登天的一件事情，不仅非常强算法能力，也要求具备大型项目的管理能力。所以虽然我写不了内核，看一看这些神秘的 API 背后的实现也是很有意思的。

updated: osfmk 目录下的代码就是 Mach 内核部分，由于进程是在 Mach 内核实现的，所以我们可以通过 Mach 内核接口获取相关信息。host_info() 类型的接口都由 Mach 内核提供。

内核系列文章

参考资料

System time - Wikipedia
Hammerspoon docs: hs.host
hishamhm/htop: htop is an interactive text-mode process viewer for Unix systems. It aims to be a better 'top'.
Historically on Unix based systems, the nice cpu state represents processes for which the execution priority has been reduced to allow other higher priority processes access to more system resources.
Understanding CPU statistics | AppSignal Blog
Tencent/matrix: Matrix is a plugin style, non-invasive APM system developed by WeChat.
didi/DoraemonKit: 简称 "DoKit" 。一款功能齐全的客户端（ iOS 、Android、微信小程序）研发助手，你值得拥有。
apple/darwin-xnu: The Darwin Kernel (mirror)
NSEC_PER_SEC - Dispatch | Apple Developer Documentation
4.4 Thread Scheduling | Process Management in the FreeBSD Operating System | InformIT
Real-time clock - Wikipedia