Basic Statistics

基本的 relation-level统计信息存储在系统目录的 pg_class 表中,包含以下数据:

  • relation中的元组(记录)数量(reltuples)
  • relation的大小,以页(page)为单位(relpages)
  • 在可见性映射(visibility map)中被标记的页数(relallvisible)

下面是 flights 表对应的这些值:

1
2
3
4
5
6
=> SELECT reltuples, relpages, relallvisible
FROM pg_class WHERE relname = 'flights';
reltuples | relpages | relallvisible
−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−−−−
214867 | 2624 | 2624
(1 row)

如果查询未施加任何过滤条件,则 reltuples 的值将作为基数(cardinality)估计:

1
2
3
4
5
=> EXPLAIN SELECT * FROM flights;
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Seq Scan on flights (cost=0.00..4772.67 rows=214867 width=63)
(1 row)

统计信息是在表分析(无论是手动还是自动)过程中收集的。此外,由于基本统计信息至关重要,这些数据也会在其他操作中计算(如 VACUUM FULL 和 CLUSTER,以及 CREATE INDEX 和 REINDEX),并在常规 VACUUM 过程中进一步精化。

为了分析目的,会从表中随机抽取 300 × default_statistics_target 行进行采样。为了构建具有特定精度的统计信息,所需的样本量与被分析数据的总体量关系不大,因此表的大小不会被考虑在内。

抽样行是从同样数量的随机页(300 × default_statistics_target 页)中选出的。显然,如果表本身较小,则可能读取的页数更少,选取用于分析的行数也会相应减少。

对于大表,统计信息收集不会包含所有行,因此估算值可能与实际值存在偏差。这是完全正常的:如果数据在不断变化,统计信息本身也不可能始终准确。通常,只要估算精度在数量级上足够,即可用来选择合理的查询执行计划。

我们创建一个 flights 表的副本,并禁用 autovacuum,这样我们就可以控制自动分析(autoanalysis)的启动时间:

1
=> CREATE TABLE flights_copy(LIKE flights) WITH (autovacuum_enabled = false);

目前新表没有统计信息

1
2
3
4
5
=> SELECT reltuples, relpages, relallvisible FROM pg_class WHERE relname = 'flights_copy';
reltuples | relpages | relallvisible
−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−−−−
1 | 0 | 0
(1 row)

当 reltuples = −1 时,用于区分尚未分析的表和真正没有行的空表。

表创建后,很可能会马上插入一些行。由于优化器无法获知表的当前实际状态,它会假设该表包含 10 个页:

1
2
3
4
5
=> EXPLAIN SELECT * FROM flights_copy;
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Seq Scan on flights_copy (cost=0.00..14.10 rows=410 width=170)
(1 row)

行数的估算是基于单行的大小(在执行计划中显示为 width)。行宽通常是在分析过程中计算的平均值,但由于此时尚未收集任何统计信息,这里只是根据列的数据类型做出的近似估算。

现在,我们从 flights 表拷贝数据同时完成分析:

1
2
=> INSERT INTO flights_copy SELECT * FROM flights; INSERT 0 214867
=> ANALYZE flights_copy;

收集到的统计信息反映了实际的行数(表足够小,分析器可以对所有数据收集统计信息):

1
2
3
4
5
=> SELECT reltuples, relpages, relallvisible FROM pg_class WHERE relname = 'flights_copy';
reltuples | relpages | relallvisible
−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−−−−
214867 | 2624 | 0
(1 row)

relallvisible 值用于估算 index-only scan 的成本。该值由 VACUUM 更新:

1
2
3
4
5
6
=> VACUUM flights_copy;
=> SELECT relallvisible FROM pg_class WHERE relname = 'flights_copy';
relallvisible
−−−−−−−−−−−−−−−
2624
(1 row)

现在我们将行数翻倍,但不更新统计信息,检查查询计划中的基数估算:

1
2
3
4
5
6
7
8
9
10
11
12
=> INSERT INTO flights_copy SELECT * FROM flights; 
=> SELECT count(*) FROM flights_copy;
count
−−−−−−−−
429734
(1 row)

=> EXPLAIN SELECT * FROM flights_copy;
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Seq Scan on flights_copy (cost=0.00..9545.34 rows=429734 width=63)
(1 row)

尽管 pg_class 中的数据已经过时,估算结果仍然准确:

1
2
3
4
5
6
=> SELECT reltuples, relpages
FROM pg_class WHERE relname = 'flights_copy';
reltuples | relpages
−−−−−−−−−−−+−−−−−−−−−−
214867 | 2624
(1 row)

问题在于,如果优化器发现 relpages 与实际文件大小之间存在差距,它可以通过比例调整 reltuples 的值来提高估算精度。由于文件大小相比 relpages 翻了一倍,优化器会调整估算的行数,同时假设数据密度保持不变:

1
2
3
4
5
6
7
=> SELECT reltuples *
(pg_relation_size('flights_copy') / 8192) / relpages AS tuples
FROM pg_class WHERE relname = 'flights_copy';
tuples
−−−−−−−−
429734
(1 row)

当然,这种调整并不总是有效(例如,如果删除了一些行,估算值将保持不变),但在某些情况下,它可以让优化器在下一次分析触发之前仍然保持合理的估算。