R内置函数实现数组变量转为因子变量有:cut、split、quantile、bincode,本文主要介绍ggplot提供的几个分组函数。
*cut_interval()*按照相同范围分为n组;, cut_number() 按照相同数量(近似)观测值分为n组; cut_width() 按照参数 width指定的宽度进行分组。
语法如下:
# cut_interval(x, n = NULL, length = NULL, ...)
#
# cut_number(x, n = NULL, ...)
#
# cut_width(
# x,
# width,
# center = NULL,
# boundary = NULL,
# closed = c("right", "left"),
# ...
# )
cut_interval举例
按照相同范围分为6组,使用table进行统计分组数据进行验证:
table(cut_interval(1:10, 6))
# [1,2.5] (2.5,4] (4,5.5] (5.5,7] (7,8.5] (8.5,10]
# 2 2 1 2 1 2
table(cut_interval(1:10, 5))
# [1,2.8] (2.8,4.6] (4.6,6.4] (6.4,8.2] (8.2,10]
# 2 2 2 2 2
cut_number举例
每组包括相同数量元素进行分组:
table(cut_number(runif(100), 10))
# [0.00693,0.17] (0.17,0.305] (0.305,0.38] (0.38,0.477] (0.477,0.58] (0.58,0.688] (0.688,0.771]
# 10 10 10 10 10 10 10
# (0.771,0.83] (0.83,0.922] (0.922,0.993]
# 10 10 10
cut_width 举例
每组距离是0.1,对100个均匀分布数据分组:
table(cut_width(runif(100), 0.1))
# [-0.05,0.05] (0.05,0.15] (0.15,0.25] (0.25,0.35] (0.35,0.45] (0.45,0.55] (0.55,0.65] (0.65,0.75]
# 3 14 10 6 11 11 13 8
# (0.75,0.85] (0.85,0.95] (0.95,1.05]
# 9 7 8
table(cut_width(runif(100), 0.1, boundary = 0))
# [0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
# 10 6 13 11 8 11 9 11 11 10
table(cut_width(runif(100), 0.1, center = 0))
# [-0.05,0.05] (0.05,0.15] (0.15,0.25] (0.25,0.35] (0.35,0.45] (0.45,0.55] (0.55,0.65] (0.65,0.75]
# 5 16 12 11 8 11 7 8
# (0.75,0.85] (0.85,0.95] (0.95,1.05]
# 13 5 4
table(cut_width(runif(100), 0.1, labels = FALSE))
# 1 2 3 4 5 6 7 8 9 10 11
# 9 8 13 12 7 10 9 11 8 9 4
boundary 设置分组初始边界,如果不指定则为width的一半。center 指定分组中心,center=0让中心为整数。
labels
labels 指定分组结果的级别. 默认使用 “(a,b]” 作为分组标识. 如果设置 labels = FALSE, 简单使用整数代码代替因子变量.
应用举例
统计diamonds数据中钻石重量的分布情况:
library("dplyr")
diamonds %>% count(cut_width(carat, 0.5))
# A tibble: 11 x 2
# `cut_width(carat, 0.5)` n
# <fct> <int>
# 1 [-0.25,0.25] 785
# 2 (0.25,0.75] 29498
# 3 (0.75,1.25] 15977
# 4 (1.25,1.75] 5313
# 5 (1.75,2.25] 2002
# 6 (2.25,2.75] 322
# 7 (2.75,3.25] 32
# 8 (3.25,3.75] 5
# 9 (3.75,4.25] 4
# 10 (4.25,4.75] 1
# 11 (4.75,5.25] 1
ggplot(data = diamonds, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.5)
geom_histogram也可以指定分组宽度。
本文参考链接:https://blog.csdn.net/neweastsun/article/details/121070379