Skip to main content
 首页 » 编程设计

R语言连续变量离散化类别变量

2022年07月19日204webabcd

R内置函数实现数组变量转为因子变量有:cut、split、quantile、bincode,本文主要介绍ggplot提供的几个分组函数。

*cut_interval()*按照相同范围分为n组;, cut_number() 按照相同数量(近似)观测值分为n组; cut_width() 按照参数 width指定的宽度进行分组。

语法如下:

# cut_interval(x, n = NULL, length = NULL, ...) 
#  
# cut_number(x, n = NULL, ...) 
#  
# cut_width( 
#   x, 
#   width, 
#   center = NULL, 
#   boundary = NULL, 
#   closed = c("right", "left"), 
#   ... 
# ) 

cut_interval举例

按照相同范围分为6组,使用table进行统计分组数据进行验证:

table(cut_interval(1:10, 6)) 
 
 # [1,2.5]  (2.5,4]  (4,5.5]  (5.5,7]  (7,8.5] (8.5,10]  
 #       2        2        1        2        1        2  
 
table(cut_interval(1:10, 5)) 
  # [1,2.8] (2.8,4.6] (4.6,6.4] (6.4,8.2]  (8.2,10]  
  #       2         2         2         2         2  

cut_number举例

每组包括相同数量元素进行分组:

table(cut_number(runif(100), 10)) 
# [0.00693,0.17]   (0.17,0.305]   (0.305,0.38]   (0.38,0.477]   (0.477,0.58]   (0.58,0.688]  (0.688,0.771]  
#             10             10             10             10             10             10             10  
#   (0.771,0.83]   (0.83,0.922]  (0.922,0.993]  
#             10             10             10  
 

cut_width 举例

每组距离是0.1,对100个均匀分布数据分组:

table(cut_width(runif(100), 0.1)) 
 
# [-0.05,0.05]  (0.05,0.15]  (0.15,0.25]  (0.25,0.35]  (0.35,0.45]  (0.45,0.55]  (0.55,0.65]  (0.65,0.75]  
#            3           14           10            6           11           11           13            8  
#  (0.75,0.85]  (0.85,0.95]  (0.95,1.05]  
#            9            7            8  
table(cut_width(runif(100), 0.1, boundary = 0)) 
  # [0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9]   (0.9,1]  
  #      10         6        13        11         8        11         9        11        11        10  
 
table(cut_width(runif(100), 0.1, center = 0)) 
# [-0.05,0.05]  (0.05,0.15]  (0.15,0.25]  (0.25,0.35]  (0.35,0.45]  (0.45,0.55]  (0.55,0.65]  (0.65,0.75]  
#            5           16           12           11            8           11            7            8  
#  (0.75,0.85]  (0.85,0.95]  (0.95,1.05]  
#           13            5            4  
 
table(cut_width(runif(100), 0.1, labels = FALSE)) 
# 1  2  3  4  5  6  7  8  9 10 11  
# 9  8 13 12  7 10  9 11  8  9  4  

boundary 设置分组初始边界,如果不指定则为width的一半。center 指定分组中心,center=0让中心为整数。

labels
labels 指定分组结果的级别. 默认使用 “(a,b]” 作为分组标识. 如果设置 labels = FALSE, 简单使用整数代码代替因子变量.

应用举例

统计diamonds数据中钻石重量的分布情况:

library("dplyr") 
diamonds %>% count(cut_width(carat, 0.5))  
 
# A tibble: 11 x 2 
#    `cut_width(carat, 0.5)`     n 
#    <fct>                   <int> 
#  1 [-0.25,0.25]              785 
#  2 (0.25,0.75]             29498 
#  3 (0.75,1.25]             15977 
#  4 (1.25,1.75]              5313 
#  5 (1.75,2.25]              2002 
#  6 (2.25,2.75]               322 
#  7 (2.75,3.25]                32 
#  8 (3.25,3.75]                 5 
#  9 (3.75,4.25]                 4 
# 10 (4.25,4.75]                 1 
# 11 (4.75,5.25]                 1 
 
ggplot(data = diamonds, mapping = aes(x = carat)) + 
geom_histogram(binwidth = 0.5) 

在这里插入图片描述
geom_histogram也可以指定分组宽度。


本文参考链接:https://blog.csdn.net/neweastsun/article/details/121070379
阅读延展