Skip to main content
 首页 » 编程设计

Elasticsearch聚合分析实战(2)

2022年07月19日142jiqing9006

Elasticsearch聚合分析实战(2)

本文在前文基础上进一步通过学习度量分析和分组分析。示例数据可以点击这里下载。

环境准备

系统中提供1000条employee数据,读者可以通过POST /employees/_bulk命令批量插入至elasticsearch中。
执行 GET /_cat/indices?v 命名验证employees索引是否创建成功。

yellow open   employees  LDqYniJMRy2wVvB5O0oxEA   1   1       1000            0    458.9kb        458.9kb 

一条示例数据如下:

{ 
    "name" :"Randi Howell", 
    "age":19, 
    "salary":20144, 
    "gender":"female", 
    "email":"randihowell@comvex.com", 
    "phone":"+1 (870) 408-2828", 
    "street":"135 Montieth Street", 
    "city":"Diaperville", 
    "state":"Montana, 299" 
} 

2. 员工数据分析示例

2.1. 度量聚集

计算所有员工的平均工资。下面示例通过度量聚集查询所有员工的平均工资。aggs 表示聚集类型查询,avg_salary 给结果分配一个名词,avg是度量聚集类型用于计算平均值。最后指定field表示针对哪个字段进行计算。既然聚集查询不关心具体文档记录,因此设置size=0 。

GET /employees/_search 
{ 
  "size": 0,  
  "aggs": { 
    "avg_salary": { 
      "avg": {"field": "salary"} 
    } 
  } 
} 

查询结果:

{ 
  "took" : 62, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 1000, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "avg_salary" : { 
      "value" : 44966.073 
    } 
  } 
} 

有条件的聚集查询:查询女性员工或state为Mississippi的员工的平均工资:

GET /employees/_search 
{ 
  "size": 0,  
  "query" : { 
    "bool" : { 
          "should": [ 
              { "match": { "state": "Mississippi" } }, 
              { "match": { "gender": "female" } } 
            ] 
     } 
  }, 
  "aggs": { 
    "avg_salary": { 
      "avg": {"field": "salary"} 
    } 
  } 
} 

响应结果:

{ 
  "took" : 95, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 483, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "avg_salary" : { 
      "value" : 44222.351966873706 
    } 
  } 
} 

统计聚集查询:一个查询实现多个度量指标统计。下面查询使用stats聚集查询salary字段多个统计值,包括max,min,sum,avg,count:

GET /employees/_search 
{ 
    "size" : 0, 
    "aggs" : { 
        "salary_stats" : { 
             "stats" : { 
                 "field" : "salary" 
             } 
         } 
    } 
} 

响应结果:

{ 
  "took" : 47, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 1000, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "salary_stats" : { 
      "count" : 1000, 
      "min" : 10026.0, 
      "max" : 79968.0, 
      "avg" : 44966.073, 
      "sum" : 4.4966073E7 
    } 
  } 
} 

给定字段的唯一值数量(字段值基数,不重复值数量):下面查询年龄值得基数,使用cardinality类型聚集查询。

GET /employees/_search 
{ 
   "size" : 0, 
    "aggs" : { 
        "age_count" : { 
             "cardinality" : { 
                 "field" : "age" 
             } 
         } 
    } 
} 

响应结果:

{ 
  "took" : 84, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 1000, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "age_count" : { 
      "value" : 58 
    } 
  } 
} 

注:默认文本字段(text类型)fielddata属性被禁用,如果针对gender字段进行基数查询,elasticsearch会抛出异常“Fielddata is disabled on text fields by default. Set fielddata=true on [gender] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.”

什么是fielddata?
严格来说elasticsearch把文本字段值存储在内存数据结构称为fielddata。文本字段设置使用fielddata可以进行聚集、排序等操作。因为文本设置fielddata需要占用大量的堆内存,因此默认不启用。

那如何对文本字段启用fielddata呢?
其实默认导入文本字段时,elasticsearch5.x 会生成一个字段名.keyword扩展字段,默认类型为keyword,可直接进行聚集操作。同时可以手动修改mapping,设置fielddata为true。建议导入数据之前设定对于数据类型,如性别这种字段直接设置为keyword 类型。

修改mapping请求:

PUT /employees/_mapping 
{ 
  "properties": { 
    "gender": { 
      "type":     "text", 
      "fielddata": true 
    } 
  } 
} 

现在可以直接进行基数统计:

GET /employees/_search 
{ 
   "size" : 0, 
    "aggs" : { 
        "age_count" : { 
             "cardinality" : { 
                 "field" : "gender" 
             } 
         } 
    } 
} 

如果不修改mapping,读者也可以使用gender.keyword代替gender。
响应结果:

{ 
  "took" : 73, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 1000, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "age_count" : { 
      "value" : 2 
    } 
  } 
} 

与预期结果一致,性别基础为2。

2.2. 分组聚集

实际应用中经常需要根据一定条件对文档进行分组分析,elasticsearch称为Bucketing aggregation,为了更好理解翻译为分组聚集(个人觉得不适合直译为桶聚集)。

根据字段值进行分组:下面示例执行term聚集(关键词聚集)查询,把所有文档根据性别分成两组:

GET /employees/_search 
{ 
   "size" : 0, 
   "aggs" : { 
        "gender_bucket" : { 
             "terms" : { 
                 "field" : "gender" 
             } 
         } 
    } 
} 

响应结果:

{ 
  "took" : 55, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 1000, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "gender_bucket" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "male", 
          "doc_count" : 524 
        }, 
        { 
          "key" : "female", 
          "doc_count" : 476 
        } 
      ] 
    } 
  } 
} 

范围聚集:根据字段值的范围对文档进行分组。使用keyed : true 选项可以对结果中的每个范围设置key:

GET /employees/_search 
{ 
   "size" : 0, 
   "aggs" : { 
       "age_ranges" : { 
           "range" : { 
               "field" : "age", 
               "keyed" : true, 
               "ranges" : [ 
                   { "to" : 30 }, 
                   { "from" : 30, "to" : 40 }, 
                   { "from" : 40, "to" : 55 }, 
                   { "from" : 55 } 
                ] 
            } 
        } 
     } 
} 

响应结果:

{ 
  "took" : 1, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 1000, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "age_ranges" : { 
      "buckets" : { 
        "*-30.0" : { 
          "to" : 30.0, 
          "doc_count" : 216 
        }, 
        "30.0-40.0" : { 
          "from" : 30.0, 
          "to" : 40.0, 
          "doc_count" : 174 
        }, 
        "40.0-55.0" : { 
          "from" : 40.0, 
          "to" : 55.0, 
          "doc_count" : 248 
        }, 
        "55.0-*" : { 
          "from" : 55.0, 
          "doc_count" : 362 
        } 
      } 
    } 
  } 
} 

还可以在每个范围内直接设置key名称:

GET /employees/_search 
{ 
   "size" : 0, 
   "aggs" : { 
       "age_ranges" : { 
           "range" : { 
               "field" : "age", 
               "keyed" : true, 
               "ranges" : [ 
                   { "key": "young", "to" : 35 }, 
                   { "key": "quarter-aged", "from" : 35, "to" : 45 }, 
                   { "key": "middle-aged", "from" : 45, "to" : 65 }, 
                   { "key": "senior", "from" : 55 } 
                ] 
            } 
        } 
     } 
} 

响应结果一样,只是key的名称是设定的名称。

2.3. 嵌套聚集

度量聚集可以嵌入至分组聚集中:查询每个性别的平均年龄。下面示例首先执行分组查询,然后计算各个分组的平均年龄。外面aggs关键字是指定按照性别进行分组,里面aggs是计算平均年龄。

GET /employees/_search 
{ 
   "size" : 0, 
   "aggs" : { 
        "gender_bucket" : { 
             "terms" : { 
                 "field" : "gender" 
             }, 
             "aggs": { 
                 "average_age": { 
                      "avg": { 
                          "field": "age" 
                      } 
                 } 
              } 
         } 
    } 
} 

响应结果:

{ 
  "took" : 69, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 1000, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "gender_bucket" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "male", 
          "doc_count" : 524, 
          "average_age" : { 
            "value" : 47.333969465648856 
          } 
        }, 
        { 
          "key" : "female", 
          "doc_count" : 476, 
          "average_age" : { 
            "value" : 45.71848739495798 
          } 
        } 
      ] 
    } 
  } 
} 

按性别查询不同年龄范围的平均年龄:最外层聚集按照年龄分组,第二层聚集按照年龄范围分组,最后一层聚集计算每个分组平均年龄。

GET /employees/_search 
{ 
   "size" : 0, 
   "aggs" : { 
        "gender_bucket" : { 
             "terms" : { 
                 "field" : "gender" 
             }, 
             "aggs" : { 
                 "age_ranges" : { 
                     "range" : { 
                         "field" : "age", 
                         "keyed" : true, 
                         "ranges" : [ 
                             { "key": "young", "to" : 35 }, 
                             { "key": "middle-aged", "from" : 35, "to" : 50 }, 
                             { "key": "senior", "from" : 55 } 
                          ] 
                      }, 
                      "aggs": { 
                          "average_age": { 
                               "avg": { 
                                   "field": "age" 
                               } 
                          } 
                       } 
                  } 
               } 
         } 
    } 
} 

响应结果:

{ 
  "took" : 37, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 1000, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "gender_bucket" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "male", 
          "doc_count" : 524, 
          "age_ranges" : { 
            "buckets" : { 
              "young" : { 
                "to" : 35.0, 
                "doc_count" : 145, 
                "average_age" : { 
                  "value" : 26.048275862068966 
                } 
              }, 
              "middle-aged" : { 
                "from" : 35.0, 
                "to" : 50.0, 
                "doc_count" : 135, 
                "average_age" : { 
                  "value" : 42.19259259259259 
                } 
              }, 
              "senior" : { 
                "from" : 55.0, 
                "doc_count" : 192, 
                "average_age" : { 
                  "value" : 65.734375 
                } 
              } 
            } 
          } 
        }, 
        { 
          "key" : "female", 
          "doc_count" : 476, 
          "age_ranges" : { 
            "buckets" : { 
              "young" : { 
                "to" : 35.0, 
                "doc_count" : 157, 
                "average_age" : { 
                  "value" : 25.840764331210192 
                } 
              }, 
              "middle-aged" : { 
                "from" : 35.0, 
                "to" : 50.0, 
                "doc_count" : 112, 
                "average_age" : { 
                  "value" : 41.517857142857146 
                } 
              }, 
              "senior" : { 
                "from" : 55.0, 
                "doc_count" : 170, 
                "average_age" : { 
                  "value" : 65.5 
                } 
              } 
            } 
          } 
        } 
      ] 
    } 
  } 
} 

2.4. 过滤聚集

计算除了来自minnesota的员工平均工资。这也属于嵌套聚集,首先应用过滤,然后应用聚合查找平均工资。

GET /employees/_search 
{ 
    "aggs" : { 
        "state" : { 
            "filter" : { "term": { "state": "minnesota" } }, 
            "aggs" : { 
                "avg_age" : { "avg" : { "field" : "salary" } } 
            } 
        } 
    } 
} 

响应如下:

{ 
  "took" : 7, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 1000, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "state" : { 
      "doc_count" : 17, 
      "avg_age" : { 
        "value" : 43855.35294117647 
      } 
    } 
  } 
} 

3. 总结

本文利用示例的员工数据进行实例分析。主要包括度量分析、分组分析以及嵌套分析。


本文参考链接:https://blog.csdn.net/neweastsun/article/details/104324747
阅读延展