Skip to main content
 首页 » 编程设计

Elasticsearch聚合分析实战(1)

2022年07月19日128me-sa

Elasticsearch聚合分析实战(1)

本文通过实际示例学习Elasticsearch的聚集分析。

1. 聚集分析介绍

聚集分析主要包括两大类,度量聚集(metrics aggregation)和分组聚集(bucket aggregation),其他类型本文暂不涉及。
度量聚集基于文档集合计算一些值(如平均值);分组聚集根据分组条件对文档进行分组。

1.1. 示例分析数据

定义sport是索引数据,其中name和sport为keyword,用于作为关键词分析。

PUT sports 
{ 
   "mappings": { 
     "properties": { 
        "birthdate": { 
           "type": "date", 
           "format": "dateOptionalTime" 
        }, 
        "location": { 
           "type": "geo_point" 
        }, 
        "name": { 
           "type": "keyword" 
        }, 
        "rating": { 
           "type": "integer" 
        }, 
        "sport": { 
           "type": "keyword" 
        } 
     } 
  } 
} 

批量插入数据:

POST /sports/_bulk 
{"index":{}} 
{"name":"Michael", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["5", "4"],  "location":"46.22,-68.45"} 
{"index":{}} 
{"name":"Bob", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["3", "4"],  "location":"45.21,-68.35"} 
{"index":{}} 
{"name":"Jim", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["3", "2"],  "location":"45.16,-63.58" } 
{"index":{}} 
{"name":"Joe", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"],  "location":"45.22,-68.53"} 
{"index":{}} 
{"name":"Tim", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["3", "3"],  "location":"46.22,-68.85"} 
{"index":{}} 
{"name":"Alfred", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["2", "2"],  "location":"45.12,-68.35"} 
{"index":{}} 
{"name":"Jeff", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["2", "3"], "location":"46.12,-68.55"} 
{"index":{}} 
{"name":"Will", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["4", "4"], "location":"46.25,-68.55" } 
{"index":{}} 
{"name":"Mick", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["3", "4"],  "location":"46.22,-68.45"} 
{"index":{}} 
{"name":"Pong", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["1", "3"],  "location":"45.21,-68.35"} 
{"index":{}} 
{"name":"Ray", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["2", "2"],  "location":"45.16,-63.58" } 
{"index":{}} 
{"name":"Ping", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"],  "location":"45.22,-68.53"} 
{"index":{}} 
{"name":"Duke", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["5", "2"],  "location":"46.22,-68.85"} 
{"index":{}} 
{"name":"Hal", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["4", "2"],  "location":"45.12,-68.35"} 
{"index":{}} 
{"name":"Charge", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["3", "2"], "location":"46.12,-68.55"} 
{"index":{}} 
{"name":"Barry", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["5", "2"], "location":"46.25,-68.55" } 
{"index":{}} 
{"name":"Bank", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["6", "4"], "location":"46.25,-68.55" } 
{"index":{}} 
{"name":"Bingo", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["10", "7"], "location":"46.25,-68.55" } 
{"index":{}} 
{"name":"James", "birthdate":"1988-3-1", "sport":"Basketball", "rating": ["10", "8"], "location":"46.25,-68.55" } 
{"index":{}} 
{"name":"Wayne", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["10", "10"], "location":"46.25,-68.55" } 
{"index":{}} 
{"name":"Brady", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" } 
{"index":{}} 
{"name":"Lewis", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" } 
 

1.2. 语法结构

下面看下聚集的语法结构。

"aggregations" : { 
    "<aggregation_name>" : { 
        "<aggregation_type>" : {  
            <aggregation_body> 
        }, 
        ["aggregations" : { [<sub_aggregation>]* } ] 
    } 
    [,"<aggregation_name_2>" : { ... } ]* 
} 

aggregations 关键词也可以使用 “aggs” 代替,主要包括聚集名称,类型以及主体三个部分。 <aggregation_name> 是用户定义的名称,该名称在请求响应中唯一标识聚集。

<aggregation_type> 通常是聚集中第一个键确定聚集类型,如 terms, stats, 或者 geo-distance 聚集等。

<aggregation_body><aggregation_type>里面定义聚集主体内容,用于指定必要的属性,不同聚集有不同的属性。

另外两个可选项:可选提供子聚集对上级聚集结果进行分析。在查询中可选提供多个聚集(aggregation_name_2)作为独立的顶级聚集。虽然嵌套的聚集层级没有限制,但不能在度量聚集下嵌套聚集。

1.3. 值来源

一些聚合使用来自聚合文档的值。这些值既可以是特定文档的字段,也可以是通过脚本针对文档生成的值。下面示例中的terms聚集基于name字段,但order排序是基于子聚集rating_avg的值,这里使用嵌套的子聚集————度量聚集对父级分组聚集进行排序。

POST /sports/_search 
{ 
  "size": 0, 
  "aggs": { 
    "the_name": { 
       "terms": { 
          "field": "name", 
          "order": { 
             "rating_avg": "desc" 
          } 
       }, 
       "aggs": { 
          "rating_avg": { 
             "avg": { 
                "field": "rating" 
             } 
          } 
       } 
    } 
  } 
} 

1.4. 多个顶级聚集

这里同时定义两个顶级聚集:the_nametype_cnt,同时the_name还包括子聚集rating_avg

POST /sports/_search 
{ 
  "size": 0, 
  "aggs": { 
    "the_name": { 
       "terms": { 
          "field": "name", 
          "order": { 
             "rating_avg": "desc" 
          } 
       }, 
       "aggs": { 
          "rating_avg": { 
             "avg": { 
                "field": "rating" 
             } 
          } 
       } 
    }, 
    "type_cnt":{ 
      "terms": { 
        "field": "sport" 
      } 
    } 
  } 
} 

2. 度量聚集

度量聚集用于计算整个文档集合的度量。可以是单个值(如平均数),也可以是多个度量值(如stats)。简单的度量聚集是value_count聚集,其返回给定字段值的总数量。下面示例返回sport值的数量。

POST /sports/_search 
{ 
   "size": 0, 
   "aggs": { 
      "sport_count": { 
         "value_count": { 
            "field": "sport" 
         } 
      } 
   } 
} 

值得注意的是,返回结果总数不是数值的唯一值。所以返回数量和索引文档数量一致。
不能在度量聚集中嵌入度量聚集,实际上也没有实际意义。但在分组聚集中嵌入度量聚集非常有用。下面章节会涉及到,但需先看看分组聚集。

3. 分组聚集

分组聚集是一种文档分组机制。每种类型分组有其文档分类方式,最简单类型是terms聚集。下面示例对sport字段的值进行分组计数。类似于SQL中根据该字段分组再计数。

POST /sports/_search 
{ 
   "size": 0, 
   "aggregations": { 
      "sport": { 
         "terms": { 
            "field": "sport" 
         } 
      } 
   } 
} 

返回结果:

{ 
  ...... 
  "aggregations" : { 
    "sport" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "Baseball", 
          "doc_count" : 16 
        }, 
        { 
          "key" : "Football", 
          "doc_count" : 2 
        }, 
        { 
          "key" : "Golf", 
          "doc_count" : 2 
        }, 
        { 
          "key" : "Basketball", 
          "doc_count" : 1 
        }, 
        { 
          "key" : "Hockey", 
          "doc_count" : 1 
        } 
      ] 
    } 
  } 
} 

geo_distance聚集更有趣,虽然其有很多选项,最简单场景是根据原点计算距离范围,然后计算有多少文档位于圆内。下面计算从点"46.12,-68.55."计算20里范围内的记录:

POST /sports/_search 
{ 
   "size": 0, 
   "aggregations": { 
      "baseball_player_ring": { 
         "geo_distance": { 
            "field": "location", 
            "origin": "46.12,-68.55", 
            "unit": "mi", 
            "ranges": [ 
               { 
                  "from": 0, 
                  "to": 20 
               } 
            ] 
         } 
      } 
   } 
} 

返回结果:

  ...... 
 
  "aggregations" : { 
    "baseball_player_ring" : { 
      "buckets" : [ 
        { 
          "key" : "*-20.0", 
          "from" : 0.0, 
          "to" : 20.0, 
          "doc_count" : 14 
        } 
      ] 
    } 
  } 
} 

4. 嵌套聚集

分组聚集最强大的能力是其嵌套能力。首先定义顶级分组聚集,然后在其内部定义二级聚集操作每个父级分组结果,嵌套可以根据需要定义很多级。

继续上面的示例,先找出一定范围内的记录,在看90后的记录数:

POST /sports/_search 
{ 
   "size": 0, 
   "aggs": { 
      "baseball_player_ring": { 
         "geo_distance": { 
            "field": "location", 
            "origin": "46.12,-68.55", 
            "unit": "mi", 
            "ranges": [ 
               { 
                  "from": 0, 
                  "to": 20 
               } 
            ] 
         }, 
         "aggs": { 
            "ring_age_ranges": { 
               "range": { 
                 "field": "birthdate",  
                  "ranges": [ 
                      {"key":"~90", "to": "1990-1-1"}, 
                      {"key":"90~", "from": "1990-1-1" } 
                  ] 
               } 
            } 
         } 
      } 
   } 
} 

返回结果:

  "aggregations" : { 
    "baseball_player_ring" : { 
      "buckets" : [ 
        { 
          "key" : "*-20.0", 
          "from" : 0.0, 
          "to" : 20.0, 
          "doc_count" : 14, 
          "ring_age_ranges" : { 
            "buckets" : [ 
              { 
                "key" : "~90", 
                "doc_count" : 10 
              }, 
              { 
                "key" : "90~", 
                "doc_count" : 4 
              } 
            ] 
          } 
        } 
      ] 
    } 
  } 
 

下面在我们针对最里层的结果使用stats进行统计————多值度量聚集。

POST /sports/_search 
{ 
   "size": 0, 
   "aggs": { 
      "baseball_player_ring": { 
         "geo_distance": { 
            "field": "location", 
            "origin": "46.12,-68.55", 
            "unit": "mi", 
            "ranges": [ 
               { 
                  "from": 0, 
                  "to": 20 
               } 
            ] 
         }, 
         "aggs": { 
            "ring_age_ranges": { 
               "range": { 
                 "field": "birthdate",  
                  "ranges": [ 
                      {"key":"~90", "to": "1990-1-1"}, 
                      {"key":"90~", "from": "1990-1-1" } 
                  ] 
               }, 
               "aggs": { 
                  "rating_stats": { 
                     "stats": { 
                        "field": "rating" 
                     } 
                  } 
               } 
            } 
         } 
      } 
   } 
}· 

响应结果:

{ 
  ...... 
  "aggregations" : { 
    "baseball_player_ring" : { 
      "buckets" : [ 
        { 
          "key" : "*-20.0", 
          "from" : 0.0, 
          "to" : 20.0, 
          "doc_count" : 14, 
          "ring_age_ranges" : { 
            "buckets" : [ 
              { 
                "key" : "~90", 
                "doc_count" : 10, 
                "rating_stats" : { 
                  "count" : 20, 
                  "min" : 2.0, 
                  "max" : 10.0, 
                  "avg" : 6.8, 
                  "sum" : 136.0 
                } 
              }, 
              { 
                "key" : "90~", 
                "doc_count" : 4, 
                "rating_stats" : { 
                  "count" : 8, 
                  "min" : 2.0, 
                  "max" : 5.0, 
                  "avg" : 2.875, 
                  "sum" : 23.0 
                } 
              } 
            ] 
          } 
        } 
      ] 
    } 
  } 
} 

我们看到可以创建分组包含分组的复杂应用。

5. 总结

本文我们介绍了Elasticsearch的聚集应用。包括聚集的语法及说明,重点通过示例展示了度量聚集、分组聚集以及嵌套聚集。


本文参考链接:https://blog.csdn.net/neweastsun/article/details/104298675
阅读延展