Elasticsearch聚合分析实战(1)
本文通过实际示例学习Elasticsearch的聚集分析。
1. 聚集分析介绍
聚集分析主要包括两大类,度量聚集(metrics aggregation)和分组聚集(bucket aggregation),其他类型本文暂不涉及。
度量聚集基于文档集合计算一些值(如平均值);分组聚集根据分组条件对文档进行分组。
1.1. 示例分析数据
定义sport是索引数据,其中name和sport为keyword,用于作为关键词分析。
PUT sports
{
"mappings": {
"properties": {
"birthdate": {
"type": "date",
"format": "dateOptionalTime"
},
"location": {
"type": "geo_point"
},
"name": {
"type": "keyword"
},
"rating": {
"type": "integer"
},
"sport": {
"type": "keyword"
}
}
}
}
批量插入数据:
POST /sports/_bulk
{"index":{}}
{"name":"Michael", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["5", "4"], "location":"46.22,-68.45"}
{"index":{}}
{"name":"Bob", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["3", "4"], "location":"45.21,-68.35"}
{"index":{}}
{"name":"Jim", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["3", "2"], "location":"45.16,-63.58" }
{"index":{}}
{"name":"Joe", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"], "location":"45.22,-68.53"}
{"index":{}}
{"name":"Tim", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["3", "3"], "location":"46.22,-68.85"}
{"index":{}}
{"name":"Alfred", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["2", "2"], "location":"45.12,-68.35"}
{"index":{}}
{"name":"Jeff", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["2", "3"], "location":"46.12,-68.55"}
{"index":{}}
{"name":"Will", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["4", "4"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Mick", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["3", "4"], "location":"46.22,-68.45"}
{"index":{}}
{"name":"Pong", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["1", "3"], "location":"45.21,-68.35"}
{"index":{}}
{"name":"Ray", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["2", "2"], "location":"45.16,-63.58" }
{"index":{}}
{"name":"Ping", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"], "location":"45.22,-68.53"}
{"index":{}}
{"name":"Duke", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["5", "2"], "location":"46.22,-68.85"}
{"index":{}}
{"name":"Hal", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["4", "2"], "location":"45.12,-68.35"}
{"index":{}}
{"name":"Charge", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["3", "2"], "location":"46.12,-68.55"}
{"index":{}}
{"name":"Barry", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["5", "2"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Bank", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["6", "4"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Bingo", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["10", "7"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"James", "birthdate":"1988-3-1", "sport":"Basketball", "rating": ["10", "8"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Wayne", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["10", "10"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Brady", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Lewis", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }
1.2. 语法结构
下面看下聚集的语法结构。
"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
},
["aggregations" : { [<sub_aggregation>]* } ]
}
[,"<aggregation_name_2>" : { ... } ]*
}
aggregations 关键词也可以使用 “aggs” 代替,主要包括聚集名称,类型以及主体三个部分。 <aggregation_name>
是用户定义的名称,该名称在请求响应中唯一标识聚集。
<aggregation_type>
通常是聚集中第一个键确定聚集类型,如 terms, stats, 或者 geo-distance 聚集等。
<aggregation_body>
在<aggregation_type>
里面定义聚集主体内容,用于指定必要的属性,不同聚集有不同的属性。
另外两个可选项:可选提供子聚集对上级聚集结果进行分析。在查询中可选提供多个聚集(aggregation_name_2
)作为独立的顶级聚集。虽然嵌套的聚集层级没有限制,但不能在度量聚集下嵌套聚集。
1.3. 值来源
一些聚合使用来自聚合文档的值。这些值既可以是特定文档的字段,也可以是通过脚本针对文档生成的值。下面示例中的terms聚集基于name字段,但order排序是基于子聚集rating_avg
的值,这里使用嵌套的子聚集————度量聚集对父级分组聚集进行排序。
POST /sports/_search
{
"size": 0,
"aggs": {
"the_name": {
"terms": {
"field": "name",
"order": {
"rating_avg": "desc"
}
},
"aggs": {
"rating_avg": {
"avg": {
"field": "rating"
}
}
}
}
}
}
1.4. 多个顶级聚集
这里同时定义两个顶级聚集:the_name
和 type_cnt
,同时the_name
还包括子聚集rating_avg
。
POST /sports/_search
{
"size": 0,
"aggs": {
"the_name": {
"terms": {
"field": "name",
"order": {
"rating_avg": "desc"
}
},
"aggs": {
"rating_avg": {
"avg": {
"field": "rating"
}
}
}
},
"type_cnt":{
"terms": {
"field": "sport"
}
}
}
}
2. 度量聚集
度量聚集用于计算整个文档集合的度量。可以是单个值(如平均数),也可以是多个度量值(如stats)。简单的度量聚集是value_count
聚集,其返回给定字段值的总数量。下面示例返回sport值的数量。
POST /sports/_search
{
"size": 0,
"aggs": {
"sport_count": {
"value_count": {
"field": "sport"
}
}
}
}
值得注意的是,返回结果总数不是数值的唯一值。所以返回数量和索引文档数量一致。
不能在度量聚集中嵌入度量聚集,实际上也没有实际意义。但在分组聚集中嵌入度量聚集非常有用。下面章节会涉及到,但需先看看分组聚集。
3. 分组聚集
分组聚集是一种文档分组机制。每种类型分组有其文档分类方式,最简单类型是terms
聚集。下面示例对sport
字段的值进行分组计数。类似于SQL中根据该字段分组再计数。
POST /sports/_search
{
"size": 0,
"aggregations": {
"sport": {
"terms": {
"field": "sport"
}
}
}
}
返回结果:
{
......
"aggregations" : {
"sport" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Baseball",
"doc_count" : 16
},
{
"key" : "Football",
"doc_count" : 2
},
{
"key" : "Golf",
"doc_count" : 2
},
{
"key" : "Basketball",
"doc_count" : 1
},
{
"key" : "Hockey",
"doc_count" : 1
}
]
}
}
}
geo_distance
聚集更有趣,虽然其有很多选项,最简单场景是根据原点计算距离范围,然后计算有多少文档位于圆内。下面计算从点"46.12,-68.55."
计算20里范围内的记录:
POST /sports/_search
{
"size": 0,
"aggregations": {
"baseball_player_ring": {
"geo_distance": {
"field": "location",
"origin": "46.12,-68.55",
"unit": "mi",
"ranges": [
{
"from": 0,
"to": 20
}
]
}
}
}
}
返回结果:
......
"aggregations" : {
"baseball_player_ring" : {
"buckets" : [
{
"key" : "*-20.0",
"from" : 0.0,
"to" : 20.0,
"doc_count" : 14
}
]
}
}
}
4. 嵌套聚集
分组聚集最强大的能力是其嵌套能力。首先定义顶级分组聚集,然后在其内部定义二级聚集操作每个父级分组结果,嵌套可以根据需要定义很多级。
继续上面的示例,先找出一定范围内的记录,在看90后的记录数:
POST /sports/_search
{
"size": 0,
"aggs": {
"baseball_player_ring": {
"geo_distance": {
"field": "location",
"origin": "46.12,-68.55",
"unit": "mi",
"ranges": [
{
"from": 0,
"to": 20
}
]
},
"aggs": {
"ring_age_ranges": {
"range": {
"field": "birthdate",
"ranges": [
{"key":"~90", "to": "1990-1-1"},
{"key":"90~", "from": "1990-1-1" }
]
}
}
}
}
}
}
返回结果:
"aggregations" : {
"baseball_player_ring" : {
"buckets" : [
{
"key" : "*-20.0",
"from" : 0.0,
"to" : 20.0,
"doc_count" : 14,
"ring_age_ranges" : {
"buckets" : [
{
"key" : "~90",
"doc_count" : 10
},
{
"key" : "90~",
"doc_count" : 4
}
]
}
}
]
}
}
下面在我们针对最里层的结果使用stats
进行统计————多值度量聚集。
POST /sports/_search
{
"size": 0,
"aggs": {
"baseball_player_ring": {
"geo_distance": {
"field": "location",
"origin": "46.12,-68.55",
"unit": "mi",
"ranges": [
{
"from": 0,
"to": 20
}
]
},
"aggs": {
"ring_age_ranges": {
"range": {
"field": "birthdate",
"ranges": [
{"key":"~90", "to": "1990-1-1"},
{"key":"90~", "from": "1990-1-1" }
]
},
"aggs": {
"rating_stats": {
"stats": {
"field": "rating"
}
}
}
}
}
}
}
}·
响应结果:
{
......
"aggregations" : {
"baseball_player_ring" : {
"buckets" : [
{
"key" : "*-20.0",
"from" : 0.0,
"to" : 20.0,
"doc_count" : 14,
"ring_age_ranges" : {
"buckets" : [
{
"key" : "~90",
"doc_count" : 10,
"rating_stats" : {
"count" : 20,
"min" : 2.0,
"max" : 10.0,
"avg" : 6.8,
"sum" : 136.0
}
},
{
"key" : "90~",
"doc_count" : 4,
"rating_stats" : {
"count" : 8,
"min" : 2.0,
"max" : 5.0,
"avg" : 2.875,
"sum" : 23.0
}
}
]
}
}
]
}
}
}
我们看到可以创建分组包含分组的复杂应用。
5. 总结
本文我们介绍了Elasticsearch的聚集应用。包括聚集的语法及说明,重点通过示例展示了度量聚集、分组聚集以及嵌套聚集。
本文参考链接:https://blog.csdn.net/neweastsun/article/details/104298675