Elasticsearch聚合分析实战(2)
本文在前文基础上进一步通过学习度量分析和分组分析。示例数据可以点击这里下载。
环境准备
系统中提供1000条employee数据,读者可以通过POST /employees/_bulk
命令批量插入至elasticsearch中。
执行 GET /_cat/indices?v
命名验证employees索引是否创建成功。
yellow open employees LDqYniJMRy2wVvB5O0oxEA 1 1 1000 0 458.9kb 458.9kb
一条示例数据如下:
{
"name" :"Randi Howell",
"age":19,
"salary":20144,
"gender":"female",
"email":"randihowell@comvex.com",
"phone":"+1 (870) 408-2828",
"street":"135 Montieth Street",
"city":"Diaperville",
"state":"Montana, 299"
}
2. 员工数据分析示例
2.1. 度量聚集
计算所有员工的平均工资。下面示例通过度量聚集查询所有员工的平均工资。aggs 表示聚集类型查询,avg_salary 给结果分配一个名词,avg是度量聚集类型用于计算平均值。最后指定field表示针对哪个字段进行计算。既然聚集查询不关心具体文档记录,因此设置size=0 。
GET /employees/_search
{
"size": 0,
"aggs": {
"avg_salary": {
"avg": {"field": "salary"}
}
}
}
查询结果:
{
"took" : 62,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"avg_salary" : {
"value" : 44966.073
}
}
}
有条件的聚集查询:查询女性员工或state为Mississippi的员工的平均工资:
GET /employees/_search
{
"size": 0,
"query" : {
"bool" : {
"should": [
{ "match": { "state": "Mississippi" } },
{ "match": { "gender": "female" } }
]
}
},
"aggs": {
"avg_salary": {
"avg": {"field": "salary"}
}
}
}
响应结果:
{
"took" : 95,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 483,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"avg_salary" : {
"value" : 44222.351966873706
}
}
}
统计聚集查询:一个查询实现多个度量指标统计。下面查询使用stats聚集查询salary字段多个统计值,包括max,min,sum,avg,count:
GET /employees/_search
{
"size" : 0,
"aggs" : {
"salary_stats" : {
"stats" : {
"field" : "salary"
}
}
}
}
响应结果:
{
"took" : 47,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"salary_stats" : {
"count" : 1000,
"min" : 10026.0,
"max" : 79968.0,
"avg" : 44966.073,
"sum" : 4.4966073E7
}
}
}
给定字段的唯一值数量(字段值基数,不重复值数量):下面查询年龄值得基数,使用cardinality类型聚集查询。
GET /employees/_search
{
"size" : 0,
"aggs" : {
"age_count" : {
"cardinality" : {
"field" : "age"
}
}
}
}
响应结果:
{
"took" : 84,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_count" : {
"value" : 58
}
}
}
注:默认文本字段(text类型)fielddata属性被禁用,如果针对gender字段进行基数查询,elasticsearch会抛出异常“Fielddata is disabled on text fields by default. Set fielddata=true on [gender] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.”
什么是fielddata?
严格来说elasticsearch把文本字段值存储在内存数据结构称为fielddata。文本字段设置使用fielddata可以进行聚集、排序等操作。因为文本设置fielddata需要占用大量的堆内存,因此默认不启用。
那如何对文本字段启用fielddata呢?
其实默认导入文本字段时,elasticsearch5.x 会生成一个字段名.keyword扩展字段,默认类型为keyword,可直接进行聚集操作。同时可以手动修改mapping,设置fielddata为true。建议导入数据之前设定对于数据类型,如性别这种字段直接设置为keyword 类型。
修改mapping请求:
PUT /employees/_mapping
{
"properties": {
"gender": {
"type": "text",
"fielddata": true
}
}
}
现在可以直接进行基数统计:
GET /employees/_search
{
"size" : 0,
"aggs" : {
"age_count" : {
"cardinality" : {
"field" : "gender"
}
}
}
}
如果不修改mapping,读者也可以使用gender.keyword代替gender。
响应结果:
{
"took" : 73,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_count" : {
"value" : 2
}
}
}
与预期结果一致,性别基础为2。
2.2. 分组聚集
实际应用中经常需要根据一定条件对文档进行分组分析,elasticsearch称为Bucketing aggregation,为了更好理解翻译为分组聚集(个人觉得不适合直译为桶聚集)。
根据字段值进行分组:下面示例执行term聚集(关键词聚集)查询,把所有文档根据性别分成两组:
GET /employees/_search
{
"size" : 0,
"aggs" : {
"gender_bucket" : {
"terms" : {
"field" : "gender"
}
}
}
}
响应结果:
{
"took" : 55,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"gender_bucket" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "male",
"doc_count" : 524
},
{
"key" : "female",
"doc_count" : 476
}
]
}
}
}
范围聚集:根据字段值的范围对文档进行分组。使用keyed : true 选项可以对结果中的每个范围设置key:
GET /employees/_search
{
"size" : 0,
"aggs" : {
"age_ranges" : {
"range" : {
"field" : "age",
"keyed" : true,
"ranges" : [
{ "to" : 30 },
{ "from" : 30, "to" : 40 },
{ "from" : 40, "to" : 55 },
{ "from" : 55 }
]
}
}
}
}
响应结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_ranges" : {
"buckets" : {
"*-30.0" : {
"to" : 30.0,
"doc_count" : 216
},
"30.0-40.0" : {
"from" : 30.0,
"to" : 40.0,
"doc_count" : 174
},
"40.0-55.0" : {
"from" : 40.0,
"to" : 55.0,
"doc_count" : 248
},
"55.0-*" : {
"from" : 55.0,
"doc_count" : 362
}
}
}
}
}
还可以在每个范围内直接设置key名称:
GET /employees/_search
{
"size" : 0,
"aggs" : {
"age_ranges" : {
"range" : {
"field" : "age",
"keyed" : true,
"ranges" : [
{ "key": "young", "to" : 35 },
{ "key": "quarter-aged", "from" : 35, "to" : 45 },
{ "key": "middle-aged", "from" : 45, "to" : 65 },
{ "key": "senior", "from" : 55 }
]
}
}
}
}
响应结果一样,只是key的名称是设定的名称。
2.3. 嵌套聚集
度量聚集可以嵌入至分组聚集中:查询每个性别的平均年龄。下面示例首先执行分组查询,然后计算各个分组的平均年龄。外面aggs关键字是指定按照性别进行分组,里面aggs是计算平均年龄。
GET /employees/_search
{
"size" : 0,
"aggs" : {
"gender_bucket" : {
"terms" : {
"field" : "gender"
},
"aggs": {
"average_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
响应结果:
{
"took" : 69,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"gender_bucket" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "male",
"doc_count" : 524,
"average_age" : {
"value" : 47.333969465648856
}
},
{
"key" : "female",
"doc_count" : 476,
"average_age" : {
"value" : 45.71848739495798
}
}
]
}
}
}
按性别查询不同年龄范围的平均年龄:最外层聚集按照年龄分组,第二层聚集按照年龄范围分组,最后一层聚集计算每个分组平均年龄。
GET /employees/_search
{
"size" : 0,
"aggs" : {
"gender_bucket" : {
"terms" : {
"field" : "gender"
},
"aggs" : {
"age_ranges" : {
"range" : {
"field" : "age",
"keyed" : true,
"ranges" : [
{ "key": "young", "to" : 35 },
{ "key": "middle-aged", "from" : 35, "to" : 50 },
{ "key": "senior", "from" : 55 }
]
},
"aggs": {
"average_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
}
}
响应结果:
{
"took" : 37,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"gender_bucket" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "male",
"doc_count" : 524,
"age_ranges" : {
"buckets" : {
"young" : {
"to" : 35.0,
"doc_count" : 145,
"average_age" : {
"value" : 26.048275862068966
}
},
"middle-aged" : {
"from" : 35.0,
"to" : 50.0,
"doc_count" : 135,
"average_age" : {
"value" : 42.19259259259259
}
},
"senior" : {
"from" : 55.0,
"doc_count" : 192,
"average_age" : {
"value" : 65.734375
}
}
}
}
},
{
"key" : "female",
"doc_count" : 476,
"age_ranges" : {
"buckets" : {
"young" : {
"to" : 35.0,
"doc_count" : 157,
"average_age" : {
"value" : 25.840764331210192
}
},
"middle-aged" : {
"from" : 35.0,
"to" : 50.0,
"doc_count" : 112,
"average_age" : {
"value" : 41.517857142857146
}
},
"senior" : {
"from" : 55.0,
"doc_count" : 170,
"average_age" : {
"value" : 65.5
}
}
}
}
}
]
}
}
}
2.4. 过滤聚集
计算除了来自minnesota的员工平均工资。这也属于嵌套聚集,首先应用过滤,然后应用聚合查找平均工资。
GET /employees/_search
{
"aggs" : {
"state" : {
"filter" : { "term": { "state": "minnesota" } },
"aggs" : {
"avg_age" : { "avg" : { "field" : "salary" } }
}
}
}
}
响应如下:
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"state" : {
"doc_count" : 17,
"avg_age" : {
"value" : 43855.35294117647
}
}
}
}
3. 总结
本文利用示例的员工数据进行实例分析。主要包括度量分析、分组分析以及嵌套分析。
本文参考链接:https://blog.csdn.net/neweastsun/article/details/104324747