Skip to main content
 首页 » 编程设计

Elasticsearch 嵌套聚集与全局聚集

2022年07月19日127freeliver54

Elasticsearch 嵌套聚集与全局聚集

本系列已经有好几篇关于聚集的内容,本文主要介绍嵌套聚集和全局聚集,为了文章完整性,也会先回顾下关键词聚集和子聚集。

1. 准备数据

为了演示,我们先准备模型和数据。

1.1. 模型

假设关于城市宠物注册的web应用,系统包括下列一些实体:

  • City(city, type)
  • Citizen(occupation,age)
  • Pet(kind,name,age)

city包括多个citizen,citizen包括多个注册pet。

下面开始创建索引映射:

PUT city 
{ 
  "settings": { 
    "number_of_shards": 1 
  }, 
  "mappings": { 
    "properties": { 
      "city": { 
        "type": "keyword" 
      }, 
      "city_type": { 
        "type": "keyword" 
      }, 
      "citizens": { 
        "type": "nested", 
        "properties": { 
          "occupation": { 
            "type": "keyword" 
          }, 
          "age": { 
            "type": "integer" 
          }, 
          "pets": { 
            "type": "nested", 
            "properties": { 
              "kind": { 
                "type": "keyword" 
              }, 
              "name": { 
                "type": "keyword" 
              }, 
              "age": { 
                "type": "integer" 
              } 
            } 
          } 
        } 
      } 
    } 
  } 
} 

执行提示完成:

{ 
  "acknowledged" : true, 
  "shards_acknowledged" : true, 
  "index" : "city" 
} 

说明我们已经成功创建了索引city。这里要解释下为什么定义实体关系为嵌套对象?

嵌套类型是对象数据类型的专门版本,它允许对象数组以一种相互独立的查询方式进行索引,Lucene没有内置对象概念,所以Elasticsearch拉平对象层次至简单列表中。下面通过示例进行说明:

假设城市有两个公民,[{"occupation": "Dentist", "age":35},{"occupation":"Developer"],"age":30}],如果使用对象数据类型,elasticsearch会合并所有子属性关系:

{ 
  "citizens": { 
    "occupation": ["Dentist", "Developer"], 
    "age": ["35", "30"] 
  } 
} 

这样如果搜索年龄为30的"Dentist" ,那么即使其年龄为35也满足条件。嵌套对象索引数组中每个对象作为隐藏文档,意味着每个嵌套对象可以被独立于其他对象进行查询。

1.2. 数据

这里准备了一些示例数据,下载后在当前目录下执行命令批量都让数据:

curl -s -H "Content-Type: application/x-ndjson" -XPOST http://localhost:9200/_bulk --data-binary "@nested-data.json"

当然http://localhost:9200是你Elasticsearch的地址。

通过下面命令测试索引数据是否已经导入:

GET city/_search?size=0 

返回内容包括:

  "hits" : { 
    "total" : { 
      "value" : 113, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  } 

2. 回顾聚集

所有聚集查询都嵌入在搜索请求中,语法如下:

GET <index_name>/_search 
{ 
  "query": { ... }, 
  "aggs": { 
    "<aggregation name>": { 
      "<aggregation type>": { <aggregation properties> } 
    } 
  } 
} 
  • aggregation name

    是用户给聚集的命名,用于后面对响应的进行解析时定位至特定的聚集结果。

  • aggregation type

    指定聚集类型,elasticsearch7.x版本提供了四大类若干种聚集。

  • aggregation properties

    特定聚集类型的属性。

2.1. 关键词分组聚集

使用关键词分组聚集可以发现文档指定字段有多少不同值。请看下面脚本:

GET city/_search?size=0 
{ 
  "aggs": { 
    "cities": { 
      "terms": { "field": "city" } 
    } 
  } 
} 

简单解释下:

我们搜索city索引,使用terms关键词分组聚集,针对city字段,即查询city字段有多少不同的值。执行结果如下:

{ 
  "took" : 45, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 113, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "cities" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 37, 
      "buckets" : [ 
        { 
          "key" : "Amsterdam", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "London", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Oslo", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Paris", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "San Francisco", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Tokyo", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Athens", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Barcelona", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Chicago", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Madrid", 
          "doc_count" : 7 
        } 
      ] 
    } 
  } 
} 

因为在外层设置size=0,所以看到hits属性没有查询响应结果,在aggregations是聚集的结果。我们看到其下面cities正式我们定义的名称。在解释奇怪的属性sum_other_doc_count之前,我们先检查下分组值。

city字段有不同的值,其中数量为8的有:Amsterdam, London, Oslo, Paris, San Francisco, Tokyo;为7的有: Athens, Barcelona, Chicago, Madrid。所有这些值为76,但响应总数为113,两者相减:(113 - 76 = 37) == sum_other_doc_count

下面进行说明:我们定义的聚集有另一个属性size缺省值为10,elasticsearch根据文档数量仅返回前10个分组,37是没有在当前10分组之外的分组。我们增加size属性再执行:

GET city/_search?size=0 
{ 
  "aggs": { 
    "cities": { 
      "terms": { "field": "city" ,"size": 50} 
    } 
  } 
} 

响应如下:

{ 
  "took" : 32, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 113, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "cities" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "Amsterdam", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "London", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Oslo", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Paris", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "San Francisco", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Tokyo", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Athens", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Barcelona", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Chicago", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Madrid", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "New York", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Warsaw", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Berlin", 
          "doc_count" : 6 
        }, 
        { 
          "key" : "Budapest", 
          "doc_count" : 6 
        }, 
        { 
          "key" : "Melbourne", 
          "doc_count" : 6 
        }, 
        { 
          "key" : "Prague", 
          "doc_count" : 5 
        } 
      ] 
    } 
  } 
} 
 

非常好,之前丢掉的city现在出来了并且sum_other_doc_count的值为0。

下面再增加另一个聚集,对city_type字段:

GET city/_search?size=0 
{ 
  "aggs": { 
    "cities": { 
      "terms": { 
        "field": "city", 
        "size": 50 
      } 
    }, 
    "city_type": { 
      "terms": { 
        "field": "city_type" 
      } 
    } 
  } 
} 

这里增加了另一个对city_type字段的关键词分组聚集,并使用缺省size,因为我们知道其分类数不超过10,响应如下:

{ 
  "took" : 30, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 113, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "cities" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "Amsterdam", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "London", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Oslo", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Paris", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "San Francisco", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Tokyo", 
          "doc_count" : 8 
        }, 
        { 
          "key" : "Athens", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Barcelona", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Chicago", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Madrid", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "New York", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Warsaw", 
          "doc_count" : 7 
        }, 
        { 
          "key" : "Berlin", 
          "doc_count" : 6 
        }, 
        { 
          "key" : "Budapest", 
          "doc_count" : 6 
        }, 
        { 
          "key" : "Melbourne", 
          "doc_count" : 6 
        }, 
        { 
          "key" : "Prague", 
          "doc_count" : 5 
        } 
      ] 
    }, 
    "city_type" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "primary", 
          "doc_count" : 57 
        }, 
        { 
          "key" : "secondary", 
          "doc_count" : 56 
        } 
      ] 
    } 
  } 
} 
 

primary 和 secondary 分别为57,56,没有任何丢失。但需要提示的是:如果和聚集查询一起定义了搜索查询,那么聚集仅对搜索查询的结果进行聚集查询。
举例说明,增加term查询条件,仅查询"Athens"的文档:

GET city/_search?size=0 
{ 
  "size": 0, 
  "query": { 
    "term": { 
      "city": { 
        "value": "Athens" 
      } 
    } 
  }, 
  "aggs": { 
    "cities": { 
      "terms": { 
        "field": "city", 
        "size": 50 
      } 
    }, 
    "office_types": { 
      "terms": { 
        "field": "city_type", 
        "size": 10 
      } 
    } 
  } 
} 

响应结果:

  "aggregations" : { 
    "cities" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "Athens", 
          "doc_count" : 7 
        } 
      ] 
    }, 
    "office_types" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "secondary", 
          "doc_count" : 5 
        }, 
        { 
          "key" : "primary", 
          "doc_count" : 2 
        } 
      ] 
    } 
  } 

Athens包括5个 secondary 和 2个 primary 区域。

如果我们想不限制结果的情况下,获取所有城市的分类信息,大致如下:

- [ ] Amsterdam (8) 
    - [ ] Primary (4) 
    - [ ] Secondary (4) 
- [ ] London (8) 
    - [ ] Primary (6) 
    - [ ] Secondary (2) 
- [ ] Athens (7) 
    - [ ] Primary (2) 
    - [ ] Secondary (5) 

接着阅读子分组聚集。

2.2. 子分组聚集

Terms聚集(其他类型的分组聚集)支持定义子聚集。子聚集对父级结果进行分组,根据不同城市再按照类型分类。首先看下语法:

GET <index_name>/_search 
{ 
  "query": { ... }, 
  "aggs": { 
    "<aggregation name>": { 
      "<aggregation type>": { <aggregation properties> }, 
      "aggs": { 
        "<sub-aggregation name>": { 
          "<sub-aggregation type>": { <sub-aggregation properties> } 
        } 
      } 
    } 
  } 
} 

对于上面的示例,定义查询语句如下:

GET city/_search?size=0 
{ 
  "aggs": { 
    "cities": { 
      "terms": { 
        "field": "city", 
        "size": 50 
      }, 
      "aggs": { 
        "city_types": { 
          "terms": { 
            "field": "city_type" 
          } 
        } 
      } 
    }, 
    "office_types": { 
      "terms": { 
        "field": "city_type", 
        "size": 10 
      } 
    } 
  } 
} 

响应结果如下:

{ 
  "took": 12, 
  "timed_out": false, 
  "_shards": { 
    "total": 1, 
    "successful": 1, 
    "skipped": 0, 
    "failed": 0 
  }, 
  "hits": { 
    "total": 113, 
    "max_score": 0, 
    "hits": [] 
  }, 
  "aggregations": { 
    "cities": { 
      "doc_count_error_upper_bound": 0, 
      "sum_other_doc_count": 0, 
      "buckets": [ 
        { 
          "key": "Amsterdam", 
          "doc_count": 8, 
          "city_types": { 
            "doc_count_error_upper_bound": 0, 
            "sum_other_doc_count": 0, 
            "buckets": [ 
              { 
                "key": "primary", 
                "doc_count": 4 
              }, 
              { 
                "key": "secondary", 
                "doc_count": 4 
              } 
            ] 
          } 
        }, 
        { 
          "key": "San Francisco", 
          "doc_count": 8, 
          "city_types": { 
            "doc_count_error_upper_bound": 0, 
            "sum_other_doc_count": 0, 
            "buckets": [ 
              { 
                "key": "primary", 
                "doc_count": 4 
              }, 
              { 
                "key": "secondary", 
                "doc_count": 4 
              } 
            ] 
          } 
        }, 
        ... 
      ] 
    }, 
    "office_types": { 
      "doc_count_error_upper_bound": 0, 
      "sum_other_doc_count": 0, 
      "buckets": [ 
        { 
          "key": "primary", 
          "doc_count": 57 
        }, 
        { 
          "key": "secondary", 
          "doc_count": 56 
        } 
      ] 
    } 
  } 
} 

如果对其他属性(如建筑类型)增加更多子聚集,可以在cities>>city_types聚集下再增加其他子聚集。好了,下面开始讲解嵌套聚集。

3. 嵌套聚集

前面已经提及city包括nested对象,对于嵌套对象需要使用嵌套聚集。其官方定义为:

一种特殊的单个分组聚集,支持聚集嵌套文档。

语法如下:

GET <index_name>/_search 
{ 
 
  "aggs" : { 
    "<aggregation-name>" : { 
      "nested" : { 
        "path" : "<nested-object-path>" 
      }, 
      "aggs" : { 
        "<nested-aggregation-name>": { 
          "<aggregation-type>" : { <aggregation-properties> }   
        } 
      } 
    } 
  } 
} 

nested-object-path 指定遍历对象的根路径。例如,如果希望对citizens进行聚集,则设置为citizens;如果希望对pets过滤,则设置为citizens.pets。情况下面对市民按照职业进行分组示例:

GET city/_search?size=0 
{ 
  "size": 0, 
  "aggs": { 
    "citizens": { 
      "nested": { 
        "path": "citizens" 
      }, 
      "aggs": { 
        "occupations": { 
          "terms": { 
            "field": "citizens.occupation", 
            "size": 50 
          } 
        } 
      } 
    } 
  } 
}  

响应结果:

{ 
... 
  "hits" : { 
    "total" : { 
      "value" : 113, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "citizens" : { 
      "doc_count" : 3966, 
      "occupations" : { 
        "doc_count_error_upper_bound" : 0, 
        "sum_other_doc_count" : 0, 
        "buckets" : [ 
          { 
            "key" : "Hairdresser", 
            "doc_count" : 243 
          }, 
          { 
            "key" : "Microbiologist", 
            "doc_count" : 241 
          }, 
          { 
            "key" : "Farmer", 
            "doc_count" : 234 
          }, 
          { 
            "key" : "Marketing Manager", 
            "doc_count" : 231 
          }, 
          { 
            "key" : "Clinical Laboratory Technician", 
            "doc_count" : 230 
          }, 
          { 
            "key" : "Librarian", 
            "doc_count" : 230 
          }, 
          { 
            "key" : "Editor", 
            "doc_count" : 229 
          }, 
          { 
            "key" : "Statistician", 
            "doc_count" : 227 
          }, 
          { 
            "key" : "Dancer", 
            "doc_count" : 226 
          }, 
          { 
            "key" : "Software Developer", 
            "doc_count" : 224 
          }, 
          { 
            "key" : "Photographer", 
            "doc_count" : 214 
          }, 
          { 
            "key" : "Environmental scientist", 
            "doc_count" : 210 
          }, 
          ... 
        ] 
      } 
    } 
  } 
} 

共有3966位市民注册了他们的宠物,其中243 是 Hairdressers, 241 是 Microbiologists 等.
需要注意的是:在查询中定义嵌套类型的关键词分组聚集,必须要指定嵌套对象的完整路径。

很好,但是我想知道这些注册市民分布在多少区域,换句话说,多少个区域有营销管理人员(Marketing Managers),多少区域有图示管理员?

3.1. 反向嵌套聚集

上面的问题需要使用反向嵌套聚集,我们看官网定义:

一个特定的单分组聚集,能够从嵌套文档中聚集父文档。这种聚集可以有效地跳出嵌套的块结构,并链接到其他嵌套结构或根文档,从而允许将不属于嵌套对象的其他聚合嵌套在嵌套聚合中。必须在嵌套聚集内定义反向嵌套聚集。

这听起来有点复杂,其实并不复杂。通过示例数据可以更好进行理解其提供的特性。我们先看其语法结构:

GET <index_name>/_search 
{ 
 
  "aggs" : { 
    "<aggregation-name>" : { 
      "nested" : { 
        "path" : "<nested-object-path>" 
      }, 
      "aggs" : { 
        "<nested-aggregation-name>": { 
          "<aggregation-type>" : { <aggregation-properties> }, 
          "aggs": { 
            "in_offices": { 
              "reverse_nested": { <reverse-nested-options> } 
            } 
          } 
        } 
      } 
    } 
  } 
} 

我们看到反向嵌套聚集总是作为子聚集定义在嵌套聚集中。下面看看每个职业分散在多少个区域:

GET city_offices/_search 
{ 
  "aggs": { 
    "citizens": { 
      "nested": { 
        "path": "citizens" 
      }, 
      "aggs": { 
        "occupations": { 
          "terms": { 
            "field": "citizens.occupation", 
            "size": 50 
          }, 
          "aggs": { 
            "in_offices": { 
              "reverse_nested": {} 
            } 
          } 
        } 
      } 
    } 
  } 
} 

响应如下:

{ 
  ... 
  "hits" : { 
    "total" : { 
      "value" : 113, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "citizens" : { 
      "doc_count" : 3966, 
      "occupations" : { 
        "doc_count_error_upper_bound" : 0, 
        "sum_other_doc_count" : 0, 
        "buckets" : [ 
          { 
            "key" : "Hairdresser", 
            "doc_count" : 243, 
            "in_offices" : { 
              "doc_count" : 98 
            } 
          }, 
          { 
            "key" : "Microbiologist", 
            "doc_count" : 241, 
            "in_offices" : { 
              "doc_count" : 98 
            } 
          }, 
          { 
            "key" : "Farmer", 
            "doc_count" : 234, 
            "in_offices" : { 
              "doc_count" : 99 
            } 
          }, 
          { 
            "key" : "Marketing Manager", 
            "doc_count" : 231, 
            "in_offices" : { 
              "doc_count" : 91 
            } 
          }, 
          ... 
        ] 
      } 
    } 
  } 
} 
 

我们看到有243位Hairdressers注册在98个区域,241位Microbiologists注册在98个区域,234位Farmers注册在99个区域等。

反向嵌套聚集只有一个选项path。该选项定义了在文档层次结构中我们希望Elasticsearch返回多少步进行计算聚集。在我们的例子中,由于citizens是城市区域的直接关系,所以仅需要保留未定义状态即可,这意味着我们希望根据根对象(即区域)计算职业聚集。感到有点难以理解吗?别担心,下面通过宠物数据进行分析,会让你更清楚。

分析每个市民登记养多少只宠物?

GET city/_search?size=0 
{ 
  "aggs": { 
    "citizens": { 
      "nested": { 
        "path": "citizens.pets" 
      }, 
      "aggs": { 
        "kinds": { 
          "terms": { 
            "field": "citizens.pets.kind", 
            "size": 10 
          }, 
          "aggs": { 
            "per_citizen": { 
              "reverse_nested": {} 
            } 
          } 
        } 
      } 
    } 
  } 
} 

我们仍然没定义path选项:

{ 
  ... 
  "hits" : { 
    "total" : { 
      "value" : 113, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "citizens" : { 
      "doc_count" : 11845, 
      "kinds" : { 
        "doc_count_error_upper_bound" : 0, 
        "sum_other_doc_count" : 0, 
        "buckets" : [ 
          { 
            "key" : "Dog", 
            "doc_count" : 2421, 
            "per_citizen" : { 
              "doc_count" : 113 
            } 
          }, 
          { 
            "key" : "Hamster", 
            "doc_count" : 2403, 
            "per_citizen" : { 
              "doc_count" : 113 
            } 
          }, 
          { 
            "key" : "Cat", 
            "doc_count" : 2380, 
            "per_citizen" : { 
              "doc_count" : 113 
            } 
          }, 
          { 
            "key" : "Bird", 
            "doc_count" : 2330, 
            "per_citizen" : { 
              "doc_count" : 113 
            } 
          }, 
          { 
            "key" : "Rabbit", 
            "doc_count" : 2311, 
            "per_citizen" : { 
              "doc_count" : 113 
            } 
          } 
        ] 
      } 
    } 
  } 
} 

有2421只登记的狗,2403只登记的仓鼠,2380只登记的猫等等。但是per_citizen的bucket信息似乎并不正确。113这个号码听起来熟悉吗?没错,我们有这么多区域。因为在反向嵌套聚集中没有定义path,Elasticsearch技术根文档数量(也就是区域)。我们修改上面的示例:

GET city/_search?size=0 
{ 
  "aggs": { 
    "citizens": { 
      "nested": { 
        "path": "citizens.pets" 
      }, 
      "aggs": { 
        "kinds": { 
          "terms": { 
            "field": "citizens.pets.kind", 
            "size": 10 
          }, 
          "aggs": { 
            "per_citizen": { 
              "reverse_nested": { 
                "path": "citizens" 
              } 
            } 
          } 
        } 
      } 
    } 
  } 
} 

响应如下:

{ 
  ... 
  "hits" : { 
    "total" : { 
      "value" : 113, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "citizens" : { 
      "doc_count" : 11845, 
      "kinds" : { 
        "doc_count_error_upper_bound" : 0, 
        "sum_other_doc_count" : 0, 
        "buckets" : [ 
          { 
            "key" : "Dog", 
            "doc_count" : 2421, 
            "per_citizen" : { 
              "doc_count" : 1864 
            } 
          }, 
          { 
            "key" : "Hamster", 
            "doc_count" : 2403, 
            "per_citizen" : { 
              "doc_count" : 1852 
            } 
          }, 
          { 
            "key" : "Cat", 
            "doc_count" : 2380, 
            "per_citizen" : { 
              "doc_count" : 1823 
            } 
          }, 
          { 
            "key" : "Bird", 
            "doc_count" : 2330, 
            "per_citizen" : { 
              "doc_count" : 1803 
            } 
          }, 
          { 
            "key" : "Rabbit", 
            "doc_count" : 2311, 
            "per_citizen" : { 
              "doc_count" : 1800 
            } 
          } 
        ] 
      } 
    } 
  } 
} 
 

1864位市民登记狗有2421只,1852位市民登记仓鼠有2403只,1823位市民登记猫有2380只,等等。

注意:每位市民有一个以上的宠物,可以是不同的种类,这就是为什么per_citizen的和比市民大得多的原因。

下面再执行几个嵌套聚集的示例回答下面几个问题:

对于每个城市,每个公民职业登记了多少种宠物,有多少个区域?

问题分解,过程如下:

  • 因为结果是基于每个城市,需要对city字段进行关键词分组聚集
  • 因为结果是基于每个市民职业,需要对字段occupation增加关键词子聚集
    • 既然市民是嵌套对象,前面聚集必须是嵌套类型子聚集,path设置为citizen
  • 因为需要针对每个宠物种类结果,需要对kind字段增加关键词子聚集。
    • 既然宠物是嵌套对象,上面的聚集必须嵌套类型的子聚集,path设置为citizen.pets
GET city/_search?size=0 
{ 
  "aggs": { 
    "cities": { 
      "terms": { 
        "field": "city", 
        "size": 50 
      }, 
      "aggs": { 
        "citizens": { 
          "nested": { 
            "path": "citizens" 
          }, 
          "aggs": { 
            "occupations": { 
              "terms": { 
                "field": "citizens.occupation", 
                "size": 50 
              }, 
              "aggs": { 
                "pets": { 
                  "nested": { 
                    "path": "citizens.pets" 
                  }, 
                  "aggs": { 
                    "kinds": { 
                      "terms": { 
                        "field": "citizens.pets.kind", 
                        "size": 10 
                      }, 
                      "aggs": { 
                        "per_occupation": { 
                          "reverse_nested": { 
                            "path": "citizens" 
                          } 
                        }, 
                        "per_office": { 
                          "reverse_nested": {} 
                        } 
                      } 
                    } 
                  } 
                } 
              } 
            } 
          } 
        } 
      } 
    } 
  } 
} 

响应如下:

{ 
  ... 
  "hits": { 
    "total": 113, 
    "max_score": 0, 
    "hits": [] 
  }, 
  "aggregations": { 
    "cities": { 
      "doc_count_error_upper_bound": 0, 
      "sum_other_doc_count": 0, 
      "buckets": [ 
        { 
          "key": "Amsterdam", 
          "doc_count": 8, 
          "citizens": { 
            "doc_count": 230, 
            "occupations": { 
              "doc_count_error_upper_bound": 0, 
              "sum_other_doc_count": 0, 
              "buckets": [ 
                { 
                  "key": "Dancer", 
                  "doc_count": 19, 
                  "pets": { 
                    "doc_count": 49, 
                    "kinds": { 
                      "doc_count_error_upper_bound": 0, 
                      "sum_other_doc_count": 0, 
                      "buckets": [ 
                        { 
                          "key": "Cat", 
                          "doc_count": 13, 
                          "per_office": { 
                            "doc_count": 5 
                          }, 
                          "per_occupation": { 
                            "doc_count": 9 
                          } 
                        }, 
                        { 
                          "key": "Rabbit", 
                          "doc_count": 11, 
                          "per_office": { 
                            "doc_count": 5 
                          }, 
                          "per_occupation": { 
                            "doc_count": 10 
                          } 
                        }, 
                        { 
                          "key": "Bird", 
                          "doc_count": 9, 
                          "per_office": { 
                            "doc_count": 5 
                          }, 
                          "per_occupation": { 
                            "doc_count": 7 
                          } 
                        }, 
                        { 
                          "key": "Dog", 
                          "doc_count": 8, 
                          "per_office": { 
                            "doc_count": 5 
                          }, 
                          "per_occupation": { 
                            "doc_count": 7 
                          } 
                        }, 
                        { 
                          "key": "Hamster", 
                          "doc_count": 8, 
                          "per_office": { 
                            "doc_count": 3 
                          }, 
                          "per_occupation": { 
                            "doc_count": 7 
                          } 
                        } 
                      ] 
                    } 
                  } 
                }, 
                ... 
              ] 
            } 
          } 
        }, 
        ... 
        ] 
      } 
    } 
  } 
} 

我们解释下结果,在阿姆斯特丹有8个办事处区域,230名公民登记了宠物,其中:

  • 其中19位舞蹈职业登记了49只宠物,其中
    13只猫由5个办事处区域的9个舞者登记
    11只兔子由5个办事处区域的10个舞者登记

    8只仓鼠由3个办事处区域的7名舞者登记

4. 全局聚集

最后我们介绍下全局聚集,官网定义:

在搜索执行上下文中对所有文档定义单个分组。此上下文由正在搜索的索引和文档类型定义,但不受搜索查询本身的影响。

通过示例说明:

GET city/_search?size=0 
{ 
  "aggs": { 
    "cities": { 
      "terms": { 
        "field": "city", 
        "size": 50 
      }, 
      "aggs": { 
        "office_types": { 
          "terms": { 
            "field": "city_type" 
          } 
        } 
      } 
    }, 
    "office_types": { 
      "terms": { 
        "field": "city_type", 
        "size": 10 
      } 
    } 
  } 
} 

结果可以展现这样的表单:

- [ ] Amsterdam (8) 
  - [ ] Primary (4) 
  - [ ] Secondary (4) 
- [ ] London (8) 
  - [ ] Primary (6) 
  - [ ] Secondary (2) 
- [ ] Athens (7) 
  - [ ] Primary (2) 
  - [ ] Secondary (5) 

当我们渲染表单时,期望的行为是当点击复选框时,如London,此时重新带条件进行渲染:

GET city/_search?size=0 
{ 
  "query": { 
    "term": { 
      "city": { 
        "value": "London" 
      } 
    } 
  }, 
  "aggs": { 
    "cities": { 
      "terms": { 
        "field": "city", 
        "size": 50 
      }, 
      "aggs": { 
        "office_types": { 
          "terms": { 
            "field": "city_type", 
            "size": 10 
          } 
        } 
      } 
    } 
  } 
} 

响应结果:

{ 
  "took" : 23, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 8, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "cities" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "London", 
          "doc_count" : 8, 
          "office_types" : { 
            "doc_count_error_upper_bound" : 0, 
            "sum_other_doc_count" : 0, 
            "buckets" : [ 
              { 
                "key" : "primary", 
                "doc_count" : 6 
              }, 
              { 
                "key" : "secondary", 
                "doc_count" : 2 
              } 
            ] 
          } 
        } 
      ] 
    } 
  } 
} 

如果渲染结果,仅显示London文档:

- [ ] London (8) 
  - [ ] Primary (6) 
  - [ ] Secondary (2) 

但是最后能这样展示,其他选项也展示,仅仅是没选择:

- [ ] Amsterdam 
  - [ ] Primary 
  - [ ] Secondary 
- [*] London 
  - [ ] Primary (6) 
  - [ ] Secondary (2) 
- [ ] Athens 
  - [ ] Primary 
  - [ ] Secondary 

如果我们有一个不带查询的搜索请求的聚集结果,并将其与用户通过单击复选框缩小搜索结果后触发的搜索请求进行比较。为了避免执行额外的搜索请求,我们可以使用全局聚集实现。

GET city/_search?size=0 
{ 
  "query": { 
    "term": { 
      "city": { 
        "value": "London" 
      } 
    } 
  }, 
  "aggs": { 
    "cities": { 
      "terms": { 
        "field": "city", 
        "size": 50 
      }, 
      "aggs": { 
        "office_types": { 
          "terms": { 
            "field": "city_type", 
            "size": 10 
          } 
        } 
      } 
    }, 
    "unfiltered": { 
      "global": {}, 
      "aggs": { 
        "cities": { 
          "terms": { 
            "field": "city", 
            "size": 50 
          }, 
          "aggs": { 
            "office_types": { 
              "terms": { 
                "field": "city_type", 
                "size": 10 
              } 
            } 
          } 
        } 
      } 
    } 
  } 
} 

现在响应中多了未过滤部分,和我们期望的渲染表单一致:

{ 
  "took" : 4, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 8, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "cities" : { 
      "doc_count_error_upper_bound" : 0, 
      "sum_other_doc_count" : 0, 
      "buckets" : [ 
        { 
          "key" : "London", 
          "doc_count" : 8, 
          "office_types" : { 
            "doc_count_error_upper_bound" : 0, 
            "sum_other_doc_count" : 0, 
            "buckets" : [ 
              { 
                "key" : "primary", 
                "doc_count" : 6 
              }, 
              { 
                "key" : "secondary", 
                "doc_count" : 2 
              } 
            ] 
          } 
        } 
      ] 
    }, 
    "unfiltered" : { 
      "doc_count" : 113, 
      "cities" : { 
        "doc_count_error_upper_bound" : 0, 
        "sum_other_doc_count" : 0, 
        "buckets" : [ 
          { 
            "key" : "Amsterdam", 
            "doc_count" : 8, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 4 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 4 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "London", 
            "doc_count" : 8, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 6 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 2 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Oslo", 
            "doc_count" : 8, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 4 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 4 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Paris", 
            "doc_count" : 8, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 4 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 4 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "San Francisco", 
            "doc_count" : 8, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 4 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 4 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Tokyo", 
            "doc_count" : 8, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 4 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 4 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Athens", 
            "doc_count" : 7, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "secondary", 
                  "doc_count" : 5 
                }, 
                { 
                  "key" : "primary", 
                  "doc_count" : 2 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Barcelona", 
            "doc_count" : 7, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 4 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 3 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Chicago", 
            "doc_count" : 7, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "secondary", 
                  "doc_count" : 5 
                }, 
                { 
                  "key" : "primary", 
                  "doc_count" : 2 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Madrid", 
            "doc_count" : 7, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "secondary", 
                  "doc_count" : 4 
                }, 
                { 
                  "key" : "primary", 
                  "doc_count" : 3 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "New York", 
            "doc_count" : 7, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 4 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 3 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Warsaw", 
            "doc_count" : 7, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 4 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 3 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Berlin", 
            "doc_count" : 6, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "secondary", 
                  "doc_count" : 4 
                }, 
                { 
                  "key" : "primary", 
                  "doc_count" : 2 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Budapest", 
            "doc_count" : 6, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 3 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 3 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Melbourne", 
            "doc_count" : 6, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "primary", 
                  "doc_count" : 5 
                }, 
                { 
                  "key" : "secondary", 
                  "doc_count" : 1 
                } 
              ] 
            } 
          }, 
          { 
            "key" : "Prague", 
            "doc_count" : 5, 
            "office_types" : { 
              "doc_count_error_upper_bound" : 0, 
              "sum_other_doc_count" : 0, 
              "buckets" : [ 
                { 
                  "key" : "secondary", 
                  "doc_count" : 3 
                }, 
                { 
                  "key" : "primary", 
                  "doc_count" : 2 
                } 
              ] 
            } 
          } 
        ] 
      } 
    } 
  } 
} 
 

总结

本文我们首先介绍关键词分组聚集,接着介绍了嵌套分组聚集和反向嵌套分组聚聚,最后是全局分组聚集。


本文参考链接:https://blog.csdn.net/neweastsun/article/details/105447064
阅读延展