Skip to main content
 首页 » 编程设计

深入理解Elasticsearch Pipeline聚集(1)

2022年07月19日113dflying

深入理解Elasticsearch Pipeline聚集(1)

度量聚集和桶聚集一般用于文档中的数值型字段,而本文讨论的管道聚集针对其他聚集产生的输出值,因此管道聚集是针对中间值而不是原始文档数据。对于计算复杂统计和数学度量,如累加和、导数(变化情况)、移动平均等非常有用。

本文讨论管道聚集的两个基本类型,通过示例展示常用的管道聚集,如求和、累加求和、最小值、最大值、平均值以及导数等管道聚集。

1. 管道聚集类型

管道聚集通常分为两类:父、兄弟管道聚集。
父管道聚集使用其父聚集的输出,它获取此聚合的值计算新的分组或聚集并将它们添加到已经存在的分组中。导数聚集、累加聚集是两个常用的父管道聚集示例。

与父管道聚集相比,兄弟聚集使用兄弟聚集的输出。它获取该输出并计算一个新聚合,该聚合与兄弟聚合处于同一级别。

管道聚集需要访问父聚集或兄弟聚集的路径。这可以使用buckets_path参数引用需要使用的聚集,表示需要度量的路径。该参数有一定的语法规范:

AGG_SEPARATOR       =  '>' ; 
METRIC_SEPARATOR    =  '.' ; 
AGG_NAME            =  <the name of the aggregation> ; 
METRIC              =  <the name of the metric (in case of multi-value metrics aggregation)> ; 
PATH                =  <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ; 

举例,my_bucket>my_stats.sum中的sum值在my_stats度量中,其包括在my_bucket分组聚集内。
需要强调的是路径时相对于管道聚集的位置,因此路径不能回溯至上级聚集树。举例,导数管道聚集嵌入在date_histogram中,引用兄弟度量the_sum:

{ 
    "aggs": { 
        "total_monthly_visits":{ 
            "date_histogram":{ 
                "field":"date", 
                "interval":"month" 
            }, 
            "aggs":{ 
                "the_sum":{ 
                    "sum":{ "field": "visits" }  
                }, 
                "the_derivative":{ 
                    "derivative":{ "buckets_path": "the_sum" }  
                } 
            } 
        } 
    } 
} 

兄弟管道聚集可以放在连续分组后面,而不是嵌入在它们里面。在这种情况下,访问必要的度量,需要指定完整路径包括父聚集的路径:

{ 
  "aggs": { 
    "visits_per_month": { 
      "date_histogram": { 
        "field": "date", 
        "interval": "month" 
      }, 
      "aggs": { 
        "total_visits": { 
          "sum": { 
            "field": "visits" 
          } 
        } 
      } 
    }, 
    "avg_monthly_visits": { 
      "avg_bucket": { 
        "buckets_path": "visits_per_month>total_visits"  
      } 
    } 
  } 
} 

上面示例中,我们通过父日期直方图visits_per_month聚集引用兄弟聚集total_visits。其完整路径为visits_per_month>total_visits

需要记住的重要内容是,管道聚集不能有子聚集。但像导数管道聚集,能在它们的buckets_path引用其他管道聚集,这样可以链接多个管道聚集。举例,我们可以链接两个一级导数计算二级导数(导数的导数,变化率的变化率)。

我们知道,度量聚集和分组聚集处理缺失数据使用missing。管道聚集使用gap_policy参数处理文档不包含需要的字段或没有文档符合匹配查询形成一个或多个分组等。该参数支持下面缺失策略:

  • skip

如果分组不存在时处理缺失数据。如果启用该策略,聚集会跳过空的分组并继续使用下一个有效值计算。

  • insert_zeros

使用0代替所有缺失值,管道聚集正常处理不受影响。

2. 示例实战

测试环境:elasticsearch7.x kibana7.x

2.1. 准备测试环境

创建下面索引,映射包括三个字段:date, visits, max_time_spent

PUT /traffic_stats 
{ 
 "mappings": { 
       "properties": { 
          "date": { 
             "type": "date", 
             "format": "dateOptionalTime" 
          }, 
          "visits": { 
             "type": "integer" 
          }, 
           "max_time_spent": { 
               "type": "integer" 
           } 
       } 
    } 
} 

插入测试数据:

POST /traffic_stats/_bulk 
{"index":{}} 
{"visits":"488", "date":"2018-10-1", "max_time_spent":"900"} 
{"index":{}} 
{"visits":"783", "date":"2018-10-6", "max_time_spent":"928"} 
{"index":{}} 
{"visits":"789", "date":"2018-10-12", "max_time_spent":"1834"} 
{"index":{}} 
{"visits":"1299", "date":"2018-11-3", "max_time_spent":"592"} 
{"index":{}} 
{"visits":"394", "date":"2018-11-6", "max_time_spent":"1249"} 
{"index":{}} 
{"visits":"448", "date":"2018-11-24", "max_time_spent":"874"} 
{"index":{}} 
{"visits":"768", "date":"2018-12-18", "max_time_spent":"876"} 
{"index":{}} 
{"visits":"1194", "date":"2018-12-24", "max_time_spent":"1249"} 
{"index":{}} 
{"visits":"987", "date":"2018-12-28", "max_time_spent":"1599"} 
{"index":{}} 
{"visits":"872", "date":"2019-01-1", "max_time_spent":"828"} 
{"index":{}} 
{"visits":"972", "date":"2019-01-5", "max_time_spent":"723"} 
{"index":{}} 
{"visits":"827", "date":"2019-02-5", "max_time_spent":"1300"} 
{"index":{}} 
{"visits":"1584", "date":"2019-02-15", "max_time_spent":"1500"} 
{"index":{}} 
{"visits":"1604", "date":"2019-03-2", "max_time_spent":"1488"} 
{"index":{}} 
{"visits":"1499", "date":"2019-03-27", "max_time_spent":"1399"} 
{"index":{}} 
{"visits":"1392", "date":"2019-04-8", "max_time_spent":"1294"} 
{"index":{}} 
{"visits":"1247", "date":"2019-04-15", "max_time_spent":"1194"} 
{"index":{}} 
{"visits":"984", "date":"2019-05-15", "max_time_spent":"1184"} 
{"index":{}} 
{"visits":"1228", "date":"2019-05-18", "max_time_spent":"1485"} 
{"index":{}} 
{"visits":"1423", "date":"2019-06-14", "max_time_spent":"1452"} 
{"index":{}} 
{"visits":"1238", "date":"2019-06-24", "max_time_spent":"1329"} 
{"index":{}} 
{"visits":"1388", "date":"2019-07-14", "max_time_spent":"1542"} 
{"index":{}} 
{"visits":"1499", "date":"2019-07-24", "max_time_spent":"1742"} 
{"index":{}} 
{"visits":"1523", "date":"2019-08-13", "max_time_spent":"1552"} 
{"index":{}} 
{"visits":"1443", "date":"2019-08-19", "max_time_spent":"1511"} 
{"index":{}} 
{"visits":"1587", "date":"2019-09-14", "max_time_spent":"1497"} 
{"index":{}} 
{"visits":"1534", "date":"2019-09-27", "max_time_spent":"1434"} 

Ok,环境和数据都准备好了,首先从平均分组管道聚集开始。

2.2. 平均分组管道聚集

平均分组管道聚集是典型的兄弟管道聚集。一般用于数值计算,通过其他兄弟聚集计算所有分组的平均值。对兄弟聚集有两个需求,兄弟聚集必须是多个分组聚集,必须指定的度量是数值。

为了理解管道聚集如何工作,可以把整个计算过程分为几个阶段。请看下面的查询,其包括三个阶段。第一,elasticsearch创建一个日期直方图,使用月作为日期间隔对索引中的visits字段进行分组。日期直方图产生多个分组,每个分组包括多个文档。接下来求和子聚集计算组内每月所有visits字段的和。最后,平均分组管道聚集引用所有兄弟聚集的和,计算所有分组的平均值。因此我们将得到每个月的平均博客访问量。

GET /traffic_stats/_search?size=0 
{ 
  "aggs": { 
    "visits_per_month": { 
      "date_histogram": { 
        "field": "date", 
        "interval": "month" 
      }, 
      "aggs": { 
        "total_visits": { 
          "sum": { 
            "field": "visits" 
          } 
        } 
      } 
    }, 
    "avg_monthly_visits": { 
      "avg_bucket": { 
        "buckets_path": "visits_per_month>total_visits"  
      } 
    } 
  } 
} 

响应结果:

{ 
  "took" : 1184, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 27, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "visits_per_month" : { 
      "buckets" : [ 
        { 
          "key_as_string" : "2018-10-01T00:00:00.000Z", 
          "key" : 1538352000000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2060.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-11-01T00:00:00.000Z", 
          "key" : 1541030400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2141.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-12-01T00:00:00.000Z", 
          "key" : 1543622400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2949.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-01-01T00:00:00.000Z", 
          "key" : 1546300800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 1844.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-02-01T00:00:00.000Z", 
          "key" : 1548979200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2411.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-03-01T00:00:00.000Z", 
          "key" : 1551398400000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3103.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-04-01T00:00:00.000Z", 
          "key" : 1554076800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2639.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-05-01T00:00:00.000Z", 
          "key" : 1556668800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2212.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-06-01T00:00:00.000Z", 
          "key" : 1559347200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2661.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-07-01T00:00:00.000Z", 
          "key" : 1561939200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2887.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-08-01T00:00:00.000Z", 
          "key" : 1564617600000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2966.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-09-01T00:00:00.000Z", 
          "key" : 1567296000000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3121.0 
          } 
        } 
      ] 
    }, 
    "avg_monthly_visits" : { 
      "value" : 2582.8333333333335 
    } 
  } 
} 
 

月度博客平均访问量为2582.83,仔细看看上面描述的步骤,应该能理解管道聚集的计算流程。它们利用分组聚集或度量聚集的中间结果,增加额外的计算结果。

2.2. 导数管道聚集

这是一个父管道聚集,用于计算父直方图或日期直方图特定度量的导数。有两个必要条件:

  • 度量必须是数值型,否则不可能计算导数
  • 直方图内的min_doc_count必须设置为0(这是直方图聚集的缺省值)。如果min_doc_count大于0,一些分组将被忽略,会导致错误或令人困惑的导数值。

从数学角度看,函数的导数测量函数值(输出值)相对于其参数(输入值)的变化的敏感性。也就是说,导数根据变量计算函数的变化速度。对我们的数据来说,导数聚集用来计算相对于前一个周期的变量速度。下面通过示例进行说明,首先计算一阶导数,一阶导数告诉我们函数是否增长或下降,增长或下降的幅度。示例代码:

GET /traffic_stats/_search?size=0 
{ 
  "aggs" : { 
      "visits_per_month" : { 
          "date_histogram" : { 
              "field" : "date", 
              "interval" : "month" 
          }, 
          "aggs": { 
              "total_visits": { 
                  "sum": { 
                      "field": "visits" 
                  } 
              }, 
              "visits_deriv": { 
                  "derivative": { 
                      "buckets_path": "total_visits"  
                  } 
              } 
          } 
      } 
  } 
} 

buckets_path指明导数聚集使用total_visits父聚集的输出。因为导数聚集是父管道聚集,因此我们需使用父聚集。响应结果如下:

{ 
  "took" : 61, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 27, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "visits_per_month" : { 
      "buckets" : [ 
        { 
          "key_as_string" : "2018-10-01T00:00:00.000Z", 
          "key" : 1538352000000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2060.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-11-01T00:00:00.000Z", 
          "key" : 1541030400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2141.0 
          }, 
          "visits_deriv" : { 
            "value" : 81.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-12-01T00:00:00.000Z", 
          "key" : 1543622400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2949.0 
          }, 
          "visits_deriv" : { 
            "value" : 808.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-01-01T00:00:00.000Z", 
          "key" : 1546300800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 1844.0 
          }, 
          "visits_deriv" : { 
            "value" : -1105.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-02-01T00:00:00.000Z", 
          "key" : 1548979200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2411.0 
          }, 
          "visits_deriv" : { 
            "value" : 567.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-03-01T00:00:00.000Z", 
          "key" : 1551398400000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3103.0 
          }, 
          "visits_deriv" : { 
            "value" : 692.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-04-01T00:00:00.000Z", 
          "key" : 1554076800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2639.0 
          }, 
          "visits_deriv" : { 
            "value" : -464.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-05-01T00:00:00.000Z", 
          "key" : 1556668800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2212.0 
          }, 
          "visits_deriv" : { 
            "value" : -427.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-06-01T00:00:00.000Z", 
          "key" : 1559347200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2661.0 
          }, 
          "visits_deriv" : { 
            "value" : 449.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-07-01T00:00:00.000Z", 
          "key" : 1561939200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2887.0 
          }, 
          "visits_deriv" : { 
            "value" : 226.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-08-01T00:00:00.000Z", 
          "key" : 1564617600000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2966.0 
          }, 
          "visits_deriv" : { 
            "value" : 79.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-09-01T00:00:00.000Z", 
          "key" : 1567296000000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3121.0 
          }, 
          "visits_deriv" : { 
            "value" : 155.0 
          } 
        } 
      ] 
    } 
  } 
} 

如果你比较两个相邻的分组,当前分组和前一个分组值的差即为当前导数值。举例:

 { 
          "key_as_string" : "2018-11-01T00:00:00.000Z", 
          "key" : 1541030400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2141.0 
          }, 
          "visits_deriv" : { 
            "value" : 81.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-12-01T00:00:00.000Z", 
          "key" : 1543622400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2949.0 
          }, 
          "visits_deriv" : { 
            "value" : 808.0 
          } 
        } 

12月数据是2949,,11月是2141,因此12月的导数值为808,即两者的差。

2.3. 二阶导数管道聚集

二阶导数是双导数或导数的导数。它衡量一个量的变化率本身是如何变化的。在elasticsearch中,我们可以通过链接导数管道聚集至另一个导数管道聚集中来计算二阶导数。这种方式首先计算一阶导数,然后基于一阶导数计算二阶导数。下面看示例:

GET /traffic_stats/_search?size=0 
{ 
    "aggs" : { 
        "visits_per_month" : { 
            "date_histogram" : { 
                "field" : "date", 
                "interval" : "month" 
            }, 
            "aggs": { 
                "total_visits": { 
                    "sum": { 
                        "field": "visits" 
                    } 
                }, 
                "visits_deriv": { 
                    "derivative": { 
                        "buckets_path": "total_visits" 
                    } 
                }, 
                "visits_2nd_deriv": { 
                    "derivative": { 
                        "buckets_path": "visits_deriv"  
                    } 
                } 
            } 
        } 
    } 
} 

我们看到一阶导数使用路径total_visits指明依赖求和聚集来计算。而二阶导数使用路径visits_deriv,即指定一阶导数。通过这种方式,二阶导数计算可视为双管道聚集。响应结果:

{ 
  "took" : 6, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 27, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "visits_per_month" : { 
      "buckets" : [ 
        { 
          "key_as_string" : "2018-10-01T00:00:00.000Z", 
          "key" : 1538352000000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2060.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-11-01T00:00:00.000Z", 
          "key" : 1541030400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2141.0 
          }, 
          "visits_deriv" : { 
            "value" : 81.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-12-01T00:00:00.000Z", 
          "key" : 1543622400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2949.0 
          }, 
          "visits_deriv" : { 
            "value" : 808.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : 727.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-01-01T00:00:00.000Z", 
          "key" : 1546300800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 1844.0 
          }, 
          "visits_deriv" : { 
            "value" : -1105.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : -1913.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-02-01T00:00:00.000Z", 
          "key" : 1548979200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2411.0 
          }, 
          "visits_deriv" : { 
            "value" : 567.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : 1672.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-03-01T00:00:00.000Z", 
          "key" : 1551398400000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3103.0 
          }, 
          "visits_deriv" : { 
            "value" : 692.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : 125.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-04-01T00:00:00.000Z", 
          "key" : 1554076800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2639.0 
          }, 
          "visits_deriv" : { 
            "value" : -464.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : -1156.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-05-01T00:00:00.000Z", 
          "key" : 1556668800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2212.0 
          }, 
          "visits_deriv" : { 
            "value" : -427.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : 37.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-06-01T00:00:00.000Z", 
          "key" : 1559347200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2661.0 
          }, 
          "visits_deriv" : { 
            "value" : 449.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : 876.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-07-01T00:00:00.000Z", 
          "key" : 1561939200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2887.0 
          }, 
          "visits_deriv" : { 
            "value" : 226.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : -223.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-08-01T00:00:00.000Z", 
          "key" : 1564617600000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2966.0 
          }, 
          "visits_deriv" : { 
            "value" : 79.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : -147.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-09-01T00:00:00.000Z", 
          "key" : 1567296000000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3121.0 
          }, 
          "visits_deriv" : { 
            "value" : 155.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : 76.0 
          } 
        } 
      ] 
    } 
  } 
} 

看看两条邻近记录进行对比:

        { 
          "key_as_string" : "2019-08-01T00:00:00.000Z", 
          "key" : 1564617600000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2966.0 
          }, 
          "visits_deriv" : { 
            "value" : 79.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : -147.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-09-01T00:00:00.000Z", 
          "key" : 1567296000000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3121.0 
          }, 
          "visits_deriv" : { 
            "value" : 155.0 
          }, 
          "visits_2nd_deriv" : { 
            "value" : 76.0 
          } 
        } 

我们看到8、9月份的一阶导数分别为79,155,则9月份二阶导数为两者之差76.

假设我们可以设计三个链式流水线聚合来计算第三阶、第四阶甚至更高阶的导数。然而,这对大多数数据来说几乎没有价值。前两个部分没有二阶导数因为我们需要从一阶导数中得到至少两个数据点来计算二阶导数。

2.4. 最小、最大分组管道聚集

最大分组聚集是兄弟管道聚集,其搜索兄弟聚集中带最大度量值的分组并输出对应值和分组的key。度量必须是数值类型,兄弟度量必须是多分组聚集。

下面示例中,最大分组聚集计算有日期直方图聚集生成的所有月份中最大数值。它使用求和聚集total_visits的结果,即兄弟聚集。

GET /traffic_stats/_search?size=0 
{ 
  "aggs": { 
    "visits_per_month": { 
      "date_histogram": { 
        "field": "date", 
        "interval": "month" 
      }, 
      "aggs": { 
        "total_visits": { 
          "sum": { 
            "field": "visits" 
          } 
        } 
      } 
    }, 
    "max_monthly_visits": { 
      "max_bucket": { 
        "buckets_path": "visits_per_month>total_visits"  
      } 
    } 
  } 
} 

响应结果为:

{ 
  "took" : 8, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 27, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "visits_per_month" : { 
      "buckets" : [ 
        { 
          "key_as_string" : "2018-10-01T00:00:00.000Z", 
          "key" : 1538352000000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2060.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-11-01T00:00:00.000Z", 
          "key" : 1541030400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2141.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-12-01T00:00:00.000Z", 
          "key" : 1543622400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2949.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-01-01T00:00:00.000Z", 
          "key" : 1546300800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 1844.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-02-01T00:00:00.000Z", 
          "key" : 1548979200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2411.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-03-01T00:00:00.000Z", 
          "key" : 1551398400000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3103.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-04-01T00:00:00.000Z", 
          "key" : 1554076800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2639.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-05-01T00:00:00.000Z", 
          "key" : 1556668800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2212.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-06-01T00:00:00.000Z", 
          "key" : 1559347200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2661.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-07-01T00:00:00.000Z", 
          "key" : 1561939200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2887.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-08-01T00:00:00.000Z", 
          "key" : 1564617600000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2966.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-09-01T00:00:00.000Z", 
          "key" : 1567296000000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3121.0 
          } 
        } 
      ] 
    }, 
    "max_monthly_visits" : { 
      "value" : 3121.0, 
      "keys" : [ 
        "2019-09-01T00:00:00.000Z" 
      ] 
    } 
  } 
} 
 

我们看到求和聚集计算每个月分组的访问量之和,然后最大分组管道聚集计算最大访问量的分组,结果为3121,属于2019-09-01月份对于的分组。

最小分组聚集逻辑一样。我们仅需要修改查询中的max_bucketmin_bucket
"max_monthly_visits": { "min_bucket": { "buckets_path": "visits_per_month>total_visits" } }

结果为:

"min_monthly_visits" : { 
    "value" : 1844.0, 
    "keys" : [ 
    "2019-01-01T00:00:00.000Z" 
    ] 
} 

2.5. 求和、累加求和分组管道聚集

有时需要计算有其他聚集生成的所有分组值的和。这时可以使用求和分组管道聚集,属于兄弟聚集。下面计算所有月度访问量的和:

GET /traffic_stats/_search?size=0 
{ 
  "aggs": { 
    "visits_per_month": { 
      "date_histogram": { 
        "field": "date", 
        "interval": "month" 
      }, 
      "aggs": { 
        "total_visits": { 
          "sum": { 
            "field": "visits" 
          } 
        } 
      } 
    }, 
    "sum_monthly_visits": { 
      "sum_bucket": { 
        "buckets_path": "visits_per_month>total_visits"  
      } 
    } 
  } 
} 

管道聚集使用兄弟聚集total_visits,其表示每月的访问量。响应结果为:

{ 
  "took" : 6, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 27, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "visits_per_month" : { 
      "buckets" : [ 
        { 
          "key_as_string" : "2018-10-01T00:00:00.000Z", 
          "key" : 1538352000000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2060.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-11-01T00:00:00.000Z", 
          "key" : 1541030400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2141.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-12-01T00:00:00.000Z", 
          "key" : 1543622400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2949.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-01-01T00:00:00.000Z", 
          "key" : 1546300800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 1844.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-02-01T00:00:00.000Z", 
          "key" : 1548979200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2411.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-03-01T00:00:00.000Z", 
          "key" : 1551398400000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3103.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-04-01T00:00:00.000Z", 
          "key" : 1554076800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2639.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-05-01T00:00:00.000Z", 
          "key" : 1556668800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2212.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-06-01T00:00:00.000Z", 
          "key" : 1559347200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2661.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-07-01T00:00:00.000Z", 
          "key" : 1561939200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2887.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-08-01T00:00:00.000Z", 
          "key" : 1564617600000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2966.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-09-01T00:00:00.000Z", 
          "key" : 1567296000000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3121.0 
          } 
        } 
      ] 
    }, 
    "sum_monthly_visits" : { 
      "value" : 30994.0 
    } 
  } 
} 

求和管道聚集简单计算所有月份访问量之和,即计算兄弟求和聚集产生的中间结果之和。

累加求和聚集利用不同的方法。通常情况下,累加求和是给定序列的部分值累加序列。举例,{a,b,c,…}序列的累加和为a,a+b,a+b+c,…

累加和聚集是父管道聚集,用于计算父直方图(或日期直方图)聚集中指定的度量值的累加和。与其他父管道聚集一样,特定的度量值必须是数值型,直方图的内部参数min_doc_count设为0(缺省值)。

GET /traffic_stats/_search?size=0 
{ 
    "aggs" : { 
        "visits_per_month" : { 
            "date_histogram" : { 
                "field" : "date", 
                "interval" : "month" 
            }, 
            "aggs": { 
                "total_visits": { 
                    "sum": { 
                        "field": "visits" 
                    } 
                }, 
                "cumulative_visits": { 
                    "cumulative_sum": { 
                        "buckets_path": "total_visits"  
                    } 
                } 
            } 
        } 
    } 
} 

响应结果为:

{ 
  "took" : 8, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 1, 
    "successful" : 1, 
    "skipped" : 0, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : { 
      "value" : 27, 
      "relation" : "eq" 
    }, 
    "max_score" : null, 
    "hits" : [ ] 
  }, 
  "aggregations" : { 
    "visits_per_month" : { 
      "buckets" : [ 
        { 
          "key_as_string" : "2018-10-01T00:00:00.000Z", 
          "key" : 1538352000000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2060.0 
          }, 
          "cumulative_visits" : { 
            "value" : 2060.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-11-01T00:00:00.000Z", 
          "key" : 1541030400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2141.0 
          }, 
          "cumulative_visits" : { 
            "value" : 4201.0 
          } 
        }, 
        { 
          "key_as_string" : "2018-12-01T00:00:00.000Z", 
          "key" : 1543622400000, 
          "doc_count" : 3, 
          "total_visits" : { 
            "value" : 2949.0 
          }, 
          "cumulative_visits" : { 
            "value" : 7150.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-01-01T00:00:00.000Z", 
          "key" : 1546300800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 1844.0 
          }, 
          "cumulative_visits" : { 
            "value" : 8994.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-02-01T00:00:00.000Z", 
          "key" : 1548979200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2411.0 
          }, 
          "cumulative_visits" : { 
            "value" : 11405.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-03-01T00:00:00.000Z", 
          "key" : 1551398400000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3103.0 
          }, 
          "cumulative_visits" : { 
            "value" : 14508.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-04-01T00:00:00.000Z", 
          "key" : 1554076800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2639.0 
          }, 
          "cumulative_visits" : { 
            "value" : 17147.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-05-01T00:00:00.000Z", 
          "key" : 1556668800000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2212.0 
          }, 
          "cumulative_visits" : { 
            "value" : 19359.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-06-01T00:00:00.000Z", 
          "key" : 1559347200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2661.0 
          }, 
          "cumulative_visits" : { 
            "value" : 22020.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-07-01T00:00:00.000Z", 
          "key" : 1561939200000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2887.0 
          }, 
          "cumulative_visits" : { 
            "value" : 24907.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-08-01T00:00:00.000Z", 
          "key" : 1564617600000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 2966.0 
          }, 
          "cumulative_visits" : { 
            "value" : 27873.0 
          } 
        }, 
        { 
          "key_as_string" : "2019-09-01T00:00:00.000Z", 
          "key" : 1567296000000, 
          "doc_count" : 2, 
          "total_visits" : { 
            "value" : 3121.0 
          }, 
          "cumulative_visits" : { 
            "value" : 30994.0 
          } 
        } 
      ] 
    } 
  } 
} 

聚集首先计算两个分组的和,然后将结果与下一个分组的值相加,以此类推。通过这种方式,它将序列中所有分组的和累加起来。

3. 总结

管道聚集用于实现涉及有其他聚集产生中间结果的复杂计算。可以提取如导数、二阶导数、移动平均等其他类型度量计算,往往并不直接针对文档数据,而是涉及多个中间步骤进行计算。


本文参考链接:https://blog.csdn.net/neweastsun/article/details/104395294
阅读延展