Elasticsearch-68-深入聚合分析数据I

背景

本文,将以一个家电卖场中的电视销售数据为背景,进行各种各样角度的聚合分析

准备数据

创建索引tvs

PUT /tvs
{
	"mappings": {
		"sales": {
			"properties": {
				"price": {
					"type": "long"
				},
				"color": {
					"type": "keyword"
				},
				"brand": {
					"type": "keyword"
				},
				"sold_date": {
					"type": "date"
				}
			}
		}
	}
}

添加测试数据

POST /tvs/sales/_bulk
{ "index": {}}
{ "price" : 1000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-10-28" }
{ "index": {}}
{ "price" : 2000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-11-05" }
{ "index": {}}
{ "price" : 3000, "color" : "绿色", "brand" : "小米", "sold_date" : "2016-05-18" }
{ "index": {}}
{ "price" : 1500, "color" : "蓝色", "brand" : "TCL", "sold_date" : "2016-07-02" }
{ "index": {}}
{ "price" : 1200, "color" : "绿色", "brand" : "TCL", "sold_date" : "2016-08-19" }
{ "index": {}}
{ "price" : 2000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-11-05" }
{ "index": {}}
{ "price" : 8000, "color" : "红色", "brand" : "三星", "sold_date" : "2017-01-01" }
{ "index": {}}
{ "price" : 2500, "color" : "蓝色", "brand" : "小米", "sold_date" : "2017-02-12" }

根据颜色分类统计销量

请求体

GET /tvs/sales/_search
{
  "size": 0,
  "aggs": {
    "popular_colors": {
      "terms": {
        "field": "color"
      }
    }
  }
}

请求体重的各种参数:

size:设置为0的话,只获取聚合结果,不会把原始数据返回回来
aggs:固定语法,要对一份数据执行分组聚合操作
popular_colors:需要对每个aggs取一个名字,名字是自定义的
terms:表示要根据字段的值进行分组
field:要根据那个字段进行分组

上面请求的返回值:

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 8,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "popular_colors": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "红色",
          "doc_count": 4
        },
        {
          "key": "绿色",
          "doc_count": 2
        },
        {
          "key": "蓝色",
          "doc_count": 2
        }
      ]
    }
  }
}

返回结果中的数据

hits.hits: 我们指定了size是0,所以这里就是空的,否则会把执行聚合的原始数据返回回来
aggregations: 聚合结果
popular_color: 在查询时候指定的那个名称
buckets: 根据我们指定的field划分出来的buckets
key: 每个bucket对应的那个值
doc_count: 这个bucket分组内,有多少个数据

默认是按照doc_count降序排序的

统计每种颜色的平均价格

请求体:

GET /tvs/sales/_search
{
  "size": 0,
  "aggs": {
    "colors": {
      "terms": {
        "field": "color"
      },
      "aggs": {
        "price_avg": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

还是和上面一样,按照color去分bucket,可以拿到每个color bucket中的数量(doc_count),这仅仅是一个bucket操作,doc_count的统计其实只是es的bucket操作默认执行的一个内置metric

上面请求中的计算平均值,就是对bucket执行的一个metric聚合统计操作

看一下请求体,在一个aggs执行的bucket操作(terms),同级下又加入了一个aggs,这第二个aggs内部,同样取了个名字,执行一个metric操作 avg,对之前的每个bucket中的数据的指定field, 求一个平均值

请求中的

"aggs": {
    "price_avg": {
      "avg": {
        "field": "price"
      }
    }
  }

就是一个metric操作,对分组后的每个bucket都要执行的一个操作

请求的返回值:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 8,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "colors": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "红色",
          "doc_count": 4,
          "price_avg": {
            "value": 3250
          }
        },
        {
          "key": "绿色",
          "doc_count": 2,
          "price_avg": {
            "value": 2100
          }
        },
        {
          "key": "蓝色",
          "doc_count": 2,
          "price_avg": {
            "value": 2000
          }
        }
      ]
    }
  }
}

再来看一下返回值,buckets中除了key和doc_count还有

avg_price: 我们在发送请求时候,自己取的名字
value: metric计算的结果,每个bucket中的数据的price字段求平均值后的结果

这段请求,如果转成sql的话,就是

1	select avg(price) from tvs.sales group by color

下钻分析

需求: 从颜色到品牌进行下钻分析, 分析每种颜色的平均价格,以及每个颜色中的每个品牌的平均价格

下钻的意思是,已经分了一个组了,然后还要对这个分组内的数据,再分组,比如上面这个案例中,颜色分组之后,还可以对品牌进行分组,最后对每个最小粒度的分组执行聚合分析的操作,就是下钻分析

搜索请求:

GET /tvs/sales/_search
{
  "size": 0,
  "aggs": {
    "group_by_color": {
      "terms": {
        "field": "color"
      },
      "aggs": {
        "color_avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "group_by_brand":{
          "terms": {
            "field": "brand"
          },
          "aggs": {
            "brand_avg_price": {
              "avg": {
                "field": "price"
              }
            }
          }
        }
      }
    }
  }
}

返回值:

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 8,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_color": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "红色",
          "doc_count": 4,
          "color_avg_price": {
            "value": 3250
          },
          "group_by_brand": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "长虹",
                "doc_count": 3,
                "brand_avg_price": {
                  "value": 1666.6666666666667
                }
              },
              {
                "key": "三星",
                "doc_count": 1,
                "brand_avg_price": {
                  "value": 8000
                }
              }
            ]
          }
        },
        {
          "key": "绿色",
          "doc_count": 2,
          "color_avg_price": {
            "value": 2100
          },
          "group_by_brand": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "TCL",
                "doc_count": 1,
                "brand_avg_price": {
                  "value": 1200
                }
              },
              {
                "key": "小米",
                "doc_count": 1,
                "brand_avg_price": {
                  "value": 3000
                }
              }
            ]
          }
        },
        {
          "key": "蓝色",
          "doc_count": 2,
          "color_avg_price": {
            "value": 2000
          },
          "group_by_brand": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "TCL",
                "doc_count": 1,
                "brand_avg_price": {
                  "value": 1500
                }
              },
              {
                "key": "小米",
                "doc_count": 1,
                "brand_avg_price": {
                  "value": 2500
                }
              }
            ]
          }
        }
      ]
    }
  }
}

先看一下搜索请求,就是在计算完按颜色分组之后的平均值后,又分了一次组group_by_brand,按的是品牌,然后分组之后,再计算按颜色品牌的平均值

再看返回结果,结构基本和搜索请求是相同的,先是按颜色的分组,然后下面又套了一个按品牌的分组