notes notes
首页
读书笔记
系统设计
项目实战
学习笔记
源码
运维
其它
极客时间 (opens new window)
GitHub (opens new window)
首页
读书笔记
系统设计
项目实战
学习笔记
源码
运维
其它
极客时间 (opens new window)
GitHub (opens new window)
  • 并发编程

    • 并发编程
    • 多线程
    • 高级篇
  • 设计模式

    • 设计模式
  • 网络编程

    • Netty

      • NIO基础
      • Netty入门
      • Netty进阶
      • 优化与源码
  • 源码篇

    • 环境搭建
    • Spring
  • 云原生

    • Kubernetes
    • Helm
  • ElasticSearch

    • ElasticSearch
      • 概述
        • Elasticsearch 是什么
        • Elasticsearch 的用途是什么?
        • Elasticsearch 的工作原理是什么?
        • Elasticsearch 索引是什么?
        • Kibana 的用途是什么?
        • 为何使用 Elasticsearch?
        • Logstash 的用途是什么?
      • 安装 ES
        • 运行
        • JVM 配置
        • 安装与查看插件
        • 单台机器运行多个 ES 实例
      • 安装 Kibana
        • 快捷键
      • Docker 中运行 ES、Kibana 和 Cerebro
      • Logstash 导入数据到 ES
      • ES 术语
        • 文档(Document)
        • 文档的元数据
        • 索引
        • Type
        • 抽象与类比
        • 测试API
        • 节点
        • Master-eligible nodes 和 Master Node
        • Data Node & Coordinating Node
        • 其他的节点类型
        • 配置节点类型
        • 分片(Primary Shard & Replica Shard)
        • 分片的设定
        • 查看集群健康状态
      • 文档的CRUD
        • 单文档
        • 多文档
      • 倒排索引
      • Analysis 与 Analyzer
        • 什么是Analysis
        • Analyzer 的组成
        • ES 的内置分词器
        • 中文分词
      • SearchAPI
        • 衡量相关性
        • URI Search
        • 指定字段 / 泛查询
        • Term / Phrase
        • 分组
        • 布尔操作
        • 范围查询
        • 通配符
        • 模糊匹配和近似查询
        • RequestBody
        • 分页
        • 排序
        • _source 过滤
        • 脚本字段
        • 查询表达式 Match
        • 短句搜索 Match Phrase
        • QueryString 和 SimpleQueryString
      • Mapping
        • 什么是 Mapping
        • 字段的数据类型
        • 什么是 Dynamic Mapping
        • 能否更改 Mapping 的字段类型
        • 自定义 Mapping
        • 控制当前字段是否被索引
        • null_value
        • copy_to
        • 数组类型
        • 多字段
        • Exact Values & Full Text
        • 自定义分词
        • Character Filter
        • Tokenizer
        • Token Filter
        • Index Template
        • Dynamic Template
      • 聚合搜索
        • 聚合的分类
        • Bucket
        • Metric
      • 总结与回顾
      • 基于词项 & 基于全文的搜索
        • 基于 Term 的查询
        • 基于全文的索引
      • 结构化搜索
        • 结构化数据
        • ES 中的结构化搜索
      • 搜索的相关性算分
      • 多字符串多字段查询
        • bool 查询
      • 单字符串多字段查询
        • Dis Max Query
        • Multi Match
      • 多语言及中文分词与检索
      • Search Template
      • Aliases
      • Function Score Query优化算分
      • Suggester
      • 自动补全 & 上下文提升
      • 跨集群搜索
      • 集群分布式模型及选主与脑裂问题
        • 分布式特性
        • 节点
        • Coordinating Node
        • Cerebro 可视化
        • Data Node
        • Master Node
        • Master Eligible Nodes & 选主流程
        • 集群状态
        • Master Eligible Nodes & 选主的过程
        • 脑裂问题
        • 如何避免脑裂问题
        • 配置节点类型
      • 分片与集群的故障转移
        • Primary Shard - 提升系统存储容量
        • Replica Shard - 提高数据可用性
        • 集群健康状态
        • 测试故障转移
      • 文档分布式存储
        • 文档存储在分片上
        • 更新和删除文档
      • 分片及其生命周期
        • 分片的内部原理
        • 倒排索引不可变性
        • Lucene Index
        • Refresh
        • Transaction Log
        • Flush
        • Merge
      • 分布式查询及相关性算分
        • Query 阶段
        • Fetch 阶段
        • Query Then Fetch 潜在的问题
        • 解决算分不准的方法
      • 排序
      • 分页与遍历
        • 分布式系统中深度分页的问题
        • Search After 避免深度分页的问题
        • Scroll API
      • 处理并发读写操作
      • Bucket & Metric
        • Metric Aggregation
        • Bucket
        • Terms Aggregation
        • Bucket Size & Top Hits
        • Range & Histogram
        • Pipeline 聚合分析
      • 聚合作用范围与排序
      • 聚合的精准度问题
      • 对象及 Nested 对象
        • 关系型数据库的范式化设计
        • Denormalization
        • 在 Elasticsearch 中处理关联关系
        • 对象类型
        • 嵌套类型
        • 父子关系
        • parent_id
        • has_child
        • has_parent
      • Update By Query & Reindex API
      • Ingest Pipeline 与 Painless Script
        • Ingest Node
        • Pipeline & Processor
        • Ingest Node v.s Logstash
        • Painless 简介
        • Painless 的用途
        • 通过 Painless 脚本访问字段
        • 脚本缓存
      • 集群身份认证与用户鉴权
        • 开启并配置 X-Pack 的认证与鉴权
        • 创建角色
        • 创建用户
      • 集群内部安全通信
        • 生成节点证书
      • 集群与外部间的安全通信
        • ES 配置 https
        • Kibana 连接 ES https
        • HTTPS 访问 Kibana
      • 常见的集群部署方式
  • Java 虚拟机

    • 深入拆解 Java 虚拟机
    • JVM与GC调优
  • MQ

    • RabbitMQ

      • RabbitMQ笔记
      • RabbitMQ集群搭建文档
  • Redis

    • Redis进阶
  • ShardingSphere

    • Sharding-JDBC
  • SpringCloud

    • SpringCloud
  • ZooKeeper

    • ZooKeeper
  • 学习笔记
  • ElasticSearch
starry
2023-08-03
目录

ElasticSearch

# 概述

# Elasticsearch 是什么

Elasticsearch 是一个分布式的免费开源搜索和分析引擎,适用于包括文本、数字、地理空间、结构化和非结构化数据等在内的所有类型的数据。Elasticsearch 在 Apache Lucene 的基础上开发而成,由 Elasticsearch N.V.(即现在的 Elastic)于 2010 年首次发布。Elasticsearch 以其简单的 REST 风格 API、分布式特性、速度和可扩展性而闻名,是 Elastic Stack 的核心组件;Elastic Stack 是一套适用于数据采集、扩充、存储、分析和可视化的免费开源工具。人们通常将 Elastic Stack 称为 ELK Stack(代指 Elasticsearch、Logstash 和 Kibana),目前 Elastic Stack 包括一系列丰富的轻量型数据采集代理,这些代理统称为 Beats,可用来向 Elasticsearch 发送数据。

# Elasticsearch 的用途是什么?

Elasticsearch 在速度和可扩展性方面都表现出色,而且还能够索引多种类型的内容,这意味着其可用于多种用例:

  • 应用程序搜索
  • 网站搜索
  • 企业搜索
  • 日志处理和分析
  • 基础设施指标和容器监测
  • 应用程序性能监测
  • 地理空间数据分析和可视化
  • 安全分析
  • 业务分析

# Elasticsearch 的工作原理是什么?

原始数据会从多个来源(包括日志、系统指标和网络应用程序)输入到 Elasticsearch 中。_数据采集_指在 Elasticsearch 中进行_索引_之前解析、标准化并充实这些原始数据的过程。这些数据在 Elasticsearch 中索引完成之后,用户便可针对他们的数据运行复杂的查询,并使用聚合来检索自身数据的复杂汇总。在 Kibana 中,用户可以基于自己的数据创建强大的可视化,分享仪表板,并对 Elastic Stack 进行管理。

# Elasticsearch 索引是什么?

Elasticsearch 索引_指相互关联的文档集合。Elasticsearch 会以 JSON 文档的形式存储数据。每个文档都会在一组_键(字段或属性的名称)和它们对应的值(字符串、数字、布尔值、日期、_数值_组、地理位置或其他类型的数据)之间建立联系。 Elasticsearch 使用的是一种名为_倒排索引_的数据结构,这一结构的设计可以允许十分快速地进行全文本搜索。倒排索引会列出在所有文档中出现的每个特有词汇,并且可以找到包含每个词汇的全部文档。 在索引过程中,Elasticsearch 会存储文档并构建倒排索引,这样用户便可以近实时地对文档数据进行搜索。索引过程是在索引 API 中启动的,通过此 API 您既可向特定索引中添加 JSON 文档,也可更改特定索引中的 JSON 文档。

# Kibana 的用途是什么?

Kibana 是一款适用于 Elasticsearch 的数据可视化和管理工具,可以提供实时的直方图、线形图、饼状图和地图。Kibana 同时还包括诸如 Canvas 和 Elastic Maps 等高级应用程序;Canvas 允许用户基于自身数据创建定制的动态信息图表,而 Elastic Maps 则可用来对地理空间数据进行可视化。

# 为何使用 Elasticsearch?

**Elasticsearch 很快。**由于 Elasticsearch 是在 Lucene 基础上构建而成的,所以在全文本搜索方面表现十分出色。Elasticsearch 同时还是一个近实时的搜索平台,这意味着从文档索引操作到文档变为可搜索状态之间的延时很短,一般只有一秒。因此,Elasticsearch 非常适用于对时间有严苛要求的用例,例如安全分析和基础设施监测。 **Elasticsearch 具有分布式的本质特征。**Elasticsearch 中存储的文档分布在不同的容器中,这些容器称为_分片_,可以进行复制以提供数据冗余副本,以防发生硬件故障。Elasticsearch 的分布式特性使得它可以扩展至数百台(甚至数千台)服务器,并处理 PB 量级的数据。 **Elasticsearch 包含一系列广泛的功能。**除了速度、可扩展性和弹性等优势以外,Elasticsearch 还有大量强大的内置功能(例如数据汇总和索引生命周期管理),可以方便用户更加高效地存储和搜索数据。 **Elastic Stack 简化了数据采集、可视化和报告过程。**通过与 Beats 和 Logstash 进行集成,用户能够在向 Elasticsearch 中索引数据之前轻松地处理数据。同时,Kibana 不仅可针对 Elasticsearch 数据提供实时可视化,同时还提供 UI 以便用户快速访问应用程序性能监测 (APM)、日志和基础设施指标等数据。

# Logstash 的用途是什么?

Logstash 是 Elastic Stack 的核心产品之一,可用来对数据进行聚合和处理,并将数据发送到 Elasticsearch。Logstash 是一个开源的服务器端数据处理管道,允许您在将数据索引到 Elasticsearch 之前同时从多个来源采集数据,并对数据进行充实和转换。

# 安装 ES

  • 运行 ElasticSearch ,需要安装并配置 JDK
    • 设置 $JAVA_HOME
  • 各个版本对 Java 的依赖
    • ElasticSearch5 需要 Java8 以上的版本
    • ElasticSearch 从6.5 开始支持 Java11
    • 7.0 开始,内置了 Java 环境
    • https://www.elastic.co/cn/support/matrix#matrix_jvm (opens new window)
  • 下载并解压
    • https://www.elastic.co/cn/downloads/past-releases#elasticsearch (opens new window)
    • https://www.elastic.co/guide/en/elasticsearch/reference/7.1/targz.html (opens new window)

# 运行

  • ./bin/elasticsearch,启动报错,不能在root用户上运行es
    • 创建elsearch用户组及elsearch用户:
groupadd elsearch
useradd elsearch -g elsearch
passwd elsearch
  • 更改elasticsearch文件夹及内部文件的所属用户及组为elsearch:elsearch
cd /usr/local/src
chown -R elsearch:elsearch  elasticsearch-7.1.0
  • 切换到elsearch用户再启动
su elsearch 
cd /usr/local/src/elasticsearch-7.1.0
./bin/elasticsearch
  • 测试 curl [http://localhost:9200](http://localhost:9200)

# JVM 配置

  • .config/jvm.options
    • 7.1 的默认配置是 1G
  • 配置的建议
    • Xms 和 Xmx 设置成一样
    • Xms 不要超过机器内存的 50%
    • 不要超过 30 G https://www.elastic.co/cn/blog/a-heap-of-trouble (opens new window)

# 安装与查看插件

  • 查看安装的插件
    • ./bin/elasticsearch-plugin list
    • 没有安装任何插件
  • 安装插件 https://www.elastic.co/guide/en/elasticsearch/plugins/current/installation.html (opens new window)
    • ./bin/elasticsearch-plugin install analysis-icu analysis-icu 是一个国际化分析插件
    • 再次查看插件列表 ./bin/elasticsearch-plugin list
    • 重新启动 bin\elasticsearch.bat 再次查看 curl [http://localhost:9200/_cat/plugins](http://localhost:9200/_cat/plugins)

# 单台机器运行多个 ES 实例

./bin/elasticsearch -E node.name=node0 -E cluster.name=mycluster -E path.data=node0_data -d
./bin/elasticsearch -E node.name=node1 -E cluster.name=mycluster -E path.data=node1_data -d
./bin/elasticsearch -E node.name=node2 -E cluster.name=mycluster -E path.data=node2_data -d
./bin/elasticsearch -E node.name=node3 -E cluster.name=mycluster -E path.data=node3_data -d

node.name:节点名称 cluster.name:所在集群名称 path.data:节点存放数据的地方 -d:daemon后台启动

因为之前启动的es没关,所以9200就被占用了,依次向后寻找未被占用的端口,就是9201,_cat/nodes,查看节点

$ curl http://localhost:9201/_cat/nodes
127.0.0.1 34 98  2 0.79 0.32 0.18 mdi - node3
127.0.0.1 35 98 10 0.79 0.32 0.18 mdi * node0
127.0.0.1 32 98  8 0.79 0.32 0.18 mdi - node1
127.0.0.1 10 98  8 0.79 0.32 0.18 mdi - node2

关闭es所有进程

 ps -ef |grep elasticsearch |awk '{print $2}'|xargs kill -9

# 安装 Kibana

https://www.elastic.co/guide/en/kibana/current/targz.html (opens new window)

wget https://artifacts.elastic.co/downloads/kibana/kibana-7.1.0-linux-x86_64.tar.gz
tar -zxvf kibana-7.1.0-linux-x86_64.tar.gz

修改配置文件,默认只能本地访问

vim ./config/kibana.yml
# 修改 server.host 为 0.0.0.0 任何ip都能访问
server.host: "0.0.0.0"

访问 http://192.168.83.130:5601/ (opens new window) 可以在kibana的 Dev Tools 直接运行命令 image.png

# 快捷键

Ctrl/Cmd + I 自动缩进当前请求 Ctrl/Cmd + / 当前请求的打开文档 Ctrl + 空格 打开自动完成(即使不输入) Ctrl/Cmd + Enter 提交请求 Ctrl/Cmd + 上/下 跳转到上一个/下一个请求的开始或结束。 Ctrl/Cmd + Alt + L 折叠/展开当前范围。 Ctrl/Cmd + Option + 0 折叠除当前范围之外的所有范围。通过添加班次进行扩展。 向下箭头 将焦点切换到自动完成菜单。使用箭头进一步选择一个术语 Enter/Tab 在自动完成菜单中选择当前选定的或最上面的术语 Esc 关闭自动完成菜单

# Docker 中运行 ES、Kibana 和 Cerebro

docker-compose.yml

version: '2.2'
services:
  cerebro:
    image: lmenezes/cerebro:0.8.3
    container_name: cerebro
    ports:
      - "9000:9000"
    command:
      - -Dhosts.0.host=http://elasticsearch:9200
    networks:
      - es7net
  kibana:
    image: docker.elastic.co/kibana/kibana:7.1.0
    container_name: kibana7
    environment:
      - I18N_LOCALE=zh-CN
      - XPACK_GRAPH_ENABLED=true
      - TIMELION_ENABLED=true
      - XPACK_MONITORING_COLLECTION_ENABLED="true"
    ports:
      - "5601:5601"
    networks:
      - es7net
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.1.0
    container_name: es7_01
    environment:
      - cluster.name=geektime
      - node.name=es7_01
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - discovery.seed_hosts=es7_01,es7_02
      - cluster.initial_master_nodes=es7_01,es7_02
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - es7data1:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - es7net
  elasticsearch2:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.1.0
    container_name: es7_02
    environment:
      - cluster.name=geektime
      - node.name=es7_02
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - discovery.seed_hosts=es7_01,es7_02
      - cluster.initial_master_nodes=es7_01,es7_02
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - es7data2:/usr/share/elasticsearch/data
    networks:
      - es7net


volumes:
  es7data1:
    driver: local
  es7data2:
    driver: local

networks:
  es7net:
    driver: bridge

增加虚拟内存

 vim /etc/sysctl.conf
 # 添加配置
 vm.max_map_count=655360
 # 重新加载系统配置
 sysctl -p
 # 重启docker
 service dockers restart

运行 docker-compose.yml docker-compose up


相关阅读

  • 安装 Docker https://www.docker.com/products/docker-desktop (opens new window)
  • 安装 docker-compose https://docs.docker.com/compose/install/ (opens new window)
  • 如何创建自己的 Docker Image - https://www.elastic.co/cn/blog/how-to-make-a-dockerfile-for-elasticsearch (opens new window)
  • 如何在为 Docker Image 安装 Elasticsearch 插件 - https://www.elastic.co/cn/blog/elasticsearch-docker-plugin-management (opens new window)
  • 如何设置 Docker 网络 - https://www.elastic.co/cn/blog/docker-networking (opens new window)
  • Cerebro 源码 https://github.com/lmenezes/cerebro (opens new window)

测试是否成功运行

  • elasticsearch http://192.168.83.130:9200/ (opens new window) http://192.168.83.130:9200/_cat/nodes (opens new window)
  • kibana

http://192.168.83.130:5601/ (opens new window)

  • cerebro

http://192.168.83.130:9000/ (opens new window)

# Logstash 导入数据到 ES

下载 https://www.elastic.co/cn/downloads/logstash (opens new window) https://www.elastic.co/cn/downloads/past-releases#logstash (opens new window)

sudo wget https://artifacts.elastic.co/downloads/logstash/logstash-7.1.0.tar.gz

下载最MovieLens最小测试数据集:https://grouplens.org/datasets/movielens/ (opens new window),要导入 csv 的数据 movies.csv (opens new window) 编写 logstash 配置 logstash.conf https://www.elastic.co/guide/en/logstash/current/configuration.html (opens new window)

input {
  file {
    path => "/usr/local/src/logstash-7.1.0/mydata/movies.csv"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}
filter {
  csv {
    separator => ","
    columns => ["id","content","genre"]
  }

  mutate {
    split => { "genre" => "|" }
    remove_field => ["path", "host","@timestamp","message"]
  }

  mutate {

    split => ["content", "("]
    add_field => { "title" => "%{[content][0]}"}
    add_field => { "year" => "%{[content][1]}"}
  }

  mutate {
    convert => {
      "year" => "integer"
    }
    strip => ["title"]
    remove_field => ["path", "host","@timestamp","message","content"]
  }

}
output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "movies"
     document_id => "%{id}"
   }
  stdout {}
}

启动

sudo ./bin/logstash -f ./config/logstash.conf 

# ES 术语

# 文档(Document)

文档是存储在 Elasticsearch 中的 JSON 文档。它就像关系数据库中表中的一行。每个文档都存储在索引中,并具有类型和 id。 文档是包含零个或多个字段或键值对的 JSON 对象(在其他语言中也称为哈希/哈希映射/关联数组) 。 被索引的原始 JSON 文档将存储在 _sourcefield中,在获取或搜索文档时默认返回该字段。

  • ES是面向文档的,文档是所有可搜索数据的最小单位
    • 日志文件中的日志项
    • 一本电影的具体信息 / 一张唱片的详细信息
    • MA3 播放器里的一首歌 / 一篇 PDF 文档中的具体内容
  • 文档会被序列化为 JSON 格式,保存在 ES 中
    • JSON 对象由字段组成
    • 每个字段都有对应的字段类型(字符串 / 数值 / 布尔 / 日期 / 二进制 / 范围类型)
  • 每个文档都有一个 Unique ID
    • 可以自己指定 ID
    • 或者通过 ES 自动生成

# 文档的元数据

image.png 元数据,用于标注文档的相关信息

  • _index:文档所属的索引名
  • _type:文档所属的类型名
  • _id:文档唯一 id
  • _source:文档的原始 Json 数据
  • _all:整合所有字段内容到该字段,已被废除
  • _version:文档的版本信息
  • _score:相关性打分

# 索引

image.png

  • Index 索引是文档的容器,是一类文档的结合
    • Index 体现了逻辑空间的概念:每个索引都有自己的 Mapping 定于,用于定义包含的文档的字段名和字段类型
    • Shard 体现了物理空间的概念:索引中的数据分散在 Shard 上
  • 索引的 Mapping 与 Setting
    • Mapping 定义文档字段的类型
    • Setting 定义不同的数据分布

索引的不同寓意 image.png

  • 名词:一个 ES 集群中,可以创建很多个不同的索引
  • 动词:保存一个文档到 ES 的过程也叫索引(indexing)
    • ES 中,创建一个倒排索引的过程
  • 名称:一个 B 树索引,一个倒排索引

# Type

  • 在 7.0 之前,一个 Index 可以设置多个 Types
  • 6.0 开始,Type 已经被 Depreacated。7.0 开始,一个索引只能创建一个 Type

# 抽象与类比

image.png

# 测试API

kibana 的 dev tools 进行测试

#查看索引相关信息
GET kibana_sample_data_ecommerce

#查看索引的文档总数
GET kibana_sample_data_ecommerce/_count

#查看前10条文档,了解文档格式
POST kibana_sample_data_ecommerce/_search
{
}

#_cat indices API
#查看indices
GET /_cat/indices/kibana*?v&s=index

#查看状态为绿的索引
GET /_cat/indices?v&health=green

#按照文档个数排序
GET /_cat/indices?v&s=docs.count:desc

#查看具体的字段
GET /_cat/indices/kibana*?pri&v&h=health,index,pri,rep,docs.count,mt

#How much memory is used per index?
GET /_cat/indices?v&h=i,tm&s=tm:desc

ES 的分布式架构

  • 不同的集群通过不同的民族来区分,默认名字 elasticsearch
  • 通过配置文件修改,或者在命令行中 -E cluster.name=mycluster 进行设定
  • 一个集群可以有一个或多个节点

# 节点

  • 节点是一个 ES 的实例
    • 本质上就是一个 Java 进程
    • 一台机器上可以运行多个 ES 进程,但是生产环境一般建议一台机器上只允许一个 ES 实例
  • 每个节点都有名字,通过配置文件配置,或者启动时 -E node.name=node1 指定
  • 每一个节点在启动之后,会分配一个 UID,保存在 data 目录下

# Master-eligible nodes 和 Master Node

  • 每个节点启动后,默认就是一个 Master eligible 节点
    • 可以设置 node.master: false 禁止
  • Master-eligible 节点可以参加选主流程,成为 Master 节点
  • 当第一个节点启动时,它会将自己选举为 Master 节点
  • 每个节点上都保存了集群的状态,只有 Master 节点才能修改集群的状态信息
    • 集群状态(Cluster State),维护了一个集群中,必要的信息
      • 所有的节点信息
      • 所有索引和其相关的 Mapping 与 Setting 信息
      • 分片的路由信息
    • 任意节点都能修改信息会导致数据的不一致性

# Data Node & Coordinating Node

  • Data Node
    • 可以保存数据的节点,叫做 Data Node。负责保存分片的数据。在数据扩展上起到了至关重要的作用
  • Coordinating Node
    • 负责接受 Client 的请求,将请求发送到合适的节点,最终把结果汇集到一起
    • 每个节点默认都起到了 Coordinating Node 的职责

# 其他的节点类型

  • Hot & Warm Node
    • 不同的硬件配置的 Data Node,用来实现 Hot & Warm 架构,降低集群部署的成本
  • Machine Learning Node
    • 负责跑机器学习的 Job,用来做异常检测
  • Tribe Node
    • (5.3 开始使用 Cross Cluster Search)Tribe Node 连接到不用的 ES 集群,并且支持将这些集群当成一个单独的集群处理

# 配置节点类型

  • 开发环境中,一个节点可以承担多种角色
  • 生产环境中,应该设置单一的角色的节点(dedicated node)

image.png

# 分片(Primary Shard & Replica Shard)

  • 主分片,用以解决数据水平扩展的问题。通过主分片,可以将数据分布到集群内的所有节点上
    • 一个分片是一个允许 Lucene 的实例
    • 主分片数在创建时指定,后续不允许修改,除非 Reindex
  • 副本,用以解决高可用的问题。分片是主分片的拷贝
    • 副本分片数,可以动态调整
    • 增加副本数,还可以在一定程度上提高服务的可用性(读取的吞吐)

image.png

# 分片的设定

对于生产环境中分片的设定,需要提前做好容量规划

  • 对于分片设置过小
    • 导致后续无法增加节点实现水平扩展
    • 单个分片的数据量太大,导致数据重新分配耗时
  • 分片数设置过大,7.0 开始,默认主分片设置成 1,解决了 over-sharding 的问题
    • 影响搜索结果的相关性打分,影响统计结果的准确性
    • 单个节点上过多的分片,会导致资源浪费,同时也会影响性能

# 查看集群健康状态

image.png image.png

  • Green:主分片与副本都正常分配
  • Yellow:主分片全部正常分配,有副本分片未能正常分配
  • Red:有主分片未能分配
    • 比如,当服务器的磁盘容量超过 85% 时,去创建了一个新的索引

查看Cerebro http://192.168.83.130:9000/ (opens new window) image.png

# 文档的CRUD

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/docs.html (opens new window)

# 单文档

image.png image.png image.png image.png image.png

# 创建文档,自动生成id
POST users/_doc
{
  "user": "starry",
  "post_date": "2020-02-02T11:11:11",
  "message": "trying out kibana"
}

# 创建文档,指定id。如果id已经存在,报错
PUT users/_doc/1?op_type=create
{
  "user": "starry2",
  "post_date": "2022-02-02T11:11:11",
  "message": "trying out es"
}

# 创建文档,指定id。如果id已经存在,报错
PUT users/_create/1
{
  "user": "starry2",
  "post_date": "2022-02-02T11:11:11",
  "message": "trying out es"
}

# 通过id获取文档信息
GET users/_doc/1

# id存在就删除,重新创建(更新)
PUT users/_doc/1
{
  "user": "starry2"
}


GET users/_doc/1
# 原文档新增字段
POST users/_update/1
{
  "doc":{
    "post_date": "2022-02-02T11:11:11",
    "message": "trying out es!"
  }
}

# 删除文档
DELETE users/_doc/1

# 多文档

image.png image.png image.png

### 批量操作
# 索引、创建、删除和更新
POST _bulk
{"index": {"_index": "test", "_id": "1"}}
{"field1": "value1"}
{"delete": {"_index": "test", "_id": "2"}}
{"create": {"_index": "test2", "_id": "3"}}
{"field1": "value3"}
{"update": {"_id": "1", "_index": "test"}}
{"doc": {"field2": "value2"}}


### 批量获取
GET _mget
{
  "docs": [
      {
        "_index": "test",
        "_id": "1"
      },
      {
        "_index": "test",
        "_id": "2"
      }
    ]
}

# URI中指定索引
GET test/_mget
{
  "docs": [
      {
        "_id": "1"
      },
      {
        "_id": "2"
      }
    ]
}


# 根据_source进行筛选
GET _mget
{
  "docs": [
      {
       "_index": "test",
       "_id": "1",
       "_source": false
      },
      {
        "_index": "test",
        "_id": "2",
        "_source": ["field3","field4"]
      },
      {
        "_index": "test",
        "_id": "3",
        "_source": {
          "include": ["user"],
          "exclude": ["user.location"]
        }
      }
    ]
}


### msearch
POST kibana_sample_data_ecommerce/_msearch
{}
{"query": {"match_all": {}}, "size": 1}
{"index": "kibana_sample_data_flights"}
{"query": {"match_all": {}}, "size": 2}


### 清除测试数据
DELETE users
DELETE test
DELETE test2

image.png

# 倒排索引

image.png 倒排索引包含两个部分

  • 单词词典(Term Dictionary)记录所有文档的单词,记录单词到倒排列表的关联关系
    • 单词词典一般比较大,可以通过 B+ 数或哈希拉链发实现,以满足高性能的插入与查询
  • 倒排列表(Posting List)记录了单词对于的文档结合,由倒排索引项组成
    • 倒排索引项(Posting List)
      • 文档 ID
      • 词频 TF:该单词在文档中出现的次数,用于相关性评分
      • 位置(Position):单词在文档中分词的位置,用于语句搜索(phrase query)
      • 偏移(Offset):记录单词的开始结束位置,实现高亮显示

image.png

# Analysis 与 Analyzer

# 什么是Analysis

Analysis:文本分析是把全文转换一系列单词(term / token)的过程,也叫分词 Analysis 是通过 Analyzer 来实现的

  • 可使用 ES 内置的分析器 / 或者按需求定制化分析器

除了在数据写入时转换词条,匹配 Query 语句时候也需要用相同的分析器对查询语句进行分析

# Analyzer 的组成

分词器是专门处理分词的组件,Analyzer 由三部分组成

  • Character Filters:针对原始文本处理,比如,去除html标签
  • Tokenizer:按照规则切分为单词
  • Token Filter:将切分的单词进行加工,转小写,删除 stopwords,增加同义词等等

image.png

# ES 的内置分词器

  • Standard Analyzer:默认分词器,按词切分,小写处理
  • Simple Analyzer:按照非字母切分(符号被过滤),小写处理
  • Stop Analyzer:小写处理,停用词过滤(the、a、is)
  • Whitespace Analyzer:按照空格切分,不转小写
  • Keyword Analyzer:不分词,直接将输入作为输出
  • Patter Analyzer:正则表达式,默认 \W+ (非字符分割)
  • Language:提供了30多种常见语言的分词器
  • Customer Analyzer:自定义分词器

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/indices-analyze.html (opens new window) https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html (opens new window)

#Simple Analyzer – 按照非字母切分(符号被过滤),小写处理
#Stop Analyzer – 小写处理,停用词过滤(the,a,is)
#Whitespace Analyzer – 按照空格切分,不转小写
#Keyword Analyzer – 不分词,直接将输入当作输出
#Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔)
#Language – 提供了30多种常见语言的分词器
#2 running Quick brown-foxes leap over lazy dogs in the summer evening

#查看不同的analyzer的效果
#standard
GET _analyze
{
  "analyzer": "standard",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#simpe
GET _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


GET _analyze
{
  "analyzer": "stop",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#stop
GET _analyze
{
  "analyzer": "whitespace",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#keyword
GET _analyze
{
  "analyzer": "keyword",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

GET _analyze
{
  "analyzer": "pattern",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#english
GET _analyze
{
  "analyzer": "english",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "他说的确实在理”"
}


POST _analyze
{
  "analyzer": "standard",
  "text": "他说的确实在理”"
}


POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "这个苹果不大好吃"
}

image.png image.png image.png image.png image.png image.png image.png

# 中文分词

image.png image.png image.png image.png

# SearchAPI

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-search.html (opens new window)

  • URI Search
    • 在 URL 中使用参数查询
  • Request Body Search
    • 基于 JSON 格式的更加完备的 Query Domain Specific Language(DSL)

[

](https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-search.html) image.png

image.png image.png image.png image.png image.png image.png

# 衡量相关性

image.png image.png

# URI Search

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-uri-request.html (opens new window) image.png

GET movies/_search?q=2012&df=title&sort=year:desc&from=0&size=10&timeout=1s
{
  "profile": "true"
}
  • q:指定查询语句,使用 Query String Syntax
  • df:在查询中未定义字段前缀时使用的默认字段
  • sort 排序 / from 和 size 用于分页
  • profile:可以查看查询是如何被执行的

# 指定字段 / 泛查询

### 指定字段
GET movies/_search?q=title:2012
{
  "profile": "true"
}
## response
# "type" : "TermQuery",
# "description" : "title:2012"


### 泛查询
GET movies/_search?q=2012
{
  "profile": "true"
}
## response
# "type" : "DisjunctionMaxQuery",
# "description" : "(title.keyword:2012 | titile.keyword:2012 | id.keyword:2012 | titile:2012 | year:[2012 TO 2012] | genre:2012 | @version:2012 | @version.keyword:2012 | id:2012 | genre.keyword:2012 | title:2012)"

# Term / Phrase

### Phrase 短句
GET movies/_search?q=title:"beautiful mind"
{
  "profile": "true"
}
## response
# "type" : "PhraseQuery",
# "description" : """title:"beautiful mind""""

### Term 分词 mind为泛查询
GET movies/_search?q=title:beautiful mind
{
  "profile": "true"
}
## response
# "type" : "DisjunctionMaxQuery",
# "description" : """(title.keyword:mind | titile.keyword:mind | id.keyword:mind | titile:mind | MatchNoDocsQuery("failed [year] query, caused by number_format_exception:[For input string: "mind"]") | genre:mind | @version:mind | @version.keyword:mind | id:mind | genre.keyword:mind | title:mind)""",

beautiful mind 相当于 beautiful OR mind

# 分组

只对组中的字段进行过滤

### 分组,Bool查询,相当于 OR(中间空格默认为OR)
GET movies/_search?q=title:(beautiful mind)
{
  "profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "title:beautiful title:mind",

# 布尔操作

AND OR NOT 或者 && || !

### AND
GET movies/_search?q=title:(beautiful AND mind)
{
  "profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "+title:beautiful +title:mind"

### OR
GET movies/_search?q=title:(beautiful OR mind)
{
  "profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "title:beautiful title:mind",

### NOT
GET movies/_search?q=title:(beautiful NOT mind)
{
  "profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "title:beautiful -title:mind",

分组

  • **+ **表示 must
    • 表示 must_not
### 分组 +转移%2B
GET movies/_search?q=title:(%2Bbeautiful %2Bmind)
{
"profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "+title:beautiful +title:mind",

### 分组 -转义%2B
GET movies/_search?q=title:(-beautiful -Bmind)
{
"profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "-title:beautiful -title:bmind #*:*"

# 范围查询

### 范围查询
GET movies/_search?q=title:beautiful AND year:[2002 TO 2018]
{
	"profile":"true"
}
## response
# "type" : "BooleanQuery",
# "description" : "+title:beautiful +year:[2002 TO 2018]",

GET movies/_search?q=title:beautiful AND year:>2010
{
	"profile":"true"
}
GET movies/_search?q=title:beautiful AND year:(>2010 AND <=2018)
{
	"profile":"true"
}
GET movies/_search?q=title:beautiful AND year:(+>2010 +<=2018)
{
	"profile":"true"
}

# 通配符

  • ?:1 个字符
  • *:0 或多个字符
### 通配符查询
GET movies/_search?q=title:beaut*
{
	"profile":"true"
}

# 模糊匹配和近似查询

# 模糊匹配&近似度匹配
GET /movies/_search?q=title:beautifl~1
{
	"profile":"true"
}

GET /movies/_search?q=title:"Lord Rings"~2
{
	"profile":"true"
}
  • 第一个 beautiful拼写是错误的,1代表允许有一个字母可以和beautiful有差别
  • 第二个,写成2,load of the rings就可以被搜索到。如果写1,load of rings可以被搜索到。load of the rings不可以
GET movies/_search?q=2012&df=title&sort=year:desc&from=0&size=10&timeout=1s
{
  "profile": "true"
}

### 指定字段
GET movies/_search?q=title:2012
{
  "profile": "true"
}
## response
# "type" : "TermQuery",
# "description" : "title:2012"


### 泛查询
GET movies/_search?q=2012
{
  "profile": "true"
}
## response
# "type" : "DisjunctionMaxQuery",
# "description" : "(title.keyword:2012 | titile.keyword:2012 | id.keyword:2012 | titile:2012 | year:[2012 TO 2012] | genre:2012 | @version:2012 | @version.keyword:2012 | id:2012 | genre.keyword:2012 | title:2012)"

### Phrase 短句
GET movies/_search?q=title:"beautiful mind"
{
  "profile": "true"
}
## response
# "type" : "PhraseQuery",
# "description" : """title:"beautiful mind""""

### Term 分词 mind为泛查询
GET movies/_search?q=title:beautiful mind
{
  "profile": "true"
}
## response
# "type" : "DisjunctionMaxQuery",
# "description" : """(title.keyword:mind | titile.keyword:mind | id.keyword:mind | titile:mind | MatchNoDocsQuery("failed [year] query, caused by number_format_exception:[For input string: "mind"]") | genre:mind | @version:mind | @version.keyword:mind | id:mind | genre.keyword:mind | title:mind)""",

### 分组,Bool查询,相当于 OR
GET movies/_search?q=title:(beautiful mind)
{
  "profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "title:beautiful title:mind",

### AND
GET movies/_search?q=title:(beautiful AND mind)
{
  "profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "+title:beautiful +title:mind"

### OR
GET movies/_search?q=title:(beautiful OR mind)
{
  "profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "title:beautiful title:mind",

### NOT
GET movies/_search?q=title:(beautiful NOT mind)
{
  "profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "title:beautiful -title:mind",


### 分组 +转义%2B
GET movies/_search?q=title:(%2Bbeautiful %2Bmind)
{
  "profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "+title:beautiful +title:mind",

### 分组 -转移%2B
GET movies/_search?q=title:(-beautiful -Bmind)
{
  "profile": "true"
}
## response
# "type" : "BooleanQuery",
# "description" : "-title:beautiful -title:bmind #*:*"


### 范围查询
GET movies/_search?q=title:beautiful AND year:[2002 TO 2018]
{
	"profile":"true"
}
## response
# "type" : "BooleanQuery",
# "description" : "+title:beautiful +year:[2002 TO 2018]",

GET movies/_search?q=title:beautiful AND year:>2010
{
	"profile":"true"
}

GET movies/_search?q=title:beautiful AND year:(>2010 AND <=2018)
{
	"profile":"true"
}
GET movies/_search?q=title:beautiful AND year:(+>2010 +<=2018)
{
	"profile":"true"
}

### 通配符查询
GET movies/_search?q=title:beaut*
{
	"profile":"true"
}

GET /movies/_search?q=title:beautifl~1
{
	"profile":"true"
}

### 
GET movies/_search?q=title:"Lord Rings"~2
{
	"profile":"true"
}

# RequestBody

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-request-body.html (opens new window)

# 忽略不可用的索引404_idx
# 匹配所有
POST movies,404_idx/_search?ignore_unavailable=true
{
  "profile": "true",
  "query": {
    "match_all": {}
  }
}

# 分页

默认从0开始,返回10个结果。获取靠后的翻页成本较高

# 分页
POST kibana_sample_data_ecommerce/_search
{
  "from": 10,
  "size": 20,
  "query": {
    "match_all": {}
  }
}

# 排序

最好在 数值型 和 日期型 字段上排序。因为对于多值类型或分析过的字段排序,系统会选一个值,无法得知该值

# 日期排序
POST kibana_sample_data_ecommerce/_search
{
  "sort": [{"order_date": "desc"}],
  "query": {
    "match_all": {}
  }
}

# _source 过滤

  • 如果 _source 没有存储,那就只返回匹配的文档的元数据
  • _source 支持使用通配符 _source["*name"]
# _source 过滤
POST kibana_sample_data_ecommerce/_search
{
  "_source": ["order_date"],
  "query": {
    "match_all": {}
  }
}

# 脚本字段

# 脚本字段
POST kibana_sample_data_ecommerce/_search
{
  "script_fields": {
    "new_field": {
      "script": {
        "lang": "painless",
        "source": "doc['order_date'].value + 'hello'"
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

# 查询表达式 Match

# or
POST movies/_search
{
  "query": {
    "match": {
      "title": "beautiful mind"
    }
  }
}

# and
POST movies/_search
{
  "query": {
    "match": {
      "title": {
        "query": "beautiful mind",
        "operator": "and"
      }
    }
  }
}

# 短句搜索 Match Phrase

# PhraseQuery
POST movies/_search
{
  "profile": "true", 
  "query": {
    "match_phrase": {
      "title": "beautiful mind"
    }
  }
}

# slop,干预不匹配位置的最大数量
POST movies/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "one love",
        "slop": 1
      }
    }
  }
}

POST movies/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "one love"
      }
    }
  }
}
# 忽略不可用的索引404_idx
# 匹配所有
POST movies,404_idx/_search?ignore_unavailable=true
{
  "profile": "true",
  "query": {
    "match_all": {}
  }
}

# 分页
POST kibana_sample_data_ecommerce/_search
{
  "from": 10,
  "size": 20,
  "query": {
    "match_all": {}
  }
}

# 日期排序
POST kibana_sample_data_ecommerce/_search
{
  "sort": [{"order_date": "desc"}],
  "query": {
    "match_all": {}
  }
}

# _source 过滤
POST kibana_sample_data_ecommerce/_search
{
  "_source": ["order_date"],
  "query": {
    "match_all": {}
  }
}

POST kibana_sample_data_ecommerce/_search
{
  "_source": ["*name"], 
  "query": {
    "match_all": {}
  }
}

# 脚本字段
POST kibana_sample_data_ecommerce/_search
{
  "script_fields": {
    "new_field": {
      "script": {
        "lang": "painless",
        "source": "doc['order_date'].value + 'hello'"
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

# or
POST movies/_search
{
  "query": {
    "match": {
      "title": "beautiful mind"
    }
  }
}

# and
POST movies/_search
{
  "query": {
    "match": {
      "title": {
        "query": "beautiful mind",
        "operator": "and"
      }
    }
  }
}

# PhraseQuery
POST movies/_search
{
  "profile": "true", 
  "query": {
    "match_phrase": {
      "title": "beautiful mind"
    }
  }
}

# slop,干预不匹配位置的最大数量
POST movies/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "one love",
        "slop": 1
      }
    }
  }
}

POST movies/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "one love"
      }
    }
  }
}

# QueryString 和 SimpleQueryString

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-query-string-query.html (opens new window)

  • QueryString 和 Query 类似

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-simple-query-string-query.html (opens new window)

  • SimpleQueryString 和 QueryString 类似,但是会忽略错误的语法,同时只支持部分查询语法
  • 不支持 AND OR NOT,会当作字符串处理
  • Term 之间默认的关系是 OR,可以指定 Operator
  • 支持部分逻辑
      • 替代 AND
    • | 替代 OR
      • 替代 NOT
PUT users/_doc/1
{
  "name": "zhang san",
  "about": "c, java, go, python, es"
}

PUT users/_doc/2
{
  "name": "li si",
  "about": "nginx"
} 

POST users/_search
{
  "query": {
    "query_string": {
      "default_field": "name",
      "query": "zhang AND san"
    }
  }
}

POST users/_search
{
  "query": {
    "query_string": {
      "fields": ["name","about"], 
      "query": "(zhang AND san) AND (c AND java)"
    }
  }
}

POST users/_search
{
  "profile": "true", 
  "query": {
    "simple_query_string": {
      "query": "zhang san",
      "fields": ["name"],
      "default_operator": "AND"
    }
  }
}

GET movies/_search
{
  "profile": "true",
  "query": {
    "query_string": {
      "default_field": "title",
      "query": "beautiful AND mind"
    }
  }
}

GET movies/_search
{
  "profile": "true",
  "query": {
    "query_string": {
      "fields": [
          "title",
          "year"
        ],
      "query": "2012"
    }
  }
}

GET movies/_search
{
  "profile": "true",
  "query": {
    "simple_query_string": {
      "query": "beautiful -mind",
      "fields": ["title"]
    }
  }
}

# Mapping

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/dynamic-mapping.html (opens new window)

# 什么是 Mapping

  • Mapping 类似数据库中的 schema 的定义,作用如下
    • 定义索引中的字段的名称
    • 定义字段的数据类型,比如,字符串,数字,布尔...
    • 字段,倒排索引的相关配置(Analyzed or Not Analyzed,Analyzer)
  • Mapping 会把 JSON 文档映射成 Lucene 所需的扁平格式
  • 一个 Mapping 属于一个索引的 Type
    • 每个文档都属于一个 Type
    • 一个 Type 有一个 Mapping 定义
    • 7.0 开始,不需要在 Mapping 定义中指定 type 信息

# 字段的数据类型

  • 简单类型
    • Text / Keyword
    • Date
    • Integer / Floating
    • Boolean
    • IPv4 & IPv6
  • 复杂类型
    • 对象类型 / 嵌套类型
  • 特殊类型
    • geo_point & geo_shape / percolator

# 什么是 Dynamic Mapping

  • 在写入文档时,如果索引不存在,会自动创建索引
  • Dynamic Mapping 的机制,使得我们无需手动定义 Mappings,ES 会自动根据文档信息,推算出字段的类型
  • 但是有时候会推算的不对,比如地理位置信息
  • 当类型如果设置不对时,会导致一些功能无法正常允许,比如 Range 查询

image.png

# 自动创建mapping
PUT mapping_test/_doc/1
{
  "firstName": "Chan",
  "lastName": "Jackie",
  "loginDate": "2020-02-02T11:11:11"
}

# 查看mapping
GET mapping_test/_mapping

# 删除
DELETE mapping_test

# Dynamic Mapping,推断字段的类型
PUT mapping_test/_doc/1
{
    "uid" : "123",
    "isVip" : false,
    "isAdmin": "true",
    "age":19,
    "heigh":180
}

GET mapping_test/_mapping

DELETE mapping_test

# 能否更改 Mapping 的字段类型

两种情况

  • 新增加字段
    • Dynamic 设置 true 时,一旦有新增字段的文档写入,Mapping 也同时被更新
    • Dynamic 设置 false 时,Mapping 不会被更新,新增字段的数据无法被索引,但是信息会出现在 _source 中
    • Dynamic 设置 strict 时,文档写入失败
  • 已有字段
    • 一旦数据写入就不再支持修改字段的定义
    • Lucene 实现的倒排索引,一旦生成后,就不允许修改
  • 如果希望改变字段类型,必须 Reindex API,重建索引

原因

  • 如果修改了字段的数据类型,会导致已被索引的属于无法被搜索
  • 但是如果是增加新的字段,就不会有这样的影响

image.png


# 默认Mapping支持Dynamic,写入的文档中加入新的字段
PUT dynamic_mapping_test/_doc/1
{
  "field1": "someValue"
}

# 字段可以被搜索,数据也出现在_source中
POST dynamic_mapping_test/_search
{
  "query": {
    "match": {
      "field1": "someValue"
    }
  }
}

# 修改Dynamic false
PUT dynamic_mapping_test/_mapping
{
  "dynamic": false
}

# 新增字段
PUT dynamic_mapping_test/_doc/2
{
  "field2": "someValue"
}

# 搜索不到,因为Dynamic已经被设置为false了
POST dynamic_mapping_test/_search
{
  "query": {
    "match": {
      "field2": "someValue"
    }
  }
}

GET dynamic_mapping_test/_doc/2

# 修改为 strict
PUT dynamic_mapping_test/_mapping
{
  "dynamic": "strict"
}

# 写入数据报错
PUT dynamic_mapping_test/_doc/3
{
  "field3": "someValue"
}

DELETE dynamic_mapping_test

# 自定义 Mapping

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/mapping-params.html (opens new window)

  • 创建一个临时的 index,写入一些样本数据
  • 通过访问 Mapping API 获得该临时文件的动态 Mapping 定义
  • 删除临时索引
  • 使用我们修改后的索引
# 添加样本数据
PUT mapping_test/_doc/1
{
    "uid" : "123",
    "isVip" : false,
    "isAdmin": "true",
    "age":19,
    "heigh":180
}

# 查看自动生成的 mapping
GET mapping_test/_mapping

# 删除
DELETE mapping_test

# 写入自定义
PUT mapping_test
{
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "long"
        },
        "heigh" : {
          "type" : "long"
        },
        "isAdmin" : {
          "type" : "boolean"
        },
        "isVip" : {
          "type" : "boolean"
        },
        "uid" : {
          "type" : "long"
        }
      }
    }
}

# 再次添加 & 查看
# 添加样本数据
PUT mapping_test/_doc/1
{
    "uid" : "123",
    "isVip" : false,
    "isAdmin": "true",
    "age":19,
    "heigh":180
}
GET mapping_test/_mapping

# 控制当前字段是否被索引

  • index:控制当前字段是否被索引,默认 true,如果设置 false,该字段不可被搜索
PUT users
{
  "mappings": {
    "properties": {
      "firseName": {
        "type": "text"
      },
      "lastName": {
        "type": "text"
      },
      "mobile": {
        "type": "text",
        "index": false,
        "index_options": "docs"
      }
    }
  }
}

PUT users/_doc/1
{
  "firseName": "zhang",
  "lastName": "san",
  "mobile": "123321"
}

# 报错,不能通过mobile搜索
GET users/_search
{
  "query": {
    "match": {
      "mobile": "123321"
    }
  }
}

DELETE users

image.png

# null_value

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/null-value.html (opens new window) null无法索引或搜索值。当字段设置为null, (或空数组或null值数组)时,它被视为该字段没有值。 该null_value参数允许您将显式null值替换为指定值,以便对其进行索引和搜索

  • 需要对 null 值实现搜索
  • 只有 keyword 类型支持设定 null_value
PUT users
{
  "mappings": {
    "properties": {
      "firseName": {
        "type": "text"
      },
      "lastName": {
        "type": "text"
      },
      "mobile": {
        "type": "keyword",
        "null_value": "NULL"
      }
    }
  }
}

PUT users/_doc/1
{
  "firstName": "zhang",
  "lastName": "san",
  "mobile": null
}

GET users/_search?q=mobile:NULL

POST users/_search
{
  "query": {
    "match": {
      "mobile": "NULL"
    }
  }
}

DELETE users

# copy_to

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/copy-to.html (opens new window) 该copy_to参数允许您将多个字段的值复制到一个组字段中,然后可以将其作为单个字段进行查询。例如,first_name 和 last_name字段可以复制到full_name字段中

  • _all 在 ES7 中被 copy_to 所代替
  • 满足一些特定的搜索需求
  • copy_to 将字段的数值拷贝到目标字段,实现类似 _all 的作用
  • copy_to 的目标字段不出现在 _source 中
PUT users
{
  "mappings": {
    "properties": {
      "firseName": {
        "type": "text",
        "copy_to": "fullName"
      },
      "lastName": {
        "type": "text",
        "copy_to": "fullName"
      }
    }
  }
}


PUT users/_doc/1
{
  "firstName": "zhang",
  "lastName": "san"
}

GET users/_search?q=fullName:(zhang san)

DELETE users

# 数组类型

ES 中不提供专门的数组类型。但是任何字段,都可以包含多个相同类型的数值

PUT users/_doc/1
{
  "name": "zhang san",
  "interests": "reading"
}

PUT users/_doc/2
{
  "name": "li si",
  "interests": ["reading","music"]
}

GET users/_mapping
GET users/_search

DELETE users

# 多字段

https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html (opens new window) 出于不同目的以不同方式索引同一字段通常很有用。这就是多领域的目的。例如,一个string 字段可以映射为一个text用于全文搜索的字段,也可以映射为一个keyword用于排序或聚合的字段。

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "city": {
        "type": "text",
        "fields": {
          "raw": { 
            "type":  "keyword"
          }
        }
      }
    }
  }
}

PUT my-index-000001/_doc/1
{
  "city": "New York"
}

PUT my-index-000001/_doc/2
{
  "city": "York"
}

GET my-index-000001/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

多字段映射完全独立于父字段的映射。多字段不会从其父字段继承任何映射选项。多字段不会更改原始 _source 字段。

多字段的另一个用例是以不同方式分析同一字段以获得更好的相关性。例如,我们可以使用 将文本分解为单词的standard分析器 (opens new window)索引一个字段,并再次使用将 单词词干成其根形式的english分析器: (opens new window)

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "text": { 
        "type": "text",
        "fields": {
          "english": { 
            "type":     "text",
            "analyzer": "english"
          }
        }
      }
    }
  }
}

PUT my-index-000001/_doc/1
{ "text": "quick brown fox" } 

PUT my-index-000001/_doc/2
{ "text": "quick brown foxes" } 

GET my-index-000001/_search
{
  "query": {
    "multi_match": {
      "query": "quick brown foxes",
      "fields": [ 
        "text",
        "text.english"
      ],
      "type": "most_fields" 
    }
  }
}

# Exact Values & Full Text

image.png image.png

# 自定义分词

当 ES 自带的分词器无法满足时,可以自定义分词器。通过自由组合不同的组件实现

  • Character Filter
  • Tokenizer
  • Token Filter

# Character Filter

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/analysis-charfilters.html (opens new window)

  • 在 Tokenizer 之前对文本进行处理,比如增加删除及删除字符串。可以配置多个 Character Filters。会影响 Tokenizer 的 position 和 offset 信息
  • 一些自带的 Character Filters
    • HTML Strip:去除 html 标签
    • Mapping:字符串替换
    • Pattern Replace:正则匹配替换

# Tokenizer

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/analysis-tokenizers.html (opens new window)

  • 将原始的文本按照一定的规则,切分为词(term OR token)
  • ES 内置的 Tokenizer
    • whitespace
    • standard
    • uax_url_email
    • pattern
    • keyword
    • path hierarchy
    • ......
  • 可以用 Java 开发插件,实现自己的 Tokenizer

# Token Filter

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/analysis-tokenfilters.html (opens new window)

  • 将 Tokenizer 输出的单词(term)进行增加、修改、删除
  • 自带的 Token Filter
    • Lowercase
    • Stop
    • Synonym(添加近义词)
    • ......
PUT logs/_doc/1
{
  "level": "DEBUG"
}

GET logs/_mapping

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "<b>hello world<p>"
}

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/usr/local/src/es"
}

# 字符替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      {
        "type": "mapping",
        "mappings": ["- => _"]
      }
    }
  ],
  "text": ""
}

# 符号表情替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [":) => happy", ":( => sad"]
    }  
  ],
  "text": ["I am felling :)", "Feeling:( today"]
}

# 
POST _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The gilrs in China are playing this game!"]
}

POST _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop","snowball"],
  "text": ["The gilrs in China are playing this game!"]
}

# 正则表达式
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
  ],
  "text": "http://www.elastic.co"
}

自定义 Analyzer https://www.elastic.co/guide/en/elasticsearch/reference/7.1/analysis-custom-analyzer.html (opens new window) 当内置分析器不能满足您的需求时,您可以创建一个 custom使用适当组合的分析器:

  • zero or more character filters (opens new window) 零个或多个字符过滤器 (opens new window)
  • a tokenizer (opens new window)(Required) 分词器 (opens new window)
  • zero or more token filters (opens new window). 零个或多个令牌过滤器 (opens new window)。

(opens new window)

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": { 
          "type": "custom",
          "char_filter": [
            "emoticons"
          ],
          "tokenizer": "punctuation",
          "filter": [
            "lowercase",
            "english_stop"
          ]
        }
      },
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { 
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { 
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text":     "I'm a :) person, and you?"
}
  • 为索引分配一个默认的自定义分析器,my_custom_analyzer. 此分析器使用稍后在请求中定义的自定义标记器、字符过滤器和标记过滤器。
  • 定义自定义punctuation标记器。
  • 定义自定义emoticons字符过滤器。
  • 定义自定义english_stop令牌过滤器。

# Index Template

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/indices-templates.html (opens new window) 按照我们指定的规则设定 Mappings 和 Setting,自动匹配到新创建的索引上

  • 模板仅在一个索引被新创建时,才会产生作用。修改模板不会影响已创建的索引
  • 可以设定多个索引模板,这些设置会被 "merge" 在一起
  • 可以指定 "order" 的数值,控制 "merging" 的过程

规则 当一个索引被新创建时

  • 应用 ES 默认的 setting 和 mapping
  • 应用 order 数值低的 Index Template 中的设定
  • 应用 order 高的 Index Template 中的设定,之前的设定会被覆盖
  • 应用创建索引时,用户所指定的 settings 和 mappings,并覆盖之前模板中的定义
PUT my_template/_doc/1
{
  "someNumber": "1",
  "someDate": "2020-11-11"
}

GET my_template/_mapping

PUT _template/template_default
{
  "index_patterns": ["*"],
  "order": 0,
  "version": 1,
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

PUT _template/template_test
{
  "index_patterns": ["test*"],
  "order": 1,
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 2
  },
  "mappings": {
    "date_detection": false,
    "numeric_detection": true
  }
}

GET _template/template_default
GET _template/temp*

PUT test_template/_doc/1
{
  "someNumber": "1",
  "someDate": "2020-02-02"
}

GET test_template/_mapping
GET test_template/_settings

PUT test_my
{
  "settings": {
    "number_of_replicas": 5
  }
}

PUT test_my/_doc/1
{
  "key1": "value1"
}

GET test_my/_settings

DELETE test_my
DELETE _template/template_default
DELETE _template/template_test

# Dynamic Template

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/dynamic-templates.html (opens new window) 根据 ES 识别的数据类型,结合字段名称,来动态设定字段类型 举例:

  • 所有的字符串类型都设定成 keyword,或者关闭 keyword 字段
  • is 开头的字段都设置成 boolean
  • long_ 开头的都设置成 long 类型
  • ......

说明

  • Dynamic Template 是定义在某个索引的 mapping 中
  • 需要为 Template 指定名称
  • 匹配规则是一个数组,顺序匹配
  • 为匹配到的字段设置 mapping

匹配参数

  • match_mapping_type

可以自动检测以下数据类型:

  • boolean当true或false遇到。
  • date当启用日期检测 (opens new window)并找到与任何配置的日期格式匹配的字符串时。
  • double对于带有小数部分的数字。
  • long对于没有小数部分的数字。
  • object对于对象,也称为哈希。
  • string对于字符串。

*也可以用于匹配所有数据类型。

  • match和unmatch

该match参数使用模式匹配字段名称,而 unmatch使用模式排除匹配的字段match。

  • match_pattern 正则表达式
  • path_match 和path_unmatch

path_match 和 path_unmatch 参数的工作方式与 match 和 unmatch 相同,但是操作的是字段的完整虚线路径,而不仅仅是最终名称

  • {name} and 及{dynamic_type}

在与字段名称和检测到的动态类型的映射中替换{ name }和{ dynamic _ type }占位符。

PUT my_index/_doc/1
{
  "firstName": "ttt",
  "isVIP": "true"
}

GET my_index/_mapping
DELETE my_index

# 字段名称和类型
PUT my_index
{
  "mappings": {
    "dynamic_templates": [
      {
        "my_template1": {
          "match_mapping_type": "string",
          "match": "is*",
          "mapping": {
            "type": "boolean"
          }
        }
      },
      {
        "my_template2": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"
          }
        }
      }
      
    ]
  }
}

PUT my_index/_doc/1
{
  "firstName": "ttt",
  "isVIP": "true"
}

GET my_index/_mapping

DELETE my_index

# 路径
PUT my_index
{
  "mappings": {
    "dynamic_templates": [
      {
        "my_temp": {
          "path_match": "name.*",
          "path_unmatch": "*.middle",
          "mapping": {
            "type": "text",
            "copy_to": "full_name"
        }
      }
      }
    ]
  }
}

PUT my_index/_doc/1
{
  "name": {
    "first": "aaa",
    "middle": "bbb",
    "last": "ccc"
  }
}
GET my_index/_mapping

GET my_index/_search?q=full_name:aaa



# 聚合搜索

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-aggregations.html (opens new window) image.png image.png

# 聚合的分类

https://learnku.com/docs/elasticsearch73/7.3/article-11/6889 (opens new window)

  • Bucket Aggregation:满足特定条件的文档的集合
  • Metric Aggregation:一些数学运算,可以对文档字段进行统计分析
  • Pipeline Aggregation:对其他的聚合结果进行二次聚合
  • Matrix Aggregation:支持对多个字段的操作并提供一个结果矩阵

image.png

# Bucket

生成存储桶的一组聚合,其中每个存储桶都与一个键和一个文档条件相关联。执行聚合时,将对上下文中的每个文档评估所有存储桶条件,并且当条件匹配时,该文档将被视为 “落入” 相关存储桶。在汇总过程结束时,我们将获得一个存储桶列表 - 每个存储桶都带有一组 “属于” 的文档。 image.png

# Metric

用于跟踪和计算一组文档的指标的聚合。 image.png

# 按照目的地进行分桶统计
GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "my_aggs": {
      "terms": {
        "field": "DestCountry"
      }
    }
  }
}

# 查看航班目的地的统计信息,平均,最高,最低价格
GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "my_aggs": {
      "terms": {
        "field": "DestCountry"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "AvgTicketPrice"
          }
        },
        "max_price": {
          "max": {
            "field": "AvgTicketPrice"
          }
        },
        "min_price": {
          "min": {
            "field": "AvgTicketPrice"
          }
        }
        
      }
    }
  }
}

# 价格统计信息+添加信息
GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "flight_dest": {
      "terms": {
        "field": "DestCountry"
      },
      "aggs": {
        "stats_price": {
          "stats": {
            "field": "AvgTicketPrice"
          }
        },
        "wather_dest": {
          "terms": {
            "field": "DestWeather",
            "size": 3
          }
        }
      }
    }
  }
}

# 总结与回顾

image.png image.png image.png image.png

# 基于词项 & 基于全文的搜索

# 基于 Term 的查询

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/term-level-queries.html (opens new window)

  • Term 的重要性
    • Term 是表达语意的最小单位。搜索和利用统计语言模型进行自然语言处理都需要处理 Term
  • 特点
    • Term Level Query:Term Query / Range Query / Exists Query / Prefix Query / Wildcard Query
    • 在 ES 中,Term 查询,对输入不做分词,会将输入作为一个整体,在倒排索引中查找准确的此项,并且使用相关度算分公式为每个包含该此项的文档进行相关度算分。比如:“Apple Store”
    • 可以通过 Constant Score (opens new window) 将查询转换成一个 Filtering,避免算分,并利用缓存,提高性能

# 基于全文的索引

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/full-text-queries.html (opens new window)

  • 基于全文本的查找
    • Match Query / Match Phrase Query / Query String Query
  • 特点
    • 索引和搜索时都会进行分词,查询字符串先传递到一个合适的分词器,然后生成一个供查询的此项列表
    • 查询时,先会对输入的查询进行分词,然后每个此项逐个进行底层的查询,最终将结果进行合并,并为每个文档生成一个分数。比如:查询“Matrix reload”,会查询包括 Matrix 或者 reload 的所有结果

image.png

DELETE products

PUT products/_bulk
{ "index": {"_id": 1} }
{ "productId":"XHDK-A-1293-#fJ3","desc": "iPhone" }
{ "index": {"_id": 2} }
{ "productId": "KDKE-B-9947-#kL5","desc": "iPad" }
{ "index": {"_id": 3}}
{ "productId": "JODL-X-1937-#pV7","desc": "MBP" }

GET products/_search
GET products

# iPhone搜索不到,因为不会对条件做任何分词处理,搜索的是大小的P,但是ES在做数据索引的时候会对text做默认的分词处理,就是小写的p
POST products/_search
{
  "query": {
    "term": {
      "desc": {
        //"value": "iPhone"
        "value": "iphone"
      }
    }
  }
}

POST products/_search
{
  "query": {
    "term": {
      "productId": {
        //"value": "xhdk-a-1293-#fj3"
        "value": "xhdk"
        //"value": "XHDK-A-1293-#fJ3"
        
      }
    }
  }
}

POST _analyze
{
  "analyzer": "standard",
  "text": ["XHDK-A-1293-#fJ3"]
}

# 完全匹配keyword
POST products/_search
{
  "query": {
    "term": {
      "productId.keyword": {
        "value": "XHDK-A-1293-#fJ3"
      }
    }
  }
}

POST products/_search
{
  "explain": true, 
  "query": {
    "term": {
      "desc.keyword": {
        "value": "iPhone"
        //"value": "iphone"
      }
    }
  }
}

# constant_score filter不进行score计算
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "productId.keyword": "XHDK-A-1293-#fJ3"
        }
      }
    }
  }
}

### 全文本

POST movies/_search
{
  "profile": "true", 
  "query": {
    "match": {
      "title": "Matrix reloaded"
    }
  }
}

POST movies/_search
{
  "profile": "true", 
  "query": {
    "match": {
      "title": {
        "query": "Matrix reloaded",
        "operator": "and"
      }
    }
  }
}

POST movies/_search
{
  "profile": "true",
  "query": {
    "match_phrase": {
      "title": {
        "query": "Matrix reloaded",
        "slop": 1
      }
    }
  }
}

DELETE test
PUT test/_doc/1
{
  "content":"Hello World"
}

# 能查到数据
# 默认分词器 索引时就 分词+转小写
# 查找时也是默认分词器,matchQuery会对text进行分词
# content:hello content:world
POST test/_search
{
  "profile": "true",
  "query": {
    "match": {
      "content": "Hello World"
    }
  }
}

# 和上面一样
POST test/_search
{
  "profile": "true",
  "query": {
    "match": {
      "content": "hello world"
    }
  }
}

# 能查到数据
# 要查找的字段为keyword 转为termQuery
# 索引时和查找时不会分词 原样匹配
POST test/_search
{
  "profile": "true",
  "query": {
    "match": {
      "content.keyword": "Hello World"
    }
  }
}

# 查不到数据 和索引时不一致
POST test/_search
{
  "profile": "true",
  "query": {
    "match": {
      "content.keyword": "hello world"
    }
  }
}

# 查不到数据
# 索引时小写,查询时大写
# content:Hello World
POST test/_search
{
  "profile": "true",
  "query": {
    "term": {
      "content": "Hello World"
    }
  }
}

# 查不到数据
# 索引时会被分词,查询时没有分词
# content:hello world
POST test/_search
{
  "profile": "true",
  "query": {
    "term": {
      "content": "hello world"
    }
  }
}

# 能查到数据 keyword不做分词
POST test/_search
{
  "profile": "true",
  "query": {
    "term": {
      "content.keyword": "Hello World"
    }
  }
}


# 结构化搜索

https://www.elastic.co/guide/en/elasticsearch/guide/master/structured-search.html (opens new window)

# 结构化数据

  • 结构化搜索(Structured Search)是指对结构化数据的搜索
    • 日期,布尔类型和数字都是结构化的
  • 文本也可以是结构化的
    • 如彩色笔可以有离散的颜色集合:红(red)、绿(green)、蓝色(blue)
    • 一个博客可能被标记了标签:分布式(distributed)和搜索(search)
    • 电商网站上的商品都有 UPCs(通用产品码 Universal Protocol Codes)或其他的唯一标识,它们都需要遵从严格规定的、结构化的格式

# ES 中的结构化搜索

  • 布尔,时间,日期和数字这类结构化数据:有精确的格式,我们可以对这些格式进行逻辑操作。包括比较数字或时间的范围,或判定两个值的大小。
  • 结构化的本文可以做精确匹配或者部分匹配
    • Term 查询 / Prefix 前缀查询
  • 结构化结果只有 是 或者 否 两个指
    • 根据场景需要,可以决定结构化搜索是否需要打分
#结构化搜索,精确匹配
DELETE products
POST /products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }

GET products/_search

# 布尔 term 查询,有算分
POST products/_search
{
  "query": {
    "term": {
      "avaliable": true
    }
  }
}

# 布尔,没有算分
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "avaliable": true
        }
      }
    }
  }
}

# 数字类型 term
POST products/_search
{
  "query": {
    "term": {
      "price": 30
    }
  }
}

# 数字 range 查询
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "price": {
            "gte": 20,
            "lte": 30
          }
        }
      },
      "boost": 1.2
    }
  }
}

# 日期 range
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "date": {
            // 4年前
            "gte": "now-4y"
          }
        }
      },
      "boost": 1.2
    }
  }
}

# exists 查询 包含指定字段的记录
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "exists": {
          "field": "date"
        }
      },
      "boost": 1.2
    }
  }
}

### 处理多值字段
POST movies/_bulk
{"index": {"_id": 1}}
{"title": "Father of the Bridge Part II", "year": 1995,"genre": "Comedy"}
{"index": {"_id": 2}}
{"title": "Dave","year":"1993","genre":["Comedy","Romance"]}

# 处理多值字段,term查询是包含,而不是等于
# 如果需要等值匹配,可以增加一个字段进行计数,匹配的时候把计数字段也带上
POST movies/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "genre.keyword": "Comedy"
        }
      },
      "boost": 1.2
    }
  }
}

# 字符串 terms 或的关系
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "productID.keyword": [
            "QQPX-R-3956-#aD8",
            "JODL-X-1937-#pV7"
          ]
        }
      },
      "boost": 1.2
    }
  }
}


# 搜索的相关性算分

image.png image.png image.png image.png image.png image.png image.png

PUT testscore
{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      }
    }
  }
}

PUT testscore/_bulk
{ "index": { "_id": 1 }}
{ "content":"we use Elasticsearch to power the search" }
{ "index": { "_id": 2 }}
{ "content":"we like elasticsearch" }
{ "index": { "_id": 3 }}
{ "content":"The scoring of documents is caculated by the scoring formula" }
{ "index": { "_id": 4 }}
{ "content":"you know, for search" }


POST /testscore/_search
{
  //"explain": true,
  "query": {
    "match": {
      //"content":"you"
      //"content": "elasticsearch"
      "content":"the"
      //"content": "the elasticsearch"
    }
  }
}

# boosting 查询用于需要对两个查询的评分进行调整的场景,boosting查询会把两个查询封装在一起并降低其中一个查询的评分。boosting查询包括positive、negative和negative_boost三个部分,positive中的查询评分保持不变,negative中的查询会降低文档评分,negative_boost 指明 negative 中降低的权值。
# https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-boosting-query.html
POST testscore/_search
{
    "query": {
        "boosting" : {
            "positive" : {
                "term" : {
                    "content" : "elasticsearch"
                }
            },
            "negative" : {
                 "term" : {
                     "content" : "like"
                }
            },
            "negative_boost" : 0.2
        }
    }
}

# 文本内容相似推荐
# https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-mlt-query.html
POST movies/_search
{
  "_source": ["title","genre"],
  "query": {
    "more_like_this": {
      "fields": [
        "title^10","genre"
      ],
      "like": [{"_id":"1"}],
      "min_term_freq": 1,
      "max_query_terms": 12
    }
  }
}

# 多字符串多字段查询

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html (opens new window) 在 ES 中,有 Query 和Filter 两种不同的 Context

  • Query Context:相关性算分
  • Filter Context:不需要算分(Yes or No),可以利用 Cache,获得更好的性能

# bool 查询

  • 一个 bool 查询,是一个或者多个查询子句的组合
    • 总共包含4个字节。其他2种会影响算分,2种不会影响算分
  • 相关性不只是全文索引的专利。也适用 yes | no 的句子,匹配的子句越多,相关性评分越高。如果多条查询子句被合并为一条复合查询语句,比如 bool 查询,则每个查询子句计算得出的评分会被合并到总的相关性评分中 | must | 必须匹配。贡献算分 | | --- | --- | | should | 选择性匹配。贡献算分 | | must_not | Filter Context 查询子句,必须不能匹配 | | filter | Filter Context 必须匹配,但是不贡献算分 |

#

Boosting query https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-boosting-query.html (opens new window)

  • 返回匹配正向查询的文档,同时减少同样匹配负向查询的文档的相关性得分。
  • 您可以使用 boosting 查询将某些文档降级,而不将它们排除在搜索结果之外。
DELETE products

POST /products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }

# 基本语法
POST products/_search
{
  "query": {
    "bool": {
      "must": {
        "term": {"price": 30}
      },
      "filter": {
        "term": {"avaliable": true} 
      },
      "must_not": {
        "range": {
          "price": {
            "lte": 10
          }
        }
      },
      "should": [
        {"term": {"productID.keyword": "JODL-X-1937-#pV7"}},
        {"term": {"productID.keyword": "XHDK-A-1293-#fJ3"}}
      ],
      "minimum_should_match": 1
    }
  }
}

# 新增字段 genre_count,数组精确匹配
POST newmovies/_bulk
{ "index": { "_id": 1 }}
{ "title" : "Father of the Bridge Part II","year":1995, "genre":"Comedy","genre_count":1 }
{ "index": { "_id": 2 }}
{ "title" : "Dave","year":1993,"genre":["Comedy","Romance"],"genre_count":2 }

# must,算分
POST newmovies/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": { "genre.keyword": {"value": "Comedy"}}},
        {"term": { "genre_count": {"value": 1}}}
      ]
    }
  }
}

# filter 不算分
POST newmovies/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": { "genre.keyword": {"value": "Comedy"}}},
        {"term": { "genre_count": {"value": 1}}} 
      ]
    }
  }
}

# filter context
POST products/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "avaliable": true
        }
      },
      "must_not": {
        "range": {
          "price":{
            "lte": 10
          }
        }
      }
    }
  }
}

# query context
POST /products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }

POST products/_search
{
  "query": {
    "bool": {
      "should": [
        {"term": {"productID.keyword":"JODL-X-1937-#pV7"}},
        {"term": {"avaliable":true}}
      ]
    }
  }
}

# 嵌套
POST products/_search
{
  "query": {
    "bool": {
      "must": [{"term": {"price": {"value": 30}}}],
      "should": [
        {"bool": {
          "must_not": [
            {"term": {
              "avaliable": {
                "value": true
              }
            }}
          ]
        }}
      ],
      "minimum_should_match": 1
    }
  }
}

# 相同权重
POST movies/_search
{
  "query": {
    "bool": {
      "should": [
        {"term": {"title": "mind"}},
        {"term": {"title": "befutiful"}},
        {"term": {"title": "game"}},
        {"term": {"title": "Spy"}}
      ]
    }
  }
}

# 组合权重
POST movies/_search
{
  "query": {
    "bool": {
      "should": [
        {"term": {"title": "mind"}},
        {"term": {"title": "befutiful"}},
        {"bool": {
          "should": [
            {"term": {"title": "game"}},
            {"term": {"title": "Spy"}}
          ]
        }}
        
      ]
    }
  }
}

POST /blogs/_bulk
{ "index": { "_id": 1 }}
{"title":"Apple iPad", "content":"Apple iPad,Apple iPad" }
{ "index": { "_id": 2 }}
{"title":"Apple iPad,Apple iPad", "content":"Apple iPad" }


# 修改相关度
POST blogs/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              // 增大title的相关度
              "query": "apple,ipad",
              "boost": 1.1
            }
          }
        },
        {
          "match": {
            "content":  {
              // 减少content的相关度
              "query": "apple,ipad",
              "boost": 0.9
            }
          }
        }
      ]
    }
  }
}

DELETE news
POST /news/_bulk
{ "index": { "_id": 1 }}
{ "content":"Apple Mac" }
{ "index": { "_id": 2 }}
{ "content":"Apple iPad" }
{ "index": { "_id": 3 }}
{ "content":"Apple employee like Apple Pie and Apple Juice" }


POST news/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {
          "content": "apple"
        }}
      ]
    }
  }
}

# 只显示苹果公司的产品
POST news/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {
          "content": "apple"
        }}
      ],
      "must_not": [
        {"match": {
          "content": "juice"
        }}
      ]
    }
  }
}

# 苹果公司的产品优先展示,降低 juice 的相关度
POST news/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "content": "apple"
        }
      },
      "negative": {
        "match": {
          "content": "juice"
        }
      },
      "negative_boost": 0.5
    }
  }
}

# 单字符串多字段查询

# Dis Max Query

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-dis-max-query.html (opens new window) 返回匹配一个或多个包装查询的文档,称为查询子句或子句。 如果返回的文档与多个查询子句匹配,则 dis_max 查询为该文档分配来自任何匹配子句的最高相关性分数,并为任何其他匹配的子查询加上一个平局增量

并不是把所有相匹配的字段的分数进行累加,而是使用单个字段分数最高的结果作为相关性分数 如果一个文档匹配多个子句,dis _ max 查询计算文档的相关性得分如下:

  • 从得分最高的匹配子句中获取相关性得分。
  • 将任何其他匹配子句的分数乘以 tie_breaker 值。
  • 将最高分数添加到乘积中

tie_breaker:取值 0~1之间

  • 0使用最佳的匹配
  • 1所有语句同样重要

PUT /blogs/_doc/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /blogs/_doc/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

# 所有得分相加
POST /blogs/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

# 只取最高得分
POST blogs/_search
{
  "query": {
    "dis_max": {
      //"tie_breaker": 0.7,
      //"boost": 1.2,
      "queries": [
        { "match": { "title": "Brown fox" }},
        { "match": { "body":  "Brown fox" }}
      ]
    }
  }
}

# 其他得分也进行计算
POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.2
        }
    }
}

# Multi Match

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-multi-match-query.html (opens new window)

  • 最佳字段(Best Fields)
    • 当字段之间相互竞争,又相互关联。比如 title 和 body 这样的字段。评分来自最佳匹配字段
  • 多数字段(Most Fields)
    • 处理英文内容时:一种常见的手段是,在主字段(English Analyzer)抽取词干,加入同义词,以匹配更多的文档。相同的本文,加入子字段(Standard Analyzer),以提供更加精准的匹配。其他字段作为匹配文档提高相关度的信号。匹配字段越多则越好
  • 混合字段(Corss Field)
    • 对于某些实体,比如人名,地址,图书信息。需要在多个字段中确定信息,单个字段只能作为整体的一部分。希望在任何这些列出的字段中找到尽可能多的词
 POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.2
        }
    }
}
# 等效上面的  multi_match默认 best_fields
POST blogs/_search
{
  "query": {
    "multi_match": {
      "type": "best_fields", 
      "query": "Quick pets",
      "fields": ["title","body"],
      "tie_breaker": 0.2
    }
  }
}
# 支持字段通配符,^增强字段分数
POST blogs/_search
{
  "query": {
    "multi_match": {
      "type": "best_fields", 
      "query": "Quick pets",
      "fields": ["*title","body^2"],
      "tie_breaker": 0.2
    }
  }
}



DELETE titles
PUT titles
{
  "mappings" : {
      "properties" : {
        "title" : {
          "type" : "text",
          "analyzer": "english"
        }
      }
    }
}

POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }

# 会把各种时态去掉,负数去掉,即 brak dog,两个分词,文档1跟匹配,长度更短
GET titles/_search
{
  "query": {
    "match": {
      "title": "barking dogs"
    }
  }
}

# 添加子字段,不同分词器
PUT titles
{
  "settings": {"number_of_shards": 1},
  "mappings": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "english",
          "fields": {
            "std": {
              "type": "text",
              "analyzer": "standard"
            }
          }
        }
      }
  }
}

# most_fields 得分相加 除以匹配子句
POST titles/_search
{
  "query": {
    "multi_match": {
      "query": "brak dog",
      "type": "most_fields", 
      "fields": ["title","title.std"]
    }
  }
}
# most_fields不支持跨字段,可以使用copy_to但是浪费空间
POST blogs/_search
{
  "query": {
    "multi_match": {
      "type": "most_fields", 
      "query": "Quick pets",
      //"operator": "and", 
      "fields": ["title","body"]
    }
  }
}

# cross_fields支持跨字段搜索
POST blogs/_search
{
  "query": {
    "multi_match": {
      "type": "cross_fields", 
      "query": "Quick pets",
      "operator": "and", 
      "fields": ["title","body"]
    }
  }
}



# 多语言及中文分词与检索

image.pngimage.pngimage.pngimage.pngimage.pngimage.png 相关资源 Elasticsearch IK 分词插件 https://github.com/medcl/elasticsearch-analysis-ik/releases (opens new window) Elasticsearch hanlp 分词插件 https://github.com/KennFalcon/elasticsearch-analysis-hanlp (opens new window) 分词算法综述 https://zhuanlan.zhihu.com/p/50444885 (opens new window)

一些分词工具,供参考: 中科院计算所 NLPIR http://ictclas.nlpir.org/nlpir/ (opens new window) ansj 分词器 https://github.com/NLPchina/ansj_seg (opens new window) 哈工大的 LTP https://github.com/HIT-SCIR/ltp (opens new window) 清华大学 THULAC https://github.com/thunlp/THULAC (opens new window) 斯坦福分词器 https://nlp.stanford.edu/software/segmenter.shtml (opens new window) Hanlp 分词器 https://github.com/hankcs/HanLP (opens new window) 结巴分词 https://github.com/yanyiwu/cppjieba (opens new window) KCWS 分词器 (字嵌入 +Bi-LSTM+CRF) https://github.com/koth/kcws (opens new window) ZPar https://github.com/frcchang/zpar/releases (opens new window) IKAnalyzer https://github.com/wks/ik-analyzer (opens new window)

安装分词器

  • hanlp elasticsearch-plugin install [https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.1.0/elasticsearch-analysis-hanlp-7.1.0.zip](https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.1.0/elasticsearch-analysis-hanlp-7.1.0.zip)
  • ik elasticsearch-plugin install [https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip](https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip)
  • pinyin elasticsearch-plugin install [https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.1.0/elasticsearch-analysis-pinyin-7.1.0.zip](https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.1.0/elasticsearch-analysis-pinyin-7.1.0.zip)

DELETE my_index

PUT my_index/_doc/1
{ "title": "I'm happy for this fox" }

POST my_index/_search
{
  "profile": "true", 
  "query": {
    "match": {
      "title": "not happy fox"
    }
  }
}

DELETE my_index
PUT my_index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english"
      }
    }
  }
}

PUT my_index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "eng": {
            "type": "text",
            "analyzer": "english"
          }
        }
      }
    }
  }
}

PUT /my_index/blog/1
{ "title": "I'm happy for this fox" }

PUT /my_index/blog/2
{ "title": "I'm not happy about my fox problem" }

POST my_index/_search
{
  "profile": "true", 
  "query": {
    "multi_match": {
      "query": "not happy foxes",
      "fields": ["title","title.eng"]
    }
  }
}


#ik_max_word
#ik_smart
#hanlp: hanlp默认分词
#hanlp_standard: 标准分词
#hanlp_index: 索引分词
#hanlp_nlp: NLP分词
#hanlp_n_short: N-最短路分词
#hanlp_dijkstra: 最短路分词
#hanlp_crf: CRF分词(在hanlp 1.6.6已开始废弃)
#hanlp_speed: 极速词典分词

POST _analyze
{
  "analyzer": "ik_smart",
  "text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}     

# 拼音
PUT /artists/
{
  "settings": {
    "analysis": {
      "analyzer": {
        "user_name_analyzer": {
          "tokenizer": "whitespace",
          "filter": "pinyin_first_letter_and_full_pinyin_filter"
        }
      },
      "filter": {
        "pinyin_first_letter_and_full_pinyin_filter": {
          "type": "pinyin",
          "keep_first_letter": true,
          "keep_full_pinyin": false,
          "keep_none_chinese": true,
          "keep_original": false,
          "limit_first_letter_length": 16,
          "lowercase": true,
          "trim_whitespace": true,
          "keep_none_chinese_in_first_letter": true
        }
      }
    }
  }
}

GET /artists/_analyze
{
  "text": ["刘德华 张学友 郭富城 黎明 四大天王"],
  "analyzer": "user_name_analyzer"
}

# Search Template

https://www.elastic.co/guide/en/elasticsearch/reference/master/search-template.html (opens new window) 搜索模板是存储的搜索,您可以使用不同的变量运行。 如果您使用 Elasticsearch 作为搜索后端,您可以将来自搜索栏的用户输入作为搜索模板的参数传递。这使您可以在不向用户公开 Elasticsearch 查询语法的情况下运行搜索。 如果您将 Elasticsearch 用于自定义应用程序,搜索模板可让您更改搜索,而无需修改应用程序的代码。 开发、运维解构

POST _scripts/my_search_template
{
  "script": {
    "lang": "mustache",
    "source": {
      "_source": [
        "id",
        "title"
      ],
      "query": {
        "match": {
          "title": "{{my_var}}"
        }
      }
    },
    "from": "{{from}}",
    "size": "{{size}}"
  }
}


DELETE _scripts/my_search_template
GET _scripts/my_search_template
GET /_cluster/state/metadata?pretty&filter_path=metadata.stored_scripts&pretty

POST movies/_search/template
{
  "id": "my_search_template",
  "params": {
    "from": 0,
    "size": 20,
    "my_var": "beautiful mind"
  }
}

# Aliases

https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-aliases.html (opens new window) 为索引起别名,实现不停机运维


PUT movies-2019/_doc/1
{
  "name":"the matrix",
  "rating":5
}

PUT movies-2019/_doc/2
{
  "name":"Speed",
  "rating":3
}

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "movies-2019",
        "alias": "movies-lastest"
      }
    }
  ]
}

POST movies-lastest/_search

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "movies-2019",
        "alias": "movies-lastest-higtrate",
        "filter": {
          "range": {
            "rating": {
              "gte": 4
            }
          }
        }
      }
    }
  ]
}

POST movies-lastest-higtrate/_search

# Function Score Query优化算分

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-function-score-query.html (opens new window) 算分与排序

  • ES 默认会以文档的相关度进行排序
  • 可以通过指定一个或多个字段进行排序
  • 使用相关度算分(score)排序,不能满足某些特定条件
    • 无法针对相关度,对排序实现更多的控制

Function Score Query

  • 可以在查询结束后,对每一个匹配的文档进行一些列的重新算分,根据新生成的分数进行排序
  • 提供了几种默认的计算分值的函数
    • Weight:为每一个文档设置一个简单而不被规范化的权重
    • Field Value Facotr:使用该数值来修改 _score,例如将“热度”和“点赞数”作为算分的参考因素
    • Random Score:为每一个用户使用一个不用的,随机算分结果
    • 衰减函数:以某个字段的值为标准,举例某个值越近,得分越高
    • Script Score:自定义脚本完全控制所需逻辑

image.png image.pngimage.png image.pngimage.png

DELETE blogs
PUT /blogs/_doc/1
{
  "title":   "About popularity",
  "content": "In this post we will talk about...",
  "votes":   0
}

PUT /blogs/_doc/2
{
  "title":   "About popularity",
  "content": "In this post we will talk about...",
  "votes":   100
}

PUT /blogs/_doc/3
{
  "title":   "About popularity",
  "content": "In this post we will talk about...",
  "votes":   1000000
}

# 指定字段影响得分,默认相乘
POST blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "popularity",
          "fields": ["title","content"]
        }
      },
      "field_value_factor": {
        // _score*votes
        "field": "votes"
      }
    }
  }
}

# https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-function-score-query.html#function-field-value-factor
# 平滑处理,分值相对接近
POST blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "popularity",
          "fields": ["title","content"]
        }
      },
      "field_value_factor": {
        "field": "votes",
        // _score*log(1+votes)
        "modifier": "log1p"
      }
    }
  }
}
# 额外加入 factor
POST blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "popularity",
          "fields": ["title","content"]
        }
      },
      "field_value_factor": {
        "field": "votes",
        "modifier": "log1p",
        // _source*log(1+factory*votes)
        "factor": 0.1
      }
    }
  }
}

POST blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "popularity",
          "fields": ["title","content"]
        }
      },
      "field_value_factor": {
        "field": "votes",
        "modifier": "log1p",
        "factor": 0.1
      },
      // 得分计算模式
      "boost_mode": "sum",
      // 算分最大值控制
      "max_boost": 2
    }
  }
}
# 一致性随机分数0~1
POST blogs/_search
{
  "query": {
    "function_score": {
      "random_score": {
        // 只要seed不变,随机分数就不变
        "seed": 314159265359,
        "field": "votes"
      }
    }
  }
}


# Suggester

https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters.html#search-suggesters (opens new window) image.png image.png image.png image.pngimage.pngimage.pngimage.pngimage.png

DELETE articles

POST articles/_bulk
{ "index" : { } }
{ "body": "lucene is very cool"}
{ "index" : { } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "body": "Elasticsearch rocks"}
{ "index" : { } }
{ "body": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "body": "Elk stack rocks"}
{ "index" : {} }
{  "body": "elasticsearch is rock solid"}

POST _analyze
{
  "analyzer": "standard",
  "text": ["Elk stack rocks rock"]
}

# missing 只为不在分片中的terms生成建议。
POST articles/_search
{
  "size": 1,
  "query": {
    "match": {
      "body": "lucen rock"
    }
  },
  "suggest": {
    "my_suggestion": {
      "text": "lucen rock",
      "term": {
        "field": "body",
        "suggest_mode": "missing"
      }
    }
  }
}
# 仅建议出现在比原始建议本文术语(text term)更多的文档中的建议
POST articles/_search
{
  "suggest": {
    "my_suggestion": {
      "text": "lucen rock",
      "term": {
        "field": "body",
        "suggest_mode": "popular"
      }
    }
  }
}
# 根据建议文本中的术语(term)提出任何匹配的建议
POST articles/_search
{
  "suggest": {
    "my_suggestion": {
      "text": "lucen rock",
      "term": {
        "field": "body",
        "suggest_mode": "always"
      }
    }
  }
}
# 必须匹配的最小前缀字符数,以作为建议的候选字符。默认值为1。增加这个数字可以提高拼写检查的性能。通常拼写错误不会出现在术语的开头。
POST articles/_search
{
  "suggest": {
    "my_suggestion": {
      "text": "lucen hock",
      "term": {
        "field": "body",
        "suggest_mode": "always",
        // 必须匹配最小前缀数,默认1
        "prefix_length": 0
      }
    }
  }
}

# phrase suggest
# https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters.html#phrase-suggester
POST articles/_search
{
  "suggest": {
    "YOUR_SUGGESTION": {
      "text": "lucne and elasticsear rock hello world",
      "phrase": {
        "field": "body",
        // 最大误差term数
        "max_errors": 2,
        "direct_generator": [{
          "field": "body",
          "suggest_mode": "always"
        }],
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

# 自动补全 & 上下文提升

https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters.html#completion-suggester (opens new window) image.png image.png image.pngimage.png image.png image.png image.png image.pngimage.png image.png

DELETE articles
PUT articles
{
  "mappings": {
    "properties": {
      "title_completion":{
        "type": "completion"
      }
    }
  }
}

POST articles/_bulk
{ "index" : { } }
{ "title_completion": "lucene is very cool"}
{ "index" : { } }
{ "title_completion": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "title_completion": "Elasticsearch rocks"}
{ "index" : { } }
{ "title_completion": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "title_completion": "Elk stack rocks"}
{ "index" : {} }

# 前缀匹配
POST articles/_search
{
  "size": 0,
  "suggest": {
    "YOUR_SUGGESTION": {
      //"prefix": "el",
      "prefix": "elk",
      "completion": {
        "field": "title_completion"
      }
    }
  }
}

DELETE comments
PUT comments

# 定义mapping
# https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters.html#context-suggester
PUT comments/_mapping
{
  "properties": {
    "my_autocomplete": {
      "type": "completion",
      "contexts": [{
        // 指定类型 category Or geo
        "type": "category",
        // 指定类别名称
        "name": "comment_category"
      }]
    }
  }
}

# 插入文档,指定类型
POST comments/_doc
{
  "comment": "I love the star war movies",
  "my_autocomplete": {
    // 建议suggestion
    "input": ["star wars"],
    "contexts": {
      "comment_category": "movies"
    }
  }
}

GET comments/_search

POST comments/_doc
{
  "comment": "Where can I find a Starbucks",
  "my_autocomplete": {
    "input": ["starbucks"],
    "contexts": {
      "comment_category": "coffee"
    }
  }
}

POST comments/_search
{
  "suggest": {
    "YOUR_SUGGESTION": {
      "prefix": "sta",
      "completion": {
        "field": "my_autocomplete",
        "contexts": {
          // 指定分类
          //"comment_category": "coffee"
          "comment_category": "movies"
        }
      }
    }
  }
}


# 跨集群搜索

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html (opens new window) https://www.elastic.co/guide/en/elasticsearch/reference/master/remote-clusters-connect.html (opens new window) image.png image.png

# 启动3个集群,每个一个节点

bin/elasticsearch -E node.name=cluster0node -E cluster.name=cluster0 -E path.data=cluster0_data -E discovery.type=single-node -E http.port=9200 -E transport.port=9300
bin/elasticsearch -E node.name=cluster1node -E cluster.name=cluster1 -E path.data=cluster1_data -E discovery.type=single-node -E http.port=9201 -E transport.port=9301
bin/elasticsearch -E node.name=cluster2node -E cluster.name=cluster2 -E path.data=cluster2_data -E discovery.type=single-node -E http.port=9202 -E transport.port=9302


# 在每个集群上动态的设置,使当前集群拥有对方的集群信息
PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster0": {
          "seeds": [
            "192.168.83.130:9300"
          ],
          "transport.ping_schedule": "30s"
        },
        "cluster1": {
          "seeds": [
            "192.168.83.130:9301"
          ],
          "transport.compress": true,
          "skip_unavailable": true
        },
        "cluster2": {
          "seeds": [
            "192.168.83.130:9302"
          ]
        }
      }
    }
  }
}

#cURL
curl -XPUT "http://192.168.83.130:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["192.168.83.130:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["192.168.83.130:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["192.168.83.130:9302"]}}}}}'

curl -XPUT "http://192.168.83.130:9201/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["192.168.83.130:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["192.168.83.130:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["192.168.83.130:9302"]}}}}}'

curl -XPUT "http://192.168.83.130:9202/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["192.168.83.130:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["192.168.83.130:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["192.168.83.130:9302"]}}}}}'


#创建测试数据
curl -XPOST "http://192.168.83.130:9200/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user1","age":10}'

curl -XPOST "http://192.168.83.130:9201/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user2","age":20}'

curl -XPOST "http://192.168.83.130:9202/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user3","age":30}'


#查询
GET /users,cluster1:users,cluster2:users/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 20,
        "lte": 40
      }
    }
  }
}

# 集群分布式模型及选主与脑裂问题

https://www.elastic.co/cn/blog/a-new-era-for-cluster-coordination-in-elasticsearch (opens new window)

# 分布式特性

  • ES 的分布式架构带来的好处
    • 存储的水平扩容,支持 PB 级数据
    • 提高系统的高可用,部分节点停止服务,整个集群的服务不受影响
  • ES 的分布式架构
    • 不同的集群通过不同的名字来区分,默认名字”elasticsearch“
    • 通过配置文件修改,或者在命令行中 -E cluster.name=my_cluster_name 进行设定

# 节点

  • 节点是一个 ES 的实例
    • 其本质上就是一个 Java 进行
    • 一台机器上可以运行多个 ES 进程,但是生产环境建议一台机器只允许一个 ES 实例
  • 每个节点都有名字,通过配置文件,或者启动时 -E node.name = node01进行指定
  • 每一个节点启动之后,会分配一个 UID,保存在 data 目录下

# Coordinating Node

  • 处理请求的节点,叫 Coordinating Node
    • 路由请求到正确的节点,比例创建索引的请求,需要路由到 Master 节点
  • 所有节点默认都是 Coordinating Node
  • 通过将其他类型设置成 False,使其成为 Dedicated Coordinating Node

# Cerebro 可视化

https://github.com/lmenezes/cerebro/releases (opens new window) ES 可视化工具 启动一个节点

bin/elasticsearch -E node.name=node1 -E cluster.name=geektime -E path.data=node1_data -E http.port=9200 -E cluster.initial_master_nodes=node1

启动 Cerebro

sudo ./cerebro-0.9.4/bin/cerebro

进入页面http://192.168.83.130:9000/ (opens new window) 输入节点地址进行连接http://192.168.83.130:9200/ (opens new window) 可以创建索引 image.png 3个分片,1个副本 image.png 变成黄色,亚健康,正常分片,副本没有正常工作,因为只有一个node image.png 再添加一个node

bin/elasticsearch -E node.name=node2 -E cluster.name=geektime -E path.data=node2_data -E http.port=9201 -E cluster.initial_master_nodes=node1

自动变成绿色,分片和副本都正常工作 image.png

# Data Node

  • 可以保存数据的节点,叫做 Data Node
    • 节点启动后,默认就是数据节点。可以设置 node.data: false 禁止
  • Data Node的职责
    • 保存分片数据。在数据扩展上起到了至关重要的作用(由 Master Node 决定如何把 分片分发到数据节点上)
  • 通过增加数据节点
    • 可以解决数据水平扩展和解决数据单点问题

# Master Node

  • Master Node 的职责
    • 处理创建,删除索引等请求 / 决定分片被分配到哪个节点 / 负责索引的创建与删除
    • 维护并且更新 Cluster State
  • Master Node 的最佳实践
    • Master 节点非常重要,在部署上需要考虑解决单点的问题
    • 为⼀个集群设置多个 Master 节点 / 每个节点只承担 Master 的单一角色

# Master Eligible Nodes & 选主流程

  • ⼀个集群,支持配置多个 Master Eligible 节点。这些节点可以在必要时(如 Master 节点出现故障,网络故障时)参与选主流程,成为 Master 节点
  • 每个节点启动后,默认就是⼀个 Master eligible 节点
    • 可以设置 node.master: false 禁止
  • 当集群内第⼀个 Master eligible 节点启动时候,它会将自己选举成 Master 节点

# 集群状态

  • 集群状态信息(Cluster State),维护了⼀个集群中,必要的信息
    • 所有的节点信息
    • 所有的索引和其相关的 Mapping 与 Setting 信息
    • 分片的路由信息
  • 在每个节点上都保存了集群的状态信息
  • 但是,只有 Master 节点才能修改集群的状态信息,并负责同步给其他节点
    • 因为,任意节点都能修改信息会导致 Cluster State 信息的不⼀致

# Master Eligible Nodes & 选主的过程

  • 互相 Ping 对方,Node Id 低的会成为被选举的节点
  • 其他节点会加⼊集群,但是不承担 Master 节点的角色。一旦发现被选中的主节点丢失,就会选举出新的 Master 节点

# 脑裂问题

Split-Brain,分布式系统的经典网络问题,当出现网络问题,⼀个节点和其他节点无法连接

  • Node 2 和 Node 3 会重新选举 Master
  • Node 1 自己还是作为 Master,组成⼀个集群,同时更新 Cluster State
  • 导致 2 个 master,维护不同的 cluster state。当网络恢复时,无法选择正确恢复

image.png

# 如何避免脑裂问题

  • 限定⼀个选举条件,设置 quorum(仲裁),只有在 Master eligible 节点数大于 quorum 时,才能进行选举
    • Quorum = (master 节点总数 /2)+ 1
    • 当 3 个 master eligible 时,设置 discovery.zen.minimum_master_nodes 为 2,即可避免脑裂
  • 从 7.0 开始,无需这个配置
    • 移除 minimum_master_nodes 参数,让Elasticsearch自己选择可以形成仲裁的节点。
    • 典型的主节点选举现在只需要很短的时间就可以完成。集群的伸缩变得更安全、更容易,并且可能造成丢失数据的系统配置选项更少了。
    • 节点更清楚地记录它们的状态,有助于诊断为什么它们不能加入集群或为什么无法选举出主节点

# 配置节点类型

⼀个节点默认情况下是⼀个 Master eligible,data and ingest node image.png

# 分片与集群的故障转移

# Primary Shard - 提升系统存储容量

  • 分片是 Elasticsearch 分布式存储的基石
    • 主分片 / 副本分片
  • 通过主分片,将数据分布在所有节点上
    • Primary Shard,可以将⼀份索引的数据,分散在多个 Data Node 上,实现存储的水平扩展
    • 主分片(Primary Shard)数在索引创建时候指定,后续默认不能修改,如要修改,需重建索引

# Replica Shard - 提高数据可用性

  • 数据可用性
    • 通过引入副本分片 (Replica Shard) 提高数据的可用性。⼀旦主分片丢失,副本分片可以 Promote 成主分片。副本分片数可以动态调整。每个节点上都有完备的数据。如果不设置副本分片,⼀旦出现节点硬件故障,就有可能造成数据丢失
  • 提升系统的读取性能
    • 副本分片由主分片(Primary Shard)同步。通过片持增加 Replica 个数,⼀定程度可以提高读取的吞吐量

# 集群健康状态

  • Green:健康状态,所有的主分片和副本分片都可用
  • Yellow:亚健康,所有的主分片可用,部分副本分片不可用
  • Red:不健康状态,部分主分片不可用
// http://192.168.83.130:9200/_cluster/health

{
  "cluster_name": "geektime",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 2,
  "number_of_data_nodes": 2,
  "active_primary_shards": 3,
  "active_shards": 6,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}

# 测试故障转移

启动三个节点

bin/elasticsearch -E node.name=node1 -E cluster.name=geektime -E path.data=node1_data -E http.port=9200 -E cluster.initial_master_nodes=node1
bin/elasticsearch -E node.name=node2 -E cluster.name=geektime -E path.data=node2_data -E http.port=9201 -E cluster.initial_master_nodes=node1
bin/elasticsearch -E node.name=node3 -E cluster.name=geektime -E path.data=node3_data -E http.port=9202 -E cluster.initial_master_nodes=node1

创建索引,3个主分片,1个副本分片(副本就是额外的,加上原来的master,就是一个数据两份) image.png 正常运行 image.png 关闭一个节点,两个副本还没有被正常分配 image.png 等一会就会自动进行分配(故障转移) image.png

# 文档分布式存储

# 文档存储在分片上

  • 文档会存储在具体的某个主分片和副本分片上:例如 文档 1, 会存储在 P0 和 R0 分片上
  • 文档到分片的映射算法
    • 确保文档能均匀分布在所用分片上,充分利用硬件资源,避免部分机器空闲,部分机器繁忙
    • 潜在的算法
      • 随机 / Round Robin。当查询文档 1,分片数很多,需要多次查询才可能查到文档 1
      • 维护文档到分片的映射关系,当文档数据量大的时候,维护成本高
      • 实时计算,通过文档 1,自动算出,需要去那个分片上获取文档

文档到分片的路由算法

  • shard = hash(_routing) % number_of_primary_shards
    • Hash 算法确保文档均匀分散到分片中
    • 默认的 _routing 值是文档 id
    • 可以自行制定 routing数值,例如用相同国家的商品,都分配到指定的 shard https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-routing-field.html (opens new window)
PUT my-index-000001/_doc/1?routing=user1&refresh=true 
{
  "title": "This is a document"
}

GET my-index-000001/_doc/1?routing=user1 
  • 设置 Index Settings 后, Primary 数,不能随意修改的根本原因

# 更新和删除文档

更新和删除文档都会把请求转发给Primary Shard,由Primary Shard进行操作,并将请求分发给 Replica Shard 同样操作 image.png image.png

# 分片及其生命周期

# 分片的内部原理

  • 什么是 ES 的分片
    • ES 中最小的工作单元 / 是一个 Lucene 的 Index
  • 一些问题:
    • 为什么 ES 的搜索是近实时的(1 秒后被搜到)
    • ES 如何保证在断电时数据也不会丢失
    • 为什么删除文档,并不会立刻释放空间

# 倒排索引不可变性

  • 倒排索引采用 Immutable Design,一旦生成,不可更改
  • 不可变性,带来了的好处如下:
    • 无需考虑并发写文件的问题,避免了锁机制带来的性能问题
    • 一旦读入内核的文件系统缓存,便留在哪里。只要文件系统存有足够的空间,大部分请求就会直接请求内存,不会命中磁盘,提升了很大的性能
    • 缓存容易生成和维护 / 数据可以被压缩
  • 不可变更性,带来了的挑战:如果需要让一个新的文档可以被搜索,需要重建整个索引。

# Lucene Index

  • 在 Lucene 中,单个倒排索引文件被称为Segment。Segment 是自包含的,不可变更的。多个 Segments 汇总在一起,称为 Lucene 的Index,其对应的就是 ES 中的 Shard
  • 当有新文档写入时,会生成新 Segment,查询时会同时查询所有 Segments,并且对结果汇总。Lucene 中有一个文件,用来记录所有Segments 信息,叫做 Commit Point
  • 删除的文档信息,保存在“.del”文件中

image.png

# Refresh

  • 将 Index buffer 写入 Segment 的过程叫Refresh。Refresh 不执行 fsync 操作
  • Refresh 频率:默认 1 秒发生⼀次,可通过 index.refresh_interval 配置。Refresh 后,数据就可以被搜索到了。这也是为什么Elasticsearch 被称为近实时搜索
  • 如果系统有大量的数据写入,那就会产生很多的 Segment
  • Index Buffer 被占满时,会触发 Refresh,默认值是 JVM 的 10%

image.png

# Transaction Log

https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-translog.html (opens new window)

  • Segment 写入磁盘的过程相对耗时,借助文件系统缓存,Refresh 时,先将Segment 写入缓存以开放查询
  • 为了保证数据不会丢失。所以在 Index 文档时,同时写 Transaction Log,高版本开始,Transaction Log 默认落盘(默认每个request或5s)。每个分片有一个 Transaction Log
  • 在 ES Refresh 时,Index Buffer 被清空,Transaction log 不会清空

image.png

# Flush

ES Flush & Lucene Commit

  • 调用 Refresh,Index Buffer 清空并且 Refresh
  • 调用 fsync,将缓存中的 Segments写入磁盘
  • 清空(删除)Transaction Log默认 30 分钟调用一次
  • Transaction Log 满 (默认 512 MB)

image.png

# Merge

  • 随着时间,磁盘上的segment越来越多,需要定期进行合并。
  • ES和Lucene 会自动进行merge操作,合并segment和删除已经删除的文档。
  • 我们可以手动进行merge:POST index/_forcemerge。

# 分布式查询及相关性算分

分布式搜索的运行机制 Elasticsearch 的搜索,会分两阶段进行

  • 第⼀阶段 - Query
  • 第⼆阶段 - Fetch

# Query 阶段

  • 用户发出搜索请求到 ES 节点。节点收到请求后, 会以 Coordinating 节点的身份,在 6 个主副分片中随机选择 3 个分片,发送查询请求
  • 被选中的分片执行查询,进行排序。然后,每个分片都会返回 From + Size 个排序后的文档 Id 和排序值给 Coordinating 节点

image.png

# Fetch 阶段

  • Coordinating Node 会将 Query 阶段,从从每个分片获取的排序后的文档 Id 列表,重新进行排序。选取 From 到 From + Size个文档的 Id
  • 以 multi get 请求的方式,到相应的分片获取详细的文档数据

# Query Then Fetch 潜在的问题

  • 性能问题
    • 每个分片上需要查的文档个数 = from + size
    • 最终协调节点需要处理:number_of_shard * ( from+size )
    • 深度分页
  • 相关性算分
    • 每个分片都基于自己的分片上的数据进进相关度计算。这会导致打分偏离的情况,特别是数据量很少时。相关性算分在分片之间是相互独立。当文档总数很少的情况下,如果主分片大于 1,主分片数越多 ,相关性算分会越不准

# 解决算分不准的方法

  • 数据量不大的时候,可以将主分片数设置为 1
    • 当数据量足够大时候,只要保证文档均匀分散在各个分片上,结果⼀般就不会出现偏差
  • 使用 DFS Query Then Fetch
    • 搜索的URL中指定参数_search?search_type=dfs_query_then_fetch
    • 到每个分片把各分片的词频和文档频率进行搜集,然后完整的进行一次相关性算分,耗费更加多的 CPU 和内存,执行性能低下,一般不建议使用
DELETE message

POST message/_doc
{
  "content":"good"
}

POST message/_doc
{
  "content":"good morning"
}

POST message/_doc
{
  "content":"good morning everyone"
}

POST message/_search
{
  //"explain": true,
  "query": {
    "match_all": {}
  }
}


POST message/_search
{
  //"explain": true,
  "query": {
    "term": {
      "content": {
        "value": "good"
      }
    }
  }
}

DELETE message
PUT message
{
  "settings": {
    "number_of_shards": 20
  }
}

GET message

POST message/_doc?routing=1
{
  "content":"good"
}

POST message/_doc?routing=2
{
  "content":"good morning"
}

POST message/_doc?routing=3
{
  "content":"good morning everyone"
}

POST message/_search
{
  "explain": true,
  "query": {
    "term": {
      "content": {
        "value": "good"
      }
    }
  }
}

POST message/_search?search_type=dfs_query_then_fetch
{

  "query": {
    "term": {
      "content": {
        "value": "good"
      }
    }
  }
}

# 排序

https://www.elastic.co/guide/en/elasticsearch/reference/master/sort-search-results.html (opens new window) https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-index-sorting.html (opens new window) Elasticsearch 默认采用相关性算分对结果进进行降序排序

  • 可以通过设定 sort 参数,自行设定排序
  • 如果不指定 _score,算分为 Null

排序的过程

  • 排序是针对字段原始内容进行的。 倒排索引无法发挥作用
  • 需要用到正排索引。通过文档 Id 和字段快速得到字段原始内容
  • Elasticsearch 有两种实现方法
    • Fielddata
    • Doc Values (列式存储,对 Text 类型无效)

image.png

image.png image.png image.png

#单字段排序
POST /kibana_sample_data_ecommerce/_search
{
  "size": 5,
  "query": {
    "match_all": {

    }
  },
  "sort": [
    {"order_date": {"order": "desc"}}
  ]
}

#多字段排序
POST /kibana_sample_data_ecommerce/_search
{
  "size": 5,
  "query": {
    "match_all": {

    }
  },
  "sort": [
    {"order_date": {"order": "desc"}},
    {"_doc":{"order": "asc"}},
    {"_score":{ "order": "desc"}}
  ]
}

GET kibana_sample_data_ecommerce/_mapping

#对 text 字段进行排序。默认会报错,需打开fielddata
POST /kibana_sample_data_ecommerce/_search
{
  "size": 5,
  "query": {
    "match_all": {

    }
  },
  "sort": [
    {"customer_full_name": {"order": "desc"}}
  ]
}

#打开 text的 fielddata
PUT kibana_sample_data_ecommerce/_mapping
{
  "properties": {
    "customer_full_name" : {
          "type" : "text",
          "fielddata": true,
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
  }
}

#关闭 keyword的 doc values
PUT test_keyword
PUT test_keyword/_mapping
{
  "properties": {
    "user_name":{
      "type": "keyword",
      "doc_values":false
    }
  }
}

DELETE test_keyword

PUT test_text
# 不支持
PUT test_text/_mapping
{
  "properties": {
    "intro":{
      "type": "text",
      "doc_values":true
    }
  }
}

DELETE test_text


DELETE temp_users
PUT temp_users
PUT temp_users/_mapping
{
  "properties": {
    "name":{"type": "text","fielddata": true},
    "desc":{"type": "text","fielddata": true}
  }
}

Post temp_users/_doc
{"name":"Jack","desc":"Jack is a good boy!","age":10}

#打开fielddata 后,查看 docvalue_fields数据
POST  temp_users/_search
{
  "docvalue_fields": [
    "name","desc"
    ]
}

#查看整型字段的docvalues
POST  temp_users/_search
{
  "docvalue_fields": [
    "age"
    ]
}

# 分页与遍历

默认情况下,查询按照相关度算分排序,返回前10 条记录

  • From:开始位置
  • Size:期望获取文档的总数

# 分布式系统中深度分页的问题

  • ES 天生就是分布式的。查询信息,但是数据分别保存在多个分片,多台机器上,ES 天生就需要满足排序的需要(按照相关性算分)
  • 当⼀个查询: From = 990, Size =10
    • 会在每个分片上先都获取 1000 个文档。然后,通过 Coordinating Node 聚合所有结果。最后再通过排序选取前 1000 个文档
    • 页数越深,占用内存越多。为了避免深度分页带来的内存开销。ES 有⼀个设定,默认限定到10000 个文档(from+size < 10000) Index.max_result_window

image.png

# Search After 避免深度分页的问题

https://www.elastic.co/guide/en/elasticsearch/reference/master/paginate-search-results.html#search-after (opens new window)

  • 避免深度分页的性能问题,可以实时获取下一页文档信息
    • 不支持指定页数(From)
    • 只能往下翻
  • 第一步搜索需要指定 sort,并且保证值是唯一的(可以通过加入 _id 保证唯一性)
  • 然后使用上一次,最后一个文档档的 sort 值进行查询
# from+size < 1000
POST /kibana_sample_data_ecommerce/_search
{
  "from": 10000,
  "size": 1,
  "Index.max_result_window"
}

#Search After
DELETE users

POST users/_doc
{"name":"user1","age":10}
POST users/_doc
{"name":"user2","age":11}
POST users/_doc
{"name":"user3","age":12}
POST users/_doc
{"name":"user4","age":13}

GET users/_count

POST users/_search
{
    "size": 1,
    "query": {
        "match_all": {}
    },
    "sort": [
        {"age": "desc"} ,
        {"_id": "asc"}    
    ]
}

POST users/_search
{
  "size": 1,
  "query": {
    "match_all": {}
  },
  // sort规则不变
  "sort": [
    {"age": "desc"},
    {"_id": "asc"}
  ],
  // 传入每次sort的返回结果
  "search_after": [
    10,
    "6Q3ozIABWtHVGnC1DW90"
  ]
}

# Scroll API

  • 创建⼀个快照,有新的数据写入以后,无法被查到
  • 每次查询后,输入上⼀次的 Scroll Id

我们不再推荐使用滚动 API 进行深度分页。 如果您需要在分页超过 10,000 个点击时保留索引状态,请使用search_after (opens new window)带有时间点 (PIT) 的参数。

#Scroll API
DELETE users
POST users/_doc
{"name":"user1","age":10}
POST users/_doc
{"name":"user2","age":20}
POST users/_doc
{"name":"user3","age":30}
POST users/_doc
{"name":"user4","age":40}

GET users/_count

POST /users/_search?scroll=5m
{
    "size": 2,
    "query": {
        "match_all" : {
        }
    }
}

# 生成快照后添加数据
POST users/_doc
{"name":"user5","age":50}

POST /_search/scroll
{
    // 刷新快照时长
    "scroll" : "3m",
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAZqwWaDdob2NCRDBRVjIybktoc1A0SWRnQQ=="
}

# 处理并发读写操作

https://www.elastic.co/guide/en/elasticsearch/reference/master/optimistic-concurrency-control.html (opens new window) 两个 Web 程序同时更新某个文档,如果缺乏有效的并发,会导致更改的数据丢失

  • 悲观并发控制
    • 假定有变更冲突的可能。会对资源加锁,防止冲突。例如数据库行锁
  • 乐观并发控制
    • 假定冲突是不会发生的,不会阻塞正在尝试的操作。如果数据在读写中被修改,更新将会失败。应用程序决定如何解决冲突,例如重试更新,使用新的数据,或者将错误报告给用户
  • ES 采用的是乐观并发控制

ES 中的文档是不可变更的。如果你更新一个文档,会将就文档标记为删除,同时增加一个全新的文档。同时文档的 version 字段加 1

  • 内部版本控制
    • If_seq_no + If_primary_term
  • 使用外部版本(使用其他数据库作为主要数据存储)
    • version + version_type=external
DELETE products

PUT products/_doc/1
{
  "title":"iphone",
  "count":100
}

GET products/_doc/1
# "_seq_no" : 0,
# "_primary_term" : 1,

PUT products/_doc/1?if_seq_no=0&if_primary_term=1
{
  "title":"iphone",
  "count":99
}

PUT products/_doc/1?if_seq_no=1&if_primary_term=1
{
  "title":"iphone",
  "count":99
}


# 传入版本号,可以是mysql中的自定义version字段或者时间戳(自增)
PUT products/_doc/1?version=30000&version_type=external
{
  "title":"iphone",
  "count":100
}

PUT products/_doc/1?version=30001&version_type=external
{
  "title":"iphone",
  "count":100
}

# Bucket & Metric

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-aggregations-metrics.html (opens new window) https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-aggregations-bucket.html (opens new window) image.png image.png

# Metric Aggregation

  • 单值分析:只输出一个分析结果
    • min, max, avg, sum
    • Cardinality (类似 distinct Count)
  • 多值分析:输出多个分析结果
    • stats, extended stats
    • percentile, percentile rank
    • top hits (排在前面的示例)
DELETE /employees
PUT /employees/
{
  "mappings" : {
      "properties" : {
        "age" : {
          "type" : "integer"
        },
        "gender" : {
          "type" : "keyword"
        },
        "job" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 50
            }
          }
        },
        "name" : {
          "type" : "keyword"
        },
        "salary" : {
          "type" : "integer"
        }
      }
    }
}

PUT /employees/_bulk
{ "index" : {  "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : {  "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : {  "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : {  "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : {  "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : {  "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : {  "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : {  "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : {  "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : {  "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : {  "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : {  "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : {  "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : {  "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : {  "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : {  "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : {  "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : {  "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}

# Metric 聚合,找到最低的工资
POST employees/_search
{
  "size": 0,
  "aggs": {
    "min_salary": {
      "min": {
        "field":"salary"
      }
    }
  }
}

# Metric 聚合,找到最高的工资
POST employees/_search
{
  "size": 0,
  "aggs": {
    "max_salary": {
      "max": {
        "field":"salary"
      }
    }
  }
}

# 多个 Metric 聚合,找到最低最高和平均工资
POST employees/_search
{
  "size": 0,
  "aggs": {
    "max_salary": {
      "max": {
        "field": "salary"
      }
    },
    "min_salary": {
      "min": {
        "field": "salary"
      }
    },
    "avg_salary": {
      "avg": {
        "field": "salary"
      }
    }
  }
}

# 一个聚合,输出多值
POST employees/_search
{
  "size": 0,
  "aggs": {
    "stats_salary": {
      "stats": {
        "field": "salary"
      }
    }
  }
}

# Bucket

  • 按照⼀定的规则,将文档分配到不同的桶中,从而达到分类的目的。ES 提供的一些常见的 Bucket Aggregation
    • Terms
    • 数字类型
    • Range / Data Range
    • Histogram / Date Histogram
    • 支持嵌套:也就在桶里再做分桶

# Terms Aggregation

  • 字段需要打开 fielddata,才能进行 Terms Aggregation
  • Keyword 默认支持 doc_values
  • Text 需要在 Mapping 中 enable。会按照分词后的结果进行分桶
# 对keword 进行聚合
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    }
  }
}


# 对 Text 字段进行 terms 聚合查询,失败
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job"
      }
    }
  }
}

# 对 Text 字段打开 fielddata,支持terms aggregation
PUT employees/_mapping
{
  "properties" : {
    "job":{
       "type":     "text",
       "fielddata": true
    }
  }
}


# 对 Text 字段进行 terms 分词。分词后的terms
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job"
      }
    }
  }
}

POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
      }
    }
  }
}


# 对job.keyword 和 job 进行 terms 聚合,分桶的总数并不一样
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs_dis": {
      "cardinality": {
        "field": "job.keyword"
      }
    }
  }
}

POST employees/_search
{
  "size": 0,
  "aggs": {
    "cardinate": {
      "cardinality": {
        "field": "job"
      }
    }
  }
}

# Bucket Size & Top Hits

  • 应用场景:当获取分桶后,桶内最匹配的顶部文档列表
  • Size:按年龄分桶,找出指定数据量的分桶信息
  • Top Hits:查看各个工种中,年纪最大的 3 名员工
# 对 性别的 keyword 进行聚合
POST employees/_search
{
  "size": 0,
  "aggs": {
    "gender": {
      "terms": {
        "field":"gender"
      }
    }
  }
}


#指定 bucket 的 size
POST employees/_search
{
  "size": 0,
  "aggs": {
    "ages_5": {
      "terms": {
        "field":"age",
        //"size":5
        "size":3
      }
    }
  }
}



# 指定size,不同工种中,年纪最大的3个员工的具体信息
POST employees/_search
{
  "size": 0,
  "aggs": {
    "top3_jobs": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "top3": {
          "top_hits": {
            "size": 3,
            "sort": [
              {
                "age": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

# Range & Histogram

按照数字的范围,进行分桶,在 Range Aggregation 中,可以自定义 Key


#Salary Ranges 分桶,自定义阶段分桶
POST employees/_search
{
  "size": 0,
  "aggs": {
    "salary_range": {
      "range": {
        "field":"salary",
        "ranges":[
          {
            "key": "salary<10000",
            "to":10000
          },
          {
            "key": "10000<=salary<20000",
            "from":10000,
            "to":20000
          },
          {
            "key":"salary>=20000",
            "from":20000
          }
        ]
      }
    }
  }
}


#Salary Histogram,工资0到100000,以 5000一个区间进行分桶
POST employees/_search
{
  "size": 0,
  "aggs": {
    "salary_histrogram": {
      "histogram": {
        "field":"salary",
        "interval":5000,
        "extended_bounds":{
          "min":0,
          "max":100000

        }
      }
    }
  }
}


# 嵌套聚合1,按照工作类型分桶,并统计工资信息
POST employees/_search
{
  "size": 0,
  "aggs": {
    "Job_salary_stats": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "salary": {
          "stats": {
            "field": "salary"
          }
        }
      }
    }
  }
}

# 多次嵌套。根据工作类型分桶,然后按照性别分桶,计算工资的统计信息
POST employees/_search
{
  "size": 0,
  "aggs": {
    "agg1": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "agg2": {
          "terms": {
            "field": "gender"
          },
          "aggs": {
            "agg3": {
              "stats": {
                "field": "salary"
              }
            }
          }
        }
      }
    }
  }
}

# Pipeline 聚合分析

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-aggregations-pipeline.html (opens new window)

  • 管道的概念: 支持对聚合分析的结果,再次进行聚合分析
  • Pipeline 的分析结果会输出到原结果中,根据位置的不同,分为两类
    • Sibling - 结果和现有分析结果同级
      • Max,min,Avg & Sum Bucket
      • Stats,Extended Status Bucket
      • Percentiles Bucket
    • Parent - 结果内嵌到现有的聚合分析结果之中
      • Derivative (求导)
      • Cumultive Sum (累计求和)
      • Moving Function (滑动窗口)
DELETE employees
PUT /employees/_bulk
{ "index" : {  "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : {  "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : {  "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : {  "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : {  "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : {  "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : {  "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : {  "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : {  "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : {  "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : {  "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : {  "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : {  "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : {  "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : {  "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : {  "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : {  "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : {  "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}

## siling
# 平均工资最低的工作类型
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "min_salary_by_job": {
      "min_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}

# 平均工资最高的工作类型
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "max_salary_by_job": {
      "max_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}

# 平均工资的平均工资
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "avg_salary_by_job": {
      "avg_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}

# 平均工资的统计分析
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "stats_salary_by_job": {
      "stats_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}



# 平均工资的百分位数
# Percentile Bucket 返回的最近的输入数据点不大于所请求的百分位数; 它不在数据点之间插值。百分位数是精确计算的,不是近似值(不像百分位数度量)。
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "percentiles_salary_by_job": {
      "percentiles_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}


## parent
# 按照年龄对平均工资求导(相邻差值)
POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "interval": 1,
        "min_doc_count": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "derivative_avg_salary": {
          "derivative": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}

# 桶的累计和
POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "interval": 1,
        "min_doc_count": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "cumulative_salary": {
          "cumulative_sum": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}

# 滑动窗口
POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "interval": 1,
        "min_doc_count": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "moving_avg_salary": {
          "moving_fn": {
            "buckets_path": "avg_salary",
            "window": 3,
            "script": "MovingFunctions.min(values)"
          }
        }
      }
    }
    
  }
}

# 聚合作用范围与排序

  • ES 聚合分析的默认作用范围是 query 的查询结果集
  • 同时 ES 还支持以下方式改变聚合的作用范围
    • Filter:聚合前过滤
    • Post_Filter:聚合后过滤
    • Global:无视过滤
DELETE /employees
PUT /employees/
{
  "mappings" : {
      "properties" : {
        "age" : {
          "type" : "integer"
        },
        "gender" : {
          "type" : "keyword"
        },
        "job" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 50
            }
          }
        },
        "name" : {
          "type" : "keyword"
        },
        "salary" : {
          "type" : "integer"
        }
      }
    }
}

PUT /employees/_bulk
{ "index" : {  "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : {  "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : {  "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : {  "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : {  "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : {  "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : {  "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : {  "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : {  "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : {  "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : {  "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : {  "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : {  "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : {  "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : {  "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : {  "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : {  "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : {  "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}



# 对Query后的结果进行聚合
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 40
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    }
  }
}


# Filter 聚合前过滤
POST employees/_search
{
  "size": 0,
  "aggs": {
    "some_person": {
      "filter": {
        "range": {
          "age": {
            "from": 35,
            "to": 40
          }
        }
      },
      "aggs": {
        "jobs": {
          "terms": {
            "field": "job.keyword"
          }
        }
      }
    },
    "all_person": {
      "terms": {
        "field": "job.keyword"
      }
    }
  }
}



#Post field. 一条语句,找出所有的job类型。还能找到聚合后符合条件的结果
# 使用场景 获取聚合信息 + 获取符合条件的⽂档
POST employees/_search
{
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    }
  },
  "post_filter": {
    "match": {
      "job.keyword": "Dev Manager"
    }
  }
}


# global 不受最外层query的影响
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 40
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    },
    "all": {
      "global": {},
      "aggs": {
        "salary_avg": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
  }
}

聚合排序

  • 默认情况,按照 count 降序排序
  • 指定 size,就能返回相应的桶
#排序 order
# 按照key和doc_countcount进行排序
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 20
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "order": [
          {
            "_count": "asc"
          },
          {
            "_key": "desc"
          }
        ]
      }
    }
  }
}


#排序 order
# 按照单个聚合后的字段排序
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "order": [
          {
            "avg_salary": "desc"
          }
        ]
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
  }
}


#排序 order
# 按照聚合后的指定字段排序
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "order": [
          {
            "stats_salary.min": "desc"
          }
        ]
      },
      "aggs": {
        "stats_salary": {
          "stats": {
            "field": "salary"
          }
        }
      }
    }
  }
}

# 聚合的精准度问题

https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-terms-ggregation.html#_calculating_document_count_error (opens new window) https://github.com/elastic/elasticsearch/issues/35987 (opens new window) image.pngimage.pngimage.pngimage.pngimage.pngimage.pngimage.png

DELETE my_flights
PUT my_flights
{
  "settings": {
    "number_of_shards": 20
  },
  "mappings" : {
      "properties" : {
        "AvgTicketPrice" : {
          "type" : "float"
        },
        "Cancelled" : {
          "type" : "boolean"
        },
        "Carrier" : {
          "type" : "keyword"
        },
        "Dest" : {
          "type" : "keyword"
        },
        "DestAirportID" : {
          "type" : "keyword"
        },
        "DestCityName" : {
          "type" : "keyword"
        },
        "DestCountry" : {
          "type" : "keyword"
        },
        "DestLocation" : {
          "type" : "geo_point"
        },
        "DestRegion" : {
          "type" : "keyword"
        },
        "DestWeather" : {
          "type" : "keyword"
        },
        "DistanceKilometers" : {
          "type" : "float"
        },
        "DistanceMiles" : {
          "type" : "float"
        },
        "FlightDelay" : {
          "type" : "boolean"
        },
        "FlightDelayMin" : {
          "type" : "integer"
        },
        "FlightDelayType" : {
          "type" : "keyword"
        },
        "FlightNum" : {
          "type" : "keyword"
        },
        "FlightTimeHour" : {
          "type" : "keyword"
        },
        "FlightTimeMin" : {
          "type" : "float"
        },
        "Origin" : {
          "type" : "keyword"
        },
        "OriginAirportID" : {
          "type" : "keyword"
        },
        "OriginCityName" : {
          "type" : "keyword"
        },
        "OriginCountry" : {
          "type" : "keyword"
        },
        "OriginLocation" : {
          "type" : "geo_point"
        },
        "OriginRegion" : {
          "type" : "keyword"
        },
        "OriginWeather" : {
          "type" : "keyword"
        },
        "dayOfWeek" : {
          "type" : "integer"
        },
        "timestamp" : {
          "type" : "date"
        }
      }
    }
}


POST _reindex
{
  "source": {
    "index": "kibana_sample_data_flights"
  },
  "dest": {
    "index": "my_flights"
  }
}

GET kibana_sample_data_flights/_count
GET my_flights/_count

get kibana_sample_data_flights/_search

# primary
# "doc_count_error_upper_bound" : 0,
GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "weather": {
      "terms": {
        "field":"OriginWeather",
        "size":5,
        "show_term_doc_count_error":true
      }
    }
  }
}

# shard_size:1# "doc_count_error_upper_bound" : 2514,
# shard_size:10# "doc_count_error_upper_bound" : 0,
GET my_flights/_search
{
  "size": 0,
  "aggs": {
    "weather": {
      "terms": {
        "field":"OriginWeather",
        "size":1,
        "shard_size":1,
        //"shard_size":10,
        "show_term_doc_count_error":true
      }
    }
  }
}

# 对象及 Nested 对象

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-nested-query.html (opens new window)

# 关系型数据库的范式化设计

范式化设计(Normalization)的主要目标是“减少不必要的更新”

  • 副作用:⼀个完全范式化设计的数据库会经常面临“查询缓慢”的问题
  • 数据库越范式化,就需要 Join 越多的表
  • 范式化节省了存储空间,但是存储空间却越来越便宜
  • 范式化简化了更新,但是数据“读”取操作可能更多

# Denormalization

反范式化设计

  • 数据 “Flattening”,不使用关联关系,而是在文档中保存冗余的数据拷贝
  • 优点:无需处理 Joins 操作,数据读取性能好
  • Elasticsearch 通过压缩 _source 字段,减少磁盘空间的开销
  • 缺点:不适合在数据频繁修改的场景
  • ⼀条数据(用户名)的改动,可能会引起很多数据的更新

# 在 Elasticsearch 中处理关联关系

  • 关系型数据库,⼀般会考虑 Normalize 数据;在 Elasticsearch,往往考虑 Denormalize 数据
  • Denormalize 的好处:读的速度变快 / 无需表连接 / 无需行锁
  • Elasticsearch 并不擅长处理关联关系。我们一般采用以下四种方法处理关联
    • 对象类型
    • 嵌套对象(Nested Object)
    • 父子关联关系(Parent / Child )
    • 应用端关联

# 对象类型

DELETE blog
# 设置blog的 Mapping
PUT /blog
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      },
      "time": {
        "type": "date"
      },
      "user": {
        "properties": {
          "city": {
            "type": "text"
          },
          "userid": {
            "type": "long"
          },
          "username": {
            "type": "keyword"
          }
        }
      }
    }
  }
}


# 插入一条 Blog 信息
PUT blog/_doc/1
{
  "content":"I like Elasticsearch",
  "time":"2019-01-01T00:00:00",
  "user":{
    "userid":1,
    "username":"Jack",
    "city":"Shanghai"
  }
}


# 查询 Blog 信息
POST blog/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"content": "Elasticsearch"}},
        {"match": {"user.username": "Jack"}}
      ]
    }
  }
}


DELETE my_movies

# 电影的Mapping信息
PUT my_movies
{
      "mappings" : {
      "properties" : {
        "actors" : {
          "properties" : {
            "first_name" : {
              "type" : "keyword"
            },
            "last_name" : {
              "type" : "keyword"
            }
          }
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
}


# 写入一条电影信息
POST my_movies/_doc/1
{
  "title":"Speed",
  "actors":[
    {
      "first_name":"Keanu",
      "last_name":"Reeves"
    },

    {
      "first_name":"Dennis",
      "last_name":"Hopper"
    }

  ]
}

# 查询电影信息,不符合预期
POST my_movies/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"actors.first_name": "Keanu"}},
        {"match": {"actors.last_name": "Hopper"}}
      ]
    }
  }

}
  • 存储时,内部对象的边界并没有考虑在内,JSON 格式被处理成扁平式键值对的结构
  • 当对多个字段进行查询时,导致了意外的搜索结果
  • 可以用 Nested Data Type 解决这个问题

在系统内部,Elasticsearch(或说Lucene)并不了解每个对象的结构;它只知道字段和值。文档最终是像下面这样进行索引的

{
  "title":"Speed",
  "actors.first_name":["Keanu","Dennis"],
  "actors.last_name":["Reeves","Hopper"]
}

# 嵌套类型

Nested 数据类型:允许对象数组中的对象被独立索引

  • 使用 nested 和 properties 关键字,将所有 actors 索引到多个分隔的文档
  • 在内部, Nested 文档会被保存在两个Lucene 文档中,在查询时做 Join 处理

如果想避免这种跨对象的匹配的发生,可以使用嵌套类型 (nested type),它将活动索引到分隔的Lucene文档。在两种情况下,分组的JSON文档看上去一模一样,应用程序也将按照同样的方式来索引它们。不同之处在于映射,这会促使Elasticsearch将嵌套的内部对象索引到邻近的位置,但是保持独立的Lucene文档。


DELETE my_movies
# 创建 Nested 对象 Mapping
PUT my_movies
{
      "mappings" : {
      "properties" : {
        "actors" : {
          "type": "nested",
          "properties" : {
            "first_name" : {"type" : "keyword"},
            "last_name" : {"type" : "keyword"}
          }},
        "title" : {
          "type" : "text",
          "fields" : {"keyword":{"type":"keyword","ignore_above":256}}
        }
      }
    }
}


POST my_movies/_doc/1
{
  "title":"Speed",
  "actors":[
    {
      "first_name":"Keanu",
      "last_name":"Reeves"
    },

    {
      "first_name":"Dennis",
      "last_name":"Hopper"
    }

  ]
}

# Nested 查询
POST my_movies/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "Speed"}},
        {
          "nested": {
            "path": "actors",
            "query": {
              "bool": {
                "must": [
                  {"match": {
                    "actors.first_name": "Keanu"
                  }},

                  {"match": {
                    "actors.last_name": "Hopper"
                  }}
                ]
              }
            }
          }
        }
      ]
    }
  }
}


# Nested Aggregation
POST my_movies/_search
{
  "size": 0,
  "aggs": {
    "actors": {
      "nested": {
        "path": "actors"
      },
      "aggs": {
        "actor_name": {
          "terms": {
            "field": "actors.first_name",
            "size": 10
          }
        }
      }
    }
  }
}


# 普通 aggregation不工作
POST my_movies/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "terms": {
        "field": "actors.first_name",
        "size": 10
      }
    }
  }
}

在某些用例中,像对象和嵌套类型那样,将所有的数据硬塞在同一篇文档中不见得是明智之举。拿分组和活动的例子来说:如果一个分组所有数据都放在同一篇文档中,那么它创建一项新的活动时,你不得不为了这个活动来重新索引整篇文档。这可能会降低性能和并发性,取决于文档有多大,以及操作的频繁程度。

# 父子关系

  • 对象和 Nested 对象的局限性
    • 每次更新,需要重新索引整个对象(包括根对象和嵌套对象)
  • ES 提供了类似关系型数据库中 Join 的实现。使用 Join 数据类型实现,可以通过维护 Parent/ Child 的关系,从而分离两个对象
    • 父文档和子文档是两个独立的文档
    • 更新父文档无需重新索引子文档。子文档被添加,更新或者删除也不会影响到父文档和其他的子文档

设置 Mapping

DELETE my_blogs

# 设定 Parent/Child Mapping
PUT my_blogs
{
  "settings": {
    "number_of_shards": 2
  },
  "mappings": {
    "properties": {
      // 字段名称
      "blog_comments_relation": {
        //join 父子关系
        "type": "join",
        "relations": {
          // key 是父名称,value是子名称
          "my_parent": "my_child"
        }
      },
      "content": {
        "type": "text"
      },
      "title": {
        "type": "keyword"
      }
    }
  }
}

索引文档

  • 父文档和子文档必须存在相同的分片上

确保查询 join 的性能

  • 当指定子文档时候,必须指定它的父文档 Id

使用 route 参数来保证,分配到相同的分片

#索引父文档
PUT my_blogs/_doc/p1
{
  "title":"Learning Elasticsearch",
  "content":"learning ELK @ geektime",
  "blog_comments_relation":{
    // 文档类型
    "name":"my_parent"
  }
}

#索引父文档
PUT my_blogs/_doc/p2
{
  "title":"Learning Hadoop",
  "content":"learning Hadoop",
    "blog_comments_relation":{
    "name":"my_parent"
  }
}


#索引子文档
# 指定 routing,确保和⽗⽂档索引到相同的分⽚
PUT my_blogs/_doc/c1?routing=p1
{
  "comment":"I am learning ELK",
  "username":"Jack",
  "blog_comments_relation":{
    // 子文档
    "name":"my_child",
    // 关联父文档的id
    "parent":"p1"
  }
}

#索引子文档
PUT my_blogs/_doc/c2?routing=p2
{
  "comment":"I like Hadoop!!!!!",
  "username":"Jack",
  "blog_comments_relation":{
    "name":"my_child",
    "parent":"p2"
  }
}

#索引子文档
PUT my_blogs/_doc/c3?routing=p2
{
  "comment":"Hello Hadoop",
  "username":"Bob",
  "blog_comments_relation":{
    "name":"my_child",
    "parent":"p2"
  }
}

查询


# 查询所有文档
POST my_blogs/_search
{

}


#根据父文档ID查看
GET my_blogs/_doc/p2
# parent_id

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-parent-id-query.html (opens new window)

# Parent Id 查询
# 返回加入到特定父文档中的子文档。
# 返回id为p2的父文档的所有子文档
POST my_blogs/_search
{
  "query": {
    "parent_id": {
      "type": "my_child",
      "id": "p2"
    }
  }
}
# has_child

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-has-child-query.html (opens new window)

# Has Child 查询,返回父文档
# 查询子文档中username包含jack的父文档
POST my_blogs/_search
{
  "query": {
    "has_child": {
      "type": "my_child",
      "query": {
        "match": {
          "username": "Jack"
        }
      }
    }
  }
}
# has_parent

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-has-parent-query.html (opens new window)

# Has Parent 查询,返回相关的子文档
# 查询父文档title包含Learning Hadoop分词后的子文档
POST my_blogs/_search
{
  "query": {
    "has_parent": {
      "parent_type": "my_parent",
      "query": {
        "match": {
          "title": "Learning Hadoop"
        }
      }
    }
  }
}

更新

# 访问子文档 需指定 routing 参数(和⽗⽂档路由到相同分片)
GET my_blogs/_doc/c3?routing=p2

#更新子文档,不影响父文档
PUT my_blogs/_doc/c3?routing=p2
{
    "comment": "Hello Hadoop??",
    "blog_comments_relation": {
      "name": "my_child",
      "parent": "p2"
    }
}

image.png

# Update By Query & Reindex API

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/docs-reindex.html (opens new window) ⼀般在以下几种情况时,我们需要重建索引

  • 索引的 Mappings 发生变更:字段类型更改,分词器及字典更新
  • 索引的 Settings 发生变更:索引的主分片数发生改变
  • 集群内,集群间需要做数据迁移

Elasticsearch 的内置提供的 API

  • Update By Query:在现有索引上重建
  • Reindex:在其他索引上重建索引
DELETE blogs/

# 写入文档
PUT blogs/_doc/1
{
  "content":"Hadoop is cool",
  "keyword":"hadoop"
}

# 查看 Mapping
GET blogs/_mapping

# 修改 Mapping,增加子字段,使用英文分词器
PUT blogs/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "fields": {
        "english": {
          "type": "text",
          "analyzer": "english"
        }
      }
    }
  }
}


# 写入文档
PUT blogs/_doc/2
{
  "content": "Elasticsearch rocks",
  "keyword": "elasticsearch"
}

# 查询新写入文档
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Elasticsearch"
    }
  }

}

# 查询 Mapping 变更前写入的文档
# 查询不到,索引时没有这个字段
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Hadoop"
    }
  }
}


# Update所有文档
POST blogs/_update_by_query
{

}

# 查询之前写入的文档
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Hadoop"
    }
  }
}


# 查询
GET blogs/_mapping

# 不允许修改
PUT blogs/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "fields": {
        "english": {
          "type": "text",
          "analyzer": "english"
        }
      }
    },
    "keyword": {
      "type": "keyword"
    }
  }
}



DELETE blogs_fix

# 创建新的索引并且设定新的Mapping
PUT blogs_fix/
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "fields": {
          "english": {
            "type": "text",
            "analyzer": "english"
          }
        }
      },
      "keyword": {
        "type": "keyword"
      }
    }
  }
}

# Reindx API
POST  _reindex
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix"
  }
}

GET  blogs_fix/_doc/1

# 测试 Term Aggregation
POST blogs_fix/_search
{
  "size": 0,
  "aggs": {
    "blog_keyword": {
      "terms": {
        "field": "keyword",
        "size": 10
      }
    }
  }
}


# Reindx API,version Type Internal
POST  _reindex
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix",
    // 乐观锁,版本递增
    "version_type": "internal"
  }
}

# 文档版本号增加
GET  blogs_fix/_doc/1
GET  blogs/_doc/1


DELETE blogs_fix
# Reindx API,version Type external
POST  _reindex
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix",
    // 使用旧文档版本号
    "version_type": "external"
  }
}


# Reindx API,version Type proceed
# “ conflicts”: “ proceed”时,_ reindex 进程将继续处理版本冲突,并返回遇到的版本冲突计数
POST  _reindex
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix",
    "version_type": "external"
  },
  "conflicts": "proceed"
}



# Reindx API,op_type create
# 只会创建不存在的⽂档,⽂档如果已经存在,会导致版本冲突
POST  _reindex
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix",
    "op_type": "create"
  }
}

# 异步操作,返回taskid
POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix"
  }
}

# 根据taksId查询执行状态
GET _tasks/N8-ExBj-S1K7rptd2x3Fcw:148346
# 所有task
GET _tasks?detailed=true&actions=*reindex

# Ingest Pipeline 与 Painless Script

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/ingest.html (opens new window)

# Ingest Node

  • Elasticsearch 5.0 后,引入的一种新的节点类型。默认配置下,每个节点都是 Ingest Node
    • 具有预处理数据的能力,可拦截 Index 或 Bulk API 的请求
    • 对数据进行转换,并重新返回给 Index 或 Bulk API
  • 无需 Logstash,就可以进行数据的预处理,例如
    • 为某个字段设置默认值;重命名某个字段的字段名;对字段值进行 Split 操作
    • 支持设置 Painless 脚本,对数据进行更加复杂的加工

# Pipeline & Processor

  • Pipeline - 管道会对通过的数据(文档),按照顺序进行加工
  • Processor - Elasticsearch 对⼀些加工的行为进行了抽象包装
  • Elasticsearch 有很多内置的 Processors。也支持通过插件的方式,实现自己的 Processor

image.png 一些内置 Processors https://www.elastic.co/guide/en/elasticsearch/reference/7.1/ingest-processors.html (opens new window)

  • Split Processor (例:将给定字段值分成一个数组)
  • Remove / Rename Processor (例:移除一个重命名字段)
  • Append (例:为商品增加一个新的标签)
  • Convert(例:将商品价格,从字符串转换成 float 类型)
  • Date / JSON(例:日期格式转换,字符串转 JSON 对象)
  • Date Index Name Processor (例:将通过该处理器的文档,分配到指定时间格式的索引中)
  • Fail Processor (一旦出现异常,该 Pipeline 指定的错误信息能返回给用户)
  • Foreach Process(数组字段,数组的每个元素都会使用到一个相同的处理器)
  • Grok Processor(日志的日期格式切割)
  • Gsub / Join / Split(字符串替换 / 数组转字符串/ 字符串转数组)
  • Lowercase / Upcase(大小写转换)
#########Demo for Pipeline###############

DELETE tech_blogs

#Blog数据,包含3个字段,tags用逗号间隔
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
}


# 测试split tags
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "content": "You konw, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}


#同时为文档,增加一个字段。blog查看量
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
      {
        "set": {
          "field": "views",
          "value": 0
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "content": "You konw, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}



# 为ES添加一个 Pipeline
PUT _ingest/pipeline/blog_pipeline
{
  "description": "a blog pipeline",
  "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },

      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
}

#查看Pipleline
GET _ingest/pipeline/blog_pipeline


#测试pipeline
POST _ingest/pipeline/blog_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}

# 不使用pipeline更新数据
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
}


# 使用pipeline更新数据
PUT tech_blogs/_doc/2?pipeline=blog_pipeline
{
  "title": "Introducing cloud computering",
  "tags": "openstack,k8s",
  "content": "You konw, for cloud"
}


#查看两条数据,一条被处理,一条未被处理(旧数据)
POST tech_blogs/_search
GET tech_blogs/_mapping

# 直接update_by_query 对所有文档更新,会导致错误
# 因为一部分已经被pipeline处理过了,需要处理未被处理的文档
POST tech_blogs/_update_by_query?pipeline=blog_pipeline


#增加update_by_query的条件
# 对未被处理过的文档进行处理
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "views"
        }
      }
    }
  }
}

# Ingest Node v.s Logstash

https://www.elastic.co/cn/blog/should-i-use-logstash-or-elasticsearch-ingest-nodes (opens new window) image.png

# Painless 简介

https://www.elastic.co/guide/en/elasticsearch/painless/7.1/painless-lang-spec.html (opens new window) https://www.elastic.co/guide/en/elasticsearch/painless/7.1/painless-api-reference.html (opens new window)

  • 自 Elasticsearch 5.x 后引入,专门为 Elasticsearch 设计,扩展了 Java 的语法。
  • 6.0 开始,ES 只只持 Painless。Groovy, JavaScript 和 Python 都不再支持
  • Painless 支持所有 Java 的数据类型及 Java API 子集
  • Painless Script 具备以下特性
    • 高性能 / 安全
    • 支持显示类型或者动态定义类型

# Painless 的用途

  • 可以对文档字段进行加工处理
  • 更新或删除字段,处理数据聚合操作
  • Script Field:对返回的字段提前进行计算
  • Function Score:对文档的算分进行处理
  • 在 Ingest Pipeline 中执行脚本
  • 在 Reindex API,Update By Query 时,对数据进行处理

# 通过 Painless 脚本访问字段

image.png

#########Demo for Painless###############

# 增加一个 Script Prcessor
# """ 包裹代码块
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
      {
        "script": {
          "source": """
          if(ctx.containsKey("content")){
            ctx.content_length = ctx.content.length();
          }else{
            ctx.content_length=0;
          }
          """
        }
      },
      {
        "set": {
          "field": "views",
          "value": 0
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "content": "You konw, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}


DELETE tech_blogs
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data",
  "views":0
}

POST tech_blogs/_update/1
{
  "script": {
    "source": "ctx._source.views += params.new_views",
    "params": {
      "new_views":100
    }
  }
}

# 查看views计数
POST tech_blogs/_search

#保存脚本在 Cluster State
POST _scripts/update_views
{
  "script":{
    "lang": "painless",
    "source": "ctx._source.views += params.new_views"
  }
}

POST tech_blogs/_update/1
{
  "script": {
    "id": "update_views",
    "params": {
      "new_views":1000
    }
  }
}

GET tech_blogs/_search

# 对search返回的文档进行操作
GET tech_blogs/_search
{
  "script_fields": {
    "rnd_views": {
      "script": {
        "lang": "painless",
        "source": """
          java.util.Random rnd = new Random();
          doc['views'].value+rnd.nextInt(1000);
        """
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

# 脚本缓存

  • 编译的开销相较大
  • Elasticsearch 会将脚本编译后缓存在Cache 中
    • Inline scripts 和 Stored Scripts 都会被缓存
    • 默认缓存 100 个脚本

image.png

# 集群身份认证与用户鉴权

Elasticsearch 在默认安装后,不提供任何形式的安全防护 X-Pack 的 Basic 版

  • 从 ES 6.8 & ES 7.0 开始,Security 纳入 x-pack 的 Basic 版本中,免费使用一些基本的功能
  • https://www.elastic.co/what-is/elastic-stack-security (opens new window)

Authentication - 身份认证

  • 认证体系的几种类型
    • 提供用户名和密码
    • 提供秘钥或 Kerberos 票据
  • Realms:X-Pack 中的认证服务
    • 内置 Realms (免费)
      • File / Native(用户名密码保存在 Elasticsearch)
    • 外部 Realms (收费)
      • LDAP / Active Directory / PKI / SAML / Kerberos

RBAC - 用户鉴权

  • 什么是 RBAC:Role Based Access Control, 定义一个角色,并分配一组权限。权限包括索引级,字段级,集群级的不同的操作。然后通过将角色分配给用户,使得用户拥有这些权限
    • User:The authenticated User
    • Role:A named set of permissions
    • Permission – A set of one or more privileges against a secured resource
    • Privilege – A named group of 1 or more actions that user may execute against a secured resource

内置的角色与用户

image.png image.png

# 开启并配置 X-Pack 的认证与鉴权

  • 修改配置文件,打开认证与授权
bin/elasticsearch -E node.name=node0 -E cluster.name=geektime -E path.data=node0_data -E
http.port=9200 -E xpack.security.enabled=true
  • 创建默认的用户和分组
bin/elasticsearch-setup-passwords interactive
  • 访问 ES http://192.168.83.130:9200/ (opens new window) 需要认证,使用root账户 elastic
  • 为 kibana 配置用户名和密码 vim config/kibana.yml
elasticsearch.username: "elastic"
elasticsearch.password: "starry"
  • 访问 kibana 网页 http://192.168.83.130:5601/ (opens new window) 需要登陆账户,elastic
  • 插入一些测试数据
POST orders/_bulk
{"index":{}}
{"product":"1","price":18,"payment":"master","card":"9876543210123456","name":"jack"}
{"index":{}}
{"product":"2","price":99,"payment":"visa","card":"1234567890123456","name":"bob"}

GET orders/_search

# 创建角色

192.168.83.130_5601_app_kibana.png

# 创建用户

image.png

退出登录使用 test 账户登录kibana

GET orders/_search

# 报错,没有write权限
POST orders/_doc/1
{
  "name": "hello"
}

# 集群内部安全通信

https://www.elastic.co/guide/en/elasticsearch/reference/current/configuring-tls.html (opens new window) 如何防止伪造节点加入到集群中?

  • 为节点创建证书
  • TLS
    • TLS 协议要求 Trusted Certificate Authority(CA)签发的 X.509的证书
  • 证书认证的不同级别
    • Certificate – 节点加入需要使用相同 CA 签发的证书
    • Full Verification – 节点加入集群需要相同 CA 签发的证书,还需要验证 Host name 或 IP 地址
    • No Verification – 任何节点都可以加入,开发环境中用于诊断目的

# 生成节点证书

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/configuring-tls.html (opens new window)

  • 为 Elasticsearch 集群创建证书颁发机构
bin/elasticsearch-certutil ca

会在目录下生成elastic-stack-ca.p12文件

  • 为集群中的每个节点生成证书和私钥
bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12
  • 添加配置

将节点证书复制到适当的位置

mkdir config/certs
mv elastic-certificates.p12 config/certs/
# 添加权限
chmod 777 config/certs/elastic-certificates.p12

修改配置文件或者直接在启动的时候指定配置

xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12

启动3个节点,两个指定证书路径,一个不指定

bin/elasticsearch -E node.name=node0 -E cluster.name=geektime -E path.data=node0_data -E http.port=9200 -E xpack.security.enabled=true -E xpack.security.transport.ssl.enabled=true -E xpack.security.transport.ssl.verification_mode=certificate -E xpack.security.transport.ssl.keystore.path=certs/elastic-certificates.p12 -E xpack.security.transport.ssl.truststore.path=certs/elastic-certificates.p12
bin/elasticsearch -E node.name=node1 -E cluster.name=geektime -E path.data=node1_data -E http.port=9201 -E xpack.security.enabled=true -E xpack.security.transport.ssl.enabled=true -E xpack.security.transport.ssl.verification_mode=certificate -E xpack.security.transport.ssl.keystore.path=certs/elastic-certificates.p12 -E xpack.security.transport.ssl.truststore.path=certs/elastic-certificates.p12


#不提供证书的节点,无法加入
bin/elasticsearch -E node.name=node2 -E cluster.name=geektime -E path.data=node2_data -E http.port=9202 -E xpack.security.enabled=true -E xpack.security.transport.ssl.enabled=true -E xpack.security.transport.ssl.verification_mode=certificate
  • 查看节点 http://192.168.83.130:9200/_cat/nodes (opens new window)

# 集群与外部间的安全通信

https://www.elastic.co/guide/en/elasticsearch/reference/current/security-basic-setup-https.html#encrypt-http-communication (opens new window)

# ES 配置 https

配置文件

xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.http.ssl.truststore.path: certs/elastic-certificates.p12

启动测试

bin/elasticsearch -E node.name=node0 -E cluster.name=geektime -E path.data=node0_data -E http.port=9200 

https://192.168.83.130:9200/_cat/nodes (opens new window)

# Kibana 连接 ES https

将之前签发的 .p12 证书转换为 .pem 证书

# 提示输入密码,和签发证书时的密码一样,如果没有直接回车
openssl pkcs12 -in elastic-certificates.p12 -cacerts -nokeys -out elastic-ca.pem
# 选择合适存放位置
mkdir -p /usr/local/src/kibana-7.1.0-linux-x86_64/config/certs
mv elastic-ca.pem /usr/local/src/kibana-7.1.0-linux-x86_64/config/certs/

修改配置 kibana.yml

elasticsearch.hosts: ["https://localhost:9200"]
elasticsearch.ssl.certificateAuthorities: [ "/usr/local/src/kibana-7.1.0-linux-x86_64/config/certs/elastic-ca.pem" ]
elasticsearch.ssl.verificationMode: certificate

启动 bin/kibana http://192.168.83.130:5601/ (opens new window) dev tools测试 GET _cat/indices

# HTTPS 访问 Kibana

# 生成证书
bin/elasticsearch-certutil ca --pem
# 解压到指定目录
unzip -d /usr/local/src/kibana-7.1.0-linux-x86_64/config/certs/ elastic-stack-ca.zip

修改配置 kibana.yml

server.ssl.enabled: true
server.ssl.certificate: /usr/local/src/kibana-7.1.0-linux-x86_64/config/certs/ca/ca.crt
server.ssl.key: /usr/local/src/kibana-7.1.0-linux-x86_64/config/certs/ca/ca.key

https://192.168.83.130:5601/ (opens new window)

# 常见的集群部署方式

上次更新: 2024/03/03, 08:36:37
Helm
深入拆解 Java 虚拟机

← Helm 深入拆解 Java 虚拟机→

Theme by Vdoing | Copyright © 2023-2024 Starry | MIT License
  • 跟随系统
  • 浅色模式
  • 深色模式
  • 阅读模式