Skip to main content
 首页 » 编程设计

ElasticSearch 全文搜索

2022年07月19日151java哥

ElasticSearch 全文搜索

对文档执行全文检索,包括单个或多个单词或词组查询,返回匹配条件的搜索结果。
ElasticSearch 是基于Apache Lucene的搜索引擎,一个开源、免费信息检索软件库。基于HTTP协议web接口和无模式json文档方式提供分布式、全文检索引擎。

本文探究下 ElasticSearch REST API 并演示基于HTTP 请求的基本查询操作。

安装环境

安装ElasticSearch,请参考官方安装指南

RESTfull API运行端口为9200,让我们使用下面curl命令测试程序是否运行正确:

curl -XGET 'http://localhost:9200/' 

如果你观察到下面响应,表明实例已经启动成功:

name	"USER-20170915LA" 
cluster_name	"elasticsearch" 
cluster_uuid	"7k8ij9iPT1uAkVqJxanAYQ" 
version	 
number	"7.1.1" 
build_flavor	"oss" 
build_type	"zip" 
build_hash	"7a013de" 
build_date	"2019-05-23T14:04:00.380842Z" 
build_snapshot	false 
lucene_version	"8.0.0" 
minimum_wire_compatibility_version	"6.8.0" 
minimum_index_compatibility_version	"6.0.0-beta1" 
tagline	"You Know, for Search" 

安装Elasticsearch Head
为了执行命令,我们可以安装Elasticsearch Head,也有相应chrome的插件,从应用商店中搜索安装。先启动elasticSearch,点击Head插件:

索引文档

ElasticSearch是面向文档NoSql应用,主要用于存储和索引文档。索引创建或更新文档,有了索引,即可搜索、排序、过滤完整的文档————不仅是行列类型的数据。这是一种完全不同的数据思考方式,也是ElasticSearch能够执行复杂全文搜索的原因之一。

文档以JSON对象方式表示,因为其简单、简洁其易读。JSON序列化被大多数编程语言支持,集合成为NoSql软件的标准格式.

下面我们使用一些随机文本执行全文检索:

{ 
  "title": "He went", 
  "random_text": "He went such dare good fact. The small own seven saved man age." 
} 
  
{ 
  "title": "He oppose", 
  "random_text":  
    "He oppose at thrown desire of no. \ 
      Announcing impression unaffected day his are unreserved indulgence." 
} 
  
{ 
  "title": "Repulsive questions", 
  "random_text": "Repulsive questions contented him few extensive supported." 
} 
  
{ 
  "title": "Old education", 
  "random_text": "Old education him departure any arranging one prevailed." 
} 

索引文档之前,我们需要决定其存储在哪。ElasticSearch可以包括多个索引,每个索引可以多文档。我们打算使用下列schema:

text:索引名称
article:类型名称
id:文本实体的唯一ID

可以使用下面命令增加索引:

put http://localhost:9200/text/artice/1/ 
{ 
  "title": "He went", 
  "random_text": "He went such dare good fact. The small own seven saved man age." 
} 

这里使用id=1,其他文本实体也可以使用相同的命令进行增加,只是id是递增的。

当然我们也可以在插入索引之前,手动建立Mapping,明确哪些字段需要全文检索,哪些字段仅需要严格匹配。示例代码如下:

PUT /text 
{  
  "mappings": {  
    "artice":{ 
      "properties": {  
        "title": { "type" : "text" }, 
        "desc": { "type" : "text" } 
      } 
    } 
  }  
} 

检索文档

前面增加了4个文档,我们现在可以检索有多少个文档,使用下面命令:

GET http://localhost:9200/text/_count/ 
{ 
  "query": { 
    "match_all": {} 
  } 
} 

返回结果:

{ 
"count": 4, 
"_shards": { 
"total": 1, 
"successful": 1, 
"skipped": 0, 
"failed": 0 
} 
} 

与我们插入的文档数量相符,下面查询特定文档:

http://localhost:9200/text/artice/1/ 

查询结果如下:

{ 
    "_index": "text", 
    "_type": "artice", 
    "_id": "1", 
    "_version": 1, 
    "_seq_no": 0, 
    "_primary_term": 1, 
    "found": true, 
    "_source": { 
    "title": "He went", 
    "random_text": "He went such dare good fact. The small own seven saved man age." 
    } 
} 

返回结果是我们之前增加ID为1的文档。

查询文档

现在测试全文检索,使用下面命令:

GET 'localhost:9200/text/article/_search 
{ 
  "query": { 
    "match": { 
      "random_text": "him departure" 
    } 
  } 
} 

返回结果如下:

{ 
  "took": 32, 
  "timed_out": false, 
  "_shards": { 
    "total": 5, 
    "successful": 5, 
    "failed": 0 
  }, 
  "hits": { 
    "total": 2, 
    "max_score": 1.4513469, 
    "hits": [ 
      { 
        "_index": "text", 
        "_type": "article", 
        "_id": "4", 
        "_score": 1.4513469, 
        "_source": { 
          "title": "Old education", 
          "random_text": "Old education him departure any arranging one prevailed." 
        } 
      }, 
      { 
        "_index": "text", 
        "_type": "article", 
        "_id": "3", 
        "_score": 0.28582606, 
        "_source": { 
          "title": "Repulsive questions", 
          "random_text": "Repulsive questions contented him few extensive supported." 
        } 
      } 
    ] 
  } 
} 

我们查询 “him departure”,获得两个不同score的查询结果。第一条结果很明显,因为完全包括查询文本,其得分为 1.4513469.

第二条结果是因为目标文档包括单次“him”,得分为0.28582606。

缺省情况下ElasticSearch 根据相关性得分对查询结果进行排序,即每个文档匹配程度。注意,第二条结果得分比第一条低,表示相关性低。

模糊(Fuzzy)查询

模糊查询处理两个“模糊”相似的单词,就好像它们是同一个单词一样。首先,我们需要定义什么是模糊。
Elasticsearch支持最大编辑距离,使用模糊度参数指定为2。模糊度参数可设置为AUTO,编辑距离可以为:

  • 0 表示一个或两个字符的字符串
  • 1 表示三个、四个或五个字符的字符串
  • 2 表示多于五个字符的字符串

如果使用编辑距离为2,返回结果似乎不相关。为了使返回结果更好、性能更好,使用编辑距离为1。距离指的是Levenshtein距离,这是一个字符串度量,用于测量两个序列之间的差异。下面执行模糊搜索:

GET localhost:9200/text/article/_search 
{  
  "query":  
  {  
    "match":  
    {  
      "random_text":  
      { 
        "query": "him departure", 
        "fuzziness": "2" 
      } 
    }  
  }  
} 

返回结果:

{ 
  "took": 88, 
  "timed_out": false, 
  "_shards": { 
    "total": 5, 
    "successful": 5, 
    "failed": 0 
  }, 
  "hits": { 
    "total": 4, 
    "max_score": 1.5834423, 
    "hits": [ 
      { 
        "_index": "text", 
        "_type": "article", 
        "_id": "4", 
        "_score": 1.4513469, 
        "_source": { 
          "title": "Old education", 
          "random_text": "Old education him departure any arranging one prevailed." 
        } 
      }, 
      { 
        "_index": "text", 
        "_type": "article", 
        "_id": "2", 
        "_score": 0.41093433, 
        "_source": { 
          "title": "He oppose", 
          "random_text": 
            "He oppose at thrown desire of no.  
              \ Announcing impression unaffected day his are unreserved indulgence." 
        } 
      }, 
      { 
        "_index": "text", 
        "_type": "article", 
        "_id": "3", 
        "_score": 0.2876821, 
        "_source": { 
          "title": "Repulsive questions", 
          "random_text": "Repulsive questions contented him few extensive supported." 
        } 
      }, 
      { 
        "_index": "text", 
        "_type": "article", 
        "_id": "1", 
        "_score": 0.0, 
        "_source": { 
          "title": "He went", 
          "random_text": "He went such dare good fact. The small own seven saved man age." 
        } 
      } 
    ] 
  } 
} 

我们看到模糊查询返回结果更多。使用模糊查询需要小心,因为可能返回根本不相干的结果。

总结

本文我们主要解释了索引文档,使用ElasticSearch Rest Api执行全文检索查询文档。


本文参考链接:https://blog.csdn.net/neweastsun/article/details/93337446
阅读延展