Summaries/Databases/ElasticSearch/Query_DSL.md

372 lines
9.7 KiB
Markdown

---
title: Query_DSL
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Elasticsearch Query DSL
### Queries can be classified into three types
1. Filtering by exact values
2. Searching on analyzed text
3. A combination of the two
Every __document field__ can be classified:
- either as an exact values
- analyzed text (also called full text)
## Exact values
are fields like user_id, date, email_addresses
Querying documents can be done by specifying __filters over exact values__. Whether the document gets returned is a __binary__ yes or no
---
## Analyzed search
__Analyzed text__ is text data like product_description or email_body
- Querying documents by searching analyzed text returns results based on __relevance__ (score)
- Highly complex operation and involves different __analyzer packages__ depending on the type of text data
- - The default analyzer package is the _standard analyzer_ which splits text by word boundaries, lowercases and removes punctuation
- less performant than just filtering by exact values
## Expensive queries
1. Lineair scans
- script queries
2. high up-front
- fussie queries
- reqexp queries
- prefix queries without index_prefixes
- wildcard queries
- range queries on text and keyword fields
3. joinig queries
4. Queries on deprecated geo shapes
5. high per-document cost
- script score queries
- percolate queries
The execution of such queries can be prevented by setting the value of the `search.allow_expensive_queries` setting to `false` (defaults to `true`).
Queries behave different: **query context** or **filter context**
| Queries | filters |
| --------------- | -------- |
| Fuzzy, scoring | Boolean |
| Slower | Faster |
| not Cachable | Cachable |
## Scoring queries
By default, Elasticsearch sorts matching search results by **relevance score**, which measures how well each document matches a query. But depends if the query is executed in **query** or **filter** context
## => Query context
“*How well does this document match this query clause?*” The relevance is stored in the **_score** meta_field
Query context is in effect whenever query clause is passed to the query parameter.
## => Filter context
“*Does this document match this query clause?*” Answer is a true of false. No score is calculated == scoring of all documents is 0.
Mostly used for filtering structured data, eq
- Does this timestamp fall in range....
- is the status field set to "text value"
Frequently used filters will be cached
Filter context in effect when filter clause is used
- such as filter or must_not parameters in bool query
- filter parameter ins constant_score query
- filter aggregation
Example
```json
GET /_search
{
"query": { <= query context
"bool": { <= query context, together with matches: how well they match documents
"must": [
{ "match": { "title": "Search" }},
{ "match": { "content": "Elasticsearch" }}
],
"filter": [ <= filter context
{ "term": { "status": "published" }},
{ "range": { "publish_date": { "gte": "2015-01-01" }}}
]
}
}
}
```
---
### Difference term vs match
- match : query aplies the same analyzer to the search at the time the data was stored
- term : does not apply any analyzer, so will look for exactly what is stored in the inverted index
## The Query DSL
Elasticsearch queries are comprised of one or many __Leaf query clauses__. Query clauses can be combined to create other query clauses, called __compound query clauses__. All query clauses have either one of these two formats:
```json
{
QUERY_CLAUSE: { // match, match_all, multi_match, term, terms, exists, missing, range, bool
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
{
QUERY_CLAUSE: {
FIELD_NAME: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
}
```
Query clauses can be __repeatedly nested__ inside other query clauses
```json
{
QUERY_CLAUSE {
QUERY_CLAUSE: {
QUERY_CLAUSE: {
QUERY_CLAUSE: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
}
}
}
```
## Two type of Query DSL (Leaf and Compound)
### Leaf query clause
Look for a partiqulair value in a particulair field, such as match, term, range queries/
These queries can be used by themselves. Use such as **match**, **term** or **range**.
### Compound query clause
wrap other leaf(s) or compound queries and are used to combine multiple queries in a logical fashion (**bool** or **dis_max**)
Or alter their behaviour (such as **constant_score**)
- bool => must, must-not, should, filter, minimum_should_match
multiple leaf or compound query clauses
**must**, **should** => scores combined (), contributes to score
**must_not**, **filter** => in context filter
**must** ==> like logical **AND**.
**should** ==> like logical **OR**.
You can use the `minimum_should_match` parameter to specify the number or percentage of `should` clauses returned documents *must* match.
If the `bool` query includes at least one `should` clause and no `must` or `filter` clauses, the default value is `1`. Otherwise, the default value is `0`
```json
POST _search
{
"query": {
"bool" : {
"must" : {
"term" : { "user" : "kimchy" }
},
"filter": {
"term" : { "tag" : "tech" }
},
"must_not" : {
"range" : {
"age" : { "gte" : 10, "lte" : 20 }
}
},
"should" : [
{ "term" : { "tag" : "wow" } },
{ "term" : { "tag" : "elasticsearch" } }
],
"minimum_should_match" : 1,
"boost" : 1.0
}
}
}
```
- boosting query
- constant_score query
- dis_max query
- function_score query
## Match Query Clause
Match query clause is the most generic and commonly used query clause:
- run on a analyzed text field, it performs an analyzed search on the text
- run on an exact value field, it performs a filter
- calculates the score
example:
```json
{ "match": { "description": "Fourier analysis signals processing" }}
{ "match": { "date": "2014-09-01" }}
{ "match": { "visible": true }}
```
## The Match All Query Clause
Returns all documemts
```json
{ "match_all": {} }
```
## Term/Terms Query Clause
The term and terms query clauses are used to **filter** by a exact value fields by single or multiple values, respectively. In the case of multiple values, the logical connection is OR.
```json
{
"query": {
"term": { "tag": "math" }
}
}
{
"query": {
"term": { "tag": ["math", "second"] }
}
}
```
## Multi Match Query Clause
Is run across multiple fields instead of just one
```json
{ "query": {
"multi_match": {
"query": "probability theory", // value
"fields": ["title^3", "*body"], // fields, with wildcard *
// no fields == *
// title 3* more important
"type": "best_fields",
}
}
}
```
[Other types](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#multi-match-types)
## Exists and Missing Filters Query Clause
- The exists filter checks that documents have a value at a specified field
```json
{
"query": {
"exists": {
"field": "*installCount" // also with wildcards
}
}
}
```
- The missing filter checks that documents do not have have a value at a specified field
```json
{
"missing" : {
"field" : "title"
}
}
```
## Range Filter Query Clause
Number and date fields in ranges, using the operators gt gte lt lte
```json
{ "range" : { "age" : { "gt" : 30 } } }
{
"range": {
"born" : {
"gte": "01/01/2012",
"lte": "2013",
"format": "dd/MM/yyyy||yyyy"
}
}
}
```
## Query in filter context
### No scores are calculated: yes or no
The __query__ parameter indicates query context.
The __bool__ and two __match__ clauses are used in query context, which means that they are used to score how well each document matches.
The __filter__ parameter indicates __*filter context*__. Its term and range clauses are used in filter context. They will filter out documents which do not match, but they will __*not affect the score*__ for matching documents.
__Must__ clause is not required (score == 0.0)
```json
GET /.kibana/_search
{
"query": {
"bool": {
"must": [
{"match": {"type" : "ui-metric"}},
{"match": {"ui-metric.count" : "1"}}
],
"filter": [
{"range": {"updated_at": {"gte": "2020-04-01"}}}
]
}
}
}
```
## Bool Query Clause
Are built from other query clauses are called compound query clauses. <sup> Note that compound query clauses can also be comprised of other compound query clauses, allowing for multi-layer nesting <sup>.
The three supported boolean operators are __must__ (and) __must_not__ (not) and __should__ (or)
```json
{
"bool": {
"must": { "term": { "tag": "math" }},
"must_not": { "term": { "tag": "probability" }},
"should": [
{ "term": { "favorite": true }},
{ "term": { "unread": true }}
]
}
}
```
## Combining Analyzed Search With Filters
Example: query to find all posts by performing an analyzed search for “Probability Theory” but we only want posts with 20 or more upvotes and not those with that tag “frequentist”.
```json
{
"filtered": {
"query": { "match": { "body": "Probability Theory" }},
"filter": {
"bool": {
"must": {
"range": { "upvotes" : { "gt" : 20 } }
},
"must_not": { "term": { "tag": "frequentist" } }
}
}
}
}
```
[Source: Understanding the Elasticsearch Query DSL](https://medium.com/@User3141592/understanding-the-elasticsearch-query-dsl-ce1d67f1aa5b)