Summaries/Databases/ElasticSearch/Analysis.md

110 lines
2.2 KiB
Markdown

---
title: Analysis
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Analysis
### Analyses is performed by a analyser
- tokenizer: breaks sentence in tokens, position of the tokens, optional for a specific language
- token filter: filter out stopwords
- character filter
Reader -> tokenizer -> token filter -> token
### Where use analyses?
- query
- mapping parameter
- index setting
Analyser is used in the mapping part
Example
### Analysers
1. Standard
- max_token_length (default 255)
- stopwords (defaults \_none_)
- stopwords_path (path to file containing stopwords)
- keep numeric values
2. simple
- lowercase
- remove special characters (ie dog's -> [dog, s])
- remove numeric values
3. whitespace
- breakes text into terms whenever it encounters a whitespace character
- no lowercase transformation
- takes terms as they are
- keeps special characters
4. keyword
- no configuration
- takes all text as one keyword
5. stop
- stopword, stopword_path
6. pattern
- stopword, stopword_path, pattern, lowercase
- regular expression
7. custom
- tokenizer, char_filter, filter
### Example with standard analyzer
```json
PUT /test_analyzer
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
},
"mappings": {
"properties": {
"spreker_1": {
"type": "keyword",
"analyzer" : "my_analyzer" <== or an other analyzer; so per field
}
}
}
}
```
```json
GET /test_analyzer/_analyze
{
"analyzer": "my_analyzer",
"field": "spreker_1",
"text": ["What is the this builders"]
}
```
### without mapping; pattern analyzer
```json
PUT /test_analyzer
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_words": {
"type" : "pattern",
"pattern": "\\W|_|[a-c]", <-==== seperator whitespace or _ or chars a,b,c
"lowercase": true
}
},
"analyzer": {
"rebuild_pattern": {
"tokenizer" : "split_on_words",
"filter": ["lowercase"]
}
}
}
}
}
```