Analyzer

hanseom 2025. 3. 18. 05:00

Analyzer

A component that processes text during indexing and querying. The analyzer breaks down text into a stream of tokens or terms (usually words) and can apply various transformations like lowercasing, removing stop words, and more.

Analyzer 구성요소

Tokenizer: 텍스트를 개별 단어나 토큰으로 나누는 역할을 합니다. (e.g., words)
Character Filters: 텍스트를 토크나이저에 전달하기 전 문자 수준에서 변환하는 역할을 합니다. (e.g., removing HTML tags)
Token Filters: 토크나이저가 생성한 토큰을 변환하거나 필터링 하는 역할을 합니다. (e.g., lowercasing, stemming)

Common Built-in Analyzers

standard: 기본적으로 제공되는 분석기로, 공백과 특수 문자를 기준으로 토큰화합니다. (Tokenizer: standard)
simple: 단순한 분석기로, 문자만으로 구성된 토큰을 생성합니다. 대소문자를 구분하지 않습니다. (Tokenizer: lowercase)
whitespace: 공백만을 기준으로 토큰화합니다. 대소문자를 구분하지 않습니다. (Tokenizer: whitespace)
stop: Stop Words를 제거하며, 공백과 특수 문자를 기준으로 토큰화합니다. (Tokenizer: whitespace)
keyword: 입력된 텍스트를 단일 토큰으로 반환합니다. (Tokenizer: keyword)
pattern: 특정한 패턴을 기반으로 토큰화합니다. (Tokenizer: pattern)
공식 가이드

Standard Analyzer

PUT /my_index_standard
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_standard_analyzer": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "custom_standard_analyzer"
      }
    }
  }
}

# Analyze
POST /my_index_standard/_analyze
{
  "analyzer": "custom_standard_analyzer",
  "text": "The quick brown fox jumps over the lazy dog."
}

Simple Analyzer

PUT /my_index_simple
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_simple_analyzer": {
          "type": "simple"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "custom_simple_analyzer"
      }
    }
  }
}

# Analyze
POST /my_index_simple/_analyze
{
  "analyzer": "custom_simple_analyzer",
  "text": "The quick brown fox jumps over the lazy dog."
}

Custom Analyzer

A custom analyzer is defined by specifying the tokenizer, and oprionally, one or more character filters and token filters.

PUT /my_custom_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "language": "english"
        }
      },
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "stop",
            "my_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "custom_analyzer"
      }
    }
  }
}

POST /my_custom_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text": "The quick brown foxes are jumping over the lazy dogs."
}

인덱스 설정 변경

# 1. close
POST /my_custom_index/_close

# 2. update
PUT /my_custom_index/_settings
{
  "analysis": {
    "analyzer": {
      "second_custom_analyzer": {
        "type": "custom",
        "tokenizer": "whitespace",
        "filter": [
          "lowercase",
          "stop"
        ]
      }
    }
  }
}

# 3. open
POST /my_custom_index/_open

# 4. 설정 확인
GET /my_custom_index/_settings

# 5. 새로운 매핑이나 설정을 반영하기 위한 문서 재색인
POST /my_custom_index/_update_by_query?conflicts=proceed

[참고자료]

실리콘밸리 엔지니어와 함께하는 Elasticsearch