Automatic learning of the morphology of medical language using information compression. Academic Article uri icon

Overview

MeSH

  • Algorithms
  • Information Theory
  • Natural Language Processing

MeSH Major

  • Artificial Intelligence
  • Data Compression
  • Unified Medical Language System

abstract

  • Conversion of free-text strings in a natural language to a standard representation (codes) is an important reoccurring problem in biomedical informatics. Determining the content of a string involves identifying its meaningful constituents (morphemes). One current method of identifying these constituents is to look them up in a preexisting table (lexicon). Manual construction of lexicons and grammars in complex domains such as biomedicine is extremely laborious. As an alternative to the lexico-grammatical approach, we introduce a segmentation algorithm that automatically learns lexical and structural preferences from corpora via information compression. The method is based on the Minimum Description Length (MDL) principle from classic information theory.

publication date

  • 2003

has subject area

  • Algorithms
  • Artificial Intelligence
  • Data Compression
  • Information Theory
  • Natural Language Processing
  • Unified Medical Language System

Research

keywords

  • Journal Article

Identity

Language

  • eng

PubMed Central ID

  • PMC1480252

PubMed ID

  • 14728443

Additional Document Info

start page

  • 938