Automatic learning of the morphology of medical language using information compression. Academic Article uri icon


MeSH Major

  • Artificial Intelligence
  • Data Compression
  • Unified Medical Language System


  • Conversion of free-text strings in a natural language to a standard representation (codes) is an important reoccurring problem in biomedical informatics. Determining the content of a string involves identifying its meaningful constituents (morphemes). One current method of identifying these constituents is to look them up in a preexisting table (lexicon). Manual construction of lexicons and grammars in complex domains such as biomedicine is extremely laborious. As an alternative to the lexico-grammatical approach, we introduce a segmentation algorithm that automatically learns lexical and structural preferences from corpora via information compression. The method is based on the Minimum Description Length (MDL) principle from classic information theory.

publication date

  • January 2003



  • Academic Article



  • eng

PubMed Central ID

  • PMC1480252

PubMed ID

  • 14728443

Additional Document Info

start page

  • 938