libunibreak  4.3
Data Structures | Enumerations | Functions | Variables
linebreakdef.h File Reference
#include "unibreakdef.h"
Include dependency graph for linebreakdef.h:
This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Data Structures

struct  LineBreakProperties
 Struct for entries of line break properties. More...
 
struct  LineBreakPropertiesLang
 Struct for association of language-specific line breaking properties with language names. More...
 
struct  LineBreakContext
 Context representing internal state of the line breaking algorithm. More...
 

Enumerations

enum  LineBreakClass {
  LBP_Undefined, LBP_OP, LBP_CL, LBP_CP,
  LBP_QU, LBP_GL, LBP_NS, LBP_EX,
  LBP_SY, LBP_IS, LBP_PR, LBP_PO,
  LBP_NU, LBP_AL, LBP_HL, LBP_ID,
  LBP_IN, LBP_HY, LBP_BA, LBP_BB,
  LBP_B2, LBP_ZW, LBP_CM, LBP_WJ,
  LBP_H2, LBP_H3, LBP_JL, LBP_JV,
  LBP_JT, LBP_RI, LBP_EB, LBP_EM,
  LBP_ZWJ, LBP_CB, LBP_AI, LBP_BK,
  LBP_CJ, LBP_CR, LBP_LF, LBP_NL,
  LBP_SA, LBP_SG, LBP_SP, LBP_XX
}
 Line break classes. More...
 

Functions

void lb_init_break_context (struct LineBreakContext *lbpCtx, utf32_t ch, const char *lang)
 Initializes line breaking context for a given language. More...
 
int lb_process_next_char (struct LineBreakContext *lbpCtx, utf32_t ch)
 Updates LineBreakingContext for the next codepoint and returns the detected break. More...
 
void set_linebreaks (const void *s, size_t len, const char *lang, char *brks, get_next_char_t get_next_char)
 Sets the line breaking information for a generic input string. More...
 

Variables

const struct LineBreakProperties lb_prop_default []
 Default line breaking properties as from the Unicode Web site. More...
 
const struct LineBreakPropertiesLang lb_prop_lang_map []
 Association data of language-specific line breaking properties with language names. More...
 

Detailed Description

Definitions of internal data structures, declarations of global variables, and function prototypes for the line breaking algorithm.

Author
Wu Yongwei
Petr Filipsky

Enumeration Type Documentation

◆ LineBreakClass

Line break classes.

This is a mapping of Table 1 of Unicode Standard Annex 14.

Enumerator
LBP_Undefined 

Undefined.

LBP_OP 

Opening punctuation.

LBP_CL 

Closing punctuation.

LBP_CP 

Closing parenthesis.

LBP_QU 

Ambiguous quotation.

LBP_GL 

Glue.

LBP_NS 

Non-starters.

LBP_EX 

Exclamation/Interrogation.

LBP_SY 

Symbols allowing break after.

LBP_IS 

Infix separator.

LBP_PR 

Prefix.

LBP_PO 

Postfix.

LBP_NU 

Numeric.

LBP_AL 

Alphabetic.

LBP_HL 

Hebrew letter.

LBP_ID 

Ideographic.

LBP_IN 

Inseparable characters.

LBP_HY 

Hyphen.

LBP_BA 

Break after.

LBP_BB 

Break before.

LBP_B2 

Break on either side (but not pair)

LBP_ZW 

Zero-width space.

LBP_CM 

Combining marks.

LBP_WJ 

Word joiner.

LBP_H2 

Hangul LV.

LBP_H3 

Hangul LVT.

LBP_JL 

Hangul L Jamo.

LBP_JV 

Hangul V Jamo.

LBP_JT 

Hangul T Jamo.

LBP_RI 

Regional indicator.

LBP_EB 

Emoji base.

LBP_EM 

Emoji modifier.

LBP_ZWJ 

Zero width joiner.

LBP_CB 

Contingent break.

LBP_AI 

Ambiguous (alphabetic or ideograph)

LBP_BK 

Break (mandatory)

LBP_CJ 

Conditional Japanese starter.

LBP_CR 

Carriage return.

LBP_LF 

Line feed.

LBP_NL 

Next line.

LBP_SA 

South-East Asian.

LBP_SG 

Surrogates.

LBP_SP 

Space.

LBP_XX 

Unknown.

Function Documentation

◆ lb_init_break_context()

void lb_init_break_context ( struct LineBreakContext lbpCtx,
utf32_t  ch,
const char *  lang 
)

Initializes line breaking context for a given language.

Parameters
[in,out]lbpCtxpointer to the line breaking context
[in]chthe first character to process
[in]langlanguage of the input
Postcondition
the line breaking context is initialized

◆ lb_process_next_char()

int lb_process_next_char ( struct LineBreakContext lbpCtx,
utf32_t  ch 
)

Updates LineBreakingContext for the next codepoint and returns the detected break.

Parameters
[in,out]lbpCtxpointer to the line breaking context
[in]chUnicode codepoint
Returns
break result, one of LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, and LINEBREAK_NOBREAK
Postcondition
the line breaking context is updated

◆ set_linebreaks()

void set_linebreaks ( const void *  s,
size_t  len,
const char *  lang,
char *  brks,
get_next_char_t  get_next_char 
)

Sets the line breaking information for a generic input string.

Currently, this implementation has customization for the following ISO 639-1 language codes (for lang):

  • de (German)
  • en (English)
  • es (Spanish)
  • fr (French)
  • ja (Japanese)
  • ko (Korean)
  • ru (Russian)
  • zh (Chinese)

In addition, a suffix "-strict" may be added to indicate strict (as versus normal) line-breaking behaviour. See the Conditional Japanese Starter section of UAX #14 for more details.

Parameters
[in]sinput string
[in]lenlength of the input
[in]langlanguage of the input
[out]brkspointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR
[in]get_next_charfunction to get the next UTF-32 character

Variable Documentation

◆ lb_prop_default

const struct LineBreakProperties lb_prop_default[]

Default line breaking properties as from the Unicode Web site.

◆ lb_prop_lang_map

const struct LineBreakPropertiesLang lb_prop_lang_map[]

Association data of language-specific line breaking properties with language names.

This is the definition for the static data in this file. If you want more flexibility, or do not need the data here, you may want to redefine lb_prop_lang_map in your C source file.