zensols.nlparse.locale

Feature utility functions. See zensols.nlparse.feature.lang.

lang-code-to-locale

(lang-code-to-locale lang-code)

Return a Locale for a two letter language code.

locale-counts

(locale-counts text & {:keys [best-match?], :or {best-match? false}})

Return counts that are a member of a language mapping (locale) of all characters in text. See unicode-for-char.

Keys

  • :best-match? if true then return only the best match (i.e. language over partial alphabet) per each Unicode range

locale-keys

(locale-keys)

Return a sequence of nominals that could be returned via

locale-to-lang-code

(locale-to-lang-code loc)

Return the two letter language code for a Locale

name-by-locale

(name-by-locale)

Return a map of language name to two letter language code. This iterates over java.util.Locale/getAvailableLocales using .getDisplayName. See java.util.Locale/getAvailableLocales.

name-to-locale

(name-to-locale)(name-to-locale remove-langs lang-map)

Like name-by-locale but set difference out remove-langs and use the mapping in locale-map when it exists. Each key (also used for lookup in remove-langs and locale-map) subtracts out the parenthetical langauge name (i.e. Arabic (Lebanon) -> Arabic).

The value (language code) uses com.neovisionaries.i18n.LanguageCode/findByName.

set-name-to-locale-fn

(set-name-to-locale-fn name-to-locale-fn)

Set the function to use when mapping a language name to a java.util.Locale. The default uses name-to-locale.

unicode-counts

(unicode-counts text & {:keys [best-match?], :or {best-match? false}})

Return counts of all characters in text. See unicode-for-char.

Keys

  • :best-match? see unicode-for-char; note that counts will differ and won’t necessarily sum to all combinations of disjoint Unicode ranges/sets

unicode-for-char

(unicode-for-char c)(unicode-for-char c best-match?)

Return a sequence of Unicode info maps that are in range for characater c. Each map has the following keys:

  • :name the name of the Unicode record, which is one of
    • Unicode range name as defined in the unicode-ranges.csv resource
    • Particular set of Unicode characters (i.e. umlaut)
    • Language name mapped from the java.util.Locale.
  • :range the numeric Unicode range (if this is missing :set isn’t)
  • :set a hash of Unicode characters (if this is missing :range ins’t)
  • :locale the java.util.Locale assigned to the range (if any)

If best-match? is true then return only the best match (i.e. language over partial alphabet) per each Unicode match (in range or member of set). Another way to say this is that there will not be any overlapping Unicode range/set data returned, and thus, results are disjoint.