zensols.nlparse.locale
Feature utility functions. See zensols.nlparse.feature.lang.
lang-code-to-locale
(lang-code-to-locale lang-code)
Return a Locale for a two letter language code.
locale-counts
(locale-counts text & {:keys [best-match?], :or {best-match? false}})
Return counts that are a member of a language mapping (locale) of all characters in text. See unicode-for-char.
Keys
- :best-match? if
true
then return only the best match (i.e. language over partial alphabet) per each Unicode range
locale-keys
(locale-keys)
Return a sequence of nominals that could be returned via
locale-to-lang-code
(locale-to-lang-code loc)
Return the two letter language code for a Locale
name-by-locale
(name-by-locale)
Return a map of language name to two letter language code. This iterates over java.util.Locale/getAvailableLocales
using .getDisplayName
. See java.util.Locale/getAvailableLocales.
name-to-locale
(name-to-locale)
(name-to-locale remove-langs lang-map)
Like name-by-locale but set difference out remove-langs and use the mapping in locale-map when it exists. Each key (also used for lookup in remove-langs and locale-map) subtracts out the parenthetical langauge name (i.e. Arabic (Lebanon) -> Arabic).
The value (language code) uses com.neovisionaries.i18n.LanguageCode/findByName
.
set-name-to-locale-fn
(set-name-to-locale-fn name-to-locale-fn)
Set the function to use when mapping a language name to a java.util.Locale
. The default uses name-to-locale.
unicode-counts
(unicode-counts text & {:keys [best-match?], :or {best-match? false}})
Return counts of all characters in text. See unicode-for-char.
Keys
- :best-match? see unicode-for-char; note that counts will differ and won’t necessarily sum to all combinations of disjoint Unicode ranges/sets
unicode-for-char
(unicode-for-char c)
(unicode-for-char c best-match?)
Return a sequence of Unicode info maps that are in range for characater c. Each map has the following keys:
- :name the name of the Unicode record, which is one of
- Unicode range name as defined in the
unicode-ranges.csv
resource - Particular set of Unicode characters (i.e. umlaut)
- Language name mapped from the
java.util.Locale
.
- Unicode range name as defined in the
- :range the numeric Unicode range (if this is missing :set isn’t)
- :set a hash of Unicode characters (if this is missing :range ins’t)
- :locale the
java.util.Locale
assigned to the range (if any)
If best-match? is true
then return only the best match (i.e. language over partial alphabet) per each Unicode match (in range or member of set). Another way to say this is that there will not be any overlapping Unicode range/set data returned, and thus, results are disjoint.