zensols.nlparse.feature.char

Character feature creation functions.

capital-feature-metas

(capital-feature-metas)

capital-features

(capital-features tokens)

Return features based on counts of capitalization of tokens. Features returned include (all integers):

  • :caps-first-char-count number of first character being capital (i.e. Yes, YEs, YES)
  • :caps-first-char-ratio number of first character being capital (i.e. Yes, YEs, YES) as a ratio to all characters across all tokens
  • :caps-capitalized-count number of capitalied tokens (i.e. Yes)
  • :caps-capitalized-ratio number of capitalied tokens (i.e. Yes) as a ratio to all other characters across all tokens
  • :caps-all-count number of all caps tokens (i.e. YES)
  • :caps-all-ratio number of all caps tokens (i.e. YES) as a ratio to all other characters across all tokens
  • :cap-utterance true if there exist any capitals in any tokens or false otherwise

char-dist-feature-metas

(char-dist-feature-metas)

char-dist-features

(char-dist-features text)

Return the following features generated from text:

  • :char-dist-unique number of unique characters
  • :char-dist-unique-ratio ratio of unique characters to non-unique
  • :char-dist-count character length
  • :char-dist-variance variance of character counts
  • :char-dist-mean mean of character counts

latin-non-alpha-numeric

Latin character set but not alpha numeric

lrs-feature-metas

(lrs-feature-metas count)

lrs-features

(lrs-features text unique-char-repeats)

Return the following features:

  • :lrs-len longest repeating string length
  • :lrs-unique-characters the number of unique characters in the longest repeating string
  • :lrs-occurs-N the number of times the string repeated that has N unique consecutive characters
  • :lrs-length-N the length of the string that has N unique consecutive characters

All where N is unique-char-repeats, which is a range from 1 to N of the grouping of consecutive characters. For example the string:

          1         2         3         4         5
01234567890123456789012345678901234567890123456789012
abcabc aabb aaaaaa abcabcabcabc abcdefgabcdefgabcdefg

yields:

{:lrs-len 14,           ; abcdefgabcdefgabcdefg (TODO: should be 21)
 :lrs-unique-chars 7,   ; abcdefg
 :lrs-length-1 1,       ; 'a'
 :lrs-occurs-1 6,       ; 'aaaaaa' at index 12
 :lrs-length-2 3,       ; ' aa'
 :lrs-occurs-2 1,       ; index: 7
 :lrs-length-3 3,       ; 'abcabc'
 :lrs-occurs-3 4,       ; indexes: 0, 19, 25
 :lrs-length-4 4,       ; ' abc'
 :lrs-occurs-4 1,
 :lrs-length-5 5,       ; 'cdefg' (has to be consecutive/non-overlapping)
 :lrs-occurs-5 1,
 :lrs-length-6 6,       ; 'bcdefg'
 :lrs-occurs-6 1,
 :lrs-length-7 7,       ; 'abcdefg'
 :lrs-occurs-7 3}       ; indexes: 32, 39, 49

punctuation

Natural language punctuation and several languages.

punctuation-features

(punctuation-features text)

Return the following features from text:

punctuation-metas

(punctuation-metas)

unicode-feature-metas

(unicode-feature-metas nth-best-unicodes)

unicode-features

(unicode-features text nth-best-unicodes)

Create features based on the Unicode values of text:

  • :unicode-variance variance of Unicode (range) character counts
  • :unicode-range-name-N top Nth best (highest count) unicode name
  • :unicode-range-ratio-N top Nth best (highest count) unicode character ratio

nth-best-unicodes are the number of range name/ratio features for Unicode ranges across characters in text.