zensols.nlparse.feature.char
Character feature creation functions.
capital-features
(capital-features tokens)
Return features based on counts of capitalization of tokens. Features returned include (all integers):
- :caps-first-char-count number of first character being capital (i.e.
Yes
,YEs
,YES
) - :caps-first-char-ratio number of first character being capital (i.e.
Yes
,YEs
,YES
) as a ratio to all characters across all tokens - :caps-capitalized-count number of capitalied tokens (i.e.
Yes
) - :caps-capitalized-ratio number of capitalied tokens (i.e.
Yes
) as a ratio to all other characters across all tokens - :caps-all-count number of all caps tokens (i.e.
YES
) - :caps-all-ratio number of all caps tokens (i.e.
YES
) as a ratio to all other characters across all tokens - :cap-utterance
true
if there exist any capitals in any tokens orfalse
otherwise
char-dist-features
(char-dist-features text)
Return the following features generated from text:
- :char-dist-unique number of unique characters
- :char-dist-unique-ratio ratio of unique characters to non-unique
- :char-dist-count character length
- :char-dist-variance variance of character counts
- :char-dist-mean mean of character counts
latin-non-alpha-numeric
Latin character set but not alpha numeric
lrs-features
(lrs-features text unique-char-repeats)
Return the following features:
- :lrs-len longest repeating string length
- :lrs-unique-characters the number of unique characters in the longest repeating string
- :lrs-occurs-N the number of times the string repeated that has N unique consecutive characters
- :lrs-length-N the length of the string that has N unique consecutive characters
All where N
is unique-char-repeats, which is a range from 1 to N
of the grouping of consecutive characters. For example the string:
1 2 3 4 5
01234567890123456789012345678901234567890123456789012
abcabc aabb aaaaaa abcabcabcabc abcdefgabcdefgabcdefg
yields:
{:lrs-len 14, ; abcdefgabcdefgabcdefg (TODO: should be 21)
:lrs-unique-chars 7, ; abcdefg
:lrs-length-1 1, ; 'a'
:lrs-occurs-1 6, ; 'aaaaaa' at index 12
:lrs-length-2 3, ; ' aa'
:lrs-occurs-2 1, ; index: 7
:lrs-length-3 3, ; 'abcabc'
:lrs-occurs-3 4, ; indexes: 0, 19, 25
:lrs-length-4 4, ; ' abc'
:lrs-occurs-4 1,
:lrs-length-5 5, ; 'cdefg' (has to be consecutive/non-overlapping)
:lrs-occurs-5 1,
:lrs-length-6 6, ; 'bcdefg'
:lrs-occurs-6 1,
:lrs-length-7 7, ; 'abcdefg'
:lrs-occurs-7 3} ; indexes: 32, 39, 49
punctuation
Natural language punctuation and several languages.
punctuation-features
(punctuation-features text)
Return the following features from text:
- :punctuation-count the count of punctuation
- :punctuation-ratio the ratio of count of punctuation to the length
- :latin-non-alpha-numeric-count like :punctuation-countbut with latin-non-alpha-numeric
- :latin-non-alpha-numeric-ratio like :punctuation-ratiobut with latin-non-alpha-numeric
unicode-features
(unicode-features text nth-best-unicodes)
Create features based on the Unicode values of text:
- :unicode-variance variance of Unicode (range) character counts
- :unicode-range-name-N top Nth best (highest count) unicode name
- :unicode-range-ratio-N top Nth best (highest count) unicode character ratio
nth-best-unicodes are the number of range name/ratio features for Unicode ranges across characters in text.