zensols.deepnlp.index package

Submodules

zensols.deepnlp.index.domain module

Contains a base class for vectorizers for indexing document.

class zensols.deepnlp.index.domain.DocumentIndexVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path)[source]

Bases: FeatureDocumentVectorizer, PersistableContainer, Primeable

A vectorizer that generates vectorized features based on the index documents of the training set. For example, latent dirichlet allocation maybe be used to generated a distrubiton of likelihood a document belongs to a topic.

Subclasses of this abstract class are both vectorizers and models. The model created once, and then cached. To clear the cache and force it to be retrained, use clear().

The method _create_model() must be implemented.

See:

TopicModelDocumentIndexerVectorizer

abstract _create_model(docs)[source]

Create the model for this indexer. The model is implementation specific. The model must be pickelabel and is cached in as model.

Return type:

Any

__init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path)
clear()[source]
doc_factory: IndexedDocumentFactory

The document factor used to create training documents for the model vectorizer.

static feat_to_tokens(docs)[source]

Create a tuple of string tokens from a set of documents suitable for document indexing. The strings are the lemmas of the tokens.

Important: this method must remain static since the LSI instance of this class uses it as a factory function in the a vectorizer.

Return type:

Tuple[str, ...]

index_path: Path

The path to the pickeled cache file of the trained model.

property model

Return the trained model for this vectorizer. See the class docs on how it is cached and cleared.

prime()[source]
class zensols.deepnlp.index.domain.IndexedDocumentFactory[source]

Bases: ABC

Creates training documents used to generate indexed features (i.e. latent dirichlet allocation, latent semantic indexing etc).

See:

DocumentIndexVectorizer

__init__()
abstract create_training_docs()[source]

Create the documents used to index in the model during training.

Return type:

Iterable[FeatureDocument]

zensols.deepnlp.index.lda module

class zensols.deepnlp.index.lda.TopicModelDocumentIndexerVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, topics=20, decode_as_flat=True)[source]

Bases: DocumentIndexVectorizer

Train a model using LDA for topic modeling.

Citation:

Hoffman, M., Bach, F., and Blei, D. 2010. Online Learning for Latent Dirichlet Allocation. Advances in Neural Information Processing Systems 23.

Shape:

(topics, ) when decode_as_flat is True, otherwise, ``(, topics)

See:

gensim.models.ldamodel.LdaModel

DESCRIPTION = 'latent semantic indexing'
FEATURE_TYPE = 2
__init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, topics=20, decode_as_flat=True)
decode_as_flat: bool = True

If True, flatten the tensor after decoding.

query(tokens)[source]

Return a distribution over the topics for a query set of tokens.

Parameters:

tokens (Tuple[str]) – the string list of tokens to use for inferencing in the model

Return type:

Tuple[float]

Returns:

a list of tuples in the form (topic_id, probability)

topics: int = 20

The number of topics (usually denoted K).

zensols.deepnlp.index.lsi module

A Deerwester latent semantic index vectorizer implementation.

class zensols.deepnlp.index.lsi.LatentSemanticDocumentIndexerVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, components=100, iterations=10, vectorizer_params=<factory>)[source]

Bases: DocumentIndexVectorizer

Train a latent semantic indexing (LSI, aka LSA) model from:

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman,
R. 1990.  Indexing by Latent Semantic Analysis. Journal of the American
Society for Information Science; New York, N.Y. 41, 6, 391–407.

This class can be used only to index TF/IDF. To skip the LSI training, set iterations to zero.

Shape:

(1,)

See:

sklearn.decomposition.TruncatedSVD

DESCRIPTION = 'latent semantic indexing'
FEATURE_TYPE = 2
__init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, components=100, iterations=10, vectorizer_params=<factory>)
components: int = 100

The number of components for the output.

iterations: int = 10

Number of iterations for randomized SVD solver if greater than 0 (see class docs).

property lsa: Pipeline

The LSA pipeline trained on the document set.

similarity(a, b)[source]

Return the semantic similarity between two documents.

Return type:

float

property vectorizer: TfidfVectorizer

The vectorizer trained on the document set.

vectorizer_params: Dict[str, Any]

Additional parameters passed to TfidfVectorizer when vectorizing TF/IDF features.

Module contents

Contains classes for vectorizers for indexing document.