zensols.deepnlp.index package¶
Submodules¶
zensols.deepnlp.index.domain module¶
Contains a base class for vectorizers for indexing document.
- class zensols.deepnlp.index.domain.DocumentIndexVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path)[source]¶
Bases:
FeatureDocumentVectorizer
,PersistableContainer
,Primeable
A vectorizer that generates vectorized features based on the index documents of the training set. For example, latent dirichlet allocation maybe be used to generated a distrubiton of likelihood a document belongs to a topic.
Subclasses of this abstract class are both vectorizers and models. The model created once, and then cached. To clear the cache and force it to be retrained, use
clear()
.The method
_create_model()
must be implemented.- abstract _create_model(docs)[source]¶
Create the model for this indexer. The model is implementation specific. The model must be pickelabel and is cached in as
model
.- Return type:
- __init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path)¶
-
doc_factory:
IndexedDocumentFactory
¶ The document factor used to create training documents for the model vectorizer.
- static feat_to_tokens(docs)[source]¶
Create a tuple of string tokens from a set of documents suitable for document indexing. The strings are the lemmas of the tokens.
Important: this method must remain static since the LSI instance of this class uses it as a factory function in the a vectorizer.
- property model¶
Return the trained model for this vectorizer. See the class docs on how it is cached and cleared.
- class zensols.deepnlp.index.domain.IndexedDocumentFactory[source]¶
Bases:
ABC
Creates training documents used to generate indexed features (i.e. latent dirichlet allocation, latent semantic indexing etc).
- __init__()¶
zensols.deepnlp.index.lda module¶
- class zensols.deepnlp.index.lda.TopicModelDocumentIndexerVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, topics=20, decode_as_flat=True)[source]¶
Bases:
DocumentIndexVectorizer
Train a model using LDA for topic modeling.
Citation:
Hoffman, M., Bach, F., and Blei, D. 2010. Online Learning for Latent Dirichlet Allocation. Advances in Neural Information Processing Systems 23.
- Shape:
(topics, )
whendecode_as_flat
isTrue, otherwise, ``(, topics)
- See:
- DESCRIPTION = 'latent semantic indexing'¶
- FEATURE_TYPE = 2¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, topics=20, decode_as_flat=True)¶
zensols.deepnlp.index.lsi module¶
A Deerwester latent semantic index vectorizer implementation.
- class zensols.deepnlp.index.lsi.LatentSemanticDocumentIndexerVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, components=100, iterations=10, vectorizer_params=<factory>)[source]¶
Bases:
DocumentIndexVectorizer
Train a latent semantic indexing (LSI, aka LSA) model from:
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science; New York, N.Y. 41, 6, 391–407.
This class can be used only to index TF/IDF. To skip the LSI training, set
iterations
to zero.- Shape:
(1,)
- See:
- DESCRIPTION = 'latent semantic indexing'¶
- FEATURE_TYPE = 2¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, components=100, iterations=10, vectorizer_params=<factory>)¶
-
iterations:
int
= 10¶ Number of iterations for randomized SVD solver if greater than 0 (see class docs).
- property vectorizer: TfidfVectorizer¶
The vectorizer trained on the document set.
-
vectorizer_params:
Dict
[str
,Any
]¶ Additional parameters passed to
TfidfVectorizer
when vectorizing TF/IDF features.
Module contents¶
Contains classes for vectorizers for indexing document.