zensols.deepnlp.index package¶
Submodules¶
zensols.deepnlp.index.domain module¶
Contains a base class for vectorizers for indexing document.
- class zensols.deepnlp.index.domain.DocumentIndexVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path)[source]¶
 Bases:
FeatureDocumentVectorizer,PersistableContainer,PrimeableA vectorizer that generates vectorized features based on the index documents of the training set. For example, latent dirichlet allocation maybe be used to generated a distrubiton of likelihood a document belongs to a topic.
Subclasses of this abstract class are both vectorizers and models. The model created once, and then cached. To clear the cache and force it to be retrained, use
clear().The method
_create_model()must be implemented.- abstract _create_model(docs)[source]¶
 Create the model for this indexer. The model is implementation specific. The model must be pickelabel and is cached in as
model.- Return type:
 
- __init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path)¶
 
- 
doc_factory: 
IndexedDocumentFactory¶ The document factor used to create training documents for the model vectorizer.
- static feat_to_tokens(docs)[source]¶
 Create a tuple of string tokens from a set of documents suitable for document indexing. The strings are the lemmas of the tokens.
Important: this method must remain static since the LSI instance of this class uses it as a factory function in the a vectorizer.
- property model¶
 Return the trained model for this vectorizer. See the class docs on how it is cached and cleared.
- class zensols.deepnlp.index.domain.IndexedDocumentFactory[source]¶
 Bases:
ABCCreates training documents used to generate indexed features (i.e. latent dirichlet allocation, latent semantic indexing etc).
- __init__()¶
 
zensols.deepnlp.index.lda module¶
- class zensols.deepnlp.index.lda.TopicModelDocumentIndexerVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, topics=20, decode_as_flat=True)[source]¶
 Bases:
DocumentIndexVectorizerTrain a model using LDA for topic modeling.
Citation:
Hoffman, M., Bach, F., and Blei, D. 2010. Online Learning for Latent Dirichlet Allocation. Advances in Neural Information Processing Systems 23.
- Shape:
 (topics, )whendecode_as_flatisTrue, otherwise, ``(, topics)- See:
 
- DESCRIPTION = 'latent semantic indexing'¶
 
- FEATURE_TYPE = 2¶
 
- __init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, topics=20, decode_as_flat=True)¶
 
zensols.deepnlp.index.lsi module¶
A Deerwester latent semantic index vectorizer implementation.
- class zensols.deepnlp.index.lsi.LatentSemanticDocumentIndexerVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, components=100, iterations=10, vectorizer_params=<factory>)[source]¶
 Bases:
DocumentIndexVectorizerTrain a latent semantic indexing (LSI, aka LSA) model from:
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science; New York, N.Y. 41, 6, 391–407.
This class can be used only to index TF/IDF. To skip the LSI training, set
iterationsto zero.- Shape:
 (1,)- See:
 
- DESCRIPTION = 'latent semantic indexing'¶
 
- FEATURE_TYPE = 2¶
 
- __init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, components=100, iterations=10, vectorizer_params=<factory>)¶
 
- 
iterations: 
int= 10¶ Number of iterations for randomized SVD solver if greater than 0 (see class docs).
- property vectorizer: TfidfVectorizer¶
 The vectorizer trained on the document set.
- 
vectorizer_params: 
Dict[str,Any]¶ Additional parameters passed to
TfidfVectorizerwhen vectorizing TF/IDF features.
Module contents¶
Contains classes for vectorizers for indexing document.