zensols.deepnlp.index package¶
Submodules¶
zensols.deepnlp.index.domain module¶
Contains a base class for vectorizers for indexing document.
- class zensols.deepnlp.index.domain.DocumentIndexVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path)[source]¶
- Bases: - FeatureDocumentVectorizer,- PersistableContainer,- Primeable- A vectorizer that generates vectorized features based on the index documents of the training set. For example, latent dirichlet allocation maybe be used to generated a distrubiton of likelihood a document belongs to a topic. - Subclasses of this abstract class are both vectorizers and models. The model created once, and then cached. To clear the cache and force it to be retrained, use - clear().- The method - _create_model()must be implemented.- abstract _create_model(docs)[source]¶
- Create the model for this indexer. The model is implementation specific. The model must be pickelabel and is cached in as - model.- Return type:
 
 - __init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path)¶
 - 
doc_factory: IndexedDocumentFactory¶
- The document factor used to create training documents for the model vectorizer. 
 - static feat_to_tokens(docs)[source]¶
- Create a tuple of string tokens from a set of documents suitable for document indexing. The strings are the lemmas of the tokens. - Important: this method must remain static since the LSI instance of this class uses it as a factory function in the a vectorizer. 
 - property model¶
- Return the trained model for this vectorizer. See the class docs on how it is cached and cleared. 
 
- class zensols.deepnlp.index.domain.IndexedDocumentFactory[source]¶
- Bases: - ABC- Creates training documents used to generate indexed features (i.e. latent dirichlet allocation, latent semantic indexing etc). - __init__()¶
 
zensols.deepnlp.index.lda module¶
- class zensols.deepnlp.index.lda.TopicModelDocumentIndexerVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, topics=20, decode_as_flat=True)[source]¶
- Bases: - DocumentIndexVectorizer- Train a model using LDA for topic modeling. - Citation: - Hoffman, M., Bach, F., and Blei, D. 2010. Online Learning for Latent Dirichlet Allocation. Advances in Neural Information Processing Systems 23. - Shape:
- (topics, )when- decode_as_flatis- True, otherwise, ``(, topics)
- See:
 - DESCRIPTION = 'latent semantic indexing'¶
 - FEATURE_TYPE = 2¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, topics=20, decode_as_flat=True)¶
 
zensols.deepnlp.index.lsi module¶
A Deerwester latent semantic index vectorizer implementation.
- class zensols.deepnlp.index.lsi.LatentSemanticDocumentIndexerVectorizer(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, components=100, iterations=10, vectorizer_params=<factory>)[source]¶
- Bases: - DocumentIndexVectorizer- Train a latent semantic indexing (LSI, aka LSA) model from: - Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science; New York, N.Y. 41, 6, 391–407. - This class can be used only to index TF/IDF. To skip the LSI training, set - iterationsto zero.- Shape:
- (1,)
- See:
 - DESCRIPTION = 'latent semantic indexing'¶
 - FEATURE_TYPE = 2¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, doc_factory, index_path, components=100, iterations=10, vectorizer_params=<factory>)¶
 - 
iterations: int= 10¶
- Number of iterations for randomized SVD solver if greater than 0 (see class docs). 
 - property vectorizer: TfidfVectorizer¶
- The vectorizer trained on the document set. 
 - 
vectorizer_params: Dict[str,Any]¶
- Additional parameters passed to - TfidfVectorizerwhen vectorizing TF/IDF features.
 
Module contents¶
Contains classes for vectorizers for indexing document.