zensols.cnndmdb package#

Submodules#

zensols.cnndmdb.app#

Inheritance diagram of zensols.cnndmdb.app

Creates a SQLite database if the CNN and DailyMail summarization dataset.

class zensols.cnndmdb.app.Application(config_factory, corpus)[source]#

Bases: object

Creates a SQLite database if the CNN and DailyMail summarization dataset.

__init__(config_factory, corpus)#
config_factory: ConfigFactory#

Used to create objects for load().

corpus: Corpus#

.Article.

Type

The corpus which contains a stash that creates instances of

load()[source]#

Load the SQLite database with the CNN/DailyMail corpus.

write_article(key, key_type=_KeyType.org, format=_Format.text, output_file=PosixPath('-'))[source]#

Write an article.

Parameters
  • key (str) – the key to the article

  • key_type (_KeyType) – db for numeric, database, org for corpus original, short for Kth shortest article

  • format (_Format) – the output format

Return type

Article

write_keys(limit=1)[source]#

Print the keys of the corpus.

Parameters

limit (int) – the max number of keys to write

class zensols.cnndmdb.app.PrototypeApplication(app)[source]#

Bases: object

CLI_META = {'is_usage_visible': False}#
__init__(app)#
app: Application#
proto()[source]#

Prototype test.

zensols.cnndmdb.cli#

Inheritance diagram of zensols.cnndmdb.cli

Command line entry point to the application.

class zensols.cnndmdb.cli.ApplicationFactory(*args, **kwargs)[source]#

Bases: ApplicationFactory

__init__(*args, **kwargs)[source]#
classmethod get_corpus()[source]#

Return the section predictor using the app context.

Return type

Corpus

zensols.cnndmdb.cli.main(args=['/Users/landes/opt/lib/python/bin/sphinx-build', '-M', 'html', '/Users/landes/view/nlp/cnndmdb/target/doc/src', '/Users/landes/view/nlp/cnndmdb/target/doc/build'], **kwargs)[source]#
Return type

ActionResult

zensols.cnndmdb.corpus#

Inheritance diagram of zensols.cnndmdb.corpus

Data access objects (DAO) for the CNN/DailyMail news summarization corpus, which is sourced from a Tensorflow dataset instance, which in turn uses the Abi See `GitHub`_ repo.

link

Tensorflow

link

GitHub <https://github.com/abisee/cnn-dailymail>

class zensols.cnndmdb.corpus.Article(id, corp_id, split, publisher, text, highlights=None)[source]#

Bases: Dictable

Represents an article from the CNN/DailyMail corpus.

__init__(id, corp_id, split, publisher, text, highlights=None)#
asflatdict(*args, **kwargs)[source]#

Like asdict() but flatten in to a data structure suitable for writing to JSON or YAML.

Return type

Dict[str, Any]

corp_id: str#

The original corpus unique identifier.

highlights: Tuple[str, ...] = None#

The highlights, or summarization, of the article.

id: int#

The database unique identifier.

publisher: Publisher#

The source of the article.

split: Split#

The split .

text: str#

The article’s (story) text.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

class zensols.cnndmdb.corpus.Corpus(persister, stash)[source]#

Bases: object

Contains access to the CNN/DailyMail corpus.

__init__(persister, stash)#
get_by_corp_id(name)[source]#

Get an article using the original corpus ID.

Parameters

name (str) – the long 40 character unique identifier from the original corpus.

Return type

Article

get_kth_shortest(k)[source]#
Return type

Article

persister: BeanDbPersister#

The DB access object.

stash: Stash#

A stash for accessing and mapping the corpus as Article instances.

class zensols.cnndmdb.corpus.Publisher(value)[source]#

Bases: Enum

The source of the article.

cnn = 'c'#
daily_mail = 'd'#
class zensols.cnndmdb.corpus.Split(value)[source]#

Bases: Enum

The split of the news article

test = 't'#
train = 'r'#
validation = 'v'#

zensols.cnndmdb.load#

Inheritance diagram of zensols.cnndmdb.load

Classes to populate the database. For data sources see zensols.stash.

class zensols.cnndmdb.load.DatabaseLoader(persister, chunk_size, dataset_name='cnn_dailymail', split_spec=None)[source]#

Bases: object

Loads the CNN/DailyMail into a new SQLite database file. If the file already exists, it is deleted. This takes about 2 to load.

__init__(persister, chunk_size, dataset_name='cnn_dailymail', split_spec=None)#
chunk_size: int#

Number of rows to insert into SQLite at a time.

dataset_name: str = 'cnn_dailymail'#

The name of the dataset to load from.

property db_file: Path#

The SQLite file.

Return type

Path

load()[source]#

Load the SQLite database with the CNN/DailyMail corpus.

persister: BeanDbPersister#

The DB access object.

split_spec: Dict[str, str] = None#

Used to create the split format for loading the dataset.

Module contents#