Describe and optimize data

PyPI Python 3.11 Python 3.12 Build Status

In this package, Pythonic objects are used to easily (un)serialize to create LaTeX tables, figures and Excel files. The API and command-line program describes data in tables with metadata and using YAML and CSV files and integrates with Pandas. The paths to the CSV files to create tables from and their metadata is given as a YAML configuration file.

Features:

  • Create LaTeX tables (with captions) and Excel files (with notes) of tabular metadata from CSV files.

  • Create LaTeX friendly encapsulated postscript (.eps) files from CSV files.

  • Data and metadata is viewable in a nice format with paging in a web browser using the Render program.

  • Usable as an API during data collection for research projects.

Table of Contents

Documentation

See the full documentation. The API reference is also available.

Obtaining

The library can be installed with pip from the pypi repository:

pip3 install zensols.datdesc

Binaries are also available on pypi.

Usage

The library can be used as a Python API to programmatically create tables, figures, and/or represent tabular data. However, it also has a very robust command-line that is intended by be used by GNU make. The command-line can be used to create on the fly LaTeX .sty files that are generated as commands and figures are generated as Encapsulated Postscript (.eps) files.

The YAML file format is used to create both tables and figures. Parameters are both files or both directories when using directories, only files that match *-table.yml are considered on the command line. In addition, the described data can be hyperparameter metadata, which can be optimized with the hyperparameter module.

Tables

First create the table’s configuration file. For example, to create a Latex .sty file from the CSV file test-resources/section-id.csv using the first column as the index (makes that column go away) using a variable size and placement, use:

intercodertab:
  type: one_column
  path: test-resources/section-id.csv
  caption: >-
    Krippendorff’s ...
  single_column: true
  uses: zentable
  read_params:
    index_col: 0
  tabulate_params:
    disable_numparse: true
  replace_nan: ' '
  blank_columns: [0]
  bold_cells: [[0, 0], [1, 0], [2, 0], [3, 0]]

Some of these fields include:

  • index_col: clears column 0 and

  • bold_cells: make certain cells bold

  • disable_numparse tells the tabulate module not reformat numbers

See the Table class for a full listing of options.

Figures

Figures can be generated in any format supported by matplotlib (namely .eps, .svg, and .pdf). Figures are configured in a very similar fashion to tables. The configuration also points to a CSV file, but describes the plot.

The primary difference is that the YAML is parsed using the Zensols parsing rules so the string path: target will be given to a new Plot instance as a pathlib.Path.

A bar plot is configured below:

irisFig:
  image_dir: 'path: target'
  seaborn:
    style:
      style: darkgrid
      rc:
        axes.facecolor: 'str: .9'
    context:
      context: 'paper'
      font_scale: 1.3
  plots:
    - type: bar
      data: 'dataframe: test-resources/fig/iris.csv'
      title: 'Iris Splits'
      x_column_name: ds_type
      y_column_name: count
      code: |
        df = df.groupby('ds_type').agg({'ds_type': 'count'}).\
          rename(columns={'ds_type': 'count'}).reset_index()

This configuration meaning:

  • The top level irisFig creates a Figure instance, and when used with the command line, outputs this root level string as the name in the image_dir directory.

  • The image_dir tells where to write the image. This should be left out when invoking from the command-line to allow it to decide where to write the file.

  • The seaborn section configures the seaborn module.

  • The plots are a list of Plot instances that, like the Figure level, are populated with all the values.

  • The code (optionally) allows the massaging of the Pandas dataframe (pointed to by data). This feature also exists for Table.

See the Figure and Plot classes for a full listing of options.

Hyperparameters

Hyperparameter metadata is largely isomorphic to datdesc tables. This package was designed for the following purposes:

  • Provide a basic scaffolding to update model hyperparameters such as hyperopt.

  • Generate LaTeX tables of the hyperparamers and their descriptions for academic papers.

Access to the hyperparameters via the API is done by calling the set or model levels with a dotted path notation string. For example, svm.C first navigates to model svm, then to the hyperparameter named C.

A command line access to create LaTeX tables from the hyperparameter definitions is available with the hyper action. An example of a hyperparameter set (a grouping of models that in turn have hyperparameters) follows:

svm:
  doc: 'support vector machine'
  params:
    kernel:
      type: choice
      choices: [radial, linear]
      doc: 'maps the observations into some feature space'
    C:
      type: float
      doc: 'regularization parameter'
    max_iter:
      type: int
      doc: 'number of iterations'
      value: 20
      interval: [1, 30]

In the example, the svm model has hyperparameters kernel, C and max_iter. The kernel type is set as a choice, which is a string that has the constraints of matching a string in the list. The C hyperparameter is a floating point number, and the max_iter is an integer that must be between 1 and 30.

In this next example, the k_means model uses the string k-means in human readable documentation, which can be Python generated code in a dataclass.

k_means:
  desc: k-means
  doc: 'k-means clustering'
  params:
    n_clusters:
      type: int
      doc: 'number of clusters'
    copy_x:
      type: bool
      value: True
      doc: 'When pre-computing distances it is more numerically accurate to center the data first'
    strata:
      type: list
      doc: 'An array of stratified hyperparameters (made up for test cases).'
      value: [1, 2]
    kwargs:
      type: dict
      doc: 'Model keyword arguments (made up for test cases).'
      value:
        learning_rate: 0.01
        epochs: 3

Changelog

An extensive changelog is available here.

Community

Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.

License

MIT License

Copyright (c) 2023 - 2025 Paul Landes