Updated: 2021-08-08
Wrapping up our code into reusable Python packages allow us to take more full advantage of one of the great features of software: extensibility. It allows us to bundle up code that can be used again and again, often in novel extensions of the original use case. We can distribute our code more widely, using an index such as PyPI. Creating packages also encourages us to write more robust, better documented, and easy to understand code.
Your package is a piece of software that should enable you and your users to achieve some task. You should design it with the user's needs in mind. So, before you start, document these needs and design a package workflow that not only makes sense within the package, but also within the broader workflow that a user is likely using.
To design a useful package, have empathy for your users
Remember: you are writing code not so that it is easy for you to use and maintain, but so that it is easy for others to use and maintain.
In addition to a clear and meaningful workflow, a useful and robust package is (a) modular, (b) well documented, and (c) extensively and automatically tested.
Strive for modulatiry. Writing functions that are small and specific--modular--enables easier testing, maintenence, and extensibility. The unix philosophy can help guide our function design:
Make each [function] do one thing well.
The same rule holds for packages, make the package scope small and clearly defined. A really good reference to dig in deeper is John Ousterhout's A Philosophy of Software Design (2018).
Make your package easily interoperable. Design your package to accept standard inputs from other packages. Expect its outputs to become inputs to other programs. The simplest inputs/outputs are text strings. Self explainable pandas data frames or CSVs (just formatted text strings) are likely good input/output format choices Python package that works with data.
Don't forget to document your package such that someone who was not involved in the package development (and yourself in the future) can fully understand it. This includes documenting:
If your package is well defined in scope and the functions are focused, then the documentation should be easy to write. If you find yourself struggling to write documentation, this is a red flag that your functions/package are too complex.1
Don't forget to test your package. Test driven development--where you write and automate the test for the correctness of your code before fully writing the code--enables you to create more robust code that is easier to maintain. It is more robust, because you have tests validating your code's correctness. It is more maintainable, because changes both to the package and its dependencies are automatically evaluated before deployment. Ideally your tests allow you to "fail fast" in that the tests fail close to the source of the problem. This makes the debugging process faster.
Take advantage of version control and CI/CD tools such as Git, GitHub, and GitHub actions. These tools allow you to develop new features while being confident that they only make it to the "release" version of the package once they have passed your tests.
Assuming we have a package purpose and workflow designed, these are important implementation details you'll need to create, test, and distribute your package.
First create a new directory for your package and commit it to some version control system such as GitHub. In this example, the package functions will be in the mean_var.py file or module. Imagine we are creating a package called stats_batch.2 Our initial file structure for the stats_batch package could look like this:
x
1stats_batch2├── CHANGELOG3├── .github4│ └── workflow5├── LICENSE6├── stats_batch7│ ├── __init__.py8│ └── mean-var.py9├── README.md10└── setup.py11└── testsThe __init__.py file is empty. It denotes the directory as a Python package. Use the setuptools package to create the setup.py file. This file contains key metadata for your package. For example:
xxxxxxxxxx221from setuptools import setup2
3setup(4 name='stats_batch',5 version='0.0.9000', 6 description='Find statistics (e.g. mean and variance) using batch updating algorithms',7 url='https://github.com/christophergandrud/batch-stats',8 author='Christopher Gandrud',9 author_email='christopher.gandrud@gmail.com',10 license='MIT',11 packages=['stats_batch'],12 install_requires=['numpy',13 'scipy' 14 ],15
16 classifiers=[17 'Development Status :: 1 - Planning',18 'Intended Audience :: Science/Research',19 'License :: OSI Approved :: MIT License', 20 'Programming Language :: Python :: 3.5',21 ],22)See the Python Packaging project for details.
Once you have these files in place, you can install your package. Go to the terminal. Make your package the working directory (this is assumed for all following terminal examples). Then use:
x
1pip3 install .The README.md file is the introduction to your package and package workflow. It should include at least the package's:
Function documentation in Python is built from docstrings. For example:
1def add_two(x:int) -> int:2 """3 Add two to an integer4 5 Parameters6 ----------7 x: int8 An integer to add 2 to.9 10 Returns11 -------12 An integer that is x + 2.13 14 Examples15 --------16 >>> add_two(10)17 1218 """19 return x + 2The full function documentation is included in the function definition and denoted by three double quotes. The first line of the docstring describes the purpose of the function. Then we document the function's one argument x, describe what the function returns, and provide examples. Users can access this documentation by calling help(add_two).
There are multiple style guides for Python docstrings. The numpy/scipy style guide is a good one to use.
Each time you make a release of your package, you should document the changes in a CHANGELOG.
In the docstring example above, we defined the functions arguments and what it returns:
xxxxxxxxxx11def add_two(x:int) -> int:Notice that we defined the types of the parameters--int--and what the function returns--also int. Each time a user runs the fuction, Python will check to ensure that the inputs and outputs are integers.3 Type checking is useful for:
As you develop your package, build its suite of automated tests. To do this, create a directory called tests. In this directory, place Python files that begin with test. For example, here is test_batch_mean.py
1import stats_batch as sb2import numpy as np3
4# Test batch_mean returns the mean if prior_mean and prior_sample_size are missing5def test_batch_mean_missing_prior_mean_prior_sample_size():6 x = list(range(1, 100))7 assert sb.batch_mean(x) == np.mean(x) Then in the terminal, use the pytest function (from the pytest package) to run all of the tests:
1pytestNote: all of the test function names need to begin with test_.
You can get an overview of how much of your code is covered by tests with the coverage package. Code coverage provide a quick indication of where there might be blind spots in your current set of tests.
Here is an example of how to use coverage in your terminal:
x
1# Run tests and record code coverage2coverage run -m pytest3
4# Show report5coverage reportYou can also create a badge to report your code coverage, for example in the README, with the coverage-bage package. In the terminal:
x
1coverage-badge -o coverage.svg -fThen in your README add:
11In the rendered version of the README, e.g. on GitHub, you will now have see a badge with the code coverage percentage for the package.
Your package should be in a version control system like GitHub. The main branch should be your "protected branch". Develop and test new code in other branches. Only merge code into the main branch after it has passed the automated tests and (if it is headed for production) review by a peer. This is sometimes called the "four eyes" principle. On GitHub you can enforce this discipline with branch control.
Rules of thumb for this process include:
The public GitHub has a really good built in CI/CD platform called GitHub Actions. This will build your package and run all of the tests (if you set it up to) on multiple platforms. In your package, add the directory .github/workflow. Then place an Actions YAML file in this directory. Here is an example to get started. Each time you push a commit, GitHub will run your tests. Click on the Actions tab on your repo's GitHub website to see the outcome of the tests.
Once you have your package built and tested, you can distribute it through the PyPI package index. The official tutorial has easy to follow instructions for how to do this. It is a good idea to try it out on the test index first.
Assuming:
your package pasts its tests
you have a (test) PyPI account and API token
have the build and twine packages installed,
follow a workflow like this in your terminal:
1# Build package2python3 -m build3
4# Upload built package to testpypi5python3 -m twine upload --repository testpypi dist/*6
If successful, you should be given a URL for the packages directory.
You could also include the build and publish process as part of your CI/CD pipeline. For more information see here.