Since the arrival of type hints in Python 3.5, statically writing a DataFrame has generally been limited to specifying only the type:
def process(f: DataFrame) -> Series: ...
This is inappropriate, as it ignores the types contained in the container. A DataFrame can have string column labels and three columns of integer, string, and floating-point values; These characteristics define the type. A function argument with these types of hints provides developers, static analyzers, and runtime testers with all the information necessary to understand the interface’s expectations. static frame 2 (an open source project of which I am a lead developer) now allows this:
from typing import Any
from static_frame import Frame, Index, TSeriesAnydef process(f: Frame( # type of the container
Any, # type of the index labels
Index(np.str_), # type of the column labels
np.int_, # type of the first column
np.str_, # type of the second column
np.float64, # type of the third column
)) -> TSeriesAny: ...
All major StaticFrame containers now support generic specifications. While it can be verified statically, a new decorator, @CallGuard.check
, allows run-time validation of these type hints in function interfaces. Furthermore, using Annotated
generic, new Require
The class defines a family of powerful runtime validators, allowing per-column or per-row data checks. Finally, each container exposes a new via_type_clinic
interface to derive and validate type hints. Together, these tools offer a consistent approach to type suggestion and validation of DataFrames.
Requirements for a generic data frame
Python’s built-in generic types (e.g. tuple
either dict
) require specification of component types (e.g. tuple(int, str, bool)
either dict(str, int)
). Defining component types allows for more precise static analysis. While the same is true for DataFrames, there have been few attempts to define comprehensive type hints for DataFrames.
The pandas, even with the pandas-stubs
package, does not allow specifying the component types of a DataFrame. It may not be convenient to statically write the Pandas DataFrame, which allows extensive in-place mutation. Fortunately, immutable DataFrames are available in StaticFrame.
Additionally, Python’s tools for defining generics were, until recently, not suitable for DataFrames. That a DataFrame has a variable number of heterogeneous column types poses a challenge for generic specification. Writing such a structure became easier with the new TypeVarTuple
introduced in Python 3.11 (and adapted in the typing_extensions
package).
TO TypeVarTuple
allows defining generics that accept a variable number of types. (See PEP 646 for more details.) With this new type variable, StaticFrame can define a Frame
with a TypeVar
for the index, a TypeVar
for the columns and a TypeVarTuple
for zero or more column types.
a generic Series
is defined with a TypeVar
for the index and a TypeVar
for the values. The static framework Index
and IndexHierarchy
are also generic, the latter again taking advantage TypeVarTuple
to define a variable number of components Index
for each depth level.
StaticFrame uses NumPy types to define the column types of a Frame
or the values of a Series
either Index
. This allows you to specify strictly sized numeric types, such as np.uint8
either np.complex128
; or broadly specify categories of types, such as np.integer
either np.inexact
. Since StaticFrame supports all NumPy types, the correspondence is direct.
Interfaces defined with generic data frames
Expanding on the previous example, the following function interface shows a Frame
with three columns transformed into a dictionary Series
. With so much more information provided by component type hints, the purpose of the function is almost obvious.
from typing import Any
from static_frame import Frame, Series, Index, IndexYearMonthdef process(f: Frame(
Any,
Index(np.str_),
np.int_,
np.str_,
np.float64,
)) -> dict(
int,
Series( # type of the container
IndexYearMonth, # type of the index labels
np.float64, # type of the values
),
): ...
This function processes a signal table from a Open Source Asset Pricing (OSAP) data set (firm/individual level characteristics/predictors). Each table has three columns: security identifier (labeled “permno”), year and month (labeled “yyyymm”), and token (with a specific token name).
The function ignores the index of the provided one. Frame
(written as Any
) and creates groups defined by the first column “permno” np.int_
values. A dictionary is returned with the key “permno”, where each value is a Series
of np.float64
values for that “permno”; the index is a IndexYearMonth
created from the np.str_
“yyyymm” column. (StaticFrame uses NumPy datetime64
values to define unit indices: IndexYearMonth
stories datetime64(M)
labels.)
Instead of returning a dict
The following function returns a Series
with a hierarchical index. He IndexHierarchy
generic specifies a component Index
for each depth level; here, the outer depth is a Index(np.int_)
(derived from the column “permno”), the interior depth and IndexYearMonth
(derived from the “yyyymm” column).
from typing import Any
from static_frame import Frame, Series, Index, IndexYearMonth, IndexHierarchydef process(f: Frame(
Any,
Index(np.str_),
np.int_,
np.str_,
np.float64,
)) -> Series( # type of the container
IndexHierarchy( # type of the index labels
Index(np.int_), # type of index depth 0
IndexYearMonth), # type of index depth 1
np.float64, # type of the values
): ...
Rich type hints provide a self-documenting interface that makes functionality explicit. Even better, these type hints can be used for static analysis with Pyright (now) and Mypy (pending). TypeVarTuple
support). For example, calling this function with a Frame
of two columns of np.float64
will fail a static analysis type check or deliver a warning in an editor.
Runtime type validation
Static type checking may not be sufficient: runtime evaluation provides even stronger constraints, particularly for dynamic values or with incomplete (or incorrect) type hints.
Building on a new runtime type checker called TypeClinic
StaticFrame 2 presents @CallGuard.check
, a decorator for runtime validation of interfaces with type hints. All StaticFrame and NumPy generics are supported, and most Python built-in types are supported, even when deeply nested. The following function adds the @CallGuard.check
decorator.
from typing import Any
from static_frame import Frame, Series, Index, IndexYearMonth, IndexHierarchy, CallGuard@CallGuard.check
def process(f: Frame(
Any,
Index(np.str_),
np.int_,
np.str_,
np.float64,
)) -> Series(
IndexHierarchy(Index(np.int_), IndexYearMonth),
np.float64,
): ...
Now decorated with @CallGuard.check
if the above function is called with an untagged Frame
of two columns of np.float64
to ClinicError
An exception will be raised, illustrating that where three columns were expected, two were provided, and where string column labels were expected, integer labels were provided. (To issue warnings instead of raising exceptions, use the @CallGuard.warn
decorator.)
ClinicError:
In args of (f: Frame(Any, Index(str_), int64, str_, float64)) -> Series(IndexHierarchy(Index(int64), IndexYearMonth), float64)
└── Frame(Any, Index(str_), int64, str_, float64)
└── Expected Frame has 3 dtype, provided Frame has 2 dtype
In args of (f: Frame(Any, Index(str_), int64, str_, float64)) -> Series(IndexHierarchy(Index(int64), IndexYearMonth), float64)
└── Frame(Any, Index(str_), int64, str_, float64)
└── Index(str_)
└── Expected str_, provided int64 invalid
Data validation at runtime
Other features can be validated at runtime. For example, him shape
either name
attributes, or the sequence of labels in the index or columns. The static framework Require
The class provides a family of configurable validators.
Require.Name
: Validates the “name“ attribute of the container.Require.Len
: Validate the length of the container.Require.Shape
: Validates the “shape“ attribute of the container.Require.LabelsOrder
: Validate the order of the labels.Require.LabelsMatch
: Validate inclusion of labels regardless of the order.Require.Apply
– Apply a boolean return function to the container.
Aligning with a growing trend, these objects are provided within type hints as one or more additional arguments for a Annotated
generic. (See PEP 593 for more details.) The type referred to in the first Annotated
The argument is the target of subsequent argument validators. For example, if a Index(np.str_)
the type hint is replaced with a Annotated(Index(np.str_), Require.Len(20))
type hint, the runtime length validation is applied to the index associated with the first argument.
Expanding on the example of processing an OSAP signal table, we could validate our expectation of column labels. He Require.LabelsOrder
The validator can define a sequence of tags, optionally using …
for contiguous regions of zero or more unspecified tags. To specify that the first two columns of the table are labeled “permno” and “yyyymm”, while the third label is variable (depending on the signal), the following must be done Require.LabelsOrder
can be defined within a Annotated
generic:
from typing import Any, Annotated
from static_frame import Frame, Series, Index, IndexYearMonth, IndexHierarchy, CallGuard, Require@CallGuard.check
def process(f: Frame(
Any,
Annotated(
Index(np.str_),
Require.LabelsOrder('permno', 'yyyymm', ...),
),
np.int_,
np.str_,
np.float64,
)) -> Series(
IndexHierarchy(Index(np.int_), IndexYearMonth),
np.float64,
): ...
If the interface expects a small collection of OSAP signal tables, we can validate the third column with the Require.LabelsMatch
validator. This validator can specify required tags, sets of tags (of which at least one must match), and regular expression patterns. If tables of only three files are expected (i.e. “Mom12m.csv”, “Mom6m.csv” and “LRreversal.csv”), we can validate the third column labels by defining Require.LabelsMatch
with a set:
@CallGuard.check
def process(f: Frame(
Any,
Annotated(
Index(np.str_),
Require.LabelsOrder('permno', 'yyyymm', ...),
Require.LabelsMatch({'Mom12m', 'Mom6m', 'LRreversal'}),
),
np.int_,
np.str_,
np.float64,
)) -> Series(
IndexHierarchy(Index(np.int_), IndexYearMonth),
np.float64,
): ...
Both Require.LabelsOrder
and Require.LabelsMatch
You can associate functions with tag specifiers to validate data values. If the validator is applied to column labels, it is Series
of the column values will be provided to the function; If the validator is applied to index tags, a Series
of row values will be provided to the function.
Similar to the use of Annotated
The tag specifier is replaced with a list, where the first element is the tag specifier and the remaining elements are row or column processing functions that return a boolean value.
To expand on the previous example, we could validate that all “permno” values are greater than zero and that all signal values (“Mom12m”, “Mom6m”, “LRreversal”) are greater than or equal to -1.
from typing import Any, Annotated
from static_frame import Frame, Series, Index, IndexYearMonth, IndexHierarchy, CallGuard, Require@CallGuard.check
def process(f: Frame(
Any,
Annotated(
Index(np.str_),
Require.LabelsOrder(
('permno', lambda s: (s > 0).all()),
'yyyymm',
...,
),
Require.LabelsMatch(
({'Mom12m', 'Mom6m', 'LRreversal'}, lambda s: (s >= -1).all()),
),
),
np.int_,
np.str_,
np.float64,
)) -> Series(
IndexHierarchy(Index(np.int_), IndexYearMonth),
np.float64,
): ...
If a validation fails, @CallGuard.check
will generate an exception. For example, if the previous function is called with a Frame
which has an unexpected label in the third column, the following exception will be raised:
ClinicError:
In args of (f: Frame(Any, Annotated(Index(str_), LabelsOrder(('permno', <lambda>), 'yyyymm', ...), LabelsMatch(({'Mom12m', 'LRreversal', 'Mom6m'}, <lambda>))), int64, str_, float64)) -> Series(IndexHierarchy(Index(int64), IndexYearMonth), float64)
└── Frame(Any, Annotated(Index(str_), LabelsOrder(('permno', <lambda>), 'yyyymm', ...), LabelsMatch(({'Mom12m', 'LRreversal', 'Mom6m'}, <lambda>))), int64, str_, float64)
└── Annotated(Index(str_), LabelsOrder(('permno', <lambda>), 'yyyymm', ...), LabelsMatch(({'Mom12m', 'LRreversal', 'Mom6m'}, <lambda>)))
└── LabelsMatch(({'Mom12m', 'LRreversal', 'Mom6m'}, <lambda>))
└── Expected label to match frozenset({'Mom12m', 'LRreversal', 'Mom6m'}), no provided match
The expressive power of TypeVarTuple
As shown above, TypeVarTuple
permissions specifying Frame
with zero or more heterogeneous column types. For example, we can provide type hints for a Frame
of two floating types or six mixed:
>>> from typing import Any
>>> from static_frame import Frame, Index>>> f1: sf.Frame(Any, Any, np.float64, np.float64)
>>> f2: sf.Frame(Any, Any, np.bool_, np.float64, np.int8, np.int8, np.str_, np.datetime64)
While this accommodates a variety of DataFrames, large DataFrames with type hints, such as those with hundreds of columns, would be difficult to handle. Python 3.11 introduces new syntax to provide a variable range of types in TypeVarTuple
generic: stellar expressions of tuple
generic aliases. For example, to write a suggestion Frame
with a date index, string column labels, and any column type settings, we can star unpack a tuple
zero or more All
:
>>> from typing import Any
>>> from static_frame import Frame, Index>>> f: sf.Frame(Index(np.datetime64), Index(np.str_), *tuple(All, ...))
He tuple
The star expression can go anywhere in a list of types, but there can only be one. For example, the following type hint defines a Frame
which should start with boolean and string columns, but has a flexible specification for any number of subsequent columns. np.float64
columns.
>>> from typing import Any
>>> from static_frame import Frame>>> f: sf.Frame(Any, Any, np.bool_, np.str_, *tuple(np.float64, ...))
Utilities for type hints
Working with such detailed typographic suggestions can be challenging. To help users, StaticFrame provides convenient utilities for hints and type checking at runtime. All StaticFrame 2 containers now have a via_type_clinic
interface that allows access to TypeClinic
functionality.
First, utilities are provided to translate a container, such as a complete Frame
, in a suggestion of type. The chain representation of via_type_clinic
The interface provides a string representation of the container’s type hint; alternatively, the to_hint()
The method returns a complete generic alias object.
>>> import static_frame as sf
>>> f = sf.Frame.from_records(((3, '192004', 0.3), (3, '192005', -0.4)), columns=('permno', 'yyyymm', 'Mom3m'))>>> f.via_type_clinic
Frame(Index(int64), Index(str_), int64, str_, float64)
>>> f.via_type_clinic.to_hint()
static_frame.core.frame.Frame(static_frame.core.index.Index(numpy.int64), static_frame.core.index.Index(numpy.str_), numpy.int64, numpy.str_, numpy.float64)
Second, utilities are provided for testing type hints at runtime. He via_type_clinic.check()
The function allows you to validate the container with a provided type hint.
>>> f.via_type_clinic.check(sf.Frame(sf.Index(np.str_), sf.TIndexAny, *tuple(tp.Any, ...)))
ClinicError:
In Frame(Index(str_), Index(Any), Unpack(Tuple(Any, ...)))
└── Index(str_)
└── Expected str_, provided int64 invalid
To support gradual writing, StaticFrame defines several generic aliases configured with Any
for each type of component. For example, TFrameAny
can be used for any Frame
and TSeriesAny
For any Series
. As expected, TFrameAny
will validate the Frame
created above.
>>> f.via_type_clinic.check(sf.TFrameAny)
Conclusion
Better type hints for DataFrames are long overdue. With modern Python writing tools and a DataFrame built on an immutable data model, StaticFrame 2 meets this need, providing powerful resources for engineers who prioritize maintainability and verifiability.