Awkward Arrays: Working With Nontabular Data at Scale

Most data structures are represented in memory as tables and arrays and are well handled by Python and its ecosystem. However, this doesn’t include a massive amount of data that is in nested, variable-length structures, such as JSON objects. Typically a programmer would need to handwrite slow and brittle Python code to make inferences from these—or write glue code to convert/regularize the data before use.

The new Awkward Array project provides a library for operating on nested, variable-length data structures with NumPy-like idioms. This webinar gives an overview of two Anaconda-developed projects that provide native support for Awkward Arrays in the broader Python data analysis ecosystem. Dask-awkward lets you scale up and distribute workflows with partitioned Awkward Arrays using the parallel processing library dask. At the same time, awkward-pandas integrates Awkward Arrays into the extremely popular pandas data science library. Awkward-pandas make it easy to use Awkward Arrays in semi-tabular workflows and enable massive acceleration in processing nontabular data. In this webinar, Martin Durant and Doug Davis will show how these projects plug into the existing analysis landscape and present compelling use cases.