Mar 09, 2024 - 2 min read

C API design guidelines learned from lzma

I am trying to patch CPython recently to add support for new BCJ filters in the lzma module(see PR) and I found something interesting. The lzma C library would guide me designing new APIs.

Memory allocation

The lzma library does not call malloc() or free() to allocate or free memory directly. It's using wrapper functions.

// Usage
extern void *
lzma_alloc(size_t size, const lzma_allocator *allocator)
{
    if (size == 0)
        size = 1;

    void *ptr;

    if (allocator != NULL && allocator->alloc != NULL)
        ptr = allocator->alloc(allocator->opaque, 1, size);
    else
        ptr = malloc(size);

    return ptr;
}

// lzma_allocator
typedef struct {
    void *(*alloc)(void *opaque, size_t nmemb, size_t size);
    void (*free)(void *opaque, void *ptr);
    void *opaque;
} lzma_allocator;

We can see that the library calls lzma_alloc() to allocate memory. The lzma_alloc() function uses user passed in custom memory allocators, if any and falls back to C standard library. This is good for flexibility and portability as user may want to utilize custom allocators and some embedded environment may not implement malloc() or free().

ABI compatibility

We should maintain ABI stability.

See example in CPython/Modules/_lzmamodule.c:

static void *
parse_filter_spec_lzma(_lzma_state *state, PyObject *spec)
{
    static char *optnames[] = {"id", "preset", "dict_size", "lc", "lp",
                               "pb", "mode", "nice_len", "mf", "depth", NULL};
    PyObject *id;
    PyObject *preset_obj;
    uint32_t preset = LZMA_PRESET_DEFAULT;
    lzma_options_lzma *options;

    /* First, fill in default values for all the options using a preset.
       Then, override the defaults with any values given by the caller. */

    if (PyMapping_GetOptionalItemString(spec, "preset", &preset_obj) < 0) {
        return NULL;
    }
    if (preset_obj != NULL) {
        int ok = uint32_converter(preset_obj, &preset);
        Py_DECREF(preset_obj);
        if (!ok) {
            return NULL;
        }
    }

    options = (lzma_options_lzma *)PyMem_Calloc(1, sizeof *options);
    if (options == NULL) {
        return PyErr_NoMemory();
    }

    if (lzma_lzma_preset(options, preset)) {
        PyMem_Free(options);
        PyErr_Format(state->error, "Invalid compression preset: %u", preset);
        return NULL;
    }

    if (!PyArg_ParseTupleAndKeywords(state->empty_tuple, spec,
                                     "|OOO&O&O&O&O&O&O&O&", optnames,
                                     &id, &preset_obj,
                                     uint32_converter, &options->dict_size,
                                     uint32_converter, &options->lc,
                                     uint32_converter, &options->lp,
                                     uint32_converter, &options->pb,
                                     lzma_mode_converter, &options->mode,
                                     uint32_converter, &options->nice_len,
                                     lzma_mf_converter, &options->mf,
                                     uint32_converter, &options->depth)) {
        PyErr_SetString(PyExc_ValueError,
                        "Invalid filter specifier for LZMA filter");
        PyMem_Free(options);
        return NULL;
    }

    return options;
}

The allocation in options = (lzma_options_lzma *)PyMem_Calloc(1, sizeof *options); seems dangerous: What would happen CPython is built with lzma version A headers and running with version B dll if the structure lzma_options_lzma changes? Maybe buffer will overflow. However, this is safe here thanks to lzma maintaining ABI stability:

typedef struct {
    uint32_t dict_size;
    const uint8_t *preset_dict;
    // CW note: some members ommitted
    lzma_mode mode;

    uint32_t nice_len;

    lzma_match_finder mf;

    /*
     * Reserved space to allow possible future extensions without
     * breaking the ABI. You should not touch these, because the names
     * of these variables may change. These are and will never be used
     * with the currently supported options, so it is safe to leave these
     * uninitialized.
     */

    /** \private     Reserved member. */
    uint32_t reserved_int4;
    // CW note: some members ommitted
    /** \private     Reserved member. */
    uint32_t reserved_int8;

    /** \private     Reserved member. */
    lzma_reserved_enum reserved_enum1;
    // CW note: some members ommitted
    /** \private     Reserved member. */
    lzma_reserved_enum reserved_enum4;

    /** \private     Reserved member. */
    void *reserved_ptr1;

    /** \private     Reserved member. */
    void *reserved_ptr2;

} lzma_options_lzma;

We can see that the structure reserves some room for future extension, which is great. When adding new options, the reserved members are used instead of adding new members.

Another question: Why the Python module is constructing lzma options manually here while the C library already provide constructors?

/// \brief      Parser for LZMA options
///
/// \return     Pointer to allocated options structure.
///             Doesn't return on error.
extern lzma_options_lzma *options_lzma(const char *str);

Using options_lzma would save the caller from allocating options object but the caller has to provide a string. And Python is providing keyword arguments here so a parser is needed. Instead of serializing, deserializing to string and calling options_lzma(), simply serialize to a lzma_options_lzma object is prefered.

However, a minor defect was found: The lzma_options_lzma structure contains enums, which are not fixed-size. C does not guarantee they are the same as int in size. I would rather choose uint32_t though type info is lost.

Conclusion

Allow user provided allocator
Keep ABI stable