C API design guidelines learned from lzma
I am trying to patch CPython recently to add support for new BCJ filters in the lzma module(see PR) and I found something interesting. The lzma C library would guide me designing new APIs.
Memory allocation
The lzma library does not call malloc()
or free()
to allocate or free memory directly. It's using
wrapper functions.
// Usage
extern void *
lzma_alloc(size_t size, const lzma_allocator *allocator)
{
if (size == 0)
size = 1;
void *ptr;
if (allocator != NULL && allocator->alloc != NULL)
ptr = allocator->alloc(allocator->opaque, 1, size);
else
ptr = malloc(size);
return ptr;
}
// lzma_allocator
typedef struct {
void *(*alloc)(void *opaque, size_t nmemb, size_t size);
void (*free)(void *opaque, void *ptr);
void *opaque;
} lzma_allocator;
We can see that the library calls lzma_alloc()
to allocate memory. The lzma_alloc()
function
uses user passed in custom memory allocators, if any and falls back to C standard library.
This is good for flexibility and portability as user may want to utilize custom allocators and
some embedded environment may not implement malloc()
or free()
.
ABI compatibility
We should maintain ABI stability.
See example in CPython/Modules/_lzmamodule.c:
static void *
parse_filter_spec_lzma(_lzma_state *state, PyObject *spec)
{
static char *optnames[] = {"id", "preset", "dict_size", "lc", "lp",
"pb", "mode", "nice_len", "mf", "depth", NULL};
PyObject *id;
PyObject *preset_obj;
uint32_t preset = LZMA_PRESET_DEFAULT;
lzma_options_lzma *options;
/* First, fill in default values for all the options using a preset.
Then, override the defaults with any values given by the caller. */
if (PyMapping_GetOptionalItemString(spec, "preset", &preset_obj) < 0) {
return NULL;
}
if (preset_obj != NULL) {
int ok = uint32_converter(preset_obj, &preset);
Py_DECREF(preset_obj);
if (!ok) {
return NULL;
}
}
options = (lzma_options_lzma *)PyMem_Calloc(1, sizeof *options);
if (options == NULL) {
return PyErr_NoMemory();
}
if (lzma_lzma_preset(options, preset)) {
PyMem_Free(options);
PyErr_Format(state->error, "Invalid compression preset: %u", preset);
return NULL;
}
if (!PyArg_ParseTupleAndKeywords(state->empty_tuple, spec,
"|OOO&O&O&O&O&O&O&O&", optnames,
&id, &preset_obj,
uint32_converter, &options->dict_size,
uint32_converter, &options->lc,
uint32_converter, &options->lp,
uint32_converter, &options->pb,
lzma_mode_converter, &options->mode,
uint32_converter, &options->nice_len,
lzma_mf_converter, &options->mf,
uint32_converter, &options->depth)) {
PyErr_SetString(PyExc_ValueError,
"Invalid filter specifier for LZMA filter");
PyMem_Free(options);
return NULL;
}
return options;
}
The allocation in options = (lzma_options_lzma *)PyMem_Calloc(1, sizeof *options);
seems dangerous:
What would happen CPython is built with lzma version A headers and running with version B dll if
the structure lzma_options_lzma
changes? Maybe buffer will overflow. However, this is safe here
thanks to lzma maintaining ABI stability:
typedef struct {
uint32_t dict_size;
const uint8_t *preset_dict;
// CW note: some members ommitted
lzma_mode mode;
uint32_t nice_len;
lzma_match_finder mf;
/*
* Reserved space to allow possible future extensions without
* breaking the ABI. You should not touch these, because the names
* of these variables may change. These are and will never be used
* with the currently supported options, so it is safe to leave these
* uninitialized.
*/
/** \private Reserved member. */
uint32_t reserved_int4;
// CW note: some members ommitted
/** \private Reserved member. */
uint32_t reserved_int8;
/** \private Reserved member. */
lzma_reserved_enum reserved_enum1;
// CW note: some members ommitted
/** \private Reserved member. */
lzma_reserved_enum reserved_enum4;
/** \private Reserved member. */
void *reserved_ptr1;
/** \private Reserved member. */
void *reserved_ptr2;
} lzma_options_lzma;
We can see that the structure reserves some room for future extension, which is great. When adding new options, the reserved members are used instead of adding new members.
Another question: Why the Python module is constructing lzma options manually here while the C library already provide constructors?
/// \brief Parser for LZMA options
///
/// \return Pointer to allocated options structure.
/// Doesn't return on error.
extern lzma_options_lzma *options_lzma(const char *str);
Using options_lzma
would save the caller from allocating options object but the caller has to
provide a string. And Python is providing keyword arguments here so a parser is needed.
Instead of serializing, deserializing to string and calling options_lzma()
, simply serialize
to a lzma_options_lzma
object is prefered.
However, a minor defect was found: The lzma_options_lzma
structure contains enums, which are
not fixed-size. C does not guarantee they are the same as int
in size. I would rather choose
uint32_t
though type info is lost.
Conclusion
- Allow user provided allocator
- Keep ABI stable