Understanding pickle in Python

2021-08-14 | Tech

The module pickle shipped in Python could be used for generic-purpose object serialization and de-serialization. It’s been widely adopted or recommended as backend in scenarios like persisting states or IPC.

Employed by many famous frameworks, though, the magic behind it still seems to be vague for daily users, especially guys fresh to the language. People come across “unpicklable” errors from time to time, but don’t know the reason; or re-invent state persistence by themselves, even if pickle could be competent. People sometimes write error-prone codes, merely because they are afraid of or unaware of pickle.

This post thus attempts to clarify the usage of pickle module in an easy understanding way, by answering three questions.

What kind of object is picklable?

Before stepping in, let’s take a swift overview for the design of pickle module.

pickle separates the logic of serialization / de-serialization into two parts, using some kind of intermediate representation (IR). During serialization, an object is firstly transformed into IR, and then dumped into byte stream by pickle. De-serialization operates conversely. For most of the time, we users should only care about the first aspect, that is, the conversion between Python objects and IR.

To ease the burden of using pickle, the module set up a few principles that could generalize the concept “picklable” automatically to most user-defined objects. Users then need not to communicate with IR. The principles could be roughly summarized into

Most built-in Python types and their instances are picklable.
Containers like lists, dicts or sets with only picklable elements are picklable.
Top-level classes or functions of a module are picklable (if configured properly).
Objects with picklable __dict__ and picklable type are picklable.
Objects holding external resources (open(), socket.socket(), etc.) are usually NOT picklable.

Principle 1 and 2 ensures that we can easily pickle built-in types ¹, values ² and their composition ³, which covers a large variety of daily-used objects.

If you want to pickle custom objects, Principle 3 and 4 come into play. If you define your class at the top-level of a module (and usually you should), the class is automatically picklable. An object instantiated from that class, if holding only picklable attributes (stored in __dict__), is also picklable.

How convenient! Nothing is needed but a simple pickle.load or pickle.dump to enjoy pickle services .

Principle 3 comes with a remark “if configured properly”. Technically, module-level values are picklable if they have the correct __qualname__. __qualname__ dictates the name of variable that holds the value. This is automatically done if the value is a class or function:

class C: pass
C.__qualname__  # 'C'
def f(): pass
f.__qualname__  # 'f'

which enables pickling for user-defined classes or functions.

But this does not apply to lambdas. A lambda by default would have __qualname__ == "<lambda>", which does not match the variable name that holds it. pickle thus cannot handle it normally

>>> f = lambda: 1
>>> pickle.dumps(f)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_pickle.PicklingError: Cant pickle <function <lambda> at 0x7f7ed81ad3a0>: attribute look up <lambda> on __main__ failed
>>> f.__qualname__
'<lambda>'

To make it possible, one could assign __qualname__ for it manually

>>> f.__qualname__ = 'f'
>>> pickle.dumps(f)
b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x01f\x94\x93\x94.'

Principle 5 might be wierd at the first sight, but is indeed reasonable.

Let’s consider if a socket object could be dumped, and we load it at some point afterwards. There exist so many causes that could fail the de-serialization. Peer host malfunctions, link layer breaks, or out of system resources. Even if it succeeds, the loaded object is not identical to the original one, since it references to external resource with different internal state.

pickle therefore disables the stuff – but if you wish, you could control it by customizing the serialization / de-serialization behavior. Here we come to

What if my object is not picklable?

From now you would need to learn something about the IR, as well as the object protocol of pickling.

Fundamentally, an object’s __reduce__() would be called when it’s being serialized. The method’s expected behavior is described by the documentation in detail. Unoffically, we could regard the return value as kind of IR, which dictates how to locate the object in specific module, or necessary information to restore the object from scratch.

According to the doc, __reduce__() should return either a string or a tuple. We are going to take apart the two cases in the next moment.

If a string returned, it should be the name of a local variable relative to the object’s module. Remember the __qualname__ trick above? They do the same job. pickle treats the returned string as __qualname__ of the object, storing it along with the module name __module__ at serialization. The de-serializer afterwards would look up an attribute in the module with that string as name, and retrieve the object directly without creating from scratch.

It is useful for singleton object, i.e., object that be instantiated once and survives the whole program lifetime, for example, a database connection pool. We thus would not struggle on how to record the internal states for these objects.

When a tuple returned, well, it’s being a bit complicated. The tuple should contain 2~6 elements ⁴. The meaning of the elements would be described after, but before which, let’s get some understanding of how an object is created.

In the assumption of pickle module, an object is made up of a skeleton, and states. A skeleton is an initial version of an object, which is returned by a callable named “constructor”. Usually the constructor is the object’s __new__() method, but not necessarily. It could also be some factory function. States refer to all other attributes or elements that the object holds, which could be simple Python objects or external resources. At de-serialization, pickle would first set up a skeleton by calling constructor, then filling up the states. Either step can be customized to accord with your application logic.

Now we could talk about the scheme of returned tuple.

The 1-st and 2-nd elements describe how to set up a skeleton. The 1-st element is the constructor callable, and the 2-nd be a tuple of positional arguments, which the constructor takes. The two elements are both required. If no argument is needed, one should leave an empty tuple.

The remaining elements are optional, describing the states. pickle employs various strategies of restoring states. If the 6-th element provided, it should be a callable with signature (obj, state), which performs state updating with the object and the 3-rd element as arguments. If not provided, pickle would look up a method named __setstate__ on the object, which shares the same signature, and if found, it is served as the state updater. Otherwise, pickle expects the 3-rd element to be a dict, which would then be added to the object’s __dict__. The 4-th and 5-th elements are specialized for list- or dict-like object and less used. If supplied, they should be a list and a dict, which update the object through .extend() and .update() methods, respectively.

Directly implementing __reduce__() could be error-prone. pickle thus provides other object protocols to simplify the task. A list of the special methods could be found here. Users can implement some of them to serve the same purpose, e.g., __getnewargs_ex__() or __getnewargs__() for the 2-nd element, and __getstate__() for the 3-rd element.

Now let’s back to the title – what if my object is not picklable? The answer is, implementing your own pickling / unpickling logics via __reduce__() or other special methods. This section in the docs showcases a good example, where an object maintains a file which should be re-opened and re-sought at de-serialization.

Here we discussed the scenario of serializing our customized types. Now what if the object to be pickled is out of our control?

What if the object is out of my control?

In some cases, one might have the demand to alter the pickling behavior for a specific type, either it is not supported, or the serialized byte stream is not efficient enough. The type is maintained by some libaries, and out of your control, so you could not change its __reduce__() to fit your requirement. pickle introduces some other interfaces to mitigate the problem from different aspects.

Dispatch Tables

This is the recommended way to pickle objects without disturbing any other external codes. Aside from looking up special methods on object, pickle also relies on module copyreg to seek reducers. The function copyreg.pickle(type, reducer) associates callable reducer as the reducer function of type type. reducer should accept arguments and return IR just like a __reduce__() method, and it shadows the original __reduce__() on type. The documentation showcases the stuff with a simple example.

Persistent ID

Sometimes you would like to persist external objects that could be uniquely identified by some IDs. Think about entries of a database table, where foreign keys act as those IDs. Here storing IDs directly would be better and more efficient than turning the objects into IR. pickle have persistent ID for this purpose.

Unlike dispatch tables, resolution of persistent IDs is not defined in pickle. One should sub-class Pickler and Unpickler classes and overwrite persistent_id() and persistent_load(), respectively. An example of pickling table entries with persistent IDs could be found here.

reducer_override()

With only types given, users might still not able to decide the pickling behavior for some objects. If you want to pickle user-defined classes (not their instances) via dispatch tables, all reducings would be delegated to the same method, since in Python user-defined classes share a same base class type. Python 3.8 introduces the reducer_override() method on Pickler class to handle custom pickling in arbitary conditions. You might check technical details and examples at this link.

Personally, I found dispatch tables the most useful. You can always ensemble persistent IDs using IR, and dealing several different objects within a same reducer function is actually not that bad. In contrast, dispatch tables could enable the customization in a global manner, which is useful at some time.

Conclusion

It’s such a long way here, but we made it!

We’ve seen some basic usage of pickle, the fundamental object protocols of pickling, as well as some other interfaces for customization. Hope the stuffs might give you a rough idea on this module, and be helpful for dealing with pickling problems.

int, str, etc.
42, "foobar", etc.
such as [int, {42: "foobar"}]
in Python with version lower than 3.8, this would be 2~5 elements

Author: hsfzxjy.
Link: .
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.

Python pickle

OOPS!

A comment box should be right here...But it was gone due to network issues :-(If you want to leave comments, make sure you have access to disqus.com.