Python Garbage Collection: Everything You Need to Know

I. Overview of Garbage Collection In computer science, Garbage Collection (abbreviated as GC) is an automatic memory management mechanism. When certain memory space occupied by a program is no longer accessed, the program will return it to the operating system through garbage collection algorithms. The garbage collector can reduce the burden on programmers and minimize program errors. It originated from the LISP language. Currently, many languages such as Smalltalk, Java, C#, Go, and D support garbage collectors. As an automatic memory management mechanism of modern programming languages, GC mainly performs two tasks: Identify useless garbage resources in memory. Clear these garbage and release the memory for other objects to utilize. It frees programmers from the heavy load of resource management, enabling them to concentrate more on business logic. However, programmers still need to understand GC, which helps them write more robust code. II. Common Garbage Collection Algorithms Reference Counting: Maintain a reference count for each object. When the object that references this object is destroyed, the reference count is decremented by one. When the object's reference count reaches zero, the object is recycled. Representative languages: Python, PHP, Swift. Advantages: Fast object recycling, and it does not recycle only when the memory is exhausted or when a certain threshold is reached. Disadvantages: It cannot handle circular references effectively, and maintaining the reference count in real time incurs overhead. Mark-Sweep: Traverse all referenced objects starting from the root variable, mark the referenced objects, and recycle those that are not marked. Representative languages: Golang (tri-color marking method), Python (auxiliary). Advantages: It overcomes the disadvantages of reference counting. Disadvantages: It requires STW (temporarily stopping the program from running). Generational Collection: Divide different generation spaces according to the object's lifespan. Objects with a long lifespan are placed in the old generation, and those with a short lifespan are placed in the new generation. Different generations have different recycling algorithms and frequencies. Representative languages: Java, Python (auxiliary). Advantages: Good recycling performance. Disadvantages: The algorithm is complex. III. Python's Garbage Collection Mechanism Description from Official Documentation: The details of Python's memory management depend on the implementation. CPython uses reference counting to detect inaccessible objects and employs another mechanism to collect reference cycles. It periodically executes a cycle detection algorithm to find inaccessible cycles and deletes the involved objects. The gc module provides functions for performing garbage collection, obtaining debugging statistics, and optimizing collector parameters. Other implementations (such as Jython or PyPy) may rely on different mechanisms, such as a complete garbage collector. If Python code depends on the behavior implemented by reference counting, it may lead to portability issues. Reference Counting: The default garbage collection mechanism adopted by Python is the reference counting method, which was first proposed by George E. Collins in 1960 and is still used by many programming languages today. Principle: Each object maintains an ob_ref field that records the number of times the object is currently referenced. When a new reference points to the object, ob_ref is incremented by one. When the reference becomes invalid, ob_ref is decremented by one. When the reference count is zero, the object is immediately recycled, and the occupied memory space is released. Disadvantages: It requires additional space to maintain the reference count and cannot solve the "circular reference" of objects. For instance: a = {} # The reference count of object A is 1 b = {} # The reference count of object B is 1 a['b'] = b # The reference count of B is incremented by 1 b['a'] = a # The reference count of A is incremented by 1 del a # The reference of A is decremented by 1, and the final reference of object A is 1 del b # The reference of B is decremented by 1, and the final reference of object B is 1 In the above example, after the del statements are executed, objects A and B form a circular reference. Although there are no external references, the reference count is not zero, so they will not be recycled, which may cause a memory leak. Mark-Sweep: It is a garbage collection algorithm implemented based on tracing GC technology and consists of two phases: Marking phase: Mark all "active objects". Sweeping phase: Recycle those "inactive objects" that are not marked. Starting from root objects (such as global variables, the call stack, and registers), traverse objects along directed edges. Reachable objects are marked as active objects, and unreachable objects are ma

Jan 17, 2025 - 16:42
Python Garbage Collection: Everything You Need to Know

Image description

I. Overview of Garbage Collection

  • In computer science, Garbage Collection (abbreviated as GC) is an automatic memory management mechanism. When certain memory space occupied by a program is no longer accessed, the program will return it to the operating system through garbage collection algorithms.
  • The garbage collector can reduce the burden on programmers and minimize program errors. It originated from the LISP language.
  • Currently, many languages such as Smalltalk, Java, C#, Go, and D support garbage collectors.
  • As an automatic memory management mechanism of modern programming languages, GC mainly performs two tasks:
    • Identify useless garbage resources in memory.
    • Clear these garbage and release the memory for other objects to utilize.
  • It frees programmers from the heavy load of resource management, enabling them to concentrate more on business logic. However, programmers still need to understand GC, which helps them write more robust code.

II. Common Garbage Collection Algorithms

  • Reference Counting:
    • Maintain a reference count for each object. When the object that references this object is destroyed, the reference count is decremented by one. When the object's reference count reaches zero, the object is recycled.
    • Representative languages: Python, PHP, Swift.
    • Advantages: Fast object recycling, and it does not recycle only when the memory is exhausted or when a certain threshold is reached.
    • Disadvantages: It cannot handle circular references effectively, and maintaining the reference count in real time incurs overhead.
  • Mark-Sweep:
    • Traverse all referenced objects starting from the root variable, mark the referenced objects, and recycle those that are not marked.
    • Representative languages: Golang (tri-color marking method), Python (auxiliary).
    • Advantages: It overcomes the disadvantages of reference counting.
    • Disadvantages: It requires STW (temporarily stopping the program from running).
  • Generational Collection:
    • Divide different generation spaces according to the object's lifespan. Objects with a long lifespan are placed in the old generation, and those with a short lifespan are placed in the new generation. Different generations have different recycling algorithms and frequencies.
    • Representative languages: Java, Python (auxiliary).
    • Advantages: Good recycling performance.
    • Disadvantages: The algorithm is complex.

III. Python's Garbage Collection Mechanism

  • Description from Official Documentation:
    • The details of Python's memory management depend on the implementation.
    • CPython uses reference counting to detect inaccessible objects and employs another mechanism to collect reference cycles. It periodically executes a cycle detection algorithm to find inaccessible cycles and deletes the involved objects.
    • The gc module provides functions for performing garbage collection, obtaining debugging statistics, and optimizing collector parameters.
    • Other implementations (such as Jython or PyPy) may rely on different mechanisms, such as a complete garbage collector. If Python code depends on the behavior implemented by reference counting, it may lead to portability issues.
  • Reference Counting:
    • The default garbage collection mechanism adopted by Python is the reference counting method, which was first proposed by George E. Collins in 1960 and is still used by many programming languages today.
    • Principle: Each object maintains an ob_ref field that records the number of times the object is currently referenced. When a new reference points to the object, ob_ref is incremented by one. When the reference becomes invalid, ob_ref is decremented by one. When the reference count is zero, the object is immediately recycled, and the occupied memory space is released.
    • Disadvantages: It requires additional space to maintain the reference count and cannot solve the "circular reference" of objects. For instance:
a = {}  # The reference count of object A is 1
b = {}  # The reference count of object B is 1
a['b'] = b  # The reference count of B is incremented by 1
b['a'] = a  # The reference count of A is incremented by 1
del a  # The reference of A is decremented by 1, and the final reference of object A is 1
del b  # The reference of B is decremented by 1, and the final reference of object B is 1

Image description

  • In the above example, after the del statements are executed, objects A and B form a circular reference. Although there are no external references, the reference count is not zero, so they will not be recycled, which may cause a memory leak.
    • Mark-Sweep:
  • It is a garbage collection algorithm implemented based on tracing GC technology and consists of two phases:
    • Marking phase: Mark all "active objects".
    • Sweeping phase: Recycle those "inactive objects" that are not marked.
  • Starting from root objects (such as global variables, the call stack, and registers), traverse objects along directed edges. Reachable objects are marked as active objects, and unreachable objects are marked as inactive objects and then cleared.
  • The mark-sweep algorithm, as an auxiliary garbage collection technology of Python, mainly handles container objects (such as list, dict, tuple, instance, etc.), because string and numeric objects do not cause circular reference problems.
  • Python uses a doubly linked list to organize these container objects.
  • Disadvantages: It is necessary to sequentially scan the entire heap memory before clearing inactive objects. Even if only a small portion of active objects remain, all objects need to be scanned.
    • Generational Recycling:
  • It is an operation mode that trades space for time. The memory is divided into different sets according to the object's survival time, and each set is considered a generation. Python is divided into three generations: the young generation (generation 0), the middle generation (generation 1), and the old generation (generation 2), corresponding to three linked lists. The garbage collection frequency decreases as the object's survival time increases.
  • Newly created objects are allocated to the young generation. When the total number of the young generation's linked list reaches the upper limit, the garbage collection mechanism is triggered. Recyclable objects are recycled, and non-recyclable objects are moved to the middle generation, and so on. Objects in the old generation survive the longest.
  • Generational recycling is based on the mark-sweep technology and also serves as an auxiliary garbage collection technology in Python to handle container objects.

Image description

IV. Memory Leak

  • Memory leaks are relatively rare in daily Python usage.
  • Situations where CPython does not release all memory upon exit:
    • Objects referenced from the global namespace or Python modules are not always released, which may occur when there is a circular reference. Some memory allocated by C libraries may also not be released.
    • Python will clean up the memory and attempt to destroy each object when exiting.
    • If you want to force Python to delete certain content when releasing, you can use the atexit module to run a function.
  • Code Example:
# In some Python implementations, the following code (which works well in CPython) may exhaust file descriptors
for file in very_long_list_of_files:
    f = open(file)
    c = f.read(1)
  • Improved Solution:
# You should explicitly close the file or use the with statement, which is effective regardless of the memory management scheme
for file in very_long_list_of_files:
    with open(file) as f:
        c = f.read(1)

Leapcell: The Best Serverless Platform for Python App Hosting

Image description

Finally, let me introduce the most suitable platform for deploying Python services: Leapcell

1. Multi-Language Support

  • Develop using JavaScript, Python, Go, or Rust.

2. Deploy Unlimited Projects for Free

  • Pay only for actual usage — no charges without requests.

3. Unbeatable Cost Efficiency

  • Pay-as-you-go without idle charges.
  • Example: With $25, you can support 6.94 million requests with an average response time of 60 milliseconds.

4. Streamlined Developer Experience

  • Intuitive UI for easy setup.
  • Fully automated CI/CD pipelines and GitOps integration.
  • Real-time metrics and logging for actionable insights.

5. Effortless Scalability and High Performance

  • Auto-scaling to handle high concurrency with ease.
  • Zero operational overhead — just focus on building.

Image description

Explore more in the documentation!

Leapcell Twitter: https://x.com/LeapcellHQ