How to Extract Text from DOCX Files Using Python and python-docx

In this post let’s learn how to easily extract text from DOCX files in Python using python-docx library. Perfect for beginners exploring document automation and text processing.

Why Python and `python-docx`?

Python is known for its simplicity and powerful libraries. When it comes to handling DOCX files, python-docx makes it easy to extract text without diving into complex details.

Getting Started

First things first, make sure you have python-docx installed. You can install it using pip if you haven’t already:

pip install python-docx

Step-by-Step Example

1. Importing the library

To start, you need to import the Document class from python-docx

from docx import Document

2. Reading a docx file

Now, let’s create a simple function to read a DOCX file and extract its text

def read_docx(file_path):
    doc = Document(file_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return '\n'.join(full_text)

3. Using the function

You can now use the read_docx function to extract text from any DOCX file

file_path = 'path/to/your/document.docx'
extracted_text = read_docx(file_path)
print(extracted_text)

Here is the full implementation:

from docx import Document

def read_docx(file_path):
    doc = Document(file_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return '\n'.join(full_text)

if __name__ == "__main__":
    file_path = 'path/to/your/document.docx'
    extracted_text = read_docx(file_path)
    print(extracted_text)

Learn more about python-docx here.

Categories:BlogPython

Tags:docx python-docx

Become Geeks

How to Extract Text from DOCX Files Using Python and python-docx

Why Python and `python-docx`?

Getting Started