How to Extract Text from DOCX Files Using Python and python-docx
In this post let’s learn how to easily extract text from DOCX files in Python using python-docx library. Perfect for beginners exploring document automation and text processing.
Why Python and python-docx
?
Python is known for its simplicity and powerful libraries. When it comes to handling DOCX files, python-docx
makes it easy to extract text without diving into complex details.
Getting Started
First things first, make sure you have python-docx
installed. You can install it using pip if you haven’t already:
pip install python-docx
Step-by-Step Example
1. Importing the library
To start, you need to import the Document
class from python-docx
from docx import Document
2. Reading a docx file
Now, let’s create a simple function to read a DOCX file and extract its text
def read_docx(file_path):
doc = Document(file_path)
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
return '\n'.join(full_text)
3. Using the function
You can now use the read_docx
function to extract text from any DOCX file
file_path = 'path/to/your/document.docx'
extracted_text = read_docx(file_path)
print(extracted_text)
Here is the full implementation:
from docx import Document
def read_docx(file_path):
doc = Document(file_path)
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
return '\n'.join(full_text)
if __name__ == "__main__":
file_path = 'path/to/your/document.docx'
extracted_text = read_docx(file_path)
print(extracted_text)
Learn more about python-docx here.