💡 Challenge

Extracting text from a Microsoft Word (DOCX) file using Power Automate can be challenging, especially when avoiding third-party tools.

✅ Solution

Leverage Power Automate to extract text directly from a DOCX file, understanding that it’s essentially a ZIP archive containing various XML files.

🔧 How It’s Done

Here’s how to do it:

  1. Recognize that a DOCX file is a ZIP archive.
    🔸 Rename the .docx extension to .zip to inspect its structure.
    🔸 Identify the document.xml file inside the word folder.
  2. Use Power Automate to extract the archive.
    🔸 Use the “Extract archive to folder” action on the DOCX file stored in OneDrive or SharePoint.
    🔸 Store the extracted files in a temporary folder.
  3. Read and parse the document.xml file.
    🔸 Use “Get file content using path” to retrieve document.xml.
    🔸 Use a “Compose” or “Parse XML” action to extract the text nodes.

🎉 Result

A streamlined method to extract text from Word documents using standard Power Automate features, keeping the process simple and entirely within the platform.

🌟 Key Advantages

🔸 No need for third-party tools.
🔸 Utilizes native Power Automate actions.
🔸 Directly parses XML for accurate text extraction.


🎥 Video Tutorial


🛠️ FAQ

1. Do I need premium connectors to extract the DOCX archive?
No, the archive extraction actions are available with standard OneDrive or SharePoint connectors.

2. How can I automate this for multiple files?
Use an “Apply to each” loop over the list of DOCX files, then repeat the extraction steps for each file.

3. How do I strip XML tags to get only plain text?
After parsing the XML, use the “Html to text” action or string expressions in “Compose” to remove any residual markup.

Marcel Lehmann

Marcel Lehmann

Microsoft MVP Microsoft MVP

BizzApps MVP from Switzerland 🇨🇭 - PowerPlatform Expert & Evangelist & MVP - Turning passion into expertise

MVP since 2023 Power Platform Expert since 2017

Leave a comment