How to extract URLs from an XML Sitemap

Photo of author
Written By Ed Roberts

A so-called SEO expert. 

This guide will show you three quick and easy methods to extract the URLs from a Sitemap into a CSV file. We will look at how to do this using Python, Google Sheets & Screaming Frog. This can be incredibly useful if you want to pull a list of all submitted pages quickly, or if you want to keep checks on what your competitors are updating/adding to their sites.

How to extract sitemap URLs using Python

The first method I’m going to teach you uses every SEO’s favourite programming language, Python!

Don’t worry if you find coding daunting because I’ve made a free Google Colab file with the script you can run yourselves straight away. No coding is required!

Step 1: Install the required modules

For this script, the first thing we want to do is install and import a few packages. The first of these is Pandas which comes prepackaged with Google Colab. 

import pandas as pd

Pandas is a library specifically for data manipulation and we will be using it to create a data frame from our sitemap and output it as a CSV file.

Next, we will be installing urllib3, an HTTP client for Python, which will be used to request the sitemap file.

!pip install urllib3
import urllib3

If you’re using Google Colab, then you will need to use the !pip command to install the packages.

Alongside this, we will also be installing certifi to be able to send verified HTTPS requests, something that while not strictly necessary is strongly advised from a security perspective.

Now we have the required modules installed, let’s get to work!

Step 2: Request the sitemap file using urllib3

To extract the URLs from a sitemap, the first thing we need to do is request the .xml file in question. 

To do this using Python we will be using urllib3’s request() function. To do this we first need to set up a PoolManager() instance. 

https = urllib3.PoolManager()

To ensure that we’re sending verified HTTPS requests we will be adding a snippet of code, using the certifi module we installed at the beginning.

https = urllib3.PoolManager(cert_reqs="CERT_REQUIRED", ca_certs=certifi.where())

You will then want to specify the URL of the sitemap file. If you’re simply using the Colab file I’ve shared, all you need to do is replace the example URL with your own.

url = "https://example.com/sitemap.xml"

This does need to be a .xml file, and make sure it’s a sitemap of URLs and not a sitemap index as you will just end up with a list of sitemaps instead of pages.

TIP: If you’re having difficulty finding the sitemap file on your site, try checking your robots.txt as it will often (though not always) specify the sitemap locations.

Next, we want to use urlib3’s request() function to send a GET request for the URL.

response = https.request('GET', url)

And there you go we’ve successfully requested our sitemap URL.

Step 3: Convert the XML file into a dictionary

For this next step we’re going to turn the XML file into a dictionary so we can easily extract the URLs into a data frame. To do this we will be using xmltodict to parse the file, using the code below.

sitemap = xmltodict.parse(response.data)

We now have a handy dictionary name ‘sitemap’ which contains all the data from the XML file. It will look something like this.

{'urlset': {'@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance', '@xmlns:image': 'http://www.google.com/schemas/sitemap-image/1.1', '@xsi:schemaLocation': 'http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd http://www.google.com/schemas/sitemap-image/1.1 http://www.google.com/schemas/sitemap-image/1.1/sitemap-image.xsd', '@xmlns': 'http://www.sitemaps.org/schemas/sitemap/0.9', 'url': [{'loc': 'https://pagenotfound.online/'}, 

Step 4: Turn the dictionary into a data frame and export it as a CSV

Now that we have our sitemap stored as a dictionary we can easily retrieve the data that we need (the URLs) and store them in a data frame. 

To do this we’re going to create a variable called sitemap_df (df for data frame) and use Pandas (pd) DataFrame.from_dict() function to extract just the URLs. To do this we will specify the keys we need, in this case, ‘urlset’ and ‘url’. The final code looks like this.

sitemap_df = pd.DataFrame.from_dict(sitemap['urlset']['url'])

We can then export this dataframe into a CSV using Pandas to_csv() function.

sitemap_df.to_csv('output.csv', index=False)

If you’re using my Colab notebook then you’ll find the output CSV under files on the left-hand side, where you can simply download it and open it in your spreadsheet software of choice.

A screenshot showing where to locate the output.csv file by navigating to the file tab in the top left corner and finding it below sample_data.
The output file can be found under the files tab.

The final file will look something like this –

Screenshot of the csv file created by the python script showing the sitemap URLs along with the lastmod, changefreq priority and image data.
You know have a complete list from the sitemap to work with in sheets

As you can see you will also have additional useful data such as when it was last modified and image files associated with that page.

You could quite easily adapt this script further to provide additional insight, such as iterating through the list of URLs to check the status codes and identify any 404 pages in the sitemap. Maybe one for a future tutorial.

Python is a fantastic language for automating tasks like this, just take a look at my guide to automating alt-text or how to easily do keyword research using the Reddit API.

How to extract sitemap URLs using Google Sheets

If you feel a bit lost using Python, then don’t worry, you can also extract URLs from a sitemap in Google Sheets, using a fairly simple formula. 

To do this we will be using IMPORTXML and an XPath query. As per Google Sheets, the IMPORTXML formula –

“Imports data from various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds”.

The syntax this formula uses is IMPORTXML(url, xpath_query). 

NOTE: Make sure to wrap the URL in quotation marks or it won’t work. 

In this case, the URL will be the URL of the sitemap we want to use. For this example, we will be using https://pagenotfound.online/post-sitemap.xml

XPath is an expression language used to query/navigate XML documents. If you’re a Screaming Frog user you may have previously used XPath for custom extractions. 

For our XPath query, we’re going to be using the local-name() function which returns a string based on the node-set we specify. So to extract the URLs from the sitemap our XPath query looks like this – //*[local-name()=’url’]/*[local-name()=’loc’].

To break this down we are specifying the <url> parent tag and then the child <loc> or location tag which contains the URL of the page.

Combining these we get our finalised formula which is =IMPORTXML(“https://pagenotfound.online/post-sitemap.xml”,”//*[local-name()=’url’]/*[local-name()=’loc’]”)

Gif showing the formula import sitemap URLs in to a Google sheet
A quick and easy way to import sitemap URLs into Google Sheets

As you can see the formula quickly pulls the URLs from the sitemaps into the sheet. To do this with your own sitemap simply replace the URL in the formula.

How to extract sitemap URLs using Screaming Frog

The final method I will be showing you to extract URLs from a sitemap is with Screaming Frog, which I link to think of as the SEO swiss army knife.

If you’re unfamiliar with Screaming Frog, it is a web crawler used for auditing websites and probably my single most used tool alongside Google Search Console.

There is a free version of SF, that allows you to crawl up to 500 URLs, but I really do recommend picking up an annual licence as it will unlock all the features and it is dirt cheap by SEO tool standards. 

Once you’ve downloaded Screaming Frog and booted it up, the first thing you’ll want to do is go into the Spider configuration. 

Here you’ll want to scroll to the bottom of the crawl tab and click crawl linked XML sitemaps.

Then you’ll want to select crawl these sitemaps and paste the URL of the sitemap you want to use. 

Screaming Frog spider configuration options for XML sitemaps.

You may want to untick automatically detect sitemaps from robots.txt if your site has multiple sitemaps and you just want the URLs from one. 

Now simply type or paste in your site URL and let Screaming Frog crawl the site. Once it has finished click on the crawl analysis tab on the top menu and select configure.

Screenshot showing where to locate the crawl analysis tab in the top right of the menu in Screaming Frog

Then make sure that sitemaps is ticked and start the crawl analysis.

Screaming Frog crawl analysis configuration menu showing sitemaps as the sixth option down.

Once the analysis has been completed, head over to the sitemap tab and simply export the list of URLs to a .csv. And that’s it. 

Of course, that’s just scratching the surface of what can be done using Screaming Frog’s sitemap analysis, and one we’ll take a deeper dive into in the future. 

In the meantime, if you’re just starting out in SEO or are a Youtube content creator I recommend checking out my guide to the new Youtube keyword research tool.