How to scrape a webpage using ChatGPT 5 and Python

Web scrape has become popular in recent years, it’s a process of selecting a target webpage, finding the patterns of the data to scrape and then writing a computer program to extract the data and store it in a usable format.
With the advent of AI, mainly ChatGPT and other similar LLM chatboxes, the process has become much simpler.
I will use in this article, the recently published unofficial ChatGPT 5 to show how to scrape data from a webpage with much less effort.
ChatGPT can’t scan web pages yet since it has no access to external sites, but it can write python code to scrape the web page, all it’s required is to provide the patterns.
In this article, I will use a Wikipedia page which isn’t behind a login, and can be easily accessible from python without issues. For web pages with logins or other access restrictions, it will require more complex solutions such as Google puppeteer.
I will use as an example the Wikipedia List of Sovereign States.

If you don’t have python installed on your system or don’t know how to use it, read this article: Get started programming with Python and ChatGPT.

Finding the pattern

To scrape data, you have to find manually the data pattern that you need to scrape.
For that, open a web browser with the web page you want to scrape.
For the process below, I’m using Chrome Browser but other browsers have similar tools.

Press Ctrl+Shift+I to open Chrome “Developer Tools”.
Click the “Elements” tab.
Click on the arrow icon, which is the first icon to the left as it's marked on the screenshot below.
Click on one element you want to scrape.
Click on the html code corresponding to the element you selected on the “Elements” tab.
Right click to get the context menu.
Select copy and then select “Copy Full path”.
Repeat the process for 2 or 3 elements.

Chrome Find Element Xpath

If you select the first items, you will have these XPaths, notice that Afghanistan doesn’t have the flag image on the table.

/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[1]/td[1]/b/a
/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[2]/td[1]/b/span/span/span/img
/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[2]/td[1]/b/a
/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[3]/td[1]/b/span/span/span/img
/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[3]/td[1]/b/a

When you analyse the XPaths, you will see that each sovereign state has the a field with a corresponding XPath:

name is /html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[N]/td[1]/b/a where N is the value that changes.
image is /html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[N]/td[1]/b/span/span/span/img where N is the value that changes.

ChatGPT 5 Prompt

Now that you know XPaths that you need from your web page, the following prompt will generate the python code to scrape the data that you need:

Write a python program to web scrape the webpage:
- The webpage is `https://en.wikipedia.org/wiki/List_of_sovereign_states`
- For each N `item` it should create a list named `states` where each item has the following fields:
- A field named `name` with XPath on the webpage: /html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[N]/td[1]/b/a where N is the variable element.
- An optional field image with XPath on the webpage: /html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[N]/td[1]/b/span/span/span/img where N is the variable element.

Save the output in json format to `states.json`

It will output the following:

import json
import requests
from lxml import html

# URL of the webpage to scrape
url = "https://en.wikipedia.org/wiki/List_of_sovereign_states"

# Send a GET request to the webpage
response = requests.get(url)

# Parse the HTML content
tree = html.fromstring(response.content)

# Initialize an empty list to store the states
states = []

# Iterate over the rows of the table
for N in range(1, 195):  # Assuming there are 194 countries
    try:
        # Extract the state name using the provided XPath
        name_xpath = f'/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[{N}]/td[1]/b/a'
        name = tree.xpath(name_xpath)[0].text_content().strip()

        # Initialize the state dictionary
        state = {"name": name}

        # Attempt to extract the image URL, if available
        image_xpath = f'/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[{N}]/td[1]/b/span/span/span/img'
        image_elements = tree.xpath(image_xpath)
        if image_elements:
            image_url = image_elements[0].get('src')
            state["image"] = image_url

        # Append the state to the list
        states.append(state)

    except IndexError:
        # If the XPath fails (no more rows), break the loop
        break

# Save the list of states to a JSON file
with open('states.json', 'w', encoding='utf-8') as f:
    json.dump(states, f, ensure_ascii=False, indent=4)

print("Scraping complete! Data saved to states.json.")

The code above requires the popular external libraries requests and lxml.
If you don’t have it them installed yet, you can do install them using:

pip install requests lxml

When I execute the python code, it generates an empty list of sovereign states.

Finding the problem

In many cases, it’s all straight forward but sometimes things don’t go 100% well, that is why coding experience is always useful.
Was the problem on python code generated by ChatGPT 5 or the XPaths?
After a bit of digging, I detected that the tr[N] wasn’t correct due the table header. It had to start from N=2, not N=1.
By tweaking the line 18 from N in range(1, 195) to N in range(2, 195).
It will generate for each sovereign state, its name and image.
Wikipedia Sovereign States Scrape Json Screenshot

Conclusion

ChatGPT 5 can’t scrape a web page but it can generate python code to do the job.
You have to manually find the XPaths of the data you want to scrape.
ChatGPT 5 was able to successfully generate the python code to scrape the required data.
XPaths might have issues, in this case, you have to debug the problem to find the correct XPath.

Fast, reliable, and affordable web hosting

hosting.com offers high-performance web hosting with free SSL, daily backups, and 24/7 support, giving you everything you need to launch your website with confidence.

Start with hosting.com

Travel smarter, explore deeper with TravelToIX

Find real travel experiences, destination guides, and easy-to-follow itineraries to plan your next adventure with confidence. TravelToIX helps you explore the world better.

Visit TravelToIX

How to scrape a webpage using ChatGPT 5 and Python

Finding the pattern

ChatGPT 5 Prompt

Finding the problem

Conclusion

Travel Ideas

About the Author

How to scrape a webpage using ChatGPT 5 and Python

Finding the pattern

ChatGPT 5 Prompt

Finding the problem

Conclusion

Related Articles

Travel Ideas

About the Author