Web scrape has become popular in recent years, it’s a process of selecting a target webpage, finding the patterns of the data to scrape and then write a computer program to extract the data and store it in a usable format.
With the advent of AI, mainly ChatGPT and other similar LLM chatboxes, the process has become much simpler.
I will use in this article, the recently published unofficial ChatGPT 5 to show how to scrape data from a webpage with much less effort.
ChatGPT can’t scan yet web pages since it has no access to external sites, but it can write python code to scrape the web page, all it’s required is to provide the patterns.
In this article, I will use a Wikipedia page which isn’t behind a login, and can be easily accessible from python without issues. For web pages with logins or other access restrictions, it will require more complex solutions such as Google puppeteer.
I will use as an example the Wikipedia List of Sovereign States.

If you don’t have python installed on your system or don’t know how to use it, read this article: Get started programming with Python and ChatGPT.

Finding the pattern

To scrape data, you have to find manually the data pattern that you need to scrape.
For that, open a web browser with the web page you want to scrape.
For the process bellow, I’m using Chrome Browser but other browsers have a similar tools.

  • Press Ctrl+Shift+I to open Chrome “Developer Tools”.
  • Click “Elements” tab.
  • Click on the arrow icon, which  is the first icon to left as its marked on the screenshot bellow.
  • Click on one element you want to scrape.
  • Click on the html code corresponding to the element you selected on the “Elements” tab.
  • Right click to get the context menu.
  • Select copy and then select “Copy Full path”.
  • Repeat the process for 2 or 3 elements.



If you select the first items, you will have these XPaths, notice that Afeganistan doesn’t have the flag image on the table.

/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[1]/td[1]/b/a
/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[2]/td[1]/b/span/span/span/img
/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[2]/td[1]/b/a
/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[3]/td[1]/b/span/span/span/img
/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[3]/td[1]/b/a

When you analyse the XPaths, you will see that each sovereign state has the a field with a corresponding XPath:

  • name is /html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[N]/td[1]/b/a where N is the value that changes.
  • image is /html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[N]/td[1]/b/span/span/span/img where N is the value that changes.

ChatGPT 5 Prompt

Now that you know XPaths that you need from your web page, the following prompt will generate the python code to scrape the data that you need:

Write a python program to web scrape the webpage:
 - The webpage is `https://en.wikipedia.org/wiki/List_of_sovereign_states`
 - For each N `item` it should create a list named `states` where each item has the following fields:
 - A field named `name` with XPath on the webpage: /html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[N]/td[1]/b/a where N is the variable element.
 - An optional field image with XPath on the webpage: /html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[N]/td[1]/b/span/span/span/img where N is the variable element.

Save the output in json format to `states.json`


It will output the following:
import json
import requests
from lxml import html

# URL of the webpage to scrape
url = "https://en.wikipedia.org/wiki/List_of_sovereign_states"

# Send a GET request to the webpage
response = requests.get(url)

# Parse the HTML content
tree = html.fromstring(response.content)

# Initialize an empty list to store the states
states = []

# Iterate over the rows of the table
for N in range(1, 195):  # Assuming there are 194 countries
    try:
        # Extract the state name using the provided XPath
        name_xpath = f'/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[{N}]/td[1]/b/a'
        name = tree.xpath(name_xpath)[0].text_content().strip()

        # Initialize the state dictionary
        state = {"name": name}

        # Attempt to extract the image URL, if available
        image_xpath = f'/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[{N}]/td[1]/b/span/span/span/img'
        image_elements = tree.xpath(image_xpath)
        if image_elements:
            image_url = image_elements[0].get('src')
            state["image"] = image_url

        # Append the state to the list
        states.append(state)

    except IndexError:
        # If the XPath fails (no more rows), break the loop
        break

# Save the list of states to a JSON file
with open('states.json', 'w', encoding='utf-8') as f:
    json.dump(states, f, ensure_ascii=False, indent=4)

print("Scraping complete! Data saved to states.json.")

The code above requires the popular external libraries requests and lxml.
If you don’t have it them installed yet, you can do install them using:
pip install requests lxml

When I execute the python code, it generates an empty list of sovereign states.

Finding the problem

In many cases, it’s all straight forward but sometimes things don’t go 100% well, that is why coding experience is always useful.
Was the problem on python code generated by ChatGPT 5 or the XPaths?
After a bit of digging, I detected that the tr[N] wasn’t correct due the table header. It had to start from N=2, not N=1.
By tweaking the line 18 from N in range(1, 195) to N in range(2, 195).
It will generate for each sovereign state, its name and image.

Conclusion

  • ChatGPT 5 can’t scrape a web page but it can generate python code to do the job.
  • You have to manually find the XPaths of the data you want to scrape.
  • ChatGPT 5 was able to successfully generate the python code to scrape the required data.
  • XPaths might have issues, in this case, you have to do debug the problem to find the correct XPath.