Creating a Google Image Scraper with Python
Scraping images from websites can be useful for a variety of tasks, including image classification and training machine learning models. In this post, we will learn how to create an image scraper using Python!
This post is actually part 1 of 2 of a classification blog post. In the next blog post we will take our scraped images and use them to classify objects!
All the code can be found here if you want to get right into it.
What is an Image Scraper?
An image scraper is a program that automatically downloads images from a website. It falls under the broader category of “web scraping,” which involves extracting data from websites.
To make an image scraper, there’s essentially 5 steps:
- Get all your prerequisites set up
- Open a website using a web testing library
- Loop through the website in a way that can find all the images in full size
- Store those URLs
- Download the images at the URLs locally!
Not too bad, right?
It can be pretty simple to make a simple image scraper. The hard part, depending on the website, is step 3. For example, Google images only shows you image thumbnails unless you click on them, which aren’t full size. Some websites take a while to load as well.
Prerequisites
For this we will be using the Selenium
and Requests
libraries, so you will need to get those:
pip install selenium
pip install requests
pip install shutil
Other popular web scraping libraries include BeautifulSoup, Scrapy, and lxml. Each of these has its strengths, and I recommend exploring them if you’re interested in web scraping beyond images.
Establishing a Connection
First, we need to fetch the webpage we’re scraping. If you’re targeting a specific website, you can simply hardcode its URL. But since we want to scrape images based on various search queries, we need to make our URL dynamic.
# Build the Google Query.
search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"
# load the page
wd.get(search_url.format(q=query))
Here, wd
is short for webdriver.Chrome()
, part of selenium. When it executes, it opens a Chrome browser window and navigates to the search results.
[!tip]
If you’re using Edge, you can still use
webdriver.Chrome()
, since Edge is chromium-based. If you are using Safari or Firefox, usewebdriver.Safari()
orwebdriver.Firefox()
respectively.
Finding Image URLS
Next, we need to find the urls to all the images. We first find the images of all the thumbnails by filtering through the CSS selector img.Q4LuWd . What does this mean? This mean that it’s a thumbnail. I found this by using inspect element on the images (CTRL + Shift + I)
How do we identify this using selenium? Easy! just use wd.find_elements.
In this case you can just thumbnail_results = wd.find_elements(By.CSS_SELECTOR, "img.Q4LuWd")
Looping Through and Clicking
Next on our list of importance is clicking through all of the thumbnails and trying to click to find the “full size version” of the image.
This is what it will look like when navigating through thumbnails in the browser
Downloading images
Now you have to download the images from the url you found. You just loop through each url you retrieved and attempt to download it.
We’ll use the Requests library to send a GET request to each URL.
We will make sure the status code is correct before downloading (HTTP status code 200 means OK, everything is working correctly.)
From there, we can save the file locally using Python’s File-writing capabilities.
try:
r = requests.get(url, stream=True)
if r.status_code == 200:
image_number = i + quantity * index
with open(folder_path + str(image_number) + '.jpg', 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
print("SUCCESS- Got image" + str(image_number) + " - saved to " + folder_path + str(image_number) + '.jpg')
else:
print("ERROR - Could not download {url} - {r.status_code}")
except Exception as e:
print(f"ERROR - Could not download {url} - {e}")
Putting it all together
At this point, we’re all done! I added a driver function that calls the functions of finding URLs and downloads each of these urls.
I also added a main.py where you can easily change the amount of images you want, what your search terms are, etc.
def search_and_download()
# Create a folder name.
target_folder = os.path.join(target_path, "_".join(title_term.lower().split(" ")))
# Open Chrome
with webdriver.Chrome() as wd:
# Search for images URLs.
res = fetch_image_urls(
search_term,
number_images,
wd=wd,
sleep_between_interactions=SLEEP_BETWEEN_INTERACTIONS,
)
# Download the images.
if res is not None:
for i, elem in enumerate(res):
download_image(target_folder, elem, i, number_images, index)
else:
print(f"Failed to return links for term: {search_term}")
Conclusion
That’s it for part 1! You’ve now built a basic image scraper that can collect images from Google based on any search term. In part 2, we’ll explore how to use these images for object classification. Stay tuned!
Resources: