Though initially intended for testing web pages, Selenium is an invaluable tool for web scraping with Python. Covered below are the basics of using the tool and how to implement web scraping with Python regardless of scale.
Getting Started With Selenium
Selenium conveniently has thorough documentation for beginners. The setup takes a couple of steps because the framework enables cross-browser and cross-platform automation. This is a great feature, especially if large-scale web scraping is the goal, because the same code will work across multiple machines without much concern.
At the same time, though, this also means that language and browser-specific components of Selenium will need to be installed onto each machine. Python is the language for the tutorial linked in the following section, and Chrome is the browser of choice.
With the correct Selenium library and browser driver installed, a recommended first step is to attempt to open and close a browser window. The Python code for this step is given in Selenium’s documentation:
“from selenium import webdriver
driver = webdriver.Chrome()
driver.get(“http://selenium.dev”)
driver.quit()”
WebDriver is the primary tool in Selenium’s toolbox used to automate whatever is needed within the browser. The second line creates a (Chrome-specific) instance of WebDriver, and the third line tells the example (named “driver”) to navigate to the given URL. The last line quits the browser.
Basic Implementation
The basic but crucial first steps to Python web scraping are now complete. It’s time to move on to the general implementation with Selenium. Though there are other tools for slightly different needs, Selenium is a good choice because it can handle JavaScript.
This tutorial provides a solid introduction to web scraping with Python. Steps 1-3 should look familiar, though the tutorial requires more modules to be imported.
You should note that this tutorial, from Step 4 and onward, introduces web scraping in the context of searching Reddit for a keyword using the site’s search box. This step is notable because although the basic principles of web scraping with Python will remain the same, different contexts will require specific methods of obtaining the necessary data from the webpage.
Ultimately, it will be up to you to inspect the webpage’s HTML to determine what the code needs to “look” for to retrieve data effectively. Is it H1 or H2, is it in bold, does it always contain a particular string? These are examples of questions to consider for your own web scraping needs.
Another vital puzzle piece is what the web crawler should do if the page isn’t fully loaded. The above tutorial takes advantage of WebDriver’s WebDriverWait and ExpectedCondition methods, but remember that your implementation will need to know what to “wait” for on the page in a similar way that it needs to “look” for HTML elements to retrieve data.
How to Scale Up Your Web Scraping Implementation
Luckily, the Selenium umbrella also includes a tool for scaling up your web scraping implementation called Selenium Grid. Grid makes web scraping in parallel possible, i.e., across four machines, it will take about one-fourth the time as it would if you ran your code sequentially on a single machine.
This step is where the previously discussed advantage of cross-browser and cross-platform automation comes into play. Neither the platform nor the browser on any given machine needs to be worrisome when you distribute the work across devices in parallel.
Conclusion
Parallel computing is a vast topic in its own right, but rest assured that the tools discussed will get you well on your way to scalable Python web scraping. Selenium’s WebDriver and Grid, alongside helpful documentation and tutorials, make the automated retrieval of data an achievable task.
Check out all the software testing webinars and eBooks here on EuroSTARHuddle.com