How can we automate the process of scraping the title and header h1 from web pages using Excel VBA?
Our macro code should be able to do the following:
- Open a web browser like Internet Explorer automatically
- Load the relevant hyperlink from our Excel worksheet
- Extract the title and the header h1
- Place the title and the header data in appropriate cells in our worksheet
- Go to the next web-link using looping process and do the same process again till it has automatically scanned all the links and extracted all the relevant data and placed it into relevant cells of our worksheet
- In case the header h1 is not available for whatever reasons, the macro should perform an error handling and do its job neatly
- Finally the Internet Explorer or web browser should quit
- The extracted data in the relevant cells should be neatly ‘auto-fitted’.
View the training video:
The complete VBA code:
Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim sURL As String
Dim lastrow As Long
lastrow = Sheet1.Cells(Rows.Count, “A”).End(xlUp).Row
For i = 2 To lastrow
Set wb = CreateObject(“internetExplorer.Application”)
sURL = Cells(i, 1)
wb.navigate sURL
wb.Visible = True
While wb.Busy
DoEvents
Wend
‘HTML document
Set doc = wb.document
Cells(i, 2) = doc.title
On Error GoTo err_clear
Cells(i, 3) = doc.GetElementsByTagName(“h1”)(0).innerText
err_clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Range(Cells(i, 1), Cells(i, 3)).Columns.AutoFit
Next i
End Sub
Note:
- Go Tools and the References in the Microsoft Visual Basic Editor and check (a) Microsoft HTML Object Library and (b) Microsoft Internet Controls
- If your Internet speed is slow or erratic check your extracted data carefully
Download a sample file:
Further reading: