Developing for Modern Windows

Tips, tricks, and guides for developing on modern Windows platforms

Download a website as HTML text in Windows 8

A common functionality of apps is to download website content. This downloaded HTML can be parsed for links or content. My app XBMC Buddy uses this technique to find media links in a webpage. In a nutshell The DownloadStringFromUrl function will download the entire HTML from the web address it is given and returns that HTML as a text string. You would use it like this:


downloadedHtmlText = Await DownloadStringFromUrl("http://www.website.com")

Here's the code:


Imports System.Net.Http

Public Class MyFunctions

Public Async Function DownloadStringFromUrl(url As String) As Task(Of String)
' return a webpage as a string (the xml, html, etc.)
Dim client As HttpClient = New HttpClient
Dim response As HttpResponseMessage
Try
response = Await client.GetAsync(url)
Catch As Exception
Return Nothing
End Try

Try
Dim downloadedHtml as String = Await response.Content.ReadAsStringAsync()
Return downloadedHtml
Catch As Exception
Return Nothing
End Try

End Function

End Class

What the code is doing


Imports System.Net.Http

This imports the System.Net.Http namespace, which we need to use HTTP functionality.


Public Async Function DownloadStringFromUrl(url AsS tring) As Task(Of String)

This is a function (i.e. a method that returns a value), and it is async (can use the await keyword to effectively pause itself while waiting for an action to complete (and let the rest of your application continue its work in the meantime). DownloadStringFromUrl is our function name. (url as String) tells us what variables/information must be passed to the function when it is called - a string that will be referred to as url within the function. As Task(Of String) is our function's return type - a string task, which is required because this is an async function that doesn't immediately return a result.


Dim client As HttpClient = New HttpClient
Dim response As HttpResponseMessage

These are objects we need to use within the function. HttpClient is the object that does our downloading, and HttpResonseMessage is what we use to interact with HttpClient and give us a response we can test.

 

Try / Catch blocks

Try / Catch blocks are used because downloading from the Internet can be unpredictable. This function will return a value of Nothing if any error occurs, but you can customise the logic to include any error message or fallback you might want to use.


response = Await client.GetAsync(url)

This line opens the connection to the website. The Await means that this function halts until this line completes. Control goes back to the method that called this function. This asynchronous behaviour is what allows your apps to remain responsive even when they are completing a process, such as downloading, that will take a while to complete. This function will not proceed to the next line of code until the client.GetAsync(url) is complete, as it's awaiting it. If this succeeds, response will be our connection to the webpage.


Return Nothing

If client.GetAsync(url) fails the function will return a value of Nothing as it could not connect to the website.

 


Dim downloadedHtml as String = Await response.Content.ReadAsStringAsync()

This is the hero line, which downloads the website text and stores it in a string variable called downloadedHtml. We use response from earlier, and use its Content.ReadAsStringAsync() method to get all the text from it.


Return downloadedHtml

Now we return downloadedHtml as the result of the function. An example usage Here is a line of code from an app that is downloading HTML from a webpage. Note that we use the Await keyword because our function uses Await.


Public Sub DownloadHtmlFromWebPage()
Dim rawHtmlText as String
rawHtmlText = Await DownloadStringFromUrl("http://www.website.com")
MyTextBox.Text = rawHtmlText
End Sub

When the above method is run, the download will start. This method effectively pauses once the download starts, and lets the rest of the program continue while downloading continues in the background. Once the download is complete, the value of the awaited download gets put in the rawHtmlText string.

Leave a Reply

Your email address will not be published.