9 Ways to Connect to Web-Based CSVs in Tableau
That title is a bit of a mouthful and this may seem like a somewhat niche topic, but if you’ve ever tried to connect to a csv from the web, you know how painful it can be. There are many public services, including government data portals, that make their data publicly-available in comma-separated-values (CSV) format on their websites. We can easily download these files then connect to them in Tableau, but what if you want to regularly refresh the data on Tableau Cloud or Server? That’s when things start to get tricky. So, in this blog, I’m going to share nine methods for connecting Tableau to a web-based CSV and refreshing them in Tableau Server or Cloud.
For this blog, I’m going to use a feed of earthquake data provided by USGS. The USGS provides several options for
downloading this data, based on time as shown below.
To get the path of the csv file, we can
right-click and link and choose “Copy link address” (the name of this option
will differ from browser to browser). For example, if we click the “All Earthquakes”
option under “Past 30 Days”, we’ll get the following csv: https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv
One last item before I jump into the
nine methods is to acknowledge that the examples above are all publicly-available
data files. You may have a scenario where your web-based csv requires some sort
of authentication. That will complicate (and possibly eliminate) some of these
methods, depending on the authentication protocols used.
1) The Manual Way
Let’s start with the manual method.
In this case, we simply manually download the csv and save it to our computer.
We then either manually refresh the Tableau data source and publish to Server
or Cloud or, in the case of Cloud, we could use a local Bridge client to
refresh the file in a more automated fashion.
An alternative is to save the file to
a shared network storage location. The advantage of this method is that the
file can be accessed by other resources on your network. Thus, you’ll be able
to refresh automatically from Server or leverage an instance of Tableau Bridge
that does not reside on your computer. For more details on refreshing network-based
files, see Auto-Refreshing File-Based Data Sources (note: while focused on Tableau Server, many of the
lessons would also apply to Bridge).
So why would you want to implement
such a manual process for this? Well, perhaps you only need to refresh the data
occasionally. In that case, it may not be worth the additional effort to implement
a more automated solution, making the manual method good enough for your use
case.
2) Semi-Automated using Cloud Storage
One of the biggest issues with the
manual method is that, if you’re using Tableau Cloud, it requires you to
leverage Tableau Bridge. I personally try to avoid Bridge whenever possible as
it is just another component that must be maintained, upgraded, etc. We can
eliminate this by using some sort of cloud-based storage such as Google Drive,
OneDrive, Amazon S3, or any other service for which Tableau has a native
connector.
Note: For data in Amazon S3, you can
either use the new S3 connector or you can connect Amazon Athena to S3 and
create a sort of “virtual” database, then use the Tableau Athena connector.
In this scenario, we’ll still
manually download the csv, but then we’ll upload that file to our cloud
service. We then use the appropriate cloud connector in Tableau to connect to
the file. When that data source gets published, it will be able to refresh
automatically—regardless of whether you’re using Server or Cloud—because the
file is accessible via the internet.
Note: Be careful when using desktop cloud
sync applications. In some cases, these applications will change the internal
identifiers of files, causing Tableau to be unable to locate the file, even
though the name is the same. Before using these apps, be sure to test to make
sure everything works and refreshes as expected. If you do get missing file
errors during refreshes, then manually upload the files to the cloud service
(via the web) instead of using sync apps.
3) Semi-Automated using Code
Our next method automates the
downloading of the file from the web using code. Python is my go-to coding
language, but you could use the language of your choice. With Python, we can
easily load the csv into a dataframe then write that dataframe to a csv file:
We’d then schedule the code to run on
some regular basis using a scheduling tool so that it regularly downloads
updated data. From here, we use one of the previously discussed methods to
refresh the data on Tableau Cloud or Server. If paired with one of the methods
that automatically performs the Cloud/Server refresh, then you could create a
fully-automated solution.
4) Fully-Automated using Code & a
Database
Okay, so the manual and semi-automated
methods aren’t good enough for you? Fortunately, you’ve now arrived at the
fully-automated section. For the next three methods, we’re going to continue
with code-based solutions and simply take the next step towards a fully-automated
solution.
The first of these options is to write
code that will download the data from the csv then push that data into a
database. I haven’t written sample code for this, but my general approach is to
use the sqlalchemy library to connect
to a database then use the dataframe’s to_sql method to insert data.
Ideally, we’d use a cloud-based
database, such as those offered by AWS, Azure, Google, etc. By doing this,
Tableau Cloud would be able to refresh automatically without the need for
Tableau Bridge. If you did, however, use an on-premise database, then Bridge
would be required for refreshing. If you’re using Server, then you’ll be able
to refresh easily whether the database is on-prem or cloud.
5) Fully-Automated using Code & Cloud
Storage
If you don’t have a database
available or just don’t want to deal with the stringency of databases (particularly,
the need for well-defined tables, columns, data types, etc.), then you could
have your code upload the downloaded file to one of the cloud storage solutions
we previously mentioned.
This is, unfortunately, somewhat complicated
due to the need to build authentication mechanisms into your code. In my
experience, many cloud services use different authentication methods and none
of them are particularly straightforward. Of course, I’m just an amateur Python
developer so folks with more experience will likely find this easier. My recommendation
is to engage one of these developers as they’ll be able to quickly develop the
code needed and do so in a secure manner that will prevent any unauthorized
access to your cloud environment.
6) Fully-Automated using Table
Extensions
How about a fully-automated method
that can be built right into Tableau? Last year, Tableau introduced Table Extensions, a feature that allows you to write code directly in a data
model. To add a table extension, we first need to connect to a data source. In
this case, we’re not going to use that data source at all, so we can connect to
anything—I’ll simply connect to the Superstore data source. I’ve used the “Saved
Data Source” so my next step will be to remove all tables from the data model.
Next, we drag “New Table Extension”
into our data model. This will expose a code window.
It will also create a new connection
called “tableau-analytics-extension”. We’re only going to use this connection,
so we can close the connection to Superstore.
Next, we write our code. The code
sample below first loads the web-based csv into a dataframe. When using Python,
the table extension script must return a Python dictionary, so the final
step is to convert the dataframe to a dictionary then return it.
When done with your code, click Apply,
then you can start using the data source.
Note: Table Extensions work similarly
to Live connections in that they refresh each time you open a workbook or
refresh a data source.
While this method seems pretty easy,
the one drawback is that it requires that you have an analytics extension
configured (in my case, a TabPy server) and that you’ve connected to it in
Tableau Desktop. You can get to the Desktop setup by going to Help | Settings
and Performance | Manage Analytics Extension Connection, then
specifying the hostname and port of the server. In my case, I’ve configured
TabPy to run on my local computer.
For more information on setting up a TabPy
server, see TabPy Installation Instructions.
7) Fully-Automated using a Web Data
Connector
What if you don’t want to mess with
code at all? There has to be a non-code solution, right? Yes, there is!!! The
first solution was developed by Keshia Rose, a former
Senior Product Manger at Tableau. If you’re not familiar with Keshia, you
should be as she did incredible things at Tableau. And you may remember her
from Iron Viz as she acted as co-host for several years.
Keshia created a web data connector (WDC) that
allows you to easily extract data from a web-based csv. You simply enter your
URL, click Get Data! and the WDC does the rest.
While this method is quite easy to
use, there are some drawbacks. First, this is a version 2.0 connector, which requires
Tableau Bridge for Tableau Cloud refreshes. For Tableau Server, on the other
hand, you can add the WDC to your safe list and it will refresh automatically.
Additionally, version 2.0 is slated to be eliminated at some point in the
future and replaced by 3.0, so this connector won’t work indefinitely. Hopefully
someone creates a similar WDC for 3.0!!
8) Fully-Automated using Google
Sheets
For our next code-free solution, we’re
going to leverage a handy function, IMPORTDATA, in Google Sheets. IMPORTDATA is
specifically built for importing data from a text file (csv or tsv) via URL.
So, we’ll create a new Google Sheet then type the following formulate into the
first cell hit Enter.
=IMPORTDATA("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv")
Google Sheets will connect to the csv
file then pull in the data.
We can then simply use Tableau’s Google
Drive connector to get our data and refresh automatically on Cloud or
Server. This command will automatically run every time you open the
spreadsheet, including when Tableau Cloud/Server open it for a refresh. Thus,
you’ll be sure to get updated data.
One drawback I’ve seen with this
approach is that it can’t handle large data files. For example, I tried to
import the 30-day earthquake file and that resulted in an error stating “Resource
at url contents exceeded maximum size.” Google Sheets has a 5 million cell
limit, but this data file is much smaller than that, so I suspect that Google
has some additional limits on the amount of data that can be transferred using
this command. For further discussion on this topic and alternatives, see this post on Stack Overflow.
Another drawback I’ve seen is that
the automatic update of data can often take a bit of time to complete,
particularly for slower websites. In those cases, I’ve seen situations where
the data isn’t yet updated in the Google Sheet when Tableau tries to refresh,
leaving incomplete data. So be sure to test everything before leveraging this
method.
9) Fully-Automated using Excel &
OneDrive
Finally, we can use Excel and
OneDrive to do something similar. We start by creating an Excel file one
OneDrive/SharePoint. Then go to the Data menu and, in the Get &
Transform Data section, choose From Web. Enter your csv into the
dialog box.
Then click OK then click Connect
then Load. The data will be loaded into an Excel “table”.
Once loaded, click the Query menu
then edit the Properties. In the dialog box, check the option to “Refresh
data when opening the file”. This will ensure that the data gets refreshed when
Tableau Cloud/Server attempts to refresh it.
We can then simply use Tableau’s OneDrive and SharePoint Online connector to get our data and refresh automatically on Cloud or Server.
I haven’t used this method as much as
# 8, but I suspect it has some of the same limitations. However, as far as size
is concerned, Excel can handle a lot more data than Google Sheets. Unlike
Google Sheets, I was able to load the full 30-day earthquake file without any
problems.
Closing
So, there are nine methods for connecting
to web-based csv files and refreshing them. I suspect that there are several
additional methods for doing this, so if you have any other ideas, feel free to
share them in the comments.
While this is a somewhat niche use
case, I think it’s important to note that some of these methods could be used
in other scenarios as well. For example, perhaps you have a business system that
doesn’t allow you to connect to its database or doesn’t have APIs. You might
have no choice but to manually download the data from that system in order to
use it in Tableau. While that’s a manual process, you could leverage some of
the techniques shared above to automate the refresh part of it and to avoid use
of extra components such as Tableau Bridge.
Thanks for reading!!
Header image: Photo by Mika Baumeister on Unsplash
Ken Flerlage, February
19, 2024
Twitter | LinkedIn | GitHub | Tableau Public
No comments: