Services in the Microsoft Azure cloud have been developing rapidly recently. Since long the offering is not limited only to the possibility to rent scalable and reliable computing power and large amounts of cheap storage space; the Azure cloud offering now includes lots of interesting PaaS services: from a relational database service (Azure SQL Database), a MPP (multi parallel processing) database service (Azure SQL Data Warehouse), an ETL service (Data Factory), machine learning service to the PowerBI – a cloud-based BI service that can be used (also) as a front-end visualization tool. We can now build an entire end-to-end analytics solution and host it in Azure; this includes the back-end to get the data from multiple sources, the database to persist and transform the data and the front-end dashboards and reports to visualize the data. The hosting in Azure (compared to an on-premise solution) gives us many benefits: it’s scalable, reliable and cost-effective.
We have done an internal project using Azure components to analyse our Twitter activities and the reach of our tweets: we fetch the data from Twitter, load and transform it to a dimensional data model and expose the results as PowerBI dashboards.
At Atos Consulting we wanted to have a solution to quickly see the impact of our activities on Twitter. We wanted a few simple dashboards that would provide us a short overview of the number of tweets posted, number of impressions, the engagement of the users with our tweets and other similar statistics. The dashboards should be simple and informative. To build such a solution we need:
- a back-end module to get the data from Twitter and store it,
- a database to persist the data and transform it to a structure that’s suitable for analysis (in our case a dimensional data model),
- a front-end interface to view and visualize the data.
In the Azure cloud, we can use these services (among others) to build such a solution:
Web Apps is a PaaS offering to host and run your web applications. It’s fully managed, so you don’t need to worry about the underlying hardware or operating system. You simply deploy your web application, either in ASP.NET, Node.js, Java, PHP or Python and the system will take care that it’s running properly (the guaranteed SLA is 99.95%). However, in the Web App service you can also run just a simple script or a small background task – either on a regular schedule, via a trigger or in a continuous mode. This is a feature known as WebJobs. Sure, you could achieve the same by running your own VM and scheduling jobs there via a cron job or a Windows scheduler. But – do you really want to bother with managing a VM just for this? WebJobs are fully managed, you simply upload your script or an app (in .NET, Python, PHP…) and schedule its execution in the Azure portal. You can put as many jobs as you want in one Web App and schedule them accordingly.
In our case we can use an Azure Web App service to run our jobs that fetch the data from Twitter and write it to a persistent storage (a database).
Azure SQL Database is a PaaS relational database service. This is again a fully managed service: you don’t need to worry about the underlying SQL Server engine (patching, updating, etc.). You can simply create a database and develop the database objects using the T-SQL language – with (more or less) the same functionality as the on-premise SQL Server database engine. You can also anytime change the performance tier of your database; e.g. you decide how much computing power you need and you’re billed accordingly.
In our case we can use an Azure SQL Server Database to store and transform our data.
PowerBI is a Microsoft self-service BI platform. It can fetch data in different formats from different sources and you can create interesting dynamic visualizations. You can author the reports in the Power BI desktop application and deploy it to the PowerBI cloud-based service – the reports can be either accessed by a group of internal users or the reports can be deployed to the web and accessed by public.
Azure Blob Storage is inexpensive cloud-based object storage for large unstructured data. You can think of it as an (almost) unlimited dump where you can put anything you’d like: images, video, audio, documents and more.
Azure Queue Storage is an interesting messaging service for workflow processing and orchestration between application components or applications. Applications can use this service for communication between each other; the messages sent to the queue storage are simply plain-text strings that are queued until they are picked up by the receiving end.
This is a high-level overview of our finished solution:
“Obtain Twitter Data” WebJob is a Python script (written for Python version 3.4) that gets the Twitter data and saves it as a CSV file to the blob storage. If the data is successfully received, it sends a message to the queue storage signaling to the next component (“Process Queue” component) that there is data to process. Python as a language is very nice and easy to learn, it’s dynamically-typed and it’s easy to create a simple script that fetches the data from a REST API or even does web scraping.
“Process Queue” WebJob is a .NET C# application that runs continuously and listens to the queue storage for new messages. If a message is received, it processed it – it reads the newly uploaded file from the blob storage and copies the data to the SQL database. It’s actually doing a simple Extract-Load job. This component is written in .NET C# because this way we can leverage the existing .NET libraries for accessing and sending data to Azure SQL database.
You might be wondering the reason to use a C# application for such a simple process. The Data Factory service (which could be used for such task as it’s marketed as a cloud ETL service) was the first selection. However, during the development it has proven to be very difficult to develop and debug and it was very limiting. There’s definitively a lot of room for improvement and progress.
The database hosted on the Azure SQL Database service is modelled as a simple star-schema data mart. The data is transformed and loaded into fact and dimension tables using the T-SQL stored procedures; the T-SQL language is very powerful for transformation and set-based logic. The data is also not moved around (better performance) as everything happens within the database.
PowerBI is a full-featured BI platform that can do data loading, transformation and visualization of data. However, the transformation and loading part can be a bit difficult to handle in PowerBI. We also believe that the best place to do the data transformations is in a properly modelled data warehouse – it’s faster and more powerful. In the OLAP (or visualization) layer, you sometimes have to define additional calculations (like ratios, comparisons, etc.), but for everything else you have a data warehouse. The PowerBI reports that were authored locally using the desktop app are uploaded to the PowerBI cloud service and automatically refreshed every day (that’s the limitation of the free PowerBI edition).
Here is a sneek preview of the dashboards:
Conclusion (and Pricing)
As you can see, Microsoft Azure gives us some interesting possibilities to host our applications, data and visuals. The most interesting point is the reliability, scalability and the cost-effectiveness. The Azure PaaS services (that are not in Preview mode) are managed and have a guaranteed SLA. The components (generally) work well together. The performance tier can be easily upgraded or downgraded. The pricing is also generally low. In our case the total monthly cost for running this solution is:
- WebApp (Basic B1 performance tier) @ 29.56 CHF/month
- Azure Blob Storage @ 0.02 CHF/month
- Azure SQL Database (Basic performance tier) @ 4.5 CHF/month
- Total: 34.08 CHF/month
Latest posts by Saso Koren (see all)
- Building an End-to-End Analytics Solution Using Azure Cloud Components - February 21, 2017