Web activities in azure data factory
Introduction
When we think of data sources, we typically think of relational databases, NoSQL databases, file-based data sources, data lakes, or data warehouses. The web pages that are made available to the public or private users of the internet constitute one significant unstructured or semi-structured data source.
Text, images, media, and other elements can be found on these pages. Web tables, on the other hand, are one of the most valuable components of web pages from a data perspective because they can be directly mapped to data objects in data repositories or even stored as files.
Web scraping is a well-known method for reading or extracting data from web pages.
The method of web scratching isn't new. Most of the well-known frameworks or programming languages like R, Python,.NET, Java, etc. give libraries that can straightforwardly web scratch information, convert also parse the information in JSON arrangement and cycle it as wanted.
Extract Transform Load (ETL) services are now available in a Platform-as-a-Service model thanks to cloud computing platforms. This makes it possible to create data pipelines that run on a managed infrastructure.
For large-scale web scraping, such as scraping data from Wikipedia at regular intervals across hundreds to thousands of pages, one would prefer to use a data pipeline that runs on a managed cloud infrastructure rather than typically using custom code on a single virtual machine.
The ETL service that supports the construction of data pipelines is Azure Data Factory. Additionally, it enables the creation of data pipelines that can web-scrape data.
Using Azure Data Factory to build data pipelines for web scraping
When we want to web scrape data, the actual data will be the first thing we need. On a website that is accessible to the general public, this data ought to be presented in the form of a few tables. Wikipedia is one of the easiest ways to get to such a website.
For this exercise, you can use any publicly accessible webpage with a table as a data source. We will be using a Wikipedia page as our data source. The webpage we're thinking about scrapping with our data pipeline, which we'll build with Azure Data Factory, is shown below.
There are numerous tables on this page. Another table can be found further down this page by scrolling down. We've chosen a page with multiple tables because we might want to scrape one or more of the tables that are typically found there.
It is time to begin creating our data pipeline now that we have identified the data source. On the Azure platform, it is presumed that one has the necessary access to the Azure Data Factory service.
Open the Data Factories service in the Azure portal, and the dashboard page with a list of all your Azure Data factory instances will open. If this is your first time using Azure Data Factory, you might not have created any instances.
To create a brand-new instance of Azure Data Factory, select the Add button. Clicking the Create button will result in the creation of a new data factory instance, as shown below after you have provided basic information about your subscription, the resource group, and the name of the data factory instance.
The instance's dashboard can be accessed by clicking on the name of the instance. The "Author and Monitor" link ought to be visible in the middle of the screen.
The Azure Data Factory portal, which serves as the development and administrative console for data pipelines and can be accessed by clicking on this link, will open. Our data pipeline, which will web-scrape data from the webpage we identified earlier, will be built and hosted here.
In order to begin using the Copy Data feature, we intend to copy data from the web table to the page. When you click on the link that says "Copy data" as shown above, the process of creating a new data pipeline that explains how to copy data from one location to another will immediately begin.
In the initial step, we really want to give the name of the undertaking and alternatively a depiction of the equivalent. Then, we need to provide the necessary scheduling information to carry out this task. We can continue with the "Run once now" selection for the time being. This can also be changed in the future. To move on to the next step, select the Next button when you are finished.
I hope that my article was beneficial to you. To learn more, click the link here
Comments
Post a Comment