Introduction
Migrating data from on-premises Hadoop Distributed File System (HDFS) to Microsoft Fabric Lakehouse represents a critical modernisation step for organisations seeking to leverage cloud-native analytics capabilities. As enterprises increasingly adopt hybrid cloud architectures, the ability to seamlessly transfer large-scale data workloads while maintaining security and performance has become paramount.
Microsoft Fabric Lakehouse combines the flexibility of a data lake with the structured querying capabilities of a data warehouse, providing organisations with a unified analytics platform built on OneLake storage. This migration pathway enables companies to break free from legacy infrastructure constraints while unlocking advanced analytics, machine learning, and real-time processing capabilities.
With Fabric’s native HDFS connector, some users faced challenges with connectivity, authentication, and network stability, especially for environments with strict firewall policies. This article explains how to overcome these limitations using a Self-Hosted Integration Runtime (SHIR) within Azure Synapse Workspace as a secure bridge between on-premises environment and Fabric Lakehouse. This guide highlights every critical step, from firewall configuration to data validation, ensuring the migration is both secure and efficient.
Foundation Setup: Preparing Your Migration Environment
Before initiating your HDFS to Fabric Lakehouse migration, ensure you have established the proper foundation. Your Self-Hosted Integration Runtime should be installed on a dedicated virtual machine within your on-premises network, acting as the secure communication bridge between your Hadoop cluster and Microsoft Fabric.
Key Requirements:
• Firewall rules configured to allow connectivity to HDFS.
• Proxy settings configured correctly on the VM.
• Self-hosted Integration Runtime (SHIR) installed on a VM with right proxy configuration.
Firewall rules configuration:
Network security forms the backbone of successful HDFS migration. Your firewall configuration must allow specific ports and protocols while maintaining security best practices.
Essential ports to be whitelisted in Firewall Rules:
1. Outbound HTTPS (443)
2. HDFS NameNode Port (50070)
3. HDFS DataNode Ports (50010-50020)
4. Azure Relay Ports (5671, 5672)
Pro Tip: Test connectivity using PowerShell commands before proceeding:
o Open PowerShell and run the following command to test network connectivity for specific ports:
o Test-NetConnection -ComputerName <HDFS_Server> -Port <Port_Number>

Proxy settings configuration:
If the Proxy settings are incorrect, then Web HDFS will be not accessible even if the firewall rules are applied. Proxy can be configured by following the below steps:
o Open a browser and try to connect to the Web HDFS UI.
o Ensure port 443 is enabled for the HDFS subnet.
o Configure the correct bypass entries in the proxy settings on the VM.
NOTE: Better to confirm the proxy settings from command line also by below command
netsh winhttp show proxy

If the correct proxy settings are not returned, then we need to set the correct proxy settings from cmd like below:
netsh winhttp set proxy proxy-server=”http://{your_address}:8080″
bypass-list=” “

Install and Register SHIR in the VM with right proxy settings
Install SHIR:
o Download the latest version of SHIR application from the Microsoft website
o Install the application
Configure Proxy Details for SHIR:
1. Edit Configuration Files:
o Open the following files and configure the proxy details: 
Add the bypass details as shown below:

2. Check SHIR Status after updating the proxy settings:
o Ensure the integration runtime app in the VM is running and also confirm the status in Synapse workspace
o If not, check the error logs in the Event Viewer for configuration issues.

Need to make sure application is using system proxy


Configuring the Data Pipeline
1. Configure Linked Service:
o Create a linked service in Synapse with the HDFS details.
o Select the SHIR and test the connection to ensure it is successful.

2. Create Datasets
1. HDFS Source Dataset:
o Create a dataset using the linked service for HDFS.

2. Lakehouse Files Destination Dataset:
o Create a dataset for the Lakehouse files using the built-in integration runtime.
o Define the Lakehouse files directory in the dataset.

• Note that the Lakehouse files dataset in Synapse does not allow dynamic passing of parameters, unlike Azure Data Factory
• Hence a new sink dataset must be created for each table/source directory in HDFS
3. Create Synapse Data Pipeline
• Copy Activity:
o Create a Synapse data pipeline and add a copy activity.
o For the HDFS dataset, pass the folder dynamically using a pipeline parameter.
Create a parameter dir in the dataset and use it in the folder path.

• Use *.snappy.parquet for the wildcard file name. Select the recursive option to allow multiple files to be copied parallelly.


• Preview Data:
o Preview the data in the source before running the pipeline in debug mode.

• Run Pipeline & confirm data in Lakehouse:
o Run the pipeline once the data preview looks good.
o Check the Lakehouse files directory to confirm that the files from the HDFS source directory have been copied successfully.

CONCLUSION:
Successfully migrating from HDFS to Microsoft Fabric Lakehouse requires careful planning, robust security measures, and systematic execution. This modernisation journey positions an organisation to leverage cutting-edge analytics capabilities while maintaining data governance and security standards.
The Self Hosted Integration Runtime serves as your trusted bridge, enabling secure data movement without compromising your existing security posture. By following these detailed steps and best practices, one can achieve a smooth transition that unlocks the full potential of cloud-native analytics.
Samil is an experienced Big Data Engineer specializing in cloud-based data solutions, real-time analytics, and secure PII (Personally Identifiable Information) data handling. With expertise in AWS, Azure, Databricks, Spark, and Terraform, he has played a key role in migrating data to cloud Lakehouse, implementing real-time analytics, automating DevOps workflows and optimizing cloud-based ETL pipelines. He holds a master’s degree in information technology from The University of Auckland with a focus on Data and AI. Beyond work, Samil is passionate about technology and data-driven decision-making. He enjoys, making content for social media, playing FIFA, Table Tennis, Squash and HAM radio operations.


From Data to Strategic Action: Why Most Companies are Stuck at the Bottom of the Value Chain
