Solved: Ingesting 1.5TB of data

Housden996 · ‎05-02-2024

Hi

Not necicerally a pipeline question but what would be the best way to ingest a 1.5tb db into Fabric?

I feel like a notebook into a KQL DB would be most optermised. I am going to run through tutorials for the likes of the taxi data but I am wondering if any one has had any experiance importing this scale of data into Fabric.

Thanks

v-nikhilan-msft · ‎05-02-2024

Hi @Housden996
Thanks for using Fabric Community.
Ingesting large-scale data into Microsoft Fabric can be achieved through several methods, each with its own strengths depending on your specific use case.

Here are some options:
COPY (Transact-SQL): This method offers flexible, high-throughput data ingestion from an external Azure storage account. You can use the COPY statement as part of your existing ETL/ELT logic in Transact-SQL code.

Data Pipelines: Pipelines offer a code-free or low-code experience for data ingestion. Using pipelines, you can orchestrate robust workflows for a full Extract, Transform, Load (ETL) experience that includes activities to help prepare the destination environment, run custom Transact-SQL statements, perform lookups, or copy data from a source to a destination.

Dataflows: An alternative to pipelines, dataflows enable easy data preparation, cleaning, and transformation using a code-free experience.

Cross-Warehouse Ingestion: Data ingestion from workspace sources is also possible. This scenario might be required when there’s the need to create a new table with a subset of a different table, or as a result of joining different tables in the warehouse and in the lakehouse.

For large-scale data ingestion, you might want to consider using data pipelines or the COPY statement for the highest data ingestion throughput possible. You can also use Dataflow Gen2 . Dataflows provide a low-code interface for ingesting data from hundreds of data sources, transforming your data using 300+ data transformations.

For more information refer to these links:
https://www.youtube.com/watch?v=sXZkrFtN5oc
Ingesting data into the warehouse - Microsoft Fabric | Microsoft Learn
https://www.linkedin.com/pulse/mastering-data-loading-microsoft-fabric-comprehensive-machado-ph-d-/
https://community.fabric.microsoft.com/t5/Desktop/Need-help-on-handling-the-large-volume-of-data/td-...
https://www.linkedin.com/pulse/data-ingestion-patterns-microsoft-fabric-sankha-chakraborty-zei4c/

Hope this helps. Please let me know if you have any further questions.

View solution in original post

jwinchell40 · ‎05-02-2024

@Housden996

Where is the database hosted and what type is it?
How many tables would you be ingesting?
Would you be looking to sunset this DB or are you just trying to clone it?

If it is in Azure, the new Mirroring integration may simplify things for you: Mirroring - Microsoft Fabric | Microsoft Learn

Housden996 · ‎05-07-2024

1. This is a MYSQL Server hosted on the client side

2. Unsure about how many tables they didnt let me know

3. Clone for the time being, could be sunsetted once migrated over to MongoDB

I dont think unfortuntly I can mirror this type of DB.

v-nikhilan-msft · ‎05-02-2024

Hi @Housden996
Thanks for using Fabric Community.
Ingesting large-scale data into Microsoft Fabric can be achieved through several methods, each with its own strengths depending on your specific use case.

Here are some options:
COPY (Transact-SQL): This method offers flexible, high-throughput data ingestion from an external Azure storage account. You can use the COPY statement as part of your existing ETL/ELT logic in Transact-SQL code.

Data Pipelines: Pipelines offer a code-free or low-code experience for data ingestion. Using pipelines, you can orchestrate robust workflows for a full Extract, Transform, Load (ETL) experience that includes activities to help prepare the destination environment, run custom Transact-SQL statements, perform lookups, or copy data from a source to a destination.

Dataflows: An alternative to pipelines, dataflows enable easy data preparation, cleaning, and transformation using a code-free experience.

Cross-Warehouse Ingestion: Data ingestion from workspace sources is also possible. This scenario might be required when there’s the need to create a new table with a subset of a different table, or as a result of joining different tables in the warehouse and in the lakehouse.

For large-scale data ingestion, you might want to consider using data pipelines or the COPY statement for the highest data ingestion throughput possible. You can also use Dataflow Gen2 . Dataflows provide a low-code interface for ingesting data from hundreds of data sources, transforming your data using 300+ data transformations.

For more information refer to these links:
https://www.youtube.com/watch?v=sXZkrFtN5oc
Ingesting data into the warehouse - Microsoft Fabric | Microsoft Learn
https://www.linkedin.com/pulse/mastering-data-loading-microsoft-fabric-comprehensive-machado-ph-d-/
https://community.fabric.microsoft.com/t5/Desktop/Need-help-on-handling-the-large-volume-of-data/td-...
https://www.linkedin.com/pulse/data-ingestion-patterns-microsoft-fabric-sankha-chakraborty-zei4c/

Hope this helps. Please let me know if you have any further questions.

v-nikhilan-msft · ‎05-03-2024

Hi @Housden996
We haven’t heard from you on the last response and was just checking back to see if your query has been resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks

v-nikhilan-msft · ‎05-06-2024

Hi @Housden996
We haven’t heard from you on the last response and was just checking back to see if your query has been resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks

Housden996 · ‎05-07-2024

Thanks and appolagies for the delay.

I will be going down the cloning route, how the potential client curerntly uses Tableau Server is a bit strange as they dont really use it for reporting more just for working with the data so the likes of Notebooks and the ML side of this is more usefull to them.

I have looked into the copy feature and this seems more cost effective for the client!

Ingesting 1.5TB of data

Helpful resources

New forum boards available in Real-Time Intelligence.

Fabric Monthly Update - May 2024

Jumpstart your career with the Fabric Career Hub

Ingesting 1.5TB of data

Helpful resources

New forum boards available in Real-Time Intelligence.

Fabric Monthly Update - May 2024