- December 9, 2024
- by Admin
- Data Extraction
Data extraction techniques and types play unique roles in the retrieval of data from diverse sources. The aforementioned techniques encompass the specific methods and tools employed for extraction, including web scraping, APIs, and optical character recognition (OCR). These techniques emphasize the means by which data is accessed and extracted from its origin. Conversely, data extraction types relate to the strategies and frameworks that oversee the extraction process. They are generally categorized into classifications such as full extraction, incremental extraction, Change Data Capture (CDC), Slowly Changing Dimensions (SCD), and manual extraction. Below are descriptions of several prevalent data extraction types.
Full extraction
Full data extraction refers to the process of retrieving the complete dataset from a source system. This entails transferring all data from the source to the target system in each extraction cycle. The method guarantees that the target system possesses a comprehensive and current replica of the source data. While it is relatively simple to execute, it can demand significant resources, making it most appropriate for initial loads, data migrations, and backup situations.
Incremental batch extraction
Incremental batch extraction is a method that captures only the data that has been modified since the previous extraction from the source system. This process entails the identification and extraction of newly added or updated records in batches, typically utilizing a timestamp or a change tracking system. While it enhances efficiency and minimizes resource consumption, it necessitates the implementation of change tracking mechanisms, which may add complexity to the extraction process.
Incremental stream extraction
Incremental stream extraction refers to a data extraction technique that involves the continuous monitoring and real-time or near real-time extraction of data changes. This process entails the capture and processing of data modifications as they occur, facilitating prompt updates to subsequent systems.
Change data capture (CDC)
Change Data Capture (CDC) refers to a technique for incremental stream extraction that tracks and captures alterations made to data in a database or data source. It identifies and documents insertions, updates, and deletions, allowing for real-time or near-real-time replication of these changes to other systems or downstream applications. The low-latency data replication and synchronization capabilities support various use cases, including data warehousing, data integration, and business intelligence. However, the implementation process can be complex and may create overhead on the source system.
Slowly changing dimensions (SCDs)
Slowly Changing Dimensions (SCDs) represent a method of incremental stream extraction utilized in data warehousing to effectively manage alterations in dimensional data over time. This approach focuses on the management of changes in data attributes, such as customer addresses or product prices, while ensuring the integrity and accuracy of historical data. SCDs facilitate the maintenance of historical data integrity and enhance the efficiency of querying and analysing dimensional data.
Manual extraction
Manual data extraction refers to the procedure of obtaining data from various sources without the aid of automated tools or software. This process necessitates human involvement to access, evaluate, and extract information from sources like databases, documents, or websites. Although it offers flexibility and is cost-effective, it is labour-intensive and does not scale well. This method is generally appropriate for small-scale or one-time extraction tasks, but it is not feasible for larger datasets or regular extraction cycles.