Open source data integration tools can be a low-cost alternative to commercial packaged data integration solutions.
If you do not have the time or resources in-house to build a custom ETL solution — or the funding to purchase one — an open source solution may be a practical option. Further, open source ETL solutions can be a great fit for smaller projects, or places where data analysis is not mission critical. Keep in mind that most open source ETL solutions will still require some configuration and setup work (if not actual coding). So even if you avoid having to hand code a solution, you still may need to have some systems or programming expertise available.
Pentaho Kettle is the component of Pentaho responsible for the ETL processes. It enables users to ingest, blend, cleanse, and prepare diverse data from any source. Pentaho also includes in-line analytics and visualization tools. This community version is free, but offers fewer capabilities than the paid version.
Talend Open Studio
Talend offers Open Studio for Data Integration as a limited-functionality open source (Apache license) version of its Data Management Platform. It offers connectors for various RDBMS, SaaS, packaged apps, and technologies.
Spectra’s ETL Implementation Methodology
- Accurately identify the business information that must be contained in the Warehouse
- Identify and prioritize subject areas to be included in the Data Warehouse
- Define the correct level of summarization to support business decision making
- Establish a refresh program that is consistent with business needs, timing and cycles
- While an Enterprise Data Store and Metadata Store(s) are always included in a sound Data Warehouse design, the specific number of Data Marts (if any) and the need for an Operational Data Store are judgment calls. Potential Data Warehouse configurations should be evaluated and a logical architecture determined according to business requirements.
- During the Architecture Review and Design stage, the logical Data Warehouse architecture is developed. The logical architecture is a configuration map of the necessary data stores that make up the Warehouse; it includes a central Enterprise Data Store, an optional Operational Data Store, one or more (optional) individual business area Data Marts, and one or more Metadata stores
- Once the logical configuration is defined, the Data, Application, Technical and Support Architectures are designed to physically implement it. Requirements of these four architectures are carefully analyzed so that the Data Warehouse can be optimized to serve the users
- Gap analysis is conducted to determine which components of each architecture already exist in the organization and can be reused, and which components must be developed (or purchased) and configured for the Data Warehouse.
- Finally, the detailed design of all procedures for the implementation project is completed and documented. Procedures to achieve the following activities are designed
- Warehouse Capacity Growth
- Data Extraction/Transformation/Cleansing
- Data Load
- Data Refresh
- Data Access
- Backup and Recovery
- Disaster Recovery
- Data Archiving
- Configuration Management
- Extract, cleanse, aggregate, transform
- Programs are developed to extract, cleanse, transform and load the source data and to periodically refresh the existing data in the Warehouse, and the programs are individually unit tested against a test database with sample source data. Metrics are captured for the load process. The metadata repository is loaded with transformational and business user metadata. Canned production reports are developed and sample ad-hoc queries are run against the test database, and the validity of the output is measured. User access to the data in the Warehouse is established. Once the programs have been developed and unit tested and the components are in place, system functionality and user acceptance testing is conducted for the complete integrated Data Warehouse system. System support processes of database security, system backup and recovery, system disaster recovery, and data archiving are implemented and tested as the system is prepared for deployment.
- Validate the data to ensure accuracy and consistency
- The final step is to conduct the Production Readiness Review prior to transitioning the Data Warehouse system into production. During this review, the system is evaluated for acceptance by the customer organization
- The Transition to Production stage moves the Data Warehouse development project into the production environment. The production database is created, and the extraction/cleanse/transformation routines are run on the operations system source data. The development team works with the Operations staff to perform the initial load of this data to the Warehouse and execute the first refresh cycle. The Operations staff is trained, and the Data Warehouse programs and processes are moved into the production libraries and catalogs. Rollout presentations and tool demonstrations are given to the entire customer community, and end-user training is scheduled and conducted. The Help Desk is established and put into operation. A Service Level Agreement is developed and approved by the customer organization. Finally, the new system is positioned for ongoing maintenance through the establishment of a Change Management Board and the implementation of change control procedures for future development cycles.