ETL Tool POC

Project Description

Request was initiated by Roald van den Berg

Introduction
As the need for reporting across multiple systems is becoming more of a priority at NWU, more systems are being integrated with the operational data store (ODS). The current ETL process that feeds the ODS is a custom developed process which is outdated, difficult to manage and does not allow NWU to easily add additional systems to the ODS. Creating and maintaining an ETL processes is a cumbersome process with many different elements dependent on one another where things could go wrong. Managing & verifying runs is a cumbersome task taking up unnecessary resources.

The current process only caters for Oracle & MySQL source databases which limits the number of systems that can be integrated with the ODS essentially impeding the strategy of an integrated reporting environment. The current process also does not cater for transactions reversed in the source systems and transferring only modified records would result in having to modify some of the existing systems currently being used at NWU. Since we cannot only transfer modified records for some of our existing systems the runtime of the ETL keeps increasing and we will soon be faced with the problem that not all data can be transferred to the ODS in time as a result of the increasing number of data. The ETL tool might also allow real-time integration with the ODS unlike the current process where runs have to be scheduled to run in the evening.

Purpose
The purpose of this project is to evaluate three different ETL tools in order to determine which would best suit NWU’s requirements and to address all the issues we currently face with our existing process. During the first phase we investigated six different products and after consulting with all stakeholders that list was reduced to the three that would best address the current issues. By doing a POC on the short listed tools we will be able to choose the product that best satisfies NWU’s requirements. It would also assist us in choosing a product that would enable faster development and deployment.

Product shortlist
After comparing a number of ETL tools and comparing their features with NWU’s needs we decided on the following tools:
Talend data integration, JBoss data virtualization & Phentaho data integration.

Infrastructure requirements
In order to successfully complete the investigation we will need an environment to deploy the tools and perform tests. The following environment is required:
1 Linux machine
8 GB Memory
60 GB System Disk
50 GB Install Disk
4 CPUs

Documents

No documents at this time.

Project Progress

100%

Project Timing

  • Start
    Nov 24 2015
  • End
    Sep 30 2016

11/24/2015 09/30/2016

100%

Overall Project Completion

  • 20%
  • 60%

100%

  • 40%
  • 80%

Project Discussion 11 Responses to ETL Tool POC

  1. Recommendation
    Although Talend makes the ETL creation process more manageable it is still a cumbersome task and difficult to manage since tasks are still dependant on one another and those dependencies must still be managed. Talend also provides the functionality to handle deletes in the source system although that must be done with triggers in the source database. In my opinion Talend is not the preferred tool to address the problems we currently face with our ETL/ODS.

    JBoss Data Virtualization eliminates the need for ETL runs since the data is read directly from the data source which would provide users with real-time data which is not currently the case. Since the data is read directly from the production environment the issue we currently face with deletes is also addressed and users will not have deleted transactions on their reports. Since JDV does not have ETL steps that are dependent on one another the risk of having inconsistent data is also eliminated. JDV would also make the process of ad-hoc reporting easier as views can easily be set up using different sources.

    Some concerns have been raised regarding JBoss Data Virtualization reading data directly from the production environment; we tried to test the load on the servers but OI could not see any spikes even though we used the most resource intensive report. Since you can have multiple data sources on JDV one could potentially have data sources to production systems for real-time reporting/reports that are affected by deletes and data sources to a copy of production for reports that needn’t be real-time and are not affected by deletes thus reducing the load on production servers.

    In my opinion JDV addresses all the problems we currently face and is relatively easy to use.

    October 4, 2016 at 8:29 am
    ROALD VAN DEN BERG
  2. Project will be extended with a month to complete stress testing on JDV environment

    August 24, 2016 at 8:43 am
    ROALD VAN DEN BERG
  3. JBoss Data Virtualization POC completed. Will provide feedback and comparison between Talend & JBoss within the next couple of weeks.

    July 14, 2016 at 1:35 pm
    ROALD VAN DEN BERG
  4. Red Hat JBoss Data Virtualization Proof of Concept being scoped, Quote received. Final arrangement to be made for execution of POC in upcoming month

    May 31, 2016 at 10:07 am
    MARI PRINSLOO
  5. Talend POC was done on 18&19 April.

    Received feedback from LSD regarding POC and will determine the way forward next week. Detailed feedback will be provided on both tools once JBoss DV POC has been completed.

    May 20, 2016 at 10:30 am
    ROALD VAN DEN BERG
  6. Still waiting for red hat – will follow up on Jboss Data Virtualization license. Will have a discussion with DataWave regarding Talend on Monday 7/3

    March 3, 2016 at 4:19 pm
    ROALD VAN DEN BERG
  7. Infrastructure available. LSD busy negotiating a 90day full evaluation with support from red hat. Waiting for their feedback.

    February 26, 2016 at 12:44 pm
    ROALD VAN DEN BERG
  8. Still waiting for resources – expect feedback on 15/02/2016

    February 11, 2016 at 10:30 am
    ROALD VAN DEN BERG
  9. Waiting on infrastructure – will probably have it by end Jan.

    January 22, 2016 at 9:09 am
    ROALD VAN DEN BERG
  10. Summary of kick-off meeting held 4-DEC-2015
    Present:
    Hannes, Riaan, Phillip, Liaan & Roald

    Tools that will be part of POC
    1. Talend
    2. JBoss data virualization
    3. Kafka

    Server Requirements for Tools
    1x Linux Machine
    4 CPUs
    60 GB System disk
    100 GB Install disk
    16GB RAM
    OI will be needed to do the deployments of the tools but that will be discussed at a later stage since this meeting was just to give background on the project and provide OI with the infrastructure needs.

    Clone of current ODS dev environment
    approx 8 CPUs
    12 GB Ram
    600 GB Disk (OI will check and ammend the request accordingly)

    BSS requested the servers end of Jan 2016 – OI will put in a request for the infrastructure and let us know if that is possible/when it will be available.

    December 4, 2015 at 9:34 am
    ROALD VAN DEN BERG
  11. Kick-off meeting will be held 4-Dec-2015 to discuss environments needed for the project.

    December 2, 2015 at 8:52 am
    ROALD VAN DEN BERG

Leave a Reply