Tuesday, September 29, 2020

Ubuntu 20.04 Docker Installation

 Recently, I was following the installation instructions at https://docs.docker.com/engine/install/ubuntu/ 

So, I updated the apt package index and installed packages to allow apt to use a repository over HTTP:

    • sudo apt-get update

    • sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common

Next, added Docker’s official GPG key:

    • curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Verified the key by using its fingerprint (the last 8 characters of the key):

    • sudo apt-key fingerprint 0EBFCD88

So far, all went well.

However, when the following command was used to set up the stable (as well as the nightly and test) repository:

    • sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable nightly test"

This error was thrown:

    • The repository 'https://download.docker.com/linux/ubuntu/dists focal Release' does not have a Release file. 

The source of the error were the deb references to docker.com in the sources.list file:

    • sudo nano /etc/apt/sources.list

After numerous trials and errors, all original docker.com references in sources.list were commented out. This is the docker.com reference that works, and which was added to sources.list:

    • deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable nightly test

The following command was used to set up the stable (as well as the nightly and test) repository, worked successfully:

    • sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable nightly test"

Next, updated the apt package index:

    • sudo apt-get update

And, then this command to install Docker:

    • sudo apt-get install docker-ce docker-ce-cli containerd.io

To check the state of Docker after the installation, run this command:

    • docker –version

Which returned the message:

Docker version 19.03.13, build 4484c46d9d

Then, since Docker is installed, try this command to setup and run the  doc-rabbitmq docker container, which hosts a RabbitMQ message broker:

  • docker run -p 5672:5672 --hostname doc-rabbitmq rabbitmq:3

After pages of information are displayed in the console, this patch appears at the tail end:

Ready to start client connection listeners

2020-09-29 18:21:52.895 [info] <0.44.0> Application rabbitmq_prometheus started on node 'rabbit@nameko-rabbitmq'

2020-09-29 18:21:52.896 [info] <0.839.0> started TCP listener on [::]:5672

 completed with 3 plugins.

2020-09-29 18:21:53.214 [info] <0.689.0> Server startup complete; 3 plugins started.

 * rabbitmq_prometheus

 * rabbitmq_web_dispatch

 * rabbitmq_management_agent

2020-09-29 18:21:53.214 [info] <0.689.0> Resetting node maintenance status

Docker on Ubuntu 20.04 is good to go!


Monday, July 10, 2017

NiFi + Anaconda + Kafka Streams + Spark MLlib (part 1)

Intro

This is the 2nd blog, in a series focused on Change Data Capture (CDC) using Apache NiFi, near-real time transformation of CDC records using Apache Kafka Streams, and publishing those multivariate time-series data messages to a machine learning (ML) service initially implement in Python using the (Continuum Analytics) Anaconda data science platform, and subsequently refactored and implemented using MLlib, and deploying the ML service on a Apache Spark cluster.

The BPM Use Case(s)

At the operation level, the business has a sizable staff focused on the collection of payments from outstanding loans. To be highly effective (and thereby meet their quotas, which dictate their weekly bonuses) the staff desires immediate notice when a loan payment has been missed. Of course, the staff would prefer to act as oppose to re-act, so they welcome usable 'heads up' news (a.k.a., predictions) about bothersome loans. In addition, the staff prefers to focus on salvaging viable loans displayed in their dashboard, and not on those that are predictably destined to become unsalvageable.

The stakeholders would prefer not to have bothersome loans in their portfolio, which make it necessary to hire said staff. However, the large volume and high velocity of the data events related to loans are beyond the ability of humans to be consistently proactive, in a timely manner, during the genesis of bothersome loans. Enter ML.

At points in the life-cycle of a loans they have to be serviced by human beings. In addition, some loans do default. The mission of the ML service is to minimize the presence of bothersome loans, by:
  • quickly, and consistently, identifying the genesis of bothersome loans, so they can be purged from the portfolio sooner than later, thereby, in the long run,
  • minimizing staff head-count, on a given shift, on a given day.
Consequently, the savings derived from minimizing the on-gonig costs of the staff, and by consistently predicting the timely identification of bothersome loans, must - at a certain point in time - be greater than the capital and operating expenses caused by the creation and deployment of an ML service.

In addition, there are local and national market forces, such as oil and natural gas boom and bust cycles, which act, like fast growing plagues, to drive down the value of loans originated from a specific geographic region under going a market/economic collapse. Evolving patterns and trends such as these stress borrowers and drive up loan defaults.

Given the magnitude (and number) of larger market forces, it is extremely challenging to sustain a  real-time ML service that can accurately measure the present value of a loan, let alone predict its future distressed value (and doing so far enough in advance to wisely jettison the loan from the current portfolio). In the genesis of a bothersome loan, it is difficult to see which straw breaks the camel's back, as well as when that straw (or straws) might land on said back, given where it is at.

The more complicated the business process model (BPM), the less likely the ML service will deliver a dollar value to the stakeholders which exceeds their capital and operating expenses incurred to fulfill its mission. In this use case, the business process a loan is regulated, in 50 different ways - literally - depending on which state the loan originates in, and or serviced in, each governing state regulates the BPM activities as it deems fit. Since you can eat an elephant one bite at a time, one state was selected by the team, for which outcomes produced by the ML service were validated and verified according to key performance indicators (KPIs), dictated by the stakeholder(s).


The Initial ML Engineering Process

Before fully funding the development and deployment of an ML service, you must, in a cost effective way, investigate the available data sets (internal to the company, as well as external data sets), and identify what shape the ML prediction needs to take for it to satisfy the KPIs. To minimize the risk of loosing capital, you must know for certain what shape the ML algorithms must take, as well as the data required to support the prediction(s), prior to developing Kafka Streaming or Spark MLlib code.

Enter the Anaconda platform and Python to the rescue. Investigating source data sets to determine what the data represents, how it needs to be prepared/manipulated, and how each of its elements/attributes (does or does not) relate to the prediction task, is an iterative process. Since this initial phase often accounts for 80-90% of the design and development costs of the ML service, a cost effective approach is demanded.

Driven by the occupational hazards of wishful thinking, ignoring the major level of effort required to rigorously investigate the source data sets is the root cause of ill-fated ML projects. A contributing factor of failed ML services is the inability of the ML engineers to detect (sooner rather than later) the real patterns inherent within the source data set(s). Worldwide, humans tend to project onto phenomena characteristics and properties which are not inherently present, and Data Scientists are certainly not liberated from the universal habit of faulty human cognition.

During the design phase, ML engineers (I'll refrain from using the term Data Scientist, as the role is that of technician not scientist) iterate between framing the problem, selecting useful data, training the model, producing predictions, and evaluating performance of the algorithm. In a nutshell, like any capable lab technician, the ML engineer conducts a series of lengthy and complex experiments intended to figure out how things tick in the present, so that the model can dependably and accurately predict its 'tocks' in advance.

Each iteration in this process of experimentation begins by deciding what individual pieces of the source data set(s) (a.k.a., elements, attributes) can be used to make predictions. This activity is called attribute extraction, where an attribute is a characteristic-property of the problem space, captured within the data set. If you are used to analyzing the contents of relational tables, then an attribute is akin to a table column. In ML lingo, an attribute is also called a feature, an independent variable, an input,  or a predictor.

After identifying the desired attributes, those attributes are extracted from the data set, transformed (as may be needed), and combined into a data structure best suited to train the ML model. This activity is called attribute engineering.

Once you have a suitable collection of attributes, in a viable form, you can then train the predictive model, assess its performance as well as the quality of its predictions, and make the business decision to proceed with development of the ML service using the Kafka and Spark frameworks, or to kill the ML project entirely. To help accelerate feature extraction and feature engineering (and keep cost to a minimum, and as importantly, avoid ML project failure sooner rather than later), Anaconda has been adopted by ML engineers and product owners alike, worldwide.

As regards any data set chosen to undergo feature extraction and feature engineering, it must have a meaningful data dictionary. To be meaningful, each entry in the dictionary must define a attribute's/element's identity and role/function within the BPM, as well as disclosing all applicable data governance and data management rules/regulations. No data dictionary means the data set does not get an entry pass into the ML project. The ML engineer must not be in a position where they are guessing the meaning or the governance of the attribute. If the data source is internal to a company, and for whatever reason/excuse/rationalization its' data dictionary does not accompany the data set, then consider pulling the plug on the project asap. Knowledge (and its absence) has political mass, and is not just the plain facts.

With access to the data set and data dictionary, the ML engineer begins their journey to discover the real patterns intrinsic to the data set (and not mere images projected upon it by a wishful thinker). The vehicle of choice for making that journey is the Anaconda platform, and where the native language is Python. With some foresight, the Python code produced in this phase can more readily be refactored for later use within the Kafka and Spark frameworks.

When we start attribute extraction and attribute engineering, we'll first segregate a portion of the source data set so that it is never used until after training of the ML model has completed. That segregated portion of data will be used to verify and validate the KPIs delivered by the ML service.

The ML Team Roles - Phase #1

To cross the finish line sooner rather than later, it is best to have more than one ML engineer on the project. Each ML engineer is provided with an instance of the Anaconda platform, as well as with access to both the data sets and their dictionaries. Using Python, each individual ML engineer independently extracts features from the data set(s) and engineers those features into their desired form, they design and develop the ML model, which they train. The extent to which ML engineers might collaborate on these efforts is limited to a shared degree of mutual style. Studies show that 97% of all programmers do not enjoy pair programming, and therefore that it is not a highly 'mutual' style.

The purpose in bringing multiple ML engineers onto Phase #1 of the project is to develop many models, of varying complexity, as quickly as possible. A multiplicity of different models increases the probability of producing at least 1 model that satisfies the stakeholders' KPIs, and doing so quickly. The productivity and innovation of ML engineers are significantly diminished by excessive collaboration. By all means, share ideas and insights as often as warranted, and, as always, refrain from pair programming. Too many chefs spoil the soup. Only after the best prediction model has been determined (relative to the KPIs) do a team of people write the Kafka and Spark codesets of the ML service that will be deployed. Until that finish line is crossed, each horse runs its own race (and, where its jockey is the product owner).

In addition to the ML engineer(s), there is a representative of the stakeholders with command of the parameters of the solution. The ML engineer(s) work directly with that representative (a.k.a., a product owner (in the literal sense of that term - they have their (budgeted) capital invested in the ML service) from whom the ML engineer gains their understanding of the outcomes (quantify the financial impact of each outcome), the KPIs (how they will be verified and validated), the resources available to the project, timing issues, and alternatives (the consequences of your current situation form the basis of your selection criteria). Because the ML engineer communicates directly with an authentic product owner, there is no scrum master. The ML engineer(s) are professionals and so organize their time and effort as they see fit, in light of their communications with an authentic product owner (and so, they do not need to be babysat, as they are always insight of the product owner).

How Anaconda Supports the initial ML Process

Studies confirm that (like BI and Big Data projects) 80-90% of the ML project expenses is spent on attribute extraction and attribute engineering (which complete before you begin the design of the ML algorithm). Therefore, you need a tool set (and a skill set) which delivers excellent support for attribute extraction and attribute engineering. At present, these skill sets are commanded by ETL professionals who currently use a programming language to create ETL solutions (not by operating an ETL service). If you can use a programming language and libraries/packages well suited to these efforts, as well as to the development of the ML algorithm itself, there is no better alternative. Let's look at how Anaconda supports the best efforts undertaken in the day in the life of an ML engineer.

Understand the data, as the data

The 'nature' of the attribute (in a data set) determines if and how it can/might be used by the ML service. Most importantly, the 'nature' of a attribute dictates which type(s) of ML algorithm can consume said feature. In ML lingo, nature is an alternative term for 'literal data type' of the attribute, e.g., integer, text, etc. From an ML viewpoint, phenomena in this universe is either a numeric variable, or a categorical (a.k.a., factor) variable. In this use case, a postal code for a state in the USA is a (multi-class) categorical variable.

An attribute is input to the ML algorithm that is used to make predictions, and a label is an observed outcome you want to predict (e.g., a given loan will default), that the ML algorithm will use to build a predictive model. The process of building a predictive model is called training. Like attributes, labels have a literal data type, and so likewise influence the shape the ML algorithm. In ML lingo, a label is also known as a dependent variable, a target, an outcome, or a response.

A power of the Python language and its stats, numpy, pylab, and scipy libraries/packages is the relative ease at which you can create scripts to extract variables from a data source, and to then engineer them into a data structure best suited to the ML model. In the the real-world, the source data sets extracted and engineered for this use case are confidential, therefore the source CSV data set of loans used in this demonstration, as well as its data dictionary, are available (to Kaggle members) at this URL https://www.kaggle.com/wendykan/lending-club-loan-data

The first task is to verify if the data values of the variables contained in the source data file match the data types defined in the data dictionary. In the real world, the quality of data captured by a source application is rarely good. Most applications do not rigorously ensure the quality of the data they capture. Adding to this chronic handicap, data governance and management rules/regulations/policies intentionally transform meaningful source data values into meaningless extracted data values. 

So, the first task is to conduct a 'sanity check'. The sanity check examines the data values of each variable in each record, and determines if the variable is empty, or contains a number or a string. And, since this ML project focuses on time-series data, we need to determine if a date variable's data value is a 'properly formatted' date (formats are conventions, so you need to read the dates to discover how they are formatted by the source system). The Python code which accomplishes the basic sanity check (even with the regular expression used to validate the format of the date values) is rather simple:

        qaRegEx_1 = "[A-Z][a-z][a-z]-[0-9][0-9][0-9][0-9]"
        re_1 = re.compile(qaRegEx_1)
        """
        [0] numeric
        [1] date
        [2] string
        [3] empty
        """
        data_type = [0]*4
        data_type_Counts = list()
        for indx in range(variablesCount):
            for row in Variables:
                try:
                    var = float(row[indx])
                    if isinstance(var,float):
                        data_type[0] += 1
                except:
                    if len(row[indx]) > 0:
                        result = re_1.fullmatch(row[indx])
                        if result is None:
                            data_type[2] += 1
                        else:
                            data_type[1] += 1
                    else:
                        data_type[3] += 1
            data_type_Counts.append(data_type)
            data_type = [0]*4


By running this sanity check on the source data set, you determine the number of rows, and count of variables per row, and, most importantly, you discover if:
  • variables are missing data values (are empty), to a significant degree
  • numeric variables only contain numbers
  • date variables only contain a properly formatted dates
  • a string variable is a string, and not a number or a date. 
In the sanity check, there is no validation of data values (other than date format). The data type report of this source CSV file confirms that the majority of numeric, date, and categorical variables are corrupted, while, as is usually the case, few variables are not corrupt, or are not missing data values (of any type) to a significant degree.


Count of Rows: 887379
Count of Variables per row: 74
Number of variables to check: 65666046

Data Type Report
Name Indx Number Date String Empty
id 0 887379 0 0 0
member_id 1 887379 0 0 0
loan_amnt 2 887379 0 0 0
funded_amnt 3 887379 0 0 0
funded_amnt_inv 4 887379 0 0 0
term 5 0 0 887379 0
int_rate 6 887379 0 0 0
installment 7 887379 0 0 0
grade 8 0 0 887379 0
sub_grade 9 0 0 887379 0
emp_title 10 35 0 835887 51457
emp_length 11 0 0 887375 4
home_ownership 12 0 0 887378 1
annual_inc 13 871217 0 16158 4
verification_status 14 15580 0 871799 0
issue_d 15 489 871221 15669 0
loan_status 16 83 15580 871716 0
pymnt_plan 17 6 489 886884 0
url 18 0 83 887296 0
desc 19 0 6 136775 750598
purpose 20 130 0 876817 10432
title 21 199 0 886692 488
zip_code 22 88 0 887239 52
addr_state 23 52 0 887315 12
dti 24 833575 0 53794 10
delinq_2yrs 25 865453 0 21905 21
earliest_cr_line 26 41971 833516 11868 24
inq_last_6mths 27 848127 31899 7321 32
mths_since_last_delinq 28 450011 10043 4871 422454
mths_since_last_record 29 159655 4548 3519 719657
open_acc 30 847381 2456 2619 34923
pub_rec 31 871997 1354 2019 12009
revol_bal 32 879106 909 1614 5750
revol_util 33 881898 591 1287 3603
total_acc 34 884045 412 1035 1887
initial_list_status 35 51397 327 834389 1266
out_prncp 36 853712 255 32609 803
out_prncp_inv 37 875975 173 10633 598
total_pymnt 38 881731 159 5028 461
total_pymnt_inv 39 884056 117 2853 353
total_rec_prncp 40 885330 107 1681 261
total_rec_int 41 885884 85 1191 219
total_rec_late_fee 42 886293 68 842 176
recoveries 43 886558 43 630 148
collection_recovery_fee 44 886718 35 509 117
last_pymnt_d 45 53308 816099 415 17557
last_pymnt_amnt 46 855056 31779 309 235
next_pymnt_d 47 43455 627256 269 216399
last_credit_pull_d 48 17105 850663 210 19401
collections_12_mths_ex_med 49 842620 36851 195 7713
mths_since_last_major_derog 50 252901 12393 161 621924
policy_code 51 852141 5908 134 29196
application_type 52 40295 3283 833634 10167
annual_inc_joint 53 15455 1923 31994 838007
dti_joint 54 8171 1331 10119 867758
verification_status_joint 55 4678 908 5135 876658
acc_now_delinq 56 836546 630 2511 47692
tot_coll_amt 57 820496 537 1406 64940
tot_cur_bal 58 820145 394 961 65879
open_acc_6m 59 54229 306 633 832211
open_il_6m 60 31951 256 435 854737
open_il_12m 61 26196 199 354 860630
open_il_24m 62 23990 164 279 862946
mths_since_rcnt_il 63 22431 130 191 864627
total_bal_il 64 22469 102 174 864634
il_util 65 19442 89 135 867713
open_rv_12m 66 21985 57 121 865216
open_rv_24m 67 21832 60 93 865394
max_bal_bc 68 21756 67 80 865476
all_util 69 21692 45 50 865592
total_rev_hi_lim 70 786902 30 43 100404
inq_fi 71 43449 41 40 843849
total_cu_tl 72 26808 29 49 860493
inq_last_12m 73 23420 15 25 863919

After you have a basic understanding of the real data

As the 'sanity check' data type report (and the data dictionary) reveals, 31 variables are unsuitable for this use case because the variables are:

  • missing data values to a significant degree
  • unstructured and free-form text
  • candidate keys assigned to the record by the source application. 

This leaves 43 numeric, date, and categorical variables that are potentially of use. Of these, only 8 can be extracted without needing to clean their data values. That leaves 35 variables which need to be cleaned during variable extraction. These variables are:

  • 4 date variables (# 15, 45, 47 & 48) which are validated using a regular expression, and stripped of numbers and text
  • 7 categorical variables(# 14, 16, 17, 22, 23, 35, 52) which must be stripped of numbers and dates data values. The 'cleaned' data values may need further cleaning to remove meaningless strings which are clearly not legitimate domain values, e.g., an invalid US state code.
  • 24 numeric variables which must be stripped of dates and text data values

After understanding the data as the data, there is a sequence of 4 'must do first' tasks to be completed on each source data set. Together, they are a best practice. Gratefully, all of these 4 tasks are fully supported by Python and Anaconda. You must:
  • Detect if an instance of an variable is not initialized with a value, and track that fact.
  • Detect if an instance of a variable contains an invalid data value, and track that fact.
  • For each (cleaned) numeric variable calculate the mean and standard deviation, and plot the distribution of data values, of each numeric attribute. Also, detect outliers, and generate quartile and decile boundaries.
  • For each (cleaned) categorical variable, detect the unique values of each categorical attribute, as well as the count of unique values (e.g., the count US state codes).
A power of the Python language and its stats, numpy, pylab, and scipy libraries/packages (that are bundled with Anaconda) is the relative ease at which you can create a single script that can accomplishes these 4 must-do-first tasks. The Python code that executes these 4 tasks over CSV source data sets can be obtained by sending a request to charles@thesolusgroupllc.com.

After these quality assurance test were completed, just 29 numeric variables and just 10 categorical variables (including the date fields) in the source CSV data set proved to be usable for this use case (that's only 39 out of 74 variables). In the next blog, we'll explore how Anaconda and Python can be used to develop the ML predictive model.















Thursday, August 18, 2016

Change Data Capture using Apache NiFI

Change data capture (CDC) is a notoriously difficult challenge, and one that is critical to successful data sharing. To that end, a number of data flow vendors have proprietary CDC solutions, each of which is very expensive to purchase, support and operate. For the cost conscious enterprise, there is a viable and robust alternative which costs nothing to purchase and which has relatively low support and operating expenses: Apache NiFi.

At the heart of the CDC challenge are the INSERT, UPDATE and DELETE data manipulation (DML) statements committed against the source RDBMS table. The top tier RDBMS, e.g., ORACLE, Microsoft SQL Server, have CDC features which can be enabled for a source table, and which will manage a CDC shadow table. In the CDC shadow table are the columns from the source table as well as additional columns which hold CDC metadata, e.g., operation type, operation event time-stamp, etc.. The notoriously difficult part of the challenge is keeping downstream target RDBMS tables synchronized with the contents to the source CDC shadow table as per the service level agreement (SLA), given disk failures, network outages, and other types of inevitable events which invariably impede data flows.

At its creation in 2006, the National Security Agency (NSA) put everything possible into NiFi to make it the best and most useful data platform, on the globe. Moreover, since 2006, the NSA has run NiFI in large scale production environments. By doing that, the NSA has proven the long-range value and benefit of NiFi. As such, you can depend on NiFi for superior service over the long haul. Recently, the NSA conveyed NiFi to the Apache community. However, unlike most Apache projects, NiFi is an appliance, a highly secure, and very easy to use appliance. The NiFi UI is web browser based.

To build a data flown using NiFi, you simply drag and drop any 1 of over 140 processors onto the canvas. For each processor icon on the canvas, you fill in the mandatory property values, and link the processor to other processors as needed. Which processors you choose to use is specific to your business use case. Since Apache NiFi supports all 4 enterprise integration patterns:
  1. file replication
  2. shared database
  3. remote procedure invocation
  4. messaging

the potential combinations of processors are truly phenomenal. 

For a CDC use case, using NiFi, the replicated DML statements are streamed, on a first-in-first-out (FIFO) basis, from the source RDBMS CDC table to the target RDBMS table, and transformed in flight (as may be needed). Using NiFi for CDC does not require file replication. As such, CDC using NiFi stays squarely within the 'shared database' enterprise integration pattern. 

For a CDC use case using NiFi, these processors are typically used:
  • DBCPConnectionPool
  • QueryDatabaseTable
  • PutFile (for logging)
  • RouteOnAttribute
  • SplitAvro
  • ConvertAvroToJson
  • DetectDuplicate
  • DistributedMapCacheClientService
  • DistributedMapCacheServer
  • PutDistributedMapCache
  • ExecuteScript (running Groovy)
  • PutSQL

At the source RDBMS and at the target RDBMS there exists a dedicated DBCPConnectionPool processor. The QueryDatabaseTable processor periodically submits a SELECT * query against the source CDC table, and returns 0-N records, in binary Avro format. As with any data flow, at any step in the flow, problems can occur, and so the PutFile processor is used to log those issues to disk. 

In a worst case CDC scenario, CDC DML statements are redundantly applied against the target table. To handle that scenario, you need to first now how many records are in the query result set. NiFi has 2 types of data flow files: one holds metadata about the flow, and the other holds the actual data content. The QueryDatabaseTable output has metadata about the count of records in the result set. If the record count is > 1, the query result set in binary Avro format is split into single records (using the SplitAvro processor). The RouteOnAttribute processor uses the row count flow metadata attribute to determine if the query result set needs to be split.

Each individual record in binary Avro format is then converted into JSON, by using the ConvertAvroToJson processor. This streaming JSON is next sent to an ExecuteScript processor which executes a Groovy script. This Groovy script copies values of elements in the JSON string and appends them as new attributes, new metadata fields, of the flow. These new attributes are used by the subsequent DetectDuplicate processor, which in turn uses a DistributedMapCacheClientService to check with a DistributedMapCacheServer to determine if the incoming CDC record has already been handled by the downstream target RDBMS. If it is a duplicate CDC record then it is discarded, else the record is sent to a PutDistributedMapCache processor which enters the new CDC record into the cache.

From these the data flow is routed to the corresponding ExecuteScript process which, using Groovy, converts the JSON stream into either an INSERT or UPDATE DML statement. This DML statement becomes the new content of the flow, and is then sent onto the PutSQL process which applies the DML to the target RDBMS table.

Of course, there are other scenarios which arise within a CDC data flow, and which must be handled by a production solution. The design described above is a generic template, not a universal solution to all CDC scenarios. However, the abundance of existing drag and drop NiFi processors enable easy development of more complex CDC scenarios then described above. For example, your CDC scenario may require directing the records into Kafka, or into HDFS, or into Solr, and in each case there is a corresponding NiFi processor to support that solution.

As regards ExecuteScript, it supports not just Groovy (which I prefer with NiFi given its integration with Java), but other script languages as well, e.g., JavaScript, Python, Ruby, Lua, etc.

Lastly, though this generic template contains all of the processors running in a single NiFi canvas, the processors can be distributed as well. You could easily handle all of the query executions and removal of duplicates by a NiFi instance running at or near the source RDBMS. and run the transform, query generation and DML puts on another NiFi instance, one at or near the target RDBMS.

For a demonstration of these CDC capabilities of Apache NiFi, or to investigate the significant cost savings delivered by Apache NiFi, just send an email to charles@thesolusgroupllc.com



Tuesday, June 14, 2016

A valid IT Use Case has KPIs

When IT initiatives are undertaken, most fail to deliver value for two reasons. Most often the stakeholders in the organization do not:


  1. Have a comprehensive view of key aspects of their current situation which impact the IT initiative;
  2. Adequately understand the future (the 1 year out, 3 years out) state, i.e., those new/optimized processes which they aspire to deliver to their organization.


The easiest way to recognize the presence of either problem in an IT initiative (and therefore, the initiative's inescapable failure) is the absence of key performance indicators (KPI) regarding the IT initiative itself.

Indicators are all about having the ability, as a stakeholder, to know what's going on - to quantitatively measure the value delivered by a process (or group of processes) to the organization, throughout their lifecycle, from start to finish. To be suitable, an indicator is an event, or physical situation, which  can be seen, heard, or perceived, by a stakeholder. Indicators are recognizable by the stakeholders; they are observations a stakeholder can make - they are definitely not opinions; nor are they qualitative.

When you take a closer look, indicators are symptoms, they are clues to the underlying cause(s) of the problems. It is these causes of the problems - the individual dissatisfactions among the stakeholders - that have to be reached through open discussions. Those individual dissatisfactions are the vehicles which drive organizational changes/transformation, for the benefit of the stakeholders.

The delivery of a new/optimized process changes/transforms organization behavior, and often organization structure as well. As such, it is very important that each stakeholder understand their current situation, how it overlays the IT initiative. The current situation is the baseline against which to measure how effectively the IT initiative is performing - the KPIs provide actionable insights into the effectiveness of the on-going IT initiative. 

The common cause for not fully comprehending key aspects of their current situation is the organization's habit of not discussing tough issues. Stakeholders usually only make changes when they are experiencing the consequences of a problem. When we do not discuss tough topics we loose the opportunity to avoid the consequences of a problem. 

Discussing a tough topic is difficult because each of us quickly forms our own perspective, confirming our concerns, and narrowing our focus onto specific areas of our dissatisfaction with the current situation. Through this individual habit, ever present throughout the group as a whole, we 'fail to communicate' in a timely manner. This lack of openness is cultural - people just don't feel safe enough to raise that issue for discussion within their organization. In such contexts, people are only able to discuss the problem after stakeholders are experiencing the consequence(s) of the problem - at an intolerable level.

Given this common organizational behavior, the only way to ensure a quality decision to change/transform is to ensure that the stakeholders are clear about the consequences (which they acknowledge that they are experiencing now, and which they see will persist if they do not change the organization behavior/structure). In business, the tool that gives greatest clarity into a consequence of a problem is the simple act of measuring its financial impact. 

Is the problem having a negative impact on the acquisition/growth/preservation of capital? If yes, then what is its financial impact? Is the financial impact/cost/lost-opportunity great enough to proceed to make a change to the current situation? By answering these questions, the stakeholders both clarify their thoughts about their current situation, as well as take ownership of problems - before deciding to make a change, or to not make a change. Even if a change provides a new capability, there is still the financial cost of adopting that change, and those costs have to be defined.

Some may prefer not to label an event/situation which has a negative impact on the acquistion/growth/preservation of capital as a problem, but instead refer to it as their current situation. In a manner of speaking, the label 'current situation' is more neutral, and so succeeds at failing to carry any warranted alarm. Besides, people tend to prefer their 'current situation' over any change to their current situation, unless their level of dissatisfaction with their 'current situation' necessitates a change/transformation, i.e., they are experiencing an intolerable problem.

When stakeholders assign a $ value to their current situation, they put a price tag on the individual dissatisfactions the organization is experiencing. The price tag empowers the stakeholders to:
  • prioritize various aspect of their current situation, and to 
  • make better, informed, rational, decisions - do we decide to do nothing about the problem (and therefore continue to incur the financial costs), or do we seek a solution to our problem? 


Unless the costs exceed levels acceptable to the stakeholders, nothing will be done by the stakeholders to change/transform the organization behavior/structure.

In an IT initiative, the tangible remediation of the financial costs of key indicators of the current situation have to be measured throughout the life cycle of the initiative. These KPIs tell the stakeholders whether or not their resources (within the IT initiative) are being used in a manner which actually solves their problem, a.k.a., remediates/optimizes their current situation. That is, at a bear minimum, through the KPIs the stakeholders can, as a direct consequence of the IT initiative, see/hear/percieve that the events and situations that were the cause of the problems are no longer present (and if present, they are producing a measurably reduced effect) within their current situation (or within some alternative future state). 

For a stakeholder to be able to see, hear, perceive the value of the IT initiative in a meaningful way, they need to adopt use cases which they verify and validate, through testing. What is it about the use case that is being verified and validated? In general, a valuable use case is perceived by the stakeholders, through testing, to reduce/eliminate the financial cost of a key indicator of their current situation.

When each use case in the IT initiative is evaluated along these lines, it is rare that an organization looses their investment of resources into change/transformation - in situations when such losses do occur these events are due to disruptive market forces that overpower the transformed organization.

Friday, June 3, 2016

Lifting MS SQL EDW into Amazon Redshift

In a previous posting (Lifting MS SQL EDW into Apache Hive), the migration of Microsoft SQL based Enterprise Data Warehouse (EDW) into Apache Hive was demonstrated. In this posting, the migration to Amazon Redshift will be covered.

As in the case of migrating from MS SQL to Hive, the shareholders are unlikely to undertake a migration to Amazon Redshift unless they feel dissatisfied with MS SQL. But, in the case of migrating to Amazon Redshift, unlike the case of migrating to Apache Hive, the shareholders must not only feel dissatisfied, the financial costs of Amazon Redshift, relative to the significant expenses of the MS SQL EDW cluster, must be calculated and the shareholders must be ready to publicly defend the resulting valuation. To have shareholders willingly do that, they must be certain that the resulting value will be delivered to the organization, in the time-frame which they demand.

Today, all high-quality decisions are the result of orchestrating a consensus among the stakeholders. In order to orchestrate such a change, do 2 things:
  1. Choose a critically important data flow, one with compelling business value, as your use case. You'll use that use case to verify and validate the 'migration hypothesis.' You want a use case that helps stakeholders make sense of their situation and connect it to the Amazon Redshift solution and the measured value which Amazon Redshift delivers - hypothetical value, so far, which brings me to the next point: 
  2. Undertake an Amazon Redshift Proof of Value (POV). Through the Amazon Redshift POV event the stakeholders measure the value delivered by Amazon Redshift features. 

A technology POV is a complex social environment in which outcomes are increasingly random and unpredictable. Therefore, a best practice is to select a use case which helps the stakeholders make several decisions about whether to eliminate a problem/significant-expense and how to change your situation for the better. Keep in mind that with higher risk comes greater resistance, so layout a course of escalating risk/change management to apply after the POV concludes successfully.

One thought to consider is that the existing EDW schema is a valuable asset of the organization, as are SQL queries used with the EDW. As such, perhaps it is worthwhile to 'lift' that schema into Redshift and execute those SQL queries against it, as the use case for the Amazon Redshift POV. That focus would serve to reveal any ripple effect on the schema and on the SQL, which would need to be remedied at some point in the migration process -  consequently raising the cost of the migration hypothesis.

So, what would occur in a POV that's lifts a MS SQL schema into Amazon Redshift
  1. Extract the EDW schema from the MS SQL
  2. Extract (all of) the data from the EDW tables
  3. Format the extracted data
  4. Persist that formatted data to storage
  5. Create Amazon Redshift schema which mirrors source EDW
  6. Load extracted data into Amazon Redshift
  7. Submit SQL queries to Amazon Redshift in any number of ways that you'd prefer. 
Steps 1-6 can involve quit a lot of software engineering. For example since the rule is that the EDW schema must be mirrored within Amazon Redshift, that schema has to be preserved when data is extracted, when that data is formatted, when that data is stored, when the Amazon Redshift schema is created, and when the Amazon Redshift schema is load with data. How would these numerous requirements effect the cost of the migration if your organization had 300+ tables within the EDW schema? 

The cost of designing, writing, testing and operating that volume of code for a POV is significant. So, is there anything about this use case which can be used to drive down the software engineering expense of the Amazon Redshift POV?  Yes there is - the schema is being preserved. That rule greatly simplifies the software engineering effort. In fact, it reduces the effort to such a degree that you only have to write a program which generates all of the required code!

The C# code (and a Java version) are available on GitHub, and the URL(s) to the code will be provided after a brief description of the code generator. But before we do that, it has to be pointed out that the generator have to produce Sqoop scripts, Avro scripts, and Amazon Redshift scripts. The Sqoop scripts will extract all data records from a table, and write out that data to disk, in AVRO format. The Amazon Redshift scripts will create the schema in Amazon Redshift and load the data into that schema. Were these scripts will be run, and how they will be orchestrated is beyond the scope of this posting. But, imagine for this example, that they will use S3 buckets for storage.

The crux of the design of the generator is the EDW schema metadata which drives its script generation functions. So to acquire that metadata, SMO is used to pull the schema metadata out of the MS SQL system catalog, and return a collection of tables (and views) to the generator. Next, for each table in the collection, a Sqoop "import" script is generated, followed by generation of the table schema in AVRO (with all data type conversions handled transparently) script. After which, the Amazon Redshift "create table" script is produced (with all data type conversions handled transparently, once again) .

The pattern of generating each of these 3 types of scripts repeats itself for every table (and view) in the EDW. This means that all scripts can be generated for an every table and view in the source MS SQL schema! During the Amazon Redshift  POV, you can run these scripts in any number of ways - from the command line, or from Ozzie, or from Apache Nifi. The choice is yours.

You can find the C# code and its configuration file at: github.com/charlesbclifford/csharp_mssql_to_redshift.

It is recommended that you start by reviewing the Program.cs file, then the App.config, and then onto the 3 generators: Sqoop.cs, Avro.cs and Redshift.cs. The POV I wrote this script generator for had 453 tables and views in the EDW schema. All 3 scripts for each of the 453 relations were generated in a matter of a few minutes. The generator was designed, developed, tested and deployed to operations over a four day period. You should be able to build a C# project in your favorite IDE, import these code files into that project, change the App.config key-value pairs in a matter of minutes and run the script generator through its unit tests. These same tasks are once again left for the Java developer, but this time, you are left with the task of translating C# into Java (which is not that difficult). You can use the C# and Java code for the Hive migration as your detailed guide.

Since Amazon Redshift had not, prior to Q3 2015 supported AVRO, a C# class which converts AVRO to CSV format, is provided. That class Pig.cs generates the Pig scripts used to convert file formats.

Since it was asserted in the previous posting, that it is a relatively straight forward task to have the generator consume both schema and data in ORACLE, TeraData, Netezza, etc., and likewise migrate these into Apache Hive/Qubole as well, this posting demonstrates that they can easily be extended to migrated to a variety of target data stores.

If you need support, send an email to charles@thesolusgroupllc.com

Thursday, April 28, 2016

Lifting Microsoft SQL EDW into Apache Hive (Qubole)

Today, the migration of Microsoft SQL based Enterprise Data Warehouse (EDW) into Apache Hive is no longer seen as a radical move. But that migration involves change and unless you feel dissatisfied with your MS SQL EDW, or at risk in your current situation, you're unlikely to undertake a migration. Migration is an appealing value proposition to many since free open source Apache Hive provides bottom-line growth via reduced expenses (since licenses for commodity EDW servers frequently cost several hundred thousand to several million dollars, that reduction is dramatic).

Today, all high-quality decisions are the result of orchestrating a consensus among the stakeholders. In order to orchestrate such a change, do 2 things:
  1. Choose a critically important data flow, one with compelling business value, as your use case. You'll use that use case to verify and validate the 'migration hypothesis.' You want a use case that helps stakeholders make sense of their situation and connect it to the Apache Hive solution and the measured value which Apache Hive delivers - hypothetical value, so far, which brings me to the next point: 
  2. Undertake a Apache Hive Proof of Value (POV). Through the Apache Hive POV event the stakeholders measure the value delivered by Apache Hive features. 
A technology POV is a complex social environment in which outcomes are increasingly random and unpredictable. Therefore, a best practice is to select a use case which helps the stakeholders make several decisions about whether to eliminate a problem/significant-expense and how to change your situation for the better. Keep in mind that with higher risk comes greater resistance, so layout a course of escalating risk/change management to apply after the POV concludes successfully.

One thought to consider is that the existing EDW schema is a valuable asset of the organization, as are SQL queries used with the EDW. As such, perhaps it is worthwhile to 'lift' that schema into Apache Hive and execute those SQL queries against it, as the use case for the Apache Hive POV. That focus would serve to reveal any ripple effect on the schema and on the SQL, which would need to be remedied at some point in the migration process -  consequently raising the cost of the migration hypothesis.

So, what would occur in a POV that's lifts a MS SQL schema into Apache Hive? 
  1. Extract the EDW schema from the MS SQL
  2. Extract (all of) the data from the EDW tables
  3. Format the extracted data
  4. Persist that formatted data to storage
  5. Create Hive schema which mirrors source EDW
  6. Load extracted data into Hive
  7. Submit SQL queries to Hive in any number of ways that you'd prefer. 
Steps 1-6 can involve quit a lot of software engineering. For example since the rule is that the EDW schema must be mirrored within Hive, that schema has to be preserved when data is extracted, when that data is formatted, when that data is stored, when the Hive schema is created, and when the Hive schema is load with data. How would these numerous requirements effect the cost of the migration if your organization had 300+ tables within the EDW schema? 

The cost of designing, writing, testing and operating that volume of code for a POV is significant. So, is there anything about this use case which can be used to drive down the software engineering expense of the Apache Hive POV?  Yes there is - the schema is being preserved. That rule greatly simplifies the software engineering effort. In fact, it reduces the effort to such a degree that you only have to write a program which generates all of the required code!

The C# code (and a Java version) are available on GitHub, and the URL(s) to the code will be provided after a brief description of the code generator. But before we do that, it has to be pointed out that the generator have to produce Sqoop scripts, Avro scripts, and Hive scripts. The Sqoop scripts will extract all data records from a table, and write out that data to disk, in AVRO format. The Hive scripts will create the schema in Hive/Qubole and load the data into that schema. Were these scripts will be run, and how they will be orchestrated is beyond the scope of this posting. But, imagine for this example, that they will all be run on AWS, will use S3 buckets for storage, and will use Hive on Hortonworks Data Platform, or Qubole on AWS - either one will work with the Hive scrips.

The crux of the design of the generator is the EDW schema metadata which drives its script generation functions. So to acquire that metadata, SMO is used to pull the schema metadata out of the MS SQL system catalog, and return a collection of tables (and views) to the generator. Next, for each table in the collection, a Sqoop "import" script is generated, followed by generation of the table schema in AVRO (with all data type conversions handled transparently) script. After which, the Hive "create external table" script is produced(with all data type conversions handled transparently, once again) .

The pattern of generating each of these 3 types of scripts repeats itself for every table (and view) in the EDW. This means that all scripts can be generated for an every table and view in the source MS SQL schema! During the Apache Hive/Qubole POV, you can run these scripts in any number of ways - from the command line, or from Ozzie, or from Hortonworks Dataflow (Nifi). The choice is yours.

You can find the C# code and its configuration file at: github.com/charlesbclifford/csharp_mssql_to_hive. And the Java code at: github.com/charlesbclifford/java_mssql_to_hive

It is recommended that you start by reviewing the Program.cs file, then the App.config, and then onto the 3 generators: Sqoop.cs, Avro.cs and Hive.cs. The POV I wrote this script generator for had 453 tables and views in the EDW schema. All 3 scripts for each of the 453 relations were generated in a matter of a few minutes. The generator was designed, developed, tested and deployed to operations over a four day period. You should be able to build a C# project in your favorite IDE, import these code files into that project, change the App.config key-value pairs in a matter of minutes and run the script generator. These same tasks are left for the Java developer. It is a relatively straight forward task to have the generator consume both schema and data in ORACLE, TeraData, Netezza, etc., and likewise migrate these into Apache Hive/Qubole as well. Bon appetite!

If you need support, send an email to charles@thesolusgroupllc.com