Tuesday, April 18, 2017

Continuent -- Product Management Newsletter -- April 2017

Tungsten Release 5.1.0
We are nearly ready with our new 5.1.0 release. It is a y-release, and therefore contains some new functionality that we want to share. In particular:
  • A simplified method for sharing tpm diag information
  • Some minor usability improvements to the thl command
  • The first phase in improving our filters, including a standard JS library for the JS filters, and upstream filter requirements (for example, pkey in heterogeneous appliers)
  • Numerous improvements to our core Hadoop support and associated tools
  • Some further fixes to tpm, net-ssh, and improvements to tpm and Ruby compatibility

The code is currently going through QA right now, so we expect to release in April. For more information, join us on the Webinar on the 19th April, details at the bottom of this newsletter!

Percona Live 2017 - Continuent Talks

Continuent is a Diamond Sponsor for Percona Live in Santa Clara, CA from the 24th-27th April. We have five different sessions if you would like to come along and meet us, or learn more about our products:

  • Continuent is back! But what does Continuent do Anyway? Tuesday, April 25- 9:20AM - 9:45AM, Eero Teerikorpi, CEO and Continuent Customers
  • Real-time Data Loading from MySQL and Oracle into Analytics/Big Data Tuesday, April 25 - 3:50 PM - 4:40 PM at Room 210, MC Brown, VP Products
  • Keynote Panel Wednesday, April 26 - 9:25 AM - 9:45 AM at Main Scenario, MC Brown, VP Products
  • Multi-Site, Multi-Master Done Right Wednesday, April 26 - 1:00 PM - 1:50 PM at Room 210, Matt Lang, Director of Professional Services - Americas
  • Spread the Database Love with Heterogeneous Replication Thursday, April 27 - 3:00 PM - 3:50 PM at Ballroom B, MC Brown, VP Products

We will of course be doing our best to share the info and presentations after the conference, but if you are near Santa Clara, or are visiting the conference anyway, please drop by the booth and meet the team.

Continuent Product Release Webinar on May 3, 2017 at 9am PST/1pm EST/5pm BST

We are holding another product release webinar on the 19th April where we will look back at some of the issues and challenges in 4.0.7 and how we are approaching 5.1.0, and our new release schedule.

We’ll also give you a sneak peek at 5.1.0 features and a look forward a few months to the upcoming releases in this quarter, and maybe even some experimental items.

Also, of course, you’ll get a chance to ask questions.

Improving Replicator Status and Usability

One of my own personal frustrations with the replicator (and at different times clustering) is that when we are executing large transactions and trying to identify what the transaction error is for a specific issue, sometimes it can be very difficult to see what is going on and why the replicator seems ‘stuck’. Furthermore, the status output from the trepctl command is not really very useful, since it shows a lot of information that is not helpful in comparison to the few statistics you really want.

So, we’ve done two things that will appear in an upcoming release (probably 5.2.0). Getting information about the THL is the first step, and the best way to do that is to extend the usability of the thl command. There are a few different elements to this, so first we’ve added more ways to select the THL entries. In 5.1.0, we’ve already added aliases from the ‘-low’ and ‘-high’ options to the more user friendly ‘-from’ and ‘-to’ options:

$ thl list -from 45 -to 65
In 5.2.0 we’ve added -first and -last, and these both default to showing one single transaction (either the first or last stored), or you can supply a number and get the first or last N transactions:

$ thl list -last 5
This helps pin down a transaction more easily, and makes it quicker to view a failing transaction instead of doing ‘thl list -low 575’, then ‘thl list 588’, etc. to get the one you really want.

These positioning requirements also work with other options, so if you extracted or applied a bad or corrupted single transaction (for example, because of a disk full error) and need to remove the last one, you can do:

$ thl purge -last
...without first having to work out which one is the last one!
From a large-transaction standpoint, we’ve also added a summary view of the size of transactions. So when you have a massive single transaction (thousands, hundreds of thousands or millions of rows) over one or more sequences, you can now check the transaction size without trying to judge from the output of ‘thl list’. For example, with a single massive transaction you can get a summary to understand the overall size:

$ thl list -sizes
1384 121 2017-04-07 10:09:31.0 125 chunks SQL 0 bytes (0 avg bytes per chunk) Rows 19072 (152 avg rows per chunk)
1384 122 2017-04-07 10:09:32.0 123 chunks SQL 0 bytes (0 avg bytes per chunk) Rows 19216 (156 avg rows per chunk)
1384 123 2017-04-07 10:09:32.0 123 chunks SQL 0 bytes (0 avg bytes per chunk) Rows 19266 (156 avg rows per chunk)
1384 124 2017-04-07 10:09:32.0 90 chunks SQL 0 bytes (0 avg bytes per chunk) Rows 14209 (157 avg rows per chunk)
Event total: 15421 chunks 0 bytes in SQL statements 2401906 rows

So if you do a trepctl status and see transaction 1384 is being processed, but the replicator is not doing anything, at least we now know that with 2.4 million rows it’s going to take a while to apply.

And, of course, the new position options work too:

$ thl list -sizes -last
...gets you the size of your last transaction.

In terms of the overall replicator status, because of the way the replicator works, it’s difficult to get information mid-transaction. The replicator is a state machine, and status is only reported at the end of the transaction. However, we’ve added a simple indication that the replicator is processing and that we know it’s still working by showing how long we’ve currently been applying an event:

$ trepctl status -name tasks
timeInCurrentEvent    : 379.218

It’s not hugely descriptive, but you can see how long we’ve been processing the current event, and know we haven’t hit an error (or the status would be offline). Match that with a quick thl sizes command on the current event and you can make a quick ‘oh, it’s just a big transaction’ determination.

Two other areas of the trepctl command will also be improved in 5.2.0. First, we’ve built a simplified status command that provides the key information, along with sizes and units, and simplified overall output depending on the current replicator state. This is a bit more human-readable and contains the key information you need to identify the current state:

$ trepctl qs
State: Online for 4398.38s, running for 5879.502s
Latency: 0.531s from DB commit time on ubuntu into THL
        2149.685s since last database commit
Sequence: 19193 last applied, (1362-19193 stored)

Of course if something goes wrong, you get more information:

$ trepctl qs
State: Faulty (Offline) for 3.06s because SEQNO 18012 did not apply, reason:
   Event application failed: seqno=18012 fragno=0 message=java.sql.SQLException: Statement failed on slave but succeeded on master

This is a quicker way to see the important information without a lot excessive detail and static settings, like the current version or timezone setting, which don’t change between installs.

In terms of understanding what the replicator is doing, we’ve improved the statistical output that you could previously get from trepctl status -name tasks and -name stages. The output from these is not very clear - in particular, it’s not really explaining what each stage is actually doing, or indeed what the counters mean. The ‘applyTime’ for example is actually the total time spent applying since the replicator last went online, not the last apply, and not since the replicator was started.

In place of that we have unified the output and made the content clearer:

$ trepctl perf
Statistics since last put online 1222.798s ago
Stage                |      Seqno |     Latency |      Events |  Extraction |   Filtering |    Applying |       Other |       Total
binlog-to-q          |      18009 |      0.315s |        1987 |    829.958s |      0.599s |     30.914s |      0.094s |    861.565s
                                          Avg time per Event |      0.418s |      0.000s |      0.000s |      0.016s
                                           Filters in stage | colnames,pkey,pkey
q-to-thl             |      18009 |      0.315s |        1987 |    734.173s |      2.011s |    124.850s |      0.523s |    861.557s
                                          Avg time per Event |      0.369s |      0.001s |      0.000s |      0.063s
                                           Filters in stage | enumtostring,settostring

Now it’s much easier to see what is configured, the correct sequence, and whether, say, filter execution time is what is ultimately affecting your latency.

These are still in their rough-cut early beta status, but we would welcome any feedback you have on the output and what you’d like to see in the output.

Kafka Applier for Tungsten Replicator

We know that many of our customers are interested in a Kafka applier, and some in a Kafka extractor, and the requests and interest is increasing substantially.

We have already done some work on this in terms of understanding the development load, and the basics of how the Kafka applier (in particular) would work. There are some important considerations when looking at the applier and the replicator.

Fundamentally, the replicator expects to replicate entire transactions between systems. That’s fine if each transactions only updates one or two smaller rows in a table, but what happens with larger updates? Depending on the environment and use case individual transactions can be huge.

Current thoughts go along these lines:
Have a configurable full transaction/individual row selection in the applier
Allow for splitting large single transactions into multiple messages if possible automatically

Other questions:
What formats do we want to write the data in; JSON, CSV?
Should we batch items (as we do with our BigData items), or just push a continuous stream
Any kind of rate-throttling required?
What kinds of tags/topics to use?

There are many other questions still to answer, but if you have thoughts and want to share, please let MC (mc.brown@continuent.com) know.