Howto Write Complex Programs Running On Vertica
Even though my roots are in the field of system administration I learned early on how important SQL and Databases are. Over the years I worked on different Projects on Oracle, MySQL and Vertica. Both from a DBA as well as a developer perspective. Even though the examples here are for Vertica most of it can and should be adapted to other databases as well.
If you have to process and aggregate data there are different ways to tackle the problem. I have seen a lot of programs fetching the complete data out of the Database trying to do the aggregation within the application and FAIL. Even in areas where they should have known better like Business Intelligence software this approach is quite common.
To be honest I never understood that way of thinking. There is this big iron containing all the necessary data. Instead of facilitating that database this poor, probably small computer has to fetch all the information and shall process it.
Rather you should always keep the data where it is. If not everything
is stored in one place you will find a way to sync it into one place.
If you get lucky your tables will inclode columns like
last_modified
. This makes it trivial to sync to another database. If
that’s not the case don’t worry. You can still compare the data with
what you have. Create a temporary table, import the data and run
something like
SELECT COUNT(*) FROM ( ( SELECT * FROM destination EXCEPT SELECT * FROM v_temp.destination ) a UNION ( SELECT * FROM v_temp.destination EXCEPT SELECT * FROM destination ) b ) c;
If there is nothing to sync you can just drop your temporary table. With the temporary table you put less stress on the database. It is not a permant table where this transaction would end up in some kind of transaction log. If your database system is a multi node system like master/slave or an MPP (Massively parallel processing) architecture like Vertica this is very important. Otherwise other systems have to replay those transactions. Imagine you do just something like this:
SET SESSION AUTOCOMMIT TO OFF; DELETE FROM destination; INSERT INTO destination; COMMIT;
And now do this every 20 minutes even though the content of the table doesn’t change at all. Your whole database system gets bussy without doing any real work. Just don’t do it!
Nowadays Database systems bring more than simple SQL. My favorite database, Vertica, has many analytics functions included. So there is no need to do this in your application.
Processing and aggregating data is usually a recurring process. The reported data must be up to date. Thus your software should include mechanisms to know on its own when and what must be reprocessed.
If you had clever database designers you tables will contain columns
like last_modified
. Unfortunately this is not always the case. But
don’t give up yet. There is a good chance that you have columns in
your source tables which end up in the destination tables and you can
use SUM()
to compare them. If there is a difference you have to
reprocess. Even if you have those last_modified
columns: If you had
to transfer them from a different data source you can’t use them. The
rows might have been modified after your last transfer but before the
last run of your program. In those cases you need information in you
local database when the last sync happened and contained
updates. Otherwise your code will miss some changes. Now you can base
your checks on that sync status.
Developing these algorithms takes a bit more time compared to blindly reprocess every time today, yesterday and maybe the day before that without knowing if the outcome has a difference. But it’s well spend time.
On Vertica we are talking of a multi-node database cluster. As we all know hardware and even virtual hardware will fail from time to time. With Vertica this is not an issue. In a perfect scenario you might even loose half of your cluster and the database is still up and running. The problem arises when a node joins a cluster again. Its first question is: “What has changed since i left?”. And now you are in big trouble if you reprocess data without any changes in the result. Sure the data looks the same. But this itty pitty node has to replay the change log. Probably totally unnecessary because its data will look the same after it. In the mean time there will more unnecessary “changes”. So be nice to your Database. Let it work only if there is really new data to process.
If you don’t transfer all data from the database to your application to process it locally: Welcome to a new way of debugging your code! Let your program run until it fails or does the wrong thing. Capture all SQL statements it fired on the database.
Now you can use those to manually execute them in your SQL
client. What’s the benefit? You can see the data your program
sees. You can peek into those temporary tables your application is
using. You can easily capture timings for each query and if something
is taking more time as expected yu can use the good old EXPLAIN
to
analyze the query. Even if it is using temporary tables.
Depending on the systems you work with you might have to rely on debug
logs. I am not talking about the printf
in your code. Usually you can
enable debug logging as well in your database driver. If you get lucky
the complete SQL statements will show up and you can collect
them. Currently this is not the case with Vertica. But it’s still
worth looking into those logs. In my world I found this message:
DIAG [01000] WARNING 9081: ENABLE_WITH_CLAUSE_MATERIALIZATION hint is ignored with prepared statements
You should use prepared statements wherever you can. If you run the
same queries over and over again the database already has an execution
plan at hand. But there is as well a security aspect: prepared
statements contain place holders and are your best weapon against sql
injection. Use them wherever you can. I am still using Perl. One of
the reasons is the DBI module to connect to databases. It not only
supports prepared statements: In your subroutines you can use
prepare_cached. So the next time the same subroutine gets called your
program and the database has less work to do. But the issue we have
here: When I created the SQL statement I was running it manually. So
the Database hint +ENABLE_WITH_CLAUSE_MATERIALIZATION was working
and speeding up the query. Now with the prepared statement it is
slower and I spend some time to create a different query without the
WITH
clause.
As I mentioned earlier debug logging is of no help with Vertica to capture the whole SQL statements. But Vertica being Vertica we have far better tools in our pocket: v_monitor.profiles.
One of the first things my program does is running
SELECT session_id FROM v_monitor.current_session;
and writing that information in the log file.
Now we can use the sessionid to collect all queries the program was using:
\t \o programlogfile_20220612.sql SELECT query || ';' FROM v_monitor.query_profiles WHERE user_name = 'programname' AND session_id = 'vmart_node0001-203745:0x1b73f15' ORDER BY query_start; \o
After running this code snippet you migh want to open the file and run some sql formatter over it. There is one small issue though: Your whole SQL statement is one line in that table. If you have comments in your multiline statement you have to adjust:
-- this comment will bite you /* this comment is fine */
With your newly created sql logfile you can either run it query by
query spending a lot of time or break the file up into multiple parts
and use vsql
to get to the interisting part:
\set ECHO queries \timing \i programlogfile_20220612_part1.sql
As you execute commands from a file you want to know the currently
executed query. Thus the echo. Timings are always nice. Store them in
your notes. You never know how much your future you will appreciate that
information. Or maybe even a colleague.
I usually stop before a COMMIT
or TRUNCATE TABLE
to peek into the
current state. Well - temporary tables of course. Your permanent
tables you should not TRUNCATE
. Yes. You should not DELETE, UPDATE,
INSERT in bulk on your tables in Vertica either. As the data is stored
per column and not per row as in other RDBMS the database has to
rebuild the column if you modify the content. Inserting in general is
fine, even mass inserts if you use COPY. But what should I do?
Create a staging table. Same layout including the projections as your
main table. Start with a TRUNCATE TABLE
of your staging table. Always!
You never know what a previous run of your code might have left in
that table. Insert your new data in the staging table. If necessary
even run a SELECT
on the main table to INSERT
some old data. When done run a
SWAP_PARTITIONS_BETWEEN_TABLES. This is instantly and all queries by
other users either get the old content or the new content of the
table. In contrast a TRUNCATE TABLE
on the main table is not part of a transaction. So
everybody accessing that table would get an empty result set until you
filled the table again.
But there are more columns in v_monitor.query_profiles.
First of you should start with labeling statements:
SELECT /*+label(programname_myselectquery)*/ COUNT(*) FROM t;
While on it set your client connection label as well!
Now you have more tools to figure out where your program is spending it’s time:
SELECT identifier, session_id, TO_CHAR( AVG(query_duration_us/1000/1000), '999g999g999g990d99') AS AVG, TO_CHAR( MIN(query_duration_us/1000/1000), '999g999g999g990d99') AS MIN, TO_CHAR( MAX(query_duration_us/1000/1000), '999g999g999g990d99') AS MAX FROM v_monitor.query_profiles WHERE identifier LIKE 'programname_%' AND is_executing IS FALSE GROUP BY 1,2 ORDER BY 5 DESC LIMIT 50; SELECT identifier, session_id, MAX(processed_row_count) AS max_rows, TO_CHAR( MAX(reserved_extra_memory_b/1024/1024), '999g999g999g990d99') AS max_memory, TO_CHAR( MAX(query_duration_us/1000/1000), '999g999g999g990d99') AS MAX FROM v_monitor.query_profiles WHERE identifier LIKE 'programname_%' AND is_executing IS FALSE GROUP BY 1,2 ORDER BY 3 DESC LIMIT 50;
On the other hand you already write a log file with your application containing time stamps and the number of rows each query returned, don’t you?
By using these techniques your database can spend more time on user queries and you will very seldomly bothered with requests to reprocess anything.