Thursday, February 23, 2012

AE table in 11 lines of code: Why it is possible.


Here I explain the 11 lines of code that produce the AE table in more detail, to clear up some misunderstandings:

Here is code for table A3 in the Vilno Table Programming Language white paper, an AE table with the typical bells and whistles:

title "Table A3: AE table, with chi-square (75-patient dbase)" ;
directoryref a="/home/robert/test" ;
inputdset asc a/patinfo3 patid trt 1*(patid) ;
inputdset asc a/advevt3a patid bodysys prefterm ;
thing pat uniqval(patid) a/patinfo3 ;
n~n(pat) ;
printto "/home/robert/test/outp01" ;
denom trt ;
model chisq(thisrow?*trt*n) ;
col (trt all)*(n %) pvalue ;
row all have(a/advevt3a) bodysys*(all nothave prefterm) ;

(Please note: patinfo3 and advevt3a are names of datasets, the patient_info dataset and the adverse events dataset.)
(Please note: trt is the variable name for treatment group (here 3 groups), bodysys and prefterm are, obviously variable names for body system and preferred term.)

As I've already said, the best way to explain and learn this language is to focus attention on 3 lines of code: the model statement, the column statement, and the row statement (which are the last 3 lines of code in the above example). It is by reading these 3 lines of code that you can see what the statistician is asking for. Most of the analysis logic is in these 3 lines. If you do many statistical tables off of the same database, the other lines of code require little modification.
But to dispel some confusion, I will go through all 11 lines. Naturally, I begin with the model, column, and row statements:

Look at the column statement:
col (trt all)*(n %) pvalue ;
You have an N and % column for each treatment group (there are 3 treatment groups in table A3). You have an N and % column for all patients together. In the right-most column, you have a p-value from a simple categorical test.

Look at the row statement :
row all have(a/ae_dataset) bodysys*(all nothave prefterm) ;
Going through each piece of the row statement in order, what it does:
row
all -> row with grand total N for each group
have(a/ae_dataset) ->
row with number of patients with any AE at all (no matter what body system or preferred term)
bodysys*( -> for each body system, a set of rows that include the following:
all -> row for number of patients with AE for this body system
nothave -> row for number of patients who do NOT have an AE (for this body system)
prefterm -> for each preferred term, number of patients with AE


This AE table is a famous example. It is not a beginner's example for two reasons: for each body system there is a row for number and % of patients who do NOT have an event ; and the request for a categorical p-value for EVERY row in the table (for every preferred term). For these two advanced issues, two advanced keywords in the programming language are used: NOTHAVE and THISROW? respectively.
In a very simple example, the model statement would look like this:
model chisq(gender*race*N) ;

The column and row statements, which are at the center of this programming language, use a "visual tree" method: you write a syntax that becomes a tree data structure, this tree data structure becomes a visual display at the top of (and left side of) the statistical table. The terms in the column and row statement are called TABLE FACTORS, which become nodes in the visual tree.
THE TREE DATA STRUCTURE IS A COMBINATION OF DETAILS THAT SHOW WHAT THE STATISTICIAN IS ASKING FOR.

With elementary tables, the table factors are: categorical variable names, boolean expressions (such as [age<65]), names of statistics( such as N % mean std(std.deviation) pvalue est(estimate for least square mean) fvalue(F-statistic) and so on), plus the ALL keyword.
With more advanced tables, more advanced keywords in the programming language might be used: HAVE NOTHAVE THISROW? THISROWCAT .

Lines 1,2, and 7 are simple, obvious, and nothing new.
Lines 3,4,5,6 are input dataset description code. That includes the thing statement, which is a new innovative feature: you have to define WHAT you are counting, and later specify WHERE you are counting. If you want to count people AND count event records (using ae_record_idnum if in the dataset), you could have TWO thing statements for the same table (I've seen it asked for(tracking before database lock), but it's not that common a request).
The keyword "inputdset" means: here is an input dataset which I describe here (obviously there is a PATINFO dataset and an AE dataset, as you expect). I'll get into data source identification later, but if you have two datasets, it's pretty obvious to the compiler.

The denominator statement is "denom trt ;" . Is that fairly obvious? It actually is part of the table paragraph, with the last three lines.

This is not a macro library, it is a new language, and it can produce a huge variety of different tables. (I still haven't shown you the linear model example, table A2, that is going to blow your mind!).


Robert Wilkins



2 comments:

  1. I am intrigued. How does it compare to "R"?

    ReplyDelete
  2. It could not be more different from R. The R programming language, which is the S programming language, was not designed to use tree data structures at all in this way to boost worker productivity.
    The most popular programming language is not always the programming language that has the best qualities of worker productivity and usability (learning curve and readability). R has good usability for some, but not all, types of statistical work (R is good is you assume the data cleaning is already done, if you are running several ad-hoc analyses, and if you assume the statistical output is not needed in a customized tabular report).
    I have an older product , Vilno Data Transformation, which I open-sourced back in 2007. It is a solid candidate for data transformation/preparation/cleaning (and Vilno Table depends upon it for low-level work). But the R folks have chosen not to listen, largely due to the hype surrounding the R product. A couple recommended R + Awk, but from what I have seen about Awk, that isn't the best way to provide a user-friendly solution to the part-time programmer for data cleaning.
    To solve usability and productivity problems, sometimes you need product designs that don't come from AT&T laboratories. Perhaps you can persuade them.

    ReplyDelete