Five Times Faster: 2012

Monday, December 31, 2012

SAS programmers are preventing cost-cutting reforms

If the pharmaceutical industry wants cost-efficiency reforms in statistical programming services, by raising worker productivity, they must evaluate, test, and use alternatives to the SAS programming language.

The pharmaceutical industry has tried to solve this worker productivity problem over the past two decades by writing one SAS MACRO library, after another, after another. When the latest library doesn't solve the productivity problem, write another library. This approach does not work. (How has the cost per statistical table in a Clinical Study Report changed from 1990 to 2012, if at all?)

Senior pharmaceutical SAS programmers do not want any non-SAS product design to be examined, tested, or used. That is their position today, and has been their position for decades. The logical conclusion is that pharmaceutical SAS programmers are preventing cost-efficiency reforms.

The Vilno Table programming language gives much higher worker productivity for complex statistical table production than the SAS programming language does, often a table that needs 400 lines of SAS code needs less than 20 lines of Vilno Table code (and the logic of the Vilno code is also easier to read). That's not a typo: you reduce the amount of code you have to write by 20 times (400 divide by 20 = 20)! Higher worker productivity leads to lower costs.

The evidence, logical arguments, and illustrative examples showing this are rock solid.

The response of most (but not all) senior pharmaceutical SAS programmers is, for all practical purposes:
"We don't care. And we prefer you not talk."

With a few exceptions, the response is apathy, silence, and in a few cases more obvious obstructionism and censorship (moderators of professional internet forums such as PharmaSUG and PhUSE have the power to censor). And they don't have a counterargument that makes logical sense. (But some will say "You can write a macro for that" but not provide the macro library external APIs).

Pharmaceutical shareholders would benefit from such a change. Pharmaceutical SAS programmers do not want change.

Sunday, December 30, 2012

Flexibility and Productivity

Vilno Table gives a simultaneous combination of high flexibility and high worker productivity that no SAS MACRO library can come close to achieving.
Senior pharmaceutical SAS developers who have a vested interest in preventing change retort: "A SAS MACRO can solve that problem. Non-SAS product designs are unwelcome in the pharmaceutical industry".

A SAS MACRO library might (even here there is difficulty) work for a small subset of tables that satisfy a very strict set of assumptions. But for the wide variety of statistical tables that statisticians and physicians ask for every day, a SAS MACRO library simply does not work.

The problem is the flexibility of the SAS programming language goes only so deep: whenever you try to solve the productivity problem with a SAS MACRO library, flexibility collapses. Any candidate for a standard SAS MACRO library is rigid and difficult to use - it lacks flexibility.

The Vilno Table programming language has high worker productivity: the worker can do a complex statistical table in 20 lines of code (instead of spending 3 hours writing hundreds of lines of code, as with SAS). But Vilno Table combines high worker productivity with high flexibility: summary statistics and advanced statistics on the same page, advanced statistics from different models in the same table, different parts of the table are allowed to use different data sources and different computations, and so on.

Take the framework of tree data structures, blend it in a very careful manner with the types of concepts in a statistician's head just before she says "I want you to code for me a table that's BLA-BLA-BLA", and what you get is a product design that completely blows the SAS programming language out of the water. In terms of worker productivity, it's not even a close call, even if many senior SAS programmers do not want to admit it.

Chi-Square and Linear Model Connectors

The current software implementation of Vilno Table has just two syntax connectors (at the moment): the chi-square connector and the linear model connector. Recall that summary statistics (N, %, median, etc.) do not need such connectors. (The extern-syntax connector(for rare procedures) is not yet operational.)

Consider briefly the implications of the chi-square connector and the linear model connector. The chi-square procedure is (just barely) more advanced than a summary statistic. But the structure of the output data is very simple. So the small code module that is the chi-square connector is simple and small. Other procedures that are just barely more complex than a summary statistic (Fisher's exact, t-test, etc.) will only require a connector module that is quite similar to the chi-square connector.

The linear model connector is quite different - it's a proof of concept (like the New York song: if you can make it here, you can make it anywhere). The structure of the output data from a linear model procedure is nuanced. This means that if I can write a fully functional and correct syntax connector for the linear model (which I have already done in the current software version), then it is very plausible to do so for any statistical procedure: mixed model, factor analysis, survival analysis, etc.

SYNTAX CONNECTORS

With as little as THREE lines of Vilno Table code, the end user can draft a statistical table with a mixture of summary statistics and advanced statistics (p-values, etc.), a table that in SAS needs hundreds of lines of code and hours of work. How can this be possible? Part of the answer: syntax connectors.

A syntax connector is a small code module attached to the main conceptual compiler. Each advanced statistical procedure (chi-square, survival analysis, linear model, etc.) needs a syntax connector. When the syntax connector for a statistical procedure (say, linear model) is available, then you can request the output statistics directly in the column and row statements ("fvalue" is F-statistic, "pval_pw" is p-value for a pairwise comparison, and so on), which makes your work so much easier. (For summary statistics, syntax connectors are not needed).

[ To some degree, this blog post is inside baseball: the end user drafting tables won't read or write syntax connectors - they are part of the Vilno Table product. But if you are asking how this can be possible, this blog post is for you. ]

The current version of the software has a chi-square connector and a linear model connector under the hood. What that means is that the conceptual compiler "understands" how to handle chi-square statistics and linear model statistics (including F-statistic, regression coefficients, least-square means, pairwise difference estimates). Of course, more syntax connectors shall be added (survival analysis, mixed model, and so on). In addition, an all-purpose extern-syntax connector shall be added for very rarely used statistical procedures.

**************************************************

The extern-syntax connector will allow for statistical tables that include advanced statistics from statistical procedures that do not yet have a syntax connector written for them. This will be especially useful for statistical procedures that are highly specialized (and not widely known) but have an implementation in one of the computational back engines that Vilno Table can be connected to (R, S-Plus, SAS/STAT, etc.).

For the most part, using the extern-syntax connector will involve writing a paragraph of code in the native syntax of the statistical procedure computational back engine. In addition, appended to the end of this paragraph is a line of code for meta-data description (telling the conceptual compiler how to expect the output data to be structured). [ By contrast, for statistics that have regular syntax connectors installed, the end-user does not have to write or read any native syntax.]

Tuesday, December 18, 2012

Computational Back Engines

The software implementation of Vilno Table is (mostly) a front end, it needs to be connected to two computational back engines: one engine for data transformation, data preparation, and summary statistics and the other engine for advanced statistical procedures (linear model, survival analysis, etc.).

The front end, a "conceptual compiler", controls the two computational back engines: giving them lower level code files (that the worker does not have to look at) to execute and gathering the results to put in a statistical table.

The choices for the data transformation back engine include:
1. Vilno Data Transformation (VDT). A low-cost open source replacement for the SAS datastep (as well as proc transpose, proc means, and proc sql (for many-to-many joins))
2. SAS/BASE (SAS datastep, proc sql, proc transpose, proc means)
3. SPSS (possibly), depending on demand
4. others , depending on demand

The choices for the statistical procedure back engine include
1. S programming language implementation (R or S-Plus)
2. SAS/STAT (procedures for linear model, survival analysis, etc.)
3. SPSS, depending on demand
4. others, depending on demand

At this time, the current implementation of Vilno Table uses low-cost choices for the computational back engines: VDT and R. (Remember: the worker does not have to write or read any VDT code or R code to produce statistical tables, Vilno Table does that work under the hood.)

Also, when a later version of Vilno Table is configured to two or three statistical procedure back engines (at the same time, on the same hard drive), which is very doable, then you can easily do a statistical table with p-values and other statistics from R and S-Plus and SAS/STAT, all on the same page.

changing gears

SWITCHING GEARS, WHEN IT'S NEEDED

The history of programming languages show that when the syntax of a programming language is at a higher level of abstraction, you can get a huge increase in worker productivity, but situations where a lower level of abstraction is still needed can occasionally occur.

Programming languages such as SQL, SPSS, SAS, VDT (Vilno Data Transformation), and Vilno Table are data paragraph scripting languages - consisting of paragraphs of code that read and create datasets. R is often used as a data paragraph language (Python can be, but it's purpose is more general).

The worker can use source code files that consist of multiple paragraphs of code. Different paragraphs can use different levels of abstraction and different syntaxes, as long as the datasets created have a common format (i.e. the row-and-column format used with SQL). For Vilno Table, the end user can write paragraphs of Vilno Table code (such as the table paragraph), and the end user can also write paragraphs of code in the native syntax of the computational back engines, if needed.

Fine. But WHY would the end user want to do that? Because statistical programming is a multi-stage work process: data analysis, data reporting, and multiple stages of data transformation (data preparation). In particular you have:

1: Early-stage Data Transformation
2: Mid-stage Data Transformation
3: Table-stage Data Transformation
4: Statistical Table Production
5: Rarely used niche statistical procedures

Table-stage Data Transformation: Most of the data transformation code you see in statistical table SAS programs is table-stage data transformation, data transformation that's an integral part of statistical table formation. You do not want workers writing table-stage data transformation code by hand - it's a big waste of time and money. The table paragraph handles such data transformation "under the hood".

Statistical Table Production: Use the table paragraph, and the statistical table needs less than 15 lines of code. Use the SAS programming language, and the same table needs 400 lines of code.

Niche statistical procedures: For very common statistical procedures, Vilno Table has now or will have syntax connectors installed - you can include advanced statistics directly in the table with just a couple of lines of code. But for rarely used statistical procedures (that lack a syntax connector), it will be necessary to write a paragraph of code to call that statistical procedure in the native syntax of the chosen statistical procedure back engine (typically, R, S-Plus, SAS/STAT).

Data Transformation (data preparation, data cleaning):
You can always use the old-fashioned syntax of the data transformation computational back engine to do data transformation, data preparation, and data cleaning: usually either VDT or the SAS datastep (and someone who knows one can learn the other very quickly).
Data transformation needs are diverse and hard to categorize, but I divide it into early-stage data transformation and mid-stage data transformation.

Early-stage data transformation is often called "data cleaning". Typically, you tell the statistician what's wrong with the data (partially missing data, unusual visits, etc.) and she decides what data adjustments to make prior to analysis. This is tedious work and a product design that multiplies worker productivity for data cleaning is still an unsolved computer science problem. It's unusual for data cleaning code to be in statistical table programs, it's supposed to be finalized in the programs that create the ready-for-analysis datasets from the raw datasets. For data cleaning, there is simply no choice: use old-fashioned syntax (VDT or SAS datastep), with low worker productivity. [This is a big branch of applied computer science in which I am active, but in which academic computer scientists (and employees of SAS Institute ) have accomplished absolutely nothing over the past 20 years - that's worth noting. There is only one serious researcher in this part of computer science, and that is me.]

Mid-stage data transformation is more geometric, almost like a generalized transposing. Calculation of % change from baseline is an example of mid-stage data transformation. Usually mid-stage data transformation is done at the end of programs creating ready-for-analysis datasets, or at the beginning of a statistical table program. A worker can often do mid-stage data transformation (with higher worker productivity) using the parity paragraph, using old fashioned syntax (VDT or SAS) is another option.

Tuesday, August 7, 2012

My blogs: what goes where

This is my fivetimesfaster blog, for specific issues of the Vilno Table programming language.

For more general issues of applied computer science, I have another blog at

http://comp-sci-stuff.blogspot.com

There I discuss more general issues of statistical software, data transformation and data cleaning.

Tuesday, May 8, 2012

Vilno Table Programming Language White Paper

Here is a link to the
"Vilno Table Programming Language White Paper"

It is a .PDF file stored in a Google Documents area :

https://docs.google.com/open?id=0B2oCqHJ9cxhCNkp2T0dnMFFBdlE

Rather than viewing it in the Google Documents area, download it and open it after downloaded, the viewing within Google Docs of a .PDF file might show a couple of pages as blurry.

The first 8 table examples in the appendix are the most intuitive way to get a feel for how this language works.

Monday, April 16, 2012

Vilno Table, Real Examples of How It Works

These are actual examples actually produced by the software product (not yet version 1.0, but very functional, as you can see, the technology works).

Because the tables produced by the software are not so easy to put into a blog post, I am using a link to a document in "Google Docs":

https://docs.google.com/document/d/1O21VX4MCwcs2u_1eCwx5An-Di9NDld0-GCp0yHUYZLA/edit

So click on it, and you will see the examples from the appendix of the white paper.

Each example is 2 pages: the code (usually less than 15 lines), and the table that it produces.

I am having difficult choosing landscape/portrait for each page, so every page, unfortunately, is landscape. When you upload to Google Docs from a word document on the hard drive, it throws some stuff out, like page-by-page choices for landscape/portrait

Sunday, April 1, 2012

More Examples of Vilno Table

Here I show a few examples of rather simple statistical tables. Vilno Table has an enormous worker productivity advantage over SAS for complex, customized, picky statistical table requests. But it's also easy to use for simple tables. The more complex and picky the statistical table request is, the greater the advantage that Vilno Table has over SAS (and Excel) in terms of worker productivity. (Therefore, for very simple requests, the difference in productivity is less).

Most of these examples use only summary statistics (the last example will be a one-way ANOVA). Each example has only one available dataset and essentially one analysis (Vilno Table can produce a table using multiple data sources and different analyses, but these examples are simple).

This is code that describes the available datasets. Here, there is only one dataset, the PATINFO dataset :

inputdset asc a/PATINFO site patid trtgrp gender race happy weight age ;

( trtgrp means "Treatment Group", happy is a categorical outcome variable).

This is just a two-way frequency table:

denom trtgrp ;

col trtgrp*( N % ) ;

row happy ;

This is the same thing, but add a chi-square p-value in the upper right corner:

( I add a model statement, and add the word "pvalue" to the column statement )

denom trtgrp ;

model chisq(trtgrp*happy) ;

col trtgrp*( N % ) pvalue ;

row happy ;

This is assorted summary statistics , for certain demographic subgroups:

col all gender*race [age<65] ;

row N mean(weight) std(weight) median(weight) ;

Okey-dokey, let's try one slightly more advanced example, a one-way ANOVA (well, technically "age" is a continuous covariate). Lm stands for "Linear model".

model lm( weight ~ trtgrp -1 + age ) ;

col trtgrp*est all*( est_pw("60mg"-"Placebo") pval_pw("60mg"-"Placebo") ) ;

row all ;

What you get is least-square mean for every treatment group, and just one pairwise comparison.

Let's make the row and column headers easier to read with:

label trtgrp "Treatment Group" est "Least Square Mean"

all "60mg vs Placebo" ""

est_pw "Difference" pval_pw "P-value" ;

To the dataset description code at the top, for the linear model, I'll need to add:

categorical trtgrp ;

continuous weight age ; (this extra description code was not needed for the summary statistics tables)

The above 4 examples are a lot simpler than most of the examples in the appendix of the Vilno Table Programming Language. The above examples do not show the full flexibility of Vilno. They show it's easy to produce tables that should be easy.

Just one more example, before I go, the next table is several frequency cross-tabulations, each one with a chi-square p-value on the right side, stacked vertically (each row section has a different row category, but the column category is always treatment group.

Several categorical tests, with N and % , and the p-value in the right column, like a baseline characteristic table. Again, most of the important stuff is in the model statement, the column statement, and the row statement.  

model chisq(thisrowcat*trtgroup*N) ; 

col (trtgroup all)*(N %) pvalue ;

 row gender race age_group ;  

This is fairly similar to table A1 in the Vilno Table white paper. When the current beta version has cr-modifier statements added to the parser, a later version will be able to put continuous and categorical statistics into the same column, but not yet. (A baseline characteristic table crams a lot of stuff into the same page, so N and % must share the same column with mean and std. deviation).

Thursday, February 23, 2012

AE table in 11 lines of code: Why it is possible.

Here I explain the 11 lines of code that produce the AE table in more detail, to clear up some misunderstandings:

Here is code for table A3 in the Vilno Table Programming Language white paper, an AE table with the typical bells and whistles:

title "Table A3: AE table, with chi-square (75-patient dbase)" ;

directoryref a="/home/robert/test" ;

inputdset asc a/patinfo3 patid trt 1*(patid) ;

inputdset asc a/advevt3a patid bodysys prefterm ;

thing pat uniqval(patid) a/patinfo3 ;

n~n(pat) ;

printto "/home/robert/test/outp01" ;

denom trt ;

model chisq(thisrow?*trt*n) ;

col (trt all)*(n %) pvalue ;

row all have(a/advevt3a) bodysys*(all nothave prefterm) ;

(Please note: patinfo3 and advevt3a are names of datasets, the patient_info dataset and the adverse events dataset.)

(Please note: trt is the variable name for treatment group (here 3 groups), bodysys and prefterm are, obviously variable names for body system and preferred term.)

As I've already said, the best way to explain and learn this language is to focus attention on 3 lines of code: the model statement, the column statement, and the row statement (which are the last 3 lines of code in the above example). It is by reading these 3 lines of code that you can see what the statistician is asking for. Most of the analysis logic is in these 3 lines. If you do many statistical tables off of the same database, the other lines of code require little modification.

But to dispel some confusion, I will go through all 11 lines. Naturally, I begin with the model, column, and row statements:

Look at the column statement:

col (trt all)*(n %) pvalue ;

You have an N and % column for each treatment group (there are 3 treatment groups in table A3). You have an N and % column for all patients together. In the right-most column, you have a p-value from a simple categorical test.

Look at the row statement :

row all have(a/ae_dataset) bodysys*(all nothave prefterm) ;

Going through each piece of the row statement in order, what it does:

row

all -> row with grand total N for each group

have(a/ae_dataset) ->

row with number of patients with any AE at all (no matter what body system or preferred term)

bodysys*( -> for each body system, a set of rows that include the following:

all -> row for number of patients with AE for this body system

nothave -> row for number of patients who do NOT have an AE (for this body system)

prefterm -> for each preferred term, number of patients with AE

This AE table is a famous example. It is not a beginner's example for two reasons: for each body system there is a row for number and % of patients who do NOT have an event ; and the request for a categorical p-value for EVERY row in the table (for every preferred term). For these two advanced issues, two advanced keywords in the programming language are used: NOTHAVE and THISROW? respectively.

In a very simple example, the model statement would look like this:

model chisq(gender*race*N) ;

The column and row statements, which are at the center of this programming language, use a "visual tree" method: you write a syntax that becomes a tree data structure, this tree data structure becomes a visual display at the top of (and left side of) the statistical table. The terms in the column and row statement are called TABLE FACTORS, which become nodes in the visual tree.

THE TREE DATA STRUCTURE IS A COMBINATION OF DETAILS THAT SHOW WHAT THE STATISTICIAN IS ASKING FOR.

With elementary tables, the table factors are: categorical variable names, boolean expressions (such as [age<65]), names of statistics( such as N % mean std(std.deviation) pvalue est(estimate for least square mean) fvalue(F-statistic) and so on), plus the ALL keyword.

With more advanced tables, more advanced keywords in the programming language might be used: HAVE NOTHAVE THISROW? THISROWCAT .

Lines 1,2, and 7 are simple, obvious, and nothing new.

Lines 3,4,5,6 are input dataset description code. That includes the thing statement, which is a new innovative feature: you have to define WHAT you are counting, and later specify WHERE you are counting. If you want to count people AND count event records (using ae_record_idnum if in the dataset), you could have TWO thing statements for the same table (I've seen it asked for(tracking before database lock), but it's not that common a request).

The keyword "inputdset" means: here is an input dataset which I describe here (obviously there is a PATINFO dataset and an AE dataset, as you expect). I'll get into data source identification later, but if you have two datasets, it's pretty obvious to the compiler.

The denominator statement is "denom trt ;" . Is that fairly obvious? It actually is part of the table paragraph, with the last three lines.

This is not a macro library, it is a new language, and it can produce a huge variety of different tables. (I still haven't shown you the linear model example, table A2, that is going to blow your mind!).

Robert Wilkins