Five Times Faster: December 2012

Monday, December 31, 2012

SAS programmers are preventing cost-cutting reforms

If the pharmaceutical industry wants cost-efficiency reforms in statistical programming services, by raising worker productivity, they must evaluate, test, and use alternatives to the SAS programming language.

The pharmaceutical industry has tried to solve this worker productivity problem over the past two decades by writing one SAS MACRO library, after another, after another. When the latest library doesn't solve the productivity problem, write another library. This approach does not work. (How has the cost per statistical table in a Clinical Study Report changed from 1990 to 2012, if at all?)

Senior pharmaceutical SAS programmers do not want any non-SAS product design to be examined, tested, or used. That is their position today, and has been their position for decades. The logical conclusion is that pharmaceutical SAS programmers are preventing cost-efficiency reforms.

The Vilno Table programming language gives much higher worker productivity for complex statistical table production than the SAS programming language does, often a table that needs 400 lines of SAS code needs less than 20 lines of Vilno Table code (and the logic of the Vilno code is also easier to read). That's not a typo: you reduce the amount of code you have to write by 20 times (400 divide by 20 = 20)! Higher worker productivity leads to lower costs.

The evidence, logical arguments, and illustrative examples showing this are rock solid.

The response of most (but not all) senior pharmaceutical SAS programmers is, for all practical purposes:
"We don't care. And we prefer you not talk."

With a few exceptions, the response is apathy, silence, and in a few cases more obvious obstructionism and censorship (moderators of professional internet forums such as PharmaSUG and PhUSE have the power to censor). And they don't have a counterargument that makes logical sense. (But some will say "You can write a macro for that" but not provide the macro library external APIs).

Pharmaceutical shareholders would benefit from such a change. Pharmaceutical SAS programmers do not want change.

Sunday, December 30, 2012

Flexibility and Productivity

Vilno Table gives a simultaneous combination of high flexibility and high worker productivity that no SAS MACRO library can come close to achieving.
Senior pharmaceutical SAS developers who have a vested interest in preventing change retort: "A SAS MACRO can solve that problem. Non-SAS product designs are unwelcome in the pharmaceutical industry".

A SAS MACRO library might (even here there is difficulty) work for a small subset of tables that satisfy a very strict set of assumptions. But for the wide variety of statistical tables that statisticians and physicians ask for every day, a SAS MACRO library simply does not work.

The problem is the flexibility of the SAS programming language goes only so deep: whenever you try to solve the productivity problem with a SAS MACRO library, flexibility collapses. Any candidate for a standard SAS MACRO library is rigid and difficult to use - it lacks flexibility.

The Vilno Table programming language has high worker productivity: the worker can do a complex statistical table in 20 lines of code (instead of spending 3 hours writing hundreds of lines of code, as with SAS). But Vilno Table combines high worker productivity with high flexibility: summary statistics and advanced statistics on the same page, advanced statistics from different models in the same table, different parts of the table are allowed to use different data sources and different computations, and so on.

Take the framework of tree data structures, blend it in a very careful manner with the types of concepts in a statistician's head just before she says "I want you to code for me a table that's BLA-BLA-BLA", and what you get is a product design that completely blows the SAS programming language out of the water. In terms of worker productivity, it's not even a close call, even if many senior SAS programmers do not want to admit it.

Chi-Square and Linear Model Connectors

The current software implementation of Vilno Table has just two syntax connectors (at the moment): the chi-square connector and the linear model connector. Recall that summary statistics (N, %, median, etc.) do not need such connectors. (The extern-syntax connector(for rare procedures) is not yet operational.)

Consider briefly the implications of the chi-square connector and the linear model connector. The chi-square procedure is (just barely) more advanced than a summary statistic. But the structure of the output data is very simple. So the small code module that is the chi-square connector is simple and small. Other procedures that are just barely more complex than a summary statistic (Fisher's exact, t-test, etc.) will only require a connector module that is quite similar to the chi-square connector.

The linear model connector is quite different - it's a proof of concept (like the New York song: if you can make it here, you can make it anywhere). The structure of the output data from a linear model procedure is nuanced. This means that if I can write a fully functional and correct syntax connector for the linear model (which I have already done in the current software version), then it is very plausible to do so for any statistical procedure: mixed model, factor analysis, survival analysis, etc.

SYNTAX CONNECTORS

With as little as THREE lines of Vilno Table code, the end user can draft a statistical table with a mixture of summary statistics and advanced statistics (p-values, etc.), a table that in SAS needs hundreds of lines of code and hours of work. How can this be possible? Part of the answer: syntax connectors.

A syntax connector is a small code module attached to the main conceptual compiler. Each advanced statistical procedure (chi-square, survival analysis, linear model, etc.) needs a syntax connector. When the syntax connector for a statistical procedure (say, linear model) is available, then you can request the output statistics directly in the column and row statements ("fvalue" is F-statistic, "pval_pw" is p-value for a pairwise comparison, and so on), which makes your work so much easier. (For summary statistics, syntax connectors are not needed).

[ To some degree, this blog post is inside baseball: the end user drafting tables won't read or write syntax connectors - they are part of the Vilno Table product. But if you are asking how this can be possible, this blog post is for you. ]

The current version of the software has a chi-square connector and a linear model connector under the hood. What that means is that the conceptual compiler "understands" how to handle chi-square statistics and linear model statistics (including F-statistic, regression coefficients, least-square means, pairwise difference estimates). Of course, more syntax connectors shall be added (survival analysis, mixed model, and so on). In addition, an all-purpose extern-syntax connector shall be added for very rarely used statistical procedures.

**************************************************

The extern-syntax connector will allow for statistical tables that include advanced statistics from statistical procedures that do not yet have a syntax connector written for them. This will be especially useful for statistical procedures that are highly specialized (and not widely known) but have an implementation in one of the computational back engines that Vilno Table can be connected to (R, S-Plus, SAS/STAT, etc.).

For the most part, using the extern-syntax connector will involve writing a paragraph of code in the native syntax of the statistical procedure computational back engine. In addition, appended to the end of this paragraph is a line of code for meta-data description (telling the conceptual compiler how to expect the output data to be structured). [ By contrast, for statistics that have regular syntax connectors installed, the end-user does not have to write or read any native syntax.]

Tuesday, December 18, 2012

Computational Back Engines

The software implementation of Vilno Table is (mostly) a front end, it needs to be connected to two computational back engines: one engine for data transformation, data preparation, and summary statistics and the other engine for advanced statistical procedures (linear model, survival analysis, etc.).

The front end, a "conceptual compiler", controls the two computational back engines: giving them lower level code files (that the worker does not have to look at) to execute and gathering the results to put in a statistical table.

The choices for the data transformation back engine include:
1. Vilno Data Transformation (VDT). A low-cost open source replacement for the SAS datastep (as well as proc transpose, proc means, and proc sql (for many-to-many joins))
2. SAS/BASE (SAS datastep, proc sql, proc transpose, proc means)
3. SPSS (possibly), depending on demand
4. others , depending on demand

The choices for the statistical procedure back engine include
1. S programming language implementation (R or S-Plus)
2. SAS/STAT (procedures for linear model, survival analysis, etc.)
3. SPSS, depending on demand
4. others, depending on demand

At this time, the current implementation of Vilno Table uses low-cost choices for the computational back engines: VDT and R. (Remember: the worker does not have to write or read any VDT code or R code to produce statistical tables, Vilno Table does that work under the hood.)

Also, when a later version of Vilno Table is configured to two or three statistical procedure back engines (at the same time, on the same hard drive), which is very doable, then you can easily do a statistical table with p-values and other statistics from R and S-Plus and SAS/STAT, all on the same page.

changing gears

SWITCHING GEARS, WHEN IT'S NEEDED

The history of programming languages show that when the syntax of a programming language is at a higher level of abstraction, you can get a huge increase in worker productivity, but situations where a lower level of abstraction is still needed can occasionally occur.

Programming languages such as SQL, SPSS, SAS, VDT (Vilno Data Transformation), and Vilno Table are data paragraph scripting languages - consisting of paragraphs of code that read and create datasets. R is often used as a data paragraph language (Python can be, but it's purpose is more general).

The worker can use source code files that consist of multiple paragraphs of code. Different paragraphs can use different levels of abstraction and different syntaxes, as long as the datasets created have a common format (i.e. the row-and-column format used with SQL). For Vilno Table, the end user can write paragraphs of Vilno Table code (such as the table paragraph), and the end user can also write paragraphs of code in the native syntax of the computational back engines, if needed.

Fine. But WHY would the end user want to do that? Because statistical programming is a multi-stage work process: data analysis, data reporting, and multiple stages of data transformation (data preparation). In particular you have:

1: Early-stage Data Transformation
2: Mid-stage Data Transformation
3: Table-stage Data Transformation
4: Statistical Table Production
5: Rarely used niche statistical procedures

Table-stage Data Transformation: Most of the data transformation code you see in statistical table SAS programs is table-stage data transformation, data transformation that's an integral part of statistical table formation. You do not want workers writing table-stage data transformation code by hand - it's a big waste of time and money. The table paragraph handles such data transformation "under the hood".

Statistical Table Production: Use the table paragraph, and the statistical table needs less than 15 lines of code. Use the SAS programming language, and the same table needs 400 lines of code.

Niche statistical procedures: For very common statistical procedures, Vilno Table has now or will have syntax connectors installed - you can include advanced statistics directly in the table with just a couple of lines of code. But for rarely used statistical procedures (that lack a syntax connector), it will be necessary to write a paragraph of code to call that statistical procedure in the native syntax of the chosen statistical procedure back engine (typically, R, S-Plus, SAS/STAT).

Data Transformation (data preparation, data cleaning):
You can always use the old-fashioned syntax of the data transformation computational back engine to do data transformation, data preparation, and data cleaning: usually either VDT or the SAS datastep (and someone who knows one can learn the other very quickly).
Data transformation needs are diverse and hard to categorize, but I divide it into early-stage data transformation and mid-stage data transformation.

Early-stage data transformation is often called "data cleaning". Typically, you tell the statistician what's wrong with the data (partially missing data, unusual visits, etc.) and she decides what data adjustments to make prior to analysis. This is tedious work and a product design that multiplies worker productivity for data cleaning is still an unsolved computer science problem. It's unusual for data cleaning code to be in statistical table programs, it's supposed to be finalized in the programs that create the ready-for-analysis datasets from the raw datasets. For data cleaning, there is simply no choice: use old-fashioned syntax (VDT or SAS datastep), with low worker productivity. [This is a big branch of applied computer science in which I am active, but in which academic computer scientists (and employees of SAS Institute ) have accomplished absolutely nothing over the past 20 years - that's worth noting. There is only one serious researcher in this part of computer science, and that is me.]

Mid-stage data transformation is more geometric, almost like a generalized transposing. Calculation of % change from baseline is an example of mid-stage data transformation. Usually mid-stage data transformation is done at the end of programs creating ready-for-analysis datasets, or at the beginning of a statistical table program. A worker can often do mid-stage data transformation (with higher worker productivity) using the parity paragraph, using old fashioned syntax (VDT or SAS) is another option.