Tuesday, December 18, 2012

changing gears

SWITCHING GEARS, WHEN IT'S NEEDED

The history of programming languages show that when the syntax of a programming language is at a higher level of abstraction, you can get a huge increase in worker productivity, but situations where a lower level of abstraction is still needed can occasionally occur.

Programming languages such as SQL, SPSS, SAS, VDT (Vilno Data Transformation), and Vilno Table are data paragraph scripting languages - consisting of paragraphs of code that read and create datasets. R is often used as a data paragraph language (Python can be, but it's purpose is more general).

The worker can use source code files that consist of multiple paragraphs of code. Different paragraphs can use different levels of abstraction and different syntaxes, as long as the datasets created have a common format (i.e. the row-and-column format used with SQL). For Vilno Table, the end user can write paragraphs of Vilno Table code (such as the table paragraph), and the end user can also write paragraphs of code in the native syntax of the computational back engines, if needed.

Fine. But WHY would the end user want to do that? Because statistical programming is a multi-stage work process: data analysis, data reporting, and multiple stages of data transformation (data preparation). In particular you have:

1: Early-stage Data Transformation
2: Mid-stage Data Transformation
3: Table-stage Data Transformation
4: Statistical Table Production
5: Rarely used niche statistical procedures

Table-stage Data Transformation: Most of the data transformation code you see in statistical table SAS programs is table-stage data transformation, data transformation that's an integral part of statistical table formation. You do not want workers writing table-stage data transformation code by hand - it's a big waste of time and money. The table paragraph handles such data transformation "under the hood".

Statistical Table Production: Use the table paragraph, and the statistical table needs less than 15 lines of code. Use the SAS programming language, and the same table needs 400 lines of code.

Niche statistical procedures: For very common statistical procedures, Vilno Table has now or will have syntax connectors installed - you can include advanced statistics directly in the table with just a couple of lines of code. But for rarely used statistical procedures (that lack a syntax connector), it will be necessary to write a paragraph of code to call that statistical procedure in the native syntax of the chosen statistical procedure back engine (typically, R, S-Plus, SAS/STAT).

Data Transformation (data preparation, data cleaning):
You can always use the old-fashioned syntax of the data transformation computational back engine to do data transformation, data preparation, and data cleaning: usually either VDT or the SAS datastep (and someone who knows one can learn the other very quickly).
Data transformation needs are diverse and hard to categorize, but I divide it into early-stage data transformation and mid-stage data transformation.

Early-stage data transformation is often called "data cleaning". Typically, you tell the statistician what's wrong with the data (partially missing data, unusual visits, etc.) and she decides what data adjustments to make prior to analysis. This is tedious work and a product design that multiplies worker productivity for data cleaning is still an unsolved computer science problem. It's unusual for data cleaning code to be in statistical table programs, it's supposed to be finalized in the programs that create the ready-for-analysis datasets from the raw datasets. For data cleaning, there is simply no choice: use old-fashioned syntax (VDT or SAS datastep), with low worker productivity. [This is a big branch of applied computer science in which I am active, but in which academic computer scientists (and employees of SAS Institute ) have accomplished absolutely nothing over the past 20 years - that's worth noting. There is only one serious researcher in this part of computer science, and that is me.]

Mid-stage data transformation is more geometric, almost like a generalized transposing. Calculation of % change from baseline is an example of mid-stage data transformation. Usually mid-stage data transformation is done at the end of programs creating ready-for-analysis datasets, or at the beginning of a statistical table program. A worker can often do mid-stage data transformation (with higher worker productivity) using the parity paragraph, using old fashioned syntax (VDT or SAS) is another option.

No comments:

Post a Comment