[Linux-aus] Australian Digital Civil Rights Quiz

Wed Jun 21 21:00:04 UTC 2006

Nathan Bailey wrote:
> I'm not much of a stats buff, but I get the impression that the value of 
> packages like R <URL:http://www.r-project.org/> (and SAS, SPSS, Matlab, 
> etc.) is in the scripts/analyses built on top of the basic tool.  In 
> this case, it is quite possible that:
> a) Proprietary tools lock in scripts to reduce portability (= lock in 
> costs)
> b) Proprietary tools require non-standard scripting language/approach (= 
> lack of economies of scale in skills/knowledge)
> c) Proprietary tools may have plug-ins that can't be ported to other tools
> 
> (a) and (c) are particularly relevant to DRM -- if I am not legally 
> allowed to work out how to migrate my scripts, or a script set I bought, 
> then my own IP is being restricted by the DRM.
> 
> (i.e. equivalent to MS Office proprietary formats not being allowed to 
> be reverse engineered for portability [which isn't the case now, AFAIK, 
> but was less clear in earlier days of their EULA])

Hi Nathan,

I am a stats buff -- I was responsible for the technical accuracy
of Australian Bureau of Statistics computing for some years.

Tools like S-Plus, R, SAS, SPSS and Matlab have two strengths:
  - they are correct
     - they handle overflow and underflow, null values, NaN propogation
       and so on. COBOL, C and the like suck at this, and even have
       trouble with statistically correct definitions of basic functions
       like max().
  - they implement statistical functions
     - you want a X11-ARIMA month weighted average, that's one function
       call.  In COBOL or C that's 1,000 lines of code, and the coder
       will forget to calculate a quality estimator.

Increasingly these tools also implement data exploration and
visualisation.  But there's nothing to prevent the use of other
tools for that.

The vendors make a huge investment in documentation.  Stats
is not taught well at school and the vendors have to do a
great deal of education for their users to be able to use
their product effectively. It's that education that tends
to produce the lock-in and a strong loyalty to the product.

This isn't to say FOSS has no hope. Like experienced programmers,
experienced users of stats tools are happy to drive a number
of programs and tend to chose them for fitness to task.

On one hand, there's not much script portability between the
languages.  On the other hand the scripts are pretty high level
and it's plain to see what's being attempted, and thus re-writing
code isn't a huge drama. Regression testing also tends to be
simple, since stats people are obsessive about retaining past
data and summary statistics.

I wouldn't be worried about plugins. Part of the problem with
stats is that you need to sweat the details, often to a number
of decimal places. No stats user is interested in vaguely
described black boxes.  The level of precision of description
they need is enough to act as a specification for the reimplementation
of the black box.

Statistical processing has the flow
   define --> collect --> input --> edit --> summary --> explore --> archive
and you want to be able to use different software at each step.
So DRMed output which does not interoperate is never going
to take off -- people simply won't buy the product.

This isn't to say that stats people don't want DRM.  Government
statistical agencies can compel responses, and as the tradeoff
give strong privacy assurances.  So they want to secure data
from collection to summary.

Some statistics are of high value if disclosed.  Knowing the
GDP results prior to their release at 11AM would allow you
to make a few $m.

Some exploration results are of high value if disclosed. For
example the location of a possible mineral deposit from
mining exploration data.

So again it's a question of whom any DRM is protecting.  Is
it forcing vendor lock-in by tying your use of software
at one step of the stats process to using that vendor's
software at the next step?  Or is it providing privacy
for the owner of the statistics?

File formats tend to be simple.  Mainly because this allows
simple scripts in the language of your choice to correct
errors in the data that become obvious at the explore stage.
For a real life example, to move all "domestic trailers"
greater than 1000Kg into the "prime mover trailer" category,
and to print the address and frequency of all trailers moved.
These errors tend to result from faulty definitions and poor
editing, but no one can get those right the first time around.
For performance reasons most products have their own internal
data format, but expect input and output to be in simple
formats like columns, comma delimited, etc.

There's another entire issue around *transparency* and models.

For a real life example, take the beach near my house.  They've
built an artificial sand bar further down the beach to protect
some houses from erosion. That causes some erosion up-current
of the sand bar.  The modeling showed that this erosion would
be minor.

That didn't seem correct to me -- the outputs of the model seemed
sensitive to the inputs, especially to the slope of the beach,
which is something that can alter radically for some weeks after
a large storm.

So it seemed to me that the model didn't explore the risks of
storms with enough care.

So I asked for a copy of the model.  No can do.  External
consultancy.  So I asked for previous before+after validations
of the model. No can do.  Commercial in confidence.

As I wrote to Janet, transparency is important, but not our fight.

Hope this helps,
Glen