[Linux-aus] Tutorial on data science at LCA

Wed Jan 13 22:30:46 AEDT 2016

Hi Paul,

The video from PyCon AU last year is available, but you might as well wait
until the content from LCA which will be somewhat updated for the latest
information. Code samples and slides from last year are on github:
https://github.com/tleeuwenburg/pyconau15. I can think of a few ways to
improve the slides and the code examples and I hope to find the time for
that. If nothing else, I have the first few days of the conference :)

For me, computer science is where the subject domain is computation. It
includes investigations into memory, disk, algorithm efficiency, design
methodology etc. A fun definition I heard for data science is "What they
call statistics in silicon valley". I think that data science focuses on
the identification of trends, features and relationships in both structured
and unstructured data. Typically, this involves both algorithmic processing
and analytical understanding from a human analyst. Machine learning is the
exploration of effective algorithms for the prediction of future states
based on complex inputs and complex (and potentially hidden) rules and
relationships. Happy to provide more exposition here. I might not include
it in the presentation, because I'm not sure how into 'computer science' /
how academic the general audience is likely to be. Maybe in 2017 I'll
re-run this as a kind of 'masterclass' idea rather than a tutorial and get
into some more challenging territory.

I struggle with mathematics but have made the effort to learn some
fundamentals. It is possible to use many of the techniques of machine
learning "black box" but it's not really possible to do data science that
way. I think it would be worth learning:
  -- Standard deviation function
  -- Normal distribution
  -- How to draw a good graph
  -- How to draw a scatter plot
  -- How to draw a Venn Diagram of the state space. Almost all complex
probability theory can be more easily understood when you start with a
square representing "everything", then start bisecting it into smaller
sub-populations. It's easier to understand Bayes Rule, correlation,
causation, likelihoods....

Visual methods are highly effective.

Regarding trending and anomaly detection ... I am straying outside of my
main knowledge here, but in general you will be looking here at data
normalisation, hypothesis testing, bias elimination/identification and
significance thresholding. Some methods are more robust to this than
others. If you have a use case, I'd be happy to hear about it! :)

On 13 January 2016 at 20:59, Paul Gear <paul at gear.dyndns.org> wrote:

> On 13/01/16 18:30, Tennessee Leeuwenburg wrote:
>
> Hi all,
>
> I am running a tutorial on data science at LCA. The chief language used
> will be Python, but users of other technologies will still find the
> concepts relevant. In preparation for this, I will be dusting off my slide
> deck, re-running the code, and updating the content with a small amount of
> new findings from the last 6 months. This is also an opportunity to focus
> the content on what the LCA audience might be most interested in.
>
> Does anyone on this list have any particular questions around data science
> / machine learning / AI which they would like to see answered?
>
> The session is practical, with supplied code and data, and audience
> members should be able to re-create the results while the session is being
> presented. Are there any particular problems that people are confronted
> with? I might not be able to re-work a major case study, but I should be
> able to incorporate some relevant examples...
>
> Cheers,
> -T
>
>
> Hi Tennessee,
>
> I won't be at LCA this year, but would love to see the slides/code
> samples.  Here are my suggestions, not having any real background in data
> science:
>
>    - What's the difference between data science and computer science?
>    i.e. What are the important characteristics which distinguish it as a field
>    (or sub-field) in its own right?
>    - My eyes tend to glaze over at the first sign of grade 12 or higher
>    maths (even though I did pretty well at it in grade 12).  What are the main
>    mathematical concepts that non-data scientists need to brush up on to
>    understand what data scientists are telling them?
>    - Keen to hear anything you can teach about the theory behind trending
>    & anomaly detection, especially as it relates to modern monitoring systems.
>
> Regards,
> Paul
>
> _______________________________________________
> linux-aus mailing list
> linux-aus at lists.linux.org.au
> http://lists.linux.org.au/mailman/listinfo/linux-aus
>
>

-- 
--------------------------------------------------
Tennessee Leeuwenburg
http://myownhat.blogspot.com/
"Don't believe everything you think"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux.org.au/pipermail/linux-aus/attachments/20160113/b67737d5/attachment.html>