Thursday, January 25, 2018

Another "Masters" in Data Science via MOOCs ... (3)

Last update: February 2/25/18
Two years ago David Venturi posted a widely read description of the path he followed to obtain a self-designed "Masters" degree in data science via MOOCs, i.e., a free education that was equivalent in scope and quality to what he would have received had he completed a Masters degree program from one of the nation's leading universities.  You can read his account here ==> MY DATA SCIENCE MASTER'S. The following note describes the "Masters" in data science that I designed for myself, a program that I will complete within the next five or six months.

As the reader will see, my program is quite different from Mr. Venturi's. There are many reasons, but the two that I will focus on are age and prior academic experience.
  • Age -- Mr. Venturi was in his early to middle twenties when he embarked on his "Masters"; I was in my mid-seventies
  • Prior academic experience -- Mr. Venturi had just obtained a bachelors degree; whereas I obtained a Ph.D. more than forty years ago (1975).  I spent 20 years as a tenured member of the faculty of a prominent university, then another 20 years as a member of its senior staff developing EdTech software and implementing blended/online programs.
In other words, Mr. Venturi wanted to use what he had learned from his "Masters" program to begin his professional career; whereas I am hoping to use the knowledge from my "Masters" for a few more laps around the track.

I was "retired" in early 2014, sulked for about a year or so, then began my serious MOOC studies in early 2015. At the time I just wanted to add a few more tools to my skill set before re-entering the job market. In another post on this blog, I describe such a move as a "pivot".  I read Venturi's memo in 2016 at about the time that I completed my first data science certificate program.  I realized that I had not yet learned enough to address the policy issues of greatest concern to me, so I knew that I needed to make a "jump", rather than a "pivot". As I use the term, a "jump" would require substantially larger additions to my skill set, additions that were traditionally acquired via Masters degree programs. So following Venturi's example, I reframed my studies in terms of a "Masters" degree. Then I identified the additional skills that I thought I needed,

A. Textbooks vs. Course Lists
Mr. Venturi provided lists of the courses he took wherein each course on the list was hyperlinked to a description of the course on the sponsor's Website. While I might give the benefit of some doubts to computer science courses offered by Harvard or Stanford, my 40 plus years in academia taught me to be wary of most course descriptions, even when descriptions include detailed lists of the topics covered. I found that the required texts for a course provided far more reliable indicators of the real scope and depth of most courses.
  • Most MOOCs do not have required texts because most MOOCs are still tethered to the notion that MOOCs should be free -- except for small certification fees. Unfortunately, the costs of textbooks, especially for courses related to STEM fields, has gone through the roof in recent decades, with new texts typically costing anywhere from $100 to $200. As long as MOOCs are "free", they cannot require that students pay $100 to $200 for textbooks. So most MOOCs don't use textbooks.  
  • Of course, a viable solution to this problem is the adoption of Open Educational Resource (OER) texts, i.e., free/inexpensive textbooks that are available online. Unfortunately, only a few OER texts related to Data Science have been produced. However, readers will note that three of the five texts that I am about to cite are available online for free in PDF format ... :-)

B. My "Textbooks"
I put the word "textbooks" in quotes to remind readers that these books were not required reading for any of the MOOCs that I took. They are the texts I used to gauge the breadth and depth of my understanding of the topics discussed in my courses. Indeed, these texts also guided my selection of courses. I won't award myself a "Masters" degree in Data Science until I feel confident that I have a thorough grasp of most of the topics they cover. 
  • Using R for Introductory Statistics,  J. Verzani, CRC Press: Boca Raton, FL, 2nd edition, 2014 ... fundamentals of descriptive & inferential statistics; basic R 
  • R for Data Science, H. Wickham and R. Grolemund, O'Reilly Media, Inc, Sebastapol, CA, 2017 ... data wrangling via "tidyverse" packages developed by Wickham and associates ... free PDF here
  • Advanced R, H. Wickham, CRC Press: Boca Raton, FL, 2015 ... advanced R code, developing R packages ... free PDF here
  • Modern Data Science with R, B. Baumer, D. Kaplan, N. Horton, CRC Press: Boca Raton, FL, 2017 ... comprehensive overview of data science; IMHO the first full-fledged Data Science text ... great examples, superb graphics, and extensive sets of problems ... employs "tidyverse" code 
  • An Introduction to Statistical Learning, G. James, D. Witten, T. Hastie, R. Tibshirani, Springer Verlag, New York, NY, 2013 ... the classic text for statistical models ... free PDF here
As the reader can see, all five texts use the R language. Within a few months of enrolling in my first Data Science course, I perceived that the Data Science community was divided into two groups that sometimes seemed more like competing churches. There was the "High Church of R" and the "Pythonic Congregation". The R-group had academic roots in statistics programs, the Python-group in computer science programs. 

I continued my studies in the R-group because I perceived greater interest among statisticians in the kinds of public policy issues that concerned me than among computer scientists. For this reason I remained in the R-group even after taking a couple of short intro to Python courses from Udacity. These courses were a revelation, a "religious experience" in the "other" church. What an incredibly elegant language!!! -- Was I the last tech on the planet to learn that Python is a simple but powerful, general purpose language that runs fast enough to support any kind of application? Compared to Python, Base R is ... aaarrraRrgh!!!!!! ...  So this is a serious deficiency in my self-designed "Masters" program: I didn't require that I become as proficient in Python and its data-science packages as I have become in R.

Another serious deficiency in my program, as embodied in my five texts, has been the absence, until recently, of calculus and linear algebra. These subjects might not be as important in the Python data science community, but statistical literature is written in multivariate calculus. I began my graduate studies in a PhD program in pure math many years ago, but dropped out. As I noted in another post on this blog -- MOOCs, Data Science, and Me -- I have been pleasantly surprised by how much I remember about these subjects. Nevertheless I just enrolled in a series of rigorous calculus review courses and a course in linear algebra to bring these skills back to my fingertips. I will supplement these MOOCs with a couple of my old grad school texts.

C. Is "Data Science" a long term field of study???
I ask this question in the context of the preliminary draft of a directory of academic degrees and certificate programs in "Data Science" that I recently posted on this blog. As I compiled the links to the 50 plus programs, mostly for Masters degrees, I read the descriptions of the programs and was stunned by their variety. They range from here to way over there, and from hither to way out yonder. The simple division of the data science community into R-based vs. Python-based subgroups that I referenced in the preceding paragraphs is a substantial misrepresentation of this variety. It's difficult to believe that all of these programs are training graduates in the same field. So I offer two hypotheses and will leave it up to the reader to decide which one is null ... :-)
  • Just another rational wave -- Perhaps "data science" is just the latest in a series of rational waves that began with industrial engineering at the end of the 19th century. Each wave was powered by super-techs who vowed to apply new mathematical/computer abstractions to solve problems that had previously evaded domain experts ... operations research, systems analysts ... (lots in between) ... Business Analytics, Big Data, and now Data Science. Super-techs brought new techniques to old domains and made different adaptations of their techniques to the different domains. After a while the super-tech innovations became common knowledge among the domain experts. Indeed they became part of the required skill sets for experts in these domains.
  • Lemonade stands -- The dean says, "This Data Science thing is becoming big, really big, even bigger than Big Data, the last big thing. We have to jump on it while it's still on the upside of the hype cycle. Who do we have that we can assign to our version? A duck, three dolphin, and a platypus? ... Hmmmmmm, not quite the same team as the Data Science program at Stanford; but what the hell. If that's all we got, then we will just have to present their qualifications in the most favorable light in our brochures" ... :-)

D. Acknowledgements
I will end this note with a few words of well-deserved praise for two of my favorite Data Science instructors:
  • Roger Peng ... Dr. Peng is the leader of the group of instructors who offer the Johns Hopkins Data Science program via Coursera. He inspires students by his own example. Time and again I was greatly impressed by his grasp of effective ways to handle data sets that had no relationship whatsoever to his expertise in biostatistics. How did he ever find the time to learn all this stuff? Although I remain unsure as to whether there really is such a thing as "Data Science" as a long-term field of study, I am firmly convinced of the existence of at least two data scientists, one of whom is Dr. Peng. May he live long and prosper ... :-)

    Note: I did not include Dr. Peng's inexpensive LeanPub online manuals in my list of textbooks because as works-in-progress they lacked some of the polished components that I look for in the best textbooks. However, I anticipate Dr. Peng's eventual consolidation of these drafts into an inexpensive one or two volume publication. I am sure that it will be a worthy competitor to Baumer's "Modern Data Science".
  • Hadley Wickham ... Dr. Wickham is my other real Data Scientist because he does what someone who had never heard the words "Data Scientist" before might think every data scientist does. He creates functions that manipulate and/or display data, any kind of data, including other functions. He has recently begun referring to the various interrelated packages he's developed as the "tidyverse". I only took one short course from Dr. Wickham via DataCamp, and it was excellent. Nevertheless, he's one of my favorite data science instructors because I've learned so much from his books, and even more from understanding how to use the various tools he's created in his "tidyverse". My earlier growl about "Base R" refers to the basic R package as it comes out of the box. But installing the packages in Dr. Wickham's "tidyverse" transforms Base R into a formidable tool that rivals Python in its power and simplicity. 

    Roy L Beasley, PhD
    DLL Editor

    P.S. #1 -- For a "Very Short History of Data Science" see this article by Gil Press in Forbes (5/28/13)

    P.S. #2 -- In early February 2018 I dropped out of the Linear Algebra MOOC because it did not offer a rigorous theoretical presentation. Instead, I will study the third edition of Serge Lang's "Linear Algebra" (Springer 1987), i.e., I will relearn this material the old fashioned way by studying a good book ... :-)

    PS. #3 -- On second thought, I have added Garrett Grolemund's book Hands-On Programming with R (O'Reilly Media, Inc, Sebastapol, CA, 2017) to my list of required texts (free PDF here) for my own online Masters. Whereas Verzani's text (cited above) uses R to introduce descriptive and inferential statistics, Grolemund's book provides an excellent introduction to the art of writing good programs in R.

    Related notes on this blog:

    No comments:

    Post a Comment

    Thank you!!! Your comments and suggestions will be greatly appreciated ... :-)