Saturday, November 21, 2015

Where Did They Get Their Numbers???

Last update: Saturday 11/21/15
I have been dismayed by the fallacious "pipeline" explanations offered by the Silicon Valley corporate elite that there aren't enough Black students majoring in Computer Science to enable them to employ racially diversified technical staffs. More recently, however, I have also been dismayed by fallacious declarations from some Black academic leaders that it's possible to diversify the faculty of all colleges and universities to a level close to 12 percent because Black Americans are 12 percent of the total U.S. population. So I find myself shaking my head at both elites and wondering, "Where did they get their numbers?"

I know where they didn't get their numbers. They didn't get them from the U.S. Departments of Education, Labor, or Commerce, nor from the National Science Foundation, nor from the Carnegie Foundation, nor from any other reliable source of public information of which I am aware. 
  • The numbers I retrieved from reliable sources indicate that Silicon Valley, a very small, but very wealthy sector, could easily hire a few thousand highly qualified Black technical staff by outbidding other sectors and thereby achieving 12 percent diversity in four or five years.  
     
  • On the other hand there is reason to question the existence of enough Black PhDs in most fields to enable the financially pressed academic sector to hire the hundreds of thousands of additional Black faculty that would be required to achieve 12 percent diversity in all colleges and universities within the next four or five years. And only the wealthiest fifteen or twenty would be able to compete with Silicon Valley (and Wall Street) for the best of the best.
At this point I am sorely tempted to trot out my own numbers (again) from reliable sources to support both of my assertions; but I won't. I won't because I no longer think that sharing raw data is good enough. I am now persuaded that it's better to share raw data plus the computer code that downloaded the raw data and then transformed the raw data into the calculations, tables, and graphs required to support my conclusions. In other words my readers should not only be able to directly verify that I used the correct data; they should also be able to verify that I made all of the calculations, produced all of the tables, and generated all of the graphs correctly. How can this be done?

As regular readers of this blog may be aware, for the last seven months I have been enrolled in certified MOOCs in "Data Science" -- the first from M.I.T. and the most recent six from Johns Hopkins University. These courses cover a collection of techniques that some call "data science", others call "statistical learning", still others call "machine learning". The impressive hotshots teaching the courses at Hopkins also present persuasive arguments for something they call "Reproducible Research". In my case, reproducible research would look something like this:
  • In addition to publishing a report on this blog that begins with raw data retrieved from reliable sources, then derives results from the the raw data; I would also include a link to a "literate program" on GitHub that would look like the same report. It would include the same text, tables, and graphs. But it would also include the code I actually used to download the data from the original sources and all of the code that I used to perform the calculations, produce the tables, and generate the graphs.
My reproducible reports will be more transparent than the kinds of traditional reports that I have previously posted on this blog. Access to the GitHub versions will enable readers to immediately verify that my data really came from IPEDs or NSF or Carnegie or wherever. Then my actual data transformations could be examined, executed, and modified by the readers who fork/clone my reports to see what would happen if I had made other choices. How would the derived data change? What would the new graphs look like? And of course, using powerful machine learning techniques will enable me and my reviewers to disclose important patterns in the data more easily than by using traditional techniques. 

One initial disadvantage of my new approach derives from the fact that I will be coding my first reports in R, the preferred language of professional statisticians. R is an extraordinarily powerful language with dazzling graphics packages but, unfortunately, it is also one of the quirkiest computer languages ever devised. Although its popularity within the data science/machine learning community is rapidly increasing, it's unlikely that it will ever become as popular as Python, a language that is justly famous for being easy to learn. 

Why did I start of with R? I started with R because R was the language used in my first data science course, the MOOC I took last Spring that was offered by M.I.T., the course that blew my mind as few other courses have ever done. Enrolling in the extensive set of R-based courses offered by Hopkins was a logical follow-up. 

Nevertheless I want my reproducible reports to be accessible to the largest possible readership, so in the Spring 2016 I will enroll in Udacity's nanodegree program in Machine Learning wherein Python is the language of choice. These MOOCs will not only strengthen my skills in the application of machine learning techniques, I anticipate that they will also enable me to create reproducible reports coded in Python.