useR! 2016 participants and the R community: a multivariate analysis

R Foundation taskforce on women and other under-represented groups.
Author

Julie Josse and Heather Turner

Published

June 18, 2017

Introduction

Background

The useR! survey gathers responses to questions on different themes for 455 attendees: demographic characteristics, R programming and involvement in the R community. The separate analyses of each question in each theme are detailed in the posts Mapping useRs, useRs Relationship with R and Community Involvement of useRs. A multivariate analysis of the R programming questions is detailed in a companion report. Here, we perform MCA on the R community questions.

R community questions

Questions are the following ones:

  • Q18 Do you consider yourself part of the R community?
  • Q19 Which of the following resources do you use for support? Select all that apply
    • Q19_A The R mailing lists
    • Q19_B The #rstats hashtag on Twitter
    • Q19_C The R StackOverflow queues
    • Q19_D The R IRC channel
    • Q19_E The rOpenSci mailing lists or chat forums
    • Q19_F The Bioconductor support site
    • Q19_H Other
  • Q20 What would be your preferred medium for R community news (e.g. events, webinars, opportunities)?
  • Q21 Do you attend R user group meetings in your local area?
  • Q22 If you do: you do: what type of user group is it?
  • Q23 If you do not: why not?
  • Q24 Which of the following would make you more likely to participate in the R community, or improve your experience? Tick any that apply.
    • Q24_A New R user group near me (specify location in comments box)
    • Q24_B New R user group near me aimed at my demographic (specify relevant group in comments box)
    • Q24_C Free local introductory R workshops
    • Q24_D Paid local advanced R workshops
    • Q24_E R workshop at conference in my domain (specify domain/conference in comments box)
    • Q24_F R workshop aimed at my demographic (specify relevant group in comments box)
    • Q24_G Mentoring (e.g. first CRAN submission/useR! abstract submission/GitHub contribution)
    • Q24_H Training in non-English language (specify language in comments box)
    • Q24_I Training that accommodates my disability (specify disability in comments box)
    • Q24_J Online forum to discuss R-related issues
    • Q24_K Online support group for my demographic (specify relevant group in comments box)
    • Q24_L Special facilities at R conferences (give further detail in comments box)
    • Q24_M Comments on answers above
  • Q25 What other ideas do you have for improving the R community?
  • Q26 Do you have any feedback for the survey authors?

Coding

  • Question Q20 has a category “others” which corresponds to free text answers. We add new categories for this variable by grouping common answers and we name the new variables with the suffix “_new’’. For instance, there is a variable Q20_new with a new category medium_R-bloggers.

  • Question Q19_H also corresponds to free text answers. It has been recoded with Q19_H_new to Q19_N_new by grouping common answers: for instance, Q19_K_new has two categories Github_yes or Github_no.

  • Q21, Q22 and Q23 have been merged in a variable Q21_new since Q22 and Q23 are filled-in depending on the answer on Q21. For example, for Q21 Do you attend R user group meetings in your local area?, if the respondent answers no, they answer Q23 If you do not: why not?, with I am too busy. This is coded in the variable Q21_new, with the category meeting_no_I am too busy.

  • Q24_M, Q25 and Q26 are not analysed.

  • Non-respondents identified in the MCA Rprogramming document are also excluded since they do not answer to R community questions as well.

  • Categories are renamed to be easily understandable. In particular, the two categories for answers completed by checking a box are recoded as yes, if the box is checked, or recoded as no otherwise.

Analysis

MCA summarizes the relationships between the categories of the R community variables.

plot of chunk unnamed-chunk-3

To interpret the main dimensions of variability in the response profiles, we study the proximities between categories. Indeed, two categories are close on the graph if most respondents taking the first category also take the other category.

First dimension of variability

The first dimension (horizontal axis) opposes people that didn’t answer questions about involvement in the R community on the left, to the others on the right. More precisely, they did not answer Q21, Q18 and Q20 and did not check any boxes for the other questions (_no). Individuals with the smallest coordinates on the first dimension work more in the industry; this is confirmed by a statistical test, but this remains difficult to interpret.

Second dimension of variability

On the top, there are respondents who do not attend an R user group because they feel too inexperienced. They think that they would be more likely to participate in the R community or to improve their experience, if a user group, a workshop, or an online support dedicated to their socio-demographic group would be available. Facebook would be their preferred medium for R community news. They do not use Twitter (question Q19_B), (the category R_twitter_no corresponds to the answer The #rstats hashtag on Twitter for the question Which of the following resources do you use for support?). In addition, they do not like to use it as a medium for R community news (question Q20). They consider themselves as outside the R community.

On the bottom, in contrast, there are individuals who tend to consider themselves as part of the R community. They attend general R user group meetings and they use Twitter. Twitter would be their preferred medium for R community news.

plot of chunk unnamed-chunk-4

Relationship with demographic

In the following graph, answers about the demographic questions are projected as supplementary variables on the two first dimensions of variability. Supplementary variables are not used to build the dimensions of variability, but they are projected a posteriori to highlight relationships between the dimensions of variability and demographics. Thus, supplementary variables can be used to characterize the response profiles described previously.

Only demographic characteristics that are significantly related the R community profiles are represented. An analysis of variance with demographic variables as covariates, and the coordinates of the individuals on the dimensions as explanatory variables, has been performed.

plot of chunk unnamed-chunk-5

It appears that people having an undergraduate degree or a master degree are located at the top of the graph, while those with a doctorate are at the bottom. This would imply that people with undergraduate degree or master degree are more likely to consider themselves as outside the R community than people having a doctorate. The same comment holds for women versus men. It is interesting to note, caregivers for children or adult dependents tend to be at the bottom of the graph, while non-caregivers are on the top.

We represent the age categories on the map since some confounding effects with other demographic variables could explain the previous results. Youngest people tend to be on the top of the second axis, which suggests that they consider themselves as outside the community.

plot of chunk unnamed-chunk-6

To get more insights into this possible confounder effect, we created a new variable sex-age and represent it:

plot of chunk unnamed-chunk-7

It highlights that both young (men and women) and women do not consider themselves as part of the R community.

Similarities between respondents

A clustering of the participants is performed from their answers to questions about their involvement in the R community. More precisely a hierarchical clustering is performed on the principal components of MCA. The rationale is to improve the stability of the clustering by building it from denoised data rather than from the original dataset1. Several rules can be used to select the number of components used for the clustering. Here we choose 44 dimensions which corresponds to 95% of the projected inertia (variance) of the data.

The number of clusters is chosen according to the barplot of the inertia gains which suggests a partition with 3 clusters. However, this clustering is not relevant since most of respondents are in the same group:

1 2 3
size 12 425 6

We investigate the number of respondents in each group according to the number of components used to build the clustering:

clust1 clust2 clust3 clust4 clust5 clust6 clust7
ncp=2 12 307 124 0 0 0 0
ncp=3 12 208 162 61 0 0 0
ncp=4 12 231 139 61 0 0 0
ncp=5 12 212 148 65 6 0 0
ncp=6 12 249 160 22 0 0 0
ncp=7 12 269 159 3 0 0 0
ncp=8 12 255 173 3 0 0 0
ncp=9 12 236 192 3 0 0 0
ncp=10 12 271 157 3 0 0 0
ncp=11 12 273 155 3 0 0 0
ncp=12 12 428 3 0 0 0 0
ncp=13 12 425 6 0 0 0 0
ncp=14 12 425 6 0 0 0 0
ncp=15 12 425 6 0 0 0 0
ncp=16 12 425 6 0 0 0 0
ncp=17 12 425 6 0 0 0 0
ncp=18 12 425 6 0 0 0 0
ncp=19 12 425 6 0 0 0 0
ncp=20 12 425 6 0 0 0 0
ncp=21 12 425 6 0 0 0 0
ncp=22 12 425 6 0 0 0 0

We choose to keep 10 components as this will keep enough information, while highlighting a new cluster in the previous big one.

plot of chunk unnamed-chunk-12

Respondents are grouped in the clusters as follows:

1 2 3 4
% 2.71 61.17 35.44 0.68

plot of chunk unnamed-chunk-14

From the previous interpretation of the two first dimensions of variability, the location of clusters on the graph allows providing a brief summary of the clusters: cluster 1 is the group of non-respondents, while cluster 2 contains some respondents who feel they are part of the R community unlike the cluster 3 respondents.

Cluster 1

To describe the first cluster more precisely, we use statistical tests to compare the proportion of each category in the cluster against the proportion in the population of the respondents.

Unsurprisingly, there is an over-representation of people that didn’t answer the questions about the involvement in the R community. Using the demographic characteristics, we can see that they are mainly from the industry.

Cluster 2

The following categories are over-represented in cluster 2:

Cla/Mod Mod/Cla Global p.value
Q24=new_gp_near_no 78.19 92.62 72.46 < 2e-16
Q18=in_Rcommunity_yes 70.64 94.10 81.49 < 2e-16
Q21=Yes 85.89 51.66 36.79 < 2e-16
Q21_new=meeting_yes_A general user group 86.82 41.33 29.12 1.1e-13
Q19_B=R_twitter_yes 85.59 37.27 26.64 3.4e-11
Q24_C=free_workshop_no 66.75 94.10 86.23 3.1e-09
Q20_new=medium_Twitter 90.00 23.25 15.80 8.3e-09
Q19=R_mailing_yes 73.89 49.08 40.63 4.8e-06
Q24_B=new_gp_demo_no 64.08 97.42 93.00 7.9e-06
Q21_new=meeting_no_It is not necessary (I have the support I need) 96.15 9.23 5.87 3.5e-05
Q24_D=paid_workshop_no 64.92 91.51 86.23 7.5e-05
Q19_C=R_StackOverflow_yes 66.17 81.55 75.40 0.00020
Q21_new=meeting_no_I am too busy 77.92 22.14 17.38 0.00071
Q7=Doctorate 68.91 49.08 43.57 0.00334
Q1_new=Q1_new_Academic/Govt 69.02 46.86 41.53 0.00428
Q24_F=demo_workshop_no 62.65 97.79 95.49 0.00481
Q9=Part-time 93.33 5.17 3.39 0.00648
Q7=Professional degree or certification 100.00 3.69 2.26 0.00687
Q19_L_new=web_no 62.50 97.79 95.71 0.00928
Q24_K=online_demo_no 62.35 97.79 95.94 0.01753
Q19_F=Biocond_yes 80.65 9.23 7.00 0.01900
Q2Q3=>58_Men 85.00 6.27 4.51 0.02259
Q3_cat=>58 81.48 8.12 6.09 0.02320
Q2=Men 65.22 66.42 62.30 0.02582
Q2Q3=29-33_Men 74.14 15.87 13.09 0.02851

The numbers read as follows, e.g. for Q18: 70.64% of individuals who consider themselves as part of the R community are in cluster 2; in cluster 2 94.1% of the individuals consider themselves as part of the R community, and 81.49% in the population of all the respondents consider themselves as part of the R community. The test is significant which implies that this cluster can be considered as a cluster with an over-representation of people considering themselves as part of the community. We conclude they are not interested in free workshops, online demos, etc. According to the demographic characteristics, there is an over-representation of people employed permanently in academia, people with caring responsibilities, as well as a significantly larger proportion of men and people with a doctorate.

We studied the categories under-represented in the cluster. However, since most of variables are binary, we do not get new information from this table. We note however that young people are under-represented, as well as people who didn’t participate in R user group meetings because of inexperience.

Cluster 3

Cla/Mod Mod/Cla Global p.value
Q24=new_gp_near_yes 81.97 63.69 27.54 < 2e-16
Q21_new=meeting_no_There is no group nearby/the group is inactive 78.82 42.68 19.19 < 2e-16
Q21=No 50.75 86.62 60.50 < 2e-16
Q18=in_Rcommunity_no 76.81 33.76 15.58 2.4e-14
Q24_C=free_workshop_yes 73.77 28.66 13.77 5.9e-11
Q21_new=meeting_no_I feel too inexperienced 90.32 17.83 7.00 7.2e-11
Q19_B=R_twitter_no 43.69 90.45 73.36 2.5e-10
Q24_D=paid_workshop_yes 62.30 24.20 13.77 4.8e-06
Q24_B=new_gp_demo_yes 74.19 14.65 7.00 6.9e-06
Q20_new=medium_Website 52.73 36.94 24.83 1.8e-05
Q19=R_mailing_no 42.97 71.97 59.37 5.6e-05
Q19_L_new=web_yes 68.42 8.28 4.29 0.00351
Q20_new=medium_Facebook 66.67 8.92 4.74 0.00351
Q8=Q8_Employed in industry 42.37 47.77 39.95 0.01354
Q3_cat=20-28 45.83 28.03 21.67 0.01803
Q24_F=demo_workshop_yes 60.00 7.64 4.51 0.02503
Q7=Undergraduate degree 50.00 15.92 11.29 0.02609
Q24_K=online_demo_yes 61.11 7.01 4.06 0.02707
Q2Q3=20-28_Women 50.00 15.29 10.84 0.02974
Q20_new=medium_Google+ 100.00 1.91 0.68 0.04396
Q7=High school/Secondary school or lower 100.00 1.91 0.68 0.04396

In this cluster, there is an over-representation of those who adhere to most of the proposals to increase their participation in the R community. Most of them have never participated in a user group meeting. The cluster gathers a large number of people who use blogs to get help, but also a large proportion of people who do not use Twitter. They prefer to use a medium like Facebook or mailing lists for news.

From a demographic point of view, the cluster contains a significantly higher proportion of women than the respondent population, and a larger proportion of those with a master’s or undergraduate degree. People under the age of 30 are also over-represented.

Note that we need to be cautious concerning the interpretation of the clustering as it varies depending on the number of MCA components selected. The stability of the solution should be investigated further. In addition, the number of individuals per cluster may slightly vary as there is a “consolidation step”2 which is applied, i.e. a k-means algorithm applied after the hierarchical clustering.

Conclusion

Missing values occurring in the questions relative to the involvement in the R community often occur on all the questions simultaneously.

Two main groups of respondents have been highlighted:

  • People who do not feel they are part of the R community, but who would like to participate in workshops, or an R user group meeting close to them (from a geographic, or demographic point of view). They are more likely women, people with master or undergraduate degree.

  • People who feel they are part of the R community, but do not have time to participate to meetings. They are more likely men, people with caring responsibilities, or people with a doctorate.

In addition, since we recoded many variables, we could make recommendations for the next questionnaire. In particular, we suggest using “yes” and “no” answers for the checkboxes to distinguish missing values from “no”.

Footnotes

  1. For more details about the classification based on principal components methods, see: Francois Husson, Jerome Pages, Sebastien Le. Exploratory Multivariate Analysis by Example Using R. CRC Press Taylor & Francis, pp.240, 2010, Chapman & Hall/CRC Computer Science & Data Analysis.↩︎

  2. For more details about the classification based on principal components methods, see: Francois Husson, Jerome Pages, Sebastien Le. Exploratory Multivariate Analysis by Example Using R. CRC Press Taylor & Francis, pp.240, 2010, Chapman & Hall/CRC Computer Science & Data Analysis.↩︎