This READ_ME file contains information about a Perl program written to detect occurrences of Simpson's paradox in bioinformatics datasets for the classification task of machine learning, as reported in the following published paper: A.A. Freitas. Investigating the role of Simpson's paradox in the analysis of top-ranked features in high-dimensional bioinformatics datasets. Briefings in Bioinformatics, Published online (ahead of print) on 9 Jan. 2019, DOI: 10.1093/bib/bby126 https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bby126/5280899 Note that the perl code of this program is being made freely available without any guarantees or technical support; and the code has not been optimized in any way. Also, the user has to set up several parameters before running the program, as explained below. In addition, note that the program was written specifically for discovering occurrences of Simpson's paradox in the datasets used in the above paper, which have a relatively simple structure consisting mainly of binary variables (attributes). It is possible to use the program to detect Simpson's paradoxes in other datasets, but the program was not tested with other datasets, and this additional use is subject to several limitations. In particular, as explained in more detail below, the user has to specify the class (goal) attribute, the class label of interest, the list of candidate predictor attributes, and the list of candidate confounder attributes in the dataset. By candidate predictor or confounder we mean an attribute that can occur as a predictor or confounder in an instance of Simpson's paradox, as defined in the above paper. The class attribute and all candidate predictor attributes in the specified list must be binary, whilst the confounder attributes must be categorical (nominal, or discrete) - not necessarily binary. The program will only check if the class, predictor and confounder attributes are nominal, it does not check if the class and predictor attributes are binary. If either of the latter two attributes are nominal but not binary, the program will not work. The input dataset may contain numerical (integer-valued or real-valued) attributes, but these will be ignored; and they cannot be used as class, predictor or confounder attributes (taking into account the definition of Simpson's paradox in the above paper). In addition, the program cannot cope with missing values. The remainder of this READ_ME file is divided into 3 sections: (1) Setting up the input file, in the ARFF format (2) Setting up the parameters of the program (based on the input dataset) and running it. (3) Interpreting the Output of the Program ================================================= (1) Setting up the input file, in the ARFF format ================================================= The program detects occurrences of Simpson's paradox in a given input dataset. The file with this dataset must be in the ARFF format used by the WEKA data mining (machine learning) tool, which is freely available from: http://www.cs.waikato.ac.nz/ml/weka/ Some datasets used in the above paper are also provided in a separate zip file, and these datasets are already in the ARFF format. That zip file also includes another READ_ME file describing the datasets. In any case, here is some brief information about the ARFF file format. In the file containing the input dataset, each instance (row) corresponds to an object being classified (a gene in the above published paper). Each instance (gene) is described by a number of attributes (columns), or features. The last attribute is the class to which the gene belongs. The lines containing the actual data start after the line with the keyword "@data", but before that, the file contains a line with the name of the dataset (starting with the keyword "@relation") and many lines starting with the keyword "@attribute". These lines describe the names of the attributes (features) and their possible values, as well as the class labels to be predicted. For more information about the WEKA's file format ".arff", as well as information about the algorithms available in WEKA and about data mining (or machine learning) in general, the reader is referred to the following book: I.H. Witten, E. Frank, M.A. Hall, C.J. Pal. Data Mining: practical machine learning tools and techniques. 4th Ed. Morgan Kaufmann, 2016. as well as to the WEKA website: http://www.cs.waikato.ac.nz/ml/weka/ ============================================================== (2) Setting up the parameters of the program (based on the input dataset) and running it. ============================================================== The program detects occurrences of Simpson's paradox in a given input dataset. The file name for that dataset is passed as an argument to the program. Hence, the program can be executed with the following command line (assuming perl is installed in your computer): perl program_name input_dataset_name.arff.txt In the above file name, we included the extension ".txt" because, although the ARFF file is essentially a plain text format, perl does not recognize the extension ".arff" as a text file. Note that the input file name is not required to include the ".arff" extension (we keep it in the middle of the file name just as a 'reminder' that the file is in this format). What is required is that the input file internally have the dataset in the ARFF file format. Before running the program with the above command, the user has to set up four types of parameters required by the program. The values of these parameters are hard-coded inside the program code (i.e., the user has to manually change the setttings of these parameters in the program code). These parameters are: First, the id number of the Goal attribute of interest, which is the Class variable in the the classification datasets used in the above paper. Second, the id number of the Goal Attribute Value (class label) of interest. This id is typically coded as 1 or 0 in the input file when the class variable is binary. Third, a list of the id numbers of all attributes that are candidate to take the role of the predictor attribute in an occurrence of Simpson's paradox. Fourth, a list of the id numbers of all attributes that are candidate to take the role of a confounder attribute in an occurrence of Simpson's paradox. (For an explanation of the terms "predictor" and "confounder", see the above paper.) In the experiments reported in the above paper, for each dataset, the third parameter was set to the list of the top-15 predictor attributes in the ranking produced by running some attribute ranking method on the WEKA machine learning tool; whilst the list of candidate confounder attributes consists of all attributes (except the class attribute) in the dataset. Note that some attribute(s) can be included in both these lists, since an attribute can act as a predictor in a paradox instance and act as a confounder in another paradox instance. When an attribute is included in both lists, the program makes sure the attribute is not used as both predictor and confounder in a paradox instance. When getting the id numbers of the top-15 predictors based on the ranking produced by WEKA, there is one caveat: in its output screen, WEKA reports the id of each attribute assuming attributes are sequentially numbered starting from 1, whilst in the perl program attributes are sequentially numbered starting from 0. Hence, we have to substract 1 (one) from the attribute ids output by WEKA, when specifying the attribute ids in the lists of predictor and confounder attributes for the perl program. In addition, in the datasets used in the paper, the goal (class) attribute of interest was the last attribute in the dataset file. As an example, we consider the attribute ranking produced by the Information Gain feature selection/ranking method in WEKA when the input file was the Mouse dataset (Dataset-Mouse-GO-attrs-class-anti-longev.arff.txt). The list of ids of the top 15 attributes in this ranking, as reported in the WEKA output, is: 4,738,737,791,806,733,656,55,73,90,102,620,404,402,72 Then, substracting 1 (one) from each of the above attribute ids, the list of ids of candidate predictors attributes for Simpson's paradox detection to be specified in the perl program is: 3,737,736,790,805,732,655,54,72,89,101,619,403,401,71 Also, in this dataset, the goal attribute id is 886; and all attributes (except the class one) can be used as confounders -they are all nominal attributes (in this particular dataset, all attributes are binary). So, the specification of the list of confounder attributes in the programs is: 0..885. Finally, to make the above example precise, once the settings of the above four types of parameters have been chosen by the user (based on the input file), these parameter settings are included in some lines near the start of the Perl program as hard-coded values, as follows (where the '#' at the start of a line means that line is treated as just comments by Perl): ############################################################# # The below hard-coded values are associated with the # # input file: Mouse-GO-attrs-class-anti-longev.arff.txt # ############################################################# my $GA_ID = 886; # Goal (Class) Attr. Id in Mouse dataset my $GOAL_VAL = '0'; # Goal Value as recorded in input file my $GOAL_VAL_NAME = 'anti-longevity'; # meaningful value name ### Top-15 predictor attributes in ranking by InfoGain ### my @part_attr_1_list = (3,737,736,790,805,732,655,54,72,89,101,619,403,401,71); ### The list of confounders has all 886 predictor attribs ### my @part_attr_2_list = (0..885); ########### END OF VARIABLES WITH HARD-CODED VALUES ######### For the convenience of users, we also provide, in a separate zip file, several versions of the perl program where the above parameters have already been specified in the program code - one program version for each dataset or each combination of a dataset and an attribute ranking method. Each program's name indicates the type of ranking method that was used to select the top-15 attributes (in the above paper) and the dataset (type of organism). Each program version can be run using the above command line. For instance, to run the program for detecting Simpson's paradox in the Mouse dataset, using the top-15 predictors as ranked by Information Gain, the command line would be: perl Simp-Parad-detect-top15-rank-InfoGain-Mouse Dataset-Mouse-GO-attrs-class-anti-longev.arff.txt ========================================== (3) Interpreting the Output of the Program ========================================== The program produces two output files. One is a file titled 'LOG-Simpsons-paradoxes.txt', in which processing errors (if any) are reported. This log file also reports the names, data types and values of the attributes in the input file, as well as the total number of attributes and instances read from that file. The main output file is titled 'Simpsons-paradoxes.txt', where all found occurrences (if any) of Simpson's paradox are reported. More precisely, this main output file has the following information: First, the name of the input file that has been read. Second, the goal attribute's id and name ('class', in the above paper's datasets), the goal attribute's values ('0' and '1' in the above paper's datasets), the goal attribute's value of interest and that value's meaningful name. Third, the attribute ids in the list of candidate predictor attributes. Fourth, a list of all (if any) occurrences of Simpson's paradox found for each predictor attribute. For each paradox occurrence, the output file shows: (a) The id and name of the predictor and confounder attributes. (b) The goal attribute's id and name (class), and its value (class label) of interest - this information is the same for all paradox occurrences found in the input dataset. (c) The id and name of the predictor attribute and its values. (d) The probability of the class label of interest given each of the two values of the predictor attribute. (e) The effect of changing the observed value of the predictor attribute from value1 ('0') to value2 ('1') on the probability of the class label of interest (i.e. whether that change increases or decreases or has no effect on that probability). (f) The id and name of the confounder attribute and its values. (g) The probability of the class label of interest given each pair of attribute values: a value of the predictor attribute and a value of the confounder attribute. (h) For each fixed value of the confounder variable, the effect of changing the observed value of the predictor from value1 ('0') to value2 ('1') on the probability of the class label of interest (i.e. whether that change increases or decreases or has no effect on the probability of the class label of interest given that fixed confounder's value). For more details about the computation of the above probabilities, please see the above published paper. After reporting all (if any) occurrences of Simpson's paradox found for a given predictor attribute, the output file reports the total number of paradox occurrences where that attribute occurred as the predictor (i.e., not counting that attribute's occurrences as the confounder). For the convenience of users, we also provide, in a separate zip file, the two output files (the LOG file and the main output file) for each of the several versions of the perl program mentioned at the end of section (2) of this READ_ME file. Again, each output file's name indicates the type of ranking method that was used (in the above paper) and the dataset (type of organism).