Project Report for Phys291:
Project description:
The goal of the project was to create a program that would take a distribution for cancer risk in regards to age and create a population with random age to apply it to and then fill this data into a ROOT histogram and see if I could get the distribution back out by fitting it to the histogram.
Source code:
The source code is available here.
Progress log:
My progress log is available here.
Some other codes:
Some of the failed or discontinued programs can be found here together with a readme.
The Distribution:
I used the fit for the cancer data found in Age Distribution of Cancer: The Incidence Turnover
at Old Age by Francesco Pompei and Richard Wilson found here.
This distribution looks like this f(t)=((a*t)^(k-1))*(1-b*t)*100000 and gives the number of cancer cases per 100000 people of an age t with fit parameters a(alfa), b(beta) and (k-1). I have used it to calculate the risk of an individual having the cancer in question.
The program:
During this project I have experimented with a few different ways of coding what I wanted the program to do, here I will outline what I have come to think of as my main data generating program dose and what my main root analysis program dose.
Data generator:
My data generator starts up asking the user to choose if it should create a data set for a male cancer, a female cancer, both of these sets or a set where the sexes are mixed (population).
The last one is not so relevant for the project since it cannot be compared back to Pompei and Wilson and was more something I added on a whim. Then the program asks for a cancer to test from 5 choices 4 of these are shared for male and female and one is a sex-spesific cancer (Prostate and Breast cancer). Next the program asks how many people to test and comes with a recommendation, the recommended value is disigned to give ~100000 counts of every age included for an easy comparison to the original.
This then makes the program go trough a for loop as many times as the user chose where it each time choses a random age between 0 and 100, and uses this age to calculate the risk to the indevidual. This risk is then used to check if the cancer is true or false and the age and the result of the true/flase is printed to a txt file. The program also counts the number of times that the for loop has run, mostly so I could see that the program was runing.
ROOT analysis:
The analysis program starts much the same as the data generator. It asks for the sex to analyse and the cancer that was tested in the dataset. It then opens the file made by the data generator and starts reading it and filling the values into the histogram(s). It then fits the histogram(s) to the original distribution, unique for every cancer, and draws the histogram(s).
Assumptions:
In most of my programs I have assumed that age is uniformly distributed, this is because in Pompei and Wilson, which I want to compare my histogram fit to, they use per 100000 for their y-axis. This means that all the ages have as many potential cancer victims and an age distribution where the ages have a slightly different chance of being chosen would then make the comparison between my histogram(s) and their graphs non-trivial.
Problems:
I had several problems during this project, most of these I overcame by doing some searching, some I ran out of time in solving. One problem was in the sex site cancers distribution, this is explained further when I present these histograms. Most of the problems are also outlined in my progress notes.
Results:
After running the program for 10000000 persons and running it through the ROOT analysis program I got this histogram and fit for male lung cancer:
The fit parameters were: a = 0,0075244 with error = 0,0000334171
b = 0,0104385 with error = 0,00000488508
k-1 = 6,66583 with error = 0,0473411
This gives these intervals: 0,007490983 < a < 0,007557817
0,010433615 < b < 0,010443385
6,6184889 < k-1 < 6,7131711
The original parameters where: a = 0,00755
b = 0,0105
k-1 = 6,6
The difference between the original parameters and those used by the fit are here:
The difference in a = 0,00755/0,0075244 = 1,003402265 meaning the parameter used in the fit is ~0,34% smaller than the original.
The difference in b = 0,0105 /0,0104385 = 1,005891651 meaning the parameter used in the fit is ~0,59% smaller than the original.
The difference in k-1 = 6,6/ 6,66583 = 0,990124261 meaning the parameter used in the fit is ~0,99% larger than the original.
From this we see that the fit for my histogram is, though quite similar, not exactly the same as the distribution I used for the probability of cancer happening. a is within the error, k-1 is within one additional error leap-out, but b is quite far away from the error. But both a and b have well under a 1% difference from the original and k-1 is just slightly under 1%.
Creating a similar data set for female lung cancer and doing the ROOT analysis provided me with this histogram and fit:
The fit parameters where: a = 0,00707970 with error = 0,0000525194
b = 0,0107433 with error = 0,00000978719
k-1 = 6,64276 with error = 0,0674143
This gives these intervals: 0,007027181 < a < 0,007132219
0,010733513 < b < 0,010753087
6,5753457 < k-1 < 6,7101743
The original parameters where: a = 0,007
b = 0,0108
k-1 = 6,5
The difference between the original parameters and those used by the fit are here:
The difference in a = 0.007/ 0,00707970 = 0,988742461 meaning the parameter used in the fit is ~1,13% larger than the original.
The difference in b = 0,0108 / 0,0107433 = 1,005277708 meaning the parameter used in the fit is ~0,53% smaller than the original.
The difference in k-1 = 6,5/ 6,64276 = 0,978508933 meaning the parameter used in the fit is ~2,15% larger than the original.
In this histogram the fit isn't as good as the male counterpart, none of the original parameters are within the error of the parameters used to fit the histogram and the % differences of the original versus the fit is larger, the largest difference is still k-1 at slightly over 2%.
Some other Histograms:
Using the both option for the ROOT analysis program provides histograms like this:
This is the Histograms and fit for a data set of Bladder cancer, it only says cancer becouse I didn't want to code several different histograms in my program, in the others I have simply changed the title of the histogram in the code.
The Prostate and Breast plots are the only ones that start at 15 as explained earlier, Pompei and Wilson also started the fit at t = 0 = age = 15. This presented a problem for me in comparing the fits. Since my fit starts with x being 0 at age = 0 even though the histogram is restricted between 15 and 100. I have tried many different approaches to get this fixed but I have yet to find a way to make it work other than just run the data generator from 0-85 and t = age instead of t = age-15 as I did for the histogram above, the data from this run of the program resulted in this histogram and fit from the ROOT program:
The problem then is that the histograms have moved down in age and the age range dose not present the data as it in truth should be, but now I can compare the fits of my generated histograms and Pompei and Wilson's fits.
Tables:
Now follows tables of different values I got from running the programs:
Men |
Original a |
Fit a |
Original b |
Fit b |
Original k-1 |
Fit k-1 |
Lung |
0,00755 |
0,00752443 |
0,01050 |
0,01043850 |
6,6 |
6,66583 |
Colon |
0,00732 |
0,00729708 |
0,01003 |
0,00997667 |
7,0 |
7,03797 |
Bladder |
0,00688 |
0,00691392 |
0,01007 |
0,01002260 |
7,2 |
7,31089 |
Non-Hod |
0,00509 |
0,00500969 |
0,00997 |
0,00991823 |
5,7 |
5,66871 |
Prostate |
0,00850 |
0,00851574 |
0,01220 |
0,01212600 |
4,8 |
4,87296 |
Women |
Original a |
Fit a |
Original b |
Fit b |
Original k-1 |
Fit k-1 |
Lung |
0,00700 |
0,00707970 |
0,01080 |
0,01074330 |
6,5 |
6,64278 |
Colon |
0,00717 |
0,00720891 |
0,00995 |
0,00990295 |
7,3 |
7,47542 |
Bladder |
0,00525 |
0,00542464 |
0,00980 |
0,00981651 |
6,7 |
6,98149 |
Non-Hod |
0,00481 |
0,00486456 |
0,01010 |
0,01003890 |
5,7 |
5,84788 |
Breast |
0,00375 |
0,00382116 |
0,01150 |
0,01144390 |
2,8 |
2,85112 |
I then looked at the % difference between the original and the fit parameters as show earlier:
Men |
The fit parameter a is: |
The fit parameter b is: |
The fit parameter k-1 is: |
Lung |
~0,34% smaller |
~0,59% smaller |
~0,99% larger |
Colon |
~0,31% smaller |
~0,53% smaller |
~0,54% larger |
Bladder |
~0,49% larger |
~0,47% smaller |
~1,52% larger |
Non-Hod |
~1,60% smaller |
~0,52% smaller |
~0,55% smaller |
Prostate |
~0,18% larger |
~0,61% smaller |
~1,50% larger |
Women |
The fit parameter a is: |
The fit parameter b is: |
The fit parameter k-1 is: |
Lung |
~1,13% larger |
~0,53% smaller |
~2,15% larger |
Colon |
~0,54% larger |
~0,48% smaller |
~2,35% larger |
Bladder |
~3,22% larger |
~0,17% larger |
~4,03% larger |
Non-Hod |
~1,12% larger |
~0,61% smaller |
~2,53% larger |
Breast |
~1,86% larger |
~0,49% smaller |
~1,79% larger |
I also ran the data generator 100 million times for lung cancer and changed the fit expression, as this includes how many cases of an age should exist, this produced this histogram and fit:
The fit parameters for men here are a = 0,00752878 b = 0,0104432 and k-1 = 6,62983
The difference with the original parameters here is a ~0,28% smaller, b ~0,54% smaller, k-1 ~0,45% larger. These differences are smaller in all cases compared to the 10 million run.
The fir parameters for women here are a = 0,00697116 b = 0,0107440 and k-1 = 6,51166
The difference with the original parameters here is a ~0,41% smaller b ~0,52% smaller and k-1 ~1,96% larger. These differences are also smaller in all cases compared to the 10 million run.
This would suggest that the fit parameters gets closer to the original when the number of data goes up, as it should, this prompted me to run the programs set to 1 billion. I chose Colon cancer because this seemed to have fit parameters closest to the original already. I limited myself to only make one data set, female, even though the female seems to have the most difference between the original parameters and the fit. This limiting because of the time it took to make and analyse the datasets. I got this histogram and fit as result:
The fit parameters here are a = 0,00712403 b = 0,00989500 and k-1 = 7,29332
The difference with the original parameters here is a ~0,65% smaller, b ~0,56% smaller, k-1 ~0,09% smaller. Of these parameters only k-1 have a large change in difference when compared to the 10 million run. This might be because of missing data points in the 1 billion data set, about 205 runs are missing from the dataset making the fit, which depends on having information on how many data points of each age exists. This is hard to confirm though.
Discusion:
From my limited experience with doing projects like this and using fit with ROOT, I find that I got the root fit quite close to the original that would constitute my probability function for cancer at the different ages. This would indicate that my data program created data that was true to the source material. I would also like to add that during this entire process of trying to get good fits and so on, the shape of the histograms and the rough number of cancers cases have been quite similar to the graphs shown in Pompei and Wilson, this at least I am very pleased with.
Outlook:
I see multiple areas where my C++ coding and understanding can improve and likewise with ROOT, I see that some fairly simple things like getting functions to work would make my code much simpler, especially to correct at a later stage. When I continue my coding this is what I will focus on first.
Acknowledgements:
Thanks to Dan Hart for proofreading my report, it looks much better for it.
Reference:
Francesco Pompei and Richard Wilson, 2001, Age Distribution of Cancer: The Incidence Turnover at Old Age, Human and Ecological Risk Assessment: Vol. 7, no. 6, pp.1619-1650