Arthur Charpentier

Aller au contenu | Aller au menu | Aller à la recherche

vendredi 1 octobre 2010

Blog transfert

As mentioned during the past few weeks, the blog has been transfered (please update links and bookmarks). You will be redirected (shortly) to

mercredi 29 septembre 2010

Statistique de l'assurance STT6705V, partie 5

Pour la cinquième séance de cours, nous allons continuer sur la modélisation des coûts, à partir des transparents en ligne ici. La première partie portera sur la comparaison du modèle Gamma et du modèle lognormal, et la seconde sur l’écrêtement des gros sinistres.
Il convient de faire attention à la transformation logarithmique. Un modèle de la forme

http://perso.univ-rennes1.fr/arthur.charpentier/latex/lognorm01.png
paraîtra toujours "meilleur" qu’un modèle
http://perso.univ-rennes1.fr/arthur.charpentier/latex/lognorm02.png
(pour peu que les http://perso.univ-rennes1.fr/arthur.charpentier/latex/lognorm03.png soient bien plus grands que 1). Regardons un jeu de données classique, liant vitesse du véhicule et distance de freinage.
>  summary(lm(dist~speed,data=cars))
Call:
lm(formula = dist ~ speed, data = cars)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -17.5791     6.7584  -2.601   0.0123 * 
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511,     Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.490e-12

>  summary(lm(log(dist)~speed,data=cars))
Call:
lm(formula = log(dist) ~ speed, data = cars)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  1.67612    0.19614   8.546 3.34e-11 ***
speed        0.12077    0.01206  10.015 2.41e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4463 on 48 degrees of freedom
Multiple R-squared: 0.6763,     Adjusted R-squared: 0.6696
F-statistic: 100.3 on 1 and 48 DF,  p-value: 2.413e-13

Le R2 est plus faible, l’écart type des résidus aussi, la log-vraisemblance si on l’avait, etc.
Il ne faut jamais oublier qu’on doit ensuite prendre l’exponentiel sur modèle en log. C’est d’ailleurs ce qui est fait par exemple dans la transformation de Box Cox, où l’on compare les sommes des carrés des résidus sur les http://perso.univ-rennes1.fr/arthur.charpentier/latex/lognorm03.png, i.e.
http://perso.univ-rennes1.fr/arthur.charpentier/latex/lognorm04.png
et
http://perso.univ-rennes1.fr/arthur.charpentier/latex/lognorm05.png
où http://perso.univ-rennes1.fr/arthur.charpentier/latex/lognorm06.png.

>  s=summary(lm(log(dist)~speed,data=cars))$sigma
>  mean((cars$dist-predict(lm(dist~speed,data=cars)))^2)
[1] 227.0704
>  mean((cars$dist-exp(predict(lm(log(dist)~speed,data=cars))+.5*s^2))^2)
[1] 296.2027
Tout ça pour conclure que le modèle linéaire n’est peut être pas si mauvais que ça....

samedi 25 septembre 2010

EM and mixture estimation, with R (part 2)

Following my previous post on optimization and mixtures (here), Nicolas told me that my idea was probably not the most clever one (there).
So, we get back to our simple mixture model,

http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM01.png
In order to describe how EM algorithm works, assume first that both http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM02.png and  http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM03.png are perfectly known, and the mixture parameter is the only one we care about.
  • The simple model, with only one parameter that is unknown
Here, the likelihood is
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM04.png
so that we write the log likelihood as
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM05.png
which might not be simple to maximize. Recall that the mixture model can interpreted through a latent variate http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM06.png (that cannot be observed), taking value when http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM07.png is drawn from http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM02.png, and 0 if it is drawn from http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM03.png. More generally (especially in the case we want to extend our model to 3, 4, ... mixtures), http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM08.png and http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM09.png.
With that notation, the likelihood becomes
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM10.png
and the log likelihood
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM11.png
the term on the right is useless since we only care about p, here. From here, consider the following iterative procedure,
Assume that the mixture probability http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM13.png is known, denoted http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM12.png. Then I can predict the value of http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM06.png (i.e. http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM08.png and http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM09.png) for all observations,
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM14.png
So I can inject those values into my log likelihood, i.e. in
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM15.png
having maximum (no need to run numerical tools here)
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM16.png
that will be denoted http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM17.png. And I can iterate from here.
Formally, the first step is where we calculate an expected (E) value, where http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png is the best predictor of http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM19.png given my observations (as well as my belief in http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM13.png). Then comes a maximization (M) step, where using http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM06.png, I can estimate probability http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM13.png.
  • A more general framework, all parameters are now unkown
So far, it was simple, since we assumed that http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM02.png and  http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM03.png were perfectly known. Which is not reallistic. An there is not much to change to get a complete algorithm, to estimate http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM30.png. Recall that we had http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png which was the expected value of Z_{1,i}, i.e. it is a probability that observation i has been drawn from http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM02.png.
If http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png, instead of being in the segment http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM31.png was in http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM32.png, then we could have considered mean and standard deviations of observations such that http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png=0, and similarly on the subset of observations such that http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png=1.
But we can’t. So what can be done is to consider http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png as the weight we should give to observation i when estimating parameters of http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM02.png, and similarly, 1-http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png would be weights given to observation i when estimating parameters of http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM03.png.
So we set, as before
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM33.png
and then
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM34.png
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM35.png
and for the variance, well, it is a weighted mean again,
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM36.png
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM37.png

and this is it.
  • Let us run the code on the same data as before
Here, the code is rather simple: let us start generating a sample
> X1 = rnorm(n,0,1)
> X20 = rnorm(n,0,1)
> Z  = sample(c(1,2,2),size=n,replace=TRUE)
> X2=4+X20
> X = c(X1[Z==1],X2[Z==2])
then, given a vector of initial values (that I called http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM12.png and then http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM99.png before),
>  s = c(0.5, mean(X)-1, var(X), mean(X)+1, var(X))
I define my function as,
>  em = function(X0,s) {
+  Ep = s[1]*dnorm(X0, s[2], sqrt(s[4]))/(s[1]*dnorm(X0, s[2], sqrt(s[4])) +
+  (1-s[1])*dnorm(X0, s[3], sqrt(s[5])))
+  s[1] = mean(Ep)
+  s[2] = sum(Ep*X0) / sum(Ep)
+  s[3] = sum((1-Ep)*X0) / sum(1-Ep)
+  s[4] = sum(Ep*(X0-s[2])^2) / sum(Ep)
+  s[5] = sum((1-Ep)*(X0-s[3])^2) / sum(1-Ep)
+  return(s)
+  }

Then I get http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM12.png, or http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM99.png. So this is it ! We just need to iterate (here I stop after 200 iterations) since we can see that, actually, our algorithm converges quite fast,
> for(i in 2:200){
+ s=em(X,s)
+ }

Let us run the same procedure as before, i.e. I generate samples of size 200, where difference between means can be small (0) or large (4),

Ok, Nicolas, you were right, we’re doing much better ! Maybe we should also go for a Gibbs sampling procedure ?... next time, maybe....

jeudi 23 septembre 2010

Optimization and mixture estimation, with R

Recently, one of my students asked me about optimization routines in R. He told me he that R performed well on the estimation of a time series model with different regimes, while he had trouble with a (simple) GARCH process, and he was wondering if R was good in optimization routines. Actually, I always thought that mixtures (and regimes) was something difficult to estimate, so I was a bit surprised...

Indeed, it reminded me some trouble I experienced once, while I was talking about maximum likelihooh estimation, for non standard distribution, i.e. when optimization had to be done on the log likelihood function. And even when generating nice samples, giving appropriate initial values (actually the true value used in random generation), each time I tried to optimize my log likelihood, it failed. So I decided to play a little bit with standard optimization functions, to see which one performed better when trying to estimate mixture parameter (from a mixture based sample). Here, I generate a mixture of two gaussian distributions, and I would like to see how different the mean should be to have a high probability to estimate properly the parameters of the mixture.
The density is here http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-01.png proportional to
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-02.png
The true model is http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-03.png, and http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-04.png being a parameter that will change, from 0 to 4.
The log likelihood (actually, I add a minus since most of the optimization functions actually minimize functions) is
> logvraineg <- function(param, obs) {
+ p <- param[1]
+ m1 <- param[2]
+ sd1 <- param[3]
+ m2 <- param[4]
sd2 <- param[5]
+  -sum(log(p * dnorm(x = obs, mean = m1, sd = sd1) + (1 - p) *
+ dnorm(x = obs, mean = m2, sd = sd2)))
+  }
The code to generate my samples is the following,
>X1 = rnorm(n,0,1)
> X20 = rnorm(n,0,1)
> Z  = sample(c(1,2,2),size=n,replace=TRUE)
> X2=m+X20
> X = c(X1[Z==1],X2[Z==2])

Then I use two functions to optimize my log likelihood, with identical intial values,
> O1=nlm(f = logvraineg, p = c(.5, mean(X)-sd(X)/5, sd(X), mean(X)+sd(X)/5, sd(X)), obs = X)
> logvrainegX <- function(param) {logvraineg(param,X)}
> O2=optim( par = c(.5, mean(X)-sd(X)/5, sd(X), mean(X)+sd(X)/5, sd(X)),
+   fn = logvrainegX)

Actually, since I might have identification problems, I take either http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-05.png or http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-06.png, depending whether http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-07.png or http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-08.png is the smallest parameter.

On the graph above, the x-axis is the difference between means of the mixture (as on the animated grap above). Then, the red point is the median of estimated parameter I have (here http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-05.png), and I have included something that can be interpreted as a confidence interval, i.e. where I have been in 90% of my scenarios: the black vertical segments. Obviously, when the sample is not enough heterogeneous (i.e. http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-09.png and http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-04.png rather different), I cannot estimate properly my parameters, I might even have a probability that exceed 1 (I did not add any constraint). The blue plain horizontal line is the true value of the parameter, while the blue dotted horizontal line is the initial value of the parameter in the optimization algorithm (I started assuming that the mixture probability was around 0.2).
The graph below is based on the second optimization routine (with identical  starting values, and of course on the same generated samples),

(just to be honest, in many cases, it did not converge, so the loop stopped, and I had to run it again... so finally, my study is based on a bit less than 500 samples (times 15 since I considered several values for the mean of my second underlying distribution), with 200 generated observations from a mixture).
The graph below compares the two (empty circles are the first algorithm, while plain circles the second one),

On average, it is not so bad.... but the probability to be far away from the tru value is not small at all... except when the difference between the two means exceeds 3...
If I change starting values for the optimization algorithm (previously, I assumed that the mixture probability was 1/5, here I start from 1/2), we have the following graph

which look like the previous one, except for small differences between the two underlying distributions (just as if initial values had not impact on the optimization, but it might come from the fact that the surface is nice, and we are not trapped in regions of local minimum).
Thus, I am far from being an expert in optimization routines in R (see here for further information), but so far, it looks like R is not doing so bad... and the two algorithm perform similarly (maybe the first one being a bit closer to the true parameter).

Statistique de l'assurance STT6705V, partie 4b

Avant toute autre chose, pour revoir le dernier cours, il suffit d’aller ici et .
Bon, sinon, comme promis, les bases de données pour les projets sont en ligne ici. Le principe est simple. Il y a 28 bases de données, toutes semblables (mais bien sûr différentes), par les numéros ci-dessous. Comme toujours, premier arrivé, premier servi, donc les bases vont être attribuées au fur et à mesure

  • étape 1: choisir une base
Vous choisissez un numéro et cliquez dessus, afin de me prévenir (par courriel, à l’aide du tableau ci-dessous) de votre choix. Les projets se font par deux, merci de mettre dans le corps du message les noms des deux personnes qui travailleront sur le projet. 
  • étape 2: récupérer les données
une fois que j’aurais confirmé que personne n’a choisi la même base, vous pouvez les récupérer sous R à  l’aide du code suivant,
> k=1
> nom=paste("http://perso.univ-rennes1.fr/arthur.charpentier/6705V/UdM-baseC-",
+     k,".txt",sep="")
> baseC=read.table(nom,header=TRUE)

pour la base de contrats (et pour le groupe qui aurait choisi la base 1) et pour la base des sinistres,
> nom=paste("http://perso.univ-rennes1.fr/arthur.charpentier/6705V/UdM-baseS-",
+     k,".txt",sep="")
> baseS=read.table(nom,header=TRUE)
Ensuite, c’est parti, il s’agit de me proposer différents modèles de tarification, et de calculer les primes pures avec les différents modèles, pour une personne parmi les listes des personnes ayant les caractéristiques suivantes,
> client=data.frame(
+ exposition=rep(1,9),
+ zone=c("A","A","A","C","D","E","F","F","F"),
+ puissance=c(6,7,11,6,7,11,6,7,11),
+ agevehicule=c(0,1,5,10,5,1,0,6,10),
+ ageconducteur=c(25,18,55,55,55,40,21,20,18),
+ bonus=c(80,100,50,60,55,50,100,125,100),
+ marque=c(1,2,12,12,12,1,1,1,2),
+ carburant=c("D","E","E","D","D","E","E","D","D"),
+ densite=rep(3000,9),
+ client=rep(baseC$region[1],9))
(si certains modalités ne sont pas présentes dans la base, il faut choisir quelqu’un d’autre.... sur les neuf, il doit bien en avoir un(e) qui pourrait être présent(e) dans votre base de données). En cas de problème, vous avez mon adresse électronique.... Et je reviendrais ultérieurement sur la forme de ce que j’attends.

- page 1 de 157