
vendredi 1 octobre 2010
Blog transfert
Par Arthur Charpentier le vendredi 1 octobre 2010, 00:01 - liens externes
mercredi 29 septembre 2010
Statistique de l'assurance STT6705V, partie 5
Par Arthur Charpentier le mercredi 29 septembre 2010, 02:54 - actuariat 10/11 STT6705V
Pour la cinquième séance de cours, nous allons continuer sur la modélisation des coûts, à partir des transparents en
ligne ici.
La première partie portera sur la comparaison du modèle Gamma et du
modèle lognormal, et la seconde sur l’écrêtement des gros sinistres.
Il convient de faire attention à la transformation logarithmique. Un modèle de la forme


soient bien plus grands que 1). Regardons un jeu de données classique, liant vitesse du véhicule et distance de freinage.> summary(lm(dist~speed,data=cars))
Call:
lm(formula = dist ~ speed, data = cars)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.490e-12

Call:
lm(formula = log(dist) ~ speed, data = cars)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.67612 0.19614 8.546 3.34e-11 ***
speed 0.12077 0.01206 10.015 2.41e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4463 on 48 degrees of freedom
Multiple R-squared: 0.6763, Adjusted R-squared: 0.6696
F-statistic: 100.3 on 1 and 48 DF, p-value: 2.413e-13

Il ne faut jamais oublier qu’on doit ensuite prendre l’exponentiel sur modèle en log. C’est d’ailleurs ce qui est fait par exemple dans la transformation de Box Cox, où l’on compare les sommes des carrés des résidus sur les
, i.e.

.
> mean((cars$dist-predict(lm(dist~speed,data=cars)))^2)
[1] 227.0704
> mean((cars$dist-exp(predict(lm(log(dist)~speed,data=cars))+.5*s^2))^2)
[1] 296.2027
Tout ça pour conclure que le modèle linéaire n’est peut être pas si mauvais que ça....
samedi 25 septembre 2010
EM and mixture estimation, with R (part 2)
Par Arthur Charpentier le samedi 25 septembre 2010, 02:56 - Informatique / R
Following my previous post on optimization and mixtures (here), Nicolas told me that my idea was probably not the most clever one (there).
So, we get back to our simple mixture model,

and
are perfectly known, and the mixture parameter is the only one we care about.- The simple model, with only one parameter that is unknown


(that cannot be observed), taking value when
is drawn from
, and 0 if it is drawn from
. More generally (especially in the case we want to extend our model to 3, 4, ... mixtures),
and
.With that notation, the likelihood becomes


Assume that the mixture probability
is known, denoted
. Then I can predict the value of
(i.e.
and
) for all observations,


. And I can iterate from here.Formally, the first step is where we calculate an expected (E) value, where
is the best predictor of
given my observations (as well as my belief in
). Then comes a maximization (M) step, where using
, I can estimate probability
.- A more general framework, all parameters are now unkown
and
were perfectly known. Which is not reallistic. An there is not much to change to get a complete algorithm, to estimate
. Recall that we had
which was the expected value of Z_{1,i}, i.e. it is a probability that observation i has been drawn from
. If
, instead of being in the segment
was in
, then we could have considered mean and standard deviations of observations such that
=0, and similarly on the subset of observations such that
=1.But we can’t. So what can be done is to consider
as the weight we should give to observation i when estimating parameters of
, and similarly, 1-
would be weights given to observation i when estimating parameters of
.So we set, as before





and this is it.
- Let us run the code on the same data as before
> X1 = rnorm(n,0,1)
> X20 = rnorm(n,0,1)
> Z = sample(c(1,2,2),size=n,replace=TRUE)
> X2=4+X20
> X = c(X1[Z==1],X2[Z==2])
then, given a vector of initial values (that I called
and then
before),> s = c(0.5, mean(X)-1, var(X), mean(X)+1, var(X))
I define my function as,
> em = function(X0,s) {
+ Ep = s[1]*dnorm(X0, s[2], sqrt(s[4]))/(s[1]*dnorm(X0, s[2], sqrt(s[4])) +
+ (1-s[1])*dnorm(X0, s[3], sqrt(s[5])))
+ s[1] = mean(Ep)
+ s[2] = sum(Ep*X0) / sum(Ep)
+ s[3] = sum((1-Ep)*X0) / sum(1-Ep)
+ s[4] = sum(Ep*(X0-s[2])^2) / sum(Ep)
+ s[5] = sum((1-Ep)*(X0-s[3])^2) / sum(1-Ep)
+ return(s)
+ }
Then I get
, or
.
So this is it ! We just need to iterate (here I stop after 200
iterations) since we can see that, actually, our algorithm converges
quite fast,> for(i in 2:200){
+ s=em(X,s)
+ }


jeudi 23 septembre 2010
Optimization and mixture estimation, with R
Par Arthur Charpentier le jeudi 23 septembre 2010, 23:09 - Informatique / R
Recently, one of my students asked me about optimization routines in R. He told me he that R performed well on the estimation of a time series model with different regimes, while he had trouble with a (simple) GARCH process, and he was wondering if R was good in optimization routines. Actually, I always thought that mixtures (and regimes) was something difficult to estimate, so I was a bit surprised...


proportional to
, and
being a parameter that will change, from 0 to 4.The log likelihood (actually, I add a minus since most of the optimization functions actually minimize functions) is
> logvraineg <- function(param, obs) {
+ p <- param[1]
+ m1 <- param[2]
+ sd1 <- param[3]
+ m2 <- param[4]
+ sd2 <- param[5]
+ -sum(log(p * dnorm(x = obs, mean = m1, sd = sd1) + (1 - p) *
+ dnorm(x = obs, mean = m2, sd = sd2)))
+ }
The code to generate my samples is the following,
>X1 = rnorm(n,0,1)
> X20 = rnorm(n,0,1)
> Z = sample(c(1,2,2),size=n,replace=TRUE)
> X2=m+X20
> X = c(X1[Z==1],X2[Z==2])
Then I use two functions to optimize my log likelihood, with identical intial values,
> O1=nlm(f = logvraineg, p = c(.5, mean(X)-sd(X)/5, sd(X), mean(X)+sd(X)/5, sd(X)), obs = X)
> logvrainegX <- function(param) {logvraineg(param,X)}
> O2=optim( par = c(.5, mean(X)-sd(X)/5, sd(X), mean(X)+sd(X)/5, sd(X)),
+ fn = logvrainegX)
Actually, since I might have identification problems, I take either
or
, depending whether
or
is the smallest parameter.
), and I have included something that can be interpreted as a confidence interval, i.e. where I have been in 90% of my scenarios: the black vertical segments. Obviously, when the sample is not enough heterogeneous (i.e.
and
rather different), I cannot estimate properly my parameters, I might
even have a probability that exceed 1 (I did not add any constraint).
The blue plain horizontal line is the true value of the parameter, while the blue
dotted horizontal line is the initial value of the parameter in the
optimization algorithm (I started assuming that the mixture probability
was around 0.2).The graph below is based on the second optimization routine (with identical starting values, and of course on the same generated samples),

The graph below compares the two (empty circles are the first algorithm, while plain circles the second one),

If I change starting values for the optimization algorithm (previously, I assumed that the mixture probability was 1/5, here I start from 1/2), we have the following graph

Thus, I am far from being an expert in optimization routines in R (see here for further information), but so far, it looks like R is not doing so bad... and the two algorithm perform similarly (maybe the first one being a bit closer to the true parameter).
Statistique de l'assurance STT6705V, partie 4b
Par Arthur Charpentier le jeudi 23 septembre 2010, 21:17 - actuariat 10/11 STT6705V
Avant toute autre chose, pour revoir le dernier cours, il suffit
d’aller ici
et là.
Bon, sinon, comme promis, les bases de données pour les
projets sont en
ligne ici. Le principe est simple. Il y a 28 bases de
données,
toutes semblables (mais bien sûr différentes), par
les
numéros ci-dessous. Comme toujours, premier
arrivé,
premier servi, donc les bases vont être attribuées
au fur
et à mesure
- étape 1: choisir une base
- étape 2: récupérer les données
> k=1
> nom=paste("http://perso.univ-rennes1.fr/arthur.charpentier/6705V/UdM-baseC-",
+ k,".txt",sep="")
> baseC=read.table(nom,header=TRUE)
pour la base de contrats (et pour le groupe qui aurait choisi la base 1) et pour la base des sinistres,
> nom=paste("http://perso.univ-rennes1.fr/arthur.charpentier/6705V/UdM-baseS-",
+ k,".txt",sep="")
> baseS=read.table(nom,header=TRUE)
Ensuite, c’est parti, il s’agit de me proposer différents modèles de tarification, et de calculer les primes pures avec les différents modèles, pour une personne parmi les listes des personnes ayant les caractéristiques suivantes,
> client=data.frame(
+ exposition=rep(1,9),
+ zone=c("A","A","A","C","D","E","F","F","F"),
+ puissance=c(6,7,11,6,7,11,6,7,11),
+ agevehicule=c(0,1,5,10,5,1,0,6,10),
+ ageconducteur=c(25,18,55,55,55,40,21,20,18),
+ bonus=c(80,100,50,60,55,50,100,125,100),
+ marque=c(1,2,12,12,12,1,1,1,2),
+ carburant=c("D","E","E","D","D","E","E","D","D"),
+ densite=rep(3000,9),
+ client=rep(baseC$region[1],9))
(si certains modalités ne sont pas présentes dans la base, il faut choisir quelqu’un d’autre.... sur les neuf, il doit bien en avoir un(e) qui pourrait être présent(e) dans votre base de données). En cas de problème, vous avez mon adresse électronique.... Et je reviendrais ultérieurement sur la forme de ce que j’attends.
« billets précédents - page 1 de 157