choice-metrics.com

Posted: **Wed Aug 20, 2025 4:23 am**

Hello Prof ! We got data from 100 people pilot study, so next step was to find out the parameters to use them in the equation for finding the correct sample size required for main study.
For reaching there, I used the mnl model equation on this data as :
mlogit choice i.AttribA i.AttribB i.AttribC i.AttribD i.AttribE i.AttribE
1) From here I got coeff values for each of these attributes (with one level as base whose coeffs are absent), are they sufficient as parameters for future sample size calculation? NOTE: none of the coeff values are coming as significant here, is that ok or normal?
[img][https://drive.google.com/file/d/1me7hQn ... haring/img]
2) Do i need to use <,cluster(respondent_id)> as option, after the above command.. given each respondent was shown multiple(8) choice tasks. If respondents answered multiple choice sets, should we use the cluster or panel options? I have seen some resources declare it to be panel dataset using cmset (for DCEs where one person is shown multiple choice scenarios at one point of time ) , can we regard such cases as panel dataset, given the DCE is conducted only once and not overtime?

Posted: **Fri Aug 22, 2025 11:38 am**

1. In a pilot study you will often find that coefficients are not statistically significant because the sample size may be too low. 100 respondents is a reasonable size for a pilot study though, so often you would see more parameters that are significant. In your case, only 1 parameter is significant. The coefficient for price is not significant in your study, which is somewhat surprising because this is an attribute that a lot of people usually find quite important. Could it be that your price level range is relatively narrow and therefore not many trade-offs are made with respect to price?
Yes, these parameters suffice for sample size calculations, but these calculations will likely not be very reliable because your parameters have large standard errors.

2. I do not know the software you are using to estimate, I only know Apollo and Biogeme, so I cannot say which option to use. SP data is always panel data if a single respondent provides multiple observations. The panel nature indicates that observations come from the same respondent and therefore can account for correlations. So you would typically use the respondent id to group the data. This is mostly important when you are estimating a mixed logit model, but not when you are estimating a multinomial logit model (it should give you the same parameter estimates).

Michiel

Posted: **Tue Sep 23, 2025 3:05 am**

In this case, as we had less time, went on with our main survey without specifying the priors (as last time in pilot they did not come as significant coeffs)
The data received of the main survey was in two parts, wide form Choice data(2 blocks, 8 tasks, 2 alt, 7 attributes with 2/3/4 levels) and 2nd was the data with covariates. So I reshaped the Choice data part to long form and merged them both for a joint long-form overall dataset. So where there was 8 choice task per person or RID... now there are 16 rows for each RID . And then tried the command for conditional logit (stata:cmclogit) after setting it as panel data. I did not account for the two diff blocks.

The attached image tries to show the steps:
[img][https://drive.google.com/file/d/1CUE3cr ... haring/img]
[img][https://drive.google.com/file/d/1Hycd62 ... haring/img]

Am I doing something wrong fundamentally in the model or the data analysis or in reshaping which is causing these errors ?
Is the correlation too high between P&S?

Posted: **Tue Sep 23, 2025 9:18 am**

Note that this is a model estimation question,not an experimental design question, and I have no familiarity with Stata so it would be best to ask on the Stata forum. The correlations are unlikely to be an issue; even with correlations as high as 0.9 or 0.95 you should typically be able to estimate a choice model.

Perhaps check how you constructed your Choice column? It seems that pref=1 is converted to (0,1) and pref=2 is converted to (1,0), should this not be the other way around?

Michiel

Posted: **Thu Sep 25, 2025 3:09 am**

I looked at the construction of the ''choice'' variable as you suggested and could not find the error. After reshaping, I constructed an 'alt' variable in a new column that just lists two unlabelled alternatives one after the other, twice for each RID facing 8 choice tasks each, and then a 'choice' variable such that when alternative is device-1 and pref1 is 1 (where, acc to SEngine: pref1=Which would you choose? 1= opt1 , 2= opt2)
then the new variable 'choice' is designated as 1 or 'yes chosen' : when alternative is device-2 and pref1 is also 2 , similarly when alt is Device 1 and pref1 is also 1 then too 'Choice' = 1 or yes chosen. Am I still making the coding mistake you mentioned? Or can it be improved?
Tried to show it in the image for just RID6 in wide and long form:
[img][https://drive.google.com/file/d/1BbgO6i ... haring/img]

Posted: **Thu Sep 25, 2025 10:14 pm**

I see. In your previous screenshot Charger1 and Charger2 are swapped around. So Pref=1 means that the option in the second row is chosen. In that case it may be correct as long as Stata understands that the first charging option (Charger2) is actually the second listed option.

Posted: **Sun Sep 28, 2025 6:21 am**

Yes thats right.I rechecked.Thank you.
1)If the pilot study doesn’t give significant coefficients or does not converge or very large standard errors, can we still use zero priors for the main study? would that lower the efficiency of the results?
Note: I didn’t set any restrictions between speed and price, so there’s no preference order or priors for main study unlike pilot.

2)When I tried to put near-zero negative prior on Price attribute levels, NGENE could not generate a design after many attempts. Why could that be? We are allowed to set such preference order, arent we?

3)Are my 48 rows with 6 blocks for 500-700 respondents, the problem in the design that wont converge? ( ERROR shown: multiple cases not alternative-specific; there is no within-case variability)

4) 4 of my 7 attributes when regressed over the depen.var : 'choice' actually converges and gives reliable results, but when we include the other 3 attributes with 3/3/4 levels each, then it never converges. But again if we regress these 3 attributes over 'choice' and exclude the other 4, they converge separetly. But together, do not. Can we actually do/report them separately in our paper?

PS: I understand I have posted long queries, I’m just trying to figure out and learn from my mistake... for future designs,as this feels like a failed DCE now.. if my initial zero-prior (with no limiting condition or preference order) design for the main survey was flawed decision, or if I could have improved it. If this might be reason why the logistic regression isn’t converging when I run conditional logit, despite having above 500 respondents
~~~~~
For reference:
This was my prepilot model:
design
;alts = opt1*, opt2*
;eff = (mnl, d)
;alg = swap
;rows = 48
;block = 6
;cond:
if(opt1.x1= 0, opt1.x2<=1),
if(opt1.x1= 2, opt1.x2>=1),
if(opt2.x1= 0, opt2.x2<=1),
if(opt2.x1= 2, opt2.x2>=1)

;model:
U(opt1) = b1.dummy[0|0.000] * x1[1,2,0]
+ b2.dummy[0.000|0.000] * x2[1,2,0]
+b3.dummy[0|0] * x3[1,2,0]
+ b4.dummy[0|0|0] * x4[1,2,3,0]
+ b5.dummy[0.000|0.000] * x5[1,2,0]
+ b6.dummy[0|0] * x6[1,2,0]
+ b7.dummy[0.00] * x7[1,0] /
U(opt2) = b1 * x1
+ b2 * x2
+ b3 * x3
+ b4 * x4
+ b5 * x5
+ b6 * x6
+ b7 * x7
$
~~~~
And this my postpilot design:
design
;alts = opt1*, opt2*
;eff = (mnl, d)
;alg = swap
;rows = 48
;block = 6
;model:
U(opt1) = b1.dummy[0.000|0.000] * x1[1,2,0]
+ b2.dummy[0.000|0.000] * x2[1,2,0]
+ b3.dummy[0.000|0.000] * x3[1,2,0]
+ b4.dummy[0.000|0.000|0.000] * x4[1,2,3,0]
+ b5.dummy[0.000|0.000] * x5[1,2,0]
+ b6.dummy[0.000|0.000] * x6[1,2,0]
+ b7.dummy[0.000] * x7[1,0] /

$

Posted: **Tue Sep 30, 2025 10:00 am**

1. Yes you can use zero priors, that is always a safe option. The further away your priors are from the true parameters, the more efficiency you lose. Note that parameter estimates are often not significant in a pilot study because of the small sample size, and the estimate is still the best guess for the prior. But if the priors are very unreliable, with very large standard errors, then one has to be careful not to select too large parameters as that can result in in-efficient designs, so setting zero priors may in that case be preferred.

2. Dominance checks can only be performed if at least two attributes have non-zero priors. Setting only one attribute with a preference order makes it infeasible. If only setting two attributes with non-zero priors then it will still be very difficult to find a design without any dominant alternatives because Ngene will try to find non-dominance based on these two attributes only. The more attributes have non-zero priors, the easier it is to avoid dominant alternatives.

3. I do not know what that error means. I doubt that 48 rows and 6 blocks are an issue. It could be a data coding error or a model specification issue in Stata. In one of the screenshots of your model estimates it strikes me that all your coefficients are very large (dummy coded coefficients would typically not be so large, and neither would your alternative-specific constant because you have an unlabelled experiment). I also see that it states "non-concave" for the loglikelihood function, which is also strange because the LL function for an MNL model with a linear in the parameters utility function is by definition concave.

4. I have no explanation for that, except that perhaps there is multicollinearity in your data. But the Ngene generated design should not have any multicollinearity. I would not report the two models separately since the respondents did see the other attributes as well and cannot be ignored.

Did you ask on the Stata forum to check whether you coded your data correctly and whether you specified the correct model in Stata? I really cannot help here. I am familiar with Apollo and Biogeme packages, which are both free, but I have never used Stata.

Michiel

Posted: **Sat Oct 18, 2025 4:36 am**

I have retried again used this NGENE design command now with near zero priors and preference order for the 2nd and 7th attribute(unlike before where we had only zero priors), as follows (could you please check if this seems correct) :

design
;alts = opt1*, opt2*
;eff = (mnl, d)
;alg = mfederov
;rows = 48
;block = 6
;reject:
opt1.x1 = 0 and opt1.x2 = 2,
opt1.x2 = 2 and opt1.x2 = 0,
opt2.x1 = 0 and opt2.x2 = 2,
opt2.x2 = 2 and opt2.x2 = 0

;model:
U(opt1) = b1.dummy [0.0001|0.0001] * x1[1,2,0] ? speed (low,high, ultrafast)
+ b2.dummy[-0.0001|-0.0001] * x2[1,2,0] ? linear variable of price where 0(25p), 1( 57p), 2(80p)
+ b3.dummy[0|0] * x3[1,2,0] ? locality (categorical variable with no ordering)
+ b4.dummy[0|0|0] * x4[1,2,3,0] ? types of journey services (categorical variable with no ordering)
+ b5.dummy[0.0001|0.0001] * x5[1,2,0] ? type of fuel used (ordered in terms of clean-ness)
+ b6.dummy[0|0] * x6[1,2,0] ? types of loans and subscriptions (categorical variable with no ordering)
+ b7.dummy[-0.0001] * x7[1,0] ? late-fee, yes = 1, no= 0 (if yes, then lesser utility)
/
U(opt2) = b1 * x1 + b2 * x2
+ b3 * x3
+ b4 * x4
+ b5 * x5
+ b6 * x6
+ b7 * x7
$

This gives:

D error 0.226929
A error 0.288819
B estimate 100
S estimate 158755349.055802

But when I go check the 'formatted scanarios' option in ngene (which I hope are the choice sets that our respondents will ultimately receive when i import the xl-sheet to S.Engine) i see some concerning choice scenarios with similar levels for both alternatives ,as in the attached image( I have only put 4 such scenarios in the photo, but there are many such)
[img][https://drive.google.com/file/d/1_72k0E ... haring/img]

Do you think that is normal or should this be avoided? and if so, then what can I do to avoid such scenarios which has got exactly same levels for multiple attributes in each choice set?

Posted: **Sun Oct 19, 2025 7:48 am**

No that is not what you want/expect in a D-efficient experimental design. But it is easy to fix.

There are two potential causes:
1. You mis-specified your priors, as you have set several priors to identical values, indicating that they have the same preference and therefore there are much fewer combinations left without creating dominant alternatives.
2. You did not run the modified Federov algorithm long enough. This algorithm swaps each choice task with a default of 2000 candidate choice tasks. So with 48 rows it takes 48*2000 = 96,000 design evaluations to go through the entire design.

If 0=low, 1=high, and 2=ultrafast, then the ultrafast speed (the second prior) needs to have a higher utility than for high speed (the first prior). So:
b1.dummy [0.0001|0.0002] * x1[1,2,0] ? speed (low,high, ultrafast)

Similarly, to indicate that a higher price is less utility, you need to specify:
b2.dummy[-0.0001|-0.0002] * x2[1,2,0] ? linear variable of price where 0(25p), 1( 57p), 2(80p)

And you would also need to adapt the prior for type of fuel used.

In the new version of Ngene, Ngene Online, you could simply specify:
b1.dummy [+] * speed[1,2,0] +
b2.dummy[-] * price[1,2,0]

After updating the script and letting it evaluate 100,000 designs it produces a D-error of 0.157 and the choice tasks exhibit very little level overlap (41 rows with no overlapping attribute levels and 7 rows with 1 overlapping attribute).

Michiel

choice-metrics.com

post-pilot parameters

post-pilot parameters

Re: post-pilot parameters

Re: post-pilot parameters

Re: post-pilot parameters

Re: post-pilot parameters

Re: post-pilot parameters

Re: post-pilot parameters

Re: post-pilot parameters

Re: post-pilot parameters

Re: post-pilot parameters