How should I handle a mass-point in the dependent variable when running OLS regression in R?
I’m working with a a household expenditure dataset (Living Costs 2019) where the dependent variable is the weekly combined gas and electricty bill. One issue is that about 40% of the 5008 households report the exact same value: £2.2428 per week.
All of these cases correspond to households whose payment method is coded as “Not Direct Debit”. For these households the energy expenditure variable has no variation at all — all values are £2.2428.
Despite these records showing the same figure, they are distinctly different, with different characteristics for the rest of the 30ish variables they have.
My current plan is to estimate two models:
Full sample OLS model (all households, including the mass-point)
Restricted sample OLS model (excluding all households where expenditure = £2.2428)
The idea is that the first model reflects population-wide differences including tariff/payment effects, while the second model isolates variation among households with meaningful, continuous expenditure data.
My questions are:
Is this two-model approach statistically valid for handling a dependent variable with a large mass-point?
Are there better practices in R for dealing with this kind of issue?
Should the payment-method variable be included as an explanatory variable in the full model, or does this introduce its own bias because the mass-point is entirely determined by that category?
Any advice on best practice for handling this structure in OLS or alternative models would be appreciated.